Ilya Sutskever: Meta Learning and Self Play

Presenter: Ilya Sutskever
Host Institute: MIT AGI Course
Host: Lex Fridman
This post distills a lecture by Ilya Sutskever (then Chief Scientist at OpenAI) titled Meta Learning and Self Play, delivered at MIT's AGI course on February 1, 2018, introduced by Lex Fridman. Ilya surveys OpenAI's research in reinforcement learning, meta learning, and self play — three pillars he saw as converging toward general intelligence. The talk is remarkably prescient: it foreshadows RLHF, the scaling hypothesis, and even the alignment problem, all before GPT-1 was published. In the Q&A, Ilya makes his famous prediction: "just training bigger, deeper language models will achieve surprising results — scale up."
本文整理自 Ilya Sutskever(时任 OpenAI 首席科学家)于 2018 年 2 月 1 日在 MIT AGI 课程上的讲座 Meta Learning and Self Play,由 Lex Fridman 主持。Ilya 全面介绍了 OpenAI 在强化学习、元学习和自我博弈方面的研究——他认为这三大支柱正在汇聚通向通用智能的道路。这场讲座具有惊人的前瞻性:它预示了 RLHF、规模化假说,甚至对齐问题,而这一切都发生在 GPT-1 发表之前。在问答环节中,Ilya 做出了他著名的预言:"仅仅训练更大、更深的语言模型就会取得令人惊讶的效果——扩大规模。"

Why Do Neural Networks Work?

Ilya opens with a foundational question: why should deep learning work at all?

The theoretically optimal hypothesis class is the set of short programsFrom Kolmogorov complexity theory (Solomonoff, 1964; Kolmogorov, 1965). A “short program” is the shortest computer program that produces a given dataset. Its length — the Kolmogorov complexity — measures the data’s intrinsic information content. — a provable mathematical theorem states that the shortest program fitting your data yields the best possible generalization. The intuition is clean: if you can compress data into a short program, you’ve extracted all conceivable regularities; if no short program exists, the data is essentially random.

The problem is that searching over short programs is computationally intractable. But what about small circuitsFrom circuit complexity theory. A Boolean circuit is a DAG of logic gates; a neural network can be viewed as a parameterized circuit. “Small circuit search” = finding the simplest circuit that fits the data, analogous to finding the shortest program but feasible via gradient descent.? Here’s the key insight: backpropagation is small circuit search. When you constrain a neural network’s architecture and iteratively adjust weights via gradient descent, you are effectively searching for the smallest circuit satisfying \(F(x_i; \theta) = y_i\). Neural network training = solving a neural equation.

Why does backprop actually find good solutions? Mostly a mystery. Ilya attributes it to “the great variety that exists in most natural datasets” — an inexplicable fact that powers all of modern AI.

A 50-layer network is essentially a 50-step parallel computer. You can accomplish impressive tasks in just 50 parallel steps — for instance, a medium-sized network can learn to sort \(n\) numbers, executing what amounts to \(O(\log n)\) parallel sorting steps. The fact that 50 layers of threshold-gated neurons can complete significant logic and reasoning is the deep mystery underlying everything else in the talk.

Ilya 以一个基础性问题开场:深度学习为什么应该有效?

理论上最优的假设类是短程序源自 Kolmogorov 复杂度理论(Solomonoff, 1964; Kolmogorov, 1965)。”短程序”指能生成给定数据集的最短计算机程序,其长度即 Kolmogorov 复杂度——衡量数据内在信息量的指标。集合——一个可证明的数学定理表明,拟合数据的最短程序能获得最佳泛化效果。直觉很清晰:如果你能将数据压缩成短程序,你就提取了所有可想象的规律;如果不存在短程序,数据本质上就是随机的。

问题在于搜索短程序在计算上是不可行的。但小电路源自电路复杂度理论。布尔电路是由逻辑门组成的有向无环图,神经网络可视为参数化电路。”小电路搜索”即寻找拟合数据的最简电路——与最短程序类似,但通过梯度下降在计算上可行。呢?关键洞见是:反向传播就是小电路搜索。当你约束神经网络架构并通过梯度下降迭代调整权重时,你实际上是在搜索满足 \(F(x_i; \theta) = y_i\) 的最小电路。神经网络训练 = 求解神经方程。

反向传播为什么能找到好的解?这主要是个谜。Ilya 将其归因于”大多数自然数据集中存在的巨大多样性”——这个无法解释的事实支撑着整个现代 AI。

一个 50 层的网络本质上是一台 50 步并行计算机。你可以在仅仅 50 个并行步骤中完成令人印象深刻的任务——例如,一个中等大小的网络可以学会对 \(n\) 个数字排序,执行的相当于 \(O(\log n)\) 个并行排序步骤。50 层阈值门控神经元能够完成重要的逻辑和推理——这是本次演讲中其他一切内容背后的深层谜题。

Reinforcement Learning and Meta Learning

The Framework

Ilya presents RL as a framework for evaluating an agent’s ability to achieve goals in complex stochastic environments. The formulation is simple: find a policy that maximizes expected reward. But he makes a subtle philosophical point that often goes unnoticed.

In the standard RL diagram, the environment sends a reward signal \(r_t\) back to the agent — reward is a well-defined component of the MDP, and the math is perfectly correct. But Ilya raises a question about what this means as a model of biological agents.

Consider what happens physically when you touch a hot stove. The stove does not “send” pain — it is simply an object at 300°C. All it transmits is thermal energy. Your nociceptors (pain receptors) in the skin convert this temperature into a neural signal, and your brain interprets that signal as suffering. If your pain nerves were severed, the same stove, the same contact, would produce zero reward signal. The environment hasn’t changed; the agent’s internal wiring has.

This means that in biological reality, the environment provides only observations (temperature, photons, pressure waves), and the agent’s own neural circuitry decides which observations constitute reward. The RL formalism packages this internal construction as an exogenous signal \(r_t = R(s_t, a_t)\), which is mathematically convenient but epistemologically misleading — it makes reward look like an objective property of the environment when it is actually a subjective construction of the agent.

The implication is reductionist: the only reward that evolution “cares about” is survival — existence versus non-existence. Everything else — pleasure, pain, hunger, curiosity, social approval — is a proxy that evolution hard-coded into our nervous system because it correlated with survival in ancestral environments. If we want AI to learn with biological-like flexibility, merely specifying an external reward function may not be enough — the agent may need some form of endogenous reward construction, an internal mechanism for deciding what matters. This foreshadows the later discussion of RLHF and value functions as emotions.

He covers two classes of model-free RL:

  • Policy gradients: “Just take the gradient.” Stable, easy to use, on-policy. The derivative has a beautiful form — it tells you to try actions, and if you like them, increase their log-probability. (For a detailed derivation, see our post on Policy Gradient and Actor-Critic.)
  • Q-learning: Less stable, more sample-efficient, off-policy. Can learn from anyone’s data trying to achieve any goal. (For why scaling Q-learning is hard, see Challenges in Scaling Q-Learning.)

The on-policy vs off-policy distinction matters deeply: on-policy means “I can only learn from my own actions,” while off-policy means “I can learn from anyone trying to achieve any goal.” The mathematical machinery behind this — importance sampling — is what makes off-policy correction possible (see Importance Sampling: Why and How). This distinction becomes critical for Hindsight Experience Replay.

Ilya 将 RL 呈现为评估代理在复杂随机环境中实现目标能力的框架。公式很简单:找到最大化预期奖励的策略。但他提出了一个容易被忽视的哲学观点。

在标准 RL 图示中,环境向代理返回奖励信号 \(r_t\)——奖励是 MDP 的组成部分,数学上完全成立。但 Ilya 追问的是一个更底层的问题:当我们用 RL 类比生物体时,”环境给出奖励”在物理上意味着什么?

考虑手碰到滚烫炉子时发生了什么。炉子并不会”发送”疼痛——它只是一个 300°C 的物体,传递的是热能。是你皮肤中的伤害性感受器(nociceptor)将温度信号转化为神经电信号,大脑再将其解读为痛苦的主观体验。如果你的痛觉神经被切断,同样的炉子、同样的接触,奖励信号就变成了零。环境没有变化,变化的是代理的内部线路。

这意味着在生物现实中,环境只提供观测(温度、光子、压力波),是代理自身的神经回路决定了哪些观测构成奖励。RL 形式化将这一内部构造过程打包成从环境直接来的外生信号 \(r_t = R(s_t, a_t)\),数学上方便,但在认识论上有误导性——它使奖励看起来像环境的客观属性,实际上它是代理的主观构造。

由此引出一个还原论推论:进化真正”关心”的唯一奖励是存活——存在还是不存在。其他一切——快乐、痛苦、饥饿、好奇心、社会认同——都是进化硬编码在神经系统中的近似指标,因为它们在祖先环境中恰好与存活相关。如果我们希望 AI 拥有类似生物体的学习灵活性,仅仅从外部指定奖励函数也许不够——代理可能需要某种内生的奖励构造机制,一种自行判断”什么重要”的内部装置。这为后面关于 RLHF 和”情感作为价值函数”的讨论埋下了伏笔。

他介绍了两类无模型 RL:

  • 策略梯度:”直接取梯度。”稳定、易用、策略内。导数形式非常优美——它告诉你尝试一些行动,如果你喜欢它们,就增加这些行动的对数概率。(详细推导见我们的博文 Policy Gradient and Actor-Critic。)
  • Q-learning:不那么稳定,但样本效率更高,策略外。可以从任何人试图达到任何目标的数据中学习。(关于 Q-learning 的规模化挑战,见 Challenges in Scaling Q-Learning。)

策略内与策略外的区别非常重要:策略内意味着”我只能从自己的行动中学习”,而策略外意味着”我可以从任何人试图实现任何目标中学习。”使策略外修正成为可能的数学工具是重要性采样(见 Importance Sampling: Why and How)。这一区别对后见经验回放至关重要。

Meta Learning: Learn to Learn

The dream of meta learning: train a system on many tasks so it can solve new tasks quickly. The modern approach reduces this to conventional deep learning with an elegant trick — training tasks become training examples. You feed a model all information about a new task plus test cases, and ask it to predict. Tasks = training cases. That’s it. Everything else is details.

Ilya covers several success stories:

  • Omniglot: 98% accuracy on 1-shot 20-way character recognition (Mishra et al., 2017) — a dataset designed as a challenge for deep learning.
  • Neural Architecture Search (Zoph & Le, 2017): find an architecture on a small dataset that transfers to large ones. Meta learning over architectures.

But the biggest limitation of high-capacity meta learning is stark: training task distribution must equal test task distribution. You can do well on Omniglot and some robotics tasks, but can you teach an agent math, counting, reading, programming — and have it be capable of learning chemistry as a result? That’s the real question, and it remains open.

元学习的梦想:在多项任务上训练系统,使其能快速解决新任务。现代方法用一个巧妙的技巧将其简化为常规深度学习——训练任务变成训练样本。你将关于新任务的所有信息加上测试用例喂给模型,让它进行预测。任务 = 训练案例。就是这样。其他一切都只是细节。

Ilya 介绍了几个成功案例:

  • Omniglot:1-shot 20-way 字符识别达到 98% 准确率(Mishra et al., 2017)——这是一个专门为挑战深度学习而设计的数据集。
  • 神经架构搜索(Zoph & Le, 2017):在小数据集上找到能迁移到大数据集的架构。对架构进行元学习。

但高容量元学习最大的局限性很明显:训练任务分布必须等于测试任务分布。你可以在 Omniglot 和一些机器人任务上做得很好,但你能教一个代理数学、计数、阅读、编程——然后让它因此具备学习化学的能力吗?这才是真正的问题,而它仍然悬而未决。

Hindsight Experience Replay

Hindsight Experience Replay (HER, Andrychowicz et al. 2017) addresses a fundamental RL problem: exploration. If you never receive reward, how can you learn? HER’s insight is that you can learn from failure.

The setup: build a system that can reach any state. Goal: reach state A. But any trajectory ends up at some other state B. The key idea — use this as training data to reach state B. You aimed for A, ended up at B, and that’s disappointing. But reframe: you’ve actually reached B successfully. You now have training data for how to reach B, obtained for free while trying to reach A.

This works because it leverages off-policy learning — when you actually land at state B, you’re effectively doing off-policy learning, because the actions you’d take if genuinely trying to reach B would be different. HER combined with DDPGDeep Deterministic Policy Gradient (Lillicrap et al., 2015). A continuous-action extension of Q-learning combining a learned Q-function with a deterministic policy network. lets a robotic arm learn to push objects to target positions, even with sparse rewards where conventional RL completely fails (no reward = no learning).

Ilya emphasizes the direction is correct: you want to utilize all data, not just the small fraction where you succeeded. The next step is the same algorithm but with high-level states — and the key question becomes: where do high-level states come from? This is where representation learning and unsupervised learning become critical.

后见经验回放(HER,Andrychowicz et al. 2017)解决了 RL 的一个基本问题:探索。如果你从未获得奖励,你怎么学习?HER 的洞见是你可以从失败中学习。

设定:构建一个能到达任何状态的系统。目标:到达状态 A。但任何轨迹都会到达某个其他状态 B。关键想法——将此作为到达状态 B 的训练数据。你瞄准了 A,结果到了 B,这令人失望。但换个角度:你实际上成功到达了 B。你现在拥有了如何到达 B 的训练数据,而这是在试图到达 A 的过程中免费获得的。

这之所以有效,是因为它利用了策略外学习——当你实际到达状态 B 时,你实际上是在进行策略外学习,因为如果你真的试图到达 B,你会采取不同的行动。HER 与 DDPGDeep Deterministic Policy Gradient(Lillicrap et al., 2015)。Q-learning 的连续动作扩展,结合 Q 函数与确定性策略网络。 结合,让机械臂学会将物体推到目标位置,即使在传统 RL 完全失败的稀疏奖励下也能成功(没有奖励 = 没有学习)。

Ilya 强调这个方向是正确的:你想利用所有数据,而不仅仅是成功的那一小部分。下一步是用同样的算法处理高层状态——关键问题变成:高层状态从哪里来?这就是表示学习和无监督学习变得至关重要的地方。

Sim2Real with Meta Learning

Training robots in simulation and transferring policies to real hardware is attractive but fundamentally hard — simulators can never perfectly match reality. Simulating contact is NP-complete (or close to it), so there will always be a sim-real gap.

The solution is simulation randomization (Peng et al., 2017): randomize gravity, friction, torques, object dimensions, and contact types during training. The policy is never told how the simulator is configured — it must infer physics from experience and adapt. This is meta learning applied to domain transfer.

The result: a policy trained in randomized simulation can adapt to real physics. A robot trained to push a puck to a target struggles badly without randomization (the sim-real gap defeats it), but with randomization it learns to quickly infer the simulator’s properties and complete the task. Transfer to real hardware works because reality is just another “randomized” setting.

在模拟器中训练机器人并将策略迁移到真实硬件很有吸引力,但从根本上很困难——模拟器永远无法完美匹配现实。模拟接触是 NP 完全的(或接近),所以模拟-现实差距总是存在的。

解决方案是模拟随机化(Peng et al., 2017):在训练期间随机化重力、摩擦力、扭矩、物体尺寸和接触类型。策略不会被告知模拟器如何配置——它必须从经验中推断物理并适应。这是将元学习应用于域迁移。

结果:在随机化模拟中训练的策略可以适应真实物理。一个训练来将冰球推到目标的机器人,没有随机化时会严重挣扎(模拟-现实差距击败了它),但有了随机化,它学会快速推断模拟器的属性并完成任务。迁移到真实硬件是可行的,因为现实只是另一种”随机化”设置。

Self Play

Converting Compute into Data

Self play is the most exciting part of the talk. Ilya’s thesis: self play converts compute into data, and this conversion will become increasingly important as neural network processors get faster.

He traces the history from TD-Gammon (Tesauro, 1992) — Q-learning + neural networks + self play that beat all human backgammon players and discovered strategies humans hadn’t noticed — through AlphaGo Zero (learning Go from scratch with no human data) to OpenAI’s Dota 2 bot (pure self play in a complex real-time strategy game, going from random play to beating top professionals in ~5 months).

The promise of self play:

  • Simple environment → extremely complex strategy: a simple game with simple rules can produce arbitrarily sophisticated behavior.
  • Convert compute into data: with self play, more compute = more and better training data, automatically.
  • Perfect curriculum: your opponent is always at your level. You always face a fair challenge, whether you’re a beginner or world champion.

Ilya shows OpenAI’s “Sumo” experiment (Bansal et al., 2017): two humanoid figures learning to wrestle with no prior knowledge of standing, balance, or gravity. Through pure competition, they discover walking, pushing, dodging — general physical dexterity emerges from the pressure to win. The trained agents even exhibit transfer learning: you can apply random forces to them and they maintain balance, because they’ve already learned to handle being pushed by an unpredictable opponent.

自我博弈是这场演讲中最激动人心的部分。Ilya 的论点:自我博弈将计算转化为数据,随着神经网络处理器速度的提高,这种转换将变得越来越重要。

他追溯了从 TD-Gammon(Tesauro, 1992)——Q-learning + 神经网络 + 自我博弈,击败所有人类双陆棋玩家并发现了人类未曾注意到的策略——到 AlphaGo Zero(从零开始学习围棋,不使用人类数据)再到 OpenAI 的 Dota 2 机器人(在复杂的实时策略游戏中纯自我博弈,约 5 个月内从随机游戏到击败顶尖职业选手)的历史。

自我博弈的承诺:

  • 简单环境 → 极其复杂的策略:规则简单的游戏可以产生任意复杂的行为。
  • 将计算转化为数据:有了自我博弈,更多计算 = 更多更好的训练数据,自动完成。
  • 完美的课程:你的对手总是与你水平相当。无论你是初学者还是世界冠军,你总是面对恰到好处的挑战。

Ilya 展示了 OpenAI 的”相扑”实验(Bansal et al., 2017):两个人形角色在不了解站立、平衡或重力的情况下学习摔跤。通过纯粹的竞争,它们发现了行走、推搡、闪避——通用的物理灵巧性从获胜的压力中涌现。训练后的代理甚至展现出迁移学习:你可以对它们施加随机力,它们仍然能保持平衡,因为它们已经学会了应对被不可预测的对手推搡。

Can We Train AGI via Self Play?

Ilya makes a speculative but compelling argument. He cites a Science paper noting that corvids and apes independently evolved complex cognitive abilities — language-like communication, tool use, theory of mind — despite vastly different brain structures, because they needed to solve similar socioecological problems. Social life incentivizes intelligence.

The implication: if you create a multi-agent environment where agents must communicate, negotiate, cooperate, and compete, open-ended self play should produce theory of mind, social skills, empathy, and eventually real language understanding. The human brain tripled in size over two million years, likely driven by social competition — survival depended not on outrunning tigers but on managing your tribe’s social dynamics.

If agent society is a reasonable venue for general intelligence, and if we accept the rapid competence increases observed in Dota, then once we get the details right, we should see rapid capability increases in agents living in agent societies.

This also raises alignment issues — how do we ensure that agents trained through self play behave as we hope? Ilya explicitly flags this as a concern.

Ilya 提出了一个推测性但令人信服的论证。他引用了一篇 Science 论文,指出鸦科鸟类和猿类独立进化出了复杂的认知能力——类语言交流、工具使用、心智理论——尽管脑结构截然不同,因为它们需要解决类似的社会生态问题。社会生活激励智能的进化。

含义是:如果你创建一个代理必须交流、谈判、合作和竞争的多代理环境,开放式自我博弈应该能产生心智理论、社交技能、共情,最终产生真正的语言理解。人类大脑在两百万年间增大了三倍,很可能是由社会竞争驱动的——生存不取决于跑得过老虎,而取决于管理部落的社会动态。

如果代理社会是通向通用智能的合理场所,如果我们接受在 Dota 中观察到的能力快速提升,那么一旦所有细节都处理好,我们应该看到生活在代理社会中的代理能力快速提升。

这也引出了对齐问题——我们如何确保通过自我博弈训练的代理按我们期望的方式行事?Ilya 明确将此标记为一个关注点。

Alignment and the Future

Learning from Human Feedback

In a section that proved remarkably prescient, Ilya presents RLHF (Christiano et al., 2017) — years before it would become the foundation of ChatGPT and modern LLM alignment.

The problem: how do you communicate goals to an agent? Ilya frames this as a technical problem that is critical because “the agents we train may eventually be smarter than us.”

The method: human judges see pairs of agent behaviors and click on whichever looks better. From ~500 such clicks, you fit a scalar reward function to the human preferences (a triplet loss: if human deems A > B, learn a reward consistent with this). Then optimize this learned reward via RL. (The idea that a language model itself can serve as this reward predictor — replacing human judges — is explored in our post Can Language Models Be Critic Functions?.)

The results are surprisingly effective — with just a few thousand bits of human feedback, you can train Atari agents and even teach unusual goals (like making a car follow closely behind another car in a racing game) that would be hard to specify programmatically. (For how this evolved into modern LLM alignment, see Policy Optimization without a Critic: The GRPO Family and RL on Language under Single-step Settings.)

Ilya closes the main talk with an alignment slide that captures the tension perfectly: “Will likely solve the technical alignment problem. But what are the right goals? Political problem.” The technical challenge of making AI do what we want is tractable; deciding what we should want is the hard part.

在一个被证明具有非凡前瞻性的章节中,Ilya 介绍了 RLHF(Christiano et al., 2017)——这比它成为 ChatGPT 和现代 LLM 对齐基础早了好几年。

问题是:如何向代理传达目标?Ilya 将此框架为一个技术问题,但非常关键,因为”我们所训练的智能体最终可能会比我们更聪明。”

方法:人类评判者看到成对的代理行为,点击看起来更好的那个。大约 500 次这样的点击,就可以拟合一个标量奖励函数来匹配人类偏好(三元组损失:如果人类认为 A > B,则学习与此一致的奖励)。然后通过 RL 优化这个学习到的奖励。(语言模型本身能否充当这个奖励预测器、从而取代人类评判?见我们的博文 Can Language Models Be Critic Functions?。)

结果出人意料地有效——仅凭几千比特的人类反馈,就可以训练 Atari 代理,甚至教授用程序化方式很难指定的不寻常目标(比如在赛车游戏中让白色汽车紧跟在橙色汽车后面)。(这一思路如何演变为现代 LLM 对齐,见 Policy Optimization without a Critic: The GRPO FamilyRL on Language under Single-step Settings。)

Ilya 用一张对齐幻灯片完美地概括了这种张力来结束主演讲:“很可能会解决技术性的对齐问题。但正确的目标是什么?这是一个政治问题。”让 AI 做我们想要的事情的技术挑战是可解决的;决定我们应该想要什么才是困难的部分。

Q&A Highlights

The Q&A session contains several gems:

On backpropagation vs. the brain: Ilya acknowledges that backprop doesn’t happen in biological brains (signals propagate forward along axons; backprop requires sending errors backward). But he argues backprop solves circuit search — “a profoundly fundamental problem” — and will remain central to AI until we understand how brains actually work. “We will build systems that are fully at human level and beyond before we understand how the brain works.”

On cooperation in self play: In sufficiently open-ended games, cooperation will emerge as a winning strategy. “We will eventually choose to cooperate, because cooperation is more beneficial than not.” Understanding other agents’ goals, strategies, and beliefs becomes essential for both competition and communication.

On the future of language models — the most prophetic exchange. When asked “the current state of generative language models is very bad; what is the most promising research direction?”, Ilya responds:

I want to say that just training bigger, deeper language models will achieve surprising results — scale up. If you train a language model with a thousand layers of the same type, I think it will be a very impressive language model. We haven’t reached that point yet, but I think things will change quickly.

This was February 2018 — months before GPT-1, years before GPT-3 vindicated this prediction spectacularly.

On continual learning: Ilya draws an analogy to education — you go to school, learn useful but incomplete things, then join the workforce and must continue learning. Your degree doesn’t fully prepare you; it gives you a starting point. “I think this is what schools should be doing.” The AI equivalent: pre-train, then deploy into an environment that violates some of your assumptions, and continue training to reconcile new data with old knowledge.

On the alignment problem as political: “What I can say is, at a very high level, every time you advance into the future, or every time you build a machine that can do what people do better, the impact on society will be enormous and overwhelming. Even if you try very hard, it’s hard to imagine.”

问答环节包含几个精华:

关于反向传播与大脑:Ilya 承认反向传播不在生物大脑中发生(信号沿轴突单向传播;反向传播需要将误差向后传送)。但他认为反向传播解决了电路搜索——”一个极其根本的问题”——在我们理解大脑实际工作方式之前,它将一直是 AI 的核心。”在我们理解大脑如何运作之前,我们实际上将建立完全人类水平和超越的系统。”

关于自我博弈中的合作:在足够开放的游戏中,合作将作为一种获胜策略出现。”我们最终会选择合作,因为合作比不合作更有利。”理解其他代理的目标、策略和信念对竞争和沟通都至关重要。

关于语言模型的未来——最具预言性的交流。当被问到”目前生成语言模型的状况非常糟糕,您认为最有成效的研究方向是什么?”时,Ilya 回答:

我想说的是,仅仅训练更大、更深的语言模型就会取得令人惊讶的效果——扩大规模。如果你训练一个具有一千层同类型层的语言模型,我认为它将是一个非常令人印象深刻的语言模型。我们还没有达到那个阶段,但我认为情况很快就会改变。

这是 2018 年 2 月——GPT-1 发布前几个月,GPT-3 以惊人方式验证这一预测的几年前。

关于持续学习:Ilya 用教育作类比——你去上学,学到有用但不完整的东西,然后加入职场,必须继续学习。你的学位不会让你完全准备好;它给你一个起点。”我认为这是学校应该做的事情。”AI 的等价物:预训练,然后部署到一个违反你某些假设的环境中,继续训练以协调新数据与旧知识。

关于对齐问题作为政治问题:”我所能说的是,在非常高的层面上,每当你进入未来,或者每当你建造一台能比人类做得更好的机器时,对社会的影响将是巨大且压倒性的。即使你非常努力,这也是很难想象的。”