Ilya Sutskever: Meta Learning and Self Play

Presenter: Ilya Sutskever

Host Institute: MIT AGI Course

Host: Lex Fridman

This post distills a lecture by Ilya Sutskever (then Chief Scientist at OpenAI) titled Meta Learning and Self Play, delivered at MIT's AGI course on February 1, 2018, introduced by Lex Fridman. Ilya surveys OpenAI's research in reinforcement learning, meta learning, and self play — three pillars he saw as converging toward general intelligence. The talk is remarkably prescient: it foreshadows RLHF, the scaling hypothesis, and even the alignment problem, all before GPT-1 was published. In the Q&A, Ilya makes his famous prediction: "just training bigger, deeper language models will achieve surprising results — scale up."

Why Do Neural Networks Work?

Ilya opens with a foundational question: why should deep learning work at all?

The theoretically optimal hypothesis class is the set of short programsFrom Kolmogorov complexity theory (Solomonoff, 1964; Kolmogorov, 1965). A “short program” is the shortest computer program that produces a given dataset. Its length — the Kolmogorov complexity — measures the data’s intrinsic information content. — a provable mathematical theorem states that the shortest program fitting your data yields the best possible generalization. The intuition is clean: if you can compress data into a short program, you’ve extracted all conceivable regularities; if no short program exists, the data is essentially random.

The problem is that searching over short programs is computationally intractable. But what about small circuitsFrom circuit complexity theory. A Boolean circuit is a DAG of logic gates; a neural network can be viewed as a parameterized circuit. “Small circuit search” = finding the simplest circuit that fits the data, analogous to finding the shortest program but feasible via gradient descent.? Here’s the key insight: backpropagation is small circuit search. When you constrain a neural network’s architecture and iteratively adjust weights via gradient descent, you are effectively searching for the smallest circuit satisfying \(F(x_i; \theta) = y_i\). Neural network training = solving a neural equation.

Why does backprop actually find good solutions? Mostly a mystery. Ilya attributes it to “the great variety that exists in most natural datasets” — an inexplicable fact that powers all of modern AI.

A 50-layer network is essentially a 50-step parallel computer. You can accomplish impressive tasks in just 50 parallel steps — for instance, a medium-sized network can learn to sort \(n\) numbers, executing what amounts to \(O(\log n)\) parallel sorting steps. The fact that 50 layers of threshold-gated neurons can complete significant logic and reasoning is the deep mystery underlying everything else in the talk.

Reinforcement Learning and Meta Learning

The Framework

Ilya presents RL as a framework for evaluating an agent’s ability to achieve goals in complex stochastic environments. The formulation is simple: find a policy that maximizes expected reward. But he makes a subtle philosophical point that often goes unnoticed.

In the standard RL diagram, the environment sends a reward signal \(r_t\) back to the agent — reward is a well-defined component of the MDP, and the math is perfectly correct. But Ilya raises a question about what this means as a model of biological agents.

Consider what happens physically when you touch a hot stove. The stove does not “send” pain — it is simply an object at 300°C. All it transmits is thermal energy. Your nociceptors (pain receptors) in the skin convert this temperature into a neural signal, and your brain interprets that signal as suffering. If your pain nerves were severed, the same stove, the same contact, would produce zero reward signal. The environment hasn’t changed; the agent’s internal wiring has.

This means that in biological reality, the environment provides only observations (temperature, photons, pressure waves), and the agent’s own neural circuitry decides which observations constitute reward. The RL formalism packages this internal construction as an exogenous signal \(r_t = R(s_t, a_t)\), which is mathematically convenient but epistemologically misleading — it makes reward look like an objective property of the environment when it is actually a subjective construction of the agent.

The implication is reductionist: the only reward that evolution “cares about” is survival — existence versus non-existence. Everything else — pleasure, pain, hunger, curiosity, social approval — is a proxy that evolution hard-coded into our nervous system because it correlated with survival in ancestral environments. If we want AI to learn with biological-like flexibility, merely specifying an external reward function may not be enough — the agent may need some form of endogenous reward construction, an internal mechanism for deciding what matters. This foreshadows the later discussion of RLHF and value functions as emotions.

He covers two classes of model-free RL:

Policy gradients: “Just take the gradient.” Stable, easy to use, on-policy. The derivative has a beautiful form — it tells you to try actions, and if you like them, increase their log-probability. (For a detailed derivation, see our post on Policy Gradient and Actor-Critic.)
Q-learning: Less stable, more sample-efficient, off-policy. Can learn from anyone’s data trying to achieve any goal. (For why scaling Q-learning is hard, see Challenges in Scaling Q-Learning.)

The on-policy vs off-policy distinction matters deeply: on-policy means “I can only learn from my own actions,” while off-policy means “I can learn from anyone trying to achieve any goal.” The mathematical machinery behind this — importance sampling — is what makes off-policy correction possible (see Importance Sampling: Why and How). This distinction becomes critical for Hindsight Experience Replay.

Meta Learning: Learn to Learn

The dream of meta learning: train a system on many tasks so it can solve new tasks quickly. The modern approach reduces this to conventional deep learning with an elegant trick — training tasks become training examples. You feed a model all information about a new task plus test cases, and ask it to predict. Tasks = training cases. That’s it. Everything else is details.

Ilya covers several success stories:

Omniglot: 98% accuracy on 1-shot 20-way character recognition (Mishra et al., 2017) — a dataset designed as a challenge for deep learning.
Neural Architecture Search (Zoph & Le, 2017): find an architecture on a small dataset that transfers to large ones. Meta learning over architectures.

But the biggest limitation of high-capacity meta learning is stark: training task distribution must equal test task distribution. You can do well on Omniglot and some robotics tasks, but can you teach an agent math, counting, reading, programming — and have it be capable of learning chemistry as a result? That’s the real question, and it remains open.

Hindsight Experience Replay

Hindsight Experience Replay (HER, Andrychowicz et al. 2017) addresses a fundamental RL problem: exploration. If you never receive reward, how can you learn? HER’s insight is that you can learn from failure.

The setup: build a system that can reach any state. Goal: reach state A. But any trajectory ends up at some other state B. The key idea — use this as training data to reach state B. You aimed for A, ended up at B, and that’s disappointing. But reframe: you’ve actually reached B successfully. You now have training data for how to reach B, obtained for free while trying to reach A.

This works because it leverages off-policy learning — when you actually land at state B, you’re effectively doing off-policy learning, because the actions you’d take if genuinely trying to reach B would be different. HER combined with DDPGDeep Deterministic Policy Gradient (Lillicrap et al., 2015). A continuous-action extension of Q-learning combining a learned Q-function with a deterministic policy network. lets a robotic arm learn to push objects to target positions, even with sparse rewards where conventional RL completely fails (no reward = no learning).

Ilya emphasizes the direction is correct: you want to utilize all data, not just the small fraction where you succeeded. The next step is the same algorithm but with high-level states — and the key question becomes: where do high-level states come from? This is where representation learning and unsupervised learning become critical.

Sim2Real with Meta Learning

Training robots in simulation and transferring policies to real hardware is attractive but fundamentally hard — simulators can never perfectly match reality. Simulating contact is NP-complete (or close to it), so there will always be a sim-real gap.

The solution is simulation randomization (Peng et al., 2017): randomize gravity, friction, torques, object dimensions, and contact types during training. The policy is never told how the simulator is configured — it must infer physics from experience and adapt. This is meta learning applied to domain transfer.

The result: a policy trained in randomized simulation can adapt to real physics. A robot trained to push a puck to a target struggles badly without randomization (the sim-real gap defeats it), but with randomization it learns to quickly infer the simulator’s properties and complete the task. Transfer to real hardware works because reality is just another “randomized” setting.

Self Play

Converting Compute into Data

Self play is the most exciting part of the talk. Ilya’s thesis: self play converts compute into data, and this conversion will become increasingly important as neural network processors get faster.

He traces the history from TD-Gammon (Tesauro, 1992) — Q-learning + neural networks + self play that beat all human backgammon players and discovered strategies humans hadn’t noticed — through AlphaGo Zero (learning Go from scratch with no human data) to OpenAI’s Dota 2 bot (pure self play in a complex real-time strategy game, going from random play to beating top professionals in ~5 months).

The promise of self play:

Simple environment → extremely complex strategy: a simple game with simple rules can produce arbitrarily sophisticated behavior.
Convert compute into data: with self play, more compute = more and better training data, automatically.
Perfect curriculum: your opponent is always at your level. You always face a fair challenge, whether you’re a beginner or world champion.

Ilya shows OpenAI’s “Sumo” experiment (Bansal et al., 2017): two humanoid figures learning to wrestle with no prior knowledge of standing, balance, or gravity. Through pure competition, they discover walking, pushing, dodging — general physical dexterity emerges from the pressure to win. The trained agents even exhibit transfer learning: you can apply random forces to them and they maintain balance, because they’ve already learned to handle being pushed by an unpredictable opponent.

Can We Train AGI via Self Play?

Ilya makes a speculative but compelling argument. He cites a Science paper noting that corvids and apes independently evolved complex cognitive abilities — language-like communication, tool use, theory of mind — despite vastly different brain structures, because they needed to solve similar socioecological problems. Social life incentivizes intelligence.

The implication: if you create a multi-agent environment where agents must communicate, negotiate, cooperate, and compete, open-ended self play should produce theory of mind, social skills, empathy, and eventually real language understanding. The human brain tripled in size over two million years, likely driven by social competition — survival depended not on outrunning tigers but on managing your tribe’s social dynamics.

If agent society is a reasonable venue for general intelligence, and if we accept the rapid competence increases observed in Dota, then once we get the details right, we should see rapid capability increases in agents living in agent societies.

This also raises alignment issues — how do we ensure that agents trained through self play behave as we hope? Ilya explicitly flags this as a concern.

Alignment and the Future

Learning from Human Feedback

In a section that proved remarkably prescient, Ilya presents RLHF (Christiano et al., 2017) — years before it would become the foundation of ChatGPT and modern LLM alignment.

The problem: how do you communicate goals to an agent? Ilya frames this as a technical problem that is critical because “the agents we train may eventually be smarter than us.”

The method: human judges see pairs of agent behaviors and click on whichever looks better. From ~500 such clicks, you fit a scalar reward function to the human preferences (a triplet loss: if human deems A > B, learn a reward consistent with this). Then optimize this learned reward via RL. (The idea that a language model itself can serve as this reward predictor — replacing human judges — is explored in our post Can Language Models Be Critic Functions?.)

The results are surprisingly effective — with just a few thousand bits of human feedback, you can train Atari agents and even teach unusual goals (like making a car follow closely behind another car in a racing game) that would be hard to specify programmatically. (For how this evolved into modern LLM alignment, see Policy Optimization without a Critic: The GRPO Family and RL on Language under Single-step Settings.)

Ilya closes the main talk with an alignment slide that captures the tension perfectly: “Will likely solve the technical alignment problem. But what are the right goals? Political problem.” The technical challenge of making AI do what we want is tractable; deciding what we should want is the hard part.

Q&A Highlights

The Q&A session contains several gems:

On backpropagation vs. the brain: Ilya acknowledges that backprop doesn’t happen in biological brains (signals propagate forward along axons; backprop requires sending errors backward). But he argues backprop solves circuit search — “a profoundly fundamental problem” — and will remain central to AI until we understand how brains actually work. “We will build systems that are fully at human level and beyond before we understand how the brain works.”

On cooperation in self play: In sufficiently open-ended games, cooperation will emerge as a winning strategy. “We will eventually choose to cooperate, because cooperation is more beneficial than not.” Understanding other agents’ goals, strategies, and beliefs becomes essential for both competition and communication.

On the future of language models — the most prophetic exchange. When asked “the current state of generative language models is very bad; what is the most promising research direction?”, Ilya responds:

I want to say that just training bigger, deeper language models will achieve surprising results — scale up. If you train a language model with a thousand layers of the same type, I think it will be a very impressive language model. We haven’t reached that point yet, but I think things will change quickly.

This was February 2018 — months before GPT-1, years before GPT-3 vindicated this prediction spectacularly.

On continual learning: Ilya draws an analogy to education — you go to school, learn useful but incomplete things, then join the workforce and must continue learning. Your degree doesn’t fully prepare you; it gives you a starting point. “I think this is what schools should be doing.” The AI equivalent: pre-train, then deploy into an environment that violates some of your assumptions, and continue training to reconcile new data with old knowledge.

On the alignment problem as political: “What I can say is, at a very high level, every time you advance into the future, or every time you build a machine that can do what people do better, the impact on society will be enormous and overwhelming. Even if you try very hard, it’s hard to imagine.”

Ilya Sutskever: Meta Learning and Self Play

Why Do Neural Networks Work?

神经网络为什么有效？

Reinforcement Learning and Meta Learning

强化学习与元学习

The Framework

基本框架

Meta Learning: Learn to Learn

元学习：学会学习

Hindsight Experience Replay

后见经验回放

Sim2Real with Meta Learning

从模拟到现实的元学习

Self Play

自我博弈

Converting Compute into Data

将计算转化为数据

Can We Train AGI via Self Play?

能否通过自我博弈训练 AGI？

Alignment and the Future

对齐与未来

Learning from Human Feedback

从人类反馈中学习

Q&A Highlights

问答精选