Ilya Sutskever: An Observation on Generalization
Supervised vs. Unsupervised Learning
监督学习 vs. 无监督学习
Why Supervised Learning Works
监督学习为什么有效
The starting point is a question we rarely ask anymore: why should learning work at all? Why should data have regularity that machine learning models can capture?
Supervised learning has a clean answer. Statistical learning theory (PAC learning, VC theory) gives a precise condition under which learning must succeed: if you achieve low training error and your degrees of freedom are smaller than your training set, you are guaranteed low test error.
The proof is elementary — the slide above shows all the math. If you have a finite function class \(\mathcal{F}\), the probability that training error deviates far from test error for any function is bounded by a union bound over \(\lvert \mathcal{F} \rvert\) followed by a Hoeffding-type concentration inequality. Three lines. The key assumption: the training distribution and test distribution must be the same.
Ilya makes a side comment on VC dimension: it was invented solely to handle parameters with infinite precision. In practice, all our floats are finite precision (and shrinking — FP16, INT8, …), so you can reduce back to the finite function class bound, which gives you essentially the same generalization guarantees with a simpler proof.
The Mystery of Unsupervised Learning
无监督学习的谜团
Supervised learning is conceptually settled. But what about unsupervised learning?
The old dream of unsupervised learning — look at images or text without labels and somehow discover the true hidden structure — has been fulfilled empirically. GPT, BERT, diffusion models, all testify to this. But why should it work? There is no analogue of the supervised learning guarantee. You optimize one objective (reconstruction, denoising, next-token prediction) but care about a different objective (downstream task performance). You have no mathematical reason to expect the first to help with the second.
Consider the pathological case: if your data is drawn from the uniform distribution, no unsupervised learning algorithm can extract useful structure. So unsupervised learning clearly requires assumptions about the data. The question is: what assumptions, and can we formalize them?
Distribution Matching
分布匹配
Before presenting the main argument, Ilya describes one form of unsupervised learning that provably works: distribution matching.
Given two data sources \(X\) and \(Y\) with no correspondence between them (e.g., English sentences and French sentences), find a function \(f\) such that the distribution of \(f(X)\) matches the distribution of \(Y\):
\[f^\ast = \arg\min_f \, D\big(P_{f(X)},\; P_Y\big)\]If \(X\) and \(Y\) are high-dimensional, this constraint is highly informative — it almost fully determines \(f\). This is guaranteed to work in the same sense as supervised learning: the constraint is sufficient to recover the mapping.
This framework covers unsupervised machine translation, substitution ciphers, and any setting where two distributions are related by a structured transformation. Ilya independently discovered this idea in 2015 and was excited that something mathematically meaningful could be said about unsupervised learning. But real unsupervised learning settings are rarely this clean.
Compression as the Foundation of Unsupervised Learning
压缩:无监督学习的理论基础
The Thought Experiment
思想实验
It is well known that compression and prediction are equivalent: every compressor can be converted to a predictor and vice versa (Shannon’s source coding theorem). But Ilya argues that for reasoning about unsupervised learning, the language of compression offers conceptual advantages.
Say you have two datasets \(X\) and \(Y\) — two files on disk. You have a really good compressor \(C\). Consider what happens when you compress \(X\) and \(Y\) jointly (concatenate them, then compress) versus separately:
\[C(X \circ Y) \leq C(X) + C(Y)\]A sufficiently good compressor will use patterns in \(X\) to help compress \(Y\), and vice versa. The gap
\[\Delta = \big[C(X) + C(Y)\big] - C(X \circ Y)\]is the shared structure — the algorithmic mutual information between the two datasets. The better the compressor, the more shared structure it extracts.
This is the key: let \(X\) be your unsupervised data and \(Y\) be your supervised task data. Suddenly you have a mathematical reason for patterns in \(X\) to help with \(Y\).
This also generalizes distribution matching. If X is language 1 and Y is language 2, and there exists a simple function f transforming one into the other, a good compressor will discover that mapping internally — because exploiting it reduces the compressed length.
Kolmogorov Complexity: The Ultimate Compressor
Kolmogorov 复杂度:终极压缩器
The slide shows the core theorem of Kolmogorov complexity: for any computable compressor C, the Kolmogorov-compressed length \(K(X)\) is bounded by the compressed output \(\lvert C(X)\rvert\) plus the description length of C itself, \(K(C)\), plus a constant. The proof is the simulation argument: the universal Turing machine can run C’s code to decompress, so it only needs C’s program plus C’s output. This single idea — that the best compressor can simulate all others — is the foundation of everything that follows.
To formalize “a really good compressor”, Ilya invokes Kolmogorov complexity: the length of the shortest program that outputs a given string \(x\):
\[K(x) = \min_{p:\; U(p) = x} \lvert p \rvert\]where \(U\) is a universal Turing machine. The Kolmogorov compressor has a remarkable property: it has minimal regret relative to any other compressor. Specifically:
\[K(x) \leq C(x) + \lvert \text{program implementing } C \rvert\]for any compressor \(C\). The proof is a simulation argument: the Kolmogorov compressor can simulate any compressor \(C\) by running \(C\)’s source code. You just pay the one-time cost of specifying \(C\)’s program. This is also why it’s not computable — it simulates all programs — but it is the best possible compressor.
Conditional Kolmogorov Complexity as the Solution
条件 Kolmogorov 复杂度作为解决方案
The slide above frames unsupervised learning as a regret minimization problem. Given an ML algorithm A that tries to compress (predict) your labeled data Y while having access to unlabeled data X, the question is: how much prediction quality are you leaving on the table? Low regret means you extracted all the value from X that any algorithm could — even if X turns out to be useless (e.g., uniform noise), you know you haven’t missed anything.
Now generalize to conditional Kolmogorov complexity:
\[K(Y \mid X) = \min_{p:\; U(p, X) = Y} \lvert p \rvert\]the length of the shortest program that outputs \(Y\) if it is allowed to probe \(X\). Using this to compress your supervised task data \(Y\) while conditioning on unsupervised data \(X\) is definitionally the optimal unsupervised learning algorithm: no algorithm can extract more predictive value from \(X\) for the task \(Y\).
If \(K(Y \mid X)\) is much smaller than \(K(Y)\), the unsupervised data is highly informative for your task. If \(K(Y \mid X) \approx K(Y)\), the unsupervised data is useless (e.g., the uniform distribution case). Either way, using the conditional Kolmogorov compressor means you have zero regret — you’ve done the best possible job of leveraging unlabeled data.
Ilya phrases this beautifully: “you can sleep soundly at night knowing that no one could have done a better job at benefiting from your unlabeled data.”
The slide above formalizes the conditional Kolmogorov theorem: for any computable compressor C, the Kolmogorov compressor conditioned on X is at least as good, up to the constant overhead of specifying C’s code. Two details from the slide deserve emphasis: we condition on the entire dataset X, not a single example — this is what makes it a theory of unsupervised learning rather than few-shot learning. And the theorem guarantees that this conditional compressor will extract all predictive value from X for the task Y.
From Conditional to Joint Compression
从条件压缩到联合压缩
There is a practical issue: conditioning on a massive dataset \(X\) is unnatural in a machine learning context. We can fit large datasets but we can’t truly condition on them (not yet, at least).
The resolution: joint compression of the concatenation \(X \circ Y\) achieves the same quality as conditional compression \(K(Y \mid X)\). The theoretical basis is the chain rule of Kolmogorov complexity (also called the symmetry of algorithmic information):
\[K(X, Y) = K(X) + K(Y \mid X) + O(\log n)\]This means the joint compressed length \(K(X, Y)\) decomposes into the cost of compressing \(X\) alone plus the conditional cost of compressing \(Y\) given \(X\), up to a logarithmic term. A good joint compressor of \(X \circ Y\) will automatically discover and exploit the shared structure — it effectively performs conditional compression internally. The only “waste” is \(K(X)\), which you’d pay regardless, plus the negligible \(O(\log n)\) overhead.
Ilya notes in the talk that “proving this is slightly more tricky” and skips the proof, but the chain rule is a classical result in algorithmic information theory (Li & Vitányi, 2008).
And joint compression maps directly to maximum likelihood training: the total log-likelihood over a dataset is the cost of compressing that dataset, and adding more data sources just adds terms to the sum.
\[-\sum_i \log p_\theta(x_i) \;\longleftrightarrow\; \text{compression cost of the dataset under model } \theta\]So the practical recipe is simple: take all your data, concatenate it, and train a model to maximize likelihood. This is what we already do — but now we have a theoretical justification.
Neural Networks and Empirical Validation
神经网络与实验验证
Neural Networks as Approximate Kolmogorov Compressors
神经网络是近似的 Kolmogorov 压缩器
The Kolmogorov compressor searches over all programs. SGD over neural networks searches over a restricted class of programs — but this class is remarkably expressive.
Neural networks are circuits. Circuits are computers. SGD is program search over these computing machines. All of deep learning rides on the “SGD miracle”: that gradient descent can actually find good circuits from data.
This is also why it’s hard to improve neural network architectures — the simulation argument applies. Your new architecture can usually be straightforwardly simulated by the old one, so the improvement is small. The exceptions are when the old architecture has a genuine bottleneck: for example, the RNN’s fixed-size hidden state makes it hard to simulate a Transformer’s attention mechanism.
Ilya notes: had we found a way to engineer an RNN with a very large hidden state, perhaps it would have matched the Transformer.
Bigger neural networks approximate the Kolmogorov compressor better. This may be why we like scaling: larger models have less regret as compressors, extracting more predictive value from data.
iGPT: Validation in the Image Domain
iGPT:图像领域的验证
To validate this theory beyond text (where next-token prediction can be explained without invoking compression), Ilya and collaborators tested it in the image domain with iGPT (2020):
- Take an image, flatten it into a sequence of pixels, discretize intensity values
- Train a Transformer on next-pixel prediction — the same autoregressive objective as GPT, but on pixels
- Evaluate unsupervised learning quality via linear probe accuracy: pick the best internal layer, fit a linear classifier, measure downstream performance
Results on CIFAR-10: as the model gets better at next-pixel prediction, linear probe accuracy improves monotonically. With a 6B parameter model on 64×64 images, they achieved 99% on CIFAR-10 — remarkably strong for a method that never sees any labels during pretraining.
The result confirmed: better compression → better unsupervised representations, even in a domain (vision) where the connection to text-based next-token prediction is non-obvious.
The Mystery of Linear Representations
线性表示的谜团
The compression theory explains why unsupervised learning extracts useful features, but it does not explain why those features are linearly separable. Linear probes work surprisingly well on intermediate representations, and the reason “must be deep and profound.”
One interesting observation: autoregressive models produce better linear representations than BERT-style models. Ilya’s speculation for why:
- In next-pixel prediction, predicting pixel \(t+1\) from pixels \(1, \ldots, t\) sometimes requires understanding long-range structure (e.g., predicting a pixel at the start of a new row from the entire preceding image context)
- In BERT, you mask ~25% of tokens and predict them from context on both sides. Any prediction can be done well by looking a little into the past and a little into the future — the local context suffices
- BERT removes the hardest prediction tasks, the ones that force the model to build global representations. Autoregressive models must solve these hard cases, which drives the formation of more informative features
This also implies that diffusion models, which are conceptually similar to BERT (denoising from a corrupted version), should have worse linear representations than autoregressive models — for the same reason.
Q&A Highlights
Q&A 精选
On the limits of the analogy. The biggest gap between Kolmogorov complexity and neural networks is the search procedure. Kolmogorov complexity enumerates all programs from scratch each time; SGD is an incremental, order-dependent search. Data ordering matters for neural networks precisely because SGD is a weak search procedure — early features get “baked in” and persist throughout training. Ilya cautions that the analogy is “very clearly not universally applicable” but is valuable for explaining where unsupervised learning comes from.
On VC dimension. A questioner defends VC dimension by citing a 2007 result on PAC learning of quantum states, where the VC dimension is exponentially smaller than the log of the hypothesis class size. Ilya concedes this is a cool example where VC dimension remains essential.
On energy-based models. Energy-based models offer another way to turn neural networks into probability distributions — assign an energy to each configuration, then normalize. Ratios of distributions correspond to differences in energy. This connects to the distinguishability ↔ prediction equivalence from cryptography.