Ilya Sutskever: An Observation on Generalization

Presenter: Ilya Sutskever

Host Institute: Simons Institute @ Berkeley

Host: Yejin Choi

This post distills a talk by Ilya Sutskever titled An Observation on Generalization, delivered at the Simons Institute. Ilya describes results from OpenAI circa 2016 that shaped his thinking about unsupervised learning. The central claim: compression provides a mathematical foundation for unsupervised learning, analogous to the guarantees that PAC/statistical learning theory provides for supervised learning.

Supervised vs. Unsupervised Learning

Why Supervised Learning Works

Supervised Learning

Low training error + more training data than "degrees of freedom" = low test error

$$\Pr_{S \sim D^{\lvert S \rvert}}\!\left[\text{Test}_D(f) - \text{Train}_S(f) \leq \sqrt{\frac{\log \lvert \mathcal{F} \rvert + \log 1/\delta}{\lvert S \rvert}} \;\; \forall f \!\in\! \mathcal{F}\right] > 1 - \delta$$

$$\begin{aligned}\Pr\!\left[\text{Test}_D(f) - \text{Train}_S(f) > t ,\; \exists\, f \!\in\! \mathcal{F}\right] &\leq \textstyle\sum_{f \in \mathcal{F}} \Pr\!\left[\text{Test}_D(f) - \text{Train}_S(f) > t\right] \\ &\leq \lvert \mathcal{F} \rvert \exp\!\left(-\lvert S \rvert\, t^2\right)\end{aligned}$$

监督学习

低训练误差 + 训练数据多于"自由度" = 低测试误差

The starting point is a question we rarely ask anymore: why should learning work at all? Why should data have regularity that machine learning models can capture?

Supervised learning has a clean answer. Statistical learning theory (PAC learning, VC theory) gives a precise condition under which learning must succeed: if you achieve low training error and your degrees of freedom are smaller than your training set, you are guaranteed low test error.

The proof is elementary — the slide above shows all the math. If you have a finite function class $\mathcal{F}$, the probability that training error deviates far from test error for any function is bounded by a union bound over $\lvert \mathcal{F} \rvert$ followed by a Hoeffding-type concentration inequality. Three lines. The key assumption: the training distribution and test distribution must be the same.

Ilya makes a side comment on VC dimension: it was invented solely to handle parameters with infinite precision. In practice, all our floats are finite precision (and shrinking — FP16, INT8, …), so you can reduce back to the finite function class bound, which gives you essentially the same generalization guarantees with a simpler proof.

The Mystery of Unsupervised Learning

Unsupervised Learning is confusing

Very much unlike supervised learning
You optimize one objective
- Say reconstruction or next word prediction
But you care about a different objective

无监督学习令人困惑

与监督学习非常不同
你优化一个目标
- 比如重建或下一个词预测
但你关心的是另一个目标

Supervised learning is conceptually settled. But what about unsupervised learning?

The old dream of unsupervised learning — look at images or text without labels and somehow discover the true hidden structure — has been fulfilled empirically. GPT, BERT, diffusion models, all testify to this. But why should it work? There is no analogue of the supervised learning guarantee. You optimize one objective (reconstruction, denoising, next-token prediction) but care about a different objective (downstream task performance). You have no mathematical reason to expect the first to help with the second.

Consider the pathological case: if your data is drawn from the uniform distribution, no unsupervised learning algorithm can extract useful structure. So unsupervised learning clearly requires assumptions about the data. The question is: what assumptions, and can we formalize them?

Distribution Matching

Digression: Unsupervised Learning via Distribution Matching

Got data X and Y, no correspondence
Find F so that distribution(F(X)) ≈ distribution(Y)
Works for high-D X and Y's
- Substitution ciphers
- Unsupervised machine translation
Critically: it's clear why Distribution Matching has to work

题外：通过分布匹配的无监督学习

有数据 X 和 Y，无对应关系
找 F 使得 distribution(F(X)) ≈ distribution(Y)
对高维 X 和 Y 有效
- 替换密码
- 无监督机器翻译
关键：分布匹配必须有效，原因很清楚

Before presenting the main argument, Ilya describes one form of unsupervised learning that provably works: distribution matching.

Given two data sources $X$ and $Y$ with no correspondence between them (e.g., English sentences and French sentences), find a function $f$ such that the distribution of $f(X)$ matches the distribution of $Y$:

\[f^\ast = \arg\min_f \, D\big(P_{f(X)},\; P_Y\big)\]

If $X$ and $Y$ are high-dimensional, this constraint is highly informative — it almost fully determines $f$. This is guaranteed to work in the same sense as supervised learning: the constraint is sufficient to recover the mapping.

This framework covers unsupervised machine translation, substitution ciphers, and any setting where two distributions are related by a structured transformation. Ilya independently discovered this idea in 2015 and was excited that something mathematically meaningful could be said about unsupervised learning. But real unsupervised learning settings are rarely this clean.

Compression as the Foundation of Unsupervised Learning

The Thought Experiment

Compression for reasoning about unsupervised learning

Say you have datasets X and Y
You have a good compression algorithm C(data)
And say you compress X and Y jointly
What will a "sufficiently good compressor" do?
- Use patterns that exist in X to help compress Y!

用压缩来推理无监督学习

假设你有数据集 X 和 Y
你有一个好的压缩算法 C(data)
将 X 和 Y 联合压缩
"足够好的压缩器"会做什么？
- 利用 X 中存在的模式来帮助压缩 Y！

It is well known that compression and prediction are equivalent: every compressor can be converted to a predictor and vice versa (Shannon’s source coding theorem). But Ilya argues that for reasoning about unsupervised learning, the language of compression offers conceptual advantages.

Say you have two datasets $X$ and $Y$ — two files on disk. You have a really good compressor $C$. Consider what happens when you compress $X$ and $Y$ jointly (concatenate them, then compress) versus separately:

\[C(X \circ Y) \leq C(X) + C(Y)\]

A sufficiently good compressor will use patterns in $X$ to help compress $Y$, and vice versa. The gap

\[\Delta = \big[C(X) + C(Y)\big] - C(X \circ Y)\]

is the shared structure — the algorithmic mutual information between the two datasets. The better the compressor, the more shared structure it extracts.

This is the key: let $X$ be your unsupervised data and $Y$ be your supervised task data. Suddenly you have a mathematical reason for patterns in $X$ to help with $Y$.

This also generalizes distribution matching. If X is language 1 and Y is language 2, and there exists a simple function f transforming one into the other, a good compressor will discover that mapping internally — because exploiting it reduces the compressed length.

Kolmogorov Complexity: The Ultimate Compressor

Kolmogorov complexity as the ultimate compressor

If C is a computable compressor, then:

For all X, $$K(X) < \lvert C(X) \rvert + K(C) + O(1)$$

Proof: the simulation argument

Kolmogorov 复杂度作为终极压缩器

如果 C 是一个可计算的压缩器，则：

对所有 X, $$K(X) < \lvert C(X) \rvert + K(C) + O(1)$$

证明：模拟论证

The slide shows the core theorem of Kolmogorov complexity: for any computable compressor C, the Kolmogorov-compressed length $K(X)$ is bounded by the compressed output $\lvert C(X)\rvert$ plus the description length of C itself, $K(C)$, plus a constant. The proof is the simulation argument: the universal Turing machine can run C’s code to decompress, so it only needs C’s program plus C’s output. This single idea — that the best compressor can simulate all others — is the foundation of everything that follows.

To formalize “a really good compressor”, Ilya invokes Kolmogorov complexity: the length of the shortest program that outputs a given string $x$:

\[K(x) = \min_{p:\; U(p) = x} \lvert p \rvert\]

where $U$ is a universal Turing machine. The Kolmogorov compressor has a remarkable property: it has minimal regret relative to any other compressor. Specifically:

\[K(x) \leq C(x) + \lvert \text{program implementing } C \rvert\]

for any compressor $C$. The proof is a simulation argument: the Kolmogorov compressor can simulate any compressor $C$ by running $C$’s source code. You just pay the one-time cost of specifying $C$’s program. This is also why it’s not computable — it simulates all programs — but it is the best possible compressor.

Conditional Kolmogorov Complexity as the Solution

Can we formalize this?

Consider an ML algorithm, A, that tries to compress Y
Say it has access to X
What's our regret of using this algorithm?
- And regret relative to what?
Low regret = "we got all the value" out of the unlabelled data X
- And nobody could get much more value than we did!

能否形式化？

考虑一个 ML 算法 A，它尝试压缩 Y
假设它能访问 X
使用这个算法的遗憾 (regret) 是什么？
- 相对于什么的遗憾？
低遗憾 = 我们从无标签数据 X 中"提取了所有价值"
- 没有人能比我们获得更多价值！

The slide above frames unsupervised learning as a regret minimization problem. Given an ML algorithm A that tries to compress (predict) your labeled data Y while having access to unlabeled data X, the question is: how much prediction quality are you leaving on the table? Low regret means you extracted all the value from X that any algorithm could — even if X turns out to be useless (e.g., uniform noise), you know you haven’t missed anything.

Now generalize to conditional Kolmogorov complexity:

\[K(Y \mid X) = \min_{p:\; U(p, X) = Y} \lvert p \rvert\]

the length of the shortest program that outputs $Y$ if it is allowed to probe $X$. Using this to compress your supervised task data $Y$ while conditioning on unsupervised data $X$ is definitionally the optimal unsupervised learning algorithm: no algorithm can extract more predictive value from $X$ for the task $Y$.

If $K(Y \mid X)$ is much smaller than $K(Y)$, the unsupervised data is highly informative for your task. If $K(Y \mid X) \approx K(Y)$, the unsupervised data is useless (e.g., the uniform distribution case). Either way, using the conditional Kolmogorov compressor means you have zero regret — you’ve done the best possible job of leveraging unlabeled data.

Ilya phrases this beautifully: “you can sleep soundly at night knowing that no one could have done a better job at benefiting from your unlabeled data.”

Conditional Kolmogorov complexity as the solution

If C is a computable compressor, then:

For all x, $$K(Y \vert X) < \lvert C(Y \vert X) \rvert + K(C) + O(1)$$

Conditioning on a dataset, not an example
Will extract all "value" out of X for predicting Y

条件 Kolmogorov 复杂度作为解决方案

如果 C 是一个可计算的压缩器，则：

对所有 x, $$K(Y \vert X) < \lvert C(Y \vert X) \rvert + K(C) + O(1)$$

以数据集为条件，而非单个样本
将从 X 中提取所有对预测 Y 有用的"价值"

The slide above formalizes the conditional Kolmogorov theorem: for any computable compressor C, the Kolmogorov compressor conditioned on X is at least as good, up to the constant overhead of specifying C’s code. Two details from the slide deserve emphasis: we condition on the entire dataset X, not a single example — this is what makes it a theory of unsupervised learning rather than few-shot learning. And the theorem guarantees that this conditional compressor will extract all predictive value from X for the task Y.

From Conditional to Joint Compression

There is a practical issue: conditioning on a massive dataset $X$ is unnatural in a machine learning context. We can fit large datasets but we can’t truly condition on them (not yet, at least).

The resolution: joint compression of the concatenation $X \circ Y$ achieves the same quality as conditional compression $K(Y \mid X)$. The theoretical basis is the chain rule of Kolmogorov complexity (also called the symmetry of algorithmic information):

\[K(X, Y) = K(X) + K(Y \mid X) + O(\log n)\]

This means the joint compressed length $K(X, Y)$ decomposes into the cost of compressing $X$ alone plus the conditional cost of compressing $Y$ given $X$, up to a logarithmic term. A good joint compressor of $X \circ Y$ will automatically discover and exploit the shared structure — it effectively performs conditional compression internally. The only “waste” is $K(X)$, which you’d pay regardless, plus the negligible $O(\log n)$ overhead.

Ilya notes in the talk that “proving this is slightly more tricky” and skips the proof, but the chain rule is a classical result in algorithmic information theory (Li & Vitányi, 2008).

And joint compression maps directly to maximum likelihood training: the total log-likelihood over a dataset is the cost of compressing that dataset, and adding more data sources just adds terms to the sum.

\[-\sum_i \log p_\theta(x_i) \;\longleftrightarrow\; \text{compression cost of the dataset under model } \theta\]

So the practical recipe is simple: take all your data, concatenate it, and train a model to maximize likelihood. This is what we already do — but now we have a theoretical justification.

Neural Networks and Empirical Validation

Neural Networks as Approximate Kolmogorov Compressors

The Kolmogorov compressor searches over all programs. SGD over neural networks searches over a restricted class of programs — but this class is remarkably expressive.

Neural networks are circuits. Circuits are computers. SGD is program search over these computing machines. All of deep learning rides on the “SGD miracle”: that gradient descent can actually find good circuits from data.

This is also why it’s hard to improve neural network architectures — the simulation argument applies. Your new architecture can usually be straightforwardly simulated by the old one, so the improvement is small. The exceptions are when the old architecture has a genuine bottleneck: for example, the RNN’s fixed-size hidden state makes it hard to simulate a Transformer’s attention mechanism.

Ilya notes: had we found a way to engineer an RNN with a very large hidden state, perhaps it would have matched the Transformer.

Bigger neural networks approximate the Kolmogorov compressor better. This may be why we like scaling: larger models have less regret as compressors, extracting more predictive value from data.

iGPT: Validation in the Image Domain

To validate this theory beyond text (where next-token prediction can be explained without invoking compression), Ilya and collaborators tested it in the image domain with iGPT (2020):

Take an image, flatten it into a sequence of pixels, discretize intensity values
Train a Transformer on next-pixel prediction — the same autoregressive objective as GPT, but on pixels
Evaluate unsupervised learning quality via linear probe accuracy: pick the best internal layer, fit a linear classifier, measure downstream performance

Results on CIFAR-10: as the model gets better at next-pixel prediction, linear probe accuracy improves monotonically. With a 6B parameter model on 64×64 images, they achieved 99% on CIFAR-10 — remarkably strong for a method that never sees any labels during pretraining.

The result confirmed: better compression → better unsupervised representations, even in a domain (vision) where the connection to text-based next-token prediction is non-obvious.

The Mystery of Linear Representations

The compression theory explains why unsupervised learning extracts useful features, but it does not explain why those features are linearly separable. Linear probes work surprisingly well on intermediate representations, and the reason “must be deep and profound.”

One interesting observation: autoregressive models produce better linear representations than BERT-style models. Ilya’s speculation for why:

In next-pixel prediction, predicting pixel $t+1$ from pixels $1, \ldots, t$ sometimes requires understanding long-range structure (e.g., predicting a pixel at the start of a new row from the entire preceding image context)
In BERT, you mask ~25% of tokens and predict them from context on both sides. Any prediction can be done well by looking a little into the past and a little into the future — the local context suffices
BERT removes the hardest prediction tasks, the ones that force the model to build global representations. Autoregressive models must solve these hard cases, which drives the formation of more informative features

This also implies that diffusion models, which are conceptually similar to BERT (denoising from a corrupted version), should have worse linear representations than autoregressive models — for the same reason.

Q&A Highlights

On the limits of the analogy. The biggest gap between Kolmogorov complexity and neural networks is the search procedure. Kolmogorov complexity enumerates all programs from scratch each time; SGD is an incremental, order-dependent search. Data ordering matters for neural networks precisely because SGD is a weak search procedure — early features get “baked in” and persist throughout training. Ilya cautions that the analogy is “very clearly not universally applicable” but is valuable for explaining where unsupervised learning comes from.

On VC dimension. A questioner defends VC dimension by citing a 2007 result on PAC learning of quantum states, where the VC dimension is exponentially smaller than the log of the hypothesis class size. Ilya concedes this is a cool example where VC dimension remains essential.

On energy-based models. Energy-based models offer another way to turn neural networks into probability distributions — assign an energy to each configuration, then normalize. Ratios of distributions correspond to differences in energy. This connects to the distinguishability ↔ prediction equivalence from cryptography.

Ilya Sutskever: An Observation on Generalization

Supervised vs. Unsupervised Learning

监督学习 vs. 无监督学习

Why Supervised Learning Works

监督学习为什么有效

The Mystery of Unsupervised Learning

无监督学习的谜团

Distribution Matching

分布匹配

Compression as the Foundation of Unsupervised Learning

压缩：无监督学习的理论基础

The Thought Experiment

思想实验

Kolmogorov Complexity: The Ultimate Compressor

Kolmogorov 复杂度：终极压缩器

Conditional Kolmogorov Complexity as the Solution

条件 Kolmogorov 复杂度作为解决方案

From Conditional to Joint Compression

从条件压缩到联合压缩

Neural Networks and Empirical Validation

神经网络与实验验证

Neural Networks as Approximate Kolmogorov Compressors

神经网络是近似的 Kolmogorov 压缩器

iGPT: Validation in the Image Domain

iGPT：图像领域的验证

The Mystery of Linear Representations

线性表示的谜团

Q&A Highlights

Q&A 精选