Ilya Sutskever: An Observation on Generalization

Presenter: Ilya Sutskever
Host Institute: Simons Institute @ Berkeley
Host: Yejin Choi
This post distills a talk by Ilya Sutskever titled An Observation on Generalization, delivered at the Simons Institute. Ilya describes results from OpenAI circa 2016 that shaped his thinking about unsupervised learning. The central claim: compression provides a mathematical foundation for unsupervised learning, analogous to the guarantees that PAC/statistical learning theory provides for supervised learning.
本文整理自 Ilya SutskeverSimons Institute 的演讲 An Observation on Generalization。Ilya 分享了他在 OpenAI 2016 年左右的一些思考,这些思考深刻影响了他对无监督学习的理解。核心论点:压缩为无监督学习提供了数学基础,类似于 PAC/统计学习理论为监督学习提供的保证。

Supervised vs. Unsupervised Learning

Why Supervised Learning Works

Supervised Learning
  • Low training error + more training data than "degrees of freedom" = low test error
$$\Pr_{S \sim D^{\lvert S \rvert}}\!\left[\text{Test}_D(f) - \text{Train}_S(f) \leq \sqrt{\frac{\log \lvert \mathcal{F} \rvert + \log 1/\delta}{\lvert S \rvert}} \;\; \forall f \!\in\! \mathcal{F}\right] > 1 - \delta$$
$$\begin{aligned}\Pr\!\left[\text{Test}_D(f) - \text{Train}_S(f) > t ,\; \exists\, f \!\in\! \mathcal{F}\right] &\leq \textstyle\sum_{f \in \mathcal{F}} \Pr\!\left[\text{Test}_D(f) - \text{Train}_S(f) > t\right] \\ &\leq \lvert \mathcal{F} \rvert \exp\!\left(-\lvert S \rvert\, t^2\right)\end{aligned}$$

The starting point is a question we rarely ask anymore: why should learning work at all? Why should data have regularity that machine learning models can capture?

Supervised learning has a clean answer. Statistical learning theory (PAC learning, VC theory) gives a precise condition under which learning must succeed: if you achieve low training error and your degrees of freedom are smaller than your training set, you are guaranteed low test error.

The proof is elementary — the slide above shows all the math. If you have a finite function class \(\mathcal{F}\), the probability that training error deviates far from test error for any function is bounded by a union bound over \(\lvert \mathcal{F} \rvert\) followed by a Hoeffding-type concentration inequality. Three lines. The key assumption: the training distribution and test distribution must be the same.

Ilya makes a side comment on VC dimension: it was invented solely to handle parameters with infinite precision. In practice, all our floats are finite precision (and shrinking — FP16, INT8, …), so you can reduce back to the finite function class bound, which gives you essentially the same generalization guarantees with a simpler proof.

起点是一个我们今天习以为常、但本不显然的问题:学习为什么应该有效? 数据为什么会有机器学习模型能捕捉到的规律性?

监督学习有一个干净的回答。统计学习理论(PAC 学习、VC 理论)给出了学习必然成功的精确条件:如果你达到了低训练误差,且自由度小于训练集大小,那么低测试误差是有保证的。

证明是初等的——上面的 slide 展示了全部数学。如果你有一个有限函数类 \(\mathcal{F}\),训练误差与测试误差对任意函数偏差过大的概率,可以先对 \(\lvert \mathcal{F} \rvert\) 做 union bound,再用 Hoeffding 型集中不等式。三行数学。关键假设:训练分布与测试分布必须相同

Ilya 对 VC 维度做了一个旁注:VC 维度的发明完全是为了处理无限精度的参数。实际中,我们的浮点数都是有限精度的(而且还在缩减——FP16、INT8……),所以可以回归到有限函数类的 bound,用更简单的证明得到本质上相同的泛化保证。

The Mystery of Unsupervised Learning

Unsupervised Learning is confusing
  • Very much unlike supervised learning
  • You optimize one objective
    • Say reconstruction or next word prediction
  • But you care about a different objective

Supervised learning is conceptually settled. But what about unsupervised learning?

The old dream of unsupervised learning — look at images or text without labels and somehow discover the true hidden structure — has been fulfilled empirically. GPT, BERT, diffusion models, all testify to this. But why should it work? There is no analogue of the supervised learning guarantee. You optimize one objective (reconstruction, denoising, next-token prediction) but care about a different objective (downstream task performance). You have no mathematical reason to expect the first to help with the second.

Consider the pathological case: if your data is drawn from the uniform distribution, no unsupervised learning algorithm can extract useful structure. So unsupervised learning clearly requires assumptions about the data. The question is: what assumptions, and can we formalize them?

监督学习在概念上已经解决了。但无监督学习呢?

无监督学习的旧梦想——不看标签,只看图像或文本,就能发现数据中真正的隐藏结构——已经在经验上实现了。GPT、BERT、扩散模型都证明了这一点。但它为什么应该有效? 不存在类似监督学习保证的东西。你优化的是一个目标(重建、去噪、下一个 token 预测),但你关心的是另一个目标(下游任务性能)。你没有数学上的理由期望前者对后者有帮助。

考虑一个病态情况:如果数据来自均匀分布,没有任何无监督学习算法能提取有用的结构。所以无监督学习显然需要对数据做假设。问题是:什么样的假设?能否形式化?

Distribution Matching

Digression: Unsupervised Learning via Distribution Matching
  • Got data X and Y, no correspondence
  • Find F so that distribution(F(X)) ≈ distribution(Y)
  • Works for high-D X and Y's
    • Substitution ciphers
    • Unsupervised machine translation
  • Critically: it's clear why Distribution Matching has to work

Before presenting the main argument, Ilya describes one form of unsupervised learning that provably works: distribution matching.

Given two data sources \(X\) and \(Y\) with no correspondence between them (e.g., English sentences and French sentences), find a function \(f\) such that the distribution of \(f(X)\) matches the distribution of \(Y\):

\[f^\ast = \arg\min_f \, D\big(P_{f(X)},\; P_Y\big)\]

If \(X\) and \(Y\) are high-dimensional, this constraint is highly informative — it almost fully determines \(f\). This is guaranteed to work in the same sense as supervised learning: the constraint is sufficient to recover the mapping.

This framework covers unsupervised machine translation, substitution ciphers, and any setting where two distributions are related by a structured transformation. Ilya independently discovered this idea in 2015 and was excited that something mathematically meaningful could be said about unsupervised learning. But real unsupervised learning settings are rarely this clean.

在给出主要论证之前,Ilya 描述了一种可证明有效的无监督学习形式:分布匹配

给定两个数据源 \(X\) 和 \(Y\),它们之间没有对应关系(比如英语句子和法语句子),找一个函数 \(f\) 使得 \(f(X)\) 的分布匹配 \(Y\) 的分布:

\[f^\ast = \arg\min_f \, D\big(P_{f(X)},\; P_Y\big)\]

如果 \(X\) 和 \(Y\) 是高维的,这个约束的信息量非常大——它几乎完全确定了 \(f\)。这与监督学习在相同意义上是有保证的:约束足以恢复映射。

这个框架涵盖了无监督机器翻译、替换密码,以及任何两个分布通过结构化变换关联的场景。Ilya 在 2015 年独立发现了这个想法,并为能对无监督学习说出有数学意义的话而兴奋。但实际的无监督学习场景很少这么干净。

Compression as the Foundation of Unsupervised Learning

The Thought Experiment

Compression for reasoning about unsupervised learning
  • Say you have datasets X and Y
  • You have a good compression algorithm C(data)
  • And say you compress X and Y jointly
  • What will a "sufficiently good compressor" do?
    • Use patterns that exist in X to help compress Y!

It is well known that compression and prediction are equivalent: every compressor can be converted to a predictor and vice versa (Shannon’s source coding theorem). But Ilya argues that for reasoning about unsupervised learning, the language of compression offers conceptual advantages.

Say you have two datasets \(X\) and \(Y\) — two files on disk. You have a really good compressor \(C\). Consider what happens when you compress \(X\) and \(Y\) jointly (concatenate them, then compress) versus separately:

\[C(X \circ Y) \leq C(X) + C(Y)\]

A sufficiently good compressor will use patterns in \(X\) to help compress \(Y\), and vice versa. The gap

\[\Delta = \big[C(X) + C(Y)\big] - C(X \circ Y)\]

is the shared structure — the algorithmic mutual information between the two datasets. The better the compressor, the more shared structure it extracts.

This is the key: let \(X\) be your unsupervised data and \(Y\) be your supervised task data. Suddenly you have a mathematical reason for patterns in \(X\) to help with \(Y\).

This also generalizes distribution matching. If X is language 1 and Y is language 2, and there exists a simple function f transforming one into the other, a good compressor will discover that mapping internally — because exploiting it reduces the compressed length.

众所周知,压缩和预测是等价的:每个压缩器都能转化为预测器,反之亦然(Shannon 信源编码定理)。但 Ilya 认为,在推理无监督学习时,压缩的语言提供了概念上的优势。

假设你有两个数据集 \(X\) 和 \(Y\)——磁盘上的两个文件。你有一个非常好的压缩器 \(C\)。考虑将 \(X\) 和 \(Y\) 联合压缩(拼接后压缩)与分别压缩的区别:

\[C(X \circ Y) \leq C(X) + C(Y)\]

一个足够好的压缩器会利用 \(X\) 中的模式来帮助压缩 \(Y\),反之亦然。其差值

\[\Delta = \big[C(X) + C(Y)\big] - C(X \circ Y)\]

就是共享结构——两个数据集之间的算法互信息。压缩器越好,提取的共享结构越多。

关键在此:令 \(X\) 为你的无监督数据,\(Y\) 为你的监督任务数据。你突然有了一个数学上的理由,解释为什么 \(X\) 中的模式能帮助 \(Y\)。

这也推广了分布匹配。如果 X 是语言 1,Y 是语言 2,且存在一个简单函数 f 将一个分布变换为另一个,好的压缩器会在内部发现这个映射——因为利用它能减少压缩长度。

Kolmogorov Complexity: The Ultimate Compressor

Kolmogorov complexity as the ultimate compressor
  • If C is a computable compressor, then:
For all X,    $$K(X) < \lvert C(X) \rvert + K(C) + O(1)$$
  • Proof: the simulation argument

The slide shows the core theorem of Kolmogorov complexity: for any computable compressor C, the Kolmogorov-compressed length \(K(X)\) is bounded by the compressed output \(\lvert C(X)\rvert\) plus the description length of C itself, \(K(C)\), plus a constant. The proof is the simulation argument: the universal Turing machine can run C’s code to decompress, so it only needs C’s program plus C’s output. This single idea — that the best compressor can simulate all others — is the foundation of everything that follows.

To formalize “a really good compressor”, Ilya invokes Kolmogorov complexity: the length of the shortest program that outputs a given string \(x\):

\[K(x) = \min_{p:\; U(p) = x} \lvert p \rvert\]

where \(U\) is a universal Turing machine. The Kolmogorov compressor has a remarkable property: it has minimal regret relative to any other compressor. Specifically:

\[K(x) \leq C(x) + \lvert \text{program implementing } C \rvert\]

for any compressor \(C\). The proof is a simulation argument: the Kolmogorov compressor can simulate any compressor \(C\) by running \(C\)’s source code. You just pay the one-time cost of specifying \(C\)’s program. This is also why it’s not computable — it simulates all programs — but it is the best possible compressor.

Slide 展示了 Kolmogorov 复杂度的核心定理:对于任何可计算压缩器 C,Kolmogorov 压缩长度 \(K(X)\) 不超过 C 的压缩输出 \(\lvert C(X)\rvert\) 加上 C 本身的描述长度 \(K(C)\) 再加一个常数。证明是模拟论证:通用图灵机可以运行 C 的代码来解压,所以它只需要 C 的程序加上 C 的输出。这个单一思想——最好的压缩器可以模拟所有其他压缩器——是后续一切的基础。

为了形式化”一个非常好的压缩器”,Ilya 引入了 Kolmogorov 复杂度:输出给定字符串 \(x\) 的最短程序的长度:

\[K(x) = \min_{p:\; U(p) = x} \lvert p \rvert\]

其中 \(U\) 是通用图灵机。Kolmogorov 压缩器有一个显著性质:它相对于任何其他压缩器都具有最小遗憾 (regret)。具体地:

\[K(x) \leq C(x) + \lvert \text{实现 } C \text{ 的程序长度} \rvert\]

对任意压缩器 \(C\) 成立。证明是一个模拟论证:Kolmogorov 压缩器可以通过运行 \(C\) 的源代码来模拟任何压缩器 \(C\)。你只需付出指定 \(C\) 的程序的一次性代价。这也是它不可计算的原因——它模拟所有程序——但它是最好的压缩器。

Conditional Kolmogorov Complexity as the Solution

Can we formalize this?
  • Consider an ML algorithm, A, that tries to compress Y
  • Say it has access to X
  • What's our regret of using this algorithm?
    • And regret relative to what?
  • Low regret = "we got all the value" out of the unlabelled data X
    • And nobody could get much more value than we did!

The slide above frames unsupervised learning as a regret minimization problem. Given an ML algorithm A that tries to compress (predict) your labeled data Y while having access to unlabeled data X, the question is: how much prediction quality are you leaving on the table? Low regret means you extracted all the value from X that any algorithm could — even if X turns out to be useless (e.g., uniform noise), you know you haven’t missed anything.

Now generalize to conditional Kolmogorov complexity:

\[K(Y \mid X) = \min_{p:\; U(p, X) = Y} \lvert p \rvert\]

the length of the shortest program that outputs \(Y\) if it is allowed to probe \(X\). Using this to compress your supervised task data \(Y\) while conditioning on unsupervised data \(X\) is definitionally the optimal unsupervised learning algorithm: no algorithm can extract more predictive value from \(X\) for the task \(Y\).

If \(K(Y \mid X)\) is much smaller than \(K(Y)\), the unsupervised data is highly informative for your task. If \(K(Y \mid X) \approx K(Y)\), the unsupervised data is useless (e.g., the uniform distribution case). Either way, using the conditional Kolmogorov compressor means you have zero regret — you’ve done the best possible job of leveraging unlabeled data.

Ilya phrases this beautifully: “you can sleep soundly at night knowing that no one could have done a better job at benefiting from your unlabeled data.”

Conditional Kolmogorov complexity as the solution
  • If C is a computable compressor, then:
For all x,    $$K(Y \vert X) < \lvert C(Y \vert X) \rvert + K(C) + O(1)$$
  • Conditioning on a dataset, not an example
  • Will extract all "value" out of X for predicting Y

The slide above formalizes the conditional Kolmogorov theorem: for any computable compressor C, the Kolmogorov compressor conditioned on X is at least as good, up to the constant overhead of specifying C’s code. Two details from the slide deserve emphasis: we condition on the entire dataset X, not a single example — this is what makes it a theory of unsupervised learning rather than few-shot learning. And the theorem guarantees that this conditional compressor will extract all predictive value from X for the task Y.

上面的 slide 形式化了条件 Kolmogorov 定理:对于任何可计算压缩器 C,以 X 为条件的 Kolmogorov 压缩器至少同样好,最多相差指定 C 代码的常数开销。slide 中有两个细节值得强调:我们以整个数据集 X 为条件,而非单个样本——这使它成为无监督学习的理论,而非 few-shot 学习的理论。而且定理保证这个条件压缩器将从 X 中提取对任务 Y 的所有预测价值

上面的 slide 将无监督学习框架化为一个遗憾最小化问题。给定一个 ML 算法 A,它在访问无标签数据 X 的同时尝试压缩(预测)有标签数据 Y,问题是:你在预测质量上”浪费”了多少?低遗憾意味着你从 X 中提取了任何算法所能提取的所有价值——即使 X 最终没用(比如均匀噪声),你也知道自己没有遗漏任何东西。

现在推广到条件 Kolmogorov 复杂度

\[K(Y \mid X) = \min_{p:\; U(p, X) = Y} \lvert p \rvert\]

即允许访问 \(X\) 的条件下,输出 \(Y\) 的最短程序的长度。用它来压缩你的监督任务数据 \(Y\)(以无监督数据 \(X\) 为条件),在定义上就是最优的无监督学习算法:没有算法能从 \(X\) 中为任务 \(Y\) 提取更多预测价值。

如果 \(K(Y \mid X)\) 远小于 \(K(Y)\),无监督数据对你的任务高度有用。如果 \(K(Y \mid X) \approx K(Y)\),无监督数据没用(比如均匀分布的情况)。无论哪种情况,使用条件 Kolmogorov 压缩器意味着你有零遗憾——你已经最大限度地利用了无标签数据。

Ilya 的原话非常优美:”你可以安心入睡,因为没有人能比你更好地利用你的无标签数据。”

From Conditional to Joint Compression

There is a practical issue: conditioning on a massive dataset \(X\) is unnatural in a machine learning context. We can fit large datasets but we can’t truly condition on them (not yet, at least).

The resolution: joint compression of the concatenation \(X \circ Y\) achieves the same quality as conditional compression \(K(Y \mid X)\). The theoretical basis is the chain rule of Kolmogorov complexity (also called the symmetry of algorithmic information):

\[K(X, Y) = K(X) + K(Y \mid X) + O(\log n)\]

This means the joint compressed length \(K(X, Y)\) decomposes into the cost of compressing \(X\) alone plus the conditional cost of compressing \(Y\) given \(X\), up to a logarithmic term. A good joint compressor of \(X \circ Y\) will automatically discover and exploit the shared structure — it effectively performs conditional compression internally. The only “waste” is \(K(X)\), which you’d pay regardless, plus the negligible \(O(\log n)\) overhead.

Ilya notes in the talk that “proving this is slightly more tricky” and skips the proof, but the chain rule is a classical result in algorithmic information theory (Li & Vitányi, 2008).

And joint compression maps directly to maximum likelihood training: the total log-likelihood over a dataset is the cost of compressing that dataset, and adding more data sources just adds terms to the sum.

\[-\sum_i \log p_\theta(x_i) \;\longleftrightarrow\; \text{compression cost of the dataset under model } \theta\]

So the practical recipe is simple: take all your data, concatenate it, and train a model to maximize likelihood. This is what we already do — but now we have a theoretical justification.

有一个实际问题:在机器学习语境中,以一个巨大数据集 \(X\) 为条件并不自然。我们可以拟合大数据集,但无法真正地”以之为条件”(至少目前还不能)。

解决方案:对拼接 \(X \circ Y\) 做联合压缩,其质量与条件压缩 \(K(Y \mid X)\) 相当。理论依据是 Kolmogorov 复杂度的链式法则(也称为算法信息的对称性):

\[K(X, Y) = K(X) + K(Y \mid X) + O(\log n)\]

这意味着联合压缩长度 \(K(X, Y)\) 分解为单独压缩 \(X\) 的代价加上在给定 \(X\) 的条件下压缩 \(Y\) 的代价,误差仅为对数阶。一个好的 \(X \circ Y\) 联合压缩器会自动发现并利用共享结构——它在内部实质上执行了条件压缩。唯一的”浪费”是 \(K(X)\)(无论如何都要支付的代价),加上可忽略的 \(O(\log n)\) 开销。

Ilya 在演讲中指出”证明这一点稍微更棘手”并跳过了证明,但链式法则是算法信息论中的经典结果(Li & Vitányi, 2008)。

而联合压缩直接对应于最大似然训练:数据集上的总对数似然就是在该模型下压缩数据集的代价,添加更多数据源只是在求和中增加项。

\[-\sum_i \log p_\theta(x_i) \;\longleftrightarrow\; \text{模型 } \theta \text{ 下的数据集压缩代价}\]

所以实际的做法很简单:把所有数据拿来,拼接在一起,训练模型最大化似然。 这正是我们一直在做的——但现在我们有了理论依据。

Neural Networks and Empirical Validation

Neural Networks as Approximate Kolmogorov Compressors

The Kolmogorov compressor searches over all programs. SGD over neural networks searches over a restricted class of programs — but this class is remarkably expressive.

Neural networks are circuits. Circuits are computers. SGD is program search over these computing machines. All of deep learning rides on the “SGD miracle”: that gradient descent can actually find good circuits from data.

This is also why it’s hard to improve neural network architectures — the simulation argument applies. Your new architecture can usually be straightforwardly simulated by the old one, so the improvement is small. The exceptions are when the old architecture has a genuine bottleneck: for example, the RNN’s fixed-size hidden state makes it hard to simulate a Transformer’s attention mechanism.

Ilya notes: had we found a way to engineer an RNN with a very large hidden state, perhaps it would have matched the Transformer.

Bigger neural networks approximate the Kolmogorov compressor better. This may be why we like scaling: larger models have less regret as compressors, extracting more predictive value from data.

Kolmogorov 压缩器搜索所有程序。SGD 在神经网络上搜索一个受限的程序类——但这个类的表达力非常强。

神经网络是电路。电路是计算机。SGD 是在这些计算机器上做程序搜索。 深度学习的一切都建立在”SGD 奇迹”之上:梯度下降竟然能从数据中找到好的电路。

这也是为什么改进神经网络架构很难——模拟论证在此适用。你的新架构通常可以被旧架构直接模拟,所以改进很小。例外出现在旧架构有真正的瓶颈时:比如 RNN 的固定大小隐藏状态使其难以模拟 Transformer 的注意力机制。

Ilya 指出:如果我们找到了一种方法,让 RNN 拥有非常大的隐藏状态,也许它就能匹敌 Transformer。

更大的神经网络更好地逼近 Kolmogorov 压缩器。这也许是我们喜欢 scaling 的原因:更大的模型作为压缩器有更小的 regret,从数据中提取更多预测价值。

iGPT: Validation in the Image Domain

To validate this theory beyond text (where next-token prediction can be explained without invoking compression), Ilya and collaborators tested it in the image domain with iGPT (2020):

  • Take an image, flatten it into a sequence of pixels, discretize intensity values
  • Train a Transformer on next-pixel prediction — the same autoregressive objective as GPT, but on pixels
  • Evaluate unsupervised learning quality via linear probe accuracy: pick the best internal layer, fit a linear classifier, measure downstream performance

Results on CIFAR-10: as the model gets better at next-pixel prediction, linear probe accuracy improves monotonically. With a 6B parameter model on 64×64 images, they achieved 99% on CIFAR-10 — remarkably strong for a method that never sees any labels during pretraining.

The result confirmed: better compression → better unsupervised representations, even in a domain (vision) where the connection to text-based next-token prediction is non-obvious.

为了在文本之外验证这个理论(文本领域中 next-token prediction 的效果可以不借助压缩理论来解释),Ilya 和合作者在图像领域iGPT (2020) 进行了测试:

  • 取一张图像,展平为像素序列,离散化亮度值
  • 用 Transformer 训练 next-pixel prediction——与 GPT 相同的自回归目标,但在像素上
  • 通过 linear probe 准确率评估无监督学习质量:选取最优内部层,拟合线性分类器,衡量下游性能

CIFAR-10 上的结果:随着模型 next-pixel prediction 能力的提升,linear probe 准确率单调提升。用 60 亿参数模型在 64×64 图像上训练,他们在 CIFAR-10 上达到了 99%——对于预训练期间从未见过任何标签的方法来说,这非常强。

结果证实了:更好的压缩 → 更好的无监督表示,即使在视觉领域(其中与文本 next-token prediction 的联系并不明显)也是如此。

The Mystery of Linear Representations

The compression theory explains why unsupervised learning extracts useful features, but it does not explain why those features are linearly separable. Linear probes work surprisingly well on intermediate representations, and the reason “must be deep and profound.”

One interesting observation: autoregressive models produce better linear representations than BERT-style models. Ilya’s speculation for why:

  • In next-pixel prediction, predicting pixel \(t+1\) from pixels \(1, \ldots, t\) sometimes requires understanding long-range structure (e.g., predicting a pixel at the start of a new row from the entire preceding image context)
  • In BERT, you mask ~25% of tokens and predict them from context on both sides. Any prediction can be done well by looking a little into the past and a little into the future — the local context suffices
  • BERT removes the hardest prediction tasks, the ones that force the model to build global representations. Autoregressive models must solve these hard cases, which drives the formation of more informative features

This also implies that diffusion models, which are conceptually similar to BERT (denoising from a corrupted version), should have worse linear representations than autoregressive models — for the same reason.

压缩理论解释了为什么无监督学习能提取有用特征,但没有解释为什么这些特征是线性可分的。Linear probe 在中间表示上效果惊人地好,其原因”一定是深刻而深远的”。

一个有趣的观察:自回归模型比 BERT 类模型产生更好的线性表示。 Ilya 的猜测:

  • 在 next-pixel prediction 中,从像素 \(1, \ldots, t\) 预测像素 \(t+1\) 有时需要理解长程结构(例如,从整个前面的图像上下文预测新一行开头的像素)
  • 在 BERT 中,你遮蔽约 25% 的 token,从两侧上下文预测它们。任何预测都可以通过看一点过去和一点未来来很好地完成——局部上下文就够了
  • BERT 移除了最难的预测任务,而恰恰是这些任务迫使模型构建全局表示。自回归模型必须解决这些困难情况,这驱动了更具信息量的特征的形成

这也意味着扩散模型——在概念上类似于 BERT(从损坏版本中去噪)——应该比自回归模型有更差的线性表示,原因相同。

Q&A Highlights

On the limits of the analogy. The biggest gap between Kolmogorov complexity and neural networks is the search procedure. Kolmogorov complexity enumerates all programs from scratch each time; SGD is an incremental, order-dependent search. Data ordering matters for neural networks precisely because SGD is a weak search procedure — early features get “baked in” and persist throughout training. Ilya cautions that the analogy is “very clearly not universally applicable” but is valuable for explaining where unsupervised learning comes from.

On VC dimension. A questioner defends VC dimension by citing a 2007 result on PAC learning of quantum states, where the VC dimension is exponentially smaller than the log of the hypothesis class size. Ilya concedes this is a cool example where VC dimension remains essential.

On energy-based models. Energy-based models offer another way to turn neural networks into probability distributions — assign an energy to each configuration, then normalize. Ratios of distributions correspond to differences in energy. This connects to the distinguishability ↔ prediction equivalence from cryptography.

关于类比的局限性。 Kolmogorov 复杂度与神经网络之间最大的差距在于搜索过程。Kolmogorov 复杂度每次从头枚举所有程序;SGD 是一个增量的、依赖顺序的搜索。数据顺序对神经网络有影响,正是因为 SGD 是一个弱搜索过程——早期学到的特征会”固化”并在整个训练过程中持续存在。Ilya 提醒说,这个类比”显然不是普适的”,但在解释无监督学习的来源方面有价值。

关于 VC 维度。 一位提问者引用了 2007 年关于量子态 PAC 学习的结果为 VC 维度辩护:在该场景中,VC 维度指数级地小于假设类大小的对数。Ilya 承认这是一个 VC 维度仍然不可或缺的精彩例子。

关于基于能量的模型。 基于能量的模型提供了将神经网络转化为概率分布的另一种方式——给每个配置赋予一个能量值,然后归一化。分布的比率对应能量的差异。这与密码学中的可区分性 ↔ 预测等价性相关。