Short Programs = Best Generalization

Kolmogorov complexity and the theoretical foundation of learning

A short program that fits the data has captured all conceivable regularities.

If no short program exists, the data is essentially random — nothing to learn.

Theorem: the shortest program fitting the training data yields the best possible generalization.

But searching over short programs is computationally intractable...

Backprop = Small Circuit Search

If we can't search programs, what about circuits?

A neural network is a parameterized circuit — each layer is a set of gates.

$$F(x_i;\theta) = y_i \quad \Longrightarrow \quad \text{backprop finds the smallest circuit that fits}$$

Neural network training = solving a neural equation.

Gradient descent searches the space of circuits efficiently, despite the landscape being non-convex.

Depth = sequential steps, width = parallelism within each step

All neurons in one layer fire simultaneously — the layer is one parallel step.

A 50-layer network gets 50 sequential operations, each applied to thousands of values at once.

Example: sorting $n$ numbers requires only $O(\log n)$ parallel compare-and-swap rounds.

A medium-sized network can learn to sort — the sorting "circuit" fits in a shallow depth.

50 steps of massively parallel computation can express significant logic and reasoning.

That this works at all — that backprop finds good circuits in a vast, non-convex landscape — remains mostly a mystery.

Three levels of understanding

Short programs give optimal generalization

$K(x)$ = Kolmogorov complexity

✘ Computationally intractable

Small circuits found via backprop

$\nabla_\theta \mathcal{L}$ guides the search

✔ Feasible via gradient descent

Why does gradient descent find good circuits?

"Great variety in natural data"

⚠ Unexplained — powers all of AI

This is the foundation upon which everything else in the talk is built.