Kolmogorov complexity and the theoretical foundation of learning
A short program that fits the data has captured all conceivable regularities.
If no short program exists, the data is essentially random — nothing to learn.
Theorem: the shortest program fitting the training data yields the best possible generalization.
But searching over short programs is computationally intractable...
If we can't search programs, what about circuits?
A neural network is a parameterized circuit — each layer is a set of gates.
Neural network training = solving a neural equation.
Gradient descent searches the space of circuits efficiently, despite the landscape being non-convex.
Depth = sequential steps, width = parallelism within each step
All neurons in one layer fire simultaneously — the layer is one parallel step.
A 50-layer network gets 50 sequential operations, each applied to thousands of values at once.
Example: sorting $n$ numbers requires only $O(\log n)$ parallel compare-and-swap rounds.
A medium-sized network can learn to sort — the sorting "circuit" fits in a shallow depth.
50 steps of massively parallel computation can express significant logic and reasoning.
That this works at all — that backprop finds good circuits in a vast, non-convex landscape — remains mostly a mystery.
Three levels of understanding
Short programs give optimal generalization
$K(x)$ = Kolmogorov complexity
✘ Computationally intractable
Small circuits found via backprop
$\nabla_\theta \mathcal{L}$ guides the search
✔ Feasible via gradient descent
Why does gradient descent find good circuits?
"Great variety in natural data"
⚠ Unexplained — powers all of AI
This is the foundation upon which everything else in the talk is built.