Kolmogorov Complexity
Kolmogorov complexity measures the intrinsic information content of an individual object — not in terms of probability (like Shannon entropy), but in terms of the length of the shortest program that produces it. It is a foundational concept in algorithmic information theory, with deep connections to computability, statistics, learning theory, and even philosophy.
Textbook: Ming Li and Paul Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, 4th edition, 2019). The definitive reference, covering theory, applications, and connections to statistics and learning.
Recommended reading:
- Scholarpedia — Algorithmic Complexity
- Wikipedia — Kolmogorov Complexity — Unusually good for a Wikipedia article; well-sourced.
- Greg Chaitin’s home page — Original papers and expository essays from one of the co-founders.
Definition
定义
Fix a universal Turing machine \(U\). The Kolmogorov complexity of a finite binary string \(x\) is:
\[K(x) = \min \{ \vert p \vert : U(p) = x \}\]where \(\vert p \vert\) denotes the length of program \(p\) in bits. In words: \(K(x)\) is the length of the shortest program that, when run on \(U\), outputs exactly \(x\) and halts.
Key properties:
- Invariance theorem: The choice of universal Turing machine matters only up to an additive constant. If \(U_1\) and \(U_2\) are two universal machines, then \(\vert K_{U_1}(x) - K_{U_2}(x) \vert \le c\) for all \(x\), where \(c\) depends only on \(U_1, U_2\).
- Upper bound: \(K(x) \le \vert x \vert + c\) — we can always just hardcode \(x\).
- Incompressibility: Most strings are incompressible. For any \(n\), at least \(2^n - 2^{n-c} + 1\) strings of length \(n\) satisfy \(K(x) \ge n - c\).
- Uncomputability: \(K(x)\) is not computable. This follows from a Berry-paradox-style argument. We cannot write a program that, given \(x\), outputs \(K(x)\).
Kolmogorov Complexity vs. Shannon Entropy
Kolmogorov 复杂度 vs. Shannon 熵
Shannon entropy and Kolmogorov complexity both measure “information,” but from fundamentally different angles:
| Shannon Entropy | Kolmogorov Complexity | |
|---|---|---|
| Object | Random variable / distribution | Individual string |
| Measure | Expected code length | Shortest program length |
| Computable? | Yes (given the distribution) | No |
| Depends on | Probability model | Universal Turing machine (up to constant) |
The deep connection: for a random variable \(X\) taking values in \(\{0,1\}^n\), the expected Kolmogorov complexity equals the Shannon entropy up to lower-order terms:
\[\mathbb{E}[K(X)] = H(X) + O(1)\]In other words, Shannon entropy is the average-case version of Kolmogorov complexity. Shannon tells you about the source; Kolmogorov tells you about the individual output.
This distinction matters. The string \(0000...0\) (a billion zeros) has high Shannon entropy if drawn from a uniform distribution, but near-zero Kolmogorov complexity — a tiny program generates it. Kolmogorov complexity captures the intuitive notion that this string is “simple” regardless of where it came from.
Applications
应用
Minimum Description Length (MDL)
Rissanen’s MDL principle (1978) uses Kolmogorov complexity as the ideal but uncomputable model selection criterion: the best hypothesis for data \(x\) is the one that minimizes \(\vert H \vert + \vert D_{H} \vert\), where \(\vert H \vert\) is the description length of the hypothesis and \(\vert D_H \vert\) is the description length of the data given the hypothesis. In practice, we replace Kolmogorov complexity with computable approximations (e.g., two-part codes, normalized maximum likelihood).
Normalized Compression Distance (NCD)
Since \(K(x)\) is uncomputable, Cilibrasi and Vitányi (2005) proposed approximating it with real-world compressors (gzip, bzip2). The normalized compression distance between strings \(x\) and \(y\) is:
\[\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}\]where \(C\) is a compressor and \(xy\) is concatenation. This has been successfully applied to clustering music by genre, classifying languages, detecting plagiarism, and even phylogenetic tree construction.
Solomonoff Induction
Solomonoff (1964) proposed the ultimate prior for prediction: assign to each hypothesis (program) \(p\) a prior probability \(2^{-\vert p \vert}\). The resulting universal prior \(M(x) = \sum_{p: U(p) = x} 2^{-\vert p \vert}\) dominates every computable probability distribution and converges to the true distribution at an optimal rate. It is incomputable but provides the theoretical gold standard for inductive inference.
Connection to Learning Theory
Kolmogorov complexity provides an alternative foundation for generalization bounds. Instead of VC dimension or Rademacher complexity, one can bound generalization error using the Kolmogorov complexity of the hypothesis. This connects to the intuition that “simpler” models generalize better — with “simpler” now given a precise, model-independent meaning.
References
参考文献
- Li, M. & Vitányi, P. (2019). An Introduction to Kolmogorov Complexity and Its Applications (4th ed.). Springer. — The textbook. Chapters 1–2 for foundations, Chapter 4 for Shannon vs. Kolmogorov, Chapter 8 for applications.
- Cover, T. & Thomas, J. (2006). Elements of Information Theory (2nd ed.). Wiley. — Chapter 14 covers Kolmogorov complexity from an information-theoretic perspective.
- Grünwald, P. (2007). The Minimum Description Length Principle. MIT Press. — The definitive treatment of MDL and its connections to Kolmogorov complexity.
- Cilibrasi, R. & Vitányi, P. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51(4), 1523–1545.
- Solomonoff, R. (1964). A formal theory of inductive inference. Information and Control, 7(1), 1–22.