Kolmogorov Complexity

Kolmogorov complexity measures the intrinsic information content of an individual object — not in terms of probability (like Shannon entropy), but in terms of the length of the shortest program that produces it. It is a foundational concept in algorithmic information theory, with deep connections to computability, statistics, learning theory, and even philosophy.

Textbook: Ming Li and Paul Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, 4th edition, 2019). The definitive reference, covering theory, applications, and connections to statistics and learning.

Recommended reading:

Scholarpedia — Algorithmic Complexity
Wikipedia — Kolmogorov Complexity — Unusually good for a Wikipedia article; well-sourced.
Greg Chaitin’s home page — Original papers and expository essays from one of the co-founders.

Definition

Fix a universal Turing machine \(U\). The Kolmogorov complexity of a finite binary string \(x\) is:

\[K(x) = \min \{ \vert p \vert : U(p) = x \}\]

where \(\vert p \vert\) denotes the length of program \(p\) in bits. In words: \(K(x)\) is the length of the shortest program that, when run on \(U\), outputs exactly \(x\) and halts.

Key properties:

Invariance theorem: The choice of universal Turing machine matters only up to an additive constant. If \(U_1\) and \(U_2\) are two universal machines, then \(\vert K_{U_1}(x) - K_{U_2}(x) \vert \le c\) for all \(x\), where \(c\) depends only on \(U_1, U_2\).
Upper bound: \(K(x) \le \vert x \vert + c\) — we can always just hardcode \(x\).
Incompressibility: Most strings are incompressible. For any \(n\), at least \(2^n - 2^{n-c} + 1\) strings of length \(n\) satisfy \(K(x) \ge n - c\).
Uncomputability: \(K(x)\) is not computable. This follows from a Berry-paradox-style argument. We cannot write a program that, given \(x\), outputs \(K(x)\).

Kolmogorov Complexity vs. Shannon Entropy

Shannon entropy and Kolmogorov complexity both measure “information,” but from fundamentally different angles:

	Shannon Entropy	Kolmogorov Complexity
Object	Random variable / distribution	Individual string
Measure	Expected code length	Shortest program length
Computable?	Yes (given the distribution)	No
Depends on	Probability model	Universal Turing machine (up to constant)

The deep connection: for a random variable \(X\) taking values in \(\{0,1\}^n\), the expected Kolmogorov complexity equals the Shannon entropy up to lower-order terms:

\[\mathbb{E}[K(X)] = H(X) + O(1)\]

In other words, Shannon entropy is the average-case version of Kolmogorov complexity. Shannon tells you about the source; Kolmogorov tells you about the individual output.

This distinction matters. The string \(0000...0\) (a billion zeros) has high Shannon entropy if drawn from a uniform distribution, but near-zero Kolmogorov complexity — a tiny program generates it. Kolmogorov complexity captures the intuitive notion that this string is “simple” regardless of where it came from.

	Shannon 熵	Kolmogorov 复杂度
对象	随机变量/分布	单个字符串
度量	期望编码长度	最短程序长度
可计算？	是（给定分布）	否
依赖于	概率模型	通用图灵机（相差常数）

Applications

Minimum Description Length (MDL)

Rissanen’s MDL principle (1978) uses Kolmogorov complexity as the ideal but uncomputable model selection criterion: the best hypothesis for data \(x\) is the one that minimizes \(\vert H \vert + \vert D_{H} \vert\), where \(\vert H \vert\) is the description length of the hypothesis and \(\vert D_H \vert\) is the description length of the data given the hypothesis. In practice, we replace Kolmogorov complexity with computable approximations (e.g., two-part codes, normalized maximum likelihood).

Normalized Compression Distance (NCD)

Since \(K(x)\) is uncomputable, Cilibrasi and Vitányi (2005) proposed approximating it with real-world compressors (gzip, bzip2). The normalized compression distance between strings \(x\) and \(y\) is:

\[\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}\]

where \(C\) is a compressor and \(xy\) is concatenation. This has been successfully applied to clustering music by genre, classifying languages, detecting plagiarism, and even phylogenetic tree construction.

Solomonoff Induction

Solomonoff (1964) proposed the ultimate prior for prediction: assign to each hypothesis (program) \(p\) a prior probability \(2^{-\vert p \vert}\). The resulting universal prior \(M(x) = \sum_{p: U(p) = x} 2^{-\vert p \vert}\) dominates every computable probability distribution and converges to the true distribution at an optimal rate. It is incomputable but provides the theoretical gold standard for inductive inference.

Connection to Learning Theory

Kolmogorov complexity provides an alternative foundation for generalization bounds. Instead of VC dimension or Rademacher complexity, one can bound generalization error using the Kolmogorov complexity of the hypothesis. This connects to the intuition that “simpler” models generalize better — with “simpler” now given a precise, model-independent meaning.

References

Li, M. & Vitányi, P. (2019). An Introduction to Kolmogorov Complexity and Its Applications (4th ed.). Springer. — The textbook. Chapters 1–2 for foundations, Chapter 4 for Shannon vs. Kolmogorov, Chapter 8 for applications.
Cover, T. & Thomas, J. (2006). Elements of Information Theory (2nd ed.). Wiley. — Chapter 14 covers Kolmogorov complexity from an information-theoretic perspective.
Grünwald, P. (2007). The Minimum Description Length Principle. MIT Press. — The definitive treatment of MDL and its connections to Kolmogorov complexity.
Cilibrasi, R. & Vitányi, P. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51(4), 1523–1545.
Solomonoff, R. (1964). A formal theory of inductive inference. Information and Control, 7(1), 1–22.

Definition

定义

Kolmogorov Complexity vs. Shannon Entropy

Kolmogorov 复杂度 vs. Shannon 熵

Applications

应用

Minimum Description Length (MDL)

Normalized Compression Distance (NCD)

Solomonoff Induction

Connection to Learning Theory

最小描述长度（MDL）

归一化压缩距离（NCD）

Solomonoff 归纳

与学习理论的联系

References

参考文献