Kolmogorov Complexity

Kolmogorov complexity measures the intrinsic information content of an individual object — not in terms of probability (like Shannon entropy), but in terms of the length of the shortest program that produces it. It is a foundational concept in algorithmic information theory, with deep connections to computability, statistics, learning theory, and even philosophy.

Textbook: Ming Li and Paul Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, 4th edition, 2019). The definitive reference, covering theory, applications, and connections to statistics and learning.

Recommended reading:

Kolmogorov 复杂度衡量单个对象的内在信息含量——不是以概率(如 Shannon 熵)来度量,而是以能生成该对象的最短程序的长度来度量。它是算法信息论的基础概念,与可计算性、统计学、学习理论甚至哲学有着深刻联系。

教材:Li & Vitányi,《Kolmogorov 复杂度及其应用导论》(Springer,第四版,2019)。权威参考书,涵盖理论、应用及其与统计学和学习的联系。

推荐阅读

Definition

Fix a universal Turing machine \(U\). The Kolmogorov complexity of a finite binary string \(x\) is:

\[K(x) = \min \{ \vert p \vert : U(p) = x \}\]

where \(\vert p \vert\) denotes the length of program \(p\) in bits. In words: \(K(x)\) is the length of the shortest program that, when run on \(U\), outputs exactly \(x\) and halts.

Key properties:

  • Invariance theorem: The choice of universal Turing machine matters only up to an additive constant. If \(U_1\) and \(U_2\) are two universal machines, then \(\vert K_{U_1}(x) - K_{U_2}(x) \vert \le c\) for all \(x\), where \(c\) depends only on \(U_1, U_2\).
  • Upper bound: \(K(x) \le \vert x \vert + c\) — we can always just hardcode \(x\).
  • Incompressibility: Most strings are incompressible. For any \(n\), at least \(2^n - 2^{n-c} + 1\) strings of length \(n\) satisfy \(K(x) \ge n - c\).
  • Uncomputability: \(K(x)\) is not computable. This follows from a Berry-paradox-style argument. We cannot write a program that, given \(x\), outputs \(K(x)\).

固定一台通用图灵机 \(U\)。有限二进制字符串 \(x\) 的 Kolmogorov 复杂度定义为:

\[K(x) = \min \{ \vert p \vert : U(p) = x \}\]

其中 \(\vert p \vert\) 表示程序 \(p\) 的比特长度。用自然语言说:\(K(x)\) 是在 \(U\) 上运行后恰好输出 \(x\) 并停机的最短程序的长度。

关键性质

  • 不变性定理:通用图灵机的选择只影响一个加法常数。若 \(U_1\) 和 \(U_2\) 是两台通用机,则对所有 \(x\),\(\vert K_{U_1}(x) - K_{U_2}(x) \vert \le c\),其中 \(c\) 仅依赖于 \(U_1, U_2\)。
  • 上界:\(K(x) \le \vert x \vert + c\) — 我们总可以直接硬编码 \(x\)。
  • 不可压缩性:大多数字符串是不可压缩的。对任意 \(n\),至少有 \(2^n - 2^{n-c} + 1\) 个长度为 \(n\) 的字符串满足 \(K(x) \ge n - c\)。
  • 不可计算性:\(K(x)\) 不可计算。这可由类 Berry 悖论的论证推出。我们无法编写一个程序,给定 \(x\) 后输出 \(K(x)\)。

Kolmogorov Complexity vs. Shannon Entropy

Shannon entropy and Kolmogorov complexity both measure “information,” but from fundamentally different angles:

  Shannon Entropy Kolmogorov Complexity
Object Random variable / distribution Individual string
Measure Expected code length Shortest program length
Computable? Yes (given the distribution) No
Depends on Probability model Universal Turing machine (up to constant)

The deep connection: for a random variable \(X\) taking values in \(\{0,1\}^n\), the expected Kolmogorov complexity equals the Shannon entropy up to lower-order terms:

\[\mathbb{E}[K(X)] = H(X) + O(1)\]

In other words, Shannon entropy is the average-case version of Kolmogorov complexity. Shannon tells you about the source; Kolmogorov tells you about the individual output.

This distinction matters. The string \(0000...0\) (a billion zeros) has high Shannon entropy if drawn from a uniform distribution, but near-zero Kolmogorov complexity — a tiny program generates it. Kolmogorov complexity captures the intuitive notion that this string is “simple” regardless of where it came from.

Shannon 熵和 Kolmogorov 复杂度都衡量”信息”,但出发点根本不同:

  Shannon 熵 Kolmogorov 复杂度
对象 随机变量/分布 单个字符串
度量 期望编码长度 最短程序长度
可计算? 是(给定分布)
依赖于 概率模型 通用图灵机(相差常数)

深层联系:对于取值在 \(\{0,1\}^n\) 中的随机变量 \(X\),期望 Kolmogorov 复杂度等于 Shannon 熵(忽略低阶项):

\[\mathbb{E}[K(X)] = H(X) + O(1)\]

换言之,Shannon 熵是 Kolmogorov 复杂度的平均情况版本。Shannon 告诉你信源的性质;Kolmogorov 告诉你单个输出的性质。

这个区别很重要。字符串 \(0000...0\)(十亿个零)如果从均匀分布中抽取,Shannon 熵很高,但 Kolmogorov 复杂度几乎为零——一个很小的程序就能生成它。Kolmogorov 复杂度捕捉了这个字符串”简单”的直觉,无论它来自哪里。

Applications

Minimum Description Length (MDL)

Rissanen’s MDL principle (1978) uses Kolmogorov complexity as the ideal but uncomputable model selection criterion: the best hypothesis for data \(x\) is the one that minimizes \(\vert H \vert + \vert D_{H} \vert\), where \(\vert H \vert\) is the description length of the hypothesis and \(\vert D_H \vert\) is the description length of the data given the hypothesis. In practice, we replace Kolmogorov complexity with computable approximations (e.g., two-part codes, normalized maximum likelihood).

Normalized Compression Distance (NCD)

Since \(K(x)\) is uncomputable, Cilibrasi and Vitányi (2005) proposed approximating it with real-world compressors (gzip, bzip2). The normalized compression distance between strings \(x\) and \(y\) is:

\[\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}\]

where \(C\) is a compressor and \(xy\) is concatenation. This has been successfully applied to clustering music by genre, classifying languages, detecting plagiarism, and even phylogenetic tree construction.

Solomonoff Induction

Solomonoff (1964) proposed the ultimate prior for prediction: assign to each hypothesis (program) \(p\) a prior probability \(2^{-\vert p \vert}\). The resulting universal prior \(M(x) = \sum_{p: U(p) = x} 2^{-\vert p \vert}\) dominates every computable probability distribution and converges to the true distribution at an optimal rate. It is incomputable but provides the theoretical gold standard for inductive inference.

Connection to Learning Theory

Kolmogorov complexity provides an alternative foundation for generalization bounds. Instead of VC dimension or Rademacher complexity, one can bound generalization error using the Kolmogorov complexity of the hypothesis. This connects to the intuition that “simpler” models generalize better — with “simpler” now given a precise, model-independent meaning.

最小描述长度(MDL)

Rissanen 的 MDL 原理(1978)将 Kolmogorov 复杂度作为理想但不可计算的模型选择准则:对数据 \(x\) 的最佳假设是使 \(\vert H \vert + \vert D_{H} \vert\) 最小化的假设,其中 \(\vert H \vert\) 是假设的描述长度,\(\vert D_H \vert\) 是给定假设后数据的描述长度。实践中,我们用可计算的近似(如两部分编码、归一化最大似然)替代 Kolmogorov 复杂度。

归一化压缩距离(NCD)

由于 \(K(x)\) 不可计算,Cilibrasi 和 Vitányi (2005) 提出用现实世界的压缩器(gzip, bzip2)来近似。字符串 \(x\) 和 \(y\) 之间的归一化压缩距离为:

\[\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}\]

其中 \(C\) 是压缩器,\(xy\) 是拼接。这已成功应用于按流派聚类音乐、分类语言、检测抄袭,甚至系统发育树构建。

Solomonoff 归纳

Solomonoff (1964) 提出了预测的终极先验:为每个假设(程序)\(p\) 分配先验概率 \(2^{-\vert p \vert}\)。由此产生的通用先验 \(M(x) = \sum_{p: U(p) = x} 2^{-\vert p \vert}\) 支配所有可计算概率分布,并以最优速率收敛到真实分布。它不可计算,但为归纳推理提供了理论上的黄金标准。

与学习理论的联系

Kolmogorov 复杂度为泛化界提供了另一种基础。不使用 VC 维或 Rademacher 复杂度,我们可以用假设的 Kolmogorov 复杂度来约束泛化误差。这与”更简单”的模型泛化更好的直觉相联系——”更简单”现在有了精确的、与模型无关的含义。

References

  • Li, M. & Vitányi, P. (2019). An Introduction to Kolmogorov Complexity and Its Applications (4th ed.). Springer. — The textbook. Chapters 1–2 for foundations, Chapter 4 for Shannon vs. Kolmogorov, Chapter 8 for applications.
  • Cover, T. & Thomas, J. (2006). Elements of Information Theory (2nd ed.). Wiley. — Chapter 14 covers Kolmogorov complexity from an information-theoretic perspective.
  • Grünwald, P. (2007). The Minimum Description Length Principle. MIT Press. — The definitive treatment of MDL and its connections to Kolmogorov complexity.
  • Cilibrasi, R. & Vitányi, P. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51(4), 1523–1545.
  • Solomonoff, R. (1964). A formal theory of inductive inference. Information and Control, 7(1), 1–22.
  • Li, M. & Vitányi, P. (2019). 《Kolmogorov 复杂度及其应用导论》(第四版)。Springer. — 权威教材。第 1–2 章为基础,第 4 章为 Shannon 与 Kolmogorov 的对比,第 8 章为应用。
  • Cover, T. & Thomas, J. (2006). 《信息论基础》(第二版)。Wiley. — 第 14 章从信息论视角介绍 Kolmogorov 复杂度。
  • Grünwald, P. (2007). 《最小描述长度原理》。MIT Press. — MDL 及其与 Kolmogorov 复杂度联系的权威论述。
  • Cilibrasi, R. & Vitányi, P. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51(4), 1523–1545.
  • Solomonoff, R. (1964). A formal theory of inductive inference. Information and Control, 7(1), 1–22.