Dynamic Programming: Foundations

Dynamic programming (DP) is one of the most reused ideas in computer science — from edit distance and shortest paths to the Bellman equation at the heart of reinforcement learning. Its core insight is embarrassingly simple: if a problem decomposes into overlapping subproblems whose solutions combine optimally, remember each subproblem's answer instead of recomputing it. This post walks through the two principles behind DP, the naive-vs-memoized-vs-tabulated spectrum, and a grid example that previews the Bellman-style reasoning that reappears throughout RL.

Two Principles

A problem admits a DP solution when it has both:

  1. Optimal substructure. The optimal solution to the whole problem can be assembled from optimal solutions to smaller subproblems. Shortest paths have this: any prefix of a shortest path from \(s\) to \(t\) is itself a shortest path from \(s\) to the intermediate node.

  2. Overlapping subproblems. The recursion revisits the same subproblem many times. Without overlap, plain divide-and-conquer already suffices — merge sort does not benefit from memoization.

When both hold, caching subproblem answers collapses a potentially exponential recursion into polynomial time. When only (1) holds, use divide-and-conquer; when only (2) holds (but no optimality is being sought), use plain memoization.

一个问题能用 DP 求解,需要同时满足:

  1. 最优子结构(Optimal substructure)。 整体问题的最优解可以由较小子问题的最优解拼装而成。最短路径就满足这个性质:从 \(s\) 到 \(t\) 的最短路径的任意前缀,本身就是从 \(s\) 到中间节点的最短路径。

  2. 重叠子问题(Overlapping subproblems)。 递归过程中会反复访问相同的子问题。如果没有重叠,普通的分治就够了——归并排序并不需要记忆化。

当两条都成立时,把子问题答案缓存下来能把一个本可能指数复杂度的递归压成多项式。只满足 (1) 时用分治;只满足 (2) 且并非优化问题时用普通的记忆化。

Motivating Example: Fibonacci

The textbook motivator is Fibonacci:

\[f(0) = 0, \quad f(1) = 1, \quad f(n) = f(n-1) + f(n-2) \;\;\text{for } n \geq 2.\]

The naive recursion is correct and catastrophically slow. The number of calls \(T(n)\) satisfies \(T(n) = T(n-1) + T(n-2) + 1\), so \(T(n) = \Theta(\varphi^n)\) with \(\varphi = (1+\sqrt{5})/2 \approx 1.618\).

教科书式的引子就是斐波那契:

\[f(0) = 0, \quad f(1) = 1, \quad f(n) = f(n-1) + f(n-2) \;\;\text{for } n \geq 2.\]

朴素递归是正确的,但慢得要命。调用次数 \(T(n)\) 满足 \(T(n) = T(n-1) + T(n-2) + 1\),所以 \(T(n) = \Theta(\varphi^n)\),其中 \(\varphi = (1+\sqrt{5})/2 \approx 1.618\)。

The figure above draws the full recursion tree for \(f(n)\). Toggle Memoized to mark duplicate subproblems as cache hits (gray): once a node has been computed, subsequent encounters cost \(O(1)\). The exponential tree collapses — only \(n+1\) distinct subproblems are ever computed.

Three equivalent implementations capture the standard DP workflow:

# 1. Naive recursion — O(phi^n) time, O(n) stack.
def fib_naive(n):
    if n < 2: return n
    return fib_naive(n - 1) + fib_naive(n - 2)

# 2. Top-down memoization — O(n) time and space.
def fib_memo(n, cache={}):
    if n < 2: return n
    if n in cache: return cache[n]
    cache[n] = fib_memo(n - 1) + fib_memo(n - 2)
    return cache[n]

# 3. Bottom-up tabulation — O(n) time, O(1) space with rolling variables.
def fib_tab(n):
    a, b = 0, 1
    for _ in range(n): a, b = b, a + b
    return a

“Top-down” (memoization) starts from the original problem and recurses, caching along the way; “bottom-up” (tabulation) fills a table from base cases up to the target. They do exactly the same work and produce the same answers — pick whichever makes the recursion clearer, or whichever avoids Python’s recursion limit.

上图画的是 \(f(n)\) 完整的递归树。切到 Memoized 之后,重复出现的子问题会被标记成缓存命中(灰色节点):一个子问题一旦算过,后续再遇到它只需 \(O(1)\)。指数级的树就坍缩成只剩 \(n+1\) 个不同子问题。

三种等价实现,概括了 DP 的标准套路:

# 1. 朴素递归——时间 O(phi^n),栈深 O(n)。
def fib_naive(n):
    if n < 2: return n
    return fib_naive(n - 1) + fib_naive(n - 2)

# 2. 自顶向下的记忆化——时间、空间都是 O(n)。
def fib_memo(n, cache={}):
    if n < 2: return n
    if n in cache: return cache[n]
    cache[n] = fib_memo(n - 1) + fib_memo(n - 2)
    return cache[n]

# 3. 自底向上的表格填充——时间 O(n),配合滚动变量可做到 O(1) 空间。
def fib_tab(n):
    a, b = 0, 1
    for _ in range(n): a, b = b, a + b
    return a

“自顶向下”(memoization)从原问题出发递归、顺路缓存;”自底向上”(tabulation)从基 case 开始填表直到目标。两者做的功完全一样,结果也完全一样——哪种写起来让递归结构更清楚就用哪种,或者哪种能绕开 Python 的递归深度限制就用哪种。

Anatomy of a DP Solution

Almost every DP solution can be written as the same three-line recipe:

  1. State. What arguments uniquely identify a subproblem? The state space must be small (polynomial) for DP to be efficient. For Fibonacci, the state is a single integer \(n\).

  2. Transition. How does a state’s answer relate to other states’ answers? This is the recurrence. For Fibonacci, \(f(n) = f(n-1) + f(n-2)\).

  3. Base case. Which states are resolved without recursion? For Fibonacci, \(f(0)=0,\,f(1)=1\).

Designing a DP is mostly about finding a good state. A bad state either fails to capture what the transition needs (so the recurrence is wrong) or grows exponentially (so memoization doesn’t help). A common trick: if a natural state leaves out some context needed for the transition, add that context as an extra dimension — at the cost of time/space proportional to the new state space.

几乎每一个 DP 解法都可以用同样的三行套路写出来:

  1. 状态(State)。 哪些参数足以唯一标识一个子问题?状态空间必须足够小(多项式),DP 才高效。对斐波那契,状态就是一个整数 \(n\)。

  2. 转移(Transition)。 一个状态的答案如何由其他状态的答案得到?这就是递推式。对斐波那契,\(f(n) = f(n-1) + f(n-2)\)。

  3. 基 case(Base case)。 哪些状态不需要递归就能直接给出?对斐波那契,\(f(0)=0,\,f(1)=1\)。

设计 DP 最难的部分通常是找一个好的状态。坏状态的典型毛病有两种:转移所需的信息没被状态包住(递推式写错),或者状态空间爆炸(记忆化起不到作用)。一个常用技巧:如果自然的状态遗漏了转移需要的某个上下文,就把那个上下文当作额外一维加到状态里——代价是时间/空间正比于新状态空间。

A Classic DP: Minimum Path Sum in a Grid

Given a grid of non-negative costs \(c[i,j]\), find the minimum total cost of a path from \((0,0)\) to \((R-1, C-1)\) that moves only right or down. This is the \(\ell^1\)-cousin of shortest-path problems.

  • State: \(dp[i,j]\) = minimum total cost of reaching cell \((i,j)\).
  • Transition: \(dp[i,j] = c[i,j] + \min\!\big(dp[i-1,j],\; dp[i,j-1]\big)\).
  • Base case: \(dp[0,0] = c[0,0]\); for the first row/column only one predecessor exists.

The answer is \(dp[R-1, C-1]\). To recover the path (not just its cost), store which predecessor achieved the minimum as you fill the table — a parent pointer — then trace it backward from the goal.

The same problem can also be solved in the Bellman / cost-to-go direction. Instead of asking “what is the cheapest cost to arrive here?”, define

\[V[i,j] = \text{minimum remaining cost from } (i,j) \text{ to } (R-1,C-1).\]

Now the dependencies point forward along the path:

\[V[i,j] = c[i,j] + \min\!\big(V[i+1,j],\; V[i,j+1]\big),\]

with boundary condition \(V[R-1,C-1] = c[R-1,C-1]\). Cells on the last row can only move right, and cells on the last column can only move down. To make the dependencies available, fill the table backward: from bottom-right toward top-left, for example by decreasing \(i\) and \(j\). The answer is then \(V[0,0]\).

This is the Bellman optimality equation in grid clothing: at state \((i,j)\), choose the action (right or down) whose one-step cost plus optimal future cost is smallest.

给定一个非负代价矩阵 \(c[i,j]\),求从 \((0,0)\) 到 \((R-1, C-1)\) 的最小总代价路径,每步只能向右或向下。这是最短路径问题的 \(\ell^1\) 近亲。

  • 状态: \(dp[i,j]\) = 到达格子 \((i,j)\) 的最小累计代价。
  • 转移: \(dp[i,j] = c[i,j] + \min\!\big(dp[i-1,j],\; dp[i,j-1]\big)\)。
  • 基 case: \(dp[0,0] = c[0,0]\);第一行/第一列只有一个前驱。

答案就是 \(dp[R-1, C-1]\)。如果要还原出路径(不仅仅是代价),填表时记录哪个前驱取到了最小值——也就是父指针——最后从终点反向回溯即可。

同一道题也可以用 Bellman / cost-to-go 的方向来解。不要问“到达这个格子的最小代价是多少”,而是定义

\[V[i,j] = \text{从 } (i,j) \text{ 走到 } (R-1,C-1) \text{ 的最小剩余代价}.\]

这时依赖关系沿着路径向前:

\[V[i,j] = c[i,j] + \min\!\big(V[i+1,j],\; V[i,j+1]\big),\]

边界条件是 \(V[R-1,C-1] = c[R-1,C-1]\)。最后一行只能向右走,最后一列只能向下走。为了让后继状态先被算出来,填表顺序要从后往前:从右下角往左上角扫,例如让 \(i\) 和 \(j\) 都递减。最后答案就是 \(V[0,0]\)。

这就是 Bellman 最优方程套在网格题上的样子:在状态 \((i,j)\),选择一个动作(向右或向下),使“一步代价 + 后继状态的最优剩余代价”最小。

Press Step to fill one cell at a time. The yellow cell is the one being computed; dashed arrows mark its dependencies (up and/or left). When every cell is filled, the red polyline shows the optimal path recovered from parent pointers. Randomize to try a new cost grid.

Two details worth noting:

  • Fill order matters, but only mildly. Any order that processes \((i,j)\) after both \((i-1,j)\) and \((i,j-1)\) works — row-major, column-major, or diagonal sweeps all produce the same table.

  • Space can be \(O(C)\), not \(O(RC)\). Each row depends only on the previous row, so a rolling 1D array suffices — a common DP space-saving trick, at the cost of losing easy path recovery.

点击 Step 一格一格填。黄色格子是当前正在计算的格子;虚线箭头指出它的依赖(上方和/或左侧的格子)。所有格子填完后,红色折线就是根据父指针还原出的最优路径。Randomize 能换一个新的代价网格。

两个值得一提的细节:

  • 填表顺序会影响过程,但影响不大。 只要处理 \((i,j)\) 时 \((i-1,j)\) 和 \((i,j-1)\) 已经算过,什么顺序都行——按行、按列、按对角线扫都得到同一张表。

  • 空间可以压到 \(O(C)\),不必 \(O(RC)\)。 因为每一行只依赖上一行,用一维滚动数组就够了——这是 DP 里很常见的节省空间技巧,代价是不再方便还原路径。

Why This Matters Beyond Algorithms

The grid recurrence is a finite, deterministic special case of the Bellman optimality equation. Read \(dp[i,j]\) as the optimal cost-to-go from \((i,j)\) to the goal (flipping the direction), and you get

\[V^\star(s) = \min_{a}\, \big[ c(s,a) + V^\star(s') \big],\]

which is exactly the Bellman equation for a deterministic shortest-path MDP with no discounting. Filling the DP table row-by-row is a form of value iteration that converges in one sweep because the state graph is a DAG (no cycles, so no need to iterate to a fixed point). The moment the MDP has stochastic transitions, loops, or a discount factor, the one-sweep trick breaks — and we get the full iterative value iteration / policy iteration machinery that makes up a large part of reinforcement learning theory.

This connection is why DP vocabulary (value function, Bellman backup, optimal substructure) dominates RL even though RL typically solves problems where the state graph is too large to tabulate. The algorithmic ideas transfer; only the implementation (function approximation, sampling) changes.

网格递推其实是 Bellman 最优方程 在有限、确定环境下的一个特例。把 \(dp[i,j]\) 解读为从 \((i,j)\) 走到终点的最优剩余代价(方向反过来看),就能写出

\[V^\star(s) = \min_{a}\, \big[ c(s,a) + V^\star(s') \big],\]

这正是无折扣、确定性最短路径 MDP 的 Bellman 方程。按行填 DP 表是一种价值迭代(value iteration),之所以能”一遍扫完”,是因为状态图是一个 DAG(没有回路,无需反复迭代到不动点)。一旦 MDP 里出现随机转移、回路或折扣因子,一遍扫完的技巧就失效了——这时候就需要完整的价值迭代 / 策略迭代机制,这也是强化学习理论里很大一部分内容。

正是这条关系让 DP 的术语(价值函数、Bellman backup、最优子结构)在 RL 中占据主导地位,哪怕 RL 真正要对付的状态空间通常大到根本列不下表。算法思想是可以迁移的,变的只是实现层面(函数逼近、采样)。

Worked Example: Bellman Backup on a 3×3 Grid

Take the canonical cost grid

\[c = \begin{bmatrix} 1 & 3 & 1 \\ 1 & 5 & 1 \\ 4 & 2 & 1 \end{bmatrix},\]

with goal \((R-1,C-1) = (2,2)\). Apply the Bellman recurrence

\[V[i,j] = c[i,j] + \min\!\big(V[i+1,j],\; V[i,j+1]\big)\]

starting from the boundary \(V[2,2] = c[2,2] = 1\) and sweeping toward \((0,0)\). Cells on the last row have only \(V[i,j+1]\) available; cells on the last column have only \(V[i+1,j]\).

step cell computation \(V\)
1 \((2,2)\) base case \(1\)
2 \((2,1)\) \(2 + V[2,2]\) \(3\)
3 \((2,0)\) \(4 + V[2,1]\) \(7\)
4 \((1,2)\) \(1 + V[2,2]\) \(2\)
5 \((1,1)\) \(5 + \min(2, 3)\) \(7\)
6 \((1,0)\) \(1 + \min(7, 7)\) \(8\)
7 \((0,2)\) \(1 + V[1,2]\) \(3\)
8 \((0,1)\) \(3 + \min(3, 7)\) \(6\)
9 \((0,0)\) \(1 + \min(6, 8)\) \(7\)

The completed value table is

\[V = \begin{bmatrix} 7 & 6 & 3 \\ 8 & 7 & 2 \\ 7 & 3 & 1 \end{bmatrix}.\]

The optimal policy at each cell is the action attaining the \(\min\) in the Bellman update — i.e. \(\pi^\star(i,j) = \arg\min\{V[i+1,j], V[i,j+1]\}\):

\[(0,0) \xrightarrow{\text{right}} (0,1) \xrightarrow{\text{right}} (0,2) \xrightarrow{\text{down}} (1,2) \xrightarrow{\text{down}} (2,2),\]

with running cost \(1 + 3 + 1 + 1 + 1 = 7\), exactly \(V[0,0]\).

def min_path_bellman(c):
    R, C = len(c), len(c[0])
    V = [[0] * C for _ in range(R)]
    V[R-1][C-1] = c[R-1][C-1]
    for j in range(C-2, -1, -1):                    # last row: only right
        V[R-1][j] = c[R-1][j] + V[R-1][j+1]
    for i in range(R-2, -1, -1):                    # last column: only down
        V[i][C-1] = c[i][C-1] + V[i+1][C-1]
    for i in range(R-2, -1, -1):                    # interior: backward sweep
        for j in range(C-2, -1, -1):
            V[i][j] = c[i][j] + min(V[i+1][j], V[i][j+1])
    return V[0][0]

Two things worth pointing out:

  • One sweep, not value iteration. The dependency graph is a DAG (each cell depends only on cells strictly closer to the goal), so a single backward pass solves Bellman exactly. In a stochastic or cyclic MDP we would have to iterate \(V \leftarrow \mathcal{T}V\) until convergence.

  • No parent pointers needed. The optimal path is implicit in \(V\): at any state, take the action whose successor has the smaller \(V\). Storing \(V\) alone is enough to reconstruct \(\pi^\star\) on demand.

取经典代价矩阵

\[c = \begin{bmatrix} 1 & 3 & 1 \\ 1 & 5 & 1 \\ 4 & 2 & 1 \end{bmatrix},\]

终点 \((R-1,C-1) = (2,2)\)。套用 Bellman 递推

\[V[i,j] = c[i,j] + \min\!\big(V[i+1,j],\; V[i,j+1]\big),\]

从边界 \(V[2,2] = c[2,2] = 1\) 开始,朝 \((0,0)\) 反向扫。最后一行只能取 \(V[i,j+1]\),最后一列只能取 \(V[i+1,j]\)。

格子 计算 \(V\)
1 \((2,2)\) 基 case \(1\)
2 \((2,1)\) \(2 + V[2,2]\) \(3\)
3 \((2,0)\) \(4 + V[2,1]\) \(7\)
4 \((1,2)\) \(1 + V[2,2]\) \(2\)
5 \((1,1)\) \(5 + \min(2, 3)\) \(7\)
6 \((1,0)\) \(1 + \min(7, 7)\) \(8\)
7 \((0,2)\) \(1 + V[1,2]\) \(3\)
8 \((0,1)\) \(3 + \min(3, 7)\) \(6\)
9 \((0,0)\) \(1 + \min(6, 8)\) \(7\)

填好的 V 表:

\[V = \begin{bmatrix} 7 & 6 & 3 \\ 8 & 7 & 2 \\ 7 & 3 & 1 \end{bmatrix}.\]

每个格子上的最优策略就是 Bellman 更新里取到 \(\min\) 的那个动作,即 \(\pi^\star(i,j) = \arg\min\{V[i+1,j], V[i,j+1]\}\):

\[(0,0) \xrightarrow{\text{right}} (0,1) \xrightarrow{\text{right}} (0,2) \xrightarrow{\text{down}} (1,2) \xrightarrow{\text{down}} (2,2),\]

累计代价 \(1 + 3 + 1 + 1 + 1 = 7\),恰好等于 \(V[0,0]\)。

def min_path_bellman(c):
    R, C = len(c), len(c[0])
    V = [[0] * C for _ in range(R)]
    V[R-1][C-1] = c[R-1][C-1]
    for j in range(C-2, -1, -1):                    # 最后一行:只能右
        V[R-1][j] = c[R-1][j] + V[R-1][j+1]
    for i in range(R-2, -1, -1):                    # 最后一列:只能下
        V[i][C-1] = c[i][C-1] + V[i+1][C-1]
    for i in range(R-2, -1, -1):                    # 内部:反向扫
        for j in range(C-2, -1, -1):
            V[i][j] = c[i][j] + min(V[i+1][j], V[i][j+1])
    return V[0][0]

两点值得一提:

  • 一次扫完,不是 value iteration。 这里的依赖图是一个 DAG(每个格子只依赖更靠近终点的格子),所以一次反向扫就把 Bellman 解干净了。在带随机或带环的 MDP 里就得真正迭代 \(V \leftarrow \mathcal{T}V\) 直到收敛。

  • 不需要父指针。 最优路径已经隐式地在 \(V\) 里:在任意状态,选择后继 \(V\) 较小的那一步即可。只存 \(V\) 就足以按需还原 \(\pi^\star\)。

Takeaways

  • DP is just caching plus a recurrence — the hard part is usually picking a good state.
  • Top-down memoization and bottom-up tabulation are interchangeable; choose for readability or stack safety.
  • If you can write the problem as “the optimal solution is some function of the optimal solutions to smaller subproblems,” a DP is sitting there waiting to be written.
  • The Bellman equation is DP lifted to stochastic, possibly-looping environments — same idea, different plumbing.
  • When the state graph has cycles or stochastic transitions, the one-sweep trick generalizes to value iteration: initialize \(V_0\) arbitrarily and repeatedly apply the Bellman operator \(V_{k+1} = \mathcal{T} V_k\) until \(\|V_{k+1} - V_k\|_\infty < \varepsilon\). For discount \(\gamma < 1\), \(\mathcal{T}\) is a \(\gamma\)-contraction in sup-norm, so the iterates converge geometrically to the unique fixed point \(V^\star\). The grid is the lucky DAG case where a single backward sweep already lands on \(V^\star\).
  • DP 说到底就是缓存 + 递推——难的部分通常是选一个好的状态
  • 自顶向下的记忆化和自底向上的表格填充是等价的;按可读性或栈安全性来挑。
  • 只要问题能写成”整体最优解是若干较小子问题最优解的某种函数”,就摆明了该用 DP。
  • Bellman 方程就是 DP 被推广到随机、可能带环的环境——思想一样,工具链不一样而已。
  • 当状态图带环或转移随机时,”一遍扫完”的技巧推广为价值迭代(value iteration):任意初始化 \(V_0\),反复施加 Bellman 算子 \(V_{k+1} = \mathcal{T} V_k\) 直到 \(\|V_{k+1} - V_k\|_\infty < \varepsilon\)。当折扣因子 \(\gamma < 1\) 时,\(\mathcal{T}\) 在 sup-范数下是一个 \(\gamma\)-收缩映射,所以迭代以几何速度收敛到唯一不动点 \(V^\star\)。网格只是 DAG 这种”幸运”的特例——一次反向扫就直接落在 \(V^\star\) 上。