Dynamic Programming: Foundations

Dynamic programming (DP) is one of the most reused ideas in computer science — from edit distance and shortest paths to the Bellman equation at the heart of reinforcement learning. Its core insight is embarrassingly simple: if a problem decomposes into overlapping subproblems whose solutions combine optimally, remember each subproblem's answer instead of recomputing it. This post walks through the two principles behind DP, the naive-vs-memoized-vs-tabulated spectrum, and a grid example that previews the Bellman-style reasoning that reappears throughout RL.

Two Principles

A problem admits a DP solution when it has both:

Optimal substructure. The optimal solution to the whole problem can be assembled from optimal solutions to smaller subproblems. Shortest paths have this: any prefix of a shortest path from \(s\) to \(t\) is itself a shortest path from \(s\) to the intermediate node.
Overlapping subproblems. The recursion revisits the same subproblem many times. Without overlap, plain divide-and-conquer already suffices — merge sort does not benefit from memoization.

When both hold, caching subproblem answers collapses a potentially exponential recursion into polynomial time. When only (1) holds, use divide-and-conquer; when only (2) holds (but no optimality is being sought), use plain memoization.

Motivating Example: Fibonacci

The textbook motivator is Fibonacci:

\[f(0) = 0, \quad f(1) = 1, \quad f(n) = f(n-1) + f(n-2) \;\;\text{for } n \geq 2.\]

The naive recursion is correct and catastrophically slow. The number of calls \(T(n)\) satisfies \(T(n) = T(n-1) + T(n-2) + 1\), so \(T(n) = \Theta(\varphi^n)\) with \(\varphi = (1+\sqrt{5})/2 \approx 1.618\).

The figure above draws the full recursion tree for \(f(n)\). Toggle Memoized to mark duplicate subproblems as cache hits (gray): once a node has been computed, subsequent encounters cost \(O(1)\). The exponential tree collapses — only \(n+1\) distinct subproblems are ever computed.

Three equivalent implementations capture the standard DP workflow:

# 1. Naive recursion — O(phi^n) time, O(n) stack.
def fib_naive(n):
    if n < 2: return n
    return fib_naive(n - 1) + fib_naive(n - 2)

# 2. Top-down memoization — O(n) time and space.
def fib_memo(n, cache={}):
    if n < 2: return n
    if n in cache: return cache[n]
    cache[n] = fib_memo(n - 1) + fib_memo(n - 2)
    return cache[n]

# 3. Bottom-up tabulation — O(n) time, O(1) space with rolling variables.
def fib_tab(n):
    a, b = 0, 1
    for _ in range(n): a, b = b, a + b
    return a

“Top-down” (memoization) starts from the original problem and recurses, caching along the way; “bottom-up” (tabulation) fills a table from base cases up to the target. They do exactly the same work and produce the same answers — pick whichever makes the recursion clearer, or whichever avoids Python’s recursion limit.

Anatomy of a DP Solution

Almost every DP solution can be written as the same three-line recipe:

State. What arguments uniquely identify a subproblem? The state space must be small (polynomial) for DP to be efficient. For Fibonacci, the state is a single integer \(n\).
Transition. How does a state’s answer relate to other states’ answers? This is the recurrence. For Fibonacci, \(f(n) = f(n-1) + f(n-2)\).
Base case. Which states are resolved without recursion? For Fibonacci, \(f(0)=0,\,f(1)=1\).

Designing a DP is mostly about finding a good state. A bad state either fails to capture what the transition needs (so the recurrence is wrong) or grows exponentially (so memoization doesn’t help). A common trick: if a natural state leaves out some context needed for the transition, add that context as an extra dimension — at the cost of time/space proportional to the new state space.

A Classic DP: Minimum Path Sum in a Grid

Given a grid of non-negative costs \(c[i,j]\), find the minimum total cost of a path from \((0,0)\) to \((R-1, C-1)\) that moves only right or down. This is the \(\ell^1\)-cousin of shortest-path problems.

State: \(dp[i,j]\) = minimum total cost of reaching cell \((i,j)\).
Transition: \(dp[i,j] = c[i,j] + \min\!\big(dp[i-1,j],\; dp[i,j-1]\big)\).
Base case: \(dp[0,0] = c[0,0]\); for the first row/column only one predecessor exists.

The answer is \(dp[R-1, C-1]\). To recover the path (not just its cost), store which predecessor achieved the minimum as you fill the table — a parent pointer — then trace it backward from the goal.

The same problem can also be solved in the Bellman / cost-to-go direction. Instead of asking “what is the cheapest cost to arrive here?”, define

\[V[i,j] = \text{minimum remaining cost from } (i,j) \text{ to } (R-1,C-1).\]

Now the dependencies point forward along the path:

\[V[i,j] = c[i,j] + \min\!\big(V[i+1,j],\; V[i,j+1]\big),\]

with boundary condition \(V[R-1,C-1] = c[R-1,C-1]\). Cells on the last row can only move right, and cells on the last column can only move down. To make the dependencies available, fill the table backward: from bottom-right toward top-left, for example by decreasing \(i\) and \(j\). The answer is then \(V[0,0]\).

This is the Bellman optimality equation in grid clothing: at state \((i,j)\), choose the action (right or down) whose one-step cost plus optimal future cost is smallest.

Press Step to fill one cell at a time. The yellow cell is the one being computed; dashed arrows mark its dependencies (up and/or left). When every cell is filled, the red polyline shows the optimal path recovered from parent pointers. Randomize to try a new cost grid.

Two details worth noting:

Fill order matters, but only mildly. Any order that processes \((i,j)\) after both \((i-1,j)\) and \((i,j-1)\) works — row-major, column-major, or diagonal sweeps all produce the same table.
Space can be \(O(C)\), not \(O(RC)\). Each row depends only on the previous row, so a rolling 1D array suffices — a common DP space-saving trick, at the cost of losing easy path recovery.

Why This Matters Beyond Algorithms

The grid recurrence is a finite, deterministic special case of the Bellman optimality equation. Read \(dp[i,j]\) as the optimal cost-to-go from \((i,j)\) to the goal (flipping the direction), and you get

\[V^\star(s) = \min_{a}\, \big[ c(s,a) + V^\star(s') \big],\]

which is exactly the Bellman equation for a deterministic shortest-path MDP with no discounting. Filling the DP table row-by-row is a form of value iteration that converges in one sweep because the state graph is a DAG (no cycles, so no need to iterate to a fixed point). The moment the MDP has stochastic transitions, loops, or a discount factor, the one-sweep trick breaks — and we get the full iterative value iteration / policy iteration machinery that makes up a large part of reinforcement learning theory.

This connection is why DP vocabulary (value function, Bellman backup, optimal substructure) dominates RL even though RL typically solves problems where the state graph is too large to tabulate. The algorithmic ideas transfer; only the implementation (function approximation, sampling) changes.

Worked Example: Bellman Backup on a 3×3 Grid

Take the canonical cost grid

\[c = \begin{bmatrix} 1 & 3 & 1 \\ 1 & 5 & 1 \\ 4 & 2 & 1 \end{bmatrix},\]

with goal \((R-1,C-1) = (2,2)\). Apply the Bellman recurrence

\[V[i,j] = c[i,j] + \min\!\big(V[i+1,j],\; V[i,j+1]\big)\]

starting from the boundary \(V[2,2] = c[2,2] = 1\) and sweeping toward \((0,0)\). Cells on the last row have only \(V[i,j+1]\) available; cells on the last column have only \(V[i+1,j]\).

step	cell	computation	\(V\)
1	\((2,2)\)	base case	\(1\)
2	\((2,1)\)	\(2 + V[2,2]\)	\(3\)
3	\((2,0)\)	\(4 + V[2,1]\)	\(7\)
4	\((1,2)\)	\(1 + V[2,2]\)	\(2\)
5	\((1,1)\)	\(5 + \min(2, 3)\)	\(7\)
6	\((1,0)\)	\(1 + \min(7, 7)\)	\(8\)
7	\((0,2)\)	\(1 + V[1,2]\)	\(3\)
8	\((0,1)\)	\(3 + \min(3, 7)\)	\(6\)
9	\((0,0)\)	\(1 + \min(6, 8)\)	\(7\)

The completed value table is

\[V = \begin{bmatrix} 7 & 6 & 3 \\ 8 & 7 & 2 \\ 7 & 3 & 1 \end{bmatrix}.\]

The optimal policy at each cell is the action attaining the \(\min\) in the Bellman update — i.e. \(\pi^\star(i,j) = \arg\min\{V[i+1,j], V[i,j+1]\}\):

\[(0,0) \xrightarrow{\text{right}} (0,1) \xrightarrow{\text{right}} (0,2) \xrightarrow{\text{down}} (1,2) \xrightarrow{\text{down}} (2,2),\]

with running cost \(1 + 3 + 1 + 1 + 1 = 7\), exactly \(V[0,0]\).

def min_path_bellman(c):
    R, C = len(c), len(c[0])
    V = [[0] * C for _ in range(R)]
    V[R-1][C-1] = c[R-1][C-1]
    for j in range(C-2, -1, -1):                    # last row: only right
        V[R-1][j] = c[R-1][j] + V[R-1][j+1]
    for i in range(R-2, -1, -1):                    # last column: only down
        V[i][C-1] = c[i][C-1] + V[i+1][C-1]
    for i in range(R-2, -1, -1):                    # interior: backward sweep
        for j in range(C-2, -1, -1):
            V[i][j] = c[i][j] + min(V[i+1][j], V[i][j+1])
    return V[0][0]

Two things worth pointing out:

One sweep, not value iteration. The dependency graph is a DAG (each cell depends only on cells strictly closer to the goal), so a single backward pass solves Bellman exactly. In a stochastic or cyclic MDP we would have to iterate \(V \leftarrow \mathcal{T}V\) until convergence.
No parent pointers needed. The optimal path is implicit in \(V\): at any state, take the action whose successor has the smaller \(V\). Storing \(V\) alone is enough to reconstruct \(\pi^\star\) on demand.

步	格子	计算	\(V\)
1	\((2,2)\)	基 case	\(1\)
2	\((2,1)\)	\(2 + V[2,2]\)	\(3\)
3	\((2,0)\)	\(4 + V[2,1]\)	\(7\)
4	\((1,2)\)	\(1 + V[2,2]\)	\(2\)
5	\((1,1)\)	\(5 + \min(2, 3)\)	\(7\)
6	\((1,0)\)	\(1 + \min(7, 7)\)	\(8\)
7	\((0,2)\)	\(1 + V[1,2]\)	\(3\)
8	\((0,1)\)	\(3 + \min(3, 7)\)	\(6\)
9	\((0,0)\)	\(1 + \min(6, 8)\)	\(7\)

Takeaways

DP is just caching plus a recurrence — the hard part is usually picking a good state.
Top-down memoization and bottom-up tabulation are interchangeable; choose for readability or stack safety.
If you can write the problem as “the optimal solution is some function of the optimal solutions to smaller subproblems,” a DP is sitting there waiting to be written.
The Bellman equation is DP lifted to stochastic, possibly-looping environments — same idea, different plumbing.
When the state graph has cycles or stochastic transitions, the one-sweep trick generalizes to value iteration: initialize \(V_0\) arbitrarily and repeatedly apply the Bellman operator \(V_{k+1} = \mathcal{T} V_k\) until \(\|V_{k+1} - V_k\|_\infty < \varepsilon\). For discount \(\gamma < 1\), \(\mathcal{T}\) is a \(\gamma\)-contraction in sup-norm, so the iterates converge geometrically to the unique fixed point \(V^\star\). The grid is the lucky DAG case where a single backward sweep already lands on \(V^\star\).

Dynamic Programming: Foundations

Two Principles

两条原理

Motivating Example: Fibonacci

引入例子：斐波那契

Anatomy of a DP Solution

DP 解法的结构

A Classic DP: Minimum Path Sum in a Grid

经典 DP：网格最小路径和

Why This Matters Beyond Algorithms

跳出算法题的意义

Worked Example: Bellman Backup on a 3×3 Grid

具体例子：3×3 网格上的 Bellman 回溯

Takeaways

小结