Policy Improvement Theorem

This post presents the policy improvement theorem, a foundational result in reinforcement learning that guarantees monotonic improvement when we greedily improve a policy. The exposition follows Nan Jiang's CS 443 lecture notes at UIUC.

Setup

Consider an MDP with states \(s \in \mathcal{S}\), actions \(a \in \mathcal{A}\), transition \(P(s' \vert s,a)\), reward \(r(s,a)\), and discount factor \(\gamma \in [0,1)\). For a policy \(\pi\), define the value function and action-value function:

\[V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \;\middle\vert\; s_0 = s\right],\] \[Q^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \;\middle\vert\; s_0 = s,\, a_0 = a\right].\]

The advantage function measures how much better action \(a\) is compared to following \(\pi\):

\[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).\]

Note that \(\mathbb{E}_{a \sim \pi(\cdot \vert s)}[A^\pi(s,a)] = 0\) for all \(s\).

The Policy Improvement Theorem

Theorem (Policy Improvement). Let \(\pi\) be any policy, and let \(\pi'\) be another policy such that

\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^\pi(s,a)] \geq 0, \quad \forall s \in \mathcal{S},\]

then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\).

The condition says: at every state, actions chosen by \(\pi'\) are at least as good as those chosen by \(\pi\), as evaluated by \(\pi\)’s own value function. The conclusion is global: the new policy is better everywhere.

Why is the advantage evaluated under \(\pi\), not \(\pi'\)? Evaluating under \(\pi'\) itself is vacuous: by definition of advantage,
\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^{\pi'}(s,a)] = \mathbb{E}_{a \sim \pi'}[Q^{\pi'}(s,a)] - V^{\pi'}(s) = V^{\pi'}(s) - V^{\pi'}(s) = 0,\]
which holds for any \(\pi'\) and any state \(s\) — it tells us nothing about whether \(\pi'\) is better or worse.

Instead, \(A^\pi(s, \pi'(s))\) asks a non-trivial question: if I follow \(\pi'\) for one step at state \(s\), then revert to \(\pi\) for all future steps, am I better off than following \(\pi\) the whole time? The theorem says: if the answer is “yes” (or “tie”) at every state, then following \(\pi'\) the entire time is also better — each one-step improvement cascades forward via the telescoping argument. And crucially, we only need \(V^\pi\) and \(Q^\pi\), which we already computed. No need to evaluate \(\pi'\) first — this bootstrap is exactly what makes policy iteration work.

Try it yourself in the interactive grid world below. The left grid shows the current policy \(\pi\) (fixed). On the right grid, click the directional triangles to change \(\pi'\) at each state. Green triangles indicate positive advantage (\(A^\pi > 0\)), red indicates negative. The theorem predicts: if all arrows are green or neutral, \(V^{\pi'}\) must improve everywhere.

Proof

The proof uses a telescoping argument (following Nan Jiang’s CS 443 notes). Start from \(V^\pi(s)\) and repeatedly expand:

\[\begin{aligned} V^\pi(s) &= \mathbb{E}_{a \sim \pi'}\!\left[Q^\pi(s,a)\right] - \mathbb{E}_{a \sim \pi'}\!\left[A^\pi(s,a)\right] && \text{(} Q = A + V \text{, re-arrange)} \\ &\leq \mathbb{E}_{a \sim \pi'}\!\left[Q^\pi(s,a)\right] && \text{(assumption: } \mathbb{E}_{\pi'}[A^\pi] \geq 0 \text{)} \\ &= \mathbb{E}_{a \sim \pi'}\!\left[r(s,a) + \gamma \mathbb{E}_{s'}[V^\pi(s')]\right] && \text{(Bellman equation for } Q^\pi \text{)} \\ &\leq \mathbb{E}_{a \sim \pi'}\!\left[r(s,a) + \gamma \mathbb{E}_{s'}\!\left[\mathbb{E}_{a' \sim \pi'}[Q^\pi(s',a')]\right]\right] && \text{(apply } Q = A + V \text{, assumption again at } s' \text{)} \\ &= \mathbb{E}_{\pi'}\!\left[r_0 + \gamma r_1 + \gamma^2 V^\pi(s_2)\right] && \text{(expand } Q^\pi(s',a') \text{ via Bellman)} \\ &\leq \cdots && \text{(repeat at every future state)} \\ &\leq \mathbb{E}_{\pi'}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] = V^{\pi'}(s). && \square \end{aligned}\]

Performance Difference Lemma

A more precise version gives the exact performance difference, not just an inequality. This is sometimes called the performance difference lemma (Kakade & Langford, 2002):

\[V^{\pi'}(s) - V^\pi(s) = \frac{1}{1-\gamma}\, \mathbb{E}_{s' \sim d_{s}^{\pi'}}\!\left[\mathbb{E}_{a \sim \pi'(\cdot \vert s')}[A^\pi(s',a)]\right],\]

where \(d_{s}^{\pi'}(s') = (1-\gamma) \sum_{t=0}^\infty \gamma^t \Pr(s_t = s' \vert s_0 = s, \pi')\) is the discounted state visitation distribution under \(\pi'\) starting from \(s\).

This lemma is the backbone of modern policy optimization: it tells us that improving the expected advantage under the new policy’s state distribution leads to global improvement. The challenge is that \(d_s^{\pi'}\) depends on \(\pi'\) itself, creating a chicken-and-egg problem.

References

Nan Jiang, CS 443: Reinforcement Learning, UIUC.
Sham Kakade and John Langford, “Approximately Optimal Approximate Reinforcement Learning,” ICML 2002.
Sutton and Barto, Reinforcement Learning: An Introduction, Chapter 4.2.
Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun, Reinforcement Learning: Theory and Algorithms.

Policy Improvement Theorem

Setup

问题设定

The Policy Improvement Theorem

策略改进定理

Proof

证明

Performance Difference Lemma

性能差异引理

References

参考文献