Policy Improvement Theorem
Setup
问题设定
Consider an MDP with states \(s \in \mathcal{S}\), actions \(a \in \mathcal{A}\), transition \(P(s' \vert s,a)\), reward \(r(s,a)\), and discount factor \(\gamma \in [0,1)\). For a policy \(\pi\), define the value function and action-value function:
The advantage function measures how much better action \(a\) is compared to following \(\pi\):
Note that \(\mathbb{E}_{a \sim \pi(\cdot \vert s)}[A^\pi(s,a)] = 0\) for all \(s\).
The Policy Improvement Theorem
策略改进定理
Theorem (Policy Improvement). Let \(\pi\) be any policy, and let \(\pi'\) be another policy such that
\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^\pi(s,a)] \geq 0, \quad \forall s \in \mathcal{S},\]then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\).
The condition says: at every state, actions chosen by \(\pi'\) are at least as good as those chosen by \(\pi\), as evaluated by \(\pi\)’s own value function. The conclusion is global: the new policy is better everywhere.
Why is the advantage evaluated under \(\pi\), not \(\pi'\)? Evaluating under \(\pi'\) itself is vacuous: by definition of advantage,
\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^{\pi'}(s,a)] = \mathbb{E}_{a \sim \pi'}[Q^{\pi'}(s,a)] - V^{\pi'}(s) = V^{\pi'}(s) - V^{\pi'}(s) = 0,\]which holds for any \(\pi'\) and any state \(s\) — it tells us nothing about whether \(\pi'\) is better or worse.
Instead, \(A^\pi(s, \pi'(s))\) asks a non-trivial question: if I follow \(\pi'\) for one step at state \(s\), then revert to \(\pi\) for all future steps, am I better off than following \(\pi\) the whole time? The theorem says: if the answer is “yes” (or “tie”) at every state, then following \(\pi'\) the entire time is also better — each one-step improvement cascades forward via the telescoping argument. And crucially, we only need \(V^\pi\) and \(Q^\pi\), which we already computed. No need to evaluate \(\pi'\) first — this bootstrap is exactly what makes policy iteration work.
Try it yourself in the interactive grid world below. The left grid shows the current policy \(\pi\) (fixed). On the right grid, click the directional triangles to change \(\pi'\) at each state. Green triangles indicate positive advantage (\(A^\pi > 0\)), red indicates negative. The theorem predicts: if all arrows are green or neutral, \(V^{\pi'}\) must improve everywhere.
Proof
证明
The proof uses a telescoping argument (following Nan Jiang’s CS 443 notes). Start from \(V^\pi(s)\) and repeatedly expand:
Performance Difference Lemma
性能差异引理
A more precise version gives the exact performance difference, not just an inequality. This is sometimes called the performance difference lemma (Kakade & Langford, 2002):
\[V^{\pi'}(s) - V^\pi(s) = \frac{1}{1-\gamma}\, \mathbb{E}_{s' \sim d_{s}^{\pi'}}\!\left[\mathbb{E}_{a \sim \pi'(\cdot \vert s')}[A^\pi(s',a)]\right],\]where \(d_{s}^{\pi'}(s') = (1-\gamma) \sum_{t=0}^\infty \gamma^t \Pr(s_t = s' \vert s_0 = s, \pi')\) is the discounted state visitation distribution under \(\pi'\) starting from \(s\).
This lemma is the backbone of modern policy optimization: it tells us that improving the expected advantage under the new policy’s state distribution leads to global improvement. The challenge is that \(d_s^{\pi'}\) depends on \(\pi'\) itself, creating a chicken-and-egg problem.
References
参考文献
- Nan Jiang, CS 443: Reinforcement Learning, UIUC.
- Sham Kakade and John Langford, “Approximately Optimal Approximate Reinforcement Learning,” ICML 2002.
- Sutton and Barto, Reinforcement Learning: An Introduction, Chapter 4.2.
- Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun, Reinforcement Learning: Theory and Algorithms.