Policy Improvement Theorem

This post presents the policy improvement theorem, a foundational result in reinforcement learning that guarantees monotonic improvement when we greedily improve a policy. The exposition follows Nan Jiang's CS 443 lecture notes at UIUC.
本文介绍策略改进定理(Policy Improvement Theorem),这是强化学习中的一个基础性结果,保证了贪心策略改进的单调性。内容主要参考 UIUC Nan Jiang 的 CS 443 讲义

Setup

Consider an MDP with states \(s \in \mathcal{S}\), actions \(a \in \mathcal{A}\), transition \(P(s' \vert s,a)\), reward \(r(s,a)\), and discount factor \(\gamma \in [0,1)\). For a policy \(\pi\), define the value function and action-value function:

考虑一个 MDP,状态 \(s \in \mathcal{S}\),动作 \(a \in \mathcal{A}\),转移 \(P(s' \vert s,a)\),奖励 \(r(s,a)\),折扣因子 \(\gamma \in [0,1)\)。对于策略 \(\pi\),定义价值函数和动作价值函数:

\[V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \;\middle\vert\; s_0 = s\right],\] \[Q^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \;\middle\vert\; s_0 = s,\, a_0 = a\right].\]

The advantage function measures how much better action \(a\) is compared to following \(\pi\):

优势函数(advantage function) 衡量动作 \(a\) 相比于遵循策略 \(\pi\) 好多少:

\[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).\]

Note that \(\mathbb{E}_{a \sim \pi(\cdot \vert s)}[A^\pi(s,a)] = 0\) for all \(s\).

注意对所有 \(s\) 有 \(\mathbb{E}_{a \sim \pi(\cdot \vert s)}[A^\pi(s,a)] = 0\)。

The Policy Improvement Theorem

Theorem (Policy Improvement). Let \(\pi\) be any policy, and let \(\pi'\) be another policy such that

\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^\pi(s,a)] \geq 0, \quad \forall s \in \mathcal{S},\]

then \(V^{\pi'}(s) \geq V^\pi(s)\) for all \(s\).

定理(策略改进)。 设 \(\pi\) 为任意策略,\(\pi'\) 为另一策略,若

\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^\pi(s,a)] \geq 0, \quad \forall s \in \mathcal{S},\]

则对所有 \(s\) 有 \(V^{\pi'}(s) \geq V^\pi(s)\)。

The condition says: at every state, actions chosen by \(\pi'\) are at least as good as those chosen by \(\pi\), as evaluated by \(\pi\)’s own value function. The conclusion is global: the new policy is better everywhere.

Why is the advantage evaluated under \(\pi\), not \(\pi'\)? Evaluating under \(\pi'\) itself is vacuous: by definition of advantage,

\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^{\pi'}(s,a)] = \mathbb{E}_{a \sim \pi'}[Q^{\pi'}(s,a)] - V^{\pi'}(s) = V^{\pi'}(s) - V^{\pi'}(s) = 0,\]

which holds for any \(\pi'\) and any state \(s\) — it tells us nothing about whether \(\pi'\) is better or worse.

Instead, \(A^\pi(s, \pi'(s))\) asks a non-trivial question: if I follow \(\pi'\) for one step at state \(s\), then revert to \(\pi\) for all future steps, am I better off than following \(\pi\) the whole time? The theorem says: if the answer is “yes” (or “tie”) at every state, then following \(\pi'\) the entire time is also better — each one-step improvement cascades forward via the telescoping argument. And crucially, we only need \(V^\pi\) and \(Q^\pi\), which we already computed. No need to evaluate \(\pi'\) first — this bootstrap is exactly what makes policy iteration work.

条件的含义是:在每个状态,\(\pi'\) 选择的动作至少和 \(\pi\) 选择的一样好,且用 \(\pi\) 自身的价值函数来评估。结论是全局的:新策略在所有状态都更好。

为什么 advantage 是在 \(\pi\) 下而非 \(\pi'\) 下评估? 用 \(\pi'\) 自身来评估是空洞的:由 advantage 的定义,

\[\mathbb{E}_{a \sim \pi'(\cdot \vert s)}[A^{\pi'}(s,a)] = \mathbb{E}_{a \sim \pi'}[Q^{\pi'}(s,a)] - V^{\pi'}(s) = V^{\pi'}(s) - V^{\pi'}(s) = 0,\]

这对任何 \(\pi'\) 和任何状态 \(s\) 都成立——它无法告诉我们 \(\pi'\) 是更好还是更差。

而 \(A^\pi(s, \pi'(s))\) 问的是一个非平凡的问题:如果我在状态 \(s\) 按 \(\pi'\) 走一步,之后都按 \(\pi\) 走,会比一直按 \(\pi\) 走更好吗? 定理告诉我们:如果每个状态的答案都是”是”(或”持平”),那么一直按 \(\pi'\) 走也更好——每一步的单步改进通过望远镜求和逐步向前累积。关键在于,我们只需要已经算好的 \(V^\pi\) 和 \(Q^\pi\),不需要先评估 \(\pi'\)——这个 bootstrap 正是策略迭代能 work 的原因。

Try it yourself in the interactive grid world below. The left grid shows the current policy \(\pi\) (fixed). On the right grid, click the directional triangles to change \(\pi'\) at each state. Green triangles indicate positive advantage (\(A^\pi > 0\)), red indicates negative. The theorem predicts: if all arrows are green or neutral, \(V^{\pi'}\) must improve everywhere.

在下面的交互式 grid world 中亲自体验。左侧网格显示当前策略 \(\pi\)(固定不变)。在右侧网格上,点击方向三角形来改变每个状态的 \(\pi'\)。绿色三角形表示正 advantage(\(A^\pi > 0\)),红色表示负 advantage。定理预测:如果所有箭头都是绿色或中性的,\(V^{\pi'}\) 必定处处改进。

Proof

The proof uses a telescoping argument (following Nan Jiang’s CS 443 notes). Start from \(V^\pi(s)\) and repeatedly expand:

证明使用望远镜求和法(参考 Nan Jiang 的 CS 443 讲义)。从 \(V^\pi(s)\) 出发反复展开:

\[\begin{aligned} V^\pi(s) &= \mathbb{E}_{a \sim \pi'}\!\left[Q^\pi(s,a)\right] - \mathbb{E}_{a \sim \pi'}\!\left[A^\pi(s,a)\right] && \text{(} Q = A + V \text{, re-arrange)} \\ &\leq \mathbb{E}_{a \sim \pi'}\!\left[Q^\pi(s,a)\right] && \text{(assumption: } \mathbb{E}_{\pi'}[A^\pi] \geq 0 \text{)} \\ &= \mathbb{E}_{a \sim \pi'}\!\left[r(s,a) + \gamma \mathbb{E}_{s'}[V^\pi(s')]\right] && \text{(Bellman equation for } Q^\pi \text{)} \\ &\leq \mathbb{E}_{a \sim \pi'}\!\left[r(s,a) + \gamma \mathbb{E}_{s'}\!\left[\mathbb{E}_{a' \sim \pi'}[Q^\pi(s',a')]\right]\right] && \text{(apply } Q = A + V \text{, assumption again at } s' \text{)} \\ &= \mathbb{E}_{\pi'}\!\left[r_0 + \gamma r_1 + \gamma^2 V^\pi(s_2)\right] && \text{(expand } Q^\pi(s',a') \text{ via Bellman)} \\ &\leq \cdots && \text{(repeat at every future state)} \\ &\leq \mathbb{E}_{\pi'}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] = V^{\pi'}(s). && \square \end{aligned}\]

Performance Difference Lemma

A more precise version gives the exact performance difference, not just an inequality. This is sometimes called the performance difference lemma (Kakade & Langford, 2002):

\[V^{\pi'}(s) - V^\pi(s) = \frac{1}{1-\gamma}\, \mathbb{E}_{s' \sim d_{s}^{\pi'}}\!\left[\mathbb{E}_{a \sim \pi'(\cdot \vert s')}[A^\pi(s',a)]\right],\]

where \(d_{s}^{\pi'}(s') = (1-\gamma) \sum_{t=0}^\infty \gamma^t \Pr(s_t = s' \vert s_0 = s, \pi')\) is the discounted state visitation distribution under \(\pi'\) starting from \(s\).

This lemma is the backbone of modern policy optimization: it tells us that improving the expected advantage under the new policy’s state distribution leads to global improvement. The challenge is that \(d_s^{\pi'}\) depends on \(\pi'\) itself, creating a chicken-and-egg problem.

一个更精确的版本给出恰好的性能差异,而非仅仅是不等式。这被称为性能差异引理(Kakade & Langford, 2002):

\[V^{\pi'}(s) - V^\pi(s) = \frac{1}{1-\gamma}\, \mathbb{E}_{s' \sim d_{s}^{\pi'}}\!\left[\mathbb{E}_{a \sim \pi'(\cdot \vert s')}[A^\pi(s',a)]\right],\]

其中 \(d_{s}^{\pi'}(s') = (1-\gamma) \sum_{t=0}^\infty \gamma^t \Pr(s_t = s' \vert s_0 = s, \pi')\) 是从 \(s\) 出发在 \(\pi'\) 下的折扣状态访问分布。

这一引理是现代策略优化的基石:它告诉我们,在新策略的状态分布下改进期望优势会带来全局改进。挑战在于 \(d_s^{\pi'}\) 依赖于 \(\pi'\) 本身,形成先有鸡还是先有蛋的问题。

References