The Goal: Monotonic Policy Improvement

Given a policy $\pi$, can we find a better one $\pi'$?

In RL, we evaluate policies by their value function:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \;\middle|\; s_0 = s\right]$$

The advantage tells us how much better action $a$ is versus following $\pi$:

$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$

Note: $\mathbb{E}_{a \sim \pi}[A^\pi(s,a)] = 0$ — the advantage is zero on average under $\pi$.

If $\pi'$ picks actions with positive advantage at every state, it must be better.

This is the policy improvement theorem.

Policy Improvement Theorem

The foundational guarantee for policy iteration

Theorem. If for all states $s \in \mathcal{S}$:

$$\mathbb{E}_{a \sim \pi'(\cdot|s)}[A^\pi(s,a)] \geq 0$$

then $V^{\pi'}(s) \geq V^{\pi}(s)$ for all $s$.

The condition is local (check each state independently).

The conclusion is global (the value improves everywhere).

The greedy policy $\pi'(s) = \arg\max_a Q^\pi(s,a)$ always satisfies this condition!

This is why policy iteration converges to the optimal policy.

Proof: The Telescoping Argument

Following Nan Jiang's CS 443 lecture notes

At each step, replace $V^\pi(s')$ with $\mathbb{E}_{\pi'}[Q^\pi(s',a')]$, using $A^\pi \geq 0$.

The inequality cascades forward, and the telescoping produces $V^{\pi'}(s)$.

Each step uses $Q^\pi = A^\pi + V^\pi$ and the assumption $\mathbb{E}_{\pi'}[A^\pi] \geq 0$.

Performance Difference Lemma

Kakade & Langford (2002): the exact gap, not just an inequality

$$V^{\pi'}(s) - V^\pi(s) = \frac{1}{1-\gamma}\, \mathbb{E}_{s' \sim d_s^{\pi'}}\!\left[\mathbb{E}_{a \sim \pi'}[A^\pi(s',a)]\right]$$

where $d_s^{\pi'}$ is the discounted state visitation under $\pi'$ from $s$.

This is an equality: the performance gap is exactly the expected advantage under $\pi'$'s own visitation.

Problem: $d_s^{\pi'}$ depends on $\pi'$ itself — a chicken-and-egg issue.

This is why practical algorithms approximate or bound this quantity.