Given a policy $\pi$, can we find a better one $\pi'$?
In RL, we evaluate policies by their value function:
The advantage tells us how much better action $a$ is versus following $\pi$:
Note: $\mathbb{E}_{a \sim \pi}[A^\pi(s,a)] = 0$ — the advantage is zero on average under $\pi$.
If $\pi'$ picks actions with positive advantage at every state, it must be better.
This is the policy improvement theorem.
The foundational guarantee for policy iteration
Theorem. If for all states $s \in \mathcal{S}$:
$$\mathbb{E}_{a \sim \pi'(\cdot|s)}[A^\pi(s,a)] \geq 0$$then $V^{\pi'}(s) \geq V^{\pi}(s)$ for all $s$.
The condition is local (check each state independently).
The conclusion is global (the value improves everywhere).
The greedy policy $\pi'(s) = \arg\max_a Q^\pi(s,a)$ always satisfies this condition!
This is why policy iteration converges to the optimal policy.
Following Nan Jiang's CS 443 lecture notes
At each step, replace $V^\pi(s')$ with $\mathbb{E}_{\pi'}[Q^\pi(s',a')]$, using $A^\pi \geq 0$.
The inequality cascades forward, and the telescoping produces $V^{\pi'}(s)$.
Each step uses $Q^\pi = A^\pi + V^\pi$ and the assumption $\mathbb{E}_{\pi'}[A^\pi] \geq 0$.
Kakade & Langford (2002): the exact gap, not just an inequality
where $d_s^{\pi'}$ is the discounted state visitation under $\pi'$ from $s$.
This is an equality: the performance gap is exactly the expected advantage under $\pi'$'s own visitation.
Problem: $d_s^{\pi'}$ depends on $\pi'$ itself — a chicken-and-egg issue.
This is why practical algorithms approximate or bound this quantity.