where $r_\theta = \pi_\theta / \pi_{\text{old}}$ is the importance sampling ratio.
Policy improvement
(reward-weighted)
Forward KL
Trust region (conservative)
teacher = $\pi_{\text{old}}$ student = $\pi_\theta$
This encourages $\pi_\theta$ to stay close to $\pi_{\text{old}}$ — mean-seeking.
Inherited from TRPO: importance sampling ratio + forward KL trust region.
What if we had a teacher policy $\pi_{\text{teach}}$?
Reverse KL
Distill from teacher
(aggressive, mode-seeking)
Forward KL
Trust region from PPO
(conservative, stabilizing)
PPO
$$\mathbb{E}\!\left[r_\theta(s,a)\;\color{#c62828}{\hat{A}(s,a)} \;-\; \beta\,D_{KL}(\pi_{\text{old}} \| \pi_\theta)\right]$$OPD
$$\mathbb{E}\!\left[\frac{\pi_\theta}{\pi_{\text{old}}}\;\color{#1565c0}{(\log\pi_\theta - \log\pi_{\text{teach}})} \;+\; \beta\,D_{KL}(\pi_{\text{old}} \| \pi_\theta)\right]$$The advantage $\hat{A}$ (from reward) is replaced by a reverse KL term (from teacher).
Privileged information enters through $\pi_{\text{teach}}$ instead of through reward $r$.
Taking the gradient of $D_{KL}(Q_\theta \| P)$:
The second term, by the REINFORCE trick (log-derivative trick):
This has exactly the form of a REINFORCE gradient.
$\mathbb{E}_{x \sim Q_\theta}\!\big[\!\log P(x) \cdot \nabla_\theta \log Q_\theta(x)\big]$
$\mathbb{E}\!\Big[\sum_t G_t\,\nabla_\theta \log\pi_\theta(a_t | s_t)\Big]$
A large $\log P(x)$ means the teacher agrees with action $x$.
It serves as the reward signal — replacing the return $G$ with teacher approval.
Forward KL is NOT REINFORCE — because $P$ is not the distribution being optimized.
Distill from teacher
Mode-seeking
$\log\pi_\theta - \log\pi_{\text{teach}}$
Trust region (PPO)
Mean-seeking
$D_{KL}(\pi_{\text{old}} \| \pi_\theta)$
Aggressive distillation + conservative stabilization = OPD
SFT has exposure bias. RL reward is too sparse. OPD bridges the gap.