PPO Objective

$$J(\theta) = \mathbb{E}_{s \sim d^{\pi_{\text{old}}},\, a \sim \pi_{\text{old}}}\!\left[r_\theta(s,a)\,\hat{A}(s,a) \;-\; \beta\, D_{KL}\!\big(\pi_{\text{old}}(\cdot|s) \,\|\, \pi_\theta(\cdot|s)\big)\right]$$

where $r_\theta = \pi_\theta / \pi_{\text{old}}$ is the importance sampling ratio.

$r_\theta \hat{A}$

Policy improvement
(reward-weighted)

$D_{KL}(\pi_{\text{old}} \| \pi_\theta)$

Forward KL
Trust region (conservative)

PPO's Forward KL Penalty

$$-\beta\,D_{KL}(\pi_{\text{old}} \| \pi_\theta) = -\beta\,\mathbb{E}_{a \sim \pi_{\text{old}}}\!\left[\log \frac{\pi_{\text{old}}}{\pi_\theta}\right] = \beta\,\mathbb{E}_{a \sim \pi_{\text{old}}}\!\left[\log \frac{\pi_\theta}{\pi_{\text{old}}}\right]$$

teacher = $\pi_{\text{old}}$ student = $\pi_\theta$

This encourages $\pi_\theta$ to stay close to $\pi_{\text{old}}$ — mean-seeking.

Inherited from TRPO: importance sampling ratio + forward KL trust region.

OPD: Replace $\hat{A}$ with Reverse KL

What if we had a teacher policy $\pi_{\text{teach}}$?

$$\min_\theta\;\mathbb{E}_{x \sim \pi_{\text{old}}}\!\left[\sum_{t=0}^{T-1}\frac{\pi_\theta}{\pi_{\text{old}}}\big(\log\pi_\theta - \log\pi_{\text{teach}}\big) \;+\; \beta\,D_{KL}\!\big(\pi_{\text{old}} \| \pi_\theta\big)\right]$$

$\log\pi_\theta - \log\pi_{\text{teach}}$

Reverse KL
Distill from teacher
(aggressive, mode-seeking)

$D_{KL}(\pi_{\text{old}} \| \pi_\theta)$

Forward KL
Trust region from PPO
(conservative, stabilizing)

PPO vs. OPD: Spot the Difference

PPO

$$\mathbb{E}\!\left[r_\theta(s,a)\;\color{#c62828}{\hat{A}(s,a)} \;-\; \beta\,D_{KL}(\pi_{\text{old}} \| \pi_\theta)\right]$$

↓ replace $\hat{A}$ ↓

OPD

$$\mathbb{E}\!\left[\frac{\pi_\theta}{\pi_{\text{old}}}\;\color{#1565c0}{(\log\pi_\theta - \log\pi_{\text{teach}})} \;+\; \beta\,D_{KL}(\pi_{\text{old}} \| \pi_\theta)\right]$$

The advantage $\hat{A}$ (from reward) is replaced by a reverse KL term (from teacher).

Privileged information enters through $\pi_{\text{teach}}$ instead of through reward $r$.

Gradient of Reverse KL

Taking the gradient of $D_{KL}(Q_\theta \| P)$:

$$\nabla_\theta D_{KL}(Q_\theta \| P) = -\nabla_\theta H(Q_\theta) - \nabla_\theta \mathbb{E}_{x \sim Q_\theta}\![\log P(x)]$$

The second term, by the REINFORCE trick (log-derivative trick):

$$\nabla_\theta \mathbb{E}_{x \sim Q_\theta}\![\log P(x)] = \mathbb{E}_{x \sim Q_\theta}\!\big[\log P(x) \cdot \nabla_\theta \log Q_\theta(x)\big]$$

This has exactly the form of a REINFORCE gradient.

Reverse KL = REINFORCE with $G = \log P(x)$

Reversed KL gradient

$\mathbb{E}_{x \sim Q_\theta}\!\big[\!\log P(x) \cdot \nabla_\theta \log Q_\theta(x)\big]$

Step-wise REINFORCE

$\mathbb{E}\!\Big[\sum_t G_t\,\nabla_\theta \log\pi_\theta(a_t | s_t)\Big]$

A large $\log P(x)$ means the teacher agrees with action $x$.

It serves as the reward signal — replacing the return $G$ with teacher approval.

Forward KL is NOT REINFORCE — because $P$ is not the distribution being optimized.

OPD: The Complete Picture

Reverse KL

Distill from teacher

Mode-seeking

$\log\pi_\theta - \log\pi_{\text{teach}}$

+

Forward KL

Trust region (PPO)

Mean-seeking

$D_{KL}(\pi_{\text{old}} \| \pi_\theta)$

Aggressive distillation + conservative stabilization = OPD

SFT has exposure bias. RL reward is too sparse. OPD bridges the gap.