Forward KL: Teacher in the Driver's Seat

$$D_{KL}(P_{\text{teacher}} \| Q_{\text{student}}) = \mathbb{E}_{x \sim P}\!\left[\log \frac{P(x)}{Q(x)}\right]$$

Expectation is under $P$ (teacher) — samples come from the teacher.

Minimizing this maximizes $Q(x)$ wherever $P(x) > 0$.

The student must cover every mode of the teacher.

Zero-Avoiding Nature of Forward KL

What happens when $Q(x)=0$ but $P(x)>0$?

$$P(x) \log \frac{P(x)}{Q(x)} \;\to\; +\infty \quad\text{when } Q(x) \to 0$$

The loss explodes — it hates when the student assigns zero probability to teacher-likely regions.

Result: $Q$ spreads its mass to cover all modes → mean-seeking.

Also: it tries to maximize $Q$ logits overall, making the $Q$ curve higher everywhere.

Forward KL is Mean-Seeking

Bimodal teacher $P$ (solid) vs. $\arg\min_Q D_{KL}(P \| Q)$ (dashed)

The student covers both modes, placing mass in the valley between them.

PPO uses forward KL as trust region — this prevents aggressive updates.

Reverse KL: Student in the Driver's Seat

$$D_{KL}(Q_{\text{student}} \| P_{\text{teacher}}) = \mathbb{E}_{x \sim Q}\!\left[\log \frac{Q(x)}{P(x)}\right]$$

Expectation is under $Q$ (student) — samples come from the student.

Minimizing this forces $Q(x) \to 0$ wherever $P(x) = 0$.

The student concentrates on a single mode of the teacher.

Mode-Seeking Nature of Reverse KL

What happens when $P(x)=0$ but $Q(x)>0$?

$$Q(x) \log \frac{Q(x)}{P(x)} \;\to\; +\infty \quad\text{when } P(x) \to 0$$

The loss explodes — the student must not place mass where the teacher is zero.

Result: $Q$ collapses onto one peak → mode-seeking.

This is aggressive: the student commits to one interpretation.

Reverse KL is Mode-Seeking

Bimodal teacher $P$ (solid) vs. $\arg\min_Q D_{KL}(Q \| P)$ (dotted)

The student locks onto one mode, ignoring the other entirely.

OPD uses reverse KL to distill from the teacher — aggressive but targeted.

Forward vs. Reverse KL: Side by Side

Forward KL  $D_{KL}(P \| Q)$

Mean-seeking

Covers all modes, broad

Reverse KL  $D_{KL}(Q \| P)$

Mode-seeking

Locks onto one mode, sharp

OPD = reverse KL (distill from teacher) + forward KL (trust region from PPO)