Expectation is under $P$ (teacher) — samples come from the teacher.
Minimizing this maximizes $Q(x)$ wherever $P(x) > 0$.
The student must cover every mode of the teacher.
What happens when $Q(x)=0$ but $P(x)>0$?
The loss explodes — it hates when the student assigns zero probability to teacher-likely regions.
Result: $Q$ spreads its mass to cover all modes → mean-seeking.
Also: it tries to maximize $Q$ logits overall, making the $Q$ curve higher everywhere.
Bimodal teacher $P$ (solid) vs. $\arg\min_Q D_{KL}(P \| Q)$ (dashed)
The student covers both modes, placing mass in the valley between them.
PPO uses forward KL as trust region — this prevents aggressive updates.
Expectation is under $Q$ (student) — samples come from the student.
Minimizing this forces $Q(x) \to 0$ wherever $P(x) = 0$.
The student concentrates on a single mode of the teacher.
What happens when $P(x)=0$ but $Q(x)>0$?
The loss explodes — the student must not place mass where the teacher is zero.
Result: $Q$ collapses onto one peak → mode-seeking.
This is aggressive: the student commits to one interpretation.
Bimodal teacher $P$ (solid) vs. $\arg\min_Q D_{KL}(Q \| P)$ (dotted)
The student locks onto one mode, ignoring the other entirely.
OPD uses reverse KL to distill from the teacher — aggressive but targeted.
Mean-seeking
Covers all modes, broad
Mode-seeking
Locks onto one mode, sharp
OPD = reverse KL (distill from teacher) + forward KL (trust region from PPO)