Available before any $\tau_\pi$
Optimal trajectory $\tau^\ast$
e.g., ground-truth solution trace
Optimal policy $\pi^\ast$
e.g., a stronger teacher model
Available after $\tau_\pi$ is generated
Structured reward $r$
e.g., correctness score, pass/fail
Unstructured reward $\hat{r}$
e.g., natural language feedback
Priors enable direct imitation; posteriors require credit assignment.
REINFORCE, PPO
Reward-weighted
log-prob updates
On-Policy Distillation
Reverse KL from teacher
+ forward KL trust region
No gradient
Privileged info
directly in prompt
PG is well-understood; ICL is simple but limited. OPD is the interesting middle ground.
| Privileged Info / Optimization | PG (2025–2026) | OPD (2026) | ICL (2024–2025) |
|---|---|---|---|
| Optimal Trajectory | POPE, InT | OPSD, SDFT | Not novel |
| Optimal Policy | (not interesting) | Vanilla OPD | Not novel |
| Unstructured Reward | Guiding PRM | SDPO | RLEF |
| Structured Reward | Part of loss in all works | Not fine-grained enough | Not fine-grained enough |
Off-policyness (roughly):
optimal trajectory $\approx$ optimal policy $>$ unstructured reward $>$ structured reward
Optimization: OPD $>$ PG
When a good teacher is available, reverse KL directly targets the teacher distribution.
Structured reward is the weakest form of privileged information.
Used as part of the loss everywhere, but not fine-grained enough on its own.