Privileged Information: Taxonomy

Available before any $\tau_\pi$

Optimal trajectory $\tau^\ast$

e.g., ground-truth solution trace

Optimal policy $\pi^\ast$

e.g., a stronger teacher model

Available after $\tau_\pi$ is generated

Structured reward $r$

e.g., correctness score, pass/fail

Unstructured reward $\hat{r}$

e.g., natural language feedback

Priors enable direct imitation; posteriors require credit assignment.

Optimization Methods

REINFORCE, PPO

Reward-weighted
log-prob updates

On-Policy Distillation

Reverse KL from teacher
+ forward KL trust region

No gradient

Privileged info
directly in prompt

PG is well-understood; ICL is simple but limited. OPD is the interesting middle ground.

Privileged Info / Optimization	PG (2025–2026)	OPD (2026)	ICL (2024–2025)
Optimal Trajectory	POPE, InT	OPSD, SDFT	Not novel
Optimal Policy	(not interesting)	Vanilla OPD	Not novel
Unstructured Reward	Guiding PRM	SDPO	RLEF
Structured Reward	Part of loss in all works	Not fine-grained enough	Not fine-grained enough

Off-policyness (roughly):

optimal trajectory $\approx$ optimal policy $>$ unstructured reward $>$ structured reward

Optimization: OPD $>$ PG

When a good teacher is available, reverse KL directly targets the teacher distribution.

Structured reward is the weakest form of privileged information.

Used as part of the loss everywhere, but not fine-grained enough on its own.