Privileged Information: Taxonomy

Priors

Available before any $\tau_\pi$

Optimal trajectory $\tau^\ast$

e.g., ground-truth solution trace

Optimal policy $\pi^\ast$

e.g., a stronger teacher model

Posteriors

Available after $\tau_\pi$ is generated

Structured reward $r$

e.g., correctness score, pass/fail

Unstructured reward $\hat{r}$

e.g., natural language feedback

Priors enable direct imitation; posteriors require credit assignment.

Optimization Methods

Policy Gradient

REINFORCE, PPO

Reward-weighted
log-prob updates

Surrogate PG (OPD)

On-Policy Distillation

Reverse KL from teacher
+ forward KL trust region

In-Context Learning

No gradient

Privileged info
directly in prompt

PG is well-understood; ICL is simple but limited. OPD is the interesting middle ground.

The Matrix: Privileged Info $\times$ Optimization

Privileged Info / OptimizationPG (2025–2026)OPD (2026)ICL (2024–2025)
Optimal TrajectoryPOPE, InTOPSD, SDFTNot novel
Optimal Policy(not interesting)Vanilla OPDNot novel
Unstructured RewardGuiding PRMSDPORLEF
Structured RewardPart of loss in all worksNot fine-grained enoughNot fine-grained enough

Key Observations

Off-policyness (roughly):

optimal trajectory $\approx$ optimal policy $>$ unstructured reward $>$ structured reward

Optimization: OPD $>$ PG

When a good teacher is available, reverse KL directly targets the teacher distribution.

Structured reward is the weakest form of privileged information.

Used as part of the loss everywhere, but not fine-grained enough on its own.