How to Use Privileged Information in RL

In reinforcement learning for language models, we often have access to information at training time that is unavailable at test time — an optimal solution, a teacher policy, or structured feedback from a verifier. This privileged information is the secret ingredient behind many recent advances in LLM reasoning and agentic RL. But how exactly should we incorporate it into the training objective?

This post organizes the landscape along two axes: what kind of privileged information you have, and how you optimize with it. The punchline is that the choice of KL divergence direction — forward vs. reverse — has deep consequences, and a family of methods called On-Policy Distillation (OPD) emerges as a principled way to leverage privileged information through reverse KL.

Taxonomy and the Landscape

Privileged information in RL can be divided into two broad categories based on when it becomes available relative to the learner’s trajectory \(\tau_\pi\). Priors are available before the learner generates any trajectory: an optimal trajectory \(\tau^\ast\) (a ground-truth solution trace) or an optimal policy \(\pi^\ast\) (a stronger teacher model). Posteriors are available only after the learner generates \(\tau_\pi\): a structured reward \(r\) (correctness score, code execution result) or unstructured feedback \(\hat{r}\) (natural language critique from a judge). The distinction matters because priors enable direct imitation while posteriors require the learner to do its own credit assignment.

Given privileged information, there are three families of optimization: policy gradient (REINFORCE, PPO) using reward signals, surrogate policy gradient / On-Policy Distillation (OPD) distilling from a teacher via reverse KL, and in-context learning (ICL) providing privileged information directly in the prompt with no gradient update. The following table maps recent methods onto these two axes, and a few patterns emerge: OPD dominates PG in sample efficiency when a good teacher is available; structured reward is the weakest form of privileged information; and richer privileged information introduces more distribution mismatch.

Privileged Info / Optimization PG (2025–2026) OPD (2026) ICL (2024–2025)
Optimal Trajectory POPE, InT OPSD, SDFT (not novel)
Optimal Policy (not interesting) Vanilla OPD (not novel)
Unstructured Reward Guiding PRM SDPO RLEF
Structured Reward (part of loss in all works) (not fine-grained enough) (not fine-grained enough)
Taxonomy of privileged information and the comparison matrix. Priors (optimal trajectory, optimal policy) are available before any learner trajectory; posteriors (structured/unstructured reward) come after. Click through to see how recent methods map onto the two axes.

Forward vs. Reverse KL

The choice of KL direction is central to everything that follows. The forward KL divergence \(D_{KL}(P_{\text{teacher}} \| Q_{\text{student}})\) takes expectations under the teacher: it hates when \(Q(x) = 0\) but \(P(x) > 0\), forcing the student to cover every mode. The result is mean-seeking — a broad distribution that hedges across all peaks. PPO uses forward KL as its trust region constraint:

\[J(\theta) = \mathbb{E}_{s \sim d^{\pi_{\text{old}}}, a \sim \pi_{\text{old}}(\cdot \vert s)}\left[r_\theta(s, a)\,\hat{A}(s, a) - \beta\, D_{KL}(\pi_{\text{old}}(\cdot \vert s) \| \pi_\theta(\cdot \vert s))\right]\]

where the KL penalty expands to \(\beta\, \mathbb{E}_{a \sim \pi_{\text{old}}}[\log \frac{\pi_\theta}{\pi_{\text{old}}}]\) — encouraging the updated policy to stay close to the data-collecting policy. This mean-seeking nature prevents PPO from becoming too aggressive.

The reverse KL divergence \(D_{KL}(Q_{\text{student}} \| P_{\text{teacher}})\) takes expectations under the student: it hates when \(P(x) = 0\) but \(Q(x) > 0\), forcing the student to zero out mass where the teacher is absent. The result is mode-seeking — the student locks onto a single peak, ignoring others. For a multi-modal teacher, \(\arg\min_Q D_{KL}(Q \| P)\) picks one mode and sharpens around it. This is aggressive — the student commits to a single interpretation rather than hedging.

Forward KL is mean-seeking (covers all modes); reverse KL is mode-seeking (locks onto one peak). Click through to see the definitions, zero-avoiding/mode-seeking properties, and distribution visualizations side by side.

On-Policy Distillation (OPD)

OPD combines both KL directions into a single objective:

\[\min_\theta \; \mathbb{E}_{x \sim \pi_{\text{old}}}\left[\sum_{t=0}^{T-1} \frac{\pi_\theta}{\pi_{\text{old}}} \left(\log \pi_\theta - \log \pi_{\text{teach}}\right) + \beta\, D_{KL}(\pi_{\text{old}} \| \pi_\theta)\right]\]

The first term is a reverse KL between \(\pi_\theta\) and \(\pi_{\text{teach}}\) (importance-weighted), and the second is the forward KL trust region from PPO. The key insight: OPD replaces the advantage \(\hat{A}\) in PPO with a reverse KL distillation term toward the teacher.

Why reverse KL? Taking the gradient of \(D_{KL}(Q_\theta \| P)\) and applying the REINFORCE trick yields:

\[\nabla_\theta \mathbb{E}_{x \sim Q_\theta}[\log P(x)] = \mathbb{E}_{x \sim Q_\theta}\left[\log P(x) \cdot \nabla_\theta \log Q_\theta(x)\right]\]

This is REINFORCE with the return \(G\) replaced by \(\log P(x)\) — the teacher’s log-probability serves as the reward. A large \(\log P(x)\) means the teacher agrees with action \(x\), so the gradient increases its probability under the student. Forward KL does not have this interpretation, because in forward KL the expectation is under \(P\) (the teacher), and \(P\) is not the distribution being optimized.

The On-Policy Distillation paper (TML, Oct 2025) motivates OPD from the observation that SFT suffers from exposure bias (off-policyness), RL rewards are too sparse for multi-step tasks, and OPD can be mixed with either — e.g., simply replacing the forward KL in PPO with a reverse KL yields OPD.

From PPO to OPD: the advantage function is replaced by a reverse KL distillation term toward the teacher, while the forward KL trust region is preserved. The reverse KL gradient has a REINFORCE interpretation with G = log P(x).

Self-Distillation: OPSD and SDPO

A natural question: where does the teacher come from? Two recent papers show that the student itself can serve as the teacher, differentiated only by privileged information in the prompt.

OPSD (Self-Distilled Reasoner, UCLA, Jan 2026) uses optimal trajectories. The student \(p_S(\cdot \vert x) := p_\theta(\cdot \vert x)\) sees only the problem, while the teacher \(p_T(\cdot \vert x, y^\ast) := p_\theta(\cdot \vert x, y^\ast)\) sees the ground-truth answer \(y^\ast\) as well. The student generates on-policy samples \(\hat{y} \sim p_S(\cdot \vert x)\), and the learning objective is a per-token divergence \(D(p_T(\cdot \vert x, y^\ast, \hat{y}_{<n}) \| p_S(\cdot \vert x, \hat{y}_{<n}))\), with gradients flowing only through the student’s logits. The teacher knows the answer and can assign per-token credit: “is this token moving toward the correct solution?”

SDPO (Reinforcement Learning via Self-Distillation, ETH Zurich, Jan 2026) replaces the optimal trajectory with unstructured feedback. Given question \(x\), the student generates \(y \sim \pi_\theta(\cdot \vert x)\); a judge provides feedback \(f\) (e.g., “Don’t include n”); and the self-teacher \(\pi_\theta(y \vert x, f)\) re-evaluates the response given the feedback, performing credit assignment to determine which tokens to reinforce and which to suppress. This extends OPD to the setting where no ground-truth solution exists — only post-hoc critique.

Both methods share the same core idea: same model weights, different privileged context. The reverse KL objective drives the student toward the teacher’s corrected distribution, and on-policy sampling avoids SFT’s exposure bias.

Self-distillation: the teacher is the student itself, given privileged information. OPSD uses optimal trajectories; SDPO uses natural language feedback. Both use reverse KL to drive the student toward the teacher's corrected distribution.

Takeaways

The choice of how to use privileged information is not just an engineering detail — it determines the training dynamics:

  • Forward KL (PPO-style) is conservative and mean-seeking. Good for trust regions, bad for sharp imitation.
  • Reverse KL (OPD-style) is aggressive and mode-seeking. Good for distilling from a teacher, but needs the forward KL trust region to stay stable.
  • The type of privileged information matters as much as the optimization method. Having an optimal trajectory is strictly more informative than having a scalar reward, and the optimization should reflect this.
  • Self-distillation — using the same model as both student and teacher, differentiated only by privileged information in the prompt — is an elegant way to construct teachers without a separate model, and both OPSD and SDPO demonstrate its effectiveness.