How to Use Privileged Information in RL: On-policy Distillation
In reinforcement learning for language models, we often have access to information at training time that is unavailable at test time — an optimal solution, a teacher policy, or structured feedback from a verifier. This privileged information is the secret ingredient behind many recent advances in LLM reasoning and agentic RL. But how exactly should we incorporate it into the training objective?
This post organizes the landscape along two axes: what kind of privileged information you have, and how you optimize with it. The punchline is that the choice of KL divergence direction — forward vs. reverse — has deep consequences, and a family of methods called On-Policy Distillation (OPD) emerges as a principled way to leverage privileged information through reverse KL.
Taxonomy and the Landscape
分类体系与全景
Privileged information in RL can be divided into two broad categories based on when it becomes available relative to the learner’s trajectory \(\tau_\pi\).
Priors are available before the learner generates any trajectory:
- Optimal trajectory \(\tau^\ast\): a ground-truth solution trace (e.g., a correct chain-of-thought for a math problem). This is the richest form of prior — it tells the learner not just what the answer is, but how to get there.
- Optimal policy \(\pi^\ast\): a stronger teacher model that can be queried for next-token probabilities. Slightly weaker than a full trajectory because it provides guidance only at the distribution level, not a concrete solution path.
Posteriors are available only after the learner generates \(\tau_\pi\):
- Structured reward \(r\): a fixed-format, machine-readable signal — typically a scalar score or binary pass/fail (e.g., code execution result, math answer match). “Structured” means the format is predetermined and parseable, not that the content is rich. In fact this is the weakest form of privileged information — it tells you whether the trajectory was good, but not where it went wrong or how to fix it.
- Unstructured feedback \(\hat{r}\): free-form, variable-length natural language critique from a judge model (e.g., “the error is on step 3, you forgot to carry the sign”). “Unstructured” means the format is open-ended text, not a fixed schema — but paradoxically the content is richer than a scalar reward because it can localize the error and describe what went wrong. The trade-off: it requires the learner to interpret the feedback, which introduces noise.
The distinction matters because priors enable direct imitation while posteriors require the learner to do its own credit assignment. There is a rough hierarchy of informativeness:
Ways of Optimization
优化方法的分类
Given privileged information, there are three families of optimization:
-
Policy gradient (PG): REINFORCE, PPO = REINFORCE + trust region. The classic approach — sample trajectories, compute rewards, update via the policy gradient theorem. Works with any reward signal but can be sample-inefficient when the reward is sparse.
-
Surrogate policy gradient / On-Policy Distillation (OPD): Instead of optimizing a reward, OPD distills from a teacher via reverse KL divergence. The key difference from standard distillation (SFT) is that OPD samples on-policy from the student, avoiding exposure bias. This is the main focus of this post.
-
In-context learning (ICL): Provide privileged information directly in the prompt with no gradient update. For example, RLEF puts natural language feedback in the context window and lets the model self-correct. No training required, but the model’s ability to use the information is limited by its in-context learning capacity.
The Matrix
对比矩阵
The following table maps recent methods onto these two axes:
| Privileged Info / Optimization | PG (2025–2026) | OPD (2026) | ICL (2024–2025) |
|---|---|---|---|
| Optimal Trajectory | POPE, InT | OPSD, SDFT | (not novel) |
| Optimal Policy | (not interesting) | Vanilla OPD | (not novel) |
| Unstructured Reward | Guiding PRM | SDPO | RLEF |
| Structured Reward | (always used; not standalone) | (not fine-grained enough) | (not fine-grained enough) |
A few patterns emerge. First, there is a rough ranking of off-policyness — the gap between the privileged information available at training time and what the model sees at test time:
\[\text{Off-policyness:} \quad \text{optimal trajectory} \approx \text{optimal policy} > \text{unstructured reward} > \text{structured reward}\]Second, there is a ranking of optimization methods when a good teacher is available:
\[\text{Optimization:} \quad \text{OPD} > \text{PG}\]Concretely:
- Optimization: OPD > PG in sample efficiency because OPD gets per-token credit assignment from the teacher rather than relying on sparse trajectory-level rewards.
- Privileged info: Richer information strictly helps, but also introduces more distribution mismatch (the gap between what the teacher knows and what the student sees at test time).
- Structured reward is the weakest form and is usually folded into the loss function rather than being the sole training signal. Both OPD and ICL struggle to make use of it because a scalar reward is not fine-grained enough for per-token distillation or in-context correction.
Forward vs. Reverse KL
前向 KL 与反向 KL
The choice of KL direction is central to everything that follows. Let us start with the definitions.
Forward KL
前向 KL
The forward KL divergence places the teacher \(P\) in the first argument:
For continuous distributions, replace the sum with an integral. The crucial point: the expectation is under \(P\) (the teacher). This has an important consequence called the zero-avoiding property:
- When \(Q(x) = 0\) but \(P(x) > 0\), the log ratio \(\log \frac{P(x)}{Q(x)} \to +\infty\). The loss blows up.
- So forward KL forces \(Q(x) > 0\) wherever \(P(x) > 0\) — the student must cover every mode of the teacher.
- It also tries to maximize the \(Q\) logits overall, pushing the \(Q\) curve higher wherever the teacher has support.
The result is mean-seeking behavior: when the teacher is multi-modal (e.g., two peaks), the student spreads its mass to cover both peaks rather than committing to one. For a bimodal teacher, \(\arg\min_Q D_{KL}(P \| Q)\) places mass on both modes, resulting in a broad, hedging distribution.
Example: PPO Uses Forward KL
例子:PPO 使用前向 KL
PPO’s objective uses forward KL as its trust region constraint, a design inherited from TRPO:
where \(r = \pi_\theta / \pi_{\text{old}}\) is the importance sampling ratio. The KL penalty expands as:
Why forward KL here? Because \(\pi_{\text{old}}\) collected the data — the expectation must be under \(\pi_{\text{old}}\) since those are the samples we have. The negative of this forward KL penalty is \(\beta\, \mathbb{E}_{a \sim \pi_{\text{old}}}[\log \frac{\pi_\theta}{\pi_{\text{old}}}]\), which is similar to learning from old samples. PPO is thus importance sampling ratio + forward KL trust region. The mean-seeking nature prevents the updated policy from becoming too aggressive — it stays close to the data-collecting policy.
SFT Is Also Forward KL
SFT 本质上也是前向 KL
Note that supervised fine-tuning (SFT) is also forward KL. The SFT loss:
is equivalent to minimizing \(D_{KL}(p_{\text{data}} \| p_\theta)\) up to a constant \(H(p_{\text{data}})\). The expectation is taken under the data distribution, not the model — so SFT inherits the same mode-covering, mean-seeking behavior: the model is penalized heavily wherever the data has support but the model assigns near-zero probability, leading it to spread mass across all modes rather than committing to one. This is precisely why SFT tends to produce “hedging” behavior (e.g., generating safe but generic responses), and why reverse-KL methods like OPD can produce sharper, more committed outputs.
Reverse KL
反向 KL
The reverse KL divergence swaps \(P\) and \(Q\):
Now the expectation is under \(Q\) (the student). This has the opposite property — mode-seeking:
- When \(P(x) = 0\) but \(Q(x) > 0\), the log ratio \(\log \frac{Q(x)}{P(x)} \to +\infty\). The loss blows up.
- So reverse KL forces \(Q(x) \to 0\) wherever \(P(x) = 0\) — the student must avoid placing mass where the teacher is absent.
- It also tries to minimize the \(Q\) logits overall, concentrating the probability mass.
The intuitive question is: what happens when the student is confident but the teacher is not? Reverse KL penalizes this — the student must zero out its mass in those regions. For a bimodal teacher, \(\arg\min_Q D_{KL}(Q \| P)\) locks onto a single mode and sharpens around it, ignoring the other. This is aggressive in nature — the student commits fully to one interpretation rather than hedging across possibilities.
Why the Direction Matters
方向为什么重要
To summarize the contrast with a concrete picture: imagine a teacher with two well-separated peaks.
- Forward KL (\(\min_Q D_{KL}(P \| Q)\)): the student covers both peaks with a broad distribution — mean-seeking. This avoids PPO from being too aggressive.
- Reverse KL (\(\min_Q D_{KL}(Q \| P)\)): the student picks one peak and concentrates there — mode-seeking. This is aggressive but sharp.
In policy optimization, forward KL is the safe conservative choice (PPO, SFT). Reverse KL is the aggressive distillation choice (OPD). As we’ll see, OPD uses both — reverse KL for distillation and forward KL for trust region stability. (For more on the subtleties of using KL as an estimator vs. as an optimization loss, see KL Estimation vs. Optimization.)
On-Policy Distillation (OPD)
在线策略蒸馏(OPD)
The OPD Loss
OPD 损失函数
OPD combines both KL directions into a single objective:
The first term is a reverse KL between \(\pi_\theta\) and \(\pi_{\text{teach}}\) (importance-weighted by \(\pi_\theta / \pi_{\text{old}}\)), and the second is the forward KL trust region inherited from PPO. The key structural insight: compare this to the PPO objective:
OPD is just PPO with the advantage \(\hat{A}\) replaced by a reverse KL distillation term toward the teacher. The forward KL trust region stays — it prevents the update from deviating too far from the data-collecting policy.
Why is the first term reverse KL? Recall:
This is student \(\|\) teacher — the student’s perspective on how far it is from the teacher. The importance weight \(\pi_\theta / \pi_{\text{old}}\) corrects for the fact that we sample from \(\pi_{\text{old}}\) but want expectations under \(\pi_\theta\).
Another Perspective: Reverse KL as REINFORCE
另一个视角:反向 KL 即 REINFORCE
Let’s extract the reverse KL term and take its gradient. We can decompose:
The first term \(-\nabla_\theta H(Q_\theta)\) is an entropy bonus that encourages exploration. For the second term, we apply the REINFORCE trick (the log-derivative trick):
This is exactly the REINFORCE gradient! Compare it to the standard step-wise REINFORCE:
The reverse KL gradient is REINFORCE with the return \(G\) replaced by \(\log P(x)\). The teacher’s log-probability serves as the reward: a large \(\log P(x)\) means the teacher agrees with the student’s action \(x\), so the gradient increases its probability.
This is a deep connection. The reverse KL is not just a distillation loss — it’s a policy gradient where the teacher’s agreement is the reward signal. The entropy term \(-\nabla_\theta H(Q_\theta)\) provides exploration, just as entropy bonuses do in standard RL.
Important: forward KL does not have this REINFORCE interpretation. In forward KL, the expectation is under \(P\) (the teacher), and \(P\) is not the distribution being optimized — so the log-derivative trick does not apply in the same way.
Motivation: Why OPD?
动机:为什么要 OPD?
The On-Policy Distillation paper (TML, Oct 2025) motivates OPD from three observations:
-
SFT has exposure bias (off-policyness): SFT trains on teacher-generated tokens but the student generates its own tokens at test time. Errors compound because the student never sees its own mistakes during training.
-
RL reward is too sparse: For multi-step agentic tasks, a binary success/fail reward at the end of a trajectory provides almost no learning signal per step. The credit assignment problem is severe.
-
OPD can be mixed with both: OPD bridges the gap — it can be combined with SFT (adding an on-policy distillation term to SFT loss) or with RL (replacing the advantage in PPO with a reverse KL distillation term). The simplest recipe: just replace the forward KL in PPO with a reverse KL, and you get OPD.
Self-Distillation: OPSD, SDFT, and SDPO
自蒸馏:OPSD、SDFT 与 SDPO
A natural question: where does the teacher come from? Vanilla OPD assumes access to a separate, stronger teacher model. But several recent papers show an elegant alternative: the student itself can serve as the teacher, differentiated only by what privileged information appears in the prompt. This is the “OPD + teacher replacement” paradigm. Three key methods instantiate this idea with different types of privileged information.
OPSD: Self-Distillation with Optimal Trajectories
OPSD:基于最优轨迹的自蒸馏
OPSD (Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models, UCLA, Jan 2026) uses optimal trajectories as the privileged information. The key idea: construct two prompts for the same LLM \(p_\theta\).
- Student policy: \(p_S(\cdot \vert x) := p_\theta(\cdot \vert x)\) — sees only the problem \(x\).
- Teacher policy: \(p_T(\cdot \vert x, y^\ast) := p_\theta(\cdot \vert x, y^\ast)\) — sees both the problem \(x\) and the ground-truth chain-of-thought answer \(y^\ast\).
The training loop works as follows:
- Sample a problem \((x, y^\ast) \sim \mathcal{S}\) from the dataset.
- The student generates an on-policy sample \(\hat{y} \sim p_S(\cdot \vert x)\).
- The teacher evaluates this sample with privileged information: \(p_T(\cdot \vert x, y^\ast, \hat{y}_{<n})\) — it reads the student’s partial response and, knowing the correct answer, assigns per-token credit.
- The learning objective is a per-token divergence:
Crucially, gradients flow only through the student’s logits. The teacher’s output is treated as a fixed target — even though it shares the same weights, it is not being optimized directly.
Why does this work? The teacher knows the answer \(y^\ast\) and can perform fine-grained credit assignment: “given what the student has written so far, is the next token moving toward or away from the correct solution?” This is much richer than a binary reward at the end of the trajectory.
SDFT: Self-Distillation Fine-Tuning
SDFT:自蒸馏微调
SDFT (Self-Distillation Enables Continual Learning, Shenfeld et al., Jan 2026) takes a similar approach to OPSD but with a different motivation: continual learning. The teacher is again the student model itself conditioned on the optimal trajectory, but SDFT frames the objective as preventing catastrophic forgetting while learning new tasks.
Like OPSD, SDFT uses OPD with optimal trajectories as the privileged information: the teacher sees the correct solution and the student does not. The per-token reverse KL objective ensures the student stays aligned with the teacher’s corrected distribution. The key contribution of SDFT is showing that this self-distillation framework is not just useful for one-shot training — it enables the model to continually incorporate new knowledge without degrading performance on previously learned tasks.
SDPO: Self-Distillation with Feedback
SDPO:基于反馈的自蒸馏
SDPO (Reinforcement Learning via Self-Distillation, ETH Zurich, Jan 2026) extends the same paradigm to settings where no ground-truth solution exists — only post-hoc critique.
The pipeline has four steps:
- Question \(x\): the problem to solve (e.g., “Write a Python function that returns all numbers from 1 to n”).
- Answer \(y \sim \pi_\theta(\cdot \vert x)\): the student generates a response on-policy.
- Feedback \(f\): a judge (possibly an external model or verifier) provides natural language feedback on the response (e.g., “Don’t include n” — pointing out an off-by-one error).
- Credit assignment by self-teacher \(\pi_\theta(y \vert x, f)\): the same model, now conditioned on the feedback, re-evaluates the student’s response. It assigns per-token credit — which tokens were correct and which led to the error the feedback identified.
The difference from OPSD: instead of Teacher = student + optimal solution, we have Teacher = student + feedback (step-wise reward). This is strictly less informative (the feedback may be vague or incomplete), but it works in domains where ground-truth solutions don’t exist — e.g., open-ended code generation, creative writing, or agentic tasks.
The Common Pattern
共同模式
All three methods — OPSD, SDFT, and SDPO — share the same core idea:
- Same model weights for student and teacher — no need for a separate, larger teacher model.
- Different privileged context in the prompt — the teacher sees extra information that is unavailable at test time.
- Reverse KL drives the student toward the teacher’s corrected distribution.
- On-policy sampling avoids SFT’s exposure bias — the student learns from its own mistakes, not from pre-generated teacher trajectories.
The elegance is that the “teacher” is free — it’s just the same model, prompted differently. The privileged information (optimal solution or feedback) is the only thing that separates teacher from student.
Method Overview
方法总览
The interactive matrix below maps all methods discussed in this post (and a few related ones) onto the two axes: privileged information type and optimization family. Click any method name to see its pipeline and key idea.
Takeaways
要点总结
The choice of how to use privileged information is not just an engineering detail — it determines the training dynamics. Here are the key principles:
KL Direction
KL 方向
- Forward KL (PPO-style, SFT) is conservative and mean-seeking. It covers all modes, prevents aggressive updates, and is the natural choice for trust regions. But it produces hedging, generic outputs.
- Reverse KL (OPD-style) is aggressive and mode-seeking. It locks onto the teacher’s best mode, producing sharp, committed outputs. But it needs the forward KL trust region to stay stable — pure reverse KL can collapse.
- OPD uses both: reverse KL for distillation (replacing the advantage function) and forward KL for trust region stability. This is the key architectural insight.
Privileged Information Hierarchy
特权信息层次
Not all privileged information is created equal. Roughly ordered by decreasing off-policyness (and increasing informativeness):
- Optimal trajectory provides the richest signal but also the largest distribution mismatch — the teacher’s trajectory may follow a reasoning path the student would never take.
- Optimal policy is similarly informative (you can always sample trajectories from a policy) but avoids committing to a specific trace.
- Unstructured feedback is weaker — it tells you what went wrong but not the right answer, so the student must do more work to convert critique into improvement.
- Structured reward is the weakest — a scalar signal at the end of a long trajectory. This is why pure RL (with only pass/fail reward) struggles on complex agentic tasks.
Optimization Method Ranking
优化方法排序
When a good teacher is available:
OPD provides per-token credit assignment from the teacher, while PG relies on trajectory-level reward signals. The gap is especially large for multi-step tasks where reward is sparse.
Self-Distillation
自蒸馏
Self-distillation — using the same model as both student and teacher, differentiated only by privileged information in the prompt — is an elegant way to construct teachers without a separate model. OPSD and SDPO show that this works surprisingly well: the “privileged” version of the same model can assign meaningful per-token credit, even though it has the same underlying capabilities as the student. The gap comes purely from having access to the answer (or feedback) in context.
This also suggests a broader principle: the value of privileged information lies not in the model’s capacity, but in the information asymmetry. A model that knows the answer can guide a model that doesn’t, even if they share the same weights.