How to Use Privileged Information in RL: On-policy Distillation

In reinforcement learning for language models, we often have access to information at training time that is unavailable at test time — an optimal solution, a teacher policy, or structured feedback from a verifier. This privileged information is the secret ingredient behind many recent advances in LLM reasoning and agentic RL. But how exactly should we incorporate it into the training objective?

在语言模型的强化学习中,我们常常可以在训练时获取一些测试时不可用的信息——最优解、教师策略,或来自验证器的结构化反馈。这种特权信息(privileged information)是近年来 LLM 推理和智能体 RL 诸多进展背后的关键要素。但我们究竟应该如何将其融入训练目标?

This post organizes the landscape along two axes: what kind of privileged information you have, and how you optimize with it. The punchline is that the choice of KL divergence direction — forward vs. reverse — has deep consequences, and a family of methods called On-Policy Distillation (OPD) emerges as a principled way to leverage privileged information through reverse KL.

本文沿两个维度梳理这一领域:你拥有何种特权信息,以及如何利用它进行优化。核心结论是:KL 散度方向的选择——前向与反向——具有深远影响,而一类名为 On-Policy Distillation (OPD) 的方法通过反向 KL 提供了利用特权信息的原则性方案。

Taxonomy and the Landscape

Privileged information in RL can be divided into two broad categories based on when it becomes available relative to the learner’s trajectory \(\tau_\pi\).

RL 中的特权信息可以根据其相对于学习者轨迹 \(\tau_\pi\) 的可用时机分为两大类。

Priors are available before the learner generates any trajectory:

  • Optimal trajectory \(\tau^\ast\): a ground-truth solution trace (e.g., a correct chain-of-thought for a math problem). This is the richest form of prior — it tells the learner not just what the answer is, but how to get there.
  • Optimal policy \(\pi^\ast\): a stronger teacher model that can be queried for next-token probabilities. Slightly weaker than a full trajectory because it provides guidance only at the distribution level, not a concrete solution path.

先验信息(Priors)在学习者生成任何轨迹之前就已可用:

  • 最优轨迹 \(\tau^\ast\):标准答案的求解过程(例如,数学题的正确推理链)。这是最丰富的先验形式——它不仅告诉学习者答案是什么,还告诉学习者如何得到答案。
  • 最优策略 \(\pi^\ast\):一个更强的教师模型,可以查询其下一个 token 的概率分布。比完整轨迹稍弱,因为它仅在分布层面提供指导,而非具体的解题路径。

Posteriors are available only after the learner generates \(\tau_\pi\):

  • Structured reward \(r\): a fixed-format, machine-readable signal — typically a scalar score or binary pass/fail (e.g., code execution result, math answer match). “Structured” means the format is predetermined and parseable, not that the content is rich. In fact this is the weakest form of privileged information — it tells you whether the trajectory was good, but not where it went wrong or how to fix it.
  • Unstructured feedback \(\hat{r}\): free-form, variable-length natural language critique from a judge model (e.g., “the error is on step 3, you forgot to carry the sign”). “Unstructured” means the format is open-ended text, not a fixed schema — but paradoxically the content is richer than a scalar reward because it can localize the error and describe what went wrong. The trade-off: it requires the learner to interpret the feedback, which introduces noise.

后验信息(Posteriors)只在学习者生成 \(\tau_\pi\) 之后才可用:

  • 结构化奖励 \(r\):格式固定、机器可读的信号——通常是标量分数或二值通过/失败(如代码执行结果、数学答案匹配)。”结构化”指的是格式是预定义的、可解析的,而非内容丰富。实际上这是最弱的特权信息形式——它告诉你轨迹是否好,但不告诉哪里出了问题或如何修复。
  • 非结构化反馈 \(\hat{r}\):自由格式、变长的自然语言评语,来自评判模型(如”第 3 步有误,你忘记了进位”)。”非结构化”指的是格式是开放式文本而非固定模式——但矛盾的是,其内容比标量奖励更丰富,因为它能定位错误并描述问题所在。代价是:学习者需要自行解读反馈,这会引入噪声。

The distinction matters because priors enable direct imitation while posteriors require the learner to do its own credit assignment. There is a rough hierarchy of informativeness:

这一区别至关重要——先验信息允许直接模仿,而后验信息要求学习者自行完成信用分配。特权信息的信息量有一个大致的层次关系:

\[\text{optimal trajectory} \approx \text{optimal policy} > \text{unstructured feedback} > \text{structured reward}\]

Ways of Optimization

Given privileged information, there are three families of optimization:

  1. Policy gradient (PG): REINFORCE, PPO = REINFORCE + trust region. The classic approach — sample trajectories, compute rewards, update via the policy gradient theorem. Works with any reward signal but can be sample-inefficient when the reward is sparse.

  2. Surrogate policy gradient / On-Policy Distillation (OPD): Instead of optimizing a reward, OPD distills from a teacher via reverse KL divergence. The key difference from standard distillation (SFT) is that OPD samples on-policy from the student, avoiding exposure bias. This is the main focus of this post.

  3. In-context learning (ICL): Provide privileged information directly in the prompt with no gradient update. For example, RLEF puts natural language feedback in the context window and lets the model self-correct. No training required, but the model’s ability to use the information is limited by its in-context learning capacity.

给定特权信息,有三类优化方法:

  1. 策略梯度(PG)REINFORCE, PPO = REINFORCE + 信赖域。经典方法——采样轨迹、计算奖励、通过策略梯度定理更新。可以使用任何奖励信号,但当奖励稀疏时样本效率较低。

  2. 代理策略梯度 / 在线策略蒸馏(OPD):OPD 不优化奖励,而是通过反向 KL 散度从教师蒸馏。与标准蒸馏(SFT)的关键区别在于 OPD 从学生在策略采样,避免了曝光偏差。这是本文的重点。

  3. 上下文学习(ICL):将特权信息直接放入提示词中,无需梯度更新。例如,RLEF 将自然语言反馈放入上下文窗口,让模型自我修正。无需训练,但模型利用信息的能力受限于其上下文学习能力。

The Matrix

The following table maps recent methods onto these two axes:

Privileged Info / Optimization PG (2025–2026) OPD (2026) ICL (2024–2025)
Optimal Trajectory POPE, InT OPSD, SDFT (not novel)
Optimal Policy (not interesting) Vanilla OPD (not novel)
Unstructured Reward Guiding PRM SDPO RLEF
Structured Reward (always used; not standalone) (not fine-grained enough) (not fine-grained enough)

A few patterns emerge. First, there is a rough ranking of off-policyness — the gap between the privileged information available at training time and what the model sees at test time:

\[\text{Off-policyness:} \quad \text{optimal trajectory} \approx \text{optimal policy} > \text{unstructured reward} > \text{structured reward}\]

Second, there is a ranking of optimization methods when a good teacher is available:

\[\text{Optimization:} \quad \text{OPD} > \text{PG}\]

Concretely:

  • Optimization: OPD > PG in sample efficiency because OPD gets per-token credit assignment from the teacher rather than relying on sparse trajectory-level rewards.
  • Privileged info: Richer information strictly helps, but also introduces more distribution mismatch (the gap between what the teacher knows and what the student sees at test time).
  • Structured reward is the weakest form and is usually folded into the loss function rather than being the sole training signal. Both OPD and ICL struggle to make use of it because a scalar reward is not fine-grained enough for per-token distillation or in-context correction.

下表将近期方法映射到这两个维度上:

特权信息 / 优化方法 PG (2025–2026) OPD (2026) ICL (2024–2025)
最优轨迹 POPE, InT OPSD, SDFT (非新颖)
最优策略 (意义不大) Vanilla OPD (非新颖)
非结构化奖励 Guiding PRM SDPO RLEF
结构化奖励 (总是使用;非独立方法) (粒度不够细) (粒度不够细)

几个规律浮现。首先,离策略程度(off-policyness)有一个大致排序——训练时可用的特权信息与测试时模型所见之间的差距:

\[\text{离策略程度:} \quad \text{最优轨迹} \approx \text{最优策略} > \text{非结构化奖励} > \text{结构化奖励}\]

其次,当有好的教师可用时,优化方法有一个排序:

\[\text{优化方法:} \quad \text{OPD} > \text{PG}\]

具体而言:

  • 优化方法:OPD 在样本效率上优于 PG,因为 OPD 从教师获得逐 token 的信用分配,而非依赖稀疏的轨迹级奖励。
  • 特权信息:更丰富的信息严格有益,但也引入更大的分布偏移(教师所知与学生在测试时所见之间的差距)。
  • 结构化奖励是最弱的形式,通常被折叠进损失函数中,而非作为唯一的训练信号。OPD 和 ICL 都难以利用它,因为标量奖励对于逐 token 蒸馏或上下文修正来说粒度不够细。

Forward vs. Reverse KL

The choice of KL direction is central to everything that follows. Let us start with the definitions.

KL 散度方向的选择是后续一切讨论的核心。让我们从定义开始。

Forward KL

The forward KL divergence places the teacher \(P\) in the first argument:

前向 KL 散度将教师 \(P\) 放在第一个参数位置:

\[D_{KL}(P_{\text{teacher}} \| Q_{\text{student}}) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_{x \sim P}\left[\log \frac{P(x)}{Q(x)}\right]\]

For continuous distributions, replace the sum with an integral. The crucial point: the expectation is under \(P\) (the teacher). This has an important consequence called the zero-avoiding property:

  • When \(Q(x) = 0\) but \(P(x) > 0\), the log ratio \(\log \frac{P(x)}{Q(x)} \to +\infty\). The loss blows up.
  • So forward KL forces \(Q(x) > 0\) wherever \(P(x) > 0\) — the student must cover every mode of the teacher.
  • It also tries to maximize the \(Q\) logits overall, pushing the \(Q\) curve higher wherever the teacher has support.

The result is mean-seeking behavior: when the teacher is multi-modal (e.g., two peaks), the student spreads its mass to cover both peaks rather than committing to one. For a bimodal teacher, \(\arg\min_Q D_{KL}(P \| Q)\) places mass on both modes, resulting in a broad, hedging distribution.

对于连续分布,将求和替换为积分。关键在于:期望在 \(P\)(教师)下取。这带来了一个重要性质,称为零回避(zero-avoiding):

  • 当 \(Q(x) = 0\) 而 \(P(x) > 0\) 时,对数比 \(\log \frac{P(x)}{Q(x)} \to +\infty\),损失爆炸。
  • 因此前向 KL 迫使在 \(P(x) > 0\) 的所有地方都有 \(Q(x) > 0\)——学生必须覆盖教师的每一个峰值。
  • 它还试图整体提高 \(Q\) 的 logits,将 \(Q\) 曲线在教师有支撑的地方整体推高。

结果是均值趋向(mean-seeking)行为:当教师是多峰值的(如两个峰值)时,学生会分散其概率质量以覆盖两个峰值,而非选择其中一个。对于双峰教师,\(\arg\min_Q D_{KL}(P \| Q)\) 会在两个峰值上都放置概率质量,产生宽泛的、保守的分布。

Example: PPO Uses Forward KL

PPO’s objective uses forward KL as its trust region constraint, a design inherited from TRPO:

PPO 的目标函数使用前向 KL 作为信赖域约束,这一设计继承自 TRPO:

\[J(\theta) = \mathbb{E}_{s \sim d^{\pi_{\text{old}}}, a \sim \pi_{\text{old}}(\cdot \vert s)}\left[r_\theta(s, a)\,\hat{A}(s, a) - \beta\, D_{KL}(\pi_{\text{old}}(\cdot \vert s) \| \pi_\theta(\cdot \vert s))\right]\]

where \(r = \pi_\theta / \pi_{\text{old}}\) is the importance sampling ratio. The KL penalty expands as:

其中 \(r = \pi_\theta / \pi_{\text{old}}\) 是重要性采样比。KL 惩罚项展开为:

\[D_{KL}(\pi_{\text{old}} \| \pi_\theta) = \mathbb{E}_{a \sim \pi_{\text{old}}}\left[\log \frac{\pi_{\text{old}}(a \vert s)}{\pi_\theta(a \vert s)}\right]\]

Why forward KL here? Because \(\pi_{\text{old}}\) collected the data — the expectation must be under \(\pi_{\text{old}}\) since those are the samples we have. The negative of this forward KL penalty is \(\beta\, \mathbb{E}_{a \sim \pi_{\text{old}}}[\log \frac{\pi_\theta}{\pi_{\text{old}}}]\), which is similar to learning from old samples. PPO is thus importance sampling ratio + forward KL trust region. The mean-seeking nature prevents the updated policy from becoming too aggressive — it stays close to the data-collecting policy.

为什么用前向 KL?因为 \(\pi_{\text{old}}\) 收集了数据——期望必须在 \(\pi_{\text{old}}\) 下取,因为我们只有这些样本。这个前向 KL 惩罚项取负后变成 \(\beta\, \mathbb{E}_{a \sim \pi_{\text{old}}}[\log \frac{\pi_\theta}{\pi_{\text{old}}}]\),类似于从旧样本学习。PPO 本质上就是重要性采样比 + 前向 KL 信赖域。均值趋向特性防止更新后的策略过于激进——保持在数据收集策略附近。

SFT Is Also Forward KL

Note that supervised fine-tuning (SFT) is also forward KL. The SFT loss:

注意监督微调(SFT)本质上也是前向 KL。SFT 的损失:

\[\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]\]

is equivalent to minimizing \(D_{KL}(p_{\text{data}} \| p_\theta)\) up to a constant \(H(p_{\text{data}})\). The expectation is taken under the data distribution, not the model — so SFT inherits the same mode-covering, mean-seeking behavior: the model is penalized heavily wherever the data has support but the model assigns near-zero probability, leading it to spread mass across all modes rather than committing to one. This is precisely why SFT tends to produce “hedging” behavior (e.g., generating safe but generic responses), and why reverse-KL methods like OPD can produce sharper, more committed outputs.

等价于最小化 \(D_{KL}(p_{\text{data}} \| p_\theta)\)(差一个常数 \(H(p_{\text{data}})\))。期望在数据分布而非模型分布下取——因此 SFT 继承了同样的峰值覆盖、均值趋向行为:在数据有支撑但模型赋予接近零概率的地方,模型会受到严重惩罚,导致它将概率质量分散到所有峰值上而非专注于一个。这正是 SFT 倾向于产生”打安全牌”行为(如生成安全但泛化的回答)的原因,也是为什么反向 KL 方法(如 OPD)能产生更锐利、更果断的输出。

Reverse KL

The reverse KL divergence swaps \(P\) and \(Q\):

反向 KL 散度交换了 \(P\) 和 \(Q\) 的位置:

\[D_{KL}(Q_{\text{student}} \| P_{\text{teacher}}) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} = \mathbb{E}_{x \sim Q}\left[\log \frac{Q(x)}{P(x)}\right]\]

Now the expectation is under \(Q\) (the student). This has the opposite property — mode-seeking:

  • When \(P(x) = 0\) but \(Q(x) > 0\), the log ratio \(\log \frac{Q(x)}{P(x)} \to +\infty\). The loss blows up.
  • So reverse KL forces \(Q(x) \to 0\) wherever \(P(x) = 0\) — the student must avoid placing mass where the teacher is absent.
  • It also tries to minimize the \(Q\) logits overall, concentrating the probability mass.

The intuitive question is: what happens when the student is confident but the teacher is not? Reverse KL penalizes this — the student must zero out its mass in those regions. For a bimodal teacher, \(\arg\min_Q D_{KL}(Q \| P)\) locks onto a single mode and sharpens around it, ignoring the other. This is aggressive in nature — the student commits fully to one interpretation rather than hedging across possibilities.

现在期望在 \(Q\)(学生)下取。这带来了相反的性质——峰值趋向(mode-seeking):

  • 当 \(P(x) = 0\) 而 \(Q(x) > 0\) 时,对数比 \(\log \frac{Q(x)}{P(x)} \to +\infty\),损失爆炸。
  • 因此反向 KL 迫使在 \(P(x) = 0\) 的地方 \(Q(x) \to 0\)——学生必须避免在教师缺席的地方放置概率质量。
  • 它还试图整体最小化 \(Q\) 的 logits,集中概率质量。

一个直觉性的问题:当学生确信但教师不确信时会发生什么?反向 KL 会惩罚这种情况——学生必须在那些区域将概率质量清零。对于双峰教师,\(\arg\min_Q D_{KL}(Q \| P)\) 会锁定单个峰值并在其周围锐化,忽略另一个。这本质上是激进的——学生完全投入到一种解读,而非在各种可能性之间左右逢源。

Why the Direction Matters

To summarize the contrast with a concrete picture: imagine a teacher with two well-separated peaks.

  • Forward KL (\(\min_Q D_{KL}(P \| Q)\)): the student covers both peaks with a broad distribution — mean-seeking. This avoids PPO from being too aggressive.
  • Reverse KL (\(\min_Q D_{KL}(Q \| P)\)): the student picks one peak and concentrates there — mode-seeking. This is aggressive but sharp.

In policy optimization, forward KL is the safe conservative choice (PPO, SFT). Reverse KL is the aggressive distillation choice (OPD). As we’ll see, OPD uses both — reverse KL for distillation and forward KL for trust region stability. (For more on the subtleties of using KL as an estimator vs. as an optimization loss, see KL Estimation vs. Optimization.)

用一个具体画面来总结这一对比:想象一个有两个分离峰值的教师分布。

  • 前向 KL(\(\min_Q D_{KL}(P \| Q)\)):学生用宽泛的分布覆盖两个峰值——均值趋向。这防止 PPO 过于激进。
  • 反向 KL(\(\min_Q D_{KL}(Q \| P)\)):学生选择一个峰值并集中在那里——峰值趋向。激进但锐利。

在策略优化中,前向 KL 是安全的保守选择(PPO、SFT)。反向 KL 是激进的蒸馏选择(OPD)。我们将看到,OPD 同时使用两者——反向 KL 用于蒸馏,前向 KL 用于信赖域稳定性。(关于 KL 作为估计量与优化目标的微妙区别,参见 KL 估计 vs. 优化。)

On-Policy Distillation (OPD)

The OPD Loss

OPD combines both KL directions into a single objective:

OPD 将两个方向的 KL 散度整合到一个目标函数中:

\[\min_\theta \; \mathbb{E}_{x \sim \pi_{\text{old}}}\left[\sum_{t=0}^{T-1} \frac{\pi_\theta}{\pi_{\text{old}}} \underbrace{\left(\log \pi_\theta - \log \pi_{\text{teach}}\right)}_{\text{Reverse KL}} + \beta\, \underbrace{D_{KL}(\pi_{\text{old}} \| \pi_\theta)}_{\text{Forward KL}}\right]\]

The first term is a reverse KL between \(\pi_\theta\) and \(\pi_{\text{teach}}\) (importance-weighted by \(\pi_\theta / \pi_{\text{old}}\)), and the second is the forward KL trust region inherited from PPO. The key structural insight: compare this to the PPO objective:

第一项是 \(\pi_\theta\) 与 \(\pi_{\text{teach}}\) 之间的反向 KL(经 \(\pi_\theta / \pi_{\text{old}}\) 重要性加权),第二项是继承自 PPO 的前向 KL 信赖域。核心结构性洞察:将其与 PPO 目标函数对比:

\[J_{\text{PPO}}(\theta) = \mathbb{E}_{s \sim d^{\pi_{\text{old}}}, a \sim \pi_{\text{old}}(\cdot \vert s)}\left[r_\theta(s, a)\,\hat{A}(s, a) - \beta\, D_{KL}(\pi_{\text{old}}(\cdot \vert s) \| \pi_\theta(\cdot \vert s))\right]\]

OPD is just PPO with the advantage \(\hat{A}\) replaced by a reverse KL distillation term toward the teacher. The forward KL trust region stays — it prevents the update from deviating too far from the data-collecting policy.

OPD 就是将 PPO 中的优势函数 \(\hat{A}\) 替换为朝向教师的反向 KL 蒸馏项。前向 KL 信赖域保留——防止更新偏离数据收集策略太远。

Why is the first term reverse KL? Recall:

为什么第一项是反向 KL?回忆:

\[D_{KL}(\pi_\theta \| \pi_{\text{teach}}) = \mathbb{E}_{a \sim \pi_\theta}\left[\log \frac{\pi_\theta(a)}{\pi_{\text{teach}}(a)}\right] = \mathbb{E}_{a \sim \pi_\theta}\left[\log \pi_\theta(a) - \log \pi_{\text{teach}}(a)\right]\]

This is student \(\|\) teacher — the student’s perspective on how far it is from the teacher. The importance weight \(\pi_\theta / \pi_{\text{old}}\) corrects for the fact that we sample from \(\pi_{\text{old}}\) but want expectations under \(\pi_\theta\).

这是学生 \(\|\) 教师——从学生的视角衡量它与教师的距离。重要性权重 \(\pi_\theta / \pi_{\text{old}}\) 修正了我们从 \(\pi_{\text{old}}\) 采样但需要 \(\pi_\theta\) 下期望的偏差。

Another Perspective: Reverse KL as REINFORCE

Let’s extract the reverse KL term and take its gradient. We can decompose:

让我们提取反向 KL 项并对其求梯度。可以分解为:

\[\begin{aligned} \nabla_\theta D_{KL}(Q_\theta \| P) &= \nabla_\theta \left[-H(Q_\theta) - \mathbb{E}_{x \sim Q_\theta}[\log P(x)]\right] \\ &= -\nabla_\theta H(Q_\theta) - \nabla_\theta \mathbb{E}_{x \sim Q_\theta}[\log P(x)] \end{aligned}\]

The first term \(-\nabla_\theta H(Q_\theta)\) is an entropy bonus that encourages exploration. For the second term, we apply the REINFORCE trick (the log-derivative trick):

第一项 \(-\nabla_\theta H(Q_\theta)\) 是鼓励探索的熵奖励。对于第二项,我们应用 REINFORCE 技巧(对数导数技巧):

\[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \int Q_\theta(x) \log P(x)\, dx \\ &= \int \nabla_\theta Q_\theta(x) \log P(x)\, dx \\ &= \int Q_\theta(x) \nabla_\theta \log Q_\theta(x) \log P(x)\, dx \\ &= \mathbb{E}_{x \sim Q_\theta}\left[\log P(x) \cdot \nabla_\theta \log Q_\theta(x)\right] \end{aligned}\]

This is exactly the REINFORCE gradient! Compare it to the standard step-wise REINFORCE:

这正是 REINFORCE 梯度!将其与标准的逐步 REINFORCE 对比:

\[\mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t G_t \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right]\]

The reverse KL gradient is REINFORCE with the return \(G\) replaced by \(\log P(x)\). The teacher’s log-probability serves as the reward: a large \(\log P(x)\) means the teacher agrees with the student’s action \(x\), so the gradient increases its probability.

This is a deep connection. The reverse KL is not just a distillation loss — it’s a policy gradient where the teacher’s agreement is the reward signal. The entropy term \(-\nabla_\theta H(Q_\theta)\) provides exploration, just as entropy bonuses do in standard RL.

Important: forward KL does not have this REINFORCE interpretation. In forward KL, the expectation is under \(P\) (the teacher), and \(P\) is not the distribution being optimized — so the log-derivative trick does not apply in the same way.

反向 KL 的梯度就是将回报 \(G\) 替换为 \(\log P(x)\) 的 REINFORCE。教师的对数概率充当奖励:\(\log P(x)\) 越大,意味着教师越认可学生的动作 \(x\),梯度就会增大该动作的概率。

这是一个深层联系。反向 KL 不仅仅是蒸馏损失——它是一个策略梯度,教师的认可度就是奖励信号。熵项 \(-\nabla_\theta H(Q_\theta)\) 提供探索,正如标准 RL 中的熵奖励一样。

重要:前向 KL 没有这种 REINFORCE 解释。在前向 KL 中,期望在 \(P\)(教师)下取,而 \(P\) 并非被优化的分布——所以对数导数技巧不能以同样的方式应用。

Motivation: Why OPD?

The On-Policy Distillation paper (TML, Oct 2025) motivates OPD from three observations:

  1. SFT has exposure bias (off-policyness): SFT trains on teacher-generated tokens but the student generates its own tokens at test time. Errors compound because the student never sees its own mistakes during training.

  2. RL reward is too sparse: For multi-step agentic tasks, a binary success/fail reward at the end of a trajectory provides almost no learning signal per step. The credit assignment problem is severe.

  3. OPD can be mixed with both: OPD bridges the gap — it can be combined with SFT (adding an on-policy distillation term to SFT loss) or with RL (replacing the advantage in PPO with a reverse KL distillation term). The simplest recipe: just replace the forward KL in PPO with a reverse KL, and you get OPD.

On-Policy Distillation 论文(TML, 2025年10月)从三个观察出发提出了 OPD:

  1. SFT 存在曝光偏差(off-policy 问题):SFT 在教师生成的 token 上训练,但学生在测试时生成自己的 token。由于学生在训练中从未见过自己的错误,错误会累积。

  2. RL 奖励过于稀疏:对于多步智能体任务,轨迹末尾的二值成功/失败奖励几乎不提供每步的学习信号。信用分配问题非常严重。

  3. OPD 可以与两者混合:OPD 弥合了差距——它可以与 SFT 结合(在 SFT 损失中添加在线策略蒸馏项)或与 RL 结合(将 PPO 中的优势函数替换为反向 KL 蒸馏项)。最简单的配方:只需将 PPO 中的前向 KL 替换为反向 KL,即得 OPD。

Self-Distillation: OPSD, SDFT, and SDPO

A natural question: where does the teacher come from? Vanilla OPD assumes access to a separate, stronger teacher model. But several recent papers show an elegant alternative: the student itself can serve as the teacher, differentiated only by what privileged information appears in the prompt. This is the “OPD + teacher replacement” paradigm. Three key methods instantiate this idea with different types of privileged information.

一个自然的问题:教师从何而来?原始 OPD 假设可以访问一个独立的、更强的教师模型。但近期多篇论文展示了一种优雅的替代方案:学生本身就可以充当教师,差别仅在于提示词中出现了什么特权信息。这就是”OPD + 教师替换”范式。三种关键方法用不同类型的特权信息实例化了这一思想。

OPSD: Self-Distillation with Optimal Trajectories

OPSD (Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models, UCLA, Jan 2026) uses optimal trajectories as the privileged information. The key idea: construct two prompts for the same LLM \(p_\theta\).

  • Student policy: \(p_S(\cdot \vert x) := p_\theta(\cdot \vert x)\) — sees only the problem \(x\).
  • Teacher policy: \(p_T(\cdot \vert x, y^\ast) := p_\theta(\cdot \vert x, y^\ast)\) — sees both the problem \(x\) and the ground-truth chain-of-thought answer \(y^\ast\).

The training loop works as follows:

  1. Sample a problem \((x, y^\ast) \sim \mathcal{S}\) from the dataset.
  2. The student generates an on-policy sample \(\hat{y} \sim p_S(\cdot \vert x)\).
  3. The teacher evaluates this sample with privileged information: \(p_T(\cdot \vert x, y^\ast, \hat{y}_{<n})\) — it reads the student’s partial response and, knowing the correct answer, assigns per-token credit.
  4. The learning objective is a per-token divergence:

OPSDSelf-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models,UCLA,2026年1月)使用最优轨迹作为特权信息。核心思想:为同一个 LLM \(p_\theta\) 构建两个提示。

  • 学生策略:\(p_S(\cdot \vert x) := p_\theta(\cdot \vert x)\)——只看到问题 \(x\)。
  • 教师策略:\(p_T(\cdot \vert x, y^\ast) := p_\theta(\cdot \vert x, y^\ast)\)——同时看到问题 \(x\) 和标准答案的推理链 \(y^\ast\)。

训练流程如下:

  1. 从数据集中采样一个问题 \((x, y^\ast) \sim \mathcal{S}\)。
  2. 学生生成一个在策略样本 \(\hat{y} \sim p_S(\cdot \vert x)\)。
  3. 教师用特权信息评估这个样本:\(p_T(\cdot \vert x, y^\ast, \hat{y}_{<n})\)——它阅读学生的部分回答,并在知道正确答案的情况下进行逐 token 信用分配。
  4. 学习目标是逐 token 散度
\[D\left(p_T(\cdot \vert x, y^\ast, \hat{y}_{<n}) \;\|\; p_S(\cdot \vert x, \hat{y}_{<n})\right)\]

Crucially, gradients flow only through the student’s logits. The teacher’s output is treated as a fixed target — even though it shares the same weights, it is not being optimized directly.

Why does this work? The teacher knows the answer \(y^\ast\) and can perform fine-grained credit assignment: “given what the student has written so far, is the next token moving toward or away from the correct solution?” This is much richer than a binary reward at the end of the trajectory.

关键是,梯度只通过学生的 logits 回传。教师的输出被视为固定目标——尽管它共享相同的权重,但不直接被优化。

为什么有效?教师知道答案 \(y^\ast\),可以进行精细的信用分配:”鉴于学生目前写的内容,下一个 token 是在朝着正确解前进还是偏离?”这比轨迹末尾的二值奖励丰富得多。

SDFT: Self-Distillation Fine-Tuning

SDFT (Self-Distillation Enables Continual Learning, Shenfeld et al., Jan 2026) takes a similar approach to OPSD but with a different motivation: continual learning. The teacher is again the student model itself conditioned on the optimal trajectory, but SDFT frames the objective as preventing catastrophic forgetting while learning new tasks.

Like OPSD, SDFT uses OPD with optimal trajectories as the privileged information: the teacher sees the correct solution and the student does not. The per-token reverse KL objective ensures the student stays aligned with the teacher’s corrected distribution. The key contribution of SDFT is showing that this self-distillation framework is not just useful for one-shot training — it enables the model to continually incorporate new knowledge without degrading performance on previously learned tasks.

SDFTSelf-Distillation Enables Continual Learning,Shenfeld 等,2026年1月)采用与 OPSD 类似的方法,但动机不同:持续学习。教师同样是以最优轨迹为条件的学生模型本身,但 SDFT 将目标框定为在学习新任务的同时防止灾难性遗忘。

与 OPSD 一样,SDFT 使用 OPD 并以最优轨迹作为特权信息:教师看到正确解而学生看不到。逐 token 反向 KL 目标确保学生与教师修正后的分布保持一致。SDFT 的关键贡献在于表明这种自蒸馏框架不仅适用于一次性训练——它使模型能够持续吸收新知识而不降低已学任务的性能。

SDPO: Self-Distillation with Feedback

SDPO (Reinforcement Learning via Self-Distillation, ETH Zurich, Jan 2026) extends the same paradigm to settings where no ground-truth solution exists — only post-hoc critique.

The pipeline has four steps:

  1. Question \(x\): the problem to solve (e.g., “Write a Python function that returns all numbers from 1 to n”).
  2. Answer \(y \sim \pi_\theta(\cdot \vert x)\): the student generates a response on-policy.
  3. Feedback \(f\): a judge (possibly an external model or verifier) provides natural language feedback on the response (e.g., “Don’t include n” — pointing out an off-by-one error).
  4. Credit assignment by self-teacher \(\pi_\theta(y \vert x, f)\): the same model, now conditioned on the feedback, re-evaluates the student’s response. It assigns per-token credit — which tokens were correct and which led to the error the feedback identified.

The difference from OPSD: instead of Teacher = student + optimal solution, we have Teacher = student + feedback (step-wise reward). This is strictly less informative (the feedback may be vague or incomplete), but it works in domains where ground-truth solutions don’t exist — e.g., open-ended code generation, creative writing, or agentic tasks.

SDPOReinforcement Learning via Self-Distillation,ETH Zurich,2026年1月)将同样的范式扩展到不存在标准答案、只有事后评语的场景。

流程有四个步骤:

  1. 问题 \(x\):待解决的问题(如”编写一个返回 1 到 n 所有数字的 Python 函数”)。
  2. 回答 \(y \sim \pi_\theta(\cdot \vert x)\):学生在策略生成回答。
  3. 反馈 \(f\):评判者(可能是外部模型或验证器)提供自然语言反馈(如”不要包含 n”——指出差一错误)。
  4. 自教师信用分配 \(\pi_\theta(y \vert x, f)\):同一模型,现在以反馈为条件,重新评估学生的回答。它进行逐 token 信用分配——哪些 token 是正确的,哪些导致了反馈指出的错误。

与 OPSD 的区别:不是教师 = 学生 + 最优解,而是教师 = 学生 + 反馈(逐步奖励)。这在信息量上严格更少(反馈可能模糊或不完整),但它适用于不存在标准答案的领域——如开放式代码生成、创意写作或智能体任务。

The Common Pattern

All three methods — OPSD, SDFT, and SDPO — share the same core idea:

  • Same model weights for student and teacher — no need for a separate, larger teacher model.
  • Different privileged context in the prompt — the teacher sees extra information that is unavailable at test time.
  • Reverse KL drives the student toward the teacher’s corrected distribution.
  • On-policy sampling avoids SFT’s exposure bias — the student learns from its own mistakes, not from pre-generated teacher trajectories.

The elegance is that the “teacher” is free — it’s just the same model, prompted differently. The privileged information (optimal solution or feedback) is the only thing that separates teacher from student.

OPSD、SDFT 和 SDPO 三种方法共享相同的核心思想:

  • 学生和教师使用相同的模型权重——不需要单独的、更大的教师模型。
  • 提示词中的特权上下文不同——教师看到测试时不可用的额外信息。
  • 反向 KL 驱动学生趋向教师修正后的分布。
  • 在策略采样避免了 SFT 的曝光偏差——学生从自身错误中学习,而非从预生成的教师轨迹中学习。

其优雅之处在于”教师”是免费的——就是同一个模型,只是提示不同。特权信息(最优解或反馈)是区分教师和学生的唯一因素。

Method Overview

The interactive matrix below maps all methods discussed in this post (and a few related ones) onto the two axes: privileged information type and optimization family. Click any method name to see its pipeline and key idea.

下面的交互式矩阵将本文讨论的所有方法(以及一些相关方法)映射到两个维度:特权信息类型和优化方法族。点击任意方法名称查看其流程图和核心思想。

Takeaways

The choice of how to use privileged information is not just an engineering detail — it determines the training dynamics. Here are the key principles:

如何使用特权信息不仅仅是工程细节——它决定了训练动态。以下是核心原则:

KL Direction

  • Forward KL (PPO-style, SFT) is conservative and mean-seeking. It covers all modes, prevents aggressive updates, and is the natural choice for trust regions. But it produces hedging, generic outputs.
  • Reverse KL (OPD-style) is aggressive and mode-seeking. It locks onto the teacher’s best mode, producing sharp, committed outputs. But it needs the forward KL trust region to stay stable — pure reverse KL can collapse.
  • OPD uses both: reverse KL for distillation (replacing the advantage function) and forward KL for trust region stability. This is the key architectural insight.
  • 前向 KL(PPO 风格、SFT)保守且均值趋向。覆盖所有峰值,防止激进更新,是信赖域的自然选择。但会产生保守、泛化的输出。
  • 反向 KL(OPD 风格)激进且峰值趋向。锁定教师最优峰值,产生锐利、果断的输出。但需要前向 KL 信赖域来保持稳定——纯反向 KL 可能坍缩。
  • OPD 同时使用两者:反向 KL 用于蒸馏(替换优势函数),前向 KL 用于信赖域稳定性。这是关键的架构洞察。

Privileged Information Hierarchy

Not all privileged information is created equal. Roughly ordered by decreasing off-policyness (and increasing informativeness):

并非所有特权信息都同样有价值。按 off-policyness 递减(信息量递增)大致排列:

\[\text{optimal trajectory} \approx \text{optimal policy} > \text{unstructured feedback} > \text{structured reward}\]
  • Optimal trajectory provides the richest signal but also the largest distribution mismatch — the teacher’s trajectory may follow a reasoning path the student would never take.
  • Optimal policy is similarly informative (you can always sample trajectories from a policy) but avoids committing to a specific trace.
  • Unstructured feedback is weaker — it tells you what went wrong but not the right answer, so the student must do more work to convert critique into improvement.
  • Structured reward is the weakest — a scalar signal at the end of a long trajectory. This is why pure RL (with only pass/fail reward) struggles on complex agentic tasks.
  • 最优轨迹提供最丰富的信号,但分布偏移也最大——教师的轨迹可能遵循学生永远不会走的推理路径。
  • 最优策略的信息量类似(你总可以从策略中采样轨迹),但避免了锁定特定路径。
  • 非结构化反馈更弱——它告诉你哪里出错了但不告诉正确答案,学生需要做更多工作将评语转化为改进。
  • 结构化奖励是最弱的——在长轨迹末尾的标量信号。这正是纯 RL(仅有通过/失败奖励)在复杂智能体任务上挣扎的原因。

Optimization Method Ranking

When a good teacher is available:

当有好的教师可用时:

\[\text{OPD} > \text{PG (policy gradient)}\]

OPD provides per-token credit assignment from the teacher, while PG relies on trajectory-level reward signals. The gap is especially large for multi-step tasks where reward is sparse.

OPD 从教师获得逐 token 信用分配,而 PG 依赖轨迹级奖励信号。对于奖励稀疏的多步任务,差距尤其明显。

Self-Distillation

Self-distillation — using the same model as both student and teacher, differentiated only by privileged information in the prompt — is an elegant way to construct teachers without a separate model. OPSD and SDPO show that this works surprisingly well: the “privileged” version of the same model can assign meaningful per-token credit, even though it has the same underlying capabilities as the student. The gap comes purely from having access to the answer (or feedback) in context.

This also suggests a broader principle: the value of privileged information lies not in the model’s capacity, but in the information asymmetry. A model that knows the answer can guide a model that doesn’t, even if they share the same weights.

自蒸馏——将同一模型同时用作学生和教师,仅通过提示词中的特权信息加以区分——是一种无需额外模型即可构造教师的优雅方式。OPSD 和 SDPO 表明这出奇地有效:同一模型的”特权”版本可以分配有意义的逐 token 信用,尽管它与学生具有相同的底层能力。差距纯粹来自于在上下文中能访问答案(或反馈)。

这也暗示了一个更广泛的原则:特权信息的价值不在于模型的能力,而在于信息不对称。一个知道答案的模型可以引导一个不知道的模型,即使它们共享相同的权重。