The teacher is the student itself, given privileged information.
$p_\theta(\cdot \,|\, x)$
Sees only the problem $x$
$p_\theta(\cdot \,|\, x, \text{priv})$
Sees $x$ + privileged info
Same weights $\theta$, different context.
Two variants: OPSD (priv = optimal trajectory) and SDPO (priv = feedback).
Self-Distilled Reasoner (UCLA, Jan 2026)
The teacher sees $x + y^\ast$ → can assign per-token credit.
Gradients flow only through the student's logits.
On-policy sampling avoids SFT's exposure bias.
RL via Self-Distillation (ETH Zurich, Jan 2026)
def f(n): return list(range(1, n+1))
"Don't include n."
No optimal trajectory needed — only post-hoc critique.
Priv info: Optimal trajectory $y^\ast$
Teacher: $p_\theta(\cdot | x, y^\ast)$
Pro: Direct, fine-grained signal
Con: Needs ground-truth solutions
Priv info: NL feedback $f$
Teacher: $\pi_\theta(\cdot | x, f)$
Pro: Works with any judge
Con: Feedback quality matters
Both use self-distillation: same model, different privileged context.
Reverse KL drives the student toward the teacher's corrected distribution.