Self-Distillation: Same Model, Different Prompts

The teacher is the student itself, given privileged information.

Student

$p_\theta(\cdot \,|\, x)$

Sees only the problem $x$

Teacher

$p_\theta(\cdot \,|\, x, \text{priv})$

Sees $x$ + privileged info

Same weights $\theta$, different context.

Two variants: OPSD (priv = optimal trajectory) and SDPO (priv = feedback).

OPSD: OPD + Optimal Trajectory

Self-Distilled Reasoner (UCLA, Jan 2026)

Dataset
$\{(x, y^\ast)\}$
Student
$p_S(\cdot|x)$
Sample
$\hat{y} \sim p_S$
Teacher
$p_T(\cdot|x, y^\ast)$
Loss
$D(p_T \| p_S)$

The teacher sees $x + y^\ast$ → can assign per-token credit.

Gradients flow only through the student's logits.

On-policy sampling avoids SFT's exposure bias.

SDPO: OPD + Natural Language Feedback

RL via Self-Distillation (ETH Zurich, Jan 2026)

1. Question
$x$
2. Answer
$y \sim \pi_\theta$
3. Feedback
$f$ from judge
4. Self-Teacher
$\pi_\theta(y|x,f)$
5. Distill
RKLD

Example

def f(n): return list(range(1, n+1))

Feedback

"Don't include n."

No optimal trajectory needed — only post-hoc critique.

OPSD vs. SDPO

OPSD

Priv info: Optimal trajectory $y^\ast$

Teacher: $p_\theta(\cdot | x, y^\ast)$

Pro: Direct, fine-grained signal

Con: Needs ground-truth solutions

SDPO

Priv info: NL feedback $f$

Teacher: $\pi_\theta(\cdot | x, f)$

Pro: Works with any judge

Con: Feedback quality matters

Both use self-distillation: same model, different privileged context.

Reverse KL drives the student toward the teacher's corrected distribution.