OPSD Pipeline: Self-Distillation with Optimal Trajectories

Dataset

$(x, y^\ast)$

→

Student

$p_\theta(\cdot \vert x)$

sees only $x$

→

On-Policy

$\hat{y} \sim p_S$

→

Teacher

$p_\theta(\cdot \vert x, y^\ast)$

sees $x + y^\ast$

→

Loss

$D(p_T \| p_S)$

per-token

Privileged info = optimal trajectory $y^\ast$ (ground-truth CoT).
Teacher = student + answer → knows the solution, assigns per-token credit.
Objective = reverse KL per token. Gradients flow only through student logits.

Self-Distilled Reasoner (UCLA, Jan 2026) — arxiv