$(x, y^\ast)$
$p_\theta(\cdot \vert x)$
sees only $x$
$\hat{y} \sim p_S$
$p_\theta(\cdot \vert x, y^\ast)$
sees $x + y^\ast$
$D(p_T \| p_S)$
per-token
Privileged info = optimal trajectory $y^\ast$ (ground-truth CoT).
Teacher = student + answer → knows the solution, assigns per-token credit.
Objective = reverse KL per token. Gradients flow only through student logits.
Self-Distilled Reasoner (UCLA, Jan 2026) — arxiv