$(x, y^\ast)$
$p_\theta(\cdot \vert x)$
sees only $x$
$\hat{y} \sim p_S$
$p_\theta(\cdot \vert x, y^\ast)$
sees $x + y^\ast$
$D(p_T \| p_S)$
+ old-task reg.
Privileged info = optimal trajectory $y^\ast$ (same as OPSD).
Teacher = student + answer. Same self-distillation mechanism.
Key difference: SDFT adds continual learning — the model learns new tasks without forgetting old ones.
Self-Distillation Enables Continual Learning (Shenfeld et al., Jan 2026) — arxiv