SDFT Pipeline: Self-Distillation for Continual Learning

New Task

$(x, y^\ast)$

Student

$p_\theta(\cdot \vert x)$

sees only $x$

On-Policy

$\hat{y} \sim p_S$

Teacher

$p_\theta(\cdot \vert x, y^\ast)$

sees $x + y^\ast$

Loss

$D(p_T \| p_S)$

+ old-task reg.

Privileged info = optimal trajectory $y^\ast$ (same as OPSD).
Teacher = student + answer. Same self-distillation mechanism.
Key difference: SDFT adds continual learning — the model learns new tasks without forgetting old ones.

Self-Distillation Enables Continual Learning (Shenfeld et al., Jan 2026) — arxiv