SDPO Pipeline: Self-Distillation with Feedback

1. Question

$x$

→

2. Answer

$y \sim \pi_\theta(\cdot \vert x)$

on-policy

→

3. Feedback

$f$ from judge

e.g. "Don't include n"

→

4. Self-Teacher

$\pi_\theta(y \vert x, f)$

credit assign

→

5. Distill

RKLD

per-token

Privileged info = natural language feedback $f$ (no ground-truth solution needed).
Teacher = student + feedback → re-evaluates the response knowing what went wrong.
Key difference: works where no $y^\ast$ exists (open-ended code, creative writing, agentic tasks).

Reinforcement Learning via Self-Distillation (ETH Zurich, Jan 2026) — arxiv