$x$
$y \sim \pi_\theta(\cdot \vert x)$
on-policy
$f$ from judge
e.g. "Don't include n"
$\pi_\theta(y \vert x, f)$
credit assign
RKLD
per-token
Privileged info = natural language feedback $f$ (no ground-truth solution needed).
Teacher = student + feedback → re-evaluates the response knowing what went wrong.
Key difference: works where no $y^\ast$ exists (open-ended code, creative writing, agentic tasks).
Reinforcement Learning via Self-Distillation (ETH Zurich, Jan 2026) — arxiv