Gradient of $L_\text{clip}$ w.r.t. $\rho$: why async clipping pulls $\pi_\theta$ back toward $\pi_\text{behav}$

At $\rho_0 \gg 1$: positive-advantage tokens have zero gradient, negative-advantage tokens push $\rho$ down — net force is toward $\pi_\text{behav}$

$\epsilon$: 0.20 $\rho_0$: 1.80

Top: $L_\text{clip}$ (objective). Bottom: $\partial L_\text{clip}/\partial \rho$ (gradient the optimizer sees). Shaded region = clip zone $[1{-}\epsilon, 1{+}\epsilon]$.