At $\rho_0 \gg 1$: positive-advantage tokens have zero gradient, negative-advantage tokens push $\rho$ down — net force is toward $\pi_\text{behav}$
Top: $L_\text{clip}$ (objective). Bottom: $\partial L_\text{clip}/\partial \rho$ (gradient the optimizer sees). Shaded region = clip zone $[1{-}\epsilon, 1{+}\epsilon]$.