PPO Clipped Objective: $\min\!\big(\rho\,\hat{A},\;\mathrm{clip}(\rho,1{-}\epsilon,1{+}\epsilon)\,\hat{A}\big)$

Naive async: $\rho = \pi_\theta / \pi_\text{behav}$ starts far from 1 — optimization begins in the flat (zero-gradient) region

Left: $\hat{A} > 0$ (good completion). Right: $\hat{A} < 0$ (bad completion). Dashed = async $\rho_0$.