Naive async: $\rho = \pi_\theta / \pi_\text{behav}$ starts far from 1 — optimization begins in the flat (zero-gradient) region
Left: $\hat{A} > 0$ (good completion). Right: $\hat{A} < 0$ (bad completion). Dashed = async $\rho_0$.