Given prompt $q$, sample $G$ completions $\{o_1, \ldots, o_G\}$ from $\pi_{\theta_\text{old}}$.
For the $l$-th token of the $i$-th completion, compute the importance ratio:
$$\rho_i^l = \frac{\pi_\theta(o_i^l \mid q,\, o_i^{\lt l})}{\pi_{\theta_\text{old}}(o_i^l \mid q,\, o_i^{\lt l})}$$$\pi_\theta$ = current policy | $\pi_{\theta_\text{old}}$ = rollout policy | $o_i^{\lt l}$ = preceding tokens
The group advantage $\hat A_i$ is computed from reward statistics across the group:
Each token is then clipped independently (PPO-style):
$\varepsilon = 0.2$ typically | same $\hat A_i$ for all tokens in $o_i$
$|o_i|$ separate ratios, $|o_i|$ separate clip decisions — but only one advantage per completion.
A 10-token completion. Each token has its own $\rho_i^l$:
Most tokens: $\rho \approx 1$ (common tokens, unchanged).
A few tokens shift substantially — these carry the real policy change.
But clipping treats each one in isolation, without considering the sequence as a whole.
Token-level $\rho_i^l$ does not correspond to valid importance sampling.
The sequence-level ratio is a product of token ratios:
$$\frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_\text{old}}(o_i \mid q)} = \prod_{l=1}^{|o_i|} \rho_i^l$$Clipping each $\rho_i^l \in [0.8, 1.2]$ individually allows the product to range over:
$[0.8^{10},\; 1.2^{10}]$
$= [0.107,\; 6.19]$
$[0.8^{100},\; 1.2^{100}]$
$\approx [0,\; 8 \times 10^7]$
Token-level clipping provides no effective constraint on the sequence-level ratio.
Replace $|o_i|$ token ratios with a single sequence-level quantity:
This is the geometric mean of per-token ratios.
One scalar per completion. The sequence-level ratio is directly bounded.
Length-normalized: a 10-token and 100-token completion are on the same scale.
Same policy update, same completion. What gets clipped?
Most tokens not clipped individually
Product can explode
Geometric mean = one number
Directly bounded
GSPO clips $\mathbf{100\times}$ more often than GRPO — yet trains better.
GRPO's low clip rate is not "gentle optimization" — it is lack of control.
Most tokens pass the clip check individually, but their product drifts unconstrained.
GSPO's high clip rate means the constraint is actively shaping the update — as intended.
How gradient variance scales with response length:
Noise $\propto$ length
Long reasoning traces = noisy gradients
Noise $\approx$ constant
Length-normalized ratio absorbs the scaling
Mixture-of-Experts models route each token to different experts.
GRPO needs per-token likelihoods — sensitive to routing decisions.
Requires "Routing Replay" (cache + replay routing patterns) to stabilize training.
GSPO only needs the sequence-level log-probability.
Insensitive to per-token routing variations. No Routing Replay needed.
GSPO was used to train the Qwen3 model family.
Token-level $\rho_i^l$
$|o_i|$ ratios, each clipped
Product unconstrained
Noise $\propto$ length
DeepSeek, 2024
Sequence-level $s_i(\theta)$
1 ratio (geometric mean)
Directly bounded by clip
Noise $\approx$ constant
Qwen/Alibaba, 2025
Same group advantage $\hat A_i$. Different optimization target.
GSPO fixes the how of the update, not the what (credit assignment remains task-level).