GRPO: per-token importance ratios

Given prompt $q$, sample $G$ completions $\{o_1, \ldots, o_G\}$ from $\pi_{\theta_\text{old}}$.

For the $l$-th token of the $i$-th completion, compute the importance ratio:

$$\rho_i^l = \frac{\pi_\theta(o_i^l \mid q,\, o_i^{\lt l})}{\pi_{\theta_\text{old}}(o_i^l \mid q,\, o_i^{\lt l})}$$

$\pi_\theta$ = current policy | $\pi_{\theta_\text{old}}$ = rollout policy | $o_i^{\lt l}$ = preceding tokens

The group advantage $\hat A_i$ is computed from reward statistics across the group:

$$\hat A_i = \frac{r(q, o_i) - \mu}{\sigma}, \qquad \mu = \frac{1}{G}\sum_{j} r(q, o_j), \quad \sigma = \text{std}(\{r_j\})$$

Each token is then clipped independently (PPO-style):

$$\min\!\big(\rho_i^l\,\hat A_i,\;\text{clip}(\rho_i^l,\,1\!-\!\varepsilon,\,1\!+\!\varepsilon)\,\hat A_i\big)$$

$\varepsilon = 0.2$ typically | same $\hat A_i$ for all tokens in $o_i$

$|o_i|$ separate ratios, $|o_i|$ separate clip decisions — but only one advantage per completion.

What token-level ratios look like

A 10-token completion. Each token has its own $\rho_i^l$:

Most tokens: $\rho \approx 1$ (common tokens, unchanged).

A few tokens shift substantially — these carry the real policy change.

But clipping treats each one in isolation, without considering the sequence as a whole.

The problem: noise accumulates with length

Token-level $\rho_i^l$ does not correspond to valid importance sampling.

The sequence-level ratio is a product of token ratios:

$$\frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_\text{old}}(o_i \mid q)} = \prod_{l=1}^{|o_i|} \rho_i^l$$

Clipping each $\rho_i^l \in [0.8, 1.2]$ individually allows the product to range over:

$|o_i| = 10$

$[0.8^{10},\; 1.2^{10}]$

$= [0.107,\; 6.19]$

$|o_i| = 100$

$[0.8^{100},\; 1.2^{100}]$

$\approx [0,\; 8 \times 10^7]$

Token-level clipping provides no effective constraint on the sequence-level ratio.

GSPO: sequence-level ratio

Replace $|o_i|$ token ratios with a single sequence-level quantity:

$$s_i(\theta) = \left(\frac{\pi_\theta(o_i \mid q)}{\pi_{\theta_\text{old}}(o_i \mid q)}\right)^{1/|o_i|} = \left(\prod_{l=1}^{|o_i|} \rho_i^l\right)^{1/|o_i|}$$

This is the geometric mean of per-token ratios.

$$\text{Clip } s_i(\theta) \in [1-\varepsilon,\; 1+\varepsilon]$$

One scalar per completion. The sequence-level ratio is directly bounded.

Length-normalized: a 10-token and 100-token completion are on the same scale.

Side by side: clipping behavior

Same policy update, same completion. What gets clipped?

GRPO

Most tokens not clipped individually

Product can explode

GSPO

Geometric mean = one number

Directly bounded

Effective clipping rate

GSPO clips $\mathbf{100\times}$ more often than GRPO — yet trains better.

GRPO's low clip rate is not "gentle optimization" — it is lack of control.

Most tokens pass the clip check individually, but their product drifts unconstrained.

GSPO's high clip rate means the constraint is actively shaping the update — as intended.

Length sensitivity

How gradient variance scales with response length:

GRPO

Noise $\propto$ length

Long reasoning traces = noisy gradients

GSPO

Noise $\approx$ constant

Length-normalized ratio absorbs the scaling

Bonus: MoE stability

Mixture-of-Experts models route each token to different experts.

GRPO needs per-token likelihoods — sensitive to routing decisions.

Requires "Routing Replay" (cache + replay routing patterns) to stabilize training.

GSPO only needs the sequence-level log-probability.

Insensitive to per-token routing variations. No Routing Replay needed.

GSPO was used to train the Qwen3 model family.

GRPO vs GSPO: Summary

GRPO

Token-level $\rho_i^l$

$|o_i|$ ratios, each clipped

Product unconstrained

Noise $\propto$ length

DeepSeek, 2024

GSPO

Sequence-level $s_i(\theta)$

1 ratio (geometric mean)

Directly bounded by clip

Noise $\approx$ constant

Qwen/Alibaba, 2025

Same group advantage $\hat A_i$. Different optimization target.

GSPO fixes the how of the update, not the what (credit assignment remains task-level).