Policy Optimization without a Critic: The GRPO Family

Motivation: Why Not Learn a Critic?

The standard recipe for RLHF — PPO with a learned value function — requires training and maintaining a critic alongside the policy. In the LM setting, this means a second copy of the base model with a scalar value head, doubling memory and introducing its own training challenges (representation drift, readout sensitivity, etc.). For single-turn tasks like math and code generation, a natural question arises: can we skip the critic entirely?

The answer is yes — if we are willing to trade a learned value estimate for a statistical one. Instead of training \(V_\phi(s)\) to predict expected return, we can sample multiple completions from the same prompt and use their empirical reward statistics as a baseline. This is the core idea behind GRPO, and the starting point for DAPO and GSPO.

RLHF 的标准方案——PPO 加学习价值函数——需要在策略之外训练和维护一个 critic。在 LM 场景下,这意味着基础模型的第二个副本加上标量价值头,内存翻倍,并带来自身的训练挑战(表征漂移、读出敏感性等)。对于数学和代码生成等单轮任务,一个自然的问题浮现:能否完全跳过 critic?

答案是肯定的——前提是我们愿意用统计估计替代学习得到的价值估计。我们不训练 \(V_\phi(s)\) 来预测期望回报,而是从同一提示采样多个补全,利用它们的经验奖励统计量作为基线。这是 GRPO 的核心思想,也是 DAPO 和 GSPO 的起点。

GRPO: Group Relative Policy Optimization

GRPO (DeepSeek, 2024) replaces the learned critic with group-level sampling statistics. Given a prompt \(q\), sample a group of \(G\) completions \(\{o_1, \ldots, o_G\}\) from the current policy \(\pi_\theta\), score each with a reward function \(r(q, o_i)\), and compute normalized advantages:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]

The “value function” is the group mean — a Monte Carlo estimate that converges to \(\mathbb{E}_\pi[r \vert q]\) as \(G \to \infty\). No learned parameters, no value head, no critic training loop.

The GRPO loss applies PPO-style clipping with per-token importance ratios:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]

where:

  • \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio — the probability of the \(l\)-th token of the \(i\)-th completion under the current policy \(\pi_\theta\) divided by its probability under the rollout policy \(\pi_{\theta_{\text{old}}}\). A ratio \(\rho > 1\) means the current policy assigns higher probability to this token than the rollout policy did; \(\rho < 1\) means lower.
  • \(\hat{A}_i\) is the group-normalized advantage for completion \(o_i\), shared across all tokens \(l\) in that completion. This is the key simplification: GRPO assigns a single scalar credit to the entire response, not per-token.
  • \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) clamps the ratio to \([1-\varepsilon, 1+\varepsilon]\) (typically \([0.8, 1.2]\)), preventing any single token’s probability from changing too much in one update — the same mechanism as PPO.
  • \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) is a KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model), preventing the policy from drifting too far from the pretrained distribution.
  • The outer average \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) first averages over tokens within each completion, then averages over completions in the group.

GRPO(DeepSeek, 2024)用组级采样统计量替代了学习的 critic。给定提示 \(q\),从当前策略 \(\pi_\theta\) 采样一组 \(G\) 个补全 \(\{o_1, \ldots, o_G\}\),用奖励函数 \(r(q, o_i)\) 对每个打分,计算归一化优势:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]

“价值函数”就是组均值——一个 Monte Carlo 估计,当 \(G \to \infty\) 时收敛到 \(\mathbb{E}_\pi[r \vert q]\)。无需学习参数,无需价值头,无需 critic 训练循环。

GRPO 损失使用 PPO 式裁剪和逐 token 重要性比率:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]

其中:

  • \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) 是逐 token 重要性比率——第 \(i\) 个补全的第 \(l\) 个 token 在当前策略 \(\pi_\theta\) 下的概率除以在 rollout 策略 \(\pi_{\theta_{\text{old}}}\) 下的概率。比率 \(\rho > 1\) 表示当前策略赋予该 token 更高概率;\(\rho < 1\) 表示更低。
  • \(\hat{A}_i\) 是补全 \(o_i\) 的组归一化优势,在该补全的所有 token \(l\) 间共享。这是关键简化:GRPO 对整个回复赋予一个标量信用,而非逐 token。
  • \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) 将比率裁剪到 \([1-\varepsilon, 1+\varepsilon]\)(通常为 \([0.8, 1.2]\)),防止任何单个 token 的概率在一次更新中变化过大——与 PPO 相同的机制。
  • \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) 是对参考策略 \(\pi_{\text{ref}}\)(通常为 SFT 模型)的 KL 惩罚,防止策略偏离预训练分布过远。
  • 外层平均 \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) 先在每个补全内对 token 取平均,再在组内对补全取平均。

The \(G \to \infty\) Limit: REINFORCE with a Baseline

What happens when the group size grows? By the law of large numbers, the group mean converges to the true expected reward under the current policy:

\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]

Note that this is not a state value function \(V(s)\) in the RL sense — there is no multi-step MDP here. GRPO operates at the task level: one prompt in, one complete response out, one reward. The expectation above is simply the average reward the current policy would get on this prompt if it generated infinitely many responses.

In this limit the GRPO advantage (ignoring the std normalization) becomes:

\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]

This is a constant baseline subtraction — the same idea as REINFORCE with a baseline, but the baseline is prompt-specific and estimated purely by sampling. A learned critic in standard RLHF serves the same role (predicting the expected reward for a given input), but it does so with a neural network trained across the entire dataset. GRPO replaces that network with a sample mean computed from the \(G\) completions at hand.

The difference is practical. A learned critic accumulates information across the entire training run: every prompt it has ever seen refines its predictions. GRPO’s group mean estimates the expected reward from scratch each time, using only the \(G\) completions sampled for that prompt in that batch. With finite \(G\) (DeepSeek uses \(G = 64\), DAPO uses \(G = 16\)), this estimate is noisy — and the noise is the price of not maintaining a critic. Whether the simplicity is worth the noise depends on the task: for math and code with reliable verifiers, the answer has been a clear yes.

当组大小增长时会怎样?根据大数定律,组均值收敛到当前策略下的真实期望奖励:

\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]

注意这不是 RL 意义上的状态价值函数 \(V(s)\)——这里没有多步 MDP。GRPO 在任务级别运作:一个提示输入,一个完整回复输出,一个奖励。上述期望只是当前策略在该提示上生成无穷多回复时的平均奖励。

在此极限下,GRPO 优势(忽略标准差归一化)变为:

\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]

这是一个常数基线减法——与带基线的 REINFORCE 相同的思路,但基线是针对特定提示的,且纯粹通过采样估计。标准 RLHF 中的学习 critic 起着相同的作用(预测给定输入的期望奖励),但它是通过在整个数据集上训练的神经网络来实现的。GRPO 用从当前 \(G\) 个补全计算的样本均值替代了该网络。

差异在于实践层面。学习的 critic 在整个训练过程中积累信息:它见过的每个提示都优化了其预测。GRPO 的组均值每次从零开始估计期望奖励,仅使用该批次中为该提示采样的 \(G\) 个补全。在有限的 \(G\)(DeepSeek 使用 \(G = 64\),DAPO 使用 \(G = 16\))下,这个估计是有噪声的——而噪声就是不维护 critic 的代价。这种简洁是否值得承受噪声取决于任务:对于有可靠验证器的数学和代码任务,答案显然是肯定的。

The Fundamental Limitation: Task-Level Credit

The advantage \(\hat{A}_i\) is task-level: every token in completion \(o_i\) receives the same advantage. GRPO cannot distinguish which part of a response was responsible for success or failure — it only knows that this completion, as a whole, was better or worse than its peers. For single-turn tasks (solve a math problem, write a function) this is acceptable. For multi-step agent tasks, it is not — see the ArCHer post for a detailed comparison.

优势 \(\hat{A}_i\) 是任务级的:补全 \(o_i\) 中的每个 token 获得相同的优势。GRPO 无法区分回复中哪个部分对成功或失败负责——它只知道这个补全作为一个整体比同伴更好或更差。对于单轮任务(解数学题、写函数),这是可以接受的。对于多步智能体任务则不行——详见 ArCHer 博文

When Does GRPO Learn Nothing?

GRPO’s learning signal depends entirely on outcome diversity within the group:

  • All completions correct (pass@G = G): all rewards identical → \(\hat{A}_i = 0\) for all \(i\) → zero gradient
  • All completions wrong (pass@G = 0): same situation → zero gradient
  • Mixed outcomes: some succeed, some fail → nonzero advantages → useful gradient

As training progresses and the model improves, more prompts become “too easy” (pass rate → 100%) and stop contributing signal. This is the central failure mode that DAPO addresses.

GRPO 的学习信号完全依赖于组内结果的多样性

  • 所有补全均正确(pass@G = G):所有奖励相同 → 所有 \(i\) 的 \(\hat{A}_i = 0\) → 零梯度
  • 所有补全均错误(pass@G = 0):同样的情况 → 零梯度
  • 混合结果:部分成功,部分失败 → 非零优势 → 有用梯度

随着训练推进和模型改进,越来越多的提示变得”太简单”(通过率 → 100%),不再贡献信号。这正是 DAPO 要解决的核心失败模式。

DAPO: Dynamic Sampling Policy Optimization

DAPO (ByteDance Seed, 2025) identifies entropy collapse as the central failure mode of naive GRPO and introduces four techniques to combat it. A fifth, less obvious change: DAPO completely removes the KL penalty (\(\beta = 0\)). The reasoning is that for long chain-of-thought models, the policy is expected to diverge significantly from the SFT initialization — the KL constraint actively fights against the distribution shift that training is trying to achieve. Instead of relying on KL to prevent collapse, DAPO uses the mechanisms below.

DAPO(ByteDance Seed, 2025)将熵坍缩识别为朴素 GRPO 的核心失败模式,并引入四项技术来对抗它。第五个不太明显的改变:DAPO 完全移除了 KL 惩罚(\(\beta = 0\))。理由是对于长思维链模型,策略预计会显著偏离 SFT 初始化——KL 约束实际上在对抗训练试图实现的分布偏移。DAPO 不依赖 KL 来防止坍缩,而是使用以下机制。

1. Dynamic Sampling

The single most impactful contribution (+8 points on AIME 2024). During rollout, each prompt gets \(G\) sampled completions. If the number of correct completions is 0 or \(G\), the prompt is discarded and a new prompt is sampled to replace it. The filtering criterion is:

\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]

This ensures every prompt in the training batch has mixed outcomes — guaranteeing nonzero gradient signal. As the model improves and more prompts become trivial, dynamic sampling automatically shifts the training distribution toward the model’s learning frontier: problems that are neither too easy nor too hard.

最具影响力的单项贡献(在 AIME 2024 上 +8 分)。在 rollout 期间,每个提示获得 \(G\) 个采样补全。如果正确补全的数量为 0 或 \(G\),该提示被丢弃,用新采样的提示替代。过滤条件为:

\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]

这确保训练批次中的每个提示都有混合结果——保证非零梯度信号。随着模型改进,越来越多的提示变得简单,动态采样自动将训练分布推向模型的学习前沿:既不太简单也不太难的问题。

2. Clip-Higher (Asymmetric Clipping)

Standard PPO clips the importance ratio symmetrically: \([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\). DAPO decouples the bounds:

\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]

The asymmetry lets low-probability tokens increase more freely, counteracting entropy collapse. The policy can explore more while maintaining a stable lower bound.

标准 PPO 对称裁剪重要性比率:\([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\)。DAPO 将上下界解耦:

\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]

非对称性让低概率 token 更自由地提升概率,对抗熵坍缩。策略可以进行更多探索,同时保持稳定的下界。

3. Token-Level Loss Normalization

Standard GRPO averages token losses within each sample, then averages across samples — giving short and long responses equal weight. DAPO averages directly across all tokens in the batch:

\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]

This gives longer sequences proportionally more influence, properly penalizing verbose low-quality outputs.

标准 GRPO 先在每个样本内对 token 损失取平均,再在样本间取平均——给予短回复和长回复相同权重。DAPO 直接在批次中所有 token 上取平均:

\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]

这使得更长的序列获得成比例的更大影响力,恰当地惩罚冗长的低质量输出。

4. Overlong Reward Shaping

Truncated outputs (hitting max length) receive a soft graduated penalty rather than being discarded or harshly penalized. In the last \(L_{\text{cache}}\) tokens before the max length, the reward linearly decreases toward \(-1\). This discourages unnecessary verbosity without punishing responses that are on the right track but ran out of budget.

被截断的输出(达到最大长度)会收到柔和的渐进惩罚,而非被丢弃或严厉惩罚。在最大长度之前的最后 \(L_{\text{cache}}\) 个 token 内,奖励线性递减至 \(-1\)。这抑制了不必要的冗长,同时不惩罚方向正确但用尽了长度预算的回复。

Ablation Results (AIME 2024)

Configuration Score
Naive GRPO 30
+ Overlong Filtering 36
+ Clip-Higher 38
+ Token-Level Loss 42
+ Dynamic Sampling (full DAPO) 50
DeepSeek-R1-Zero (2x training steps) 47

Dynamic sampling alone contributes +8 points — the largest single improvement — validating that the core bottleneck in GRPO training is wasted compute on prompts with uniform outcomes.

GRPO vs GSPO: token-level clipping allows the sequence-level ratio to drift unconstrained, while GSPO's geometric mean provides a direct bound. Click through the slides to see how noise accumulates with length and why GSPO clips 100x more often yet trains better.
配置 分数
朴素 GRPO 30
+ 超长过滤 36
+ Clip-Higher 38
+ Token 级损失 42
+ 动态采样(完整 DAPO) 50
DeepSeek-R1-Zero(2 倍训练步数) 47

仅动态采样就贡献了 +8 分——最大的单项改进——验证了 GRPO 训练的核心瓶颈是在结果一致的提示上浪费计算。

GRPO 与 GSPO 对比:token 级裁剪允许序列级比率不受约束地漂移,而 GSPO 的几何均值提供了直接的约束。点击幻灯片查看噪声如何随长度累积,以及 GSPO 裁剪频率高 100 倍却训练效果更好的原因。

GSPO: Group Sequence Policy Optimization

GSPO (Alibaba/Qwen, 2025) takes a different angle: rather than fixing GRPO’s sampling strategy, it fixes GRPO’s optimization target. The core claim is that GRPO’s per-token importance ratios are a theoretically unsound application of importance sampling, introducing noise that accumulates with response length.

GSPO(Alibaba/Qwen, 2025)从不同角度出发:与其修复 GRPO 的采样策略,不如修复 GRPO 的优化目标。核心论点是 GRPO 的逐 token 重要性比率是理论上不合理的重要性采样应用,引入的噪声随回复长度累积。

The Problem with Token-Level Ratios

The policy gradient theorem operates at the sequence level — the quantity we want to optimize is:

\[\nabla_\theta \, \mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}} \!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

The correct importance sampling ratio for a full sequence factorizes as a product of per-token ratios:

\[\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)} = \prod_{l=1}^{\vert o_i \vert} \rho_i^l\]

GRPO, however, clips each \(\rho_i^l\) independently to \([1-\varepsilon,\, 1+\varepsilon]\). This does not bound the sequence-level ratio — even after clipping, the effective product can range over:

\[\prod_{l=1}^{\vert o_i \vert} \text{clip}(\rho_i^l) \;\in\; \left[(1-\varepsilon)^{\vert o_i \vert},\; (1+\varepsilon)^{\vert o_i \vert}\right]\]

For a typical response of \(\vert o_i \vert = 1024\) tokens and \(\varepsilon = 0.2\), the upper bound is \(1.2^{1024} \approx 10^{81}\). The “trust region” that token-level clipping defines is essentially unbounded at the sequence level, and the resulting noise in the gradient estimate grows with sequence length.

Why is the sequence-level ratio the theoretically correct one? The core argument starts from the goal of IS correction. We want to convert the expectation from the old policy \(\pi_{\theta_{\text{old}}}\) to the current policy \(\pi_\theta\):

\[\mathbb{E}_{\pi_\theta}\!\left[R(o)\right] = \mathbb{E}_{\pi_{\theta_{\text{old}}}}\!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

The expectation here is over complete sequences \(o\), and the IS ratio \(\pi_\theta(o \vert q) / \pi_{\theta_{\text{old}}}(o \vert q)\) is a sequence-level quantity. Decomposing it into \(\prod_l \rho_l\) merely exploits the autoregressive factorization — it does not mean each factor \(\rho_l\) is an independent IS correction. When GRPO clips each \(\rho_l\) independently, it constrains the individual pieces of a factorized quantity, but the product of these constraints is not equivalent to constraining the original sequence-level ratio. In other words, token-level clipping constrains artificially separated fragments, not the whole object that IS correction actually needs to bound.

More concretely, GRPO’s advantage \(\hat{A}_i\) is a trajectory-level constant — it depends only on \(r(\tau^i)\) and is the same for every token within a sequence. The gradient contribution at position \(l\) takes the form \(\hat{A}_i \cdot \rho_l \cdot \nabla_\theta \ln \pi_\theta(a_l)\), where \(\hat{A}_i\) does not depend on any single token’s distribution. Since the advantage itself is sequence-level, the natural granularity of IS correction should also be sequence-level: what we need to correct is “how much more likely is this entire trajectory under \(\pi_\theta\) relative to \(\pi_{\theta_{\text{old}}}\),” not “how much did each token shift individually.” GSPO’s geometric mean \(s_i = (\prod_l \rho_l)^{1/\vert o_i \vert}\) is a monotonic transformation of the sequence-level log-ratio \(\frac{1}{\vert o_i \vert}\ln \frac{\pi_\theta(o_i)}{\pi_{\theta_{\text{old}}}(o_i)}\), so clipping it directly constrains the sequence-level distributional shift.

This argument also aligns with the trust-region intuition: PPO’s trust region is fundamentally a constraint on the KL divergence between policies — a sequence-level quantity. Token-level clipping is merely an approximation, one that can become arbitrarily loose on long sequences; sequence-level clipping imposes the constraint at the correct level of abstraction.

策略梯度定理在序列级别运作——我们要优化的量是:

\[\nabla_\theta \, \mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}} \!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

完整序列的正确重要性采样比率分解为逐 token 比率的乘积

\[\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)} = \prod_{l=1}^{\vert o_i \vert} \rho_i^l\]

但 GRPO 对每个 \(\rho_i^l\) 独立裁剪到 \([1-\varepsilon,\, 1+\varepsilon]\)。这并不能约束序列级比率——即使裁剪后,有效乘积的范围仍然是:

\[\prod_{l=1}^{\vert o_i \vert} \text{clip}(\rho_i^l) \;\in\; \left[(1-\varepsilon)^{\vert o_i \vert},\; (1+\varepsilon)^{\vert o_i \vert}\right]\]

对于典型的 \(\vert o_i \vert = 1024\) token 回复和 \(\varepsilon = 0.2\),上界为 \(1.2^{1024} \approx 10^{81}\)。Token 级裁剪定义的”信赖域”在序列层面几乎是无界的,梯度估计中的噪声随序列长度增长。

为什么序列级比率才是理论上合理的? 核心论点可以从 IS 校正的目标出发理解。我们要做的是将期望从旧策略 \(\pi_{\theta_{\text{old}}}\) 转换到当前策略 \(\pi_\theta\):

\[\mathbb{E}_{\pi_\theta}\!\left[R(o)\right] = \mathbb{E}_{\pi_{\theta_{\text{old}}}}\!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

这里的期望是对完整序列 \(o\) 取的,IS 比率 \(\pi_\theta(o \vert q) / \pi_{\theta_{\text{old}}}(o \vert q)\) 也是一个序列级的量。将它分解为 \(\prod_l \rho_l\) 只是利用自回归结构的因式分解——这并不意味着每个因子 \(\rho_l\) 本身是一个独立的 IS 校正。GRPO 对每个 \(\rho_l\) 独立裁剪,相当于对因式分解的每个因子施加约束,但这些约束的乘积并不等价于对原始序列级比率的约束。换言之,token 级裁剪约束的是一个被人为拆散的量的各个碎片,而非 IS 校正真正需要约束的整体。

更具体地说,GRPO 的 advantage \(\hat{A}_i\) 是一个轨迹级常数——它只依赖 \(r(\tau^i)\),对序列内的每个 token 都相同。因此 GRPO 的梯度在位置 \(l\) 处的贡献形如 \(\hat{A}_i \cdot \rho_l \cdot \nabla_\theta \ln \pi_\theta(a_l)\),其中 \(\hat{A}_i\) 不依赖于任何单个 token 的分布。既然 advantage 本身就是序列级的,那么 IS 校正的自然粒度也应该是序列级的:我们需要校正的是”这条完整轨迹在 \(\pi_\theta\) 下出现的概率相对于 \(\pi_{\theta_{\text{old}}}\) 偏移了多少”,而不是”每个 token 单独偏移了多少”。GSPO 的几何均值 \(s_i = (\prod_l \rho_l)^{1/\vert o_i \vert}\) 是序列级对数比率 \(\frac{1}{\vert o_i \vert}\ln \frac{\pi_\theta(o_i)}{\pi_{\theta_{\text{old}}}(o_i)}\) 的单调变换,对它裁剪直接约束了序列级的分布偏移。

这一论点也与信赖域的直觉一致:PPO 的信赖域本质上是对策略间 KL 散度(一个序列级量)的约束。Token 级裁剪只是一种近似,在长序列上这个近似可以任意松弛;而序列级裁剪直接在正确的层级上施加约束。

Sequence-Level Importance Ratio

GSPO replaces the per-token ratio with a length-normalized sequence-level ratio:

GSPO 用长度归一化的序列级比率替代逐 token 比率:

\[s_i(\theta) = \left(\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)}\right)^{1/\vert o_i \vert}\]

This is the geometric mean of per-token ratios across the full response — a single scalar per completion, not per token. The clipped objective becomes:

这是完整回复上逐 token 比率的几何均值——每个补全一个标量,而非每个 token 一个。裁剪后的目标变为:

\[\mathcal{J}_{\text{GSPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G} \min\!\left(s_i(\theta)\,\hat{A}_i,\; \text{clip}(s_i(\theta), 1\!-\!\varepsilon, 1\!+\!\varepsilon)\,\hat{A}_i\right)\]

Why This Matters

Moving to sequence-level ratios has several practical consequences:

  1. Higher effective clipping rate, better efficiency. GSPO’s clipping rate is two orders of magnitude higher than GRPO’s, yet it achieves better training efficiency. This reveals that most of GRPO’s token-level gradient signal is noise, not signal.

  2. MoE stability. GRPO requires “Routing Replay” — caching and replaying expert routing patterns — to stabilize Mixture-of-Experts training. GSPO eliminates this need because it depends only on sequence-level likelihood, not individual token routing decisions.

  3. Infrastructure simplification. No need for token-level synchronization between inference and training engines; sequence-level log-probabilities suffice.

GSPO was used to train the Qwen3 model family.

转向序列级比率有几个实际影响:

  1. 更高的有效裁剪率,更好的效率。 GSPO 的裁剪率比 GRPO 高两个数量级,但训练效率更好。这揭示了 GRPO 的 token 级梯度信号大部分是噪声而非信号。

  2. MoE 稳定性。 GRPO 需要”路由回放”(Routing Replay)——缓存和重放专家路由模式——来稳定混合专家训练。GSPO 消除了这一需求,因为它仅依赖序列级似然,而非逐 token 的路由决策。

  3. 基础设施简化。 推理引擎和训练引擎之间无需 token 级同步;序列级对数概率即可。

GSPO 被用于训练 Qwen3 模型系列。