Policy Optimization without a Critic: The GRPO Family

Motivation: Why Not Learn a Critic?

The standard recipe for RLHF — PPO with a learned value function — requires training and maintaining a critic alongside the policy. In the LM setting, this means a second copy of the base model with a scalar value head, doubling memory and introducing its own training challenges (representation drift, readout sensitivity, etc.). For single-turn tasks like math and code generation, a natural question arises: can we skip the critic entirely?

The answer is yes — if we are willing to trade a learned value estimate for a statistical one. Instead of training \(V_\phi(s)\) to predict expected return, we can sample multiple completions from the same prompt and use their empirical reward statistics as a baseline. This is the core idea behind GRPO, and the starting point for DAPO and GSPO.

RLHF 的标准方案——PPO 加学习价值函数——需要在策略之外训练和维护一个 critic。在 LM 场景下,这意味着基础模型的第二个副本加上标量价值头,内存翻倍,并带来自身的训练挑战(表征漂移、读出敏感性等)。对于数学和代码生成等单轮任务,一个自然的问题浮现:能否完全跳过 critic?

答案是肯定的——前提是我们愿意用统计估计替代学习得到的价值估计。我们不训练 \(V_\phi(s)\) 来预测期望回报,而是从同一提示采样多个补全,利用它们的经验奖励统计量作为基线。这是 GRPO 的核心思想,也是 DAPO 和 GSPO 的起点。

GRPO: Group Relative Policy Optimization

GRPO (DeepSeek, 2024) replaces the learned critic with group-level sampling statistics. Given a prompt \(q\), sample a group of \(G\) completions \(\{o_1, \ldots, o_G\}\) from the current policy \(\pi_\theta\), score each with a reward function \(r(q, o_i)\), and compute normalized advantages:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]

The “value function” is the group mean — a Monte Carlo estimate that converges to \(\mathbb{E}_\pi[r \vert q]\) as \(G \to \infty\). No learned parameters, no value head, no critic training loop.

The GRPO loss applies PPO-style clipping with per-token importance ratios:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]

where:

  • \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio — the probability of the \(l\)-th token of the \(i\)-th completion under the current policy \(\pi_\theta\) divided by its probability under the rollout policy \(\pi_{\theta_{\text{old}}}\). A ratio \(\rho > 1\) means the current policy assigns higher probability to this token than the rollout policy did; \(\rho < 1\) means lower.
  • \(\hat{A}_i\) is the group-normalized advantage for completion \(o_i\), shared across all tokens \(l\) in that completion. This is the key simplification: GRPO assigns a single scalar credit to the entire response, not per-token.
  • \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) clamps the ratio to \([1-\varepsilon, 1+\varepsilon]\) (typically \([0.8, 1.2]\)), preventing any single token’s probability from changing too much in one update — the same mechanism as PPO.
  • \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) is a KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model), preventing the policy from drifting too far from the pretrained distribution.
  • The outer average \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) first averages over tokens within each completion, then averages over completions in the group.

GRPO(DeepSeek, 2024)用组级采样统计量替代了学习的 critic。给定提示 \(q\),从当前策略 \(\pi_\theta\) 采样一组 \(G\) 个补全 \(\{o_1, \ldots, o_G\}\),用奖励函数 \(r(q, o_i)\) 对每个打分,计算归一化优势:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]

“价值函数”就是组均值——一个 Monte Carlo 估计,当 \(G \to \infty\) 时收敛到 \(\mathbb{E}_\pi[r \vert q]\)。无需学习参数,无需价值头,无需 critic 训练循环。

GRPO 损失使用 PPO 式裁剪和逐 token 重要性比率:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]

其中:

  • \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) 是逐 token 重要性比率——第 \(i\) 个补全的第 \(l\) 个 token 在当前策略 \(\pi_\theta\) 下的概率除以在 rollout 策略 \(\pi_{\theta_{\text{old}}}\) 下的概率。比率 \(\rho > 1\) 表示当前策略赋予该 token 更高概率;\(\rho < 1\) 表示更低。
  • \(\hat{A}_i\) 是补全 \(o_i\) 的组归一化优势,在该补全的所有 token \(l\) 间共享。这是关键简化:GRPO 对整个回复赋予一个标量信用,而非逐 token。
  • \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) 将比率裁剪到 \([1-\varepsilon, 1+\varepsilon]\)(通常为 \([0.8, 1.2]\)),防止任何单个 token 的概率在一次更新中变化过大——与 PPO 相同的机制。
  • \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) 是对参考策略 \(\pi_{\text{ref}}\)(通常为 SFT 模型)的 KL 惩罚,防止策略偏离预训练分布过远。
  • 外层平均 \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) 先在每个补全内对 token 取平均,再在组内对补全取平均。

The \(G \to \infty\) Limit: REINFORCE with a Baseline

What happens when the group size grows? By the law of large numbers, the group mean converges to the true expected reward under the current policy:

\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]

Note that this is not a state value function \(V(s)\) in the RL sense — there is no multi-step MDP here. GRPO operates at the task level: one prompt in, one complete response out, one reward. The expectation above is simply the average reward the current policy would get on this prompt if it generated infinitely many responses.

In this limit the GRPO advantage (ignoring the std normalization) becomes:

\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]

This is a constant baseline subtraction — the same idea as REINFORCE with a baseline, but the baseline is prompt-specific and estimated purely by sampling. A learned critic in standard RLHF serves the same role (predicting the expected reward for a given input), but it does so with a neural network trained across the entire dataset. GRPO replaces that network with a sample mean computed from the \(G\) completions at hand.

The difference is practical. A learned critic accumulates information across the entire training run: every prompt it has ever seen refines its predictions. GRPO’s group mean estimates the expected reward from scratch each time, using only the \(G\) completions sampled for that prompt in that batch. With finite \(G\) (DeepSeek uses \(G = 64\), DAPO uses \(G = 16\)), this estimate is noisy — and the noise is the price of not maintaining a critic. Whether the simplicity is worth the noise depends on the task: for math and code with reliable verifiers, the answer has been a clear yes.

当组大小增长时会怎样?根据大数定律,组均值收敛到当前策略下的真实期望奖励:

\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]

注意这不是 RL 意义上的状态价值函数 \(V(s)\)——这里没有多步 MDP。GRPO 在任务级别运作:一个提示输入,一个完整回复输出,一个奖励。上述期望只是当前策略在该提示上生成无穷多回复时的平均奖励。

在此极限下,GRPO 优势(忽略标准差归一化)变为:

\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]

这是一个常数基线减法——与带基线的 REINFORCE 相同的思路,但基线是针对特定提示的,且纯粹通过采样估计。标准 RLHF 中的学习 critic 起着相同的作用(预测给定输入的期望奖励),但它是通过在整个数据集上训练的神经网络来实现的。GRPO 用从当前 \(G\) 个补全计算的样本均值替代了该网络。

差异在于实践层面。学习的 critic 在整个训练过程中积累信息:它见过的每个提示都优化了其预测。GRPO 的组均值每次从零开始估计期望奖励,仅使用该批次中为该提示采样的 \(G\) 个补全。在有限的 \(G\)(DeepSeek 使用 \(G = 64\),DAPO 使用 \(G = 16\))下,这个估计是有噪声的——而噪声就是不维护 critic 的代价。这种简洁是否值得承受噪声取决于任务:对于有可靠验证器的数学和代码任务,答案显然是肯定的。

The Fundamental Limitation: Task-Level Credit

The advantage \(\hat{A}_i\) is task-level: every token in completion \(o_i\) receives the same advantage. GRPO cannot distinguish which part of a response was responsible for success or failure — it only knows that this completion, as a whole, was better or worse than its peers. For single-turn tasks (solve a math problem, write a function) this is acceptable. For multi-step agent tasks, it is not — see the ArCHer post for a detailed comparison.

优势 \(\hat{A}_i\) 是任务级的:补全 \(o_i\) 中的每个 token 获得相同的优势。GRPO 无法区分回复中哪个部分对成功或失败负责——它只知道这个补全作为一个整体比同伴更好或更差。对于单轮任务(解数学题、写函数),这是可以接受的。对于多步智能体任务则不行——详见 ArCHer 博文

When Does GRPO Learn Nothing?

GRPO’s learning signal depends entirely on outcome diversity within the group:

  • All completions correct (pass@G = G): all rewards identical → \(\hat{A}_i = 0\) for all \(i\) → zero gradient
  • All completions wrong (pass@G = 0): same situation → zero gradient
  • Mixed outcomes: some succeed, some fail → nonzero advantages → useful gradient

As training progresses and the model improves, more prompts become “too easy” (pass rate → 100%) and stop contributing signal. This is the central failure mode that DAPO addresses.

GRPO 的学习信号完全依赖于组内结果的多样性

  • 所有补全均正确(pass@G = G):所有奖励相同 → 所有 \(i\) 的 \(\hat{A}_i = 0\) → 零梯度
  • 所有补全均错误(pass@G = 0):同样的情况 → 零梯度
  • 混合结果:部分成功,部分失败 → 非零优势 → 有用梯度

随着训练推进和模型改进,越来越多的提示变得”太简单”(通过率 → 100%),不再贡献信号。这正是 DAPO 要解决的核心失败模式。

DAPO: Dynamic Sampling Policy Optimization

DAPO (ByteDance Seed, 2025) identifies entropy collapse as the central failure mode of naive GRPO and introduces four techniques to combat it. A fifth, less obvious change: DAPO completely removes the KL penalty (\(\beta = 0\)). The reasoning is that for long chain-of-thought models, the policy is expected to diverge significantly from the SFT initialization — the KL constraint actively fights against the distribution shift that training is trying to achieve. Instead of relying on KL to prevent collapse, DAPO uses the mechanisms below.

DAPO(ByteDance Seed, 2025)将熵坍缩识别为朴素 GRPO 的核心失败模式,并引入四项技术来对抗它。第五个不太明显的改变:DAPO 完全移除了 KL 惩罚(\(\beta = 0\))。理由是对于长思维链模型,策略预计会显著偏离 SFT 初始化——KL 约束实际上在对抗训练试图实现的分布偏移。DAPO 不依赖 KL 来防止坍缩,而是使用以下机制。

1. Dynamic Sampling

Motivation. As training improves the model, more prompts hit pass rate 0 or \(G\) — uniform outcomes give zero advantage and contribute no gradient. Effective batch size shrinks while rollout compute does not, so signal-to-cost steadily worsens (paper Fig 3b: fraction of all-correct prompts grows over training).

Method. During rollout, keep each prompt only if its correct count is strictly between 0 and \(G\), oversampling until the batch is full:

\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]

Every surviving prompt has mixed outcomes — nonzero gradient guaranteed. As the model improves, the surviving distribution automatically tracks the model’s learning frontier.

Result. AIME 42 → 50 (+8). Largest single delta in the entire DAPO ablation chain. Despite needing extra rollouts to fill each batch, convergence is also faster (paper Fig 6).

Does the result match the motivation? Cleanly yes. The hypothesis is “uniform-outcome prompts are dead weight”; removing them produces the biggest single-component lift, and the speedup despite extra rollouts is exactly what the signal-to-cost story predicts.

动机。 随着模型变强,越来越多的提示通过率为 0 或 \(G\)——结果一致使得优势为零、对梯度无贡献。有效批大小持续缩小而 rollout 计算量不变,信噪比持续恶化(论文 Figure 3b:训练中”全对”提示占比逐步上升)。

做法。 rollout 阶段,仅保留正确数严格介于 0 和 \(G\) 之间的提示,过采样直到填满批次:

\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]

每个存活下来的提示都有混合结果——非零梯度有保证。随着模型改进,存活分布自动跟随模型的学习前沿

结果。 AIME 42 → 50(+8)。整条 DAPO 消融链中单项最大幅度。即便每个批次需要更多 rollout 才能填满,收敛仍更快(论文 Figure 6)。

结果能否说明动机? 非常干净地能。假设是”结果一致的提示是死重”;移除后带来单项最大提升,且多消耗 rollout 反而更快收敛——正好对应”信噪比”故事的预测。

2. Clip-Higher (Asymmetric Clipping)

Motivation. Naive GRPO suffers entropy collapse — the policy concentrates and stops exploring. The mechanism is asymmetry hidden in symmetric clipping. Under \([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\), a low-probability token at \(\pi_\text{old}=0.01\) can rise to at most \(0.012\) before clipping; a high-probability token at \(0.9\) is allowed to reach \(1.08\). The same \(\varepsilon\) multiplicatively favors already-likely tokens, so “exploration” tokens are structurally hard to amplify.

Where does the asymmetry come from? PPO clips the importance ratio \(r = \pi_\theta / \pi_\text{old}\), not the probability — so the bound is multiplicative. The absolute room to grow per step is \(\varepsilon \cdot \pi_\text{old}\), proportional to where the token already is.

This ceiling is a hard wall, not a soft penalty — enforced by the gradient becoming exactly zero. When \(r > 1+\varepsilon\) on a token with \(\hat{A}>0\), PPO’s \(\min(r\hat{A},\,\text{clip}(r)\hat{A})\) picks the clipped term \((1+\varepsilon)\hat{A}\), which is constant in \(\theta\). The token’s contribution to the loss is frozen; its gradient is 0; nothing pushes its probability further this step. (The \(\min\) design only clips when clipping tightens the objective; in opposite-sign cases like \(\hat{A}<0\) with \(r>1+\varepsilon\), the unclipped term wins and gradient still flows — PPO won’t shield you from pulling down a bad action you over-shot on.) So “the clip bites” specifically means: positive-advantage tokens with \(r\) pegged at \(1+\varepsilon\), contributing zero gradient. Three consequences fall out:

  1. Absolute budget scales with \(\pi_\text{old}\). A 0.01 token can gain at most 0.002 mass per step; a 0.5 token can gain 0.1 — a 50× gap from the same \(\varepsilon\).
  2. The upper clip only restricts low-probability tokens — high-probability tokens are unaffected. Two separate ceilings act on \(\pi_\text{new}\) at every step: the PPO clip \(\pi_\text{new} \le (1+\varepsilon)\,\pi_\text{old}\) and the natural probability cap \(\pi_\text{new} \le 1\). Whichever is tighter is the binding one; the other is inert. When \(\pi_\text{old} > 1/(1+\varepsilon) \approx 0.83\), the clip ceiling \((1+\varepsilon)\pi_\text{old}\) exceeds 1 — unreachable as a probability — so the clip simply never activates, and \(\le 1\) is what’s binding. By contrast, for \(\pi_\text{old} = 0.01\) the clip ceiling is 0.012, which is the binding constraint. So the upper clip taxes only the low-/mid-probability population — exactly the exploration tokens — while already-dominant tokens grow freely.
  3. Compounding makes the gap exponential. When the clip is the binding constraint at every step (saturated clipping), the probability evolves geometrically:

    \[\pi_n = \pi_0 \, (1+\varepsilon)^n\]

    so the time to climb from \(\pi_0\) to any target \(\pi^*\) is logarithmic:

    \[n^* \;=\; \frac{\log(\pi^*/\pi_0)}{\log(1+\varepsilon)}.\]

    Plugging in \(\pi_0 = 0.01\), \(\pi^* = 0.1\): 12.6 steps at \(\varepsilon = 0.20\), 9.3 at \(\varepsilon_\text{high} = 0.28\). Two things sharpen this:

    • The race is doubly asymmetric. A token at 0.01 must climb 10× to even “start mattering” at 0.1, and another 10× to dominate. A token already at 0.5 only needs to double once to take over. Same per-step rate, vastly different distances. And by consequence (1), the low-probability token’s per-step absolute gain is also 50× smaller — it’s slower in the geometric metric and in the literal one.
    • The “small” \(\varepsilon\) change compounds into a large probability gap. After \(n\) saturated steps, \(\varepsilon_\text{high} = 0.28\) leaves the token at \((1.28/1.20)^n\) times where \(\varepsilon = 0.20\) would put it: 1.7× after 20 steps, 3.7× after 50, 14× after 100. A modest-looking rate change becomes a multiplicative probability difference once the geometric process runs.

    The clip won’t saturate every step in practice — but it saturates precisely when the advantage is large, which is exactly when an exploration token would benefit most from amplification. So this geometric bound describes the worst-case ascent rate, which is the operative one for tokens trying to escape obscurity. Each step the bound binds, the rich-get-richer dynamic locks in a little more probability mass on already-dominant tokens, and entropy drops accordingly.

The lower bound stays symmetric for a reason: loosening it would let the policy abandon previously-good tokens too aggressively, which is a different failure mode (premature mode collapse). The fix is asymmetric because the problem is asymmetric — it’s the upper clip’s multiplicative ceiling on small probabilities that needs lifting.

Method. Decouple the bounds:

\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]

Same lower bound (still bound how fast probabilities can collapse), looser upper bound (give exploration tokens room to grow).

Result. AIME 36 → 38 (+2). Paper Fig 2b: the entropy curve, which was sliding downward, flattens and rises. Fig 3a: tokens that hit the upper clip really do sit at \(\pi_\theta < 0.2\) — the clip is biting the population it was designed for.

Does the result match the motivation? Strongly for the mechanism, weakly for the score. The entropy curve and clip distribution directly support the “low-probability tokens were being suppressed” story. But +2 on AIME is small — entropy collapse is a contributing rather than dominant cause of low scores.

Panel A: the clip ceiling is multiplicative, so the absolute "growth budget" (shaded band) scales with \(\pi_\text{old}\). Past \(\pi_\text{old} \approx 0.83\) the ceiling exceeds 1 and the upper clip stops biting entirely — exploration tokens never reach this regime. Panel B: compounding under saturated clipping; raising \(\varepsilon_\text{high}\) from 0.20 to 0.28 cuts the time for a 0.01 token to climb to 0.1 from 12.6 steps to 9.3. Drag \(\varepsilon_\text{high}\) to see the ceiling lift, and drag \(\pi_0\) to see how much further low-probability tokens have to travel.

动机。 朴素 GRPO 出现熵坍缩——策略集中、停止探索。机制是对称裁剪中隐含的不对称性。在 \([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\) 下,\(\pi_\text{old}=0.01\) 的低概率 token 最多被允许提升到 \(0.012\);而 \(\pi_\text{old}=0.9\) 的高概率 token 可以提升到 \(1.08\)。相同的 \(\varepsilon\) 在乘法意义上偏袒已经高概率的 token,”探索”型 token 结构性地难以被放大。

不对称从何而来? PPO 裁剪的对象是重要性比率 \(r = \pi_\theta / \pi_\text{old}\),不是概率本身——因此约束是乘法的。每一步的绝对成长空间为 \(\varepsilon \cdot \pi_\text{old}\),与 token 当前已经在哪里成正比

这个上限是硬墙,不是软惩罚——靠梯度直接归零来落实。当 \(r > 1+\varepsilon\) 且 \(\hat{A}>0\) 时,PPO 的 \(\min(r\hat{A},\,\text{clip}(r)\hat{A})\) 选中裁剪项 \((1+\varepsilon)\hat{A}\),这个值对 \(\theta\) 是常数。该 token 对损失的贡献被冻结,梯度为 0,本步它的概率不再被推动。(\(\min\) 设计只在裁剪能收紧目标时才生效;如果是反方向不匹配,比如 \(\hat{A}<0\) 且 \(r>1+\varepsilon\),未裁剪项胜出,梯度照常流动——PPO 不会包庇你去保留一个超采到过高概率的坏动作。)所以”裁剪咬住”具体含义就是:正优势 token 的 \(r\) 钉在 \(1+\varepsilon\),贡献零梯度。由此推出三条后果:

  1. 绝对预算与 \(\pi_\text{old}\) 成比例。 0.01 的 token 单步最多增加 0.002;0.5 的 token 单步最多增加 0.1——同一个 \(\varepsilon\) 下相差 50 倍。
  2. 上界裁剪只压低概率 token,对高概率 token 完全不起作用。 每一步 \(\pi_\text{new}\) 同时受两个上限约束:PPO 的裁剪 \(\pi_\text{new} \le (1+\varepsilon)\pi_\text{old}\),以及概率本身的天然上限 \(\pi_\text{new} \le 1\)。两者哪个更严哪个就起作用,另一个形同虚设。当 \(\pi_\text{old} > 1/(1+\varepsilon) \approx 0.83\) 时,裁剪上限 \((1+\varepsilon)\pi_\text{old}\) 已经超过 1——而概率本来就到不了——所以裁剪永远不会被触发,是 \(\le 1\) 在起作用。反之,对 \(\pi_\text{old} = 0.01\) 的 token,裁剪上限是 0.012,远比 1 严,是裁剪在起作用。于是上界裁剪只对低 / 中概率 token 征税——正好是探索群体——而已经主导的 token 不受任何约束
  3. 复利让差距指数级放大。 当裁剪每一步都是有效约束时(饱和裁剪),概率按几何级数演化:

    \[\pi_n = \pi_0 \, (1+\varepsilon)^n\]

    所以从 \(\pi_0\) 爬到任意目标 \(\pi^*\) 所需步数是对数的:

    \[n^* \;=\; \frac{\log(\pi^*/\pi_0)}{\log(1+\varepsilon)}.\]

    代入 \(\pi_0 = 0.01\)、\(\pi^* = 0.1\):\(\varepsilon = 0.20\) 时要 12.6 步,\(\varepsilon_\text{high} = 0.28\) 时要 9.3 步。有两点放大了这个差距:

    • 赛跑是双重不对称的。 0.01 的 token 要爬 10 倍才”开始有意义”(达到 0.1),再爬 10 倍才能主导。已经在 0.5 的 token 只要翻一倍就能完全占据。同样的相对增长率,路程完全不同。 而且由第 (1) 条,低概率 token 每一步的绝对增量还要小 50 倍——它在几何意义上慢,在字面意义上也慢。
    • “小”的 \(\varepsilon\) 改动会复利成大的概率差距。 经过 \(n\) 步饱和更新后,\(\varepsilon_\text{high} = 0.28\) 会把 token 推到 \((1.28/1.20)^n\) 倍于 \(\varepsilon = 0.20\) 时的位置:20 步后 1.7 倍,50 步后 3.7 倍,100 步后 14 倍。看起来不大的速率改动,一旦几何过程跑起来,就变成乘法级的概率差。

    实际训练中裁剪不是每一步都饱和——但它正好在优势绝对值很大时饱和,而那也正是探索 token 最需要被放大的时刻。所以这个几何上界刻画的是最坏情况下的攀升速率,而最需要走快的 token 正好处在这种情况里。每一次裁剪生效,”强者愈强”的动力学就把一点概率质量从探索 token 锁到已主导 token 上——熵随之下降。

下界保持对称是有原因的:放松下界会让策略更快地抛弃之前不错的 token,这是另一种失败模式(过早的 mode collapse)。修正之所以不对称,是因为问题本身就不对称——需要抬升的,是上界对小概率的乘法天花板。

做法。 上下界解耦:

\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]

下界不变(继续约束概率坍缩的速度),上界放宽(给探索 token 成长空间)。

结果。 AIME 36 → 38(+2)。论文 Figure 2b:原本持续下滑的熵曲线趋平并回升。Figure 3a:触发上界裁剪的 token 确实位于 \(\pi_\theta < 0.2\) 区间——裁剪精准作用在目标群体上。

结果能否说明动机? 机制层面非常充分,分数层面偏弱。熵曲线和裁剪分布直接支持”低概率 token 被压制”的论断;但 AIME 上 +2 偏小,说明熵坍缩是低分的一个原因而非主因。

左图:裁剪上限是乘法的,绝对"成长预算"(阴影带)随 \(\pi_\text{old}\) 线性放大。当 \(\pi_\text{old} \gtrsim 0.83\) 时上界超过 1 而不可达,上界裁剪完全停止生效——而探索 token 永远到不了这个区域。右图:饱和裁剪下的复利累积;把 \(\varepsilon_\text{high}\) 从 0.20 调到 0.28,0.01 token 爬到 0.1 所需步数从 12.6 降到 9.3。拖动 \(\varepsilon_\text{high}\) 观察上限抬升,拖动 \(\pi_0\) 体会低概率 token 还要走多远。

3. Token-Level Loss Normalization

Motivation. Standard GRPO averages token losses within each sample, then averages across samples — every response gets equal weight regardless of length. Long, low-quality outputs (gibberish, runaway repetition) end up under-penalized per token, while short clean responses are over-weighted per token. The paper observes this manifesting as an “unhealthy increase in entropy and response length.”

Method. Average directly over all tokens in the batch:

\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]

Each token contributes equally — long sequences now exert proportional influence on the loss.

Result. AIME 41 → 42 (+1) — the smallest single contribution in the chain. But Fig 4: entropy and response-length curves stop drifting upward and stabilize.

Does the result match the motivation? Mixed. The qualitative claim (entropy/length stability) is supported by the curves; the quantitative claim (better AIME) barely is. This change is best read as a stability and collapse-prevention measure, not a score lever.

动机。 标准 GRPO 先在每个样本内对 token 损失取平均,再在样本间取平均——每条回复权重相等,与长度无关。冗长低质量的回复(胡言乱语、复读)的每个 token 受罚不足;短小高质量的回复每个 token 反而被加权过高。论文观察到这表现为”熵和回复长度的不健康增长”。

做法。 直接对整个批次的所有 token 取平均:

\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]

每个 token 贡献相同——长序列获得按比例的损失影响力。

结果。 AIME 41 → 42(+1)——四项中单项贡献最小。但论文 Figure 4 显示,熵和回复长度的曲线由原本的持续上漂转为稳定。

结果能否说明动机? 一半一半。定性论断(熵/长度稳定)由曲线支持;定量论断(AIME 提升)几乎没有。这一项更应被理解为稳定性 / 防坍缩措施,而非分数推力。

4. Overlong Reward Shaping

Motivation. When a sample hits the max generation length, it gets truncated and labeled wrong — even if the reasoning was on track and just ran out of tokens. The paper calls this reward noise: the negative reward isn’t telling the model “your reasoning is bad,” it’s telling it “you were too long.” Training on the wrong signal destabilizes the policy.

Method. Two stages, applied successively:

  1. Overlong Filtering (first fix): mask the loss on truncated samples — they contribute no gradient at all.
  2. Soft Overlong Punishment (refinement): a graduated length penalty. With \(L_\text{max} = 16384\) and \(L_\text{cache} = 4096\):
    • \(\vert y \vert \leq L_\text{max} - L_\text{cache}\): no length penalty
    • \(L_\text{max} - L_\text{cache} < \vert y \vert \leq L_\text{max}\): penalty drops linearly from \(0\) to \(-1\)
    • \(\vert y \vert > L_\text{max}\): penalty \(= -1\)

    Generation cap is \(L_\text{max} + L_\text{cache} = 20480\).

Result. Filtering alone is the biggest jump in the chain: 30 → 36 (+6). Adding soft shaping on top of clip-higher: 38 → 41 (+3). Fig 5: AIME accuracy and entropy curves both stabilize once truncation noise is removed.

Does the result match the motivation? Yes, more directly than any other DAPO component. The motivation is “truncation injects a wrong-direction signal that destabilizes training”; killing that signal both lifts the score by a large margin (combined +9 across the two stages — the bulk of DAPO’s improvement together with dynamic sampling) and visibly stabilizes the curves.

动机。 当样本达到最大生成长度被截断时,它被记为错误——即便推理方向正确,只是没用完 token。论文称之为奖励噪声:负奖励并不是在告诉模型”你的推理糟糕”,而是在告诉它”你太长了”。错误信号训进策略,会扰乱训练。

做法。 两阶段,依次叠加:

  1. 超长过滤(初步修正):对被截断的样本掩码损失——它们完全不贡献梯度。
  2. 软超长惩罚(精细化):渐进式长度惩罚。设 \(L_\text{max} = 16384\),\(L_\text{cache} = 4096\):
    • \(\vert y \vert \leq L_\text{max} - L_\text{cache}\):不施加长度惩罚
    • \(L_\text{max} - L_\text{cache} < \vert y \vert \leq L_\text{max}\):惩罚从 \(0\) 线性降至 \(-1\)
    • \(\vert y \vert > L_\text{max}\):惩罚 \(= -1\)

    生成上限为 \(L_\text{max} + L_\text{cache} = 20480\)。

结果。 过滤本身是消融链中最大的一跳:30 → 36(+6)。在 Clip-Higher 之上再加软塑形:38 → 41(+3)。论文 Figure 5:去掉截断噪声后,AIME 准确率和熵曲线都趋于稳定。

结果能否说明动机? 比 DAPO 任何其他组件都更直接地能。动机是”截断注入了错误方向的信号、扰乱训练”;去掉这个信号既带来大幅分数提升(两阶段合计 +9,与动态采样一起构成 DAPO 改进的主体),又明显稳定了训练曲线。

Ablation Results (AIME 2024)

Configuration Score Δ
Naive GRPO 30
+ Overlong Filtering 36 +6
+ Clip-Higher 38 +2
+ Soft Overlong Punishment 41 +3
+ Token-Level Loss 42 +1
+ Dynamic Sampling (full DAPO) 50 +8
DeepSeek-R1-Zero (2x training steps) 47

Two components dominate: Dynamic Sampling (+8) and Overlong Reward Shaping (+9 across its two stages), together accounting for 17 of the 20-point gain. Clip-Higher (+2) and Token-Level Loss (+1) are smaller-magnitude but address training stability — entropy and length collapse modes that would otherwise re-emerge. The headline “DAPO matches DeepSeek-R1-Zero in half the steps” comes mostly from cleaning up two sources of corrupted gradient: truncation-induced wrong rewards and uniform-outcome dead prompts.

GRPO vs GSPO: token-level clipping allows the sequence-level ratio to drift unconstrained, while GSPO's geometric mean provides a direct bound. Click through the slides to see how noise accumulates with length and why GSPO clips 100x more often yet trains better.
配置 分数 Δ
朴素 GRPO 30
+ 超长过滤 36 +6
+ Clip-Higher 38 +2
+ 软超长惩罚 41 +3
+ Token 级损失 42 +1
+ 动态采样(完整 DAPO) 50 +8
DeepSeek-R1-Zero(2 倍训练步数) 47

主导贡献来自两个组件:动态采样(+8)超长奖励塑形(两阶段合计 +9),加起来占 20 分总改进中的 17 分。Clip-Higher(+2)Token 级损失(+1) 数值较小,但解决的是训练稳定性——若不修补,熵和长度坍缩会再度出现。”DAPO 用一半步数追平 DeepSeek-R1-Zero” 的标题结果,主要来自清理两类被污染的梯度:截断带来的错误奖励,以及结果一致的死提示。

GRPO 与 GSPO 对比:token 级裁剪允许序列级比率不受约束地漂移,而 GSPO 的几何均值提供了直接的约束。点击幻灯片查看噪声如何随长度累积,以及 GSPO 裁剪频率高 100 倍却训练效果更好的原因。

GSPO: Group Sequence Policy Optimization

GSPO (Alibaba/Qwen, 2025) takes a different angle: rather than fixing GRPO’s sampling strategy, it fixes GRPO’s optimization target. The core claim is that GRPO’s per-token importance ratios are a theoretically unsound application of importance sampling, introducing noise that accumulates with response length.

GSPO(Alibaba/Qwen, 2025)从不同角度出发:与其修复 GRPO 的采样策略,不如修复 GRPO 的优化目标。核心论点是 GRPO 的逐 token 重要性比率是理论上不合理的重要性采样应用,引入的噪声随回复长度累积。

The Problem with Token-Level Ratios

The policy gradient theorem operates at the sequence level — the quantity we want to optimize is:

\[\nabla_\theta \, \mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}} \!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

The correct importance sampling ratio for a full sequence factorizes as a product of per-token ratios:

\[\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)} = \prod_{l=1}^{\vert o_i \vert} \rho_i^l\]

GRPO, however, clips each \(\rho_i^l\) independently to \([1-\varepsilon,\, 1+\varepsilon]\). This does not bound the sequence-level ratio — even after clipping, the effective product can range over:

\[\prod_{l=1}^{\vert o_i \vert} \text{clip}(\rho_i^l) \;\in\; \left[(1-\varepsilon)^{\vert o_i \vert},\; (1+\varepsilon)^{\vert o_i \vert}\right]\]

For a typical response of \(\vert o_i \vert = 1024\) tokens and \(\varepsilon = 0.2\), the upper bound is \(1.2^{1024} \approx 10^{81}\). The “trust region” that token-level clipping defines is essentially unbounded at the sequence level, and the resulting noise in the gradient estimate grows with sequence length.

Why is the sequence-level ratio the theoretically correct one? The core argument starts from the goal of IS correction. We want to convert the expectation from the old policy \(\pi_{\theta_{\text{old}}}\) to the current policy \(\pi_\theta\):

\[\mathbb{E}_{\pi_\theta}\!\left[R(o)\right] = \mathbb{E}_{\pi_{\theta_{\text{old}}}}\!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

The expectation here is over complete sequences \(o\), and the IS ratio \(\pi_\theta(o \vert q) / \pi_{\theta_{\text{old}}}(o \vert q)\) is a sequence-level quantity. Decomposing it into \(\prod_l \rho_l\) merely exploits the autoregressive factorization — it does not mean each factor \(\rho_l\) is an independent IS correction. When GRPO clips each \(\rho_l\) independently, it constrains the individual pieces of a factorized quantity, but the product of these constraints is not equivalent to constraining the original sequence-level ratio. In other words, token-level clipping constrains artificially separated fragments, not the whole object that IS correction actually needs to bound.

More concretely, GRPO’s advantage \(\hat{A}_i\) is a trajectory-level constant — it depends only on \(r(\tau^i)\) and is the same for every token within a sequence. The gradient contribution at position \(l\) takes the form \(\hat{A}_i \cdot \rho_l \cdot \nabla_\theta \ln \pi_\theta(a_l)\), where \(\hat{A}_i\) does not depend on any single token’s distribution. Since the advantage itself is sequence-level, the natural granularity of IS correction should also be sequence-level: what we need to correct is “how much more likely is this entire trajectory under \(\pi_\theta\) relative to \(\pi_{\theta_{\text{old}}}\),” not “how much did each token shift individually.” GSPO’s geometric mean \(s_i = (\prod_l \rho_l)^{1/\vert o_i \vert}\) is a monotonic transformation of the sequence-level log-ratio \(\frac{1}{\vert o_i \vert}\ln \frac{\pi_\theta(o_i)}{\pi_{\theta_{\text{old}}}(o_i)}\), so clipping it directly constrains the sequence-level distributional shift.

This argument also aligns with the trust-region intuition: PPO’s trust region is fundamentally a constraint on the KL divergence between policies — a sequence-level quantity. Token-level clipping is merely an approximation, one that can become arbitrarily loose on long sequences; sequence-level clipping imposes the constraint at the correct level of abstraction.

策略梯度定理在序列级别运作——我们要优化的量是:

\[\nabla_\theta \, \mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}} \!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

完整序列的正确重要性采样比率分解为逐 token 比率的乘积

\[\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)} = \prod_{l=1}^{\vert o_i \vert} \rho_i^l\]

但 GRPO 对每个 \(\rho_i^l\) 独立裁剪到 \([1-\varepsilon,\, 1+\varepsilon]\)。这并不能约束序列级比率——即使裁剪后,有效乘积的范围仍然是:

\[\prod_{l=1}^{\vert o_i \vert} \text{clip}(\rho_i^l) \;\in\; \left[(1-\varepsilon)^{\vert o_i \vert},\; (1+\varepsilon)^{\vert o_i \vert}\right]\]

对于典型的 \(\vert o_i \vert = 1024\) token 回复和 \(\varepsilon = 0.2\),上界为 \(1.2^{1024} \approx 10^{81}\)。Token 级裁剪定义的”信赖域”在序列层面几乎是无界的,梯度估计中的噪声随序列长度增长。

为什么序列级比率才是理论上合理的? 核心论点可以从 IS 校正的目标出发理解。我们要做的是将期望从旧策略 \(\pi_{\theta_{\text{old}}}\) 转换到当前策略 \(\pi_\theta\):

\[\mathbb{E}_{\pi_\theta}\!\left[R(o)\right] = \mathbb{E}_{\pi_{\theta_{\text{old}}}}\!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]

这里的期望是对完整序列 \(o\) 取的,IS 比率 \(\pi_\theta(o \vert q) / \pi_{\theta_{\text{old}}}(o \vert q)\) 也是一个序列级的量。将它分解为 \(\prod_l \rho_l\) 只是利用自回归结构的因式分解——这并不意味着每个因子 \(\rho_l\) 本身是一个独立的 IS 校正。GRPO 对每个 \(\rho_l\) 独立裁剪,相当于对因式分解的每个因子施加约束,但这些约束的乘积并不等价于对原始序列级比率的约束。换言之,token 级裁剪约束的是一个被人为拆散的量的各个碎片,而非 IS 校正真正需要约束的整体。

更具体地说,GRPO 的 advantage \(\hat{A}_i\) 是一个轨迹级常数——它只依赖 \(r(\tau^i)\),对序列内的每个 token 都相同。因此 GRPO 的梯度在位置 \(l\) 处的贡献形如 \(\hat{A}_i \cdot \rho_l \cdot \nabla_\theta \ln \pi_\theta(a_l)\),其中 \(\hat{A}_i\) 不依赖于任何单个 token 的分布。既然 advantage 本身就是序列级的,那么 IS 校正的自然粒度也应该是序列级的:我们需要校正的是”这条完整轨迹在 \(\pi_\theta\) 下出现的概率相对于 \(\pi_{\theta_{\text{old}}}\) 偏移了多少”,而不是”每个 token 单独偏移了多少”。GSPO 的几何均值 \(s_i = (\prod_l \rho_l)^{1/\vert o_i \vert}\) 是序列级对数比率 \(\frac{1}{\vert o_i \vert}\ln \frac{\pi_\theta(o_i)}{\pi_{\theta_{\text{old}}}(o_i)}\) 的单调变换,对它裁剪直接约束了序列级的分布偏移。

这一论点也与信赖域的直觉一致:PPO 的信赖域本质上是对策略间 KL 散度(一个序列级量)的约束。Token 级裁剪只是一种近似,在长序列上这个近似可以任意松弛;而序列级裁剪直接在正确的层级上施加约束。

Sequence-Level Importance Ratio

GSPO replaces the per-token ratio with a length-normalized sequence-level ratio:

GSPO 用长度归一化的序列级比率替代逐 token 比率:

\[s_i(\theta) = \left(\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)}\right)^{1/\vert o_i \vert}\]

This is the geometric mean of per-token ratios across the full response — a single scalar per completion, not per token. The clipped objective becomes:

这是完整回复上逐 token 比率的几何均值——每个补全一个标量,而非每个 token 一个。裁剪后的目标变为:

\[\mathcal{J}_{\text{GSPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G} \min\!\left(s_i(\theta)\,\hat{A}_i,\; \text{clip}(s_i(\theta), 1\!-\!\varepsilon, 1\!+\!\varepsilon)\,\hat{A}_i\right)\]

Why This Matters

Moving to sequence-level ratios has several practical consequences:

  1. Higher effective clipping rate, better efficiency. GSPO’s clipping rate is two orders of magnitude higher than GRPO’s, yet it achieves better training efficiency. This reveals that most of GRPO’s token-level gradient signal is noise, not signal.

  2. MoE stability. GRPO requires “Routing Replay” — caching and replaying expert routing patterns — to stabilize Mixture-of-Experts training. GSPO eliminates this need because it depends only on sequence-level likelihood, not individual token routing decisions.

  3. Infrastructure simplification. No need for token-level synchronization between inference and training engines; sequence-level log-probabilities suffice.

GSPO was used to train the Qwen3 model family.

转向序列级比率有几个实际影响:

  1. 更高的有效裁剪率,更好的效率。 GSPO 的裁剪率比 GRPO 高两个数量级,但训练效率更好。这揭示了 GRPO 的 token 级梯度信号大部分是噪声而非信号。

  2. MoE 稳定性。 GRPO 需要”路由回放”(Routing Replay)——缓存和重放专家路由模式——来稳定混合专家训练。GSPO 消除了这一需求,因为它仅依赖序列级似然,而非逐 token 的路由决策。

  3. 基础设施简化。 推理引擎和训练引擎之间无需 token 级同步;序列级对数概率即可。

GSPO 被用于训练 Qwen3 模型系列。