Policy Optimization without a Critic: The GRPO Family
Motivation: Why Not Learn a Critic?
动机:为什么不学一个 Critic?
The standard recipe for RLHF — PPO with a learned value function — requires training and maintaining a critic alongside the policy. In the LM setting, this means a second copy of the base model with a scalar value head, doubling memory and introducing its own training challenges (representation drift, readout sensitivity, etc.). For single-turn tasks like math and code generation, a natural question arises: can we skip the critic entirely?
The answer is yes — if we are willing to trade a learned value estimate for a statistical one. Instead of training \(V_\phi(s)\) to predict expected return, we can sample multiple completions from the same prompt and use their empirical reward statistics as a baseline. This is the core idea behind GRPO, and the starting point for DAPO and GSPO.
GRPO: Group Relative Policy Optimization
GRPO:组相对策略优化
GRPO (DeepSeek, 2024) replaces the learned critic with group-level sampling statistics. Given a prompt \(q\), sample a group of \(G\) completions \(\{o_1, \ldots, o_G\}\) from the current policy \(\pi_\theta\), score each with a reward function \(r(q, o_i)\), and compute normalized advantages:
\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]The “value function” is the group mean — a Monte Carlo estimate that converges to \(\mathbb{E}_\pi[r \vert q]\) as \(G \to \infty\). No learned parameters, no value head, no critic training loop.
The GRPO loss applies PPO-style clipping with per-token importance ratios:
\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]where:
- \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio — the probability of the \(l\)-th token of the \(i\)-th completion under the current policy \(\pi_\theta\) divided by its probability under the rollout policy \(\pi_{\theta_{\text{old}}}\). A ratio \(\rho > 1\) means the current policy assigns higher probability to this token than the rollout policy did; \(\rho < 1\) means lower.
- \(\hat{A}_i\) is the group-normalized advantage for completion \(o_i\), shared across all tokens \(l\) in that completion. This is the key simplification: GRPO assigns a single scalar credit to the entire response, not per-token.
- \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) clamps the ratio to \([1-\varepsilon, 1+\varepsilon]\) (typically \([0.8, 1.2]\)), preventing any single token’s probability from changing too much in one update — the same mechanism as PPO.
- \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) is a KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model), preventing the policy from drifting too far from the pretrained distribution.
- The outer average \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) first averages over tokens within each completion, then averages over completions in the group.
The \(G \to \infty\) Limit: REINFORCE with a Baseline
\(G \to \infty\) 极限:带基线的 REINFORCE
What happens when the group size grows? By the law of large numbers, the group mean converges to the true expected reward under the current policy:
\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]Note that this is not a state value function \(V(s)\) in the RL sense — there is no multi-step MDP here. GRPO operates at the task level: one prompt in, one complete response out, one reward. The expectation above is simply the average reward the current policy would get on this prompt if it generated infinitely many responses.
In this limit the GRPO advantage (ignoring the std normalization) becomes:
\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]This is a constant baseline subtraction — the same idea as REINFORCE with a baseline, but the baseline is prompt-specific and estimated purely by sampling. A learned critic in standard RLHF serves the same role (predicting the expected reward for a given input), but it does so with a neural network trained across the entire dataset. GRPO replaces that network with a sample mean computed from the \(G\) completions at hand.
The difference is practical. A learned critic accumulates information across the entire training run: every prompt it has ever seen refines its predictions. GRPO’s group mean estimates the expected reward from scratch each time, using only the \(G\) completions sampled for that prompt in that batch. With finite \(G\) (DeepSeek uses \(G = 64\), DAPO uses \(G = 16\)), this estimate is noisy — and the noise is the price of not maintaining a critic. Whether the simplicity is worth the noise depends on the task: for math and code with reliable verifiers, the answer has been a clear yes.
The Fundamental Limitation: Task-Level Credit
根本局限:任务级信用分配
The advantage \(\hat{A}_i\) is task-level: every token in completion \(o_i\) receives the same advantage. GRPO cannot distinguish which part of a response was responsible for success or failure — it only knows that this completion, as a whole, was better or worse than its peers. For single-turn tasks (solve a math problem, write a function) this is acceptable. For multi-step agent tasks, it is not — see the ArCHer post for a detailed comparison.
When Does GRPO Learn Nothing?
GRPO 何时学不到任何东西?
GRPO’s learning signal depends entirely on outcome diversity within the group:
- All completions correct (pass@G = G): all rewards identical → \(\hat{A}_i = 0\) for all \(i\) → zero gradient
- All completions wrong (pass@G = 0): same situation → zero gradient
- Mixed outcomes: some succeed, some fail → nonzero advantages → useful gradient
As training progresses and the model improves, more prompts become “too easy” (pass rate → 100%) and stop contributing signal. This is the central failure mode that DAPO addresses.
DAPO: Dynamic Sampling Policy Optimization
DAPO:动态采样策略优化
DAPO (ByteDance Seed, 2025) identifies entropy collapse as the central failure mode of naive GRPO and introduces four techniques to combat it. A fifth, less obvious change: DAPO completely removes the KL penalty (\(\beta = 0\)). The reasoning is that for long chain-of-thought models, the policy is expected to diverge significantly from the SFT initialization — the KL constraint actively fights against the distribution shift that training is trying to achieve. Instead of relying on KL to prevent collapse, DAPO uses the mechanisms below.
1. Dynamic Sampling
1. 动态采样
Motivation. As training improves the model, more prompts hit pass rate 0 or \(G\) — uniform outcomes give zero advantage and contribute no gradient. Effective batch size shrinks while rollout compute does not, so signal-to-cost steadily worsens (paper Fig 3b: fraction of all-correct prompts grows over training).
Method. During rollout, keep each prompt only if its correct count is strictly between 0 and \(G\), oversampling until the batch is full:
\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]Every surviving prompt has mixed outcomes — nonzero gradient guaranteed. As the model improves, the surviving distribution automatically tracks the model’s learning frontier.
Result. AIME 42 → 50 (+8). Largest single delta in the entire DAPO ablation chain. Despite needing extra rollouts to fill each batch, convergence is also faster (paper Fig 6).
Does the result match the motivation? Cleanly yes. The hypothesis is “uniform-outcome prompts are dead weight”; removing them produces the biggest single-component lift, and the speedup despite extra rollouts is exactly what the signal-to-cost story predicts.
2. Clip-Higher (Asymmetric Clipping)
2. Clip-Higher(非对称裁剪)
Motivation. Naive GRPO suffers entropy collapse — the policy concentrates and stops exploring. The mechanism is asymmetry hidden in symmetric clipping. Under \([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\), a low-probability token at \(\pi_\text{old}=0.01\) can rise to at most \(0.012\) before clipping; a high-probability token at \(0.9\) is allowed to reach \(1.08\). The same \(\varepsilon\) multiplicatively favors already-likely tokens, so “exploration” tokens are structurally hard to amplify.
Where does the asymmetry come from? PPO clips the importance ratio \(r = \pi_\theta / \pi_\text{old}\), not the probability — so the bound is multiplicative. The absolute room to grow per step is \(\varepsilon \cdot \pi_\text{old}\), proportional to where the token already is.
This ceiling is a hard wall, not a soft penalty — enforced by the gradient becoming exactly zero. When \(r > 1+\varepsilon\) on a token with \(\hat{A}>0\), PPO’s \(\min(r\hat{A},\,\text{clip}(r)\hat{A})\) picks the clipped term \((1+\varepsilon)\hat{A}\), which is constant in \(\theta\). The token’s contribution to the loss is frozen; its gradient is 0; nothing pushes its probability further this step. (The \(\min\) design only clips when clipping tightens the objective; in opposite-sign cases like \(\hat{A}<0\) with \(r>1+\varepsilon\), the unclipped term wins and gradient still flows — PPO won’t shield you from pulling down a bad action you over-shot on.) So “the clip bites” specifically means: positive-advantage tokens with \(r\) pegged at \(1+\varepsilon\), contributing zero gradient. Three consequences fall out:
- Absolute budget scales with \(\pi_\text{old}\). A 0.01 token can gain at most 0.002 mass per step; a 0.5 token can gain 0.1 — a 50× gap from the same \(\varepsilon\).
- The upper clip only restricts low-probability tokens — high-probability tokens are unaffected. Two separate ceilings act on \(\pi_\text{new}\) at every step: the PPO clip \(\pi_\text{new} \le (1+\varepsilon)\,\pi_\text{old}\) and the natural probability cap \(\pi_\text{new} \le 1\). Whichever is tighter is the binding one; the other is inert. When \(\pi_\text{old} > 1/(1+\varepsilon) \approx 0.83\), the clip ceiling \((1+\varepsilon)\pi_\text{old}\) exceeds 1 — unreachable as a probability — so the clip simply never activates, and \(\le 1\) is what’s binding. By contrast, for \(\pi_\text{old} = 0.01\) the clip ceiling is 0.012, which is the binding constraint. So the upper clip taxes only the low-/mid-probability population — exactly the exploration tokens — while already-dominant tokens grow freely.
-
Compounding makes the gap exponential. When the clip is the binding constraint at every step (saturated clipping), the probability evolves geometrically:
\[\pi_n = \pi_0 \, (1+\varepsilon)^n\]so the time to climb from \(\pi_0\) to any target \(\pi^*\) is logarithmic:
\[n^* \;=\; \frac{\log(\pi^*/\pi_0)}{\log(1+\varepsilon)}.\]Plugging in \(\pi_0 = 0.01\), \(\pi^* = 0.1\): 12.6 steps at \(\varepsilon = 0.20\), 9.3 at \(\varepsilon_\text{high} = 0.28\). Two things sharpen this:
- The race is doubly asymmetric. A token at 0.01 must climb 10× to even “start mattering” at 0.1, and another 10× to dominate. A token already at 0.5 only needs to double once to take over. Same per-step rate, vastly different distances. And by consequence (1), the low-probability token’s per-step absolute gain is also 50× smaller — it’s slower in the geometric metric and in the literal one.
- The “small” \(\varepsilon\) change compounds into a large probability gap. After \(n\) saturated steps, \(\varepsilon_\text{high} = 0.28\) leaves the token at \((1.28/1.20)^n\) times where \(\varepsilon = 0.20\) would put it: 1.7× after 20 steps, 3.7× after 50, 14× after 100. A modest-looking rate change becomes a multiplicative probability difference once the geometric process runs.
The clip won’t saturate every step in practice — but it saturates precisely when the advantage is large, which is exactly when an exploration token would benefit most from amplification. So this geometric bound describes the worst-case ascent rate, which is the operative one for tokens trying to escape obscurity. Each step the bound binds, the rich-get-richer dynamic locks in a little more probability mass on already-dominant tokens, and entropy drops accordingly.
The lower bound stays symmetric for a reason: loosening it would let the policy abandon previously-good tokens too aggressively, which is a different failure mode (premature mode collapse). The fix is asymmetric because the problem is asymmetric — it’s the upper clip’s multiplicative ceiling on small probabilities that needs lifting.
Method. Decouple the bounds:
\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]Same lower bound (still bound how fast probabilities can collapse), looser upper bound (give exploration tokens room to grow).
Result. AIME 36 → 38 (+2). Paper Fig 2b: the entropy curve, which was sliding downward, flattens and rises. Fig 3a: tokens that hit the upper clip really do sit at \(\pi_\theta < 0.2\) — the clip is biting the population it was designed for.
Does the result match the motivation? Strongly for the mechanism, weakly for the score. The entropy curve and clip distribution directly support the “low-probability tokens were being suppressed” story. But +2 on AIME is small — entropy collapse is a contributing rather than dominant cause of low scores.
3. Token-Level Loss Normalization
3. Token 级损失归一化
Motivation. Standard GRPO averages token losses within each sample, then averages across samples — every response gets equal weight regardless of length. Long, low-quality outputs (gibberish, runaway repetition) end up under-penalized per token, while short clean responses are over-weighted per token. The paper observes this manifesting as an “unhealthy increase in entropy and response length.”
Method. Average directly over all tokens in the batch:
\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]Each token contributes equally — long sequences now exert proportional influence on the loss.
Result. AIME 41 → 42 (+1) — the smallest single contribution in the chain. But Fig 4: entropy and response-length curves stop drifting upward and stabilize.
Does the result match the motivation? Mixed. The qualitative claim (entropy/length stability) is supported by the curves; the quantitative claim (better AIME) barely is. This change is best read as a stability and collapse-prevention measure, not a score lever.
4. Overlong Reward Shaping
4. 超长奖励塑形
Motivation. When a sample hits the max generation length, it gets truncated and labeled wrong — even if the reasoning was on track and just ran out of tokens. The paper calls this reward noise: the negative reward isn’t telling the model “your reasoning is bad,” it’s telling it “you were too long.” Training on the wrong signal destabilizes the policy.
Method. Two stages, applied successively:
- Overlong Filtering (first fix): mask the loss on truncated samples — they contribute no gradient at all.
- Soft Overlong Punishment (refinement): a graduated length penalty. With \(L_\text{max} = 16384\) and \(L_\text{cache} = 4096\):
- \(\vert y \vert \leq L_\text{max} - L_\text{cache}\): no length penalty
- \(L_\text{max} - L_\text{cache} < \vert y \vert \leq L_\text{max}\): penalty drops linearly from \(0\) to \(-1\)
- \(\vert y \vert > L_\text{max}\): penalty \(= -1\)
Generation cap is \(L_\text{max} + L_\text{cache} = 20480\).
Result. Filtering alone is the biggest jump in the chain: 30 → 36 (+6). Adding soft shaping on top of clip-higher: 38 → 41 (+3). Fig 5: AIME accuracy and entropy curves both stabilize once truncation noise is removed.
Does the result match the motivation? Yes, more directly than any other DAPO component. The motivation is “truncation injects a wrong-direction signal that destabilizes training”; killing that signal both lifts the score by a large margin (combined +9 across the two stages — the bulk of DAPO’s improvement together with dynamic sampling) and visibly stabilizes the curves.
Ablation Results (AIME 2024)
消融实验结果(AIME 2024)
| Configuration | Score | Δ |
|---|---|---|
| Naive GRPO | 30 | — |
| + Overlong Filtering | 36 | +6 |
| + Clip-Higher | 38 | +2 |
| + Soft Overlong Punishment | 41 | +3 |
| + Token-Level Loss | 42 | +1 |
| + Dynamic Sampling (full DAPO) | 50 | +8 |
| DeepSeek-R1-Zero (2x training steps) | 47 | — |
Two components dominate: Dynamic Sampling (+8) and Overlong Reward Shaping (+9 across its two stages), together accounting for 17 of the 20-point gain. Clip-Higher (+2) and Token-Level Loss (+1) are smaller-magnitude but address training stability — entropy and length collapse modes that would otherwise re-emerge. The headline “DAPO matches DeepSeek-R1-Zero in half the steps” comes mostly from cleaning up two sources of corrupted gradient: truncation-induced wrong rewards and uniform-outcome dead prompts.
GSPO: Group Sequence Policy Optimization
GSPO:组序列策略优化
GSPO (Alibaba/Qwen, 2025) takes a different angle: rather than fixing GRPO’s sampling strategy, it fixes GRPO’s optimization target. The core claim is that GRPO’s per-token importance ratios are a theoretically unsound application of importance sampling, introducing noise that accumulates with response length.
The Problem with Token-Level Ratios
Token 级比率的问题
The policy gradient theorem operates at the sequence level — the quantity we want to optimize is:
\[\nabla_\theta \, \mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}} \!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]The correct importance sampling ratio for a full sequence factorizes as a product of per-token ratios:
\[\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)} = \prod_{l=1}^{\vert o_i \vert} \rho_i^l\]GRPO, however, clips each \(\rho_i^l\) independently to \([1-\varepsilon,\, 1+\varepsilon]\). This does not bound the sequence-level ratio — even after clipping, the effective product can range over:
\[\prod_{l=1}^{\vert o_i \vert} \text{clip}(\rho_i^l) \;\in\; \left[(1-\varepsilon)^{\vert o_i \vert},\; (1+\varepsilon)^{\vert o_i \vert}\right]\]For a typical response of \(\vert o_i \vert = 1024\) tokens and \(\varepsilon = 0.2\), the upper bound is \(1.2^{1024} \approx 10^{81}\). The “trust region” that token-level clipping defines is essentially unbounded at the sequence level, and the resulting noise in the gradient estimate grows with sequence length.
Why is the sequence-level ratio the theoretically correct one? The core argument starts from the goal of IS correction. We want to convert the expectation from the old policy \(\pi_{\theta_{\text{old}}}\) to the current policy \(\pi_\theta\):
\[\mathbb{E}_{\pi_\theta}\!\left[R(o)\right] = \mathbb{E}_{\pi_{\theta_{\text{old}}}}\!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]The expectation here is over complete sequences \(o\), and the IS ratio \(\pi_\theta(o \vert q) / \pi_{\theta_{\text{old}}}(o \vert q)\) is a sequence-level quantity. Decomposing it into \(\prod_l \rho_l\) merely exploits the autoregressive factorization — it does not mean each factor \(\rho_l\) is an independent IS correction. When GRPO clips each \(\rho_l\) independently, it constrains the individual pieces of a factorized quantity, but the product of these constraints is not equivalent to constraining the original sequence-level ratio. In other words, token-level clipping constrains artificially separated fragments, not the whole object that IS correction actually needs to bound.
More concretely, GRPO’s advantage \(\hat{A}_i\) is a trajectory-level constant — it depends only on \(r(\tau^i)\) and is the same for every token within a sequence. The gradient contribution at position \(l\) takes the form \(\hat{A}_i \cdot \rho_l \cdot \nabla_\theta \ln \pi_\theta(a_l)\), where \(\hat{A}_i\) does not depend on any single token’s distribution. Since the advantage itself is sequence-level, the natural granularity of IS correction should also be sequence-level: what we need to correct is “how much more likely is this entire trajectory under \(\pi_\theta\) relative to \(\pi_{\theta_{\text{old}}}\),” not “how much did each token shift individually.” GSPO’s geometric mean \(s_i = (\prod_l \rho_l)^{1/\vert o_i \vert}\) is a monotonic transformation of the sequence-level log-ratio \(\frac{1}{\vert o_i \vert}\ln \frac{\pi_\theta(o_i)}{\pi_{\theta_{\text{old}}}(o_i)}\), so clipping it directly constrains the sequence-level distributional shift.
This argument also aligns with the trust-region intuition: PPO’s trust region is fundamentally a constraint on the KL divergence between policies — a sequence-level quantity. Token-level clipping is merely an approximation, one that can become arbitrarily loose on long sequences; sequence-level clipping imposes the constraint at the correct level of abstraction.
Sequence-Level Importance Ratio
序列级重要性比率
GSPO replaces the per-token ratio with a length-normalized sequence-level ratio:
This is the geometric mean of per-token ratios across the full response — a single scalar per completion, not per token. The clipped objective becomes:
Why This Matters
为什么这很重要
Moving to sequence-level ratios has several practical consequences:
-
Higher effective clipping rate, better efficiency. GSPO’s clipping rate is two orders of magnitude higher than GRPO’s, yet it achieves better training efficiency. This reveals that most of GRPO’s token-level gradient signal is noise, not signal.
-
MoE stability. GRPO requires “Routing Replay” — caching and replaying expert routing patterns — to stabilize Mixture-of-Experts training. GSPO eliminates this need because it depends only on sequence-level likelihood, not individual token routing decisions.
-
Infrastructure simplification. No need for token-level synchronization between inference and training engines; sequence-level log-probabilities suffice.
GSPO was used to train the Qwen3 model family.