Policy Optimization without a Critic: The GRPO Family
Motivation: Why Not Learn a Critic?
动机:为什么不学一个 Critic?
The standard recipe for RLHF — PPO with a learned value function — requires training and maintaining a critic alongside the policy. In the LM setting, this means a second copy of the base model with a scalar value head, doubling memory and introducing its own training challenges (representation drift, readout sensitivity, etc.). For single-turn tasks like math and code generation, a natural question arises: can we skip the critic entirely?
The answer is yes — if we are willing to trade a learned value estimate for a statistical one. Instead of training \(V_\phi(s)\) to predict expected return, we can sample multiple completions from the same prompt and use their empirical reward statistics as a baseline. This is the core idea behind GRPO, and the starting point for DAPO and GSPO.
GRPO: Group Relative Policy Optimization
GRPO:组相对策略优化
GRPO (DeepSeek, 2024) replaces the learned critic with group-level sampling statistics. Given a prompt \(q\), sample a group of \(G\) completions \(\{o_1, \ldots, o_G\}\) from the current policy \(\pi_\theta\), score each with a reward function \(r(q, o_i)\), and compute normalized advantages:
\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]The “value function” is the group mean — a Monte Carlo estimate that converges to \(\mathbb{E}_\pi[r \vert q]\) as \(G \to \infty\). No learned parameters, no value head, no critic training loop.
The GRPO loss applies PPO-style clipping with per-token importance ratios:
\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]where:
- \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio — the probability of the \(l\)-th token of the \(i\)-th completion under the current policy \(\pi_\theta\) divided by its probability under the rollout policy \(\pi_{\theta_{\text{old}}}\). A ratio \(\rho > 1\) means the current policy assigns higher probability to this token than the rollout policy did; \(\rho < 1\) means lower.
- \(\hat{A}_i\) is the group-normalized advantage for completion \(o_i\), shared across all tokens \(l\) in that completion. This is the key simplification: GRPO assigns a single scalar credit to the entire response, not per-token.
- \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) clamps the ratio to \([1-\varepsilon, 1+\varepsilon]\) (typically \([0.8, 1.2]\)), preventing any single token’s probability from changing too much in one update — the same mechanism as PPO.
- \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) is a KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model), preventing the policy from drifting too far from the pretrained distribution.
- The outer average \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) first averages over tokens within each completion, then averages over completions in the group.
The \(G \to \infty\) Limit: REINFORCE with a Baseline
\(G \to \infty\) 极限:带基线的 REINFORCE
What happens when the group size grows? By the law of large numbers, the group mean converges to the true expected reward under the current policy:
\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]Note that this is not a state value function \(V(s)\) in the RL sense — there is no multi-step MDP here. GRPO operates at the task level: one prompt in, one complete response out, one reward. The expectation above is simply the average reward the current policy would get on this prompt if it generated infinitely many responses.
In this limit the GRPO advantage (ignoring the std normalization) becomes:
\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]This is a constant baseline subtraction — the same idea as REINFORCE with a baseline, but the baseline is prompt-specific and estimated purely by sampling. A learned critic in standard RLHF serves the same role (predicting the expected reward for a given input), but it does so with a neural network trained across the entire dataset. GRPO replaces that network with a sample mean computed from the \(G\) completions at hand.
The difference is practical. A learned critic accumulates information across the entire training run: every prompt it has ever seen refines its predictions. GRPO’s group mean estimates the expected reward from scratch each time, using only the \(G\) completions sampled for that prompt in that batch. With finite \(G\) (DeepSeek uses \(G = 64\), DAPO uses \(G = 16\)), this estimate is noisy — and the noise is the price of not maintaining a critic. Whether the simplicity is worth the noise depends on the task: for math and code with reliable verifiers, the answer has been a clear yes.
The Fundamental Limitation: Task-Level Credit
根本局限:任务级信用分配
The advantage \(\hat{A}_i\) is task-level: every token in completion \(o_i\) receives the same advantage. GRPO cannot distinguish which part of a response was responsible for success or failure — it only knows that this completion, as a whole, was better or worse than its peers. For single-turn tasks (solve a math problem, write a function) this is acceptable. For multi-step agent tasks, it is not — see the ArCHer post for a detailed comparison.
When Does GRPO Learn Nothing?
GRPO 何时学不到任何东西?
GRPO’s learning signal depends entirely on outcome diversity within the group:
- All completions correct (pass@G = G): all rewards identical → \(\hat{A}_i = 0\) for all \(i\) → zero gradient
- All completions wrong (pass@G = 0): same situation → zero gradient
- Mixed outcomes: some succeed, some fail → nonzero advantages → useful gradient
As training progresses and the model improves, more prompts become “too easy” (pass rate → 100%) and stop contributing signal. This is the central failure mode that DAPO addresses.
DAPO: Dynamic Sampling Policy Optimization
DAPO:动态采样策略优化
DAPO (ByteDance Seed, 2025) identifies entropy collapse as the central failure mode of naive GRPO and introduces four techniques to combat it. A fifth, less obvious change: DAPO completely removes the KL penalty (\(\beta = 0\)). The reasoning is that for long chain-of-thought models, the policy is expected to diverge significantly from the SFT initialization — the KL constraint actively fights against the distribution shift that training is trying to achieve. Instead of relying on KL to prevent collapse, DAPO uses the mechanisms below.
1. Dynamic Sampling
1. 动态采样
The single most impactful contribution (+8 points on AIME 2024). During rollout, each prompt gets \(G\) sampled completions. If the number of correct completions is 0 or \(G\), the prompt is discarded and a new prompt is sampled to replace it. The filtering criterion is:
\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]This ensures every prompt in the training batch has mixed outcomes — guaranteeing nonzero gradient signal. As the model improves and more prompts become trivial, dynamic sampling automatically shifts the training distribution toward the model’s learning frontier: problems that are neither too easy nor too hard.
2. Clip-Higher (Asymmetric Clipping)
2. Clip-Higher(非对称裁剪)
Standard PPO clips the importance ratio symmetrically: \([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\). DAPO decouples the bounds:
\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]The asymmetry lets low-probability tokens increase more freely, counteracting entropy collapse. The policy can explore more while maintaining a stable lower bound.
3. Token-Level Loss Normalization
3. Token 级损失归一化
Standard GRPO averages token losses within each sample, then averages across samples — giving short and long responses equal weight. DAPO averages directly across all tokens in the batch:
\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]This gives longer sequences proportionally more influence, properly penalizing verbose low-quality outputs.
4. Overlong Reward Shaping
4. 超长奖励塑形
Truncated outputs (hitting max length) receive a soft graduated penalty rather than being discarded or harshly penalized. In the last \(L_{\text{cache}}\) tokens before the max length, the reward linearly decreases toward \(-1\). This discourages unnecessary verbosity without punishing responses that are on the right track but ran out of budget.
Ablation Results (AIME 2024)
消融实验结果(AIME 2024)
| Configuration | Score |
|---|---|
| Naive GRPO | 30 |
| + Overlong Filtering | 36 |
| + Clip-Higher | 38 |
| + Token-Level Loss | 42 |
| + Dynamic Sampling (full DAPO) | 50 |
| DeepSeek-R1-Zero (2x training steps) | 47 |
Dynamic sampling alone contributes +8 points — the largest single improvement — validating that the core bottleneck in GRPO training is wasted compute on prompts with uniform outcomes.
GSPO: Group Sequence Policy Optimization
GSPO:组序列策略优化
GSPO (Alibaba/Qwen, 2025) takes a different angle: rather than fixing GRPO’s sampling strategy, it fixes GRPO’s optimization target. The core claim is that GRPO’s per-token importance ratios are a theoretically unsound application of importance sampling, introducing noise that accumulates with response length.
The Problem with Token-Level Ratios
Token 级比率的问题
The policy gradient theorem operates at the sequence level — the quantity we want to optimize is:
\[\nabla_\theta \, \mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}} \!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]The correct importance sampling ratio for a full sequence factorizes as a product of per-token ratios:
\[\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)} = \prod_{l=1}^{\vert o_i \vert} \rho_i^l\]GRPO, however, clips each \(\rho_i^l\) independently to \([1-\varepsilon,\, 1+\varepsilon]\). This does not bound the sequence-level ratio — even after clipping, the effective product can range over:
\[\prod_{l=1}^{\vert o_i \vert} \text{clip}(\rho_i^l) \;\in\; \left[(1-\varepsilon)^{\vert o_i \vert},\; (1+\varepsilon)^{\vert o_i \vert}\right]\]For a typical response of \(\vert o_i \vert = 1024\) tokens and \(\varepsilon = 0.2\), the upper bound is \(1.2^{1024} \approx 10^{81}\). The “trust region” that token-level clipping defines is essentially unbounded at the sequence level, and the resulting noise in the gradient estimate grows with sequence length.
Why is the sequence-level ratio the theoretically correct one? The core argument starts from the goal of IS correction. We want to convert the expectation from the old policy \(\pi_{\theta_{\text{old}}}\) to the current policy \(\pi_\theta\):
\[\mathbb{E}_{\pi_\theta}\!\left[R(o)\right] = \mathbb{E}_{\pi_{\theta_{\text{old}}}}\!\left[\frac{\pi_\theta(o \vert q)}{\pi_{\theta_{\text{old}}}(o \vert q)} R(o)\right]\]The expectation here is over complete sequences \(o\), and the IS ratio \(\pi_\theta(o \vert q) / \pi_{\theta_{\text{old}}}(o \vert q)\) is a sequence-level quantity. Decomposing it into \(\prod_l \rho_l\) merely exploits the autoregressive factorization — it does not mean each factor \(\rho_l\) is an independent IS correction. When GRPO clips each \(\rho_l\) independently, it constrains the individual pieces of a factorized quantity, but the product of these constraints is not equivalent to constraining the original sequence-level ratio. In other words, token-level clipping constrains artificially separated fragments, not the whole object that IS correction actually needs to bound.
More concretely, GRPO’s advantage \(\hat{A}_i\) is a trajectory-level constant — it depends only on \(r(\tau^i)\) and is the same for every token within a sequence. The gradient contribution at position \(l\) takes the form \(\hat{A}_i \cdot \rho_l \cdot \nabla_\theta \ln \pi_\theta(a_l)\), where \(\hat{A}_i\) does not depend on any single token’s distribution. Since the advantage itself is sequence-level, the natural granularity of IS correction should also be sequence-level: what we need to correct is “how much more likely is this entire trajectory under \(\pi_\theta\) relative to \(\pi_{\theta_{\text{old}}}\),” not “how much did each token shift individually.” GSPO’s geometric mean \(s_i = (\prod_l \rho_l)^{1/\vert o_i \vert}\) is a monotonic transformation of the sequence-level log-ratio \(\frac{1}{\vert o_i \vert}\ln \frac{\pi_\theta(o_i)}{\pi_{\theta_{\text{old}}}(o_i)}\), so clipping it directly constrains the sequence-level distributional shift.
This argument also aligns with the trust-region intuition: PPO’s trust region is fundamentally a constraint on the KL divergence between policies — a sequence-level quantity. Token-level clipping is merely an approximation, one that can become arbitrarily loose on long sequences; sequence-level clipping imposes the constraint at the correct level of abstraction.
Sequence-Level Importance Ratio
序列级重要性比率
GSPO replaces the per-token ratio with a length-normalized sequence-level ratio:
This is the geometric mean of per-token ratios across the full response — a single scalar per completion, not per token. The clipped objective becomes:
Why This Matters
为什么这很重要
Moving to sequence-level ratios has several practical consequences:
-
Higher effective clipping rate, better efficiency. GSPO’s clipping rate is two orders of magnitude higher than GRPO’s, yet it achieves better training efficiency. This reveals that most of GRPO’s token-level gradient signal is noise, not signal.
-
MoE stability. GRPO requires “Routing Replay” — caching and replaying expert routing patterns — to stabilize Mixture-of-Experts training. GSPO eliminates this need because it depends only on sequence-level likelihood, not individual token routing decisions.
-
Infrastructure simplification. No need for token-level synchronization between inference and training engines; sequence-level log-probabilities suffice.
GSPO was used to train the Qwen3 model family.