Policy Optimization without a Critic: The GRPO Family

Motivation: Why Not Learn a Critic?

The standard recipe for RLHF — PPO with a learned value function — requires training and maintaining a critic alongside the policy. In the LM setting, this means a second copy of the base model with a scalar value head, doubling memory and introducing its own training challenges (representation drift, readout sensitivity, etc.). For single-turn tasks like math and code generation, a natural question arises: can we skip the critic entirely?

The answer is yes — if we are willing to trade a learned value estimate for a statistical one. Instead of training \(V_\phi(s)\) to predict expected return, we can sample multiple completions from the same prompt and use their empirical reward statistics as a baseline. This is the core idea behind GRPO, and the starting point for DAPO and GSPO.

GRPO: Group Relative Policy Optimization

GRPO (DeepSeek, 2024) replaces the learned critic with group-level sampling statistics. Given a prompt \(q\), sample a group of \(G\) completions \(\{o_1, \ldots, o_G\}\) from the current policy \(\pi_\theta\), score each with a reward function \(r(q, o_i)\), and compute normalized advantages:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}\]

The “value function” is the group mean — a Monte Carlo estimate that converges to \(\mathbb{E}_\pi[r \vert q]\) as \(G \to \infty\). No learned parameters, no value head, no critic training loop.

The GRPO loss applies PPO-style clipping with per-token importance ratios:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1\!-\!\varepsilon, 1\!+\!\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\right]\]

where:

  • \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio — the probability of the \(l\)-th token of the \(i\)-th completion under the current policy \(\pi_\theta\) divided by its probability under the rollout policy \(\pi_{\theta_{\text{old}}}\). A ratio \(\rho > 1\) means the current policy assigns higher probability to this token than the rollout policy did; \(\rho < 1\) means lower.
  • \(\hat{A}_i\) is the group-normalized advantage for completion \(o_i\), shared across all tokens \(l\) in that completion. This is the key simplification: GRPO assigns a single scalar credit to the entire response, not per-token.
  • \(\text{clip}(\rho, 1-\varepsilon, 1+\varepsilon)\) clamps the ratio to \([1-\varepsilon, 1+\varepsilon]\) (typically \([0.8, 1.2]\)), preventing any single token’s probability from changing too much in one update — the same mechanism as PPO.
  • \(\beta \, D_{\text{KL}}(\pi_\theta \Vert \pi_{\text{ref}})\) is a KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model), preventing the policy from drifting too far from the pretrained distribution.
  • The outer average \(\frac{1}{G} \sum_i \frac{1}{\vert o_i \vert} \sum_l\) first averages over tokens within each completion, then averages over completions in the group.

The \(G \to \infty\) Limit: REINFORCE with a Baseline

What happens when the group size grows? By the law of large numbers, the group mean converges to the true expected reward under the current policy:

\[\frac{1}{G}\sum_{j=1}^{G} r(q, o_j) \xrightarrow{G \to \infty} \mathbb{E}_{o \sim \pi_\theta(\cdot \vert q)}[r(q, o)]\]

Note that this is not a state value function \(V(s)\) in the RL sense — there is no multi-step MDP here. GRPO operates at the task level: one prompt in, one complete response out, one reward. The expectation above is simply the average reward the current policy would get on this prompt if it generated infinitely many responses.

In this limit the GRPO advantage (ignoring the std normalization) becomes:

\[\hat{A}_i \;\to\; r(q, o_i) - \mathbb{E}_{o \sim \pi_\theta}[r(q, o)]\]

This is a constant baseline subtraction — the same idea as REINFORCE with a baseline, but the baseline is prompt-specific and estimated purely by sampling. A learned critic in standard RLHF serves the same role (predicting the expected reward for a given input), but it does so with a neural network trained across the entire dataset. GRPO replaces that network with a sample mean computed from the \(G\) completions at hand.

The difference is practical. A learned critic accumulates information across the entire training run: every prompt it has ever seen refines its predictions. GRPO’s group mean estimates the expected reward from scratch each time, using only the \(G\) completions sampled for that prompt in that batch. With finite \(G\) (DeepSeek uses \(G = 64\), DAPO uses \(G = 16\)), this estimate is noisy — and the noise is the price of not maintaining a critic. Whether the simplicity is worth the noise depends on the task: for math and code with reliable verifiers, the answer has been a clear yes.

The Fundamental Limitation: Task-Level Credit

The advantage \(\hat{A}_i\) is task-level: every token in completion \(o_i\) receives the same advantage. GRPO cannot distinguish which part of a response was responsible for success or failure — it only knows that this completion, as a whole, was better or worse than its peers. For single-turn tasks (solve a math problem, write a function) this is acceptable. For multi-step agent tasks, it is not — see the ArCHer post for a detailed comparison.

When Does GRPO Learn Nothing?

GRPO’s learning signal depends entirely on outcome diversity within the group:

  • All completions correct (pass@G = G): all rewards identical → \(\hat{A}_i = 0\) for all \(i\) → zero gradient
  • All completions wrong (pass@G = 0): same situation → zero gradient
  • Mixed outcomes: some succeed, some fail → nonzero advantages → useful gradient

As training progresses and the model improves, more prompts become “too easy” (pass rate → 100%) and stop contributing signal. This is the central failure mode that DAPO addresses.

DAPO: Dynamic Sampling Policy Optimization

DAPO (ByteDance Seed, 2025) identifies entropy collapse as the central failure mode of naive GRPO and introduces four techniques to combat it. A fifth, less obvious change: DAPO completely removes the KL penalty (\(\beta = 0\)). The reasoning is that for long chain-of-thought models, the policy is expected to diverge significantly from the SFT initialization — the KL constraint actively fights against the distribution shift that training is trying to achieve. Instead of relying on KL to prevent collapse, DAPO uses the mechanisms below.

1. Dynamic Sampling

The single most impactful contribution (+8 points on AIME 2024). During rollout, each prompt gets \(G\) sampled completions. If the number of correct completions is 0 or \(G\), the prompt is discarded and a new prompt is sampled to replace it. The filtering criterion is:

\[0 < \vert\{o_i : \text{correct}(o_i)\}\vert < G\]

This ensures every prompt in the training batch has mixed outcomes — guaranteeing nonzero gradient signal. As the model improves and more prompts become trivial, dynamic sampling automatically shifts the training distribution toward the model’s learning frontier: problems that are neither too easy nor too hard.

2. Clip-Higher (Asymmetric Clipping)

Standard PPO clips the importance ratio symmetrically: \([1-\varepsilon, 1+\varepsilon] = [0.8, 1.2]\). DAPO decouples the bounds:

\[\varepsilon_{\text{low}} = 0.2, \quad \varepsilon_{\text{high}} = 0.28 \quad \Rightarrow \quad [0.8, 1.28]\]

The asymmetry lets low-probability tokens increase more freely, counteracting entropy collapse. The policy can explore more while maintaining a stable lower bound.

3. Token-Level Loss Normalization

Standard GRPO averages token losses within each sample, then averages across samples — giving short and long responses equal weight. DAPO averages directly across all tokens in the batch:

\[\mathcal{L} = \frac{1}{\sum_i \vert o_i \vert} \sum_{i=1}^{G} \sum_{l=1}^{\vert o_i \vert} \ell(o_i^l)\]

This gives longer sequences proportionally more influence, properly penalizing verbose low-quality outputs.

4. Overlong Reward Shaping

Truncated outputs (hitting max length) receive a soft graduated penalty rather than being discarded or harshly penalized. In the last \(L_{\text{cache}}\) tokens before the max length, the reward linearly decreases toward \(-1\). This discourages unnecessary verbosity without punishing responses that are on the right track but ran out of budget.

Ablation Results (AIME 2024)

Configuration Score
Naive GRPO 30
+ Overlong Filtering 36
+ Clip-Higher 38
+ Token-Level Loss 42
+ Dynamic Sampling (full DAPO) 50
DeepSeek-R1-Zero (2x training steps) 47

Dynamic sampling alone contributes +8 points — the largest single improvement — validating that the core bottleneck in GRPO training is wasted compute on prompts with uniform outcomes.

GRPO vs GSPO: token-level clipping allows the sequence-level ratio to drift unconstrained, while GSPO's geometric mean provides a direct bound. Click through the slides to see how noise accumulates with length and why GSPO clips 100x more often yet trains better.

GSPO: Group Sequence Policy Optimization

GSPO (Alibaba/Qwen, 2025) takes a different angle: rather than fixing GRPO’s sampling strategy, it fixes GRPO’s optimization target. The core claim is that GRPO’s per-token importance ratios are a theoretically unsound application of importance sampling, introducing noise that accumulates with response length.

The Problem with Token-Level Ratios

In GRPO, each token has its own importance ratio \(\rho_i^l\) and is clipped independently. But the policy gradient theorem operates at the sequence level — the quantity we want to optimize is the expected reward over full completions, not individual tokens. Token-level clipping does not correspond to any valid importance sampling estimator, and the resulting noise grows with sequence length.

Sequence-Level Importance Ratio

GSPO replaces the per-token ratio with a length-normalized sequence-level ratio:

\[s_i(\theta) = \left(\frac{\pi_\theta(o_i \vert q)}{\pi_{\theta_{\text{old}}}(o_i \vert q)}\right)^{1/\vert o_i \vert}\]

This is the geometric mean of per-token ratios across the full response — a single scalar per completion, not per token. The clipped objective becomes:

\[\mathcal{J}_{\text{GSPO}}(\theta) = \frac{1}{G}\sum_{i=1}^{G} \min\!\left(s_i(\theta)\,\hat{A}_i,\; \text{clip}(s_i(\theta), 1\!-\!\varepsilon, 1\!+\!\varepsilon)\,\hat{A}_i\right)\]

Why This Matters

Moving to sequence-level ratios has several practical consequences:

  1. Higher effective clipping rate, better efficiency. GSPO’s clipping rate is two orders of magnitude higher than GRPO’s, yet it achieves better training efficiency. This reveals that most of GRPO’s token-level gradient signal is noise, not signal.

  2. MoE stability. GRPO requires “Routing Replay” — caching and replaying expert routing patterns — to stabilize Mixture-of-Experts training. GSPO eliminates this need because it depends only on sequence-level likelihood, not individual token routing decisions.

  3. Infrastructure simplification. No need for token-level synchronization between inference and training engines; sequence-level log-probabilities suffice.

GSPO was used to train the Qwen3 model family.

Summary: Three Approaches, One Theme

All three methods share a common premise: for tasks with a reliable reward signal (math, code), a learned critic is unnecessary overhead. The group sampling baseline — \(\hat{V}(q) \approx \text{mean}(\{r(q, o_i)\})\) — is simple, unbiased, and requires no additional parameters.

Where they differ is in what they identify as the bottleneck:

Method Core insight Key fix
GRPO Sample mean replaces learned \(V_\phi\) Group normalization of advantages
DAPO Training signal degrades as model improves (easy prompts → zero gradient) Dynamic sampling: filter prompts, maintain learning frontier
GSPO Token-level importance ratios introduce length-dependent noise Sequence-level geometric mean ratio

The limitations are also shared: all three provide task-level credit only. Every token in a completion receives the same advantage. For multi-step agent tasks where individual steps matter, a turn-level critic remains necessary.