Adaptive Sampling and Curriculum Methods

This post focuses on adaptive compute allocation in group-sampling-based RL for LLMs (GRPO, RAFT, etc.), with the core question: when different prompts have vastly different pass rates, how should we allocate sampling compute across them? Papers covered include Reinforce-Ada (Xiong et al., 2025) and Gradient Variance Minimization.
本文聚焦于基于组采样的 LLM RL(GRPO、RAFT 等)中的自适应计算分配,核心问题是:当不同 prompt 的通过率差异巨大时,应该如何在它们之间分配采样计算资源?涉及的论文包括 Reinforce-Ada(Xiong et al., 2025)和 Gradient Variance Minimization

The Problem: Group Sampling and Signal Recovery

Why group sampling loses signal

Notation. Throughout this post: \(x\) denotes a prompt (the input question), \(a\) denotes a response (a model-generated rollout), \(\pi_\theta(a \vert x)\) is the policy (LLM), and \(r(x, a) \in \{0, 1\}\) is a binary reward (correct or not). For each prompt, we sample a group of \(n\) responses and use their rewards to estimate a learning signal.

RAFT, RLOO, GRPO, and online DPO all rely on this group sampling to estimate learning signals — for each prompt, sample a group of \(n\) responses, then use the reward distribution within the group to construct a gradient. But they all share a blind spot: no learning signal on prompts with uniform rewards.

To see why, let \(A^+\) and \(A^-\) denote the correct and incorrect rollouts within a group. The RAFT gradient only trains on correct rollouts (rejection sampling):

\[g_\theta^{RAFT}(x) = \sum_{a_i \in A^+} \nabla_\theta \log \pi_\theta(a_i \vert x) - 0 \cdot \sum_{a_j \in A^-} \nabla_\theta \log \pi_\theta(a_j \vert x)\]

The GRPO gradient uses the group mean \(\mu_r\) and standard deviation \(\sigma_r\) to normalize rewards into per-response weights:

\[g_\theta^{GRPO}(x) = \sum_{a_i \in A} \frac{r(x, a_i) - \mu_r}{\sigma_r + \epsilon} \cdot \nabla_\theta \log \pi_\theta(a_i \vert x)\]

If all \(n\) responses in the group are correct (\(r = 1\) for all), then \(\mu_r = 1\) and \(\sigma_r = 0\) — every GRPO weight is \(0/0\), and the gradient vanishes. Similarly, if all are wrong, \(\mu_r = 0\) and \(\sigma_r = 0\). For RAFT, if no response is correct, \(A^+\) is empty and there is literally nothing to train on. For DPO, which needs a winner-loser pair: if all responses have the same reward, no pair can be formed.

During GRPO training, the ratio of “effective prompts” (those with mixed rewards in their group) fluctuates wildly and can be as low as 30% — the model masters easy prompts quickly (all-correct groups), while still failing on hard ones (all-wrong groups). In both cases, compute is wasted with zero learning signal.

符号约定。 本文中:\(x\) 表示 prompt(输入问题),\(a\) 表示 response(模型生成的 rollout),\(\pi_\theta(a \vert x)\) 是策略(LLM),\(r(x, a) \in \{0, 1\}\) 是二值奖励(正确与否)。对每个 prompt,我们采样一 \(n\) 个 response,用它们的奖励来估计学习信号。

RAFT、RLOO、GRPO 和在线 DPO 都依赖这种组采样来估计学习信号——对每个 prompt,采样一组 \(n\) 个回复,然后用组内的奖励分布来构建梯度。但它们都有一个共同的盲区:在奖励均匀的 prompt 上没有学习信号

设 \(A^+\) 和 \(A^-\) 分别为组内正确和错误的 rollout。RAFT 的梯度仅在正确 rollout 上训练(拒绝采样):

\[g_\theta^{RAFT}(x) = \sum_{a_i \in A^+} \nabla_\theta \log \pi_\theta(a_i \vert x) - 0 \cdot \sum_{a_j \in A^-} \nabla_\theta \log \pi_\theta(a_j \vert x)\]

GRPO 的梯度使用组内均值 \(\mu_r\) 和标准差 \(\sigma_r\) 将奖励归一化为每个回复的权重:

\[g_\theta^{GRPO}(x) = \sum_{a_i \in A} \frac{r(x, a_i) - \mu_r}{\sigma_r + \epsilon} \cdot \nabla_\theta \log \pi_\theta(a_i \vert x)\]

如果组内所有 \(n\) 个回复都正确(\(r = 1\)),则 \(\mu_r = 1\),\(\sigma_r = 0\)——每个 GRPO 权重都是 \(0/0\),梯度消失。类似地,如果全部错误,\(\mu_r = 0\),\(\sigma_r = 0\)。对于 RAFT,如果没有正确回复,\(A^+\) 为空集,完全没有训练数据。对于 DPO,它需要胜者-败者配对:如果所有回复奖励相同,就无法构成配对。

在 GRPO 训练过程中,”有效 prompt”(组内奖励有差异的 prompt)的比例剧烈波动,可低至 30%——模型快速掌握简单 prompt(全对组),而在困难 prompt 上持续失败(全错组)。两种情况下,计算资源都在零学习信号的情况下被浪费。

Effective Prompt Ratio in GRPO Training 0 0.20 0.40 0.60 0.80 0.95 0 200 400 600 800 Training Steps Models struggle with hard prompts Models master easy prompts

Larger groups help, but uniform scaling is too expensive

A larger group size uncovers more non-trivial reward signals. Pass@K curves show that with \(K = 256\), models can solve 81.3% of problems, and increasing \(K\) from 4 to 32 adds 16.9% coverage, while going from 32 to 256 adds only 1.7%.

更大的组大小能揭示更多非平凡的奖励信号。Pass@K 曲线表明,当 \(K = 256\) 时模型能解决 81.3% 的问题,\(K\) 从 4 增加到 32 增加了 16.9% 的覆盖率,而从 32 到 256 仅增加 1.7%。

Pass@k Curves 1.0 0.8 0.6 0.4 0.2 0.0 1 2 4 8 16 32 64 128 256 k (Number of Attempts) Accuracy (Pass@k) Qwen2.5-Math-1.5B-base Qwen2.5-Math-1.5B-RL 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0 100 200 300 400 500 Training Steps Reward GRPO-n4 GRPO-n8 GRPO-n16

However, uniformly enlarging group size is slow: GRPO-n16 achieves the highest final reward in training steps but is the slowest in wall-clock time.

然而,均匀增大组大小很慢:GRPO-n16 在训练步数上达到最高最终奖励,但在实际时间上是最慢的

0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 Reward 0 10 20 30 40 50 Training Time (Hours) GRPO-n4 GRPO-n8 GRPO-n16

Different prompts need different budgets

The Pass@K data reveals huge variance across prompts. We can partition prompts into three tiers based on when they first yield a correct answer:

Tier Solved by Fraction of prompts Samples needed
Easy \(K = 4\) 62.7% 4
Medium \(K = 32\) (but not 4) 16.9% 32
Hard \(K = 256\) (but not 32) 1.7% 256

With an adaptive strategy — start with \(K = 4\), escalate to \(K = 32\) only for prompts that got no correct answer, then to \(K = 256\) for the remaining — the expected average cost per prompt is:

\[4 \times 0.627 + 32 \times 0.169 + 256 \times 0.017 \approx 2.5 + 5.4 + 4.4 = 12.3 < 13\]

Compare this to uniform \(K = 256\) for all prompts, which costs 256 per prompt — adaptive sampling achieves similar coverage at \(\sim 5\%\) of the cost. The principle: only increase group size when extra samples can reveal new learning signals.

Pass@K 数据揭示了 prompt 间的巨大差异。我们可以根据首次产生正确答案的时间将 prompt 分为三个层级:

层级 解决条件 prompt 占比 所需样本数
简单 \(K = 4\) 62.7% 4
中等 \(K = 32\)(但不是 4) 16.9% 32
困难 \(K = 256\)(但不是 32) 1.7% 256

采用自适应策略——先用 \(K = 4\),仅对没有正确答案的 prompt 升级到 \(K = 32\),再对仍未解决的 prompt 升级到 \(K = 256\)——每个 prompt 的期望平均成本为:

\[4 \times 0.627 + 32 \times 0.169 + 256 \times 0.017 \approx 2.5 + 5.4 + 4.4 = 12.3 < 13\]

相比之下,对所有 prompt 统一使用 \(K = 256\) 每个 prompt 需要 256 个样本——自适应采样以约 \(5\%\) 的成本实现了相似的覆盖率。原则:仅当额外样本能揭示新的学习信号时才增大组大小

Pass@k Curves 1.0 0.8 0.6 0.4 0.2 0.0 Accuracy (Pass@k) 1 2 4 8 16 32 64 128 256 k (Number of Attempts) Qwen2.5-Math-1.5B-base Qwen2.5-Math-1.5B-RL
Pass@K on Open-R1 data.

Reinforce-Ada: Adaptive Sequential Sampling

The algorithm: sample until useful signal appears

Since we don’t know in advance how many samples a prompt requires, the idea is to keep sampling until enough useful learning signal is recovered. Reinforce-Ada-Seq implements this as a sequential process: (1) initialize all prompts as active with a first batch of rollouts; (2) check whether each prompt already has enough signal; (3) keep sampling only the unresolved prompts, stopping at a maximum budget. Different stopping rules realize different implicit allocation schedules.

由于我们无法预先知道一个 prompt 需要多少样本,思路是持续采样直到恢复足够有用的学习信号Reinforce-Ada-Seq 以顺序过程实现:(1)将所有 prompt 初始化为活跃状态,进行第一批 rollout;(2)检查每个 prompt 是否已有足够信号;(3)仅对未解决的 prompt 继续采样,达到最大预算时停止。不同的停止规则实现不同的隐式分配策略。

1
Initialize all prompts as active
Every prompt gets a first batch of rollouts.
2
Check whether the prompt is already informative
Eliminate the prompts with enough signals.
3
Keep sampling only the unresolved prompts
Also stop when reaching a maximum budget.

Stopping rules and implicit allocation

Two stopping rules are proposed, each with a different philosophy:

Ada-Seq-Pos: stop after collecting \(k\) positive (correct) samples. Since each sample is correct with probability \(p\), the expected number of samples needed to see \(k\) successes follows a negative binomial distribution:

\[\mathbb{E}[n] = \frac{k}{p}\]

This means easy prompts (high \(p\)) stop quickly, while hard prompts (low \(p\)) get more samples — naturally focusing compute on difficult problems. For example, with \(k = 4\): a prompt with \(p = 0.8\) needs only \(\sim 5\) samples on average, while a prompt with \(p = 0.05\) needs \(\sim 80\).

What is the negative binomial distribution? (Click to expand)

The negative binomial distribution models the number of i.i.d. Bernoulli trials needed to observe exactly \(k\) successes, where each trial succeeds with probability \(p\). If \(n\) is the total number of trials, then:

$$\Pr(n = m) = \binom{m-1}{k-1} p^k (1-p)^{m-k}, \quad m = k, k+1, \ldots$$

The expected value is \(\mathbb{E}[n] = k/p\). Intuitively: each success takes on average \(1/p\) trials, and we need \(k\) of them. The variance is \(\mathrm{Var}(n) = k(1-p)/p^2\), so harder prompts (small \(p\)) have both higher expected cost and higher variance in cost.

什么是负二项分布?(点击展开)

负二项分布描述了为观察到恰好 \(k\) 次成功所需的独立同分布伯努利试验次数,每次试验的成功概率为 \(p\)。若 \(n\) 为总试验次数,则:

$$\Pr(n = m) = \binom{m-1}{k-1} p^k (1-p)^{m-k}, \quad m = k, k+1, \ldots$$

期望值为 \(\mathbb{E}[n] = k/p\)。直觉上:每次成功平均需要 \(1/p\) 次试验,我们需要 \(k\) 次成功。方差为 \(\mathrm{Var}(n) = k(1-p)/p^2\),因此更困难的 prompt(小 \(p\))不仅期望成本更高,成本的方差也更大。

Ada-Seq-Balance: stop after collecting \(k\) positive and \(k\) negative samples. Here the expected stopping time is:

\[\mathbb{E}[n] \propto \frac{1}{p(1-p)}\]

This allocates the most samples to prompts with extreme pass rates (\(p \approx 0\) or \(p \approx 1\)), where it is hardest to find both positive and negative examples. Prompts near \(p = 0.5\) stop fastest because both outcomes appear frequently. The function \(1/(p(1-p))\) is U-shaped — minimal at \(p = 0.5\) (value = 4) and diverging as \(p \to 0\) or \(p \to 1\).

提出了两种停止规则,各有不同的设计理念:

Ada-Seq-Pos:收集到 \(k\) 个正确样本后停止。由于每个样本以概率 \(p\) 正确,收集到 \(k\) 个成功样本所需的期望采样数服从负二项分布:

\[\mathbb{E}[n] = \frac{k}{p}\]

这意味着简单 prompt(高 \(p\))很快停止,而困难 prompt(低 \(p\))获得更多样本——自然地将计算资源集中在困难问题上。例如,当 \(k = 4\) 时:通过率 \(p = 0.8\) 的 prompt 平均只需 \(\sim 5\) 个样本,而 \(p = 0.05\) 的 prompt 则需要 \(\sim 80\) 个。

Ada-Seq-Balance:收集到 \(k\) 个正样本 \(k\) 个负样本后停止。此时期望停止时间为:

\[\mathbb{E}[n] \propto \frac{1}{p(1-p)}\]

这对通过率极端(\(p \approx 0\) 或 \(p \approx 1\))的 prompt 分配最多样本,因为在这些情况下最难同时找到正样本和负样本。\(p\) 接近 \(0.5\) 的 prompt 停止最快,因为两种结果都频繁出现。函数 \(1/(p(1-p))\) 是 U 形的——在 \(p = 0.5\) 时取最小值(为 4),当 \(p \to 0\) 或 \(p \to 1\) 时趋向无穷。

Infer more, train less

Training on all samples from adaptive sampling can be unstable: stochastic batch sizes cause OOM, too many negative samples can hurt or even collapse training, and training is more expensive than inference. The figure below compares two strategies on Qwen2.5-Math-1.5B: the blue curve uses coupled allocation (\(n_{\text{sample}} = 1/\hat{p},\; n_{\text{grad}} = 1/\sqrt{\hat{p}}\), where both sampling and training scale with difficulty), while the red curve decouples them (\(n_{\text{sample}} = 1/p,\; n_{\text{grad}} = 1\), i.e. adaptive sampling but fixed-size training). The decoupled approach trains significantly better.

对自适应采样的所有样本进行训练可能不稳定:随机的 batch 大小会导致 OOM,过多的负样本可能损害甚至崩溃训练,而且训练比推理更昂贵。下图在 Qwen2.5-Math-1.5B 上对比了两种策略:蓝色曲线使用耦合分配(\(n_{\text{sample}} = 1/\hat{p},\; n_{\text{grad}} = 1/\sqrt{\hat{p}}\),采样量和训练量都随难度变化),红色曲线将两者解耦(\(n_{\text{sample}} = 1/p,\; n_{\text{grad}} = 1\),即自适应采样但固定大小训练)。解耦方案的训练效果明显更好。

Qwen2.5-Math-1.5B: Training Step vs Reward (Batch) 0.40 0.35 0.30 0.25 0.20 0.15 0.10 Reward 0 100 200 300 400 500 Training Steps nsample = 1/√p̂, ngrad = 1/√p̂ nsample = 1/p, ngrad = 1

The fix: decouple the sampling budget from the training budget. Adaptive sampling may generate many rollouts for a single hard prompt (e.g., 200 samples to find 4 correct ones), but we only need a small balanced subset for training. Specifically, for each prompt, sub-sample a fixed-size training batch of \(n/2\) positive and \(n/2\) negative samples:

\[\text{Sampling budget (variable): } n_i^{\text{sample}} \propto \frac{1}{p_i(1-p_i)} \quad \longrightarrow \quad \text{Training batch (fixed): } n^{\text{train}} = n\]

This “infer more, train less” design has three benefits: (1) fixed batch size avoids OOM from stochastic sizes; (2) balanced positive/negative ratio stabilizes training; (3) inference is much cheaper than training (forward pass only, no backward pass or optimizer step), so extra sampling is cost-effective.

Why is decoupling mathematically valid? The concern is: if we sub-sample \(k\) positive and \(k\) negative from a larger pool, does the gradient estimate degrade? The answer is no — the quality of the gradient estimate depends on the training batch composition, not on how many samples were needed to find it.

Concretely, after Ada-Seq-Balance collects enough rollouts for prompt \(x_i\), we sub-sample \(k\) positive responses \(\{a_j^+\}\) and \(k\) negative responses \(\{a_j^-\}\). The per-prompt gradient estimate becomes:

\[\hat{h}_i = \frac{1}{2k}\left[\sum_{j=1}^{k}(1-p_i)\,\nabla_\theta \log \pi(a_j^+ \vert x_i) + \sum_{j=1}^{k}(-p_i)\,\nabla_\theta \log \pi(a_j^- \vert x_i)\right]\]

The variance of this estimate (for a fixed \(k\)) comes from two independent sources:

\[\mathrm{Var}(\hat{h}_i) = \frac{(1-p_i)^2}{k}\,\mathrm{Var}[\nabla \log \pi \vert r=1] + \frac{p_i^2}{k}\,\mathrm{Var}[\nabla \log \pi \vert r=0]\]

This variance depends only on \(k\) (the training batch size) and the prompt’s pass rate \(p_i\) — it does not depend on \(n_i^{\text{sample}}\) (how many total rollouts were needed to find those \(k\) positives and \(k\) negatives). The extra samples collected during the adaptive search were necessary to find the \(k\) positive examples, but they contribute nothing additional to the gradient once found.

The cost decomposition makes this concrete:

\[\text{Total cost per prompt} = \underbrace{n_i^{\text{sample}} \cdot c_{\text{infer}}}_{\text{adaptive search}} + \underbrace{2k \cdot c_{\text{train}}}_{\text{fixed training}}\]

Since \(c_{\text{train}} \gg c_{\text{infer}}\) (backward pass + optimizer vs. forward pass only), fixing the training batch at \(2k\) saves far more than the extra inference costs.

解决方案:将采样预算与训练预算解耦。自适应采样可能为一个困难 prompt 生成大量 rollout(例如采 200 个样本才找到 4 个正确的),但训练只需要一个小的平衡子集。具体来说,对每个 prompt,子采样一个固定大小的训练 batch,包含 \(n/2\) 个正样本和 \(n/2\) 个负样本:

\[\text{采样预算(可变):} n_i^{\text{sample}} \propto \frac{1}{p_i(1-p_i)} \quad \longrightarrow \quad \text{训练 batch(固定):} n^{\text{train}} = n\]

这种”多推理、少训练”设计有三个优势:(1)固定 batch 大小避免随机大小导致的 OOM;(2)平衡的正负样本比例稳定训练;(3)推理远比训练便宜(只需前向传播,无需反向传播或优化器更新),因此额外采样的成本效益很高。

为什么解耦在数学上是合理的? 疑虑是:如果我们从更大的样本池中子采样 \(k\) 个正样本和 \(k\) 个负样本,梯度估计会变差吗?答案是不会——梯度估计的质量取决于训练 batch 的组成,而非找到它们所需的采样数。

具体地,当 Ada-Seq-Balance 为 prompt \(x_i\) 收集到足够 rollout 后,子采样 \(k\) 个正确 response \(\{a_j^+\}\) 和 \(k\) 个错误 response \(\{a_j^-\}\)。per-prompt 梯度估计为:

\[\hat{h}_i = \frac{1}{2k}\left[\sum_{j=1}^{k}(1-p_i)\,\nabla_\theta \log \pi(a_j^+ \vert x_i) + \sum_{j=1}^{k}(-p_i)\,\nabla_\theta \log \pi(a_j^- \vert x_i)\right]\]

该估计的方差(对固定 \(k\))来自两个独立来源:

\[\mathrm{Var}(\hat{h}_i) = \frac{(1-p_i)^2}{k}\,\mathrm{Var}[\nabla \log \pi \vert r=1] + \frac{p_i^2}{k}\,\mathrm{Var}[\nabla \log \pi \vert r=0]\]

该方差仅取决于 \(k\)(训练 batch 大小)和 prompt 的通过率 \(p_i\)——不取决于 \(n_i^{\text{sample}}\)(找到那 \(k\) 个正样本和 \(k\) 个负样本总共需要多少次 rollout)。自适应搜索中收集的额外样本是找到 \(k\) 个正样本所必需的,但一旦找到,它们对梯度没有额外贡献。

成本分解使这一点具体化:

\[\text{每个 prompt 的总成本} = \underbrace{n_i^{\text{sample}} \cdot c_{\text{infer}}}_{\text{自适应搜索}} + \underbrace{2k \cdot c_{\text{train}}}_{\text{固定训练}}\]

由于 \(c_{\text{train}} \gg c_{\text{infer}}\)(反向传播 + 优化器 vs. 仅前向传播),将训练 batch 固定为 \(2k\) 节省的成本远超额外推理成本。


Theoretical Framework: Nonlinear Objectives and Variance Reduction

Nonlinear objectives induce prompt-dependent weights

Instead of optimizing the standard pass rate \(p_\theta(x)\) directly, consider a general nonlinear objective:

\[J_f(\theta) = \mathbb{E}_{x \sim d}\left[f\!\left(p_\theta(x)\right)\right]\]

where \(p_\theta(x) = \mathbb{E}_{a \sim \pi_\theta(\cdot \vert x)}[r(x, a)]\) is the pass rate of the current policy on prompt \(x\), and \(f: [0,1] \to \mathbb{R}\) is a differentiable transform. The key insight is that by the chain rule, the policy gradient of \(J_f\) introduces a prompt-dependent weight:

\[\nabla_\theta J_f(\theta) = \mathbb{E}_{x \sim d}\left[ f'(p_\theta(x)) \cdot \nabla_\theta p_\theta(x) \right]\]

Expanding \(\nabla_\theta p_\theta(x)\) via the standard policy gradient theorem (\(\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[r] = \mathbb{E}_{a \sim \pi_\theta}[r \cdot \nabla_\theta \log \pi_\theta]\)), we get the full per-sample form:

\[\nabla_\theta \mathbb{E}_{x \sim d} f(p_\theta(x)) = \mathbb{E}_{x \sim d}\left[ f'(p_\theta(x)) \cdot \mathbb{E}_{a \sim \pi_\theta}\left[ r(x, a) \cdot \nabla_\theta \log \pi_\theta(a \vert x) \right] \right]\]

The derivative \(f'(p)\) acts as a per-prompt importance weight — different choices of \(f\) prioritize different difficulty levels. For example:

  • Linear \(f(p) = p\): then \(f'(p) = 1\), all prompts are weighted equally (standard objective).
  • Log \(f(p) = \log p\): then \(f'(p) = 1/p\), hard prompts (low \(p\)) get upweighted. This corresponds to RAFT’s implicit objective.
  • GRPO’s implicit transform \(f(p) = 2\arcsin(\sqrt{p})\): applying the chain rule,
\[f'(p) = 2 \cdot \frac{1}{\sqrt{1-p}} \cdot \frac{1}{2\sqrt{p}} = \frac{1}{\sqrt{p(1-p)}}\]

This is the same U-shaped function as Ada-Seq-Balance’s allocation! Prompts with extreme pass rates (near 0 or 1) get upweighted, while \(p \approx 0.5\) prompts get the least weight. This provides a principled framework: different RL objectives implicitly prioritize prompts differently, and adaptive sampling can exploit this structure for variance reduction.

Why arcsin? This is not a design choice but a mathematical consequence of GRPO’s per-response normalization. The derivation proceeds in three steps.

Step 1: From GRPO weights to per-prompt gradient contribution. Recall the GRPO gradient weights each response by \((r_i - \mu_r)/\sigma_r\). Since \(\sigma_r\) is constant within a group, we can pull it out of the expectation. For a prompt with pass rate \(p\) (so \(\mu_r = p\)):

\[\mathbb{E}\!\left[\frac{r - \mu_r}{\sigma_r} \nabla_\theta \log \pi\right] = \frac{1}{\sigma_r}\,\mathbb{E}\!\left[(r - p)\,\nabla_\theta \log \pi\right]\]

Now we split the expectation:

\[\mathbb{E}[(r - p)\,\nabla_\theta \log \pi] = \underbrace{\mathbb{E}[r\,\nabla_\theta \log \pi]}_{\text{(A)}} - p\,\underbrace{\mathbb{E}[\nabla_\theta \log \pi]}_{\text{(B)}}\]

For (A), by the REINFORCE log-derivative trick: \(\mathbb{E}_{a \sim \pi}[r(x,a)\,\nabla_\theta \log \pi_\theta(a \vert x)] = \nabla_\theta \mathbb{E}_{a \sim \pi}[r(x,a)] = \nabla_\theta p_\theta(x)\).

For (B), the score function has zero mean: \(\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a \vert x)] = \nabla_\theta \sum_a \pi_\theta(a \vert x) = \nabla_\theta 1 = 0\). This is the classic result that a constant baseline does not change the expected gradient.

REINFORCE / log-derivative trick (Click to expand)

The identity \(\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \mathbb{E}_{a \sim \pi_\theta}[f(a)\,\nabla_\theta \log \pi_\theta(a)]\) follows from:

$$\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \nabla_\theta \sum_a \pi_\theta(a)\,f(a) = \sum_a f(a)\,\nabla_\theta \pi_\theta(a) = \sum_a f(a)\,\pi_\theta(a)\,\frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)} = \mathbb{E}_{a \sim \pi_\theta}\!\left[f(a)\,\nabla_\theta \log \pi_\theta(a)\right]$$

The key step is multiplying and dividing by \(\pi_\theta(a)\) to turn the sum into an expectation. This allows us to estimate policy gradients via sampling, without knowing the normalizing constant of \(\pi_\theta\). In RL, \(f(a) = r(x, a)\) gives the REINFORCE estimator (Williams, 1992).

REINFORCE / 对数导数技巧(点击展开)

恒等式 \(\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \mathbb{E}_{a \sim \pi_\theta}[f(a)\,\nabla_\theta \log \pi_\theta(a)]\) 的推导如下:

$$\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \nabla_\theta \sum_a \pi_\theta(a)\,f(a) = \sum_a f(a)\,\nabla_\theta \pi_\theta(a) = \sum_a f(a)\,\pi_\theta(a)\,\frac{\nabla_\theta \pi_\theta(a)}{\pi_\theta(a)} = \mathbb{E}_{a \sim \pi_\theta}\!\left[f(a)\,\nabla_\theta \log \pi_\theta(a)\right]$$

关键步骤是乘以并除以 \(\pi_\theta(a)\),将求和转化为期望。这使得我们可以通过采样估计策略梯度,而无需知道 \(\pi_\theta\) 的归一化常数。在 RL 中,令 \(f(a) = r(x, a)\) 即得 REINFORCE 估计器(Williams, 1992)。

Why does a constant baseline not change the gradient? (Click to expand)

For any constant \(b\) (not depending on \(a\)):

$$\mathbb{E}_{a \sim \pi_\theta}[b\,\nabla_\theta \log \pi_\theta(a)] = b\,\nabla_\theta \sum_a \pi_\theta(a) = b\,\nabla_\theta 1 = 0$$

So \(\mathbb{E}[(r - b)\nabla \log \pi] = \mathbb{E}[r\,\nabla \log \pi] - b \cdot 0 = \mathbb{E}[r\,\nabla \log \pi]\). The baseline \(b\) vanishes in expectation but reduces variance because it makes the per-sample weights \(r - b\) smaller in magnitude. The optimal baseline (minimizing variance) is \(b^* = \mathbb{E}[r \lVert\nabla \log \pi\rVert^2] / \mathbb{E}[\lVert\nabla \log \pi\rVert^2]\), but any constant works — here we use \(b = p\).

为什么常数基线不改变梯度?(点击展开)

对任意不依赖于 \(a\) 的常数 \(b\):

$$\mathbb{E}_{a \sim \pi_\theta}[b\,\nabla_\theta \log \pi_\theta(a)] = b\,\nabla_\theta \sum_a \pi_\theta(a) = b\,\nabla_\theta 1 = 0$$

因此 \(\mathbb{E}[(r - b)\nabla \log \pi] = \mathbb{E}[r\,\nabla \log \pi] - b \cdot 0 = \mathbb{E}[r\,\nabla \log \pi]\)。基线 \(b\) 在期望中消失,但因为它使逐样本权重 \(r - b\) 的绝对值更小,所以降低了方差。最优基线(最小化方差)为 \(b^* = \mathbb{E}[r \lVert\nabla \log \pi\rVert^2] / \mathbb{E}[\lVert\nabla \log \pi\rVert^2]\),但任何常数都有效——这里我们用 \(b = p\)。

So the baseline \(p\) vanishes, and with \(\sigma_r = \sqrt{p(1-p)}\) for binary rewards:

\[\frac{1}{\sigma_r}\,\mathbb{E}[(r - p)\,\nabla_\theta \log \pi] = \frac{1}{\sqrt{p(1-p)}} \cdot \nabla_\theta p_\theta(x)\]

Step 2: Read off \(f'(p)\) and integrate. Comparing with the nonlinear framework \(f'(p) \cdot \nabla_\theta p\), we identify \(f'(p) = 1/\sqrt{p(1-p)}\). To recover \(f\), we integrate using the substitution \(p = \sin^2\theta\), so \(dp = 2\sin\theta\cos\theta\,d\theta\) and \(\sqrt{p(1-p)} = \sin\theta\cos\theta\):

\[f(p) = \int \frac{dp}{\sqrt{p(1-p)}} = \int \frac{2\sin\theta\cos\theta\,d\theta}{\sin\theta\cos\theta} = 2\theta = 2\arcsin(\sqrt{p})\]

Step 3: Connection to classical statistics. The function \(2\arcsin(\sqrt{p})\) is exactly the classical arcsine (angular) transformation from statistics (Fisher, 1940s), originally designed for variance stabilization of binomial proportions. Its key property: if \(\hat{p}\) is a sample proportion from \(n\) Bernoulli trials, then \(2\arcsin(\sqrt{\hat{p}})\) has variance \(\approx 1/n\) regardless of the true \(p\). In other words, GRPO’s per-response normalization implicitly applies the unique transformation that equalizes gradient variance across all difficulty levels — which is precisely why uniform sampling is already optimal for GRPO.

不直接优化标准通过率 \(p_\theta(x)\),而是考虑一般的非线性目标:

\[J_f(\theta) = \mathbb{E}_{x \sim d}\left[f\!\left(p_\theta(x)\right)\right]\]

其中 \(p_\theta(x) = \mathbb{E}_{a \sim \pi_\theta(\cdot \vert x)}[r(x, a)]\) 是当前策略在 prompt \(x\) 上的通过率,\(f: [0,1] \to \mathbb{R}\) 是一个可微变换。关键洞察是,通过链式法则,\(J_f\) 的策略梯度引入了一个依赖于 prompt 的权重:

\[\nabla_\theta J_f(\theta) = \mathbb{E}_{x \sim d}\left[ f'(p_\theta(x)) \cdot \nabla_\theta p_\theta(x) \right]\]

通过标准策略梯度定理(\(\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[r] = \mathbb{E}_{a \sim \pi_\theta}[r \cdot \nabla_\theta \log \pi_\theta]\))展开 \(\nabla_\theta p_\theta(x)\),得到完整的逐样本形式:

\[\nabla_\theta \mathbb{E}_{x \sim d} f(p_\theta(x)) = \mathbb{E}_{x \sim d}\left[ f'(p_\theta(x)) \cdot \mathbb{E}_{a \sim \pi_\theta}\left[ r(x, a) \cdot \nabla_\theta \log \pi_\theta(a \vert x) \right] \right]\]

导数 \(f'(p)\) 充当每个 prompt 的重要性权重——不同的 \(f\) 选择优先处理不同难度级别的 prompt。例如:

  • 线性 \(f(p) = p\):则 \(f'(p) = 1\),所有 prompt 权重相等(标准目标)。
  • 对数 \(f(p) = \log p\):则 \(f'(p) = 1/p\),困难 prompt(低 \(p\))获得更高权重。这对应 RAFT 的隐式目标。
  • GRPO 的隐式变换 \(f(p) = 2\arcsin(\sqrt{p})\):应用链式法则,
\[f'(p) = 2 \cdot \frac{1}{\sqrt{1-p}} \cdot \frac{1}{2\sqrt{p}} = \frac{1}{\sqrt{p(1-p)}}\]

这与 Ada-Seq-Balance 的分配函数形状完全一致!通过率极端(接近 0 或 1)的 prompt 获得更高权重,而 \(p \approx 0.5\) 的 prompt 权重最低。这提供了一个原则性框架:不同的 RL 目标隐式地以不同方式优先处理 prompt,自适应采样可以利用这种结构进行方差缩减

为什么是 arcsin? 这不是一个设计选择,而是 GRPO 逐 response 归一化的数学推论。推导分三步。

第一步:从 GRPO 权重到 per-prompt 梯度贡献。 回顾 GRPO 梯度对每个 response 的权重为 \((r_i - \mu_r)/\sigma_r\)。由于 \(\sigma_r\) 在组内是常数,可以提到期望外面。对通过率为 \(p\) 的 prompt(\(\mu_r = p\)):

\[\mathbb{E}\!\left[\frac{r - \mu_r}{\sigma_r} \nabla_\theta \log \pi\right] = \frac{1}{\sigma_r}\,\mathbb{E}\!\left[(r - p)\,\nabla_\theta \log \pi\right]\]

将期望拆开:

\[\mathbb{E}[(r - p)\,\nabla_\theta \log \pi] = \underbrace{\mathbb{E}[r\,\nabla_\theta \log \pi]}_{\text{(A)}} - p\,\underbrace{\mathbb{E}[\nabla_\theta \log \pi]}_{\text{(B)}}\]

对于 (A),由 REINFORCE 对数导数技巧:\(\mathbb{E}_{a \sim \pi}[r(x,a)\,\nabla_\theta \log \pi_\theta(a \vert x)] = \nabla_\theta \mathbb{E}_{a \sim \pi}[r(x,a)] = \nabla_\theta p_\theta(x)\)。

对于 (B),score function 的期望恒为零:\(\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a \vert x)] = \nabla_\theta \sum_a \pi_\theta(a \vert x) = \nabla_\theta 1 = 0\)。这就是经典结论:常数基线不改变梯度的期望

因此基线 \(p\) 被消掉,代入二值奖励的 \(\sigma_r = \sqrt{p(1-p)}\):

\[\frac{1}{\sigma_r}\,\mathbb{E}[(r - p)\,\nabla_\theta \log \pi] = \frac{1}{\sqrt{p(1-p)}} \cdot \nabla_\theta p_\theta(x)\]

第二步:读出 \(f'(p)\) 并积分。 与非线性框架 \(f'(p) \cdot \nabla_\theta p\) 对比,读出 \(f'(p) = 1/\sqrt{p(1-p)}\)。对其积分求 \(f\):令 \(p = \sin^2\theta\),则 \(dp = 2\sin\theta\cos\theta\,d\theta\),\(\sqrt{p(1-p)} = \sin\theta\cos\theta\):

\[f(p) = \int \frac{dp}{\sqrt{p(1-p)}} = \int \frac{2\sin\theta\cos\theta\,d\theta}{\sin\theta\cos\theta} = 2\theta = 2\arcsin(\sqrt{p})\]

第三步:与经典统计学的联系。 函数 \(2\arcsin(\sqrt{p})\) 恰好就是统计学中经典的 arcsine(角度)变换(Fisher, 1940s),最初为二项比例数据的方差稳定化而设计。其核心性质是:若 \(\hat{p}\) 是 \(n\) 次伯努利试验的样本比例,则 \(2\arcsin(\sqrt{\hat{p}})\) 的方差约为 \(1/n\),与真实 \(p\) 无关。换言之,GRPO 的逐 response 归一化隐式地施加了那个能将所有难度级别的梯度方差拉平的唯一变换——这正是为什么均匀采样对 GRPO 已经是最优的。

Optimal allocation via variance reduction

In practice, we need to estimate the gradient from finite samples. The batch gradient estimator is derived in three steps from the population gradient \(\nabla_\theta J_f = \mathbb{E}_{x}[f'(p) \cdot \nabla_\theta p]\):

Step 1. Replace the expectation over prompts \(\mathbb{E}_{x \sim d}\) with a sample average over \(B\) prompts \(\{x_1, \ldots, x_B\}\):

\[\nabla_\theta J_f \approx \frac{1}{B} \sum_{i=1}^{B} f'(p_i) \cdot \nabla_\theta p_\theta(x_i)\]

Step 2. For each prompt \(x_i\), estimate \(\nabla_\theta p_\theta(x_i)\) using the REINFORCE trick with \(n_i\) sampled rollouts \(\{a_{i1}, \ldots, a_{in_i}\}\):

\[\nabla_\theta p_\theta(x_i) = \mathbb{E}_{a \sim \pi_\theta}\!\left[r(x_i, a)\,\nabla_\theta \log \pi_\theta(a \vert x_i)\right] \approx \frac{1}{n_i} \sum_{j=1}^{n_i} r_{ij}\,\nabla_\theta \log \pi_\theta(a_{ij} \vert x_i)\]

Step 3. Subtract the baseline \(p_i\) from each reward. As shown earlier, \(\mathbb{E}[\nabla_\theta \log \pi] = 0\), so \(p_i\) does not change the expected gradient but reduces variance:

\[\nabla_\theta p_\theta(x_i) \approx \frac{1}{n_i} \sum_{j=1}^{n_i} (r_{ij} - p_i)\,\nabla_\theta \log \pi_\theta(a_{ij} \vert x_i)\]

Combining all three steps gives the batch estimator:

\[\hat{g}_{\text{batch}} = \frac{1}{B} \sum_{i=1}^{B} \frac{f'(p_i)}{n_i} \sum_{j=1}^{n_i} \nabla_\theta \log \pi_\theta(a_{ij} \vert x_i) \left(r_{ij} - p_i\right)\]

Each prompt contributes with weight \(f'(p_i)\) (from the nonlinear objective), and within each prompt, \(n_i\) rollouts provide a Monte Carlo estimate of \(\nabla_\theta p\).

Now we derive the variance of this estimator step by step.

Step 1: Decompose into per-prompt contributions. Write the estimator as \(\hat{g}_{\text{batch}} = \frac{1}{B}\sum_{i=1}^{B} f'(p_i) \cdot \hat{h}_i\), where \(\hat{h}_i = \frac{1}{n_i}\sum_{j=1}^{n_i}(r_{ij} - p_i)\,\nabla_\theta \log \pi_\theta(a_{ij} \vert x_i)\) is the per-prompt gradient estimate. Since prompts are sampled independently, the variance of a sum of independent terms equals the sum of variances:

\[\mathrm{Var}(\hat{g}_{\text{batch}}) = \frac{1}{B^2}\sum_{i=1}^{B} f'(p_i)^2 \cdot \mathrm{Var}(\hat{h}_i)\]

Step 2: Variance of the per-prompt estimate. Within each prompt, the \(n_i\) rollouts are i.i.d. from \(\pi_\theta(\cdot \vert x_i)\). For i.i.d. samples, the variance of a sample mean is \(1/n_i\) times the variance of a single sample:

\[\mathrm{Var}(\hat{h}_i) = \frac{1}{n_i}\,\mathrm{Var}_{a \sim \pi_\theta}\!\left[(r(x_i, a) - p_i)\,\nabla_\theta \log \pi_\theta(a \vert x_i)\right] \;=\; \frac{\sigma_g^2(x_i)}{n_i}\]

where \(\sigma_g^2(x_i)\) denotes the single-sample gradient variance for prompt \(x_i\).

Step 3: Evaluate \(\sigma_g^2(x_i)\) for binary rewards. Expanding the variance:

\[\sigma_g^2(x_i) = \mathbb{E}\!\left[(r - p_i)^2\,\lVert\nabla_\theta \log \pi\rVert^2\right] - \lVert\underbrace{\mathbb{E}[(r - p_i)\,\nabla_\theta \log \pi]}_{= \nabla_\theta p_\theta(x_i)}\rVert^2\]

The second term \(\lVert\nabla_\theta p\rVert^2\) is typically small relative to the first (this is why we need variance reduction in the first place). Dropping it and focusing on the dominant term:

\[\sigma_g^2(x_i) \approx \mathbb{E}\!\left[(r - p_i)^2\right] \cdot \mathbb{E}\!\left[\lVert\nabla_\theta \log \pi\rVert^2\right]\]

For binary \(r \in \{0,1\}\) with \(\Pr(r=1) = p_i\):

\[\mathbb{E}[(r - p_i)^2] = p_i(1-p_i)^2 + (1-p_i)\,p_i^2 = p_i(1-p_i)\]

This is simply the Bernoulli variance. Assuming the gradient norm \(\mathbb{E}[\lVert\nabla_\theta \log \pi\rVert^2]\) is roughly similar across prompts (a standard simplifying assumption), we get \(\sigma_g^2(x_i) \propto p_i(1-p_i)\).

Why is the Bernoulli variance \(p(1-p)\)? (Click to expand)

For a Bernoulli random variable \(r \in \{0, 1\}\) with \(\Pr(r=1) = p\), the variance is computed directly from the definition \(\mathrm{Var}(r) = \mathbb{E}[r^2] - (\mathbb{E}[r])^2\). Since \(r^2 = r\) (because \(0^2=0\) and \(1^2=1\)), we have \(\mathbb{E}[r^2] = \mathbb{E}[r] = p\), so:

$$\mathrm{Var}(r) = p - p^2 = p(1-p)$$

This is maximized at \(p = 0.5\) (where \(\mathrm{Var} = 0.25\)) and zero at \(p = 0\) or \(p = 1\). For the centered version \(r - p\), the same holds: \(\mathbb{E}[(r-p)^2] = p(1-p)^2 + (1-p)p^2 = p(1-p)[(1-p)+p] = p(1-p)\).

为什么伯努利方差是 \(p(1-p)\)?(点击展开)

对于伯努利随机变量 \(r \in \{0, 1\}\),\(\Pr(r=1) = p\),方差直接由定义 \(\mathrm{Var}(r) = \mathbb{E}[r^2] - (\mathbb{E}[r])^2\) 计算。由于 \(r^2 = r\)(因为 \(0^2=0\),\(1^2=1\)),有 \(\mathbb{E}[r^2] = \mathbb{E}[r] = p\),所以:

$$\mathrm{Var}(r) = p - p^2 = p(1-p)$$

在 \(p = 0.5\) 时取最大值(\(\mathrm{Var} = 0.25\)),在 \(p = 0\) 或 \(p = 1\) 时为零。对于居中版本 \(r - p\),同样成立:\(\mathbb{E}[(r-p)^2] = p(1-p)^2 + (1-p)p^2 = p(1-p)[(1-p)+p] = p(1-p)\)。

Putting it all together. Substituting back (and dropping the constant \(1/B^2\), which does not affect the allocation optimization):

\[V \propto \sum_{i=1}^{B} \frac{f'(p_i)^2 \cdot p_i(1-p_i)}{n_i}\]

Given a total budget of \(N = \sum_i n_i\) samples across \(B\) prompts, we want to minimize this total gradient variance:

\[\min_{n_1, \ldots, n_B} \sum_{i=1}^{B} \frac{f'(p_i)^2 \cdot p_i(1-p_i)}{n_i} \quad \text{subject to} \quad \sum_{i=1}^{B} n_i = N, \quad n_i \geq 1\]

Solving the optimization. By the method of Lagrange multipliers, introduce \(\lambda\) for the budget constraint and set the partial derivative to zero:

\[\frac{\partial}{\partial n_i}\left[\frac{f'(p_i)^2\,p_i(1-p_i)}{n_i} + \lambda\,n_i\right] = -\frac{f'(p_i)^2\,p_i(1-p_i)}{n_i^2} + \lambda = 0\]

Solving: \(n_i^\star = \frac{1}{\sqrt{\lambda}}\,f'(p_i)\,\sqrt{p_i(1-p_i)}\), i.e.,

\[n_i^\star \propto f'(p_i) \cdot \sqrt{p_i(1-p_i)}\]

The principle: allocate more samples to prompts that contribute more variance to the gradient — both because they are heavily weighted (large \(f'(p_i)\)) and because their reward signal is noisy (\(p_i\) near 0.5).

Lagrange multipliers for constrained optimization (Click to expand)

The method of Lagrange multipliers solves problems of the form: minimize \(f(x_1, \ldots, x_n)\) subject to \(g(x_1, \ldots, x_n) = 0\). The idea: at a constrained optimum, the gradient of \(f\) must be parallel to the gradient of \(g\) (otherwise we could improve \(f\) while staying on the constraint surface). So we solve \(\nabla f = \lambda \nabla g\) for some scalar \(\lambda\).

Here, \(f = \sum_i c_i / n_i\) (with \(c_i = f'(p_i)^2 p_i(1-p_i)\)) and \(g = \sum_i n_i - N = 0\). Setting \(\partial f / \partial n_i + \lambda \cdot \partial g / \partial n_i = 0\) gives \(-c_i / n_i^2 + \lambda = 0\), so \(n_i = \sqrt{c_i / \lambda}\). Since all \(n_i\) share the same \(\lambda\), we get \(n_i \propto \sqrt{c_i}\).

拉格朗日乘子法求解约束优化(点击展开)

拉格朗日乘子法求解如下形式的问题:最小化 \(f(x_1, \ldots, x_n)\),约束 \(g(x_1, \ldots, x_n) = 0\)。核心思想:在约束最优点,\(f\) 的梯度必须平行于 \(g\) 的梯度(否则可以沿约束面改进 \(f\))。因此求解 \(\nabla f = \lambda \nabla g\)。

这里 \(f = \sum_i c_i / n_i\)(其中 \(c_i = f'(p_i)^2 p_i(1-p_i)\))且 \(g = \sum_i n_i - N = 0\)。令 \(\partial f / \partial n_i + \lambda \cdot \partial g / \partial n_i = 0\) 得 \(-c_i / n_i^2 + \lambda = 0\),即 \(n_i = \sqrt{c_i / \lambda}\)。由于所有 \(n_i\) 共享同一个 \(\lambda\),得到 \(n_i \propto \sqrt{c_i}\)。

实际中需要从有限样本估计梯度。批次梯度估计器从总体梯度 \(\nabla_\theta J_f = \mathbb{E}_{x}[f'(p) \cdot \nabla_\theta p]\) 出发,分三步推导:

第一步。 用 \(B\) 个 prompt \(\{x_1, \ldots, x_B\}\) 的样本均值替代 \(\mathbb{E}_{x \sim d}\):

\[\nabla_\theta J_f \approx \frac{1}{B} \sum_{i=1}^{B} f'(p_i) \cdot \nabla_\theta p_\theta(x_i)\]

第二步。 对每个 prompt \(x_i\),用 \(n_i\) 个采样 rollout \(\{a_{i1}, \ldots, a_{in_i}\}\) 通过 REINFORCE 技巧估计 \(\nabla_\theta p_\theta(x_i)\):

\[\nabla_\theta p_\theta(x_i) = \mathbb{E}_{a \sim \pi_\theta}\!\left[r(x_i, a)\,\nabla_\theta \log \pi_\theta(a \vert x_i)\right] \approx \frac{1}{n_i} \sum_{j=1}^{n_i} r_{ij}\,\nabla_\theta \log \pi_\theta(a_{ij} \vert x_i)\]

第三步。 从每个奖励中减去基线 \(p_i\)。如前所述 \(\mathbb{E}[\nabla_\theta \log \pi] = 0\),所以 \(p_i\) 不改变梯度期望,但降低方差:

\[\nabla_\theta p_\theta(x_i) \approx \frac{1}{n_i} \sum_{j=1}^{n_i} (r_{ij} - p_i)\,\nabla_\theta \log \pi_\theta(a_{ij} \vert x_i)\]

合并三步得到批次估计器:

\[\hat{g}_{\text{batch}} = \frac{1}{B} \sum_{i=1}^{B} \frac{f'(p_i)}{n_i} \sum_{j=1}^{n_i} \nabla_\theta \log \pi_\theta(a_{ij} \vert x_i) \left(r_{ij} - p_i\right)\]

每个 prompt 以权重 \(f'(p_i)\)(来自非线性目标)贡献梯度,组内 \(n_i\) 个 rollout 提供 \(\nabla_\theta p\) 的蒙特卡洛估计。

下面逐步推导该估计器的方差。

第一步:分解为 per-prompt 贡献。 将估计器写成 \(\hat{g}_{\text{batch}} = \frac{1}{B}\sum_{i=1}^{B} f'(p_i) \cdot \hat{h}_i\),其中 \(\hat{h}_i = \frac{1}{n_i}\sum_{j=1}^{n_i}(r_{ij} - p_i)\,\nabla_\theta \log \pi_\theta(a_{ij} \vert x_i)\) 是 prompt \(x_i\) 的梯度估计。由于各 prompt 独立采样,独立项之和的方差等于方差之和:

\[\mathrm{Var}(\hat{g}_{\text{batch}}) = \frac{1}{B^2}\sum_{i=1}^{B} f'(p_i)^2 \cdot \mathrm{Var}(\hat{h}_i)\]

第二步:per-prompt 估计的方差。 在每个 prompt 内,\(n_i\) 个 rollout 是从 \(\pi_\theta(\cdot \vert x_i)\) 独立同分布采样的。对于 i.i.d. 样本,样本均值的方差是单个样本方差的 \(1/n_i\):

\[\mathrm{Var}(\hat{h}_i) = \frac{1}{n_i}\,\mathrm{Var}_{a \sim \pi_\theta}\!\left[(r(x_i, a) - p_i)\,\nabla_\theta \log \pi_\theta(a \vert x_i)\right] = \frac{\sigma_g^2(x_i)}{n_i}\]

其中 \(\sigma_g^2(x_i)\) 是 prompt \(x_i\) 的单样本梯度方差。

第三步:计算二值奖励下的 \(\sigma_g^2(x_i)\)。 展开方差:

\[\sigma_g^2(x_i) = \mathbb{E}\!\left[(r - p_i)^2\,\lVert\nabla_\theta \log \pi\rVert^2\right] - \lVert\underbrace{\mathbb{E}[(r - p_i)\,\nabla_\theta \log \pi]}_{= \nabla_\theta p_\theta(x_i)}\rVert^2\]

第二项 \(\lVert\nabla_\theta p\rVert^2\) 相对于第一项通常很小(这正是我们需要方差缩减的原因)。忽略它,并假设梯度范数 \(\mathbb{E}[\lVert\nabla_\theta \log \pi\rVert^2]\) 在不同 prompt 间大致相同(标准简化假设):

\[\sigma_g^2(x_i) \approx \mathbb{E}\!\left[(r - p_i)^2\right] \cdot \mathbb{E}\!\left[\lVert\nabla_\theta \log \pi\rVert^2\right]\]

对于二值 \(r \in \{0,1\}\),\(\Pr(r=1) = p_i\):

\[\mathbb{E}[(r - p_i)^2] = p_i(1-p_i)^2 + (1-p_i)\,p_i^2 = p_i(1-p_i)\]

这就是伯努利方差。因此 \(\sigma_g^2(x_i) \propto p_i(1-p_i)\)。

汇总。 代回(\(1/B^2\) 是常数,不影响分配优化,省略):

\[V \propto \sum_{i=1}^{B} \frac{f'(p_i)^2 \cdot p_i(1-p_i)}{n_i}\]

给定 \(B\) 个 prompt 的总预算 \(N = \sum_i n_i\) 个样本,最小化总梯度方差:

\[\min_{n_1, \ldots, n_B} \sum_{i=1}^{B} \frac{f'(p_i)^2 \cdot p_i(1-p_i)}{n_i} \quad \text{subject to} \quad \sum_{i=1}^{B} n_i = N, \quad n_i \geq 1\]

求解优化。 引入拉格朗日乘子 \(\lambda\),对 \(n_i\) 求偏导并令其为零:

\[\frac{\partial}{\partial n_i}\left[\frac{f'(p_i)^2\,p_i(1-p_i)}{n_i} + \lambda\,n_i\right] = -\frac{f'(p_i)^2\,p_i(1-p_i)}{n_i^2} + \lambda = 0\]

解得:\(n_i^\star = \frac{1}{\sqrt{\lambda}}\,f'(p_i)\,\sqrt{p_i(1-p_i)}\),即

\[n_i^\star \propto f'(p_i) \cdot \sqrt{p_i(1-p_i)}\]

原则:将更多样本分配给对梯度贡献更多方差的 prompt——既因为它们权重高(大的 \(f'(p_i)\)),也因为其奖励信号噪声大(\(p_i\) 接近 0.5)

Concrete allocation rules for different objectives

Substituting different \(f\) into the optimal allocation formula \(n_i^\star \propto f'(p_i) \cdot \sqrt{p_i(1-p_i)}\) yields concrete allocation rules:

\(f(p) = \log p\) (RAFT’s implicit objective): \(f'(p) = 1/p\), so

\[n_i \propto \frac{1}{p} \cdot \sqrt{p(1-p)} = \sqrt{\frac{1-p}{p}} \approx \sqrt{\frac{1}{p}} \quad \text{(when } p \text{ is small)}\]

Hard prompts (small \(p\)) get many more samples — a prompt with \(p = 0.01\) gets \(10\times\) the samples of one with \(p = 1\).

\(f(p) = \log p\) with baseline (policy gradient): The baseline changes the effective variance. The allocation becomes \(n_i \propto \sqrt{(1-p)/p}\), which also favors hard prompts but less aggressively than RAFT without baseline.

\(f(p) = p^\alpha\) (power objective): \(f'(p) = \alpha p^{\alpha-1}\), so

\[n_i \propto p^{\alpha-1} \cdot \sqrt{p(1-p)}\]

For \(\alpha > 1\), this upweights easy prompts; for \(\alpha < 1\), it upweights hard prompts.

\(f(p) = 2\arcsin(\sqrt{p})\) (GRPO): \(f'(p) = 1/\sqrt{p(1-p)}\), so

\[n_i \propto \frac{1}{\sqrt{p(1-p)}} \cdot \sqrt{p(1-p)} = 1\]

The weight and the Bernoulli variance exactly canceluniform allocation is already variance-optimal for GRPO! This is a remarkable result: GRPO’s implicit nonlinear objective is precisely the one for which no adaptive allocation can improve upon uniform sampling.

将不同的 \(f\) 代入最优分配公式 \(n_i^\star \propto f'(p_i) \cdot \sqrt{p_i(1-p_i)}\),得到具体的分配规则:

\(f(p) = \log p\)(RAFT 的隐式目标):\(f'(p) = 1/p\),因此

\[n_i \propto \frac{1}{p} \cdot \sqrt{p(1-p)} = \sqrt{\frac{1-p}{p}} \approx \sqrt{\frac{1}{p}} \quad \text{(当 } p \text{ 较小时)}\]

困难 prompt(小 \(p\))获得更多样本——通过率 \(p = 0.01\) 的 prompt 获得的样本是 \(p = 1\) 的 \(10\) 倍。

带基线的 \(f(p) = \log p\)(策略梯度):基线改变了有效方差。分配变为 \(n_i \propto \sqrt{(1-p)/p}\),同样偏向困难 prompt,但比无基线的 RAFT 更温和。

\(f(p) = p^\alpha\)(幂目标):\(f'(p) = \alpha p^{\alpha-1}\),因此

\[n_i \propto p^{\alpha-1} \cdot \sqrt{p(1-p)}\]

当 \(\alpha > 1\) 时偏向简单 prompt;当 \(\alpha < 1\) 时偏向困难 prompt。

\(f(p) = 2\arcsin(\sqrt{p})\)(GRPO):\(f'(p) = 1/\sqrt{p(1-p)}\),因此

\[n_i \propto \frac{1}{\sqrt{p(1-p)}} \cdot \sqrt{p(1-p)} = 1\]

权重与伯努利方差恰好抵消——对于 GRPO,均匀分配已经是方差最优的!这是一个非凡的结论:GRPO 的隐式非线性目标恰好就是那个自适应分配无法改进均匀采样的目标函数。


Experiments: Comparing Allocation Strategies

logp-VR vs. Ada-Seq-Balance

Comparing the two approaches with per-prompt budget clip at 256:

  • logp-VR: variance-reduction allocation with \(n_i \propto \sqrt{(1-p)/p}\), average per-prompt budget \(n = 128\). Both sampling and training counts follow the same allocation curve. Average cost = 512 samples.
  • Ada-Seq-Balance: \(n/2\) training batch (fixed), \(2n\) maximal sampling budget. Sampling count follows the U-shaped \(1/(p(1-p))\) curve, but training size stays constant. Average cost = 383 samples.

The key difference is whether sampling and training are coupled or decoupled:

  • logp-VR (coupled): for each prompt, sample \(n_i\) rollouts and train on all \(n_i\) of them. A hard prompt with \(p = 0.01\) might get \(n_i = 200\) samples, and the training step processes all 200. This means both sampling cost and training cost scale together (left plot: the two curves overlap). Total cost per prompt = \(n_i^{\text{sample}} + n_i^{\text{train}} = 2n_i\), so average cost = \(2 \times 256 = 512\).

  • Ada-Seq-Balance (decoupled): for each prompt, sample adaptively until \(k\) positive and \(k\) negative rollouts are found (sampling count varies with difficulty), but then sub-sample a fixed-size training batch of \(n\) rollouts regardless of how many were sampled. A hard prompt might require 200 samples to find 4 correct ones, but training only uses those 4 correct + 4 incorrect = 8 rollouts. This is shown in the right plot: the blue curve (sampling) is U-shaped, but the orange line (training) is flat. Since training (backward pass + optimizer) is much more expensive per sample than inference (forward pass only), the fixed training size saves significant compute. Average cost = 383 (lower because training cost is constant).

In short: Ada-Seq-Balance spends extra inference compute (cheap) to recover signal from hard prompts, but does not spend extra training compute (expensive) on them.

对比两种方案,每个 prompt 预算上限为 256:

  • logp-VR:方差缩减分配,\(n_i \propto \sqrt{(1-p)/p}\),平均每 prompt 预算 \(n = 128\)。采样量和训练量遵循相同的分配曲线。平均成本 = 512 个样本。
  • Ada-Seq-Balance:\(n/2\) 训练 batch(固定),\(2n\) 最大采样预算。采样量遵循 U 形的 \(1/(p(1-p))\) 曲线,但训练大小保持不变。平均成本 = 383 个样本。

两者的核心区别在于采样和训练是耦合还是解耦的

  • logp-VR(耦合):对每个 prompt,采样 \(n_i\) 个 rollout 并在全部 \(n_i\) 个上训练。一个 \(p = 0.01\) 的困难 prompt 可能得到 \(n_i = 200\) 个样本,训练步骤处理全部 200 个。这意味着采样成本和训练成本同步增长(左图:两条曲线重叠)。每个 prompt 的总成本 = \(n_i^{\text{sample}} + n_i^{\text{train}} = 2n_i\),平均成本 = \(2 \times 256 = 512\)。

  • Ada-Seq-Balance(解耦):对每个 prompt,自适应地采样直到找到 \(k\) 个正样本和 \(k\) 个负样本(采样量随难度变化),然后子采样一个固定大小的训练 batch,无论实际采了多少。一个困难 prompt 可能需要 200 个样本才找到 4 个正确的,但训练只用这 4 个正确 + 4 个错误 = 8 个 rollout。右图体现了这一点:蓝色曲线(采样)是 U 形的,但橙色线(训练)是平的。由于训练(反向传播 + 优化器)的每样本成本远高于推理(仅前向传播),固定训练大小节省了大量计算。平均成本 = 383(更低,因为训练成本恒定)。

简言之:Ada-Seq-Balance 花费额外的推理计算(便宜)来从困难 prompt 中恢复信号,但不在它们上花费额外的训练计算(昂贵)。

Ada-Seq-Balance is consistently competitive

Comparing GRPO, Ada-Est (log \(p\) variance-reduction), Ada-Seq-Pos-4, and Ada-Seq-Balance-4 across Qwen2.5-Math-1.5B and 7B, Ada-Seq-Balance is consistently competitive or best. On Qwen2.5-Math-7B, it reaches 54.6% weighted test accuracy versus GRPO-n4’s 53.0%.

在 Qwen2.5-Math-1.5B 和 7B 上对比 GRPO、Ada-Est(log \(p\) 方差缩减)、Ada-Seq-Pos-4 和 Ada-Seq-Balance-4,Ada-Seq-Balance 始终具有竞争力或表现最佳。在 Qwen2.5-Math-7B 上,其加权测试准确率达到 54.6%,而 GRPO-n4 为 53.0%。

GRPO-n4 GRPO-n8 ADA-SEQ-Pos ADA-SEQ-Balance ADA-EST-n8-logp
45.3
46.1
46.1
47.6
46.6
Qwen2.5-Math-1.5B
53.0
53.2
54.2
54.6
53.8
Qwen2.5-Math-7B
Weighted Test Accuracy (%)

Ada-Seq-Balance is more compute efficient than GRPO

When matching per-step compute and training for up to 1000 steps, Ada-Seq-Balance-n8 consistently outperforms GRPO-n16 on the accuracy-vs-cost curve, and Ada-Seq-Balance-n16 outperforms GRPO-n32.

在匹配每步计算量并训练至 1000 步时,Ada-Seq-Balance-n8 在准确率-成本曲线上持续优于 GRPO-n16,Ada-Seq-Balance-n16 优于 GRPO-n32。

Ada-Seq-Balance-n8 vs GRPO-n16 (pass@1) 0.450 0.400 0.350 0.300 0.275 Weighted Accuracy Cost Ada-Seq-Balance-n8 GRPO-n16 Ada-Seq-Balance-n16 vs GRPO-n32 (pass@1) 0.460 0.410 0.360 0.310 0.275 Weighted Accuracy Cost Ada-Seq-Balance-n16 GRPO-n32

The accuracy-entropy frontier also improves: Ada-Seq-Balance achieves higher accuracy at the same entropy level, indicating better exploration-exploitation trade-off.

准确率-熵前沿也有所改善:Ada-Seq-Balance 在相同熵水平下达到更高准确率,表明更好的探索-利用权衡。

Ada-Seq-Balance-n8 vs GRPO-n16 0.450 0.400 0.350 0.300 0.275 Weighted Accuracy 0.08 0.17 0.26 0.35 Entropy 1000 550 50 1000 550 50 Ada-Seq-Balance-n8 GRPO-n16 Ada-Seq-Balance-n16 vs GRPO-n32 0.460 0.410 0.360 0.310 0.290 Weighted Accuracy 0.04 0.13 0.22 0.31 Entropy 900 500 50 900 500 50 Ada-Seq-Balance-n16 GRPO-n32