Token-Level KL-Regularized Policy Gradient and GRPO
Hao Bai, Tong Zhang
April 02, 2026
This post is archived and not publicly available.
This post is protected. Enter the passcode to view.
Wrong passcode.
We consider the token-wise KL-regularized formulation for RL fine-tuning of language models. For simplicity we slightly modify our notation: starting from a prompt \(x_0\), we consider both LLM tokens \(a_j\) (which may include thinking tokens) followed by environment tokens \(x_j\), where each \(x_j\) could be either empty or encoded as multiple tokens.
我们考虑逐 token KL 正则化的形式来进行语言模型的 RL 微调。为简化记号,从提示 \(x_0\) 出发,我们同时考虑 LLM token \(a_j\)(可能包含思考 token)以及环境 token \(x_j\),其中每个 \(x_j\) 可以为空或编码为多个 token。
This leads to a trajectory of horizon \(H\) (total number of LLM actions):
这产生一条长度为 \(H\)(LLM 动作总数)的轨迹:
where \(\tau_j\) denotes the history up to step \(j\), and \(\tau_{-j}\) denotes the future trajectory after step \(j\).
其中 \(\tau_j\) 表示到第 \(j\) 步为止的历史,\(\tau_{-j}\) 表示第 \(j\) 步之后的未来轨迹。
Token-Level KL-Regularized Policy Gradient
逐 Token 的 KL 正则化策略梯度
KL-Regularized Objective
KL 正则化目标
The KL-regularized objective is:
KL 正则化目标为:
Here \(\pi_\theta\) is the current policy (the LLM being trained), \(\pi_{\mathrm{ref}}\) is the reference policy (typically the SFT checkpoint), \(r(\tau)\) is the trajectory-level reward, and \(\beta > 0\) is the KL regularization coefficient. The term \(\ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)}\) is the log-likelihood ratio between the current and reference policies over the full trajectory, which equals the KL divergence when taken in expectation. Dividing by \(H\) normalizes the penalty per token. Intuitively, this objective asks the policy to maximize reward while paying a per-token cost for deviating from the reference — when \(\beta\) is large the policy stays close to \(\pi_{\mathrm{ref}}\), and when \(\beta \to 0\) it reduces to pure reward maximization.
其中 \(\pi_\theta\) 是当前策略(正在训练的 LLM),\(\pi_{\mathrm{ref}}\) 是参考策略(通常为 SFT 检查点),\(r(\tau)\) 是轨迹级奖励,\(\beta > 0\) 是 KL 正则化系数。\(\ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)}\) 是当前策略与参考策略在整条轨迹上的对数似然比,取期望后即为 KL 散度。除以 \(H\) 将惩罚归一化到每个 token。直觉上,该目标要求策略在最大化奖励的同时,为偏离参考策略支付逐 token 的代价——当 \(\beta\) 较大时策略保持接近 \(\pi_{\mathrm{ref}}\),当 \(\beta \to 0\) 时退化为纯奖励最大化。
Deriving the Token-Level Gradient
推导逐 Token 梯度
Step 1: REINFORCE on the full trajectory. Applying the standard policy gradient theorem (REINFORCE), we differentiate through \(\pi_\theta(\tau)\) and obtain:
第一步:对完整轨迹使用 REINFORCE。 应用标准策略梯度定理(REINFORCE),对 \(\pi_\theta(\tau)\) 求导得到:
Step 2: Decompose \(\ln \pi_\theta(\tau)\) into per-token terms. Since the policy generates tokens autoregressively, the trajectory probability factorizes as \(\pi_\theta(\tau) = \prod_{j=1}^H \pi_\theta(a_j \vert \tau_{j-1})\). Taking the log:
第二步:将 \(\ln \pi_\theta(\tau)\) 分解为逐 token 项。 由于策略自回归地生成 token,轨迹概率可分解为 \(\pi_\theta(\tau) = \prod_{j=1}^H \pi_\theta(a_j \vert \tau_{j-1})\)。取对数:
Similarly, the KL term decomposes as \(\ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)} = \sum_{i=1}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})}\).
类似地,KL 项分解为 \(\ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)} = \sum_{i=1}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})}\)。
Step 3: Substitute into the gradient. Plugging both decompositions into Step 1:
第三步:代入梯度。 将两个分解代入第一步:
Step 4: Distribute into a double sum. The expression above is a product of two sums: the KL sum (indexed by \(i\)) and the gradient sum (indexed by \(j\)). Expanding this product, we get one term for every \((j, i)\) pair. Focusing on the KL portion:
第四步:展开为双重求和。 上式是两个求和的乘积:KL 求和(下标 \(i\))与梯度求和(下标 \(j\))。展开这个乘积,每一对 \((j, i)\) 各产生一项。仅关注 KL 部分:
What does this double sum mean? It is performing credit assignment for the KL penalty: for each gradient direction \(j\) (which direction to update the policy at token \(j\)), it asks how much KL penalty from every token \(i\) should weight that update. Naively, every token’s KL penalty influences every token’s gradient — an \(H \times H\) table of interactions.
But this is wasteful. The action at step \(j\) was chosen after steps \(1, \ldots, j-1\) already happened. Changing what we do at step \(j\) cannot retroactively alter the KL cost incurred at earlier steps. So the past KL terms (\(i < j\)) are irrelevant to the gradient at step \(j\) — they are sunk costs.
Step 5: Apply causality to reduce the sum. Formally, for \(i < j\), the KL term \(K_i = \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})}\) was determined before step \(j\) — it depends only on \(\tau_{i-1}\) and \(a_i\), both of which are fixed by the time step \(j\) is reached. So we can condition on everything up to step \(j-1\) and take the inner expectation over \(a_j\). Since \(K_i\) is a constant given \(\tau_{j-1}\), it can be pulled out:
这个双重求和在做什么?它在执行 KL 惩罚的credit assignment:对每个梯度方向 \(j\)(在 token \(j\) 处更新策略的方向),它问每一个 token \(i\) 的 KL 惩罚应该以多大的权重影响这次更新。朴素地看,每个 token 的 KL 惩罚都影响每个 token 的梯度——一张 \(H \times H\) 的交互表。
但这是浪费的。第 \(j\) 步的动作是在第 \(1, \ldots, j-1\) 步已经发生之后才选择的。改变第 \(j\) 步的行为无法追溯性地改变之前步骤已经产生的 KL 代价——它们是sunk cost。
第五步:利用因果性缩减求和。 形式上,对于 \(i < j\),KL 项 \(K_i = \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})}\) 在第 \(j\) 步之前已经确定——它只依赖于 \(\tau_{i-1}\) 和 \(a_i\),到达第 \(j\) 步时都已固定。因此我们可以条件化到第 \(j-1\) 步,对 \(a_j\) 取内层期望。由于 \(K_i\) 给定 \(\tau_{j-1}\) 后是常数,可以提出来:
The remaining expectation is always zero by the REINFORCE identity (score function identity): \(\mathbb{E}_{a_j}[\nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1})] = 0\). Therefore \(K_i \cdot 0 = 0\) for all \(i < j\) — past KL terms contribute nothing to the gradient. This is also why any constant baseline can be subtracted from the reward without introducing bias. Only the terms \(i \geq j\) survive.
剩下的期望由 REINFORCE 恒等式(score function identity)恒为零:\(\mathbb{E}_{a_j}[\nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1})] = 0\)。因此 \(K_i \cdot 0 = 0\)(对所有 \(i < j\))——过去的 KL 项对梯度的贡献为零。这也是为什么从奖励中减去任意常数 baseline 不会引入偏差。只有 \(i \geq j\) 的项保留下来。
Proof of the REINFORCE identity (Click to expand)
For any normalized distribution \(\pi_\theta(a \vert s)\):
$$\mathbb{E}_{a \sim \pi_\theta(\cdot \vert s)}\bigl[\nabla_\theta \ln \pi_\theta(a \vert s)\bigr] = \sum_a \pi_\theta(a \vert s) \cdot \frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_\theta(a \vert s)} = \nabla_\theta \sum_a \pi_\theta(a \vert s) = \nabla_\theta 1 = 0$$
\(\pi_\theta\) in the expectation cancels the denominator of \(\nabla_\theta \ln \pi_\theta = \frac{\nabla_\theta \pi_\theta}{\pi_\theta}\), leaving \(\nabla_\theta\) of a sum that equals 1 regardless of \(\theta\).
REINFORCE 恒等式的证明(点击展开)
对任意归一化分布 \(\pi_\theta(a \vert s)\):
$$\mathbb{E}_{a \sim \pi_\theta(\cdot \vert s)}\bigl[\nabla_\theta \ln \pi_\theta(a \vert s)\bigr] = \sum_a \pi_\theta(a \vert s) \cdot \frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_\theta(a \vert s)} = \nabla_\theta \sum_a \pi_\theta(a \vert s) = \nabla_\theta 1 = 0$$
期望中的 \(\pi_\theta\) 与 \(\nabla_\theta \ln \pi_\theta = \frac{\nabla_\theta \pi_\theta}{\pi_\theta}\) 的分母相消,剩下 \(\nabla_\theta\) 作用于一个恒等于 1 的和。
This lets us drop the lower triangle (\(i < j\)) from the double sum, leaving only the upper triangle \(i \geq j\):Check with \(H{=}2\): the 4-term grid \(K_i g_j\) loses one entry \(K_1 g_2\) (past KL), leaving the upper triangle. \(\checkmark\)
这使我们可以从双重求和中丢弃下三角(\(i < j\)),只保留上三角 \(i \geq j\):用 \(H{=}2\) 验证:\(K_i g_j\) 的 4 项网格中丢弃一项 \(K_1 g_2\)(past KL),剩余 upper triangle。\(\checkmark\)
Putting everything back together for general \(H\):
对一般的 \(H\) 重新组合:
Token-Level Reward and Loss
逐 Token 奖励与损失
This is a key result: each token position \(j\) sees the full reward \(r(\tau)\) but only the future KL penalty \(\sum_{i=j}^H\), not the past. We can therefore define the token-level reward \(r_j(\tau)\), which assigns to each position \(j\) the trajectory reward minus the KL penalty accumulated from step \(j\) onward:
这是一个关键结果:每个 token 位置 \(j\) 看到完整的奖励 \(r(\tau)\),但只看到未来的 KL 惩罚 \(\sum_{i=j}^H\),而非过去的。因此我们可以定义 token-level reward \(r_j(\tau)\),它为每个位置 \(j\) 分配轨迹奖励减去从第 \(j\) 步起累积的 KL 惩罚:
where \(\mathrm{stopgrad}(\cdot)\) is the stop-gradient operator, treating its argument as a constant during backpropagation.
其中 \(\mathrm{stopgrad}(\cdot)\) 是stop-gradient 算子,在反向传播时将其参数视为常数。
In practice, we normalize by \(\frac{1}{H}\) (averaging over tokens) and use the following weighted next-token prediction loss:
在实践中,我们乘以 \(\frac{1}{H}\)(对 token 取平均)并使用如下weighted next-token prediction loss:
The \(\frac{1}{H}\) does not come from the derivation — it is a normalization constant that makes the loss magnitude independent of sequence length (equivalent to scaling the learning rate by \(\frac{1}{H}\)). The gradient direction is unchanged. In essence, we simply replace \(r(\tau)\) by the token-level \(r_j(\tau)\).
\(\frac{1}{H}\) 并非来自推导——它是一个归一化常数,使损失量级不依赖于序列长度(等价于将学习率缩放 \(\frac{1}{H}\))。梯度方向不变。本质上,我们只需将 \(r(\tau)\) 替换为 token 级的 \(r_j(\tau)\)。
Plugging into GRPO
代入 GRPO
Recall the standard GRPO loss. For \(G\) completions sampled per prompt, it has two additive components — a clipped surrogate driven by a reward-only advantage, and a separate KL penalty:
回顾标准 GRPO loss。对每个 prompt 采样 \(G\) 条 completion,它由两个加法项组成——一个基于纯 reward advantage 的 clipped surrogate,以及一个独立的 KL 惩罚:
Now substitute our token-level reward \(r_j^g\) (which already contains the KL cost) in place of the task-only reward. For each prompt, sample \(G\) trajectories \(\tau^g \sim \pi_\theta\).The formulation uses a uniform \(H\), implicitly assuming all trajectories in a group have the same length. In practice, trajectories are padded to the longest in the group, with loss and KL terms at padded positions masked to zero. The statistics \(\mu_j, \sigma_j\) are computed only over trajectories that have a valid token at position \(j\). At each token position \(j\), compute the group mean \(\mu_j = \frac{1}{G}\sum_g r_j^g\) and standard deviation \(\sigma_j\). The normalized advantage is:
现在用我们的 token-level reward \(r_j^g\)(已包含 KL 成本)替换纯 task reward。对每个 prompt 采样 \(G\) 条轨迹 \(\tau^g \sim \pi_\theta\)。上述公式使用统一的 \(H\),隐含假设同组内所有轨迹长度相同。实践中轨迹长度不同,通常 pad 到组内最长,pad 位置的 loss 和 KL 项用 mask 置零;\(\mu_j, \sigma_j\) 只在位置 \(j\) 有有效 token 的轨迹上计算。在每个 token 位置 \(j\),计算组内均值 \(\mu_j = \frac{1}{G}\sum_g r_j^g\) 和标准差 \(\sigma_j\)。Normalized advantage 为:
Because \(r_j^g\) already carries the KL penalty, the separate \(-\beta K_j^g\) term in \(\mathcal{L}_{\text{GRPO}}\) is no longer needed — it would double-count the regularization. The loss simplifies to a pure clipped surrogate with no additive KL term:
由于 \(r_j^g\) 已包含 KL 惩罚,\(\mathcal{L}_{\text{GRPO}}\) 中独立的 \(-\beta K_j^g\) 项不再需要——否则会重复计算正则化。Loss 简化为纯 clipped surrogate,没有加法 KL 项:
where \(\mathrm{clip}(\rho_j^g) = \mathrm{clamp}(\rho_j^g, 1-\epsilon, 1+\epsilon)\) and \(\rho_j\) is a proper off-policy correction term, either at the token level or the sequence level as defined in the next section.
To summarize: standard GRPO = reward-only advantage + separate KL penalty. We replace both with a single KL-aware advantage — the KL migrates from a standalone penalty into the advantage itself, and the \(-\beta K_j^g\) term disappears from the loss.
其中 \(\mathrm{clip}(\rho_j^g) = \mathrm{clamp}(\rho_j^g, 1-\epsilon, 1+\epsilon)\),\(\rho_j\) 为 off-policy correction 项,可在 token 级或序列级定义,详见下一节。
总结:标准 GRPO = reward-only advantage + 独立 KL 惩罚。我们用一个包含 KL 的 advantage 替换了两者——KL 从独立的惩罚项迁移到 advantage 内部,\(-\beta K_j^g\) 项从 loss 中消失。
GAE Implementation
GAE 实现
The token-level reward \(r_j(\tau)\) can be implemented directly in mainstream RL frameworks via Generalized Advantage Estimation. A note on notation: \(r_j(\tau)\) as defined above is a reward-to-go (cumulative return from position \(j\)), not a single-step reward. With \(\gamma = \lambda = 1\) and no critic (\(V = 0\)), GAE reduces to the sum of future per-step rewards. Applying the standard KL-in-reward trick — where the environment reward is zero except at the terminal token (the standard RLHF/PPO setup) and each token incurs a KL penalty:
\[r(s_k, a_k) = \underbrace{R^{\mathrm{env}}_k}_{\substack{0 \text{ if } k < H \\ r(\tau) \text{ if } k = H}} - \;\frac{\beta}{H}\, K_k\]The GAE advantage at position \(j\) becomes exactly our token-level reward:
\[\hat{A}_j^{\mathrm{GAE}(1,1)} = \sum_{k=j}^{H} r(s_k, a_k) = r(\tau) - \frac{\beta}{H}\sum_{k=j}^{H} K_k = r_j(\tau)\]The task reward \(r(\tau)\) appears exactly once (from the terminal step), so every position sees it with weight 1. Each KL term \(K_k\) contributes \(\frac{\beta}{H}\) to all positions \(j \leq k\). The asymmetry between the two weights (1 vs \(\frac{\beta}{H}\)) arises naturally from the placement: \(r(\tau)\) lives at one step while \(K_k\) lives at every step. The group normalization \(\hat{A}_j^g = (r_j^g - \mu_j)/\sigma_j\) then plays the role of a baseline — replacing the absent critic.
Derivation: from GAE to undiscounted MC return (Click to expand)
GAE with discount \(\gamma\) and trace parameter \(\lambda\) is:
$$\hat{A}^{\mathrm{GAE}}_t = \sum_{t'=t}^{T} (\gamma\lambda)^{t'-t} \delta_{t'}, \qquad \delta_{t'} = r_{t'} + \gamma \hat{V}(s_{t'+1}) - \hat{V}(s_{t'})$$
Setting \(\gamma = \lambda = 1\) gives \(\hat{A}_t = \sum_{t'=t}^{T} \delta_{t'}\). The value-function terms telescope: \(\sum_{t'=t}^{T}[\hat{V}(s_{t'+1}) - \hat{V}(s_{t'})] = \hat{V}(s_{T+1}) - \hat{V}(s_t) = -\hat{V}(s_t)\) (the terminal state has \(\hat{V} = 0\)), leaving:
$$\hat{A}_t^{\mathrm{GAE}(1,1)} = \sum_{t'=t}^{T} r_{t'} - \hat{V}(s_t)$$
GRPO is critic-free (\(\hat{V} = 0\)), so the advantage reduces to the undiscounted Monte Carlo return: \(\hat{A}_j = \sum_{k=j}^{H} r(s_k, a_k)\). Substituting the per-step reward and splitting terminal vs non-terminal:
$$\hat{A}_j = \underbrace{\sum_{k=j}^{H-1}\Bigl(-\frac{\beta}{H}K_k\Bigr)}_{\text{non-terminal: KL only}} + \;\underbrace{\Bigl(r(\tau) - \frac{\beta}{H}K_H\Bigr)}_{\text{terminal: task reward + KL}} = r_j(\tau) \quad \checkmark$$
推导:从 GAE 到无折扣 MC 回报(点击展开)
GAE 在折扣因子 \(\gamma\)、trace 参数 \(\lambda\) 下为:
$$\hat{A}^{\mathrm{GAE}}_t = \sum_{t'=t}^{T} (\gamma\lambda)^{t'-t} \delta_{t'}, \qquad \delta_{t'} = r_{t'} + \gamma \hat{V}(s_{t'+1}) - \hat{V}(s_{t'})$$
令 \(\gamma = \lambda = 1\),得 \(\hat{A}_t = \sum_{t'=t}^{T} \delta_{t'}\)。Value function 项 telescope:\(\sum_{t'=t}^{T}[\hat{V}(s_{t'+1}) - \hat{V}(s_{t'})] = \hat{V}(s_{T+1}) - \hat{V}(s_t) = -\hat{V}(s_t)\)(终止状态 \(\hat{V} = 0\)),剩余:
$$\hat{A}_t^{\mathrm{GAE}(1,1)} = \sum_{t'=t}^{T} r_{t'} - \hat{V}(s_t)$$
GRPO 无 critic(\(\hat{V} = 0\)),因此 advantage 退化为无折扣 MC 回报:\(\hat{A}_j = \sum_{k=j}^{H} r(s_k, a_k)\)。代入 per-step reward 并拆分终止/非终止项:
$$\hat{A}_j = \underbrace{\sum_{k=j}^{H-1}\Bigl(-\frac{\beta}{H}K_k\Bigr)}_{\text{non-terminal: KL only}} + \;\underbrace{\Bigl(r(\tau) - \frac{\beta}{H}K_H\Bigr)}_{\text{terminal: task reward + KL}} = r_j(\tau) \quad \checkmark$$
Why \(\gamma = 1\) is necessary (Click to expand)
With \(\gamma < 1\), the GAE return becomes:
$$\hat{A}_j = \gamma^{H-j}\,r(\tau) - \frac{\beta}{H}\sum_{k=j}^{H}\gamma^{k-j}K_k$$
The discount \(\gamma^{H-j}\) would make earlier tokens see a smaller task reward — but the derivation above shows \(r(\tau)\) must have weight 1 at every position. So \(\gamma = 1\) is not a free choice; it is uniquely determined by the weight structure of \(r_j(\tau)\).
为什么 \(\gamma = 1\) 是必须的(点击展开)
若 \(\gamma < 1\),GAE return 变为:
$$\hat{A}_j = \gamma^{H-j}\,r(\tau) - \frac{\beta}{H}\sum_{k=j}^{H}\gamma^{k-j}K_k$$
折扣 \(\gamma^{H-j}\) 会让越早的 token 看到越小的任务奖励——但上文推导表明 \(r(\tau)\) 在每个位置的权重必须为 1。因此 \(\gamma = 1\) 不是自由选择,而是由 \(r_j(\tau)\) 的权重结构唯一确定的。
Beyond this structural simplification, the advantage \(\hat{A}_j^g\) is now position-dependent — unlike standard GRPO’s \(\hat{A}^g\) which is the same for every token. This is because \(r_j^g\) depends on the future KL sum \(\sum_{k=j}^H K_k^g\): earlier tokens bear more future KL cost than later tokens. Concretely:
Token-level reward \(r_j(\tau)\) 可以通过 Generalized Advantage Estimation 直接在主流 RL 框架中实现。注意 notation 区别:上文定义的 \(r_j(\tau)\) 是一个 reward-to-go(从位置 \(j\) 起的累积回报),而非单步奖励。在 \(\gamma = \lambda = 1\)、无 critic(\(V = 0\))的设定下,GAE 退化为未来 per-step reward 之和。使用标准的 KL-in-reward 技巧——环境奖励仅在终止 token 处非零(RLHF/PPO 标准设定),每步扣除 KL 惩罚:
\[r(s_k, a_k) = \underbrace{R^{\mathrm{env}}_k}_{\substack{0 \text{ if } k < H \\ r(\tau) \text{ if } k = H}} - \;\frac{\beta}{H}\, K_k\]位置 \(j\) 处的 GAE advantage 恰好等于我们的 token-level reward:
\[\hat{A}_j^{\mathrm{GAE}(1,1)} = \sum_{k=j}^{H} r(s_k, a_k) = r(\tau) - \frac{\beta}{H}\sum_{k=j}^{H} K_k = r_j(\tau)\]任务奖励 \(r(\tau)\) 恰好出现一次(来自终止步),因此每个位置看到它的权重都是 1。每个 KL 项 \(K_k\) 以 \(\frac{\beta}{H}\) 的权重贡献给所有 \(j \leq k\) 的位置。两项权重的不对称(1 vs \(\frac{\beta}{H}\))自然源于放置方式:\(r(\tau)\) 仅在一步出现,而 \(K_k\) 在每步都出现。随后的 group normalization \(\hat{A}_j^g = (r_j^g - \mu_j)/\sigma_j\) 则扮演 baseline 的角色——替代了缺失的 critic。
除了这一结构简化之外,advantage \(\hat{A}_j^g\) 现在是位置相关的——不同于标准 GRPO 中对每个 token 相同的 \(\hat{A}^g\)。这是因为 \(r_j^g\) 依赖于未来 KL 求和 \(\sum_{k=j}^H K_k^g\):越早的 token 承担越多的 future KL cost。具体而言:
The group normalization \(\hat{A}_j^g = \frac{r_j^g - \mu_j}{\sigma_j}\) therefore computes a different baseline at each position. A trajectory that takes a high-KL detour in the middle will have its advantage reduced at early positions (which “bear” the future KL cost of that detour) but not at late positions (after the detour is over). In \(\mathcal{L}_{\text{GRPO}}\), by contrast, the total KL is a flat penalty \(-\beta K_j^g\) at every position — it cannot make this distinction.
The implications for clipping and IS ratios are discussed in the off-policy comparison.
Why does standard GRPO keep KL separate? This is a deliberate design choice, not a theoretical necessity. The original GRPO formulation (Shao et al., 2024) inherits the KL penalty from PPO’s constrained optimization framework, where KL appears as a Lagrangian penalty term independent of the reward. As noted in DeepSeekMath (Shao et al., 2024): “instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence to the loss, avoiding complicating the calculation of \(\hat{A}\).” The practical reason: standard GRPO computes advantage via group normalization over task rewards:
因此 group normalization \(\hat{A}_j^g = \frac{r_j^g - \mu_j}{\sigma_j}\) 在每个位置计算不同的 baseline。如果一条轨迹在中间发生了高 KL 偏移,其早期位置的 advantage 会被降低(因为它们”承担”了该偏移的 future KL cost),而晚期位置不受影响(偏移已结束)。相比之下,在 \(\mathcal{L}_{\text{GRPO}}\) 中,总 KL 是每个位置上平坦的惩罚 \(-\beta K_j^g\)——无法做出这种区分。
Clipping 和 IS ratio 的影响将在离策略校正的对比中讨论。
标准 GRPO 为什么要把 KL 单独拿出来? 这是一个刻意的设计选择,而非理论推导的必然结果。原始 GRPO 公式(Shao et al., 2024)从 PPO 的约束优化框架继承了 KL 惩罚,其中 KL 作为 Lagrangian penalty 自然独立于 reward 出现。正如 DeepSeekMath(Shao et al., 2024)所述:”GRPO 不是将 KL penalty 加到 reward 中,而是直接将 KL 散度加到 loss 上,从而避免使 \(\hat{A}\) 的计算复杂化。”实际原因在于:标准 GRPO 通过对纯 task reward 做 group normalization 来计算 advantage:
If KL were folded into the reward before normalization, the group mean \(\mu\) and standard deviation \(\sigma\) would be contaminated by KL costs that vary across trajectories for reasons unrelated to task performance. A trajectory with high KL but identical task reward would shift the baseline, distorting the advantage signal for all other trajectories in the group. Keeping KL separate preserves clean reward statistics: \(\mu\) and \(\sigma\) reflect pure task performance, and the KL penalty is applied uniformly afterward.
Our formulation makes the opposite trade-off: we accept that KL enters the group statistics, because this is precisely what enables position-dependent credit assignment. The group mean \(\mu_j\) and standard deviation \(\sigma_j\) at each position \(j\) now reflect both task performance and KL cost — a trajectory that achieves the same reward but with lower KL deviation will receive higher advantage. This is a feature, not a bug: the advantage signal directly encodes the trade-off between reward and reference-adherence at each token position.
The implications for clipping and IS ratios are discussed in the off-policy comparison.
如果在归一化之前就把 KL 折入 reward,组均值 \(\mu\) 和标准差 \(\sigma\) 会被各轨迹因与 task performance 无关的原因而产生的 KL 差异所污染。一条 KL 高但 task reward 相同的轨迹会移动 baseline,扭曲组内其他所有轨迹的 advantage 信号。把 KL 拿出来保持了干净的 reward statistics:\(\mu\) 和 \(\sigma\) 反映纯粹的 task performance,KL 惩罚在之后统一施加。
我们的公式做了相反的权衡:我们接受 KL 进入 group statistics,因为这正是实现位置相关 credit assignment 的前提。每个位置 \(j\) 的组均值 \(\mu_j\) 和标准差 \(\sigma_j\) 现在同时反映 task performance 和 KL 成本——一条达到相同 reward 但 KL 偏移更小的轨迹会获得更高的 advantage。这是特性而非缺陷:advantage 信号在每个 token 位置直接编码了 reward 与 reference 一致性之间的权衡。
Clipping 和 IS ratio 的影响将在离策略校正的对比中讨论。
Off-Policy Correction
离策略校正
Importance Sampling Correction
重要性采样校正
We may also consider off-policy correction:
我们还可以考虑off-policy correction:
where \(\rho_j\) is the importance weight that corrects for the distribution mismatch between the current policy \(\pi_\theta\) and the behavior policy \(\pi_{\mathrm{old}}\) (the policy that generated the trajectory). It can be defined at two levels:
Token-level correction:
其中 \(\rho_j\) 是importance weight,用于校正当前策略 \(\pi_\theta\) 与行为策略 \(\pi_{\mathrm{old}}\)(生成轨迹的策略)之间的分布偏差。它可以在两个层级上定义:
Token-level correction:
Sequence-level correction:
Sequence-level correction:
Where does this come from? Recall that the trajectories were sampled from \(\pi_{\mathrm{old}}\), but we want to evaluate the gradient under \(\pi_\theta\). The token-level correction \(\rho_j = \frac{\pi_\theta(a_j)}{\pi_{\mathrm{old}}(a_j)}\) only fixes the mismatch at one token. Why might we need more?
Consider what the loss term at position \(j\) looks like: \(\rho_j \cdot r_j(\tau) \cdot \ln \pi_\theta(a_j \vert \tau_{j-1})\). The reward \(r_j(\tau) = r(\tau) - \frac{\beta}{H}\sum_{i=j}^H K_i\) contains the KL log-ratios at all future steps \(i = j, j{+}1, \ldots, H\). These future \(K_i\) values depend on future actions \(a_{j+1}, \ldots, a_H\), which were sampled from \(\pi_{\mathrm{old}}\). Under \(\pi_\theta\), those future actions would have a different distribution, so \(r_j(\tau)\) would take different values on average. A single-token \(\rho_j\) does not correct for this — it only reweights the probability of \(a_j\) itself, not the distribution of the future actions that determine \(r_j\).
To properly account for this, we need the importance weight over the entire future trajectory from step \(j\) onward. The interactive figure below visualizes this dependency: click on different token positions to see what \(r_j\) depends on, and toggle between token-level and sequence-level correction to see the coverage gap.
这个公式是怎么来的?回忆轨迹是从 \(\pi_{\mathrm{old}}\) 采样的,但我们想在 \(\pi_\theta\) 下评估梯度。Token-level correction \(\rho_j = \frac{\pi_\theta(a_j)}{\pi_{\mathrm{old}}(a_j)}\) 只修正一个 token 的 mismatch。为什么可能不够?
考虑位置 \(j\) 的 loss 项:\(\rho_j \cdot r_j(\tau) \cdot \ln \pi_\theta(a_j \vert \tau_{j-1})\)。其中 reward \(r_j(\tau) = r(\tau) - \frac{\beta}{H}\sum_{i=j}^H K_i\) 包含了所有未来步骤 \(i = j, j{+}1, \ldots, H\) 的 KL log-ratio。这些未来的 \(K_i\) 依赖于未来动作 \(a_{j+1}, \ldots, a_H\),而它们是从 \(\pi_{\mathrm{old}}\) 采样的。在 \(\pi_\theta\) 下,这些未来动作的分布不同,所以 \(r_j(\tau)\) 的平均值也不同。单 token 的 \(\rho_j\) 无法校正这一点——它只重新加权了 \(a_j\) 本身的概率,而非决定 \(r_j\) 的未来动作的分布。
为了正确处理这个问题,我们需要从第 \(j\) 步起整个 future trajectory 上的 importance weight。下面的交互式图表可视化了这个依赖结构:点击不同的 token 位置查看 \(r_j\) 依赖什么,切换 token-level 和 sequence-level correction 观察覆盖缺口。
The ideal weight is the full likelihood ratio over the future trajectory:
理想的 weight 是 future trajectory 上完整的 likelihood ratio:
This is a product of \(H - j + 1\) per-token ratios. The problem is that this product can have exponentially high variance — if each ratio fluctuates by a factor of 2, the product fluctuates by \(2^{H-j+1}\). To tame this, we take the geometric mean (i.e., the \((H-j+1)\)-th root), which keeps the correction in the same scale as a single-token ratio while still capturing the overall distributional shift:
这是 \(H - j + 1\) 个 per-token ratio 的乘积。问题在于这个乘积的方差可能指数级增长——如果每个 ratio 波动 2 倍,乘积就波动 \(2^{H-j+1}\) 倍。为了控制方差,我们取几何平均(即 \((H-j+1)\) 次方根),使 correction 保持在单个 token ratio 的量级,同时仍然捕捉整体的分布偏移:
The interactive figure below lets you see this variance explosion numerically. Each trial samples \(H\) per-token ratios \(\rho_k = e^{X_k}\) with \(X_k \sim \mathcal{N}(0, \sigma^2)\), then compares the raw product \(\prod \rho_k\) (red), the geometric mean \((\prod \rho_k)^{1/H}\) (green), and a single-token \(\rho_1\) (blue). Try increasing \(H\) or \(\sigma\) to see the red histogram spread explosively while green stays compact.
下面的交互式图表让你直观观察这种方差爆炸。每次试验采样 \(H\) 个 per-token ratio \(\rho_k = e^{X_k}\),其中 \(X_k \sim \mathcal{N}(0, \sigma^2)\),然后比较原始乘积 \(\prod \rho_k\)(红色)、几何平均 \((\prod \rho_k)^{1/H}\)(绿色)和单个 token 的 \(\rho_1\)(蓝色)。尝试增大 \(H\) 或 \(\sigma\),观察红色直方图如何爆炸式扩散,而绿色保持紧凑。
This is a bias-variance trade-off: the token-level \(\rho_j\) ignores future mismatch (low variance, potentially biased), the full product captures it exactly (unbiased, high variance), and the geometric mean interpolates between them.
这是一个 bias-variance trade-off:token-level \(\rho_j\) 忽略 future mismatch(低方差,可能有偏),完整乘积精确捕捉它(无偏,高方差),几何平均则在两者之间插值。
Comparison with Standard GRPO: Clipping and IS Ratios
与标准 GRPO 的对比:Clipping 与 IS Ratio
The off-policy correction interacts with KL differently in the two formulations. To see this, compare the gradient contributions at token \(j\):
Off-policy correction 在两种公式中与 KL 的交互方式不同。为了清楚地看到这一点,比较 token \(j\) 处的梯度贡献:
Standard GRPO — gradient at token \(j\):
标准 GRPO — token \(j\) 处的梯度:
Ours — gradient at token \(j\):
我们的公式 — token \(j\) 处的梯度:
In \(\nabla_\theta \mathcal{L}_{\text{GRPO}}\), the reward term and the KL term are decoupled — the advantage \(\hat{A}^g\) knows nothing about KL, and the KL penalty knows nothing about the reward. The clipping mechanism \(\min(\rho_j \hat{A}^g, \mathrm{clip}(\rho_j) \hat{A}^g)\) only considers the reward-based advantage when deciding whether to trust the update. A token that deviates heavily from the reference but contributes to a high-reward completion receives a large positive \(\hat{A}^g\) (encouraging more deviation) and a large \(\beta \nabla K_j^g\) (discouraging deviation). The two forces fight, but clipping only constrains the first.
In \(\nabla_\theta \mathcal{L}_{\text{ours}}\), there is no separate KL gradient — it has been absorbed into \(\hat{A}_j^g\) through \(r_j\). A token that deviates heavily from the reference has its advantage reduced before clipping decides whether to trust the update. Clipping and KL regularization work together rather than independently.
The IS ratio \(\rho_j\) also differs between the two formulations. Compare what each \(\rho_j\) needs to correct:
在 \(\nabla_\theta \mathcal{L}_{\text{GRPO}}\) 中,reward 项和 KL 项是解耦的——advantage \(\hat{A}^g\) 对 KL 一无所知,KL 惩罚对 reward 也一无所知。Clipping 机制 \(\min(\rho_j \hat{A}^g, \mathrm{clip}(\rho_j) \hat{A}^g)\) 在决定是否信任更新时只考虑基于 reward 的 advantage。一个严重偏离 reference 但贡献于高奖励 completion 的 token,会同时收到一个大的正 \(\hat{A}^g\)(鼓励继续偏离)和一个大的 \(\beta \nabla K_j^g\)(阻止偏离)。两种力量对抗,但 clipping 只约束前者。
在 \(\nabla_\theta \mathcal{L}_{\text{ours}}\) 中,不存在独立的 KL 梯度——它已通过 \(r_j\) 被吸收进 \(\hat{A}_j^g\)。严重偏离 reference 的 token 的 advantage 在 clipping 决策之前就已被削减。Clipping 和 KL 正则化协同工作,而非独立作用。
IS ratio \(\rho_j\) 在两种公式中也有所不同。比较每个 \(\rho_j\) 需要校正的内容:
Standard GRPO — token-level IS ratio suffices:
标准 GRPO — token-level IS ratio 即可:
Ours — sequence-level correction via geometric mean:
我们的公式 — 通过几何平均进行 sequence-level correction:
Why the difference? In \(\mathcal{L}_{\text{GRPO}}\), the advantage \(\hat{A}^g\) depends only on the task reward \(r(\tau^g)\), which is a constant for the entire trajectory — no future-dependent quantity needs correction, so a single-token \(\rho_j\) suffices. In \(\mathcal{L}_{\text{ours}}\), the token-level reward \(r_j^g\) contains the future KL sum \(\sum_{k=j}^H K_k^g\), which depends on future actions \(a_{j+1}, \ldots, a_H\) sampled from \(\pi_{\mathrm{old}}\). As derived above, correcting for this future mismatch requires the full trajectory likelihood ratio from step \(j\) onward; the geometric mean is a variance-reduction compromise.
This is a genuine trade-off: \(\mathcal{L}_{\text{ours}}\) provides finer-grained credit assignment and more coherent clipping, but requires a more careful IS correction. The geometric mean introduces bias relative to the exact full-product correction, whereas \(\mathcal{L}_{\text{GRPO}}\)’s simpler IS treatment is exact for its (coarser) advantage definition.
Both formulations optimize the same objective and converge to the same optimum in the limit:
差异何在?在 \(\mathcal{L}_{\text{GRPO}}\) 中,advantage \(\hat{A}^g\) 只依赖 task reward \(r(\tau^g)\),它对整条轨迹是常数——不存在需要校正的依赖未来的量,因此单 token 的 \(\rho_j\) 就够了。在 \(\mathcal{L}_{\text{ours}}\) 中,token-level reward \(r_j^g\) 包含 future KL 求和 \(\sum_{k=j}^H K_k^g\),依赖于从 \(\pi_{\mathrm{old}}\) 采样的未来动作 \(a_{j+1}, \ldots, a_H\)。正如上文推导所示,校正这种 future mismatch 需要从第 \(j\) 步起完整轨迹的 likelihood ratio;几何平均是方差缩减的折中方案。
这是一个真实的权衡:\(\mathcal{L}_{\text{ours}}\) 提供更精细的 credit assignment 和更一致的 clipping,但需要更精细的 IS 校正。几何平均相对于精确的完整乘积校正引入了偏差,而 \(\mathcal{L}_{\text{GRPO}}\) 更简单的 IS 处理对于其(更粗的)advantage 定义是精确的。
两种公式优化相同的目标,在极限下收敛到相同的最优:
The practical difference is in convergence speed and stability: \(\mathcal{L}_{\text{ours}}\) provides a lower-variance advantage signal (position-dependent credit) and more coherent clipping (reward-minus-KL), at the cost of a more involved IS correction.
| \(\mathcal{L}_{\text{GRPO}}\) (Standard) | \(\mathcal{L}_{\text{ours}}\) (Token-Level KL-GRPO) | |
|---|---|---|
| KL placement | Separate \(-\beta K_j^g\) after clipping | Absorbed into \(r_j^g\) before advantage |
| Advantage | \(\hat{A}^g\): same for all tokens | \(\hat{A}_j^g\): varies by position |
| KL scope per token | \(K_j^g\) (this token only) | \(\sum_{k=j}^H K_k\) (future only) |
| Clipping sees KL? | No | Yes |
| IS ratio | Token-level \(\rho_j\) | Geometric mean over future |
| Credit assignment | Uniform | Position-dependent |
实际差异在于收敛速度和稳定性:\(\mathcal{L}_{\text{ours}}\) 提供方差更低的 advantage 信号(位置相关的 credit)和更一致的 clipping(reward-minus-KL),代价是更复杂的 IS 校正。
| \(\mathcal{L}_{\text{GRPO}}\)(标准) | \(\mathcal{L}_{\text{ours}}\)(Token-Level KL-GRPO) | |
|---|---|---|
| KL 位置 | Clipping 之后独立的 \(-\beta K_j^g\) | 在计算 advantage 之前吸收进 \(r_j^g\) |
| Advantage | \(\hat{A}^g\):所有 token 相同 | \(\hat{A}_j^g\):随位置变化 |
| 每个 token 的 KL 范围 | \(K_j^g\)(仅当前 token) | \(\sum_{k=j}^H K_k\)(仅未来) |
| Clipping 是否看到 KL? | 否 | 是 |
| IS ratio | Token-level \(\rho_j\) | 未来轨迹的几何平均 |
| Credit assignment | 均匀 | 位置相关 |
A General Family of Off-Policy Correction
离策略校正的一般算法族
Assume that we generate trajectory using behavior policy \(\pi_{\mathrm{old}}\), and our current policy is \(\pi\). Consider \(\rho_j\) is the sequence level correction defined before, and consider the following family of algorithms:
假设我们使用行为策略 \(\pi_{\mathrm{old}}\) 生成轨迹,当前策略为 \(\pi\)。 考虑 \(\rho_j\) 为之前定义的序列级校正,考虑以下算法族:
where each \(f_k(\rho)\) is non-decreasing in \(\rho\), and
其中每个 \(f_k(\rho)\) 关于 \(\rho\) 单调非递减,且
Why these constraints? The constraint \(f_k(\rho) \in [\min(\rho,1),\, \max(\rho,1)]\) ensures that \(f_k(\rho)\hat{A}\) always has the same sign as \(\rho\hat{A}\). In other words, the direction of the update is never reversed — if the full IS correction says “increase the probability of this action,” the modified version agrees, just with a different magnitude. Combined with monotonicity (\(f_k\) non-decreasing), this ensures the algorithm still moves in a direction of improvement.
The two extremes. The endpoints of this family correspond to two well-known algorithms:
-
\(f_k(\rho) = 1\): Policy improvement — the IS weight is dropped entirely. The objective becomes \(\frac{1}{GH}\sum_{g,j} \hat{A}_j^g = \mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}_j]\), the expected advantage under \(\pi_{\mathrm{old}}\)’s trajectory distribution. Note a subtlety: the policy improvement theorem (with \(\pi = \pi_{\mathrm{old}},\, \pi' = \pi_\theta\)) requires \(\mathbb{E}_{\pi_\theta}[A^{\pi_{\mathrm{old}}}(s,a)] \geq 0\) — the expectation under \(\pi_\theta\), not \(\pi_{\mathrm{old}}\). Our f=1 objective evaluates under \(\pi_{\mathrm{old}}\) instead, so the theorem does not directly apply. The gap is exactly the missing IS correction: \(\mathbb{E}_{\pi_\theta}[A^{\pi_{\mathrm{old}}}] = \mathbb{E}_{\pi_{\mathrm{old}}}[\rho \cdot A^{\pi_{\mathrm{old}}}]\), and setting \(f=1\) replaces \(\rho\) with \(1\). When \(\pi_\theta \approx \pi_{\mathrm{old}}\) (small updates), \(\rho \approx 1\) and the approximation is tight, so f=1 still works in practice — this is the regime where trust-region methods (TRPO, PPO) operate.
-
\(f_k(\rho) = \rho\): Policy gradient — full importance sampling correction. The objective becomes \(\frac{1}{GH}\sum_{g,j} \frac{\pi_\theta(a_j^g)}{\pi_{\mathrm{old}}(a_j^g)} \hat{A}_j^g \approx \mathbb{E}_{\pi_\theta}[\hat{A}_j]\), converting the expectation from \(\pi_{\mathrm{old}}\) to \(\pi_\theta\) — unbiased in principle. But the variance can be catastrophic: for a sequence of length \(H\), the product \(\prod_{j=1}^H \rho_j\) fluctuates exponentially. If each per-token ratio has standard deviation \(\sigma\), the variance of the product scales as \(e^{O(H\sigma^2)}\), making learning unstable for long sequences.
-
\(f_k(\rho) = \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\): PPO — a practical middle ground. The clipped objective uses \(m=2\) functions in the general formula:
为什么要这些约束? 约束 \(f_k(\rho) \in [\min(\rho,1),\, \max(\rho,1)]\) 确保 \(f_k(\rho)\hat{A}\) 与 \(\rho\hat{A}\) 符号始终相同。换言之,更新的方向永远不会反转——如果完整的 IS 校正说”增加这个动作的概率”,修改后的版本也同意,只是幅度不同。结合单调性(\(f_k\) 非递减),这确保了算法仍然沿着改进的方向移动。
两个极端。 这个算法族的两个端点对应两个经典算法:
-
\(f_k(\rho) = 1\):策略改进——完全丢弃 IS 权重。目标函数变为 \(\frac{1}{GH}\sum_{g,j} \hat{A}_j^g = \mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}_j]\),即在 \(\pi_{\mathrm{old}}\) 的轨迹分布下评估的期望 advantage。注意一个微妙之处:策略改进定理(令 \(\pi = \pi_{\mathrm{old}},\, \pi' = \pi_\theta\))要求 \(\mathbb{E}_{\pi_\theta}[A^{\pi_{\mathrm{old}}}(s,a)] \geq 0\)——期望在 \(\pi_\theta\) 下取,而非 \(\pi_{\mathrm{old}}\)。我们的 f=1 目标函数在 \(\pi_{\mathrm{old}}\) 下取期望,因此定理并不直接适用。差距恰好是缺失的 IS 校正:\(\mathbb{E}_{\pi_\theta}[A^{\pi_{\mathrm{old}}}] = \mathbb{E}_{\pi_{\mathrm{old}}}[\rho \cdot A^{\pi_{\mathrm{old}}}]\),而 f=1 将 \(\rho\) 替换为 \(1\)。当 \(\pi_\theta \approx \pi_{\mathrm{old}}\)(小更新)时,\(\rho \approx 1\),近似是紧的,所以 f=1 在实践中仍然有效——这正是信赖域方法(TRPO、PPO)运作的范围。
-
\(f_k(\rho) = \rho\):策略梯度——完整的重要性采样校正。目标函数变为 \(\frac{1}{GH}\sum_{g,j} \frac{\pi_\theta(a_j^g)}{\pi_{\mathrm{old}}(a_j^g)} \hat{A}_j^g \approx \mathbb{E}_{\pi_\theta}[\hat{A}_j]\),通过重要性采样将期望从 \(\pi_{\mathrm{old}}\) 转换为 \(\pi_\theta\)——原则上无偏。但方差可能是灾难性的:对于长度为 \(H\) 的序列,乘积 \(\prod_{j=1}^H \rho_j\) 指数级波动。如果每个 per-token ratio 的标准差为 \(\sigma\),乘积的方差量级为 \(e^{O(H\sigma^2)}\),使长序列的学习不稳定。
-
\(f_k(\rho) = \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\):PPO——实用的折中方案。Clipped objective 在一般公式中使用 \(m=2\) 个函数:
PPO’s clipped objective combines two of these via \(\min\):
PPO 的 clipped objective 通过 \(\min\) 组合了其中两个:
where \(f_1(\rho) = \rho\) and \(f_2(\rho) = \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\). The \(\min\) acts as a pessimistic lower bound: when \(\hat{A} > 0\), it prevents \(\rho > 1{+}\epsilon\) from inflating the credit; when \(\hat{A} < 0\), it prevents \(\rho < 1{-}\epsilon\) from under-penalizing.
The bias-variance tradeoff. The choice of \(f_k\) controls a fundamental tradeoff:
| Bias | Variance | When to prefer | |
|---|---|---|---|
| \(f(\rho) \approx 1\) | High (old distribution) | Low | Early training, large policy shifts |
| \(f(\rho) \approx \rho\) | Low (current distribution) | High | Late training, small policy shifts |
The optimal \(f_k\) likely depends on how far \(\pi_\theta\) has drifted from \(\pi_{\mathrm{old}}\). When the policies are close (\(\rho \approx 1\)), all choices of \(f\) are similar. When they diverge, the choice matters significantly — and an adaptive \(f\) that adjusts based on the observed \(\rho\) distribution could outperform any fixed choice.
Open question. Can we design an \(f_k\) that adapts during training — perhaps starting near \(f=\rho\) (unbiased, when \(\rho\) is small) and gradually moving toward \(f=1\) (safe, as \(\rho\) grows)? This would be an adaptive off-policy correction that automatically navigates the bias-variance tradeoff.
其中 \(f_1(\rho) = \rho\),\(f_2(\rho) = \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\)。\(\min\) 起到悲观下界的作用:当 \(\hat{A} > 0\) 时,防止 \(\rho > 1{+}\epsilon\) 膨胀 credit;当 \(\hat{A} < 0\) 时,防止 \(\rho < 1{-}\epsilon\) 弱化惩罚。
偏差-方差权衡。 \(f_k\) 的选择控制着一个根本性的权衡:
| 偏差 | 方差 | 适用场景 | |
|---|---|---|---|
| \(f(\rho) \approx 1\) | 高(旧分布) | 低 | 训练早期,策略变化大 |
| \(f(\rho) \approx \rho\) | 低(当前分布) | 高 | 训练后期,策略变化小 |
最优的 \(f_k\) 可能取决于 \(\pi_\theta\) 偏离 \(\pi_{\mathrm{old}}\) 的程度。当策略接近时(\(\rho \approx 1\)),所有 \(f\) 的选择都类似。当策略分歧时,选择的影响显著——一个根据观测到的 \(\rho\) 分布自适应调整的 \(f\) 可能优于任何固定选择。
开放问题。 我们能否设计一个在训练过程中自适应的 \(f_k\)——也许在 \(\rho\) 较小时从 \(f \approx \rho\)(无偏)开始,随着 \(\rho\) 增大逐渐趋向 \(f \approx 1\)(安全)?这将是一种自适应离策略校正,自动在偏差-方差权衡中导航。
Some Experiments to Try for Off-Policy Correction
离策略校正的一些实验方向
We now consider the off-policy correction problem from a more experimental perspective. Rather than focusing on the token-level KL formulation above, we return to a simpler setting — the standard GRPO loss without KL — to isolate the effect of different off-policy correction strategies.
We start with the following loss formula for a generalization of GRPO with group size \(G\):
我们现在从更偏实验的角度考虑离策略校正问题。我们不再关注上面的 token-level KL 公式,而是回到更简单的设定——不含 KL 的标准 GRPO 损失——以隔离不同离策略校正策略的效果。
我们从以下推广的 GRPO 损失公式出发,组大小为 \(G\):
where \(\mathrm{sg}(\cdot)\) denotes stop-gradient (the weight \(w_{g,i}\) is treated as a constant during backpropagation), \(H_g\) is the number of tokens in trajectory \(g\), and \(\hat{A}_g\) is the GRPO group advantage:
其中 \(\mathrm{sg}(\cdot)\) 表示 stop-gradient(权重 \(w_{g,i}\) 在反向传播中被视为常数),\(H_g\) 是轨迹 \(g\) 的 token 数量,\(\hat{A}_g\) 是 GRPO 的组 advantage:
This is a REINFORCE-style loss where the gradient flows through \(\ln \pi_\theta\) only, while \(w_{g,i}\) acts as a stop-gradiented weight controlling how strongly each token contributes to the update. The gradient of this loss with respect to \(\theta\) is:
这是一个 REINFORCE 风格的损失,梯度仅通过 \(\ln \pi_\theta\) 传递,而 \(w_{g,i}\) 作为 stop-gradient 权重控制每个 token 对更新的贡献强度。该损失关于 \(\theta\) 的梯度为:
The choice of \(w_{g,i}\) determines the off-policy correction strategy. Different choices lead to different algorithms, all sharing the same REINFORCE structure but differing in how they handle the distribution mismatch between \(\pi_\theta\) (current policy) and \(\pi_{\mathrm{old}}\) (behavior policy that generated the trajectories).
\(w_{g,i}\) 的选择决定了离策略校正策略。不同的选择导致不同的算法,它们共享相同的 REINFORCE 结构,但在处理 \(\pi_\theta\)(当前策略)与 \(\pi_{\mathrm{old}}\)(生成轨迹的行为策略)之间的分布偏差时有所不同。
PPO-Style Clipping as a Choice of \(w_{g,i}\)
PPO 风格 Clipping 作为 \(w_{g,i}\) 的一种选择
For standard GRPO (which uses PPO-style clipping), the weight is:
对于标准 GRPO(使用 PPO 风格 clipping),权重为:
Why does this reproduce PPO? The key is that this formula encodes the gradient behavior of the standard PPO clipped objective
为什么这等价于 PPO? 关键在于这个公式编码了标准 PPO clipped objective 的梯度行为:
into a REINFORCE form. When we differentiate \(L^{\mathrm{PPO}}\) with respect to \(\theta\), using \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\):
- When the \(\min\) selects \(\rho\,\hat{A}\) (gradient passes through): \(\nabla_\theta L^{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\).
- When the \(\min\) selects \(\mathrm{clip}(\rho)\,\hat{A}\) (clipped, constant w.r.t. \(\theta\)): \(\nabla_\theta L^{\mathrm{PPO}} = 0\).
So the PPO gradient is \(w \cdot \hat{A} \cdot \nabla_\theta \ln \pi_\theta\) where \(w = \rho\) when the gradient passes through, and \(w = 0\) when it is clipped. The indicator function \(\mathbb{I}(\cdot)\) in our formula is precisely this on/off switch. Although the two formulations have different values and different computation graphs, they produce identical gradients with respect to \(\theta\), and therefore identical optimization trajectories. Let us verify by case analysis:
转化为 REINFORCE 形式。当我们对 \(L^{\mathrm{PPO}}\) 关于 \(\theta\) 求导,利用 \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\):
- 当 \(\min\) 选择 \(\rho\,\hat{A}\)(梯度通过):\(\nabla_\theta L^{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\)。
- 当 \(\min\) 选择 \(\mathrm{clip}(\rho)\,\hat{A}\)(被截断,关于 \(\theta\) 为常数):\(\nabla_\theta L^{\mathrm{PPO}} = 0\)。
因此 PPO 的梯度为 \(w \cdot \hat{A} \cdot \nabla_\theta \ln \pi_\theta\),其中 \(w = \rho\)(梯度通过时)或 \(w = 0\)(被截断时)。我们公式中的指示函数 \(\mathbb{I}(\cdot)\) 恰好就是这个开关。虽然两种形式的值和计算图不同,但它们关于 \(\theta\) 产生完全一样的梯度,因此优化轨迹完全相同。逐情况验证:
| Case | \(\rho - \mathrm{clip}(\rho)\) | Condition \((\cdot)\hat{A} \leq 0\)? | \(w\) | PPO behavior |
|---|---|---|---|---|
| \(\rho \in [1{-}\epsilon,\, 1{+}\epsilon]\) | \(0\) | \(0 \cdot \hat{A} = 0 \leq 0\): yes | \(\rho\) | Gradient passes through ✓ |
| \(\rho > 1{+}\epsilon,\; \hat{A} > 0\) | \(> 0\) | \((+)(+) > 0\): no | \(0\) | Clipped: prevents inflating credit ✓ |
| \(\rho > 1{+}\epsilon,\; \hat{A} < 0\) | \(> 0\) | \((+)(-) < 0\): yes | \(\rho\) | Not clipped: allows penalizing bad actions ✓ |
| \(\rho < 1{-}\epsilon,\; \hat{A} > 0\) | \(< 0\) | \((-)(+) < 0\): yes | \(\rho\) | Not clipped: allows rewarding good actions ✓ |
| \(\rho < 1{-}\epsilon,\; \hat{A} < 0\) | \(< 0\) | \((-)(-) > 0\): no | \(0\) | Clipped: prevents over-penalizing ✓ |
| 情况 | \(\rho - \mathrm{clip}(\rho)\) | 条件 \((\cdot)\hat{A} \leq 0\)? | \(w\) | PPO 行为 |
|---|---|---|---|---|
| \(\rho \in [1{-}\epsilon,\, 1{+}\epsilon]\) | \(0\) | \(0 \cdot \hat{A} = 0 \leq 0\):是 | \(\rho\) | 梯度通过 ✓ |
| \(\rho > 1{+}\epsilon,\; \hat{A} > 0\) | \(> 0\) | \((+)(+) > 0\):否 | \(0\) | 截断:防止膨胀 credit ✓ |
| \(\rho > 1{+}\epsilon,\; \hat{A} < 0\) | \(> 0\) | \((+)(-) < 0\):是 | \(\rho\) | 未截断:允许惩罚坏动作 ✓ |
| \(\rho < 1{-}\epsilon,\; \hat{A} > 0\) | \(< 0\) | \((-)(+) < 0\):是 | \(\rho\) | 未截断:允许奖励好动作 ✓ |
| \(\rho < 1{-}\epsilon,\; \hat{A} < 0\) | \(< 0\) | \((-)(-) > 0\):否 | \(0\) | 截断:防止过度惩罚 ✓ |
In summary, \(w_{g,i} = 0\) (gradient clipped) precisely when:
- \(\hat{A}_g > 0\) and \(\rho_{g,i} > 1 + \epsilon\): the current policy already assigns much more probability to this action than the old policy did, and the advantage is positive. PPO prevents further increasing — the policy has already moved enough in the right direction.
- \(\hat{A}_g < 0\) and \(\rho_{g,i} < 1 - \epsilon\): the current policy already assigns much less probability to this action, and the advantage is negative. PPO prevents further decreasing — the policy has already moved enough away from this bad action.
In all other cases, \(w_{g,i} = \rho_{g,i}\), and the gradient passes through with full importance sampling correction.
总结:\(w_{g,i} = 0\)(梯度被截断)恰好在以下情况:
- \(\hat{A}_g > 0\) 且 \(\rho_{g,i} > 1 + \epsilon\):当前策略已经比旧策略赋予该动作高得多的概率,且 advantage 为正。PPO 阻止继续增加——策略已经在正确方向上移动了足够多。
- \(\hat{A}_g < 0\) 且 \(\rho_{g,i} < 1 - \epsilon\):当前策略已经比旧策略赋予该动作低得多的概率,且 advantage 为负。PPO 阻止继续减少——策略已经远离了这个坏动作足够多。
在所有其他情况下,\(w_{g,i} = \rho_{g,i}\),梯度以完整的重要性采样校正通过。
The Variance Problem with PPO Clipping
PPO Clipping 的方差问题
While PPO clipping zeroes out the gradient in two “already moved enough” cases, there is a subtle asymmetry that can cause high variance. Consider the case \(\hat{A}_g < 0\) and \(\rho_{g,i} \gg 1\):
- The advantage is negative (bad trajectory), so we want to decrease the probability of these actions.
- But \(\rho_{g,i} \gg 1\) means the current policy assigns much more probability to action \(a_i\) than the old policy did.
- PPO does not clip this case — the indicator gives \(w_{g,i} = \rho_{g,i}\), and the gradient contribution is \(\rho_{g,i}\, \hat{A}_g\, \nabla_\theta \ln \pi_\theta\).
- Since \(\rho_{g,i}\) can be arbitrarily large, this gradient term has unbounded magnitude, leading to high variance.
Why does large \(\rho\) cause high variance? The gradient estimator is an average over samples: \(\hat{g} = \frac{1}{N}\sum_i \rho_i\, \hat{A}_i\, \nabla_\theta \ln \pi_\theta\). Its variance is governed by
The first term \(\mathbb{E}[\rho^2 \hat{A}^2]\) grows unboundedly as \(\pi_\theta\) diverges from \(\pi_{\mathrm{old}}\), while the second term \((\mathbb{E}[\rho \hat{A}])^2\) stays bounded. The reason is that \(\mathbb{E}_{\pi_{\mathrm{old}}}[\rho] = 1\) always holds:
虽然 PPO clipping 在两种”已经移动足够多”的情况下将梯度置零,但存在一个微妙的不对称性,可能导致高方差。考虑 \(\hat{A}_g < 0\) 且 \(\rho_{g,i} \gg 1\) 的情况:
- Advantage 为负(坏轨迹),所以我们想减少这些动作的概率。
- 但 \(\rho_{g,i} \gg 1\) 意味着当前策略赋予动作 \(a_i\) 的概率比旧策略高得多。
- PPO 不会截断这种情况——指示函数给出 \(w_{g,i} = \rho_{g,i}\),梯度贡献为 \(\rho_{g,i}\, \hat{A}_g\, \nabla_\theta \ln \pi_\theta\)。
- 由于 \(\rho_{g,i}\) 可以任意大,这个梯度项的幅度无界,导致高方差。
为什么大的 \(\rho\) 会导致高方差? 梯度估计量是对样本的平均:\(\hat{g} = \frac{1}{N}\sum_i \rho_i\, \hat{A}_i\, \nabla_\theta \ln \pi_\theta\)。其方差由上式控制。第一项 \(\mathbb{E}[\rho^2 \hat{A}^2]\) 随 \(\pi_\theta\) 偏离 \(\pi_{\mathrm{old}}\) 无界增长,而第二项 \((\mathbb{E}[\rho \hat{A}])^2\) 保持有界。原因是 \(\mathbb{E}_{\pi_{\mathrm{old}}}[\rho] = 1\) 恒成立:
So \((\mathbb{E}[\rho\hat{A}])^2\) does not grow with the spread of \(\rho\). In contrast, \(\mathbb{E}[\rho^2]\) grows monotonically as \(\pi_\theta\) diverges from \(\pi_{\mathrm{old}}\) — a single sample with \(\rho = 100\) contributes \(10000\) to \(\mathbb{E}[\rho^2]\), but its contribution to \(\mathbb{E}[\rho]\) is averaged away by the many samples with \(\rho \approx 1\). The result is that a few high-\(\rho\) samples dominate the gradient, and different mini-batches produce wildly different gradient estimates.
因此 \((\mathbb{E}[\rho\hat{A}])^2\) 不随 \(\rho\) 的分散程度增长。相反,\(\mathbb{E}[\rho^2]\) 随 \(\pi_\theta\) 偏离 \(\pi_{\mathrm{old}}\) 单调增大——一个 \(\rho = 100\) 的样本对 \(\mathbb{E}[\rho^2]\) 贡献 \(10000\),但它对 \(\mathbb{E}[\rho]\) 的贡献被众多 \(\rho \approx 1\) 的样本平均掉了。结果是少数高 \(\rho\) 样本主导了梯度,不同 mini-batch 产生差异巨大的梯度估计。
Why does PPO leave this unclipped? The PPO philosophy is pessimistic: it clips updates that would overestimate improvement. When \(\hat{A} < 0\) and \(\rho \gg 1\), the large \(\rho\) amplifies the penalty — this is conservative (pushes harder against a bad action the policy has drifted toward), so PPO sees no reason to clip it. Similarly, when \(\hat{A} > 0\) and \(\rho \ll 1\), the small \(\rho\) dampens the reward signal, which is also conservative.
But “conservative in expectation” does not mean “low variance.” The unclipped \(\rho\) in these cases can be very large, injecting noise into the gradient. A natural fix is to also clip \(\rho\) from above, bounding the weight even in the unclipped cases:
为什么 PPO 不截断这种情况? PPO 的哲学是悲观的:它截断那些会高估改进的更新。当 \(\hat{A} < 0\) 且 \(\rho \gg 1\) 时,大的 \(\rho\) 放大了惩罚——这是保守的(对策略已经偏向的坏动作施加更大惩罚),所以 PPO 认为没有理由截断。类似地,当 \(\hat{A} > 0\) 且 \(\rho \ll 1\) 时,小的 \(\rho\) 抑制了奖励信号,这也是保守的。
但”期望意义上保守”不等于”低方差”。在这些未截断的情况下,\(\rho\) 可以非常大,向梯度中注入噪声。一个自然的修复方案是同时从上方截断 \(\rho\),即使在未截断的情况下也限制权重的大小:
This caps the IS weight at \(1 + \epsilon\) regardless of the sign of \(\hat{A}\), preventing any single token from dominating the gradient. Note that both PPO clipping and this upper clipping are biased relative to the full IS correction \(w = \rho\) — only \(w = \rho\) is unbiased. PPO introduces bias by zeroing out the gradient in certain cases (\(w = 0\) when clipped); upper clipping introduces bias by capping the weight (\(w = 1 + \epsilon\) when \(\rho > 1 + \epsilon\)). The difference is in where the bias is introduced: PPO’s bias is asymmetric (only clips when the update would be too optimistic, leaving the pessimistic direction unclipped — which is precisely the high-variance case), while upper clipping is symmetric (caps \(\rho\) in all cases, trading more bias for uniformly lower variance).
这将 IS 权重上限设为 \(1 + \epsilon\),无论 \(\hat{A}\) 的符号如何,防止任何单个 token 主导梯度。注意 PPO clipping 和这种上方截断相对于完整 IS 校正 \(w = \rho\) 都是有偏的——只有 \(w = \rho\) 是无偏的。PPO 通过在某些情况下将梯度置零(被截断时 \(w = 0\))引入偏差;上方截断通过限制权重上界(\(\rho > 1 + \epsilon\) 时 \(w = 1 + \epsilon\))引入偏差。区别在于偏差引入的位置:PPO 的偏差是不对称的(只在更新过于乐观时截断,而悲观方向不截断——这恰好就是高方差的情况),而上方截断是对称的(在所有情况下限制 \(\rho\),以更多偏差换取一致更低的方差)。
The Power Family \(w_{g,i} = \rho_{g,i}^q\)
幂次族 \(w_{g,i} = \rho_{g,i}^q\)
Moving beyond PPO-style clipping, we can investigate a continuous family of off-policy corrections parameterized by \(q \in [0, 1]\):
超越 PPO 风格 clipping,我们可以研究一个由 \(q \in [0, 1]\) 参数化的连续离策略校正族:
This family smoothly interpolates between two well-known extremes. To understand the variance difference, consider the gradient estimator from a batch of \(N\) samples drawn from \(\pi_{\mathrm{old}}\):
Each sample’s contribution to the gradient is \(\rho_i^q \cdot \hat{A}_i \cdot \nabla_\theta \ln \pi_\theta\). The variance of \(\hat{g}_q\) across mini-batches depends on how much the per-sample contributions fluctuate:
这个族在两个已知极端之间平滑插值。为了理解方差的差异,考虑从 \(\pi_{\mathrm{old}}\) 抽取 \(N\) 个样本的梯度估计量(见上式)。每个样本对梯度的贡献为 \(\rho_i^q \cdot \hat{A}_i \cdot \nabla_\theta \ln \pi_\theta\)。\(\hat{g}_q\) 在不同 mini-batch 间的方差取决于 per-sample 贡献的波动程度:
-
\(q = 0\): No correction (\(w = 1\)). Each sample contributes \(\hat{A}_i \cdot \nabla_\theta \ln \pi_\theta\) — all samples are weighted equally. The variance comes solely from \(\hat{A}\) and \(\nabla \ln \pi_\theta\), which are inherent to the problem. No additional randomness is injected. This is simply the variance of a standard sample mean.
The estimator computes \(\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\, \nabla_\theta \ln \pi_\theta]\), but we actually want \(\mathbb{E}_{\pi_\theta}[\hat{A}\, \nabla_\theta \ln \pi_\theta]\). These differ when \(\pi_\theta \neq \pi_{\mathrm{old}}\) — the estimator is biased. The policy improvement theorem guarantees this bias is small when \(\pi_\theta \approx \pi_{\mathrm{old}}\), but it can be significant after many gradient steps.
-
\(q = 1\): Full IS correction (\(w = \rho\)). Each sample contributes \(\rho_i \cdot \hat{A}_i \cdot \nabla_\theta \ln \pi_\theta\). This converts the expectation to \(\mathbb{E}_{\pi_\theta}[\hat{A}\, \nabla_\theta \ln \pi_\theta]\) — unbiased. But \(\rho_i\) is itself a random variable that multiplies every sample. A sample with \(\rho = 10\) contributes 10× more than a sample with \(\rho = 1\), so a few high-\(\rho\) outliers can dominate the entire batch:
-
\(q = 0\):不校正(\(w = 1\))。 每个样本贡献 \(\hat{A}_i \cdot \nabla_\theta \ln \pi_\theta\)——所有样本权重相同。方差仅来自 \(\hat{A}\) 和 \(\nabla \ln \pi_\theta\),这些是问题本身固有的。没有额外的随机性被注入。这就是标准样本均值的方差。
该估计量计算的是 \(\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\, \nabla_\theta \ln \pi_\theta]\),但我们实际想要的是 \(\mathbb{E}_{\pi_\theta}[\hat{A}\, \nabla_\theta \ln \pi_\theta]\)。当 \(\pi_\theta \neq \pi_{\mathrm{old}}\) 时两者不同——估计量是有偏的。策略改进定理保证在 \(\pi_\theta \approx \pi_{\mathrm{old}}\) 时偏差很小,但经过多次梯度更新后偏差可能显著。
-
\(q = 1\):完全 IS 校正(\(w = \rho\))。 每个样本贡献 \(\rho_i \cdot \hat{A}_i \cdot \nabla_\theta \ln \pi_\theta\)。这将期望转换为 \(\mathbb{E}_{\pi_\theta}[\hat{A}\, \nabla_\theta \ln \pi_\theta]\)——无偏。但 \(\rho_i\) 本身是一个随机变量,乘以每个样本。一个 \(\rho = 10\) 的样本贡献是 \(\rho = 1\) 样本的 10 倍,因此少数高 \(\rho\) 异常值可以主导整个 batch:
where the inequality holds because multiplying by \(\rho\) (a non-constant positive random variable with \(\mathbb{E}[\rho] = 1\)) can only increase or maintain the variance. The gap grows with the spread of \(\rho\): if \(\pi_\theta\) has drifted far from \(\pi_{\mathrm{old}}\), some \(\rho_i\) values will be very large and others very small, making \(\mathrm{Var}(\rho\,\hat{A})\) much larger than \(\mathrm{Var}(\hat{A})\). In the extreme, if a single sample has \(\rho = 100\) while the rest have \(\rho \approx 0\), the entire batch estimate is determined by that one sample — the effective sample size collapses to 1.
- \(q \in (0, 1)\): Partial correction. The weight \(\rho^q\) with \(q < 1\) compresses the IS ratio toward 1: since \(\rho^q\) is closer to 1 than \(\rho\) for any \(\rho > 0\), the per-sample contributions are more uniform. Formally, \(\mathrm{Var}(\rho^q) \leq \mathrm{Var}(\rho)\) for \(q \in [0, 1]\), and the variance interpolates smoothly. For example, \(q = 0.5\) uses \(w = \sqrt{\rho}\): a sample with \(\rho = 100\) contributes \(\sqrt{100} = 10\times\) instead of \(100\times\), significantly taming the outlier effect. The cost is bias — \(\rho^q\) does not yield a valid IS correction for \(q \neq 1\).
不等式成立是因为乘以 \(\rho\)(一个非常数的正随机变量,\(\mathbb{E}[\rho] = 1\))只可能增加或维持方差。差距随 \(\rho\) 的分散程度增大:如果 \(\pi_\theta\) 已经远离 \(\pi_{\mathrm{old}}\),某些 \(\rho_i\) 值会非常大而其他非常小,使得 \(\mathrm{Var}(\rho\,\hat{A})\) 远大于 \(\mathrm{Var}(\hat{A})\)。在极端情况下,如果一个样本 \(\rho = 100\) 而其余 \(\rho \approx 0\),整个 batch 的估计由这一个样本决定——有效样本量退化为 1。
- \(q \in (0, 1)\):部分校正。 权重 \(\rho^q\)(\(q < 1\))将 IS ratio 向 1 压缩:由于对任意 \(\rho > 0\),\(\rho^q\) 比 \(\rho\) 更接近 1,per-sample 贡献更均匀。形式上,\(\mathrm{Var}(\rho^q) \leq \mathrm{Var}(\rho)\) 对 \(q \in [0, 1]\) 成立,方差平滑插值。例如,\(q = 0.5\) 使用 \(w = \sqrt{\rho}\):一个 \(\rho = 100\) 的样本贡献 \(\sqrt{100} = 10\) 倍而非 \(100\) 倍,显著抑制了异常值效应。代价是偏差——\(\rho^q\) 在 \(q \neq 1\) 时不构成有效的 IS 校正。
The bias-variance tradeoff. The choice of \(q\) navigates a fundamental tension:
| \(q\) | Bias | Variance | What it estimates |
|---|---|---|---|
| \(0\) | High | Low | \(\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\,\nabla\ln\pi_\theta]\) (wrong distribution) |
| \(1\) | None | High | \(\mathbb{E}_{\pi_\theta}[\hat{A}\,\nabla\ln\pi_\theta]\) (correct distribution) |
| \(q \in (0,1)\) | Medium | Medium | Neither \(\pi_{\mathrm{old}}\) nor \(\pi_\theta\) (no clean interpretation) |
The bias matters most when \(\pi_\theta\) has drifted far from \(\pi_{\mathrm{old}}\) (many gradient steps since the last rollout). In this regime, \(q = 0\) optimizes the wrong objective entirely. The variance matters most when the batch size \(N\) is small or the policy drift \(\sigma\) is large — in this regime, \(q = 1\) produces gradient estimates so noisy that training becomes unstable.
The optimal \(q\) depends on the training stage: early in training when updates are large (high \(\sigma\)), lower \(q\) is safer; later when \(\pi_\theta \approx \pi_{\mathrm{old}}\) (low \(\sigma\)), \(q \approx 1\) is fine since \(\rho \approx 1\) makes all choices equivalent. This motivates PPO’s clipping approach, which adaptively sets \(w = \rho\) (full correction) when \(\rho\) is moderate and \(w = 0\) (no update) when \(\rho\) is extreme — a data-dependent \(q\) that avoids committing to a single tradeoff.
偏差-方差权衡。 \(q\) 的选择在一个根本性的张力中导航:
| \(q\) | 偏差 | 方差 | 估计的是什么 |
|---|---|---|---|
| \(0\) | 高 | 低 | \(\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\,\nabla\ln\pi_\theta]\)(错误分布) |
| \(1\) | 无 | 高 | \(\mathbb{E}_{\pi_\theta}[\hat{A}\,\nabla\ln\pi_\theta]\)(正确分布) |
| \(q \in (0,1)\) | 中等 | 中等 | 既非 \(\pi_{\mathrm{old}}\) 也非 \(\pi_\theta\)(无简洁解释) |
偏差在 \(\pi_\theta\) 远离 \(\pi_{\mathrm{old}}\) 时最重要(自上次 rollout 以来经过了许多梯度步)。在这种情况下,\(q = 0\) 完全优化了错误的目标。方差在 batch 大小 \(N\) 小或策略漂移 \(\sigma\) 大时最重要——此时 \(q = 1\) 产生的梯度估计太嘈杂,训练变得不稳定。
最优的 \(q\) 取决于训练阶段:训练早期更新幅度大(高 \(\sigma\))时,较低的 \(q\) 更安全;后期 \(\pi_\theta \approx \pi_{\mathrm{old}}\)(低 \(\sigma\))时,\(q \approx 1\) 没问题,因为 \(\rho \approx 1\) 使所有选择等价。这启发了 PPO 的 clipping 方法:当 \(\rho\) 适中时自适应地设 \(w = \rho\)(完全校正),当 \(\rho\) 极端时设 \(w = 0\)(不更新)——一种数据依赖的 \(q\),避免锁定在单一权衡上。
Empirical evidence from RAFT++. According to the RAFT++ paper (A Minimalist Approach to LLM Reasoning), Figure 2:
- \(q = 1\) (full IS correction, RAFT++ without clipping) performs the worst, due to the high variance of unclipped importance weights.
- \(q = 0\) (no correction, corresponding to RAFT) performs better than \(q = 1\), but is still suboptimal.
- The best performance comes from methods that reduce variance while maintaining some correction — which is exactly the role PPO clipping plays.
This suggests that among the power family \(\rho^q\), neither extreme is optimal. But instead of searching for the best \(q\), one can use more sophisticated variance reduction strategies (like PPO clipping or the approaches we discuss next) that adapt to the local magnitude of \(\rho\).
RAFT++ 的实验证据。 根据 RAFT++ 论文(A Minimalist Approach to LLM Reasoning)图 2:
- \(q = 1\)(完全 IS 校正,不带 clipping 的 RAFT++)表现最差,因为未截断的重要性权重方差很高。
- \(q = 0\)(不校正,对应 RAFT)表现优于 \(q = 1\),但仍非最优。
- 最佳性能来自既减少方差又保留一定校正的方法——这正是 PPO clipping 所扮演的角色。
这说明在幂次族 \(\rho^q\) 中,两个极端都不是最优的。但与其搜索最佳 \(q\),不如使用更精细的方差缩减策略(如 PPO clipping 或我们接下来讨论的方法),这些策略可以根据 \(\rho\) 的局部大小自适应调整。
Alternative Definitions of \(\rho_{g,i}\)
\(\rho_{g,i}\) 的替代定义
So far we have used the token-level IS ratio \(\rho_{g,i} = \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{old}}(a_i \vert \tau_{i-1})}\), which corrects only the distribution of the single token \(a_i\). But we can also consider sequence-level alternatives that capture the distributional shift across the entire trajectory. These can be combined with any choice of \(w(\rho)\) discussed above.
When is the token-level \(\rho_j\) exact? Whether the token-level ratio suffices depends on what the advantage depends on. As discussed above:
-
Standard GRPO: The advantage \(\hat{A}_g\) depends only on the task reward \(r(\tau^g)\), which is a fixed constant for the entire trajectory. The gradient contribution at token \(j\) is \(\hat{A}_g \nabla_\theta \ln \pi_\theta(a_j)\). The only distribution mismatch is that action \(a_j\) was sampled from \(\pi_{\mathrm{old}}\) instead of \(\pi_\theta\) — and the single-token ratio \(\rho_j\) corrects exactly this. There is no future-action-dependent quantity in \(\hat{A}_g\), so no further correction is needed. The token-level IS correction is exact.
-
Token-level KL formulation (ours): The token-level reward \(r_j = r(\tau) - \frac{\beta}{H}\sum_{k=j}^H K_k\) depends on future actions \(a_{j+1}, \ldots, a_H\) through the KL terms \(K_k\). These future actions were sampled from \(\pi_{\mathrm{old}}\), so the distribution of \(r_j\) itself is wrong under \(\pi_\theta\). Correcting this requires the full future trajectory ratio \(\prod_{k=j}^H \rho_k\), which has exponential variance (as shown above). The geometric mean is a variance-reduced approximation — it is not the exact IS weight, hence biased.
The practical implication: standard GRPO’s simpler IS treatment is exact for its coarser (trajectory-level) advantage, while our formulation’s finer (token-level) advantage demands a more complex IS correction where any tractable approximation introduces bias.
Sequence-level geometric mean (GSPO). Replace the per-token ratio with the geometric mean of all token ratios in the trajectory:
到目前为止我们使用的是 token-level IS ratio \(\rho_{g,i} = \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{old}}(a_i \vert \tau_{i-1})}\),它只校正单个 token \(a_i\) 的分布。但我们也可以考虑 sequence-level 的替代方案,以捕捉整条轨迹上的分布偏移。这些可以与上面讨论的任何 \(w(\rho)\) 选择组合。
Token-level \(\rho_j\) 何时精确? 单 token ratio 是否足够取决于 advantage 依赖什么。正如上文讨论的:
-
标准 GRPO:Advantage \(\hat{A}_g\) 只依赖 task reward \(r(\tau^g)\),对整条轨迹是固定常数。Token \(j\) 处的梯度贡献为 \(\hat{A}_g \nabla_\theta \ln \pi_\theta(a_j)\)。唯一的分布偏差是动作 \(a_j\) 来自 \(\pi_{\mathrm{old}}\) 而非 \(\pi_\theta\)——单 token ratio \(\rho_j\) 精确校正了这一点。\(\hat{A}_g\) 中没有依赖未来动作的项,因此不需要进一步校正。Token-level IS 校正是精确的。
-
Token-level KL 公式(我们的):Token-level reward \(r_j = r(\tau) - \frac{\beta}{H}\sum_{k=j}^H K_k\) 通过 KL 项 \(K_k\) 依赖于未来动作 \(a_{j+1}, \ldots, a_H\)。这些未来动作来自 \(\pi_{\mathrm{old}}\) 的采样,因此 \(r_j\) 本身的分布在 \(\pi_\theta\) 下就是错的。精确校正需要完整的 future trajectory ratio \(\prod_{k=j}^H \rho_k\),但其方差指数爆炸(如上文所示)。几何平均是方差缩减的近似——它不是精确的 IS 权重,因此有偏。
实际含义:标准 GRPO 更简单的 IS 处理对其更粗的(轨迹级)advantage 是精确的,而我们公式更精细的(token 级)advantage 需要更复杂的 IS 校正,任何可行的近似都会引入偏差。
Sequence-level 几何平均(GSPO)。 将 per-token ratio 替换为轨迹中所有 token ratio 的几何平均:
This makes \(\rho_{g,i}\) the same for all tokens \(i\) within trajectory \(g\) — it measures the average log-probability shift across the entire sequence. This has several implications:
- Variance reduction: By averaging over \(H_g\) token ratios, the geometric mean has \(\sim 1/H_g\) times the log-variance of a single token ratio, preventing any single outlier token from dominating.
- Global view: It captures the overall distributional shift of the trajectory, not just local shifts at individual tokens.
- Position-independence: All tokens in the same trajectory receive the same IS correction, which may lose fine-grained per-token information but simplifies the algorithm.
这使得 \(\rho_{g,i}\) 对轨迹 \(g\) 中所有 token \(i\) 相同——它衡量整个序列上平均的 log-probability 偏移。这有几个含义:
- 方差缩减:通过对 \(H_g\) 个 token ratio 取平均,几何平均的 log-方差约为单个 token ratio 的 \(1/H_g\),防止任何单个异常 token 主导。
- 全局视角:捕捉轨迹整体的分布偏移,而非仅仅是个别 token 的局部偏移。
- 位置无关性:同一轨迹中的所有 token 接收相同的 IS 校正,可能丢失精细的 per-token 信息,但简化了算法。
Future-only geometric mean. A position-dependent variant that only considers tokens from position \(i\) onward:
仅未来 token 的几何平均。 一种位置相关的变体,只考虑从位置 \(i\) 开始的 token:
This is motivated by the same reasoning as the sequence-level correction derived above: the loss at position \(i\) depends on future actions \(a_i, a_{i+1}, \ldots, a_{H_g}\), so the IS correction should cover the distributional shift from position \(i\) onward. Compared to the full-sequence geometric mean, this variant:
- Adapts by position: Early tokens (small \(i\)) average over more tokens, getting stronger variance reduction but potentially more bias from including irrelevant past distributional changes. Late tokens (large \(i\)) average over fewer tokens, preserving more local information but with higher variance.
- Connects to the token-level case: At \(i = H_g\) (the last token), the future-only geometric mean reduces to the standard token-level ratio \(\rho_{g,H_g} = \frac{\pi_\theta(a_{H_g})}{\pi_{\mathrm{old}}(a_{H_g})}\).
- Connects to the full-sequence case: At \(i = 1\) (the first token), it equals the full-sequence geometric mean.
这与上文推导的 sequence-level correction 基于相同的直觉:位置 \(i\) 处的损失依赖于未来动作 \(a_i, a_{i+1}, \ldots, a_{H_g}\),因此 IS 校正应覆盖从位置 \(i\) 开始的分布偏移。与 full-sequence 几何平均相比,这个变体:
- 按位置自适应:早期 token(小 \(i\))对更多 token 取平均,获得更强的方差缩减,但可能因包含不相关的过去分布变化而引入更多偏差。晚期 token(大 \(i\))对更少 token 取平均,保留更多局部信息但方差更高。
- 与 token-level 的联系:在 \(i = H_g\)(最后一个 token)时,future-only 几何平均退化为标准的 token-level ratio \(\rho_{g,H_g} = \frac{\pi_\theta(a_{H_g})}{\pi_{\mathrm{old}}(a_{H_g})}\)。
- 与 full-sequence 的联系:在 \(i = 1\)(第一个 token)时,等于 full-sequence 几何平均。
Summary of \(\rho\) choices. The three definitions form a spectrum from local to global correction:
| \(\rho\) definition | Scope | Variance | Bias | Position-dependent? |
|---|---|---|---|---|
| Token-level \(\frac{\pi_\theta(a_i)}{\pi_{\mathrm{old}}(a_i)}\) | Single token | Highest | Lowest (for token \(i\)) | Yes |
| Future-only geometric mean | Tokens \(i\) to \(H_g\) | Medium | Medium | Yes |
| Full-sequence geometric mean (GSPO) | All tokens | Lowest | Highest | No |
These \(\rho\) choices are orthogonal to the \(w(\rho)\) choices (PPO clipping, power family \(\rho^q\), upper clipping, etc.). Any combination is a valid algorithm to try. The experimental question is which combination yields the best bias-variance tradeoff for LLM reasoning tasks.
\(\rho\) 选择总结。 三种定义构成从局部到全局校正的谱系:
| \(\rho\) 定义 | 范围 | 方差 | 偏差 | 位置相关? |
|---|---|---|---|---|
| Token-level \(\frac{\pi_\theta(a_i)}{\pi_{\mathrm{old}}(a_i)}\) | 单个 token | 最高 | 最低(对 token \(i\)) | 是 |
| 仅未来几何平均 | Token \(i\) 到 \(H_g\) | 中等 | 中等 | 是 |
| Full-sequence 几何平均(GSPO) | 所有 token | 最低 | 最高 | 否 |
这些 \(\rho\) 选择与 \(w(\rho)\) 选择(PPO clipping、幂次族 \(\rho^q\)、上方截断等)是正交的。任何组合都是一个有效的算法。实验问题是哪种组合在 LLM 推理任务上产生最佳的偏差-方差权衡。