From Estimation to Optimization: KL Regularization in RLHF

This post distills Liu et al. (2025), Rethinking KL Regularization in RLHF. The paper asks a deceptively simple question: Schulman's three KL estimators \(k_1, k_2, k_3\) were ranked by their statistical properties (bias and variance). But in RLHF these formulas are used as optimization losses — does the best estimator also make the best loss? The answer is no, and the reasons are illuminating.
本文整理自 Liu et al. (2025), Rethinking KL Regularization in RLHF。论文提出一个看似简单的问题:Schulman 的三个 KL 估计量 \(k_1, k_2, k_3\) 是按统计性质(偏差和方差)排名的。但在 RLHF 中,这些公式被用作优化损失——最好的估计量是否也是最好的损失?答案是否定的,原因发人深省。

Recap: Schulman's Three Estimators

In Approximating KL Divergence, Schulman defined three Monte Carlo estimators of \(\mathrm{KL}[q,p]\) using the ratio \(r = p(x)/q(x)\):

  Formula Unbiased? Always \(\geq 0\)? Variance
\(k_1\) \(-\log r\) Yes No High
\(k_2\) \(\frac{1}{2}(\log r)^2\) No (low bias) Yes Low
\(k_3\) \((r-1) - \log r\) Yes Yes Low

As an estimator, \(k_3\) is the clear winner: unbiased, non-negative, and low variance. But estimation and optimization are different games.

《近似 KL 散度》中,Schulman 用概率比 \(r = p(x)/q(x)\) 定义了三个 \(\mathrm{KL}[q,p]\) 的 Monte Carlo 估计量:

  公式 无偏? 始终 \(\geq 0\)? 方差
\(k_1\) \(-\log r\)
\(k_2\) \(\frac{1}{2}(\log r)^2\) 否(低偏差)
\(k_3\) \((r-1) - \log r\)

作为估计量,\(k_3\) 是明确的赢家:无偏、非负、低方差。但估计和优化是不同的游戏。

Two Different KLs in PPO-RLHF

Before analyzing how \(k_1, k_2, k_3\) behave as losses, we need to untangle a common source of confusion. PPO-based RLHF involves two different KL divergences that serve entirely different purposes and point in different directions. To see both clearly, start from the TRPO-RLHF formulation with the KL constraint written explicitly:

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL (reverse)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL (forward)}} \leq \delta\]

PPO approximates the constraint by replacing it with clipping, yielding the familiar PPO-RLHF objective (as in InstructGPT):

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]

The two KLs are:

1. Trust-region KL (forward): \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta\)

This is the constraint inherited from TRPO: don’t let the new policy deviate too far from the old policy within a single update step. Since data is sampled from \(\pi_{\mathrm{old}}\), the constraint is measured under \(\pi_{\mathrm{old}}\) — a forward KL (old policy first). It strongly penalizes the case where \(\pi_{\mathrm{old}}\) puts high probability on an action but \(\pi_\theta\) compresses it — exactly the dangerous regime where importance ratios explode and the surrogate approximation breaks. PPO replaces this explicit constraint with clipping.

2. Reference KL (reverse): \(\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)

This is the RLHF-specific regularizer that prevents the policy from drifting too far from the pretrained base model across all of training. Expanding it:

\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]

Since we care about what the current policy generates, the expectation is naturally under \(\pi_\theta\) — a reverse KL (new policy first). It penalizes the policy for generating outputs that the reference model would find unlikely.

  Trust-Region KL Reference KL
Purpose Optimization stability Regularization to base model
Direction \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\) (forward) \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) (reverse)
Sampling distribution \(\pi_{\mathrm{old}}\) (data you already have) \(\pi_\theta\) (outputs you will generate)
Constrains Per-step update size Total drift from reference
In the formula Implicit (clipping) Explicit (\(\beta\) penalty)

The rest of this post focuses exclusively on the reference KL \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) — specifically, how the choice of \(k_1, k_2, k_3\) and the choice of “in reward” vs. “as loss” affect its gradient.

Wait — the surrogate uses \(\mathbb{E}_{\pi_{\mathrm{old}}}\) but the reference KL uses \(\mathbb{E}_{\pi_\theta}\). How can they coexist in one loss? (Click to expand)

They can't, as written. The formula above is conceptually clean but notationally sloppy — it mixes two different expectations. In practice, InstructGPT resolves this by folding the KL into the reward. The per-token reward becomes:

$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$

Note the key move: the \(\log \pi_\theta\) in the KL is replaced by \(\log \pi_{\mathrm{old}}\) — the policy that actually generated the rollout. The KL penalty is computed at rollout time and treated as part of the reward, detached from the gradient. The advantage \(\hat{A}\) is then estimated from this modified reward using GAE, and the entire PPO loss has a single expectation under \(\pi_{\mathrm{old}}\):

$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$

This is precisely the "\(k_1\) in reward" approach that the rest of this post will analyze. The reference KL never appears as a separate loss term with its own expectation — it is absorbed into the advantage.

在分析 \(k_1, k_2, k_3\) 作为损失的行为之前,需要厘清一个常见的混淆。基于 PPO 的 RLHF 涉及两个不同的 KL 散度,它们服务于完全不同的目的,且方向相反。为了同时看清两者,从 KL 约束显式写出的 TRPO-RLHF 公式出发:

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL(反向)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL(前向)}} \leq \delta\]

PPO 用 clipping 近似该约束,得到熟悉的 PPO-RLHF 目标(如 InstructGPT):

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]

两个 KL 分别是:

1. Trust-region KL(前向):\(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta\)

这是继承自 TRPO 的约束:在单次更新中,不要让新策略偏离旧策略太远。由于数据来自 \(\pi_{\mathrm{old}}\) 的采样,约束在 \(\pi_{\mathrm{old}}\) 下度量——前向 KL(旧策略在前)。它强烈惩罚这种情况:\(\pi_{\mathrm{old}}\) 对某个动作给出高概率,但 \(\pi_\theta\) 却将其压缩得很低——这正是 importance ratio 爆炸、代理近似崩溃的危险区域。PPO 用 clipping 替代了显式约束。

2. Reference KL(反向):\(\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)

这是 RLHF 特有的正则化器,防止策略在整个训练过程中偏离预训练基座模型过远。展开:

\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]

由于我们关心当前策略生成什么,期望自然在 \(\pi_\theta\) 下取——反向 KL(新策略在前)。它惩罚策略生成参考模型认为不太可能的输出。

  Trust-Region KL Reference KL
目的 优化稳定性 正则化到基座模型
方向 \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\)(前向) \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)(反向)
采样分布 \(\pi_{\mathrm{old}}\)(已有的数据) \(\pi_\theta\)(将要生成的输出)
约束对象 单步更新幅度 相对参考模型的总漂移
在公式中 隐式(clipping) 显式(\(\beta\) 惩罚)

本文其余部分专注于 reference KL \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)——具体来说,\(k_1, k_2, k_3\) 的选择以及”放入奖励”vs.”作为损失”的选择如何影响其梯度。

等等——surrogate 用的是 \(\mathbb{E}_{\pi_{\mathrm{old}}}\),但 reference KL 用的是 \(\mathbb{E}_{\pi_\theta}\)。它们怎么能共存于一个 loss 中?(点击展开)

确实不能,如上面所写的那样。上面的公式概念上清晰,但记号上是不严谨的——它混合了两个不同的期望。实践中,InstructGPT 通过将 KL 吸收进 reward 来解决这个问题。Per-token reward 变为:

$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$

注意关键的一步:KL 中的 \(\log \pi_\theta\) 被替换为 \(\log \pi_{\mathrm{old}}\)——即实际生成 rollout 的策略。KL 惩罚在 rollout 时计算,作为 reward 的一部分,不参与梯度计算。然后用 GAE 从这个修改后的 reward 估计 advantage \(\hat{A}\),整个 PPO loss 只有一个在 \(\pi_{\mathrm{old}}\) 下的期望:

$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$

这正是本文接下来将分析的"\(k_1\) 放入奖励"方法。Reference KL 从不作为单独的、有自己期望的 loss 项出现——它被吸收进了 advantage。

"In Reward" vs. "As Loss"

In RLHF, \(k_n\) is used to regularize the policy \(\pi_\theta\) toward a reference \(\pi_{\mathrm{ref}}\). Let \(\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)\). There are two fundamentally different ways to plug \(k_n\) into the loss:

”\(k_n\) in reward” (combined form): treat \(k_n\) as a detached scalar in the REINFORCE objective — it modulates the policy gradient like a reward signal, but is not differentiated:

\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]

”\(k_n\) as loss” (decoupled form): add \(k_n\) as a separate differentiable loss — the gradient flows through \(k_n\) itself via the chain rule:

\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]

These produce different gradients, even though they use the same formula.

在 RLHF 中,\(k_n\) 用于将策略 \(\pi_\theta\) 正则化到参考策略 \(\pi_{\mathrm{ref}}\)。令 \(\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)\)。将 \(k_n\) 嵌入损失有两种根本不同的方式:

”\(k_n\) 放入奖励”(合并形式):将 \(k_n\) 作为 REINFORCE 目标中不参与梯度计算的标量——它像奖励信号一样调节策略梯度,但不被微分

\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]

”\(k_n\) 作为损失”(解耦形式):将 \(k_n\) 作为单独的可微损失添加——梯度通过链式法则流经 \(k_n\) 本身

\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]

尽管使用相同的公式,这两种方式产生不同的梯度

The Gradient Analysis

The target is the reverse KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\), whose true gradient (under on-policy sampling) is:

\[\nabla_\theta \mathcal{J}_{\mathrm{RKL}} = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)} \cdot \nabla_\theta \log \pi_\theta(y \vert x)\right].\]
Why reverse KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\)? (Click to expand)

Because this is exactly what appears in the standard RLHF objective:

$$\max_{\pi_\theta} \; \mathbb{E}_{y \sim \pi_\theta}\big[R(y)\big] - \beta \cdot \mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}].$$

The KL is \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\) rather than \(\mathrm{KL}[\pi_{\mathrm{ref}} \Vert \pi_\theta]\) because we sample from the current policy \(\pi_\theta\). The expectation \(\mathbb{E}_{y \sim \pi_\theta}[\log(\pi_\theta / \pi_{\mathrm{ref}})]\) is directly compatible with the policy gradient framework — no importance sampling needed. PPO-RLHF, GRPO, and other on-policy RLHF methods all use this form.

为什么是反向 KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\)?(点击展开)

因为它正是 RLHF 标准目标函数中出现的形式:

$$\max_{\pi_\theta} \; \mathbb{E}_{y \sim \pi_\theta}\big[R(y)\big] - \beta \cdot \mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}].$$

之所以是 \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\) 而不是 \(\mathrm{KL}[\pi_{\mathrm{ref}} \Vert \pi_\theta]\),是因为我们从当前策略 \(\pi_\theta\) 采样。期望 \(\mathbb{E}_{y \sim \pi_\theta}[\log(\pi_\theta / \pi_{\mathrm{ref}})]\) 与策略梯度框架直接兼容——不需要 importance sampling。PPO-RLHF、GRPO 等 on-policy RLHF 方法都使用这一形式。

The coefficient \(-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})\) multiplying \(\nabla_\theta \log \pi_\theta\) is the “signal” that pushes the policy back toward the reference. Liu et al. show which implementations recover this gradient:

\(k_1\) in reward produces the correct gradient. Since \(k_1 = -\log \delta\), placing it in the REINFORCE coefficient directly yields \(-\log \delta \cdot \nabla_\theta \log \pi_\theta\) — exactly the RKL gradient. ✓

\(k_2\) as loss is gradient-equivalent to \(k_1\) in reward. Since \(k_2 = \frac{1}{2}(\log \delta)^2\), differentiating directly gives \(\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta\) — the same RKL gradient. ✓

This is the paper’s key equivalence result (Theorem 5.1): ”\(k_1\) in reward” \(=\) “\(k_2\) as loss” in terms of gradient.

乘以 \(\nabla_\theta \log \pi_\theta\) 的系数 \(-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})\) 是将策略推回参考策略的”信号”。Liu et al. 证明了哪些实现能恢复这个梯度:

\(k_1\) 放入奖励 产生正确的梯度。由于 \(k_1 = -\log \delta\),将其放入 REINFORCE 系数直接得到 \(-\log \delta \cdot \nabla_\theta \log \pi_\theta\)——恰好是 RKL 梯度。✓

\(k_2\) 作为损失 与 \(k_1\) 放入奖励梯度等价。由于 \(k_2 = \frac{1}{2}(\log \delta)^2\),直接微分得 \(\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta\)——相同的 RKL 梯度。✓

这是论文的关键等价性结果(定理 5.1):”\(k_1\) 放入奖励” \(=\) “\(k_2\) 作为损失”(就梯度而言)。

k₁ as Loss: A Surprising Failure

What if we use \(k_1\) as a direct loss instead of in the reward? Since \(k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}\), differentiating gives:

\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]

The reference policy \(\pi_{\mathrm{ref}}\) has completely disappeared from the gradient — it carries no regularization signal at all. Worse, by the score function identity \(\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0\), this gradient has zero expectation. It contributes nothing but noise.

This is a stark example of how a perfect estimator (\(k_1\) is exactly unbiased for KL) can be a terrible loss function.

如果我们将 \(k_1\) 作为直接损失而不是放入奖励呢?由于 \(k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}\),微分得:

\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]

参考策略 \(\pi_{\mathrm{ref}}\) 已从梯度中完全消失——它不携带任何正则化信号。更糟糕的是,根据得分函数恒等式 \(\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0\),这个梯度的期望为零。它只贡献噪声。

这是一个鲜明的例子,说明一个完美的估计量(\(k_1\) 对 KL 精确无偏)可以是一个糟糕的损失函数

k₃ as Loss (GRPO): A Biased Approximation

GRPO uses \(k_3 = \delta - 1 - \log \delta\) as a directly differentiated loss (decoupled form). Differentiating:

\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]

Compared to the true RKL gradient coefficient \(-\log \delta\), GRPO uses \(1 - \delta\). These are related by Taylor expansion — \(-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots\), so \(1 - \delta\) is only the first-order approximation of \(-\log \delta\). This introduces three problems:

  1. Bias: for all \(\delta \neq 1\), the coefficient \(1 - \delta \neq -\log \delta\), so the gradient direction is biased.

  2. Pathological asymmetry: When the policy deviates away from the reference (\(\delta \to 0\), meaning \(\pi_\theta \gg \pi_{\mathrm{ref}}\)), the true coefficient \(-\log \delta \to +\infty\) provides a strong restoring force, but \(1 - \delta \to 1\) saturates — it cannot push back hard enough. Conversely, when \(\delta \to \infty\) (\(\pi_\theta \ll \pi_{\mathrm{ref}}\)), \(1 - \delta \to -\infty\) explodes much faster than the logarithmic \(-\log \delta\), risking destabilizing updates.

  3. Variance: the variance of \(1 - \delta\) involves \(\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)\), the chi-squared divergence, which is notoriously unstable and can diverge even when KL remains finite.

So paradoxically, \(k_3\) — the “clear winner” as a KL estimator — produces a biased, asymmetric, and potentially unstable gradient when used as a loss in GRPO. The paper recommends \(k_2\) as loss (or equivalently \(k_1\) in reward) as the principled default.

GRPO 使用 \(k_3 = \delta - 1 - \log \delta\) 作为直接微分的损失(解耦形式)。微分得:

\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]

与真实 RKL 梯度系数 \(-\log \delta\) 相比,GRPO 使用的是 \(1 - \delta\)。两者通过 Taylor 展开相关——\(-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots\),所以 \(1 - \delta\) 只是 \(-\log \delta\) 的一阶近似。这带来三个问题:

  1. 偏差:对所有 \(\delta \neq 1\),系数 \(1 - \delta \neq -\log \delta\),因此梯度方向有偏。

  2. 病态不对称性:当策略偏离参考策略时(\(\delta \to 0\),即 \(\pi_\theta \gg \pi_{\mathrm{ref}}\)),真实系数 \(-\log \delta \to +\infty\) 提供强恢复力,但 \(1 - \delta \to 1\) 饱和——无法提供足够的回推力。反之,当 \(\delta \to \infty\)(\(\pi_\theta \ll \pi_{\mathrm{ref}}\))时,\(1 - \delta \to -\infty\) 比对数式的 \(-\log \delta\) 爆炸得更快,可能导致不稳定的更新。

  3. 方差:\(1 - \delta\) 的方差涉及 \(\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)\)(卡方散度),这在数值上是出了名的不稳定,即使 KL 保持有限也可能发散。

所以矛盾的是,\(k_3\)——作为 KL 估计量的”明确赢家”——当作为 GRPO 中的损失使用时,产生的梯度是有偏的、不对称的、且可能不稳定的。论文推荐 \(k_2\) 作为损失(或等价地 \(k_1\) 放入奖励)作为原则性的默认选择。

Summary: Estimation ≠ Optimization

The following table contrasts the estimator ranking (Schulman) with the optimization ranking (Liu et al.):

  As Estimator ”\(k_n\) in reward” ”\(k_n\) as loss”
\(k_1 = -\log \delta\) Unbiased, high variance ✓ Correct RKL gradient ✗ Zero-mean noise, no regularization
\(k_2 = \frac{1}{2}(\log \delta)^2\) Biased (low), low variance ✓ Correct RKL gradient
\(k_3 = (\delta - 1) - \log \delta\) Unbiased, low variance ≈ First-order biased approximation

The irony is complete: \(k_1\), the worst estimator (high variance), produces the correct gradient when placed in the reward. \(k_3\), the best estimator (unbiased + low variance), produces a biased gradient when used as a loss. And \(k_2\), the biased estimator, produces the correct gradient as a loss — making it gradient-equivalent to \(k_1\) in reward.

The reason is that estimation asks “how close is the value \(k_n\) to the true KL?” while optimization asks “does \(\nabla_\theta k_n\) point in the right direction?” These are fundamentally different questions, and a good answer to one does not imply a good answer to the other.

下表对比了估计量排名(Schulman)和优化排名(Liu et al.):

  作为估计量 ”\(k_n\) 放入奖励” ”\(k_n\) 作为损失”
\(k_1 = -\log \delta\) 无偏,高方差 ✓ 正确的 RKL 梯度 ✗ 零均值噪声,无正则化
\(k_2 = \frac{1}{2}(\log \delta)^2\) 有偏(低),低方差 ✓ 正确的 RKL 梯度
\(k_3 = (\delta - 1) - \log \delta\) 无偏,低方差 ≈ 一阶有偏近似

反讽至此完整:\(k_1\),最差的估计量(高方差),放入奖励时产生正确的梯度。\(k_3\),最好的估计量(无偏+低方差),作为损失时产生有偏的梯度。而 \(k_2\),那个有偏的估计量,作为损失时反而产生正确的梯度——使其与 \(k_1\) 放入奖励梯度等价。

原因在于,估计问的是”\(k_n\) 的离真实 KL 有多近?”,而优化问的是”\(\nabla_\theta k_n\) 是否指向正确方向?”这是根本不同的问题,一个问题的好答案并不意味着另一个问题的好答案。