From Estimation to Optimization: KL Regularization in RLHF
Recap: Schulman's Three Estimators
回顾:Schulman 的三个估计量
In Approximating KL Divergence, Schulman defined three Monte Carlo estimators of \(\mathrm{KL}[q,p]\) using the ratio \(r = p(x)/q(x)\):
| Formula | Unbiased? | Always \(\geq 0\)? | Variance | |
|---|---|---|---|---|
| \(k_1\) | \(-\log r\) | Yes | No | High |
| \(k_2\) | \(\frac{1}{2}(\log r)^2\) | No (low bias) | Yes | Low |
| \(k_3\) | \((r-1) - \log r\) | Yes | Yes | Low |
As an estimator, \(k_3\) is the clear winner: unbiased, non-negative, and low variance. But estimation and optimization are different games.
Two Different KLs in PPO-RLHF
PPO-RLHF 中的两个不同 KL
Before analyzing how \(k_1, k_2, k_3\) behave as losses, we need to untangle a common source of confusion. PPO-based RLHF involves two different KL divergences that serve entirely different purposes and point in different directions. To see both clearly, start from the TRPO-RLHF formulation with the KL constraint written explicitly:
\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL (reverse)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL (forward)}} \leq \delta\]PPO approximates the constraint by replacing it with clipping, yielding the familiar PPO-RLHF objective (as in InstructGPT):
\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]The two KLs are:
1. Trust-region KL (forward): \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta\)
This is the constraint inherited from TRPO: don’t let the new policy deviate too far from the old policy within a single update step. Since data is sampled from \(\pi_{\mathrm{old}}\), the constraint is measured under \(\pi_{\mathrm{old}}\) — a forward KL (old policy first). It strongly penalizes the case where \(\pi_{\mathrm{old}}\) puts high probability on an action but \(\pi_\theta\) compresses it — exactly the dangerous regime where importance ratios explode and the surrogate approximation breaks. PPO replaces this explicit constraint with clipping.
2. Reference KL (reverse): \(\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)
This is the RLHF-specific regularizer that prevents the policy from drifting too far from the pretrained base model across all of training. Expanding it:
\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]Since we care about what the current policy generates, the expectation is naturally under \(\pi_\theta\) — a reverse KL (new policy first). It penalizes the policy for generating outputs that the reference model would find unlikely.
| Trust-Region KL | Reference KL | |
|---|---|---|
| Purpose | Optimization stability | Regularization to base model |
| Direction | \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\) (forward) | \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) (reverse) |
| Sampling distribution | \(\pi_{\mathrm{old}}\) (data you already have) | \(\pi_\theta\) (outputs you will generate) |
| Constrains | Per-step update size | Total drift from reference |
| In the formula | Implicit (clipping) | Explicit (\(\beta\) penalty) |
The rest of this post focuses exclusively on the reference KL \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) — specifically, how the choice of \(k_1, k_2, k_3\) and the choice of “in reward” vs. “as loss” affect its gradient.
Wait — the surrogate uses \(\mathbb{E}_{\pi_{\mathrm{old}}}\) but the reference KL uses \(\mathbb{E}_{\pi_\theta}\). How can they coexist in one loss? (Click to expand)
They can't, as written. The formula above is conceptually clean but notationally sloppy — it mixes two different expectations. In practice, InstructGPT resolves this by folding the KL into the reward. The per-token reward becomes:
$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$
Note the key move: the \(\log \pi_\theta\) in the KL is replaced by \(\log \pi_{\mathrm{old}}\) — the policy that actually generated the rollout. The KL penalty is computed at rollout time and treated as part of the reward, detached from the gradient. The advantage \(\hat{A}\) is then estimated from this modified reward using GAE, and the entire PPO loss has a single expectation under \(\pi_{\mathrm{old}}\):
$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$
This is precisely the "\(k_1\) in reward" approach that the rest of this post will analyze. The reference KL never appears as a separate loss term with its own expectation — it is absorbed into the advantage.
"In Reward" vs. "As Loss"
"放入奖励" vs. "作为损失"
In RLHF, \(k_n\) is used to regularize the policy \(\pi_\theta\) toward a reference \(\pi_{\mathrm{ref}}\). Let \(\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)\). There are two fundamentally different ways to plug \(k_n\) into the loss:
”\(k_n\) in reward” (combined form): treat \(k_n\) as a detached scalar in the REINFORCE objective — it modulates the policy gradient like a reward signal, but is not differentiated:
\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]”\(k_n\) as loss” (decoupled form): add \(k_n\) as a separate differentiable loss — the gradient flows through \(k_n\) itself via the chain rule:
\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]These produce different gradients, even though they use the same formula.
The Gradient Analysis
梯度分析
The target is the reverse KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\), whose true gradient (under on-policy sampling) is:
\[\nabla_\theta \mathcal{J}_{\mathrm{RKL}} = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)} \cdot \nabla_\theta \log \pi_\theta(y \vert x)\right].\]Why reverse KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\)? (Click to expand)
Because this is exactly what appears in the standard RLHF objective:
$$\max_{\pi_\theta} \; \mathbb{E}_{y \sim \pi_\theta}\big[R(y)\big] - \beta \cdot \mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}].$$
The KL is \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\) rather than \(\mathrm{KL}[\pi_{\mathrm{ref}} \Vert \pi_\theta]\) because we sample from the current policy \(\pi_\theta\). The expectation \(\mathbb{E}_{y \sim \pi_\theta}[\log(\pi_\theta / \pi_{\mathrm{ref}})]\) is directly compatible with the policy gradient framework — no importance sampling needed. PPO-RLHF, GRPO, and other on-policy RLHF methods all use this form.
The coefficient \(-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})\) multiplying \(\nabla_\theta \log \pi_\theta\) is the “signal” that pushes the policy back toward the reference. Liu et al. show which implementations recover this gradient:
\(k_1\) in reward produces the correct gradient. Since \(k_1 = -\log \delta\), placing it in the REINFORCE coefficient directly yields \(-\log \delta \cdot \nabla_\theta \log \pi_\theta\) — exactly the RKL gradient. ✓
\(k_2\) as loss is gradient-equivalent to \(k_1\) in reward. Since \(k_2 = \frac{1}{2}(\log \delta)^2\), differentiating directly gives \(\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta\) — the same RKL gradient. ✓
This is the paper’s key equivalence result (Theorem 5.1): ”\(k_1\) in reward” \(=\) “\(k_2\) as loss” in terms of gradient.
k₁ as Loss: A Surprising Failure
k₁ 作为损失:一个令人惊讶的失败
What if we use \(k_1\) as a direct loss instead of in the reward? Since \(k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}\), differentiating gives:
\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]The reference policy \(\pi_{\mathrm{ref}}\) has completely disappeared from the gradient — it carries no regularization signal at all. Worse, by the score function identity \(\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0\), this gradient has zero expectation. It contributes nothing but noise.
This is a stark example of how a perfect estimator (\(k_1\) is exactly unbiased for KL) can be a terrible loss function.
k₃ as Loss (GRPO): A Biased Approximation
k₃ 作为损失 (GRPO):有偏近似
GRPO uses \(k_3 = \delta - 1 - \log \delta\) as a directly differentiated loss (decoupled form). Differentiating:
\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]Compared to the true RKL gradient coefficient \(-\log \delta\), GRPO uses \(1 - \delta\). These are related by Taylor expansion — \(-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots\), so \(1 - \delta\) is only the first-order approximation of \(-\log \delta\). This introduces three problems:
-
Bias: for all \(\delta \neq 1\), the coefficient \(1 - \delta \neq -\log \delta\), so the gradient direction is biased.
-
Pathological asymmetry: When the policy deviates away from the reference (\(\delta \to 0\), meaning \(\pi_\theta \gg \pi_{\mathrm{ref}}\)), the true coefficient \(-\log \delta \to +\infty\) provides a strong restoring force, but \(1 - \delta \to 1\) saturates — it cannot push back hard enough. Conversely, when \(\delta \to \infty\) (\(\pi_\theta \ll \pi_{\mathrm{ref}}\)), \(1 - \delta \to -\infty\) explodes much faster than the logarithmic \(-\log \delta\), risking destabilizing updates.
-
Variance: the variance of \(1 - \delta\) involves \(\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)\), the chi-squared divergence, which is notoriously unstable and can diverge even when KL remains finite.
So paradoxically, \(k_3\) — the “clear winner” as a KL estimator — produces a biased, asymmetric, and potentially unstable gradient when used as a loss in GRPO. The paper recommends \(k_2\) as loss (or equivalently \(k_1\) in reward) as the principled default.
Summary: Estimation ≠ Optimization
总结:估计 ≠ 优化
The following table contrasts the estimator ranking (Schulman) with the optimization ranking (Liu et al.):
| As Estimator | ”\(k_n\) in reward” | ”\(k_n\) as loss” | |
|---|---|---|---|
| \(k_1 = -\log \delta\) | Unbiased, high variance | ✓ Correct RKL gradient | ✗ Zero-mean noise, no regularization |
| \(k_2 = \frac{1}{2}(\log \delta)^2\) | Biased (low), low variance | — | ✓ Correct RKL gradient |
| \(k_3 = (\delta - 1) - \log \delta\) | Unbiased, low variance | — | ≈ First-order biased approximation |
The irony is complete: \(k_1\), the worst estimator (high variance), produces the correct gradient when placed in the reward. \(k_3\), the best estimator (unbiased + low variance), produces a biased gradient when used as a loss. And \(k_2\), the biased estimator, produces the correct gradient as a loss — making it gradient-equivalent to \(k_1\) in reward.
The reason is that estimation asks “how close is the value \(k_n\) to the true KL?” while optimization asks “does \(\nabla_\theta k_n\) point in the right direction?” These are fundamentally different questions, and a good answer to one does not imply a good answer to the other.