From Estimation to Optimization: KL Regularization in RLHF

This post distills Liu et al. (2025), Rethinking KL Regularization in RLHF. The paper asks a deceptively simple question: Schulman's three KL estimators $k_1, k_2, k_3$ were ranked by their statistical properties (bias and variance). But in RLHF these formulas are used as optimization losses — does the best estimator also make the best loss? The answer is no, and the reasons are illuminating.

Recap: Schulman's Three Estimators

In Approximating KL Divergence, Schulman defined three Monte Carlo estimators of $\mathrm{KL}[q,p]$ using the ratio $r = p(x)/q(x)$:

	Formula	Unbiased?	Always $\geq 0$?	Variance
$k_1$	$-\log r$	Yes	No	High
$k_2$	$\frac{1}{2}(\log r)^2$	No (low bias)	Yes	Low
$k_3$	$(r-1) - \log r$	Yes	Yes	Low

As an estimator, $k_3$ is the clear winner: unbiased, non-negative, and low variance. But estimation and optimization are different games.

	公式	无偏？	始终 \(\geq 0\)？	方差
\(k_1\)	\(-\log r\)	是	否	高
\(k_2\)	\(\frac{1}{2}(\log r)^2\)	否（低偏差）	是	低
\(k_3\)	\((r-1) - \log r\)	是	是	低

Two Different KLs in PPO-RLHF

Before analyzing how $k_1, k_2, k_3$ behave as losses, we need to untangle a common source of confusion. PPO-based RLHF involves two different KL divergences that serve entirely different purposes and point in different directions. To see both clearly, start from the TRPO-RLHF formulation with the KL constraint written explicitly:

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL (reverse)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL (forward)}} \leq \delta\]

PPO approximates the constraint by replacing it with clipping, yielding the familiar PPO-RLHF objective (as in InstructGPT):

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]

The two KLs are:

1. Trust-region KL (forward): $D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta$

This is the constraint inherited from TRPO: don’t let the new policy deviate too far from the old policy within a single update step. Since data is sampled from $\pi_{\mathrm{old}}$, the constraint is measured under $\pi_{\mathrm{old}}$ — a forward KL (old policy first). It strongly penalizes the case where $\pi_{\mathrm{old}}$ puts high probability on an action but $\pi_\theta$ compresses it — exactly the dangerous regime where importance ratios explode and the surrogate approximation breaks. PPO replaces this explicit constraint with clipping.

2. Reference KL (reverse): $\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})$

This is the RLHF-specific regularizer that prevents the policy from drifting too far from the pretrained base model across all of training. Expanding it:

\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]

Since we care about what the current policy generates, the expectation is naturally under $\pi_\theta$ — a reverse KL (new policy first). It penalizes the policy for generating outputs that the reference model would find unlikely.

	Trust-Region KL	Reference KL
Purpose	Optimization stability	Regularization to base model
Direction	$D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)$ (forward)	$D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})$ (reverse)
Sampling distribution	$\pi_{\mathrm{old}}$ (data you already have)	$\pi_\theta$ (outputs you will generate)
Constrains	Per-step update size	Total drift from reference
In the formula	Implicit (clipping)	Explicit ($\beta$ penalty)

The rest of this post focuses exclusively on the reference KL $D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})$ — specifically, how the choice of $k_1, k_2, k_3$ and the choice of “in reward” vs. “as loss” affect its gradient.

Wait — the surrogate uses $\mathbb{E}_{\pi_{\mathrm{old}}}$ but the reference KL uses $\mathbb{E}_{\pi_\theta}$. How can they coexist in one loss? (Click to expand)

They can't, as written. The formula above is conceptually clean but notationally sloppy — it mixes two different expectations. In practice, InstructGPT resolves this by folding the KL into the reward. The per-token reward becomes:

$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$

Note the key move: the $\log \pi_\theta$ in the KL is replaced by $\log \pi_{\mathrm{old}}$ — the policy that actually generated the rollout. The KL penalty is computed at rollout time and treated as part of the reward, detached from the gradient. The advantage $\hat{A}$ is then estimated from this modified reward using GAE, and the entire PPO loss has a single expectation under $\pi_{\mathrm{old}}$:

$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$

This is precisely the "$k_1$ in reward" approach that the rest of this post will analyze. The reference KL never appears as a separate loss term with its own expectation — it is absorbed into the advantage.

	Trust-Region KL	Reference KL
目的	优化稳定性	正则化到基座模型
方向	\(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\)（前向）	\(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)（反向）
采样分布	\(\pi_{\mathrm{old}}\)（已有的数据）	\(\pi_\theta\)（将要生成的输出）
约束对象	单步更新幅度	相对参考模型的总漂移
在公式中	隐式（clipping）	显式（\(\beta\) 惩罚）

"In Reward" vs. "As Loss"

In RLHF, $k_n$ is used to regularize the policy $\pi_\theta$ toward a reference $\pi_{\mathrm{ref}}$. Let $\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)$. There are two fundamentally different ways to plug $k_n$ into the loss:

”$k_n$ in reward” (combined form): treat $k_n$ as a detached scalar in the REINFORCE objective — it modulates the policy gradient like a reward signal, but is not differentiated:

\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]

”$k_n$ as loss” (decoupled form): add $k_n$ as a separate differentiable loss — the gradient flows through $k_n$ itself via the chain rule:

\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]

These produce different gradients, even though they use the same formula.

The Gradient Analysis

The target is the reverse KL $\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]$, whose true gradient (under on-policy sampling) is:

\[\nabla_\theta \mathcal{J}_{\mathrm{RKL}} = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)} \cdot \nabla_\theta \log \pi_\theta(y \vert x)\right].\]

Why reverse KL $\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]$? (Click to expand)

Because this is exactly what appears in the standard RLHF objective:

$$\max_{\pi_\theta} \; \mathbb{E}_{y \sim \pi_\theta}\big[R(y)\big] - \beta \cdot \mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}].$$

The KL is $\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]$ rather than $\mathrm{KL}[\pi_{\mathrm{ref}} \Vert \pi_\theta]$ because we sample from the current policy $\pi_\theta$. The expectation $\mathbb{E}_{y \sim \pi_\theta}[\log(\pi_\theta / \pi_{\mathrm{ref}})]$ is directly compatible with the policy gradient framework — no importance sampling needed. PPO-RLHF, GRPO, and other on-policy RLHF methods all use this form.

The coefficient $-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})$ multiplying $\nabla_\theta \log \pi_\theta$ is the “signal” that pushes the policy back toward the reference. Liu et al. show which implementations recover this gradient:

$k_1$ in reward produces the correct gradient. Since $k_1 = -\log \delta$, placing it in the REINFORCE coefficient directly yields $-\log \delta \cdot \nabla_\theta \log \pi_\theta$ — exactly the RKL gradient. ✓

$k_2$ as loss is gradient-equivalent to $k_1$ in reward. Since $k_2 = \frac{1}{2}(\log \delta)^2$, differentiating directly gives $\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta$ — the same RKL gradient. ✓

This is the paper’s key equivalence result (Theorem 5.1): ”$k_1$ in reward” $=$ “$k_2$ as loss” in terms of gradient.

k₁ as Loss: A Surprising Failure

What if we use $k_1$ as a direct loss instead of in the reward? Since $k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}$, differentiating gives:

\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]

The reference policy $\pi_{\mathrm{ref}}$ has completely disappeared from the gradient — it carries no regularization signal at all. Worse, by the score function identity $\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0$, this gradient has zero expectation. It contributes nothing but noise.

This is a stark example of how a perfect estimator ($k_1$ is exactly unbiased for KL) can be a terrible loss function.

k₃ as Loss (GRPO): A Biased Approximation

GRPO uses $k_3 = \delta - 1 - \log \delta$ as a directly differentiated loss (decoupled form). Differentiating:

\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]

Compared to the true RKL gradient coefficient $-\log \delta$, GRPO uses $1 - \delta$. These are related by Taylor expansion — $-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots$, so $1 - \delta$ is only the first-order approximation of $-\log \delta$. This introduces three problems:

Bias: for all $\delta \neq 1$, the coefficient $1 - \delta \neq -\log \delta$, so the gradient direction is biased.
Pathological asymmetry: When the policy deviates away from the reference ($\delta \to 0$, meaning $\pi_\theta \gg \pi_{\mathrm{ref}}$), the true coefficient $-\log \delta \to +\infty$ provides a strong restoring force, but $1 - \delta \to 1$ saturates — it cannot push back hard enough. Conversely, when $\delta \to \infty$ ($\pi_\theta \ll \pi_{\mathrm{ref}}$), $1 - \delta \to -\infty$ explodes much faster than the logarithmic $-\log \delta$, risking destabilizing updates.
Variance: the variance of $1 - \delta$ involves $\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)$, the chi-squared divergence, which is notoriously unstable and can diverge even when KL remains finite.

So paradoxically, $k_3$ — the “clear winner” as a KL estimator — produces a biased, asymmetric, and potentially unstable gradient when used as a loss in GRPO. The paper recommends $k_2$ as loss (or equivalently $k_1$ in reward) as the principled default.

Summary: Estimation ≠ Optimization

The following table contrasts the estimator ranking (Schulman) with the optimization ranking (Liu et al.):

	As Estimator	”$k_n$ in reward”	”$k_n$ as loss”
$k_1 = -\log \delta$	Unbiased, high variance	✓ Correct RKL gradient	✗ Zero-mean noise, no regularization
$k_2 = \frac{1}{2}(\log \delta)^2$	Biased (low), low variance	—	✓ Correct RKL gradient
$k_3 = (\delta - 1) - \log \delta$	Unbiased, low variance	—	≈ First-order biased approximation

The irony is complete: $k_1$, the worst estimator (high variance), produces the correct gradient when placed in the reward. $k_3$, the best estimator (unbiased + low variance), produces a biased gradient when used as a loss. And $k_2$, the biased estimator, produces the correct gradient as a loss — making it gradient-equivalent to $k_1$ in reward.

The reason is that estimation asks “how close is the value $k_n$ to the true KL?” while optimization asks “does $\nabla_\theta k_n$ point in the right direction?” These are fundamentally different questions, and a good answer to one does not imply a good answer to the other.

	作为估计量	”\(k_n\) 放入奖励”	”\(k_n\) 作为损失”
\(k_1 = -\log \delta\)	无偏，高方差	✓ 正确的 RKL 梯度	✗ 零均值噪声，无正则化
\(k_2 = \frac{1}{2}(\log \delta)^2\)	有偏（低），低方差	—	✓ 正确的 RKL 梯度
\(k_3 = (\delta - 1) - \log \delta\)	无偏，低方差	—	≈ 一阶有偏近似

From Estimation to Optimization: KL Regularization in RLHF

Recap: Schulman's Three Estimators

回顾：Schulman 的三个估计量

Two Different KLs in PPO-RLHF

PPO-RLHF 中的两个不同 KL

"In Reward" vs. "As Loss"

"放入奖励" vs. "作为损失"

The Gradient Analysis

梯度分析

k₁ as Loss: A Surprising Failure

k₁ 作为损失：一个令人惊讶的失败

k₃ as Loss (GRPO): A Biased Approximation

k₃ 作为损失 (GRPO)：有偏近似

Summary: Estimation ≠ Optimization

总结：估计 ≠ 优化