The Policy Gradient Family: PG, PPO, and AC

The policy gradient sections of this post draw on Nan Jiang's CS 443 lecture notes on policy gradient. The actor-critic sections follow Sergey Levine's lecture on actor-critic methods. The KL approximation section distills John Schulman's blog post on approximating KL divergence. The KL estimation vs. optimization section follows Liu et al. (2025).

The Policy Gradient

Trajectories and the Objective

In reinforcement learning, an agent interacts with an environment by choosing actions according to a policy \(\pi_\theta(a \vert s)\) — a distribution over actions given a state, parameterized by \(\theta\). Each interaction produces a trajectory:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

The probability of a trajectory under policy \(\pi_\theta\) is:

\[P^{\pi_\theta}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]

where \(d_0\) is the initial state distribution and \(P(s_{t+1} \vert s_t, a_t)\) is the transition probability. This is an alternating product of policy terms (learnable) and environment terms (fixed).

The objective is \(J(\pi_\theta) = \sum_\tau R(\tau) P^{\pi_\theta}(\tau)\) — a sum over all possible trajectories in the MDP. Each trajectory \(\tau\) has a fixed return \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\); what the policy controls is the probability \(P^{\pi_\theta}(\tau)\) assigned to each one. A better policy concentrates probability on high-return trajectories.

Figure 1: The expected return as a weighted sum over trajectory space. All trajectories fan out from a common start; use the playbar to watch them extend. Click parts of the formula to highlight what each represents: R(τ) colors by return, Pπ(τ) shows probability via thickness. Switch policies to see how the same trajectories get reweighted.

在强化学习中,智能体根据策略 \(\pi_\theta(a \vert s)\) 选择动作与环境交互——这是一个以 \(\theta\) 为参数、在给定状态下关于动作的分布。每次交互产生一条轨迹:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

在策略 \(\pi_\theta\) 下,轨迹的概率为:

\[P^{\pi_\theta}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]

其中 \(d_0\) 是初始状态分布,\(P(s_{t+1} \vert s_t, a_t)\) 是状态转移概率。这是策略项(可学习)与环境项(固定)的交替乘积。

目标函数为 \(J(\pi_\theta) = \sum_\tau R(\tau) P^{\pi_\theta}(\tau)\)——对 MDP 中所有可能轨迹的求和。每条轨迹 \(\tau\) 有一个固定的回报 \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\);策略所控制的是赋予每条轨迹的概率 \(P^{\pi_\theta}(\tau)\)。更好的策略会将概率集中在高回报的轨迹上。

图 1:期望回报作为轨迹空间上的加权和。所有轨迹从同一起点展开;使用播放条观察它们的延伸。点击公式的不同部分查看各自的含义:R(τ) 按回报着色,Pπ(τ) 以线条粗细表示概率。切换策略可观察同一批轨迹如何被重新加权。

Deriving REINFORCE

To differentiate \(J\) with respect to \(\theta\):

\[\nabla_\theta J = \sum_\tau R(\tau) \nabla_\theta P^{\pi_\theta}(\tau)\]

This is already mathematically correct, but it sums over all possible trajectories — an astronomically large space that cannot be enumerated. We need a form that can be estimated by sampling a handful of trajectories from \(\pi_\theta\).

One might ask: why not just sample a trajectory \(\tau \sim \pi_\theta\), observe \(R(\tau)\), and let autograd backpropagate through \(\mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\) directly? The problem is that the computation graph passes through a discrete sampling step — each action \(a_t\) is drawn from a categorical distribution \(\pi_\theta(\cdot \vert s_t)\). The sampled action is a discrete index, not a smooth function of \(\theta\), so the gradient cannot flow through it. Autograd only sees that the reward is some scalar, but has no way to capture the fact that changing \(\theta\) changes which trajectories get sampled. Standard backpropagation handles \(\nabla_\theta f_\theta(x)\) for fixed inputs \(x\), but here \(\theta\) affects both the function and the distribution over inputs. The log-derivative trick is precisely the tool that recovers this missing “distributional” part of the gradient.

Converting a sum \(\sum_\tau f(\tau)\) into an expectation \(\mathbb{E}_{\tau \sim P^{\pi_\theta}}[f(\tau) / P^{\pi_\theta}(\tau)]\) requires that \(P^{\pi_\theta}(\tau)\) is a valid probability distribution — non-negative and summing to 1. It is: \(P^{\pi_\theta}(\tau)\) is a product of the initial state distribution, per-step policy probabilities, and transition probabilities, all of which are valid distributions, so \(\sum_\tau P^{\pi_\theta}(\tau) = 1\) by construction. The log-derivative trick \(\nabla P = P \nabla \log P\) achieves exactly this conversion by factoring out \(P^{\pi_\theta}(\tau)\) as the sampling weight:

\[\nabla_\theta J = \sum_\tau R(\tau) P^{\pi_\theta}(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau)\right]\]

There is one more step: the gradient still involves \(\nabla_\theta \log P^{\pi_\theta}(\tau)\) — a trajectory-level quantity. To get something we can compute per action, we exploit the fact that the log turns the product of per-step probabilities into a sum:

\[\log P^{\pi_\theta}(\tau) = \log d_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \vert s_t) + \sum_{t=0}^{T-2} \log P(s_{t+1} \vert s_t, a_t)\]

The initial state distribution \(d_0\) and transition dynamics \(P\) do not depend on \(\theta\), so their gradients vanish. Only the policy terms survive:

\[\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right]\]

This is the trajectory-level REINFORCE estimator. Because the gradient is now an expectation under \(\pi_\theta\), we can estimate it by sampling \(N\) trajectories and averaging:

\[\hat{g} = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)}) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \vert s_t^{(i)}), \quad \tau^{(i)} \sim \pi_\theta\]

Each action contributes its \(\nabla_\theta \log \pi_\theta\) — a quantity neural network frameworks compute naturally via backpropagation — weighted by the trajectory return. This is what makes the log-derivative form practical: it decomposes into per-action log-probabilities that fit directly into standard gradient-based training.

By decomposing the rewards over time steps, it can be rewritten in a per-step form using the discounted state occupancy \(d^{\pi_\theta}\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[Q^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

where \(Q^{\pi_\theta}(s, a)\) is the action-value function.

对 \(J\) 关于 \(\theta\) 求导:

\[\nabla_\theta J = \sum_\tau R(\tau) \nabla_\theta P^{\pi_\theta}(\tau)\]

这在数学上是正确的,但它对所有可能的轨迹求和——这是一个无法枚举的天文数字级别的空间。我们需要一种可以通过从 \(\pi_\theta\) 中采样少量轨迹来估计的形式。

有人可能会问:为什么不直接采样一条轨迹 \(\tau \sim \pi_\theta\),观察 \(R(\tau)\),然后让自动微分直接对 \(\mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\) 进行反向传播?问题在于计算图经过了一个离散采样步骤——每个动作 \(a_t\) 是从类别分布 \(\pi_\theta(\cdot \vert s_t)\) 中抽取的。采样得到的动作是一个离散索引,而非 \(\theta\) 的光滑函数,因此梯度无法通过它传播。自动微分只看到奖励是某个标量,却无法捕捉改变 \(\theta\) 会改变哪些轨迹被采样这一事实。标准反向传播处理的是固定输入 \(x\) 下的 \(\nabla_\theta f_\theta(x)\),但这里 \(\theta\) 同时影响函数和输入的分布。对数导数技巧恰好是恢复这一缺失的”分布”梯度部分的工具。

将求和 \(\sum_\tau f(\tau)\) 转换为期望 \(\mathbb{E}_{\tau \sim P^{\pi_\theta}}[f(\tau) / P^{\pi_\theta}(\tau)]\) 要求 \(P^{\pi_\theta}(\tau)\) 是一个有效的概率分布——非负且和为 1。事实的确如此:\(P^{\pi_\theta}(\tau)\) 是初始状态分布、每步策略概率和转移概率的乘积,这些都是有效分布,因此 \(\sum_\tau P^{\pi_\theta}(\tau) = 1\)。对数导数技巧 \(\nabla P = P \nabla \log P\) 通过将 \(P^{\pi_\theta}(\tau)\) 提取为采样权重来实现这一转换:

\[\nabla_\theta J = \sum_\tau R(\tau) P^{\pi_\theta}(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau)\right]\]

还需要再进一步:梯度中仍然包含 \(\nabla_\theta \log P^{\pi_\theta}(\tau)\)——这是一个轨迹级别的量。为了得到可以逐动作计算的形式,我们利用对数将每步概率的乘积转化为求和:

\[\log P^{\pi_\theta}(\tau) = \log d_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \vert s_t) + \sum_{t=0}^{T-2} \log P(s_{t+1} \vert s_t, a_t)\]

初始状态分布 \(d_0\) 和转移动力学 \(P\) 不依赖于 \(\theta\),因此它们的梯度为零。只有策略项保留下来:

\[\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right]\]

这就是轨迹级别的 REINFORCE 估计器。由于梯度现在是 \(\pi_\theta\) 下的期望,我们可以通过采样 \(N\) 条轨迹并取平均来估计它:

\[\hat{g} = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)}) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \vert s_t^{(i)}), \quad \tau^{(i)} \sim \pi_\theta\]

每个动作贡献其 \(\nabla_\theta \log \pi_\theta\)——这是神经网络框架通过反向传播自然计算的量——以轨迹回报为权重。这正是对数导数形式实用的原因:它分解为逐动作的对数概率,可以直接嵌入标准的基于梯度的训练流程。

将奖励按时间步分解后,可以使用折扣状态占用分布 \(d^{\pi_\theta}\) 改写为逐步形式:

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[Q^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

其中 \(Q^{\pi_\theta}(s, a)\) 是动作价值函数。

Variance Reduction: Baselines and the Advantage

A useful property of \(\nabla_\theta \log \pi_\theta\) is that its expectation under the policy is zero: \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \vert s)] = \sum_a \nabla_\theta \pi_\theta(a \vert s) = \nabla_\theta 1 = 0\). This means we can subtract any state-dependent baseline \(b(s)\) from \(Q^{\pi_\theta}\) without introducing bias. The natural choice is the value function \(V^{\pi_\theta}(s)\), giving us the advantage function \(A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

The advantage centers the reward signal around zero, substantially reducing variance. This expectation requires sampling from \(\pi_\theta\) itself, meaning we need fresh data after every parameter update.

\(\nabla_\theta \log \pi_\theta\) 有一个有用的性质:它在策略下的期望为零:\(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \vert s)] = \sum_a \nabla_\theta \pi_\theta(a \vert s) = \nabla_\theta 1 = 0\)。这意味着我们可以从 \(Q^{\pi_\theta}\) 中减去任何依赖于状态的基线 \(b(s)\) 而不引入偏差。最自然的选择是价值函数 \(V^{\pi_\theta}(s)\),由此得到优势函数 \(A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

优势函数将奖励信号居中到零附近,大幅降低方差。此期望要求从 \(\pi_\theta\) 本身采样,这意味着每次参数更新后都需要新数据。

Implementation Notes: The Surrogate Loss

In practice, deep learning frameworks minimize a loss, so we need to translate the policy gradient into something a framework can compute. The standard approach is to define a surrogate loss

\[L_{\mathrm{sur}}(\theta) = -\sum_t A_t \log \pi_\theta(a_t \vert s_t),\]

where \(A_t\) is treated as a stop-gradient constant. Its gradient

\[\nabla_\theta L_{\mathrm{sur}} = -\sum_t A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t)\]

is exactly the negated policy gradient, so minimizing \(L_{\mathrm{sur}}\) performs a policy gradient ascent step. Note that if we were to backpropagate through \(A_t\) (e.g., when \(A_t\) depends on a learned critic), an extra term \((\nabla_\theta A_t) \log \pi_\theta\) would appear, breaking the correspondence — this is why the advantage must always be detached.

A common source of confusion is that \(L_{\mathrm{sur}}\) looks like weighted negative log-likelihood, making REINFORCE appear identical to “weighted SFT.” In the special case of binary rewards where \(A_t = 1\) for successful trajectories and \(A_t = 0\) otherwise, the surrogate loss does reduce to NLL on successful trajectories — i.e., online filtered behavior cloning. But in general, the surrogate loss and the true objective

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

are not the same function: they are merely gradient-equivalent. In supervised learning, \(-\log \pi_\theta(y \vert x)\) is the objective; in policy gradient, \(-A_t \log \pi_\theta(a_t \vert s_t)\) is a tool constructed to reproduce the correct gradient.

This same idea extends beyond vanilla REINFORCE. PPO’s clipped surrogate

\[L^{\mathrm{PPO}} = \mathbb{E}\!\left[\min\!\Big(r_t A_t, \; \mathrm{clip}(r_t, 1{-}\epsilon, 1{+}\epsilon) A_t\Big)\right]\]

does not explicitly contain \(\log \pi\), but the importance ratio \(r_t = \pi_\theta / \pi_{\theta_\mathrm{old}}\) is computed via log-probabilities in practice. The underlying pattern is the same: first derive what gradient direction the policy should follow, then construct a surrogate objective that produces it.

在实践中,深度学习框架通过最小化损失来工作,因此我们需要将策略梯度转化为框架可以计算的形式。标准做法是定义一个代理损失

\[L_{\mathrm{sur}}(\theta) = -\sum_t A_t \log \pi_\theta(a_t \vert s_t),\]

其中 \(A_t\) 被视为停止梯度的常数。其梯度

\[\nabla_\theta L_{\mathrm{sur}} = -\sum_t A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t)\]

恰好是策略梯度的相反数,因此最小化 \(L_{\mathrm{sur}}\) 就是执行策略梯度上升步骤。注意,如果对 \(A_t\) 也进行反向传播(例如当 \(A_t\) 依赖于学习到的 critic 时),会出现额外的项 \((\nabla_\theta A_t) \log \pi_\theta\),破坏对应关系——这就是为什么优势函数必须始终被 detach。

一个常见的混淆是 \(L_{\mathrm{sur}}\) 看起来像加权负对数似然,使得 REINFORCE 看似等同于”加权 SFT”。在二元奖励的特殊情况下,成功轨迹的 \(A_t = 1\)、其他为 \(A_t = 0\),代理损失确实退化为对成功轨迹的 NLL——即在线过滤的行为克隆。但一般而言,代理损失和真正的目标函数

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

不是同一个函数:它们仅仅是梯度等价的。在监督学习中,\(-\log \pi_\theta(y \vert x)\) 就是目标函数;而在策略梯度中,\(-A_t \log \pi_\theta(a_t \vert s_t)\) 是为了复现正确梯度而构造的工具。

同样的思路也延伸到 vanilla REINFORCE 之外。PPO 的裁剪代理目标

\[L^{\mathrm{PPO}} = \mathbb{E}\!\left[\min\!\Big(r_t A_t, \; \mathrm{clip}(r_t, 1{-}\epsilon, 1{+}\epsilon) A_t\Big)\right]\]

并不显式包含 \(\log \pi\),但重要性比率 \(r_t = \pi_\theta / \pi_{\theta_\mathrm{old}}\) 在实践中通过对数概率计算。底层模式是一样的:先推导策略应该遵循的梯度方向,然后构造一个能产生该梯度的代理目标。

Proximal Policy Optimization (PPO)

From On-Policy to Off-Policy

The policy gradient derived above requires sampling from the current policy \(\pi_\theta\): after every parameter update, all previously collected data becomes stale. This is wasteful — we would like to take multiple gradient steps on the same batch of data.

The idea is to use importance sampling to correct for the distribution mismatch. If our data was collected under an old policy \(\pi_{\mathrm{old}}\), we can reweight each sample by the probability ratio between the new and old policies. For a detailed treatment of importance sampling and how it applies to RL, see the importance sampling post.

上面推导的策略梯度要求从当前策略 \(\pi_\theta\) 中采样:每次参数更新后,之前收集的所有数据都会过时。这很浪费——我们希望在同一批数据上进行多次梯度步骤。

核心思想是使用重要性采样来修正分布不匹配。如果数据是在旧策略 \(\pi_{\mathrm{old}}\) 下收集的,我们可以通过新旧策略之间的概率比来重新加权每个样本。关于重要性采样及其在 RL 中的应用,请参阅重要性采样专题文章

The IS Surrogate Objective

Starting from the policy gradient in advantage form, we can rewrite the expectation over \(\pi_\theta\) as an expectation over \(\pi_{\mathrm{old}}\) by introducing a single-step IS ratio (see derivation):

\[L^{\mathrm{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{old}}(a \vert s)} \, A^{\pi_{\mathrm{old}}}(s, a)\right]\]

At \(\theta = \theta_{\mathrm{old}}\), the ratio equals 1 and the gradient of \(L^{\mathrm{IS}}\) reduces to the standard policy gradient. This means we can take gradient steps on \(L^{\mathrm{IS}}\) using data collected once from \(\pi_{\mathrm{old}}\), without recollecting trajectories after each step.

However, this surrogate only corrects the action distribution mismatch — the state distribution is still drawn from \(d^{\pi_{\mathrm{old}}}\), not \(d^{\pi_\theta}\). As \(\theta\) drifts from \(\theta_{\mathrm{old}}\), the two state distributions diverge and the surrogate can overestimate improvement, causing the policy to overshoot and degrade. See the hidden approximation discussion for details.

优势函数形式的策略梯度出发,我们可以通过引入单步 IS 比率将 \(\pi_\theta\) 下的期望改写为 \(\pi_{\mathrm{old}}\) 下的期望(详见推导):

\[L^{\mathrm{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{old}}(a \vert s)} \, A^{\pi_{\mathrm{old}}}(s, a)\right]\]

在 \(\theta = \theta_{\mathrm{old}}\) 时,比率等于 1,\(L^{\mathrm{IS}}\) 的梯度退化为标准策略梯度。这意味着我们可以使用从 \(\pi_{\mathrm{old}}\) 一次性收集的数据对 \(L^{\mathrm{IS}}\) 进行梯度步骤,无需在每步之后重新收集轨迹。

然而,这个代理目标只修正了动作分布的不匹配——状态分布仍然来自 \(d^{\pi_{\mathrm{old}}}\) 而非 \(d^{\pi_\theta}\)。随着 \(\theta\) 偏离 \(\theta_{\mathrm{old}}\),两个状态分布发生分歧,代理目标可能高估改进幅度,导致策略过度更新而性能下降。详见隐含近似讨论

Clipping the Ratio

Proximal Policy Optimization (PPO) addresses this by clipping the IS ratio to prevent large updates. Define

\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}.\]

PPO’s clipped surrogate objective is:

\[L^{\mathrm{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]

where \(\epsilon\) is a small constant (typically 0.1–0.2). The \(\min\) takes the more pessimistic estimate:

  • When \(\hat{A}_t > 0\) (good action): the ratio is capped at \(1 + \epsilon\), preventing the policy from moving too aggressively toward this action.
  • When \(\hat{A}_t < 0\) (bad action): the ratio is floored at \(1 - \epsilon\), preventing the policy from moving too aggressively away from this action.

This trades a small amount of bias for much more stable training — rather than hoping the IS ratio stays well-behaved, we simply clip it by force.

Dual clipping (Ye et al., 2020) adds a second clip to handle a subtle failure mode of standard PPO. When \(\hat{A}_t < 0\) (bad action) and the ratio \(r_t\) is very large (\(r_t \gg 1 + \epsilon\)), the standard \(\min\) already selects the clipped branch \((1-\epsilon)\hat{A}_t\), which is a negative constant. But the unclipped branch \(r_t \hat{A}_t\) is an even larger negative number — so the \(\min\) correctly ignores it. The problem is that this clipped value \((1-\epsilon)\hat{A}_t\) still provides a non-trivial negative gradient signal, pushing the policy to increase \(r_t\) further (since making \(r_t\) larger makes the objective more negative, and we are maximizing). In other words, for actions the policy already wants to avoid, a large ratio can paradoxically encourage the policy to increase the probability of those bad actions.

Dual clip fixes this by introducing a lower bound \(c \hat{A}_t\) (with \(c > 1\), typically \(c = 3\)) when \(\hat{A}_t < 0\):

\[L^{\mathrm{DualCLIP}}(\theta) = \max\!\Big(\min\!\big(r_t \hat{A}_t,\; \operatorname{clip}(r_t, 1\!-\!\epsilon, 1\!+\!\epsilon)\hat{A}_t\big),\; c\hat{A}_t\Big)\]

The outer \(\max\) with \(c\hat{A}_t\) ensures that when the advantage is negative and the standard clipped objective produces a value below \(c\hat{A}_t\), the objective is floored at \(c\hat{A}_t\). This creates a flat region with zero gradient, preventing the policy from being pushed in the wrong direction for large ratios on bad actions.

近端策略优化(PPO)通过裁剪 IS 比率来防止过大的更新。定义

\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}.\]

PPO 的裁剪代理目标为:

\[L^{\mathrm{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]

其中 \(\epsilon\) 是一个小常数(通常为 0.1–0.2)。\(\min\) 取更悲观的估计:

  • 当 \(\hat{A}_t > 0\)(好动作)时:比率被限制在 \(1 + \epsilon\),防止策略过度偏向该动作。
  • 当 \(\hat{A}_t < 0\)(坏动作)时:比率被限制在 \(1 - \epsilon\),防止策略过度远离该动作。

这以少量偏差换取更稳定的训练——与其寄希望于 IS 比率表现良好,不如直接强制裁剪。

双重裁剪(Dual Clipping,Ye et al., 2020)在标准 PPO 的基础上增加了第二层裁剪,以处理一个微妙的失效模式。当 \(\hat{A}_t < 0\)(坏动作)且比率 \(r_t\) 非常大(\(r_t \gg 1 + \epsilon\))时,标准的 \(\min\) 已经选择了裁剪分支 \((1-\epsilon)\hat{A}_t\),这是一个常数。未裁剪分支 \(r_t \hat{A}_t\) 是一个更大的负数——所以 \(\min\) 正确地忽略了它。问题在于,这个裁剪值 \((1-\epsilon)\hat{A}_t\) 仍然提供了非平凡的梯度信号,推动策略进一步增大 \(r_t\)(因为 \(r_t\) 越大目标越负,而我们在做最大化)。换言之,对于策略已经想要避免的动作,大的比率反而可能悖论性地鼓励策略增加这些坏动作的概率。

双重裁剪通过引入下界 \(c\hat{A}_t\)(其中 \(c > 1\),通常 \(c = 3\))来修复这一问题,仅在 \(\hat{A}_t < 0\) 时生效:

\[L^{\mathrm{DualCLIP}}(\theta) = \max\!\Big(\min\!\big(r_t \hat{A}_t,\; \operatorname{clip}(r_t, 1\!-\!\epsilon, 1\!+\!\epsilon)\hat{A}_t\big),\; c\hat{A}_t\Big)\]

外层的 \(\max\) 与 \(c\hat{A}_t\) 确保:当 advantage 为负且标准裁剪目标的值低于 \(c\hat{A}_t\) 时,目标被兜底在 \(c\hat{A}_t\)。这创造了一个梯度为零的平坦区域,防止策略在坏动作的大比率情况下被推向错误方向。

Why the Log-Form and Ratio-Form Losses Share the Same Gradient Direction

At first glance, the REINFORCE surrogate

\[L_{\mathrm{PG}}(\theta) = -A_t \log \pi_\theta(a_t \vert s_t)\]

explicitly contains \(\log \pi_\theta\), whereas the PPO-style ratio objective

\[L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}\]

does not. It may seem surprising that both lead to essentially the same update direction. The key is the identity \(\nabla_\theta \pi_\theta(a \vert s) = \pi_\theta(a \vert s) \, \nabla_\theta \log \pi_\theta(a \vert s)\), which explains why a loss written in terms of \(\pi_\theta\) can still produce a gradient in terms of \(\nabla_\theta \log \pi_\theta\).

REINFORCE form. The gradient of \(L_{\mathrm{PG}}\) is straightforward:

\[\nabla_\theta L_{\mathrm{PG}}(\theta) = -A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t).\]

The score function \(\nabla_\theta \log \pi_\theta\) appears explicitly.

Ratio form. Since \(\pi_{\mathrm{old}}(a_t \vert s_t)\) is constant with respect to \(\theta\):

\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -\frac{A_t}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \pi_\theta(a_t \vert s_t).\]

Applying \(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\):

\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \log \pi_\theta(a_t \vert s_t) = -A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t),\]

where \(r_t(\theta) = \pi_\theta(a_t \vert s_t) / \pi_{\mathrm{old}}(a_t \vert s_t)\).

Core conclusion. Even though the ratio-form loss does not explicitly contain \(\log \pi_\theta\), its gradient still has the same core score-function direction \(\nabla_\theta \log \pi_\theta(a_t \vert s_t)\). The difference is that PPO introduces an additional multiplicative weight \(r_t(\theta)\), and in practice also clipping, to control how aggressively the policy moves relative to the old policy. So:

  • REINFORCE directly optimizes a surrogate linear in \(\log \pi_\theta\);
  • PPO optimizes a surrogate linear in the probability ratio \(r_t(\theta)\);
  • but after differentiation, both are driven by the same score-function direction \(\nabla_\theta \log \pi_\theta\).

In one sentence: a loss does not need to explicitly contain \(\log \pi_\theta\) for its gradient to involve \(\nabla_\theta \log \pi_\theta\), because \(\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta\).

乍看之下,REINFORCE 代理目标

\[L_{\mathrm{PG}}(\theta) = -A_t \log \pi_\theta(a_t \vert s_t)\]

显式包含 \(\log \pi_\theta\),而 PPO 风格的比率目标

\[L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}\]

则不包含。两者都能产生基本相同的更新方向,这似乎令人惊讶。关键在于恒等式 \(\nabla_\theta \pi_\theta(a \vert s) = \pi_\theta(a \vert s) \, \nabla_\theta \log \pi_\theta(a \vert s)\),它解释了为什么以 \(\pi_\theta\) 表达的损失仍能产生 \(\nabla_\theta \log \pi_\theta\) 形式的梯度。

REINFORCE 形式。 \(L_{\mathrm{PG}}\) 的梯度很直接:

\[\nabla_\theta L_{\mathrm{PG}}(\theta) = -A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t).\]

得分函数 \(\nabla_\theta \log \pi_\theta\) 显式出现。

比率形式。 由于 \(\pi_{\mathrm{old}}(a_t \vert s_t)\) 相对于 \(\theta\) 是常数:

\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -\frac{A_t}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \pi_\theta(a_t \vert s_t).\]

应用 \(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\):

\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \log \pi_\theta(a_t \vert s_t) = -A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t),\]

其中 \(r_t(\theta) = \pi_\theta(a_t \vert s_t) / \pi_{\mathrm{old}}(a_t \vert s_t)\)。

核心结论。 尽管比率形式的损失没有显式包含 \(\log \pi_\theta\),其梯度仍然具有相同的核心得分函数方向 \(\nabla_\theta \log \pi_\theta(a_t \vert s_t)\)。区别在于 PPO 引入了额外的乘性权重 \(r_t(\theta)\),以及实践中的裁剪,来控制策略相对于旧策略的更新幅度。因此:

  • REINFORCE 直接优化关于 \(\log \pi_\theta\) 线性的代理目标;
  • PPO 优化关于概率比率 \(r_t(\theta)\) 线性的代理目标;
  • 但求导之后,两者都由相同的得分函数方向 \(\nabla_\theta \log \pi_\theta\) 驱动。

一句话总结:损失函数不需要显式包含 \(\log \pi_\theta\),其梯度也可以涉及 \(\nabla_\theta \log \pi_\theta\),因为 \(\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta\)。

Why You Cannot Simply Multiply the PPO Objective by \(\log \pi_\theta\)

A natural but incorrect idea: since REINFORCE involves \(\log \pi_\theta(a_t \vert s_t)\), why not define a PPO-style objective as

\[\tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[r_t(\theta) \, A_t \, \log \pi_\theta(a_t \vert s_t)\right]?\]

This is not the correct importance-sampled policy gradient objective — it introduces an extra factor in the gradient and changes the optimization problem entirely.

Correct PPO surrogate. The standard surrogate \(L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}[r_t(\theta) \, A_t]\) has gradient

\[\nabla_\theta L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, \nabla_\theta r_t(\theta)\right] = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right],\]

which is exactly the desired importance-weighted score-function gradient.

Incorrect objective with extra \(\log \pi_\theta\). The gradient of \(\tilde{L}\) requires the product rule on \(r_t(\theta) \log \pi_\theta(a_t \vert s_t)\):

\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \Big((\nabla_\theta r_t) \log \pi_\theta + r_t \, \nabla_\theta \log \pi_\theta\Big)\right].\]

Substituting \(\nabla_\theta r_t = r_t \nabla_\theta \log \pi_\theta\):

\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \bigl(1 + \log \pi_\theta(a_t \vert s_t)\bigr) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right].\]

Compared with the correct gradient \(A_t \, r_t \, \nabla_\theta \log \pi_\theta\), this has an extra multiplicative factor \(1 + \log \pi_\theta(a_t \vert s_t)\). Since \(\log \pi_\theta(a_t \vert s_t) \le 0\), this factor becomes negative whenever \(\pi_\theta(a_t \vert s_t) < e^{-1}\) — meaning the gradient can push the policy in the opposite direction of the intended update, even when \(A_t > 0\).

The importance ratio \(r_t(\theta)\) exists solely to correct for the sampling distribution mismatch. It should multiply the advantage — the quantity whose expectation we want to estimate — and nothing else. Inserting an extra \(\log \pi_\theta\) changes the objective itself and breaks the correspondence with the policy gradient theorem.

一个自然但错误的想法:既然 REINFORCE 涉及 \(\log \pi_\theta(a_t \vert s_t)\),为什么不定义一个 PPO 风格的目标为

\[\tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[r_t(\theta) \, A_t \, \log \pi_\theta(a_t \vert s_t)\right]?\]

不是正确的重要性采样策略梯度目标——它在梯度中引入了额外的因子,完全改变了优化问题。

正确的 PPO 代理目标。 标准代理目标 \(L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}[r_t(\theta) \, A_t]\) 的梯度为

\[\nabla_\theta L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, \nabla_\theta r_t(\theta)\right] = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right],\]

这恰好是所需的重要性加权得分函数梯度。

含有额外 \(\log \pi_\theta\) 的错误目标。 \(\tilde{L}\) 的梯度需要对 \(r_t(\theta) \log \pi_\theta(a_t \vert s_t)\) 使用乘积法则:

\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \Big((\nabla_\theta r_t) \log \pi_\theta + r_t \, \nabla_\theta \log \pi_\theta\Big)\right].\]

代入 \(\nabla_\theta r_t = r_t \nabla_\theta \log \pi_\theta\):

\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \bigl(1 + \log \pi_\theta(a_t \vert s_t)\bigr) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right].\]

与正确的梯度 \(A_t \, r_t \, \nabla_\theta \log \pi_\theta\) 相比,这多了一个乘性因子 \(1 + \log \pi_\theta(a_t \vert s_t)\)。由于 \(\log \pi_\theta(a_t \vert s_t) \le 0\),当 \(\pi_\theta(a_t \vert s_t) < e^{-1}\) 时该因子变为负数——意味着即使 \(A_t > 0\),梯度也可能将策略推向预期更新的相反方向。

重要性比率 \(r_t(\theta)\) 的存在仅仅是为了修正采样分布的不匹配。它应该乘以优势函数——即我们想要估计其期望的量——而不应乘以其他东西。额外插入 \(\log \pi_\theta\) 改变了目标函数本身,破坏了与策略梯度定理的对应关系。

Asynchronous PPO: Decoupling the Importance Ratio

All methods above assume synchronous training: the policy that generates rollouts (\(\pi_{\text{old}}\)) is the same policy we clip against. In practice, large-scale RL systems are asynchronous — rollout workers continuously generate data while the trainer updates model weights. By the time a batch reaches the trainer, the generating policy \(\pi_{\text{behav}}\) may be several gradient steps behind the current policy \(\pi_\theta\).

This creates a problem: in standard PPO/GRPO, the importance ratio \(\pi_\theta / \pi_{\text{old}}\) serves two roles simultaneously:

  1. Off-policy correction: reweight samples to account for the distributional mismatch
  2. Trust region: clip the ratio to prevent the policy from changing too much

When \(\pi_{\text{old}} = \pi_{\text{behav}}\) is stale, these two roles conflict.

以上所有方法都假设同步训练:生成 rollout 的策略(\(\pi_{\text{old}}\))与裁剪的参照策略是同一个。但在实际的大规模 RL 系统中,训练是异步的——rollout worker 持续生成数据,同时 trainer 更新模型权重。当一个 batch 到达 trainer 时,生成策略 \(\pi_{\text{behav}}\) 可能已经落后当前策略 \(\pi_\theta\) 若干梯度步。

这带来一个问题:在标准 PPO/GRPO 中,重要性比率 \(\pi_\theta / \pi_{\text{old}}\) 同时承担两个角色

  1. 离策略校正:重新加权样本以弥补分布不匹配
  2. 信赖域:裁剪比率以防止策略变化过大

当 \(\pi_{\text{old}} = \pi_{\text{behav}}\) 已经陈旧时,这两个角色产生冲突。

Vanilla Async PPO

The naive approach simply substitutes \(\pi_{\text{behav}}\) for \(\pi_{\text{old}}\). Writing \(\rho_t = \pi_\theta / \pi_{\text{behav}}\):

\[\boxed{\mathcal{J}_{\text{naive}}(\theta) = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\min\!\left(\rho_t \, \hat{A}_t,\; \text{clip}(\rho_t,\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon) \, \hat{A}_t\right)\right]}\]

Let’s derive the gradient. Outside the clip range \([1\!-\!\varepsilon, 1\!+\!\varepsilon]\), the clipped branch is a constant \((1\!\pm\!\varepsilon)\hat{A}_t\), so the min picks whichever branch is smaller. When \(\rho_t > 1+\varepsilon\) and \(\hat{A}_t > 0\), the unclipped value \(\rho_t \hat{A}_t\) exceeds the clipped value \((1\!+\!\varepsilon)\hat{A}_t\), so the min selects the clipped (flat) branch — gradient zero. Symmetrically for \(\rho_t < 1-\varepsilon\) and \(\hat{A}_t < 0\). In all other cases the unclipped branch wins and the gradient is \(\nabla_\theta \rho_t \cdot \hat{A}_t\). Collecting these into an indicator:

\[\mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t) = 1 - \mathbb{1}[\rho_t > 1\!+\!\varepsilon]\,\mathbb{1}[\hat{A}_t > 0] - \mathbb{1}[\rho_t < 1\!-\!\varepsilon]\,\mathbb{1}[\hat{A}_t < 0]\]

The per-token gradient is \(\nabla_\theta f = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} \hat{A}_t \cdot \mathbb{1}_{\text{active}}\). Substituting \(\frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} = \rho_t \nabla_\theta \log \pi_\theta\) and taking the expectation over \(a_t \sim \pi_{\text{behav}}\):

\[\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\rho_t \, \nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}\right]\]

By the importance sampling identity \(\mathbb{E}_{\pi_{\text{behav}}}[\rho_t \, g(a)] = \mathbb{E}_{\pi_\theta}[g(a)]\):

\[\boxed{\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t)\right]}\]

This is the standard policy gradient \(\nabla_\theta \log \pi_\theta \, \hat{A}_t\), masked by \(\mathbb{1}_{\text{active}}\): the clip mechanism simply silences tokens where \(\rho_t = \pi_\theta / \pi_{\text{behav}}\) has already overshot in the direction favored by the advantage. Note what determines whether a token is silenced: the ratio \(\pi_\theta / \pi_{\text{behav}}\). The trust region is centered at \(\pi_{\text{behav}}\) — gradients vanish once \(\pi_\theta\) moves more than \(\varepsilon\) away from \(\pi_{\text{behav}}\) (in the advantage-favored direction). In synchronous training \(\pi_{\text{behav}}\) is the current policy, so this is exactly right. But with stale data the picture changes completely.

When \(\pi_{\text{behav}}\) is stale (several gradient steps behind \(\pi_\theta\)), \(\rho_t\) is already far from 1 before any optimization begins. Each gradient step shifts \(\pi_\theta\)’s probability mass — tokens the policy has learned to favor get \(\rho \gg 1\), tokens it has learned to suppress get \(\rho \ll 1\). After \(\eta\) gradient steps, many tokens fall outside \([1-\varepsilon, 1+\varepsilon]\). The clipping then forces \(\pi_\theta\) to stay close to an old, low-quality policy rather than constraining the size of the current update. The trust region is centered in the wrong place.

The figures below show why. The clipped objective is flat outside \([1-\varepsilon, 1+\varepsilon]\). In synchronous training, \(\rho\) starts at 1 (green dot), safely inside the active region. In async training, \(\rho_0\) can land far outside this region (orange cross) — use the slider to explore both \(\rho_0 > 1\) and \(\rho_0 < 1\).

Worse than just zero gradient: the clipping creates an asymmetric force that actively pulls \(\pi_\theta\) back toward \(\pi_{\text{behav}}\). Consider \(\rho_0 > 1 + \varepsilon\): positive-advantage tokens contribute zero gradient (flat region), but negative-advantage tokens are in the unclipped branch and push \(\rho\) down. The net gradient only points toward the stale policy. At \(\rho_0 < 1 - \varepsilon\) the asymmetry flips — negative-advantage tokens are flat, positive-advantage tokens push \(\rho\) up — but the net force still points toward \(\pi_{\text{behav}}\).

朴素的做法是直接用 \(\pi_{\text{behav}}\) 替换 \(\pi_{\text{old}}\)。记 \(\rho_t = \pi_\theta / \pi_{\text{behav}}\):

\[\boxed{\mathcal{J}_{\text{naive}}(\theta) = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\min\!\left(\rho_t \, \hat{A}_t,\; \text{clip}(\rho_t,\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon) \, \hat{A}_t\right)\right]}\]

下面推导梯度。在裁剪范围 \([1\!-\!\varepsilon, 1\!+\!\varepsilon]\) 之外,裁剪分支是常数 \((1\!\pm\!\varepsilon)\hat{A}_t\),min 选择两者中较小的。当 \(\rho_t > 1+\varepsilon\) 且 \(\hat{A}_t > 0\) 时,未裁剪值 \(\rho_t \hat{A}_t\) 超过裁剪值 \((1\!+\!\varepsilon)\hat{A}_t\),min 选择裁剪(平坦)分支——梯度为零。对称地 \(\rho_t < 1-\varepsilon\) 且 \(\hat{A}_t < 0\) 时亦然。其余情况未裁剪分支胜出,梯度为 \(\nabla_\theta \rho_t \cdot \hat{A}_t\)。将这些情况收集为指示函数:

\[\mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t) = 1 - \mathbb{1}[\rho_t > 1\!+\!\varepsilon]\,\mathbb{1}[\hat{A}_t > 0] - \mathbb{1}[\rho_t < 1\!-\!\varepsilon]\,\mathbb{1}[\hat{A}_t < 0]\]

逐 token 梯度为 \(\nabla_\theta f = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} \hat{A}_t \cdot \mathbb{1}_{\text{active}}\)。代入 \(\frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} = \rho_t \nabla_\theta \log \pi_\theta\),对 \(a_t \sim \pi_{\text{behav}}\) 取期望:

\[\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\rho_t \, \nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}\right]\]

由重要性采样恒等式 \(\mathbb{E}_{\pi_{\text{behav}}}[\rho_t \, g(a)] = \mathbb{E}_{\pi_\theta}[g(a)]\):

\[\boxed{\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t)\right]}\]

这就是标准策略梯度 \(\nabla_\theta \log \pi_\theta \, \hat{A}_t\),被 \(\mathbb{1}_{\text{active}}\) 掩码:裁剪机制只是静默那些 \(\rho_t = \pi_\theta / \pi_{\text{behav}}\) 已经沿 advantage 方向过冲的 token。注意决定 token 是否被静默的是比率 \(\pi_\theta / \pi_{\text{behav}}\)。信赖域以 \(\pi_{\text{behav}}\) 为中心——一旦 \(\pi_\theta\) 偏离 \(\pi_{\text{behav}}\) 超过 \(\varepsilon\)(沿 advantage 方向),梯度就消失。同步训练中 \(\pi_{\text{behav}}\) 就是当前策略,这完全正确。但数据陈旧时情况完全不同。

当 \(\pi_{\text{behav}}\) 已经过时(落后 \(\pi_\theta\) 若干梯度步),\(\rho_t\) 在优化开始前就已经远离 1。每一步梯度更新都在移动 \(\pi_\theta\) 的概率质量——策略学会偏好的 token 有 \(\rho \gg 1\),学会抑制的 token 有 \(\rho \ll 1\)。经过 \(\eta\) 步梯度更新后,大量 token 落在 \([1-\varepsilon, 1+\varepsilon]\) 之外。裁剪会迫使 \(\pi_\theta\) 靠近一个陈旧的低质量策略,而不是约束当前更新的步长。信赖域被定在了错误的位置。

下面两张图展示了原因。裁剪目标在 \([1-\varepsilon, 1+\varepsilon]\) 之外是平坦的。同步训练中,\(\rho\) 从 1 开始(绿点),安全地位于活跃区域内。异步训练中,\(\rho_0\) 可能远在此区间之外(橙叉)——拖动滑块可以探索 \(\rho_0 > 1\) 和 \(\rho_0 < 1\) 两种情况。

比零梯度更糟的是:裁剪产生了一个不对称力,主动将 \(\pi_\theta\) 拉回 \(\pi_{\text{behav}}\)。考虑 \(\rho_0 > 1 + \varepsilon\) 的情况:正 advantage 的 token 贡献零梯度(平坦区域),但负 advantage 的 token 处于未裁剪分支,将 \(\rho\) 向下推。净梯度只指向陈旧策略的方向。当 \(\rho_0 < 1 - \varepsilon\) 时不对称性翻转——负 advantage 的 token 平坦,正 advantage 的 token 将 \(\rho\) 向上推——但净力仍然指向 \(\pi_{\text{behav}}\)。

Decoupled Clipped Objective

The \(\pi_{\text{old}}\) in PPO’s objective quietly serves two independent purposes (Hilton, Cobbe & Schulman, 2022). This is easiest to see in the KL-penalized form:

\[\mathcal{J}^{\text{KLPEN}}(\theta) = \mathbb{E}\!\left[\frac{\pi_\theta}{\underbrace{\pi_{\text{old}}}_{\text{(i)}}} \hat{A}_t - \beta \, \text{KL}\!\left[\underbrace{\pi_{\text{old}}}_{\text{(ii)}} \,\|\, \pi_\theta\right]\right]\]

Use (i) is importance sampling — it corrects for the fact that actions were drawn from \(\pi_{\text{old}}\), so it must be the behavior policy \(\pi_{\text{behav}}\). Use (ii) is the trust-region anchor — it penalizes \(\pi_\theta\) for moving too far, but this anchor only needs to be some recent policy; call it \(\pi_{\text{prox}}\).

For the clipped objective, \(\pi_{\text{old}}\) appears only once in the ratio \(r_t = \pi_\theta / \pi_{\text{old}}\), hiding the two roles. Rewriting by multiplying numerator and denominator into the min exposes them:

\[\mathcal{J}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\frac{1}{\underbrace{\pi_{\text{old}}}_{\text{(i)}}} \min\!\left(\pi_\theta \,\hat{A}_t,\;\; \text{clip}\!\left(\pi_\theta,\; (1\!-\!\varepsilon)\underbrace{\pi_{\text{old}}}_{\text{(ii)}},\; (1\!+\!\varepsilon)\underbrace{\pi_{\text{old}}}_{\text{(ii)}}\right) \hat{A}_t\right)\right]\]

Now the two uses are manifest: the \(1/\pi_{\text{old}}\) prefactor is the importance-sampling denominator (i), while the clip bounds \((1\pm\varepsilon)\pi_{\text{old}}\) define the trust region (ii). Replacing (i) with \(\pi_{\text{behav}}\) and (ii) with \(\pi_{\text{prox}}\):

\[\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\frac{1}{\pi_{\text{behav}}} \min\!\left(\pi_\theta \,\hat{A}_t,\;\; \text{clip}\!\left(\pi_\theta,\; (1\!-\!\varepsilon)\pi_{\text{prox}},\; (1\!+\!\varepsilon)\pi_{\text{prox}}\right) \hat{A}_t\right)\right]\]

Dividing through by \(\pi_{\text{prox}}\) inside the min recovers the ratio form used by AReaL (IIIS Tsinghua, 2025):

\[\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\sum_{t=1}^{H} \min\!\left(\underbrace{\frac{\pi_\theta}{\pi_{\text{behav}}}}_{\text{importance ratio}} \hat{A}_t,\;\; \overbrace{\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \, \text{clip}\!\left(\underbrace{\frac{\pi_\theta}{\pi_{\text{prox}}}}_{\text{trust region}},\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon\right)}^{\text{importance ratio}} \hat{A}_t\right)\right]\]

Writing \(r_t = \pi_\theta / \pi_{\text{prox}}\), \(w_t = \pi_{\text{prox}} / \pi_{\text{behav}}\). Since \(w_t > 0\), it factors out of the min:

\[\boxed{\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \min\!\left(r_t \, \hat{A}_t,\;\; \text{clip}(r_t,\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon) \, \hat{A}_t\right)\right]}\]

Now derive its gradient step by step.

Step 1. \(w_t = \pi_{\text{prox}} / \pi_{\text{behav}}\) is constant w.r.t. \(\theta\), so it passes through the gradient:

\[\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[w_t \cdot \nabla_\theta \min\!\left(r_t \, \hat{A}_t,\; \text{clip}(r_t) \, \hat{A}_t\right)\right]\]

Step 2. The \(\min(r_t \hat{A}_t,\, \text{clip}(r_t)\hat{A}_t)\) has exactly the same form as the naive PPO objective, but with \(r_t = \pi_\theta / \pi_{\text{prox}}\) in place of \(\rho_t = \pi_\theta / \pi_{\text{behav}}\). We already derived this gradient — \(\nabla_\theta r_t \cdot \hat{A}_t\) when active, zero when clipped:

\[\nabla_\theta \min\!\left(r_t \, \hat{A}_t,\; \text{clip}(r_t) \, \hat{A}_t\right) = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(r_t, \hat{A}_t)\]

where \(\mathbb{1}_{\text{active}}\) is the same mask as before, now evaluated at \(r_t\):

\[\mathbb{1}_{\text{active}}(r_t, \hat{A}_t) = 1 - \mathbb{1}[r_t > 1\!+\!\varepsilon]\,\mathbb{1}[\hat{A}_t > 0] - \mathbb{1}[r_t < 1\!-\!\varepsilon]\,\mathbb{1}[\hat{A}_t < 0]\]

Step 3. Multiply by \(w_t\). The \(\pi_{\text{prox}}\) cancels:

\[w_t \cdot \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} = \frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \cdot \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} = \frac{\pi_\theta}{\pi_{\text{behav}}} \nabla_\theta \log \pi_\theta = \rho_t \, \nabla_\theta \log \pi_\theta\]

Step 4. Taking the expectation over \(a_t \sim \pi_{\text{behav}}\) and applying the importance sampling identity \(\mathbb{E}_{\pi_{\text{behav}}}[\rho_t \, g(a)] = \mathbb{E}_{\pi_\theta}[g(a)]\):

\[\boxed{\nabla_\theta \mathcal{J}_{\text{decoupled}} = \mathbb{E}_{a_t \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}\!\left(\frac{\pi_\theta}{\pi_{\text{prox}}},\, \hat{A}_t\right)\right]}\]

Comparing the two gradients side by side:

  Naive PPO Decoupled
Gradient \(\mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t)\right]\) \(\mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}^{\text{dec}}(u_t, \hat{A}_t)\right]\)
Mask argument \(\rho_t = \pi_\theta / \pi_{\text{behav}}\) \(u_t = \pi_\theta / \pi_{\text{prox}}\)
Trust region center \(\pi_{\text{behav}}\) \(\pi_{\text{prox}}\)

Both are the same policy gradient with the same functional form of binary mask. The only difference is which policy the mask measures distance from. Naive PPO silences tokens when \(\pi_\theta\) is far from \(\pi_{\text{behav}}\); decoupled PPO silences tokens when \(\pi_\theta\) is far from \(\pi_{\text{prox}}\).

This makes the failure mode transparent. In asynchronous training, \(\pi_{\text{behav}}\) lags \(\pi_\theta\) by \(\eta\) gradient steps. Naive PPO’s mask kills tokens based on drift that already happened before the current step — the trust region is centered at the wrong place. Decoupled PPO centers at \(\pi_{\text{prox}}\), which is always the most recent checkpoint, so all tokens start with \(u_t \approx 1\) and the mask only activates if the current update overshoots. Even with a single minibatch (\(\pi_\theta = \pi_{\text{prox}}\) before the update, \(u_t = 1\) exactly), naive PPO already silences many tokens while decoupled PPO silences none.

Reduction to PPO-clip. When \(\pi_{\text{prox}} = \pi_{\text{behav}}\), we have \(u_t = \rho_t\) and the two masks coincide. This holds at the first gradient step of synchronous training.

PPO 目标函数中的 \(\pi_{\text{old}}\) 悄然承担着两个独立的职责Hilton, Cobbe & Schulman, 2022)。这在 KL 惩罚形式中最容易看出:

\[\mathcal{J}^{\text{KLPEN}}(\theta) = \mathbb{E}\!\left[\frac{\pi_\theta}{\underbrace{\pi_{\text{old}}}_{\text{(i)}}} \hat{A}_t - \beta \, \text{KL}\!\left[\underbrace{\pi_{\text{old}}}_{\text{(ii)}} \,\|\, \pi_\theta\right]\right]\]

用途 (i) 是重要性采样——校正动作由 \(\pi_{\text{old}}\) 采集的事实,因此必须是行为策略 \(\pi_{\text{behav}}\)。用途 (ii) 是信赖域锚点——惩罚 \(\pi_\theta\) 偏离过远,但这个锚点只需要是某个近期策略即可;称之为 \(\pi_{\text{prox}}\)。

对于裁剪目标,\(\pi_{\text{old}}\) 仅在比率 \(r_t = \pi_\theta / \pi_{\text{old}}\) 中出现一次,隐藏了两个角色。将分子分母展开到 min 内部可以暴露它们:

\[\mathcal{J}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\frac{1}{\underbrace{\pi_{\text{old}}}_{\text{(i)}}} \min\!\left(\pi_\theta \,\hat{A}_t,\;\; \text{clip}\!\left(\pi_\theta,\; (1\!-\!\varepsilon)\underbrace{\pi_{\text{old}}}_{\text{(ii)}},\; (1\!+\!\varepsilon)\underbrace{\pi_{\text{old}}}_{\text{(ii)}}\right) \hat{A}_t\right)\right]\]

两个用途清晰可见:\(1/\pi_{\text{old}}\) 前缀是重要性采样的分母 (i),裁剪边界 \((1\pm\varepsilon)\pi_{\text{old}}\) 定义信赖域 (ii)。将 (i) 替换为 \(\pi_{\text{behav}}\),(ii) 替换为 \(\pi_{\text{prox}}\):

\[\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\frac{1}{\pi_{\text{behav}}} \min\!\left(\pi_\theta \,\hat{A}_t,\;\; \text{clip}\!\left(\pi_\theta,\; (1\!-\!\varepsilon)\pi_{\text{prox}},\; (1\!+\!\varepsilon)\pi_{\text{prox}}\right) \hat{A}_t\right)\right]\]

在 min 内部除以 \(\pi_{\text{prox}}\) 可恢复 AReaL(IIIS Tsinghua, 2025)使用的比率形式:

\[\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\sum_{t=1}^{H} \min\!\left(\underbrace{\frac{\pi_\theta}{\pi_{\text{behav}}}}_{\text{importance ratio}} \hat{A}_t,\;\; \overbrace{\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \, \text{clip}\!\left(\underbrace{\frac{\pi_\theta}{\pi_{\text{prox}}}}_{\text{trust region}},\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon\right)}^{\text{importance ratio}} \hat{A}_t\right)\right]\]

记 \(r_t = \pi_\theta / \pi_{\text{prox}}\),\(w_t = \pi_{\text{prox}} / \pi_{\text{behav}}\)。由于 \(w_t > 0\),可从 min 中提出:

\[\boxed{\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \min\!\left(r_t \, \hat{A}_t,\;\; \text{clip}(r_t,\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon) \, \hat{A}_t\right)\right]}\]

这是 AReaL(IIIS Tsinghua, 2025)使用的形式。下面逐步推导其梯度。

第一步。 \(w_t = \pi_{\text{prox}} / \pi_{\text{behav}}\) 关于 \(\theta\) 为常数,直接穿过梯度:

\[\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[w_t \cdot \nabla_\theta \min\!\left(r_t \, \hat{A}_t,\; \text{clip}(r_t) \, \hat{A}_t\right)\right]\]

第二步。 \(\min(r_t \hat{A}_t,\, \text{clip}(r_t)\hat{A}_t)\) 与朴素 PPO 目标形式完全相同,只是用 \(r_t = \pi_\theta / \pi_{\text{prox}}\) 替代了 \(\rho_t = \pi_\theta / \pi_{\text{behav}}\)。直接套用已推导的结果——活跃时为 \(\nabla_\theta r_t \cdot \hat{A}_t\),被裁剪时为零:

\[\nabla_\theta \min\!\left(r_t \, \hat{A}_t,\; \text{clip}(r_t) \, \hat{A}_t\right) = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(r_t, \hat{A}_t)\]

其中 \(\mathbb{1}_{\text{active}}\) 是同一个掩码,现在对 \(r_t\) 求值:

\[\mathbb{1}_{\text{active}}(r_t, \hat{A}_t) = 1 - \mathbb{1}[r_t > 1\!+\!\varepsilon]\,\mathbb{1}[\hat{A}_t > 0] - \mathbb{1}[r_t < 1\!-\!\varepsilon]\,\mathbb{1}[\hat{A}_t < 0]\]

第三步。 乘以 \(w_t\),\(\pi_{\text{prox}}\) 约掉:

\[w_t \cdot \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} = \frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \cdot \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} = \frac{\pi_\theta}{\pi_{\text{behav}}} \nabla_\theta \log \pi_\theta = \rho_t \, \nabla_\theta \log \pi_\theta\]

第四步。 对 \(a_t \sim \pi_{\text{behav}}\) 取期望,应用重要性采样恒等式 \(\mathbb{E}_{\pi_{\text{behav}}}[\rho_t \, g(a)] = \mathbb{E}_{\pi_\theta}[g(a)]\):

\[\boxed{\nabla_\theta \mathcal{J}_{\text{decoupled}} = \mathbb{E}_{a_t \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}\!\left(\frac{\pi_\theta}{\pi_{\text{prox}}},\, \hat{A}_t\right)\right]}\]

将两个梯度并排对比:

  朴素 PPO 解耦
梯度 \(\mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t)\right]\) \(\mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}^{\text{dec}}(u_t, \hat{A}_t)\right]\)
掩码参数 \(\rho_t = \pi_\theta / \pi_{\text{behav}}\) \(u_t = \pi_\theta / \pi_{\text{prox}}\)
信赖域中心 \(\pi_{\text{behav}}\) \(\pi_{\text{prox}}\)

两者是相同的策略梯度乘以相同函数形式的二值掩码。唯一的区别在于掩码衡量距离的参照策略。朴素 PPO 在 \(\pi_\theta\) 远离 \(\pi_{\text{behav}}\) 时静默 token;解耦 PPO 在 \(\pi_\theta\) 远离 \(\pi_{\text{prox}}\) 时静默 token。

这使失败模式一目了然。异步训练中,\(\pi_{\text{behav}}\) 落后 \(\pi_\theta\) \(\eta\) 个梯度步。朴素 PPO 的掩码基于当前步之前已经发生的漂移来杀死 token——信赖域定在了错误的位置。解耦 PPO 以 \(\pi_{\text{prox}}\) 为中心,\(\pi_{\text{prox}}\) 始终是最新检查点,因此所有 token 以 \(u_t \approx 1\) 开始,掩码仅在当前更新过冲时才激活。即使单个 minibatch(更新前 \(\pi_\theta = \pi_{\text{prox}}\),\(u_t = 1\)),朴素 PPO 已经静默了大量 token,而解耦 PPO 一个都不会静默。

退化为 PPO-clip。 当 \(\pi_{\text{prox}} = \pi_{\text{behav}}\) 时,\(u_t = \rho_t\),两个掩码一致。这在同步训练的第一步梯度更新时成立。

Why Decoupling Matters: Ablation Evidence

The AReaL paper ablates naive vs. decoupled PPO across staleness levels \(\eta\) (max age of rollout data in gradient steps). The 1.5B model results on AIME24 (pass@1, avg over 32):

Staleness \(\eta\) Naive PPO Decoupled PPO Sync oracle
0 (sync) 42.0 42.0
1 41.8 42.1  
2 40.0 41.8  
4 23.3 42.2  
8 35.7 41.0  
16 35.8 38.7  

The collapse at \(\eta = 4\) is dramatic: naive PPO drops from 42.0 to 23.3 (a 45% relative decline), while decoupled PPO actually matches the synchronous oracle at 42.2. The pattern is consistent across benchmarks — AMC23 drops from 84.4 to 58.5 under naive PPO at \(\eta = 4\), but stays at 85.1 with decoupling.

The practical payoff: asynchronous training with \(\eta \leq 4\) achieves up to 2.77x wall-clock speedup over synchronous baselines with no loss in final performance. Throughput nearly doubles just from \(\eta = 0 \to 1\) (27.1k to 47.8k tokens/s on 8 GPUs), because the trainer no longer waits for rollout workers.

AReaL 论文在不同陈旧度 \(\eta\)(rollout 数据的最大梯度步龄)下消融了朴素 PPO 与解耦 PPO。1.5B 模型在 AIME24 上的结果(pass@1,32 次采样平均):

陈旧度 \(\eta\) 朴素 PPO 解耦 PPO 同步 oracle
0(同步) 42.0 42.0
1 41.8 42.1  
2 40.0 41.8  
4 23.3 42.2  
8 35.7 41.0  
16 35.8 38.7  

\(\eta = 4\) 时的崩溃是戏剧性的:朴素 PPO 从 42.0 暴跌到 23.3(相对下降 45%),而解耦 PPO 实际上匹配了同步 oracle 的 42.2。这一模式在各基准上一致——AMC23 在朴素 PPO \(\eta = 4\) 下从 84.4 降至 58.5,但解耦后保持在 85.1。

实际收益:\(\eta \leq 4\) 的异步训练实现了相比同步基线高达 2.77 倍的训练加速,且最终性能无损。仅从 \(\eta = 0 \to 1\),吞吐量就几乎翻倍(8 GPU 上从 27.1k 到 47.8k tokens/s),因为 trainer 不再需要等待 rollout worker。

KL for Reference Model

Approximating KL Divergence (Schulman, 2020)

We want to estimate the KL divergence from \(q\) to \(p\):

我们想估计从 \(q\) 到 \(p\) 的 KL 散度:

\[\mathrm{KL}[q, p] = \sum_x q(x) \log \frac{q(x)}{p(x)} = \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right].\]

Our options for computing KL depend on what kind of access we have to \(p\) and \(q\). Here we assume we can evaluate the probabilities (or probability densities) \(p(x)\) and \(q(x)\) for any given \(x\), but we cannot calculate the sum over \(x\) analytically. Why not?

  1. Computation/memory: the state space is too large to enumerate (e.g., all possible token sequences).
  2. No closed form: the distributions don’t belong to a family with a known KL formula.
  3. Code simplicity: we only store the log-prob \(\log \pi_\theta(a \vert s)\), not the full distribution. This is a reasonable design choice when KL is just used as a diagnostic, as is often the case in reinforcement learning (e.g., logging KL between the current policy and a reference policy during PPO training).

In all three cases, we turn to Monte Carlo estimation. Given samples \(x_1, x_2, \ldots \sim q\), how can we construct a good estimate?

我们能否计算 KL 取决于对 \(p\) 和 \(q\) 有什么样的访问。这里假设我们可以对任何给定的 \(x\) 计算概率(或概率密度)\(p(x)\) 和 \(q(x)\),但无法解析地计算对 \(x\) 的求和。原因可能是:

  1. 计算/内存限制:状态空间太大,无法枚举(例如所有可能的 token 序列)。
  2. 没有解析表达式:分布不属于某个有已知 KL 公式的参数族。
  3. 代码简洁性:我们只存储了 log 概率 \(\log \pi_\theta(a \vert s)\),而不是完整分布。当 KL 仅用作诊断指标时,这是一个合理的设计选择,在强化学习中很常见(例如在 PPO 训练中记录当前策略与参考策略之间的 KL)。

在以上三种情况下,我们都需要Monte Carlo 估计。给定样本 \(x_1, x_2, \ldots \sim q\),如何构造一个好的估计量?

A good estimator has two properties:

  • Unbiased: its expected value equals the true KL, i.e. \(\mathbb{E}[\hat{k}] = \mathrm{KL}[q,p]\).
  • Low variance: individual samples don’t fluctuate wildly around the mean.

We’ll define the probability ratio \(r = p(x)/q(x)\), so that \(\log r = \log p(x) - \log q(x)\). All three estimators below are functions of \(r\) (or equivalently, of \(\log r\)). This is convenient because in practice we often already have \(\log p(x)\) and \(\log q(x)\) computed — e.g., the log-probability of an action under two different policies.

一个好的估计量应具备两个性质:

  • 无偏:期望值等于真实 KL,即 \(\mathbb{E}[\hat{k}] = \mathrm{KL}[q,p]\)。
  • 低方差:单个样本不会剧烈偏离均值。

我们定义概率比 \(r = p(x)/q(x)\),即 \(\log r = \log p(x) - \log q(x)\)。下面三个估计量都是 \(r\)(或等价地,\(\log r\))的函数。这很方便,因为在实践中我们通常已经计算好了 \(\log p(x)\) 和 \(\log q(x)\)——例如,一个动作在两个不同策略下的 log 概率。

The most straightforward unbiased estimator follows directly from the definition of KL:

\[k_1 = -\log r = \log \frac{q(x)}{p(x)}.\]

Since \(\mathbb{E}_{x \sim q}[k_1] = \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right] = \mathrm{KL}[q,p]\), this is exactly unbiased.

However, it has high variance. To see why, note that KL divergence is always non-negative (\(\mathrm{KL}[q,p] \geq 0\)), yet \(k_1\) takes negative values whenever \(r > 1\) (i.e., whenever \(p(x) > q(x)\)). For similar distributions, this happens for roughly half the samples. An estimator that’s negative half the time for a quantity that’s always positive is clearly noisy — we’re relying on cancellation between positive and negative samples to get the right mean.

最直接的无偏估计量直接来自 KL 的定义:

\[k_1 = -\log r = \log \frac{q(x)}{p(x)}.\]

由于 \(\mathbb{E}_{x \sim q}[k_1] = \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right] = \mathrm{KL}[q,p]\),这是精确无偏的。

然而,它的方差很大。要理解原因,注意 KL 散度始终非负(\(\mathrm{KL}[q,p] \geq 0\)),但 \(k_1\) 在 \(r > 1\)(即 \(p(x) > q(x)\))时取负值。对于相似的分布,这大约发生在一半的样本上。一个估计量对于一个始终为正的量,却有一半时间取负值,显然噪声很大——我们依赖于正负样本的相消来得到正确的均值。

Why is KL always non-negative? (Click to expand)

By Jensen's inequality applied to the convex function \(-\log\):

$$\mathrm{KL}[q,p] = \mathbb{E}_{x \sim q}\!\left[-\log \frac{p(x)}{q(x)}\right] \geq -\log \mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)}\right] = -\log 1 = 0.$$

This is known as Gibbs' inequality. The same inequality \(\log x \leq x - 1\) that we will use below to construct \(k_3\) provides an alternative proof: \(\mathrm{KL}[q,p] = \mathbb{E}_q[-\log r] \geq \mathbb{E}_q[1 - r] = 1 - 1 = 0\).

为什么 KL 始终非负?(点击展开)

对凸函数 \(-\log\) 使用 Jensen 不等式:

$$\mathrm{KL}[q,p] = \mathbb{E}_{x \sim q}\!\left[-\log \frac{p(x)}{q(x)}\right] \geq -\log \mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)}\right] = -\log 1 = 0.$$

这就是 Gibbs 不等式。我们后面用来构造 \(k_3\) 的不等式 \(\log x \leq x - 1\) 也给出另一种证明:\(\mathrm{KL}[q,p] = \mathbb{E}_q[-\log r] \geq \mathbb{E}_q[1 - r] = 1 - 1 = 0\)。

The interactive figure below plots \(k_1\) alongside \(k_2\) and \(k_3\) (defined next) for comparison. Notice how \(k_1\) dips below zero for \(r > 1\) — this is where its high variance comes from.

下方的交互式图将 \(k_1\) 与 \(k_2\) 和 \(k_3\)(后文定义)一起绘制以供比较。注意 \(k_1\) 在 \(r > 1\) 时降到零以下——这正是其高方差的来源。

An alternative with lower variance but slight bias:

\[k_2 = \frac{1}{2}(\log r)^2.\]

Intuitively, \(k_2\) seems better because:

  • It is always non-negative (it’s a square).
  • Each sample directly measures how far apart \(p\) and \(q\) are at point \(x\), regardless of which direction the ratio goes.

Empirically, \(k_2\) indeed has much lower variance than \(k_1\), and also has remarkably low bias. But why is the bias small? The answer comes from f-divergences.

一个方差更低但略有偏差的替代方案:

\[k_2 = \frac{1}{2}(\log r)^2.\]

直觉上,\(k_2\) 更好,因为:

  • 它始终非负(是一个平方)。
  • 每个样本直接度量了 \(p\) 和 \(q\) 在点 \(x\) 处有多远,与比值的方向无关。

经验上,\(k_2\) 的方差确实比 \(k_1\) 低得多,偏差也非常小。但为什么偏差小?答案来自f-散度

An f-divergence is a general family of divergences defined as:

\[D_f(p, q) = \mathbb{E}_{x \sim q}\!\left[f\!\left(\frac{p(x)}{q(x)}\right)\right] = \mathbb{E}_{x \sim q}[f(r)]\]

for a convex function \(f\) with \(f(1) = 0\). Many well-known divergences are special cases:

  • KL divergence \(\mathrm{KL}[q, p]\): \(f(r) = -\log r\)
  • Reverse KL \(\mathrm{KL}[p, q]\): \(f(r) = r \log r\)
  • Chi-squared divergence: \(f(r) = (r-1)^2\)

The expectation of \(k_2\) is \(\mathbb{E}_q\!\left[\frac{1}{2}(\log r)^2\right]\), which is also an f-divergence with \(f(r) = \frac{1}{2}(\log r)^2\).

Now here is the key non-obvious fact: all f-divergences with differentiable \(f\) look like KL divergence up to second order when \(q\) is close to \(p\). Specifically, for a parameterized distribution \(p_\theta\):

\[D_f(p_0, p_\theta) = \frac{f''(1)}{2}\,\theta^\top F\,\theta + O(\theta^3),\]

where \(F\) is the Fisher information matrix for \(p_\theta\) evaluated at \(p_\theta = p_0\).

Both \(k_2\)’s f-divergence (\(f(r) = \frac{1}{2}(\log r)^2\)) and KL (\(f(r) = -\log r\)) have \(f''(1) = 1\). So both look like the same quadratic distance function \(\frac{1}{2}\theta^\top F\,\theta\) when \(p \approx q\). The bias of \(k_2\) only comes from third-order and higher terms, which explains why it is negligible when \(p\) and \(q\) are close.

f-散度是一类通用的散度,定义为:

\[D_f(p, q) = \mathbb{E}_{x \sim q}\!\left[f\!\left(\frac{p(x)}{q(x)}\right)\right] = \mathbb{E}_{x \sim q}[f(r)]\]

其中 \(f\) 是满足 \(f(1) = 0\) 的凸函数。许多常见散度都是特例:

  • KL 散度 \(\mathrm{KL}[q, p]\):\(f(r) = -\log r\)
  • 反向 KL \(\mathrm{KL}[p, q]\):\(f(r) = r \log r\)
  • 卡方散度:\(f(r) = (r-1)^2\)

\(k_2\) 的期望为 \(\mathbb{E}_q\!\left[\frac{1}{2}(\log r)^2\right]\),也是一个 f-散度,对应 \(f(r) = \frac{1}{2}(\log r)^2\)。

关键的非显然事实是:所有具有可微 \(f\) 的 f-散度在 \(q\) 接近 \(p\) 时,二阶展开都与 KL 散度相同。具体来说,对于参数化分布 \(p_\theta\):

\[D_f(p_0, p_\theta) = \frac{f''(1)}{2}\,\theta^\top F\,\theta + O(\theta^3),\]

其中 \(F\) 是 \(p_\theta\) 在 \(p_\theta = p_0\) 处的 Fisher 信息矩阵。

\(k_2\) 的 f-散度(\(f(r) = \frac{1}{2}(\log r)^2\))和 KL(\(f(r) = -\log r\))都满足 \(f''(1) = 1\)。因此当 \(p \approx q\) 时,两者都近似于相同的二次距离函数 \(\frac{1}{2}\theta^\top F\,\theta\)。\(k_2\) 的偏差仅来自三阶及更高阶项,这解释了为什么在 \(p\) 和 \(q\) 接近时偏差可以忽略不计。

What is the Fisher information matrix, and why does it appear here? (Click to expand)

The Fisher information matrix \(F\) of a parametric family \(p_\theta\) is defined as:

$$F_{ij} = \mathbb{E}_{x \sim p_\theta}\!\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i}\,\frac{\partial \log p_\theta(x)}{\partial \theta_j}\right] = -\mathbb{E}_{x \sim p_\theta}\!\left[\frac{\partial^2 \log p_\theta(x)}{\partial \theta_i \,\partial \theta_j}\right].$$

Intuitively, \(F\) measures how sensitive the distribution is to small changes in \(\theta\). If changing \(\theta_i\) by a tiny amount causes the log-likelihood to fluctuate a lot (high Fisher information), then the distribution is very "curved" in that direction — a small step in parameter space creates a large change in distribution space.

The interactive figure below makes this concrete. Both panels apply the same perturbation δ to the mean of a Gaussian. On the left, σ is small (high Fisher info F = 1/σ²) — the distributions barely overlap. On the right, σ is large (low Fisher info) — the same δ changes almost nothing. Try dragging the sliders.

Why does \(F\) appear in the f-divergence expansion? Consider \(p_\theta\) near \(p_0\) (i.e., \(\theta\) small). The ratio is:

$$r(\theta) = \frac{p_\theta(x)}{p_0(x)} = \exp\!\big(\log p_\theta(x) - \log p_0(x)\big).$$

Taylor-expanding \(\log p_\theta(x)\) around \(\theta = 0\):

$$\log p_\theta(x) = \log p_0(x) + \theta^\top \nabla_\theta \log p_0(x) + \frac{1}{2}\theta^\top \nabla^2_\theta \log p_0(x)\,\theta + O(\theta^3),$$

so \(\log r \approx \theta^\top s(x) + \frac{1}{2}\theta^\top H(x)\,\theta\), where \(s(x) = \nabla_\theta \log p_0(x)\) is the score function and \(H(x) = \nabla^2_\theta \log p_0(x)\) is its Hessian. Two key facts about the score:

  • \(\mathbb{E}_{p_0}[s(x)] = 0\) (the score has zero mean), and
  • \(\mathbb{E}_{p_0}[s(x)\,s(x)^\top] = F\) (its covariance is the Fisher matrix).

Substituting into \(D_f = \mathbb{E}_{p_0}[f(r)]\) and expanding \(f\) around \(r = 1\): since \(f(1) = 0\) and \(f'(1)\) contributes terms proportional to \(\mathbb{E}[s(x)] = 0\), the leading term is:

$$D_f \approx \frac{f''(1)}{2}\,\mathbb{E}_{p_0}\!\big[(\theta^\top s(x))^2\big] = \frac{f''(1)}{2}\,\theta^\top F\,\theta.$$

This is why all f-divergences share the same local geometry: they all reduce to a quadratic form in \(\theta\) weighted by the Fisher matrix, differing only by the scalar \(f''(1)\). The Fisher matrix is the unique "metric tensor" on the space of distributions (up to scale) — this is the foundation of information geometry.

In RL, the Fisher matrix of the policy \(\pi_\theta\) is exactly what defines the natural policy gradient: the direction \(F^{-1}\nabla_\theta J\) that makes the steepest improvement per unit of KL divergence, rather than per unit of Euclidean distance in parameter space.

什么是 Fisher 信息矩阵,为什么它会出现在这里?(点击展开)

参数族 \(p_\theta\) 的Fisher 信息矩阵 \(F\) 定义为:

$$F_{ij} = \mathbb{E}_{x \sim p_\theta}\!\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i}\,\frac{\partial \log p_\theta(x)}{\partial \theta_j}\right] = -\mathbb{E}_{x \sim p_\theta}\!\left[\frac{\partial^2 \log p_\theta(x)}{\partial \theta_i \,\partial \theta_j}\right].$$

直觉上,\(F\) 度量的是分布对 \(\theta\) 的微小变化有多敏感。如果稍微改变 \(\theta_i\) 就能导致 log 似然大幅波动(高 Fisher 信息),那么分布在该方向上非常"弯曲"——参数空间中的一小步就会在分布空间中产生巨大变化。

下面的交互式图直观展示了这一点。两个面板对高斯分布的均值施加相同的扰动 δ。左侧 σ 小(高 Fisher 信息 F = 1/σ²)——两个分布几乎不重叠。右侧 σ 大(低 Fisher 信息)——同样的 δ 几乎没有改变分布。试试拖动滑块。

为什么 \(F\) 出现在 f-散度的展开中? 考虑 \(p_\theta\) 在 \(p_0\) 附近(即 \(\theta\) 很小)。概率比为:

$$r(\theta) = \frac{p_\theta(x)}{p_0(x)} = \exp\!\big(\log p_\theta(x) - \log p_0(x)\big).$$

将 \(\log p_\theta(x)\) 在 \(\theta = 0\) 处 Taylor 展开:

$$\log p_\theta(x) = \log p_0(x) + \theta^\top \nabla_\theta \log p_0(x) + \frac{1}{2}\theta^\top \nabla^2_\theta \log p_0(x)\,\theta + O(\theta^3),$$

于是 \(\log r \approx \theta^\top s(x) + \frac{1}{2}\theta^\top H(x)\,\theta\),其中 \(s(x) = \nabla_\theta \log p_0(x)\) 是得分函数(score function),\(H(x) = \nabla^2_\theta \log p_0(x)\) 是其 Hessian。关于得分函数的两个关键事实:

  • \(\mathbb{E}_{p_0}[s(x)] = 0\)(得分函数的均值为零),以及
  • \(\mathbb{E}_{p_0}[s(x)\,s(x)^\top] = F\)(其协方差就是 Fisher 矩阵)。

将其代入 \(D_f = \mathbb{E}_{p_0}[f(r)]\) 并将 \(f\) 在 \(r = 1\) 处展开:由于 \(f(1) = 0\) 且 \(f'(1)\) 贡献的项正比于 \(\mathbb{E}[s(x)] = 0\),主导项为:

$$D_f \approx \frac{f''(1)}{2}\,\mathbb{E}_{p_0}\!\big[(\theta^\top s(x))^2\big] = \frac{f''(1)}{2}\,\theta^\top F\,\theta.$$

这就是为什么所有 f-散度共享相同的局部几何:它们都归结为以 Fisher 矩阵加权的 \(\theta\) 的二次型,仅在标量 \(f''(1)\) 上不同。Fisher 矩阵是分布空间上唯一的"度量张量"(差一个尺度因子)——这是信息几何的基础。

在 RL 中,策略 \(\pi_\theta\) 的 Fisher 矩阵正是定义自然策略梯度的关键:方向 \(F^{-1}\nabla_\theta J\) 使得每单位 KL 散度(而非参数空间中的欧氏距离)的改进最大。

The interactive figure below plots \(k_2\) alongside \(k_1\) (dashed). Notice that \(k_2\) is always non-negative — a square can’t be negative. The zoomed inset near \(r = 1\) shows why the bias is small: \(k_2\) and KL agree to second order. Drag the \(\mu_p\) slider to see how the bias grows as the distributions diverge.

下方的交互式图将 \(k_2\) 与 \(k_1\)(虚线)一起绘制。注意 \(k_2\) 始终非负——平方不可能为负。\(r = 1\) 附近的放大插图展示了偏差为何很小:\(k_2\) 和 KL 在二阶展开上一致。拖动 \(\mu_p\) 滑块观察偏差如何随分布差异增大而增长。

Can we get an estimator that is both unbiased (like \(k_1\)) and always non-negative (like \(k_2\))?

The general technique for reducing variance of an unbiased estimator is a control variate: add something with zero expectation that is negatively correlated with the original estimator. The only interesting quantity guaranteed to have zero expectation under \(q\) is:

\[\mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)} - 1\right] = \mathbb{E}_{x \sim q}[r - 1] = \sum_x q(x) \cdot \frac{p(x)}{q(x)} - 1 = 1 - 1 = 0.\]

So for any \(\lambda\), the expression

\[-\log r + \lambda(r - 1)\]

is an unbiased estimator of \(\mathrm{KL}[q,p]\). We could minimize the variance over \(\lambda\), but this yields an expression that depends on \(p\) and \(q\) and is hard to compute analytically.

Instead, we can choose a good \(\lambda\) using a simpler and more elegant argument. Since \(\log\) is concave, we have the fundamental inequality:

\[\log x \leq x - 1 \quad \text{for all } x > 0,\]

with equality only at \(x = 1\). Setting \(\lambda = 1\), the estimator becomes:

\[k_3 = (r - 1) - \log r = \underbrace{-\log r}_{k_1} + \underbrace{(r - 1)}_{\text{control variate}}.\]

By the inequality above, \((r-1) - \log r \geq 0\) for all \(r > 0\), with equality only when \(r = 1\) (i.e., \(p(x) = q(x)\)). So \(k_3\) is:

  • Unbiased (since \(\mathbb{E}[r-1] = 0\), we’re just adding zero in expectation to \(k_1\)).
  • Always non-negative (by the concavity of \(\log\)).
  • Low variance (the control variate cancels much of \(k_1\)’s noise).

能否得到一个既无偏(像 \(k_1\))又始终非负(像 \(k_2\))的估计量?

降低无偏估计量方差的通用技术是控制变量:加上一个期望为零但与原始估计量负相关的项。在 \(q\) 下唯一保证期望为零的有趣量是:

\[\mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)} - 1\right] = \mathbb{E}_{x \sim q}[r - 1] = \sum_x q(x) \cdot \frac{p(x)}{q(x)} - 1 = 1 - 1 = 0.\]

因此对任意 \(\lambda\),

\[-\log r + \lambda(r - 1)\]

都是 \(\mathrm{KL}[q,p]\) 的无偏估计量。我们可以对 \(\lambda\) 最小化方差,但得到的表达式依赖于 \(p\) 和 \(q\),难以解析计算。

我们可以用一个更简洁优雅的论证来选择一个好的 \(\lambda\)。由于 \(\log\) 是凹函数,我们有基本不等式:

\[\log x \leq x - 1 \quad \forall\; x > 0,\]

等号仅在 \(x = 1\) 时成立。令 \(\lambda = 1\),估计量变为:

\[k_3 = (r - 1) - \log r = \underbrace{-\log r}_{k_1} + \underbrace{(r - 1)}_{\text{control variate}}.\]

由上述不等式,\((r-1) - \log r \geq 0\) 对所有 \(r > 0\) 成立,等号仅当 \(r = 1\)(即 \(p(x) = q(x)\))时成立。所以 \(k_3\):

  • 无偏(因为 \(\mathbb{E}[r-1] = 0\),我们只是在期望上给 \(k_1\) 加了零)。
  • 始终非负(由 \(\log\) 的凹性保证)。
  • 低方差(控制变量消除了 \(k_1\) 的大部分噪声)。

There is a beautiful geometric way to see why \(k_3\) is non-negative. Consider the convex function \(\phi(r) = -\log r\). Its tangent line at \(r = 1\) is \(\ell(r) = -(r - 1)\). Then:

\[k_3 = (r-1) - \log r = \phi(r) - \ell(r) = (-\log r) - (-(r-1)).\]

This is the vertical gap between the convex function and its tangent line. Since convex functions always lie above their tangent lines, this gap is always non-negative.

This construction — measuring distance as the gap between a convex function and its tangent plane — is called a Bregman divergence. It appears throughout optimization, information theory, and machine learning, and has many beautiful properties (e.g., the “three-point identity” that generalizes the Pythagorean theorem).

You can see this geometry in the interactive figure below. Drag the slider to see how the gap grows as \(r\) moves away from 1.

有一种优美的几何方式来理解 \(k_3\) 为什么非负。考虑凸函数 \(\phi(r) = -\log r\),其在 \(r = 1\) 处的切线为 \(\ell(r) = -(r - 1)\)。则:

\[k_3 = (r-1) - \log r = \phi(r) - \ell(r) = (-\log r) - (-(r-1)).\]

这是凸函数与其切线之间的垂直距离。由于凸函数总是在其切线之上,这个距离始终非负。

这种构造——用凸函数与其切平面之间的间距来度量距离——称为 Bregman 散度。它出现在优化、信息论和机器学习的各个角落,有许多优美的性质(例如推广了勾股定理的”三点恒等式”)。

你可以在下方的交互式图中看到这一几何关系。拖动滑块观察当 \(r\) 远离 1 时间距如何增长。

To see how these estimators compare in practice, consider Gaussian experiments from Schulman’s post. Let \(q = \mathcal{N}(0, 1)\) and \(p = \mathcal{N}(\mu, 1)\), so the true KL is \(\mu^2/2\). Try the two preset experiments (\(\mu = 0.1\) and \(\mu = 1.0\)), or drag \(\mu\) to any value to see how bias and variance change:

为了看看这些估计量在实践中如何比较,考虑 Schulman 博文中的高斯实验。令 \(q = \mathcal{N}(0, 1)\),\(p = \mathcal{N}(\mu, 1)\),真实 KL 为 \(\mu^2/2\)。试试两个预设实验(\(\mu = 0.1\) 和 \(\mu = 1.0\)),或拖动 \(\mu\) 到任意值观察偏差和方差的变化:

Key observations as you drag \(\mu\):

  • Small \(\mu\) (≈ 0.1): \(k_1\)’s std is ~20× the true KL — you’d need hundreds of samples for a reliable sign. \(k_2\) and \(k_3\) are nearly identical (\(k_2\)’s bias ≈ 0.2%).
  • Large \(\mu\) (≈ 1.0): \(k_2\)’s bias grows to ~25% — no longer negligible. \(k_3\) stays unbiased with low variance. \(k_3\) is strictly better.

拖动 \(\mu\) 时的关键观察:

  • 小 \(\mu\)(≈ 0.1):\(k_1\) 的标准差是真实 KL 的 ~20 倍——需要数百个样本才能可靠判断正负号。\(k_2\) 和 \(k_3\) 几乎相同(\(k_2\) 偏差 ≈ 0.2%)。
  • 大 \(\mu\)(≈ 1.0):\(k_2\) 的偏差增长到 ~25%——不再可忽略。\(k_3\) 保持无偏且低方差。\(k_3\) 是严格更优的估计量。

For samples \(x \sim q\) and ratio \(r = p(x)/q(x)\), the three estimators are:

  Estimator Unbiased? Always ≥ 0? Variance
\(k_1\) \(-\log r\) Yes No High
\(k_2\) \(\frac{1}{2}(\log r)^2\) No (low bias when \(p \approx q\)) Yes Low
\(k_3\) \((r-1) - \log r\) Yes Yes Low

\(k_3\) is the clear winner: unbiased, always non-negative, and low variance. It achieves this by adding the control variate \((r-1)\) to the naive estimator \(k_1\), and its non-negativity follows from the concavity of \(\log\) (equivalently, the Bregman divergence interpretation).

对于样本 \(x \sim q\) 和概率比 \(r = p(x)/q(x)\),三个估计量为:

  估计量 无偏? 始终 ≥ 0? 方差
\(k_1\) \(-\log r\)
\(k_2\) \(\frac{1}{2}(\log r)^2\) 否(\(p \approx q\) 时偏差低)
\(k_3\) \((r-1) - \log r\)

\(k_3\) 是明显的赢家:无偏、始终非负、低方差。它通过将控制变量 \((r-1)\) 加到朴素估计量 \(k_1\) 上来实现这一点,其非负性来自 \(\log\) 的凹性(等价地,来自 Bregman 散度的解释)。

The Bregman divergence trick generalizes elegantly. For any f-divergence \(D_f(p,q) = \mathbb{E}_{x \sim q}[f(r)]\) with convex \(f\), the estimator

\[f(r) - f'(1)(r - 1)\]

is:

  • Unbiased: because \(\mathbb{E}_q[f'(1)(r-1)] = f'(1) \cdot 0 = 0\).
  • Always non-negative: because \(f\) is convex, it lies above its tangent at \(r = 1\), so \(f(r) \geq f(1) + f'(1)(r-1) = f'(1)(r-1)\) (using \(f(1) = 0\)).

This is the Bregman divergence of \(f\) at point \(r\) relative to \(r = 1\).

Bregman 散度技巧可以优雅地推广。对于任意 f-散度 \(D_f(p,q) = \mathbb{E}_{x \sim q}[f(r)]\)(\(f\) 为凸函数),估计量

\[f(r) - f'(1)(r - 1)\]

满足:

  • 无偏:因为 \(\mathbb{E}_q[f'(1)(r-1)] = f'(1) \cdot 0 = 0\)。
  • 始终非负:因为 \(f\) 是凸的,它在 \(r = 1\) 处的切线之上,所以 \(f(r) \geq f(1) + f'(1)(r-1) = f'(1)(r-1)\)(利用 \(f(1) = 0\))。

这就是 \(f\) 在点 \(r\) 相对于 \(r = 1\) 的 Bregman 散度。

The most notable application is to \(\mathrm{KL}[p, q]\) (note \(p\) and \(q\) are swapped). This corresponds to \(f(r) = r \log r\), which has \(f'(1) = 1\). The Bregman-based estimator becomes:

\[r\log r - (r - 1).\]

最重要的应用是 \(\mathrm{KL}[p, q]\)(注意 \(p\) 和 \(q\) 交换了)。对应 \(f(r) = r \log r\),\(f'(1) = 1\)。基于 Bregman 的估计量为:

\[r\log r - (r - 1).\]

Final summary: for samples \(x \sim q\) with \(r = p(x)/q(x)\), the recommended estimators are:

Divergence Estimator Properties
\(\mathrm{KL}[q, p]\) \((r - 1) - \log r\) Unbiased, non-negative, low variance
\(\mathrm{KL}[p, q]\) \(r\log r - (r - 1)\) Unbiased, non-negative, low variance

Both are special cases of the general Bregman divergence estimator \(f(r) - f'(1)(r-1)\) for their respective f-divergence generators. In practice, you can drop these into any codebase that computes log-probs — no need to store or compute full distributions.

最终总结:对于样本 \(x \sim q\),\(r = p(x)/q(x)\),推荐的估计量为:

散度 估计量 性质
\(\mathrm{KL}[q, p]\) \((r - 1) - \log r\) 无偏、非负、低方差
\(\mathrm{KL}[p, q]\) \(r\log r - (r - 1)\) 无偏、非负、低方差

两者都是通用 Bregman 散度估计量 \(f(r) - f'(1)(r-1)\) 对各自 f-散度生成函数的特例。在实践中,你可以直接将它们加入任何计算 log 概率的代码库——无需存储或计算完整分布。

From Estimation to Optimization (Liu et al., 2025)

Before analyzing how \(k_1, k_2, k_3\) behave as losses, we need to untangle a common source of confusion. PPO-based RLHF involves two different KL divergences that serve entirely different purposes and point in different directions. To see both clearly, start from the TRPO-RLHF formulation with the KL constraint written explicitly:

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL (reverse)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL (forward)}} \leq \delta\]

PPO approximates the constraint by replacing it with clipping, yielding the familiar PPO-RLHF objective (as in InstructGPT):

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]

The two KLs are:

1. Trust-region KL (forward): \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta\)

This is the constraint inherited from TRPO: don’t let the new policy deviate too far from the old policy within a single update step. Since data is sampled from \(\pi_{\mathrm{old}}\), the constraint is measured under \(\pi_{\mathrm{old}}\) — a forward KL (old policy first). It strongly penalizes the case where \(\pi_{\mathrm{old}}\) puts high probability on an action but \(\pi_\theta\) compresses it — exactly the dangerous regime where importance ratios explode and the surrogate approximation breaks. PPO replaces this explicit constraint with clipping.

2. Reference KL (reverse): \(\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)

This is the RLHF-specific regularizer that prevents the policy from drifting too far from the pretrained base model across all of training. Expanding it:

\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]

Since we care about what the current policy generates, the expectation is naturally under \(\pi_\theta\) — a reverse KL (new policy first). It penalizes the policy for generating outputs that the reference model would find unlikely.

  Trust-Region KL Reference KL
Purpose Optimization stability Regularization to base model
Direction \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\) (forward) \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) (reverse)
Sampling distribution \(\pi_{\mathrm{old}}\) (data you already have) \(\pi_\theta\) (outputs you will generate)
Constrains Per-step update size Total drift from reference
In the formula Implicit (clipping) Explicit (\(\beta\) penalty)

The rest of this section focuses exclusively on the reference KL \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) — specifically, how the choice of \(k_1, k_2, k_3\) and the choice of “in reward” vs. “as loss” affect its gradient.

Wait — the surrogate uses \(\mathbb{E}_{\pi_{\mathrm{old}}}\) but the reference KL uses \(\mathbb{E}_{\pi_\theta}\). How can they coexist in one loss? (Click to expand)

They can't, as written. The formula above is conceptually clean but notationally sloppy — it mixes two different expectations. In practice, InstructGPT resolves this by folding the KL into the reward. The per-token reward becomes:

$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$

Note the key move: the \(\log \pi_\theta\) in the KL is replaced by \(\log \pi_{\mathrm{old}}\) — the policy that actually generated the rollout. The KL penalty is computed at rollout time and treated as part of the reward, detached from the gradient. The advantage \(\hat{A}\) is then estimated from this modified reward using GAE, and the entire PPO loss has a single expectation under \(\pi_{\mathrm{old}}\):

$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$

This is precisely the "\(k_1\) in reward" approach that the rest of this section will analyze. The reference KL never appears as a separate loss term with its own expectation — it is absorbed into the advantage.

在分析 \(k_1, k_2, k_3\) 作为损失的行为之前,需要厘清一个常见的混淆。基于 PPO 的 RLHF 涉及两个不同的 KL 散度,它们服务于完全不同的目的,且方向相反。为了同时看清两者,从 KL 约束显式写出的 TRPO-RLHF 公式出发:

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL (reverse)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL (forward)}} \leq \delta\]

PPO 用 clipping 近似该约束,得到熟悉的 PPO-RLHF 目标(如 InstructGPT):

\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]

两个 KL 分别是:

1. Trust-region KL(前向):\(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta\)

这是继承自 TRPO 的约束:在单次更新中,不要让新策略偏离旧策略太远。由于数据来自 \(\pi_{\mathrm{old}}\) 的采样,约束在 \(\pi_{\mathrm{old}}\) 下度量——前向 KL(旧策略在前)。它强烈惩罚这种情况:\(\pi_{\mathrm{old}}\) 对某个动作给出高概率,但 \(\pi_\theta\) 却将其压缩得很低——这正是 importance ratio 爆炸、代理近似崩溃的危险区域。PPO 用 clipping 替代了显式约束。

2. Reference KL(反向):\(\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)

这是 RLHF 特有的正则化器,防止策略在整个训练过程中偏离预训练基座模型过远。展开:

\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]

由于我们关心当前策略生成什么,期望自然在 \(\pi_\theta\) 下取——反向 KL(新策略在前)。它惩罚策略生成参考模型认为不太可能的输出。

  Trust-Region KL Reference KL
目的 优化稳定性 正则化到基座模型
方向 \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\)(前向) \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)(反向)
采样分布 \(\pi_{\mathrm{old}}\)(已有的数据) \(\pi_\theta\)(将要生成的输出)
约束对象 单步更新幅度 相对参考模型的总漂移
在公式中 隐式(clipping) 显式(\(\beta\) 惩罚)

本节其余部分专注于 reference KL \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)——具体来说,\(k_1, k_2, k_3\) 的选择以及”放入奖励”vs.”作为损失”的选择如何影响其梯度。

等等——surrogate 用的是 \(\mathbb{E}_{\pi_{\mathrm{old}}}\),但 reference KL 用的是 \(\mathbb{E}_{\pi_\theta}\)。它们怎么能共存于一个 loss 中?(点击展开)

确实不能,如上面所写的那样。上面的公式概念上清晰,但记号上是不严谨的——它混合了两个不同的期望。实践中,InstructGPT 通过将 KL 吸收进 reward 来解决这个问题。Per-token reward 变为:

$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$

注意关键的一步:KL 中的 \(\log \pi_\theta\) 被替换为 \(\log \pi_{\mathrm{old}}\)——即实际生成 rollout 的策略。KL 惩罚在 rollout 时计算,作为 reward 的一部分,不参与梯度计算。然后用 GAE 从这个修改后的 reward 估计 advantage \(\hat{A}\),整个 PPO loss 只有一个在 \(\pi_{\mathrm{old}}\) 下的期望:

$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$

这正是本节接下来将分析的"\(k_1\) 放入奖励"方法。Reference KL 从不作为单独的、有自己期望的 loss 项出现——它被吸收进了 advantage。

We now focus on the reference KL. Let \(k_n\) denote any of the three estimators from the three estimators above (\(k_1 = -\log \delta\), \(k_2 = \frac{1}{2}(\log \delta)^2\), \(k_3 = (\delta - 1) - \log \delta\)), where \(\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)\) is the reference-to-current probability ratio. There are two fundamentally different ways to plug \(k_n\) into the loss:

”\(k_n\) in reward” (combined form): treat \(k_n\) as a detached scalar in the REINFORCE objective — it modulates the policy gradient like a reward signal, but is not differentiated:

\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]

”\(k_n\) as loss” (decoupled form): add \(k_n\) as a separate differentiable loss — the gradient flows through \(k_n\) itself via the chain rule:

\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]

These produce different gradients, even though they use the same formula.

现在聚焦 reference KL。令 \(k_n\) 为上文三个估计量中的三个估计量之一(\(k_1 = -\log \delta\),\(k_2 = \frac{1}{2}(\log \delta)^2\),\(k_3 = (\delta - 1) - \log \delta\)),其中 \(\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)\) 是参考策略与当前策略的概率比。将 \(k_n\) 嵌入损失有两种根本不同的方式:

”\(k_n\) 放入奖励”(合并形式):将 \(k_n\) 作为 REINFORCE 目标中不参与梯度计算的标量——它像奖励信号一样调节策略梯度,但不被微分

\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]

”\(k_n\) 作为损失”(解耦形式):将 \(k_n\) 作为单独的可微损失添加——梯度通过链式法则流经 \(k_n\) 本身

\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]

尽管使用相同的公式,这两种方式产生不同的梯度

The target is the reverse KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\), whose true gradient (under on-policy sampling) is:

\[\nabla_\theta \mathcal{J}_{\mathrm{RKL}} = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)} \cdot \nabla_\theta \log \pi_\theta(y \vert x)\right].\]

The coefficient \(-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})\) multiplying \(\nabla_\theta \log \pi_\theta\) is the “signal” that pushes the policy back toward the reference. Liu et al. show which implementations recover this gradient:

\(k_1\) in reward produces the correct gradient. Since \(k_1 = -\log \delta\), placing it in the REINFORCE coefficient directly yields \(-\log \delta \cdot \nabla_\theta \log \pi_\theta\) — exactly the RKL gradient. ✓

\(k_2\) as loss is gradient-equivalent to \(k_1\) in reward. Since \(k_2 = \frac{1}{2}(\log \delta)^2\), differentiating directly gives \(\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta\) — the same RKL gradient. ✓

This is the paper’s key equivalence result (Theorem 5.1): ”\(k_1\) in reward” \(=\) “\(k_2\) as loss” in terms of gradient.

乘以 \(\nabla_\theta \log \pi_\theta\) 的系数 \(-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})\) 是将策略推回参考策略的”信号”。Liu et al. 证明了哪些实现能恢复这个梯度:

\(k_1\) 放入奖励 产生正确的梯度。由于 \(k_1 = -\log \delta\),将其放入 REINFORCE 系数直接得到 \(-\log \delta \cdot \nabla_\theta \log \pi_\theta\)——恰好是 RKL 梯度。✓

\(k_2\) 作为损失 与 \(k_1\) 放入奖励梯度等价。由于 \(k_2 = \frac{1}{2}(\log \delta)^2\),直接微分得 \(\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta\)——相同的 RKL 梯度。✓

这是论文的关键等价性结果(定理 5.1):”\(k_1\) 放入奖励” \(=\) “\(k_2\) 作为损失”(就梯度而言)。

What if we use \(k_1\) as a direct loss instead of in the reward? Since \(k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}\), differentiating gives:

\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]

The reference policy \(\pi_{\mathrm{ref}}\) has completely disappeared from the gradient — it carries no regularization signal at all. Worse, by the score function identity \(\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0\), this gradient has zero expectation. It contributes nothing but noise.

This is a stark example of how a perfect estimator (\(k_1\) is exactly unbiased for KL) can be a terrible loss function.

如果我们将 \(k_1\) 作为直接损失而不是放入奖励呢?由于 \(k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}\),微分得:

\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]

参考策略 \(\pi_{\mathrm{ref}}\) 已从梯度中完全消失——它不携带任何正则化信号。更糟糕的是,根据得分函数恒等式 \(\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0\),这个梯度的期望为零。它只贡献噪声。

这是一个鲜明的例子,说明一个完美的估计量(\(k_1\) 对 KL 精确无偏)可以是一个糟糕的损失函数

GRPO uses \(k_3 = \delta - 1 - \log \delta\) as a directly differentiated loss (decoupled form). Differentiating:

\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]

Compared to the true RKL gradient coefficient \(-\log \delta\), GRPO uses \(1 - \delta\). These are related by Taylor expansion — \(-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots\), so \(1 - \delta\) is only the first-order approximation of \(-\log \delta\). This introduces three problems:

  1. Bias: for all \(\delta \neq 1\), the coefficient \(1 - \delta \neq -\log \delta\), so the gradient direction is biased.

  2. Pathological asymmetry: When the policy deviates away from the reference (\(\delta \to 0\), meaning \(\pi_\theta \gg \pi_{\mathrm{ref}}\)), the true coefficient \(-\log \delta \to +\infty\) provides a strong restoring force, but \(1 - \delta \to 1\) saturates — it cannot push back hard enough. Conversely, when \(\delta \to \infty\) (\(\pi_\theta \ll \pi_{\mathrm{ref}}\)), \(1 - \delta \to -\infty\) explodes much faster than the logarithmic \(-\log \delta\), risking destabilizing updates.

  3. Variance: the variance of \(1 - \delta\) involves \(\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)\), the chi-squared divergence, which is notoriously unstable and can diverge even when KL remains finite.

So paradoxically, \(k_3\) — the “clear winner” as a KL estimator — produces a biased, asymmetric, and potentially unstable gradient when used as a loss in GRPO. The paper recommends \(k_2\) as loss (or equivalently \(k_1\) in reward) as the principled default.

GRPO 使用 \(k_3 = \delta - 1 - \log \delta\) 作为直接微分的损失(解耦形式)。微分得:

\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]

与真实 RKL 梯度系数 \(-\log \delta\) 相比,GRPO 使用的是 \(1 - \delta\)。两者通过 Taylor 展开相关——\(-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots\),所以 \(1 - \delta\) 只是 \(-\log \delta\) 的一阶近似。这带来三个问题:

  1. 偏差:对所有 \(\delta \neq 1\),系数 \(1 - \delta \neq -\log \delta\),因此梯度方向有偏。

  2. 病态不对称性:当策略偏离参考策略时(\(\delta \to 0\),即 \(\pi_\theta \gg \pi_{\mathrm{ref}}\)),真实系数 \(-\log \delta \to +\infty\) 提供强恢复力,但 \(1 - \delta \to 1\) 饱和——无法提供足够的回推力。反之,当 \(\delta \to \infty\)(\(\pi_\theta \ll \pi_{\mathrm{ref}}\))时,\(1 - \delta \to -\infty\) 比对数式的 \(-\log \delta\) 爆炸得更快,可能导致不稳定的更新。

  3. 方差:\(1 - \delta\) 的方差涉及 \(\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)\)(卡方散度),这在数值上是出了名的不稳定,即使 KL 保持有限也可能发散。

所以矛盾的是,\(k_3\)——作为 KL 估计量的”明确赢家”——当作为 GRPO 中的损失使用时,产生的梯度是有偏的、不对称的、且可能不稳定的。论文推荐 \(k_2\) 作为损失(或等价地 \(k_1\) 放入奖励)作为原则性的默认选择。

The following table contrasts the estimator ranking (Schulman) with the optimization ranking (Liu et al.):

  As Estimator ”\(k_n\) in reward” ”\(k_n\) as loss”
\(k_1 = -\log \delta\) Unbiased, high variance ✓ Correct RKL gradient ✗ Zero-mean noise, no regularization
\(k_2 = \frac{1}{2}(\log \delta)^2\) Biased (low), low variance ✓ Correct RKL gradient
\(k_3 = (\delta - 1) - \log \delta\) Unbiased, low variance ≈ First-order biased approximation

The irony is complete: \(k_1\), the worst estimator (high variance), produces the correct gradient when placed in the reward. \(k_3\), the best estimator (unbiased + low variance), produces a biased gradient when used as a loss. And \(k_2\), the biased estimator, produces the correct gradient as a loss — making it gradient-equivalent to \(k_1\) in reward.

The reason is that estimation asks “how close is the value \(k_n\) to the true KL?” while optimization asks “does \(\nabla_\theta k_n\) point in the right direction?” These are fundamentally different questions, and a good answer to one does not imply a good answer to the other.

下表对比了估计量排名(Schulman)和优化排名(Liu et al.):

  作为估计量 ”\(k_n\) 放入奖励” ”\(k_n\) 作为损失”
\(k_1 = -\log \delta\) 无偏,高方差 ✓ 正确的 RKL 梯度 ✗ 零均值噪声,无正则化
\(k_2 = \frac{1}{2}(\log \delta)^2\) 有偏(低),低方差 ✓ 正确的 RKL 梯度
\(k_3 = (\delta - 1) - \log \delta\) 无偏,低方差 ≈ 一阶有偏近似

反讽至此完整:\(k_1\),最差的估计量(高方差),放入奖励时产生正确的梯度。\(k_3\),最好的估计量(无偏+低方差),作为损失时产生有偏的梯度。而 \(k_2\),那个有偏的估计量,作为损失时反而产生正确的梯度——使其与 \(k_1\) 放入奖励梯度等价。

原因在于,估计问的是”\(k_n\) 的离真实 KL 有多近?”,而优化问的是”\(\nabla_\theta k_n\) 是否指向正确方向?”这是根本不同的问题,一个问题的好答案并不意味着另一个问题的好答案。

MaxEnt RL Methods and Their Connection to KL

Maximum Entropy RL (Ziebart, 2010; Haarnoja et al., SAC, 2018) augments the standard RL objective with an entropy bonus at every timestep. The optimal policy maximizes not just cumulative reward, but also the entropy of its own action distribution:

\[\pi^*_{\mathrm{maxent}} := \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) + \alpha \, \mathcal{H}\!\big(\pi(\cdot \vert s_t)\big)\Big)\right]\]

where \(\mathcal{H}(\pi(\cdot \vert s)) = \mathbb{E}_{a \sim \pi}[-\log \pi(a \vert s)]\) is the entropy of the policy at state \(s\). Three properties of this quantity matter for what follows: (i) it is non-negative, with \(\mathcal{H} = 0\) iff \(\pi\) is deterministic; (ii) it is bounded above by \(\log \lvert\mathcal{A}\rvert\), achieved iff \(\pi\) is uniform; (iii) it is concave in \(\pi\). Property (ii) will become important later — the upper bound grows with the action space, so for large \(\lvert\mathcal{A}\rvert\) the entropy term can dominate the reward in the MaxEnt objective.

To build intuition for how entropy depends on the distribution shape, consider a policy that puts probability \(p\) on one action and spreads the remaining \(1 - p\) uniformly over the other \(\lvert\mathcal{A}\rvert - 1\) actions (each getting \(\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\)). Splitting the expectation into these two groups:

\[\mathcal{H}(\pi) = -\sum_{a} \pi(a)\log\pi(a) = \underbrace{-p\log p}_{\text{from the single action}} \;\underbrace{- \;(\lvert\mathcal{A}\rvert - 1)\cdot\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\cdot\log\frac{1-p}{\lvert\mathcal{A}\rvert - 1}}_{\text{from the remaining }\lvert\mathcal{A}\rvert - 1\text{ actions}}\]

The second group simplifies (the \(\lvert\mathcal{A}\rvert - 1\) cancels with the fraction), giving a clean two-term decomposition:

\[\mathcal{H}(\pi) = -p\log p \;-\; (1-p)\log\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\]

When \(p = 1/\lvert\mathcal{A}\rvert\) (uniform), both terms contribute and the total reaches the maximum \(\log\lvert\mathcal{A}\rvert\). When \(p = 1\) (deterministic), both terms vanish. The interactive figure below visualizes this decomposition:

Expanding the entropy and absorbing it into the per-step reward:

\[\pi^*_{\mathrm{maxent}} = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) + \alpha \, \mathbb{E}_{a \sim \pi(\cdot \vert s_t)}[-\log \pi(a \vert s_t)]\Big)\right]\]

Since the trajectory expectation already samples \(a_t \sim \pi(\cdot \vert s_t)\), the inner expectation can be folded in, yielding the equivalent form:

\[\pi^*_{\mathrm{maxent}} = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)\Big)\right]\]

This looks like standard RL with a modified per-step reward \(\tilde{r}_t = r(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)\). One might think we can simply add \(-\alpha \log \pi\) as a bonus to policy gradient and call it a day.

Why is this not just entropy-regularized policy gradient? The crucial difference is that \(-\alpha \log \pi(a_t \vert s_t)\) depends on \(\pi\) itself, unlike the environment reward \(r(s,a)\) which is a fixed function. In standard actor-critic, the critic \(Q^\pi(s,a)\) only backs up \(r\) — it evaluates “how good is state \(s'\)” purely in terms of future reward. But here the “return” includes \(-\alpha \log \pi\) at every future step, and this term changes as \(\pi\) updates. A critic that ignores future entropy gives wrong advantage estimates: it uses a baseline that does not account for the entropy component of the return. MaxEnt RL fixes this by backing up the entropy into the value function — the critic itself must track how much entropy the policy will generate in the future.

Entropy backing up and the Soft Bellman Equation. Recall the standard action-value function, which measures the total reward from taking action \(a\) at state \(s\) and following \(\pi\) thereafter:

\[Q^\pi(s,a) := r(s,a) + \gamma \, \mathbb{E}_{s' \sim p(\cdot \vert s,a),\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi(s', a')\right]\]

The Soft Bellman Equation (Haarnoja et al., ICML 2017) modifies this by including the entropy bonus in the backup target:

\[Q^\pi_{\mathrm{soft}}(s,a) := r(s,a) + \gamma \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi_{\mathrm{soft}}(s', a') - \alpha \log \pi(a' \vert s')\right]\]

Why this form? Note that the entropy \(-\alpha \log \pi(a' \vert s')\) is attached to the next state \(s'\), not the current state \(s\). We can rearrange by separating the entropy from the recursive \(Q\) term:

\[Q^\pi_{\mathrm{soft}}(s,a) = \underbrace{\Big(r(s,a) - \alpha \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[\log \pi(a' \vert s')\right]\Big)}_{\tilde{r}(s,a)} + \gamma \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi_{\mathrm{soft}}(s', a')\right]\]

This has exactly the form of a standard Bellman equation \(Q = \tilde{r} + \gamma \, \mathbb{E}[Q]\) with an effective reward \(\tilde{r}(s,a) = r(s,a) + \alpha \, \mathcal{H}\!\big(\pi(\cdot \vert s')\big)\) that augments the environment reward with the entropy of the policy at the next state. Since \(\tilde{r}\) is bounded whenever \(r\) and \(\log \pi\) are bounded, the standard Bellman contraction argument applies directly — the Soft Bellman operator is a \(\gamma\)-contraction in \(\ell_\infty\) norm, guaranteeing a unique fixed point and convergence of value iteration.

While \(Q^\pi\) only accounts for future reward, \(Q^\pi_{\mathrm{soft}}\) also incorporates future entropy bonuses into the backup — it values not just how much reward the agent collects, but also how many options it keeps open at future states.

Unrolled definitions. Unrolling the Soft Bellman recursion, the soft Q-value can be written as the expected discounted sum of rewards and future entropies (Haarnoja et al., ICML 2017, Appendix A). For convenience, the entropy coefficient \(\alpha\) is set to 1 (the general case is recovered by dividing rewards by \(\alpha\)):

\[Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a}) \triangleq r_0 + \mathbb{E}_{\tau \sim \pi,\, \mathbf{s}_0 = \mathbf{s},\, \mathbf{a}_0 = \mathbf{a}}\!\left[\sum_{t=1}^{\infty} \gamma^t \Big(r_t + \mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_t)\big)\Big)\right]\]

where \(\tau = (\mathbf{s}_0, \mathbf{a}_0, \mathbf{s}_1, \mathbf{a}_1, \ldots)\) denotes the trajectory originating at \((\mathbf{s}, \mathbf{a})\). The discounted maximum entropy policy objective is then:

\[J(\pi) \triangleq \sum_t \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_\pi}\!\left[Q_{\mathrm{soft}}^{\pi}(\mathbf{s}_t, \mathbf{a}_t) + \alpha\,\mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_t)\big)\right]\]

and the corresponding optimal policy is:

\[\pi^*_{\mathrm{MaxEnt}} = \arg\max_\pi \sum_t \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_\pi}\!\left[\sum_{l=t}^{\infty} \gamma^{l-t} \mathbb{E}_{(\mathbf{s}_l, \mathbf{a}_l)}\!\left[r(\mathbf{s}_l, \mathbf{a}_l) + \alpha\,\mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_l)\big) \,\middle\vert\, \mathbf{s}_t, \mathbf{a}_t\right]\right]\]

Note that this objective takes into account the entropy of the policy at future states, in contrast to greedy objectives such as Boltzmann exploration.

Wait — the first formula already sums entropy at every timestep. How is the last one different? (Click to expand)

Both formulas include entropy at every timestep, but they organize the sum differently, revealing different structure.

The first formula is a flat sum: \(\sum_t \gamma^t (r_t + \alpha H_t)\). Expanded, this is just \(\gamma^0(r_0+\alpha H_0) + \gamma^1(r_1+\alpha H_1) + \gamma^2(r_2+\alpha H_2) + \cdots\). Each timestep's entropy appears exactly once, on equal footing with the reward — it looks like entropy is just a local, per-step bonus with no notion of "future."

The last formula is a nested sum: for each \((s_t, a_t)\), the inner sum \(\sum_{l=t}^{\infty} \gamma^{l-t} \mathbb{E}[r_l + \alpha H_l \mid s_t, a_t]\) is the full soft return from \(t\) onward:

$$\underbrace{(r_t + \alpha H_t)}_{\text{current}} + \gamma\underbrace{(r_{t+1} + \alpha H_{t+1})}_{\text{future}} + \gamma^2\underbrace{(r_{t+2} + \alpha H_{t+2})}_{\text{further future}} + \cdots$$

and these future terms are a conditional expectation given \((s_t, a_t)\). The future entropies \(H_{t+1}, H_{t+2}, \ldots\) appear explicitly inside the evaluation of the current state-action pair.

Both formulas are mathematically equivalent — they define the same optimal policy. But they are not the same objective function (they weight timesteps differently). The distinction is about algorithmic readability: the first formula's flat structure makes entropy look like an ordinary per-step bonus, tempting one to think it can simply be added to standard policy gradient. The last formula's nested structure makes explicit that evaluating \((s_t, a_t)\) requires knowing \(\mathbb{E}[H_{t+1} + \gamma H_{t+2} + \cdots \mid s_t, a_t]\) — and that inner sum is precisely \(Q_{\mathrm{soft}}^\pi(s_t, a_t)\). This directly reveals why the Soft Bellman equation must back up entropy: the critic must track future entropy, or it will produce wrong advantage estimates.

Proof of equivalent argmax. Write \(f_t = r_t + \alpha \mathcal{H}_t\). Then \(J_1(\pi) = \mathbb{E}_\tau[\sum_t \gamma^t f_t]\) and \(J_2(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[\sum_{l=t}^\infty \gamma^{l-t} \mathbb{E}[f_l \mid s_t,a_t]]\). Swapping the summation order in \(J_2\) gives each \(f_k\) the weight \(\frac{1-\gamma^{k+1}}{1-\gamma}\) instead of \(\gamma^k\) in \(J_1\), so the two objectives are genuinely different functions of \(\pi\). Nevertheless, they share the same unique maximizer:

  1. Unique fixed point. The soft Bellman operator \(\mathcal{T}Q(s,a) = r(s,a) + \gamma\,\mathbb{E}_{s'}[\log\!\int\!\exp Q(s',a')\,da']\) is a \(\gamma\)-contraction in \(\ell_\infty\) norm (the log-sum-exp is 1-Lipschitz). Therefore it has a unique fixed point \(Q^*_{\mathrm{soft}}\), and the corresponding policy \(\pi^*(a|s) \propto \exp(Q^*_{\mathrm{soft}}(s,a)/\alpha)\) is the unique policy satisfying the soft Bellman optimality equation.
  2. Q-dominance. The soft policy improvement theorem (Haarnoja et al., 2017, Theorem 1) shows: for any \(\pi\), define \(\tilde\pi(\cdot|s) \propto \exp(Q^\pi_{\mathrm{soft}}(s,\cdot)/\alpha)\). Then \(Q^{\tilde\pi}_{\mathrm{soft}}(s,a) \geq Q^\pi_{\mathrm{soft}}(s,a)\) for all \((s,a)\), with equality iff \(\pi\) already satisfies the optimality condition. Iterating converges to the unique fixed point \(\pi^*\), giving \(Q^{\pi^*}_{\mathrm{soft}}(s,a) \geq Q^\pi_{\mathrm{soft}}(s,a)\) for all \((s,a)\) and all \(\pi\). This implies pointwise V-dominance: \(V^{\pi^*}_{\mathrm{soft}}(s) \geq V^\pi_{\mathrm{soft}}(s)\) for all \(s\).
  3. \(\pi^*\) maximizes \(J_1\). Since \(J_1(\pi) = \mathbb{E}_{s_0}[V^\pi_{\mathrm{soft}}(s_0)]\) and \(V^{\pi^*}(s_0) \geq V^\pi(s_0)\) for all \(s_0\), we have \(J_1(\pi^*) \geq J_1(\pi)\) for all \(\pi\).
  4. \(\pi^*\) maximizes \(J_2\). The policy iteration \(\pi_i \to \pi_{i+1}\) with \(\pi_{i+1}(\cdot|s) \propto \exp(Q^{\pi_i}_{\mathrm{soft}}(s,\cdot)/\alpha)\) yields \(Q^{\pi_{i+1}}_{\mathrm{soft}} \geq Q^{\pi_i}_{\mathrm{soft}}\) pointwise at each step, and converges to \(\pi^*\) by contraction. Since the only policy where no improvement exists is \(\pi^*\), it is the unique maximizer of \(J_2\). (Here, unlike for \(J_1\), we cannot directly use V-dominance because the state distribution \(d_t^\pi\) in \(J_2\) also changes with \(\pi\). Instead we rely on the monotone convergence of Q-values to the unique fixed point.)

Since both objectives have the same unique maximizer, \(\arg\max_\pi J_1(\pi) = \arg\max_\pi J_2(\pi) = \pi^*\). \(\square\)

Why the MaxEnt objective cannot be optimized by entropy-regularized policy gradient (Click to expand)

Entropy-regularized policy gradient (ERPG) is the approach of taking a standard actor-critic algorithm — whose critic \(Q^\pi\) only backs up environment reward — and adding an entropy bonus \(\alpha H(\pi)\) to the actor's objective. The ERPG optimization target at each state \(s\) is:

$$J_{\mathrm{ERPG}}(\pi; s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\big[Q^\pi(s,a)\big] + \alpha\,\mathcal{H}\!\big(\pi(\cdot|s)\big)$$

where \(Q^\pi(s,a) = \mathbb{E}\!\left[\sum_{l=0}^\infty \gamma^l r_{t+l} \mid s_t\!=\!s,\, a_t\!=\!a\right]\) satisfies the standard (non-soft) Bellman equation — it backs up reward only, with no entropy in the bootstrap target. The entropy term \(\alpha H\) is applied only at the current policy improvement step, not propagated into future value estimates. Solving for the optimal policy gives:

$$\pi_{\mathrm{ERPG}}(\cdot|s) \propto \exp\!\big(Q^\pi(s,\cdot)/\alpha\big)$$

Deriving the true gradient of the MaxEnt objective. Write the MaxEnt objective as \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \gamma^t(r_t - \alpha\log\pi_\theta(a_t|s_t))]\). We need the gradient of an expectation where both the distribution and the integrand depend on \(\theta\). By the product rule under the integral sign:

$$\nabla_\theta\!\int\! p_\theta(\tau)\,f(\tau,\theta)\,d\tau = \int\!\big[\nabla_\theta p_\theta(\tau)\cdot f(\tau,\theta) + p_\theta(\tau)\cdot\nabla_\theta f(\tau,\theta)\big]\,d\tau = \mathbb{E}_{p_\theta}\!\big[\nabla_\theta\!\log p_\theta(\tau)\cdot f(\tau,\theta) + \nabla_\theta f(\tau,\theta)\big]$$

where the second equality uses \(\nabla_\theta p_\theta = p_\theta\,\nabla_\theta\!\log p_\theta\). Applying this to \(J(\theta)\):

$$\nabla_\theta J = \underbrace{\mathbb{E}_\tau\!\left[\bigg(\sum_{t'} \nabla_\theta\!\log\pi_\theta(a_{t'}|s_{t'})\bigg) \cdot \bigg(\sum_t \gamma^t \tilde r_t\bigg)\right]}_{\text{(I)}} \;+\; \underbrace{\mathbb{E}_\tau\!\left[\sum_t \gamma^t\big(-\alpha\,\nabla_\theta\!\log\pi_\theta(a_t|s_t)\big)\right]}_{(\text{II})}$$

where \(\tilde r_t = r_t - \alpha\log\pi_\theta(a_t|s_t)\). Term (II) vanishes by the score function identity: \(\mathbb{E}_{a\sim\pi}[\nabla\!\log\pi(a|s)] = \nabla\!\sum_a\pi(a|s) = 0\).

Simplifying term (I) via causality. Expanding the product of sums gives cross terms \(\nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \gamma^t \tilde r_t\). For \(t < t'\), the "reward" \(\tilde r_t\) depends only on \((s_t, a_t)\) and is therefore fixed given the trajectory up to time \(t' - 1\). Conditioning on \(s_{t'}\):

$$\mathbb{E}\!\big[\nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \tilde r_t\big] = \mathbb{E}\!\big[\tilde r_t \cdot \underbrace{\mathbb{E}_{a_{t'}\sim\pi}[\nabla\!\log\pi(a_{t'}|s_{t'})]}_{=\,0}\big] = 0 \quad (t < t')$$

So only terms with \(t \geq t'\) survive. Re-indexing with \(l = t - t'\):

$$\text{(I)} = \mathbb{E}_\tau\!\left[\sum_{t'} \nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \sum_{t \geq t'} \gamma^t \tilde r_t\right] = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla\!\log\pi(a_t|s_t) \cdot \underbrace{\sum_{l=0}^{\infty}\gamma^l \tilde r_{t+l}}_{G_t^{\mathrm{soft}}}\right]$$

Replacing \(G_t^{\mathrm{soft}}\) by its conditional expectation \(\mathbb{E}[G_t^{\mathrm{soft}}|s_t,a_t] = Q^\pi_{\mathrm{soft}}(s_t,a_t) - \alpha\log\pi(a_t|s_t)\):

$$\boxed{\nabla_\theta J_{\mathrm{MaxEnt}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \big(Q^\pi_{\mathrm{soft}}(s_t,a_t) - \alpha\log\pi_\theta(a_t|s_t)\big)\right]}$$

ERPG's gradient. ERPG computes two separate pieces: (1) a standard policy gradient using reward-only \(Q^\pi\), and (2) a separate entropy gradient \(\alpha\nabla_\theta H = -\alpha\,\mathbb{E}[\nabla\!\log\pi\cdot\log\pi]\) (the \(\nabla\!\log\pi\cdot 1\) term vanishes by the score identity). Combining:

$$\boxed{\nabla_\theta J_{\mathrm{ERPG}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \big(Q^\pi(s_t,a_t) - \alpha\log\pi_\theta(a_t|s_t)\big)\right]}$$

The gradient gap. The \(-\alpha\log\pi\) terms are identical and cancel in the difference:

$$\nabla_\theta J_{\mathrm{MaxEnt}} - \nabla_\theta J_{\mathrm{ERPG}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \underbrace{\big(Q^\pi_{\mathrm{soft}}(s_t,a_t) - Q^\pi(s_t,a_t)\big)}_{\text{discounted future entropy}}\right]$$

where the gap is exactly the discounted future entropy conditional on the action:

$$Q^\pi_{\mathrm{soft}}(s,a) - Q^\pi(s,a) = \gamma\,\mathbb{E}_{s' \sim p(\cdot|s,a)}\!\big[V^\pi_{\mathrm{soft}}(s') - V^\pi(s')\big] = \alpha\gamma\;\mathbb{E}_{s'}\!\left[\sum_{l=0}^{\infty}\gamma^l H\!\big(\pi(\cdot|s_{l+1}')\big)\right]$$

This depends on \(a\) through the transition \(s' \sim p(\cdot|s,a)\). In any MDP where different actions lead to states with different future entropy, this gap is action-dependent and nonzero, so \(\nabla J_{\mathrm{MaxEnt}} \neq \nabla J_{\mathrm{ERPG}}\). ERPG produces biased gradients for the MaxEnt objective. \(\square\)

Concrete instance: GRPO. The GRPO objective (Shao et al., 2024) uses a PPO-clipped surrogate with a KL penalty against a reference model:

$$\mathcal{J}_{\mathrm{GRPO}}(\theta) \;=\; \mathbb{E}\!\left[\frac{1}{G}\sum_{i}\frac{1}{|o_i|}\sum_{t}\Big\{\min\!\big[r_t(\theta)\,\hat A_{i,t},\;\mathrm{clip}(r_t(\theta),\,1\!-\!\varepsilon,\,1\!+\!\varepsilon)\,\hat A_{i,t}\big] - \beta\,D_{\mathrm{KL}}[\pi_\theta\|\pi_{\mathrm{ref}}]\Big\}\right]$$

where \(r_t(\theta) = \pi_\theta(o_{i,t}|q,o_{i,<t})/\pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i,<t})\) is the importance ratio and \(\hat A_{i,t}\) is the group-relative advantage (computed from reward only, no entropy in the baseline). If we replace \(-\beta\,D_{\mathrm{KL}}[\pi_\theta\|\pi_{\mathrm{ref}}]\) with \(+\alpha\,\mathcal{H}(\pi_\theta)\), the result is exactly ERPG: the entropy bonus is applied per-token at the actor level, but the advantage estimates \(\hat A_{i,t}\) still come from a reward-only signal — entropy is not backed up into the value baseline. By the argument above, this is not equivalent to the MaxEnt RL objective.

The practical consequence is best seen through an example. Consider navigating around an obstacle to reach a goal. Without entropy backing up, the value function only cares about reward, so it finds the shortest path — say, squeezing through a narrow gap on the left side of the obstacle. This path is slightly shorter, but leaves little room for error: a noisy policy would easily collide with the obstacle. With entropy backing up, \(Q_{\mathrm{soft}}\) assigns higher value to states from which many different trajectories can reach the goal. The policy therefore prefers the wider route around the right side of the obstacle — even though it is slightly longer — because from those states, there are more ways to succeed even under stochastic action selection. The key insight is that this preference for “states with many options” emerges automatically from backing up entropy through the Bellman equation, not from any explicit path-planning logic.

Relationship to KL regularization. The per-step KL penalty in RLHF can be decomposed to reveal that it contains the MaxEnt entropy bonus as a component. For a single state \(s\):

\[D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot \vert s) \,\big\Vert\, \pi_{\mathrm{ref}}(\cdot \vert s)\big) = \mathbb{E}_{a \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{ref}}(a \vert s)}\right] = \underbrace{-\mathcal{H}\!\big(\pi_\theta(\cdot \vert s)\big)}_{\text{negative entropy}} + \underbrace{H\!\big(\pi_\theta(\cdot \vert s),\, \pi_{\mathrm{ref}}(\cdot \vert s)\big)}_{\text{cross-entropy}}\]

where \(H(\pi_\theta, \pi_{\mathrm{ref}}) = \mathbb{E}_{a \sim \pi_\theta}[-\log \pi_{\mathrm{ref}}(a \vert s)]\) is the cross-entropy. Therefore the KL-penalized reward decomposes as:

\[r(s,a) - \beta \log \frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{ref}}(a \vert s)} = r(s,a) \underbrace{- \beta \log \pi_\theta(a \vert s)}_{\text{entropy bonus (as in MaxEnt)}} \underbrace{+ \beta \log \pi_{\mathrm{ref}}(a \vert s)}_{\text{anchor to reference}}\]

The first two terms are exactly the MaxEnt reward with \(\alpha = \beta\). The third term, \(+\beta \log \pi_{\mathrm{ref}}(a \vert s)\), acts as a state-action-dependent reward shaping that pulls the policy toward the reference model. When \(\pi_{\mathrm{ref}}\) is uniform, \(\log \pi_{\mathrm{ref}}\) is a constant and drops out — recovering the MaxEnt objective exactly.

So the distinction between MaxEnt RL and KL regularization is not about whether the value function is redefined (both can back up their respective bonuses through the Bellman equation). The distinction is purely about what is being backed up:

MaxEnt RL KL regularization
Backs up entropy \(\mathcal{H}(\pi)\) only Backs up entropy \(\mathcal{H}(\pi)\) plus a cross-entropy anchor \(H(\pi, \pi_{\mathrm{ref}})\)
Encourages exploration for its own sake Encourages exploration while staying close to \(\pi_{\mathrm{ref}}\)

Maximum Entropy RL(Ziebart, 2010; Haarnoja et al., SAC, 2018)在标准 RL 目标中每一步都加入 entropy bonus。最优策略不仅最大化累积 reward,还最大化自身动作分布的 entropy:

\[\pi^*_{\mathrm{maxent}} := \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) + \alpha \, \mathcal{H}\!\big(\pi(\cdot \vert s_t)\big)\Big)\right]\]

其中 \(\mathcal{H}(\pi(\cdot \vert s)) = \mathbb{E}_{a \sim \pi}[-\log \pi(a \vert s)]\) 是策略在状态 \(s\) 处的 entropy。它有三个重要性质:(i) 非负,\(\mathcal{H} = 0\) 当且仅当 \(\pi\) 是确定性策略;(ii) 上界为 \(\log \lvert\mathcal{A}\rvert\),等号当且仅当 \(\pi\) 是均匀分布;(iii) 关于 \(\pi\) 是凹函数。性质 (ii) 在后文很关键——上界随 action space 增长,因此当 \(\lvert\mathcal{A}\rvert\) 很大时,entropy 项可能主导 MaxEnt 目标中的 reward。

为了直观理解 entropy 如何依赖于分布的形状,考虑一个策略:以概率 \(p\) 选择某个动作,将剩余的 \(1 - p\) 均匀分配给其他 \(\lvert\mathcal{A}\rvert - 1\) 个动作(每个得到 \(\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\))。将期望按这两组拆开:

\[\mathcal{H}(\pi) = -\sum_{a} \pi(a)\log\pi(a) = \underbrace{-p\log p}_{\text{来自单个动作}} \;\underbrace{- \;(\lvert\mathcal{A}\rvert - 1)\cdot\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\cdot\log\frac{1-p}{\lvert\mathcal{A}\rvert - 1}}_{\text{来自其余 }\lvert\mathcal{A}\rvert - 1\text{ 个动作}}\]

第二组中 \(\lvert\mathcal{A}\rvert - 1\) 与分数约去,得到简洁的两项分解:

\[\mathcal{H}(\pi) = -p\log p \;-\; (1-p)\log\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\]

当 \(p = 1/\lvert\mathcal{A}\rvert\)(均匀分布)时,两项都有贡献,总和达到最大值 \(\log\lvert\mathcal{A}\rvert\)。当 \(p = 1\)(确定性策略)时,两项都为零。下面的交互图可视化了这个分解:

展开 entropy 并吸收进每步 reward:

\[\pi^*_{\mathrm{maxent}} = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) + \alpha \, \mathbb{E}_{a \sim \pi(\cdot \vert s_t)}[-\log \pi(a \vert s_t)]\Big)\right]\]

由于轨迹期望已经按 \(a_t \sim \pi(\cdot \vert s_t)\) 采样,内层期望可以合并,得到等价形式:

\[\pi^*_{\mathrm{maxent}} = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)\Big)\right]\]

这看起来就是标准 RL 加上修改后的每步 reward \(\tilde{r}_t = r(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)\)。人们可能会认为只需在 policy gradient 上加一个 \(-\alpha \log \pi\) bonus 就够了。

为什么这不仅仅是 entropy-regularized policy gradient? 关键区别在于 \(-\alpha \log \pi(a_t \vert s_t)\) 依赖于 \(\pi\) 本身,而环境 reward \(r(s,a)\) 是固定函数。在标准 actor-critic 中,critic \(Q^\pi(s,a)\) 只 backup \(r\)——它评估”状态 \(s'\) 有多好”时只看未来的 reward。但这里的 “return” 在未来每一步都包含 \(-\alpha \log \pi\),而这个项会随 \(\pi\) 的更新而变化。一个忽略未来 entropy 的 critic 会给出错误的 advantage 估计:它使用的 baseline 没有考虑 return 中的 entropy 成分。MaxEnt RL 通过将 entropy backup 进 value function 来解决这个问题——critic 本身必须追踪策略在未来会产生多少 entropy。

Entropy backing up 与 Soft Bellman 方程。 回忆标准的动作价值函数,它衡量在状态 \(s\) 执行动作 \(a\) 并此后遵循 \(\pi\) 所获得的总 reward:

\[Q^\pi(s,a) := r(s,a) + \gamma \, \mathbb{E}_{s' \sim p(\cdot \vert s,a),\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi(s', a')\right]\]

Soft Bellman 方程(Haarnoja et al., ICML 2017)通过在 backup 目标中包含 entropy bonus 来修改它:

\[Q^\pi_{\mathrm{soft}}(s,a) := r(s,a) + \gamma \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi_{\mathrm{soft}}(s', a') - \alpha \log \pi(a' \vert s')\right]\]

为什么是这个形式? 注意 entropy \(-\alpha \log \pi(a' \vert s')\) 附着在下一个状态 \(s'\) 上,而不是当前状态 \(s\)。我们可以将 entropy 从递归的 \(Q\) 项中分离出来:

\[Q^\pi_{\mathrm{soft}}(s,a) = \underbrace{\Big(r(s,a) - \alpha \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[\log \pi(a' \vert s')\right]\Big)}_{\tilde{r}(s,a)} + \gamma \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi_{\mathrm{soft}}(s', a')\right]\]

这恰好是标准 Bellman 方程 \(Q = \tilde{r} + \gamma \, \mathbb{E}[Q]\) 的形式,其中等效 reward \(\tilde{r}(s,a) = r(s,a) + \alpha \, \mathcal{H}\!\big(\pi(\cdot \vert s')\big)\) 将环境 reward 与下一状态处策略的 entropy 结合在一起。由于当 \(r\) 和 \(\log \pi\) 有界时 \(\tilde{r}\) 也有界,标准 Bellman 压缩论证直接适用——Soft Bellman 算子是 \(\ell_\infty\) 范数下的 \(\gamma\)-压缩映射,保证了唯一不动点的存在和值迭代的收敛。

\(Q^\pi\) 只考虑未来的 reward,而 \(Q^\pi_{\mathrm{soft}}\) 还将未来的 entropy bonus 纳入 backup——它不仅评估 agent 收集了多少 reward,还评估它在未来状态保留了多少选择空间。

展开形式。 将 Soft Bellman 递归展开,soft Q-value 可以写成 reward 与未来 entropy 的折扣求和(Haarnoja et al., ICML 2017, Appendix A)。为方便起见,将 entropy 系数 \(\alpha\) 设为 1(一般情况可通过将 reward 除以 \(\alpha\) 来恢复):

\[Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a}) \triangleq r_0 + \mathbb{E}_{\tau \sim \pi,\, \mathbf{s}_0 = \mathbf{s},\, \mathbf{a}_0 = \mathbf{a}}\!\left[\sum_{t=1}^{\infty} \gamma^t \Big(r_t + \mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_t)\big)\Big)\right]\]

其中 \(\tau = (\mathbf{s}_0, \mathbf{a}_0, \mathbf{s}_1, \mathbf{a}_1, \ldots)\) 表示从 \((\mathbf{s}, \mathbf{a})\) 出发的轨迹。折扣 maximum entropy 策略目标为:

\[J(\pi) \triangleq \sum_t \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_\pi}\!\left[Q_{\mathrm{soft}}^{\pi}(\mathbf{s}_t, \mathbf{a}_t) + \alpha\,\mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_t)\big)\right]\]

对应的最优策略为:

\[\pi^*_{\mathrm{MaxEnt}} = \arg\max_\pi \sum_t \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_\pi}\!\left[\sum_{l=t}^{\infty} \gamma^{l-t} \mathbb{E}_{(\mathbf{s}_l, \mathbf{a}_l)}\!\left[r(\mathbf{s}_l, \mathbf{a}_l) + \alpha\,\mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_l)\big) \,\middle\vert\, \mathbf{s}_t, \mathbf{a}_t\right]\right]\]

注意该目标考虑了策略在未来状态处的 entropy,而非仅限当前步的 greedy objectives(如 Boltzmann exploration)。

等等——第一个式子也在每个时刻都求和了 entropy,最后一个式子有什么不同?(点击展开)

两个公式都包含了每个时刻的 entropy,但求和的组织方式不同,揭示的结构也不同。

第一个式子是平铺求和:\(\sum_t \gamma^t (r_t + \alpha H_t)\)。展开就是 \(\gamma^0(r_0+\alpha H_0) + \gamma^1(r_1+\alpha H_1) + \gamma^2(r_2+\alpha H_2) + \cdots\)。每个时刻的 entropy 恰好出现一次,和 reward 地位完全对称——看起来 entropy 只是一个局部的、逐步的 bonus,不涉及"未来"。

最后一个式子是嵌套求和:对每个 \((s_t, a_t)\),内层求和 \(\sum_{l=t}^{\infty} \gamma^{l-t} \mathbb{E}[r_l + \alpha H_l \mid s_t, a_t]\) 是从 \(t\) 出发的完整 soft return

$$\underbrace{(r_t + \alpha H_t)}_{\text{当前}} + \gamma\underbrace{(r_{t+1} + \alpha H_{t+1})}_{\text{未来}} + \gamma^2\underbrace{(r_{t+2} + \alpha H_{t+2})}_{\text{更远的未来}} + \cdots$$

并且这些未来项是以 \((s_t, a_t)\) 为条件的期望。未来的 entropy \(H_{t+1}, H_{t+2}, \ldots\) 显式地出现在对当前状态-动作对的评估中。

两个式子定义相同的最优策略,但它们不是同一个目标函数(对各时刻的加权不同)。区别在于算法设计上的可读性:第一个式子的平铺形式让 entropy 看起来只是一个普通的 per-step bonus,容易让人误以为直接加进标准 policy gradient 就够了。最后一个式子的嵌套结构则显式写出:评估 \((s_t, a_t)\) 需要知道 \(\mathbb{E}[H_{t+1} + \gamma H_{t+2} + \cdots \mid s_t, a_t]\)——而那个内层求和恰好就是 \(Q_{\mathrm{soft}}^\pi(s_t, a_t)\)。这直接揭示了 Soft Bellman 方程需要 backup entropy 的原因:critic 必须追踪未来的 entropy,否则会给出错误的 advantage 估计。

Argmax 等价性证明。记 \(f_t = r_t + \alpha \mathcal{H}_t\)。则 \(J_1(\pi) = \mathbb{E}_\tau[\sum_t \gamma^t f_t]\),\(J_2(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[\sum_{l=t}^\infty \gamma^{l-t} \mathbb{E}[f_l \mid s_t,a_t]]\)。交换 \(J_2\) 的求和顺序可知 \(f_k\) 的权重为 \(\frac{1-\gamma^{k+1}}{1-\gamma}\),而非 \(J_1\) 中的 \(\gamma^k\),因此两者确实是不同的目标函数。但它们共享同一个最优策略:

  1. 唯一不动点。Soft Bellman 算子 \(\mathcal{T}Q(s,a) = r(s,a) + \gamma\,\mathbb{E}_{s'}[\log\!\int\!\exp Q(s',a')\,da']\) 是 \(\ell_\infty\) 范数下的 \(\gamma\)-压缩映射(log-sum-exp 是 1-Lipschitz 的)。因此它有唯一不动点 \(Q^*_{\mathrm{soft}}\),对应的策略 \(\pi^*(a|s) \propto \exp(Q^*_{\mathrm{soft}}(s,a)/\alpha)\) 是满足 soft Bellman 最优方程的唯一策略。
  2. Q-dominance。Soft policy improvement theorem(Haarnoja et al., 2017, Theorem 1)证明:对任意 \(\pi\),定义 \(\tilde\pi(\cdot|s) \propto \exp(Q^\pi_{\mathrm{soft}}(s,\cdot)/\alpha)\),则 \(Q^{\tilde\pi}_{\mathrm{soft}}(s,a) \geq Q^\pi_{\mathrm{soft}}(s,a)\) 对所有 \((s,a)\) 成立,等号当且仅当 \(\pi\) 已满足最优条件。迭代收敛至唯一不动点 \(\pi^*\),给出 \(Q^{\pi^*}_{\mathrm{soft}}(s,a) \geq Q^\pi_{\mathrm{soft}}(s,a)\) 对所有 \((s,a)\) 和所有 \(\pi\) 成立。由此得到逐点 V-dominance:\(V^{\pi^*}_{\mathrm{soft}}(s) \geq V^\pi_{\mathrm{soft}}(s)\) 对所有 \(s\) 成立。
  3. \(\pi^*\) 最大化 \(J_1\)。由于 \(J_1(\pi) = \mathbb{E}_{s_0}[V^\pi_{\mathrm{soft}}(s_0)]\) 且 \(V^{\pi^*}(s_0) \geq V^\pi(s_0)\) 对所有 \(s_0\) 成立,\(J_1(\pi^*) \geq J_1(\pi)\)。
  4. \(\pi^*\) 最大化 \(J_2\)。策略迭代 \(\pi_i \to \pi_{i+1}\)(其中 \(\pi_{i+1}(\cdot|s) \propto \exp(Q^{\pi_i}_{\mathrm{soft}}(s,\cdot)/\alpha)\))在每步都给出 \(Q^{\pi_{i+1}}_{\mathrm{soft}} \geq Q^{\pi_i}_{\mathrm{soft}}\)(逐点),并由压缩性收敛到 \(\pi^*\)。唯一使得无法继续改进的策略就是 \(\pi^*\),因此它是 \(J_2\) 的唯一最大化者。(注意:与 \(J_1\) 不同,这里不能直接用 V-dominance 论证,因为 \(J_2\) 中的状态分布 \(d_t^\pi\) 也随 \(\pi\) 变化。我们依赖的是 Q 值向唯一不动点的单调收敛。)

两个目标共享同一个最优策略:\(\arg\max_\pi J_1(\pi) = \arg\max_\pi J_2(\pi) = \pi^*\)。\(\square\)

为什么 MaxEnt 目标不能用 entropy-regularized policy gradient 来优化(点击展开)

Entropy-regularized policy gradient (ERPG) 是指:在标准 actor-critic 算法(其 critic \(Q^\pi\) 只 backup 环境 reward)的基础上,给 actor 目标加一个 entropy bonus \(\alpha H(\pi)\)。ERPG 在每个状态 \(s\) 处的优化目标为:

$$J_{\mathrm{ERPG}}(\pi; s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\big[Q^\pi(s,a)\big] + \alpha\,\mathcal{H}\!\big(\pi(\cdot|s)\big)$$

其中 \(Q^\pi(s,a) = \mathbb{E}\!\left[\sum_{l=0}^\infty \gamma^l r_{t+l} \mid s_t\!=\!s,\, a_t\!=\!a\right]\) 满足标准(非 soft)Bellman 方程——它只 backup reward,bootstrap target 中不包含 entropy。Entropy 项 \(\alpha H\) 仅在当前策略改进步施加,不传播到未来的 value 估计中。求解最优策略得:

$$\pi_{\mathrm{ERPG}}(\cdot|s) \propto \exp\!\big(Q^\pi(s,\cdot)/\alpha\big)$$

推导 MaxEnt 目标的真正梯度。将 MaxEnt 目标写成 \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \gamma^t(r_t - \alpha\log\pi_\theta(a_t|s_t))]\)。当分布和被积函数都依赖于 \(\theta\) 时,对积分号下应用乘法法则:

$$\nabla_\theta\!\int\! p_\theta(\tau)\,f(\tau,\theta)\,d\tau = \int\!\big[\nabla_\theta p_\theta(\tau)\cdot f(\tau,\theta) + p_\theta(\tau)\cdot\nabla_\theta f(\tau,\theta)\big]\,d\tau = \mathbb{E}_{p_\theta}\!\big[\nabla_\theta\!\log p_\theta(\tau)\cdot f(\tau,\theta) + \nabla_\theta f(\tau,\theta)\big]$$

其中第二个等号使用了 \(\nabla_\theta p_\theta = p_\theta\,\nabla_\theta\!\log p_\theta\)。将此应用于 \(J(\theta)\):

$$\nabla_\theta J = \underbrace{\mathbb{E}_\tau\!\left[\bigg(\sum_{t'} \nabla_\theta\!\log\pi_\theta(a_{t'}|s_{t'})\bigg) \cdot \bigg(\sum_t \gamma^t \tilde r_t\bigg)\right]}_{\text{(I)}} \;+\; \underbrace{\mathbb{E}_\tau\!\left[\sum_t \gamma^t\big(-\alpha\,\nabla_\theta\!\log\pi_\theta(a_t|s_t)\big)\right]}_{(\text{II})}$$

其中 \(\tilde r_t = r_t - \alpha\log\pi_\theta(a_t|s_t)\)。项 (II) 由 score function 恒等式为零:\(\mathbb{E}_{a\sim\pi}[\nabla\!\log\pi(a|s)] = \nabla\!\sum_a\pi(a|s) = 0\)。

利用因果性化简项 (I)。展开两个求和的乘积得到交叉项 \(\nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \gamma^t \tilde r_t\)。当 \(t < t'\) 时,"reward" \(\tilde r_t\) 仅依赖于 \((s_t, a_t)\),在给定到 \(t'-1\) 时刻的轨迹后是固定的。对 \(s_{t'}\) 取条件期望:

$$\mathbb{E}\!\big[\nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \tilde r_t\big] = \mathbb{E}\!\big[\tilde r_t \cdot \underbrace{\mathbb{E}_{a_{t'}\sim\pi}[\nabla\!\log\pi(a_{t'}|s_{t'})]}_{=\,0}\big] = 0 \quad (t < t')$$

因此只有 \(t \geq t'\) 的项存留。令 \(l = t - t'\) 重新标号:

$$\text{(I)} = \mathbb{E}_\tau\!\left[\sum_{t'} \nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \sum_{t \geq t'} \gamma^t \tilde r_t\right] = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla\!\log\pi(a_t|s_t) \cdot \underbrace{\sum_{l=0}^{\infty}\gamma^l \tilde r_{t+l}}_{G_t^{\mathrm{soft}}}\right]$$

将 \(G_t^{\mathrm{soft}}\) 替换为其条件期望 \(\mathbb{E}[G_t^{\mathrm{soft}}|s_t,a_t] = Q^\pi_{\mathrm{soft}}(s_t,a_t) - \alpha\log\pi(a_t|s_t)\):

$$\boxed{\nabla_\theta J_{\mathrm{MaxEnt}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \big(Q^\pi_{\mathrm{soft}}(s_t,a_t) - \alpha\log\pi_\theta(a_t|s_t)\big)\right]}$$

ERPG 的梯度。ERPG 分别计算两部分:(1) 使用 reward-only \(Q^\pi\) 的标准 policy gradient,(2) 独立的 entropy 梯度 \(\alpha\nabla_\theta H = -\alpha\,\mathbb{E}[\nabla\!\log\pi\cdot\log\pi]\)(\(\nabla\!\log\pi\cdot 1\) 项由 score 恒等式消去)。合并得:

$$\boxed{\nabla_\theta J_{\mathrm{ERPG}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \big(Q^\pi(s_t,a_t) - \alpha\log\pi_\theta(a_t|s_t)\big)\right]}$$

梯度差。\(-\alpha\log\pi\) 项相同,在差中消去:

$$\nabla_\theta J_{\mathrm{MaxEnt}} - \nabla_\theta J_{\mathrm{ERPG}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \underbrace{\big(Q^\pi_{\mathrm{soft}}(s_t,a_t) - Q^\pi(s_t,a_t)\big)}_{\text{折扣未来 entropy}}\right]$$

其中差值恰好是以所选动作为条件的折扣未来 entropy:

$$Q^\pi_{\mathrm{soft}}(s,a) - Q^\pi(s,a) = \gamma\,\mathbb{E}_{s' \sim p(\cdot|s,a)}\!\big[V^\pi_{\mathrm{soft}}(s') - V^\pi(s')\big] = \alpha\gamma\;\mathbb{E}_{s'}\!\left[\sum_{l=0}^{\infty}\gamma^l H\!\big(\pi(\cdot|s_{l+1}')\big)\right]$$

这通过转移 \(s' \sim p(\cdot|s,a)\) 依赖于 \(a\)。在任何不同动作导向具有不同未来 entropy 的状态的 MDP 中,这个差值都是依赖于动作且非零的,因此 \(\nabla J_{\mathrm{MaxEnt}} \neq \nabla J_{\mathrm{ERPG}}\)。ERPG 对 MaxEnt 目标产生有偏的梯度。\(\square\)

具体实例:GRPO。GRPO 目标(Shao et al., 2024)使用 PPO-clipped surrogate 加上对 reference model 的 KL 惩罚:

$$\mathcal{J}_{\mathrm{GRPO}}(\theta) \;=\; \mathbb{E}\!\left[\frac{1}{G}\sum_{i}\frac{1}{|o_i|}\sum_{t}\Big\{\min\!\big[r_t(\theta)\,\hat A_{i,t},\;\mathrm{clip}(r_t(\theta),\,1\!-\!\varepsilon,\,1\!+\!\varepsilon)\,\hat A_{i,t}\big] - \beta\,D_{\mathrm{KL}}[\pi_\theta\|\pi_{\mathrm{ref}}]\Big\}\right]$$

其中 \(r_t(\theta) = \pi_\theta(o_{i,t}|q,o_{i,<t})/\pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i,<t})\) 是重要性比率,\(\hat A_{i,t}\) 是 group-relative advantage(仅由 reward 计算,baseline 中不含 entropy)。如果将 \(-\beta\,D_{\mathrm{KL}}[\pi_\theta\|\pi_{\mathrm{ref}}]\) 替换为 \(+\alpha\,\mathcal{H}(\pi_\theta)\),得到的恰好就是 ERPG:entropy bonus 在 actor 端逐 token 施加,但 advantage 估计 \(\hat A_{i,t}\) 仍来自 reward-only 的信号——entropy 没有被 backup 进 value baseline。由上面的论证,这不等价于 MaxEnt RL 目标。

这个区别的实际后果可以通过一个例子来理解。考虑绕过障碍物到达目标的导航问题。不 backup entropy 时,value function 只关心 reward,因此找到最短路径——比如从障碍物左侧的狭窄缝隙挤过去。这条路径稍短,但容错空间很小:一个带噪声的策略很容易撞上障碍物。Backup entropy 时,\(Q_{\mathrm{soft}}\) 给那些”有很多不同轨迹都能到达目标”的状态赋予更高的价值。因此策略会偏好绕障碍物右侧走更宽阔的路线——即使路径稍长——因为从那些状态出发,即使在随机动作选择下,也有更多的成功方式。关键洞察是:这种对”有更多选项的状态”的偏好,是通过 Bellman 方程 backup entropy 自动产生的,而不是来自任何显式的路径规划逻辑。

与 KL 正则化的关系。 RLHF 中的 per-step KL 惩罚可以分解,揭示它包含 MaxEnt 的 entropy bonus 作为一个组成部分。对于单个状态 \(s\):

\[D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot \vert s) \,\big\Vert\, \pi_{\mathrm{ref}}(\cdot \vert s)\big) = \mathbb{E}_{a \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{ref}}(a \vert s)}\right] = \underbrace{-\mathcal{H}\!\big(\pi_\theta(\cdot \vert s)\big)}_{\text{负 entropy}} + \underbrace{H\!\big(\pi_\theta(\cdot \vert s),\, \pi_{\mathrm{ref}}(\cdot \vert s)\big)}_{\text{交叉熵}}\]

其中 \(H(\pi_\theta, \pi_{\mathrm{ref}}) = \mathbb{E}_{a \sim \pi_\theta}[-\log \pi_{\mathrm{ref}}(a \vert s)]\) 是交叉熵。因此 KL 惩罚的 reward 分解为:

\[r(s,a) - \beta \log \frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{ref}}(a \vert s)} = r(s,a) \underbrace{- \beta \log \pi_\theta(a \vert s)}_{\text{entropy bonus(同 MaxEnt)}} \underbrace{+ \beta \log \pi_{\mathrm{ref}}(a \vert s)}_{\text{锚定到 reference}}\]

前两项恰好是 \(\alpha = \beta\) 时的 MaxEnt reward。第三项 \(+\beta \log \pi_{\mathrm{ref}}(a \vert s)\) 起到状态-动作依赖的 reward shaping 的作用,将策略拉向 reference model。当 \(\pi_{\mathrm{ref}}\) 为均匀分布时,\(\log \pi_{\mathrm{ref}}\) 为常数,可以忽略——精确恢复 MaxEnt 目标。

所以 MaxEnt RL 与 KL 正则化的区别不在于是否重新定义 value function(两者都可以通过 Bellman 方程 backup 各自的 bonus)。区别纯粹在于 backup 的内容是什么

MaxEnt RL KL 正则化
仅 backup entropy \(\mathcal{H}(\pi)\) Backup entropy \(\mathcal{H}(\pi)\) 加上交叉熵锚 \(H(\pi, \pi_{\mathrm{ref}})\)
为探索本身而鼓励探索 在保持接近 \(\pi_{\mathrm{ref}}\) 的同时鼓励探索

Actor-Critic

The Critic

In the policy gradient, the single-sample reward-to-go \(\hat{Q}_{i,t} = \sum_{t'=t}^{T} r(s_{t'}, a_{t'})\) serves as our estimate of \(Q^{\pi_\theta}(s_t, a_t)\). This is unbiased — in expectation it equals the true action-value — but high-variance, because a single trajectory may encounter lucky or unlucky transitions.

Can we get a better estimate? The idea is to fit a model to predict expected returns, rather than relying on a single sample. Define three value functions:

\[Q^\pi(s, a) = \sum_{t'=t}^{T} \mathbb{E}_{\pi_\theta}\!\left[r(s_{t'}, a_{t'}) \vert s_t, a_t\right]\] \[V^\pi(s) = \mathbb{E}_{a \sim \pi_\theta(a \vert s)}\!\left[Q^\pi(s, a)\right]\] \[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

The advantage \(A^\pi\) tells us how much better action \(a\) is compared to the average action from state \(s\). Using the advantage in the policy gradient:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \, A^\pi(s_{i,t}, a_{i,t})\]

The better this estimate, the lower the variance. The key insight is that we only need to fit \(V^\pi(s)\), because \(A^\pi(s, a) \approx r(s, a) + V^\pi(s') - V^\pi(s)\), which requires only the value function and one observed transition.

How do we train this value function? This is the policy evaluation problem: given a fixed policy \(\pi_\theta\), estimate \(V^\pi(s)\). We fit a neural network \(\hat{V}^\pi_\phi(s)\) with parameters \(\phi\) by supervised regression:

\[\mathcal{L}(\phi) = \frac{1}{2} \sum_i \left\lVert \hat{V}^\pi_\phi(s_i) - y_i \right\rVert^2\]

The question is what target \(y_i\) to use. The Monte Carlo target \(y_{i,t} = \sum_{t'=t}^{T} r(s_{i,t'}, a_{i,t'})\) is unbiased but noisy — the same function must fit many different sampled trajectories from the same state. The bootstrapped (TD) target \(y_{i,t} = r(s_{i,t}, a_{i,t}) + \gamma \hat{V}^\pi_\phi(s_{i,t+1})\) is lower variance because it replaces all future randomness with the current estimate, but introduces bias — if \(\hat{V}^\pi_\phi\) is wrong (and it always is, at least initially), the target is wrong too. Here the discount factor \(\gamma \in [0, 1]\) keeps values finite when episodes are long or infinite; one interpretation is that \(\gamma\) adds a \((1 - \gamma)\) probability of “death” at each step, making the effective horizon finite.

策略梯度中,单样本的 reward-to-go \(\hat{Q}_{i,t} = \sum_{t'=t}^{T} r(s_{t'}, a_{t'})\) 作为 \(Q^{\pi_\theta}(s_t, a_t)\) 的估计。这是无偏的——期望等于真实的动作价值——但方差很高,因为单条轨迹可能遇到幸运或不幸的转移。

能否得到更好的估计?核心想法是拟合一个模型来预测期望回报,而非依赖单个样本。定义三个价值函数:

\[Q^\pi(s, a) = \sum_{t'=t}^{T} \mathbb{E}_{\pi_\theta}\!\left[r(s_{t'}, a_{t'}) \vert s_t, a_t\right]\] \[V^\pi(s) = \mathbb{E}_{a \sim \pi_\theta(a \vert s)}\!\left[Q^\pi(s, a)\right]\] \[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

优势函数 \(A^\pi\) 告诉我们动作 \(a\) 比状态 \(s\) 下的平均动作好多少。将优势函数用于策略梯度:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \, A^\pi(s_{i,t}, a_{i,t})\]

估计越好,方差越低。关键洞察是我们只需拟合 \(V^\pi(s)\),因为 \(A^\pi(s, a) \approx r(s, a) + V^\pi(s') - V^\pi(s)\),这只需要价值函数和一次观测到的转移。

如何训练这个价值函数?这是策略评估问题:给定固定策略 \(\pi_\theta\),估计 \(V^\pi(s)\)。我们通过监督回归拟合一个参数为 \(\phi\) 的神经网络 \(\hat{V}^\pi_\phi(s)\):

\[\mathcal{L}(\phi) = \frac{1}{2} \sum_i \left\lVert \hat{V}^\pi_\phi(s_i) - y_i \right\rVert^2\]

问题是使用什么目标 \(y_i\)。Monte Carlo 目标 \(y_{i,t} = \sum_{t'=t}^{T} r(s_{i,t'}, a_{i,t'})\) 是无偏但有噪声的——同一个函数必须拟合来自同一状态的许多不同采样轨迹。自举(TD)目标 \(y_{i,t} = r(s_{i,t}, a_{i,t}) + \gamma \hat{V}^\pi_\phi(s_{i,t+1})\) 方差更低,因为它用当前估计替代了所有未来的随机性,但引入了偏差——如果 \(\hat{V}^\pi_\phi\) 不准确(至少在初期总是如此),目标也会不准确。这里的折扣因子 \(\gamma \in [0, 1]\) 在回合很长或无限时保持价值有界;一种理解是 \(\gamma\) 在每一步添加 \((1 - \gamma)\) 的”死亡”概率,使得有效时间范围有限。

The Algorithm

An actor-critic method maintains two components:

  • Actor: the policy \(\pi_\theta(a \vert s)\), updated by policy gradient.
  • Critic: a value function \(\hat{V}^\pi_\phi(s) \approx V^\pi(s)\), trained by regression.

Batch actor-critic:

  1. Sample \(\{s_i, a_i\}\) from \(\pi_\theta(a \vert s)\) (run the policy).
  2. Fit \(\hat{V}^\pi_\phi(s)\) to sampled reward sums.
  3. Evaluate \(\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)\).
  4. \(\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a_i \vert s_i) \hat{A}^\pi(s_i, a_i)\).
  5. \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\).

Online actor-critic updates after every single transition \((s, a, s', r)\):

  1. Take action \(a \sim \pi_\theta(a \vert s)\), observe \((s, a, s', r)\).
  2. Update \(\hat{V}^\pi_\phi\) using target \(r + \gamma \hat{V}^\pi_\phi(s')\).
  3. Evaluate \(\hat{A}^\pi(s, a) = r(s, a) + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\).
  4. \(\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(a \vert s) \hat{A}^\pi(s, a)\).
  5. \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\).

The online version works on a single transition — no need to collect full trajectories. In practice, it works best with a batch from parallel workers: multiple agents collect transitions simultaneously, and their gradients are aggregated before each update. This is the idea behind A3C (Mnih et al., 2016), which runs asynchronous parallel actors each contributing gradients to a shared parameter server.

Figure 2: The actor-critic loop. In batch mode, the policy collects many transitions before updating; in online mode, each transition triggers an immediate update. Toggle between the two to see the difference.

The actor and critic can use two separate networks (simple, stable, but no shared features) or a single network with two heads (shared early layers, separate output heads for \(\pi_\theta\) and \(\hat{V}^\pi_\phi\)). The shared design is more parameter-efficient and can learn common state representations, but couples the two learning problems. In the LLM era, this question takes a new form: can a pretrained language model serve as both actor and critic? The LM-as-critic post explores the surprising difficulties of attaching a value head to a pretrained backbone.

Actor-Critic 方法维护两个组件:

  • Actor(演员):策略 \(\pi_\theta(a \vert s)\),通过策略梯度更新。
  • Critic(评论家):价值函数 \(\hat{V}^\pi_\phi(s) \approx V^\pi(s)\),通过回归训练。

批量 Actor-Critic:

  1. 从 \(\pi_\theta(a \vert s)\) 中采样 \(\{s_i, a_i\}\)(运行策略)。
  2. 将 \(\hat{V}^\pi_\phi(s)\) 拟合到采样的奖励之和。
  3. 计算 \(\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)\)。
  4. \(\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a_i \vert s_i) \hat{A}^\pi(s_i, a_i)\)。
  5. \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)。

在线 Actor-Critic 在每次转移 \((s, a, s', r)\) 后更新:

  1. 执行动作 \(a \sim \pi_\theta(a \vert s)\),观测 \((s, a, s', r)\)。
  2. 使用目标 \(r + \gamma \hat{V}^\pi_\phi(s')\) 更新 \(\hat{V}^\pi_\phi\)。
  3. 计算 \(\hat{A}^\pi(s, a) = r(s, a) + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\)。
  4. \(\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(a \vert s) \hat{A}^\pi(s, a)\)。
  5. \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)。

在线版本仅需一次转移——无需收集完整轨迹。实践中,配合并行 worker 的批量数据效果最好:多个智能体同时收集转移,在每次更新前汇总梯度。这就是 A3C(Mnih et al., 2016)背后的思想——运行异步并行的 actor,每个向共享参数服务器贡献梯度。

图 2:Actor-Critic 循环。在批量模式下,策略收集多次转移后再更新;在在线模式下,每次转移触发即时更新。切换两种模式查看差异。

Actor 和 Critic 可以使用两个独立网络(简单、稳定,但无共享特征)或带两个输出头的单一网络(共享前层,\(\pi_\theta\) 和 \(\hat{V}^\pi_\phi\) 分别输出)。共享设计更节省参数且能学习公共的状态表示,但耦合了两个学习问题。在大语言模型时代,这个问题有了新的形态:预训练语言模型能否同时充当 actor 和 critic?LM-as-critic 文章探讨了在预训练骨干网络上附加价值头的意外困难。

Advantage Estimation

The actor-critic gradient uses the TD advantage \(r + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\) — lower variance but biased (since the critic is imperfect). The Monte Carlo policy gradient uses \(\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} - b\) — unbiased but higher variance. One middle ground is to use the Monte Carlo return but subtract the critic as a state-dependent baseline:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \left(\underbrace{\sum_{t'=t}^{T} \gamma^{t'-t} r(s_{i,t'}, a_{i,t'})}_{\substack{\mathrm{Monte\ Carlo\ return} \\ \mathrm{(unbiased,\ high\ variance)}}} - \underbrace{\hat{V}^\pi_\phi(s_{i,t})}_{\substack{\mathrm{critic\ as\ baseline} \\ \mathrm{(state\ dependent)}}}\right)\]

The reward-to-go sum is the same Monte Carlo return from the vanilla policy gradient — no bootstrapping, so no bias. But instead of subtracting a constant baseline \(b\), we subtract the critic’s value estimate \(\hat{V}^\pi_\phi(s_{i,t})\), which depends on the state. Since any state-dependent baseline preserves unbiasedness (shown above), this remains unbiased. And because \(\hat{V}^\pi_\phi(s)\) is close to the expected return from each state, it reduces variance far more effectively than a constant \(b\).

Figure 3: Constant vs state-dependent baselines. Three trajectories from different states have different MC returns. A constant baseline b subtracts the same value from all; a state-dependent baseline V(s) subtracts each state's expected return, centering advantages more tightly. The Compare view shows the variance reduction.

More generally, the one-step TD advantage and the full Monte Carlo return are two extremes. We can interpolate with \(n\)-step returns: use \(n\) steps of actual rewards, then bootstrap:

\[\hat{A}_n^\pi(s_t, a_t) = \underbrace{\sum_{t'=t}^{t+n} \gamma^{t'-t} r(s_{t'}, a_{t'})}_{\mathrm{actual\ rewards\ (}n\mathrm{\ steps)}} \underbrace{- \; \hat{V}^\pi_\phi(s_t)}_{\mathrm{baseline}} + \underbrace{\gamma^n \hat{V}^\pi_\phi(s_{t+n})}_{\mathrm{bootstrap\ remainder}}\]

The first term sums \(n\) steps of actual observed rewards, discounted back to time \(t\). The second term subtracts the critic’s estimate at the current state (the baseline). The third term “fills in” the remaining future by bootstrapping from the critic at step \(t+n\) — discounted by \(\gamma^n\) because that state is \(n\) steps away. When \(n = 1\), we get the TD advantage \(r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)\): mostly bootstrapped, low variance but biased. When \(n = T - t\), the bootstrap vanishes and we recover the full Monte Carlo return minus a baseline: unbiased but high variance. Choosing \(n > 1\) often works better than either extreme.

Generalized advantage estimation (Schulman, Moritz, Levine, Jordan, Abbeel, 2016) takes this further: instead of choosing a single \(n\), take a weighted combination of all \(n\)-step advantages with exponentially decaying weights \(w_n \propto \lambda^{n-1}\):

\[\hat{A}^{\mathrm{GAE}}(s_t, a_t) = \sum_{t'=t}^{\infty} (\gamma \lambda)^{t'-t} \delta_{t'}\]

where \(\delta_{t'} = r(s_{t'}, a_{t'}) + \gamma \hat{V}^\pi_\phi(s_{t'+1}) - \hat{V}^\pi_\phi(s_{t'})\) is the one-step TD residual. The parameter \(\lambda\) controls how far into the future we look before trusting the critic. When \(\lambda = 0\), only the \(t' = t\) term survives:

\[\hat{A}^{\mathrm{GAE}(\gamma,\,0)}(s_t, a_t) = \delta_t = r(s_t, a_t) + \gamma \hat{V}^\pi_\phi(s_{t+1}) - \hat{V}^\pi_\phi(s_t)\]

This is the one-step TD advantage — low variance but biased (relies entirely on the critic). When \(\lambda = 1\), the geometric decay disappears and all TD residuals are summed with only \(\gamma\) discounting:

\[\hat{A}^{\mathrm{GAE}(\gamma,\,1)}(s_t, a_t) = \sum_{t'=t}^{\infty} \gamma^{t'-t} \delta_{t'} = \sum_{t'=t}^{\infty} \gamma^{t'-t} r(s_{t'}, a_{t'}) - \hat{V}^\pi_\phi(s_t)\]

To see why the second equality holds, expand \(\delta_{t'} = r_{t'} + \gamma \hat{V}(s_{t'+1}) - \hat{V}(s_{t'})\) and split the sum:

\[\sum_{t'=t}^{\infty} \gamma^{t'-t} \delta_{t'} = \sum_{t'=t}^{\infty} \gamma^{t'-t} r_{t'} + \sum_{t'=t}^{\infty} \bigl[\gamma^{t'-t+1} \hat{V}(s_{t'+1}) - \gamma^{t'-t} \hat{V}(s_{t'})\bigr]\]

The second sum telescopes: consecutive terms cancel, leaving \(\lim_{N\to\infty} \gamma^{N+1}\hat{V}(s_{t+N+1}) - \hat{V}(s_t)\). The limit vanishes (either \(\gamma < 1\) or the episode terminates with \(\hat{V} = 0\)), so we are left with \(\sum \gamma^{t'-t} r_{t'} - \hat{V}(s_t)\).

This recovers the Monte Carlo return minus a state-dependent baseline — unbiased but high variance. In practice, \(\lambda \approx 0.95{-}0.97\) works well.

Figure 4: n-step returns and GAE. Drag the n slider to see how many actual rewards (orange) are used before bootstrapping from the critic (green). Toggle to GAE to see how λ blends all n-step estimates with exponentially decaying weights.

The on-policy requirement — needing fresh data after every update — is a major limitation addressed by PPO above. For a deeper treatment of importance sampling in RL, see the importance sampling post.

Actor-Critic 梯度使用 TD 优势 \(r + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\)——方差较低但有偏(因为 critic 并不完美)。Monte Carlo 策略梯度使用 \(\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} - b\)——无偏但方差较高。一个折中方案是使用 Monte Carlo 回报,但减去 critic 作为状态相关基线

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \left(\underbrace{\sum_{t'=t}^{T} \gamma^{t'-t} r(s_{i,t'}, a_{i,t'})}_{\substack{\mathrm{Monte\ Carlo\ return} \\ \mathrm{(unbiased,\ high\ variance)}}} - \underbrace{\hat{V}^\pi_\phi(s_{i,t})}_{\substack{\mathrm{critic\ as\ baseline} \\ \mathrm{(state\ dependent)}}}\right)\]

reward-to-go 之和与 vanilla 策略梯度中的 Monte Carlo 回报相同——没有自举,因此没有偏差。但不是减去常数基线 \(b\),而是减去 critic 的价值估计 \(\hat{V}^\pi_\phi(s_{i,t})\),这取决于状态。由于任何状态相关基线都能保持无偏性(如上所示),因此仍然是无偏的。而且由于 \(\hat{V}^\pi_\phi(s)\) 接近每个状态的期望回报,它比常数 \(b\) 更有效地降低方差。

图 3:常数基线与状态相关基线的对比。来自不同状态的三条轨迹具有不同的 MC 回报。常数基线 b 从所有轨迹中减去相同的值;状态相关基线 V(s) 减去每个状态的期望回报,使优势更紧密地居中。Compare 视图展示了方差缩减效果。

更一般地,单步 TD 优势和完整 Monte Carlo 回报是两个极端。我们可以用 \(n\) 步回报进行插值:使用 \(n\) 步的实际奖励,然后自举:

\[\hat{A}_n^\pi(s_t, a_t) = \underbrace{\sum_{t'=t}^{t+n} \gamma^{t'-t} r(s_{t'}, a_{t'})}_{\mathrm{actual\ rewards\ (}n\mathrm{\ steps)}} \underbrace{- \; \hat{V}^\pi_\phi(s_t)}_{\mathrm{baseline}} + \underbrace{\gamma^n \hat{V}^\pi_\phi(s_{t+n})}_{\mathrm{bootstrap\ remainder}}\]

第一项是 \(n\) 步实际观测奖励的折扣求和。第二项减去 critic 在当前状态的估计(基线)。第三项通过从 \(t+n\) 步的 critic 自举来”补全”剩余的未来——折扣 \(\gamma^n\) 因为该状态在 \(n\) 步之后。当 \(n = 1\) 时,我们得到 TD 优势 \(r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)\):大部分依赖自举,方差低但有偏。当 \(n = T - t\) 时,自举项消失,恢复完整的 Monte Carlo 回报减基线:无偏但方差高。选择 \(n > 1\) 通常比任一极端都效果更好。

广义优势估计Schulman, Moritz, Levine, Jordan, Abbeel, 2016)更进一步:不选择单一的 \(n\),而是取所有 \(n\) 步优势的加权组合,权重按指数衰减 \(w_n \propto \lambda^{n-1}\):

\[\hat{A}^{\mathrm{GAE}}(s_t, a_t) = \sum_{t'=t}^{\infty} (\gamma \lambda)^{t'-t} \delta_{t'}\]

其中 \(\delta_{t'} = r(s_{t'}, a_{t'}) + \gamma \hat{V}^\pi_\phi(s_{t'+1}) - \hat{V}^\pi_\phi(s_{t'})\) 是单步 TD 残差。参数 \(\lambda\) 控制我们在信任 critic 之前看多远的未来。当 \(\lambda = 0\) 时,只有 \(t' = t\) 项存活:

\[\hat{A}^{\mathrm{GAE}(\gamma,\,0)}(s_t, a_t) = \delta_t = r(s_t, a_t) + \gamma \hat{V}^\pi_\phi(s_{t+1}) - \hat{V}^\pi_\phi(s_t)\]

即单步 TD 优势——方差低但有偏(完全依赖 critic)。当 \(\lambda = 1\) 时,几何衰减消失,所有 TD 残差仅以 \(\gamma\) 折扣求和:

\[\hat{A}^{\mathrm{GAE}(\gamma,\,1)}(s_t, a_t) = \sum_{t'=t}^{\infty} \gamma^{t'-t} \delta_{t'} = \sum_{t'=t}^{\infty} \gamma^{t'-t} r(s_{t'}, a_{t'}) - \hat{V}^\pi_\phi(s_t)\]

第二个等号可通过展开 \(\delta_{t'} = r_{t'} + \gamma \hat{V}(s_{t'+1}) - \hat{V}(s_{t'})\) 并拆分求和来验证:

\[\sum_{t'=t}^{\infty} \gamma^{t'-t} \delta_{t'} = \sum_{t'=t}^{\infty} \gamma^{t'-t} r_{t'} + \sum_{t'=t}^{\infty} \bigl[\gamma^{t'-t+1} \hat{V}(s_{t'+1}) - \gamma^{t'-t} \hat{V}(s_{t'})\bigr]\]

第二个求和是 telescoping sum:相邻项相消,剩余 \(\lim_{N\to\infty} \gamma^{N+1}\hat{V}(s_{t+N+1}) - \hat{V}(s_t)\)。极限项为零(\(\gamma < 1\) 或 episode 终止时 \(\hat{V} = 0\)),因此结果为 \(\sum \gamma^{t'-t} r_{t'} - \hat{V}(s_t)\)。

恢复 Monte Carlo 回报减状态相关基线——无偏但方差高。实践中,\(\lambda \approx 0.95{-}0.97\) 效果良好。

图 4:n 步回报与 GAE。拖动 n 滑块查看在从 critic(绿色)自举之前使用多少实际奖励(橙色)。切换到 GAE 查看 λ 如何以指数衰减权重混合所有 n 步估计。

在线策略的要求——每次更新后都需要新数据——是一个重大限制,由上面的 PPO 来解决。关于重要性采样在 RL 中的深入讨论,请参阅重要性采样专题文章