The Policy Gradient Family: PG, PPO, and AC
The Policy Gradient
策略梯度
Trajectories and the Objective
轨迹与目标函数
In reinforcement learning, an agent interacts with an environment by choosing actions according to a policy \(\pi_\theta(a \vert s)\) — a distribution over actions given a state, parameterized by \(\theta\). Each interaction produces a trajectory:
\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]The probability of a trajectory under policy \(\pi_\theta\) is:
\[P^{\pi_\theta}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]where \(d_0\) is the initial state distribution and \(P(s_{t+1} \vert s_t, a_t)\) is the transition probability. This is an alternating product of policy terms (learnable) and environment terms (fixed).
The objective is \(J(\pi_\theta) = \sum_\tau R(\tau) P^{\pi_\theta}(\tau)\) — a sum over all possible trajectories in the MDP. Each trajectory \(\tau\) has a fixed return \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\); what the policy controls is the probability \(P^{\pi_\theta}(\tau)\) assigned to each one. A better policy concentrates probability on high-return trajectories.
Deriving REINFORCE
推导 REINFORCE
To differentiate \(J\) with respect to \(\theta\):
\[\nabla_\theta J = \sum_\tau R(\tau) \nabla_\theta P^{\pi_\theta}(\tau)\]This is already mathematically correct, but it sums over all possible trajectories — an astronomically large space that cannot be enumerated. We need a form that can be estimated by sampling a handful of trajectories from \(\pi_\theta\).
One might ask: why not just sample a trajectory \(\tau \sim \pi_\theta\), observe \(R(\tau)\), and let autograd backpropagate through \(\mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\) directly? The problem is that the computation graph passes through a discrete sampling step — each action \(a_t\) is drawn from a categorical distribution \(\pi_\theta(\cdot \vert s_t)\). The sampled action is a discrete index, not a smooth function of \(\theta\), so the gradient cannot flow through it. Autograd only sees that the reward is some scalar, but has no way to capture the fact that changing \(\theta\) changes which trajectories get sampled. Standard backpropagation handles \(\nabla_\theta f_\theta(x)\) for fixed inputs \(x\), but here \(\theta\) affects both the function and the distribution over inputs. The log-derivative trick is precisely the tool that recovers this missing “distributional” part of the gradient.
Converting a sum \(\sum_\tau f(\tau)\) into an expectation \(\mathbb{E}_{\tau \sim P^{\pi_\theta}}[f(\tau) / P^{\pi_\theta}(\tau)]\) requires that \(P^{\pi_\theta}(\tau)\) is a valid probability distribution — non-negative and summing to 1. It is: \(P^{\pi_\theta}(\tau)\) is a product of the initial state distribution, per-step policy probabilities, and transition probabilities, all of which are valid distributions, so \(\sum_\tau P^{\pi_\theta}(\tau) = 1\) by construction. The log-derivative trick \(\nabla P = P \nabla \log P\) achieves exactly this conversion by factoring out \(P^{\pi_\theta}(\tau)\) as the sampling weight:
\[\nabla_\theta J = \sum_\tau R(\tau) P^{\pi_\theta}(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau)\right]\]There is one more step: the gradient still involves \(\nabla_\theta \log P^{\pi_\theta}(\tau)\) — a trajectory-level quantity. To get something we can compute per action, we exploit the fact that the log turns the product of per-step probabilities into a sum:
\[\log P^{\pi_\theta}(\tau) = \log d_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \vert s_t) + \sum_{t=0}^{T-2} \log P(s_{t+1} \vert s_t, a_t)\]The initial state distribution \(d_0\) and transition dynamics \(P\) do not depend on \(\theta\), so their gradients vanish. Only the policy terms survive:
\[\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right]\]This is the trajectory-level REINFORCE estimator. Because the gradient is now an expectation under \(\pi_\theta\), we can estimate it by sampling \(N\) trajectories and averaging:
\[\hat{g} = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)}) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \vert s_t^{(i)}), \quad \tau^{(i)} \sim \pi_\theta\]Each action contributes its \(\nabla_\theta \log \pi_\theta\) — a quantity neural network frameworks compute naturally via backpropagation — weighted by the trajectory return. This is what makes the log-derivative form practical: it decomposes into per-action log-probabilities that fit directly into standard gradient-based training.
By decomposing the rewards over time steps, it can be rewritten in a per-step form using the discounted state occupancy \(d^{\pi_\theta}\):
\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[Q^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]where \(Q^{\pi_\theta}(s, a)\) is the action-value function.
Variance Reduction: Baselines and the Advantage
方差缩减:基线与优势函数
A useful property of \(\nabla_\theta \log \pi_\theta\) is that its expectation under the policy is zero: \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \vert s)] = \sum_a \nabla_\theta \pi_\theta(a \vert s) = \nabla_\theta 1 = 0\). This means we can subtract any state-dependent baseline \(b(s)\) from \(Q^{\pi_\theta}\) without introducing bias. The natural choice is the value function \(V^{\pi_\theta}(s)\), giving us the advantage function \(A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\):
\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]The advantage centers the reward signal around zero, substantially reducing variance. This expectation requires sampling from \(\pi_\theta\) itself, meaning we need fresh data after every parameter update.
Implementation Notes: The Surrogate Loss
实现细节:代理损失
In practice, deep learning frameworks minimize a loss, so we need to translate the policy gradient into something a framework can compute. The standard approach is to define a surrogate loss
\[L_{\mathrm{sur}}(\theta) = -\sum_t A_t \log \pi_\theta(a_t \vert s_t),\]where \(A_t\) is treated as a stop-gradient constant. Its gradient
\[\nabla_\theta L_{\mathrm{sur}} = -\sum_t A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t)\]is exactly the negated policy gradient, so minimizing \(L_{\mathrm{sur}}\) performs a policy gradient ascent step. Note that if we were to backpropagate through \(A_t\) (e.g., when \(A_t\) depends on a learned critic), an extra term \((\nabla_\theta A_t) \log \pi_\theta\) would appear, breaking the correspondence — this is why the advantage must always be detached.
A common source of confusion is that \(L_{\mathrm{sur}}\) looks like weighted negative log-likelihood, making REINFORCE appear identical to “weighted SFT.” In the special case of binary rewards where \(A_t = 1\) for successful trajectories and \(A_t = 0\) otherwise, the surrogate loss does reduce to NLL on successful trajectories — i.e., online filtered behavior cloning. But in general, the surrogate loss and the true objective
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]are not the same function: they are merely gradient-equivalent. In supervised learning, \(-\log \pi_\theta(y \vert x)\) is the objective; in policy gradient, \(-A_t \log \pi_\theta(a_t \vert s_t)\) is a tool constructed to reproduce the correct gradient.
This same idea extends beyond vanilla REINFORCE. PPO’s clipped surrogate
\[L^{\mathrm{PPO}} = \mathbb{E}\!\left[\min\!\Big(r_t A_t, \; \mathrm{clip}(r_t, 1{-}\epsilon, 1{+}\epsilon) A_t\Big)\right]\]does not explicitly contain \(\log \pi\), but the importance ratio \(r_t = \pi_\theta / \pi_{\theta_\mathrm{old}}\) is computed via log-probabilities in practice. The underlying pattern is the same: first derive what gradient direction the policy should follow, then construct a surrogate objective that produces it.
Proximal Policy Optimization (PPO)
近端策略优化 (PPO)
From On-Policy to Off-Policy
从在线策略到离线策略
The policy gradient derived above requires sampling from the current policy \(\pi_\theta\): after every parameter update, all previously collected data becomes stale. This is wasteful — we would like to take multiple gradient steps on the same batch of data.
The idea is to use importance sampling to correct for the distribution mismatch. If our data was collected under an old policy \(\pi_{\mathrm{old}}\), we can reweight each sample by the probability ratio between the new and old policies. For a detailed treatment of importance sampling and how it applies to RL, see the importance sampling post.
The IS Surrogate Objective
IS 代理目标
Starting from the policy gradient in advantage form, we can rewrite the expectation over \(\pi_\theta\) as an expectation over \(\pi_{\mathrm{old}}\) by introducing a single-step IS ratio (see derivation):
\[L^{\mathrm{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{old}}(a \vert s)} \, A^{\pi_{\mathrm{old}}}(s, a)\right]\]At \(\theta = \theta_{\mathrm{old}}\), the ratio equals 1 and the gradient of \(L^{\mathrm{IS}}\) reduces to the standard policy gradient. This means we can take gradient steps on \(L^{\mathrm{IS}}\) using data collected once from \(\pi_{\mathrm{old}}\), without recollecting trajectories after each step.
However, this surrogate only corrects the action distribution mismatch — the state distribution is still drawn from \(d^{\pi_{\mathrm{old}}}\), not \(d^{\pi_\theta}\). As \(\theta\) drifts from \(\theta_{\mathrm{old}}\), the two state distributions diverge and the surrogate can overestimate improvement, causing the policy to overshoot and degrade. See the hidden approximation discussion for details.
Clipping the Ratio
裁剪比率
Proximal Policy Optimization (PPO) addresses this by clipping the IS ratio to prevent large updates. Define
\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}.\]PPO’s clipped surrogate objective is:
\[L^{\mathrm{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]where \(\epsilon\) is a small constant (typically 0.1–0.2). The \(\min\) takes the more pessimistic estimate:
- When \(\hat{A}_t > 0\) (good action): the ratio is capped at \(1 + \epsilon\), preventing the policy from moving too aggressively toward this action.
- When \(\hat{A}_t < 0\) (bad action): the ratio is floored at \(1 - \epsilon\), preventing the policy from moving too aggressively away from this action.
This trades a small amount of bias for much more stable training — rather than hoping the IS ratio stays well-behaved, we simply clip it by force.
Dual clipping (Ye et al., 2020) adds a second clip to handle a subtle failure mode of standard PPO. When \(\hat{A}_t < 0\) (bad action) and the ratio \(r_t\) is very large (\(r_t \gg 1 + \epsilon\)), the standard \(\min\) already selects the clipped branch \((1-\epsilon)\hat{A}_t\), which is a negative constant. But the unclipped branch \(r_t \hat{A}_t\) is an even larger negative number — so the \(\min\) correctly ignores it. The problem is that this clipped value \((1-\epsilon)\hat{A}_t\) still provides a non-trivial negative gradient signal, pushing the policy to increase \(r_t\) further (since making \(r_t\) larger makes the objective more negative, and we are maximizing). In other words, for actions the policy already wants to avoid, a large ratio can paradoxically encourage the policy to increase the probability of those bad actions.
Dual clip fixes this by introducing a lower bound \(c \hat{A}_t\) (with \(c > 1\), typically \(c = 3\)) when \(\hat{A}_t < 0\):
\[L^{\mathrm{DualCLIP}}(\theta) = \max\!\Big(\min\!\big(r_t \hat{A}_t,\; \operatorname{clip}(r_t, 1\!-\!\epsilon, 1\!+\!\epsilon)\hat{A}_t\big),\; c\hat{A}_t\Big)\]The outer \(\max\) with \(c\hat{A}_t\) ensures that when the advantage is negative and the standard clipped objective produces a value below \(c\hat{A}_t\), the objective is floored at \(c\hat{A}_t\). This creates a flat region with zero gradient, preventing the policy from being pushed in the wrong direction for large ratios on bad actions.
Why the Log-Form and Ratio-Form Losses Share the Same Gradient Direction
对数形式与比率形式损失为何共享相同的梯度方向
At first glance, the REINFORCE surrogate
\[L_{\mathrm{PG}}(\theta) = -A_t \log \pi_\theta(a_t \vert s_t)\]explicitly contains \(\log \pi_\theta\), whereas the PPO-style ratio objective
\[L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}\]does not. It may seem surprising that both lead to essentially the same update direction. The key is the identity \(\nabla_\theta \pi_\theta(a \vert s) = \pi_\theta(a \vert s) \, \nabla_\theta \log \pi_\theta(a \vert s)\), which explains why a loss written in terms of \(\pi_\theta\) can still produce a gradient in terms of \(\nabla_\theta \log \pi_\theta\).
REINFORCE form. The gradient of \(L_{\mathrm{PG}}\) is straightforward:
\[\nabla_\theta L_{\mathrm{PG}}(\theta) = -A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t).\]The score function \(\nabla_\theta \log \pi_\theta\) appears explicitly.
Ratio form. Since \(\pi_{\mathrm{old}}(a_t \vert s_t)\) is constant with respect to \(\theta\):
\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -\frac{A_t}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \pi_\theta(a_t \vert s_t).\]Applying \(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\):
\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \log \pi_\theta(a_t \vert s_t) = -A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t),\]where \(r_t(\theta) = \pi_\theta(a_t \vert s_t) / \pi_{\mathrm{old}}(a_t \vert s_t)\).
Core conclusion. Even though the ratio-form loss does not explicitly contain \(\log \pi_\theta\), its gradient still has the same core score-function direction \(\nabla_\theta \log \pi_\theta(a_t \vert s_t)\). The difference is that PPO introduces an additional multiplicative weight \(r_t(\theta)\), and in practice also clipping, to control how aggressively the policy moves relative to the old policy. So:
- REINFORCE directly optimizes a surrogate linear in \(\log \pi_\theta\);
- PPO optimizes a surrogate linear in the probability ratio \(r_t(\theta)\);
- but after differentiation, both are driven by the same score-function direction \(\nabla_\theta \log \pi_\theta\).
In one sentence: a loss does not need to explicitly contain \(\log \pi_\theta\) for its gradient to involve \(\nabla_\theta \log \pi_\theta\), because \(\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta\).
Why You Cannot Simply Multiply the PPO Objective by \(\log \pi_\theta\)
为什么不能简单地在 PPO 目标中乘以 \(\log \pi_\theta\)
A natural but incorrect idea: since REINFORCE involves \(\log \pi_\theta(a_t \vert s_t)\), why not define a PPO-style objective as
\[\tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[r_t(\theta) \, A_t \, \log \pi_\theta(a_t \vert s_t)\right]?\]This is not the correct importance-sampled policy gradient objective — it introduces an extra factor in the gradient and changes the optimization problem entirely.
Correct PPO surrogate. The standard surrogate \(L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}[r_t(\theta) \, A_t]\) has gradient
\[\nabla_\theta L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, \nabla_\theta r_t(\theta)\right] = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right],\]which is exactly the desired importance-weighted score-function gradient.
Incorrect objective with extra \(\log \pi_\theta\). The gradient of \(\tilde{L}\) requires the product rule on \(r_t(\theta) \log \pi_\theta(a_t \vert s_t)\):
\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \Big((\nabla_\theta r_t) \log \pi_\theta + r_t \, \nabla_\theta \log \pi_\theta\Big)\right].\]Substituting \(\nabla_\theta r_t = r_t \nabla_\theta \log \pi_\theta\):
\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \bigl(1 + \log \pi_\theta(a_t \vert s_t)\bigr) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right].\]Compared with the correct gradient \(A_t \, r_t \, \nabla_\theta \log \pi_\theta\), this has an extra multiplicative factor \(1 + \log \pi_\theta(a_t \vert s_t)\). Since \(\log \pi_\theta(a_t \vert s_t) \le 0\), this factor becomes negative whenever \(\pi_\theta(a_t \vert s_t) < e^{-1}\) — meaning the gradient can push the policy in the opposite direction of the intended update, even when \(A_t > 0\).
The importance ratio \(r_t(\theta)\) exists solely to correct for the sampling distribution mismatch. It should multiply the advantage — the quantity whose expectation we want to estimate — and nothing else. Inserting an extra \(\log \pi_\theta\) changes the objective itself and breaks the correspondence with the policy gradient theorem.
Asynchronous PPO: Decoupling the Importance Ratio
异步 PPO:解耦重要性比率
All methods above assume synchronous training: the policy that generates rollouts (\(\pi_{\text{old}}\)) is the same policy we clip against. In practice, large-scale RL systems are asynchronous — rollout workers continuously generate data while the trainer updates model weights. By the time a batch reaches the trainer, the generating policy \(\pi_{\text{behav}}\) may be several gradient steps behind the current policy \(\pi_\theta\).
This creates a problem: in standard PPO/GRPO, the importance ratio \(\pi_\theta / \pi_{\text{old}}\) serves two roles simultaneously:
- Off-policy correction: reweight samples to account for the distributional mismatch
- Trust region: clip the ratio to prevent the policy from changing too much
When \(\pi_{\text{old}} = \pi_{\text{behav}}\) is stale, these two roles conflict.
Vanilla Async PPO
朴素异步 PPO
The naive approach simply substitutes \(\pi_{\text{behav}}\) for \(\pi_{\text{old}}\). Writing \(\rho_t = \pi_\theta / \pi_{\text{behav}}\):
\[\boxed{\mathcal{J}_{\text{naive}}(\theta) = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\min\!\left(\rho_t \, \hat{A}_t,\; \text{clip}(\rho_t,\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon) \, \hat{A}_t\right)\right]}\]Let’s derive the gradient. Outside the clip range \([1\!-\!\varepsilon, 1\!+\!\varepsilon]\), the clipped branch is a constant \((1\!\pm\!\varepsilon)\hat{A}_t\), so the min picks whichever branch is smaller. When \(\rho_t > 1+\varepsilon\) and \(\hat{A}_t > 0\), the unclipped value \(\rho_t \hat{A}_t\) exceeds the clipped value \((1\!+\!\varepsilon)\hat{A}_t\), so the min selects the clipped (flat) branch — gradient zero. Symmetrically for \(\rho_t < 1-\varepsilon\) and \(\hat{A}_t < 0\). In all other cases the unclipped branch wins and the gradient is \(\nabla_\theta \rho_t \cdot \hat{A}_t\). Collecting these into an indicator:
\[\mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t) = 1 - \mathbb{1}[\rho_t > 1\!+\!\varepsilon]\,\mathbb{1}[\hat{A}_t > 0] - \mathbb{1}[\rho_t < 1\!-\!\varepsilon]\,\mathbb{1}[\hat{A}_t < 0]\]The per-token gradient is \(\nabla_\theta f = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} \hat{A}_t \cdot \mathbb{1}_{\text{active}}\). Substituting \(\frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} = \rho_t \nabla_\theta \log \pi_\theta\) and taking the expectation over \(a_t \sim \pi_{\text{behav}}\):
\[\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\rho_t \, \nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}\right]\]By the importance sampling identity \(\mathbb{E}_{\pi_{\text{behav}}}[\rho_t \, g(a)] = \mathbb{E}_{\pi_\theta}[g(a)]\):
\[\boxed{\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t)\right]}\]This is the standard policy gradient \(\nabla_\theta \log \pi_\theta \, \hat{A}_t\), masked by \(\mathbb{1}_{\text{active}}\): the clip mechanism simply silences tokens where \(\rho_t = \pi_\theta / \pi_{\text{behav}}\) has already overshot in the direction favored by the advantage. Note what determines whether a token is silenced: the ratio \(\pi_\theta / \pi_{\text{behav}}\). The trust region is centered at \(\pi_{\text{behav}}\) — gradients vanish once \(\pi_\theta\) moves more than \(\varepsilon\) away from \(\pi_{\text{behav}}\) (in the advantage-favored direction). In synchronous training \(\pi_{\text{behav}}\) is the current policy, so this is exactly right. But with stale data the picture changes completely.
When \(\pi_{\text{behav}}\) is stale (several gradient steps behind \(\pi_\theta\)), \(\rho_t\) is already far from 1 before any optimization begins. Each gradient step shifts \(\pi_\theta\)’s probability mass — tokens the policy has learned to favor get \(\rho \gg 1\), tokens it has learned to suppress get \(\rho \ll 1\). After \(\eta\) gradient steps, many tokens fall outside \([1-\varepsilon, 1+\varepsilon]\). The clipping then forces \(\pi_\theta\) to stay close to an old, low-quality policy rather than constraining the size of the current update. The trust region is centered in the wrong place.
The figures below show why. The clipped objective is flat outside \([1-\varepsilon, 1+\varepsilon]\). In synchronous training, \(\rho\) starts at 1 (green dot), safely inside the active region. In async training, \(\rho_0\) can land far outside this region (orange cross) — use the slider to explore both \(\rho_0 > 1\) and \(\rho_0 < 1\).
Worse than just zero gradient: the clipping creates an asymmetric force that actively pulls \(\pi_\theta\) back toward \(\pi_{\text{behav}}\). Consider \(\rho_0 > 1 + \varepsilon\): positive-advantage tokens contribute zero gradient (flat region), but negative-advantage tokens are in the unclipped branch and push \(\rho\) down. The net gradient only points toward the stale policy. At \(\rho_0 < 1 - \varepsilon\) the asymmetry flips — negative-advantage tokens are flat, positive-advantage tokens push \(\rho\) up — but the net force still points toward \(\pi_{\text{behav}}\).
Decoupled Clipped Objective
解耦裁剪目标
The \(\pi_{\text{old}}\) in PPO’s objective quietly serves two independent purposes (Hilton, Cobbe & Schulman, 2022). This is easiest to see in the KL-penalized form:
\[\mathcal{J}^{\text{KLPEN}}(\theta) = \mathbb{E}\!\left[\frac{\pi_\theta}{\underbrace{\pi_{\text{old}}}_{\text{(i)}}} \hat{A}_t - \beta \, \text{KL}\!\left[\underbrace{\pi_{\text{old}}}_{\text{(ii)}} \,\|\, \pi_\theta\right]\right]\]Use (i) is importance sampling — it corrects for the fact that actions were drawn from \(\pi_{\text{old}}\), so it must be the behavior policy \(\pi_{\text{behav}}\). Use (ii) is the trust-region anchor — it penalizes \(\pi_\theta\) for moving too far, but this anchor only needs to be some recent policy; call it \(\pi_{\text{prox}}\).
For the clipped objective, \(\pi_{\text{old}}\) appears only once in the ratio \(r_t = \pi_\theta / \pi_{\text{old}}\), hiding the two roles. Rewriting by multiplying numerator and denominator into the min exposes them:
\[\mathcal{J}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\frac{1}{\underbrace{\pi_{\text{old}}}_{\text{(i)}}} \min\!\left(\pi_\theta \,\hat{A}_t,\;\; \text{clip}\!\left(\pi_\theta,\; (1\!-\!\varepsilon)\underbrace{\pi_{\text{old}}}_{\text{(ii)}},\; (1\!+\!\varepsilon)\underbrace{\pi_{\text{old}}}_{\text{(ii)}}\right) \hat{A}_t\right)\right]\]Now the two uses are manifest: the \(1/\pi_{\text{old}}\) prefactor is the importance-sampling denominator (i), while the clip bounds \((1\pm\varepsilon)\pi_{\text{old}}\) define the trust region (ii). Replacing (i) with \(\pi_{\text{behav}}\) and (ii) with \(\pi_{\text{prox}}\):
\[\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\frac{1}{\pi_{\text{behav}}} \min\!\left(\pi_\theta \,\hat{A}_t,\;\; \text{clip}\!\left(\pi_\theta,\; (1\!-\!\varepsilon)\pi_{\text{prox}},\; (1\!+\!\varepsilon)\pi_{\text{prox}}\right) \hat{A}_t\right)\right]\]Dividing through by \(\pi_{\text{prox}}\) inside the min recovers the ratio form used by AReaL (IIIS Tsinghua, 2025):
\[\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}\!\left[\sum_{t=1}^{H} \min\!\left(\underbrace{\frac{\pi_\theta}{\pi_{\text{behav}}}}_{\text{importance ratio}} \hat{A}_t,\;\; \overbrace{\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \, \text{clip}\!\left(\underbrace{\frac{\pi_\theta}{\pi_{\text{prox}}}}_{\text{trust region}},\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon\right)}^{\text{importance ratio}} \hat{A}_t\right)\right]\]Writing \(r_t = \pi_\theta / \pi_{\text{prox}}\), \(w_t = \pi_{\text{prox}} / \pi_{\text{behav}}\). Since \(w_t > 0\), it factors out of the min:
\[\boxed{\mathcal{J}_{\text{decoupled}}^{\text{clip}}(\theta) = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \min\!\left(r_t \, \hat{A}_t,\;\; \text{clip}(r_t,\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon) \, \hat{A}_t\right)\right]}\]Now derive its gradient step by step.
Step 1. \(w_t = \pi_{\text{prox}} / \pi_{\text{behav}}\) is constant w.r.t. \(\theta\), so it passes through the gradient:
\[\nabla_\theta \mathcal{J} = \mathbb{E}_{a_t \sim \pi_{\text{behav}}}\!\left[w_t \cdot \nabla_\theta \min\!\left(r_t \, \hat{A}_t,\; \text{clip}(r_t) \, \hat{A}_t\right)\right]\]Step 2. The \(\min(r_t \hat{A}_t,\, \text{clip}(r_t)\hat{A}_t)\) has exactly the same form as the naive PPO objective, but with \(r_t = \pi_\theta / \pi_{\text{prox}}\) in place of \(\rho_t = \pi_\theta / \pi_{\text{behav}}\). We already derived this gradient — \(\nabla_\theta r_t \cdot \hat{A}_t\) when active, zero when clipped:
\[\nabla_\theta \min\!\left(r_t \, \hat{A}_t,\; \text{clip}(r_t) \, \hat{A}_t\right) = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(r_t, \hat{A}_t)\]where \(\mathbb{1}_{\text{active}}\) is the same mask as before, now evaluated at \(r_t\):
\[\mathbb{1}_{\text{active}}(r_t, \hat{A}_t) = 1 - \mathbb{1}[r_t > 1\!+\!\varepsilon]\,\mathbb{1}[\hat{A}_t > 0] - \mathbb{1}[r_t < 1\!-\!\varepsilon]\,\mathbb{1}[\hat{A}_t < 0]\]Step 3. Multiply by \(w_t\). The \(\pi_{\text{prox}}\) cancels:
\[w_t \cdot \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} = \frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \cdot \frac{\nabla_\theta \pi_\theta}{\pi_{\text{prox}}} = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{behav}}} = \frac{\pi_\theta}{\pi_{\text{behav}}} \nabla_\theta \log \pi_\theta = \rho_t \, \nabla_\theta \log \pi_\theta\]Step 4. Taking the expectation over \(a_t \sim \pi_{\text{behav}}\) and applying the importance sampling identity \(\mathbb{E}_{\pi_{\text{behav}}}[\rho_t \, g(a)] = \mathbb{E}_{\pi_\theta}[g(a)]\):
\[\boxed{\nabla_\theta \mathcal{J}_{\text{decoupled}} = \mathbb{E}_{a_t \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}\!\left(\frac{\pi_\theta}{\pi_{\text{prox}}},\, \hat{A}_t\right)\right]}\]Comparing the two gradients side by side:
| Naive PPO | Decoupled | |
|---|---|---|
| Gradient | \(\mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}(\rho_t, \hat{A}_t)\right]\) | \(\mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta \, \hat{A}_t \cdot \mathbb{1}_{\text{active}}^{\text{dec}}(u_t, \hat{A}_t)\right]\) |
| Mask argument | \(\rho_t = \pi_\theta / \pi_{\text{behav}}\) | \(u_t = \pi_\theta / \pi_{\text{prox}}\) |
| Trust region center | \(\pi_{\text{behav}}\) | \(\pi_{\text{prox}}\) |
Both are the same policy gradient with the same functional form of binary mask. The only difference is which policy the mask measures distance from. Naive PPO silences tokens when \(\pi_\theta\) is far from \(\pi_{\text{behav}}\); decoupled PPO silences tokens when \(\pi_\theta\) is far from \(\pi_{\text{prox}}\).
This makes the failure mode transparent. In asynchronous training, \(\pi_{\text{behav}}\) lags \(\pi_\theta\) by \(\eta\) gradient steps. Naive PPO’s mask kills tokens based on drift that already happened before the current step — the trust region is centered at the wrong place. Decoupled PPO centers at \(\pi_{\text{prox}}\), which is always the most recent checkpoint, so all tokens start with \(u_t \approx 1\) and the mask only activates if the current update overshoots. Even with a single minibatch (\(\pi_\theta = \pi_{\text{prox}}\) before the update, \(u_t = 1\) exactly), naive PPO already silences many tokens while decoupled PPO silences none.
Reduction to PPO-clip. When \(\pi_{\text{prox}} = \pi_{\text{behav}}\), we have \(u_t = \rho_t\) and the two masks coincide. This holds at the first gradient step of synchronous training.
Why Decoupling Matters: Ablation Evidence
为什么解耦很重要:消融实验
The AReaL paper ablates naive vs. decoupled PPO across staleness levels \(\eta\) (max age of rollout data in gradient steps). The 1.5B model results on AIME24 (pass@1, avg over 32):
| Staleness \(\eta\) | Naive PPO | Decoupled PPO | Sync oracle |
|---|---|---|---|
| 0 (sync) | 42.0 | — | 42.0 |
| 1 | 41.8 | 42.1 | |
| 2 | 40.0 | 41.8 | |
| 4 | 23.3 | 42.2 | |
| 8 | 35.7 | 41.0 | |
| 16 | 35.8 | 38.7 |
The collapse at \(\eta = 4\) is dramatic: naive PPO drops from 42.0 to 23.3 (a 45% relative decline), while decoupled PPO actually matches the synchronous oracle at 42.2. The pattern is consistent across benchmarks — AMC23 drops from 84.4 to 58.5 under naive PPO at \(\eta = 4\), but stays at 85.1 with decoupling.
The practical payoff: asynchronous training with \(\eta \leq 4\) achieves up to 2.77x wall-clock speedup over synchronous baselines with no loss in final performance. Throughput nearly doubles just from \(\eta = 0 \to 1\) (27.1k to 47.8k tokens/s on 8 GPUs), because the trainer no longer waits for rollout workers.
KL for Reference Model
Reference Model 的 KL
Approximating KL Divergence (Schulman, 2020)
近似 KL 散度 (Schulman, 2020)
We want to estimate the KL divergence from \(q\) to \(p\):
Our options for computing KL depend on what kind of access we have to \(p\) and \(q\). Here we assume we can evaluate the probabilities (or probability densities) \(p(x)\) and \(q(x)\) for any given \(x\), but we cannot calculate the sum over \(x\) analytically. Why not?
- Computation/memory: the state space is too large to enumerate (e.g., all possible token sequences).
- No closed form: the distributions don’t belong to a family with a known KL formula.
- Code simplicity: we only store the log-prob \(\log \pi_\theta(a \vert s)\), not the full distribution. This is a reasonable design choice when KL is just used as a diagnostic, as is often the case in reinforcement learning (e.g., logging KL between the current policy and a reference policy during PPO training).
In all three cases, we turn to Monte Carlo estimation. Given samples \(x_1, x_2, \ldots \sim q\), how can we construct a good estimate?
A good estimator has two properties:
- Unbiased: its expected value equals the true KL, i.e. \(\mathbb{E}[\hat{k}] = \mathrm{KL}[q,p]\).
- Low variance: individual samples don’t fluctuate wildly around the mean.
We’ll define the probability ratio \(r = p(x)/q(x)\), so that \(\log r = \log p(x) - \log q(x)\). All three estimators below are functions of \(r\) (or equivalently, of \(\log r\)). This is convenient because in practice we often already have \(\log p(x)\) and \(\log q(x)\) computed — e.g., the log-probability of an action under two different policies.
The most straightforward unbiased estimator follows directly from the definition of KL:
\[k_1 = -\log r = \log \frac{q(x)}{p(x)}.\]Since \(\mathbb{E}_{x \sim q}[k_1] = \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right] = \mathrm{KL}[q,p]\), this is exactly unbiased.
However, it has high variance. To see why, note that KL divergence is always non-negative (\(\mathrm{KL}[q,p] \geq 0\)), yet \(k_1\) takes negative values whenever \(r > 1\) (i.e., whenever \(p(x) > q(x)\)). For similar distributions, this happens for roughly half the samples. An estimator that’s negative half the time for a quantity that’s always positive is clearly noisy — we’re relying on cancellation between positive and negative samples to get the right mean.
Why is KL always non-negative? (Click to expand)
By Jensen's inequality applied to the convex function \(-\log\):
$$\mathrm{KL}[q,p] = \mathbb{E}_{x \sim q}\!\left[-\log \frac{p(x)}{q(x)}\right] \geq -\log \mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)}\right] = -\log 1 = 0.$$
This is known as Gibbs' inequality. The same inequality \(\log x \leq x - 1\) that we will use below to construct \(k_3\) provides an alternative proof: \(\mathrm{KL}[q,p] = \mathbb{E}_q[-\log r] \geq \mathbb{E}_q[1 - r] = 1 - 1 = 0\).
The interactive figure below plots \(k_1\) alongside \(k_2\) and \(k_3\) (defined next) for comparison. Notice how \(k_1\) dips below zero for \(r > 1\) — this is where its high variance comes from.
An alternative with lower variance but slight bias:
\[k_2 = \frac{1}{2}(\log r)^2.\]Intuitively, \(k_2\) seems better because:
- It is always non-negative (it’s a square).
- Each sample directly measures how far apart \(p\) and \(q\) are at point \(x\), regardless of which direction the ratio goes.
Empirically, \(k_2\) indeed has much lower variance than \(k_1\), and also has remarkably low bias. But why is the bias small? The answer comes from f-divergences.
An f-divergence is a general family of divergences defined as:
\[D_f(p, q) = \mathbb{E}_{x \sim q}\!\left[f\!\left(\frac{p(x)}{q(x)}\right)\right] = \mathbb{E}_{x \sim q}[f(r)]\]for a convex function \(f\) with \(f(1) = 0\). Many well-known divergences are special cases:
- KL divergence \(\mathrm{KL}[q, p]\): \(f(r) = -\log r\)
- Reverse KL \(\mathrm{KL}[p, q]\): \(f(r) = r \log r\)
- Chi-squared divergence: \(f(r) = (r-1)^2\)
The expectation of \(k_2\) is \(\mathbb{E}_q\!\left[\frac{1}{2}(\log r)^2\right]\), which is also an f-divergence with \(f(r) = \frac{1}{2}(\log r)^2\).
Now here is the key non-obvious fact: all f-divergences with differentiable \(f\) look like KL divergence up to second order when \(q\) is close to \(p\). Specifically, for a parameterized distribution \(p_\theta\):
\[D_f(p_0, p_\theta) = \frac{f''(1)}{2}\,\theta^\top F\,\theta + O(\theta^3),\]where \(F\) is the Fisher information matrix for \(p_\theta\) evaluated at \(p_\theta = p_0\).
Both \(k_2\)’s f-divergence (\(f(r) = \frac{1}{2}(\log r)^2\)) and KL (\(f(r) = -\log r\)) have \(f''(1) = 1\). So both look like the same quadratic distance function \(\frac{1}{2}\theta^\top F\,\theta\) when \(p \approx q\). The bias of \(k_2\) only comes from third-order and higher terms, which explains why it is negligible when \(p\) and \(q\) are close.
What is the Fisher information matrix, and why does it appear here? (Click to expand)
The Fisher information matrix \(F\) of a parametric family \(p_\theta\) is defined as:
$$F_{ij} = \mathbb{E}_{x \sim p_\theta}\!\left[\frac{\partial \log p_\theta(x)}{\partial \theta_i}\,\frac{\partial \log p_\theta(x)}{\partial \theta_j}\right] = -\mathbb{E}_{x \sim p_\theta}\!\left[\frac{\partial^2 \log p_\theta(x)}{\partial \theta_i \,\partial \theta_j}\right].$$
Intuitively, \(F\) measures how sensitive the distribution is to small changes in \(\theta\). If changing \(\theta_i\) by a tiny amount causes the log-likelihood to fluctuate a lot (high Fisher information), then the distribution is very "curved" in that direction — a small step in parameter space creates a large change in distribution space.
The interactive figure below makes this concrete. Both panels apply the same perturbation δ to the mean of a Gaussian. On the left, σ is small (high Fisher info F = 1/σ²) — the distributions barely overlap. On the right, σ is large (low Fisher info) — the same δ changes almost nothing. Try dragging the sliders.
Why does \(F\) appear in the f-divergence expansion? Consider \(p_\theta\) near \(p_0\) (i.e., \(\theta\) small). The ratio is:
$$r(\theta) = \frac{p_\theta(x)}{p_0(x)} = \exp\!\big(\log p_\theta(x) - \log p_0(x)\big).$$
Taylor-expanding \(\log p_\theta(x)\) around \(\theta = 0\):
$$\log p_\theta(x) = \log p_0(x) + \theta^\top \nabla_\theta \log p_0(x) + \frac{1}{2}\theta^\top \nabla^2_\theta \log p_0(x)\,\theta + O(\theta^3),$$
so \(\log r \approx \theta^\top s(x) + \frac{1}{2}\theta^\top H(x)\,\theta\), where \(s(x) = \nabla_\theta \log p_0(x)\) is the score function and \(H(x) = \nabla^2_\theta \log p_0(x)\) is its Hessian. Two key facts about the score:
- \(\mathbb{E}_{p_0}[s(x)] = 0\) (the score has zero mean), and
- \(\mathbb{E}_{p_0}[s(x)\,s(x)^\top] = F\) (its covariance is the Fisher matrix).
Substituting into \(D_f = \mathbb{E}_{p_0}[f(r)]\) and expanding \(f\) around \(r = 1\): since \(f(1) = 0\) and \(f'(1)\) contributes terms proportional to \(\mathbb{E}[s(x)] = 0\), the leading term is:
$$D_f \approx \frac{f''(1)}{2}\,\mathbb{E}_{p_0}\!\big[(\theta^\top s(x))^2\big] = \frac{f''(1)}{2}\,\theta^\top F\,\theta.$$
This is why all f-divergences share the same local geometry: they all reduce to a quadratic form in \(\theta\) weighted by the Fisher matrix, differing only by the scalar \(f''(1)\). The Fisher matrix is the unique "metric tensor" on the space of distributions (up to scale) — this is the foundation of information geometry.
In RL, the Fisher matrix of the policy \(\pi_\theta\) is exactly what defines the natural policy gradient: the direction \(F^{-1}\nabla_\theta J\) that makes the steepest improvement per unit of KL divergence, rather than per unit of Euclidean distance in parameter space.
The interactive figure below plots \(k_2\) alongside \(k_1\) (dashed). Notice that \(k_2\) is always non-negative — a square can’t be negative. The zoomed inset near \(r = 1\) shows why the bias is small: \(k_2\) and KL agree to second order. Drag the \(\mu_p\) slider to see how the bias grows as the distributions diverge.
Can we get an estimator that is both unbiased (like \(k_1\)) and always non-negative (like \(k_2\))?
The general technique for reducing variance of an unbiased estimator is a control variate: add something with zero expectation that is negatively correlated with the original estimator. The only interesting quantity guaranteed to have zero expectation under \(q\) is:
\[\mathbb{E}_{x \sim q}\!\left[\frac{p(x)}{q(x)} - 1\right] = \mathbb{E}_{x \sim q}[r - 1] = \sum_x q(x) \cdot \frac{p(x)}{q(x)} - 1 = 1 - 1 = 0.\]So for any \(\lambda\), the expression
\[-\log r + \lambda(r - 1)\]is an unbiased estimator of \(\mathrm{KL}[q,p]\). We could minimize the variance over \(\lambda\), but this yields an expression that depends on \(p\) and \(q\) and is hard to compute analytically.
Instead, we can choose a good \(\lambda\) using a simpler and more elegant argument. Since \(\log\) is concave, we have the fundamental inequality:
\[\log x \leq x - 1 \quad \text{for all } x > 0,\]with equality only at \(x = 1\). Setting \(\lambda = 1\), the estimator becomes:
\[k_3 = (r - 1) - \log r = \underbrace{-\log r}_{k_1} + \underbrace{(r - 1)}_{\text{control variate}}.\]By the inequality above, \((r-1) - \log r \geq 0\) for all \(r > 0\), with equality only when \(r = 1\) (i.e., \(p(x) = q(x)\)). So \(k_3\) is:
- Unbiased (since \(\mathbb{E}[r-1] = 0\), we’re just adding zero in expectation to \(k_1\)).
- Always non-negative (by the concavity of \(\log\)).
- Low variance (the control variate cancels much of \(k_1\)’s noise).
There is a beautiful geometric way to see why \(k_3\) is non-negative. Consider the convex function \(\phi(r) = -\log r\). Its tangent line at \(r = 1\) is \(\ell(r) = -(r - 1)\). Then:
\[k_3 = (r-1) - \log r = \phi(r) - \ell(r) = (-\log r) - (-(r-1)).\]This is the vertical gap between the convex function and its tangent line. Since convex functions always lie above their tangent lines, this gap is always non-negative.
This construction — measuring distance as the gap between a convex function and its tangent plane — is called a Bregman divergence. It appears throughout optimization, information theory, and machine learning, and has many beautiful properties (e.g., the “three-point identity” that generalizes the Pythagorean theorem).
You can see this geometry in the interactive figure below. Drag the slider to see how the gap grows as \(r\) moves away from 1.
To see how these estimators compare in practice, consider Gaussian experiments from Schulman’s post. Let \(q = \mathcal{N}(0, 1)\) and \(p = \mathcal{N}(\mu, 1)\), so the true KL is \(\mu^2/2\). Try the two preset experiments (\(\mu = 0.1\) and \(\mu = 1.0\)), or drag \(\mu\) to any value to see how bias and variance change:
Key observations as you drag \(\mu\):
- Small \(\mu\) (≈ 0.1): \(k_1\)’s std is ~20× the true KL — you’d need hundreds of samples for a reliable sign. \(k_2\) and \(k_3\) are nearly identical (\(k_2\)’s bias ≈ 0.2%).
- Large \(\mu\) (≈ 1.0): \(k_2\)’s bias grows to ~25% — no longer negligible. \(k_3\) stays unbiased with low variance. \(k_3\) is strictly better.
For samples \(x \sim q\) and ratio \(r = p(x)/q(x)\), the three estimators are:
| Estimator | Unbiased? | Always ≥ 0? | Variance | |
|---|---|---|---|---|
| \(k_1\) | \(-\log r\) | Yes | No | High |
| \(k_2\) | \(\frac{1}{2}(\log r)^2\) | No (low bias when \(p \approx q\)) | Yes | Low |
| \(k_3\) | \((r-1) - \log r\) | Yes | Yes | Low |
\(k_3\) is the clear winner: unbiased, always non-negative, and low variance. It achieves this by adding the control variate \((r-1)\) to the naive estimator \(k_1\), and its non-negativity follows from the concavity of \(\log\) (equivalently, the Bregman divergence interpretation).
The Bregman divergence trick generalizes elegantly. For any f-divergence \(D_f(p,q) = \mathbb{E}_{x \sim q}[f(r)]\) with convex \(f\), the estimator
\[f(r) - f'(1)(r - 1)\]is:
- Unbiased: because \(\mathbb{E}_q[f'(1)(r-1)] = f'(1) \cdot 0 = 0\).
- Always non-negative: because \(f\) is convex, it lies above its tangent at \(r = 1\), so \(f(r) \geq f(1) + f'(1)(r-1) = f'(1)(r-1)\) (using \(f(1) = 0\)).
This is the Bregman divergence of \(f\) at point \(r\) relative to \(r = 1\).
The most notable application is to \(\mathrm{KL}[p, q]\) (note \(p\) and \(q\) are swapped). This corresponds to \(f(r) = r \log r\), which has \(f'(1) = 1\). The Bregman-based estimator becomes:
\[r\log r - (r - 1).\]Final summary: for samples \(x \sim q\) with \(r = p(x)/q(x)\), the recommended estimators are:
| Divergence | Estimator | Properties |
|---|---|---|
| \(\mathrm{KL}[q, p]\) | \((r - 1) - \log r\) | Unbiased, non-negative, low variance |
| \(\mathrm{KL}[p, q]\) | \(r\log r - (r - 1)\) | Unbiased, non-negative, low variance |
Both are special cases of the general Bregman divergence estimator \(f(r) - f'(1)(r-1)\) for their respective f-divergence generators. In practice, you can drop these into any codebase that computes log-probs — no need to store or compute full distributions.
From Estimation to Optimization (Liu et al., 2025)
从估计到优化 (Liu et al., 2025)
Before analyzing how \(k_1, k_2, k_3\) behave as losses, we need to untangle a common source of confusion. PPO-based RLHF involves two different KL divergences that serve entirely different purposes and point in different directions. To see both clearly, start from the TRPO-RLHF formulation with the KL constraint written explicitly:
\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}\right] - \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})}_{\text{reference KL (reverse)}} \quad \text{s.t.}\quad \underbrace{D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)}_{\text{trust-region KL (forward)}} \leq \delta\]PPO approximates the constraint by replacing it with clipping, yielding the familiar PPO-RLHF objective (as in InstructGPT):
\[\max_{\pi_\theta}\; \mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A},\; \mathrm{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\mathrm{old}}}, 1\!-\!\epsilon, 1\!+\!\epsilon\big)\hat{A}\Big)\right] - \beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\]The two KLs are:
1. Trust-region KL (forward): \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta) \leq \delta\)
This is the constraint inherited from TRPO: don’t let the new policy deviate too far from the old policy within a single update step. Since data is sampled from \(\pi_{\mathrm{old}}\), the constraint is measured under \(\pi_{\mathrm{old}}\) — a forward KL (old policy first). It strongly penalizes the case where \(\pi_{\mathrm{old}}\) puts high probability on an action but \(\pi_\theta\) compresses it — exactly the dangerous regime where importance ratios explode and the surrogate approximation breaks. PPO replaces this explicit constraint with clipping.
2. Reference KL (reverse): \(\beta \cdot D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\)
This is the RLHF-specific regularizer that prevents the policy from drifting too far from the pretrained base model across all of training. Expanding it:
\[D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)}\right]\]Since we care about what the current policy generates, the expectation is naturally under \(\pi_\theta\) — a reverse KL (new policy first). It penalizes the policy for generating outputs that the reference model would find unlikely.
| Trust-Region KL | Reference KL | |
|---|---|---|
| Purpose | Optimization stability | Regularization to base model |
| Direction | \(D_{\mathrm{KL}}(\pi_{\mathrm{old}} \Vert \pi_\theta)\) (forward) | \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) (reverse) |
| Sampling distribution | \(\pi_{\mathrm{old}}\) (data you already have) | \(\pi_\theta\) (outputs you will generate) |
| Constrains | Per-step update size | Total drift from reference |
| In the formula | Implicit (clipping) | Explicit (\(\beta\) penalty) |
The rest of this section focuses exclusively on the reference KL \(D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\) — specifically, how the choice of \(k_1, k_2, k_3\) and the choice of “in reward” vs. “as loss” affect its gradient.
Wait — the surrogate uses \(\mathbb{E}_{\pi_{\mathrm{old}}}\) but the reference KL uses \(\mathbb{E}_{\pi_\theta}\). How can they coexist in one loss? (Click to expand)
They can't, as written. The formula above is conceptually clean but notationally sloppy — it mixes two different expectations. In practice, InstructGPT resolves this by folding the KL into the reward. The per-token reward becomes:
$$\tilde{r}_t = R(y) \cdot \mathbf{1}_{t=T} - \beta \cdot \big(\log \pi_{\mathrm{old}}(a_t \vert s_t) - \log \pi_{\mathrm{ref}}(a_t \vert s_t)\big)$$
Note the key move: the \(\log \pi_\theta\) in the KL is replaced by \(\log \pi_{\mathrm{old}}\) — the policy that actually generated the rollout. The KL penalty is computed at rollout time and treated as part of the reward, detached from the gradient. The advantage \(\hat{A}\) is then estimated from this modified reward using GAE, and the entire PPO loss has a single expectation under \(\pi_{\mathrm{old}}\):
$$\mathbb{E}_{y \sim \pi_{\mathrm{old}}}\!\left[\min\!\Big(\frac{\pi_\theta}{\pi_{\mathrm{old}}} \hat{A}_{\tilde{r}},\; \mathrm{clip}(\cdots)\hat{A}_{\tilde{r}}\Big)\right]$$
This is precisely the "\(k_1\) in reward" approach that the rest of this section will analyze. The reference KL never appears as a separate loss term with its own expectation — it is absorbed into the advantage.
We now focus on the reference KL. Let \(k_n\) denote any of the three estimators from the three estimators above (\(k_1 = -\log \delta\), \(k_2 = \frac{1}{2}(\log \delta)^2\), \(k_3 = (\delta - 1) - \log \delta\)), where \(\delta = \pi_{\mathrm{ref}}(y \vert x) / \pi_\theta(y \vert x)\) is the reference-to-current probability ratio. There are two fundamentally different ways to plug \(k_n\) into the loss:
”\(k_n\) in reward” (combined form): treat \(k_n\) as a detached scalar in the REINFORCE objective — it modulates the policy gradient like a reward signal, but is not differentiated:
\[\mathcal{L} = -\mathbb{E}_{y \sim \pi_\theta}\!\Big[\big(R(y) - \beta \cdot k_n\big) \cdot \log \pi_\theta(y \vert x)\Big].\]”\(k_n\) as loss” (decoupled form): add \(k_n\) as a separate differentiable loss — the gradient flows through \(k_n\) itself via the chain rule:
\[\mathcal{L} = -\mathbb{E}\!\big[R(y) \cdot \log \pi_\theta(y \vert x)\big] + \beta \cdot \mathbb{E}\!\big[k_n(\pi_\theta, \pi_{\mathrm{ref}})\big].\]These produce different gradients, even though they use the same formula.
The target is the reverse KL \(\mathrm{KL}[\pi_\theta \Vert \pi_{\mathrm{ref}}]\), whose true gradient (under on-policy sampling) is:
\[\nabla_\theta \mathcal{J}_{\mathrm{RKL}} = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\mathrm{ref}}(y \vert x)} \cdot \nabla_\theta \log \pi_\theta(y \vert x)\right].\]The coefficient \(-\log \delta = \log(\pi_\theta / \pi_{\mathrm{ref}})\) multiplying \(\nabla_\theta \log \pi_\theta\) is the “signal” that pushes the policy back toward the reference. Liu et al. show which implementations recover this gradient:
\(k_1\) in reward produces the correct gradient. Since \(k_1 = -\log \delta\), placing it in the REINFORCE coefficient directly yields \(-\log \delta \cdot \nabla_\theta \log \pi_\theta\) — exactly the RKL gradient. ✓
\(k_2\) as loss is gradient-equivalent to \(k_1\) in reward. Since \(k_2 = \frac{1}{2}(\log \delta)^2\), differentiating directly gives \(\nabla_\theta k_2 = \log \delta \cdot \nabla_\theta \log \delta = -\log \delta \cdot \nabla_\theta \log \pi_\theta\) — the same RKL gradient. ✓
This is the paper’s key equivalence result (Theorem 5.1): ”\(k_1\) in reward” \(=\) “\(k_2\) as loss” in terms of gradient.
What if we use \(k_1\) as a direct loss instead of in the reward? Since \(k_1 = -\log \delta = \log \pi_\theta - \log \pi_{\mathrm{ref}}\), differentiating gives:
\[\nabla_\theta k_1 = \nabla_\theta \log \pi_\theta(y \vert x).\]The reference policy \(\pi_{\mathrm{ref}}\) has completely disappeared from the gradient — it carries no regularization signal at all. Worse, by the score function identity \(\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta] = 0\), this gradient has zero expectation. It contributes nothing but noise.
This is a stark example of how a perfect estimator (\(k_1\) is exactly unbiased for KL) can be a terrible loss function.
GRPO uses \(k_3 = \delta - 1 - \log \delta\) as a directly differentiated loss (decoupled form). Differentiating:
\[\nabla_\theta k_3 = \nabla_\theta(\delta - \log \delta) = (\delta - 1) \cdot \nabla_\theta \delta / \delta + \nabla_\theta \log \pi_\theta = (1 - \delta) \cdot \nabla_\theta \log \pi_\theta.\]Compared to the true RKL gradient coefficient \(-\log \delta\), GRPO uses \(1 - \delta\). These are related by Taylor expansion — \(-\log \delta = (\delta - 1) - \frac{1}{2}(\delta - 1)^2 + \cdots\), so \(1 - \delta\) is only the first-order approximation of \(-\log \delta\). This introduces three problems:
-
Bias: for all \(\delta \neq 1\), the coefficient \(1 - \delta \neq -\log \delta\), so the gradient direction is biased.
-
Pathological asymmetry: When the policy deviates away from the reference (\(\delta \to 0\), meaning \(\pi_\theta \gg \pi_{\mathrm{ref}}\)), the true coefficient \(-\log \delta \to +\infty\) provides a strong restoring force, but \(1 - \delta \to 1\) saturates — it cannot push back hard enough. Conversely, when \(\delta \to \infty\) (\(\pi_\theta \ll \pi_{\mathrm{ref}}\)), \(1 - \delta \to -\infty\) explodes much faster than the logarithmic \(-\log \delta\), risking destabilizing updates.
-
Variance: the variance of \(1 - \delta\) involves \(\mathrm{Var}[\delta] = \chi^2(\pi_{\mathrm{ref}} \Vert \pi_\theta)\), the chi-squared divergence, which is notoriously unstable and can diverge even when KL remains finite.
So paradoxically, \(k_3\) — the “clear winner” as a KL estimator — produces a biased, asymmetric, and potentially unstable gradient when used as a loss in GRPO. The paper recommends \(k_2\) as loss (or equivalently \(k_1\) in reward) as the principled default.
The following table contrasts the estimator ranking (Schulman) with the optimization ranking (Liu et al.):
| As Estimator | ”\(k_n\) in reward” | ”\(k_n\) as loss” | |
|---|---|---|---|
| \(k_1 = -\log \delta\) | Unbiased, high variance | ✓ Correct RKL gradient | ✗ Zero-mean noise, no regularization |
| \(k_2 = \frac{1}{2}(\log \delta)^2\) | Biased (low), low variance | — | ✓ Correct RKL gradient |
| \(k_3 = (\delta - 1) - \log \delta\) | Unbiased, low variance | — | ≈ First-order biased approximation |
The irony is complete: \(k_1\), the worst estimator (high variance), produces the correct gradient when placed in the reward. \(k_3\), the best estimator (unbiased + low variance), produces a biased gradient when used as a loss. And \(k_2\), the biased estimator, produces the correct gradient as a loss — making it gradient-equivalent to \(k_1\) in reward.
The reason is that estimation asks “how close is the value \(k_n\) to the true KL?” while optimization asks “does \(\nabla_\theta k_n\) point in the right direction?” These are fundamentally different questions, and a good answer to one does not imply a good answer to the other.
MaxEnt RL Methods and Their Connection to KL
MaxEnt RL 方法与 KL 的关联
Maximum Entropy RL (Ziebart, 2010; Haarnoja et al., SAC, 2018) augments the standard RL objective with an entropy bonus at every timestep. The optimal policy maximizes not just cumulative reward, but also the entropy of its own action distribution:
\[\pi^*_{\mathrm{maxent}} := \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) + \alpha \, \mathcal{H}\!\big(\pi(\cdot \vert s_t)\big)\Big)\right]\]where \(\mathcal{H}(\pi(\cdot \vert s)) = \mathbb{E}_{a \sim \pi}[-\log \pi(a \vert s)]\) is the entropy of the policy at state \(s\). Three properties of this quantity matter for what follows: (i) it is non-negative, with \(\mathcal{H} = 0\) iff \(\pi\) is deterministic; (ii) it is bounded above by \(\log \lvert\mathcal{A}\rvert\), achieved iff \(\pi\) is uniform; (iii) it is concave in \(\pi\). Property (ii) will become important later — the upper bound grows with the action space, so for large \(\lvert\mathcal{A}\rvert\) the entropy term can dominate the reward in the MaxEnt objective.
To build intuition for how entropy depends on the distribution shape, consider a policy that puts probability \(p\) on one action and spreads the remaining \(1 - p\) uniformly over the other \(\lvert\mathcal{A}\rvert - 1\) actions (each getting \(\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\)). Splitting the expectation into these two groups:
\[\mathcal{H}(\pi) = -\sum_{a} \pi(a)\log\pi(a) = \underbrace{-p\log p}_{\text{from the single action}} \;\underbrace{- \;(\lvert\mathcal{A}\rvert - 1)\cdot\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\cdot\log\frac{1-p}{\lvert\mathcal{A}\rvert - 1}}_{\text{from the remaining }\lvert\mathcal{A}\rvert - 1\text{ actions}}\]The second group simplifies (the \(\lvert\mathcal{A}\rvert - 1\) cancels with the fraction), giving a clean two-term decomposition:
\[\mathcal{H}(\pi) = -p\log p \;-\; (1-p)\log\frac{1-p}{\lvert\mathcal{A}\rvert - 1}\]When \(p = 1/\lvert\mathcal{A}\rvert\) (uniform), both terms contribute and the total reaches the maximum \(\log\lvert\mathcal{A}\rvert\). When \(p = 1\) (deterministic), both terms vanish. The interactive figure below visualizes this decomposition:
Expanding the entropy and absorbing it into the per-step reward:
\[\pi^*_{\mathrm{maxent}} = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) + \alpha \, \mathbb{E}_{a \sim \pi(\cdot \vert s_t)}[-\log \pi(a \vert s_t)]\Big)\right]\]Since the trajectory expectation already samples \(a_t \sim \pi(\cdot \vert s_t)\), the inner expectation can be folded in, yielding the equivalent form:
\[\pi^*_{\mathrm{maxent}} = \arg\max_\pi \; \mathbb{E}_{\tau \sim \pi}\!\left[\sum_t \gamma^t \Big(r(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)\Big)\right]\]This looks like standard RL with a modified per-step reward \(\tilde{r}_t = r(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)\). One might think we can simply add \(-\alpha \log \pi\) as a bonus to policy gradient and call it a day.
Why is this not just entropy-regularized policy gradient? The crucial difference is that \(-\alpha \log \pi(a_t \vert s_t)\) depends on \(\pi\) itself, unlike the environment reward \(r(s,a)\) which is a fixed function. In standard actor-critic, the critic \(Q^\pi(s,a)\) only backs up \(r\) — it evaluates “how good is state \(s'\)” purely in terms of future reward. But here the “return” includes \(-\alpha \log \pi\) at every future step, and this term changes as \(\pi\) updates. A critic that ignores future entropy gives wrong advantage estimates: it uses a baseline that does not account for the entropy component of the return. MaxEnt RL fixes this by backing up the entropy into the value function — the critic itself must track how much entropy the policy will generate in the future.
Entropy backing up and the Soft Bellman Equation. Recall the standard action-value function, which measures the total reward from taking action \(a\) at state \(s\) and following \(\pi\) thereafter:
\[Q^\pi(s,a) := r(s,a) + \gamma \, \mathbb{E}_{s' \sim p(\cdot \vert s,a),\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi(s', a')\right]\]The Soft Bellman Equation (Haarnoja et al., ICML 2017) modifies this by including the entropy bonus in the backup target:
\[Q^\pi_{\mathrm{soft}}(s,a) := r(s,a) + \gamma \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi_{\mathrm{soft}}(s', a') - \alpha \log \pi(a' \vert s')\right]\]Why this form? Note that the entropy \(-\alpha \log \pi(a' \vert s')\) is attached to the next state \(s'\), not the current state \(s\). We can rearrange by separating the entropy from the recursive \(Q\) term:
\[Q^\pi_{\mathrm{soft}}(s,a) = \underbrace{\Big(r(s,a) - \alpha \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[\log \pi(a' \vert s')\right]\Big)}_{\tilde{r}(s,a)} + \gamma \, \mathbb{E}_{s',\, a' \sim \pi(\cdot \vert s')}\!\left[Q^\pi_{\mathrm{soft}}(s', a')\right]\]This has exactly the form of a standard Bellman equation \(Q = \tilde{r} + \gamma \, \mathbb{E}[Q]\) with an effective reward \(\tilde{r}(s,a) = r(s,a) + \alpha \, \mathcal{H}\!\big(\pi(\cdot \vert s')\big)\) that augments the environment reward with the entropy of the policy at the next state. Since \(\tilde{r}\) is bounded whenever \(r\) and \(\log \pi\) are bounded, the standard Bellman contraction argument applies directly — the Soft Bellman operator is a \(\gamma\)-contraction in \(\ell_\infty\) norm, guaranteeing a unique fixed point and convergence of value iteration.
While \(Q^\pi\) only accounts for future reward, \(Q^\pi_{\mathrm{soft}}\) also incorporates future entropy bonuses into the backup — it values not just how much reward the agent collects, but also how many options it keeps open at future states.
Unrolled definitions. Unrolling the Soft Bellman recursion, the soft Q-value can be written as the expected discounted sum of rewards and future entropies (Haarnoja et al., ICML 2017, Appendix A). For convenience, the entropy coefficient \(\alpha\) is set to 1 (the general case is recovered by dividing rewards by \(\alpha\)):
\[Q_{\mathrm{soft}}^{\pi}(\mathbf{s}, \mathbf{a}) \triangleq r_0 + \mathbb{E}_{\tau \sim \pi,\, \mathbf{s}_0 = \mathbf{s},\, \mathbf{a}_0 = \mathbf{a}}\!\left[\sum_{t=1}^{\infty} \gamma^t \Big(r_t + \mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_t)\big)\Big)\right]\]where \(\tau = (\mathbf{s}_0, \mathbf{a}_0, \mathbf{s}_1, \mathbf{a}_1, \ldots)\) denotes the trajectory originating at \((\mathbf{s}, \mathbf{a})\). The discounted maximum entropy policy objective is then:
\[J(\pi) \triangleq \sum_t \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_\pi}\!\left[Q_{\mathrm{soft}}^{\pi}(\mathbf{s}_t, \mathbf{a}_t) + \alpha\,\mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_t)\big)\right]\]and the corresponding optimal policy is:
\[\pi^*_{\mathrm{MaxEnt}} = \arg\max_\pi \sum_t \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_\pi}\!\left[\sum_{l=t}^{\infty} \gamma^{l-t} \mathbb{E}_{(\mathbf{s}_l, \mathbf{a}_l)}\!\left[r(\mathbf{s}_l, \mathbf{a}_l) + \alpha\,\mathcal{H}\!\big(\pi(\cdot \vert \mathbf{s}_l)\big) \,\middle\vert\, \mathbf{s}_t, \mathbf{a}_t\right]\right]\]Note that this objective takes into account the entropy of the policy at future states, in contrast to greedy objectives such as Boltzmann exploration.
Wait — the first formula already sums entropy at every timestep. How is the last one different? (Click to expand)
Both formulas include entropy at every timestep, but they organize the sum differently, revealing different structure.
The first formula is a flat sum: \(\sum_t \gamma^t (r_t + \alpha H_t)\). Expanded, this is just \(\gamma^0(r_0+\alpha H_0) + \gamma^1(r_1+\alpha H_1) + \gamma^2(r_2+\alpha H_2) + \cdots\). Each timestep's entropy appears exactly once, on equal footing with the reward — it looks like entropy is just a local, per-step bonus with no notion of "future."
The last formula is a nested sum: for each \((s_t, a_t)\), the inner sum \(\sum_{l=t}^{\infty} \gamma^{l-t} \mathbb{E}[r_l + \alpha H_l \mid s_t, a_t]\) is the full soft return from \(t\) onward:
$$\underbrace{(r_t + \alpha H_t)}_{\text{current}} + \gamma\underbrace{(r_{t+1} + \alpha H_{t+1})}_{\text{future}} + \gamma^2\underbrace{(r_{t+2} + \alpha H_{t+2})}_{\text{further future}} + \cdots$$
and these future terms are a conditional expectation given \((s_t, a_t)\). The future entropies \(H_{t+1}, H_{t+2}, \ldots\) appear explicitly inside the evaluation of the current state-action pair.
Both formulas are mathematically equivalent — they define the same optimal policy. But they are not the same objective function (they weight timesteps differently). The distinction is about algorithmic readability: the first formula's flat structure makes entropy look like an ordinary per-step bonus, tempting one to think it can simply be added to standard policy gradient. The last formula's nested structure makes explicit that evaluating \((s_t, a_t)\) requires knowing \(\mathbb{E}[H_{t+1} + \gamma H_{t+2} + \cdots \mid s_t, a_t]\) — and that inner sum is precisely \(Q_{\mathrm{soft}}^\pi(s_t, a_t)\). This directly reveals why the Soft Bellman equation must back up entropy: the critic must track future entropy, or it will produce wrong advantage estimates.
Proof of equivalent argmax. Write \(f_t = r_t + \alpha \mathcal{H}_t\). Then \(J_1(\pi) = \mathbb{E}_\tau[\sum_t \gamma^t f_t]\) and \(J_2(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[\sum_{l=t}^\infty \gamma^{l-t} \mathbb{E}[f_l \mid s_t,a_t]]\). Swapping the summation order in \(J_2\) gives each \(f_k\) the weight \(\frac{1-\gamma^{k+1}}{1-\gamma}\) instead of \(\gamma^k\) in \(J_1\), so the two objectives are genuinely different functions of \(\pi\). Nevertheless, they share the same unique maximizer:
- Unique fixed point. The soft Bellman operator \(\mathcal{T}Q(s,a) = r(s,a) + \gamma\,\mathbb{E}_{s'}[\log\!\int\!\exp Q(s',a')\,da']\) is a \(\gamma\)-contraction in \(\ell_\infty\) norm (the log-sum-exp is 1-Lipschitz). Therefore it has a unique fixed point \(Q^*_{\mathrm{soft}}\), and the corresponding policy \(\pi^*(a|s) \propto \exp(Q^*_{\mathrm{soft}}(s,a)/\alpha)\) is the unique policy satisfying the soft Bellman optimality equation.
- Q-dominance. The soft policy improvement theorem (Haarnoja et al., 2017, Theorem 1) shows: for any \(\pi\), define \(\tilde\pi(\cdot|s) \propto \exp(Q^\pi_{\mathrm{soft}}(s,\cdot)/\alpha)\). Then \(Q^{\tilde\pi}_{\mathrm{soft}}(s,a) \geq Q^\pi_{\mathrm{soft}}(s,a)\) for all \((s,a)\), with equality iff \(\pi\) already satisfies the optimality condition. Iterating converges to the unique fixed point \(\pi^*\), giving \(Q^{\pi^*}_{\mathrm{soft}}(s,a) \geq Q^\pi_{\mathrm{soft}}(s,a)\) for all \((s,a)\) and all \(\pi\). This implies pointwise V-dominance: \(V^{\pi^*}_{\mathrm{soft}}(s) \geq V^\pi_{\mathrm{soft}}(s)\) for all \(s\).
- \(\pi^*\) maximizes \(J_1\). Since \(J_1(\pi) = \mathbb{E}_{s_0}[V^\pi_{\mathrm{soft}}(s_0)]\) and \(V^{\pi^*}(s_0) \geq V^\pi(s_0)\) for all \(s_0\), we have \(J_1(\pi^*) \geq J_1(\pi)\) for all \(\pi\).
- \(\pi^*\) maximizes \(J_2\). The policy iteration \(\pi_i \to \pi_{i+1}\) with \(\pi_{i+1}(\cdot|s) \propto \exp(Q^{\pi_i}_{\mathrm{soft}}(s,\cdot)/\alpha)\) yields \(Q^{\pi_{i+1}}_{\mathrm{soft}} \geq Q^{\pi_i}_{\mathrm{soft}}\) pointwise at each step, and converges to \(\pi^*\) by contraction. Since the only policy where no improvement exists is \(\pi^*\), it is the unique maximizer of \(J_2\). (Here, unlike for \(J_1\), we cannot directly use V-dominance because the state distribution \(d_t^\pi\) in \(J_2\) also changes with \(\pi\). Instead we rely on the monotone convergence of Q-values to the unique fixed point.)
Since both objectives have the same unique maximizer, \(\arg\max_\pi J_1(\pi) = \arg\max_\pi J_2(\pi) = \pi^*\). \(\square\)
Why the MaxEnt objective cannot be optimized by entropy-regularized policy gradient (Click to expand)
Entropy-regularized policy gradient (ERPG) is the approach of taking a standard actor-critic algorithm — whose critic \(Q^\pi\) only backs up environment reward — and adding an entropy bonus \(\alpha H(\pi)\) to the actor's objective. The ERPG optimization target at each state \(s\) is:
$$J_{\mathrm{ERPG}}(\pi; s) = \mathbb{E}_{a \sim \pi(\cdot|s)}\!\big[Q^\pi(s,a)\big] + \alpha\,\mathcal{H}\!\big(\pi(\cdot|s)\big)$$
where \(Q^\pi(s,a) = \mathbb{E}\!\left[\sum_{l=0}^\infty \gamma^l r_{t+l} \mid s_t\!=\!s,\, a_t\!=\!a\right]\) satisfies the standard (non-soft) Bellman equation — it backs up reward only, with no entropy in the bootstrap target. The entropy term \(\alpha H\) is applied only at the current policy improvement step, not propagated into future value estimates. Solving for the optimal policy gives:
$$\pi_{\mathrm{ERPG}}(\cdot|s) \propto \exp\!\big(Q^\pi(s,\cdot)/\alpha\big)$$
Deriving the true gradient of the MaxEnt objective. Write the MaxEnt objective as \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \gamma^t(r_t - \alpha\log\pi_\theta(a_t|s_t))]\). We need the gradient of an expectation where both the distribution and the integrand depend on \(\theta\). By the product rule under the integral sign:
$$\nabla_\theta\!\int\! p_\theta(\tau)\,f(\tau,\theta)\,d\tau = \int\!\big[\nabla_\theta p_\theta(\tau)\cdot f(\tau,\theta) + p_\theta(\tau)\cdot\nabla_\theta f(\tau,\theta)\big]\,d\tau = \mathbb{E}_{p_\theta}\!\big[\nabla_\theta\!\log p_\theta(\tau)\cdot f(\tau,\theta) + \nabla_\theta f(\tau,\theta)\big]$$
where the second equality uses \(\nabla_\theta p_\theta = p_\theta\,\nabla_\theta\!\log p_\theta\). Applying this to \(J(\theta)\):
$$\nabla_\theta J = \underbrace{\mathbb{E}_\tau\!\left[\bigg(\sum_{t'} \nabla_\theta\!\log\pi_\theta(a_{t'}|s_{t'})\bigg) \cdot \bigg(\sum_t \gamma^t \tilde r_t\bigg)\right]}_{\text{(I)}} \;+\; \underbrace{\mathbb{E}_\tau\!\left[\sum_t \gamma^t\big(-\alpha\,\nabla_\theta\!\log\pi_\theta(a_t|s_t)\big)\right]}_{(\text{II})}$$
where \(\tilde r_t = r_t - \alpha\log\pi_\theta(a_t|s_t)\). Term (II) vanishes by the score function identity: \(\mathbb{E}_{a\sim\pi}[\nabla\!\log\pi(a|s)] = \nabla\!\sum_a\pi(a|s) = 0\).
Simplifying term (I) via causality. Expanding the product of sums gives cross terms \(\nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \gamma^t \tilde r_t\). For \(t < t'\), the "reward" \(\tilde r_t\) depends only on \((s_t, a_t)\) and is therefore fixed given the trajectory up to time \(t' - 1\). Conditioning on \(s_{t'}\):
$$\mathbb{E}\!\big[\nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \tilde r_t\big] = \mathbb{E}\!\big[\tilde r_t \cdot \underbrace{\mathbb{E}_{a_{t'}\sim\pi}[\nabla\!\log\pi(a_{t'}|s_{t'})]}_{=\,0}\big] = 0 \quad (t < t')$$
So only terms with \(t \geq t'\) survive. Re-indexing with \(l = t - t'\):
$$\text{(I)} = \mathbb{E}_\tau\!\left[\sum_{t'} \nabla\!\log\pi(a_{t'}|s_{t'}) \cdot \sum_{t \geq t'} \gamma^t \tilde r_t\right] = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla\!\log\pi(a_t|s_t) \cdot \underbrace{\sum_{l=0}^{\infty}\gamma^l \tilde r_{t+l}}_{G_t^{\mathrm{soft}}}\right]$$
Replacing \(G_t^{\mathrm{soft}}\) by its conditional expectation \(\mathbb{E}[G_t^{\mathrm{soft}}|s_t,a_t] = Q^\pi_{\mathrm{soft}}(s_t,a_t) - \alpha\log\pi(a_t|s_t)\):
$$\boxed{\nabla_\theta J_{\mathrm{MaxEnt}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \big(Q^\pi_{\mathrm{soft}}(s_t,a_t) - \alpha\log\pi_\theta(a_t|s_t)\big)\right]}$$
ERPG's gradient. ERPG computes two separate pieces: (1) a standard policy gradient using reward-only \(Q^\pi\), and (2) a separate entropy gradient \(\alpha\nabla_\theta H = -\alpha\,\mathbb{E}[\nabla\!\log\pi\cdot\log\pi]\) (the \(\nabla\!\log\pi\cdot 1\) term vanishes by the score identity). Combining:
$$\boxed{\nabla_\theta J_{\mathrm{ERPG}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \big(Q^\pi(s_t,a_t) - \alpha\log\pi_\theta(a_t|s_t)\big)\right]}$$
The gradient gap. The \(-\alpha\log\pi\) terms are identical and cancel in the difference:
$$\nabla_\theta J_{\mathrm{MaxEnt}} - \nabla_\theta J_{\mathrm{ERPG}} = \mathbb{E}_\tau\!\left[\sum_t \gamma^t \nabla_\theta\!\log\pi_\theta(a_t|s_t) \cdot \underbrace{\big(Q^\pi_{\mathrm{soft}}(s_t,a_t) - Q^\pi(s_t,a_t)\big)}_{\text{discounted future entropy}}\right]$$
where the gap is exactly the discounted future entropy conditional on the action:
$$Q^\pi_{\mathrm{soft}}(s,a) - Q^\pi(s,a) = \gamma\,\mathbb{E}_{s' \sim p(\cdot|s,a)}\!\big[V^\pi_{\mathrm{soft}}(s') - V^\pi(s')\big] = \alpha\gamma\;\mathbb{E}_{s'}\!\left[\sum_{l=0}^{\infty}\gamma^l H\!\big(\pi(\cdot|s_{l+1}')\big)\right]$$
This depends on \(a\) through the transition \(s' \sim p(\cdot|s,a)\). In any MDP where different actions lead to states with different future entropy, this gap is action-dependent and nonzero, so \(\nabla J_{\mathrm{MaxEnt}} \neq \nabla J_{\mathrm{ERPG}}\). ERPG produces biased gradients for the MaxEnt objective. \(\square\)
Concrete instance: GRPO. The GRPO objective (Shao et al., 2024) uses a PPO-clipped surrogate with a KL penalty against a reference model:
$$\mathcal{J}_{\mathrm{GRPO}}(\theta) \;=\; \mathbb{E}\!\left[\frac{1}{G}\sum_{i}\frac{1}{|o_i|}\sum_{t}\Big\{\min\!\big[r_t(\theta)\,\hat A_{i,t},\;\mathrm{clip}(r_t(\theta),\,1\!-\!\varepsilon,\,1\!+\!\varepsilon)\,\hat A_{i,t}\big] - \beta\,D_{\mathrm{KL}}[\pi_\theta\|\pi_{\mathrm{ref}}]\Big\}\right]$$
where \(r_t(\theta) = \pi_\theta(o_{i,t}|q,o_{i,<t})/\pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i,<t})\) is the importance ratio and \(\hat A_{i,t}\) is the group-relative advantage (computed from reward only, no entropy in the baseline). If we replace \(-\beta\,D_{\mathrm{KL}}[\pi_\theta\|\pi_{\mathrm{ref}}]\) with \(+\alpha\,\mathcal{H}(\pi_\theta)\), the result is exactly ERPG: the entropy bonus is applied per-token at the actor level, but the advantage estimates \(\hat A_{i,t}\) still come from a reward-only signal — entropy is not backed up into the value baseline. By the argument above, this is not equivalent to the MaxEnt RL objective.
The practical consequence is best seen through an example. Consider navigating around an obstacle to reach a goal. Without entropy backing up, the value function only cares about reward, so it finds the shortest path — say, squeezing through a narrow gap on the left side of the obstacle. This path is slightly shorter, but leaves little room for error: a noisy policy would easily collide with the obstacle. With entropy backing up, \(Q_{\mathrm{soft}}\) assigns higher value to states from which many different trajectories can reach the goal. The policy therefore prefers the wider route around the right side of the obstacle — even though it is slightly longer — because from those states, there are more ways to succeed even under stochastic action selection. The key insight is that this preference for “states with many options” emerges automatically from backing up entropy through the Bellman equation, not from any explicit path-planning logic.
Relationship to KL regularization. The per-step KL penalty in RLHF can be decomposed to reveal that it contains the MaxEnt entropy bonus as a component. For a single state \(s\):
\[D_{\mathrm{KL}}\!\big(\pi_\theta(\cdot \vert s) \,\big\Vert\, \pi_{\mathrm{ref}}(\cdot \vert s)\big) = \mathbb{E}_{a \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{ref}}(a \vert s)}\right] = \underbrace{-\mathcal{H}\!\big(\pi_\theta(\cdot \vert s)\big)}_{\text{negative entropy}} + \underbrace{H\!\big(\pi_\theta(\cdot \vert s),\, \pi_{\mathrm{ref}}(\cdot \vert s)\big)}_{\text{cross-entropy}}\]where \(H(\pi_\theta, \pi_{\mathrm{ref}}) = \mathbb{E}_{a \sim \pi_\theta}[-\log \pi_{\mathrm{ref}}(a \vert s)]\) is the cross-entropy. Therefore the KL-penalized reward decomposes as:
\[r(s,a) - \beta \log \frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{ref}}(a \vert s)} = r(s,a) \underbrace{- \beta \log \pi_\theta(a \vert s)}_{\text{entropy bonus (as in MaxEnt)}} \underbrace{+ \beta \log \pi_{\mathrm{ref}}(a \vert s)}_{\text{anchor to reference}}\]The first two terms are exactly the MaxEnt reward with \(\alpha = \beta\). The third term, \(+\beta \log \pi_{\mathrm{ref}}(a \vert s)\), acts as a state-action-dependent reward shaping that pulls the policy toward the reference model. When \(\pi_{\mathrm{ref}}\) is uniform, \(\log \pi_{\mathrm{ref}}\) is a constant and drops out — recovering the MaxEnt objective exactly.
So the distinction between MaxEnt RL and KL regularization is not about whether the value function is redefined (both can back up their respective bonuses through the Bellman equation). The distinction is purely about what is being backed up:
| MaxEnt RL | KL regularization |
|---|---|
| Backs up entropy \(\mathcal{H}(\pi)\) only | Backs up entropy \(\mathcal{H}(\pi)\) plus a cross-entropy anchor \(H(\pi, \pi_{\mathrm{ref}})\) |
| Encourages exploration for its own sake | Encourages exploration while staying close to \(\pi_{\mathrm{ref}}\) |
Actor-Critic
Actor-Critic 方法
The Critic
Critic(评论家)
In the policy gradient, the single-sample reward-to-go \(\hat{Q}_{i,t} = \sum_{t'=t}^{T} r(s_{t'}, a_{t'})\) serves as our estimate of \(Q^{\pi_\theta}(s_t, a_t)\). This is unbiased — in expectation it equals the true action-value — but high-variance, because a single trajectory may encounter lucky or unlucky transitions.
Can we get a better estimate? The idea is to fit a model to predict expected returns, rather than relying on a single sample. Define three value functions:
\[Q^\pi(s, a) = \sum_{t'=t}^{T} \mathbb{E}_{\pi_\theta}\!\left[r(s_{t'}, a_{t'}) \vert s_t, a_t\right]\] \[V^\pi(s) = \mathbb{E}_{a \sim \pi_\theta(a \vert s)}\!\left[Q^\pi(s, a)\right]\] \[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]The advantage \(A^\pi\) tells us how much better action \(a\) is compared to the average action from state \(s\). Using the advantage in the policy gradient:
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \, A^\pi(s_{i,t}, a_{i,t})\]The better this estimate, the lower the variance. The key insight is that we only need to fit \(V^\pi(s)\), because \(A^\pi(s, a) \approx r(s, a) + V^\pi(s') - V^\pi(s)\), which requires only the value function and one observed transition.
How do we train this value function? This is the policy evaluation problem: given a fixed policy \(\pi_\theta\), estimate \(V^\pi(s)\). We fit a neural network \(\hat{V}^\pi_\phi(s)\) with parameters \(\phi\) by supervised regression:
\[\mathcal{L}(\phi) = \frac{1}{2} \sum_i \left\lVert \hat{V}^\pi_\phi(s_i) - y_i \right\rVert^2\]The question is what target \(y_i\) to use. The Monte Carlo target \(y_{i,t} = \sum_{t'=t}^{T} r(s_{i,t'}, a_{i,t'})\) is unbiased but noisy — the same function must fit many different sampled trajectories from the same state. The bootstrapped (TD) target \(y_{i,t} = r(s_{i,t}, a_{i,t}) + \gamma \hat{V}^\pi_\phi(s_{i,t+1})\) is lower variance because it replaces all future randomness with the current estimate, but introduces bias — if \(\hat{V}^\pi_\phi\) is wrong (and it always is, at least initially), the target is wrong too. Here the discount factor \(\gamma \in [0, 1]\) keeps values finite when episodes are long or infinite; one interpretation is that \(\gamma\) adds a \((1 - \gamma)\) probability of “death” at each step, making the effective horizon finite.
The Algorithm
算法
An actor-critic method maintains two components:
- Actor: the policy \(\pi_\theta(a \vert s)\), updated by policy gradient.
- Critic: a value function \(\hat{V}^\pi_\phi(s) \approx V^\pi(s)\), trained by regression.
Batch actor-critic:
- Sample \(\{s_i, a_i\}\) from \(\pi_\theta(a \vert s)\) (run the policy).
- Fit \(\hat{V}^\pi_\phi(s)\) to sampled reward sums.
- Evaluate \(\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)\).
- \(\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a_i \vert s_i) \hat{A}^\pi(s_i, a_i)\).
- \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\).
Online actor-critic updates after every single transition \((s, a, s', r)\):
- Take action \(a \sim \pi_\theta(a \vert s)\), observe \((s, a, s', r)\).
- Update \(\hat{V}^\pi_\phi\) using target \(r + \gamma \hat{V}^\pi_\phi(s')\).
- Evaluate \(\hat{A}^\pi(s, a) = r(s, a) + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\).
- \(\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(a \vert s) \hat{A}^\pi(s, a)\).
- \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\).
The online version works on a single transition — no need to collect full trajectories. In practice, it works best with a batch from parallel workers: multiple agents collect transitions simultaneously, and their gradients are aggregated before each update. This is the idea behind A3C (Mnih et al., 2016), which runs asynchronous parallel actors each contributing gradients to a shared parameter server.
The actor and critic can use two separate networks (simple, stable, but no shared features) or a single network with two heads (shared early layers, separate output heads for \(\pi_\theta\) and \(\hat{V}^\pi_\phi\)). The shared design is more parameter-efficient and can learn common state representations, but couples the two learning problems. In the LLM era, this question takes a new form: can a pretrained language model serve as both actor and critic? The LM-as-critic post explores the surprising difficulties of attaching a value head to a pretrained backbone.
Advantage Estimation
优势估计
The actor-critic gradient uses the TD advantage \(r + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\) — lower variance but biased (since the critic is imperfect). The Monte Carlo policy gradient uses \(\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} - b\) — unbiased but higher variance. One middle ground is to use the Monte Carlo return but subtract the critic as a state-dependent baseline:
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \left(\underbrace{\sum_{t'=t}^{T} \gamma^{t'-t} r(s_{i,t'}, a_{i,t'})}_{\substack{\mathrm{Monte\ Carlo\ return} \\ \mathrm{(unbiased,\ high\ variance)}}} - \underbrace{\hat{V}^\pi_\phi(s_{i,t})}_{\substack{\mathrm{critic\ as\ baseline} \\ \mathrm{(state\ dependent)}}}\right)\]The reward-to-go sum is the same Monte Carlo return from the vanilla policy gradient — no bootstrapping, so no bias. But instead of subtracting a constant baseline \(b\), we subtract the critic’s value estimate \(\hat{V}^\pi_\phi(s_{i,t})\), which depends on the state. Since any state-dependent baseline preserves unbiasedness (shown above), this remains unbiased. And because \(\hat{V}^\pi_\phi(s)\) is close to the expected return from each state, it reduces variance far more effectively than a constant \(b\).
More generally, the one-step TD advantage and the full Monte Carlo return are two extremes. We can interpolate with \(n\)-step returns: use \(n\) steps of actual rewards, then bootstrap:
\[\hat{A}_n^\pi(s_t, a_t) = \underbrace{\sum_{t'=t}^{t+n} \gamma^{t'-t} r(s_{t'}, a_{t'})}_{\mathrm{actual\ rewards\ (}n\mathrm{\ steps)}} \underbrace{- \; \hat{V}^\pi_\phi(s_t)}_{\mathrm{baseline}} + \underbrace{\gamma^n \hat{V}^\pi_\phi(s_{t+n})}_{\mathrm{bootstrap\ remainder}}\]The first term sums \(n\) steps of actual observed rewards, discounted back to time \(t\). The second term subtracts the critic’s estimate at the current state (the baseline). The third term “fills in” the remaining future by bootstrapping from the critic at step \(t+n\) — discounted by \(\gamma^n\) because that state is \(n\) steps away. When \(n = 1\), we get the TD advantage \(r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)\): mostly bootstrapped, low variance but biased. When \(n = T - t\), the bootstrap vanishes and we recover the full Monte Carlo return minus a baseline: unbiased but high variance. Choosing \(n > 1\) often works better than either extreme.
Generalized advantage estimation (Schulman, Moritz, Levine, Jordan, Abbeel, 2016) takes this further: instead of choosing a single \(n\), take a weighted combination of all \(n\)-step advantages with exponentially decaying weights \(w_n \propto \lambda^{n-1}\):
\[\hat{A}^{\mathrm{GAE}}(s_t, a_t) = \sum_{t'=t}^{\infty} (\gamma \lambda)^{t'-t} \delta_{t'}\]where \(\delta_{t'} = r(s_{t'}, a_{t'}) + \gamma \hat{V}^\pi_\phi(s_{t'+1}) - \hat{V}^\pi_\phi(s_{t'})\) is the one-step TD residual. The parameter \(\lambda\) controls how far into the future we look before trusting the critic. When \(\lambda = 0\), only the \(t' = t\) term survives:
\[\hat{A}^{\mathrm{GAE}(\gamma,\,0)}(s_t, a_t) = \delta_t = r(s_t, a_t) + \gamma \hat{V}^\pi_\phi(s_{t+1}) - \hat{V}^\pi_\phi(s_t)\]This is the one-step TD advantage — low variance but biased (relies entirely on the critic). When \(\lambda = 1\), the geometric decay disappears and all TD residuals are summed with only \(\gamma\) discounting:
\[\hat{A}^{\mathrm{GAE}(\gamma,\,1)}(s_t, a_t) = \sum_{t'=t}^{\infty} \gamma^{t'-t} \delta_{t'} = \sum_{t'=t}^{\infty} \gamma^{t'-t} r(s_{t'}, a_{t'}) - \hat{V}^\pi_\phi(s_t)\]To see why the second equality holds, expand \(\delta_{t'} = r_{t'} + \gamma \hat{V}(s_{t'+1}) - \hat{V}(s_{t'})\) and split the sum:
\[\sum_{t'=t}^{\infty} \gamma^{t'-t} \delta_{t'} = \sum_{t'=t}^{\infty} \gamma^{t'-t} r_{t'} + \sum_{t'=t}^{\infty} \bigl[\gamma^{t'-t+1} \hat{V}(s_{t'+1}) - \gamma^{t'-t} \hat{V}(s_{t'})\bigr]\]The second sum telescopes: consecutive terms cancel, leaving \(\lim_{N\to\infty} \gamma^{N+1}\hat{V}(s_{t+N+1}) - \hat{V}(s_t)\). The limit vanishes (either \(\gamma < 1\) or the episode terminates with \(\hat{V} = 0\)), so we are left with \(\sum \gamma^{t'-t} r_{t'} - \hat{V}(s_t)\).
This recovers the Monte Carlo return minus a state-dependent baseline — unbiased but high variance. In practice, \(\lambda \approx 0.95{-}0.97\) works well.
The on-policy requirement — needing fresh data after every update — is a major limitation addressed by PPO above. For a deeper treatment of importance sampling in RL, see the importance sampling post.