Policy Gradient and Actor-Critic

The policy gradient sections of this post draw on Nan Jiang's CS 443 lecture notes on policy gradient. The actor-critic sections follow Sergey Levine's lecture on actor-critic methods.

The Policy Gradient

Trajectories and the Objective

In reinforcement learning, an agent interacts with an environment by choosing actions according to a policy \(\pi_\theta(a \vert s)\) — a distribution over actions given a state, parameterized by \(\theta\). Each interaction produces a trajectory:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

The probability of a trajectory under policy \(\pi_\theta\) is:

\[P^{\pi_\theta}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]

where \(d_0\) is the initial state distribution and \(P(s_{t+1} \vert s_t, a_t)\) is the transition probability. This is an alternating product of policy terms (learnable) and environment terms (fixed).

The objective is \(J(\pi_\theta) = \sum_\tau R(\tau) P^{\pi_\theta}(\tau)\) — a sum over all possible trajectories in the MDP. Each trajectory \(\tau\) has a fixed return \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\); what the policy controls is the probability \(P^{\pi_\theta}(\tau)\) assigned to each one. A better policy concentrates probability on high-return trajectories.

Figure 1: The expected return as a weighted sum over trajectory space. All trajectories fan out from a common start; use the playbar to watch them extend. Click parts of the formula to highlight what each represents: R(τ) colors by return, P^π(τ) shows probability via thickness. Switch policies to see how the same trajectories get reweighted.

Deriving REINFORCE

To differentiate \(J\) with respect to \(\theta\):

\[\nabla_\theta J = \sum_\tau R(\tau) \nabla_\theta P^{\pi_\theta}(\tau)\]

This is already mathematically correct, but it sums over all possible trajectories — an astronomically large space that cannot be enumerated. We need a form that can be estimated by sampling a handful of trajectories from \(\pi_\theta\).

One might ask: why not just sample a trajectory \(\tau \sim \pi_\theta\), observe \(R(\tau)\), and let autograd backpropagate through \(\mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\) directly? The problem is that the computation graph passes through a discrete sampling step — each action \(a_t\) is drawn from a categorical distribution \(\pi_\theta(\cdot \vert s_t)\). The sampled action is a discrete index, not a smooth function of \(\theta\), so the gradient cannot flow through it. Autograd only sees that the reward is some scalar, but has no way to capture the fact that changing \(\theta\) changes which trajectories get sampled. Standard backpropagation handles \(\nabla_\theta f_\theta(x)\) for fixed inputs \(x\), but here \(\theta\) affects both the function and the distribution over inputs. The log-derivative trick is precisely the tool that recovers this missing “distributional” part of the gradient.

Converting a sum \(\sum_\tau f(\tau)\) into an expectation \(\mathbb{E}_{\tau \sim P^{\pi_\theta}}[f(\tau) / P^{\pi_\theta}(\tau)]\) requires that \(P^{\pi_\theta}(\tau)\) is a valid probability distribution — non-negative and summing to 1. It is: \(P^{\pi_\theta}(\tau)\) is a product of the initial state distribution, per-step policy probabilities, and transition probabilities, all of which are valid distributions, so \(\sum_\tau P^{\pi_\theta}(\tau) = 1\) by construction. The log-derivative trick \(\nabla P = P \nabla \log P\) achieves exactly this conversion by factoring out \(P^{\pi_\theta}(\tau)\) as the sampling weight:

\[\nabla_\theta J = \sum_\tau R(\tau) P^{\pi_\theta}(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau)\right]\]

There is one more step: the gradient still involves \(\nabla_\theta \log P^{\pi_\theta}(\tau)\) — a trajectory-level quantity. To get something we can compute per action, we exploit the fact that the log turns the product of per-step probabilities into a sum:

\[\log P^{\pi_\theta}(\tau) = \log d_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \vert s_t) + \sum_{t=0}^{T-2} \log P(s_{t+1} \vert s_t, a_t)\]

The initial state distribution \(d_0\) and transition dynamics \(P\) do not depend on \(\theta\), so their gradients vanish. Only the policy terms survive:

\[\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right]\]

This is the trajectory-level REINFORCE estimator. Because the gradient is now an expectation under \(\pi_\theta\), we can estimate it by sampling \(N\) trajectories and averaging:

\[\hat{g} = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)}) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \vert s_t^{(i)}), \quad \tau^{(i)} \sim \pi_\theta\]

Each action contributes its \(\nabla_\theta \log \pi_\theta\) — a quantity neural network frameworks compute naturally via backpropagation — weighted by the trajectory return. This is what makes the log-derivative form practical: it decomposes into per-action log-probabilities that fit directly into standard gradient-based training.

By decomposing the rewards over time steps, it can be rewritten in a per-step form using the discounted state occupancy \(d^{\pi_\theta}\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[Q^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

where \(Q^{\pi_\theta}(s, a)\) is the action-value function.

Variance Reduction: Baselines and the Advantage

A useful property of \(\nabla_\theta \log \pi_\theta\) is that its expectation under the policy is zero: \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \vert s)] = \sum_a \nabla_\theta \pi_\theta(a \vert s) = \nabla_\theta 1 = 0\). This means we can subtract any state-dependent baseline \(b(s)\) from \(Q^{\pi_\theta}\) without introducing bias. The natural choice is the value function \(V^{\pi_\theta}(s)\), giving us the advantage function \(A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

The advantage centers the reward signal around zero, substantially reducing variance. This expectation requires sampling from \(\pi_\theta\) itself, meaning we need fresh data after every parameter update.

Implementation Notes: The Surrogate Loss

In practice, deep learning frameworks minimize a loss, so we need to translate the policy gradient into something a framework can compute. The standard approach is to define a surrogate loss

\[L_{\mathrm{sur}}(\theta) = -\sum_t A_t \log \pi_\theta(a_t \vert s_t),\]

where \(A_t\) is treated as a stop-gradient constant. Its gradient

\[\nabla_\theta L_{\mathrm{sur}} = -\sum_t A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t)\]

is exactly the negated policy gradient, so minimizing \(L_{\mathrm{sur}}\) performs a policy gradient ascent step. Note that if we were to backpropagate through \(A_t\) (e.g., when \(A_t\) depends on a learned critic), an extra term \((\nabla_\theta A_t) \log \pi_\theta\) would appear, breaking the correspondence — this is why the advantage must always be detached.

A common source of confusion is that \(L_{\mathrm{sur}}\) looks like weighted negative log-likelihood, making REINFORCE appear identical to “weighted SFT.” In the special case of binary rewards where \(A_t = 1\) for successful trajectories and \(A_t = 0\) otherwise, the surrogate loss does reduce to NLL on successful trajectories — i.e., online filtered behavior cloning. But in general, the surrogate loss and the true objective

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

are not the same function: they are merely gradient-equivalent. In supervised learning, \(-\log \pi_\theta(y \vert x)\) is the objective; in policy gradient, \(-A_t \log \pi_\theta(a_t \vert s_t)\) is a tool constructed to reproduce the correct gradient.

This same idea extends beyond vanilla REINFORCE. PPO’s clipped surrogate

\[L^{\mathrm{PPO}} = \mathbb{E}\!\left[\min\!\Big(r_t A_t, \; \mathrm{clip}(r_t, 1{-}\epsilon, 1{+}\epsilon) A_t\Big)\right]\]

does not explicitly contain \(\log \pi\), but the importance ratio \(r_t = \pi_\theta / \pi_{\theta_\mathrm{old}}\) is computed via log-probabilities in practice. The underlying pattern is the same: first derive what gradient direction the policy should follow, then construct a surrogate objective that produces it.

Proximal Policy Optimization (PPO)

From On-Policy to Off-Policy

The policy gradient derived above requires sampling from the current policy \(\pi_\theta\): after every parameter update, all previously collected data becomes stale. This is wasteful — we would like to take multiple gradient steps on the same batch of data.

The idea is to use importance sampling to correct for the distribution mismatch. If our data was collected under an old policy \(\pi_{\mathrm{old}}\), we can reweight each sample by the probability ratio between the new and old policies. For a detailed treatment of importance sampling and how it applies to RL, see the importance sampling post.

The IS Surrogate Objective

Starting from the policy gradient in advantage form, we can rewrite the expectation over \(\pi_\theta\) as an expectation over \(\pi_{\mathrm{old}}\) by introducing a single-step IS ratio (see derivation):

\[L^{\mathrm{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\mathrm{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\mathrm{old}}(a \vert s)} \, A^{\pi_{\mathrm{old}}}(s, a)\right]\]

At \(\theta = \theta_{\mathrm{old}}\), the ratio equals 1 and the gradient of \(L^{\mathrm{IS}}\) reduces to the standard policy gradient. This means we can take gradient steps on \(L^{\mathrm{IS}}\) using data collected once from \(\pi_{\mathrm{old}}\), without recollecting trajectories after each step.

However, this surrogate only corrects the action distribution mismatch — the state distribution is still drawn from \(d^{\pi_{\mathrm{old}}}\), not \(d^{\pi_\theta}\). As \(\theta\) drifts from \(\theta_{\mathrm{old}}\), the two state distributions diverge and the surrogate can overestimate improvement, causing the policy to overshoot and degrade. See the hidden approximation discussion for details.

Clipping the Ratio

Proximal Policy Optimization (PPO) addresses this by clipping the IS ratio to prevent large updates. Define

\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}.\]

PPO’s clipped surrogate objective is:

\[L^{\mathrm{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]

where \(\epsilon\) is a small constant (typically 0.1–0.2). The \(\min\) takes the more pessimistic estimate:

When \(\hat{A}_t > 0\) (good action): the ratio is capped at \(1 + \epsilon\), preventing the policy from moving too aggressively toward this action.
When \(\hat{A}_t < 0\) (bad action): the ratio is floored at \(1 - \epsilon\), preventing the policy from moving too aggressively away from this action.

This trades a small amount of bias for much more stable training — rather than hoping the IS ratio stays well-behaved, we simply clip it by force.

Why the Log-Form and Ratio-Form Losses Share the Same Gradient Direction

At first glance, the REINFORCE surrogate

\[L_{\mathrm{PG}}(\theta) = -A_t \log \pi_\theta(a_t \vert s_t)\]

explicitly contains \(\log \pi_\theta\), whereas the PPO-style ratio objective

\[L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)}\]

does not. It may seem surprising that both lead to essentially the same update direction. The key is the identity \(\nabla_\theta \pi_\theta(a \vert s) = \pi_\theta(a \vert s) \, \nabla_\theta \log \pi_\theta(a \vert s)\), which explains why a loss written in terms of \(\pi_\theta\) can still produce a gradient in terms of \(\nabla_\theta \log \pi_\theta\).

REINFORCE form. The gradient of \(L_{\mathrm{PG}}\) is straightforward:

\[\nabla_\theta L_{\mathrm{PG}}(\theta) = -A_t \nabla_\theta \log \pi_\theta(a_t \vert s_t).\]

The score function \(\nabla_\theta \log \pi_\theta\) appears explicitly.

Ratio form. Since \(\pi_{\mathrm{old}}(a_t \vert s_t)\) is constant with respect to \(\theta\):

\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -\frac{A_t}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \pi_\theta(a_t \vert s_t).\]

Applying \(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\):

\[\nabla_\theta L_{\mathrm{ratio}}(\theta) = -A_t \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\mathrm{old}}(a_t \vert s_t)} \nabla_\theta \log \pi_\theta(a_t \vert s_t) = -A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t),\]

where \(r_t(\theta) = \pi_\theta(a_t \vert s_t) / \pi_{\mathrm{old}}(a_t \vert s_t)\).

Core conclusion. Even though the ratio-form loss does not explicitly contain \(\log \pi_\theta\), its gradient still has the same core score-function direction \(\nabla_\theta \log \pi_\theta(a_t \vert s_t)\). The difference is that PPO introduces an additional multiplicative weight \(r_t(\theta)\), and in practice also clipping, to control how aggressively the policy moves relative to the old policy. So:

REINFORCE directly optimizes a surrogate linear in \(\log \pi_\theta\);
PPO optimizes a surrogate linear in the probability ratio \(r_t(\theta)\);
but after differentiation, both are driven by the same score-function direction \(\nabla_\theta \log \pi_\theta\).

In one sentence: a loss does not need to explicitly contain \(\log \pi_\theta\) for its gradient to involve \(\nabla_\theta \log \pi_\theta\), because \(\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta\).

Why You Cannot Simply Multiply the PPO Objective by \(\log \pi_\theta\)

A natural but incorrect idea: since REINFORCE involves \(\log \pi_\theta(a_t \vert s_t)\), why not define a PPO-style objective as

\[\tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[r_t(\theta) \, A_t \, \log \pi_\theta(a_t \vert s_t)\right]?\]

This is not the correct importance-sampled policy gradient objective — it introduces an extra factor in the gradient and changes the optimization problem entirely.

Correct PPO surrogate. The standard surrogate \(L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}[r_t(\theta) \, A_t]\) has gradient

\[\nabla_\theta L_{\mathrm{PPO}}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, \nabla_\theta r_t(\theta)\right] = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right],\]

which is exactly the desired importance-weighted score-function gradient.

Incorrect objective with extra \(\log \pi_\theta\). The gradient of \(\tilde{L}\) requires the product rule on \(r_t(\theta) \log \pi_\theta(a_t \vert s_t)\):

\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \Big((\nabla_\theta r_t) \log \pi_\theta + r_t \, \nabla_\theta \log \pi_\theta\Big)\right].\]

Substituting \(\nabla_\theta r_t = r_t \nabla_\theta \log \pi_\theta\):

\[\nabla_\theta \tilde{L}(\theta) = \mathbb{E}_{\pi_{\mathrm{old}}}\!\left[A_t \, r_t(\theta) \, \bigl(1 + \log \pi_\theta(a_t \vert s_t)\bigr) \, \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right].\]

Compared with the correct gradient \(A_t \, r_t \, \nabla_\theta \log \pi_\theta\), this has an extra multiplicative factor \(1 + \log \pi_\theta(a_t \vert s_t)\). Since \(\log \pi_\theta(a_t \vert s_t) \le 0\), this factor becomes negative whenever \(\pi_\theta(a_t \vert s_t) < e^{-1}\) — meaning the gradient can push the policy in the opposite direction of the intended update, even when \(A_t > 0\).

The importance ratio \(r_t(\theta)\) exists solely to correct for the sampling distribution mismatch. It should multiply the advantage — the quantity whose expectation we want to estimate — and nothing else. Inserting an extra \(\log \pi_\theta\) changes the objective itself and breaks the correspondence with the policy gradient theorem.

Actor-Critic

The Critic

In the policy gradient, the single-sample reward-to-go \(\hat{Q}_{i,t} = \sum_{t'=t}^{T} r(s_{t'}, a_{t'})\) serves as our estimate of \(Q^{\pi_\theta}(s_t, a_t)\). This is unbiased — in expectation it equals the true action-value — but high-variance, because a single trajectory may encounter lucky or unlucky transitions.

Can we get a better estimate? The idea is to fit a model to predict expected returns, rather than relying on a single sample. Define three value functions:

\[Q^\pi(s, a) = \sum_{t'=t}^{T} \mathbb{E}_{\pi_\theta}\!\left[r(s_{t'}, a_{t'}) \vert s_t, a_t\right]\] \[V^\pi(s) = \mathbb{E}_{a \sim \pi_\theta(a \vert s)}\!\left[Q^\pi(s, a)\right]\] \[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

The advantage \(A^\pi\) tells us how much better action \(a\) is compared to the average action from state \(s\). Using the advantage in the policy gradient:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \, A^\pi(s_{i,t}, a_{i,t})\]

The better this estimate, the lower the variance. The key insight is that we only need to fit \(V^\pi(s)\), because \(A^\pi(s, a) \approx r(s, a) + V^\pi(s') - V^\pi(s)\), which requires only the value function and one observed transition.

How do we train this value function? This is the policy evaluation problem: given a fixed policy \(\pi_\theta\), estimate \(V^\pi(s)\). We fit a neural network \(\hat{V}^\pi_\phi(s)\) with parameters \(\phi\) by supervised regression:

\[\mathcal{L}(\phi) = \frac{1}{2} \sum_i \left\lVert \hat{V}^\pi_\phi(s_i) - y_i \right\rVert^2\]

The question is what target \(y_i\) to use. The Monte Carlo target \(y_{i,t} = \sum_{t'=t}^{T} r(s_{i,t'}, a_{i,t'})\) is unbiased but noisy — the same function must fit many different sampled trajectories from the same state. The bootstrapped (TD) target \(y_{i,t} = r(s_{i,t}, a_{i,t}) + \gamma \hat{V}^\pi_\phi(s_{i,t+1})\) is lower variance because it replaces all future randomness with the current estimate, but introduces bias — if \(\hat{V}^\pi_\phi\) is wrong (and it always is, at least initially), the target is wrong too. Here the discount factor \(\gamma \in [0, 1]\) keeps values finite when episodes are long or infinite; one interpretation is that \(\gamma\) adds a \((1 - \gamma)\) probability of “death” at each step, making the effective horizon finite.

The Algorithm

An actor-critic method maintains two components:

Actor: the policy \(\pi_\theta(a \vert s)\), updated by policy gradient.
Critic: a value function \(\hat{V}^\pi_\phi(s) \approx V^\pi(s)\), trained by regression.

Batch actor-critic:

Sample \(\{s_i, a_i\}\) from \(\pi_\theta(a \vert s)\) (run the policy).
Fit \(\hat{V}^\pi_\phi(s)\) to sampled reward sums.
Evaluate \(\hat{A}^\pi(s_i, a_i) = r(s_i, a_i) + \gamma \hat{V}^\pi_\phi(s'_i) - \hat{V}^\pi_\phi(s_i)\).
\(\nabla_\theta J(\theta) \approx \sum_i \nabla_\theta \log \pi_\theta(a_i \vert s_i) \hat{A}^\pi(s_i, a_i)\).
\(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\).

Online actor-critic updates after every single transition \((s, a, s', r)\):

Take action \(a \sim \pi_\theta(a \vert s)\), observe \((s, a, s', r)\).
Update \(\hat{V}^\pi_\phi\) using target \(r + \gamma \hat{V}^\pi_\phi(s')\).
Evaluate \(\hat{A}^\pi(s, a) = r(s, a) + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\).
\(\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(a \vert s) \hat{A}^\pi(s, a)\).
\(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\).

The online version works on a single transition — no need to collect full trajectories. In practice, it works best with a batch from parallel workers: multiple agents collect transitions simultaneously, and their gradients are aggregated before each update. This is the idea behind A3C (Mnih et al., 2016), which runs asynchronous parallel actors each contributing gradients to a shared parameter server.

Figure 2: The actor-critic loop. In batch mode, the policy collects many transitions before updating; in online mode, each transition triggers an immediate update. Toggle between the two to see the difference.

The actor and critic can use two separate networks (simple, stable, but no shared features) or a single network with two heads (shared early layers, separate output heads for \(\pi_\theta\) and \(\hat{V}^\pi_\phi\)). The shared design is more parameter-efficient and can learn common state representations, but couples the two learning problems. In the LLM era, this question takes a new form: can a pretrained language model serve as both actor and critic? The LM-as-critic post explores the surprising difficulties of attaching a value head to a pretrained backbone.

Advantage Estimation

The actor-critic gradient uses the TD advantage \(r + \gamma \hat{V}^\pi_\phi(s') - \hat{V}^\pi_\phi(s)\) — lower variance but biased (since the critic is imperfect). The Monte Carlo policy gradient uses \(\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} - b\) — unbiased but higher variance. One middle ground is to use the Monte Carlo return but subtract the critic as a state-dependent baseline:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \left(\underbrace{\sum_{t'=t}^{T} \gamma^{t'-t} r(s_{i,t'}, a_{i,t'})}_{\substack{\mathrm{Monte\ Carlo\ return} \\ \mathrm{(unbiased,\ high\ variance)}}} - \underbrace{\hat{V}^\pi_\phi(s_{i,t})}_{\substack{\mathrm{critic\ as\ baseline} \\ \mathrm{(state\ dependent)}}}\right)\]

The reward-to-go sum is the same Monte Carlo return from the vanilla policy gradient — no bootstrapping, so no bias. But instead of subtracting a constant baseline \(b\), we subtract the critic’s value estimate \(\hat{V}^\pi_\phi(s_{i,t})\), which depends on the state. Since any state-dependent baseline preserves unbiasedness (shown above), this remains unbiased. And because \(\hat{V}^\pi_\phi(s)\) is close to the expected return from each state, it reduces variance far more effectively than a constant \(b\).

Figure 3: Constant vs state-dependent baselines. Three trajectories from different states have different MC returns. A constant baseline b subtracts the same value from all; a state-dependent baseline V(s) subtracts each state's expected return, centering advantages more tightly. The Compare view shows the variance reduction.

More generally, the one-step TD advantage and the full Monte Carlo return are two extremes. We can interpolate with \(n\)-step returns: use \(n\) steps of actual rewards, then bootstrap:

\[\hat{A}_n^\pi(s_t, a_t) = \underbrace{\sum_{t'=t}^{t+n} \gamma^{t'-t} r(s_{t'}, a_{t'})}_{\mathrm{actual\ rewards\ (}n\mathrm{\ steps)}} \underbrace{- \; \hat{V}^\pi_\phi(s_t)}_{\mathrm{baseline}} + \underbrace{\gamma^n \hat{V}^\pi_\phi(s_{t+n})}_{\mathrm{bootstrap\ remainder}}\]

The first term sums \(n\) steps of actual observed rewards, discounted back to time \(t\). The second term subtracts the critic’s estimate at the current state (the baseline). The third term “fills in” the remaining future by bootstrapping from the critic at step \(t+n\) — discounted by \(\gamma^n\) because that state is \(n\) steps away. When \(n = 1\), we get the TD advantage \(r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)\): mostly bootstrapped, low variance but biased. When \(n = T - t\), the bootstrap vanishes and we recover the full Monte Carlo return minus a baseline: unbiased but high variance. Choosing \(n > 1\) often works better than either extreme.

Generalized advantage estimation (Schulman, Moritz, Levine, Jordan, Abbeel, 2016) takes this further: instead of choosing a single \(n\), take a weighted combination of all \(n\)-step advantages with exponentially decaying weights \(w_n \propto \lambda^{n-1}\):

\[\hat{A}^{\mathrm{GAE}}(s_t, a_t) = \sum_{t'=t}^{\infty} (\gamma \lambda)^{t'-t} \delta_{t'}\]

where \(\delta_{t'} = r(s_{t'}, a_{t'}) + \gamma \hat{V}^\pi_\phi(s_{t'+1}) - \hat{V}^\pi_\phi(s_{t'})\) is the one-step TD residual. The parameter \(\lambda\) controls how far into the future we look before trusting the critic: \(\lambda = 0\) reduces to the one-step TD advantage; \(\lambda = 1\) recovers the Monte Carlo advantage with a state-dependent baseline. In practice, \(\lambda \approx 0.95{-}0.97\) works well.

Figure 4: n-step returns and GAE. Drag the n slider to see how many actual rewards (orange) are used before bootstrapping from the critic (green). Toggle to GAE to see how λ blends all n-step estimates with exponentially decaying weights.

The on-policy requirement — needing fresh data after every update — is a major limitation addressed by PPO above. For a deeper treatment of importance sampling in RL, see the importance sampling post.