Policy Gradient and Actor-Critic

The Policy Gradient

Trajectories and the Objective

In reinforcement learning, an agent interacts with an environment by choosing actions according to a policy \(\pi_\theta(a \vert s)\) — a distribution over actions given a state, parameterized by \(\theta\). Each interaction produces a trajectory:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

The probability of a trajectory under policy \(\pi_\theta\) is:

\[P^{\pi_\theta}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]

where \(d_0\) is the initial state distribution and \(P(s_{t+1} \vert s_t, a_t)\) is the transition probability. This is an alternating product of policy terms (learnable) and environment terms (fixed).

The objective is \(J(\pi_\theta) = \sum_\tau R(\tau) P^{\pi_\theta}(\tau)\) — a sum over all possible trajectories in the MDP. Each trajectory \(\tau\) has a fixed return \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\); what the policy controls is the probability \(P^{\pi_\theta}(\tau)\) assigned to each one. A better policy concentrates probability on high-return trajectories.

Figure 1: The expected return as a weighted sum over trajectory space. All trajectories fan out from a common start; use the playbar to watch them extend. Click parts of the formula to highlight what each represents: R(τ) colors by return, P^π(τ) shows probability via thickness. Switch policies to see how the same trajectories get reweighted.

Deriving REINFORCE

To differentiate \(J\) with respect to \(\theta\), we use the log-derivative trick \(\nabla P = P \nabla \log P\):

\[\nabla_\theta J = \sum_\tau R(\tau) \nabla_\theta P^{\pi_\theta}(\tau) = \sum_\tau R(\tau) P^{\pi_\theta}(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau)\] \[= \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \nabla_\theta \log P^{\pi_\theta}(\tau)\right]\]

Now expand the log-probability of a trajectory:

\[\log P^{\pi_\theta}(\tau) = \log d_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t \vert s_t) + \sum_{t=0}^{T-2} \log P(s_{t+1} \vert s_t, a_t)\]

The initial state distribution \(d_0\) and transition dynamics \(P\) do not depend on \(\theta\), so their gradients vanish. Only the policy terms survive:

\[\nabla_\theta J = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \vert s_t)\right]\]

This is the trajectory-level REINFORCE estimator. By decomposing the rewards over time steps, it can be rewritten in a per-step form using the discounted state occupancy \(d^{\pi_\theta}\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[Q^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

where \(Q^{\pi_\theta}(s, a)\) is the action-value function.

Variance Reduction: Baselines and the Advantage

A useful property of \(\nabla_\theta \log \pi_\theta\) is that its expectation under the policy is zero: \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \vert s)] = \sum_a \nabla_\theta \pi_\theta(a \vert s) = \nabla_\theta 1 = 0\). This means we can subtract any state-dependent baseline \(b(s)\) from \(Q^{\pi_\theta}\) without introducing bias. The natural choice is the value function \(V^{\pi_\theta}(s)\), giving us the advantage function \(A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

The advantage centers the reward signal around zero, substantially reducing variance. This expectation requires sampling from \(\pi_\theta\) itself, meaning we need fresh data after every parameter update.

Actor-Critic

Coming soon.

The on-policy requirement — needing fresh data after every update — is a major limitation. The importance sampling post shows how IS ratios enable off-policy reuse via the surrogate objective, and how PPO clips these ratios to keep updates stable.

The policy gradient sections of this post draw on Nan Jiang’s lecture notes on importance sampling and policy gradient from CS 443 at UIUC, which cover these topics in much greater depth — including per-step IS, doubly robust estimators, natural policy gradient, and formal analysis of the distribution mismatch problem.