Generalizable Value Functions and Emotions (?)

Critics and Value Functions

In a previous post on policy gradients, we introduced the actor-critic architecture: an actor (the policy \(\pi_\theta\)) that chooses actions, and a critic (a value function \(\hat{V}_\phi\)) that evaluates states. The two components serve fundamentally different roles — the actor answers “what should I do?” while the critic answers “how good is the situation I’m in?”.

A natural question is: why do we need the critic at all? After all, the vanilla policy gradient (REINFORCE) trains the actor without any value function. The actor collects trajectories, observes the rewards, and updates its parameters to make high-reward actions more likely. This works — in principle. In practice, it breaks down for exactly the reason that motivates this post: variance.

The Variance Problem

Consider a pure actor method applied to a 20-step web agent task. The agent receives a single binary reward at the end: 1 for success, 0 for failure. REINFORCE computes the policy gradient as:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]

Every token in a successful trajectory gets reinforced equally. Every token in a failed trajectory gets suppressed equally. The gradient tells the actor “do more of everything you did when you succeeded” — it cannot distinguish the decisive click from a useless scroll. With \(N\) rollouts that each take 20 steps, you have 20 decisions but only 1 bit of feedback. The signal-to-noise ratio is abysmal.

A constant baseline \(b\) (e.g., the mean reward across the batch) helps: it centers the advantages so that average-performing trajectories get near-zero gradient. But it cannot distinguish states. An agent in a clearly good state (already on the right product page) and an agent in a clearly bad state (stuck on an error page) receive advantages that differ only in the trajectory’s final outcome — not in their current prospects.

A State-Dependent Baseline Already Helps

In fact, REINFORCE itself can use \(V_\phi(s_t)\) as a baseline. This is perfectly valid — any function of \(s_t\) can serve as a baseline without introducing bias (as shown in the policy gradient post). The advantage becomes:

\[A(s_t, a_t) = G_t - V_\phi(s_t)\]

where \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) is the Monte Carlo return. This is still REINFORCE — the return \(G_t\) comes from the actual trajectory, no bootstrapping involved — but with a much better baseline. Instead of asking “was this trajectory better than average?”, it asks “was this trajectory better than expected from this state?”.

This already provides meaningful per-step signal. If the agent is on the right product page (\(V_\phi(s_t)\) is high) and the trajectory succeeds (\(G_t\) is high), the advantage is small — success was expected. If the agent is on the homepage (\(V_\phi(s_t)\) is moderate) and navigates directly to the right category (leading to eventual success), the advantage is large — the outcome was better than expected from that state.

Why V(s) reduces variance. Without a baseline, success always gives advantage 1, failure always gives 0 — the gradient cannot distinguish easy states from hard ones. With V(s), advantages are centered per state: succeeding from an easy state (high V) gets a small advantage, while succeeding from a hard state (low V) gets a large one.
Can \(Q(s, a)\) be used as a baseline? (Click to expand)

A valid baseline must be a function of \(s\) only — not of \(a\). The reason is that the baseline property relies on:

$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$

The key step is pulling \(b(s)\) out of the expectation over \(a\) — which is only valid when \(b\) does not depend on \(a\). Since \(Q(s, a)\) depends on \(a\), it cannot be pulled out, and subtracting it would change the expected gradient — introducing bias.

\(Q(s, a)\) plays a different role: it is the signal itself, not a baseline. The policy gradient theorem states:

$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$

The advantage \(A(s, a) = Q(s, a) - V(s)\) uses \(Q\) as the signal and \(V\) as the baseline. Replacing \(V\) with \(Q\) in the baseline slot would collapse the advantage to zero — subtracting the signal from itself.

So Why Actor-Critic?

If REINFORCE + \(V(s)\) baseline already gives per-step credit, what does actor-critic add? The difference is in how the advantage is computed. REINFORCE uses:

\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]

Actor-critic replaces the MC return with a bootstrapped estimate:

\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]

The MC version is unbiased — \(G_t\) is the true sampled return. But it is still high-variance: \(G_t\) sums rewards over many future steps, each subject to stochastic transitions. In a 20-step web task where the environment is non-stationary (pages load differently, A/B tests shift layouts), this sum carries enormous noise.

The TD version replaces all future randomness with the critic’s estimate \(V_\phi(s_{t+1})\). This is biased — if the critic is wrong, the advantage is wrong — but dramatically lower variance, because the estimate depends only on the immediate transition \((s_t, a_t, s_{t+1})\) rather than the entire future trajectory. In practice, a slightly biased but low-variance gradient is far more useful for learning than an unbiased but noisy one.

This is the real contribution of the critic in actor-critic: not just providing a baseline (REINFORCE can do that too), but enabling bootstrapping — replacing noisy future returns with learned predictions. The critic becomes load-bearing: its accuracy directly determines the quality of the actor’s gradient. This is why training the critic well matters so much, and why ArCHer trains the critic before the actor.

Where Does Q-Learning Fit?

Both REINFORCE and actor-critic are policy gradient methods: they explicitly parameterize and optimize a policy \(\pi_\theta\). The value function — whether \(V(s)\) as a baseline or \(V(s)\) for bootstrapping — is a helper that improves the policy’s gradient. The policy remains the primary object being learned.

Q-learning inverts this relationship entirely. There is no explicit policy. Instead, the value function is the primary object: learn \(Q^*(s, a)\) directly via the Bellman optimality equation

\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]

and extract the policy implicitly as \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The value function does not serve the policy — the value function replaces it.

This creates a fundamental difference in what the value function needs to capture. In actor-critic, the critic learns \(V^\pi(s)\) — the expected return under the current policy. It only needs to evaluate states, not compare actions. In Q-learning, \(Q^*(s, a)\) must be action-sensitive: it must distinguish the expected return of different actions from the same state, because that distinction is the policy. This is a much harder representational requirement, especially in the LM setting where the action space (all possible token sequences) is combinatorially large. The Q-learning scalability post examines whether this requirement can be met at scale.

Critic = Learned Value Function

A common source of confusion: “critic” and “value function” are not two different things. The critic is a value function — specifically, a learned approximation \(\hat{V}_\phi(s) \approx V^\pi(s)\), trained to predict expected returns. The term “critic” emphasizes its role in the actor-critic architecture (evaluating the actor’s behavior); “value function” emphasizes its mathematical definition (expected cumulative reward from a state). They refer to the same object.

In classical RL, the critic is typically a small neural network trained from scratch. In the LM setting, the critic is usually a copy of the base language model with a scalar value head replacing the token prediction head — as explored in the LM-as-critic post. This reuse of the pretrained backbone gives the critic a strong prior over language and visual representations, but introduces its own challenges (representation drift, sensitivity to the readout position, etc.).

The rest of this post examines ArCHer, which applies the actor-critic idea at the turn level rather than the token level — using the critic to assign credit to each step of a multi-turn agent interaction.

What is ArCHer?

ArCHer (Actor-Critic with Hierarchical Turn-Level Credit Assignment) was introduced by Zhou et al. (2024) to address a fundamental limitation of RLHF-style training for multi-step language model agents: trajectory-level rewards provide no signal about which step was responsible for success or failure.

Consider a web agent that completes a 10-step shopping task. Standard PPO assigns the terminal reward to the entire trajectory — all 10 steps receive the same advantage. The agent cannot learn that step 3 (clicking the right product) was the decisive action while step 7 (scrolling past relevant content) was a mistake. Both get reinforced equally.

The original ArCHer paper proposes a hierarchical actor-critic framework whose central idea is: the turn, not the token, is the right granularity for credit assignment in multi-step LM agents. Below we describe the formulation and losses precisely.

Formulation: Multi-Turn MDP

ArCHer models the interaction as a token-level MDP embedded within a turn-level structure. At each turn \(t\), the agent observes a state \(s_t\) (e.g., a web page) and generates a response \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\) as a sequence of \(L\) tokens. The environment then transitions to a new state \(s_{t+1}\). After \(T\) turns, a scalar reward \(R\) is received.

The key decomposition is between the high-level (turn-level) and low-level (token-level) policies. The high-level policy selects the overall intent at each turn, while the low-level policy autoregressively generates the tokens that implement it. In the simplified ArCHer PPO variant we implement, these are collapsed into a single autoregressive policy — but the turn-level value function is preserved.

The Critic: Turn-Level Value Function

The critic \(V_\phi(s_t)\) predicts the expected discounted return from turn \(t\) onward:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

where \(r_k = 0\) for intermediate turns and \(r_T = R\). This is a state value function — it conditions only on the observation \(s_t\), not on the action \(a_t\). In the LM implementation, the critic is a copy of the base model with the language modeling head replaced by a scalar projection. The value is read at a single token position per turn: the last token of the observation (prompt boundary), where causal masking ensures the hidden state encodes the full observation but none of the response.

The critic is trained by regression on Monte Carlo return targets:

\[G_t = \gamma^{T-1-t} \cdot R\]

with a simple MSE loss:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

Using MC returns rather than bootstrapped targets \(r_t + \gamma V_\phi(s_{t+1})\) is a deliberate choice: with sparse terminal rewards (\(r_t = 0\) for \(t < T\)), bootstrapped targets reduce to \(\gamma V_\phi(s_{t+1})\), making the critic chase its own predictions. MC returns ground the training in the actual outcome.

The Actor: Step-Level Advantages

Given the critic, the step-level advantage is computed as a TD(0) residual:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

This measures whether the agent’s action at turn \(t\) increased or decreased the expected return relative to the critic’s prediction. Positive means the action was better than expected; negative means worse.

The advantage is broadcast to all tokens within the turn: every token \(a_t^l\) in the response at turn \(t\) shares the same advantage \(A(t)\). This is the core ArCHer insight — within a single turn, the agent generates a coherent action (reasoning + command), and assigning different advantages to individual tokens within the same action is noisy and semantically meaningless.

Trajectory-level PPO assigns the same advantage to every step (red). ArCHer computes per-step TD residuals (blue) — the "scroll" step gets a negative advantage while productive steps get positive signal. Drag the slider to see how advantages evolve as the critic trains.

The Actor Loss

The actor is trained with the standard PPO clipped objective, using the step-level advantages:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

where \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) is the per-token importance ratio. The clipping threshold \(\varepsilon\) (typically 0.2) prevents the policy from moving too far from the rollout policy in a single update.

Note that \(A(t)\) does not depend on \(l\) — it is the same for all tokens in the turn. The per-token ratios \(\rho_t^l\) provide token-level granularity in how much the policy changed, but the direction of the gradient (reinforce or suppress) is determined entirely at the turn level.

Training Order: Critic First

A subtle but important detail: ArCHer trains the critic before the actor, reversing the standard PPO order. The reason is that the actor’s gradient quality depends entirely on the advantage estimates, which depend on the critic. Training the actor with a stale critic wastes gradient steps on noisy signals. By training the critic first and then recomputing the advantages with the updated \(V_\phi\), the actor always sees the best available value estimates.

In the previous post, we discussed the challenges of using language models as critic functions — the need for action-sensitive representations and the instability of end-to-end TD training. ArCHer sidesteps several of these issues by using MC return targets (grounding the critic in actual outcomes) and by evaluating at a single position per step (the prompt boundary), where causal masking ensures the representation is clean.

What we implement in WebGym is a simplified variant — ArCHer PPO — that drops the hierarchical structure of the original paper and applies the turn-level critic idea directly to a flat PPO training loop for multi-step web agents. There is no high-level goal selector; just a single policy that acts at each turn, with a step-level critic providing per-turn credit assignment. The rest of this post walks through the implementation.

GRPO: Value Estimation by Sampling

GRPO (Group Relative Policy Optimization) takes a radically different approach to the variance problem: instead of learning a value function, estimate it statistically by sampling multiple completions from the same prompt.

The core observation is simple. A learned critic \(V_\phi(s)\) approximates \(\mathbb{E}_{\pi}[G \vert s]\) — the expected return from state \(s\). But there is another way to estimate an expectation: draw samples and take the empirical mean. Given a prompt \(q\), GRPO samples a group of \(G\) completions \(\{o_1, o_2, \ldots, o_G\}\) from the current policy \(\pi_\theta\), scores each with a reward function \(r(q, o_i)\), and uses the group statistics as a baseline:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]

This is essentially a Monte Carlo estimate of the advantage — normalized to zero mean and unit variance within each group. No learned parameters, no value head, no critic training loop. The “value function” is replaced by the sample mean of the group.

The GRPO loss applies the same PPO-style clipping, but at the task level (one advantage per completion) rather than per-token or per-step:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]

where \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio, and \(\beta\) controls the KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model).

What GRPO Gives Up

The simplicity of GRPO comes with a clear limitation: the advantage \(\hat{A}_i\) is task-level, not step-level. Every token in completion \(o_i\) receives the same advantage — exactly the problem that ArCHer’s turn-level critic was designed to solve. GRPO cannot distinguish which step within a trajectory was responsible for success or failure; it only knows that this completion, as a whole, was better or worse than its peers.

This works well for single-turn tasks (math reasoning, code generation) where the entire output is one coherent response and the reward reflects its overall quality. For multi-step agent tasks — where a 10-step trajectory might have 9 good steps and 1 bad one — the task-level advantage dilutes the signal. The bad step gets reinforced along with the good ones if the trajectory happened to succeed, and the good steps get suppressed if it happened to fail.

The tradeoff is thus between:

  • ArCHer: learns a critic \(V_\phi(s_t)\) to provide per-step credit, but requires training and maintaining a separate value network — with all the challenges of using LMs as critics.
  • GRPO: avoids the critic entirely by brute-force sampling, but is limited to task-level credit assignment. The “value function” is accurate (it converges to the true expectation as \(G \to \infty\)) but only at the granularity of entire trajectories.