Are Value Functions Generalizable?

What is a Realistic Environment?

Most RL training environments for language model agents are simulated: synthetic websites with hand-crafted layouts, deterministic transitions, and a fixed set of pages. The agent learns to navigate these sandboxes, but the skills often fail to transfer to real websites — the visual complexity, layout diversity, and unpredictable behavior of the live web are absent from training.

WebGym (Bai et al., 2025) takes the opposite approach: train on real websites directly. The agent interacts with actual, live web pages through a browser, receiving raw pixel screenshots as observations and issuing click/type/scroll actions. This makes the environment realistic in several concrete ways:

Visual complexity. Real websites contain logos, advertisements, dynamic content, inconsistent layouts, and decorative elements that synthetic environments omit. The agent’s vision-language model must extract task-relevant information from this noise — there is no clean, semantic abstraction layer.
Stochastic transitions. The same action on a real website can produce different results depending on server state, network latency, A/B testing variants, or content updates. Unlike simulated environments where \(P(s' \vert s, a)\) is fixed, the transition dynamics are genuinely non-stationary. Some websites block automated access entirely, causing tasks to fail for reasons unrelated to the agent’s skill.
Diverse task structure. Tasks span shopping, food & cooking, sports, health, entertainment, news, finance, travel, and education — each with different UI conventions, navigation patterns, and success criteria. An agent trained on WebGym cannot overfit to a single website’s layout; it must generalize across radically different interfaces.
Partial observability across turns. At each turn, the agent sees only the current screenshot and a sliding window of recent history (typically 4 rounds). Older screenshots are replaced with text summaries. The agent cannot revisit past observations — it must decide what information to retain and what to discard, much like a human browsing with limited working memory.

These properties make WebGym a challenging testbed for RL algorithms. Standard trajectory-level PPO — which works well enough in simulated environments with short horizons and deterministic dynamics — struggles here. The long horizons (10–30 steps), sparse terminal rewards, and high variance from stochastic transitions mean that assigning a single reward to an entire trajectory provides almost no usable learning signal. This is the motivation for a turn-level value function: if we cannot control the environment’s stochasticity, we can at least decompose the credit assignment problem into manageable per-step pieces.

What is ArCHer?

ArCHer (Actor-Critic with Hierarchical Turn-Level Credit Assignment) was introduced by Zhou et al. (2024) to address a fundamental limitation of RLHF-style training for multi-step language model agents: trajectory-level rewards provide no signal about which step was responsible for success or failure.

Consider a web agent that completes a 10-step shopping task. Standard PPO assigns the terminal reward to the entire trajectory — all 10 steps receive the same advantage. The agent cannot learn that step 3 (clicking the right product) was the decisive action while step 7 (scrolling past relevant content) was a mistake. Both get reinforced equally.

The original ArCHer paper proposes a hierarchical actor-critic framework whose central idea is: the turn, not the token, is the right granularity for credit assignment in multi-step LM agents. Below we describe the formulation and losses precisely.

Formulation: Multi-Turn MDP

ArCHer models the interaction as a token-level MDP embedded within a turn-level structure. At each turn \(t\), the agent observes a state \(s_t\) (e.g., a web page) and generates a response \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\) as a sequence of \(L\) tokens. The environment then transitions to a new state \(s_{t+1}\). After \(T\) turns, a scalar reward \(R\) is received.

The key decomposition is between the high-level (turn-level) and low-level (token-level) policies. The high-level policy selects the overall intent at each turn, while the low-level policy autoregressively generates the tokens that implement it. In the simplified ArCHer PPO variant we implement, these are collapsed into a single autoregressive policy — but the turn-level value function is preserved.

The Critic: Turn-Level Value Function

The critic \(V_\phi(s_t)\) predicts the expected discounted return from turn \(t\) onward:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

where \(r_k = 0\) for intermediate turns and \(r_T = R\). This is a state value function — it conditions only on the observation \(s_t\), not on the action \(a_t\). In the LM implementation, the critic is a copy of the base model with the language modeling head replaced by a scalar projection. The value is read at a single token position per turn: the last token of the observation (prompt boundary), where causal masking ensures the hidden state encodes the full observation but none of the response.

The critic is trained by regression on Monte Carlo return targets:

\[G_t = \gamma^{T-1-t} \cdot R\]

with a simple MSE loss:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

Using MC returns rather than bootstrapped targets \(r_t + \gamma V_\phi(s_{t+1})\) is a deliberate choice: with sparse terminal rewards (\(r_t = 0\) for \(t < T\)), bootstrapped targets reduce to \(\gamma V_\phi(s_{t+1})\), making the critic chase its own predictions. MC returns ground the training in the actual outcome.

The Actor: Step-Level Advantages

Given the critic, the step-level advantage is computed as a TD(0) residual:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

This measures whether the agent’s action at turn \(t\) increased or decreased the expected return relative to the critic’s prediction. Positive means the action was better than expected; negative means worse.

The advantage is broadcast to all tokens within the turn: every token \(a_t^l\) in the response at turn \(t\) shares the same advantage \(A(t)\). This is the core ArCHer insight — within a single turn, the agent generates a coherent action (reasoning + command), and assigning different advantages to individual tokens within the same action is noisy and semantically meaningless.

Trajectory-level PPO assigns the same advantage to every step (red). ArCHer computes per-step TD residuals (blue) — the "scroll" step gets a negative advantage while productive steps get positive signal. Drag the slider to see how advantages evolve as the critic trains.

The Actor Loss

The actor is trained with the standard PPO clipped objective, using the step-level advantages:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

where \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) is the per-token importance ratio. The clipping threshold \(\varepsilon\) (typically 0.2) prevents the policy from moving too far from the rollout policy in a single update.

Note that \(A(t)\) does not depend on \(l\) — it is the same for all tokens in the turn. The per-token ratios \(\rho_t^l\) provide token-level granularity in how much the policy changed, but the direction of the gradient (reinforce or suppress) is determined entirely at the turn level.

Training Order: Critic First

A subtle but important detail: ArCHer trains the critic before the actor, reversing the standard PPO order. The reason is that the actor’s gradient quality depends entirely on the advantage estimates, which depend on the critic. Training the actor with a stale critic wastes gradient steps on noisy signals. By training the critic first and then recomputing the advantages with the updated \(V_\phi\), the actor always sees the best available value estimates.

In the previous post, we discussed the challenges of using language models as critic functions — the need for action-sensitive representations and the instability of end-to-end TD training. ArCHer sidesteps several of these issues by using MC return targets (grounding the critic in actual outcomes) and by evaluating at a single position per step (the prompt boundary), where causal masking ensures the representation is clean.

What we implement in WebGym is a simplified variant — ArCHer PPO — that drops the hierarchical structure of the original paper and applies the turn-level critic idea directly to a flat PPO training loop for multi-step web agents. There is no high-level goal selector; just a single policy that acts at each turn, with a step-level critic providing per-turn credit assignment. The rest of this post walks through the implementation.