Generalizable Value Functions and Emotions (?)

Critics and Value Functions

In a previous post on policy gradients, we introduced the actor-critic architecture: an actor (the policy \(\pi_\theta\)) that chooses actions, and a critic (a value function \(\hat{V}_\phi\)) that evaluates states. The two components serve fundamentally different roles — the actor answers “what should I do?” while the critic answers “how good is the situation I’m in?”.

A natural question is: why do we need the critic at all? After all, the vanilla policy gradient (REINFORCE) trains the actor without any value function. The actor collects trajectories, observes the rewards, and updates its parameters to make high-reward actions more likely. This works — in principle. In practice, it breaks down for exactly the reason that motivates this post: variance.

之前关于策略梯度的文章中,我们介绍了 actor-critic 架构:一个选择动作的 actor(策略 \(\pi_\theta\))和一个评估状态的 critic(价值函数 \(\hat{V}_\phi\))。两个组件扮演根本不同的角色——actor 回答“我该做什么?”,而 critic 回答“我当前的处境有多好?”

一个自然的问题是:我们为什么需要 critic? 毕竟,原始的策略梯度(REINFORCE)在没有任何价值函数的情况下就能训练 actor。Actor 收集轨迹,观察奖励,然后更新参数使高奖励动作更有可能被选择。这在原理上可行。但在实践中,它恰恰因为本文要讨论的核心原因而失效:方差

The Variance Problem

Consider a pure actor method applied to a 20-step web agent task. The agent receives a single binary reward at the end: 1 for success, 0 for failure. REINFORCE computes the policy gradient as:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]

Every token in a successful trajectory gets reinforced equally. Every token in a failed trajectory gets suppressed equally. The gradient tells the actor “do more of everything you did when you succeeded” — it cannot distinguish the decisive click from a useless scroll. With \(N\) rollouts that each take 20 steps, you have 20 decisions but only 1 bit of feedback. The signal-to-noise ratio is abysmal.

A constant baseline \(b\) (e.g., the mean reward across the batch) helps: it centers the advantages so that average-performing trajectories get near-zero gradient. But it cannot distinguish states. An agent in a clearly good state (already on the right product page) and an agent in a clearly bad state (stuck on an error page) receive advantages that differ only in the trajectory’s final outcome — not in their current prospects.

考虑一个纯 actor 方法应用于 20 步的网页智能体任务。智能体在最后只收到一个二值奖励:成功为 1,失败为 0。REINFORCE 计算策略梯度为:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]

成功轨迹中的每个 token 被同等程度地强化。失败轨迹中的每个 token 被同等程度地抑制。梯度告诉 actor “把你成功时做的所有事情都多做一些”——它无法区分关键性的点击和无用的滚动。\(N\) 条 rollout 各有 20 步决策,但只有 1 bit 的反馈。信噪比极低。

常数基线 \(b\)(如批次内的平均奖励)有所帮助:它使优势函数居中,让表现一般的轨迹获得接近零的梯度。但它无法区分不同状态。一个处于明显好状态的智能体(已经在正确的商品页面上)和一个处于明显差状态的智能体(卡在错误页面上),它们获得的优势值只取决于轨迹的最终结果——而非当前的前景。

A State-Dependent Baseline Already Helps

In fact, REINFORCE itself can use \(V_\phi(s_t)\) as a baseline. This is perfectly valid — any function of \(s_t\) can serve as a baseline without introducing bias (as shown in the policy gradient post). The advantage becomes:

\[A(s_t, a_t) = G_t - V_\phi(s_t)\]

where \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) is the Monte Carlo return. This is still REINFORCE — the return \(G_t\) comes from the actual trajectory, no bootstrapping involved — but with a much better baseline. Instead of asking “was this trajectory better than average?”, it asks “was this trajectory better than expected from this state?”.

This already provides meaningful per-step signal. If the agent is on the right product page (\(V_\phi(s_t)\) is high) and the trajectory succeeds (\(G_t\) is high), the advantage is small — success was expected. If the agent is on the homepage (\(V_\phi(s_t)\) is moderate) and navigates directly to the right category (leading to eventual success), the advantage is large — the outcome was better than expected from that state.

实际上,REINFORCE 本身就可以使用 \(V_\phi(s_t)\) 作为基线。这完全合理——任何 \(s_t\) 的函数都可以作为基线而不引入偏差(如策略梯度一文中所示)。优势函数变为:

\[A(s_t, a_t) = G_t - V_\phi(s_t)\]

其中 \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) 是蒙特卡洛回报。这仍然是 REINFORCE——回报 \(G_t\) 来自实际轨迹,不涉及自举——但拥有更好的基线。它不再问“这条轨迹是否优于平均水平?”,而是问“这条轨迹是否优于从当前状态出发的期望?”

这已经提供了有意义的逐步信号。如果智能体在正确的商品页面上(\(V_\phi(s_t)\) 较高)且轨迹成功(\(G_t\) 较高),优势值较小——成功是预期之中的。如果智能体在主页上(\(V_\phi(s_t)\) 适中)并直接导航到正确的类目(最终成功),优势值较大——结果好于从该状态出发的期望。

Why V(s) reduces variance. Without a baseline, success always gives advantage 1, failure always gives 0 — the gradient cannot distinguish easy states from hard ones. With V(s), advantages are centered per state: succeeding from an easy state (high V) gets a small advantage, while succeeding from a hard state (low V) gets a large one.
V(s) 为什么能降低方差。没有基线时,成功总是给出优势 1,失败总是给出 0——梯度无法区分容易状态和困难状态。有了 V(s),优势值按状态居中:从容易状态(高 V)成功得到的优势较小,从困难状态(低 V)成功得到的优势较大。
Can \(Q(s, a)\) be used as a baseline? (Click to expand)

A valid baseline must be a function of \(s\) only — not of \(a\). The reason is that the baseline property relies on:

$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$

The key step is pulling \(b(s)\) out of the expectation over \(a\) — which is only valid when \(b\) does not depend on \(a\). Since \(Q(s, a)\) depends on \(a\), it cannot be pulled out, and subtracting it would change the expected gradient — introducing bias.

\(Q(s, a)\) plays a different role: it is the signal itself, not a baseline. The policy gradient theorem states:

$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$

The advantage \(A(s, a) = Q(s, a) - V(s)\) uses \(Q\) as the signal and \(V\) as the baseline. Replacing \(V\) with \(Q\) in the baseline slot would collapse the advantage to zero — subtracting the signal from itself.

\(Q(s, a)\) 能否用作基线?(点击展开)

有效的基线必须只是 \(s\) 的函数——不能依赖于 \(a\)。原因在于基线性质依赖于:

$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$

关键步骤是将 \(b(s)\) 从对 \(a\) 的期望中提出来——这只在 \(b\) 不依赖于 \(a\) 时才成立。由于 \(Q(s, a)\) 依赖于 \(a\),它无法被提出,减去它会改变期望梯度——引入偏差。

\(Q(s, a)\) 扮演不同的角色:它是信号本身,而非基线。策略梯度定理表明:

$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$

优势 \(A(s, a) = Q(s, a) - V(s)\) 使用 \(Q\) 作为信号,\(V\) 作为基线。如果在基线位置用 \(Q\) 替换 \(V\),优势会塌缩为零——从信号自身中减去了自己。

So Why Actor-Critic?

If REINFORCE + \(V(s)\) baseline already gives per-step credit, what does actor-critic add? The difference is in how the advantage is computed. REINFORCE uses:

\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]

Actor-critic replaces the MC return with a bootstrapped estimate:

\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]

The MC version is unbiased — \(G_t\) is the true sampled return. But it is still high-variance: \(G_t\) sums rewards over many future steps, each subject to stochastic transitions. In a 20-step web task where the environment is non-stationary (pages load differently, A/B tests shift layouts), this sum carries enormous noise.

The TD version replaces all future randomness with the critic’s estimate \(V_\phi(s_{t+1})\). This is biased — if the critic is wrong, the advantage is wrong — but dramatically lower variance, because the estimate depends only on the immediate transition \((s_t, a_t, s_{t+1})\) rather than the entire future trajectory. In practice, a slightly biased but low-variance gradient is far more useful for learning than an unbiased but noisy one.

This is the real contribution of the critic in actor-critic: not just providing a baseline (REINFORCE can do that too), but enabling bootstrapping — replacing noisy future returns with learned predictions. The critic becomes load-bearing: its accuracy directly determines the quality of the actor’s gradient. This is why training the critic well matters so much, and why ArCHer trains the critic before the actor.

如果 REINFORCE + \(V(s)\) 基线已经能提供逐步信用分配,actor-critic 还增加了什么?区别在于优势函数的计算方式。REINFORCE 使用:

\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]

Actor-critic 将蒙特卡洛回报替换为自举估计:

\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]

MC 版本是无偏的——\(G_t\) 是真实的采样回报。但方差依然很高:\(G_t\) 对未来许多步的奖励求和,每一步都受到随机转移的影响。在 20 步的网页任务中,环境是非平稳的(页面加载不同、A/B 测试改变布局),这个求和携带巨大噪声。

TD 版本用 critic 的估计 \(V_\phi(s_{t+1})\) 替换了所有未来的随机性。这是有偏的——如果 critic 不准确,优势值就不准确——但方差大幅降低,因为估计只依赖于即时转移 \((s_t, a_t, s_{t+1})\) 而非整条未来轨迹。实践中,略有偏差但低方差的梯度远比无偏但充满噪声的梯度更有用。

这才是 critic 在 actor-critic 中的真正贡献:不仅仅是提供基线(REINFORCE 也能做到),而是实现了自举——用学到的预测替代充满噪声的未来回报。Critic 成为承重结构:其准确性直接决定了 actor 梯度的质量。这就是为什么训练好 critic 如此重要,也是为什么 ArCHer 先训练 critic 再训练 actor。

Where Does Q-Learning Fit?

Both REINFORCE and actor-critic are policy gradient methods: they explicitly parameterize and optimize a policy \(\pi_\theta\). The value function — whether \(V(s)\) as a baseline or \(V(s)\) for bootstrapping — is a helper that improves the policy’s gradient. The policy remains the primary object being learned.

Q-learning inverts this relationship entirely. There is no explicit policy. Instead, the value function is the primary object: learn \(Q^*(s, a)\) directly via the Bellman optimality equation

\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]

and extract the policy implicitly as \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The value function does not serve the policy — the value function replaces it.

This creates a fundamental difference in what the value function needs to capture. In actor-critic, the critic learns \(V^\pi(s)\) — the expected return under the current policy. It only needs to evaluate states, not compare actions. In Q-learning, \(Q^*(s, a)\) must be action-sensitive: it must distinguish the expected return of different actions from the same state, because that distinction is the policy. This is a much harder representational requirement, especially in the LM setting where the action space (all possible token sequences) is combinatorially large. The Q-learning scalability post examines whether this requirement can be met at scale.

REINFORCE 和 actor-critic 都是策略梯度方法:它们显式地参数化和优化策略 \(\pi_\theta\)。价值函数——无论是作为基线的 \(V(s)\) 还是用于自举的 \(V(s)\)——都是改善策略梯度的辅助工具。策略始终是被学习的主要对象。

Q-learning 彻底颠倒了这一关系。没有显式策略。价值函数本身就是主要对象:通过 Bellman 最优方程直接学习 \(Q^*(s, a)\)

\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]

然后隐式地提取策略 \(\pi^*(s) = \arg\max_a Q^*(s, a)\)。价值函数不是服务于策略——价值函数替代了策略。

这导致价值函数需要捕获的信息有根本性差异。在 actor-critic 中,critic 学习 \(V^\pi(s)\)——在当前策略下的期望回报。它只需评估状态,不需比较动作。在 Q-learning 中,\(Q^*(s, a)\) 必须是动作敏感的:它必须区分同一状态下不同动作的期望回报,因为这种区分就是策略。这是一个更难的表征要求,尤其在 LM 场景中动作空间(所有可能的 token 序列)是组合爆炸式增长的。Q-learning 可扩展性一文探讨了这一要求能否在规模化时得到满足。

Critic = Learned Value Function

A common source of confusion: “critic” and “value function” are not two different things. The critic is a value function — specifically, a learned approximation \(\hat{V}_\phi(s) \approx V^\pi(s)\), trained to predict expected returns. The term “critic” emphasizes its role in the actor-critic architecture (evaluating the actor’s behavior); “value function” emphasizes its mathematical definition (expected cumulative reward from a state). They refer to the same object.

In classical RL, the critic is typically a small neural network trained from scratch. In the LM setting, the critic is usually a copy of the base language model with a scalar value head replacing the token prediction head — as explored in the LM-as-critic post. This reuse of the pretrained backbone gives the critic a strong prior over language and visual representations, but introduces its own challenges (representation drift, sensitivity to the readout position, etc.).

The rest of this post examines ArCHer, which applies the actor-critic idea at the turn level rather than the token level — using the critic to assign credit to each step of a multi-turn agent interaction.

一个常见的混淆:”critic”和”价值函数”并非两个不同的东西。Critic 就是价值函数——确切地说,是一个学习到的近似 \(\hat{V}_\phi(s) \approx V^\pi(s)\),被训练来预测期望回报。”Critic”强调它在 actor-critic 架构中的角色(评估 actor 的行为);”价值函数”强调其数学定义(从某个状态出发的期望累积奖励)。它们指的是同一个对象。

在经典 RL 中,critic 通常是一个从头训练的小型神经网络。在 LM 场景中,critic 通常是基础语言模型的一个副本,将 token 预测头替换为标量值头——如 LM 作为 critic 一文中所探讨的。复用预训练骨干为 critic 提供了强大的语言和视觉表征先验,但也带来了自身的挑战(表征漂移、对读取位置的敏感性等)。

本文余下部分探讨 ArCHer,它将 actor-critic 的思想应用于回合级别而非 token 级别——使用 critic 为多轮智能体交互的每一步分配信用。

What is ArCHer?

ArCHer (Actor-Critic with Hierarchical Turn-Level Credit Assignment) was introduced by Zhou et al. (2024) to address a fundamental limitation of RLHF-style training for multi-step language model agents: trajectory-level rewards provide no signal about which step was responsible for success or failure.

Consider a web agent that completes a 10-step shopping task. Standard PPO assigns the terminal reward to the entire trajectory — all 10 steps receive the same advantage. The agent cannot learn that step 3 (clicking the right product) was the decisive action while step 7 (scrolling past relevant content) was a mistake. Both get reinforced equally.

The original ArCHer paper proposes a hierarchical actor-critic framework whose central idea is: the turn, not the token, is the right granularity for credit assignment in multi-step LM agents. Below we describe the formulation and losses precisely.

ArCHer(Actor-Critic with Hierarchical Turn-Level Credit Assignment)由 Zhou et al. (2024) 提出,旨在解决多步语言模型智能体 RLHF 式训练的一个根本局限:轨迹级奖励无法提供关于哪一步导致成功或失败的信号。

考虑一个完成 10 步购物任务的网页智能体。标准 PPO 将终端奖励分配给整条轨迹——所有 10 步获得相同的优势值。智能体无法学到第 3 步(点击正确商品)是决定性动作,而第 7 步(滚动跳过了相关内容)是一个错误。两者被同等程度地强化。

ArCHer 原始论文提出了一个分层 actor-critic 框架,核心思想是:在多步 LM 智能体中,回合(turn)而非 token 才是信用分配的正确粒度。 下面我们精确描述其形式化和损失函数。

Formulation: Multi-Turn MDP

ArCHer models the interaction as a token-level MDP embedded within a turn-level structure. At each turn \(t\), the agent observes a state \(s_t\) (e.g., a web page) and generates a response \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\) as a sequence of \(L\) tokens. The environment then transitions to a new state \(s_{t+1}\). After \(T\) turns, a scalar reward \(R\) is received.

The key decomposition is between the high-level (turn-level) and low-level (token-level) policies. The high-level policy selects the overall intent at each turn, while the low-level policy autoregressively generates the tokens that implement it. In the simplified ArCHer PPO variant we implement, these are collapsed into a single autoregressive policy — but the turn-level value function is preserved.

ArCHer 将交互建模为嵌入在回合级结构中的 token 级 MDP。在每个回合 \(t\),智能体观察一个状态 \(s_t\)(如一个网页)并生成一个响应 \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\),即一个由 \(L\) 个 token 组成的序列。环境随后转移到新状态 \(s_{t+1}\)。经过 \(T\) 个回合后,获得一个标量奖励 \(R\)。

关键分解是高层(回合级)和低层(token 级)策略之间的关系。高层策略在每个回合选择总体意图,低层策略自回归地生成实现该意图的 token。在我们实现的简化 ArCHer PPO 变体中,两者被合并为单一的自回归策略——但回合级价值函数被保留。

The Critic: Turn-Level Value Function

The critic \(V_\phi(s_t)\) predicts the expected discounted return from turn \(t\) onward:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

where \(r_k = 0\) for intermediate turns and \(r_T = R\). This is a state value function — it conditions only on the observation \(s_t\), not on the action \(a_t\). In the LM implementation, the critic is a copy of the base model with the language modeling head replaced by a scalar projection. The value is read at a single token position per turn: the last token of the observation (prompt boundary), where causal masking ensures the hidden state encodes the full observation but none of the response.

The critic is trained by regression on Monte Carlo return targets:

\[G_t = \gamma^{T-1-t} \cdot R\]

with a simple MSE loss:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

Using MC returns rather than bootstrapped targets \(r_t + \gamma V_\phi(s_{t+1})\) is a deliberate choice: with sparse terminal rewards (\(r_t = 0\) for \(t < T\)), bootstrapped targets reduce to \(\gamma V_\phi(s_{t+1})\), making the critic chase its own predictions. MC returns ground the training in the actual outcome.

Critic \(V_\phi(s_t)\) 预测从回合 \(t\) 开始的期望折扣回报:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

其中中间回合 \(r_k = 0\),\(r_T = R\)。这是一个状态价值函数——只依赖于观测 \(s_t\),不依赖于动作 \(a_t\)。在 LM 实现中,critic 是基础模型的一个副本,将语言建模头替换为标量投影。每个回合在单个 token 位置读取值:观测的最后一个 token(提示边界),因果掩码确保隐状态编码了完整观测但不包含响应。

Critic 通过对蒙特卡洛回报目标的回归来训练:

\[G_t = \gamma^{T-1-t} \cdot R\]

使用简单的 MSE 损失:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

使用 MC 回报而非自举目标 \(r_t + \gamma V_\phi(s_{t+1})\) 是刻意的选择:在稀疏终端奖励下(\(t < T\) 时 \(r_t = 0\)),自举目标退化为 \(\gamma V_\phi(s_{t+1})\),使 critic 追逐自身的预测。MC 回报将训练锚定在实际结果上。

The Actor: Step-Level Advantages

Given the critic, the step-level advantage is computed as a TD(0) residual:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

This measures whether the agent’s action at turn \(t\) increased or decreased the expected return relative to the critic’s prediction. Positive means the action was better than expected; negative means worse.

The advantage is broadcast to all tokens within the turn: every token \(a_t^l\) in the response at turn \(t\) shares the same advantage \(A(t)\). This is the core ArCHer insight — within a single turn, the agent generates a coherent action (reasoning + command), and assigning different advantages to individual tokens within the same action is noisy and semantically meaningless.

给定 critic,步级优势函数作为 TD(0) 残差计算:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

这衡量了智能体在回合 \(t\) 的动作是否相对于 critic 的预测提高或降低了期望回报。正值意味着动作优于预期;负值意味着不如预期。

优势值被广播到回合内的所有 token:回合 \(t\) 中响应的每个 token \(a_t^l\) 共享相同的优势 \(A(t)\)。这是 ArCHer 的核心洞察——在单个回合内,智能体生成一个连贯的动作(推理 + 命令),为同一动作内的不同 token 分配不同优势是噪声很大且语义上无意义的。

Trajectory-level PPO assigns the same advantage to every step (red). ArCHer computes per-step TD residuals (blue) — the "scroll" step gets a negative advantage while productive steps get positive signal. Drag the slider to see how advantages evolve as the critic trains.
轨迹级 PPO 为每一步分配相同的优势值(红色)。ArCHer 计算逐步 TD 残差(蓝色)——"滚动"步骤获得负优势,而有效步骤获得正信号。拖动滑块查看随着 critic 训练优势值如何演变。

The Actor Loss

The actor is trained with the standard PPO clipped objective, using the step-level advantages:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

where \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) is the per-token importance ratio. The clipping threshold \(\varepsilon\) (typically 0.2) prevents the policy from moving too far from the rollout policy in a single update.

Note that \(A(t)\) does not depend on \(l\) — it is the same for all tokens in the turn. The per-token ratios \(\rho_t^l\) provide token-level granularity in how much the policy changed, but the direction of the gradient (reinforce or suppress) is determined entirely at the turn level.

Actor 使用标准 PPO 截断目标训练,采用步级优势:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

其中 \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) 是逐 token 重要性比率。截断阈值 \(\varepsilon\)(通常为 0.2)防止策略在单次更新中偏离 rollout 策略过远。

注意 \(A(t)\) 不依赖于 \(l\)——对回合内所有 token 相同。逐 token 比率 \(\rho_t^l\) 提供了策略变化程度的 token 级粒度,但梯度的方向(强化还是抑制)完全由回合级决定。

Training Order: Critic First

A subtle but important detail: ArCHer trains the critic before the actor, reversing the standard PPO order. The reason is that the actor’s gradient quality depends entirely on the advantage estimates, which depend on the critic. Training the actor with a stale critic wastes gradient steps on noisy signals. By training the critic first and then recomputing the advantages with the updated \(V_\phi\), the actor always sees the best available value estimates.

In the previous post, we discussed the challenges of using language models as critic functions — the need for action-sensitive representations and the instability of end-to-end TD training. ArCHer sidesteps several of these issues by using MC return targets (grounding the critic in actual outcomes) and by evaluating at a single position per step (the prompt boundary), where causal masking ensures the representation is clean.

What we implement in WebGym is a simplified variant — ArCHer PPO — that drops the hierarchical structure of the original paper and applies the turn-level critic idea directly to a flat PPO training loop for multi-step web agents. There is no high-level goal selector; just a single policy that acts at each turn, with a step-level critic providing per-turn credit assignment. The rest of this post walks through the implementation.

一个微妙但重要的细节:ArCHer 先训练 critic 再训练 actor,颠倒了标准 PPO 的顺序。原因在于 actor 的梯度质量完全取决于优势估计,而优势估计取决于 critic。用过时的 critic 训练 actor 等于浪费梯度步在噪声信号上。通过先训练 critic 再用更新后的 \(V_\phi\) 重新计算优势值,actor 总是看到最佳可用的价值估计。

之前的文章中,我们讨论了将语言模型用作 critic 函数的挑战——对动作敏感表征的需求以及端到端 TD 训练的不稳定性。ArCHer 通过使用 MC 回报目标(将 critic 锚定在实际结果上)以及在每步的单个位置(提示边界)进行评估(因果掩码确保表征干净),规避了其中几个问题。

我们在 WebGym 中实现的是一个简化变体——ArCHer PPO——去掉了原论文的分层结构,将回合级 critic 的思想直接应用于多步网页智能体的扁平 PPO 训练循环。没有高层目标选择器;只有一个在每个回合行动的策略,配合步级 critic 提供逐回合信用分配。本文余下部分展示具体实现。

GRPO: Value Estimation by Sampling

GRPO (Group Relative Policy Optimization) takes a radically different approach to the variance problem: instead of learning a value function, estimate it statistically by sampling multiple completions from the same prompt.

The core observation is simple. A learned critic \(V_\phi(s)\) approximates \(\mathbb{E}_{\pi}[G \vert s]\) — the expected return from state \(s\). But there is another way to estimate an expectation: draw samples and take the empirical mean. Given a prompt \(q\), GRPO samples a group of \(G\) completions \(\{o_1, o_2, \ldots, o_G\}\) from the current policy \(\pi_\theta\), scores each with a reward function \(r(q, o_i)\), and uses the group statistics as a baseline:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]

This is essentially a Monte Carlo estimate of the advantage — normalized to zero mean and unit variance within each group. No learned parameters, no value head, no critic training loop. The “value function” is replaced by the sample mean of the group.

The GRPO loss applies the same PPO-style clipping, but at the task level (one advantage per completion) rather than per-token or per-step:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]

where \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio, and \(\beta\) controls the KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model).

GRPO(Group Relative Policy Optimization)对方差问题采取了一种截然不同的方法:不学习价值函数,而是通过从同一提示采样多个补全来统计估计。

核心观察很简单。学到的 critic \(V_\phi(s)\) 近似 \(\mathbb{E}_{\pi}[G \vert s]\)——从状态 \(s\) 出发的期望回报。但还有另一种估计期望的方式:抽取样本并取经验均值。给定提示 \(q\),GRPO 从当前策略 \(\pi_\theta\) 采样一 \(G\) 个补全 \(\{o_1, o_2, \ldots, o_G\}\),用奖励函数 \(r(q, o_i)\) 为每个评分,并用组统计量作为基线:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]

这本质上是优势的蒙特卡洛估计——在每组内归一化为零均值和单位方差。无需学习参数,无需价值头,无需 critic 训练循环。”价值函数”被组的样本均值所替代。

GRPO 损失应用同样的 PPO 式截断,但在任务级别(每个补全一个优势值)而非逐 token 或逐步:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]

其中 \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) 是逐 token 重要性比率,\(\beta\) 控制对参考策略 \(\pi_{\text{ref}}\)(通常是 SFT 模型)的 KL 惩罚。

What GRPO Gives Up

The simplicity of GRPO comes with a clear limitation: the advantage \(\hat{A}_i\) is task-level, not step-level. Every token in completion \(o_i\) receives the same advantage — exactly the problem that ArCHer’s turn-level critic was designed to solve. GRPO cannot distinguish which step within a trajectory was responsible for success or failure; it only knows that this completion, as a whole, was better or worse than its peers.

This works well for single-turn tasks (math reasoning, code generation) where the entire output is one coherent response and the reward reflects its overall quality. For multi-step agent tasks — where a 10-step trajectory might have 9 good steps and 1 bad one — the task-level advantage dilutes the signal. The bad step gets reinforced along with the good ones if the trajectory happened to succeed, and the good steps get suppressed if it happened to fail.

The tradeoff is thus between:

  • ArCHer: learns a critic \(V_\phi(s_t)\) to provide per-step credit, but requires training and maintaining a separate value network — with all the challenges of using LMs as critics.
  • GRPO: avoids the critic entirely by brute-force sampling, but is limited to task-level credit assignment. The “value function” is accurate (it converges to the true expectation as \(G \to \infty\)) but only at the granularity of entire trajectories.

GRPO 的简洁性伴随着一个明确的局限:优势 \(\hat{A}_i\) 是任务级的,而非步级的。 补全 \(o_i\) 中的每个 token 获得相同的优势值——这恰恰是 ArCHer 回合级 critic 要解决的问题。GRPO 无法区分轨迹内哪一步导致了成功或失败;它只知道这个补全整体上比同组其他补全好还是差。

这对单轮任务(数学推理、代码生成)效果很好,因为整个输出是一个连贯的回答,奖励反映其整体质量。对于多步智能体任务——一条 10 步轨迹可能有 9 步好、1 步差——任务级优势会稀释信号。如果轨迹恰好成功,差的那步也会和好的一起被强化;如果恰好失败,好的步骤也会和差的一起被抑制。

因此,权衡如下:

  • ArCHer:学习 critic \(V_\phi(s_t)\) 提供逐步信用分配,但需要训练和维护一个独立的价值网络——面临使用 LM 作为 critic 的所有挑战。
  • GRPO:通过暴力采样完全避开 critic,但仅限于任务级信用分配。”价值函数”是准确的(当 \(G \to \infty\) 时收敛到真实期望),但粒度仅限于整条轨迹。