Generalizable Value Functions and Introverted Intuition (Ni)

The Role of Value Functions

In a previous post on policy gradients, we introduced the actor-critic architecture: an actor (the policy \(\pi_\theta\)) that chooses actions, and a critic (a value function \(\hat{V}_\phi\)) that evaluates states. The actor answers “what should I do?”; the critic answers “how good is the situation I’m in?”.

A natural first question is: why bother with the critic at all? Vanilla policy gradient (REINFORCE) trains the actor without any value function; Q-learning, on the other hand, replaces the actor entirely with a value function. The full picture is that value functions show up in two distinct roles:

  • In policy gradient, the critic is a helper. It reduces gradient variance and supplies per-step credit. We use ArCHer as the worked example, then take a dialectical look at why critic-free variants like GRPO can also succeed.
  • In Q-learning, the value function is no longer a helper — it is the policy. There is no separate actor; \(\arg\max_a Q^*(s, a)\) acts directly.

We walk through both before stepping back to ask what these functions are actually doing under the hood.

在之前关于策略梯度的文章中,我们介绍了 actor-critic 架构:一个选择动作的 actor(策略 \(\pi_\theta\))和一个评估状态的 critic(价值函数 \(\hat{V}_\phi\))。Actor 回答“我该做什么?”,critic 回答“我当前的处境有多好?”

一个自然的问题是:我们为什么需要 critic? 原始的策略梯度(REINFORCE)在没有价值函数的情况下也能训练 actor;而 Q-learning 反过来用价值函数完全替代了 actor。完整的图景是:价值函数出现在两种不同的角色里:

  • 在策略梯度中,critic 是一个辅助工具。它降低梯度方差,并提供逐步的信用分配。我们以 ArCHer 为具体例子,然后辩证地看一下,为什么像 GRPO 这样不带 critic 的变体也能成功。
  • 在 Q-learning 中,价值函数不再是辅助——它就是策略。没有独立的 actor;\(\arg\max_a Q^*(s, a)\) 直接采取行动。

我们先依次走过这两种角色,然后退一步问它们到底在底层做什么。

The Variance Problem in Policy Gradient

Consider a pure actor method applied to a 20-step web agent task. The agent receives a single binary reward at the end: 1 for success, 0 for failure. REINFORCE computes the policy gradient as:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]

Every token in a successful trajectory gets reinforced equally. Every token in a failed trajectory gets suppressed equally. The gradient tells the actor “do more of everything you did when you succeeded” — it cannot distinguish the decisive click from a useless scroll. With \(N\) rollouts that each take 20 steps, you have 20 decisions but only 1 bit of feedback. The signal-to-noise ratio is abysmal.

A constant baseline \(b\) (e.g., the mean reward across the batch) helps: it centers the advantages so that average-performing trajectories get near-zero gradient. But it cannot distinguish states. An agent in a clearly good state (already on the right product page) and an agent in a clearly bad state (stuck on an error page) receive advantages that differ only in the trajectory’s final outcome — not in their current prospects.

考虑一个纯 actor 方法应用于 20 步的网页智能体任务。智能体在最后只收到一个二值奖励:成功为 1,失败为 0。REINFORCE 计算策略梯度为:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]

成功轨迹中的每个 token 被同等程度地强化。失败轨迹中的每个 token 被同等程度地抑制。梯度告诉 actor “把你成功时做的所有事情都多做一些”——它无法区分关键性的点击和无用的滚动。\(N\) 条 rollout 各有 20 步决策,但只有 1 bit 的反馈。信噪比极低。

常数基线 \(b\)(如批次内的平均奖励)有所帮助:它使优势函数居中,让表现一般的轨迹获得接近零的梯度。但它无法区分不同状态。一个处于明显好状态的智能体(已经在正确的商品页面上)和一个处于明显差状态的智能体(卡在错误页面上),它们获得的优势值只取决于轨迹的最终结果——而非当前的前景。

A State-Dependent Baseline Already Helps

In fact, REINFORCE itself can use \(V_\phi(s_t)\) as a baseline. This is perfectly valid — any function of \(s_t\) can serve as a baseline without introducing bias (as shown in the policy gradient post). The advantage becomes:

\[A(s_t, a_t) = G_t - V_\phi(s_t)\]

where \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) is the Monte Carlo return. This is still REINFORCE — the return \(G_t\) comes from the actual trajectory, no bootstrapping involved — but with a much better baseline. Instead of asking “was this trajectory better than average?”, it asks “was this trajectory better than expected from this state?”.

This already provides meaningful per-step signal. If the agent is on the right product page (\(V_\phi(s_t)\) is high) and the trajectory succeeds (\(G_t\) is high), the advantage is small — success was expected. If the agent is on the homepage (\(V_\phi(s_t)\) is moderate) and navigates directly to the right category (leading to eventual success), the advantage is large — the outcome was better than expected from that state.

实际上,REINFORCE 本身就可以使用 \(V_\phi(s_t)\) 作为基线。这完全合理——任何 \(s_t\) 的函数都可以作为基线而不引入偏差(如策略梯度一文中所示)。优势函数变为:

\[A(s_t, a_t) = G_t - V_\phi(s_t)\]

其中 \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) 是蒙特卡洛回报。这仍然是 REINFORCE——回报 \(G_t\) 来自实际轨迹,不涉及自举——但拥有更好的基线。它不再问“这条轨迹是否优于平均水平?”,而是问“这条轨迹是否优于从当前状态出发的期望?”

这已经提供了有意义的逐步信号。如果智能体在正确的商品页面上(\(V_\phi(s_t)\) 较高)且轨迹成功(\(G_t\) 较高),优势值较小——成功是预期之中的。如果智能体在主页上(\(V_\phi(s_t)\) 适中)并直接导航到正确的类目(最终成功),优势值较大——结果好于从该状态出发的期望。

Why V(s) reduces variance. Without a baseline, success always gives advantage 1, failure always gives 0 — the gradient cannot distinguish easy states from hard ones. With V(s), advantages are centered per state: succeeding from an easy state (high V) gets a small advantage, while succeeding from a hard state (low V) gets a large one.
V(s) 为什么能降低方差。没有基线时,成功总是给出优势 1,失败总是给出 0——梯度无法区分容易状态和困难状态。有了 V(s),优势值按状态居中:从容易状态(高 V)成功得到的优势较小,从困难状态(低 V)成功得到的优势较大。
Can \(Q(s, a)\) be used as a baseline? (Click to expand)

A valid baseline must be a function of \(s\) only — not of \(a\). The reason is that the baseline property relies on:

$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$

The key step is pulling \(b(s)\) out of the expectation over \(a\) — which is only valid when \(b\) does not depend on \(a\). Since \(Q(s, a)\) depends on \(a\), it cannot be pulled out, and subtracting it would change the expected gradient — introducing bias.

\(Q(s, a)\) plays a different role: it is the signal itself, not a baseline. The policy gradient theorem states:

$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$

The advantage \(A(s, a) = Q(s, a) - V(s)\) uses \(Q\) as the signal and \(V\) as the baseline. Replacing \(V\) with \(Q\) in the baseline slot would collapse the advantage to zero — subtracting the signal from itself.

\(Q(s, a)\) 能否用作基线?(点击展开)

有效的基线必须只是 \(s\) 的函数——不能依赖于 \(a\)。原因在于基线性质依赖于:

$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$

关键步骤是将 \(b(s)\) 从对 \(a\) 的期望中提出来——这只在 \(b\) 不依赖于 \(a\) 时才成立。由于 \(Q(s, a)\) 依赖于 \(a\),它无法被提出,减去它会改变期望梯度——引入偏差。

\(Q(s, a)\) 扮演不同的角色:它是信号本身,而非基线。策略梯度定理表明:

$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$

优势 \(A(s, a) = Q(s, a) - V(s)\) 使用 \(Q\) 作为信号,\(V\) 作为基线。如果在基线位置用 \(Q\) 替换 \(V\),优势会塌缩为零——从信号自身中减去了自己。

From Baseline to Bootstrapping

If REINFORCE + \(V(s)\) baseline already gives per-step credit, what does actor-critic add? The difference is in how the advantage is computed. REINFORCE uses:

\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]

Actor-critic replaces the MC return with a bootstrapped estimate:

\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]

The MC version is unbiased — \(G_t\) is the true sampled return. But it is still high-variance: \(G_t\) sums rewards over many future steps, each subject to stochastic transitions. In a 20-step web task where the environment is non-stationary (pages load differently, A/B tests shift layouts), this sum carries enormous noise.

The TD version replaces all future randomness with the critic’s estimate \(V_\phi(s_{t+1})\). This is biased — if the critic is wrong, the advantage is wrong — but dramatically lower variance, because the estimate depends only on the immediate transition \((s_t, a_t, s_{t+1})\) rather than the entire future trajectory. In practice, a slightly biased but low-variance gradient is far more useful for learning than an unbiased but noisy one.

This is the real contribution of the critic in actor-critic: not just providing a baseline (REINFORCE can do that too), but enabling bootstrapping — replacing noisy future returns with learned predictions. The critic becomes load-bearing: its accuracy directly determines the quality of the actor’s gradient. This is why training the critic well matters so much.

如果 REINFORCE + \(V(s)\) 基线已经能提供逐步信用分配,actor-critic 还增加了什么?区别在于优势函数的计算方式。REINFORCE 使用:

\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]

Actor-critic 将蒙特卡洛回报替换为自举估计:

\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]

MC 版本是无偏的——\(G_t\) 是真实的采样回报。但方差依然很高:\(G_t\) 对未来许多步的奖励求和,每一步都受到随机转移的影响。在 20 步的网页任务中,环境是非平稳的(页面加载不同、A/B 测试改变布局),这个求和携带巨大噪声。

TD 版本用 critic 的估计 \(V_\phi(s_{t+1})\) 替换了所有未来的随机性。这是有偏的——如果 critic 不准确,优势值就不准确——但方差大幅降低,因为估计只依赖于即时转移 \((s_t, a_t, s_{t+1})\) 而非整条未来轨迹。实践中,略有偏差但低方差的梯度远比无偏但充满噪声的梯度更有用。

这才是 critic 在 actor-critic 中的真正贡献:不仅仅是提供基线(REINFORCE 也能做到),而是实现了自举——用学到的预测替代充满噪声的未来回报。Critic 成为承重结构:其准确性直接决定了 actor 梯度的质量。这就是为什么训练好 critic 如此重要。

A Worked Example in Policy Gradient: ArCHer

Before going on, a clarification: “critic” and “value function” are not two different things. The critic is a value function — specifically, a learned approximation \(\hat{V}_\phi(s) \approx V^\pi(s)\), trained to predict expected returns. “Critic” emphasizes its role in the actor-critic architecture; “value function” emphasizes its mathematical definition. They refer to the same object. In the LM setting the critic is usually a copy of the base language model with a scalar value head replacing the token prediction head — see the LM-as-critic post for the engineering details.

ArCHer (Actor-Critic with Hierarchical Turn-Level Credit Assignment) was introduced by Zhou et al. (2024) to address a fundamental limitation of RLHF-style training for multi-step language model agents: trajectory-level rewards provide no signal about which step was responsible for success or failure.

Consider a web agent that completes a 10-step shopping task. Standard PPO assigns the terminal reward to the entire trajectory — all 10 steps receive the same advantage. The agent cannot learn that step 3 (clicking the right product) was the decisive action while step 7 (scrolling past relevant content) was a mistake. Both get reinforced equally.

The original ArCHer paper proposes a hierarchical actor-critic framework whose central idea is: the turn, not the token, is the right granularity for credit assignment in multi-step LM agents. Below we describe the formulation and losses precisely.

继续之前先澄清一点:”critic”和”价值函数”并非两个不同的东西。Critic 就是价值函数——确切地说,是一个学习到的近似 \(\hat{V}_\phi(s) \approx V^\pi(s)\),被训练来预测期望回报。”Critic”强调它在 actor-critic 架构中的角色;”价值函数”强调其数学定义。两者指的是同一个对象。在 LM 场景中,critic 通常是基础语言模型的一个副本,将 token 预测头替换为标量值头——工程细节见 LM 作为 critic 一文

ArCHer(Actor-Critic with Hierarchical Turn-Level Credit Assignment)由 Zhou et al. (2024) 提出,旨在解决多步语言模型智能体 RLHF 式训练的一个根本局限:轨迹级奖励无法提供关于哪一步导致成功或失败的信号。

考虑一个完成 10 步购物任务的网页智能体。标准 PPO 将终端奖励分配给整条轨迹——所有 10 步获得相同的优势值。智能体无法学到第 3 步(点击正确商品)是决定性动作,而第 7 步(滚动跳过了相关内容)是一个错误。两者被同等程度地强化。

ArCHer 原始论文提出了一个分层 actor-critic 框架,核心思想是:在多步 LM 智能体中,回合(turn)而非 token 才是信用分配的正确粒度。 下面我们精确描述其形式化和损失函数。

Formulation: Multi-Turn MDP

ArCHer models the interaction as a token-level MDP embedded within a turn-level structure. At each turn \(t\), the agent observes a state \(s_t\) (e.g., a web page) and generates a response \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\) as a sequence of \(L\) tokens. The environment then transitions to a new state \(s_{t+1}\). After \(T\) turns, a scalar reward \(R\) is received.

The key decomposition is between the high-level (turn-level) and low-level (token-level) policies. The high-level policy selects the overall intent at each turn, while the low-level policy autoregressively generates the tokens that implement it. In the simplified ArCHer PPO variant we implement, these are collapsed into a single autoregressive policy — but the turn-level value function is preserved.

ArCHer 将交互建模为嵌入在回合级结构中的 token 级 MDP。在每个回合 \(t\),智能体观察一个状态 \(s_t\)(如一个网页)并生成一个响应 \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\),即一个由 \(L\) 个 token 组成的序列。环境随后转移到新状态 \(s_{t+1}\)。经过 \(T\) 个回合后,获得一个标量奖励 \(R\)。

关键分解是高层(回合级)和低层(token 级)策略之间的关系。高层策略在每个回合选择总体意图,低层策略自回归地生成实现该意图的 token。在我们实现的简化 ArCHer PPO 变体中,两者被合并为单一的自回归策略——但回合级价值函数被保留。

The Critic: Turn-Level Value Function

The critic \(V_\phi(s_t)\) predicts the expected discounted return from turn \(t\) onward:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

where \(r_k = 0\) for intermediate turns and \(r_T = R\). This is a state value function — it conditions only on the observation \(s_t\), not on the action \(a_t\). In the LM implementation, the critic is a copy of the base model with the language modeling head replaced by a scalar projection. The value is read at a single token position per turn: the last token of the observation (prompt boundary), where causal masking ensures the hidden state encodes the full observation but none of the response.

The critic is trained by regression on Monte Carlo return targets:

\[G_t = \gamma^{T-1-t} \cdot R\]

with a simple MSE loss:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

Using MC returns rather than bootstrapped targets \(r_t + \gamma V_\phi(s_{t+1})\) is a deliberate choice: with sparse terminal rewards (\(r_t = 0\) for \(t < T\)), bootstrapped targets reduce to \(\gamma V_\phi(s_{t+1})\), making the critic chase its own predictions. MC returns ground the training in the actual outcome.

Critic \(V_\phi(s_t)\) 预测从回合 \(t\) 开始的期望折扣回报:

\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]

其中中间回合 \(r_k = 0\),\(r_T = R\)。这是一个状态价值函数——只依赖于观测 \(s_t\),不依赖于动作 \(a_t\)。在 LM 实现中,critic 是基础模型的一个副本,将语言建模头替换为标量投影。每个回合在单个 token 位置读取值:观测的最后一个 token(提示边界),因果掩码确保隐状态编码了完整观测但不包含响应。

Critic 通过对蒙特卡洛回报目标的回归来训练:

\[G_t = \gamma^{T-1-t} \cdot R\]

使用简单的 MSE 损失:

\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]

使用 MC 回报而非自举目标 \(r_t + \gamma V_\phi(s_{t+1})\) 是刻意的选择:在稀疏终端奖励下(\(t < T\) 时 \(r_t = 0\)),自举目标退化为 \(\gamma V_\phi(s_{t+1})\),使 critic 追逐自身的预测。MC 回报将训练锚定在实际结果上。

The Actor: Step-Level Advantages

Given the critic, the step-level advantage is computed as a TD(0) residual:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

This measures whether the agent’s action at turn \(t\) increased or decreased the expected return relative to the critic’s prediction. Positive means the action was better than expected; negative means worse.

The advantage is broadcast to all tokens within the turn: every token \(a_t^l\) in the response at turn \(t\) shares the same advantage \(A(t)\). This is the core ArCHer insight — within a single turn, the agent generates a coherent action (reasoning + command), and assigning different advantages to individual tokens within the same action is noisy and semantically meaningless.

给定 critic,步级优势函数作为 TD(0) 残差计算:

\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]

这衡量了智能体在回合 \(t\) 的动作是否相对于 critic 的预测提高或降低了期望回报。正值意味着动作优于预期;负值意味着不如预期。

优势值被广播到回合内的所有 token:回合 \(t\) 中响应的每个 token \(a_t^l\) 共享相同的优势 \(A(t)\)。这是 ArCHer 的核心洞察——在单个回合内,智能体生成一个连贯的动作(推理 + 命令),为同一动作内的不同 token 分配不同优势是噪声很大且语义上无意义的。

Trajectory-level PPO assigns the same advantage to every step (red). ArCHer computes per-step TD residuals (blue) — the "scroll" step gets a negative advantage while productive steps get positive signal. Drag the slider to see how advantages evolve as the critic trains.
轨迹级 PPO 为每一步分配相同的优势值(红色)。ArCHer 计算逐步 TD 残差(蓝色)——"滚动"步骤获得负优势,而有效步骤获得正信号。拖动滑块查看随着 critic 训练优势值如何演变。

The Actor Loss

The actor is trained with the standard PPO clipped objective, using the step-level advantages:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

where \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) is the per-token importance ratio. The clipping threshold \(\varepsilon\) (typically 0.2) prevents the policy from moving too far from the rollout policy in a single update.

Note that \(A(t)\) does not depend on \(l\) — it is the same for all tokens in the turn. The per-token ratios \(\rho_t^l\) provide token-level granularity in how much the policy changed, but the direction of the gradient (reinforce or suppress) is determined entirely at the turn level.

Actor 使用标准 PPO 截断目标训练,采用步级优势:

\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]

其中 \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) 是逐 token 重要性比率。截断阈值 \(\varepsilon\)(通常为 0.2)防止策略在单次更新中偏离 rollout 策略过远。

注意 \(A(t)\) 不依赖于 \(l\)——对回合内所有 token 相同。逐 token 比率 \(\rho_t^l\) 提供了策略变化程度的 token 级粒度,但梯度的方向(强化还是抑制)完全由回合级决定。

Training Order: Critic First

A subtle but important detail: ArCHer trains the critic before the actor, reversing the standard PPO order. The reason is that the actor’s gradient quality depends entirely on the advantage estimates, which depend on the critic. Training the actor with a stale critic wastes gradient steps on noisy signals. By training the critic first and then recomputing the advantages with the updated \(V_\phi\), the actor always sees the best available value estimates.

In the previous post, we discussed the challenges of using language models as critic functions — the need for action-sensitive representations and the instability of end-to-end TD training. ArCHer sidesteps several of these issues by using MC return targets (grounding the critic in actual outcomes) and by evaluating at a single position per step (the prompt boundary), where causal masking ensures the representation is clean.

What we implement in WebGym is a simplified variant — ArCHer PPO — that drops the hierarchical structure of the original paper and applies the turn-level critic idea directly to a flat PPO training loop for multi-step web agents. There is no high-level goal selector; just a single policy that acts at each turn, with a step-level critic providing per-turn credit assignment.

一个微妙但重要的细节:ArCHer 先训练 critic 再训练 actor,颠倒了标准 PPO 的顺序。原因在于 actor 的梯度质量完全取决于优势估计,而优势估计取决于 critic。用过时的 critic 训练 actor 等于浪费梯度步在噪声信号上。通过先训练 critic 再用更新后的 \(V_\phi\) 重新计算优势值,actor 总是看到最佳可用的价值估计。

之前的文章中,我们讨论了将语言模型用作 critic 函数的挑战——对动作敏感表征的需求以及端到端 TD 训练的不稳定性。ArCHer 通过使用 MC 回报目标(将 critic 锚定在实际结果上)以及在每步的单个位置(提示边界)进行评估(因果掩码确保表征干净),规避了其中几个问题。

我们在 WebGym 中实现的是一个简化变体——ArCHer PPO——去掉了原论文的分层结构,将回合级 critic 的思想直接应用于多步网页智能体的扁平 PPO 训练循环。没有高层目标选择器;只有一个在每个回合行动的策略,配合步级 critic 提供逐回合信用分配。

A Dialectical View: Critic-Free Methods Also Work

ArCHer makes the critic load-bearing. But empirically, an entire family of methods — GRPO and its descendants — has dropped the critic completely and still works very well, especially for math and code reasoning. If the critic is so important, how do these methods get away without one? This section takes a dialectical view: when does the critic actually pay for itself, and when does brute-force sampling do the job equally well?

ArCHer 让 critic 成为承重结构。但经验上,一整族方法——GRPO 及其后继者——完全丢掉了 critic,而且效果非常好,尤其是在数学与代码推理任务上。如果 critic 真的那么重要,这些方法是怎么不带 critic 还能成功的?本节用辩证的视角看一下:critic 什么时候真的值回票价,什么时候暴力采样同样能完成任务?

GRPO: Sampling Replaces the Critic

GRPO (Group Relative Policy Optimization) takes a radically different approach to the variance problem: instead of learning a value function, estimate it statistically by sampling multiple completions from the same prompt.

The core observation is simple. A learned critic \(V_\phi(s)\) approximates \(\mathbb{E}_{\pi}[G \vert s]\) — the expected return from state \(s\). But there is another way to estimate an expectation: draw samples and take the empirical mean. Given a prompt \(q\), GRPO samples a group of \(G\) completions \(\{o_1, o_2, \ldots, o_G\}\) from the current policy \(\pi_\theta\), scores each with a reward function \(r(q, o_i)\), and uses the group statistics as a baseline:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]

This is essentially a Monte Carlo estimate of the advantage — normalized to zero mean and unit variance within each group. No learned parameters, no value head, no critic training loop. The “value function” is replaced by the sample mean of the group.

The GRPO loss applies the same PPO-style clipping, but at the task level (one advantage per completion) rather than per-token or per-step:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]

where \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio, and \(\beta\) controls the KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model).

GRPO(Group Relative Policy Optimization)对方差问题采取了一种截然不同的方法:不学习价值函数,而是通过从同一提示采样多个补全来统计估计。

核心观察很简单。学到的 critic \(V_\phi(s)\) 近似 \(\mathbb{E}_{\pi}[G \vert s]\)——从状态 \(s\) 出发的期望回报。但还有另一种估计期望的方式:抽取样本并取经验均值。给定提示 \(q\),GRPO 从当前策略 \(\pi_\theta\) 采样一 \(G\) 个补全 \(\{o_1, o_2, \ldots, o_G\}\),用奖励函数 \(r(q, o_i)\) 为每个评分,并用组统计量作为基线:

\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]

这本质上是优势的蒙特卡洛估计——在每组内归一化为零均值和单位方差。无需学习参数,无需价值头,无需 critic 训练循环。”价值函数”被组的样本均值所替代。

GRPO 损失应用同样的 PPO 式截断,但在任务级别(每个补全一个优势值)而非逐 token 或逐步:

\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]

其中 \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) 是逐 token 重要性比率,\(\beta\) 控制对参考策略 \(\pi_{\text{ref}}\)(通常是 SFT 模型)的 KL 惩罚。

When Does Each Win?

The clean way to read GRPO is that the group mean is a (statistical) value function — just an empirical one rather than a parametric one. With \(G\) samples per prompt, \(\text{mean}(r(q, o_j))\) converges to \(\mathbb{E}_\pi[r \vert q]\) at rate \(O(G^{-1/2})\). The function approximator and its training loop are gone; sampling does the same job.

So the dialectic isn’t critic vs no-critic. It is: at what granularity does your problem need credit, and which estimator is cheaper at that granularity?

Setting Best fit Reason
Single-turn task (math, code, single answer) GRPO-family One advantage per completion is enough. The group mean is unbiased, \(G \!\!= 8\!-\!64\) samples already gives a workable estimator, and you save a separately trained critic.
Short-horizon task with dense reward Either, often GRPO Per-step credit doesn’t add much when each step’s contribution is already legible from the reward signal.
Long-horizon multi-turn task with sparse terminal reward Critic-based (ArCHer) A 10-step trajectory has 1 reward bit but 10 decisions. Group mean cannot tell you which of the 10 steps caused the failure; a per-step \(V_\phi(s_t)\) can.
Off-policy / replay-heavy training Critic-based The critic generalizes across states it has digested; the group-mean estimator only knows about the prompts you currently sample.

There is no clean winner. The pro-critic position says: a parametric \(V_\phi\) amortizes value estimation across states, generalizes, and gives per-step credit. The pro-critic-free position says: when the granularity you need is task-level, sampling estimates the same thing without the engineering cost (and without the well-known failure modes of LM critics — representation drift, TD instability, value-head collapse). The deeper analysis of the GRPO family — DAPO, GSPO, and friends — and where each variant wins lives in the GRPO-family post.

The takeaway for the rest of this post: whether you build the value function explicitly (ArCHer) or implicitly via sampling (GRPO), you are doing the same underlying thing — estimating an expected return from a state. That estimation is what we’ll examine more closely below.

理解 GRPO 的干净视角是:组内均值就是一个(统计意义上的)价值函数——只不过是经验估计而非参数估计。每个 prompt 给 \(G\) 个样本,\(\text{mean}(r(q, o_j))\) 以 \(O(G^{-1/2})\) 的速率收敛到 \(\mathbb{E}_\pi[r \vert q]\)。函数逼近器和它的训练循环不见了;采样做同样的事。

所以这个辩证关系不是 critic 无 critic。而是:你的问题需要多细粒度的信用,并且在那个粒度下哪种估计器更便宜?

场景 适合 原因
单轮任务(数学、代码、单一回答) GRPO 族 一个补全一个优势就够了。组内均值无偏,\(G \!\!= 8\!-\!64\) 个样本已能给出可用估计,还省掉了一个独立训练的 critic。
奖励稠密的短视野任务 两者皆可,常用 GRPO 当每步贡献已经从奖励信号上可读时,逐步信用不会带来太多额外收益。
终端奖励稀疏的多回合长视野任务 基于 critic(ArCHer) 10 步轨迹有 1 比特奖励却有 10 个决策。组均值无法告诉你 10 步中哪一步导致了失败;逐步 \(V_\phi(s_t)\) 可以。
严重 off-policy / replay 重的训练 基于 critic Critic 在它消化过的状态上能泛化;组均值估计器只知道你当前采样的 prompt。

没有干净的赢家。Critic 派的立场是:参数化的 \(V_\phi\) 在状态间摊销价值估计、能泛化、能提供逐步信用。Critic-free 派的立场是:当你需要的粒度是任务级时,采样估计同一件事而不付工程代价(也不必承担 LM critic 那些众所周知的失败模式——表征漂移、TD 不稳定、value head 塌缩)。GRPO 族的更深分析——DAPO、GSPO 及其朋友们——以及各变体的胜场,详见 GRPO 族一文

本文余下部分要带走的是:无论你是显式地构建价值函数(ArCHer),还是通过采样隐式地构建(GRPO),你都在做同一件底层的事——估计从某个状态出发的期望回报。这个估计本身就是我们下面要更仔细审视的对象。

Value Functions in Q-Learning: When V IS the Policy

Both REINFORCE and actor-critic are policy gradient methods: they explicitly parameterize and optimize a policy \(\pi_\theta\). The value function — whether \(V(s)\) as a baseline or \(V(s)\) for bootstrapping — is a helper that improves the policy’s gradient. The policy remains the primary object being learned.

Q-learning inverts this relationship entirely. There is no explicit policy. Instead, the value function is the primary object: learn \(Q^*(s, a)\) directly via the Bellman optimality equation

\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]

and extract the policy implicitly as \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The value function does not serve the policy — the value function replaces it.

This creates a fundamental difference in what the value function needs to capture. In actor-critic, the critic learns \(V^\pi(s)\) — the expected return under the current policy. It only needs to evaluate states, not compare actions. In Q-learning, \(Q^*(s, a)\) must be action-sensitive: it must distinguish the expected return of different actions from the same state, because that distinction is the policy. This is a much harder representational requirement, especially in the LM setting where the action space (all possible token sequences) is combinatorially large. The Q-learning scalability post examines whether this requirement can be met at scale.

REINFORCE 和 actor-critic 都是策略梯度方法:它们显式地参数化和优化策略 \(\pi_\theta\)。价值函数——无论是作为基线的 \(V(s)\) 还是用于自举的 \(V(s)\)——都是改善策略梯度的辅助工具。策略始终是被学习的主要对象。

Q-learning 彻底颠倒了这一关系。没有显式策略。价值函数本身就是主要对象:通过 Bellman 最优方程直接学习 \(Q^*(s, a)\)

\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]

然后隐式地提取策略 \(\pi^*(s) = \arg\max_a Q^*(s, a)\)。价值函数不是服务于策略——价值函数替代了策略。

这导致价值函数需要捕获的信息有根本性差异。在 actor-critic 中,critic 学习 \(V^\pi(s)\)——在当前策略下的期望回报。它只需评估状态,不需比较动作。在 Q-learning 中,\(Q^*(s, a)\) 必须是动作敏感的:它必须区分同一状态下不同动作的期望回报,因为这种区分就是策略。这是一个更难的表征要求,尤其在 LM 场景中动作空间(所有可能的 token 序列)是组合爆炸式增长的。Q-learning 可扩展性一文探讨了这一要求能否在规模化时得到满足。

What Is the Value Function Really Doing? Value Judgment, Not Reasoning

Step back from the math. In all the cases above — \(V_\phi\) as baseline, \(V_\phi\) as bootstrap target, group mean as empirical \(V\), \(Q^*\) as the policy itself — the same operation is happening:

Given a state, return a scalar that summarizes how good it is.

That is the entire job. Not a derivation. Not a chain of inferences. A single verdict.

The architectural label for this is value judgment, in deliberate contrast with reasoning. Reasoning produces a sequence of intermediate steps where the steps are part of what makes the answer correct — a chain of derivations, each one inspectable. Judgment produces only the answer; whatever computation produced it has been integrated away into the parameters of \(V_\phi\), or into the empirical mean over \(G\) samples, and is no longer recoverable from the output.

This distinction is not philosophical decoration; it has direct architectural consequences for how we build critics.

从数学里退一步。上面所有情形——\(V_\phi\) 作为基线、\(V_\phi\) 作为自举目标、组均值作为经验 \(V\)、\(Q^*\) 本身就是策略——发生的都是同一个操作:

给一个状态,返回一个标量,概括它有多好。

这就是它的全部工作。不是推导。不是推理链。一个单一的裁决。

它的架构标签是价值判断(value judgment),刻意与推理(reasoning)相对。推理产生一系列中间步骤,这些步骤本身参与构成”答案为何正确”——一条可逐步检视的推导链。判断只产生答案;产生它的计算已经被整合进 \(V_\phi\) 的参数里,或者整合进 \(G\) 个样本的经验均值里,从输出再也无法复原。

这个区分不是哲学装饰——它对我们怎么构建 critic 有直接的架构后果。

An Inspiration from Jungian Psychology: Ni

A useful structural label for this kind of function comes from Jungian analytical psychology, which (independently of any computational concern) catalogs eight cognitive functions and divides them along the same axis we just identified. On one side sit judging / reasoning functions — Te, Ti — whose outputs are explicit chains of derivations, designed to be inspected. On the other side sits one perceiving function with a particularly relevant signature: introverted intuition, abbreviated Ni.

Ni is the function that takes many disparate inputs, lets them settle, and outputs a single integrative impression — without surfacing the reasoning behind it. The classic description is “I just know where this is going” with no defendable derivation to back it up; the integration happened, but it happened in a way the system itself cannot replay step by step. (Background and the rest of the eight-function taxonomy live in the MBTI / Jungian functions post.)

Translated into the engineering vocabulary of this post, Ni is the cognitive shape of a value function. The critic ingests many trajectories during training, integrates them into its parameters, and at inference time emits a scalar value head per state — with no rationale attached. The justification (“which past trajectories the prediction depends on, which features mattered most”) is not produced and cannot be produced from the output alone. It lives in the weights.

Cognitive function Output shape Reasoning visible? Engineering counterpart
Ti / Te (thinking) a chain of derivations yes symbolic solver, CoT reasoner, hand-coded heuristic
Ni (introverted intuition) a single integrative scalar/gestalt no scalar value head, learned \(V_\phi(s)\)

Two clarifications. First, this is not the claim that critics have personalities — it is the claim that the architectural shape of an integrative-judgment function and the architectural shape of a stepwise-reasoning function are different, and that any taxonomy that already separates them (Jung’s happens to be a clean one) gives us a useful vocabulary for the distinction. Second, we are taking only the inspiration: the structural label, plus its consequence for design. The rest of psychology stays in its own post.

对于这一类函数,一个有用的结构性标签来自荣格的分析心理学。荣格(与任何计算考虑无关地)整理了八个认知功能,并把它们沿着我们刚刚识别出的同一根轴分成两类。一侧是判断 / 推理功能——TeTi——它们的输出是显式的推导链,本就为了被检视。另一侧坐着一个特别相关的感知功能:内倾直觉(introverted intuition),简称 Ni

Ni 是这样一种功能:把许多分散的输入收进来,让它们沉降,再输出一个整合性的单一印象——把背后的推理摊出来。经典描述是”我就是知道这事会怎么走”却给不出可辩护的推导;整合发生了,但发生的方式让系统自己也无法一步步回放。(八维分类的背景及其余功能见 MBTI / 荣格功能一文。)

翻译到本文的工程语言里,Ni 就是价值函数的认知形状。Critic 在训练中吞下许多轨迹,把它们整合进自己的参数,推断时对每个状态给出一个标量值头——不附带任何依据。其依据(”预测依赖哪些过去的轨迹、哪些特征最重要”)不会被产生,也无法仅从输出中产生。它住在权重里。

认知功能 输出形状 推理可见? 工程对应物
Ti / Te(思考) 一条推导链 符号求解器、CoT 推理器、手写启发式
Ni(内倾直觉) 一个整合性的标量/整体感 标量值头、学到的 \(V_\phi(s)\)

两点澄清。第一,我们不是在说 critic 真的有人格——我们是在说,一个整合性判断功能的架构形状与一个分步推理功能的架构形状不同;任何已经把这两者分清楚的分类法(荣格那个恰好相当干净)都能给我们提供一套好用的词汇。第二,我们取这个灵感:那个结构性标签,以及它对设计的后果。心理学其余部分留在它自己的文章里

An Intuitive Critic Should Not Reason

The practical claim that follows is the one this whole section was building toward:

An intuitive critic — a function whose job is integrative value judgment — should not exhibit reasoning in its output.

Concretely, \(V_\phi(s)\) should be a scalar emitted in a single forward pass at a designated readout position, with no token sequence in between. The “thinking” should happen implicitly in the forward pass, not explicitly as emitted tokens. Several recent design trends violate this and are, from this framing, architecturally mismatched to what a critic is for:

  • Generative reward models that emit critique text before a score. The critique tokens are sitting in the architectural slot that integration should occupy in a Ni-shaped function. The score, having been forced to follow the text, becomes a post-hoc rationalization of the emitted critique rather than an integrative verdict. The model has been pushed toward Ti shape (defensible chain) when its job is Ni shape (integrative scalar).
  • Process reward models (PRMs) that score each step using their own chain of thought. The CoT makes the PRM look defensible step by step, but it pushes the function toward a Ti-style audit and away from the integrative shape that the value head is set up to produce.
  • “Thinking critics” that consume reasoning tokens before outputting \(V\). The compute spent on those tokens is compute spent in the wrong cognitive position. If reasoning helps, that reasoning belongs to the actor, which then queries an integrative critic — not to the critic itself.

What is consistent with the framing, and worth keeping:

  • Scalar value heads read off a single token position. ArCHer reads at the prompt boundary; standard reward models read at EOS. Pure judgment, no emitted reasoning. This is the architecturally honest implementation of \(V_\phi\).
  • MC-return supervision rather than verbal-rationale supervision. A Ni-shaped function trains on outcomes — did the trajectory succeed? — not on rationales. This is exactly what ArCHer’s MSE-against-\(G_t\) loss does, and what reward-from-preferences losses do. Training a critic on “explain why this state is good” data is training the wrong function.
  • Critic-first, then bootstrap. ArCHer’s order — develop the integrative function on real returns first, then trust its own predictions as bootstrap targets — is the engineering analogue of grounding intuition in real consequences before letting it predict on its own. A Ni function with no consequential history is hallucination; a critic with no return-supervised pretraining is a noisy randomly-initialized scalar regressor.

The clean division of labor: the actor is allowed — and required — to reason. Chain-of-thought is its native medium; it has to produce the action, and the chain that produced it is the action. The critic is not. It has to deliver an integrative value judgment, fast and silent, and let the actor handle whatever stepwise derivation the action needs. Mixing the two functions into one model that “thinks before scoring” is the architectural mistake this whole framing is here to flag.

随之而来的实践主张就是整一节都在指向的那一句:

一个直觉型 critic——其职责是整合性的价值判断——不应该在它的输出里出现推理。

具体地说,\(V_\phi(s)\) 应该在一次前向中、于指定的读取位置输出一个标量,中间不出现任何 token 序列。”思考”应该隐式发生在前向里,而不是显式地以 token 形式被发出来。若干近期设计趋势违反了这一点,从这个框架看,它们与 critic 该做的事在架构上不匹配:

  • 生成式奖励模型在打分前先生成 critique 文本. 这些 critique token 占据的正是 Ni 形状的函数本应让整合占据的架构位。被强迫跟在文本后面的分数变成对 critique 的事后合理化,而不再是整合性裁决。这把模型推向了 Ti 形状(可辩护的链),而它的工作本是 Ni 形状(整合性标量)。
  • 过程奖励模型(PRM)用自己的思维链给每一步打分. CoT 让 PRM 看上去步步可辩,但它把这个函数推向 Ti 风格的审计,远离了 value head 本来准备好输出的那种整合性形状。
  • “思考型 critic”在输出 \(V\) 之前先消耗推理 token. 花在那些 token 上的算力,是花在错位的认知位置上的算力。如果推理有帮助,那个推理属于 actor,由 actor 然后查询一个整合型 critic——而不应该属于 critic 本身。

与该框架一致、值得保留的:

  • 标量值头从单一 token 位置读出. ArCHer 在 prompt 边界读,标准奖励模型在 EOS 读。纯判断,不发推理。这是 \(V_\phi\) 在架构上诚实的实现。
  • 用 MC 回报作监督,而不是用言语化理由作监督. Ni 形状的函数在结果上训练——这条轨迹成功了吗?——不在理由上训练。这正是 ArCHer 对 \(G_t\) 做 MSE 损失在干的事,也是从偏好出发的奖励损失在干的事。用”解释为什么这个状态好”的数据训练 critic,是在训练错的功能。
  • 先 critic、再自举. ArCHer 的顺序——先在真实回报上发展整合功能,再信任它自己的预测作为自举目标——正是”在让直觉独立预测之前先把它落到真实后果上”的工程对应物。没有真实经历的 Ni 是幻觉;没有回报监督预训过的 critic 是一个被随机初始化的、噪声很大的标量回归器。

干净的分工是:actor 被允许——而且被要求——去推理。思维链是它的母语;它必须产生动作,而产生它的那条链就是动作。critic 不是。它必须给出一个整合性的价值判断,快速、安静,并且把动作所需的任何分步推导留给 actor。把这两种功能混进同一个”先想再打分”的模型里,正是这一整段框架想要标记出来的架构错误。