Generalizable Value Functions and Introverted Intuition (Ni)
The Role of Value Functions
价值函数的作用
In a previous post on policy gradients, we introduced the actor-critic architecture: an actor (the policy \(\pi_\theta\)) that chooses actions, and a critic (a value function \(\hat{V}_\phi\)) that evaluates states. The actor answers “what should I do?”; the critic answers “how good is the situation I’m in?”.
A natural first question is: why bother with the critic at all? Vanilla policy gradient (REINFORCE) trains the actor without any value function; Q-learning, on the other hand, replaces the actor entirely with a value function. The full picture is that value functions show up in two distinct roles:
- In policy gradient, the critic is a helper. It reduces gradient variance and supplies per-step credit. We use ArCHer as the worked example, then take a dialectical look at why critic-free variants like GRPO can also succeed.
- In Q-learning, the value function is no longer a helper — it is the policy. There is no separate actor; \(\arg\max_a Q^*(s, a)\) acts directly.
We walk through both before stepping back to ask what these functions are actually doing under the hood.
The Variance Problem in Policy Gradient
策略梯度中的方差问题
Consider a pure actor method applied to a 20-step web agent task. The agent receives a single binary reward at the end: 1 for success, 0 for failure. REINFORCE computes the policy gradient as:
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \vert s_{i,t}) \cdot R_i\]Every token in a successful trajectory gets reinforced equally. Every token in a failed trajectory gets suppressed equally. The gradient tells the actor “do more of everything you did when you succeeded” — it cannot distinguish the decisive click from a useless scroll. With \(N\) rollouts that each take 20 steps, you have 20 decisions but only 1 bit of feedback. The signal-to-noise ratio is abysmal.
A constant baseline \(b\) (e.g., the mean reward across the batch) helps: it centers the advantages so that average-performing trajectories get near-zero gradient. But it cannot distinguish states. An agent in a clearly good state (already on the right product page) and an agent in a clearly bad state (stuck on an error page) receive advantages that differ only in the trajectory’s final outcome — not in their current prospects.
A State-Dependent Baseline Already Helps
状态相关的基线已有帮助
In fact, REINFORCE itself can use \(V_\phi(s_t)\) as a baseline. This is perfectly valid — any function of \(s_t\) can serve as a baseline without introducing bias (as shown in the policy gradient post). The advantage becomes:
\[A(s_t, a_t) = G_t - V_\phi(s_t)\]where \(G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}\) is the Monte Carlo return. This is still REINFORCE — the return \(G_t\) comes from the actual trajectory, no bootstrapping involved — but with a much better baseline. Instead of asking “was this trajectory better than average?”, it asks “was this trajectory better than expected from this state?”.
This already provides meaningful per-step signal. If the agent is on the right product page (\(V_\phi(s_t)\) is high) and the trajectory succeeds (\(G_t\) is high), the advantage is small — success was expected. If the agent is on the homepage (\(V_\phi(s_t)\) is moderate) and navigates directly to the right category (leading to eventual success), the advantage is large — the outcome was better than expected from that state.
Can \(Q(s, a)\) be used as a baseline? (Click to expand)
A valid baseline must be a function of \(s\) only — not of \(a\). The reason is that the baseline property relies on:
$$\mathbb{E}_{a \sim \pi}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi(a \vert s) = b(s) \cdot \nabla_\theta 1 = 0$$The key step is pulling \(b(s)\) out of the expectation over \(a\) — which is only valid when \(b\) does not depend on \(a\). Since \(Q(s, a)\) depends on \(a\), it cannot be pulled out, and subtracting it would change the expected gradient — introducing bias.
\(Q(s, a)\) plays a different role: it is the signal itself, not a baseline. The policy gradient theorem states:
$$\nabla_\theta J = \mathbb{E}\!\left[\nabla_\theta \log \pi(a \vert s) \cdot Q^\pi(s, a)\right]$$The advantage \(A(s, a) = Q(s, a) - V(s)\) uses \(Q\) as the signal and \(V\) as the baseline. Replacing \(V\) with \(Q\) in the baseline slot would collapse the advantage to zero — subtracting the signal from itself.
From Baseline to Bootstrapping
从基线到自举
If REINFORCE + \(V(s)\) baseline already gives per-step credit, what does actor-critic add? The difference is in how the advantage is computed. REINFORCE uses:
\[A^{\text{MC}}(s_t, a_t) = G_t - V_\phi(s_t) \qquad \text{(Monte Carlo return minus baseline)}\]Actor-critic replaces the MC return with a bootstrapped estimate:
\[A^{\text{TD}}(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \qquad \text{(one-step TD residual)}\]The MC version is unbiased — \(G_t\) is the true sampled return. But it is still high-variance: \(G_t\) sums rewards over many future steps, each subject to stochastic transitions. In a 20-step web task where the environment is non-stationary (pages load differently, A/B tests shift layouts), this sum carries enormous noise.
The TD version replaces all future randomness with the critic’s estimate \(V_\phi(s_{t+1})\). This is biased — if the critic is wrong, the advantage is wrong — but dramatically lower variance, because the estimate depends only on the immediate transition \((s_t, a_t, s_{t+1})\) rather than the entire future trajectory. In practice, a slightly biased but low-variance gradient is far more useful for learning than an unbiased but noisy one.
This is the real contribution of the critic in actor-critic: not just providing a baseline (REINFORCE can do that too), but enabling bootstrapping — replacing noisy future returns with learned predictions. The critic becomes load-bearing: its accuracy directly determines the quality of the actor’s gradient. This is why training the critic well matters so much.
A Worked Example in Policy Gradient: ArCHer
策略梯度的具体例子:ArCHer
Before going on, a clarification: “critic” and “value function” are not two different things. The critic is a value function — specifically, a learned approximation \(\hat{V}_\phi(s) \approx V^\pi(s)\), trained to predict expected returns. “Critic” emphasizes its role in the actor-critic architecture; “value function” emphasizes its mathematical definition. They refer to the same object. In the LM setting the critic is usually a copy of the base language model with a scalar value head replacing the token prediction head — see the LM-as-critic post for the engineering details.
ArCHer (Actor-Critic with Hierarchical Turn-Level Credit Assignment) was introduced by Zhou et al. (2024) to address a fundamental limitation of RLHF-style training for multi-step language model agents: trajectory-level rewards provide no signal about which step was responsible for success or failure.
Consider a web agent that completes a 10-step shopping task. Standard PPO assigns the terminal reward to the entire trajectory — all 10 steps receive the same advantage. The agent cannot learn that step 3 (clicking the right product) was the decisive action while step 7 (scrolling past relevant content) was a mistake. Both get reinforced equally.
The original ArCHer paper proposes a hierarchical actor-critic framework whose central idea is: the turn, not the token, is the right granularity for credit assignment in multi-step LM agents. Below we describe the formulation and losses precisely.
Formulation: Multi-Turn MDP
形式化:多回合 MDP
ArCHer models the interaction as a token-level MDP embedded within a turn-level structure. At each turn \(t\), the agent observes a state \(s_t\) (e.g., a web page) and generates a response \(a_t = (a_t^1, a_t^2, \ldots, a_t^L)\) as a sequence of \(L\) tokens. The environment then transitions to a new state \(s_{t+1}\). After \(T\) turns, a scalar reward \(R\) is received.
The key decomposition is between the high-level (turn-level) and low-level (token-level) policies. The high-level policy selects the overall intent at each turn, while the low-level policy autoregressively generates the tokens that implement it. In the simplified ArCHer PPO variant we implement, these are collapsed into a single autoregressive policy — but the turn-level value function is preserved.
The Critic: Turn-Level Value Function
Critic:回合级价值函数
The critic \(V_\phi(s_t)\) predicts the expected discounted return from turn \(t\) onward:
\[V_\phi(s_t) \approx \mathbb{E}\left[\sum_{k=t}^{T} \gamma^{k-t} r_k \mid s_t\right]\]where \(r_k = 0\) for intermediate turns and \(r_T = R\). This is a state value function — it conditions only on the observation \(s_t\), not on the action \(a_t\). In the LM implementation, the critic is a copy of the base model with the language modeling head replaced by a scalar projection. The value is read at a single token position per turn: the last token of the observation (prompt boundary), where causal masking ensures the hidden state encodes the full observation but none of the response.
The critic is trained by regression on Monte Carlo return targets:
\[G_t = \gamma^{T-1-t} \cdot R\]with a simple MSE loss:
\[\mathcal{L}_{\text{critic}}(\phi) = \frac{1}{2} \mathbb{E}_t\left[(V_\phi(s_t) - G_t)^2\right]\]Using MC returns rather than bootstrapped targets \(r_t + \gamma V_\phi(s_{t+1})\) is a deliberate choice: with sparse terminal rewards (\(r_t = 0\) for \(t < T\)), bootstrapped targets reduce to \(\gamma V_\phi(s_{t+1})\), making the critic chase its own predictions. MC returns ground the training in the actual outcome.
The Actor: Step-Level Advantages
Actor:步级优势函数
Given the critic, the step-level advantage is computed as a TD(0) residual:
\[A(t) = \begin{cases} V_\phi(s_{t+1}) - V_\phi(s_t) & t < T \\ R - V_\phi(s_T) & t = T \end{cases}\]This measures whether the agent’s action at turn \(t\) increased or decreased the expected return relative to the critic’s prediction. Positive means the action was better than expected; negative means worse.
The advantage is broadcast to all tokens within the turn: every token \(a_t^l\) in the response at turn \(t\) shares the same advantage \(A(t)\). This is the core ArCHer insight — within a single turn, the agent generates a coherent action (reasoning + command), and assigning different advantages to individual tokens within the same action is noisy and semantically meaningless.
The Actor Loss
Actor 损失函数
The actor is trained with the standard PPO clipped objective, using the step-level advantages:
\[\mathcal{L}_{\text{actor}}(\theta) = -\mathbb{E}_{t,l}\left[\min\left(\rho_t^l \, A(t),\; \text{clip}(\rho_t^l, 1-\varepsilon, 1+\varepsilon) \, A(t)\right)\right]\]where \(\rho_t^l = \frac{\pi_\theta(a_t^l \vert s_t, a_t^{<l})}{\pi_{\theta_{\text{old}}}(a_t^l \vert s_t, a_t^{<l})}\) is the per-token importance ratio. The clipping threshold \(\varepsilon\) (typically 0.2) prevents the policy from moving too far from the rollout policy in a single update.
Note that \(A(t)\) does not depend on \(l\) — it is the same for all tokens in the turn. The per-token ratios \(\rho_t^l\) provide token-level granularity in how much the policy changed, but the direction of the gradient (reinforce or suppress) is determined entirely at the turn level.
Training Order: Critic First
训练顺序:先训练 Critic
A subtle but important detail: ArCHer trains the critic before the actor, reversing the standard PPO order. The reason is that the actor’s gradient quality depends entirely on the advantage estimates, which depend on the critic. Training the actor with a stale critic wastes gradient steps on noisy signals. By training the critic first and then recomputing the advantages with the updated \(V_\phi\), the actor always sees the best available value estimates.
In the previous post, we discussed the challenges of using language models as critic functions — the need for action-sensitive representations and the instability of end-to-end TD training. ArCHer sidesteps several of these issues by using MC return targets (grounding the critic in actual outcomes) and by evaluating at a single position per step (the prompt boundary), where causal masking ensures the representation is clean.
What we implement in WebGym is a simplified variant — ArCHer PPO — that drops the hierarchical structure of the original paper and applies the turn-level critic idea directly to a flat PPO training loop for multi-step web agents. There is no high-level goal selector; just a single policy that acts at each turn, with a step-level critic providing per-turn credit assignment.
A Dialectical View: Critic-Free Methods Also Work
辩证视角:Critic-Free 方法同样可行
ArCHer makes the critic load-bearing. But empirically, an entire family of methods — GRPO and its descendants — has dropped the critic completely and still works very well, especially for math and code reasoning. If the critic is so important, how do these methods get away without one? This section takes a dialectical view: when does the critic actually pay for itself, and when does brute-force sampling do the job equally well?
GRPO: Sampling Replaces the Critic
GRPO:用采样替代 critic
GRPO (Group Relative Policy Optimization) takes a radically different approach to the variance problem: instead of learning a value function, estimate it statistically by sampling multiple completions from the same prompt.
The core observation is simple. A learned critic \(V_\phi(s)\) approximates \(\mathbb{E}_{\pi}[G \vert s]\) — the expected return from state \(s\). But there is another way to estimate an expectation: draw samples and take the empirical mean. Given a prompt \(q\), GRPO samples a group of \(G\) completions \(\{o_1, o_2, \ldots, o_G\}\) from the current policy \(\pi_\theta\), scores each with a reward function \(r(q, o_i)\), and uses the group statistics as a baseline:
\[\hat{A}_i = \frac{r(q, o_i) - \text{mean}(\{r(q, o_j)\}_{j=1}^G)}{\text{std}(\{r(q, o_j)\}_{j=1}^G)}\]This is essentially a Monte Carlo estimate of the advantage — normalized to zero mean and unit variance within each group. No learned parameters, no value head, no critic training loop. The “value function” is replaced by the sample mean of the group.
The GRPO loss applies the same PPO-style clipping, but at the task level (one advantage per completion) rather than per-token or per-step:
\[\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} \frac{1}{\vert o_i \vert}\sum_{l=1}^{\vert o_i \vert} \left[\min\!\left(\rho_i^l \, \hat{A}_i,\; \text{clip}(\rho_i^l, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_i\right) - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]\]where \(\rho_i^l = \frac{\pi_\theta(o_i^l \vert q, o_i^{<l})}{\pi_{\theta_{\text{old}}}(o_i^l \vert q, o_i^{<l})}\) is the per-token importance ratio, and \(\beta\) controls the KL penalty against a reference policy \(\pi_{\text{ref}}\) (typically the SFT model).
When Does Each Win?
什么时候用哪个?
The clean way to read GRPO is that the group mean is a (statistical) value function — just an empirical one rather than a parametric one. With \(G\) samples per prompt, \(\text{mean}(r(q, o_j))\) converges to \(\mathbb{E}_\pi[r \vert q]\) at rate \(O(G^{-1/2})\). The function approximator and its training loop are gone; sampling does the same job.
So the dialectic isn’t critic vs no-critic. It is: at what granularity does your problem need credit, and which estimator is cheaper at that granularity?
| Setting | Best fit | Reason |
|---|---|---|
| Single-turn task (math, code, single answer) | GRPO-family | One advantage per completion is enough. The group mean is unbiased, \(G \!\!= 8\!-\!64\) samples already gives a workable estimator, and you save a separately trained critic. |
| Short-horizon task with dense reward | Either, often GRPO | Per-step credit doesn’t add much when each step’s contribution is already legible from the reward signal. |
| Long-horizon multi-turn task with sparse terminal reward | Critic-based (ArCHer) | A 10-step trajectory has 1 reward bit but 10 decisions. Group mean cannot tell you which of the 10 steps caused the failure; a per-step \(V_\phi(s_t)\) can. |
| Off-policy / replay-heavy training | Critic-based | The critic generalizes across states it has digested; the group-mean estimator only knows about the prompts you currently sample. |
There is no clean winner. The pro-critic position says: a parametric \(V_\phi\) amortizes value estimation across states, generalizes, and gives per-step credit. The pro-critic-free position says: when the granularity you need is task-level, sampling estimates the same thing without the engineering cost (and without the well-known failure modes of LM critics — representation drift, TD instability, value-head collapse). The deeper analysis of the GRPO family — DAPO, GSPO, and friends — and where each variant wins lives in the GRPO-family post.
The takeaway for the rest of this post: whether you build the value function explicitly (ArCHer) or implicitly via sampling (GRPO), you are doing the same underlying thing — estimating an expected return from a state. That estimation is what we’ll examine more closely below.
Value Functions in Q-Learning: When V IS the Policy
价值函数在 Q-Learning 中:当 V 就是策略
Both REINFORCE and actor-critic are policy gradient methods: they explicitly parameterize and optimize a policy \(\pi_\theta\). The value function — whether \(V(s)\) as a baseline or \(V(s)\) for bootstrapping — is a helper that improves the policy’s gradient. The policy remains the primary object being learned.
Q-learning inverts this relationship entirely. There is no explicit policy. Instead, the value function is the primary object: learn \(Q^*(s, a)\) directly via the Bellman optimality equation
\[Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a')\right]\]and extract the policy implicitly as \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The value function does not serve the policy — the value function replaces it.
This creates a fundamental difference in what the value function needs to capture. In actor-critic, the critic learns \(V^\pi(s)\) — the expected return under the current policy. It only needs to evaluate states, not compare actions. In Q-learning, \(Q^*(s, a)\) must be action-sensitive: it must distinguish the expected return of different actions from the same state, because that distinction is the policy. This is a much harder representational requirement, especially in the LM setting where the action space (all possible token sequences) is combinatorially large. The Q-learning scalability post examines whether this requirement can be met at scale.
What Is the Value Function Really Doing? Value Judgment, Not Reasoning
价值函数到底在做什么?是价值判断,不是推理
Step back from the math. In all the cases above — \(V_\phi\) as baseline, \(V_\phi\) as bootstrap target, group mean as empirical \(V\), \(Q^*\) as the policy itself — the same operation is happening:
Given a state, return a scalar that summarizes how good it is.
That is the entire job. Not a derivation. Not a chain of inferences. A single verdict.
The architectural label for this is value judgment, in deliberate contrast with reasoning. Reasoning produces a sequence of intermediate steps where the steps are part of what makes the answer correct — a chain of derivations, each one inspectable. Judgment produces only the answer; whatever computation produced it has been integrated away into the parameters of \(V_\phi\), or into the empirical mean over \(G\) samples, and is no longer recoverable from the output.
This distinction is not philosophical decoration; it has direct architectural consequences for how we build critics.
An Inspiration from Jungian Psychology: Ni
一个来自荣格心理学的灵感:Ni
A useful structural label for this kind of function comes from Jungian analytical psychology, which (independently of any computational concern) catalogs eight cognitive functions and divides them along the same axis we just identified. On one side sit judging / reasoning functions — Te, Ti — whose outputs are explicit chains of derivations, designed to be inspected. On the other side sits one perceiving function with a particularly relevant signature: introverted intuition, abbreviated Ni.
Ni is the function that takes many disparate inputs, lets them settle, and outputs a single integrative impression — without surfacing the reasoning behind it. The classic description is “I just know where this is going” with no defendable derivation to back it up; the integration happened, but it happened in a way the system itself cannot replay step by step. (Background and the rest of the eight-function taxonomy live in the MBTI / Jungian functions post.)
Translated into the engineering vocabulary of this post, Ni is the cognitive shape of a value function. The critic ingests many trajectories during training, integrates them into its parameters, and at inference time emits a scalar value head per state — with no rationale attached. The justification (“which past trajectories the prediction depends on, which features mattered most”) is not produced and cannot be produced from the output alone. It lives in the weights.
| Cognitive function | Output shape | Reasoning visible? | Engineering counterpart |
|---|---|---|---|
Ti / Te (thinking) | a chain of derivations | yes | symbolic solver, CoT reasoner, hand-coded heuristic |
Ni (introverted intuition) | a single integrative scalar/gestalt | no | scalar value head, learned \(V_\phi(s)\) |
Two clarifications. First, this is not the claim that critics have personalities — it is the claim that the architectural shape of an integrative-judgment function and the architectural shape of a stepwise-reasoning function are different, and that any taxonomy that already separates them (Jung’s happens to be a clean one) gives us a useful vocabulary for the distinction. Second, we are taking only the inspiration: the structural label, plus its consequence for design. The rest of psychology stays in its own post.
An Intuitive Critic Should Not Reason
直觉型 critic 不应该出现推理
The practical claim that follows is the one this whole section was building toward:
An intuitive critic — a function whose job is integrative value judgment — should not exhibit reasoning in its output.
Concretely, \(V_\phi(s)\) should be a scalar emitted in a single forward pass at a designated readout position, with no token sequence in between. The “thinking” should happen implicitly in the forward pass, not explicitly as emitted tokens. Several recent design trends violate this and are, from this framing, architecturally mismatched to what a critic is for:
- Generative reward models that emit critique text before a score. The critique tokens are sitting in the architectural slot that integration should occupy in a Ni-shaped function. The score, having been forced to follow the text, becomes a post-hoc rationalization of the emitted critique rather than an integrative verdict. The model has been pushed toward
Tishape (defensible chain) when its job isNishape (integrative scalar). - Process reward models (PRMs) that score each step using their own chain of thought. The CoT makes the PRM look defensible step by step, but it pushes the function toward a
Ti-style audit and away from the integrative shape that the value head is set up to produce. - “Thinking critics” that consume reasoning tokens before outputting \(V\). The compute spent on those tokens is compute spent in the wrong cognitive position. If reasoning helps, that reasoning belongs to the actor, which then queries an integrative critic — not to the critic itself.
What is consistent with the framing, and worth keeping:
- Scalar value heads read off a single token position. ArCHer reads at the prompt boundary; standard reward models read at EOS. Pure judgment, no emitted reasoning. This is the architecturally honest implementation of \(V_\phi\).
- MC-return supervision rather than verbal-rationale supervision. A
Ni-shaped function trains on outcomes — did the trajectory succeed? — not on rationales. This is exactly what ArCHer’s MSE-against-\(G_t\) loss does, and what reward-from-preferences losses do. Training a critic on “explain why this state is good” data is training the wrong function. - Critic-first, then bootstrap. ArCHer’s order — develop the integrative function on real returns first, then trust its own predictions as bootstrap targets — is the engineering analogue of grounding intuition in real consequences before letting it predict on its own. A
Nifunction with no consequential history is hallucination; a critic with no return-supervised pretraining is a noisy randomly-initialized scalar regressor.
The clean division of labor: the actor is allowed — and required — to reason. Chain-of-thought is its native medium; it has to produce the action, and the chain that produced it is the action. The critic is not. It has to deliver an integrative value judgment, fast and silent, and let the actor handle whatever stepwise derivation the action needs. Mixing the two functions into one model that “thinks before scoring” is the architectural mistake this whole framing is here to flag.