Can Language Models Be Critic Functions?

A natural idea in the era of foundation models: take a pretrained language model, strip off the token-prediction head, attach a small MLP that outputs a scalar, and use it as a critic (value function or Q-function) for reinforcement learning. The LM has already learned rich representations of language, code, and visual scenes — surely these representations contain enough information to predict future returns?

This idea is simple and appealing. It is also, in its naive form, surprisingly ineffective. This post explores why, what it takes to make it work, and what the resulting design choices tell us about the gap between “understanding a scene” and “predicting the consequences of actions.”

在基础模型时代,一个自然的想法是:取一个预训练语言模型,去掉 token 预测头,接上一个输出标量的小型 MLP,将其用作强化学习的 critic(价值函数或 Q 函数)。语言模型已经学到了丰富的语言、代码和视觉场景表征——这些表征当然包含足够的信息来预测未来回报?

这个想法简洁而诱人,但在朴素的形式下却出人意料地无效。本文探讨其中的原因、使其奏效所需的条件,以及由此产生的设计选择如何揭示了”理解场景”与”预测行为后果”之间的鸿沟。

The Architecture

The setup is straightforward. Given a pretrained language model (or vision-language model) with parameters \(\theta_{\text{LM}}\), the standard next-token prediction head maps the last hidden state to a distribution over the vocabulary:

\[p(x_{t+1} \vert x_{\leq t}) = \text{softmax}(W_{\text{head}} \cdot h_t + b)\]

where \(h_t \in \mathbb{R}^d\) is the hidden state at position \(t\) and \(W_{\text{head}} \in \mathbb{R}^{\vert\mathcal{V}\vert \times d}\) projects to vocabulary size. To build a critic, we replace this head with a small MLP that outputs a scalar:

\[Q_{\theta_{\text{MLP}}}(s, a) = \text{MLP}_{\theta_{\text{MLP}}}(h_t)\]

where \(h_t = f_{\theta_{\text{LM}}}(s, a)\) is the LM’s representation of the state-action pair. The MLP is typically 1–3 layers with a few hundred hidden units — trivially small compared to the LM backbone.

The question is: what should we do with \(\theta_{\text{LM}}\)? There are three options:

  1. Freeze entirely. Train only \(\theta_{\text{MLP}}\). The LM is a fixed feature extractor.
  2. Fine-tune end-to-end. Update both \(\theta_{\text{LM}}\) and \(\theta_{\text{MLP}}\) with the TD loss.
  3. Fine-tune, then freeze. First adapt \(\theta_{\text{LM}}\) with an auxiliary objective, then freeze it and train \(\theta_{\text{MLP}}\) with TD learning.

Each choice has consequences that connect directly to the failure modes of Q-learning under function approximation.

设置很直接。给定一个参数为 \(\theta_{\text{LM}}\) 的预训练语言模型(或视觉语言模型),标准的下一 token 预测头将最后一层隐藏状态映射为词表上的分布:

\[p(x_{t+1} \vert x_{\leq t}) = \text{softmax}(W_{\text{head}} \cdot h_t + b)\]

其中 \(h_t \in \mathbb{R}^d\) 是位置 \(t\) 的隐藏状态,\(W_{\text{head}} \in \mathbb{R}^{\vert\mathcal{V}\vert \times d}\) 投影到词表大小。要构建 critic,我们将该头替换为一个输出标量的小型 MLP:

\[Q_{\theta_{\text{MLP}}}(s, a) = \text{MLP}_{\theta_{\text{MLP}}}(h_t)\]

其中 \(h_t = f_{\theta_{\text{LM}}}(s, a)\) 是语言模型对状态-动作对的表征。MLP 通常为 1–3 层、几百个隐藏单元——与 LM 主干相比微不足道。

核心问题是:我们应该如何处理 \(\theta_{\text{LM}}\)? 有三种选择:

  1. 完全冻结。 只训练 \(\theta_{\text{MLP}}\)。LM 作为固定的特征提取器。
  2. 端到端微调。 用 TD 损失同时更新 \(\theta_{\text{LM}}\) 和 \(\theta_{\text{MLP}}\)。
  3. 先微调,再冻结。 先用辅助目标适配 \(\theta_{\text{LM}}\),然后冻结它,再用 TD 学习训练 \(\theta_{\text{MLP}}\)。

每种选择都有其后果,直接关联到函数逼近下 Q-learning 的失败模式。

Why Freezing Alone Fails

The most computationally attractive option — freeze the LM and train only the MLP — often produces a degenerate critic. The problem is that pretrained LM representations are not optimized for predicting action consequences.

Consider a VLM like LLaVa that processes a screenshot of a mobile device. The VLM can answer “What app is open?” or “What text is on the screen?” — it understands the content of the scene. But ask it “Will clicking at coordinates (0.3, 0.7) navigate to a new page?” and it fails. The internal representations encode visual semantics, not the causal structure of how actions transform states.

When we attach an MLP head and train it to predict Q-values on these frozen features, two failure modes emerge:

Failure mode 1: Action-blindness. The frozen features \(f_{\theta_{\text{LM}}}(s, a)\) may not meaningfully distinguish between different actions at the same state. If the LM was never trained to attend to action inputs, the hidden state \(h_t\) will be approximately the same regardless of the action \(a\). The MLP then learns a state-only value function \(V(s)\) instead of a state-action value function \(Q(s, a)\):

\[Q_{\theta_{\text{MLP}}}(s, a_1) \approx Q_{\theta_{\text{MLP}}}(s, a_2) \approx V(s) \quad \forall a_1, a_2\]

This is useless for policy extraction, which requires ranking actions. This is exactly what Digi-Q (Bai et al., 2025) observed when using off-the-shelf VLM representations: the Q-function collapsed to a V-function.

Failure mode 2: Insufficient coverage of task-relevant features. Even if the features are somewhat action-sensitive, they may not capture the specific aspects of the state that matter for value prediction. A language model trained on internet text knows about restaurant reviews and product descriptions, but its internal features may not encode “how many steps remain until task completion” or “whether the current page is a dead end.” These task-relevant features are critical for accurate value estimation but are absent from the pretraining distribution.

Empirically, Digi-Q found that using off-the-shelf VLM features (without any fine-tuning) achieved only 31.9% on the Web Shopping benchmark — barely better than the 25.0% behavior policy and substantially worse than the 58.0% achieved with representation fine-tuning.

计算上最具吸引力的选项——冻结 LM 只训练 MLP——往往会产生退化的 critic。问题在于预训练的 LM 表征并没有为预测动作后果而优化

以 LLaVa 这样的 VLM 处理手机截屏为例。VLM 可以回答”打开了什么应用?”或”屏幕上有什么文字?”——它理解场景的内容。但如果问它”点击坐标 (0.3, 0.7) 会跳转到新页面吗?”它就无能为力了。内部表征编码的是视觉语义,而非动作如何改变状态的因果结构。

当我们在这些冻结的特征上接一个 MLP 头来预测 Q 值时,会出现两种失败模式:

失败模式 1:动作盲视。 冻结的特征 \(f_{\theta_{\text{LM}}}(s, a)\) 可能无法有效区分同一状态下的不同动作。如果 LM 从未被训练去关注动作输入,隐藏状态 \(h_t\) 将不论动作 \(a\) 如何都近似相同。MLP 于是学到一个仅依赖状态的价值函数 \(V(s)\),而非状态-动作价值函数 \(Q(s, a)\):

\[Q_{\theta_{\text{MLP}}}(s, a_1) \approx Q_{\theta_{\text{MLP}}}(s, a_2) \approx V(s) \quad \forall a_1, a_2\]

这对策略提取毫无用处,因为策略提取需要对动作进行排序。这正是 Digi-Q(Bai et al., 2025)在使用现成 VLM 表征时观察到的现象:Q 函数退化为 V 函数。

失败模式 2:任务相关特征覆盖不足。 即使特征对动作有一定敏感度,它们也可能无法捕捉对价值预测至关重要的状态方面。在互联网文本上训练的语言模型了解餐厅评论和商品描述,但其内部特征可能不编码”距任务完成还剩多少步”或”当前页面是否为死胡同”。这些任务相关的特征对准确的价值估计至关重要,但在预训练分布中是缺失的。

实验上,Digi-Q 发现使用现成的 VLM 特征(未经任何微调)在 Web Shopping 基准上仅达到 31.9%——仅略好于 25.0% 的行为策略,远逊于经过表征微调后达到的 58.0%。

Why End-to-End TD Learning Is Unstable

The opposite extreme — fine-tuning the entire LM backbone with the TD loss — addresses the representation problem but introduces severe optimization instability.

The TD loss for the Q-function is:

\[J_Q(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(Q_\theta(s, a) - r - \gamma \max_{a'} Q_{\bar{\theta}}(s', a'))^2\right]\]

where \(\bar{\theta}\) is a delayed copy of \(\theta\). When \(\theta\) includes billions of LM parameters, several problems compound:

Moving target amplification. The target \(r + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\) depends on the same network being trained. With a small MLP head, the target changes slowly because the feature space is fixed. With end-to-end training, a single gradient step can shift the entire representation space, causing the target values to change dramatically for all state-action pairs simultaneously. The delayed target network \(\bar{\theta}\) mitigates this but cannot eliminate it when the representation shifts are large.

Bootstrapping through shared representations. In the Q-learning scalability blog, we discussed how the composed operator \(\Pi\mathcal{T}\) can fail to contract when \(\lVert\Pi\rVert > 1/\gamma\). With end-to-end training, the projection \(\Pi\) (onto the function class representable by the current parameters) changes at every gradient step. This is worse than a fixed projection — it is a moving projection that can amplify errors in unpredictable ways.

Catastrophic forgetting of useful features. The LM backbone was pretrained on a vast corpus and contains general-purpose representations useful for language understanding. Aggressive TD updates can destroy these features, replacing them with representations that minimize the TD loss on the current batch but generalize poorly. This is especially problematic in offline RL, where the dataset is finite and the risk of overfitting is high.

Prior work has documented these instabilities extensively. Kumar et al. (2021) found that value-based RL with large networks exhibits pathological training dynamics. Chebotar et al. (2023) (Q-Transformer) had to employ conservative Q-learning regularization combined with n-step returns to stabilize training — a complex recipe that is difficult to tune.

另一个极端——用 TD 损失微调整个 LM 主干——解决了表征问题,但引入了严重的优化不稳定性。

Q 函数的 TD 损失为:

\[J_Q(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(Q_\theta(s, a) - r - \gamma \max_{a'} Q_{\bar{\theta}}(s', a'))^2\right]\]

其中 \(\bar{\theta}\) 是 \(\theta\) 的延迟副本。当 \(\theta\) 包含数十亿 LM 参数时,多个问题叠加:

移动目标放大效应。 目标值 \(r + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\) 依赖于正在训练的同一网络。使用小型 MLP 头时,由于特征空间固定,目标变化缓慢。但在端到端训练中,单步梯度更新就可能移动整个表征空间,导致所有状态-动作对的目标值同时剧烈变化。延迟目标网络 \(\bar{\theta}\) 可以缓解但无法消除表征大幅移动时的问题。

通过共享表征的自举。Q-learning 可扩展性博客 中,我们讨论了当 \(\lVert\Pi\rVert > 1/\gamma\) 时,组合算子 \(\Pi\mathcal{T}\) 可能不收缩。在端到端训练中,投影 \(\Pi\)(到当前参数可表示的函数类上)在每一步梯度更新时都在变化。这比固定投影更糟——它是一个移动的投影,可能以不可预测的方式放大误差。

有用特征的灾难性遗忘。 LM 主干在大规模语料上预训练,包含有用的通用语言理解表征。激进的 TD 更新可能破坏这些特征,代之以在当前批次上最小化 TD 损失但泛化能力差的表征。这在离线 RL 中尤为严重,因为数据集有限,过拟合风险很高。

已有大量工作记录了这些不稳定性。Kumar et al. (2021) 发现基于价值的 RL 在大型网络上表现出病态的训练动态。Chebotar et al. (2023)(Q-Transformer)不得不结合保守 Q-learning 正则化和 n 步回报来稳定训练——一套复杂且难以调参的方案。

The Middle Path: Fine-Tune Then Freeze

The approach that works best in practice is a two-phase strategy:

Phase 1: Representation fine-tuning. Adapt the LM backbone with an auxiliary objective that teaches it to encode action-relevant information. This does not use the TD loss — it uses a supervised or self-supervised objective that is stable and well-understood.

Phase 2: TD learning on frozen features. Freeze the fine-tuned LM and train only the MLP head with the TD loss. Because the feature space is now fixed, the projection \(\Pi\) is stable, and the standard analysis of projected Bellman operators applies.

The key design question is: what auxiliary objective should Phase 1 use?

实践中效果最好的方法是两阶段策略:

阶段 1:表征微调。 用辅助目标适配 LM 主干,教它编码与动作相关的信息。这使用 TD 损失——而是使用稳定且成熟的监督或自监督目标。

阶段 2:在冻结特征上进行 TD 学习。 冻结微调后的 LM,仅用 TD 损失训练 MLP 头。因为特征空间现在是固定的,投影 \(\Pi\) 是稳定的,投影 Bellman 算子的标准分析适用。

关键的设计问题是:阶段 1 应该使用什么辅助目标?

Binary action-effect prediction

Digi-Q uses a binary classification objective: given a transition \((s_t, a_t, s_{t+1})\), predict whether the action caused a substantial change in the state. This is operationalized as:

\[y_t = \begin{cases} 1, & d(s_t, s_{t+1}) \geq \epsilon \\ 0, & \text{otherwise} \end{cases}\]

The LM is fine-tuned to output “yes” or “no” given the state-action pair as input. This teaches the representations to distinguish between actions that do something and actions that do nothing — a coarse but crucial signal for value prediction. If the LM cannot tell whether clicking a button will navigate to a new page, it certainly cannot predict the long-term return of that action.

Digi-Q 使用二分类目标:给定一个转移 \((s_t, a_t, s_{t+1})\),预测该动作是否引起了状态的显著变化。具体定义为:

\[y_t = \begin{cases} 1, & d(s_t, s_{t+1}) \geq \epsilon \\ 0, & \text{otherwise} \end{cases}\]

LM 被微调为在给定状态-动作对作为输入时输出”是”或”否”。这教会表征区分有效果的动作和无效果的动作——一个粗糙但关键的价值预测信号。如果 LM 无法判断点击一个按钮是否会跳转到新页面,它当然也无法预测该动作的长期回报。

Why not fine-tune on rewards directly?

A tempting alternative: fine-tune the LM to predict immediate rewards \(r(s, a)\) via supervised regression. This would seem to directly teach the features needed for value estimation.

The problem is that reward prediction is a much harder supervised learning task than action-effect prediction, especially when rewards are sparse (e.g., binary 0/1 only at episode end). Most transitions have reward 0, so reward prediction degenerates into predicting a constant. The action-effect signal, by contrast, is dense: many transitions involve visible state changes, providing a rich training signal for representation learning.

一个诱人的替代方案:通过监督回归微调 LM 来预测即时奖励 \(r(s, a)\)。这似乎能直接教会价值估计所需的特征。

问题在于奖励预测是比动作效果预测困难得多的监督学习任务,尤其当奖励稀疏时(例如仅在 episode 结束时给出 0/1 二值奖励)。大多数转移的奖励为 0,因此奖励预测退化为预测常数。相比之下,动作效果信号是密集的:许多转移涉及可见的状态变化,为表征学习提供了丰富的训练信号。

Other auxiliary objectives

More broadly, any objective that teaches the LM to model how actions transform states could work. Possibilities include:

  • Next-state prediction: predict features of \(s_{t+1}\) given \((s_t, a_t)\). This is essentially learning a forward dynamics model in representation space.
  • Inverse dynamics: predict \(a_t\) given \((s_t, s_{t+1})\). This forces the features to encode action-distinguishing information.
  • Contrastive objectives: pull together representations of \((s_t, a_t)\) and \(s_{t+1}\) while pushing apart negative pairs.

The common thread is that all of these are stable, supervised objectives that can be trained with standard techniques, unlike the TD loss which involves bootstrapping and moving targets.

更广泛地说,任何教会 LM 建模动作如何改变状态的目标都可能有效。可能的选择包括:

  • 下一状态预测:给定 \((s_t, a_t)\) 预测 \(s_{t+1}\) 的特征。这本质上是在表征空间中学习前向动力学模型。
  • 逆动力学:给定 \((s_t, s_{t+1})\) 预测 \(a_t\)。这迫使特征编码区分动作的信息。
  • 对比学习目标:拉近 \((s_t, a_t)\) 和 \(s_{t+1}\) 的表征,同时推远负样本对。

共同的主线是,这些都是稳定的监督目标,可以用标准技术训练,不像 TD 损失那样涉及自举和移动目标。

What Makes a Good Critic Representation?

The preceding discussion suggests that a good critic representation must satisfy two properties:

1. Action sensitivity. The representation \(f(s, a)\) must vary meaningfully with the action \(a\). Formally, for states \(s\) where different actions lead to different outcomes, we need:

\[a_1 \neq a_2 \implies f(s, a_1) \neq f(s, a_2)\]

This is not guaranteed by language model pretraining, which optimizes for next-token prediction — a task where the “action” (next token) is the output, not part of the input representation.

2. Transition awareness. The representation must encode information about what happens next. A feature that captures “there is a search bar on the screen” is useful for describing the state but not for predicting whether typing a query will yield relevant results. The representation needs to capture the causal structure: “typing in this search bar will trigger a product search.”

Standard LM pretraining provides neither property reliably. The model learns to predict tokens, not to predict state transitions. The fine-tuning phase bridges this gap by injecting transition-awareness into the frozen features.

前述讨论表明,一个好的 critic 表征必须满足两个性质:

1. 动作敏感性。 表征 \(f(s, a)\) 必须随动作 \(a\) 产生有意义的变化。形式上,对于不同动作导致不同结果的状态 \(s\),我们需要:

\[a_1 \neq a_2 \implies f(s, a_1) \neq f(s, a_2)\]

语言模型预训练并不保证这一点,因为它优化的是下一 token 预测——在这个任务中,”动作”(下一 token)是输出,而非输入表征的一部分。

2. 转移感知。 表征必须编码关于接下来会发生什么的信息。一个捕捉到”屏幕上有搜索栏”的特征对于描述状态是有用的,但无法预测输入查询是否会得到相关结果。表征需要捕捉因果结构:”在这个搜索栏中输入会触发商品搜索。”

标准 LM 预训练无法可靠地提供这两个性质。模型学会了预测 token,而非预测状态转移。微调阶段通过将转移感知注入冻结的特征来弥合这一鸿沟。

The Compute Trade-Off

This two-phase approach involves a clear compute trade-off:

  Parameters trained Training stability Representation quality
Freeze only ~0.1M (MLP) Stable Poor (action-blind)
End-to-end TD ~1B+ (full LM) Unstable Potentially good but fragile
Fine-tune + freeze ~1B (Phase 1) + ~0.1M (Phase 2) Stable in both phases Good

The fine-tune-then-freeze approach trains the same number of total parameters as end-to-end, but separates the stable (supervised) and unstable (TD) optimization phases. Phase 1 can use standard supervised learning recipes (AdamW, cosine schedule, etc.) without worrying about bootstrapping dynamics. Phase 2 can use standard TD learning tricks (target networks, replay buffers) on a small parameter space where they are well-understood.

In Digi-Q, Phase 2 trains only the MLP head — about 1% of the total parameters. This makes each TD update orders of magnitude cheaper than an end-to-end update, allowing more gradient steps per unit of compute. The paper reports that this efficiency gain is critical: Digi-Q achieves higher success rates than the end-to-end baseline DigiRL even with the same amount of data.

这种两阶段方法涉及明确的计算权衡:

  训练参数量 训练稳定性 表征质量
仅冻结 ~0.1M(MLP) 稳定 差(动作盲视)
端到端 TD ~1B+(完整 LM) 不稳定 可能好但脆弱
微调 + 冻结 ~1B(阶段 1)+ ~0.1M(阶段 2) 两阶段均稳定

先微调再冻结的方法训练的总参数量与端到端相同,但将稳定的(监督)和不稳定的(TD)优化阶段分离。阶段 1 可以使用标准的监督学习方案(AdamW、余弦调度等),无需担心自举动态。阶段 2 可以在小参数空间上使用标准的 TD 学习技巧(目标网络、经验回放缓冲区),这些技巧在该空间内已被充分理解。

在 Digi-Q 中,阶段 2 只训练 MLP 头——约占总参数的 1%。这使得每次 TD 更新比端到端更新便宜几个数量级,允许在相同计算量下执行更多梯度步。论文指出这种效率增益至关重要:Digi-Q 即使使用相同的数据量,也比端到端基线 DigiRL 达到更高的成功率。

Open Questions

Several questions remain open:

How much fine-tuning is enough? The binary action-effect objective is coarse. A richer objective (e.g., predicting the full next-state embedding) might produce better features but requires more compute. The optimal trade-off between fine-tuning richness and compute cost is unclear.

Can we avoid fine-tuning entirely with better prompting? Chen et al. (2024) explored using handcrafted prompts to extract VLM representations more suitable for RL, without any fine-tuning. This is cheaper but less effective than fine-tuning. As LMs become more capable, the gap between “prompted features” and “fine-tuned features” may shrink.

Does this extend to text-only domains? Most evidence comes from visual domains (device control, robotics). In text-only RL tasks (dialogue, code generation), the LM’s representations may already be more action-sensitive, since actions are tokens. The fine-tuning phase might be less critical — or entirely unnecessary — in these settings.

Can the critic survive online updates? As discussed in the Q-learning scalability blog, extending offline critics to online self-improvement introduces distribution shift. The frozen features, trained on offline data, may not generalize to states reached by the improved policy. Whether periodic representation re-fine-tuning can address this — and at what cost — is an open engineering and research challenge.

几个问题仍然开放:

需要多少微调才够? 二元动作效果目标比较粗糙。更丰富的目标(例如预测完整的下一状态嵌入)可能产生更好的特征,但需要更多计算。微调丰富度与计算成本之间的最优权衡尚不明确。

能否通过更好的提示完全避免微调? Chen et al. (2024) 探索了使用手工设计的提示来提取更适合 RL 的 VLM 表征,无需任何微调。这更便宜但不如微调有效。随着 LM 能力的提升,”提示特征”和”微调特征”之间的差距可能缩小。

这能推广到纯文本领域吗? 大部分证据来自视觉领域(设备控制、机器人)。在纯文本 RL 任务(对话、代码生成)中,LM 的表征可能已经对动作更敏感,因为动作就是 token。在这些场景下,微调阶段可能不那么关键——甚至完全不必要。

Critic 能否在在线更新中存活?Q-learning 可扩展性博客 所述,将离线 critic 扩展到在线自我改进会引入分布偏移。在离线数据上训练的冻结特征可能无法泛化到改进后策略到达的状态。周期性的表征重新微调能否解决这个问题——以及代价多大——是一个开放的工程和研究挑战。

Conclusion

Language models can serve as critic functions, but not by simply swapping the head. The pretrained representations lack action-sensitivity and transition-awareness — properties that next-token prediction never incentivized. Making the critic work requires a deliberate representation fine-tuning phase that injects these properties, followed by TD learning on the resulting frozen features.

This two-phase recipe is not elegant. It adds complexity, requires designing auxiliary objectives, and introduces hyperparameters (what distance threshold \(\epsilon\)? how many fine-tuning steps?). But it is currently the most reliable way to get the best of both worlds: the rich semantic representations of large language models, and the stable optimization dynamics needed for value-based RL.

The deeper lesson is that understanding a scene and predicting the value of acting in it are fundamentally different capabilities. Language models excel at the former. Bridging to the latter requires teaching them, explicitly, about the causal structure of actions — something that no amount of internet text will provide for free.

语言模型可以作为 critic 函数,但不是简单地换一个头就行。预训练的表征缺乏动作敏感性和转移感知——这些是下一 token 预测从未激励过的性质。要让 critic 有效工作,需要一个刻意的表征微调阶段来注入这些性质,然后在得到的冻结特征上进行 TD 学习。

这种两阶段方案并不优雅。它增加了复杂性,需要设计辅助目标,并引入超参数(距离阈值 \(\epsilon\) 取多少?微调多少步?)。但它目前是兼得两方面优势的最可靠方式:大型语言模型丰富的语义表征,以及基于价值的 RL 所需的稳定优化动态。

更深层的启示是,理解一个场景和预测在其中行动的价值是根本不同的能力。语言模型擅长前者。要桥接到后者,需要明确地教它们动作的因果结构——这不是任何数量的互联网文本能免费提供的。

References