Can Language Models Be Critic Functions?

A natural idea in the era of foundation models: take a pretrained language model, strip off the token-prediction head, attach a small MLP that outputs a scalar, and use it as a critic (value function or Q-function) for reinforcement learning. The LM has already learned rich representations of language, code, and visual scenes — surely these representations contain enough information to predict future returns?

This idea is simple and appealing. It is also, in its naive form, surprisingly ineffective. This post explores why, what it takes to make it work, and what the resulting design choices tell us about the gap between “understanding a scene” and “predicting the consequences of actions.”

The Architecture

The setup is straightforward. Given a pretrained language model (or vision-language model) with parameters \(\theta_{\text{LM}}\), the standard next-token prediction head maps the last hidden state to a distribution over the vocabulary:

\[p(x_{t+1} \vert x_{\leq t}) = \text{softmax}(W_{\text{head}} \cdot h_t + b)\]

where \(h_t \in \mathbb{R}^d\) is the hidden state at position \(t\) and \(W_{\text{head}} \in \mathbb{R}^{\vert\mathcal{V}\vert \times d}\) projects to vocabulary size. To build a critic, we replace this head with a small MLP that outputs a scalar:

\[Q_{\theta_{\text{MLP}}}(s, a) = \text{MLP}_{\theta_{\text{MLP}}}(h_t)\]

where \(h_t = f_{\theta_{\text{LM}}}(s, a)\) is the LM’s representation of the state-action pair. The MLP is typically 1–3 layers with a few hundred hidden units — trivially small compared to the LM backbone.

The question is: what should we do with \(\theta_{\text{LM}}\)? There are three options:

Freeze entirely. Train only \(\theta_{\text{MLP}}\). The LM is a fixed feature extractor.
Fine-tune end-to-end. Update both \(\theta_{\text{LM}}\) and \(\theta_{\text{MLP}}\) with the TD loss.
Fine-tune, then freeze. First adapt \(\theta_{\text{LM}}\) with an auxiliary objective, then freeze it and train \(\theta_{\text{MLP}}\) with TD learning.

Each choice has consequences that connect directly to the failure modes of Q-learning under function approximation.

Why Freezing Alone Fails

The most computationally attractive option — freeze the LM and train only the MLP — often produces a degenerate critic. The problem is that pretrained LM representations are not optimized for predicting action consequences.

Consider a VLM like LLaVa that processes a screenshot of a mobile device. The VLM can answer “What app is open?” or “What text is on the screen?” — it understands the content of the scene. But ask it “Will clicking at coordinates (0.3, 0.7) navigate to a new page?” and it fails. The internal representations encode visual semantics, not the causal structure of how actions transform states.

When we attach an MLP head and train it to predict Q-values on these frozen features, two failure modes emerge:

Failure mode 1: Action-blindness. The frozen features \(f_{\theta_{\text{LM}}}(s, a)\) may not meaningfully distinguish between different actions at the same state. If the LM was never trained to attend to action inputs, the hidden state \(h_t\) will be approximately the same regardless of the action \(a\). The MLP then learns a state-only value function \(V(s)\) instead of a state-action value function \(Q(s, a)\):

\[Q_{\theta_{\text{MLP}}}(s, a_1) \approx Q_{\theta_{\text{MLP}}}(s, a_2) \approx V(s) \quad \forall a_1, a_2\]

This is useless for policy extraction, which requires ranking actions. This is exactly what Digi-Q (Bai et al., 2025) observed when using off-the-shelf VLM representations: the Q-function collapsed to a V-function.

Failure mode 2: Insufficient coverage of task-relevant features. Even if the features are somewhat action-sensitive, they may not capture the specific aspects of the state that matter for value prediction. A language model trained on internet text knows about restaurant reviews and product descriptions, but its internal features may not encode “how many steps remain until task completion” or “whether the current page is a dead end.” These task-relevant features are critical for accurate value estimation but are absent from the pretraining distribution.

Empirically, Digi-Q found that using off-the-shelf VLM features (without any fine-tuning) achieved only 31.9% on the Web Shopping benchmark — barely better than the 25.0% behavior policy and substantially worse than the 58.0% achieved with representation fine-tuning.

Why End-to-End TD Learning Is Unstable

The opposite extreme — fine-tuning the entire LM backbone with the TD loss — addresses the representation problem but introduces severe optimization instability.

The TD loss for the Q-function is:

\[J_Q(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(Q_\theta(s, a) - r - \gamma \max_{a'} Q_{\bar{\theta}}(s', a'))^2\right]\]

where \(\bar{\theta}\) is a delayed copy of \(\theta\). When \(\theta\) includes billions of LM parameters, several problems compound:

Moving target amplification. The target \(r + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\) depends on the same network being trained. With a small MLP head, the target changes slowly because the feature space is fixed. With end-to-end training, a single gradient step can shift the entire representation space, causing the target values to change dramatically for all state-action pairs simultaneously. The delayed target network \(\bar{\theta}\) mitigates this but cannot eliminate it when the representation shifts are large.

Bootstrapping through shared representations. In the Q-learning scalability blog, we discussed how the composed operator \(\Pi\mathcal{T}\) can fail to contract when \(\lVert\Pi\rVert > 1/\gamma\). With end-to-end training, the projection \(\Pi\) (onto the function class representable by the current parameters) changes at every gradient step. This is worse than a fixed projection — it is a moving projection that can amplify errors in unpredictable ways.

Catastrophic forgetting of useful features. The LM backbone was pretrained on a vast corpus and contains general-purpose representations useful for language understanding. Aggressive TD updates can destroy these features, replacing them with representations that minimize the TD loss on the current batch but generalize poorly. This is especially problematic in offline RL, where the dataset is finite and the risk of overfitting is high.

Prior work has documented these instabilities extensively. Kumar et al. (2021) found that value-based RL with large networks exhibits pathological training dynamics. Chebotar et al. (2023) (Q-Transformer) had to employ conservative Q-learning regularization combined with n-step returns to stabilize training — a complex recipe that is difficult to tune.

The Middle Path: Fine-Tune Then Freeze

The approach that works best in practice is a two-phase strategy:

Phase 1: Representation fine-tuning. Adapt the LM backbone with an auxiliary objective that teaches it to encode action-relevant information. This does not use the TD loss — it uses a supervised or self-supervised objective that is stable and well-understood.

Phase 2: TD learning on frozen features. Freeze the fine-tuned LM and train only the MLP head with the TD loss. Because the feature space is now fixed, the projection \(\Pi\) is stable, and the standard analysis of projected Bellman operators applies.

The key design question is: what auxiliary objective should Phase 1 use?

Binary action-effect prediction

Digi-Q uses a binary classification objective: given a transition \((s_t, a_t, s_{t+1})\), predict whether the action caused a substantial change in the state. This is operationalized as:

\[y_t = \begin{cases} 1, & d(s_t, s_{t+1}) \geq \epsilon \\ 0, & \text{otherwise} \end{cases}\]

The LM is fine-tuned to output “yes” or “no” given the state-action pair as input. This teaches the representations to distinguish between actions that do something and actions that do nothing — a coarse but crucial signal for value prediction. If the LM cannot tell whether clicking a button will navigate to a new page, it certainly cannot predict the long-term return of that action.

Why not fine-tune on rewards directly?

A tempting alternative: fine-tune the LM to predict immediate rewards \(r(s, a)\) via supervised regression. This would seem to directly teach the features needed for value estimation.

The problem is that reward prediction is a much harder supervised learning task than action-effect prediction, especially when rewards are sparse (e.g., binary 0/1 only at episode end). Most transitions have reward 0, so reward prediction degenerates into predicting a constant. The action-effect signal, by contrast, is dense: many transitions involve visible state changes, providing a rich training signal for representation learning.

Other auxiliary objectives

More broadly, any objective that teaches the LM to model how actions transform states could work. Possibilities include:

Next-state prediction: predict features of \(s_{t+1}\) given \((s_t, a_t)\). This is essentially learning a forward dynamics model in representation space.
Inverse dynamics: predict \(a_t\) given \((s_t, s_{t+1})\). This forces the features to encode action-distinguishing information.
Contrastive objectives: pull together representations of \((s_t, a_t)\) and \(s_{t+1}\) while pushing apart negative pairs.

The common thread is that all of these are stable, supervised objectives that can be trained with standard techniques, unlike the TD loss which involves bootstrapping and moving targets.

What Makes a Good Critic Representation?

The preceding discussion suggests that a good critic representation must satisfy two properties:

1. Action sensitivity. The representation \(f(s, a)\) must vary meaningfully with the action \(a\). Formally, for states \(s\) where different actions lead to different outcomes, we need:

\[a_1 \neq a_2 \implies f(s, a_1) \neq f(s, a_2)\]

This is not guaranteed by language model pretraining, which optimizes for next-token prediction — a task where the “action” (next token) is the output, not part of the input representation.

2. Transition awareness. The representation must encode information about what happens next. A feature that captures “there is a search bar on the screen” is useful for describing the state but not for predicting whether typing a query will yield relevant results. The representation needs to capture the causal structure: “typing in this search bar will trigger a product search.”

Standard LM pretraining provides neither property reliably. The model learns to predict tokens, not to predict state transitions. The fine-tuning phase bridges this gap by injecting transition-awareness into the frozen features.

The Compute Trade-Off

This two-phase approach involves a clear compute trade-off:

	Parameters trained	Training stability	Representation quality
Freeze only	~0.1M (MLP)	Stable	Poor (action-blind)
End-to-end TD	~1B+ (full LM)	Unstable	Potentially good but fragile
Fine-tune + freeze	~1B (Phase 1) + ~0.1M (Phase 2)	Stable in both phases	Good

The fine-tune-then-freeze approach trains the same number of total parameters as end-to-end, but separates the stable (supervised) and unstable (TD) optimization phases. Phase 1 can use standard supervised learning recipes (AdamW, cosine schedule, etc.) without worrying about bootstrapping dynamics. Phase 2 can use standard TD learning tricks (target networks, replay buffers) on a small parameter space where they are well-understood.

In Digi-Q, Phase 2 trains only the MLP head — about 1% of the total parameters. This makes each TD update orders of magnitude cheaper than an end-to-end update, allowing more gradient steps per unit of compute. The paper reports that this efficiency gain is critical: Digi-Q achieves higher success rates than the end-to-end baseline DigiRL even with the same amount of data.

	训练参数量	训练稳定性	表征质量
仅冻结	~0.1M（MLP）	稳定	差（动作盲视）
端到端 TD	~1B+（完整 LM）	不稳定	可能好但脆弱
微调 + 冻结	~1B（阶段 1）+ ~0.1M（阶段 2）	两阶段均稳定	好

Open Questions

Several questions remain open:

How much fine-tuning is enough? The binary action-effect objective is coarse. A richer objective (e.g., predicting the full next-state embedding) might produce better features but requires more compute. The optimal trade-off between fine-tuning richness and compute cost is unclear.

Can we avoid fine-tuning entirely with better prompting? Chen et al. (2024) explored using handcrafted prompts to extract VLM representations more suitable for RL, without any fine-tuning. This is cheaper but less effective than fine-tuning. As LMs become more capable, the gap between “prompted features” and “fine-tuned features” may shrink.

Does this extend to text-only domains? Most evidence comes from visual domains (device control, robotics). In text-only RL tasks (dialogue, code generation), the LM’s representations may already be more action-sensitive, since actions are tokens. The fine-tuning phase might be less critical — or entirely unnecessary — in these settings.

Can the critic survive online updates? As discussed in the Q-learning scalability blog, extending offline critics to online self-improvement introduces distribution shift. The frozen features, trained on offline data, may not generalize to states reached by the improved policy. Whether periodic representation re-fine-tuning can address this — and at what cost — is an open engineering and research challenge.

Conclusion

Language models can serve as critic functions, but not by simply swapping the head. The pretrained representations lack action-sensitivity and transition-awareness — properties that next-token prediction never incentivized. Making the critic work requires a deliberate representation fine-tuning phase that injects these properties, followed by TD learning on the resulting frozen features.

This two-phase recipe is not elegant. It adds complexity, requires designing auxiliary objectives, and introduces hyperparameters (what distance threshold \(\epsilon\)? how many fine-tuning steps?). But it is currently the most reliable way to get the best of both worlds: the rich semantic representations of large language models, and the stable optimization dynamics needed for value-based RL.

The deeper lesson is that understanding a scene and predicting the value of acting in it are fundamentally different capabilities. Language models excel at the former. Bridging to the latter requires teaching them, explicitly, about the causal structure of actions — something that no amount of internet text will provide for free.

References

Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar, “Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents”, ICLR 2025
Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, Sergey Levine, “DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization”, 2021
Yevgen Chebotar et al., “Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions”, 2023
William Chen, Oier Mees, Aviral Kumar, Sergey Levine, “Vision-Language Models Provide Promptable Representations for Reinforcement Learning”, 2024

Can Language Models Be Critic Functions?

The Architecture

架构

Why Freezing Alone Fails

为什么单纯冻结会失败

Why End-to-End TD Learning Is Unstable

为什么端到端 TD 学习不稳定

The Middle Path: Fine-Tune Then Freeze

折中之道：先微调再冻结

Binary action-effect prediction

二元动作效果预测

Why not fine-tune on rewards directly?

为什么不直接用奖励微调？

Other auxiliary objectives

其他辅助目标

What Makes a Good Critic Representation?

什么构成好的 Critic 表征？

The Compute Trade-Off

计算开销的权衡

Open Questions

开放问题

Conclusion

结论

References

参考文献