Can Language Models Be Critic Functions?

A natural idea in the era of foundation models: take a pretrained language model, strip off the token-prediction head, attach a small MLP that outputs a scalar, and use it as a critic (value function or Q-function) for reinforcement learning. The LM has already learned rich representations of language, code, and visual scenes — surely these representations contain enough information to predict future returns?

This idea is simple and appealing. It is also, in its naive form, surprisingly ineffective. This post explores why, what it takes to make it work, and what the resulting design choices tell us about the gap between “understanding a scene” and “predicting the consequences of actions.”

The Architecture

The setup is straightforward. Given a pretrained language model (or vision-language model) with parameters \(\theta_{\text{LM}}\), the standard next-token prediction head maps the last hidden state to a distribution over the vocabulary:

\[p(x_{t+1} \vert x_{\leq t}) = \text{softmax}(W_{\text{head}} \cdot h_t + b)\]

where \(h_t \in \mathbb{R}^d\) is the hidden state at position \(t\) and \(W_{\text{head}} \in \mathbb{R}^{\vert\mathcal{V}\vert \times d}\) projects to vocabulary size. To build a critic, we replace this head with a small MLP that outputs a scalar:

\[Q_{\theta_{\text{MLP}}}(s, a) = \text{MLP}_{\theta_{\text{MLP}}}(h_t)\]

where \(h_t = f_{\theta_{\text{LM}}}(s, a)\) is the LM’s representation of the state-action pair. The MLP is typically 1–3 layers with a few hundred hidden units — trivially small compared to the LM backbone.

The question is: what should we do with \(\theta_{\text{LM}}\)? There are three options:

  1. Freeze entirely. Train only \(\theta_{\text{MLP}}\). The LM is a fixed feature extractor.
  2. Fine-tune end-to-end. Update both \(\theta_{\text{LM}}\) and \(\theta_{\text{MLP}}\) with the TD loss.
  3. Fine-tune, then freeze. First adapt \(\theta_{\text{LM}}\) with an auxiliary objective, then freeze it and train \(\theta_{\text{MLP}}\) with TD learning.

Each choice has consequences that connect directly to the failure modes of Q-learning under function approximation.

Why Freezing Alone Fails

The most computationally attractive option — freeze the LM and train only the MLP — often produces a degenerate critic. The problem is that pretrained LM representations are not optimized for predicting action consequences.

Consider a VLM like LLaVa that processes a screenshot of a mobile device. The VLM can answer “What app is open?” or “What text is on the screen?” — it understands the content of the scene. But ask it “Will clicking at coordinates (0.3, 0.7) navigate to a new page?” and it fails. The internal representations encode visual semantics, not the causal structure of how actions transform states.

When we attach an MLP head and train it to predict Q-values on these frozen features, two failure modes emerge:

Failure mode 1: Action-blindness. The frozen features \(f_{\theta_{\text{LM}}}(s, a)\) may not meaningfully distinguish between different actions at the same state. If the LM was never trained to attend to action inputs, the hidden state \(h_t\) will be approximately the same regardless of the action \(a\). The MLP then learns a state-only value function \(V(s)\) instead of a state-action value function \(Q(s, a)\):

\[Q_{\theta_{\text{MLP}}}(s, a_1) \approx Q_{\theta_{\text{MLP}}}(s, a_2) \approx V(s) \quad \forall a_1, a_2\]

This is useless for policy extraction, which requires ranking actions. This is exactly what Digi-Q (Bai et al., 2025) observed when using off-the-shelf VLM representations: the Q-function collapsed to a V-function.

Failure mode 2: Insufficient coverage of task-relevant features. Even if the features are somewhat action-sensitive, they may not capture the specific aspects of the state that matter for value prediction. A language model trained on internet text knows about restaurant reviews and product descriptions, but its internal features may not encode “how many steps remain until task completion” or “whether the current page is a dead end.” These task-relevant features are critical for accurate value estimation but are absent from the pretraining distribution.

Empirically, Digi-Q found that using off-the-shelf VLM features (without any fine-tuning) achieved only 31.9% on the Web Shopping benchmark — barely better than the 25.0% behavior policy and substantially worse than the 58.0% achieved with representation fine-tuning.

Why End-to-End TD Learning Is Unstable

The opposite extreme — fine-tuning the entire LM backbone with the TD loss — addresses the representation problem but introduces severe optimization instability.

The TD loss for the Q-function is:

\[J_Q(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(Q_\theta(s, a) - r - \gamma \max_{a'} Q_{\bar{\theta}}(s', a'))^2\right]\]

where \(\bar{\theta}\) is a delayed copy of \(\theta\). When \(\theta\) includes billions of LM parameters, several problems compound:

Moving target amplification. The target \(r + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\) depends on the same network being trained. With a small MLP head, the target changes slowly because the feature space is fixed. With end-to-end training, a single gradient step can shift the entire representation space, causing the target values to change dramatically for all state-action pairs simultaneously. The delayed target network \(\bar{\theta}\) mitigates this but cannot eliminate it when the representation shifts are large.

Bootstrapping through shared representations. In the Q-learning scalability blog, we discussed how the composed operator \(\Pi\mathcal{T}\) can fail to contract when \(\lVert\Pi\rVert > 1/\gamma\). With end-to-end training, the projection \(\Pi\) (onto the function class representable by the current parameters) changes at every gradient step. This is worse than a fixed projection — it is a moving projection that can amplify errors in unpredictable ways.

Catastrophic forgetting of useful features. The LM backbone was pretrained on a vast corpus and contains general-purpose representations useful for language understanding. Aggressive TD updates can destroy these features, replacing them with representations that minimize the TD loss on the current batch but generalize poorly. This is especially problematic in offline RL, where the dataset is finite and the risk of overfitting is high.

Prior work has documented these instabilities extensively. Kumar et al. (2021) found that value-based RL with large networks exhibits pathological training dynamics. Chebotar et al. (2023) (Q-Transformer) had to employ conservative Q-learning regularization combined with n-step returns to stabilize training — a complex recipe that is difficult to tune.

The Middle Path: Fine-Tune Then Freeze

The approach that works best in practice is a two-phase strategy:

Phase 1: Representation fine-tuning. Adapt the LM backbone with an auxiliary objective that teaches it to encode action-relevant information. This does not use the TD loss — it uses a supervised or self-supervised objective that is stable and well-understood.

Phase 2: TD learning on frozen features. Freeze the fine-tuned LM and train only the MLP head with the TD loss. Because the feature space is now fixed, the projection \(\Pi\) is stable, and the standard analysis of projected Bellman operators applies.

The key design question is: what auxiliary objective should Phase 1 use?

Binary action-effect prediction

Digi-Q uses a binary classification objective: given a transition \((s_t, a_t, s_{t+1})\), predict whether the action caused a substantial change in the state. This is operationalized as:

\[y_t = \begin{cases} 1, & d(s_t, s_{t+1}) \geq \epsilon \\ 0, & \text{otherwise} \end{cases}\]

The LM is fine-tuned to output “yes” or “no” given the state-action pair as input. This teaches the representations to distinguish between actions that do something and actions that do nothing — a coarse but crucial signal for value prediction. If the LM cannot tell whether clicking a button will navigate to a new page, it certainly cannot predict the long-term return of that action.

Why not fine-tune on rewards directly?

A tempting alternative: fine-tune the LM to predict immediate rewards \(r(s, a)\) via supervised regression. This would seem to directly teach the features needed for value estimation.

The problem is that reward prediction is a much harder supervised learning task than action-effect prediction, especially when rewards are sparse (e.g., binary 0/1 only at episode end). Most transitions have reward 0, so reward prediction degenerates into predicting a constant. The action-effect signal, by contrast, is dense: many transitions involve visible state changes, providing a rich training signal for representation learning.

Other auxiliary objectives

More broadly, any objective that teaches the LM to model how actions transform states could work. Possibilities include:

  • Next-state prediction: predict features of \(s_{t+1}\) given \((s_t, a_t)\). This is essentially learning a forward dynamics model in representation space.
  • Inverse dynamics: predict \(a_t\) given \((s_t, s_{t+1})\). This forces the features to encode action-distinguishing information.
  • Contrastive objectives: pull together representations of \((s_t, a_t)\) and \(s_{t+1}\) while pushing apart negative pairs.

The common thread is that all of these are stable, supervised objectives that can be trained with standard techniques, unlike the TD loss which involves bootstrapping and moving targets.

What Makes a Good Critic Representation?

The preceding discussion suggests that a good critic representation must satisfy two properties:

1. Action sensitivity. The representation \(f(s, a)\) must vary meaningfully with the action \(a\). Formally, for states \(s\) where different actions lead to different outcomes, we need:

\[a_1 \neq a_2 \implies f(s, a_1) \neq f(s, a_2)\]

This is not guaranteed by language model pretraining, which optimizes for next-token prediction — a task where the “action” (next token) is the output, not part of the input representation.

2. Transition awareness. The representation must encode information about what happens next. A feature that captures “there is a search bar on the screen” is useful for describing the state but not for predicting whether typing a query will yield relevant results. The representation needs to capture the causal structure: “typing in this search bar will trigger a product search.”

Standard LM pretraining provides neither property reliably. The model learns to predict tokens, not to predict state transitions. The fine-tuning phase bridges this gap by injecting transition-awareness into the frozen features.

The Compute Trade-Off

This two-phase approach involves a clear compute trade-off:

  Parameters trained Training stability Representation quality
Freeze only ~0.1M (MLP) Stable Poor (action-blind)
End-to-end TD ~1B+ (full LM) Unstable Potentially good but fragile
Fine-tune + freeze ~1B (Phase 1) + ~0.1M (Phase 2) Stable in both phases Good

The fine-tune-then-freeze approach trains the same number of total parameters as end-to-end, but separates the stable (supervised) and unstable (TD) optimization phases. Phase 1 can use standard supervised learning recipes (AdamW, cosine schedule, etc.) without worrying about bootstrapping dynamics. Phase 2 can use standard TD learning tricks (target networks, replay buffers) on a small parameter space where they are well-understood.

In Digi-Q, Phase 2 trains only the MLP head — about 1% of the total parameters. This makes each TD update orders of magnitude cheaper than an end-to-end update, allowing more gradient steps per unit of compute. The paper reports that this efficiency gain is critical: Digi-Q achieves higher success rates than the end-to-end baseline DigiRL even with the same amount of data.

Open Questions

Several questions remain open:

How much fine-tuning is enough? The binary action-effect objective is coarse. A richer objective (e.g., predicting the full next-state embedding) might produce better features but requires more compute. The optimal trade-off between fine-tuning richness and compute cost is unclear.

Can we avoid fine-tuning entirely with better prompting? Chen et al. (2024) explored using handcrafted prompts to extract VLM representations more suitable for RL, without any fine-tuning. This is cheaper but less effective than fine-tuning. As LMs become more capable, the gap between “prompted features” and “fine-tuned features” may shrink.

Does this extend to text-only domains? Most evidence comes from visual domains (device control, robotics). In text-only RL tasks (dialogue, code generation), the LM’s representations may already be more action-sensitive, since actions are tokens. The fine-tuning phase might be less critical — or entirely unnecessary — in these settings.

Can the critic survive online updates? As discussed in the Q-learning scalability blog, extending offline critics to online self-improvement introduces distribution shift. The frozen features, trained on offline data, may not generalize to states reached by the improved policy. Whether periodic representation re-fine-tuning can address this — and at what cost — is an open engineering and research challenge.

Conclusion

Language models can serve as critic functions, but not by simply swapping the head. The pretrained representations lack action-sensitivity and transition-awareness — properties that next-token prediction never incentivized. Making the critic work requires a deliberate representation fine-tuning phase that injects these properties, followed by TD learning on the resulting frozen features.

This two-phase recipe is not elegant. It adds complexity, requires designing auxiliary objectives, and introduces hyperparameters (what distance threshold \(\epsilon\)? how many fine-tuning steps?). But it is currently the most reliable way to get the best of both worlds: the rich semantic representations of large language models, and the stable optimization dynamics needed for value-based RL.

The deeper lesson is that understanding a scene and predicting the value of acting in it are fundamentally different capabilities. Language models excel at the former. Bridging to the latter requires teaching them, explicitly, about the causal structure of actions — something that no amount of internet text will provide for free.

References