Challenges in Scaling Q-Learning

Recently, I revisited the blog by Seohong Park, “Q-learning is not yet scalable”, and would like to share some notes. In this blog, scaling is defined as whether the algorithm can solve harder problems that usually require a larger horizon, with improved data, compute, and time. There are two dimensions to scale the task difficulty: the depth and the width, which we found useful to think of in WebGym as well, when we tried to define the difficulty of a task. Seohong suggested that the depth (long-horizon) dimension is more important and harder to solve.

最近我重读了 Seohong Park 的博客 “Q-learning is not yet scalable”,想分享一些笔记。在这篇博客中,可扩展性(scaling)被定义为:当数据算力时间增加时,算法是否能解决通常需要更大 horizon 的更难的问题。任务难度有两个维度可以扩展:深度和宽度,我们在 WebGym 中定义任务难度时也发现了这种思考方式的价值。Seohong 认为深度(长 horizon)维度更重要,也更难解决。

Why Q-Learning? The Off-Policy Advantage

Before diving into the scaling challenges, it is worth understanding what makes Q-learning fundamentally different from policy gradient methods — and why researchers keep trying to scale it despite its difficulties.

Policy gradient methods (REINFORCE, PPO, etc.) are on-policy: they can only learn from trajectories generated by the current policy. If the policy changes, all previously collected data becomes stale and must be discarded or corrected via importance sampling. This is a severe constraint — it means the agent must constantly re-collect data as it improves.

Q-learning is off-policy: it learns a value function \(Q(s, a)\) that estimates the expected return of taking action \(a\) in state \(s\) and then acting optimally thereafter. The crucial property is that this Q-function can be updated from any transition \((s, a, r, s')\), regardless of which policy generated it. As Ilya Sutskever put it: off-policy means “I can learn from anyone trying to achieve any goal.”

This has profound implications. A Q-learning agent can learn from:

  • Its own past experience (replay buffer), even from policies that are now outdated
  • Demonstrations collected by a human or another agent
  • Data generated by a completely different task, if the state-action space overlaps

In principle, this makes Q-learning far more data-efficient. In practice, however, this advantage comes with steep costs — instability, overestimation bias, and the challenges that are the subject of this post. The question is whether these costs can be overcome at scale. (See also our detailed treatment of the policy gradient family for the on-policy perspective.)

在深入讨论可扩展性挑战之前,有必要理解 Q-learning 与策略梯度方法的本质区别——以及为什么研究者们不断尝试扩展它,尽管困难重重。

策略梯度方法(REINFORCE、PPO 等)是同策略(on-policy)的:它们只能从当前策略生成的轨迹中学习。一旦策略改变,之前收集的所有数据就变得过时,必须丢弃或通过重要性采样进行修正。这是一个严苛的约束——意味着 agent 在不断改进的过程中必须持续重新收集数据。

Q-learning 是离策略(off-policy)的:它学习一个价值函数 \(Q(s, a)\),估计在状态 \(s\) 下采取动作 \(a\) 并随后按最优策略行动的期望回报。关键性质在于,这个 Q 函数可以从任意转移 \((s, a, r, s')\) 中更新,无论该转移由哪个策略产生。正如 Ilya Sutskever 所说:离策略意味着”我可以从任何人追求任何目标的经验中学习。”

这具有深远的意义。Q-learning agent 可以从以下来源学习:

  • 自身的历史经验(经验回放缓冲区),即使来自已过时的策略
  • 人类或其他 agent 收集的演示数据
  • 完全不同任务生成的数据,只要状态-动作空间有重叠

原则上,这使得 Q-learning 的数据效率远高于同策略方法。但在实践中,这一优势伴随着高昂的代价——不稳定性、过高估计偏差,以及本文讨论的各种挑战。核心问题是这些代价能否在大规模场景下被克服。(关于同策略视角,另见我们对策略梯度方法族的详细讨论。)

Contraction in TD Learning

For long-horizon tasks, the bias introduced by the bootstrapped TD target compounds across the bootstrapping chain. For each state-action pair, the bootstrapped target is

\[y(s, a) = r(s, a) + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\]

where \(\bar{\theta}\) denotes the parameters prior to the current update. This target is biased from the true optimal value, defined as the expected discounted return under the optimal policy:

\(Q^*(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]\)

对于长 horizon 任务,自举(bootstrapped)TD 目标引入的偏差会沿着自举链不断累积。对于每个状态-动作对,自举目标为

\[y(s, a) = r(s, a) + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\]

其中 \(\bar{\theta}\) 表示当前更新之前的参数。该目标相对于真实最优值存在偏差,真实最优值定义为最优策略下的期望折扣回报:

\(Q^*(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]\)

Tabular Case

Under a tabular setting, this bias can be systematically reduced iteration by iteration.

在表格设定下,这种偏差可以逐迭代地被系统性减少。

Define the Bellman optimality operator \(\mathcal{T}\) as:

\[(\mathcal{T}Q)(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\left[r(s, a) + \gamma \max_{a'} Q(s', a')\right]\]

Step 1: \(\mathcal{T}\) is a \(\gamma\)-contraction. For any two Q-functions \(Q\) and \(Q'\):

\[\begin{aligned} |(\mathcal{T}Q)(s,a) - (\mathcal{T}Q')(s,a)| &= \gamma \left|\mathbb{E}_{s'}\left[\max_{a'} Q(s', a') - \max_{a'} Q'(s', a')\right]\right| \\ &\leq \gamma \, \mathbb{E}_{s'}\left[\max_{a'} |Q(s', a') - Q'(s', a')|\right] \\ &\leq \gamma \|Q - Q'\|_\infty \end{aligned}\]

where the first inequality uses \(\vert\max_a f(a) - \max_a g(a)\vert \leq \max_a \vert f(a) - g(a)\vert\). Taking the supremum over \((s, a)\) on the left gives

\[\lVert\mathcal{T}Q - \mathcal{T}Q'\rVert_\infty \leq \gamma \lVert Q - Q'\rVert_\infty\]

By the Banach fixed point theorem, \(\mathcal{T}\) has a unique fixed point \(Q^*\) and repeated application converges:

\[\lVert\mathcal{T}^k Q - Q^*\rVert_\infty \leq \gamma^k \lVert Q - Q^*\rVert_\infty \to 0\]
Interactive value iteration on a branching MDP. Drag the sliders to step through Bellman operator iterations and per-state updates, or press Play to animate.

定义 Bellman 最优算子 \(\mathcal{T}\) 为:

\[(\mathcal{T}Q)(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\left[r(s, a) + \gamma \max_{a'} Q(s', a')\right]\]

第一步:\(\mathcal{T}\) 是 \(\gamma\)-压缩的。 对于任意两个 Q 函数 \(Q\) 和 \(Q'\):

\[\begin{aligned} |(\mathcal{T}Q)(s,a) - (\mathcal{T}Q')(s,a)| &= \gamma \left|\mathbb{E}_{s'}\left[\max_{a'} Q(s', a') - \max_{a'} Q'(s', a')\right]\right| \\ &\leq \gamma \, \mathbb{E}_{s'}\left[\max_{a'} |Q(s', a') - Q'(s', a')|\right] \\ &\leq \gamma \|Q - Q'\|_\infty \end{aligned}\]

其中第一个不等式利用了 \(\vert\max_a f(a) - \max_a g(a)\vert \leq \max_a \vert f(a) - g(a)\vert\)。对左边取 \((s, a)\) 的上确界可得

\[\lVert\mathcal{T}Q - \mathcal{T}Q'\rVert_\infty \leq \gamma \lVert Q - Q'\rVert_\infty\]

由 Banach 不动点定理,\(\mathcal{T}\) 有唯一不动点 \(Q^*\),反复应用该算子即可收敛:

\[\lVert\mathcal{T}^k Q - Q^*\rVert_\infty \leq \gamma^k \lVert Q - Q^*\rVert_\infty \to 0\]
分支 MDP 上的交互式值迭代。拖动滑块逐步查看 Bellman 算子迭代和逐状态更新,或按播放键观看动画。

Step 2: Q-learning as stochastic approximation. Value iteration requires the full transition model \(P(s' \mid s, a)\) to compute the expectation in \(\mathcal{T}\). Q-learning (Watkins, 1989) removes this requirement by learning directly from experience. The agent maintains a Q-table \(Q(s, a)\) for every state-action pair, initialized arbitrarily. At each time step, the agent:

  1. Observes the current state \(s_t\).
  2. Selects an action \(a_t\) using a behavior policy (typically \(\varepsilon\)-greedy: with probability \(\varepsilon\) pick a random action, otherwise pick \(\arg\max_a Q(s_t, a)\)).
  3. Executes \(a_t\), observes reward \(r_t\) and next state \(s_{t+1}\).
  4. Updates the Q-value for the visited pair \((s_t, a_t)\):
\[Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha_t \underbrace{\left[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t)\right]}_{\text{TD error } \delta_t}\]

All other entries remain unchanged: \(Q_{t+1}(s, a) = Q_t(s, a)\) for \((s, a) \neq (s_t, a_t)\). The TD error \(\delta_t\) measures how far the current estimate \(Q_t(s_t, a_t)\) is from the one-step bootstrapped target \(r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')\). The learning rate \(\alpha_t \in (0, 1]\) controls the step size: larger \(\alpha\) accelerates learning but increases variance, while smaller \(\alpha\) stabilizes convergence at the cost of speed.

Why does this converge? The key insight is that the sampled TD target is an unbiased estimate of the Bellman operator applied to the current Q-function:

\[\mathbb{E}[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') \mid s_t = s, a_t = a, Q_t] = (\mathcal{T}Q_t)(s, a)\]

This means each Q-learning update is, in expectation, a step of value iteration on the visited \((s, a)\) pair. The update decomposes into a contraction step plus zero-mean noise:

\[Q_{t+1}(s, a) = Q_t(s, a) + \alpha_t \left[\underbrace{(\mathcal{T}Q_t)(s, a) - Q_t(s, a)}_{\text{signal: contracts toward } Q^*} + \underbrace{\varepsilon_t}_{\text{noise}}\right]\]

where

\[\varepsilon_t = \delta_t - \mathbb{E}[\delta_t \mid \mathcal{F}_t]\]

is a martingale difference sequence satisfying \(\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0\). Under the Robbins-Monro conditions (\(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)) and sufficient state-action coverage, the contraction signal dominates the noise and \(Q_t \to Q^*\) almost surely.

第二步:Q-learning 作为随机近似。 值迭代需要完整的转移模型 \(P(s' \mid s, a)\) 来计算 \(\mathcal{T}\) 中的期望。Q-learning(Watkins, 1989)通过直接从经验中学习来消除这一要求。Agent 为每个状态-动作对维护一个 Q 表 \(Q(s, a)\),可任意初始化。在每个时间步,agent:

  1. 观察当前状态 \(s_t\)。
  2. 使用行为策略选择动作 \(a_t\)(通常是 \(\varepsilon\)-greedy:以概率 \(\varepsilon\) 随机选择动作,否则选择 \(\arg\max_a Q(s_t, a)\))。
  3. 执行 \(a_t\),观察奖励 \(r_t\) 和下一状态 \(s_{t+1}\)。
  4. 更新所访问的状态-动作对 \((s_t, a_t)\) 的 Q 值:
\[Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha_t \underbrace{\left[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t)\right]}_{\text{TD error } \delta_t}\]

其余所有条目保持不变:对于 \((s, a) \neq (s_t, a_t)\),\(Q_{t+1}(s, a) = Q_t(s, a)\)。TD 误差 \(\delta_t\) 衡量当前估计 \(Q_t(s_t, a_t)\) 与一步自举目标 \(r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')\) 之间的差距。学习率 \(\alpha_t \in (0, 1]\) 控制步长:较大的 \(\alpha\) 加速学习但增大方差,较小的 \(\alpha\) 则以速度为代价换取稳定性。

为什么会收敛? 关键洞察在于,采样的 TD 目标是 Bellman 算子应用于当前 Q 函数的无偏估计:

\[\mathbb{E}[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') \mid s_t = s, a_t = a, Q_t] = (\mathcal{T}Q_t)(s, a)\]

这意味着每次 Q-learning 更新在期望意义上都是对所访问 \((s, a)\) 对的一步值迭代。更新可以分解为一个压缩步骤加零均值噪声:

\[Q_{t+1}(s, a) = Q_t(s, a) + \alpha_t \left[\underbrace{(\mathcal{T}Q_t)(s, a) - Q_t(s, a)}_{\text{signal: contracts toward } Q^*} + \underbrace{\varepsilon_t}_{\text{noise}}\right]\]

其中

\[\varepsilon_t = \delta_t - \mathbb{E}[\delta_t \mid \mathcal{F}_t]\]

是满足 \(\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0\) 的鞅差序列。在 Robbins-Monro 条件(\(\sum_t \alpha_t = \infty\),\(\sum_t \alpha_t^2 < \infty\))和充分的状态-动作覆盖下,压缩信号支配噪声,\(Q_t \to Q^*\) 几乎必然成立。

Martingale differences and stochastic approximation. (Click to expand)

A martingale is a stochastic process whose conditional expectation of the next value, given all past information, equals the current value. Intuitively, it is a "fair game" with no drift. A sequence \(\{\varepsilon_t\}\) is a martingale difference if \(\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0\), where \(\mathcal{F}_t = \sigma(s_0, a_0, r_0, \ldots, s_t, a_t, Q_t)\) is the filtration (the complete history up to step \(t\)). This means: conditioned on everything the agent has seen so far, the noise has zero expected value, i.e., it is equally likely to push the estimate up or down. Crucially, this does not require the noise to be independent across time steps; it only requires zero conditional mean.

This structure is what makes Q-learning a stochastic approximation algorithm. The general Robbins-Monro framework studies iterations of the form \(\theta_{t+1} = \theta_t + \alpha_t [h(\theta_t) + \varepsilon_t]\), where \(h(\theta^*) = 0\) defines the target fixed point. In Q-learning, \(h(Q) = \mathcal{T}Q - Q\) and \(\theta^* = Q^*\). Convergence requires:

  • \(\sum_t \alpha_t = \infty\): the step sizes must be large enough in total so the iterates can travel from any initialization to \(Q^*\). If \(\sum \alpha_t < \infty\), the iterates may stall before reaching the fixed point.
  • \(\sum_t \alpha_t^2 < \infty\): the step sizes must shrink fast enough so that the accumulated noise variance \(\sum \alpha_t^2 \text{Var}(\varepsilon_t)\) remains bounded. This ensures the noise averages out rather than accumulating.

A common choice satisfying both conditions is \(\alpha_t = 1/(t+1)\).

Proof that \(\varepsilon_t\) is a martingale difference. We show \(\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0\).

Step 1: Compute \(\mathbb{E}[\delta_t \mid \mathcal{F}_t]\). The TD error is:

$$\delta_t = r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t)$$

Since \(s_t, a_t, Q_t \in \mathcal{F}_t\), the term \(Q_t(s_t, a_t)\) is a known constant and comes out of the expectation. The only randomness is over the transition \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\):

$$\mathbb{E}[\delta_t \mid \mathcal{F}_t] = \underbrace{\mathbb{E}_{s_{t+1} \sim P(\cdot \mid s_t, a_t)}\left[r(s_t, a_t) + \gamma \max_{a'} Q_t(s_{t+1}, a')\right]}_{= \, (\mathcal{T}Q_t)(s_t, a_t) \text{ by definition of } \mathcal{T}} - \; Q_t(s_t, a_t)$$

The first term is exactly the Bellman operator applied to \(Q_t\), because \(\mathcal{T}\) is the expectation of the one-step bootstrapped return. Therefore:

$$\mathbb{E}[\delta_t \mid \mathcal{F}_t] = (\mathcal{T}Q_t)(s_t, a_t) - Q_t(s_t, a_t)$$

Step 2: Show \(\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0\). By definition, \(\varepsilon_t = \delta_t - \mathbb{E}[\delta_t \mid \mathcal{F}_t]\). Taking the conditional expectation:

$$\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = \mathbb{E}[\delta_t \mid \mathcal{F}_t] - \mathbb{E}[\delta_t \mid \mathcal{F}_t] = 0 \quad \square$$

Equivalently, \(\varepsilon_t = [r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')] - (\mathcal{T}Q_t)(s_t, a_t)\) is the deviation of the sampled one-step return from its conditional mean, and the expectation of any random variable minus its own conditional expectation is zero.

Remark. This holds regardless of the behavior policy used to select \(a_t\), because by the time we condition on \(\mathcal{F}_t\), both \(s_t\) and \(a_t\) are already determined. The only source of randomness is the environment's transition \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\). This is precisely why Q-learning is off-policy: the convergence proof does not depend on how actions are chosen.

Stochastic approximation: the blue trace shows iterates θ_t converging to the fixed point θ* despite noise. The gray dashed line shows the deterministic path (signal only), and the orange line shows a noise-only random walk (martingale). With decaying step sizes, the signal accumulates while the noise averages out.

Comparison with value iteration. Three important differences: (1) only one \((s, a)\) entry is updated per step, so convergence requires every state-action pair to be visited infinitely often; (2) actions must be selected by a behavior policy (e.g., \(\varepsilon\)-greedy) that balances exploration and exploitation; and (3) because Q-learning updates use \(\max_{a'}\) regardless of the behavior policy, it is an off-policy algorithm: the Q-values converge to the optimal policy even when the agent explores suboptimally.

Q-learning on the same MDP. Each step updates a single Q(s, a) from a sampled transition under an ε-greedy policy. Compare the noisy, asynchronous convergence with the smooth contraction above.

Notice that the \(\lVert Q - Q^*\rVert_\infty\) curve in the Q-learning figure resembles a step function. This is because Q-learning updates only one \((s, a)\) entry per step, and the infinity norm is determined by the single worst-case pair. When the updated pair is not the one with the maximum error, the metric stays flat; it only jumps when the bottleneck entry itself gets updated. This staircase pattern is a direct consequence of the sparse, asynchronous nature of Q-learning. Contrast it with value iteration above, where every iteration updates all pairs simultaneously, producing a smooth exponential decay.

与值迭代的对比。 三个重要区别:(1) 每步只更新一个 \((s, a)\) 条目,因此收敛要求每个状态-动作对都被无限次访问;(2) 动作必须由行为策略(如 \(\varepsilon\)-greedy)选择,以平衡探索与利用;(3) 由于 Q-learning 的更新使用 \(\max_{a'}\),与行为策略无关,因此它是一种离策略算法:即使 agent 的探索策略不是最优的,Q 值仍然会收敛到最优策略。

同一 MDP 上的 Q-learning。每步在 ε-greedy 策略下从一个采样转移更新单个 Q(s, a)。对比上面平滑的压缩过程,注意这里带噪声的异步收敛。

注意 Q-learning 图中 \(\lVert Q - Q^*\rVert_\infty\) 曲线呈现阶梯函数的形状。这是因为 Q-learning 每步只更新一个 \((s, a)\) 条目,而无穷范数由误差最大的单个状态-动作对决定。当被更新的不是误差最大的那一对时,指标保持不变;只有当瓶颈条目本身被更新时,指标才会跳变。这种阶梯模式是 Q-learning 稀疏、异步特性的直接结果。将其与上面的值迭代对比——后者每次迭代同时更新所有状态-动作对,产生平滑的指数衰减。

Function Approximation

With a parameterized Q-function \(Q_\theta\), the tabular update \(Q(s,a) \leftarrow Q(s,a) + \alpha \delta\) becomes:

\[\theta \leftarrow \theta + \alpha \, \delta_t \, \nabla_\theta Q_\theta(s_t, a_t)\]

where

\[\delta_t = r_t + \gamma \max_{a'} Q_\theta(s_{t+1}, a') - Q_\theta(s_t, a_t)\]

This is called semi-gradient Q-learning because the gradient is taken only through the prediction \(Q_\theta(s_t, a_t)\), not through the bootstrapped target. In the tabular case, the semi-gradient and full gradient produce equivalent updates because each parameter controls exactly one Q-value independently; under function approximation, parameters are shared across state-action pairs, and this equivalence breaks.

使用参数化的 Q 函数 \(Q_\theta\) 后,表格更新 \(Q(s,a) \leftarrow Q(s,a) + \alpha \delta\) 变为:

\[\theta \leftarrow \theta + \alpha \, \delta_t \, \nabla_\theta Q_\theta(s_t, a_t)\]

其中

\[\delta_t = r_t + \gamma \max_{a'} Q_\theta(s_{t+1}, a') - Q_\theta(s_t, a_t)\]

这被称为半梯度 Q-learning,因为梯度仅通过预测 \(Q_\theta(s_t, a_t)\) 计算,而不通过自举目标。在表格情形下,半梯度和全梯度产生等价的更新,因为每个参数独立控制一个 Q 值;在函数近似下,参数在状态-动作对之间共享,这种等价性就不再成立。

Semi-gradient vs. full gradient in the tabular case. (Click to expand)

A hypothetical "full gradient" would also differentiate through the target:

$$\theta \leftarrow \theta + \alpha \, \delta_t \, \left[\nabla_\theta Q_\theta(s_t, a_t) - \gamma \nabla_\theta \max_{a'} Q_\theta(s_{t+1}, a')\right]$$

In the tabular case, \(\theta \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}\) is the Q-table. Since \(Q_\theta(s, a) = \theta_{(s,a)}\), the gradient \(\nabla_\theta Q_\theta(s, a)\) is a one-hot vector at entry \((s, a)\):

$$\left[\nabla_\theta Q_\theta(s, a)\right]_{(s',a')} = \mathbf{1}[(s',a') = (s,a)]$$

Substituting, the two updates are, component-wise:

$$\text{Semi: } \theta_{(s',a')} \leftarrow \theta_{(s',a')} + \alpha\,\delta_t \cdot \mathbf{1}[(s',a') = (s_t, a_t)]$$ $$\text{Full: } \theta_{(s',a')} \leftarrow \theta_{(s',a')} + \alpha\,\delta_t \cdot \bigl\{\mathbf{1}[(s',a') = (s_t, a_t)] - \gamma\,\mathbf{1}[(s',a') = (s_{t+1}, a^*)]\bigr\}$$

where \(a^* = \arg\max_{a'} Q_\theta(s_{t+1}, a')\). At the prediction entry \((s_t, a_t)\), the full gradient evaluates to \(\alpha\delta_t \cdot \{1 - \gamma \cdot \mathbf{1}[(s_t, a_t) = (s_{t+1}, a^*)]\} = \alpha\delta_t \cdot \{1 - \gamma \cdot 0\} = +\alpha\delta_t\), identical to the semi-gradient update. The \(-\gamma\) term is zero because \((s_t, a_t) \neq (s_{t+1}, a^*)\) for any non-self-loop transition. The full gradient additionally modifies a different entry \((s_{t+1}, a^*)\) by \(-\alpha\gamma\delta_t\), but since tabular parameters are decoupled, this does not interfere.

Click any (s, a) to compare. Semi-gradient modifies only the prediction entry (blue). Full gradient additionally modifies the target entry (amber) — a different, independent parameter.

Under function approximation, \(\nabla_\theta Q_\theta(s, a)\) is no longer one-hot: updating \(\theta\) to improve \(Q_\theta(s_t, a_t)\) simultaneously changes \(Q_\theta\) at other state-action pairs. The prediction and target gradients now act on the same shared parameters, so ignoring the target gradient introduces genuine bias.

Left (tabular): one-hot gradient, only one Q-value changes. Right (FA): coupled gradient, updating one (s, a) changes many Q-values (yellow).

Does the semi-gradient cause Q-learning's contraction failure under FA? No — the semi-gradient introduces bias in the descent direction, but this is a separate issue from whether the operator \(\mathcal{T}\) contracts. Consider the Bellman error loss we are implicitly minimizing:

$$L(\theta) = \tfrac{1}{2}\bigl(r + \gamma \max_{a'} Q_\theta(s', a') - Q_\theta(s, a)\bigr)^2 = \tfrac{1}{2}\delta_t^2$$

The true (full) gradient differentiates through both occurrences of \(\theta\):

$$\nabla_\theta L = -\delta_t \cdot \bigl[\nabla_\theta Q_\theta(s, a) - \gamma \nabla_\theta \max_{a'} Q_\theta(s', a')\bigr]$$

The semi-gradient treats the target \(r + \gamma \max_{a'} Q_\theta(s', a')\) as a constant:

$$\nabla_\theta L \approx -\delta_t \cdot \nabla_\theta Q_\theta(s, a)$$

This is a gradient bias: the semi-gradient is not the true gradient of \(L(\theta)\), because it ignores how \(\theta\) affects the target. It is the gradient of a different objective where the target is frozen. Note that this is distinct from the bootstrapping bias of the TD target itself: the target \(r + \gamma \max_{a'} Q_\theta(s', a')\) uses the current estimate \(Q_\theta\) rather than \(Q^*\), so it is biased regardless of whether we use semi-gradient or full gradient. The gradient bias concerns the optimization direction (how we move \(\theta\)); the bootstrapping bias concerns the target value (what we move \(\theta\) toward). Both methods compute the same (bootstrapped) TD target and apply the same operator \(\mathcal{T}\); they differ only in the descent direction. The contraction failure under FA comes from the projection \(\Pi\) onto the restricted function class, which both methods face equally.


The key question is whether Q-learning remains contractive under function approximation. It does not. As discussed above, the Bellman operator \(\mathcal{T}\) remains a \(\gamma\)-contraction, but \(\mathcal{T}Q_\theta\) generally falls outside the restricted function class, forcing a projection \(\Pi\) onto the nearest representable function. The effective update becomes

\[Q \leftarrow \Pi \mathcal{T} Q\]

and

\[\lVert\Pi \mathcal{T} Q - \Pi \mathcal{T} Q'\rVert \leq \lVert\Pi\rVert \cdot \gamma \lVert Q - Q'\rVert,\]

where \(\lVert\Pi\rVert\) can exceed \(1/\gamma\), destroying the contraction.

Semi-gradient Q-learning with cosine FA on the same MDP. Q_θ(s,a) = θ₁ + θ₂·cos(π·depth/3) + θ₃·a (3 parameters for 10 Q-values). The error plateaus above the projection floor (dashed), unlike the tabular case which converges to zero.

This figure illustrates two failure modes. First, the projection error floor \(\lVert\Pi Q^* - Q^*\rVert_\infty > 0\): with only 3 parameters for 10 state-action pairs, \(Q^*\) cannot be represented exactly. Second, the actual error \(\lVert Q_\theta - Q^*\rVert_\infty\) stays above even this floor. Semi-gradient Q-learning follows the TD error gradient, which is biased by bootstrapping. With the deadly triad — off-policy learning, function approximation, and bootstrapping — all present, \(\Pi \mathcal{T}\) may fail to be a contraction, causing the iterates to oscillate rather than converge.

Scaling the function class (e.g., using a language model as \(Q_\theta\)) shrinks the projection error but does not resolve the deadly triad: the composed operator \(\Pi\mathcal{T}\) can still fail to contract regardless of model capacity.

In real-world applications — robotics, game playing, large-scale decision making — the state-action space is so vast that no practical function class can represent \(\mathcal{T}Q\) exactly for all \(Q\). The projection \(\Pi\) is never the identity, and there is no guarantee that \(\Pi\mathcal{T}\) remains contractive. We should therefore not assume that Q-learning with function approximation inherits the convergence guarantees of its tabular counterpart.

关键问题是 Q-learning 在函数近似下是否仍然具有压缩性。答案是否定的。如上所述,Bellman 算子 \(\mathcal{T}\) 仍然是 \(\gamma\)-压缩的,但 \(\mathcal{T}Q_\theta\) 通常落在受限函数类之外,迫使投影 \(\Pi\) 将其映射到最近的可表示函数上。有效更新变为

\[Q \leftarrow \Pi \mathcal{T} Q\]

并且

\[\lVert\Pi \mathcal{T} Q - \Pi \mathcal{T} Q'\rVert \leq \lVert\Pi\rVert \cdot \gamma \lVert Q - Q'\rVert,\]

其中 \(\lVert\Pi\rVert\) 可能超过 \(1/\gamma\),从而破坏压缩性。

同一 MDP 上使用余弦函数近似的半梯度 Q-learning。Q_θ(s,a) = θ₁ + θ₂·cos(π·depth/3) + θ₃·a(3 个参数对应 10 个 Q 值)。误差在投影下界(虚线)之上趋于平稳,不同于收敛到零的表格情形。

这张图展示了两种失败模式。第一,投影误差下界 \(\lVert\Pi Q^* - Q^*\rVert_\infty > 0\):仅用 3 个参数表示 10 个状态-动作对,\(Q^*\) 无法被精确表示。第二,实际误差 \(\lVert Q_\theta - Q^*\rVert_\infty\) 甚至停留在该下界之上。半梯度 Q-learning 沿 TD 误差梯度方向更新,而该梯度因自举而存在偏差。当致命三角——离策略学习、函数近似、自举——同时存在时,\(\Pi \mathcal{T}\) 可能不再是压缩的,导致迭代振荡而非收敛。

扩大函数类(例如使用语言模型作为 \(Q_\theta\))可以缩小投影误差,但不能解决致命三角问题:无论模型容量多大,复合算子 \(\Pi\mathcal{T}\) 仍然可能不具有压缩性。

在现实应用中——机器人、博弈、大规模决策——状态-动作空间极其庞大,没有任何实用的函数类能够对所有 \(Q\) 精确表示 \(\mathcal{T}Q\)。投影 \(\Pi\) 永远不是恒等映射,也无法保证 \(\Pi\mathcal{T}\) 保持压缩性。因此我们不应假设带函数近似的 Q-learning 继承了表格版本的收敛保证。

Cumulative Bootstrapping Error

Q-learning updates the value of state \(s\) using the estimated value of the next state \(s'\). If that estimate carries error \(\epsilon\), the update target inherits \(\gamma \epsilon\). The next state’s estimate, in turn, was built from its successor — so the error chains backward through the MDP. After \(D\) bootstrapping steps, per-state errors accumulate roughly as \(\sum_{k=0}^{D-1} \gamma^k \epsilon_k\). Longer chains mean more terms in this sum, and higher \(\gamma\) means each term decays more slowly.

For the remaining experiments, we switch to a 1-hidden-layer neural network with 64 ReLU units (257 parameters) — enough capacity to represent \(Q^*\) nearly exactly on these small MDPs. This isolates the bootstrapping dynamics from the representational limitations studied above.

Q-learning 使用下一状态 \(s'\) 的估计值来更新状态 \(s\) 的值。如果该估计带有误差 \(\epsilon\),则更新目标会继承 \(\gamma \epsilon\) 的误差。而下一状态的估计又是从它的后继状态构建的——因此误差沿 MDP 向前链式传播。经过 \(D\) 步自举后,逐状态误差大约累积为 \(\sum_{k=0}^{D-1} \gamma^k \epsilon_k\)。链越长,求和项越多;\(\gamma\) 越大,每项衰减越慢。

在后续实验中,我们改用 1 隐藏层、64 个 ReLU 单元的神经网络(257 个参数)——在这些小型 MDP 上有足够的容量几乎精确表示 \(Q^*\)。这样可以将自举动态与上面讨论的表示能力限制隔离开来。

Depth

To see this concretely, consider chain MDPs of increasing depth: \(s_0 \to s_1 \to \cdots \to s_{D-1} \to \text{terminal}\), where only the final transition yields reward \(+1\). The shortest path is exactly \(D\) steps, and the agent must bootstrap \(Q^*\) backward through the entire chain. We fix \(\gamma = 0.9\) and use the same NN (64 ReLU units, 257 parameters) for all chains (depth 2, 8, and 32).

Semi-gradient Q-learning on chain MDPs of depth 2, 8, and 32. Setup: 1-hidden-layer NN (64 ReLU units, 257 parameters, α = 0.0005), ε-greedy (ε = 0.3), γ = 0.9, 10k steps, single seed. Deeper chains require more bootstrapping steps to propagate the reward signal, leading to larger and more volatile errors.

To see why deeper chains are harder, consider what happens step by step. In a chain of depth 5, the only grounded value is at the terminal (\(Q = 0\)). On the first Bellman backup, only \(s_4\) gets a correct target — it bootstraps from the terminal. All other states bootstrap from stale estimates (zeros), so their updates are wrong. On the second backup, \(s_3\) can now bootstrap from the corrected \(s_4\), but \(s_0\)–\(s_2\) still use incorrect values. It takes exactly \(D\) full sweeps for the correct signal to propagate from the terminal all the way back to \(s_0\).

Bellman backup iterations on a depth-5 chain MDP (γ = 0.9). Left: Q-values at each iteration, colored by error magnitude (green = accurate, red = large error). Right: absolute error per state. Correct values propagate backward from the terminal at one state per iteration.

为了直观地理解这一点,考虑深度递增的链式 MDP:\(s_0 \to s_1 \to \cdots \to s_{D-1} \to \text{terminal}\),其中只有最后一步转移给出奖励 \(+1\)。最短路径恰好 \(D\) 步,agent 必须沿整条链向后自举 \(Q^*\)。我们固定 \(\gamma = 0.9\),对所有链(深度 2、8、32)使用同一个神经网络(64 个 ReLU 单元,257 个参数)。

深度 2、8、32 的链式 MDP 上的半梯度 Q-learning。设置:1 隐藏层 NN(64 ReLU 单元,257 参数,α = 0.0005),ε-greedy(ε = 0.3),γ = 0.9,10k 步,单一随机种子。更深的链需要更多自举步骤来传播奖励信号,导致更大且更不稳定的误差。

要理解为什么更深的链更困难,考虑逐步发生的过程。在深度为 5 的链中,唯一有确定值的是终止状态(\(Q = 0\))。在第一次 Bellman 回溯时,只有 \(s_4\) 得到了正确的目标——它从终止状态自举。所有其他状态都从陈旧估计(零值)自举,因此它们的更新是错误的。在第二次回溯时,\(s_3\) 可以从修正后的 \(s_4\) 自举,但 \(s_0\)–\(s_2\) 仍然使用不正确的值。正确信号从终止状态一路传播回 \(s_0\) 恰好需要 \(D\) 次完整扫描。

深度 5 链式 MDP 上的 Bellman 回溯迭代(γ = 0.9)。左:每次迭代的 Q 值,按误差大小着色(绿 = 准确,红 = 大误差)。右:每个状态的绝对误差。正确值以每次迭代一个状态的速度从终止状态向后传播。

Per-state bias

We can also look inside a single chain to see where the error lives. The figure below (from Park, 2024) plots the per-state bias against the state’s distance from the terminal. States near the terminal have small bias because bootstrapping stops there — \(Q\) is grounded at zero. States far from the terminal must propagate \(Q^*\) through many bootstrapping steps, and at each step the error in the next state’s estimate feeds into the current state’s target, compounding along the way.

Accumulated bias in a toy chain MDP
Per-state Q-value bias vs. distance from the terminal in a toy chain MDP. Source: Park (2024).

我们还可以深入单条链的内部,看看误差分布在哪里。下图(来自 Park, 2024)绘制了逐状态偏差与该状态到终止状态距离的关系。靠近终止状态的状态偏差较小,因为自举在那里停止——\(Q\) 被锚定为零。远离终止状态的状态必须通过多步自举传播 \(Q^*\),每一步中下一状态估计的误差都会馈入当前状态的目标,沿途不断累积。

Accumulated bias in a toy chain MDP
链式 MDP 中逐状态 Q 值偏差与到终止状态距离的关系。来源:Park (2024)

Discount factor

This compounding effect also explains why practitioners almost never use large discount factors. The effective horizon is \(1/(1-\gamma)\): for \(\gamma = 0.99\) this is 100 steps, but for \(\gamma = 0.999\) it is 1,000 and for \(\gamma = 0.9999\) it is 10,000. Since each bootstrapping step multiplies the error by \(\gamma\), higher discount factors mean errors decay more slowly — the contraction rate weakens and the bias accumulates more severely across the chain.

Effect of the discount factor on Q-learning error. Setup: chain-64 MDP, 1-hidden-layer NN (64 ReLU units, 257 parameters, α = 0.0005), ε-greedy (ε = 0.3), 60k steps, 5 seeds. As γ → 1, the effective horizon 1/(1−γ) grows and bootstrapping errors compound more severely, even though the NN has sufficient capacity to represent Q*.

这种复合效应也解释了为什么实践者几乎从不使用大的折扣因子。有效 horizon 为 \(1/(1-\gamma)\):\(\gamma = 0.99\) 时是 100 步,\(\gamma = 0.999\) 时是 1,000 步,\(\gamma = 0.9999\) 时是 10,000 步。由于每一步自举都将误差乘以 \(\gamma\),更高的折扣因子意味着误差衰减更慢——压缩率减弱,偏差沿链累积更为严重。

折扣因子对 Q-learning 误差的影响。设置:chain-64 MDP,1 隐藏层 NN(64 ReLU 单元,257 参数,α = 0.0005),ε-greedy(ε = 0.3),60k 步,5 个随机种子。随着 γ → 1,有效 horizon 1/(1−γ) 增大,自举误差累积更加严重,即使 NN 有足够的容量表示 Q*。

Case Study: Digi-Q

The challenges above — projection error under function approximation, bootstrapping instability, and the deadly triad — are not merely theoretical. They arise directly when scaling Q-learning to real-world agentic tasks. Digi-Q (Bai, Zhou, Li, Levine & Kumar, 2025) provides a concrete case study: it trains a VLM-based Q-function for Android device control entirely from offline data, and every design choice in the paper can be read as a response to one of the failure modes discussed in this blog.

上述挑战——函数近似下的投影误差、自举不稳定性、致命三角——并不仅仅是理论问题。当将 Q-learning 扩展到真实世界的智能体任务时,它们会直接出现。Digi-Q(Bai, Zhou, Li, Levine & Kumar, 2025)提供了一个具体的案例:它完全从离线数据训练了一个基于 VLM 的 Q 函数,用于 Android 设备控制,论文中的每一个设计选择都可以被理解为对本文讨论的某种失败模式的应对。

The Setup

The agent controls a mobile device by issuing pixel-level actions (clicks, typed text) given a screenshot. Each episode is a task described in natural language (e.g., “Go to walmart.com, search for alienware area 51”). The state \(s_t\) is the interaction history up to the current screenshot; the action \(a_t\) is a string command such as “click (0.8, 0.2)”. Rewards are binary 0/1 per step, provided by an external evaluator that judges whether the task has been completed.

This is a long-horizon, large-state-space, large-action-space problem — exactly the regime where Q-learning struggles. The state space is the set of all possible screenshot histories (effectively infinite), the action space is continuous (pixel coordinates + free-form text), and episodes can span dozens of steps.

Agent 通过给定截图发出像素级动作(点击、输入文字)来控制移动设备。每个 episode 是一个用自然语言描述的任务(例如”前往 walmart.com,搜索 alienware area 51”)。状态 \(s_t\) 是到当前截图为止的交互历史;动作 \(a_t\) 是一个字符串命令,如 “click (0.8, 0.2)”。奖励为每步 0/1 的二值信号,由外部评估器判断任务是否完成。

这是一个长 horizon、大状态空间、大动作空间的问题——恰好是 Q-learning 吃力的领域。状态空间是所有可能的截图历史的集合(实际上无穷大),动作空间是连续的(像素坐标 + 自由文本),episode 可以跨越数十步。

Frozen Representations and the Projection Problem

The blog showed that under function approximation, the composed operator \(\Pi\mathcal{T}\) can fail to contract because \(\lVert\Pi\rVert\) may exceed \(1/\gamma\). When the function approximator is a billion-parameter VLM trained end-to-end with TD learning, this instability becomes severe: prior work (Kumar et al., 2021; Chebotar et al., 2023) found that Bellman backups on large models are “hard and unstable to train at scale.”

Digi-Q’s solution is to freeze the VLM backbone and train only a small MLP head on top of the frozen intermediate-layer representations. The Q-function becomes:

\[Q_{\theta_{\text{MLP}}}(s, a) = \text{MLP}_{\theta_{\text{MLP}}}(f_{\theta_{\text{VLM}}}(s, a))\]

where \(f_{\theta_{\text{VLM}}}\) is fixed after an initial fine-tuning phase. This is a deliberate trade-off: by restricting the learnable parameters to 1% of the model, the projection \(\Pi\) onto the representable function class is more constrained (potentially larger projection error floor), but the operator \(\Pi\mathcal{T}\) behaves much more stably because the feature space does not shift during TD updates.

Digi-Q also maintains a separate state-only value function \(V_\psi\) (also an MLP on frozen VLM features) to stabilize training via double critics. The Q- and V-functions are trained with the following TD losses:

\[J_Q(\theta_{\text{MLP}}) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}}\left[(Q_{\theta_{\text{MLP}}}(f_{\theta_{\text{VLM}}}(s, a)) - r - \gamma \, V_{\bar{\psi}_{\text{MLP}}}(f_{\bar{\psi}_{\text{VLM}}}(s')))^2\right]\] \[J_V(\psi_{\text{MLP}}) = \mathbb{E}_{s \sim \mathcal{D}}\left[\mathbb{E}_{a' \sim \pi_\phi(\cdot \vert s)}\left[(V_{\psi_{\text{MLP}}}(f_{\psi_{\text{VLM}}}(s)) - Q_{\bar{\theta}_{\text{MLP}}}(f_{\bar{\theta}_{\text{VLM}}}(s, a')))^2\right]\right]\]

where \(\bar{\theta}\) and \(\bar{\psi}\) are delayed target networks (exponential moving averages of \(\theta\) and \(\psi\)), following the standard trick from DQN (Mnih et al., 2013) to reduce moving-target instability. The Q-function is trained to predict the one-step bootstrapped return; the V-function is trained to match the expected Q-value under the current policy. This dual-critic setup decouples the policy evaluation (V) from the action ranking (Q), which is important for stable Best-of-N policy extraction.

如博客所述,在函数近似下,复合算子 \(\Pi\mathcal{T}\) 可能因为 \(\lVert\Pi\rVert\) 超过 \(1/\gamma\) 而失去压缩性。当函数近似器是一个数十亿参数的 VLM 并用 TD 学习端到端训练时,这种不稳定性变得非常严重:先前的工作(Kumar et al., 2021; Chebotar et al., 2023)发现在大模型上进行 Bellman 回溯”难以训练且在大规模场景下不稳定”。

Digi-Q 的解决方案是冻结 VLM 主干,只在冻结的中间层表示之上训练一个小型 MLP 头。Q 函数变为:

\[Q_{\theta_{\text{MLP}}}(s, a) = \text{MLP}_{\theta_{\text{MLP}}}(f_{\theta_{\text{VLM}}}(s, a))\]

其中 \(f_{\theta_{\text{VLM}}}\) 在初始微调阶段之后固定不变。这是一种刻意的权衡:通过将可学习参数限制为模型的 1%,投影 \(\Pi\) 到可表示函数类的约束更强(投影误差下界可能更大),但算子 \(\Pi\mathcal{T}\) 的行为更加稳定,因为特征空间在 TD 更新过程中不会发生变化。

Digi-Q 还维护了一个独立的纯状态价值函数 \(V_\psi\)(同样是冻结 VLM 特征上的 MLP),通过双 critic 稳定训练。Q 函数和 V 函数使用以下 TD 损失训练:

\[J_Q(\theta_{\text{MLP}}) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}}\left[(Q_{\theta_{\text{MLP}}}(f_{\theta_{\text{VLM}}}(s, a)) - r - \gamma \, V_{\bar{\psi}_{\text{MLP}}}(f_{\bar{\psi}_{\text{VLM}}}(s')))^2\right]\] \[J_V(\psi_{\text{MLP}}) = \mathbb{E}_{s \sim \mathcal{D}}\left[\mathbb{E}_{a' \sim \pi_\phi(\cdot \vert s)}\left[(V_{\psi_{\text{MLP}}}(f_{\psi_{\text{VLM}}}(s)) - Q_{\bar{\theta}_{\text{MLP}}}(f_{\bar{\theta}_{\text{VLM}}}(s, a')))^2\right]\right]\]

其中 \(\bar{\theta}\) 和 \(\bar{\psi}\) 是延迟目标网络(\(\theta\) 和 \(\psi\) 的指数滑动平均),沿用 DQN(Mnih et al., 2013)中的标准技巧以减少移动目标的不稳定性。Q 函数被训练去预测一步自举回报;V 函数被训练去匹配当前策略下的期望 Q 值。这种双 critic 设置将策略评估(V)与动作排序(Q)解耦,这对稳定的 Best-of-N 策略提取非常重要。

Representation Fine-Tuning: Shrinking the Projection Floor

Using off-the-shelf VLM features directly leads to a large projection error — the VLM’s internal representations are trained for general visual question answering, not for predicting the consequences of actions. Digi-Q found that without fine-tuning, the Q-function degenerates into a state-only value function \(V(s)\) that ignores the action input entirely, because the frozen features lack actionable information.

To address this, Digi-Q runs a preliminary fine-tuning phase with a binary classification objective: given a transition \((s_t, a_t, s_{t+1})\), predict whether the action caused a substantial visual change in the screen. The label is:

\[y_t = \begin{cases} 0, & d(s_t, s_{t+1}) < \epsilon \\ 1, & \text{otherwise} \end{cases}\]

where \(d\) is the \(\ell_2\) image distance. This teaches the VLM to attend to which actions actually change the environment — a property the original pretraining objective never incentivized. After this phase, the VLM parameters are frozen, and the fine-tuned representations serve as the fixed feature extractor for TD learning.

In the language of our earlier discussion, this fine-tuning step shrinks the projection error \(\lVert\Pi Q^* - Q^*\rVert_\infty\) by reshaping the feature space to better capture action-relevant structure, without incurring the instability of end-to-end TD learning.

直接使用现成的 VLM 特征会导致较大的投影误差——VLM 的内部表示是为通用视觉问答训练的,而非为预测动作后果。Digi-Q 发现,不经微调时,Q 函数退化为一个纯状态价值函数 \(V(s)\),完全忽略动作输入,因为冻结的特征缺乏可操作信息。

为此,Digi-Q 先运行一个预微调阶段,使用二分类目标:给定一个转移 \((s_t, a_t, s_{t+1})\),预测该动作是否导致了屏幕上的显著视觉变化。标签为:

\[y_t = \begin{cases} 0, & d(s_t, s_{t+1}) < \epsilon \\ 1, & \text{otherwise} \end{cases}\]

其中 \(d\) 是 \(\ell_2\) 图像距离。这教会了 VLM 关注哪些动作实际改变了环境——这是原始预训练目标从未激励的属性。该阶段之后,VLM 参数被冻结,微调后的表示作为 TD 学习的固定特征提取器。

用我们之前的讨论语言来说,这一微调步骤通过重塑特征空间以更好地捕获与动作相关的结构,缩小了投影误差 \(\lVert\Pi Q^* - Q^*\rVert_\infty\),同时避免了端到端 TD 学习的不稳定性。

TD Learning vs. Monte Carlo

An ablation in the paper directly tests the value of bootstrapping. Replacing TD learning with Monte Carlo (MC) regression — fitting \(Q(s, a)\) to the observed cumulative return rather than bootstrapping from \(Q(s', a')\) — drops performance from 58.0% to 37.5% on the Web Shopping benchmark.

Why does TD outperform MC here, despite the bootstrapping bias we warned about? The answer lies in variance. MC returns in this domain are extremely noisy: a single trajectory either succeeds (reward 1) or fails (reward 0), and the outcome depends on a long chain of stochastic interactions. The MC estimator is unbiased but high-variance, and this variance contaminates the Q-function when used for ranking actions. TD learning introduces bias through bootstrapping, but each update uses a one-step target with much lower variance. For the purpose of ranking actions (which is all the policy extraction needs), a slightly biased but low-variance Q-function is far more useful than an unbiased but noisy one.

This is a concrete illustration of the bias-variance tradeoff at the heart of TD learning: bootstrapping’s bias is the price of manageable variance, and in practice the tradeoff often favors TD.

论文中的一个消融实验直接检验了自举的价值。将 TD 学习替换为 Monte Carlo(MC)回归——让 \(Q(s, a)\) 拟合观察到的累积回报而非从 \(Q(s', a')\) 自举——在 Web Shopping 基准上性能从 58.0% 下降到 37.5%。

尽管我们之前警告了自举偏差,为什么 TD 在这里表现更好?答案在于方差。该领域中 MC 回报的噪声极大:单条轨迹要么成功(奖励 1),要么失败(奖励 0),结果取决于一长串随机交互。MC 估计器无偏但高方差,当用于排序动作时,这种方差会污染 Q 函数。TD 学习通过自举引入偏差,但每次更新使用方差低得多的一步目标。就排序动作而言(这正是策略提取所需的全部),一个略有偏差但低方差的 Q 函数远比一个无偏但嘈杂的 Q 函数更有用。

这是 TD 学习核心偏差-方差权衡的一个具体例证:自举的偏差是可控方差的代价,在实践中这种权衡往往有利于 TD。

Best-of-N Policy Extraction

Given a trained Q-function, the natural next step is to extract a policy. Standard approaches face their own instabilities: REINFORCE-style policy gradients suffer from a “negative gradient” term (the policy gradient multiplies the log-likelihood by the advantage, which can be negative), and advantage-weighted regression (AWR) is too conservative. Digi-Q instead uses a Best-of-N reranking approach:

  1. Sample \(N\) candidate actions from a behavior-cloned base policy \(\pi_\beta\).
  2. Score each action with the learned Q-function.
  3. Train the policy to imitate only the highest-scoring action (provided it has positive advantage).

Formally, the policy loss is:

\[J_\pi(\phi) = \mathbb{E}_{s_t, \, a_i \sim \pi_\beta(\cdot \vert s_t)} \left[\sum_{i=1}^{N} \delta(a_i) \sum_{h=1}^{L} \log \pi_\phi(a_i^h \vert s_t, a_i^{1:h-1})\right]\]

where \(\delta(a_i) = \mathbb{1}\{a_i = \arg\max_j Q(s_t, a_j)\}\) and \(Q(s_t, a_i) - V(s_t) > 0\). This avoids the instability of policy gradients and the conservatism of AWR, while leveraging the Q-function’s ability to evaluate multiple actions at the same state — something that policy-based methods, which only see one action per state in the data, cannot do.

Performance scales monotonically with \(N\): using 16 candidate actions outperforms using 1, 4, or 8, confirming that the Q-function provides useful signal for ranking.

给定一个训练好的 Q 函数,下一步自然是提取策略。标准方法面临各自的不稳定性:REINFORCE 风格的策略梯度存在”负梯度”问题(策略梯度将对数似然乘以优势值,后者可能为负),而 advantage-weighted regression(AWR)又过于保守。Digi-Q 转而使用 Best-of-N 重排序方法:

  1. 从行为克隆基础策略 \(\pi_\beta\) 中采样 \(N\) 个候选动作。
  2. 用学到的 Q 函数为每个动作打分。
  3. 训练策略只模仿最高分的动作(前提是其优势值为正)。

形式化地,策略损失为:

\[J_\pi(\phi) = \mathbb{E}_{s_t, \, a_i \sim \pi_\beta(\cdot \vert s_t)} \left[\sum_{i=1}^{N} \delta(a_i) \sum_{h=1}^{L} \log \pi_\phi(a_i^h \vert s_t, a_i^{1:h-1})\right]\]

其中 \(\delta(a_i) = \mathbb{1}\{a_i = \arg\max_j Q(s_t, a_j)\}\) 且 \(Q(s_t, a_i) - V(s_t) > 0\)。这避免了策略梯度的不稳定性和 AWR 的保守性,同时利用了 Q 函数在同一状态评估多个动作的能力——这是基于策略的方法做不到的,因为它们在数据中每个状态只能看到一个动作。

性能随 \(N\) 单调提升:使用 16 个候选动作优于使用 1、4 或 8 个,这证实了 Q 函数为排序提供了有用的信号。

The Online Challenge: Distribution Shift

While Digi-Q achieves strong results in the offline setting, extending it to an online self-improvement loop — where the agent collects new data with its improved policy and retrains the Q-function — exposes a fundamental tension.

In the offline setting, the Q-function is trained on a fixed dataset \(\mathcal{D}\) collected by a behavior policy \(\pi_\beta\). The Best-of-N policy extraction produces a new policy \(\pi\) that is better than \(\pi_\beta\) but still close to it (the KL divergence is moderate, around 3.28). The Q-function only needs to be accurate for states and actions visited by \(\pi_\beta\), which is exactly what the offline data covers.

Once we enter the online loop, however, the improved policy \(\pi\) starts visiting new states that \(\pi_\beta\) never reached. The Q-function, trained only on \(\mathcal{D}\), has no reliable estimates for these out-of-distribution states. Bootstrapping from unreliable Q-values at novel states amplifies errors — precisely the compounding bootstrapping bias discussed earlier — and the critic can become arbitrarily wrong in regions of the state space that the online policy now frequents. This is a manifestation of the distribution shift problem: as the policy improves and drifts away from the data-collection policy, the Q-function’s training distribution no longer matches its evaluation distribution.

In practice, we found that this requires careful system design: periodically re-collecting data with the current policy, mixing online and offline data, and robustifying the critic against out-of-distribution inputs. The experiment iteration speed also becomes a bottleneck — each cycle of “collect data → retrain critic → extract policy” involves running the agent on real Android devices, which is orders of magnitude slower than training. These engineering challenges, rather than the core algorithmic ideas, are currently the main barrier to scaling Digi-Q’s online performance.

虽然 Digi-Q 在离线设定下取得了很好的结果,但将其扩展为在线自我改进循环——agent 用改进后的策略收集新数据并重新训练 Q 函数——暴露了一个根本性的矛盾。

在离线设定下,Q 函数在行为策略 \(\pi_\beta\) 收集的固定数据集 \(\mathcal{D}\) 上训练。Best-of-N 策略提取产生一个优于 \(\pi_\beta\) 但仍然与之接近的新策略 \(\pi\)(KL 散度适中,约为 3.28)。Q 函数只需要在 \(\pi_\beta\) 访问过的状态和动作上准确,而离线数据恰好覆盖了这些。

然而一旦进入在线循环,改进后的策略 \(\pi\) 开始访问 \(\pi_\beta\) 从未到达的新状态。仅在 \(\mathcal{D}\) 上训练的 Q 函数对这些分布外状态没有可靠估计。从不可靠的 Q 值自举会放大误差——恰好是前面讨论的复合自举偏差——critic 在在线策略频繁访问的状态空间区域可能变得任意不准确。这是分布偏移问题的表现:随着策略改进并偏离数据收集策略,Q 函数的训练分布不再匹配其评估分布。

在实践中,我们发现这需要精心的系统设计:定期用当前策略重新收集数据、混合在线和离线数据、增强 critic 对分布外输入的鲁棒性。实验迭代速度也成为瓶颈——”收集数据 → 重新训练 critic → 提取策略”的每一轮循环都涉及在真实 Android 设备上运行 agent,其速度比训练慢数个数量级。这些工程挑战,而非核心算法思想,目前是扩展 Digi-Q 在线性能的主要障碍。

Takeaways

Digi-Q illustrates how the theoretical failure modes discussed in this blog manifest in practice and how they can be mitigated with careful engineering:

Challenge Theory (this blog) Digi-Q’s Response
Projection error \(\lVert\Pi\rVert > 1/\gamma\) \(\Pi\mathcal{T}\) may not contract Freeze VLM, train only MLP head
Large projection floor Restricted class cannot represent \(Q^*\) Representation fine-tuning to encode action-relevant features
Bootstrapping bias Compounds over depth \(D\) Accept bias; TD’s lower variance outperforms unbiased MC
Policy extraction instability Best-of-N reranking avoids negative gradients
Online distribution shift Q-values unreliable at novel states Open problem; requires periodic data re-collection and critic robustification

The result: Digi-Q’s offline Q-learning agent outperforms all prior offline methods by 21.2% on Android-in-the-Wild, and nearly matches DigiRL, which requires costly online interaction. This suggests that Q-learning can scale to complex agentic tasks — not by solving the deadly triad in general, but by carefully controlling each of its components.

Digi-Q 展示了本文讨论的理论失败模式如何在实践中体现,以及如何通过精心的工程设计加以缓解:

挑战 理论(本文) Digi-Q 的应对
投影误差 \(\lVert\Pi\rVert > 1/\gamma\) \(\Pi\mathcal{T}\) 可能不具压缩性 冻结 VLM,仅训练 MLP 头
投影下界过大 受限函数类无法表示 \(Q^*\) 表示微调以编码与动作相关的特征
自举偏差 沿深度 \(D\) 复合累积 接受偏差;TD 更低的方差胜过无偏的 MC
策略提取不稳定 Best-of-N 重排序避免负梯度
在线分布偏移 Q 值在新状态不可靠 开放问题;需要定期重新收集数据和增强 critic 鲁棒性

结果:Digi-Q 的离线 Q-learning agent 在 Android-in-the-Wild 上以 21.2% 的优势超越所有先前的离线方法,几乎追平了需要昂贵在线交互的 DigiRL。这表明 Q-learning 可以扩展到复杂的智能体任务——不是通过在一般意义上解决致命三角,而是通过精心控制其每一个组成部分。

References