Notes on Q-Learning Scalability

Recently, I revisited the blog by Seohong Park, “Q-learning is not yet scalable”, and would like to share some notes. In this blog, scaling is defined as whether the algorithm can solve harder problems that usually require a larger horizon, with improved data, compute, and time. There are two dimensions to scale the task difficulty: the depth and the width, which we found useful to think of in WebGym as well, when we tried to define the difficulty of a task. Seohong suggested that the depth (long-horizon) dimension is more important and harder to solve.

Contraction in TD Learning

For long-horizon tasks, the bias introduced by the bootstrapped TD target compounds across the bootstrapping chain. For each state-action pair, the bootstrapped target is

\[y(s, a) = r(s, a) + \gamma \max_{a'} Q_{\bar{\theta}}(s', a')\]

where $\bar{\theta}$ denotes the parameters prior to the current update. This target is biased from the true optimal value, defined as the expected discounted return under the optimal policy:

\[Q^*(s, a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]

Tabular Case

Under a tabular setting, this bias can be systematically reduced iteration by iteration.

Define the Bellman optimality operator $\mathcal{T}$ as:

\[(\mathcal{T}Q)(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\left[r(s, a) + \gamma \max_{a'} Q(s', a')\right]\]

Step 1: $\mathcal{T}$ is a $\gamma$-contraction. For any two Q-functions $Q$ and $Q'$:

\[\begin{aligned} |(\mathcal{T}Q)(s,a) - (\mathcal{T}Q')(s,a)| &= \gamma \left|\mathbb{E}_{s'}\left[\max_{a'} Q(s', a') - \max_{a'} Q'(s', a')\right]\right| \\ &\leq \gamma \, \mathbb{E}_{s'}\left[\max_{a'} |Q(s', a') - Q'(s', a')|\right] \\ &\leq \gamma \|Q - Q'\|_\infty \end{aligned}\]

where the first inequality uses $\vert\max_a f(a) - \max_a g(a)\vert \leq \max_a \vert f(a) - g(a)\vert$. Taking the supremum over $(s, a)$ on the left gives

\[\lVert\mathcal{T}Q - \mathcal{T}Q'\rVert_\infty \leq \gamma \lVert Q - Q'\rVert_\infty\]

By the Banach fixed point theorem, $\mathcal{T}$ has a unique fixed point $Q^*$ and repeated application converges:

\[\lVert\mathcal{T}^k Q - Q^*\rVert_\infty \leq \gamma^k \lVert Q - Q^*\rVert_\infty \to 0\]

Interactive value iteration on a branching MDP. Drag the sliders to step through Bellman operator iterations and per-state updates, or press Play to animate.

Step 2: Q-learning as stochastic approximation. Value iteration requires the full transition model $P(s' \mid s, a)$ to compute the expectation in $\mathcal{T}$. Q-learning (Watkins, 1989) removes this requirement by learning directly from experience. The agent maintains a Q-table $Q(s, a)$ for every state-action pair, initialized arbitrarily. At each time step, the agent:

Observes the current state $s_t$.
Selects an action $a_t$ using a behavior policy (typically $\varepsilon$-greedy: with probability $\varepsilon$ pick a random action, otherwise pick $\arg\max_a Q(s_t, a)$).
Executes $a_t$, observes reward $r_t$ and next state $s_{t+1}$.
Updates the Q-value for the visited pair $(s_t, a_t)$:

\[Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha_t \underbrace{\left[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t)\right]}_{\text{TD error } \delta_t}\]

All other entries remain unchanged: $Q_{t+1}(s, a) = Q_t(s, a)$ for $(s, a) \neq (s_t, a_t)$. The TD error $\delta_t$ measures how far the current estimate $Q_t(s_t, a_t)$ is from the one-step bootstrapped target $r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')$. The learning rate $\alpha_t \in (0, 1]$ controls the step size: larger $\alpha$ accelerates learning but increases variance, while smaller $\alpha$ stabilizes convergence at the cost of speed.

Why does this converge? The key insight is that the sampled TD target is an unbiased estimate of the Bellman operator applied to the current Q-function:

\[\mathbb{E}[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') \mid s_t = s, a_t = a, Q_t] = (\mathcal{T}Q_t)(s, a)\]

This means each Q-learning update is, in expectation, a step of value iteration on the visited $(s, a)$ pair. The update decomposes into a contraction step plus zero-mean noise:

\[Q_{t+1}(s, a) = Q_t(s, a) + \alpha_t \left[\underbrace{(\mathcal{T}Q_t)(s, a) - Q_t(s, a)}_{\text{signal: contracts toward } Q^*} + \underbrace{\varepsilon_t}_{\text{noise}}\right]\]

where

\[\varepsilon_t = \delta_t - \mathbb{E}[\delta_t \mid \mathcal{F}_t]\]

is a martingale difference sequence satisfying $\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0$. Under the Robbins-Monro conditions ($\sum_t \alpha_t = \infty$, $\sum_t \alpha_t^2 < \infty$) and sufficient state-action coverage, the contraction signal dominates the noise and $Q_t \to Q^*$ almost surely.

Martingale differences and stochastic approximation. (Click to expand)

A martingale is a stochastic process whose conditional expectation of the next value, given all past information, equals the current value. Intuitively, it is a "fair game" with no drift. A sequence $\{\varepsilon_t\}$ is a martingale difference if $\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0$, where $\mathcal{F}_t = \sigma(s_0, a_0, r_0, \ldots, s_t, a_t, Q_t)$ is the filtration (the complete history up to step $t$). This means: conditioned on everything the agent has seen so far, the noise has zero expected value, i.e., it is equally likely to push the estimate up or down. Crucially, this does not require the noise to be independent across time steps; it only requires zero conditional mean.

This structure is what makes Q-learning a stochastic approximation algorithm. The general Robbins-Monro framework studies iterations of the form $\theta_{t+1} = \theta_t + \alpha_t [h(\theta_t) + \varepsilon_t]$, where $h(\theta^*) = 0$ defines the target fixed point. In Q-learning, $h(Q) = \mathcal{T}Q - Q$ and $\theta^* = Q^*$. Convergence requires:

$\sum_t \alpha_t = \infty$: the step sizes must be large enough in total so the iterates can travel from any initialization to $Q^*$. If $\sum \alpha_t < \infty$, the iterates may stall before reaching the fixed point.
$\sum_t \alpha_t^2 < \infty$: the step sizes must shrink fast enough so that the accumulated noise variance $\sum \alpha_t^2 \text{Var}(\varepsilon_t)$ remains bounded. This ensures the noise averages out rather than accumulating.

A common choice satisfying both conditions is $\alpha_t = 1/(t+1)$.

Proof that $\varepsilon_t$ is a martingale difference. We show $\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0$.

Step 1: Compute $\mathbb{E}[\delta_t \mid \mathcal{F}_t]$. The TD error is:

$$\delta_t = r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t)$$

Since $s_t, a_t, Q_t \in \mathcal{F}_t$, the term $Q_t(s_t, a_t)$ is a known constant and comes out of the expectation. The only randomness is over the transition $s_{t+1} \sim P(\cdot \mid s_t, a_t)$:

$$\mathbb{E}[\delta_t \mid \mathcal{F}_t] = \underbrace{\mathbb{E}_{s_{t+1} \sim P(\cdot \mid s_t, a_t)}\left[r(s_t, a_t) + \gamma \max_{a'} Q_t(s_{t+1}, a')\right]}_{= \, (\mathcal{T}Q_t)(s_t, a_t) \text{ by definition of } \mathcal{T}} - \; Q_t(s_t, a_t)$$

The first term is exactly the Bellman operator applied to $Q_t$, because $\mathcal{T}$ is the expectation of the one-step bootstrapped return. Therefore:

$$\mathbb{E}[\delta_t \mid \mathcal{F}_t] = (\mathcal{T}Q_t)(s_t, a_t) - Q_t(s_t, a_t)$$

Step 2: Show $\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = 0$. By definition, $\varepsilon_t = \delta_t - \mathbb{E}[\delta_t \mid \mathcal{F}_t]$. Taking the conditional expectation:

$$\mathbb{E}[\varepsilon_t \mid \mathcal{F}_t] = \mathbb{E}[\delta_t \mid \mathcal{F}_t] - \mathbb{E}[\delta_t \mid \mathcal{F}_t] = 0 \quad \square$$

Equivalently, $\varepsilon_t = [r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')] - (\mathcal{T}Q_t)(s_t, a_t)$ is the deviation of the sampled one-step return from its conditional mean, and the expectation of any random variable minus its own conditional expectation is zero.

Remark. This holds regardless of the behavior policy used to select $a_t$, because by the time we condition on $\mathcal{F}_t$, both $s_t$ and $a_t$ are already determined. The only source of randomness is the environment's transition $s_{t+1} \sim P(\cdot \mid s_t, a_t)$. This is precisely why Q-learning is off-policy: the convergence proof does not depend on how actions are chosen.

Stochastic approximation: the blue trace shows iterates θ_t converging to the fixed point θ* despite noise. The gray dashed line shows the deterministic path (signal only), and the orange line shows a noise-only random walk (martingale). With decaying step sizes, the signal accumulates while the noise averages out.

Comparison with value iteration. Three important differences: (1) only one $(s, a)$ entry is updated per step, so convergence requires every state-action pair to be visited infinitely often; (2) actions must be selected by a behavior policy (e.g., $\varepsilon$-greedy) that balances exploration and exploitation; and (3) because Q-learning updates use $\max_{a'}$ regardless of the behavior policy, it is an off-policy algorithm: the Q-values converge to the optimal policy even when the agent explores suboptimally.

Q-learning on the same MDP. Each step updates a single Q(s, a) from a sampled transition under an ε-greedy policy. Compare the noisy, asynchronous convergence with the smooth contraction above.

Notice that the $\lVert Q - Q^*\rVert_\infty$ curve in the Q-learning figure resembles a step function. This is because Q-learning updates only one $(s, a)$ entry per step, and the infinity norm is determined by the single worst-case pair. When the updated pair is not the one with the maximum error, the metric stays flat; it only jumps when the bottleneck entry itself gets updated. This staircase pattern is a direct consequence of the sparse, asynchronous nature of Q-learning. Contrast it with value iteration above, where every iteration updates all pairs simultaneously, producing a smooth exponential decay.

Function Approximation

With a parameterized Q-function $Q_\theta$, the tabular update $Q(s,a) \leftarrow Q(s,a) + \alpha \delta$ becomes:

\[\theta \leftarrow \theta + \alpha \, \delta_t \, \nabla_\theta Q_\theta(s_t, a_t)\]

where

\[\delta_t = r_t + \gamma \max_{a'} Q_\theta(s_{t+1}, a') - Q_\theta(s_t, a_t)\]

This is called semi-gradient Q-learning because the gradient is taken only through the prediction $Q_\theta(s_t, a_t)$, not through the bootstrapped target. In the tabular case, the semi-gradient and full gradient produce equivalent updates because each parameter controls exactly one Q-value independently; under function approximation, parameters are shared across state-action pairs, and this equivalence breaks.

Semi-gradient vs. full gradient in the tabular case. (Click to expand)

A hypothetical "full gradient" would also differentiate through the target:

$$\theta \leftarrow \theta + \alpha \, \delta_t \, \left[\nabla_\theta Q_\theta(s_t, a_t) - \gamma \nabla_\theta \max_{a'} Q_\theta(s_{t+1}, a')\right]$$

In the tabular case, $\theta \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ is the Q-table. Since $Q_\theta(s, a) = \theta_{(s,a)}$, the gradient $\nabla_\theta Q_\theta(s, a)$ is a one-hot vector at entry $(s, a)$:

$$\left[\nabla_\theta Q_\theta(s, a)\right]_{(s',a')} = \mathbf{1}[(s',a') = (s,a)]$$

Substituting, the two updates are, component-wise:

$$\text{Semi: } \theta_{(s',a')} \leftarrow \theta_{(s',a')} + \alpha\,\delta_t \cdot \mathbf{1}[(s',a') = (s_t, a_t)]$$ $$\text{Full: } \theta_{(s',a')} \leftarrow \theta_{(s',a')} + \alpha\,\delta_t \cdot \bigl\{\mathbf{1}[(s',a') = (s_t, a_t)] - \gamma\,\mathbf{1}[(s',a') = (s_{t+1}, a^*)]\bigr\}$$

where $a^* = \arg\max_{a'} Q_\theta(s_{t+1}, a')$. At the prediction entry $(s_t, a_t)$, the full gradient evaluates to $\alpha\delta_t \cdot \{1 - \gamma \cdot \mathbf{1}[(s_t, a_t) = (s_{t+1}, a^*)]\} = \alpha\delta_t \cdot \{1 - \gamma \cdot 0\} = +\alpha\delta_t$, identical to the semi-gradient update. The $-\gamma$ term is zero because $(s_t, a_t) \neq (s_{t+1}, a^*)$ for any non-self-loop transition. The full gradient additionally modifies a different entry $(s_{t+1}, a^*)$ by $-\alpha\gamma\delta_t$, but since tabular parameters are decoupled, this does not interfere.

Click any (s, a) to compare. Semi-gradient modifies only the prediction entry (blue). Full gradient additionally modifies the target entry (amber) — a different, independent parameter.

Under function approximation, $\nabla_\theta Q_\theta(s, a)$ is no longer one-hot: updating $\theta$ to improve $Q_\theta(s_t, a_t)$ simultaneously changes $Q_\theta$ at other state-action pairs. The prediction and target gradients now act on the same shared parameters, so ignoring the target gradient introduces genuine bias.

Left (tabular): one-hot gradient, only one Q-value changes. Right (FA): coupled gradient, updating one (s, a) changes many Q-values (yellow).

Does the semi-gradient cause Q-learning's contraction failure under FA? No — the semi-gradient introduces bias in the descent direction, but this is a separate issue from whether the operator $\mathcal{T}$ contracts. Consider the Bellman error loss we are implicitly minimizing:

$$L(\theta) = \tfrac{1}{2}\bigl(r + \gamma \max_{a'} Q_\theta(s', a') - Q_\theta(s, a)\bigr)^2 = \tfrac{1}{2}\delta_t^2$$

The true (full) gradient differentiates through both occurrences of $\theta$:

$$\nabla_\theta L = -\delta_t \cdot \bigl[\nabla_\theta Q_\theta(s, a) - \gamma \nabla_\theta \max_{a'} Q_\theta(s', a')\bigr]$$

The semi-gradient treats the target $r + \gamma \max_{a'} Q_\theta(s', a')$ as a constant:

$$\nabla_\theta L \approx -\delta_t \cdot \nabla_\theta Q_\theta(s, a)$$

This is a gradient bias: the semi-gradient is not the true gradient of $L(\theta)$, because it ignores how $\theta$ affects the target. It is the gradient of a different objective where the target is frozen. Note that this is distinct from the bootstrapping bias of the TD target itself: the target $r + \gamma \max_{a'} Q_\theta(s', a')$ uses the current estimate $Q_\theta$ rather than $Q^*$, so it is biased regardless of whether we use semi-gradient or full gradient. The gradient bias concerns the optimization direction (how we move $\theta$); the bootstrapping bias concerns the target value (what we move $\theta$ toward). Both methods compute the same (bootstrapped) TD target and apply the same operator $\mathcal{T}$; they differ only in the descent direction. The contraction failure under FA comes from the projection $\Pi$ onto the restricted function class, which both methods face equally.

The key question is whether Q-learning remains contractive under function approximation. It does not. As discussed above, the Bellman operator $\mathcal{T}$ remains a $\gamma$-contraction, but $\mathcal{T}Q_\theta$ generally falls outside the restricted function class, forcing a projection $\Pi$ onto the nearest representable function. The effective update becomes

\[Q \leftarrow \Pi \mathcal{T} Q\]

and

\[\lVert\Pi \mathcal{T} Q - \Pi \mathcal{T} Q'\rVert \leq \lVert\Pi\rVert \cdot \gamma \lVert Q - Q'\rVert,\]

where $\lVert\Pi\rVert$ can exceed $1/\gamma$, destroying the contraction.

Semi-gradient Q-learning with cosine FA on the same MDP. Q_θ(s,a) = θ₁ + θ₂·cos(π·depth/3) + θ₃·a (3 parameters for 10 Q-values). The error plateaus above the projection floor (dashed), unlike the tabular case which converges to zero.

This figure illustrates two failure modes. First, the projection error floor $\lVert\Pi Q^* - Q^*\rVert_\infty > 0$: with only 3 parameters for 10 state-action pairs, $Q^*$ cannot be represented exactly. Second, the actual error $\lVert Q_\theta - Q^*\rVert_\infty$ stays above even this floor. Semi-gradient Q-learning follows the TD error gradient, which is biased by bootstrapping. With the deadly triad — off-policy learning, function approximation, and bootstrapping — all present, $\Pi \mathcal{T}$ may fail to be a contraction, causing the iterates to oscillate rather than converge.

Scaling the function class (e.g., using a language model as $Q_\theta$) shrinks the projection error but does not resolve the deadly triad: the composed operator $\Pi\mathcal{T}$ can still fail to contract regardless of model capacity.

In real-world applications — robotics, game playing, large-scale decision making — the state-action space is so vast that no practical function class can represent $\mathcal{T}Q$ exactly for all $Q$. The projection $\Pi$ is never the identity, and there is no guarantee that $\Pi\mathcal{T}$ remains contractive. We should therefore not assume that Q-learning with function approximation inherits the convergence guarantees of its tabular counterpart.

Cumulative Bootstrapping Error

Q-learning updates the value of state $s$ using the estimated value of the next state $s'$. If that estimate carries error $\epsilon$, the update target inherits $\gamma \epsilon$. The next state’s estimate, in turn, was built from its successor — so the error chains backward through the MDP. After $D$ bootstrapping steps, per-state errors accumulate roughly as $\sum_{k=0}^{D-1} \gamma^k \epsilon_k$. Longer chains mean more terms in this sum, and higher $\gamma$ means each term decays more slowly.

For the remaining experiments, we switch to a 1-hidden-layer neural network with 64 ReLU units (257 parameters) — enough capacity to represent $Q^*$ nearly exactly on these small MDPs. This isolates the bootstrapping dynamics from the representational limitations studied above.

Depth

To see this concretely, consider chain MDPs of increasing depth: $s_0 \to s_1 \to \cdots \to s_{D-1} \to \text{terminal}$, where only the final transition yields reward $+1$. The shortest path is exactly $D$ steps, and the agent must bootstrap $Q^*$ backward through the entire chain. We fix $\gamma = 0.9$ and use the same NN (64 ReLU units, 257 parameters) for all chains (depth 2, 8, and 32).

Semi-gradient Q-learning on chain MDPs of depth 2, 8, and 32. Setup: 1-hidden-layer NN (64 ReLU units, 257 parameters, α = 0.0005), ε-greedy (ε = 0.3), γ = 0.9, 10k steps, single seed. Deeper chains require more bootstrapping steps to propagate the reward signal, leading to larger and more volatile errors.

To see why deeper chains are harder, consider what happens step by step. In a chain of depth 5, the only grounded value is at the terminal ($Q = 0$). On the first Bellman backup, only $s_4$ gets a correct target — it bootstraps from the terminal. All other states bootstrap from stale estimates (zeros), so their updates are wrong. On the second backup, $s_3$ can now bootstrap from the corrected $s_4$, but $s_0$–$s_2$ still use incorrect values. It takes exactly $D$ full sweeps for the correct signal to propagate from the terminal all the way back to $s_0$.

Bellman backup iterations on a depth-5 chain MDP (γ = 0.9). Left: Q-values at each iteration, colored by error magnitude (green = accurate, red = large error). Right: absolute error per state. Correct values propagate backward from the terminal at one state per iteration.

Per-state bias

We can also look inside a single chain to see where the error lives. The figure below (from Park, 2024) plots the per-state bias against the state’s distance from the terminal. States near the terminal have small bias because bootstrapping stops there — $Q$ is grounded at zero. States far from the terminal must propagate $Q^*$ through many bootstrapping steps, and at each step the error in the next state’s estimate feeds into the current state’s target, compounding along the way.

Accumulated bias in a toy chain MDP — Per-state Q-value bias vs. distance from the terminal in a toy chain MDP. Source: Park (2024).

Discount factor

This compounding effect also explains why practitioners almost never use large discount factors. The effective horizon is $1/(1-\gamma)$: for $\gamma = 0.99$ this is 100 steps, but for $\gamma = 0.999$ it is 1,000 and for $\gamma = 0.9999$ it is 10,000. Since each bootstrapping step multiplies the error by $\gamma$, higher discount factors mean errors decay more slowly — the contraction rate weakens and the bias accumulates more severely across the chain.

Effect of the discount factor on Q-learning error. Setup: chain-64 MDP, 1-hidden-layer NN (64 ReLU units, 257 parameters, α = 0.0005), ε-greedy (ε = 0.3), 60k steps, 5 seeds. As γ → 1, the effective horizon 1/(1−γ) grows and bootstrapping errors compound more severely, even though the NN has sufficient capacity to represent Q*.

References

Seohong Park, “Q-learning is not yet scalable”
Nan Jiang, CS 443 Lecture 7: TD with Function Approximation, pg. 14