Standard Q-Function: One-Shot Prediction

The conventional approach

A standard Q-network takes $(s, a)$ and produces $Q(s,a)$ in a single forward pass.

No way to trade more compute for a better prediction at test time.

Can we scale test-time compute for value functions?

Flow Q-Function: Iterative Refinement

Replace the single pass with numerical integration

Start from noise $z_0 \sim \text{Unif}[l, u]$, integrate a velocity field $v_\theta$ over $K$ steps.

The final value $z_K$ is the predicted $Q(s, a)$.

More integration steps = more compute = better Q-value.

Noise $\to$ Q-value through iterative velocity updates

$$z_{t+1} = z_t + \frac{1}{K} \, v_\theta\!\left(\frac{t}{K},\, z_t \;\middle|\; s, a\right)$$

Each step applies the velocity network at a different $(t, z_t)$ — doing different computation each time.

The flow transforms a distribution, not just a point

As $t$ goes from $0 \to 1$, the distribution narrows from $\text{Unif}[l,u]$ toward $\delta_{Q^\pi(s,a)}$.

Still a scalar prediction — not modeling a high-dimensional distribution!

Three key differences from conventional generative flows

The flow operates on a scalar $z \in \mathbb{R}$, conditioned on $(s, a)$.

Trained with a TD-learning loss, not maximum likelihood.

Uses $\text{Unif}[l, u]$ instead of $\mathcal{N}(0, I)$ to enable curved paths.

The point is not to model distributions — it's to get iterative test-time compute for Q-values.