The conventional approach
A standard Q-network takes $(s, a)$ and produces $Q(s,a)$ in a single forward pass.
No way to trade more compute for a better prediction at test time.
Can we scale test-time compute for value functions?
Replace the single pass with numerical integration
Start from noise $z_0 \sim \text{Unif}[l, u]$, integrate a velocity field $v_\theta$ over $K$ steps.
The final value $z_K$ is the predicted $Q(s, a)$.
More integration steps = more compute = better Q-value.
Noise $\to$ Q-value through iterative velocity updates
Each step applies the velocity network at a different $(t, z_t)$ — doing different computation each time.
The flow transforms a distribution, not just a point
As $t$ goes from $0 \to 1$, the distribution narrows from $\text{Unif}[l,u]$ toward $\delta_{Q^\pi(s,a)}$.
Still a scalar prediction — not modeling a high-dimensional distribution!
Three key differences from conventional generative flows
The flow operates on a scalar $z \in \mathbb{R}$, conditioned on $(s, a)$.
Trained with a TD-learning loss, not maximum likelihood.
Uses $\text{Unif}[l, u]$ instead of $\mathcal{N}(0, I)$ to enable curved paths.
The point is not to model distributions — it's to get iterative test-time compute for Q-values.