Flow integration as iterative reasoning
Like chain-of-thought in LLMs: each step refines the answer using the same network.
The intermediate $z_t$ values act as "thoughts" — latent states that carry information forward.
Supervised densely at each token/step — not just at the final output.
Unlike standard Q-networks that only have one loss at the output
Single loss at output:
$(Q_\theta(s,a) - y)^2$
1 gradient signal
Loss at every timestep $t$:
$\|v_\theta(t, z_t) - (y - z)\|^2$
K gradient signals
Dense supervision means every layer of the unrolled flow gets direct gradient.
This is analogous to SFT supervising at every token, not just the final answer.
The key benefit for RL: features that adapt
In TD-learning, targets $y(s,a)$ keep changing as the Q-network improves.
Standard nets: features learned for old targets become stale (loss of plasticity).
floq: the same features support different outputs through different integration paths.
floq features are "plastic" — they can adapt to future targets without retraining.
What does iterative compute enable for RL?
Generalize to unseen state-actions
Yes!
Handle changing targets during training
Yes!
Generalize to more steps at test time
No — fixed K
Unlike LLMs where longer CoT generalizes, floq uses a fixed number of steps.
The compute scaling comes from using the right number of steps, not unbounded growth.
Multiple forward passes with the same weights, each doing different work
Gradient signal at every integration step, not just final output
Shared features support changing targets through different flow paths
It's not about modeling distributions — it's about better computation for value estimation.
Agrawalla, Nauman, Agrawal, Kumar. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL. ICLR 2026.
Agrawalla, Nauman, Kumar. What Does Flow-Matching Bring to TD-Learning? arXiv 2026.