The Chain-of-Thought Analogy

Flow integration as iterative reasoning

Like chain-of-thought in LLMs: each step refines the answer using the same network.

The intermediate $z_t$ values act as "thoughts" — latent states that carry information forward.

Supervised densely at each token/step — not just at the final output.

Dense Supervision at Every Step

Unlike standard Q-networks that only have one loss at the output

Single loss at output:

$(Q_\theta(s,a) - y)^2$

1 gradient signal

Loss at every timestep $t$:

$\|v_\theta(t, z_t) - (y - z)\|^2$

K gradient signals

Dense supervision means every layer of the unrolled flow gets direct gradient.

This is analogous to SFT supervising at every token, not just the final answer.

The key benefit for RL: features that adapt

In TD-learning, targets $y(s,a)$ keep changing as the Q-network improves.

Standard nets: features learned for old targets become stale (loss of plasticity).

floq: the same features support different outputs through different integration paths.

floq features are "plastic" — they can adapt to future targets without retraining.

What does iterative compute enable for RL?

Generalize to unseen state-actions

Yes!

Handle changing targets during training

Yes!

Generalize to more steps at test time

No — fixed K

Unlike LLMs where longer CoT generalizes, floq uses a fixed number of steps.

The compute scaling comes from using the right number of steps, not unbounded growth.

Multiple forward passes with the same weights, each doing different work

Gradient signal at every integration step, not just final output

Shared features support changing targets through different flow paths

It's not about modeling distributions — it's about better computation for value estimation.

Agrawalla, Nauman, Agrawal, Kumar. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL. ICLR 2026.

Agrawalla, Nauman, Kumar. What Does Flow-Matching Bring to TD-Learning? arXiv 2026.