The Chain-of-Thought Analogy

Flow integration as iterative reasoning

Like chain-of-thought in LLMs: each step refines the answer using the same network.

The intermediate $z_t$ values act as "thoughts" — latent states that carry information forward.

Supervised densely at each token/step — not just at the final output.

Dense Supervision at Every Step

Unlike standard Q-networks that only have one loss at the output

Standard Q-Net

Single loss at output:

$(Q_\theta(s,a) - y)^2$

1 gradient signal

floq

Loss at every timestep $t$:

$\|v_\theta(t, z_t) - (y - z)\|^2$

K gradient signals

Dense supervision means every layer of the unrolled flow gets direct gradient.

This is analogous to SFT supervising at every token, not just the final answer.

Plastic Features for Non-Stationary Targets

The key benefit for RL: features that adapt

In TD-learning, targets $y(s,a)$ keep changing as the Q-network improves.

Standard nets: features learned for old targets become stale (loss of plasticity).

floq: the same features support different outputs through different integration paths.

floq features are "plastic" — they can adapt to future targets without retraining.

Axes of Generalization

What does iterative compute enable for RL?

New (s, a) pairs

Generalize to unseen state-actions

Yes!

Non-stationarity

Handle changing targets during training

Yes!

Longer integration

Generalize to more steps at test time

No — fixed K

Unlike LLMs where longer CoT generalizes, floq uses a fixed number of steps.

The compute scaling comes from using the right number of steps, not unbounded growth.

Why Flow-Matching Works for RL

Iterative Compute

Multiple forward passes with the same weights, each doing different work

Dense Supervision

Gradient signal at every integration step, not just final output

Plastic Features

Shared features support changing targets through different flow paths

It's not about modeling distributions — it's about better computation for value estimation.

Agrawalla, Nauman, Agrawal, Kumar. floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL. ICLR 2026.

Agrawalla, Nauman, Kumar. What Does Flow-Matching Bring to TD-Learning? arXiv 2026.