What Does Flow-Matching Bring to Deep RL?
Training a good value function is one of the biggest challenges in deep RL. Two problems keep showing up: scalability (does more compute lead to better performance?) and plasticity (does the network learn features that remain useful as targets shift?). A recent line of work proposes an unexpected solution: replace the standard Q-network with a flow-matching Q-function. The surprising twist? It has nothing to do with modeling high-dimensional distributions — the key contribution is iterative test-time compute for value estimation.
This post covers two papers from Agrawalla, Nauman, Agrawal, and Kumar:
- floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL (ICLR 2026)
- What Does Flow-Matching Bring to TD-Learning? (arXiv 2026)
Background: Why Value Functions Are Hard
背景:为什么价值函数难学
The standard approach to learning Q-values is temporal difference (TD) learning. Given a dataset of transitions \(\{s_i, a_i, r(s_i, a_i), s_{i+1}\}\), we train a Q-network by minimizing:
\[\min_\theta \; \mathbb{E}_{\text{data}}\left[(Q_\theta(s_i, a_i) - y(s_i, a_i))^2\right]\]where the target is \(y(s, a) = r(s_i, a_i) + \gamma Q_{\theta^{\text{old}}}(s_{i+1}, a_{i+1})\).
This procedure has two fundamental issues:
- Non-stationary targets: The targets \(y(s_i, a_i)\) depend on the Q-network itself (through \(Q_{\theta^{\text{old}}}\)), so they keep changing as training progresses.
- Imperfect optimization: We only take a few gradient steps on each set of frozen targets before refreshing them, so the loss is never fully minimized.
These issues make it hard to scale value learning — bigger networks don’t reliably help, and features learned for old targets often become stale (loss of plasticity). Prior work has explored better architectures (layer normalization, separate state/action encoders) and better losses (classification-style losses, feature regularization), but the fundamental tension remains.
Primer: What is Flow Matching?
入门:什么是 Flow Matching?
Before we see how flows enter RL, let’s understand the core idea. Flow matching (Lipman et al., 2023; Liu et al., 2023) is a framework for learning a continuous transformation from a simple noise distribution to a complex target distribution. Compared to diffusion models (which learn a score function and reverse a stochastic noising process), flow matching learns a deterministic velocity field along a straight-line path — simpler to train and faster to sample.
The Velocity Field and Flow ODE
速度场与 Flow ODE
Imagine many particles, each starting at a random position drawn from some simple distribution \(p_0\) (e.g., a Gaussian). We want to move them so that, at the end, they are arranged according to a complex target distribution \(p_1\) (e.g., the data distribution of images). The question is: what velocity should each particle have at each moment?
Formally, we define a velocity field \(v_\theta(t, z)\) parameterized by a neural network. Given a starting point \(z_0 \sim p_0\), the particle’s trajectory is governed by the ODE:
\[\frac{dz}{dt} = v_\theta(t, z)\]If \(v_\theta\) is learned well, integrating from \(t=0\) to \(t=1\) transports the noise distribution \(p_0\) into the data distribution \(p_1\). In practice, we discretize with \(K\) Euler steps:
\[z_{k+1} = z_k + \frac{1}{K}\, v_\theta\!\left(\frac{k}{K},\, z_k\right)\]This is the inference procedure: start from noise, apply the network \(K\) times, get a data sample.
Conditional Flow Matching: A Tractable Training Loss
条件 Flow Matching:易解的训练损失
How do we learn \(v_\theta\)? Naively, we’d want to minimize:
\[\mathcal{L}_{\text{marginal}}(\theta) = \mathbb{E}_t\!\left[\left\lVert v_\theta(t, z) - u_t(z)\right\rVert^2\right]\]where \(u_t(z)\) is the true marginal velocity field that transports \(p_0\) to \(p_1\). But \(u_t(z)\) is intractable — it depends on the entire data distribution.
The key insight is that we don’t need to reason about all particles at once. Instead, we can construct training data one pair at a time. Pick a noise sample \(z_0 \sim p_0\) and a data sample \(x_1 \sim p_1\) independently — there is no special correspondence between a particular \(z_0\) and a particular \(x_1\), they are just randomly paired. We declare: “this particle starts at \(z_0\) and should end at \(x_1\).” The simplest path connecting them is a straight line. The interpolant places a point along this straight line at time \(t\):
\[z(t) = (1 - t)\, z_0 + t\, x_1\]When \(t = 0\), \(z(0) = z_0\) (pure noise). When \(t = 1\), \(z(1) = x_1\) (pure data). When \(t = 0.5\), \(z(0.5)\) is the midpoint. In other words, the interpolant generates a training input: a synthetic position that the particle should pass through at time \(t\) if it’s traveling in a straight line from \(z_0\) to \(x_1\).
Now we need a training label. If the particle is at \(z(t)\) and should arrive at \(x_1\) by \(t=1\), what velocity should it have? Along a straight-line path, the velocity is constant:
\[u(t \mid z_0, x_1) = x_1 - z_0\]So we have both the input (where the particle is: \((t, z(t))\)) and the label (what velocity it should have: \(x_1 - z_0\)). This gives us the conditional flow-matching loss:
\[\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1),\; z_0 \sim p_0,\; x_1 \sim p_1}\!\left[\left\lVert v_\theta(t,\, z(t)) - (x_1 - z_0)\right\rVert^2\right]\]Each training step is simple: sample a random \((z_0, x_1, t)\), compute the interpolated position \(z(t)\), feed \((t, z(t))\) into the network, and regress the output toward \((x_1 - z_0)\). No ODE solving needed — just standard supervised learning.
The remarkable result (Lipman et al., 2023) is that \(\nabla_\theta \mathcal{L}_{\text{CFM}} = \nabla_\theta \mathcal{L}_{\text{marginal}}\) — the gradients are identical. So even though we train on individual straight-line pairs, the network learns the correct global velocity field that transports the entire noise distribution to the data distribution.
Marginal vs. Conditional Paths
边际路径与条件路径
An important subtlety: each conditional path (for a specific \((z_0, x_1)\) pair) is a straight line. But the network \(v_\theta\) sees points from all pairs during training. At any location \((t, z)\), many different conditional paths pass through, and \(v_\theta\) learns their conditional expectation. This means the actual marginal flow paths — obtained by integrating the learned \(v_\theta\) from a single \(z_0\) — are generally curved, because the averaged velocity at intermediate points differs from any single conditional velocity.
Summary of the flow-matching recipe:
- Training: sample \((z_0, x_1, t)\), compute interpolant \(z(t)\), regress \(v_\theta(t, z(t))\) toward \(x_1 - z_0\). Simple supervised learning, no ODE.
- Inference: start from \(z_0 \sim p_0\), run \(K\) Euler steps through the learned \(v_\theta\), output \(z_K \approx x_1\).
Standard flow matching is used for generative modeling in high dimensions (images, molecules, etc.). But the iterative computation structure — applying the same network \(v_\theta\) repeatedly with evolving inputs — turns out to be independently valuable for a completely different purpose. The question that floq asks is: can we repurpose this iterative structure, not to generate data, but to produce better value estimates?
The Idea: Flow-Matching Q-Functions
核心思想:Flow-Matching Q 函数
Instead of predicting \(Q(s, a)\) in a single forward pass, floq produces Q-values by integrating a learned velocity field over multiple steps. Start from a noise sample \(z_0 \sim \text{Unif}[l, u]\), and iteratively apply:
\[z_{t+1} = z_t + \frac{1}{K} \, v_\theta\!\left(\frac{t}{K},\, z_t \;\middle\vert\; s, a\right)\]for \(K\) integration steps. The final value \(z_K\) is the predicted Q-value. This is still a scalar prediction — not a high-dimensional generative model.
Note that this is one of the few cases where “test-time compute scaling” has a clean theoretical justification: more Euler steps means better ODE integration, which means a more accurate Q-value.
The key insight: this gives us a knob to trade test-time compute for Q-value accuracy. More integration steps means more forward passes through the velocity network, each operating on a different input \((t, z_t)\) and therefore doing different computation.
An important clarification: floq does not require a larger network. The velocity network \(v_\theta\) is a standard 4-layer MLP — the same kind of network used as a Q-network in continuous control RL (robotics, maze navigation, etc.), taking raw state and action vectors as input. There is no language model involved; the connection to LLMs discussed later is purely an analogy. The only architectural difference from a standard Q-network is two extra scalar inputs \((t, z_t)\), which adds negligible parameters. The extra compute comes entirely from running this same small network \(K\) times, not from having more parameters. In fact, the floq paper shows that ResNets with matched FLOPs (i.e., one big forward pass using the same total compute) perform significantly worse. The benefit is specifically from iterative computation, not from model size.
Training: The Flow-Matching Loss
训练:Flow-Matching 损失
Training floq requires two steps:
Step 1: Compute the target value by integrating the flow with the old (frozen) network:
\[y(s, a) = r(s_i, a_i) + \gamma Q_{\theta^{\text{old}}}(s_{i+1}, a_{i+1})\]where \(Q_{\theta^{\text{old}}}\) is itself obtained by integrating the old velocity field.
Step 2: Train the velocity field against the linear interpolant between noise and target. Just like in the primer, we construct a straight-line path from the noise sample \(\mathbf{z}\) to the target \(y\):
\[\mathbf{z}(t) = (1 - t) \cdot \mathbf{z} + t \cdot y(s_i, a_i)\]The derivative of this path is \(\frac{d\mathbf{z}}{dt} = y - \mathbf{z}\) — the constant velocity along the straight line. This is the label we train the network to predict (exactly the same conditional flow-matching trick as before, with \(x_1\) replaced by the TD target \(y\)). The training loss is:
\[\mathcal{L}(\theta) = \mathbb{E}_{t, \mathbf{z}}\left[\left\lVert v_\theta(t, \mathbf{z}(t) \mid s, a) - (y - \mathbf{z})\right\rVert^2\right]\]A natural question: if the actual flow paths are curved (as discussed in the primer), why do we train against straight-line targets? This is exactly the power of conditional flow matching: the gradient equivalence theorem guarantees that training on individual straight-line (noise, target) pairs produces the same optimal \(v_\theta\) as training on the true (curved) marginal velocity field. The network sees many different \((z_0, y)\) pairs across training, and the averaged gradient steers it toward the correct curved flow — even though each individual supervision signal is a straight line.
A follow-up question: does training on straight lines mean inference must also be straight? No. At inference time, we start from a single \(z_0\) and integrate \(v_\theta\) step by step. At each point \((t, z_t)\), the network outputs the velocity it learned — which is the average over all the straight-line targets that passed through that region during training. Different starting points \(z_0\) lead to different positions \(z_t\) at intermediate times, and the network gives them different velocities. The result is curved paths, even though no single training example was curved. This is precisely why the width of the noise interval matters so much, as we discuss next: a wider interval means more diverse starting points, which forces the network to learn position-dependent velocities, which produces more curvature, which makes each integration step do genuinely different work.
Forward and Backward: What Each Does
前向与反向:各自做什么
Unlike diffusion models, flow matching has no reverse process. The flow only goes one direction: noise \(\to\) value. But there are two distinct phases that are easy to confuse:
Forward (integration): This is the ODE integration from \(t=0\) to \(t=1\), transforming noise into a Q-value:
\[z_0 \sim \text{Unif}[l, u] \;\xrightarrow{v_\theta,\; K \text{ steps}}\; z_K = Q(s, a)\]This is used in two places: (1) at test time, to predict Q-values for action selection, and (2) during training, to compute the TD target \(y = r + \gamma Q_{\theta^{\text{old}}}(s', a')\) by integrating the frozen old network.
Backward (gradient update): This is not reverse-time integration — it is standard backpropagation through the flow-matching loss. For each training step:
- Run forward integration with the old (frozen) network to get target \(y\)
- Sample noise \(z_0\) and random time \(t \sim \mathcal{U}(0,1)\)
- Compute interpolated position \(z(t) = (1-t)\, z_0 + t \cdot y\) — this is the network’s input
- Feed \((t, z(t), s, a)\) into \(v_\theta\) and compute the MSE loss against \((y - z_0)\) — this is the label
Then backpropagate \(\nabla_\theta \mathcal{L}\) as usual. Steps 1–3 just construct the training example (input, label); step 4 is ordinary supervised regression. There is nothing special about the gradient — no denoising, no reverse SDE. The “backward” in floq is just the loss gradient, not a backward-in-time process.
Design Choices: Why Curved Paths Matter
设计选择:为什么弯曲路径很重要
A critical design choice is the width of the initial noise interval \([l, u]\). This determines whether integration paths are straight or curved — and this matters a lot for compute scaling.
With a narrow noise interval, all starting points are close together, so the optimal paths are nearly straight lines to the target. The velocity field is approximately constant — every integration step does the same thing, and extra steps are wasted.
With a wide noise interval, starting points are spread far apart, and the paths must curve to converge. Now the velocity field at each step depends strongly on the current position \(z_t\), meaning each step does genuinely different computation. This is what makes more integration steps valuable.
This is essentially the same insight as why deeper networks are more expressive than wider ones: depth enables composition, and composition enables qualitatively different computation at each layer.
Why It Works: Iterative Compute and Plastic Features
为什么有效:迭代计算与可塑特征
The mechanism behind floq’s success has a striking analogy to chain-of-thought reasoning in LLMs:
- The intermediate values \(z_0, z_1, \ldots, z_K\) act like “thoughts” — latent states that carry information forward.
- The same network \(v_\theta\) is applied at each step with different inputs, just like an autoregressive LM generating each token.
- The flow-matching loss provides dense supervision at every timestep \(t\), analogous to SFT supervising at every token rather than only at the final answer.
This dense supervision has a crucial consequence for RL: it produces plastic features. In standard TD-learning, features learned for early (inaccurate) targets become entrenched and hard to update — a well-known phenomenon called loss of plasticity. In floq, the same features can produce different outputs through different integration paths, so they naturally support changing targets without retraining.
The plasticity story is perhaps the most underappreciated contribution. Loss of plasticity is one of the main reasons we can’t just scale up Q-learning like we scale up LLMs. If floq truly solves this, it opens the door to much larger value networks.
Experiments confirm this: when features are frozen partway through training, floq networks maintain high performance while standard Q-networks collapse. floq can also handle much more aggressive update-to-data ratios (UTD up to 128) without degradation.
Results
实验结果
On simulated offline RL benchmarks (antmaze, humanoidmaze, cube manipulation, puzzles), floq achieves strong improvements over both standard Q-learning and flow-based policy methods:
- ~2x improvement on hard antmaze and cube tasks
- ~3x on puzzle-4x4 (the hardest task requiring precise value estimation)
- Average score of 59 vs. 47 for the best flow policy baseline (FQL with 2M parameters)
The scaling behavior is clear: increasing integration steps from 1 to 8–16 consistently improves performance across tasks. Importantly, this scaling comes from the iterative flow architecture, not just from using bigger networks — ResNets with matched FLOPs perform significantly worse.
Takeaways
要点总结
- Flow-matching in RL is not about modeling distributions — it’s about getting iterative test-time compute for value estimation, turning Q-value prediction from a one-shot problem into an iterative refinement process.
- Wide noise intervals are essential for curved integration paths that make each step do different work.
- Dense supervision at every integration step produces plastic features that handle non-stationary TD targets gracefully.
- The analogy to chain-of-thought in LLMs is deep: both use iterative computation through a shared network to trade compute for accuracy, with dense supervision enabling generalization.
- Open questions remain: does “exploration in a CoT” have a flow analogue? What does the RL vs. SFT gap in LLMs mean for flow-based value learning?