What Does Flow-Matching Bring to Deep RL?

Training a good value function is one of the biggest challenges in deep RL. Two problems keep showing up: scalability (does more compute lead to better performance?) and plasticity (does the network learn features that remain useful as targets shift?). A recent line of work proposes an unexpected solution: replace the standard Q-network with a flow-matching Q-function. The surprising twist? It has nothing to do with modeling high-dimensional distributions — the key contribution is iterative test-time compute for value estimation.

This post covers two papers from Agrawalla, Nauman, Agrawal, and Kumar:

  • floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL (ICLR 2026)
  • What Does Flow-Matching Bring to TD-Learning? (arXiv 2026)

训练一个好的价值函数是深度 RL 中最大的挑战之一。两个问题反复出现:可扩展性(更多计算是否带来更好的性能?)和可塑性(网络学到的特征在目标变化时是否仍然有用?)。最近的一系列工作提出了一个出人意料的解决方案:用 flow-matching Q 函数替代标准 Q 网络。令人惊讶的是,这与建模高维分布无关——核心贡献在于价值估计的迭代测试时计算

本文涵盖 Agrawalla、Nauman、Agrawal 和 Kumar 的两篇论文:

  • floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL (ICLR 2026)
  • What Does Flow-Matching Bring to TD-Learning? (arXiv 2026)

Background: Why Value Functions Are Hard

The standard approach to learning Q-values is temporal difference (TD) learning. Given a dataset of transitions \(\{s_i, a_i, r(s_i, a_i), s_{i+1}\}\), we train a Q-network by minimizing:

\[\min_\theta \; \mathbb{E}_{\text{data}}\left[(Q_\theta(s_i, a_i) - y(s_i, a_i))^2\right]\]

where the target is \(y(s, a) = r(s_i, a_i) + \gamma Q_{\theta^{\text{old}}}(s_{i+1}, a_{i+1})\).

This procedure has two fundamental issues:

  1. Non-stationary targets: The targets \(y(s_i, a_i)\) depend on the Q-network itself (through \(Q_{\theta^{\text{old}}}\)), so they keep changing as training progresses.
  2. Imperfect optimization: We only take a few gradient steps on each set of frozen targets before refreshing them, so the loss is never fully minimized.

These issues make it hard to scale value learning — bigger networks don’t reliably help, and features learned for old targets often become stale (loss of plasticity). Prior work has explored better architectures (layer normalization, separate state/action encoders) and better losses (classification-style losses, feature regularization), but the fundamental tension remains.

学习 Q 值的标准方法是时序差分(TD)学习。给定一组转移数据 \(\{s_i, a_i, r(s_i, a_i), s_{i+1}\}\),我们通过最小化以下目标来训练 Q 网络:

\[\min_\theta \; \mathbb{E}_{\text{data}}\left[(Q_\theta(s_i, a_i) - y(s_i, a_i))^2\right]\]

其中目标值为 \(y(s, a) = r(s_i, a_i) + \gamma Q_{\theta^{\text{old}}}(s_{i+1}, a_{i+1})\)。

这个过程有两个根本问题:

  1. 非平稳目标:目标值 \(y(s_i, a_i)\) 依赖于 Q 网络自身(通过 \(Q_{\theta^{\text{old}}}\)),所以随着训练进行会不断变化。
  2. 不完美优化:我们只在每组冻结的目标上做几步梯度下降就刷新它们,因此损失永远不会被完全最小化。

这些问题使价值学习难以扩展——更大的网络并不能可靠地帮助,而为旧目标学到的特征往往会过时(可塑性丧失)。先前的工作探索了更好的架构(层归一化、分离的状态/动作编码器)和更好的损失(分类式损失、特征正则化),但根本矛盾仍然存在。

Primer: What is Flow Matching?

Before we see how flows enter RL, let’s understand the core idea. Flow matching (Lipman et al., 2023; Liu et al., 2023) is a framework for learning a continuous transformation from a simple noise distribution to a complex target distribution. Compared to diffusion models (which learn a score function and reverse a stochastic noising process), flow matching learns a deterministic velocity field along a straight-line path — simpler to train and faster to sample.

在看 flow 如何进入 RL 之前,先理解核心思想。Flow matching(Lipman et al., 2023; Liu et al., 2023)是一个学习从简单噪声分布到复杂目标分布的连续变换的框架。与扩散模型(学习分数函数并反转随机加噪过程)相比,flow matching 沿直线路径学习确定性的速度场——训练更简单,采样更快。

The Velocity Field and Flow ODE

Imagine many particles, each starting at a random position drawn from some simple distribution \(p_0\) (e.g., a Gaussian). We want to move them so that, at the end, they are arranged according to a complex target distribution \(p_1\) (e.g., the data distribution of images). The question is: what velocity should each particle have at each moment?

Formally, we define a velocity field \(v_\theta(t, z)\) parameterized by a neural network. Given a starting point \(z_0 \sim p_0\), the particle’s trajectory is governed by the ODE:

\[\frac{dz}{dt} = v_\theta(t, z)\]

If \(v_\theta\) is learned well, integrating from \(t=0\) to \(t=1\) transports the noise distribution \(p_0\) into the data distribution \(p_1\). In practice, we discretize with \(K\) Euler steps:

\[z_{k+1} = z_k + \frac{1}{K}\, v_\theta\!\left(\frac{k}{K},\, z_k\right)\]

This is the inference procedure: start from noise, apply the network \(K\) times, get a data sample.

想象许多粒子,每个从某个简单分布 \(p_0\)(如高斯分布)中随机采样一个位置出发。我们想要移动它们,使得最终它们按照复杂的目标分布 \(p_1\)(如图像的数据分布)排列。问题是:每个粒子在每个时刻应该有什么速度?

形式上,我们定义一个由神经网络参数化的速度场 \(v_\theta(t, z)\)。给定起点 \(z_0 \sim p_0\),粒子的轨迹由 ODE 控制:

\[\frac{dz}{dt} = v_\theta(t, z)\]

如果 \(v_\theta\) 学得好,从 \(t=0\) 积分到 \(t=1\) 就能将噪声分布 \(p_0\) 传输为数据分布 \(p_1\)。实际中我们用 \(K\) 步 Euler 离散化:

\[z_{k+1} = z_k + \frac{1}{K}\, v_\theta\!\left(\frac{k}{K},\, z_k\right)\]

这就是推理过程:从噪声开始,将网络应用 \(K\) 次,得到一个数据样本。

Conditional Flow Matching: A Tractable Training Loss

How do we learn \(v_\theta\)? Naively, we’d want to minimize:

\[\mathcal{L}_{\text{marginal}}(\theta) = \mathbb{E}_t\!\left[\left\lVert v_\theta(t, z) - u_t(z)\right\rVert^2\right]\]

where \(u_t(z)\) is the true marginal velocity field that transports \(p_0\) to \(p_1\). But \(u_t(z)\) is intractable — it depends on the entire data distribution.

The key insight is that we don’t need to reason about all particles at once. Instead, we can construct training data one pair at a time. Pick a noise sample \(z_0 \sim p_0\) and a data sample \(x_1 \sim p_1\) independently — there is no special correspondence between a particular \(z_0\) and a particular \(x_1\), they are just randomly paired. We declare: “this particle starts at \(z_0\) and should end at \(x_1\).” The simplest path connecting them is a straight line. The interpolant places a point along this straight line at time \(t\):

\[z(t) = (1 - t)\, z_0 + t\, x_1\]

When \(t = 0\), \(z(0) = z_0\) (pure noise). When \(t = 1\), \(z(1) = x_1\) (pure data). When \(t = 0.5\), \(z(0.5)\) is the midpoint. In other words, the interpolant generates a training input: a synthetic position that the particle should pass through at time \(t\) if it’s traveling in a straight line from \(z_0\) to \(x_1\).

Now we need a training label. If the particle is at \(z(t)\) and should arrive at \(x_1\) by \(t=1\), what velocity should it have? Along a straight-line path, the velocity is constant:

\[u(t \mid z_0, x_1) = x_1 - z_0\]

So we have both the input (where the particle is: \((t, z(t))\)) and the label (what velocity it should have: \(x_1 - z_0\)). This gives us the conditional flow-matching loss:

\[\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1),\; z_0 \sim p_0,\; x_1 \sim p_1}\!\left[\left\lVert v_\theta(t,\, z(t)) - (x_1 - z_0)\right\rVert^2\right]\]

Each training step is simple: sample a random \((z_0, x_1, t)\), compute the interpolated position \(z(t)\), feed \((t, z(t))\) into the network, and regress the output toward \((x_1 - z_0)\). No ODE solving needed — just standard supervised learning.

The remarkable result (Lipman et al., 2023) is that \(\nabla_\theta \mathcal{L}_{\text{CFM}} = \nabla_\theta \mathcal{L}_{\text{marginal}}\) — the gradients are identical. So even though we train on individual straight-line pairs, the network learns the correct global velocity field that transports the entire noise distribution to the data distribution.

如何学习 \(v_\theta\)?朴素地,我们想最小化:

\[\mathcal{L}_{\text{marginal}}(\theta) = \mathbb{E}_t\!\left[\left\lVert v_\theta(t, z) - u_t(z)\right\rVert^2\right]\]

其中 \(u_t(z)\) 是将 \(p_0\) 传输到 \(p_1\) 的真实边际速度场。但 \(u_t(z)\) 是难解的(intractable)——它依赖于整个数据分布。

关键洞察是:我们不需要同时考虑所有粒子。相反,我们可以一对一对地构造训练数据。取一个噪声样本 \(z_0 \sim p_0\) 和一个数据样本 \(x_1 \sim p_1\),两者独立采样——某个特定的 \(z_0\) 和某个特定的 \(x_1\) 之间没有特殊的对应关系,只是随机配对。我们规定:”这个粒子从 \(z_0\) 出发,应该到达 \(x_1\)。” 连接它们的最简单路径是一条直线。插值就是在这条直线上放置一个时刻 \(t\) 对应的点:

\[z(t) = (1 - t)\, z_0 + t\, x_1\]

当 \(t = 0\) 时,\(z(0) = z_0\)(纯噪声)。当 \(t = 1\) 时,\(z(1) = x_1\)(纯数据)。当 \(t = 0.5\) 时,\(z(0.5)\) 是中点。换言之,插值生成了一个训练输入:一个合成位置,代表粒子在从 \(z_0\) 到 \(x_1\) 的直线行进中,在时刻 \(t\) 应该经过的地方

现在我们需要训练标签。如果粒子在 \(z(t)\) 处且应在 \(t=1\) 时到达 \(x_1\),它应该有什么速度?沿直线路径,速度是常数:

\[u(t \mid z_0, x_1) = x_1 - z_0\]

于是我们同时有了输入(粒子在哪:\((t, z(t))\))和标签(它应有的速度:\(x_1 - z_0\))。这给出了条件 flow-matching 损失

\[\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1),\; z_0 \sim p_0,\; x_1 \sim p_1}\!\left[\left\lVert v_\theta(t,\, z(t)) - (x_1 - z_0)\right\rVert^2\right]\]

每一步训练都很简单:采样随机的 \((z_0, x_1, t)\),计算插值位置 \(z(t)\),将 \((t, z(t))\) 输入网络,回归输出到 \((x_1 - z_0)\)。不需要 ODE 求解——就是标准的监督学习。

Lipman et al. (2023) 证明了一个重要结论:\(\nabla_\theta \mathcal{L}_{\text{CFM}} = \nabla_\theta \mathcal{L}_{\text{marginal}}\)——梯度完全相同。所以尽管我们在单独的直线配对上训练,网络却学到了正确的全局速度场,能够将整个噪声分布传输到数据分布。

Marginal vs. Conditional Paths

An important subtlety: each conditional path (for a specific \((z_0, x_1)\) pair) is a straight line. But the network \(v_\theta\) sees points from all pairs during training. At any location \((t, z)\), many different conditional paths pass through, and \(v_\theta\) learns their conditional expectation. This means the actual marginal flow paths — obtained by integrating the learned \(v_\theta\) from a single \(z_0\) — are generally curved, because the averaged velocity at intermediate points differs from any single conditional velocity.

一个重要的细微之处:每条条件路径(对于特定的 \((z_0, x_1)\) 配对)是一条直线。但网络 \(v_\theta\) 在训练中看到来自所有配对的点。在任意位置 \((t, z)\),许多不同的条件路径经过此处,\(v_\theta\) 学到的是它们的条件期望。这意味着实际的边际 flow 路径——从单个 \(z_0\) 出发对学到的 \(v_\theta\) 积分得到的路径——通常是弯曲的,因为中间点的平均速度不同于任何单条条件速度。

Flow matching transports noise into data via a learned velocity field. Click through for the ODE, conditional FM training loss, why straight-line training produces curved inference paths, and comparison with floq.
Flow matching 通过学习的速度场将噪声传输为数据。依次展示 ODE、条件 FM 训练损失、为什么直线训练产生弯曲的推理路径,以及与 floq 的对比。

Summary of the flow-matching recipe:

  1. Training: sample \((z_0, x_1, t)\), compute interpolant \(z(t)\), regress \(v_\theta(t, z(t))\) toward \(x_1 - z_0\). Simple supervised learning, no ODE.
  2. Inference: start from \(z_0 \sim p_0\), run \(K\) Euler steps through the learned \(v_\theta\), output \(z_K \approx x_1\).

Standard flow matching is used for generative modeling in high dimensions (images, molecules, etc.). But the iterative computation structure — applying the same network \(v_\theta\) repeatedly with evolving inputs — turns out to be independently valuable for a completely different purpose. The question that floq asks is: can we repurpose this iterative structure, not to generate data, but to produce better value estimates?

Flow matching 流程总结:

  1. 训练: 采样 \((z_0, x_1, t)\),计算插值 \(z(t)\),回归 \(v_\theta(t, z(t))\) 到 \(x_1 - z_0\)。简单的监督学习,不需要 ODE。
  2. 推理: 从 \(z_0 \sim p_0\) 开始,通过学到的 \(v_\theta\) 运行 \(K\) 步 Euler 积分,输出 \(z_K \approx x_1\)。

标准 flow matching 用于高维生成建模(图像、分子等)。但其迭代计算结构——对相同网络 \(v_\theta\) 反复应用但输入不断变化——对于一个完全不同的目的同样有价值。floq 提出的问题是:我们能否将这种迭代结构重新用于产生更好的价值估计,而非生成数据?

The Idea: Flow-Matching Q-Functions

Instead of predicting \(Q(s, a)\) in a single forward pass, floq produces Q-values by integrating a learned velocity field over multiple steps. Start from a noise sample \(z_0 \sim \text{Unif}[l, u]\), and iteratively apply:

\[z_{t+1} = z_t + \frac{1}{K} \, v_\theta\!\left(\frac{t}{K},\, z_t \;\middle\vert\; s, a\right)\]

for \(K\) integration steps. The final value \(z_K\) is the predicted Q-value. This is still a scalar prediction — not a high-dimensional generative model.

Note that this is one of the few cases where “test-time compute scaling” has a clean theoretical justification: more Euler steps means better ODE integration, which means a more accurate Q-value.

The key insight: this gives us a knob to trade test-time compute for Q-value accuracy. More integration steps means more forward passes through the velocity network, each operating on a different input \((t, z_t)\) and therefore doing different computation.

An important clarification: floq does not require a larger network. The velocity network \(v_\theta\) is a standard 4-layer MLP — the same kind of network used as a Q-network in continuous control RL (robotics, maze navigation, etc.), taking raw state and action vectors as input. There is no language model involved; the connection to LLMs discussed later is purely an analogy. The only architectural difference from a standard Q-network is two extra scalar inputs \((t, z_t)\), which adds negligible parameters. The extra compute comes entirely from running this same small network \(K\) times, not from having more parameters. In fact, the floq paper shows that ResNets with matched FLOPs (i.e., one big forward pass using the same total compute) perform significantly worse. The benefit is specifically from iterative computation, not from model size.

floq 不是用单次前向传播预测 \(Q(s, a)\),而是通过对学习的速度场做多步积分来产生 Q 值。从噪声样本 \(z_0 \sim \text{Unif}[l, u]\) 出发,迭代地应用:

\[z_{t+1} = z_t + \frac{1}{K} \, v_\theta\!\left(\frac{t}{K},\, z_t \;\middle\vert\; s, a\right)\]

做 \(K\) 步积分。最终值 \(z_K\) 就是预测的 Q 值。这仍然是标量预测——不是高维生成模型。

注意这是少数”测试时计算缩放”有清晰理论依据的情况之一:更多的 Euler 步意味着更好的 ODE 积分,也就意味着更准确的 Q 值。

核心洞察:这给了我们一个旋钮来用测试时计算换取 Q 值精度。更多积分步意味着更多次通过速度网络的前向传播,每次操作在不同的输入 \((t, z_t)\) 上,因此做的是不同的计算

一个重要的澄清:floq 不需要更大的网络。速度网络 \(v_\theta\) 是标准的 4 层 MLP——和连续控制 RL(机器人、迷宫导航等)中用作 Q 网络的网络是同一类,输入是原始的状态向量和动作向量。这里没有语言模型;后文讨论的与 LLM 的联系纯粹是类比。与标准 Q 网络唯一的架构区别是多了两个标量输入 \((t, z_t)\),增加的参数量可以忽略不计。额外的计算完全来自于将这个同样的小网络运行 \(K\) 次,而不是更多的参数。实际上,floq 论文表明,匹配 FLOPs 的 ResNet(即用同样的总计算量做一次大的前向传播)表现明显更差。收益明确来自迭代计算,而非模型大小。

From standard Q-networks to flow Q-functions. Instead of a single forward pass, floq integrates a velocity field over K steps, trading compute for accuracy. Click through to see the integration process and distribution evolution.
从标准 Q 网络到 flow Q 函数。floq 不是单次前向传播,而是对速度场做 K 步积分,用计算换精度。

Training: The Flow-Matching Loss

Training floq requires two steps:

Step 1: Compute the target value by integrating the flow with the old (frozen) network:

\[y(s, a) = r(s_i, a_i) + \gamma Q_{\theta^{\text{old}}}(s_{i+1}, a_{i+1})\]

where \(Q_{\theta^{\text{old}}}\) is itself obtained by integrating the old velocity field.

Step 2: Train the velocity field against the linear interpolant between noise and target. Just like in the primer, we construct a straight-line path from the noise sample \(\mathbf{z}\) to the target \(y\):

\[\mathbf{z}(t) = (1 - t) \cdot \mathbf{z} + t \cdot y(s_i, a_i)\]

The derivative of this path is \(\frac{d\mathbf{z}}{dt} = y - \mathbf{z}\) — the constant velocity along the straight line. This is the label we train the network to predict (exactly the same conditional flow-matching trick as before, with \(x_1\) replaced by the TD target \(y\)). The training loss is:

\[\mathcal{L}(\theta) = \mathbb{E}_{t, \mathbf{z}}\left[\left\lVert v_\theta(t, \mathbf{z}(t) \mid s, a) - (y - \mathbf{z})\right\rVert^2\right]\]

A natural question: if the actual flow paths are curved (as discussed in the primer), why do we train against straight-line targets? This is exactly the power of conditional flow matching: the gradient equivalence theorem guarantees that training on individual straight-line (noise, target) pairs produces the same optimal \(v_\theta\) as training on the true (curved) marginal velocity field. The network sees many different \((z_0, y)\) pairs across training, and the averaged gradient steers it toward the correct curved flow — even though each individual supervision signal is a straight line.

A follow-up question: does training on straight lines mean inference must also be straight? No. At inference time, we start from a single \(z_0\) and integrate \(v_\theta\) step by step. At each point \((t, z_t)\), the network outputs the velocity it learned — which is the average over all the straight-line targets that passed through that region during training. Different starting points \(z_0\) lead to different positions \(z_t\) at intermediate times, and the network gives them different velocities. The result is curved paths, even though no single training example was curved. This is precisely why the width of the noise interval matters so much, as we discuss next: a wider interval means more diverse starting points, which forces the network to learn position-dependent velocities, which produces more curvature, which makes each integration step do genuinely different work.

训练 floq 需要两个步骤:

步骤 1:用旧的(冻结的)网络积分 flow 来计算目标值:

\[y(s, a) = r(s_i, a_i) + \gamma Q_{\theta^{\text{old}}}(s_{i+1}, a_{i+1})\]

其中 \(Q_{\theta^{\text{old}}}\) 本身也是通过积分旧的速度场得到的。

步骤 2:在噪声和目标之间的线性插值上训练速度场。和 primer 中一样,我们构造从噪声 \(\mathbf{z}\) 到目标 \(y\) 的直线路径:

\[\mathbf{z}(t) = (1 - t) \cdot \mathbf{z} + t \cdot y(s_i, a_i)\]

对这条路径求导得 \(\frac{d\mathbf{z}}{dt} = y - \mathbf{z}\) ——沿直线的恒定速度。这就是我们训练网络去预测的标签(和之前的条件 flow matching 完全一样,只是把数据样本 \(x_1\) 换成了 TD 目标 \(y\))。训练损失为:

\[\mathcal{L}(\theta) = \mathbb{E}_{t, \mathbf{z}}\left[\left\lVert v_\theta(t, \mathbf{z}(t) \mid s, a) - (y - \mathbf{z})\right\rVert^2\right]\]

一个自然的问题:既然实际的 flow 路径是弯曲的(如 primer 中讨论的),为什么用直线做训练目标?这正是条件 flow matching 的精妙之处:梯度等价定理保证,在单独的直线(噪声,目标)配对上训练,产生的最优 \(v_\theta\) 与在真实的(弯曲的)边际速度场上训练完全一致。网络在训练中看到许多不同的 \((z_0, y)\) 配对,平均梯度会把它引导到正确的弯曲 flow——即使每个单独的监督信号都是直线。

接下来的问题:训练用的是直线,推理时也一定是直线吗?不是。 推理时,我们从单个 \(z_0\) 出发,逐步积分 \(v_\theta\)。在每个点 \((t, z_t)\),网络输出它学到的速度——这是训练时经过该区域的所有直线目标的平均。不同的起点 \(z_0\) 在中间时刻 \(t\) 会到达不同的位置 \(z_t\),网络给它们不同的速度。结果就是弯曲的路径,即使没有任何一条训练样本是弯曲的。这正是为什么噪声区间的宽度如此重要(下一节会详细讨论):更宽的区间意味着更多样的起点,迫使网络学习依赖位置的速度,从而产生更大的曲率,使每一步积分做的是真正不同的计算。

Forward and Backward: What Each Does

Unlike diffusion models, flow matching has no reverse process. The flow only goes one direction: noise \(\to\) value. But there are two distinct phases that are easy to confuse:

Forward (integration): This is the ODE integration from \(t=0\) to \(t=1\), transforming noise into a Q-value:

\[z_0 \sim \text{Unif}[l, u] \;\xrightarrow{v_\theta,\; K \text{ steps}}\; z_K = Q(s, a)\]

This is used in two places: (1) at test time, to predict Q-values for action selection, and (2) during training, to compute the TD target \(y = r + \gamma Q_{\theta^{\text{old}}}(s', a')\) by integrating the frozen old network.

Backward (gradient update): This is not reverse-time integration — it is standard backpropagation through the flow-matching loss. For each training step:

  1. Run forward integration with the old (frozen) network to get target \(y\)
  2. Sample noise \(z_0\) and random time \(t \sim \mathcal{U}(0,1)\)
  3. Compute interpolated position \(z(t) = (1-t)\, z_0 + t \cdot y\) — this is the network’s input
  4. Feed \((t, z(t), s, a)\) into \(v_\theta\) and compute the MSE loss against \((y - z_0)\) — this is the label
\[\mathcal{L} = \left\lVert v_\theta(t,\, z(t) \mid s, a) - (y - z_0)\right\rVert^2\]

Then backpropagate \(\nabla_\theta \mathcal{L}\) as usual. Steps 1–3 just construct the training example (input, label); step 4 is ordinary supervised regression. There is nothing special about the gradient — no denoising, no reverse SDE. The “backward” in floq is just the loss gradient, not a backward-in-time process.

与扩散模型不同,flow matching 没有反向过程。Flow 只有一个方向:噪声 \(\to\) 值。但有两个容易混淆的阶段:

前向(积分): 即从 \(t=0\) 到 \(t=1\) 的 ODE 积分,将噪声变为 Q 值:

\[z_0 \sim \text{Unif}[l, u] \;\xrightarrow{v_\theta,\; K \text{ 步}}\; z_K = Q(s, a)\]

这在两个地方使用:(1) 测试时,预测 Q 值以选择动作;(2) 训练时,用冻结的旧网络积分来计算 TD 目标 \(y = r + \gamma Q_{\theta^{\text{old}}}(s', a')\)。

反向(梯度更新):不是反向时间积分——而是对 flow-matching 损失的标准反向传播。每一步训练:

  1. 旧的(冻结的)网络跑前向积分,得到目标 \(y\)
  2. 采样噪声 \(z_0\) 和随机时刻 \(t \sim \mathcal{U}(0,1)\)
  3. 计算插值位置 \(z(t) = (1-t)\, z_0 + t \cdot y\) ——这是网络的输入
  4. 将 \((t, z(t), s, a)\) 送入 \(v_\theta\),计算与 \((y - z_0)\) 的 MSE 损失——这是标签
\[\mathcal{L} = \left\lVert v_\theta(t,\, z(t) \mid s, a) - (y - z_0)\right\rVert^2\]

然后照常反向传播 \(\nabla_\theta \mathcal{L}\)。步骤 1–3 只是在构造训练样本(输入和标签);步骤 4 就是普通的监督回归。梯度没有任何特殊之处——没有去噪,没有反向 SDE。floq 中的”反向”只是损失函数的梯度,不是时间上的反向过程。

Design Choices: Why Curved Paths Matter

A critical design choice is the width of the initial noise interval \([l, u]\). This determines whether integration paths are straight or curved — and this matters a lot for compute scaling.

一个关键的设计选择是初始噪声区间的宽度 \([l, u]\)。这决定了积分路径是直的还是弯的——这对计算缩放非常重要。

Narrow noise leads to straight paths (redundant compute); wide noise leads to curved paths (useful compute). The curvature ensures each integration step does genuinely different work, enabling test-time compute scaling.
窄噪声导致直线路径(冗余计算);宽噪声导致弯曲路径(有效计算)。曲率确保每一步积分做的是真正不同的工作,从而实现测试时计算缩放。

With a narrow noise interval, all starting points are close together, so the optimal paths are nearly straight lines to the target. The velocity field is approximately constant — every integration step does the same thing, and extra steps are wasted.

With a wide noise interval, starting points are spread far apart, and the paths must curve to converge. Now the velocity field at each step depends strongly on the current position \(z_t\), meaning each step does genuinely different computation. This is what makes more integration steps valuable.

This is essentially the same insight as why deeper networks are more expressive than wider ones: depth enables composition, and composition enables qualitatively different computation at each layer.

如果噪声区间,所有起点都很接近,最优路径几乎是到目标的直线。速度场近似为常数——每一步积分做的是同样的事情,额外的步骤是浪费的。

如果噪声区间,起点分散很远,路径必须弯曲才能收敛。此时速度场在每一步都强烈依赖当前位置 \(z_t\),意味着每一步做的是真正不同的计算。这就是为什么更多积分步是有价值的。

这本质上与深网络比宽网络更具表达力的原因相同:深度使组合成为可能,而组合使每一层能做质上不同的计算。

Why It Works: Iterative Compute and Plastic Features

The mechanism behind floq’s success has a striking analogy to chain-of-thought reasoning in LLMs:

  • The intermediate values \(z_0, z_1, \ldots, z_K\) act like “thoughts” — latent states that carry information forward.
  • The same network \(v_\theta\) is applied at each step with different inputs, just like an autoregressive LM generating each token.
  • The flow-matching loss provides dense supervision at every timestep \(t\), analogous to SFT supervising at every token rather than only at the final answer.

floq 成功的机制与 LLM 中的 chain-of-thought 推理有一个惊人的类比:

  • 中间值 \(z_0, z_1, \ldots, z_K\) 像”思维”一样——是携带信息向前传递的隐状态。
  • 相同的网络 \(v_\theta\) 在每一步以不同的输入被应用,就像自回归语言模型生成每个 token。
  • Flow-matching 损失在每个时间步 \(t\) 提供密集监督,类似于 SFT 在每个 token 而非仅在最终答案处进行监督。
Flow integration as iterative reasoning: each step refines the Q-value estimate using the same network but different inputs. Dense supervision at every step gives rich gradient signal and enables plastic features.
Flow 积分作为迭代推理:每一步使用相同网络但不同输入来精炼 Q 值估计。每一步的密集监督提供丰富的梯度信号并实现可塑特征。

This dense supervision has a crucial consequence for RL: it produces plastic features. In standard TD-learning, features learned for early (inaccurate) targets become entrenched and hard to update — a well-known phenomenon called loss of plasticity. In floq, the same features can produce different outputs through different integration paths, so they naturally support changing targets without retraining.

The plasticity story is perhaps the most underappreciated contribution. Loss of plasticity is one of the main reasons we can’t just scale up Q-learning like we scale up LLMs. If floq truly solves this, it opens the door to much larger value networks.

Experiments confirm this: when features are frozen partway through training, floq networks maintain high performance while standard Q-networks collapse. floq can also handle much more aggressive update-to-data ratios (UTD up to 128) without degradation.

这种密集监督对 RL 有一个关键影响:它产生可塑特征。在标准 TD 学习中,为早期(不准确的)目标学到的特征会固化并难以更新——这是一个被称为可塑性丧失的知名现象。在 floq 中,相同的特征可以通过不同的积分路径产生不同的输出,因此它们自然支持目标变化而无需重新训练。

可塑性的故事可能是最被低估的贡献。可塑性丧失是我们不能像扩展 LLM 那样扩展 Q-learning 的主要原因之一。如果 floq 真正解决了这个问题,它将为更大的价值网络打开大门。

实验证实了这一点:当特征在训练中途被冻结时,floq 网络保持高性能,而标准 Q 网络则崩溃。floq 还可以处理更激进的更新数据比(UTD 高达 128)而不降级。

Results

On simulated offline RL benchmarks (antmaze, humanoidmaze, cube manipulation, puzzles), floq achieves strong improvements over both standard Q-learning and flow-based policy methods:

  • ~2x improvement on hard antmaze and cube tasks
  • ~3x on puzzle-4x4 (the hardest task requiring precise value estimation)
  • Average score of 59 vs. 47 for the best flow policy baseline (FQL with 2M parameters)

The scaling behavior is clear: increasing integration steps from 1 to 8–16 consistently improves performance across tasks. Importantly, this scaling comes from the iterative flow architecture, not just from using bigger networks — ResNets with matched FLOPs perform significantly worse.

在模拟离线 RL 基准测试(antmaze、humanoidmaze、立方体操作、拼图)上,floq 相比标准 Q-learning 和基于 flow 的策略方法都取得了显著提升:

  • 在困难的 antmaze 和立方体任务上提升约 2 倍
  • 在 puzzle-4x4(需要精确价值估计的最难任务)上提升约 3 倍
  • 平均得分 59 对比最佳 flow 策略基线(2M 参数的 FQL)的 47

缩放行为很清晰:将积分步数从 1 增加到 8-16 在各任务上一致地提升性能。重要的是,这种缩放来自迭代 flow 架构,而不仅仅是使用更大的网络——匹配 FLOPs 的 ResNet 表现明显更差。

Takeaways

  • Flow-matching in RL is not about modeling distributions — it’s about getting iterative test-time compute for value estimation, turning Q-value prediction from a one-shot problem into an iterative refinement process.
  • Wide noise intervals are essential for curved integration paths that make each step do different work.
  • Dense supervision at every integration step produces plastic features that handle non-stationary TD targets gracefully.
  • The analogy to chain-of-thought in LLMs is deep: both use iterative computation through a shared network to trade compute for accuracy, with dense supervision enabling generalization.
  • Open questions remain: does “exploration in a CoT” have a flow analogue? What does the RL vs. SFT gap in LLMs mean for flow-based value learning?
  • RL 中的 flow-matching 不是关于建模分布的——而是关于为价值估计获得迭代测试时计算,将 Q 值预测从一次性问题转变为迭代精炼过程。
  • 宽噪声区间是必要的,它产生弯曲的积分路径,使每一步做不同的工作。
  • 每一步积分的密集监督产生可塑特征,优雅地处理非平稳 TD 目标。
  • 与 LLM 中 chain-of-thought 的类比很深刻:两者都通过共享网络的迭代计算用计算换精度,密集监督使泛化成为可能。
  • 开放问题:”CoT 中的探索”是否有 flow 的对应?LLM 中 RL 与 SFT 的差距对基于 flow 的价值学习意味着什么?