Flow matching learns a velocity field that moves particles from a simple distribution to a target distribution
Each particle follows a curved path from its noise position (top) to its data position (bottom).
Paths curve because the learned velocity at each point averages over all conditional paths passing through that region.
A velocity field $v_\theta(t, z)$ tells each particle which direction to move at every moment.
A flow is defined by an ordinary differential equation
Starting from $z_0 \sim p_{\text{noise}}$, we integrate this ODE from $t=0$ to $t=1$.
In practice, we use Euler discretization with $K$ steps:
Each step is just a forward pass through the network $v_\theta$, applied to the current position.
More steps $K$ = finer approximation of the continuous flow = more compute.
We don't need to solve the ODE during training!
The key insight: instead of learning the marginal velocity field, we learn conditional velocities.
Given a noise sample $z_0$ and a target data point $x_1$, define a straight-line path:
The velocity along this path is simply:
Train the network to predict this velocity at the interpolated point:
Note: the straight-line interpolants are the training targets. The actual learned flow paths are curved, because the network sees many different $(z_0, x_1)$ pairs and learns to average — routing all particles simultaneously.
A remarkable theoretical result (Lipman et al., 2023)
The conditional flow-matching loss has the same gradients as the intractable marginal loss.
$\mathbb{E}_t\!\left[\lVert v_\theta(t,z) - u_t(z)\rVert^2\right]$
Requires knowing the true marginal velocity $u_t(z)$ — which depends on the entire data distribution.
$\mathbb{E}_{t, z_0, x_1}\!\left[\lVert v_\theta(t,z(t)) - (x_1 - z_0)\rVert^2\right]$
Only needs pairs $(z_0, x_1)$ and a random $t$. Easy to compute!
Same gradients $\Rightarrow$ same optimal $v_\theta$. We get the full flow for free from pointwise supervision.
This is why flow matching is so popular: simple loss, no ODE solve during training, theoretically sound.
Why does training on straight lines produce curved paths at inference?
At each point $(t, z)$, multiple training pairs pass through with different straight-line velocities.
The network learns their average — which differs from any single training target.
Integrating these averaged velocities step-by-step produces a curved inference path.
Same framework, very different application
Goal: generate images, molecules, ...
Dimension: high-dimensional $z \in \mathbb{R}^d$
Noise: $z_0 \sim \mathcal{N}(0, I)$
Target: data samples $x_1 \sim p_{\text{data}}$
Trained by: maximum likelihood
Goal: predict Q-values
Dimension: scalar $z \in \mathbb{R}$
Noise: $z_0 \sim \text{Unif}[l, u]$ (wider)
Target: TD target $y(s,a) = r + \gamma Q^{\text{old}}$
Trained by: TD-learning loss
floq borrows the iterative computation structure of flows, not the generative modeling goal.
The point is test-time compute scaling, not distribution modeling.