Autoregressive vs Diffusion Video Models

two ways to factorize \(p(z_{1:T} \mid c)\) over a sequence of frame latents
step: 0 / 6

Autoregressive

\(p(z_{1:T} \mid c) = \prod_{t=1}^{T} p_\theta(z_t \mid z_{
causal attention mask
frames generated step-by-step

Diffusion

\(p_\theta(z_{1:T} \mid c)\) via \(\tau\)-step joint denoising
full bidirectional attention
all frames denoised in parallel
AutoregressiveDiffusion
factorizationcausal: \(p(z_t \mid z_{joint: denoise all \(z_{1:T}\) together
attentioncausal mask (lower triangular)full bidirectional
tokensdiscrete (VQ-VAE)continuous (VAE)
passes per sample\(T\) (one per frame)\(\tau_{\max}\) denoising steps
extending lengthjust sample more framesre-condition on tail, denoise next chunk
examplesVideoGPT, MAGVIT-v2, VideoPoetSora, SVD, Cosmos-Predict, Wan