| Autoregressive | Diffusion | |
|---|---|---|
| factorization | causal: \(p(z_t \mid z_{| joint: denoise all \(z_{1:T}\) together | |
| attention | causal mask (lower triangular) | full bidirectional |
| tokens | discrete (VQ-VAE) | continuous (VAE) |
| passes per sample | \(T\) (one per frame) | \(\tau_{\max}\) denoising steps |
| extending length | just sample more frames | re-condition on tail, denoise next chunk |
| examples | VideoGPT, MAGVIT-v2, VideoPoet | Sora, SVD, Cosmos-Predict, Wan |