Cosmos Policy: Unified Latent Encoding
heterogeneous inputs → identical latent-frame shape → one shared backbone
→
Wan2.1 VAE encoder
8× spatial, 4× temporal compression to 16-channel latent
→
\(H' \times W' \times 16\)
→
normalize + tile
map to \([-1, 1]\), broadcast across all \(H' \times W' \times 16\) positions
→
\(H' \times W' \times 16\)
→
normalize + tile
no learned projection — pure broadcast
→
\(H' \times W' \times 16\)
→
normalize + tile
single number filled into the entire latent volume
→
\(H' \times W' \times 16\)
↓
concatenate into the 11-slot sequence
pad
\(s_p\)
\(s_w\)
\(s_{3,1}\)
\(s_{3,2}\)
\(a\)
\(s'_p\)
\(s'_w\)
\(s'_{3,1}\)
\(s'_{3,2}\)
\(V'\)
↓
Cosmos-Predict2 Diffusion Transformer
cross-attn ← T5-XXL text \(\ell\)
AdaLN ← noise level \(\sigma\)
↓
decode each slot back to its original space
video slots (\(s_w, s_{3,\cdot}, s'_w, s'_{3,\cdot}\)) → Wan2.1 VAE decoder → RGB pixels
scalar slots (\(s_p, a, s'_p, V'\)) → mean-pool over all \(H' \times W' \times 16\) positions → un-normalize → original scalar