Cosmos Policy: Unified Latent Encoding

heterogeneous inputs → identical latent-frame shape → one shared backbone
raw input
encoder
latent frame
RGB camera
\(T \times H \times W \times 3\)
Wan2.1 VAE encoder
8× spatial, 4× temporal compression to 16-channel latent
\(H' \times W' \times 16\)
Action chunk
\(K \times d_a\)
normalize + tile
map to \([-1, 1]\), broadcast across all \(H' \times W' \times 16\) positions
\(H' \times W' \times 16\)
Proprioception
\(D\)-dim vector
normalize + tile
no learned projection — pure broadcast
\(H' \times W' \times 16\)
Value \(V(s)\)
scalar
normalize + tile
single number filled into the entire latent volume
\(H' \times W' \times 16\)
concatenate into the 11-slot sequence
pad
\(s_p\)
\(s_w\)
\(s_{3,1}\)
\(s_{3,2}\)
\(a\)
\(s'_p\)
\(s'_w\)
\(s'_{3,1}\)
\(s'_{3,2}\)
\(V'\)
Cosmos-Predict2 Diffusion Transformer
cross-attn ← T5-XXL text \(\ell\) AdaLN ← noise level \(\sigma\)
decode each slot back to its original space
video slots (\(s_w, s_{3,\cdot}, s'_w, s'_{3,\cdot}\)) → Wan2.1 VAE decoder → RGB pixels
scalar slots (\(s_p, a, s'_p, V'\)) → mean-pool over all \(H' \times W' \times 16\) positions → un-normalize → original scalar