Cosmos Policy: Unified Latent Encoding

heterogeneous inputs → identical latent-frame shape → one shared backbone

raw input

encoder

latent frame

RGB camera

\(T \times H \times W \times 3\)

→

Wan2.1 VAE encoder

8× spatial, 4× temporal compression to 16-channel latent

→

\(H' \times W' \times 16\)

Action chunk

\(K \times d_a\)

→

normalize + tile

map to \([-1, 1]\), broadcast across all \(H' \times W' \times 16\) positions

→

\(H' \times W' \times 16\)

Proprioception

\(D\)-dim vector

→

normalize + tile

no learned projection — pure broadcast

→

\(H' \times W' \times 16\)

Value \(V(s)\)

scalar

→

normalize + tile

single number filled into the entire latent volume

→

\(H' \times W' \times 16\)

↓

concatenate into the 11-slot sequence

pad

\(s_p\)

\(s_w\)

\(s_{3,1}\)

\(s_{3,2}\)

\(a\)

\(s'_p\)

\(s'_w\)

\(s'_{3,1}\)

\(s'_{3,2}\)

\(V'\)

↓

Cosmos-Predict2 Diffusion Transformer

cross-attn ← T5-XXL text \(\ell\) AdaLN ← noise level \(\sigma\)

↓

decode each slot back to its original space

video slots (\(s_w, s_{3,\cdot}, s'_w, s'_{3,\cdot}\)) → Wan2.1 VAE decoder → RGB pixels

scalar slots (\(s_p, a, s'_p, V'\)) → mean-pool over all \(H' \times W' \times 16\) positions → un-normalize → original scalar