Cosmos Policy: 11 latent slots, three masks, one shared backbone

padVAE compression placeholder

\(s_{\mathrm{prop}}\)Current proprioception (joint angles)

\(s_{\mathrm{wrst}}\)Current wrist camera latent

\(s_{\mathrm{3rd},1}\)Current 1st third-person camera

\(s_{\mathrm{3rd},2}\)Current 2nd third-person camera

\(a\)Action chunk (H steps tiled into latent volume)

\(s'_{\mathrm{prop}}\)Future proprioception

\(s'_{\mathrm{wrst}}\)Future wrist camera

\(s'_{\mathrm{3rd},1}\)Future 1st third-person camera

\(s'_{\mathrm{3rd},2}\)Future 2nd third-person camera

\(V(s')\)Predicted value of future state

current state \(s\)

\(a\)

future state \(s'\)

\(V\)

↓

Cosmos-Predict2 Diffusion Transformer

single \(\theta\), no per-modality head

\(\theta \;\leftarrow\; \theta \,-\, \eta \cdot \big(\) \(\nabla_\theta\, \mathcal{L}_{\mathrm{pol}}\) \(+\) \(\nabla_\theta\, \mathcal{L}_{\mathrm{wm}}\) \(+\) \(\nabla_\theta\, \mathcal{L}_{\mathrm{val}}\) \(\big)\)

three losses, three gradient vectors, one shared parameter set \(\theta\)

clean (conditioned) noised (target) shared backbone — hover slots above for details