Cosmos Policy: 11 latent slots, three masks, one shared backbone

1
2
3
4
5
6
7
8
9
10
11
padVAE compression placeholder
\(s_{\mathrm{prop}}\)Current proprioception (joint angles)
\(s_{\mathrm{wrst}}\)Current wrist camera latent
\(s_{\mathrm{3rd},1}\)Current 1st third-person camera
\(s_{\mathrm{3rd},2}\)Current 2nd third-person camera
\(a\)Action chunk (H steps tiled into latent volume)
\(s'_{\mathrm{prop}}\)Future proprioception
\(s'_{\mathrm{wrst}}\)Future wrist camera
\(s'_{\mathrm{3rd},1}\)Future 1st third-person camera
\(s'_{\mathrm{3rd},2}\)Future 2nd third-person camera
\(V(s')\)Predicted value of future state
current state \(s\)
\(a\)
future state \(s'\)
\(V\)
Cosmos-Predict2 Diffusion Transformer
single \(\theta\), no per-modality head
\(\theta \;\leftarrow\; \theta \,-\, \eta \cdot \big(\) \(\nabla_\theta\, \mathcal{L}_{\mathrm{pol}}\) \(+\) \(\nabla_\theta\, \mathcal{L}_{\mathrm{wm}}\) \(+\) \(\nabla_\theta\, \mathcal{L}_{\mathrm{val}}\) \(\big)\)
three losses, three gradient vectors, one shared parameter set \(\theta\)
clean (conditioned) noised (target) shared backbone — hover slots above for details