padVAE compression placeholder
\(s_{\mathrm{prop}}\)Current proprioception (joint angles)
\(s_{\mathrm{wrst}}\)Current wrist camera latent
\(s_{\mathrm{3rd},1}\)Current 1st third-person camera
\(s_{\mathrm{3rd},2}\)Current 2nd third-person camera
\(a\)Action chunk (H steps tiled into latent volume)
\(s'_{\mathrm{prop}}\)Future proprioception
\(s'_{\mathrm{wrst}}\)Future wrist camera
\(s'_{\mathrm{3rd},1}\)Future 1st third-person camera
\(s'_{\mathrm{3rd},2}\)Future 2nd third-person camera
\(V(s')\)Predicted value of future state
Cosmos-Predict2 Diffusion Transformer
single \(\theta\), no per-modality head
\(\theta \;\leftarrow\; \theta \,-\, \eta \cdot \big(\)
\(\nabla_\theta\, \mathcal{L}_{\mathrm{pol}}\) \(+\)
\(\nabla_\theta\, \mathcal{L}_{\mathrm{wm}}\) \(+\)
\(\nabla_\theta\, \mathcal{L}_{\mathrm{val}}\)
\(\big)\)
three losses, three gradient vectors, one shared parameter set \(\theta\)