Inside a Shared-Attention Block

per-modality projections (pink) feed one shared softmax (blue), then split back to per-modality MLPs
read bottom-to-top — arrows show data flow
\(y^v\)  → next block
\(y^a\)  → next block
FFN \(\!{}^v\)  +  LayerNorm\(\!{}^v\)
per-modality weights
FFN\(\!{}^a\)  +  LayerNorm\(\!{}^a\)
per-modality weights
\(W_O^v\)   (output projection)
per-modality weights
\(W_O^a\)   (output projection)
per-modality weights
↑   split output back to per-modality streams   ↑
Shared attention — one softmax over the combined Q / K / V
\(\text{Attn} \;=\; \text{softmax}\!\big([Q^v;\,Q^a]\,[K^v;\,K^a]^\top / \sqrt{d}\big)\;[V^v;\,V^a]\)
single shared op — only place where the two modalities mix
↑   concatenate \(Q\), \(K\), \(V\) along the sequence dim   ↑
\(Q^v\)\(K^v\)\(V^v\)
\(Q^a\)\(K^a\)\(V^a\)
\(W_Q^v,\, W_K^v,\, W_V^v\)
per-modality QKV projection
\(W_Q^a,\, W_K^a,\, W_V^a\)
per-modality QKV projection
video stream
\(x^v\)  video tokens
action stream
\(x^a\)  action tokens
per-modality (different weights for video / action)
shared (one operation, all tokens together)
The "shared" in shared attention is just the softmax. Every other weight matrix in the block — \(W_Q\), \(W_K\), \(W_V\), \(W_O\), FFN, LayerNorm — is duplicated per modality. The single thing both modalities reach for is the scaled-dot-product softmax in the middle, which sees the concatenated Q / K / V from both streams. That is enough for an action token's query to attend to a video token's K/V (and vice versa), without forcing them to share any other weights.