KV Cache Flow: Cosmos Policy vs Fast-WAM

how clean conditioning's K/V is computed once and reused — same idea, two architectures
0 / 10

Cosmos Policy

cache inside one backbone: clean slots reused across 5 denoising steps
unified DiT
~2B
↑ reads K/V from
K/V cacheempty

Fast-WAM

cache across two backbones: video runs once, action iterates 10 steps
video branch
Video DiT
5B
action branch
Action DiT
1B
K/V cacheempty
clean conditioning
noised (initial)
recompute K/V (every step)
cached K/V (read, not recomputed)
The flow at a glance. Cosmos's cache lives inside one DiT: across the 5 denoising steps the 4 clean state slots are computed once and read every step, while the action / future / value slots are recomputed every step (because their noise level changes). Fast-WAM's cache lives across two DiTs: the 5B video DiT runs once on the clean first observation, its K/V is stored, and the 1B action expert reads that same K/V at every one of its 10 denoising steps — the video branch never runs again until the next action chunk.