3-Layer MoE Transformer: Token Generation

x → Router selects top-k experts → x sent only to selected experts → weighted sum → y

Green = selected expert (receives x, computes E(x)). Gray = not selected (zero FLOPs). Weight % = gating coefficient G(x).