Sparse Gating: $G(x) = \mathrm{Softmax}\!\big(\mathrm{TopK}(H(x),\,k)\big)$

How noisy top-k gating routes a token to experts. Grayed experts receive zero weight.

Left: raw logits $x \cdot W_g$ + noise. Right: gating weights after top-k + softmax. Only top-k experts are active (colored).