Sparse Gating: $G(x) = \mathrm{Softmax}\!\big(\mathrm{TopK}(H(x),\,k)\big)$

How noisy top-k gating routes a token to experts. Grayed experts receive zero weight.

Experts $n$: Top-$k$: 2 Noise $\sigma$: 0.50

Left: raw logits $x \cdot W_g$ + noise. Right: gating weights after top-k + softmax. Only top-k experts are active (colored).