Mixture of Experts Explained
This post is a discussion of the Mixture of Experts Explained blog by HuggingFace (Sanseviero et al., 2023). MoEs are one of the most important architectural ideas in modern LLMs — they let you scale parameter count without proportionally scaling compute. Here I reorganize the key ideas, add some commentary, and provide interactive visualizations.
TL;DR
要点速览
MoEs:
- Are pretrained much faster vs. dense models
- Have faster inference compared to a model with the same number of parameters
- Require high VRAM as all experts are loaded in memory
- Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising
What is a Mixture of Experts?
什么是混合专家模型?
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps. MoE enables models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model.
In the context of transformer models, a MoE consists of two main elements:
- Sparse MoE layers replace dense feed-forward network (FFN) layers. Each MoE layer has a certain number of “experts” (e.g. 8), where each expert is an independent FFN.
- A gate network (router) determines which tokens are sent to which expert. The router is composed of learned parameters and is pretrained at the same time as the rest of the network.
So, to recap: in MoEs we replace every FFN layer of the transformer with a MoE layer = gate network + a set of experts.
The key trade-offs:
- Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
- Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high.
For example, Mixtral 8x7B needs VRAM for a dense 47B parameter model (not 8 × 7B = 56B, because only FFN layers are separate experts; the rest is shared). But with top-2 routing, the inference FLOPs are like using a ~12B model.
The parameter count \(\neq\) the compute cost is the central insight of MoEs.
A Brief History
简史
- 1991: Adaptive Mixture of Local Experts — the original idea. A supervised procedure for a system of separate networks, each handling a different subset of training cases. A gating network determines the weights.
- 2010–2015: Two directions: (a) Eigen, Ranzato, and Ilya explored MoEs as components of deeper networks (MoE as a layer, not the whole model); (b) Bengio et al. explored conditional computation — dynamically activating/deactivating components based on input.
- 2017: Shazeer et al. (with Hinton and Jeff Dean) scaled the idea to a 137B LSTM by introducing sparsity, keeping fast inference at high scale.
- 2020+: GShard, Switch Transformers, GLaM, ST-MoE, and eventually Mixtral brought MoEs to the Transformer era.
MoEs have allowed training multi-trillion parameter models, such as the open-sourced 1.6T parameter Switch Transformers.
Routing and Load Balancing
路由与负载均衡
Sparsity and Gating
稀疏性与门控
Sparsity uses the idea of conditional computation: while in dense models all parameters are used for all inputs, sparsity allows us to only run some parts of the whole system.
A learned gating network \(G\) decides which experts \(E\) to activate for each input:
\[y = \sum_{i=1}^{n} G(x)_i \, E_i(x)\]If \(G(x)_i = 0\), we skip expert \(i\) entirely — saving compute. The simplest gating function is a softmax over a linear projection:
\[G_\sigma(x) = \text{Softmax}(x \cdot W_g)\]But in practice, Shazeer et al. used Noisy Top-K Gating:
Step 1: Add tunable noise:
\[H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{\text{noise}})_i)\]Step 2: Keep only the top-\(k\) values:
\[\text{KeepTopK}(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise} \end{cases}\]Step 3: Apply softmax:
\[G(x) = \text{Softmax}(\text{KeepTopK}(H(x), k))\]By using a low enough \(k\) (e.g. 1 or 2), we can train and run inference much faster than if many experts were activated. The initial conjecture was that routing to more than one expert was needed to have the gate learn how to route, so at least two experts had to be picked. Switch Transformers later revisited this decision.
The noise is crucial for load balancing — without it, the gating network converges to mostly activating the same few experts, a self-reinforcing collapse.
Load Balancing
负载均衡
If all tokens are sent to just a few popular experts, training becomes inefficient. The gating network naturally converges to mostly activate the same few experts — this self-reinforces as favored experts are trained quicker and hence selected more.
To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples.
This is combined with the concept of expert capacity — a threshold of how many tokens can be processed by a single expert:
\[\text{Expert Capacity} = \left\lfloor\frac{\text{tokens per batch}}{\text{number of experts}} \times \text{capacity factor}\right\rfloor\]If an expert is at capacity, additional tokens overflow — they skip the MoE layer via residual connections (or are dropped entirely). The capacity factor (CF) creates a buffer: CF = 1.0 means perfectly even distribution with no slack; CF = 1.25 provides a 25% buffer.
Scaling MoEs in Transformers
在 Transformer 中扩展 MoE
GShard
GShard
GShard (2020) explored scaling transformers beyond 600B parameters by replacing every other FFN layer with a MoE layer using top-2 gating. Two key innovations:
- Random routing: The top expert is always selected, but the second expert is picked with probability proportional to its weight — adding exploration.
- Expert capacity: All tensor shapes must be statically determined at compilation time, but we can’t know how many tokens will go to each expert ahead of time. So we fix a capacity factor.
For large-scale computing, the MoE layer is shared across devices while all other layers are replicated. This makes MoE particularly suited to multi-device training.
When we talk about Mixtral 8x7B being a “47B model of 8 experts” that runs with the compute of a 12B dense model: the attention layers are shared across all tokens (not routed), so the actual number of active parameters per forward pass is closer to 12B despite having 47B total.
Switch Transformers
Switch Transformers
Switch Transformers (2022) deep-dived into training and fine-tuning instabilities. They achieved a 4x pre-train speed-up over T5-XXL and released a 1.6T parameter model with 2048 experts.
The key simplification: route to only one expert (top-1 instead of top-2). This:
- Reduces router computation
- Halves the effective batch size per expert (at least)
- Reduces communication costs
- Preserves quality
They also found that Switch Transformers perform well at low capacity factors (1.0–1.25), and that the properties observed at large scale (hundreds of experts) are consistent at small scale (2, 4, or 8 experts per layer).
Selective precision: Training experts with bfloat16 while using full precision for the routing computation. The router has an exponentiation function (softmax), so higher precision matters there. This doesn’t degrade quality and enables faster training.
Router Z-Loss
Router Z-Loss
The balancing loss can itself cause instability. ST-MoE introduced router z-loss, which penalizes large logits entering the gating network. By encouraging smaller absolute magnitudes, roundoff errors are reduced — particularly important for the exponential function in the gating softmax.
This significantly improves training stability without quality degradation. A simple but elegant solution: don’t let the logits blow up.
Training and Fine-Tuning
训练与微调
What Does an Expert Learn?
专家学到了什么?
ST-MoE researchers observed that encoder experts specialize in token groups or shallow concepts: punctuation experts, proper noun experts, etc. Decoder experts have less specialization.
In multilingual training, one might expect each expert to specialize in a language — but the opposite happens. Due to token routing and load balancing, no single expert is specialized in any given language.
This suggests that the load-balancing mechanisms work as intended — they prevent the degenerate solution of language-level partitioning and force experts to learn more general, complementary features.
Fine-Tuning MoEs
微调混合专家模型
Sparse models are more prone to overfitting than dense models. Key findings:
Regularization: We can use higher dropout within experts (e.g. a higher rate for sparse layers than dense layers). Turning off the auxiliary loss during fine-tuning doesn’t significantly impact quality, even when up to 11% of tokens are dropped — token dropping may itself be a form of regularization.
Dense vs. sparse at fixed perplexity: The sparse model does worse on reasoning-heavy downstream tasks (SuperGLUE) but better on knowledge-heavy tasks (TriviaQA). Fewer experts help during fine-tuning.
What to freeze? Freezing all non-expert weights leads to a huge performance drop. Freezing only MoE layers works almost as well as updating everything — and it’s faster. This is somewhat counterintuitive since ~80% of parameters are in MoE layers. The hypothesis: since expert layers only occur every 1/4 layers and each token sees at most two experts per layer, updating MoE parameters affects fewer layers than updating shared parameters.
Hyperparameters: Sparse models tend to benefit from smaller batch sizes and higher learning rates.
The Instruction Tuning Breakthrough
MoEs Meets Instruction Tuning (July 2023) found that:
- Vanilla fine-tuned MoE performs worse than its T5 equivalent
- But Flan (instruction-tuned) MoE performs significantly better than Flan T5
- The improvement of Flan-MoE over MoE is larger than Flan T5 over T5
- MoEs benefit more from instruction tuning than dense models
- The auxiliary loss actually prevents overfitting during instruction tuning (contrary to the earlier suggestion of turning it off)
This is exciting: MoEs may struggle with narrow fine-tuning but excel with diverse, multi-task instruction tuning.
Sparse vs Dense: When to Use Which?
稀疏 vs 稠密:何时使用?
- High throughput, many machines: Use sparse MoE. Given a fixed compute budget, a sparse model will be more optimal for pretraining.
- Low throughput, limited VRAM: Use a dense model.
- Cannot directly compare the number of parameters between sparse and dense models — they represent fundamentally different things.
Efficiency and Deployment
效率与部署
Parallelism
- Data parallelism: Same weights replicated across all cores, data partitioned across cores.
- Model parallelism: Model partitioned across cores, data replicated.
- Expert parallelism: Experts placed on different workers. For non-MoE layers, behaves like data parallelism. For MoE layers, tokens are sent to workers where the desired experts reside.
Capacity Factor Trade-offs
Increasing CF increases quality but also communication costs and activation memory. If all-to-all communications are slow, use a smaller CF. A good starting point: top-2 routing, CF = 1.25, one expert per core.
Serving Techniques
- Distillation: Distill a MoE back to a dense model, retaining 30–40% of sparsity gains. Faster pretraining, smaller model in production.
- Task-level routing: Route entire sentences/tasks to an expert, permitting sub-network extraction.
- Expert aggregation: Merge expert weights to reduce parameter count at inference.
Efficient Training
- FasterMoE (2022): Topology-aware gating that picks experts based on lowest latency → 17x speedup.
- MegaBlocks (2022): Express MoE layers as block-sparse operations instead of batched matrix multiplication. Never drops tokens, maps efficiently to modern hardware.
Update: MoE in Practice — Qwen3 (2025)
更新:MoE 实践——Qwen3(2025)
The Qwen3 Technical Report (May 2025) provides a striking example of how far MoE has come since the early Switch Transformer days. Qwen3 includes two MoE models that embody several ideas discussed above — and push them further.
Architecture
| Model | Layers | Heads (Q/KV) | Total Experts | Activated | Total Params | Active Params |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | 48 | 32/4 | 128 | 8 | 30B | 3B |
| Qwen3-235B-A22B | 94 | 64/4 | 128 | 8 | 235B | 22B |
Several design choices are worth noting:
-
Fine-grained expert segmentation (from DeepSeekMoE): Instead of 8 large experts with top-2 routing (like Mixtral), Qwen3 uses 128 small experts with top-8 routing. The intuition: more experts with finer granularity allows more flexible combinations — \(\binom{128}{8} \approx 2.3 \times 10^{10}\) possible expert sets versus \(\binom{8}{2} = 28\) for Mixtral. Each individual expert is small, but the combinatorial diversity is enormous.
-
No shared experts: Unlike Qwen2.5-MoE (and DeepSeek-V2/V3 which use shared experts that are always activated), Qwen3 drops shared experts entirely. This is a bold move — shared experts were introduced to ensure a baseline of common knowledge across all tokens. Qwen3 apparently found that with 128 fine-grained experts and top-8 routing, the need for an explicit shared component disappears.
-
Global-batch load balancing loss: Rather than the per-sample auxiliary loss from Switch Transformers, Qwen3 uses a global-batch variant that computes load balance across the entire batch. This encourages expert specialization while avoiding the per-example noise that can destabilize training.
The efficiency story is dramatic: Qwen3-235B has 235B total parameters but only 22B active per token — a 10.7x ratio. Compare with Mixtral 8x7B’s ~3.9x ratio (47B total / 12B active). The trend is clear: modern MoEs are pushing toward higher total-to-active ratios with more fine-grained experts.
Benchmark Results
Qwen3-235B-A22B outperforms dense models with far more active parameters:
| Benchmark | Qwen2.5-72B (dense) | DeepSeek-V3 (MoE) | Qwen3-235B-A22B (MoE) |
|---|---|---|---|
| MMLU | 86.1 | 87.2 | 87.8 |
| MMLU-Pro | 58.1 | 59.8 | 68.2 |
| MATH | 62.1 | 62.6 | 71.8 |
| EvalPlus (code) | 65.9 | 63.8 | 77.6 |
| BBH | 86.3 | 86.2 | 88.9 |
The smaller Qwen3-30B-A3B is equally impressive — with only 3B active parameters, it matches or exceeds Qwen2.5-14B (a dense model with 4.7x more active compute) on most benchmarks.
These results validate the core MoE thesis: given a fixed inference budget, sparse models consistently outperform dense ones. The gap has only widened as MoE techniques have matured.
Open Source and Future Directions
开源生态与未来方向
Training frameworks:
Released models (as of Dec 2023):
- Switch Transformers (Google): T5-based, 8 to 2048 experts, up to 1.6T parameters
- NLLB MoE (Meta): Translation model
- OpenMoE: Community Llama-based MoEs
- Mixtral 8x7B (Mistral): Outperforms Llama 2 70B with much faster inference
Exciting directions:
- Distillation: Distilling sparse MoEs back to dense models with fewer parameters but similar quality
- Quantization: QMoE (Oct 2023) compresses the 1.6T Switch Transformer from 3.2TB to just 160GB by quantizing to less than 1 bit per parameter
- Model merging: Exploring expert aggregation techniques and their impact on inference time
References
参考文献
- Jacobs et al., Adaptive Mixture of Local Experts, 1991
- Eigen et al., Learning Factored Representations in a Deep Mixture of Experts, 2013
- Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017
- Lepikhin et al., GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, 2020
- Du et al., GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, 2021
- Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022
- Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models, 2022
- He et al., FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models, 2022
- Gale et al., MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, 2022
- Shen et al., Mixture-of-Experts Meets Instruction Tuning, 2023
- Sanseviero et al., Mixture of Experts Explained, HuggingFace Blog, 2023
- Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, 2024
- Qwen Team, Qwen3 Technical Report, 2025