Mixture of Experts Explained

This post is a discussion of the Mixture of Experts Explained blog by HuggingFace (Sanseviero et al., 2023). MoEs are one of the most important architectural ideas in modern LLMs — they let you scale parameter count without proportionally scaling compute. Here I reorganize the key ideas, add some commentary, and provide interactive visualizations.

TL;DR

MoEs:

Are pretrained much faster vs. dense models
Have faster inference compared to a model with the same number of parameters
Require high VRAM as all experts are loaded in memory
Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising

What is a Mixture of Experts?

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps. MoE enables models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model.

In the context of transformer models, a MoE consists of two main elements:

Sparse MoE layers replace dense feed-forward network (FFN) layers. Each MoE layer has a certain number of “experts” (e.g. 8), where each expert is an independent FFN.
A gate network (router) determines which tokens are sent to which expert. The router is composed of learned parameters and is pretrained at the same time as the rest of the network.

So, to recap: in MoEs we replace every FFN layer of the transformer with a MoE layer = gate network + a set of experts.

Interactive: 3-layer MoE Transformer. Click "Generate token" to watch routing and activation states at each layer. Green = active expert, gray = inactive (zero FLOPs).

The key trade-offs:

Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high.

For example, Mixtral 8x7B needs VRAM for a dense 47B parameter model (not 8 × 7B = 56B, because only FFN layers are separate experts; the rest is shared). But with top-2 routing, the inference FLOPs are like using a ~12B model.

The parameter count \(\neq\) the compute cost is the central insight of MoEs.

A Brief History

1991: Adaptive Mixture of Local Experts — the original idea. A supervised procedure for a system of separate networks, each handling a different subset of training cases. A gating network determines the weights.
2010–2015: Two directions: (a) Eigen, Ranzato, and Ilya explored MoEs as components of deeper networks (MoE as a layer, not the whole model); (b) Bengio et al. explored conditional computation — dynamically activating/deactivating components based on input.
2017: Shazeer et al. (with Hinton and Jeff Dean) scaled the idea to a 137B LSTM by introducing sparsity, keeping fast inference at high scale.
2020+: GShard, Switch Transformers, GLaM, ST-MoE, and eventually Mixtral brought MoEs to the Transformer era.

MoEs have allowed training multi-trillion parameter models, such as the open-sourced 1.6T parameter Switch Transformers.

Routing and Load Balancing

Sparsity and Gating

Sparsity uses the idea of conditional computation: while in dense models all parameters are used for all inputs, sparsity allows us to only run some parts of the whole system.

A learned gating network \(G\) decides which experts \(E\) to activate for each input:

\[y = \sum_{i=1}^{n} G(x)_i \, E_i(x)\]

If \(G(x)_i = 0\), we skip expert \(i\) entirely — saving compute. The simplest gating function is a softmax over a linear projection:

\[G_\sigma(x) = \text{Softmax}(x \cdot W_g)\]

But in practice, Shazeer et al. used Noisy Top-K Gating:

Step 1: Add tunable noise:

\[H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{\text{noise}})_i)\]

Step 2: Keep only the top-\(k\) values:

\[\text{KeepTopK}(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise} \end{cases}\]

Step 3: Apply softmax:

\[G(x) = \text{Softmax}(\text{KeepTopK}(H(x), k))\]

By using a low enough \(k\) (e.g. 1 or 2), we can train and run inference much faster than if many experts were activated. The initial conjecture was that routing to more than one expert was needed to have the gate learn how to route, so at least two experts had to be picked. Switch Transformers later revisited this decision.

The noise is crucial for load balancing — without it, the gating network converges to mostly activating the same few experts, a self-reinforcing collapse.

Interactive: Noisy Top-K Gating. Adjust noise and k to see how tokens get routed. Click "Resample" for new logits.

Load Balancing

If all tokens are sent to just a few popular experts, training becomes inefficient. The gating network naturally converges to mostly activate the same few experts — this self-reinforces as favored experts are trained quicker and hence selected more.

To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples.

This is combined with the concept of expert capacity — a threshold of how many tokens can be processed by a single expert:

\[\text{Expert Capacity} = \left\lfloor\frac{\text{tokens per batch}}{\text{number of experts}} \times \text{capacity factor}\right\rfloor\]

If an expert is at capacity, additional tokens overflow — they skip the MoE layer via residual connections (or are dropped entirely). The capacity factor (CF) creates a buffer: CF = 1.0 means perfectly even distribution with no slack; CF = 1.25 provides a 25% buffer.

Interactive: Expert Capacity. Adjust capacity factor to see how overflow changes. Lower CF = more dropped tokens; higher CF = more memory/communication cost.

Scaling MoEs in Transformers

GShard

GShard (2020) explored scaling transformers beyond 600B parameters by replacing every other FFN layer with a MoE layer using top-2 gating. Two key innovations:

Random routing: The top expert is always selected, but the second expert is picked with probability proportional to its weight — adding exploration.
Expert capacity: All tensor shapes must be statically determined at compilation time, but we can’t know how many tokens will go to each expert ahead of time. So we fix a capacity factor.

For large-scale computing, the MoE layer is shared across devices while all other layers are replicated. This makes MoE particularly suited to multi-device training.

When we talk about Mixtral 8x7B being a “47B model of 8 experts” that runs with the compute of a 12B dense model: the attention layers are shared across all tokens (not routed), so the actual number of active parameters per forward pass is closer to 12B despite having 47B total.

Switch Transformers

Switch Transformers (2022) deep-dived into training and fine-tuning instabilities. They achieved a 4x pre-train speed-up over T5-XXL and released a 1.6T parameter model with 2048 experts.

The key simplification: route to only one expert (top-1 instead of top-2). This:

Reduces router computation
Halves the effective batch size per expert (at least)
Reduces communication costs
Preserves quality

They also found that Switch Transformers perform well at low capacity factors (1.0–1.25), and that the properties observed at large scale (hundreds of experts) are consistent at small scale (2, 4, or 8 experts per layer).

Selective precision: Training experts with bfloat16 while using full precision for the routing computation. The router has an exponentiation function (softmax), so higher precision matters there. This doesn’t degrade quality and enables faster training.

Router Z-Loss

The balancing loss can itself cause instability. ST-MoE introduced router z-loss, which penalizes large logits entering the gating network. By encouraging smaller absolute magnitudes, roundoff errors are reduced — particularly important for the exponential function in the gating softmax.

This significantly improves training stability without quality degradation. A simple but elegant solution: don’t let the logits blow up.

Training and Fine-Tuning

What Does an Expert Learn?

ST-MoE researchers observed that encoder experts specialize in token groups or shallow concepts: punctuation experts, proper noun experts, etc. Decoder experts have less specialization.

In multilingual training, one might expect each expert to specialize in a language — but the opposite happens. Due to token routing and load balancing, no single expert is specialized in any given language.

This suggests that the load-balancing mechanisms work as intended — they prevent the degenerate solution of language-level partitioning and force experts to learn more general, complementary features.

Fine-Tuning MoEs

Sparse models are more prone to overfitting than dense models. Key findings:

Regularization: We can use higher dropout within experts (e.g. a higher rate for sparse layers than dense layers). Turning off the auxiliary loss during fine-tuning doesn’t significantly impact quality, even when up to 11% of tokens are dropped — token dropping may itself be a form of regularization.

Dense vs. sparse at fixed perplexity: The sparse model does worse on reasoning-heavy downstream tasks (SuperGLUE) but better on knowledge-heavy tasks (TriviaQA). Fewer experts help during fine-tuning.

What to freeze? Freezing all non-expert weights leads to a huge performance drop. Freezing only MoE layers works almost as well as updating everything — and it’s faster. This is somewhat counterintuitive since ~80% of parameters are in MoE layers. The hypothesis: since expert layers only occur every 1/4 layers and each token sees at most two experts per layer, updating MoE parameters affects fewer layers than updating shared parameters.

Hyperparameters: Sparse models tend to benefit from smaller batch sizes and higher learning rates.

The Instruction Tuning Breakthrough

MoEs Meets Instruction Tuning (July 2023) found that:

Vanilla fine-tuned MoE performs worse than its T5 equivalent
But Flan (instruction-tuned) MoE performs significantly better than Flan T5
The improvement of Flan-MoE over MoE is larger than Flan T5 over T5
MoEs benefit more from instruction tuning than dense models
The auxiliary loss actually prevents overfitting during instruction tuning (contrary to the earlier suggestion of turning it off)

This is exciting: MoEs may struggle with narrow fine-tuning but excel with diverse, multi-task instruction tuning.

Sparse vs Dense: When to Use Which?

High throughput, many machines: Use sparse MoE. Given a fixed compute budget, a sparse model will be more optimal for pretraining.
Low throughput, limited VRAM: Use a dense model.
Cannot directly compare the number of parameters between sparse and dense models — they represent fundamentally different things.

Efficiency and Deployment

Parallelism

Data parallelism: Same weights replicated across all cores, data partitioned across cores.
Model parallelism: Model partitioned across cores, data replicated.
Expert parallelism: Experts placed on different workers. For non-MoE layers, behaves like data parallelism. For MoE layers, tokens are sent to workers where the desired experts reside.

Capacity Factor Trade-offs

Increasing CF increases quality but also communication costs and activation memory. If all-to-all communications are slow, use a smaller CF. A good starting point: top-2 routing, CF = 1.25, one expert per core.

Serving Techniques

Distillation: Distill a MoE back to a dense model, retaining 30–40% of sparsity gains. Faster pretraining, smaller model in production.
Task-level routing: Route entire sentences/tasks to an expert, permitting sub-network extraction.
Expert aggregation: Merge expert weights to reduce parameter count at inference.

Efficient Training

FasterMoE (2022): Topology-aware gating that picks experts based on lowest latency → 17x speedup.
MegaBlocks (2022): Express MoE layers as block-sparse operations instead of batched matrix multiplication. Never drops tokens, maps efficiently to modern hardware.

Update: MoE in Practice — Qwen3 (2025)

The Qwen3 Technical Report (May 2025) provides a striking example of how far MoE has come since the early Switch Transformer days. Qwen3 includes two MoE models that embody several ideas discussed above — and push them further.

Architecture

Model	Layers	Heads (Q/KV)	Total Experts	Activated	Total Params	Active Params
Qwen3-30B-A3B	48	32/4	128	8	30B	3B
Qwen3-235B-A22B	94	64/4	128	8	235B	22B

Several design choices are worth noting:

Fine-grained expert segmentation (from DeepSeekMoE): Instead of 8 large experts with top-2 routing (like Mixtral), Qwen3 uses 128 small experts with top-8 routing. The intuition: more experts with finer granularity allows more flexible combinations — \(\binom{128}{8} \approx 2.3 \times 10^{10}\) possible expert sets versus \(\binom{8}{2} = 28\) for Mixtral. Each individual expert is small, but the combinatorial diversity is enormous.
No shared experts: Unlike Qwen2.5-MoE (and DeepSeek-V2/V3 which use shared experts that are always activated), Qwen3 drops shared experts entirely. This is a bold move — shared experts were introduced to ensure a baseline of common knowledge across all tokens. Qwen3 apparently found that with 128 fine-grained experts and top-8 routing, the need for an explicit shared component disappears.
Global-batch load balancing loss: Rather than the per-sample auxiliary loss from Switch Transformers, Qwen3 uses a global-batch variant that computes load balance across the entire batch. This encourages expert specialization while avoiding the per-example noise that can destabilize training.

The efficiency story is dramatic: Qwen3-235B has 235B total parameters but only 22B active per token — a 10.7x ratio. Compare with Mixtral 8x7B’s ~3.9x ratio (47B total / 12B active). The trend is clear: modern MoEs are pushing toward higher total-to-active ratios with more fine-grained experts.

Benchmark Results

Qwen3-235B-A22B outperforms dense models with far more active parameters:

Benchmark	Qwen2.5-72B (dense)	DeepSeek-V3 (MoE)	Qwen3-235B-A22B (MoE)
MMLU	86.1	87.2	87.8
MMLU-Pro	58.1	59.8	68.2
MATH	62.1	62.6	71.8
EvalPlus (code)	65.9	63.8	77.6
BBH	86.3	86.2	88.9

The smaller Qwen3-30B-A3B is equally impressive — with only 3B active parameters, it matches or exceeds Qwen2.5-14B (a dense model with 4.7x more active compute) on most benchmarks.

These results validate the core MoE thesis: given a fixed inference budget, sparse models consistently outperform dense ones. The gap has only widened as MoE techniques have matured.

Open Source and Future Directions

Training frameworks:

Released models (as of Dec 2023):

Switch Transformers (Google): T5-based, 8 to 2048 experts, up to 1.6T parameters
NLLB MoE (Meta): Translation model
OpenMoE: Community Llama-based MoEs
Mixtral 8x7B (Mistral): Outperforms Llama 2 70B with much faster inference

Exciting directions:

Distillation: Distilling sparse MoEs back to dense models with fewer parameters but similar quality
Quantization: QMoE (Oct 2023) compresses the 1.6T Switch Transformer from 3.2TB to just 160GB by quantizing to less than 1 bit per parameter
Model merging: Exploring expert aggregation techniques and their impact on inference time

References

Jacobs et al., Adaptive Mixture of Local Experts, 1991
Eigen et al., Learning Factored Representations in a Deep Mixture of Experts, 2013
Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017
Lepikhin et al., GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, 2020
Du et al., GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, 2021
Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022
Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models, 2022
He et al., FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models, 2022
Gale et al., MegaBlocks: Efficient Sparse Training with Mixture-of-Experts, 2022
Shen et al., Mixture-of-Experts Meets Instruction Tuning, 2023
Sanseviero et al., Mixture of Experts Explained, HuggingFace Blog, 2023
Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, 2024
Qwen Team, Qwen3 Technical Report, 2025

TL;DR

要点速览

What is a Mixture of Experts?

什么是混合专家模型？

A Brief History

简史

Routing and Load Balancing

路由与负载均衡

Sparsity and Gating

稀疏性与门控

Load Balancing

负载均衡

Scaling MoEs in Transformers

在 Transformer 中扩展 MoE

GShard

GShard

Switch Transformers

Switch Transformers

Router Z-Loss

Router Z-Loss

Training and Fine-Tuning

训练与微调

What Does an Expert Learn?

专家学到了什么？

Fine-Tuning MoEs

微调混合专家模型

Sparse vs Dense: When to Use Which?

稀疏 vs 稠密：何时使用？

Efficiency and Deployment

效率与部署

Parallelism

Capacity Factor Trade-offs

Serving Techniques

Efficient Training

并行计算

容量因子权衡

部署技术

高效训练

Update: MoE in Practice — Qwen3 (2025)

更新：MoE 实践——Qwen3（2025）

Architecture

Benchmark Results

架构

基准测试结果

Open Source and Future Directions

开源生态与未来方向

References

参考文献