Mixture of Experts Explained

This post is a discussion of the Mixture of Experts Explained blog by HuggingFace (Sanseviero et al., 2023). MoEs are one of the most important architectural ideas in modern LLMs — they let you scale parameter count without proportionally scaling compute. Here I reorganize the key ideas, add some commentary, and provide interactive visualizations.

本文是对 HuggingFace 混合专家模型详解(Sanseviero et al., 2023)的讨论。MoE 是现代 LLM 最重要的架构思想之一——它让你在不成比例增加计算量的情况下扩展参数量。这里我重新组织了关键思想,加入了一些评论,并提供交互式可视化。

TL;DR

MoEs:

  • Are pretrained much faster vs. dense models
  • Have faster inference compared to a model with the same number of parameters
  • Require high VRAM as all experts are loaded in memory
  • Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising

MoE 的关键特性:

  • 与稠密模型相比预训练速度更快
  • 相比同等参数量的模型推理速度更快
  • 需要高显存,因为所有专家都需加载到内存
  • 微调存在挑战,但最新研究表明指令调优潜力巨大

What is a Mixture of Experts?

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps. MoE enables models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model.

In the context of transformer models, a MoE consists of two main elements:

  1. Sparse MoE layers replace dense feed-forward network (FFN) layers. Each MoE layer has a certain number of “experts” (e.g. 8), where each expert is an independent FFN.
  2. A gate network (router) determines which tokens are sent to which expert. The router is composed of learned parameters and is pretrained at the same time as the rest of the network.

So, to recap: in MoEs we replace every FFN layer of the transformer with a MoE layer = gate network + a set of experts.

Interactive: 3-layer MoE Transformer. Click "Generate token" to watch routing and activation states at each layer. Green = active expert, gray = inactive (zero FLOPs).
交互图:三层 MoE Transformer。点击"Generate token"观察每层的路由和激活状态。绿色 = 活跃专家,灰色 = 不活跃(零 FLOPs)。

The key trade-offs:

  • Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
  • Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high.

For example, Mixtral 8x7B needs VRAM for a dense 47B parameter model (not 8 × 7B = 56B, because only FFN layers are separate experts; the rest is shared). But with top-2 routing, the inference FLOPs are like using a ~12B model.

The parameter count \(\neq\) the compute cost is the central insight of MoEs.

模型规模是提升模型质量最重要的维度之一。在固定计算预算下,训练更大的模型更少的步数,优于训练更小的模型更多的步数。MoE 使模型能以更少的计算量进行预训练,这意味着你可以在与稠密模型相同的计算预算下大幅扩展模型或数据集的规模。

在 Transformer 模型中,MoE 由两个核心要素组成:

  1. 稀疏 MoE 层替代稠密的前馈网络(FFN)层。每个 MoE 层包含若干”专家”(如 8 个),每个专家都是独立的 FFN。
  2. 门控网络(路由器)决定哪些 token 被发送到哪个专家。路由器由可学习参数组成,与网络其余部分一同预训练。

总结:在 MoE 中,我们将 Transformer 的每个 FFN 层替换为 MoE 层 = 门控网络 + 一组专家。

关键权衡:

  • 训练:MoE 能显著提升预训练的计算效率,但在微调时泛化能力不足,容易过拟合。
  • 推理:虽然 MoE 可能有很多参数,但推理时只使用其中一部分。这使得推理速度远快于同等参数量的稠密模型。然而所有参数都需要加载到内存,因此内存需求很高。

例如,Mixtral 8x7B 需要 47B 参数的稠密模型的显存(不是 8 × 7B = 56B,因为只有 FFN 层是独立的专家,其余部分是共享的)。但通过 top-2 路由,推理的 FLOPs 相当于使用一个约 12B 的模型。

参数量 \(\neq\) 计算量——这是 MoE 的核心洞察。

A Brief History

  • 1991: Adaptive Mixture of Local Experts — the original idea. A supervised procedure for a system of separate networks, each handling a different subset of training cases. A gating network determines the weights.
  • 2010–2015: Two directions: (a) Eigen, Ranzato, and Ilya explored MoEs as components of deeper networks (MoE as a layer, not the whole model); (b) Bengio et al. explored conditional computation — dynamically activating/deactivating components based on input.
  • 2017: Shazeer et al. (with Hinton and Jeff Dean) scaled the idea to a 137B LSTM by introducing sparsity, keeping fast inference at high scale.
  • 2020+: GShard, Switch Transformers, GLaM, ST-MoE, and eventually Mixtral brought MoEs to the Transformer era.

MoEs have allowed training multi-trillion parameter models, such as the open-sourced 1.6T parameter Switch Transformers.

  • 1991 年: Adaptive Mixture of Local Experts——原始概念。由独立网络组成的监督系统,每个网络处理不同的训练子集,门控网络决定权重。
  • 2010–2015 年: 两个方向:(a) Eigen, Ranzato 和 Ilya 探索了 MoE 作为深层网络的组件(MoE 作为层,而非整个模型);(b) Bengio 等人探索了条件计算——基于输入动态激活/停用组件。
  • 2017 年: Shazeer 等人(包括 Hinton 和 Jeff Dean)通过引入稀疏性将该思想扩展到 137B LSTM,在大规模下保持快速推理。
  • 2020 年至今: GShard、Switch Transformers、GLaM、ST-MoE,最终 Mixtral 将 MoE 带入 Transformer 时代。

MoE 使训练万亿级参数模型成为可能,例如开源的 1.6T 参数 Switch Transformers。

Routing and Load Balancing

Sparsity and Gating

Sparsity uses the idea of conditional computation: while in dense models all parameters are used for all inputs, sparsity allows us to only run some parts of the whole system.

A learned gating network \(G\) decides which experts \(E\) to activate for each input:

\[y = \sum_{i=1}^{n} G(x)_i \, E_i(x)\]

If \(G(x)_i = 0\), we skip expert \(i\) entirely — saving compute. The simplest gating function is a softmax over a linear projection:

\[G_\sigma(x) = \text{Softmax}(x \cdot W_g)\]

But in practice, Shazeer et al. used Noisy Top-K Gating:

Step 1: Add tunable noise:

\[H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{\text{noise}})_i)\]

Step 2: Keep only the top-\(k\) values:

\[\text{KeepTopK}(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top } k \text{ elements of } v \\ -\infty & \text{otherwise} \end{cases}\]

Step 3: Apply softmax:

\[G(x) = \text{Softmax}(\text{KeepTopK}(H(x), k))\]

By using a low enough \(k\) (e.g. 1 or 2), we can train and run inference much faster than if many experts were activated. The initial conjecture was that routing to more than one expert was needed to have the gate learn how to route, so at least two experts had to be picked. Switch Transformers later revisited this decision.

The noise is crucial for load balancing — without it, the gating network converges to mostly activating the same few experts, a self-reinforcing collapse.

稀疏性利用条件计算的思想:在稠密模型中所有参数对所有输入都进行处理,而稀疏性允许我们只运行系统的部分。

一个学习的门控网络 \(G\) 决定为每个输入激活哪些专家 \(E\):

\[y = \sum_{i=1}^{n} G(x)_i \, E_i(x)\]

如果 \(G(x)_i = 0\),我们完全跳过专家 \(i\)——节省计算。最简单的门控函数是对线性投影取 softmax:

\[G_\sigma(x) = \text{Softmax}(x \cdot W_g)\]

但实际中,Shazeer 等人使用了带噪声的 Top-K 门控

第一步:添加可调噪声:

\[H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{\text{noise}})_i)\]

第二步:只保留前 \(k\) 个值:

\[\text{KeepTopK}(v, k)_i = \begin{cases} v_i & \text{如果 } v_i \text{ 在 } v \text{ 的前 } k \text{ 个元素中} \\ -\infty & \text{否则} \end{cases}\]

第三步:应用 softmax:

\[G(x) = \text{Softmax}(\text{KeepTopK}(H(x), k))\]

通过使用足够小的 \(k\)(如 1 或 2),我们可以比激活多个专家时更快地训练和推理。最初的假设是需要路由到多个专家才能让门控学会路由,因此至少需要选择两个专家。Switch Transformers 后来重新审视了这一决定。

噪声对负载均衡至关重要——没有噪声,门控网络会收敛到只激活少数几个专家,形成自我强化的崩溃。

Interactive: Noisy Top-K Gating. Adjust noise and k to see how tokens get routed. Click "Resample" for new logits.
交互图:带噪声的 Top-K 门控。调整噪声和 k 来观察 token 如何被路由。点击"Resample"生成新的 logits。

Load Balancing

If all tokens are sent to just a few popular experts, training becomes inefficient. The gating network naturally converges to mostly activate the same few experts — this self-reinforces as favored experts are trained quicker and hence selected more.

To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples.

This is combined with the concept of expert capacity — a threshold of how many tokens can be processed by a single expert:

\[\text{Expert Capacity} = \left\lfloor\frac{\text{tokens per batch}}{\text{number of experts}} \times \text{capacity factor}\right\rfloor\]

If an expert is at capacity, additional tokens overflow — they skip the MoE layer via residual connections (or are dropped entirely). The capacity factor (CF) creates a buffer: CF = 1.0 means perfectly even distribution with no slack; CF = 1.25 provides a 25% buffer.

如果所有 token 都被发送到少数几个热门专家,训练就会变得低效。门控网络自然会收敛到只激活同样的几个专家——这种自我强化效应使得被偏好的专家训练更快,因而被选择更多。

为缓解这个问题,引入了辅助损失来鼓励所有专家获得均等的重要性。该损失确保所有专家接收大致相等数量的训练样本。

这与专家容量的概念结合——即单个专家能处理的 token 数量阈值:

\[\text{Expert Capacity} = \left\lfloor\frac{\text{tokens per batch}}{\text{number of experts}} \times \text{capacity factor}\right\rfloor\]

如果一个专家已达容量上限,额外的 token 会溢出——通过残差连接跳过 MoE 层(或被完全丢弃)。容量因子(CF)提供了缓冲:CF = 1.0 意味着完美均匀分配无余量;CF = 1.25 提供 25% 的缓冲。

Interactive: Expert Capacity. Adjust capacity factor to see how overflow changes. Lower CF = more dropped tokens; higher CF = more memory/communication cost.
交互图:专家容量。调整容量因子观察溢出变化。CF 越低 = 丢弃的 token 越多;CF 越高 = 内存/通信开销越大。

Scaling MoEs in Transformers

GShard

GShard (2020) explored scaling transformers beyond 600B parameters by replacing every other FFN layer with a MoE layer using top-2 gating. Two key innovations:

  • Random routing: The top expert is always selected, but the second expert is picked with probability proportional to its weight — adding exploration.
  • Expert capacity: All tensor shapes must be statically determined at compilation time, but we can’t know how many tokens will go to each expert ahead of time. So we fix a capacity factor.

For large-scale computing, the MoE layer is shared across devices while all other layers are replicated. This makes MoE particularly suited to multi-device training.

When we talk about Mixtral 8x7B being a “47B model of 8 experts” that runs with the compute of a 12B dense model: the attention layers are shared across all tokens (not routed), so the actual number of active parameters per forward pass is closer to 12B despite having 47B total.

GShard(2020)通过将每隔一个 FFN 层替换为使用 top-2 门控的 MoE 层,探索了将 Transformer 扩展到 600B 参数以上。两个关键创新:

  • 随机路由:最优专家总是被选中,但第二个专家以与其权重成比例的概率被选择——增加探索性。
  • 专家容量:所有张量形状必须在编译时静态确定,但我们无法预先知道多少 token 会被路由到每个专家。因此我们固定一个容量因子。

在大规模计算中,MoE 层在设备间共享,而其他所有层在每个设备上复制。这使得 MoE 特别适合多设备训练。

当我们说 Mixtral 8x7B 是一个”47B 模型包含 8 个专家”、运行计算量相当于 12B 稠密模型时:注意力层在所有 token 间共享(不进行路由),因此每次前向传播的实际活跃参数量接近 12B,尽管总参数量为 47B。

Switch Transformers

Switch Transformers (2022) deep-dived into training and fine-tuning instabilities. They achieved a 4x pre-train speed-up over T5-XXL and released a 1.6T parameter model with 2048 experts.

The key simplification: route to only one expert (top-1 instead of top-2). This:

  • Reduces router computation
  • Halves the effective batch size per expert (at least)
  • Reduces communication costs
  • Preserves quality

They also found that Switch Transformers perform well at low capacity factors (1.0–1.25), and that the properties observed at large scale (hundreds of experts) are consistent at small scale (2, 4, or 8 experts per layer).

Selective precision: Training experts with bfloat16 while using full precision for the routing computation. The router has an exponentiation function (softmax), so higher precision matters there. This doesn’t degrade quality and enables faster training.

Switch Transformers(2022)深入研究了训练和微调的不稳定性。他们实现了相比 T5-XXL 4 倍的预训练加速,并发布了拥有 2048 个专家的 1.6T 参数模型。

关键简化:只路由到一个专家(top-1 而非 top-2)。这样做:

  • 降低路由计算负担
  • 每个专家的有效批量大小至少减半
  • 降低通信成本
  • 保持模型质量

他们还发现 Switch Transformers 在低容量因子(1.0–1.25)下表现良好,且在大规模(数百个专家)下观察到的特性在小规模(每层 2、4 或 8 个专家)下同样适用。

选择性精度:用 bfloat16 训练专家,同时对路由计算使用全精度。路由器有指数函数(softmax),因此更高精度很重要。这不会降低质量,且能加速训练。

Router Z-Loss

The balancing loss can itself cause instability. ST-MoE introduced router z-loss, which penalizes large logits entering the gating network. By encouraging smaller absolute magnitudes, roundoff errors are reduced — particularly important for the exponential function in the gating softmax.

This significantly improves training stability without quality degradation. A simple but elegant solution: don’t let the logits blow up.

平衡损失本身也可能导致不稳定性。ST-MoE 引入了 router z-loss,惩罚进入门控网络的大 logits。通过鼓励更小的绝对值,减少了舍入误差——这对门控 softmax 中的指数函数尤为重要。

这在不降低质量的情况下显著提高了训练稳定性。一个简单但优雅的解决方案:不要让 logits 爆炸。

Training and Fine-Tuning

What Does an Expert Learn?

ST-MoE researchers observed that encoder experts specialize in token groups or shallow concepts: punctuation experts, proper noun experts, etc. Decoder experts have less specialization.

In multilingual training, one might expect each expert to specialize in a language — but the opposite happens. Due to token routing and load balancing, no single expert is specialized in any given language.

This suggests that the load-balancing mechanisms work as intended — they prevent the degenerate solution of language-level partitioning and force experts to learn more general, complementary features.

ST-MoE 的研究者观察到编码器中的专家倾向于专注于特定 token 组或浅层概念:标点专家、专有名词专家等。解码器中的专家专业化程度较低

在多语言训练中,人们可能期望每个专家专攻一种语言——但实际恰恰相反。由于 token 路由和负载均衡机制,没有任何专家被专门配置来处理特定语言。

这表明负载均衡机制按预期工作——它们防止了语言级别分区这种退化解,迫使专家学习更通用的互补特征。

Fine-Tuning MoEs

Sparse models are more prone to overfitting than dense models. Key findings:

Regularization: We can use higher dropout within experts (e.g. a higher rate for sparse layers than dense layers). Turning off the auxiliary loss during fine-tuning doesn’t significantly impact quality, even when up to 11% of tokens are dropped — token dropping may itself be a form of regularization.

Dense vs. sparse at fixed perplexity: The sparse model does worse on reasoning-heavy downstream tasks (SuperGLUE) but better on knowledge-heavy tasks (TriviaQA). Fewer experts help during fine-tuning.

What to freeze? Freezing all non-expert weights leads to a huge performance drop. Freezing only MoE layers works almost as well as updating everything — and it’s faster. This is somewhat counterintuitive since ~80% of parameters are in MoE layers. The hypothesis: since expert layers only occur every 1/4 layers and each token sees at most two experts per layer, updating MoE parameters affects fewer layers than updating shared parameters.

Hyperparameters: Sparse models tend to benefit from smaller batch sizes and higher learning rates.

The Instruction Tuning Breakthrough

MoEs Meets Instruction Tuning (July 2023) found that:

  • Vanilla fine-tuned MoE performs worse than its T5 equivalent
  • But Flan (instruction-tuned) MoE performs significantly better than Flan T5
  • The improvement of Flan-MoE over MoE is larger than Flan T5 over T5
  • MoEs benefit more from instruction tuning than dense models
  • The auxiliary loss actually prevents overfitting during instruction tuning (contrary to the earlier suggestion of turning it off)

This is exciting: MoEs may struggle with narrow fine-tuning but excel with diverse, multi-task instruction tuning.

稀疏模型比稠密模型更容易过拟合。关键发现:

正则化:可以在专家内使用更高的 dropout(例如稀疏层比稠密层更高的 dropout 率)。微调时关闭辅助损失不会显著影响质量,即使高达 11% 的 token 被丢弃——token 丢弃本身可能就是一种正则化。

固定困惑度下的稠密 vs. 稀疏:稀疏模型在重推理的下游任务(SuperGLUE)上表现更差,但在知识密集型任务(TriviaQA)上表现更好。微调时更少的专家有帮助。

冻结什么? 冻结所有非专家权重会导致性能大幅下降。只冻结 MoE 层的效果几乎与更新所有参数一样好——而且更快。这有些反直觉,因为约 80% 的参数在 MoE 层中。假说是:由于专家层每隔 1/4 层才出现,每个 token 每层最多见到两个专家,更新 MoE 参数影响的层数比更新共享参数更少。

超参数:稀疏模型倾向于从更小的批量大小更高的学习率中获益。

指令微调的突破

MoEs Meets Instruction Tuning(2023 年 7 月)发现:

  • 普通微调的 MoE 表现不如对应的 T5
  • 但 Flan(指令微调的)MoE 显著优于 Flan T5
  • Flan-MoE 相对于 MoE 的提升大于 Flan T5 相对于 T5 的提升
  • MoE 从指令微调中获益比稠密模型更多
  • 辅助损失在指令微调中实际上能防止过拟合(与之前建议关闭相反)

这令人兴奋:MoE 在窄领域微调中可能表现不佳,但在多样化的多任务指令微调中表现出色。

Sparse vs Dense: When to Use Which?

  • High throughput, many machines: Use sparse MoE. Given a fixed compute budget, a sparse model will be more optimal for pretraining.
  • Low throughput, limited VRAM: Use a dense model.
  • Cannot directly compare the number of parameters between sparse and dense models — they represent fundamentally different things.
  • 高吞吐量、多台机器:使用稀疏 MoE。在固定计算预算下,稀疏模型对预训练更优。
  • 低吞吐量、有限显存:使用稠密模型。
  • 不能直接比较稀疏和稠密模型的参数量——它们代表根本不同的东西。

Efficiency and Deployment

Parallelism

  • Data parallelism: Same weights replicated across all cores, data partitioned across cores.
  • Model parallelism: Model partitioned across cores, data replicated.
  • Expert parallelism: Experts placed on different workers. For non-MoE layers, behaves like data parallelism. For MoE layers, tokens are sent to workers where the desired experts reside.

Capacity Factor Trade-offs

Increasing CF increases quality but also communication costs and activation memory. If all-to-all communications are slow, use a smaller CF. A good starting point: top-2 routing, CF = 1.25, one expert per core.

Serving Techniques

  • Distillation: Distill a MoE back to a dense model, retaining 30–40% of sparsity gains. Faster pretraining, smaller model in production.
  • Task-level routing: Route entire sentences/tasks to an expert, permitting sub-network extraction.
  • Expert aggregation: Merge expert weights to reduce parameter count at inference.

Efficient Training

  • FasterMoE (2022): Topology-aware gating that picks experts based on lowest latency → 17x speedup.
  • MegaBlocks (2022): Express MoE layers as block-sparse operations instead of batched matrix multiplication. Never drops tokens, maps efficiently to modern hardware.

并行计算

  • 数据并行:相同权重在所有节点上复制,数据在节点间分割。
  • 模型并行:模型在节点间分割,数据复制。
  • 专家并行:专家放置在不同节点上。对于非 MoE 层,行为与数据并行相同。对于 MoE 层,token 被发送到拥有目标专家的节点。

容量因子权衡

增加 CF 能提升质量,但也增加通信成本和激活值内存。如果 all-to-all 通信较慢,使用较小的 CF。一个好的起点:top-2 路由、CF = 1.25、每个节点一个专家

部署技术

  • 蒸馏:将 MoE 蒸馏回稠密模型,保留 30–40% 的稀疏性增益。更快的预训练,生产中使用更小的模型。
  • 任务级路由:将整个句子/任务路由到一个专家,允许子网络提取。
  • 专家聚合:合并专家权重以减少推理时的参数量。

高效训练

  • FasterMoE(2022):拓扑感知门控,基于最低延迟选择专家 → 17 倍加速。
  • MegaBlocks(2022):将 MoE 层表示为块稀疏操作而非批量矩阵乘法。不丢弃任何 token,高效映射到现代硬件。

Update: MoE in Practice — Qwen3 (2025)

The Qwen3 Technical Report (May 2025) provides a striking example of how far MoE has come since the early Switch Transformer days. Qwen3 includes two MoE models that embody several ideas discussed above — and push them further.

Architecture

Model Layers Heads (Q/KV) Total Experts Activated Total Params Active Params
Qwen3-30B-A3B 48 32/4 128 8 30B 3B
Qwen3-235B-A22B 94 64/4 128 8 235B 22B

Several design choices are worth noting:

  1. Fine-grained expert segmentation (from DeepSeekMoE): Instead of 8 large experts with top-2 routing (like Mixtral), Qwen3 uses 128 small experts with top-8 routing. The intuition: more experts with finer granularity allows more flexible combinations — \(\binom{128}{8} \approx 2.3 \times 10^{10}\) possible expert sets versus \(\binom{8}{2} = 28\) for Mixtral. Each individual expert is small, but the combinatorial diversity is enormous.

  2. No shared experts: Unlike Qwen2.5-MoE (and DeepSeek-V2/V3 which use shared experts that are always activated), Qwen3 drops shared experts entirely. This is a bold move — shared experts were introduced to ensure a baseline of common knowledge across all tokens. Qwen3 apparently found that with 128 fine-grained experts and top-8 routing, the need for an explicit shared component disappears.

  3. Global-batch load balancing loss: Rather than the per-sample auxiliary loss from Switch Transformers, Qwen3 uses a global-batch variant that computes load balance across the entire batch. This encourages expert specialization while avoiding the per-example noise that can destabilize training.

The efficiency story is dramatic: Qwen3-235B has 235B total parameters but only 22B active per token — a 10.7x ratio. Compare with Mixtral 8x7B’s ~3.9x ratio (47B total / 12B active). The trend is clear: modern MoEs are pushing toward higher total-to-active ratios with more fine-grained experts.

Benchmark Results

Qwen3-235B-A22B outperforms dense models with far more active parameters:

Benchmark Qwen2.5-72B (dense) DeepSeek-V3 (MoE) Qwen3-235B-A22B (MoE)
MMLU 86.1 87.2 87.8
MMLU-Pro 58.1 59.8 68.2
MATH 62.1 62.6 71.8
EvalPlus (code) 65.9 63.8 77.6
BBH 86.3 86.2 88.9

The smaller Qwen3-30B-A3B is equally impressive — with only 3B active parameters, it matches or exceeds Qwen2.5-14B (a dense model with 4.7x more active compute) on most benchmarks.

These results validate the core MoE thesis: given a fixed inference budget, sparse models consistently outperform dense ones. The gap has only widened as MoE techniques have matured.

Qwen3 技术报告(2025 年 5 月)提供了一个生动的例子,展示了 MoE 自早期 Switch Transformer 以来走了多远。Qwen3 包含两个 MoE 模型,体现了上述讨论的多个思想——并将其推进。

架构

模型 层数 注意力头 (Q/KV) 总专家数 激活数 总参数 活跃参数
Qwen3-30B-A3B 48 32/4 128 8 30B 3B
Qwen3-235B-A22B 94 64/4 128 8 235B 22B

几个设计选择值得关注:

  1. 细粒度专家分割(来自 DeepSeekMoE):Qwen3 不像 Mixtral 那样使用 8 个大专家配合 top-2 路由,而是使用 128 个小专家配合 top-8 路由。直觉是:更多细粒度的专家允许更灵活的组合——\(\binom{128}{8} \approx 2.3 \times 10^{10}\) 种可能的专家组合 vs. Mixtral 的 \(\binom{8}{2} = 28\) 种。每个单独的专家很小,但组合多样性是巨大的。

  2. 无共享专家:不同于 Qwen2.5-MoE(以及使用始终激活的共享专家的 DeepSeek-V2/V3),Qwen3 完全放弃了共享专家。这是一个大胆的选择——共享专家的引入是为了确保所有 token 都有一个基础的公共知识。Qwen3 显然发现,有了 128 个细粒度专家和 top-8 路由,显式共享组件的需求消失了。

  3. 全局批次负载均衡损失:Qwen3 不使用 Switch Transformers 的逐样本辅助损失,而是使用一个全局批次变体,在整个批次上计算负载均衡。这鼓励专家专业化,同时避免了可能破坏训练稳定性的逐样本噪声。

效率数据非常惊人:Qwen3-235B 有 235B 总参数但每个 token 只有 22B 活跃——10.7 倍的比率。对比 Mixtral 8x7B 的约 3.9 倍比率(47B 总 / 12B 活跃)。趋势很明确:现代 MoE 正朝着更高的总/活跃参数比和更细粒度的专家方向发展。

基准测试结果

Qwen3-235B-A22B 超越了拥有更多活跃参数的稠密模型:

基准测试 Qwen2.5-72B(稠密) DeepSeek-V3(MoE) Qwen3-235B-A22B(MoE)
MMLU 86.1 87.2 87.8
MMLU-Pro 58.1 59.8 68.2
MATH 62.1 62.6 71.8
EvalPlus(代码) 65.9 63.8 77.6
BBH 86.3 86.2 88.9

更小的 Qwen3-30B-A3B 同样令人印象深刻——仅 3B 活跃参数,在大多数基准上匹配或超越 Qwen2.5-14B(一个活跃计算量多 4.7 倍的稠密模型)。

这些结果验证了 MoE 的核心论点:在固定推理预算下,稀疏模型持续优于稠密模型。随着 MoE 技术的成熟,这一差距只在扩大。

Open Source and Future Directions

Training frameworks:

Released models (as of Dec 2023):

训练框架

已发布模型(截至 2023 年 12 月):

Exciting directions:

  • Distillation: Distilling sparse MoEs back to dense models with fewer parameters but similar quality
  • Quantization: QMoE (Oct 2023) compresses the 1.6T Switch Transformer from 3.2TB to just 160GB by quantizing to less than 1 bit per parameter
  • Model merging: Exploring expert aggregation techniques and their impact on inference time

未来方向:

  • 蒸馏:将稀疏 MoE 蒸馏为参数更少但质量相近的稠密模型
  • 量化QMoE(2023 年 10 月)通过量化到每参数不到 1 bit,将 1.6T Switch Transformer 从 3.2TB 压缩到仅 160GB
  • 模型合并:探索专家聚合技术及其对推理时间的影响

References