Autoregressive Embedding Models: Training, Attention, and Performance

This post surveys recent papers (2024--2026) on autoregressive embedding models — using decoder-only LLMs as text embedding backbones. It addresses three questions: How are state-of-the-art embedding models trained? What are the attention distribution characteristics over different tokens? What performance do they achieve? Papers covered include Qwen3-Embedding, NV-Embed, LLM2Vec, GritLM, E5-Mistral, KaLM-Embedding, Causal2Vec, AutoRegEmbed, MoE-Embedding, HTP, GEM, and DiffEmbed.
本文综述了 2024--2026 年间关于 Autoregressive Embedding Model 的主要论文——即使用 decoder-only LLM 作为文本嵌入骨干网络。回答三个问题:最先进的 embedding model 怎么训练?它们在 attention distribution 上有什么特点?它们的 performance 如何?涉及的论文包括 Qwen3-EmbeddingNV-EmbedLLM2VecGritLME5-MistralKaLM-EmbeddingCausal2VecAutoRegEmbedMoE-EmbeddingHTPGEMDiffEmbed

I. How Are SOTA Embedding Models Trained?

Multi-Stage Progressive Training

Current SOTA models almost universally adopt multi-stage training, progressively refining representation quality from coarse to fine:

Stage Purpose Typical Data Scale Representative Methods
Stage 0: LLM pretraining Strong language understanding foundation Trillions of tokens Inherited from Qwen3/Mistral/LLaMA
Stage 1: Weakly-supervised contrastive pretraining Learn general text representations ~150M–470M pairs Qwen3-Emb (150M), KaLM (470M)
Stage 2: Supervised fine-tuning Task-specific refinement ~2M–19M pairs All SOTA models
Stage 3: Distillation / merging Improve generalization Same as Stage 2 KaLM-V2.5 (KL distillation), Qwen3 (SLERP merging)

A key finding: Wang et al. (2024) showed that for sufficiently large LLMs, Stage 1 contrastive pretraining is nearly useless — it improves smaller models like XLM-R by +8.2 points but has negligible effect on Mistral-7B. The LLM’s autoregressive pretraining already provides a strong representational foundation. However, for smaller models (e.g., Qwen2-0.5B), Stage 1 still contributes significantly (~3 MMTEB points for Qwen3-Emb-0.6B).

当前 SOTA 模型几乎都采用多阶段训练,从粗到细逐步提升表征质量:

阶段 目的 典型数据量 代表方法
Stage 0: LLM 预训练 获得强大的语言理解基础 数万亿 token 继承 Qwen3/Mistral/LLaMA 等基座
Stage 1: 弱监督对比预训练 学习通用文本表征 ~150M–470M pairs Qwen3-Emb (150M), KaLM (470M)
Stage 2: 有监督微调 精调任务特定能力 ~2M–19M pairs 所有 SOTA 模型
Stage 3: 蒸馏/合并 进一步提升泛化 同 Stage 2 数据 KaLM-V2.5 (KL蒸馏), Qwen3 (SLERP合并)

一个关键发现:Wang et al. (2024) 证明了对于足够大的 LLM,Stage 1 的对比预训练几乎无益——对小模型如 XLM-R 能提升 +8.2 分,对 Mistral-7B 几乎为零。LLM 的 autoregressive 预训练已经提供了足够好的表征基础。但对于较小模型(如 Qwen2-0.5B),Stage 1 仍贡献显著(Qwen3-Emb-0.6B 的 Stage 1 贡献约 3 MMTEB 分)。

Attention Mechanism: Two Camps

Camp 1: Remove the causal mask. Replace the causal attention mask with fully bidirectional attention during training. Representatives include NV-Embed (Lee et al., 2024), LLM2Vec (BehnamGhader et al., 2024), KaLM (Zhao et al., 2025), and GritLM (Muennighoff et al., 2024) (in embedding mode). The advantage is simplicity and strong embedding quality; the downside is loss of generation capability (except GritLM, which uses bidirectional for embedding and causal for generation).

Camp 2: Keep the causal mask. Bypass the causal attention limitation through clever designs. Representatives include Qwen3-Emb (Zhang et al., 2025), E5-Mistral (Wang et al., 2024), Causal2Vec (Lin et al., 2025), GEM (Zhang et al., 2025), and HTP (Ding et al., 2025). The advantage is preserved generation capability and no pretrain-finetune attention mismatch; the downside is that extra mechanisms are needed to compensate for the information flow deficit.

NV-Embed’s ablation clearly shows the value of bidirectional attention:

Attention Pooling MTEB Avg
Causal EOS 66.50
Bidirectional EOS 67.85 (+1.35)
Causal Latent-attn 68.47
Bidirectional Latent-attn 69.32 (+0.85)

GritLM’s ablation is even more dramatic: bidirectional vs. causal differs by +4 MTEB points on embedding tasks (64.0 vs 60.0).

流派一:去除 causal mask。 训练时将因果注意力掩码改为全双向注意力。代表方法包括 NV-Embed (Lee et al., 2024)LLM2Vec (BehnamGhader et al., 2024)KaLM (Zhao et al., 2025)GritLM (Muennighoff et al., 2024)(embedding 模式)。优势是简单有效,embedding 质量高;劣势是丧失生成能力(GritLM 除外,它在 embedding 用双向、generation 用因果)。

流派二:保留 causal mask。 通过巧妙设计绕过因果注意力的限制。代表方法包括 Qwen3-Emb (Zhang et al., 2025)E5-Mistral (Wang et al., 2024)Causal2Vec (Lin et al., 2025)GEM (Zhang et al., 2025)HTP (Ding et al., 2025)。优势是保留生成能力,无预训练-微调注意力不匹配问题;劣势是需要额外设计来弥补信息流缺陷。

NV-Embed 的消融实验清晰展示了双向注意力的价值:

注意力 Pooling MTEB Avg
Causal EOS 66.50
Bidirectional EOS 67.85 (+1.35)
Causal Latent-attn 68.47
Bidirectional Latent-attn 69.32 (+0.85)

GritLM 的消融更为显著:双向 vs 因果在 embedding 任务上相差 +4 MTEB 分(64.0 vs 60.0)。

Pooling Strategies

Strategy Representatives Best When
EOS / Last-token pooling Qwen3-Emb, E5-Mistral Keeping causal mask — information naturally converges to the final token
Mean pooling KaLM, LLM2Vec, GritLM, DiffEmbed After removing causal mask — all token representations are equally contextualized
Latent attention pooling NV-Embed Learnable cross-attention, outperforms both alternatives
Contextual + EOS concat Causal2Vec Combining global compressed info with end-of-sequence info

NV-Embed (Lee et al., 2024)’s Latent Attention is a learnable cross-attention layer: decoder outputs serve as Q, a set of 512 learnable latent vectors serve as K=V, followed by MLP + mean pooling. It achieves the best results among all pooling methods (MTEB 72.31 vs mean 71.71 vs EOS 71.63).

The following figure illustrates the attention masks and pooling strategies:

策略 代表 适用场景
EOS / Last-token pooling Qwen3-Emb, E5-Mistral 保留 causal mask 时,信息自然汇聚到末尾 token
Mean pooling KaLM, LLM2Vec, GritLM, DiffEmbed 去除 causal mask 后,所有 token 表征质量均等
Latent attention pooling NV-Embed 学习式跨注意力,优于上述两者
Contextual+EOS 拼接 Causal2Vec 结合全局压缩信息和序列末尾信息

NV-Embed (Lee et al., 2024) 提出的 Latent Attention 是一个可学习的交叉注意力层:用 decoder 输出作为 Q,学习一组 512 个 latent vectors 作为 K=V,再经 MLP + mean pooling。在所有 pooling 方法中表现最佳(MTEB 72.31 vs mean 71.71 vs EOS 71.63)。

下图展示了注意力掩码和各种 pooling 策略:

Which Layer's Representation?

Nearly all SOTA embedding models extract representations from the last layer of the transformer:

Model Layer Pooling
Qwen3-Emb, E5-Mistral Last layer EOS token
NV-Embed Last layer Latent Attention
LLM2Vec, KaLM, GritLM, DiffEmbed Last layer Mean pooling
HTP Near-last (2nd or 3rd from end) Mean pooling
MoE-Embedding All layers (routing weights) Last token

Two notable exceptions: HTP uses a near-last but not final layer (e.g., layer 29/32 for Mistral-7B), chosen per-model via validation. MoE-Embedding concatenates routing weights from all layers (e.g., 28 layers × 64 experts = 1792-dim vector).

几乎所有 SOTA embedding 模型都从 transformer 的最后一层提取表征:

模型 Pooling
Qwen3-Emb, E5-Mistral 最后一层 EOS token
NV-Embed 最后一层 Latent Attention
LLM2Vec, KaLM, GritLM, DiffEmbed 最后一层 Mean pooling
HTP 接近最后层(倒数第 2-3 层) Mean pooling
MoE-Embedding 所有层(路由权重) Last token

两个显著例外:HTP 使用接近但非最终层(如 Mistral-7B 的第 29/32 层),通过验证集按模型选择。MoE-Embedding 拼接所有层的路由权重(如 28 层 × 64 experts = 1792 维向量)。

Why not use intermediate layers? The "Layer by Layer" finding (Click to expand)

Ren et al. (2025), "Layer by Layer: Uncovering Hidden Representations in Language Models" (ICML 2025) systematically studied layer selection for embeddings and found that intermediate layers outperform the last layer by 2%--16% on average across 32 MTEB tasks in zero-shot settings. The optimal layers cluster around mid-depth of the network.

The reason: the last layer is optimized for next-token prediction and is biased toward the next output token's semantics rather than global sentence semantics. Intermediate layers strike a better balance between information preservation and noise filtering. This effect is architecture-agnostic — observed in decoder-only (Pythia, LLaMA3), encoder (BERT), and state-space models (Mamba).

However, this finding primarily applies to zero-shot / unfine-tuned models. In practice, SOTA embedding models fine-tune with contrastive learning, which reshapes the last layer to encode global rather than local semantics. Combined with good pooling strategies (latent attention, bidirectional mean pooling), the last layer's "next-token bias" is effectively mitigated — which is why most production models still use it.

HTP's near-last-layer early exit can be seen as a compromise: it avoids the very final layer's next-token specialization while staying close enough to preserve the deep representations.

为什么不用中间层?"Layer by Layer" 的发现(点击展开)

Ren et al. (2025), "Layer by Layer: Uncovering Hidden Representations in Language Models"(ICML 2025)系统研究了 embedding 的层选择,发现在 zero-shot 设置下,中间层在 32 个 MTEB 任务上平均比最后一层好 2%--16%。最优层集中在网络的中段偏深位置。

原因:最后一层为 next-token prediction 优化,偏向下一个输出 token 的语义,而非全局句子语义。中间层在信息保留和噪声过滤之间取得了更好的平衡。这一效果是架构无关的——在 decoder-only (Pythia, LLaMA3)、encoder (BERT) 和 state-space 模型 (Mamba) 上均观察到。

但这一发现主要适用于 zero-shot / 未微调的模型。 实践中,SOTA embedding 模型通过对比学习微调,重塑了最后一层以编码全局而非局部语义。结合好的 pooling 策略(latent attention、双向 mean pooling),最后一层的 "next-token bias" 被有效缓解——这就是为什么大多数生产模型仍然使用最后一层。

HTP 的近最后层 early exit 可以看作一种折中:避免最后一层的 next-token 特化,同时足够深以保留深层表征。

Loss Functions

All models use InfoNCE contrastive loss as the core:

\[\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, d^+)/\tau)}{\sum_j \exp(\text{sim}(q, d_j)/\tau)}\]

Several improvements build on top of this:

  • Focal-InfoNCE (KaLM-V2, Zhao et al., 2025): Reweight with \(w_i = (1-p_i)^\gamma\) to focus on hard samples. This is KaLM’s single largest performance contributor.
  • False negative mitigation (Qwen3-Emb, Zhang et al., 2025): Zero out gradient contributions from in-batch negatives whose similarity to the positive exceeds a margin of 0.1.
  • Conditional distribution alignment (AutoRegEmbed, Deng et al., 2025): Instead of cosine similarity, align \(p(\cdot \vert e_q)\) and \(p(\cdot \vert e_d)\) — conditional probability distributions, better matching the LLM’s generative nature.
  • Information compression (AutoRegEmbed, Deng et al., 2025): Use a frozen decoder to reconstruct the target document, forcing compressed tokens to capture global semantics.

所有模型都以 InfoNCE 对比损失为核心:

\[\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, d^+)/\tau)}{\sum_j \exp(\text{sim}(q, d_j)/\tau)}\]

在此基础上有多种改进:

  • Focal-InfoNCEKaLM-V2, Zhao et al., 2025):加权 \(w_i = (1-p_i)^\gamma\),聚焦困难样本。这是 KaLM 最大的单项性能贡献。
  • 假阴性缓解Qwen3-Emb, Zhang et al., 2025):当 in-batch 负样本与正样本相似度差距 < 0.1 时,置零其梯度贡献。
  • 条件分布对齐AutoRegEmbed, Deng et al., 2025):不用 cosine similarity,而是对齐 \(p(\cdot \vert e_q)\) 和 \(p(\cdot \vert e_d)\) 的条件概率分布,更符合 LLM 的生成本质。
  • 信息压缩AutoRegEmbed, Deng et al., 2025):用冻结 decoder 重建目标文档,迫使压缩 token 捕获全局语义。
Details: Synthetic data engineering and other key techniques (Click to expand)

Synthetic data is a critical driver across all SOTA models:

  • E5-Mistral: GPT-4 generates 500K synthetic pairs across 93 languages with 150K unique instructions.
  • Qwen3-Emb: Qwen3-32B generates ~150M synthetic pairs, controlling query type / length / difficulty / language diversity.
  • NV-Embed: Mixtral-8x22B generates 120K synthetic samples covering 60K synthetic tasks.
  • KaLM-V2: Qwen2-72B generates 550K persona-based synthetic data.

Instruction-aware embeddings. Nearly all SOTA models prepend "Instruct: {task} Query: {q}" to queries; documents receive no instruction prefix. E5-Mistral (Wang et al., 2024)'s ablation shows instructions contribute +4.2 points.

Hard negative mining. NV-Embed (Lee et al., 2024) uses a teacher model to mine hard negatives with threshold max_neg_score < pos_score * 0.95 to filter false negatives, contributing +2.30 retrieval points.

Model merging / SLERP. Qwen3-Emb (Zhang et al., 2025) merges multiple SFT checkpoints via spherical linear interpolation, contributing +1.77 MMTEB points.

Two-stage instruction tuning. NV-Embed trains retrieval first (in-batch negatives ON), then multi-task (in-batch negatives OFF — same-class samples become false negatives). Order matters: reversing drops retrieval by 0.74 points.

LoRA efficient fine-tuning. E5-Mistral and Causal2Vec show that LoRA rank 16 suffices — no full-parameter fine-tuning needed.

详情:合成数据工程和其他关键技术(点击展开)

合成数据是关键驱动力:

  • E5-Mistral:GPT-4 生成 500K 合成对,覆盖 93 种语言、150K 唯一指令。
  • Qwen3-Emb:Qwen3-32B 生成约 150M 合成对,控制 query 类型/长度/难度/语言多样性。
  • NV-Embed:Mixtral-8x22B 生成 120K 合成样本,覆盖 60K 合成任务。
  • KaLM-V2:Qwen2-72B 生成 550K persona-based 合成数据。

指令感知 embedding。 几乎所有 SOTA 模型都在 Query 前加 "Instruct: {task} Query: {q}",Document 不加指令。E5-Mistral (Wang et al., 2024) 消融显示指令贡献 +4.2 分

Hard negative mining。 NV-Embed (Lee et al., 2024) 用教师模型挖掘困难负样本,用 max_neg_score < pos_score * 0.95 过滤假阴性,单项贡献 +2.30 retrieval 分。

Model merging / SLERP。 Qwen3-Emb (Zhang et al., 2025) 合并多个 SFT checkpoint 的参数(球面线性插值),贡献 +1.77 MMTEB 分。

两阶段指令微调。 NV-Embed 先训 retrieval(开启 in-batch negatives),再训 multi-task(关闭 in-batch negatives——因为同 class 样本会成为假阴性)。顺序很关键,反转则掉 0.74 retrieval 分。

LoRA 高效微调。 E5-MistralCausal2Vec 显示 LoRA rank 16 即可,无需全参数微调。

Frontier Directions

Diffusion LM as embedding backbone. DiffEmbed (Zhang et al., 2025) uses Dream-7B (a masked diffusion model based on Qwen2.5-7B) with natively bidirectional attention. It outperforms AR models by ~20% on long-document retrieval and ~8% on reasoning-intensive tasks.

MoE routing weights as embeddings. MoE-Embedding (Li & Zhou, 2024) discovers that router weights in MoE models serve as off-the-shelf embeddings without fine-tuning, complementary to hidden states (AMI only 0.29). Combining them yields +7.94 points.

Hierarchical Token Prepending. HTP (Ding et al., 2025) requires no training — it creates backward information flow in causal models through hierarchical segment summary tokens. It can even improve already fine-tuned NV-Embed-v2.

Diffusion LM 作为 embedding backbone。 DiffEmbed (Zhang et al., 2025) 使用 Dream-7B(基于 Qwen2.5-7B 的 masked diffusion model),天然双向注意力,在长文档检索上比 AR 模型好约 20%,推理密集任务好约 8%。

MoE 路由权重作为 embedding。 MoE-Embedding (Li & Zhou, 2024) 发现 MoE 模型的 router weights 无需微调即可作为嵌入,与 hidden states 互补(AMI 仅 0.29),组合后提升 +7.94 分。

Hierarchical Token Prepending。 HTP (Ding et al., 2025) 无需训练,通过分段摘要 token 在 causal 模型中创建反向信息流,甚至能在已微调的 NV-Embed-v2 上再提升。

II. Attention Distribution Characteristics

The Core Problem: Information Flow Bottleneck in Causal Attention

Causal attention in decoder-only models creates two fundamental deficiencies:

Unidirectional information flow. The token at position \(i\) can only see tokens \(1, \ldots, i\), with no access to information from \(i+1, \ldots, n\). Early tokens are severely under-contextualized — the first token’s representation contains zero context from the rest of the sentence. Mean pooling quality suffers because early tokens contribute incomplete representations.

Information over-compression into the final token. All semantic information must “flow forward” to the end of the sequence. The EOS token is the only position that sees the complete sequence. AutoRegEmbed (Deng et al., 2025) further points out that the hidden state at EOS encodes the next token’s probability distribution (local semantics), not the global semantics of the input text.

Decoder-only 模型的因果注意力导致两个根本性缺陷:

前向信息流单向性。 位置 \(i\) 的 token 只能看到 \(1, \ldots, i\),无法获取来自 \(i+1, \ldots, n\) 的信息。前部 token 的表征严重欠语境化——第 1 个 token 的表征完全没有来自句子其余部分的上下文。Mean pooling 的质量因此受损:早期 token 贡献的是不完整的表征。

信息过度压缩到末尾 token。 所有语义信息必须”单向流动”到序列末尾,最后一个 token(EOS)成为唯一能看到完整序列的位置。AutoRegEmbed (Deng et al., 2025) 进一步指出:EOS 处的 hidden state 编码的是下一个 token 的概率分布(局部语义),而非输入文本的全局语义。

Over-Squashing: Why Causal Attention Gets Worse with Length

The term “over-squashing” originated in graph neural networks (Alon & Yahav, 2021; Topping et al., 2022): when exponentially growing receptive fields must pass messages through fixed-dimension channels, information from distant nodes gets “squashed” and lost. Barbero et al. (2024) formally bridged this theory to causal transformers by observing that a causal attention pattern defines a directed acyclic graph — a lower-triangular adjacency matrix — so information propagation follows the same path-counting framework.

The mixing matrix. For a causal transformer with \(L\) layers, define the mixing matrix:

\[A = M^{(L-1)} \cdot M^{(L-2)} \cdots M^{(0)}, \quad \text{where } M^{(l)} = \frac{1}{r_l}\left(\frac{\alpha^{(l)}}{\beta_1^{(l)}} + I\right)\]

Here \(\alpha^{(l)}_{j,i}\) are the softmax attention weights at layer \(l\) (zero for \(i > j\) due to causal masking), and the identity \(I\) comes from the residual connection. \(A\) is lower-triangular and row-stochastic. HTP (Ding et al., 2025) proved the following bounds:

Last-token pooling — sensitivity depends on a single entry:

\[\left\lVert \frac{\partial y_n}{\partial v_i^{(0)}} \right\rVert \leq K_L \cdot A_{n,i}\]

Mean pooling — sensitivity depends on the entire column:

\[\left\lVert \frac{\partial \bar{y}}{\partial v_i^{(0)}} \right\rVert \leq \frac{K_L}{n} \sum_{j=1}^{n} A_{j,i}\]

The column sum captures all outgoing influence from token \(i\), not just the single path to position \(n\). This is why mean pooling is structurally more robust to over-squashing than last-token pooling.

Three forces conspire to make \(A_{n,i}\) small for early tokens:

  1. Attention dilution. Softmax distributes weight across all visible tokens. Average weight per token is \(\sim 1/n\), shrinking as context grows.
  2. Multiplicative decay. \(A_{n,i}\) is a sum over products of \(L\) attention weights along causal paths. Products of \(L\) small numbers decay exponentially.
  3. Attention sinks. Xiao et al. (2023) found that in LLaMA-2-7B, attention to the first token exceeds 50% of total attention across most layers — despite carrying no semantic content (replacing it with \n tokens yields comparable perplexity). This wastes attention budget: with >50% consumed by a semantically empty sink, even less remains for informative early tokens, further shrinking their \(A_{n,i}\).

“Over-squashing” 一词源自图神经网络(Alon & Yahav, 2021Topping et al., 2022):当指数增长的感受野必须通过固定维度的通道传递消息时,来自远距离节点的信息会被”压扁”并丢失。Barbero et al. (2024) 将这一理论正式桥接到因果 transformer:因果注意力模式定义了一个有向无环图(下三角邻接矩阵),因此信息传播遵循相同的路径计数框架。

混合矩阵。 对于有 \(L\) 层的因果 transformer,定义混合矩阵:

\[A = M^{(L-1)} \cdot M^{(L-2)} \cdots M^{(0)}, \quad \text{where } M^{(l)} = \frac{1}{r_l}\left(\frac{\alpha^{(l)}}{\beta_1^{(l)}} + I\right)\]

其中 \(\alpha^{(l)}_{j,i}\) 是第 \(l\) 层的 softmax 注意力权重(因果掩码使 \(i > j\) 时为零),恒等矩阵 \(I\) 来自残差连接。\(A\) 是下三角且行随机的。HTP (Ding et al., 2025) 证明了以下界:

Last-token pooling — 灵敏度取决于单个元素

\[\left\lVert \frac{\partial y_n}{\partial v_i^{(0)}} \right\rVert \leq K_L \cdot A_{n,i}\]

Mean pooling — 灵敏度取决于整列之和

\[\left\lVert \frac{\partial \bar{y}}{\partial v_i^{(0)}} \right\rVert \leq \frac{K_L}{n} \sum_{j=1}^{n} A_{j,i}\]

列和捕获了 token \(i\) 的所有出向影响,而非仅到位置 \(n\) 的单条路径。这就是为什么 mean pooling 在结构上比 last-token pooling 对 over-squashing 更鲁棒。

三种力量共同导致早期 token 的 \(A_{n,i}\) 很小:

  1. 注意力稀释。 Softmax 将权重分配到所有可见 token 上。每个 token 的平均权重为 \(\sim 1/n\),随上下文增长而缩小。
  2. 乘法衰减。 \(A_{n,i}\) 是沿因果路径的 \(L\) 个注意力权重乘积之和。\(L\) 个小数的乘积指数级衰减。
  3. Attention Sink。 Xiao et al. (2023) 发现在 LLaMA-2-7B 中,对第一个 token 的注意力超过总量的 50%——尽管它不携带语义内容(用 \n 替换后困惑度几乎不变)。这浪费了注意力预算:>50% 被语义空洞的 sink 消耗,留给有信息量的早期 token 的更少,进一步缩小了它们的 \(A_{n,i}\)。
Self-attention as a low-pass filter: the spectral perspective on length degradation (Click to expand)

Zhou et al. (2025) (ACL 2025) provided a complementary spectral analysis. They decomposed token representations into frequency components via DFT and showed that any attention matrix A = softmax(P) acts as a low-pass filter:

$$\lim_{t \to \infty} \frac{\lVert \text{HC}[A^t z] \rVert_2}{\lVert \text{DC}[A^t z] \rVert_2} = 0$$

where DC is the mean (zero-frequency) component and HC is all higher-frequency components. By Perron-Frobenius, the attention matrix is row-stochastic with largest eigenvalue 1 (DC), and all other eigenvalues have magnitude < 1 — so repeated attention application exponentially damps discriminative features.

The filter rate depends on length. Let σa be the largest singular value of HC[A]. Under Gaussian assumptions on Q,K projections:

$$\sigma_a \leq \sqrt{\frac{n}{2\sqrt{1 + e^{-2\sigma_s^2}} \cdot (n-1)^{3/2} + 1}}$$

σa monotonically decreases as sequence length n increases. Longer sequences → stronger low-pass filtering → faster destruction of high-frequency (discriminative) information. This explains why embeddings of long texts collapse: they converge toward their DC component, and since natural language has relatively consistent mean embeddings, all long-text embeddings crowd into a narrow region with abnormally high pairwise cosine similarity.

Barbero et al. (2024) ("Transformers need glasses!") pushed this further, showing that representational collapse occurs even for distinct sequences: as n → ∞, the last-token representations of two different sequences converge. In practice with bf16 precision, collapse occurs at ~50 tokens for repeated digits, and Gemini 1.5 fails to copy the last element of a sequence at length ~300.

The connection between over-squashing and low-pass filtering is natural: both describe the same information loss from different perspectives — over-squashing through the Jacobian of the mixing matrix (gradient-based), low-pass filtering through eigenvalue decay of the attention matrix (spectral). They predict the same outcome: longer sequences lose more information, and causal models are worse than bidirectional ones because the triangular structure restricts information flow paths.

Self-attention 作为低通滤波器:长度退化的频谱视角(点击展开)

Zhou et al. (2025)(ACL 2025)提供了互补的频谱分析。他们通过 DFT 将 token 表征分解为频率分量,证明任何注意力矩阵 A = softmax(P) 都是低通滤波器

$$\lim_{t \to \infty} \frac{\lVert \text{HC}[A^t z] \rVert_2}{\lVert \text{DC}[A^t z] \rVert_2} = 0$$

其中 DC 是均值(零频)分量,HC 是所有高频分量。由 Perron-Frobenius 定理,注意力矩阵是行随机的,最大特征值为 1(DC),其余特征值模 < 1——因此重复的注意力运算指数级衰减判别性特征。

滤波速率依赖于长度。 令 σa 为 HC[A] 的最大奇异值。在 Q,K 投影的高斯假设下:

$$\sigma_a \leq \sqrt{\frac{n}{2\sqrt{1 + e^{-2\sigma_s^2}} \cdot (n-1)^{3/2} + 1}}$$

σa 随序列长度 n 增加而单调递减。 更长的序列 → 更强的低通滤波 → 更快地破坏高频(判别性)信息。这解释了为什么长文本的 embedding 会坍缩:它们收敛到 DC 分量,而由于自然语言具有相对一致的均值 embedding,所有长文本的 embedding 挤入一个狭窄区域,产生异常高的成对 cosine similarity。

Barbero et al. (2024)("Transformers need glasses!")进一步展示了即使对于不同的序列也会发生表征坍缩:当 n → ∞ 时,两个不同序列的 last-token 表征趋于一致。在 bf16 精度下,重复数字约 50 个 token 时就会坍缩,Gemini 1.5 在序列长度约 300 时无法复制最后一个元素。

Over-squashing 和低通滤波之间的联系是自然的:两者从不同视角描述了相同的信息损失——over-squashing 通过混合矩阵的 Jacobian(梯度视角),低通滤波通过注意力矩阵的特征值衰减(频谱视角)。它们预测相同的结果:更长的序列损失更多信息,因果模型比双向模型更差,因为三角结构限制了信息流路径。

Empirical Findings

LLM2Vec: Causal vs. bidirectional representation similarity. LLM2Vec (BehnamGhader et al., 2024) analyzed the cosine similarity between per-layer representations in causal and bidirectional modes:

  • LLaMA-2-7B and S-LLaMA-1.3B: similarities are low across nearly all layers, indicating that enabling bidirectional attention drastically changes internal representations — hence MNTP adaptation training is needed.
  • Mistral-7B is the anomaly: similarities remain ~0.9+ throughout all layers. The authors speculate Mistral may have been pretrained with some form of bidirectional attention (e.g., prefix LM). This explains why Mistral is the only model that benefits from bidirectional attention without any training (+4.4 points), while LLaMA-3-8B collapses (-13.4 points).

HTP: Theoretical analysis of over-squashing. HTP (Ding et al., 2025) provides a formal analysis of the information flow bottleneck (Theorem 3.1): for a causal Transformer, the gradient of last-token readout \(\lVert \partial y_n / \partial v_i^{(0)} \rVert\) depends on a single entry \(A_{n,i}\) of the mixing matrix, which decays rapidly with depth. Mean-token readout aggregates the entire column \(\sum_j A_{j,i}\), making it more robust to over-squashing.

Experimental validation: masking the backward attention in Echo Embeddings (second pass → first pass) causes STS to plummet from 68.00 to 54.25; masking forward attention has virtually no effect (68.00 → 67.89). Backward information flow is the key to embedding quality.

DiffEmbed: The most direct attention-direction ablation. DiffEmbed (Zhang et al., 2025) tested the effect of removing backward attention on models trained with bidirectional attention:

Task Mistral (bidir) Mistral (causal only) DiffEmbed (bidir) DiffEmbed (causal only)
TheoremQA (question) 33.7 9.6 (-24.1) 48.3 0.7 (-47.6)
TheoremQA (theorem) 32.4 4.0 (-28.4) 38.9 1.1 (-37.8)

DiffEmbed (natively bidirectional pretraining) depends on backward attention far more than AR models do. The more complex the task (reasoning, long documents), the more critical bidirectional attention becomes; the difference is small for short-text STS.

LLM2Vec:Causal vs Bidirectional 表征相似度。 LLM2Vec (BehnamGhader et al., 2024) 分析了在因果和双向模式下每层表征的 cosine similarity:

  • LLaMA-2-7B 和 S-LLaMA-1.3B:两种模式的表征相似度在几乎所有层都很低,说明开启双向注意力会剧烈改变内部表征,因此需要 MNTP 适应训练。
  • Mistral-7B 是异类:两种模式的相似度始终约 0.9+。作者推测 Mistral 可能在预训练中用过某种双向注意力(如 prefix LM)。这解释了为什么 Mistral 是唯一不训练就能从双向注意力获益的模型(+4.4 分),而 LLaMA-3-8B 直接崩溃(-13.4 分)。

HTP:Over-squashing 的理论分析。 HTP (Ding et al., 2025) 提出了关于信息流瓶颈的形式化分析(定理 3.1):对 causal Transformer,last-token readout 的梯度 \(\lVert \partial y_n / \partial v_i^{(0)} \rVert\) 取决于 mixing matrix 的单个元素 \(A_{n,i}\),随深度快速衰减。而 mean-token readout 聚合了整列 \(\sum_j A_{j,i}\),对 over-squashing 更鲁棒。

实验验证:屏蔽 Echo Embedding 的反向注意力(后半到前半),STS 从 68.00 暴跌到 54.25;屏蔽前向注意力几乎无影响(68.00 到 67.89)。反向信息流是 embedding 质量的关键。

DiffEmbed:最直接的注意力方向消融。 DiffEmbed (Zhang et al., 2025) 在双向训练的模型上测试去除反向注意力的影响:

任务 Mistral (双向) Mistral (仅因果) DiffEmbed (双向) DiffEmbed (仅因果)
TheoremQA (问题检索) 33.7 9.6 (-24.1) 48.3 0.7 (-47.6)
TheoremQA (定理检索) 32.4 4.0 (-28.4) 38.9 1.1 (-37.8)

DiffEmbed(原生双向预训练)对反向注意力的依赖远超 AR 模型。任务复杂度越高(推理、长文档),双向注意力越关键;短文本简单任务差异不大。

More empirical findings: Causal2Vec L2 norms and MoE routing weights (Click to expand)

Causal2Vec: EOS vs. Contextual token. Causal2Vec (Lin et al., 2025) analyzed the L2 norms of EOS and Contextual token representations: EOS consistently shows higher L2 norms, indicating greater influence in the concatenated embedding. A single Contextual token suffices — increasing to 2/4/8 tokens actually degrades performance.

MoE-Embedding: Router Weights vs. Hidden States. MoE-Embedding (Li & Zhou, 2024) found that MoE routing weights (RW) and hidden states (HS) encode fundamentally different information (AMI=0.29, Jaccard=0.06). RW is more robust to prompt variation (cross-prompt Spearman correlation 0.63 vs HS's 0.52). RW captures "intermediate reasoning choices" (how the model processes input), HS captures "final prediction output" — the two are complementary.

更多实证发现:Causal2Vec L2 范数和 MoE 路由权重(点击展开)

Causal2Vec:EOS vs Contextual token。 Causal2Vec (Lin et al., 2025) 分析了 EOS token 和 Contextual token 的 L2 范数:EOS 始终更高,说明 EOS 在拼接表征中影响力更大。单个 Contextual token 即可(增加到 2/4/8 个反而降低性能)。

MoE-Embedding:Router Weights vs Hidden States。 MoE-Embedding (Li & Zhou, 2024) 发现 MoE 的 routing weights (RW) 和 hidden states (HS) 编码了完全不同的信息(AMI=0.29, Jaccard=0.06)。RW 对 prompt 变化更鲁棒(跨 prompt 的 Spearman 相关 0.63 vs HS 的 0.52)。RW 捕获"中间推理选择"(模型如何处理输入),HS 捕获"最终预测输出"——二者互补。

Summary of Attention Characteristics

Characteristic Explanation
Information converges to the final token Under causal attention, EOS is the only position seeing the full sequence — it becomes the natural information sink
Early tokens have poor representations Lacking subsequent context, their representations are unsuitable for embedding
Backward information flow is the critical missing piece Both HTP and DiffEmbed prove: backward attention (later→earlier) is essential; forward attention (earlier→later) contributes marginally
Bidirectional conversion requires adaptation training Except Mistral, directly enabling bidirectional attention destroys AR pretrained representations (LLaMA-3 drops 13.4 points)
More complex tasks need bidirectionality more Short-text STS shows small differences; long-document retrieval and reasoning-intensive tasks show huge gaps (up to 47.6 points)
Instruction tokens influence but don’t participate in pooling NV-Embed and GritLM exclude instruction tokens from pooling, but instructions influence other token representations via attention
特征 解释
末尾 token 信息汇聚 因果注意力下,EOS 是唯一看到完整序列的位置,自然成为信息汇聚点
前部 token 表征贫乏 早期 token 缺乏后续上下文,其表征不适合用于 embedding
反向信息流是关键缺失 HTP 和 DiffEmbed 都证明:反向注意力(后到前)对 embedding 至关重要,前向注意力(前到后)的边际贡献很小
双向转换需要适应训练 除 Mistral 外,直接开启双向注意力会破坏 AR 预训练学到的表征(LLaMA-3 暴跌 13.4 分)
任务越复杂,双向越重要 短文本 STS 差异小;长文档检索和推理密集任务差异巨大(最高 47.6 分差距)
指令 token 影响但不参与 pooling NV-Embed、GritLM 在 pooling 时排除指令 token,但指令通过注意力影响其他 token 的表征

The Causal vs. Bidirectional Debate: Is There a Winner?

One of the central tensions in this field is whether to remove the causal attention mask inherited from autoregressive pretraining, or to keep it and work around its limitations. Controlled ablations and leaderboard rankings tell different stories — and reconciling them reveals a nuanced picture.

该领域的核心争论之一是:应该去除从自回归预训练继承的因果注意力掩码,还是保留它并绕过其限制。控制变量消融实验和排行榜排名讲述了不同的故事——而将两者调和则揭示了一幅复杂的图景。

Controlled Ablations: Bidirectional Wins

Under strict ablation (same model, same data, only the attention mask differs), removing the causal mask consistently helps:

Paper Experiment Gain
NV-Embed (Lee et al., 2024) Causal → Bidirectional (EOS pooling) +1.35 MTEB
NV-Embed Causal → Bidirectional (Latent-attn) +0.85 MTEB
GritLM (Muennighoff et al., 2024) Causal → Bidirectional (embedding mode) +4.0 MTEB
KaLM (Zhao et al., 2025) Remove causal mask +0.39 MTEB

The conclusion from ablations is clear: all else being equal, bidirectional attention produces better embeddings. This is expected — embedding fundamentally requires understanding the whole input, and causal attention prevents early tokens from incorporating future context.

在严格消融实验中(相同模型、相同数据,仅注意力掩码不同),去除因果掩码一致性地有帮助:

论文 实验 提升
NV-Embed (Lee et al., 2024) Causal → Bidirectional (EOS pooling) +1.35 MTEB
NV-Embed Causal → Bidirectional (Latent-attn) +0.85 MTEB
GritLM (Muennighoff et al., 2024) Causal → Bidirectional (embedding 模式) +4.0 MTEB
KaLM (Zhao et al., 2025) 去除 causal mask +0.39 MTEB

消融实验的结论很明确:在其他条件相同时,双向注意力产生更好的 embedding。 这是符合预期的——embedding 本质上需要理解完整输入,而因果注意力阻止了早期 token 融合来自未来的上下文。

The Leaderboard Paradox: Causal Models Lead

Yet the actual leaderboards tell a different story:

Model Attention MTEB Eng v2
Qwen3-Embedding-8B Causal 75.22
Qwen3-Embedding-4B Causal 74.60
NV-Embed-v2 Bidirectional 69.81

Causal2Vec (Lin et al., 2025) also consistently outperforms the bidirectional LLM2Vec by +0.78 to +1.30 MTEB points across all tested base models, while keeping the causal mask intact.

How can causal models dominate despite the ablation evidence?

然而实际排行榜讲的是另一个故事:

模型 注意力 MTEB Eng v2
Qwen3-Embedding-8B Causal 75.22
Qwen3-Embedding-4B Causal 74.60
NV-Embed-v2 Bidirectional 69.81

Causal2Vec (Lin et al., 2025) 也在所有测试的基座模型上一致超越双向的 LLM2Vec +0.78 到 +1.30 MTEB 分,同时保留了因果掩码

既然消融实验显示双向更好,为什么因果模型在排行榜上能领先?

Reconciling the Evidence

Three factors explain the paradox:

1. Pretrain-finetune attention mismatch. LLMs are pretrained on trillions of tokens with causal attention. Their internal representations are optimized for unidirectional information flow. Switching to bidirectional attention disrupts these learned representations — sometimes catastrophically. LLM2Vec (BehnamGhader et al., 2024) documented this clearly:

Model Causal baseline Bidirectional (no adaptation) Change
Mistral-7B 42.46 46.86 +4.40 (helps)
LLaMA-3-8B 43.98 30.56 -13.42 (collapses)

Mistral-7B is the anomaly — its causal and bidirectional representations have cosine similarity ~0.9+ across all layers, suggesting it may have been pretrained with some form of bidirectional attention (e.g., prefix LM). For most models, adaptation training (e.g., LLM2Vec’s MNTP) is essential, but Causal2Vec argues that even with adaptation, the mismatch cannot be fully resolved — the model’s semantic extraction abilities, shaped by causal pretraining, are partially compromised.

2. Scale and data engineering compensate. Qwen3-Embedding uses ~150M synthetic pretraining pairs + ~19M supervised pairs + SLERP checkpoint merging — far exceeding the data scale of bidirectional models like NV-Embed. With sufficient model scale and training data, the EOS token under causal attention can aggregate enough global information to rival bidirectional representations on standard benchmarks.

3. Task complexity determines the gap size. This is the most important nuance. DiffEmbed (Zhang et al., 2025) showed that the advantage of bidirectional attention scales with task complexity:

Task Type Bidirectional Advantage
Short-text STS Nearly zero
Standard MTEB (mixed) Small (+1 to +4 points)
Long-document retrieval (LongEmbed) ~20%
Reasoning-intensive retrieval (TheoremQA) ~8%; removing backward attention: 48.3 → 0.7

For short texts, the EOS token sees the full sequence under causal attention — bidirectionality adds little. For long documents and tasks requiring logical reasoning across the full text, bidirectional attention is qualitatively superior.

三个因素解释了这一悖论:

1. 预训练-微调注意力不匹配。 LLM 在数万亿 token 上用因果注意力预训练,其内部表征为单向信息流优化。切换到双向注意力会破坏这些学到的表征——有时是灾难性的。LLM2Vec (BehnamGhader et al., 2024) 清晰地记录了这一点:

模型 Causal 基线 Bidirectional(无适应) 变化
Mistral-7B 42.46 46.86 +4.40(有帮助)
LLaMA-3-8B 43.98 30.56 -13.42(崩溃)

Mistral-7B 是异类——其因果和双向表征在所有层的 cosine similarity 约 0.9+,暗示它可能在预训练中使用过某种双向注意力(如 prefix LM)。对大多数模型而言,适应训练(如 LLM2Vec 的 MNTP)是必需的,但 Causal2Vec 认为即使经过适应,这种不匹配也无法完全消除——模型在因果预训练中形成的语义提取能力会被部分损害。

2. 规模和数据工程可以弥补。 Qwen3-Embedding 使用了约 150M 合成预训练对 + 约 19M 有监督对 + SLERP checkpoint 合并——远超双向模型如 NV-Embed 的数据规模。当模型规模和训练数据足够大时,因果注意力下的 EOS token 也能聚合足够的全局信息,在标准基准上匹敌双向表征。

3. 任务复杂度决定差距大小。 这是最重要的细微之处。DiffEmbed (Zhang et al., 2025) 展示了双向注意力的优势随任务复杂度增加而增长:

任务类型 双向优势
短文本 STS 几乎为零
标准 MTEB(混合) 小幅(+1 到 +4 分)
长文档检索 (LongEmbed) 约 20%
推理密集检索 (TheoremQA) 约 8%;去掉反向注意力:48.3 → 0.7

对于短文本,EOS token 在因果注意力下就能看到完整序列——双向性几乎无益。对于长文档和需要跨全文逻辑推理的任务,双向注意力有质的优势。

Practical Recommendations

There is no definitive winner — the right choice depends on the use case:

Scenario Recommendation Rationale
Need both generation and embedding Keep causal GritLM / GEM approach: switch attention mode per task
General-purpose embedding SOTA Either works Qwen3-Emb (causal) and NV-Embed (bidirectional) both achieve top results; scale and data matter more
Long-document retrieval Strongly prefer bidirectional DiffEmbed: ~20% advantage over AR models
Reasoning-intensive tasks Strongly prefer bidirectional Removing backward attention causes near-total collapse on TheoremQA
Small models (<1B) Remove causal mask Small models cannot compensate via scale; KaLM-V2.5 (0.5B, bidirectional) surpasses 7B causal models
Inference efficiency priority Keep causal + workarounds Causal2Vec: 85% sequence length reduction, 82% inference speedup

没有定论——正确的选择取决于使用场景:

场景 建议 理由
需要同时保留生成能力 保留 causal GritLM / GEM 的做法:按任务切换注意力模式
通用 embedding SOTA 两者皆可 Qwen3-Emb(causal)和 NV-Embed(bidirectional)都达到顶级水平;规模和数据更关键
长文档检索 强烈推荐双向 DiffEmbed:比 AR 模型优约 20%
推理密集任务 强烈推荐双向 去掉反向注意力导致 TheoremQA 上性能近乎归零
小模型 (<1B) 去掉 causal mask 小模型无法靠规模弥补;KaLM-V2.5(0.5B, 双向)超越 7B 因果模型
推理效率优先 保留 causal + 变通方案 Causal2Vec:序列长度减少 85%,推理加速 82%
The convergence trend: both camps are moving toward each other (Click to expand)

The most interesting trend is that the two camps are converging. Methods that keep the causal mask are increasingly finding ways to simulate bidirectional information flow:

  • Causal2Vec injects global context via a BERT-encoded Contextual token prepended before the text — all subsequent tokens can attend to it under causal masking, approximating backward information flow.
  • HTP creates hierarchical segment summary tokens that are rewired to the beginning, providing backward pathways without modifying the attention mask.
  • GEM uses an attention-mask bottleneck with special tokens that force prefix compression, keeping causal attention while creating an information funnel.

Meanwhile, methods that remove the causal mask are finding ways to preserve generation capability:

  • GritLM proved that a single model can use bidirectional attention for embedding and causal attention for generation with zero performance trade-off on either task.
  • GEM preserves generation capability (MMLU) while adding embedding capability, with only a small generation quality drop.

The future likely lies not in choosing one camp, but in better fusion strategies — models that seamlessly switch between attention modes or architectures that natively support both information flow patterns.

融合趋势:两个阵营正在趋同(点击展开)

最有趣的趋势是两个阵营正在趋同。保留因果掩码的方法越来越多地找到了模拟双向信息流的方式:

  • Causal2Vec 通过 BERT 编码的 Contextual token 前置到文本前来注入全局上下文——后续所有 token 在因果掩码下都能 attend 到它,近似实现了反向信息流。
  • HTP 创建层次化的分段摘要 token 并重新布线到序列开头,在不修改注意力掩码的情况下提供反向通路。
  • GEM 使用注意力掩码瓶颈和特殊 token 来强制前缀压缩,在保持因果注意力的同时创建信息漏斗。

与此同时,去除因果掩码的方法也在寻找保留生成能力的途径:

  • GritLM 证明了单一模型可以用双向注意力做 embedding、因果注意力做 generation,两项任务零性能损失
  • GEM 在添加 embedding 能力的同时保留了生成能力(MMLU),仅有轻微的生成质量下降。

未来可能不在于选择哪个阵营,而在于更好的融合策略——能在注意力模式间无缝切换的模型,或天生支持双向信息流模式的架构。

III. Performance of SOTA Embedding Models

Embedding Benchmarks Overview

The primary benchmarks used to evaluate embedding models are:

Benchmark Scope Tasks Reference
MTEB Text embedding (English) 8 task types: Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, Summarization, Bitext Mining. Originally 56 datasets (v1), expanded in v2. Muennighoff et al. (2022)
MMTEB Multilingual text embedding Same task types as MTEB, expanded to 250+ datasets across 100+ languages. Enevoldsen et al. (2025)
MMEB Multimodal embedding (image+text) 36 datasets across 20 embedding tasks spanning classification, VQA, retrieval, and visual grounding. The first unified multimodal embedding benchmark analogous to MTEB. Jiang et al. (2024)

MTEB is the de facto standard for text embedding evaluation — nearly all papers in this survey report MTEB scores. The leaderboard is hosted on HuggingFace. MMEB extends this paradigm to vision-language, evaluating models like VLM2Vec and Qwen3-VL-Embedding that map text, images, and video into a unified embedding space.

The overall MTEB score is a simple unweighted average of the main metric across all datasets. Since retrieval has 15 datasets while summarization has only 1, retrieval-heavy models get a disproportionate boost — a known limitation.

评估 embedding 模型的主要基准:

基准 范围 任务 参考
MTEB 文本嵌入(英语) 8 类任务:分类、聚类、配对分类、重排序、检索、STS、摘要、双语挖掘。原始 56 个数据集 (v1),v2 有扩展。 Muennighoff et al. (2022)
MMTEB 多语言文本嵌入 与 MTEB 相同的任务类型,扩展到 250+ 数据集、100+ 种语言。 Enevoldsen et al. (2025)
MMEB 多模态嵌入(图像+文本) 36 个数据集、20 个嵌入任务,涵盖分类、VQA、检索和视觉定位。首个类似 MTEB 的统一多模态嵌入基准。 Jiang et al. (2024)

MTEB 是文本嵌入评估的事实标准——本文综述的几乎所有论文都报告了 MTEB 分数。排行榜托管在 HuggingFace 上。MMEB 将这一范式扩展到视觉-语言领域,评估 VLM2VecQwen3-VL-Embedding 等将文本、图像和视频映射到统一嵌入空间的模型。

MTEB 的总分是所有数据集上主要指标的简单无权平均。由于 retrieval 有 15 个数据集而 summarization 只有 1 个,擅长检索的模型会获得不成比例的分数提升——这是已知的局限性。

How MTEB evaluates each task type — with a concrete retrieval example (Click to expand)

Each task type uses a different evaluation protocol. The embedding model only produces vectors — a downstream metric measures how well those vectors capture semantics:

Task Type Input Metric Example Dataset
Classification Text → embedding → logistic regression probe Accuracy Banking77: "What currencies is an exchange rate calculated in?" → label exchange_rate
Clustering Set of texts → embeddings → k-means V-measure ArxivClustering: group paper abstracts by field (math, cs, ...)
Pair Classification Two texts → cosine similarity → threshold Average Precision TwitterURL: "The new iPhone has a stunning display" vs "Apple's latest phone features an amazing screen" → paraphrase?
Reranking Query + candidate list → rank by cosine sim MAP AskUbuntu: rerank candidate answers for Ubuntu questions
Retrieval Query → search entire corpus by cosine sim nDCG@10 MSMARCO, NQ, HotpotQA, SciFact, ... (15 datasets from BEIR)
STS Two sentences → cosine sim vs gold score Spearman ρ STSBenchmark: "A man is playing the cello" vs "A man seated is playing the cello" → gold 4.25/5.0
Summarization Machine summary → cosine sim to human summary Spearman ρ SummEval (1 dataset only)
Bitext Mining Sentences in language A → find translation in B F1 Tatoeba: FR "Morales remporte l'élection..." ↔ EN "Morales went on to win..."

Retrieval scoring walkthrough (nDCG@10):

Given query "What is the capital of France?" and a corpus of 10,000 documents with 3 relevant ones (doc_42, doc_789, doc_3001). The model embeds everything and ranks by cosine similarity:

Rank Document Relevant? Gain (2rel−1) Discount (1/log₂(i+1))
1 doc_42 Yes 1 1.000
2 doc_100 No 0 0.631
3 doc_789 Yes 1 0.500
4 doc_555 No 0 0.431
5 doc_3001 Yes 1 0.387
6–10 ... No 0 ...

DCG@10 = 1×1.0 + 1×0.5 + 1×0.387 = 1.887. Ideal DCG (all 3 relevant at ranks 1-3) = 1.0 + 0.631 + 0.5 = 2.131. nDCG@10 = 1.887/2.131 = 0.886. The logarithmic discount penalizes relevant documents appearing lower — rank 1 counts twice as much as rank 3.

Instruction-aware evaluation: Models like E5-Mistral prepend task-specific instructions to queries only (not documents). For example, NQ retrieval uses "Given a question, retrieve Wikipedia passages that answer the question". Instructions contribute +4.2 MTEB points on average.

MTEB 如何评估各任务类型——附检索具体示例(点击展开)

每种任务类型使用不同的评估协议。Embedding 模型只产生向量——下游指标衡量这些向量捕获语义的好坏:

任务类型 输入 指标 示例数据集
分类 文本 → embedding → 逻辑回归探针 Accuracy Banking77: "What currencies is an exchange rate calculated in?" → 标签 exchange_rate
聚类 文本集合 → embeddings → k-means V-measure ArxivClustering: 按领域(mathcs 等)分组论文摘要
配对分类 两段文本 → cosine similarity → 阈值 Average Precision TwitterURL: "The new iPhone has a stunning display" vs "Apple's latest phone features an amazing screen" → 是否释义?
重排序 查询 + 候选列表 → 按 cosine sim 排序 MAP AskUbuntu: 对 Ubuntu 问题的候选答案重新排序
检索 查询 → 按 cosine sim 搜索整个语料库 nDCG@10 MSMARCO、NQ、HotpotQA、SciFact 等(来自 BEIR 的 15 个数据集)
STS 两个句子 → cosine sim 与金标分数对比 Spearman ρ STSBenchmark: "A man is playing the cello" vs "A man seated is playing the cello" → 金标 4.25/5.0
摘要 机器摘要 → 与人类摘要的 cosine sim Spearman ρ SummEval(仅 1 个数据集)
双语挖掘 语言 A 的句子 → 在语言 B 中找翻译 F1 Tatoeba: FR "Morales remporte l'élection..." ↔ EN "Morales went on to win..."

检索评分详解(nDCG@10):

给定查询 "What is the capital of France?",语料库含 10,000 个文档,其中 3 个相关(doc_42、doc_789、doc_3001)。模型对所有文本做 embedding 并按 cosine similarity 排序:

排名 文档 相关? 增益 (2rel−1) 折扣 (1/log₂(i+1))
1 doc_42 1 1.000
2 doc_100 0 0.631
3 doc_789 1 0.500
4 doc_555 0 0.431
5 doc_3001 1 0.387
6–10 ... 0 ...

DCG@10 = 1×1.0 + 1×0.5 + 1×0.387 = 1.887。理想 DCG(3 个相关文档排在 1-3 位)= 1.0 + 0.631 + 0.5 = 2.131。nDCG@10 = 1.887/2.131 = 0.886。对数折扣惩罚排名靠后的相关文档——排名第 1 的权重是排名第 3 的两倍。

指令感知评估: E5-Mistral 等模型仅在查询前加任务特定指令(文档不加)。例如 NQ 检索使用 "Given a question, retrieve Wikipedia passages that answer the question"。指令平均贡献 +4.2 MTEB 分

MTEB English Leaderboard

Model Params Attention Pooling MTEB Eng v2 MTEB Eng v1 (56 tasks)
Qwen3-Embedding-8B 8B Causal EOS 75.22
Qwen3-Embedding-4B 4B Causal EOS 74.60
Gemini Embedding 73.30
NV-Embed-v2 7B Bidirectional Latent-attn 69.81 72.31
Qwen3-Embedding-0.6B 0.6B Causal EOS 70.70
gte-Qwen2-7B 7B 70.72 70.24
KaLM-V2.5 0.5B Bidirectional Mean 69.33
GritLM-7B 7B Bi/Causal Mean 66.8
Causal2Vec-Mistral (ICL) 7B+110M Causal C+EOS 66.85
E5-Mistral-7B 7B Causal EOS 66.6
LLM2Vec-LLaMA3-8B 8B Bidirectional Mean 65.01
模型 参数量 注意力 Pooling MTEB Eng v2 MTEB Eng v1 (56 tasks)
Qwen3-Embedding-8B 8B Causal EOS 75.22
Qwen3-Embedding-4B 4B Causal EOS 74.60
Gemini Embedding 73.30
NV-Embed-v2 7B Bidirectional Latent-attn 69.81 72.31
Qwen3-Embedding-0.6B 0.6B Causal EOS 70.70
gte-Qwen2-7B 7B 70.72 70.24
KaLM-V2.5 0.5B Bidirectional Mean 69.33
GritLM-7B 7B Bi/Causal Mean 66.8
Causal2Vec-Mistral (ICL) 7B+110M Causal C+EOS 66.85
E5-Mistral-7B 7B Causal EOS 66.6
LLM2Vec-LLaMA3-8B 8B Bidirectional Mean 65.01
More benchmarks: Multilingual, Code, Long-Document, and Reasoning-Intensive Retrieval (Click to expand)

Multilingual MTEB (MMTEB)

Model Params MMTEB Avg
Qwen3-Embedding-8B 8B 70.58
Qwen3-Embedding-4B 4B 69.45
Gemini Embedding -- 68.37
Qwen3-Embedding-0.6B 0.6B 64.33
multilingual-e5-large 0.6B 63.22
gte-Qwen2-7B 7B 62.51

Code Retrieval (MTEB Code)

Model Score
Qwen3-Embedding-8B 80.68
Qwen3-Embedding-4B 80.06
Qwen3-Embedding-0.6B 75.41
Gemini Embedding 74.66

Long-Document Retrieval (LongEmbed)

DiffEmbed (Zhang et al., 2025) has the largest advantage in this scenario:

Model Long-Doc Avg Passkey (≤4K)
DiffEmbed (Dream-7B) 62.2% 100%
Mistral+LLM2Vec 58.6% 98.8%
LLaMA3+LLM2Vec 42.0% 59.6%

Reasoning-Intensive Retrieval (Bright / TheoremQA)

Model TheoremQA (Q) TheoremQA (T) Bright Avg
DiffEmbed 48.3 38.9 33.2
Qwen2.5 (best AR) 40.2 34.7 30.6
LLaMA3+LLM2Vec 33.8 28.3 24.6
更多基准:多语言、代码、长文档、推理密集检索(点击展开)

多语言 MTEB (MMTEB)

模型 参数量 MMTEB Avg
Qwen3-Embedding-8B 8B 70.58
Qwen3-Embedding-4B 4B 69.45
Gemini Embedding -- 68.37
Qwen3-Embedding-0.6B 0.6B 64.33
multilingual-e5-large 0.6B 63.22
gte-Qwen2-7B 7B 62.51

代码检索 (MTEB Code)

模型 Score
Qwen3-Embedding-8B 80.68
Qwen3-Embedding-4B 80.06
Qwen3-Embedding-0.6B 75.41
Gemini Embedding 74.66

长文档检索 (LongEmbed)

DiffEmbed (Zhang et al., 2025) 在此场景优势最大:

模型 长文档 Avg Passkey (4K 以内)
DiffEmbed (Dream-7B) 62.2% 100%
Mistral+LLM2Vec 58.6% 98.8%
LLaMA3+LLM2Vec 42.0% 59.6%

推理密集检索 (Bright / TheoremQA)

模型 TheoremQA (Q) TheoremQA (T) Bright Avg
DiffEmbed 48.3 38.9 33.2
Qwen2.5 (best AR) 40.2 34.7 30.6
LLaMA3+LLM2Vec 33.8 28.3 24.6

Long-Text Embedding

Long-text embedding is where the limitations of autoregressive models become most acute. Models vary widely in supported context length:

Model Max Context Training Length Position Encoding
Qwen3-Embedding 32K 32K RoPE (YaRN extendable to 128K)
NV-Embed-v2 32K 512 RoPE
E5-Mistral 4K → 32K 512 RoPE (NTK interpolation)
LLM2Vec 8K (Mistral) 512 RoPE
GritLM Arbitrary (sliding window) 2K RoPE
KaLM-V2.5 512 512 RoPE

A critical observation: most models are trained on short sequences (512 tokens) even when the base LLM supports much longer contexts. This creates a train-test length mismatch that degrades performance.

LongEmbed benchmark (Zhu et al., 2024) evaluates retrieval across lengths from 256 to 32K tokens on 6 datasets (2 synthetic: Needle-in-a-Haystack, Passkey Retrieval; 4 real-world: NarrativeQA, QMSum, 2WikiMQA, SummScreenFD). The best baseline achieves only 64.4 average — indicating large room for improvement.

Key results on LongEmbed:

Model Long-Doc Avg Passkey (≤4K) Key Technique
DiffEmbed 62.2% 100% Natively bidirectional (diffusion LM)
Mistral+LLM2Vec 58.6% 98.8% Bidirectional conversion
LLaMA3+LLM2Vec 42.0% 59.6% Bidirectional conversion
E5-Mistral + NTK ext. 75.3 RoPE NTK interpolation (+10.9 pts)

长文本 embedding 是自回归模型局限性最为突出的场景。各模型支持的上下文长度差异很大:

模型 最大上下文 训练长度 位置编码
Qwen3-Embedding 32K 32K RoPE(YaRN 可扩展至 128K)
NV-Embed-v2 32K 512 RoPE
E5-Mistral 4K → 32K 512 RoPE(NTK 插值)
LLM2Vec 8K (Mistral) 512 RoPE
GritLM 任意(滑动窗口) 2K RoPE
KaLM-V2.5 512 512 RoPE

一个关键观察:大多数模型在短序列(512 token)上训练,即使基座 LLM 支持更长上下文。这种训练-测试长度不匹配会导致性能下降。

LongEmbed 基准Zhu et al., 2024)在 256 到 32K token 的 6 个数据集上评估检索能力(2 个合成:Needle-in-a-Haystack、Passkey Retrieval;4 个真实:NarrativeQA、QMSum、2WikiMQA、SummScreenFD)。最佳基线仅达到 64.4 平均分——说明有很大的提升空间。

LongEmbed 上的关键结果:

模型 长文档 Avg Passkey (≤4K) 关键技术
DiffEmbed 62.2% 100% 原生双向(扩散语言模型)
Mistral+LLM2Vec 58.6% 98.8% 双向转换
LLaMA3+LLM2Vec 42.0% 59.6% 双向转换
E5-Mistral + NTK ext. 75.3 RoPE NTK 插值(+10.9 分)
Why do embeddings degrade with length? Length-Induced Embedding Collapse (Click to expand)

Zhou et al. (2025) (ACL 2025) identified a phenomenon called Length-Induced Embedding Collapse: self-attention acts as a low-pass filter, and longer sequences increase the attenuation rate of high-frequency components. This causes token representations to retain only the DC (Direct-Current) component — embeddings of longer texts collapse into a narrow region of embedding space with abnormally high pairwise cosine similarity.

Concrete degradation numbers on BGE:

Input Length Classification Accuracy
0–100 tokens 75.6%
100–200 tokens 72.1%
200–300 tokens 66.8%
300–400 tokens 63.2%
400–500 tokens 59.0% (−16.6 pts)

This is consistent across ANCE, GTR, GIST, BGE, and E5 models, and also observed in LLM-based models.

Solution — TempScale: Divides attention logits by a temperature τ ∈ (0,1] before softmax: softmax(QKT / (τ√d)). This preserves high-frequency information for longer texts, yielding +0.94% MTEB average and +1.10% LongEmbed improvement.

Other long-context techniques:

  • RoPE extension (NTK interpolation, YaRN, SelfExtend): Rescale the position encoding frequency basis to support longer sequences. E5-Mistral gains +10.9 pts on LongEmbed via NTK.
  • Late Chunking (Jina AI, 2024): Run the full transformer over the entire long document first, then chunk the token embeddings before mean pooling. Preserves cross-chunk context. +3.63% relative improvement over naive chunking.
  • HTP: Hierarchical segment summaries create backward information flow — particularly effective for long documents where over-squashing is severe.
为什么 embedding 质量随长度下降?Length-Induced Embedding Collapse(点击展开)

Zhou et al. (2025)(ACL 2025)发现了一种称为 Length-Induced Embedding Collapse 的现象:self-attention 充当低通滤波器,更长的序列增加了高频分量的衰减率。这导致 token 表征仅保留直流(DC)分量——更长文本的 embedding 坍缩到嵌入空间的狭窄区域,产生异常高的成对 cosine similarity。

BGE 上的具体性能退化数据:

输入长度 分类准确率
0–100 tokens 75.6%
100–200 tokens 72.1%
200–300 tokens 66.8%
300–400 tokens 63.2%
400–500 tokens 59.0%(−16.6 分)

这在 ANCE、GTR、GIST、BGE 和 E5 模型上一致存在,在基于 LLM 的模型上也同样观察到。

解决方案——TempScale:在 softmax 前将注意力 logits 除以温度 τ ∈ (0,1]:softmax(QKT / (τ√d))。这为较长文本保留了高频信息,在 MTEB 上平均提升 +0.94%,LongEmbed 上提升 +1.10%。

其他长上下文技术:

  • RoPE 扩展(NTK 插值、YaRN、SelfExtend):重新缩放位置编码频率基以支持更长序列。E5-Mistral 通过 NTK 在 LongEmbed 上获得 +10.9 分。
  • Late ChunkingJina AI, 2024):先在整个长文档上运行完整 transformer,然后再分块做 mean pooling。保留跨块上下文,比朴素分块提升 +3.63%。
  • HTP:层次化分段摘要创建反向信息流——在 over-squashing 严重的长文档上尤其有效。

Efficiency-Performance Trade-offs

Model Params Performance Highlight
KaLM-V2.5 0.5B MTEB 69.33 0.5B surpasses 7B E5-Mistral (66.6)
Qwen3-Emb-0.6B 0.6B MMTEB 64.33 0.6B surpasses all open-source 7B models
Causal2Vec 7B+110M MTEB 66.85 82% inference time reduction, 85% sequence length reduction (vs bge-en-icl)
AutoRegEmbed 7B STS 83.81 Only ~66K training samples needed (NV-Embed needs 1M+)
模型 参数量 性能 亮点
KaLM-V2.5 0.5B MTEB 69.33 0.5B 超越 7B 的 E5-Mistral (66.6)
Qwen3-Emb-0.6B 0.6B MMTEB 64.33 0.6B 超越所有开源 7B 模型
Causal2Vec 7B+110M MTEB 66.85 推理时间减少 82%,序列长度减少 85% (vs bge-en-icl)
AutoRegEmbed 7B STS 83.81 仅需约 66K 训练样本(NV-Embed 需 1M+)

Generation + Embedding: Unified Models

GritLM (Muennighoff et al., 2024)’s core finding is zero-loss unification:

Model MTEB MMLU Notes
GritLM-7B 66.8 57.6% Matches both single-task specialists
GritLM-8x7B 65.7 66.7% Generation close to Mixtral-8x7B-Instruct
GEM-LLaMA-1B MTEB 54.35 MMLU 28.36 Only 1/10 the training data of GritLM

Training Mistral-7B for embedding-only yields 66.8 MTEB + 7.6 generation (generation destroyed); training generative-only yields 41.2 MTEB + 55.2 generation. GritLM’s unified training achieves 66.8 MTEB + 55.5 generation — matching both individual specialists with zero trade-off.

GritLM (Muennighoff et al., 2024) 的核心发现是零损失统一

模型 MTEB MMLU 说明
GritLM-7B 66.8 57.6% Embedding 和 generation 分别达到各自单任务模型水平
GritLM-8x7B 65.7 66.7% 生成能力接近 Mixtral-8x7B-Instruct
GEM-LLaMA-1B MTEB 54.35 MMLU 28.36 训练数据仅需 GritLM 的 1/10

单独训练 Mistral-7B 做 embedding-only 得到 66.8 MTEB + 7.6 generation(生成能力报废);单独训练 generative-only 得到 41.2 MTEB + 55.2 generation。而 GritLM 统一训练得到 66.8 MTEB + 55.5 generation——两项都不输各自的专用模型

References

  • Yanzhao Zhang et al. “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.” arXiv:2506.05176, 2025.
  • Chankyu Lee et al. “NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.” ICLR 2025 (Spotlight). arXiv:2405.17428, 2024.
  • Parishad BehnamGhader et al. “LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.” COLM 2024. arXiv:2404.05961, 2024.
  • Niklas Muennighoff et al. “Generative Representational Instruction Tuning.” arXiv:2402.09906, 2024.
  • Liang Wang et al. “Improving Text Embeddings with Large Language Models.” ACL 2024. arXiv:2401.00368, 2024.
  • Xinping Zhao et al. “KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model.” arXiv:2506.20923, 2025.
  • Ailiang Lin et al. “Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models.” arXiv:2507.23386, 2025.
  • Jingcheng Deng et al. “Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment.” EMNLP 2025. arXiv:2502.11401, 2025.
  • Ziyue Li, Tianyi Zhou. “Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free.” ICLR 2025 (Oral). arXiv:2410.10814, 2024.
  • Xueying Ding et al. “Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings.” arXiv:2511.14868, 2025.
  • Caojin Zhang et al. “GEM: Empowering LLM for both Embedding Generation and Language Understanding.” arXiv:2506.04344, 2025.
  • Siyue Zhang et al. “Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective.” EMNLP 2025. arXiv:2505.15045, 2025.