Autoregressive Embedding Models: Training, Attention, and Performance
I. How Are SOTA Embedding Models Trained?
一、最先进的 Embedding Model 是怎么训练的?
Multi-Stage Progressive Training
多阶段渐进式训练
Current SOTA models almost universally adopt multi-stage training, progressively refining representation quality from coarse to fine:
| Stage | Purpose | Typical Data Scale | Representative Methods |
|---|---|---|---|
| Stage 0: LLM pretraining | Strong language understanding foundation | Trillions of tokens | Inherited from Qwen3/Mistral/LLaMA |
| Stage 1: Weakly-supervised contrastive pretraining | Learn general text representations | ~150M–470M pairs | Qwen3-Emb (150M), KaLM (470M) |
| Stage 2: Supervised fine-tuning | Task-specific refinement | ~2M–19M pairs | All SOTA models |
| Stage 3: Distillation / merging | Improve generalization | Same as Stage 2 | KaLM-V2.5 (KL distillation), Qwen3 (SLERP merging) |
A key finding: Wang et al. (2024) showed that for sufficiently large LLMs, Stage 1 contrastive pretraining is nearly useless — it improves smaller models like XLM-R by +8.2 points but has negligible effect on Mistral-7B. The LLM’s autoregressive pretraining already provides a strong representational foundation. However, for smaller models (e.g., Qwen2-0.5B), Stage 1 still contributes significantly (~3 MMTEB points for Qwen3-Emb-0.6B).
Attention Mechanism: Two Camps
注意力机制:两大流派
Camp 1: Remove the causal mask. Replace the causal attention mask with fully bidirectional attention during training. Representatives include NV-Embed (Lee et al., 2024), LLM2Vec (BehnamGhader et al., 2024), KaLM (Zhao et al., 2025), and GritLM (Muennighoff et al., 2024) (in embedding mode). The advantage is simplicity and strong embedding quality; the downside is loss of generation capability (except GritLM, which uses bidirectional for embedding and causal for generation).
Camp 2: Keep the causal mask. Bypass the causal attention limitation through clever designs. Representatives include Qwen3-Emb (Zhang et al., 2025), E5-Mistral (Wang et al., 2024), Causal2Vec (Lin et al., 2025), GEM (Zhang et al., 2025), and HTP (Ding et al., 2025). The advantage is preserved generation capability and no pretrain-finetune attention mismatch; the downside is that extra mechanisms are needed to compensate for the information flow deficit.
NV-Embed’s ablation clearly shows the value of bidirectional attention:
| Attention | Pooling | MTEB Avg |
|---|---|---|
| Causal | EOS | 66.50 |
| Bidirectional | EOS | 67.85 (+1.35) |
| Causal | Latent-attn | 68.47 |
| Bidirectional | Latent-attn | 69.32 (+0.85) |
GritLM’s ablation is even more dramatic: bidirectional vs. causal differs by +4 MTEB points on embedding tasks (64.0 vs 60.0).
Pooling Strategies
Pooling 策略
| Strategy | Representatives | Best When |
|---|---|---|
| EOS / Last-token pooling | Qwen3-Emb, E5-Mistral | Keeping causal mask — information naturally converges to the final token |
| Mean pooling | KaLM, LLM2Vec, GritLM, DiffEmbed | After removing causal mask — all token representations are equally contextualized |
| Latent attention pooling | NV-Embed | Learnable cross-attention, outperforms both alternatives |
| Contextual + EOS concat | Causal2Vec | Combining global compressed info with end-of-sequence info |
NV-Embed (Lee et al., 2024)’s Latent Attention is a learnable cross-attention layer: decoder outputs serve as Q, a set of 512 learnable latent vectors serve as K=V, followed by MLP + mean pooling. It achieves the best results among all pooling methods (MTEB 72.31 vs mean 71.71 vs EOS 71.63).
The following figure illustrates the attention masks and pooling strategies:
Which Layer's Representation?
使用哪一层的表征?
Nearly all SOTA embedding models extract representations from the last layer of the transformer:
| Model | Layer | Pooling |
|---|---|---|
| Qwen3-Emb, E5-Mistral | Last layer | EOS token |
| NV-Embed | Last layer | Latent Attention |
| LLM2Vec, KaLM, GritLM, DiffEmbed | Last layer | Mean pooling |
| HTP | Near-last (2nd or 3rd from end) | Mean pooling |
| MoE-Embedding | All layers (routing weights) | Last token |
Two notable exceptions: HTP uses a near-last but not final layer (e.g., layer 29/32 for Mistral-7B), chosen per-model via validation. MoE-Embedding concatenates routing weights from all layers (e.g., 28 layers × 64 experts = 1792-dim vector).
Why not use intermediate layers? The "Layer by Layer" finding (Click to expand)
Ren et al. (2025), "Layer by Layer: Uncovering Hidden Representations in Language Models" (ICML 2025) systematically studied layer selection for embeddings and found that intermediate layers outperform the last layer by 2%--16% on average across 32 MTEB tasks in zero-shot settings. The optimal layers cluster around mid-depth of the network.
The reason: the last layer is optimized for next-token prediction and is biased toward the next output token's semantics rather than global sentence semantics. Intermediate layers strike a better balance between information preservation and noise filtering. This effect is architecture-agnostic — observed in decoder-only (Pythia, LLaMA3), encoder (BERT), and state-space models (Mamba).
However, this finding primarily applies to zero-shot / unfine-tuned models. In practice, SOTA embedding models fine-tune with contrastive learning, which reshapes the last layer to encode global rather than local semantics. Combined with good pooling strategies (latent attention, bidirectional mean pooling), the last layer's "next-token bias" is effectively mitigated — which is why most production models still use it.
HTP's near-last-layer early exit can be seen as a compromise: it avoids the very final layer's next-token specialization while staying close enough to preserve the deep representations.
Loss Functions
损失函数
All models use InfoNCE contrastive loss as the core:
\[\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, d^+)/\tau)}{\sum_j \exp(\text{sim}(q, d_j)/\tau)}\]Several improvements build on top of this:
- Focal-InfoNCE (KaLM-V2, Zhao et al., 2025): Reweight with \(w_i = (1-p_i)^\gamma\) to focus on hard samples. This is KaLM’s single largest performance contributor.
- False negative mitigation (Qwen3-Emb, Zhang et al., 2025): Zero out gradient contributions from in-batch negatives whose similarity to the positive exceeds a margin of 0.1.
- Conditional distribution alignment (AutoRegEmbed, Deng et al., 2025): Instead of cosine similarity, align \(p(\cdot \vert e_q)\) and \(p(\cdot \vert e_d)\) — conditional probability distributions, better matching the LLM’s generative nature.
- Information compression (AutoRegEmbed, Deng et al., 2025): Use a frozen decoder to reconstruct the target document, forcing compressed tokens to capture global semantics.
Details: Synthetic data engineering and other key techniques (Click to expand)
Synthetic data is a critical driver across all SOTA models:
- E5-Mistral: GPT-4 generates 500K synthetic pairs across 93 languages with 150K unique instructions.
- Qwen3-Emb: Qwen3-32B generates ~150M synthetic pairs, controlling query type / length / difficulty / language diversity.
- NV-Embed: Mixtral-8x22B generates 120K synthetic samples covering 60K synthetic tasks.
- KaLM-V2: Qwen2-72B generates 550K persona-based synthetic data.
Instruction-aware embeddings. Nearly all SOTA models prepend "Instruct: {task} Query: {q}" to queries; documents receive no instruction prefix. E5-Mistral (Wang et al., 2024)'s ablation shows instructions contribute +4.2 points.
Hard negative mining. NV-Embed (Lee et al., 2024) uses a teacher model to mine hard negatives with threshold max_neg_score < pos_score * 0.95 to filter false negatives, contributing +2.30 retrieval points.
Model merging / SLERP. Qwen3-Emb (Zhang et al., 2025) merges multiple SFT checkpoints via spherical linear interpolation, contributing +1.77 MMTEB points.
Two-stage instruction tuning. NV-Embed trains retrieval first (in-batch negatives ON), then multi-task (in-batch negatives OFF — same-class samples become false negatives). Order matters: reversing drops retrieval by 0.74 points.
LoRA efficient fine-tuning. E5-Mistral and Causal2Vec show that LoRA rank 16 suffices — no full-parameter fine-tuning needed.
Frontier Directions
前沿探索方向
Diffusion LM as embedding backbone. DiffEmbed (Zhang et al., 2025) uses Dream-7B (a masked diffusion model based on Qwen2.5-7B) with natively bidirectional attention. It outperforms AR models by ~20% on long-document retrieval and ~8% on reasoning-intensive tasks.
MoE routing weights as embeddings. MoE-Embedding (Li & Zhou, 2024) discovers that router weights in MoE models serve as off-the-shelf embeddings without fine-tuning, complementary to hidden states (AMI only 0.29). Combining them yields +7.94 points.
Hierarchical Token Prepending. HTP (Ding et al., 2025) requires no training — it creates backward information flow in causal models through hierarchical segment summary tokens. It can even improve already fine-tuned NV-Embed-v2.
II. Attention Distribution Characteristics
二、Attention Distribution 的特点
The Core Problem: Information Flow Bottleneck in Causal Attention
核心问题:Causal Attention 的信息流瓶颈
Causal attention in decoder-only models creates two fundamental deficiencies:
Unidirectional information flow. The token at position \(i\) can only see tokens \(1, \ldots, i\), with no access to information from \(i+1, \ldots, n\). Early tokens are severely under-contextualized — the first token’s representation contains zero context from the rest of the sentence. Mean pooling quality suffers because early tokens contribute incomplete representations.
Information over-compression into the final token. All semantic information must “flow forward” to the end of the sequence. The EOS token is the only position that sees the complete sequence. AutoRegEmbed (Deng et al., 2025) further points out that the hidden state at EOS encodes the next token’s probability distribution (local semantics), not the global semantics of the input text.
Over-Squashing: Why Causal Attention Gets Worse with Length
Over-Squashing:因果注意力为何随长度恶化
The term “over-squashing” originated in graph neural networks (Alon & Yahav, 2021; Topping et al., 2022): when exponentially growing receptive fields must pass messages through fixed-dimension channels, information from distant nodes gets “squashed” and lost. Barbero et al. (2024) formally bridged this theory to causal transformers by observing that a causal attention pattern defines a directed acyclic graph — a lower-triangular adjacency matrix — so information propagation follows the same path-counting framework.
The mixing matrix. For a causal transformer with \(L\) layers, define the mixing matrix:
\[A = M^{(L-1)} \cdot M^{(L-2)} \cdots M^{(0)}, \quad \text{where } M^{(l)} = \frac{1}{r_l}\left(\frac{\alpha^{(l)}}{\beta_1^{(l)}} + I\right)\]Here \(\alpha^{(l)}_{j,i}\) are the softmax attention weights at layer \(l\) (zero for \(i > j\) due to causal masking), and the identity \(I\) comes from the residual connection. \(A\) is lower-triangular and row-stochastic. HTP (Ding et al., 2025) proved the following bounds:
Last-token pooling — sensitivity depends on a single entry:
\[\left\lVert \frac{\partial y_n}{\partial v_i^{(0)}} \right\rVert \leq K_L \cdot A_{n,i}\]Mean pooling — sensitivity depends on the entire column:
\[\left\lVert \frac{\partial \bar{y}}{\partial v_i^{(0)}} \right\rVert \leq \frac{K_L}{n} \sum_{j=1}^{n} A_{j,i}\]The column sum captures all outgoing influence from token \(i\), not just the single path to position \(n\). This is why mean pooling is structurally more robust to over-squashing than last-token pooling.
Three forces conspire to make \(A_{n,i}\) small for early tokens:
- Attention dilution. Softmax distributes weight across all visible tokens. Average weight per token is \(\sim 1/n\), shrinking as context grows.
- Multiplicative decay. \(A_{n,i}\) is a sum over products of \(L\) attention weights along causal paths. Products of \(L\) small numbers decay exponentially.
- Attention sinks. Xiao et al. (2023) found that in LLaMA-2-7B, attention to the first token exceeds 50% of total attention across most layers — despite carrying no semantic content (replacing it with
\ntokens yields comparable perplexity). This wastes attention budget: with >50% consumed by a semantically empty sink, even less remains for informative early tokens, further shrinking their \(A_{n,i}\).
Self-attention as a low-pass filter: the spectral perspective on length degradation (Click to expand)
Zhou et al. (2025) (ACL 2025) provided a complementary spectral analysis. They decomposed token representations into frequency components via DFT and showed that any attention matrix A = softmax(P) acts as a low-pass filter:
$$\lim_{t \to \infty} \frac{\lVert \text{HC}[A^t z] \rVert_2}{\lVert \text{DC}[A^t z] \rVert_2} = 0$$
where DC is the mean (zero-frequency) component and HC is all higher-frequency components. By Perron-Frobenius, the attention matrix is row-stochastic with largest eigenvalue 1 (DC), and all other eigenvalues have magnitude < 1 — so repeated attention application exponentially damps discriminative features.
The filter rate depends on length. Let σa be the largest singular value of HC[A]. Under Gaussian assumptions on Q,K projections:
$$\sigma_a \leq \sqrt{\frac{n}{2\sqrt{1 + e^{-2\sigma_s^2}} \cdot (n-1)^{3/2} + 1}}$$
σa monotonically decreases as sequence length n increases. Longer sequences → stronger low-pass filtering → faster destruction of high-frequency (discriminative) information. This explains why embeddings of long texts collapse: they converge toward their DC component, and since natural language has relatively consistent mean embeddings, all long-text embeddings crowd into a narrow region with abnormally high pairwise cosine similarity.
Barbero et al. (2024) ("Transformers need glasses!") pushed this further, showing that representational collapse occurs even for distinct sequences: as n → ∞, the last-token representations of two different sequences converge. In practice with bf16 precision, collapse occurs at ~50 tokens for repeated digits, and Gemini 1.5 fails to copy the last element of a sequence at length ~300.
The connection between over-squashing and low-pass filtering is natural: both describe the same information loss from different perspectives — over-squashing through the Jacobian of the mixing matrix (gradient-based), low-pass filtering through eigenvalue decay of the attention matrix (spectral). They predict the same outcome: longer sequences lose more information, and causal models are worse than bidirectional ones because the triangular structure restricts information flow paths.
Empirical Findings
实证发现
LLM2Vec: Causal vs. bidirectional representation similarity. LLM2Vec (BehnamGhader et al., 2024) analyzed the cosine similarity between per-layer representations in causal and bidirectional modes:
- LLaMA-2-7B and S-LLaMA-1.3B: similarities are low across nearly all layers, indicating that enabling bidirectional attention drastically changes internal representations — hence MNTP adaptation training is needed.
- Mistral-7B is the anomaly: similarities remain ~0.9+ throughout all layers. The authors speculate Mistral may have been pretrained with some form of bidirectional attention (e.g., prefix LM). This explains why Mistral is the only model that benefits from bidirectional attention without any training (+4.4 points), while LLaMA-3-8B collapses (-13.4 points).
HTP: Theoretical analysis of over-squashing. HTP (Ding et al., 2025) provides a formal analysis of the information flow bottleneck (Theorem 3.1): for a causal Transformer, the gradient of last-token readout \(\lVert \partial y_n / \partial v_i^{(0)} \rVert\) depends on a single entry \(A_{n,i}\) of the mixing matrix, which decays rapidly with depth. Mean-token readout aggregates the entire column \(\sum_j A_{j,i}\), making it more robust to over-squashing.
Experimental validation: masking the backward attention in Echo Embeddings (second pass → first pass) causes STS to plummet from 68.00 to 54.25; masking forward attention has virtually no effect (68.00 → 67.89). Backward information flow is the key to embedding quality.
DiffEmbed: The most direct attention-direction ablation. DiffEmbed (Zhang et al., 2025) tested the effect of removing backward attention on models trained with bidirectional attention:
| Task | Mistral (bidir) | Mistral (causal only) | DiffEmbed (bidir) | DiffEmbed (causal only) |
|---|---|---|---|---|
| TheoremQA (question) | 33.7 | 9.6 (-24.1) | 48.3 | 0.7 (-47.6) |
| TheoremQA (theorem) | 32.4 | 4.0 (-28.4) | 38.9 | 1.1 (-37.8) |
DiffEmbed (natively bidirectional pretraining) depends on backward attention far more than AR models do. The more complex the task (reasoning, long documents), the more critical bidirectional attention becomes; the difference is small for short-text STS.
More empirical findings: Causal2Vec L2 norms and MoE routing weights (Click to expand)
Causal2Vec: EOS vs. Contextual token. Causal2Vec (Lin et al., 2025) analyzed the L2 norms of EOS and Contextual token representations: EOS consistently shows higher L2 norms, indicating greater influence in the concatenated embedding. A single Contextual token suffices — increasing to 2/4/8 tokens actually degrades performance.
MoE-Embedding: Router Weights vs. Hidden States. MoE-Embedding (Li & Zhou, 2024) found that MoE routing weights (RW) and hidden states (HS) encode fundamentally different information (AMI=0.29, Jaccard=0.06). RW is more robust to prompt variation (cross-prompt Spearman correlation 0.63 vs HS's 0.52). RW captures "intermediate reasoning choices" (how the model processes input), HS captures "final prediction output" — the two are complementary.
Summary of Attention Characteristics
Attention 特征总结
| Characteristic | Explanation |
|---|---|
| Information converges to the final token | Under causal attention, EOS is the only position seeing the full sequence — it becomes the natural information sink |
| Early tokens have poor representations | Lacking subsequent context, their representations are unsuitable for embedding |
| Backward information flow is the critical missing piece | Both HTP and DiffEmbed prove: backward attention (later→earlier) is essential; forward attention (earlier→later) contributes marginally |
| Bidirectional conversion requires adaptation training | Except Mistral, directly enabling bidirectional attention destroys AR pretrained representations (LLaMA-3 drops 13.4 points) |
| More complex tasks need bidirectionality more | Short-text STS shows small differences; long-document retrieval and reasoning-intensive tasks show huge gaps (up to 47.6 points) |
| Instruction tokens influence but don’t participate in pooling | NV-Embed and GritLM exclude instruction tokens from pooling, but instructions influence other token representations via attention |
The Causal vs. Bidirectional Debate: Is There a Winner?
Causal vs. Bidirectional 之争:谁更好?
One of the central tensions in this field is whether to remove the causal attention mask inherited from autoregressive pretraining, or to keep it and work around its limitations. Controlled ablations and leaderboard rankings tell different stories — and reconciling them reveals a nuanced picture.
Controlled Ablations: Bidirectional Wins
控制变量消融:双向胜出
Under strict ablation (same model, same data, only the attention mask differs), removing the causal mask consistently helps:
| Paper | Experiment | Gain |
|---|---|---|
| NV-Embed (Lee et al., 2024) | Causal → Bidirectional (EOS pooling) | +1.35 MTEB |
| NV-Embed | Causal → Bidirectional (Latent-attn) | +0.85 MTEB |
| GritLM (Muennighoff et al., 2024) | Causal → Bidirectional (embedding mode) | +4.0 MTEB |
| KaLM (Zhao et al., 2025) | Remove causal mask | +0.39 MTEB |
The conclusion from ablations is clear: all else being equal, bidirectional attention produces better embeddings. This is expected — embedding fundamentally requires understanding the whole input, and causal attention prevents early tokens from incorporating future context.
The Leaderboard Paradox: Causal Models Lead
排行榜悖论:因果模型领先
Yet the actual leaderboards tell a different story:
| Model | Attention | MTEB Eng v2 |
|---|---|---|
| Qwen3-Embedding-8B | Causal | 75.22 |
| Qwen3-Embedding-4B | Causal | 74.60 |
| NV-Embed-v2 | Bidirectional | 69.81 |
Causal2Vec (Lin et al., 2025) also consistently outperforms the bidirectional LLM2Vec by +0.78 to +1.30 MTEB points across all tested base models, while keeping the causal mask intact.
How can causal models dominate despite the ablation evidence?
Reconciling the Evidence
调和证据
Three factors explain the paradox:
1. Pretrain-finetune attention mismatch. LLMs are pretrained on trillions of tokens with causal attention. Their internal representations are optimized for unidirectional information flow. Switching to bidirectional attention disrupts these learned representations — sometimes catastrophically. LLM2Vec (BehnamGhader et al., 2024) documented this clearly:
| Model | Causal baseline | Bidirectional (no adaptation) | Change |
|---|---|---|---|
| Mistral-7B | 42.46 | 46.86 | +4.40 (helps) |
| LLaMA-3-8B | 43.98 | 30.56 | -13.42 (collapses) |
Mistral-7B is the anomaly — its causal and bidirectional representations have cosine similarity ~0.9+ across all layers, suggesting it may have been pretrained with some form of bidirectional attention (e.g., prefix LM). For most models, adaptation training (e.g., LLM2Vec’s MNTP) is essential, but Causal2Vec argues that even with adaptation, the mismatch cannot be fully resolved — the model’s semantic extraction abilities, shaped by causal pretraining, are partially compromised.
2. Scale and data engineering compensate. Qwen3-Embedding uses ~150M synthetic pretraining pairs + ~19M supervised pairs + SLERP checkpoint merging — far exceeding the data scale of bidirectional models like NV-Embed. With sufficient model scale and training data, the EOS token under causal attention can aggregate enough global information to rival bidirectional representations on standard benchmarks.
3. Task complexity determines the gap size. This is the most important nuance. DiffEmbed (Zhang et al., 2025) showed that the advantage of bidirectional attention scales with task complexity:
| Task Type | Bidirectional Advantage |
|---|---|
| Short-text STS | Nearly zero |
| Standard MTEB (mixed) | Small (+1 to +4 points) |
| Long-document retrieval (LongEmbed) | ~20% |
| Reasoning-intensive retrieval (TheoremQA) | ~8%; removing backward attention: 48.3 → 0.7 |
For short texts, the EOS token sees the full sequence under causal attention — bidirectionality adds little. For long documents and tasks requiring logical reasoning across the full text, bidirectional attention is qualitatively superior.
Practical Recommendations
实践建议
There is no definitive winner — the right choice depends on the use case:
| Scenario | Recommendation | Rationale |
|---|---|---|
| Need both generation and embedding | Keep causal | GritLM / GEM approach: switch attention mode per task |
| General-purpose embedding SOTA | Either works | Qwen3-Emb (causal) and NV-Embed (bidirectional) both achieve top results; scale and data matter more |
| Long-document retrieval | Strongly prefer bidirectional | DiffEmbed: ~20% advantage over AR models |
| Reasoning-intensive tasks | Strongly prefer bidirectional | Removing backward attention causes near-total collapse on TheoremQA |
| Small models (<1B) | Remove causal mask | Small models cannot compensate via scale; KaLM-V2.5 (0.5B, bidirectional) surpasses 7B causal models |
| Inference efficiency priority | Keep causal + workarounds | Causal2Vec: 85% sequence length reduction, 82% inference speedup |
The convergence trend: both camps are moving toward each other (Click to expand)
The most interesting trend is that the two camps are converging. Methods that keep the causal mask are increasingly finding ways to simulate bidirectional information flow:
- Causal2Vec injects global context via a BERT-encoded Contextual token prepended before the text — all subsequent tokens can attend to it under causal masking, approximating backward information flow.
- HTP creates hierarchical segment summary tokens that are rewired to the beginning, providing backward pathways without modifying the attention mask.
- GEM uses an attention-mask bottleneck with special tokens that force prefix compression, keeping causal attention while creating an information funnel.
Meanwhile, methods that remove the causal mask are finding ways to preserve generation capability:
- GritLM proved that a single model can use bidirectional attention for embedding and causal attention for generation with zero performance trade-off on either task.
- GEM preserves generation capability (MMLU) while adding embedding capability, with only a small generation quality drop.
The future likely lies not in choosing one camp, but in better fusion strategies — models that seamlessly switch between attention modes or architectures that natively support both information flow patterns.
III. Performance of SOTA Embedding Models
三、最先进的 Embedding Model 的 Performance
Embedding Benchmarks Overview
Embedding 基准概览
The primary benchmarks used to evaluate embedding models are:
| Benchmark | Scope | Tasks | Reference |
|---|---|---|---|
| MTEB | Text embedding (English) | 8 task types: Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, Summarization, Bitext Mining. Originally 56 datasets (v1), expanded in v2. | Muennighoff et al. (2022) |
| MMTEB | Multilingual text embedding | Same task types as MTEB, expanded to 250+ datasets across 100+ languages. | Enevoldsen et al. (2025) |
| MMEB | Multimodal embedding (image+text) | 36 datasets across 20 embedding tasks spanning classification, VQA, retrieval, and visual grounding. The first unified multimodal embedding benchmark analogous to MTEB. | Jiang et al. (2024) |
MTEB is the de facto standard for text embedding evaluation — nearly all papers in this survey report MTEB scores. The leaderboard is hosted on HuggingFace. MMEB extends this paradigm to vision-language, evaluating models like VLM2Vec and Qwen3-VL-Embedding that map text, images, and video into a unified embedding space.
The overall MTEB score is a simple unweighted average of the main metric across all datasets. Since retrieval has 15 datasets while summarization has only 1, retrieval-heavy models get a disproportionate boost — a known limitation.
How MTEB evaluates each task type — with a concrete retrieval example (Click to expand)
Each task type uses a different evaluation protocol. The embedding model only produces vectors — a downstream metric measures how well those vectors capture semantics:
| Task Type | Input | Metric | Example Dataset |
|---|---|---|---|
| Classification | Text → embedding → logistic regression probe | Accuracy | Banking77: "What currencies is an exchange rate calculated in?" → label exchange_rate |
| Clustering | Set of texts → embeddings → k-means | V-measure | ArxivClustering: group paper abstracts by field (math, cs, ...) |
| Pair Classification | Two texts → cosine similarity → threshold | Average Precision | TwitterURL: "The new iPhone has a stunning display" vs "Apple's latest phone features an amazing screen" → paraphrase? |
| Reranking | Query + candidate list → rank by cosine sim | MAP | AskUbuntu: rerank candidate answers for Ubuntu questions |
| Retrieval | Query → search entire corpus by cosine sim | nDCG@10 | MSMARCO, NQ, HotpotQA, SciFact, ... (15 datasets from BEIR) |
| STS | Two sentences → cosine sim vs gold score | Spearman ρ | STSBenchmark: "A man is playing the cello" vs "A man seated is playing the cello" → gold 4.25/5.0 |
| Summarization | Machine summary → cosine sim to human summary | Spearman ρ | SummEval (1 dataset only) |
| Bitext Mining | Sentences in language A → find translation in B | F1 | Tatoeba: FR "Morales remporte l'élection..." ↔ EN "Morales went on to win..." |
Retrieval scoring walkthrough (nDCG@10):
Given query "What is the capital of France?" and a corpus of 10,000 documents with 3 relevant ones (doc_42, doc_789, doc_3001). The model embeds everything and ranks by cosine similarity:
| Rank | Document | Relevant? | Gain (2rel−1) | Discount (1/log₂(i+1)) |
|---|---|---|---|---|
| 1 | doc_42 | Yes | 1 | 1.000 |
| 2 | doc_100 | No | 0 | 0.631 |
| 3 | doc_789 | Yes | 1 | 0.500 |
| 4 | doc_555 | No | 0 | 0.431 |
| 5 | doc_3001 | Yes | 1 | 0.387 |
| 6–10 | ... | No | 0 | ... |
DCG@10 = 1×1.0 + 1×0.5 + 1×0.387 = 1.887. Ideal DCG (all 3 relevant at ranks 1-3) = 1.0 + 0.631 + 0.5 = 2.131. nDCG@10 = 1.887/2.131 = 0.886. The logarithmic discount penalizes relevant documents appearing lower — rank 1 counts twice as much as rank 3.
Instruction-aware evaluation: Models like E5-Mistral prepend task-specific instructions to queries only (not documents). For example, NQ retrieval uses "Given a question, retrieve Wikipedia passages that answer the question". Instructions contribute +4.2 MTEB points on average.
MTEB English Leaderboard
MTEB English 排行
| Model | Params | Attention | Pooling | MTEB Eng v2 | MTEB Eng v1 (56 tasks) |
|---|---|---|---|---|---|
| Qwen3-Embedding-8B | 8B | Causal | EOS | 75.22 | – |
| Qwen3-Embedding-4B | 4B | Causal | EOS | 74.60 | – |
| Gemini Embedding | – | – | – | 73.30 | – |
| NV-Embed-v2 | 7B | Bidirectional | Latent-attn | 69.81 | 72.31 |
| Qwen3-Embedding-0.6B | 0.6B | Causal | EOS | 70.70 | – |
| gte-Qwen2-7B | 7B | – | – | 70.72 | 70.24 |
| KaLM-V2.5 | 0.5B | Bidirectional | Mean | – | 69.33 |
| GritLM-7B | 7B | Bi/Causal | Mean | – | 66.8 |
| Causal2Vec-Mistral (ICL) | 7B+110M | Causal | C+EOS | – | 66.85 |
| E5-Mistral-7B | 7B | Causal | EOS | – | 66.6 |
| LLM2Vec-LLaMA3-8B | 8B | Bidirectional | Mean | – | 65.01 |
More benchmarks: Multilingual, Code, Long-Document, and Reasoning-Intensive Retrieval (Click to expand)
Multilingual MTEB (MMTEB)
| Model | Params | MMTEB Avg |
|---|---|---|
| Qwen3-Embedding-8B | 8B | 70.58 |
| Qwen3-Embedding-4B | 4B | 69.45 |
| Gemini Embedding | -- | 68.37 |
| Qwen3-Embedding-0.6B | 0.6B | 64.33 |
| multilingual-e5-large | 0.6B | 63.22 |
| gte-Qwen2-7B | 7B | 62.51 |
Code Retrieval (MTEB Code)
| Model | Score |
|---|---|
| Qwen3-Embedding-8B | 80.68 |
| Qwen3-Embedding-4B | 80.06 |
| Qwen3-Embedding-0.6B | 75.41 |
| Gemini Embedding | 74.66 |
Long-Document Retrieval (LongEmbed)
DiffEmbed (Zhang et al., 2025) has the largest advantage in this scenario:
| Model | Long-Doc Avg | Passkey (≤4K) |
|---|---|---|
| DiffEmbed (Dream-7B) | 62.2% | 100% |
| Mistral+LLM2Vec | 58.6% | 98.8% |
| LLaMA3+LLM2Vec | 42.0% | 59.6% |
Reasoning-Intensive Retrieval (Bright / TheoremQA)
| Model | TheoremQA (Q) | TheoremQA (T) | Bright Avg |
|---|---|---|---|
| DiffEmbed | 48.3 | 38.9 | 33.2 |
| Qwen2.5 (best AR) | 40.2 | 34.7 | 30.6 |
| LLaMA3+LLM2Vec | 33.8 | 28.3 | 24.6 |
Long-Text Embedding
长文本 Embedding
Long-text embedding is where the limitations of autoregressive models become most acute. Models vary widely in supported context length:
| Model | Max Context | Training Length | Position Encoding |
|---|---|---|---|
| Qwen3-Embedding | 32K | 32K | RoPE (YaRN extendable to 128K) |
| NV-Embed-v2 | 32K | 512 | RoPE |
| E5-Mistral | 4K → 32K | 512 | RoPE (NTK interpolation) |
| LLM2Vec | 8K (Mistral) | 512 | RoPE |
| GritLM | Arbitrary (sliding window) | 2K | RoPE |
| KaLM-V2.5 | 512 | 512 | RoPE |
A critical observation: most models are trained on short sequences (512 tokens) even when the base LLM supports much longer contexts. This creates a train-test length mismatch that degrades performance.
LongEmbed benchmark (Zhu et al., 2024) evaluates retrieval across lengths from 256 to 32K tokens on 6 datasets (2 synthetic: Needle-in-a-Haystack, Passkey Retrieval; 4 real-world: NarrativeQA, QMSum, 2WikiMQA, SummScreenFD). The best baseline achieves only 64.4 average — indicating large room for improvement.
Key results on LongEmbed:
| Model | Long-Doc Avg | Passkey (≤4K) | Key Technique |
|---|---|---|---|
| DiffEmbed | 62.2% | 100% | Natively bidirectional (diffusion LM) |
| Mistral+LLM2Vec | 58.6% | 98.8% | Bidirectional conversion |
| LLaMA3+LLM2Vec | 42.0% | 59.6% | Bidirectional conversion |
| E5-Mistral + NTK ext. | 75.3 | – | RoPE NTK interpolation (+10.9 pts) |
Why do embeddings degrade with length? Length-Induced Embedding Collapse (Click to expand)
Zhou et al. (2025) (ACL 2025) identified a phenomenon called Length-Induced Embedding Collapse: self-attention acts as a low-pass filter, and longer sequences increase the attenuation rate of high-frequency components. This causes token representations to retain only the DC (Direct-Current) component — embeddings of longer texts collapse into a narrow region of embedding space with abnormally high pairwise cosine similarity.
Concrete degradation numbers on BGE:
| Input Length | Classification Accuracy |
|---|---|
| 0–100 tokens | 75.6% |
| 100–200 tokens | 72.1% |
| 200–300 tokens | 66.8% |
| 300–400 tokens | 63.2% |
| 400–500 tokens | 59.0% (−16.6 pts) |
This is consistent across ANCE, GTR, GIST, BGE, and E5 models, and also observed in LLM-based models.
Solution — TempScale: Divides attention logits by a temperature τ ∈ (0,1] before softmax: softmax(QKT / (τ√d)). This preserves high-frequency information for longer texts, yielding +0.94% MTEB average and +1.10% LongEmbed improvement.
Other long-context techniques:
- RoPE extension (NTK interpolation, YaRN, SelfExtend): Rescale the position encoding frequency basis to support longer sequences. E5-Mistral gains +10.9 pts on LongEmbed via NTK.
- Late Chunking (Jina AI, 2024): Run the full transformer over the entire long document first, then chunk the token embeddings before mean pooling. Preserves cross-chunk context. +3.63% relative improvement over naive chunking.
- HTP: Hierarchical segment summaries create backward information flow — particularly effective for long documents where over-squashing is severe.
Efficiency-Performance Trade-offs
效率-性能权衡
| Model | Params | Performance | Highlight |
|---|---|---|---|
| KaLM-V2.5 | 0.5B | MTEB 69.33 | 0.5B surpasses 7B E5-Mistral (66.6) |
| Qwen3-Emb-0.6B | 0.6B | MMTEB 64.33 | 0.6B surpasses all open-source 7B models |
| Causal2Vec | 7B+110M | MTEB 66.85 | 82% inference time reduction, 85% sequence length reduction (vs bge-en-icl) |
| AutoRegEmbed | 7B | STS 83.81 | Only ~66K training samples needed (NV-Embed needs 1M+) |
Generation + Embedding: Unified Models
Generation + Embedding 双能力
GritLM (Muennighoff et al., 2024)’s core finding is zero-loss unification:
| Model | MTEB | MMLU | Notes |
|---|---|---|---|
| GritLM-7B | 66.8 | 57.6% | Matches both single-task specialists |
| GritLM-8x7B | 65.7 | 66.7% | Generation close to Mixtral-8x7B-Instruct |
| GEM-LLaMA-1B | MTEB 54.35 | MMLU 28.36 | Only 1/10 the training data of GritLM |
Training Mistral-7B for embedding-only yields 66.8 MTEB + 7.6 generation (generation destroyed); training generative-only yields 41.2 MTEB + 55.2 generation. GritLM’s unified training achieves 66.8 MTEB + 55.5 generation — matching both individual specialists with zero trade-off.
References
参考文献
- Yanzhao Zhang et al. “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.” arXiv:2506.05176, 2025.
- Chankyu Lee et al. “NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.” ICLR 2025 (Spotlight). arXiv:2405.17428, 2024.
- Parishad BehnamGhader et al. “LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.” COLM 2024. arXiv:2404.05961, 2024.
- Niklas Muennighoff et al. “Generative Representational Instruction Tuning.” arXiv:2402.09906, 2024.
- Liang Wang et al. “Improving Text Embeddings with Large Language Models.” ACL 2024. arXiv:2401.00368, 2024.
- Xinping Zhao et al. “KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model.” arXiv:2506.20923, 2025.
- Ailiang Lin et al. “Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models.” arXiv:2507.23386, 2025.
- Jingcheng Deng et al. “Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment.” EMNLP 2025. arXiv:2502.11401, 2025.
- Ziyue Li, Tianyi Zhou. “Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free.” ICLR 2025 (Oral). arXiv:2410.10814, 2024.
- Xueying Ding et al. “Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings.” arXiv:2511.14868, 2025.
- Caojin Zhang et al. “GEM: Empowering LLM for both Embedding Generation and Language Understanding.” arXiv:2506.04344, 2025.
- Siyue Zhang et al. “Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective.” EMNLP 2025. arXiv:2505.15045, 2025.