Autoregressive Embedding Models: Training, Attention, and Performance

This post surveys recent papers (2024--2026) on autoregressive embedding models — using decoder-only LLMs as text embedding backbones. It addresses three questions: How are state-of-the-art embedding models trained? What are the attention distribution characteristics over different tokens? What performance do they achieve? Papers covered include Qwen3-Embedding, NV-Embed, LLM2Vec, GritLM, E5-Mistral, KaLM-Embedding, Causal2Vec, AutoRegEmbed, MoE-Embedding, HTP, GEM, and DiffEmbed.

I. How Are SOTA Embedding Models Trained?

Multi-Stage Progressive Training

Current SOTA models almost universally adopt multi-stage training, progressively refining representation quality from coarse to fine:

Stage	Purpose	Typical Data Scale	Representative Methods
Stage 0: LLM pretraining	Strong language understanding foundation	Trillions of tokens	Inherited from Qwen3/Mistral/LLaMA
Stage 1: Weakly-supervised contrastive pretraining	Learn general text representations	~150M–470M pairs	Qwen3-Emb (150M), KaLM (470M)
Stage 2: Supervised fine-tuning	Task-specific refinement	~2M–19M pairs	All SOTA models
Stage 3: Distillation / merging	Improve generalization	Same as Stage 2	KaLM-V2.5 (KL distillation), Qwen3 (SLERP merging)

A key finding: Wang et al. (2024) showed that for sufficiently large LLMs, Stage 1 contrastive pretraining is nearly useless — it improves smaller models like XLM-R by +8.2 points but has negligible effect on Mistral-7B. The LLM’s autoregressive pretraining already provides a strong representational foundation. However, for smaller models (e.g., Qwen2-0.5B), Stage 1 still contributes significantly (~3 MMTEB points for Qwen3-Emb-0.6B).

阶段	目的	典型数据量	代表方法
Stage 0: LLM 预训练	获得强大的语言理解基础	数万亿 token	继承 Qwen3/Mistral/LLaMA 等基座
Stage 1: 弱监督对比预训练	学习通用文本表征	~150M–470M pairs	Qwen3-Emb (150M), KaLM (470M)
Stage 2: 有监督微调	精调任务特定能力	~2M–19M pairs	所有 SOTA 模型
Stage 3: 蒸馏/合并	进一步提升泛化	同 Stage 2 数据	KaLM-V2.5 (KL蒸馏), Qwen3 (SLERP合并)

Attention Mechanism: Two Camps

Camp 1: Remove the causal mask. Replace the causal attention mask with fully bidirectional attention during training. Representatives include NV-Embed (Lee et al., 2024), LLM2Vec (BehnamGhader et al., 2024), KaLM (Zhao et al., 2025), and GritLM (Muennighoff et al., 2024) (in embedding mode). The advantage is simplicity and strong embedding quality; the downside is loss of generation capability (except GritLM, which uses bidirectional for embedding and causal for generation).

Camp 2: Keep the causal mask. Bypass the causal attention limitation through clever designs. Representatives include Qwen3-Emb (Zhang et al., 2025), E5-Mistral (Wang et al., 2024), Causal2Vec (Lin et al., 2025), GEM (Zhang et al., 2025), and HTP (Ding et al., 2025). The advantage is preserved generation capability and no pretrain-finetune attention mismatch; the downside is that extra mechanisms are needed to compensate for the information flow deficit.

NV-Embed’s ablation clearly shows the value of bidirectional attention:

Attention	Pooling	MTEB Avg
Causal	EOS	66.50
Bidirectional	EOS	67.85 (+1.35)
Causal	Latent-attn	68.47
Bidirectional	Latent-attn	69.32 (+0.85)

GritLM’s ablation is even more dramatic: bidirectional vs. causal differs by +4 MTEB points on embedding tasks (64.0 vs 60.0).

Pooling Strategies

Strategy	Representatives	Best When
EOS / Last-token pooling	Qwen3-Emb, E5-Mistral	Keeping causal mask — information naturally converges to the final token
Mean pooling	KaLM, LLM2Vec, GritLM, DiffEmbed	After removing causal mask — all token representations are equally contextualized
Latent attention pooling	NV-Embed	Learnable cross-attention, outperforms both alternatives
Contextual + EOS concat	Causal2Vec	Combining global compressed info with end-of-sequence info

NV-Embed (Lee et al., 2024)’s Latent Attention is a learnable cross-attention layer: decoder outputs serve as Q, a set of 512 learnable latent vectors serve as K=V, followed by MLP + mean pooling. It achieves the best results among all pooling methods (MTEB 72.31 vs mean 71.71 vs EOS 71.63).

The following figure illustrates the attention masks and pooling strategies:

策略	代表	适用场景
EOS / Last-token pooling	Qwen3-Emb, E5-Mistral	保留 causal mask 时，信息自然汇聚到末尾 token
Mean pooling	KaLM, LLM2Vec, GritLM, DiffEmbed	去除 causal mask 后，所有 token 表征质量均等
Latent attention pooling	NV-Embed	学习式跨注意力，优于上述两者
Contextual+EOS 拼接	Causal2Vec	结合全局压缩信息和序列末尾信息

Which Layer's Representation?

Nearly all SOTA embedding models extract representations from the last layer of the transformer:

Model	Layer	Pooling
Qwen3-Emb, E5-Mistral	Last layer	EOS token
NV-Embed	Last layer	Latent Attention
LLM2Vec, KaLM, GritLM, DiffEmbed	Last layer	Mean pooling
HTP	Near-last (2nd or 3rd from end)	Mean pooling
MoE-Embedding	All layers (routing weights)	Last token

Two notable exceptions: HTP uses a near-last but not final layer (e.g., layer 29/32 for Mistral-7B), chosen per-model via validation. MoE-Embedding concatenates routing weights from all layers (e.g., 28 layers × 64 experts = 1792-dim vector).

模型	层	Pooling
Qwen3-Emb, E5-Mistral	最后一层	EOS token
NV-Embed	最后一层	Latent Attention
LLM2Vec, KaLM, GritLM, DiffEmbed	最后一层	Mean pooling
HTP	接近最后层（倒数第 2-3 层）	Mean pooling
MoE-Embedding	所有层（路由权重）	Last token

Why not use intermediate layers? The "Layer by Layer" finding (Click to expand)

Ren et al. (2025), "Layer by Layer: Uncovering Hidden Representations in Language Models" (ICML 2025) systematically studied layer selection for embeddings and found that intermediate layers outperform the last layer by 2%--16% on average across 32 MTEB tasks in zero-shot settings. The optimal layers cluster around mid-depth of the network.

The reason: the last layer is optimized for next-token prediction and is biased toward the next output token's semantics rather than global sentence semantics. Intermediate layers strike a better balance between information preservation and noise filtering. This effect is architecture-agnostic — observed in decoder-only (Pythia, LLaMA3), encoder (BERT), and state-space models (Mamba).

However, this finding primarily applies to zero-shot / unfine-tuned models. In practice, SOTA embedding models fine-tune with contrastive learning, which reshapes the last layer to encode global rather than local semantics. Combined with good pooling strategies (latent attention, bidirectional mean pooling), the last layer's "next-token bias" is effectively mitigated — which is why most production models still use it.

HTP's near-last-layer early exit can be seen as a compromise: it avoids the very final layer's next-token specialization while staying close enough to preserve the deep representations.

Loss Functions

All models use InfoNCE contrastive loss as the core:

\[\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, d^+)/\tau)}{\sum_j \exp(\text{sim}(q, d_j)/\tau)}\]

Several improvements build on top of this:

Focal-InfoNCE (KaLM-V2, Zhao et al., 2025): Reweight with $w_i = (1-p_i)^\gamma$ to focus on hard samples. This is KaLM’s single largest performance contributor.
False negative mitigation (Qwen3-Emb, Zhang et al., 2025): Zero out gradient contributions from in-batch negatives whose similarity to the positive exceeds a margin of 0.1.
Conditional distribution alignment (AutoRegEmbed, Deng et al., 2025): Instead of cosine similarity, align $p(\cdot \vert e_q)$ and $p(\cdot \vert e_d)$ — conditional probability distributions, better matching the LLM’s generative nature.
Information compression (AutoRegEmbed, Deng et al., 2025): Use a frozen decoder to reconstruct the target document, forcing compressed tokens to capture global semantics.

Details: Synthetic data engineering and other key techniques (Click to expand)

Synthetic data is a critical driver across all SOTA models:

E5-Mistral: GPT-4 generates 500K synthetic pairs across 93 languages with 150K unique instructions.
Qwen3-Emb: Qwen3-32B generates ~150M synthetic pairs, controlling query type / length / difficulty / language diversity.
NV-Embed: Mixtral-8x22B generates 120K synthetic samples covering 60K synthetic tasks.
KaLM-V2: Qwen2-72B generates 550K persona-based synthetic data.

Instruction-aware embeddings. Nearly all SOTA models prepend "Instruct: {task} Query: {q}" to queries; documents receive no instruction prefix. E5-Mistral (Wang et al., 2024)'s ablation shows instructions contribute +4.2 points.

Hard negative mining. NV-Embed (Lee et al., 2024) uses a teacher model to mine hard negatives with threshold max_neg_score < pos_score * 0.95 to filter false negatives, contributing +2.30 retrieval points.

Model merging / SLERP. Qwen3-Emb (Zhang et al., 2025) merges multiple SFT checkpoints via spherical linear interpolation, contributing +1.77 MMTEB points.

Two-stage instruction tuning. NV-Embed trains retrieval first (in-batch negatives ON), then multi-task (in-batch negatives OFF — same-class samples become false negatives). Order matters: reversing drops retrieval by 0.74 points.

LoRA efficient fine-tuning. E5-Mistral and Causal2Vec show that LoRA rank 16 suffices — no full-parameter fine-tuning needed.

Frontier Directions

Diffusion LM as embedding backbone. DiffEmbed (Zhang et al., 2025) uses Dream-7B (a masked diffusion model based on Qwen2.5-7B) with natively bidirectional attention. It outperforms AR models by ~20% on long-document retrieval and ~8% on reasoning-intensive tasks.

MoE routing weights as embeddings. MoE-Embedding (Li & Zhou, 2024) discovers that router weights in MoE models serve as off-the-shelf embeddings without fine-tuning, complementary to hidden states (AMI only 0.29). Combining them yields +7.94 points.

Hierarchical Token Prepending. HTP (Ding et al., 2025) requires no training — it creates backward information flow in causal models through hierarchical segment summary tokens. It can even improve already fine-tuned NV-Embed-v2.

II. Attention Distribution Characteristics

The Core Problem: Information Flow Bottleneck in Causal Attention

Causal attention in decoder-only models creates two fundamental deficiencies:

Unidirectional information flow. The token at position $i$ can only see tokens $1, \ldots, i$, with no access to information from $i+1, \ldots, n$. Early tokens are severely under-contextualized — the first token’s representation contains zero context from the rest of the sentence. Mean pooling quality suffers because early tokens contribute incomplete representations.

Information over-compression into the final token. All semantic information must “flow forward” to the end of the sequence. The EOS token is the only position that sees the complete sequence. AutoRegEmbed (Deng et al., 2025) further points out that the hidden state at EOS encodes the next token’s probability distribution (local semantics), not the global semantics of the input text.

Over-Squashing: Why Causal Attention Gets Worse with Length

The term “over-squashing” originated in graph neural networks (Alon & Yahav, 2021; Topping et al., 2022): when exponentially growing receptive fields must pass messages through fixed-dimension channels, information from distant nodes gets “squashed” and lost. Barbero et al. (2024) formally bridged this theory to causal transformers by observing that a causal attention pattern defines a directed acyclic graph — a lower-triangular adjacency matrix — so information propagation follows the same path-counting framework.

The mixing matrix. For a causal transformer with $L$ layers, define the mixing matrix:

\[A = M^{(L-1)} \cdot M^{(L-2)} \cdots M^{(0)}, \quad \text{where } M^{(l)} = \frac{1}{r_l}\left(\frac{\alpha^{(l)}}{\beta_1^{(l)}} + I\right)\]

Here $\alpha^{(l)}_{j,i}$ are the softmax attention weights at layer $l$ (zero for $i > j$ due to causal masking), and the identity $I$ comes from the residual connection. $A$ is lower-triangular and row-stochastic. HTP (Ding et al., 2025) proved the following bounds:

Last-token pooling — sensitivity depends on a single entry:

\[\left\lVert \frac{\partial y_n}{\partial v_i^{(0)}} \right\rVert \leq K_L \cdot A_{n,i}\]

Mean pooling — sensitivity depends on the entire column:

\[\left\lVert \frac{\partial \bar{y}}{\partial v_i^{(0)}} \right\rVert \leq \frac{K_L}{n} \sum_{j=1}^{n} A_{j,i}\]

The column sum captures all outgoing influence from token $i$, not just the single path to position $n$. This is why mean pooling is structurally more robust to over-squashing than last-token pooling.

Three forces conspire to make $A_{n,i}$ small for early tokens:

Attention dilution. Softmax distributes weight across all visible tokens. Average weight per token is $\sim 1/n$, shrinking as context grows.
Multiplicative decay. $A_{n,i}$ is a sum over products of $L$ attention weights along causal paths. Products of $L$ small numbers decay exponentially.
Attention sinks. Xiao et al. (2023) found that in LLaMA-2-7B, attention to the first token exceeds 50% of total attention across most layers — despite carrying no semantic content (replacing it with \n tokens yields comparable perplexity). This wastes attention budget: with >50% consumed by a semantically empty sink, even less remains for informative early tokens, further shrinking their $A_{n,i}$.

Self-attention as a low-pass filter: the spectral perspective on length degradation (Click to expand)

Zhou et al. (2025) (ACL 2025) provided a complementary spectral analysis. They decomposed token representations into frequency components via DFT and showed that any attention matrix A = softmax(P) acts as a low-pass filter:

$$\lim_{t \to \infty} \frac{\lVert \text{HC}[A^t z] \rVert_2}{\lVert \text{DC}[A^t z] \rVert_2} = 0$$

where DC is the mean (zero-frequency) component and HC is all higher-frequency components. By Perron-Frobenius, the attention matrix is row-stochastic with largest eigenvalue 1 (DC), and all other eigenvalues have magnitude < 1 — so repeated attention application exponentially damps discriminative features.

The filter rate depends on length. Let σ_a be the largest singular value of HC[A]. Under Gaussian assumptions on Q,K projections:

$$\sigma_a \leq \sqrt{\frac{n}{2\sqrt{1 + e^{-2\sigma_s^2}} \cdot (n-1)^{3/2} + 1}}$$

σ_a monotonically decreases as sequence length n increases. Longer sequences → stronger low-pass filtering → faster destruction of high-frequency (discriminative) information. This explains why embeddings of long texts collapse: they converge toward their DC component, and since natural language has relatively consistent mean embeddings, all long-text embeddings crowd into a narrow region with abnormally high pairwise cosine similarity.

Barbero et al. (2024) ("Transformers need glasses!") pushed this further, showing that representational collapse occurs even for distinct sequences: as n → ∞, the last-token representations of two different sequences converge. In practice with bf16 precision, collapse occurs at ~50 tokens for repeated digits, and Gemini 1.5 fails to copy the last element of a sequence at length ~300.

The connection between over-squashing and low-pass filtering is natural: both describe the same information loss from different perspectives — over-squashing through the Jacobian of the mixing matrix (gradient-based), low-pass filtering through eigenvalue decay of the attention matrix (spectral). They predict the same outcome: longer sequences lose more information, and causal models are worse than bidirectional ones because the triangular structure restricts information flow paths.

Empirical Findings

LLM2Vec: Causal vs. bidirectional representation similarity. LLM2Vec (BehnamGhader et al., 2024) analyzed the cosine similarity between per-layer representations in causal and bidirectional modes:

LLaMA-2-7B and S-LLaMA-1.3B: similarities are low across nearly all layers, indicating that enabling bidirectional attention drastically changes internal representations — hence MNTP adaptation training is needed.
Mistral-7B is the anomaly: similarities remain ~0.9+ throughout all layers. The authors speculate Mistral may have been pretrained with some form of bidirectional attention (e.g., prefix LM). This explains why Mistral is the only model that benefits from bidirectional attention without any training (+4.4 points), while LLaMA-3-8B collapses (-13.4 points).

HTP: Theoretical analysis of over-squashing. HTP (Ding et al., 2025) provides a formal analysis of the information flow bottleneck (Theorem 3.1): for a causal Transformer, the gradient of last-token readout $\lVert \partial y_n / \partial v_i^{(0)} \rVert$ depends on a single entry $A_{n,i}$ of the mixing matrix, which decays rapidly with depth. Mean-token readout aggregates the entire column $\sum_j A_{j,i}$, making it more robust to over-squashing.

Experimental validation: masking the backward attention in Echo Embeddings (second pass → first pass) causes STS to plummet from 68.00 to 54.25; masking forward attention has virtually no effect (68.00 → 67.89). Backward information flow is the key to embedding quality.

DiffEmbed: The most direct attention-direction ablation. DiffEmbed (Zhang et al., 2025) tested the effect of removing backward attention on models trained with bidirectional attention:

Task	Mistral (bidir)	Mistral (causal only)	DiffEmbed (bidir)	DiffEmbed (causal only)
TheoremQA (question)	33.7	9.6 (-24.1)	48.3	0.7 (-47.6)
TheoremQA (theorem)	32.4	4.0 (-28.4)	38.9	1.1 (-37.8)

DiffEmbed (natively bidirectional pretraining) depends on backward attention far more than AR models do. The more complex the task (reasoning, long documents), the more critical bidirectional attention becomes; the difference is small for short-text STS.

任务	Mistral (双向)	Mistral (仅因果)	DiffEmbed (双向)	DiffEmbed (仅因果)
TheoremQA (问题检索)	33.7	9.6 (-24.1)	48.3	0.7 (-47.6)
TheoremQA (定理检索)	32.4	4.0 (-28.4)	38.9	1.1 (-37.8)

More empirical findings: Causal2Vec L2 norms and MoE routing weights (Click to expand)

Causal2Vec: EOS vs. Contextual token. Causal2Vec (Lin et al., 2025) analyzed the L2 norms of EOS and Contextual token representations: EOS consistently shows higher L2 norms, indicating greater influence in the concatenated embedding. A single Contextual token suffices — increasing to 2/4/8 tokens actually degrades performance.

MoE-Embedding: Router Weights vs. Hidden States. MoE-Embedding (Li & Zhou, 2024) found that MoE routing weights (RW) and hidden states (HS) encode fundamentally different information (AMI=0.29, Jaccard=0.06). RW is more robust to prompt variation (cross-prompt Spearman correlation 0.63 vs HS's 0.52). RW captures "intermediate reasoning choices" (how the model processes input), HS captures "final prediction output" — the two are complementary.

Summary of Attention Characteristics

Characteristic	Explanation
Information converges to the final token	Under causal attention, EOS is the only position seeing the full sequence — it becomes the natural information sink
Early tokens have poor representations	Lacking subsequent context, their representations are unsuitable for embedding
Backward information flow is the critical missing piece	Both HTP and DiffEmbed prove: backward attention (later→earlier) is essential; forward attention (earlier→later) contributes marginally
Bidirectional conversion requires adaptation training	Except Mistral, directly enabling bidirectional attention destroys AR pretrained representations (LLaMA-3 drops 13.4 points)
More complex tasks need bidirectionality more	Short-text STS shows small differences; long-document retrieval and reasoning-intensive tasks show huge gaps (up to 47.6 points)
Instruction tokens influence but don’t participate in pooling	NV-Embed and GritLM exclude instruction tokens from pooling, but instructions influence other token representations via attention

特征	解释
末尾 token 信息汇聚	因果注意力下，EOS 是唯一看到完整序列的位置，自然成为信息汇聚点
前部 token 表征贫乏	早期 token 缺乏后续上下文，其表征不适合用于 embedding
反向信息流是关键缺失	HTP 和 DiffEmbed 都证明：反向注意力（后到前）对 embedding 至关重要，前向注意力（前到后）的边际贡献很小
双向转换需要适应训练	除 Mistral 外，直接开启双向注意力会破坏 AR 预训练学到的表征（LLaMA-3 暴跌 13.4 分）
任务越复杂，双向越重要	短文本 STS 差异小；长文档检索和推理密集任务差异巨大（最高 47.6 分差距）
指令 token 影响但不参与 pooling	NV-Embed、GritLM 在 pooling 时排除指令 token，但指令通过注意力影响其他 token 的表征

The Causal vs. Bidirectional Debate: Is There a Winner?

One of the central tensions in this field is whether to remove the causal attention mask inherited from autoregressive pretraining, or to keep it and work around its limitations. Controlled ablations and leaderboard rankings tell different stories — and reconciling them reveals a nuanced picture.

Controlled Ablations: Bidirectional Wins

Under strict ablation (same model, same data, only the attention mask differs), removing the causal mask consistently helps:

Paper	Experiment	Gain
NV-Embed (Lee et al., 2024)	Causal → Bidirectional (EOS pooling)	+1.35 MTEB
NV-Embed	Causal → Bidirectional (Latent-attn)	+0.85 MTEB
GritLM (Muennighoff et al., 2024)	Causal → Bidirectional (embedding mode)	+4.0 MTEB
KaLM (Zhao et al., 2025)	Remove causal mask	+0.39 MTEB

The conclusion from ablations is clear: all else being equal, bidirectional attention produces better embeddings. This is expected — embedding fundamentally requires understanding the whole input, and causal attention prevents early tokens from incorporating future context.

The Leaderboard Paradox: Causal Models Lead

Yet the actual leaderboards tell a different story:

Model	Attention	MTEB Eng v2
Qwen3-Embedding-8B	Causal	75.22
Qwen3-Embedding-4B	Causal	74.60
NV-Embed-v2	Bidirectional	69.81

Causal2Vec (Lin et al., 2025) also consistently outperforms the bidirectional LLM2Vec by +0.78 to +1.30 MTEB points across all tested base models, while keeping the causal mask intact.

How can causal models dominate despite the ablation evidence?

Reconciling the Evidence

Three factors explain the paradox:

1. Pretrain-finetune attention mismatch. LLMs are pretrained on trillions of tokens with causal attention. Their internal representations are optimized for unidirectional information flow. Switching to bidirectional attention disrupts these learned representations — sometimes catastrophically. LLM2Vec (BehnamGhader et al., 2024) documented this clearly:

Model	Causal baseline	Bidirectional (no adaptation)	Change
Mistral-7B	42.46	46.86	+4.40 (helps)
LLaMA-3-8B	43.98	30.56	-13.42 (collapses)

Mistral-7B is the anomaly — its causal and bidirectional representations have cosine similarity ~0.9+ across all layers, suggesting it may have been pretrained with some form of bidirectional attention (e.g., prefix LM). For most models, adaptation training (e.g., LLM2Vec’s MNTP) is essential, but Causal2Vec argues that even with adaptation, the mismatch cannot be fully resolved — the model’s semantic extraction abilities, shaped by causal pretraining, are partially compromised.

2. Scale and data engineering compensate. Qwen3-Embedding uses ~150M synthetic pretraining pairs + ~19M supervised pairs + SLERP checkpoint merging — far exceeding the data scale of bidirectional models like NV-Embed. With sufficient model scale and training data, the EOS token under causal attention can aggregate enough global information to rival bidirectional representations on standard benchmarks.

3. Task complexity determines the gap size. This is the most important nuance. DiffEmbed (Zhang et al., 2025) showed that the advantage of bidirectional attention scales with task complexity:

Task Type	Bidirectional Advantage
Short-text STS	Nearly zero
Standard MTEB (mixed)	Small (+1 to +4 points)
Long-document retrieval (LongEmbed)	~20%
Reasoning-intensive retrieval (TheoremQA)	~8%; removing backward attention: 48.3 → 0.7

For short texts, the EOS token sees the full sequence under causal attention — bidirectionality adds little. For long documents and tasks requiring logical reasoning across the full text, bidirectional attention is qualitatively superior.

模型	Causal 基线	Bidirectional（无适应）	变化
Mistral-7B	42.46	46.86	+4.40（有帮助）
LLaMA-3-8B	43.98	30.56	-13.42（崩溃）

任务类型	双向优势
短文本 STS	几乎为零
标准 MTEB（混合）	小幅（+1 到 +4 分）
长文档检索 (LongEmbed)	约 20%
推理密集检索 (TheoremQA)	约 8%；去掉反向注意力：48.3 → 0.7

Practical Recommendations

There is no definitive winner — the right choice depends on the use case:

Scenario	Recommendation	Rationale
Need both generation and embedding	Keep causal	GritLM / GEM approach: switch attention mode per task
General-purpose embedding SOTA	Either works	Qwen3-Emb (causal) and NV-Embed (bidirectional) both achieve top results; scale and data matter more
Long-document retrieval	Strongly prefer bidirectional	DiffEmbed: ~20% advantage over AR models
Reasoning-intensive tasks	Strongly prefer bidirectional	Removing backward attention causes near-total collapse on TheoremQA
Small models (<1B)	Remove causal mask	Small models cannot compensate via scale; KaLM-V2.5 (0.5B, bidirectional) surpasses 7B causal models
Inference efficiency priority	Keep causal + workarounds	Causal2Vec: 85% sequence length reduction, 82% inference speedup

场景	建议	理由
需要同时保留生成能力	保留 causal	GritLM / GEM 的做法：按任务切换注意力模式
通用 embedding SOTA	两者皆可	Qwen3-Emb（causal）和 NV-Embed（bidirectional）都达到顶级水平；规模和数据更关键
长文档检索	强烈推荐双向	DiffEmbed：比 AR 模型优约 20%
推理密集任务	强烈推荐双向	去掉反向注意力导致 TheoremQA 上性能近乎归零
小模型 (<1B)	去掉 causal mask	小模型无法靠规模弥补；KaLM-V2.5（0.5B, 双向）超越 7B 因果模型
推理效率优先	保留 causal + 变通方案	Causal2Vec：序列长度减少 85%，推理加速 82%

The convergence trend: both camps are moving toward each other (Click to expand)

The most interesting trend is that the two camps are converging. Methods that keep the causal mask are increasingly finding ways to simulate bidirectional information flow:

Causal2Vec injects global context via a BERT-encoded Contextual token prepended before the text — all subsequent tokens can attend to it under causal masking, approximating backward information flow.
HTP creates hierarchical segment summary tokens that are rewired to the beginning, providing backward pathways without modifying the attention mask.
GEM uses an attention-mask bottleneck with special tokens that force prefix compression, keeping causal attention while creating an information funnel.

Meanwhile, methods that remove the causal mask are finding ways to preserve generation capability:

GritLM proved that a single model can use bidirectional attention for embedding and causal attention for generation with zero performance trade-off on either task.
GEM preserves generation capability (MMLU) while adding embedding capability, with only a small generation quality drop.

The future likely lies not in choosing one camp, but in better fusion strategies — models that seamlessly switch between attention modes or architectures that natively support both information flow patterns.

III. Performance of SOTA Embedding Models

Embedding Benchmarks Overview

The primary benchmarks used to evaluate embedding models are:

Benchmark	Scope	Tasks	Reference
MTEB	Text embedding (English)	8 task types: Classification, Clustering, Pair Classification, Reranking, Retrieval, STS, Summarization, Bitext Mining. Originally 56 datasets (v1), expanded in v2.	Muennighoff et al. (2022)
MMTEB	Multilingual text embedding	Same task types as MTEB, expanded to 250+ datasets across 100+ languages.	Enevoldsen et al. (2025)
MMEB	Multimodal embedding (image+text)	36 datasets across 20 embedding tasks spanning classification, VQA, retrieval, and visual grounding. The first unified multimodal embedding benchmark analogous to MTEB.	Jiang et al. (2024)

MTEB is the de facto standard for text embedding evaluation — nearly all papers in this survey report MTEB scores. The leaderboard is hosted on HuggingFace. MMEB extends this paradigm to vision-language, evaluating models like VLM2Vec and Qwen3-VL-Embedding that map text, images, and video into a unified embedding space.

The overall MTEB score is a simple unweighted average of the main metric across all datasets. Since retrieval has 15 datasets while summarization has only 1, retrieval-heavy models get a disproportionate boost — a known limitation.

基准	范围	任务	参考
MTEB	文本嵌入（英语）	8 类任务：分类、聚类、配对分类、重排序、检索、STS、摘要、双语挖掘。原始 56 个数据集 (v1)，v2 有扩展。	Muennighoff et al. (2022)
MMTEB	多语言文本嵌入	与 MTEB 相同的任务类型，扩展到 250+ 数据集、100+ 种语言。	Enevoldsen et al. (2025)
MMEB	多模态嵌入（图像+文本）	36 个数据集、20 个嵌入任务，涵盖分类、VQA、检索和视觉定位。首个类似 MTEB 的统一多模态嵌入基准。	Jiang et al. (2024)

How MTEB evaluates each task type — with a concrete retrieval example (Click to expand)

Each task type uses a different evaluation protocol. The embedding model only produces vectors — a downstream metric measures how well those vectors capture semantics:

Task Type	Input	Metric	Example Dataset
Classification	Text → embedding → logistic regression probe	Accuracy	Banking77: `"What currencies is an exchange rate calculated in?"` → label `exchange_rate`
Clustering	Set of texts → embeddings → k-means	V-measure	ArxivClustering: group paper abstracts by field (`math`, `cs`, ...)
Pair Classification	Two texts → cosine similarity → threshold	Average Precision	TwitterURL: `"The new iPhone has a stunning display"` vs `"Apple's latest phone features an amazing screen"` → paraphrase?
Reranking	Query + candidate list → rank by cosine sim	MAP	AskUbuntu: rerank candidate answers for Ubuntu questions
Retrieval	Query → search entire corpus by cosine sim	nDCG@10	MSMARCO, NQ, HotpotQA, SciFact, ... (15 datasets from BEIR)
STS	Two sentences → cosine sim vs gold score	Spearman ρ	STSBenchmark: `"A man is playing the cello"` vs `"A man seated is playing the cello"` → gold 4.25/5.0
Summarization	Machine summary → cosine sim to human summary	Spearman ρ	SummEval (1 dataset only)
Bitext Mining	Sentences in language A → find translation in B	F1	Tatoeba: FR `"Morales remporte l'élection..."` ↔ EN `"Morales went on to win..."`

Retrieval scoring walkthrough (nDCG@10):

Given query "What is the capital of France?" and a corpus of 10,000 documents with 3 relevant ones (doc_42, doc_789, doc_3001). The model embeds everything and ranks by cosine similarity:

Rank	Document	Relevant?	Gain (2^rel−1)	Discount (1/log₂(i+1))
1	doc_42	Yes	1	1.000
2	doc_100	No	0	0.631
3	doc_789	Yes	1	0.500
4	doc_555	No	0	0.431
5	doc_3001	Yes	1	0.387
6–10	...	No	0	...

DCG@10 = 1×1.0 + 1×0.5 + 1×0.387 = 1.887. Ideal DCG (all 3 relevant at ranks 1-3) = 1.0 + 0.631 + 0.5 = 2.131. nDCG@10 = 1.887/2.131 = 0.886. The logarithmic discount penalizes relevant documents appearing lower — rank 1 counts twice as much as rank 3.

Instruction-aware evaluation: Models like E5-Mistral prepend task-specific instructions to queries only (not documents). For example, NQ retrieval uses "Given a question, retrieve Wikipedia passages that answer the question". Instructions contribute +4.2 MTEB points on average.

任务类型	输入	指标	示例数据集
分类	文本 → embedding → 逻辑回归探针	Accuracy	Banking77: `"What currencies is an exchange rate calculated in?"` → 标签 `exchange_rate`
聚类	文本集合 → embeddings → k-means	V-measure	ArxivClustering: 按领域（`math`、`cs` 等）分组论文摘要
配对分类	两段文本 → cosine similarity → 阈值	Average Precision	TwitterURL: `"The new iPhone has a stunning display"` vs `"Apple's latest phone features an amazing screen"` → 是否释义？
重排序	查询 + 候选列表 → 按 cosine sim 排序	MAP	AskUbuntu: 对 Ubuntu 问题的候选答案重新排序
检索	查询 → 按 cosine sim 搜索整个语料库	nDCG@10	MSMARCO、NQ、HotpotQA、SciFact 等（来自 BEIR 的 15 个数据集）
STS	两个句子 → cosine sim 与金标分数对比	Spearman ρ	STSBenchmark: `"A man is playing the cello"` vs `"A man seated is playing the cello"` → 金标 4.25/5.0
摘要	机器摘要 → 与人类摘要的 cosine sim	Spearman ρ	SummEval（仅 1 个数据集）
双语挖掘	语言 A 的句子 → 在语言 B 中找翻译	F1	Tatoeba: FR `"Morales remporte l'élection..."` ↔ EN `"Morales went on to win..."`

排名	文档	相关？	增益 (2^rel−1)	折扣 (1/log₂(i+1))
1	doc_42	是	1	1.000
2	doc_100	否	0	0.631
3	doc_789	是	1	0.500
4	doc_555	否	0	0.431
5	doc_3001	是	1	0.387
6–10	...	否	0	...

MTEB English Leaderboard

Model	Params	Attention	Pooling	MTEB Eng v2	MTEB Eng v1 (56 tasks)
Qwen3-Embedding-8B	8B	Causal	EOS	75.22	–
Qwen3-Embedding-4B	4B	Causal	EOS	74.60	–
Gemini Embedding	–	–	–	73.30	–
NV-Embed-v2	7B	Bidirectional	Latent-attn	69.81	72.31
Qwen3-Embedding-0.6B	0.6B	Causal	EOS	70.70	–
gte-Qwen2-7B	7B	–	–	70.72	70.24
KaLM-V2.5	0.5B	Bidirectional	Mean	–	69.33
GritLM-7B	7B	Bi/Causal	Mean	–	66.8
Causal2Vec-Mistral (ICL)	7B+110M	Causal	C+EOS	–	66.85
E5-Mistral-7B	7B	Causal	EOS	–	66.6
LLM2Vec-LLaMA3-8B	8B	Bidirectional	Mean	–	65.01

More benchmarks: Multilingual, Code, Long-Document, and Reasoning-Intensive Retrieval (Click to expand)

Multilingual MTEB (MMTEB)

Model	Params	MMTEB Avg
Qwen3-Embedding-8B	8B	70.58
Qwen3-Embedding-4B	4B	69.45
Gemini Embedding	--	68.37
Qwen3-Embedding-0.6B	0.6B	64.33
multilingual-e5-large	0.6B	63.22
gte-Qwen2-7B	7B	62.51

Code Retrieval (MTEB Code)

Model	Score
Qwen3-Embedding-8B	80.68
Qwen3-Embedding-4B	80.06
Qwen3-Embedding-0.6B	75.41
Gemini Embedding	74.66

Long-Document Retrieval (LongEmbed)

DiffEmbed (Zhang et al., 2025) has the largest advantage in this scenario:

Model	Long-Doc Avg	Passkey (≤4K)
DiffEmbed (Dream-7B)	62.2%	100%
Mistral+LLM2Vec	58.6%	98.8%
LLaMA3+LLM2Vec	42.0%	59.6%

Reasoning-Intensive Retrieval (Bright / TheoremQA)

Model	TheoremQA (Q)	TheoremQA (T)	Bright Avg
DiffEmbed	48.3	38.9	33.2
Qwen2.5 (best AR)	40.2	34.7	30.6
LLaMA3+LLM2Vec	33.8	28.3	24.6

Long-Text Embedding

Long-text embedding is where the limitations of autoregressive models become most acute. Models vary widely in supported context length:

Model	Max Context	Training Length	Position Encoding
Qwen3-Embedding	32K	32K	RoPE (YaRN extendable to 128K)
NV-Embed-v2	32K	512	RoPE
E5-Mistral	4K → 32K	512	RoPE (NTK interpolation)
LLM2Vec	8K (Mistral)	512	RoPE
GritLM	Arbitrary (sliding window)	2K	RoPE
KaLM-V2.5	512	512	RoPE

A critical observation: most models are trained on short sequences (512 tokens) even when the base LLM supports much longer contexts. This creates a train-test length mismatch that degrades performance.

LongEmbed benchmark (Zhu et al., 2024) evaluates retrieval across lengths from 256 to 32K tokens on 6 datasets (2 synthetic: Needle-in-a-Haystack, Passkey Retrieval; 4 real-world: NarrativeQA, QMSum, 2WikiMQA, SummScreenFD). The best baseline achieves only 64.4 average — indicating large room for improvement.

Key results on LongEmbed:

Model	Long-Doc Avg	Passkey (≤4K)	Key Technique
DiffEmbed	62.2%	100%	Natively bidirectional (diffusion LM)
Mistral+LLM2Vec	58.6%	98.8%	Bidirectional conversion
LLaMA3+LLM2Vec	42.0%	59.6%	Bidirectional conversion
E5-Mistral + NTK ext.	75.3	–	RoPE NTK interpolation (+10.9 pts)

模型	最大上下文	训练长度	位置编码
Qwen3-Embedding	32K	32K	RoPE（YaRN 可扩展至 128K）
NV-Embed-v2	32K	512	RoPE
E5-Mistral	4K → 32K	512	RoPE（NTK 插值）
LLM2Vec	8K (Mistral)	512	RoPE
GritLM	任意（滑动窗口）	2K	RoPE
KaLM-V2.5	512	512	RoPE

模型	长文档 Avg	Passkey (≤4K)	关键技术
DiffEmbed	62.2%	100%	原生双向（扩散语言模型）
Mistral+LLM2Vec	58.6%	98.8%	双向转换
LLaMA3+LLM2Vec	42.0%	59.6%	双向转换
E5-Mistral + NTK ext.	75.3	–	RoPE NTK 插值（+10.9 分）

Why do embeddings degrade with length? Length-Induced Embedding Collapse (Click to expand)

Zhou et al. (2025) (ACL 2025) identified a phenomenon called Length-Induced Embedding Collapse: self-attention acts as a low-pass filter, and longer sequences increase the attenuation rate of high-frequency components. This causes token representations to retain only the DC (Direct-Current) component — embeddings of longer texts collapse into a narrow region of embedding space with abnormally high pairwise cosine similarity.

Concrete degradation numbers on BGE:

Input Length	Classification Accuracy
0–100 tokens	75.6%
100–200 tokens	72.1%
200–300 tokens	66.8%
300–400 tokens	63.2%
400–500 tokens	59.0% (−16.6 pts)

This is consistent across ANCE, GTR, GIST, BGE, and E5 models, and also observed in LLM-based models.

Solution — TempScale: Divides attention logits by a temperature τ ∈ (0,1] before softmax: softmax(QK^T / (τ√d)). This preserves high-frequency information for longer texts, yielding +0.94% MTEB average and +1.10% LongEmbed improvement.

Other long-context techniques:

RoPE extension (NTK interpolation, YaRN, SelfExtend): Rescale the position encoding frequency basis to support longer sequences. E5-Mistral gains +10.9 pts on LongEmbed via NTK.
Late Chunking (Jina AI, 2024): Run the full transformer over the entire long document first, then chunk the token embeddings before mean pooling. Preserves cross-chunk context. +3.63% relative improvement over naive chunking.
HTP: Hierarchical segment summaries create backward information flow — particularly effective for long documents where over-squashing is severe.

Efficiency-Performance Trade-offs

Model	Params	Performance	Highlight
KaLM-V2.5	0.5B	MTEB 69.33	0.5B surpasses 7B E5-Mistral (66.6)
Qwen3-Emb-0.6B	0.6B	MMTEB 64.33	0.6B surpasses all open-source 7B models
Causal2Vec	7B+110M	MTEB 66.85	82% inference time reduction, 85% sequence length reduction (vs bge-en-icl)
AutoRegEmbed	7B	STS 83.81	Only ~66K training samples needed (NV-Embed needs 1M+)

模型	参数量	性能	亮点
KaLM-V2.5	0.5B	MTEB 69.33	0.5B 超越 7B 的 E5-Mistral (66.6)
Qwen3-Emb-0.6B	0.6B	MMTEB 64.33	0.6B 超越所有开源 7B 模型
Causal2Vec	7B+110M	MTEB 66.85	推理时间减少 82%，序列长度减少 85% (vs bge-en-icl)
AutoRegEmbed	7B	STS 83.81	仅需约 66K 训练样本（NV-Embed 需 1M+）

Generation + Embedding: Unified Models

GritLM (Muennighoff et al., 2024)’s core finding is zero-loss unification:

Model	MTEB	MMLU	Notes
GritLM-7B	66.8	57.6%	Matches both single-task specialists
GritLM-8x7B	65.7	66.7%	Generation close to Mixtral-8x7B-Instruct
GEM-LLaMA-1B	MTEB 54.35	MMLU 28.36	Only 1/10 the training data of GritLM

Training Mistral-7B for embedding-only yields 66.8 MTEB + 7.6 generation (generation destroyed); training generative-only yields 41.2 MTEB + 55.2 generation. GritLM’s unified training achieves 66.8 MTEB + 55.5 generation — matching both individual specialists with zero trade-off.

模型	MTEB	MMLU	说明
GritLM-7B	66.8	57.6%	Embedding 和 generation 分别达到各自单任务模型水平
GritLM-8x7B	65.7	66.7%	生成能力接近 Mixtral-8x7B-Instruct
GEM-LLaMA-1B	MTEB 54.35	MMLU 28.36	训练数据仅需 GritLM 的 1/10

References

Yanzhao Zhang et al. “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.” arXiv:2506.05176, 2025.
Chankyu Lee et al. “NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.” ICLR 2025 (Spotlight). arXiv:2405.17428, 2024.
Parishad BehnamGhader et al. “LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.” COLM 2024. arXiv:2404.05961, 2024.
Niklas Muennighoff et al. “Generative Representational Instruction Tuning.” arXiv:2402.09906, 2024.
Liang Wang et al. “Improving Text Embeddings with Large Language Models.” ACL 2024. arXiv:2401.00368, 2024.
Xinping Zhao et al. “KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model.” arXiv:2506.20923, 2025.
Ailiang Lin et al. “Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models.” arXiv:2507.23386, 2025.
Jingcheng Deng et al. “Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment.” EMNLP 2025. arXiv:2502.11401, 2025.
Ziyue Li, Tianyi Zhou. “Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free.” ICLR 2025 (Oral). arXiv:2410.10814, 2024.
Xueying Ding et al. “Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings.” arXiv:2511.14868, 2025.
Caojin Zhang et al. “GEM: Empowering LLM for both Embedding Generation and Language Understanding.” arXiv:2506.04344, 2025.
Siyue Zhang et al. “Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective.” EMNLP 2025. arXiv:2505.15045, 2025.

I. How Are SOTA Embedding Models Trained?

一、最先进的 Embedding Model 是怎么训练的？

Multi-Stage Progressive Training

多阶段渐进式训练

Attention Mechanism: Two Camps

注意力机制：两大流派

Pooling Strategies

Pooling 策略

Which Layer's Representation?

使用哪一层的表征？

Loss Functions

损失函数

Frontier Directions

前沿探索方向

II. Attention Distribution Characteristics

二、Attention Distribution 的特点

The Core Problem: Information Flow Bottleneck in Causal Attention

核心问题：Causal Attention 的信息流瓶颈

Over-Squashing: Why Causal Attention Gets Worse with Length

Over-Squashing：因果注意力为何随长度恶化

Empirical Findings

实证发现

Summary of Attention Characteristics

Attention 特征总结

The Causal vs. Bidirectional Debate: Is There a Winner?

Causal vs. Bidirectional 之争：谁更好？

Controlled Ablations: Bidirectional Wins

控制变量消融：双向胜出

The Leaderboard Paradox: Causal Models Lead

排行榜悖论：因果模型领先

Reconciling the Evidence

调和证据

Practical Recommendations

实践建议

III. Performance of SOTA Embedding Models

三、最先进的 Embedding Model 的 Performance

Embedding Benchmarks Overview

Embedding 基准概览

MTEB English Leaderboard

MTEB English 排行

Multilingual MTEB (MMTEB)

Code Retrieval (MTEB Code)

Long-Document Retrieval (LongEmbed)

Reasoning-Intensive Retrieval (Bright / TheoremQA)

多语言 MTEB (MMTEB)

代码检索 (MTEB Code)

长文档检索 (LongEmbed)

推理密集检索 (Bright / TheoremQA)

Long-Text Embedding

长文本 Embedding

Efficiency-Performance Trade-offs

效率-性能权衡

Generation + Embedding: Unified Models

Generation + Embedding 双能力

References

参考文献