RoPE and M-RoPE: Rotation, Decay, and Multimodal Axes

This post derives Rotary Position Embedding (RoPE) from scratch, proves its long-distance decay property, surveys the techniques that extend it past training length (Position Interpolation, NTK-aware scaling, YaRN), and unpacks how Qwen2-VL generalizes it to three axes — temporal, height, width — for M-RoPE. The derivations follow Su et al. (2021); the extensions follow Chen et al. (2023), the original NTK-aware blog by bloc97, and Peng et al. (2023); the multimodal section follows the Qwen2-VL paper and its HuggingFace implementation.
本文从零推导 旋转位置编码 (RoPE),证明其长程衰减性质,梳理将其扩展到训练长度之外的方法(Position Interpolation, NTK-aware scaling, YaRN),并解析 Qwen2-VL 如何将其推广到时间、高度、宽度三个轴上的 M-RoPE。推导部分依据 Su et al. (2021);外推方法依据 Chen et al. (2023)、bloc97 原始 NTK-aware 博文以及 Peng et al. (2023);多模态部分依据 Qwen2-VL 论文与其 HuggingFace 实现。

The Problem: Encoding Position into Attention

Attention Is a Set Operation

Self-attention is a sum over a set, not a sequence. For queries \(q_m\) at position \(m\) and keys \(k_n\) at position \(n\), the softmax weights depend only on \(\langle q_m, k_n \rangle\) — permuting the tokens permutes the rows of \(Q\) and \(K\) identically, leaving the output identical up to that permutation. Without an extra signal, attention cannot tell first from last.

This permutation equivariance is a feature, not a bug. It is what lets a Transformer compute all \(T^2\) token-to-token interactions in parallel — the structural reason it eats data faster than an RNN can. But it also means every signal about where a token sits must enter the model from somewhere outside the attention dot product itself. Position encoding is that bridge. The design space splits cleanly by where the signal is injected: into the embedding, into the logit, or into the query/key vectors themselves.

Self-attention 的本质是对集合的加权求和,而非对序列。位置 \(m\) 的 query \(q_m\) 与位置 \(n\) 的 key \(k_n\) 之间的 softmax 权重只取决于 \(\langle q_m, k_n \rangle\)——任意置换 tokens 会同样置换 \(Q\) 和 \(K\) 的行,输出也只是相应被置换。没有额外信号时,attention 无法区分 最前最后

这种置换等变性是设计特性而非缺陷。正是它让 Transformer 能并行计算所有 \(T^2\) 个 token 间的交互——是它训练时比 RNN 吃数据更快的结构性原因。代价是:所有关于 token 位置 的信号必须从注意力点积之外的某处注入。位置编码就是这座桥。 设计空间按信号注入点干净地分为三类:注入到 embedding、注入到 logit、或注入到 query/key 向量本身。

Three Families of Position Encoding

Family Where the signal lives Translation-invariant? Examples
Absolute additive Added to token embeddings before layer 1 No BERT learned, Sinusoidal (Vaswani 2017)
Relative bias Added to logits inside softmax Yes T5 bias, ALiBi
Rotary Multiplied into \(q, k\) before dot product Yes (by construction) RoPE

Absolute additive. The original Transformer (Vaswani et al., 2017) uses sinusoidal encoding: at position \(p\), channels \(2i\) and \(2i+1\) get

\[\mathrm{PE}(p, 2i) = \sin(p\, \omega_i), \qquad \mathrm{PE}(p, 2i+1) = \cos(p\, \omega_i), \qquad \omega_i = 10000^{-2i/d},\]

added to the input embedding before layer 1. Note that \(\omega_i\) here is the very same frequency choice RoPE will later use — both are rooted in the idea that geometrically-spaced frequencies give good time-frequency coverage with \(d/2\) basis functions. BERT-style models replace this with a learned table of position embeddings, simpler but capped at the training length.

Relative bias. T5 (Raffel et al., 2020) adds a learned scalar bias \(b_{n-m}\) to the attention logit before softmax, with \(n - m\) binned into log-spaced buckets. ALiBi (Press et al., 2022) takes the bias idea to its minimal extreme:

\[\mathrm{logit}(m, n) = q_m^\top k_n - \lambda_h \cdot |n - m|,\]

where \(\lambda_h\) is a per-head fixed slope (no learnable position parameters at all). Zero parameters, linear decay built in, surprisingly strong length extrapolation — but the bias is independent of \(q\) and \(k\), so it cannot express position-dependent content interactions.

Rotary. RoPE (Su et al., 2021) — the subject of this post — does neither. It rotates \(q\) and \(k\) in 2D subspaces by an angle proportional to position, multiplying position into the vectors before the dot product. The dot product then automatically depends only on the relative offset, with no learnable position parameters and no extra bias term.

方法族 信号注入位置 是否平移不变 代表
绝对加性 在第一层之前加到 token embedding BERT 可学习, Sinusoidal (Vaswani 2017)
相对偏置 softmax 内部加到 logits T5 bias, ALiBi
旋转 在点积前乘到 \(q, k\) 是(结构上保证) RoPE

绝对加性。 Transformer 原论文 (Vaswani et al., 2017) 用 sinusoidal 编码:位置 \(p\)、通道 \(2i\) 与 \(2i+1\) 处的编码为

\[\mathrm{PE}(p, 2i) = \sin(p\, \omega_i), \qquad \mathrm{PE}(p, 2i+1) = \cos(p\, \omega_i), \qquad \omega_i = 10000^{-2i/d},\]

直接加到第一层之前的输入 embedding 上。注意这里的 \(\omega_i\) 正是 RoPE 后来使用的同一组频率——两者都根植于”几何级数分布的频率用 \(d/2\) 个基函数提供良好时频覆盖”这一思想。BERT 类模型则换成可学习的位置嵌入表,更简单,但被训练长度封顶。

相对偏置。 T5 (Raffel et al., 2020) 在 softmax 前直接给 attention logit 加可学习标量 \(b_{n-m}\),\(n - m\) 装进对数分箱。ALiBi (Press et al., 2022) 把偏置思路推到极简:

\[\mathrm{logit}(m, n) = q_m^\top k_n - \lambda_h \cdot |n - m|,\]

\(\lambda_h\) 是每个 head 固定的斜率(完全没有可学习位置参数)。零参数、线性衰减内置、长度外推出乎意料地强——但偏置独立于 \(q\) 和 \(k\),因此无法表达位置相关的内容交互。

旋转。 RoPE (Su et al., 2021)——本文主角——两者都不做。它把 \(q\) 和 \(k\) 在 2D 子空间中按位置成比例的角度旋转,在点积之前把位置乘进向量。点积自然只依赖于相对偏移,没有可学习位置参数,也没有额外偏置项。

Why Translation Invariance Matters

A position encoding is translation-invariant if shifting the entire sequence by \(k\) positions leaves every attention dot product unchanged:

\[\langle q_{m+k},\, k_{n+k} \rangle \;=\; \langle q_m,\, k_n \rangle \quad \forall\, m, n, k.\]

Absolute additive encodings fail this by construction — they inject position \(p_m\) into the embedding, so moving \(x_m\) to slot \(m + k\) replaces \(p_m\) with \(p_{m+k}\) and the dot product changes. RoPE satisfies it exactly:

\[\langle R_{m+k} q,\, R_{n+k} k \rangle \;=\; q^\top R_{(n+k) - (m+k)} k \;=\; q^\top R_{n - m} k,\]

since only the difference enters the answer. The relationship between and is the same whether those tokens sit at positions \((3, 4)\) or \((1003, 1004)\); the rotation by \(k\) cancels out exactly.

Why we want this for language. Three reasons stack:

  1. Meaning is compositional and local. The syntactic relation between cat and sat in “The cat sat on the mat” is identical whether the sentence opens a chapter or sits a million tokens deep. Binding semantic relations to absolute coordinates would force the model to relearn the same grammar at every position, throwing away cross-position induction.

  2. Weight sharing buys data efficiency. This is the same argument that justifies convolution in vision: a translation-invariant attention layer learns “subject–verb agreement at distance 1” as one pattern that applies everywhere, instead of a separate pattern per position that each needs its own statistical support.

  3. Length extrapolation. If every relation is parameterized by relative offset, training on \(L = 2048\) tokens teaches the model all offsets \(1, 2, \dots, 2047\). At inference, extending context to \(32{,}768\) shows the same offsets, just on more pairs of tokens. This is the root reason RoPE plus NTK/YaRN can stretch to 128K while pure absolute encodings crumble past their training length.

Caveat. Language is not perfectly translation-invariant — document openings introduce premises, endings summarize, “once upon a time” sets a different stage than what follows. But those are content-level signals carried by tokens (a [CLS] token, the literal phrase “In summary”), not coordinate-level signals. Letting the position encoding stay purely relative, and pushing document structure into the token stream where it belongs, is a cleaner division of labor.

Figure 1: Slide the sequence right by k. The highlighted pair (C, E) is two tokens apart at every k. Under an additive absolute encoding, the dot product wobbles with k (different absolute positions); under RoPE, it stays exactly constant — because only the difference n − m enters the answer.
To summarize, what we want from a position encoding is: (i) translation invariance — the dot product \(\langle q_m, k_n \rangle\) depends on \(m, n\) only through \(n - m\); (ii) bounded long-distance behavior — the magnitude of that dependence does not blow up with $$ n - m \(; *(iii)* cheap to compute and friendly with KV-caching at inference. RoPE achieves all three by *rotating*\)q\(and\)k$$ in fixed 2D subspaces, with frequencies that drop geometrically across feature pairs.

位置编码平移不变的意思是:把整个序列往后平移 \(k\) 位,所有 attention 点积保持不变:

\[\langle q_{m+k},\, k_{n+k} \rangle \;=\; \langle q_m,\, k_n \rangle \quad \forall\, m, n, k.\]

绝对加性编码结构上做不到这点——位置 \(p_m\) 直接加进 embedding,把 \(x_m\) 挪到 \(m + k\) 位会把 \(p_m\) 换成 \(p_{m+k}\),点积随之改变。RoPE 精确满足:

\[\langle R_{m+k} q,\, R_{n+k} k \rangle \;=\; q^\top R_{(n+k) - (m+k)} k \;=\; q^\top R_{n - m} k,\]

因为只有进入答案。”我”和”吃”的关系不论出现在 \((3, 4)\) 还是 \((1003, 1004)\) 都一样;按 \(k\) 的旋转互相抵消。

为什么语言序列希望这个性质。 三层理由层层叠加:

  1. 意义是组合的、局部的。 “The cat sat on the mat” 中 catsat 的句法关系,无论这句话出现在文档开头还是百万 token 之后都完全一样。若把语义关系硬绑到绝对坐标上,模型就得在每个位置重新学一遍同样的语法,跨位置的归纳完全作废。

  2. 权重共享换来数据效率。 这是和视觉里卷积同源的论点:平移不变的 attention 把”距离为 1 的主谓一致”作为一个模式学习,到处适用,而不是每个位置一份各自需要统计支撑的参数路径。

  3. 长度外推。 若每个关系都按相对偏移参数化,训练在 \(L = 2048\) tokens 上就教会模型所有偏移 \(1, 2, \dots, 2047\)。推理时把上下文扩到 \(32{,}768\),模型见到的还是同样这些偏移,只是用在更多对 token 上。这就是 RoPE 配合 NTK/YaRN 能拉到 128K,而纯绝对位置编码超出训练长度就崩塌的根本原因。

注脚。 语言并非完全平移不变——文档开头引入前提、结尾总结、”很久很久以前”为后文设定基调。但这些都是内容级信号,由 token 承载([CLS] token、字面短语”总结来说”),而不是坐标级信号。让位置编码保持纯相对的,把文档结构推给 token 流去表达,是更干净的分工。

图 1:向右拖动 k,整段序列平移。高亮的 (C, E) 对在任意 k 下都相距 2。加性绝对位置编码下,点积随 k 抖动(绝对位置不同了);RoPE 下点积纹丝不动——因为答案里只有差 n − m。

总结,我们希望位置编码满足:(i) 平移不变——点积 \(\langle q_m, k_n \rangle\) 只通过 \(n - m\) 依赖于 \(m, n\);(ii) 长程行为有界——这个依赖的幅度不应随 \(\lvert n - m \rvert\) 失控放大;(iii) 计算便宜、推理时与 KV-cache 友好。RoPE 通过将 \(q\) 和 \(k\) 在固定的二维子空间中旋转,并让不同特征对的频率以几何级数递减,同时满足这三点。

RoPE: Rotation in 2D Subspaces

The Two-Dimensional Derivation

Start with \(d = 2\). We seek functions \(f_q(x, m)\) and \(f_k(x, n)\) such that

\[\langle f_q(q, m),\, f_k(k, n) \rangle \;=\; g(q, k, n - m)\]

for some function \(g\) depending on positions only through the offset.

Unpacking that condition. The right-hand side \(g(q, k, n - m)\) has only three arguments — \(q\), \(k\), and the single position quantity \(n - m\). The individual positions \(m, n\) never appear separately; they enter the answer only as a difference. If we found such an \(f\) with \(g(q, k, 2) = 0.73\), then \(\langle f(q, 5), f(k, 7) \rangle\) and \(\langle f(q, 1003), f(k, 1005) \rangle\) would both equal \(0.73\) as well. Contrast absolute additive encoding: \(\langle q + p_m,\, k + p_n \rangle = q^\top k + q^\top p_n + p_m^\top k + p_m^\top p_n\) — three of the four terms involve \(m\) or \(n\) on its own, not just through the difference. The whole point of the RoPE ansatz that follows is to find an \(f\) that makes those absolute-position terms cancel.

Identifying \(\mathbb{R}^2\) with \(\mathbb{C}\) — write \(q = q_1 + i q_2\) — try the ansatz

\[f_q(q, m) = q \cdot e^{i m \theta}, \qquad f_k(k, n) = k \cdot e^{i n \theta}.\]

The (Hermitian) inner product satisfies

\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; \mathrm{Re}\!\left( q \bar{k} \, e^{i(m - n)\theta} \right),\]

which depends on positions only through \(m - n\). Translated back to \(\mathbb{R}^2\), multiplication by \(e^{i m \theta}\) is multiplication by the 2D rotation matrix

\[R_m \;=\; \begin{bmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \phantom{-}\cos m\theta \end{bmatrix}, \qquad f_q(q, m) = R_m q,\; f_k(k, n) = R_n k.\]

The relative-position property is now a one-line linear-algebra fact:

\[(R_m q)^\top (R_n k) \;=\; q^\top R_m^\top R_n k \;=\; q^\top R_{n - m} k,\]

since rotation matrices form a one-parameter group: \(R_m^\top = R_{-m}\) and \(R_{-m} R_n = R_{n - m}\). The complex-number view makes this transparent — phases simply subtract.

Figure 2: Move the m and n sliders. The dashed grey vectors are the original q (red origin) and k (blue origin); the solid arrows are R_m q and R_n k. The right panel shows ⟨R_m q, R_n k⟩ alongside qᵀ R_{n−m} k — they agree exactly. Try sliding m and n together while keeping n − m fixed: the dot product never moves.

从 \(d = 2\) 开始。我们寻找函数 \(f_q(x, m)\) 和 \(f_k(x, n)\) 满足

\[\langle f_q(q, m),\, f_k(k, n) \rangle \;=\; g(q, k, n - m)\]

即位置依赖只通过偏移体现。

这个条件到底在说什么。 右边 \(g(q, k, n - m)\) 只有三个参数——\(q\)、\(k\)、以及单一位置量 \(n - m\)。\(m\)、\(n\) 各自从不分别出现,它们只以差值的形式进入答案。若找到这样的 \(f\) 且 \(g(q, k, 2) = 0.73\),那么 \(\langle f(q, 5), f(k, 7) \rangle\) 和 \(\langle f(q, 1003), f(k, 1005) \rangle\) 也都等于 \(0.73\)。对比绝对加性编码:\(\langle q + p_m,\, k + p_n \rangle = q^\top k + q^\top p_n + p_m^\top k + p_m^\top p_n\)——四项中三项含 \(m\) 或 \(n\) 独立项,而非仅以差进入。后面 RoPE 拟设的全部目的,就是找到让这些绝对位置项相互抵消的 \(f\)。

把 \(\mathbb{R}^2\) 等同于 \(\mathbb{C}\)——记 \(q = q_1 + i q_2\)——尝试拟设

\[f_q(q, m) = q \cdot e^{i m \theta}, \qquad f_k(k, n) = k \cdot e^{i n \theta}.\]

(Hermitian)内积为

\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; \mathrm{Re}\!\left( q \bar{k} \, e^{i(m - n)\theta} \right),\]

只通过 \(m - n\) 依赖位置。回到 \(\mathbb{R}^2\),乘以 \(e^{i m \theta}\) 等价于乘以二维旋转矩阵

\[R_m \;=\; \begin{bmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \phantom{-}\cos m\theta \end{bmatrix}, \qquad f_q(q, m) = R_m q,\; f_k(k, n) = R_n k.\]

相对位置性质化为一行线性代数事实:

\[(R_m q)^\top (R_n k) \;=\; q^\top R_m^\top R_n k \;=\; q^\top R_{n - m} k,\]

因为旋转矩阵构成单参数群:\(R_m^\top = R_{-m}\) 且 \(R_{-m} R_n = R_{n - m}\)。复数视角下这一点最透明——相位直接相减。

图 2:拖动 m 和 n 滑块。虚线灰色向量是原始的 q(红色源点)和 k(蓝色源点),实线箭头是 R_m q 和 R_n k。右侧面板同时显示 ⟨R_m q, R_n k⟩ 与 qᵀ R_{n−m} k——两者完全相等。试着让 m, n 一起移动而保持 n − m 不变:点积纹丝不动。

Extending to d Dimensions

Split the \(d\)-dimensional vector into \(d/2\) adjacent pairs \((x_{2i}, x_{2i+1})\) and rotate each pair by its own angle \(m \theta_i\):

\[R_m^{(d)} \;=\; \mathrm{blkdiag}\!\Bigl(R_m^{(\theta_0)},\, R_m^{(\theta_1)},\, \dots,\, R_m^{(\theta_{d/2-1})}\Bigr), \qquad \theta_i \;=\; b^{-2i/d}.\]

Following Vaswani’s sinusoidal encoding, the original RoFormer paper takes \(b = 10000\). Then \(f_q(q, m) = R_m^{(d)} q\), \(f_k(k, n) = R_n^{(d)} k\), and the same group property gives

\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; q^\top R_{n - m}^{(d)} k\]

— relative position, exactly.

Why geometric frequencies? The choice \(\theta_i = b^{-2i/d}\) — identical to Vaswani’s sinusoidal — is not an accident. Geometrically-spaced frequencies tile the time-frequency plane more uniformly on log-scale than any equally-spaced alternative: each successive pair rotates a constant factor slower than the previous. The fastest pair (\(\theta_0 = 1\)) completes a full rotation every \(2\pi \approx 6.3\) tokens; the slowest (\(\theta_{d/2 - 1} = b^{-(d-2)/d}\)) takes \(\sim 2\pi b\) tokens. This logarithmic coverage is what gives the encoding both fine-grained adjacency resolution (high freqs) and coarse long-range positional context (low freqs) within a single fixed budget of \(d/2\) pairs.

Pair index \(i\) \(\theta_i\) at \(b = 10000, d = 128\) Wavelength \(2\pi/\theta_i\) (tokens)
0 \(1\) \(6.3\)
16 \(0.178\) \(\approx 35\)
32 \(0.0316\) \(\approx 199\)
48 \(0.00562\) \(\approx 1{,}117\)
63 \(0.000118\) \(\approx 52{,}954\)

将 \(d\) 维向量切成 \(d/2\) 个相邻对 \((x_{2i}, x_{2i+1})\),每一对用各自的角度 \(m \theta_i\) 旋转:

\[R_m^{(d)} \;=\; \mathrm{blkdiag}\!\Bigl(R_m^{(\theta_0)},\, R_m^{(\theta_1)},\, \dots,\, R_m^{(\theta_{d/2-1})}\Bigr), \qquad \theta_i \;=\; b^{-2i/d}.\]

沿用 Vaswani sinusoidal 的设计,RoFormer 原论文取 \(b = 10000\)。于是 \(f_q(q, m) = R_m^{(d)} q\),\(f_k(k, n) = R_n^{(d)} k\),同样的群性质给出

\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; q^\top R_{n - m}^{(d)} k\]

——精确的相对位置。

为什么用几何频率? \(\theta_i = b^{-2i/d}\) 的选择——与 Vaswani sinusoidal 完全相同——并非偶然。几何级数分布的频率在对数尺度上比任何等距方案都更均匀地覆盖时频平面:每个后继对都比前一个慢一个常数因子。最快的对(\(\theta_0 = 1\))每 \(2\pi \approx 6.3\) token 完成一周;最慢的(\(\theta_{d/2 - 1} = b^{-(d-2)/d}\))需要约 \(2\pi b\) token。这种对数覆盖让同一份 \(d/2\) 对预算同时提供细粒度邻近分辨率(高频)粗粒度长程位置信号(低频)。

对编号 \(i\) \(\theta_i\)(\(b = 10000, d = 128\)) 波长 \(2\pi/\theta_i\)(token)
0 \(1\) \(6.3\)
16 \(0.178\) \(\approx 35\)
32 \(0.0316\) \(\approx 199\)
48 \(0.00562\) \(\approx 1{,}117\)
63 \(0.000118\) \(\approx 52{,}954\)

From Math to PyTorch

The block-diagonal matrix is never materialized. Instead, precompute two tensors of shape \((T, d)\):

\[\cos[m,\,2i] = \cos[m,\,2i+1] = \cos(m \theta_i), \qquad \sin[m,\,2i] = \sin[m,\,2i+1] = \sin(m \theta_i),\]

and apply RoPE elementwise. Most modern implementations (Llama, Qwen, GPT-NeoX, HuggingFace) adopt the “split halves” layout where the first \(d/2\) channels store the “real” parts and the second \(d/2\) store the “imaginary” parts. Under that convention, pair \(i\) is \((x_i,\, x_{i + d/2})\), and the canonical PyTorch form is:

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)          # split into two halves
    return torch.cat((-x2, x1), dim=-1)  # (a, b) -> (-b, a)

def apply_rope(q, k, cos, sin):
    # q, k: (B, H, T, D);  cos, sin: (1, 1, T, D)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

Two layout conventions, same math. The original RoFormer paper uses adjacent pairs — \((x_0, x_1), (x_2, x_3), \dots\) — and a different rotate_half formula. The split-halves layout above is equivalent under a permutation: both produce identical inner products \(\langle R_m q, R_n k \rangle\) for the same \(\theta_i\). Always check which convention a checkpoint uses before mixing weights across codebases — silent permutation mismatches show up as garbled outputs, not crashes.

Four practical remarks:

  1. Q, K only — never V. Relative position should bias who attends to whom, not the content that gets aggregated. Rotating \(V\) would distort the values themselves.

  2. Shared cos/sin tables. The same table is shared across heads and layers — only \(\theta_i\) depends on the model. Memory is \(O(T \cdot d)\), computed once at module init.

  3. KV-cache trivial. Because the rotation is applied position-by-position, cached keys carry their rotation from when they were first written; new queries rotate at their new position; no recomputation is needed when the sequence grows. Compare with absolute encodings, which would require re-running layer 0 if positions shift.

  4. Linear-attention compatible. Rotation is unitary, so it preserves inner-product structure. RoPE composes cleanly with kernelized / linear-attention variants (Performer, RetNet, linear-attention Mamba branches): the rotation moves through the kernel approximation as a unitary transformation, leaving the kernel identity intact.

块对角矩阵从不显式构造。而是预计算两张形状 \((T, d)\) 的表:

\[\cos[m,\,2i] = \cos[m,\,2i+1] = \cos(m \theta_i), \qquad \sin[m,\,2i] = \sin[m,\,2i+1] = \sin(m \theta_i),\]

然后逐元素应用 RoPE。多数现代实现(Llama、Qwen、GPT-NeoX、HuggingFace)采用”两半切分”布局:前 \(d/2\) 个通道存”实部”,后 \(d/2\) 个存”虚部”。该约定下,对 \(i\) 是 \((x_i,\, x_{i + d/2})\),标准 PyTorch 形式为:

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)          # 切成两半
    return torch.cat((-x2, x1), dim=-1)  # (a, b) -> (-b, a)

def apply_rope(q, k, cos, sin):
    # q, k: (B, H, T, D);  cos, sin: (1, 1, T, D)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

两种布局约定,相同数学。 RoFormer 原论文用相邻对——\((x_0, x_1), (x_2, x_3), \dots\)——以及不同的 rotate_half 公式。上面的两半切分布局在置换下等价:相同 \(\theta_i\) 下两者产生完全相同的内积 \(\langle R_m q, R_n k \rangle\)。跨代码库混用权重前一定要先确认 checkpoint 用的是哪种约定——静默的置换错配会表现为输出乱码而非崩溃。

四点实际说明:

  1. 只动 Q, K,不动 V。 相对位置应该影响谁 attend 谁,而不是被聚合的内容。旋转 \(V\) 会扭曲 value 本身。

  2. 共享 cos/sin 表。 同一张表跨 head 跨 layer 共享——只有 \(\theta_i\) 是模型相关的。内存 \(O(T \cdot d)\),模块初始化时一次性算好。

  3. KV-cache 极其简洁。 因为旋转逐位置施加,缓存的 keys 自带写入时的旋转,新 query 在新位置旋转,序列变长无需重算。对比绝对编码:位置变化时需要重跑第 0 层。

  4. 与线性 attention 兼容。 旋转是 unitary 的,保持内积结构。RoPE 与 kernelized / linear-attention 变体(Performer, RetNet, linear-attention Mamba 分支)干净复合:旋转作为 unitary 变换穿过 kernel 近似,kernel 恒等式完整保留。

Why It Works: Long-Distance Decay

The Decay Bound via Abel Summation

The relative-position property alone tells us nothing about magnitudes. In principle, the rotation could leave \(\langle q_m, k_n \rangle\) oscillating at full amplitude no matter how far apart \(m\) and \(n\) are. The actual RoPE design has a stronger property: with \(\theta_i = 10000^{-2i/d}\), the inner product decays (on average) as $$ n - m $$ grows. This is the closest thing RoPE has to ALiBi-style recency bias — but it emerges from the math rather than from a hand-tuned slope.

Group the rotated dot product per pair. Writing

\[h_i \;=\; q_{[2i]} k_{[2i]} + q_{[2i+1]} k_{[2i+1]} + i\,(q_{[2i+1]} k_{[2i]} - q_{[2i]} k_{[2i+1]})\]

as a complex number capturing the \(i\)-th pair’s contribution, the relative dot product becomes

\[\langle R_m q, R_n k \rangle \;=\; \mathrm{Re}\sum_{i=0}^{d/2 - 1} h_i\, e^{i(m - n)\theta_i}.\]

Define the partial phase sum \(S_j = \sum_{i=0}^{j-1} e^{i(m - n)\theta_i}\) with \(S_0 = 0\). Abel summation (summation by parts) gives

\[\sum_{i=0}^{d/2-1} h_i\, e^{i(m-n)\theta_i} \;=\; h_{d/2-1}\, S_{d/2} \;-\; \sum_{i=0}^{d/2-2} S_{i+1}\,(h_{i+1} - h_i),\]

so that

\[\bigl|\langle R_m q, R_n k \rangle\bigr| \;\le\; \Bigl(\max_i |h_{i+1} - h_i|\Bigr)\, \sum_{i=1}^{d/2} |S_i|.\]
The first factor is content-dependent; the second is purely geometric. The RoFormer paper plots $$\frac{1}{d/2}\sum_i S_i \(against\) n - m \(for\)\theta_i = 10000^{-2i/d}$$ and shows it falls off — fast at first, then more slowly, with the residual oscillation expected from a finite sum of incommensurate frequencies.

相对位置性质本身并不告诉我们幅度。原则上,旋转可以让 \(\langle q_m, k_n \rangle\) 无论 \(m, n\) 相距多远都全幅振荡。但 RoPE 的实际设计有更强的性质:在 \(\theta_i = 10000^{-2i/d}\) 下,内积平均意义上随 \(\lvert n - m \rvert\) 增长而衰减。这是 RoPE 最接近 ALiBi 式 recency bias 的地方——但它是从数学中自然涌现的,而非人工调出来的斜率。

按对分组重写旋转后的点积。记

\[h_i \;=\; q_{[2i]} k_{[2i]} + q_{[2i+1]} k_{[2i+1]} + i\,(q_{[2i+1]} k_{[2i]} - q_{[2i]} k_{[2i+1]})\]

为第 \(i\) 对贡献的复数表示,相对点积为

\[\langle R_m q, R_n k \rangle \;=\; \mathrm{Re}\sum_{i=0}^{d/2 - 1} h_i\, e^{i(m - n)\theta_i}.\]

定义部分相位和 \(S_j = \sum_{i=0}^{j-1} e^{i(m - n)\theta_i}\),\(S_0 = 0\)。Abel 求和(分部求和)给出

\[\sum_{i=0}^{d/2-1} h_i\, e^{i(m-n)\theta_i} \;=\; h_{d/2-1}\, S_{d/2} \;-\; \sum_{i=0}^{d/2-2} S_{i+1}\,(h_{i+1} - h_i),\]

于是

\[\bigl|\langle R_m q, R_n k \rangle\bigr| \;\le\; \Bigl(\max_i |h_{i+1} - h_i|\Bigr)\, \sum_{i=1}^{d/2} |S_i|.\]

第一项与内容相关;第二项纯几何。RoFormer 论文在 \(\theta_i = 10000^{-2i/d}\) 下绘制 \(\frac{1}{d/2}\sum_i \lvert S_i \rvert\) 关于 \(\lvert n - m \rvert\) 的曲线,显示其先快后慢地下降,并伴随有限多个不可公度频率之和应有的残余振荡。

Geometric Frequencies and Phase Decorrelation

The bound has a clean physical interpretation. The high-frequency pairs (small \(i\), \(\theta_i \approx 1\)) rotate by a different amount per step than the low-frequency pairs (large \(i\), \(\theta_i \to 0\)). When you sum the contributions, the phases at different frequencies generically don’t reinforce — they decorrelate. Only when \(m = n\) are all phases zero and the sum is exactly \(\sum_i h_i\) (constructive interference). Move \(n\) away from \(m\) and the phases drift apart at different rates, producing destructive interference.

The geometric spacing of \(\theta_i\) is what spreads the rotation rates evenly on log-scale, so that no two pairs stay in lockstep over typical distances. Compare with three alternatives:

  • All identical frequencies (\(\theta_i = \theta\) for all \(i\)). All pairs rotate in phase. The sum has the same magnitude as a single pair — no decay; behaves like a single complex exponential.
  • Linear spacing (\(\theta_i = (1 + i)\theta_0\)). Pairs decorrelate, but on linear timescales. Decay happens but the spectrum lacks log-scale coverage — fewer slow frequencies, so worse long-range behavior.
  • Geometric spacing (\(\theta_i = b^{-2i/d}\)). Pairs decorrelate evenly on log-scale. This is the RoPE / sinusoidal choice. Decay is graceful and the spectrum covers many orders of magnitude with only \(d/2\) basis functions.
Figure 3: Normalized magnitude (1/(d/2))·|Σᵢ exp(i · Δ · θᵢ)| as a function of distance Δ = |n − m|. Starts at 1 (perfect alignment) and decays through destructive interference of the geometric frequencies. Drag the base slider: a larger b makes the slowest frequencies even slower, flattening the tail — the same lever Llama 3.1 pulls (b = 500000) to preserve resolution at long range. Drag d: bigger head dim means more frequencies, averaging into a smoother curve.

这个界有清晰的物理解读。高频对(\(i\) 小,\(\theta_i \approx 1\))每步旋转量大,低频对(\(i\) 大,\(\theta_i \to 0\))每步旋转量小。把不同频率的贡献相加,相位通常不会强化——而是去相关。只有 \(m = n\) 时所有相位为零,求和精确等于 \(\sum_i h_i\)(相长干涉)。\(n\) 远离 \(m\) 时,各频率相位以不同速率偏移,产生相消干涉。

\(\theta_i\) 的几何级数布置正是把旋转速率在对数尺度上均匀铺开,使得任意两对都不会在常见距离上保持同步。对比三种备选:

  • 所有频率相同(\(\theta_i = \theta\) 对所有 \(i\))。所有对同相旋转。求和幅度等同于单个对——无衰减;行为像单个复指数。
  • 线性分布(\(\theta_i = (1 + i)\theta_0\))。对会去相关,但在线性时间尺度上。衰减发生但频谱缺少对数尺度覆盖——慢频率更少,长程行为更差。
  • 几何分布(\(\theta_i = b^{-2i/d}\))。对在对数尺度上均匀去相关。这是 RoPE / sinusoidal 的选择。衰减优雅,频谱用 \(d/2\) 个基函数覆盖多个数量级。
图 3:归一化幅度 (1/(d/2))·|Σᵢ exp(i · Δ · θᵢ)| 作为距离 Δ = |n − m| 的函数。从 1(完全对齐)出发,因几何分布频率的相消干涉而衰减。拖动 base 滑块:更大的 b 让最慢的频率更慢,曲线尾部更平——这正是 Llama 3.1 (b = 500000) 拉动的杠杆,为长程保留分辨率。拖动 d:更大的 head dim 意味着更多频率,平均后曲线更平滑。

Practical Implications: The Role of the Base b

Picking the base \(b\) trades off near-field resolution against far-field coverage. The highest frequency \(\theta_0 = 1\) rotates by a full radian per step regardless of \(b\) — this is what gives RoPE its fine-grained ability to distinguish adjacent tokens. The lowest frequency \(\theta_{d/2-1} = b^{-(d-2)/d}\) scales with \(b\): at \(b = 10000, d = 128\), \(\theta_{d/2-1} \approx 1.2 \times 10^{-4}\), giving a wavelength of \(\sim 52{,}000\) tokens.

Base \(b\) Lowest \(\theta\) Longest wavelength Used by
10000 \(1.2 \times 10^{-4}\) \(\sim 52\)K tokens RoFormer, Llama 2, Mistral 7B
100000 \(1.5 \times 10^{-5}\) \(\sim 410\)K tokens Code Llama (long-context variant)
500000 \(3.0 \times 10^{-6}\) \(\sim 2.1\)M tokens Llama 3 / 3.1
1000000 \(1.4 \times 10^{-6}\) \(\sim 4.4\)M tokens Qwen2.5 (some variants)

The flip side: very-slow frequencies don’t help at short range and consume head channels — pushing \(b\) too high wastes capacity on positions the model rarely sees. Llama 3.1’s choice of \(b = 500{,}000\) is calibrated against its announced 128K context, leaving comfortable headroom while still providing useful gradient at short distances. The same lever is the cleanest entry point for the length-extension methods covered in the next section.

选择基 \(b\) 是在近场分辨率与远场覆盖之间权衡。最高频率 \(\theta_0 = 1\) 每步旋转一整弧度,与 \(b\) 无关——这赋予 RoPE 区分相邻 token 的细粒度能力。最低频率 \(\theta_{d/2-1} = b^{-(d-2)/d}\) 与 \(b\) 成比例:在 \(b = 10000, d = 128\) 下,\(\theta_{d/2-1} \approx 1.2 \times 10^{-4}\),波长约 \(52{,}000\) tokens。

基 \(b\) 最低 \(\theta\) 最长波长 使用方
10000 \(1.2 \times 10^{-4}\) \(\sim 52\)K tokens RoFormer, Llama 2, Mistral 7B
100000 \(1.5 \times 10^{-5}\) \(\sim 410\)K tokens Code Llama(长上下文版本)
500000 \(3.0 \times 10^{-6}\) \(\sim 2.1\)M tokens Llama 3 / 3.1
1000000 \(1.4 \times 10^{-6}\) \(\sim 4.4\)M tokens Qwen2.5(部分变体)

反面:极慢的频率在短距离上没帮助,还占用 head 通道——\(b\) 推得太高会把容量浪费在模型几乎见不到的位置上。Llama 3.1 选 \(b = 500{,}000\) 是根据其 128K 上下文校准的,留出舒适余量同时仍在短距离上提供有用梯度。同一杠杆也是下一节长度扩展方法最干净的入口。

Beyond Training Length: PI, NTK, YaRN

Why Naive Extrapolation Fails

A model trained on \(L = 2048\) tokens almost always degrades catastrophically when fed sequences much longer than \(L\). The reason is sharper than “the model hasn’t seen those positions”. Pick any frequency pair \(i\). During training, the rotation angles that pair sees are

\[\{\, m \theta_i \;:\; m \in [0, L) \,\}\]

— a set covering \(L \theta_i\) radians. For high-frequency pairs (\(\theta_i\) near 1), \(L\theta_i \gg 2\pi\) — the pair wraps the unit circle many times, sees every phase, and any extrapolation to \(m > L\) produces phases the pair has already practiced. For low-frequency pairs (\(\theta_i\) small), \(L\theta_i\) may be less than \(2\pi\) — the pair has not even completed a full cycle. Asking it to handle \(m = 2L\) pushes its phase into truly novel territory.

In one line. Length extrapolation fails because the slow frequencies never wrapped around during training; the fast ones did. So fixes must target the slow end of the spectrum without disturbing the fast end.

What “wrapped around” buys you. RoPE’s output \((\cos m\theta_i, \sin m\theta_i)\) lives on the unit circle, periodic with period \(2\pi/\theta_i\). Wrapped around means \(L\theta_i > 2\pi\) — the angle completed at least one full revolution during training, so the model has been shown the entire \((\cos, \sin)\) output range for that pair. The downstream weights treat \((\cos, \sin)\) as ordinary features — they are not themselves periodic; they only “know” what they were trained on. Once a frequency’s outputs cover the full circle during training, any future angle \(m\theta_i \bmod 2\pi\) lands on a point the model has already mapped. Concretely at \(b = 10000,\, d = 128,\, L = 2048\): pair 0 (\(\theta_0 = 1\)) sees \(2048\) radians of training angle — about 326 full cycles — every \((\cos, \sin)\) value many times over. Pair 63 (\(\theta_{63} \approx 1.15 \times 10^{-4}\)) sees only \(0.24\) radians — about 4% of one cycle — and the \(\cos\) output never leaves \([0.97, 1]\). Extrapolate to \(m = 16384\) and the slow pair’s \(\cos\) drops to \(-0.31\), feeding the downstream weights a value they have never been trained on. That is why the fast end needs no surgery and the slow end does.

This single observation organizes the entire length-extension literature. Position Interpolation rescales every frequency uniformly (touches the fast end too — hence needs fine-tuning to recover). NTK-aware leaves the fast end alone and stretches only the slow end. YaRN refines NTK with per-frequency control. All three are post-hoc surgeries on \(\theta_i\).

训练长度 \(L = 2048\) 的模型,喂入显著长于 \(L\) 的序列几乎都会灾难性退化。原因比”模型没见过那些位置”更尖锐。任取一个频率对 \(i\),训练时它见到的旋转角是

\[\{\, m \theta_i \;:\; m \in [0, L) \,\}\]

——覆盖 \(L \theta_i\) 弧度的集合。对高频对(\(\theta_i\) 接近 1),\(L\theta_i \gg 2\pi\)——它在单位圆上绕了许多圈,每个相位都见过,任意外推到 \(m > L\) 产生的相位它都练过。对低频对(\(\theta_i\) 小),\(L\theta_i\) 可能小于 \(2\pi\)——它连一个完整周期都没走完。让它处理 \(m = 2L\) 就把它的相位推到了真正陌生的领域。

一句话。 长度外推失败是因为慢频率在训练时从未绕回;快频率绕回了。所以修复必须只针对频谱的慢端,不动快端。

“绕回”换来了什么? RoPE 的输出 \((\cos m\theta_i, \sin m\theta_i)\) 活在单位圆上,周期为 \(2\pi/\theta_i\)。绕回的意思是 \(L\theta_i > 2\pi\)——训练中角已完成至少一整圈,所以模型见过这一对的整个 \((\cos, \sin)\) 输出域。下游权重把 \((\cos, \sin)\) 当普通特征处理——它们自身不周期,只”知道”训练时见过的东西。一旦某频率的输出在训练中铺满整个圆,未来任意角 \(m\theta_i \bmod 2\pi\) 都落在模型映射过的点上。具体地,\(b = 10000,\, d = 128,\, L = 2048\):第 0 对(\(\theta_0 = 1\))训练角横跨 \(2048\) 弧度——约 326 整圈——圆上每个 \((\cos, \sin)\) 值都见过许多次。第 63 对(\(\theta_{63} \approx 1.15 \times 10^{-4}\))训练角仅 \(0.24\) 弧度——约 一圈的 4%——\(\cos\) 输出从未离开 \([0.97, 1]\) 区间。外推到 \(m = 16384\),慢对的 \(\cos\) 降到 \(-0.31\),把下游权重从未训练过的值喂给它。这就是 快端无需手术、慢端必须手术的原因。

这一个观察组织了整个长度扩展文献。Position Interpolation 把所有频率统一缩放(连快端一起动——所以需要微调来恢复)。NTK-aware 不动快端,只拉伸慢端。YaRN 用逐频率控制精化 NTK。三者都是对 \(\theta_i\) 的事后手术。

The Three Major Methods

Method Where it modifies \(\theta_i\) Formula Need fine-tuning?
Position Interpolation (PI) All frequencies, uniformly \(\theta_i' = \theta_i \cdot L / L'\) Yes (~1k steps)
NTK-aware Base only \(b' = b \cdot s^{d/(d-2)},\; s = L'/L\) Often zero-shot
YaRN Per-frequency ramp + temperature NTK-by-parts + \(\sqrt{1/t} = 0.1 \ln s + 1\) Short fine-tune

Position Interpolation (Chen et al., 2023) takes the bluntest route: pretend position \(m \in [0, L')\) is actually position \(m \cdot L / L' \in [0, L)\). Equivalently, multiply every \(\theta_i\) by the scaling factor \(L/L'\). This keeps RoPE values inside their training range, but it compresses all frequencies equally — the high-frequency pairs that used to distinguish adjacent tokens now barely move per step, which hurts local detail. PI works, but it requires a thousand-or-so fine-tuning steps to recover.

NTK-aware scaling (originally a LocalLLaMA post, then picked up by everyone) attacks the same problem from the opposite direction: leave the high frequencies alone — they generalize fine — and stretch only the low frequencies. The cleanest implementation just changes the base:

\[b' \;=\; b \cdot s^{d/(d-2)}, \qquad s = L'/L.\]

The exponent \(d/(d-2)\) is chosen so that the lowest frequency \(\theta_{d/2-1} = b'^{-(d-2)/d}\) becomes exactly \(\theta_{d/2-1} \cdot 1/s\) (i.e., gets PI’d by factor \(s\)), while the highest frequency \(\theta_0 = b'^0 = 1\) is unchanged. Intermediate frequencies smoothly interpolate. The big practical win: NTK often extends context without any fine-tuning.

YaRN (Peng et al., 2023) refines NTK with two ideas. First, NTK-by-parts: instead of the smooth NTK base change, define a piecewise ramp \(\gamma\) over the wavelength-to-context ratio \(r\) and interpolate between PI and no-op,

\[h(\theta_i) \;=\; (1 - \gamma(r_i)) \cdot \frac{\theta_i}{s} \;+\; \gamma(r_i) \cdot \theta_i, \qquad \gamma(r) = \begin{cases} 0 & r < \alpha \\ \frac{r - \alpha}{\beta - \alpha} & \alpha \le r \le \beta \\ 1 & r > \beta \end{cases}\]

with \(\alpha = 1, \beta = 32\) recommended for LLaMA. Frequencies whose wavelength is much shorter than the original context (\(r > \beta\)) are untouched; very-long-wavelength frequencies (\(r < \alpha\)) get full PI scaling. Second, attention temperature: scale both \(q\) and \(k\) by \(\sqrt{1/t}\) with

\[\sqrt{1/t} \;=\; 0.1 \ln(s) + 1,\]

fitted empirically across LLaMA 7B/13B/33B/65B. The temperature compensates for the average entropy increase of attention logits as the context grows, restoring the perplexity curve at long range.

Figure 4: Wavelength 2π / θᵢ across the d/2 = 64 frequency pairs (head_dim = 128, b = 10000), log-scaled. Vanilla RoPE (grey) climbs from 2π at i=0 to ~60k at i=63. PI (orange) shifts the whole curve up by a constant factor s — every frequency stretched the same way. NTK-aware (green) leaves the leftmost (high-frequency) pairs essentially untouched and stretches only the right tail. YaRN (blue) is the piecewise NTK-by-parts ramp — PI'd at very-long-wavelength frequencies (r < α), untouched at short-wavelength ones (r > β). The red dashed line marks the training context L = 2048 — pairs whose wavelength sits below it generalize naturally; those above it need help.
方法 修改 \(\theta_i\) 的方式 公式 是否需要微调
Position Interpolation (PI) 所有频率,统一缩放 \(\theta_i' = \theta_i \cdot L / L'\) 需要(~1k 步)
NTK-aware 仅基 \(b' = b \cdot s^{d/(d-2)},\; s = L'/L\) 通常零样本即可
YaRN 分频率 ramp + 温度 NTK-by-parts + \(\sqrt{1/t} = 0.1 \ln s + 1\) 短微调

Position Interpolation (Chen et al., 2023) 走最简单粗暴的路线:把位置 \(m \in [0, L')\) 假装成 \(m \cdot L / L' \in [0, L)\)。等价地,把每个 \(\theta_i\) 乘以缩放因子 \(L/L'\)。这能把 RoPE 值保持在训练范围内,但代价是所有频率都被等比例压缩——原本用来区分相邻 token 的高频对每步几乎不动了,局部细节受损。PI 有效,但通常需要一千左右的微调步数才能恢复。

NTK-aware scaling(最初是一篇 LocalLLaMA 帖子,之后被广泛采纳)从相反方向解决同一问题:高频泛化良好,不要动;只拉伸低频。最干净的实现是修改基:

\[b' \;=\; b \cdot s^{d/(d-2)}, \qquad s = L'/L.\]

指数 \(d/(d-2)\) 的选择保证最低频率 \(\theta_{d/2-1} = b'^{-(d-2)/d}\) 恰好变为 \(\theta_{d/2-1} \cdot 1/s\)(即被 PI 缩放因子 \(s\)),而最高频率 \(\theta_0 = b'^0 = 1\) 不变。中间频率平滑过渡。最大的实用收益:NTK 通常能无需微调就扩展上下文。

YaRN (Peng et al., 2023) 用两个想法精化 NTK。其一,NTK-by-parts:不再用平滑的 NTK 改基,而是按波长与上下文之比 \(r\) 定义分段 ramp \(\gamma\),在 PI 与不变之间插值,

\[h(\theta_i) \;=\; (1 - \gamma(r_i)) \cdot \frac{\theta_i}{s} \;+\; \gamma(r_i) \cdot \theta_i, \qquad \gamma(r) = \begin{cases} 0 & r < \alpha \\ \frac{r - \alpha}{\beta - \alpha} & \alpha \le r \le \beta \\ 1 & r > \beta \end{cases}\]

LLaMA 推荐 \(\alpha = 1, \beta = 32\)。波长远短于原上下文的频率(\(r > \beta\))原封不动;超长波长频率(\(r < \alpha\))全用 PI 缩放。其二,注意力温度:把 \(q\) 和 \(k\) 同时乘以 \(\sqrt{1/t}\),

\[\sqrt{1/t} \;=\; 0.1 \ln(s) + 1,\]

跨 LLaMA 7B/13B/33B/65B 经验拟合。温度补偿上下文增长导致的 attention logits 平均熵增,恢复长程的 perplexity 曲线。

图 4:d/2 = 64 个频率对的波长 2π / θᵢ(head_dim = 128, b = 10000),对数纵轴。原始 RoPE(灰)从 i=0 处的 2π 升至 i=63 处约 60k。PI(橙)把整条曲线整体抬升因子 s,所有频率被同样拉伸。NTK-aware(绿)几乎不动最左端(高频),只拉伸右端长尾。YaRN(蓝)是分段 NTK-by-parts ramp——长波长频率(r < α)按 PI 缩放,短波长频率(r > β)原封不动。红虚线标注训练上下文 L = 2048:波长在其下方的频率自然泛化,在其上方的需要帮助。

Other Variants and Architectural Choices

The PI/NTK/YaRN trio is the most-cited core, but the surrounding literature has accumulated a number of useful refinements and alternatives:

Variant Idea Used by
Dynamic NTK Apply NTK scaling only when actual context exceeds \(L\); degrade gracefully HuggingFace transformers default; Mistral long-context
LongRoPE (Ding et al., 2024) Per-frequency scaling factors found by evolutionary search; needs no smooth functional form Phi-3 long-context, internal extensions of Llama
ABF (adjusted base frequency) Change the base \(b\) at fine-tune time without other scaling — equivalent to NTK with the right exponent Code Llama (b = 1e6), Mistral long-context fine-tunes
Llama 3.1: trained big-base + extra scaling \(b = 500{,}000\) from scratch, then YaRN-style “llama3” scaling for 128K context Llama 3.1 / 3.3
Llama 4: iRoPE (interleaved) Alternate RoPE layers with NoPE (no positional encoding) layers — NoPE layers act as global; RoPE layers as local Llama 4

Dynamic NTK is the most user-visible: it leaves the model alone for sequences within \(L\), then progressively applies NTK only when needed. This avoids the “you paid an accuracy tax even on short prompts” failure mode of static rescaling.

LongRoPE abandons the closed-form ramp entirely and treats per-frequency scaling factors as an optimization problem solved by differential evolution on a held-out perplexity target. The resulting curves often look messy compared to YaRN’s clean ramp — but they outperform it on the long tail of context-length benchmarks, suggesting that the right scaling shape is not actually smooth.

Llama 4’s iRoPE is a structural rather than algebraic move: rather than tinker with \(\theta_i\), alternate the layers — some get RoPE, some get NoPE. The NoPE layers (no position encoding at all) become naturally translation-invariant and unbounded, capturing very-long-range dependencies; the RoPE layers handle local structure. This is closer in spirit to ALiBi than to NTK: the long-context behavior is built into the architecture, not retrofitted onto the embedding.

Bigger picture. Every method here ultimately makes the same trade: high frequencies for local detail, low frequencies for long-range coverage, with the slow-end frequencies stretched somehow when context grows past training length. PI, NTK, YaRN, LongRoPE differ on how to stretch; bigger-base and iRoPE differ on when to commit. Llama 3.1’s choice to train at large base from scratch is a bet that the right place to spend the design budget is before training, not after.

PI/NTK/YaRN 三件套是被引最广的核心,但周边文献已经积累了不少有用的精化与替代:

变体 思想 使用方
Dynamic NTK 只在实际上下文超出 \(L\) 时才应用 NTK 缩放;优雅降级 HuggingFace transformers 默认;Mistral 长上下文
LongRoPE (Ding et al., 2024) 通过演化搜索得到逐频率缩放因子;无需平滑函数形式 Phi-3 长上下文、Llama 内部扩展
ABF(adjusted base frequency) 微调时直接改基 \(b\),无其他缩放——指数对了就等价于 NTK Code Llama (b = 1e6)、Mistral 长上下文微调
Llama 3.1:大基预训练 + 额外缩放 从头用 \(b = 500{,}000\),再叠 YaRN 风格的 “llama3” 缩放支持 128K Llama 3.1 / 3.3
Llama 4:iRoPE(交替) RoPE 层与 NoPE(无位置编码)层交替——NoPE 层作为全局,RoPE 层作为局部 Llama 4

Dynamic NTK 对用户最直观:序列在 \(L\) 之内时模型保持原样,仅在需要时逐步应用 NTK。这避免了静态缩放”短 prompt 也付精度税”的失效模式。

LongRoPE 完全放弃闭式 ramp,把逐频率缩放因子当成优化问题,用差分进化在留出的 perplexity 目标上求解。得到的曲线常常比 YaRN 的干净 ramp 看起来更杂乱——但在长上下文 benchmark 的长尾上优于 YaRN,提示正确的缩放形态实际上并不平滑。

Llama 4 的 iRoPE 是结构性而非代数性的改动:不再折腾 \(\theta_i\),而是交替——部分层用 RoPE,部分用 NoPE。NoPE 层(完全无位置编码)天然平移不变无界,捕捉超长程依赖;RoPE 层处理局部结构。精神上更接近 ALiBi 而非 NTK:长上下文行为内嵌于架构而非事后补丁。

大图景。 这里所有方法本质上做同一笔交易:高频负责局部细节,低频负责长程覆盖,上下文超出训练长度时把慢端频率以某种方式拉伸。PI、NTK、YaRN、LongRoPE 区别在如何拉伸;大基与 iRoPE 区别在何时下决心。Llama 3.1 选择从头训练在大基上,是在赌设计预算最该花在训练前而非训练后。

M-RoPE: Position in Three Axes

Why 1D RoPE Falls Short for Vision and Video

Multimodal sequences carry position information that 1D RoPE cannot express. An image patch has a row and a column; a video frame adds time. Take a 224×224 image patched at 14×14 — that’s 256 patches arranged in a 16×16 grid. Flatten to 1D in row-major raster order:

Pair of patches 2D positions 1D distance
Horizontal neighbors \((0, 0)\) and \((0, 1)\) 1
Vertical neighbors \((0, 0)\) and \((1, 0)\) 16
Diagonal neighbors \((0, 0)\) and \((1, 1)\) 17

A row-major flatten makes vertical neighbors look 16× farther apart than horizontal ones. The attention pattern that emerges has no way to know these distances mean the same physical thing — every spatial relation becomes a function of the flatten order rather than the underlying geometry.

Several VL designs pre-Qwen2-VL papered over this with different tricks:

  • NaViT / Pixtral: use a 2D RoPE — each spatial axis gets its own rotation (essentially M-RoPE without the temporal axis), giving height/width neighbors equal proximity.
  • Idefics2 / Idefics3: learn 2D positional embeddings over the patch grid.
  • CLIP-style towers feeding an LLM: rely on the visual encoder to bake 2D structure into patch embeddings before they reach the LLM, so the LLM’s 1D RoPE only has to handle the “sequence of patches” without caring about the grid.

Qwen2-VL’s M-RoPE is more ambitious: it extends the trick to three axes (temporal, height, width) and unifies image, video, and text under one position scheme that gracefully reduces to 1D RoPE when the input is pure text. The same checkpoint serves all three modalities without architecture surgery per modality.

多模态序列携带 1D RoPE 表达不了的位置信息。图像 patch 有行有列;视频帧再加时间。以 14×14 patch 化的 224×224 图像为例——共 256 个 patch 排成 16×16 grid。按行主序压成 1D:

patch 对 2D 位置 1D 距离
水平相邻 \((0, 0)\) 与 \((0, 1)\) 1
垂直相邻 \((0, 0)\) 与 \((1, 0)\) 16
对角相邻 \((0, 0)\) 与 \((1, 1)\) 17

行主序压平让垂直相邻 patch 看起来比水平相邻远 16 倍。涌现的 attention pattern 无从知道这些距离指代同样的物理关系——每个空间关系都变成了压平顺序的函数,而非底层几何。

Qwen2-VL 之前的多个 VL 设计用不同的 trick 糊弄过去:

  • NaViT / Pixtral:用 2D RoPE——每个空间轴各自旋转(本质上就是没有时间轴的 M-RoPE),让高/宽相邻 patch 距离相等。
  • Idefics2 / Idefics3:在 patch grid 上学习 2D 位置嵌入。
  • CLIP 风格塔接 LLM:依赖视觉编码器把 2D 结构烙进 patch embedding,到达 LLM 时 1D RoPE 只需处理”patch 序列”,无需关心 grid。

Qwen2-VLM-RoPE 更激进:把这个 trick 扩展到三个轴(时间、高度、宽度),并在一个位置方案下统一图像、视频和文本——纯文本输入时优雅地退化为 1D RoPE。同一份 checkpoint 服务三种模态,无需逐模态架构改造。

The (t, h, w) Decomposition in Qwen2-VL

The dimension split. For Qwen2-VL’s standard head dimension \(d = 128\), the \(d/2 = 64\) frequency pairs are partitioned as

\[\texttt{mrope\_section} \;=\; [16,\,24,\,24] \quad \text{(temporal,\,height,\,width)}.\]

This is the actual rope_scaling.mrope_section field in the Qwen2-VL HuggingFace config, and it’s a key thing to read carefully: the split sums to 64, not 128 — it applies to the frequency-pair index, not the full feature dimension. Per head, temporal gets \(2 \times 16 = 32\) channels, height \(2 \times 24 = 48\), width \(2 \times 24 = 48\). The HF implementation doubles mrope_section before slicing to recover the per-channel partition.

Concretely, for frequency-pair index \(i \in \{0, 1, \dots, 63\}\), the position used in the rotation \(R_{m_i}^{(\theta_i)}\) is

\[m_i \;=\; \begin{cases} t & i \in [0,\,16) \\ h & i \in [16,\,40) \\ w & i \in [40,\,64) \end{cases}\]

so each pair “listens to” a single axis, but the model can mix axes freely across pairs in subsequent linear layers. Note the asymmetry: temporal gets fewer frequencies than height or width. The paper doesn’t justify the exact ratio, but the choice gives spatial axes more frequency resolution — appropriate for images where local spatial structure dominates.

Figure 5: Toggle between text, image, and video to see how (t, h, w) is assigned. For text, all three axes carry the same sequence position — so M-RoPE collapses to 1D RoPE and the partition is functionally invisible. For an image, t is pinned to a constant and the two spatial axes carry row and column. For video, t increments per frame on top of the same image-style spatial encoding. The colored bar shows the [16, 24, 24] frequency partition over the 64 pairs.

Position IDs by modality. With three position coordinates instead of one, the assignment of \((t, h, w)\) varies:

Modality \(t\) \(h\) \(w\)
Text token at sequence position \(p\) \(p\) \(p\) \(p\)
Image patch at row \(r\), col \(c\) in a single frame constant \(K\) \(r\) \(c\)
Video patch at frame \(f\), row \(r\), col \(c\) \(f\) \(r\) \(c\)

The text-as-1D fallback is load-bearing. Setting \(t = h = w = p\) for text makes all three axes receive the same rotation; the per-pair allocation becomes invisible — every pair, regardless of which axis it “belongs” to, rotates by the same angle. This is precisely the design choice that lets a pretrained 1D-RoPE LLM be drop-in upgraded to M-RoPE without retraining the text-only behavior. Qwen2-VL initializes from Qwen2-7B and inherits its text capability through this clean reduction.

A word on “1D” here — don’t confuse it with the 2D rotation. Two different dimensions are in play. Feature dimension: every RoPE variant, including M-RoPE, rotates in 2D subspaces of the \(d\)-dimensional feature vector — that’s the operation’s geometry, never up for debate. Position dimension: how many separate position coordinates label each token. 1D RoPE attaches a single number \(m\) per token; M-RoPE attaches a triple \((t, h, w)\). “M-RoPE reduces to 1D RoPE for text” refers to the second meaning only: text uses \((p, p, p)\), so the three coordinates collapse to one effective value. The 2D rotation in each frequency subspace is unchanged — but every pair, no matter which axis mrope_section assigned it to, now rotates by the same angle \(p\theta_i\), making the partition functionally invisible. The partition only “lights up” when the three coordinates actually carry distinct values, as for image patches \((K, r, c)\) or video patches \((f, r, c)\).

维度划分。 Qwen2-VL 标准 head 维度 \(d = 128\) 下,\(d/2 = 64\) 个频率对按

\[\texttt{mrope\_section} \;=\; [16,\,24,\,24] \quad \text{(temporal,\,height,\,width)}\]

划分。这是 Qwen2-VL HuggingFace config 中的实际 rope_scaling.mrope_section 字段,需要小心阅读:分割之和是 64 而 128——它针对的是频率对编号,不是完整的特征维度。每个 head 中,时间获得 \(2 \times 16 = 32\) 个通道,高度 \(2 \times 24 = 48\),宽度 \(2 \times 24 = 48\)。HF 实现先把 mrope_section 翻倍再切片,恢复到逐通道的分割。

具体地,对频率对编号 \(i \in \{0, 1, \dots, 63\}\),用于旋转 \(R_{m_i}^{(\theta_i)}\) 的位置是

\[m_i \;=\; \begin{cases} t & i \in [0,\,16) \\ h & i \in [16,\,40) \\ w & i \in [40,\,64) \end{cases}\]

即每个频率对只”听一个轴”,但模型可以在后续线性层中跨频率对任意混合各轴。注意非对称:时间分到的频率比高/宽少。论文没有解释具体比例的依据,但这个选择给空间轴更高的频率分辨率——对于以局部空间结构为主的图像是合适的。

图 5:切换文本/图像/视频,观察 (t, h, w) 的赋值。文本时三个轴携带同一序列位置——M-RoPE 退化为 1D RoPE,分割在功能上不可见。图像时 t 固定为常数,两个空间轴携带行与列。视频在图像式空间编码之上让 t 按帧递增。彩色条显示 [16, 24, 24] 在 64 对频率上的划分。

各模态的位置 ID。 在三个位置坐标下,\((t, h, w)\) 的赋值因模态而异:

模态 \(t\) \(h\) \(w\)
序列位置 \(p\) 处的 text token \(p\) \(p\) \(p\)
单帧中行 \(r\)、列 \(c\) 处的图像 patch 常数 \(K\) \(r\) \(c\)
视频中帧 \(f\)、行 \(r\)、列 \(c\) 处的 patch \(f\) \(r\) \(c\)

“文本作为 1D” 这个退化是关键。 文本设 \(t = h = w = p\) 让三轴接收同一旋转,逐对的轴分配不可见——任何对,无论”属于”哪个轴,都按同样角度旋转。这正是让预训练好的 1D-RoPE LLM 直接平滑升级到 M-RoPE 而无需重新训练纯文本行为 的关键设计。Qwen2-VL 从 Qwen2-7B 初始化,正是通过这个干净的退化继承了其文本能力。

关于这里的 “1D”——别与 2D 旋转混淆。 两个不同的维度在同时出现。特征维度:所有 RoPE 变体,包括 M-RoPE,都在 \(d\) 维特征向量的 2D 子空间中旋转——这是操作本身的几何,从未改变。位置维度:每个 token 用几个独立位置坐标来标注。1D RoPE 给每 token 一个数 \(m\);M-RoPE 给一个三元组 \((t, h, w)\)。”M-RoPE 对文本退化为 1D RoPE” 指的只是第二层含义:文本用 \((p, p, p)\),三个坐标塌缩为同一个有效值。每个频率子空间的 2D 旋转不变——但每个对,无论 mrope_section 把它分给哪个轴,都按同一角度 \(p\theta_i\) 旋转,让分割在功能上隐形。只有当三个坐标真正取不同值时(图像 patch \((K, r, c)\)、视频 patch \((f, r, c)\)),分割才”点亮”。

Cross-Modal Continuity, Ablation, and Implementation

Cross-modal continuity. When the input mixes modalities — say, a video clip followed by a text response — the position IDs must continue smoothly across the boundary. The Qwen2-VL paper specifies: “position numbering for each modality is initialized by incrementing the maximum position ID of the preceding modality by one.” So if a video chunk ends with \(\max(t, h, w) = K\) (achieved by the last patch of the last frame), the subsequent text token starts at \(t = h = w = K + 1\) and increments from there. The reduction of text to identical \((t, h, w)\) makes this seamless: text after image continues the count on all three axes simultaneously.

Ablation: does M-RoPE actually help? Qwen2-VL’s Table 8 holds everything else constant and swaps M-RoPE for 1D-RoPE. The pattern is unambiguous, with the largest gaps on spatial and video tasks:

Benchmark 1D-RoPE M-RoPE \(\Delta\)
MathVista (geometry-heavy) 39.2 43.4 +4.2
STAR (video) 55.5 57.9 +2.4
NextQA (video) 43.9 46.0 +2.1
MMBench 58.6 60.6 +2.0
PerceptionTest 46.6 47.4 +0.8
TextVQA (OCR) 71.3 71.8 +0.5
ChartQA (OCR) 68.0 68.4 +0.4
DocVQA (OCR) 82.5 82.8 +0.3
MMStar 36.7 36.7 0.0
RealWorldQA 54.5 53.7 −0.8
InfoVQA 50.8 50.3 −0.5

Spatial reasoning (MathVista) and video-temporal benchmarks (NextQA, STAR) gain the most — exactly where 2D/3D structure matters. OCR-heavy benchmarks where rasterized 1D order is already informative (DocVQA, ChartQA, TextVQA) gain little — those tasks effectively re-discover the row/col structure from the visual encoder. The slight regressions on MMStar/RealWorldQA/InfoVQA suggest the partition isn’t free: fewer frequencies per axis means slightly less expressive encoding for tasks that don’t benefit from 2D structure.

Implementation. The HF kernel is one line of slice-and-stitch, expressing the per-pair axis assignment as a tensor split:

def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section):
    # cos, sin: (3, B, H, T, D) — three position axes stacked
    # mrope_section: e.g. [16, 24, 24] for the frequency-pair index
    mrope_section = mrope_section * 2  # apply to both halves of D -> [32, 48, 48]
    cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1)
    sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

Read it as: precompute three full \((B, H, T, D)\) cos/sin tables — one per axis, using \(t, h, w\) position IDs respectively — then weave them together so the temporal table contributes to channels \([0, 32)\) and \([D/2, D/2 + 32)\), height contributes to the next \(48 + 48\) channels, and width to the final \(48 + 48\). The * 2 doubling and the i % 3 cycle reflect that RoPE’s “first half / second half” layout interleaves with the three-axis partition.

What changes in Qwen2.5-VL. The follow-up Qwen2.5-VL refines this further with a time-aligned M-RoPE: instead of \(t\) being a raw frame index, it is set proportional to the elapsed time in seconds (with a fixed FPS reference). This decouples temporal position from sampling rate — a 30-FPS clip and a 1-FPS down-sampled version of the same scene now encode the same elapsed-time signal. Empirically this helps long-video understanding where the FPS budget can’t capture every frame, and is the kind of refinement that becomes natural once the temporal axis is a named coordinate rather than a hidden one.

Closing thought. M-RoPE is conservative by design: zero inference cost, exact 1D-RoPE behavior on text, long-distance decay preserved per axis. The price is the partition — spatial axes take ~75% of the head’s frequencies, leaving temporal with the rest. Whether \([16, 24, 24]\) is the right split is open — the paper doesn’t ablate it, and follow-up VL models tend to inherit the same numbers without revisiting them. The next interesting round of ablations probably lives there, or in whether even the three-axis decomposition itself is the right unit (audio? depth? higher temporal resolution?). The recipe — one position scheme to rule them all, with text as a clean special case — feels durable; the specific numbers, less so.

跨模态连续性。 输入混合模态时——比如视频片段后跟一段文本回复——位置 ID 必须在边界上平滑续接。Qwen2-VL 论文规定:”每个模态的位置编号通过将前一模态的最大位置 ID 加一来初始化。” 所以视频块以 \(\max(t, h, w) = K\) 结尾(最后一帧最后一个 patch 给出),其后文本从 \(t = h = w = K + 1\) 开始递增。文本到统一 \((t, h, w)\) 的回退使得这一切无缝:图像后的文本同时在三个轴上续算。

消融:M-RoPE 真的有用吗? Qwen2-VL Table 8 控制其余条件不变,把 M-RoPE 换成 1D-RoPE。规律很明确,空间与视频任务增益最大:

Benchmark 1D-RoPE M-RoPE \(\Delta\)
MathVista(几何题多) 39.2 43.4 +4.2
STAR(视频) 55.5 57.9 +2.4
NextQA(视频) 43.9 46.0 +2.1
MMBench 58.6 60.6 +2.0
PerceptionTest 46.6 47.4 +0.8
TextVQA (OCR) 71.3 71.8 +0.5
ChartQA (OCR) 68.0 68.4 +0.4
DocVQA (OCR) 82.5 82.8 +0.3
MMStar 36.7 36.7 0.0
RealWorldQA 54.5 53.7 −0.8
InfoVQA 50.8 50.3 −0.5

空间推理(MathVista)与视频时序 benchmark(NextQA, STAR)增益最大——正是 2D/3D 结构起作用的地方。OCR 密集型 benchmark(DocVQA、ChartQA、TextVQA)增益小——因为光栅化的 1D 顺序对它们本就足够,这些任务实际上是从视觉编码器重新发现行/列结构。MMStar/RealWorldQA/InfoVQA 上的轻微回退说明分割不是免费的:每个轴更少的频率意味着对于不依赖 2D 结构的任务,位置编码的表达力略有下降。

实现。 HF kernel 是一行切片拼接,把逐对的轴分配写成 tensor split:

def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section):
    # cos, sin: (3, B, H, T, D) —— 三个位置轴堆叠
    # mrope_section: 如 [16, 24, 24],针对频率对编号
    mrope_section = mrope_section * 2  # 应用到 D 的两半 -> [32, 48, 48]
    cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1)
    sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

读法:预计算三张完整的 \((B, H, T, D)\) cos/sin 表——每张轴一份,分别用 \(t, h, w\) 位置 ID——然后按规则编织:时间表贡献到 \([0, 32)\) 和 \([D/2, D/2 + 32)\) 通道,高度贡献接下来的 \(48 + 48\) 通道,宽度贡献最后的 \(48 + 48\)。* 2 翻倍与 i % 3 循环反映了 RoPE “前半/后半” 布局与三轴分割的交织。

Qwen2.5-VL 的变化。 后续的 Qwen2.5-VL 进一步精化为时间对齐 M-RoPE:\(t\) 不再是原始帧编号,而是与流逝秒数成比例(以固定 FPS 为参考)。这把时间位置与采样率解耦——30 FPS 片段与同一场景 1 FPS 降采样版本现在编码同一流逝时间信号。经验上这帮助长视频理解中 FPS 预算无法捕捉每一帧的情况,是时间轴成为具名坐标而非隐藏坐标后自然涌现的精化。

收尾。 M-RoPE 在设计上是保守的:推理零开销、文本精确保留 1D-RoPE 行为、每个轴独立保留长程衰减。代价在分割——空间轴吃掉 head 中约 75% 的频率,剩下的留给时间。\([16, 24, 24]\) 是不是那个最优分割尚是开放问题——论文没消融,后续 VL 模型也沿用这套数字而未重新审视。下一轮有意思的消融大概率落在这里,或者三轴分解本身是否就是合适的单元(音频?深度?更高时间分辨率?)。配方——一个位置方案统御一切,文本作为干净特例——感觉是耐久的;具体数字则未必。