RoPE and M-RoPE: Rotation, Decay, and Multimodal Axes
The Problem: Encoding Position into Attention
问题:如何把位置编码进 attention
Attention Is a Set Operation
Attention 本质是集合运算
Self-attention is a sum over a set, not a sequence. For queries \(q_m\) at position \(m\) and keys \(k_n\) at position \(n\), the softmax weights depend only on \(\langle q_m, k_n \rangle\) — permuting the tokens permutes the rows of \(Q\) and \(K\) identically, leaving the output identical up to that permutation. Without an extra signal, attention cannot tell first from last.
This permutation equivariance is a feature, not a bug. It is what lets a Transformer compute all \(T^2\) token-to-token interactions in parallel — the structural reason it eats data faster than an RNN can. But it also means every signal about where a token sits must enter the model from somewhere outside the attention dot product itself. Position encoding is that bridge. The design space splits cleanly by where the signal is injected: into the embedding, into the logit, or into the query/key vectors themselves.
Three Families of Position Encoding
位置编码的三大流派
| Family | Where the signal lives | Translation-invariant? | Examples |
|---|---|---|---|
| Absolute additive | Added to token embeddings before layer 1 | No | BERT learned, Sinusoidal (Vaswani 2017) |
| Relative bias | Added to logits inside softmax | Yes | T5 bias, ALiBi |
| Rotary | Multiplied into \(q, k\) before dot product | Yes (by construction) | RoPE |
Absolute additive. The original Transformer (Vaswani et al., 2017) uses sinusoidal encoding: at position \(p\), channels \(2i\) and \(2i+1\) get
\[\mathrm{PE}(p, 2i) = \sin(p\, \omega_i), \qquad \mathrm{PE}(p, 2i+1) = \cos(p\, \omega_i), \qquad \omega_i = 10000^{-2i/d},\]added to the input embedding before layer 1. Note that \(\omega_i\) here is the very same frequency choice RoPE will later use — both are rooted in the idea that geometrically-spaced frequencies give good time-frequency coverage with \(d/2\) basis functions. BERT-style models replace this with a learned table of position embeddings, simpler but capped at the training length.
Relative bias. T5 (Raffel et al., 2020) adds a learned scalar bias \(b_{n-m}\) to the attention logit before softmax, with \(n - m\) binned into log-spaced buckets. ALiBi (Press et al., 2022) takes the bias idea to its minimal extreme:
\[\mathrm{logit}(m, n) = q_m^\top k_n - \lambda_h \cdot |n - m|,\]where \(\lambda_h\) is a per-head fixed slope (no learnable position parameters at all). Zero parameters, linear decay built in, surprisingly strong length extrapolation — but the bias is independent of \(q\) and \(k\), so it cannot express position-dependent content interactions.
Rotary. RoPE (Su et al., 2021) — the subject of this post — does neither. It rotates \(q\) and \(k\) in 2D subspaces by an angle proportional to position, multiplying position into the vectors before the dot product. The dot product then automatically depends only on the relative offset, with no learnable position parameters and no extra bias term.
Why Translation Invariance Matters
为什么平移不变这么重要
A position encoding is translation-invariant if shifting the entire sequence by \(k\) positions leaves every attention dot product unchanged:
\[\langle q_{m+k},\, k_{n+k} \rangle \;=\; \langle q_m,\, k_n \rangle \quad \forall\, m, n, k.\]Absolute additive encodings fail this by construction — they inject position \(p_m\) into the embedding, so moving \(x_m\) to slot \(m + k\) replaces \(p_m\) with \(p_{m+k}\) and the dot product changes. RoPE satisfies it exactly:
\[\langle R_{m+k} q,\, R_{n+k} k \rangle \;=\; q^\top R_{(n+k) - (m+k)} k \;=\; q^\top R_{n - m} k,\]since only the difference enters the answer. The relationship between 我 and 吃 is the same whether those tokens sit at positions \((3, 4)\) or \((1003, 1004)\); the rotation by \(k\) cancels out exactly.
Why we want this for language. Three reasons stack:
-
Meaning is compositional and local. The syntactic relation between cat and sat in “The cat sat on the mat” is identical whether the sentence opens a chapter or sits a million tokens deep. Binding semantic relations to absolute coordinates would force the model to relearn the same grammar at every position, throwing away cross-position induction.
-
Weight sharing buys data efficiency. This is the same argument that justifies convolution in vision: a translation-invariant attention layer learns “subject–verb agreement at distance 1” as one pattern that applies everywhere, instead of a separate pattern per position that each needs its own statistical support.
-
Length extrapolation. If every relation is parameterized by relative offset, training on \(L = 2048\) tokens teaches the model all offsets \(1, 2, \dots, 2047\). At inference, extending context to \(32{,}768\) shows the same offsets, just on more pairs of tokens. This is the root reason RoPE plus NTK/YaRN can stretch to 128K while pure absolute encodings crumble past their training length.
Caveat. Language is not perfectly translation-invariant — document openings introduce premises, endings summarize, “once upon a time” sets a different stage than what follows. But those are content-level signals carried by tokens (a
[CLS]token, the literal phrase “In summary”), not coordinate-level signals. Letting the position encoding stay purely relative, and pushing document structure into the token stream where it belongs, is a cleaner division of labor.
| To summarize, what we want from a position encoding is: (i) translation invariance — the dot product \(\langle q_m, k_n \rangle\) depends on \(m, n\) only through \(n - m\); (ii) bounded long-distance behavior — the magnitude of that dependence does not blow up with $$ | n - m | \(; *(iii)* cheap to compute and friendly with KV-caching at inference. RoPE achieves all three by *rotating*\)q\(and\)k$$ in fixed 2D subspaces, with frequencies that drop geometrically across feature pairs. |
RoPE: Rotation in 2D Subspaces
RoPE:二维子空间中的旋转
The Two-Dimensional Derivation
二维推导
Start with \(d = 2\). We seek functions \(f_q(x, m)\) and \(f_k(x, n)\) such that
\[\langle f_q(q, m),\, f_k(k, n) \rangle \;=\; g(q, k, n - m)\]for some function \(g\) depending on positions only through the offset.
Unpacking that condition. The right-hand side \(g(q, k, n - m)\) has only three arguments — \(q\), \(k\), and the single position quantity \(n - m\). The individual positions \(m, n\) never appear separately; they enter the answer only as a difference. If we found such an \(f\) with \(g(q, k, 2) = 0.73\), then \(\langle f(q, 5), f(k, 7) \rangle\) and \(\langle f(q, 1003), f(k, 1005) \rangle\) would both equal \(0.73\) as well. Contrast absolute additive encoding: \(\langle q + p_m,\, k + p_n \rangle = q^\top k + q^\top p_n + p_m^\top k + p_m^\top p_n\) — three of the four terms involve \(m\) or \(n\) on its own, not just through the difference. The whole point of the RoPE ansatz that follows is to find an \(f\) that makes those absolute-position terms cancel.
Identifying \(\mathbb{R}^2\) with \(\mathbb{C}\) — write \(q = q_1 + i q_2\) — try the ansatz
\[f_q(q, m) = q \cdot e^{i m \theta}, \qquad f_k(k, n) = k \cdot e^{i n \theta}.\]The (Hermitian) inner product satisfies
\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; \mathrm{Re}\!\left( q \bar{k} \, e^{i(m - n)\theta} \right),\]which depends on positions only through \(m - n\). Translated back to \(\mathbb{R}^2\), multiplication by \(e^{i m \theta}\) is multiplication by the 2D rotation matrix
\[R_m \;=\; \begin{bmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \phantom{-}\cos m\theta \end{bmatrix}, \qquad f_q(q, m) = R_m q,\; f_k(k, n) = R_n k.\]The relative-position property is now a one-line linear-algebra fact:
\[(R_m q)^\top (R_n k) \;=\; q^\top R_m^\top R_n k \;=\; q^\top R_{n - m} k,\]since rotation matrices form a one-parameter group: \(R_m^\top = R_{-m}\) and \(R_{-m} R_n = R_{n - m}\). The complex-number view makes this transparent — phases simply subtract.
Extending to d Dimensions
推广到 d 维
Split the \(d\)-dimensional vector into \(d/2\) adjacent pairs \((x_{2i}, x_{2i+1})\) and rotate each pair by its own angle \(m \theta_i\):
\[R_m^{(d)} \;=\; \mathrm{blkdiag}\!\Bigl(R_m^{(\theta_0)},\, R_m^{(\theta_1)},\, \dots,\, R_m^{(\theta_{d/2-1})}\Bigr), \qquad \theta_i \;=\; b^{-2i/d}.\]Following Vaswani’s sinusoidal encoding, the original RoFormer paper takes \(b = 10000\). Then \(f_q(q, m) = R_m^{(d)} q\), \(f_k(k, n) = R_n^{(d)} k\), and the same group property gives
\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; q^\top R_{n - m}^{(d)} k\]— relative position, exactly.
Why geometric frequencies? The choice \(\theta_i = b^{-2i/d}\) — identical to Vaswani’s sinusoidal — is not an accident. Geometrically-spaced frequencies tile the time-frequency plane more uniformly on log-scale than any equally-spaced alternative: each successive pair rotates a constant factor slower than the previous. The fastest pair (\(\theta_0 = 1\)) completes a full rotation every \(2\pi \approx 6.3\) tokens; the slowest (\(\theta_{d/2 - 1} = b^{-(d-2)/d}\)) takes \(\sim 2\pi b\) tokens. This logarithmic coverage is what gives the encoding both fine-grained adjacency resolution (high freqs) and coarse long-range positional context (low freqs) within a single fixed budget of \(d/2\) pairs.
| Pair index \(i\) | \(\theta_i\) at \(b = 10000, d = 128\) | Wavelength \(2\pi/\theta_i\) (tokens) |
|---|---|---|
| 0 | \(1\) | \(6.3\) |
| 16 | \(0.178\) | \(\approx 35\) |
| 32 | \(0.0316\) | \(\approx 199\) |
| 48 | \(0.00562\) | \(\approx 1{,}117\) |
| 63 | \(0.000118\) | \(\approx 52{,}954\) |
From Math to PyTorch
从数学到 PyTorch
The block-diagonal matrix is never materialized. Instead, precompute two tensors of shape \((T, d)\):
\[\cos[m,\,2i] = \cos[m,\,2i+1] = \cos(m \theta_i), \qquad \sin[m,\,2i] = \sin[m,\,2i+1] = \sin(m \theta_i),\]and apply RoPE elementwise. Most modern implementations (Llama, Qwen, GPT-NeoX, HuggingFace) adopt the “split halves” layout where the first \(d/2\) channels store the “real” parts and the second \(d/2\) store the “imaginary” parts. Under that convention, pair \(i\) is \((x_i,\, x_{i + d/2})\), and the canonical PyTorch form is:
def rotate_half(x):
x1, x2 = x.chunk(2, dim=-1) # split into two halves
return torch.cat((-x2, x1), dim=-1) # (a, b) -> (-b, a)
def apply_rope(q, k, cos, sin):
# q, k: (B, H, T, D); cos, sin: (1, 1, T, D)
q_rot = q * cos + rotate_half(q) * sin
k_rot = k * cos + rotate_half(k) * sin
return q_rot, k_rot
Two layout conventions, same math. The original RoFormer paper uses adjacent pairs — \((x_0, x_1), (x_2, x_3), \dots\) — and a different
rotate_halfformula. The split-halves layout above is equivalent under a permutation: both produce identical inner products \(\langle R_m q, R_n k \rangle\) for the same \(\theta_i\). Always check which convention a checkpoint uses before mixing weights across codebases — silent permutation mismatches show up as garbled outputs, not crashes.
Four practical remarks:
-
Q, K only — never V. Relative position should bias who attends to whom, not the content that gets aggregated. Rotating \(V\) would distort the values themselves.
-
Shared cos/sin tables. The same table is shared across heads and layers — only \(\theta_i\) depends on the model. Memory is \(O(T \cdot d)\), computed once at module init.
-
KV-cache trivial. Because the rotation is applied position-by-position, cached keys carry their rotation from when they were first written; new queries rotate at their new position; no recomputation is needed when the sequence grows. Compare with absolute encodings, which would require re-running layer 0 if positions shift.
-
Linear-attention compatible. Rotation is unitary, so it preserves inner-product structure. RoPE composes cleanly with kernelized / linear-attention variants (Performer, RetNet, linear-attention Mamba branches): the rotation moves through the kernel approximation as a unitary transformation, leaving the kernel identity intact.
Why It Works: Long-Distance Decay
为什么有效:长程衰减
The Decay Bound via Abel Summation
通过 Abel 求和得到的衰减界
| The relative-position property alone tells us nothing about magnitudes. In principle, the rotation could leave \(\langle q_m, k_n \rangle\) oscillating at full amplitude no matter how far apart \(m\) and \(n\) are. The actual RoPE design has a stronger property: with \(\theta_i = 10000^{-2i/d}\), the inner product decays (on average) as $$ | n - m | $$ grows. This is the closest thing RoPE has to ALiBi-style recency bias — but it emerges from the math rather than from a hand-tuned slope. |
Group the rotated dot product per pair. Writing
\[h_i \;=\; q_{[2i]} k_{[2i]} + q_{[2i+1]} k_{[2i+1]} + i\,(q_{[2i+1]} k_{[2i]} - q_{[2i]} k_{[2i+1]})\]as a complex number capturing the \(i\)-th pair’s contribution, the relative dot product becomes
\[\langle R_m q, R_n k \rangle \;=\; \mathrm{Re}\sum_{i=0}^{d/2 - 1} h_i\, e^{i(m - n)\theta_i}.\]Define the partial phase sum \(S_j = \sum_{i=0}^{j-1} e^{i(m - n)\theta_i}\) with \(S_0 = 0\). Abel summation (summation by parts) gives
\[\sum_{i=0}^{d/2-1} h_i\, e^{i(m-n)\theta_i} \;=\; h_{d/2-1}\, S_{d/2} \;-\; \sum_{i=0}^{d/2-2} S_{i+1}\,(h_{i+1} - h_i),\]so that
\[\bigl|\langle R_m q, R_n k \rangle\bigr| \;\le\; \Bigl(\max_i |h_{i+1} - h_i|\Bigr)\, \sum_{i=1}^{d/2} |S_i|.\]| The first factor is content-dependent; the second is purely geometric. The RoFormer paper plots $$\frac{1}{d/2}\sum_i | S_i | \(against\) | n - m | \(for\)\theta_i = 10000^{-2i/d}$$ and shows it falls off — fast at first, then more slowly, with the residual oscillation expected from a finite sum of incommensurate frequencies. |
Geometric Frequencies and Phase Decorrelation
几何频率与相位去相关
The bound has a clean physical interpretation. The high-frequency pairs (small \(i\), \(\theta_i \approx 1\)) rotate by a different amount per step than the low-frequency pairs (large \(i\), \(\theta_i \to 0\)). When you sum the contributions, the phases at different frequencies generically don’t reinforce — they decorrelate. Only when \(m = n\) are all phases zero and the sum is exactly \(\sum_i h_i\) (constructive interference). Move \(n\) away from \(m\) and the phases drift apart at different rates, producing destructive interference.
The geometric spacing of \(\theta_i\) is what spreads the rotation rates evenly on log-scale, so that no two pairs stay in lockstep over typical distances. Compare with three alternatives:
- All identical frequencies (\(\theta_i = \theta\) for all \(i\)). All pairs rotate in phase. The sum has the same magnitude as a single pair — no decay; behaves like a single complex exponential.
- Linear spacing (\(\theta_i = (1 + i)\theta_0\)). Pairs decorrelate, but on linear timescales. Decay happens but the spectrum lacks log-scale coverage — fewer slow frequencies, so worse long-range behavior.
- Geometric spacing (\(\theta_i = b^{-2i/d}\)). Pairs decorrelate evenly on log-scale. This is the RoPE / sinusoidal choice. Decay is graceful and the spectrum covers many orders of magnitude with only \(d/2\) basis functions.
Practical Implications: The Role of the Base b
实践含义:基 b 的作用
Picking the base \(b\) trades off near-field resolution against far-field coverage. The highest frequency \(\theta_0 = 1\) rotates by a full radian per step regardless of \(b\) — this is what gives RoPE its fine-grained ability to distinguish adjacent tokens. The lowest frequency \(\theta_{d/2-1} = b^{-(d-2)/d}\) scales with \(b\): at \(b = 10000, d = 128\), \(\theta_{d/2-1} \approx 1.2 \times 10^{-4}\), giving a wavelength of \(\sim 52{,}000\) tokens.
| Base \(b\) | Lowest \(\theta\) | Longest wavelength | Used by |
|---|---|---|---|
| 10000 | \(1.2 \times 10^{-4}\) | \(\sim 52\)K tokens | RoFormer, Llama 2, Mistral 7B |
| 100000 | \(1.5 \times 10^{-5}\) | \(\sim 410\)K tokens | Code Llama (long-context variant) |
| 500000 | \(3.0 \times 10^{-6}\) | \(\sim 2.1\)M tokens | Llama 3 / 3.1 |
| 1000000 | \(1.4 \times 10^{-6}\) | \(\sim 4.4\)M tokens | Qwen2.5 (some variants) |
The flip side: very-slow frequencies don’t help at short range and consume head channels — pushing \(b\) too high wastes capacity on positions the model rarely sees. Llama 3.1’s choice of \(b = 500{,}000\) is calibrated against its announced 128K context, leaving comfortable headroom while still providing useful gradient at short distances. The same lever is the cleanest entry point for the length-extension methods covered in the next section.
Beyond Training Length: PI, NTK, YaRN
超出训练长度:PI、NTK、YaRN
Why Naive Extrapolation Fails
为什么朴素外推会失败
A model trained on \(L = 2048\) tokens almost always degrades catastrophically when fed sequences much longer than \(L\). The reason is sharper than “the model hasn’t seen those positions”. Pick any frequency pair \(i\). During training, the rotation angles that pair sees are
\[\{\, m \theta_i \;:\; m \in [0, L) \,\}\]— a set covering \(L \theta_i\) radians. For high-frequency pairs (\(\theta_i\) near 1), \(L\theta_i \gg 2\pi\) — the pair wraps the unit circle many times, sees every phase, and any extrapolation to \(m > L\) produces phases the pair has already practiced. For low-frequency pairs (\(\theta_i\) small), \(L\theta_i\) may be less than \(2\pi\) — the pair has not even completed a full cycle. Asking it to handle \(m = 2L\) pushes its phase into truly novel territory.
In one line. Length extrapolation fails because the slow frequencies never wrapped around during training; the fast ones did. So fixes must target the slow end of the spectrum without disturbing the fast end.
What “wrapped around” buys you. RoPE’s output \((\cos m\theta_i, \sin m\theta_i)\) lives on the unit circle, periodic with period \(2\pi/\theta_i\). Wrapped around means \(L\theta_i > 2\pi\) — the angle completed at least one full revolution during training, so the model has been shown the entire \((\cos, \sin)\) output range for that pair. The downstream weights treat \((\cos, \sin)\) as ordinary features — they are not themselves periodic; they only “know” what they were trained on. Once a frequency’s outputs cover the full circle during training, any future angle \(m\theta_i \bmod 2\pi\) lands on a point the model has already mapped. Concretely at \(b = 10000,\, d = 128,\, L = 2048\): pair 0 (\(\theta_0 = 1\)) sees \(2048\) radians of training angle — about 326 full cycles — every \((\cos, \sin)\) value many times over. Pair 63 (\(\theta_{63} \approx 1.15 \times 10^{-4}\)) sees only \(0.24\) radians — about 4% of one cycle — and the \(\cos\) output never leaves \([0.97, 1]\). Extrapolate to \(m = 16384\) and the slow pair’s \(\cos\) drops to \(-0.31\), feeding the downstream weights a value they have never been trained on. That is why the fast end needs no surgery and the slow end does.
This single observation organizes the entire length-extension literature. Position Interpolation rescales every frequency uniformly (touches the fast end too — hence needs fine-tuning to recover). NTK-aware leaves the fast end alone and stretches only the slow end. YaRN refines NTK with per-frequency control. All three are post-hoc surgeries on \(\theta_i\).
The Three Major Methods
三大主流方法
| Method | Where it modifies \(\theta_i\) | Formula | Need fine-tuning? |
|---|---|---|---|
| Position Interpolation (PI) | All frequencies, uniformly | \(\theta_i' = \theta_i \cdot L / L'\) | Yes (~1k steps) |
| NTK-aware | Base only | \(b' = b \cdot s^{d/(d-2)},\; s = L'/L\) | Often zero-shot |
| YaRN | Per-frequency ramp + temperature | NTK-by-parts + \(\sqrt{1/t} = 0.1 \ln s + 1\) | Short fine-tune |
Position Interpolation (Chen et al., 2023) takes the bluntest route: pretend position \(m \in [0, L')\) is actually position \(m \cdot L / L' \in [0, L)\). Equivalently, multiply every \(\theta_i\) by the scaling factor \(L/L'\). This keeps RoPE values inside their training range, but it compresses all frequencies equally — the high-frequency pairs that used to distinguish adjacent tokens now barely move per step, which hurts local detail. PI works, but it requires a thousand-or-so fine-tuning steps to recover.
NTK-aware scaling (originally a LocalLLaMA post, then picked up by everyone) attacks the same problem from the opposite direction: leave the high frequencies alone — they generalize fine — and stretch only the low frequencies. The cleanest implementation just changes the base:
\[b' \;=\; b \cdot s^{d/(d-2)}, \qquad s = L'/L.\]The exponent \(d/(d-2)\) is chosen so that the lowest frequency \(\theta_{d/2-1} = b'^{-(d-2)/d}\) becomes exactly \(\theta_{d/2-1} \cdot 1/s\) (i.e., gets PI’d by factor \(s\)), while the highest frequency \(\theta_0 = b'^0 = 1\) is unchanged. Intermediate frequencies smoothly interpolate. The big practical win: NTK often extends context without any fine-tuning.
YaRN (Peng et al., 2023) refines NTK with two ideas. First, NTK-by-parts: instead of the smooth NTK base change, define a piecewise ramp \(\gamma\) over the wavelength-to-context ratio \(r\) and interpolate between PI and no-op,
\[h(\theta_i) \;=\; (1 - \gamma(r_i)) \cdot \frac{\theta_i}{s} \;+\; \gamma(r_i) \cdot \theta_i, \qquad \gamma(r) = \begin{cases} 0 & r < \alpha \\ \frac{r - \alpha}{\beta - \alpha} & \alpha \le r \le \beta \\ 1 & r > \beta \end{cases}\]with \(\alpha = 1, \beta = 32\) recommended for LLaMA. Frequencies whose wavelength is much shorter than the original context (\(r > \beta\)) are untouched; very-long-wavelength frequencies (\(r < \alpha\)) get full PI scaling. Second, attention temperature: scale both \(q\) and \(k\) by \(\sqrt{1/t}\) with
\[\sqrt{1/t} \;=\; 0.1 \ln(s) + 1,\]fitted empirically across LLaMA 7B/13B/33B/65B. The temperature compensates for the average entropy increase of attention logits as the context grows, restoring the perplexity curve at long range.
Other Variants and Architectural Choices
其他变体与架构选择
The PI/NTK/YaRN trio is the most-cited core, but the surrounding literature has accumulated a number of useful refinements and alternatives:
| Variant | Idea | Used by |
|---|---|---|
| Dynamic NTK | Apply NTK scaling only when actual context exceeds \(L\); degrade gracefully | HuggingFace transformers default; Mistral long-context |
| LongRoPE (Ding et al., 2024) | Per-frequency scaling factors found by evolutionary search; needs no smooth functional form | Phi-3 long-context, internal extensions of Llama |
| ABF (adjusted base frequency) | Change the base \(b\) at fine-tune time without other scaling — equivalent to NTK with the right exponent | Code Llama (b = 1e6), Mistral long-context fine-tunes |
| Llama 3.1: trained big-base + extra scaling | \(b = 500{,}000\) from scratch, then YaRN-style “llama3” scaling for 128K context | Llama 3.1 / 3.3 |
| Llama 4: iRoPE (interleaved) | Alternate RoPE layers with NoPE (no positional encoding) layers — NoPE layers act as global; RoPE layers as local | Llama 4 |
Dynamic NTK is the most user-visible: it leaves the model alone for sequences within \(L\), then progressively applies NTK only when needed. This avoids the “you paid an accuracy tax even on short prompts” failure mode of static rescaling.
LongRoPE abandons the closed-form ramp entirely and treats per-frequency scaling factors as an optimization problem solved by differential evolution on a held-out perplexity target. The resulting curves often look messy compared to YaRN’s clean ramp — but they outperform it on the long tail of context-length benchmarks, suggesting that the right scaling shape is not actually smooth.
Llama 4’s iRoPE is a structural rather than algebraic move: rather than tinker with \(\theta_i\), alternate the layers — some get RoPE, some get NoPE. The NoPE layers (no position encoding at all) become naturally translation-invariant and unbounded, capturing very-long-range dependencies; the RoPE layers handle local structure. This is closer in spirit to ALiBi than to NTK: the long-context behavior is built into the architecture, not retrofitted onto the embedding.
Bigger picture. Every method here ultimately makes the same trade: high frequencies for local detail, low frequencies for long-range coverage, with the slow-end frequencies stretched somehow when context grows past training length. PI, NTK, YaRN, LongRoPE differ on how to stretch; bigger-base and iRoPE differ on when to commit. Llama 3.1’s choice to train at large base from scratch is a bet that the right place to spend the design budget is before training, not after.
M-RoPE: Position in Three Axes
M-RoPE:三轴上的位置
Why 1D RoPE Falls Short for Vision and Video
为什么 1D RoPE 在视觉与视频上不够用
Multimodal sequences carry position information that 1D RoPE cannot express. An image patch has a row and a column; a video frame adds time. Take a 224×224 image patched at 14×14 — that’s 256 patches arranged in a 16×16 grid. Flatten to 1D in row-major raster order:
| Pair of patches | 2D positions | 1D distance |
|---|---|---|
| Horizontal neighbors | \((0, 0)\) and \((0, 1)\) | 1 |
| Vertical neighbors | \((0, 0)\) and \((1, 0)\) | 16 |
| Diagonal neighbors | \((0, 0)\) and \((1, 1)\) | 17 |
A row-major flatten makes vertical neighbors look 16× farther apart than horizontal ones. The attention pattern that emerges has no way to know these distances mean the same physical thing — every spatial relation becomes a function of the flatten order rather than the underlying geometry.
Several VL designs pre-Qwen2-VL papered over this with different tricks:
- NaViT / Pixtral: use a 2D RoPE — each spatial axis gets its own rotation (essentially M-RoPE without the temporal axis), giving height/width neighbors equal proximity.
- Idefics2 / Idefics3: learn 2D positional embeddings over the patch grid.
- CLIP-style towers feeding an LLM: rely on the visual encoder to bake 2D structure into patch embeddings before they reach the LLM, so the LLM’s 1D RoPE only has to handle the “sequence of patches” without caring about the grid.
Qwen2-VL’s M-RoPE is more ambitious: it extends the trick to three axes (temporal, height, width) and unifies image, video, and text under one position scheme that gracefully reduces to 1D RoPE when the input is pure text. The same checkpoint serves all three modalities without architecture surgery per modality.
The (t, h, w) Decomposition in Qwen2-VL
Qwen2-VL 中的 (t, h, w) 分解
The dimension split. For Qwen2-VL’s standard head dimension \(d = 128\), the \(d/2 = 64\) frequency pairs are partitioned as
\[\texttt{mrope\_section} \;=\; [16,\,24,\,24] \quad \text{(temporal,\,height,\,width)}.\]This is the actual rope_scaling.mrope_section field in the Qwen2-VL HuggingFace config, and it’s a key thing to read carefully: the split sums to 64, not 128 — it applies to the frequency-pair index, not the full feature dimension. Per head, temporal gets \(2 \times 16 = 32\) channels, height \(2 \times 24 = 48\), width \(2 \times 24 = 48\). The HF implementation doubles mrope_section before slicing to recover the per-channel partition.
Concretely, for frequency-pair index \(i \in \{0, 1, \dots, 63\}\), the position used in the rotation \(R_{m_i}^{(\theta_i)}\) is
\[m_i \;=\; \begin{cases} t & i \in [0,\,16) \\ h & i \in [16,\,40) \\ w & i \in [40,\,64) \end{cases}\]so each pair “listens to” a single axis, but the model can mix axes freely across pairs in subsequent linear layers. Note the asymmetry: temporal gets fewer frequencies than height or width. The paper doesn’t justify the exact ratio, but the choice gives spatial axes more frequency resolution — appropriate for images where local spatial structure dominates.
Position IDs by modality. With three position coordinates instead of one, the assignment of \((t, h, w)\) varies:
| Modality | \(t\) | \(h\) | \(w\) |
|---|---|---|---|
| Text token at sequence position \(p\) | \(p\) | \(p\) | \(p\) |
| Image patch at row \(r\), col \(c\) in a single frame | constant \(K\) | \(r\) | \(c\) |
| Video patch at frame \(f\), row \(r\), col \(c\) | \(f\) | \(r\) | \(c\) |
The text-as-1D fallback is load-bearing. Setting \(t = h = w = p\) for text makes all three axes receive the same rotation; the per-pair allocation becomes invisible — every pair, regardless of which axis it “belongs” to, rotates by the same angle. This is precisely the design choice that lets a pretrained 1D-RoPE LLM be drop-in upgraded to M-RoPE without retraining the text-only behavior. Qwen2-VL initializes from Qwen2-7B and inherits its text capability through this clean reduction.
A word on “1D” here — don’t confuse it with the 2D rotation. Two different dimensions are in play. Feature dimension: every RoPE variant, including M-RoPE, rotates in 2D subspaces of the \(d\)-dimensional feature vector — that’s the operation’s geometry, never up for debate. Position dimension: how many separate position coordinates label each token. 1D RoPE attaches a single number \(m\) per token; M-RoPE attaches a triple \((t, h, w)\). “M-RoPE reduces to 1D RoPE for text” refers to the second meaning only: text uses \((p, p, p)\), so the three coordinates collapse to one effective value. The 2D rotation in each frequency subspace is unchanged — but every pair, no matter which axis
mrope_sectionassigned it to, now rotates by the same angle \(p\theta_i\), making the partition functionally invisible. The partition only “lights up” when the three coordinates actually carry distinct values, as for image patches \((K, r, c)\) or video patches \((f, r, c)\).
Cross-Modal Continuity, Ablation, and Implementation
跨模态连续性、消融与实现
Cross-modal continuity. When the input mixes modalities — say, a video clip followed by a text response — the position IDs must continue smoothly across the boundary. The Qwen2-VL paper specifies: “position numbering for each modality is initialized by incrementing the maximum position ID of the preceding modality by one.” So if a video chunk ends with \(\max(t, h, w) = K\) (achieved by the last patch of the last frame), the subsequent text token starts at \(t = h = w = K + 1\) and increments from there. The reduction of text to identical \((t, h, w)\) makes this seamless: text after image continues the count on all three axes simultaneously.
Ablation: does M-RoPE actually help? Qwen2-VL’s Table 8 holds everything else constant and swaps M-RoPE for 1D-RoPE. The pattern is unambiguous, with the largest gaps on spatial and video tasks:
| Benchmark | 1D-RoPE | M-RoPE | \(\Delta\) |
|---|---|---|---|
| MathVista (geometry-heavy) | 39.2 | 43.4 | +4.2 |
| STAR (video) | 55.5 | 57.9 | +2.4 |
| NextQA (video) | 43.9 | 46.0 | +2.1 |
| MMBench | 58.6 | 60.6 | +2.0 |
| PerceptionTest | 46.6 | 47.4 | +0.8 |
| TextVQA (OCR) | 71.3 | 71.8 | +0.5 |
| ChartQA (OCR) | 68.0 | 68.4 | +0.4 |
| DocVQA (OCR) | 82.5 | 82.8 | +0.3 |
| MMStar | 36.7 | 36.7 | 0.0 |
| RealWorldQA | 54.5 | 53.7 | −0.8 |
| InfoVQA | 50.8 | 50.3 | −0.5 |
Spatial reasoning (MathVista) and video-temporal benchmarks (NextQA, STAR) gain the most — exactly where 2D/3D structure matters. OCR-heavy benchmarks where rasterized 1D order is already informative (DocVQA, ChartQA, TextVQA) gain little — those tasks effectively re-discover the row/col structure from the visual encoder. The slight regressions on MMStar/RealWorldQA/InfoVQA suggest the partition isn’t free: fewer frequencies per axis means slightly less expressive encoding for tasks that don’t benefit from 2D structure.
Implementation. The HF kernel is one line of slice-and-stitch, expressing the per-pair axis assignment as a tensor split:
def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section):
# cos, sin: (3, B, H, T, D) — three position axes stacked
# mrope_section: e.g. [16, 24, 24] for the frequency-pair index
mrope_section = mrope_section * 2 # apply to both halves of D -> [32, 48, 48]
cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1)
sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1)
q_rot = q * cos + rotate_half(q) * sin
k_rot = k * cos + rotate_half(k) * sin
return q_rot, k_rot
Read it as: precompute three full \((B, H, T, D)\) cos/sin tables — one per axis, using \(t, h, w\) position IDs respectively — then weave them together so the temporal table contributes to channels \([0, 32)\) and \([D/2, D/2 + 32)\), height contributes to the next \(48 + 48\) channels, and width to the final \(48 + 48\). The * 2 doubling and the i % 3 cycle reflect that RoPE’s “first half / second half” layout interleaves with the three-axis partition.
What changes in Qwen2.5-VL. The follow-up Qwen2.5-VL refines this further with a time-aligned M-RoPE: instead of \(t\) being a raw frame index, it is set proportional to the elapsed time in seconds (with a fixed FPS reference). This decouples temporal position from sampling rate — a 30-FPS clip and a 1-FPS down-sampled version of the same scene now encode the same elapsed-time signal. Empirically this helps long-video understanding where the FPS budget can’t capture every frame, and is the kind of refinement that becomes natural once the temporal axis is a named coordinate rather than a hidden one.
Closing thought. M-RoPE is conservative by design: zero inference cost, exact 1D-RoPE behavior on text, long-distance decay preserved per axis. The price is the partition — spatial axes take ~75% of the head’s frequencies, leaving temporal with the rest. Whether \([16, 24, 24]\) is the right split is open — the paper doesn’t ablate it, and follow-up VL models tend to inherit the same numbers without revisiting them. The next interesting round of ablations probably lives there, or in whether even the three-axis decomposition itself is the right unit (audio? depth? higher temporal resolution?). The recipe — one position scheme to rule them all, with text as a clean special case — feels durable; the specific numbers, less so.