RoPE and M-RoPE: Rotation, Decay, and Multimodal Axes

This post derives Rotary Position Embedding (RoPE) from scratch, proves its long-distance decay property, surveys the techniques that extend it past training length (Position Interpolation, NTK-aware scaling, YaRN), and unpacks how Qwen2-VL generalizes it to three axes — temporal, height, width — for M-RoPE. The derivations follow Su et al. (2021); the extensions follow Chen et al. (2023), the original NTK-aware blog by bloc97, and Peng et al. (2023); the multimodal section follows the Qwen2-VL paper and its HuggingFace implementation.

The Problem: Encoding Position into Attention

Attention Is a Set Operation

Self-attention is a sum over a set, not a sequence. For queries $q_m$ at position $m$ and keys $k_n$ at position $n$, the softmax weights depend only on $\langle q_m, k_n \rangle$ — permuting the tokens permutes the rows of $Q$ and $K$ identically, leaving the output identical up to that permutation. Without an extra signal, attention cannot tell first from last.

This permutation equivariance is a feature, not a bug. It is what lets a Transformer compute all $T^2$ token-to-token interactions in parallel — the structural reason it eats data faster than an RNN can. But it also means every signal about where a token sits must enter the model from somewhere outside the attention dot product itself. Position encoding is that bridge. The design space splits cleanly by where the signal is injected: into the embedding, into the logit, or into the query/key vectors themselves.

Three Families of Position Encoding

Family	Where the signal lives	Translation-invariant?	Examples
Absolute additive	Added to token embeddings before layer 1	No	BERT learned, Sinusoidal (Vaswani 2017)
Relative bias	Added to logits inside softmax	Yes	T5 bias, ALiBi
Rotary	Multiplied into $q, k$ before dot product	Yes (by construction)	RoPE

Absolute additive. The original Transformer (Vaswani et al., 2017) uses sinusoidal encoding: at position $p$, channels $2i$ and $2i+1$ get

\[\mathrm{PE}(p, 2i) = \sin(p\, \omega_i), \qquad \mathrm{PE}(p, 2i+1) = \cos(p\, \omega_i), \qquad \omega_i = 10000^{-2i/d},\]

added to the input embedding before layer 1. Note that $\omega_i$ here is the very same frequency choice RoPE will later use — both are rooted in the idea that geometrically-spaced frequencies give good time-frequency coverage with $d/2$ basis functions. BERT-style models replace this with a learned table of position embeddings, simpler but capped at the training length.

Relative bias. T5 (Raffel et al., 2020) adds a learned scalar bias $b_{n-m}$ to the attention logit before softmax, with $n - m$ binned into log-spaced buckets. ALiBi (Press et al., 2022) takes the bias idea to its minimal extreme:

\[\mathrm{logit}(m, n) = q_m^\top k_n - \lambda_h \cdot |n - m|,\]

where $\lambda_h$ is a per-head fixed slope (no learnable position parameters at all). Zero parameters, linear decay built in, surprisingly strong length extrapolation — but the bias is independent of $q$ and $k$, so it cannot express position-dependent content interactions.

Rotary. RoPE (Su et al., 2021) — the subject of this post — does neither. It rotates $q$ and $k$ in 2D subspaces by an angle proportional to position, multiplying position into the vectors before the dot product. The dot product then automatically depends only on the relative offset, with no learnable position parameters and no extra bias term.

方法族	信号注入位置	是否平移不变	代表
绝对加性	在第一层之前加到 token embedding	否	BERT 可学习, Sinusoidal (Vaswani 2017)
相对偏置	softmax 内部加到 logits	是	T5 bias, ALiBi
旋转	在点积前乘到 \(q, k\)	是（结构上保证）	RoPE

Why Translation Invariance Matters

A position encoding is translation-invariant if shifting the entire sequence by $k$ positions leaves every attention dot product unchanged:

\[\langle q_{m+k},\, k_{n+k} \rangle \;=\; \langle q_m,\, k_n \rangle \quad \forall\, m, n, k.\]

Absolute additive encodings fail this by construction — they inject position $p_m$ into the embedding, so moving $x_m$ to slot $m + k$ replaces $p_m$ with $p_{m+k}$ and the dot product changes. RoPE satisfies it exactly:

\[\langle R_{m+k} q,\, R_{n+k} k \rangle \;=\; q^\top R_{(n+k) - (m+k)} k \;=\; q^\top R_{n - m} k,\]

since only the difference enters the answer. The relationship between 我 and 吃 is the same whether those tokens sit at positions $(3, 4)$ or $(1003, 1004)$; the rotation by $k$ cancels out exactly.

Why we want this for language. Three reasons stack:

Meaning is compositional and local. The syntactic relation between cat and sat in “The cat sat on the mat” is identical whether the sentence opens a chapter or sits a million tokens deep. Binding semantic relations to absolute coordinates would force the model to relearn the same grammar at every position, throwing away cross-position induction.
Weight sharing buys data efficiency. This is the same argument that justifies convolution in vision: a translation-invariant attention layer learns “subject–verb agreement at distance 1” as one pattern that applies everywhere, instead of a separate pattern per position that each needs its own statistical support.
Length extrapolation. If every relation is parameterized by relative offset, training on $L = 2048$ tokens teaches the model all offsets $1, 2, \dots, 2047$. At inference, extending context to $32{,}768$ shows the same offsets, just on more pairs of tokens. This is the root reason RoPE plus NTK/YaRN can stretch to 128K while pure absolute encodings crumble past their training length.

Caveat. Language is not perfectly translation-invariant — document openings introduce premises, endings summarize, “once upon a time” sets a different stage than what follows. But those are content-level signals carried by tokens (a [CLS] token, the literal phrase “In summary”), not coordinate-level signals. Letting the position encoding stay purely relative, and pushing document structure into the token stream where it belongs, is a cleaner division of labor.

Figure 1: Slide the sequence right by k. The highlighted pair (C, E) is two tokens apart at every k. Under an additive absolute encoding, the dot product wobbles with k (different absolute positions); under RoPE, it stays exactly constant — because only the difference n − m enters the answer.

To summarize, what we want from a position encoding is: (i) translation invariance — the dot product $\langle q_m, k_n \rangle$ depends on $m, n$ only through $n - m$; (ii) bounded long-distance behavior — the magnitude of that dependence does not blow up with $$

n - m

$; *(iii)* cheap to compute and friendly with KV-caching at inference. RoPE achieves all three by *rotating*$q$and$k$$ in fixed 2D subspaces, with frequencies that drop geometrically across feature pairs.

RoPE: Rotation in 2D Subspaces

The Two-Dimensional Derivation

Start with $d = 2$. We seek functions $f_q(x, m)$ and $f_k(x, n)$ such that

\[\langle f_q(q, m),\, f_k(k, n) \rangle \;=\; g(q, k, n - m)\]

for some function $g$ depending on positions only through the offset.

Unpacking that condition. The right-hand side $g(q, k, n - m)$ has only three arguments — $q$, $k$, and the single position quantity $n - m$. The individual positions $m, n$ never appear separately; they enter the answer only as a difference. If we found such an $f$ with $g(q, k, 2) = 0.73$, then $\langle f(q, 5), f(k, 7) \rangle$ and $\langle f(q, 1003), f(k, 1005) \rangle$ would both equal $0.73$ as well. Contrast absolute additive encoding: $\langle q + p_m,\, k + p_n \rangle = q^\top k + q^\top p_n + p_m^\top k + p_m^\top p_n$ — three of the four terms involve $m$ or $n$ on its own, not just through the difference. The whole point of the RoPE ansatz that follows is to find an $f$ that makes those absolute-position terms cancel.

Identifying $\mathbb{R}^2$ with $\mathbb{C}$ — write $q = q_1 + i q_2$ — try the ansatz

\[f_q(q, m) = q \cdot e^{i m \theta}, \qquad f_k(k, n) = k \cdot e^{i n \theta}.\]

The (Hermitian) inner product satisfies

\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; \mathrm{Re}\!\left( q \bar{k} \, e^{i(m - n)\theta} \right),\]

which depends on positions only through $m - n$. Translated back to $\mathbb{R}^2$, multiplication by $e^{i m \theta}$ is multiplication by the 2D rotation matrix

\[R_m \;=\; \begin{bmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \phantom{-}\cos m\theta \end{bmatrix}, \qquad f_q(q, m) = R_m q,\; f_k(k, n) = R_n k.\]

The relative-position property is now a one-line linear-algebra fact:

\[(R_m q)^\top (R_n k) \;=\; q^\top R_m^\top R_n k \;=\; q^\top R_{n - m} k,\]

since rotation matrices form a one-parameter group: $R_m^\top = R_{-m}$ and $R_{-m} R_n = R_{n - m}$. The complex-number view makes this transparent — phases simply subtract.

Figure 2: Move the m and n sliders. The dashed grey vectors are the original q (red origin) and k (blue origin); the solid arrows are R_m q and R_n k. The right panel shows ⟨R_m q, R_n k⟩ alongside qᵀ R_{n−m} k — they agree exactly. Try sliding m and n together while keeping n − m fixed: the dot product never moves.

Extending to d Dimensions

Split the $d$-dimensional vector into $d/2$ adjacent pairs $(x_{2i}, x_{2i+1})$ and rotate each pair by its own angle $m \theta_i$:

\[R_m^{(d)} \;=\; \mathrm{blkdiag}\!\Bigl(R_m^{(\theta_0)},\, R_m^{(\theta_1)},\, \dots,\, R_m^{(\theta_{d/2-1})}\Bigr), \qquad \theta_i \;=\; b^{-2i/d}.\]

Following Vaswani’s sinusoidal encoding, the original RoFormer paper takes $b = 10000$. Then $f_q(q, m) = R_m^{(d)} q$, $f_k(k, n) = R_n^{(d)} k$, and the same group property gives

\[\langle f_q(q, m), f_k(k, n) \rangle \;=\; q^\top R_{n - m}^{(d)} k\]

— relative position, exactly.

Why geometric frequencies? The choice $\theta_i = b^{-2i/d}$ — identical to Vaswani’s sinusoidal — is not an accident. Geometrically-spaced frequencies tile the time-frequency plane more uniformly on log-scale than any equally-spaced alternative: each successive pair rotates a constant factor slower than the previous. The fastest pair ($\theta_0 = 1$) completes a full rotation every $2\pi \approx 6.3$ tokens; the slowest ($\theta_{d/2 - 1} = b^{-(d-2)/d}$) takes $\sim 2\pi b$ tokens. This logarithmic coverage is what gives the encoding both fine-grained adjacency resolution (high freqs) and coarse long-range positional context (low freqs) within a single fixed budget of $d/2$ pairs.

Pair index $i$	$\theta_i$ at $b = 10000, d = 128$	Wavelength $2\pi/\theta_i$ (tokens)
0	$1$	$6.3$
16	$0.178$	$\approx 35$
32	$0.0316$	$\approx 199$
48	$0.00562$	$\approx 1{,}117$
63	$0.000118$	$\approx 52{,}954$

对编号 \(i\)	\(\theta_i\)（\(b = 10000, d = 128\)）	波长 \(2\pi/\theta_i\)（token）
0	\(1\)	\(6.3\)
16	\(0.178\)	\(\approx 35\)
32	\(0.0316\)	\(\approx 199\)
48	\(0.00562\)	\(\approx 1{,}117\)
63	\(0.000118\)	\(\approx 52{,}954\)

From Math to PyTorch

从数学到 PyTorch

The block-diagonal matrix is never materialized. Instead, precompute two tensors of shape $(T, d)$:

\[\cos[m,\,2i] = \cos[m,\,2i+1] = \cos(m \theta_i), \qquad \sin[m,\,2i] = \sin[m,\,2i+1] = \sin(m \theta_i),\]

and apply RoPE elementwise. Most modern implementations (Llama, Qwen, GPT-NeoX, HuggingFace) adopt the “split halves” layout where the first $d/2$ channels store the “real” parts and the second $d/2$ store the “imaginary” parts. Under that convention, pair $i$ is $(x_i,\, x_{i + d/2})$, and the canonical PyTorch form is:

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)          # split into two halves
    return torch.cat((-x2, x1), dim=-1)  # (a, b) -> (-b, a)

def apply_rope(q, k, cos, sin):
    # q, k: (B, H, T, D);  cos, sin: (1, 1, T, D)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

Two layout conventions, same math. The original RoFormer paper uses adjacent pairs — $(x_0, x_1), (x_2, x_3), \dots$ — and a different rotate_half formula. The split-halves layout above is equivalent under a permutation: both produce identical inner products $\langle R_m q, R_n k \rangle$ for the same $\theta_i$. Always check which convention a checkpoint uses before mixing weights across codebases — silent permutation mismatches show up as garbled outputs, not crashes.

Four practical remarks:

Q, K only — never V. Relative position should bias who attends to whom, not the content that gets aggregated. Rotating $V$ would distort the values themselves.
Shared cos/sin tables. The same table is shared across heads and layers — only $\theta_i$ depends on the model. Memory is $O(T \cdot d)$, computed once at module init.
KV-cache trivial. Because the rotation is applied position-by-position, cached keys carry their rotation from when they were first written; new queries rotate at their new position; no recomputation is needed when the sequence grows. Compare with absolute encodings, which would require re-running layer 0 if positions shift.
Linear-attention compatible. Rotation is unitary, so it preserves inner-product structure. RoPE composes cleanly with kernelized / linear-attention variants (Performer, RetNet, linear-attention Mamba branches): the rotation moves through the kernel approximation as a unitary transformation, leaving the kernel identity intact.

Why It Works: Long-Distance Decay

The Decay Bound via Abel Summation

The relative-position property alone tells us nothing about magnitudes. In principle, the rotation could leave $\langle q_m, k_n \rangle$ oscillating at full amplitude no matter how far apart $m$ and $n$ are. The actual RoPE design has a stronger property: with $\theta_i = 10000^{-2i/d}$, the inner product decays (on average) as $$

n - m

$$ grows. This is the closest thing RoPE has to ALiBi-style recency bias — but it emerges from the math rather than from a hand-tuned slope.

Group the rotated dot product per pair. Writing

\[h_i \;=\; q_{[2i]} k_{[2i]} + q_{[2i+1]} k_{[2i+1]} + i\,(q_{[2i+1]} k_{[2i]} - q_{[2i]} k_{[2i+1]})\]

as a complex number capturing the $i$-th pair’s contribution, the relative dot product becomes

\[\langle R_m q, R_n k \rangle \;=\; \mathrm{Re}\sum_{i=0}^{d/2 - 1} h_i\, e^{i(m - n)\theta_i}.\]

Define the partial phase sum $S_j = \sum_{i=0}^{j-1} e^{i(m - n)\theta_i}$ with $S_0 = 0$. Abel summation (summation by parts) gives

\[\sum_{i=0}^{d/2-1} h_i\, e^{i(m-n)\theta_i} \;=\; h_{d/2-1}\, S_{d/2} \;-\; \sum_{i=0}^{d/2-2} S_{i+1}\,(h_{i+1} - h_i),\]

so that

\[\bigl|\langle R_m q, R_n k \rangle\bigr| \;\le\; \Bigl(\max_i |h_{i+1} - h_i|\Bigr)\, \sum_{i=1}^{d/2} |S_i|.\]

The first factor is content-dependent; the second is purely geometric. The RoFormer paper plots $$\frac{1}{d/2}\sum_i

S_i

$against$

n - m

$for$\theta_i = 10000^{-2i/d}$$ and shows it falls off — fast at first, then more slowly, with the residual oscillation expected from a finite sum of incommensurate frequencies.

Geometric Frequencies and Phase Decorrelation

The bound has a clean physical interpretation. The high-frequency pairs (small $i$, $\theta_i \approx 1$) rotate by a different amount per step than the low-frequency pairs (large $i$, $\theta_i \to 0$). When you sum the contributions, the phases at different frequencies generically don’t reinforce — they decorrelate. Only when $m = n$ are all phases zero and the sum is exactly $\sum_i h_i$ (constructive interference). Move $n$ away from $m$ and the phases drift apart at different rates, producing destructive interference.

The geometric spacing of $\theta_i$ is what spreads the rotation rates evenly on log-scale, so that no two pairs stay in lockstep over typical distances. Compare with three alternatives:

All identical frequencies ($\theta_i = \theta$ for all $i$). All pairs rotate in phase. The sum has the same magnitude as a single pair — no decay; behaves like a single complex exponential.
Linear spacing ($\theta_i = (1 + i)\theta_0$). Pairs decorrelate, but on linear timescales. Decay happens but the spectrum lacks log-scale coverage — fewer slow frequencies, so worse long-range behavior.
Geometric spacing ($\theta_i = b^{-2i/d}$). Pairs decorrelate evenly on log-scale. This is the RoPE / sinusoidal choice. Decay is graceful and the spectrum covers many orders of magnitude with only $d/2$ basis functions.

Figure 3: Normalized magnitude (1/(d/2))·|Σᵢ exp(i · Δ · θᵢ)| as a function of distance Δ = |n − m|. Starts at 1 (perfect alignment) and decays through destructive interference of the geometric frequencies. Drag the base slider: a larger b makes the slowest frequencies even slower, flattening the tail — the same lever Llama 3.1 pulls (b = 500000) to preserve resolution at long range. Drag d: bigger head dim means more frequencies, averaging into a smoother curve.

Practical Implications: The Role of the Base b

Picking the base $b$ trades off near-field resolution against far-field coverage. The highest frequency $\theta_0 = 1$ rotates by a full radian per step regardless of $b$ — this is what gives RoPE its fine-grained ability to distinguish adjacent tokens. The lowest frequency $\theta_{d/2-1} = b^{-(d-2)/d}$ scales with $b$: at $b = 10000, d = 128$, $\theta_{d/2-1} \approx 1.2 \times 10^{-4}$, giving a wavelength of $\sim 52{,}000$ tokens.

Base $b$	Lowest $\theta$	Longest wavelength	Used by
10000	$1.2 \times 10^{-4}$	$\sim 52$K tokens	RoFormer, Llama 2, Mistral 7B
100000	$1.5 \times 10^{-5}$	$\sim 410$K tokens	Code Llama (long-context variant)
500000	$3.0 \times 10^{-6}$	$\sim 2.1$M tokens	Llama 3 / 3.1
1000000	$1.4 \times 10^{-6}$	$\sim 4.4$M tokens	Qwen2.5 (some variants)

The flip side: very-slow frequencies don’t help at short range and consume head channels — pushing $b$ too high wastes capacity on positions the model rarely sees. Llama 3.1’s choice of $b = 500{,}000$ is calibrated against its announced 128K context, leaving comfortable headroom while still providing useful gradient at short distances. The same lever is the cleanest entry point for the length-extension methods covered in the next section.

基 \(b\)	最低 \(\theta\)	最长波长	使用方
10000	\(1.2 \times 10^{-4}\)	\(\sim 52\)K tokens	RoFormer, Llama 2, Mistral 7B
100000	\(1.5 \times 10^{-5}\)	\(\sim 410\)K tokens	Code Llama（长上下文版本）
500000	\(3.0 \times 10^{-6}\)	\(\sim 2.1\)M tokens	Llama 3 / 3.1
1000000	\(1.4 \times 10^{-6}\)	\(\sim 4.4\)M tokens	Qwen2.5（部分变体）

Beyond Training Length: PI, NTK, YaRN

Why Naive Extrapolation Fails

A model trained on $L = 2048$ tokens almost always degrades catastrophically when fed sequences much longer than $L$. The reason is sharper than “the model hasn’t seen those positions”. Pick any frequency pair $i$. During training, the rotation angles that pair sees are

\[\{\, m \theta_i \;:\; m \in [0, L) \,\}\]

— a set covering $L \theta_i$ radians. For high-frequency pairs ($\theta_i$ near 1), $L\theta_i \gg 2\pi$ — the pair wraps the unit circle many times, sees every phase, and any extrapolation to $m > L$ produces phases the pair has already practiced. For low-frequency pairs ($\theta_i$ small), $L\theta_i$ may be less than $2\pi$ — the pair has not even completed a full cycle. Asking it to handle $m = 2L$ pushes its phase into truly novel territory.

In one line. Length extrapolation fails because the slow frequencies never wrapped around during training; the fast ones did. So fixes must target the slow end of the spectrum without disturbing the fast end.

What “wrapped around” buys you. RoPE’s output $(\cos m\theta_i, \sin m\theta_i)$ lives on the unit circle, periodic with period $2\pi/\theta_i$. Wrapped around means $L\theta_i > 2\pi$ — the angle completed at least one full revolution during training, so the model has been shown the entire $(\cos, \sin)$ output range for that pair. The downstream weights treat $(\cos, \sin)$ as ordinary features — they are not themselves periodic; they only “know” what they were trained on. Once a frequency’s outputs cover the full circle during training, any future angle $m\theta_i \bmod 2\pi$ lands on a point the model has already mapped. Concretely at $b = 10000,\, d = 128,\, L = 2048$: pair 0 ($\theta_0 = 1$) sees $2048$ radians of training angle — about 326 full cycles — every $(\cos, \sin)$ value many times over. Pair 63 ($\theta_{63} \approx 1.15 \times 10^{-4}$) sees only $0.24$ radians — about 4% of one cycle — and the $\cos$ output never leaves $[0.97, 1]$. Extrapolate to $m = 16384$ and the slow pair’s $\cos$ drops to $-0.31$, feeding the downstream weights a value they have never been trained on. That is why the fast end needs no surgery and the slow end does.

This single observation organizes the entire length-extension literature. Position Interpolation rescales every frequency uniformly (touches the fast end too — hence needs fine-tuning to recover). NTK-aware leaves the fast end alone and stretches only the slow end. YaRN refines NTK with per-frequency control. All three are post-hoc surgeries on $\theta_i$.

The Three Major Methods

Method	Where it modifies $\theta_i$	Formula	Need fine-tuning?
Position Interpolation (PI)	All frequencies, uniformly	$\theta_i' = \theta_i \cdot L / L'$	Yes (~1k steps)
NTK-aware	Base only	$b' = b \cdot s^{d/(d-2)},\; s = L'/L$	Often zero-shot
YaRN	Per-frequency ramp + temperature	NTK-by-parts + $\sqrt{1/t} = 0.1 \ln s + 1$	Short fine-tune

Position Interpolation (Chen et al., 2023) takes the bluntest route: pretend position $m \in [0, L')$ is actually position $m \cdot L / L' \in [0, L)$. Equivalently, multiply every $\theta_i$ by the scaling factor $L/L'$. This keeps RoPE values inside their training range, but it compresses all frequencies equally — the high-frequency pairs that used to distinguish adjacent tokens now barely move per step, which hurts local detail. PI works, but it requires a thousand-or-so fine-tuning steps to recover.

NTK-aware scaling (originally a LocalLLaMA post, then picked up by everyone) attacks the same problem from the opposite direction: leave the high frequencies alone — they generalize fine — and stretch only the low frequencies. The cleanest implementation just changes the base:

\[b' \;=\; b \cdot s^{d/(d-2)}, \qquad s = L'/L.\]

The exponent $d/(d-2)$ is chosen so that the lowest frequency $\theta_{d/2-1} = b'^{-(d-2)/d}$ becomes exactly $\theta_{d/2-1} \cdot 1/s$ (i.e., gets PI’d by factor $s$), while the highest frequency $\theta_0 = b'^0 = 1$ is unchanged. Intermediate frequencies smoothly interpolate. The big practical win: NTK often extends context without any fine-tuning.

YaRN (Peng et al., 2023) refines NTK with two ideas. First, NTK-by-parts: instead of the smooth NTK base change, define a piecewise ramp $\gamma$ over the wavelength-to-context ratio $r$ and interpolate between PI and no-op,

\[h(\theta_i) \;=\; (1 - \gamma(r_i)) \cdot \frac{\theta_i}{s} \;+\; \gamma(r_i) \cdot \theta_i, \qquad \gamma(r) = \begin{cases} 0 & r < \alpha \\ \frac{r - \alpha}{\beta - \alpha} & \alpha \le r \le \beta \\ 1 & r > \beta \end{cases}\]

with $\alpha = 1, \beta = 32$ recommended for LLaMA. Frequencies whose wavelength is much shorter than the original context ($r > \beta$) are untouched; very-long-wavelength frequencies ($r < \alpha$) get full PI scaling. Second, attention temperature: scale both $q$ and $k$ by $\sqrt{1/t}$ with

\[\sqrt{1/t} \;=\; 0.1 \ln(s) + 1,\]

fitted empirically across LLaMA 7B/13B/33B/65B. The temperature compensates for the average entropy increase of attention logits as the context grows, restoring the perplexity curve at long range.

Figure 4: Wavelength 2π / θᵢ across the d/2 = 64 frequency pairs (head_dim = 128, b = 10000), log-scaled. Vanilla RoPE (grey) climbs from 2π at i=0 to ~60k at i=63. PI (orange) shifts the whole curve up by a constant factor s — every frequency stretched the same way. NTK-aware (green) leaves the leftmost (high-frequency) pairs essentially untouched and stretches only the right tail. YaRN (blue) is the piecewise NTK-by-parts ramp — PI'd at very-long-wavelength frequencies (r < α), untouched at short-wavelength ones (r > β). The red dashed line marks the training context L = 2048 — pairs whose wavelength sits below it generalize naturally; those above it need help.

方法	修改 \(\theta_i\) 的方式	公式	是否需要微调
Position Interpolation (PI)	所有频率，统一缩放	\(\theta_i' = \theta_i \cdot L / L'\)	需要（~1k 步）
NTK-aware	仅基	\(b' = b \cdot s^{d/(d-2)},\; s = L'/L\)	通常零样本即可
YaRN	分频率 ramp + 温度	NTK-by-parts + \(\sqrt{1/t} = 0.1 \ln s + 1\)	短微调

Other Variants and Architectural Choices

The PI/NTK/YaRN trio is the most-cited core, but the surrounding literature has accumulated a number of useful refinements and alternatives:

Variant	Idea	Used by
Dynamic NTK	Apply NTK scaling only when actual context exceeds $L$; degrade gracefully	HuggingFace `transformers` default; Mistral long-context
LongRoPE (Ding et al., 2024)	Per-frequency scaling factors found by evolutionary search; needs no smooth functional form	Phi-3 long-context, internal extensions of Llama
ABF (adjusted base frequency)	Change the base $b$ at fine-tune time without other scaling — equivalent to NTK with the right exponent	Code Llama (`b = 1e6`), Mistral long-context fine-tunes
Llama 3.1: trained big-base + extra scaling	$b = 500{,}000$ from scratch, then YaRN-style “llama3” scaling for 128K context	Llama 3.1 / 3.3
Llama 4: iRoPE (interleaved)	Alternate RoPE layers with NoPE (no positional encoding) layers — NoPE layers act as global; RoPE layers as local	Llama 4

Dynamic NTK is the most user-visible: it leaves the model alone for sequences within $L$, then progressively applies NTK only when needed. This avoids the “you paid an accuracy tax even on short prompts” failure mode of static rescaling.

LongRoPE abandons the closed-form ramp entirely and treats per-frequency scaling factors as an optimization problem solved by differential evolution on a held-out perplexity target. The resulting curves often look messy compared to YaRN’s clean ramp — but they outperform it on the long tail of context-length benchmarks, suggesting that the right scaling shape is not actually smooth.

Llama 4’s iRoPE is a structural rather than algebraic move: rather than tinker with $\theta_i$, alternate the layers — some get RoPE, some get NoPE. The NoPE layers (no position encoding at all) become naturally translation-invariant and unbounded, capturing very-long-range dependencies; the RoPE layers handle local structure. This is closer in spirit to ALiBi than to NTK: the long-context behavior is built into the architecture, not retrofitted onto the embedding.

Bigger picture. Every method here ultimately makes the same trade: high frequencies for local detail, low frequencies for long-range coverage, with the slow-end frequencies stretched somehow when context grows past training length. PI, NTK, YaRN, LongRoPE differ on how to stretch; bigger-base and iRoPE differ on when to commit. Llama 3.1’s choice to train at large base from scratch is a bet that the right place to spend the design budget is before training, not after.

变体	思想	使用方
Dynamic NTK	只在实际上下文超出 \(L\) 时才应用 NTK 缩放；优雅降级	HuggingFace `transformers` 默认；Mistral 长上下文
LongRoPE (Ding et al., 2024)	通过演化搜索得到逐频率缩放因子；无需平滑函数形式	Phi-3 长上下文、Llama 内部扩展
ABF（adjusted base frequency）	微调时直接改基 \(b\)，无其他缩放——指数对了就等价于 NTK	Code Llama (`b = 1e6`)、Mistral 长上下文微调
Llama 3.1：大基预训练 + 额外缩放	从头用 \(b = 500{,}000\)，再叠 YaRN 风格的 “llama3” 缩放支持 128K	Llama 3.1 / 3.3
Llama 4：iRoPE（交替）	RoPE 层与 NoPE（无位置编码）层交替——NoPE 层作为全局，RoPE 层作为局部	Llama 4

M-RoPE: Position in Three Axes

Why 1D RoPE Falls Short for Vision and Video

Multimodal sequences carry position information that 1D RoPE cannot express. An image patch has a row and a column; a video frame adds time. Take a 224×224 image patched at 14×14 — that’s 256 patches arranged in a 16×16 grid. Flatten to 1D in row-major raster order:

Pair of patches	2D positions	1D distance
Horizontal neighbors	$(0, 0)$ and $(0, 1)$	1
Vertical neighbors	$(0, 0)$ and $(1, 0)$	16
Diagonal neighbors	$(0, 0)$ and $(1, 1)$	17

A row-major flatten makes vertical neighbors look 16× farther apart than horizontal ones. The attention pattern that emerges has no way to know these distances mean the same physical thing — every spatial relation becomes a function of the flatten order rather than the underlying geometry.

Several VL designs pre-Qwen2-VL papered over this with different tricks:

NaViT / Pixtral: use a 2D RoPE — each spatial axis gets its own rotation (essentially M-RoPE without the temporal axis), giving height/width neighbors equal proximity.
Idefics2 / Idefics3: learn 2D positional embeddings over the patch grid.
CLIP-style towers feeding an LLM: rely on the visual encoder to bake 2D structure into patch embeddings before they reach the LLM, so the LLM’s 1D RoPE only has to handle the “sequence of patches” without caring about the grid.

Qwen2-VL’s M-RoPE is more ambitious: it extends the trick to three axes (temporal, height, width) and unifies image, video, and text under one position scheme that gracefully reduces to 1D RoPE when the input is pure text. The same checkpoint serves all three modalities without architecture surgery per modality.

patch 对	2D 位置	1D 距离
水平相邻	\((0, 0)\) 与 \((0, 1)\)	1
垂直相邻	\((0, 0)\) 与 \((1, 0)\)	16
对角相邻	\((0, 0)\) 与 \((1, 1)\)	17

The (t, h, w) Decomposition in Qwen2-VL

The dimension split. For Qwen2-VL’s standard head dimension $d = 128$, the $d/2 = 64$ frequency pairs are partitioned as

\[\texttt{mrope\_section} \;=\; [16,\,24,\,24] \quad \text{(temporal,\,height,\,width)}.\]

This is the actual rope_scaling.mrope_section field in the Qwen2-VL HuggingFace config, and it’s a key thing to read carefully: the split sums to 64, not 128 — it applies to the frequency-pair index, not the full feature dimension. Per head, temporal gets $2 \times 16 = 32$ channels, height $2 \times 24 = 48$, width $2 \times 24 = 48$. The HF implementation doubles mrope_section before slicing to recover the per-channel partition.

Concretely, for frequency-pair index $i \in \{0, 1, \dots, 63\}$, the position used in the rotation $R_{m_i}^{(\theta_i)}$ is

\[m_i \;=\; \begin{cases} t & i \in [0,\,16) \\ h & i \in [16,\,40) \\ w & i \in [40,\,64) \end{cases}\]

so each pair “listens to” a single axis, but the model can mix axes freely across pairs in subsequent linear layers. Note the asymmetry: temporal gets fewer frequencies than height or width. The paper doesn’t justify the exact ratio, but the choice gives spatial axes more frequency resolution — appropriate for images where local spatial structure dominates.

Figure 5: Toggle between text, image, and video to see how (t, h, w) is assigned. For text, all three axes carry the same sequence position — so M-RoPE collapses to 1D RoPE and the partition is functionally invisible. For an image, t is pinned to a constant and the two spatial axes carry row and column. For video, t increments per frame on top of the same image-style spatial encoding. The colored bar shows the [16, 24, 24] frequency partition over the 64 pairs.

Position IDs by modality. With three position coordinates instead of one, the assignment of $(t, h, w)$ varies:

Modality	$t$	$h$	$w$
Text token at sequence position $p$	$p$	$p$	$p$
Image patch at row $r$, col $c$ in a single frame	constant $K$	$r$	$c$
Video patch at frame $f$, row $r$, col $c$	$f$	$r$	$c$

The text-as-1D fallback is load-bearing. Setting $t = h = w = p$ for text makes all three axes receive the same rotation; the per-pair allocation becomes invisible — every pair, regardless of which axis it “belongs” to, rotates by the same angle. This is precisely the design choice that lets a pretrained 1D-RoPE LLM be drop-in upgraded to M-RoPE without retraining the text-only behavior. Qwen2-VL initializes from Qwen2-7B and inherits its text capability through this clean reduction.

A word on “1D” here — don’t confuse it with the 2D rotation. Two different dimensions are in play. Feature dimension: every RoPE variant, including M-RoPE, rotates in 2D subspaces of the $d$-dimensional feature vector — that’s the operation’s geometry, never up for debate. Position dimension: how many separate position coordinates label each token. 1D RoPE attaches a single number $m$ per token; M-RoPE attaches a triple $(t, h, w)$. “M-RoPE reduces to 1D RoPE for text” refers to the second meaning only: text uses $(p, p, p)$, so the three coordinates collapse to one effective value. The 2D rotation in each frequency subspace is unchanged — but every pair, no matter which axis mrope_section assigned it to, now rotates by the same angle $p\theta_i$, making the partition functionally invisible. The partition only “lights up” when the three coordinates actually carry distinct values, as for image patches $(K, r, c)$ or video patches $(f, r, c)$.

模态	\(t\)	\(h\)	\(w\)
序列位置 \(p\) 处的 text token	\(p\)	\(p\)	\(p\)
单帧中行 \(r\)、列 \(c\) 处的图像 patch	常数 \(K\)	\(r\)	\(c\)
视频中帧 \(f\)、行 \(r\)、列 \(c\) 处的 patch	\(f\)	\(r\)	\(c\)

Cross-Modal Continuity, Ablation, and Implementation

Cross-modal continuity. When the input mixes modalities — say, a video clip followed by a text response — the position IDs must continue smoothly across the boundary. The Qwen2-VL paper specifies: “position numbering for each modality is initialized by incrementing the maximum position ID of the preceding modality by one.” So if a video chunk ends with $\max(t, h, w) = K$ (achieved by the last patch of the last frame), the subsequent text token starts at $t = h = w = K + 1$ and increments from there. The reduction of text to identical $(t, h, w)$ makes this seamless: text after image continues the count on all three axes simultaneously.

Ablation: does M-RoPE actually help? Qwen2-VL’s Table 8 holds everything else constant and swaps M-RoPE for 1D-RoPE. The pattern is unambiguous, with the largest gaps on spatial and video tasks:

Benchmark	1D-RoPE	M-RoPE	$\Delta$
MathVista (geometry-heavy)	39.2	43.4	+4.2
STAR (video)	55.5	57.9	+2.4
NextQA (video)	43.9	46.0	+2.1
MMBench	58.6	60.6	+2.0
PerceptionTest	46.6	47.4	+0.8
TextVQA (OCR)	71.3	71.8	+0.5
ChartQA (OCR)	68.0	68.4	+0.4
DocVQA (OCR)	82.5	82.8	+0.3
MMStar	36.7	36.7	0.0
RealWorldQA	54.5	53.7	−0.8
InfoVQA	50.8	50.3	−0.5

Spatial reasoning (MathVista) and video-temporal benchmarks (NextQA, STAR) gain the most — exactly where 2D/3D structure matters. OCR-heavy benchmarks where rasterized 1D order is already informative (DocVQA, ChartQA, TextVQA) gain little — those tasks effectively re-discover the row/col structure from the visual encoder. The slight regressions on MMStar/RealWorldQA/InfoVQA suggest the partition isn’t free: fewer frequencies per axis means slightly less expressive encoding for tasks that don’t benefit from 2D structure.

Implementation. The HF kernel is one line of slice-and-stitch, expressing the per-pair axis assignment as a tensor split:

def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section):
    # cos, sin: (3, B, H, T, D) — three position axes stacked
    # mrope_section: e.g. [16, 24, 24] for the frequency-pair index
    mrope_section = mrope_section * 2  # apply to both halves of D -> [32, 48, 48]
    cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1)
    sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1)
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

Read it as: precompute three full $(B, H, T, D)$ cos/sin tables — one per axis, using $t, h, w$ position IDs respectively — then weave them together so the temporal table contributes to channels $[0, 32)$ and $[D/2, D/2 + 32)$, height contributes to the next $48 + 48$ channels, and width to the final $48 + 48$. The * 2 doubling and the i % 3 cycle reflect that RoPE’s “first half / second half” layout interleaves with the three-axis partition.

What changes in Qwen2.5-VL. The follow-up Qwen2.5-VL refines this further with a time-aligned M-RoPE: instead of $t$ being a raw frame index, it is set proportional to the elapsed time in seconds (with a fixed FPS reference). This decouples temporal position from sampling rate — a 30-FPS clip and a 1-FPS down-sampled version of the same scene now encode the same elapsed-time signal. Empirically this helps long-video understanding where the FPS budget can’t capture every frame, and is the kind of refinement that becomes natural once the temporal axis is a named coordinate rather than a hidden one.

Closing thought. M-RoPE is conservative by design: zero inference cost, exact 1D-RoPE behavior on text, long-distance decay preserved per axis. The price is the partition — spatial axes take ~75% of the head’s frequencies, leaving temporal with the rest. Whether $[16, 24, 24]$ is the right split is open — the paper doesn’t ablate it, and follow-up VL models tend to inherit the same numbers without revisiting them. The next interesting round of ablations probably lives there, or in whether even the three-axis decomposition itself is the right unit (audio? depth? higher temporal resolution?). The recipe — one position scheme to rule them all, with text as a clean special case — feels durable; the specific numbers, less so.

The Problem: Encoding Position into Attention

问题：如何把位置编码进 attention

Attention Is a Set Operation

Attention 本质是集合运算

Three Families of Position Encoding

位置编码的三大流派

Why Translation Invariance Matters

为什么平移不变这么重要

RoPE: Rotation in 2D Subspaces

RoPE：二维子空间中的旋转

The Two-Dimensional Derivation

二维推导

Extending to d Dimensions

推广到 d 维

From Math to PyTorch

从数学到 PyTorch

Why It Works: Long-Distance Decay

为什么有效：长程衰减

The Decay Bound via Abel Summation

通过 Abel 求和得到的衰减界

Geometric Frequencies and Phase Decorrelation

几何频率与相位去相关

Practical Implications: The Role of the Base b

实践含义：基 b 的作用

Beyond Training Length: PI, NTK, YaRN

超出训练长度：PI、NTK、YaRN

Why Naive Extrapolation Fails

为什么朴素外推会失败

The Three Major Methods

三大主流方法

Other Variants and Architectural Choices

其他变体与架构选择

M-RoPE: Position in Three Axes

M-RoPE：三轴上的位置

Why 1D RoPE Falls Short for Vision and Video

为什么 1D RoPE 在视觉与视频上不够用

The (t, h, w) Decomposition in Qwen2-VL

Qwen2-VL 中的 (t, h, w) 分解

Cross-Modal Continuity, Ablation, and Implementation

跨模态连续性、消融与实现