Self-Attention Layer and The Transformers Architecture

This article explains the Transformer architecture thoroughly, from RNN to self-attention, and then to Transformer. This is not merely a popular science article, but rather a hands-on tutorial. Each section opens with a description of why that component is needed. Readers are encouraged to ponder the questions raised throughout the article. By actively engaging with the material, you will not only gain the ability to construct a Transformer from scratch, but also develop a deep understanding of its intricate details and the rationale behind its design.

本文从 RNN 到自注意力机制,再到 Transformer,全面讲解 Transformer 架构。这不仅是一篇科普文章,更是一个实操教程。每个章节开头都会说明该组件存在的必要性。鼓励读者思考文中提出的问题。通过主动参与,你不仅能够从零构建一个 Transformer,还能深入理解其细节和设计背后的原理。

Sequential Dependency in RNN and LSTM

To understand why self-attention was needed, we first examine the fundamental bottleneck of recurrent architectures: sequential processing that prevents parallelization and loses information over long sequences.

Before the Transformer, sequence modeling relied on recurrent architectures. An RNN processes tokens one by one, updating a single hidden state at each step:

\[a^{\langle t \rangle} = \sigma(W_{aa} \, a^{\langle t-1 \rangle} + W_{ax} \, x^{\langle t \rangle} + b_a), \quad y^{\langle t \rangle} = W_{ya} \, a^{\langle t \rangle} + b_y\]
Figure 1: RNN architecture. The same MLP (blue box) is reused at every time step, with the hidden state as the only channel carrying information forward.

The fundamental problem: with only one hidden state tensor, earlier tokens are inevitably forgotten as the sequence grows — information across the input is treated unevenly, similar to catastrophic forgetting in online learning.

LSTMs attempt to fix this by adding a cell state $c_t$ as a separate memory channel, controlled by three gates:

\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f), \quad i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), \quad o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\] \[c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_c [h_{t-1}, x_t] + b_c), \quad h_t = o_t \odot \tanh(c_t)\]
Figure 2: LSTM cell. The cell state (top highway) carries long-term memory, controlled by forget, input, and output gates. Hover over each gate to see its role.

LSTMs mitigate the long-range dependency problem but don’t solve it — earlier tokens still get compressed. Both architectures also share a critical scalability bottleneck: the hidden state at time $t$ depends on time $t-1$, making parallelization on GPUs fundamentally difficult. The solution? Completely abandon the sequentially dependent hidden state.

在 Transformer 出现之前,序列建模依赖于循环架构。RNN 逐个处理 token,在每一步更新一个隐藏状态:

\[a^{\langle t \rangle} = \sigma(W_{aa} \, a^{\langle t-1 \rangle} + W_{ax} \, x^{\langle t \rangle} + b_a), \quad y^{\langle t \rangle} = W_{ya} \, a^{\langle t \rangle} + b_y\]
图 1:RNN 架构。相同的 MLP(蓝色框)在每个时间步被复用,隐藏状态是唯一传递信息的通道。

根本问题在于:只有一个隐藏状态张量,随着序列增长,早期的 token 不可避免地被遗忘——输入中的信息被不均匀地对待,类似于在线学习中的灾难性遗忘。

LSTM 试图通过添加细胞状态 $c_t$ 作为独立的记忆通道来修复这一问题,由三个门控制:

\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f), \quad i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), \quad o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\] \[c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_c [h_{t-1}, x_t] + b_c), \quad h_t = o_t \odot \tanh(c_t)\]
图 2:LSTM 单元。细胞状态(顶部通道)承载长期记忆,由遗忘门、输入门和输出门控制。悬停在每个门上查看其作用。

LSTM 缓解了长程依赖问题但并未解决——早期 token 仍然被压缩。两种架构还共享一个关键的可扩展性瓶颈:时间 $t$ 的隐藏状态依赖于时间 $t-1$,使得在 GPU 上并行化从根本上就很困难。解决方案?彻底放弃顺序依赖的隐藏状态。

Self-Attention

RNNs cannot parallelize and forget early tokens. Self-attention solves both problems by letting every token directly attend to every other token in constant depth, with no sequential dependency.

It’s important to note that attention and self-attention are not synonymous. Attention is a broader concept encompassing self-attention, cross-attention, bi-attention, and more. This article primarily focuses on Transformer and delves into self-attention and cross-attention.

Self-attention mechanisms had been applied to various models prior to the advent of Transformer, but their effectiveness was limited. Self-attention itself has several variations, such as single-layer attention, multi-layer attention, and multi-head attention. Fundamentally, they all operate on the same principle, differing only in the number of layers or branches used. Let’s begin by examining single-layer attention.

需要注意,注意力和自注意力并不等同。注意力是一个更广泛的概念,涵盖自注意力、交叉注意力、双向注意力等。本文主要聚焦于 Transformer,深入探讨自注意力和交叉注意力。

在 Transformer 出现之前,自注意力机制已经被应用于各种模型,但效果有限。自注意力本身有多种变体,如单层注意力、多层注意力和多头注意力。从根本上说,它们都基于相同的原理,仅在使用的层数或分支数上有所不同。让我们从单层注意力开始。

Single-Layer Attention

We start with the simplest form of self-attention to build intuition: one set of Q, K, V matrices that computes a weighted sum over all tokens.

The diagram below illustrates single-layer self-attention. Assuming an input sequence $x_{1,2,3}$, an embedding layer generates corresponding embeddings $a_{1,2,3}$ for each token. We then define three matrices, $Q, K, V$, as model parameters. For token embedding $a_1$, we multiply it with matrices $Q$ and $K$ to obtain vectors $q_1$ and $k_1$, respectively. Multiplying these two vectors results in an initial attention score, $at_{11}$ (often denoted as $\alpha$). It’s crucial to understand that $at_{11}$ is a scalar value, not a vector. Applying softmax to all attention scores produces normalized values, denoted as $st_{11}$. Simultaneously, $a_1$ is multiplied with matrix $V$ to yield a value vector $v_1$. Multiplying the normalized qk token with the v token gives us a qkv token, $wt_{11}$. By multiplying $q_1$ with the k vectors derived from the second and third tokens in the input sequence, we obtain $wt_{12}$ and $wt_{13}$. Summing these three tokens yields $b_1$. Repeating this process for $q_2$ and $q_3$ yields $b_2$ and $b_3$, respectively. This constitutes the core algorithm behind self-attention.

Figure 3: Single-layer self-attention. All three tokens enter simultaneously — no sequential dependency. The output \(b_1\) is a weighted sum of all value vectors.

Here are some noteworthy points about the diagram:

  1. The only parameters in a self-attention layer are the three matrices: ${W}^Q,{W}^K,{W}^V$.
  2. The output token corresponding to each input token is essentially a weighted sum of key-value pairs from all tokens (including itself) and its own query.
  3. $q, k, v$ are essentially semantic representations of the corresponding token $x$ in the latent space. Token $x$ enters the latent space as $q, k, v$ through ${W}^Q$, $W^K, W^V$.
  4. The output token does not depend on the hidden state of any previous time step.

The fourth point might raise questions. Unlike RNNs, self-attention doesn’t require any hidden state that’s sequentially passed along with the tokens. This is because it employs a positional encoding method to directly modify the token embeddings, enabling the model to perceive the relative positions of tokens within the sequence. This will be explained in detail in the Positional Encoding section.

下图展示了单层自注意力。假设输入序列为 $x_{1,2,3}$,嵌入层为每个 token 生成对应的嵌入 $a_{1,2,3}$。然后我们定义三个矩阵 $Q, K, V$ 作为模型参数。对于 token 嵌入 $a_1$,我们将其分别与矩阵 $Q$ 和 $K$ 相乘,得到向量 $q_1$ 和 $k_1$。将这两个向量相乘得到初始注意力分数 $at_{11}$(通常记为 $\alpha$)。关键要理解的是:$at_{11}$ 是一个标量值,不是向量。 对所有注意力分数应用 softmax 得到归一化值,记为 $st_{11}$。同时,$a_1$ 与矩阵 $V$ 相乘得到 value 向量 $v_1$。将归一化的 qk 值与 v 值相乘得到 qkv 值 $wt_{11}$。将 $q_1$ 与输入序列中第二和第三个 token 的 k 向量相乘,得到 $wt_{12}$ 和 $wt_{13}$。将这三个值相加得到 $b_1$。对 $q_2$ 和 $q_3$ 重复此过程分别得到 $b_2$ 和 $b_3$。这就是自注意力背后的核心算法。

图 3:单层自注意力。所有三个 token 同时进入——没有顺序依赖。输出 \(b_1\) 是所有 value 向量的加权和。

关于该图有几个值得注意的要点:

  1. 自注意力层中唯一的参数是三个矩阵:${W}^Q,{W}^K,{W}^V$。
  2. 每个输入 token 对应的输出 token 本质上是来自所有 token(包括自身)的键值对与其自身查询的加权和。
  3. $q, k, v$ 本质上是对应 token $x$ 在潜在空间中的语义表示。Token $x$ 通过 ${W}^Q$, $W^K, W^V$ 以 $q, k, v$ 的形式进入潜在空间。
  4. 输出 token 不依赖于任何先前时间步的隐藏状态。

第四点可能引起疑问。与 RNN 不同,自注意力不需要随 token 顺序传递的隐藏状态。这是因为它使用位置编码方法直接修改 token 嵌入,使模型能够感知序列中 token 的相对位置。这将在位置编码部分详细解释。

Matrix Form

The token-by-token computation above can be expressed as a single matrix multiplication, which is what makes self-attention GPU-friendly and parallelizable.

Based on the fourth point, this algorithm can be parallelized using matrices. By concatenating the three vectors $e_{1,2,3}$ into a matrix, we get the following diagram. Since our input sequence has 3 tokens, the Inputs matrix on the left has $n = 3$. This input matrix is multiplied by three parameter matrices to obtain $Q, K, V$. Note that $Q$ and $K$ can be multiplied using ${QK}^T$ to obtain the attention score matrix. In the diagram, self-attention seems to be applied to each of the three tokens ($x_1, x_2, x_3$) separately. However, their embedding vectors can actually be concatenated for parallel processing. Similarly, concatenating $at_{11, 12, 13}$ forms a row of the attention score matrix (think about why it’s a row, not a column?). Applying row-wise softmax to this attention score matrix yields the normalized attention score matrix $A$, where each row sums to 1. The shape of matrix $A$ is $n*n$. We will delve deeper into the mathematical significance of matrix $A$ later. Finally, multiplying matrix $A$ with $V$ produces the final matrix $Z$, with a shape of $n*d_v$. This signifies that we have $n$ tokens, with each token now possessing a value of length $d_v$.

Figure 4: Matrix form of self-attention. The input matrix is projected into Q, K, V, and the output \(Z = \text{softmax}(QK^T)V\) has shape \(n \times d_v\). Image credit: Sebastian Raschka.

Now, let’s examine the official formula for self-attention:

\[AT(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\]

Here, $Q, K, V$ represent the hidden state matrices resulting from multiplying the input matrix $X$ with three parameter matrices, respectively. $d_k$ represents the number of columns in the matrix $W_k$.

Let’s address a small detail: why divide by $\sqrt{d_k}$? Those familiar with hyperparameter tuning know that this is done to mitigate the impact of variance. If we take any column $q_i$ from matrix $Q$ and any row $k_j$ from matrix $K$, and assuming each element in $q_i$ and $k_j$ is an independently and identically distributed random variable with a mean of 0 and variance of 1, then each element in the random variable $q_i k_j^T$ (representing the attention score of token $i$ attending to token $j$) will also have a mean of 0 and a variance of 1. This is derived from the formula:

\[Var(x_1 \cdot x_2) = Var(x_1) \cdot Var(x_2) + Var(x_1) \cdot E(x_2)^2 + Var(x_2) \cdot E(x_1)^2\]

Since $q$ and $k$ are independently and identically distributed, we have:

\[E\left\lbrack {X + Y}\right\rbrack = E\left\lbrack X\right\rbrack + E\left\lbrack Y\right\rbrack = 0\]

and

\(Var(X + Y) = Var(X) + Var(Y) = \sum_m^{d_k} q_{i, m} k_{d, m} = \sum_m^{d_k} 1 = d_k\).

This means every such qk pair generates a new vector with an expected value of 0 and a variance of $d_k$, representing the attention score of token $i$ attending to token $j$. When $d_k$ is large, the variance of this vector becomes large as well. Extremely large values get pushed to the edges during the softmax layer, resulting in very small backpropagated gradients. This is where dividing by $\sqrt{d_k}$ comes in, utilizing $Var(kx) = k^2 Var(x)$ to reduce the variance from $d_k$ to 1.

基于第四点,该算法可以使用矩阵进行并行化。将三个向量 $e_{1,2,3}$ 拼接成一个矩阵,就得到下图。由于输入序列有 3 个 token,左侧的 Inputs 矩阵有 $n = 3$。将该输入矩阵与三个参数矩阵相乘得到 $Q, K, V$。注意 $Q$ 和 $K$ 可以通过 ${QK}^T$ 相乘得到注意力分数矩阵。在图中,自注意力似乎分别应用于三个 token($x_1, x_2, x_3$)。但实际上它们的嵌入向量可以拼接起来并行处理。同样,拼接 $at_{11, 12, 13}$ 构成注意力分数矩阵的一行(想想为什么是一行而不是一列?)。对该注意力分数矩阵按行做 softmax 得到归一化注意力分数矩阵 $A$,其中每行之和为 1。矩阵 $A$ 的形状是 $n*n$。我们稍后将深入探讨矩阵 $A$ 的数学意义。最后,将矩阵 $A$ 与 $V$ 相乘得到最终矩阵 $Z$,形状为 $n*d_v$。这意味着我们有 $n$ 个 token,每个 token 现在拥有长度为 $d_v$ 的 value

图 4:自注意力的矩阵形式。输入矩阵被投影为 Q、K、V,输出 \(Z = \text{softmax}(QK^T)V\) 的形状为 \(n \times d_v\)。图片来源:Sebastian Raschka

现在让我们来看自注意力的正式公式:

\[AT(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V\]

其中,$Q, K, V$ 分别是输入矩阵 $X$ 与三个参数矩阵相乘后得到的隐藏状态矩阵。$d_k$ 表示矩阵 $W_k$ 的列数。

让我们解决一个小细节:为什么要除以 $\sqrt{d_k}$?熟悉超参数调优的人知道,这是为了减轻方差的影响。如果我们从矩阵 $Q$ 中取任意一列 $q_i$,从矩阵 $K$ 中取任意一行 $k_j$,并假设 $q_i$ 和 $k_j$ 中的每个元素都是独立同分布的随机变量,均值为 0,方差为 1,那么随机变量 $q_i k_j^T$(表示 token $i$ 对 token $j$ 的注意力分数)中的每个元素也将具有均值 0 和方差 1。这由以下公式推导:

\[Var(x_1 \cdot x_2) = Var(x_1) \cdot Var(x_2) + Var(x_1) \cdot E(x_2)^2 + Var(x_2) \cdot E(x_1)^2\]

由于 $q$ 和 $k$ 是独立同分布的,我们有:

\[E\left\lbrack {X + Y}\right\rbrack = E\left\lbrack X\right\rbrack + E\left\lbrack Y\right\rbrack = 0\]

以及

\[Var(X + Y) = Var(X) + Var(Y) = \sum_m^{d_k} q_{i, m} k_{d, m} = \sum_m^{d_k} 1 = d_k\]

这意味着每个这样的 qk 对生成一个期望值为 0、方差为 $d_k$ 的新向量,代表 token $i$ 对 token $j$ 的注意力分数。当 $d_k$ 很大时,该向量的方差也会很大。极大的值在 softmax 层中被推到边缘,导致反向传播的梯度非常小。这就是除以 $\sqrt{d_k}$ 的作用,利用 $Var(kx) = k^2 Var(x)$ 将方差从 $d_k$ 减小到 1。

Essence of Self-Attention

Stripping away the learned projections reveals that self-attention is fundamentally computing pairwise token similarity and using it to produce context-aware embeddings.

Let’s delve into the essence of the official self-attention formula. Firstly, it’s essential to acknowledge that the three matrices $Q,K,V$ are essentially linear transformations of the input matrix $X$, representing $X$ semantically in the latent space. In other words, it’s possible to train the model without the matrices ${W}^Q,{W}^K,{W}^V$, but the complexity would be insufficient, impacting the model’s performance. For clarity, we’ll use $X$ to represent these three matrices. Additionally, $\frac{1}{\sqrt{d_k}}$ is used for scaling and doesn’t affect the essence, so we’ll omit it. This simplifies the official formula to:

\[AT(Q,K,V)=softmax(XX^T)X\]

Consider the sentence “Welcome to Starbucks.” If the embedding layer employs simple 2-hot encoding (e.g., “Welcome” is encoded as 1010), we can represent the input matrix $X$ as shown on the left side of the diagram below. Multiplying this matrix with its transpose yields a matrix that’s essentially an attention matrix, as depicted on the right side of the diagram.

Figure 5: Computing \(XX^T\) with 2-hot encodings. The resulting matrix captures pairwise token similarity.

What does this attention matrix represent? Examining the first row, we see that this row essentially calculates the similarity between the token “Welcome” and all other tokens in the sentence. The essence of similarity between word vectors is attention. If token A and token B frequently co-occur, their similarity tends to be high. For instance, in the diagram, “Welcome” exhibits high similarity with itself and “Starbucks,” indicating that these two tokens should receive higher attention when inferring the token “Welcome.”

Normalizing this result using softmax gives us the normalized attention matrix shown on the right side of the diagram below. After normalization, this attention matrix becomes a coefficient matrix, ready to be multiplied with the original matrix.

Figure 6: Row-wise softmax normalizes the attention matrix so each row sums to 1.

The final step involves right-multiplying the normalized attention matrix $\alpha$ with the input matrix $X$, resulting in the matrix $\hat X$, as illustrated below. What does this step essentially achieve? The highlighted first row of the left matrix will be multiplied and summed with each column of the input matrix $X$ to compute each value in the first row of the output matrix. Since the first row of the $\alpha$ matrix represents the attention values of the token “Welcome” towards all tokens, the first row of the output matrix $\hat X$ becomes the attention-weighted embedding of the token “Welcome.”

Figure 7: The final multiplication \(\hat{X} = \alpha X\). Each row of \(\hat{X}\) is the attention-weighted embedding of the corresponding token.

In summary, given an input matrix $\mathsf{X}$, self-attention outputs a matrix $\hat X$, which is the attention-weighted semantic representation matrix of the input matrix.

让我们深入探讨自注意力正式公式的本质。首先,必须认识到三个矩阵 $Q,K,V$ 本质上是输入矩阵 $X$ 的线性变换,在潜在空间中语义化地表示 $X$。换句话说,不使用矩阵 ${W}^Q,{W}^K,{W}^V$ 也可以训练模型,但复杂度不足会影响模型性能。为清晰起见,我们用 $X$ 来代替这三个矩阵。此外,$\frac{1}{\sqrt{d_k}}$ 用于缩放且不影响本质,因此我们省略它。这将正式公式简化为:

\[AT(Q,K,V)=softmax(XX^T)X\]

考虑句子 “Welcome to Starbucks.”。如果嵌入层采用简单的 2-hot 编码(例如,”Welcome” 编码为 1010),我们可以将输入矩阵 $X$ 表示为下图左侧所示。将该矩阵与其转置相乘得到一个本质上是注意力矩阵的矩阵,如下图右侧所示。

图 5:使用 2-hot 编码计算 \(XX^T\)。结果矩阵捕获了 token 之间的成对相似度。

这个注意力矩阵代表什么?查看第一行,我们可以看到这一行本质上计算了 token “Welcome” 与句子中所有其他 token 的相似度。词向量之间相似度的本质就是注意力。如果 token A 和 token B 经常共同出现,它们的相似度往往很高。例如,在图中,”Welcome” 与自身和 “Starbucks” 展现出高相似度,表明在推断 token “Welcome” 时这两个 token 应获得更高的注意力。

对该结果使用 softmax 归一化得到下图右侧所示的归一化注意力矩阵。归一化后,该注意力矩阵变成系数矩阵,可以与原始矩阵相乘。

图 6:按行 softmax 将注意力矩阵归一化,使每行之和为 1。

最后一步是将归一化注意力矩阵 $\alpha$ 右乘输入矩阵 $X$,得到矩阵 $\hat X$,如下图所示。这一步本质上实现了什么?左侧矩阵高亮显示的第一行将与输入矩阵 $X$ 的每一列相乘并求和,计算输出矩阵第一行的每个值。由于 $\alpha$ 矩阵的第一行表示 token “Welcome” 对所有 token 的注意力值,因此输出矩阵 $\hat X$ 的第一行成为 token “Welcome” 的注意力加权嵌入。

图 7:最终乘法 \(\hat{X} = \alpha X\)。\(\hat{X}\) 的每一行是对应 token 的注意力加权嵌入。

总之,给定输入矩阵 $\mathsf{X}$,自注意力输出矩阵 $\hat X$,即输入矩阵的注意力加权语义表示矩阵。

Query, Key, Value

Attention works without separate Q, K, V projections, but adding them gives the model enough capacity to learn richer semantic relationships.

While the matrices $W^Q, W^K, W^V$ aren’t strictly necessary, and we know that attention weighting can be achieved using $X$ alone, the performance would be suboptimal. It’s natural to wonder how the QKV concept came about and its underlying rationale.

Why choose the names QKV? Q stands for Query, K for Key, and V for Value. In the context of databases, we aim to retrieve a corresponding Value V from the database using a Query Q. However, directly searching for V in the database using Query often yields unsatisfactory results. We desire each V to have an associated Key K that facilitates its retrieval. This Key K captures the essential characteristics of V, making it easier to locate. It’s important to note that one key corresponds to one value, implying that the number of Ks and Vs is identical.

Refer to the diagram below. How are the features extracted by K obtained? Based on the attention mechanism, we need to first examine all items before accurately defining a specific item. Therefore, for each Query Q, we retrieve all Keys K (the first row of $\alpha$) as coefficients, multiply them with their corresponding Vs (the first column of V), and sum the weighted values to obtain the desired result 0.67 for this query. Through backpropagation, K gradually learns the features of V.

Figure 8: The QKV retrieval analogy. Each query retrieves all keys as coefficients, multiplies with corresponding values, and sums to produce the output.

Therefore, the QKV concept in self-attention essentially functions as a database capable of integrating global semantic information. The resulting matrix Y represents the semantic matrix of the input matrix X in the latent space, with each row representing a token, having considered the contextual information.

虽然矩阵 $W^Q, W^K, W^V$ 并非严格必要,我们知道仅使用 $X$ 就能实现注意力加权,但性能会不够理想。很自然地会好奇 QKV 概念的来源及其背后的原理。

为什么选择 QKV 这个名称?Q 代表 Query(查询),K 代表 Key(键),V 代表 Value(值)。在数据库的语境中,我们的目标是使用 Query Q 从数据库中检索对应的 Value V。然而,直接使用 Query 在数据库中搜索 V 往往效果不佳。我们希望每个 V 都有一个关联的 Key K 来辅助检索。这个 Key K 捕获了 V 的核心特征,使其更容易定位。需要注意的是,一个 key 对应一个 value,意味着 K 和 V 的数量相同。

参考下图。K 提取的特征是如何获得的?基于注意力机制,我们需要先审视所有项目,才能准确定义某个特定项目。 因此,对于每个 Query Q,我们检索所有 Key K($\alpha$ 的第一行)作为系数,将它们与对应的 V(V 的第一列)相乘,并对加权值求和,得到该查询的结果 0.67。通过反向传播,K 逐渐学习 V 的特征。

图 8:QKV 检索类比。每个 query 检索所有 key 作为系数,与对应的 value 相乘,求和产生输出。

因此,自注意力中的 QKV 概念本质上充当了一个能够整合全局语义信息的数据库。结果矩阵 Y 代表输入矩阵 X 在潜在空间中的语义矩阵,其中每一行代表一个 token,并已考虑了上下文信息。

Positional Encoding

Self-attention treats input as a set, not a sequence — it has no notion of token order. Positional encoding injects ordering information so the model knows which token comes first.

Upon careful consideration, you’ll realize that self-attention alone, relying solely on three matrices storing semantic mappings, cannot capture positional information about tokens. The input sequence remains unaware of the order of its elements. While it’s possible to append an MLP after the output matrix Y for classification tasks, the lack of positional information hinders performance. This begs the question, how would you design an algorithm to address this issue?

The first challenge is how to integrate positional encoding into the self-attention algorithm. We could directly modify the input matrix 𝑋 or alter the self-attention algorithm itself. Modifying the input matrix is more intuitive. By ensuring that the positional embedding generated by positional encoding has the same length as the token embedding, we can directly add them to obtain a new input matrix of the same size, without requiring any modifications to the self-attention algorithm. Let’s proceed with this approach and consider how to perform the encoding.

A straightforward solution is to encode the token’s position in the sequence from 0 to 1 and incorporate it into the self-attention layer. However, this method presents a significant problem: the distance between adjacent tokens differs for sentences of varying lengths. For example, “Welcome to Starbucks” is encoded as [0, 0.5, 1] with an interval of 0.5, while “Today is a very clear day” is encoded as [0, 0.2, 0.4, 0.6, 0.8, 1] with an interval of 0.2. Such non-normalized intervals can degrade model performance, as parameters in the self-attention layer are trained in a sequence-length-agnostic manner. Introducing any sequence-length-dependent parameters can lead to parameter target alienation.

Another idea involves fixing the interval at 1 and incrementally increasing it. For instance, “Welcome to Starbucks” is encoded as [0, 1, 2], and “Today is a very clear day” is encoded as [0, 1, 2, 3, 4, 5]. While this addresses the issue of sequence-length-dependent parameters, it can result in very large parameter values for long sequences, leading to vanishing gradients.

The approach employed by Transformer is a widely used positional encoding method in applied mathematics: sinusoidal positional encoding. For a given token at position $\mathfrak{t}$ in the sequence, and embedding length $d$, the sinusoidal positional encoding is mathematically expressed as:

Here, $w_k=\frac{1}{10000^{2k/d}}$, $k$ is the token’s position in the sequence divided by 2, and $d$ is the embedding dimension. Notice that the length of $p_t$ is the same as the embedding length. Therefore, $p_t$ is also referred to as positional embedding and can be directly added to the semantic embedding. We can extend $\mathfrak{t}$ to the entire sequence to obtain a matrix. For example, given the input sentence “I am a Robot” with 4 tokens and embedding size $d = 4$, we can obtain the following $4 \times 4$ positional encoding matrix. Each row represents a $p_t$ from the formula above.

Figure 9: Sinusoidal positional encoding matrix for "I am a Robot" (4 tokens, \(d = 4\)). Image credit: Machine Learning Mastery.

The first row represents the positional embedding of the token “I.” Since its position in the sequence is 0, we have $\mathrm{t} = 0$. Therefore, the first element of $p_0$ is $\sin(w_1 t)$, which is $\sin(\frac{1}{10000^{2 * 0 / 4}} * 0) = 0$. The second element is $\cos(w_1 t)$, which is $\cos(\frac{1}{10000^{20 / 4}} 0) = 1$. Following this logic, we can obtain the positional embedding for each token.

How do we interpret this method? It leverages the periodicity of the cosine function. In fact, as long as we can find multiple functions with different periods but similar characteristics (e.g., the waveform of cosine functions is the same), they can theoretically be used for positional encoding. For instance, we can employ binary positional encoding instead of sinusoidal encoding. As shown in the diagram below, assuming a sequence with 16 tokens and an embedding size of 4, we can obtain the positional embedding for each token using binary encoding:

Figure 10: Binary positional encoding for 16 tokens. Each bit has a different period, preventing duplicate encodings. Image credit: Amirhossein Kazemnejad.

Clearly, the least significant bit (red) changes very rapidly (period of 2), while the most significant bit (yellow) changes the slowest (period of 8). This means that each positional embedding bit has a different period, preventing duplicate embeddings. Experiments have shown that this method also yields results, but not as good as sinusoidal positional encoding.

仔细思考你会发现,仅靠自注意力,仅依赖存储语义映射的三个矩阵,无法捕获 token 的位置信息。输入序列无法感知其元素的顺序。虽然可以在输出矩阵 Y 后面添加 MLP 用于分类任务,但缺乏位置信息会影响性能。这引出一个问题:你会如何设计算法来解决这个问题?

第一个挑战是如何将位置编码集成到自注意力算法中。我们可以直接修改输入矩阵 $X$,也可以改变自注意力算法本身。修改输入矩阵更为直观。通过确保位置编码生成的位置嵌入与 token 嵌入长度相同,我们可以直接将它们相加得到相同大小的新输入矩阵,无需修改自注意力算法。让我们按照这个思路考虑如何进行编码。

一个直接的方案是将 token 在序列中的位置从 0 到 1 编码,并整合到自注意力层中。但这种方法存在一个重大问题:对于不同长度的句子,相邻 token 之间的距离不同。例如,”Welcome to Starbucks” 编码为 [0, 0.5, 1],间隔为 0.5;而 “Today is a very clear day” 编码为 [0, 0.2, 0.4, 0.6, 0.8, 1],间隔为 0.2。这种非归一化的间隔会降低模型性能,因为自注意力层中的参数以与序列长度无关的方式训练。引入任何与序列长度相关的参数会导致参数目标偏移。

另一个想法是固定间隔为 1 并递增。例如,”Welcome to Starbucks” 编码为 [0, 1, 2],”Today is a very clear day” 编码为 [0, 1, 2, 3, 4, 5]。虽然这解决了与序列长度相关的参数问题,但对于长序列可能导致非常大的参数值,从而引起梯度消失。

Transformer 采用的方法是应用数学中广泛使用的位置编码方法:正弦位置编码。对于序列中位置为 $\mathfrak{t}$ 的 token 和嵌入长度 $d$,正弦位置编码的数学表达式为:

其中,$w_k=\frac{1}{10000^{2k/d}}$,$k$ 是 token 在序列中位置除以 2,$d$ 是嵌入维度。注意 $p_t$ 的长度与嵌入长度相同,因此 $p_t$ 也称为位置嵌入,可以直接加到语义嵌入上。我们可以将 $\mathfrak{t}$ 扩展到整个序列得到矩阵。例如,给定输入句子 “I am a Robot”,有 4 个 token,嵌入大小 $d = 4$,我们可以得到以下 $4 \times 4$ 位置编码矩阵。每一行代表上述公式中的一个 $p_t$。

图 9:"I am a Robot" 的正弦位置编码矩阵(4 个 token,\(d = 4\))。图片来源:Machine Learning Mastery

第一行代表 token “I” 的位置嵌入。由于它在序列中的位置为 0,我们有 $\mathrm{t} = 0$。因此 $p_0$ 的第一个元素是 $\sin(w_1 t)$,即 $\sin(\frac{1}{10000^{2 * 0 / 4}} * 0) = 0$。第二个元素是 $\cos(w_1 t)$,即 $\cos(\frac{1}{10000^{20 / 4}} 0) = 1$。按此逻辑,我们可以得到每个 token 的位置嵌入。

如何理解这种方法?它利用了余弦函数的周期性。实际上,只要能找到多个具有不同周期但特征相似的函数(例如余弦函数的波形相同),理论上都可以用于位置编码。例如,我们可以使用二进制位置编码代替正弦编码。如下图所示,假设序列有 16 个 token,嵌入大小为 4,我们可以使用二进制编码得到每个 token 的位置嵌入:

图 10:16 个 token 的二进制位置编码。每个位具有不同的周期,防止重复编码。图片来源:Amirhossein Kazemnejad

显然,最低有效位(红色)变化非常快(周期为 2),而最高有效位(黄色)变化最慢(周期为 8)。这意味着每个位置嵌入位具有不同的周期,防止重复嵌入。实验表明这种方法也有效果,但不如正弦位置编码好。

The Transformer Architecture

Self-attention with positional encoding handles classification, but sequence-to-sequence tasks require an encoder-decoder structure. The Transformer assembles self-attention, cross-attention, masking, and feed-forward layers into a complete architecture.

So far, we’ve established that self-attention can handle classification tasks but not seq2seq tasks. For classification, we simply add the positional embedding to $x$ to obtain the new embedding and append an MLP head after self-attention. However, seq2seq tasks often necessitate an encoder-decoder architecture. This raises the question of how to implement an encoder-decoder architecture using self-attention layers. To answer this, let’s examine how RNNs achieve this.

The diagram below illustrates the encoder-decoder architecture of an RNN. Notice that in the encoder section, the RNN performs the same operations as in classification tasks. However, at the last token, instead of connecting an MLP, the RNN passes the context vector to the decoder. The decoder is another RNN (with different parameters) that takes the context vector and the hidden vector from the previous time step as input. This RNN outputs a hidden vector and an output vector. The output vector goes through a softmax layer to obtain the probability of each token, and the token with the highest probability is selected. The hidden vector is fed back into the RNN, advancing it to the next time step. When the RNN outputs the <eos> token, the decoding process terminates, and we obtain a generated sequence, which might not necessarily have the same length as the input sequence.

Figure 11: RNN encoder-decoder for seq2seq. The encoder compresses the input into a context vector, which the decoder uses to generate output tokens one at a time.

In essence, the RNN’s seq2seq solution involves obtaining a representation of the entire sentence, the context vector, in the encoder stage. This representation is then passed to the decoder, which refers to it while generating each token. Moreover, due to RNN’s sequential nature, hidden vectors remain indispensable for carrying state information.

How can we draw inspiration from RNNs to design an encoder-decoder architecture using self-attention to accomplish seq2seq tasks? Firstly, the encoder must output a representation of the processed sentence. Secondly, each token position in the decoder needs to calculate a probability to select the most likely token. However, a challenge arises: how do we pass the context vector outputted by the encoder to the decoder? In RNNs, the context vector is directly concatenated with the input embedding and fed into the RNN. If we use self-attention, can we also concatenate the context vector with the embedding and pass it to self-attention? Or should we employ an additional network for information fusion? If we use self-attention, given its three inputs, to which input should we pass the context vector?

Let’s examine the official Transformer architecture:

Figure 12: The Transformer architecture. Left: encoder with self-attention. Right: decoder with masked self-attention and cross-attention. Image credit: Vaswani et al. (2017).

This diagram should be familiar to anyone who’s explored Transformer. The left side represents the encoder, while the right side represents the decoder. The multi-head attention layer is an enhancement over single-layer attention. The bottom left corner shows multi-head self-attention, the top right corner shows multi-head cross-attention, and the bottom right corner shows masked multi-head attention. The blue layers are FFNs (Feed Forward Layers), essentially MLPs consisting of several fully connected layers and activation functions like ReLU. Both the encoder and decoder can be stacked with multiple layers, as indicated by the $N \times$ notation in the diagram. Let’s analyze each component in detail.

到目前为止,我们已确定自注意力可以处理分类任务但不能处理 seq2seq 任务。对于分类,我们只需将位置嵌入加到 $x$ 上获得新嵌入,并在自注意力后添加 MLP head。然而,seq2seq 任务通常需要编码器-解码器架构。这就引出了如何使用自注意力层实现编码器-解码器架构的问题。为了回答这个问题,让我们看看 RNN 是如何实现的。

下图展示了 RNN 的编码器-解码器架构。注意在编码器部分,RNN 执行与分类任务相同的操作。但在最后一个 token 处,RNN 不是连接 MLP,而是将上下文向量传递给解码器。解码器是另一个 RNN(参数不同),以上下文向量和前一时间步的隐藏向量作为输入。该 RNN 输出隐藏向量和输出向量。输出向量经过 softmax 层获得每个 token 的概率,选择概率最高的 token。隐藏向量被反馈回 RNN,推进到下一个时间步。当 RNN 输出 <eos> token 时,解码过程终止,我们得到一个生成的序列,其长度不一定与输入序列相同。

图 11:用于 seq2seq 的 RNN 编码器-解码器。编码器将输入压缩为上下文向量,解码器用它逐个生成输出 token。

本质上,RNN 的 seq2seq 方案涉及在编码器阶段获得整个句子的表示——上下文向量。然后将该表示传递给解码器,解码器在生成每个 token 时参考它。此外,由于 RNN 的顺序特性,隐藏向量仍然不可或缺地承载着状态信息。

我们如何从 RNN 中获得启发,使用自注意力设计编码器-解码器架构来完成 seq2seq 任务?首先,编码器必须输出处理后句子的表示。其次,解码器中的每个 token 位置需要计算概率以选择最可能的 token。然而,一个挑战出现了:如何将编码器输出的上下文向量传递给解码器? 在 RNN 中,上下文向量直接与输入嵌入拼接并输入 RNN。如果使用自注意力,我们是否也可以将上下文向量与嵌入拼接并传递给自注意力?还是应该使用额外的网络进行信息融合?如果使用自注意力,鉴于它有三个输入,我们应该将上下文向量传递给哪个输入?

让我们来看看官方的 Transformer 架构:

图 12:Transformer 架构。左侧:带自注意力的编码器。右侧:带掩码自注意力和交叉注意力的解码器。图片来源:Vaswani et al. (2017)

对于探索过 Transformer 的人来说,这张图应该很熟悉。左侧是编码器,右侧是解码器。多头注意力层是对单层注意力的增强。左下角是多头自注意力,右上角是多头交叉注意力,右下角是掩码多头注意力。蓝色层是 FFN(前馈层),本质上是由几个全连接层和 ReLU 等激活函数组成的 MLP。编码器和解码器都可以堆叠多层,如图中 $N \times$ 标记所示。让我们逐个分析每个组件。

Cross-Attention

The decoder needs to condition its output on the encoder's representation. Cross-attention lets decoder queries attend to encoder keys and values, bridging the two halves of the model.

The term “cross-attention” might seem intimidating and difficult to grasp. However, if you’ve understood self-attention, cross-attention becomes quite straightforward. Why add “self” to self-attention? Looking at the Transformer architecture diagram, you’ll notice that the data input to the attention layers in the bottom left and bottom right corners both come from the same matrix, hence “self-attention.” Conversely, the data input to the attention layer in the top right corner originates from two different matrices, hence “cross-attention.”

As shown in the diagram below, we now have two input matrices, $X_1$ and $X_2$. $X_1$ provides the linear transformation $Q$, while $X_2$ provides the linear transformations $K$ and $V$. The difference between cross-attention and self-attention is marked with “new” in the diagram.

Figure 13: Cross-attention. \(Q\) comes from \(X_1\) (decoder), while \(K\) and \(V\) come from \(X_2\) (encoder). Image credit: Sebastian Raschka.

Carefully observe the resulting matrix $Z$ and you’ll see that its number of rows is the same as matrix $X_1$. Now, consider this: if you were to design the cross-attention module in the top right corner of the Transformer architecture diagram, would you use the encoder’s context vector as $X_1$ or $X_2$? Remember that the encoder’s context vector is essentially a database (V) aggregating information from the input sequence, while each input token in the decoder is essentially a query (Q), responsible for querying the database for the most similar (and therefore most important) tokens. In this light, each row in the matrix $QK^T$ in the diagram represents the attention of a decoder input token towards all tokens in the context vector. This attention matrix is called the cross-attention matrix.

“交叉注意力”这个术语可能看起来令人生畏且难以理解。但如果你已经理解了自注意力,交叉注意力就变得相当简单。为什么在自注意力前面加”自”?看 Transformer 架构图,你会注意到左下角和右下角注意力层的数据输入都来自同一个矩阵,因此称为”自注意力”。相反,右上角注意力层的数据输入来自两个不同的矩阵,因此称为”交叉注意力”。

如下图所示,我们现在有两个输入矩阵 $X_1$ 和 $X_2$。$X_1$ 提供线性变换 $Q$,而 $X_2$ 提供线性变换 $K$ 和 $V$。交叉注意力与自注意力的区别在图中用 “new” 标记。

图 13:交叉注意力。\(Q\) 来自 \(X_1\)(解码器),而 \(K\) 和 \(V\) 来自 \(X_2\)(编码器)。图片来源:Sebastian Raschka

仔细观察结果矩阵 $Z$,你会发现它的行数与矩阵 $X_1$ 相同。现在思考:如果你要设计 Transformer 架构图右上角的交叉注意力模块,你会将编码器的上下文向量用作 $X_1$ 还是 $X_2$?记住,编码器的上下文向量本质上是一个聚合了输入序列信息的数据库(V),而解码器中的每个输入 token 本质上是一个查询(Q),负责在数据库中查找最相似(因此最重要)的 token。从这个角度看,图中矩阵 $QK^T$ 的每一行代表解码器输入 token 对上下文向量中所有 token 的注意力。这个注意力矩阵称为交叉注意力矩阵。

Decoder Training and Prediction

The decoder must generate tokens one at a time during inference, but teacher forcing makes training parallelizable by feeding ground-truth tokens instead of model predictions.

In seq2seq tasks, the Transformer encoder and decoder must be trained jointly because it’s not a classification problem, and the encoder cannot be directly connected to an MLP for training. While the flow of tokens through the encoder is relatively simple, handled by self-attention, the decoder presents some challenges. Let’s first consider this question: if you were to design how to use this decoder, could you make it output all the tokens in one go? The answer is no. Generating a sequence requires an end-of-sequence token (e.g., [EOS]) to signal termination. The generation of this end token must depend on the previously generated tokens. The decision to end a sentence relies on the sentence having fulfilled its purpose. This means the generation of the last token must be conditioned on the preceding tokens. By extension, every preceding token must be conditioned on its predecessors, all the way back to the start-of-sequence token (e.g., [SOS]). This logic mirrors human speech. While we might conceive of an entire sentence in our minds, we articulate it one word at a time, with subsequent words influenced by those preceding them. This inherent logic governs how we speak. Therefore, the decoder must still predict tokens sequentially during prediction, rather than outputting them all at once. This process of sequential token output is illustrated in the diagram below. In other words, the decoder’s prediction algorithm cannot be parallelized.

Figure 14: Auto-regressive decoding. The decoder generates tokens one at a time, feeding each output back as input for the next step. Image credit: CSDN.

However, the decoder employs a clever training method called teacher forcing, where it learns under the guidance of a “teacher.” What does teacher forcing entail? Let’s illustrate with an example of teacher forcing in an RNN decoder. As shown in the diagram below, assume a seq2seq model receives the input “What do you see” and the correct output (label) is “Two people running.” The training process on the left is called free-running, while the one on the right is called teacher forcing. If you observe the bottom right corner of the Transformer architecture diagram, you’ll notice a “Shifted Right” operation applied to the Outputs. This involves shifting all input tokens one position to the right, corresponding to the teacher forcing method shown on the right side of the diagram below. In teacher forcing, all label tokens are shifted one position to the right, and a start-of-sequence token (e.g., <Start>) is placed at the beginning. The free-running decoder RNN receives its own output from the previous time step as input, which might be incorrect. In contrast, the teacher forcing decoder RNN receives the previous label token as input, which is guaranteed to be correct. This training method prevents the accumulation of errors, thereby improving training effectiveness.

Figure 15: Free-running (left) vs. teacher forcing (right). Teacher forcing feeds ground-truth tokens as input, preventing error accumulation during training. Image credit: cnblogs.

Another advantage of teacher forcing is parallelization. The diagram above depicts an RNN. Due to the presence of a hidden vector that needs to be passed sequentially, the decoder cannot be parallelized during training. However, self-attention and cross-attention do not rely on hidden vectors for state passing. Instead, they directly encode positional information through positional embeddings during the input stage. Moreover, during training, with teacher forcing and an attention mask, we can feed the entire input and label sequences directly into the decoder (consisting of masked self-attention and cross-attention) for parallel computation and training. Therefore, by employing teacher forcing, the decoder’s training algorithm becomes parallelizable.

It’s crucial to note a subtle difference between the decoder’s prediction and training phases. During prediction, suppose we have already predicted “Welcome to.” To predict “Starbucks,” the decoder needs to see all previous tokens, “Welcome to,” not just the last token “to.” Many implementations overlook this detail and fail to stack previous tokens, resulting in poor sentence generation and a decoder that never outputs the end token. During training, due to teacher forcing and the presence of an attention mask, shifting the entire sentence to the right and feeding it into the decoder achieves the goal of attending to all previous tokens.

在 seq2seq 任务中,Transformer 编码器和解码器必须联合训练,因为这不是分类问题,编码器不能直接连接 MLP 进行训练。虽然 token 通过编码器的流程相对简单,由自注意力处理,但解码器带来了一些挑战。让我们先考虑这个问题:如果你来设计这个解码器的使用方式,能让它一次性输出所有 token 吗?答案是不能。生成序列需要一个序列结束 token(如 [EOS])来表示终止。这个结束 token 的生成必须依赖于先前生成的 token。结束句子的决定依赖于句子已经完成了其目的。这意味着最后一个 token 的生成必须以前面的 token 为条件。推而广之,每个前面的 token 都必须以其前驱为条件,一直追溯到序列开始 token(如 [SOS])。这个逻辑类似于人类说话的方式。虽然我们可能在脑海中构思整个句子,但我们是逐词表达的,后续的词受前面的词影响。这种固有的逻辑决定了我们的说话方式。因此,解码器在预测时仍必须按顺序预测 token,而不是一次性全部输出。这个顺序输出 token 的过程如下图所示。换句话说,解码器的预测算法不能并行化。

图 14:自回归解码。解码器逐个生成 token,将每个输出反馈作为下一步的输入。图片来源:CSDN

然而,解码器采用了一种巧妙的训练方法,称为 teacher forcing,即在”教师”指导下学习。teacher forcing 是什么?让我们以 RNN 解码器中的 teacher forcing 为例来说明。如下图所示,假设一个 seq2seq 模型接收输入 “What do you see”,正确输出(标签)是 “Two people running.”。左边的训练过程称为 free-running,右边的称为 teacher forcing。如果你观察 Transformer 架构图的右下角,会注意到对 Outputs 应用了 “Shifted Right” 操作。这涉及将所有输入 token 右移一个位置,对应下图右侧所示的 teacher forcing 方法。在 teacher forcing 中,所有标签 token 右移一个位置,序列开始 token(如 <Start>)放在开头。free-running 解码器 RNN 接收自己上一时间步的输出作为输入,这可能是不正确的。相比之下,teacher forcing 解码器 RNN 接收上一个标签 token 作为输入,保证是正确的。这种训练方法防止了错误累积,从而提高了训练效果。

图 15:Free-running(左)与 teacher forcing(右)。Teacher forcing 输入真实标签 token,防止训练中的错误累积。图片来源:cnblogs

Teacher forcing 的另一个优势是并行化。上图描述的是 RNN。由于存在需要顺序传递的隐藏向量,解码器在训练时无法并行化。然而,自注意力和交叉注意力不依赖隐藏向量进行状态传递。它们在输入阶段通过位置嵌入直接编码位置信息。此外,在训练时,通过 teacher forcing 和注意力掩码,我们可以将整个输入和标签序列直接输入解码器(由掩码自注意力和交叉注意力组成)进行并行计算和训练。因此,通过采用 teacher forcing,解码器的训练算法变得可并行化。

需要注意解码器预测阶段和训练阶段之间的一个微妙差异。在预测时,假设我们已经预测了 “Welcome to”。要预测 “Starbucks”,解码器需要看到所有先前的 token “Welcome to”,而不仅仅是最后一个 token “to”。许多实现忽略了这个细节,未能堆叠先前的 token,导致生成的句子质量差,解码器永远不会输出结束 token。在训练时,由于 teacher forcing 和注意力掩码的存在,将整个句子右移并输入解码器就能实现关注所有先前 token 的目标。

Masked Self-Attention

Teacher forcing feeds the entire target sequence at once, but the decoder must not peek at future tokens. A causal mask zeroes out attention to future positions, preventing data leakage.

If you think about it, simply using “Shifted Right” doesn’t fully implement teacher forcing. This is because if the attention module in the bottom right corner were unmasked self-attention, it would lead to data leakage. Let’s revisit the example from the Query, Key, Value section to see where the problem lies.

During decoding, the decoder should only have access to the tokens it has generated so far (during prediction) or the teacher tokens provided in the label up to the current time step (during training). In short, when attending to a particular token, the decoder should not be aware of any tokens appearing after it. If the decoder could attend to subsequent tokens, it would essentially be “peeking ahead” at the answers, hindering the training process.

Since self-attention relies on matrix operations, we need to employ masking to prevent this undesirable behavior. The masked attention is computed as:

\(\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V, \quad M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}\) As shown in the diagram below, we can mask the upper triangular part of the attention matrix with negative infinity (-inf). By setting the attention scores of tokens that the model should not attend to as negative infinity, we prevent gradients from flowing through these positions (gradients become 0), effectively eliminating the issue of peeking ahead.

Figure 16: Causal mask. The upper triangle is set to \(-\infty\) before softmax, preventing the decoder from attending to future tokens.

Do we need to add an attention mask to the cross-attention layer after masked self-attention? No, because the encoder has already processed the entire input sequence and possesses all the information. Therefore, the query Q from the decoder can attend to all Ks and Vs in the context vector. In other words, we allow the decoder to see all the information in the input during both training and prediction.

仔细想想,仅使用 “Shifted Right” 并不能完全实现 teacher forcing。这是因为如果右下角的注意力模块是无掩码的自注意力,就会导致数据泄露。让我们回顾 Query, Key, Value 部分的例子来看问题出在哪里。

在解码过程中,解码器应该只能访问它到目前为止生成的 token(预测时)或标签中到当前时间步提供的 teacher token(训练时)。简而言之,在关注某个特定 token 时,解码器不应该知道出现在它之后的任何 token。如果解码器能关注到后续 token,它本质上就是在”偷看”答案,阻碍了训练过程。

由于自注意力依赖矩阵运算,我们需要使用掩码来防止这种不良行为。掩码注意力计算如下:

\[\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V, \quad M_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}\]

如下图所示,我们可以用负无穷(-inf)掩盖注意力矩阵的上三角部分。通过将模型不应关注的 token 的注意力分数设为负无穷,我们阻止梯度通过这些位置流动(梯度变为 0),有效消除了偷看问题。

图 16:因果掩码。上三角在 softmax 之前设为 \(-\infty\),防止解码器关注未来的 token。

我们是否需要在掩码自注意力之后的交叉注意力层中添加注意力掩码?不需要,因为编码器已经处理了整个输入序列并拥有所有信息。因此,解码器的 query Q 可以关注上下文向量中的所有 K 和 V。换句话说,我们允许解码器在训练和预测时都能看到输入中的所有信息。

Multi-Head Attention

A single attention head has limited capacity. Multiple heads let the model jointly attend to information from different representation subspaces — e.g., syntax, context, and rare words.

The parameters learned by a single-layer attention mechanism are essentially the three matrices $W^Q, W^K, W^V$. The number of these parameters is often quite small. While this might suffice for representing basic semantics, it can become a bottleneck as semantic complexity increases. Multi-head attention addresses this limitation. Let’s see how multi-head attention in the Transformer architecture diagram differs from the single-layer attention we discussed earlier.

The diagram below illustrates the structure of multi-head attention. Formally, multi-head attention is defined as:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O, \quad \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)\]

where \(W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}\), and \(W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}\). This mechanism divides the three matrices $W^Q,W^K,W^V $ into multiple smaller matrices. For instance, in a two-head attention setup, $W_Q$ is split into two smaller matrices, $W^{q_1}$ and $W^{q_2}$. Consequently, the $q$ matrix generated from $a_1$ can also be divided into two smaller matrices, $q_{11}$ and $q_{12}$, which we call attention heads. After obtaining multiple heads, the corresponding qkv heads perform single-layer attention separately, resulting in multiple outputs. For example, two heads would yield $b_{11}$ and $b_{12}$ as outputs. These outputs from different heads are then aggregated into a single output vector, $b_1$. Notice a detail in the diagram: $q_{21}$ is not used when calculating $b_1$. Think about why. Since $b_1$ represents the latent space representation of the query $a_1$, it cannot involve the query $a_2$. Refer back to the Query, Key, Value section if this isn’t clear. Once all heads have completed their calculations, an affine matrix $W_o$ is applied to aggregate information from all heads. The shapes of the matrices are indicated in the diagram. Assuming the maximum sentence length is 256 tokens and the embedding size is 1024, the shape of $a_1$ would be (256, 1024). The shapes of other matrices are also shown in the diagram.

Figure 17: Multi-head attention. The Q, K, V matrices are split into multiple heads, each performing attention independently before concatenation.

Now that we understand the general idea, let’s explore the role of multi-head attention in more depth. Is a larger number of heads always better? Let’s examine the ablation study performed in the Transformer paper, where $h$ represents the number of heads:

Figure 18: Ablation study on the number of attention heads from the original Transformer paper. Best performance at \(h = 8\). Image credit: Vaswani et al. (2017).

Clearly, the best performance is achieved with $h = 8$. Increasing $h$ further (to 16 or 32) doesn’t significantly improve performance, while decreasing $h$ (to 1 or 4) tends to degrade it.

Is the sole purpose of adding heads merely to increase the number of parameters? If so, we could simply enlarge the hidden size of $W^Q, W^K, W^V$. Why achieve this by adding heads?

The following diagram visualizes how different heads influence the attention matrix. This research focuses on the encoder, with 4 encoder layers (0-3), each containing 6 heads (0-5). Each row in the diagram represents an encoder layer, and each column represents a head.

Figure 19: Attention patterns across layers (rows) and heads (columns). Different heads learn to attend to different linguistic features. Image credit: Michel et al. (2019).

We observe that within the same layer, some heads tend to focus on similar features, while others exhibit more distinct preferences. Why do different heads select different features?

During the training of the multi-head attention mechanism, due to differences in parameter initialization, we have $q_{11} \neq q_{12}$. Similarly, we have $st_{111} \neq st_{121}$ and $b_{11} \neq b_{12}$. However, since $b_{11}$ and $b_{12}$ are concatenated, the gradient flow during backpropagation is symmetrical for these two paths. Different initialization methods lead to heads learning different feature selection capabilities.

Research has also analyzed the specific information different heads focus on. The paper “Adaptively Sparse Transformers” suggests that heads primarily focus on three aspects: grammar, context, and rare words. Grammar-focused heads effectively suppress the output of grammatically incorrect words. Context-focused heads are responsible for sentence comprehension and tend to pay more attention to nearby words. Rare word-focused heads aim to capture important keywords in the sentence. For instance, “Starbucks” is rarer than “to” and likely carries more information. Inspired by this, some research has conducted detailed ablation studies on initialization methods, demonstrating that modifying initialization can reduce layer variance and improve training effectiveness.

Let’s ponder another question: what are the significant drawbacks of using multi-layer attention (stacking multiple single-layer attention layers) instead of multi-head attention? There are substantial parallelization limitations. Multi-head attention can be easily parallelized because different heads receive the same input and perform the same computations. In contrast, due to the stacked structure of multi-layer attention, upper layers must wait for computations in lower layers to complete before proceeding, hindering parallelization. The time complexity increases linearly with the number of layers. Therefore, from a parallelization standpoint, multi-head attention is often preferred.

In summary, multi-head attention increases the parameter capacity of the attention layer, enhances the differentiation of feature extractors, and effectively improves attention performance. While more heads aren’t always better, multiple heads generally outperform a single head. Compared to multi-layer attention, multi-head attention is more conducive to parallelization.

单层注意力机制学习到的参数本质上是三个矩阵 $W^Q, W^K, W^V$。这些参数的数量通常相当少。虽然这可能足以表示基本语义,但随着语义复杂度的增加,这可能成为瓶颈。多头注意力解决了这一局限。让我们看看 Transformer 架构图中的多头注意力与之前讨论的单层注意力有何不同。

下图展示了多头注意力的结构。形式上,多头注意力定义为:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O, \quad \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)\]

其中 \(W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}\),\(W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}\),\(W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}\),\(W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}\)。这个机制将三个矩阵 $W^Q,W^K,W^V$ 分成多个更小的矩阵。例如,在双头注意力设置中,$W_Q$ 被分成两个较小的矩阵 $W^{q_1}$ 和 $W^{q_2}$。因此,从 $a_1$ 生成的 $q$ 矩阵也可以分成两个较小的矩阵 $q_{11}$ 和 $q_{12}$,我们称之为注意力头。 获得多个头后,对应的 qkv 头分别执行单层注意力,产生多个输出。例如,两个头会产生 $b_{11}$ 和 $b_{12}$ 作为输出。来自不同头的输出然后聚合成单个输出向量 $b_1$。注意图中一个细节:计算 $b_1$ 时没有使用 $q_{21}$。想想为什么。由于 $b_1$ 代表查询 $a_1$ 的潜在空间表示,它不能涉及查询 $a_2$。如果不清楚,请回顾 Query, Key, Value 部分。所有头完成计算后,应用仿射矩阵 $W_o$ 来聚合所有头的信息。矩阵的形状在图中标出。假设最大句子长度为 256 个 token,嵌入大小为 1024,则 $a_1$ 的形状为 (256, 1024)。其他矩阵的形状也在图中标出。

图 17:多头注意力。Q、K、V 矩阵被分成多个头,每个头独立执行注意力,然后拼接。

理解了大致思路后,让我们更深入地探讨多头注意力的作用。头的数量越多越好吗?让我们看看 Transformer 论文中的消融实验,其中 $h$ 代表头的数量:

图 18:原始 Transformer 论文中关于注意力头数量的消融实验。\(h = 8\) 时性能最佳。图片来源:Vaswani et al. (2017)

显然,$h = 8$ 时性能最佳。进一步增加 $h$(到 16 或 32)不会显著提升性能,而减少 $h$(到 1 或 4)往往会降低性能。

增加头的唯一目的仅仅是增加参数数量吗?如果是这样,我们可以简单地增大 $W^Q, W^K, W^V$ 的隐藏大小。为什么要通过添加头来实现?

下图可视化了不同头如何影响注意力矩阵。这项研究聚焦于编码器,有 4 个编码器层(0-3),每层包含 6 个头(0-5)。图中每行代表一个编码器层,每列代表一个头。

图 19:不同层(行)和头(列)的注意力模式。不同的头学习关注不同的语言特征。图片来源:Michel et al. (2019)

我们观察到在同一层中,某些头倾向于关注相似的特征,而其他头则表现出更明显的偏好差异。为什么不同的头会选择不同的特征?

在多头注意力机制的训练过程中,由于参数初始化的差异,我们有 $q_{11} \neq q_{12}$。同样,$st_{111} \neq st_{121}$ 且 $b_{11} \neq b_{12}$。然而,由于 $b_{11}$ 和 $b_{12}$ 是拼接的,反向传播过程中这两条路径的梯度流是对称的。不同的初始化方法导致头学习到不同的特征选择能力。

研究还分析了不同头关注的具体信息。论文 “Adaptively Sparse Transformers” 建议头主要关注三个方面:语法、上下文和罕见词。 关注语法的头有效地抑制了语法不正确的词的输出。关注上下文的头负责句子理解,倾向于更关注附近的词。关注罕见词的头旨在捕获句子中的重要关键词。例如,”Starbucks” 比 “to” 更罕见,可能携带更多信息。受此启发,一些研究对初始化方法进行了详细的消融实验,表明修改初始化可以减少层间方差并提高训练效果。

让我们再思考一个问题:使用多层注意力(堆叠多个单层注意力层)而非多头注意力有什么显著缺点?存在大量的并行化限制。 多头注意力可以轻松并行化,因为不同的头接收相同的输入并执行相同的计算。相比之下,由于多层注意力的堆叠结构,上层必须等待下层计算完成才能继续,阻碍了并行化。时间复杂度随层数线性增加。 因此,从并行化的角度来看,多头注意力通常更受青睐。

总之,多头注意力增加了注意力层的参数容量,增强了特征提取器的差异化,有效提升了注意力性能。虽然更多的头并不总是更好,但多个头通常优于单个头。与多层注意力相比,多头注意力更有利于并行化。

Feed Forward Layer

Attention is a linear re-weighting of values. The feed-forward network adds non-linearity and massively increases model capacity — it accounts for over half of Transformer's parameters.

When encountering Transformer for the first time, you might be unfamiliar with the term “Feed Forward Layer” (FFN). It’s a relatively old-fashioned term. An FFN is essentially an MLP, comprising several matrices and activation functions. In the Transformer architecture diagram, FFNs appear somewhat insignificant, represented by a thin layer. However, if you carefully analyze the parameter distribution in Transformer, you’ll realize that FFNs account for more than half of the total parameters.

The FFN used in Transformer can be expressed with the following formula, where $W_1, b_1$ are the parameters of the first fully connected layer, and $W_2, b_2$ are the parameters of the second fully connected layer. A more intuitive illustration of the FFN structure is provided in the diagram below.

\[W_2(\text{relu}(W_1x+b_1))+b_2\]
Figure 20: FFN structure: two linear layers with ReLU in between. Image credit: Towards Data Science.

This formula allows for straightforward implementation of the FFN. Here’s a PyTorch implementation, where d_model is the embedding size (512 in Transformer) and d_ff is the hidden size of the FFN (2048 in Transformer).

class FFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FFN, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.w_2(F.relu(self.w_1(x)))

Now, let’s consider the role of this FFN layer. The embedding size dimension of the input tensor (512) is mapped to a larger hidden size dimension (2048), and then mapped back to the original embedding size (512) in the next layer. It’s evident that FFNs can introduce non-linearity into the model due to the ReLU activation function in between. Moreover, FFNs significantly increase model capacity by substantially increasing the number of parameters. Calculating the number of parameters in this FFN yields a surprisingly large number: 2∗512∗2048=2,097,152 (ignoring bias). In comparison, the article “How to Estimate the Number of Parameters in Transformer Models” states that the 8-head attention network proposed in the Transformer paper requires 1,050,624 parameters, calculated as follows:

Figure 21: Parameter count breakdown for multi-head attention: \(4(d_{\text{model}}^2 + d_{\text{model}})\).

You can verify this yourself using PyTorch:

d_model = 512
n_heads = 8
multi_head_attention = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_heads)
print(count_parameters(multi_head_attention))  # 1050624
print(4 * (d_model * d_model + d_model))  # 1050624

This means that two multi-head attention layers have roughly the same number of parameters as one FFN layer. Since the Transformer architecture includes three multi-head attention layers and two FFN layers, FFNs account for over half of the total parameters.

第一次接触 Transformer 时,你可能不熟悉”前馈层”(FFN)这个术语。这是一个比较老式的术语。FFN 本质上是一个 MLP,由几个矩阵和激活函数组成。在 Transformer 架构图中,FFN 看起来不太起眼,只用一个薄层表示。然而,如果你仔细分析 Transformer 中的参数分布,你会发现 FFN 占总参数的一半以上。

Transformer 中使用的 FFN 可以用以下公式表示,其中 $W_1, b_1$ 是第一个全连接层的参数,$W_2, b_2$ 是第二个全连接层的参数。FFN 结构的更直观说明见下图。

\[W_2(\text{relu}(W_1x+b_1))+b_2\]
图 20:FFN 结构:两个线性层之间带 ReLU。图片来源:Towards Data Science

这个公式使得 FFN 的实现非常直接。以下是 PyTorch 实现,其中 d_model 是嵌入大小(Transformer 中为 512),d_ff 是 FFN 的隐藏大小(Transformer 中为 2048)。

class FFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FFN, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.w_2(F.relu(self.w_1(x)))

现在让我们考虑 FFN 层的作用。输入张量的嵌入大小维度(512)被映射到更大的隐藏大小维度(2048),然后在下一层中映射回原始嵌入大小(512)。显然,由于中间的 ReLU 激活函数,FFN 可以为模型引入非线性。此外,FFN 通过大幅增加参数数量来显著增加模型容量。计算这个 FFN 的参数数量会得到一个惊人的大数字:25122048=2,097,152(忽略偏置)。相比之下,文章 “How to Estimate the Number of Parameters in Transformer Models” 指出 Transformer 论文中提出的 8 头注意力网络需要 1,050,624 个参数,计算如下:

图 21:多头注意力的参数数量分解:\(4(d_{\text{model}}^2 + d_{\text{model}})\)。

你可以用 PyTorch 自行验证:

d_model = 512
n_heads = 8
multi_head_attention = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_heads)
print(count_parameters(multi_head_attention))  # 1050624
print(4 * (d_model * d_model + d_model))  # 1050624

这意味着两个多头注意力层的参数数量大约等于一个 FFN 层。由于 Transformer 架构包含三个多头注意力层和两个 FFN 层,FFN 占总参数的一半以上。

Residual Connections and LayerNorm

Deep networks suffer from vanishing gradients and unstable training. Residual connections preserve information flow, and layer normalization stabilizes activations — together they make stacking many layers feasible.

Observing the Transformer architecture diagram, you’ll notice numerous occurrences of “Add & Norm” modules, as shown below. In fact, after every computational layer, Transformer applies an “Add & Norm” module. “Add” refers to residual addition, where the input before a module is added to the output of that module to obtain a new vector. Mathematically, it’s expressed as:

\[y = x + f(x)\]

where \(f\) represents the function of the computational layer.

Figure 22: Add & Norm module. Residual connection followed by layer normalization.

By the time Transformer was proposed, residual connections had become a prevalent technique for mitigating vanishing gradients, first introduced in ResNet. Whether you’re familiar with computer vision or natural language processing, you’ve likely encountered ResNet. For instance, in FFNs, the ReLU function can cause roughly half of the signals to become 0 during backpropagation, leading to significant information loss. Residual connections preserve the vector information before ReLU, effectively alleviating this issue.

In “Add & Norm,” layer normalization follows residual addition. Layer normalization, which normalizes vectors, was already widely adopted when Transformer emerged. Let’s briefly compare layer normalization with batch normalization.

As shown in the diagram below, given a three-dimensional tensor (embedding size, token number, batch size), batch normalization normalizes across a batch, while layer normalization normalizes across a sequence. For example, with a batch size of 2, suppose we input two sequences: “hello” and “machine learning.” Assume the embedding of “hello” is $[4,6,0,0]$ and the embedding of “machine learning” is $[1,2,3,3]$. Batch normalization normalizes each corresponding dimension across the batch. In other words, it normalizes each token and embedding across the two sequences. This is not ideal because tokens in the two sequences don’t necessarily correspond to each other. If we use sum-to-1 normalization, “hello” would have a normalized embedding of $[0.8,0.75,0,0]$, while “machine learning” would have a normalized embedding of $[0.2,0.25,1,1]$. This disrupts the original semantic information within the embeddings. For instance, the embedding of “hello” originally had $4<6$, but after batch normalization, it becomes $0.8>0.75$, altering the semantics.

Layer normalization, on the other hand, normalizes within each sequence, where elements naturally correspond to each other. For a vector \(x \in \mathbb{R}^d\):

\[\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta, \quad \mu = \frac{1}{d}\sum_{i=1}^d x_i, \quad \sigma^2 = \frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2\]

where \(\gamma, \beta \in \mathbb{R}^d\) are learnable scale and shift parameters. In the example above, using layer normalization, “hello” would have a normalized embedding of $[0.4,0.6,0,0]$, while “machine learning” would have a normalized embedding of $[\frac{1}{9},\frac{2}{9},\frac{1}{3},\frac{1}{3}]$. This approach preserves the original semantics while effectively preventing gradient issues caused by excessively large values.

Figure 23: Batch normalization vs. layer normalization. Layer norm normalizes within each sequence, preserving semantic relationships. Image credit: ifwind.

观察 Transformer 架构图,你会注意到大量 “Add & Norm” 模块,如下所示。事实上,在每个计算层之后,Transformer 都会应用 “Add & Norm” 模块。”Add” 指残差加法,将模块输入与模块输出相加得到新向量。数学上表示为:

\[y = x + f(x)\]

其中 \(f\) 代表计算层的函数。

图 22:Add & Norm 模块。残差连接后接层归一化。

在 Transformer 被提出时,残差连接已经成为缓解梯度消失的普遍技术,最早在 ResNet 中引入。无论你熟悉计算机视觉还是自然语言处理,你可能都遇到过 ResNet。例如,在 FFN 中,ReLU 函数会导致大约一半的信号在反向传播时变为 0,造成严重的信息丢失。残差连接保留了 ReLU 之前的向量信息,有效缓解了这一问题。

在 “Add & Norm” 中,层归一化紧随残差加法之后。层归一化用于归一化向量,在 Transformer 出现时已被广泛采用。让我们简要比较层归一化与批归一化。

如下图所示,给定一个三维张量(嵌入大小,token 数量,batch 大小),批归一化在 batch 维度上归一化,而层归一化在序列维度上归一化。例如,batch 大小为 2,假设我们输入两个序列:”hello” 和 “machine learning”。假设 “hello” 的嵌入为 $[4,6,0,0]$,”machine learning” 的嵌入为 $[1,2,3,3]$。批归一化在 batch 中对每个对应维度进行归一化。换句话说,它在两个序列之间对每个 token 和嵌入进行归一化。这并不理想,因为两个序列中的 token 不一定相互对应。如果使用归一化到 1 的方法,”hello” 的归一化嵌入为 $[0.8,0.75,0,0]$,而 “machine learning” 的归一化嵌入为 $[0.2,0.25,1,1]$。这破坏了嵌入中原始的语义信息。例如,”hello” 的嵌入原本有 $4<6$,但经过批归一化后变成 $0.8>0.75$,改变了语义。

另一方面,层归一化在每个序列内部进行归一化,其中元素自然地相互对应。对于向量 \(x \in \mathbb{R}^d\):

\[\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta, \quad \mu = \frac{1}{d}\sum_{i=1}^d x_i, \quad \sigma^2 = \frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2\]

其中 \(\gamma, \beta \in \mathbb{R}^d\) 是可学习的缩放和偏移参数。在上面的例子中,使用层归一化,”hello” 的归一化嵌入为 $[0.4,0.6,0,0]$,而 “machine learning” 的归一化嵌入为 $[\frac{1}{9},\frac{2}{9},\frac{1}{3},\frac{1}{3}]$。这种方法保留了原始语义,同时有效防止了过大值引起的梯度问题。

图 23:批归一化与层归一化。层归一化在每个序列内部归一化,保留语义关系。图片来源:ifwind