LLM Optimization Basics: Memory

This post is inspired by EleutherAI's Transformer Math 101 and Jiayi Pan's VRAM Estimation notes. Originally drafted after discussion with Jiayi Pan and revised in March 2026 for better visual effects.
本文受 EleutherAI 的 Transformer Math 101Jiayi PanVRAM Estimation 笔记启发。最初在与 Jiayi Pan 讨论后起草,2026 年 3 月修订以改善可视化效果。

Training a large language model means fitting everything the GPU needs — weights, optimizer buffers, gradients, and intermediate computations — into a fixed amount of VRAM. Understanding what each component costs is a prerequisite for reasoning about OOM errors. This section walks through the four main consumers of GPU memory during a training step.

训练大语言模型意味着将 GPU 所需的一切——权重、优化器缓冲区、梯度和中间计算——塞入固定容量的显存中。理解每个组件的开销是分析 OOM 错误的前提。本节梳理训练步骤中 GPU 内存的四大消耗来源。

What Lives on the GPU

Figure 1. Adjust model size, optimizer, and precision to see how GPU memory is consumed. Green = fits, red = OOM. The four cards break down each component's cost.

图 1. 调整模型大小、优化器和精度,观察 GPU 内存的消耗情况。绿色 = 可容纳,红色 = OOM。四张卡片分别展示各组件的开销。

Let \(P\) denote the number of model parameters. During mixed-precision training (the standard practice for LLMs), the GPU holds four categories of data:

设 \(P\) 为模型参数量。在混合精度训练(LLM 的标准做法)中,GPU 上存放四类数据:

Model Parameters

The weights used in the forward pass. In mixed-precision training, the forward and backward passes run in half precision (fp16 or bf16), so the live weights consume:

\[\text{Model memory} = 2P \text{ bytes}\]

(2 bytes per parameter for bf16/fp16.)

前向传播中使用的权重。在混合精度训练中,前向和反向传播以半精度(fp16 或 bf16)运行,因此活跃权重占用:

\[\text{Model memory} = 2P \text{ bytes}\]

(bf16/fp16 每参数 2 字节。)

Optimizer States

The optimizer maintains its own buffers that persist across training steps. For AdamW (the standard choice for LLM training), the update rule at each step \(t\) is:

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, g_t\] \[v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, g_t^2\] \[\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \, \theta_{t-1} \right)\]

where \(\hat{m}_t, \hat{v}_t\) are bias-corrected estimates. Each of these quantities must be stored per parameter:

  • fp32 master copy of \(\theta\) — 4 bytes/param. The optimizer updates weights in fp32 for numerical stability, then casts back to bf16 for the next forward pass.
  • First moment \(m_t\) — 4 bytes/param. The exponential moving average of gradients (momentum).
  • Second moment \(v_t\) — 4 bytes/param. The exponential moving average of squared gradients (variance).
\[\text{Optimizer memory} = \underbrace{4P}_{\theta^{\text{fp32}}} + \underbrace{4P}_{m} + \underbrace{4P}_{v} = 12P \text{ bytes}\]

This is typically the single largest memory consumer for model-related storage. For a 7B parameter model: \(12 \times 7 \times 10^9 = 84\) GB just for optimizer states.

Other optimizers use less: SGD with momentum costs \(8P\) bytes (fp32 copy + momentum), and 8-bit optimizers like those in bitsandbytes reduce the moments to 1 byte each, costing \(6P\) bytes.

优化器维护自己的缓冲区,跨训练步持久存在。对于 AdamW(LLM 训练的标准选择),每步 \(t\) 的更新规则为:

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, g_t\] \[v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, g_t^2\] \[\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \, \theta_{t-1} \right)\]

其中 \(\hat{m}_t, \hat{v}_t\) 为偏差校正后的估计。以下各量均需逐参数存储:

  • \(\theta\) 的 fp32 主副本 — 每参数 4 字节。优化器以 fp32 更新权重以保证数值稳定性,然后转换回 bf16 用于下一次前向传播。
  • 一阶矩 \(m_t\) — 每参数 4 字节。梯度的指数移动平均(动量)。
  • 二阶矩 \(v_t\) — 每参数 4 字节。梯度平方的指数移动平均(方差)。
\[\text{Optimizer memory} = \underbrace{4P}_{\theta^{\text{fp32}}} + \underbrace{4P}_{m} + \underbrace{4P}_{v} = 12P \text{ bytes}\]

这通常是模型相关存储中单项最大的内存消耗者。对于 7B 参数的模型:\(12 \times 7 \times 10^9 = 84\) GB,仅优化器状态就需要这么多。

其他优化器消耗更少:带动量的 SGD 需要 \(8P\) 字节(fp32 副本 + 动量),bitsandbytes 中的 8-bit 优化器将两个矩各压缩至 1 字节,仅需 \(6P\) 字节。

Gradients

The gradient tensor for each parameter, computed during the backward pass. In mixed precision:

\[\text{Gradient memory} = 2P \text{ bytes}\]

(Gradients are stored in the same precision as the model — bf16/fp16.)

每个参数的梯度张量,在反向传播期间计算。在混合精度下:

\[\text{Gradient memory} = 2P \text{ bytes}\]

(梯度以与模型相同的精度存储——bf16/fp16。)

Activations

The intermediate outputs of each layer, saved during the forward pass so they can be reused in the backward pass. Unlike the previous three, activation memory scales with the data being processed — specifically with batch size, sequence length, and model depth.

For a transformer with \(n_{\text{layers}}\) layers, hidden size \(d\), attention heads \(a\), sequence length \(s\), and microbatch size \(b\) (per GPU), the activation memory without recomputation is approximately:

\[\text{Activation memory} \approx s \cdot b \cdot d \cdot n_{\text{layers}} \cdot \left(10 + \frac{24}{t} + \frac{5as}{dt}\right) \text{ bytes}\]

where \(t\) is the tensor parallelism degree (1 if no tensor parallelism). The three terms inside the parentheses correspond to three groups of saved tensors per layer, factored by whether they are split across \(t\) GPUs:

The \(10\) term — activations that are not split by tensor parallelism (full \(d\)-dimensional tensors):

Saved tensor Shape Bytes (bf16)
Layer norm input (before self-attention) \([s, b, d]\) \(2sbd\)
Layer norm input (before MLP) \([s, b, d]\) \(2sbd\)
Self-attention output (before residual add) \([s, b, d]\) \(2sbd\)
MLP output (before residual add) \([s, b, d]\) \(2sbd\)
Dropout mask (after self-attention) \([s, b, d]\) \(sbd\)
Dropout mask (after MLP) \([s, b, d]\) \(sbd\)
Subtotal   \(10 \cdot sbd\)

The \(24/t\) term — activations split across \(t\) tensor-parallel GPUs (each GPU stores \(d/t\)):

Saved tensor Shape per GPU Bytes (bf16)
Q projection output \([s, b, d/t]\) \(2sbd/t\)
K projection output \([s, b, d/t]\) \(2sbd/t\)
V projection output \([s, b, d/t]\) \(2sbd/t\)
Attention value output \([s, b, d/t]\) \(2sbd/t\)
MLP first linear output (input to GeLU) \([s, b, 4d/t]\) \(8sbd/t\)
GeLU output \([s, b, 4d/t]\) \(8sbd/t\)
Subtotal   \(24 \cdot sbd/t\)

The \(5as/(dt)\) term — attention matrices that scale as \(O(s^2)\):

Saved tensor Shape per GPU Bytes
Attention scores (pre-softmax) \([b, a/t, s, s]\) \(2bas^2/t\)
Attention probabilities (post-softmax) \([b, a/t, s, s]\) \(2bas^2/t\)
Attention dropout mask \([b, a/t, s, s]\) \(bas^2/t\)
Subtotal   \(5bas^2/t = sbd \cdot \frac{5as}{dt}\)

This last group is why long sequences are expensive: at \(s = 8192\) with \(d = 4096\), the \(5as/(dt)\) term can exceed the other two combined.

With full activation recomputation (gradient checkpointing), only the input to each layer is saved — everything else is recomputed during the backward pass. This reduces activation memory to \(2 \cdot s \cdot b \cdot d \cdot n_{\text{layers}}\) bytes, at the cost of roughly doubling the forward-pass compute.

每层的中间输出在前向传播期间保存,以便在反向传播中复用。与前三项不同,激活内存随处理数据量变化——具体来说随批量大小、序列长度和模型深度变化。

对于具有 \(n_{\text{layers}}\) 层、隐藏大小 \(d\)、注意力头数 \(a\)、序列长度 \(s\) 和微批量大小 \(b\)(每 GPU)的 Transformer,不做重计算时的激活内存约为:

\[\text{Activation memory} \approx s \cdot b \cdot d \cdot n_{\text{layers}} \cdot \left(10 + \frac{24}{t} + \frac{5as}{dt}\right) \text{ bytes}\]

其中 \(t\) 为张量并行度(无张量并行时为 1)。括号内的三项对应每层三组保存的张量,按是否在 \(t\) 个 GPU 间切分区分:

\(10\) 项 — 未被张量并行切分的激活(完整 \(d\) 维张量):

保存的张量 形状 字节数 (bf16)
Layer norm 输入(自注意力前) \([s, b, d]\) \(2sbd\)
Layer norm 输入(MLP 前) \([s, b, d]\) \(2sbd\)
自注意力输出(残差相加前) \([s, b, d]\) \(2sbd\)
MLP 输出(残差相加前) \([s, b, d]\) \(2sbd\)
Dropout 掩码(自注意力后) \([s, b, d]\) \(sbd\)
Dropout 掩码(MLP 后) \([s, b, d]\) \(sbd\)
小计   \(10 \cdot sbd\)

\(24/t\) 项 — 在 \(t\) 个张量并行 GPU 间切分的激活(每个 GPU 存储 \(d/t\)):

保存的张量 每 GPU 形状 字节数 (bf16)
Q 投影输出 \([s, b, d/t]\) \(2sbd/t\)
K 投影输出 \([s, b, d/t]\) \(2sbd/t\)
V 投影输出 \([s, b, d/t]\) \(2sbd/t\)
注意力值输出 \([s, b, d/t]\) \(2sbd/t\)
MLP 第一层线性输出(GeLU 输入) \([s, b, 4d/t]\) \(8sbd/t\)
GeLU 输出 \([s, b, 4d/t]\) \(8sbd/t\)
小计   \(24 \cdot sbd/t\)

\(5as/(dt)\) 项 — 按 \(O(s^2)\) 增长的注意力矩阵:

保存的张量 每 GPU 形状 字节数
注意力分数(softmax 前) \([b, a/t, s, s]\) \(2bas^2/t\)
注意力概率(softmax 后) \([b, a/t, s, s]\) \(2bas^2/t\)
注意力 dropout 掩码 \([b, a/t, s, s]\) \(bas^2/t\)
小计   \(5bas^2/t = sbd \cdot \frac{5as}{dt}\)

最后一组是长序列昂贵的原因:当 \(s = 8192\)、\(d = 4096\) 时,\(5as/(dt)\) 项可能超过前两项之和。

使用完全激活重计算(梯度检查点),仅保存每层的输入——其余在反向传播期间重新计算。这将激活内存降至 \(2 \cdot s \cdot b \cdot d \cdot n_{\text{layers}}\) 字节,代价是前向传播计算量大约翻倍。

Putting It Together

For a single GPU with no parallelism, the total training memory is:

\[\text{Total} = \underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{AdamW}} + \underbrace{2P}_{\text{grads}} + \underbrace{\text{Activations}(b, s)}_{\text{scales with data}}\]

The first three terms sum to \(16P\) bytes — fixed regardless of batch size. For a 7B model, that’s ~112 GB before any data is processed. The fourth term is the only one you can control at runtime by changing how much data you feed per forward pass.

This is where memory optimization techniques come in.

对于单 GPU 且无并行的情况,总训练内存为:

\[\text{Total} = \underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{AdamW}} + \underbrace{2P}_{\text{grads}} + \underbrace{\text{Activations}(b, s)}_{\text{scales with data}}\]

前三项之和为 \(16P\) 字节——与批量大小无关的固定开销。对于 7B 模型,在处理任何数据之前就需要约 112 GB。第四项是你在运行时唯一可以通过调整每次前向传播的数据量来控制的。

这就是内存优化技术发挥作用的地方。

Optimizing GPU Memory Usage

Every memory optimization technique trades something for memory savings — there is no free lunch. The table below summarizes the main approaches, what they save, and what they cost.

每种内存优化技术都以某些代价换取内存节省——天下没有免费的午餐。下表总结了主要方法、它们节省了什么以及代价是什么。

Mixed Precision Training

What it does. Store different GPU memory residents at different precisions. The key insight is that the four memory components from the previous section have different numerical sensitivity — activations and gradients are stable in half precision, but optimizer updates are not. The standard recipe:

Stored on GPU Precision Memory Why
Model parameters bf16 (live) + fp32 (master) \(2P + 4P\) Forward/backward use the bf16 copy; optimizer updates the fp32 master
Activations bf16/fp16 \(2 \cdot sbd\) per tensor Half precision is sufficient for intermediate computations
Gradients bf16/fp16 \(2P\) Same precision as the forward pass
Optimizer states (\(m_t, v_t\)) fp32 \(8P\) Momentum and variance need fp32 for numerical stability

The flow each training step: cast fp32 master weights to bf16, run forward/backward in bf16 (producing bf16 activations and gradients), pass gradients to the optimizer which updates the fp32 master weights, then refresh the bf16 copy.

The tradeoff. fp16 has a narrow dynamic range (max \(\sim 6.5 \times 10^4\)) and can cause loss spikes or divergence without careful loss scaling. bf16 (available on Ampere+ GPUs) has the same exponent range as fp32, making it more stable but slightly less precise in the mantissa. Either way, the optimizer states remain in fp32 — so mixed precision saves on the live parameters and gradients (\(4P\) bytes saved) but not on the dominant \(12P\) optimizer cost.

\(\text{Mixed precision savings: } 4P \text{ bytes (params + grads halved)}\)

功能。 以不同精度存储 GPU 上的不同数据。关键洞察是上节的四个内存组件对数值精度的敏感度不同——激活和梯度在半精度下稳定,但优化器更新则不然。标准方案:

GPU 上存储的内容 精度 内存 原因
模型参数 bf16(活跃)+ fp32(主副本) \(2P + 4P\) 前向/反向使用 bf16 副本;优化器更新 fp32 主副本
激活值 bf16/fp16 每张量 \(2 \cdot sbd\) 半精度足以应对中间计算
梯度 bf16/fp16 \(2P\) 与前向传播精度一致
优化器状态 (\(m_t, v_t\)) fp32 \(8P\) 动量和方差需要 fp32 以保证数值稳定性

每个训练步的流程:将 fp32 主权重转换为 bf16,以 bf16 运行前向/反向(产生 bf16 激活和梯度),将梯度传给优化器更新 fp32 主权重,然后刷新 bf16 副本。

代价。 fp16 动态范围窄(最大约 \(6.5 \times 10^4\)),不仔细做损失缩放就可能引起 loss 尖峰或发散。bf16(Ampere+ GPU 可用)与 fp32 具有相同的指数范围,更稳定但尾数精度略低。无论哪种,优化器状态仍为 fp32——因此混合精度节省了活跃参数和梯度(节省 \(4P\) 字节),但无法触及占主导地位的 \(12P\) 优化器开销。

\(\text{Mixed precision savings: } 4P \text{ bytes (params + grads halved)}\)

Gradient Checkpointing (Activation Recomputation)

What it does. Instead of saving all intermediate activations during the forward pass, save only the input to each transformer layer. During the backward pass, recompute each layer’s intermediates from the saved input before computing gradients.

The tradeoff. Activation memory drops from \(sbd \cdot n_{\text{layers}} \cdot (10 + 24/t + 5as/(dt))\) to just \(2sbd \cdot n_{\text{layers}}\) — typically a 5-10x reduction. The cost is roughly doubling the forward-pass compute, since every layer’s forward pass runs twice (once during forward, once during backward). Wall-clock training time increases by ~30-40% in practice (backward is more expensive than forward, so the extra forward pass is a smaller fraction of total time).

功能。 在前向传播中不保存所有中间激活,而仅保存每个 Transformer 层的输入。在反向传播中,从保存的输入重新计算每层的中间值,然后再计算梯度。

代价。 激活内存从 \(sbd \cdot n_{\text{layers}} \cdot (10 + 24/t + 5as/(dt))\) 降至仅 \(2sbd \cdot n_{\text{layers}}\)——通常减少 5-10 倍。代价是前向计算量大约翻倍,因为每层的前向传播运行两次(前向阶段一次,反向阶段一次)。实际壁钟训练时间增加约 30-40%(反向比前向更耗时,所以额外的前向传播在总时间中占比较小)。

Figure 2. Adjust layers, sequence length, batch size, and hidden dim to see per-layer activation breakdown. Toggle gradient checkpointing to compare memory savings. Note how attention scores scale as O(s²).

图 2. 调整层数、序列长度、批量大小和隐藏维度,查看每层激活分布。切换梯度检查点以比较内存节省。注意注意力分数如何按 O(s²) 增长。

Microbatching and Gradient Accumulation

Consider a training step where one GPU receives \(N\) sequences of varying lengths \(\{l_1, l_2, \ldots, l_N\}\), totaling \(L = \sum_i l_i\) tokens. The GPU must compute forward and backward passes over all \(N\) sequences before a single optimizer step.

The naive approach processes everything at once: pack all \(N\) sequences into a single tensor, forward pass, backward pass, optimizer step. But this requires holding all activations for \(N\) sequences simultaneously — which can easily exceed GPU memory even when the fixed costs (parameters + optimizer + gradients) fit comfortably.

Gradient accumulation solves this: split the \(N\) sequences into \(K\) microbatches, process each sequentially, accumulate gradients, then step.

\[\nabla_\theta \mathcal{L} = \sum_{k=1}^{K} \nabla_\theta \mathcal{L}_k\]

The final gradient is identical to the single-batch case. The optimizer sees the same update. But memory usage is dramatically different: activations are created and destroyed per microbatch. Only one microbatch’s activations exist at a time. So peak memory is determined by the largest microbatch, not the total batch.

考虑一个训练步:一个 GPU 接收 \(N\) 条变长序列 \(\{l_1, l_2, \ldots, l_N\}\),总计 \(L = \sum_i l_i\) 个 token。GPU 必须对所有 \(N\) 条序列完成前向和反向传播,才能执行一次优化器步。

朴素方法一次处理所有内容:将 \(N\) 条序列打包为一个张量,前向传播,反向传播,优化器步。但这要求同时持有 \(N\) 条序列的所有激活——即使固定开销(参数 + 优化器 + 梯度)能舒适地放入显存,这也很容易超标。

梯度累积解决了这个问题:将 \(N\) 条序列分成 \(K\) 个微批量,依次处理,累积梯度,然后更新。

\[\nabla_\theta \mathcal{L} = \sum_{k=1}^{K} \nabla_\theta \mathcal{L}_k\]

最终梯度与单批次情况完全一致。优化器看到的更新相同。但内存使用截然不同:激活在每个微批量中创建和销毁。任一时刻只有一个微批量的激活存在。因此峰值内存由最大的微批量决定,而非总批量。

Figure 3. Animation of gradient accumulation. Watch how activations appear during each forward pass and are consumed during backward, while gradients accumulate across microbatches. Only one microbatch's activations exist at a time.

图 3. 梯度累积动画。观察激活如何在每次前向传播中出现、在反向传播中消耗,而梯度在各微批量间累积。任一时刻只有一个微批量的激活存在。

A bin-packing algorithm (First-Fit Decreasing) groups the \(N\) sequences into microbatches, where each microbatch’s total token count must not exceed max_tokens_per_mb.

Example: 84 sequences averaging 5000 tokens each (420K total tokens) on one GPU.

max_tokens_per_mb Microbatches Seqs/MB Peak activation memory
65536 ~7 ~12 High — 12 sequences x padded length
16384 ~26 ~3 Low — 3 sequences x padded length
8192 ~52 ~1-2 Minimal — 1-2 sequences x padded length

With max_tokens_per_mb=65536, each microbatch holds ~12 sequences. The forward pass must store activations for all 12 simultaneously. With 16384, only ~3 sequences are processed at once — 4x less activation memory.

The padding tax. There is a subtlety that makes larger microbatches even worse: within a microbatch, sequences are padded to the length of the longest sequence in that group, because GPU tensor operations need rectangular shapes. Consider a microbatch with 3 sequences of lengths [3000, 5000, 8000]. All three are padded to 8000:

\[\text{Effective tokens} = 3 \times 8000 = 24000 \quad \text{(vs actual 16000)}\]

That’s 50% wasted compute and memory on padding tokens. Larger microbatches are more likely to contain outlier-length sequences, making the padding problem worse. Smaller max_tokens_per_mb means fewer sequences per group, so the longest sequence in each group is closer to the average — less padding waste, lower peak memory.

The tradeoff. Smaller microbatches reduce peak activation memory and padding waste, but increase the number of forward/backward passes per step — adding kernel launch overhead and reducing GPU utilization. Extremely small microbatches (1-2 sequences) also underutilize the GPU’s parallel compute units.

装箱算法(首次适应递减法)将 \(N\) 条序列分组为微批量,每个微批量的总 token 数不超过 max_tokens_per_mb

示例:一个 GPU 上 84 条序列,平均每条 5000 个 token(共 420K token)。

max_tokens_per_mb 微批量数 每微批量序列数 峰值激活内存
65536 ~7 ~12 — 12 条序列 x 填充长度
16384 ~26 ~3 — 3 条序列 x 填充长度
8192 ~52 ~1-2 最小 — 1-2 条序列 x 填充长度

max_tokens_per_mb=65536 时,每个微批量容纳约 12 条序列。前向传播必须同时存储 12 条序列的激活。用 16384 时,一次仅处理约 3 条序列——激活内存减少 4 倍

填充税。 一个微妙之处使大微批量更为不利:微批量内的序列会填充到该组中最长序列的长度,因为 GPU 张量运算需要矩形形状。考虑一个包含 3 条序列、长度为 [3000, 5000, 8000] 的微批量。三条均填充到 8000:

\[\text{Effective tokens} = 3 \times 8000 = 24000 \quad \text{(vs actual 16000)}\]

50% 的计算和内存浪费在填充 token 上。更大的微批量更可能包含离群长度的序列,使填充问题恶化。更小的 max_tokens_per_mb 意味着每组序列更少,最长序列更接近平均值——填充浪费更少,峰值内存更低。

代价。 更小的微批量减少了峰值激活内存和填充浪费,但增加了每步的前向/反向传播次数——增加内核启动开销并降低 GPU 利用率。极小的微批量(1-2 条序列)也无法充分利用 GPU 的并行计算单元。

Figure 4. Drag the slider to change max_tokens_per_mb and see how sequences are packed into microbatches. Purple = actual tokens, pink = padding waste. The red-outlined microbatch determines peak activation memory.

图 4. 拖动滑块改变 max_tokens_per_mb,观察序列如何被打包为微批量。紫色 = 实际 token,粉色 = 填充浪费。红框微批量决定峰值激活内存。

Optimizer State Compression

What it does. Replace the fp32 optimizer states with lower-precision versions. 8-bit Adam (e.g., bitsandbytes) quantizes the first and second moments to int8, reducing optimizer memory from \(12P\) to \(6P\) bytes. Adafactor eliminates the second moment entirely by factoring it into row and column statistics, reducing to \(\sim 6P\) bytes with a different approximation.

The tradeoff. 8-bit Adam introduces quantization noise into the moment estimates. In practice, dynamic quantization with block-wise scaling preserves most of the convergence properties of full-precision Adam. Adafactor can diverge on some tasks and requires more careful hyperparameter tuning. SGD with momentum costs \(8P\) bytes but converges more slowly and with less stability on large-scale language model training.

功能。 用低精度版本替代 fp32 优化器状态。8-bit Adam(如 bitsandbytes)将一阶和二阶矩量化为 int8,将优化器内存从 \(12P\) 降至 \(6P\) 字节。Adafactor 完全消除二阶矩,将其分解为行和列统计量,以不同的近似方式降至约 \(6P\) 字节。

代价。 8-bit Adam 在矩估计中引入量化噪声。实践中,带分块缩放的动态量化保留了全精度 Adam 的大部分收敛特性。Adafactor 在某些任务上可能发散,需要更谨慎的超参调节。带动量的 SGD 需要 \(8P\) 字节,但在大规模语言模型训练中收敛更慢且稳定性更差。

Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

What it does. Freeze the base model weights and train only small low-rank adapter matrices. LoRA adds rank-\(r\) matrices \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times d}\) to each target layer, with \(r \ll d\). Trainable parameters drop to \(\sim 1\text{-}2\%\) of the original. QLoRA goes further: quantize the frozen base model to 4-bit (0.5 bytes/param), reducing parameter memory from \(2P\) to \(0.5P\).

The tradeoff. Optimizer states, gradients, and activations now scale with the adapter size, not the full model — massive memory savings. The cost is reduced expressiveness: the adapter can only learn changes within the low-rank subspace. For many fine-tuning tasks this is sufficient, but for pretraining or large distribution shifts, full-rank updates are necessary.

\(\text{QLoRA memory} \approx \underbrace{0.5P}_{\text{4-bit base}} + \underbrace{12 \cdot 0.02P}_{\text{adapter optimizer}} + \underbrace{2 \cdot 0.02P}_{\text{adapter grads}} \approx 0.78P \text{ bytes}\)

功能。 冻结基座模型权重,仅训练小型低秩适配器矩阵。LoRA 在每个目标层添加秩-\(r\) 矩阵 \(A \in \mathbb{R}^{d \times r}\) 和 \(B \in \mathbb{R}^{r \times d}\),其中 \(r \ll d\)。可训练参数降至原始的约 \(1\text{-}2\%\)。QLoRA 更进一步:将冻结的基座模型量化为 4-bit(每参数 0.5 字节),将参数内存从 \(2P\) 降至 \(0.5P\)。

代价。 优化器状态、梯度和激活现在随适配器大小而非完整模型缩放——内存大幅节省。代价是表达能力降低:适配器只能学习低秩子空间内的变化。对于许多微调任务这已足够,但对于预训练或大的分布偏移,需要全秩更新。

\(\text{QLoRA memory} \approx \underbrace{0.5P}_{\text{4-bit base}} + \underbrace{12 \cdot 0.02P}_{\text{adapter optimizer}} + \underbrace{2 \cdot 0.02P}_{\text{adapter grads}} \approx 0.78P \text{ bytes}\)

Offloading

What it does. Move optimizer states or parameters to CPU RAM (or even NVMe storage) and swap them to GPU only when needed. DeepSpeed ZeRO-Offload and ZeRO-Infinity implement this transparently.

The tradeoff. PCIe bandwidth between CPU and GPU is 1-2 orders of magnitude slower than GPU memory bandwidth. Offloading trades training speed for the ability to train models that would otherwise not fit at all. It is a last resort, useful for training very large models on limited GPU hardware.

功能。 将优化器状态或参数移至 CPU 内存(甚至 NVMe 存储),仅在需要时交换到 GPU。DeepSpeed ZeRO-Offload 和 ZeRO-Infinity 透明地实现了这一点。

代价。 CPU 与 GPU 之间的 PCIe 带宽比 GPU 内存带宽慢 1-2 个数量级。卸载以训练速度换取训练原本无法放入显存的模型的能力。这是最后手段,适用于在有限 GPU 硬件上训练超大模型。

Summary

Technique Memory saved Cost
Mixed precision (bf16) \(4P\) bytes Slight precision loss
Gradient checkpointing ~5-10x activation reduction ~30-40% more compute
8-bit optimizer \(6P\) bytes Quantization noise
QLoRA ~\(15P\) bytes vs full fine-tuning Reduced expressiveness
CPU offloading Moves optimizer to RAM Major speed reduction
Microbatching Controls peak activation memory More kernel launches, lower GPU utilization

None of these are free — and all of the above are single-GPU techniques. When the model still doesn’t fit, or when you need to scale to dozens or hundreds of GPUs, parallelism strategies distribute the memory (and compute) across devices. The next section covers these.

技术 节省的内存 代价
混合精度 (bf16) \(4P\) 字节 轻微精度损失
梯度检查点 激活减少约 5-10 倍 计算量增加约 30-40%
8-bit 优化器 \(6P\) 字节 量化噪声
QLoRA 相比全参微调节省约 \(15P\) 字节 表达能力降低
CPU 卸载 将优化器移至内存 速度大幅下降
微批量 控制峰值激活内存 内核启动次数增多、GPU 利用率降低

这些方法都不是免费的——而且以上全部是单 GPU 技术。当模型仍然放不下,或者需要扩展到数十乃至数百个 GPU 时,并行策略将内存(和计算)分布到多个设备上。下一节介绍这些策略。

Parallelism Strategies

The single-GPU optimizations above can only go so far. A 70B model requires \(16 \times 70 \times 10^9 = 1120\) GB for parameters + optimizer + gradients alone — no single GPU comes close. Parallelism distributes memory and compute across multiple devices. The four main strategies are complementary and are typically combined in practice.

Recall from What Lives on the GPU that total per-GPU memory is:

\[\underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{optimizer}} + \underbrace{2P}_{\text{grads}} + \underbrace{\text{Act}(b, s)}_{\text{activations}} = 16P + \text{Act}\]

Each parallelism strategy targets different terms in this equation.

上述单 GPU 优化能做的有限。70B 模型仅参数 + 优化器 + 梯度就需要 \(16 \times 70 \times 10^9 = 1120\) GB——没有任何单卡能接近这一需求。并行化将内存和计算分布到多个设备上。四种主要策略互补,实践中通常组合使用。

回顾GPU 上存放了什么,每 GPU 总内存为:

\[\underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{optimizer}} + \underbrace{2P}_{\text{grads}} + \underbrace{\text{Act}(b, s)}_{\text{activations}} = 16P + \text{Act}\]

每种并行策略针对这个等式中的不同项。

Data Parallelism (DP)

Idea. Replicate the entire model on each of \(N\) GPUs. Each GPU processes a different data shard, computes gradients locally, then all-reduces gradients before the optimizer step. The result is mathematically identical to single-GPU training with \(N \times\) the batch size.

Per-GPU memory:

\[\underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{optimizer}} + \underbrace{2P}_{\text{grads}} + \text{Act}(b/N, s)\]

DP does not reduce model-related memory — every GPU still holds the full \(16P\) bytes. It only reduces activations by shrinking each GPU’s local batch from \(b\) to \(b/N\).

Communication. One all-reduce of gradients (\(2P\) bytes) per step, which can be overlapped with backward computation using bucketed gradient all-reduce (as in PyTorch DDP).

When to use. When the model fits on a single GPU but you want higher throughput. DP is the simplest and most efficient form of parallelism — always the first thing to try.

思路。 在 \(N\) 个 GPU 上各复制一份完整模型。每个 GPU 处理不同的数据分片,在本地计算梯度,然后在优化器步之前 all-reduce 梯度。结果在数学上等价于批量大小为 \(N\) 倍的单 GPU 训练。

每 GPU 内存:

\[\underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{optimizer}} + \underbrace{2P}_{\text{grads}} + \text{Act}(b/N, s)\]

DP 不会减少模型相关内存——每个 GPU 仍持有完整的 \(16P\) 字节。它仅通过将每个 GPU 的本地批量从 \(b\) 缩小到 \(b/N\) 来减少激活。

通信。 每步一次梯度 all-reduce(\(2P\) 字节),可使用分桶梯度 all-reduce(如 PyTorch DDP)与反向计算重叠。

适用场景。 当模型能放入单 GPU 但需要更高吞吐时。DP 是最简单、最高效的并行形式——总是首先尝试的方案。

ZeRO / FSDP (Sharded Data Parallelism)

The key insight behind ZeRO (Zero Redundancy Optimizer) is that vanilla DP is wasteful: every GPU stores an identical copy of optimizer states, gradients, and parameters. ZeRO shards these across \(N\) data-parallel GPUs in three progressive stages:

Stage What is sharded Per-GPU memory Communication per step
ZeRO-1 Optimizer states \(2P + 2P + 12P/N + \text{Act}\) Same as DP (gradient all-reduce)
ZeRO-2 Optimizer states + gradients \(2P + 2P/N + 12P/N + \text{Act}\) Reduce-scatter gradients (similar cost to all-reduce)
ZeRO-3 Optimizer states + gradients + parameters \(2P/N + 2P/N + 12P/N + \text{Act}\) All-gather params before each forward/backward layer

ZeRO-1 is nearly free: sharding optimizer states across \(N\) GPUs reduces the dominant \(12P\) term to \(12P/N\), with no extra communication beyond the standard gradient all-reduce. For a 7B model on 8 GPUs, optimizer memory drops from 84 GB to ~10.5 GB per GPU.

ZeRO-2 additionally shards gradients. Each GPU only stores gradients for its shard’s parameters, then reduce-scatters (instead of all-reducing) so each rank accumulates only the gradients it needs. Communication cost is similar to all-reduce.

ZeRO-3 (equivalent to PyTorch FSDP) shards everything — the full \(16P\) becomes \(16P/N\) per GPU. The cost: parameters must be all-gathered before each layer’s forward and backward pass, and freed immediately after. This turns every layer into a communication event.

\[\text{ZeRO-3 per-GPU model memory} = \frac{16P}{N} \text{ bytes}\]

Practical notes. EleutherAI reports that ZeRO-3 is “too communication-heavy at large scales” and prefers ZeRO-1 combined with tensor and pipeline parallelism. ZeRO-1 is the default for most training runs because it targets the largest memory consumer (optimizer states) with minimal overhead. ZeRO-3/FSDP shines when GPU count is moderate and interconnect is fast (e.g., 8 GPUs within a single node on NVLink).

ZeRO(零冗余优化器)的核心洞察是:普通 DP 是浪费的——每个 GPU 存储了一份完全相同的优化器状态、梯度和参数副本。ZeRO 将这些在 \(N\) 个数据并行 GPU 间分片,分三个递进阶段:

阶段 分片内容 每 GPU 内存 每步通信
ZeRO-1 优化器状态 \(2P + 2P + 12P/N + \text{Act}\) 与 DP 相同(梯度 all-reduce)
ZeRO-2 优化器状态 + 梯度 \(2P + 2P/N + 12P/N + \text{Act}\) Reduce-scatter 梯度(开销与 all-reduce 相近)
ZeRO-3 优化器状态 + 梯度 + 参数 \(2P/N + 2P/N + 12P/N + \text{Act}\) 每层前向/反向前 all-gather 参数

ZeRO-1 几乎免费:跨 \(N\) 个 GPU 分片优化器状态将主导项 \(12P\) 降至 \(12P/N\),无需标准梯度 all-reduce 之外的额外通信。对于 8 GPU 上的 7B 模型,优化器内存从 84 GB 降至每 GPU 约 10.5 GB。

ZeRO-2 额外分片梯度。每个 GPU 仅存储其分片参数的梯度,然后 reduce-scatter(而非 all-reduce),使每个 rank 仅累积自己需要的梯度。通信开销与 all-reduce 相近。

ZeRO-3(等价于 PyTorch FSDP)分片一切——完整的 \(16P\) 变为每 GPU \(16P/N\)。代价是:参数必须在每层的前向和反向传播前 all-gather,之后立即释放。这使每层都成为一次通信事件。

\[\text{ZeRO-3 per-GPU model memory} = \frac{16P}{N} \text{ bytes}\]

实践说明。 EleutherAI 指出 ZeRO-3 在大规模下”通信开销过重”,更倾向于 ZeRO-1 结合张量并行和流水线并行。ZeRO-1 是大多数训练的默认选择,因为它以最小开销瞄准最大内存消耗者(优化器状态)。ZeRO-3/FSDP 在 GPU 数量适中且互联快速时表现最佳(如单节点内 8 个 NVLink GPU)。

Tensor Parallelism (TP)

Idea. Split individual weight matrices across \(t\) GPUs so that each GPU computes a slice of every layer. For a linear layer \(Y = XW\), the weight \(W\) is column- or row-split, each GPU computes its portion, and the results are combined via all-reduce or all-gather.

Per-GPU memory:

\[\frac{2P}{t} + \frac{12P}{t} + \frac{2P}{t} + \text{Act}(b, s, t) = \frac{16P}{t} + \text{Act}\]

where activations are also partially reduced — the \(24/t\) and \(5as/(dt)\) terms in the activation formula reflect this splitting.

Communication. Two all-reduce operations per transformer layer (one in the attention block, one in the MLP), each communicating \(O(bsd)\) activation tensors. This happens on the critical path — computation cannot proceed until the all-reduce completes. This is why TP requires NVLink (~900 GB/s) rather than PCIe (~64 GB/s) or network interconnects.

Typical TP degree. TP is usually set to the number of GPUs within a single node (e.g., \(t = 8\) for an 8-GPU node with NVLink). Going beyond a node boundary is impractical because the inter-node bandwidth is too low for the frequent all-reduces.

思路。 将各权重矩阵切分到 \(t\) 个 GPU 上,使每个 GPU 计算每层的一个切片。对于线性层 \(Y = XW\),权重 \(W\) 按列或行切分,每个 GPU 计算其部分,结果通过 all-reduce 或 all-gather 合并。

每 GPU 内存:

\[\frac{2P}{t} + \frac{12P}{t} + \frac{2P}{t} + \text{Act}(b, s, t) = \frac{16P}{t} + \text{Act}\]

激活也被部分切分——激活公式中的 \(24/t\) 和 \(5as/(dt)\) 项反映了这一切分。

通信。 每个 Transformer 层两次 all-reduce(注意力块一次,MLP 一次),每次传输 \(O(bsd)\) 的激活张量。这发生在关键路径上——计算在 all-reduce 完成前无法继续。这就是 TP 需要 NVLink(约 900 GB/s)而非 PCIe(约 64 GB/s)或网络互联的原因。

典型 TP 度数。 TP 通常设为单节点内的 GPU 数量(如 8-GPU NVLink 节点上 \(t = 8\))。跨节点不切实际,因为节点间带宽对于频繁的 all-reduce 而言太低。

Pipeline Parallelism (PP)

Idea. Partition the model’s layers into \(p\) stages, assigning consecutive layers to different GPUs. GPU 1 runs layers 1–\(L/p\), GPU 2 runs layers \(L/p + 1\)–\(2L/p\), and so on.

Per-GPU memory:

\[\frac{2P}{p} + \frac{12P}{p} + \frac{2P}{p} + \text{Act} = \frac{16P}{p} + \text{Act}\]

Each GPU only stores parameters, optimizer states, and gradients for its assigned layers. Activation memory depends on the schedule (see below).

Communication. Point-to-point sends of activation tensors (\(O(bsd)\)) between adjacent stages — much less volume than TP’s all-reduces, and tolerant of lower-bandwidth interconnects.

The bubble problem. With naive sequential execution, only one stage is active at a time — the other \(p - 1\) GPUs are idle. The pipeline bubble is the fraction of time wasted:

\[\text{Bubble fraction} = \frac{p - 1}{m + p - 1}\]

where \(m\) is the number of microbatches. To keep the bubble small (say, < 5%), you need \(m \gg p\). GPipe and PipeDream use different strategies:

  • GPipe: Runs all \(m\) microbatch forward passes, then all \(m\) backward passes. Simple, but all activations for all microbatches must be held simultaneously, increasing memory by a factor of \(m\).
  • 1F1B (PipeDream): Interleaves forward and backward passes so that each GPU holds activations for at most \(p\) microbatches (instead of \(m\)). This significantly reduces activation memory at the cost of more complex scheduling.

思路。 将模型层划分为 \(p\) 个阶段,将连续层分配到不同 GPU。GPU 1 运行第 1–\(L/p\) 层,GPU 2 运行第 \(L/p + 1\)–\(2L/p\) 层,依此类推。

每 GPU 内存:

\[\frac{2P}{p} + \frac{12P}{p} + \frac{2P}{p} + \text{Act} = \frac{16P}{p} + \text{Act}\]

每个 GPU 仅存储其分配层的参数、优化器状态和梯度。激活内存取决于调度方式(见下文)。

通信。 相邻阶段间点对点发送激活张量(\(O(bsd)\))——数据量远小于 TP 的 all-reduce,且可容忍较低带宽的互联。

气泡问题。 在朴素顺序执行下,同一时刻仅一个阶段活跃——其余 \(p - 1\) 个 GPU 空闲。流水线气泡是浪费的时间比例:

\[\text{Bubble fraction} = \frac{p - 1}{m + p - 1}\]

其中 \(m\) 为微批量数。要使气泡足够小(如 < 5%),需要 \(m \gg p\)。GPipe 和 PipeDream 采用不同策略:

  • GPipe:先运行所有 \(m\) 个微批量的前向传播,再运行所有 \(m\) 个反向传播。简单,但所有微批量的激活必须同时保留,内存增加 \(m\) 倍。
  • 1F1B (PipeDream):交替前向和反向传播,使每个 GPU 最多同时持有 \(p\) 个微批量(而非 \(m\) 个)的激活。以更复杂的调度为代价,显著减少激活内存。

Combining Strategies: 3D Parallelism

In practice, large-scale training combines all three: TP within a node, PP across nodes, and DP (with ZeRO-1) across pipeline-parallel replicas. For \(N\) total GPUs with TP degree \(t\), PP degree \(p\), and DP degree \(d = N / (t \cdot p)\):

\[\text{Per-GPU model memory} \approx \frac{16P}{t \cdot p} + \frac{12P}{d} \cdot \left(\frac{1}{t \cdot p} - \frac{1}{t \cdot p}\right)\]

More concretely with ZeRO-1:

\[\text{Per-GPU memory} = \frac{2P}{t \cdot p} + \frac{12P}{t \cdot p \cdot d} + \frac{2P}{t \cdot p} + \text{Act}(b_{\text{local}}, s, t)\]

Example: 70B model on 64 GPUs (8 nodes x 8 GPUs/node):

  • \(t = 8\) (TP within each node), \(p = 4\) (PP across 4 nodes), \(d = 2\) (2 DP replicas)
  • Parameters per GPU: \(2 \times 70\text{B} / (8 \times 4) = 4.4\) GB
  • Optimizer per GPU: \(12 \times 70\text{B} / (8 \times 4 \times 2) = 13.1\) GB
  • Gradients per GPU: \(2 \times 70\text{B} / (8 \times 4) = 4.4\) GB
  • Model-related total: ~21.9 GB per GPU — comfortably fits on an 80 GB A100

| Strategy | Splits | Communication | Interconnect | Reduces | |—|—|—|—|—| | DP | Data | All-reduce gradients | Any | Activation memory (via smaller local batch) | | ZeRO-1 | Optimizer states | Same as DP | Any | Optimizer memory | | ZeRO-3 / FSDP | Everything | All-gather per layer | NVLink preferred | All model memory | | TP | Weight matrices | All-reduce per layer | NVLink required | All model memory + activations | | PP | Layers | Point-to-point activations | Network OK | All model memory |

实践中,大规模训练组合三者:节点内 TP,节点间 PP,流水线并行副本间 DP(配合 ZeRO-1)。对于总共 \(N\) 个 GPU,TP 度 \(t\),PP 度 \(p\),DP 度 \(d = N / (t \cdot p)\):

\[\text{Per-GPU model memory} \approx \frac{16P}{t \cdot p} + \frac{12P}{d} \cdot \left(\frac{1}{t \cdot p} - \frac{1}{t \cdot p}\right)\]

更具体地,配合 ZeRO-1:

\[\text{Per-GPU memory} = \frac{2P}{t \cdot p} + \frac{12P}{t \cdot p \cdot d} + \frac{2P}{t \cdot p} + \text{Act}(b_{\text{local}}, s, t)\]

示例:64 GPU 上的 70B 模型(8 节点 x 每节点 8 GPU):

  • \(t = 8\)(节点内 TP),\(p = 4\)(跨 4 节点 PP),\(d = 2\)(2 个 DP 副本)
  • 每 GPU 参数:\(2 \times 70\text{B} / (8 \times 4) = 4.4\) GB
  • 每 GPU 优化器:\(12 \times 70\text{B} / (8 \times 4 \times 2) = 13.1\) GB
  • 每 GPU 梯度:\(2 \times 70\text{B} / (8 \times 4) = 4.4\) GB
  • 模型相关总计:每 GPU 约 21.9 GB——在 80 GB A100 上轻松放下

| 策略 | 切分对象 | 通信 | 互联需求 | 减少的内存 | |—|—|—|—|—| | DP | 数据 | All-reduce 梯度 | 任意 | 激活内存(通过缩小本地批量) | | ZeRO-1 | 优化器状态 | 与 DP 相同 | 任意 | 优化器内存 | | ZeRO-3 / FSDP | 所有 | 每层 all-gather | 首选 NVLink | 所有模型内存 | | TP | 权重矩阵 | 每层 all-reduce | 需要 NVLink | 所有模型内存 + 激活 | | PP | 层 | 点对点激活 | 网络即可 | 所有模型内存 |

Checkpointing to Disk

GPU memory is volatile — if a node crashes, a job gets preempted, or you simply want to pause and resume later, everything on the GPU is lost. Checkpointing saves the training state to persistent storage so that a run can be restarted from where it left off without retraining from scratch.

A complete checkpoint must contain everything needed to produce bit-identical training dynamics from the point of save. This means more than just the model weights:

| Component | What it contains | Size (7B model, AdamW, bf16) | Why it’s needed | |—|—|—|—| | Model parameters (fp32 master) | The fp32 master copy of all weights | \(4P = 28\) GB | The authoritative weights; bf16 copies are derived from these | | Optimizer states | First moment \(m_t\), second moment \(v_t\), step count | \(8P = 56\) GB | Without these, the optimizer restarts with zero momentum/variance — causing a spike in loss and effectively wasting the warmup | | Learning rate scheduler state | Current step, warmup progress, decay schedule | Negligible | Ensures the learning rate continues from the correct position | | RNG states | Random seeds for all GPUs, dropout masks, data shuffling | Negligible | Required for exact reproducibility; without these, the resumed run diverges from the original | | Data loader state | Current epoch, sample index, shuffle order | Small | Prevents re-training on already-seen data or skipping unseen data | | Gradient scaler state (fp16 only) | Current loss scale, backoff count | Negligible | fp16 training uses dynamic loss scaling; resetting it causes unnecessary scale search |

GPU 内存是易失的——节点崩溃、任务被抢占或仅仅想暂停后恢复,GPU 上的一切都会丢失。检查点将训练状态保存到持久存储,使训练可以从中断处重启而无需从头训练。

完整的检查点必须包含从保存点起产生比特一致训练动态所需的一切。这不仅仅是模型权重:

| 组件 | 内容 | 大小(7B 模型,AdamW,bf16) | 为何需要 | |—|—|—|—| | 模型参数(fp32 主副本) | 所有权重的 fp32 主副本 | \(4P = 28\) GB | 权威权重;bf16 副本由此派生 | | 优化器状态 | 一阶矩 \(m_t\)、二阶矩 \(v_t\)、步数 | \(8P = 56\) GB | 缺失则优化器以零动量/方差重启——导致 loss 尖峰,实际上浪费了 warmup | | 学习率调度器状态 | 当前步数、warmup 进度、衰减计划 | 可忽略 | 确保学习率从正确位置继续 | | RNG 状态 | 所有 GPU 的随机种子、dropout 掩码、数据打乱顺序 | 可忽略 | 精确可复现性所需;缺失则恢复的运行偏离原始轨迹 | | 数据加载器状态 | 当前 epoch、样本索引、打乱顺序 | 较小 | 防止重复训练已见数据或跳过未见数据 | | 梯度缩放器状态(仅 fp16) | 当前损失缩放值、退避计数 | 可忽略 | fp16 训练使用动态损失缩放;重置会导致不必要的缩放搜索 |

What You Can Skip

Gradients do not need to be saved. They are recomputed from scratch at the start of each training step. Saving mid-step (between microbatches during gradient accumulation) would require saving the partially accumulated gradients, but this is rarely done — it’s simpler to checkpoint only at step boundaries.

Activations do not need to be saved. They are transient, created during the forward pass and consumed during the backward pass within a single step.

The bf16 model copy does not need to be saved. It is deterministically derived from the fp32 master weights by casting.

梯度无需保存。它们在每个训练步开始时从头计算。在步中(梯度累积的微批量之间)保存需要存储部分累积的梯度,但这很少做到——仅在步边界检查点更简单。

激活值无需保存。它们是瞬态的,在前向传播中创建,在同一步的反向传播中消耗。

bf16 模型副本无需保存。它通过对 fp32 主权重做类型转换确定性地派生。

Checkpoint Size

The dominant cost is model parameters + optimizer states. For AdamW in mixed precision:

\[\text{Checkpoint size} \approx 4P + 8P = 12P \text{ bytes}\]

For a 7B model, that’s ~84 GB per checkpoint. A 70B model produces ~840 GB checkpoints. At typical save frequencies (every few hundred steps), this accumulates quickly — a 70B training run saving every 500 steps for 100K steps produces ~168 TB of checkpoints if none are pruned.

主要开销是模型参数 + 优化器状态。对于混合精度下的 AdamW:

\[\text{Checkpoint size} \approx 4P + 8P = 12P \text{ bytes}\]

对于 7B 模型,每个检查点约 84 GB。70B 模型产生约 840 GB 的检查点。按典型保存频率(每几百步),这会迅速积累——70B 训练每 500 步保存、共 100K 步,若不裁剪将产生约 168 TB 检查点。

Strategies to Reduce Checkpoint Cost

Async checkpointing. Saving 84 GB to networked storage (e.g., NFS, S3) can take minutes. Synchronous saves stall training. Modern frameworks (PyTorch’s torch.distributed.checkpoint, DeepSpeed) copy the state to CPU memory asynchronously and write to disk in a background thread, overlapping I/O with the next training step.

Sharded checkpointing. With data or model parallelism, each GPU saves only its own shard of the state. This parallelizes the I/O across all nodes and avoids gathering the full state onto a single machine. The downside is that loading requires the same parallelism configuration — resharding is needed if you change the number of GPUs.

Save only what changed. Some systems support incremental or delta checkpoints, saving only the difference from the previous checkpoint. This is most useful when checkpoints are frequent and the model changes slowly between saves.

Pruning old checkpoints. Keep the last \(k\) checkpoints and delete older ones. Optionally keep “milestone” checkpoints at longer intervals (e.g., every 10K steps) for evaluation or fallback.

异步检查点。 将 84 GB 保存到网络存储(如 NFS、S3)可能需要数分钟。同步保存会阻塞训练。现代框架(PyTorch 的 torch.distributed.checkpoint、DeepSpeed)异步地将状态复制到 CPU 内存,并在后台线程中写入磁盘,将 I/O 与下一训练步重叠。

分片检查点。 在数据或模型并行下,每个 GPU 仅保存自己的状态分片。这将 I/O 并行化到所有节点,避免将完整状态聚集到单台机器。缺点是加载需要相同的并行配置——更改 GPU 数量需要重新分片。

仅保存变化部分。 一些系统支持增量或差分检查点,仅保存与上一检查点的差异。当检查点频繁且模型在保存间变化缓慢时最为有用。

裁剪旧检查点。 保留最近 \(k\) 个检查点,删除更早的。可选择以更长间隔(如每 10K 步)保留”里程碑”检查点用于评估或回退。