LLM Optimization Basics: Memory

This post is inspired by EleutherAI's Transformer Math 101 and Jiayi Pan's VRAM Estimation notes. Originally drafted after discussion with Jiayi Pan and revised in March 2026 for better visual effects.

Training a large language model means fitting everything the GPU needs — weights, optimizer buffers, gradients, and intermediate computations — into a fixed amount of VRAM. Understanding what each component costs is a prerequisite for reasoning about OOM errors. This section walks through the four main consumers of GPU memory during a training step.

What Lives on the GPU

Figure 1. Adjust model size, optimizer, and precision to see how GPU memory is consumed. Green = fits, red = OOM. The four cards break down each component's cost.

Let \(P\) denote the number of model parameters. During mixed-precision training (the standard practice for LLMs), the GPU holds four categories of data:

Model Parameters

The weights used in the forward pass. In mixed-precision training, the forward and backward passes run in half precision (fp16 or bf16), so the live weights consume:

\[\text{Model memory} = 2P \text{ bytes}\]

(2 bytes per parameter for bf16/fp16.)

Optimizer States

The optimizer maintains its own buffers that persist across training steps. For AdamW (the standard choice for LLM training), the update rule at each step \(t\) is:

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, g_t\] \[v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, g_t^2\] \[\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \, \theta_{t-1} \right)\]

where \(\hat{m}_t, \hat{v}_t\) are bias-corrected estimates. Each of these quantities must be stored per parameter:

fp32 master copy of \(\theta\) — 4 bytes/param. The optimizer updates weights in fp32 for numerical stability, then casts back to bf16 for the next forward pass.
First moment \(m_t\) — 4 bytes/param. The exponential moving average of gradients (momentum).
Second moment \(v_t\) — 4 bytes/param. The exponential moving average of squared gradients (variance).

\[\text{Optimizer memory} = \underbrace{4P}_{\theta^{\text{fp32}}} + \underbrace{4P}_{m} + \underbrace{4P}_{v} = 12P \text{ bytes}\]

This is typically the single largest memory consumer for model-related storage. For a 7B parameter model: \(12 \times 7 \times 10^9 = 84\) GB just for optimizer states.

Other optimizers use less: SGD with momentum costs \(8P\) bytes (fp32 copy + momentum), and 8-bit optimizers like those in bitsandbytes reduce the moments to 1 byte each, costing \(6P\) bytes.

Gradients

The gradient tensor for each parameter, computed during the backward pass. In mixed precision:

\[\text{Gradient memory} = 2P \text{ bytes}\]

(Gradients are stored in the same precision as the model — bf16/fp16.)

Activations

The intermediate outputs of each layer, saved during the forward pass so they can be reused in the backward pass. Unlike the previous three, activation memory scales with the data being processed — specifically with batch size, sequence length, and model depth.

For a transformer with \(n_{\text{layers}}\) layers, hidden size \(d\), attention heads \(a\), sequence length \(s\), and microbatch size \(b\) (per GPU), the activation memory without recomputation is approximately:

\[\text{Activation memory} \approx s \cdot b \cdot d \cdot n_{\text{layers}} \cdot \left(10 + \frac{24}{t} + \frac{5as}{dt}\right) \text{ bytes}\]

where \(t\) is the tensor parallelism degree (1 if no tensor parallelism). The three terms inside the parentheses correspond to three groups of saved tensors per layer, factored by whether they are split across \(t\) GPUs:

The \(10\) term — activations that are not split by tensor parallelism (full \(d\)-dimensional tensors):

Saved tensor	Shape	Bytes (bf16)
Layer norm input (before self-attention)	\([s, b, d]\)	\(2sbd\)
Layer norm input (before MLP)	\([s, b, d]\)	\(2sbd\)
Self-attention output (before residual add)	\([s, b, d]\)	\(2sbd\)
MLP output (before residual add)	\([s, b, d]\)	\(2sbd\)
Dropout mask (after self-attention)	\([s, b, d]\)	\(sbd\)
Dropout mask (after MLP)	\([s, b, d]\)	\(sbd\)
Subtotal		\(10 \cdot sbd\)

The \(24/t\) term — activations split across \(t\) tensor-parallel GPUs (each GPU stores \(d/t\)):

Saved tensor	Shape per GPU	Bytes (bf16)
Q projection output	\([s, b, d/t]\)	\(2sbd/t\)
K projection output	\([s, b, d/t]\)	\(2sbd/t\)
V projection output	\([s, b, d/t]\)	\(2sbd/t\)
Attention value output	\([s, b, d/t]\)	\(2sbd/t\)
MLP first linear output (input to GeLU)	\([s, b, 4d/t]\)	\(8sbd/t\)
GeLU output	\([s, b, 4d/t]\)	\(8sbd/t\)
Subtotal		\(24 \cdot sbd/t\)

The \(5as/(dt)\) term — attention matrices that scale as \(O(s^2)\):

Saved tensor	Shape per GPU	Bytes
Attention scores (pre-softmax)	\([b, a/t, s, s]\)	\(2bas^2/t\)
Attention probabilities (post-softmax)	\([b, a/t, s, s]\)	\(2bas^2/t\)
Attention dropout mask	\([b, a/t, s, s]\)	\(bas^2/t\)
Subtotal		\(5bas^2/t = sbd \cdot \frac{5as}{dt}\)

This last group is why long sequences are expensive: at \(s = 8192\) with \(d = 4096\), the \(5as/(dt)\) term can exceed the other two combined.

With full activation recomputation (gradient checkpointing), only the input to each layer is saved — everything else is recomputed during the backward pass. This reduces activation memory to \(2 \cdot s \cdot b \cdot d \cdot n_{\text{layers}}\) bytes, at the cost of roughly doubling the forward-pass compute.

保存的张量	形状	字节数 (bf16)
Layer norm 输入（自注意力前）	\([s, b, d]\)	\(2sbd\)
Layer norm 输入（MLP 前）	\([s, b, d]\)	\(2sbd\)
自注意力输出（残差相加前）	\([s, b, d]\)	\(2sbd\)
MLP 输出（残差相加前）	\([s, b, d]\)	\(2sbd\)
Dropout 掩码（自注意力后）	\([s, b, d]\)	\(sbd\)
Dropout 掩码（MLP 后）	\([s, b, d]\)	\(sbd\)
小计		\(10 \cdot sbd\)

保存的张量	每 GPU 形状	字节数 (bf16)
Q 投影输出	\([s, b, d/t]\)	\(2sbd/t\)
K 投影输出	\([s, b, d/t]\)	\(2sbd/t\)
V 投影输出	\([s, b, d/t]\)	\(2sbd/t\)
注意力值输出	\([s, b, d/t]\)	\(2sbd/t\)
MLP 第一层线性输出（GeLU 输入）	\([s, b, 4d/t]\)	\(8sbd/t\)
GeLU 输出	\([s, b, 4d/t]\)	\(8sbd/t\)
小计		\(24 \cdot sbd/t\)

保存的张量	每 GPU 形状	字节数
注意力分数（softmax 前）	\([b, a/t, s, s]\)	\(2bas^2/t\)
注意力概率（softmax 后）	\([b, a/t, s, s]\)	\(2bas^2/t\)
注意力 dropout 掩码	\([b, a/t, s, s]\)	\(bas^2/t\)
小计		\(5bas^2/t = sbd \cdot \frac{5as}{dt}\)

Putting It Together

For a single GPU with no parallelism, the total training memory is:

\[\text{Total} = \underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{AdamW}} + \underbrace{2P}_{\text{grads}} + \underbrace{\text{Activations}(b, s)}_{\text{scales with data}}\]

The first three terms sum to \(16P\) bytes — fixed regardless of batch size. For a 7B model, that’s ~112 GB before any data is processed. The fourth term is the only one you can control at runtime by changing how much data you feed per forward pass.

This is where memory optimization techniques come in.

Optimizing GPU Memory Usage

Every memory optimization technique trades something for memory savings — there is no free lunch. The table below summarizes the main approaches, what they save, and what they cost.

Mixed Precision Training

What it does. Store different GPU memory residents at different precisions. The key insight is that the four memory components from the previous section have different numerical sensitivity — activations and gradients are stable in half precision, but optimizer updates are not. The standard recipe:

Stored on GPU	Precision	Memory	Why
Model parameters	bf16 (live) + fp32 (master)	\(2P + 4P\)	Forward/backward use the bf16 copy; optimizer updates the fp32 master
Activations	bf16/fp16	\(2 \cdot sbd\) per tensor	Half precision is sufficient for intermediate computations
Gradients	bf16/fp16	\(2P\)	Same precision as the forward pass
Optimizer states (\(m_t, v_t\))	fp32	\(8P\)	Momentum and variance need fp32 for numerical stability

The flow each training step: cast fp32 master weights to bf16, run forward/backward in bf16 (producing bf16 activations and gradients), pass gradients to the optimizer which updates the fp32 master weights, then refresh the bf16 copy.

The tradeoff. fp16 has a narrow dynamic range (max \(\sim 6.5 \times 10^4\)) and can cause loss spikes or divergence without careful loss scaling. bf16 (available on Ampere+ GPUs) has the same exponent range as fp32, making it more stable but slightly less precise in the mantissa. Either way, the optimizer states remain in fp32 — so mixed precision saves on the live parameters and gradients (\(4P\) bytes saved) but not on the dominant \(12P\) optimizer cost.

\(\text{Mixed precision savings: } 4P \text{ bytes (params + grads halved)}\)

GPU 上存储的内容	精度	内存	原因
模型参数	bf16（活跃）+ fp32（主副本）	\(2P + 4P\)	前向/反向使用 bf16 副本；优化器更新 fp32 主副本
激活值	bf16/fp16	每张量 \(2 \cdot sbd\)	半精度足以应对中间计算
梯度	bf16/fp16	\(2P\)	与前向传播精度一致
优化器状态 (\(m_t, v_t\))	fp32	\(8P\)	动量和方差需要 fp32 以保证数值稳定性

Gradient Checkpointing (Activation Recomputation)

What it does. Instead of saving all intermediate activations during the forward pass, save only the input to each transformer layer. During the backward pass, recompute each layer’s intermediates from the saved input before computing gradients.

The tradeoff. Activation memory drops from \(sbd \cdot n_{\text{layers}} \cdot (10 + 24/t + 5as/(dt))\) to just \(2sbd \cdot n_{\text{layers}}\) — typically a 5-10x reduction. The cost is roughly doubling the forward-pass compute, since every layer’s forward pass runs twice (once during forward, once during backward). Wall-clock training time increases by ~30-40% in practice (backward is more expensive than forward, so the extra forward pass is a smaller fraction of total time).

Figure 2. Adjust layers, sequence length, batch size, and hidden dim to see per-layer activation breakdown. Toggle gradient checkpointing to compare memory savings. Note how attention scores scale as O(s²).

Microbatching and Gradient Accumulation

Consider a training step where one GPU receives \(N\) sequences of varying lengths \(\{l_1, l_2, \ldots, l_N\}\), totaling \(L = \sum_i l_i\) tokens. The GPU must compute forward and backward passes over all \(N\) sequences before a single optimizer step.

The naive approach processes everything at once: pack all \(N\) sequences into a single tensor, forward pass, backward pass, optimizer step. But this requires holding all activations for \(N\) sequences simultaneously — which can easily exceed GPU memory even when the fixed costs (parameters + optimizer + gradients) fit comfortably.

Gradient accumulation solves this: split the \(N\) sequences into \(K\) microbatches, process each sequentially, accumulate gradients, then step.

\[\nabla_\theta \mathcal{L} = \sum_{k=1}^{K} \nabla_\theta \mathcal{L}_k\]

The final gradient is identical to the single-batch case. The optimizer sees the same update. But memory usage is dramatically different: activations are created and destroyed per microbatch. Only one microbatch’s activations exist at a time. So peak memory is determined by the largest microbatch, not the total batch.

Figure 3. Animation of gradient accumulation. Watch how activations appear during each forward pass and are consumed during backward, while gradients accumulate across microbatches. Only one microbatch's activations exist at a time.

A bin-packing algorithm (First-Fit Decreasing) groups the \(N\) sequences into microbatches, where each microbatch’s total token count must not exceed max_tokens_per_mb.

Example: 84 sequences averaging 5000 tokens each (420K total tokens) on one GPU.

`max_tokens_per_mb`	Microbatches	Seqs/MB	Peak activation memory
65536	~7	~12	High — 12 sequences x padded length
16384	~26	~3	Low — 3 sequences x padded length
8192	~52	~1-2	Minimal — 1-2 sequences x padded length

With max_tokens_per_mb=65536, each microbatch holds ~12 sequences. The forward pass must store activations for all 12 simultaneously. With 16384, only ~3 sequences are processed at once — 4x less activation memory.

The padding tax. There is a subtlety that makes larger microbatches even worse: within a microbatch, sequences are padded to the length of the longest sequence in that group, because GPU tensor operations need rectangular shapes. Consider a microbatch with 3 sequences of lengths [3000, 5000, 8000]. All three are padded to 8000:

\[\text{Effective tokens} = 3 \times 8000 = 24000 \quad \text{(vs actual 16000)}\]

That’s 50% wasted compute and memory on padding tokens. Larger microbatches are more likely to contain outlier-length sequences, making the padding problem worse. Smaller max_tokens_per_mb means fewer sequences per group, so the longest sequence in each group is closer to the average — less padding waste, lower peak memory.

The tradeoff. Smaller microbatches reduce peak activation memory and padding waste, but increase the number of forward/backward passes per step — adding kernel launch overhead and reducing GPU utilization. Extremely small microbatches (1-2 sequences) also underutilize the GPU’s parallel compute units.

`max_tokens_per_mb`	微批量数	每微批量序列数	峰值激活内存
65536	~7	~12	高 — 12 条序列 x 填充长度
16384	~26	~3	低 — 3 条序列 x 填充长度
8192	~52	~1-2	最小 — 1-2 条序列 x 填充长度

Figure 4. Drag the slider to change max_tokens_per_mb and see how sequences are packed into microbatches. Purple = actual tokens, pink = padding waste. The red-outlined microbatch determines peak activation memory.

Optimizer State Compression

What it does. Replace the fp32 optimizer states with lower-precision versions. 8-bit Adam (e.g., bitsandbytes) quantizes the first and second moments to int8, reducing optimizer memory from \(12P\) to \(6P\) bytes. Adafactor eliminates the second moment entirely by factoring it into row and column statistics, reducing to \(\sim 6P\) bytes with a different approximation.

The tradeoff. 8-bit Adam introduces quantization noise into the moment estimates. In practice, dynamic quantization with block-wise scaling preserves most of the convergence properties of full-precision Adam. Adafactor can diverge on some tasks and requires more careful hyperparameter tuning. SGD with momentum costs \(8P\) bytes but converges more slowly and with less stability on large-scale language model training.

Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

What it does. Freeze the base model weights and train only small low-rank adapter matrices. LoRA adds rank-\(r\) matrices \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times d}\) to each target layer, with \(r \ll d\). Trainable parameters drop to \(\sim 1\text{-}2\%\) of the original. QLoRA goes further: quantize the frozen base model to 4-bit (0.5 bytes/param), reducing parameter memory from \(2P\) to \(0.5P\).

The tradeoff. Optimizer states, gradients, and activations now scale with the adapter size, not the full model — massive memory savings. The cost is reduced expressiveness: the adapter can only learn changes within the low-rank subspace. For many fine-tuning tasks this is sufficient, but for pretraining or large distribution shifts, full-rank updates are necessary.

\(\text{QLoRA memory} \approx \underbrace{0.5P}_{\text{4-bit base}} + \underbrace{12 \cdot 0.02P}_{\text{adapter optimizer}} + \underbrace{2 \cdot 0.02P}_{\text{adapter grads}} \approx 0.78P \text{ bytes}\)

Offloading

What it does. Move optimizer states or parameters to CPU RAM (or even NVMe storage) and swap them to GPU only when needed. DeepSpeed ZeRO-Offload and ZeRO-Infinity implement this transparently.

The tradeoff. PCIe bandwidth between CPU and GPU is 1-2 orders of magnitude slower than GPU memory bandwidth. Offloading trades training speed for the ability to train models that would otherwise not fit at all. It is a last resort, useful for training very large models on limited GPU hardware.

Summary

Technique	Memory saved	Cost
Mixed precision (bf16)	\(4P\) bytes	Slight precision loss
Gradient checkpointing	~5-10x activation reduction	~30-40% more compute
8-bit optimizer	\(6P\) bytes	Quantization noise
QLoRA	~\(15P\) bytes vs full fine-tuning	Reduced expressiveness
CPU offloading	Moves optimizer to RAM	Major speed reduction
Microbatching	Controls peak activation memory	More kernel launches, lower GPU utilization

None of these are free — and all of the above are single-GPU techniques. When the model still doesn’t fit, or when you need to scale to dozens or hundreds of GPUs, parallelism strategies distribute the memory (and compute) across devices. The next section covers these.

技术	节省的内存	代价
混合精度 (bf16)	\(4P\) 字节	轻微精度损失
梯度检查点	激活减少约 5-10 倍	计算量增加约 30-40%
8-bit 优化器	\(6P\) 字节	量化噪声
QLoRA	相比全参微调节省约 \(15P\) 字节	表达能力降低
CPU 卸载	将优化器移至内存	速度大幅下降
微批量	控制峰值激活内存	内核启动次数增多、GPU 利用率降低

Parallelism Strategies

The single-GPU optimizations above can only go so far. A 70B model requires \(16 \times 70 \times 10^9 = 1120\) GB for parameters + optimizer + gradients alone — no single GPU comes close. Parallelism distributes memory and compute across multiple devices. The four main strategies are complementary and are typically combined in practice.

Recall from What Lives on the GPU that total per-GPU memory is:

\[\underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{optimizer}} + \underbrace{2P}_{\text{grads}} + \underbrace{\text{Act}(b, s)}_{\text{activations}} = 16P + \text{Act}\]

Each parallelism strategy targets different terms in this equation.

Data Parallelism (DP)

Idea. Replicate the entire model on each of \(N\) GPUs. Each GPU processes a different data shard, computes gradients locally, then all-reduces gradients before the optimizer step. The result is mathematically identical to single-GPU training with \(N \times\) the batch size.

Per-GPU memory:

\[\underbrace{2P}_{\text{params}} + \underbrace{12P}_{\text{optimizer}} + \underbrace{2P}_{\text{grads}} + \text{Act}(b/N, s)\]

DP does not reduce model-related memory — every GPU still holds the full \(16P\) bytes. It only reduces activations by shrinking each GPU’s local batch from \(b\) to \(b/N\).

Communication. One all-reduce of gradients (\(2P\) bytes) per step, which can be overlapped with backward computation using bucketed gradient all-reduce (as in PyTorch DDP).

When to use. When the model fits on a single GPU but you want higher throughput. DP is the simplest and most efficient form of parallelism — always the first thing to try.

ZeRO / FSDP (Sharded Data Parallelism)

The key insight behind ZeRO (Zero Redundancy Optimizer) is that vanilla DP is wasteful: every GPU stores an identical copy of optimizer states, gradients, and parameters. ZeRO shards these across \(N\) data-parallel GPUs in three progressive stages:

Stage	What is sharded	Per-GPU memory	Communication per step
ZeRO-1	Optimizer states	\(2P + 2P + 12P/N + \text{Act}\)	Same as DP (gradient all-reduce)
ZeRO-2	Optimizer states + gradients	\(2P + 2P/N + 12P/N + \text{Act}\)	Reduce-scatter gradients (similar cost to all-reduce)
ZeRO-3	Optimizer states + gradients + parameters	\(2P/N + 2P/N + 12P/N + \text{Act}\)	All-gather params before each forward/backward layer

ZeRO-1 is nearly free: sharding optimizer states across \(N\) GPUs reduces the dominant \(12P\) term to \(12P/N\), with no extra communication beyond the standard gradient all-reduce. For a 7B model on 8 GPUs, optimizer memory drops from 84 GB to ~10.5 GB per GPU.

ZeRO-2 additionally shards gradients. Each GPU only stores gradients for its shard’s parameters, then reduce-scatters (instead of all-reducing) so each rank accumulates only the gradients it needs. Communication cost is similar to all-reduce.

ZeRO-3 (equivalent to PyTorch FSDP) shards everything — the full \(16P\) becomes \(16P/N\) per GPU. The cost: parameters must be all-gathered before each layer’s forward and backward pass, and freed immediately after. This turns every layer into a communication event.

\[\text{ZeRO-3 per-GPU model memory} = \frac{16P}{N} \text{ bytes}\]

Practical notes. EleutherAI reports that ZeRO-3 is “too communication-heavy at large scales” and prefers ZeRO-1 combined with tensor and pipeline parallelism. ZeRO-1 is the default for most training runs because it targets the largest memory consumer (optimizer states) with minimal overhead. ZeRO-3/FSDP shines when GPU count is moderate and interconnect is fast (e.g., 8 GPUs within a single node on NVLink).

阶段	分片内容	每 GPU 内存	每步通信
ZeRO-1	优化器状态	\(2P + 2P + 12P/N + \text{Act}\)	与 DP 相同（梯度 all-reduce）
ZeRO-2	优化器状态 + 梯度	\(2P + 2P/N + 12P/N + \text{Act}\)	Reduce-scatter 梯度（开销与 all-reduce 相近）
ZeRO-3	优化器状态 + 梯度 + 参数	\(2P/N + 2P/N + 12P/N + \text{Act}\)	每层前向/反向前 all-gather 参数

Tensor Parallelism (TP)

Idea. Split individual weight matrices across \(t\) GPUs so that each GPU computes a slice of every layer. For a linear layer \(Y = XW\), the weight \(W\) is column- or row-split, each GPU computes its portion, and the results are combined via all-reduce or all-gather.

Per-GPU memory:

\[\frac{2P}{t} + \frac{12P}{t} + \frac{2P}{t} + \text{Act}(b, s, t) = \frac{16P}{t} + \text{Act}\]

where activations are also partially reduced — the \(24/t\) and \(5as/(dt)\) terms in the activation formula reflect this splitting.

Communication. Two all-reduce operations per transformer layer (one in the attention block, one in the MLP), each communicating \(O(bsd)\) activation tensors. This happens on the critical path — computation cannot proceed until the all-reduce completes. This is why TP requires NVLink (~900 GB/s) rather than PCIe (~64 GB/s) or network interconnects.

Typical TP degree. TP is usually set to the number of GPUs within a single node (e.g., \(t = 8\) for an 8-GPU node with NVLink). Going beyond a node boundary is impractical because the inter-node bandwidth is too low for the frequent all-reduces.

Pipeline Parallelism (PP)

Idea. Partition the model’s layers into \(p\) stages, assigning consecutive layers to different GPUs. GPU 1 runs layers 1–\(L/p\), GPU 2 runs layers \(L/p + 1\)–\(2L/p\), and so on.

Per-GPU memory:

\[\frac{2P}{p} + \frac{12P}{p} + \frac{2P}{p} + \text{Act} = \frac{16P}{p} + \text{Act}\]

Each GPU only stores parameters, optimizer states, and gradients for its assigned layers. Activation memory depends on the schedule (see below).

Communication. Point-to-point sends of activation tensors (\(O(bsd)\)) between adjacent stages — much less volume than TP’s all-reduces, and tolerant of lower-bandwidth interconnects.

The bubble problem. With naive sequential execution, only one stage is active at a time — the other \(p - 1\) GPUs are idle. The pipeline bubble is the fraction of time wasted:

\[\text{Bubble fraction} = \frac{p - 1}{m + p - 1}\]

where \(m\) is the number of microbatches. To keep the bubble small (say, < 5%), you need \(m \gg p\). GPipe and PipeDream use different strategies:

GPipe: Runs all \(m\) microbatch forward passes, then all \(m\) backward passes. Simple, but all activations for all microbatches must be held simultaneously, increasing memory by a factor of \(m\).
1F1B (PipeDream): Interleaves forward and backward passes so that each GPU holds activations for at most \(p\) microbatches (instead of \(m\)). This significantly reduces activation memory at the cost of more complex scheduling.

Combining Strategies: 3D Parallelism

In practice, large-scale training combines all three: TP within a node, PP across nodes, and DP (with ZeRO-1) across pipeline-parallel replicas. For \(N\) total GPUs with TP degree \(t\), PP degree \(p\), and DP degree \(d = N / (t \cdot p)\):

\[\text{Per-GPU model memory} \approx \frac{16P}{t \cdot p} + \frac{12P}{d} \cdot \left(\frac{1}{t \cdot p} - \frac{1}{t \cdot p}\right)\]

More concretely with ZeRO-1:

\[\text{Per-GPU memory} = \frac{2P}{t \cdot p} + \frac{12P}{t \cdot p \cdot d} + \frac{2P}{t \cdot p} + \text{Act}(b_{\text{local}}, s, t)\]

Example: 70B model on 64 GPUs (8 nodes x 8 GPUs/node):

\(t = 8\) (TP within each node), \(p = 4\) (PP across 4 nodes), \(d = 2\) (2 DP replicas)
Parameters per GPU: \(2 \times 70\text{B} / (8 \times 4) = 4.4\) GB
Optimizer per GPU: \(12 \times 70\text{B} / (8 \times 4 \times 2) = 13.1\) GB
Gradients per GPU: \(2 \times 70\text{B} / (8 \times 4) = 4.4\) GB
Model-related total: ~21.9 GB per GPU — comfortably fits on an 80 GB A100

Checkpointing to Disk

GPU memory is volatile — if a node crashes, a job gets preempted, or you simply want to pause and resume later, everything on the GPU is lost. Checkpointing saves the training state to persistent storage so that a run can be restarted from where it left off without retraining from scratch.

A complete checkpoint must contain everything needed to produce bit-identical training dynamics from the point of save. This means more than just the model weights:

What You Can Skip

Gradients do not need to be saved. They are recomputed from scratch at the start of each training step. Saving mid-step (between microbatches during gradient accumulation) would require saving the partially accumulated gradients, but this is rarely done — it’s simpler to checkpoint only at step boundaries.

Activations do not need to be saved. They are transient, created during the forward pass and consumed during the backward pass within a single step.

The bf16 model copy does not need to be saved. It is deterministically derived from the fp32 master weights by casting.

Checkpoint Size

The dominant cost is model parameters + optimizer states. For AdamW in mixed precision:

\[\text{Checkpoint size} \approx 4P + 8P = 12P \text{ bytes}\]

For a 7B model, that’s ~84 GB per checkpoint. A 70B model produces ~840 GB checkpoints. At typical save frequencies (every few hundred steps), this accumulates quickly — a 70B training run saving every 500 steps for 100K steps produces ~168 TB of checkpoints if none are pruned.

Strategies to Reduce Checkpoint Cost

Async checkpointing. Saving 84 GB to networked storage (e.g., NFS, S3) can take minutes. Synchronous saves stall training. Modern frameworks (PyTorch’s torch.distributed.checkpoint, DeepSpeed) copy the state to CPU memory asynchronously and write to disk in a background thread, overlapping I/O with the next training step.

Sharded checkpointing. With data or model parallelism, each GPU saves only its own shard of the state. This parallelizes the I/O across all nodes and avoids gathering the full state onto a single machine. The downside is that loading requires the same parallelism configuration — resharding is needed if you change the number of GPUs.

Save only what changed. Some systems support incremental or delta checkpoints, saving only the difference from the previous checkpoint. This is most useful when checkpoints are frequent and the model changes slowly between saves.

Pruning old checkpoints. Keep the last \(k\) checkpoints and delete older ones. Optionally keep “milestone” checkpoints at longer intervals (e.g., every 10K steps) for evaluation or fallback.

What Lives on the GPU

GPU 上存放了什么

Model Parameters

模型参数

Optimizer States

优化器状态

Gradients

梯度

Activations

激活值

Putting It Together

总览

Optimizing GPU Memory Usage

优化 GPU 内存使用

Mixed Precision Training

混合精度训练

Gradient Checkpointing (Activation Recomputation)

梯度检查点（激活重计算）

Microbatching and Gradient Accumulation

微批量与梯度累积

Optimizer State Compression

优化器状态压缩

Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

参数高效微调（LoRA / QLoRA）

Offloading

卸载

Summary

总结

Parallelism Strategies

并行策略

Data Parallelism (DP)

数据并行 (DP)

ZeRO / FSDP (Sharded Data Parallelism)

ZeRO / FSDP（分片数据并行）

Tensor Parallelism (TP)

张量并行 (TP)

Pipeline Parallelism (PP)

流水线并行 (PP)

Combining Strategies: 3D Parallelism

策略组合：3D 并行

Checkpointing to Disk

磁盘检查点

What You Can Skip

可以跳过的内容

Checkpoint Size

检查点大小

Strategies to Reduce Checkpoint Cost

降低检查点开销的策略