Backpropagation
What Is Backpropagation?
什么是反向传播?
Backpropagation is the algorithm that computes the gradient of a loss function with respect to model parameters. It is the foundation of training neural networks. The core idea is simple: apply the chain rule systematically through a computation graph.
Computation Graphs
计算图
Any differentiable program can be represented as a computation graph: a directed acyclic graph (DAG) where each node is a variable and each edge represents a function that produces one variable from another.
Consider a simple example: \(L = (wx + b - y)^2\), where \(w, b\) are parameters and \(x, y\) are data. The computation graph is:
Each node stores both its value (computed in the forward pass) and a recipe for computing derivatives (used in the backward pass).
The interactive figure above lets you explore the computation graph for \(L = (wx + b - y)^2\). Switch between Graph (structure), Forward Pass (compute values), and Backward Pass (propagate gradients). Adjust the sliders to see how different inputs change the values and gradients.
Forward Pass
前向传播
The forward pass evaluates the computation graph from inputs to output, computing the value of each intermediate variable in topological order:
- \[z_1 = wx\]
- \[z_2 = z_1 + b\]
- \[z_3 = z_2 - y\]
- \[L = z_3^2\]
This is simply “running the program.” Each operation stores its inputs for later use in the backward pass.
Backward Pass (Backpropagation)
反向传播
The backward pass computes \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta\) by applying the chain rule in reverse topological order. Starting from \(\frac{\partial L}{\partial L} = 1\), we propagate gradients backward through each node:
The key principle: at each node, multiply the incoming gradient by the local derivative. If node \(z\) computes \(z = f(u)\), and we already know \(\frac{\partial L}{\partial z}\) (the incoming gradient), then:
This is the chain rule. The efficiency of backpropagation comes from the fact that each intermediate gradient \(\frac{\partial L}{\partial z}\) is computed once and reused by all downstream nodes — no redundant work.
Multivariate Chain Rule and Fan-Out
多元链式法则与分支
When a variable \(u\) feeds into multiple downstream nodes \(z_1 = f_1(u), z_2 = f_2(u), \ldots\), the gradients add up:
This is the multivariate chain rule. In a neural network, the parameters \(\theta\) of a shared layer may influence the loss through many paths — backpropagation correctly accumulates all contributions.
This figure animates three key concepts: the Chain Rule (gradient flows backward, multiplying local derivatives), Fan-Out (gradients sum when a variable feeds multiple paths), and Stop-Gradient (blocking gradient flow through selected variables).
Stop-Gradient (Detach)
Stop-Gradient(Detach)
In practice, we sometimes want to block gradient flow through certain variables. The stop-gradient operation \(\mathrm{sg}(u)\) acts as the identity in the forward pass (\(\mathrm{sg}(u) = u\)) but has zero derivative in the backward pass (\(\frac{\partial\, \mathrm{sg}(u)}{\partial u} = 0\)). This means any gradient flowing backward is killed at this node.
This is useful when a computed quantity should be treated as a fixed constant during optimization, even though it was derived from the parameters. For example, in reinforcement learning, the advantage estimate or the importance weight may depend on \(\theta\), but we do not want to differentiate through them — only through the log-probability \(\ln \pi_\theta\).
Does the Gradient Depend on \(L\)'s Value?
梯度取决于 \(L\) 的值吗?
Looking at the backward pass in the computation graph above, notice that none of the gradient arrows reference \(L\) itself. The local derivative at the loss node is \(\partial L / \partial z_3 = 2 z_3\) — it depends on \(z_3\), not on \(L\). Every downstream gradient (\(\partial L / \partial w\), \(\partial L / \partial x\), etc.) is derived from this via the chain rule, so none of them contain \(L\) either.
More generally, the backward pass never uses \(L\)’s numerical value. What it uses is:
- Intermediate activations saved from the forward pass (\(z_1, z_2, z_3\) and the inputs \(w, x, b, y\));
- Local derivatives at each node, which are functions of that node’s inputs.
The scalar \(L\) only appears as the seed of the chain: we initialize \(\partial L / \partial L = 1\) and propagate. Its magnitude never enters any multiplication along the way.
Two practical consequences:
- Adding a constant is free. \(L' = L + C\) produces identical gradients for any constant \(C\), because the local derivative at “\(+ C\)” is \(1\) and nothing else changes.
- \(L\) and its gradient are not fully independent in practice. For \(L = z_3^2\), a large \(L\) does imply a large \(\vert z_3 \vert\) and hence a large gradient. But the mechanism goes through \(z_3\), not through \(L\) directly.
This decoupling is why PyTorch’s .backward() never needs to know the loss’s numerical value — it only needs the saved activations and the local derivatives.
If the Loss Value Doesn't Matter, Why Do We Want It to Decrease?
既然损失值不重要,为什么希望它下降?
The two statements sound contradictory, but they concern different things:
- ”\(L\)’s value doesn’t enter the gradient computation” is a claim about the mechanics of backprop. Numerically, \(L\) never multiplies into any backward-flowing quantity.
- “We want \(L\) to go down” is a claim about the objective of optimization. We deliberately chose this \(L\) as our proxy for model quality (MSE encodes prediction error, cross-entropy encodes distribution mismatch, etc.), so reducing \(L\) is literally what “training” means.
The gradient is the tool; the loss is the target. A decreasing loss is our evidence that the tool is working — that gradient descent is actually pushing the target quantity down. When the loss doesn’t decrease, it doesn’t mean the gradient formula is broken; it means the optimization process has problems: step size too large, noise dominating the signal, bad initialization, or a non-convex landscape with a nearby saddle point or plateau.
A subtler point: what’s informative is the trajectory of \(L\), not its absolute value. \(L\) and \(L + 100\) decrease by the same amount, and \(10L\) by ten times as much — all three carry exactly the same information about progress. This is why comparing training curves across papers or loss functions should focus on shape and relative change, not absolute numbers.
Finally, a decreasing training loss is necessary but not sufficient for a better model:
- Overfitting: training loss drops while test loss rises.
- Objective mismatch: the loss may not align with the metric you care about — cross-entropy vs. accuracy, log-likelihood vs. sample quality.
- Reward hacking (in RL): the agent finds a trivial way to drive \(L\) down without solving the task.
So: the gradient formula doesn’t care about \(L\)’s value, but we as practitioners care a lot — the value is our signal that the optimization is doing what we asked.
What Happens to Gradients That Flow to the Inputs?
流到输入的梯度去哪了?
The computation graph produces gradients w.r.t. every leaf, not just parameters. For \(L = (wx + b - y)^2\) we get \(\partial L/\partial w\) and \(\partial L/\partial b\) (fed to the optimizer) but also \(\partial L/\partial x\) and \(\partial L/\partial y\) — what happens to those?
First, a distinction worth making. “Input gradient” can mean two different things:
- Gradient at the input of an intermediate node (e.g. \(\partial L/\partial z_2\)). This is never discarded — it is the messenger that carries the chain rule upstream. Without it you could not compute \(\partial L/\partial w\) at all. Every intermediate gradient is first used (as input to the next backward step) and only then released.
- Gradient at a non-parameter leaf — typically the data \(x\) or the label \(y\). This is what your question is really about.
For the leaf case, in standard supervised training the answer is: computed but discarded. In fact most frameworks don’t even compute it. PyTorch runs backprop only for tensors with requires_grad=True; data and labels default to False, so the backward pass short-circuits past them. This is a pure efficiency optimization — a few skipped matmuls per step.
But in many regimes the input gradient is the whole point:
- End-to-end training of composed models. If what looks like an “input” \(x\) to one module is really the output of an upstream module, \(\partial L/\partial x\) is exactly the signal that lets the upstream module be trained. Encoder–decoder, vision–language, and actor–critic architectures all rely on this.
- Adversarial examples and adversarial training. FGSM perturbs the input along \(\mathrm{sign}(\partial L/\partial x)\) to find inputs that fool the model (attack) or to augment training data (defense).
- Feature visualization / activation maximization / DeepDream. Freeze \(\theta\), optimize \(x\) via gradient ascent on a chosen logit or layer activation.
- Gradient-based saliency. \(\vert \partial L/\partial x\vert\) ranks which pixels or tokens mattered most — the starting point for Integrated Gradients, SmoothGrad, and related attribution methods.
- Physics-informed neural networks. Part of the loss is a PDE residual that requires derivatives of the network output w.r.t. spatial/temporal inputs.
So input gradients are “wasted” only in the narrow regime of vanilla supervised learning on fixed data. The moment you want to do anything more — joint training of stacked modules, robustness, interpretability, generative modeling — they stop being waste and become the primary signal.
Loss Functions with the Same Gradient
梯度相同的损失函数
A fundamental consequence of gradient-based optimization is that the loss value itself does not matter — only its gradient does. If two loss functions \(L_1(\theta)\) and \(L_2(\theta)\) satisfy
then gradient descent produces identical parameter trajectories from the same initialization:
This update rule depends only on \(\nabla_\theta L\), not on \(L\) itself. Two functions with the same gradient everywhere differ by at most a constant: \(L_1(\theta) = L_2(\theta) + C\). The constant \(C\) does not affect any gradient-based optimizer (SGD, Adam, etc.).
Same Gradient, Different Computation Graph
相同梯度,不同计算图
A more subtle case arises when two loss functions produce the same gradient but have different computation graphs. They may not differ by a constant — they may not even have the same value for any \(\theta\) — yet they yield identical optimization trajectories.
Example: PPO vs. REINFORCE formulation. Consider the PPO clipped objective and its REINFORCE reformulation. The PPO form is:
The REINFORCE form is:
These two losses have different values and different computation graphs:
| \(L_{\mathrm{PPO}}\) | \(L_{\mathrm{RF}}\) | |
|---|---|---|
| Value | \(\min(\rho\hat{A},\, \mathrm{clip}(\rho)\hat{A})\) | \(-w\hat{A}\ln\pi_\theta\) |
| Gradient path | Through \(\rho = \pi_\theta / \pi_{\mathrm{old}}\) | Through \(\ln \pi_\theta\) (\(w\) is detached) |
| Intermediate variables | \(\rho, \mathrm{clip}(\rho), \min(\cdot)\) | \(w\) (stop-gradient), \(\ln \pi_\theta\) |
Yet their gradients with respect to \(\theta\) are identical. To see why, note that \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\), so:
- PPO: when the gradient passes through (not clipped), \(\nabla_\theta L_{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\). When clipped, \(\nabla_\theta L_{\mathrm{PPO}} = 0\).
- REINFORCE: \(\nabla_\theta L_{\mathrm{RF}} = -w\, \hat{A}\, \nabla_\theta \ln \pi_\theta\), where \(w = \rho\) (not clipped) or \(w = 0\) (clipped).
Both give \(\hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\) or \(0\), in exactly the same cases. Same gradient, different computation graph, identical optimization.
When Does the Loss Value Matter?
损失值何时重要?
Although the loss value does not affect gradient updates, it can matter for:
- Monitoring: we often track the loss curve to diagnose training. Two equivalent losses may show different curves, which can be confusing if you compare across implementations.
- Learning rate scheduling: some schedulers (e.g., ReduceLROnPlateau) adjust the learning rate based on the loss value. Different-valued but same-gradient losses would trigger different scheduling decisions.
- Early stopping: if you stop training when the loss falls below a threshold, the threshold depends on which formulation you use.
- Mixed-precision training: numerical stability can differ between formulations even if the exact gradients are identical, because intermediate values differ and floating-point arithmetic is not associative.
For pure gradient descent with a fixed learning rate, the loss value is irrelevant. In practice, be aware of these secondary effects when switching between equivalent formulations.