Backpropagation and Computation Graphs
What Is Backpropagation?
什么是反向传播?
Backpropagation is the algorithm that computes the gradient of a loss function with respect to model parameters. It is the foundation of training neural networks. The core idea is simple: apply the chain rule systematically through a computation graph.
Computation Graphs
计算图
Any differentiable program can be represented as a computation graph: a directed acyclic graph (DAG) where each node is a variable and each edge represents a function that produces one variable from another.
Consider a simple example: \(L = (wx + b - y)^2\), where \(w, b\) are parameters and \(x, y\) are data. The computation graph is:
Each node stores both its value (computed in the forward pass) and a recipe for computing derivatives (used in the backward pass).
The interactive figure above lets you explore the computation graph for \(L = (wx + b - y)^2\). Switch between Graph (structure), Forward Pass (compute values), and Backward Pass (propagate gradients). Adjust the sliders to see how different inputs change the values and gradients.
Forward Pass
前向传播
The forward pass evaluates the computation graph from inputs to output, computing the value of each intermediate variable in topological order:
- \[z_1 = wx\]
- \[z_2 = z_1 + b\]
- \[z_3 = z_2 - y\]
- \[L = z_3^2\]
This is simply “running the program.” Each operation stores its inputs for later use in the backward pass.
Backward Pass (Backpropagation)
反向传播
The backward pass computes \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta\) by applying the chain rule in reverse topological order. Starting from \(\frac{\partial L}{\partial L} = 1\), we propagate gradients backward through each node:
The key principle: at each node, multiply the incoming gradient by the local derivative. If node \(z\) computes \(z = f(u)\), and we already know \(\frac{\partial L}{\partial z}\) (the incoming gradient), then:
This is the chain rule. The efficiency of backpropagation comes from the fact that each intermediate gradient \(\frac{\partial L}{\partial z}\) is computed once and reused by all downstream nodes — no redundant work.
Multivariate Chain Rule and Fan-Out
多元链式法则与分支
When a variable \(u\) feeds into multiple downstream nodes \(z_1 = f_1(u), z_2 = f_2(u), \ldots\), the gradients add up:
This is the multivariate chain rule. In a neural network, the parameters \(\theta\) of a shared layer may influence the loss through many paths — backpropagation correctly accumulates all contributions.
This figure animates three key concepts: the Chain Rule (gradient flows backward, multiplying local derivatives), Fan-Out (gradients sum when a variable feeds multiple paths), and Stop-Gradient (blocking gradient flow through selected variables).
Stop-Gradient (Detach)
Stop-Gradient(Detach)
In practice, we sometimes want to block gradient flow through certain variables. The stop-gradient operation \(\mathrm{sg}(u)\) acts as the identity in the forward pass (\(\mathrm{sg}(u) = u\)) but has zero derivative in the backward pass (\(\frac{\partial\, \mathrm{sg}(u)}{\partial u} = 0\)). This means any gradient flowing backward is killed at this node.
This is useful when a computed quantity should be treated as a fixed constant during optimization, even though it was derived from the parameters. For example, in reinforcement learning, the advantage estimate or the importance weight may depend on \(\theta\), but we do not want to differentiate through them — only through the log-probability \(\ln \pi_\theta\).
Loss Functions with the Same Gradient
梯度相同的损失函数
A fundamental consequence of gradient-based optimization is that the loss value itself does not matter — only its gradient does. If two loss functions \(L_1(\theta)\) and \(L_2(\theta)\) satisfy
then gradient descent produces identical parameter trajectories from the same initialization:
This update rule depends only on \(\nabla_\theta L\), not on \(L\) itself. Two functions with the same gradient everywhere differ by at most a constant: \(L_1(\theta) = L_2(\theta) + C\). The constant \(C\) does not affect any gradient-based optimizer (SGD, Adam, etc.).
Same Gradient, Different Computation Graph
相同梯度,不同计算图
A more subtle case arises when two loss functions produce the same gradient but have different computation graphs. They may not differ by a constant — they may not even have the same value for any \(\theta\) — yet they yield identical optimization trajectories.
Example: PPO vs. REINFORCE formulation. Consider the PPO clipped objective and its REINFORCE reformulation. The PPO form is:
The REINFORCE form is:
These two losses have different values and different computation graphs:
| \(L_{\mathrm{PPO}}\) | \(L_{\mathrm{RF}}\) | |
|---|---|---|
| Value | \(\min(\rho\hat{A},\, \mathrm{clip}(\rho)\hat{A})\) | \(-w\hat{A}\ln\pi_\theta\) |
| Gradient path | Through \(\rho = \pi_\theta / \pi_{\mathrm{old}}\) | Through \(\ln \pi_\theta\) (\(w\) is detached) |
| Intermediate variables | \(\rho, \mathrm{clip}(\rho), \min(\cdot)\) | \(w\) (stop-gradient), \(\ln \pi_\theta\) |
Yet their gradients with respect to \(\theta\) are identical. To see why, note that \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\), so:
- PPO: when the gradient passes through (not clipped), \(\nabla_\theta L_{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\). When clipped, \(\nabla_\theta L_{\mathrm{PPO}} = 0\).
- REINFORCE: \(\nabla_\theta L_{\mathrm{RF}} = -w\, \hat{A}\, \nabla_\theta \ln \pi_\theta\), where \(w = \rho\) (not clipped) or \(w = 0\) (clipped).
Both give \(\hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\) or \(0\), in exactly the same cases. Same gradient, different computation graph, identical optimization.
When Does the Loss Value Matter?
损失值何时重要?
Although the loss value does not affect gradient updates, it can matter for:
- Monitoring: we often track the loss curve to diagnose training. Two equivalent losses may show different curves, which can be confusing if you compare across implementations.
- Learning rate scheduling: some schedulers (e.g., ReduceLROnPlateau) adjust the learning rate based on the loss value. Different-valued but same-gradient losses would trigger different scheduling decisions.
- Early stopping: if you stop training when the loss falls below a threshold, the threshold depends on which formulation you use.
- Mixed-precision training: numerical stability can differ between formulations even if the exact gradients are identical, because intermediate values differ and floating-point arithmetic is not associative.
For pure gradient descent with a fixed learning rate, the loss value is irrelevant. In practice, be aware of these secondary effects when switching between equivalent formulations.