Backpropagation and Computation Graphs

What Is Backpropagation?

Backpropagation is the algorithm that computes the gradient of a loss function with respect to model parameters. It is the foundation of training neural networks. The core idea is simple: apply the chain rule systematically through a computation graph.

反向传播是计算损失函数关于模型参数梯度的算法,是训练神经网络的基础。核心思想很简单:在计算图上系统地应用链式法则。

Computation Graphs

Any differentiable program can be represented as a computation graph: a directed acyclic graph (DAG) where each node is a variable and each edge represents a function that produces one variable from another.

Consider a simple example: \(L = (wx + b - y)^2\), where \(w, b\) are parameters and \(x, y\) are data. The computation graph is:

任何可微程序都可以表示为一个计算图:一个有向无环图(DAG),其中每个节点是一个变量,每条边表示一个从一个变量产生另一个变量的函数。

考虑一个简单的例子:\(L = (wx + b - y)^2\),其中 \(w, b\) 是参数,\(x, y\) 是数据。计算图为:

\[w, x \;\xrightarrow{\times}\; z_1 = wx \;\xrightarrow{+b}\; z_2 = z_1 + b \;\xrightarrow{-y}\; z_3 = z_2 - y \;\xrightarrow{(\cdot)^2}\; L = z_3^2\]

Each node stores both its value (computed in the forward pass) and a recipe for computing derivatives (used in the backward pass).

\[w, x \;\xrightarrow{\times}\; z_1 = wx \;\xrightarrow{+b}\; z_2 = z_1 + b \;\xrightarrow{-y}\; z_3 = z_2 - y \;\xrightarrow{(\cdot)^2}\; L = z_3^2\]

每个节点既存储其值(在前向传播中计算),也存储计算导数的规则(在反向传播中使用)。

The interactive figure above lets you explore the computation graph for \(L = (wx + b - y)^2\). Switch between Graph (structure), Forward Pass (compute values), and Backward Pass (propagate gradients). Adjust the sliders to see how different inputs change the values and gradients.

上面的交互图展示了 \(L = (wx + b - y)^2\) 的计算图。可以切换Graph(结构)、Forward Pass(计算值)和Backward Pass(传播梯度)。拖动滑块观察不同输入如何改变值和梯度。

Forward Pass

The forward pass evaluates the computation graph from inputs to output, computing the value of each intermediate variable in topological order:

  1. \[z_1 = wx\]
  2. \[z_2 = z_1 + b\]
  3. \[z_3 = z_2 - y\]
  4. \[L = z_3^2\]

This is simply “running the program.” Each operation stores its inputs for later use in the backward pass.

前向传播按拓扑序从输入到输出计算图中每个中间变量的值:

  1. \[z_1 = wx\]
  2. \[z_2 = z_1 + b\]
  3. \[z_3 = z_2 - y\]
  4. \[L = z_3^2\]

这就是”执行程序”。每个运算保存其输入,供反向传播时使用。

Backward Pass (Backpropagation)

The backward pass computes \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta\) by applying the chain rule in reverse topological order. Starting from \(\frac{\partial L}{\partial L} = 1\), we propagate gradients backward through each node:

反向传播通过按逆拓扑序应用链式法则,计算每个参数 \(\theta\) 的 \(\frac{\partial L}{\partial \theta}\)。从 \(\frac{\partial L}{\partial L} = 1\) 开始,我们将梯度反向传播通过每个节点:

\[\frac{\partial L}{\partial z_3} = 2z_3, \quad \frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial z_3} \cdot 1, \quad \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial z_2} \cdot 1, \quad \frac{\partial L}{\partial w} = \frac{\partial L}{\partial z_1} \cdot x, \quad \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z_2} \cdot 1.\]

The key principle: at each node, multiply the incoming gradient by the local derivative. If node \(z\) computes \(z = f(u)\), and we already know \(\frac{\partial L}{\partial z}\) (the incoming gradient), then:

核心原则:在每个节点,将传入的梯度乘以局部导数。 如果节点 \(z\) 计算 \(z = f(u)\),且我们已知 \(\frac{\partial L}{\partial z}\)(传入梯度),则:

\[\frac{\partial L}{\partial u} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial u} = \frac{\partial L}{\partial z} \cdot f'(u).\]

This is the chain rule. The efficiency of backpropagation comes from the fact that each intermediate gradient \(\frac{\partial L}{\partial z}\) is computed once and reused by all downstream nodes — no redundant work.

这就是链式法则。反向传播的效率来源于每个中间梯度 \(\frac{\partial L}{\partial z}\) 只计算一次,然后被所有下游节点复用——没有冗余计算。

Multivariate Chain Rule and Fan-Out

When a variable \(u\) feeds into multiple downstream nodes \(z_1 = f_1(u), z_2 = f_2(u), \ldots\), the gradients add up:

当一个变量 \(u\) 同时输入多个下游节点 \(z_1 = f_1(u), z_2 = f_2(u), \ldots\) 时,梯度相加

\[\frac{\partial L}{\partial u} = \sum_k \frac{\partial L}{\partial z_k} \cdot \frac{\partial z_k}{\partial u}.\]

This is the multivariate chain rule. In a neural network, the parameters \(\theta\) of a shared layer may influence the loss through many paths — backpropagation correctly accumulates all contributions.

这是多元链式法则。在神经网络中,共享层的参数 \(\theta\) 可能通过多条路径影响损失——反向传播正确地累加了所有贡献。

This figure animates three key concepts: the Chain Rule (gradient flows backward, multiplying local derivatives), Fan-Out (gradients sum when a variable feeds multiple paths), and Stop-Gradient (blocking gradient flow through selected variables).

该图动画展示了三个核心概念:链式法则(梯度反向流动,逐步乘以局部导数)、分支求和(变量输入多条路径时梯度相加)以及Stop-Gradient(阻断选定变量的梯度流)。

Stop-Gradient (Detach)

In practice, we sometimes want to block gradient flow through certain variables. The stop-gradient operation \(\mathrm{sg}(u)\) acts as the identity in the forward pass (\(\mathrm{sg}(u) = u\)) but has zero derivative in the backward pass (\(\frac{\partial\, \mathrm{sg}(u)}{\partial u} = 0\)). This means any gradient flowing backward is killed at this node.

This is useful when a computed quantity should be treated as a fixed constant during optimization, even though it was derived from the parameters. For example, in reinforcement learning, the advantage estimate or the importance weight may depend on \(\theta\), but we do not want to differentiate through them — only through the log-probability \(\ln \pi_\theta\).

在实践中,我们有时希望阻断梯度流经某些变量。Stop-gradient 操作 \(\mathrm{sg}(u)\) 在前向传播中为恒等映射(\(\mathrm{sg}(u) = u\)),但在反向传播中导数为零(\(\frac{\partial\, \mathrm{sg}(u)}{\partial u} = 0\))。这意味着任何反向流经的梯度在此节点被截断。

这在以下场景中很有用:某个计算出的量在优化中应被视为固定常数,尽管它是从参数推导出来的。例如,在强化学习中,advantage 估计或 importance weight 可能依赖于 \(\theta\),但我们不想对它们求导——只对 log-probability \(\ln \pi_\theta\) 求导。

Loss Functions with the Same Gradient

A fundamental consequence of gradient-based optimization is that the loss value itself does not matter — only its gradient does. If two loss functions \(L_1(\theta)\) and \(L_2(\theta)\) satisfy

基于梯度优化的一个基本推论是:损失值本身不重要——只有梯度重要。 如果两个损失函数 \(L_1(\theta)\) 和 \(L_2(\theta)\) 满足

\[\nabla_\theta L_1(\theta) = \nabla_\theta L_2(\theta) \quad \forall\, \theta,\]

then gradient descent produces identical parameter trajectories from the same initialization:

则从相同初始化出发,梯度下降产生完全一样的参数轨迹

\[\theta_{t+1} = \theta_t - \eta\, \nabla_\theta L(\theta_t).\]

This update rule depends only on \(\nabla_\theta L\), not on \(L\) itself. Two functions with the same gradient everywhere differ by at most a constant: \(L_1(\theta) = L_2(\theta) + C\). The constant \(C\) does not affect any gradient-based optimizer (SGD, Adam, etc.).

这个更新规则只依赖 \(\nabla_\theta L\),不依赖 \(L\) 本身。梯度处处相同的两个函数至多差一个常数:\(L_1(\theta) = L_2(\theta) + C\)。常数 \(C\) 不影响任何基于梯度的优化器(SGD、Adam 等)。

Same Gradient, Different Computation Graph

A more subtle case arises when two loss functions produce the same gradient but have different computation graphs. They may not differ by a constant — they may not even have the same value for any \(\theta\) — yet they yield identical optimization trajectories.

Example: PPO vs. REINFORCE formulation. Consider the PPO clipped objective and its REINFORCE reformulation. The PPO form is:

一个更微妙的情况是:两个损失函数产生相同的梯度,但计算图不同。它们可能不差一个常数——甚至对任何 \(\theta\) 都没有相同的值——但产生完全一样的优化轨迹。

例子:PPO 与 REINFORCE 形式。 考虑 PPO clipped objective 及其 REINFORCE 改写。PPO 形式为:

\[L_{\mathrm{PPO}} = \min\!\big(\rho\,\hat{A},\; \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\,\hat{A}\big), \quad \rho = \frac{\pi_\theta(a)}{\pi_{\mathrm{old}}(a)}.\]

The REINFORCE form is:

REINFORCE 形式为:

\[L_{\mathrm{RF}} = -\mathrm{sg}(w)\,\hat{A}\,\ln \pi_\theta(a), \quad w = \mathbb{I}\!\left((\rho - \mathrm{clip}(\rho))\,\hat{A} \leq 0\right) \rho.\]

These two losses have different values and different computation graphs:

  \(L_{\mathrm{PPO}}\) \(L_{\mathrm{RF}}\)
Value \(\min(\rho\hat{A},\, \mathrm{clip}(\rho)\hat{A})\) \(-w\hat{A}\ln\pi_\theta\)
Gradient path Through \(\rho = \pi_\theta / \pi_{\mathrm{old}}\) Through \(\ln \pi_\theta\) (\(w\) is detached)
Intermediate variables \(\rho, \mathrm{clip}(\rho), \min(\cdot)\) \(w\) (stop-gradient), \(\ln \pi_\theta\)

Yet their gradients with respect to \(\theta\) are identical. To see why, note that \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\), so:

  • PPO: when the gradient passes through (not clipped), \(\nabla_\theta L_{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\). When clipped, \(\nabla_\theta L_{\mathrm{PPO}} = 0\).
  • REINFORCE: \(\nabla_\theta L_{\mathrm{RF}} = -w\, \hat{A}\, \nabla_\theta \ln \pi_\theta\), where \(w = \rho\) (not clipped) or \(w = 0\) (clipped).

Both give \(\hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\) or \(0\), in exactly the same cases. Same gradient, different computation graph, identical optimization.

这两个损失有不同的值不同的计算图

  \(L_{\mathrm{PPO}}\) \(L_{\mathrm{RF}}\)
\(\min(\rho\hat{A},\, \mathrm{clip}(\rho)\hat{A})\) \(-w\hat{A}\ln\pi_\theta\)
梯度路径 通过 \(\rho = \pi_\theta / \pi_{\mathrm{old}}\) 通过 \(\ln \pi_\theta\)(\(w\) 被 detach)
中间变量 \(\rho, \mathrm{clip}(\rho), \min(\cdot)\) \(w\)(stop-gradient),\(\ln \pi_\theta\)

然而它们关于 \(\theta\) 的梯度完全一样。原因是 \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\),因此:

  • PPO:当梯度通过时(未截断),\(\nabla_\theta L_{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\)。截断时,\(\nabla_\theta L_{\mathrm{PPO}} = 0\)。
  • REINFORCE:\(\nabla_\theta L_{\mathrm{RF}} = -w\, \hat{A}\, \nabla_\theta \ln \pi_\theta\),其中 \(w = \rho\)(未截断)或 \(w = 0\)(截断)。

两者在完全相同的情况下给出 \(\hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\) 或 \(0\)。相同梯度,不同计算图,相同优化结果。

When Does the Loss Value Matter?

Although the loss value does not affect gradient updates, it can matter for:

  • Monitoring: we often track the loss curve to diagnose training. Two equivalent losses may show different curves, which can be confusing if you compare across implementations.
  • Learning rate scheduling: some schedulers (e.g., ReduceLROnPlateau) adjust the learning rate based on the loss value. Different-valued but same-gradient losses would trigger different scheduling decisions.
  • Early stopping: if you stop training when the loss falls below a threshold, the threshold depends on which formulation you use.
  • Mixed-precision training: numerical stability can differ between formulations even if the exact gradients are identical, because intermediate values differ and floating-point arithmetic is not associative.

For pure gradient descent with a fixed learning rate, the loss value is irrelevant. In practice, be aware of these secondary effects when switching between equivalent formulations.

虽然损失值不影响梯度更新,但在以下情况下可能重要:

  • 监控:我们通常跟踪 loss 曲线来诊断训练。两个等价的 loss 可能显示不同的曲线,在不同实现之间比较时可能造成混淆。
  • 学习率调度:某些调度器(如 ReduceLROnPlateau)根据 loss 值调整学习率。值不同但梯度相同的 loss 会触发不同的调度决策。
  • 早停:如果在 loss 降到阈值以下时停止训练,阈值取决于使用哪种形式。
  • 混合精度训练:即使精确梯度相同,不同形式的数值稳定性也可能不同,因为中间值不同且浮点运算不满足结合律。

对于使用固定学习率的纯梯度下降,loss 值无关紧要。在实践中,切换等价形式时需注意这些次要影响。