Backpropagation

What Is Backpropagation?

Backpropagation is the algorithm that computes the gradient of a loss function with respect to model parameters. It is the foundation of training neural networks. The core idea is simple: apply the chain rule systematically through a computation graph.

反向传播是计算损失函数关于模型参数梯度的算法,是训练神经网络的基础。核心思想很简单:在计算图上系统地应用链式法则。

Computation Graphs

Any differentiable program can be represented as a computation graph: a directed acyclic graph (DAG) where each node is a variable and each edge represents a function that produces one variable from another.

Consider a simple example: \(L = (wx + b - y)^2\), where \(w, b\) are parameters and \(x, y\) are data. The computation graph is:

任何可微程序都可以表示为一个计算图:一个有向无环图(DAG),其中每个节点是一个变量,每条边表示一个从一个变量产生另一个变量的函数。

考虑一个简单的例子:\(L = (wx + b - y)^2\),其中 \(w, b\) 是参数,\(x, y\) 是数据。计算图为:

\[w, x \;\xrightarrow{\times}\; z_1 = wx \;\xrightarrow{+b}\; z_2 = z_1 + b \;\xrightarrow{-y}\; z_3 = z_2 - y \;\xrightarrow{(\cdot)^2}\; L = z_3^2\]

Each node stores both its value (computed in the forward pass) and a recipe for computing derivatives (used in the backward pass).

\[w, x \;\xrightarrow{\times}\; z_1 = wx \;\xrightarrow{+b}\; z_2 = z_1 + b \;\xrightarrow{-y}\; z_3 = z_2 - y \;\xrightarrow{(\cdot)^2}\; L = z_3^2\]

每个节点既存储其值(在前向传播中计算),也存储计算导数的规则(在反向传播中使用)。

The interactive figure above lets you explore the computation graph for \(L = (wx + b - y)^2\). Switch between Graph (structure), Forward Pass (compute values), and Backward Pass (propagate gradients). Adjust the sliders to see how different inputs change the values and gradients.

上面的交互图展示了 \(L = (wx + b - y)^2\) 的计算图。可以切换Graph(结构)、Forward Pass(计算值)和Backward Pass(传播梯度)。拖动滑块观察不同输入如何改变值和梯度。

Forward Pass

The forward pass evaluates the computation graph from inputs to output, computing the value of each intermediate variable in topological order:

  1. \[z_1 = wx\]
  2. \[z_2 = z_1 + b\]
  3. \[z_3 = z_2 - y\]
  4. \[L = z_3^2\]

This is simply “running the program.” Each operation stores its inputs for later use in the backward pass.

前向传播按拓扑序从输入到输出计算图中每个中间变量的值:

  1. \[z_1 = wx\]
  2. \[z_2 = z_1 + b\]
  3. \[z_3 = z_2 - y\]
  4. \[L = z_3^2\]

这就是”执行程序”。每个运算保存其输入,供反向传播时使用。

Backward Pass (Backpropagation)

The backward pass computes \(\frac{\partial L}{\partial \theta}\) for every parameter \(\theta\) by applying the chain rule in reverse topological order. Starting from \(\frac{\partial L}{\partial L} = 1\), we propagate gradients backward through each node:

反向传播通过按逆拓扑序应用链式法则,计算每个参数 \(\theta\) 的 \(\frac{\partial L}{\partial \theta}\)。从 \(\frac{\partial L}{\partial L} = 1\) 开始,我们将梯度反向传播通过每个节点:

\[\frac{\partial L}{\partial z_3} = 2z_3, \quad \frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial z_3} \cdot 1, \quad \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial z_2} \cdot 1, \quad \frac{\partial L}{\partial w} = \frac{\partial L}{\partial z_1} \cdot x, \quad \frac{\partial L}{\partial b} = \frac{\partial L}{\partial z_2} \cdot 1.\]

The key principle: at each node, multiply the incoming gradient by the local derivative. If node \(z\) computes \(z = f(u)\), and we already know \(\frac{\partial L}{\partial z}\) (the incoming gradient), then:

核心原则:在每个节点,将传入的梯度乘以局部导数。 如果节点 \(z\) 计算 \(z = f(u)\),且我们已知 \(\frac{\partial L}{\partial z}\)(传入梯度),则:

\[\frac{\partial L}{\partial u} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial u} = \frac{\partial L}{\partial z} \cdot f'(u).\]

This is the chain rule. The efficiency of backpropagation comes from the fact that each intermediate gradient \(\frac{\partial L}{\partial z}\) is computed once and reused by all downstream nodes — no redundant work.

这就是链式法则。反向传播的效率来源于每个中间梯度 \(\frac{\partial L}{\partial z}\) 只计算一次,然后被所有下游节点复用——没有冗余计算。

Multivariate Chain Rule and Fan-Out

When a variable \(u\) feeds into multiple downstream nodes \(z_1 = f_1(u), z_2 = f_2(u), \ldots\), the gradients add up:

当一个变量 \(u\) 同时输入多个下游节点 \(z_1 = f_1(u), z_2 = f_2(u), \ldots\) 时,梯度相加

\[\frac{\partial L}{\partial u} = \sum_k \frac{\partial L}{\partial z_k} \cdot \frac{\partial z_k}{\partial u}.\]

This is the multivariate chain rule. In a neural network, the parameters \(\theta\) of a shared layer may influence the loss through many paths — backpropagation correctly accumulates all contributions.

这是多元链式法则。在神经网络中,共享层的参数 \(\theta\) 可能通过多条路径影响损失——反向传播正确地累加了所有贡献。

This figure animates three key concepts: the Chain Rule (gradient flows backward, multiplying local derivatives), Fan-Out (gradients sum when a variable feeds multiple paths), and Stop-Gradient (blocking gradient flow through selected variables).

该图动画展示了三个核心概念:链式法则(梯度反向流动,逐步乘以局部导数)、分支求和(变量输入多条路径时梯度相加)以及Stop-Gradient(阻断选定变量的梯度流)。

Stop-Gradient (Detach)

In practice, we sometimes want to block gradient flow through certain variables. The stop-gradient operation \(\mathrm{sg}(u)\) acts as the identity in the forward pass (\(\mathrm{sg}(u) = u\)) but has zero derivative in the backward pass (\(\frac{\partial\, \mathrm{sg}(u)}{\partial u} = 0\)). This means any gradient flowing backward is killed at this node.

This is useful when a computed quantity should be treated as a fixed constant during optimization, even though it was derived from the parameters. For example, in reinforcement learning, the advantage estimate or the importance weight may depend on \(\theta\), but we do not want to differentiate through them — only through the log-probability \(\ln \pi_\theta\).

在实践中,我们有时希望阻断梯度流经某些变量。Stop-gradient 操作 \(\mathrm{sg}(u)\) 在前向传播中为恒等映射(\(\mathrm{sg}(u) = u\)),但在反向传播中导数为零(\(\frac{\partial\, \mathrm{sg}(u)}{\partial u} = 0\))。这意味着任何反向流经的梯度在此节点被截断。

这在以下场景中很有用:某个计算出的量在优化中应被视为固定常数,尽管它是从参数推导出来的。例如,在强化学习中,advantage 估计或 importance weight 可能依赖于 \(\theta\),但我们不想对它们求导——只对 log-probability \(\ln \pi_\theta\) 求导。

Does the Gradient Depend on \(L\)'s Value?

Looking at the backward pass in the computation graph above, notice that none of the gradient arrows reference \(L\) itself. The local derivative at the loss node is \(\partial L / \partial z_3 = 2 z_3\) — it depends on \(z_3\), not on \(L\). Every downstream gradient (\(\partial L / \partial w\), \(\partial L / \partial x\), etc.) is derived from this via the chain rule, so none of them contain \(L\) either.

More generally, the backward pass never uses \(L\)’s numerical value. What it uses is:

  • Intermediate activations saved from the forward pass (\(z_1, z_2, z_3\) and the inputs \(w, x, b, y\));
  • Local derivatives at each node, which are functions of that node’s inputs.

The scalar \(L\) only appears as the seed of the chain: we initialize \(\partial L / \partial L = 1\) and propagate. Its magnitude never enters any multiplication along the way.

Two practical consequences:

  • Adding a constant is free. \(L' = L + C\) produces identical gradients for any constant \(C\), because the local derivative at “\(+ C\)” is \(1\) and nothing else changes.
  • \(L\) and its gradient are not fully independent in practice. For \(L = z_3^2\), a large \(L\) does imply a large \(\vert z_3 \vert\) and hence a large gradient. But the mechanism goes through \(z_3\), not through \(L\) directly.

This decoupling is why PyTorch’s .backward() never needs to know the loss’s numerical value — it only needs the saved activations and the local derivatives.

观察上方计算图的反向传播,会发现没有任何梯度箭头引用 \(L\) 本身。损失节点处的局部导数是 \(\partial L / \partial z_3 = 2 z_3\)——它依赖于 \(z_3\),而非 \(L\)。所有下游梯度(\(\partial L / \partial w\)、\(\partial L / \partial x\) 等)都由此通过链式法则推出,因此也都不含 \(L\)。

更一般地,反向传播从不使用 \(L\) 的数值。它用到的只是:

  • 前向传播中保存的中间激活值(\(z_1, z_2, z_3\) 以及输入 \(w, x, b, y\));
  • 每个节点处的局部导数,它们是该节点输入的函数。

标量 \(L\) 仅作为链式计算的起点:我们初始化 \(\partial L / \partial L = 1\) 并开始传播,其大小从不进入沿路径的任何乘法运算。

这有两个实际后果:

  • 加常数不改变梯度。 \(L' = L + C\) 对任意常数 \(C\) 产生相同的梯度,因为 “\(+ C\)” 节点处的局部导数是 \(1\),其他量均未改变。
  • \(L\) 与其梯度在实践中并非完全独立。 对 \(L = z_3^2\),\(L\) 大意味着 \(\vert z_3 \vert\) 大,梯度也大。但这种关联的机制是通过 \(z_3\),而不是通过 \(L\) 直接。

这种解耦正是 PyTorch 的 .backward() 从不需要知道损失数值的原因——它只需要保存的激活值和各节点的局部导数。

If the Loss Value Doesn't Matter, Why Do We Want It to Decrease?

The two statements sound contradictory, but they concern different things:

  • ”\(L\)’s value doesn’t enter the gradient computation” is a claim about the mechanics of backprop. Numerically, \(L\) never multiplies into any backward-flowing quantity.
  • “We want \(L\) to go down” is a claim about the objective of optimization. We deliberately chose this \(L\) as our proxy for model quality (MSE encodes prediction error, cross-entropy encodes distribution mismatch, etc.), so reducing \(L\) is literally what “training” means.

The gradient is the tool; the loss is the target. A decreasing loss is our evidence that the tool is working — that gradient descent is actually pushing the target quantity down. When the loss doesn’t decrease, it doesn’t mean the gradient formula is broken; it means the optimization process has problems: step size too large, noise dominating the signal, bad initialization, or a non-convex landscape with a nearby saddle point or plateau.

A subtler point: what’s informative is the trajectory of \(L\), not its absolute value. \(L\) and \(L + 100\) decrease by the same amount, and \(10L\) by ten times as much — all three carry exactly the same information about progress. This is why comparing training curves across papers or loss functions should focus on shape and relative change, not absolute numbers.

Finally, a decreasing training loss is necessary but not sufficient for a better model:

  • Overfitting: training loss drops while test loss rises.
  • Objective mismatch: the loss may not align with the metric you care about — cross-entropy vs. accuracy, log-likelihood vs. sample quality.
  • Reward hacking (in RL): the agent finds a trivial way to drive \(L\) down without solving the task.

So: the gradient formula doesn’t care about \(L\)’s value, but we as practitioners care a lot — the value is our signal that the optimization is doing what we asked.

这两句话听起来矛盾,但它们说的是不同的事情:

  • ”\(L\) 的值不进入梯度计算” 是关于反向传播机制的陈述。\(L\) 的数值从不作为乘数参与任何反向流动的量。
  • “我们希望 \(L\) 下降” 是关于优化目标的陈述。我们之所以选这个 \(L\) 作为模型质量的代理(MSE 表征预测误差、交叉熵表征分布失配等),正是因为降低 \(L\) 就是”训练”的字面含义。

梯度是工具,损失是目标。损失下降是我们判断工具有效的依据——即梯度下降确实在把目标量向下推。损失不下降并不意味着梯度公式错了,而是意味着优化过程本身出问题了:步长太大、噪声淹没信号、初始化不好,或者损失景观非凸导致陷入鞍点、高原等。

一个更细微的点:真正有信息量的是 \(L\) 的轨迹,不是其绝对值。\(L\) 与 \(L + 100\) 下降幅度相同,\(10L\) 则是十倍——这三者传递的进度信息完全一样。这也是为什么跨论文或跨损失函数比较训练曲线时,应该关注形状相对变化,而非绝对数字。

最后,训练损失下降对于更好的模型是必要条件而非充分条件

  • 过拟合:训练损失下降但测试损失上升。
  • 目标错位:损失可能与真正关心的指标不一致——交叉熵 vs. 准确率、对数似然 vs. 生成质量。
  • Reward hacking(强化学习中):智能体找到一种平凡的方式压低 \(L\),但并未真正解决任务。

所以:梯度公式不关心 \(L\) 的数值,但作为实践者我们非常关心——这个数值是优化系统是否按预期工作的信号。

What Happens to Gradients That Flow to the Inputs?

The computation graph produces gradients w.r.t. every leaf, not just parameters. For \(L = (wx + b - y)^2\) we get \(\partial L/\partial w\) and \(\partial L/\partial b\) (fed to the optimizer) but also \(\partial L/\partial x\) and \(\partial L/\partial y\) — what happens to those?

First, a distinction worth making. “Input gradient” can mean two different things:

  1. Gradient at the input of an intermediate node (e.g. \(\partial L/\partial z_2\)). This is never discarded — it is the messenger that carries the chain rule upstream. Without it you could not compute \(\partial L/\partial w\) at all. Every intermediate gradient is first used (as input to the next backward step) and only then released.
  2. Gradient at a non-parameter leaf — typically the data \(x\) or the label \(y\). This is what your question is really about.

For the leaf case, in standard supervised training the answer is: computed but discarded. In fact most frameworks don’t even compute it. PyTorch runs backprop only for tensors with requires_grad=True; data and labels default to False, so the backward pass short-circuits past them. This is a pure efficiency optimization — a few skipped matmuls per step.

But in many regimes the input gradient is the whole point:

  • End-to-end training of composed models. If what looks like an “input” \(x\) to one module is really the output of an upstream module, \(\partial L/\partial x\) is exactly the signal that lets the upstream module be trained. Encoder–decoder, vision–language, and actor–critic architectures all rely on this.
  • Adversarial examples and adversarial training. FGSM perturbs the input along \(\mathrm{sign}(\partial L/\partial x)\) to find inputs that fool the model (attack) or to augment training data (defense).
  • Feature visualization / activation maximization / DeepDream. Freeze \(\theta\), optimize \(x\) via gradient ascent on a chosen logit or layer activation.
  • Gradient-based saliency. \(\vert \partial L/\partial x\vert\) ranks which pixels or tokens mattered most — the starting point for Integrated Gradients, SmoothGrad, and related attribution methods.
  • Physics-informed neural networks. Part of the loss is a PDE residual that requires derivatives of the network output w.r.t. spatial/temporal inputs.

So input gradients are “wasted” only in the narrow regime of vanilla supervised learning on fixed data. The moment you want to do anything more — joint training of stacked modules, robustness, interpretability, generative modeling — they stop being waste and become the primary signal.

计算图会给出对每个叶子的梯度,不止是参数。对 \(L = (wx + b - y)^2\),我们得到送入优化器的 \(\partial L/\partial w\)、\(\partial L/\partial b\),同时也得到 \(\partial L/\partial x\)、\(\partial L/\partial y\)——这些去哪了?

先做个区分。”输入梯度”其实可以指两件不同的事:

  1. 某个中间节点输入处的梯度(例如 \(\partial L/\partial z_2\))。这种梯度永远不会被丢掉——它是把链式法则一路传回上游的信使,没有它就根本算不出 \(\partial L/\partial w\)。每个中间梯度都是先被使用(作为下一步反向的输入),之后才被释放。
  2. 非参数叶子节点处的梯度——通常指数据 \(x\) 或标签 \(y\)。这才是你真正在问的。

对于叶子这一情况,标准监督学习的答案是:算出来但丢掉。 实际上大多数框架根本不会去算。PyTorch 只对 requires_grad=True 的张量执行反向传播;数据和标签默认为 False,所以反向传播直接绕过它们。这纯粹是效率优化——每步省下几次矩阵乘法。

但在许多场景中,输入梯度本身就是目的

  • 复合模型的端到端训练。 当一个模块的”输入” \(x\) 实际上是某个上游模块的输出时,\(\partial L/\partial x\) 正是让上游模块能被训练的信号。编码器-解码器、视觉-语言、actor-critic 架构都依赖于此。
  • 对抗样本与对抗训练。 FGSM 沿 \(\mathrm{sign}(\partial L/\partial x)\) 方向扰动输入,用以找到欺骗网络的样本(攻击)或增强训练数据(防御)。
  • 特征可视化 / 激活最大化 / DeepDream。 固定 \(\theta\),用梯度上升优化 \(x\),目标是某个选定的 logit 或层激活。
  • 基于梯度的显著性。 \(\vert \partial L/\partial x\vert\) 排序了哪些像素或词元对损失贡献最大——这是 Integrated Gradients、SmoothGrad 等归因方法的出发点。
  • 物理信息神经网络 (PINN)。 损失中有一部分是 PDE 残差,需要网络输出对空间/时间输入求导。

所以输入梯度只在”固定数据上的纯监督学习”这个狭窄场景下才算”浪费”。一旦你想做更多事——堆叠模块的联合训练、鲁棒性、可解释性、生成建模——它们立刻从废弃物变成最主要的信号。

Loss Functions with the Same Gradient

A fundamental consequence of gradient-based optimization is that the loss value itself does not matter — only its gradient does. If two loss functions \(L_1(\theta)\) and \(L_2(\theta)\) satisfy

基于梯度优化的一个基本推论是:损失值本身不重要——只有梯度重要。 如果两个损失函数 \(L_1(\theta)\) 和 \(L_2(\theta)\) 满足

\[\nabla_\theta L_1(\theta) = \nabla_\theta L_2(\theta) \quad \forall\, \theta,\]

then gradient descent produces identical parameter trajectories from the same initialization:

则从相同初始化出发,梯度下降产生完全一样的参数轨迹

\[\theta_{t+1} = \theta_t - \eta\, \nabla_\theta L(\theta_t).\]

This update rule depends only on \(\nabla_\theta L\), not on \(L\) itself. Two functions with the same gradient everywhere differ by at most a constant: \(L_1(\theta) = L_2(\theta) + C\). The constant \(C\) does not affect any gradient-based optimizer (SGD, Adam, etc.).

这个更新规则只依赖 \(\nabla_\theta L\),不依赖 \(L\) 本身。梯度处处相同的两个函数至多差一个常数:\(L_1(\theta) = L_2(\theta) + C\)。常数 \(C\) 不影响任何基于梯度的优化器(SGD、Adam 等)。

Same Gradient, Different Computation Graph

A more subtle case arises when two loss functions produce the same gradient but have different computation graphs. They may not differ by a constant — they may not even have the same value for any \(\theta\) — yet they yield identical optimization trajectories.

Example: PPO vs. REINFORCE formulation. Consider the PPO clipped objective and its REINFORCE reformulation. The PPO form is:

一个更微妙的情况是:两个损失函数产生相同的梯度,但计算图不同。它们可能不差一个常数——甚至对任何 \(\theta\) 都没有相同的值——但产生完全一样的优化轨迹。

例子:PPO 与 REINFORCE 形式。 考虑 PPO clipped objective 及其 REINFORCE 改写。PPO 形式为:

\[L_{\mathrm{PPO}} = \min\!\big(\rho\,\hat{A},\; \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\,\hat{A}\big), \quad \rho = \frac{\pi_\theta(a)}{\pi_{\mathrm{old}}(a)}.\]

The REINFORCE form is:

REINFORCE 形式为:

\[L_{\mathrm{RF}} = -\mathrm{sg}(w)\,\hat{A}\,\ln \pi_\theta(a), \quad w = \mathbb{I}\!\left((\rho - \mathrm{clip}(\rho))\,\hat{A} \leq 0\right) \rho.\]

These two losses have different values and different computation graphs:

  \(L_{\mathrm{PPO}}\) \(L_{\mathrm{RF}}\)
Value \(\min(\rho\hat{A},\, \mathrm{clip}(\rho)\hat{A})\) \(-w\hat{A}\ln\pi_\theta\)
Gradient path Through \(\rho = \pi_\theta / \pi_{\mathrm{old}}\) Through \(\ln \pi_\theta\) (\(w\) is detached)
Intermediate variables \(\rho, \mathrm{clip}(\rho), \min(\cdot)\) \(w\) (stop-gradient), \(\ln \pi_\theta\)

Yet their gradients with respect to \(\theta\) are identical. To see why, note that \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\), so:

  • PPO: when the gradient passes through (not clipped), \(\nabla_\theta L_{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\). When clipped, \(\nabla_\theta L_{\mathrm{PPO}} = 0\).
  • REINFORCE: \(\nabla_\theta L_{\mathrm{RF}} = -w\, \hat{A}\, \nabla_\theta \ln \pi_\theta\), where \(w = \rho\) (not clipped) or \(w = 0\) (clipped).

Both give \(\hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\) or \(0\), in exactly the same cases. Same gradient, different computation graph, identical optimization.

这两个损失有不同的值不同的计算图

  \(L_{\mathrm{PPO}}\) \(L_{\mathrm{RF}}\)
\(\min(\rho\hat{A},\, \mathrm{clip}(\rho)\hat{A})\) \(-w\hat{A}\ln\pi_\theta\)
梯度路径 通过 \(\rho = \pi_\theta / \pi_{\mathrm{old}}\) 通过 \(\ln \pi_\theta\)(\(w\) 被 detach)
中间变量 \(\rho, \mathrm{clip}(\rho), \min(\cdot)\) \(w\)(stop-gradient),\(\ln \pi_\theta\)

然而它们关于 \(\theta\) 的梯度完全一样。原因是 \(\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta\),因此:

  • PPO:当梯度通过时(未截断),\(\nabla_\theta L_{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\)。截断时,\(\nabla_\theta L_{\mathrm{PPO}} = 0\)。
  • REINFORCE:\(\nabla_\theta L_{\mathrm{RF}} = -w\, \hat{A}\, \nabla_\theta \ln \pi_\theta\),其中 \(w = \rho\)(未截断)或 \(w = 0\)(截断)。

两者在完全相同的情况下给出 \(\hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta\) 或 \(0\)。相同梯度,不同计算图,相同优化结果。

When Does the Loss Value Matter?

Although the loss value does not affect gradient updates, it can matter for:

  • Monitoring: we often track the loss curve to diagnose training. Two equivalent losses may show different curves, which can be confusing if you compare across implementations.
  • Learning rate scheduling: some schedulers (e.g., ReduceLROnPlateau) adjust the learning rate based on the loss value. Different-valued but same-gradient losses would trigger different scheduling decisions.
  • Early stopping: if you stop training when the loss falls below a threshold, the threshold depends on which formulation you use.
  • Mixed-precision training: numerical stability can differ between formulations even if the exact gradients are identical, because intermediate values differ and floating-point arithmetic is not associative.

For pure gradient descent with a fixed learning rate, the loss value is irrelevant. In practice, be aware of these secondary effects when switching between equivalent formulations.

虽然损失值不影响梯度更新,但在以下情况下可能重要:

  • 监控:我们通常跟踪 loss 曲线来诊断训练。两个等价的 loss 可能显示不同的曲线,在不同实现之间比较时可能造成混淆。
  • 学习率调度:某些调度器(如 ReduceLROnPlateau)根据 loss 值调整学习率。值不同但梯度相同的 loss 会触发不同的调度决策。
  • 早停:如果在 loss 降到阈值以下时停止训练,阈值取决于使用哪种形式。
  • 混合精度训练:即使精确梯度相同,不同形式的数值稳定性也可能不同,因为中间值不同且浮点运算不满足结合律。

对于使用固定学习率的纯梯度下降,loss 值无关紧要。在实践中,切换等价形式时需注意这些次要影响。