Bellman Operator Identities

This post collects a handful of identities involving the Bellman expectation operator \(T^\pi\) and the Bellman optimality operator \(T^\star\). They are the workhorses behind policy evaluation, value iteration, and most convergence proofs in RL. The exposition follows Nan Jiang's CS 443 notes.

Setup

We work with a discounted MDP \((\mathcal{S}, \mathcal{A}, P, r, \gamma)\) with \(\gamma \in [0,1)\). A value function is any \(V: \mathcal{S} \to \mathbb{R}\), viewed as a vector in \(\mathbb{R}^{\vert\mathcal{S}\vert}\).

For a stationary policy \(\pi\), let \(P^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert \times \vert\mathcal{S}\vert}\) be the induced transition matrix and \(r^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert}\) the induced reward vector:

考虑折扣 MDP \((\mathcal{S}, \mathcal{A}, P, r, \gamma)\),\(\gamma \in [0,1)\)。价值函数是任意 \(V: \mathcal{S} \to \mathbb{R}\),可以看作 \(\mathbb{R}^{\vert\mathcal{S}\vert}\) 中的向量。

对固定策略 \(\pi\),记 \(P^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert \times \vert\mathcal{S}\vert}\) 为其诱导的转移矩阵,\(r^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert}\) 为其诱导的奖励向量:

\[P^\pi(s, s') = \sum_a \pi(a \vert s)\, P(s' \vert s, a), \qquad r^\pi(s) = \sum_a \pi(a \vert s)\, r(s, a).\]

Bellman expectation operator. For a fixed \(\pi\), define \(T^\pi: \mathbb{R}^{\vert\mathcal{S}\vert} \to \mathbb{R}^{\vert\mathcal{S}\vert}\) by

Bellman 期望算子。 对固定策略 \(\pi\),定义 \(T^\pi: \mathbb{R}^{\vert\mathcal{S}\vert} \to \mathbb{R}^{\vert\mathcal{S}\vert}\) 为

\[(T^\pi V)(s) = \sum_a \pi(a \vert s) \left[ r(s,a) + \gamma \sum_{s'} P(s' \vert s, a)\, V(s') \right] = r^\pi(s) + \gamma (P^\pi V)(s).\]

Bellman optimality operator. Define \(T^\star\) by

Bellman 最优算子。 定义 \(T^\star\) 为

\[(T^\star V)(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} P(s' \vert s, a)\, V(s') \right].\]

The two operators act on value functions; they are the primary objects we will manipulate.

这两个算子都作用在价值函数上,是我们接下来要反复操作的主要对象。

Linearity of \(T^\pi\)

\(T^\pi\) is an affine map (linear plus a constant): in matrix form,

\(T^\pi\) 是仿射映射(线性加常数)。写成矩阵形式:

\[T^\pi V = r^\pi + \gamma P^\pi V.\]

Consequently, for any two value functions \(V, V'\) and scalar \(\alpha\),

因此对任意两个价值函数 \(V, V'\) 和标量 \(\alpha\),

\[T^\pi (\alpha V + (1-\alpha) V') = \alpha\, T^\pi V + (1-\alpha)\, T^\pi V',\] \[T^\pi V - T^\pi V' = \gamma P^\pi (V - V').\]

The second identity is what makes \(T^\pi\) a contraction — the “error” \(V - V'\) gets multiplied by \(\gamma P^\pi\), a sub-stochastic linear map with spectral radius at most \(\gamma\).

Note that \(T^\star\) is not linear — the \(\max_a\) spoils linearity — but it is still a contraction, see below.

第二个恒等式正是 \(T^\pi\) 之所以是压缩映射的原因:误差 \(V - V'\) 被 \(\gamma P^\pi\)(谱半径至多为 \(\gamma\) 的次随机线性映射)放缩。

注意 \(T^\star\) 不是线性的(\(\max_a\) 破坏线性性),但它仍然是压缩映射,见下。

Monotonicity

If \(V \le V'\) pointwise (i.e. \(V(s) \le V'(s)\) for all \(s\)), then

若 \(V \le V'\) 逐点成立(即对所有 \(s\) 有 \(V(s) \le V'(s)\)),则

\[T^\pi V \le T^\pi V', \qquad T^\star V \le T^\star V'.\]

This is immediate since \(P^\pi, P\) have non-negative entries and \(\max\) preserves order. Monotonicity is the backbone of the policy improvement theorem: if \(V \le T^\star V\), then iterating \(T^\star\) gives a monotone increasing sequence bounded above by \(V^\star\).

这是显然的:\(P^\pi, P\) 的元素非负,\(\max\) 保序。单调性是策略改进定理的基石——若 \(V \le T^\star V\),则迭代 \(T^\star\) 得到单调递增序列,且被 \(V^\star\) 上界约束。

Constant shift

Let \(\mathbf{1}\) be the all-ones vector and \(c \in \mathbb{R}\). Then

设 \(\mathbf{1}\) 为全 1 向量,\(c \in \mathbb{R}\)。则

\[T^\pi (V + c\,\mathbf{1}) = T^\pi V + \gamma c\,\mathbf{1}, \qquad T^\star(V + c\,\mathbf{1}) = T^\star V + \gamma c\,\mathbf{1}.\]

Because \(P^\pi \mathbf{1} = \mathbf{1}\) (rows sum to one), a uniform shift of \(V\) turns into a shift of \(T V\) scaled by \(\gamma\). This is the cleanest way to see \(\gamma\)-contraction in sup-norm.

因为 \(P^\pi \mathbf{1} = \mathbf{1}\)(每行和为 1),\(V\) 的一致平移在 \(T V\) 处变为被 \(\gamma\) 缩放的平移。这是理解 \(\gamma\)-压缩性最干净的方式。

\(\gamma\)-contraction in sup-norm

For any \(V, V'\):

对任意 \(V, V'\):

\[\|T^\pi V - T^\pi V'\|_\infty \le \gamma \|V - V'\|_\infty, \qquad \|T^\star V - T^\star V'\|_\infty \le \gamma \|V - V'\|_\infty.\]

For \(T^\pi\): let \(c = \|V - V'\|_\infty\). Then \(V' - c\mathbf{1} \le V \le V' + c\mathbf{1}\). Apply monotonicity and constant-shift to get \(T^\pi V' - \gamma c\mathbf{1} \le T^\pi V \le T^\pi V' + \gamma c \mathbf{1}\), i.e. \(\|T^\pi V - T^\pi V'\|_\infty \le \gamma c\).

For \(T^\star\): for any \(s\),

对 \(T^\pi\):令 \(c = \|V - V'\|_\infty\)。则 \(V' - c\mathbf{1} \le V \le V' + c\mathbf{1}\)。结合单调性与常数平移得 \(T^\pi V' - \gamma c\mathbf{1} \le T^\pi V \le T^\pi V' + \gamma c \mathbf{1}\),即 \(\|T^\pi V - T^\pi V'\|_\infty \le \gamma c\)。

对 \(T^\star\):对任意 \(s\),

\[\vert (T^\star V)(s) - (T^\star V')(s) \vert \le \max_a \gamma \sum_{s'} P(s' \vert s,a)\, \vert V(s') - V'(s')\vert \le \gamma \|V - V'\|_\infty,\]

using \(\vert \max_a f(a) - \max_a g(a)\vert \le \max_a \vert f(a) - g(a)\vert\).

By Banach’s fixed-point theorem, each operator has a unique fixed point in \(\mathbb{R}^{\vert\mathcal{S}\vert}\), and iteration converges geometrically at rate \(\gamma\).

这里用到 \(\vert \max_a f(a) - \max_a g(a)\vert \le \max_a \vert f(a) - g(a)\vert\)。

由 Banach 不动点定理,两个算子在 \(\mathbb{R}^{\vert\mathcal{S}\vert}\) 中各有唯一不动点,迭代以速率 \(\gamma\) 几何收敛。

Fixed points

The fixed points of \(T^\pi\) and \(T^\star\) are exactly the value functions we care about:

\(T^\pi\) 与 \(T^\star\) 的不动点恰好是我们关心的价值函数:

\[T^\pi V^\pi = V^\pi, \qquad T^\star V^\star = V^\star.\]

Since \(T^\pi\) is affine, solving \(V = r^\pi + \gamma P^\pi V\) gives the closed form

因为 \(T^\pi\) 是仿射映射,解 \(V = r^\pi + \gamma P^\pi V\) 得闭式解:

\[V^\pi = (I - \gamma P^\pi)^{-1} r^\pi = \sum_{t=0}^{\infty} \gamma^t (P^\pi)^t r^\pi.\]

The series converges because \(\|\gamma P^\pi\|_\infty = \gamma < 1\). The Neumann-series form \(\sum_t \gamma^t (P^\pi)^t r^\pi\) is exactly the expected discounted return expanded step-by-step.

For \(T^\star\), there is no closed form — but the fixed point still exists and is unique by contraction.

级数收敛是因为 \(\|\gamma P^\pi\|_\infty = \gamma < 1\)。Neumann 级数形式 \(\sum_t \gamma^t (P^\pi)^t r^\pi\) 正是期望折扣回报逐步展开。

对 \(T^\star\) 没有闭式解,但由压缩性不动点仍然存在且唯一。

Bellman residual bound

A key practical identity: for any value function \(V\),

一个关键的实用恒等式:对任意价值函数 \(V\),

\[\|V - V^\pi\|_\infty \le \frac{\|V - T^\pi V\|_\infty}{1 - \gamma}, \qquad \|V - V^\star\|_\infty \le \frac{\|V - T^\star V\|_\infty}{1 - \gamma}.\]

Proof sketch (for \(T^\pi\)). By the triangle inequality and contraction,

证明(以 \(T^\pi\) 为例)。 由三角不等式与压缩性,

\[\|V - V^\pi\|_\infty \le \|V - T^\pi V\|_\infty + \|T^\pi V - T^\pi V^\pi\|_\infty \le \|V - T^\pi V\|_\infty + \gamma \|V - V^\pi\|_\infty.\]

Rearranging gives the bound. The right-hand side \(\|V - T^\pi V\|_\infty\) is the Bellman residual — it is computable from \(V\) alone, no access to \(V^\pi\) needed.

整理后即得所需不等式。右端 \(\|V - T^\pi V\|_\infty\) 称为 Bellman 残差——仅从 \(V\) 即可计算,无需访问 \(V^\pi\)。

Optimality operator as max over policies

The optimality operator is a pointwise maximum of expectation operators:

最优算子是期望算子在策略上的逐点最大:

\[(T^\star V)(s) = \max_\pi (T^\pi V)(s).\]

Equivalently, \(T^\star V = T^{\pi_V} V\) where \(\pi_V(s) \in \arg\max_a\, [r(s,a) + \gamma \sum_{s'} P(s' \vert s, a) V(s')]\) is the greedy policy w.r.t. \(V\). This links the two operators: one step of value iteration is the same as Bellman-evaluating the current greedy policy for one step.

等价地,\(T^\star V = T^{\pi_V} V\),其中 \(\pi_V(s) \in \arg\max_a\, [r(s,a) + \gamma \sum_{s'} P(s' \vert s, a) V(s')]\) 是关于 \(V\) 的贪心策略。这把两个算子联系起来:一步值迭代等价于对当前贪心策略做一步 Bellman 评估。

Telescoping / performance difference

Iterating the linear identity \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\) gives, for any \(V\) and any \(k\),

反复运用线性恒等式 \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\),对任意 \(V\) 和任意 \(k\),

\[(T^\pi)^k V - V^\pi = (\gamma P^\pi)^k (V - V^\pi) \;\xrightarrow{k \to \infty}\; 0.\]

Telescoping the partial sums recovers a useful identity:

对部分和做望远镜展开得到一个有用的恒等式:

\[V^\pi - V = \sum_{t=0}^{\infty} (\gamma P^\pi)^t \big(T^\pi V - V\big).\]

Read it as: the gap between the truth \(V^\pi\) and any guess \(V\) is the discounted sum of Bellman residuals along the \(\pi\)-induced Markov chain. Taking sup-norm gives the Bellman residual bound above; taking an inner product with a state distribution gives the performance difference lemma:

可以这样理解:真值 \(V^\pi\) 与任意猜测 \(V\) 的差,等于沿 \(\pi\) 诱导 Markov 链的 Bellman 残差折扣和。取无穷范数即得上面的 Bellman 残差界;与某状态分布做内积即得 性能差引理

\[V^\pi(s_0) - V(s_0) = \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^\pi_{s_0}} \big[(T^\pi V)(s) - V(s)\big],\]

where \(d^\pi_{s_0}\) is the discounted state-visitation distribution of \(\pi\) starting at \(s_0\). The same identity with \(V = V^{\pi'}\) recovers Kakade and Langford’s classical performance-difference lemma:

其中 \(d^\pi_{s_0}\) 为从 \(s_0\) 出发的 \(\pi\) 折扣状态访问分布。取 \(V = V^{\pi'}\) 即恢复 Kakade 与 Langford 的经典性能差引理:

\[V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^\pi_{s_0},\, a \sim \pi(\cdot\vert s)}\!\left[ A^{\pi'}(s, a) \right].\]

which is the starting point for TRPO, CPI, and a lot of modern policy-optimization theory.

它是 TRPO、CPI 以及诸多现代策略优化理论的起点。

Mixing \(\pi\) with a better tail

A clean application of the linearity identity \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\).

Question. Suppose \(\pi'\) is strictly better than \(\pi\) at every state, i.e. \(V^{\pi'}(s) > V^\pi(s)\) for all \(s\). Define a policy \(\sigma\) that takes its first action from \(\pi\) and then follows \(\pi'\) forever. Is \(\sigma\) guaranteed to be better than \(\pi\)? Is it guaranteed to be better than \(\pi'\)?

Answer. Strictly better than \(\pi\), but not necessarily better than \(\pi'\).

The value of \(\sigma\) is, by construction,

这是线性恒等式 \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\) 的一个干净应用。

问题。 设 \(\pi'\) 在每一个状态都严格优于 \(\pi\),即 \(V^{\pi'}(s) > V^\pi(s)\) 对所有 \(s\) 成立。定义策略 \(\sigma\):第一步动作按 \(\pi\) 采样,之后永远按 \(\pi'\) 执行。问:\(\sigma\) 一定优于 \(\pi\) 吗?一定优于 \(\pi'\) 吗?

回答。 严格优于 \(\pi\),但不一定优于 \(\pi'\)。

根据构造,\(\sigma\) 的价值函数为

\[V^\sigma(s) = \mathbb{E}_{a \sim \pi(\cdot \vert s)}\!\left[ r(s,a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot \vert s,a)}\!\left[ V^{\pi'}(s') \right] \right] = (T^\pi V^{\pi'})(s).\]

That is, \(V^\sigma = T^\pi V^{\pi'}\) — one application of the Bellman expectation operator for \(\pi\), starting from \(V^{\pi'}\).

Strictly better than \(\pi\). Use linearity together with the fixed-point identity \(V^\pi = T^\pi V^\pi\):

即 \(V^\sigma = T^\pi V^{\pi'}\)——从 \(V^{\pi'}\) 出发对 \(\pi\) 的 Bellman 期望算子作用一次。

严格优于 \(\pi\)。 结合线性性与不动点恒等式 \(V^\pi = T^\pi V^\pi\):

\[V^\sigma - V^\pi = T^\pi V^{\pi'} - T^\pi V^\pi = \gamma P^\pi (V^{\pi'} - V^\pi) > 0,\]

strictly positive at every state, since \(V^{\pi'} - V^\pi > 0\) pointwise and \(P^\pi\) has non-negative rows that sum to one. So \(\sigma\) strictly improves on \(\pi\).

Not necessarily better than \(\pi'\). Using the fixed-point identity \(V^{\pi'} = T^{\pi'} V^{\pi'}\):

在每一个状态都严格为正,因为 \(V^{\pi'} - V^\pi > 0\) 逐点成立,\(P^\pi\) 的元素非负且每行和为一。所以 \(\sigma\) 严格改进 \(\pi\)。

不一定优于 \(\pi'\)。 利用不动点恒等式 \(V^{\pi'} = T^{\pi'} V^{\pi'}\):

\[V^\sigma - V^{\pi'} = T^\pi V^{\pi'} - T^{\pi'} V^{\pi'} = \mathbb{E}_{s'}\!\left[ \sum_a \big(\pi(a \vert s) - \pi'(a \vert s)\big)\, Q^{\pi'}(s, a) \right],\]

whose sign is not determined. Counterexample. One state, two actions, reward \(r(\cdot, a_1) = 1\), \(r(\cdot, a_2) = 0\), discount \(\gamma = 0.9\). Let \(\pi\) always pick \(a_2\) and \(\pi'\) always pick \(a_1\). Then \(V^\pi = 0\), \(V^{\pi'} = 10\) (so \(\pi'\) is strictly better than \(\pi\)). The mixed policy \(\sigma\) takes \(a_2\) once (reward \(0\)) then rolls out \(\pi'\), giving

符号不确定。反例。 单状态、两动作,奖励 \(r(\cdot, a_1) = 1\)、\(r(\cdot, a_2) = 0\),折扣 \(\gamma = 0.9\)。令 \(\pi\) 恒选 \(a_2\),\(\pi'\) 恒选 \(a_1\)。则 \(V^\pi = 0\),\(V^{\pi'} = 10\)(\(\pi'\) 严格优于 \(\pi\))。混合策略 \(\sigma\) 先选一次 \(a_2\)(奖励 \(0\))后按 \(\pi'\) 执行:

\[V^\sigma = 0 + \gamma \cdot V^{\pi'} = 0.9 \cdot 10 = 9 \;<\; 10 = V^{\pi'}.\]

So \(\sigma\) is strictly worse than \(\pi'\) despite being built from \(\pi'\)’s value function.

Why this matters. In policy iteration the improvement step is

所以 \(\sigma\) 虽然用 \(\pi'\) 的价值函数来评估动作,却严格劣于 \(\pi'\)。

这件事为什么重要。 策略迭代的改进步骤是

\[\pi_{\text{new}}(s) = \arg\max_a\, Q^{\pi_{\text{old}}}(s, a),\]

not “sample \(a \sim \pi_{\text{old}}\) then roll out with something better”. Evaluating \(\pi\)’s actions against a better tail \(\pi'\) lifts the value above \(\pi\) — that is essentially the first-step contribution of \(V^{\pi'} - V^\pi\). But any step that still uses \(\pi\)’s action distribution keeps leaking the gap between \(\pi\) and \(\pi'\); the \(\arg\max\) is what actually closes that gap.

The same identity also tells us what would beat \(\pi'\): replace the first action with \(\arg\max_a Q^{\pi'}(s, a)\) (the greedy policy w.r.t. \(V^{\pi'}\)). That is exactly one step of policy iteration on \(\pi'\), and by the policy improvement theorem its value is \(\ge V^{\pi'}\).

而不是「从 \(\pi_{\text{old}}\) 采样再接一段更好的尾策略」。用更优的尾策略 \(\pi'\) 评估 \(\pi\) 的动作,会把价值抬到 \(\pi\) 之上——本质上就是第一步贡献了 \(V^{\pi'} - V^\pi\)。但只要第一步仍然按 \(\pi\) 的动作分布采样,\(\pi\) 与 \(\pi'\) 之间的差距就会「漏出来」;真正弥合差距的是 \(\arg\max\)。

同一个恒等式也告诉我们打败 \(\pi'\) 的做法:把第一步换成 \(\arg\max_a Q^{\pi'}(s, a)\)(对 \(V^{\pi'}\) 的贪心策略)。这正是对 \(\pi'\) 做一步策略迭代,由策略改进定理其价值 \(\ge V^{\pi'}\)。

Takeaways

All the Bellman operator identities fall out of three primitive properties:

  1. Affine / \(\max\)-affine structure — gives linearity of \(T^\pi\) and the \(\max\)-over-policies form of \(T^\star\).
  2. Row-stochasticity of \(P^\pi\) — gives monotonicity and constant-shift invariance.
  3. Discount \(\gamma < 1\) — promotes constant-shift into \(\gamma\)-contraction, which delivers unique fixed points, geometric convergence, the Bellman residual bound, and the telescoping / performance-difference identity.

If you remember those three, you can re-derive the rest on a napkin.

所有 Bellman 算子恒等式都源自三个基本性质:

  1. 仿射 / \(\max\)-仿射结构——给出 \(T^\pi\) 的线性性与 \(T^\star\) 的策略取 \(\max\) 形式。
  2. \(P^\pi\) 的行随机性——给出单调性与常数平移不变性。
  3. 折扣因子 \(\gamma < 1\)——把常数平移变成 \(\gamma\)-压缩,进而给出唯一不动点、几何收敛、Bellman 残差界以及望远镜 / 性能差恒等式。

记住这三点,其余恒等式都能在一张餐巾纸上重新推导出来。