Bellman Operator Identities
Setup
问题设定
We work with a discounted MDP \((\mathcal{S}, \mathcal{A}, P, r, \gamma)\) with \(\gamma \in [0,1)\). A value function is any \(V: \mathcal{S} \to \mathbb{R}\), viewed as a vector in \(\mathbb{R}^{\vert\mathcal{S}\vert}\).
For a stationary policy \(\pi\), let \(P^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert \times \vert\mathcal{S}\vert}\) be the induced transition matrix and \(r^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert}\) the induced reward vector:
Bellman expectation operator. For a fixed \(\pi\), define \(T^\pi: \mathbb{R}^{\vert\mathcal{S}\vert} \to \mathbb{R}^{\vert\mathcal{S}\vert}\) by
Bellman optimality operator. Define \(T^\star\) by
The two operators act on value functions; they are the primary objects we will manipulate.
Linearity of \(T^\pi\)
\(T^\pi\) 的线性性
\(T^\pi\) is an affine map (linear plus a constant): in matrix form,
Consequently, for any two value functions \(V, V'\) and scalar \(\alpha\),
The second identity is what makes \(T^\pi\) a contraction — the “error” \(V - V'\) gets multiplied by \(\gamma P^\pi\), a sub-stochastic linear map with spectral radius at most \(\gamma\).
Note that \(T^\star\) is not linear — the \(\max_a\) spoils linearity — but it is still a contraction, see below.
Monotonicity
单调性
If \(V \le V'\) pointwise (i.e. \(V(s) \le V'(s)\) for all \(s\)), then
This is immediate since \(P^\pi, P\) have non-negative entries and \(\max\) preserves order. Monotonicity is the backbone of the policy improvement theorem: if \(V \le T^\star V\), then iterating \(T^\star\) gives a monotone increasing sequence bounded above by \(V^\star\).
Constant shift
常数平移
Let \(\mathbf{1}\) be the all-ones vector and \(c \in \mathbb{R}\). Then
Because \(P^\pi \mathbf{1} = \mathbf{1}\) (rows sum to one), a uniform shift of \(V\) turns into a shift of \(T V\) scaled by \(\gamma\). This is the cleanest way to see \(\gamma\)-contraction in sup-norm.
\(\gamma\)-contraction in sup-norm
无穷范数下的 \(\gamma\)-压缩性
For any \(V, V'\):
For \(T^\pi\): let \(c = \|V - V'\|_\infty\). Then \(V' - c\mathbf{1} \le V \le V' + c\mathbf{1}\). Apply monotonicity and constant-shift to get \(T^\pi V' - \gamma c\mathbf{1} \le T^\pi V \le T^\pi V' + \gamma c \mathbf{1}\), i.e. \(\|T^\pi V - T^\pi V'\|_\infty \le \gamma c\).
For \(T^\star\): for any \(s\),
using \(\vert \max_a f(a) - \max_a g(a)\vert \le \max_a \vert f(a) - g(a)\vert\).
By Banach’s fixed-point theorem, each operator has a unique fixed point in \(\mathbb{R}^{\vert\mathcal{S}\vert}\), and iteration converges geometrically at rate \(\gamma\).
Fixed points
不动点
The fixed points of \(T^\pi\) and \(T^\star\) are exactly the value functions we care about:
Since \(T^\pi\) is affine, solving \(V = r^\pi + \gamma P^\pi V\) gives the closed form
The series converges because \(\|\gamma P^\pi\|_\infty = \gamma < 1\). The Neumann-series form \(\sum_t \gamma^t (P^\pi)^t r^\pi\) is exactly the expected discounted return expanded step-by-step.
For \(T^\star\), there is no closed form — but the fixed point still exists and is unique by contraction.
Bellman residual bound
Bellman 残差界
A key practical identity: for any value function \(V\),
Proof sketch (for \(T^\pi\)). By the triangle inequality and contraction,
Rearranging gives the bound. The right-hand side \(\|V - T^\pi V\|_\infty\) is the Bellman residual — it is computable from \(V\) alone, no access to \(V^\pi\) needed.
Optimality operator as max over policies
最优算子作为策略上的 max
The optimality operator is a pointwise maximum of expectation operators:
Equivalently, \(T^\star V = T^{\pi_V} V\) where \(\pi_V(s) \in \arg\max_a\, [r(s,a) + \gamma \sum_{s'} P(s' \vert s, a) V(s')]\) is the greedy policy w.r.t. \(V\). This links the two operators: one step of value iteration is the same as Bellman-evaluating the current greedy policy for one step.
Telescoping / performance difference
望远镜展开 / 性能差
Iterating the linear identity \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\) gives, for any \(V\) and any \(k\),
Telescoping the partial sums recovers a useful identity:
Read it as: the gap between the truth \(V^\pi\) and any guess \(V\) is the discounted sum of Bellman residuals along the \(\pi\)-induced Markov chain. Taking sup-norm gives the Bellman residual bound above; taking an inner product with a state distribution gives the performance difference lemma:
where \(d^\pi_{s_0}\) is the discounted state-visitation distribution of \(\pi\) starting at \(s_0\). The same identity with \(V = V^{\pi'}\) recovers Kakade and Langford’s classical performance-difference lemma:
which is the starting point for TRPO, CPI, and a lot of modern policy-optimization theory.
Mixing \(\pi\) with a better tail
用更优的尾策略混合 \(\pi\)
A clean application of the linearity identity \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\).
Question. Suppose \(\pi'\) is strictly better than \(\pi\) at every state, i.e. \(V^{\pi'}(s) > V^\pi(s)\) for all \(s\). Define a policy \(\sigma\) that takes its first action from \(\pi\) and then follows \(\pi'\) forever. Is \(\sigma\) guaranteed to be better than \(\pi\)? Is it guaranteed to be better than \(\pi'\)?
Answer. Strictly better than \(\pi\), but not necessarily better than \(\pi'\).
The value of \(\sigma\) is, by construction,
That is, \(V^\sigma = T^\pi V^{\pi'}\) — one application of the Bellman expectation operator for \(\pi\), starting from \(V^{\pi'}\).
Strictly better than \(\pi\). Use linearity together with the fixed-point identity \(V^\pi = T^\pi V^\pi\):
strictly positive at every state, since \(V^{\pi'} - V^\pi > 0\) pointwise and \(P^\pi\) has non-negative rows that sum to one. So \(\sigma\) strictly improves on \(\pi\).
Not necessarily better than \(\pi'\). Using the fixed-point identity \(V^{\pi'} = T^{\pi'} V^{\pi'}\):
whose sign is not determined. Counterexample. One state, two actions, reward \(r(\cdot, a_1) = 1\), \(r(\cdot, a_2) = 0\), discount \(\gamma = 0.9\). Let \(\pi\) always pick \(a_2\) and \(\pi'\) always pick \(a_1\). Then \(V^\pi = 0\), \(V^{\pi'} = 10\) (so \(\pi'\) is strictly better than \(\pi\)). The mixed policy \(\sigma\) takes \(a_2\) once (reward \(0\)) then rolls out \(\pi'\), giving
So \(\sigma\) is strictly worse than \(\pi'\) despite being built from \(\pi'\)’s value function.
Why this matters. In policy iteration the improvement step is
not “sample \(a \sim \pi_{\text{old}}\) then roll out with something better”. Evaluating \(\pi\)’s actions against a better tail \(\pi'\) lifts the value above \(\pi\) — that is essentially the first-step contribution of \(V^{\pi'} - V^\pi\). But any step that still uses \(\pi\)’s action distribution keeps leaking the gap between \(\pi\) and \(\pi'\); the \(\arg\max\) is what actually closes that gap.
The same identity also tells us what would beat \(\pi'\): replace the first action with \(\arg\max_a Q^{\pi'}(s, a)\) (the greedy policy w.r.t. \(V^{\pi'}\)). That is exactly one step of policy iteration on \(\pi'\), and by the policy improvement theorem its value is \(\ge V^{\pi'}\).
Takeaways
小结
All the Bellman operator identities fall out of three primitive properties:
- Affine / \(\max\)-affine structure — gives linearity of \(T^\pi\) and the \(\max\)-over-policies form of \(T^\star\).
- Row-stochasticity of \(P^\pi\) — gives monotonicity and constant-shift invariance.
- Discount \(\gamma < 1\) — promotes constant-shift into \(\gamma\)-contraction, which delivers unique fixed points, geometric convergence, the Bellman residual bound, and the telescoping / performance-difference identity.
If you remember those three, you can re-derive the rest on a napkin.