Bellman Operator Identities

This post collects a handful of identities involving the Bellman expectation operator \(T^\pi\) and the Bellman optimality operator \(T^\star\). They are the workhorses behind policy evaluation, value iteration, and most convergence proofs in RL. The exposition follows Nan Jiang's CS 443 notes.

Setup

We work with a discounted MDP \((\mathcal{S}, \mathcal{A}, P, r, \gamma)\) with \(\gamma \in [0,1)\). A value function is any \(V: \mathcal{S} \to \mathbb{R}\), viewed as a vector in \(\mathbb{R}^{\vert\mathcal{S}\vert}\).

For a stationary policy \(\pi\), let \(P^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert \times \vert\mathcal{S}\vert}\) be the induced transition matrix and \(r^\pi \in \mathbb{R}^{\vert\mathcal{S}\vert}\) the induced reward vector:

\[P^\pi(s, s') = \sum_a \pi(a \vert s)\, P(s' \vert s, a), \qquad r^\pi(s) = \sum_a \pi(a \vert s)\, r(s, a).\]

Bellman expectation operator. For a fixed \(\pi\), define \(T^\pi: \mathbb{R}^{\vert\mathcal{S}\vert} \to \mathbb{R}^{\vert\mathcal{S}\vert}\) by

\[(T^\pi V)(s) = \sum_a \pi(a \vert s) \left[ r(s,a) + \gamma \sum_{s'} P(s' \vert s, a)\, V(s') \right] = r^\pi(s) + \gamma (P^\pi V)(s).\]

Bellman optimality operator. Define \(T^\star\) by

\[(T^\star V)(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} P(s' \vert s, a)\, V(s') \right].\]

The two operators act on value functions; they are the primary objects we will manipulate.

Linearity of \(T^\pi\)

\(T^\pi\) is an affine map (linear plus a constant): in matrix form,

\[T^\pi V = r^\pi + \gamma P^\pi V.\]

Consequently, for any two value functions \(V, V'\) and scalar \(\alpha\),

\[T^\pi (\alpha V + (1-\alpha) V') = \alpha\, T^\pi V + (1-\alpha)\, T^\pi V',\] \[T^\pi V - T^\pi V' = \gamma P^\pi (V - V').\]

The second identity is what makes \(T^\pi\) a contraction — the “error” \(V - V'\) gets multiplied by \(\gamma P^\pi\), a sub-stochastic linear map with spectral radius at most \(\gamma\).

Note that \(T^\star\) is not linear — the \(\max_a\) spoils linearity — but it is still a contraction, see below.

Monotonicity

If \(V \le V'\) pointwise (i.e. \(V(s) \le V'(s)\) for all \(s\)), then

\[T^\pi V \le T^\pi V', \qquad T^\star V \le T^\star V'.\]

This is immediate since \(P^\pi, P\) have non-negative entries and \(\max\) preserves order. Monotonicity is the backbone of the policy improvement theorem: if \(V \le T^\star V\), then iterating \(T^\star\) gives a monotone increasing sequence bounded above by \(V^\star\).

Constant shift

Let \(\mathbf{1}\) be the all-ones vector and \(c \in \mathbb{R}\). Then

\[T^\pi (V + c\,\mathbf{1}) = T^\pi V + \gamma c\,\mathbf{1}, \qquad T^\star(V + c\,\mathbf{1}) = T^\star V + \gamma c\,\mathbf{1}.\]

Because \(P^\pi \mathbf{1} = \mathbf{1}\) (rows sum to one), a uniform shift of \(V\) turns into a shift of \(T V\) scaled by \(\gamma\). This is the cleanest way to see \(\gamma\)-contraction in sup-norm.

\(\gamma\)-contraction in sup-norm

For any \(V, V'\):

For \(T^\pi\): let \(c = \|V - V'\|_\infty\). Then \(V' - c\mathbf{1} \le V \le V' + c\mathbf{1}\). Apply monotonicity and constant-shift to get \(T^\pi V' - \gamma c\mathbf{1} \le T^\pi V \le T^\pi V' + \gamma c \mathbf{1}\), i.e. \(\|T^\pi V - T^\pi V'\|_\infty \le \gamma c\).

For \(T^\star\): for any \(s\),

\[\vert (T^\star V)(s) - (T^\star V')(s) \vert \le \max_a \gamma \sum_{s'} P(s' \vert s,a)\, \vert V(s') - V'(s')\vert \le \gamma \|V - V'\|_\infty,\]

using \(\vert \max_a f(a) - \max_a g(a)\vert \le \max_a \vert f(a) - g(a)\vert\).

By Banach’s fixed-point theorem, each operator has a unique fixed point in \(\mathbb{R}^{\vert\mathcal{S}\vert}\), and iteration converges geometrically at rate \(\gamma\).

Fixed points

The fixed points of \(T^\pi\) and \(T^\star\) are exactly the value functions we care about:

\[T^\pi V^\pi = V^\pi, \qquad T^\star V^\star = V^\star.\]

Since \(T^\pi\) is affine, solving \(V = r^\pi + \gamma P^\pi V\) gives the closed form

\[V^\pi = (I - \gamma P^\pi)^{-1} r^\pi = \sum_{t=0}^{\infty} \gamma^t (P^\pi)^t r^\pi.\]

The series converges because \(\|\gamma P^\pi\|_\infty = \gamma < 1\). The Neumann-series form \(\sum_t \gamma^t (P^\pi)^t r^\pi\) is exactly the expected discounted return expanded step-by-step.

For \(T^\star\), there is no closed form — but the fixed point still exists and is unique by contraction.

Bellman residual bound

A key practical identity: for any value function \(V\),

\[\|V - V^\pi\|_\infty \le \frac{\|V - T^\pi V\|_\infty}{1 - \gamma}, \qquad \|V - V^\star\|_\infty \le \frac{\|V - T^\star V\|_\infty}{1 - \gamma}.\]

Proof sketch (for \(T^\pi\)). By the triangle inequality and contraction,

Rearranging gives the bound. The right-hand side \(\|V - T^\pi V\|_\infty\) is the Bellman residual — it is computable from \(V\) alone, no access to \(V^\pi\) needed.

Optimality operator as max over policies

The optimality operator is a pointwise maximum of expectation operators:

\[(T^\star V)(s) = \max_\pi (T^\pi V)(s).\]

Equivalently, \(T^\star V = T^{\pi_V} V\) where \(\pi_V(s) \in \arg\max_a\, [r(s,a) + \gamma \sum_{s'} P(s' \vert s, a) V(s')]\) is the greedy policy w.r.t. \(V\). This links the two operators: one step of value iteration is the same as Bellman-evaluating the current greedy policy for one step.

Telescoping / performance difference

Iterating the linear identity \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\) gives, for any \(V\) and any \(k\),

\[(T^\pi)^k V - V^\pi = (\gamma P^\pi)^k (V - V^\pi) \;\xrightarrow{k \to \infty}\; 0.\]

Telescoping the partial sums recovers a useful identity:

\[V^\pi - V = \sum_{t=0}^{\infty} (\gamma P^\pi)^t \big(T^\pi V - V\big).\]

Read it as: the gap between the truth \(V^\pi\) and any guess \(V\) is the discounted sum of Bellman residuals along the \(\pi\)-induced Markov chain. Taking sup-norm gives the Bellman residual bound above; taking an inner product with a state distribution gives the performance difference lemma:

\[V^\pi(s_0) - V(s_0) = \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^\pi_{s_0}} \big[(T^\pi V)(s) - V(s)\big],\]

where \(d^\pi_{s_0}\) is the discounted state-visitation distribution of \(\pi\) starting at \(s_0\). The same identity with \(V = V^{\pi'}\) recovers Kakade and Langford’s classical performance-difference lemma:

\[V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^\pi_{s_0},\, a \sim \pi(\cdot\vert s)}\!\left[ A^{\pi'}(s, a) \right].\]

which is the starting point for TRPO, CPI, and a lot of modern policy-optimization theory.

Mixing \(\pi\) with a better tail

A clean application of the linearity identity \(T^\pi V - T^\pi V' = \gamma P^\pi (V - V')\).

Question. Suppose \(\pi'\) is strictly better than \(\pi\) at every state, i.e. \(V^{\pi'}(s) > V^\pi(s)\) for all \(s\). Define a policy \(\sigma\) that takes its first action from \(\pi\) and then follows \(\pi'\) forever. Is \(\sigma\) guaranteed to be better than \(\pi\)? Is it guaranteed to be better than \(\pi'\)?

Answer. Strictly better than \(\pi\), but not necessarily better than \(\pi'\).

The value of \(\sigma\) is, by construction,

\[V^\sigma(s) = \mathbb{E}_{a \sim \pi(\cdot \vert s)}\!\left[ r(s,a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot \vert s,a)}\!\left[ V^{\pi'}(s') \right] \right] = (T^\pi V^{\pi'})(s).\]

That is, \(V^\sigma = T^\pi V^{\pi'}\) — one application of the Bellman expectation operator for \(\pi\), starting from \(V^{\pi'}\).

Strictly better than \(\pi\). Use linearity together with the fixed-point identity \(V^\pi = T^\pi V^\pi\):

\[V^\sigma - V^\pi = T^\pi V^{\pi'} - T^\pi V^\pi = \gamma P^\pi (V^{\pi'} - V^\pi) > 0,\]

strictly positive at every state, since \(V^{\pi'} - V^\pi > 0\) pointwise and \(P^\pi\) has non-negative rows that sum to one. So \(\sigma\) strictly improves on \(\pi\).

Not necessarily better than \(\pi'\). Using the fixed-point identity \(V^{\pi'} = T^{\pi'} V^{\pi'}\):

\[V^\sigma - V^{\pi'} = T^\pi V^{\pi'} - T^{\pi'} V^{\pi'} = \mathbb{E}_{s'}\!\left[ \sum_a \big(\pi(a \vert s) - \pi'(a \vert s)\big)\, Q^{\pi'}(s, a) \right],\]

whose sign is not determined. Counterexample. One state, two actions, reward \(r(\cdot, a_1) = 1\), \(r(\cdot, a_2) = 0\), discount \(\gamma = 0.9\). Let \(\pi\) always pick \(a_2\) and \(\pi'\) always pick \(a_1\). Then \(V^\pi = 0\), \(V^{\pi'} = 10\) (so \(\pi'\) is strictly better than \(\pi\)). The mixed policy \(\sigma\) takes \(a_2\) once (reward \(0\)) then rolls out \(\pi'\), giving

\[V^\sigma = 0 + \gamma \cdot V^{\pi'} = 0.9 \cdot 10 = 9 \;<\; 10 = V^{\pi'}.\]

So \(\sigma\) is strictly worse than \(\pi'\) despite being built from \(\pi'\)’s value function.

Why this matters. In policy iteration the improvement step is

\[\pi_{\text{new}}(s) = \arg\max_a\, Q^{\pi_{\text{old}}}(s, a),\]

not “sample \(a \sim \pi_{\text{old}}\) then roll out with something better”. Evaluating \(\pi\)’s actions against a better tail \(\pi'\) lifts the value above \(\pi\) — that is essentially the first-step contribution of \(V^{\pi'} - V^\pi\). But any step that still uses \(\pi\)’s action distribution keeps leaking the gap between \(\pi\) and \(\pi'\); the \(\arg\max\) is what actually closes that gap.

The same identity also tells us what would beat \(\pi'\): replace the first action with \(\arg\max_a Q^{\pi'}(s, a)\) (the greedy policy w.r.t. \(V^{\pi'}\)). That is exactly one step of policy iteration on \(\pi'\), and by the policy improvement theorem its value is \(\ge V^{\pi'}\).

Takeaways

All the Bellman operator identities fall out of three primitive properties:

Affine / \(\max\)-affine structure — gives linearity of \(T^\pi\) and the \(\max\)-over-policies form of \(T^\star\).
Row-stochasticity of \(P^\pi\) — gives monotonicity and constant-shift invariance.
Discount \(\gamma < 1\) — promotes constant-shift into \(\gamma\)-contraction, which delivers unique fixed points, geometric convergence, the Bellman residual bound, and the telescoping / performance-difference identity.

If you remember those three, you can re-derive the rest on a napkin.

Bellman Operator Identities

Setup

问题设定

Linearity of \(T^\pi\)

\(T^\pi\) 的线性性

Monotonicity

单调性

Constant shift

常数平移

\(\gamma\)-contraction in sup-norm

无穷范数下的 \(\gamma\)-压缩性

Fixed points

不动点

Bellman residual bound

Bellman 残差界

Optimality operator as max over policies

最优算子作为策略上的 max

Telescoping / performance difference

望远镜展开 / 性能差

Mixing \(\pi\) with a better tail

用更优的尾策略混合 \(\pi\)

Takeaways

小结