$R=1$ (success) or $R=0$ (failure)
$p(s_0) \approx 0.25 \qquad p(s_2) \approx 0.80$
States closer to Goal succeed more often.
$p = 0.25$
$p = 0.80$
$A = R \quad\Longrightarrow\quad \text{Success}\to 1,\;\;\text{Failure}\to 0$
$\nabla_\theta J$ receives identical signal from $s_0$ and $s_2$. It cannot tell skill from luck.
For any random variable $X$ and conditioning variable $S$:
The first term: on average, how spread out is $X$ inside each group $S\!=\!s$?
Even if you know which state you're in, outcomes still fluctuate. This is irreducible noise.
The second term: how much do the group means themselves vary?
If different states have very different expected advantages, that alone inflates total variance.
A per-group baseline (like $V(s)$) can zero out the second term by centering each group at $0$.
It cannot touch the first term — that's intrinsic to the distribution of $X \mid S$.
Let $p_A = p(s_0) = 0.25$ (far) and $p_B = p(s_2) = 0.80$ (near). $R \mid s \sim \text{Bernoulli}(p_s)$.
$\operatorname{Var}(R \mid s) = p_s(1-p_s)$ is the Bernoulli variance. $\bar p = \tfrac{1}{2}(p_A+p_B)$ is the batch mean.
$\bar A = \tfrac{1}{2}(p_A + p_B) = \tfrac{1}{2}(0.25+0.80) = 0.525$
Total $= 0.250$
The between-state term (0.076) inflates the variance because $s_0$ and $s_2$ have very different $p$.
$\mathbb{E}[A \mid s] = p_s - V(s) \approx 0$ for every state
Total $= 0.174$ between-state term eliminated
$V(s)$ centers $\mathbb{E}[A \mid s] \approx 0$ for all $s$, zeroing the between-state component.
$\operatorname{Var}(R \mid s) = p_s(1-p_s)$ is intrinsic — no baseline removes it.
$\big(\mathbb{E}[A \mid s] - \bar A\big)^2$ vanishes when $V(s)$ centers each state at zero.
The more diverse the states $\{p_s\}$, the larger the between-state term —
and the greater the variance reduction from $V(s)$.
6 trajectories per state. Dark $=R\!=\!1$, Light $=R\!=\!0$. Letter = state.
All dots at 0 or 1 — F and N indistinguishable. High between-state variance.
$V(s_0)=0.25$
$V(s_2)=0.80$
$A(s,a) = R - V(s)$
$s_2$ + success
$1-0.80 = +0.20$
small — expected
$s_0$ + success
$1-0.25 = +0.75$
large — surprising
Same trajectories, shifted by $V(s)$
4 clusters near zero. Between-state variance gone. Cleaner $\nabla_\theta J$.