An agent navigates from $s_0$ toward Goal

$R=1$ (success) or $R=0$ (failure)

$p(s_0) \approx 0.25 \qquad p(s_2) \approx 0.80$

States closer to Goal succeed more often.

Without baseline: same signal from every state

$s_0$ (far)

$p = 0.25$

$s_2$ (near)

$p = 0.80$

$A = R \quad\Longrightarrow\quad \text{Success}\to 1,\;\;\text{Failure}\to 0$

$\nabla_\theta J$ receives identical signal from $s_0$ and $s_2$. It cannot tell skill from luck.

Law of Total Variance (Eve's Law)

For any random variable $X$ and conditioning variable $S$:

$$\operatorname{Var}(X) \;=\; \underbrace{\mathbb{E}_S\!\big[\operatorname{Var}(X \mid S)\big]}_{\text{expected variance within groups}} \;+\; \underbrace{\operatorname{Var}_S\!\big(\mathbb{E}[X \mid S]\big)}_{\text{variance of group means}}$$

The first term: on average, how spread out is $X$ inside each group $S\!=\!s$?

Even if you know which state you're in, outcomes still fluctuate. This is irreducible noise.

The second term: how much do the group means themselves vary?

If different states have very different expected advantages, that alone inflates total variance.

A per-group baseline (like $V(s)$) can zero out the second term by centering each group at $0$.

It cannot touch the first term — that's intrinsic to the distribution of $X \mid S$.

Applied to our two states

Let $p_A = p(s_0) = 0.25$ (far) and $p_B = p(s_2) = 0.80$ (near). $R \mid s \sim \text{Bernoulli}(p_s)$.

$$\operatorname{Var}_{\!\text{batch}} = \underbrace{\tfrac{1}{2}\!\big[p_A(1\!-\!p_A)+p_B(1\!-\!p_B)\big]}_{\text{within-state}} + \underbrace{\tfrac{1}{2}\!\big[(p_A-\bar p)^2+(p_B-\bar p)^2\big]}_{\text{between-state}}$$

$\operatorname{Var}(R \mid s) = p_s(1-p_s)$ is the Bernoulli variance. $\bar p = \tfrac{1}{2}(p_A+p_B)$ is the batch mean.

Without baseline: $\;A_i = R_i$

$\bar A = \tfrac{1}{2}(p_A + p_B) = \tfrac{1}{2}(0.25+0.80) = 0.525$

$$\underbrace{\tfrac{1}{2}\big[0.25\!\times\!0.75 + 0.80\!\times\!0.20\big]}_{=\,0.174} + \underbrace{\tfrac{1}{2}\big[(0.25-0.525)^2 + (0.80-0.525)^2\big]}_{=\,0.076}$$

Total $= 0.250$

The between-state term (0.076) inflates the variance because $s_0$ and $s_2$ have very different $p$.

With $V(s)$: $\;A_i = R_i - V(s_i)$

$\mathbb{E}[A \mid s] = p_s - V(s) \approx 0$ for every state

$$\underbrace{\tfrac{1}{2}\big[0.25\!\times\!0.75 + 0.80\!\times\!0.20\big]}_{=\,0.174\;\text{(unchanged)}} + \cancel{\tfrac{1}{2}\big[0^2 + 0^2\big]}$$

Total $= 0.174$ between-state term eliminated

$V(s)$ centers $\mathbb{E}[A \mid s] \approx 0$ for all $s$, zeroing the between-state component.

What $V(s)$ can and cannot do

$\operatorname{Var}(R \mid s) = p_s(1-p_s)$ is intrinsic — no baseline removes it.

$\big(\mathbb{E}[A \mid s] - \bar A\big)^2$ vanishes when $V(s)$ centers each state at zero.

The more diverse the states $\{p_s\}$, the larger the between-state term —

and the greater the variance reduction from $V(s)$.

$A = R$: batch dot plot

6 trajectories per state. Dark $=R\!=\!1$, Light $=R\!=\!0$. Letter = state.

All dots at 0 or 1 — F and N indistinguishable. High between-state variance.

$V(s)$: expected return from each state

$s_0$ (far)

$V(s_0)=0.25$

$s_2$ (near)

$V(s_2)=0.80$

$A(s,a) = R - V(s)$

$s_2$ + success

$1-0.80 = +0.20$

small — expected

$s_0$ + success

$1-0.25 = +0.75$

large — surprising

$A = R - V(s)$: batch dot plot

Same trajectories, shifted by $V(s)$

4 clusters near zero. Between-state variance gone. Cleaner $\nabla_\theta J$.

Summary

$A = R$

$A = R - V(s)$

$$\operatorname{Var}_{\!\text{batch}} = \underbrace{\text{within}}_{\text{unchanged}} + \underbrace{\text{between}}_{\to\,0\;\text{with }V(s)}$$