The Full Expansion

Substituting per-token decompositions into the REINFORCE gradient

Row $j$ = gradient term, column $i$ = KL term. Every cell is their product.

The Causality Argument

Which cells actually contribute to the gradient?

$i < j$: the KL term at step $i$ doesn't depend on $a_j$ — drops out in expectation.

Only the upper triangle ($i \geq j$) survives.

$\sum_{j}\sum_{i=1}^H \;\longrightarrow\; \sum_{j}\sum_{i=j}^H$

Each row $j$ sums KL from $i = j$ to $H$ only.

Token $j$ pays the KL cost for its own and all future deviations — not the past.