Substituting per-token decompositions into the REINFORCE gradient
Row $j$ = gradient term, column $i$ = KL term. Every cell is their product.
Which cells actually contribute to the gradient?
$i < j$: the KL term at step $i$ doesn't depend on $a_j$ — drops out in expectation.
Only the upper triangle ($i \geq j$) survives.
$\sum_{j}\sum_{i=1}^H \;\longrightarrow\; \sum_{j}\sum_{i=j}^H$
Each row $j$ sums KL from $i = j$ to $H$ only.
Token $j$ pays the KL cost for its own and all future deviations — not the past.