Token-Level KL-Regularized Policy Gradient and GRPO

Hao Bai, Tong Zhang

April 02, 2026

This post is archived and not publicly available.

This post is protected. Enter the passcode to view.

We consider the token-wise KL-regularized formulation for RL fine-tuning of language models. For simplicity we slightly modify our notation: starting from a prompt $x_0$, we consider both LLM tokens $a_j$ (which may include thinking tokens) followed by environment tokens $x_j$, where each $x_j$ could be either empty or encoded as multiple tokens.

This leads to a trajectory of horizon $H$ (total number of LLM actions):

\[\tau_j = [x_0, a_1, x_1, \ldots, a_j, x_j], \qquad \tau_{-j} = [a_{j+1}, x_{j+1}, \ldots, a_H, x_H].\]

where $\tau_j$ denotes the history up to step $j$, and $\tau_{-j}$ denotes the future trajectory after step $j$.

Token-Level KL-Regularized Policy Gradient

KL-Regularized Objective

The KL-regularized objective is:

\[\mathbb{E}_{x_0} \mathbb{E}_{\tau \sim \pi_\theta} \left[ r(\tau) - \frac{\beta}{H} \ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)} \right].\]

Here $\pi_\theta$ is the current policy (the LLM being trained), $\pi_{\mathrm{ref}}$ is the reference policy (typically the SFT checkpoint), $r(\tau)$ is the trajectory-level reward, and $\beta > 0$ is the KL regularization coefficient. The term $\ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)}$ is the log-likelihood ratio between the current and reference policies over the full trajectory, which equals the KL divergence when taken in expectation. Dividing by $H$ normalizes the penalty per token. Intuitively, this objective asks the policy to maximize reward while paying a per-token cost for deviating from the reference — when $\beta$ is large the policy stays close to $\pi_{\mathrm{ref}}$, and when $\beta \to 0$ it reduces to pure reward maximization.

Deriving the Token-Level Gradient

Step 1: REINFORCE on the full trajectory. Applying the standard policy gradient theorem (REINFORCE), we differentiate through $\pi_\theta(\tau)$ and obtain:

\[\nabla_\theta J = \mathbb{E}_{x_0} \mathbb{E}_{\tau \sim \pi_\theta} \left[ r(\tau) - \frac{\beta}{H} \ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)} \right] \nabla_\theta \ln \pi_\theta(\tau).\]

Step 2: Decompose $\ln \pi_\theta(\tau)$ into per-token terms. Since the policy generates tokens autoregressively, the trajectory probability factorizes as $\pi_\theta(\tau) = \prod_{j=1}^H \pi_\theta(a_j \vert \tau_{j-1})$. Taking the log:

\[\nabla_\theta \ln \pi_\theta(\tau) = \sum_{j=1}^{H} \nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1}).\]

Similarly, the KL term decomposes as $\ln \frac{\pi_\theta(\tau)}{\pi_{\mathrm{ref}}(\tau)} = \sum_{i=1}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})}$.

Step 3: Substitute into the gradient. Plugging both decompositions into Step 1:

\[\nabla_\theta J = \mathbb{E} \left[ r(\tau) - \frac{\beta}{H} \sum_{i=1}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})} \right] \sum_{j=1}^{H} \nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1}).\]

Step 4: Distribute into a double sum. The expression above is a product of two sums: the KL sum (indexed by $i$) and the gradient sum (indexed by $j$). Expanding this product, we get one term for every $(j, i)$ pair. Focusing on the KL portion:

\[\sum_{j=1}^{H} \sum_{i=1}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})} \cdot \nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1}).\]

What does this double sum mean? It is performing credit assignment for the KL penalty: for each gradient direction $j$ (which direction to update the policy at token $j$), it asks how much KL penalty from every token $i$ should weight that update. Naively, every token’s KL penalty influences every token’s gradient — an $H \times H$ table of interactions.

But this is wasteful. The action at step $j$ was chosen after steps $1, \ldots, j-1$ already happened. Changing what we do at step $j$ cannot retroactively alter the KL cost incurred at earlier steps. So the past KL terms ($i < j$) are irrelevant to the gradient at step $j$ — they are sunk costs.

Step 5: Apply causality to reduce the sum. Formally, for $i < j$, the KL term $K_i = \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})}$ was determined before step $j$ — it depends only on $\tau_{i-1}$ and $a_i$, both of which are fixed by the time step $j$ is reached. So we can condition on everything up to step $j-1$ and take the inner expectation over $a_j$. Since $K_i$ is a constant given $\tau_{j-1}$, it can be pulled out:

\[\mathbb{E}_{a_j \sim \pi_\theta(\cdot \vert \tau_{j-1})}\bigl[K_i \cdot \nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1})\bigr] = K_i \cdot \mathbb{E}_{a_j}\bigl[\nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1})\bigr].\]

The remaining expectation is always zero by the REINFORCE identity (score function identity): $\mathbb{E}_{a_j}[\nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1})] = 0$. Therefore $K_i \cdot 0 = 0$ for all $i < j$ — past KL terms contribute nothing to the gradient. This is also why any constant baseline can be subtracted from the reward without introducing bias. Only the terms $i \geq j$ survive.

Proof of the REINFORCE identity (Click to expand)

For any normalized distribution $\pi_\theta(a \vert s)$:

$$\mathbb{E}_{a \sim \pi_\theta(\cdot \vert s)}\bigl[\nabla_\theta \ln \pi_\theta(a \vert s)\bigr] = \sum_a \pi_\theta(a \vert s) \cdot \frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_\theta(a \vert s)} = \nabla_\theta \sum_a \pi_\theta(a \vert s) = \nabla_\theta 1 = 0$$

$\pi_\theta$ in the expectation cancels the denominator of $\nabla_\theta \ln \pi_\theta = \frac{\nabla_\theta \pi_\theta}{\pi_\theta}$, leaving $\nabla_\theta$ of a sum that equals 1 regardless of $\theta$.

This lets us drop the lower triangle ($i < j$) from the double sum, leaving only the upper triangle $i \geq j$:Check with $H{=}2$: the 4-term grid $K_i g_j$ loses one entry $K_1 g_2$ (past KL), leaving the upper triangle. $\checkmark$

\[\sum_{j=1}^{H} \sum_{i=1}^{H} \longrightarrow \sum_{j=1}^{H} \sum_{i=j}^{H}.\]

Putting everything back together for general $H$:

\[\nabla_\theta J = \mathbb{E} \sum_{j=1}^{H} \left[ r(\tau) - \frac{\beta}{H} \sum_{i=j}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})} \right] \nabla_\theta \ln \pi_\theta(a_j \vert \tau_{j-1}).\]

Token-Level Reward and Loss

This is a key result: each token position $j$ sees the full reward $r(\tau)$ but only the future KL penalty $\sum_{i=j}^H$, not the past. We can therefore define the token-level reward $r_j(\tau)$, which assigns to each position $j$ the trajectory reward minus the KL penalty accumulated from step $j$ onward:

\[r_j(\tau) = \mathrm{stopgrad}\!\left( r(\tau) - \frac{\beta}{H} \sum_{i=j}^{H} \ln \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{ref}}(a_i \vert \tau_{i-1})} \right),\]

where $\mathrm{stopgrad}(\cdot)$ is the stop-gradient operator, treating its argument as a constant during backpropagation.

In practice, we normalize by $\frac{1}{H}$ (averaging over tokens) and use the following weighted next-token prediction loss:

\[\frac{1}{H} \sum_{j=1}^{H} r_j(\tau)\, \ln \pi_\theta(a_j \vert \tau_{j-1}).\]

The $\frac{1}{H}$ does not come from the derivation — it is a normalization constant that makes the loss magnitude independent of sequence length (equivalent to scaling the learning rate by $\frac{1}{H}$). The gradient direction is unchanged. In essence, we simply replace $r(\tau)$ by the token-level $r_j(\tau)$.

Plugging into GRPO

Recall the standard GRPO loss. For $G$ completions sampled per prompt, it has two additive components — a clipped surrogate driven by a reward-only advantage, and a separate KL penalty:

\[\mathcal{L}_{\text{GRPO}} = -\frac{1}{G}\sum_{g=1}^{G} \frac{1}{H}\sum_{j=1}^{H} \Bigl[\underbrace{\min\!\bigl(\rho_j^g \, \hat{A}^g,\; \mathrm{clip}(\rho_j^g)\, \hat{A}^g\bigr)}_{\text{clipped surrogate}} - \underbrace{\beta \, K_j^g}_{\text{KL penalty}}\Bigr], \quad \hat{A}^g = \frac{r(\tau^g) - \mu}{\sigma}\]

Now substitute our token-level reward $r_j^g$ (which already contains the KL cost) in place of the task-only reward. For each prompt, sample $G$ trajectories $\tau^g \sim \pi_\theta$.The formulation uses a uniform $H$, implicitly assuming all trajectories in a group have the same length. In practice, trajectories are padded to the longest in the group, with loss and KL terms at padded positions masked to zero. The statistics $\mu_j, \sigma_j$ are computed only over trajectories that have a valid token at position $j$. At each token position $j$, compute the group mean $\mu_j = \frac{1}{G}\sum_g r_j^g$ and standard deviation $\sigma_j$. The normalized advantage is:

\[\hat{A}_j^g = \frac{r_j^g - \mu_j}{\sigma_j}, \qquad r_j^g = r(\tau^g) - \frac{\beta}{H}\sum_{k=j}^{H} K_k^g\]

Because $r_j^g$ already carries the KL penalty, the separate $-\beta K_j^g$ term in $\mathcal{L}_{\text{GRPO}}$ is no longer needed — it would double-count the regularization. The loss simplifies to a pure clipped surrogate with no additive KL term:

\[\mathcal{L}_{\text{ours}} = -\frac{1}{G}\sum_{g=1}^{G} \frac{1}{H}\sum_{j=1}^{H} \min\!\left( \rho_j^g\, \hat{A}_j^g,\; \mathrm{clip}(\rho_j^g)\, \hat{A}_j^g \right)\]

where $\mathrm{clip}(\rho_j^g) = \mathrm{clamp}(\rho_j^g, 1-\epsilon, 1+\epsilon)$ and $\rho_j$ is a proper off-policy correction term, either at the token level or the sequence level as defined in the next section.

To summarize: standard GRPO = reward-only advantage + separate KL penalty. We replace both with a single KL-aware advantage — the KL migrates from a standalone penalty into the advantage itself, and the $-\beta K_j^g$ term disappears from the loss.

GAE Implementation

The token-level reward $r_j(\tau)$ can be implemented directly in mainstream RL frameworks via Generalized Advantage Estimation. A note on notation: $r_j(\tau)$ as defined above is a reward-to-go (cumulative return from position $j$), not a single-step reward. With $\gamma = \lambda = 1$ and no critic ($V = 0$), GAE reduces to the sum of future per-step rewards. Applying the standard KL-in-reward trick — where the environment reward is zero except at the terminal token (the standard RLHF/PPO setup) and each token incurs a KL penalty:

\[r(s_k, a_k) = \underbrace{R^{\mathrm{env}}_k}_{\substack{0 \text{ if } k < H \\ r(\tau) \text{ if } k = H}} - \;\frac{\beta}{H}\, K_k\]

The GAE advantage at position $j$ becomes exactly our token-level reward:

\[\hat{A}_j^{\mathrm{GAE}(1,1)} = \sum_{k=j}^{H} r(s_k, a_k) = r(\tau) - \frac{\beta}{H}\sum_{k=j}^{H} K_k = r_j(\tau)\]

The task reward $r(\tau)$ appears exactly once (from the terminal step), so every position sees it with weight 1. Each KL term $K_k$ contributes $\frac{\beta}{H}$ to all positions $j \leq k$. The asymmetry between the two weights (1 vs $\frac{\beta}{H}$) arises naturally from the placement: $r(\tau)$ lives at one step while $K_k$ lives at every step. The group normalization $\hat{A}_j^g = (r_j^g - \mu_j)/\sigma_j$ then plays the role of a baseline — replacing the absent critic.

Derivation: from GAE to undiscounted MC return (Click to expand)

GAE with discount $\gamma$ and trace parameter $\lambda$ is:

$$\hat{A}^{\mathrm{GAE}}_t = \sum_{t'=t}^{T} (\gamma\lambda)^{t'-t} \delta_{t'}, \qquad \delta_{t'} = r_{t'} + \gamma \hat{V}(s_{t'+1}) - \hat{V}(s_{t'})$$

Setting $\gamma = \lambda = 1$ gives $\hat{A}_t = \sum_{t'=t}^{T} \delta_{t'}$. The value-function terms telescope: $\sum_{t'=t}^{T}[\hat{V}(s_{t'+1}) - \hat{V}(s_{t'})] = \hat{V}(s_{T+1}) - \hat{V}(s_t) = -\hat{V}(s_t)$ (the terminal state has $\hat{V} = 0$), leaving:

$$\hat{A}_t^{\mathrm{GAE}(1,1)} = \sum_{t'=t}^{T} r_{t'} - \hat{V}(s_t)$$

GRPO is critic-free ($\hat{V} = 0$), so the advantage reduces to the undiscounted Monte Carlo return: $\hat{A}_j = \sum_{k=j}^{H} r(s_k, a_k)$. Substituting the per-step reward and splitting terminal vs non-terminal:

$$\hat{A}_j = \underbrace{\sum_{k=j}^{H-1}\Bigl(-\frac{\beta}{H}K_k\Bigr)}_{\text{non-terminal: KL only}} + \;\underbrace{\Bigl(r(\tau) - \frac{\beta}{H}K_H\Bigr)}_{\text{terminal: task reward + KL}} = r_j(\tau) \quad \checkmark$$

Why $\gamma = 1$ is necessary (Click to expand)

With $\gamma < 1$, the GAE return becomes:

$$\hat{A}_j = \gamma^{H-j}\,r(\tau) - \frac{\beta}{H}\sum_{k=j}^{H}\gamma^{k-j}K_k$$

The discount $\gamma^{H-j}$ would make earlier tokens see a smaller task reward — but the derivation above shows $r(\tau)$ must have weight 1 at every position. So $\gamma = 1$ is not a free choice; it is uniquely determined by the weight structure of $r_j(\tau)$.

Beyond this structural simplification, the advantage $\hat{A}_j^g$ is now position-dependent — unlike standard GRPO’s $\hat{A}^g$ which is the same for every token. This is because $r_j^g$ depends on the future KL sum $\sum_{k=j}^H K_k^g$: earlier tokens bear more future KL cost than later tokens. Concretely:

\[r_1^g = r(\tau^g) - \frac{\beta}{H}\sum_{k=1}^{H} K_k^g \quad \text{(full trajectory KL)}, \qquad r_H^g = r(\tau^g) - \frac{\beta}{H} K_H^g \quad \text{(last token only)}\]

The group normalization $\hat{A}_j^g = \frac{r_j^g - \mu_j}{\sigma_j}$ therefore computes a different baseline at each position. A trajectory that takes a high-KL detour in the middle will have its advantage reduced at early positions (which “bear” the future KL cost of that detour) but not at late positions (after the detour is over). In $\mathcal{L}_{\text{GRPO}}$, by contrast, the total KL is a flat penalty $-\beta K_j^g$ at every position — it cannot make this distinction.

The implications for clipping and IS ratios are discussed in the off-policy comparison.

Why does standard GRPO keep KL separate? This is a deliberate design choice, not a theoretical necessity. The original GRPO formulation (Shao et al., 2024) inherits the KL penalty from PPO’s constrained optimization framework, where KL appears as a Lagrangian penalty term independent of the reward. As noted in DeepSeekMath (Shao et al., 2024): “instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence to the loss, avoiding complicating the calculation of $\hat{A}$.” The practical reason: standard GRPO computes advantage via group normalization over task rewards:

\[\hat{A}^g = \frac{r(\tau^g) - \mu}{\sigma}, \qquad \mu = \frac{1}{G}\sum_{g=1}^G r(\tau^g)\]

If KL were folded into the reward before normalization, the group mean $\mu$ and standard deviation $\sigma$ would be contaminated by KL costs that vary across trajectories for reasons unrelated to task performance. A trajectory with high KL but identical task reward would shift the baseline, distorting the advantage signal for all other trajectories in the group. Keeping KL separate preserves clean reward statistics: $\mu$ and $\sigma$ reflect pure task performance, and the KL penalty is applied uniformly afterward.

Our formulation makes the opposite trade-off: we accept that KL enters the group statistics, because this is precisely what enables position-dependent credit assignment. The group mean $\mu_j$ and standard deviation $\sigma_j$ at each position $j$ now reflect both task performance and KL cost — a trajectory that achieves the same reward but with lower KL deviation will receive higher advantage. This is a feature, not a bug: the advantage signal directly encodes the trade-off between reward and reference-adherence at each token position.

The implications for clipping and IS ratios are discussed in the off-policy comparison.

Off-Policy Correction

Importance Sampling Correction

We may also consider off-policy correction:

\[\frac{1}{H} \sum_{j=1}^{H} \rho_j\, r_j(\tau)\, \ln \pi_\theta(a_j \vert \tau_{j-1}),\]

where $\rho_j$ is the importance weight that corrects for the distribution mismatch between the current policy $\pi_\theta$ and the behavior policy $\pi_{\mathrm{old}}$ (the policy that generated the trajectory). It can be defined at two levels:

Token-level correction:

\[\rho_j = \frac{\pi_\theta(a_j \vert \tau_{j-1})}{\pi_{\mathrm{old}}(a_j \vert \tau_{j-1})}.\]

Sequence-level correction:

\[\rho_j = \left( \frac{\pi_\theta(\tau_{-(j-1)} \vert \tau_{j-1})}{\pi_{\mathrm{old}}(\tau_{-(j-1)} \vert \tau_{j-1})} \right)^{1/(H-j+1)}.\]

Where does this come from? Recall that the trajectories were sampled from $\pi_{\mathrm{old}}$, but we want to evaluate the gradient under $\pi_\theta$. The token-level correction $\rho_j = \frac{\pi_\theta(a_j)}{\pi_{\mathrm{old}}(a_j)}$ only fixes the mismatch at one token. Why might we need more?

Consider what the loss term at position $j$ looks like: $\rho_j \cdot r_j(\tau) \cdot \ln \pi_\theta(a_j \vert \tau_{j-1})$. The reward $r_j(\tau) = r(\tau) - \frac{\beta}{H}\sum_{i=j}^H K_i$ contains the KL log-ratios at all future steps $i = j, j{+}1, \ldots, H$. These future $K_i$ values depend on future actions $a_{j+1}, \ldots, a_H$, which were sampled from $\pi_{\mathrm{old}}$. Under $\pi_\theta$, those future actions would have a different distribution, so $r_j(\tau)$ would take different values on average. A single-token $\rho_j$ does not correct for this — it only reweights the probability of $a_j$ itself, not the distribution of the future actions that determine $r_j$.

To properly account for this, we need the importance weight over the entire future trajectory from step $j$ onward. The interactive figure below visualizes this dependency: click on different token positions to see what $r_j$ depends on, and toggle between token-level and sequence-level correction to see the coverage gap.

The ideal weight is the full likelihood ratio over the future trajectory:

\[\frac{\pi_\theta(\tau_{-(j-1)} \vert \tau_{j-1})}{\pi_{\mathrm{old}}(\tau_{-(j-1)} \vert \tau_{j-1})} = \prod_{k=j}^{H} \frac{\pi_\theta(a_k \vert \tau_{k-1})}{\pi_{\mathrm{old}}(a_k \vert \tau_{k-1})}.\]

This is a product of $H - j + 1$ per-token ratios. The problem is that this product can have exponentially high variance — if each ratio fluctuates by a factor of 2, the product fluctuates by $2^{H-j+1}$. To tame this, we take the geometric mean (i.e., the $(H-j+1)$-th root), which keeps the correction in the same scale as a single-token ratio while still capturing the overall distributional shift:

\[\rho_j = \left(\prod_{k=j}^{H} \frac{\pi_\theta(a_k \vert \tau_{k-1})}{\pi_{\mathrm{old}}(a_k \vert \tau_{k-1})}\right)^{1/(H-j+1)}.\]

The interactive figure below lets you see this variance explosion numerically. Each trial samples $H$ per-token ratios $\rho_k = e^{X_k}$ with $X_k \sim \mathcal{N}(0, \sigma^2)$, then compares the raw product $\prod \rho_k$ (red), the geometric mean $(\prod \rho_k)^{1/H}$ (green), and a single-token $\rho_1$ (blue). Try increasing $H$ or $\sigma$ to see the red histogram spread explosively while green stays compact.

This is a bias-variance trade-off: the token-level $\rho_j$ ignores future mismatch (low variance, potentially biased), the full product captures it exactly (unbiased, high variance), and the geometric mean interpolates between them.

Comparison with Standard GRPO: Clipping and IS Ratios

The off-policy correction interacts with KL differently in the two formulations. To see this, compare the gradient contributions at token $j$:

Standard GRPO — gradient at token $j$:

\[\nabla_\theta \mathcal{L}_{\text{GRPO}} \;\ni\; \underbrace{\hat{A}^g \cdot \nabla_\theta \ln \pi_\theta(a_j)}_{\text{reward signal}} \;-\; \underbrace{\beta \, \nabla_\theta K_j^g}_{\text{KL signal}}\]

Ours — gradient at token $j$:

\[\nabla_\theta \mathcal{L}_{\text{ours}} \;\ni\; \underbrace{\hat{A}_j^g \cdot \nabla_\theta \ln \pi_\theta(a_j)}_{\text{reward + KL signal (coupled)}}\]

In $\nabla_\theta \mathcal{L}_{\text{GRPO}}$, the reward term and the KL term are decoupled — the advantage $\hat{A}^g$ knows nothing about KL, and the KL penalty knows nothing about the reward. The clipping mechanism $\min(\rho_j \hat{A}^g, \mathrm{clip}(\rho_j) \hat{A}^g)$ only considers the reward-based advantage when deciding whether to trust the update. A token that deviates heavily from the reference but contributes to a high-reward completion receives a large positive $\hat{A}^g$ (encouraging more deviation) and a large $\beta \nabla K_j^g$ (discouraging deviation). The two forces fight, but clipping only constrains the first.

In $\nabla_\theta \mathcal{L}_{\text{ours}}$, there is no separate KL gradient — it has been absorbed into $\hat{A}_j^g$ through $r_j$. A token that deviates heavily from the reference has its advantage reduced before clipping decides whether to trust the update. Clipping and KL regularization work together rather than independently.

The IS ratio $\rho_j$ also differs between the two formulations. Compare what each $\rho_j$ needs to correct:

Standard GRPO — token-level IS ratio suffices:

\[\rho_j^{\text{GRPO}} = \frac{\pi_\theta(a_j \vert \tau_{j-1})}{\pi_{\mathrm{old}}(a_j \vert \tau_{j-1})}\]

Ours — sequence-level correction via geometric mean:

\[\rho_j^{\text{ours}} = \left(\prod_{k=j}^{H} \frac{\pi_\theta(a_k \vert \tau_{k-1})}{\pi_{\mathrm{old}}(a_k \vert \tau_{k-1})}\right)^{1/(H-j+1)}\]

Why the difference? In $\mathcal{L}_{\text{GRPO}}$, the advantage $\hat{A}^g$ depends only on the task reward $r(\tau^g)$, which is a constant for the entire trajectory — no future-dependent quantity needs correction, so a single-token $\rho_j$ suffices. In $\mathcal{L}_{\text{ours}}$, the token-level reward $r_j^g$ contains the future KL sum $\sum_{k=j}^H K_k^g$, which depends on future actions $a_{j+1}, \ldots, a_H$ sampled from $\pi_{\mathrm{old}}$. As derived above, correcting for this future mismatch requires the full trajectory likelihood ratio from step $j$ onward; the geometric mean is a variance-reduction compromise.

This is a genuine trade-off: $\mathcal{L}_{\text{ours}}$ provides finer-grained credit assignment and more coherent clipping, but requires a more careful IS correction. The geometric mean introduces bias relative to the exact full-product correction, whereas $\mathcal{L}_{\text{GRPO}}$’s simpler IS treatment is exact for its (coarser) advantage definition.

Both formulations optimize the same objective and converge to the same optimum in the limit:

\[\max_\theta \; \mathbb{E}\!\left[r(\tau) - \frac{\beta}{H} \sum_{j=1}^H K_j\right]\]

The practical difference is in convergence speed and stability: $\mathcal{L}_{\text{ours}}$ provides a lower-variance advantage signal (position-dependent credit) and more coherent clipping (reward-minus-KL), at the cost of a more involved IS correction.

	$\mathcal{L}_{\text{GRPO}}$ (Standard)	$\mathcal{L}_{\text{ours}}$ (Token-Level KL-GRPO)
KL placement	Separate $-\beta K_j^g$ after clipping	Absorbed into $r_j^g$ before advantage
Advantage	$\hat{A}^g$: same for all tokens	$\hat{A}_j^g$: varies by position
KL scope per token	$K_j^g$ (this token only)	$\sum_{k=j}^H K_k$ (future only)
Clipping sees KL?	No	Yes
IS ratio	Token-level $\rho_j$	Geometric mean over future
Credit assignment	Uniform	Position-dependent

	\(\mathcal{L}_{\text{GRPO}}\)（标准）	\(\mathcal{L}_{\text{ours}}\)（Token-Level KL-GRPO）
KL 位置	Clipping 之后独立的 \(-\beta K_j^g\)	在计算 advantage 之前吸收进 \(r_j^g\)
Advantage	\(\hat{A}^g\)：所有 token 相同	\(\hat{A}_j^g\)：随位置变化
每个 token 的 KL 范围	\(K_j^g\)（仅当前 token）	\(\sum_{k=j}^H K_k\)（仅未来）
Clipping 是否看到 KL？	否	是
IS ratio	Token-level \(\rho_j\)	未来轨迹的几何平均
Credit assignment	均匀	位置相关

A General Family of Off-Policy Correction

Assume that we generate trajectory using behavior policy $\pi_{\mathrm{old}}$, and our current policy is $\pi$. Consider $\rho_j$ is the sequence level correction defined before, and consider the following family of algorithms:

\[\frac{1}{G} \sum_{g=1}^G \frac{1}{H} \sum_{j=1}^H \min\left( f_1(\rho_j^g) \hat{A}_j^g , f_2(\rho_j^g) \hat{A}_j^g, \ldots, f_m(\rho_j^g) \hat{A}_j^g \right) ,\]

where each $f_k(\rho)$ is non-decreasing in $\rho$, and

\[\begin{align*} f_k(\rho) \in [\rho,1] & \quad \rho <1 \\ f_k(\rho) \in [1,\rho] & \quad \rho>1 . \end{align*}\]

Why these constraints? The constraint $f_k(\rho) \in [\min(\rho,1),\, \max(\rho,1)]$ ensures that $f_k(\rho)\hat{A}$ always has the same sign as $\rho\hat{A}$. In other words, the direction of the update is never reversed — if the full IS correction says “increase the probability of this action,” the modified version agrees, just with a different magnitude. Combined with monotonicity ($f_k$ non-decreasing), this ensures the algorithm still moves in a direction of improvement.

The two extremes. The endpoints of this family correspond to two well-known algorithms:

$f_k(\rho) = 1$: Policy improvement — the IS weight is dropped entirely. The objective becomes $\frac{1}{GH}\sum_{g,j} \hat{A}_j^g = \mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}_j]$, the expected advantage under $\pi_{\mathrm{old}}$’s trajectory distribution. Note a subtlety: the policy improvement theorem (with $\pi = \pi_{\mathrm{old}},\, \pi' = \pi_\theta$) requires $\mathbb{E}_{\pi_\theta}[A^{\pi_{\mathrm{old}}}(s,a)] \geq 0$ — the expectation under $\pi_\theta$, not $\pi_{\mathrm{old}}$. Our f=1 objective evaluates under $\pi_{\mathrm{old}}$ instead, so the theorem does not directly apply. The gap is exactly the missing IS correction: $\mathbb{E}_{\pi_\theta}[A^{\pi_{\mathrm{old}}}] = \mathbb{E}_{\pi_{\mathrm{old}}}[\rho \cdot A^{\pi_{\mathrm{old}}}]$, and setting $f=1$ replaces $\rho$ with $1$. When $\pi_\theta \approx \pi_{\mathrm{old}}$ (small updates), $\rho \approx 1$ and the approximation is tight, so f=1 still works in practice — this is the regime where trust-region methods (TRPO, PPO) operate.
$f_k(\rho) = \rho$: Policy gradient — full importance sampling correction. The objective becomes $\frac{1}{GH}\sum_{g,j} \frac{\pi_\theta(a_j^g)}{\pi_{\mathrm{old}}(a_j^g)} \hat{A}_j^g \approx \mathbb{E}_{\pi_\theta}[\hat{A}_j]$, converting the expectation from $\pi_{\mathrm{old}}$ to $\pi_\theta$ — unbiased in principle. But the variance can be catastrophic: for a sequence of length $H$, the product $\prod_{j=1}^H \rho_j$ fluctuates exponentially. If each per-token ratio has standard deviation $\sigma$, the variance of the product scales as $e^{O(H\sigma^2)}$, making learning unstable for long sequences.
$f_k(\rho) = \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)$: PPO — a practical middle ground. The clipped objective uses $m=2$ functions in the general formula:

PPO’s clipped objective combines two of these via $\min$:

\[\min\!\big(\underbrace{\rho\, \hat{A}}_{f_1(\rho)=\rho},\;\; \underbrace{\mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\, \hat{A}}_{f_2(\rho)=\mathrm{clip}(\rho)}\big)\]

where $f_1(\rho) = \rho$ and $f_2(\rho) = \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)$. The $\min$ acts as a pessimistic lower bound: when $\hat{A} > 0$, it prevents $\rho > 1{+}\epsilon$ from inflating the credit; when $\hat{A} < 0$, it prevents $\rho < 1{-}\epsilon$ from under-penalizing.

The bias-variance tradeoff. The choice of $f_k$ controls a fundamental tradeoff:

	Bias	Variance	When to prefer
$f(\rho) \approx 1$	High (old distribution)	Low	Early training, large policy shifts
$f(\rho) \approx \rho$	Low (current distribution)	High	Late training, small policy shifts

The optimal $f_k$ likely depends on how far $\pi_\theta$ has drifted from $\pi_{\mathrm{old}}$. When the policies are close ($\rho \approx 1$), all choices of $f$ are similar. When they diverge, the choice matters significantly — and an adaptive $f$ that adjusts based on the observed $\rho$ distribution could outperform any fixed choice.

Open question. Can we design an $f_k$ that adapts during training — perhaps starting near $f=\rho$ (unbiased, when $\rho$ is small) and gradually moving toward $f=1$ (safe, as $\rho$ grows)? This would be an adaptive off-policy correction that automatically navigates the bias-variance tradeoff.

	偏差	方差	适用场景
\(f(\rho) \approx 1\)	高（旧分布）	低	训练早期，策略变化大
\(f(\rho) \approx \rho\)	低（当前分布）	高	训练后期，策略变化小

Some Experiments to Try for Off-Policy Correction

We now consider the off-policy correction problem from a more experimental perspective. Rather than focusing on the token-level KL formulation above, we return to a simpler setting — the standard GRPO loss without KL — to isolate the effect of different off-policy correction strategies.

We start with the following loss formula for a generalization of GRPO with group size $G$:

\[L = -\frac{1}{G} \sum_{g=1}^G \frac{1}{H_g} \sum_{i=1}^{H_g} \mathrm{sg}(w_{g,i})\, \hat{A}_g \ln \pi_\theta(a_i \vert \tau_{i-1}) ,\]

where $\mathrm{sg}(\cdot)$ denotes stop-gradient (the weight $w_{g,i}$ is treated as a constant during backpropagation), $H_g$ is the number of tokens in trajectory $g$, and $\hat{A}_g$ is the GRPO group advantage:

\[\hat{A}_g = \frac{r_g - \mathrm{mean}(\{r_g\})}{\mathrm{std}(\{r_g\})} .\]

This is a REINFORCE-style loss where the gradient flows through $\ln \pi_\theta$ only, while $w_{g,i}$ acts as a stop-gradiented weight controlling how strongly each token contributes to the update. The gradient of this loss with respect to $\theta$ is:

\[\nabla_\theta L = -\frac{1}{G} \sum_{g=1}^G \frac{1}{H_g} \sum_{i=1}^{H_g} w_{g,i}\, \hat{A}_g\, \nabla_\theta \ln \pi_\theta(a_i \vert \tau_{i-1}) .\]

The choice of $w_{g,i}$ determines the off-policy correction strategy. Different choices lead to different algorithms, all sharing the same REINFORCE structure but differing in how they handle the distribution mismatch between $\pi_\theta$ (current policy) and $\pi_{\mathrm{old}}$ (behavior policy that generated the trajectories).

PPO-Style Clipping as a Choice of $w_{g,i}$

For standard GRPO (which uses PPO-style clipping), the weight is:

\[w_{g,i} = \mathbb{I}\!\left( (\rho_{g,i} - \mathrm{clip}(\rho_{g,i},\, 1{-}\epsilon,\, 1{+}\epsilon))\, \hat{A}_g \leq 0 \right) \rho_{g,i} , \quad \rho_{g,i} = \frac{\pi_\theta(a_i \vert \tau_{i-1})}{\pi_{\mathrm{old}}(a_i \vert \tau_{i-1})} .\]

Why does this reproduce PPO? The key is that this formula encodes the gradient behavior of the standard PPO clipped objective

\[L^{\mathrm{PPO}} = \min\!\big(\rho\,\hat{A},\; \mathrm{clip}(\rho, 1{-}\epsilon, 1{+}\epsilon)\,\hat{A}\big)\]

into a REINFORCE form. When we differentiate $L^{\mathrm{PPO}}$ with respect to $\theta$, using $\nabla_\theta \rho = \rho\, \nabla_\theta \ln \pi_\theta$:

When the $\min$ selects $\rho\,\hat{A}$ (gradient passes through): $\nabla_\theta L^{\mathrm{PPO}} = \hat{A}\, \rho\, \nabla_\theta \ln \pi_\theta$.
When the $\min$ selects $\mathrm{clip}(\rho)\,\hat{A}$ (clipped, constant w.r.t. $\theta$): $\nabla_\theta L^{\mathrm{PPO}} = 0$.

So the PPO gradient is $w \cdot \hat{A} \cdot \nabla_\theta \ln \pi_\theta$ where $w = \rho$ when the gradient passes through, and $w = 0$ when it is clipped. The indicator function $\mathbb{I}(\cdot)$ in our formula is precisely this on/off switch. Although the two formulations have different values and different computation graphs, they produce identical gradients with respect to $\theta$, and therefore identical optimization trajectories. Let us verify by case analysis:

Case	$\rho - \mathrm{clip}(\rho)$	Condition $(\cdot)\hat{A} \leq 0$?	$w$	PPO behavior
$\rho \in [1{-}\epsilon,\, 1{+}\epsilon]$	$0$	$0 \cdot \hat{A} = 0 \leq 0$: yes	$\rho$	Gradient passes through ✓
$\rho > 1{+}\epsilon,\; \hat{A} > 0$	$> 0$	$(+)(+) > 0$: no	$0$	Clipped: prevents inflating credit ✓
$\rho > 1{+}\epsilon,\; \hat{A} < 0$	$> 0$	$(+)(-) < 0$: yes	$\rho$	Not clipped: allows penalizing bad actions ✓
$\rho < 1{-}\epsilon,\; \hat{A} > 0$	$< 0$	$(-)(+) < 0$: yes	$\rho$	Not clipped: allows rewarding good actions ✓
$\rho < 1{-}\epsilon,\; \hat{A} < 0$	$< 0$	$(-)(-) > 0$: no	$0$	Clipped: prevents over-penalizing ✓

情况	\(\rho - \mathrm{clip}(\rho)\)	条件 \((\cdot)\hat{A} \leq 0\)？	\(w\)	PPO 行为
\(\rho \in [1{-}\epsilon,\, 1{+}\epsilon]\)	\(0\)	\(0 \cdot \hat{A} = 0 \leq 0\)：是	\(\rho\)	梯度通过 ✓
\(\rho > 1{+}\epsilon,\; \hat{A} > 0\)	\(> 0\)	\((+)(+) > 0\)：否	\(0\)	截断：防止膨胀 credit ✓
\(\rho > 1{+}\epsilon,\; \hat{A} < 0\)	\(> 0\)	\((+)(-) < 0\)：是	\(\rho\)	未截断：允许惩罚坏动作 ✓
\(\rho < 1{-}\epsilon,\; \hat{A} > 0\)	\(< 0\)	\((-)(+) < 0\)：是	\(\rho\)	未截断：允许奖励好动作 ✓
\(\rho < 1{-}\epsilon,\; \hat{A} < 0\)	\(< 0\)	\((-)(-) > 0\)：否	\(0\)	截断：防止过度惩罚 ✓

In summary, $w_{g,i} = 0$ (gradient clipped) precisely when:

$\hat{A}_g > 0$ and $\rho_{g,i} > 1 + \epsilon$: the current policy already assigns much more probability to this action than the old policy did, and the advantage is positive. PPO prevents further increasing — the policy has already moved enough in the right direction.
$\hat{A}_g < 0$ and $\rho_{g,i} < 1 - \epsilon$: the current policy already assigns much less probability to this action, and the advantage is negative. PPO prevents further decreasing — the policy has already moved enough away from this bad action.

In all other cases, $w_{g,i} = \rho_{g,i}$, and the gradient passes through with full importance sampling correction.

The Variance Problem with PPO Clipping

While PPO clipping zeroes out the gradient in two “already moved enough” cases, there is a subtle asymmetry that can cause high variance. Consider the case $\hat{A}_g < 0$ and $\rho_{g,i} \gg 1$:

The advantage is negative (bad trajectory), so we want to decrease the probability of these actions.
But $\rho_{g,i} \gg 1$ means the current policy assigns much more probability to action $a_i$ than the old policy did.
PPO does not clip this case — the indicator gives $w_{g,i} = \rho_{g,i}$, and the gradient contribution is $\rho_{g,i}\, \hat{A}_g\, \nabla_\theta \ln \pi_\theta$.
Since $\rho_{g,i}$ can be arbitrarily large, this gradient term has unbounded magnitude, leading to high variance.

Why does large $\rho$ cause high variance? The gradient estimator is an average over samples: $\hat{g} = \frac{1}{N}\sum_i \rho_i\, \hat{A}_i\, \nabla_\theta \ln \pi_\theta$. Its variance is governed by

\[\mathrm{Var}(\hat{g}) \;\propto\; \mathbb{E}_{\pi_{\mathrm{old}}}\!\big[\rho^2\, \hat{A}^2\big] - \big(\mathbb{E}_{\pi_{\mathrm{old}}}\!\big[\rho\, \hat{A}\big]\big)^2.\]

The first term $\mathbb{E}[\rho^2 \hat{A}^2]$ grows unboundedly as $\pi_\theta$ diverges from $\pi_{\mathrm{old}}$, while the second term $(\mathbb{E}[\rho \hat{A}])^2$ stays bounded. The reason is that $\mathbb{E}_{\pi_{\mathrm{old}}}[\rho] = 1$ always holds:

\[\mathbb{E}_{\pi_{\mathrm{old}}}[\rho] = \sum_a \pi_{\mathrm{old}}(a)\,\frac{\pi_\theta(a)}{\pi_{\mathrm{old}}(a)} = \sum_a \pi_\theta(a) = 1.\]

So $(\mathbb{E}[\rho\hat{A}])^2$ does not grow with the spread of $\rho$. In contrast, $\mathbb{E}[\rho^2]$ grows monotonically as $\pi_\theta$ diverges from $\pi_{\mathrm{old}}$ — a single sample with $\rho = 100$ contributes $10000$ to $\mathbb{E}[\rho^2]$, but its contribution to $\mathbb{E}[\rho]$ is averaged away by the many samples with $\rho \approx 1$. The result is that a few high-$\rho$ samples dominate the gradient, and different mini-batches produce wildly different gradient estimates.

Why does PPO leave this unclipped? The PPO philosophy is pessimistic: it clips updates that would overestimate improvement. When $\hat{A} < 0$ and $\rho \gg 1$, the large $\rho$ amplifies the penalty — this is conservative (pushes harder against a bad action the policy has drifted toward), so PPO sees no reason to clip it. Similarly, when $\hat{A} > 0$ and $\rho \ll 1$, the small $\rho$ dampens the reward signal, which is also conservative.

But “conservative in expectation” does not mean “low variance.” The unclipped $\rho$ in these cases can be very large, injecting noise into the gradient. A natural fix is to also clip $\rho$ from above, bounding the weight even in the unclipped cases:

\[w_{g,i} = \min\!\big(\rho_{g,i},\, 1 + \epsilon\big) .\]

This caps the IS weight at $1 + \epsilon$ regardless of the sign of $\hat{A}$, preventing any single token from dominating the gradient. Note that both PPO clipping and this upper clipping are biased relative to the full IS correction $w = \rho$ — only $w = \rho$ is unbiased. PPO introduces bias by zeroing out the gradient in certain cases ($w = 0$ when clipped); upper clipping introduces bias by capping the weight ($w = 1 + \epsilon$ when $\rho > 1 + \epsilon$). The difference is in where the bias is introduced: PPO’s bias is asymmetric (only clips when the update would be too optimistic, leaving the pessimistic direction unclipped — which is precisely the high-variance case), while upper clipping is symmetric (caps $\rho$ in all cases, trading more bias for uniformly lower variance).

The Power Family $w_{g,i} = \rho_{g,i}^q$

Moving beyond PPO-style clipping, we can investigate a continuous family of off-policy corrections parameterized by $q \in [0, 1]$:

\[w_{g,i} = \rho_{g,i}^q \qquad q \in [0,1] .\]

This family smoothly interpolates between two well-known extremes. To understand the variance difference, consider the gradient estimator from a batch of $N$ samples drawn from $\pi_{\mathrm{old}}$:

\[\hat{g}_q = \frac{1}{N} \sum_{i=1}^N \rho_i^q\, \hat{A}_i\, \nabla_\theta \ln \pi_\theta(a_i).\]

Each sample’s contribution to the gradient is $\rho_i^q \cdot \hat{A}_i \cdot \nabla_\theta \ln \pi_\theta$. The variance of $\hat{g}_q$ across mini-batches depends on how much the per-sample contributions fluctuate:

\[\mathrm{Var}(\hat{g}_q) = \frac{1}{N} \mathrm{Var}_{\pi_{\mathrm{old}}}\!\big(\rho^q\, \hat{A}\, \nabla_\theta \ln \pi_\theta\big).\]

$q = 0$: No correction ($w = 1$). Each sample contributes $\hat{A}_i \cdot \nabla_\theta \ln \pi_\theta$ — all samples are weighted equally. The variance comes solely from $\hat{A}$ and $\nabla \ln \pi_\theta$, which are inherent to the problem. No additional randomness is injected. This is simply the variance of a standard sample mean.

The estimator computes $\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\, \nabla_\theta \ln \pi_\theta]$, but we actually want $\mathbb{E}_{\pi_\theta}[\hat{A}\, \nabla_\theta \ln \pi_\theta]$. These differ when $\pi_\theta \neq \pi_{\mathrm{old}}$ — the estimator is biased. The policy improvement theorem guarantees this bias is small when $\pi_\theta \approx \pi_{\mathrm{old}}$, but it can be significant after many gradient steps.
$q = 1$: Full IS correction ($w = \rho$). Each sample contributes $\rho_i \cdot \hat{A}_i \cdot \nabla_\theta \ln \pi_\theta$. This converts the expectation to $\mathbb{E}_{\pi_\theta}[\hat{A}\, \nabla_\theta \ln \pi_\theta]$ — unbiased. But $\rho_i$ is itself a random variable that multiplies every sample. A sample with $\rho = 10$ contributes 10× more than a sample with $\rho = 1$, so a few high-$\rho$ outliers can dominate the entire batch:

\[\mathrm{Var}(\hat{g}_1) = \frac{1}{N} \mathrm{Var}_{\pi_{\mathrm{old}}}\!\big(\rho\, \hat{A}\, \nabla_\theta \ln \pi_\theta\big) \;\geq\; \frac{1}{N} \mathrm{Var}_{\pi_{\mathrm{old}}}\!\big(\hat{A}\, \nabla_\theta \ln \pi_\theta\big) = \mathrm{Var}(\hat{g}_0),\]

where the inequality holds because multiplying by $\rho$ (a non-constant positive random variable with $\mathbb{E}[\rho] = 1$) can only increase or maintain the variance. The gap grows with the spread of $\rho$: if $\pi_\theta$ has drifted far from $\pi_{\mathrm{old}}$, some $\rho_i$ values will be very large and others very small, making $\mathrm{Var}(\rho\,\hat{A})$ much larger than $\mathrm{Var}(\hat{A})$. In the extreme, if a single sample has $\rho = 100$ while the rest have $\rho \approx 0$, the entire batch estimate is determined by that one sample — the effective sample size collapses to 1.

$q \in (0, 1)$: Partial correction. The weight $\rho^q$ with $q < 1$ compresses the IS ratio toward 1: since $\rho^q$ is closer to 1 than $\rho$ for any $\rho > 0$, the per-sample contributions are more uniform. Formally, $\mathrm{Var}(\rho^q) \leq \mathrm{Var}(\rho)$ for $q \in [0, 1]$, and the variance interpolates smoothly. For example, $q = 0.5$ uses $w = \sqrt{\rho}$: a sample with $\rho = 100$ contributes $\sqrt{100} = 10\times$ instead of $100\times$, significantly taming the outlier effect. The cost is bias — $\rho^q$ does not yield a valid IS correction for $q \neq 1$.

\(q\)	Bias	Variance	What it estimates
\(0\)	High	Low	\(\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\,\nabla\ln\pi_\theta]\) (wrong distribution)
\(1\)	None	High	\(\mathbb{E}_{\pi_\theta}[\hat{A}\,\nabla\ln\pi_\theta]\) (correct distribution)
\(q \in (0,1)\)	Medium	Medium	Neither \(\pi_{\mathrm{old}}\) nor \(\pi_\theta\) (no clean interpretation)

\(q\)	偏差	方差	估计的是什么
\(0\)	高	低	\(\mathbb{E}_{\pi_{\mathrm{old}}}[\hat{A}\,\nabla\ln\pi_\theta]\)（错误分布）
\(1\)	无	高	\(\mathbb{E}_{\pi_\theta}[\hat{A}\,\nabla\ln\pi_\theta]\)（正确分布）
\(q \in (0,1)\)	中等	中等	既非 \(\pi_{\mathrm{old}}\) 也非 \(\pi_\theta\)（无简洁解释）

\(\rho\) definition	Scope	Variance	Bias	Position-dependent?
Token-level \(\frac{\pi_\theta(a_i)}{\pi_{\mathrm{old}}(a_i)}\)	Single token	Highest	Lowest (for token \(i\))	Yes
Future-only geometric mean	Tokens \(i\) to \(H_g\)	Medium	Medium	Yes
Full-sequence geometric mean (GSPO)	All tokens	Lowest	Highest	No

\(\rho\) 定义	范围	方差	偏差	位置相关？
Token-level \(\frac{\pi_\theta(a_i)}{\pi_{\mathrm{old}}(a_i)}\)	单个 token	最高	最低（对 token \(i\)）	是
仅未来几何平均	Token \(i\) 到 \(H_g\)	中等	中等	是
Full-sequence 几何平均（GSPO）	所有 token	最低	最高	否

Token-Level KL-Regularized Policy Gradient and GRPO

Token-Level KL-Regularized Policy Gradient

逐 Token 的 KL 正则化策略梯度

KL-Regularized Objective

KL 正则化目标

Deriving the Token-Level Gradient

推导逐 Token 梯度

Token-Level Reward and Loss

逐 Token 奖励与损失

Plugging into GRPO

代入 GRPO

GAE Implementation

GAE 实现

Off-Policy Correction

离策略校正

Importance Sampling Correction

重要性采样校正

Comparison with Standard GRPO: Clipping and IS Ratios

与标准 GRPO 的对比：Clipping 与 IS Ratio

A General Family of Off-Policy Correction

离策略校正的一般算法族

Some Experiments to Try for Off-Policy Correction

离策略校正的一些实验方向

PPO-Style Clipping as a Choice of \(w_{g,i}\)

PPO 风格 Clipping 作为 \(w_{g,i}\) 的一种选择

The Variance Problem with PPO Clipping

PPO Clipping 的方差问题

The Power Family \(w_{g,i} = \rho_{g,i}^q\)

幂次族 \(w_{g,i} = \rho_{g,i}^q\)

Alternative Definitions of \(\rho_{g,i}\)

\(\rho_{g,i}\) 的替代定义