RL on Language under Single-step Settings

The importance sampling story changes in interesting ways when the “actions” are natural language sequences. In classical RL, an action is a discrete choice (move left, move right) or a continuous vector (torque on a joint). In language model RL, an action is an entire token sequence — a sentence, a paragraph, or a multi-step chain of thought. This shift in the structure of the action space has deep consequences for how IS ratios behave.

Modeling Language Generation with RL

What is a Contextual Bandit?

Before examining how IS ratios behave in language model RL, it helps to understand the contextual bandit — the framework that most current language model RL implicitly operates in. A contextual bandit is a tuple \((\mathcal{X}, P, \mathcal{A}, r)\):

A context space \(\mathcal{X}\) is the set of all possible situations the agent might face. Each context \(x \in \mathcal{X}\) is a complete description of the current situation — it carries all the information the agent needs to make a decision. Contexts are drawn i.i.d. from a fixed distribution \(P(x)\) that the agent cannot influence.
An action space \(\mathcal{A}\) is the set of choices available to the agent. Given a context \(x\), the agent selects one action \(a \in \mathcal{A}\). The action space is the same for every context (though a policy may assign zero probability to certain actions in certain contexts).
A reward function \(r: \mathcal{X} \times \mathcal{A} \to \mathbb{R}\) maps each context-action pair to a scalar reward. The reward may be deterministic or stochastic — in the stochastic case, \(r(x, a)\) denotes the expected reward, and the agent observes a noisy realization.
A policy \(\pi(a \vert x)\) is a conditional distribution over actions given a context. It encodes the agent’s strategy: for each context \(x\), \(\pi(\cdot \vert x)\) is a probability distribution over \(\mathcal{A}\).

At each round, nature draws a context \(x \sim P(x)\), the agent observes \(x\), selects an action \(a \sim \pi(a \vert x)\), and receives reward \(r(x, a)\). Then the round ends. The next context \(x'\) is drawn fresh from \(P(x)\) — it does not depend on the previous action or context. The agent’s goal is to find a policy that maximizes expected reward:

\[\max_\pi \; \mathbb{E}_{x \sim P, \, a \sim \pi(\cdot \vert x)}\!\left[r(x, a)\right]\]

The key structural property is that there are no state transitions. The context distribution \(P(x)\) is fixed and does not depend on the policy. This is what distinguishes a contextual bandit from a full MDP: in an MDP, the agent’s actions influence what states it sees next, creating a feedback loop between the policy and the state distribution. In a bandit, each round is statistically independent — the agent’s choice of action today has no effect on what context it will see tomorrow.

Figure 1: A contextual bandit. Each round: environment presents context x, agent picks action a, receives reward r(x, a). No state carries over between rounds.

This independence has a profound consequence for importance sampling. Recall from the post on importance sampling that the surrogate objective in policy gradient methods silently substitutes the old policy’s state distribution \(d^{\pi_{\text{old}}}\) for the current policy’s \(d^{\pi_\theta}\), introducing an approximation error controlled by the distribution mismatch coefficient. In a contextual bandit, this problem vanishes entirely: the context distribution \(P(x)\) is the same regardless of which policy is being used, so there is no state distribution mismatch to worry about. The IS ratio \(\frac{\pi_\theta(a \vert x)}{\pi_{\text{old}}(a \vert x)}\) corrects for the action mismatch, and that is the only correction needed.

Language Generation as a Bandit

Most current language model RL — RLHF for chat models, reward-based fine-tuning for math and code — fits naturally into the contextual bandit framework. The mapping is:

Contextual Bandit	Language Model RL
Context \(x\)	Prompt
Action \(a\)	Full response \(y\)
Policy \(\pi(a \vert x)\)	Language model \(\pi_\theta(y \vert x)\)
Reward \(r(x, a)\)	Reward model \(r(x, y)\)
Context distribution \(P(x)\)	Prompt dataset \(\mathcal{D}\)

The model generates a complete response in one shot, receives a scalar reward, and the episode ends. The next prompt is drawn independently from the dataset — it does not depend on what the model generated previously.

上下文老虎机	语言模型 RL
上下文 \(x\)	提示
动作 \(a\)	完整回复 \(y\)
策略 \(\pi(a \vert x)\)	语言模型 \(\pi_\theta(y \vert x)\)
奖励 \(r(x, a)\)	奖励模型 \(r(x, y)\)
上下文分布 \(P(x)\)	提示数据集 \(\mathcal{D}\)

Figure 2: Language generation as a contextual bandit. The prompt is the context, the full response is the action, and a reward model scores the output. Try different prompts and sample responses to see the bandit in action.

This is why PPO and related methods work as well as they do for single-turn tasks like math problem solving or code generation. The problem has the structure of a bandit: given a math problem (prompt), produce a solution (response), get a reward (correct or not). The IS ratio \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\) corrects for the action mismatch, and there is no state mismatch to worry about. The surrogate objective is exact up to the action-level IS correction — no hidden approximation, no distribution mismatch coefficient.

The picture changes for multi-turn tasks — dialogue, tool use, agentic workflows — where the model’s output at step \(t\) affects what input it sees at step \(t+1\). In those settings, the full sequential RL formulation returns, and with it all the challenges of state distribution mismatch and trajectory-level IS ratios. But the majority of current language model RL operates in the bandit regime, which is one reason it has been so successful despite using relatively simple algorithms.

Language Generation as a Token-Level MDP

There is an alternative way to model language generation: treat each token as an action in a sequential decision process. Instead of viewing the entire response as a single monolithic action, we model generation as a multi-step MDP where the agent makes one decision per time step:

| MDP Component | Token-Level Language Generation | |—|—| | State \(s_t\) | Prompt \(x\) concatenated with tokens generated so far: \(s_t = (x, y_1, \ldots, y_{t-1})\) | | Action \(a_t\) | Next token \(y_t \in \mathcal{V}\) | | Transition \(T(s_{t+1} \vert s_t, a_t)\) | Deterministic: append \(y_t\) to get \(s_{t+1} = (x, y_1, \ldots, y_t)\) | | Policy \(\pi(a_t \vert s_t)\) | Next-token distribution \(\pi_\theta(y_t \vert x, y_{<t})\) | | Reward \(r(s_t, a_t)\) | Zero at all intermediate steps; \(r(x, y)\) at the final token |

Figure 3: Token-level MDP view of autoregressive generation. Step through to see how each token is an action, the prefix is the state, and the sequence probability decomposes as a product of per-token probabilities. Reward is sparse — assigned only at the terminal state.

Under this formulation, the state at time \(t\) is the entire prefix — the prompt plus all tokens generated so far. The action is a single token drawn from the vocabulary \(\mathcal{V}\). The transition function is deterministic and trivial: the next state is just the current state with the new token appended. The reward is sparse: the agent receives nothing until the response is complete, at which point a reward model scores the full sequence.

This is a legitimate MDP — each action (token) changes the state (prefix), and the state determines what actions are available and how future states evolve. But it is an unusual one. The transition function is deterministic, so all stochasticity comes from the policy. The state space grows with each step, and the horizon \(T\) (response length) varies across episodes.

Bandit vs Token-Level MDP: What Changes?

The two formulations — bandit and token-level MDP — are mathematically equivalent. The bandit’s single “action” \(y\) is the MDP’s entire trajectory \((y_1, \ldots, y_T)\). The bandit’s policy \(\pi_\theta(y \vert x)\) equals the MDP’s trajectory probability \(\prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\). The bandit’s reward \(r(x, y)\) is the MDP’s cumulative return (which is just the terminal reward, since intermediate rewards are zero). Any optimization algorithm that works on one formulation can be translated to the other. But the two views lead to very different algorithmic choices.

IS ratios. Under the bandit view, there is a single IS ratio per response: \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\). Under the MDP view, this ratio decomposes into a product of \(T\) per-token ratios: \(\prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{old}}(y_t \vert x, y_{<t})}\). The bandit ratio is a single number that can be computed and clipped directly. The MDP product is a chain of \(T\) factors that can compound — even if each factor is close to 1, their product can explode or collapse for long sequences. This is the compounding problem from the post on importance sampling, reappearing in the token setting.

Credit assignment. The bandit formulation assigns the same scalar reward \(r(x, y)\) to the entire response. Every token receives equal credit, whether it is a crucial reasoning step or a filler word. The MDP formulation opens the door to token-level credit assignment: we can define a value function \(V(s_t)\) at each prefix \(s_t = (x, y_{<t})\) and an advantage function

\[A(s_t, y_t) = Q(s_t, y_t) - V(s_t)\]

that measures how much better token \(y_t\) is compared to the expected continuation from state \(s_t\). This allows the policy gradient to upweight tokens that contributed positively to the reward and downweight tokens that did not — a much finer-grained signal than applying the same reward to all tokens.

State distribution mismatch. Under the bandit view, the context distribution \(P(x)\) is fixed and policy-independent, so there is no state distribution mismatch. Under the MDP view, the state at time \(t\) is \(s_t = (x, y_1, \ldots, y_{t-1})\), which depends on the policy that generated the prefix. If we use data collected under \(\pi_{\text{old}}\) to update \(\pi_\theta\), the distribution over prefixes will differ — the hidden approximation from the importance sampling post returns. However, because the transitions are deterministic, this mismatch is entirely driven by the policy: two policies see different states only because they generate different token sequences. This is milder than the mismatch in a stochastic-transition MDP (like robotics), but it is not zero.

KL divergence. Under the bandit view, the KL between \(\pi_\theta\) and \(\pi_{\text{ref}}\) is a single number per prompt — the divergence between two distributions over full responses. Under the MDP view, this KL decomposes as a sum of per-token KLs:

\[\text{KL}\!\left[\pi_\theta(y \vert x) \,\|\, \pi_{\text{ref}}(y \vert x)\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\sum_{t=1}^{T} \log \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\right]\]

This additive structure means the sequence-level KL grows linearly with response length \(T\): a per-token KL of \(\epsilon\) accumulates to \(T\epsilon\). The MDP view makes this length dependence explicit, which is important for understanding why KL regularization strength may need to be adjusted for tasks that elicit longer responses.

Practical algorithms. Most current methods — PPO-based RLHF, GRPO, REINFORCE-style approaches — implicitly use a hybrid. They collect data at the response level (bandit), but compute per-token log-probabilities and apply token-level KL penalties (MDP). The policy gradient is typically computed with a single advantage per response (bandit-style), though some methods like token-level PPO estimate per-token advantages using a learned value function. The choice of which view to adopt at each stage of the algorithm is a design decision with real consequences for variance, credit assignment, and computational cost.

RLHF

IS Ratios in Language Model Alignment

The IS ratio from the post on importance sampling appears directly in Reinforcement Learning from Human Feedback (RLHF), where a language model \(\pi_\theta\) is fine-tuned to maximize a learned reward model \(r(x, y)\) while staying close to a reference model \(\pi_{\text{ref}}\). The standard RLHF objective is:

\[\max_\theta \; \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot \vert x)}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

The KL divergence can be written as an expectation of the log IS ratio:

\[\text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\right]\]

The ratio \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\) is exactly an importance sampling ratio — it measures how much the fine-tuned model’s distribution has shifted from the reference. The KL penalty pushes this ratio toward 1, keeping the fine-tuned model close to the reference. This is directly analogous to keeping the proposal \(q\) close to the target \(p\) in importance sampling: when the ratio deviates too far, the estimate (or in this case, the policy update) becomes unreliable.

In practice, RLHF implementations (e.g., PPO-based RLHF) use the same clipped IS ratio from PPO for the policy optimization step, combined with the KL penalty to prevent the model from drifting too far from \(\pi_{\text{ref}}\). Some recent methods like DPO (Direct Preference Optimization) eliminate the explicit RL loop entirely by reparameterizing the reward in terms of the log ratio \(\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\), making the IS ratio the central object of the optimization itself.

Token-Level Decomposition

A language model generates a response \(y = (y_1, y_2, \ldots, y_T)\) autoregressively, so the probability of the full sequence factorizes as:

\[\pi_\theta(y \vert x) = \prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\]

The IS ratio between the current policy \(\pi_\theta\) and the reference \(\pi_{\text{ref}}\) therefore decomposes into a product of per-token ratios:

\[\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)} = \prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\]

This is precisely the trajectory-level product from the importance sampling post, reappearing in a new guise. Each token generation is a “time step”, and the full response is a “trajectory”. The compounding problem applies directly: as responses grow longer, the product of per-token ratios can explode or collapse, making long-response IS estimates unreliable. A 500-token response involves a product of 500 ratios — even if each individual ratio is close to 1, their product can deviate enormously.

Why KL Regularization is Not Optional

This compounding structure explains why the KL penalty in RLHF is not merely a stylistic choice but a statistical necessity. Without it, the policy \(\pi_\theta\) can drift far from \(\pi_{\text{ref}}\) in sequence space even when each per-token distribution shifts modestly. A small per-token KL of \(\epsilon\) per token accumulates to \(T\epsilon\) over a full response, so the sequence-level KL grows linearly with response length. The KL penalty in the RLHF objective:

\[\max_\theta \; \mathbb{E}_{x, y \sim \pi_\theta}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

directly controls this accumulation. By penalizing the sum of per-token log ratios, it prevents any individual token distribution from shifting too far and prevents the aggregate shift from compounding out of control. The hyperparameter \(\beta\) trades off reward maximization against distributional stability — too small and the policy drifts into regions where the IS ratios (and therefore the policy gradient estimates) become unreliable; too large and the policy cannot learn anything beyond the reference.

Response-Level vs Token-Level Credit Assignment

A deeper challenge is credit assignment. In standard RLHF, the reward \(r(x, y)\) is assigned to the entire response \(y\). This is a single scalar for a sequence that may be hundreds of tokens long. From an IS perspective, we are using a product of \(T\) per-token ratios to reweight a single reward signal — the worst case of the compounding problem. Every token in the response receives the same reward, even though some tokens may be crucial (the key reasoning step) while others are incidental (filler phrases, formatting).

This is analogous to the trajectory-level vs per-step IS distinction from the importance sampling post. There, we showed that per-step IS reduces variance by breaking the trajectory reward into per-step rewards and applying only the relevant IS ratios. The same idea applies to language: if we could assign credit at the token level — identifying which tokens contributed to the reward — we could avoid the full product of ratios and dramatically reduce variance.

Recent work explores exactly this direction. Process reward models assign rewards to intermediate reasoning steps rather than only to the final answer, enabling step-level credit assignment analogous to per-step IS. Token-level KL penalties regularize each token’s distribution individually rather than penalizing only the aggregate. These approaches are, at their core, applications of the per-step IS principle to the language setting: decompose the monolithic sequence-level problem into manageable token-level or step-level subproblems, and apply IS corrections only where they are needed.

The Action Space Explosion

There is one final way in which language RL differs from classical RL. In a typical control task, the action space at each step might have \(\vert\mathcal{A}\vert = 4\) (four directions) or be a low-dimensional continuous space. In language, each token is drawn from a vocabulary of tens of thousands of tokens, and the response is a sequence of these choices. The effective action space is \(\vert\mathcal{V}\vert^T\) — astronomically large. This means that two language model policies, even if they are “similar” by most measures, will assign probability mass to mostly non-overlapping sets of full responses. The overlap between \(\pi_\theta\) and \(\pi_{\text{ref}}\) in sequence space can be vanishingly small even when the per-token distributions are close.

This makes off-policy evaluation in language RL fundamentally harder than in classical RL. In a grid world, the old policy and new policy might both assign significant probability to the same trajectories, making IS reweighting effective. In language, a small change in the policy can redirect probability mass to entirely different responses, leaving the old policy’s samples uninformative about the new policy’s behavior. This is why most successful language model RL methods — PPO-based RLHF, GRPO, and their variants — operate in a nearly on-policy regime, collecting fresh samples from the current policy at each step rather than attempting to reuse old data. The IS ratios in these methods serve primarily as a local correction (keeping \(\theta\) close to \(\theta_{\text{old}}\) within a single update) rather than as a tool for genuine off-policy learning across many updates.

Signal Loss and Adaptive Sampling

The bandit formulation of language model RL — sample \(n\) responses per prompt, compute advantages, update — has a subtle failure mode. When the model’s pass rate \(p_i\) on a prompt \(x_i\) is very low, a small sample group is likely to contain all incorrect responses. When all responses have the same reward, the advantage

\[A^{\text{GRPO}}(x, y_j) = \frac{r_j - \bar{r}}{\sigma_r + \epsilon}\]

collapses to zero because \(\sigma_r = 0\). The gradient vanishes — the model receives no learning signal from that prompt. Symmetrically, when \(p_i\) is very high, all responses may be correct, and the advantage again collapses. With pass rate \(p = 0.1\) and group size \(n = 8\), the probability that all samples are incorrect is \(0.9^8 \approx 43\%\). Nearly half the time, the model learns nothing from difficult prompts.

This is signal loss: uniform sampling wastes inference budget on prompts that are either too easy (the model already solves them reliably) or too hard (the small sample size fails to capture any successes). The learning signal concentrates on prompts of intermediate difficulty where the group happens to contain a mix of correct and incorrect responses. Two recent papers address this problem from different theoretical perspectives.

Non-linear Objectives and Adaptive Budget Allocation

Xiong et al. (2025) argue that signal loss is an artifact of optimizing the wrong objective. The standard RL objective for reasoning is:

\[J(\theta) = \mathbb{E}_{x}[p_\theta(x)]\]

where \(p_\theta(x)\) is the pass rate on prompt \(x\). Under this linear objective, all prompts contribute equally to the gradient regardless of difficulty. They propose optimizing a non-linear objective instead:

\[J_f(\theta) = \mathbb{E}_{x}[f(p_\theta(x))]\]

where \(f\) is a concave function. Taking \(f = \log\) and differentiating:

\[\nabla J_f(\theta) = \mathbb{E}_{x}\!\left[\frac{1}{p_\theta(x)} \cdot \nabla p_\theta(x)\right]\]

The weight \(w(x) = 1/p_\theta(x)\) naturally upweights difficult prompts. A prompt with pass rate \(0.01\) receives 100 times the weight of a prompt with pass rate \(1.0\). The key insight is that this reweighting can be realized through adaptive sampling: instead of applying the weight \(1/p_i\) to the gradient (which is unstable when \(\hat{p}_i = 0\)), allocate more inference budget to harder prompts so that the sampling itself implements the reweighting.

Their algorithm, REINFORCE-ADA, has two realizations. The first (Ada-Est) estimates per-prompt pass rates using a value network or an exponential moving average, then allocates budgets \(n_i \propto 1/\sqrt{\hat{p}_i}\). The second (Ada-Seq) avoids explicit estimation entirely: it samples sequentially for each prompt until it collects \(K\) correct responses, then stops. This sequential stopping naturally achieves \(\mathbb{E}[N_i] = K/p_i\) — harder prompts automatically receive more samples. A balanced variant (Ada-Seq-balance) requires both \(K\) correct and \(K\) incorrect responses before stopping, preventing signal loss at both ends of the difficulty spectrum.

The variance analysis connects back to IS. For the log-objective under REINFORCE with baseline, the optimal budget allocation that minimizes gradient variance is:

\[n_i^* \propto \sqrt{\frac{1-p_i}{p_i}}\]

This is a variance-reduction result: allocating more samples to hard prompts reduces the variance of the gradient estimator under a fixed total budget \(N\).

Gradient Variance Minimization in the EM Framework

Yao et al. (2025) approach the same problem from the EM (Expectation-Maximization) perspective. Chain-of-thought reasoning is modeled as a latent variable problem: given prompt \(x\), the model generates a latent rationale \(y\) and a predicted answer \(z\). The training objective is the negative log-likelihood:

\[\mathcal{L}(\theta) = -\mathbb{E}_{x}\!\left[\ln \sum_{y \in \mathcal{Y}} \mathbb{P}(y \vert x, \theta) \mathbb{P}(z \vert y, \theta)\right]\]

RAFT (Reward-Ranked Fine-Tuning) implements the EM algorithm: the E-step uses rejection sampling (generate \(n\) responses, keep those with correct answers) to approximate the posterior \(Q_i(y)\), and the M-step fine-tunes the model on the accepted responses. Under this framework, the true gradient of the ELBO is:

\[\nabla \mathcal{J}_{Q^t}(\theta) = -\sum_{i=1}^{m} \mathbb{E}_{y \sim Q_i^t} \nabla \ln \mathbb{P}(y, z_i \vert x_i, \theta)\]

which is approximated via rejection sampling. The unbiased estimator (Lemma 1 in the paper) is:

\[-\sum_{i=1}^{m} \frac{1}{n_i p_i} \sum_{y_j \in \mathcal{D}_i} \nabla \ln \mathbb{P}(y_j, z_i \vert x_i, \theta)\]

where \(p_i\) is the acceptance rate (essentially the pass rate) and \(\mathcal{D}_i\) is the set of accepted samples for prompt \(x_i\). The crucial observation is the \(1/(n_i p_i)\) factor: for difficult prompts with low \(p_i\), both the number of accepted samples and the acceptance rate are small, so the estimator has extremely high variance.

They bound the total gradient variance and minimize it subject to a budget constraint \(\sum n_i = N\), obtaining the allocation:

\[n_i \propto \frac{G_i}{\sqrt{p_i + \alpha / (p_i)^{\beta - 1}}}\]

where \(G_i = \mathbb{E}_{y \sim Q_i} \|\nabla \ln \mathbb{P}(y, z_i \vert x_i, \theta)\|\) is the Lipschitz coefficient (expected gradient norm) for prompt \(x_i\), and \((\alpha, \beta)\) are regularization parameters that prevent excessive sampling on near-impossible prompts.

This formula reveals something REINFORCE-ADA’s allocation does not capture: the gradient norm \(G_i\) matters, not just the pass rate. Two prompts with the same pass rate \(p_i\) but different gradient norms should receive different budgets — the one whose accepted samples produce larger gradients should be sampled more, because it contributes more variance to the overall estimator. In practice, \(G_i\) is estimated via a pre-sampling stage: generate \(N'\) responses per prompt, compute the gradient norms of accepted responses, and use these to set the budget for the main sampling stage.

Comparing the Two Approaches

Both papers solve the same core problem — non-uniform prompt difficulty makes uniform sampling inefficient — but their theoretical starting points lead to different allocation formulas and different algorithmic structures.

	REINFORCE-ADA	GVM-RAFT
Framework	Non-linear RL objective \(\mathbb{E}[f(p_\theta(x))]\)	EM / ELBO gradient variance minimization
Allocation	\(n_i \propto 1/\sqrt{p_i}\) (pass rate only)	\(n_i \propto G_i / \sqrt{p_i}\) (pass rate \(\times\) gradient norm)
Extra factor	None	Lipschitz coefficient \(G_i\)
Algorithm	Online (replaces GRPO sampling directly)	Two-stage: estimate \(p_i, G_i\), then allocate
Overhead	Ada-Seq: implicit, no extra forward passes	Requires \(N'\) pre-sampling forward passes
Regularization	Ada-Seq-balance: require both correct and incorrect	\((\alpha, \beta)\) penalty on very low \(p_i\)
Convergence	Empirical (up to 2x speedup)	Theorems 1-2: decreasing rate under smoothness

The connection between them is clearest in the variance-optimal allocation. REINFORCE-ADA derives \(n_i^* \propto \sqrt{(1-p_i)/p_i}\) for the log-objective. GVM-RAFT derives \(n_i^* \propto G_i/\sqrt{p_i}\) for the ELBO objective. When \(G_i\) is constant across prompts, the two formulas have the same qualitative shape — both allocate more budget to harder prompts. The divergence appears when \(G_i\) varies: GVM-RAFT argues that gradient norms are not uniform and should be estimated, while REINFORCE-ADA’s approach implicitly assumes they are absorbed into the objective’s weighting.

From a practical standpoint, both achieve significant gains over uniform sampling. REINFORCE-ADA’s Ada-Seq variant is simpler — it requires no explicit estimation and naturally adapts through sequential stopping. GVM-RAFT is more principled in its variance accounting but requires an estimation stage. Both generalize beyond their original settings: REINFORCE-ADA extends from GRPO to other RL algorithms, and GVM-RAFT extends from RAFT to GRPO (referred to as GVM-GRPO), achieving comparable improvements in both cases.

	REINFORCE-ADA	GVM-RAFT
框架	非线性 RL 目标 \(\mathbb{E}[f(p_\theta(x))]\)	EM / ELBO 梯度方差最小化
分配	\(n_i \propto 1/\sqrt{p_i}\)（仅通过率）	\(n_i \propto G_i / \sqrt{p_i}\)（通过率 \(\times\) 梯度范数）
额外因子	无	Lipschitz 系数 \(G_i\)
算法	在线（直接替换 GRPO 采样）	两阶段：估计 \(p_i, G_i\)，然后分配
开销	Ada-Seq：隐式，无额外前向传播	需要 \(N'\) 次预采样前向传播
正则化	Ada-Seq-balance：要求正确和错误都有	\((\alpha, \beta)\) 对极低 \(p_i\) 的惩罚
收敛性	经验性（最高 2 倍加速）	定理 1-2：光滑性假设下的递减速率

RL on Language under Single-step Settings

Modeling Language Generation with RL

用 RL 建模语言生成

What is a Contextual Bandit?

什么是上下文老虎机？

Language Generation as a Bandit

语言生成作为老虎机

Language Generation as a Token-Level MDP

语言生成作为 Token 级 MDP

Bandit vs Token-Level MDP: What Changes?

老虎机 vs Token 级 MDP：有何变化？

RLHF

RLHF

IS Ratios in Language Model Alignment

语言模型对齐中的 IS 比率

Token-Level Decomposition

Token 级分解

Why KL Regularization is Not Optional

为什么 KL 正则化不可或缺

Response-Level vs Token-Level Credit Assignment

回复级 vs Token 级信用分配

The Action Space Explosion

动作空间爆炸

Signal Loss and Adaptive Sampling

信号丢失与自适应采样

Non-linear Objectives and Adaptive Budget Allocation

非线性目标与自适应预算分配

Gradient Variance Minimization in the EM Framework

EM 框架中的梯度方差最小化

Comparing the Two Approaches

两种方法的比较