RL on Language under Single-step Settings

The importance sampling story changes in interesting ways when the “actions” are natural language sequences. In classical RL, an action is a discrete choice (move left, move right) or a continuous vector (torque on a joint). In language model RL, an action is an entire token sequence — a sentence, a paragraph, or a multi-step chain of thought. This shift in the structure of the action space has deep consequences for how IS ratios behave.

当”动作”变为自然语言序列时,重要性采样的故事发生了有趣的变化。在经典 RL 中,动作是离散选择(向左、向右)或连续向量(关节力矩)。在语言模型 RL 中,动作是整个 token 序列——一个句子、一个段落或一条多步推理链。动作空间结构的这一转变对 IS 比率的行为产生了深远影响。

Modeling Language Generation with RL

What is a Contextual Bandit?

Before examining how IS ratios behave in language model RL, it helps to understand the contextual bandit — the framework that most current language model RL implicitly operates in. A contextual bandit is a tuple \((\mathcal{X}, P, \mathcal{A}, r)\):

  • A context space \(\mathcal{X}\) is the set of all possible situations the agent might face. Each context \(x \in \mathcal{X}\) is a complete description of the current situation — it carries all the information the agent needs to make a decision. Contexts are drawn i.i.d. from a fixed distribution \(P(x)\) that the agent cannot influence.
  • An action space \(\mathcal{A}\) is the set of choices available to the agent. Given a context \(x\), the agent selects one action \(a \in \mathcal{A}\). The action space is the same for every context (though a policy may assign zero probability to certain actions in certain contexts).
  • A reward function \(r: \mathcal{X} \times \mathcal{A} \to \mathbb{R}\) maps each context-action pair to a scalar reward. The reward may be deterministic or stochastic — in the stochastic case, \(r(x, a)\) denotes the expected reward, and the agent observes a noisy realization.
  • A policy \(\pi(a \vert x)\) is a conditional distribution over actions given a context. It encodes the agent’s strategy: for each context \(x\), \(\pi(\cdot \vert x)\) is a probability distribution over \(\mathcal{A}\).

At each round, nature draws a context \(x \sim P(x)\), the agent observes \(x\), selects an action \(a \sim \pi(a \vert x)\), and receives reward \(r(x, a)\). Then the round ends. The next context \(x'\) is drawn fresh from \(P(x)\) — it does not depend on the previous action or context. The agent’s goal is to find a policy that maximizes expected reward:

\[\max_\pi \; \mathbb{E}_{x \sim P, \, a \sim \pi(\cdot \vert x)}\!\left[r(x, a)\right]\]

The key structural property is that there are no state transitions. The context distribution \(P(x)\) is fixed and does not depend on the policy. This is what distinguishes a contextual bandit from a full MDP: in an MDP, the agent’s actions influence what states it sees next, creating a feedback loop between the policy and the state distribution. In a bandit, each round is statistically independent — the agent’s choice of action today has no effect on what context it will see tomorrow.

在考察 IS 比率在语言模型 RL 中的行为之前,有必要理解上下文老虎机(contextual bandit)——大多数当前语言模型 RL 隐式运作的框架。上下文老虎机是一个元组 \((\mathcal{X}, P, \mathcal{A}, r)\):

  • 上下文空间 \(\mathcal{X}\) 是智能体可能面对的所有情境的集合。每个上下文 \(x \in \mathcal{X}\) 是当前情境的完整描述——它包含智能体做出决策所需的全部信息。上下文从智能体无法影响的固定分布 \(P(x)\) 中独立同分布地抽取。
  • 动作空间 \(\mathcal{A}\) 是智能体可用的选择集合。给定上下文 \(x\),智能体选择一个动作 \(a \in \mathcal{A}\)。动作空间对所有上下文相同(尽管策略可能对某些上下文中的某些动作分配零概率)。
  • 奖励函数 \(r: \mathcal{X} \times \mathcal{A} \to \mathbb{R}\) 将每个上下文-动作对映射到标量奖励。奖励可以是确定性的或随机的——在随机情况下,\(r(x, a)\) 表示期望奖励,智能体观察到一个噪声实现。
  • 策略 \(\pi(a \vert x)\) 是给定上下文的动作条件分布。它编码智能体的策略:对每个上下文 \(x\),\(\pi(\cdot \vert x)\) 是 \(\mathcal{A}\) 上的概率分布。

每一轮,自然抽取上下文 \(x \sim P(x)\),智能体观察 \(x\),选择动作 \(a \sim \pi(a \vert x)\),接收奖励 \(r(x, a)\)。然后该轮结束。下一个上下文 \(x'\) 从 \(P(x)\) 中重新抽取——不依赖于之前的动作或上下文。智能体的目标是找到一个最大化期望奖励的策略:

\[\max_\pi \; \mathbb{E}_{x \sim P, \, a \sim \pi(\cdot \vert x)}\!\left[r(x, a)\right]\]

关键的结构性质是没有状态转移。上下文分布 \(P(x)\) 是固定的,不依赖于策略。这正是上下文老虎机与完整 MDP 的区别所在:在 MDP 中,智能体的动作影响下一步看到的状态,在策略和状态分布之间形成反馈环。在老虎机中,每轮统计独立——智能体今天的动作选择对明天看到的上下文没有影响。

Figure 1: A contextual bandit. Each round: environment presents context x, agent picks action a, receives reward r(x, a). No state carries over between rounds.
图 1:上下文老虎机。每轮:环境呈现上下文 x,智能体选择动作 a,接收奖励 r(x, a)。轮次之间无状态延续。

This independence has a profound consequence for importance sampling. Recall from the post on importance sampling that the surrogate objective in policy gradient methods silently substitutes the old policy’s state distribution \(d^{\pi_{\text{old}}}\) for the current policy’s \(d^{\pi_\theta}\), introducing an approximation error controlled by the distribution mismatch coefficient. In a contextual bandit, this problem vanishes entirely: the context distribution \(P(x)\) is the same regardless of which policy is being used, so there is no state distribution mismatch to worry about. The IS ratio \(\frac{\pi_\theta(a \vert x)}{\pi_{\text{old}}(a \vert x)}\) corrects for the action mismatch, and that is the only correction needed.

这种独立性对重要性采样有深远影响。回想关于重要性采样的文章中提到,策略梯度方法中的代理目标悄悄地用旧策略的状态分布 \(d^{\pi_{\text{old}}}\) 替代当前策略的 \(d^{\pi_\theta}\),引入了由分布失配系数控制的近似误差。在上下文老虎机中,这个问题完全消失:无论使用哪个策略,上下文分布 \(P(x)\) 都相同,因此无需担心状态分布失配。IS 比率 \(\frac{\pi_\theta(a \vert x)}{\pi_{\text{old}}(a \vert x)}\) 校正了动作失配,这是唯一需要的校正。

Language Generation as a Bandit

Most current language model RL — RLHF for chat models, reward-based fine-tuning for math and code — fits naturally into the contextual bandit framework. The mapping is:

Contextual Bandit Language Model RL
Context \(x\) Prompt
Action \(a\) Full response \(y\)
Policy \(\pi(a \vert x)\) Language model \(\pi_\theta(y \vert x)\)
Reward \(r(x, a)\) Reward model \(r(x, y)\)
Context distribution \(P(x)\) Prompt dataset \(\mathcal{D}\)

The model generates a complete response in one shot, receives a scalar reward, and the episode ends. The next prompt is drawn independently from the dataset — it does not depend on what the model generated previously.

当前大多数语言模型 RL——用于对话模型的 RLHF、用于数学和代码的基于奖励的微调——自然地契合上下文老虎机框架。对应关系为:

上下文老虎机 语言模型 RL
上下文 \(x\) 提示
动作 \(a\) 完整回复 \(y\)
策略 \(\pi(a \vert x)\) 语言模型 \(\pi_\theta(y \vert x)\)
奖励 \(r(x, a)\) 奖励模型 \(r(x, y)\)
上下文分布 \(P(x)\) 提示数据集 \(\mathcal{D}\)

模型一次性生成完整回复,接收标量奖励,回合结束。下一个提示从数据集中独立抽取——不依赖于模型之前生成了什么。

Figure 2: Language generation as a contextual bandit. The prompt is the context, the full response is the action, and a reward model scores the output. Try different prompts and sample responses to see the bandit in action.
图 2:语言生成作为上下文老虎机。提示是上下文,完整回复是动作,奖励模型对输出评分。尝试不同提示和采样回复,观察老虎机的运作。

This is why PPO and related methods work as well as they do for single-turn tasks like math problem solving or code generation. The problem has the structure of a bandit: given a math problem (prompt), produce a solution (response), get a reward (correct or not). The IS ratio \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\) corrects for the action mismatch, and there is no state mismatch to worry about. The surrogate objective is exact up to the action-level IS correction — no hidden approximation, no distribution mismatch coefficient.

The picture changes for multi-turn tasks — dialogue, tool use, agentic workflows — where the model’s output at step \(t\) affects what input it sees at step \(t+1\). In those settings, the full sequential RL formulation returns, and with it all the challenges of state distribution mismatch and trajectory-level IS ratios. But the majority of current language model RL operates in the bandit regime, which is one reason it has been so successful despite using relatively simple algorithms.

这就是 PPO 及相关方法在数学问题求解或代码生成等单轮任务上效果良好的原因。问题具有老虎机的结构:给定数学题(提示),产生解答(回复),获得奖励(正确与否)。IS 比率 \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\) 校正动作失配,无需担心状态失配。代理目标在动作级 IS 校正的意义下是精确的——没有隐藏近似,没有分布失配系数。

对于多轮任务——对话、工具使用、智能体工作流——情况发生变化,模型在步骤 \(t\) 的输出影响其在步骤 \(t+1\) 看到的输入。在这些设定中,完整的序贯 RL 公式回归,伴随着状态分布失配和轨迹级 IS 比率的所有挑战。但当前大多数语言模型 RL 运作在老虎机范式下,这也是它使用相对简单的算法却如此成功的原因之一。

Language Generation as a Token-Level MDP

There is an alternative way to model language generation: treat each token as an action in a sequential decision process. Instead of viewing the entire response as a single monolithic action, we model generation as a multi-step MDP where the agent makes one decision per time step:

| MDP Component | Token-Level Language Generation | |—|—| | State \(s_t\) | Prompt \(x\) concatenated with tokens generated so far: \(s_t = (x, y_1, \ldots, y_{t-1})\) | | Action \(a_t\) | Next token \(y_t \in \mathcal{V}\) | | Transition \(T(s_{t+1} \vert s_t, a_t)\) | Deterministic: append \(y_t\) to get \(s_{t+1} = (x, y_1, \ldots, y_t)\) | | Policy \(\pi(a_t \vert s_t)\) | Next-token distribution \(\pi_\theta(y_t \vert x, y_{<t})\) | | Reward \(r(s_t, a_t)\) | Zero at all intermediate steps; \(r(x, y)\) at the final token |

存在另一种建模语言生成的方式:将每个 token 视为序贯决策过程中的一个动作。不再将整个回复视为单个整体动作,而是将生成建模为多步 MDP,智能体每个时间步做一个决策:

| MDP 组件 | Token 级语言生成 | |—|—| | 状态 \(s_t\) | 提示 \(x\) 与已生成 token 的拼接:\(s_t = (x, y_1, \ldots, y_{t-1})\) | | 动作 \(a_t\) | 下一个 token \(y_t \in \mathcal{V}\) | | 转移 \(T(s_{t+1} \vert s_t, a_t)\) | 确定性:拼接 \(y_t\) 得到 \(s_{t+1} = (x, y_1, \ldots, y_t)\) | | 策略 \(\pi(a_t \vert s_t)\) | 下一 token 分布 \(\pi_\theta(y_t \vert x, y_{<t})\) | | 奖励 \(r(s_t, a_t)\) | 所有中间步为零;最终 token 处为 \(r(x, y)\) |

Figure 3: Token-level MDP view of autoregressive generation. Step through to see how each token is an action, the prefix is the state, and the sequence probability decomposes as a product of per-token probabilities. Reward is sparse — assigned only at the terminal state.
图 3:自回归生成的 token 级 MDP 视角。逐步观察每个 token 如何成为动作、前缀如何成为状态,以及序列概率如何分解为逐 token 概率的乘积。奖励是稀疏的——仅在终止状态分配。

Under this formulation, the state at time \(t\) is the entire prefix — the prompt plus all tokens generated so far. The action is a single token drawn from the vocabulary \(\mathcal{V}\). The transition function is deterministic and trivial: the next state is just the current state with the new token appended. The reward is sparse: the agent receives nothing until the response is complete, at which point a reward model scores the full sequence.

This is a legitimate MDP — each action (token) changes the state (prefix), and the state determines what actions are available and how future states evolve. But it is an unusual one. The transition function is deterministic, so all stochasticity comes from the policy. The state space grows with each step, and the horizon \(T\) (response length) varies across episodes.

在这一公式下,时刻 \(t\) 的状态是整个前缀——提示加上迄今生成的所有 token。动作是从词表 \(\mathcal{V}\) 中抽取的单个 token。转移函数是确定性且平凡的:下一个状态就是当前状态追加新 token。奖励是稀疏的:智能体在回复完成前不接收任何奖励,完成时奖励模型对完整序列评分。

这是一个合法的 MDP——每个动作(token)改变状态(前缀),状态决定了可用动作以及未来状态的演化。但它是一个不寻常的 MDP。转移函数是确定性的,因此所有随机性来自策略。状态空间随每步增长,而时间范围 \(T\)(回复长度)在不同回合间变化。

Bandit vs Token-Level MDP: What Changes?

The two formulations — bandit and token-level MDP — are mathematically equivalent. The bandit’s single “action” \(y\) is the MDP’s entire trajectory \((y_1, \ldots, y_T)\). The bandit’s policy \(\pi_\theta(y \vert x)\) equals the MDP’s trajectory probability \(\prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\). The bandit’s reward \(r(x, y)\) is the MDP’s cumulative return (which is just the terminal reward, since intermediate rewards are zero). Any optimization algorithm that works on one formulation can be translated to the other. But the two views lead to very different algorithmic choices.

IS ratios. Under the bandit view, there is a single IS ratio per response: \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\). Under the MDP view, this ratio decomposes into a product of \(T\) per-token ratios: \(\prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{old}}(y_t \vert x, y_{<t})}\). The bandit ratio is a single number that can be computed and clipped directly. The MDP product is a chain of \(T\) factors that can compound — even if each factor is close to 1, their product can explode or collapse for long sequences. This is the compounding problem from the post on importance sampling, reappearing in the token setting.

Credit assignment. The bandit formulation assigns the same scalar reward \(r(x, y)\) to the entire response. Every token receives equal credit, whether it is a crucial reasoning step or a filler word. The MDP formulation opens the door to token-level credit assignment: we can define a value function \(V(s_t)\) at each prefix \(s_t = (x, y_{<t})\) and an advantage function

\[A(s_t, y_t) = Q(s_t, y_t) - V(s_t)\]

that measures how much better token \(y_t\) is compared to the expected continuation from state \(s_t\). This allows the policy gradient to upweight tokens that contributed positively to the reward and downweight tokens that did not — a much finer-grained signal than applying the same reward to all tokens.

State distribution mismatch. Under the bandit view, the context distribution \(P(x)\) is fixed and policy-independent, so there is no state distribution mismatch. Under the MDP view, the state at time \(t\) is \(s_t = (x, y_1, \ldots, y_{t-1})\), which depends on the policy that generated the prefix. If we use data collected under \(\pi_{\text{old}}\) to update \(\pi_\theta\), the distribution over prefixes will differ — the hidden approximation from the importance sampling post returns. However, because the transitions are deterministic, this mismatch is entirely driven by the policy: two policies see different states only because they generate different token sequences. This is milder than the mismatch in a stochastic-transition MDP (like robotics), but it is not zero.

KL divergence. Under the bandit view, the KL between \(\pi_\theta\) and \(\pi_{\text{ref}}\) is a single number per prompt — the divergence between two distributions over full responses. Under the MDP view, this KL decomposes as a sum of per-token KLs:

\[\text{KL}\!\left[\pi_\theta(y \vert x) \,\|\, \pi_{\text{ref}}(y \vert x)\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\sum_{t=1}^{T} \log \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\right]\]

This additive structure means the sequence-level KL grows linearly with response length \(T\): a per-token KL of \(\epsilon\) accumulates to \(T\epsilon\). The MDP view makes this length dependence explicit, which is important for understanding why KL regularization strength may need to be adjusted for tasks that elicit longer responses.

Practical algorithms. Most current methods — PPO-based RLHF, GRPO, REINFORCE-style approaches — implicitly use a hybrid. They collect data at the response level (bandit), but compute per-token log-probabilities and apply token-level KL penalties (MDP). The policy gradient is typically computed with a single advantage per response (bandit-style), though some methods like token-level PPO estimate per-token advantages using a learned value function. The choice of which view to adopt at each stage of the algorithm is a design decision with real consequences for variance, credit assignment, and computational cost.

两种公式——老虎机和 token 级 MDP——在数学上等价。老虎机的单一”动作” \(y\) 是 MDP 的整条轨迹 \((y_1, \ldots, y_T)\)。老虎机的策略 \(\pi_\theta(y \vert x)\) 等于 MDP 的轨迹概率 \(\prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\)。老虎机的奖励 \(r(x, y)\) 是 MDP 的累积回报(由于中间奖励为零,就是终端奖励)。任何适用于一种公式的优化算法都可以转换为另一种。但两种视角导向截然不同的算法选择。

IS 比率。 在老虎机视角下,每个回复有一个 IS 比率:\(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\)。在 MDP 视角下,该比率分解为 \(T\) 个逐 token 比率的乘积:\(\prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{old}}(y_t \vert x, y_{<t})}\)。老虎机比率是一个可以直接计算和裁剪的单一数值。MDP 乘积是 \(T\) 个因子的链,可能复合——即使每个因子接近 1,对于长序列其乘积仍可能爆炸或坍塌。这就是重要性采样文章中的复合问题,在 token 场景中重新出现。

信用分配。 老虎机公式将同一标量奖励 \(r(x, y)\) 分配给整个回复。每个 token 获得相等的信用,无论它是关键推理步骤还是填充词。MDP 公式打开了 token 级信用分配的大门:我们可以在每个前缀 \(s_t = (x, y_{<t})\) 定义值函数 \(V(s_t)\) 和优势函数

\[A(s_t, y_t) = Q(s_t, y_t) - V(s_t)\]

来衡量 token \(y_t\) 相比状态 \(s_t\) 的期望后续有多好。这使策略梯度能够增大对奖励有正贡献的 token 的权重、减小无贡献 token 的权重——比对所有 token 施加相同奖励精细得多。

状态分布失配。 在老虎机视角下,上下文分布 \(P(x)\) 固定且与策略无关,因此无状态分布失配。在 MDP 视角下,时刻 \(t\) 的状态为 \(s_t = (x, y_1, \ldots, y_{t-1})\),依赖于生成前缀的策略。若使用 \(\pi_{\text{old}}\) 下收集的数据更新 \(\pi_\theta\),前缀的分布将不同——重要性采样文章中的隐藏近似回来了。但由于转移是确定性的,这种失配完全由策略驱动:两个策略看到不同状态仅因为它们生成了不同的 token 序列。这比随机转移 MDP(如机器人)中的失配更温和,但并非为零。

KL 散度。 在老虎机视角下,\(\pi_\theta\) 与 \(\pi_{\text{ref}}\) 之间的 KL 是每个提示的单一数值——两个完整回复分布间的散度。在 MDP 视角下,该 KL 分解为逐 token KL 之和:

\[\text{KL}\!\left[\pi_\theta(y \vert x) \,\|\, \pi_{\text{ref}}(y \vert x)\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\sum_{t=1}^{T} \log \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\right]\]

这种加性结构意味着序列级 KL 随回复长度 \(T\) 线性增长:每 token KL 为 \(\epsilon\) 则累积到 \(T\epsilon\)。MDP 视角使这种长度依赖性显式化,这对理解为何 KL 正则化强度可能需要随任务引出的回复长度调整十分重要。

实践算法。 当前大多数方法——基于 PPO 的 RLHF、GRPO、REINFORCE 风格方法——隐式使用混合方式。它们在回复级别收集数据(老虎机),但计算逐 token 对数概率并施加 token 级 KL 惩罚(MDP)。策略梯度通常以每个回复一个优势值计算(老虎机风格),但一些方法如 token 级 PPO 使用学习的值函数估计逐 token 优势。在算法的每个阶段采用哪种视角是一个设计决策,对方差、信用分配和计算开销有实际影响。

RLHF

IS Ratios in Language Model Alignment

The IS ratio from the post on importance sampling appears directly in Reinforcement Learning from Human Feedback (RLHF), where a language model \(\pi_\theta\) is fine-tuned to maximize a learned reward model \(r(x, y)\) while staying close to a reference model \(\pi_{\text{ref}}\). The standard RLHF objective is:

\[\max_\theta \; \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot \vert x)}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

The KL divergence can be written as an expectation of the log IS ratio:

\[\text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\right]\]

The ratio \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\) is exactly an importance sampling ratio — it measures how much the fine-tuned model’s distribution has shifted from the reference. The KL penalty pushes this ratio toward 1, keeping the fine-tuned model close to the reference. This is directly analogous to keeping the proposal \(q\) close to the target \(p\) in importance sampling: when the ratio deviates too far, the estimate (or in this case, the policy update) becomes unreliable.

In practice, RLHF implementations (e.g., PPO-based RLHF) use the same clipped IS ratio from PPO for the policy optimization step, combined with the KL penalty to prevent the model from drifting too far from \(\pi_{\text{ref}}\). Some recent methods like DPO (Direct Preference Optimization) eliminate the explicit RL loop entirely by reparameterizing the reward in terms of the log ratio \(\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\), making the IS ratio the central object of the optimization itself.

重要性采样文章中的 IS 比率直接出现在 Reinforcement Learning from Human Feedback(RLHF)中,其中语言模型 \(\pi_\theta\) 被微调以最大化学习到的奖励模型 \(r(x, y)\),同时保持与参考模型 \(\pi_{\text{ref}}\) 的接近。标准 RLHF 目标为:

\[\max_\theta \; \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot \vert x)}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

KL 散度可以写为 log IS 比率的期望:

\[\text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\right]\]

比率 \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\) 正是一个重要性采样比率——它衡量微调模型的分布相对参考模型偏移了多少。KL 惩罚将该比率推向 1,使微调模型保持接近参考模型。这与重要性采样中保持提议分布 \(q\) 接近目标分布 \(p\) 直接类似:当比率偏离过远时,估计(或此处的策略更新)变得不可靠。

在实践中,RLHF 实现(如基于 PPO 的 RLHF)在策略优化步骤中使用与 PPO 相同的裁剪 IS 比率,结合 KL 惩罚防止模型偏离 \(\pi_{\text{ref}}\) 过远。一些近期方法如 DPO(Direct Preference Optimization)通过将奖励重参数化为 log 比率 \(\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\),完全消除了显式 RL 循环,使 IS 比率成为优化的核心对象。

Token-Level Decomposition

A language model generates a response \(y = (y_1, y_2, \ldots, y_T)\) autoregressively, so the probability of the full sequence factorizes as:

\[\pi_\theta(y \vert x) = \prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\]

The IS ratio between the current policy \(\pi_\theta\) and the reference \(\pi_{\text{ref}}\) therefore decomposes into a product of per-token ratios:

\[\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)} = \prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\]

This is precisely the trajectory-level product from the importance sampling post, reappearing in a new guise. Each token generation is a “time step”, and the full response is a “trajectory”. The compounding problem applies directly: as responses grow longer, the product of per-token ratios can explode or collapse, making long-response IS estimates unreliable. A 500-token response involves a product of 500 ratios — even if each individual ratio is close to 1, their product can deviate enormously.

语言模型以自回归方式生成回复 \(y = (y_1, y_2, \ldots, y_T)\),因此完整序列的概率分解为:

\[\pi_\theta(y \vert x) = \prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\]

因此,当前策略 \(\pi_\theta\) 与参考策略 \(\pi_{\text{ref}}\) 之间的 IS 比率分解为逐 token 比率的乘积:

\[\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)} = \prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\]

这正是重要性采样文章中轨迹级乘积的再现,只是换了一种面貌。每次 token 生成是一个”时间步”,完整回复是一条”轨迹”。复合问题直接适用:随着回复变长,逐 token 比率的乘积可能爆炸或坍塌,使长回复的 IS 估计不可靠。一个 500 token 的回复涉及 500 个比率的乘积——即使每个单独比率接近 1,其乘积也可能偏离极大。

Why KL Regularization is Not Optional

This compounding structure explains why the KL penalty in RLHF is not merely a stylistic choice but a statistical necessity. Without it, the policy \(\pi_\theta\) can drift far from \(\pi_{\text{ref}}\) in sequence space even when each per-token distribution shifts modestly. A small per-token KL of \(\epsilon\) per token accumulates to \(T\epsilon\) over a full response, so the sequence-level KL grows linearly with response length. The KL penalty in the RLHF objective:

\[\max_\theta \; \mathbb{E}_{x, y \sim \pi_\theta}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

directly controls this accumulation. By penalizing the sum of per-token log ratios, it prevents any individual token distribution from shifting too far and prevents the aggregate shift from compounding out of control. The hyperparameter \(\beta\) trades off reward maximization against distributional stability — too small and the policy drifts into regions where the IS ratios (and therefore the policy gradient estimates) become unreliable; too large and the policy cannot learn anything beyond the reference.

这种复合结构解释了为什么 RLHF 中的 KL 惩罚不仅仅是风格选择,而是统计上的必要性。没有它,即使每个 token 的分布仅有适度偏移,策略 \(\pi_\theta\) 在序列空间中仍可能远离 \(\pi_{\text{ref}}\)。每 token \(\epsilon\) 的小 KL 在完整回复上累积到 \(T\epsilon\),因此序列级 KL 随回复长度线性增长。RLHF 目标中的 KL 惩罚:

\[\max_\theta \; \mathbb{E}_{x, y \sim \pi_\theta}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

直接控制这种累积。通过惩罚逐 token log 比率之和,它防止任何单个 token 分布偏移过远,并防止总体偏移失控复合。超参数 \(\beta\) 在奖励最大化与分布稳定性之间权衡——太小则策略漂移到 IS 比率(因而策略梯度估计)变得不可靠的区域;太大则策略无法学到超越参考模型的任何内容。

Response-Level vs Token-Level Credit Assignment

A deeper challenge is credit assignment. In standard RLHF, the reward \(r(x, y)\) is assigned to the entire response \(y\). This is a single scalar for a sequence that may be hundreds of tokens long. From an IS perspective, we are using a product of \(T\) per-token ratios to reweight a single reward signal — the worst case of the compounding problem. Every token in the response receives the same reward, even though some tokens may be crucial (the key reasoning step) while others are incidental (filler phrases, formatting).

This is analogous to the trajectory-level vs per-step IS distinction from the importance sampling post. There, we showed that per-step IS reduces variance by breaking the trajectory reward into per-step rewards and applying only the relevant IS ratios. The same idea applies to language: if we could assign credit at the token level — identifying which tokens contributed to the reward — we could avoid the full product of ratios and dramatically reduce variance.

Recent work explores exactly this direction. Process reward models assign rewards to intermediate reasoning steps rather than only to the final answer, enabling step-level credit assignment analogous to per-step IS. Token-level KL penalties regularize each token’s distribution individually rather than penalizing only the aggregate. These approaches are, at their core, applications of the per-step IS principle to the language setting: decompose the monolithic sequence-level problem into manageable token-level or step-level subproblems, and apply IS corrections only where they are needed.

更深层的挑战是信用分配。在标准 RLHF 中,奖励 \(r(x, y)\) 被分配给整个回复 \(y\)。对于可能长达数百 token 的序列,这只是一个标量。从 IS 的角度看,我们用 \(T\) 个逐 token 比率的乘积来重新加权单一奖励信号——复合问题的最坏情况。回复中每个 token 获得相同的奖励,尽管有些 token 可能至关重要(关键推理步骤),而另一些则无关紧要(填充短语、格式)。

这类似于重要性采样文章中轨迹级与逐步 IS 的区分。在那篇文章中,我们展示了逐步 IS 通过将轨迹奖励分解为逐步奖励并仅应用相关 IS 比率来降低方差。同样的思路适用于语言:如果我们能在 token 级别分配信用——识别哪些 token 对奖励有贡献——就可以避免完整的比率乘积,从而大幅降低方差。

近期工作正在探索这个方向。过程奖励模型(Process reward models)将奖励分配给中间推理步骤而非仅最终答案,实现了类似于逐步 IS 的步骤级信用分配。Token 级 KL 惩罚单独正则化每个 token 的分布,而非仅惩罚总体。这些方法的核心是将逐步 IS 原理应用于语言场景:将单一的序列级问题分解为可管理的 token 级或步骤级子问题,仅在需要的地方应用 IS 校正。

The Action Space Explosion

There is one final way in which language RL differs from classical RL. In a typical control task, the action space at each step might have \(\vert\mathcal{A}\vert = 4\) (four directions) or be a low-dimensional continuous space. In language, each token is drawn from a vocabulary of tens of thousands of tokens, and the response is a sequence of these choices. The effective action space is \(\vert\mathcal{V}\vert^T\) — astronomically large. This means that two language model policies, even if they are “similar” by most measures, will assign probability mass to mostly non-overlapping sets of full responses. The overlap between \(\pi_\theta\) and \(\pi_{\text{ref}}\) in sequence space can be vanishingly small even when the per-token distributions are close.

This makes off-policy evaluation in language RL fundamentally harder than in classical RL. In a grid world, the old policy and new policy might both assign significant probability to the same trajectories, making IS reweighting effective. In language, a small change in the policy can redirect probability mass to entirely different responses, leaving the old policy’s samples uninformative about the new policy’s behavior. This is why most successful language model RL methods — PPO-based RLHF, GRPO, and their variants — operate in a nearly on-policy regime, collecting fresh samples from the current policy at each step rather than attempting to reuse old data. The IS ratios in these methods serve primarily as a local correction (keeping \(\theta\) close to \(\theta_{\text{old}}\) within a single update) rather than as a tool for genuine off-policy learning across many updates.

语言 RL 与经典 RL 还有最后一个不同之处。在典型的控制任务中,每步的动作空间可能有 \(\vert\mathcal{A}\vert = 4\)(四个方向)或是低维连续空间。在语言中,每个 token 从数万个 token 的词表中抽取,回复是这些选择的序列。有效动作空间为 \(\vert\mathcal{V}\vert^T\)——天文数字般巨大。这意味着两个语言模型策略,即使按大多数度量是”相似的”,也会将概率质量分配给基本不重叠的完整回复集合。即使逐 token 分布很接近,\(\pi_\theta\) 与 \(\pi_{\text{ref}}\) 在序列空间中的重叠也可能极其微小。

这使得语言 RL 中的离策略评估从根本上比经典 RL 更困难。在网格世界中,旧策略和新策略可能都对相同轨迹分配显著概率,使 IS 重新加权有效。在语言中,策略的微小变化可以将概率质量重定向到完全不同的回复,使旧策略的样本对新策略的行为毫无信息量。这就是为什么大多数成功的语言模型 RL 方法——基于 PPO 的 RLHF、GRPO 及其变体——运作在近乎在策略的范式下,在每步从当前策略收集新样本,而非尝试重用旧数据。这些方法中的 IS 比率主要用作局部校正(在单次更新内保持 \(\theta\) 接近 \(\theta_{\text{old}}\)),而非作为跨多次更新的真正离策略学习工具。

Signal Loss and Adaptive Sampling

The bandit formulation of language model RL — sample \(n\) responses per prompt, compute advantages, update — has a subtle failure mode. When the model’s pass rate \(p_i\) on a prompt \(x_i\) is very low, a small sample group is likely to contain all incorrect responses. When all responses have the same reward, the advantage

\[A^{\text{GRPO}}(x, y_j) = \frac{r_j - \bar{r}}{\sigma_r + \epsilon}\]

collapses to zero because \(\sigma_r = 0\). The gradient vanishes — the model receives no learning signal from that prompt. Symmetrically, when \(p_i\) is very high, all responses may be correct, and the advantage again collapses. With pass rate \(p = 0.1\) and group size \(n = 8\), the probability that all samples are incorrect is \(0.9^8 \approx 43\%\). Nearly half the time, the model learns nothing from difficult prompts.

This is signal loss: uniform sampling wastes inference budget on prompts that are either too easy (the model already solves them reliably) or too hard (the small sample size fails to capture any successes). The learning signal concentrates on prompts of intermediate difficulty where the group happens to contain a mix of correct and incorrect responses. Two recent papers address this problem from different theoretical perspectives.

语言模型 RL 的老虎机公式——每个提示采样 \(n\) 个回复、计算优势、更新——有一种微妙的失败模式。当模型在提示 \(x_i\) 上的通过率 \(p_i\) 很低时,小样本组很可能包含全部错误的回复。当所有回复具有相同奖励时,优势

\[A^{\text{GRPO}}(x, y_j) = \frac{r_j - \bar{r}}{\sigma_r + \epsilon}\]

因 \(\sigma_r = 0\) 而坍塌为零。梯度消失——模型从该提示获得零学习信号。对称地,当 \(p_i\) 很高时,所有回复可能都正确,优势同样坍塌。通过率 \(p = 0.1\)、组大小 \(n = 8\) 时,所有样本均错误的概率为 \(0.9^8 \approx 43\%\)。将近一半的时间,模型从困难提示中什么都学不到。

这就是信号丢失:均匀采样在太简单(模型已能可靠求解)或太困难(小样本量无法捕获任何成功)的提示上浪费推理预算。学习信号集中在中等难度的提示上,即组中恰好包含正确和错误回复的混合情况。两篇近期论文从不同的理论视角解决了这一问题。

Non-linear Objectives and Adaptive Budget Allocation

Xiong et al. (2025) argue that signal loss is an artifact of optimizing the wrong objective. The standard RL objective for reasoning is:

\[J(\theta) = \mathbb{E}_{x}[p_\theta(x)]\]

where \(p_\theta(x)\) is the pass rate on prompt \(x\). Under this linear objective, all prompts contribute equally to the gradient regardless of difficulty. They propose optimizing a non-linear objective instead:

\[J_f(\theta) = \mathbb{E}_{x}[f(p_\theta(x))]\]

where \(f\) is a concave function. Taking \(f = \log\) and differentiating:

\[\nabla J_f(\theta) = \mathbb{E}_{x}\!\left[\frac{1}{p_\theta(x)} \cdot \nabla p_\theta(x)\right]\]

The weight \(w(x) = 1/p_\theta(x)\) naturally upweights difficult prompts. A prompt with pass rate \(0.01\) receives 100 times the weight of a prompt with pass rate \(1.0\). The key insight is that this reweighting can be realized through adaptive sampling: instead of applying the weight \(1/p_i\) to the gradient (which is unstable when \(\hat{p}_i = 0\)), allocate more inference budget to harder prompts so that the sampling itself implements the reweighting.

Their algorithm, REINFORCE-ADA, has two realizations. The first (Ada-Est) estimates per-prompt pass rates using a value network or an exponential moving average, then allocates budgets \(n_i \propto 1/\sqrt{\hat{p}_i}\). The second (Ada-Seq) avoids explicit estimation entirely: it samples sequentially for each prompt until it collects \(K\) correct responses, then stops. This sequential stopping naturally achieves \(\mathbb{E}[N_i] = K/p_i\) — harder prompts automatically receive more samples. A balanced variant (Ada-Seq-balance) requires both \(K\) correct and \(K\) incorrect responses before stopping, preventing signal loss at both ends of the difficulty spectrum.

The variance analysis connects back to IS. For the log-objective under REINFORCE with baseline, the optimal budget allocation that minimizes gradient variance is:

\[n_i^* \propto \sqrt{\frac{1-p_i}{p_i}}\]

This is a variance-reduction result: allocating more samples to hard prompts reduces the variance of the gradient estimator under a fixed total budget \(N\).

Xiong et al. (2025) 认为信号丢失是优化了错误目标的产物。推理的标准 RL 目标为:

\[J(\theta) = \mathbb{E}_{x}[p_\theta(x)]\]

其中 \(p_\theta(x)\) 是提示 \(x\) 上的通过率。在这个线性目标下,无论难度如何,所有提示对梯度的贡献相等。他们提出优化一个非线性目标:

\[J_f(\theta) = \mathbb{E}_{x}[f(p_\theta(x))]\]

其中 \(f\) 是凹函数。取 \(f = \log\) 并求导:

\[\nabla J_f(\theta) = \mathbb{E}_{x}\!\left[\frac{1}{p_\theta(x)} \cdot \nabla p_\theta(x)\right]\]

权重 \(w(x) = 1/p_\theta(x)\) 自然地增大困难提示的权重。通过率为 \(0.01\) 的提示获得通过率为 \(1.0\) 的提示 100 倍的权重。关键洞察在于这种重新加权可以通过自适应采样实现:不是将权重 \(1/p_i\) 应用于梯度(当 \(\hat{p}_i = 0\) 时不稳定),而是为更难的提示分配更多推理预算,使采样本身实现重新加权。

他们的算法 REINFORCE-ADA 有两种实现。第一种(Ada-Est)使用价值网络或指数移动平均估计每个提示的通过率,然后分配预算 \(n_i \propto 1/\sqrt{\hat{p}_i}\)。第二种(Ada-Seq)完全避免显式估计:对每个提示顺序采样,直到收集到 \(K\) 个正确回复后停止。这种顺序停止自然实现 \(\mathbb{E}[N_i] = K/p_i\)——更难的提示自动获得更多样本。一种平衡变体(Ada-Seq-balance)在停止前同时要求 \(K\) 个正确 \(K\) 个错误回复,防止难度谱两端的信号丢失。

方差分析与 IS 相关联。对于 REINFORCE with baseline 下的 log 目标,最小化梯度方差的最优预算分配为:

\[n_i^* \propto \sqrt{\frac{1-p_i}{p_i}}\]

这是一个方差缩减结果:在固定总预算 \(N\) 下,为困难提示分配更多样本可降低梯度估计器的方差。

Gradient Variance Minimization in the EM Framework

Yao et al. (2025) approach the same problem from the EM (Expectation-Maximization) perspective. Chain-of-thought reasoning is modeled as a latent variable problem: given prompt \(x\), the model generates a latent rationale \(y\) and a predicted answer \(z\). The training objective is the negative log-likelihood:

\[\mathcal{L}(\theta) = -\mathbb{E}_{x}\!\left[\ln \sum_{y \in \mathcal{Y}} \mathbb{P}(y \vert x, \theta) \mathbb{P}(z \vert y, \theta)\right]\]

RAFT (Reward-Ranked Fine-Tuning) implements the EM algorithm: the E-step uses rejection sampling (generate \(n\) responses, keep those with correct answers) to approximate the posterior \(Q_i(y)\), and the M-step fine-tunes the model on the accepted responses. Under this framework, the true gradient of the ELBO is:

\[\nabla \mathcal{J}_{Q^t}(\theta) = -\sum_{i=1}^{m} \mathbb{E}_{y \sim Q_i^t} \nabla \ln \mathbb{P}(y, z_i \vert x_i, \theta)\]

which is approximated via rejection sampling. The unbiased estimator (Lemma 1 in the paper) is:

\[-\sum_{i=1}^{m} \frac{1}{n_i p_i} \sum_{y_j \in \mathcal{D}_i} \nabla \ln \mathbb{P}(y_j, z_i \vert x_i, \theta)\]

where \(p_i\) is the acceptance rate (essentially the pass rate) and \(\mathcal{D}_i\) is the set of accepted samples for prompt \(x_i\). The crucial observation is the \(1/(n_i p_i)\) factor: for difficult prompts with low \(p_i\), both the number of accepted samples and the acceptance rate are small, so the estimator has extremely high variance.

They bound the total gradient variance and minimize it subject to a budget constraint \(\sum n_i = N\), obtaining the allocation:

\[n_i \propto \frac{G_i}{\sqrt{p_i + \alpha / (p_i)^{\beta - 1}}}\]

where \(G_i = \mathbb{E}_{y \sim Q_i} \|\nabla \ln \mathbb{P}(y, z_i \vert x_i, \theta)\|\) is the Lipschitz coefficient (expected gradient norm) for prompt \(x_i\), and \((\alpha, \beta)\) are regularization parameters that prevent excessive sampling on near-impossible prompts.

This formula reveals something REINFORCE-ADA’s allocation does not capture: the gradient norm \(G_i\) matters, not just the pass rate. Two prompts with the same pass rate \(p_i\) but different gradient norms should receive different budgets — the one whose accepted samples produce larger gradients should be sampled more, because it contributes more variance to the overall estimator. In practice, \(G_i\) is estimated via a pre-sampling stage: generate \(N'\) responses per prompt, compute the gradient norms of accepted responses, and use these to set the budget for the main sampling stage.

Yao et al. (2025) 从 EM(Expectation-Maximization) 视角处理同一问题。思维链推理被建模为隐变量问题:给定提示 \(x\),模型生成隐含推理过程 \(y\) 和预测答案 \(z\)。训练目标是负对数似然:

\[\mathcal{L}(\theta) = -\mathbb{E}_{x}\!\left[\ln \sum_{y \in \mathcal{Y}} \mathbb{P}(y \vert x, \theta) \mathbb{P}(z \vert y, \theta)\right]\]

RAFT(Reward-Ranked Fine-Tuning)实现 EM 算法:E 步使用拒绝采样(生成 \(n\) 个回复,保留答案正确的)来近似后验 \(Q_i(y)\),M 步在接受的回复上微调模型。在此框架下,ELBO 的真实梯度为:

\[\nabla \mathcal{J}_{Q^t}(\theta) = -\sum_{i=1}^{m} \mathbb{E}_{y \sim Q_i^t} \nabla \ln \mathbb{P}(y, z_i \vert x_i, \theta)\]

通过拒绝采样近似。无偏估计器(论文中的 Lemma 1)为:

\[-\sum_{i=1}^{m} \frac{1}{n_i p_i} \sum_{y_j \in \mathcal{D}_i} \nabla \ln \mathbb{P}(y_j, z_i \vert x_i, \theta)\]

其中 \(p_i\) 是接受率(本质上是通过率),\(\mathcal{D}_i\) 是提示 \(x_i\) 的接受样本集合。关键观察是 \(1/(n_i p_i)\) 因子:对于低 \(p_i\) 的困难提示,接受样本数和接受率都很小,因此估计器方差极高。

他们对总梯度方差进行上界约束,并在预算约束 \(\sum n_i = N\) 下最小化,得到分配方案:

\[n_i \propto \frac{G_i}{\sqrt{p_i + \alpha / (p_i)^{\beta - 1}}}\]

其中 \(G_i = \mathbb{E}_{y \sim Q_i} \|\nabla \ln \mathbb{P}(y, z_i \vert x_i, \theta)\|\) 是提示 \(x_i\) 的 Lipschitz 系数(期望梯度范数),\((\alpha, \beta)\) 是防止对几乎不可能的提示过度采样的正则化参数。

这个公式揭示了 REINFORCE-ADA 的分配方案未能捕获的内容:梯度范数 \(G_i\) 很重要,不仅仅是通过率。两个通过率 \(p_i\) 相同但梯度范数不同的提示应获得不同预算——接受样本产生更大梯度的那个应被更多采样,因为它对整体估计器贡献更多方差。在实践中,\(G_i\) 通过预采样阶段估计:每个提示生成 \(N'\) 个回复,计算接受回复的梯度范数,并用这些来设定主采样阶段的预算。

Comparing the Two Approaches

Both papers solve the same core problem — non-uniform prompt difficulty makes uniform sampling inefficient — but their theoretical starting points lead to different allocation formulas and different algorithmic structures.

  REINFORCE-ADA GVM-RAFT
Framework Non-linear RL objective \(\mathbb{E}[f(p_\theta(x))]\) EM / ELBO gradient variance minimization
Allocation \(n_i \propto 1/\sqrt{p_i}\) (pass rate only) \(n_i \propto G_i / \sqrt{p_i}\) (pass rate \(\times\) gradient norm)
Extra factor None Lipschitz coefficient \(G_i\)
Algorithm Online (replaces GRPO sampling directly) Two-stage: estimate \(p_i, G_i\), then allocate
Overhead Ada-Seq: implicit, no extra forward passes Requires \(N'\) pre-sampling forward passes
Regularization Ada-Seq-balance: require both correct and incorrect \((\alpha, \beta)\) penalty on very low \(p_i\)
Convergence Empirical (up to 2x speedup) Theorems 1-2: decreasing rate under smoothness

The connection between them is clearest in the variance-optimal allocation. REINFORCE-ADA derives \(n_i^* \propto \sqrt{(1-p_i)/p_i}\) for the log-objective. GVM-RAFT derives \(n_i^* \propto G_i/\sqrt{p_i}\) for the ELBO objective. When \(G_i\) is constant across prompts, the two formulas have the same qualitative shape — both allocate more budget to harder prompts. The divergence appears when \(G_i\) varies: GVM-RAFT argues that gradient norms are not uniform and should be estimated, while REINFORCE-ADA’s approach implicitly assumes they are absorbed into the objective’s weighting.

From a practical standpoint, both achieve significant gains over uniform sampling. REINFORCE-ADA’s Ada-Seq variant is simpler — it requires no explicit estimation and naturally adapts through sequential stopping. GVM-RAFT is more principled in its variance accounting but requires an estimation stage. Both generalize beyond their original settings: REINFORCE-ADA extends from GRPO to other RL algorithms, and GVM-RAFT extends from RAFT to GRPO (referred to as GVM-GRPO), achieving comparable improvements in both cases.

两篇论文解决相同的核心问题——非均匀的提示难度使均匀采样低效——但它们不同的理论出发点导致了不同的分配公式和不同的算法结构。

  REINFORCE-ADA GVM-RAFT
框架 非线性 RL 目标 \(\mathbb{E}[f(p_\theta(x))]\) EM / ELBO 梯度方差最小化
分配 \(n_i \propto 1/\sqrt{p_i}\)(仅通过率) \(n_i \propto G_i / \sqrt{p_i}\)(通过率 \(\times\) 梯度范数)
额外因子 Lipschitz 系数 \(G_i\)
算法 在线(直接替换 GRPO 采样) 两阶段:估计 \(p_i, G_i\),然后分配
开销 Ada-Seq:隐式,无额外前向传播 需要 \(N'\) 次预采样前向传播
正则化 Ada-Seq-balance:要求正确和错误都有 \((\alpha, \beta)\) 对极低 \(p_i\) 的惩罚
收敛性 经验性(最高 2 倍加速) 定理 1-2:光滑性假设下的递减速率

两者之间的联系在方差最优分配中最为清晰。REINFORCE-ADA 对 log 目标推导出 \(n_i^* \propto \sqrt{(1-p_i)/p_i}\)。GVM-RAFT 对 ELBO 目标推导出 \(n_i^* \propto G_i/\sqrt{p_i}\)。当 \(G_i\) 在所有提示间恒定时,两个公式具有相同的定性形状——都为更难的提示分配更多预算。当 \(G_i\) 变化时出现分歧:GVM-RAFT 认为梯度范数不均匀且应被估计,而 REINFORCE-ADA 的方法隐式假设它们被吸收到目标的加权中。

从实践角度看,两者都比均匀采样取得了显著提升。REINFORCE-ADA 的 Ada-Seq 变体更简单——不需要显式估计,通过顺序停止自然适应。GVM-RAFT 在方差计算上更有原则性,但需要估计阶段。两者都超越了各自的原始设定进行推广:REINFORCE-ADA 从 GRPO 扩展到其他 RL 算法,GVM-RAFT 从 RAFT 扩展到 GRPO(称为 GVM-GRPO),在两种情况下都取得了相当的改进。