RL on Language under Single-step Settings

The importance sampling story changes in interesting ways when the “actions” are natural language sequences. In classical RL, an action is a discrete choice (move left, move right) or a continuous vector (torque on a joint). In language model RL, an action is an entire token sequence — a sentence, a paragraph, or a multi-step chain of thought. This shift in the structure of the action space has deep consequences for how IS ratios behave.

Modeling Language Generation with RL

What is a Contextual Bandit?

Before examining how IS ratios behave in language model RL, it helps to understand the contextual bandit — the framework that most current language model RL implicitly operates in. A contextual bandit is a tuple \((\mathcal{X}, P, \mathcal{A}, r)\):

A context space \(\mathcal{X}\) is the set of all possible situations the agent might face. Each context \(x \in \mathcal{X}\) is a complete description of the current situation — it carries all the information the agent needs to make a decision. Contexts are drawn i.i.d. from a fixed distribution \(P(x)\) that the agent cannot influence.
An action space \(\mathcal{A}\) is the set of choices available to the agent. Given a context \(x\), the agent selects one action \(a \in \mathcal{A}\). The action space is the same for every context (though a policy may assign zero probability to certain actions in certain contexts).
A reward function \(r: \mathcal{X} \times \mathcal{A} \to \mathbb{R}\) maps each context-action pair to a scalar reward. The reward may be deterministic or stochastic — in the stochastic case, \(r(x, a)\) denotes the expected reward, and the agent observes a noisy realization.
A policy \(\pi(a \vert x)\) is a conditional distribution over actions given a context. It encodes the agent’s strategy: for each context \(x\), \(\pi(\cdot \vert x)\) is a probability distribution over \(\mathcal{A}\).

At each round, nature draws a context \(x \sim P(x)\), the agent observes \(x\), selects an action \(a \sim \pi(a \vert x)\), and receives reward \(r(x, a)\). Then the round ends. The next context \(x'\) is drawn fresh from \(P(x)\) — it does not depend on the previous action or context. The agent’s goal is to find a policy that maximizes expected reward:

\[\max_\pi \; \mathbb{E}_{x \sim P, \, a \sim \pi(\cdot \vert x)}\!\left[r(x, a)\right]\]

The key structural property is that there are no state transitions. The context distribution \(P(x)\) is fixed and does not depend on the policy. This is what distinguishes a contextual bandit from a full MDP: in an MDP, the agent’s actions influence what states it sees next, creating a feedback loop between the policy and the state distribution. In a bandit, each round is statistically independent — the agent’s choice of action today has no effect on what context it will see tomorrow.

Figure 1: A contextual bandit. Each round: environment presents context x, agent picks action a, receives reward r(x, a). No state carries over between rounds.

This independence has a profound consequence for importance sampling. Recall from the previous post that the surrogate objective in policy gradient methods silently substitutes the old policy’s state distribution \(d^{\pi_{\text{old}}}\) for the current policy’s \(d^{\pi_\theta}\), introducing an approximation error controlled by the distribution mismatch coefficient. In a contextual bandit, this problem vanishes entirely: the context distribution \(P(x)\) is the same regardless of which policy is being used, so there is no state distribution mismatch to worry about. The IS ratio \(\frac{\pi_\theta(a \vert x)}{\pi_{\text{old}}(a \vert x)}\) corrects for the action mismatch, and that is the only correction needed.

Language Generation as a Bandit

Most current language model RL — RLHF for chat models, reward-based fine-tuning for math and code — fits naturally into the contextual bandit framework. The mapping is:

Contextual Bandit	Language Model RL
Context \(x\)	Prompt
Action \(a\)	Full response \(y\)
Policy \(\pi(a \vert x)\)	Language model \(\pi_\theta(y \vert x)\)
Reward \(r(x, a)\)	Reward model \(r(x, y)\)
Context distribution \(P(x)\)	Prompt dataset \(\mathcal{D}\)

The model generates a complete response in one shot, receives a scalar reward, and the episode ends. The next prompt is drawn independently from the dataset — it does not depend on what the model generated previously.

Figure 2: Language generation as a contextual bandit. The prompt is the context, the full response is the action, and a reward model scores the output. Try different prompts and sample responses to see the bandit in action.

This is why PPO and related methods work as well as they do for single-turn tasks like math problem solving or code generation. The problem has the structure of a bandit: given a math problem (prompt), produce a solution (response), get a reward (correct or not). The IS ratio \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\) corrects for the action mismatch, and there is no state mismatch to worry about. The surrogate objective is exact up to the action-level IS correction — no hidden approximation, no distribution mismatch coefficient.

The picture changes for multi-turn tasks — dialogue, tool use, agentic workflows — where the model’s output at step \(t\) affects what input it sees at step \(t+1\). In those settings, the full sequential RL formulation returns, and with it all the challenges of state distribution mismatch and trajectory-level IS ratios. But the majority of current language model RL operates in the bandit regime, which is one reason it has been so successful despite using relatively simple algorithms.

Language Generation as a Token-Level MDP

There is an alternative way to model language generation: treat each token as an action in a sequential decision process. Instead of viewing the entire response as a single monolithic action, we model generation as a multi-step MDP where the agent makes one decision per time step:

MDP Component	Token-Level Language Generation
State \(s_t\)	Prompt \(x\) concatenated with tokens generated so far: \(s_t = (x, y_1, \ldots, y_{t-1})\)
Action \(a_t\)	Next token \(y_t \in \mathcal{V}\)
Transition \(T(s_{t+1} \vert s_t, a_t)\)	Deterministic: append \(y_t\) to get \(s_{t+1} = (x, y_1, \ldots, y_t)\)
Policy \(\pi(a_t \vert s_t)\)	Next-token distribution \(\pi_\theta(y_t \vert x, y_{<t})\)
Reward \(r(s_t, a_t)\)	Zero at all intermediate steps; \(r(x, y)\) at the final token

Figure 3: Token-level MDP view of autoregressive generation. Step through to see how each token is an action, the prefix is the state, and the sequence probability decomposes as a product of per-token probabilities. Reward is sparse — assigned only at the terminal state.

Under this formulation, the state at time \(t\) is the entire prefix — the prompt plus all tokens generated so far. The action is a single token drawn from the vocabulary \(\mathcal{V}\). The transition function is deterministic and trivial: the next state is just the current state with the new token appended. The reward is sparse: the agent receives nothing until the response is complete, at which point a reward model scores the full sequence.

This is a legitimate MDP — each action (token) changes the state (prefix), and the state determines what actions are available and how future states evolve. But it is an unusual one. The transition function is deterministic, so all stochasticity comes from the policy. The state space grows with each step, and the horizon \(T\) (response length) varies across episodes.

Bandit vs Token-Level MDP: What Changes?

The two formulations — bandit and token-level MDP — are mathematically equivalent. The bandit’s single “action” \(y\) is the MDP’s entire trajectory \((y_1, \ldots, y_T)\). The bandit’s policy \(\pi_\theta(y \vert x)\) equals the MDP’s trajectory probability \(\prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\). The bandit’s reward \(r(x, y)\) is the MDP’s cumulative return (which is just the terminal reward, since intermediate rewards are zero). Any optimization algorithm that works on one formulation can be translated to the other. But the two views lead to very different algorithmic choices.

IS ratios. Under the bandit view, there is a single IS ratio per response: \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{old}}(y \vert x)}\). Under the MDP view, this ratio decomposes into a product of \(T\) per-token ratios: \(\prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{old}}(y_t \vert x, y_{<t})}\). The bandit ratio is a single number that can be computed and clipped directly. The MDP product is a chain of \(T\) factors that can compound — even if each factor is close to 1, their product can explode or collapse for long sequences. This is the compounding problem from the previous post, reappearing in the token setting.

Credit assignment. The bandit formulation assigns the same scalar reward \(r(x, y)\) to the entire response. Every token receives equal credit, whether it is a crucial reasoning step or a filler word. The MDP formulation opens the door to token-level credit assignment: we can define a value function \(V(s_t)\) at each prefix \(s_t = (x, y_{<t})\) and an advantage function

\[A(s_t, y_t) = Q(s_t, y_t) - V(s_t)\]

that measures how much better token \(y_t\) is compared to the expected continuation from state \(s_t\). This allows the policy gradient to upweight tokens that contributed positively to the reward and downweight tokens that did not — a much finer-grained signal than applying the same reward to all tokens.

State distribution mismatch. Under the bandit view, the context distribution \(P(x)\) is fixed and policy-independent, so there is no state distribution mismatch. Under the MDP view, the state at time \(t\) is \(s_t = (x, y_1, \ldots, y_{t-1})\), which depends on the policy that generated the prefix. If we use data collected under \(\pi_{\text{old}}\) to update \(\pi_\theta\), the distribution over prefixes will differ — the hidden approximation from the previous post returns. However, because the transitions are deterministic, this mismatch is entirely driven by the policy: two policies see different states only because they generate different token sequences. This is milder than the mismatch in a stochastic-transition MDP (like robotics), but it is not zero.

KL divergence. Under the bandit view, the KL between \(\pi_\theta\) and \(\pi_{\text{ref}}\) is a single number per prompt — the divergence between two distributions over full responses. Under the MDP view, this KL decomposes as a sum of per-token KLs:

\[\text{KL}\!\left[\pi_\theta(y \vert x) \,\|\, \pi_{\text{ref}}(y \vert x)\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\sum_{t=1}^{T} \log \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\right]\]

This additive structure means the sequence-level KL grows linearly with response length \(T\): a per-token KL of \(\epsilon\) accumulates to \(T\epsilon\). The MDP view makes this length dependence explicit, which is important for understanding why KL regularization strength may need to be adjusted for tasks that elicit longer responses.

Practical algorithms. Most current methods — PPO-based RLHF, GRPO, REINFORCE-style approaches — implicitly use a hybrid. They collect data at the response level (bandit), but compute per-token log-probabilities and apply token-level KL penalties (MDP). The policy gradient is typically computed with a single advantage per response (bandit-style), though some methods like token-level PPO estimate per-token advantages using a learned value function. The choice of which view to adopt at each stage of the algorithm is a design decision with real consequences for variance, credit assignment, and computational cost.

RLHF

IS Ratios in Language Model Alignment

The IS ratio from the previous post appears directly in Reinforcement Learning from Human Feedback (RLHF), where a language model \(\pi_\theta\) is fine-tuned to maximize a learned reward model \(r(x, y)\) while staying close to a reference model \(\pi_{\text{ref}}\). The standard RLHF objective is:

\[\max_\theta \; \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot \vert x)}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

The KL divergence can be written as an expectation of the log IS ratio:

\[\text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\right]\]

The ratio \(\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\) is exactly an importance sampling ratio — it measures how much the fine-tuned model’s distribution has shifted from the reference. The KL penalty pushes this ratio toward 1, keeping the fine-tuned model close to the reference. This is directly analogous to keeping the proposal \(q\) close to the target \(p\) in importance sampling: when the ratio deviates too far, the estimate (or in this case, the policy update) becomes unreliable.

In practice, RLHF implementations (e.g., PPO-based RLHF) use the same clipped IS ratio from PPO for the policy optimization step, combined with the KL penalty to prevent the model from drifting too far from \(\pi_{\text{ref}}\). Some recent methods like DPO (Direct Preference Optimization) eliminate the explicit RL loop entirely by reparameterizing the reward in terms of the log ratio \(\log \frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)}\), making the IS ratio the central object of the optimization itself.

Token-Level Decomposition

A language model generates a response \(y = (y_1, y_2, \ldots, y_T)\) autoregressively, so the probability of the full sequence factorizes as:

\[\pi_\theta(y \vert x) = \prod_{t=1}^{T} \pi_\theta(y_t \vert x, y_{<t})\]

The IS ratio between the current policy \(\pi_\theta\) and the reference \(\pi_{\text{ref}}\) therefore decomposes into a product of per-token ratios:

\[\frac{\pi_\theta(y \vert x)}{\pi_{\text{ref}}(y \vert x)} = \prod_{t=1}^{T} \frac{\pi_\theta(y_t \vert x, y_{<t})}{\pi_{\text{ref}}(y_t \vert x, y_{<t})}\]

This is precisely the trajectory-level product from the previous post, reappearing in a new guise. Each token generation is a “time step”, and the full response is a “trajectory”. The compounding problem applies directly: as responses grow longer, the product of per-token ratios can explode or collapse, making long-response IS estimates unreliable. A 500-token response involves a product of 500 ratios — even if each individual ratio is close to 1, their product can deviate enormously.

Why KL Regularization is Not Optional

This compounding structure explains why the KL penalty in RLHF is not merely a stylistic choice but a statistical necessity. Without it, the policy \(\pi_\theta\) can drift far from \(\pi_{\text{ref}}\) in sequence space even when each per-token distribution shifts modestly. A small per-token KL of \(\epsilon\) per token accumulates to \(T\epsilon\) over a full response, so the sequence-level KL grows linearly with response length. The KL penalty in the RLHF objective:

\[\max_\theta \; \mathbb{E}_{x, y \sim \pi_\theta}\!\left[r(x, y)\right] - \beta \, \text{KL}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right]\]

directly controls this accumulation. By penalizing the sum of per-token log ratios, it prevents any individual token distribution from shifting too far and prevents the aggregate shift from compounding out of control. The hyperparameter \(\beta\) trades off reward maximization against distributional stability — too small and the policy drifts into regions where the IS ratios (and therefore the policy gradient estimates) become unreliable; too large and the policy cannot learn anything beyond the reference.

Response-Level vs Token-Level Credit Assignment

A deeper challenge is credit assignment. In standard RLHF, the reward \(r(x, y)\) is assigned to the entire response \(y\). This is a single scalar for a sequence that may be hundreds of tokens long. From an IS perspective, we are using a product of \(T\) per-token ratios to reweight a single reward signal — the worst case of the compounding problem. Every token in the response receives the same reward, even though some tokens may be crucial (the key reasoning step) while others are incidental (filler phrases, formatting).

This is analogous to the trajectory-level vs per-step IS distinction from the previous post. There, we showed that per-step IS reduces variance by breaking the trajectory reward into per-step rewards and applying only the relevant IS ratios. The same idea applies to language: if we could assign credit at the token level — identifying which tokens contributed to the reward — we could avoid the full product of ratios and dramatically reduce variance.

Recent work explores exactly this direction. Process reward models assign rewards to intermediate reasoning steps rather than only to the final answer, enabling step-level credit assignment analogous to per-step IS. Token-level KL penalties regularize each token’s distribution individually rather than penalizing only the aggregate. These approaches are, at their core, applications of the per-step IS principle to the language setting: decompose the monolithic sequence-level problem into manageable token-level or step-level subproblems, and apply IS corrections only where they are needed.

The Action Space Explosion

There is one final way in which language RL differs from classical RL. In a typical control task, the action space at each step might have \(\vert\mathcal{A}\vert = 4\) (four directions) or be a low-dimensional continuous space. In language, each token is drawn from a vocabulary of tens of thousands of tokens, and the response is a sequence of these choices. The effective action space is \(\vert\mathcal{V}\vert^T\) — astronomically large. This means that two language model policies, even if they are “similar” by most measures, will assign probability mass to mostly non-overlapping sets of full responses. The overlap between \(\pi_\theta\) and \(\pi_{\text{ref}}\) in sequence space can be vanishingly small even when the per-token distributions are close.

This makes off-policy evaluation in language RL fundamentally harder than in classical RL. In a grid world, the old policy and new policy might both assign significant probability to the same trajectories, making IS reweighting effective. In language, a small change in the policy can redirect probability mass to entirely different responses, leaving the old policy’s samples uninformative about the new policy’s behavior. This is why most successful language model RL methods — PPO-based RLHF, GRPO, and their variants — operate in a nearly on-policy regime, collecting fresh samples from the current policy at each step rather than attempting to reuse old data. The IS ratios in these methods serve primarily as a local correction (keeping \(\theta\) close to \(\theta_{\text{old}}\) within a single update) rather than as a tool for genuine off-policy learning across many updates.