Importance Sampling: Why and How

Importance Sampling Basics

Suppose we want to compute the expected value of a function \(f(x)\) under a distribution \(p(x)\):

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx\]

When \(p(x)\) is complex or high-dimensional, this integral is often intractable. A natural approach is Monte Carlo estimation — draw \(N\) samples from \(p\) and average:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i), \quad x_i \sim p\]

This works well when we can sample from \(p\). But what if sampling from \(p\) is expensive, impossible, or inefficient? Note the distinction: evaluating \(p(x)\) at a given point (computing the density value) is usually easy — just plug \(x\) into the formula. Sampling from \(p(x)\) — generating random \(x\) values that follow \(p\)’s distribution — is the hard part.

Figure 1: Evaluating p(x) (left) is just arithmetic — plug in any x. Sampling from p(x) (right) means generating x values that cluster where p is high — which requires computing the very integral we want to avoid.

To see why sampling is hard, consider a distribution defined as a product of simpler terms — say, a mixture of many Gaussians, or the product of a likelihood and a prior. Given any specific \(x\), you can plug it into the formula and compute \(p(x)\). But generating random \(x\) values that follow \(p\)’s shape is a different problem entirely. The standard approach is the inverse CDF method: draw \(U \sim \text{Uniform}(0,1)\) and compute \(x = F^{-1}(U)\), where

\[F(x) = \int_{-\infty}^{x} p(t)\, dt\]

is the cumulative distribution function. But computing \(F\) requires solving an integral of \(p\) — the very type of computation we are trying to avoid. This is a circular dependency: to sample from \(p\), we need the CDF, but the CDF is an integral over \(p\).

So we are stuck: the Monte Carlo estimator above avoids intractable integrals by using samples, but generating those samples from \(p\) requires the CDF, which is itself an integral over \(p\) (as shown above). What about computing the integral directly? In one dimension, we could lay down a grid of \(n\) points and sum \(f(x_i)\, p(x_i)\, \Delta x\). But in \(d\) dimensions, each dimension needs its own set of \(n\) grid points, and we must evaluate every combination — so the total number of grid points is \(n \times n \times \cdots \times n = n^d\). Even a modest \(n = 10\) with \(d = 100\) requires \(10^{100}\) evaluations, far more than atoms in the universe. Importance sampling breaks this deadlock. It is still a Monte Carlo method — it still uses random samples to estimate the integral — but instead of sampling from the hard distribution \(p\), it samples from an easy distribution \(q\) and corrects for the mismatch.

Figure 2: Why not just integrate numerically? Grid methods explode exponentially in high dimensions. Monte Carlo avoids this but needs samples from p. Importance sampling solves both problems.

The key idea is simple: multiply and divide by a proposal distribution \(q(x)\) that we can sample from:

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx = \int f(x)\, \frac{p(x)}{q(x)}\, q(x)\, dx\] \[= \mathbb{E}_{x \sim q}\!\left[f(x)\, \frac{p(x)}{q(x)}\right]\]

The ratio \(w(x) = \frac{p(x)}{q(x)}\) is the importance weight (or importance sampling ratio). It corrects for the mismatch between the distribution we sample from (\(q\)) and the distribution we care about (\(p\)).

The Monte Carlo estimator becomes:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i)\, \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q\]

This estimator is unbiased — if we repeated the entire procedure (draw \(N\) samples, compute the weighted average) many times, the average of these estimates would converge to the true expectation under \(p\). Any single run with finite samples will differ from the true value due to variance, but there is no systematic over- or under-estimation. This holds for any choice of \(q\), as long as \(q(x) > 0\) wherever \(p(x) f(x) \neq 0\).

Figure 3: Importance sampling in action. Samples drawn from the proposal q(x) are reweighted by p(x)/q(x) to estimate expectations under the target p(x).

Equivalently, we can see convergence by increasing \(N\): as we draw more samples from \(q\) and accumulate weighted averages, the IS estimate converges to the true value.

Figure 4: As the number of samples N grows, the IS estimate (red) converges toward the true expectation (green dashed). The bottom plot shows p(x) and q(x) with accumulated samples on the x-axis.

假设我们想计算函数 \(f(x)\) 在分布 \(p(x)\) 下的期望值:

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx\]

当 \(p(x)\) 复杂或高维时,这个积分通常是不可解析的。一个自然的方法是 Monte Carlo 估计——从 \(p\) 中抽取 \(N\) 个样本并取平均:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i), \quad x_i \sim p\]

当我们能够从 \(p\) 中采样时,这个方法很有效。但如果从 \(p\) 中采样代价高昂、不可能或效率低下呢?注意区分:求值 \(p(x)\)(在给定点计算密度值)通常很简单——只需将 \(x\) 代入公式。采样 \(p(x)\)——生成遵循 \(p\) 分布的随机 \(x\) 值——才是困难的部分。

图 1:求值 p(x)(左)只是算术——代入任意 x。从 p(x) 采样(右)意味着生成聚集在 p 高处的 x 值——这需要计算我们想避免的积分。

要理解为什么采样困难,考虑一个由简单项的乘积定义的分布——比如多个高斯的混合,或似然与先验的乘积。给定任意特定的 \(x\),你可以将其代入公式计算 \(p(x)\)。但生成遵循 \(p\) 形状的随机 \(x\) 值是完全不同的问题。标准方法是逆 CDF 方法:抽取 \(U \sim \text{Uniform}(0,1)\) 并计算 \(x = F^{-1}(U)\),其中

\[F(x) = \int_{-\infty}^{x} p(t)\, dt\]

是累积分布函数。但计算 \(F\) 需要求解 \(p\) 的积分——恰恰是我们试图避免的计算类型。这是一个循环依赖:要从 \(p\) 采样,我们需要 CDF,但 CDF 本身就是 \(p\) 的积分。

于是我们陷入了困境:上面的 Monte Carlo 估计器通过使用样本来避免不可解析的积分,但从 \(p\) 生成这些样本需要 CDF,而 CDF 本身就是 \(p\) 的积分(如上所示)。那直接计算积分呢?在一维中,我们可以布置 \(n\) 个网格点并求和 \(f(x_i)\, p(x_i)\, \Delta x\)。但在 \(d\) 维中,每个维度需要 \(n\) 个网格点,我们必须评估所有组合——因此网格点总数为 \(n \times n \times \cdots \times n = n^d\)。即使是 \(n = 10\)、\(d = 100\) 也需要 \(10^{100}\) 次评估,远超宇宙中的原子数。重要性采样打破了这一僵局。它仍然是 Monte Carlo 方法——仍然使用随机样本来估计积分——但不从困难的分布 \(p\) 中采样,而是从容易的分布 \(q\) 中采样并修正不匹配。

图 2:为什么不直接数值积分?网格方法在高维中呈指数爆炸。Monte Carlo 避免了这一点但需要从 p 采样。重要性采样同时解决了两个问题。

核心思想很简单:乘以并除以一个提议分布 \(q(x)\),我们可以从中采样:

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx = \int f(x)\, \frac{p(x)}{q(x)}\, q(x)\, dx\] \[= \mathbb{E}_{x \sim q}\!\left[f(x)\, \frac{p(x)}{q(x)}\right]\]

比率 \(w(x) = \frac{p(x)}{q(x)}\) 即重要性权重(或重要性采样比率)。它修正了我们采样的分布(\(q\))与我们关心的分布(\(p\))之间的不匹配。

Monte Carlo 估计器变为:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i)\, \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q\]

这个估计器是无偏的——如果我们重复整个过程(抽取 \(N\) 个样本,计算加权平均)很多次,这些估计的平均值会收敛到 \(p\) 下的真实期望。任何有限样本的单次运行都会因方差而与真实值不同,但不存在系统性的高估或低估。只要 \(q(x) > 0\) 在 \(p(x) f(x) \neq 0\) 的所有地方成立,这对任意 \(q\) 的选择都成立。

图 3:重要性采样实例。从提议分布 q(x) 抽取的样本通过 p(x)/q(x) 重新加权,以估计目标分布 p(x) 下的期望。

同样,我们可以通过增加 \(N\) 来观察收敛:随着从 \(q\) 中抽取更多样本并累积加权平均,IS 估计收敛到真实值。

图 4:随着样本数 N 增长,IS 估计(红色)收敛到真实期望(绿色虚线)。下方图展示了 p(x) 和 q(x) 以及 x 轴上的累积样本。

Why Does This Work?

The intuition is straightforward:

  • If \(q\) oversamples a region relative to \(p\), the ratio \(p/q < 1\) downweights those samples.
  • If \(q\) undersamples a region relative to \(p\), the ratio \(p/q > 1\) upweights those samples.

The reweighting exactly compensates for the distributional mismatch.

直觉很简单:

  • 如果 \(q\) 相对于 \(p\) 过度采样了某个区域,比率 \(p/q < 1\) 会降低这些样本的权重。
  • 如果 \(q\) 相对于 \(p\) 采样不足,比率 \(p/q > 1\) 会增加这些样本的权重。

重新加权恰好补偿了分布的不匹配。

Variance of the Estimator

While the IS estimator is unbiased for any valid \(q\), the choice of \(q\) dramatically affects variance. The variance of the IS estimator is:

\[\text{Var}_{x \sim q}\!\left[f(x)\,\frac{p(x)}{q(x)}\right] = \mathbb{E}_{x \sim q}\!\left[\left(f(x)\,\frac{p(x)}{q(x)}\right)^2\right] - \left(\mathbb{E}_{x \sim p}[f(x)]\right)^2\]

The first term can explode when \(p(x)/q(x)\) is large — i.e., when \(q\) assigns low probability to regions where \(p\) assigns high probability.

The optimal proposal that minimizes variance is:

\[q^*(x) \propto \vert f(x) \vert \, p(x)\]

This assigns more probability mass where the integrand \(\vert f(x)\vert p(x)\) is large. In practice, \(q^*\) is rarely available (computing it requires knowing the integral we are trying to estimate), but it tells us what a good proposal looks like: match the shape of the integrand.

虽然 IS 估计器对任何有效的 \(q\) 都是无偏的,但 \(q\) 的选择会显著影响方差。IS 估计器的方差为:

\[\text{Var}_{x \sim q}\!\left[f(x)\,\frac{p(x)}{q(x)}\right] = \mathbb{E}_{x \sim q}\!\left[\left(f(x)\,\frac{p(x)}{q(x)}\right)^2\right] - \left(\mathbb{E}_{x \sim p}[f(x)]\right)^2\]

当 \(p(x)/q(x)\) 很大时——即 \(q\) 在 \(p\) 赋予高概率的区域赋予低概率时——第一项会爆炸。

最优提议分布(使方差最小化)为:

\[q^*(x) \propto \vert f(x) \vert \, p(x)\]

这在被积函数 \(\vert f(x)\vert p(x)\) 较大的地方分配更多的概率质量。实际中 \(q^*\) 很少可用(计算它需要知道我们试图估计的积分),但它告诉我们一个好的提议分布应该是什么样的:匹配被积函数的形状

When Things Go Wrong

Importance sampling has a well-known failure mode: high variance due to weight degeneracy. If \(p\) and \(q\) are poorly matched, a few samples can dominate the estimate with enormous weights while most samples contribute almost nothing.

A useful diagnostic is the effective sample size (ESS):

\[N_{\text{eff}} = \frac{\left(\sum_{i=1}^{N} w_i\right)^2}{\sum_{i=1}^{N} w_i^2}\]

When all weights are equal, \(N_{\text{eff}} = N\). When one weight dominates, \(N_{\text{eff}} \approx 1\). A low ESS signals that the proposal \(q\) is a poor match for \(p\) and the estimate is unreliable.

To see this in action, compare the well-matched proposal in Figure 4 with the poorly-matched one below. Here \(q = \mathcal{N}(0, 1.3)\) is centered far from \(p = \mathcal{N}(2, 0.7)\): most samples land near \(x = 0\) where \(p(x) \approx 0\), so their weights are nearly zero. The few samples that reach \(p\)’s peak carry enormous weights, causing the estimate to jump erratically and converge slowly.

Figure 5: Weight degeneracy with a poorly matched proposal. The IS estimate (red) wanders far from the true value (green dashed) even at N = 500. The ESS badge shows that only a small fraction of samples carry meaningful weight.

This variance problem gets worse in high dimensions, where even small distributional mismatches compound across dimensions — a phenomenon sometimes called the curse of dimensionality for importance sampling.

重要性采样有一个众所周知的失败模式:因权重退化导致的高方差。如果 \(p\) 和 \(q\) 匹配不佳,少数样本会以巨大的权重主导估计,而大多数样本几乎没有贡献。

一个有用的诊断指标是有效样本量(ESS):

\[N_{\text{eff}} = \frac{\left(\sum_{i=1}^{N} w_i\right)^2}{\sum_{i=1}^{N} w_i^2}\]

当所有权重相等时,\(N_{\text{eff}} = N\)。当一个权重占主导时,\(N_{\text{eff}} \approx 1\)。低 ESS 意味着提议分布 \(q\) 与 \(p\) 匹配不佳,估计不可靠。

为了直观展示这一点,将图 4 中匹配良好的提议分布与下方匹配不佳的进行比较。这里 \(q = \mathcal{N}(0, 1.3)\) 的中心远离 \(p = \mathcal{N}(2, 0.7)\):大多数样本落在 \(x = 0\) 附近,此处 \(p(x) \approx 0\),因此它们的权重几乎为零。少数到达 \(p\) 峰值的样本携带巨大的权重,导致估计剧烈波动且收敛缓慢。

图 5:匹配不佳的提议分布导致的权重退化。即使在 N = 500 时,IS 估计(红色)仍远离真实值(绿色虚线)。ESS 徽标显示只有少数样本具有有意义的权重。

这个方差问题在高维中更加严重,即使微小的分布不匹配也会在各维度上累积——这一现象有时被称为重要性采样的维数灾难

Importance Sampling in RL

The machinery from Section 1 — reweighting samples from one distribution to estimate expectations under another — turns out to be exactly what we need in reinforcement learning (RL). The core problem: we have data collected by one policy, but want to learn about a different policy.

第一节的工具——从一个分布重新加权样本以估计另一个分布下的期望——恰好是我们在强化学习(RL)中所需要的。核心问题是:我们有一个策略收集的数据,但想要了解另一个策略的表现。

The Off-Policy Problem

In RL, an agent interacts with an environment by choosing actions according to a policy \(\pi(a \vert s)\) — a distribution over actions given a state. Each interaction produces a trajectory:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

The goal is to find a policy that maximizes expected cumulative reward. Policy optimization algorithms do this iteratively: collect data with the current policy \(\pi_{\text{old}}\), use the data to compute a better policy \(\pi_{\text{new}}\), then repeat. But here is the problem: as soon as we update the policy, all the data we just collected came from the wrong distribution. The trajectories were generated by \(\pi_{\text{old}}\), but we need to evaluate expectations under \(\pi_{\text{new}}\). This is exactly the importance sampling setup from Section 1, with \(\pi_{\text{old}}\) playing the role of the proposal \(q\) and \(\pi_{\text{new}}\) playing the role of the target \(p\).

在 RL 中,智能体根据策略 \(\pi(a \vert s)\)——在给定状态下关于动作的分布——选择动作与环境交互。每次交互产生一条轨迹:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

目标是找到使期望累积奖励最大化的策略。策略优化算法以迭代方式实现:用当前策略 \(\pi_{\text{old}}\) 收集数据,利用数据计算更好的策略 \(\pi_{\text{new}}\),然后重复。但问题在于:一旦更新了策略,刚刚收集的所有数据都来自错误的分布。轨迹是由 \(\pi_{\text{old}}\) 生成的,但我们需要评估 \(\pi_{\text{new}}\) 下的期望。这恰好是第一节中的重要性采样场景,\(\pi_{\text{old}}\) 扮演提议分布 \(q\) 的角色,\(\pi_{\text{new}}\) 扮演目标分布 \(p\) 的角色。

Trajectory Reweighting

Suppose we want to estimate the expected return \(J(\pi_\theta) := \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\) of a target policy \(\pi_\theta\) using trajectories collected by a behavior policy \(\pi_\beta\), where \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\) is the discounted return. The full probability of a trajectory under any policy involves both the policy’s action probabilities and the environment’s transition dynamics:

\[P_{\pi}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]
Figure 6: Anatomy of a trajectory. The probability is an alternating product of policy terms (red) and environment terms (green). Toggle to see how environment terms vanish in the gradient or cancel in the IS ratio — only policy terms survive.

where \(d_0\) is the initial state distribution and \(P(s_{t+1} \vert s_t, a_t)\) is the transition probability. The key observation is that \(d_0\) and \(P\) do not change when we deploy different policies — both policies start from the same initial distribution and interact with the same environment dynamics. So in the IS ratio, they cancel:

A note on \(d_0\) vs \(d^\pi\). (Click to expand)

In RL, the letter \(d\) appears in two distinct roles:

  • \(d_0\) = the initial state distribution, defined as part of the MDP \(M = (S, A, P, R, \gamma, d_0)\). It determines the starting state \(s_0 \sim d_0\) before the policy takes any action. Since the policy has not acted yet, \(d_0\) is the same regardless of which policy we deploy.
  • \(d^\pi\) (or \(d^\pi_t\)) = the state visitation distribution induced by running policy \(\pi\). At time \(t > 0\), the distribution over states depends on which actions the policy took, so \(d^\pi_t\) is a joint property of the policy and the environment dynamics. The discounted occupancy \(d^\pi = (1-\gamma) \sum_t \gamma^t d^\pi_t\) aggregates this across time.

In the trajectory probability above, \(d_0(s_0)\) is policy-independent and cancels in the IS ratio. Later, in the policy gradient section, \(d^{\pi_\theta}\) appears and does not cancel — this is the source of the distribution mismatch problem.

\[\frac{P_{\pi_\theta}(\tau)}{P_{\pi_\beta}(\tau)} = \frac{d_0(s_0) \prod_t \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)}{d_0(s_0) \prod_t \pi_\beta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)}\] \[= \prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\]

Applying IS, we can estimate the return of \(\pi_\theta\) using data from \(\pi_\beta\):

\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\beta}\!\left[\prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)} \; R(\tau)\right]\]

This is the per-trajectory IS estimator. We only need to know the action probabilities under both policies — no knowledge of the environment dynamics is required.

假设我们想用行为策略 \(\pi_\beta\) 收集的轨迹来估计目标策略 \(\pi_\theta\) 的期望回报 \(J(\pi_\theta) := \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\),其中 \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\) 是折扣回报。任何策略下轨迹的完整概率涉及策略的动作概率和环境的转移动力学:

\[P_{\pi}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]
图 6:轨迹的解剖。概率是策略项(红色)和环境项(绿色)的交替乘积。切换可查看环境项如何在梯度中消失或在 IS 比率中约分——只有策略项保留下来。

其中 \(d_0\) 是初始状态分布,\(P(s_{t+1} \vert s_t, a_t)\) 是转移概率。关键观察是 \(d_0\) 和 \(P\) 在部署不同策略时不会改变——两个策略从相同的初始分布出发,与相同的环境动力学交互。因此在 IS 比率中它们约掉了:

关于 \(d_0\) 与 \(d^\pi\) 的说明。(点击展开)

在 RL 中,字母 \(d\) 有两个不同的角色:

  • \(d_0\) = 初始状态分布,作为 MDP \(M = (S, A, P, R, \gamma, d_0)\) 的一部分定义。它决定了策略采取任何动作之前的起始状态 \(s_0 \sim d_0\)。由于策略尚未行动,\(d_0\) 与部署的策略无关。
  • \(d^\pi\)(或 \(d^\pi_t\))= 运行策略 \(\pi\) 所诱导的状态访问分布。在 \(t > 0\) 时,状态的分布取决于策略采取了哪些动作,因此 \(d^\pi_t\) 是策略和环境动力学的联合属性。折扣占用率 \(d^\pi = (1-\gamma) \sum_t \gamma^t d^\pi_t\) 将其在时间上聚合。

在上面的轨迹概率中,\(d_0(s_0)\) 与策略无关,在 IS 比率中约分。之后在策略梯度部分,\(d^{\pi_\theta}\) 出现且不会约分——这正是分布不匹配问题的根源。

\[\frac{P_{\pi_\theta}(\tau)}{P_{\pi_\beta}(\tau)} = \frac{d_0(s_0) \prod_t \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)}{d_0(s_0) \prod_t \pi_\beta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)}\] \[= \prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\]

应用 IS,我们可以使用 \(\pi_\beta\) 的数据来估计 \(\pi_\theta\) 的回报:

\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\beta}\!\left[\prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)} \; R(\tau)\right]\]

这就是逐轨迹 IS 估计器。我们只需要知道两个策略下的动作概率——不需要环境动力学的知识。

The Compounding Problem

The product of per-step ratios is where things get dangerous. Even if each individual ratio \(\frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\) is well-behaved, their product over \(T\) steps can explode or collapse:

\[\prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\]

If \(\pi_\theta\) and \(\pi_\beta\) differ even slightly — say each ratio averages 1.1 — after 100 steps the product is \(1.1^{100} \approx 13{,}781\). Conversely, if each ratio averages 0.9, the product is \(0.9^{100} \approx 0.00003\). This is the same weight degeneracy from Section 1, but now compounded exponentially across time steps.

Figure 7: Compounding IS ratios. Each line tracks the cumulative product of per-step IS ratios for one trajectory. Even with moderate per-step divergence (σ = 0.20), the products fan out exponentially — a few trajectories carry nearly all the weight while the rest contribute almost nothing. Use the divergence slider to see how even small increases in per-step variance dramatically worsen the effect.

逐步比率的乘积是危险所在。即使每个单独的比率 \(\frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\) 表现良好,它们在 \(T\) 步上的乘积也可能爆炸或坍缩:

\[\prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\]

如果 \(\pi_\theta\) 和 \(\pi_\beta\) 即使只有微小差异——比如每个比率平均为 1.1——在 100 步后乘积为 \(1.1^{100} \approx 13{,}781\)。反过来,如果每个比率平均为 0.9,乘积为 \(0.9^{100} \approx 0.00003\)。这与第一节中的权重退化相同,但现在在时间步上呈指数累积。

图 7:IS 比率的累积效应。每条线追踪一条轨迹的逐步 IS 比率的累积乘积。即使逐步差异适中(σ = 0.20),乘积也会呈指数扇形展开——少数轨迹携带几乎所有权重,其余几乎没有贡献。使用差异滑块可以看到即使逐步方差的微小增加也会显著加剧这一效应。

Per-Step IS

Can we do better than the per-trajectory estimator? The answer is yes — if we look carefully at what the per-trajectory estimator is actually doing to each reward.

The return is a sum of per-step rewards: \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\). The per-trajectory estimator multiplies this entire sum by the full product \(\rho_{0:T-1}\):

\[\rho_{0:T-1} \, R(\tau) = \sum_{t=0}^{T-1} \gamma^t \, \rho_{0:T-1} \, r_t\]

Now consider a single term \(\rho_{0:T-1} \, r_t\). We can split the product at time \(t\):

\[\rho_{0:T-1} \, r_t = \rho_{0:t} \cdot \rho_{t+1:T-1} \cdot r_t\]

Here is the key observation: \(r_t\) depends only on \((s_0, a_0, \ldots, s_t, a_t)\) — it is determined by the history up to time \(t\). The future ratios \(\rho_{t+1:T-1}\) depend on actions \(a_{t+1}, \ldots, a_{T-1}\), which have no causal effect on \(r_t\). Under the behavior policy, the conditional expectation of each future ratio is 1:

\[\mathbb{E}_{a_k \sim \pi_\beta}\!\left[\frac{\pi_\theta(a_k \vert s_k)}{\pi_\beta(a_k \vert s_k)} \;\Big\vert\; s_k\right] = \sum_a \pi_\beta(a \vert s_k) \frac{\pi_\theta(a \vert s_k)}{\pi_\beta(a \vert s_k)}\] \[= \sum_a \pi_\theta(a \vert s_k) = 1\]

So \(\mathbb{E}[\rho_{t+1:T-1} \mid s_0, a_0, \ldots, s_t, a_t] = 1\). Dropping \(\rho_{t+1:T-1}\) does not change the expectation of each term but removes a source of multiplicative noise. This gives the per-step IS estimator:

\[\hat{J}_{\text{per-step}} = \sum_{t=0}^{T-1} \gamma^t \, \rho_{0:t} \, r_t, \quad \text{where } \rho_{0:t} = \prod_{k=0}^{t} \frac{\pi_\theta(a_k \vert s_k)}{\pi_\beta(a_k \vert s_k)}\]

This is still unbiased but has strictly lower variance — each reward \(r_t\) is weighted only by the ratios that causally precede it. In the worst case — uniform behavior policy, deterministic target, \(K\) actions — per-trajectory IS has variance proportional to \(K^T\), while per-step IS reduces this since the term for \(r_t\) only compounds \(t\) ratios instead of \(T\).

Two structural properties of MDPs make this possible:

  1. Additive rewards: the return decomposes as \(R(\tau) = \sum_t \gamma^t r_t\), so we can treat each reward separately.
  2. Causal structure: \(r_t\) does not depend on future actions \(a_{t+1:T-1}\), so the future IS ratios are pure noise with conditional expectation 1.

If the return were a non-decomposable function of the entire trajectory (e.g., a product of all rewards, or some function depending on the final state only), this decomposition would not be possible, and we would be stuck with per-trajectory IS.

Figure 8: Per-trajectory IS vs per-step IS. Top: weight structure — per-trajectory applies the full product ρ0:T−1 to every reward, while per-step applies only ρ0:t to rt. Middle: 500 sample estimates under each method (green dashed = true value). Bottom: variance ratio grows with horizon — the longer the trajectory, the more per-step IS helps.

The per-step estimator also has an elegant recursive interpretation. Define \(v_0 = 0\) and

\[v_{T-t} = \rho_t\!\left(r_t + \gamma \, v_{T-t-1}\right)\]

Then \(v_T\) equals the per-step estimator. At each step, this applies single-step (bandit) IS with ratio \(\rho_t\) to the “reward” \(r_t + \gamma \, v_{T-t-1}\), recursing backward through time.

我们能否比逐轨迹估计器做得更好?答案是肯定的——如果我们仔细审视逐轨迹估计器对每个奖励实际做了什么。

回报是逐步奖励的求和:\(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\)。逐轨迹估计器将整个求和乘以完整的乘积 \(\rho_{0:T-1}\):

\[\rho_{0:T-1} \, R(\tau) = \sum_{t=0}^{T-1} \gamma^t \, \rho_{0:T-1} \, r_t\]

现在考虑单个项 \(\rho_{0:T-1} \, r_t\)。我们可以在时间 \(t\) 处拆分乘积:

\[\rho_{0:T-1} \, r_t = \rho_{0:t} \cdot \rho_{t+1:T-1} \cdot r_t\]

关键观察是:\(r_t\) 仅取决于 \((s_0, a_0, \ldots, s_t, a_t)\)——由时间 \(t\) 之前的历史决定。未来比率 \(\rho_{t+1:T-1}\) 取决于动作 \(a_{t+1}, \ldots, a_{T-1}\),它们对 \(r_t\) 没有因果影响。在行为策略下,每个未来比率的条件期望为 1:

\[\mathbb{E}_{a_k \sim \pi_\beta}\!\left[\frac{\pi_\theta(a_k \vert s_k)}{\pi_\beta(a_k \vert s_k)} \;\Big\vert\; s_k\right] = \sum_a \pi_\beta(a \vert s_k) \frac{\pi_\theta(a \vert s_k)}{\pi_\beta(a \vert s_k)}\] \[= \sum_a \pi_\theta(a \vert s_k) = 1\]

因此 \(\mathbb{E}[\rho_{t+1:T-1} \mid s_0, a_0, \ldots, s_t, a_t] = 1\)。去掉 \(\rho_{t+1:T-1}\) 不改变每一项的期望,但去除了一个乘性噪声源。由此得到逐步 IS 估计器:

\[\hat{J}_{\text{per-step}} = \sum_{t=0}^{T-1} \gamma^t \, \rho_{0:t} \, r_t, \quad \text{where } \rho_{0:t} = \prod_{k=0}^{t} \frac{\pi_\theta(a_k \vert s_k)}{\pi_\beta(a_k \vert s_k)}\]

这仍然是无偏的,但方差严格更低——每个奖励 \(r_t\) 仅被因果上先于它的比率加权。在最坏情况下——均匀行为策略、确定性目标策略、\(K\) 个动作——逐轨迹 IS 的方差与 \(K^T\) 成正比,而逐步 IS 降低了这一点,因为 \(r_t\) 的项只累积 \(t\) 个比率而非 \(T\) 个。

MDP 的两个结构性质使这成为可能:

  1. 可加奖励:回报分解为 \(R(\tau) = \sum_t \gamma^t r_t\),因此我们可以独立处理每个奖励。
  2. 因果结构:\(r_t\) 不依赖于未来动作 \(a_{t+1:T-1}\),因此未来的 IS 比率是条件期望为 1 的纯噪声。

如果回报是整条轨迹的不可分解函数(例如所有奖励的乘积,或仅依赖最终状态的函数),这种分解将不可能实现,我们将只能使用逐轨迹 IS。

图 8:逐轨迹 IS 与逐步 IS 的对比。:权重结构——逐轨迹将完整乘积 ρ0:T−1 应用于每个奖励,而逐步仅将 ρ0:t 应用于 rt:两种方法下的 500 个样本估计(绿色虚线 = 真实值)。:方差比率随时间范围增长——轨迹越长,逐步 IS 帮助越大。

逐步估计器还有一个优雅的递归解释。定义 \(v_0 = 0\),以及

\[v_{T-t} = \rho_t\!\left(r_t + \gamma \, v_{T-t-1}\right)\]

则 \(v_T\) 等于逐步估计器。在每一步,这对”奖励” \(r_t + \gamma \, v_{T-t-1}\) 应用单步(bandit)IS,比率为 \(\rho_t\),从时间上向后递归。

IS in Policy Gradient Methods

The IS ratios we have studied — per-trajectory and per-step — appear directly in policy gradient methods. For a derivation of REINFORCE and the advantage function, see the companion post on policy gradient and actor-critic. Here we focus on how IS enables off-policy policy optimization.

我们研究过的 IS 比率——逐轨迹和逐步——直接出现在策略梯度方法中。关于 REINFORCE 和优势函数的推导,请参见策略梯度与 Actor-Critic 的配套文章。这里我们聚焦于 IS 如何实现离线策略优化。

The Surrogate Objective

The policy gradient (derived in the companion post) is:

\[\nabla_\theta J = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

This requires sampling from \(\pi_\theta\) itself — fresh data after every update. To reuse data from an old policy \(\pi_{\text{old}}\), we apply IS. Write the expectation over actions as an explicit sum:

\[\nabla_\theta J = \sum_a \pi_\theta(a \vert s) \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\]

Multiply and divide by \(\pi_{\text{old}}(a \vert s)\):

\[= \sum_a \pi_{\text{old}}(a \vert s) \, \frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\]

Since \(\pi_{\text{old}}(a \vert s)\) is now the sampling distribution, this is an expectation under \(\pi_{\text{old}}\):

\[= \mathbb{E}_{s \sim d^{\pi_{\text{old}}}, \, a \sim \pi_{\text{old}}}\!\left[\underbrace{\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}}_{\text{IS ratio}} \, \underbrace{A^{\pi_\theta}(s, a)}_{\substack{\text{how good} \\ \text{is action } a}} \, \underbrace{\nabla_\theta \log \pi_\theta(a \vert s)}_{\substack{\text{direction to make} \\ a \text{ more likely}}}\right]\]

The IS ratio \(\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}\) corrects for the fact that \(a\) was sampled from \(\pi_{\text{old}}\), not \(\pi_\theta\). Crucially, this is a single-step ratio — no product over time — so the compounding problem does not apply.

Why only single-step? Because the policy gradient is an expectation over \((s, a)\) pairs, not over trajectories. The IS correction here only changes the action sampling distribution at a given state from \(\pi_\theta\) to \(\pi_{\text{old}}\) — a single ratio suffices for that. The state distribution \(d^{\pi_\theta}\) is silently replaced by \(d^{\pi_{\text{old}}}\) without any IS correction. If we also corrected for the state distribution mismatch, we would need to importance-weight the state visitation probabilities, which would reintroduce trajectory-level products and the compounding problem. The surrogate objective avoids this by simply ignoring the state distribution shift — an approximation that is accurate when \(\theta \approx \theta_{\text{old}}\) but degrades as the policies diverge.

Rather than work with the gradient directly, we define the surrogate objective whose gradient gives us the policy gradient:

\[L^{\text{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a)\right]\]

To see why the gradient of \(L^{\text{IS}}\) recovers the true policy gradient, we differentiate. The only term that depends on \(\theta\) is the IS ratio, so:

\[\nabla_\theta L^{\text{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a)\right]\]

Now evaluate at \(\theta = \theta_{\text{old}}\). We use the log-derivative trick: \(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\). Substituting and setting \(\theta = \theta_{\text{old}}\), the \(\pi_{\text{old}}\) in the numerator cancels with the denominator:

\[\nabla_\theta L^{\text{IS}}\big\vert_{\theta = \theta_{\text{old}}} = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\pi_{\text{old}}(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\] \[= \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

This is exactly the REINFORCE gradient. So we can take gradient steps on \(L^{\text{IS}}\) using data collected once from \(\pi_{\text{old}}\), without recollecting trajectories after each update.

策略梯度(在配套文章中推导)为:

\[\nabla_\theta J = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

这要求从 \(\pi_\theta\) 本身采样——每次更新后都需要新数据。为了复用旧策略 \(\pi_{\text{old}}\) 的数据,我们应用 IS。将关于动作的期望写成显式求和:

\[\nabla_\theta J = \sum_a \pi_\theta(a \vert s) \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\]

乘以并除以 \(\pi_{\text{old}}(a \vert s)\):

\[= \sum_a \pi_{\text{old}}(a \vert s) \, \frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\]

由于 \(\pi_{\text{old}}(a \vert s)\) 现在是采样分布,这是 \(\pi_{\text{old}}\) 下的期望:

\[= \mathbb{E}_{s \sim d^{\pi_{\text{old}}}, \, a \sim \pi_{\text{old}}}\!\left[\underbrace{\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}}_{\text{IS ratio}} \, \underbrace{A^{\pi_\theta}(s, a)}_{\substack{\text{how good} \\ \text{is action } a}} \, \underbrace{\nabla_\theta \log \pi_\theta(a \vert s)}_{\substack{\text{direction to make} \\ a \text{ more likely}}}\right]\]

IS 比率 \(\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}\) 修正了 \(a\) 是从 \(\pi_{\text{old}}\) 而非 \(\pi_\theta\) 中采样的事实。关键是,这是一个单步比率——没有时间上的乘积——因此累积问题不适用。

为什么只有单步?因为策略梯度是关于 \((s, a)\) 对的期望,而非关于轨迹的。这里的 IS 修正仅将给定状态下的动作采样分布从 \(\pi_\theta\) 换为 \(\pi_{\text{old}}\)——一个比率就够了。状态分布 \(d^{\pi_\theta}\) 被静默替换为 \(d^{\pi_{\text{old}}}\),没有任何 IS 修正。如果我们也修正状态分布的不匹配,就需要对状态访问概率进行重要性加权,这将重新引入轨迹级乘积和累积问题。代理目标通过简单忽略状态分布偏移来回避这一点——这是一个在 \(\theta \approx \theta_{\text{old}}\) 时准确但随策略分歧而退化的近似。

与其直接处理梯度,我们定义代理目标,其梯度给出策略梯度:

\[L^{\text{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a)\right]\]

为了说明 \(L^{\text{IS}}\) 的梯度能恢复真实的策略梯度,我们对其求导。唯一依赖 \(\theta\) 的项是 IS 比率,因此:

\[\nabla_\theta L^{\text{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a)\right]\]

现在在 \(\theta = \theta_{\text{old}}\) 处求值。使用对数导数技巧:\(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\)。代入并令 \(\theta = \theta_{\text{old}}\),分子中的 \(\pi_{\text{old}}\) 与分母约分:

\[\nabla_\theta L^{\text{IS}}\big\vert_{\theta = \theta_{\text{old}}} = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\pi_{\text{old}}(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\] \[= \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

这恰好是 REINFORCE 梯度。因此我们可以使用从 \(\pi_{\text{old}}\) 一次性收集的数据对 \(L^{\text{IS}}\) 进行梯度步骤,无需在每次更新后重新收集轨迹。

The Hidden Approximation

However, this convenience comes with a hidden approximation. The IS correction above fixes the action distribution mismatch — we sample \(a \sim \pi_{\text{old}}\) but reweight to estimate expectations under \(\pi_\theta\) — but it does nothing about the state distribution. The surrogate objective evaluates states drawn from \(d^{\pi_{\text{old}}}\), the stationary distribution of \(\pi_{\text{old}}\), while the true objective requires states from \(d^{\pi_\theta}\). These two distributions differ because the policy determines which states the agent visits: a policy that turns left at an intersection will visit an entirely different set of future states than one that turns right, so changing the policy changes not just action probabilities but the entire trajectory of states the agent encounters.

Formally, the gap between the surrogate and the true objective is bounded by the distribution mismatch coefficient \(\lVert d^{\pi_\theta} / d^{\pi_{\text{old}}} \rVert_\infty\), which is the worst-case ratio of state visitation probabilities between the two policies. Intuitively, if there exists some state \(s\) that \(\pi_\theta\) visits 100 times more often than \(\pi_{\text{old}}\), then the surrogate’s estimate of the objective at \(s\) is based on very few samples (or none at all), and the overall estimate can be wildly off. When \(\theta \approx \theta_{\text{old}}\), the two policies make nearly identical decisions, so they visit similar states, this ratio stays near 1, and the surrogate closely tracks the true objective.

But as \(\theta\) drifts from \(\theta_{\text{old}}\) over multiple gradient steps, the state distributions can diverge substantially. Consider a concrete example: suppose the old policy mostly goes right at a fork, collecting rewards along the right branch. After several updates, the new policy starts preferring left. The surrogate objective still evaluates the new policy using states from the right branch — states the new policy would rarely visit — and has almost no information about the left branch where the new policy actually operates. The surrogate might report that the policy is improving (because the IS-reweighted actions look good on the old states), while the true performance on the states the policy actually visits could be deteriorating.

In that regime, the surrogate overestimates the improvement, causing the policy update to overshoot and potentially degrade performance. This is precisely why methods like PPO constrain how far \(\theta\) can move from \(\theta_{\text{old}}\) in a single update.

然而,这种便利伴随着一个隐含的近似。上面的 IS 修正解决了动作分布的不匹配——我们从 \(\pi_{\text{old}}\) 中采样 \(a\),但通过重新加权来估计 \(\pi_\theta\) 下的期望——但对状态分布无能为力。代理目标评估的状态来自 \(d^{\pi_{\text{old}}}\)(\(\pi_{\text{old}}\) 的平稳分布),而真实目标需要来自 \(d^{\pi_\theta}\) 的状态。这两个分布不同,因为策略决定了智能体访问哪些状态:在岔路口左转的策略将访问与右转完全不同的未来状态集,因此改变策略不仅改变动作概率,还改变智能体遇到的整个状态轨迹。

形式上,代理目标与真实目标之间的差距受分布不匹配系数 \(\lVert d^{\pi_\theta} / d^{\pi_{\text{old}}} \rVert_\infty\) 约束,这是两个策略之间状态访问概率的最坏情况比率。直观地说,如果存在某个状态 \(s\),\(\pi_\theta\) 访问它的频率是 \(\pi_{\text{old}}\) 的 100 倍,那么代理目标在 \(s\) 处的估计基于极少的样本(或根本没有),整体估计可能偏差很大。当 \(\theta \approx \theta_{\text{old}}\) 时,两个策略做出几乎相同的决策,因此访问相似的状态,这个比率接近 1,代理目标紧密追踪真实目标。

但随着 \(\theta\) 在多次梯度步骤中偏离 \(\theta_{\text{old}}\),状态分布可能大幅分歧。考虑一个具体例子:假设旧策略在岔路口大多右转,沿右侧分支收集奖励。经过几次更新后,新策略开始偏好左转。代理目标仍然使用右侧分支的状态来评估新策略——这些状态是新策略几乎不会访问的——而对新策略实际运行的左侧分支几乎没有信息。代理目标可能报告策略正在改善(因为 IS 重新加权的动作在旧状态上看起来不错),而策略实际访问的状态上的真实性能可能在恶化。

在这种情况下,代理目标高估了改进幅度,导致策略更新过冲并可能降低性能。这正是 PPO 等方法限制 \(\theta\) 在单次更新中偏离 \(\theta_{\text{old}}\) 距离的原因。

PPO: Clipping the IS Ratio

Proximal Policy Optimization (PPO) addresses this directly by clipping the IS ratio. Define:

\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\text{old}}(a_t \vert s_t)}\]

PPO’s clipped surrogate objective is:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]

where \(\epsilon\) is a small constant (typically 0.1–0.2). The \(\operatorname{clip}\) function restricts the ratio to \([1-\epsilon, 1+\epsilon]\), preventing any single state-action pair from having outsized influence on the update. The \(\min\) ensures we take the more pessimistic (conservative) estimate:

  • When \(\hat{A}_t > 0\) (good action): the ratio is capped at \(1 + \epsilon\), preventing the policy from moving too aggressively toward this action.
  • When \(\hat{A}_t < 0\) (bad action): the ratio is floored at \(1 - \epsilon\), preventing the policy from moving too aggressively away from this action.

This is the IS weight degeneracy problem, solved by brute force: rather than hoping for a well-matched proposal, we simply clip the weights to prevent them from ever becoming too large. The cost is some bias (we no longer have an unbiased estimator), but the variance reduction makes training far more stable.


The off-policy evaluation material in this post draws on Nan Jiang’s lecture notes on importance sampling and policy gradient from CS 443 at UIUC, which cover per-step IS, doubly robust estimators, natural policy gradient, and formal analysis of the distribution mismatch problem in much greater depth. For how these IS ideas play out in the language model setting, see RL on Language under Single-step Settings.

近端策略优化(PPO)通过裁剪 IS 比率直接解决这一问题。定义:

\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\text{old}}(a_t \vert s_t)}\]

PPO 的裁剪代理目标为:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]

其中 \(\epsilon\) 是一个小常数(通常为 0.1–0.2)。\(\operatorname{clip}\) 函数将比率限制在 \([1-\epsilon, 1+\epsilon]\),防止任何单个状态-动作对对更新产生过大的影响。\(\min\) 确保我们取更悲观(保守)的估计:

  • 当 \(\hat{A}_t > 0\)(好动作)时:比率被限制在 \(1 + \epsilon\),防止策略过度偏向该动作。
  • 当 \(\hat{A}_t < 0\)(坏动作)时:比率被限制在 \(1 - \epsilon\),防止策略过度远离该动作。

这是 IS 权重退化问题的暴力解法:与其寄希望于匹配良好的提议分布,不如直接裁剪权重以防止它们变得过大。代价是一些偏差(我们不再有无偏估计器),但方差的降低使训练更加稳定。


本文中的离线策略评估材料参考了 Nan Jiang 在 UIUC CS 443 的重要性采样和策略梯度讲义,其中更深入地涵盖了逐步 IS、双重稳健估计器、自然策略梯度以及分布不匹配问题的形式化分析。关于这些 IS 思想在语言模型场景中的应用,请参阅单步设置下的语言 RL