Importance Sampling: Why and How

Importance Sampling Basics

Suppose we want to compute the expected value of a function \(f(x)\) under a distribution \(p(x)\):

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx\]

When \(p(x)\) is complex or high-dimensional, this integral is often intractable. A natural approach is Monte Carlo estimation — draw \(N\) samples from \(p\) and average:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i), \quad x_i \sim p\]

This works well when we can sample from \(p\). But what if sampling from \(p\) is expensive, impossible, or inefficient? Note the distinction: evaluating \(p(x)\) at a given point (computing the density value) is usually easy — just plug \(x\) into the formula. Sampling from \(p(x)\) — generating random \(x\) values that follow \(p\)’s distribution — is the hard part.

Figure 1: Evaluating p(x) (left) is just arithmetic — plug in any x. Sampling from p(x) (right) means generating x values that cluster where p is high — which requires computing the very integral we want to avoid.

To see why sampling is hard, consider a distribution defined as a product of simpler terms — say, a mixture of many Gaussians, or the product of a likelihood and a prior. Given any specific \(x\), you can plug it into the formula and compute \(p(x)\). But generating random \(x\) values that follow \(p\)’s shape is a different problem entirely. The standard approach is the inverse CDF method: draw \(U \sim \text{Uniform}(0,1)\) and compute \(x = F^{-1}(U)\), where

\[F(x) = \int_{-\infty}^{x} p(t)\, dt\]

is the cumulative distribution function. But computing \(F\) requires solving an integral of \(p\) — the very type of computation we are trying to avoid. This is a circular dependency: to sample from \(p\), we need the CDF, but the CDF is an integral over \(p\).

So we are stuck: the Monte Carlo estimator above avoids intractable integrals by using samples, but generating those samples from \(p\) requires the CDF, which is itself an integral over \(p\) (as shown above). What about computing the integral directly? In one dimension, we could lay down a grid of \(n\) points and sum \(f(x_i)\, p(x_i)\, \Delta x\). But in \(d\) dimensions, each dimension needs its own set of \(n\) grid points, and we must evaluate every combination — so the total number of grid points is \(n \times n \times \cdots \times n = n^d\). Even a modest \(n = 10\) with \(d = 100\) requires \(10^{100}\) evaluations, far more than atoms in the universe. Importance sampling breaks this deadlock. It is still a Monte Carlo method — it still uses random samples to estimate the integral — but instead of sampling from the hard distribution \(p\), it samples from an easy distribution \(q\) and corrects for the mismatch.

Figure 2: Why not just integrate numerically? Grid methods explode exponentially in high dimensions. Monte Carlo avoids this but needs samples from p. Importance sampling solves both problems.

The key idea is simple: multiply and divide by a proposal distribution \(q(x)\) that we can sample from:

\[\mathbb{E}_{x \sim p}[f(x)] = \int f(x)\, p(x)\, dx = \int f(x)\, \frac{p(x)}{q(x)}\, q(x)\, dx\] \[= \mathbb{E}_{x \sim q}\!\left[f(x)\, \frac{p(x)}{q(x)}\right]\]

The ratio \(w(x) = \frac{p(x)}{q(x)}\) is the importance weight (or importance sampling ratio). It corrects for the mismatch between the distribution we sample from (\(q\)) and the distribution we care about (\(p\)).

The Monte Carlo estimator becomes:

\[\mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i)\, \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q\]

This estimator is unbiased — if we repeated the entire procedure (draw \(N\) samples, compute the weighted average) many times, the average of these estimates would converge to the true expectation under \(p\). Any single run with finite samples will differ from the true value due to variance, but there is no systematic over- or under-estimation. This holds for any choice of \(q\), as long as \(q(x) > 0\) wherever \(p(x) f(x) \neq 0\).

Figure 3: Importance sampling in action. Samples drawn from the proposal q(x) are reweighted by p(x)/q(x) to estimate expectations under the target p(x).

Equivalently, we can see convergence by increasing \(N\): as we draw more samples from \(q\) and accumulate weighted averages, the IS estimate converges to the true value.

Figure 4: As the number of samples N grows, the IS estimate (red) converges toward the true expectation (green dashed). The bottom plot shows p(x) and q(x) with accumulated samples on the x-axis.

Why Does This Work?

The intuition is straightforward:

If \(q\) oversamples a region relative to \(p\), the ratio \(p/q < 1\) downweights those samples.
If \(q\) undersamples a region relative to \(p\), the ratio \(p/q > 1\) upweights those samples.

The reweighting exactly compensates for the distributional mismatch.

Variance of the Estimator

While the IS estimator is unbiased for any valid \(q\), the choice of \(q\) dramatically affects variance. The variance of the IS estimator is:

\[\text{Var}_{x \sim q}\!\left[f(x)\,\frac{p(x)}{q(x)}\right] = \mathbb{E}_{x \sim q}\!\left[\left(f(x)\,\frac{p(x)}{q(x)}\right)^2\right] - \left(\mathbb{E}_{x \sim p}[f(x)]\right)^2\]

The first term can explode when \(p(x)/q(x)\) is large — i.e., when \(q\) assigns low probability to regions where \(p\) assigns high probability.

The optimal proposal that minimizes variance is:

\[q^*(x) \propto \vert f(x) \vert \, p(x)\]

This assigns more probability mass where the integrand \(\vert f(x)\vert p(x)\) is large. In practice, \(q^*\) is rarely available (computing it requires knowing the integral we are trying to estimate), but it tells us what a good proposal looks like: match the shape of the integrand.

When Things Go Wrong

Importance sampling has a well-known failure mode: high variance due to weight degeneracy. If \(p\) and \(q\) are poorly matched, a few samples can dominate the estimate with enormous weights while most samples contribute almost nothing.

A useful diagnostic is the effective sample size (ESS):

\[N_{\text{eff}} = \frac{\left(\sum_{i=1}^{N} w_i\right)^2}{\sum_{i=1}^{N} w_i^2}\]

When all weights are equal, \(N_{\text{eff}} = N\). When one weight dominates, \(N_{\text{eff}} \approx 1\). A low ESS signals that the proposal \(q\) is a poor match for \(p\) and the estimate is unreliable.

To see this in action, compare the well-matched proposal in Figure 4 with the poorly-matched one below. Here \(q = \mathcal{N}(0, 1.3)\) is centered far from \(p = \mathcal{N}(2, 0.7)\): most samples land near \(x = 0\) where \(p(x) \approx 0\), so their weights are nearly zero. The few samples that reach \(p\)’s peak carry enormous weights, causing the estimate to jump erratically and converge slowly.

Figure 5: Weight degeneracy with a poorly matched proposal. The IS estimate (red) wanders far from the true value (green dashed) even at N = 500. The ESS badge shows that only a small fraction of samples carry meaningful weight.

This variance problem gets worse in high dimensions, where even small distributional mismatches compound across dimensions — a phenomenon sometimes called the curse of dimensionality for importance sampling.

Importance Sampling in RL

The machinery from Section 1 — reweighting samples from one distribution to estimate expectations under another — turns out to be exactly what we need in reinforcement learning (RL). The core problem: we have data collected by one policy, but want to learn about a different policy.

The Off-Policy Problem

In RL, an agent interacts with an environment by choosing actions according to a policy \(\pi(a \vert s)\) — a distribution over actions given a state. Each interaction produces a trajectory:

\[\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1})\]

The goal is to find a policy that maximizes expected cumulative reward. Policy optimization algorithms do this iteratively: collect data with the current policy \(\pi_{\text{old}}\), use the data to compute a better policy \(\pi_{\text{new}}\), then repeat. But here is the problem: as soon as we update the policy, all the data we just collected came from the wrong distribution. The trajectories were generated by \(\pi_{\text{old}}\), but we need to evaluate expectations under \(\pi_{\text{new}}\). This is exactly the importance sampling setup from Section 1, with \(\pi_{\text{old}}\) playing the role of the proposal \(q\) and \(\pi_{\text{new}}\) playing the role of the target \(p\).

Trajectory Reweighting

Suppose we want to estimate the expected return \(J(\pi_\theta) := \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\) of a target policy \(\pi_\theta\) using trajectories collected by a behavior policy \(\pi_\beta\), where \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\) is the discounted return. The full probability of a trajectory under any policy involves both the policy’s action probabilities and the environment’s transition dynamics:

\[P_{\pi}(\tau) = d_0(s_0) \prod_{t=0}^{T-1} \pi(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)\]

Figure 6: Anatomy of a trajectory. The probability is an alternating product of policy terms (red) and environment terms (green). Toggle to see how environment terms vanish in the gradient or cancel in the IS ratio — only policy terms survive.

where \(d_0\) is the initial state distribution and \(P(s_{t+1} \vert s_t, a_t)\) is the transition probability. The key observation is that \(d_0\) and \(P\) do not change when we deploy different policies — both policies start from the same initial distribution and interact with the same environment dynamics. So in the IS ratio, they cancel:

A note on \(d_0\) vs \(d^\pi\). (Click to expand)

In RL, the letter \(d\) appears in two distinct roles:

\(d_0\) = the initial state distribution, defined as part of the MDP \(M = (S, A, P, R, \gamma, d_0)\). It determines the starting state \(s_0 \sim d_0\) before the policy takes any action. Since the policy has not acted yet, \(d_0\) is the same regardless of which policy we deploy.
\(d^\pi\) (or \(d^\pi_t\)) = the state visitation distribution induced by running policy \(\pi\). At time \(t > 0\), the distribution over states depends on which actions the policy took, so \(d^\pi_t\) is a joint property of the policy and the environment dynamics. The discounted occupancy \(d^\pi = (1-\gamma) \sum_t \gamma^t d^\pi_t\) aggregates this across time.

In the trajectory probability above, \(d_0(s_0)\) is policy-independent and cancels in the IS ratio. Later, in the policy gradient section, \(d^{\pi_\theta}\) appears and does not cancel — this is the source of the distribution mismatch problem.

\[\frac{P_{\pi_\theta}(\tau)}{P_{\pi_\beta}(\tau)} = \frac{d_0(s_0) \prod_t \pi_\theta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)}{d_0(s_0) \prod_t \pi_\beta(a_t \vert s_t) \, P(s_{t+1} \vert s_t, a_t)}\] \[= \prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\]

Applying IS, we can estimate the return of \(\pi_\theta\) using data from \(\pi_\beta\):

\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\beta}\!\left[\prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)} \; R(\tau)\right]\]

This is the per-trajectory IS estimator. We only need to know the action probabilities under both policies — no knowledge of the environment dynamics is required.

The Compounding Problem

The product of per-step ratios is where things get dangerous. Even if each individual ratio \(\frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\) is well-behaved, their product over \(T\) steps can explode or collapse:

\[\prod_{t=0}^{T-1} \frac{\pi_\theta(a_t \vert s_t)}{\pi_\beta(a_t \vert s_t)}\]

If \(\pi_\theta\) and \(\pi_\beta\) differ even slightly — say each ratio averages 1.1 — after 100 steps the product is \(1.1^{100} \approx 13{,}781\). Conversely, if each ratio averages 0.9, the product is \(0.9^{100} \approx 0.00003\). This is the same weight degeneracy from Section 1, but now compounded exponentially across time steps.

Figure 7: Compounding IS ratios. Each line tracks the cumulative product of per-step IS ratios for one trajectory. Even with moderate per-step divergence (σ = 0.20), the products fan out exponentially — a few trajectories carry nearly all the weight while the rest contribute almost nothing. Use the divergence slider to see how even small increases in per-step variance dramatically worsen the effect.

Per-Step IS

Can we do better than the per-trajectory estimator? The answer is yes — if we look carefully at what the per-trajectory estimator is actually doing to each reward.

The return is a sum of per-step rewards: \(R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t\). The per-trajectory estimator multiplies this entire sum by the full product \(\rho_{0:T-1}\):

\[\rho_{0:T-1} \, R(\tau) = \sum_{t=0}^{T-1} \gamma^t \, \rho_{0:T-1} \, r_t\]

Now consider a single term \(\rho_{0:T-1} \, r_t\). We can split the product at time \(t\):

\[\rho_{0:T-1} \, r_t = \rho_{0:t} \cdot \rho_{t+1:T-1} \cdot r_t\]

Here is the key observation: \(r_t\) depends only on \((s_0, a_0, \ldots, s_t, a_t)\) — it is determined by the history up to time \(t\). The future ratios \(\rho_{t+1:T-1}\) depend on actions \(a_{t+1}, \ldots, a_{T-1}\), which have no causal effect on \(r_t\). Under the behavior policy, the conditional expectation of each future ratio is 1:

\[\mathbb{E}_{a_k \sim \pi_\beta}\!\left[\frac{\pi_\theta(a_k \vert s_k)}{\pi_\beta(a_k \vert s_k)} \;\Big\vert\; s_k\right] = \sum_a \pi_\beta(a \vert s_k) \frac{\pi_\theta(a \vert s_k)}{\pi_\beta(a \vert s_k)}\] \[= \sum_a \pi_\theta(a \vert s_k) = 1\]

So \(\mathbb{E}[\rho_{t+1:T-1} \mid s_0, a_0, \ldots, s_t, a_t] = 1\). Dropping \(\rho_{t+1:T-1}\) does not change the expectation of each term but removes a source of multiplicative noise. This gives the per-step IS estimator:

\[\hat{J}_{\text{per-step}} = \sum_{t=0}^{T-1} \gamma^t \, \rho_{0:t} \, r_t, \quad \text{where } \rho_{0:t} = \prod_{k=0}^{t} \frac{\pi_\theta(a_k \vert s_k)}{\pi_\beta(a_k \vert s_k)}\]

This is still unbiased but has strictly lower variance — each reward \(r_t\) is weighted only by the ratios that causally precede it. In the worst case — uniform behavior policy, deterministic target, \(K\) actions — per-trajectory IS has variance proportional to \(K^T\), while per-step IS reduces this since the term for \(r_t\) only compounds \(t\) ratios instead of \(T\).

Two structural properties of MDPs make this possible:

Additive rewards: the return decomposes as \(R(\tau) = \sum_t \gamma^t r_t\), so we can treat each reward separately.
Causal structure: \(r_t\) does not depend on future actions \(a_{t+1:T-1}\), so the future IS ratios are pure noise with conditional expectation 1.

If the return were a non-decomposable function of the entire trajectory (e.g., a product of all rewards, or some function depending on the final state only), this decomposition would not be possible, and we would be stuck with per-trajectory IS.

Figure 8: Per-trajectory IS vs per-step IS. Top: weight structure — per-trajectory applies the full product ρ_0:T−1 to every reward, while per-step applies only ρ_0:t to r_t. Middle: 500 sample estimates under each method (green dashed = true value). Bottom: variance ratio grows with horizon — the longer the trajectory, the more per-step IS helps.

The per-step estimator also has an elegant recursive interpretation. Define \(v_0 = 0\) and

\[v_{T-t} = \rho_t\!\left(r_t + \gamma \, v_{T-t-1}\right)\]

Then \(v_T\) equals the per-step estimator. At each step, this applies single-step (bandit) IS with ratio \(\rho_t\) to the “reward” \(r_t + \gamma \, v_{T-t-1}\), recursing backward through time.

IS in Policy Gradient Methods

The IS ratios we have studied — per-trajectory and per-step — appear directly in policy gradient methods. For a derivation of REINFORCE and the advantage function, see the companion post on policy gradient and actor-critic. Here we focus on how IS enables off-policy policy optimization.

The Surrogate Objective

The policy gradient (derived in the companion post) is:

\[\nabla_\theta J = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

This requires sampling from \(\pi_\theta\) itself — fresh data after every update. To reuse data from an old policy \(\pi_{\text{old}}\), we apply IS. Write the expectation over actions as an explicit sum:

\[\nabla_\theta J = \sum_a \pi_\theta(a \vert s) \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\]

Multiply and divide by \(\pi_{\text{old}}(a \vert s)\):

\[= \sum_a \pi_{\text{old}}(a \vert s) \, \frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\]

Since \(\pi_{\text{old}}(a \vert s)\) is now the sampling distribution, this is an expectation under \(\pi_{\text{old}}\):

\[= \mathbb{E}_{s \sim d^{\pi_{\text{old}}}, \, a \sim \pi_{\text{old}}}\!\left[\underbrace{\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}}_{\text{IS ratio}} \, \underbrace{A^{\pi_\theta}(s, a)}_{\substack{\text{how good} \\ \text{is action } a}} \, \underbrace{\nabla_\theta \log \pi_\theta(a \vert s)}_{\substack{\text{direction to make} \\ a \text{ more likely}}}\right]\]

The IS ratio \(\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)}\) corrects for the fact that \(a\) was sampled from \(\pi_{\text{old}}\), not \(\pi_\theta\). Crucially, this is a single-step ratio — no product over time — so the compounding problem does not apply.

Why only single-step? Because the policy gradient is an expectation over \((s, a)\) pairs, not over trajectories. The IS correction here only changes the action sampling distribution at a given state from \(\pi_\theta\) to \(\pi_{\text{old}}\) — a single ratio suffices for that. The state distribution \(d^{\pi_\theta}\) is silently replaced by \(d^{\pi_{\text{old}}}\) without any IS correction. If we also corrected for the state distribution mismatch, we would need to importance-weight the state visitation probabilities, which would reintroduce trajectory-level products and the compounding problem. The surrogate objective avoids this by simply ignoring the state distribution shift — an approximation that is accurate when \(\theta \approx \theta_{\text{old}}\) but degrades as the policies diverge.

Rather than work with the gradient directly, we define the surrogate objective whose gradient gives us the policy gradient:

\[L^{\text{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a)\right]\]

To see why the gradient of \(L^{\text{IS}}\) recovers the true policy gradient, we differentiate. The only term that depends on \(\theta\) is the IS ratio, so:

\[\nabla_\theta L^{\text{IS}}(\theta) = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\nabla_\theta \pi_\theta(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a)\right]\]

Now evaluate at \(\theta = \theta_{\text{old}}\). We use the log-derivative trick: \(\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta\). Substituting and setting \(\theta = \theta_{\text{old}}\), the \(\pi_{\text{old}}\) in the numerator cancels with the denominator:

\[\nabla_\theta L^{\text{IS}}\big\vert_{\theta = \theta_{\text{old}}} = \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[\frac{\pi_{\text{old}}(a \vert s)}{\pi_{\text{old}}(a \vert s)} \, A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\] \[= \mathbb{E}_{s, a \sim \pi_{\text{old}}}\!\left[A^{\pi_\theta}(s, a) \, \nabla_\theta \log \pi_\theta(a \vert s)\right]\]

This is exactly the REINFORCE gradient. So we can take gradient steps on \(L^{\text{IS}}\) using data collected once from \(\pi_{\text{old}}\), without recollecting trajectories after each update.

The Hidden Approximation

However, this convenience comes with a hidden approximation. The IS correction above fixes the action distribution mismatch — we sample \(a \sim \pi_{\text{old}}\) but reweight to estimate expectations under \(\pi_\theta\) — but it does nothing about the state distribution. The surrogate objective evaluates states drawn from \(d^{\pi_{\text{old}}}\), the stationary distribution of \(\pi_{\text{old}}\), while the true objective requires states from \(d^{\pi_\theta}\). These two distributions differ because the policy determines which states the agent visits: a policy that turns left at an intersection will visit an entirely different set of future states than one that turns right, so changing the policy changes not just action probabilities but the entire trajectory of states the agent encounters.

Formally, the gap between the surrogate and the true objective is bounded by the distribution mismatch coefficient \(\lVert d^{\pi_\theta} / d^{\pi_{\text{old}}} \rVert_\infty\), which is the worst-case ratio of state visitation probabilities between the two policies. Intuitively, if there exists some state \(s\) that \(\pi_\theta\) visits 100 times more often than \(\pi_{\text{old}}\), then the surrogate’s estimate of the objective at \(s\) is based on very few samples (or none at all), and the overall estimate can be wildly off. When \(\theta \approx \theta_{\text{old}}\), the two policies make nearly identical decisions, so they visit similar states, this ratio stays near 1, and the surrogate closely tracks the true objective.

But as \(\theta\) drifts from \(\theta_{\text{old}}\) over multiple gradient steps, the state distributions can diverge substantially. Consider a concrete example: suppose the old policy mostly goes right at a fork, collecting rewards along the right branch. After several updates, the new policy starts preferring left. The surrogate objective still evaluates the new policy using states from the right branch — states the new policy would rarely visit — and has almost no information about the left branch where the new policy actually operates. The surrogate might report that the policy is improving (because the IS-reweighted actions look good on the old states), while the true performance on the states the policy actually visits could be deteriorating.

In that regime, the surrogate overestimates the improvement, causing the policy update to overshoot and potentially degrade performance. This is precisely why methods like PPO constrain how far \(\theta\) can move from \(\theta_{\text{old}}\) in a single update.

PPO: Clipping the IS Ratio

Proximal Policy Optimization (PPO) addresses this directly by clipping the IS ratio. Define:

\[r_t(\theta) = \frac{\pi_\theta(a_t \vert s_t)}{\pi_{\text{old}}(a_t \vert s_t)}\]

PPO’s clipped surrogate objective is:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\!\left[\min\!\Big(r_t(\theta)\, \hat{A}_t, \;\operatorname{clip}\!\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t\Big)\right]\]

where \(\epsilon\) is a small constant (typically 0.1–0.2). The \(\operatorname{clip}\) function restricts the ratio to \([1-\epsilon, 1+\epsilon]\), preventing any single state-action pair from having outsized influence on the update. The \(\min\) ensures we take the more pessimistic (conservative) estimate:

When \(\hat{A}_t > 0\) (good action): the ratio is capped at \(1 + \epsilon\), preventing the policy from moving too aggressively toward this action.
When \(\hat{A}_t < 0\) (bad action): the ratio is floored at \(1 - \epsilon\), preventing the policy from moving too aggressively away from this action.

This is the IS weight degeneracy problem, solved by brute force: rather than hoping for a well-matched proposal, we simply clip the weights to prevent them from ever becoming too large. The cost is some bias (we no longer have an unbiased estimator), but the variance reduction makes training far more stable.

The off-policy evaluation material in this post draws on Nan Jiang’s lecture notes on importance sampling and policy gradient from CS 443 at UIUC, which cover per-step IS, doubly robust estimators, natural policy gradient, and formal analysis of the distribution mismatch problem in much greater depth. For how these IS ideas play out in the language model setting, see RL on Language under Single-step Settings.

Importance Sampling: Why and How

Importance Sampling Basics

重要性采样基础

Why Does This Work?

为什么有效？

Variance of the Estimator

估计器的方差

When Things Go Wrong

当出现问题时

Importance Sampling in RL

重要性采样在强化学习中的应用

The Off-Policy Problem

离线策略问题

Trajectory Reweighting

轨迹重新加权

The Compounding Problem

累积问题

Per-Step IS

逐步重要性采样

IS in Policy Gradient Methods

策略梯度方法中的重要性采样

The Surrogate Objective

代理目标

The Hidden Approximation

隐含的近似

PPO: Clipping the IS Ratio

PPO：裁剪 IS 比率