Importance sampling is a way to reduce variance of your estimation on integration over a region for an integrand.
Let’s first see how traditional Monte Carlo method is used to estimate integration [2]. To estimate $latex \int_a^b f(x) dx$, one can think of reshaping the area to be integrated as a rectangle, whose width is $latex b-a$ and height is $latex E_{\mathcal{U}[a,b]}(f(x))$, a virtual, expected height.
Therefore, to use Monte Carlo estimation, we can randomly draw $latex N$ numbers from the uniform distribution $latex \mathcal{U}[a,b]$, and use the following result as the approximation to $latex \int_a^b f(x) dx$:
$latex \frac{(b-a)}{N}\sum\limits_{i=1}^N f(x_i) &s=2$
Of course, this estimation will vary from run to run (every run is an estimation based on $latex N$ samples). Therefore, Monte Carlo estimation for integration has variance. To reduce this variance, we use importance sampling [1]. The idea is to convert the integration formula into an expectation formula with an auxiliary, known distribution $latex h(x)$:
$latex \int_a^b f(x) d(x) = \int_a^b f(x) \frac{h(x)}{h(x)} dx = \int_a^b \frac{f(x)}{h(x)} h(x) dx = \mathbb{E}_h \big(\frac{f(X)}{h(X)} \big) &s=2$
Using Monte Carlo method to estimate this expectation, we have:
$latex \mathbb{E}_h \big(\frac{f(X)}{h(X)} \big) \approx \frac{1}{n} \sum\limits_{i=1}^n \frac{f(X_i)}{h(X_i)} ,\quad X_i \sim h(x) &s=2$
When $latex \alpha h(x) = f(x)$ (assuming $latex f(x)>0$), the importance sampling variance will become zero because $latex \mathbb{E}_h \big(\frac{f(X)}{h(X)}\big) = \alpha$ which is constant no matter of $latex x_i$. Though a nice property, it is impossible to achieve zero variance in reality. If you know $latex h(x)$ is proportional $latex f(x)$, then you know the density of $latex h(x)$, therefore you induce the density of $latex f(x)$ just timing $latex h(x)$ by $latex \alpha$, therefore you do not need Monte Carlo method in the first place to calculate the integration. The common solution is to pick a known distribution $latex h(x)$ that appears close, in terms of the shape, to $latex f(x)$. The closer $latex h(x)$ is to $latex f(x)$, the smaller the variance is. If $latex h(x)$ is a uniform distribution in the range of $latex (a,b)$, it is equivalent to the traditional method of Monte Carlo method we introduced at the beginning of this article.
Here is the code to verify that importance sampling really reduces the variance for estimating integration.
# estimate integration over a normal distribution (mean=0, std=1) in the range of (a, b) from scipy.stats import norm import numpy from scipy import stats T = 200 # num of rounds N = 10 # num of samples per round a, b = -2., 2. # lower bound, upper bound fx_loc, fx_scale = 0, 1 # f(x) is a normal distribution (mean=0, std=1) class hx1_dist_gen(stats.rv_continuous): def _pdf(self, x): return 1. / b - 1. / (b**2) * numpy.abs(x) class hx2_dist_gen(stats.rv_continuous): def _pdf(self, x): return 1. / (b**2) * numpy.abs(x) hx1_dist = hx1_dist_gen(a=a, b=b) hx2_dist = hx2_dist_gen(a=a, b=b) # monte carlo estimation mc_res = [] for t in range(T): print("MC round %d" % t) x = numpy.random.uniform(low=a, high=b, size=N) fxs = norm.pdf(x, loc=fx_loc, scale=fx_scale) mc_res.append((b - a) * numpy.mean(fxs)) # importance sampling estimation using h1(x) is1_res = [] for t in range(T): print("Importance Sampling h1(x) round %d" % t) fx_div_hxs = [] # approximate to let x be drawn in the range (-1, 1) x = hx1_dist.rvs(size=N) fx = norm.pdf(x, loc=fx_loc, scale=fx_scale) hx = hx1_dist.pdf(x) fx_div_hxs = fx / hx is1_res.append(numpy.mean(fx_div_hxs)) # importance sampling estimation using h2(x) is2_res = [] for t in range(T): print("Importance Sampling h2(x) round %d" % t) fx_div_hxs = [] # approximate to let x be drawn in the range (-1, 1) x = hx2_dist.rvs(size=N) fx = norm.pdf(x, loc=fx_loc, scale=fx_scale) hx = hx2_dist.pdf(x) fx_div_hxs = fx / hx is2_res.append(numpy.mean(fx_div_hxs)) print("exact: {0}".format(norm.cdf(b) - norm.cdf(a))) print("MC estimation (mean/std): {0}/{1}".format(numpy.mean(mc_res), numpy.std(mc_res))) print("Importance Sampling h1 estimation (mean/std): {0}/{1}".format(numpy.mean(is1_res), numpy.std(is1_res))) print("Importance Sampling h2 estimation (mean/std): {0}/{1}".format(numpy.mean(is2_res), numpy.std(is2_res)))
Output on my machine:
exact: 0.9544997361036416 MC estimation (mean/std): 0.958336124778972/0.15458260669929705 Importance Sampling h1 estimation (mean/std): 0.9515730240850715/0.07416369320580925 Importance Sampling h2 estimation (mean/std): 1.0496523604436687/0.7489580081170375
I set $latex f(x)$ as a normal distribution with mean=0 and std=1. I use the traditional Monte Carlo method as well as two kinds of distributions, $latex h_1(x)$ and $latex h_2(x)$, to do importance sampling. All distributions can be represented in the following chart:
From the output we can see that, estimation using importance sampling with $latex h_1(x)$ has the smallest std. This is because $latex h_1(x)$ (the blue triangle) looks the closest to $latex f(x)$. $latex h_2(x)$ (the red shape formed by two triangles) has the largest std because it looks the least close to $latex f(x)$. The traditional MC method (equivalently, importance sampling with a uniform distribution) has a medium std.
Reference
[2] http://ta.twi.tudelft.nl/mf/users/oosterle/oosterlee/lec8-hit-2009.pdf