Control Variate

In this post, let’s talk about another variance reduction method called “control variate”. This is yet another variance reduction method besides importance sampling [1].

We start from a simple example: we want to estimate $\mathbb{E}_{p(x)}[f(x)]$ , the expectation of function $f$ applied on a random variable $x$ with $x$ sampled from $p(x)$ . Analytically, $\mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx$ . You can also think of $\mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx$ as $\mathbb{E}_{p(x)}[f(x)] = \int q(x')x'dx'=\mathbb{E}_{q(x')}[x']$ , the expectation of another random variable $x'$ with any function exerted on it. Since I like the form $\mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx$ better, I’ll keep using this format throughout the post. In our setting, we can only evaluate $f(x)$ for any given $x$ but do not know the form of $f(x)$ in advance. Using Monte Carlo naively, we can draw $N$ numbers from $p(x)$ and take average of their associated values in $f(x)$ : $\mathbb{E}_{p(x)}[f(x)]\approx \frac{1}{N}\sum\limits_{i=1}^Nf(x_i)$

We can use control variate to reduce the variance of our estimate of $\mathbb{E}_{p(x)}[f(x)]$ . In the control variate method, we introduce another function $h(x)$ , for which we can evaluate $h(x)$ for any $x$ and even know its expectation $\theta=\mathbb{E}_{p(x)}[h(x)]$ (note: we don’t know that for $f(x)$ ).

The control variate method says that the following estimator, $z(x)$ which can be seen as another random variable, is an unbiased estimate of $\mathbb{E}_{p(x)}[f(x)]$ :

$z(x) = f(x) - h(x) + \theta$

It is easy to see that $z(x)$ is an unbiased estimator:

$\mathbb{E}_{p(x)}[z(x)]\newline=\mathbb{E}_{p(x)}[f(x)]-\mathbb{E}_{p(x)}[h(x)] + \theta\newline=\mathbb{E}_{p(x)}[f(x)]-\theta + \theta\newline=\mathbb{E}_{p(x)}[f(x)]$

Now let’s look at the variance of $z(x)$ :

$Var_{p(x)}[z(x)] \newline = Var_{p(x)}\left[f(x) - h(x)+\theta\right] \newline = Var_{p(x)}\left[f(x)-h(x)\right] \quad \quad \text{constant doesn't contribute to variance}\newline=\mathbb{E}_{p(x)}\left[\left(f(x)-h(x)-\mathbb{E}_{p(x)}\left[f(x)-h(x)\right] \right)^2\right] \quad\quad Var(x)=\mathbb{E}[(x-\mathbb{E}(x))^2] \newline=\mathbb{E}_{p(x)}\left[\left( f(x)-\mathbb{E}_{p(x)}[f(x)] - \left(h(x)-\mathbb{E}_{p(x)}[h(x)]\right) \right)^2\right]\newline=\mathbb{E}_{p(x)}\left[\left(f(x)-\mathbb{E}_{p(x)}[f(x)]\right)^2\right] + \mathbb{E}_{p(x)}\left[\left(h(x)-\mathbb{E}_{p(x)}[h(x)]\right)^2\right] \newline - 2 * \mathbb{E}_{p(x)}\left[f(x)-\mathbb{E}_{p(x)}[f(x)]\right] * \mathbb{E}_{p(x)}\left[h(x)-\mathbb{E}_{p(x)}[h(x)]\right]\newline=Var_{p(x)}\left[f(x)\right]+Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]$

Thus, as long as we introduce $h(x)$ such that $Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]< 0$ , we will reduce the variance of the estimate $\mathbb{E}_{p(x)}[f(x)]$ .

A more advanced method is called regression sampling [2]. We can have a tunable parameter $c$ in $z(x)$ :

$z(x) = f(x) + c\cdot (h(x) - \theta)$

In the control variate example, $c=-1$ . However, when $c$ is tunable, the variance of $z(x)$ becomes:

$Var_{p(x)}[z(x)] \newline=Var_{p(x)}\left[f(x)\right]+c^2 \cdot Var_{p(x)}\left[h(x)\right] + 2c * Cov_{p(x)}\left[f(x), h(x)\right]$ ,

which is a quadratic formula with regard to $c$ . Therefore, the optimal coefficient is:

$c^*=-\frac{Cov_{p(x)}[f(x),h(x)]}{Var_{p(x)}[h(x)]}$

With $c^*$ , the optimal variance of $z(x)$ can be achieved at:

$Var^*_{p(x)}[z(x)] \newline=Var_{p(x)}\left[f(x)\right] - \frac{\left(Cov_{p(x)}\left[f(x), h(x)\right] \right)^2}{Var_{p(x)}[h(x)]}\newline=\left(1-\left(Corr_{p(x)}\left[f(x),h(x)\right]\right)^2\right)Var_{p(x)}\left[f(x)\right] \quad\quad Corr(x,y)=Cov(x,y)/stdev(x)stdev(y)$

This concludes that the more correlated $f(x)$ and $h(x)$ are with each other, the more variance we can reduce potentially.

Now, let’s do an experiment to verify the control variate method indeed reduces variance. We let $p(x)$ be a uniform distribution in range $[0, 10]$ , $f(x)=x^2$ and $h(x)=x$ . Statistical quantities can be calculated as:

$\mathbb{E}_{p(x)}[f(x)]=\frac{100}{3}$
$Var_{p(x)}[f(x)]=\frac{8000}{9}$
$\mathbb{E}_{p(x)}[h(x)]=5$
$Var_{p(x)}[h(x)]=\frac{25}{3}$
$Cov_{p(x)}\left[f(x), h(x)\right]=\frac{250}{3}$

Therefore, according to the derived theory, we have:

$Var_{p(x)}[z(x)] \newline =Var_{p(x)}\left[f(x)\right]+Var_{p(x)}\left[h(x)\right] - 2 * Cov_{p(x)}\left[f(x), h(x)\right]\newline =\frac{8000}{9} + \frac{25}{3}-\frac{2\cdot 250}{3}\newline=\frac{6575}{9}<Var_{p(x)}[f(x)]$ .

In other words, we will expect to see around 82.2% of the original variance.

Here is the code

import numpy

T = 20000  # num of rounds
N = 10   # num of samples per round
a, b = 0., 10.   # lower bound, upper bound
expct_hx = 5.

# monte carlo estimation
mc_res = []
for t in range(T):
    x = numpy.random.uniform(low=a, high=b, size=N)
    mc_res.append(numpy.mean(x**2))

# control variate estimation using h(x)
cv_res = []
for t in range(T):
    x = numpy.random.uniform(low=a, high=b, size=N)
    cv_res.append(numpy.mean(x ** 2 - x + expct_hx))

print("MC estimation (mean/std): {0}/{1}".format(numpy.mean(mc_res), numpy.std(mc_res)))
print("Control variate estimation (mean/std): {0}/{1}".format(numpy.mean(cv_res), numpy.std(cv_res)))
print(f"Variance reduction percentage {(numpy.std(cv_res)/numpy.std(mc_res))**2}")

Here’s the result and indeed we see around 82% variance as we use the control variate:

MC estimation (mean/std): 33.292873225286186/9.426090731496881
Control variate estimation (mean/std): 33.37154726826849/8.61508822016471
Variance reduction percentage 0.8353264371911822

In the end, I want to mention that we are analyzing $Var_{p(x)}[z(x)]$ with the assumption that it is a single-sample random variable. If we always collect $N$ samples and use their mean as the estimate of $\mathbb{E}_{p(x)}[f(x)]$ , then $Var_{p(x)}[\bar{z}(x)]=Var_{p(x)}\left[\frac{\sum\limits_{i=1}^N z(x_i)}{N}\right]=N \cdot Var_{p(x)}\left[\frac{z(x)}{N}\right]=\frac{Var_{p(x)}\left[z(x)\right]}{N}$ . This matches with our intuition that the more samples you average over, the less variance you have. However, the relative ratio of variance you can save by introducing $h(x)$ will remain the same no matter how large $N$ is, which is $1-\left(Corr_{p(x)}\left[f(x),h(x)\right]\right)^2$ .

I also want to mention that there are other variance reduction techniques usable in other scenarios, such as importance sampling and reparameterization trick. People have shown empirically that reparameterization trick can also reduce variance [4] but so far I haven’t found any theoretical proof.

References

[1] https://czxttkl.com/2017/03/30/importance-sampling/

[2] https://en.wikipedia.org/wiki/Control_variates

[3] https://statweb.stanford.edu/~owen/mc/Ch-var-basic.pdf

[4] https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb