Control variate using Taylor expansion

We talked about control variate in [4]: when evaluating $\mathbb{E}_{p(x)}[f(x)]$ by Monte Carlo samples, we can instead evaluate $\mathbb{E}_{p(x)}[f(x)-h(x)+\theta]$ with $\theta=\mathbb{E}_{p(x)}[h(x)]$ in order to reduce variance. The requirement for control variate to work is that $h(x)$ is correlated with $f(x)$ and the mean of $h(x)$ is known.

In this post we will walk through a classic example of using control variate, in which $h(x)$ is picked as the Taylor expansion of $f(x)$ . We know that $f(x)$ ‘s Taylor expansion can be expressed as $f(x)=f(a) + f'(a)(x-a) + f''(a)/2! * (x-a)^2 + f'''(a)/3! * (x-a)^3 + \cdots$ . Therefore, if we pick the second order of Taylor expansion for $f(x)$ and set $a=0$ , we get $h(x)=f(0)+f'(0)x + f''(0)/2! * x^2$ .

Let’s see an example from [6]. Suppose $f(u_1, u_2)=exp[(u_1^2 + u_2^2)/2]$ and we want to evaluate the integral $\Psi=\int^1_0 \int^1_0 f(u_1, u_2) du_1 du_2$ . If we want to use Monte Carlo estimator to estimate $\Psi$ using $U^1, \cdots, U^m$ samples, we have:
$\frac{\text{area of sampling}}{m}\sum\limits_{k=1}^m f(U^k)=\frac{1}{m}\sum\limits_{k=1}^m f(U^k)$

The control variate as the second-order Taylor expansion of $f(x)$ is $h(u_1, u_2)=1 + (u_1^2 + u_2^2)/2$ (we use multi-variable Taylor expansion here [7]). We first compute its mean: $\theta=\mathbb{E}_{p(x)}[h(x)] = \int^1_0\int^1_0 \left(1 +(u_1^2+u_2^2)/2 \right) du_1 du_2 = 4/3$ . Thus, our control variate Monto Carlo estimator is:
$\frac{1}{m}\sum\limits_{k=1}^{m} z(u_1, u_2)\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[f(u_1, u_2) -h(u_1, u_2) + \theta\right]\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[exp[(u_1^2 + u_2^2)/2] - 1 - (u_1^2 + u_2^2)/2 + 4/3 \right]$ .

And from the results of [6] the sampling variance of $z(u_1, u_2)$ is smaller than $f(u_1, u_2)$ .

Now what [2] propose is to apply similar Taylor expansion-based control variate on REINFORCE policy gradient:

$\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_{\pi}(\cdot), a_t \sim \pi(\cdot | s_t)} \left[ \nabla_\theta log \pi_\theta (a_t | s_t) R_t\right]$

If we use Monte Carlo estimator, then we essentially use samples $\nabla_\theta log \pi_\theta (a_t | s_t) R_t$ to approximate $\nabla_\theta J(\theta)$ , which causes variance. If we have a control variate $h(s_t, a_t)=Q_w(s_t, \bar{a}_t) + \nabla_a Q_w(s_t, a)|_{a=\bar{a}_t} (a_t - \bar{a}_t)$ and $\bar{a}_t$ can be chosen at any value but a sensible choice being $\mu_\theta(s_t)$ , a deterministic version of $\pi_\theta(a_t | s_t)$ , then the control variate version of policy gradient can be written as (Eqn. 7 in [2]):
$\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s\sim \rho_\pi, a \sim \pi}\left[ h(s_t, a_t)\right] \newline=\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s \sim \rho_\pi}\left[ \nabla_a Q_w(s_t, a)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)\right]$

As you can see the second term can be computed when $Q_w$ is some critic with known analytic expression, and actually does not depend on $\pi_\theta$ .

[3] uses the law of total variance (some good examples and intuitions [8, 9]) to decompose the control-variate policy gradient into three parts: variance caused by variance of states $\Sigma_s$ , variance of actions $\Sigma_a$ , and variance of trajectories $\Sigma_\tau$ . What [3] highlights is that the magnitude of $\Sigma_a$ is usually greatly smaller than the rest two so the benefit of using control variate-based policy gradient is very limited.