Control variate using Taylor expansion

We talked about control variate in [4]: when evaluating \mathbb{E}_{p(x)}[f(x)] by Monte Carlo samples, we can instead evaluate \mathbb{E}_{p(x)}[f(x)-h(x)+\theta] with \theta=\mathbb{E}_{p(x)}[h(x)] in order to reduce variance. The requirement for control variate to work is that h(x) is correlated with f(x) and the mean of h(x) is known. 

In this post we will walk through a classic example of using control variate, in which h(x) is picked as the Taylor expansion of f(x). We know that f(x)‘s Taylor expansion can be expressed as f(x)=f(a) + f'(a)(x-a) + f''(a)/2! * (x-a)^2 + f'''(a)/3! * (x-a)^3 + \cdots. Therefore, if we pick the second order of Taylor expansion for f(x) and set a=0, we get h(x)=f(0)+f'(0)x + f''(0)/2! * x^2

Let’s see an example from [6]. Suppose f(u_1, u_2)=exp[(u_1^2 + u_2^2)/2] and we want to evaluate the integral \Psi=\int^1_0 \int^1_0 f(u_1, u_2) du_1 du_2. If we want to use Monte Carlo estimator to estimate \Psi using U^1, \cdots, U^m samples, we have:
\frac{\text{area of sampling}}{m}\sum\limits_{k=1}^m f(U^k)=\frac{1}{m}\sum\limits_{k=1}^m f(U^k)

The control variate as the second-order Taylor expansion of f(x) is h(u_1, u_2)=1 + (u_1^2 + u_2^2)/2 (we use multi-variable Taylor expansion here [7]). We first compute its mean: \theta=\mathbb{E}_{p(x)}[h(x)] = \int^1_0\int^1_0 \left(1 +(u_1^2+u_2^2)/2 \right) du_1 du_2 = 4/3. Thus, our control variate Monto Carlo estimator is:
\frac{1}{m}\sum\limits_{k=1}^{m} z(u_1, u_2)\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[f(u_1, u_2) -h(u_1, u_2) + \theta\right]\newline=\frac{1}{m}\sum\limits_{k=1}^{m} \left[exp[(u_1^2 + u_2^2)/2] - 1 - (u_1^2 + u_2^2)/2 + 4/3 \right].

And from the results of [6] the sampling variance of z(u_1, u_2) is smaller than f(u_1, u_2).

Now what [2] propose is to apply similar Taylor expansion-based control variate on REINFORCE policy gradient:

\nabla_\theta J(\theta) = \mathbb{E}_{s_t \sim \rho_{\pi}(\cdot), a_t \sim \pi(\cdot | s_t)} \left[ \nabla_\theta log \pi_\theta (a_t | s_t) R_t\right]

If we use Monte Carlo estimator, then we essentially use samples \nabla_\theta log \pi_\theta (a_t | s_t) R_t to approximate \nabla_\theta J(\theta), which causes variance. If we have a control variate h(s_t, a_t)=Q_w(s_t, \bar{a}_t) + \nabla_a Q_w(s_t, a)|_{a=\bar{a}_t} (a_t - \bar{a}_t) and \bar{a}_t can be chosen at any value but a sensible choice being \mu_\theta(s_t), a deterministic version of \pi_\theta(a_t | s_t), then the control variate version of policy gradient can be written as (Eqn. 7 in [2]):
\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s\sim \rho_\pi, a \sim \pi}\left[ h(s_t, a_t)\right] \newline=\nabla_\theta log \pi_\theta(a_t | s_t) R_t - h(s_t, a_t) + \mathbb{E}_{s \sim \rho_\pi}\left[ \nabla_a Q_w(s_t, a)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)\right] 

As you can see the second term can be computed when Q_w is some critic with known analytic expression, and actually does not depend on \pi_\theta

[3] uses the law of total variance (some good examples and intuitions [8, 9]) to decompose the control-variate policy gradient into three parts: variance caused by variance of states \Sigma_s, variance of actions \Sigma_a, and variance of trajectories \Sigma_\tau. What [3] highlights is that the magnitude of \Sigma_a is usually greatly smaller than the rest two so the benefit of using control variate-based policy gradient is very limited.

 

References

[1] MUPROP: UNBIASED BACKPROPAGATION FOR STOCHASTIC NEURAL NETWORKS

[2] Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC

[3] The Mirage of Action-Dependent Baselines in Reinforcement Learning

[4] https://czxttkl.com/2020/03/31/control-variate/

[5] https://czxttkl.com/2017/03/30/importance-sampling/

[6] https://www.value-at-risk.net/variance-reduction-with-control-variates-monte-carlo-simulation/

[7] https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables

[8] http://home.cc.umanitoba.ca/~farhadi/ASPER/Law%20of%20Total%20Variance.pdf

[9] https://math.stackexchange.com/a/3377007/235140

Leave a comment

Your email address will not be published. Required fields are marked *