Stochastic Variational Inference

Introduction

In this post, we introduce one machine learning technique called stochastic variational inference that is widely used to estimate posterior distribution of Bayesian models. Suppose in a Bayesian model, the model parameters is denoted as a vector z and the observation is denoted as x. According to Bayesian theorem, the posterior distribution of z can be computed as:

p(z|x)=\frac{p(z,x)}{p(x)}

p(x) is the probability of observation marginal over all possible model parameters:

p(x)=\int p(z,x)dz 

p(x) isn’t easy to compute, most of time intractable, because of its integral form. If we are not able to compute p(x), then we are not able to compute p(z|x), which is what we want to know. Therefore, we need to come up with a way to approximate p(z|x). We denote the approximated posterior as q(z). q(z) is also called the variational distribution hence the name of variational inference.

Stochastic variational inference (SVI) is such one method to approximate p(z|x). From the ICML 2018 tutorial [2], we can see the niche where SVI lies: among all possible ways to approximate p(z|x), there is a group of algorithms using optimization to minimize the difference between q^*(\cdot) and p(\cdot|x). Those representing the difference between the two distributions as Kullback-Leibler divergence is called variational inference. If we further categorize based on the family of q^*, there is one particular family called mean-field variational family which is easy to apply variational inference. After all levels of categorization, we arrive at some form of objective function which we sort to minimize. SVI is one optimization method to optimize a defined objective function that pushes q^* to reflect our interest in minimizing the KL-divergence with p(z|x).

Objective function

By definition, KL divergence between two continuous distributions H and G is defined as [4]:

KL\left(h(x)||g(x)\right)\newline=\int^\infty_{-\infty}h(x)log\frac{h(x)}{g(x)}dx \newline=\mathbb{E}_h[log\;h(x)]-\mathbb{E}_h[log\;g(x)]

If we are trying to find the best approximated distribution q^*(z) using variational Bayes, we define the following objective function:

q^*(z)=argmin_q KL\left(q(z) || p(z|x) \right )

where KL(q(z)||p(z|x))=\mathbb{E}_q [log\;q(z)] - \mathbb{E}_q [log\;p(z,x)]+log\;p(x). (all expectations are taken with respect to q(z).) Note that if we are gonna optimize w.r.t q(), then log\;p(x) can be treated as a constant. Thus, minimizing the KL-divergence is equivalent to maximizing:

ELBO(q)=\mathbb{E}_q[log\;p(z,x)] - \mathbb{E}_q[log\;q(z)]

ELBO(q) is the lower bound of log\;p(x) because of the non-negativity of KL-divergence:

KL(q(z) || p(z|x)) = log p(x) - ELBO(q) \geq 0

Update 2020.4:

The derivation above is also illustrated in [14]:

There are several other ways to understand ELBO.

  1. Based on Jensen’s inequality [13]: for a convex function f and a random variable X, f(\mathbb{E}[X]) \leq \mathbb{E}\left[f(X)\right]; for a concave function g, g(\mathbb{E}[X]) \geq \mathbb{E}\left[g(X)\right]. Therefore, we have:

log \; p(x)\newline=log \int_z p(z,x)\newline=log \int_z p(z,x)\frac{q(z)}{q(z)}\newline=log \int_z q(z)\frac{p(z,x)}{q(z)}\newline=log \left(\mathbb{E}_{q(z)}\left[\frac{p(z,x)}{q(z)}\right]\right) \newline \geq \mathbb{E}_{q(z)}\left[log \frac{p(z,x)}{q(z)}\right] \quad\quad \text{by Jensen's inequality} \newline =ELBO(q)

Therefore, ELBO(q) is the lower bound of log \; p(x)

2. By rearranging ELBO(q), we have:

ELBO(q)\newline=\mathbb{E}_q[log\;p(z,x)] - \mathbb{E}_q[log\;q(z)]\newline=\mathbb{E}_q[log\;p(x|z)] + \mathbb{E}_q[log\;p(z)] - \mathbb{E}_q[log\;q(z)] \quad\quad p(z) \text{ is the prior of } z \newline= \mathbb{E}_q[log\;p(x|z)]  - \mathbb{E}_q [log \frac{q(z)}{p(z)}]\newline=\mathbb{E}_q[log\;p(x|z)] - KL\left(q(z) || p(z)\right)

Therefore, the first part of ELBO(q) can be thought as the so-called “reconstruction error”, which encourages q(z) to put more probability mass on the area with high log\;p(x|z). The second part encourages q(z) to be close to the parameter prior p(z). ELBO(q) is the common objective used in Variational Autoencoder models. 

How to optimize?

Recall that our objective function is q^*(z)=argmin_q KL(q(z) || p(z|x) ). In practice, minimizing with regard to q translates to parameterize q(z) and then optimize the objective function with regard to the parameters. One big assumption we could make to facilitate computation is to assume all latent variables are independent such that q(z) can be factorized into the product of distributions of individual latent variables. We call such a q(z) the mean-field variational family:

q(z)=\prod\limits_{j=1}^{|z|} q_j(z_j|\theta_j)  

From the factorization, you can see that each individual latent variable’s distribution is governed by its own parameter \theta_j. Hence, the objective function to approximate p(z|x) changes from q^*(z)=argmin_q KL(q(z) || p(z|x) ) to:

\theta^* = argmin_{\theta_1, \cdot, \theta_{|z|}} KL(\prod\limits_{j=1}^{|z|} q_j(z_j|\theta_j) || p(z|x) )

One simple algorithm to optimize this is called coordinate ascent mean-field variational inference (CAVI). Each time, the algorithm optimizes one variational distribution parameter while holding all the others fixed. The algorithm works as follows:

“Set q_j(z_j) \propto exp\{\mathbb{E}_{-j}[log p(z_j|z_{-j}, x)]\}” may seems hard to understand. It means that setting the variational distribution parameter \theta_j such that q_j(z_j|\theta_j) follows the distribution that is equivalent to exp\{\mathbb{E}_{-j}[log p(z_j|z_{-j}, x)]\} up to a constant. \mathbb{E}_{-j} means that the expectation is taken with regard to a distribution \prod_{\ell \neq j} q_\ell(z_\ell|\theta_\ell).

What to do after knowing q(z)=\prod_j q(z_j|\theta_j)?

After the optimization (using CAVI for example), we get the variational distribution q(z)=\prod_j q(z_j|\theta_j). We can use the estimated \theta_j to analytically derive the mean of z_j or sample z_j from q(z_j|\theta_j). One thing to note is that there is no restriction on the parametric form of the individual variational distribution. For example, you may define q(z_j|\theta_j) to be an exponential distribution: q(z_j|\theta_j)=\theta_j e^{-\theta_j z_j}. Then, the mean of z_j is 1/\theta_j. If q(z_j|\theta_j) is a normal distribution, then \theta_j actually contains two parameters, the normal distribution’s mean and variance. Thus the mean of z_j is simply the mean parameter. 

Stochastic Variational Inference

One big disadvantage of CAVI is its scalability. Each update of \theta_j requires full sweep of data to compute the update. Stochastic variational inference (SVI) kicks in because updates of \theta_j using SVI only requires sub-samples of data. The simple idea is to take the gradient of \nabla_\theta ELBO and use it to update \theta. But there is some more detail:

  1. formulas of updates would be very succinct if we assume complete conditionals are in the exponential family: p(z_j|z_{-j}, x)=h(z_j) exp\{\eta_j(z_{-j},x)^Tz_j - a(\eta_j(z_j, x))\}, where z_j is its own sufficient statistics, h(\cdot), a(\cdot), and \eta(\cdot) are defined according to the definition of the exponential family [10]. 
  2. We also categorize latent variables into local variables, and global variables. 
  3. the gradient is not simply taken in the Euclidean space of parameters but in the distribution space [11]. In other words, the gradient is transformed in a sensible way such that it is in the steepest descent direction of KL-divergence. Also see [12].

All in all, SVI works as follows [8]: 

Reference

[1] https://www.quora.com/Why-and-when-does-mean-field-variational-Bayes-underestimate-variance

[2] ICML 2018 Tutorial Session: Variational Bayes and Beyond: Bayesian Inference for Big Data: https://www.youtube.com/watch?time_continue=2081&v=DYRK0-_K2UU

[3] Variational Inference: A Review for Statisticians

[4] https://towardsdatascience.com/demystifying-kl-divergence-7ebe4317ee68

[5] https://www.zhihu.com/question/31032863

[6] https://blog.csdn.net/aws3217150/article/details/57072827

[7] http://krasserm.github.io/2018/04/03/variational-inference/

[8] NIPS 2016 Tutorial Variational Inference: Foundations and Modern Methods: https://www.youtube.com/watch?v=ogdv_6dbvVQ

[9] https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd

[10] https://jwmi.github.io/BMS/chapter3-expfams-and-conjugacy.pdf

[11] https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/

[12] https://czxttkl.com/2019/05/09/gradient-and-natural-gradient-fisher-information-matrix-and-hessian/

[13] https://en.wikipedia.org/wiki/Jensen%27s_inequality

[14] https://github.com/cpark321/uncertainty-deep-learning/blob/master/05.%20Variational%20Inference%20(toy%20example).ipynb

Bayesian linear regression

Ordinary least square (OLS) linear regression have point estimates on weight vector w that fit the formula: \arg\min_w \left\Vert(Y - Xw)\right\Vert^2 + \epsilon. If we assume normality of the errors: \epsilon \sim \mathcal{N}(\boldsymbol{0}, \sigma^2 \boldsymbol{I}) with a fixed point estimate on \sigma^2, we could also enable analysis on confidence interval and future prediction (see discussion in the end of [2]). Instead of point estimates, bayesian linear regression assumes w and \sigma^2 are random variables and learns the posterior distribution of w and \sigma^2 from data. In my view, Bayesian linear regression is a more flexible method because it supports incorporating prior knowledge about parameters and the posterior distributions they provide enable more uncertainty analysis and facilitate other tasks [3], for example Thompson sampling in contextual bandit problems, which we will cover in the future. However, it is also not a panacea: they do not generally improve the prediction accuracy if no informative prior is provided.

The fundamental of Bayesian methods lies in Bayesian theorem, which states:

p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)

Specifically, in Bayesian linear regression, D represents observed data Y and X and \theta refers to w and \sigma^2. So the formula above can be rewritten as:

p(w, \sigma^2|Y, X) \propto p(Y|X, w, \sigma^2) p(w, \sigma^2)

The procedure to adopt bayesian methods is to (1) define the likelihood and proper prior distributions of the parameters in interests; (2) calculate the posterior distribution according to Bayesian theorem; (3) use the posterior distribution to achieve other tasks, such as predicting Y on new X.

The likelihood function of linear regression is:

p(Y|X, w, \sigma^2)=(2\pi\sigma^2)^{-n/2} exp\big\{-\frac{(Y-Xw)^T(Y-Xw)}{2\sigma^2}\big\}

which can be illustrated as what the probability to observe the data if we assume Y is normally distributed with mean X^Tw and variance \sigma^2\boldsymbol{I}.

To get analytical expression of the posterior distribution p(w,\sigma^2|Y, X), we usually require that the prior in the same probability distribution family as the likelihood. Here we can treat the likelihood p(Y|X, w, \sigma^2) as a two-dimensional exponential family (the concept is illustrated in chapter 4 in [6]), a distribution regarding to w and \sigma^2. Therefore, the prior of the likelihood, p(w, \sigma^2), can be modeled as a Normal-inverse-Gamma (NIG) distribution:

p(w, \sigma^2) =p(w|\sigma^2)p(\sigma^2)=\mathcal{N}(w_0, \sigma^2 V_0) \cdot IG(\alpha_0, \beta_0)

The inverse gamma IG(\cdot) is also called inverse chi-squared distribution; they only differ in parameterization [9]. We can denote p(w, \sigma^2) as NIG(w_0, V_0, \alpha_0, \beta_0). Note that we express p(w | \sigma^2), meaning that w and \sigma^2 is not independent. For one reason, if we model p(w, \sigma^2)=p(w) \cdot p(\sigma^2), we would not get a conjugate prior. Second, if you think (X, Y) are generated from some process governed by w and \sigma^2, then w and \sigma^2 are dependent conditioned on (X, Y) (see section 3 in [8]).

Now, given that the likelihood and the prior are in the same probability distribution family, the posterior distribution is also a NIG distribution:

p(w, \sigma^2|Y, X) = NIG(w_1, V_1, \alpha_1, \beta_1),

where:

w_1 = (V_0^{-1}+X^TX)^{-1}(V_0^{-1}w_0+X^T Y)

V_1 = (V_0^{-1}+X^TX)^{-1}

\alpha_1 = \alpha_0 + \frac{1}{2}n

\beta_1 = \beta_0 + \frac{1}{2}(w_0^T V_0^{-1}w_0 + Y^T Y - w_1^TV_1^{-1}w_1)

The posterior predictive distribution (for predicting new data) is:

p(Y_{new}|X_{new}) = \int \int p(Y_{new}|X_{new}, Y, X, w, \sigma^2) p(w, \sigma^2|Y, X) dwd\sigma^2

The result should be a student-t distribution. But the derivation detail is very complicated, possibly referring to section 6 in [10] and section 3 in [8]. I know from other online posts [11,12,13] that practical libraries don’t calculate the analytical form of the posterior distribution but rely on sampling techniques like MCMC. However, even though they get the posterior distribution,  I don’t know how they would implement the posterior predictive distribution. This would worth my further investigation in the future.

Side notes

We have touched upon Bayesian linear regression when introducing Bayesian Optimization [1]. [4] is also a good resource of Bayesian linear regression. However, [1] and [4] only assume w is a random variable but \sigma^2 is still a fix point estimate. This post actually goes fully bayesian by assuming both w and \sigma^2 are random variables whose joint distribution follows the so called Normal inverse Gamma distribution. There aren’t too many resources in the same vein though. What I’ve found so far are: [5], section 2 in [7].

Reference

[1] https://czxttkl.com/?p=3212

[2] https://czxttkl.com/2015/03/22/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f

[3] https://wso2.com/blog/research/part-two-linear-regression

[4] http://fourier.eng.hmc.edu/e176/lectures/ch7/node16.html

[5] A Guide to Bayesian Inference for Regression Problems: https://www.ptb.de/emrp/fileadmin/documents/nmasatue/NEW04/Papers/BPGWP1.pdf

[6] Bolstad, W. M. (2010). Understanding computational Bayesian statistics (Vol. 644). John Wiley & Sons.

[7] Denison, D. G., Holmes, C. C., Mallick, B. K., & Smith, A. F. (2002). Bayesian methods for nonlinear classification and regression (Vol. 386). John Wiley & Sons.

[8] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[9] https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution

[10] Conjugate bayesian analysis of the gaussian distribution

[11] https://wso2.com/blog/research/part-two-linear-regression

[12] https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7

[13] https://towardsdatascience.com/bayesian-linear-regression-in-python-using-machine-learning-to-predict-student-grades-part-2-b72059a8ac7e

Counterfactual Policy Evaluation

Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research topic called “counterfactual policy evaluation” (CPE): what would a policy perform even though we only have trajectories that were generated by some other policy?

CPE for contextual bandit

We first look at CPE on contextual bandit problems. The contextual bandit problem is to take action a at each state s that is drawn from a distribution \mathcal{D}. We can only observe the reward associated to that specific pair of s and a: r(s, a). Rewards associated to other actions not taken r(s,a'), a' \neq a can’t be observed. The next state at which you would take the next action, will not be affected by the current state or action. Essentially, contextual bandit problems are Markov Decision Problems when horizon (the number of steps) equals one. Suppose we have a dataset \mathcal{S} with N samples, each sample being a tuple (s_i, a_i, r_i), i=1,2,\cdots,N. a_i is sampled from a behavior policy \pi_0(a|s). r_i is calculated based on an underlying but non-observable reward function r_i=r(s_i, a_i). For now, we ignore any noise that could exist in reward collection. To simplify our calculation, we will assume that the decision of \pi_0 is independent across samples: this assumption bypasses the possible fact that \pi_0 may maintain an internal memory h, which is the history of observations used to facilitate its future decisions. We will also assume that during data collection we can access the true probability \pi_0(\cdot|s), the true action propensity of the behavior policy. This is not difficult to achieve in practice because we usually have the direct access to the current policy’s model. We will call the policy that we want to evaluate target policy, denoted as \pi_1.

Given all these notations defined, the value function of \pi_1 is:

V^{\pi_1} = \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi_1(\cdot|s)}[r(s,a)]

A straightforward way called Direct Method (DM) to estimate V^{\pi_1} is to train an approximated reward function \hat{r} and plug into V^{\pi_1}:

V^{\pi_1}_{DM}=\frac{1}{N}\sum\limits_{s \in \mathcal{S}}\sum\limits_a \hat{r}(s,a)\pi_1(a|s)

The bias of \hat{r} directly determines the bias of V^{\pi_1}_{DM}. The problem is that if \hat{r} is only trained on \mathcal{S} generated by \pi_0, it might be possible that these samples do not sufficiently cover state/action pairs relevant to V^{\pi_1}, thus \hat{r} could be very biased and inaccurate.

The second method is called inverse propensity score (IPS). Its formulation is basically importance sampling on rewards:

V^{\pi_1}_{IPS} = \frac{1}{N} \sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \frac{r_i \pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}

We can prove that V^{\pi_1}_{IPS} is an unbiased estimator of V^{\pi_1}. I get some ideas from [4]:

\mathbb{E}_{p_{\pi_0}(\mathcal{S})}[V^{\pi_1}_{IPS}] \newline (p_{\pi_0}(\mathcal{S}) \text{ is the joint distribution of all samples generated by }\pi_0) \newline = \frac{1}{N}\int p_{\pi_0}(\mathcal{S})\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \frac{r_i \pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} d\mathcal{S} \newline \text{(since each sample is collected independently, we have:)}\newline=\frac{1}{N}\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}}\int p_{\pi_0}(s_i, a_i)\frac{r(s_i, a_i)\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}ds_i da_i \newline=\int p_{\pi_0}(s, a)\frac{r(s, a)\pi_1(a|s)}{\pi_0(a|s)}ds da \newline=\int p_\mathcal{D}(s) \pi_0(a|s)\frac{r(s, a)\pi_1(a|s)}{\pi_0(a|s)}dsda \newline=\int p_{\pi_1}(s,a)r(s,a) dsda\newline=V^{\pi_1}

There is one condition to be satisfied for the proof to be hold: if \pi_0(a|s)=0 then \pi_1(a|s)=0. First, we don’t need to worry about that any \pi_0(a|s)=0 exists in the denominator of V^{\pi_1}_{IPS} because those samples would never be collected in \mathcal{S} in the first place. However, if \pi_0(a|s)=0, \pi_1(a|s)>0 for some (s,a), there would never be data to evaluate the outcome of taking action a at state s, which means V^{\pi_1}_{IPS} becomes a biased estimator of V^{\pi_1}. (Part of these ideas comes from [7].)

The intuition behind V^{\pi_1}_{IPS} is that:

  1. If r_i is a large (good) reward, and if \pi_1(a_i|s_i) > \pi_0(a_i|s_i), then we guess that \pi_1 is probably better than \pi_0 because \pi_1 may give actions leading to good rewards larger probabilities than \pi_0.
  2. If r_i is a small (bad) reward, and if \pi_1(a_i|s_i) > \pi_0(a_i|s_i), then we guess that \pi_1 is probably worse than \pi_0 because \pi_1 may give actions leading to bad rewards larger probabilities than \pi_0.
  3. We can reverse the relationship such that if \pi_1(a_i|s_i) < \pi_0(a_i|s_i), we can guess \pi_1 is better than \pi_0 if r_i itself is a bad reward. \pi_1 probably is a better policy trying to avoid actions leading to bad rewards.
  4. If \pi_1(a_i|s_i) < \pi_0(a_i|s_i) and r_i is a good reward, \pi_0 is probably a better policy trying to take actions leading to good rewards.

We can also relate the intuition of V^{\pi_1}_{IPS} to Inverse Probability of Treatment Weighting (IPTW) in observational studies [5]. In observational studies, researchers want to determine the effect of certain treatment while the treatment and control selection of subjects is not totally randomized. IPTW is one kind of adjustment to observational data for measuring the treatment effect correctly. One intuitive explanation of IPTW is to weight the subject with how much information the subject could reveal [2]. For example, if a subject who has a low probability to receive treatment and has a high probability to stay in the control group happened to receive treatment, his treatment result should be very informative because he represents the characteristics mostly associated with the control group. In a similar way, we can consider the importance sampling ratio \frac{\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} for each r_i as an indicator of how much information that sample reveals. When the behavior policy \pi_0 neglects some action a_i that is instead emphasized by \pi_1 (i.e., \pi_0(a_i|s_i) < \pi_1(a_i|s_i)), the sample (s_i, a_i, r_i) has some valuable counter-factual information hence we need to “amplify” r_i.

The third method to estimate V^{\pi_1} called Doubly Robust (DR) [1] combines the direct method and IPS method:

V^{\pi_1}_{DR}=\frac{1}{N}\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \big[\frac{(r_i - \hat{r}(s_i, a_i))\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} +\int\hat{r}(s_i, a')\pi_1(a'|s_i) da'\big]

Given our assumption that we know the true action propensity \pi_0(\cdot|s), V^{\pi_1}_{DR} is still an unbiased estimator:

\mathbb{E}_{p_{\pi_0}(\mathcal{S})}[V^{\pi_1}_{DR}] \newline=\int p_{\pi_0}(s, a) \cdot \big[\frac{(r(s,a) - \hat{r}(s, a))\pi_1(a|s)}{\pi_0(a|s)} +\int \hat{r}(s, a')\pi_1(a'|s) da' \big]ds da \newline=\int p_{\pi_0}(s,a) \cdot\big[\frac{(r(s,a) - \hat{r}(s, a))\pi_1(a|s)}{\pi_0(a|s)} +\int(\hat{r}(s, a') - r(s,a'))\pi_1(a'|s)da' +\int r(s, a')\pi_1(a'|s)da' \big] dsda\newline=\int p_{\pi_1}(s,a) \cdot (-1+1)dsda + \int p_{\pi_1}(s,a) r(s,a) dsda\newline=V^{\pi_1}

However, in the original paper of DR [1], the authors don’t assume that the true action propensity can be accurately obtained. This means, V^{\pi_1}_{DR} can be a biased estimator: however, if V^{\pi_1}_{DM} or V^{\pi_1}_{IPS} have biases under certain bounds, V^{\pi_1}_{DR} would have far lower bias than either of the other two.

Under the same assumption that the true action propensity is accessible, we could get the variances of DM, DR, and IPS CPE (Var[V^{\pi_1}_{DM}], Var[V^{\pi_1}_{DR}], and Var[V^{\pi_1}_{IPS}]) based on the formulas given in Theorem 2 in [1] while setting \delta=0 because we assume we know the true action propensity:

ips dr dm

Under some reasonable assumptions according to [1], Var[V^{\pi_1}_{DR}] should be between Var[V^{\pi_1}_{DM}] and Var[V^{\pi_1}_{IPS}]. Therefore, V^{\pi_1}_{DR} should be overall favored because of its zero bias and intermediate variance.

CPE for MDP

We have completed introducing several CPE methods on contextual bandit problems. Now, we move our focus to CPE on Markov Decision Problems when horizon is larger than 1. The value function of a policy is:

V^{\pi} = \mathbb{E}_{s_0 \sim \mathcal{D}, a_t \sim \pi(\cdot|s_t), s_{t+1}\sim P(\cdot|a_t, s_t)}[\sum\limits_{t=1}^H\gamma^{t-1}r_t],

where H is the number of steps in one trajectory. Simplifying the notation of this formula, we get:

V^{\pi} = \mathbb{E}_{\tau \sim (\pi, \mu)}[R(\tau)],

where \tau is a single trajectory, \mu denotes the transition dynamics, R(\tau)=\sum\limits_{t=1}^H\gamma^{t-1}r_t is the accumulative discounted rewards for the trajectory \tau.

Following the idea of IPS, we can have importance-sampling based CPE of \pi_1 based on trajectories generated by \pi_0:

V^{\pi_1}_{trajectory-IS}(\tau) = R(\tau) \prod\limits_{t=1}^H\frac{\pi_1(a_t|s_t)}{\pi_0(a_t|s_t)}

V^{\pi_1}_{per-decision-IS}(\tau)\newline=\sum\limits_{t=1}^H \gamma^{t-1} r_t \prod\limits_{j=1}^t \frac{\pi_1(a_j|s_j)}{\pi_0(a_j|s_j)}\newline=\sum\limits_{t=1}^H \gamma^{t-1} r_t \rho_{1:t}

V^{\pi_1}_{trajectory-IS} is an unbiased estimator of V^{\pi_1} given the same condition satisfied in V^{\pi_1}_{IPS}: if \pi_0(a|s)=0 then \pi_1(a|s)=0. V^{\pi_1}_{per-decision-IS} is also an unbiased estimator of V^{\pi_1} proved in [6].  However, V^{\pi_1}_{per-decision-IS} has lower upper bound than  V^{\pi_1}_{trajectory-IS} therefore in practice V^{\pi_1}_{per-decision-IS} usually enjoys lower variance than V^{\pi_1}_{trajectory-IS} (compare chapter 3.5 and 3.6 in [9]).

Following the idea of DR, [3] proposed unbiased DR-based CPE on finite horizons, and later [8] extended to the infinite-horizon case:

V^{\pi_1}_{DR}(\tau)=\sum\limits_{t=1}^\infty \gamma^{t-1}r_t\rho_{1:t} -\sum\limits_{t=1}^\infty \gamma^{t-1} \big( \rho_{1:t}\hat{q}^{\pi_1}(s_t, a_t)-\rho_{1:t-1}\hat{v}^{\pi_1}(s_t) \big),

where \hat{q}^{\pi_0} and \hat{v}^{\pi_0} are the learned q-value and value function under the behavior policy \pi_0. \rho_{1:t} = \prod\limits_{i=1}^t\frac{\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} and we define the corner case \rho_{1:0} = 1.

Based on Theorem 1 in [3], Var[V^{\pi_1}_{DR}] should have lower variance than Var[V^{\pi_1}_{per-decision-IS}], after realizing that per-decision-IS CPE is a special case of DR CPE.

Another CPE method also proposed by [6] is called Weighted Doubly Robust (WDR). It normalizes \rho_{1:t} in V^{\pi_1}_{DR}:

V^{\pi_1}_{WDR}(\tau)=\sum\limits_{t=1}^\infty \gamma^{t-1}r_t w_{1:t} -\sum\limits_{t=1}^\infty \gamma^{t-1} \big( w_{1:t}\hat{q}^{\pi_1}(s_t, a_t)-w_{1:t-1}\hat{v}^{\pi_1}(s_t) \big),

where w_{1:t} = \frac{\rho_{1:t}}{\sum\limits_\tau \rho_{1:t}}.

V^{\pi_1}_{WDR} is a biased estimator but since it truncates the range of \rho in V^{\pi_1}_{DR} to [0,1],  WDR could lower variance than DR when fewer samples are available.  The low variance of WDR produces larger reduction in expected square error than the additional error incurred by the bias. (See chapter 3.8 of [9]). Furthermore, under reasonable assumptions WDR is a consistent estimator so the bias will become less an issue as the sample size increase.  (As a side note, an estimator can be any combination of biased/unbiased and inconsistent/consistent [10].)

Direct Method CPE on trajectories learns to reconstruct the MDP. This means DM needs to learn the transition function and reward function for simulating episodes.The biggest concern is whether the collected samples support to train a low bias model. If they do, then DM can actually achieve very good performance. [8] Section 6 lists a situation when DM can outperform DR and WDR: while DM has a low variance and low bias (even no bias), stochasticity of reward and state transition produces high variance in DR and WDR.

The last CPE method is called Model and Guided Importance Sampling Combining (MAGIC) [8]. It is considered as the best CPE method because it adopts the advantages from both WDR and DM. The motivation given by [8] is that CPE methods involving importance sampling are generally consistent (i.e., when samples increase, the estimated distribution moves close to the true value) but still suffer high variance; model-based methods (DM) has low variance but is biased. MAGIC combines the ideas of DM and WDR such that MAGIC can track the performance of whichever is better between DM and WDR. The MAGIC method is heavily math involved. Without deep dive into it yet, we only show its pseudocode for now:

Screen Shot 2019-02-19 at 11.26.03 PM

Below is a table concluding this post, showing whether each CPE method (for evaluating trajectories) is biased/unbiased and consistent/inconsistent.

DM biased inconsistent 1
trajectory-IS unbiased consistent 5
per-decision-IS unbiased consistent 4
DR unbiased consistent 3
WDR biased consistent 2
MAGIC biased consistent Not sure, possibly

between 1 and 2

(Table note:  (1) variance rank 1 means the lowest variance, 5 means the highest; (2) when generating this table, we use some reasonable assumptions we mentioned and used in the papers cited .)

[9] Thomas, P. S. (2015). Safe reinforcement learning (Doctoral dissertation, University of Massachusetts Libraries).

[10] https://stats.stackexchange.com/questions/31036/what-is-the-difference-between-a-consistent-estimator-and-an-unbiased-estimator

Resources about Attention is all you need

There are several online posts [1][2] that illustrate the idea of Transformer, the model introduced in the paper “attention is all you need” [4].

Based on [1] and [2], I am sharing a short tutorial for implementing Transformer [3]. In this tutorial, the task is “copy-paste”, i.e., to let a Transformer learn to output the same sequence as the input sequence. We denote the symbols used in the code as:

Name Meaning
src_vocab_size=12 The vocabulary size of the input sequence
tgt_vocab_size=12 The vocabulary size of the output sequence. Since our task is to copy the input sequence, tgt_vocab_size = src_vocab_size. We reserve symbol 0 for padding and symbol 1 for the decoder starter symbol, and we assume possible real symbols are from 2 to 11. Therefore, the total vocabulary size is 12.
dim_model=512 This is the dimension of the embedding, as well as the output of self-attention, and the input/output of the position-wise feed forward model.
dim_feedforward=256 The size of the hidden layer in the position-wise feed forward layer.
d_k = 64 The dimension of each attention head.
num_heads=8 The number of attention heads. For the sake of computational convenience, d_k * num_heads should be equal to dim_model.
seq_len=7 The length of input sequences
batch_size=30 Batch size, the number of input sequences per batch
num_stacked_layers=6 The number of layers to be stacked in both Encoder and Decoder

Let’s first look at what the input/output data looks like,  which is generated by data_gen. Each Batch instance contains several essential pieces of information to be used in training:

  • src (batch_size x seq_len): the input sequence to the encoder
  • src_mask (batch_size x 1 x seq_len): the mask of the input sequence, so that each input symbol can fine control which other symbols to attend
  • trg_y (batch_size x seq_len): the output sequence as the label, which is the same as src given that our task is simply “copy-paste”
  • trg (batch_size x seq_len): the input sequence to the decoder. The first symbol of each sequence is symbol 1 that helps kicking off the decoder; the last symbol of each sequence is the second to the last of the trg_y sequence.
  • trg_mask (batch_size x  seq_len x seq_len): the mask of the output sequence, so that at each decoding step, a symbol being decoded can attend to other already decoded symbols. trg_mask[i, j, k], a binary value, indicates whether in the i-th output sequence the j-th symbol can attend to the k-th symbol. Since a symbol can attend all its preceding symbols, trg_mask[i,j,k]=1 for all k <= j.

Now, let’s look at EncoderLayer class. There are num_stacked_layers=6 encoder layers in one Encoder. The bottom layer takes the symbol embeddings as input, while each of the upper layers takes the previous layer’s output. Both the symbol embedding and each layer’s output has the same dimension `batch_size x seq_len x dim_model`, so in theory you can stack as many layers as needed because they all have compatible input/output. The symbol embedding converts input symbols into embeddings, plus positional encoding which helps the model learns positional information.

One EncoderLayer does the following things:

  • It has three matrices stored in MultiHeadedAttention.linears[:3], which we call query transformation matrix, key transformation matrix, and value transformation matrix. They transform the input from the last layer into three matrices which the self-attention module will use. Each of the output matrices, which we call query, key, and value, has the shape batch_size x num_heads x seq_len x d_k.
  • The transformed matrices are used as query, key, and value in self-attention (attention function). The attention is calculated based on Eqn.1 in [4]:

The attention function also needs a mask as input. The mask tells it for each symbol what other symbols it can attend to (due to padding, or other reasons). In our case, our input sequences are of the equal length and we allow all symbols attend to other symbols; so the mask is always filled with ones.

  • The attentions of d_k heads are concatenated and passed through a linear projection layer, which results to the output with shape `batch_size x seq_len x dim_model`
  • The output is finally processed by a BatchNorm layer and a Residual layer.

So far, the encoder part can be illustrated in the following diagram [1] (suppose our input sequence has length 2, containing the symbol 2 and 3):

One DecoderLayer does the following thing:

  • it first does self-attention on the target sequence (batch.trg, not the trg_y sequence), which generates attention of shape (batch_size, seq_len, dim_model). The first symbol of a target sequence is 1, a symbol indicating the decoder to start decoding.
  • the target sequence performs attention on the encoder, i.e., using the target sequence’s self attention as the query, and using the final encoder layer’s output as the key and value.
  • Pass the encoder-decoder attention’s output through a feedforward layer, with batch norm and residual connection. The output of one DecoderLayer has the shape `batch_size x seq_len x dim_model`. Again this shape allows you stack as many layers as you want.

The decoder part is shown in the following diagram:

The last part is label fitting. The last layer of DecoderLayer has the output of shape , i.e., each target symbol’s attention. We pass them through a linear project then softmax such that each target symbol will be associated with a probability distribution of the next symbol to decode. We fit the probability distribution against batch.trg_y by minimizing the KL divergence. Here is one example of label fitting:

In practice, we also apply label smoothing such that the target distribution to fit has slight probability mass even on the non-target symbol. So for example, the target distribution of position #1 (refer to the diagram on the left) is no longer (0, 0, 1, 0, 0); rather it should be (0.1, 0.1, 0.6, 0.1, 0.1) if you use 0.1 label smoothing. The intuition is that even our label may not be 100% accurate and we should not let the model overly confident in the training data. In other words, label smoothing helps prevent overfitting and hope to improve generalization in unseen data.

[1] http://jalammar.github.io/illustrated-transformer/

[2] http://nlp.seas.harvard.edu/2018/04/03/attention.html

[3] https://github.com/czxttkl/Tutorials/blob/master/experiments/transformer/try_transformer.py

[4] https://arxiv.org/abs/1706.03762

Implementation notes for world model

I’ve been recently implementing world model [1], which seems a promising algorithm to effectively learn controls after learning environments first. Here I share some implementation notes.

Loss of Gaussian Mixture Model

The memory model of world model is a Mixture-Density-Network Recurrent Neural Network (MDN-RNN). It takes current state and action as inputs, and outputs the prediction of next state, terminal, and reward. Since state transition might be stochastic, the authors choose to model the next state output as a gaussian mixture model: the next state can come from several multi-dimensional gaussian components with certain probability assignments. The probability of the next state coming from component $latex i$ is $latex \pi_i$, and that component itself is a gaussian distribution with mean $latex \mu_i \in \mathbb{R}^f$ and std. dev $latex \Sigma \in \mathbb{R}^{f*f}$, where $latex f$ is the feature size of a state.

To learn the parameters of MDN-RNN that predict $latex \pi_i$, $latex \mu_i$ and $latex \sigma_i$, we need to fit a loss function that defines how closely the observation of next state really fits the predicted gaussian mixture model. The loss function is negative log likelihood. Since we usually fit data in batches, and one batch is a sequence of consecutive states from the same episodes, the loss function is defined over batch_size and seq_len:

$latex – \sum\limits_{z=0}^{batch-size} \sum\limits_{s=0}^{seq-len} log (\sum\limits_i \pi_i \mathcal{N}(x^{z,s} | \mu^{z,s}_{i}, \Sigma^{z,s}_{i})) &s=2$

 

If we ignore the negative sign for now and only look at one state for a particular batch $latex z$ and sequence element $latex s$, we get:

$latex log(\sum\limits_i \pi_i \mathcal{N}(x | \mu_{i}, \Sigma_{i}))&s=2$

 

In practice, MDN-RNN usually predicts $latex log(\pi_{i})$, and we can get $latex log(\mathcal{N}(x | \mu_{i}, \Sigma_{i}))$ from pytorch endpoint [3], so the formula above can be re-written as:

$latex log \sum\limits_i e^{log(\pi_i \mathcal{N}(x | \mu_{i}, \Sigma_{i}))} \newline =log\sum\limits_i e^{log(\pi_i) + log(\mathcal{N}(x | \mu_{i}, \Sigma_{i}))} &s=2$

 

According to log-sum-exp trick [2], to get numerical stability, 

$latex log(\sum\limits_{i=1}^n e^{x_i}) = log(\sum\limits_{i=1}^n e^{x_i – c}) + c &s=2$

where $latex c=max_i x_i$ is usually picked.

The take away is that we walk through how the loss function for GMM is defined, and adopts the log-sum-exp trick when calculating it.

 

 

References

[1] https://worldmodels.github.io/

[2] https://blog.feedly.com/tricks-of-the-trade-logsumexp/

[3] https://pytorch.org/docs/stable/distributions.html#torch.distributions.normal.Normal.log_prob

My understanding in 401K

Here is my reasoning about 401K.

First, I’ll start with two definitions: (1) taxable income, meaning the gross income you receive on which your tax will be calculate; (2) tax deduction, meaning any deduction from your taxable income. Tax deduction lowers your taxable income thus lowers your tax in general.

401K has three categories:

Pre-tax: contribute your taxable income directly to the 401K account, without paying the tax at the time of contribution.Hence, pre-tax 401k is a way of tax deduction. You can only withdraw after 59.5 age, and every dollar you withdraw (i.e., the money you contribute + its growth) will need to be taxed based on the tax bracket you are in at the time of withdraw (i.e., the time when you are old and probably don’t have high income.)

Roth: contribute your after-tax income to the 401K account [1]. You can only withdraw after 59.5 age, but you don’t pay any tax at the time of withdraw (including for the part you earn).

After-tax: contribute your after-tax income to the 401K account separate with roth or pre-tax. However, you still need to pay tax on the earnings when withdrawing: “the earnings on after-tax contributions, just like all distributions from pre-tax contributions, are taxed as ordinary income” [4]. What’s more, there is 10% early withdraw penalty on the earnings too if you withdraw before 59.5 age. The early withdraw penalty and tax on the earning is proportionally collected [5]. If your total after-tax contribution has accumulated x% earning, then you have to pay tax and penalty on x% of whatever amount of money you withdraw before 59.5.

There is another age rule on 401K account: you can defer withdrawing 401K if you work at your elder age. But if you are over 70 years, you need to start withdrawing with at least the amount of Required Minimum Distribution (RMD) each year [7].

The annual combined limit of 401K pre-tax+ roth is 19000 as of 2019. However, there is another limit [3] that restricts you from contributing more than 56000 in any kind of retirement plan per year. That means, if you fully contribution 19000 in pre-tax + roth 401K, than you contribution to 401K after-tax plus employer match should not be over 56000 – 19000=$37000. There is another limit not for 401K but for IRA (individual retirement account), which is 6K.

The benefit of 401K after-tax contribution isn’t apparent. One underlying purpose of saving in after-tax account is to bypass the limit of 19000 for pre-tax + roth. You can rollover your after-tax amount from after-tax 401K to Roth 401K then to Roth IRA, another type of retirement account which is very similar to Roth 401k in the sense that anything going into it will be tax-free. At the time of rollover from after-tax 401K to Roth 401K, any earning in the after-tax account would be treated as your ordinary income thus become taxable; but after rollover, the whole amount of money as well as all future earnings would be tax-free.

In many situations, you need some triggering events (such as retirement, changing companies, etc.) to be able to rollover from after-tax 401K to Roth IRA. Only a few companies offer the opportunity to rollover from after-tax to Roth IRA directly without any triggering event. My company (FB) supports rollover from after-tax 401K to Roth 401K immediately after any contribution in after-tax 401K, meaning that I don’t need to pay any tax for the rollover because no earning has been generated before the rollover happens. And our company’s employees can rollover from Roth 401K to Roth IRA at any time.

The withdraw rules of Roth IRA are summarized as follows. Assume you only contribute to Roth IRA by converting from after-tax 401K. You can withdraw the converted money (principle, not earning) after they have stayed in Roth IRA for more than 5 years without penalty [8]. You can withdraw the earning part without penalty if after age 59.5 or die or become disabled or withdraw within 10000 for first-home purchase and at the same time the first Roth contribution must have started more than 5 years ago; otherwise you need to pay penalty of 10% on the earning [6]. As discussed in [8], the earning that the penalty will apply on is the pre-rollover earning (before you rollover from After-tax 401K to Roth 401K). So since my company allows to rollover from After-tax 401K to Roth 401K immediately, I don’t need to pay any penalty at all even I withdraw from Roth IRA within 5 years since my Roth IRA contribution!

So it seems like if you have any money investing to 401k after-tax, it is better to roll over to Roth IRA (if your company allows that during your employment) because (1) Roth IRA doesn’t incur any tax while 401k after-tax still incurs tax on earnings after age 59.5 (2) Roth IRA is more flexible in withdrawing the converted part of money (you can withdraw even before 59.5 years old).


Now let’s talk about the appropriate distribution of 401K money. Here is a chart showing the suggested 401K amount you should save at each age group [11]:

In essence, a vanguard target date fund comprises of several funds that cover domestic/international stock markets, and domestic/international bonds. For example, below is the decomposition of Vanguard Target Date 2060 as 01/31/2020 (https://investor.vanguard.com/mutual-funds/profile/VTTSX):

From historical statistics, bonds have less yield but have lower risks than stocks [10]. And also bonds distribute dividends that are treated as ordinary income. So the three-fund theory [12] tells us that if we really have a lot of money to invest, we should put bonds in tax-benefit accounts such as 401K, and move the stock index funds into taxable accounts (because holding them for long time we will only need to pay long-term gain tax, rather than ordinary income tax).

 

Updated 2019-05-13

Another example on how to convert after-tax to roth: Roth In-Plan Brochure

 

Updated 2020-05-01

This list shows historical returns for Vanguard funds, which are the main source of 401K. 

https://retirementplans.vanguard.com/VGApp/pe/faces/PubFundChart?site=nyu/6370

 

Updated 2020-09-05

Below is the expected growth of retirement funds under a realistic estimation on a middle-class income .

Retirement Planning Calculator (US Personal Finance @ FB)

References

[1] https://www.irs.gov/retirement-plans/roth-comparison-chart

[2] https://www.bogleheads.org/wiki/After-tax_401(k)

[3] https://www.goodfinancialcents.com/401k-contribution-limits/

[4] https://www.doughroller.net/retirement-planning/pros-and-cons-of-after-tax-401k-contributions/

[5] https://about401k.com/withdrawal/after-tax/

[6] https://www.schwab.com/public/schwab/investing/retirement_and_planning/understanding_iras/roth_ira/withdrawal_rules

[7] https://www.sensiblemoney.com/learn/at-what-age-should-i-start-making-401k-withdrawals/

[8] https://thefinancebuff.com/rollover-after-tax-401k-403b-access-before-59-1-2.html

[9] My company’s After-tax 401K to Roth 401K FAQ: PlanConversionFAQ

[10] https://personal.vanguard.com/us/insights/saving-investing/model-portfolio-allocations

[11] https://www.cnbc.com/2019/05/23/how-much-money-americans-have-in-their-401ks-at-every-age.html

[12] https://www.bogleheads.org/wiki/Three-fund_portfolio

DPG and DDPG

In this post, I am sharing my understanding regarding Deterministic Policy Gradient Algorithm (DPG) [1] and its deep-learning version (DDPG) [2].

We have introduced policy gradient theorem in [3, 4]. Here, we briefly recap. The objective function of policy gradient methods is:

$latex J(\theta)=\sum\limits_{s \in S} d^\pi(s) V^\pi(s)=\sum\limits_{s \in S} d^\pi(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a), &s=2$

where $latex \pi$ represents $latex \pi_\theta$, $latex d^\pi(s)$ is the stationary distribution of Markov chain for $latex \pi$, $latex V^\pi(s)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s]$, and $latex Q^{\pi}(s,a)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s, A_t=a]$. $latex G_t$ is accumulated rewards since time step $latex t$: $latex G_t=\sum^\infty_{k=0}\gamma^k R_{t+k}$.

Policy gradient theorem proves that the gradient of policy parameters with regard to $latex J(\theta)$ is:

$latex \nabla_\theta J(\theta) = \mathbb{E}_{\pi}[Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)] &s=2$

More specifically, the policy gradient theorem we are talking about is stochastic policy gradient theorem because at each state the policy outputs a stochastic action distribution $latex \pi(s|a)$.

However, the policy gradient theorem implies that policy gradient must be calculated on-policy, as $latex \mathbb{E}_{\pi}:=\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}$ means the state and action distribution is generated by following $latex \pi_\theta$.

If the training samples is generated by some other behavioral policy $latex \beta$, then the objective function becomes:

$latex J(\theta)=\sum\limits_{s \in S} d^\beta(s) V^\pi(s)=\sum\limits_{s \in S} d^\beta(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a), &s=2$

and we must rely on importance sampling to calculate the policy gradient [7,8]:

$latex \nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a) + \pi(a|s)\nabla_\theta Q^\pi(s,a)] \newline \approx \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a)] \newline =\mathbb{E}_{\beta}[\frac{\pi_\theta(a|s)}{\beta(a|s)}Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)] &s=2$

where $latex \mathbb{E}_{\beta}:=\mathbb{E}_{s\sim d^\beta, a\sim\beta}$

However, using importance sampling $latex \frac{\pi_\theta(a|s)}{\beta(a|s)}$ could incur some learning difficulty problems [5]:

Screen Shot 2019-01-23 at 6.31.41 PM

What DPG proposes is that we design a policy that outputs actions deterministically: $latex a = \mu_\theta(s)$, thus $latex J(\theta)$ would only have integration (or sum) over state distribution and get rid of importance sampling, which makes learning potentially easier:

$latex J(\theta)=\sum\limits_{s \in S} d^\beta(s) V^\mu(s)=\sum\limits_{s \in S} d^\beta(s) Q^\mu(s, \mu_\theta(s)) &s=2$

Writing the above formula in integration (rather than discrete sum), we get:

$latex J(\theta)=\int\limits_S d^\beta(s) V^\mu(s) ds=\int\limits_{S} d^\beta(s) Q^\mu(s, \mu_\theta(s)) ds &s=2$

And the gradient of $latex J(\theta)$ is:

$latex \nabla_\theta J(\theta) = \int\limits_S d^\beta(s) \nabla_\theta Q^\mu(s, \mu_\theta(s)) ds \approx \mathbb{E}_\beta [\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}] &s=2$

This immediately implies that $latex \theta$ can be updated off policy. Because we only need to sample $latex s \sim d^\beta(s)$, then $latex \nabla J(\theta)$ can be calculated without knowing what action the behavior policy $latex \beta$ took (we only need to know what the training policy would take, i.e., $latex a=\mu_\theta(s)$). We should also note that there is an approximation in the formula of $latex \nabla_\theta J(\theta)$, because $latex \nabla_\theta Q^u(s,\mu_\theta(s))&s=2$ cannot be applied chain rules easily since even you replace $latex \mu_\theta(s))$ with $latex a$, the $latex Q$ function itself still depends on $latex \theta$.

Updating $latex \theta$ is one part of DPG. The other part is fitting a Q-network with parameter $latex \phi$ on the optimal Q-function of policy governed by $latex \theta$. This equates to update according to mean-squared Bellman error (MSBE) [10], which is the same as in Q-Learning.

I haven’t read the DDPG paper [2] thoroughly but based on my rough understanding, it adds several tricks when dealing with large inputs (such as images). According to [9]’s summary, DDPG introduced three tricks:

  1. add batch normalization to normalize “every dimension across samples in one minibatch”
  2. add noise in the output of DPG such that the algorithm can explore
  3. soft update target network

References

[1] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, June). Deterministic policy gradient algorithms. In ICML.

[2] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

[3] Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”: https://czxttkl.com/?p=3497

[4] Policy Gradient: https://czxttkl.com/?p=2812

[5] https://repository.tudelft.nl/islandora/object/uuid:682a56ed-8e21-4b70-af11-0e8e9e298fa2?collection=education

[6] https://stackoverflow.com/questions/42763293/what-is-the-advantage-of-deterministic-policy-gradient-over-stochastic-policy-gr

[7] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#off-policy-policy-gradient

[8] Degris, T., White, M., & Sutton, R. S. (2012). Off-policy actor-critic. arXiv preprint arXiv:1205.4839.

[9] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ddpg

[10] https://spinningup.openai.com/en/latest/algorithms/ddpg.html#the-q-learning-side-of-ddpg