czxttkl – Page 8

Gradient and Natural Gradient, Fisher Information Matrix and Hessian

Here I am writing down some notes summarizing my understanding in natural gradient. There are many online materials covering similar topics. I am not adding anything new but just doing personal summary.

Assume we have a model with model parameter $\theta$ . We have training data $x$ . Then, the Hessian of log likelihood, $log p(x|\theta)$ , is:

$H_{log p(x|\theta)} = \begin{bmatrix} \frac{\partial^2 logp(x|\theta)}{\partial \theta_1^2} & \dots & \frac{\partial^2 logp(x|\theta)}{\partial \theta_1 \partial \theta_n} \\ \dots & \dots & \dots \\ \frac{\partial^2 logp(x|\theta)}{\partial \theta_n \partial \theta_1} & \dots & \frac{\partial^2 logp(x|\theta)}{\partial \theta_n^2} \end{bmatrix}$

Fisher information matrix $F$ is defined as the covariance matrix of $\nabla_\theta log p(x|\theta)$ (note that $\nabla_\theta log p(x|\theta)$ itself is a vector). If we define $s(\theta) = \nabla_\theta log p(x|\theta)$ , then $F=\mathbb{E}_{p(x|\theta)}\left[\left(s(\theta) - mean(s(\theta))\right)\left(s(\theta) - mean(s(\theta))\right)^T\right]=\mathbb{E}_{p(x|\theta)}\left[s(\theta)s(\theta)^T\right]$ . The last equation holds because $mean(s(\theta))$ is 0 [1]. It can be shown that the negative of $\mathbb{E}_{p(x|\theta)}[H_{log p(x|\theta}]$ is Fisher information matrix [1]. Therefore, Fisher information matrix can be thought as the (negative) curvature of log likelihood. The curvature (Hessian) of a function is the second-order derivative of the function, which depicts how quickly the gradient of the function changes. Computing Hessian (curvature) takes longer time than computing just gradient but knowing Hessian can accelerate learning convergence. If the curvature is high, the gradient changes quickly, then the gradient update of the parameters should be more cautious (i.e., smaller step); if the curvature is low, the gradient doesn’t change much, then the gradient update can be more aggressive (i.e., large step). Moreover, the eigenvalues of a Hessian determines convergence speed [3]. Therefore, knowing Fisher information matrix is quite important in optimizing a function.

Suppose we are going to optimize a loss function: $\min \mathcal{L}(x, \theta)$ . The normal gradient update is: $\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(x,\theta)$ . The natural gradient has a little different form: $\theta \leftarrow \theta - \alpha F^{-1} \mathcal{L}(x,\theta)$ . The natural gradient formula is actually derived from [5]:

$\nabla_\theta^{NAT} \mathcal{L}(x,\theta) = \arg\min_{d} \mathcal{L}(x, \theta+d), \quad s.t. \quad KL\left( p(x|\theta) || p(x|\theta+d)\right) = c$

This formula shows the intuition behind natural gradient: the natural gradient should minimize the loss as much as possible while doesn’t radically change $p(x|\theta)$ . Another way to think about natural gradient is that since the Fisher information matrix $F$ encodes the curvature of the log likelihood, then the natural gradient is the normal gradient scaled by the reverse of the curvature: if the log likelihood’s curvature is large, that means some change in $\theta$ could radically the likelihood, then we should be conservative in the gradient update.

Reference

[1] https://wiseodd.github.io/techblog/2018/03/11/fisher-information/

[2] https://towardsdatascience.com/its-only-natural-an-excessively-deep-dive-into-natural-gradient-optimization-75d464b89dbb

[3] http://mlexplained.com/2018/02/02/an-introduction-to-second-order-optimization-for-deep-learning-practitioners-basic-math-for-deep-learning-part-1/

[4] http://kvfrans.com/a-intuitive-explanation-of-natural-gradient-descent/

[5] https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/

Stochastic Variational Inference

Introduction

In this post, we introduce one machine learning technique called stochastic variational inference that is widely used to estimate posterior distribution of Bayesian models. Suppose in a Bayesian model, the model parameters is denoted as a vector $z$ and the observation is denoted as $x$ . According to Bayesian theorem, the posterior distribution of $z$ can be computed as:

$p(z|x)=\frac{p(z,x)}{p(x)}$

$p(x)$ is the probability of observation marginal over all possible model parameters:

$p(x)=\int p(z,x)dz$

$p(x)$ isn’t easy to compute, most of time intractable, because of its integral form. If we are not able to compute $p(x)$ , then we are not able to compute $p(z|x)$ , which is what we want to know. Therefore, we need to come up with a way to approximate $p(z|x)$ . We denote the approximated posterior as $q(z)$ . $q(z)$ is also called the variational distribution hence the name of variational inference.

Stochastic variational inference (SVI) is such one method to approximate $p(z|x)$ . From the ICML 2018 tutorial [2], we can see the niche where SVI lies: among all possible ways to approximate $p(z|x)$ , there is a group of algorithms using optimization to minimize the difference between $q^*(\cdot)$ and $p(\cdot|x)$ . Those representing the difference between the two distributions as Kullback-Leibler divergence is called variational inference. If we further categorize based on the family of $q^*$ , there is one particular family called mean-field variational family which is easy to apply variational inference. After all levels of categorization, we arrive at some form of objective function which we sort to minimize. SVI is one optimization method to optimize a defined objective function that pushes $q^*$ to reflect our interest in minimizing the KL-divergence with $p(z|x)$ .

Objective function

By definition, KL divergence between two continuous distributions $H$ and $G$ is defined as [4]:

$KL\left(h(x)||g(x)\right)\newline=\int^\infty_{-\infty}h(x)log\frac{h(x)}{g(x)}dx \newline=\mathbb{E}_h[log\;h(x)]-\mathbb{E}_h[log\;g(x)]$

If we are trying to find the best approximated distribution $q^*(z)$ using variational Bayes, we define the following objective function:

$q^*(z)=argmin_q KL\left(q(z) || p(z|x) \right )$ ,

where $KL(q(z)||p(z|x))=\mathbb{E}_q [log\;q(z)] - \mathbb{E}_q [log\;p(z,x)]+log\;p(x)$ . (all expectations are taken with respect to $q(z)$ .) Note that if we are gonna optimize w.r.t $q()$ , then $log\;p(x)$ can be treated as a constant. Thus, minimizing the KL-divergence is equivalent to maximizing:

$ELBO(q)=\mathbb{E}_q[log\;p(z,x)] - \mathbb{E}_q[log\;q(z)]$

$ELBO(q)$ is the lower bound of $log\;p(x)$ because of the non-negativity of KL-divergence:

$KL(q(z) || p(z|x)) = log p(x) - ELBO(q) \geq 0$

Update 2020.4:

The derivation above is also illustrated in [14]:

There are several other ways to understand ELBO.

Based on Jensen’s inequality [13]: for a convex function $f$ and a random variable $X$ , $f(\mathbb{E}[X]) \leq \mathbb{E}\left[f(X)\right]$ ; for a concave function $g$ , $g(\mathbb{E}[X]) \geq \mathbb{E}\left[g(X)\right]$ . Therefore, we have:

$log \; p(x)\newline=log \int_z p(z,x)\newline=log \int_z p(z,x)\frac{q(z)}{q(z)}\newline=log \int_z q(z)\frac{p(z,x)}{q(z)}\newline=log \left(\mathbb{E}_{q(z)}\left[\frac{p(z,x)}{q(z)}\right]\right) \newline \geq \mathbb{E}_{q(z)}\left[log \frac{p(z,x)}{q(z)}\right] \quad\quad \text{by Jensen's inequality} \newline =ELBO(q)$

Therefore, $ELBO(q)$ is the lower bound of $log \; p(x)$

2. By rearranging $ELBO(q)$ , we have:

$ELBO(q)\newline=\mathbb{E}_q[log\;p(z,x)] - \mathbb{E}_q[log\;q(z)]\newline=\mathbb{E}_q[log\;p(x|z)] + \mathbb{E}_q[log\;p(z)] - \mathbb{E}_q[log\;q(z)] \quad\quad p(z) \text{ is the prior of } z \newline= \mathbb{E}_q[log\;p(x|z)] - \mathbb{E}_q [log \frac{q(z)}{p(z)}]\newline=\mathbb{E}_q[log\;p(x|z)] - KL\left(q(z) || p(z)\right)$

Therefore, the first part of $ELBO(q)$ can be thought as the so-called “reconstruction error”, which encourages $q(z)$ to put more probability mass on the area with high $log\;p(x|z)$ . The second part encourages $q(z)$ to be close to the parameter prior $p(z)$ . $ELBO(q)$ is the common objective used in Variational Autoencoder models.

How to optimize?

Recall that our objective function is $q^*(z)=argmin_q KL(q(z) || p(z|x) )$ . In practice, minimizing with regard to $q$ translates to parameterize $q(z)$ and then optimize the objective function with regard to the parameters. One big assumption we could make to facilitate computation is to assume all latent variables are independent such that $q(z)$ can be factorized into the product of distributions of individual latent variables. We call such a $q(z)$ the mean-field variational family:

$q(z)=\prod\limits_{j=1}^{|z|} q_j(z_j|\theta_j)$

From the factorization, you can see that each individual latent variable’s distribution is governed by its own parameter $\theta_j$ . Hence, the objective function to approximate $p(z|x)$ changes from $q^*(z)=argmin_q KL(q(z) || p(z|x) )$ to:

$\theta^* = argmin_{\theta_1, \cdot, \theta_{|z|}} KL(\prod\limits_{j=1}^{|z|} q_j(z_j|\theta_j) || p(z|x) )$

One simple algorithm to optimize this is called coordinate ascent mean-field variational inference (CAVI). Each time, the algorithm optimizes one variational distribution parameter while holding all the others fixed. The algorithm works as follows:

“Set $q_j(z_j) \propto exp\{\mathbb{E}_{-j}[log p(z_j|z_{-j}, x)]\}$ ” may seems hard to understand. It means that setting the variational distribution parameter $\theta_j$ such that $q_j(z_j|\theta_j)$ follows the distribution that is equivalent to $exp\{\mathbb{E}_{-j}[log p(z_j|z_{-j}, x)]\}$ up to a constant. $\mathbb{E}_{-j}$ means that the expectation is taken with regard to a distribution $\prod_{\ell \neq j} q_\ell(z_\ell|\theta_\ell)$ .

What to do after knowing $q(z)=\prod_j q(z_j|\theta_j)$ ?

After the optimization (using CAVI for example), we get the variational distribution $q(z)=\prod_j q(z_j|\theta_j)$ . We can use the estimated $\theta_j$ to analytically derive the mean of $z_j$ or sample $z_j$ from $q(z_j|\theta_j)$ . One thing to note is that there is no restriction on the parametric form of the individual variational distribution. For example, you may define $q(z_j|\theta_j)$ to be an exponential distribution: $q(z_j|\theta_j)=\theta_j e^{-\theta_j z_j}$ . Then, the mean of $z_j$ is $1/\theta_j$ . If $q(z_j|\theta_j)$ is a normal distribution, then $\theta_j$ actually contains two parameters, the normal distribution’s mean and variance. Thus the mean of $z_j$ is simply the mean parameter.

Stochastic Variational Inference

One big disadvantage of CAVI is its scalability. Each update of $\theta_j$ requires full sweep of data to compute the update. Stochastic variational inference (SVI) kicks in because updates of $\theta_j$ using SVI only requires sub-samples of data. The simple idea is to take the gradient of $\nabla_\theta ELBO$ and use it to update $\theta$ . But there is some more detail:

formulas of updates would be very succinct if we assume complete conditionals are in the exponential family: $p(z_j|z_{-j}, x)=h(z_j) exp\{\eta_j(z_{-j},x)^Tz_j - a(\eta_j(z_j, x))\}$ , where $z_j$ is its own sufficient statistics, $h(\cdot)$ , $a(\cdot)$ , and $\eta(\cdot)$ are defined according to the definition of the exponential family [10].
We also categorize latent variables into local variables, and global variables.
the gradient is not simply taken in the Euclidean space of parameters but in the distribution space [11]. In other words, the gradient is transformed in a sensible way such that it is in the steepest descent direction of KL-divergence. Also see [12].

All in all, SVI works as follows [8]:

Reference

[1] https://www.quora.com/Why-and-when-does-mean-field-variational-Bayes-underestimate-variance

[2] ICML 2018 Tutorial Session: Variational Bayes and Beyond: Bayesian Inference for Big Data: https://www.youtube.com/watch?time_continue=2081&v=DYRK0-_K2UU

[3] Variational Inference: A Review for Statisticians

[4] https://towardsdatascience.com/demystifying-kl-divergence-7ebe4317ee68

[5] https://www.zhihu.com/question/31032863

[6] https://blog.csdn.net/aws3217150/article/details/57072827

[7] http://krasserm.github.io/2018/04/03/variational-inference/

[8] NIPS 2016 Tutorial Variational Inference: Foundations and Modern Methods: https://www.youtube.com/watch?v=ogdv_6dbvVQ

[9] https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd

[10] https://jwmi.github.io/BMS/chapter3-expfams-and-conjugacy.pdf

[11] https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/

[12] https://czxttkl.com/2019/05/09/gradient-and-natural-gradient-fisher-information-matrix-and-hessian/

[13] https://en.wikipedia.org/wiki/Jensen%27s_inequality

[14] https://github.com/cpark321/uncertainty-deep-learning/blob/master/05.%20Variational%20Inference%20(toy%20example).ipynb

New code highlighter sample

1
2
3

def kljlkljadsklf():
    if klaf:
        pass

Bayesian linear regression

Ordinary least square (OLS) linear regression have point estimates on weight vector $w$ that fit the formula: $\arg\min_w \left\Vert(Y - Xw)\right\Vert^2 + \epsilon$ . If we assume normality of the errors: $\epsilon \sim \mathcal{N}(\boldsymbol{0}, \sigma^2 \boldsymbol{I})$ with a fixed point estimate on $\sigma^2$ , we could also enable analysis on confidence interval and future prediction (see discussion in the end of [2]). Instead of point estimates, bayesian linear regression assumes $w$ and $\sigma^2$ are random variables and learns the posterior distribution of $w$ and $\sigma^2$ from data. In my view, Bayesian linear regression is a more flexible method because it supports incorporating prior knowledge about parameters and the posterior distributions they provide enable more uncertainty analysis and facilitate other tasks [3], for example Thompson sampling in contextual bandit problems, which we will cover in the future. However, it is also not a panacea: they do not generally improve the prediction accuracy if no informative prior is provided.

The fundamental of Bayesian methods lies in Bayesian theorem, which states:

$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)$

Specifically, in Bayesian linear regression, $D$ represents observed data $Y$ and $X$ and $\theta$ refers to $w$ and $\sigma^2$ . So the formula above can be rewritten as:

$p(w, \sigma^2|Y, X) \propto p(Y|X, w, \sigma^2) p(w, \sigma^2)$

The procedure to adopt bayesian methods is to (1) define the likelihood and proper prior distributions of the parameters in interests; (2) calculate the posterior distribution according to Bayesian theorem; (3) use the posterior distribution to achieve other tasks, such as predicting $Y$ on new $X$ .

The likelihood function of linear regression is:

$p(Y|X, w, \sigma^2)=(2\pi\sigma^2)^{-n/2} exp\big\{-\frac{(Y-Xw)^T(Y-Xw)}{2\sigma^2}\big\}$

which can be illustrated as what the probability to observe the data if we assume $Y$ is normally distributed with mean $X^Tw$ and variance $\sigma^2\boldsymbol{I}$ .

To get analytical expression of the posterior distribution $p(w,\sigma^2|Y, X)$ , we usually require that the prior in the same probability distribution family as the likelihood. Here we can treat the likelihood $p(Y|X, w, \sigma^2)$ as a two-dimensional exponential family (the concept is illustrated in chapter 4 in [6]), a distribution regarding to $w$ and $\sigma^2$ . Therefore, the prior of the likelihood, $p(w, \sigma^2)$ , can be modeled as a Normal-inverse-Gamma (NIG) distribution:

$p(w, \sigma^2) =p(w|\sigma^2)p(\sigma^2)=\mathcal{N}(w_0, \sigma^2 V_0) \cdot IG(\alpha_0, \beta_0)$

The inverse gamma $IG(\cdot)$ is also called inverse chi-squared distribution; they only differ in parameterization [9]. We can denote $p(w, \sigma^2)$ as $NIG(w_0, V_0, \alpha_0, \beta_0)$ . Note that we express $p(w | \sigma^2)$ , meaning that $w$ and $\sigma^2$ is not independent. For one reason, if we model $p(w, \sigma^2)=p(w) \cdot p(\sigma^2)$ , we would not get a conjugate prior. Second, if you think $(X, Y)$ are generated from some process governed by $w$ and $\sigma^2$ , then $w$ and $\sigma^2$ are dependent conditioned on $(X, Y)$ (see section 3 in [8]).

Now, given that the likelihood and the prior are in the same probability distribution family, the posterior distribution is also a NIG distribution:

$p(w, \sigma^2|Y, X) = NIG(w_1, V_1, \alpha_1, \beta_1)$ ,

where:

$w_1 = (V_0^{-1}+X^TX)^{-1}(V_0^{-1}w_0+X^T Y)$

$V_1 = (V_0^{-1}+X^TX)^{-1}$

$\alpha_1 = \alpha_0 + \frac{1}{2}n$

$\beta_1 = \beta_0 + \frac{1}{2}(w_0^T V_0^{-1}w_0 + Y^T Y - w_1^TV_1^{-1}w_1)$

The posterior predictive distribution (for predicting new data) is:

$p(Y_{new}|X_{new}) = \int \int p(Y_{new}|X_{new}, Y, X, w, \sigma^2) p(w, \sigma^2|Y, X) dwd\sigma^2$

The result should be a student-t distribution. But the derivation detail is very complicated, possibly referring to section 6 in [10] and section 3 in [8]. I know from other online posts [11,12,13] that practical libraries don’t calculate the analytical form of the posterior distribution but rely on sampling techniques like MCMC. However, even though they get the posterior distribution, I don’t know how they would implement the posterior predictive distribution. This would worth my further investigation in the future.

Side notes

We have touched upon Bayesian linear regression when introducing Bayesian Optimization [1]. [4] is also a good resource of Bayesian linear regression. However, [1] and [4] only assume $w$ is a random variable but $\sigma^2$ is still a fix point estimate. This post actually goes fully bayesian by assuming both $w$ and $\sigma^2$ are random variables whose joint distribution follows the so called Normal inverse Gamma distribution. There aren’t too many resources in the same vein though. What I’ve found so far are: [5], section 2 in [7].

Reference

[1] https://czxttkl.com/?p=3212

[2] https://czxttkl.com/2015/03/22/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f

[3] https://wso2.com/blog/research/part-two-linear-regression

[4] http://fourier.eng.hmc.edu/e176/lectures/ch7/node16.html

[5] A Guide to Bayesian Inference for Regression Problems: https://www.ptb.de/emrp/fileadmin/documents/nmasatue/NEW04/Papers/BPGWP1.pdf

[6] Bolstad, W. M. (2010). Understanding computational Bayesian statistics (Vol. 644). John Wiley & Sons.

[7] Denison, D. G., Holmes, C. C., Mallick, B. K., & Smith, A. F. (2002). Bayesian methods for nonlinear classification and regression (Vol. 386). John Wiley & Sons.

[8] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[9] https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution

[10] Conjugate bayesian analysis of the gaussian distribution

[11] https://wso2.com/blog/research/part-two-linear-regression

[12] https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7

[13] https://towardsdatascience.com/bayesian-linear-regression-in-python-using-machine-learning-to-predict-student-grades-part-2-b72059a8ac7e

Make PDFs search able

I just found a useful library that converts scanned, image-based pdfs into searchable pdfs. The library is named OCRmyPDF and can be found here: https://ocrmypdf.readthedocs.io/en/latest/installation.html#

Counterfactual Policy Evaluation

Evaluating trained RL policies offline is extremely important in real-world production: a trained policy with unexpected behaviors or unsuccessful learning would cause the system regress online therefore what safe to do is to evaluate their performance on the offline training data, based on which we decide whether to deploy. Evaluating policies offline is an ongoing research topic called “counterfactual policy evaluation” (CPE): what would a policy perform even though we only have trajectories that were generated by some other policy?

CPE for contextual bandit

We first look at CPE on contextual bandit problems. The contextual bandit problem is to take action $a$ at each state $s$ that is drawn from a distribution $\mathcal{D}$ . We can only observe the reward associated to that specific pair of $s$ and $a$ : $r(s, a)$ . Rewards associated to other actions not taken $r(s,a'), a' \neq a$ can’t be observed. The next state at which you would take the next action, will not be affected by the current state or action. Essentially, contextual bandit problems are Markov Decision Problems when horizon (the number of steps) equals one. Suppose we have a dataset $\mathcal{S}$ with $N$ samples, each sample being a tuple $(s_i, a_i, r_i), i=1,2,\cdots,N$ . $a_i$ is sampled from a behavior policy $\pi_0(a|s)$ . $r_i$ is calculated based on an underlying but non-observable reward function $r_i=r(s_i, a_i)$ . For now, we ignore any noise that could exist in reward collection. To simplify our calculation, we will assume that the decision of $\pi_0$ is independent across samples: this assumption bypasses the possible fact that $\pi_0$ may maintain an internal memory $h$ , which is the history of observations used to facilitate its future decisions. We will also assume that during data collection we can access the true probability $\pi_0(\cdot|s)$ , the true action propensity of the behavior policy. This is not difficult to achieve in practice because we usually have the direct access to the current policy’s model. We will call the policy that we want to evaluate target policy, denoted as $\pi_1$ .

Given all these notations defined, the value function of $\pi_1$ is:

$V^{\pi_1} = \mathbb{E}_{s \sim \mathcal{D}, a \sim \pi_1(\cdot|s)}[r(s,a)]$

A straightforward way called Direct Method (DM) to estimate $V^{\pi_1}$ is to train an approximated reward function $\hat{r}$ and plug into $V^{\pi_1}$ :

$V^{\pi_1}_{DM}=\frac{1}{N}\sum\limits_{s \in \mathcal{S}}\sum\limits_a \hat{r}(s,a)\pi_1(a|s)$

The bias of $\hat{r}$ directly determines the bias of $V^{\pi_1}_{DM}$ . The problem is that if $\hat{r}$ is only trained on $\mathcal{S}$ generated by $\pi_0$ , it might be possible that these samples do not sufficiently cover state/action pairs relevant to $V^{\pi_1}$ , thus $\hat{r}$ could be very biased and inaccurate.

The second method is called inverse propensity score (IPS). Its formulation is basically importance sampling on rewards:

$V^{\pi_1}_{IPS} = \frac{1}{N} \sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \frac{r_i \pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}$

We can prove that $V^{\pi_1}_{IPS}$ is an unbiased estimator of $V^{\pi_1}$ . I get some ideas from [4]:

$\mathbb{E}_{p_{\pi_0}(\mathcal{S})}[V^{\pi_1}_{IPS}] \newline (p_{\pi_0}(\mathcal{S}) \text{ is the joint distribution of all samples generated by }\pi_0) \newline = \frac{1}{N}\int p_{\pi_0}(\mathcal{S})\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \frac{r_i \pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} d\mathcal{S} \newline \text{(since each sample is collected independently, we have:)}\newline=\frac{1}{N}\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}}\int p_{\pi_0}(s_i, a_i)\frac{r(s_i, a_i)\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}ds_i da_i \newline=\int p_{\pi_0}(s, a)\frac{r(s, a)\pi_1(a|s)}{\pi_0(a|s)}ds da \newline=\int p_\mathcal{D}(s) \pi_0(a|s)\frac{r(s, a)\pi_1(a|s)}{\pi_0(a|s)}dsda \newline=\int p_{\pi_1}(s,a)r(s,a) dsda\newline=V^{\pi_1}$

There is one condition to be satisfied for the proof to be hold: if $\pi_0(a|s)=0$ then $\pi_1(a|s)=0$ . First, we don’t need to worry about that any $\pi_0(a|s)=0$ exists in the denominator of $V^{\pi_1}_{IPS}$ because those samples would never be collected in $\mathcal{S}$ in the first place. However, if $\pi_0(a|s)=0, \pi_1(a|s)>0$ for some $(s,a)$ , there would never be data to evaluate the outcome of taking action $a$ at state $s$ , which means $V^{\pi_1}_{IPS}$ becomes a biased estimator of $V^{\pi_1}$ . (Part of these ideas comes from [7].)

The intuition behind $V^{\pi_1}_{IPS}$ is that:

If $r_i$ is a large (good) reward, and if $\pi_1(a_i|s_i) > \pi_0(a_i|s_i)$ , then we guess that $\pi_1$ is probably better than $\pi_0$ because $\pi_1$ may give actions leading to good rewards larger probabilities than $\pi_0$ .
If $r_i$ is a small (bad) reward, and if $\pi_1(a_i|s_i) > \pi_0(a_i|s_i)$ , then we guess that $\pi_1$ is probably worse than $\pi_0$ because $\pi_1$ may give actions leading to bad rewards larger probabilities than $\pi_0$ .
We can reverse the relationship such that if $\pi_1(a_i|s_i) < \pi_0(a_i|s_i)$ , we can guess $\pi_1$ is better than $\pi_0$ if $r_i$ itself is a bad reward. $\pi_1$ probably is a better policy trying to avoid actions leading to bad rewards.
If $\pi_1(a_i|s_i) < \pi_0(a_i|s_i)$ and $r_i$ is a good reward, $\pi_0$ is probably a better policy trying to take actions leading to good rewards.

We can also relate the intuition of $V^{\pi_1}_{IPS}$ to Inverse Probability of Treatment Weighting (IPTW) in observational studies [5]. In observational studies, researchers want to determine the effect of certain treatment while the treatment and control selection of subjects is not totally randomized. IPTW is one kind of adjustment to observational data for measuring the treatment effect correctly. One intuitive explanation of IPTW is to weight the subject with how much information the subject could reveal [2]. For example, if a subject who has a low probability to receive treatment and has a high probability to stay in the control group happened to receive treatment, his treatment result should be very informative because he represents the characteristics mostly associated with the control group. In a similar way, we can consider the importance sampling ratio $\frac{\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}$ for each $r_i$ as an indicator of how much information that sample reveals. When the behavior policy $\pi_0$ neglects some action $a_i$ that is instead emphasized by $\pi_1$ (i.e., $\pi_0(a_i|s_i) < \pi_1(a_i|s_i)$ ), the sample $(s_i, a_i, r_i)$ has some valuable counter-factual information hence we need to “amplify” $r_i$ .

The third method to estimate $V^{\pi_1}$ called Doubly Robust (DR) [1] combines the direct method and IPS method:

$V^{\pi_1}_{DR}=\frac{1}{N}\sum\limits_{(s_i, a_i, r_i) \in \mathcal{S}} \big[\frac{(r_i - \hat{r}(s_i, a_i))\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)} +\int\hat{r}(s_i, a')\pi_1(a'|s_i) da'\big]$

Given our assumption that we know the true action propensity $\pi_0(\cdot|s)$ , $V^{\pi_1}_{DR}$ is still an unbiased estimator:

$\mathbb{E}_{p_{\pi_0}(\mathcal{S})}[V^{\pi_1}_{DR}] \newline=\int p_{\pi_0}(s, a) \cdot \big[\frac{(r(s,a) - \hat{r}(s, a))\pi_1(a|s)}{\pi_0(a|s)} +\int \hat{r}(s, a')\pi_1(a'|s) da' \big]ds da \newline=\int p_{\pi_0}(s,a) \cdot\big[\frac{(r(s,a) - \hat{r}(s, a))\pi_1(a|s)}{\pi_0(a|s)} +\int(\hat{r}(s, a') - r(s,a'))\pi_1(a'|s)da' +\int r(s, a')\pi_1(a'|s)da' \big] dsda\newline=\int p_{\pi_1}(s,a) \cdot (-1+1)dsda + \int p_{\pi_1}(s,a) r(s,a) dsda\newline=V^{\pi_1}$

However, in the original paper of DR [1], the authors don’t assume that the true action propensity can be accurately obtained. This means, $V^{\pi_1}_{DR}$ can be a biased estimator: however, if $V^{\pi_1}_{DM}$ or $V^{\pi_1}_{IPS}$ have biases under certain bounds, $V^{\pi_1}_{DR}$ would have far lower bias than either of the other two.

Under the same assumption that the true action propensity is accessible, we could get the variances of DM, DR, and IPS CPE ( $Var[V^{\pi_1}_{DM}]$ , $Var[V^{\pi_1}_{DR}]$ , and $Var[V^{\pi_1}_{IPS}]$ ) based on the formulas given in Theorem 2 in [1] while setting $\delta=0$ because we assume we know the true action propensity:

ips

Under some reasonable assumptions according to [1], $Var[V^{\pi_1}_{DR}]$ should be between $Var[V^{\pi_1}_{DM}]$ and $Var[V^{\pi_1}_{IPS}]$ . Therefore, $V^{\pi_1}_{DR}$ should be overall favored because of its zero bias and intermediate variance.

CPE for MDP

We have completed introducing several CPE methods on contextual bandit problems. Now, we move our focus to CPE on Markov Decision Problems when horizon is larger than 1. The value function of a policy is:

$V^{\pi} = \mathbb{E}_{s_0 \sim \mathcal{D}, a_t \sim \pi(\cdot|s_t), s_{t+1}\sim P(\cdot|a_t, s_t)}[\sum\limits_{t=1}^H\gamma^{t-1}r_t],$

where $H$ is the number of steps in one trajectory. Simplifying the notation of this formula, we get:

$V^{\pi} = \mathbb{E}_{\tau \sim (\pi, \mu)}[R(\tau)],$

where $\tau$ is a single trajectory, $\mu$ denotes the transition dynamics, $R(\tau)=\sum\limits_{t=1}^H\gamma^{t-1}r_t$ is the accumulative discounted rewards for the trajectory $\tau$ .

Following the idea of IPS, we can have importance-sampling based CPE of $\pi_1$ based on trajectories generated by $\pi_0$ :

$V^{\pi_1}_{trajectory-IS}(\tau) = R(\tau) \prod\limits_{t=1}^H\frac{\pi_1(a_t|s_t)}{\pi_0(a_t|s_t)}$

$V^{\pi_1}_{per-decision-IS}(\tau)\newline=\sum\limits_{t=1}^H \gamma^{t-1} r_t \prod\limits_{j=1}^t \frac{\pi_1(a_j|s_j)}{\pi_0(a_j|s_j)}\newline=\sum\limits_{t=1}^H \gamma^{t-1} r_t \rho_{1:t}$

$V^{\pi_1}_{trajectory-IS}$ is an unbiased estimator of $V^{\pi_1}$ given the same condition satisfied in $V^{\pi_1}_{IPS}$ : if $\pi_0(a|s)=0$ then $\pi_1(a|s)=0$ . $V^{\pi_1}_{per-decision-IS}$ is also an unbiased estimator of $V^{\pi_1}$ proved in [6]. However, $V^{\pi_1}_{per-decision-IS}$ has lower upper bound than $V^{\pi_1}_{trajectory-IS}$ therefore in practice $V^{\pi_1}_{per-decision-IS}$ usually enjoys lower variance than $V^{\pi_1}_{trajectory-IS}$ (compare chapter 3.5 and 3.6 in [9]).

Following the idea of DR, [3] proposed unbiased DR-based CPE on finite horizons, and later [8] extended to the infinite-horizon case:

$V^{\pi_1}_{DR}(\tau)=\sum\limits_{t=1}^\infty \gamma^{t-1}r_t\rho_{1:t} -\sum\limits_{t=1}^\infty \gamma^{t-1} \big( \rho_{1:t}\hat{q}^{\pi_1}(s_t, a_t)-\rho_{1:t-1}\hat{v}^{\pi_1}(s_t) \big),$

where $\hat{q}^{\pi_0}$ and $\hat{v}^{\pi_0}$ are the learned q-value and value function under the behavior policy $\pi_0$ . $\rho_{1:t} = \prod\limits_{i=1}^t\frac{\pi_1(a_i|s_i)}{\pi_0(a_i|s_i)}$ and we define the corner case $\rho_{1:0} = 1$ .

Based on Theorem 1 in [3], $Var[V^{\pi_1}_{DR}]$ should have lower variance than $Var[V^{\pi_1}_{per-decision-IS}]$ , after realizing that per-decision-IS CPE is a special case of DR CPE.

Another CPE method also proposed by [6] is called Weighted Doubly Robust (WDR). It normalizes $\rho_{1:t}$ in $V^{\pi_1}_{DR}$ :

$V^{\pi_1}_{WDR}(\tau)=\sum\limits_{t=1}^\infty \gamma^{t-1}r_t w_{1:t} -\sum\limits_{t=1}^\infty \gamma^{t-1} \big( w_{1:t}\hat{q}^{\pi_1}(s_t, a_t)-w_{1:t-1}\hat{v}^{\pi_1}(s_t) \big),$

where $w_{1:t} = \frac{\rho_{1:t}}{\sum\limits_\tau \rho_{1:t}}$ .

$V^{\pi_1}_{WDR}$ is a biased estimator but since it truncates the range of $\rho$ in $V^{\pi_1}_{DR}$ to $[0,1]$ , WDR could lower variance than DR when fewer samples are available. The low variance of WDR produces larger reduction in expected square error than the additional error incurred by the bias. (See chapter 3.8 of [9]). Furthermore, under reasonable assumptions WDR is a consistent estimator so the bias will become less an issue as the sample size increase. (As a side note, an estimator can be any combination of biased/unbiased and inconsistent/consistent [10].)

Direct Method CPE on trajectories learns to reconstruct the MDP. This means DM needs to learn the transition function and reward function for simulating episodes.The biggest concern is whether the collected samples support to train a low bias model. If they do, then DM can actually achieve very good performance. [8] Section 6 lists a situation when DM can outperform DR and WDR: while DM has a low variance and low bias (even no bias), stochasticity of reward and state transition produces high variance in DR and WDR.

The last CPE method is called Model and Guided Importance Sampling Combining (MAGIC) [8]. It is considered as the best CPE method because it adopts the advantages from both WDR and DM. The motivation given by [8] is that CPE methods involving importance sampling are generally consistent (i.e., when samples increase, the estimated distribution moves close to the true value) but still suffer high variance; model-based methods (DM) has low variance but is biased. MAGIC combines the ideas of DM and WDR such that MAGIC can track the performance of whichever is better between DM and WDR. The MAGIC method is heavily math involved. Without deep dive into it yet, we only show its pseudocode for now:

Screen Shot 2019-02-19 at 11.26.03 PM

Below is a table concluding this post, showing whether each CPE method (for evaluating trajectories) is biased/unbiased and consistent/inconsistent.

DM	biased	inconsistent	1
trajectory-IS	unbiased	consistent	5
per-decision-IS	unbiased	consistent	4
DR	unbiased	consistent	3
WDR	biased	consistent	2
MAGIC	biased	consistent	Not sure, possibly between 1 and 2

(Table note: (1) variance rank 1 means the lowest variance, 5 means the highest; (2) when generating this table, we use some reasonable assumptions we mentioned and used in the papers cited .)

References

[1] Dudík, M., Langford, J., & Li, L. (2011, June). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (pp. 1097-1104). Omnipress.

[2] https://stats.stackexchange.com/questions/273367/intuitive-explanation-for-inverse-probability-of-treatment-weights-iptws-in-pr

[3] Jiang, N., & Li, L. (2016, June). Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48 (pp. 652-661). JMLR. org.

[4] http://www.cs.cornell.edu/courses/cs7792/2016fa/lectures/03-counterfactualmodel_6up.pdf

[5] Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3), 399-424.

[6] Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.

[7] Thomas, P. S., Theocharous, G., & Ghavamzadeh, M. (2015, February). High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

[8] Thomas, P., & Brunskill, E. (2016, June). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning (pp. 2139-2148).

[9] Thomas, P. S. (2015). Safe reinforcement learning (Doctoral dissertation, University of Massachusetts Libraries).

[10] https://stats.stackexchange.com/questions/31036/what-is-the-difference-between-a-consistent-estimator-and-an-unbiased-estimator

Resources about Attention is all you need

There are several online posts [1][2] that illustrate the idea of Transformer, the model introduced in the paper “attention is all you need” [4].

Based on [1] and [2], I am sharing a short tutorial for implementing Transformer [3]. In this tutorial, the task is “copy-paste”, i.e., to let a Transformer learn to output the same sequence as the input sequence. We denote the symbols used in the code as:

Name	Meaning
src_vocab_size=12	The vocabulary size of the input sequence
tgt_vocab_size=12	The vocabulary size of the output sequence. Since our task is to copy the input sequence, tgt_vocab_size = src_vocab_size. We reserve symbol 0 for padding and symbol 1 for the decoder starter symbol, and we assume possible real symbols are from 2 to 11. Therefore, the total vocabulary size is 12.
dim_model=512	This is the dimension of the embedding, as well as the output of self-attention, and the input/output of the position-wise feed forward model.
dim_feedforward=256	The size of the hidden layer in the position-wise feed forward layer.
d_k = 64	The dimension of each attention head.
num_heads=8	The number of attention heads. For the sake of computational convenience, d_k * num_heads should be equal to dim_model.
seq_len=7	The length of input sequences
batch_size=30	Batch size, the number of input sequences per batch
num_stacked_layers=6	The number of layers to be stacked in both Encoder and Decoder

Let’s first look at what the input/output data looks like, which is generated by data_gen. Each Batch instance contains several essential pieces of information to be used in training:

src (batch_size x seq_len): the input sequence to the encoder
src_mask (batch_size x 1 x seq_len): the mask of the input sequence, so that each input symbol can fine control which other symbols to attend
trg_y (batch_size x seq_len): the output sequence as the label, which is the same as src given that our task is simply “copy-paste”
trg (batch_size x seq_len): the input sequence to the decoder. The first symbol of each sequence is symbol 1 that helps kicking off the decoder; the last symbol of each sequence is the second to the last of the trg_y sequence.
trg_mask (batch_size x seq_len x seq_len): the mask of the output sequence, so that at each decoding step, a symbol being decoded can attend to other already decoded symbols. trg_mask[i, j, k], a binary value, indicates whether in the i-th output sequence the j-th symbol can attend to the k-th symbol. Since a symbol can attend all its preceding symbols, trg_mask[i,j,k]=1 for all k <= j.

Now, let’s look at EncoderLayer class. There are num_stacked_layers=6 encoder layers in one Encoder. The bottom layer takes the symbol embeddings as input, while each of the upper layers takes the previous layer’s output. Both the symbol embedding and each layer’s output has the same dimension `batch_size x seq_len x dim_model`, so in theory you can stack as many layers as needed because they all have compatible input/output. The symbol embedding converts input symbols into embeddings, plus positional encoding which helps the model learns positional information.

One EncoderLayer does the following things:

It has three matrices stored in MultiHeadedAttention.linears[:3], which we call query transformation matrix, key transformation matrix, and value transformation matrix. They transform the input from the last layer into three matrices which the self-attention module will use. Each of the output matrices, which we call query, key, and value, has the shape batch_size x num_heads x seq_len x d_k.
The transformed matrices are used as query, key, and value in self-attention (attention function). The attention is calculated based on Eqn.1 in [4]:

The attention function also needs a mask as input. The mask tells it for each symbol what other symbols it can attend to (due to padding, or other reasons). In our case, our input sequences are of the equal length and we allow all symbols attend to other symbols; so the mask is always filled with ones.

The attentions of d_k heads are concatenated and passed through a linear projection layer, which results to the output with shape `batch_size x seq_len x dim_model`
The output is finally processed by a BatchNorm layer and a Residual layer.

So far, the encoder part can be illustrated in the following diagram [1] (suppose our input sequence has length 2, containing the symbol 2 and 3):

One DecoderLayer does the following thing:

it first does self-attention on the target sequence (batch.trg, not the trg_y sequence), which generates attention of shape (batch_size, seq_len, dim_model). The first symbol of a target sequence is 1, a symbol indicating the decoder to start decoding.
the target sequence performs attention on the encoder, i.e., using the target sequence’s self attention as the query, and using the final encoder layer’s output as the key and value.
Pass the encoder-decoder attention’s output through a feedforward layer, with batch norm and residual connection. The output of one DecoderLayer has the shape `batch_size x seq_len x dim_model`. Again this shape allows you stack as many layers as you want.

The decoder part is shown in the following diagram:

The last part is label fitting. The last layer of DecoderLayer has the output of shape , i.e., each target symbol’s attention. We pass them through a linear project then softmax such that each target symbol will be associated with a probability distribution of the next symbol to decode. We fit the probability distribution against batch.trg_y by minimizing the KL divergence. Here is one example of label fitting:

In practice, we also apply label smoothing such that the target distribution to fit has slight probability mass even on the non-target symbol. So for example, the target distribution of position #1 (refer to the diagram on the left) is no longer (0, 0, 1, 0, 0); rather it should be (0.1, 0.1, 0.6, 0.1, 0.1) if you use 0.1 label smoothing. The intuition is that even our label may not be 100% accurate and we should not let the model overly confident in the training data. In other words, label smoothing helps prevent overfitting and hope to improve generalization in unseen data.

[1] http://jalammar.github.io/illustrated-transformer/

[2] http://nlp.seas.harvard.edu/2018/04/03/attention.html

[3] https://github.com/czxttkl/Tutorials/blob/master/experiments/transformer/try_transformer.py

[4] https://arxiv.org/abs/1706.03762

Implementation notes for world model

I’ve been recently implementing world model [1], which seems a promising algorithm to effectively learn controls after learning environments first. Here I share some implementation notes.

Loss of Gaussian Mixture Model

The memory model of world model is a Mixture-Density-Network Recurrent Neural Network (MDN-RNN). It takes current state and action as inputs, and outputs the prediction of next state, terminal, and reward. Since state transition might be stochastic, the authors choose to model the next state output as a gaussian mixture model: the next state can come from several multi-dimensional gaussian components with certain probability assignments. The probability of the next state coming from component $latex i$ is $latex \pi_i$, and that component itself is a gaussian distribution with mean $latex \mu_i \in \mathbb{R}^f$ and std. dev $latex \Sigma \in \mathbb{R}^{f*f}$, where $latex f$ is the feature size of a state.

To learn the parameters of MDN-RNN that predict $latex \pi_i$, $latex \mu_i$ and $latex \sigma_i$, we need to fit a loss function that defines how closely the observation of next state really fits the predicted gaussian mixture model. The loss function is negative log likelihood. Since we usually fit data in batches, and one batch is a sequence of consecutive states from the same episodes, the loss function is defined over batch_size and seq_len:

$latex – \sum\limits_{z=0}^{batch-size} \sum\limits_{s=0}^{seq-len} log (\sum\limits_i \pi_i \mathcal{N}(x^{z,s} | \mu^{z,s}_{i}, \Sigma^{z,s}_{i})) &s=2$

If we ignore the negative sign for now and only look at one state for a particular batch $latex z$ and sequence element $latex s$, we get:

$latex log(\sum\limits_i \pi_i \mathcal{N}(x | \mu_{i}, \Sigma_{i}))&s=2$

In practice, MDN-RNN usually predicts $latex log(\pi_{i})$, and we can get $latex log(\mathcal{N}(x | \mu_{i}, \Sigma_{i}))$ from pytorch endpoint [3], so the formula above can be re-written as:

$latex log \sum\limits_i e^{log(\pi_i \mathcal{N}(x | \mu_{i}, \Sigma_{i}))} \newline =log\sum\limits_i e^{log(\pi_i) + log(\mathcal{N}(x | \mu_{i}, \Sigma_{i}))} &s=2$

According to log-sum-exp trick [2], to get numerical stability,

$latex log(\sum\limits_{i=1}^n e^{x_i}) = log(\sum\limits_{i=1}^n e^{x_i – c}) + c &s=2$

where $latex c=max_i x_i$ is usually picked.

The take away is that we walk through how the loss function for GMM is defined, and adopts the log-sum-exp trick when calculating it.

References

[1] https://worldmodels.github.io/

[2] https://blog.feedly.com/tricks-of-the-trade-logsumexp/

[3] https://pytorch.org/docs/stable/distributions.html#torch.distributions.normal.Normal.log_prob

Notes from Introduction to Calculus and Analysis

Cauchy-Schwarz inequality: $latex (a_1b_1 + a_2b_2 + \cdots + a_nb_n)^2 \leq (a_1^2 + a_2^2 + \cdots + a_n^2)(b_1^2+b_2^2 + \cdots + b_n^2)$

When $latex a_1=\sqrt{x}, a_2=\sqrt{y}, b_1=\sqrt{y}, b_2=\sqrt{x}$, then $latex (2\sqrt{xy})^2\leq (x+y)^2$

My understanding in 401K

Here is my reasoning about 401K.

First, I’ll start with two definitions: (1) taxable income, meaning the gross income you receive on which your tax will be calculate; (2) tax deduction, meaning any deduction from your taxable income. Tax deduction lowers your taxable income thus lowers your tax in general.

401K has three categories:

Pre-tax: contribute your taxable income directly to the 401K account, without paying the tax at the time of contribution.Hence, pre-tax 401k is a way of tax deduction. You can only withdraw after 59.5 age, and every dollar you withdraw (i.e., the money you contribute + its growth) will need to be taxed based on the tax bracket you are in at the time of withdraw (i.e., the time when you are old and probably don’t have high income.)

Roth: contribute your after-tax income to the 401K account [1]. You can only withdraw after 59.5 age, but you don’t pay any tax at the time of withdraw (including for the part you earn).

After-tax: contribute your after-tax income to the 401K account separate with roth or pre-tax. However, you still need to pay tax on the earnings when withdrawing: “the earnings on after-tax contributions, just like all distributions from pre-tax contributions, are taxed as ordinary income” [4]. What’s more, there is 10% early withdraw penalty on the earnings too if you withdraw before 59.5 age. The early withdraw penalty and tax on the earning is proportionally collected [5]. If your total after-tax contribution has accumulated x% earning, then you have to pay tax and penalty on x% of whatever amount of money you withdraw before 59.5.

There is another age rule on 401K account: you can defer withdrawing 401K if you work at your elder age. But if you are over 70 years, you need to start withdrawing with at least the amount of Required Minimum Distribution (RMD) each year [7].

The annual combined limit of 401K pre-tax+ roth is 19000 as of 2019. However, there is another limit [3] that restricts you from contributing more than 56000 in any kind of retirement plan per year. That means, if you fully contribution 19000 in pre-tax + roth 401K, than you contribution to 401K after-tax plus employer match should not be over 56000 – 19000=$37000. There is another limit not for 401K but for IRA (individual retirement account), which is 6K.

The benefit of 401K after-tax contribution isn’t apparent. One underlying purpose of saving in after-tax account is to bypass the limit of 19000 for pre-tax + roth. You can rollover your after-tax amount from after-tax 401K to Roth 401K then to Roth IRA, another type of retirement account which is very similar to Roth 401k in the sense that anything going into it will be tax-free. At the time of rollover from after-tax 401K to Roth 401K, any earning in the after-tax account would be treated as your ordinary income thus become taxable; but after rollover, the whole amount of money as well as all future earnings would be tax-free.

In many situations, you need some triggering events (such as retirement, changing companies, etc.) to be able to rollover from after-tax 401K to Roth IRA. Only a few companies offer the opportunity to rollover from after-tax to Roth IRA directly without any triggering event. My company (FB) supports rollover from after-tax 401K to Roth 401K immediately after any contribution in after-tax 401K, meaning that I don’t need to pay any tax for the rollover because no earning has been generated before the rollover happens. And our company’s employees can rollover from Roth 401K to Roth IRA at any time.

The withdraw rules of Roth IRA are summarized as follows. Assume you only contribute to Roth IRA by converting from after-tax 401K. You can withdraw the converted money (principle, not earning) after they have stayed in Roth IRA for more than 5 years without penalty [8]. You can withdraw the earning part without penalty if after age 59.5 or die or become disabled or withdraw within 10000 for first-home purchase and at the same time the first Roth contribution must have started more than 5 years ago; otherwise you need to pay penalty of 10% on the earning [6]. As discussed in [8], the earning that the penalty will apply on is the pre-rollover earning (before you rollover from After-tax 401K to Roth 401K). So since my company allows to rollover from After-tax 401K to Roth 401K immediately, I don’t need to pay any penalty at all even I withdraw from Roth IRA within 5 years since my Roth IRA contribution!

So it seems like if you have any money investing to 401k after-tax, it is better to roll over to Roth IRA (if your company allows that during your employment) because (1) Roth IRA doesn’t incur any tax while 401k after-tax still incurs tax on earnings after age 59.5 (2) Roth IRA is more flexible in withdrawing the converted part of money (you can withdraw even before 59.5 years old).

Now let’s talk about the appropriate distribution of 401K money. Here is a chart showing the suggested 401K amount you should save at each age group [11]:

In essence, a vanguard target date fund comprises of several funds that cover domestic/international stock markets, and domestic/international bonds. For example, below is the decomposition of Vanguard Target Date 2060 as 01/31/2020 (https://investor.vanguard.com/mutual-funds/profile/VTTSX):

From historical statistics, bonds have less yield but have lower risks than stocks [10]. And also bonds distribute dividends that are treated as ordinary income. So the three-fund theory [12] tells us that if we really have a lot of money to invest, we should put bonds in tax-benefit accounts such as 401K, and move the stock index funds into taxable accounts (because holding them for long time we will only need to pay long-term gain tax, rather than ordinary income tax).

Updated 2019-05-13

Another example on how to convert after-tax to roth: Roth In-Plan Brochure

Updated 2020-05-01

This list shows historical returns for Vanguard funds, which are the main source of 401K.

https://retirementplans.vanguard.com/VGApp/pe/faces/PubFundChart?site=nyu/6370

Updated 2020-09-05

Below is the expected growth of retirement funds under a realistic estimation on a middle-class income .

Retirement Planning Calculator (US Personal Finance @ FB)