shadowsocks + SwitchyOmega

We’ve introduced one way to proxy internet: https://czxttkl.com/?p=1265

Now we introduce another way to create proxy, which uses shadowsocks + SwitchyOmega (a chrome extension).

Ubuntu:

  1. in a terminal: 
    sudo apt-get install shadowsocks-qt5
    sudo add-apt-repository ppa:hzwhuang/ss-qt5
    sudo apt-get update
    sudo apt-get install shadowsocks-qt5
  2. open the installed shadowsocks and config a new connection: Screenshot from 2017-12-15 20-50-13
  3. install chrome extension SwitchyOmega: https://www.dropbox.com/s/i5xmrh4wv1fivg7/SwitchyOmega_Chromium.crx?dl=0
  4. config SwitchyOmega:    Screenshot from 2017-12-15 20-50-39

 

Android

  1. install Shadowsocks apk: https://www.dropbox.com/s/cijsnynpvamizvs/com.github.shadowsocks-4.3.3-varies-sdk19-vc199-APK4Fun.com.apk?dl=0
  2. config the shadowsocks similar as we config on PC

Reinforcement Learning in Web Products

Reinforcement learning (RL) is an area of machine learning concerned with optimizing a notion of cumulative rewards. Although it has been applied in video game AI, robotics and control optimization for years, we have seen less of its presence in web products. In this post, I am going to introduce some works that apply RL in web products. In a more general sense, this post will how web products can benefit from a view of sequential decision making. Indeed, the order of news read, songs listened, or videos watched would all affect user experience online. Thus, how to model and optimize sequential user experience becomes more and more crucial.

RL / Monte Carlo Tree Search on Song list Recommendation

In music services, a song list contains a sequence of songs curated by players to listen to. This is the case in point that sequential decision making may makes a difference. The song list recommendation contains two pieces of works [3, 4]. [3] details how to extract features of single songs and song transitions, and how to learn a parameterized linear reward function. Here, the observed reward is human listeners’ like/dislike and fitted by linear combination of the reward function’s weights and song and transition features. After learning the reward function, [3] uses heuristic-based planning to recommend song lists. [3] shows in simulation and real-world evaluations that their recommended song lists can result to higher long-term rewards for listeners (either synthetic or real humans).

What I don’t understand well in [3] is Algorithm 3, in which the reward function weights are updated according to observed rewards. Note that this is different than function-approximation-based TD-learning, what I have known better, where weights are updated for estimating state or state-action values. Algorithm 3 looks like inverse reinforcement learning, but the authors did not mention that. But it may just be a simple linear regression fitting procedure based on gradient descent.

[4] improves the planning with a more advanced search algorithm, Monte Carlo Tree Search (MCTS). This planning takes place after the reward function is fully known, i.e., the weights have been learned already. The goal of planning is to generate a song list achieving the highest accumulated rewards. [4] uses a specific version of MCTS called $latex MaxMCTS(\lambda)$, which employs a more sophisticated Q-value backpropagation strategy. Using MCTS is pretty straightforward here, since one cannot exhaust all possible song lists in a limited time.

Q-Learning on Recommendation

Applying Q-learning on recommendation is sometimes straightforward. You define user features as states, things to recommend as actions, and metrics you want to improve as rewards. [8] is one example in this vein, where actions are news articles and rewards are clicks + user activeness. They claim using Dueling Bandit Gradient Descent for exploration so that exploration of news articles can focus more on relevant ones. They propose to divide Q-value function into value function $latex V(S)$ and advantage function $latex A(s,a)$ much like Dueling network architectures.

Multi-Agent RL on Recommendation

[11] works on a wholistic way to do recommendation in multiple stages of user interaction with a website. For example, the website needs to recommend diverse items on the entrance page. Once the user clicks on an interested item, the user will enter an item detail page and the website needs to recommend more related items. And after the user decides to buy some item, the website needs to recommend more items post-purchase to incentivize more future purchases. Traditionally at each stage you can have one dedicated model to optimize stage-wise reward. But you can utilize more information across different stages and make a decision at each stage such that it can benefit for all other future stages. That’s the motivation behind [11] but to be honest I am not sure how much room for improvement there will be in real application compared to so much complexity introduced.

Here are some technical details about [11]. Its high-level model is DDPG, an actor-critic architecture. The actor is a policy network that outputs the action based on the state. The state is some low-dimension, dense vector which embeds user history using a memory-based network like RNN. The critic is a value network that outputs state-action values in which state is embedded in the same way as in the actor. Note, there are multiple actors, one in each stage of recommendation; but there is only one critic which outputs a global state-action value. The training of the critic needs to define a target state-value function, usually in a form of $latex y=R+\gamma Q(s’,\pi(s’))$. The authors think that fitting the target state-value function would need a huge sample size; however, decomposing $latex y$ into several parts may make the fitting easier. (Personally I am not sure if it is true.) The authors propose to decompose $latex y$ into the expectation of state-action values over the probability distribution of each action. The probability of each action can be estimated by a supervised learning model. 

Ads

There are two classic algorithms on applying RL for ads, bidding particularly. The state space is auction/campaign real-time information while the action is the next bid price to set. [14] was published earlier, in which it defined the full MDP (transition and reward) such that the optimal policy can be learned by dynamic programming. [13] used DQN to learn the Q-function, where the action space is the parameter to control which bidding model to use, rather than the direct price to bid. I really admire the details of designing the MDP in [13]. For example, in Figure 2 of [13], they found that bidding patterns are similar day by day, indicating that they should treat one episode as one day and collect <state, action> pairs per hour.   

Ranking

There has been an increasing interest in using Reinforcement Learning for list-wise ranking. The idea is that traditional ranking usually acts like the greedy policy and doesn’t consider the interaction between ranked items. For example, if a user likes soccer, all soccer-related posts are predicted with high click-through rate and thus placed in higher order, despite that the user could get fatigued by seeing posts of the same category consecutively. It is likely that showing posts of interweaving categories, such as soccer and entertainment, could foster healthier cadence of reading posts and make the user more satisfied.

[7] proposes an algorithm working on the following setting: “a user is displayed to a page of k items and she provides feedback by clicking on one or none of these items, then the system recommends a new page of k items”. The first part of the paper proposes to use Generative Adversarial Network (GAN) to train a user model for learning which item a user would click given a page and what would be their satisfaction upon click. In my opinion, this part can be done using other standard ML models. The user model will be used as a simulator to provide data for RL training. In the second part, [7] proposes cascade Q-learning (CDQN). Suppose the total number of items is $latex \mathcal{I}$. There is permutation of $latex (\mathcal{I}, k)$ ways to form a page of k items (the order of k items in a page also matters) and CDQN only needs to make $latex \mathcal{O}(k\mathcal{I})$ times of argmax calls. My understanding of CDQN is that it creates spurious $latex k$ sub-states between $latex s_t$, the state when the user is about to receive a slate, and $latex s_{t+1}$ when the user is about to receive the next slate. And you can learn to estimate Q-values at each k sub-state, with the assumption that the optimal Q-value at each sub-state should be the same as the optimal Q-value at the last sub-state. I think the $latex \mathcal{O}(k\mathcal{I})$ time complexity is still too computationally expensive for real products and I’d like to see an $latex \mathcal{O}(k)$ or $latex \mathcal{O}(l)$ algorithm.

Another work called SlateQ [9] works in the same setting but uses a different way to learn Q-values and concoct slates. The paper makes two reasonable assumptions: (1) SC: a user consumes one or zero item from each slate; (2) reward and transition functions only depend on the consumed item. With the two assumptions, $latex Q^*(s, A) = \sum_{i\in A} P(i|s,A)Q^*(s,i)$, meaning that the optimal Q-value of a state $latex s$ and a slate $latex A$ is equivalent to the expected optimal “decomposed” Q-value of $latex s$ and an item $latex i$ where the expectation is taken over the user choice probability. $latex Q^*(s,i)$ can be learned through learning schema very like normal SARSA and Q-Learning (see Eqn. (19) and (20)), and $latex P(i|s,A)$ is usually accessible from extant CTR prediction models most commercial recommendation systems have already had. Using decomposed Q-values simplifies the learning process a lot but we still have one thing to resolve: in Q-learning, you need to compute the $latex max_a$ operator (the best of all possible slates in the next slate); in the policy improvement phase of SARSA, you need to come up with a better slate. The brute-force method is to enumerate all possible slates which is combinatorial problem. But since we know $latex P(i|s,A)$ and $latex Q^*(s,i)$, the authors propose to use Linear Programming to do exact optimization, or Greedy/Top-K for approximate optimization. SlateQ has been tested on Youtube and shown 1% improvement in engagement.

Rather than using value-based models, [10] uses policy-gradient based methods to do slate recommendation. Its application context is very similar to CDQN and SlateQ, where you need to recommend k items in a slate each time. The question how to model the policy probability $latex \alpha(a|s)$, which is the probability of showing the observed clicked video $latex a$ in a slate given user state $latex s$. In Section 4.3, the authors make assumptions that a slate is generated by sampling $latex K$ times from the softmax distribution of video values ($latex \pi(a|s)$ from Eqn. 5). Following the assumptions, we have $latex \alpha(a|s)=1-(1-\pi(a|s))^K$. I think the policy gradient method would require the policy probability of $latex \Pi(A|s)$ where $latex A$ is the whole slate. But since we know that generating any slate that contains the observed click video $latex a$ differs with $latex \Pi(A|s)$ only by a scaling constant, we can just use $latex \alpha(a|s)$ in the policy gradient formula Eqn. (6). They propose to use REINFORCE to learn the policy. Since REINFORCE is an on-policy learning algorithm but their training data is off-policy, they introduce importance sampling in the REINFORCE update formula. Importance sampling itself brings more technical considerations: how to estimate the behavior policy’s propensities (sec 4.2)? and how to reduce variance of importance sampling (section 4.4)? They also show using Boltzmann exploration could help make wiser exploration.

Simulator

A very ideal idea is to train a simulator based on real data and then learn an RL model only with interaction with the simulator. If the simulator can truly reflect the characteristics of real data, RL models can be born just based on virtual data. [12] shows one example of this idea. It models two MDPs in a customer-search environment: the search engine makes sequential decisions following an MDP, and the customer themselves follows another MDP. The paper first uses Generative Adversarial Network and Multi-agent Adversarial Imitation Learning to learn a reliable simulator of customer features and learn the inherent policy of customers. Then, the paper proposes to use TRPO to learn the search engine’s policy using simulated data. It claims that using the learned policy resulted in real-world improvement.

 

Reference

[1] Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees

[2] A contextual-Bandit Approach to Personalized News Article Recommendation

[3] DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation

[4] Designing Better Playlists with Monte Carlo Tree Search

[5] http://info.ividence.com/deep-reinforcement-learning-advertising/

[6] https://www.youtube.com/watch?v=XiN9Hx3Y6TA

[7] Generative Adversarial User Model for Reinforcement Learning Based Recommendation System: https://arxiv.org/abs/1812.10613

[8] DRN: A Deep Reinforcement Learning Framework for News Recommendation

[9] Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

[10] Top-K Off-Policy Correction for a REINFORCE Recommender System

[11] Model-Based Reinforcement Learning for Whole-Chain Recommendations

[12] Virtual-Taobao- Virtualizing Real-world Online Retail Environment for Reinforcement Learning

[13] Deep Reinforcement Learning for Sponsored Search Real-timeBidding

[14] Real-Time Bidding by Reinforcement Learning in Display Advertising

LambdaMART

LambdaMART [7] is one of Learn to Rank algorithms. It emphasizes on fitting on the correct order of a list, which contains all documents returned by a query and marked as different relevance. It is a derivation/combination of RankNet, LambdaRank and MART (Multiple Addictive Regression Tree). 

For me, the most helpful order to approach LambdaMART is to understand: 1. MART; 2. RankNet; 3. LambdaRank; and 4. LambdaMART. [4] actually aligns with my order to introduce LambdaMART. (Sidenote: another good resource to learn MART is [8])

1. MART

Screen Shot 2017-10-13 at 4.31.11 PM

Above is an illustration plot for one regression tree. If data of different labels actually come from disjoint feature spaces, then one regression tree would be enough. However, there are many situations where data do come from interleaved feature spaces. Hence, it is natural to propose using multiple regression trees to differentiate irregular feature subspaces and make predictions.

Now, suppose MART’s prediction function is F(x)=\sum\limits_{m=0}^{M} h(x;a_m), where each h(x;a_m) is one regression tree’s output and  a_m are the parameters of that tree. Generally speaking, each regression tree has K-terminal nodes. Hence a_m contains \{R_{km} \}_1^K, which are the K feature subspaces that divide data, and \{\gamma_{km}\}_1^K, which are the predicted outputs of data belonging to \{R_{km} \}_1^K. h(x;a_m) can be represented as: h(x;a_m) = \sum\limits_{k=1}^K \gamma_{km} \mathbb{I}(x \in R_{km})

\{a_m\}_0^M are iteratively learned to fit to the training data:

  1. Set F_0(x) to initial guess
  2. for each m=1,2,\cdots, M, we learn optimal feature space divisions \{R_{km}\}_1^K
  3. Based on the learned\{R_{km}\}_1^K, the prediction value of each R_{km} (leaf) is deterministically fixed as:  \gamma_{km}=argmin_{\gamma} \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma)

Although [4] doesn’t provide details for learning \{R_{km}\}_1^K, I guess they should be very similar to what adopted in normal decision trees. One possible criteria could be to minimize the total loss: \{R_{km}\}_1^K = argmin_{\{R_{km}\}_1^K} \sum\limits_{k=1}^K \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma_{km}). (If using this criteria, \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K are actually learned simultaneously)

[4] provides an example to illustrate the iterative learning process. The dataset contains data points in 2 dimensions and labels \{1, 2, 3, 4\}. The loss function L(\cdot) is least square loss. Each regression tree is only a decision stump.

Screen Shot 2017-10-13 at 7.19.55 PM

F_0(x)=h(x;a_0) is the simplest predictor that predicts a constant value minimizing the error for all training data. 

Screen Shot 2017-10-13 at 7.29.47 PM

After determining h(x;a_0), we start to build h(x;a_1). The cut v_1 is determined by some criterion not mentioned. But assuming that v_1 has already been obtained,  we can then determine each leaf’s prediction output, i.e., \gamma_{k1}, k=1,2:

Screen Shot 2017-10-13 at 8.00.54 PM

After determining the first and second trees, F_1(x)=h(x;a_0)+h(x;a_1). In other words, for x<v_1, the predicted value will be 1.444; for x \geq v_1, the predicted value will be 3.625. 

Screen Shot 2017-10-14 at 6.27.05 AM

Next, determining h(x;a_2) will be similar: finding a cut v_2, and then determine \gamma_{k2}, k=1,2 as in step 3.

Screen Shot 2017-10-14 at 6.42.16 AM

Interestingly, when we fit h(x;a_2), it looks like that we fit a tree with each data point’s label updated as y_i - F_1(x_i), that is, we are fitting the residual of what previous trees are not able to predict. Below is the plot showing each data point’s original label and its updated label before fitting h(x;a_2).

Screen Shot 2017-10-14 at 6.46.16 AM

In fact, the fitting process described above is based on the assumption that L is a square loss function. When fitting the m-th tree, we are fitting the tree to the residual y - F_{m-1}(x). Interestingly, y - F_{m-1}(x) = - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}. That’s why MART is also called Gradient Boosted Decision Tree because that each tree fits the gradient of the loss.

To be more general and mathematical,  when we learn \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K in the m-th iteration, our objective function is:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m))

If we take Taylor series of the objective function up to order 2 at 0, we get [10]:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline =\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2 

Each time, the new regression tree (a_m = \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K) should minimize this objective function, which is a quadratic function with regard to h(x_i;a_m). As long as we know \frac{\partial L}{\partial F_{m-1}} and \frac{\partial^2 L}{\partial^2 F_{m-1}}, we can learn the best a_m to minimize the objective function. 

Now, back to the previous example, if L is a square loss function L(y, \hat{y})=\frac{1}{2}(y - \hat{y})^2, then L'(y_i,F_{m-1}(x_i))=-(y_i-F_{m-1}(x_i)) and L''(y_i,F_{m-1}(x_i)) = 1. Thus, the objective function becomes:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline = \sum\limits_i -(y_i-F_{m-1}(x_i)) h(x_i;a_m) + \frac{1}{2} h(x_i;a_m)^2 + constant

This is not different than min \;\; \sum\limits_i L(y_i - F_{m-1}(x_i), h(x_i;a_m)). That’s why in the above fitting process, it seems like we are fitting each new tree to the residual between the real label and previous trees’ output. In another equivalent perspective, we are fitting each new tree to the gradient - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}.

The loss function L can take many kinds of forms, making MART a general supervised learning model. When MART is applied on classification problems, then the loss function for each data point (x_i, y_i) is L(y_i, p_i)=- [y_i \log (p_i) + (1-y_i) \log (1-p_i)], where p_i is the predicted probability p_i=\Pr(y_i=1|x_i). The regression trees’ output is used to fit the logit of p_i, i.e.,

p_i=\frac{1}{1+\exp{(-F_{m-1}(x_i)-h(x_i;a_m))}}

If we formulate all things in P_i=logit(p_i), then we have: 

P_i = logit(p_i) = \log (\frac{p_i}{1-p_i}) = F_{m-1}(x_i)+ h(x_i;a_m) See [11])

L(y_i, P_i) = - y_i P_i + \log (1 + \exp(P_i)) (See [12])

After such formulation, it will be easy to calculate L'(y_i,F_{m-1}(x_i)) = \frac{\partial L}{\partial P_i} \cdot \frac{\partial P_i}{\partial F_{m-1}(x_i)}=\frac{\partial L}{\partial P_i} and L''(y_i,F_{m-1}(x_i)) =\frac{\partial^2 L}{\partial^2 P_i} \cdot \frac{\partial^2 P_i}{\partial^2 F_{m-1}(x_i)}=\frac{\partial^2 L}{\partial^2 P_i}.

2. RankNet

We finished introducing MART. Now let’s talk about RankNet and LambdaMART. Starting from now, we follow the notations from [7]:

Screen Shot 2017-10-06 at 10.57.30 AM

(Table from [9])

After reading the first three chapters of [7], you probably reach the plot: 

Screen Shot 2017-10-06 at 10.51.41 AM

The plot illustrates the motivation for the invent of LambdaRank based on RankNet. In RankNet, the cost function is an easily defined, differentiable function:

C=\frac{1}{2}(1-S_{ij})\sigma(s_i-s_j)+\log(1+e^{-\sigma(s_i - s_j)})

If S_{ij}=1, C=log(1+e^{-\sigma(s_i - s_j)}); if S_{ij}=1, C=log(1+e^{-\sigma(s_j-s_i)}). The intuition of this cost function is that the more consistent the relative order between s_i and s_j is with S_{ij}, the smaller C will be.

Since the scores s_i and s_j are from a score function of parameters w, the cost function C is a function of w too. Our goal is to adjust w to minimize C. Therefore, we arrive at the gradient of C regarding w (using just one parameter w_k as example):

\frac{\partial C}{\partial w_k}=\lambda_{ij}(\frac{\partial s_i}{\partial w_k} - \frac{\partial s_j}{\partial w_k}), where \lambda_{ij} \equiv \sigma (\frac{1}{2}(1-S_{ij})-\frac{1}{1+e^{\sigma(s_i-s_j)}}). (This is from Eqn. 3 in [7].)

\lambda_{ij} is derived from a pair of URLs U_i and U_j. For each document, we can aggregate to \lambda_i by taking into account of all its interaction with other documents:

 \lambda_i = \sum\limits_{j:\{i,j\}\in I} \lambda_{ij} - \sum\limits_{j:\{j,i\}\in I} \lambda_{ij}

Based on the derivation from Section 2.1 in [7], the update rule of score function parameter w can be eventually expressed by \lambda_i

w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k} 

\lambda_i can be seen as the force to drive high and low of w_k from document U_i. Here is how one can understand the effect of \lambda_i through a simple but wordy example. Suppose for a document U_i which only has one pair U_i \triangleright U_j (S_{ij}=1). Assume in one particular update step, we have \frac{\partial s_i}{\partial w_k}>0, i.e., an increase of w_k will cause s_i to increase if everything else is fixed. Therefore, \lambda_i=\lambda_{ij}=-\frac{1}{1+e^{(s_i-s_j)}}<0 assuming \sigma=1. According to the update rule w_k \rightarrow w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}, immediately we know that w_k is going increase further to increase s_i. Nevertheless, depending on whether s_i > s_j or s_j > s_i, the scale of \lambda_i would be different. If s_i > s_j, that is, the current calculated scores has a relative order consistent with the label, then -0.5 < \lambda_i < 0. On the other hand, if s_i < s_j, that is, the current calculated scores are inconsistent with the label, |\lambda_i| will become relatively large (\approx 1). 

3. LambdaMART

The problem of RankNet is that sometimes |\lambda_i| does not perfectly correlate to the force we want to drag the document to the ideal place.  On the right of the example plot, the black arrows denote |\lambda_i| the blue documents are assigned. However, if we want to use metrics that emphasize on the top few results (e.g., NDCG/ERR) we want |\lambda_i| to correlate with the red arrows. Unfortunately, many metrics like NDCG and ERR are not easy to come up with a well-formed cost function C. Recall the cost function of RankNet: as long as s_i and s_j are consistent with S_{ij}, C will be low regardless the absolution positions that U_i and U_j appear.  The fortunate part is that we do not need to know C when we train the model: all we need to know is \lambda_{ij} and we can design our own \lambda_{ij} for those not well-formed cost functions. Thus, the core idea of LambdaRank is to define \lambda_{ij} by:

\lambda_{ij} = \frac{\partial C(s_i- s_j)}{\partial s_i} = \frac{-\sigma}{1+e^{\sigma(s_i - s_j)}}|\triangle Z_{ij}|

where \triangle Z_{ij} is the change after swapping U_i and U_j of any measure you apply. This swap takes place in the sorted list of all documents according to their current calculated scores, and all documents’ positions are kept fixed except the two documents being swapped.

Now, if the model we are training contains parameters that form differential score functions then we can easily apply the update rule on those parameters like we talked about in <2. RankNet>: w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta \sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}

However, if we are going to train MART, then we need to train it according to its iterative process, which only requires knowing \lambda_{ij}. This finally arrives at the LambdaMART algorithm:

Note that in LambdaMART we usually define \lambda_i in the way that fitting it is equivalent to maximizing some utility function C. Therefore, the Newton step to calculate \gamma_{lk} is actually doing gradient ascent. Recall that in <1. MART> section the objective that each MART tree optimizes for is:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline=\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2

In the LambdaMART algorithm, y_i corresponds to L'(y_i,F_{m-1}(x_i)) and w_i corresponds to L''(y_i,F_{m-1}(x_i)).

 

Reference

[1] http://www.cnblogs.com/wowarsenal/p/3900359.html

[2] https://www.quora.com/search?q=lambdamart

[3] https://liam0205.me/2016/07/10/a-not-so-simple-introduction-to-lambdamart/

[4]  http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/materials/Schamoni_boosteddecisiontrees.pdf

[5] https://www.slideshare.net/Julian.Qian/learning-to-rank-an-introduction-to-lambdamart

[6] https://en.wikipedia.org/wiki/Newton%27s_method

[7] https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/

[8] https://www.youtube.com/watch?v=IXZKgIsZRm0

[9] https://www.slideshare.net/Julian.Qian/learning-to-rank-an-introduction-to-lambdamart

[10] http://xgboost.readthedocs.io/en/latest/model.html

[11] https://stats.stackexchange.com/questions/157870/scikit-binomial-deviance-loss-function

[12] https://stats.stackexchange.com/questions/204154/classification-with-gradient-boosting-how-to-keep-the-prediction-in-0-1

[13] http://dasonmo.cn/2019/02/08/from-ranknet-to-lambda-mart/

Update 2020/02/22:

This reference contains an example of computing \lambda_i using NDCG metric:

[14] http://dasonmo.cn/2019/02/08/from-ranknet-to-lambda-mart/

Questions on Guided Policy Search

I’ve been reading Prof. Sergey Levine‘s paper on Guided Policy Search (GPS) [2]. However, I do not understand about it but want to have a record of my questions so maybe in the future I could look back and solve.

Based on my understanding, traditional policy search (e.g., REINFORCE) maximizes the likelihood ratio of rewards. This is usually achieved by first collecting some samples, taking derivatives of the likelihood ratio w.r.t. the policy parameters, updating the policy parameters, then collecting new samples again. The shortcoming of such procedure is that: (1) sample inefficient, because off-policy samples are discarded every once the policy parameters are updated; (2) when the real situation is really complex, it is really hard to navigate to the globally optimal parameter in the parameter space. Local optima may often reach and the danger arises for robots when guided into risky trajectories by poor parameters.

Therefore, it is more ideal if we can utilize demonstration trajectories to initialize the policy. Moreover, whatever trajectories we have experimented can be kept instead of being discarded to help improve the policy parameters. These should be the fundamental motivation of GPS.

What I am not sure is how exactly iLQR works. And also regarding Algorithm 1, line 1 from [2]: why are there many DDP solutions ($latex \pi_{\mathcal{G}_1}, \cdots,\pi_{\mathcal{G}_n}$) generated? Does that mean iLQR have many different results when initialized differently? Is iLQR only used in the first line?

Seems like GPS only deals with known dynamics and reward function. When dynamics are not known, we should then look at [3] or Continuous Deep Q-Learning [4, 5].

Reference

[1] http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf (Importance sampling tutorial)

[2] https://graphics.stanford.edu/projects/gpspaper/gps_full.pdf

[3] https://people.eecs.berkeley.edu/~svlevine/papers/mfcgps.pdf

[4] https://zhuanlan.zhihu.com/p/21609472

[5] https://arxiv.org/abs/1603.00748

[6] http://blog.csdn.net/sunbibei/article/details/51485661 (Notes on GPS written in Chinese)

[7] https://www.youtube.com/watch?v=eKaYnXQUb2g (Levine’s video. In the first half hour he talked about GPS)

 

Relationships between DP, RL, Prioritized Sweeping, Prioritized Experience Replay, etc

In the last weekend, I’ve struggled with many concepts in Reinforcement Learning (RL) and Dynamic Programming (DP). In this post, I am collecting some of my thoughts about DP, RL, Prioritized Sweeping and Prioritized Experience Replay. Please also refer to a previous post written when I first learned RL.

Let’s first introduce a Markov Decision Process $latex \mathcal{M} = (\mathcal{S}, \mathcal{A}, T, R)$, in which $latex \mathcal{S}$ is a set of states, $latex \mathcal{A}$ is set of actions, $latex T(s, a, s’)$ is transition function and $latex R(s,a,s’)$ is reward function. The value function of a state under a specific policy $latex \pi$, $latex V^\pi(s)$, is defined as the accumulated rewards from now on: $latex V^\pi(s)=\mathbb{E}(\sum\limits_{i=0}\gamma^i r_i | s_0 = s, \pi) &s=2$, where $latex 0 < \gamma < 1$ is a reward discount factor. $latex V^\pi(s)$ satisfies the Bellman equation: $latex V^\pi(s)=\mathbb{E}_{s’ \sim T(s,a,\cdot)}\{R(s,a,s’)+\gamma V^\pi(s’)\} &s=2$, where $latex a=\pi(s)$. The optimal value function is defined as $latex V^*(s) = max_\pi V^\pi(s)$. $latex V^*(s)$ also satisfies the Bellman equation: $latex V^*(s) = max_a \mathbb{E}_{s’ \sim T(s,a,\cdot)}\{R(s,a,s’) + \gamma V^*(s’)\} &s=2$. Another type of function is called Q-function, which is defined as the accumulated rewards for state-action pairs (rather than states themselves). The formulas of $latex Q^\pi(s,a)$ and $latex Q^*(s,a)$ are omitted here but can be found in many RL materials, such as [2].

DP refers to a collection of algorithms to find the optimal policy $latex \pi^*: \pi^*(s) = argmax_a \sum_{s’}T(s,a,s’) [R(s,a,s’) + \gamma V^*(s’)] &s=2$ for $latex \forall s \in \mathcal{S} &s=2$ when the transition function $latex T(s,a,s’)$ and reward function $latex R(s,a,s’)$ are known. Therefore, DP algorithms are also called model-based algorithms. (you assume you know or you have a model to predict $latex T(s,a,s’)$ and $latex R(s,a,s’)$ . And during policy search, you use the two functions explicitly.)

The simplest idea of DP is policy iteration as a synchronous DP method. 

Screenshot from 2017-08-14 14-16-49

Here, we evaluate the value function under the current policy first (policy evaluation). Then, we update the current policy based on the estimated value function (policy improvement). The two processes alternate until the policy becomes stable. In synchronous DP method, states are treated equally and swept one by one.

The drawback of policy iteration DP is that you need to evaluate value functions for all states for each intermediate policy. This possesses a large computational requirement if the state space is large. Later, people find a more efficient way called value iteration DP, in which only one sweep over the whole space of states is needed until the optimal policy is found: 

Screenshot from 2017-08-14 14-34-46

Even in value iteration, you still need to sweep over the whole space of states. To further accelerate that, people propose asynchronous DP: “These algorithms back up the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be backed up several times before the values of others are backed up once. … Some states may not need their values backed up as often as others. We might even try to skip backing up some states entirely if they are not relevant to optimal behavior. ” (from [1])

Prioritized sweeping is one asynchronous DP method [3]:

Screenshot from 2017-08-14 14-49-19

The goal of prioritized sweeping is to learn $latex V^*(s)$. The basic idea of it is to make some updates more urgent than others. The urgent updates are with large Bellman error, $latex err(s)=|V^*(s) – max_a Q^*(s,a)|$. The pseudo-code of it as in the original paper [4]:

Screenshot from 2017-08-14 15-18-28

Something I am not sure is what policy is used during the learning. I think the policy is always the greedy policy $latex \pi^*: \pi^*(s) = argmax_a \sum_{s’}T(s,a,s’) [R(s,a,s’) + \gamma V^*(s’)] &s=2$ with some exploration stimulus (see the original paper [4] for the parameter $latex r_{opt}$ and $latex T_{bored}$).

[1] gave a slightly different prioritized sweeping when we focus on the Bellman error of Q function, rather than V function (note that this version of prioritized sweeping is still a DP, model-based algorithm because you need a transition function to know $latex \bar{S}$, $latex \bar{A}$ which lead to $latex S$ and a reward function to predict reward $latex \bar{R}$ for $latex \bar{S}, \bar{A}, S$):

Screenshot from 2017-08-14 15-07-20 

Now, all above are DP methods, assuming you know $latex T(s,a,s’)$ and $latex R(s,a,s’)$ or you have a model to approximate the two functions.

RL methods are often called model-free methods because they don’t require you to know $latex T(s,a,s’)$ and $latex R(s,a,s’)$. $latex TD(0)$ or $latex TD(\lambda)$ are a type of algorithms to evaluate value function for a specific policy in an online fashion:

Screenshot from 2017-08-14 15-25-31

Note that, in the iterative update formula $latex V(S) \leftarrow V(S)+\alpha [R + \gamma V(S’) – V(S)]$ there is no longer reward function or transition function present as in DP-based methods.

$latex TD(0)$ (or more general $latex TD(\lambda)$) only solves the prediction problem, i.e., estimation value function for a specific policy $latex \pi$. Therefore, in theory you can’t use it to derive a better policy, i.e., solving a control problem. However, people still use it to find a good policy if a near-optimal policy is used when evaluating value functions and in addition you know reward/transition functions. This is a common scenario for games with predictable outcomes for each move, such as board games. During the learning, people can apply a near-optimal policy (e.g., $latex \epsilon$-greedy) to evaluate values of game states. In real games after the learning, the player evaluates the resulting state after each available movement and goes with the movement with the highest sum of immediate reward and resulting state value: $latex \pi(s) = argmax_a \sum_{s’}T(s,a,s’) [R(s,a,s’) + \gamma V(s’)] &s=2$. Of course, you are using the value function under the near-optimal policy to guide you select actions in the greedy policy. The value functions under two different policies might not totally agree with each other but in practice, you often end up with a good outcome. (I am not 100% confident about this paragraph but please also refer to [5] and a student report [9])

The policy iteration and value iteration in model-free RL algorithms correspond to SARSA and Q-learning [2]. I am omitting the details of them. Prioritized Experience Replay [8] bears the similar idea as in Prioritized Sweeping but applies on model-free contexts. 

Screenshot from 2017-08-14 15-59-51

You can see that neither reward function nor transition function is needed in Prioritized Experience Replay.

 

Reference

[1] Richard S. Sutton’s book: Reinforcement Learning: An Introduction

[2] Reinforcement learning and dynamic programming using function approximators

[3] http://www.teach.cs.toronto.edu/~csc2542h/fall/material/csc2542f16_MDP2.pdf

[4] Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time

[5] Self-Play and Using an Expert to Learn to Play backgammon with temporal difference learning

[6] https://stackoverflow.com/questions/34181056/q-learning-vs-temporal-difference-vs-model-based-reinforced-learning

[7] https://stackoverflow.com/questions/45231492/how-to-choose-action-in-td0-learning

[8] Prioritized Experience Replay

[9] Reinforcement Learning in Tic-Tac-Toe Game and Its Similar Variations

Time Series Walk Through

In this post, I am going to give a practicum walk through on time series analysis. All related code is available in a python notebook.

The data we use is International airline passengers: monthly totals in thousands. which can be downloaded here as csv file (on the left panel, select Export->csv(,)). It is a univariate time series.

Screenshot from 2017-07-26 20:49:08

From intuition, we can see that there is pattern over here: (1) there exists an upward trend; (2) there exists seasonal fluctuation every once a while. However, this time series is not stationary.  

[3] gives an excellent explanation for the meaning of a stationary time series:

Screenshot from 2017-07-26 20:46:35

To put it mathematically, a stationary time series is:

  1. $latex E(Y_1) = E(Y_2) = E(Y_3) = \cdots = E(Y_t) = \text{constant}$
  2. $latex Var(Y_1) = Var(Y_2) = Var(Y_3) = \cdots = Var(Y_t)= \text{constant}$
  3. $latex Cov(Y_1, Y_{1+k}) = Cov(Y_2, Y_{2+k}) = Cov(Y_3, Y_{3+k}) = \cdots = \gamma_k$ (depending only on lag $latex k$)

And also compare a stationary time series with white noise:

Let $latex \{\epsilon_t\}$ denote such a series then it has zero mean $latex E(\epsilon_t)=0$, $latex Var(\epsilon_t)=0$ for $latex \forall t$ and $latex E(\epsilon_t \epsilon_s)=0$ for $latex \forall t,s$. 

The typical steps to perform time series analysis are as follows:

  1. check whether it is a white noise time series. If it is, just stop, because there is no way that you can predict noise. (In practice, however, it is hard to determine whether a time series is white noise just by observing with eyes. So, my suggestion is to go ahead assuming this is a valid time series, and use cross validation to check if there exists any model that can successfully predict the time series.)
  2. if this is a valid but non-stationary time series, make it stationary
  3. build predictive models on the stationary time series

Here is my understanding why we need to make a time series stationary before training a model. Suppose your time series varies a lot much in a specific time range. If we build a linear regression model based on the mean square error loss, the errors from that time range will dominate the mean square error and force the weight adjustment more biased to the time range. However, we want to build a predictive model that can predict the time series equally well at any time point. 

We can easily detect whether a time series is stationary using a helper function `test_stationarity`  from [3] and [4]. The usage of `test_stationarity` is that if the ‘Test Statistic’ is greater than the ‘Critical Value’, or ‘p-value’ is well below a convincing threshold (say 0.05),  then the time series is stationary. 

Base on the raw time series, we use test_stationarity and find it is not stationary:

Screenshot from 2017-07-26 21:43:24

The techniques that you can use to make a time series stationary: (1) logarithmic; (2) first difference; (3) seasonal difference. See [4] and [5] for more techniques to make time series stationary.

ARIMA

Autoregressive integrated moving average (ARIMA) model assumes each time step $latex Y_t$ is a linear function of previous time steps $latex Y_{t-1}, Y_{t-2},\cdots$ and error terms $latex \epsilon_{t}, \epsilon_{t-1}, \epsilon_{t-2}, \cdots$. An ARIMA model(p, q, d) first applies $latex d$-difference on itself to obtain $latex \{Y_{t}\}$ and then assume:

$latex Y_t =\beta_0 + \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + \cdots + \beta_p Y_{t-p} + \epsilon_t+ \Phi_{1} \epsilon_{t-1} + \Phi_{2} \epsilon_{t-2} + \cdots + \Phi_{q} \epsilon_{t-q} &s=2$

Here, the error term $latex \{\epsilon_t\}$ is not the error term from the Autoregressive (AR) model. Instead $latex \{\epsilon_t\}$ are estimated purely from data [11].

 

To determine the lag parameter $latex p$ in AR model, we can use autocorrelation_plot from pandas.autocorrelation_plot, which is the same thing as Autocorrelation Function (ACF) from statsmodels:

Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly non-zero. Therefore, autocorrelation can be used as a tool to quickly check whether your time series is suitable to use machine learning models to forecast. After all, if your time series is just white noises, it is meaningless to do time series forecast. The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.

Note that the ACF from statsmodels provides a parameter unbiased. If True, the plot will automatically adjust for the length of the time series. [14]

To determine the lag parameter $latex q$ in MA model, we can use Partial Autocorrelation Function (PACF) in statsmodels

The rules to derive right $latex p$ and $latex q$ can be found from [12].

Screenshot from 2017-07-27 12:53:54

On the other hand, [4] points out that $latex p$ should be the value when the PACF chart crosses the upper confidence interval for the first time. $latex q$ should be the value when the ACF chart crosses the upper confidence interval for the first time. (Also see [15])

In my opinion, deriving $latex p$ and $latex q$ from observing ACF and PACF plots is a very ad-hoc, vague method. I’d prefer to do grid search on parameters as in [13] then filter out the best model according to AIC or similar metrics. 

From [13]: AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. 

 

Seasonality plot allows you to separate trend, seasonality and residual from a time series. According to [10], a time series consists of four components:

  1. level, the average value in the series
  2. trend, the increasing or decreasing value in the series
  3. seasonality, the repeating short-term cycle in the series
  4. noise, the random variation in the series

Below is the seasonality plot based on our data:

Screenshot from 2017-07-27 14:29:19

The residual is thus the combination of level and noise. One thing that I am not sure is how to use seasonality plot. One possible way is to fit ARIMA on residual because that’s the thing de-trended and de-seasonalized (see one person asked in a comment [8]). However, since the residual is already the level (seen as a constant?) + noise, is there any point to fit ARIMA? Also, even if we can fit an ARIMA on the residual, when we use it to predict the future, how can we transform the predicted value back to the raw scale? [15] actually supports this direction and also mentions the same concern: “Now that you have stationarized your time series, you could go on and model residuals (fit lines between values in the plot). However, as the patterns for the trend and seasonality information extracted from the series that are plotted after decomposition are still not consistent and cannot be scaled back to the original values, you cannot use this approach to create reliable forecasts“. Another direction is to use exogenous variables to fit the residuals, treating them as something not predictable by the time series itself [3]. However, we also face the same issue about how to transfer the predicted residuals back to the raw scale.

There are also other resources introducing seasonality decomposition, such as [9].

Based on what I’ve read so far, I’d say using seasonality plot as an exploration tool. But do not rely on it to train forecast model. After all, modern python ARIMA APIs integrates fitting trend and seasonality at the same time so that you don’t need to separate them out manually [7].

 

Reference:

[1] http://machinelearningmastery.com/time-series-prediction-with-deep-learning-in-python-with-keras/

[2] http://pandasplotting.blogspot.com/2012/06/autocorrelation-plot.html

[3] http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/

[4] https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/

[5] http://people.duke.edu/~rnau/whatuse.htm

[6] http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARIMA.html

[7] http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html

[8] https://stats.stackexchange.com/questions/223117/forecasting-residuals-from-seasonal-decomposition-appears-to-be-highly-auto-cor

[9] http://www.cbcity.de/timeseries-decomposition-in-python-with-statsmodels-and-pandas

[10] http://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/

[11] https://stats.stackexchange.com/questions/26024/moving-average-model-error-terms/74826#74826

[12] https://www.youtube.com/watch?v=Aw77aMLj9uM

[13] https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3

[14] https://stats.stackexchange.com/questions/164526/is-an-auto-correlation-plot-suitable-for-determining-at-what-point-time-series-d

[15] https://datascience.ibm.com/exchange/public/entry/view/815137c868b916821dec777bdc23013c

 

tricks in deep learning neural network

In this post, I am going to talk my understanding in tricks in training deep neural network.

ResNet [1]

Why does ResNet network work? https://www.quora.com/How-does-deep-residual-learning-work

Here is my answer:

It is hard to know the desired depth of a deep network. If layers are too deep, errors are hard to propagate back correctly. if layers are too narrow, we may not learn enough representation power.

However, in deep residual network, it is safe to train very deep layers in order to get enough learning power without worrying about the degradation problem too much, because in the worst case, blocks in those “unnecessary layers” can learn to be an identity mapping and do no harm to performance. This is achieved by the solver driving weights of ReLu layers close to zeros thus only the shortcut connection is active and acts as an identity mapping. Though not proved theoretically, adjusting weights close to zeros might be an easier task to the solver than adjusting weights to effective representation all at once.

The authors observe empirically (Fig. 7) that ResNet has smaller magnitude of layer responses on average than plain networks, suggesting that many blocks are just learning little, incremental information.

To conclude, the core idea of ResNet is providing shortcut connection between layers, which make it safe to train very deep network to gain maximal representation power without worrying about the degradation problem, i.e., learning difficulties introduced by deep layers.

All my answer is based on empirical observation and intuition. I’d like to know more theories behind ResNet.

 

Now according to [2], all the following techniques are regularization methods.

Data Augmentation

augment the training set via domain-specific transformations. For image data, commonly used transformations include random cropping, random perturbation of brightness, saturation, hue and contrast.

Early Stopping

Early stopping was shown to implicitly regularize on some convex learning problems (Yao et al., 2007; Lin et al., 2016)

Dropout 

mask out each element of a layer output randomly with a given dropout probability. [1] says that We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16].  but We
do not use dropout [14], following the practice in [16].

 

Weight Decay 

Why is it equivalent to $latex \ell_2$ norm regularizer on the weights? Also equivalent to a hard constraint of the weights to an Euclidean ball, with the radius decided by the amount of weight decay?

 

Batch Norm

https://theneuralperspective.com/2016/10/27/gradient-topics/

http://leix.me/2017/03/02/normalization-in-deep-learning/

 

ReLU, Leaky ReLU & MaxOut

Comparison between different activation functions

https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

 

Reference

[1] Deep Residual Learning for Image Recognition

[2] Understand Deep Learning Requires Rethinking Generalization

[3] Note for [2]: https://danieltakeshi.github.io/2017/05/19/understanding-deep-learning-requires-rethinking-generalization-my-thoughts-and-notes

[4] https://datascience.stackexchange.com/questions/14349/difference-of-activation-functions-in-neural-networks-in-general

A3C code walkthrough

In this post, I am doing a brief code walkthrough for the code written in https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

The code implements A3C algorithm (Asynchronous Methods for Deep Reinforcement Learning). It follows the pseudocode given in supplemental part in the paper:

Screenshot from 2017-06-13 14-09-56

The structure of this model is:

Untitled Diagram (2)

For LSTM structure detail, refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/. I am using the same notation in the model structure diagram as in the link.

Below is the code: 

"""
Asynchronous Advantage Actor-Critic algorithm

See:
https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2
https://arxiv.org/pdf/1602.01783.pdf
"""

import copy
import threading
import multiprocessing
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.contrib.slim as slim
import scipy.signal
from helper import *
from vizdoom import *

from random import choice
from time import sleep
from time import time


# Copies one set of variables to another.
# Used to set worker network parameters to those of global network.
def update_target_graph(from_scope,to_scope):
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, from_scope)
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, to_scope)

    op_holder = []
    for from_var,to_var in zip(from_vars,to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder

# Processes Doom screen image to produce cropped and resized image.
def process_frame(frame):
    s = frame[10:-10,30:-30]
    s = scipy.misc.imresize(s,[84,84])
    s = np.reshape(s,[np.prod(s.shape)]) / 255.0
    return s

# Discounting function used to calculate discounted returns.
def discount(x, gamma):
    return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]

#Used to initialize weights for policy and value output layers
def normalized_columns_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None):
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer


class AC_Network():
    def __init__(self, s_size, a_size, scope, trainer):
        with tf.variable_scope(scope):
            # Input and visual encoding layers
            self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
            self.imageIn = tf.reshape(self.inputs, shape=[-1, 84, 84, 1])
            self.conv1 = slim.conv2d(activation_fn=tf.nn.elu,
                                     inputs=self.imageIn, num_outputs=16,
                                     kernel_size=[8, 8], stride=[4, 4], padding='VALID')
            self.conv2 = slim.conv2d(activation_fn=tf.nn.elu,
                                     inputs=self.conv1, num_outputs=32,
                                     kernel_size=[4, 4], stride=[2, 2], padding='VALID')
            hidden = slim.fully_connected(slim.flatten(self.conv2), 256, activation_fn=tf.nn.elu)

            # Recurrent network for temporal dependencies
            lstm_cell = tf.contrib.rnn.BasicLSTMCell(256, state_is_tuple=True)
            self.lstm_cell_c_size = lstm_cell.state_size.c
            self.lstm_cell_h_size = lstm_cell.state_size.h
            c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
            h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
            self.state_init = [c_init, h_init]

            c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
            h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
            self.state_in = (c_in, h_in)
            rnn_in = tf.expand_dims(hidden, [0])
            step_size = tf.shape(self.imageIn)[:1]
            state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in)
            lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
                lstm_cell, rnn_in, initial_state=state_in, sequence_length=step_size,
                time_major=False)
            lstm_c, lstm_h = lstm_state
            self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
            rnn_out = tf.reshape(lstm_outputs, [-1, 256])

            # Output layers for policy and value estimations
            self.policy = slim.fully_connected(rnn_out, a_size,
                                               activation_fn=tf.nn.softmax,
                                               weights_initializer=normalized_columns_initializer(0.01),
                                               biases_initializer=None)
            self.value = slim.fully_connected(rnn_out, 1,
                                              activation_fn=None,
                                              weights_initializer=normalized_columns_initializer(1.0),
                                              biases_initializer=None)

            # Only the worker network need ops for loss functions and gradient updating.
            if scope != 'global':
                self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
                self.actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
                self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
                self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)

                self.responsible_outputs = tf.reduce_sum(self.policy * self.actions_onehot, [1])

                # Loss functions
                self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value, [-1])))
                self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
                self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs) * self.advantages)
                self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

                # Get gradients from local network using local losses
                local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
                self.gradients = tf.gradients(self.loss, local_vars)
                self.var_norms = tf.global_norm(local_vars)
                grads, self.grad_norms = tf.clip_by_global_norm(self.gradients, 40.0)

                # Apply local gradients to global network
                global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
                self.apply_grads = trainer.apply_gradients(zip(grads, global_vars))

    def state_in_init(self):
        c_init = np.zeros((1, self.lstm_cell_c_size), np.float32)
        h_init = np.zeros((1, self.lstm_cell_h_size), np.float32)
        self.state_init = [c_init, h_init]


class Worker():
    def __init__(self, game, name, s_size, a_size, trainer, model_path, global_episodes):
        self.name = "worker_" + str(name)
        self.number = name
        self.model_path = model_path
        self.trainer = trainer
        self.global_episodes = global_episodes
        self.increment = self.global_episodes.assign_add(1)
        self.episode_rewards = []
        self.episode_lengths = []
        self.episode_mean_values = []
        self.summary_writer = tf.summary.FileWriter("train_" + str(self.number))

        # Create the local copy of the network and the tensorflow op to copy global paramters to local network
        self.local_AC = AC_Network(s_size, a_size, self.name, trainer)
        self.update_local_ops = update_target_graph('global', self.name)

        # The Below code is related to setting up the Doom environment
        game.set_doom_scenario_path("basic.wad")  # This corresponds to the simple task we will pose our agent
        game.set_doom_map("map01")
        game.set_screen_resolution(ScreenResolution.RES_160X120)
        game.set_screen_format(ScreenFormat.GRAY8)
        game.set_render_hud(False)
        game.set_render_crosshair(False)
        game.set_render_weapon(True)
        game.set_render_decals(False)
        game.set_render_particles(False)
        game.add_available_button(Button.MOVE_LEFT)
        game.add_available_button(Button.MOVE_RIGHT)
        game.add_available_button(Button.ATTACK)
        game.add_available_game_variable(GameVariable.AMMO2)
        game.add_available_game_variable(GameVariable.POSITION_X)
        game.add_available_game_variable(GameVariable.POSITION_Y)
        game.set_episode_timeout(300)
        game.set_episode_start_time(10)
        game.set_window_visible(False)
        game.set_sound_enabled(False)
        game.set_living_reward(-1)
        game.set_mode(Mode.PLAYER)
        game.init()
        self.actions = np.identity(a_size, dtype=bool).tolist()
        # End Doom set-up
        self.env = game

    def train(self, rollout, sess, gamma, bootstrap_value):
        rollout = np.array(rollout)
        observations = rollout[:, 0]
        actions = rollout[:, 1]
        rewards = rollout[:, 2]
        next_observations = rollout[:, 3]
        values = rollout[:, 5]

        # Here we take the rewards and values from the rollout, and use them to
        # generate the advantage and discounted returns.
        # The advantage function uses "Generalized Advantage Estimation"
        self.rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value])
        discounted_rewards = discount(self.rewards_plus, gamma)[:-1]
        self.value_plus = np.asarray(values.tolist() + [bootstrap_value])
        advantages = rewards + gamma * self.value_plus[1:] - self.value_plus[:-1]
        advantages = discount(advantages, gamma)

        # Update the global network using gradients from loss
        # Generate network statistics to periodically save
        rnn_state = self.local_AC.state_init
        feed_dict = {self.local_AC.target_v: discounted_rewards,
                     self.local_AC.inputs: np.vstack(observations),
                     self.local_AC.actions: actions,
                     self.local_AC.advantages: advantages,
                     self.local_AC.state_in[0]: rnn_state[0],
                     self.local_AC.state_in[1]: rnn_state[1]}
        v_l, p_l, e_l, g_n, v_n, _ = sess.run([self.local_AC.value_loss,
                                               self.local_AC.policy_loss,
                                               self.local_AC.entropy,
                                               self.local_AC.grad_norms,
                                               self.local_AC.var_norms,
                                               self.local_AC.apply_grads],
                                              feed_dict=feed_dict)
        return v_l / len(rollout), p_l / len(rollout), e_l / len(rollout), g_n, v_n

    def work(self, max_episode_length, gamma, sess, coord, saver):
        episode_count = sess.run(self.global_episodes)
        total_steps = 0
        print("Starting worker " + str(self.number))
        with sess.as_default(), sess.graph.as_default():
            while not coord.should_stop():
                # copy global network parameter to self parameter
                sess.run(self.update_local_ops)
                # when a new episode starts, C_0 & h_0 of LSTM are reset to zero
                self.local_AC.state_in_init()
                episode_buffer = []
                episode_values = []
                episode_frames = []
                episode_reward = 0
                episode_step_count = 0
                d = False

                self.env.new_episode()
                s = self.env.get_state().screen_buffer
                episode_frames.append(s)
                s = process_frame(s)
                rnn_state = self.local_AC.state_init

                while self.env.is_episode_finished() == False:
                    # Take an action using probabilities from policy network output.
                    # after this step, a_dist shape (1,3), v shape (1,1),
                    # rnn_state: first array (C): (1, 256), second array (h): (1, 256)
                    a_dist, v, rnn_state = sess.run(
                        [self.local_AC.policy, self.local_AC.value, self.local_AC.state_out],
                        feed_dict={self.local_AC.inputs: [s],
                                   self.local_AC.state_in[0]: rnn_state[0],
                                   self.local_AC.state_in[1]: rnn_state[1]})
                    a = np.random.choice(a_dist[0], p=a_dist[0])
                    a = np.argmax(a_dist == a)     # return the best action index

                    # see http://vizdoom.cs.put.edu.pl/tutorial and
                    # https://github.com/mwydmuch/ViZDoom/blob/3bdb16935668aa42bb14cc38ac397b8954999cca/doc/DoomGame.md
                    # for reward description
                    # the agent gets rewards for each action which is -1 for each time tick,
                    # -6 if he shots but misses, and 100 if he kills the monster
                    r = self.env.make_action(self.actions[a]) / 100.0     # make_action returns reward

                    # In this example, the episode finishes after 300 tics or when the monster gets killed
                    # ref: http://www.cs.put.poznan.pl/visualdoomai/tutorial.html#basic - Game Runtime
                    d = self.env.is_episode_finished()
                    if d == False:
                        s1 = self.env.get_state().screen_buffer
                        episode_frames.append(s1)
                        s1 = process_frame(s1)
                    else:
                        s1 = s

                    episode_buffer.append([s, a, r, s1, d, v[0, 0]])
                    episode_values.append(v[0, 0])

                    episode_reward += r
                    s = s1
                    total_steps += 1
                    episode_step_count += 1

                    # If the episode hasn't ended, but the experience buffer is full, then we
                    # make an update step using that experience rollout.
                    if len(episode_buffer) == 30 and d != True and episode_step_count != max_episode_length - 1:
                        # Since we don't know what the true final return is, we "bootstrap" from our current
                        # value estimation.
                        v1 = sess.run(self.local_AC.value,
                                      feed_dict={self.local_AC.inputs: [s],
                                                 self.local_AC.state_in[0]: rnn_state[0],
                                                 self.local_AC.state_in[1]: rnn_state[1]})
                        v1 = v1[0, 0]
                        v_l, p_l, e_l, g_n, v_n = self.train(episode_buffer, sess, gamma, v1)
                        episode_buffer = []
                        sess.run(self.update_local_ops)
                        # original code does not update state_init:
                        # in train function, rnn_state is always set to self.state_init which are two zero numpy arrays
                        self.local_AC.state_init = copy.deepcopy(rnn_state)
                    if d == True:
                        break

                self.episode_rewards.append(episode_reward)
                self.episode_lengths.append(episode_step_count)
                self.episode_mean_values.append(np.mean(episode_values))

                # Update the network using the experience buffer at the end of the episode.
                if len(episode_buffer) != 0:
                    v_l, p_l, e_l, g_n, v_n = self.train(episode_buffer, sess, gamma, 0.0)

                # Periodically save gifs of episodes, model parameters, and summary statistics.
                if episode_count % 5 == 0 and episode_count != 0:
                    if self.name == 'worker_0' and episode_count % 25 == 0:
                        time_per_step = 0.05
                        images = np.array(episode_frames)
                        make_gif(images, './frames/image' + str(episode_count) + '.gif',
                                 duration=len(images) * time_per_step, true_image=True, salience=False)
                    if episode_count % 250 == 0 and self.name == 'worker_0':
                        saver.save(sess, self.model_path + '/model-' + str(episode_count) + '.cptk')
                        print("Saved Model")

                    mean_reward = np.mean(self.episode_rewards[-5:])
                    mean_length = np.mean(self.episode_lengths[-5:])
                    mean_value = np.mean(self.episode_mean_values[-5:])
                    summary = tf.Summary()
                    summary.value.add(tag='Perf/Reward', simple_value=float(mean_reward))
                    summary.value.add(tag='Perf/Length', simple_value=float(mean_length))
                    summary.value.add(tag='Perf/Value', simple_value=float(mean_value))
                    summary.value.add(tag='Losses/Value Loss', simple_value=float(v_l))
                    summary.value.add(tag='Losses/Policy Loss', simple_value=float(p_l))
                    summary.value.add(tag='Losses/Entropy', simple_value=float(e_l))
                    summary.value.add(tag='Losses/Grad Norm', simple_value=float(g_n))
                    summary.value.add(tag='Losses/Var Norm', simple_value=float(v_n))
                    self.summary_writer.add_summary(summary, episode_count)

                    self.summary_writer.flush()
                if self.name == 'worker_0':
                    sess.run(self.increment)
                episode_count += 1




max_episode_length = 300
gamma = .99 # discount rate for advantage estimation and reward discounting
s_size = 7056 # Observations are greyscale frames of 84 * 84 * 1
a_size = 3 # Agent can move Left, Right, or Fire
load_model = False
model_path = './model'

tf.reset_default_graph()

if not os.path.exists(model_path):
    os.makedirs(model_path)

# Create a directory to save episode playback gifs to
if not os.path.exists('./frames'):
    os.makedirs('./frames')

with tf.device("/cpu:0"):
    global_episodes = tf.Variable(0, dtype=tf.int32, name='global_episodes', trainable=False)
    trainer = tf.train.AdamOptimizer(learning_rate=1e-4)
    master_network = AC_Network(s_size, a_size, 'global', None)  # Generate global network
    num_workers = multiprocessing.cpu_count()  # Set workers ot number of available CPU threads
    workers = []
    # Create worker classes
    for i in range(num_workers):
        workers.append(Worker(DoomGame(), i, s_size, a_size, trainer, model_path, global_episodes))
    saver = tf.train.Saver(max_to_keep=5)

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    if load_model == True:
        print('Loading Model...')
        ckpt = tf.train.get_checkpoint_state(model_path)
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        sess.run(tf.global_variables_initializer())

    # This is where the asynchronous magic happens.
    # Start the "work" process for each worker in a separate threat.
    worker_threads = []
    for worker in workers:
        worker_work = lambda: worker.work(max_episode_length, gamma, sess, coord, saver)
        t = threading.Thread(target=(worker_work))
        t.start()
        sleep(0.5)
        worker_threads.append(t)
    coord.join(worker_threads)

Note that Worker.train implements the loop $latex for \; i \in \{t-1, \cdots,t_{start}\}$ in the pseudocode.

 

 

Policy Gradient

Reinforcement learning algorithms can be divided into many families. In model-free temporal difference methods like Q-learning/SARSA, we try to learn action value Q(s,a) for any state-action pair, either by recording (“memorizing”) exact values in a tabular or learning a function to approximate it. Under \epsilon-greedy, the action to be selected at a state will therefore be argmax_a Q(s,a) but there is also a small constant chance \epsilon to be any non-optimal action.

Another family is called policy gradient methods which directly map states to actions. To select actions, they do not need to consult a value table or a value function. Instead, each action can be selected with a probability determined by a parameterized policy function \pi(a|s,\theta), where \theta is the policy function’s parameters.

The advantages of policy gradient methods over Q-learning/SARSA using \epsilon greedy are mainly two:

  1. in some situations the optimal approximate policy must be stochastic.  An example from [1]: in card games with imperfect information the optimal play is often to do two different things with specific probabilities, such as when bluffing in Poker. Action-value methods have no natural way of finding stochastic optimal policies.
  2. problems vary in the complexity of their policies and action-value functions. In some problems, the policy is a much simpler function to approximate than action-values so it will be faster to learn

The general update form of policy gradient methods is \theta_{t+1} = \theta_t + \alpha \nabla \eta(\theta_t), where \eta(\theta_t) is performance measure with respect to the policy weights.

Now, the policy gradient theorem can be briefly described as [5]:

Screenshot from 2017-05-29 11-43-06

In episodic case, the policy gradient theorem derives as follows (according to [1]):

Screenshot from 2017-05-29 20-53-45

The last expression is an exact formula of the gradient of \eta(\theta). It can also be seen as an expectation of \gamma^t \sum\limits_{a_t} \nabla_\theta \pi(a_t|s_t) q(s_t, a_t) over the probability distribution of landing on s_t in t steps. Please note for any fixed t, \sum\limits_{s_t} \Pr(s_0 \rightarrow s_t, t, \pi)=1. Therefore, we can rewrite the above last expression as an expectation form:

Screenshot from 2017-05-31 00-09-15

where \qquad G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + \cdots. G_t is an unbiased estimator of q(s_t, a_t) in the last two steps since we do not have estimation for q(s_t, a_t). We use \mathbb{E}_{\pi} to replace \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t)} &s=2, meaning that the sequence S_0, A_0, S_1, A_1, \cdots are generated following the policy \pi(a_t|s_t) and the transition probability p(s_{t+1}|s_t, a_t). Sometimes we can also write \mathbb{E}_{\pi_\theta} because the policy \pi is parameterized by \theta. We can also write \mathbb{E}_{\pi} as \mathbb{E}_{s_{0:T}, a_{0:T}}. In other words, \nabla \eta(\theta) = \mathbb{E}_{s_{0:T}, a_{0:T}}[\sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t|s_t)] &s=2

What we do in reality is to use these collected state, actions and rewards as samples to approximate the exact expectation of \nabla \eta(\theta):

Screenshot from 2017-05-28 19-56-19

This kind of policy gradient method is called REINFORCE, by Ronald J. Williams from Northeastern University. The original paper [2] is hard to read in my opinion. It directly tells you what is the update rule of \theta by construction, and then tells you the expected update aligns in the same direction as the performance gradient. What I wish to be told is how he derived the update rule of \theta in the first place.

(Updated 09/18/2017: The same derivative procedure of REINFORCE is illustrated more clearly in [10])

(Updated 01/31/2020: The derivative procedure of REINFORCE in continuous state/action space is illustrated in [15])

One important extension of REINFORCE is to offsetting G_t by a baseline b(s_t), a function of s_t. Intuitively, we need not how good an action is, but how much better this action is compared to average actions. For example, if G_t is uniformly larger than zero for either good or bad actions, \theta will always get updated to encourage either kind of actions. We need to calibrate G_t such that it can differentiate good or bad actions better. Mathematically, adding a baseline can reduce the variance of \nabla \eta(\theta) while still keeping it as unbiased estimator.

First, let’s look at why offsetting G_t by b(s_t) still makes \nabla \eta(\theta) an unbiased estimator (reference from [8]):

Screenshot from 2017-05-31 00-10-24

To ensure \mathbb{E}_{s_{0:T}, a_{0:T}} [\sum\limits_{t=0}^T \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi (a_t|s_t)] is an unbiased estimate of \nabla \eta(\theta), b(s_t) should only be a function of s_t, but not a_t. Otherwise \mathbb{E}_{s_{0:T}, a_{0:T}}[\sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t)] is not zero.

It is less obvious that adding b(s_t) can reduce the variance of Var[ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] .

Screenshot from 2017-06-08 21-13-30

From here, we can see if \sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi(a_t | s_t) has large enough covariance with \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t) to outweigh its own variance, then the variance is reduced. Unrealistically, if b(s_t) = G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + \cdots, then variance will be zero, although this is impossible because b(s_t) is only a function of s_t without magic forecast ability to 100% approach G_t.

(sidenote: I cannot follow [8]’s deduction on variance reduction part.)

One way is to train a predictor on \hat{v}(S_t | w) with parameter w as a baseline:

Screenshot from 2017-05-28 21-59-16

From [1]: Note that REINFORCE uses the complete return from time t (G_t), which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed.

When we derived \nabla \eta(\theta), we use the property that \mathbb{E}[G_t|S_t, A_t]=q_\pi(s_t, a_t). However, G_t could have high variance because it involves returns from step t to T, where each reward can be seen as a random variable [13]. An alternative estimator of q_\pi(s_t, a_t) which has lower variance but higher bias is to use “bootstrapping”, i.e., use a parameterized value function \hat{v}_w plus the next immediate reward to approximate G_t \approx R + \gamma \hat{v}(S', w). The one-step actor-critic algorithm is described as follows [1]:

Screen Shot 2018-11-11 at 1.44.33 PM

REINFORCE is an on-policy algorithm because \delta=G_t - \hat{v}(S_t,w) in the gradient update depends on G_t, the returns generated by following the current policy \pi_\theta. The specific one-step actor-critic algorithm we just described is also an on-policy algorithm because \delta=R+\gamma \hat{v}(S', w) - \hat{v}(S, w) depends on the next state S' which is the result of applying \pi_\theta at the current state S.  There also exists off-policy actor-critics, see an overview of on-policy and off-policy policy gradient methods at [14].

A more recent advance in baseline is Generalized Advantage Estimation (GAE) [6]. They introduce two parameters, \gamma and \lambda, in an un-discounted reward problem to help estimate g:=\nabla_\theta \mathbb{E}[\sum_{t=0}^\infty] r_t with little introduced bias and reduced variance. (Note that how discounted reward problems can be transformed into an un-discounted problem: “But the discounted problem (maximizing \sum_{t=0}^\infty \gamma^t r_t) can be handled as an instance of the undiscounted problem in which we absorb the discount factor into the reward function, making it time-dependent.”)

They introduce their notations:

Screenshot from 2017-06-09 17:48:05

Note that g^\gamma is a biased estimator of g but as they claim previous works have studied to “reduce variance by downweighting rewards corresponding to delayed effects, at the cost of introducing bias”.

The paper’s goal is to find a good estimator of A^{\pi, \gamma} which is called \hat{A}_t.

Screenshot from 2017-06-10 18-07-03

In other word, if \hat{A}_t is \gamma-just, then it helps to construct an unbiased estimator of g^\gamma. Equation (8) just uses the property that the expectation of a sum equals to the sum of expectations.

Now, what other property does \hat{A}_t have? If we know any property of \hat{A}_t, it will help us find \hat{A}_t more easily. The paper proposes one property:

Screenshot from 2017-06-10 18-19-33

Sketch proof of proposition 1:

First of all, to understand the notations clearly, think that \hat{A}_t(s_{0:\infty}, a_{0:\infty}) and Q_t(s_{0:\infty}, a_{0:\infty}) are functions with input as the whole trajectory s_0, a_0, s_1, a_1, \cdots. Similarly, think b_t(s_{0:t}, a_{0:t-1}) as a function with the input as the former part of the trajectory s_0, a_0, s_1, a_1, \cdots, s_t.

Now, suppose a satisfied \hat{A}_t such that \hat{A}_t(s_{0:\infty}, a_{0:\infty}) = Q_t(s_{0:\infty}, a_{0:\infty}) - b_t(s_{0:t}, a_{0:t-1}) where for all (s_t, a_t), \; \mathbb{E}_{s_{t+1:\infty}, a_{t+1:\infty}|s_t,a_t} = Q^{\pi,\gamma}(s_t, a_t), then:

Screenshot from 2017-06-10 19-18-26

Next, they find a good candidate function for \hat{A}_t, which is \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t). Only when we know V = V^{\pi, \gamma} is \delta_t^V \gamma-just. Otherwise V is a biased estimator of g^\gamma. However, if we can take average of different variations of \delta (equation 11, 12, 13, 14, 15, and 16), then we might get a low bias, low variance estimator, which is called GAE(\gamma, \lambda).

Screenshot from 2017-06-10 20-58-57

\hat{A}_t:=GAE(\gamma, 1) is \gamma-just regardless of the accuracy of V (again, this is because E_{s_{0:\infty}, a_{0:\infty}}[V(s_t) \nabla_\theta \log \pi_\theta (a_t | s_t) ] = 0). However GAE(\gamma, 1) is believed (I don’t know how to prove that) to have high variance due to the long sum of rewards. On the other extreme end, GAE(\gamma, 0) has low variance but since we are estimating the value function V, GAE(\gamma, 0)  must be a biased estimator of g^\gamma. GAE(\gamma, \lambda) with 0<\lambda<1 would make a trade-off between variance and bias.

Update 2018-11-08

Policy gradient is better illustrated in several recent posts: [11] and [12]

Reference

[1] Reinforcement learning: An introduction

[2] Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

[3] https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

[4] Asynchronous Methods for Deep Reinforcement Learning

[5] http://www.breloff.com/DeepRL-OnlineGAE/

[6] HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

[7] Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

[8] https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

[9] https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/

[10] https://www.youtube.com/watch?v=kUiR0RLmGCo

[11] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#soft-actor-critic

[12] https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma

[13] Supplement material of DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

[14] Unifying On-Policy and Off-Policy Learning

[15] http://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf

latex for policy gradient theorem:

    \begin{align*} \nabla \eta(\theta) &= \nabla_\theta v_{\pi} (s_0) \quad \quad \text{performance measure is the value of starting state} \\ &= \nabla_\theta \big[ \sum\limits_{a_0} \pi(a_0|s_0) q(s_0,a_0) \big] \\ &=\sum\limits_{a_0} \big[ \nabla_\theta \pi(a_0|s_0) q(s_0, a_0) + \pi(a_0|s_0) \nabla_\theta q(s_0, a_0) \big]  \quad \quad \text{derivative product rule} \\  &= \sum\limits_{a_0} \Big[ \nabla_\theta \pi(a_0|s_0) q(s_0, a_0) + \pi(a_0|s_0) \nabla_\theta \big[ \sum\limits_{s_1,r_0} p(s_1, r_0 |s_0,a_0)(r_0 + \gamma v_\pi(s_1)) \big] \Big] \\ &= \sum\limits_{a_0} \big[ \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) + \pi(a_0 | s_0) \sum\limits_{s_1} \gamma p(s_1| s_0, a_0) \nabla_\theta v_{\pi}(s_1) \big] \qquad r \text{ has nothing to do with regard to } \theta \\  & \qquad \text{up till now, we have a recursion:} \\ & \qquad \nabla_\theta v_\pi(s_t)= \sum\limits_{a_t} \Big[ \nabla_\theta \pi(a_t|s_t) q(s_t, a_t) + \pi(a_t|s_t) \big[ \sum\limits_{s_{t+1}} \gamma p(s_{t+1}|s_t,a_t) \nabla_\theta v_\pi(s_{t+1}) \big] \Big]  \\ &=\sum\limits_{a_0} \Big[ \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) + \pi(a_0 | s_0) \sum\limits_{s_1} \gamma p(s_1| s_0, a_0) \\ & \qquad \qquad  \sum\limits_{a_1} \big[ \nabla_\theta \pi(a_1 | s_1)q(s_1, a_1) + \pi(a_1 | s_1)\sum\limits_{s_2} \gamma p(s_2|s_1, a_1) \nabla_\theta v_{\pi} (s_2) \big] \Big]  \\ &=\sum\limits_{a_0} \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) \\ & \qquad +  \gamma \sum\limits_{s_1} \sum\limits_{a_0} p(s_1| s_0, a_0) \pi(a_0 | s_0)  \sum\limits_{a_1}  \nabla_\theta \pi(a_1 | s_1)q(s_1, a_1) \\ & \qquad + \gamma^2 \sum\limits_{s_2} \sum\limits_{a_1} \sum\limits_{s_1} \sum\limits_{a_0} p(s_2|s_1, a_1) \pi(a_1 | s_1) p(s_1| s_0, a_0) \pi(a_0 | s_0) \nabla_\theta v_{\pi} (s_2)  \\ &= \cdots \qquad \text{(keep unrolling using the recursion)}\\ &= \sum\limits_{t=0}^\infty \sum\limits_{s_t} \gamma^t \Pr(s_0 \rightarrow s_t, t, \pi) \sum\limits_{a_t} \nabla_\theta \pi(a_t | s_t) q(s_t, a_t)  \qquad  \Pr(s_0 \rightarrow s_t, t, \pi) \text{ is the prob. of } s_0 \text{ to } s_t \text{ in } t \text{ steps}  \end{align*}

latex for expectation form rewritten:

    \begin{align*} \nabla \eta(\theta) &= \sum\limits_{t=0}^\infty \sum\limits_{s_t} \gamma^t \Pr(s_0 \rightarrow s_t, t, \pi) \sum\limits_{a_t} \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi)}[\sum\limits_{a_t} \gamma^t \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) ] \\  &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi)} [\sum\limits_{a_t} \gamma^t \pi(a_t | s_t) q(s_t, a_t) \frac{\nabla_\theta \pi(a_t | s_t)}{\pi(a_t | s_t)} ] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t) \quad}[ \gamma^t q(s_t, a_t) \frac{\nabla_\theta \pi(a_t | s_t)}{\pi(a_t | s_t)}] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t) \quad}[ \gamma^t q(s_t, a_t) \nabla_\theta \log \pi(a_t | s_t)]  \qquad \nabla \log(x) = \frac{\nabla x}{x} \\  &=\mathbb{E}_{\pi}[ \sum\limits_{t=0}^\infty \gamma^t q(s_t, a_t) \nabla_\theta \log \pi(a_t | s_t)] \\  &=\mathbb{E}_{\pi}[ \sum\limits_{t=0}^\infty \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] \end{align*}

latex for adding baseline is still unbiased estimator:

    \begin{align*} &\mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi(a_t | s_t)] \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \mathbb{E}_{s_{0:T}, a_{0:T}}[ \sum\limits_{t=0}^T \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \mathbb{E}_{s_{0:T}, a_{0:T}}[ \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t) ]  \qquad \text{exp. of sum equals to sum of exp.}\\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \mathbb{E}_{s_{t}, a_{t}}[ \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \qquad \text{remove irrelevant variables in each exp.}\\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \sum\limits_{a_{t}} p(s_t, a_t)  \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t)  \qquad \text{expectation form} \rightarrow \text{discrete sum} \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \sum\limits_{a_{t}} p(s_t) \pi(a_t|s_t)  \gamma^t  b(s_t) \frac{\nabla_\theta \pi (a_t | s_t)}{\pi(a_t | s_t) }  \qquad \text{rule of probability} \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t  b(s_t) p(s_t) \sum\limits_{a_{t}} \nabla_\theta \pi (a_t | s_t)  \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t  b(s_t) p(s_t) \nabla_\theta \sum\limits_{a_{t}} \pi (a_t | s_t)  \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t  b(s_t) p(s_t) \nabla_\theta 1  \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] \\ =& \nabla \eta(\theta)  \end{align*}

latex for sketch proof of proposition 1:

    \begin{align*} &\mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [\hat{A}_t(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta(a_t | s_t) ] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [(Q_t(s_{0:\infty}, a_{0:\infty}) - b_t(s_{0:t}, a_{0:t-1})) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[Q_t(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta(a_t | s_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad \text{we will first work on the former part} \downarrow \\ &= \mathbb{E}_{s_{0:t} \atop a_{0:t}}[\nabla_\theta \log \pi_\theta(a_t | s_t) \mathbb{E}_{s_{t+1:\infty}, a_{t+1:\infty}} [Q_t(s_{0:\infty}, a_{0:\infty})] ] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:t} \atop a_{0:t}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad \text{since } Q^{\pi, \gamma}(s_t, a_t) \text{ is a function of input only } s_t \text{ and } a_t \text{, we can change } \mathbb{E}_{s_{0:t} \atop a_{0:t}} \text{ to } \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[V^{\pi, \gamma}(s_t) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad V^{\pi, \gamma}(s_t) \text{ is an instance of } b_t(s_{0:t}, a_{0:t-1})  \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [A^{\pi, \gamma}(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t | s_t) ] \end{align*}