czxttkl – Page 9

DPG and DDPG

In this post, I am sharing my understanding regarding Deterministic Policy Gradient Algorithm (DPG) [1] and its deep-learning version (DDPG) [2].

We have introduced policy gradient theorem in [3, 4]. Here, we briefly recap. The objective function of policy gradient methods is:

$latex J(\theta)=\sum\limits_{s \in S} d^\pi(s) V^\pi(s)=\sum\limits_{s \in S} d^\pi(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a), &s=2$

where $latex \pi$ represents $latex \pi_\theta$, $latex d^\pi(s)$ is the stationary distribution of Markov chain for $latex \pi$, $latex V^\pi(s)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s]$, and $latex Q^{\pi}(s,a)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s, A_t=a]$. $latex G_t$ is accumulated rewards since time step $latex t$: $latex G_t=\sum^\infty_{k=0}\gamma^k R_{t+k}$.

Policy gradient theorem proves that the gradient of policy parameters with regard to $latex J(\theta)$ is:

$latex \nabla_\theta J(\theta) = \mathbb{E}_{\pi}[Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)] &s=2$

More specifically, the policy gradient theorem we are talking about is stochastic policy gradient theorem because at each state the policy outputs a stochastic action distribution $latex \pi(s|a)$.

However, the policy gradient theorem implies that policy gradient must be calculated on-policy, as $latex \mathbb{E}_{\pi}:=\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}$ means the state and action distribution is generated by following $latex \pi_\theta$.

If the training samples is generated by some other behavioral policy $latex \beta$, then the objective function becomes:

$latex J(\theta)=\sum\limits_{s \in S} d^\beta(s) V^\pi(s)=\sum\limits_{s \in S} d^\beta(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a), &s=2$

and we must rely on importance sampling to calculate the policy gradient [7,8]:

$latex \nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a) + \pi(a|s)\nabla_\theta Q^\pi(s,a)] \newline \approx \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a)] \newline =\mathbb{E}_{\beta}[\frac{\pi_\theta(a|s)}{\beta(a|s)}Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)] &s=2$

where $latex \mathbb{E}_{\beta}:=\mathbb{E}_{s\sim d^\beta, a\sim\beta}$

However, using importance sampling $latex \frac{\pi_\theta(a|s)}{\beta(a|s)}$ could incur some learning difficulty problems [5]:

Screen Shot 2019-01-23 at 6.31.41 PM

What DPG proposes is that we design a policy that outputs actions deterministically: $latex a = \mu_\theta(s)$, thus $latex J(\theta)$ would only have integration (or sum) over state distribution and get rid of importance sampling, which makes learning potentially easier:

$latex J(\theta)=\sum\limits_{s \in S} d^\beta(s) V^\mu(s)=\sum\limits_{s \in S} d^\beta(s) Q^\mu(s, \mu_\theta(s)) &s=2$

Writing the above formula in integration (rather than discrete sum), we get:

$latex J(\theta)=\int\limits_S d^\beta(s) V^\mu(s) ds=\int\limits_{S} d^\beta(s) Q^\mu(s, \mu_\theta(s)) ds &s=2$

And the gradient of $latex J(\theta)$ is:

$latex \nabla_\theta J(\theta) = \int\limits_S d^\beta(s) \nabla_\theta Q^\mu(s, \mu_\theta(s)) ds \approx \mathbb{E}_\beta [\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}] &s=2$

This immediately implies that $latex \theta$ can be updated off policy. Because we only need to sample $latex s \sim d^\beta(s)$, then $latex \nabla J(\theta)$ can be calculated without knowing what action the behavior policy $latex \beta$ took (we only need to know what the training policy would take, i.e., $latex a=\mu_\theta(s)$). We should also note that there is an approximation in the formula of $latex \nabla_\theta J(\theta)$, because $latex \nabla_\theta Q^u(s,\mu_\theta(s))&s=2$ cannot be applied chain rules easily since even you replace $latex \mu_\theta(s))$ with $latex a$, the $latex Q$ function itself still depends on $latex \theta$.

Updating $latex \theta$ is one part of DPG. The other part is fitting a Q-network with parameter $latex \phi$ on the optimal Q-function of policy governed by $latex \theta$. This equates to update according to mean-squared Bellman error (MSBE) [10], which is the same as in Q-Learning.

I haven’t read the DDPG paper [2] thoroughly but based on my rough understanding, it adds several tricks when dealing with large inputs (such as images). According to [9]’s summary, DDPG introduced three tricks:

add batch normalization to normalize “every dimension across samples in one minibatch”
add noise in the output of DPG such that the algorithm can explore
soft update target network

References

[1] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, June). Deterministic policy gradient algorithms. In ICML.

[2] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

[3] Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”: https://czxttkl.com/?p=3497

[4] Policy Gradient: https://czxttkl.com/?p=2812

[5] https://repository.tudelft.nl/islandora/object/uuid:682a56ed-8e21-4b70-af11-0e8e9e298fa2?collection=education

[6] https://stackoverflow.com/questions/42763293/what-is-the-advantage-of-deterministic-policy-gradient-over-stochastic-policy-gr

[7] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#off-policy-policy-gradient

[8] Degris, T., White, M., & Sutton, R. S. (2012). Off-policy actor-critic. arXiv preprint arXiv:1205.4839.

[9] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ddpg

[10] https://spinningup.openai.com/en/latest/algorithms/ddpg.html#the-q-learning-side-of-ddpg

LSTM + DQN

Sequential decision problems can usually be formatted as Markov Decision Problems (MDPs), where you define states, actions, rewards and transitions. In some practical problems, states can just be described by action histories. For example, we’d like to decide notification delivery sequences for a group of similar users to maximize their accumulated clicks. We define two actions, whether to send or not send a notification to a user. We can define each user’s state as his or her history of notification receiving. For example, user A might receive a notification at day 1 and day 2 but stop receiving any at day 3, while user B might receive notifications in a row in all the three days. It’s a natural sequential decision problem in which future notification sending decisions depend on different user states. User A might receive a notification on day 4 because he or she hasn’t got one in day 3, while user B will need a notification break on day 4. Sure, one can define state with personal information of a user. But as I just mention, these are similar users with similar backgrounds, preferences, and cultures, and we want to keep things as simple as possible. You get my idea.

You can manually define state feature vectors by encoding action histories. Back to the notification example, you can define features like how many notifications one has received in last 3 days, last week, or last month. However, this may be manually tedious. Alternatively, you can generate some sort of user embedding from action histories. And then policy learning (RL algorithms) would just work on a MDP with user embeddings as states.

That’s why I would like to test LSTM + DQN (Long short-term memory + Deep Q Network). The idea is that at a state, LSTM will take as inputs the user action history so far and output an embedding. That embedding is used as the state to feed into DQN, which tries to approximate Q(s,a). The weights of LSTM and DQN will be jointly learned when fitting Q(s,a).

I design two gridworlds to test this idea. Both share the same structure, an agent starts at (0,0), there is some cell that is a wall (unwalkable). The gold state is at the left bottom corner. Whenever the agent enters the gold state, the episode terminates and get rewards 10. Since the reward discount factor will be set to be less than 1, the agent will not only need to get to the gold state, but get it as soon as possible. The difference between the two grid worlds is that the second gridworld endows additional reward plus the gold reward: the agent receives an additional reward at the gold state depending on the action history (the path). Some paths have higher rewards than others.

So the first gridworld (called “finite”) looks like below, with the red line indicates the optimal route.

Untitled Diagram (1)

The second gridworld (called “rnn”) looks like below, with the red line indicates the optimal route. It is not simply the shortest path, because going a little zigzag obtains higher rewards.

Untitled Diagram (3)

I test LSTM + DQN and pure DQN. In pure DQN, the state is just a one-hot encoding of which cell the agent is at. Therefore you can imagine pure DQN should fall short in the “rnn” gridworld because the state has no information about past paths. In LSTM + DQN, the state is the hidden layer output of LSTM resulting from the action history passing through LSTM. My test script can be found in Tutorials/action_state_generation/tests/test_online_train.py in this repository. (The file structure may change in the future.)

On finite gridworld

Pure DQN:

reward_plot_dqn_finite_tt5_eps2001

LSTM+DQN

reward_plot_lstm_finite_tt5_eps2001

The x-axis is learning epochs, while the y-axis is the final reward. Both algorithms can get the optimal reward 10. I also checked to confirm that they reach reward 10 using just 4 steps. However, you can observe that LSTM+DQN takes longer to converge (in other words, LSTM+DQN is less data efficient). This is as expected because LSTM needs more iterations to understand the sequences and consequences. Recall that the pure DQN encodes grid cells as states directly, while the state in LSTM is just action history. So LSTM needs many iterations to understand how action histories map to which exact cells in the grid.

On rnn gridworld

Pure DQN

reward_plot_dqn_rnn_tt5_eps5001

LSTM+DQN

reward_plot_lstm_rnn_tt5_eps5001

As expected, pure DQN does not reach high rewards at the end because its states do not have historical information. But LSTM+DQN could continuously improve its rewards, simply because the state is LSTM’s hidden state which encodes historical information along action histories.

DQN + Double Q-Learning + OpenAI Gym

Here I am providing a script to quickly experiment with the openai gym environment: https://github.com/czxttkl/Tutorials/tree/master/experiments/lunarlander. The script has the features of both Deep Q-Learning and Double Q-Learning.

I ran my script to benchmark one open ai environment LunarLander-v2. The most stable version of the algorithm has following hyperparameters: no double q-learning (just use one q-network), gamma=0.99, batch size=64, learning rate=0.001 (for adam optimizer). It finishes around at 400 episodes:

Environment solved in 406 episodes!	Average Score: 200.06
Environment solved in 406 episodes!	Average Score: 200.06
Environment solved in 545 episodes!	Average Score: 200.62
Environment solved in 413 episodes!	Average Score: 200.35
Environment solved in 406 episodes!	Average Score: 200.06

Several insights:

1. gamma, the discounted factor, could have influence on results. For example, in LunarLander, if I set gamma to 1 instead of 0.9, the agent can never successfully learn, although LunarLander-v2 limits 1000 steps per episode.

2. double q learning does not bring in benefit, at least in LunarLander-v2. Output of turning on double q-learning, which is similar to the most stable version of the algorithm which doesn’t use double q-learning:

lunarlander + mse + double q + gamma 0.99
Environment solved in 435 episodes!	Average Score: 205.85
Environment solved in 435 episodes!	Average Score: 205.85
Environment solved in 491 episodes!	Average Score: 200.97
Environment solved in 406 episodes!	Average Score: 200.06
Environment solved in 413 episodes!	Average Score: 200.35

3. huber loss seems to even hurts. I change the loss function used in fitting q-learning to huber loss, but results are pretty bad. The agent never succeeds or only succeed after 1000 episodes.

References

[1] https://github.com/RMiftakhov/LunarLander-v2-drlnd/blob/master/dqn_agent.py

Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”

I am reading this paper (https://arxiv.org/abs/1801.01290) and wanted to take down some notes about it.

Introduction

Soft Actor-Critic is a special version of Actor-Critic algorithms. Actor-Critic algorithms are one kind of policy gradient methods. Policy gradient methods are different than value-based methods (like Q-learning), where you learn Q-values and then infer the best action to take at each state $s$ using $\arg\max_a Q(s,a)$ . Policy gradient methods do not depend on value functions to infer the best policy, instead they directly learn a probability function $\pi(a|s)$ . However policy gradient methods may still learn value functions for the purpose of better guiding the learning of the policy probability function $\pi(a|s)$ .

We have introduced several kinds of policy gradient methods in [2, 18]. Here, we briefly recap. The objective function of policy gradient methods is:

$J(\theta)=\mathbb{E}_{s,a\sim\pi}Q^{\pi}(s,a)$ ,
where $\theta$ is the parameter of the policy $\pi$ .

In problems with a discrete action space, we can rewrite the expectation in summation:

$J(\theta)=\mathbb{E}_{s,a\sim\pi}Q^{\pi}(s,a)=\sum\limits_{s \in S} d^\pi(s) V^\pi(s)=\sum\limits_{s \in S} d^\pi(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a),$

where $d^\pi(s)$ is the stationary distribution of Markov chain for $\pi$ , $V^\pi(s)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s]$ , and $Q^{\pi}(s,a)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s, A_t=a]$ . $G_t$ is accumulated rewards since time step $t$ : $G_t=\sum^\infty_{k=0}\gamma^k R_{t+k}$ .

Policy gradient methods strive to learn the values of $\theta$ , which is achieved through gradient ascent w.r.t. $J(\theta)$ . The gradient $\nabla_\theta J(\theta)$ is obtained by the policy gradient theorem (proved in [4]):

$\nabla_\theta J(\theta) = \mathbb{E}_{s,a\sim\pi}[Q^\pi(s,a)\nabla_\theta log \pi_\theta(a|s)]$

Different policy gradient methods use different methods to estimate $Q^\pi(s,a)$ : REINFORCE uses a Monte Carlo method in which empirical accumulated rewards $G_t$ is used to unbiasedly approximate $Q^\pi(s,a)$ ; state-value actor-critic learns a value function $V_w(s)$ and uses $R+\gamma V_w(s')$ to approximate $Q^\pi(s,a)$ ; action-value actor-critic learns a Q-value function $Q_w(s,a)$ and uses $R+\gamma Q_w(s',a')$ to approximate $Q^\pi(s,a)$ . Note that since $\nabla_\theta J(\theta)$ is in an expectation form $\mathbb{E}_{s,a \sim \pi} [\cdot]$ , we need to collect on-policy samples. That’s why the soft actor-critic papers mentions in Introduction:

“some of the most commonly used deep RL algorithms, such as TRPO, PPO, A3C, require new samples to be collected for each gradient step.”

If we have only off-policy samples instead, which are collected by $\pi_b$ , $J(\theta)$ becomes “the value function of the target policy, averaged over the state distribution of the behavior policy” (from DPG paper [19]):

$J(\theta)=\mathbb{E}_{s\sim \pi_b} \left[ \mathbb{E}_{a\sim\pi}Q^{\pi}(s,a) \right]$

The critical problem presented by this formula is that if you take derivative of $J(\theta)$ w.r.t. $\theta$ , you’d better to be able to write the gradient without $\mathbb{E}_{a\sim\pi}[\cdot]$ because any such gradient can’t be computed using mini-batches collected from $\pi_b$ . Rather, we have several choices to learn $\theta$ given this off-policy objective $J(\theta)$ :

Using importance sampling, rewrite $\mathbb{E}_{a\sim\pi}\left[ Q^{\pi}(s,a)\right]$ as $\mathbb{E}_{a\sim\pi_b}\left[\frac{\pi(a|s)}{\pi_b(a|s)}Q^{\pi_b}(s,a) \right]$ . This results to the gradient update $\mathbb{E}_{s,a \sim \pi_b} [\frac{\pi(a|s)}{\pi_b(a|s)} Q^{\pi_b}(s,a) \nabla_\theta log \pi(a|s)]$ . See [18]. However, importance sampling can cause great variance empirically.
Making the policy deterministic would remove the expectation $\mathbb{E}_{a\sim\pi}[\cdot]$ . $J(\theta)$ then becomes $J(\theta)=\mathbb{E}_{s\sim\pi_b}\left[Q^{\pi}(s,\pi(s))\right]$ . The gradient update can be computed as $\mathbb{E}_{s\sim \pi_b}\left[ \nabla_\theta \pi(s) \nabla_a Q^\pi(s,a)|_{a=\pi(s)}\right]$ . This is how DPG works. See [20].
Using reparametrization trick (covered more below), which makes $\pi(a|s) = \pi^{deterministic}(s)+\epsilon$ , a deterministic function plus an independent noise, we would have $J(\theta)=\mathbb{E}_{s\sim\pi_b, \epsilon \sim p_\epsilon}\left[Q^{\pi}(s,\pi(s))\right]$ . The gradient update would be: $\mathbb{E}_{s\sim\pi_b, \epsilon \sim p_\epsilon}\left[\nabla_a Q^{\pi}(s,a)|_{a=\pi(s)} \nabla_\theta \pi(s) \right] \newline=\mathbb{E}_{s\sim\pi_b, \epsilon \sim p_\epsilon}\left[\nabla_a Q^{\pi}(s,a)|_{a=\pi(s) } \nabla_\theta \pi^{deterministic}(s) \right]$

The third method is used in Soft Actor-Critic, although SAC adds some more ingredients. Let’s now dive into it.

Soft Actor-Critic

Soft Actor-Critic (SAC) is designed to be an off-policy policy-gradient method. What makes it special is that it strives to maximize long-term rewards as well as action randomness, because more random actions mean better exploration and robustness. Therefore, in the maximum entropy framework, the objective function can be formulated as:

$J(\pi)=\sum\limits^T_{t=0}\mathbb{E}_{(s_t, a_t)\sim\rho_\pi} \left[r(s_t, a_t) + \alpha \mathcal{H} \left( \pi\left(\cdot | s_t \right) \right) \right],$

where $\mathcal{H}(\pi( \cdot | s ) )$ of an action distribution $\pi(\cdot|s)$ is the entropy defined as $\mathbb{E}_{a \sim \pi(\cdot|s_t)}[-log \pi(a|s)]$ , and $\alpha$ is the term controlling the influence of the entropy term.

We can always treat $\alpha=1$ by subsuming it into the reward function. Therefore, with $\alpha$ omitted, we have the following objective function, soft Q-function update operator, and soft value function:

(1) $J(\pi)=\sum\limits^T_{t=0}\mathbb{E}_{(s_t, a_t)\sim\rho_\pi} \left[r(s_t, a_t) + \mathcal{H} \left( \pi\left(\cdot | s_t \right) \right) \right]$

(2) $\mathcal{T}^\pi Q(s_t, a_t) \triangleq r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1}\sim p}\left[V(s_{t+1})\right]$

(3) $V(s_t) = \mathbb{E}_{a_t \sim \pi}\left[Q(s_t, a_t) - \log \pi(a_t|s_t) \right]$

Derivation

The paper uses Lemma1, Lemma 2 and Theorem 1 to prove the learning pattern will result in the optimal policy that maximizes the objective in Eqn. (1). The tl; dr is that Lemma1 proves that soft Q-values of any policy would converge rather than blowing up; Lemma2 proves that based on a policy’s soft Q-values, you can find a no-worse policy.

Lemma 1 proof

Screen Shot 2018-11-09 at 12.05.04 PM

“The standard convergence results for policy evaluation” means:

(1) there exists a fixed point $Q^\pi$ because of the way we define Eqn.15: for each state $s$ and each action $a$ , we define one equation of Eqn.15. We have $|S|+|A|$ equations in total. And there are $|S|+|A|$ dimensions of $Q^\pi$ . Therefore, there exists a unique fixed point: $Q^\pi=r+\gamma P^\pi Q^\pi$ , where $P^\pi$ is the operator for calculating expectation $\mathbb{E}[\cdot]$ .

(2) prove $T^\pi$ is a $\gamma$ -contraction under the infinity norm, i.e., $\Vert T^\pi Q^{k+1} - T^\pi Q^k \Vert_\infty \leq \gamma \Vert Q^{k+1} - Q^{k}\Vert_\infty$ . Explained in plain English, $T^\pi$ being a $\gamma$ -contraction means consecutive application of $T^\pi$ make the function $Q$ closer to the fixed point $Q^\pi$ at the rate of $\gamma$ .

The proof is found from [6]:

$\Vert T^\pi Q^{k+1} - T^\pi Q^k \Vert_\infty \newline =\Vert r +\gamma P^\pi Q^{k+1} - r - \gamma P^\pi Q^{k} \Vert_\infty \newline =\gamma \Vert P^\pi(Q^{k+1}-Q^k) \Vert_\infty \newline \leq \gamma \Vert Q^{k+1}-Q^k \Vert_\infty$

The last two steps are derived using $\Vert Tx\Vert_\infty \leq \Vert T \Vert_\infty \Vert x \Vert_\infty$ from [7] and the fact that $\Vert P^\pi \Vert_\infty = 1$ (because $P^\pi$ is the operator for $\mathbb{E}[\cdot]$ , though I have not verified this fact though).

(3) therefore, by contraction theory (Theorem 5.1-2 in [10]), applying $T^\pi$ will eventually lead to $Q^\pi$ . [9]

Convergence of policy evaluation/value evaluation usually have the similar pattern, such as the one for Q-learning [8].

Lemma 2 proof

Screen Shot 2018-11-09 at 3.39.57 PM

Proving Lemma 2 is basically just expanding $KL$ -divergence formula $D_{KL}(P|Q)=\int^{\infty}_{-\infty}p(x)\frac{p(x)}{q(x)}dx$ . From Eqn. 16, we expand $KL$ -divergence formula as:

$J_{\pi_{old}}(\pi'(\cdot|s_t)) \newline \triangleq D_{KL} (\pi'(\cdot | s_t) \; \Vert \; exp(Q^{\pi_{old}}(s_t, \cdot) - \log Z^{\pi_{old}}(s_t)) ) \newline = - \int \pi'(a_t | s_t) \log \frac{exp(Q^{\pi_{old}}(s_t, a_t) - \log Z^{\pi_{old}}(s_t)))}{\pi'(a_t|s_t)} d a_t \newline = \int \pi'(a_t | s_t) \big(\log \pi'(a_t|s_t) + \log Z^{\pi_{old}}(s_t) - Q^{\pi_{old}}(s_t, a_t) ) \big) d a_t \newline = \mathbb{E}_{a_t \sim \pi'}[\log \pi'(a_t|s_t) + \log Z^{\pi_{old}}(s_t) - Q^{\pi_{old}}(s_t, a_t) )]$

Note that the authors defined a new objective function $J_{\pi_{old}}(\pi'(\cdot|s_t))\triangleq D_{KL} (\pi'(\cdot | s_t) \; \Vert \; exp(Q^{\pi_{old}}(s_t, \cdot) - \log Z^{\pi_{old}}(s_t)) )$ , which is different than Eqn.1. The purpose of defining $J_{\pi_{old}}(\pi'(\cdot|s_t))$ is to help us get to Eqn. 18, which is then used to prove $Q^{\pi_{old}}(s_t, a_t) \leq Q^{\pi_{new}}(s_t, a_t)$ in Eqn. 19.

Implementation

Since we are looking at cases with continuous actions, the authors suppose we use a network to output mean and standard deviation for a (multi-dimensional) Gaussian. In other words, this network has an output vector for action means $\mu$ , and another output vector for action standard deviations $\sigma$ . The action that is to be taken will be sampled from $\mathcal{N}(\mu, \sigma^2)$ .

export

Since $D_{KL}(p\Vert q) = \mathbb{E}_{x \sim p(x)}[\log p(x) - \log q(x)]$ , $J_\pi(\phi)$ can be rewritten as:

$J_\pi(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}, a_t \sim \pi_\phi}[\log \pi_\phi(a_t | s_t) - Q_\theta(s_t, a_t)]$

However, actions are sampled from the policy network’s output, which are represented as a stochastic node in the computation graph. Gradient cannot be backprogagated unless we use “reparameterization trick” to make the action node a deterministic node. In other word, actions will be outputted from a deterministic function $f(\cdot |s_t)$ with a stochastic input $\epsilon_t$ . This trick is also widely used in variational autoencoders, and has been discussed in [11, 13, 14, 15, 16] . Reparameterization trick can also be illustrated in the diagram below [11]:

Screen Shot 2018-11-11 at 7.17.57 PM

After applying reparameterization trick, we have the stochastic gradient, for a training tuple $(s_t, a_t)$ :

$\hat{\nabla}_\phi J_\pi(\phi) \newline = \nabla_\phi [\log \pi_\phi(a_t | s_t) - Q_\theta(s_t, a_t)] |_{a_t = f_\phi(\epsilon_t ; s_t)} \newline = (\nabla_{a_t} \log \pi_\phi(a_t | s_t) - \nabla_{a_t} Q_\theta (s_t, a_t)) \nabla_\phi f_\phi(\epsilon_t; s_t)$

I think what I derived above is a little different than Eqn.(13) in the paper, which has an extra $\nabla_\phi \log \pi_\phi(a_t|s_t)$ . Right now I am not sure why but I will keep figuring it out.

In applications where output actions need to be squashed to fit into some ranges, $\pi_\phi(|s_t)$ also needs to account that. Basically, if squashed action = squash function(raw_action), then the probability density of squashed actions will be different than the probability density for raw actions [17]. The original code reflects this in https://github.com/haarnoja/sac/blob/fba571f81f961028f598b0385d80cb286d918776/sac/policies/gaussian_policy.py#L70-L75.

SAC’s learning pattern is summarized as follows:

It evaluates the soft Q-value function and the soft value function. Both functions are associated with the train policy but can still be learned through the data collected by the behavior policy.

To estimate the soft value function, we can draw states from experience replay generated by the behavior policy, and use the action generated by the train policy.

Screen Shot 2019-01-31 at 3.34.46 PM

To estimate the Q-value function, we can just use states and actions both drawn from the experience replay:

Screen Shot 2019-01-31 at 3.38.09 PM

2. It updates the policy parameters, with the goal to minimize the expected KL-divergence between the current policy outputs and the exponential of the soft Q-value function.

3. Loop back to step 1

Thee pseudocode is as follows:

Screen Shot 2019-01-31 at 2.15.35 PM

Lastly, I am using the slides for soft actor-critic paper to end this post [12].

References

[1] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). https://arxiv.org/pdf/1602.01783.pdf

[2] Policy gradient: https://czxttkl.com/?p=2812

[3] Differences between on-policy and off-policy: https://czxttkl.com/?p=1850

[4] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#soft-actor-critic

[5] Off-Policy Actor-Critic: https://arxiv.org/abs/1205.4839

[6] http://mlg.eng.cam.ac.uk/teaching/mlsalt7/1516/lect0304.pdf

[7] https://czxttkl.com/?p=2616

[8] http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

[9] https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture3_mdp_planning.pdf

[10] Kreyszig, E. (1978). Introductory functional analysis with applications (Vol. 1). New York: wiley.

[11] Tutorial on Variational Autoencoders

[12] slides for soft actor-critic

[13] https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important

[14] Reparameterization Trick Notebook

[15] The Reparameterization “Trick” As Simple as Possible in TensorFlow

[16] arxiv insight: variational autoencoders

[17] Change of variable in probability density

[18] https://czxttkl.com/2020/02/18/practical-considerations-of-off-policy-policy-gradient/

[19] http://proceedings.mlr.press/v32/silver14.pdf

[20] https://czxttkl.com/2019/01/10/dpg-and-ddpg/

Notes on Glicko paper

This weekend I just read again the Glicko skill rating paper [1] but I found something not very clear in the paper. I’d like to make some notes, some based on my guesses. Hope I’d sort them out completely in the future.

First, Glicko models game outcomes by the Bradley-Terry model, meaning that the win probability of a player is determined by his skill proportional to the sum of both players’ skills (assuming a one-vs-one game).

Here, a player skill is an exponential form of a strength parameter $latex \theta$ which follows a normal distribution:

Screenshot from 2018-10-21 10-10-40

Glicko updates a player’s skill in a rating period, in which other opponents’ skills are assumed as constants during the rating period. Thus, the posterior distribution of the player’s skill after observing the game outcomes in the rating period is:

Screenshot from 2018-10-21 10-18-01

The confusing part to me is the notation $latex L(\theta, \theta_1, \cdots, \theta_m|\textbf{s})$. In my opinion, $latex L(\theta, \theta_1, \cdots, \theta_m|\textbf{s}) = f(\textbf{s}|\theta, \theta_1, \cdots, \theta_m)$. My opinion reconciles with what is described in [2]:

Screenshot from 2018-10-21 10-26-27

References

[1] Parameter estimation in large dynamic paired comparison experiments

[2] A Bayesian Approximation Method for Online Ranking

Euler’s Formula and Fourier Transform

Euler’s formula states that $latex e^{ix} =\cos{x}+ i \sin{x}$. When $latex x = \pi$, the formula becomes $latex e^{\pi} = -1$ known as Euler’s identity.

An easy derivation of Euler’s formula is given in [3] and [5]. According to Maclaurin series (a special case of taylor expansion $latex f(x)=f(a)+f'(a)(x-a)+\frac{f”(a)}{2!}(x-a)^2+\cdots$ when $latex a=0$),

$latex e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots &s=2$

Therefore, replacing $latex x$ with $latex ix$, we have

$latex e^{ix}=1+ix-\frac{x^2}{2!}-\frac{x^3}{3!}+\frac{x^4}{4!}+\frac{x^5}{5!}-\cdots &s=2$

By Maclaurin series, we also have

$latex \cos{x}=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!} + \cdots \newline \sin{x}=x -\frac{x^3}{3!}+\frac{x^5}{5!}-\cdots&s=2$

Therefore, we can rewrite $latex e^{ix} &s=2$ as $latex e^{ix}=\cos{x}+i\sin{x} &s=2$

Intuitive understanding of $latex e^{ix}=\cos{x}+i\sin{x} &s=2$ is illustrated in [1] together with its own video [4], as well as in 3Blue1Brown’s two videos [6][7].

First, from a conventional view of coordinate system, $latex e^{ix}=\cos{x}+i\sin{x} &s=2$ means a point with x coordinate $latex \cos{x}$ and y coordinate $latex \sin{x}$ on a unit circle (centered at the origin with radius 1) in the complex plane.

Another view is that $latex e^{ix}$ describes the point that moves distance $latex x$ from (1,0) along the circumference of the unit circle.

The last view, which is similar to the second view, is that $latex e^{ix}$ specifies the point such that the degree between the x axis and the line connects that point to the origin is radiant x .

For example, $latex e^{i \cdot 1}$ viewed in x&y coordinates:

grid

or viewed from the moving perspective:

move

or viewed from the radiant perspective:

radiant

As pointed out by [1], we can think of normal exponential as stretching an axis such that number 1 is stretched to a point denoting the exponential result. For example, $latex 2^3$ means 1 is stretched to 8, or 1 is stretched to 2 first, then that point is stretched to 4, then finally that point is stretched to 8. However, complex exponential $latex e^{ix}$ means to rotate from the point (1,0) at a constant rotation speed $latex x$. Therefore, Euler’s identity can be interpreted as starting from point (1,0), a point moves half of the circle (radiant $latex \pi$) and ends up at (-1, 0), therefore $latex e^{\pi}=-1$. A good diagram summarizing this intuition is as below:

complex_growth

Why understanding Euler’s formula as a point rotating on a unit circle helps? One example in which Euler’s formula is useful is Fourier transform. The goal of Fourier transform is to turning signals measured in time space into analyzable, frequency-based spectrum. Fourier transform is intuitively illustrated in BetterExplained [8, 3Blue1Brown’s [9], and Math StackExchange [16]. Below is notes taken based on [9].

A concrete example to apply Fourier transform is to analyze a signal which mixes two different signals, each of a constant frequency. In the example, suppose we can only measure the blended signal (green), while we want to tell posthumously that it actually composes of the pink signal (for example, 2Hz) and the yellow signal (for example, 3Hz).

Screenshot from 2018-10-08 11-40-25

The magic idea of Fourier transform is that if you transform the signal measured in the Pressure vs. Time space into a rotating motion in a 2D plane, it will looks like the diagram below:

simplescreenrecorder-2018-10-08_11.52.01 (2)

Depending on the frequency of rotating motion, the diagram on the 2D plane may be different.

ezgif.com-resize

An important observation is that if you look at the center of mass of the diagram based on any rotating frequency, you will see that the center of mass is farthest away from the origin when the frequency of rotating motion matches the frequency of either individual signal (2Hz or 3Hz).

ezgif.com-resize (1)

Therefore, to recover original signals, we need to try out every possible frequency of rotating motion and just obtain the location of the center of mass of the diagram for each frequency of rotating motion, then we would know what are the frequencies of the original individual signals.

The center of mass can be obtained by sampling and averaging of the data points on the diagram, while the diagram is mapped onto a 2D complex plane:

Screenshot from 2018-10-09 19-04-08

When the number of samples goes to infinite, it becomes integral.

Screenshot from 2018-10-09 19-09-06

Fourier transform, following this idea, is only different in that the integral is from negative infinity to positive infinity, and there is no final average over time ($latex \frac{1}{t_2-t_1}$ removed).

Screenshot from 2018-10-09 19-11-20

Here is an example [15] where Fourier transform is applied on $latex f(t)=\cos(2\pi st)$. The result is two Dirac Delta functions on the frequency domain.

Screenshot from 2018-10-09 19-32-30

The transform process is based on trigonometry. In the second to the last equation, when $latex u=\pm s$, $latex \int_{-\infty}^{\infty} \cos{(2\pi st)} \cos{(2\pi ut)} dt =\int_{-\infty}^{\infty} \cos^2{(2\pi st)} dt = \int_{-\infty}^{\infty} \frac{1}{2} dt + \frac{1}{2} \int_{-\infty}^{\infty} \cos{(4\pi st)} dt = \frac{1}{2} \cdot \infty + \frac{1}{2} \int_{-\infty}^{\infty} \cos{(4\pi st)} dt &s=2$

According to [10], although $latex \int_{-\infty}^{\infty} \cos{(4\pi st)} dt &s=2$ does not converge, I have seen many places [15, 18] treating $latex \int_{-\infty}^{\infty}\cos{(4\pi st)} dt=0 &s=2$. I don’t understand this, but I could think it in an intuitive way: although $latex \int_{-\infty}^{\infty}\cos{(4\pi st)} dt &s=2$ does not converge, its values is confined in a limited range ($latex \leq 2$). Compared to the infinity we obtain from $latex \int_{-\infty}^{\infty} \frac{1}{2} dt &s=2$, the value of $latex \int_{-\infty}^{\infty}\cos{(4\pi st)} dt &s=2$ is infinitely close to zero.

Another idea is to solve the integration based on Euler’s formula [17][20]. Based on Euler’s formula, we can get $latex \cos{(x)}=\frac{1}{2}(e^{ix} + e^{-ix}) &s=2$. Also, Dirac Delta function is defined as (its proof, which I don’t fully understand, can be found on [11, 12, 13]):

$latex \delta(x)= \frac{1}{2\pi} \int_{-\infty}^{\infty} e^{-jxt}dt &s=2$

Or equivalently, $latex \delta(x) =\int_{-\infty}^{\infty} e^{-j2\pi xt}dt &s=2$

Therefore, we can finally get (based on [17][20], with a little difference because they transform $latex \cos(\omega_0 t)$ rather than $latex \cos(2\pi st)$):

Screenshot from 2018-10-09 21-04-53 :

According to the linear property of Fourier transform [19], if a blended signal is the sum of several signals (for example, $latex \cos(2\pi st)+\cos(3\pi st)$, then the resultant Fourier transform is the sum of several Dirac functions. That’s how Fourier transform extracts individual signals from the mixed one! And the connection between Euler’s formula and Fourier transform is that the integration required by Fourier transform to obtain the center of mass of the rotation diagram, if connected to Euler’s formula, can be calculated very easily.

[14] is another illustrative video showing that any signal we can measure can be actually seen as a combination of an infinite number of sine waves. For repeating waves in time space, its Fourier transform may shown as discrete frequency spectrum (like several Dirac deltas). However, non-repeating waves in time space may result to continuous frequency spectrum after Fourier transform. This is interesting to know but I will not explore further within this article.

Screenshot from 2018-10-09 00-09-22

References

[1] https://betterexplained.com/articles/intuitive-understanding-of-eulers-formula/

[2] https://www.khanacademy.org/science/electrical-engineering/ee-circuit-analysis-topic/ee-ac-analysis/v/ee-complex-numbers

[3] https://www.khanacademy.org/math/ap-calculus-bc/bc-series-new/bc-10-14/v/euler-s-formula-and-euler-s-identity

[4] https://www.youtube.com/watch?v=qpOj98VNJi4

[5] https://en.wikipedia.org/wiki/Euler%27s_formula#Using_power_series

[6] https://www.youtube.com/watch?v=F_0yfvm0UoU

[7] https://www.youtube.com/watch?v=mvmuCPvRoWQ

[8] https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/

[9] https://www.youtube.com/watch?v=spUNpyF58BY

[10] https://www.wolframalpha.com/input/?i=integrate+cos(x)+from+-infinity+to+infinity

[11] http://fourier.eng.hmc.edu/e102/lectures/ExponentialDelta.pdf

[12] https://math.stackexchange.com/questions/1343859/why-does-integrating-a-complex-exponential-give-the-delta-function

[13] https://www.quora.com/Why-are-integrals-of-complex-exponentials-delta-functions

[14] https://www.youtube.com/watch?v=r18Gi8lSkfM

[15] https://www.astro.umd.edu/~lgm/ASTR410/ft_ref2.pdf

[16] https://math.stackexchange.com/questions/1002/fourier-transform-for-dummies

[17] https://www.youtube.com/watch?v=jPM76k-uNnA

[18] https://blog.mide.com/fourier-transform-basics

[19] https://en.wikipedia.org/wiki/Fourier_transform#Properties_of_the_Fourier_transform

[20] https://web.stanford.edu/class/ee102/lectures/fourtran

Download and process Chinese songs from Youtube

This posts introduces the way to download Chinese songs from a playlist on youtube and process titles of songs.

I use youtube-dl to download all songs from a playlist (replace the youtube link with your own, make sure the playlist is public):

youtube-dl -i --yes-playlist -x --audio-format mp3 -o "%(title)s.%(ext)s" --audio-quality 0 "https://www.youtube.com/watch?v=4V3hxNyiwaA&index=1&list=PL-VzXmWCFX7iz_hxy6Xb-JXZFs4GGKMdG"

Update 2024-1-26:
Youtube-dll is no longer maintained and you should install yt-dlp instead by `brew install yt-dlp`. Once yt-dlp is installed, you can still use the command above (starting with youtube-dl) to download songs.

There could be many titles written in traditional Chinese characters. Since I am more comfortable with reading in simplified Chinese, I use a tool to batch convert traditional Chinese to simplified Chinese: https://www.dropbox.com/sh/qowqk15kpt7olak/AAB4pb3EDvzXVbmP-JMZLsIfa?dl=0

Screenshot from 2018-08-31 12-45-09

Update 2020.4

If you see Chinese not properly displayed in the tool for batch converting file names, you need to configure language settings: https://www.zhihu.com/question/34761050/answer/140114217

Install Google Pinyin on Ubuntu

Just want to document the procedure to install Google Pinyin on Ubuntu (tested on 16.04):

Command line: sudo apt-get install fcitx-googlepinyin
2. System settings -> Language support -> Keyboard input method system, change to fcitx.
3. Log out log in
4. At top right, click the penguin icon -> Text entry setting
5. Click +
6. Search ‘Google’, find ‘Google Pinyin (Fcitx)’
7, Use ‘Ctrl+space’ to switch between input methods

https://hhddkk.wordpress.com/2016/06/02/install-google-pinyin-in-ubuntu-16-04/

How to conduct grid search

I have always had some doubts on grid search. I am not sure how I should conduct grid search for hyperparameter tuning for a model and report the model’s generalization performance for a scientific paper.

There are three possible ways:

1) Split data into 10 folds. Repeat 10 times of the following: pick 9 folds as training data, train a model with a specific set of hyperparameters on the training data, test it on the remaining fold and record the test accuracy. Finally, pick the set of hyperparameters achieving the highest mean test accuracy and report that mean test accuracy.

This procedure is deployed in scikit learn GridSearchCV and also in a cross validated post (note that this post claimed to implement a nested cross validation but I think it is not). But the problem is that the accuracy you finally report is not based on unseen data (as pointed out by https://datascience.stackexchange.com/questions/21877/how-to-use-the-output-of-gridsearch “This goes against the principles of not using test data!!” part). This procedure is also claimed as flawed in another cross validated post: https://stats.stackexchange.com/questions/224287/cross-validation-misuse-reporting-performance-for-the-best-hyperparameter-value?noredirect=1&lq=1

2) Separate out a subset of data as test data. Use the remaining data to pick the best set of hyperparameters as in 1). Then, retrain the model with the best set of hyperparameters on all the data except the test data. Finally, report the accuracy of the retrained model on the test data.

The problem of this method is that the test data is only a small portion of the whole dataset and is tested only once. So the accuracy on the test data may have large variance.

3) Use nested cross validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html. Split data into 10 folds. Repeat 10 times of the following: pick 9 folds as a dataset, apply 1) to pick the best set of hyperparameters, retrain a new model on the dataset based on the best set of hyperparameters, and test the retrained model on the remaining fold to obtain an accuracy. Finally, report the mean accuracy across 10 times. This essentially involves two cross validations, often called inner cross validation (for picking the best hyperparameters) and outer cross validation (for reporting the mean accuracy as the generalization error).

The problem of this method is that in each time, you might end up with a different set of hyperparameters. So the mean accuracy reported finally is not guaranteed to average over the models of the same set of hyperparameters.

Based on my research on cross validated forum, 3) is the most correct way. If you end up with the best model with the same set of hyperparameters across 10 times, then that’s perfect since you can just ultimately retrain a model based on the whole dataset with that consistent set of hyperparameters (if you use the scikit learn script given above for nested cross validation, you get the final model by setting refit=True: https://stats.stackexchange.com/a/281824/80635). If you get models with different hyperparameters in each time, that means your model training is not stable, and picking any model of the 10 models is not fair. In this case, either collect more data or debug your model training process until models become stable across 10 times.

Some useful links to help understand:

from https://stats.stackexchange.com/a/72324/80635:

Screenshot from 2018-05-15 00:40:26

https://stats.stackexchange.com/questions/251752/model-selection-problem-using-nested-cross-validation-in-presence-of-several-alt?rq=1

https://www.elderresearch.com/blog/nested-cross-validation

https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection

Now, I’d like to share one paragraph I used to write in my paper submission:

Screenshot from 2018-05-15 09:45:32

It is different than 3) in that in each outer cross validation, hyperparameter selection is only based on one shot of training and validation (i.e., there is no inner cross validation).

I have also seen many papers just splitting the whole dataset into training/validation/test set. So this procedure just removes outer cross validation as well. I think people won’t complain about it if the dataset is very large.

At last, I want to share a famous paper, if you have not read it, Random Search for Hyper-Parameter Optimization, which claims that random search can do as well as grid search in many practical problems. llustration can be found here: https://stats.stackexchange.com/questions/160479/practical-hyperparameter-optimization-random-vs-grid-search

Monte Carlo Tree Search Overview

Monte Carlo Tree Search (MCTS) has been successfully applied in complex games such as Go [1]. In this post, I am going to introduce some basic concepts of MCTS and its application.

MCTS is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results. Search trees are built to gradually cover the search space and in an iterative fashion:

Screenshot from 2018-04-13 18-19-18

and may result into asymmetric structure:

Screenshot from 2018-04-14 22-34-25

Each iteration is grouped by four phases, selection, expansion, simulation and backpropagation [4]:

Screenshot from 2018-04-14 22-39-08

Screenshot from 2018-04-14 22-39-16

The tree policy and default policy may be defined differently in variations of MCTS. However they all rest on two fundamental concepts: the true value of an action may be approximated using random simulation; and that these values may be used efficiently to adjust the policy towards a best-first strategy [4].

One most widely used variation of MCTS is UCT (Upper Confidence Bounds for Trees), whose tree policy is to select leaf nodes according to a multi-armed bandit metric called UCB1. The pseudo-code is given by [4]:

Screenshot from 2018-04-14 22-00-02

MCTS’s recent application in Go actually come from three papers [1] (AlphaGo), [2] (AlphaGo Zero), and [9] (AlphaZero). Let’s look at the first paper first [1], in which there are several networks trained:

1) a supervised learning policy network $p_{\sigma}$ . This is a straightforward, as its input is the board state $s$ (plus some move history and human engineered features) and its output is the action $a$ to take. The training data is essentially state-action pairs $(s, a)$ sampled from real human play matches. The goal of this supervised learning policy network is to accurately predict moves. Therefore, it is designed with convoluted structures with 13 layers. As reported, it can achieve 55.7% accuracy and takes 3ms per prediction.

2) a fast supervised learning policy network $p_{\pi}$ , which results in simpler structure, faster prediction time but less accuracy. It requires 2us per prediction and has 24.2% accuracy reportedly.

3) a policy network $p_{\rho}$ trained based on the supervised learning policy network $p_{\sigma}$ and by Reinforcement Learning (more specifically, policy gradient method [3]). I think this is a smart choice: if you train a policy network by RL from scratch you may not end up learning well because the game is too complex such that the network will take long time until it adjusts its weights to some meaningful values. Initializing the weights of $p_{\rho}$ to $p_{\sigma}$ is like an effective bootstrapping. On the other hand, $p_{\sigma}$ ‘s goal is simply achieving prediction accuracy as high as possible, rather than our real goal for winning. The RL-based policy network would adjust the network to fully align with the goal of winning. Thus, a trained $p_{\rho}$ should perform stronger than $p_{\sigma}$ .

4) a value network $v_\theta(s)$ which estimates the expected outcome from position $s$ played by using the RL policy network $p_{\rho}$ for both players: $v^{p_{\rho}}(s) = \mathbb{E}[z_t | s_t=s, a_{t \cdots T} \sim p_{\rho}]$ . As the paper [1] stated, the most useful value network should estimate the expected outcome under perfect play, $v^*(s)$ , which means each of the two players acts optimally to maximize his own winning probability. But since we do not have perfect play as ground truth, we have to use $p_{\rho}$ to generate gameplay data. Therefore, we have $v_\theta(s) \approx v^{p_{\rho}}(s) \approx v^*(s)$ . But note that, training a value network is not the only way to estimate $v^{p_{\rho}}(s)$ . We can also use Monte Carlo rollouts to simulate a large number games by using $p_{\rho}$ . However it takes much larger computational resource.

After training the four networks, MCTS will base on these networks to select actions by lookahead search. Let’s look at a typical iteration of MCTS in Alphago:

Screenshot from 2018-04-13 18-22-26

You could think in such way that each node in the search tree maintains two values $N(s,a)$ and $Q(s,a)$ :

Screenshot from 2018-04-13 18-24-26

where $V(s_L^i)$ is the value of the leaf state after the selection phase in simulation $i$ . $V(s_L^i)$ combines the estimated outcome from $v_\theta$ and a roll out outcome $z$ using the fast policy network $p_\pi$ :

Screenshot from 2018-04-13 18-24-32

Maintaining $Q(s,a)$ and $N(s,a)$ is primarily for defining the tree policy, the policy used to navigate nodes in the selection phase. Specifically, at each time step $t$ during the selection phase of a simulation, an action $a_t$ is selected according to:

Screenshot from 2018-04-13 18-44-53

This is a little different than UCB1 (Upper Confidence Bound) used in UCT (Upper Confidence Bounds for Trees), a common version of MCTS that characterizes using UCB1 in the selection phase [4]:

Screenshot from 2018-04-13 18-59-25

However, both AlphaGo’s MCTS and UCT share the same idea as to model the action choice in the selection phase as a bandit problem. As such, the action values dynamically change according to historically received average rewards, as well as the number of total trials and trials for that action specifically.

In summary, AlphaGo’s MCTS actually only uses two networks ( $p_\pi$ and $v_\theta$ ) from the four trained networks ( $p_\sigma$ , $p_\pi$ , $p_\rho$ , and $v_\theta$ ). The two networks $p_\pi$ and $v_\theta$ are only used in the leaf evaluation function $V(s^L)$ .

AlphaGo also embodies two principles of reducing the search space in MCTS. First, to reduce the breadth of search, it selectively pick and expand the action nodes based on the action values calculated as a bandit problem. Additionally, in the simulation (evaluation) phase, the rollout uses the fast policy network $p_\pi$ to sample actions which also reduces the breadth of the search. Second, to reduce the depth of search, it relies on the leaf evaluation function $V(s^L)$ to evaluate the board value (combined with just one roll out, instead of fully relying on a large number of rollouts).

Now, let’s look at the second version AlphaGo, which is called AlphaGo Zero [2]. The most significant change is that AlphaGo Zero does not require any human play data as in AlphaGo to bootstrap itself. AlphaGo Zero gets rid of training multiple networks but only keeps one unified neural network $f_\theta$ which learns to output two heads: (1) an action probability distribution ( $\vec{p}$ , a vector, whose component is $p_a = Pr(a|s)$ ); (2) and outcome value prediction ( $v$ , a scalar).

The neural network $f_\theta$ gets initialized with random weights and will be updated in iterations. Self-play data are generated by two Go AI bots which equip the same neural network snapshot. The neural network snapshot is picked periodically as the best performant player seen so far in all previous iterations. Let us denote the picked snapshot as $\alpha_{\theta_{i-1}} = argmax_i \; performance(f_{\theta_i})$ . At each move, an MCTS search is executed based on $\alpha_{\theta_{i-1}}$ in the simulation phase. The output of the MCTS search is an action distribution $\vec{\pi}$ determined by the upper confidence bound for each action: $Q(s,a)+U(s,a)$ where $U(s,a) \propto P(s,a)/(1+N(s,a))$ , the same bound metric as in AlphaGo [1].

When such self-play proceeds to the game end, we store the game outcome $z$ and each of the previous MCTS action distributions as a distribution-outcome tuple $(\vec{\pi}_t, z)$ . These distribution-outcome tuples will serve as the training data for updating the weights of $f_\theta$ . The full procedure is illustrated below:

Screenshot from 2018-04-14 10-06-22

The third version, AlphaZero [9], has further differences than AlphaGo Zero:

AlphaZero is more general, which can be applied to other different games such as chess and shogi. It achieves so by supporting evaluating non-binary game outcomes and less exploiting game-specific training tricks.
In AlphaGo Zero, self-play games were generated by the selected snap of the best player from previous iterations. In contrast, AlphaZero simply maintains a single neural network that is updated continually. So self-play games are always generated by the latest parameters for this neural network.

MCTS convergence

According to [5,6], in UCT, the failure probability at the root of the tree (i.e., the probability of selecting a suboptimal action) converges to zero at a polynomial rate as the number of games simulated grows to infinity. In a two-person zero-sum game, this means UCT converges to the minimax algorithm’s solution, which is the optimal move if both players are under perfect play, which is also the move to minimize the maximal possible loss the opponent can exert.

Here, I am using a small example to sketch why UCT could converge to minimax. First, the following plot shows a minimax tree for a game. Each of two players makes exactly one move until the game ends. The numbers in the circles indicate the reward to the player. Solid lines indicate the actual moves selected under perfect play by both players.

minimax

Now, suppose we start to build a UCT. Since UCT has the exploration term $2 C_p \sqrt{\frac{2 \ln n}{n_j}}$ , we could imagine that at some point each terminal state has been visited once:

mcts (1)

Here in each circle there is a tuple indicating the average reward and visited times on that node. If MCTS stops at this moment, it seems like the player would choose the right action because it has a higher average reward, which is different than the minimax solution.

However, after that, UCT would traverse the node (-5,1) much more frequently than the node (-50,1). Therefore, The node (27.5, 2) will eventually converge to its real reward 5.

Reference

[1] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), 484-489.

[2] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … & Chen, Y. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354.

[3] https://czxttkl.com/?p=2812

[4] Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., … & Colton, S. (2012). A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1), 1-43.

[5] Kocsis, L., Szepesvári, C., & Willemson, J. (2006). Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep, 1.

[6] Kocsis, L., & Szepesvári, C. (2006, September). Bandit based monte-carlo planning. In European conference on machine learning (pp. 282-293). Springer, Berlin, Heidelberg.

[9] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … & Hassabis, D. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm.