Resources about Attention is all you need

There are several online posts [1][2] that illustrate the idea of Transformer, the model introduced in the paper “attention is all you need” [4].

Based on [1] and [2], I am sharing a short tutorial for implementing Transformer [3]. In this tutorial, the task is “copy-paste”, i.e., to let a Transformer learn to output the same sequence as the input sequence. We denote the symbols used in the code as:

Name Meaning
src_vocab_size=12 The vocabulary size of the input sequence
tgt_vocab_size=12 The vocabulary size of the output sequence. Since our task is to copy the input sequence, tgt_vocab_size = src_vocab_size. We reserve symbol 0 for padding and symbol 1 for the decoder starter symbol, and we assume possible real symbols are from 2 to 11. Therefore, the total vocabulary size is 12.
dim_model=512 This is the dimension of the embedding, as well as the output of self-attention, and the input/output of the position-wise feed forward model.
dim_feedforward=256 The size of the hidden layer in the position-wise feed forward layer.
d_k = 64 The dimension of each attention head.
num_heads=8 The number of attention heads. For the sake of computational convenience, d_k * num_heads should be equal to dim_model.
seq_len=7 The length of input sequences
batch_size=30 Batch size, the number of input sequences per batch
num_stacked_layers=6 The number of layers to be stacked in both Encoder and Decoder

Let’s first look at what the input/output data looks like,  which is generated by data_gen. Each Batch instance contains several essential pieces of information to be used in training:

  • src (batch_size x seq_len): the input sequence to the encoder
  • src_mask (batch_size x 1 x seq_len): the mask of the input sequence, so that each input symbol can fine control which other symbols to attend
  • trg_y (batch_size x seq_len): the output sequence as the label, which is the same as src given that our task is simply “copy-paste”
  • trg (batch_size x seq_len): the input sequence to the decoder. The first symbol of each sequence is symbol 1 that helps kicking off the decoder; the last symbol of each sequence is the second to the last of the trg_y sequence.
  • trg_mask (batch_size x  seq_len x seq_len): the mask of the output sequence, so that at each decoding step, a symbol being decoded can attend to other already decoded symbols. trg_mask[i, j, k], a binary value, indicates whether in the i-th output sequence the j-th symbol can attend to the k-th symbol. Since a symbol can attend all its preceding symbols, trg_mask[i,j,k]=1 for all k <= j.

Now, let’s look at EncoderLayer class. There are num_stacked_layers=6 encoder layers in one Encoder. The bottom layer takes the symbol embeddings as input, while each of the upper layers takes the previous layer’s output. Both the symbol embedding and each layer’s output has the same dimension `batch_size x seq_len x dim_model`, so in theory you can stack as many layers as needed because they all have compatible input/output. The symbol embedding converts input symbols into embeddings, plus positional encoding which helps the model learns positional information.

One EncoderLayer does the following things:

  • It has three matrices stored in MultiHeadedAttention.linears[:3], which we call query transformation matrix, key transformation matrix, and value transformation matrix. They transform the input from the last layer into three matrices which the self-attention module will use. Each of the output matrices, which we call query, key, and value, has the shape batch_size x num_heads x seq_len x d_k.
  • The transformed matrices are used as query, key, and value in self-attention (attention function). The attention is calculated based on Eqn.1 in [4]:

The attention function also needs a mask as input. The mask tells it for each symbol what other symbols it can attend to (due to padding, or other reasons). In our case, our input sequences are of the equal length and we allow all symbols attend to other symbols; so the mask is always filled with ones.

  • The attentions of d_k heads are concatenated and passed through a linear projection layer, which results to the output with shape `batch_size x seq_len x dim_model`
  • The output is finally processed by a BatchNorm layer and a Residual layer.

So far, the encoder part can be illustrated in the following diagram [1] (suppose our input sequence has length 2, containing the symbol 2 and 3):

One DecoderLayer does the following thing:

  • it first does self-attention on the target sequence (batch.trg, not the trg_y sequence), which generates attention of shape (batch_size, seq_len, dim_model). The first symbol of a target sequence is 1, a symbol indicating the decoder to start decoding.
  • the target sequence performs attention on the encoder, i.e., using the target sequence’s self attention as the query, and using the final encoder layer’s output as the key and value.
  • Pass the encoder-decoder attention’s output through a feedforward layer, with batch norm and residual connection. The output of one DecoderLayer has the shape `batch_size x seq_len x dim_model`. Again this shape allows you stack as many layers as you want.

The decoder part is shown in the following diagram:

The last part is label fitting. The last layer of DecoderLayer has the output of shape , i.e., each target symbol’s attention. We pass them through a linear project then softmax such that each target symbol will be associated with a probability distribution of the next symbol to decode. We fit the probability distribution against batch.trg_y by minimizing the KL divergence. Here is one example of label fitting:

In practice, we also apply label smoothing such that the target distribution to fit has slight probability mass even on the non-target symbol. So for example, the target distribution of position #1 (refer to the diagram on the left) is no longer (0, 0, 1, 0, 0); rather it should be (0.1, 0.1, 0.6, 0.1, 0.1) if you use 0.1 label smoothing. The intuition is that even our label may not be 100% accurate and we should not let the model overly confident in the training data. In other words, label smoothing helps prevent overfitting and hope to improve generalization in unseen data.

[1] http://jalammar.github.io/illustrated-transformer/

[2] http://nlp.seas.harvard.edu/2018/04/03/attention.html

[3] https://github.com/czxttkl/Tutorials/blob/master/experiments/transformer/try_transformer.py

[4] https://arxiv.org/abs/1706.03762

Implementation notes for world model

I’ve been recently implementing world model [1], which seems a promising algorithm to effectively learn controls after learning environments first. Here I share some implementation notes.

Loss of Gaussian Mixture Model

The memory model of world model is a Mixture-Density-Network Recurrent Neural Network (MDN-RNN). It takes current state and action as inputs, and outputs the prediction of next state, terminal, and reward. Since state transition might be stochastic, the authors choose to model the next state output as a gaussian mixture model: the next state can come from several multi-dimensional gaussian components with certain probability assignments. The probability of the next state coming from component $latex i$ is $latex \pi_i$, and that component itself is a gaussian distribution with mean $latex \mu_i \in \mathbb{R}^f$ and std. dev $latex \Sigma \in \mathbb{R}^{f*f}$, where $latex f$ is the feature size of a state.

To learn the parameters of MDN-RNN that predict $latex \pi_i$, $latex \mu_i$ and $latex \sigma_i$, we need to fit a loss function that defines how closely the observation of next state really fits the predicted gaussian mixture model. The loss function is negative log likelihood. Since we usually fit data in batches, and one batch is a sequence of consecutive states from the same episodes, the loss function is defined over batch_size and seq_len:

$latex – \sum\limits_{z=0}^{batch-size} \sum\limits_{s=0}^{seq-len} log (\sum\limits_i \pi_i \mathcal{N}(x^{z,s} | \mu^{z,s}_{i}, \Sigma^{z,s}_{i})) &s=2$

 

If we ignore the negative sign for now and only look at one state for a particular batch $latex z$ and sequence element $latex s$, we get:

$latex log(\sum\limits_i \pi_i \mathcal{N}(x | \mu_{i}, \Sigma_{i}))&s=2$

 

In practice, MDN-RNN usually predicts $latex log(\pi_{i})$, and we can get $latex log(\mathcal{N}(x | \mu_{i}, \Sigma_{i}))$ from pytorch endpoint [3], so the formula above can be re-written as:

$latex log \sum\limits_i e^{log(\pi_i \mathcal{N}(x | \mu_{i}, \Sigma_{i}))} \newline =log\sum\limits_i e^{log(\pi_i) + log(\mathcal{N}(x | \mu_{i}, \Sigma_{i}))} &s=2$

 

According to log-sum-exp trick [2], to get numerical stability, 

$latex log(\sum\limits_{i=1}^n e^{x_i}) = log(\sum\limits_{i=1}^n e^{x_i – c}) + c &s=2$

where $latex c=max_i x_i$ is usually picked.

The take away is that we walk through how the loss function for GMM is defined, and adopts the log-sum-exp trick when calculating it.

 

 

References

[1] https://worldmodels.github.io/

[2] https://blog.feedly.com/tricks-of-the-trade-logsumexp/

[3] https://pytorch.org/docs/stable/distributions.html#torch.distributions.normal.Normal.log_prob

My understanding in 401K

Here is my reasoning about 401K.

First, I’ll start with two definitions: (1) taxable income, meaning the gross income you receive on which your tax will be calculate; (2) tax deduction, meaning any deduction from your taxable income. Tax deduction lowers your taxable income thus lowers your tax in general.

401K has three categories:

Pre-tax: contribute your taxable income directly to the 401K account, without paying the tax at the time of contribution.Hence, pre-tax 401k is a way of tax deduction. You can only withdraw after 59.5 age, and every dollar you withdraw (i.e., the money you contribute + its growth) will need to be taxed based on the tax bracket you are in at the time of withdraw (i.e., the time when you are old and probably don’t have high income.)

Roth: contribute your after-tax income to the 401K account [1]. You can only withdraw after 59.5 age, but you don’t pay any tax at the time of withdraw (including for the part you earn).

After-tax: contribute your after-tax income to the 401K account separate with roth or pre-tax. However, you still need to pay tax on the earnings when withdrawing: “the earnings on after-tax contributions, just like all distributions from pre-tax contributions, are taxed as ordinary income” [4]. What’s more, there is 10% early withdraw penalty on the earnings too if you withdraw before 59.5 age. The early withdraw penalty and tax on the earning is proportionally collected [5]. If your total after-tax contribution has accumulated x% earning, then you have to pay tax and penalty on x% of whatever amount of money you withdraw before 59.5.

There is another age rule on 401K account: you can defer withdrawing 401K if you work at your elder age. But if you are over 70 years, you need to start withdrawing with at least the amount of Required Minimum Distribution (RMD) each year [7].

The annual combined limit of 401K pre-tax+ roth is 19000 as of 2019. However, there is another limit [3] that restricts you from contributing more than 56000 in any kind of retirement plan per year. That means, if you fully contribution 19000 in pre-tax + roth 401K, than you contribution to 401K after-tax plus employer match should not be over 56000 – 19000=$37000. There is another limit not for 401K but for IRA (individual retirement account), which is 6K.

The benefit of 401K after-tax contribution isn’t apparent. One underlying purpose of saving in after-tax account is to bypass the limit of 19000 for pre-tax + roth. You can rollover your after-tax amount from after-tax 401K to Roth 401K then to Roth IRA, another type of retirement account which is very similar to Roth 401k in the sense that anything going into it will be tax-free. At the time of rollover from after-tax 401K to Roth 401K, any earning in the after-tax account would be treated as your ordinary income thus become taxable; but after rollover, the whole amount of money as well as all future earnings would be tax-free.

In many situations, you need some triggering events (such as retirement, changing companies, etc.) to be able to rollover from after-tax 401K to Roth IRA. Only a few companies offer the opportunity to rollover from after-tax to Roth IRA directly without any triggering event. My company (FB) supports rollover from after-tax 401K to Roth 401K immediately after any contribution in after-tax 401K, meaning that I don’t need to pay any tax for the rollover because no earning has been generated before the rollover happens. And our company’s employees can rollover from Roth 401K to Roth IRA at any time.

The withdraw rules of Roth IRA are summarized as follows. Assume you only contribute to Roth IRA by converting from after-tax 401K. You can withdraw the converted money (principle, not earning) after they have stayed in Roth IRA for more than 5 years without penalty [8]. You can withdraw the earning part without penalty if after age 59.5 or die or become disabled or withdraw within 10000 for first-home purchase and at the same time the first Roth contribution must have started more than 5 years ago; otherwise you need to pay penalty of 10% on the earning [6]. As discussed in [8], the earning that the penalty will apply on is the pre-rollover earning (before you rollover from After-tax 401K to Roth 401K). So since my company allows to rollover from After-tax 401K to Roth 401K immediately, I don’t need to pay any penalty at all even I withdraw from Roth IRA within 5 years since my Roth IRA contribution!

So it seems like if you have any money investing to 401k after-tax, it is better to roll over to Roth IRA (if your company allows that during your employment) because (1) Roth IRA doesn’t incur any tax while 401k after-tax still incurs tax on earnings after age 59.5 (2) Roth IRA is more flexible in withdrawing the converted part of money (you can withdraw even before 59.5 years old).


Now let’s talk about the appropriate distribution of 401K money. Here is a chart showing the suggested 401K amount you should save at each age group [11]:

In essence, a vanguard target date fund comprises of several funds that cover domestic/international stock markets, and domestic/international bonds. For example, below is the decomposition of Vanguard Target Date 2060 as 01/31/2020 (https://investor.vanguard.com/mutual-funds/profile/VTTSX):

From historical statistics, bonds have less yield but have lower risks than stocks [10]. And also bonds distribute dividends that are treated as ordinary income. So the three-fund theory [12] tells us that if we really have a lot of money to invest, we should put bonds in tax-benefit accounts such as 401K, and move the stock index funds into taxable accounts (because holding them for long time we will only need to pay long-term gain tax, rather than ordinary income tax).

 

Updated 2019-05-13

Another example on how to convert after-tax to roth: Roth In-Plan Brochure

 

Updated 2020-05-01

This list shows historical returns for Vanguard funds, which are the main source of 401K. 

https://retirementplans.vanguard.com/VGApp/pe/faces/PubFundChart?site=nyu/6370

 

Updated 2020-09-05

Below is the expected growth of retirement funds under a realistic estimation on a middle-class income .

Retirement Planning Calculator (US Personal Finance @ FB)

References

[1] https://www.irs.gov/retirement-plans/roth-comparison-chart

[2] https://www.bogleheads.org/wiki/After-tax_401(k)

[3] https://www.goodfinancialcents.com/401k-contribution-limits/

[4] https://www.doughroller.net/retirement-planning/pros-and-cons-of-after-tax-401k-contributions/

[5] https://about401k.com/withdrawal/after-tax/

[6] https://www.schwab.com/public/schwab/investing/retirement_and_planning/understanding_iras/roth_ira/withdrawal_rules

[7] https://www.sensiblemoney.com/learn/at-what-age-should-i-start-making-401k-withdrawals/

[8] https://thefinancebuff.com/rollover-after-tax-401k-403b-access-before-59-1-2.html

[9] My company’s After-tax 401K to Roth 401K FAQ: PlanConversionFAQ

[10] https://personal.vanguard.com/us/insights/saving-investing/model-portfolio-allocations

[11] https://www.cnbc.com/2019/05/23/how-much-money-americans-have-in-their-401ks-at-every-age.html

[12] https://www.bogleheads.org/wiki/Three-fund_portfolio

DPG and DDPG

In this post, I am sharing my understanding regarding Deterministic Policy Gradient Algorithm (DPG) [1] and its deep-learning version (DDPG) [2].

We have introduced policy gradient theorem in [3, 4]. Here, we briefly recap. The objective function of policy gradient methods is:

J(\theta)=\sum\limits_{s \in S} d^\pi(s) V^\pi(s)=\sum\limits_{s \in S} d^\pi(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a),

where \pi represents \pi_\theta, d^\pi(s) is the stationary distribution of Markov chain for \pi, V^\pi(s)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s], and Q^{\pi}(s,a)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s, A_t=a]. G_t is accumulated rewards since time step t: G_t=\sum^\infty_{k=0}\gamma^k R_{t+k}.

Policy gradient theorem proves that the gradient of policy parameters with regard to J(\theta) is:

\nabla_\theta J(\theta) = \mathbb{E}_{\pi}[Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)]

More specifically, the policy gradient theorem we are talking about is stochastic policy gradient theorem because at each state the policy outputs a stochastic action distribution \pi(s|a).

However, the policy gradient theorem implies that policy gradient must be calculated on-policy, as \mathbb{E}_{\pi}:=\mathbb{E}_{s\sim d^{\pi}, a\sim \pi} means the state and action distribution is generated by following \pi_\theta.

If the training samples is generated by some other behavioral policy \beta, then the objective function becomes:

J(\theta)=\sum\limits_{s \in S} d^\beta(s) V^\pi(s)=\sum\limits_{s \in S} d^\beta(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a),

and we must rely on importance sampling to calculate the policy gradient [7,8]:

\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a) + \pi(a|s)\nabla_\theta Q^\pi(s,a)] \newline \approx \mathbb{E}_{s\sim d^\beta}[\sum\limits_{a\in A}\nabla_\theta \pi(a|s) Q^\pi(s,a)] \newline =\mathbb{E}_{\beta}[\frac{\pi_\theta(a|s)}{\beta(a|s)}Q^\pi(s,a)\nabla_\theta ln \pi_\theta(a|s)]

where \mathbb{E}_{\beta}:=\mathbb{E}_{s\sim d^\beta, a\sim\beta}

However, using importance sampling \frac{\pi_\theta(a|s)}{\beta(a|s)} could incur some learning difficulty problems [5]:

Screen Shot 2019-01-23 at 6.31.41 PM

What DPG proposes is that we design a policy that outputs actions deterministically: a = \mu_\theta(s), thus J(\theta) would only have integration (or sum) over state distribution and get rid of importance sampling, which makes learning potentially easier:

J(\theta)=\sum\limits_{s \in S} d^\beta(s) V^\mu(s)=\sum\limits_{s \in S} d^\beta(s) Q^\mu(s, \mu_\theta(s))

Writing the above formula in integration (rather than discrete sum), we get:

J(\theta)=\int\limits_S d^\beta(s) V^\mu(s) ds=\int\limits_{S} d^\beta(s) Q^\mu(s, \mu_\theta(s)) ds

And the gradient of J(\theta) is:

\nabla_\theta J(\theta) = \int\limits_S d^\beta(s) \nabla_\theta Q^\mu(s, \mu_\theta(s)) ds \approx \mathbb{E}_\beta [\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]

This immediately implies that \theta can be updated off policy. Because we only need to sample s \sim d^\beta(s), then \nabla J(\theta) can be calculated without knowing what action the behavior policy \beta took (we only need to know what the training policy would take, i.e., a=\mu_\theta(s)). We should also note that there is an approximation in the formula of \nabla_\theta J(\theta), because \nabla_\theta Q^u(s,\mu_\theta(s)) cannot be applied chain rules easily since even you replace \mu_\theta(s)) with a, the Q function itself still depends on \theta.

Updating \theta is one part of DPG. The other part is fitting a Q-network with parameter \phi on the optimal Q-function of policy governed by \theta. This equates to update according to mean-squared Bellman error (MSBE) [10], which is the same as in Q-Learning.

I haven’t read the DDPG paper [2] thoroughly but based on my rough understanding, it adds several tricks when dealing with large inputs (such as images). According to [9]’s summary, DDPG introduced three tricks:

  1. add batch normalization to normalize “every dimension across samples in one minibatch”
  2. add noise in the output of DPG such that the algorithm can explore
  3. soft update target network

References

[1] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014, June). Deterministic policy gradient algorithms. In ICML.

[2] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

[3] Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”: https://czxttkl.com/?p=3497

[4] Policy Gradient: https://czxttkl.com/?p=2812

[5] https://repository.tudelft.nl/islandora/object/uuid:682a56ed-8e21-4b70-af11-0e8e9e298fa2?collection=education

[6] https://stackoverflow.com/questions/42763293/what-is-the-advantage-of-deterministic-policy-gradient-over-stochastic-policy-gr

[7] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#off-policy-policy-gradient

[8] Degris, T., White, M., & Sutton, R. S. (2012). Off-policy actor-critic. arXiv preprint arXiv:1205.4839.

[9] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#ddpg

[10] https://spinningup.openai.com/en/latest/algorithms/ddpg.html#the-q-learning-side-of-ddpg

LSTM + DQN

Sequential decision problems can usually be formatted as Markov Decision Problems (MDPs), where you define states, actions, rewards and transitions. In some practical problems, states can just be described by action histories. For example, we’d like to decide notification delivery sequences for a group of similar users to maximize their accumulated clicks. We define two actions, whether to send or not send a notification to a user. We can define each user’s state as his or her history of notification receiving. For example, user A might receive a notification at day 1 and day 2 but stop receiving any at day 3, while user B might receive notifications in a row in all the three days. It’s a natural sequential decision problem in which future notification sending decisions depend on different user states. User A might receive a notification on day 4 because he or she hasn’t got one in day 3, while user B will need a notification break on day 4. Sure, one can define state with personal information of a user. But as I just mention, these are similar users with similar backgrounds, preferences, and cultures, and we want to keep things as simple as possible. You get my idea.

You can manually define state feature vectors by encoding action histories. Back to the notification example, you can define features like how many notifications one has received in last 3 days, last week, or last month. However, this may be manually tedious. Alternatively, you can generate some sort of user embedding from action histories. And then policy learning (RL algorithms) would just work on a MDP with user embeddings as states.

That’s why I would like to test  LSTM + DQN (Long short-term memory + Deep Q Network). The idea is that at a state, LSTM will take as inputs the user action history so far and output an embedding. That embedding is used as the state to feed into DQN, which tries to approximate Q(s,a). The weights of LSTM and DQN will be jointly learned when fitting Q(s,a).

I design two gridworlds to test this idea. Both share the same structure, an agent starts at (0,0), there is some cell that is a wall (unwalkable). The gold state is at the left bottom corner. Whenever the agent enters the gold state, the episode terminates and get rewards 10. Since the reward discount factor will be set to be less than 1, the agent will not only need to get to the gold state, but get it as soon as possible. The difference between the two grid worlds is that the second gridworld endows additional reward plus the gold reward: the agent receives an additional reward at the gold state depending on the action history (the path). Some paths have higher rewards than others.

So the first gridworld (called “finite”) looks like below, with the red line indicates the optimal route.

Untitled Diagram (1)

The second gridworld (called “rnn”) looks like below, with the red line indicates the optimal route. It is not simply the shortest path, because going a little zigzag obtains higher rewards.

Untitled Diagram (3)

 

I test LSTM + DQN and pure DQN. In pure DQN, the state is just a one-hot encoding of which cell the agent is at. Therefore you can imagine pure DQN should fall short in the “rnn” gridworld because the state has no information about past paths. In LSTM + DQN, the state is the hidden layer output of LSTM resulting from the action history passing through LSTM. My test script can be found in Tutorials/action_state_generation/tests/test_online_train.py in this repository. (The file structure may change in the future.)

On finite gridworld

Pure DQN:

reward_plot_dqn_finite_tt5_eps2001

LSTM+DQN

reward_plot_lstm_finite_tt5_eps2001

The x-axis is learning epochs, while the y-axis is the final reward. Both algorithms can get the optimal reward 10. I also checked to confirm that they reach reward 10 using just 4 steps. However, you can observe that LSTM+DQN takes longer to converge (in other words, LSTM+DQN is less data efficient). This is as expected because LSTM needs more iterations to understand the sequences and consequences. Recall that the pure DQN encodes grid cells as states directly, while the state in LSTM is just action history. So LSTM needs many iterations to understand how action histories map to which exact cells in the grid. 

On rnn gridworld

Pure DQN

reward_plot_dqn_rnn_tt5_eps5001

LSTM+DQN

reward_plot_lstm_rnn_tt5_eps5001

As expected, pure DQN does not reach high rewards at the end because its states do not have historical information. But LSTM+DQN could continuously improve its rewards, simply because the state is LSTM’s hidden state which encodes historical information along action histories.

DQN + Double Q-Learning + OpenAI Gym

Here I am providing a script to quickly experiment with the openai gym environment: https://github.com/czxttkl/Tutorials/tree/master/experiments/lunarlander. The script has the features of both Deep Q-Learning and Double Q-Learning.  

I ran my script to benchmark one open ai environment LunarLander-v2. The most stable version of the algorithm has following hyperparameters: no double q-learning (just use one q-network), gamma=0.99, batch size=64, learning rate=0.001 (for adam optimizer). It finishes around at 400 episodes:

Environment solved in 406 episodes!	Average Score: 200.06
Environment solved in 406 episodes!	Average Score: 200.06
Environment solved in 545 episodes!	Average Score: 200.62
Environment solved in 413 episodes!	Average Score: 200.35
Environment solved in 406 episodes!	Average Score: 200.06

 

Several insights:

1. gamma, the discounted factor, could have influence on results. For example, in LunarLander, if I set gamma to 1 instead of 0.9, the agent can never successfully learn, although LunarLander-v2 limits 1000 steps per episode.

 

2. double q learning does not bring in benefit, at least in LunarLander-v2. Output of turning on double q-learning, which is similar to the most stable version of the algorithm which doesn’t use double q-learning:

lunarlander + mse + double q + gamma 0.99
Environment solved in 435 episodes!	Average Score: 205.85
Environment solved in 435 episodes!	Average Score: 205.85
Environment solved in 491 episodes!	Average Score: 200.97
Environment solved in 406 episodes!	Average Score: 200.06
Environment solved in 413 episodes!	Average Score: 200.35

 

3. huber loss seems to even hurts. I change the loss function used in fitting q-learning to huber loss, but results are pretty bad. The agent never succeeds or only succeed after 1000 episodes.

 

References

[1] https://github.com/RMiftakhov/LunarLander-v2-drlnd/blob/master/dqn_agent.py

Notes on “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”

I am reading this paper (https://arxiv.org/abs/1801.01290) and wanted to take down some notes about it.

Introduction

Soft Actor-Critic is a special version of Actor-Critic algorithms. Actor-Critic algorithms are one kind of policy gradient methods. Policy gradient methods are different than value-based methods (like Q-learning), where you learn Q-values and then infer the best action to take at each state s using \arg\max_a Q(s,a). Policy gradient methods do not depend on value functions to infer the best policy, instead they directly learn a probability function \pi(a|s). However policy gradient methods may still learn value functions for the purpose of better guiding the learning of the policy probability function \pi(a|s).

We have introduced several kinds of policy gradient methods in [2, 18]. Here, we briefly recap. The objective function of policy gradient methods is:

J(\theta)=\mathbb{E}_{s,a\sim\pi}Q^{\pi}(s,a),
where \theta is the parameter of the policy \pi.

In problems with a discrete action space, we can rewrite the expectation in summation:

J(\theta)=\mathbb{E}_{s,a\sim\pi}Q^{\pi}(s,a)=\sum\limits_{s \in S} d^\pi(s) V^\pi(s)=\sum\limits_{s \in S} d^\pi(s) \sum\limits_{a \in A} \pi(a|s) Q^\pi(s,a),

where d^\pi(s) is the stationary distribution of Markov chain for \pi, V^\pi(s)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s], and Q^{\pi}(s,a)=\mathbb{E}_{a \sim \pi}[G_t|S_t=s, A_t=a]. G_t is accumulated rewards since time step t: G_t=\sum^\infty_{k=0}\gamma^k R_{t+k}.

Policy gradient methods strive to learn the values of \theta, which is achieved through gradient ascent w.r.t. J(\theta). The gradient \nabla_\theta J(\theta) is obtained by the policy gradient theorem (proved in [4]):

\nabla_\theta J(\theta) = \mathbb{E}_{s,a\sim\pi}[Q^\pi(s,a)\nabla_\theta log \pi_\theta(a|s)]

Different policy gradient methods use different methods to estimate Q^\pi(s,a): REINFORCE uses a Monte Carlo method in which empirical accumulated rewards G_t is used to unbiasedly approximate Q^\pi(s,a); state-value actor-critic learns a value function V_w(s) and uses R+\gamma V_w(s') to approximate Q^\pi(s,a); action-value actor-critic learns a Q-value function Q_w(s,a) and uses R+\gamma Q_w(s',a') to approximate Q^\pi(s,a). Note that since \nabla_\theta J(\theta) is in an expectation form \mathbb{E}_{s,a \sim \pi} [\cdot], we need to collect on-policy samples. That’s why the soft actor-critic papers mentions in Introduction:

some of the most commonly used deep RL algorithms, such as TRPO, PPO, A3C, require new samples to be collected for each gradient step.

If we have only off-policy samples instead, which are collected by \pi_b, J(\theta) becomes “the value function of the target policy, averaged over the state distribution of the behavior policy” (from DPG paper [19]):

J(\theta)=\mathbb{E}_{s\sim \pi_b} \left[ \mathbb{E}_{a\sim\pi}Q^{\pi}(s,a) \right]

The critical problem presented by this formula is that if you take derivative of J(\theta) w.r.t. \theta, you’d better to be able to write the gradient without \mathbb{E}_{a\sim\pi}[\cdot] because any such gradient can’t be computed using mini-batches collected from \pi_b. Rather, we have several choices to learn \theta given this off-policy objective J(\theta):

  1. Using importance sampling, rewrite \mathbb{E}_{a\sim\pi}\left[ Q^{\pi}(s,a)\right] as \mathbb{E}_{a\sim\pi_b}\left[\frac{\pi(a|s)}{\pi_b(a|s)}Q^{\pi_b}(s,a) \right]. This results to the gradient update \mathbb{E}_{s,a \sim \pi_b} [\frac{\pi(a|s)}{\pi_b(a|s)} Q^{\pi_b}(s,a) \nabla_\theta log \pi(a|s)]. See [18]. However, importance sampling can cause great variance empirically.
  2. Making the policy deterministic would remove the expectation \mathbb{E}_{a\sim\pi}[\cdot]. J(\theta) then becomes J(\theta)=\mathbb{E}_{s\sim\pi_b}\left[Q^{\pi}(s,\pi(s))\right]. The gradient update can be computed as \mathbb{E}_{s\sim \pi_b}\left[ \nabla_\theta \pi(s) \nabla_a Q^\pi(s,a)|_{a=\pi(s)}\right]. This is how DPG works. See [20].
  3. Using reparametrization trick (covered more below), which makes \pi(a|s) = \pi^{deterministic}(s)+\epsilon, a deterministic function plus an independent noise, we would have J(\theta)=\mathbb{E}_{s\sim\pi_b, \epsilon \sim p_\epsilon}\left[Q^{\pi}(s,\pi(s))\right]. The gradient update would be: \mathbb{E}_{s\sim\pi_b, \epsilon \sim p_\epsilon}\left[\nabla_a Q^{\pi}(s,a)|_{a=\pi(s)} \nabla_\theta \pi(s) \right] \newline=\mathbb{E}_{s\sim\pi_b, \epsilon \sim p_\epsilon}\left[\nabla_a Q^{\pi}(s,a)|_{a=\pi(s) } \nabla_\theta \pi^{deterministic}(s) \right]

The third method is used in Soft Actor-Critic, although SAC adds some more ingredients.  Let’s now dive into it.

Soft Actor-Critic

Soft Actor-Critic (SAC) is designed to be an off-policy policy-gradient method. What makes it special is that it strives to maximize long-term rewards as well as action randomness, because more random actions mean better exploration and robustness. Therefore, in the maximum entropy framework, the objective function can be formulated as:

    \[ J(\pi)=\sum\limits^T_{t=0}\mathbb{E}_{(s_t, a_t)\sim\rho_\pi} \left[r(s_t, a_t) + \alpha \mathcal{H} \left( \pi\left(\cdot | s_t \right) \right) \right], \]

where \mathcal{H}(\pi( \cdot | s ) ) of an action distribution \pi(\cdot|s) is the entropy defined as \mathbb{E}_{a \sim \pi(\cdot|s_t)}[-log \pi(a|s)], and \alpha is the term controlling the influence of the entropy term.

We can always treat \alpha=1 by subsuming it into the reward function. Therefore, with \alpha omitted, we have the following objective function, soft Q-function update operator, and soft value function:

(1)   \[ J(\pi)=\sum\limits^T_{t=0}\mathbb{E}_{(s_t, a_t)\sim\rho_\pi} \left[r(s_t, a_t) + \mathcal{H} \left( \pi\left(\cdot | s_t \right) \right) \right]  \]

(2)   \[ \mathcal{T}^\pi Q(s_t, a_t) \triangleq r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1}\sim p}\left[V(s_{t+1})\right]  \]

(3)   \[ V(s_t) = \mathbb{E}_{a_t \sim \pi}\left[Q(s_t, a_t) - \log \pi(a_t|s_t) \right] \]

Derivation

The paper uses Lemma1, Lemma 2 and Theorem 1 to prove the learning pattern will result in the optimal policy that maximizes the objective in Eqn. (1). The tl; dr is that Lemma1 proves that soft Q-values of any policy would converge rather than blowing up; Lemma2 proves that based on a policy’s soft Q-values, you can find a no-worse policy.

Lemma 1 proof

Screen Shot 2018-11-09 at 12.05.04 PM

“The standard convergence results for policy evaluation” means:

(1) there exists a fixed point Q^\pi because of the way we define Eqn.15: for each state s and each action a, we define one equation of Eqn.15. We have |S|+|A| equations in total. And there are |S|+|A| dimensions of Q^\pi. Therefore, there exists a unique fixed point: Q^\pi=r+\gamma P^\pi Q^\pi, where P^\pi is the operator for calculating expectation \mathbb{E}[\cdot].

(2) prove T^\pi is a \gamma-contraction under the infinity norm, i.e.,  \Vert T^\pi Q^{k+1} - T^\pi Q^k \Vert_\infty \leq \gamma \Vert Q^{k+1} - Q^{k}\Vert_\infty. Explained in plain English, T^\pi being a \gamma-contraction means consecutive application of T^\pi make the function Q closer to the fixed point Q^\pi at the rate of \gamma.

The proof is found from [6]:

\Vert T^\pi Q^{k+1} - T^\pi Q^k \Vert_\infty \newline =\Vert r +\gamma P^\pi Q^{k+1} - r - \gamma P^\pi Q^{k} \Vert_\infty \newline =\gamma \Vert P^\pi(Q^{k+1}-Q^k) \Vert_\infty \newline \leq \gamma \Vert Q^{k+1}-Q^k \Vert_\infty

The last two steps are derived using \Vert Tx\Vert_\infty \leq \Vert T \Vert_\infty \Vert x \Vert_\infty from [7] and the fact that \Vert P^\pi \Vert_\infty = 1 (because P^\pi is the operator for \mathbb{E}[\cdot], though I have not verified this fact though).

(3) therefore, by contraction theory (Theorem 5.1-2 in [10]), applying T^\pi will eventually lead to Q^\pi. [9]

Convergence of policy evaluation/value evaluation usually have the similar pattern, such as the one for Q-learning [8].

Lemma 2 proof

Screen Shot 2018-11-09 at 3.39.57 PM

Proving Lemma 2 is basically just expanding KL-divergence formula D_{KL}(P|Q)=\int^{\infty}_{-\infty}p(x)\frac{p(x)}{q(x)}dx. From Eqn. 16, we expand KL-divergence formula as:

J_{\pi_{old}}(\pi'(\cdot|s_t)) \newline \triangleq D_{KL} (\pi'(\cdot | s_t) \; \Vert \; exp(Q^{\pi_{old}}(s_t, \cdot) - \log Z^{\pi_{old}}(s_t)) ) \newline = - \int \pi'(a_t | s_t) \log \frac{exp(Q^{\pi_{old}}(s_t, a_t) - \log Z^{\pi_{old}}(s_t)))}{\pi'(a_t|s_t)} d a_t \newline = \int \pi'(a_t | s_t) \big(\log \pi'(a_t|s_t) + \log Z^{\pi_{old}}(s_t) - Q^{\pi_{old}}(s_t, a_t) ) \big) d a_t \newline = \mathbb{E}_{a_t \sim \pi'}[\log \pi'(a_t|s_t) + \log Z^{\pi_{old}}(s_t) - Q^{\pi_{old}}(s_t, a_t) )]

Note that the authors defined a new objective function J_{\pi_{old}}(\pi'(\cdot|s_t))\triangleq D_{KL} (\pi'(\cdot | s_t) \; \Vert \; exp(Q^{\pi_{old}}(s_t, \cdot) - \log Z^{\pi_{old}}(s_t)) ), which is different than Eqn.1. The purpose of defining J_{\pi_{old}}(\pi'(\cdot|s_t)) is to help us get to Eqn. 18, which is then used to prove Q^{\pi_{old}}(s_t, a_t) \leq Q^{\pi_{new}}(s_t, a_t) in Eqn. 19.

 Implementation

Since we are looking at cases with continuous actions, the authors suppose we use a network to output mean and standard deviation for a (multi-dimensional) Gaussian. In other words, this network has an output vector for action means \mu, and another output vector for action standard deviations \sigma. The action that is to be taken will be sampled from \mathcal{N}(\mu, \sigma^2).

export

Since D_{KL}(p\Vert q) = \mathbb{E}_{x \sim p(x)}[\log p(x) - \log q(x)], J_\pi(\phi) can be rewritten as:

J_\pi(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}, a_t \sim \pi_\phi}[\log \pi_\phi(a_t | s_t) - Q_\theta(s_t, a_t)]

However, actions are sampled from the policy network’s output, which are represented as a stochastic node in the computation graph. Gradient cannot be backprogagated unless we use “reparameterization trick” to make the action node a deterministic node. In other word, actions will be outputted from a deterministic function f(\cdot |s_t) with a stochastic input \epsilon_t. This trick is also widely used in variational autoencoders, and has been discussed in [11, 13, 14, 15, 16] . Reparameterization trick can also be illustrated in the diagram below [11]:

Screen Shot 2018-11-11 at 7.17.57 PM

After applying reparameterization trick, we have the stochastic gradient, for a training tuple (s_t, a_t):

\hat{\nabla}_\phi J_\pi(\phi) \newline = \nabla_\phi [\log \pi_\phi(a_t | s_t) - Q_\theta(s_t, a_t)] |_{a_t = f_\phi(\epsilon_t ; s_t)} \newline = (\nabla_{a_t} \log \pi_\phi(a_t | s_t) - \nabla_{a_t} Q_\theta (s_t, a_t)) \nabla_\phi f_\phi(\epsilon_t; s_t)

I think what I derived above is a little different than Eqn.(13) in the paper, which has an extra \nabla_\phi \log \pi_\phi(a_t|s_t). Right now I am not sure why but I will keep figuring it out.

In applications where output actions need to be squashed to fit into some ranges, \pi_\phi(|s_t) also needs to account that. Basically, if squashed action = squash function(raw_action), then the probability density of squashed actions will be different than the probability density for raw actions [17]. The original code reflects this in https://github.com/haarnoja/sac/blob/fba571f81f961028f598b0385d80cb286d918776/sac/policies/gaussian_policy.py#L70-L75.

SAC’s learning pattern is summarized as follows:

  1. It evaluates the soft Q-value function and the soft value function. Both functions are associated with the train policy but can still be learned through the data collected by the behavior policy.

To estimate the soft value function, we can draw states from experience replay generated by the behavior policy, and use the action generated by the train policy.

Screen Shot 2019-01-31 at 3.34.46 PM

To estimate the Q-value function, we can just use states and actions both drawn from the experience replay:

Screen Shot 2019-01-31 at 3.38.09 PM

2. It updates the policy parameters, with the goal to minimize the expected KL-divergence between the current policy outputs and the exponential of the soft Q-value function.

3. Loop back to step 1

Thee pseudocode is as follows:

Screen Shot 2019-01-31 at 2.15.35 PM

Lastly, I am using the slides for soft actor-critic paper to end this post [12].

References

[1] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). https://arxiv.org/pdf/1602.01783.pdf

[2] Policy gradient: https://czxttkl.com/?p=2812

[3] Differences between on-policy and off-policy: https://czxttkl.com/?p=1850

[4] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#soft-actor-critic

[5] Off-Policy Actor-Critic: https://arxiv.org/abs/1205.4839

[6] http://mlg.eng.cam.ac.uk/teaching/mlsalt7/1516/lect0304.pdf

[7] https://czxttkl.com/?p=2616

[8] http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

[9] https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture3_mdp_planning.pdf

[10] Kreyszig, E. (1978). Introductory functional analysis with applications (Vol. 1). New York: wiley.

[11] Tutorial on Variational Autoencoders

[12] slides for soft actor-critic

[13] https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important

[14] Reparameterization Trick Notebook

[15] The Reparameterization “Trick” As Simple as Possible in TensorFlow

[16] arxiv insight: variational autoencoders

[17] Change of variable in probability density

[18] https://czxttkl.com/2020/02/18/practical-considerations-of-off-policy-policy-gradient/

[19] http://proceedings.mlr.press/v32/silver14.pdf

[20] https://czxttkl.com/2019/01/10/dpg-and-ddpg/

Notes on Glicko paper

This weekend I just read again the Glicko skill rating paper [1] but I found something not very clear in the paper. I’d like to make some notes, some based on my guesses. Hope I’d sort them out completely in the future.

First, Glicko models game outcomes by the Bradley-Terry model, meaning that the win probability of a player is determined by his skill proportional to the sum of both players’ skills (assuming a one-vs-one game).    

Here, a player skill is an exponential form of a strength parameter $latex \theta$ which follows a normal distribution:

Screenshot from 2018-10-21 10-10-40

Glicko updates a player’s skill in a rating period, in which other opponents’ skills are assumed as constants during the rating period. Thus, the posterior distribution of the player’s skill after observing the game outcomes in the rating period is:

Screenshot from 2018-10-21 10-18-01 

The confusing part to me is the notation $latex L(\theta, \theta_1, \cdots, \theta_m|\textbf{s})$. In my opinion, $latex L(\theta, \theta_1, \cdots, \theta_m|\textbf{s}) = f(\textbf{s}|\theta, \theta_1, \cdots, \theta_m)$. My opinion reconciles with what is described in [2]:

Screenshot from 2018-10-21 10-26-27

 

References

[1] Parameter estimation in large dynamic paired comparison experiments

[2] A Bayesian Approximation Method for Online Ranking

Euler’s Formula and Fourier Transform

Euler’s formula states that e^{ix} =\cos{x}+ i \sin{x}. When x = \pi, the formula becomes e^{\pi} = -1 known as Euler’s identity.

An easy derivation of Euler’s formula is given in [3] and [5]. According to Maclaurin series (a special case of taylor expansion f(x)=f(a)+f'(a)(x-a)+\frac{f''(a)}{2!}(x-a)^2+\cdots when a=0),

e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots &s=2

Therefore, replacing x with ix, we have

e^{ix}=1+ix-\frac{x^2}{2!}-\frac{x^3}{3!}+\frac{x^4}{4!}+\frac{x^5}{5!}-\cdots &s=2

 

By Maclaurin series, we also have

\cos{x}=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!} + \cdots \newline \sin{x}=x -\frac{x^3}{3!}+\frac{x^5}{5!}-\cdots&s=2

 

Therefore, we can rewrite e^{ix} &s=2 as e^{ix}=\cos{x}+i\sin{x} &s=2

 

Intuitive understanding of e^{ix}=\cos{x}+i\sin{x} &s=2 is illustrated in [1] together with its own video [4], as well as in 3Blue1Brown’s two videos [6][7].

First, from a conventional view of coordinate system, e^{ix}=\cos{x}+i\sin{x} &s=2 means a point with x coordinate \cos{x} and y coordinate \sin{x} on a unit circle (centered at the origin with radius 1) in the complex plane.

Another view is that e^{ix} describes the point that moves distance x from (1,0) along the circumference of the unit circle.

The last view, which is similar to the second view, is that e^{ix} specifies the point such that the degree between the x axis and the line connects that point to the origin is radiant x .

For example, e^{i \cdot 1} viewed in x&y coordinates:

grid

or viewed from the moving perspective:

move

or viewed from the radiant perspective:

radiant

As pointed out by [1], we can think of normal exponential as stretching an axis such that number 1 is stretched to a point denoting the exponential result. For example, 2^3 means 1 is stretched to 8, or 1 is stretched to 2 first, then that point is stretched to 4, then finally that point is stretched to 8. However, complex exponential e^{ix} means to rotate from the point (1,0) at a constant rotation speed x. Therefore, Euler’s identity can be interpreted as starting from point (1,0), a point moves half of the circle (radiant \pi) and ends up at (-1, 0), therefore e^{\pi}=-1. A good diagram summarizing this intuition is as below:

complex_growth

Why understanding Euler’s formula as a point rotating on a unit circle helps? One example in which Euler’s formula is useful is Fourier transform. The goal of Fourier transform is to turning signals measured in time space into analyzable, frequency-based spectrum. Fourier transform is intuitively illustrated in BetterExplained [8, 3Blue1Brown’s [9], and Math StackExchange [16]. Below is notes taken based on [9].

A concrete example to apply Fourier transform is to analyze a signal which mixes two different signals, each of a constant frequency. In the example, suppose we can only measure the blended signal (green), while we want to tell posthumously that it actually composes of the pink signal (for example, 2Hz) and the yellow signal (for example, 3Hz).

Screenshot from 2018-10-08 11-40-25

The magic idea of Fourier transform is that if you transform the signal measured in the Pressure vs. Time space into a rotating motion in a 2D plane, it will looks like the diagram below:

simplescreenrecorder-2018-10-08_11.52.01 (2)

Depending on the frequency of rotating motion, the diagram on the 2D plane may be different.

ezgif.com-resize

An important observation is that if you look at the center of mass of the diagram based on any rotating frequency, you will see that the center of mass is farthest away from the origin when the frequency of rotating motion matches the frequency of either individual signal (2Hz or 3Hz).

ezgif.com-resize (1)

Therefore, to recover original signals, we need to try out every possible frequency of rotating motion and just obtain the location of the center of mass of the diagram for each frequency of rotating motion, then we would know what are the frequencies of the original individual signals.

The center of mass can be obtained by sampling and averaging of the data points on the diagram, while the diagram is mapped onto a 2D complex plane:

Screenshot from 2018-10-09 19-04-08

When the number of samples goes to infinite, it becomes integral.

Screenshot from 2018-10-09 19-09-06

Fourier transform, following this idea, is only different in that the integral is from negative infinity to positive infinity, and there is no final average over time (\frac{1}{t_2-t_1} removed).

Screenshot from 2018-10-09 19-11-20

Here is an example [15] where Fourier transform is applied on f(t)=\cos(2\pi st). The result is two Dirac Delta functions on the frequency domain.

Screenshot from 2018-10-09 19-32-30

The transform process is based on trigonometry. In the second to the last equation, when u=\pm s, \int_{-\infty}^{\infty} \cos{(2\pi st)} \cos{(2\pi ut)} dt =\int_{-\infty}^{\infty} \cos^2{(2\pi st)} dt = \int_{-\infty}^{\infty} \frac{1}{2} dt + \frac{1}{2} \int_{-\infty}^{\infty} \cos{(4\pi st)} dt = \frac{1}{2} \cdot \infty + \frac{1}{2} \int_{-\infty}^{\infty} \cos{(4\pi st)} dt &s=2

According to [10], although \int_{-\infty}^{\infty} \cos{(4\pi st)} dt &s=2 does not converge, I have seen many places [15, 18] treating \int_{-\infty}^{\infty}\cos{(4\pi st)} dt=0 &s=2. I don’t understand this, but I could think it in an intuitive way: although \int_{-\infty}^{\infty}\cos{(4\pi st)} dt &s=2 does not converge, its values is confined in a limited range (\leq 2). Compared to the infinity we obtain from \int_{-\infty}^{\infty} \frac{1}{2} dt &s=2, the value of  \int_{-\infty}^{\infty}\cos{(4\pi st)} dt &s=2 is infinitely close to zero.

Another idea is to solve the integration based on Euler’s formula [17][20]. Based on Euler’s formula, we can get \cos{(x)}=\frac{1}{2}(e^{ix} + e^{-ix}) &s=2. Also, Dirac Delta function is defined as (its proof, which I don’t fully understand, can be found on [11, 12, 13]):

\delta(x)= \frac{1}{2\pi} \int_{-\infty}^{\infty} e^{-jxt}dt &s=2

Or equivalently, \delta(x) =\int_{-\infty}^{\infty} e^{-j2\pi xt}dt &s=2

Therefore, we can finally get (based on [17][20], with a little difference because they transform \cos(\omega_0 t) rather than \cos(2\pi st)):

Screenshot from 2018-10-09 21-04-53:

 

According to the linear property of Fourier transform [19], if a blended signal is the sum of several signals (for example, \cos(2\pi st)+\cos(3\pi st), then the resultant Fourier transform is the sum of several Dirac functions. That’s how Fourier transform extracts individual signals from the mixed one! And the connection between Euler’s formula and Fourier transform is that the integration required by Fourier transform to obtain the center of mass of the rotation diagram, if connected to Euler’s formula, can be calculated very easily.

[14] is another illustrative video showing that any signal we can measure can be actually seen as a combination of an infinite number of sine waves. For repeating waves in time space, its Fourier transform may shown as discrete frequency spectrum (like several Dirac deltas). However, non-repeating waves in time space may result to continuous frequency spectrum after Fourier transform.  This is interesting to know but I will not explore further within this article.

Screenshot from 2018-10-09 00-09-22

 

References

[1] https://betterexplained.com/articles/intuitive-understanding-of-eulers-formula/

[2] https://www.khanacademy.org/science/electrical-engineering/ee-circuit-analysis-topic/ee-ac-analysis/v/ee-complex-numbers

[3] https://www.khanacademy.org/math/ap-calculus-bc/bc-series-new/bc-10-14/v/euler-s-formula-and-euler-s-identity

[4] https://www.youtube.com/watch?v=qpOj98VNJi4

[5] https://en.wikipedia.org/wiki/Euler%27s_formula#Using_power_series

[6] https://www.youtube.com/watch?v=F_0yfvm0UoU

[7] https://www.youtube.com/watch?v=mvmuCPvRoWQ

[8] https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/

[9] https://www.youtube.com/watch?v=spUNpyF58BY

[10] https://www.wolframalpha.com/input/?i=integrate+cos(x)+from+-infinity+to+infinity

[11] http://fourier.eng.hmc.edu/e102/lectures/ExponentialDelta.pdf

[12] https://math.stackexchange.com/questions/1343859/why-does-integrating-a-complex-exponential-give-the-delta-function

[13] https://www.quora.com/Why-are-integrals-of-complex-exponentials-delta-functions

[14] https://www.youtube.com/watch?v=r18Gi8lSkfM

[15] https://www.astro.umd.edu/~lgm/ASTR410/ft_ref2.pdf

[16] https://math.stackexchange.com/questions/1002/fourier-transform-for-dummies

[17] https://www.youtube.com/watch?v=jPM76k-uNnA

[18] https://blog.mide.com/fourier-transform-basics

[19] https://en.wikipedia.org/wiki/Fourier_transform#Properties_of_the_Fourier_transform

[20] https://web.stanford.edu/class/ee102/lectures/fourtran