Time Series Walk Through

In this post, I am going to give a practicum walk through on time series analysis. All related code is available in a python notebook.

The data we use is International airline passengers: monthly totals in thousands. which can be downloaded here as csv file (on the left panel, select Export->csv(,)). It is a univariate time series.

Screenshot from 2017-07-26 20:49:08

From intuition, we can see that there is pattern over here: (1) there exists an upward trend; (2) there exists seasonal fluctuation every once a while. However, this time series is not stationary.  

[3] gives an excellent explanation for the meaning of a stationary time series:

Screenshot from 2017-07-26 20:46:35

To put it mathematically, a stationary time series is:

  1. $latex E(Y_1) = E(Y_2) = E(Y_3) = \cdots = E(Y_t) = \text{constant}$
  2. $latex Var(Y_1) = Var(Y_2) = Var(Y_3) = \cdots = Var(Y_t)= \text{constant}$
  3. $latex Cov(Y_1, Y_{1+k}) = Cov(Y_2, Y_{2+k}) = Cov(Y_3, Y_{3+k}) = \cdots = \gamma_k$ (depending only on lag $latex k$)

And also compare a stationary time series with white noise:

Let $latex \{\epsilon_t\}$ denote such a series then it has zero mean $latex E(\epsilon_t)=0$, $latex Var(\epsilon_t)=0$ for $latex \forall t$ and $latex E(\epsilon_t \epsilon_s)=0$ for $latex \forall t,s$. 

The typical steps to perform time series analysis are as follows:

  1. check whether it is a white noise time series. If it is, just stop, because there is no way that you can predict noise. (In practice, however, it is hard to determine whether a time series is white noise just by observing with eyes. So, my suggestion is to go ahead assuming this is a valid time series, and use cross validation to check if there exists any model that can successfully predict the time series.)
  2. if this is a valid but non-stationary time series, make it stationary
  3. build predictive models on the stationary time series

Here is my understanding why we need to make a time series stationary before training a model. Suppose your time series varies a lot much in a specific time range. If we build a linear regression model based on the mean square error loss, the errors from that time range will dominate the mean square error and force the weight adjustment more biased to the time range. However, we want to build a predictive model that can predict the time series equally well at any time point. 

We can easily detect whether a time series is stationary using a helper function `test_stationarity`  from [3] and [4]. The usage of `test_stationarity` is that if the ‘Test Statistic’ is greater than the ‘Critical Value’, or ‘p-value’ is well below a convincing threshold (say 0.05),  then the time series is stationary. 

Base on the raw time series, we use test_stationarity and find it is not stationary:

Screenshot from 2017-07-26 21:43:24

The techniques that you can use to make a time series stationary: (1) logarithmic; (2) first difference; (3) seasonal difference. See [4] and [5] for more techniques to make time series stationary.

ARIMA

Autoregressive integrated moving average (ARIMA) model assumes each time step $latex Y_t$ is a linear function of previous time steps $latex Y_{t-1}, Y_{t-2},\cdots$ and error terms $latex \epsilon_{t}, \epsilon_{t-1}, \epsilon_{t-2}, \cdots$. An ARIMA model(p, q, d) first applies $latex d$-difference on itself to obtain $latex \{Y_{t}\}$ and then assume:

$latex Y_t =\beta_0 + \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + \cdots + \beta_p Y_{t-p} + \epsilon_t+ \Phi_{1} \epsilon_{t-1} + \Phi_{2} \epsilon_{t-2} + \cdots + \Phi_{q} \epsilon_{t-q} &s=2$

Here, the error term $latex \{\epsilon_t\}$ is not the error term from the Autoregressive (AR) model. Instead $latex \{\epsilon_t\}$ are estimated purely from data [11].

 

To determine the lag parameter $latex p$ in AR model, we can use autocorrelation_plot from pandas.autocorrelation_plot, which is the same thing as Autocorrelation Function (ACF) from statsmodels:

Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly non-zero. Therefore, autocorrelation can be used as a tool to quickly check whether your time series is suitable to use machine learning models to forecast. After all, if your time series is just white noises, it is meaningless to do time series forecast. The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.

Note that the ACF from statsmodels provides a parameter unbiased. If True, the plot will automatically adjust for the length of the time series. [14]

To determine the lag parameter $latex q$ in MA model, we can use Partial Autocorrelation Function (PACF) in statsmodels

The rules to derive right $latex p$ and $latex q$ can be found from [12].

Screenshot from 2017-07-27 12:53:54

On the other hand, [4] points out that $latex p$ should be the value when the PACF chart crosses the upper confidence interval for the first time. $latex q$ should be the value when the ACF chart crosses the upper confidence interval for the first time. (Also see [15])

In my opinion, deriving $latex p$ and $latex q$ from observing ACF and PACF plots is a very ad-hoc, vague method. I’d prefer to do grid search on parameters as in [13] then filter out the best model according to AIC or similar metrics. 

From [13]: AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. 

 

Seasonality plot allows you to separate trend, seasonality and residual from a time series. According to [10], a time series consists of four components:

  1. level, the average value in the series
  2. trend, the increasing or decreasing value in the series
  3. seasonality, the repeating short-term cycle in the series
  4. noise, the random variation in the series

Below is the seasonality plot based on our data:

Screenshot from 2017-07-27 14:29:19

The residual is thus the combination of level and noise. One thing that I am not sure is how to use seasonality plot. One possible way is to fit ARIMA on residual because that’s the thing de-trended and de-seasonalized (see one person asked in a comment [8]). However, since the residual is already the level (seen as a constant?) + noise, is there any point to fit ARIMA? Also, even if we can fit an ARIMA on the residual, when we use it to predict the future, how can we transform the predicted value back to the raw scale? [15] actually supports this direction and also mentions the same concern: “Now that you have stationarized your time series, you could go on and model residuals (fit lines between values in the plot). However, as the patterns for the trend and seasonality information extracted from the series that are plotted after decomposition are still not consistent and cannot be scaled back to the original values, you cannot use this approach to create reliable forecasts“. Another direction is to use exogenous variables to fit the residuals, treating them as something not predictable by the time series itself [3]. However, we also face the same issue about how to transfer the predicted residuals back to the raw scale.

There are also other resources introducing seasonality decomposition, such as [9].

Based on what I’ve read so far, I’d say using seasonality plot as an exploration tool. But do not rely on it to train forecast model. After all, modern python ARIMA APIs integrates fitting trend and seasonality at the same time so that you don’t need to separate them out manually [7].

 

Reference:

[1] http://machinelearningmastery.com/time-series-prediction-with-deep-learning-in-python-with-keras/

[2] http://pandasplotting.blogspot.com/2012/06/autocorrelation-plot.html

[3] http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/

[4] https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/

[5] http://people.duke.edu/~rnau/whatuse.htm

[6] http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARIMA.html

[7] http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html

[8] https://stats.stackexchange.com/questions/223117/forecasting-residuals-from-seasonal-decomposition-appears-to-be-highly-auto-cor

[9] http://www.cbcity.de/timeseries-decomposition-in-python-with-statsmodels-and-pandas

[10] http://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/

[11] https://stats.stackexchange.com/questions/26024/moving-average-model-error-terms/74826#74826

[12] https://www.youtube.com/watch?v=Aw77aMLj9uM

[13] https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3

[14] https://stats.stackexchange.com/questions/164526/is-an-auto-correlation-plot-suitable-for-determining-at-what-point-time-series-d

[15] https://datascience.ibm.com/exchange/public/entry/view/815137c868b916821dec777bdc23013c

 

tricks in deep learning neural network

In this post, I am going to talk my understanding in tricks in training deep neural network.

ResNet [1]

Why does ResNet network work? https://www.quora.com/How-does-deep-residual-learning-work

Here is my answer:

It is hard to know the desired depth of a deep network. If layers are too deep, errors are hard to propagate back correctly. if layers are too narrow, we may not learn enough representation power.

However, in deep residual network, it is safe to train very deep layers in order to get enough learning power without worrying about the degradation problem too much, because in the worst case, blocks in those “unnecessary layers” can learn to be an identity mapping and do no harm to performance. This is achieved by the solver driving weights of ReLu layers close to zeros thus only the shortcut connection is active and acts as an identity mapping. Though not proved theoretically, adjusting weights close to zeros might be an easier task to the solver than adjusting weights to effective representation all at once.

The authors observe empirically (Fig. 7) that ResNet has smaller magnitude of layer responses on average than plain networks, suggesting that many blocks are just learning little, incremental information.

To conclude, the core idea of ResNet is providing shortcut connection between layers, which make it safe to train very deep network to gain maximal representation power without worrying about the degradation problem, i.e., learning difficulties introduced by deep layers.

All my answer is based on empirical observation and intuition. I’d like to know more theories behind ResNet.

 

Now according to [2], all the following techniques are regularization methods.

Data Augmentation

augment the training set via domain-specific transformations. For image data, commonly used transformations include random cropping, random perturbation of brightness, saturation, hue and contrast.

Early Stopping

Early stopping was shown to implicitly regularize on some convex learning problems (Yao et al., 2007; Lin et al., 2016)

Dropout 

mask out each element of a layer output randomly with a given dropout probability. [1] says that We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16].  but We
do not use dropout [14], following the practice in [16].

 

Weight Decay 

Why is it equivalent to $latex \ell_2$ norm regularizer on the weights? Also equivalent to a hard constraint of the weights to an Euclidean ball, with the radius decided by the amount of weight decay?

 

Batch Norm

https://theneuralperspective.com/2016/10/27/gradient-topics/

http://leix.me/2017/03/02/normalization-in-deep-learning/

 

ReLU, Leaky ReLU & MaxOut

Comparison between different activation functions

https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

 

Reference

[1] Deep Residual Learning for Image Recognition

[2] Understand Deep Learning Requires Rethinking Generalization

[3] Note for [2]: https://danieltakeshi.github.io/2017/05/19/understanding-deep-learning-requires-rethinking-generalization-my-thoughts-and-notes

[4] https://datascience.stackexchange.com/questions/14349/difference-of-activation-functions-in-neural-networks-in-general

A3C code walkthrough

In this post, I am doing a brief code walkthrough for the code written in https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

The code implements A3C algorithm (Asynchronous Methods for Deep Reinforcement Learning). It follows the pseudocode given in supplemental part in the paper:

Screenshot from 2017-06-13 14-09-56

The structure of this model is:

Untitled Diagram (2)

For LSTM structure detail, refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/. I am using the same notation in the model structure diagram as in the link.

Below is the code: 

"""
Asynchronous Advantage Actor-Critic algorithm

See:
https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2
https://arxiv.org/pdf/1602.01783.pdf
"""

import copy
import threading
import multiprocessing
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.contrib.slim as slim
import scipy.signal
from helper import *
from vizdoom import *

from random import choice
from time import sleep
from time import time


# Copies one set of variables to another.
# Used to set worker network parameters to those of global network.
def update_target_graph(from_scope,to_scope):
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, from_scope)
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, to_scope)

    op_holder = []
    for from_var,to_var in zip(from_vars,to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder

# Processes Doom screen image to produce cropped and resized image.
def process_frame(frame):
    s = frame[10:-10,30:-30]
    s = scipy.misc.imresize(s,[84,84])
    s = np.reshape(s,[np.prod(s.shape)]) / 255.0
    return s

# Discounting function used to calculate discounted returns.
def discount(x, gamma):
    return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]

#Used to initialize weights for policy and value output layers
def normalized_columns_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None):
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer


class AC_Network():
    def __init__(self, s_size, a_size, scope, trainer):
        with tf.variable_scope(scope):
            # Input and visual encoding layers
            self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
            self.imageIn = tf.reshape(self.inputs, shape=[-1, 84, 84, 1])
            self.conv1 = slim.conv2d(activation_fn=tf.nn.elu,
                                     inputs=self.imageIn, num_outputs=16,
                                     kernel_size=[8, 8], stride=[4, 4], padding='VALID')
            self.conv2 = slim.conv2d(activation_fn=tf.nn.elu,
                                     inputs=self.conv1, num_outputs=32,
                                     kernel_size=[4, 4], stride=[2, 2], padding='VALID')
            hidden = slim.fully_connected(slim.flatten(self.conv2), 256, activation_fn=tf.nn.elu)

            # Recurrent network for temporal dependencies
            lstm_cell = tf.contrib.rnn.BasicLSTMCell(256, state_is_tuple=True)
            self.lstm_cell_c_size = lstm_cell.state_size.c
            self.lstm_cell_h_size = lstm_cell.state_size.h
            c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
            h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
            self.state_init = [c_init, h_init]

            c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
            h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
            self.state_in = (c_in, h_in)
            rnn_in = tf.expand_dims(hidden, [0])
            step_size = tf.shape(self.imageIn)[:1]
            state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in)
            lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
                lstm_cell, rnn_in, initial_state=state_in, sequence_length=step_size,
                time_major=False)
            lstm_c, lstm_h = lstm_state
            self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
            rnn_out = tf.reshape(lstm_outputs, [-1, 256])

            # Output layers for policy and value estimations
            self.policy = slim.fully_connected(rnn_out, a_size,
                                               activation_fn=tf.nn.softmax,
                                               weights_initializer=normalized_columns_initializer(0.01),
                                               biases_initializer=None)
            self.value = slim.fully_connected(rnn_out, 1,
                                              activation_fn=None,
                                              weights_initializer=normalized_columns_initializer(1.0),
                                              biases_initializer=None)

            # Only the worker network need ops for loss functions and gradient updating.
            if scope != 'global':
                self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
                self.actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
                self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
                self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)

                self.responsible_outputs = tf.reduce_sum(self.policy * self.actions_onehot, [1])

                # Loss functions
                self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value, [-1])))
                self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
                self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs) * self.advantages)
                self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

                # Get gradients from local network using local losses
                local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
                self.gradients = tf.gradients(self.loss, local_vars)
                self.var_norms = tf.global_norm(local_vars)
                grads, self.grad_norms = tf.clip_by_global_norm(self.gradients, 40.0)

                # Apply local gradients to global network
                global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
                self.apply_grads = trainer.apply_gradients(zip(grads, global_vars))

    def state_in_init(self):
        c_init = np.zeros((1, self.lstm_cell_c_size), np.float32)
        h_init = np.zeros((1, self.lstm_cell_h_size), np.float32)
        self.state_init = [c_init, h_init]


class Worker():
    def __init__(self, game, name, s_size, a_size, trainer, model_path, global_episodes):
        self.name = "worker_" + str(name)
        self.number = name
        self.model_path = model_path
        self.trainer = trainer
        self.global_episodes = global_episodes
        self.increment = self.global_episodes.assign_add(1)
        self.episode_rewards = []
        self.episode_lengths = []
        self.episode_mean_values = []
        self.summary_writer = tf.summary.FileWriter("train_" + str(self.number))

        # Create the local copy of the network and the tensorflow op to copy global paramters to local network
        self.local_AC = AC_Network(s_size, a_size, self.name, trainer)
        self.update_local_ops = update_target_graph('global', self.name)

        # The Below code is related to setting up the Doom environment
        game.set_doom_scenario_path("basic.wad")  # This corresponds to the simple task we will pose our agent
        game.set_doom_map("map01")
        game.set_screen_resolution(ScreenResolution.RES_160X120)
        game.set_screen_format(ScreenFormat.GRAY8)
        game.set_render_hud(False)
        game.set_render_crosshair(False)
        game.set_render_weapon(True)
        game.set_render_decals(False)
        game.set_render_particles(False)
        game.add_available_button(Button.MOVE_LEFT)
        game.add_available_button(Button.MOVE_RIGHT)
        game.add_available_button(Button.ATTACK)
        game.add_available_game_variable(GameVariable.AMMO2)
        game.add_available_game_variable(GameVariable.POSITION_X)
        game.add_available_game_variable(GameVariable.POSITION_Y)
        game.set_episode_timeout(300)
        game.set_episode_start_time(10)
        game.set_window_visible(False)
        game.set_sound_enabled(False)
        game.set_living_reward(-1)
        game.set_mode(Mode.PLAYER)
        game.init()
        self.actions = np.identity(a_size, dtype=bool).tolist()
        # End Doom set-up
        self.env = game

    def train(self, rollout, sess, gamma, bootstrap_value):
        rollout = np.array(rollout)
        observations = rollout[:, 0]
        actions = rollout[:, 1]
        rewards = rollout[:, 2]
        next_observations = rollout[:, 3]
        values = rollout[:, 5]

        # Here we take the rewards and values from the rollout, and use them to
        # generate the advantage and discounted returns.
        # The advantage function uses "Generalized Advantage Estimation"
        self.rewards_plus = np.asarray(rewards.tolist() + [bootstrap_value])
        discounted_rewards = discount(self.rewards_plus, gamma)[:-1]
        self.value_plus = np.asarray(values.tolist() + [bootstrap_value])
        advantages = rewards + gamma * self.value_plus[1:] - self.value_plus[:-1]
        advantages = discount(advantages, gamma)

        # Update the global network using gradients from loss
        # Generate network statistics to periodically save
        rnn_state = self.local_AC.state_init
        feed_dict = {self.local_AC.target_v: discounted_rewards,
                     self.local_AC.inputs: np.vstack(observations),
                     self.local_AC.actions: actions,
                     self.local_AC.advantages: advantages,
                     self.local_AC.state_in[0]: rnn_state[0],
                     self.local_AC.state_in[1]: rnn_state[1]}
        v_l, p_l, e_l, g_n, v_n, _ = sess.run([self.local_AC.value_loss,
                                               self.local_AC.policy_loss,
                                               self.local_AC.entropy,
                                               self.local_AC.grad_norms,
                                               self.local_AC.var_norms,
                                               self.local_AC.apply_grads],
                                              feed_dict=feed_dict)
        return v_l / len(rollout), p_l / len(rollout), e_l / len(rollout), g_n, v_n

    def work(self, max_episode_length, gamma, sess, coord, saver):
        episode_count = sess.run(self.global_episodes)
        total_steps = 0
        print("Starting worker " + str(self.number))
        with sess.as_default(), sess.graph.as_default():
            while not coord.should_stop():
                # copy global network parameter to self parameter
                sess.run(self.update_local_ops)
                # when a new episode starts, C_0 & h_0 of LSTM are reset to zero
                self.local_AC.state_in_init()
                episode_buffer = []
                episode_values = []
                episode_frames = []
                episode_reward = 0
                episode_step_count = 0
                d = False

                self.env.new_episode()
                s = self.env.get_state().screen_buffer
                episode_frames.append(s)
                s = process_frame(s)
                rnn_state = self.local_AC.state_init

                while self.env.is_episode_finished() == False:
                    # Take an action using probabilities from policy network output.
                    # after this step, a_dist shape (1,3), v shape (1,1),
                    # rnn_state: first array (C): (1, 256), second array (h): (1, 256)
                    a_dist, v, rnn_state = sess.run(
                        [self.local_AC.policy, self.local_AC.value, self.local_AC.state_out],
                        feed_dict={self.local_AC.inputs: [s],
                                   self.local_AC.state_in[0]: rnn_state[0],
                                   self.local_AC.state_in[1]: rnn_state[1]})
                    a = np.random.choice(a_dist[0], p=a_dist[0])
                    a = np.argmax(a_dist == a)     # return the best action index

                    # see http://vizdoom.cs.put.edu.pl/tutorial and
                    # https://github.com/mwydmuch/ViZDoom/blob/3bdb16935668aa42bb14cc38ac397b8954999cca/doc/DoomGame.md
                    # for reward description
                    # the agent gets rewards for each action which is -1 for each time tick,
                    # -6 if he shots but misses, and 100 if he kills the monster
                    r = self.env.make_action(self.actions[a]) / 100.0     # make_action returns reward

                    # In this example, the episode finishes after 300 tics or when the monster gets killed
                    # ref: http://www.cs.put.poznan.pl/visualdoomai/tutorial.html#basic - Game Runtime
                    d = self.env.is_episode_finished()
                    if d == False:
                        s1 = self.env.get_state().screen_buffer
                        episode_frames.append(s1)
                        s1 = process_frame(s1)
                    else:
                        s1 = s

                    episode_buffer.append([s, a, r, s1, d, v[0, 0]])
                    episode_values.append(v[0, 0])

                    episode_reward += r
                    s = s1
                    total_steps += 1
                    episode_step_count += 1

                    # If the episode hasn't ended, but the experience buffer is full, then we
                    # make an update step using that experience rollout.
                    if len(episode_buffer) == 30 and d != True and episode_step_count != max_episode_length - 1:
                        # Since we don't know what the true final return is, we "bootstrap" from our current
                        # value estimation.
                        v1 = sess.run(self.local_AC.value,
                                      feed_dict={self.local_AC.inputs: [s],
                                                 self.local_AC.state_in[0]: rnn_state[0],
                                                 self.local_AC.state_in[1]: rnn_state[1]})
                        v1 = v1[0, 0]
                        v_l, p_l, e_l, g_n, v_n = self.train(episode_buffer, sess, gamma, v1)
                        episode_buffer = []
                        sess.run(self.update_local_ops)
                        # original code does not update state_init:
                        # in train function, rnn_state is always set to self.state_init which are two zero numpy arrays
                        self.local_AC.state_init = copy.deepcopy(rnn_state)
                    if d == True:
                        break

                self.episode_rewards.append(episode_reward)
                self.episode_lengths.append(episode_step_count)
                self.episode_mean_values.append(np.mean(episode_values))

                # Update the network using the experience buffer at the end of the episode.
                if len(episode_buffer) != 0:
                    v_l, p_l, e_l, g_n, v_n = self.train(episode_buffer, sess, gamma, 0.0)

                # Periodically save gifs of episodes, model parameters, and summary statistics.
                if episode_count % 5 == 0 and episode_count != 0:
                    if self.name == 'worker_0' and episode_count % 25 == 0:
                        time_per_step = 0.05
                        images = np.array(episode_frames)
                        make_gif(images, './frames/image' + str(episode_count) + '.gif',
                                 duration=len(images) * time_per_step, true_image=True, salience=False)
                    if episode_count % 250 == 0 and self.name == 'worker_0':
                        saver.save(sess, self.model_path + '/model-' + str(episode_count) + '.cptk')
                        print("Saved Model")

                    mean_reward = np.mean(self.episode_rewards[-5:])
                    mean_length = np.mean(self.episode_lengths[-5:])
                    mean_value = np.mean(self.episode_mean_values[-5:])
                    summary = tf.Summary()
                    summary.value.add(tag='Perf/Reward', simple_value=float(mean_reward))
                    summary.value.add(tag='Perf/Length', simple_value=float(mean_length))
                    summary.value.add(tag='Perf/Value', simple_value=float(mean_value))
                    summary.value.add(tag='Losses/Value Loss', simple_value=float(v_l))
                    summary.value.add(tag='Losses/Policy Loss', simple_value=float(p_l))
                    summary.value.add(tag='Losses/Entropy', simple_value=float(e_l))
                    summary.value.add(tag='Losses/Grad Norm', simple_value=float(g_n))
                    summary.value.add(tag='Losses/Var Norm', simple_value=float(v_n))
                    self.summary_writer.add_summary(summary, episode_count)

                    self.summary_writer.flush()
                if self.name == 'worker_0':
                    sess.run(self.increment)
                episode_count += 1




max_episode_length = 300
gamma = .99 # discount rate for advantage estimation and reward discounting
s_size = 7056 # Observations are greyscale frames of 84 * 84 * 1
a_size = 3 # Agent can move Left, Right, or Fire
load_model = False
model_path = './model'

tf.reset_default_graph()

if not os.path.exists(model_path):
    os.makedirs(model_path)

# Create a directory to save episode playback gifs to
if not os.path.exists('./frames'):
    os.makedirs('./frames')

with tf.device("/cpu:0"):
    global_episodes = tf.Variable(0, dtype=tf.int32, name='global_episodes', trainable=False)
    trainer = tf.train.AdamOptimizer(learning_rate=1e-4)
    master_network = AC_Network(s_size, a_size, 'global', None)  # Generate global network
    num_workers = multiprocessing.cpu_count()  # Set workers ot number of available CPU threads
    workers = []
    # Create worker classes
    for i in range(num_workers):
        workers.append(Worker(DoomGame(), i, s_size, a_size, trainer, model_path, global_episodes))
    saver = tf.train.Saver(max_to_keep=5)

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    if load_model == True:
        print('Loading Model...')
        ckpt = tf.train.get_checkpoint_state(model_path)
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        sess.run(tf.global_variables_initializer())

    # This is where the asynchronous magic happens.
    # Start the "work" process for each worker in a separate threat.
    worker_threads = []
    for worker in workers:
        worker_work = lambda: worker.work(max_episode_length, gamma, sess, coord, saver)
        t = threading.Thread(target=(worker_work))
        t.start()
        sleep(0.5)
        worker_threads.append(t)
    coord.join(worker_threads)

Note that Worker.train implements the loop $latex for \; i \in \{t-1, \cdots,t_{start}\}$ in the pseudocode.

 

 

Policy Gradient

Reinforcement learning algorithms can be divided into many families. In model-free temporal difference methods like Q-learning/SARSA, we try to learn action value Q(s,a) for any state-action pair, either by recording (“memorizing”) exact values in a tabular or learning a function to approximate it. Under \epsilon-greedy, the action to be selected at a state will therefore be argmax_a Q(s,a) but there is also a small constant chance \epsilon to be any non-optimal action.

Another family is called policy gradient methods which directly map states to actions. To select actions, they do not need to consult a value table or a value function. Instead, each action can be selected with a probability determined by a parameterized policy function \pi(a|s,\theta), where \theta is the policy function’s parameters.

The advantages of policy gradient methods over Q-learning/SARSA using \epsilon greedy are mainly two:

  1. in some situations the optimal approximate policy must be stochastic.  An example from [1]: in card games with imperfect information the optimal play is often to do two different things with specific probabilities, such as when bluffing in Poker. Action-value methods have no natural way of finding stochastic optimal policies.
  2. problems vary in the complexity of their policies and action-value functions. In some problems, the policy is a much simpler function to approximate than action-values so it will be faster to learn

The general update form of policy gradient methods is \theta_{t+1} = \theta_t + \alpha \nabla \eta(\theta_t), where \eta(\theta_t) is performance measure with respect to the policy weights.

Now, the policy gradient theorem can be briefly described as [5]:

Screenshot from 2017-05-29 11-43-06

In episodic case, the policy gradient theorem derives as follows (according to [1]):

Screenshot from 2017-05-29 20-53-45

The last expression is an exact formula of the gradient of \eta(\theta). It can also be seen as an expectation of \gamma^t \sum\limits_{a_t} \nabla_\theta \pi(a_t|s_t) q(s_t, a_t) over the probability distribution of landing on s_t in t steps. Please note for any fixed t, \sum\limits_{s_t} \Pr(s_0 \rightarrow s_t, t, \pi)=1. Therefore, we can rewrite the above last expression as an expectation form:

Screenshot from 2017-05-31 00-09-15

where \qquad G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + \cdots. G_t is an unbiased estimator of q(s_t, a_t) in the last two steps since we do not have estimation for q(s_t, a_t). We use \mathbb{E}_{\pi} to replace \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t)} &s=2, meaning that the sequence S_0, A_0, S_1, A_1, \cdots are generated following the policy \pi(a_t|s_t) and the transition probability p(s_{t+1}|s_t, a_t). Sometimes we can also write \mathbb{E}_{\pi_\theta} because the policy \pi is parameterized by \theta. We can also write \mathbb{E}_{\pi} as \mathbb{E}_{s_{0:T}, a_{0:T}}. In other words, \nabla \eta(\theta) = \mathbb{E}_{s_{0:T}, a_{0:T}}[\sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t|s_t)] &s=2

What we do in reality is to use these collected state, actions and rewards as samples to approximate the exact expectation of \nabla \eta(\theta):

Screenshot from 2017-05-28 19-56-19

This kind of policy gradient method is called REINFORCE, by Ronald J. Williams from Northeastern University. The original paper [2] is hard to read in my opinion. It directly tells you what is the update rule of \theta by construction, and then tells you the expected update aligns in the same direction as the performance gradient. What I wish to be told is how he derived the update rule of \theta in the first place.

(Updated 09/18/2017: The same derivative procedure of REINFORCE is illustrated more clearly in [10])

(Updated 01/31/2020: The derivative procedure of REINFORCE in continuous state/action space is illustrated in [15])

One important extension of REINFORCE is to offsetting G_t by a baseline b(s_t), a function of s_t. Intuitively, we need not how good an action is, but how much better this action is compared to average actions. For example, if G_t is uniformly larger than zero for either good or bad actions, \theta will always get updated to encourage either kind of actions. We need to calibrate G_t such that it can differentiate good or bad actions better. Mathematically, adding a baseline can reduce the variance of \nabla \eta(\theta) while still keeping it as unbiased estimator.

First, let’s look at why offsetting G_t by b(s_t) still makes \nabla \eta(\theta) an unbiased estimator (reference from [8]):

Screenshot from 2017-05-31 00-10-24

To ensure \mathbb{E}_{s_{0:T}, a_{0:T}} [\sum\limits_{t=0}^T \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi (a_t|s_t)] is an unbiased estimate of \nabla \eta(\theta), b(s_t) should only be a function of s_t, but not a_t. Otherwise \mathbb{E}_{s_{0:T}, a_{0:T}}[\sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi (a_t | s_t)] is not zero.

It is less obvious that adding b(s_t) can reduce the variance of Var[ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] .

Screenshot from 2017-06-08 21-13-30

From here, we can see if \sum\limits_{t=0}^T \gamma^t b(s_t) \nabla_\theta \log \pi(a_t | s_t) has large enough covariance with \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t) to outweigh its own variance, then the variance is reduced. Unrealistically, if b(s_t) = G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + \cdots, then variance will be zero, although this is impossible because b(s_t) is only a function of s_t without magic forecast ability to 100% approach G_t.

(sidenote: I cannot follow [8]’s deduction on variance reduction part.)

One way is to train a predictor on \hat{v}(S_t | w) with parameter w as a baseline:

Screenshot from 2017-05-28 21-59-16

From [1]: Note that REINFORCE uses the complete return from time t (G_t), which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed.

When we derived \nabla \eta(\theta), we use the property that \mathbb{E}[G_t|S_t, A_t]=q_\pi(s_t, a_t). However, G_t could have high variance because it involves returns from step t to T, where each reward can be seen as a random variable [13]. An alternative estimator of q_\pi(s_t, a_t) which has lower variance but higher bias is to use “bootstrapping”, i.e., use a parameterized value function \hat{v}_w plus the next immediate reward to approximate G_t \approx R + \gamma \hat{v}(S', w). The one-step actor-critic algorithm is described as follows [1]:

Screen Shot 2018-11-11 at 1.44.33 PM

REINFORCE is an on-policy algorithm because \delta=G_t - \hat{v}(S_t,w) in the gradient update depends on G_t, the returns generated by following the current policy \pi_\theta. The specific one-step actor-critic algorithm we just described is also an on-policy algorithm because \delta=R+\gamma \hat{v}(S', w) - \hat{v}(S, w) depends on the next state S' which is the result of applying \pi_\theta at the current state S.  There also exists off-policy actor-critics, see an overview of on-policy and off-policy policy gradient methods at [14].

A more recent advance in baseline is Generalized Advantage Estimation (GAE) [6]. They introduce two parameters, \gamma and \lambda, in an un-discounted reward problem to help estimate g:=\nabla_\theta \mathbb{E}[\sum_{t=0}^\infty] r_t with little introduced bias and reduced variance. (Note that how discounted reward problems can be transformed into an un-discounted problem: “But the discounted problem (maximizing \sum_{t=0}^\infty \gamma^t r_t) can be handled as an instance of the undiscounted problem in which we absorb the discount factor into the reward function, making it time-dependent.”)

They introduce their notations:

Screenshot from 2017-06-09 17:48:05

Note that g^\gamma is a biased estimator of g but as they claim previous works have studied to “reduce variance by downweighting rewards corresponding to delayed effects, at the cost of introducing bias”.

The paper’s goal is to find a good estimator of A^{\pi, \gamma} which is called \hat{A}_t.

Screenshot from 2017-06-10 18-07-03

In other word, if \hat{A}_t is \gamma-just, then it helps to construct an unbiased estimator of g^\gamma. Equation (8) just uses the property that the expectation of a sum equals to the sum of expectations.

Now, what other property does \hat{A}_t have? If we know any property of \hat{A}_t, it will help us find \hat{A}_t more easily. The paper proposes one property:

Screenshot from 2017-06-10 18-19-33

Sketch proof of proposition 1:

First of all, to understand the notations clearly, think that \hat{A}_t(s_{0:\infty}, a_{0:\infty}) and Q_t(s_{0:\infty}, a_{0:\infty}) are functions with input as the whole trajectory s_0, a_0, s_1, a_1, \cdots. Similarly, think b_t(s_{0:t}, a_{0:t-1}) as a function with the input as the former part of the trajectory s_0, a_0, s_1, a_1, \cdots, s_t.

Now, suppose a satisfied \hat{A}_t such that \hat{A}_t(s_{0:\infty}, a_{0:\infty}) = Q_t(s_{0:\infty}, a_{0:\infty}) - b_t(s_{0:t}, a_{0:t-1}) where for all (s_t, a_t), \; \mathbb{E}_{s_{t+1:\infty}, a_{t+1:\infty}|s_t,a_t} = Q^{\pi,\gamma}(s_t, a_t), then:

Screenshot from 2017-06-10 19-18-26

Next, they find a good candidate function for \hat{A}_t, which is \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t). Only when we know V = V^{\pi, \gamma} is \delta_t^V \gamma-just. Otherwise V is a biased estimator of g^\gamma. However, if we can take average of different variations of \delta (equation 11, 12, 13, 14, 15, and 16), then we might get a low bias, low variance estimator, which is called GAE(\gamma, \lambda).

Screenshot from 2017-06-10 20-58-57

\hat{A}_t:=GAE(\gamma, 1) is \gamma-just regardless of the accuracy of V (again, this is because E_{s_{0:\infty}, a_{0:\infty}}[V(s_t) \nabla_\theta \log \pi_\theta (a_t | s_t) ] = 0). However GAE(\gamma, 1) is believed (I don’t know how to prove that) to have high variance due to the long sum of rewards. On the other extreme end, GAE(\gamma, 0) has low variance but since we are estimating the value function V, GAE(\gamma, 0)  must be a biased estimator of g^\gamma. GAE(\gamma, \lambda) with 0<\lambda<1 would make a trade-off between variance and bias.

Update 2018-11-08

Policy gradient is better illustrated in several recent posts: [11] and [12]

Reference

[1] Reinforcement learning: An introduction

[2] Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

[3] https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2

[4] Asynchronous Methods for Deep Reinforcement Learning

[5] http://www.breloff.com/DeepRL-OnlineGAE/

[6] HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

[7] Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

[8] https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

[9] https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/

[10] https://www.youtube.com/watch?v=kUiR0RLmGCo

[11] https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#soft-actor-critic

[12] https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma

[13] Supplement material of DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

[14] Unifying On-Policy and Off-Policy Learning

[15] http://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf

latex for policy gradient theorem:

    \begin{align*} \nabla \eta(\theta) &= \nabla_\theta v_{\pi} (s_0) \quad \quad \text{performance measure is the value of starting state} \\ &= \nabla_\theta \big[ \sum\limits_{a_0} \pi(a_0|s_0) q(s_0,a_0) \big] \\ &=\sum\limits_{a_0} \big[ \nabla_\theta \pi(a_0|s_0) q(s_0, a_0) + \pi(a_0|s_0) \nabla_\theta q(s_0, a_0) \big]  \quad \quad \text{derivative product rule} \\  &= \sum\limits_{a_0} \Big[ \nabla_\theta \pi(a_0|s_0) q(s_0, a_0) + \pi(a_0|s_0) \nabla_\theta \big[ \sum\limits_{s_1,r_0} p(s_1, r_0 |s_0,a_0)(r_0 + \gamma v_\pi(s_1)) \big] \Big] \\ &= \sum\limits_{a_0} \big[ \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) + \pi(a_0 | s_0) \sum\limits_{s_1} \gamma p(s_1| s_0, a_0) \nabla_\theta v_{\pi}(s_1) \big] \qquad r \text{ has nothing to do with regard to } \theta \\  & \qquad \text{up till now, we have a recursion:} \\ & \qquad \nabla_\theta v_\pi(s_t)= \sum\limits_{a_t} \Big[ \nabla_\theta \pi(a_t|s_t) q(s_t, a_t) + \pi(a_t|s_t) \big[ \sum\limits_{s_{t+1}} \gamma p(s_{t+1}|s_t,a_t) \nabla_\theta v_\pi(s_{t+1}) \big] \Big]  \\ &=\sum\limits_{a_0} \Big[ \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) + \pi(a_0 | s_0) \sum\limits_{s_1} \gamma p(s_1| s_0, a_0) \\ & \qquad \qquad  \sum\limits_{a_1} \big[ \nabla_\theta \pi(a_1 | s_1)q(s_1, a_1) + \pi(a_1 | s_1)\sum\limits_{s_2} \gamma p(s_2|s_1, a_1) \nabla_\theta v_{\pi} (s_2) \big] \Big]  \\ &=\sum\limits_{a_0} \nabla_\theta \pi (a_0 | s_0) q(s_0, a_0) \\ & \qquad +  \gamma \sum\limits_{s_1} \sum\limits_{a_0} p(s_1| s_0, a_0) \pi(a_0 | s_0)  \sum\limits_{a_1}  \nabla_\theta \pi(a_1 | s_1)q(s_1, a_1) \\ & \qquad + \gamma^2 \sum\limits_{s_2} \sum\limits_{a_1} \sum\limits_{s_1} \sum\limits_{a_0} p(s_2|s_1, a_1) \pi(a_1 | s_1) p(s_1| s_0, a_0) \pi(a_0 | s_0) \nabla_\theta v_{\pi} (s_2)  \\ &= \cdots \qquad \text{(keep unrolling using the recursion)}\\ &= \sum\limits_{t=0}^\infty \sum\limits_{s_t} \gamma^t \Pr(s_0 \rightarrow s_t, t, \pi) \sum\limits_{a_t} \nabla_\theta \pi(a_t | s_t) q(s_t, a_t)  \qquad  \Pr(s_0 \rightarrow s_t, t, \pi) \text{ is the prob. of } s_0 \text{ to } s_t \text{ in } t \text{ steps}  \end{align*}

latex for expectation form rewritten:

    \begin{align*} \nabla \eta(\theta) &= \sum\limits_{t=0}^\infty \sum\limits_{s_t} \gamma^t \Pr(s_0 \rightarrow s_t, t, \pi) \sum\limits_{a_t} \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi)}[\sum\limits_{a_t} \gamma^t \nabla_\theta \pi(a_t | s_t) q(s_t, a_t) ] \\  &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi)} [\sum\limits_{a_t} \gamma^t \pi(a_t | s_t) q(s_t, a_t) \frac{\nabla_\theta \pi(a_t | s_t)}{\pi(a_t | s_t)} ] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t) \quad}[ \gamma^t q(s_t, a_t) \frac{\nabla_\theta \pi(a_t | s_t)}{\pi(a_t | s_t)}] \\ &=\sum\limits_{t=0}^\infty \mathbb{E}_{s_t \sim \Pr(s_0 \rightarrow s_t, t, \pi) \atop a_t \sim \pi(a_t | s_t) \quad}[ \gamma^t q(s_t, a_t) \nabla_\theta \log \pi(a_t | s_t)]  \qquad \nabla \log(x) = \frac{\nabla x}{x} \\  &=\mathbb{E}_{\pi}[ \sum\limits_{t=0}^\infty \gamma^t q(s_t, a_t) \nabla_\theta \log \pi(a_t | s_t)] \\  &=\mathbb{E}_{\pi}[ \sum\limits_{t=0}^\infty \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] \end{align*}

latex for adding baseline is still unbiased estimator:

    \begin{align*} &\mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi(a_t | s_t)] \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \mathbb{E}_{s_{0:T}, a_{0:T}}[ \sum\limits_{t=0}^T \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \mathbb{E}_{s_{0:T}, a_{0:T}}[ \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t) ]  \qquad \text{exp. of sum equals to sum of exp.}\\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \mathbb{E}_{s_{t}, a_{t}}[ \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t) ] \qquad \text{remove irrelevant variables in each exp.}\\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \sum\limits_{a_{t}} p(s_t, a_t)  \gamma^t  b(s_t) \nabla_\theta \log \pi (a_t | s_t)  \qquad \text{expectation form} \rightarrow \text{discrete sum} \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \sum\limits_{a_{t}} p(s_t) \pi(a_t|s_t)  \gamma^t  b(s_t) \frac{\nabla_\theta \pi (a_t | s_t)}{\pi(a_t | s_t) }  \qquad \text{rule of probability} \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t  b(s_t) p(s_t) \sum\limits_{a_{t}} \nabla_\theta \pi (a_t | s_t)  \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t  b(s_t) p(s_t) \nabla_\theta \sum\limits_{a_{t}} \pi (a_t | s_t)  \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] - \sum\limits_{t=0}^T \sum\limits_{s_{t}} \gamma^t  b(s_t) p(s_t) \nabla_\theta 1  \\ =& \mathbb{E}_{s_{0:T}, a_{0:T}} [ \sum\limits_{t=0}^T \gamma^t G_t \nabla_\theta \log \pi(a_t | s_t)] \\ =& \nabla \eta(\theta)  \end{align*}

latex for sketch proof of proposition 1:

    \begin{align*} &\mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [\hat{A}_t(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta(a_t | s_t) ] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [(Q_t(s_{0:\infty}, a_{0:\infty}) - b_t(s_{0:t}, a_{0:t-1})) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[Q_t(s_{0:\infty}, a_{0:\infty}) \nabla_\theta \log \pi_\theta(a_t | s_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad \text{we will first work on the former part} \downarrow \\ &= \mathbb{E}_{s_{0:t} \atop a_{0:t}}[\nabla_\theta \log \pi_\theta(a_t | s_t) \mathbb{E}_{s_{t+1:\infty}, a_{t+1:\infty}} [Q_t(s_{0:\infty}, a_{0:\infty})] ] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:t} \atop a_{0:t}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[b_t(s_{0:t}, a_{0:t-1}) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad \text{since } Q^{\pi, \gamma}(s_t, a_t) \text{ is a function of input only } s_t \text{ and } a_t \text{, we can change } \mathbb{E}_{s_{0:t} \atop a_{0:t}} \text{ to } \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[\nabla_\theta \log \pi_\theta(a_t | s_t) Q^{\pi, \gamma}(s_t, a_t)] - \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}}[V^{\pi, \gamma}(s_t) \nabla_\theta \log \pi_\theta(a_t | s_t)] \\ &\qquad V^{\pi, \gamma}(s_t) \text{ is an instance of } b_t(s_{0:t}, a_{0:t-1})  \\ &= \mathbb{E}_{s_{0:\infty} \atop a_{0:\infty}} [A^{\pi, \gamma}(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t | s_t) ] \end{align*}

Upgrade Cuda from 7.x to 8.0 on Ubuntu

1.  remove cuda 7.x version (x depends on what you installed.)

rm /usr/local/cuda-7.x

2. make sure PATH and LD_LIBRARY_PATH no longer contain “/usr/local/cuda-7.x”. Possible places to look at are /etc/environment, ~/.profile, /etc/bash.bashrc, /etc/profile, ~/.bash_rc

If you really don’t know where cuda path is added to PATH or LD_LIBRARY_PATH, try to check here: https://unix.stackexchange.com/questions/813/how-to-determine-where-an-environment-variable-came-from

3. cuda 8.0 only supports Ubuntu 14.04 and 16.04. Therefore, do system upgrade if necessary. Ref: https://askubuntu.com/questions/760347/how-to-upgrade-from-14-04-lts-or-15-10-to-16-04-from-terminal

4. install cuda-8.0 toolkit. Go to here: https://developer.nvidia.com/cuda-downloads and download some file type you prefer. Perhaps .deb file can lead you to install via Software Center, which is not a bad idea.

5. to verify cuda 8.0 has been installed:

cd /usr/local/cuda-8.0/samples/
make
cd /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery
./deviceQuery

You should see:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 640M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1999 MBytes (2096300032 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            709 MHz (0.71 GHz)
  Memory Clock rate:                             2000 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 640M
Result = PASS

Also, you can go to `/usr/local/cuda-8.0/samples/bin` and run any generated test program you want.

(if make has “cannot find -lnvcuvid” error, follow as here: https://askubuntu.com/questions/889218/testing-cuda-in-ubuntu-16-04-usr-bin-ld-cannot-find-lnvcuvid)

ref: http://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/verify_cuda_install.html

 

6. follow https//nb4799.neu.edu/wordpress/?p=2572 to set up LD_LIBRARY_PATH and CUDA_HOME

 

English Grammars

“A” or “an” before an acronym or abbreviation? e.g., a FAQ or an FAQ?

https://english.stackexchange.com/questions/1016/do-you-use-a-or-an-before-acronyms

 

When should I add “the” before what kind of noun?

http://www.englishteachermelanie.com/grammar-when-not-to-use-the-definite-article/

 

Whether to repeat “the” in “noun and noun” phrases?

http://english.stackexchange.com/questions/9487/is-it-necessary-to-use-the-multiple-times

 

“noun and noun” phrase: the following verb is plural or single?

http://www.mhhe.com/mayfieldpub/tsw/nounsagr.htm

 

adj before “noun and noun” phrase

http://theeditorsblog.net/2015/08/08/one-adjective-paired-with-multiple-nouns-a-readers-question/

 

when to use articles?

http://www.quickanddirtytips.com/education/grammar/when-to-use-articles-before-nouns

 

whether to add comma before “and”?

https://english.stackexchange.com/questions/30516/should-i-use-a-comma-before-and-or-or

 

Which words in a title should be capitalized?

https://english.stackexchange.com/questions/14/which-words-in-a-title-should-be-capitalized

http://grammar.yourdictionary.com/capitalization/rules-for-capitalization-in-titles.html

 

do not italicize abbreviations like e.g., etc., …

Screenshot from 2017-04-19 15-05-09

 

italicize technical or key terms at first appearance

http://blog.apastyle.org/apastyle/2015/04/using-italics-for-technical-or-key-terms.html

 

abbreviation, initialism and acronym

http://blog.apastyle.org/apastyle/abbreviations/

https://www.proofreadnow.com/blog/bid/65823/Acronyms-Initialisms-and-Abbreviations (this also says do not italicize acronyms)

 

hyphen for multiple words if they are used as adjective

https://english.stackexchange.com/questions/220967/hyphenate-cutting-edge-state-of-the-art-following-to-be-verbs

https://english.stackexchange.com/questions/2908/should-i-use-related-or-related

 

describing number in english or numbers?

http://www.aje.com/en/arc/editing-tip-using-numbers-scientific-manuscripts/

 

None can be followed by singular verbs and plural verbs

http://www.onlinegrammar.com.au/top-10-grammar-myths-none-always-takes-a-singular-verb/

 

Should I add “the” in front of a superlative adjective?

Not required in many cases: https://ell.stackexchange.com/questions/46923/is-there-always-a-the-before-a-superlative-adjective

 

“half the time” or “half of the time”

https://english.stackexchange.com/questions/217600/should-i-use-half-the-time-or-half-of-the-time

Use “half the time” is good.

 

“Be it” usage

Be he good or evil, … = Whether he is good or evil, …

Be it so or not, … = Whether it is so or not, …

https://english.stackexchange.com/questions/256311/be-it-or-grammar

 

during vs. while

during + none;  while + clause

http://www.differencebetween.net/language/difference-between-during-and-while/

 

When to use “among” and “between”?

between specific things, among a group of things

The negotiations between Brazil, Argentina, and Chile are going well.

The negotiations among the countries of South America are going well.

https://www.espressoenglish.net/whats-the-real-difference-between-between-and-among/

 

“that” vs “which”

“that” is used for restrictive definition

“which” is used for non-restrictive definition

https://www.grammarly.com/blog/which-vs-that/

 

https://afterdeadline.blogs.nytimes.com/2010/04/13/faqs-on-style/

Importance sampling

Importance sampling is a way to reduce variance of your estimation on integration over a region for an integrand.

Let’s first see how traditional Monte Carlo method is used to estimate integration [2]. To estimate $latex \int_a^b f(x) dx$, one can think of reshaping the area to be integrated as a rectangle, whose width is $latex b-a$ and height is $latex E_{\mathcal{U}[a,b]}(f(x))$, a virtual, expected height.

Therefore, to use Monte Carlo estimation, we can randomly draw $latex N$ numbers from the uniform distribution $latex \mathcal{U}[a,b]$, and use the following result as the approximation to $latex \int_a^b f(x) dx$:

$latex \frac{(b-a)}{N}\sum\limits_{i=1}^N f(x_i) &s=2$

 

Of course, this estimation will vary from run to run (every run is an estimation based on $latex N$ samples). Therefore, Monte Carlo estimation for integration has variance. To reduce this variance, we use importance sampling [1]. The idea is to convert the integration formula into an expectation formula with an auxiliary, known distribution $latex h(x)$:

$latex \int_a^b f(x) d(x) = \int_a^b f(x) \frac{h(x)}{h(x)} dx = \int_a^b \frac{f(x)}{h(x)} h(x) dx = \mathbb{E}_h \big(\frac{f(X)}{h(X)} \big) &s=2$

 

Using Monte Carlo method to estimate this expectation, we have:

$latex \mathbb{E}_h \big(\frac{f(X)}{h(X)} \big) \approx \frac{1}{n} \sum\limits_{i=1}^n \frac{f(X_i)}{h(X_i)} ,\quad X_i \sim h(x) &s=2$ 

 

When $latex \alpha h(x) = f(x)$  (assuming $latex f(x)>0$), the importance sampling variance will become zero because $latex \mathbb{E}_h \big(\frac{f(X)}{h(X)}\big) = \alpha$ which is constant no matter of $latex x_i$. Though a nice property, it is impossible to achieve zero variance in reality. If you know $latex h(x)$ is proportional $latex f(x)$, then you know the density of $latex h(x)$, therefore you induce the density of $latex f(x)$ just timing $latex h(x)$ by $latex \alpha$, therefore you do not need Monte Carlo method in the first place to calculate the integration. The common solution is to pick a known distribution $latex h(x)$ that appears close, in terms of the shape, to $latex f(x)$. The closer $latex h(x)$ is to $latex f(x)$, the smaller the variance is. If $latex h(x)$ is a uniform distribution in the range of $latex (a,b)$, it is equivalent to the traditional method of Monte Carlo method we introduced at the beginning of this article.

 

Here is the code to verify that importance sampling really reduces the variance for estimating integration. 

# estimate integration over a normal distribution (mean=0, std=1) in the range of (a, b)
from scipy.stats import norm
import numpy
from scipy import stats

T = 200  # num of rounds
N = 10   # num of samples per round
a, b = -2., 2.   # lower bound, upper bound
fx_loc, fx_scale = 0, 1   # f(x) is a normal distribution (mean=0, std=1)

class hx1_dist_gen(stats.rv_continuous):
    def _pdf(self, x):
        return 1. / b - 1. / (b**2) * numpy.abs(x)

class hx2_dist_gen(stats.rv_continuous):
    def _pdf(self, x):
        return 1. / (b**2) * numpy.abs(x)

hx1_dist = hx1_dist_gen(a=a, b=b)
hx2_dist = hx2_dist_gen(a=a, b=b)

# monte carlo estimation
mc_res = []
for t in range(T):
    print("MC round %d" % t)
    x = numpy.random.uniform(low=a, high=b, size=N)
    fxs = norm.pdf(x, loc=fx_loc, scale=fx_scale)
    mc_res.append((b - a) * numpy.mean(fxs))

# importance sampling estimation using h1(x)
is1_res = []
for t in range(T):
    print("Importance Sampling h1(x) round %d" % t)
    fx_div_hxs = [] 
    # approximate to let x be drawn in the range (-1, 1)
    x = hx1_dist.rvs(size=N)
    fx = norm.pdf(x, loc=fx_loc, scale=fx_scale)
    hx = hx1_dist.pdf(x)
    fx_div_hxs = fx / hx
    is1_res.append(numpy.mean(fx_div_hxs))

# importance sampling estimation using h2(x)
is2_res = []
for t in range(T):
    print("Importance Sampling h2(x) round %d" % t)
    fx_div_hxs = [] 
    # approximate to let x be drawn in the range (-1, 1)
    x = hx2_dist.rvs(size=N)
    fx = norm.pdf(x, loc=fx_loc, scale=fx_scale)
    hx = hx2_dist.pdf(x)
    fx_div_hxs = fx / hx
    is2_res.append(numpy.mean(fx_div_hxs))

print("exact: {0}".format(norm.cdf(b) - norm.cdf(a)))
print("MC estimation (mean/std): {0}/{1}".format(numpy.mean(mc_res), numpy.std(mc_res)))
print("Importance Sampling h1 estimation (mean/std): {0}/{1}".format(numpy.mean(is1_res), numpy.std(is1_res)))
print("Importance Sampling h2 estimation (mean/std): {0}/{1}".format(numpy.mean(is2_res), numpy.std(is2_res)))

 

Output on my machine:

exact: 0.9544997361036416
MC estimation (mean/std): 0.958336124778972/0.15458260669929705
Importance Sampling h1 estimation (mean/std): 0.9515730240850715/0.07416369320580925
Importance Sampling h2 estimation (mean/std): 1.0496523604436687/0.7489580081170375

 

I set $latex f(x)$ as a normal distribution with mean=0 and std=1. I use the traditional Monte Carlo method as well as two kinds of distributions, $latex h_1(x)$ and $latex h_2(x)$, to do importance sampling. All distributions can be represented in the following chart:

Untitled Diagram

 

From the output we can see that, estimation using importance sampling with $latex h_1(x)$ has the smallest std. This is because $latex h_1(x)$ (the blue triangle) looks the closest to $latex f(x)$. $latex h_2(x)$ (the red shape formed by two triangles) has the largest std because it looks the least close to $latex f(x)$. The traditional MC method (equivalently, importance sampling with a uniform distribution) has a medium std.

 

Reference

[1] https://www.dropbox.com/s/75vb40b2vfnnlcm/Monte%20Carlo%20Methods%20and%20Importance%20Sampling%5Bai%20classicML%5D.pdf?dl=0

[2] http://ta.twi.tudelft.nl/mf/users/oosterle/oosterlee/lec8-hit-2009.pdf

Inverse Reinforcement Learning

In my rough understanding, inverse reinforcement learning is a branch of RL research in which people try to perform state-action sequences resembling given tutor sequences.

There are two famous works on inverse reinforcement learning. One is Apprenticeship Learning via Inverse Reinforcement Learning [1], and the other is Maximum Margin Planning [2].

Maximum Margin Planning

In MMP learn a cost function for estimating the cost at each state-action pair. The desired cost function should emit low costs to those state-action pairs appearing in “tutor” sequences (examples) and return high costs to those deviating tutor sequences. If we manage to recover a desired cost function, we can assign the cost for any state-action pair and use path-finding algorithms (e.g., A*, Dijkstra, etc.) to find a sequence on a new environment which has the minimum accumulative sum of costs, that is, a sequence deemed resembling given tutor sequences.

The baisc idea is given in [4]:

Screenshot from 2017-03-29 18:36:37

As you see, in line 6, you need to adjust cost function to increase on the current “not-perfect” minimum cost path and decrease the example path. However, you can see that the algorithm doesn’t know how much it should increase the cost for minimum cost path. If one path just deviates from the example path a little bit and another path deviates far more abnormally, how should the adjustment of the cost function reflect that?

So the author proposes an improvement called “loss augmentation”. Suppose the original cost at a state-action is c^{sa}, then we say its augmented cost \tilde{c}^{sa} = c^{sa}-l^{sa}, where l^{sa} is a loss function with large values for those deviated state-action pairs and with close-to-zero values for those resembling tutor sequences. Why do we want the augmented cost to be even smaller for those defiant state-action pairs? Because in this way, we make those deviated pairs even harder to be distinguished with the desired pairs. Therefore, we force the algorithm to try hard to update the cost function until it can differentiate augmented costs from different state-action pairs. In other words, the algorithm will learn the cost function such that it differentiates any a desired state-action pair (s,a) with a less desired state-action pair (s',a') by at least a margin l^{s'a'}.

Here is the visualization showing how deviated paths get high rewards (+1) and desired paths get low rewards (-1).

Screenshot from 2017-03-29 21:42:24

And here is the revised algorithm based on Algorithm 1:

Screenshot from 2017-03-29 21:42:30

Now, let’s look at how to learn the cost function using a framework called maximum margin structured classification.

Before that, we introduce preliminaries. In this paper, we are given N examples, each example is a MDP and a tutor sequence. Different examples may come from different MDPs so this model can learn the cost function generalized to new MDPs which are never seen in the training examples. However, for simplicity, we just assume all training examples are different tutor sequences from the same MDP with state \mathcal{S} and \mathcal{A}. For simplicity purpose also, an important assumption made throughout [4] is that a tutor sequence is acyclic. This means, any state-action pair is visited at most once in a tutor sequence. Then, it is easy to represent a tutor sequence by \mu \in \{0,1\}^{\mathcal{S}\mathcal{A}}, a state-action frequency count vector. If the assumption about tutor sequence being acyclic were not made, \mu \in \mathbb{R}_{+}^{\mathcal{S}\mathcal{A}} which satisfies certain flow constraints.

Now remember that the idea is to approximate everything linearly, such as the loss function or the cost function. Each training example has a loss function, which quantifies the difference between an arbitrary sequence and i-th example tutor sequence \mu_i. Since we want loss function to be a linear function of \mu, we let \mathcal{L}_i(\mu)=l_i^T \mu. The more \mu deviates from \mu_i, the higher \mathcal{L}_i(\mu) should be. A straightforward way to define l_i  is to make l_i \in \{0,1\}^{\mathcal{S}\mathcal{A}}, with 0 denoting a particular state-action is visited by both or neither of \mu and \mu_i and 1 otherwise. Of course more advanced loss function can be used but the bottom line is to assume \mathcal{L}_i(\mu) \geq 0. Now as for cost function, cost function should also factor over states and actions. This means, the cost of a sequence should be the sum of the costs at each state-action pairs visited in the sequence. Suppose we have a feature matrix F \in \mathbb{R}^{d \times \mathcal{S} \times \mathcal{A}}, whose columns are feature vectors at every state-action pair. We also suppose the cost function has a linear weight w \in \mathbb{R}^{d}. Then, the cost at one state-action pair is just w^T F_j where F_j is the column corresponding to the state-action pair.  Finally, the cost of the whole sequence is w^T F \mu. You can think F \mu as the accumulated sum of feature vectors on visited state-action pairs.

Since we assume all training examples come from the same MDP, we can write the data set as \mathcal{D}=\{(F, \mu_i, l_i)\}^N_{i=1}.

Now, the problem is how to obtain w. Recall that we want w to be able to differentiate desired paths (\mu_i from \mathcal{D}) from tentative paths (arbitrary \mu \in \mathcal{G}, where \mathcal{G} is all possible sequence in the MDP) by a margin \mathcal{L}_i(\mu)=l_i^T \mu. Therefore, the constraints we would like to impose are: for each \mu_i, for each \mu \in \mathcal{G}, w^T F \mu_i \leq w^T F \mu - l_i ^T \mu. This actually introduces a large number of constraints when \mathcal{G} space is large. Therefore, we reduce the number of constraints by rewriting as: for each \mu_i, w^T F \mu_i \leq \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}. Intuitively, this means if w can successfully distinguish \mu_i from the loss-augmented cost of  the hardest tentative path \mu \in \mathcal{G}, then it can distinguish from any other tentative paths. Also, we would like to regularize on the norm of w because if w were without norm regularization, it can always scale arbitrarily large to  overcome margin difference constraints, voiding the original purpose of adding margins. Moreover, we want to introduce a slack term \xi_i \geq 0 for each training example. This allows some margin difference constraints to not be satisfied because in real data you may not find a w which perfectly satisfies all margin difference constraints.

Putting all we describe above, we come to the optimization objective:

\min\limits_{w \in \mathcal{W}, \xi_i \in \mathbb{R}_+} \frac{1}{N} \sum\limits_{i=1}^N \xi_i + \frac{\lambda}{2} \left\Vert w \right\Vert^2 \newline \forall i, w^T F \mu \leq \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} + \xi_i &s=2

We note that \xi_i resides in both the objective function and the constraints. We want \xi_i as small as possible. Therefore, at the minimizer the slack variables will always exactly equal the constraint violation: \xi_i = w^T F \mu_i - \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}. Therefore, we can rewrite our constrained objective function into another non-constrained objective function:

R(w) = \frac{1}{N} \sum\limits^N_{i=1} \big( w^T F \mu_i - \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} \big) + \frac{\lambda}{2} \left\Vert w \right\Vert^2 &s=2

Now, R(w) is a non-differentiable (because of \min), convex objective. The author in [4] proposes to optimize using subgradient method.

In my mind, subgradient method is really similar to normal gradient descend method. The gradient \nabla f(x) of a differentiable convex function f(x) at point x satisfies: f(y) \geq f(x) + \nabla f(x)^T (y-x), \forall y \in \text{dom }f. This means the growth of line \nabla f(x)^T (y-x) from point x is always equal or slower than f(y) [7]:

Screenshot from 2017-03-30 09:24:17

Subgradient equal to gradient at the part being differentiable. At the non-differentiable part, subgradient g can be any values such that \{g|g^T(y-x) \leq f(y) - f(x) \}, \forall y \in \text{dom }f:

Screenshot from 2017-03-30 09:41:50

Now let’s look at the part -\min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} in R(w). This part can be rewritten as \max\limits_{\mu \in \mathcal{G}} \{-w^T F \mu + l_i^T \mu\}. This can be seen as a max over affine functions \{-w^T F \mu + l_i^T \mu\}, \forall \mu \in \mathcal{G}, each shown as dashed line the figure below:Screenshot from 2017-03-30 09:46:16

The bold line is \max\limits_{\mu \in \mathcal{G}} \{-w^T F \mu + l_i^T \mu\}. The red line is the subgradient at specific value of w marked by the vertical dashed line. Therefore, the subgradient of R(w) is:

\nabla R(w) = \frac{1}{N} \sum\limits_{i=1}^N F(\mu_i - \mu_i^*) + \lambda w &s=2

where \mu_i^* = argmin_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu \}.

Now, w can be updated online, i.e., being updated after seeing example one by one:

w_{t+1} = \mathcal{P}_{\mathcal{W}} [ w_t - \alpha (F(\mu_i - \mu_i^*) + \lambda w_t)] &s=2

where \mathcal{P}_{\mathcal{W}} ( \cdot) is projection operator that maps w to feasible region. For example, sometimes we even require all costs should be positive. How to do projection is something I don’t fully understand in [4]. I hope I will illustrate once I am crystal clear about that.

Reference

[1] Apprenticeship Learning via Inverse Reinforcement Learning

[2] Maximum Margin Planning

[3] Boosting Structured Prediction for Imitation Learning

[4] Learning to Search: Structured Prediction Techniques for Imitation Learning (Ratliff’s thesis)

[5] Max Margin Learning (Youtube video)

[6] Max Margin Planning Slides

[7] Subgradient slides

2018.4.8

Seems like this is a good tutorial (https://thinkingwires.com/posts/2018-02-13-irl-tutorial-1.html) for explaining [8]

[8] Algorithms for Inverse Reinforcement Learning