My understanding in Bayesian Optimization

In this post, I am going to share my understanding in Bayesian Optimization. Bayesian Optimization is closely related Gaussian Process, a probabilistic model explained in details in a textbook called “Gaussian Processes for Machine Learning” [1]. In the following, “p. ” will refer to a specific page in [1] where the content comes from. 

First, let’s dive into the detail of Gaussian Process. 

We need to have a new thinking model in our mind: think any function $latex f(x)$ as a very long vector, each entry in the vector specifying the function value $latex f(x)$ at a particular input $latex x$ (p. 2). Of course, in theory, this vector will be arbitrarily, infinitely long because we can measure $latex f(x)$ at arbitrarily scaled precision. Let’s make it clear that $latex x$ is the input of $latex f(x)$ and it has $latex d$ dimension: $latex x \in \mathbb{R}^d$. Usually, we let $latex x$ be a column vector.

One step further, we need to think the function vector as a vector of random variables. Each entry of the vector is a random variable, meaning that there is a probability distribution for every $latex f(x)$ at a specific $latex x$.  Since they are all random variables, any finite number of the entries can be selected to form a joint probability distribution.

Let $latex X \in \mathbb{R}^{n \times d}$ be the aggregation matrix of $latex x$ corresponding to the picked entries. Let $latex f=\{f(X_{i})\}_{i=1}^n$, a vector of $latex f(x)$ at different $latex x$ in $latex X$. If we assume $latex f(x)$ follows Gaussian Process (p. 13), then the joint probability distribution of $latex f$ (induced by any $latex X$) will be a Gaussian distribution:

$latex f \sim \mathcal{N}(m(X), K(X, K)),&s=2$

where $latex m(X) \in \mathbb{R}^n$ is a mean function vector whose each entry is defined as $latex m(x)=\mathbb{E}[f(x)]$; and $latex K(X, X) \in \mathbb{R}^{n \times n}$ is the covariance matrix of the Gaussian distribution. The entry on $latex p$ row, $latex q$ column, $latex K_{pq}(X, X)$, is defined as a covariance function between $latex x_p$ and $latex x_q$: $latex k(x_p, x_q)=\mathbb{E}[(f(x_p)-m(x_p))(f(x_q)-m(x_q))]$.  

For such $latex f(x)$ following Gaussian Process, we can write as: $latex f(x) \sim \mathcal{GP}(m(x), k(x, x’))$, reading as “$latex f(x)$ follows a Gaussian Process governed by a mean function $latex m(x)$ and a covariance function $latex k(x,x’)$”.

Let’s now look at a Bayesian linear regression model $latex f(x)=\phi(x)^T w$ with $latex \phi(x)$ as a feature transformation function to transform $latex x$ into a high-dimension space and a prior distribution $latex w \sim \mathcal{N}(0, \Sigma_p)$.

For any $latex x$, $latex \mathbb{E}[f(x)] = \phi(x)^T \mathbb{E}[w] = 0$. For any $latex x$ and $latex x’$, $latex \mathbb{E}[(f(x)-m(x))(f(x’)-m(x’))] = \phi(x)^T \mathbb{E}[w w^T]\phi(x’) = \phi(x)^T \Sigma_p \phi(x’)$ (p. 14). Actually, for any finite number of inputs $latex x_1, x_2, \cdots, x_n$, the function values $latex f(x_1), f(x_2), \cdots, f(x_n)$ forms a joint Gaussian distribution with zero means and convariance given as $latex k(x, x’)=\phi(x)^T \Sigma_p \phi(x’)$: $latex f \sim \mathcal{N}(0, K(X, X))$. Therefore, we can say $latex f(x)$ in the Bayesian linear regression model follows a Gaussian Process. 

If we set $latex k(x, x’)=\phi(x)^T \Sigma_p \phi(x’) = exp(-\frac{1}{2}|x – x’|^2)$, then this corresponds $latex \phi(x)$ to transform $latex x$ into an infinite feature space. We call this covariance function squared exponential (SE) or radial basis function (RBF). Please review [2] for kernel methods introduced in SVM.  There of course exist many other covariance functions. Different covariance functions will govern the shape of the joint Gaussian distribution of $latex f$ (and also the conditional distribution of new $latex f$ given observations, which we will talk next). From now on, we will assume we use SE as the covariance function.

Since we have $latex f \sim \mathcal{N}(0, K(X, X))$ as the joint distribution for any finite number of $latex x$, we can draw multivariate samples from the joint distribution. Figure 2.2 (a) (p.15) shows three drawn samples: first, we prepare a series of equidistant $latex x$, say, -5.0, -4.9, -4.8, … 4.8, 4.9, 5.0, which can be aggregated to a matrix $latex X$; then, we draw $latex f(x)$ according to $latex f \sim \mathcal{N}(0, K(X, X))$. By connecting the generated points $latex (x, f(x))$ we get three functions (the red line, the green line, or the dotted blue line). The gray area represents the pointwise mean plus and minus two times the standard deviation for each input value. Because we know $latex m(x)=0$ and $latex k(x,x’)=1$ for all $latex x$, the gray area is actually a rectangle between 2 and -2.

Screenshot from 2018-01-20 19:57:58

So far we are only dealing with a Gaussian Process prior because we have not observed any data therefore no more information can be utilized. The most interesting part comes as we have some (for now, we assume noiseless) observations $latex \{(x_i, f_i)|i=1,\cdots,n\}$ and we want to extract any insight to make more accurate guess about new the function values at new $latex x$. We will use $latex f$ to denote the observation outputs, and $latex f_*$ for new test outputs. We will use $latex X_*$ to denote the aggregated matrix of test $latex x$. Because $latex f(x)$ follows a Gaussian Process, $latex [f, f_*]$ still forms a joint Gaussian distribution (p.15):

Screenshot from 2018-01-20 20:40:51

The most important part is that we want to infer future $latex f_*$ based on observed $latex f$. Therefore, we shouldn’t merely look at the joint distribution but instead should focus on the conditional distribution $latex f_*|X_*, X, f$, which can also be called the posterior distribution. It turns out that the posterior distribution is still a Gaussian distribution (p. 16):

Screenshot from 2018-01-20 20:47:32

Figure 2.2 (b) (p.15) shows three randomly drawn functions obtained in a similar manner as in Figure 2.2 (a) based on the posterior distribution: first, we prepare a series of equidistant $latex x_*$, say, -5.0, -4.9, -4.8, … 4.8, 4.9, 5.0, which can be aggregated to a matrix $latex X_*$; then, we draw $latex f_*$ according to $latex f_*|X_*,X,f$. By connecting the generated points $latex (x_*, f_*)$ we get three functions (the red line, the green line, or the blue line). Note that, the gray area is no longer constant: in the area close to observations, the pointwise standard deviation is small; it is larger in less observation-dense areas indicating less confidence:

Screenshot from 2018-01-20 23:20:23

In the following chapters of [1], it continues to introduce the situation when our observations are not noise-free. In this situation, the posterior distribution becomes (p. 16):

Screenshot from 2018-01-20 21:04:30

In the case that there is only one test point $latex x_*$, we write $latex k(x_*)=k_*$ to denote the vector of covariances between the test point and the $latex n$ training points and we have:

Screenshot from 2018-01-21 10-23-28

So far we’ve derived the posterior distribution from a function space view. What is interesting is that the same posterior distribution can be obtained from a pure probabilistic way (i.e., weight space view), a way we are more familiar with, without thinking functions in a function space. 

It starts from Bayesian Theorem: 

$latex p(w|y,X) = \frac{p(y|X,w)p(w)}{p(y|X)} \propto p(y|X,w) p(w) &s=2$

 

In a Bayesian linear model, we assume $latex y=f(x) + \epsilon$, $latex \epsilon \sim \mathcal{N}(0, \sigma_n^2)$ and $latex w \sim \mathcal{N}(0, \Sigma_p)$. Therefore,

Screenshot from 2018-01-20 23:52:03

The posterior distribution given observation $latex X$, $latex y$ and new $latex x_*$ becomes (p. 11):

Screenshot from 2018-01-20 23:54:59

To empower a linear model, we can use a feature transformation $latex \phi(x)$ to transform $latex x$ into a high-dimensional space and apply the linear model in that space: $latex f(x) = \phi(x)^T w$. The posterior distribution changes accordingly (p. 12):

Screenshot from 2018-01-21 00:01:01

In Eqn. 2.12, the transformed features only enter in the form of $latex \phi_*^T\Sigma_p\phi$, $latex \phi_*^T\Sigma_p\phi_*$ and $latex \phi^T\Sigma_p\phi$. This can be again thought as some form of kernel: $latex k(x,x’)=\phi(x)^T\Sigma_p\phi(x’)$. Therefore, we arrive at the same posterior distribution as in Eqn. 2.22~2.24.

We finished introducing Gaussian Process. Now, we introduce Bayesian Optimization. Bayesian Optimization is mostly used for black-box optimization:

$latex max_x f(x), &s=2$

where the objective function $latex f(x)$ is assumed to have no known analytic expression and expensive to evaluate. The idea is to iteratively make guess about next promising observation based on existing observations until the (approximated) optimum is found [3]. The guess will be based on a constructed acquisition function, which has some closed form and is much cheaper to evaluate than the original objective function. People have proposed many forms of acquisition function. One thing for sure is that they all rely on the posterior distribution $latex f_*|X_*, X, f$. The link between Gaussian Process and Bayesian Optimization is that Bayesian Optimization assumes the black-box function to be maximized follows Gaussian Process thus the posterior distribution $latex f_*|X_*, X, f$ will follow a Gaussian distribution we know how to express analytically.

Screenshot from 2018-01-21 10-13-35

Intuitively, a good acquisition function should not only favor the area where the posterior mean is high (so it can exploit learned information), but also favor the area with much uncertainly (so that further exploration can be performed). One popular of such acquisition function is called Expected Improvement (EI):

Screenshot from 2018-01-21 10-35-19 

where $latex x^+=argmax_{x_i \in x_{1:t}} f(x_i)$. Interpreted this expression in English, it speaks that we want to find the next $latex x$ that has the largest expected improvement of the best $latex x^+$ so far. If we are unsure about $latex f(x)$ at $latex x$, $latex EI(x)$ will also be large.

If we set the pointwise mean and standard deviation of the posterior distribution of $latex f_{t+1}(x)$ as $latex \mu(x)=\bar{f}_*$ and $latex \sigma^2(x)= \mathbb{V}[f_*]$ [See equation 2.25 and 2.26 (p.17)], and if we set a random variable $latex I=\mathbb{E}[max\{0, f_{t+1}(x)-f(x^+)\}]$, we will have:

Screenshot from 2018-01-21 11-49-29Screenshot from 2018-01-21 11-49-45

We’ve known how to compute $latex EI(x)$, the acquisition function. The only remaining question is now how to sample next $latex x$. This remains a practical problem that I will investigate the future. But the whole idea of Bayesian Optimization has been clear now.

 

Reference

[1] http://www.gaussianprocess.org/gpml/

[2] https://czxttkl.com/?p=3114

[3] A Tutorial on Bayesian Optimization of Expensive Cost Functions, with application to Active User Modeling and Hierarchical Reinforcement Learning

[4] Bayesian Optimization with Inequality Constraints

Two informal posts:

[] https://arimo.com/data-science/2016/bayesian-optimization-hyperparameter-tuning/

[] https://thuijskens.github.io/2016/12/29/bayesian-optimisation/

Combinatorial Optimization using Pointer Network (Code Walkthrough)

In this post, I am going to walk through an online piece of code [7] which implements the idea of [1]: using pointer network [2] to solve travelling salesman problem. Pointer networks, in my understanding, are neural network architectures for the problems where output sequences come from the permutation of input sequences.

Some background posts you may read in first:

[3]: policy gradient such as REINFORCE

[4]: LSTM code walk through

[5, 6]: travelling salesman problem

The code is based on [7] but modified in several small places to work with python 3.6 and torch 0.3.0 and cuda 8.0. 

import math
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
from IPython.display import clear_output
from tqdm import tqdm
import matplotlib.pyplot as plt

USE_CUDA = True


class TSPDataset(Dataset):
    def __init__(self, num_nodes, num_samples, random_seed=111):
        super(TSPDataset, self).__init__()
        torch.manual_seed(random_seed)

        self.data_set = []
        for l in tqdm(range(num_samples)):
            x = torch.FloatTensor(2, num_nodes).uniform_(0, 1)
            self.data_set.append(x)

        self.size = len(self.data_set)

    def __len__(self):
        return self.size

    def __getitem__(self, idx):
        return self.data_set[idx]


def reward(sample_solution):
    """
    Args:
        sample_solution seq_len of [batch_size]
    """
    batch_size = sample_solution[0].size(0)
    n = len(sample_solution)
    tour_len = Variable(torch.zeros([batch_size]))

    if USE_CUDA:
        tour_len = tour_len.cuda()

    for i in range(n - 1):
        tour_len += torch.norm(sample_solution[i] - sample_solution[i + 1], dim=1)

    tour_len += torch.norm(sample_solution[n - 1] - sample_solution[0], dim=1)

    return tour_len


class Attention(nn.Module):
    def __init__(self, hidden_size, use_tanh=False, C=10, name='Bahdanau'):
        super(Attention, self).__init__()

        self.use_tanh = use_tanh
        self.C = C
        self.name = name

        if name == 'Bahdanau':
            self.W_query = nn.Linear(hidden_size, hidden_size)
            self.W_ref = nn.Conv1d(hidden_size, hidden_size, 1, 1)

            V = torch.FloatTensor(hidden_size)
            if USE_CUDA:
                V = V.cuda()
            self.V = nn.Parameter(V)
            self.V.data.uniform_(-(1. / math.sqrt(hidden_size)), 1. / math.sqrt(hidden_size))

    def forward(self, query, ref):
        """
        Args:
            query: [batch_size x hidden_size]
            ref:   ]batch_size x seq_len x hidden_size]
        """

        batch_size = ref.size(0)
        seq_len = ref.size(1)

        if self.name == 'Bahdanau':
            ref = ref.permute(0, 2, 1)
            query = self.W_query(query).unsqueeze(2)  # [batch_size x hidden_size x 1]
            ref = self.W_ref(ref)  # [batch_size x hidden_size x seq_len]
            expanded_query = query.repeat(1, 1, seq_len)  # [batch_size x hidden_size x seq_len]
            V = self.V.unsqueeze(0).unsqueeze(0).repeat(batch_size, 1, 1)  # [batch_size x 1 x hidden_size]
            logits = torch.bmm(V, F.tanh(expanded_query + ref)).squeeze(1)

        elif self.name == 'Dot':
            query = query.unsqueeze(2)
            logits = torch.bmm(ref, query).squeeze(2)  # [batch_size x seq_len x 1]
            ref = ref.permute(0, 2, 1)

        else:
            raise NotImplementedError

        if self.use_tanh:
            logits = self.C * F.tanh(logits)
        else:
            logits = logits
        return ref, logits


class GraphEmbedding(nn.Module):
    def __init__(self, input_size, embedding_size):
        super(GraphEmbedding, self).__init__()
        self.embedding_size = embedding_size
        self.embedding = torch.FloatTensor(input_size, embedding_size)
        if USE_CUDA:
            self.embedding = self.embedding.cuda()
        self.embedding = nn.Parameter(self.embedding)
        self.embedding.data.uniform_(-(1. / math.sqrt(embedding_size)), 1. / math.sqrt(embedding_size))

    def forward(self, inputs):
        batch_size = inputs.size(0)
        seq_len = inputs.size(2)
        embedding = self.embedding.repeat(batch_size, 1, 1)
        embedded = []
        inputs = inputs.unsqueeze(1)
        for i in range(seq_len):
            a = torch.bmm(inputs[:, :, :, i].float(), embedding)
            embedded.append(a)
        embedded = torch.cat(embedded, 1)
        return embedded


class PointerNet(nn.Module):
    def __init__(self,
                 embedding_size,
                 hidden_size,
                 seq_len,
                 n_glimpses,
                 tanh_exploration,
                 use_tanh,
                 attention):
        super(PointerNet, self).__init__()

        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.n_glimpses = n_glimpses
        self.seq_len = seq_len

        self.embedding = GraphEmbedding(2, embedding_size)
        self.encoder = nn.LSTM(embedding_size, hidden_size, batch_first=True)
        self.decoder = nn.LSTM(embedding_size, hidden_size, batch_first=True)
        self.pointer = Attention(hidden_size, use_tanh=use_tanh, C=tanh_exploration, name=attention)
        self.glimpse = Attention(hidden_size, use_tanh=False, name=attention)

        self.decoder_start_input = nn.Parameter(torch.FloatTensor(embedding_size))
        self.decoder_start_input.data.uniform_(-(1. / math.sqrt(embedding_size)), 1. / math.sqrt(embedding_size))

    def apply_mask_to_logits(self, logits, mask, idxs):
        batch_size = logits.size(0)
        clone_mask = mask.clone()

        if idxs is not None:
            clone_mask[[i for i in range(batch_size)], idxs.data] = 1
            logits[clone_mask] = -np.inf
        return logits, clone_mask

    def forward(self, inputs):
        """
        Args:
            inputs: [batch_size x 2 x sourceL]
        """
        batch_size = inputs.size(0)
        seq_len = inputs.size(2)
        assert seq_len == self.seq_len

        embedded = self.embedding(inputs)
        encoder_outputs, (hidden, context) = self.encoder(embedded)

        prev_probs = []
        prev_idxs = []
        mask = torch.zeros(batch_size, seq_len).byte()
        if USE_CUDA:
            mask = mask.cuda()

        idxs = None

        decoder_input = self.decoder_start_input.unsqueeze(0).repeat(batch_size, 1)

        for i in range(seq_len):

            _, (hidden, context) = self.decoder(decoder_input.unsqueeze(1), (hidden, context))

            query = hidden.squeeze(0)
            for i in range(self.n_glimpses):
                ref, logits = self.glimpse(query, encoder_outputs)
                logits, mask = self.apply_mask_to_logits(logits, mask, idxs)
                query = torch.bmm(ref, F.softmax(logits).unsqueeze(2)).squeeze(2)

            _, logits = self.pointer(query, encoder_outputs)
            logits, mask = self.apply_mask_to_logits(logits, mask, idxs)
            probs = F.softmax(logits)

            idxs = probs.multinomial().squeeze(1)
            for old_idxs in prev_idxs:
                if old_idxs.eq(idxs).data.any():
                    print(seq_len)
                    print(' RESAMPLE!')
                    idxs = probs.multinomial().squeeze(1)
                    break
            decoder_input = embedded[[i for i in range(batch_size)], idxs.data, :]

            prev_probs.append(probs)
            prev_idxs.append(idxs)

        return prev_probs, prev_idxs


class CombinatorialRL(nn.Module):
    def __init__(self,
                 embedding_size,
                 hidden_size,
                 seq_len,
                 n_glimpses,
                 tanh_exploration,
                 use_tanh,
                 reward,
                 attention):
        super(CombinatorialRL, self).__init__()
        self.reward = reward

        self.actor = PointerNet(
            embedding_size,
            hidden_size,
            seq_len,
            n_glimpses,
            tanh_exploration,
            use_tanh,
            attention)

        if USE_CUDA:
            self.actor = self.actor.cuda()

    def forward(self, inputs):
        """
        Args:
            inputs: [batch_size, input_size, seq_len]
        """
        batch_size = inputs.size(0)
        input_size = inputs.size(1)
        seq_len = inputs.size(2)

        probs, action_idxs = self.actor(inputs)

        actions = []
        inputs = inputs.transpose(1, 2)
        for action_id in action_idxs:
            actions.append(inputs[[x for x in range(batch_size)], action_id.data, :])

        action_probs = []
        for prob, action_id in zip(probs, action_idxs):
            action_probs.append(prob[[x for x in range(batch_size)], action_id.data])

        R = self.reward(actions)

        return R, action_probs, actions, action_idxs


class TrainModel:
    def __init__(self, model, train_dataset, val_dataset, batch_size=128, threshold=None, max_grad_norm=2.):
        self.model = model
        self.train_dataset = train_dataset
        self.val_dataset = val_dataset
        self.batch_size = batch_size
        self.threshold = threshold

        self.train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=1)
        self.val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, num_workers=1)

        self.actor_optim = optim.Adam(model.actor.parameters(), lr=1e-4)
        self.max_grad_norm = max_grad_norm

        self.train_tour = []
        self.val_tour = []

        self.epochs = 0

    def train_and_validate(self, n_epochs):
        critic_exp_mvg_avg = torch.zeros(1)
        if USE_CUDA:
            critic_exp_mvg_avg = critic_exp_mvg_avg.cuda()

        for epoch in range(n_epochs):
            for batch_id, sample_batch in enumerate(self.train_loader):
                self.model.train()

                inputs = Variable(sample_batch)
                if USE_CUDA:
                    inputs = inputs.cuda()

                R, probs, actions, actions_idxs = self.model(inputs)

                if batch_id == 0:
                    critic_exp_mvg_avg = R.mean()
                else:
                    critic_exp_mvg_avg = (critic_exp_mvg_avg * beta) + ((1. - beta) * R.mean())

                advantage = R - critic_exp_mvg_avg

                logprobs = 0
                for prob in probs:
                    logprob = torch.log(prob)
                    logprobs += logprob
                logprobs[(logprobs < -1000).data] = 0.

                reinforce = advantage * logprobs
                actor_loss = reinforce.mean()

                self.actor_optim.zero_grad()
                actor_loss.backward()
                torch.nn.utils.clip_grad_norm(self.model.actor.parameters(),
                                              float(self.max_grad_norm), norm_type=2)

                self.actor_optim.step()

                # critic_exp_mvg_avg = critic_exp_mvg_avg.detach()

                self.train_tour.append(R.mean().data[0])

                if batch_id % 10 == 0:
                    self.plot(self.epochs)

                if batch_id % 100 == 0:

                    self.model.eval()
                    for val_batch in self.val_loader:
                        inputs = Variable(val_batch)
                        if USE_CUDA:
                            inputs = inputs.cuda()

                        R, probs, actions, actions_idxs = self.model(inputs)
                        self.val_tour.append(R.mean().data[0])

            if self.threshold and self.train_tour[-1] < self.threshold:
                print("EARLY STOPPAGE!")
                break

            self.epochs += 1

    def plot(self, epoch):
        clear_output(True)
        plt.figure(figsize=(20, 5))
        plt.subplot(131)
        plt.title('train tour length: epoch %s reward %s' % (
        epoch, self.train_tour[-1] if len(self.train_tour) else 'collecting'))
        plt.plot(self.train_tour)
        plt.grid()
        plt.subplot(132)
        plt.title(
            'val tour length: epoch %s reward %s' % (epoch, self.val_tour[-1] if len(self.val_tour) else 'collecting'))
        plt.plot(self.val_tour)
        plt.grid()
        # plt.show()
        plt.savefig('output.png')
        print('train tour length: epoch %s reward %s' % (
            epoch, self.train_tour[-1] if len(self.train_tour) else 'collecting'))
        print(
            'val tour length: epoch %s reward %s' % (epoch, self.val_tour[-1] if len(self.val_tour) else 'collecting'))
        print('\n')


if __name__ == '__main__':
    train_size = 1000000
    val_size = 10000
    seq_len = 20

    train_20_dataset = TSPDataset(seq_len, train_size)
    val_20_dataset = TSPDataset(seq_len, val_size)

    embedding_size = 128
    hidden_size = 128
    batch_size = 64
    n_glimpses = 1
    tanh_exploration = 10
    use_tanh = True

    beta = 0.9
    max_grad_norm = 2.

    tsp_20_model = CombinatorialRL(
        embedding_size,
        hidden_size,
        seq_len,
        n_glimpses,
        tanh_exploration,
        use_tanh,
        reward,
        attention="Dot")

    tsp_20_train = TrainModel(tsp_20_model,
                              train_20_dataset,
                              val_20_dataset,
                              threshold=3.99,
                              batch_size=batch_size)

    tsp_20_train.train_and_validate(5)

 

Basically, we should start from line 368 where the main function starts. In this code, we focus on travelling salesman problems of size 20 on a 2D plane. That means, each data point of  the training/validation dataset is a sequence of 20 2D points. Our goal is to search for a sequence of the given data points such that the loop trip for these data points has the shortest distance. Line 373 and 374 initialize the described datasets. In the following line (375 – 383), we set up model hyperparameters.

1. embedding_size is the dimension of transformed data from the original 2D data points. I think here the idea is to project 2D data into a higher dimension space such that richer information can be represented. It is possibly like devising a kernel in SVM [8] or like word embedding in NLP [9]. In line 173 (embedded = self.embedding(inputs) ), you can see that inputs to a pointer network must perform the embedding transformation first. (inputs size: batch_size x 2 x seq_len, embedded size: batch_size x seq_len x embedding_size)

2. n_glimpses is the number of glimpses to be performed in each decoder step (will explain more on it). However, since the author mentions in A.1 [1] that “We observed empirically that glimpsing more than once with the same parameters made the model less likely to learn and barely improved the results“, n_glimpses is simply set as 1. 

3. tanh_exploration and use_tanh correspond to Eqn.16 in A.2 in [1], which is claimed to “help with exploration and yield marginal performance gains“. 

We list the mapping between the paper parameters and the code parameters

paper parameter note code parameter
$latex x$ input sequence of $latex n$ cities inputs
$latex \pi$ output sequence, permutation of $latex x$ actions
$latex n$ sequence length, the number of cities seq_len
$latex d$ the size of hidden units hidden_size
$latex B$ the size of Monte Carlo sampling for policy gradient batch_size
$latex k$ the number of encoder vectors to reference in attention/glimpse model. It is usually the same as $latex n$  

 

After the last line calling tsp_20_train.train_and_validate(5) , we come into the real meat and potato (line 284 to line 344):

a. at line 293~295, we get a batch of inputs (inputs size: batch_size x 2 x seq_len

b. at line 297 (R, probs, actions, actions_idxs = self.model(inputs) ), we pass inputs through a pointer network. We will talk about how the pointer network works. But for now, we just care about the workflow of TrainModel, which is essentially about training a policy gradient model called actor-critic.

Policy gradient is a family of models in which the objective function is to directly optimize rewards generated according to a parameterized policy. More details can be found in [3]. The paper gives the formulation under REINFORCE algorithm [10]:

 Screenshot from 2018-01-12 14-04-48Screenshot from 2018-01-12 14-05-55

Screenshot from 2018-01-12 14-04-56Screenshot from 2018-01-12 14-06-17

And the gradient of the policy (Eqn.4) can be approximated by the Monte Carlo method:

Screenshot from 2018-01-12 14-08-34

We can rewrite Eqn.5 to be more verbose, but more aligned to the codes:

$latex \nabla_\theta J(\theta) \approx \frac{1}{B}\sum\limits_{i=1}^B \nabla_\theta \sum\limits_{j=1}^n (L(\pi_i|s_i) – b(s_i)) \log p_\theta(\pi_i(j)|s_i) \;\; (5^*)&s=2$

Essentially, we expand $latex \log p_\theta(\pi_i | s_i)$ into $latex \sum\limits_{j=1}^n \log p_\theta(\pi_i(j) | s_i)$.

Since the learning framework like pytorch will compute the gradient automatically for us, we only need to compute $latex \frac{1}{B}\sum\limits_{i=1}^B \sum\limits_{j=1}^n (L(\pi_i|s_i) – b(s_i)) \log p_\theta(\pi_i(j)|s_i)&s=2$ using R and probs (line 306-320).

parameter note size
R  rewards, $latex L(\pi_i|s_i)$ batch_size
actions output data points seq_len x batch_size x 2
action_idxs output the order of input data points seq_len x batch_size
probs the probability of each $latex p_\theta(\pi_i(j)|s_i)$ seq_len x batch_size

 

Now, let’s transition to understand how the pointer network works (line 164-212). In line 174, self.encoder is a LSTM initiated at line 147. Since it is initialized with batch_first=True, embedded‘s size is batch_size x seq_len x embedding_size, with the batch_size as the first dimension. The output, encoder_outputs, also has the same size. (See [12] for pytorch I/O LSTM format.)

We are going to use the plot from the paper to further illustrate how the pointer network works. First, let’s get clear about where encoder_inputs (embedded) and encoder_outputs are.

Screenshot from 2018-01-12 18-41-22

Each blue block is the LSTM unit which takes input repeatedly. Since in line 174, self.encoder is fed without hidden and context parameter, hidden and context for the first input are treated as zero vectors [12]. LSTM will output, besides encoder_outputs, hidden and context layers:

Screenshot from 2018-01-12 18-46-55

After the last input passes the encoder, the decoder will start. The first input to the decoder, denoted by <g>, is a vector of the same embedding size as the inputs. And it is treated as a trainable parameter.  The first hidden and context vectors to the decoder are the last hidden and context vectors of the encoder. The decoder will output context and hidden vectors at each step.

Screenshot from 2018-01-12 19-28-36

The hidden vector of the decoder, together with encoder_outputs, will be used by an attention model to select an output (city) at each step. Before hidden is sent to the attention model, the paper suggests to process it by a technique called “glimpse”. The basic idea is to not use hidden directly in the attention model, but use query=glimpse(hidden, encoder_outputs) $latex \in \mathbb{R}^d$. Te author says that “The glimpse function G essentially computes a linear combination of the reference vectors weighted by the attention probabilities“. Note the two things:

  1. the attention model to select output cities and the glimpse model are both attention models. The difference between the two is that the glimpse model outputs a vector with $latex d$ dimension to replace the hidden vector of the decoder while the attention model to select output cities outputs a vector with $latex k$ dimension, which refers to the probability distribution of selecting each of the $latex k$ reference vectors.
  2. Due to the nature of the problem, there cannot be duplicate cities in the output sequence. Therefore we have to manually disable those already outputted cities using the function apply_mask_to_logits.

Finally, it comes to the attention model. The attention model will take as input the query vector and encoder_outputs. As we said, the attention model will output a probability distribution of length $latex k$, the number of reference vectors. $latex k$ is usually equal to $latex n$ because we reference all the input cities.

Screenshot from 2018-01-12 21-16-38

Line 200~206 finally selects the output city according to the probability distribution. 

In line 207, we prepare the next input to the decoder, which is the embedding of the city just outputted.

The following table lists the sizes of all appeared parameters.

parameter size
embedded batch_size x seq_len x embedding_size
encoder_outputs batch_size x seq_len x hidden_size
hidden 1 x batch_size x hidden_size
context 1 x batch_size x hidden_size
query batch_size x hidden_size
logits batch_size x seq_len

 

Let’s pretty much about it.

 

Reference:

[1] Neural Combinatorial Optimization With Reinforcement Learning: https://arxiv.org/abs/1611.09940

[2] Pointer Networks: https://arxiv.org/abs/1506.03134

[3] https://czxttkl.com/?p=2812

[4] https://czxttkl.com/?p=1819

[5] https://czxttkl.com/?p=1047

[6] https://czxttkl.com/?p=3109

[7] https://github.com/higgsfield/np-hard-deep-reinforcement-learning

[8] https://czxttkl.com/?p=3114

[9] https://czxttkl.com/?p=2530

[10] Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning: http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf

[12] http://pytorch.org/docs/master/nn.html#torch.nn.LSTM

GLIBC_2.xx not found while installing tensorflow

The error “GLIBC_2.xx not found” is usually seen when someone installs tensorflow on a relatively outdated machine.

Solution

Install Conda.

Details

1. download MiniConda: https://conda.io/miniconda.html

2. install it: 

chmod +x Miniconda3-latest-Linux-x86_64.sh 
./Miniconda3-latest-Linux-x86_64.sh 

3. create a new environment:

conda create -n myenv python=3.6

4.  activate it

source activate myenv

5. install tensorflow using conda (within the environment myenv)

conda install tensorflow

6. list what packages you have installed by: 

conda list

7. done. run any python program using tensorflow within the environment

python xxx.py

Reference

Comments of KrnTneja from https://github.com/tensorflow/tensorflow/issues/527

 

 

Revisit Support Vector Machine

This post reviews the key knowledgement about support vector machine (SVM). The post is pretty much based on a series of the lectures [1].

Suppose we have data \mathcal{D}, where each point is a d-dimensional data point x_i \in \mathcal{D}: x \in \mathbb{R}^d, \forall i=1,\cdots,N associated with label y_i: y_i = 1 \text{ or} -1, \;\;\forall i=1, \cdots, N.

The very initial SVM is to come up with a plane w^T x + b = 0 such that data points can be perfectly linearly divided according to their labels. For a linearly separable dataset, there can exist infinite such plane. However, the “best” one, intuitively, should have the largest margins from all the closest points to it. Here, “margin” means the distance from a point to a plane. As an example below, the best separating plane is supposed to be the rightmost one.

1

For a data point x_n, the way to calculate its margin (i.e., its distance to the plane w^T x + b=0, is to pick a point x on the plane and then project x_n - x onto w:

3

Since any point x on the plane satisfies that w^T x + b=0, we can simplify the distance formula:

distance=\frac{1}{\|w\|}|w^T x_n + b|

(From an intuitive perspective, the vector w is perpendicular to the plane w^Tx+b=0. Therefore the distance from x_n to the hyperplane is to find the projection of x_n - x onto the direction w.)

This is the margin for any data point x_n. However, remember that the ultimate goal is to maximize the margin of the data point, the data point which is the closest to the separating plane. One can write the objective function like this:

max \; \min_{x_n}\frac{1}{\|w\|}|w^T x_n + b|

A problem arises here. Even for the same plane, there can be infinite sets of (w,b) corresponding to it, all with different scales. For example, the plane w_1^T x + b_1=0 with w_1=(2,1), b=3 is the same plane as the plane w_2^T x + b_2 = 0 with w_2=(4,2), b=6. Therefore, we need to pre-define a scale.

We propose a normalization standard: normalize w and b such that the closest data point to the plane, x_n, has: |w^T x_n + b|=1. Therefore, the distance from the closest data point to the plane becomes \frac{1}{\|w\|}. After this, the objective function becomes easier:

4

Note that for any data point x_n, |w^T x_n + b| = y_n (w^T x_n + b). The constraints \min\limits_{n=1,2,\cdots,N} |w^T x_n+b|=1 can be re-written as y_n(w^T x_n + b) \geq 1, \; \forall n=1,2, \cdots, N. Moreover, instead of maximizing \frac{1}{\|w\|}, we can minimize \frac{1}{2}w^T w equivalently. Therefore, the objective function now becomes:

1

We are sure that for all the constraints, there must be at least one constraint where the equality holds (i.e., for some data point x_n, y_n(w^T x_n + b)=1). That is the data point which is the closest to the plane. If all the constraints only have the inequality hold (i.e., y_n(w^T x_n + b)>1, \; \forall n=1,2,\cdots,n), \frac{1}{2}w^T w would not be minimized: we can always scale down w a little bit such that \frac{1}{2} w^T w decreases until some constraint hits the equality.

The objective function is a constrained optimization problem with affine inequality constraints. An overview of using Lagrange/KKT to do optimization can be found in my previous post [2]. Basically, the way to optimize this objective function is to use Lagrange multiplier and KKT condition.

From [2], we know that KKT condition is usually a necessary condition in a constrained optimization problem. However, when the optimization problem is convex (in our case \frac{1}{2}w^T w) and Slater’s condition holds (in our case, we only have linear constraints y_n(w^Tx_n+b) \geq 1), KKT condition becomes a sufficient and necessary condition for finding the optimal solution w^*. w^* will be the saddle point of the Lagrange multiplier augmented loss function . Regarding why we can apply Lagrange multiplier + KKT condition to optimize the objective function, please refer to other materials also, since this is the part I am less confident about: (Section 3.2 in [4], answer in [5], slides in [6], Section 5.2.3 in [7]).

The optimization process is as follows. We first write the Lagranger formulation and take the derivatives of the Lagranger formulation w.r.t. w and b. Setting the derivatives to 0 we can find the relationship between \alpha (the Lagranger multipler), y_n, and x_n (data).

2

4

3

We can eventually formulate the problem as a quadratic programming problem (w.r.t. \alpha) and then hand it to a quadratic programming package to solve it. We are expected to receive the \alpha vector as results, which then we will use to obtain w.

5

Then, the parameter b can be obtained by picking a support vector x_n and solve by y_n (w^T x_n + b)-1 = 0.

Until this point, a new problem arises: we have been only dealing with perfectly linearly separable problems. How can we apply SVM in non-linear problems?

The first solution is kernel trick. Our objective function sending to quadratic programming is \mathcal{L}(\alpha)=\sum\limits_{n=1}^N \alpha_n - \frac{1}{2}\sum\limits_{n=1}^N \sum\limits_{m=1}^N y_n y_m \alpha_n \alpha_m x_n^T x_m, which only requires we know the inner product between any x_n and x_m. Suppose in the original form of raw data, x_1, x_2, \cdots, x_n are low-dimensional data points that are not linearly separable. However, if they are mapped to a higher dimension as z_1, z_2, \cdots, z_n, they can be separable. On that high dimension space, we can apply the linear separable SVM version on the data points z_1, z_2, \cdots, z_n to find a hyperplance and support vectors. That only requires we know the inner product of z_n and z_m.

6

We don’t even need to know the exact formula of the mapping function from x_n \rightarrow z_n: even a black-box mapping function is ok, as long as we can calculate the inner product of any z_n and z_m corresponding to x_n and x_m.  This “inner product” value function is called kernel:

7

Kernel functions can have many forms. For example, polynomial kernel: K(x, x')=(ax^Tx'+b)^Q and Radial Basis Function kernel: K(x, x')=exp(-\gamma\|x-x'\|^2). There is a math property called “Mercer’s condition” that an be used to verify whether a function is a valid kernel function. But this is out of the scope of this post.

After we use kernel, the optimization of SVM stays the same except that the coefficient matrix of quadratic programming now becomes:

9

And to finally obtain w and b, we need to utilize kernel function too:

10

The second solution to deal with non-linearly separable problems is to use soft-margin. The intuition is to allow some violation of the margin constraint for some data points.

12

To compare the two solutions, we find that the kernel solution is often used when data are highly non-linearly separable. The soft-margin SVM is often used when data is just slightly non-linearly separable (most data are still linearly separable).

11

About practical issues of applying SVM, please refer to [3].

Reference

[1] lectures from Caltech online ML course

https://www.youtube.com/watch?v=eHsErlPJWUU

and

https://www.youtube.com/watch?v=XUj5JbQihlU

[2] https://czxttkl.com/?p=1449

[3] https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

[4] http://www.cmap.polytechnique.fr/~mallat/papiers/svmtutorial.pdf

[5] https://stats.stackexchange.com/questions/239912/lagrange-multipler-kkt-condition

[6] https://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/svm.pdf

[7] convex optimization textbook: https://www.dropbox.com/s/fk992yb6qja3xhv/convex%20optimization.pdf?dl=0

Some chinese references:

[a] http://www.cnblogs.com/jerrylead/archive/2011/03/13/1982639.html

[b] http://blog.csdn.net/v_july_v/article/details/7624837

Revisit Traveling Salesman Problem

Travelling salesman problem (TSP) goes as follows [1]:

Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?

This problem statement is actually a TSP-OPT problem., where”OPT” stands for optimization. This is not a decision problem since the answer to it is not simply “yes”/”no”.  The decision version of TSP is:

Given a list of cities and the distances between each pair of cities, does there exist a  route that visits each city with the total distance less than or equal to k?

According to Section “How to prove NP-completeness in practice” in [2] and the top answer in [3], in order to prove a decision problem is NP-complete, we need to prove (1) it is in NP (2) another known NP-complete problem can be polynomially reducible to your problem

Alternatives to prove (2) are:

a. argue that if you can solve a known NP-complete problem in polynomial time, then you can solve your problem in polynomial time

b. construct an algorithm to solve your problem that only calls the oracle of a known NP-complete problem polynomial times and does extra works outside the oracle in polynomial time such that the algorithm returns yes iff the oracle returns yes. 

c. similar to b, construct an algorithm to solve your problem that only calls the oracle of a known NP-complete problem polynomial times and does extra works outside the oracle in polynomial time such that (i) the algorithm returns yes if the oracle returns yes and (ii) the algorithm returns no if the oracle returns no.

NP means the class of all decision problems where “yes” answers can be verified efficiently[7, 10th slide]. It is a property regarding only decision problems. Therefore, we can show that TSP-DEC is NP-complete. However, we can never say TSP-OPT is NP-complete [10]. 

The proof of TSP-DEC is NP-complete can be found: [6], [8], [9]. The basic idea is to reduce from Hamiltonian-cycle problem which is known to be NP-complete.

Similarly, there are many tutorials on proving independent-set problem is a NP-complete problem [4, 5].  

Although we can’t say TSP-OPT is NP-complete because it is not even a decision problem, we can say TSP-OPT is a NP-hard problem because TSP-OPT is self reducible from TSP-DEC. From a reply from [10]:

FireShot Capture 1 - Absolutely No Machete Juggling » Trave_ - http___www.nomachetejuggling.com_20

This means, TSP-OPT can be solved just in polynomial scale of the time solving TSP-DEC.

More words about self-reducible. A good tutorial of self-reducible can be found in [11, 12, 13]. It seems like (I haven’t verified) that “Every NP-Complete problem/language L is self-reducible” (from Theorem 24.3.4 in [12]). That means, all NP-complete optimization problems can be polynomially reducible to their decision problems.

 

Reference

[1] https://en.wikipedia.org/wiki/Travelling_salesman_problem

[2] http://www.ics.uci.edu/~eppstein/161/960312.html

[3] https://stackoverflow.com/questions/4294270/how-to-prove-that-a-problem-is-np-complete

[4] http://www.cs.cornell.edu/courses/cs482/2004su/handouts/NPComplete.pdf

[5] https://perso.esiee.fr/~mustafan/TechnicalWritings/Complexityslides/lec6.pdf

[6] https://www.quora.com/Why-is-the-traveling-salesman-problem-NP-complete

[7] https://web.stanford.edu/class/archive/cs/cs161/cs161.1138/lectures/19/Small19.pdf

[8] https://www.researchgate.net/file.PostFileLoader.html?id=58a153df40485494272392f9&assetKey=AS%3A461190431285249%401486967775832

[9] https://math.stackexchange.com/questions/1077436/the-traveling-salesman-problem-is-np-complete-reduction

[10] http://www.nomachetejuggling.com/2012/09/14/traveling-salesman-the-most-misunderstood-problem/

[11] http://www.cs.toronto.edu/~fpitt/20109/CSC363/lectures/LN11.txt

[12] https://courses.engr.illinois.edu/cs473/fa2011/lec/24_notes.pdf

[13] https://cseweb.ucsd.edu/~mihir/cse200/decision-search.pdf

shadowsocks + SwitchyOmega

We’ve introduced one way to proxy internet: https://czxttkl.com/?p=1265

Now we introduce another way to create proxy, which uses shadowsocks + SwitchyOmega (a chrome extension).

Ubuntu:

  1. in a terminal: 
    sudo apt-get install shadowsocks-qt5
    sudo add-apt-repository ppa:hzwhuang/ss-qt5
    sudo apt-get update
    sudo apt-get install shadowsocks-qt5
  2. open the installed shadowsocks and config a new connection: Screenshot from 2017-12-15 20-50-13
  3. install chrome extension SwitchyOmega: https://www.dropbox.com/s/i5xmrh4wv1fivg7/SwitchyOmega_Chromium.crx?dl=0
  4. config SwitchyOmega:    Screenshot from 2017-12-15 20-50-39

 

Android

  1. install Shadowsocks apk: https://www.dropbox.com/s/cijsnynpvamizvs/com.github.shadowsocks-4.3.3-varies-sdk19-vc199-APK4Fun.com.apk?dl=0
  2. config the shadowsocks similar as we config on PC

Reinforcement Learning in Web Products

Reinforcement learning (RL) is an area of machine learning concerned with optimizing a notion of cumulative rewards. Although it has been applied in video game AI, robotics and control optimization for years, we have seen less of its presence in web products. In this post, I am going to introduce some works that apply RL in web products. In a more general sense, this post will how web products can benefit from a view of sequential decision making. Indeed, the order of news read, songs listened, or videos watched would all affect user experience online. Thus, how to model and optimize sequential user experience becomes more and more crucial.

RL / Monte Carlo Tree Search on Song list Recommendation

In music services, a song list contains a sequence of songs curated by players to listen to. This is the case in point that sequential decision making may makes a difference. The song list recommendation contains two pieces of works [3, 4]. [3] details how to extract features of single songs and song transitions, and how to learn a parameterized linear reward function. Here, the observed reward is human listeners’ like/dislike and fitted by linear combination of the reward function’s weights and song and transition features. After learning the reward function, [3] uses heuristic-based planning to recommend song lists. [3] shows in simulation and real-world evaluations that their recommended song lists can result to higher long-term rewards for listeners (either synthetic or real humans).

What I don’t understand well in [3] is Algorithm 3, in which the reward function weights are updated according to observed rewards. Note that this is different than function-approximation-based TD-learning, what I have known better, where weights are updated for estimating state or state-action values. Algorithm 3 looks like inverse reinforcement learning, but the authors did not mention that. But it may just be a simple linear regression fitting procedure based on gradient descent.

[4] improves the planning with a more advanced search algorithm, Monte Carlo Tree Search (MCTS). This planning takes place after the reward function is fully known, i.e., the weights have been learned already. The goal of planning is to generate a song list achieving the highest accumulated rewards. [4] uses a specific version of MCTS called $latex MaxMCTS(\lambda)$, which employs a more sophisticated Q-value backpropagation strategy. Using MCTS is pretty straightforward here, since one cannot exhaust all possible song lists in a limited time.

Q-Learning on Recommendation

Applying Q-learning on recommendation is sometimes straightforward. You define user features as states, things to recommend as actions, and metrics you want to improve as rewards. [8] is one example in this vein, where actions are news articles and rewards are clicks + user activeness. They claim using Dueling Bandit Gradient Descent for exploration so that exploration of news articles can focus more on relevant ones. They propose to divide Q-value function into value function $latex V(S)$ and advantage function $latex A(s,a)$ much like Dueling network architectures.

Multi-Agent RL on Recommendation

[11] works on a wholistic way to do recommendation in multiple stages of user interaction with a website. For example, the website needs to recommend diverse items on the entrance page. Once the user clicks on an interested item, the user will enter an item detail page and the website needs to recommend more related items. And after the user decides to buy some item, the website needs to recommend more items post-purchase to incentivize more future purchases. Traditionally at each stage you can have one dedicated model to optimize stage-wise reward. But you can utilize more information across different stages and make a decision at each stage such that it can benefit for all other future stages. That’s the motivation behind [11] but to be honest I am not sure how much room for improvement there will be in real application compared to so much complexity introduced.

Here are some technical details about [11]. Its high-level model is DDPG, an actor-critic architecture. The actor is a policy network that outputs the action based on the state. The state is some low-dimension, dense vector which embeds user history using a memory-based network like RNN. The critic is a value network that outputs state-action values in which state is embedded in the same way as in the actor. Note, there are multiple actors, one in each stage of recommendation; but there is only one critic which outputs a global state-action value. The training of the critic needs to define a target state-value function, usually in a form of $latex y=R+\gamma Q(s’,\pi(s’))$. The authors think that fitting the target state-value function would need a huge sample size; however, decomposing $latex y$ into several parts may make the fitting easier. (Personally I am not sure if it is true.) The authors propose to decompose $latex y$ into the expectation of state-action values over the probability distribution of each action. The probability of each action can be estimated by a supervised learning model. 

Ads

There are two classic algorithms on applying RL for ads, bidding particularly. The state space is auction/campaign real-time information while the action is the next bid price to set. [14] was published earlier, in which it defined the full MDP (transition and reward) such that the optimal policy can be learned by dynamic programming. [13] used DQN to learn the Q-function, where the action space is the parameter to control which bidding model to use, rather than the direct price to bid. I really admire the details of designing the MDP in [13]. For example, in Figure 2 of [13], they found that bidding patterns are similar day by day, indicating that they should treat one episode as one day and collect <state, action> pairs per hour.   

Ranking

There has been an increasing interest in using Reinforcement Learning for list-wise ranking. The idea is that traditional ranking usually acts like the greedy policy and doesn’t consider the interaction between ranked items. For example, if a user likes soccer, all soccer-related posts are predicted with high click-through rate and thus placed in higher order, despite that the user could get fatigued by seeing posts of the same category consecutively. It is likely that showing posts of interweaving categories, such as soccer and entertainment, could foster healthier cadence of reading posts and make the user more satisfied.

[7] proposes an algorithm working on the following setting: “a user is displayed to a page of k items and she provides feedback by clicking on one or none of these items, then the system recommends a new page of k items”. The first part of the paper proposes to use Generative Adversarial Network (GAN) to train a user model for learning which item a user would click given a page and what would be their satisfaction upon click. In my opinion, this part can be done using other standard ML models. The user model will be used as a simulator to provide data for RL training. In the second part, [7] proposes cascade Q-learning (CDQN). Suppose the total number of items is $latex \mathcal{I}$. There is permutation of $latex (\mathcal{I}, k)$ ways to form a page of k items (the order of k items in a page also matters) and CDQN only needs to make $latex \mathcal{O}(k\mathcal{I})$ times of argmax calls. My understanding of CDQN is that it creates spurious $latex k$ sub-states between $latex s_t$, the state when the user is about to receive a slate, and $latex s_{t+1}$ when the user is about to receive the next slate. And you can learn to estimate Q-values at each k sub-state, with the assumption that the optimal Q-value at each sub-state should be the same as the optimal Q-value at the last sub-state. I think the $latex \mathcal{O}(k\mathcal{I})$ time complexity is still too computationally expensive for real products and I’d like to see an $latex \mathcal{O}(k)$ or $latex \mathcal{O}(l)$ algorithm.

Another work called SlateQ [9] works in the same setting but uses a different way to learn Q-values and concoct slates. The paper makes two reasonable assumptions: (1) SC: a user consumes one or zero item from each slate; (2) reward and transition functions only depend on the consumed item. With the two assumptions, $latex Q^*(s, A) = \sum_{i\in A} P(i|s,A)Q^*(s,i)$, meaning that the optimal Q-value of a state $latex s$ and a slate $latex A$ is equivalent to the expected optimal “decomposed” Q-value of $latex s$ and an item $latex i$ where the expectation is taken over the user choice probability. $latex Q^*(s,i)$ can be learned through learning schema very like normal SARSA and Q-Learning (see Eqn. (19) and (20)), and $latex P(i|s,A)$ is usually accessible from extant CTR prediction models most commercial recommendation systems have already had. Using decomposed Q-values simplifies the learning process a lot but we still have one thing to resolve: in Q-learning, you need to compute the $latex max_a$ operator (the best of all possible slates in the next slate); in the policy improvement phase of SARSA, you need to come up with a better slate. The brute-force method is to enumerate all possible slates which is combinatorial problem. But since we know $latex P(i|s,A)$ and $latex Q^*(s,i)$, the authors propose to use Linear Programming to do exact optimization, or Greedy/Top-K for approximate optimization. SlateQ has been tested on Youtube and shown 1% improvement in engagement.

Rather than using value-based models, [10] uses policy-gradient based methods to do slate recommendation. Its application context is very similar to CDQN and SlateQ, where you need to recommend k items in a slate each time. The question how to model the policy probability $latex \alpha(a|s)$, which is the probability of showing the observed clicked video $latex a$ in a slate given user state $latex s$. In Section 4.3, the authors make assumptions that a slate is generated by sampling $latex K$ times from the softmax distribution of video values ($latex \pi(a|s)$ from Eqn. 5). Following the assumptions, we have $latex \alpha(a|s)=1-(1-\pi(a|s))^K$. I think the policy gradient method would require the policy probability of $latex \Pi(A|s)$ where $latex A$ is the whole slate. But since we know that generating any slate that contains the observed click video $latex a$ differs with $latex \Pi(A|s)$ only by a scaling constant, we can just use $latex \alpha(a|s)$ in the policy gradient formula Eqn. (6). They propose to use REINFORCE to learn the policy. Since REINFORCE is an on-policy learning algorithm but their training data is off-policy, they introduce importance sampling in the REINFORCE update formula. Importance sampling itself brings more technical considerations: how to estimate the behavior policy’s propensities (sec 4.2)? and how to reduce variance of importance sampling (section 4.4)? They also show using Boltzmann exploration could help make wiser exploration.

Simulator

A very ideal idea is to train a simulator based on real data and then learn an RL model only with interaction with the simulator. If the simulator can truly reflect the characteristics of real data, RL models can be born just based on virtual data. [12] shows one example of this idea. It models two MDPs in a customer-search environment: the search engine makes sequential decisions following an MDP, and the customer themselves follows another MDP. The paper first uses Generative Adversarial Network and Multi-agent Adversarial Imitation Learning to learn a reliable simulator of customer features and learn the inherent policy of customers. Then, the paper proposes to use TRPO to learn the search engine’s policy using simulated data. It claims that using the learned policy resulted in real-world improvement.

 

Reference

[1] Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees

[2] A contextual-Bandit Approach to Personalized News Article Recommendation

[3] DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation

[4] Designing Better Playlists with Monte Carlo Tree Search

[5] http://info.ividence.com/deep-reinforcement-learning-advertising/

[6] https://www.youtube.com/watch?v=XiN9Hx3Y6TA

[7] Generative Adversarial User Model for Reinforcement Learning Based Recommendation System: https://arxiv.org/abs/1812.10613

[8] DRN: A Deep Reinforcement Learning Framework for News Recommendation

[9] Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

[10] Top-K Off-Policy Correction for a REINFORCE Recommender System

[11] Model-Based Reinforcement Learning for Whole-Chain Recommendations

[12] Virtual-Taobao- Virtualizing Real-world Online Retail Environment for Reinforcement Learning

[13] Deep Reinforcement Learning for Sponsored Search Real-timeBidding

[14] Real-Time Bidding by Reinforcement Learning in Display Advertising

LambdaMART

LambdaMART [7] is one of Learn to Rank algorithms. It emphasizes on fitting on the correct order of a list, which contains all documents returned by a query and marked as different relevance. It is a derivation/combination of RankNet, LambdaRank and MART (Multiple Addictive Regression Tree). 

For me, the most helpful order to approach LambdaMART is to understand: 1. MART; 2. RankNet; 3. LambdaRank; and 4. LambdaMART. [4] actually aligns with my order to introduce LambdaMART. (Sidenote: another good resource to learn MART is [8])

1. MART

Screen Shot 2017-10-13 at 4.31.11 PM

Above is an illustration plot for one regression tree. If data of different labels actually come from disjoint feature spaces, then one regression tree would be enough. However, there are many situations where data do come from interleaved feature spaces. Hence, it is natural to propose using multiple regression trees to differentiate irregular feature subspaces and make predictions.

Now, suppose MART’s prediction function is F(x)=\sum\limits_{m=0}^{M} h(x;a_m), where each h(x;a_m) is one regression tree’s output and  a_m are the parameters of that tree. Generally speaking, each regression tree has K-terminal nodes. Hence a_m contains \{R_{km} \}_1^K, which are the K feature subspaces that divide data, and \{\gamma_{km}\}_1^K, which are the predicted outputs of data belonging to \{R_{km} \}_1^K. h(x;a_m) can be represented as: h(x;a_m) = \sum\limits_{k=1}^K \gamma_{km} \mathbb{I}(x \in R_{km})

\{a_m\}_0^M are iteratively learned to fit to the training data:

  1. Set F_0(x) to initial guess
  2. for each m=1,2,\cdots, M, we learn optimal feature space divisions \{R_{km}\}_1^K
  3. Based on the learned\{R_{km}\}_1^K, the prediction value of each R_{km} (leaf) is deterministically fixed as:  \gamma_{km}=argmin_{\gamma} \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma)

Although [4] doesn’t provide details for learning \{R_{km}\}_1^K, I guess they should be very similar to what adopted in normal decision trees. One possible criteria could be to minimize the total loss: \{R_{km}\}_1^K = argmin_{\{R_{km}\}_1^K} \sum\limits_{k=1}^K \sum\limits_{x_i \in R_{km}} L(y_i, F_{m-1}(x_i) + \gamma_{km}). (If using this criteria, \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K are actually learned simultaneously)

[4] provides an example to illustrate the iterative learning process. The dataset contains data points in 2 dimensions and labels \{1, 2, 3, 4\}. The loss function L(\cdot) is least square loss. Each regression tree is only a decision stump.

Screen Shot 2017-10-13 at 7.19.55 PM

F_0(x)=h(x;a_0) is the simplest predictor that predicts a constant value minimizing the error for all training data. 

Screen Shot 2017-10-13 at 7.29.47 PM

After determining h(x;a_0), we start to build h(x;a_1). The cut v_1 is determined by some criterion not mentioned. But assuming that v_1 has already been obtained,  we can then determine each leaf’s prediction output, i.e., \gamma_{k1}, k=1,2:

Screen Shot 2017-10-13 at 8.00.54 PM

After determining the first and second trees, F_1(x)=h(x;a_0)+h(x;a_1). In other words, for x<v_1, the predicted value will be 1.444; for x \geq v_1, the predicted value will be 3.625. 

Screen Shot 2017-10-14 at 6.27.05 AM

Next, determining h(x;a_2) will be similar: finding a cut v_2, and then determine \gamma_{k2}, k=1,2 as in step 3.

Screen Shot 2017-10-14 at 6.42.16 AM

Interestingly, when we fit h(x;a_2), it looks like that we fit a tree with each data point’s label updated as y_i - F_1(x_i), that is, we are fitting the residual of what previous trees are not able to predict. Below is the plot showing each data point’s original label and its updated label before fitting h(x;a_2).

Screen Shot 2017-10-14 at 6.46.16 AM

In fact, the fitting process described above is based on the assumption that L is a square loss function. When fitting the m-th tree, we are fitting the tree to the residual y - F_{m-1}(x). Interestingly, y - F_{m-1}(x) = - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}. That’s why MART is also called Gradient Boosted Decision Tree because that each tree fits the gradient of the loss.

To be more general and mathematical,  when we learn \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K in the m-th iteration, our objective function is:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m))

If we take Taylor series of the objective function up to order 2 at 0, we get [10]:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline =\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2 

Each time, the new regression tree (a_m = \{R_{km}\}_1^K and \{\gamma_{km}\}_1^K) should minimize this objective function, which is a quadratic function with regard to h(x_i;a_m). As long as we know \frac{\partial L}{\partial F_{m-1}} and \frac{\partial^2 L}{\partial^2 F_{m-1}}, we can learn the best a_m to minimize the objective function. 

Now, back to the previous example, if L is a square loss function L(y, \hat{y})=\frac{1}{2}(y - \hat{y})^2, then L'(y_i,F_{m-1}(x_i))=-(y_i-F_{m-1}(x_i)) and L''(y_i,F_{m-1}(x_i)) = 1. Thus, the objective function becomes:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline = \sum\limits_i -(y_i-F_{m-1}(x_i)) h(x_i;a_m) + \frac{1}{2} h(x_i;a_m)^2 + constant

This is not different than min \;\; \sum\limits_i L(y_i - F_{m-1}(x_i), h(x_i;a_m)). That’s why in the above fitting process, it seems like we are fitting each new tree to the residual between the real label and previous trees’ output. In another equivalent perspective, we are fitting each new tree to the gradient - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}.

The loss function L can take many kinds of forms, making MART a general supervised learning model. When MART is applied on classification problems, then the loss function for each data point (x_i, y_i) is L(y_i, p_i)=- [y_i \log (p_i) + (1-y_i) \log (1-p_i)], where p_i is the predicted probability p_i=\Pr(y_i=1|x_i). The regression trees’ output is used to fit the logit of p_i, i.e.,

p_i=\frac{1}{1+\exp{(-F_{m-1}(x_i)-h(x_i;a_m))}}

If we formulate all things in P_i=logit(p_i), then we have: 

P_i = logit(p_i) = \log (\frac{p_i}{1-p_i}) = F_{m-1}(x_i)+ h(x_i;a_m) See [11])

L(y_i, P_i) = - y_i P_i + \log (1 + \exp(P_i)) (See [12])

After such formulation, it will be easy to calculate L'(y_i,F_{m-1}(x_i)) = \frac{\partial L}{\partial P_i} \cdot \frac{\partial P_i}{\partial F_{m-1}(x_i)}=\frac{\partial L}{\partial P_i} and L''(y_i,F_{m-1}(x_i)) =\frac{\partial^2 L}{\partial^2 P_i} \cdot \frac{\partial^2 P_i}{\partial^2 F_{m-1}(x_i)}=\frac{\partial^2 L}{\partial^2 P_i}.

2. RankNet

We finished introducing MART. Now let’s talk about RankNet and LambdaMART. Starting from now, we follow the notations from [7]:

Screen Shot 2017-10-06 at 10.57.30 AM

(Table from [9])

After reading the first three chapters of [7], you probably reach the plot: 

Screen Shot 2017-10-06 at 10.51.41 AM

The plot illustrates the motivation for the invent of LambdaRank based on RankNet. In RankNet, the cost function is an easily defined, differentiable function:

C=\frac{1}{2}(1-S_{ij})\sigma(s_i-s_j)+\log(1+e^{-\sigma(s_i - s_j)})

If S_{ij}=1, C=log(1+e^{-\sigma(s_i - s_j)}); if S_{ij}=1, C=log(1+e^{-\sigma(s_j-s_i)}). The intuition of this cost function is that the more consistent the relative order between s_i and s_j is with S_{ij}, the smaller C will be.

Since the scores s_i and s_j are from a score function of parameters w, the cost function C is a function of w too. Our goal is to adjust w to minimize C. Therefore, we arrive at the gradient of C regarding w (using just one parameter w_k as example):

\frac{\partial C}{\partial w_k}=\lambda_{ij}(\frac{\partial s_i}{\partial w_k} - \frac{\partial s_j}{\partial w_k}), where \lambda_{ij} \equiv \sigma (\frac{1}{2}(1-S_{ij})-\frac{1}{1+e^{\sigma(s_i-s_j)}}). (This is from Eqn. 3 in [7].)

\lambda_{ij} is derived from a pair of URLs U_i and U_j. For each document, we can aggregate to \lambda_i by taking into account of all its interaction with other documents:

 \lambda_i = \sum\limits_{j:\{i,j\}\in I} \lambda_{ij} - \sum\limits_{j:\{j,i\}\in I} \lambda_{ij}

Based on the derivation from Section 2.1 in [7], the update rule of score function parameter w can be eventually expressed by \lambda_i

w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k} 

\lambda_i can be seen as the force to drive high and low of w_k from document U_i. Here is how one can understand the effect of \lambda_i through a simple but wordy example. Suppose for a document U_i which only has one pair U_i \triangleright U_j (S_{ij}=1). Assume in one particular update step, we have \frac{\partial s_i}{\partial w_k}>0, i.e., an increase of w_k will cause s_i to increase if everything else is fixed. Therefore, \lambda_i=\lambda_{ij}=-\frac{1}{1+e^{(s_i-s_j)}}<0 assuming \sigma=1. According to the update rule w_k \rightarrow w_k - \eta\sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}, immediately we know that w_k is going increase further to increase s_i. Nevertheless, depending on whether s_i > s_j or s_j > s_i, the scale of \lambda_i would be different. If s_i > s_j, that is, the current calculated scores has a relative order consistent with the label, then -0.5 < \lambda_i < 0. On the other hand, if s_i < s_j, that is, the current calculated scores are inconsistent with the label, |\lambda_i| will become relatively large (\approx 1). 

3. LambdaMART

The problem of RankNet is that sometimes |\lambda_i| does not perfectly correlate to the force we want to drag the document to the ideal place.  On the right of the example plot, the black arrows denote |\lambda_i| the blue documents are assigned. However, if we want to use metrics that emphasize on the top few results (e.g., NDCG/ERR) we want |\lambda_i| to correlate with the red arrows. Unfortunately, many metrics like NDCG and ERR are not easy to come up with a well-formed cost function C. Recall the cost function of RankNet: as long as s_i and s_j are consistent with S_{ij}, C will be low regardless the absolution positions that U_i and U_j appear.  The fortunate part is that we do not need to know C when we train the model: all we need to know is \lambda_{ij} and we can design our own \lambda_{ij} for those not well-formed cost functions. Thus, the core idea of LambdaRank is to define \lambda_{ij} by:

\lambda_{ij} = \frac{\partial C(s_i- s_j)}{\partial s_i} = \frac{-\sigma}{1+e^{\sigma(s_i - s_j)}}|\triangle Z_{ij}|

where \triangle Z_{ij} is the change after swapping U_i and U_j of any measure you apply. This swap takes place in the sorted list of all documents according to their current calculated scores, and all documents’ positions are kept fixed except the two documents being swapped.

Now, if the model we are training contains parameters that form differential score functions then we can easily apply the update rule on those parameters like we talked about in <2. RankNet>: w_k \rightarrow w_k - \eta \frac{\partial C}{\partial w_k} = w_k - \eta \sum\limits_i \lambda_i \frac{\partial s_i}{\partial w_k}

However, if we are going to train MART, then we need to train it according to its iterative process, which only requires knowing \lambda_{ij}. This finally arrives at the LambdaMART algorithm:

Note that in LambdaMART we usually define \lambda_i in the way that fitting it is equivalent to maximizing some utility function C. Therefore, the Newton step to calculate \gamma_{lk} is actually doing gradient ascent. Recall that in <1. MART> section the objective that each MART tree optimizes for is:

min \;\; \sum\limits_i L(y_i, F_{m-1}(x_i) + h(x_i;a_m)) \newline=\sum\limits_i L(y_i,F_{m-1}(x_i)) + L'(y_i,F_{m-1}(x_i)) h(x_i;a_m) +\frac{1}{2} L''(y_i,F_{m-1}(x_i)) h(x_i;a_m)^2

In the LambdaMART algorithm, y_i corresponds to L'(y_i,F_{m-1}(x_i)) and w_i corresponds to L''(y_i,F_{m-1}(x_i)).

 

Reference

[1] http://www.cnblogs.com/wowarsenal/p/3900359.html

[2] https://www.quora.com/search?q=lambdamart

[3] https://liam0205.me/2016/07/10/a-not-so-simple-introduction-to-lambdamart/

[4]  http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/materials/Schamoni_boosteddecisiontrees.pdf

[5] https://www.slideshare.net/Julian.Qian/learning-to-rank-an-introduction-to-lambdamart

[6] https://en.wikipedia.org/wiki/Newton%27s_method

[7] https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/

[8] https://www.youtube.com/watch?v=IXZKgIsZRm0

[9] https://www.slideshare.net/Julian.Qian/learning-to-rank-an-introduction-to-lambdamart

[10] http://xgboost.readthedocs.io/en/latest/model.html

[11] https://stats.stackexchange.com/questions/157870/scikit-binomial-deviance-loss-function

[12] https://stats.stackexchange.com/questions/204154/classification-with-gradient-boosting-how-to-keep-the-prediction-in-0-1

[13] http://dasonmo.cn/2019/02/08/from-ranknet-to-lambda-mart/

Update 2020/02/22:

This reference contains an example of computing \lambda_i using NDCG metric:

[14] http://dasonmo.cn/2019/02/08/from-ranknet-to-lambda-mart/

Questions on Guided Policy Search

I’ve been reading Prof. Sergey Levine‘s paper on Guided Policy Search (GPS) [2]. However, I do not understand about it but want to have a record of my questions so maybe in the future I could look back and solve.

Based on my understanding, traditional policy search (e.g., REINFORCE) maximizes the likelihood ratio of rewards. This is usually achieved by first collecting some samples, taking derivatives of the likelihood ratio w.r.t. the policy parameters, updating the policy parameters, then collecting new samples again. The shortcoming of such procedure is that: (1) sample inefficient, because off-policy samples are discarded every once the policy parameters are updated; (2) when the real situation is really complex, it is really hard to navigate to the globally optimal parameter in the parameter space. Local optima may often reach and the danger arises for robots when guided into risky trajectories by poor parameters.

Therefore, it is more ideal if we can utilize demonstration trajectories to initialize the policy. Moreover, whatever trajectories we have experimented can be kept instead of being discarded to help improve the policy parameters. These should be the fundamental motivation of GPS.

What I am not sure is how exactly iLQR works. And also regarding Algorithm 1, line 1 from [2]: why are there many DDP solutions ($latex \pi_{\mathcal{G}_1}, \cdots,\pi_{\mathcal{G}_n}$) generated? Does that mean iLQR have many different results when initialized differently? Is iLQR only used in the first line?

Seems like GPS only deals with known dynamics and reward function. When dynamics are not known, we should then look at [3] or Continuous Deep Q-Learning [4, 5].

Reference

[1] http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf (Importance sampling tutorial)

[2] https://graphics.stanford.edu/projects/gpspaper/gps_full.pdf

[3] https://people.eecs.berkeley.edu/~svlevine/papers/mfcgps.pdf

[4] https://zhuanlan.zhihu.com/p/21609472

[5] https://arxiv.org/abs/1603.00748

[6] http://blog.csdn.net/sunbibei/article/details/51485661 (Notes on GPS written in Chinese)

[7] https://www.youtube.com/watch?v=eKaYnXQUb2g (Levine’s video. In the first half hour he talked about GPS)

 

Relationships between DP, RL, Prioritized Sweeping, Prioritized Experience Replay, etc

In the last weekend, I’ve struggled with many concepts in Reinforcement Learning (RL) and Dynamic Programming (DP). In this post, I am collecting some of my thoughts about DP, RL, Prioritized Sweeping and Prioritized Experience Replay. Please also refer to a previous post written when I first learned RL.

Let’s first introduce a Markov Decision Process $latex \mathcal{M} = (\mathcal{S}, \mathcal{A}, T, R)$, in which $latex \mathcal{S}$ is a set of states, $latex \mathcal{A}$ is set of actions, $latex T(s, a, s’)$ is transition function and $latex R(s,a,s’)$ is reward function. The value function of a state under a specific policy $latex \pi$, $latex V^\pi(s)$, is defined as the accumulated rewards from now on: $latex V^\pi(s)=\mathbb{E}(\sum\limits_{i=0}\gamma^i r_i | s_0 = s, \pi) &s=2$, where $latex 0 < \gamma < 1$ is a reward discount factor. $latex V^\pi(s)$ satisfies the Bellman equation: $latex V^\pi(s)=\mathbb{E}_{s’ \sim T(s,a,\cdot)}\{R(s,a,s’)+\gamma V^\pi(s’)\} &s=2$, where $latex a=\pi(s)$. The optimal value function is defined as $latex V^*(s) = max_\pi V^\pi(s)$. $latex V^*(s)$ also satisfies the Bellman equation: $latex V^*(s) = max_a \mathbb{E}_{s’ \sim T(s,a,\cdot)}\{R(s,a,s’) + \gamma V^*(s’)\} &s=2$. Another type of function is called Q-function, which is defined as the accumulated rewards for state-action pairs (rather than states themselves). The formulas of $latex Q^\pi(s,a)$ and $latex Q^*(s,a)$ are omitted here but can be found in many RL materials, such as [2].

DP refers to a collection of algorithms to find the optimal policy $latex \pi^*: \pi^*(s) = argmax_a \sum_{s’}T(s,a,s’) [R(s,a,s’) + \gamma V^*(s’)] &s=2$ for $latex \forall s \in \mathcal{S} &s=2$ when the transition function $latex T(s,a,s’)$ and reward function $latex R(s,a,s’)$ are known. Therefore, DP algorithms are also called model-based algorithms. (you assume you know or you have a model to predict $latex T(s,a,s’)$ and $latex R(s,a,s’)$ . And during policy search, you use the two functions explicitly.)

The simplest idea of DP is policy iteration as a synchronous DP method. 

Screenshot from 2017-08-14 14-16-49

Here, we evaluate the value function under the current policy first (policy evaluation). Then, we update the current policy based on the estimated value function (policy improvement). The two processes alternate until the policy becomes stable. In synchronous DP method, states are treated equally and swept one by one.

The drawback of policy iteration DP is that you need to evaluate value functions for all states for each intermediate policy. This possesses a large computational requirement if the state space is large. Later, people find a more efficient way called value iteration DP, in which only one sweep over the whole space of states is needed until the optimal policy is found: 

Screenshot from 2017-08-14 14-34-46

Even in value iteration, you still need to sweep over the whole space of states. To further accelerate that, people propose asynchronous DP: “These algorithms back up the values of states in any order whatsoever, using whatever values of other states happen to be available. The values of some states may be backed up several times before the values of others are backed up once. … Some states may not need their values backed up as often as others. We might even try to skip backing up some states entirely if they are not relevant to optimal behavior. ” (from [1])

Prioritized sweeping is one asynchronous DP method [3]:

Screenshot from 2017-08-14 14-49-19

The goal of prioritized sweeping is to learn $latex V^*(s)$. The basic idea of it is to make some updates more urgent than others. The urgent updates are with large Bellman error, $latex err(s)=|V^*(s) – max_a Q^*(s,a)|$. The pseudo-code of it as in the original paper [4]:

Screenshot from 2017-08-14 15-18-28

Something I am not sure is what policy is used during the learning. I think the policy is always the greedy policy $latex \pi^*: \pi^*(s) = argmax_a \sum_{s’}T(s,a,s’) [R(s,a,s’) + \gamma V^*(s’)] &s=2$ with some exploration stimulus (see the original paper [4] for the parameter $latex r_{opt}$ and $latex T_{bored}$).

[1] gave a slightly different prioritized sweeping when we focus on the Bellman error of Q function, rather than V function (note that this version of prioritized sweeping is still a DP, model-based algorithm because you need a transition function to know $latex \bar{S}$, $latex \bar{A}$ which lead to $latex S$ and a reward function to predict reward $latex \bar{R}$ for $latex \bar{S}, \bar{A}, S$):

Screenshot from 2017-08-14 15-07-20 

Now, all above are DP methods, assuming you know $latex T(s,a,s’)$ and $latex R(s,a,s’)$ or you have a model to approximate the two functions.

RL methods are often called model-free methods because they don’t require you to know $latex T(s,a,s’)$ and $latex R(s,a,s’)$. $latex TD(0)$ or $latex TD(\lambda)$ are a type of algorithms to evaluate value function for a specific policy in an online fashion:

Screenshot from 2017-08-14 15-25-31

Note that, in the iterative update formula $latex V(S) \leftarrow V(S)+\alpha [R + \gamma V(S’) – V(S)]$ there is no longer reward function or transition function present as in DP-based methods.

$latex TD(0)$ (or more general $latex TD(\lambda)$) only solves the prediction problem, i.e., estimation value function for a specific policy $latex \pi$. Therefore, in theory you can’t use it to derive a better policy, i.e., solving a control problem. However, people still use it to find a good policy if a near-optimal policy is used when evaluating value functions and in addition you know reward/transition functions. This is a common scenario for games with predictable outcomes for each move, such as board games. During the learning, people can apply a near-optimal policy (e.g., $latex \epsilon$-greedy) to evaluate values of game states. In real games after the learning, the player evaluates the resulting state after each available movement and goes with the movement with the highest sum of immediate reward and resulting state value: $latex \pi(s) = argmax_a \sum_{s’}T(s,a,s’) [R(s,a,s’) + \gamma V(s’)] &s=2$. Of course, you are using the value function under the near-optimal policy to guide you select actions in the greedy policy. The value functions under two different policies might not totally agree with each other but in practice, you often end up with a good outcome. (I am not 100% confident about this paragraph but please also refer to [5] and a student report [9])

The policy iteration and value iteration in model-free RL algorithms correspond to SARSA and Q-learning [2]. I am omitting the details of them. Prioritized Experience Replay [8] bears the similar idea as in Prioritized Sweeping but applies on model-free contexts. 

Screenshot from 2017-08-14 15-59-51

You can see that neither reward function nor transition function is needed in Prioritized Experience Replay.

 

Reference

[1] Richard S. Sutton’s book: Reinforcement Learning: An Introduction

[2] Reinforcement learning and dynamic programming using function approximators

[3] http://www.teach.cs.toronto.edu/~csc2542h/fall/material/csc2542f16_MDP2.pdf

[4] Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time

[5] Self-Play and Using an Expert to Learn to Play backgammon with temporal difference learning

[6] https://stackoverflow.com/questions/34181056/q-learning-vs-temporal-difference-vs-model-based-reinforced-learning

[7] https://stackoverflow.com/questions/45231492/how-to-choose-action-in-td0-learning

[8] Prioritized Experience Replay

[9] Reinforcement Learning in Tic-Tac-Toe Game and Its Similar Variations