Upgrade Cuda from 7.x to 8.0 on Ubuntu

1.  remove cuda 7.x version (x depends on what you installed.)

rm /usr/local/cuda-7.x

2. make sure PATH and LD_LIBRARY_PATH no longer contain “/usr/local/cuda-7.x”. Possible places to look at are /etc/environment, ~/.profile, /etc/bash.bashrc, /etc/profile, ~/.bash_rc

If you really don’t know where cuda path is added to PATH or LD_LIBRARY_PATH, try to check here: https://unix.stackexchange.com/questions/813/how-to-determine-where-an-environment-variable-came-from

3. cuda 8.0 only supports Ubuntu 14.04 and 16.04. Therefore, do system upgrade if necessary. Ref: https://askubuntu.com/questions/760347/how-to-upgrade-from-14-04-lts-or-15-10-to-16-04-from-terminal

4. install cuda-8.0 toolkit. Go to here: https://developer.nvidia.com/cuda-downloads and download some file type you prefer. Perhaps .deb file can lead you to install via Software Center, which is not a bad idea.

5. to verify cuda 8.0 has been installed:

cd /usr/local/cuda-8.0/samples/
make
cd /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery
./deviceQuery

You should see:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 640M"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1999 MBytes (2096300032 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            709 MHz (0.71 GHz)
  Memory Clock rate:                             2000 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 640M
Result = PASS

Also, you can go to `/usr/local/cuda-8.0/samples/bin` and run any generated test program you want.

(if make has “cannot find -lnvcuvid” error, follow as here: https://askubuntu.com/questions/889218/testing-cuda-in-ubuntu-16-04-usr-bin-ld-cannot-find-lnvcuvid)

ref: http://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/verify_cuda_install.html

 

6. follow https//nb4799.neu.edu/wordpress/?p=2572 to set up LD_LIBRARY_PATH and CUDA_HOME

 

English Grammars

“A” or “an” before an acronym or abbreviation? e.g., a FAQ or an FAQ?

https://english.stackexchange.com/questions/1016/do-you-use-a-or-an-before-acronyms

 

When should I add “the” before what kind of noun?

http://www.englishteachermelanie.com/grammar-when-not-to-use-the-definite-article/

 

Whether to repeat “the” in “noun and noun” phrases?

http://english.stackexchange.com/questions/9487/is-it-necessary-to-use-the-multiple-times

 

“noun and noun” phrase: the following verb is plural or single?

http://www.mhhe.com/mayfieldpub/tsw/nounsagr.htm

 

adj before “noun and noun” phrase

http://theeditorsblog.net/2015/08/08/one-adjective-paired-with-multiple-nouns-a-readers-question/

 

when to use articles?

http://www.quickanddirtytips.com/education/grammar/when-to-use-articles-before-nouns

 

whether to add comma before “and”?

https://english.stackexchange.com/questions/30516/should-i-use-a-comma-before-and-or-or

 

Which words in a title should be capitalized?

https://english.stackexchange.com/questions/14/which-words-in-a-title-should-be-capitalized

http://grammar.yourdictionary.com/capitalization/rules-for-capitalization-in-titles.html

 

do not italicize abbreviations like e.g., etc., …

Screenshot from 2017-04-19 15-05-09

 

italicize technical or key terms at first appearance

http://blog.apastyle.org/apastyle/2015/04/using-italics-for-technical-or-key-terms.html

 

abbreviation, initialism and acronym

http://blog.apastyle.org/apastyle/abbreviations/

https://www.proofreadnow.com/blog/bid/65823/Acronyms-Initialisms-and-Abbreviations (this also says do not italicize acronyms)

 

hyphen for multiple words if they are used as adjective

https://english.stackexchange.com/questions/220967/hyphenate-cutting-edge-state-of-the-art-following-to-be-verbs

https://english.stackexchange.com/questions/2908/should-i-use-related-or-related

 

describing number in english or numbers?

http://www.aje.com/en/arc/editing-tip-using-numbers-scientific-manuscripts/

 

None can be followed by singular verbs and plural verbs

http://www.onlinegrammar.com.au/top-10-grammar-myths-none-always-takes-a-singular-verb/

 

Should I add “the” in front of a superlative adjective?

Not required in many cases: https://ell.stackexchange.com/questions/46923/is-there-always-a-the-before-a-superlative-adjective

 

“half the time” or “half of the time”

https://english.stackexchange.com/questions/217600/should-i-use-half-the-time-or-half-of-the-time

Use “half the time” is good.

 

“Be it” usage

Be he good or evil, … = Whether he is good or evil, …

Be it so or not, … = Whether it is so or not, …

https://english.stackexchange.com/questions/256311/be-it-or-grammar

 

during vs. while

during + none;  while + clause

http://www.differencebetween.net/language/difference-between-during-and-while/

 

When to use “among” and “between”?

between specific things, among a group of things

The negotiations between Brazil, Argentina, and Chile are going well.

The negotiations among the countries of South America are going well.

https://www.espressoenglish.net/whats-the-real-difference-between-between-and-among/

 

“that” vs “which”

“that” is used for restrictive definition

“which” is used for non-restrictive definition

https://www.grammarly.com/blog/which-vs-that/

 

https://afterdeadline.blogs.nytimes.com/2010/04/13/faqs-on-style/

Importance sampling

Importance sampling is a way to reduce variance of your estimation on integration over a region for an integrand.

Let’s first see how traditional Monte Carlo method is used to estimate integration [2]. To estimate $latex \int_a^b f(x) dx$, one can think of reshaping the area to be integrated as a rectangle, whose width is $latex b-a$ and height is $latex E_{\mathcal{U}[a,b]}(f(x))$, a virtual, expected height.

Therefore, to use Monte Carlo estimation, we can randomly draw $latex N$ numbers from the uniform distribution $latex \mathcal{U}[a,b]$, and use the following result as the approximation to $latex \int_a^b f(x) dx$:

$latex \frac{(b-a)}{N}\sum\limits_{i=1}^N f(x_i) &s=2$

 

Of course, this estimation will vary from run to run (every run is an estimation based on $latex N$ samples). Therefore, Monte Carlo estimation for integration has variance. To reduce this variance, we use importance sampling [1]. The idea is to convert the integration formula into an expectation formula with an auxiliary, known distribution $latex h(x)$:

$latex \int_a^b f(x) d(x) = \int_a^b f(x) \frac{h(x)}{h(x)} dx = \int_a^b \frac{f(x)}{h(x)} h(x) dx = \mathbb{E}_h \big(\frac{f(X)}{h(X)} \big) &s=2$

 

Using Monte Carlo method to estimate this expectation, we have:

$latex \mathbb{E}_h \big(\frac{f(X)}{h(X)} \big) \approx \frac{1}{n} \sum\limits_{i=1}^n \frac{f(X_i)}{h(X_i)} ,\quad X_i \sim h(x) &s=2$ 

 

When $latex \alpha h(x) = f(x)$  (assuming $latex f(x)>0$), the importance sampling variance will become zero because $latex \mathbb{E}_h \big(\frac{f(X)}{h(X)}\big) = \alpha$ which is constant no matter of $latex x_i$. Though a nice property, it is impossible to achieve zero variance in reality. If you know $latex h(x)$ is proportional $latex f(x)$, then you know the density of $latex h(x)$, therefore you induce the density of $latex f(x)$ just timing $latex h(x)$ by $latex \alpha$, therefore you do not need Monte Carlo method in the first place to calculate the integration. The common solution is to pick a known distribution $latex h(x)$ that appears close, in terms of the shape, to $latex f(x)$. The closer $latex h(x)$ is to $latex f(x)$, the smaller the variance is. If $latex h(x)$ is a uniform distribution in the range of $latex (a,b)$, it is equivalent to the traditional method of Monte Carlo method we introduced at the beginning of this article.

 

Here is the code to verify that importance sampling really reduces the variance for estimating integration. 

# estimate integration over a normal distribution (mean=0, std=1) in the range of (a, b)
from scipy.stats import norm
import numpy
from scipy import stats

T = 200  # num of rounds
N = 10   # num of samples per round
a, b = -2., 2.   # lower bound, upper bound
fx_loc, fx_scale = 0, 1   # f(x) is a normal distribution (mean=0, std=1)

class hx1_dist_gen(stats.rv_continuous):
    def _pdf(self, x):
        return 1. / b - 1. / (b**2) * numpy.abs(x)

class hx2_dist_gen(stats.rv_continuous):
    def _pdf(self, x):
        return 1. / (b**2) * numpy.abs(x)

hx1_dist = hx1_dist_gen(a=a, b=b)
hx2_dist = hx2_dist_gen(a=a, b=b)

# monte carlo estimation
mc_res = []
for t in range(T):
    print("MC round %d" % t)
    x = numpy.random.uniform(low=a, high=b, size=N)
    fxs = norm.pdf(x, loc=fx_loc, scale=fx_scale)
    mc_res.append((b - a) * numpy.mean(fxs))

# importance sampling estimation using h1(x)
is1_res = []
for t in range(T):
    print("Importance Sampling h1(x) round %d" % t)
    fx_div_hxs = [] 
    # approximate to let x be drawn in the range (-1, 1)
    x = hx1_dist.rvs(size=N)
    fx = norm.pdf(x, loc=fx_loc, scale=fx_scale)
    hx = hx1_dist.pdf(x)
    fx_div_hxs = fx / hx
    is1_res.append(numpy.mean(fx_div_hxs))

# importance sampling estimation using h2(x)
is2_res = []
for t in range(T):
    print("Importance Sampling h2(x) round %d" % t)
    fx_div_hxs = [] 
    # approximate to let x be drawn in the range (-1, 1)
    x = hx2_dist.rvs(size=N)
    fx = norm.pdf(x, loc=fx_loc, scale=fx_scale)
    hx = hx2_dist.pdf(x)
    fx_div_hxs = fx / hx
    is2_res.append(numpy.mean(fx_div_hxs))

print("exact: {0}".format(norm.cdf(b) - norm.cdf(a)))
print("MC estimation (mean/std): {0}/{1}".format(numpy.mean(mc_res), numpy.std(mc_res)))
print("Importance Sampling h1 estimation (mean/std): {0}/{1}".format(numpy.mean(is1_res), numpy.std(is1_res)))
print("Importance Sampling h2 estimation (mean/std): {0}/{1}".format(numpy.mean(is2_res), numpy.std(is2_res)))

 

Output on my machine:

exact: 0.9544997361036416
MC estimation (mean/std): 0.958336124778972/0.15458260669929705
Importance Sampling h1 estimation (mean/std): 0.9515730240850715/0.07416369320580925
Importance Sampling h2 estimation (mean/std): 1.0496523604436687/0.7489580081170375

 

I set $latex f(x)$ as a normal distribution with mean=0 and std=1. I use the traditional Monte Carlo method as well as two kinds of distributions, $latex h_1(x)$ and $latex h_2(x)$, to do importance sampling. All distributions can be represented in the following chart:

Untitled Diagram

 

From the output we can see that, estimation using importance sampling with $latex h_1(x)$ has the smallest std. This is because $latex h_1(x)$ (the blue triangle) looks the closest to $latex f(x)$. $latex h_2(x)$ (the red shape formed by two triangles) has the largest std because it looks the least close to $latex f(x)$. The traditional MC method (equivalently, importance sampling with a uniform distribution) has a medium std.

 

Reference

[1] https://www.dropbox.com/s/75vb40b2vfnnlcm/Monte%20Carlo%20Methods%20and%20Importance%20Sampling%5Bai%20classicML%5D.pdf?dl=0

[2] http://ta.twi.tudelft.nl/mf/users/oosterle/oosterlee/lec8-hit-2009.pdf

Inverse Reinforcement Learning

In my rough understanding, inverse reinforcement learning is a branch of RL research in which people try to perform state-action sequences resembling given tutor sequences.

There are two famous works on inverse reinforcement learning. One is Apprenticeship Learning via Inverse Reinforcement Learning [1], and the other is Maximum Margin Planning [2].

Maximum Margin Planning

In MMP learn a cost function for estimating the cost at each state-action pair. The desired cost function should emit low costs to those state-action pairs appearing in “tutor” sequences (examples) and return high costs to those deviating tutor sequences. If we manage to recover a desired cost function, we can assign the cost for any state-action pair and use path-finding algorithms (e.g., A*, Dijkstra, etc.) to find a sequence on a new environment which has the minimum accumulative sum of costs, that is, a sequence deemed resembling given tutor sequences.

The baisc idea is given in [4]:

Screenshot from 2017-03-29 18:36:37

As you see, in line 6, you need to adjust cost function to increase on the current “not-perfect” minimum cost path and decrease the example path. However, you can see that the algorithm doesn’t know how much it should increase the cost for minimum cost path. If one path just deviates from the example path a little bit and another path deviates far more abnormally, how should the adjustment of the cost function reflect that?

So the author proposes an improvement called “loss augmentation”. Suppose the original cost at a state-action is c^{sa}, then we say its augmented cost \tilde{c}^{sa} = c^{sa}-l^{sa}, where l^{sa} is a loss function with large values for those deviated state-action pairs and with close-to-zero values for those resembling tutor sequences. Why do we want the augmented cost to be even smaller for those defiant state-action pairs? Because in this way, we make those deviated pairs even harder to be distinguished with the desired pairs. Therefore, we force the algorithm to try hard to update the cost function until it can differentiate augmented costs from different state-action pairs. In other words, the algorithm will learn the cost function such that it differentiates any a desired state-action pair (s,a) with a less desired state-action pair (s',a') by at least a margin l^{s'a'}.

Here is the visualization showing how deviated paths get high rewards (+1) and desired paths get low rewards (-1).

Screenshot from 2017-03-29 21:42:24

And here is the revised algorithm based on Algorithm 1:

Screenshot from 2017-03-29 21:42:30

Now, let’s look at how to learn the cost function using a framework called maximum margin structured classification.

Before that, we introduce preliminaries. In this paper, we are given N examples, each example is a MDP and a tutor sequence. Different examples may come from different MDPs so this model can learn the cost function generalized to new MDPs which are never seen in the training examples. However, for simplicity, we just assume all training examples are different tutor sequences from the same MDP with state \mathcal{S} and \mathcal{A}. For simplicity purpose also, an important assumption made throughout [4] is that a tutor sequence is acyclic. This means, any state-action pair is visited at most once in a tutor sequence. Then, it is easy to represent a tutor sequence by \mu \in \{0,1\}^{\mathcal{S}\mathcal{A}}, a state-action frequency count vector. If the assumption about tutor sequence being acyclic were not made, \mu \in \mathbb{R}_{+}^{\mathcal{S}\mathcal{A}} which satisfies certain flow constraints.

Now remember that the idea is to approximate everything linearly, such as the loss function or the cost function. Each training example has a loss function, which quantifies the difference between an arbitrary sequence and i-th example tutor sequence \mu_i. Since we want loss function to be a linear function of \mu, we let \mathcal{L}_i(\mu)=l_i^T \mu. The more \mu deviates from \mu_i, the higher \mathcal{L}_i(\mu) should be. A straightforward way to define l_i  is to make l_i \in \{0,1\}^{\mathcal{S}\mathcal{A}}, with 0 denoting a particular state-action is visited by both or neither of \mu and \mu_i and 1 otherwise. Of course more advanced loss function can be used but the bottom line is to assume \mathcal{L}_i(\mu) \geq 0. Now as for cost function, cost function should also factor over states and actions. This means, the cost of a sequence should be the sum of the costs at each state-action pairs visited in the sequence. Suppose we have a feature matrix F \in \mathbb{R}^{d \times \mathcal{S} \times \mathcal{A}}, whose columns are feature vectors at every state-action pair. We also suppose the cost function has a linear weight w \in \mathbb{R}^{d}. Then, the cost at one state-action pair is just w^T F_j where F_j is the column corresponding to the state-action pair.  Finally, the cost of the whole sequence is w^T F \mu. You can think F \mu as the accumulated sum of feature vectors on visited state-action pairs.

Since we assume all training examples come from the same MDP, we can write the data set as \mathcal{D}=\{(F, \mu_i, l_i)\}^N_{i=1}.

Now, the problem is how to obtain w. Recall that we want w to be able to differentiate desired paths (\mu_i from \mathcal{D}) from tentative paths (arbitrary \mu \in \mathcal{G}, where \mathcal{G} is all possible sequence in the MDP) by a margin \mathcal{L}_i(\mu)=l_i^T \mu. Therefore, the constraints we would like to impose are: for each \mu_i, for each \mu \in \mathcal{G}, w^T F \mu_i \leq w^T F \mu - l_i ^T \mu. This actually introduces a large number of constraints when \mathcal{G} space is large. Therefore, we reduce the number of constraints by rewriting as: for each \mu_i, w^T F \mu_i \leq \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}. Intuitively, this means if w can successfully distinguish \mu_i from the loss-augmented cost of  the hardest tentative path \mu \in \mathcal{G}, then it can distinguish from any other tentative paths. Also, we would like to regularize on the norm of w because if w were without norm regularization, it can always scale arbitrarily large to  overcome margin difference constraints, voiding the original purpose of adding margins. Moreover, we want to introduce a slack term \xi_i \geq 0 for each training example. This allows some margin difference constraints to not be satisfied because in real data you may not find a w which perfectly satisfies all margin difference constraints.

Putting all we describe above, we come to the optimization objective:

\min\limits_{w \in \mathcal{W}, \xi_i \in \mathbb{R}_+} \frac{1}{N} \sum\limits_{i=1}^N \xi_i + \frac{\lambda}{2} \left\Vert w \right\Vert^2 \newline \forall i, w^T F \mu \leq \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} + \xi_i &s=2

We note that \xi_i resides in both the objective function and the constraints. We want \xi_i as small as possible. Therefore, at the minimizer the slack variables will always exactly equal the constraint violation: \xi_i = w^T F \mu_i - \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}. Therefore, we can rewrite our constrained objective function into another non-constrained objective function:

R(w) = \frac{1}{N} \sum\limits^N_{i=1} \big( w^T F \mu_i - \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} \big) + \frac{\lambda}{2} \left\Vert w \right\Vert^2 &s=2

Now, R(w) is a non-differentiable (because of \min), convex objective. The author in [4] proposes to optimize using subgradient method.

In my mind, subgradient method is really similar to normal gradient descend method. The gradient \nabla f(x) of a differentiable convex function f(x) at point x satisfies: f(y) \geq f(x) + \nabla f(x)^T (y-x), \forall y \in \text{dom }f. This means the growth of line \nabla f(x)^T (y-x) from point x is always equal or slower than f(y) [7]:

Screenshot from 2017-03-30 09:24:17

Subgradient equal to gradient at the part being differentiable. At the non-differentiable part, subgradient g can be any values such that \{g|g^T(y-x) \leq f(y) - f(x) \}, \forall y \in \text{dom }f:

Screenshot from 2017-03-30 09:41:50

Now let’s look at the part -\min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} in R(w). This part can be rewritten as \max\limits_{\mu \in \mathcal{G}} \{-w^T F \mu + l_i^T \mu\}. This can be seen as a max over affine functions \{-w^T F \mu + l_i^T \mu\}, \forall \mu \in \mathcal{G}, each shown as dashed line the figure below:Screenshot from 2017-03-30 09:46:16

The bold line is \max\limits_{\mu \in \mathcal{G}} \{-w^T F \mu + l_i^T \mu\}. The red line is the subgradient at specific value of w marked by the vertical dashed line. Therefore, the subgradient of R(w) is:

\nabla R(w) = \frac{1}{N} \sum\limits_{i=1}^N F(\mu_i - \mu_i^*) + \lambda w &s=2

where \mu_i^* = argmin_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu \}.

Now, w can be updated online, i.e., being updated after seeing example one by one:

w_{t+1} = \mathcal{P}_{\mathcal{W}} [ w_t - \alpha (F(\mu_i - \mu_i^*) + \lambda w_t)] &s=2

where \mathcal{P}_{\mathcal{W}} ( \cdot) is projection operator that maps w to feasible region. For example, sometimes we even require all costs should be positive. How to do projection is something I don’t fully understand in [4]. I hope I will illustrate once I am crystal clear about that.

Reference

[1] Apprenticeship Learning via Inverse Reinforcement Learning

[2] Maximum Margin Planning

[3] Boosting Structured Prediction for Imitation Learning

[4] Learning to Search: Structured Prediction Techniques for Imitation Learning (Ratliff’s thesis)

[5] Max Margin Learning (Youtube video)

[6] Max Margin Planning Slides

[7] Subgradient slides

2018.4.8

Seems like this is a good tutorial (https://thinkingwires.com/posts/2018-02-13-irl-tutorial-1.html) for explaining [8]

[8] Algorithms for Inverse Reinforcement Learning

Reinforcement learning overview

Here are some materials I found useful to learn Reinforcement Learning (RL).

Let’s first look at Markov Decision Process (MDP), in which you know a transition function $latex T(s,a,s’)$ and a reward function $latex R(s,a,s’)$. In the diagram below, the green state is called “q state”. 

Screenshot from 2017-03-05 22:45:24

Some notations that need to be clarified:

Dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). Before we introduce reinforcement learning, all the algorithms are within the dynamic programming family, such as value iteration, policy evaluation/iteration.

$latex V^*(s)=$expected utility starting in $latex s$ and acting optimally

$latex Q^*(s,a)=$expected utility starting out having taken action $latex a$ from state $latex s$ and thereafter acting optimally

$latex \pi^*(s)=$optimal action from state $latex s$

By definition:

$latex V^*(s)=max_a Q^*(s,a)=max_a \sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^*(s’)] \newline Q^*(s,a)=\sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^*(s’)]$

 

The second equation is called Bellman Equation. $latex \gamma$ is the discounting factor, meaning that the true value of a state not only depends on the accumulated absolute value of rewards but also on how early we get reward. The earlier we can get reward, the better. Also, if there is no $latex \gamma$, we may have an infinitely large value of a state. If you have $latex N$ states and known $latex T(s,a,s’)$ and $latex R(s,a,s’)$, you can use $latex N$ equations of $latex V^*(s)$ and a non-linear solver to get $latex V^*(s)$. However this may be too complex in practice. Instead, you can learn $latex V^*(s)$ in an iterative way, which is called value iteration:

$latex V_{k+1}^{*}(s) \leftarrow max_a \sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^{*}_k(s’)] $

However note that each iteration takes $latex O(S^2A)$ time, where $latex S$ and $latex A$ are the number of states and actions respectively.

For a given policy $latex \pi$, you can also evaluate the values of states under that policy. In general such process is called policy evaluation:

$latex V^{\pi}(s) \leftarrow \sum\limits_{s’} T(s, \pi(s), s’)[R(s,\pi(s), s’)+\gamma V^{\pi} (s’)] \newline V^{\pi}_{k+1}(s) \leftarrow \sum\limits_{s’} T(s,\pi(s), s’)[R(s,\pi(s), s’)+\gamma V_k^{\pi} (s’)]$

Note that, in policy evaluation of $latex V^{\pi}(s)$, there is no max function in it. In value iteration for $latex V^{*}(s)$, there is max function. For policy evaluation, you can use a linear solver to get $latex V^{\pi}(s)$ (first equation) because there is no max function in the equation. Or you can use value iteration (second equation). If using a small $latex \gamma$, iterative update might be fast because a small $latex \gamma$ usually leads to a quick convergence. 

For any given policy $latex \pi$, you can iterate it to find an optimal policy as well as getting $latex V^*(s)$ too. Such process is called policy iteration, which follows two steps:

  1. fix the current policy $latex \pi_i$, iterate until values converge: $latex V^{\pi_i}_{k+1}(s) \leftarrow \sum\limits_{s’} T(s, \pi_i(s), s’) [R(s, \pi_i(s), s’) + \gamma V^{\pi_i}_k(s’) ]$
  2.  fix the current values, get a better policy by one-step looking ahead: $latex \pi_{i+1}(s)=argmax_a \sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^{\pi_i}_{k+1}(s’)]$.

Value iteration and policy iteration suffer some speed issues in different perspectives. Remember that value iteration on $latex V^{*}(s)$ is slow because each iteration takes $latex O(S^2A)$ time. As comparison, step 1 of policy iteration only considers one action $latex \pi_i(s)$ and related next states $latex s’$. However, you need to wait for all values to converge in step 1 in policy iteration, which also takes time. 

All the discussions above happen in the context of MDP. That means, $latex T(s,a,s’)$ and $latex R(s,a,s’)$ are known to you. All the methods we introduced above belong to Dynamic Programming, or called model based methods. Also note that in all these methods, they update estimates on the basis of other estimates. We call this general idea bootstrapping.

In the reinforcement learning we will talk about next, you may not know $latex T(s,a,s’)$ or $latex R(s,a,s’)$. So sometimes they are called model free methods. Many reinforcement learning algorithms also use bootstrapping. 

Let’s first talk about a passive reinforcement learning scheme called temporal difference learning. It only estimates $latex V^{\pi}(s)$ for each state given a specific policy $latex \pi$. It is called “passive” reinforcement learning because you are given a fixed policy and you don’t need to learn the optimal policy and only need to evaluate the policy. However, merely knowing $latex V^{\pi}(s)$ does not directly suggest the optimal action for you under the policy $latex \pi$. Thus, there are two kinds of problems all variants of reinforcement learning solve: prediction problem, estimating the value function for a given policy; control problem, finding an optimal policy, the policy maximizing accumulated rewards when traversing through states. 

In reinforcement learning, model-free value-iteration and model-free policy-iteration using q-functions are Q-learning and SARSA, respectively [6]. 

Screenshot from 2017-08-13 20-26-57 Screenshot from 2017-08-13 20-26-23

 

Here is an example using Q learning. Each cell is a state and arrows of all the cells denote a fixed, given policy $latex \pi$. Reward is only given at the top right corner (+1) and the cell below it (-1). We can see after 100 iterations, each cell (state) learns $latex v^{\pi}(s)$. Now, why the cell on the left of +1 cell is 0.85, not 1.00? This is due to the two reasons: 1. the discounting factor $latex \gamma$ makes the $latex V^{\pi}(s)$ approach but never be 1.00; 2. the transition is stochastic. So the transition probability $latex p(s’|s, a)$ is not perfect 1 but could have many possible $latex s’$.  For example, at the cell 0.85, even the policy suggests to take right, it has certain chance to go to the cell below, which might lead to the undesirable state. Therefore, this state should not be perfect +1.00.

1

 

We can compare to the result of Q-learning, which has the update rule: $latex Q(s,a) \leftarrow (1-\alpha)Q(s,a)+\alpha \cdot [R(s,a,s’) + \gamma max_{a’}Q(s’,a’)]$. 

Screenshot from 2017-03-05 23:54:52

First, note that on the same cell that is on the left of +1 cell, $latex q(s, RIGHT)=0.85$, which is equal to the same cell in the TD learning. This is because the provided policy in TD learning also suggest to take RIGHT. Second, why the cell on the left of -1 cell has $latex q(s, RIGHT)=-0.6$ instead of $latex -0.85$? This is because if an agent stands in the cell, takes RIGHT, but does not land on -1 cell, it will be less likely to get into -1 cell again. However, imagine another agent which stands on the left of +1 cell and takes RIGHT, if he fails to get into +1 cell, he might still stay in the same cell because going UP isn’t really an option, making him ready to get into +1 cell again.  

Q-learning is called off-policy learning because what exploration policy does not matter as long as the exploration policy makes you experience all state-action pairs and you set a proper learning rate $latex \alpha$. For example, you are using a very poor exploration policy, and starting from $latex s’$, you take very poor action $latex a’$. However, $latex max_{a’}Q(s’, a’)$ might still be large enough to improve $latex Q(s,a)$. The difference between Q-learning and SARSA, and generally between off-policy and on-policy, can be seen in [4].

 

An improved Q-learning technique is called Dyna Q-learning, which achieves model learning, direct reinforcement learning and planning (indirect reinforcement learning) in the same algorithm. Model learning means learning a model that produces a prediction of next state and next reward given the current state and an action. In Dyna Q-learning, the model is so trivial – a table recording the last-observed $latex R_{t+1}, S_{t+1}$ for each $latex S_t, A_t$ (Step e). Direct reinforcement learning means the direct update of Q values just as in a normal Q-learning (Step d). Planning means additionally updating Q-values based on simulated experience on the model (Step f). 

Screenshot from 2017-03-08 19:59:41

 

Without planning part, Dyna Q-learning downgrades to a normal Q-learning. The (state-space) planning can be depicted as:

Screenshot from 2017-03-08 20:06:04Because Dyna Q-learning uses additional (simulated) experience to update Q-values, it converges the optimal policy in fewer iterations. Here is an example from Chapter 8.2 in [5].

Screenshot from 2017-03-08 20:08:56

Currently, I only know Dyna Q-learning is a tabular method, meaning $latex S$, $latex A$ need to be finite, discrete so that we can have an entry for each $latex S, A$ pair.

 

 

Now, here is a big roadmap of RL and MDPs:

Screenshot from 2017-03-06 23:26:03

When MDP is known, we can use value iteration and policy iteration to compute $latex V^*$, $latex Q^*$ and $latex \pi^*$. When given a specific policy $latex \pi$, we can evaluate $latex V^{pi}(s)$.

When MDP is unknown, we need to use RL. For model-based learning, we need to build models to estimate $latex T(s,a,s’)$ and $latex R(s,a,s’)$ as an approximated MDP. After that, we can apply value iteration/policy iteration just as on a normal MDP. Similarly we can also do policy evaluation on the approximated MDP.

For model-free learning, we do not need to even know or model $latex T(s,a,s’)$ and $latex R(s,a,s’)$. Nevertheless, we still learn $latex V^*$, $latex Q^*$ and $latex \pi^*$.

 

The biggest challenge of a naive Q-learning that the size of state action pairs Q(s,a) increases quickly. Therefore, we can generalize state-action by using weights and features describing q states to approximate the learning.

Screenshot from 2017-03-07 00:30:44 Note in approximate Q-learning, you no longer maintain a table for recording $latex Q(s,a)$. So the only thing you need to do is to update weights.

 

 

[1] https://www.youtube.com/watch?v=ggqnxyjaKe4

[2] https://www.youtube.com/watch?v=Csiiv6WGzKM (lecture 8 ~ 11)

[3] http://web.engr.oregonstate.edu/~xfern/classes/cs434/slides/RL-1.pdf

[4] https://czxttkl.com/?p=1850

[5] https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

[6] Reinforcement learning and dynamic programming using function approximators: http://orbi.ulg.ac.be/bitstream/2268/27963/1/book-FA-RL-DP.pdf

Abstract Algebra

I am introducing some basic definitions of abstract algebra, structures like monoid, groups, rings, fields and vector spaces and homomorphism/isomorphism.

I find the clear definitions of structures from [1]:

Screenshot from 2017-02-22 20:48:07

Also, the tables below show a clear comparisons between several structures [2,3]:

Screenshot from 2017-02-22 20:50:49

Screenshot from 2017-02-22 21:07:59

 

All these structures are defined with both a set and operation(s). Based on [4], a monoid is an abstraction of addition. A group is an abstraction of addition and subtraction. A ring adds multiplication but not necessarily division. A field adds division. Therefore, ultimately, a field is the “biggest” set (compared to monoid, group and ring), on which are defined addition, subtraction, multiplication and division. Similar idea is also illustrated in [6].

As also pointed out by [6], since we are dealing with abstract algebra, calling operations with the names “addition, subtraction, multiplication, division” is just a convention. The operations are applied on sets that might be very different with real numbers or integers. They are analogous but not necessarily the same as what we understand for the addition, subtraction, multiplication and division on real numbers.

We can safely say that every ring is a group, and every field is a ring [5]:

Screenshot from 2017-02-22 21:39:24

A vector space over a field $latex F$ is a set $latex V$ with two operations, vector addition and scalar multiplication, satisfying the ten axioms listed below [10, 11, 22]:

PDcix

To me, Euclidean vector space is the most familiar vector space as an example.

 

Now, there are several subcategories of vector spaces. One kind is normed vector space [12], meaning that the norm of vectors is defined in the vector space. Another kind of vector space is called inner product vector space [13], which is a vector space with a mapping function called “inner product” that maps any pair of vectors (over field F) to a scalar in F. (Note that I used to think “inner product” and “dot product” are exchangeable in their meanings. Now I know that “inner product” is just a name of an operator that maps two vectors into a field. “dot product” is the name of “inner product” in Euclidean space.) More specially, an operator can be called “inner product” if it satisfies [26]:

Screenshot from 2017-02-25 21:47:47

The inner product between a vector and itself naturally induces the norm of it. Hence, an inner product vector space must be a normed vector space. BTW, along the search of inner product vector space, I found a good textbook [14]. Another kind vector space is called metric vector space, which means the distance between any two vectors $latex d(x,y), x,y \in V$ is defined. A normed vector space must be a metric vector space because the definition of metric can be induced from the definition of norm. (book page 59 in [25]). Bear in mind that a metric space may not be a normed space, and a normed space may not be an inner product space. You can have a basic intuition of how spaces from metric space, to normed space, to inner product space, become more and more powerful [27].

 

A Cauchy sequence is a sequence whose elements become arbitrarily close to each other as the sequence progresses [15]. The definition [Appendix in 16] is that in a vector space $latex V$, you have a Cauchy sequence of vectors $latex v_1, v_2, \cdots$ Then, for any $latex \epsilon > 0$ there is a corresponding $latex k$ such that $latex \lVert v_i – v_j \rVert < \epsilon$ for all $latex i,j \geq k$. A normed vector space $latex V$ is complete if every Cauchy sequence in $latex V$ is convergent in $latex V$, i.e., the limit point of every Cauchy sequence is also in $latex V$. A complete normed vector space is called Banach space [24]. A complete inner product vector space is Hilbert space [17]. An incomplete inner product vector space is called pre-Hilbert space. Every Euclidean space is finite and a Hilbert space (because dot product is defined in Euclidean space and Euclidean space is complete). However, a Hilbert space is not necessarily a Euclidean space since its dimension can be infinite. A famous example of Hilbert space with infinite dimensions is $latex \mathcal{l}^2$: 

Screenshot from 2017-02-25 20:07:18

Screenshot from 2017-02-25 20:07:37

 

Although we just talked about sequence space $latex \ell^2$ has infinite dimensions, we have not defined what is dimension of a vector space [25]:

Screenshot from 2017-02-25 22:16:22Screenshot from 2017-02-25 22:16:39 

 

 

Now let’s look at homomorphism:

Screenshot from 2017-02-23 08:47:24

Note that the map is between two structures of the same type. That means, the map preserves some properties of the two structures, hence the name: homo (same) + morphism (shape,form). 

By convention, linear transformation of vector spaces is called vector space homomorphism [18]:

Screenshot from 2017-02-23 21:31:50

 

For most algebra structures, isomorphism is a homomorphism equipped with a mapping that is bijective (has inverse and one-to-one correspondence) [7]. That means. after a mapping or the reverse of it, isomorphism preserves the relationships between elements of the two sets. For example [8]:Screenshot from 2017-02-23 10:00:50

Knowing two structures are isomorphic has several advantages, with the most prominent one being to establish the knowledge of a more complicated set by studying a simpler set which is isomorphic [8, 9].

 

 

 

There are several types of mappings in abstract algebra.

injective:

Screenshot from 2017-02-24 11:42:52

surjective:

Screenshot from 2017-02-24 11:43:01

and bijective:

Screenshot from 2017-02-24 11:43:07

 

A linear operator is [25]:  

 1

2

A linear functional $latex f$ is a linear operator with its domain as a vector space over a field $latex F$ and its range as scalars of $latex F$ [20]:

Screenshot from 2017-02-24 18:43:30

The set of all linear functionals that map from $latex V$ to $latex F$ form a vector space $latex V^*$ called dual space, with also two operations, addition and scalar multiplication [19]:

Screenshot from 2017-02-24 18:38:37

There is also a double dual space, $latex V^{**}$, which is the dual space of $latex V^*$. It is very obscure to understand this concept. My understanding is that $latex V^*$ contains a set of linear functionals, and $latex V^{**}$ contains a set of linear functionals of linear functionals. If we suppose there is a linear functional $latex f \in V^*$ and a linear functional $latex \Phi \in V^{**}$, then we want that $latex \Phi(f)$ should return a scalar. In a generative way, you can think that every $latex \Phi \in V^**$ is created by first selecting a $latex v \in V$, and let $latex \Phi(f)=f(v)$. In “$latex f(v)$”, treat $latex f$ as a variable and $latex \cdot(v)$ as the linear functional $latex \Phi$.  More details can be found in [23].

I was surprised to learn that linear operators (and linear functionals) actually have norms. The definition of norms of linear operators is [25]:

 Screenshot from 2017-02-25 19:36:14

Screenshot from 2017-02-25 19:36:45

 

 

Non degenerate bilinear mapping is [21]: 

Screenshot from 2017-02-24 22:48:26

 

Reference

[1] groups_rings_fields_vector_spaces from: http://people.csail.mit.edu/madhu/ST12/scribe/lect03.pdf

[2] https://en.wikipedia.org/wiki/Group_(mathematics)#Generalizations

[3] https://en.wikipedia.org/wiki/Field_(mathematics)#Related_algebraic_structures

[4] http://math.stackexchange.com/a/730768/235140

[5] http://math.stackexchange.com/a/80/235140

[6] http://math.stackexchange.com/a/1425663/235140

[7] http://aleph0.clarku.edu/~ma130/isomorphism.pdf

[8] https://www.britannica.com/topic/isomorphism-mathematics

[9] http://math.stackexchange.com/questions/441758/what-does-isomorphic-mean-in-linear-algebra

[10] http://mathworld.wolfram.com/VectorSpace.html

[11] https://en.wikipedia.org/wiki/Algebra_over_a_field

[12] https://en.wikipedia.org/wiki/Normed_vector_space

[13] https://en.wikipedia.org/wiki/Inner_product_space

[14] http://www.linear.axler.net/InnerProduct.pdf

[15] https://en.wikipedia.org/wiki/Cauchy_sequence

[16] http://www.cs.columbia.edu/~risi/notes/tutorial6772.pdf

[17] https://en.wikipedia.org/wiki/Hilbert_space

[18] http://math.stackexchange.com/a/29962/235140

[19] https://en.wikipedia.org/wiki/Dual_space

[20] https://en.wikipedia.org/wiki/Linear_form

[21] https://en.wikipedia.org/wiki/Degenerate_bilinear_form

[22] http://math.stackexchange.com/questions/49733/prove-in-full-detail-that-the-set-is-a-vector-space

[23] http://math.stackexchange.com/questions/170481/motivating-to-understand-double-dual-space

[24] https://en.wikipedia.org/wiki/Banach_space

[25] https://www.dropbox.com/s/3tvpl5iiq50pu6h/Kreyszig%20-%20Introductory%20Functional%20Analysis%20with%20Applications.pdf?dl=0

[26] http://people.math.gatech.edu/~heil/handouts/appendixa.pdf

[27] http://math.stackexchange.com/questions/617202/metric-space-normed-space-and-inner-product-space-hierarcy

When A* algorithm returns optimal solution

Dijkstra algorithm is a well known algorithm for finding exact distance from a source to a destination. In order to improve the path finding speed, A* algorithm combines heuristics and known distances to find the heuristically best path towards a goal. A common A* implementation maintains an open set for discovered yet not evaluated nodes and a closed set for already evaluated nodes. 

Here is the pseudo-code of A* (copied from Wikipedia):

function A*(start, goal)
    // The set of nodes already evaluated.
    closedSet := {}
    // The set of currently discovered nodes that are not evaluated yet.
    // Initially, only the start node is known.
    openSet := {start}
    // For each node, which node it can most efficiently be reached from.
    // If a node can be reached from many nodes, cameFrom will eventually contain the
    // most efficient previous step.
    cameFrom := the empty map

    // For each node, the cost of getting from the start node to that node.
    gScore := map with default value of Infinity
    // The cost of going from start to start is zero.
    gScore[start] := 0 
    // For each node, the total cost of getting from the start node to the goal
    // by passing by that node. That value is partly known, partly heuristic.
    fScore := map with default value of Infinity
    // For the first node, that value is completely heuristic.
    fScore[start] := heuristic_cost_estimate(start, goal)

    while openSet is not empty
        current := the node in openSet having the lowest fScore[] value
        if current = goal
            return reconstruct_path(cameFrom, current)

        openSet.Remove(current)
        closedSet.Add(current)
        for each neighbor of current
            if neighbor in closedSet
                continue		// Ignore the neighbor which is already evaluated.
            // The distance from start to a neighbor
            tentative_gScore := gScore[current] + dist_between(current, neighbor)
            if neighbor not in openSet	// Discover a new node
                openSet.Add(neighbor)
            else if tentative_gScore >= gScore[neighbor]
                continue		// This is not a better path.

            // This path is the best until now. Record it!
            cameFrom[neighbor] := current
            gScore[neighbor] := tentative_gScore
            fScore[neighbor] := gScore[neighbor] + heuristic_cost_estimate(neighbor, goal)

    return failure

function reconstruct_path(cameFrom, current)
    total_path := [current]
    while current in cameFrom.Keys:
        current := cameFrom[current]
        total_path.append(current)
    return total_path

 

In the code, the closed set prevents evaluated nodes from being evaluated again. One may think about not using closed set at all because it is not a big deal if we allow a node being added to the open set or its $latex f$ value being updated multiple times. Anyway, we only pick the node with the least $latex f$ value, right? Wrong. We can only safely drop using the closed set if “new nodes are added to the open set only if they have a lower $latex f$ value than at any previous iteration” (from Wikipedia). Without such mechanism, A* will never terminate in some extreme cases. Here is the version when the closed set is dropped out:

function A*(start, goal)

    // The set of currently discovered nodes that are not evaluated yet.
    // Initially, only the start node is known.
    openSet := {start}
    // For each node, which node it can most efficiently be reached from.
    // If a node can be reached from many nodes, cameFrom will eventually contain the
    // most efficient previous step.
    cameFrom := the empty map

    // For each node, the cost of getting from the start node to that node.
    gScore := map with default value of Infinity
    // The cost of going from start to start is zero.
    gScore[start] := 0 
    // For each node, the total cost of getting from the start node to the goal
    // by passing by that node. That value is partly known, partly heuristic.
    fScore := map with default value of Infinity
    // For the first node, that value is completely heuristic.
    fScore[start] := heuristic_cost_estimate(start, goal)

    while openSet is not empty
        current := the node in openSet having the lowest fScore[] value
        if current = goal
            return reconstruct_path(cameFrom, current)

        openSet.Remove(current)

        for each neighbor of current
            // The distance from start to a neighbor
            tentative_gScore := gScore[current] + dist_between(current, neighbor)
            if neighbor not in openSet	// Discover a new node
                openSet.Add(neighbor)
            else if tentative_gScore >= gScore[neighbor]
                continue		// This is not a better path.

            // This path is the best until now. Record it!
            cameFrom[neighbor] := current
            gScore[neighbor] := tentative_gScore
            fScore[neighbor] := gScore[neighbor] + heuristic_cost_estimate(neighbor, goal)

    return failure

function reconstruct_path(cameFrom, current)
    total_path := [current]
    while current in cameFrom.Keys:
        current := cameFrom[current]
        total_path.append(current)
    return total_path

 

And here is the example when neither using the closed set nor the memory of previous iteration will cause A* never terminate:

astarloop (2)

The shortest path is supposed to be A->B->C->D->G. Starting from A:

Pick A from the open set. Update B’s $latex f$ value as $latex f(B)=g(A) + d(A,B) + h(B) = 0 + 1 + 1 = 2$. Update B’s $latex g$ value as $latex g(B)=g(A)+d(A,B)=0+1=1$. Add B to the open set.

Pick B from the open set. Update C’s $latex f$ value as $latex f(C)=g(B)+d(B,C)+h(C)=1+1+1=3$. Update C’s $latex g$ value as $latex g(C)=g(B)+d(B,C)=1+1=2$. Add C to the open set.

Pick C from the open set. Update D’s $latex f$ value as $latex f(D)=g(C)+d(C,D)+h(D)=2+1+100=103$. Add D to the open set. Since there is no closed set, the algorithm is ignorant of the fact that A has been checked. Therefore, the algorithm continues to update $latex f(A)=g(C)+d(C,A)+h(A)=2+1+1=4$. Because there is no memory recording what $latex A$’s previous $latex f$ values are, the algorithm stupidly assigns $latex f(A)=4$, a larger $latex f$ value than before, to A. A is then added to the open set. $latex g(A)$ and $latex g(D)$ will be updated similarly as before.

Now, picking the next node from the open set. A will be picked because it has the least $latex f$ value ($latex f(A)<f(D)$. Then, the loop will go on from A to B to C then to A again. Note that every time when you reach C,  $latex f(D)$ and $latex f(A)$ will increase by $latex g(C) – g(C)_{old}$. A will always be favored over D because $latex h(A) < h(D)$. As a result, A* will never return.

 

Now, suppose both versions of A* (with or without the closed set) work thanks to proper implementation. Let’s see when A* returns the optimal distance.

For the version without the closed set, we claim A* is optimal when the heuristic function is admissible, i.e., $latex h(n)$ is never overestimated, i.e., $latex h(n) \leq h^*(n)$ for all nodes, where $latex h^*(n)$ is the shortest distance from n to the goal.

We can prove as follows: say the goal node is G, the start node is A. The optimal path from A to G has the distance $latex d^*$. If a non-optimal path propagates to G with $latex f(G)=d_{fake}$, then we must know $latex d^* < d_{fake}$. Since all heuristic values are not overestimated, before we pick G as the next node from the open set, there must be other nodes on the optimal path with a smaller f value than $latex d^*$ that get picked. Until the optimal path propagates its distance to G, G will never be picked from the open set with its non-optimal $latex f$ value.

For the version with the closed set, we claim A* is optimal when the heuristic function is both admissible and monotonic. A monotonic heuristic function means for any pair of adjacent nodes $latex x$ and $latex y$, where $latex d(x,y)$ denotes the length of the edge between them, we must have: $latex h(x) \leq d(x,y) + h(y)$. I don’t have a good proof why we require the heuristic function must also be monotonic but I can give an example that A* does not return the optimal distance when monotonicity is not satisfied. 

negexample

 

Since $latex h(A) \leq h^*(A)$, $latex h(B) \leq h^*(B)$, $latex h(C) \leq h^*(C)$ and $latex h(D) \leq h^*(D)$, admissibility is satisfied. However, $latex h(B) > d(B,C)+h(C)$, therefore monotonicity is not satisfied. Now let’s see how the break of monotonicity makes A* not return the optimal distance.

We first pick A as it is the start point. Add B and C to the open set, with $latex f(B)=g(B)+h(B)=1+100=101$ and  $latex f(C)=g(C)+h(C)=1+30=31$.

Then, we pick C from the open set because of $latex f(C)<f(B)$, evaluate C, and finally add C to the close set. During the evaluation, D will be added to the open set, with $latex f(D)=g(C)+d(C,D)+h(D)=1+5+90=96$ and $latex g(D)=g(C)+d(C,D)=1+5=6$. At this moment, B is not in the closed set. Since $latex f'(B)=g(C)+d(C,B)+h(B)=1+1+100=102$, which is less than current $latex f(B)$, we do not update $latex f(B)$.

Then, since $latex f(B)<f(D)$, we pick D from the open set and evaluate D. G and B are D’s neighbors. $latex f(G)=g(D)+d(D,G)=6+96=102$ and $latex f'(B)=g(D)+d(D,B)+h(B)=6+4+100=110$. Since $latex f'(B) > f(B)$, we do not update $latex f(B)$. We add D to the closed set.

Next, since $latex f(B) < f(G)$, we process B. However, the B’s neighbors C and D are all in the closed set. Therefore, we do nothing.

Finally, we pick G and turns out it is the goal. So we return with its $latex f$ value 102, which is not optimal. The optimal distance should be 101, which is achieved by the path A->B->D->G.

 

Actually, monotonicity implies admissibility when $latex h(G)=0$. Suppose for a node $latex v_0$, the shortest distance $latex h^*(v_0)$ from $latex v_0$ to $latex G$ is achieved by the path $latex v_0,v_1, v_2, v_3, \cdots, v_n, G$. Then, we can show that $latex h(v_0) \leq h^*(v_0)$:

$latex h(v_0) \leq d(v_0, v_1) + h(v_1) \leq d(f, v_1) + d(v_1, v_2) + h(v_2) \leq \cdots \leq \sum\limits_{i=0}d(v_i, v_{i+1}) + h(G) =h^*(v_0)$

 

Reference:

http://cs.stackexchange.com/questions/16065/how-does-an-admissible-heuristic-ensure-an-optimal-solution

http://cs.stackexchange.com/questions/30778/why-is-the-a-search-heuristic-optimal-even-if-it-underestimates-costs?rq=1

http://www.algostructure.com/search/a-star.php

 

Install Tensorflow 0.12 with GPU support on AWS p2 instance

# for connection and file transfer

ssh -i ~/Dropbox/research/aws_noisemodel_keypair.pem ubuntu@ec2-54-164-130-227.compute-1.amazonaws.com

rsync –progress –delete -rave “ssh -i /home/czxttkl/Dropbox/research/aws_noisemodel_keypair.pem” /home/czxttkl/workspace/mymachinelearning/Python/LoLSynergyCounter ubuntu@ec2-54-164-130-227.compute-1.amazonaws.com:~/

sudo apt-get install python-pip python-dev
pip install tensorflow-gpu

 

download and transfer cuda toolkit, then install 

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

 

download and transfer cudnn, then install:

tar xvzf cudnn-<your-version>.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

reference: https://github.com/tensorflow/tensorflow/issues/5591

 

Other possibly used scientific modules

sudo pip install gensim numpy scipy scikit-learn pandas seaborn
sudo apt-get install python-tk

 

Append to ~/.bash_rc and run source ~/.bashrc

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda

reference: https://www.tensorflow.org/versions/r0.11/get_started/os_setup#optional_linux_enable_gpu_support

 

You can also run some p2 instance optimization specific for GPU computation:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html#optimize_gpu