czxttkl – Page 12

Inverse Reinforcement Learning

In my rough understanding, inverse reinforcement learning is a branch of RL research in which people try to perform state-action sequences resembling given tutor sequences.

There are two famous works on inverse reinforcement learning. One is Apprenticeship Learning via Inverse Reinforcement Learning [1], and the other is Maximum Margin Planning [2].

Maximum Margin Planning

In MMP learn a cost function for estimating the cost at each state-action pair. The desired cost function should emit low costs to those state-action pairs appearing in “tutor” sequences (examples) and return high costs to those deviating tutor sequences. If we manage to recover a desired cost function, we can assign the cost for any state-action pair and use path-finding algorithms (e.g., A*, Dijkstra, etc.) to find a sequence on a new environment which has the minimum accumulative sum of costs, that is, a sequence deemed resembling given tutor sequences.

The baisc idea is given in [4]:

Screenshot from 2017-03-29 18:36:37

As you see, in line 6, you need to adjust cost function to increase on the current “not-perfect” minimum cost path and decrease the example path. However, you can see that the algorithm doesn’t know how much it should increase the cost for minimum cost path. If one path just deviates from the example path a little bit and another path deviates far more abnormally, how should the adjustment of the cost function reflect that?

So the author proposes an improvement called “loss augmentation”. Suppose the original cost at a state-action is $c^{sa}$ , then we say its augmented cost $\tilde{c}^{sa} = c^{sa}-l^{sa}$ , where $l^{sa}$ is a loss function with large values for those deviated state-action pairs and with close-to-zero values for those resembling tutor sequences. Why do we want the augmented cost to be even smaller for those defiant state-action pairs? Because in this way, we make those deviated pairs even harder to be distinguished with the desired pairs. Therefore, we force the algorithm to try hard to update the cost function until it can differentiate augmented costs from different state-action pairs. In other words, the algorithm will learn the cost function such that it differentiates any a desired state-action pair $(s,a)$ with a less desired state-action pair $(s',a')$ by at least a margin $l^{s'a'}$ .

Here is the visualization showing how deviated paths get high rewards (+1) and desired paths get low rewards (-1).

Screenshot from 2017-03-29 21:42:24

And here is the revised algorithm based on Algorithm 1:

Screenshot from 2017-03-29 21:42:30

Now, let’s look at how to learn the cost function using a framework called maximum margin structured classification.

Before that, we introduce preliminaries. In this paper, we are given $N$ examples, each example is a MDP and a tutor sequence. Different examples may come from different MDPs so this model can learn the cost function generalized to new MDPs which are never seen in the training examples. However, for simplicity, we just assume all training examples are different tutor sequences from the same MDP with state $\mathcal{S}$ and $\mathcal{A}$ . For simplicity purpose also, an important assumption made throughout [4] is that a tutor sequence is acyclic. This means, any state-action pair is visited at most once in a tutor sequence. Then, it is easy to represent a tutor sequence by $\mu \in \{0,1\}^{\mathcal{S}\mathcal{A}}$ , a state-action frequency count vector. If the assumption about tutor sequence being acyclic were not made, $\mu \in \mathbb{R}_{+}^{\mathcal{S}\mathcal{A}}$ which satisfies certain flow constraints.

Now remember that the idea is to approximate everything linearly, such as the loss function or the cost function. Each training example has a loss function, which quantifies the difference between an arbitrary sequence and $i$ -th example tutor sequence $\mu_i$ . Since we want loss function to be a linear function of $\mu$ , we let $\mathcal{L}_i(\mu)=l_i^T \mu$ . The more $\mu$ deviates from $\mu_i$ , the higher $\mathcal{L}_i(\mu)$ should be. A straightforward way to define $l_i$ is to make $l_i \in \{0,1\}^{\mathcal{S}\mathcal{A}}$ , with 0 denoting a particular state-action is visited by both or neither of $\mu$ and $\mu_i$ and 1 otherwise. Of course more advanced loss function can be used but the bottom line is to assume $\mathcal{L}_i(\mu) \geq 0$ . Now as for cost function, cost function should also factor over states and actions. This means, the cost of a sequence should be the sum of the costs at each state-action pairs visited in the sequence. Suppose we have a feature matrix $F \in \mathbb{R}^{d \times \mathcal{S} \times \mathcal{A}}$ , whose columns are feature vectors at every state-action pair. We also suppose the cost function has a linear weight $w \in \mathbb{R}^{d}$ . Then, the cost at one state-action pair is just $w^T F_j$ where $F_j$ is the column corresponding to the state-action pair. Finally, the cost of the whole sequence is $w^T F \mu$ . You can think $F \mu$ as the accumulated sum of feature vectors on visited state-action pairs.

Since we assume all training examples come from the same MDP, we can write the data set as $\mathcal{D}=\{(F, \mu_i, l_i)\}^N_{i=1}$ .

Now, the problem is how to obtain $w$ . Recall that we want $w$ to be able to differentiate desired paths ( $\mu_i$ from $\mathcal{D}$ ) from tentative paths (arbitrary $\mu \in \mathcal{G}$ , where $\mathcal{G}$ is all possible sequence in the MDP) by a margin $\mathcal{L}_i(\mu)=l_i^T \mu$ . Therefore, the constraints we would like to impose are: for each $\mu_i$ , for each $\mu \in \mathcal{G}$ , $w^T F \mu_i \leq w^T F \mu - l_i ^T \mu$ . This actually introduces a large number of constraints when $\mathcal{G}$ space is large. Therefore, we reduce the number of constraints by rewriting as: for each $\mu_i$ , $w^T F \mu_i \leq \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}$ . Intuitively, this means if $w$ can successfully distinguish $\mu_i$ from the loss-augmented cost of the hardest tentative path $\mu \in \mathcal{G}$ , then it can distinguish from any other tentative paths. Also, we would like to regularize on the norm of $w$ because if $w$ were without norm regularization, it can always scale arbitrarily large to overcome margin difference constraints, voiding the original purpose of adding margins. Moreover, we want to introduce a slack term $\xi_i \geq 0$ for each training example. This allows some margin difference constraints to not be satisfied because in real data you may not find a $w$ which perfectly satisfies all margin difference constraints.

Putting all we describe above, we come to the optimization objective:

$\min\limits_{w \in \mathcal{W}, \xi_i \in \mathbb{R}_+} \frac{1}{N} \sum\limits_{i=1}^N \xi_i + \frac{\lambda}{2} \left\Vert w \right\Vert^2 \newline \forall i, w^T F \mu \leq \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} + \xi_i &s=2$

We note that $\xi_i$ resides in both the objective function and the constraints. We want $\xi_i$ as small as possible. Therefore, at the minimizer the slack variables will always exactly equal the constraint violation: $\xi_i = w^T F \mu_i - \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}$ . Therefore, we can rewrite our constrained objective function into another non-constrained objective function:

$R(w) = \frac{1}{N} \sum\limits^N_{i=1} \big( w^T F \mu_i - \min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\} \big) + \frac{\lambda}{2} \left\Vert w \right\Vert^2 &s=2$

Now, $R(w)$ is a non-differentiable (because of $\min$ ), convex objective. The author in [4] proposes to optimize using subgradient method.

In my mind, subgradient method is really similar to normal gradient descend method. The gradient $\nabla f(x)$ of a differentiable convex function $f(x)$ at point $x$ satisfies: $f(y) \geq f(x) + \nabla f(x)^T (y-x), \forall y \in \text{dom }f$ . This means the growth of line $\nabla f(x)^T (y-x)$ from point $x$ is always equal or slower than $f(y)$ [7]:

Screenshot from 2017-03-30 09:24:17

Subgradient equal to gradient at the part being differentiable. At the non-differentiable part, subgradient $g$ can be any values such that $\{g|g^T(y-x) \leq f(y) - f(x) \}, \forall y \in \text{dom }f$ :

Screenshot from 2017-03-30 09:41:50

Now let’s look at the part $-\min\limits_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu\}$ in $R(w)$ . This part can be rewritten as $\max\limits_{\mu \in \mathcal{G}} \{-w^T F \mu + l_i^T \mu\}$ . This can be seen as a max over affine functions $\{-w^T F \mu + l_i^T \mu\}, \forall \mu \in \mathcal{G}$ , each shown as dashed line the figure below: Screenshot from 2017-03-30 09:46:16

The bold line is $\max\limits_{\mu \in \mathcal{G}} \{-w^T F \mu + l_i^T \mu\}$ . The red line is the subgradient at specific value of $w$ marked by the vertical dashed line. Therefore, the subgradient of $R(w)$ is:

$\nabla R(w) = \frac{1}{N} \sum\limits_{i=1}^N F(\mu_i - \mu_i^*) + \lambda w &s=2$

where $\mu_i^* = argmin_{\mu \in \mathcal{G}} \{w^T F \mu - l_i^T \mu \}$ .

Now, $w$ can be updated online, i.e., being updated after seeing example one by one:

$w_{t+1} = \mathcal{P}_{\mathcal{W}} [ w_t - \alpha (F(\mu_i - \mu_i^*) + \lambda w_t)] &s=2$

where $\mathcal{P}_{\mathcal{W}} ( \cdot)$ is projection operator that maps $w$ to feasible region. For example, sometimes we even require all costs should be positive. How to do projection is something I don’t fully understand in [4]. I hope I will illustrate once I am crystal clear about that.

Reference

[1] Apprenticeship Learning via Inverse Reinforcement Learning

[2] Maximum Margin Planning

[3] Boosting Structured Prediction for Imitation Learning

[4] Learning to Search: Structured Prediction Techniques for Imitation Learning (Ratliff’s thesis)

[5] Max Margin Learning (Youtube video)

[6] Max Margin Planning Slides

[7] Subgradient slides

2018.4.8

Seems like this is a good tutorial (https://thinkingwires.com/posts/2018-02-13-irl-tutorial-1.html) for explaining [8]

[8] Algorithms for Inverse Reinforcement Learning

Reinforcement learning overview

Here are some materials I found useful to learn Reinforcement Learning (RL).

Let’s first look at Markov Decision Process (MDP), in which you know a transition function $latex T(s,a,s’)$ and a reward function $latex R(s,a,s’)$. In the diagram below, the green state is called “q state”.

Screenshot from 2017-03-05 22:45:24

Some notations that need to be clarified:

Dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). Before we introduce reinforcement learning, all the algorithms are within the dynamic programming family, such as value iteration, policy evaluation/iteration.

$latex V^*(s)=$expected utility starting in $latex s$ and acting optimally

$latex Q^*(s,a)=$expected utility starting out having taken action $latex a$ from state $latex s$ and thereafter acting optimally

$latex \pi^*(s)=$optimal action from state $latex s$

By definition:

$latex V^*(s)=max_a Q^*(s,a)=max_a \sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^*(s’)] \newline Q^*(s,a)=\sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^*(s’)]$

The second equation is called Bellman Equation. $latex \gamma$ is the discounting factor, meaning that the true value of a state not only depends on the accumulated absolute value of rewards but also on how early we get reward. The earlier we can get reward, the better. Also, if there is no $latex \gamma$, we may have an infinitely large value of a state. If you have $latex N$ states and known $latex T(s,a,s’)$ and $latex R(s,a,s’)$, you can use $latex N$ equations of $latex V^*(s)$ and a non-linear solver to get $latex V^*(s)$. However this may be too complex in practice. Instead, you can learn $latex V^*(s)$ in an iterative way, which is called value iteration:

$latex V_{k+1}^{*}(s) \leftarrow max_a \sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^{*}_k(s’)] $

However note that each iteration takes $latex O(S^2A)$ time, where $latex S$ and $latex A$ are the number of states and actions respectively.

For a given policy $latex \pi$, you can also evaluate the values of states under that policy. In general such process is called policy evaluation:

$latex V^{\pi}(s) \leftarrow \sum\limits_{s’} T(s, \pi(s), s’)[R(s,\pi(s), s’)+\gamma V^{\pi} (s’)] \newline V^{\pi}_{k+1}(s) \leftarrow \sum\limits_{s’} T(s,\pi(s), s’)[R(s,\pi(s), s’)+\gamma V_k^{\pi} (s’)]$

Note that, in policy evaluation of $latex V^{\pi}(s)$, there is no max function in it. In value iteration for $latex V^{*}(s)$, there is max function. For policy evaluation, you can use a linear solver to get $latex V^{\pi}(s)$ (first equation) because there is no max function in the equation. Or you can use value iteration (second equation). If using a small $latex \gamma$, iterative update might be fast because a small $latex \gamma$ usually leads to a quick convergence.

For any given policy $latex \pi$, you can iterate it to find an optimal policy as well as getting $latex V^*(s)$ too. Such process is called policy iteration, which follows two steps:

fix the current policy $latex \pi_i$, iterate until values converge: $latex V^{\pi_i}_{k+1}(s) \leftarrow \sum\limits_{s’} T(s, \pi_i(s), s’) [R(s, \pi_i(s), s’) + \gamma V^{\pi_i}_k(s’) ]$
fix the current values, get a better policy by one-step looking ahead: $latex \pi_{i+1}(s)=argmax_a \sum\limits_{s’}T(s,a,s’)[R(s,a,s’)+\gamma V^{\pi_i}_{k+1}(s’)]$.

Value iteration and policy iteration suffer some speed issues in different perspectives. Remember that value iteration on $latex V^{*}(s)$ is slow because each iteration takes $latex O(S^2A)$ time. As comparison, step 1 of policy iteration only considers one action $latex \pi_i(s)$ and related next states $latex s’$. However, you need to wait for all values to converge in step 1 in policy iteration, which also takes time.

All the discussions above happen in the context of MDP. That means, $latex T(s,a,s’)$ and $latex R(s,a,s’)$ are known to you. All the methods we introduced above belong to Dynamic Programming, or called model based methods. Also note that in all these methods, they update estimates on the basis of other estimates. We call this general idea bootstrapping.

In the reinforcement learning we will talk about next, you may not know $latex T(s,a,s’)$ or $latex R(s,a,s’)$. So sometimes they are called model free methods. Many reinforcement learning algorithms also use bootstrapping.

Let’s first talk about a passive reinforcement learning scheme called temporal difference learning. It only estimates $latex V^{\pi}(s)$ for each state given a specific policy $latex \pi$. It is called “passive” reinforcement learning because you are given a fixed policy and you don’t need to learn the optimal policy and only need to evaluate the policy. However, merely knowing $latex V^{\pi}(s)$ does not directly suggest the optimal action for you under the policy $latex \pi$. Thus, there are two kinds of problems all variants of reinforcement learning solve: prediction problem, estimating the value function for a given policy; control problem, finding an optimal policy, the policy maximizing accumulated rewards when traversing through states.

In reinforcement learning, model-free value-iteration and model-free policy-iteration using q-functions are Q-learning and SARSA, respectively [6].

Screenshot from 2017-08-13 20-26-57

Here is an example using Q learning. Each cell is a state and arrows of all the cells denote a fixed, given policy $latex \pi$. Reward is only given at the top right corner (+1) and the cell below it (-1). We can see after 100 iterations, each cell (state) learns $latex v^{\pi}(s)$. Now, why the cell on the left of +1 cell is 0.85, not 1.00? This is due to the two reasons: 1. the discounting factor $latex \gamma$ makes the $latex V^{\pi}(s)$ approach but never be 1.00; 2. the transition is stochastic. So the transition probability $latex p(s’|s, a)$ is not perfect 1 but could have many possible $latex s’$. For example, at the cell 0.85, even the policy suggests to take right, it has certain chance to go to the cell below, which might lead to the undesirable state. Therefore, this state should not be perfect +1.00.

We can compare to the result of Q-learning, which has the update rule: $latex Q(s,a) \leftarrow (1-\alpha)Q(s,a)+\alpha \cdot [R(s,a,s’) + \gamma max_{a’}Q(s’,a’)]$.

Screenshot from 2017-03-05 23:54:52

First, note that on the same cell that is on the left of +1 cell, $latex q(s, RIGHT)=0.85$, which is equal to the same cell in the TD learning. This is because the provided policy in TD learning also suggest to take RIGHT. Second, why the cell on the left of -1 cell has $latex q(s, RIGHT)=-0.6$ instead of $latex -0.85$? This is because if an agent stands in the cell, takes RIGHT, but does not land on -1 cell, it will be less likely to get into -1 cell again. However, imagine another agent which stands on the left of +1 cell and takes RIGHT, if he fails to get into +1 cell, he might still stay in the same cell because going UP isn’t really an option, making him ready to get into +1 cell again.

Q-learning is called off-policy learning because what exploration policy does not matter as long as the exploration policy makes you experience all state-action pairs and you set a proper learning rate $latex \alpha$. For example, you are using a very poor exploration policy, and starting from $latex s’$, you take very poor action $latex a’$. However, $latex max_{a’}Q(s’, a’)$ might still be large enough to improve $latex Q(s,a)$. The difference between Q-learning and SARSA, and generally between off-policy and on-policy, can be seen in [4].

An improved Q-learning technique is called Dyna Q-learning, which achieves model learning, direct reinforcement learning and planning (indirect reinforcement learning) in the same algorithm. Model learning means learning a model that produces a prediction of next state and next reward given the current state and an action. In Dyna Q-learning, the model is so trivial – a table recording the last-observed $latex R_{t+1}, S_{t+1}$ for each $latex S_t, A_t$ (Step e). Direct reinforcement learning means the direct update of Q values just as in a normal Q-learning (Step d). Planning means additionally updating Q-values based on simulated experience on the model (Step f).

Screenshot from 2017-03-08 19:59:41

Without planning part, Dyna Q-learning downgrades to a normal Q-learning. The (state-space) planning can be depicted as:

Screenshot from 2017-03-08 20:06:04 Because Dyna Q-learning uses additional (simulated) experience to update Q-values, it converges the optimal policy in fewer iterations. Here is an example from Chapter 8.2 in [5].

Screenshot from 2017-03-08 20:08:56

Currently, I only know Dyna Q-learning is a tabular method, meaning $latex S$, $latex A$ need to be finite, discrete so that we can have an entry for each $latex S, A$ pair.

Now, here is a big roadmap of RL and MDPs:

Screenshot from 2017-03-06 23:26:03

When MDP is known, we can use value iteration and policy iteration to compute $latex V^*$, $latex Q^*$ and $latex \pi^*$. When given a specific policy $latex \pi$, we can evaluate $latex V^{pi}(s)$.

When MDP is unknown, we need to use RL. For model-based learning, we need to build models to estimate $latex T(s,a,s’)$ and $latex R(s,a,s’)$ as an approximated MDP. After that, we can apply value iteration/policy iteration just as on a normal MDP. Similarly we can also do policy evaluation on the approximated MDP.

For model-free learning, we do not need to even know or model $latex T(s,a,s’)$ and $latex R(s,a,s’)$. Nevertheless, we still learn $latex V^*$, $latex Q^*$ and $latex \pi^*$.

The biggest challenge of a naive Q-learning that the size of state action pairs Q(s,a) increases quickly. Therefore, we can generalize state-action by using weights and features describing q states to approximate the learning.

Screenshot from 2017-03-07 00:30:44 Note in approximate Q-learning, you no longer maintain a table for recording $latex Q(s,a)$. So the only thing you need to do is to update weights.

[1] https://www.youtube.com/watch?v=ggqnxyjaKe4

[2] https://www.youtube.com/watch?v=Csiiv6WGzKM (lecture 8 ~ 11)

[3] http://web.engr.oregonstate.edu/~xfern/classes/cs434/slides/RL-1.pdf

[4] https://czxttkl.com/?p=1850

[5] https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

[6] Reinforcement learning and dynamic programming using function approximators: http://orbi.ulg.ac.be/bitstream/2268/27963/1/book-FA-RL-DP.pdf

Abstract Algebra

I am introducing some basic definitions of abstract algebra, structures like monoid, groups, rings, fields and vector spaces and homomorphism/isomorphism.

I find the clear definitions of structures from [1]:

Screenshot from 2017-02-22 20:48:07

Also, the tables below show a clear comparisons between several structures [2,3]:

Screenshot from 2017-02-22 20:50:49

Screenshot from 2017-02-22 21:07:59

All these structures are defined with both a set and operation(s). Based on [4], a monoid is an abstraction of addition. A group is an abstraction of addition and subtraction. A ring adds multiplication but not necessarily division. A field adds division. Therefore, ultimately, a field is the “biggest” set (compared to monoid, group and ring), on which are defined addition, subtraction, multiplication and division. Similar idea is also illustrated in [6].

As also pointed out by [6], since we are dealing with abstract algebra, calling operations with the names “addition, subtraction, multiplication, division” is just a convention. The operations are applied on sets that might be very different with real numbers or integers. They are analogous but not necessarily the same as what we understand for the addition, subtraction, multiplication and division on real numbers.

We can safely say that every ring is a group, and every field is a ring [5]:

Screenshot from 2017-02-22 21:39:24

A vector space over a field $latex F$ is a set $latex V$ with two operations, vector addition and scalar multiplication, satisfying the ten axioms listed below [10, 11, 22]:

PDcix

To me, Euclidean vector space is the most familiar vector space as an example.

Now, there are several subcategories of vector spaces. One kind is normed vector space [12], meaning that the norm of vectors is defined in the vector space. Another kind of vector space is called inner product vector space [13], which is a vector space with a mapping function called “inner product” that maps any pair of vectors (over field F) to a scalar in F. (Note that I used to think “inner product” and “dot product” are exchangeable in their meanings. Now I know that “inner product” is just a name of an operator that maps two vectors into a field. “dot product” is the name of “inner product” in Euclidean space.) More specially, an operator can be called “inner product” if it satisfies [26]:

Screenshot from 2017-02-25 21:47:47

The inner product between a vector and itself naturally induces the norm of it. Hence, an inner product vector space must be a normed vector space. BTW, along the search of inner product vector space, I found a good textbook [14]. Another kind vector space is called metric vector space, which means the distance between any two vectors $latex d(x,y), x,y \in V$ is defined. A normed vector space must be a metric vector space because the definition of metric can be induced from the definition of norm. (book page 59 in [25]). Bear in mind that a metric space may not be a normed space, and a normed space may not be an inner product space. You can have a basic intuition of how spaces from metric space, to normed space, to inner product space, become more and more powerful [27].

A Cauchy sequence is a sequence whose elements become arbitrarily close to each other as the sequence progresses [15]. The definition [Appendix in 16] is that in a vector space $latex V$, you have a Cauchy sequence of vectors $latex v_1, v_2, \cdots$ Then, for any $latex \epsilon > 0$ there is a corresponding $latex k$ such that $latex \lVert v_i – v_j \rVert < \epsilon$ for all $latex i,j \geq k$. A normed vector space $latex V$ is complete if every Cauchy sequence in $latex V$ is convergent in $latex V$, i.e., the limit point of every Cauchy sequence is also in $latex V$. A complete normed vector space is called Banach space [24]. A complete inner product vector space is Hilbert space [17]. An incomplete inner product vector space is called pre-Hilbert space. Every Euclidean space is finite and a Hilbert space (because dot product is defined in Euclidean space and Euclidean space is complete). However, a Hilbert space is not necessarily a Euclidean space since its dimension can be infinite. A famous example of Hilbert space with infinite dimensions is $latex \mathcal{l}^2$:

Screenshot from 2017-02-25 20:07:18

Screenshot from 2017-02-25 20:07:37

Although we just talked about sequence space $latex \ell^2$ has infinite dimensions, we have not defined what is dimension of a vector space [25]:

Screenshot from 2017-02-25 22:16:22

Now let’s look at homomorphism:

Screenshot from 2017-02-23 08:47:24

Note that the map is between two structures of the same type. That means, the map preserves some properties of the two structures, hence the name: homo (same) + morphism (shape,form).

By convention, linear transformation of vector spaces is called vector space homomorphism [18]:

Screenshot from 2017-02-23 21:31:50

For most algebra structures, isomorphism is a homomorphism equipped with a mapping that is bijective (has inverse and one-to-one correspondence) [7]. That means. after a mapping or the reverse of it, isomorphism preserves the relationships between elements of the two sets. For example [8]: Screenshot from 2017-02-23 10:00:50

Knowing two structures are isomorphic has several advantages, with the most prominent one being to establish the knowledge of a more complicated set by studying a simpler set which is isomorphic [8, 9].

There are several types of mappings in abstract algebra.

injective:

Screenshot from 2017-02-24 11:42:52

surjective:

Screenshot from 2017-02-24 11:43:01

and bijective:

Screenshot from 2017-02-24 11:43:07

A linear operator is [25]:

A linear functional $latex f$ is a linear operator with its domain as a vector space over a field $latex F$ and its range as scalars of $latex F$ [20]:

Screenshot from 2017-02-24 18:43:30

The set of all linear functionals that map from $latex V$ to $latex F$ form a vector space $latex V^*$ called dual space, with also two operations, addition and scalar multiplication [19]:

Screenshot from 2017-02-24 18:38:37

There is also a double dual space, $latex V^{**}$, which is the dual space of $latex V^*$. It is very obscure to understand this concept. My understanding is that $latex V^*$ contains a set of linear functionals, and $latex V^{**}$ contains a set of linear functionals of linear functionals. If we suppose there is a linear functional $latex f \in V^*$ and a linear functional $latex \Phi \in V^{**}$, then we want that $latex \Phi(f)$ should return a scalar. In a generative way, you can think that every $latex \Phi \in V^**$ is created by first selecting a $latex v \in V$, and let $latex \Phi(f)=f(v)$. In “$latex f(v)$”, treat $latex f$ as a variable and $latex \cdot(v)$ as the linear functional $latex \Phi$. More details can be found in [23].

I was surprised to learn that linear operators (and linear functionals) actually have norms. The definition of norms of linear operators is [25]:

Screenshot from 2017-02-25 19:36:14

Screenshot from 2017-02-25 19:36:45

Non degenerate bilinear mapping is [21]:

Screenshot from 2017-02-24 22:48:26

Reference

[1] groups_rings_fields_vector_spaces from: http://people.csail.mit.edu/madhu/ST12/scribe/lect03.pdf

[2] https://en.wikipedia.org/wiki/Group_(mathematics)#Generalizations

[3] https://en.wikipedia.org/wiki/Field_(mathematics)#Related_algebraic_structures

[4] http://math.stackexchange.com/a/730768/235140

[5] http://math.stackexchange.com/a/80/235140

[6] http://math.stackexchange.com/a/1425663/235140

[7] http://aleph0.clarku.edu/~ma130/isomorphism.pdf

[8] https://www.britannica.com/topic/isomorphism-mathematics

[9] http://math.stackexchange.com/questions/441758/what-does-isomorphic-mean-in-linear-algebra

[10] http://mathworld.wolfram.com/VectorSpace.html

[11] https://en.wikipedia.org/wiki/Algebra_over_a_field

[12] https://en.wikipedia.org/wiki/Normed_vector_space

[13] https://en.wikipedia.org/wiki/Inner_product_space

[14] http://www.linear.axler.net/InnerProduct.pdf

[15] https://en.wikipedia.org/wiki/Cauchy_sequence

[16] http://www.cs.columbia.edu/~risi/notes/tutorial6772.pdf

[17] https://en.wikipedia.org/wiki/Hilbert_space

[18] http://math.stackexchange.com/a/29962/235140

[19] https://en.wikipedia.org/wiki/Dual_space

[20] https://en.wikipedia.org/wiki/Linear_form

[21] https://en.wikipedia.org/wiki/Degenerate_bilinear_form

[22] http://math.stackexchange.com/questions/49733/prove-in-full-detail-that-the-set-is-a-vector-space

[23] http://math.stackexchange.com/questions/170481/motivating-to-understand-double-dual-space

[24] https://en.wikipedia.org/wiki/Banach_space

[25] https://www.dropbox.com/s/3tvpl5iiq50pu6h/Kreyszig%20-%20Introductory%20Functional%20Analysis%20with%20Applications.pdf?dl=0

[26] http://people.math.gatech.edu/~heil/handouts/appendixa.pdf

[27] http://math.stackexchange.com/questions/617202/metric-space-normed-space-and-inner-product-space-hierarcy

My Pycharm keymap setting backup

Just to back up my keymap setting because I think they are super convenient to my usage style.

https://www.dropbox.com/s/m14ozngs59jcd18/pycharm_keymap_settings.jar?dl=0

When A* algorithm returns optimal solution

Dijkstra algorithm is a well known algorithm for finding exact distance from a source to a destination. In order to improve the path finding speed, A* algorithm combines heuristics and known distances to find the heuristically best path towards a goal. A common A* implementation maintains an open set for discovered yet not evaluated nodes and a closed set for already evaluated nodes.

Here is the pseudo-code of A* (copied from Wikipedia):

function A*(start, goal)
    // The set of nodes already evaluated.
    closedSet := {}
    // The set of currently discovered nodes that are not evaluated yet.
    // Initially, only the start node is known.
    openSet := {start}
    // For each node, which node it can most efficiently be reached from.
    // If a node can be reached from many nodes, cameFrom will eventually contain the
    // most efficient previous step.
    cameFrom := the empty map

    // For each node, the cost of getting from the start node to that node.
    gScore := map with default value of Infinity
    // The cost of going from start to start is zero.
    gScore[start] := 0 
    // For each node, the total cost of getting from the start node to the goal
    // by passing by that node. That value is partly known, partly heuristic.
    fScore := map with default value of Infinity
    // For the first node, that value is completely heuristic.
    fScore[start] := heuristic_cost_estimate(start, goal)

    while openSet is not empty
        current := the node in openSet having the lowest fScore[] value
        if current = goal
            return reconstruct_path(cameFrom, current)

        openSet.Remove(current)
        closedSet.Add(current)
        for each neighbor of current
            if neighbor in closedSet
                continue		// Ignore the neighbor which is already evaluated.
            // The distance from start to a neighbor
            tentative_gScore := gScore[current] + dist_between(current, neighbor)
            if neighbor not in openSet	// Discover a new node
                openSet.Add(neighbor)
            else if tentative_gScore >= gScore[neighbor]
                continue		// This is not a better path.

            // This path is the best until now. Record it!
            cameFrom[neighbor] := current
            gScore[neighbor] := tentative_gScore
            fScore[neighbor] := gScore[neighbor] + heuristic_cost_estimate(neighbor, goal)

    return failure

function reconstruct_path(cameFrom, current)
    total_path := [current]
    while current in cameFrom.Keys:
        current := cameFrom[current]
        total_path.append(current)
    return total_path

In the code, the closed set prevents evaluated nodes from being evaluated again. One may think about not using closed set at all because it is not a big deal if we allow a node being added to the open set or its $latex f$ value being updated multiple times. Anyway, we only pick the node with the least $latex f$ value, right? Wrong. We can only safely drop using the closed set if “new nodes are added to the open set only if they have a lower $latex f$ value than at any previous iteration” (from Wikipedia). Without such mechanism, A* will never terminate in some extreme cases. Here is the version when the closed set is dropped out:

function A*(start, goal)

    // The set of currently discovered nodes that are not evaluated yet.
    // Initially, only the start node is known.
    openSet := {start}
    // For each node, which node it can most efficiently be reached from.
    // If a node can be reached from many nodes, cameFrom will eventually contain the
    // most efficient previous step.
    cameFrom := the empty map

    // For each node, the cost of getting from the start node to that node.
    gScore := map with default value of Infinity
    // The cost of going from start to start is zero.
    gScore[start] := 0 
    // For each node, the total cost of getting from the start node to the goal
    // by passing by that node. That value is partly known, partly heuristic.
    fScore := map with default value of Infinity
    // For the first node, that value is completely heuristic.
    fScore[start] := heuristic_cost_estimate(start, goal)

    while openSet is not empty
        current := the node in openSet having the lowest fScore[] value
        if current = goal
            return reconstruct_path(cameFrom, current)

        openSet.Remove(current)

        for each neighbor of current
            // The distance from start to a neighbor
            tentative_gScore := gScore[current] + dist_between(current, neighbor)
            if neighbor not in openSet	// Discover a new node
                openSet.Add(neighbor)
            else if tentative_gScore >= gScore[neighbor]
                continue		// This is not a better path.

            // This path is the best until now. Record it!
            cameFrom[neighbor] := current
            gScore[neighbor] := tentative_gScore
            fScore[neighbor] := gScore[neighbor] + heuristic_cost_estimate(neighbor, goal)

    return failure

function reconstruct_path(cameFrom, current)
    total_path := [current]
    while current in cameFrom.Keys:
        current := cameFrom[current]
        total_path.append(current)
    return total_path

And here is the example when neither using the closed set nor the memory of previous iteration will cause A* never terminate:

astarloop (2)

The shortest path is supposed to be A->B->C->D->G. Starting from A:

Pick A from the open set. Update B’s $latex f$ value as $latex f(B)=g(A) + d(A,B) + h(B) = 0 + 1 + 1 = 2$. Update B’s $latex g$ value as $latex g(B)=g(A)+d(A,B)=0+1=1$. Add B to the open set.

Pick B from the open set. Update C’s $latex f$ value as $latex f(C)=g(B)+d(B,C)+h(C)=1+1+1=3$. Update C’s $latex g$ value as $latex g(C)=g(B)+d(B,C)=1+1=2$. Add C to the open set.

Pick C from the open set. Update D’s $latex f$ value as $latex f(D)=g(C)+d(C,D)+h(D)=2+1+100=103$. Add D to the open set. Since there is no closed set, the algorithm is ignorant of the fact that A has been checked. Therefore, the algorithm continues to update $latex f(A)=g(C)+d(C,A)+h(A)=2+1+1=4$. Because there is no memory recording what $latex A$’s previous $latex f$ values are, the algorithm stupidly assigns $latex f(A)=4$, a larger $latex f$ value than before, to A. A is then added to the open set. $latex g(A)$ and $latex g(D)$ will be updated similarly as before.

Now, picking the next node from the open set. A will be picked because it has the least $latex f$ value ($latex f(A)<f(D)$. Then, the loop will go on from A to B to C then to A again. Note that every time when you reach C, $latex f(D)$ and $latex f(A)$ will increase by $latex g(C) – g(C)_{old}$. A will always be favored over D because $latex h(A) < h(D)$. As a result, A* will never return.

Now, suppose both versions of A* (with or without the closed set) work thanks to proper implementation. Let’s see when A* returns the optimal distance.

For the version without the closed set, we claim A* is optimal when the heuristic function is admissible, i.e., $latex h(n)$ is never overestimated, i.e., $latex h(n) \leq h^*(n)$ for all nodes, where $latex h^*(n)$ is the shortest distance from n to the goal.

We can prove as follows: say the goal node is G, the start node is A. The optimal path from A to G has the distance $latex d^*$. If a non-optimal path propagates to G with $latex f(G)=d_{fake}$, then we must know $latex d^* < d_{fake}$. Since all heuristic values are not overestimated, before we pick G as the next node from the open set, there must be other nodes on the optimal path with a smaller f value than $latex d^*$ that get picked. Until the optimal path propagates its distance to G, G will never be picked from the open set with its non-optimal $latex f$ value.

For the version with the closed set, we claim A* is optimal when the heuristic function is both admissible and monotonic. A monotonic heuristic function means for any pair of adjacent nodes $latex x$ and $latex y$, where $latex d(x,y)$ denotes the length of the edge between them, we must have: $latex h(x) \leq d(x,y) + h(y)$. I don’t have a good proof why we require the heuristic function must also be monotonic but I can give an example that A* does not return the optimal distance when monotonicity is not satisfied.

negexample

Since $latex h(A) \leq h^*(A)$, $latex h(B) \leq h^*(B)$, $latex h(C) \leq h^*(C)$ and $latex h(D) \leq h^*(D)$, admissibility is satisfied. However, $latex h(B) > d(B,C)+h(C)$, therefore monotonicity is not satisfied. Now let’s see how the break of monotonicity makes A* not return the optimal distance.

We first pick A as it is the start point. Add B and C to the open set, with $latex f(B)=g(B)+h(B)=1+100=101$ and $latex f(C)=g(C)+h(C)=1+30=31$.

Then, we pick C from the open set because of $latex f(C)<f(B)$, evaluate C, and finally add C to the close set. During the evaluation, D will be added to the open set, with $latex f(D)=g(C)+d(C,D)+h(D)=1+5+90=96$ and $latex g(D)=g(C)+d(C,D)=1+5=6$. At this moment, B is not in the closed set. Since $latex f'(B)=g(C)+d(C,B)+h(B)=1+1+100=102$, which is less than current $latex f(B)$, we do not update $latex f(B)$.

Then, since $latex f(B)<f(D)$, we pick D from the open set and evaluate D. G and B are D’s neighbors. $latex f(G)=g(D)+d(D,G)=6+96=102$ and $latex f'(B)=g(D)+d(D,B)+h(B)=6+4+100=110$. Since $latex f'(B) > f(B)$, we do not update $latex f(B)$. We add D to the closed set.

Next, since $latex f(B) < f(G)$, we process B. However, the B’s neighbors C and D are all in the closed set. Therefore, we do nothing.

Finally, we pick G and turns out it is the goal. So we return with its $latex f$ value 102, which is not optimal. The optimal distance should be 101, which is achieved by the path A->B->D->G.

Actually, monotonicity implies admissibility when $latex h(G)=0$. Suppose for a node $latex v_0$, the shortest distance $latex h^*(v_0)$ from $latex v_0$ to $latex G$ is achieved by the path $latex v_0,v_1, v_2, v_3, \cdots, v_n, G$. Then, we can show that $latex h(v_0) \leq h^*(v_0)$:

$latex h(v_0) \leq d(v_0, v_1) + h(v_1) \leq d(f, v_1) + d(v_1, v_2) + h(v_2) \leq \cdots \leq \sum\limits_{i=0}d(v_i, v_{i+1}) + h(G) =h^*(v_0)$

Reference:

http://cs.stackexchange.com/questions/16065/how-does-an-admissible-heuristic-ensure-an-optimal-solution

http://cs.stackexchange.com/questions/30778/why-is-the-a-search-heuristic-optimal-even-if-it-underestimates-costs?rq=1

http://www.algostructure.com/search/a-star.php

Install Tensorflow 0.12 with GPU support on AWS p2 instance

# for connection and file transfer

ssh -i ~/Dropbox/research/aws_noisemodel_keypair.pem ubuntu@ec2-54-164-130-227.compute-1.amazonaws.com

rsync –progress –delete -rave “ssh -i /home/czxttkl/Dropbox/research/aws_noisemodel_keypair.pem” /home/czxttkl/workspace/mymachinelearning/Python/LoLSynergyCounter ubuntu@ec2-54-164-130-227.compute-1.amazonaws.com:~/

sudo apt-get install python-pip python-dev
pip install tensorflow-gpu

download and transfer cuda toolkit, then install

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

download and transfer cudnn, then install:

tar xvzf cudnn-<your-version>.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

reference: https://github.com/tensorflow/tensorflow/issues/5591

Other possibly used scientific modules

sudo pip install gensim numpy scipy scikit-learn pandas seaborn
sudo apt-get install python-tk

Append to ~/.bash_rc and run source ~/.bashrc

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda

reference: https://www.tensorflow.org/versions/r0.11/get_started/os_setup#optional_linux_enable_gpu_support

You can also run some p2 instance optimization specific for GPU computation:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html#optimize_gpu

gmail filter OR condition

In order to achieve a filter rule with OR operation, you must specify conditions connected by the keyword “OR” (must be in upper cases).

For example:

Screenshot from 2017-01-29 17:48:01

Install Python Package for User

If you do not have root privilege and want to install a python module, you can try the following approach:

python setup.py install --user

This will install packages into subdirectories of site.USER_BASE. To check what is the value of site.USER_BASE, use:

import site
print site.USER_BASE

reference: https://docs.python.org/2/install/

Update 2018/01/06:

using pip to install a python package:

pip install --user package_name

Embedding and Heterogeneous Network Papers

Embedding methods have been widely used in graph, network, NLP and recommendation system. In short, embedding methods vectorize entities under study by mapping them into a shared latent space. Once vectorized representation of entities are learned (through either supervised or unsupervised fashion), a lot of knowledge discovery work can be done: clustering based on entity vectors, analyze each dimension of the latent space, make recommendation, so on so forth.

When you learn embedding by supervised learning, you define a objective function that forges interaction of entity vectors and then learn vectors through maximizing/minimizing the objective function. When you learn embedding through unsupervised learning, you define how occurrences of data should conform to a conjured, underlying probability function in terms of entity vectors. This will be elaborated more in the following.

Recommendation System

A widely known embedding method is matrix factorization in recommendation system. Affinity of a user and an item is often modeled as the dot product of the two corresponding entity vectors. When you are given a huge, sparse user-item matrix, you try to learn all user/item vectors such that their dot products can maximally approximate the entries in the sparse matrix. For more information, see [2]

NLP

Embedding in NLP has been rapidly developed recently. Let’s see a few classic embedding NLP models.

[3] published 2003. It tries to predict t-th word given t-1 words previously seen in word sequences. The model extends a neural network model:

input is horizontal concatenation of input word vectors: $x=(C(w_{t-1}), C(w_{t-2}), \cdots, C(w_{t-n+1}))$
hidden layer is just a non-linear tanh layer: $y=b+Wx+U \text{tanh} (d+Hx)$
output layer is a softmax layer. For word $w_t$ that indeed appears after $w_{t-1} \cdots w_{t-n+1}$ : $P(w_t|w_{t-1}, \cdots, w_{t-n+1})=\frac{e^{y_{w_t}}}{\sum_i e^{y_i}}$
objective function is to maximize the log-likelihood: $L=\frac{1}{T}\sum\limits_t \log f(w_t, w_{t-1}, \cdots, w_{t-n+1};\theta) + R(\theta)$

The diagram of the model structure (from the paper [3]):

Screenshot from 2017-01-02 11:44:17

Equivalently, the model structure from [1]:

nn_language_model-1

One problem is that in the normalization term in the softmax is expensive when vocabulary size is usually large. In [3], it proposes several parallelism implementation to accelerate learning. The other alternative is using hierarchical softmax [4, 5], which only needs to evaluate $\log(|Vocabulary|)$ for each softmax calculation. There are several other more efficient algorithms that are proposed to sidestep the calculation of softmax.

C&W model [17, 18]

C&W model has the following structure:

Screenshot from 2017-01-14 10:28:47

First, a word has multiple features (feature 1 ~ feature K), each has a $d^k$ dimension embedding. Then a convolution layer is added to summarize surrounding embedding of each word. Then a max pooling is added so that most salient features will be kept. The last few layers are classic NN layers for supervised learning.

One contribution of the papers is to propose multi-task learning under the same, shared embedding table. The multi-task learning procedure is as follows:

Screenshot from 2017-01-14 10:35:36

Furthermore, the tasks to be learned can be either supervised or unsupervised. In section 5 of [17], it introduces how to train a language model “to discriminate a two-class classification task”. Note the special ranking-type cost it uses:

Screenshot from 2017-01-14 10:39:41

Continuous bag-of-words [5]

This model is easy to understand. Projection layer is no longer the horizontal concatenation of word vectors [3]. Instead we just sum over all input word vectors as the input before the hidden layer.

Skip-gram [6]

Note to differentiate this with [7]. Instead of using a whole sequence to predict the right next word, Skip-gram uses one word to predict its every surrounding word. Specifically, the objective function is:

$\frac{1}{T}\sum\limits_{t=1}^T \sum\limits_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j}|w_t)$

If in every pair of $<w_{t+j}, w_t>$ , we denote $w_{t+j}$ as $w_O$ and $w_t$ as $w_I$ , then we have

Screenshot from 2017-01-02 16:44:10

However, this objective function suffers the problem we described in [3]: the normalization term in softmax is expensive to calculate because vocabulary is often large. Therefore, they propose to use Negative sampling, which replaces $\log P(w_O|w_I)$ in the objective function above with:

Screenshot from 2017-01-02 16:46:41

Since we are maximizing the objective function, we can understand it in such intuition: we want $\log \sigma({v'_{w_O}}^T v_{w_I})$ as large as possible. Therefore, $v'_{w_O}$ and $v_{w_I}$ should be as similar as possible. On the other hand, the original normalization term in softmax can be seen to be replaced by $\sum\limits_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log\sigma({-v'_{w_i}}^T v_{w_I})]$ , which contains $k$ log-sigmoid function smoothing the negative dot products with $k$ noisy word vectors. Of course, to maximize it, $v'_{w_i}$ and $v_{w_I}$ should be as dissimilar as possible.

Finally, the authors claim that using a modified unigram distribution $P_n(w)$ is good in practice. I guess the unigram distribution means each word has the probability mass correlated to the frequencies it appears in the data.

Miscellaneous NLP paper

The following are some papers with less attention so far.

[8] proposes that for each word, its word embedding should be unique across different supervised labels.

[9] proposes that adding more non-linear hidden layers could result in better differentiating ability in bag-of-word model.

For more NLP embedding overview, please see [1].

(Heterogeneous) Network

A heterogeneous network is a graph with multiple types of nodes and links can connect two different types of nodes. Meta-path define a series of links in a heterogeneous network [10, 11]. Heterogeneous network has richer information to harvest compared to single typed network.

[13] first learns embedding of documents, words, and labels. Then, classification .

[14] learns entity embedding (author, paper, keyword, etc) in bibliographical network for the author identification task. The objective function contains two parts: 1. task-specific embedding. author vector should be similar to paper vector which is obtained by averaging vectors per node type. 2. path-augmented heterogeneous network embedding. Two entity vectors should be similar if they are connected by certain meta-path.

[15] models normal events’ happening probabilities as interactions between entity embeddings.

[16] models people’s behavior on twitter to infer their political opinion embeddings. The behavior on twitter is converted to a heterogeneous network where there exist 3 types of links: follow, mention and retweet.

Visualization

T-sne [12] is a dimensionality reduction technology with its goal to “preserve as much of the significant structure of the high dimensional data as possible in the low-dimensional map”. It focus on capturing the local structure of the high-dimensional data very well, while also revealing global structure such as the presence of clusters. To achieve that, t-sne converts pairwise similarity measurement to conditional probability and the objective function is the Kullback-Leibler divergence between conditional probability distribution under original high dimension and transformed low dimension space. It has been widely used for data interpretation in embedding papers.

Reference

[1] http://sebastianruder.com/word-embeddings-1/index.html

[2] https://czxttkl.com/?p=1214

[3] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. journal of machine learning research, 3(Feb), 1137-1155.

[4] Morin, F., & Bengio, Y. (2005, January). Hierarchical Probabilistic Neural Network Language Model. In Aistats (Vol. 5, pp. 246-252).

[5] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[6] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

[7] Li, C., N., Wang, B., Pavlu, V., & Aslam, J. Conditional Bernoulli Mixtures for Multi-Label Classification.

[8] Jin, P., Zhang, Y., Chen, X., & Xia, Y. (2016). Bag-of-Embeddings for Text Classification. In International Joint Conference on Artificial Intelligence (No. 25, pp. 2824-2830).

[9] Iyyer, M., Manjunatha, V., Boyd-Graber, J., & Daumé III, H. (2015). Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the Association for Computational Linguistics.

[10] Kong, X., Yu, P. S., Ding, Y., & Wild, D. J. (2012, October). Meta path-based collective classification in heterogeneous information networks. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 1567-1571). ACM.

[11] Sun, Y., Han, J., Yan, X., Yu, P. S., & Wu, T. (2011). Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11), 992-1003.

[12] Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579-2605.

[13] Tang, J., Qu, M., & Mei, Q. (2015, August). Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1165-1174). ACM.

[14] Chen, T., & Sun, Y. (2016). Task-Guided and Path-Augmented Heterogeneous Network Embedding for Author Identification. arXiv preprint arXiv:1612.02814.

[15] Chen, T., Tang, L. A., Sun, Y., Chen, Z., & Zhang, K. Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events.

[16] Gu, Y., Chen, T., Sun, Y., & Wang, B. (2016). Ideology Detection for Twitter Users with Heterogeneous Types of Links. arXiv preprint arXiv:1612.08207.

[17] Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (pp. 160-167). ACM.

[18] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537.

The expected times of tosses until you see first HTH or HTT

The problem comes from a very famous Ted Talk:

You are flipping a fair coin. What is the expected times of tosses you need to see the first “HTH” appears? What is that for the first “HTT” appears?

Suppose $latex N_1$ is the random variable which counts the number of flips till we get first “HTH”. $latex N_2$ is the random variable which counts the number of flips until we get first “HTT”. Intuitively, $latex E[N_1] > E[N_2]$. Here is why:

Suppose you want to get sequence “HTH”. When you already see “HT” sequence, you get excited because the next flip is important. If the next flip is “H”, you stop and return the count. However, if the next flip is “T”, your excitement has to cease and start over. You will only start to get excited again when you see next “H” or “HT”.

Now, suppose you want to get sequence “HTT”. Again, when you already see “HT” sequence, you become excited. If the following flip is “T”, you stop and return the count. If the next flip is “H”, your excitement also goes down but not totally ceases. Because at least you have an “H”, the first letter of the sequence you want to see. So your excitement does not need to start over completely.

Since you are more easily to get close to obtain “HTT” than “HTH”, $latex E[N_1]>E[N_2]$.

We can get exact solutions by many methods. Let me introduce first a probability method, which can be referred from here:

Let $latex X$ be the random variable which counts the number of flips till we get a “T” (tail).

$latex E[X] = \frac{1}{2} \cdot 1 + \frac{1}{2} \cdot (1 + E[X]) \\ \Rightarrow E[X]=2 &s=2$

In the RHS of the first line formula: the first term is for when we get a “T” on the first flip; the second term is if we get a “H” on the first flip, which means we are effectively starting over with another extra flip.

Another way to calculate $latex X$ is from the fact that $latex X$ follows a geometric distribution or a negative binomial distribution with mean equal to 2. See here for derivation.

Now, assume $latex Y$ the random variable which counts the number of flips until we get a “HT” sequence.

$latex E[Y] = \frac{1}{2} \cdot (1+E[Y]) + \frac{1}{2} \cdot (1+E[X]) \\ \Rightarrow E[Y]=4 &s=2$

In the RHS of the first line formula: the first term is if we get a “T” on our first flip, which means we are effectively starting over with 1 extra flip. The second term is if we get a “H” on our first flip, then all we want is a “T”. What we want to count is 1+expected number of flips to get a “T”, the latter is exactly $latex E[X]$.

In the same spirit, if we suppose $latex Z$ is the random variable which counts the number of flips until we get a “TT” sequence, then:

$latex E[Z]=\frac{1}{2} \cdot (E[X]+1) + \frac{1}{2} \cdot (E[T]+1+E[TT]) \\ \Rightarrow E[Z]=6 &s=2$

Now for the sequence “HTH”, we have:

$latex E[N_1] = \frac{1}{2} \cdot (E[Y]+1) + \frac{1}{2} \cdot (E[Y] + 1 + E[N_1]) \\ \Rightarrow E[N_1]=10 &s=2$

In the RHS of the first line formula: the first term is if we get a “HT” on our first two flips followed by a “H”; the second term is if we get a “HT” followed by a “T”, then we have to start over.

For the sequence “HTT”, we have:

$latex E[N_2] = \frac{1}{2} \cdot (E[Y]+1) + \frac{1}{2} \cdot (E[Y]+1+E[Z]) \\ \Rightarrow E[N_2] = 8 &s=2$

In the RHS of the first line formula: the first term is if we get a “HT” on our first two flips followed by a “T”; the second term is if we get a “HT” followed by a “H”, then we have to count the flips until next “TT” happens.

We can also use Markov Chain to solve this problem. I am just showing you the solution for “HTT”. We construct 4 states as below:

mmk

$latex E[N_2]$ is actually called the mean first passage time of state “HTT” from state “start”. We can get the mean first passage time using pykov package:

import pykov
d = {('start', 'H'): 0.5, ('start', 'start'): 0.5, ('H', 'H'): 0.5, ('H', 'HT'): 0.5, ('HT', 'H'): 0.5, ('HT', 'HTT'): 0.5, ('HTT', 'HTT'): 1}
T = pykov.Chain(d)
T.mfpt_to('HTT')

# result
# Vector([('H', 6.0), ('HT', 4.0), ('start', 8.0)])

Another way to solve this problem is called finite state automata: http://stats.stackexchange.com/questions/12174/time-taken-to-hit-a-pattern-of-heads-and-tails-in-a-series-of-coin-tosses/12196#12196