How to to write multi-lined subscript in latex?

In latex, what is the best way to write multi-lined formula in subscript? Use `\atop`!

For example,

\begin{align*}
P(M_n=1) =\sigma(&w_s(\sum\limits_{j_r \in T_{nr}}\vec{U_{j_r}} \cdot \vec{C_{j_r}}  - \sum\limits_{j_b \in T_{nb}}\vec{U_{j_b}} \cdot \vec{C_{j_b}} ) + \\ &w_b(\sum\limits_{j_{r_1}, j_{r_2} \in T_{nr} \atop j_{r_1} \neq j_{r_2} }\left\Vert\vec{C_{j_{r_1}}} - \vec{C_{j_{r_2}}}\right\Vert^2 - \sum\limits_{j_{b_1}, j_{b_2} \in T_{nb} \atop j_{b_1} \neq j_{b_2} }\left\Vert\vec{C_{j_{b_1}}} - \vec{C_{j_{b_2}}} \right\Vert^2) + b)
\end{align*}

will result in:

Screenshot from 2015-08-23 13:53:08

How to write multi-lined formula?

In latex, what is the best way to write multi-lined formula? Actually, the `begin/end{align*}` block supports `\\\` in it.

For example,

\begin{align*}
P(M_n=1) =\sigma(&w_s(\sum\limits_{j_r \in T_{nr}}\vec{U_{j_r}} \cdot \vec{C_{j_r}}  - \sum\limits_{j_b \in T_{nb}}\vec{U_{j_b}} \cdot \vec{C_{j_b}} ) + \\ &w_b(\sum\limits_{j_{r_1}, j_{r_2} \in T_{nr} \atop j_{r_1} \neq j_{r_2} }\left\Vert\vec{C_{j_{r_1}}} - \vec{C_{j_{r_2}}}\right\Vert^2 - \sum\limits_{j_{b_1}, j_{b_2} \in T_{nb} \atop j_{b_1} \neq j_{b_2} }\left\Vert\vec{C_{j_{b_1}}} - \vec{C_{j_{b_2}}} \right\Vert^2) + b)
\end{align*}

will result in:

Screenshot from 2015-08-23 13:53:08

Minimum Description Length (MDL): a scoring function to learn Bayesian Network Structure

Bayesian Network Augmented Naive-Bayes (BAN) is an improved version of Naive-Bayes, in which every attribute X_i may have at most one other attribute as its parent other than the class variable C. For example (figure from [1]):

Screenshot from 2015-08-03 22:43:28

Though BAN provides a more flexible structure compared to Naive Bayes, the structure (i.e., dependence/independence relationships between attributes) needs to be learnt. [2] proposed a method using a scoring function called Minimum Description Length (MDL) for structure learning. Basically, given training data set D, you start from a random network structure g_1 and calculate its MDL(g_1|D). We set the current best network structure g^* = g_1. Then for any other possible structure g_i, g_i is set to g^* if MDL(g_i|D) <= MDL(g^*|D). As you can see, we want to find a network structure which can minimize MDL(g|D).

Now let’s look at the exact formula of MDL(g|D):

MDL(g|D) = \frac{\log N}{2}|g| - LogLikelihood(g|D)&s=2

The second term of MDL(g|D), which is the negative LogLikelihood of g given data D, is easy to understand. Given the data D, we would like the likelihood of such structure g as large as possible. Since we try to minimize MDL(g|D), we write it in a negative log likelihood format. However, merely using the negative LogLikelihood(g|D) as a scoring function is not enough, since it strongly favors the structure to be too sophisticated (a.k.a, overfitting). That is why the first term \frac{\log N}{2}|g| comes into place to regulate the overall structure:

|g| is the number of parameters in the network. N is the size of dataset. \frac{\log N}{2} is the length of bits to encode every parameter. So \frac{\log N}{2}|g| penalizes the networks with many parameters. Taking for example the network in the figure above, suppose attribute x_1 has two possible values, i.e., x_1 = \{a_1, a_2\}. Similarly, x_2 = \{b_1, b_2, b_3\}, x_3 = \{c_1,c_2\}, x_4 = \{d_1, d_2, d_3, d_4\}, and C=\{o_1, o_2\}. Then attribute x_1, x_2, x_3, x_4 can have 2\times 3 \times 2 \times 4=48 combinations of values. Further, we can use one more parameter to represent the joint probability of P(x_1, x_2, x_3, x_4, C=o_1), one minus which is the joint probability of P(x_1, x_2, x_3, x_4, C=o_2). Hence |g| = (|C|-1) \cdot (|x_1| |x_2| |x_3| |x_4|) = 48. Why we can use \frac{\log N}{2} bits to encode each parameter? Please refer to section 2 in [3] and my reading note on that at the end of this post.

Back to [2], you can find in the paper that the negative likelihood functino in MDL is replaced by mutual information format. Such transformation is equivalent and was proved in [4].

My reading note on Section 2 in [3]

Until equation (3), the author is just describing the context. Not too hard to understand. Equation (3) basically says there are k(g) number of the parameters in the network. This is exactly what I explained two paragraphs earlier.

Then, the author gave equation (4) (5) (6):

Screenshot from 2015-08-04 00:24:25

Here, \theta is a vector of length |k(g)|. Each component of vector \theta is noted as \theta[q,s,g], which refers to the conditional probability of observing q \in A_{N+1} and s \in S(g) given a model g. For a given network g, \sum_{q}\sum_{s}\theta[q,s,g] =1.

Then, w(\theta) is the prior distribution of the vector \theta. The author supposes it is a Dirichlet distribution, thus w(\theta) equals to:

Screenshot from 2015-08-04 00:33:47

For any possible value \theta, the posterior probability is w(\theta)P_\theta(y^n|x^n) &s=2. The real marginal probaiblity of P(y^n|x^n) is to integrate over all possible \theta values:

Screenshot from 2015-08-04 00:37:51

Then the author said the description length of y^n|x^n is just to take negative log as in equation (4). I don’t quite get it at this point. What I can think of is that the negative log likelihood is perfect for its monotonically decreasing property. The higher probability of P(y^n|x^n), the more desirable the structure is. Since we want to minimize MDL anyway, why not just assign MDL to such negative log format?

Then the author claimed that:

Screenshot from 2015-08-04 00:50:34

This is why we can use \frac{\log N}{2} bits to encode each parameter.

Reference

[1] Cheng, J., & Greiner, R. (1999, July). Comparing Bayesian network classifiers. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 101-108). Morgan Kaufmann Publishers Inc..

[2] Hamine, V., & Helman, P. (2005). Learning optimal augmented Bayes networks. arXiv preprint cs/0509055.

[3] Suzuki, J. (1999). Learning Bayesian belief networks based on the minimum description length principle: basic properties. IEICE transactions on fundamentals of electronics, communications and computer sciences, 82(10), 2237-2245.

[4] Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine learning, 29(2-3), 131-163.

Other references, though not used in the post, are useful in my opinion:

[5] De Campos, L. M. (2006). A scoring function for learning bayesian networks based on mutual information and conditional independence tests. The Journal of Machine Learning Research, 7, 2149-2187.

[6] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465-471.

[7] Liu, Z., Malone, B., & Yuan, C. (2012). Empirical evaluation of scoring functions for Bayesian network model selection. BMC bioinformatics, 13(Suppl 15), S14.

A viewer for Python Dict

Just found a beautiful viewer library that helps you view Python Dict object.

Here is the code:

"""
Created on Fri Feb 22 12:52:28 2013

@author: kranzth
"""
from traits.api \
    import HasTraits, Instance, Str, on_trait_change

from traitsui.api \
    import View, VGroup, Item, ValueEditor, TextEditor

from copy import deepcopy

class DictEditor(HasTraits):
    SearchTerm = Str()
    Object = Instance( object )

    def __init__(self, obj, **traits):
        super(DictEditor, self).__init__(**traits)
        self._original_object = obj
        self.Object = self._filter(obj)

    def trait_view(self, name=None, view_elements=None):
        return View(
          VGroup(
            Item( 'SearchTerm',
                  label      = 'Search:',
                  id         = 'search',
                  editor     = TextEditor(),
                  #style      = 'custom',
                  dock       = 'horizontal',
                  show_label = True
            ),
            Item( 'Object',
                  label      = 'Debug',
                  id         = 'debug',
                  editor     = ValueEditor(),
                  style      = 'custom',
                  dock       = 'horizontal',
                  show_label = False
            ),
          ),
          title     = 'Dictionary Editor',
          width     = 800,
          height    = 600,
          resizable = True,
        )

    @on_trait_change("SearchTerm")
    def search(self):
        self.Object = self._filter(self._original_object, self.SearchTerm)

    def _filter(self, object_, search_term=None):
        def has_matching_leaf(obj):
            if isinstance(obj, list):
                return any(
                        map(has_matching_leaf, obj))
            if isinstance(obj, dict):
                return any(
                        map(has_matching_leaf, obj.values()))
            else:
                try:
                    if not str(obj) == search_term:
                        return False
                    return True
                except ValueError:
                    False

        obj = deepcopy(object_)
        if search_term is None:
            return obj

        if isinstance(obj, dict):
            for k in obj.keys():
                if not has_matching_leaf(obj[k]):
                    del obj[k]

            for k in obj.keys():
                if isinstance(obj, dict):
                    obj[k] = self._filter(obj[k], search_term)
                elif isinstance(obj, list):
                    filter(has_matching_leaf,obj[k])

        return obj



def build_sample_data():
    def make_one_level_dict():
        return dict(zip(range(100),
                        range(100,150) + map(str,range(150,200))))

    my_data = make_one_level_dict()
    my_data[11] = make_one_level_dict()
    my_data[11][11] = make_one_level_dict()
    return my_data

# Test
if __name__ == '__main__':
    my_data = {"map":"dafda", "23423":"da324"}
    b = DictEditor(my_data)
    b.configure_traits()

 

You need to install `traitsui` in advance.

sudo apt-get install python-traitsui

 

The outcome sample (not bad, huh?):

Screenshot from 2015-07-27 12:15:22

How to run a PPTP VPN server?

Setting up a VPN using PPTP protocal can be vulnerable for security. Yet it is an easy and quick way to set up a virtual private network (VPN).

The procedure to setup up a PPTP VPN server can be found here:  https://help.ubuntu.com/community/PPTPServer

The way to connect to a PPTP server on a client machine can be found here: https://my.hostvpn.com/knowledgebase/19/Setup-a-PPTP-VPN-Connection-in-Linux-Ubuntu.html

Change mongodb data path to portable hard drive

Background

I have a 1TB portable usb hard drive which I want to host mongodb data. The OS is ubuntu 15.04

 

Procedure

stop your mongodb service

sudo service mongod stop

 

edit `/etc/fstab` so that the computer automatically mounts your portable drive under “mongodb” user everytime the OS boots up:

UUID=018483db-1b1c-4dd2-a373-58a6aa5a10be /               ext4    errors=remount-ro 0       1
# swap was on /dev/sdb5 during installation
UUID=ade6aa02-8f10-4251-9804-fe81850451a7 none            swap    sw              0       0i

# This line is added by you
# You must mount the portable drive under /media/mongodb
UUID=324fa516-3dba-4537-8e82-2a74ea20c4c6 /media/mongodb ext4  defaults  0   0

 

move the old data directory to `/media/mongodb` using `cp …`

 

make sure the new data path has mongodb:mongodb as owner and group.

chown -R mongodb:mongodb /media/mongodb/your_data_dir

 

change mongodb conf file by `vim /etc/mongod.conf’. Set the data path to new path.

 

restart mongodb

sudo service mongod restart

 

check if everything works by either:

sudo service mongod status

or:

tail /var/log/mongodb/mongodb.log (mongodb log path)

 

Reference

http://stackoverflow.com/questions/24781972/changing-mongod-dbpath-raise-permission-error

Create mongodb replica sets on different machines

Important prerequisites:

  1. Confirm connectivity between machines you want to set as replica set
  2. Confirm the same version of Mongodb is installed on each machine. I tried one with mongo 2.6 and another with mongo 3.0 and totally failed. Then I installed mongo 3.0 on every machine.
  3. Don’t set authentication for all the machines before you manage to set all replica sets. I spent a lot of time boggling by authentication issues.
  4. Suppose we have machine A which has a complete copy of data. All other machines, say B, C, D…. for example, need to sync data from A. Make sure all other machines besides A have a empty mongodb database before starting.

 

Now let’s start using a example where machine A has a complete copy of data and machine B is empty and needs to sync from A.

By default, mongod will read `/etc/mongod.conf` every time it starts. There are two formats of the config file: one is of “setting = value” format (http://docs.mongodb.org/v2.4/reference/configuration-options/) and the other is YAML (http://docs.mongodb.org/manual/reference/configuration-options/). The former format was used before Mongo 2.6 but is compatible for versions afterwards.

I use the same config file (of YAML format) on A and B:

storage:
 dbPath: "/media/mongodb/mongodb"
 directoryPerDB: false
 journal:
 enabled: true
systemLog:
 destination: file
 path: "/var/log/mongodb/mongodb.log"
 logAppend: true
 timeStampFormat: iso8601-utc
replication:
 oplogSizeMB: 10240
 replSetName: "rs0"
net:
 port: <your port>

(The correctness of the format of config file is very important. Sometimes, typos in config file can cause failure of start of a mongodb instance while no exception log can be traced.) 

 

To make the config file taking effect, you need to restart mongodb on both machines:

sudo service mongod restart

 

Now go to A’s mongo shell, initiate its replica state:

rsconfig = {_id:"rs0",members:[{_id:0, host:"machine_B_host:port",priority:1},{_id:1, host:"machine_A_host:port",priority:2}]}
rs.initiate(rsconfig) 

Remember to replace machine A and B’s address:port. If you want A to be PRIMARY as much as possible, set its priority higher than B. After this command, it should return OK now. You can also use `rs.reconfig(rsconfig)` later if you want to modify your configuration.

 

When the first time you start mongod on B, you will find you can’t query anything. For example:

show dbs
2015-07-09T22:52:00.208-0400 E QUERY    Error: listDatabases failed:{ "note" : "from execCommand", "ok" : 0, "errmsg" : "not master" }

That’s fine since B is a secondary server. In order to allow B for querying, you need to use `rs.slaveOk()` on B first.

 

 

Till now, we have set up two machines connecting as a replica set. But they can be easily hacked. For security considerations, we need to use a keyfile as an authentication media on both machines.

 

We first stop mongod instance:

sudo service mongod stop

 

Generate a keyfile following: http://docs.mongodb.org/manual/tutorial/generate-key-file/. Then push this file to both machines. I recommend you to put this file in the data directory, with `mongodb:mongodb` as owner and group,  and 600 permissions. 

 

Now add security specification in ‘/etc/mongod.conf:

storage:
 dbPath: "/media/mongodb/mongodb"
 directoryPerDB: false
 journal:
 enabled: true
systemLog:
 destination: file
 path: "/var/log/mongodb/mongodb.log"
 logAppend: true
 timeStampFormat: iso8601-utc
replication:
 oplogSizeMB: 10240
 replSetName: "rs0"
net:
 port: <your port>
security:
 keyFile: "/media/mongodb/mongodb/mongodb-keyfile"

 

As here suggested, specifying keyFile path automatically enables authentication. So before you start mongod instance with keyFile, you should add user and password to your mongo databases. For example:

use products
db.createUser({
    user: "accountUser",
    pwd: "password",
    roles: [ "readWrite"]
})

 

Now start mongodb again:

sudo service mongod start

 

Reference

http://blog.51yip.com/nosql/1580.html

Microsoft Word Tips

Tips again. This time is for Microsoft Word.

 

1. Ctrl+Shift+8

Show non-printing characters. You can also toggle it on manually in the menu.

Screenshot from 2015-07-03 09:26:11

 

2. Pay attention to the extension symbol in ribbon. It contains more options.

Screenshot from 2015-07-03 09:30:09

Or you can see paragraph settings in Page Layout in ribbon:

Screenshot from 2015-07-07 12:19:16

 

3. Sometimes, you will find some paragraphs don’t follow tightly after previous paragraphs. Selection these paragraphs, right click, go to Paragraphs, then Line and Page breaks, and select “Keep with next”.

Screenshot from 2015-07-03 09:31:48

 

4. The best way to insert figures as well as captions is to insert as a whole text box. As you can see, when I insert a table, I always put it with caption (if needed) in a text box with border filled with no color:

Screenshot from 2015-07-07 12:22:15

 

5. Export word to pdf without markup. First you need to switch to simple markup mode. Second, you need to go to tracking options, then “Advanced options” to set “change lines” to none. This step makes sure some markup on the border of pages will not show up in pdf. Third, go to export menu, in the pop-up window where you choose the location to save the pdf, you should go to “Options” and select “Document” in “Publish what”. 

Screenshot from 2015-07-07 12:18:40Screenshot from 2015-07-07 12:25:07Screenshot from 2015-07-07 12:26:14 Screenshot from 2015-07-07 12:26:26Screenshot from 2015-07-07 12:27:53Screenshot from 2015-07-07 12:29:07

 

6. Sometimes, you will find the tables or figures in the documents may change their positions in “All Markup” mode. This is because their positioning is not correct. Go to your object’s layout options. Then make sure either horizontal or vertical positioning is relative to Margin instead of other options.

Screenshot from 2015-07-07 12:36:29 Screenshot from 2015-07-07 12:36:40

 

 

 

 

My previous tip posts:

Regex: http://maider.blog.sohu.com/304850962.html

Eclipse: http://maider.blog.sohu.com/281903297.html

Git: https://czxttkl.com/?p=56

Install Latex distribution, editor and packages on Ubuntu

1. Install TexLive. This is the latex environment.

https://www.tug.org/texlive/quickinstall.html

We recommend to use .install-tl script to install TexLive. TexLive is usually installed at “/usr/local/texlive/year/bin/x86_64-linux/….”

Although you can install vanilla TexLive via sudo apt-get install texlive, we often observe it misses advanced accessories such as fonts.

 

2. Install Texmaker

http://www.xm1math.net/texmaker/

*** Note: On Ubuntu, Texmaker can be installed via apt-get install texmaker***

3. Open Texmaker, Options -> Configure Texmaker:

Set pdflatex location in your TexLive installed location. For example, mine is “/usr/local/texlive/2015/bin/x86_64-linux/pdflatex” -interaction=nonstopmode %.tex

4. Use `F1` to compile your tex file. Done.

5. In some versions of Ubuntu, certain shortcuts of Texmaker don’t work. To address that, you can do three steps:

a. sudo apt-get remove appmenu-qt5

b. reinstall Texmaker with Qt4-supported version: http://www.xm1math.net/texmaker/download.html

c. Keep Texmaker closed while going to ~/.config/xm1/ and edit texmaker.ini such that no same key is mapped to two different actions. (e.g. QuickBuild and PdfLatex are often mapped to the same key, F1. So please eliminate such repeating cases.)

6. To support pdf – tex synchronization (features like jump to the line from pdf to tex file, or show the red rectangle of current editing position), you need to add “-synctex=1” in pdflatex command:

"/usr/local/texlive/2016/bin/x86_64-linux/pdflatex" -interaction=nonstopmode -synctex=1 %.tex

 

To install CTAN packages:

Download package files, which usually contains a .ins and a .dtx file.

Run latex ***.ins. This will generate a ***.sty file in the same folder.

Copy and paste the sty file into texmf/tex/latex/local 

 

LDA Recap

It’s been half year since last time I studied LDA (Latent Dirichlet Allocation) model. I found learning something without writing them down can be frustrating when you start to review them after some while. Sometimes it is a bit embarrassing  that you feel you didn’t learn that at all.

Back to the topic, this post will record what I learned for LDA. It should be a start point for every one who wants to learn graphical model.

 

The generative process of LDA

LDA is way too different than usual discriminant algorithms in that it is a generative model. In a generative model, you conjure up some assumption of how data was generated. Specifically in LDA, a collection of $latex |D|$ documents with known $latex |K|$ topics and $latex |V|$ size vocabulary are generated in the following way: (To be more clear, the vocabulary is a set of words that appear at least once in documents. So the word count of $latex |D|$ documents, $latex |W|$, should be equal to or larger than $latex |V|$)

1. For $latex k = 1 \cdots K$ topics:    $latex \qquad \phi^{(k)} \sim Dirichlet(\beta) &s=2$

$latex \beta$ is a vector of length $latex |W|$ which is used to affect how word distribution of topic $latex k$, $latex \phi^{(k)} &s=2$, is generated. This step is essentially saying that, since you know you have $latex k$ topics, you should expect to have $latex k$ different word distributions. But how do you determine these word distributions? You draw from a Dirichlet distribution $latex Dirichlet(\beta)$ $latex k$ times to form the $latex k$ word distributions.

2. Now you have already had $latex k$ word distributions. Next you want to generate documents.

For each document $latex d \in D$:

(a) $latex \theta_d \sim Dirichlet(\alpha) &s=2$

You first need to determine its topics. Actually, document $latex d$ has a topic distribution $latex \theta_d$. This meets our common sense: even a document has a strong single theme, it more or less covers many different topics. How do you determine the topic distribution $latex \theta_d$ for the document $latex d$? You draw $latex \theta_d$ from another Dirichlet distribution $latex Dirichlet(\alpha)$.  

(b) For each word $latex w_i \in d$:

i. $latex z_i \sim Discrete(\theta_d) &s=2$

ii. $latex w_i \sim Discrete(\phi^{(z_i)}) &s=2$

This step is saying, before each word in the document $latex d$ is written, the word should be assigned to one single topic $latex z_i$ at first place. Such topic assignment for each word $latex w_i$ follows the topic distribution $latex \theta_d$ you got in (a). After you know which topic this word belongs to, you finally determine this word from the word distribution $latex \phi^{(z_i)}$.

Two more notes:

1. $latex \alpha$ and $latex \beta$ are normally called hyperparameters.

2. That’s why Dirichlet distribution is often called distribution of distribution. In LDA, word distribution for a specific topic, or topic distribution for a specific document, is drawn from predefined Dirichlet distribution. It is from the drawn word distribution that words are finally generated but not the Dirichlet distribution. (Or it is from the drawn topic distribution that each word in the document is finally assigned to a sampled topic. )

 

What we want after having this generative process?

What we observe or we know in advance are $latex \alpha$, $latex \beta$ and $latex \vec{w}$. $latex \vec{w}$ is a vector of word counts for all documents. What we don’t know and thus we want to get are $latex \theta$, $latex \phi$ and $latex \vec{z}$. $latex \theta$ is a matrix containing topic distributions of all documents. $latex \phi$ is a matrix containing word distributions of all topics. $latex \vec{z}$ is a vector of length $latex |W|$ representing each word’s topic. So ideally, we want to know:

$latex p(\theta, \phi, \vec{z} | \vec{w}, \alpha, \beta) = \frac{p(\theta, \phi, \vec{z}, \vec{w} | \alpha, \beta)}{p(\vec{w} | \alpha, \beta)} = \frac{p(\phi | \beta) p(\theta | \alpha) p(\vec{z}|\theta) p(\vec{w} | \phi_z)}{\iiint p(\phi | \beta) p(\theta | \alpha) p(\vec{z}|\theta) p(\vec{w} | \phi_z) d\theta \, d\phi \,d\vec{z}} &s=4$

Among which $latex p(\vec{w} | \phi_z)$ is the conditional probability of generating $latex \vec{w}$ given each word’s topic’s corresponding word distribution.

The crux of difficulty is intractability of integration of $latex \iiint p(\phi | \beta) p(\theta | \alpha) p(\vec{z}|\theta) p(\vec{w} | \phi_z) d\theta \, d\phi \,d\vec{z}$. There are a number of approximate inference techniques as alternatives to help get $latex \vec{z}$, $latex \theta$ and $latex \phi$ including variational inference and Gibbs Sampling. Following http://u.cs.biu.ac.il/~89-680/darling-lda.pdf, we are going to talk about how to use Gibbs Sampling to estimate $latex \vec{z}$, $latex \theta$ and $latex \phi$.

 

Collapsed Gibbs Sampling for LDA

The idea of Gibbs Sampling is that we can find $latex \vec{z}$ close enough to be a sample from the actual but intractable joint distribution $latex p(\theta, \phi, \vec{z} | \vec{w}, \alpha, \beta)$ by repeatedly sampling from conditional distribution $latex p(z^{i}|\vec{z}^{(-i)}) &s=2$. For example,

Loop:

1st word $latex w_1$ in $latex \vec{w}$, $latex z_1 \sim p(z_1 | z_2, z_3, \cdots, z_{|W|}) &s=2$

2nd word $latex w_2$ in $latex \vec{w}$, $latex z_2 \sim p(z_2 | z_1, z_3, \cdots, z_{|W|}) &s=2$

 …

The last word $latex w_{|W|}$ in $latex \vec{w}$, $latex z_{|W|} \sim p(z_{|W|} | z_1, z_2, \cdots, z_{|W|-1}) &s=2$

After we get $latex \vec{z} &s=2$, we can infer $latex \theta &s=2$ and $latex \phi &s=2$ according to the Bayesian posterior probability formula $latex p(A|B) = \frac{p(B|A)p(A)}{p(B)}$ (p(A), as a probability distribution, gets updated after observing event B). For example, suppose $latex X_d$ denotes the event having known the topics $latex \vec{z}_d$ of $latex \vec{w}_d$ words in document $latex d$. As we know from above, $latex p(\theta_d) = Dirichlet(\theta_d|\alpha)$. So $latex \theta_d$ can be estimated after observing $latex X_d$ (Such estimation is also called MAP: Maximum A Posterior) :

$latex p(\theta_d | X_d) = \frac{p(\theta_d) p(X_d|\theta_d)}{p(X_d)} = \frac{p(\theta_d) p(X_d|\theta_d)}{\int p(\theta_d) p(X_d|\theta_d) d\theta_d} =Dirichlet(\theta_d|\alpha + \vec{z_d})&s=2$

$latex \phi$ can be similarly inferred using the Bayesian posterior probability formula. So it turned out that Gibbs Sampling only samples $latex \vec{z}$ but it ended up letting us know also $latex \theta$ and $latex \phi$. This idea (technology) is called Collapsed Gibbs Sampling: rather than sampling upon all unknown parameters, we only sample one single parameter upon which estimation of other parameters relies.

 

How to get conditional probability $latex p(z_i | \vec{z}^{(-i)}) &s=3$?

$latex p(z_i | \vec{z}^{(-i)}) $ is the conditional probability of word $latex w_i$ belonging to $latex z_i$ given topics of all other words in documents. Since $latex \vec{w}$ is observable, and $latex \alpha$ and $latex \beta$ are hyperparameters, $latex p(z_i | \vec{z}^{(-i)}) = p(z_i, | \vec{z}^{(-i)}, \vec{w}, \alpha, \beta) &s=2$.

Hence we have:

Screenshot from 2015-06-29 11:55:28

Note:

1. From (3) to (4), we put $latex \vec{w}$ from behind the conditioning bar to the front because it will be easier to write the exact form of $latex p(\vec{z}, \vec{w}|\alpha, \beta)$:

$latex p(\vec{z}, \vec{w}|\alpha, \beta)=\iint p(\phi|\beta)p(\theta|\alpha)p(\vec{z}|\theta)p(\vec{w}|\phi_z) d\theta d\phi &s=2$.

 2. From (4) to (5), we break down $latex p(\vec{z}^{(-i)}, \vec{w} | \alpha, \beta) &s=2$ to $latex p(\vec{z}^{(-i)}, \vec{w}^{(-i)}| \alpha, \beta) p(w_i|\alpha, \beta) &s=2$ because word $latex w_i$ is generated without relying on $latex \vec{z}^{(-i)} &s=2$ and $latex \vec{w}^{(-i)} &s=2$:

$latex p(w_i) = \iiint p(w_i | \phi_{z_i}) p(z_i | \theta) p(\theta | \alpha) p(\phi | \beta) d\theta d\phi dz_i&s=2$

Actually $latex p(w_i)$ can be seen as a constant that the Gibbs Sampler can ignore. That’s why we have changed the equation symbol `=` in (5) to the proportion symbol $latex \propto$ in (6).

 

Before we proceed, we list a table of variables we will use:

$latex n_{d,k}$  The number of times words in document $latex d$ are assigned to topic $latex k$
$latex n_{k,w}$  The number of times word $latex w$ is assigned to topic $latex k$
$latex n_{d,\cdot}$  The number of words in document $latex d$
$latex n_{k,\cdot}$  The number of words belonging to topic $latex k$

 

Now let’s see how to calculate $latex p(\vec{z}, \vec{w}| \alpha, \beta) &s=2$:

Screenshot from 2015-06-29 15:49:40

After that, we can have:Screenshot from 2015-06-29 16:20:16

 

Among the equations above we applied the following properties:

1. For $latex \vec{\alpha} = (\alpha_1, \alpha_2, \dotsc, \alpha_K)$, $latex \vec{t}=(t_1, t_2, \dotsc, t_K)$ and $latex \sum_K t_i = 1 &s=2$: 

$latex B(\vec{\alpha})=\frac{\prod^K_{i=1}\Gamma(\alpha_i)}{\Gamma(\sum^K_{i=1}\alpha_i)}=\int_0^1 t_i^{\alpha_i-1} d\vec{t} &s=2$

2. $latex \Gamma(n) = (n-1)!$

3. From (9) to (10), since we are sampling $latex z_i$ among 1 to K topics, so no matter which topic $latex z_i$ is, $latex \frac{\Gamma(\sum^K_{k=1}n_{d,k}^{(-i)}+ \alpha_k)}{\Gamma(\sum^K_{k=1}n_{d,k} + \alpha_k)} &s=3$ is always $latex \frac{1}{\sum_{k=1}^K n_{d,k}^{(-i)} + \alpha_k} &s=3$ which the Gibbs Sampler could ignore.

Till now, Gibbs Sampler knows everything it needs to calculate the conditional probability $latex p(z_i | \vec{z}^{(-i)}) $ by counting $latex n_{d,k} &s=2$, $latex n_{k,w} &s=2$, $latex n_{d,\cdot} &s=2$ and $latex n_{k,\cdot} &s=2$, and knowing $latex \alpha$ and $latex \beta$.

 

 

 Pseudo-code

 Screenshot from 2015-06-29 17:02:26

 

References

http://u.cs.biu.ac.il/~89-680/darling-lda.pdf

https://gist.github.com/mblondel/542786

http://blog.csdn.net/v_july_v/article/details/41209515

http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/