czxttkl – Page 23

Understand “Markov Chain Sampling Methods for Dirichlet Process Mixture Models”

In this post I am going to share my understanding of the paper: Markov Chain Sampling Methods for Dirichlet Process Mixture Models.

In chapter 2, it introduces the basic concept of Dirichlet Process Mixture Models. In (2.1), we have:

$latex y_i | \theta_i \sim F(\theta_i) \newline \theta_i | G \sim G \newline G \sim DP(G_0, \alpha)$

Think the three formulas in an example of a mixture of Gaussian distributions. We have a bunch of data points that we know are from a mixture of Gaussian distributions. Therefore, each data point is from a Gaussian distribution governed by the parameter $latex \theta_i=(\mu_i, \sigma_i)$. That is, $latex P(y_i | \theta_i) = \frac{1}{\sqrt{2\pi\sigma_i^2}}e^{-\frac{(y_i-\mu_i)^2}{2\sigma_i^2}} &s=2$. Multiple data points may belong to the same distribution, i.e., it is possible that $latex \theta_i = \theta_j \text{ for } i \neq j$. However we do not know the number of clusters nor the cluster assignment of the data points yet.

$latex G_0$ is a continuous distribution, let’s say $latex G_0 \sim \mathcal{N}(\mu_0, \sigma_0^2)$. Based on the definition from Dirichlet Process,, $latex G \sim DP(G_0, \alpha)$ is actually a discrete distribution resulting from sampling from $latex G_0$. If $latex \alpha \rightarrow 0$, $latex G$ is more concentrated on a few points. If $latex \alpha \rightarrow \infty$, $latex G$ is widely spread out just as $latex G_0$. (Below is an illustration from Wikipedia)

Screen Shot 2016-09-15 at 11.03.39 AM

Consequently, the second formula of 2.1, $latex \theta_i|G \sim G$, says that every data point’s cluster parameter is drawn from the discrete distribution, $latex G$.

Describing formula 2.1 in a generative way, a data point is generated either based on the same cluster parameter of other existing data points, or a new cluster parameter shared by no other data point. The determination of either way depends on whether this data point fits into any existing cluster parameter (how possible the data is generated under any existing cluster parameter) and the popularity of existing cluster parameters (the more data points a cluster parameter is shared by, the more likely it is to be assigned to new data points).

Now, let’s come back to our ultimate goal: you are given a dataset $latex Y=\{y_1, \cdots, y_n\}$. You know $latex Y$ is generated from a dirichlet process mixture model. You do not know $latex \Theta = \{ \theta_1, \cdots, \theta_n \}$ in advance. Your goal is to recover $latex \Theta$.

The joint distribution of $latex \Theta$ is:

$latex p(\Theta | Y, G_0, \alpha) \propto \prod\limits_i P(y_i|\theta_i) \cdot \int P(\theta_i | G) \cdot P(G|G_0, \alpha) dG $

Apparently, the joint distribution is not easy to calculate. But the conditional probability of $latex \theta_i$ given all other parameters $latex \Theta^{-i}$ and $latex Y$ is relatively easy if $latex F(y_i, \theta_i)$ and $latex G_0(\theta_i)$ is conjugate:

$latex \theta_i | \Theta^{-i}, Y \sim \sum\limits_{j \neq i} q_{i,j} \delta(\theta_j) + r_i H_i \newline q_{i,j} = bF(y_i, \theta_j) \newline r_i = b \alpha \int F(y_i, \theta) dG_0(\theta) \newline b = \text{normalization constant} &s=2$

The above is actually from formula 3.2~3.4 in the paper. To clarify, we skip explaining formula 2.2 as it is the probability $latex \theta_i$ conditional only on $latex \Theta^{-i}$ but not including $latex Y$.

The problem of formula 3.2 is that in real world dataset, many data points share the same cluster parameters. Let’s say there are five data points, $latex y_1, y_3, y_6, y_7, y_8$ actually share the same $latex \theta$. As a result, to sample $latex \theta_1, \theta_3, \theta_6, \theta_7, \theta_8$ to simultaneously converge to the same value will take very long time. Therefore, the paper gives Algorithm 2 which determines every cluster’s $latex \theta$ (instead of every data point’s $latex \theta_i$) after cluster assignments of all data points are sampled.

Algorithm 2 is based on another perspective (formula 2.3) to understand Dirichlet Process Mixture Models. Describing formula 2.3 in a generative way, you can say that each data point is generated by first choosing whether it belongs to an existing cluster or form a new cluster itself. Its value is then generated by the parameter of the chosen cluster. Another thought helpful for understanding is to think that whether a data point is generated from a cluster depending on both the cluster popularity and its likelihood within the cluster.

Ok. The above is just the basic concept part of the paper. The paper introduces more advanced sampling algorithms (Metropolis-Hastings) when prior is not conjugate. But using not conjugate prior is really rare in my research, at least for now. So I will just stop here.

The Dirichlet Process Mixture Model is applied in a recommendation system paper: CoBaFi – Collaborative Bayesian Filtering. In this paper, user/item latent factors are assumed to be from a mixture of Gaussians although we do not know the number of clusters in advance. This is very intuitive, since many users and items in real world tend to form similar populations. Advanced users, with sophisticated browsing data, could form their own small clusters. Casual users, with sparse but similar browsing data, can be easily grouped together. After grouping users and items, cluster parameters can be estimated more reliably compared to users/items parameters estimated individually.

Another interesting point of the recommendation paper is their finding that ratings $latex \{r_{ij}\}$ are often not following normal distribution while the paper still assumes all user/item clusters are gaussian distributed and cluster parameters are from a Gauss-Wishart distribution. Due to the non-conjugacy, the paper chooses to use Metropolis-Hastings sampling algorithm.

Reference:

http://www.stat.ufl.edu/archived/casella/OlderPapers/ExpGibbs.pdf

Read SAS output tables

The following tables were generated right after a simple linear regression with three independent variables was fit in SAS:

The linear regression is Gallons_sold ~ price + line_ad + display.

I will mainly illustrate how to read the first table. To give you a background, the number of samples is $latex n=406$ and the number of independent variables is 3. We set $latex p=4$ as the number of independent variables plus 1 which denotes the intercept term.

The first row of the first table:

The sum square of model is defined as: $latex SSM=\sum\limits_{i=1}^{n} (\hat{y}_i – \bar{y}_i)^2$

Degrees of freedom of SSM: $latex p-1$. An not very intuitive explanation is that when you are given $latex n$ data samples $latex (\vec{x}_1, y_1), \cdots, (\vec{x}_n, y_n)$, and you are given an SSM, then you can freely fluctuate up to $latex p-1$ data samples and adjust the remaining $latex n-p+1$ data samples to make the fitted regression based on all the $latex n$ samples achieves the same SSM.

The second row of the first table:

The sum square of error is defined as: $latex SSE=\sum\limits_{i=1}^{n} (\hat{y}_i – y_i)^2 $

Degrees of freedom of SSE: $latex n-p$

The third row of the first table:

The corrected sum square of total is defined as: $latex SST=\sum\limits_{i=1}^{n} (y_i – \bar{y}_i)^2$

Degrees of freedom of SST: $latex n-1$

F-value in the first row is the result of F-test, which tests whether your fitted linear regression is equal to a fit of intercept-only model. More details about F-test can be found in references.

$latex Fvalue = \frac{MSM}{MSE}=\frac{\frac{SSM}{DF_{model}}}{\frac{SSE}{DF_{error}}} &s=2$

The p-value of F-value is obtained by looking up F-value table, which takes three parameters: F-value, degrees of freedom of numerator and degrees of freedom of denominator.

Reference:

[1] http://facweb.cs.depaul.edu/sjost/csc423/documents/f-test-reg.htm

[2] http://blog.minitab.com/blog/adventures-in-statistics/what-is-the-f-test-of-overall-significance-in-regression-analysis

Difference between SARSA and Q-learning

State-Action-Reward-State-Action (SARSA) and Q-learning are two forms of reinforcement learning. The difference of the two methods are discussed in:

Let’s explain why Q-learning is called off-policy learning and SARSA is called on-policy learning. Suppose at state $latex s_t$, a method takes action $latex a_t$ which results to land in a new state $latex s_{t+1}$ with observed reward $latex r_{t+1}$. At state $latex s_{t+1}$, the next action $latex a_{t+1}$ is taken, which results in $latex s_{t+2}$…

For Q-learning, its update rule is: $latex Q(s_t, a_t)=(1-\alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_{t+1} + \gamma \cdot \max_a Q(s_{t+1}, a))$

For SARSA, its update rule is: $latex Q(s_t, a_t)=(1-\alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_{t+1} + \gamma \cdot Q(s_{t+1}, a_{t+1})) $

I think the most intuitive way to differentiate on-policy and off-policy is to check whether the update formula depends on $latex a_{t+1}$. For Q-learning, it is called off-policy learning because no matter what $latex a_{t+1}$ is, the $latex \max_a Q(s_{t+1}, a)$ part is regardless of $latex a_{t+1}$. That is, $latex Q(s_t, a_t)$ gets updated regardless of future policy starting from $latex s_{t+1}$. On the other hand, SARSA is called on-policy learning because the part $latex Q(s_{t+1}, a_{t+1})$ in $latex Q(s_t, a_t)$ depends on the next actual action $latex a_{t+1}$.

Another way to understand the difference between on-policy and off-policy is their estimation on $latex Q(s,a)$. When both SARSA and Q-learning use $latex \epsilon$-greedy policy to strike the balance between exploration and exploitation, they still have different estimations on $latex Q(s,a)$. Q-learning usually has more aggressive estimations, while SARSA usually has more conservative estimations.

Below is a 3×3 grid showing the different behavior path learned from Q-learning and SARSA when:

both methods adopt $latex \epsilon$-greedy policy.
when tie happens, the action of going to right is preferred.
some discount factor $latex \gamma < 1$ is used.

As you can see, Q-learning learns the faster path to the exit without caring the risk of falling off the cliff, i.e., the state-action pairs leading to the faster path have high $latex Q(s,a)$ estimations than the other longer, safer path. This is attributed to the max operator in Q-learning’s update formula: the max operator only tells you the best accumulative discounted reward you could get in the future regardless of danger. SARSA is like a scrupulous person who avoids “dangerous zones” that is close to the cliff. This is because during the training, $latex Q(s,a)$ on the faster path are sometimes offset by the negative rewards due to falling into cliff. Therefore, $latex Q(s,a)$ on the longer, safer path, though discounted by more actions, are higher and more appealing.

Imagine if this environment has larger numbers of rows and columns and we still set the first row as cliffs except the two ends for entrance and exit. The path learned by SARSA may have different distances to the first row depending on the value of $latex \epsilon$ which governs the trade-off between exploration and exploitation and the value of $latex \gamma$ which controls the discounting level. The learned policy will be affected by training configuration. This is another reason why SARSA is called on-policy and Q-learning is called off-policy.

Why the greedy algorithm of maximum weighted matching is a 2-approximation?

This post explains my understanding in a proposed greedy algorithm for the maximum weighted matching problem.

The greedy algorithm goes as follows (listed by this paper in Introduction section):

It is claimed that the greedy algorithm is a 2 approximation, i.e., greedy result >= 1/2 optimal result.

The document where the greedy algorithm is proposed is erroneous therefore the proof for the 2 approximation is unreadable. Fortunately, another guy put his words to explain it: http://math.stackexchange.com/questions/1146224/proof-for-why-maximum-weight-matching-using-greedy-guarantees-at-least-1-2-the-w. However, I think further explanation is still needed to facilitate the understanding. Here is my thoughts:

If any edge $latex e$ belongs to the optimal matching $latex M^*$ but not the greedy matching $latex M$, it must be due to that at least one of its two vertices is incident on another edge $latex e’$ which is included in the greedy matching $latex M$. Due to the greedy algorithm, $latex w(e’) > w(e)$ otherwise the greedy algorithm could have selected $latex e$. Moreover, $latex e’ \in M \setminus M^*$ because, if $latex e’ \in M^*$, and we know $latex e \in M^*$, then $latex M^*$ contains two edges incident on the same node by contradiction.

Each $latex e \in M^* \setminus M$ can be linked to an edge $latex e’ \in M \setminus M^*$ where $latex w(e’)>w(e)$. The same edge $latex e’$ can be counted at most twice if traversing all $latex e$ (maybe one endpoint of $latex e’$ touches an edge $latex e_1 \in M^* \setminus M$ and maybe the other endpoint of $latex e’$ touches another edge $latex e_2 \in M^* \setminus M$. ) Therefore, $latex \sum\limits_{e \in M^* \setminus M} w(e) < 2 * \sum\limits_{e’ \in M \setminus M^*} w(e’)$. We can also obtain trivially, under the assumption that all weights are positive, that $latex \sum\limits_{e \in M \cap M^*} w(e) < 2 * \sum\limits_{e \in M \cap M^*} w(e)$. Adding the two inequalities together, we get $latex \sum\limits_{e \in M^*} w(e) < 2 * \sum\limits_{e’ \in M} w(e’)$.

The greedy algorithm is not applicable to maximum weighted perfect matching because it is not guaranteed to find perfect matching. (perfect matching: every vertex is incident on exactly one edge in the matching. ) Provided that we have the following inequalities:

optimal result for maximum weighted matching > optimal result for maximum weighted perfect matching

1/2 * optimal result for maximum weighted matching <= greedy result for maximum weighted matching

, we conclude that the greedy result for maximum weighted matching can loosely be used for maximum weighted perfect matching, if perfect matching is not strictly required. It is more appealing when the graph is a complete graph. In a complete graph, the greedy method returns a perfect matching.

The greedy algorithm takes $latex O(m\log m)$ to sort edge weights ($latex m=|E|$). In a complete graph where $latex m=n^2$ ($latex n=|V|$), the greedy algorithm takes $latex O(n^2 \log n)$, a little better than existing $latex O(n^3)$ methods (such as this) to solve maximum weighted perfect matching. There is also an $latex 1-\epsilon$ approximation method running in $latex O(m \epsilon^{-1} \log \epsilon^{-1})$. However, the method is much more complicated.

Theano LSTM Code Walk Through

In this post, I am going to explain the code (as much as I can) from theano LSTM tutorial: http://deeplearning.net/tutorial/lstm.html

You need to first understand LSTM. Here is an online recommended material: http://colah.github.io/posts/2015-08-Understanding-LSTMs/, in which many beautiful figures are provided to illustrate LSTM step by step.

The tutorial aims to predict positive/negative sentiment based on movie reviews in texts. Essentially, it uses RNN with LSTM units to accomplish a supervised learning task. The basic idea is as follows:

Use vectors to represent words and convert each review into a `num_words` $latex \times$ `word_emb_dim` matrix. Representing words as vectors is a technique called word embedding. You can control your model complexity by flexibly setting the value of `word_emb_dim`. In the theano tutorial, the default `word_emb_dim` is set to 128.
It uses mini-batch for learning. Therefore, you need to assemble multiple review matrices from step 1 into a 3-dimension tensor, with the shape `num_samples` $latex \times$ `max_num_words` $latex \times$ `word_emb_dim`. `max_num_words` is the max number of words among samples in mini-batch.
It feeds the assembled mini-batch tensor into the RNN model. In RNN, the number of hidden units is by default set to 128, the same number of the default `word_emb_dim`. I am not sure whether it is a recommended practice to set the number of hidden units and the dimension of word embedding equal.
After the tensor passes through the RNN, the model generates two tensors, one is stacked hidden unit vectors of the shape `num_samples` $latex \times$ `max_num_words` $latex \times$ `num_hidden_units`, the other is stacked memory cell output of the same shape. To go to the next step, we only need to keep the stacked hidden unit vectors.
It applies mean pooling on the hidden unit outputs along the axis of words. Specifically, it calculates the mean of every component in the hidden unit vectors emitted by all the words in one review. Mean pooling results in a matrix of shape `num_samples` $latex \times$ `num_hidden_units`. It has one dimension less than the stacked hidden unit vectors from step 4, resulting from the mean pooling.
Finally, the matrix from step 5 is fed into a logistic regression model. The number of LR weight aligns with `num_hidden_units`. The cost (objective function) is ultimately defined as a log likelihood function as usual logistic regression models do.

Now, let’s look at the real code from the theano tutorial. In most cases I don’t modify the default parameters as provided.

The main function as entrance:

if __name__ == '__main__':
    # See function train for all possible parameter and there definition.
    train_lstm(
        max_epochs=100,
        test_size=500,
    )

The core part of training `train_lstm`:

def train_lstm(
    dim_proj=128,  # word embeding dimension and LSTM number of hidden units.
    patience=10,  # Number of epoch to wait before early stop if no progress
    max_epochs=5000,  # The maximum number of epoch to run
    dispFreq=10,  # Display to stdout the training progress every N updates
    decay_c=0.,  # Weight decay for the classifier applied to the U weights.
    lrate=0.0001,  # Learning rate for sgd (not used for adadelta and rmsprop)
    n_words=10000,  # Vocabulary size
    optimizer=adadelta,  # sgd, adadelta and rmsprop available, sgd very hard to use, not recommanded (probably need momentum and decaying learning rate).
    encoder='lstm',  # TODO: can be removed must be lstm.
    saveto='lstm_model.npz',  # The best model will be saved there
    validFreq=370,  # Compute the validation error after this number of update.
    saveFreq=1110,  # Save the parameters after every saveFreq updates
    maxlen=100,  # Sequence longer then this get ignored
    batch_size=16,  # The batch size during training.
    valid_batch_size=64,  # The batch size used for validation/test set.
    dataset='imdb',

    # Parameter for extra option
    noise_std=0.,
    use_dropout=True,  # if False slightly faster, but worst test error
                       # This frequently need a bigger model.
    reload_model=None,  # Path to a saved model we want to start from.
    test_size=-1,  # If >0, we keep only this number of test example.
):

    # Model options
    model_options = locals().copy()
    print("model options", model_options)

    load_data, prepare_data = get_dataset(dataset)

    print('Loading data')
    train, valid, test = load_data(n_words=n_words, valid_portion=0.05,
                                   maxlen=maxlen)
    if test_size > 0:
        # The test set is sorted by size, but we want to keep random
        # size example.  So we must select a random selection of the
        # examples.
        idx = numpy.arange(len(test[0]))
        numpy.random.shuffle(idx)
        idx = idx[:test_size]
        test = ([test[0][n] for n in idx], [test[1][n] for n in idx])

    ydim = numpy.max(train[1]) + 1

    model_options['ydim'] = ydim

    print('Building model')
    # This create the initial parameters as numpy ndarrays.
    # Dict name (string) -> numpy ndarray
    params = init_params(model_options)

    if reload_model:
        load_params('lstm_model.npz', params)

    # This create Theano Shared Variable from the parameters.
    # Dict name (string) -> Theano Tensor Shared Variable
    # params and tparams have different copy of the weights.
    tparams = init_tparams(params)

    # use_noise is for dropout
    (use_noise, x, mask,
     y, f_pred_prob, f_pred, cost) = build_model(tparams, model_options)

    if decay_c > 0.:
        decay_c = theano.shared(numpy_floatX(decay_c), name='decay_c')
        weight_decay = 0.
        weight_decay += (tparams['U'] ** 2).sum()
        weight_decay *= decay_c
        cost += weight_decay

    f_cost = theano.function([x, mask, y], cost, name='f_cost')

    grads = tensor.grad(cost, wrt=list(tparams.values()))
    f_grad = theano.function([x, mask, y], grads, name='f_grad')

    lr = tensor.scalar(name='lr')
    f_grad_shared, f_update = optimizer(lr, tparams, grads,
                                        x, mask, y, cost)

    print('Optimization')

    kf_valid = get_minibatches_idx(len(valid[0]), valid_batch_size)
    kf_test = get_minibatches_idx(len(test[0]), valid_batch_size)

    print("%d train examples" % len(train[0]))
    print("%d valid examples" % len(valid[0]))
    print("%d test examples" % len(test[0]))

    history_errs = []
    best_p = None
    bad_count = 0

    if validFreq == -1:
        validFreq = len(train[0]) // batch_size
    if saveFreq == -1:
        saveFreq = len(train[0]) // batch_size

    uidx = 0  # the number of update done
    estop = False  # early stop
    start_time = time.time()
    try:
        for eidx in range(max_epochs):
            n_samples = 0

            # Get new shuffled index for the training set.
            kf = get_minibatches_idx(len(train[0]), batch_size, shuffle=True)

            for _, train_index in kf:
                uidx += 1
                use_noise.set_value(1.)

                # Select the random examples for this minibatch
                y = [train[1][t] for t in train_index]
                x = [train[0][t]for t in train_index]

                # Get the data in numpy.ndarray format
                # This swap the axis!
                # Return something of shape (minibatch maxlen, n samples)
                x, mask, y = prepare_data(x, y)
                n_samples += x.shape[1]

                cost = f_grad_shared(x, mask, y)
                f_update(lrate)

                if numpy.isnan(cost) or numpy.isinf(cost):
                    print('bad cost detected: ', cost)
                    return 1., 1., 1.

                if numpy.mod(uidx, dispFreq) == 0:
                    print('Epoch ', eidx, 'Update ', uidx, 'Cost ', cost)

                if saveto and numpy.mod(uidx, saveFreq) == 0:
                    print('Saving...')

                    if best_p is not None:
                        params = best_p
                    else:
                        params = unzip(tparams)
                    numpy.savez(saveto, history_errs=history_errs, **params)
                    pickle.dump(model_options, open('%s.pkl' % saveto, 'wb'), -1)
                    print('Done')

                if numpy.mod(uidx, validFreq) == 0:
                    use_noise.set_value(0.)
                    train_err = pred_error(f_pred, prepare_data, train, kf)
                    valid_err = pred_error(f_pred, prepare_data, valid,
                                           kf_valid)
                    test_err = pred_error(f_pred, prepare_data, test, kf_test)

                    history_errs.append([valid_err, test_err])

                    if (best_p is None or
                        valid_err <= numpy.array(history_errs)[:,
                                                               0].min()):

                        best_p = unzip(tparams)
                        bad_counter = 0

                    print( ('Train ', train_err, 'Valid ', valid_err,
                           'Test ', test_err) )

                    if (len(history_errs) > patience and
                        valid_err >= numpy.array(history_errs)[:-patience,
                                                               0].min()):
                        bad_counter += 1
                        if bad_counter > patience:
                            print('Early Stop!')
                            estop = True
                            break

            print('Seen %d samples' % n_samples)

            if estop:
                break

    except KeyboardInterrupt:
        print("Training interupted")

    end_time = time.time()
    if best_p is not None:
        zipp(best_p, tparams)
    else:
        best_p = unzip(tparams)

    use_noise.set_value(0.)
    kf_train_sorted = get_minibatches_idx(len(train[0]), batch_size)
    train_err = pred_error(f_pred, prepare_data, train, kf_train_sorted)
    valid_err = pred_error(f_pred, prepare_data, valid, kf_valid)
    test_err = pred_error(f_pred, prepare_data, test, kf_test)

    print( 'Train ', train_err, 'Valid ', valid_err, 'Test ', test_err )
    if saveto:
        numpy.savez(saveto, train_err=train_err,
                    valid_err=valid_err, test_err=test_err,
                    history_errs=history_errs, **best_p)
    print('The code run for %d epochs, with %f sec/epochs' % (
        (eidx + 1), (end_time - start_time) / (1. * (eidx + 1))))
    print( ('Training took %.1fs' %
            (end_time - start_time)), file=sys.stderr)
    return train_err, valid_err, test_err

line 61: initialize all the parameters to be learned as theano shared variable. I print out the shape information of each parameters (under the default setting):

Parameter name	Note	Shape
Wemb	Word embedding. Every word is represented by a vector as a row	(10000, 128)
lstm_W	stacked weights involved in LSTM which connect input $latex X$: $latex W_i, W_f, W_o, W_c$. Stacking them for efficient learning as suggested in the tutorial. Each weight is a 128×128 square matrix.
lstm_U	stacked weights involved in LSTM which connect previous hidden units: $latex U_i, U_f, U_o, U_c$. Each weight is a 128×128 square matrix.	(128, 512)
lstm_b	the biases inside LSTM unit	(512,)
U	The weights in the logistic regression. It contains two sets of weights, one for predicting positive and the other for negative	(128,2)
b	The biases in the logistic regression	(2,)

line 65: `build_model` is the crux part for building LSTM using theano.

def build_model(tparams, options):
    trng = RandomStreams(SEED)

    # Used for dropout.
    use_noise = theano.shared(numpy_floatX(0.))

    x = tensor.matrix('x', dtype='int64')
    mask = tensor.matrix('mask', dtype=config.floatX)
    y = tensor.vector('y', dtype='int64')

    n_timesteps = x.shape[0]
    n_samples = x.shape[1]

    emb = tparams['Wemb'][x.flatten()].reshape([n_timesteps,
                                                n_samples,
                                                options['dim_proj']])
    # proj = get_layer(options['encoder'])[1](tparams, emb, options,
    proj = lstm_layer(tparams, emb, options, mask=mask)
    # if options['encoder'] == 'lstm':
    proj = (proj * mask[:, :, None]).sum(axis=0)
    proj = proj / mask.sum(axis=0)[:, None]
        
    if options['use_dropout']:
        proj = dropout_layer(proj, use_noise, trng)

    pred = tensor.nnet.softmax(tensor.dot(proj, tparams['U']) + tparams['b'])

    f_pred_prob = theano.function([x, mask], pred, name='f_pred_prob')
    f_pred = theano.function([x, mask], pred.argmax(axis=1), name='f_pred')

    off = 1e-8
    if pred.dtype == 'float16':
        off = 1e-6

    cost = -tensor.log(pred[tensor.arange(n_samples), y] + off).mean()

    return use_noise, x, mask, y, f_pred_prob, f_pred, cost

line 14-16: assemble word embedding tensor in the shape `n_timesteps` $latex \times$ `n_samples` $latex \times$ `dim_proj`. `n_timesteps` corresponds to `max_num_words` in step 2 and `dim_proj` corresponds to `word_emb_dim`/`num_hidden_units`. Such name mapping persists onwards.

line 17-21: I commented out some lines from original code and modified them into more succinct expressions.

line 18: `proj` is the stacked hidden unit vectors in the shape of `n_timesteps` $latex \times$ `n_samples` $latex \times$ `dim_proj`. (The order of dimensions is different than we introduce in step 4.)

line 20-21: mean pooling from step 5. `mask` is a binary matrix of size `n_timesteps` $latex \times$ `n_samples`, of which 1 denotes a word exists.

Now let’s look at line 18 proj = lstm_layer(tparams, emb, options, mask=mask) :

def lstm_layer(tparams, state_below, options, mask=None):
    nsteps = state_below.shape[0]
    dim_proj = options['dim_proj']
    n_samples = state_below.shape[1]

    # if state_below.ndim == 3:
    # else:
    #    n_samples = 1

    assert mask is not None

    def _slice(_x, n, dim):
        # if _x.ndim == 3:
        #    return _x[:, :, n * dim:(n + 1) * dim]
        return _x[:, n * dim:(n + 1) * dim]

    def _step(m_, x_, h_, c_):
        preact = tensor.dot(h_, tparams['lstm_U'])
        preact += x_

        i = tensor.nnet.sigmoid(_slice(preact, 0, dim_proj))
        f = tensor.nnet.sigmoid(_slice(preact, 1, dim_proj))
        o = tensor.nnet.sigmoid(_slice(preact, 2, dim_proj))
        c = tensor.tanh(_slice(preact, 3, dim_proj))

        c = f * c_ + i * c
        c = m_[:, None] * c + (1. - m_)[:, None] * c_

        h = o * tensor.tanh(c)
        h = m_[:, None] * h + (1. - m_)[:, None] * h_

        return h, c

    state_below = (tensor.dot(state_below, tparams['lstm_W']) +
                   tparams['lstm_b'])

    rval, updates = theano.scan(_step,
                                sequences=[mask, state_below],
                                outputs_info=[tensor.alloc(numpy_floatX(0.),
                                                           n_samples,
                                                           dim_proj),
                                              tensor.alloc(numpy_floatX(0.),
                                                           n_samples,
                                                           dim_proj)],
                                name='lstm_layers',
                                n_steps=nsteps)

    return rval[0]  # hidden units outputs h: n_timesteps * n_samples * dim

lstm_layer uses `theano.scan` to loop through the input. It only returns `rval[0]` because the upper level logistic regression only needs hidden units’ outputs.

NLP datasets

Twitter Sentiment Analysis: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
Topic classification for news (including Reuters, NewsGroup): http://disi.unitn.it/moschitti/corpora.htm
Movie reviews: http://www.cs.cornell.edu/People/pabo/movie-review-data/
Other reviews: http://www.text-analytics101.com/2011/07/user-review-datasets_20.html
Twitter Evaluation dataset: http://tweenator.com/index.php?page_id=13
Amazon review: https://snap.stanford.edu/data/web-Amazon.html
Amazon review (upon request): https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
opinmind: https://inclass.kaggle.com/c/si650winter11/data
Large movie reviews: http://ai.stanford.edu/~amaas/data/sentiment/

Diff local folder with remote folder

Use `rsync` with `-n` (dry run) so that it lists diff files but will not execute syncing.

rsync -n -avcr --delete local_folder/ username@domain:remote_folder/

Overview for Sequential Data Learning

Hidden Markov Model

You should bear in mind clearly the three questions people usually ask for Hidden Markov Model:

1. what is the probability of an observed sequence?

2. what is the most likely series of states given a specific observed observation?

3. Given a set of observations, what are the values of the state transition probabilities and the output emission probabilities?

ref:

http://cs229.stanford.edu/section/cs229-hmm.pdf

http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf

Right way to put test codes in a Python project

I’ve been struggled about where to put test files in a python project for a long time. Ideally, I think it is succinct to create a folder called “test” with all test files in it. However, the test files nested in the test folder need to import modules from parent folder. It is troublesome to import Python module from parent directory as you can’t just use `import ../module`. Here are the two correct ways to put test files in the test folder nested in a Python project:

don’t create __init__.py in the root directory. And use `nosetests` under the root directory: http://stackoverflow.com/questions/3073259/python-nose-import-error
use a helper file to import parent directory to python path: http://stackoverflow.com/a/23386287/1758727

Jupyter Parallelism Tutorial

In this post, I am going to introduce my favorite way to make cells in Jupyter notebook run in parallel.

1. Initialize cluster using command lines or use Python `popen` (In the example below, I create a cluster with 2 workers):

from subprocess import Popen
p = Popen(['ipcluster', 'start', '-n', '2'])

2. Then, programmatically set up the client. I usually set each worker non-blocking so that even when workers are running heavy tasks I can still do experiments in other cells:

from ipyparallel import Client
rc = Client()
dview = rc[:]
print "%d workers are running" % len(rc)

e0 = rc[0]
e0.block = False
e0.activate('0')
e1 = rc[1]
e1.block = False
e1.activate('1')

3. Then, to run one cell, add the magic command `%px0` at the first line of the cell. Here `0` means you designate the cell to be run on the first worker. You can replace `0` with any number as long as it is within the number of your workers. Here is one example:

%%px1
# An example of asynchronous parallelism 
print len(df)
import time
c = 0
print time.time()
while c < 3:
    time.sleep(3)
    c += 1
print time.time()

4. You can think each cell starting with `%px[num]` as an independent workspace. Therefore, you need to explicitly import modules or data objects you want to use within the parallel cells. For example, in the example above, I must write `import time` in order to use the `time` module in the cell. The alternative is to import module/data programmatically:

# push data to all workers. the passing objects must be in a dict form.
dview.push({'churn':churn, 'data_end_date':data_end_date, 'CHURN_INT':CHURN_INT})

# import any modules you want
with dview.sync_imports():
    import sys

5. Finally, to get results, use `%pxresult0` (Similarly, you can replace `0` with other number denoting specific worker.)

%pxresult0

Note that `%pxresult0` is blocking if the result has not come out yet. If you want to do experiments in other cells, don’t run `%pxresult0` too early.

Reference:

http://minrk.github.io/drop/nbconverttest.html

https://ipython.org/ipython-doc/3/parallel/magics.html (for old ipython notebook)

http://ipyparallel.readthedocs.io/en/latest/ (for jupyter)