Collection of statistical hypothesis tests

This post is a collection of hypothesis test methodologies. The full collection is listed here: http://www.biostathandbook.com/testchoice.html. My post just goes over several hypothesis tests that are relevant to my research.

 

One-way ANOVA: http://www.biostathandbook.com/onewayanova.html

If you have one measurement variable and one nominal variable, and the nominal variable separates subjects into multiple groups, you want to test  whether the means of the measurement variable are the same for the different groups. Sometimes, the measurement variable is called dependent variable; the nominal variable is called independent variable.

 

Two-way ANOVA: http://www.biostathandbook.com/twowayanova.html

When you have one measurement variable but two nominal variable, then you would need to use two-way ANOVA. It tests whether the means of the measurement variable are equal for different values of the first nominal variable and whether the means are equal for different values of the second nominal variable. Additionally, it also tests whether there is interaction among the two nominal variables, i.e., whether the means of the measurement variable are the same when fixing one nominal variable and changing the other nominal variable. 

 

MANOVA: https://www.researchgate.net/post/What_is_the_difference_between_ANOVA_MANOVA

When you have multiple measurement variables, you can use MANOVA to test whether measurement variables are influenced by one or more nominal variables simultaneously.

 

Paired t-test (dependent t-test): http://www.biostathandbook.com/pairedttest.html

It is a test on whether the means of two paired populations are different significantly. What are paired populations and unpaired populations? See example here. It assumes the differences of pairs are normally distributed.

 

Wilcoxon signed-rank test: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

It is a non-parametric test which achieves the same goal as paired t-tests: 
compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ, in other words, whether the rankings of the means of two populations differ significantly. (It doesn’t need to assume the differences of pairs are normally distributed like paired t-test. It takes into account the absolute values of the differences.)

 

Signed test: https://en.wikipedia.org/wiki/Sign_test

If you want to calculate whether the differences of ranks are significant and the absolute values of the differences don’t matter.

 

Kendall rank correlation coefficient (Kendall $latex \tau$): https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

It is a test on rank correlation (association) between two measured quantities. See one usage in “Deep Neural Networks for Optimal Team Composition”

 

Spearman’s rank correlation coefficient: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Similar to Kendall rank correlation coefficient.

 

One way Chi Square Test: http://www.okstate.edu/ag/agedcm4h/academic/aged5980a/5980/newpage28.htm

Test if observed data falling in categories meets expectation.

 

More non-parametric statistical method: https://en.wikipedia.org/wiki/Nonparametric_statistics#Methods

 

Factor Analysis and PCA

I’ve been scratching my head for a day (06/24/2015) but it ended up that I am still baffled by the concept of Factor Analysis. So first of all, let me list what I’ve learned so far today.

PCA and Factor Analysis can be deemed similar from the perspective that they are both dimension reduction technologies.  I’ve talked about PCA in an old post. But it can’t be emphasized more on how mistakenly people understand and use PCA. Several good questions about PCA have been raised: Apply PCA on covariance matrix or correlation matrix?. More on here and here.

 

By far the most sensible writing about Factor Analysis is from wikipedia: https://en.wikipedia.org/wiki/Factor_analysis. Remember the following expressions behind Factor Analysis:

Screenshot from 2015-06-25 17:13:38

 

There has been a debate over differences between PCA and Factor Analysis for a long time. My own point is that the objective functions they try to optimize are different. Now let’s discuss an example case within a specific context. Suppose, we have a normalized data set $latex Z_{p \times n}$. ($latex n$ is the number of data points. $latex p$ is the number of features.) The loading and latent factor scores of Factor Analysis on $latex Z_{p \times n}$ are $latex L_{p \times k}$ and $latex F_{k \times n}$. ($latex k$ is the number of latent factors.) Error Term is $latex \epsilon_{p \times n}$. $latex Cov(\epsilon) = Cov(\epsilon_1, \epsilon_2, \cdots, \epsilon_p)=Diag(\Psi_1, \Psi_2, \cdots, \Psi_p) = \Psi$. ($latex \epsilon_i$ is row vector of $latex \epsilon_{p \times n}$, indicating error of each feature is independently distributed. ) Hence we have:  Z=LF+\epsilon

Then,

Screenshot from 2015-07-15 23:01:52Here we use a property $latex Cov(F)=I$ because that’s our assumption on Factor Analysis that latent factor scores should be uncorrelated and normalized. 

 

Now on the same data set $latex Z_{p \times n}$ we apply PCA. PCA aims to find a linear transformation $latex P_{k \times p}$ so that $latex PZ=Y$ where $latex Cov(Y)$ should be a diagonal matrix. But there is no requirement that $latex Cov(Y)=I$. Let’s say there exists a matrix $latex Q$ s.t. $QP=I$. Then we have:

Screenshot from 2015-07-15 23:23:44But don’t forget that in PCA we impose $latex PP^T=I$. So actually $latex Q=P^T$. In other words, $latex Cov(Z)=P^TCov(Y)P$.

 

In the comparison above, we can see only when $latex Cov(Y) \approx I$ in PCA and $latex \Psi \approx 0$ in FA the loading of PCA and FA can be similar. Therefore I don’t agree with the post here: http://stats.stackexchange.com/questions/123063/is-there-any-good-reason-to-use-pca-instead-of-efa/123136#123136  

 

 

P.S. the solution of $latex L$ and $latex F$ of either FA or PCA is not unique. Taking FA for example, if you have already found such $latex L$ and $latex F$ and you have an orthogonal matrix $latex Q$ s.t. $latex QQ^T = I$,  then $latex Z=LF + \epsilon = LQQ^TF + \epsilon = (LQ)(Q^TF) + \epsilon = L’F’ + \epsilon$. Or you can always set $latex L’ = -L$ and $latex F’ = -F$ so that $latex Z=LF+\epsilon = (-L)\cdot(-F)+\epsilon$. This formula is intuitive, since it says that we can always find a set of opposite factors and assign negative weights of loadings to depict the same data.

Unclean uninstallation of fcitx causes login loop on Ubuntu

Whether you realize or not, fcitx, a popular input method, if not properly uninstalled, can cause the notorious login loop issue on Ubuntu (i.e., every time you type your password and enter, you get asked to type your password again on the same login window). Here is the post talking about how to fully uninstall fcitx, as well as sogoupin, on Ubuntu.

http://jingyan.baidu.com/article/9faa723154c3dc473d28cb41.html

 

You may also encounter “fcitx-skin-sogou” problem afterwards. So here is a solution:

http://forum.ubuntu.org.cn/viewtopic.php?t=416810&p=3000862

 

You may then face the login loop issue, which can be solved by reinstalling ubuntu-session and ubuntu-desktop in this case.

http://askubuntu.com/questions/283985/unable-to-load-session-ubuntu-in-12-04

 

And then after many glitches I ended up with using IBus. Shiiiiiit.

alsamixer automatically mutes headphones and speakers

http://askubuntu.com/questions/280566/my-headphones-mute-alsamixer-when-i-plug-them-in-hp-dv6-12-04

http://askubuntu.com/questions/541847/is-there-any-way-to-save-alsamixer-settings-other-than-alsactl-store

The two links above are two popular links about dealing with the auto-mute problem of alsamixer. However, the two don’t help my situation.

 

My solution

1. Type the following command:

sudo alsamixer

2. Use `RightArrow` key to navigate to right until you see “auto-mute”

3. Use `UpArrow` or `DownArrow` key to set it to “Disabled”

Screenshot from 2015-06-14 08:26:13

4. Use `Esc` to quit. Hope the audio won’t be muted after your next reboot!

 

Update 06/08/2017:

The solution is totally different if the problem is that when you plug in your headphone, both headphone and speaker have sound output.

  1. in the terminal, open alsamixer.
  2. use rightarrow to navigate to the rightmost until you see Auto-Mute option, then use uparrow to select Line Out
  3. also make sure the other volume controls look like below

Screenshot from 2017-06-08 20-45-50

 

reference: https://askubuntu.com/questions/150887/sound-from-both-headphones-and-speakers

 

Learn LSTM in RNN

Long Short Term Memory is claimed to be capable of predicting time series when there are long time lags of unknown sizes between important events. However, as to 2015.6, not many clear tutorials have been found on the Internet. I am going to list a collection of materials I came across. Probably I will write a tutorial myself soon.

Wikipedia: https://en.wikipedia.org/wiki/Long_short_term_memory

Horchreiter, 1997. Long Short Term Memory.  http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf. This seems to be the very first paper applying LSTM in RNN context. I can’t understand it well however.

Felix Gers’s phd thesis. http://www.felixgers.de/papers/phd.pdf. Not very clear though.

The most clear entry-level tutorial for me: http://www.willamette.edu/~gorr/classes/cs449/lstm.html. It illustrates the reason LSTM is called LSTM.

Alex Graves. 2014. Generating Sequences With Recurrent Neural Networks. http://arxiv.org/pdf/1308.0850v5.pdf. This paper reveals how RNN can be used to generate things with LSTM.
 

PCA和SVD的关系

Update 07/15/2015:

First, we should know how to calculate covariance matrix and correlation matrix.

According to the wikipedia page, we have the calculation procedure of covariance matrix as:

Screenshot from 2015-07-15 12:31:54

And the calculation procedure of correlation matrix is:

Screenshot from 2015-07-15 12:39:15

 

I am going to use an R example to illustrate the calculation procedures. Let’s say we have a `m1` matrix composed of six columns.

v1 <- c(1,1,1,1,1,1,1,1,1,1,3,3,3,3,3,4,5,6)
v2 <- c(1,2,1,1,1,1,2,1,2,1,3,4,3,3,3,4,6,5)
v3 <- c(3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,5,4,6)
v4 <- c(3,3,4,3,3,1,1,2,1,1,1,1,2,1,1,5,6,4)
v5 <- c(1,1,1,1,1,3,3,3,3,3,1,1,1,1,1,6,4,5)
v6 <- c(1,1,1,2,1,3,3,3,4,3,1,1,1,2,1,6,5,4)
m1 <- cbind(v1,v2,v3,v4,v5,v6)

The covariance matrix of `m1` is:

> cov(m1)
         v1        v2       v3        v4       v5       v6
v1 2.535948 2.3071895 1.300654 1.0849673 1.183007 1.026144
v2 2.307190 2.3790850 1.013072 0.9934641 1.071895 1.052288
v3 1.300654 1.0130719 2.535948 2.2026144 1.300654 1.084967
v4 1.084967 0.9934641 2.202614 2.4869281 1.084967 1.075163
v5 1.183007 1.0718954 1.300654 1.0849673 2.535948 2.379085
v6 1.026144 1.0522876 1.084967 1.0751634 2.379085 2.486928

So, let’s look at `cov(m1)[1,1] = 2.535948` and `cov(m2)[1,2] = 2.3071895`:

# nrow(m1) is the number of data points. 
> (v1-mean(v1)) %*% (v1 - mean(v1) )/ (nrow(m1) - 1)
         [,1]
[1,] 2.535948
> (v1-mean(v1)) %*% (v2 - mean(v2) )/ (nrow(m1) - 1)
        [,1]
[1,] 2.30719

The correlation matrix of `m1` is:

> cor(m1)
          v1        v2        v3        v4        v5        v6
v1 1.0000000 0.9393083 0.5128866 0.4320310 0.4664948 0.4086076
v2 0.9393083 1.0000000 0.4124441 0.4084281 0.4363925 0.4326113
v3 0.5128866 0.4124441 1.0000000 0.8770750 0.5128866 0.4320310
v4 0.4320310 0.4084281 0.8770750 1.0000000 0.4320310 0.4323259
v5 0.4664948 0.4363925 0.5128866 0.4320310 1.0000000 0.9473451
v6 0.4086076 0.4326113 0.4320310 0.4323259 0.9473451 1.0000000

`cor(m1)[1,1]=1.000` and `cor(m1)[1,2]=0.9393083`. Let’s see how they are calculated:

> (v1-mean(v1)) %*% (v1 - mean(v1) )/ (nrow(m1) - 1)/ (sd(v1) * sd(v1))
     [,1]
[1,]    1
> (v1-mean(v1)) %*% (v2 - mean(v2) )/ (nrow(m1) - 1)/ (sd(v1) * sd(v2))
          [,1]
[1,] 0.9393083

Update 07/15/2015. end.

————————————————————————————————————————————————-

 

 

现在就是这篇帖子的主题,PCA是什么?以及PCA和SVD的关系。

[gview file=”https://czxttkl.com/wp-content/uploads/2015/06/PCA-SVD-NNMF.pdf” height=”1000px” width=”700px”]

 

 

Two examples of visualizing PCA

下图是一个SVD分解的示意图。每个矩阵都代表着一种空间的变化,那么M代表的变化可以分解为:

1. 先经过V这个旋转(Rotation)变化

2. 再经过$latex \Sigma$这个拉伸(Scaling)变化

3. 再经过U这个旋转(Rotation)变化

图中,起始的图形是一个半径为单位长度的圆形。

Screenshot from 2015-06-02 20:25:10

PCA, as visualized here, is trying to minimize the error between data and the principle component’s direction:

bOl5N

 

 

Update on 06/25/2015:

The following is a R script I wrote myself:

Rplot

# A test file related to my post: https://czxttkl.com/?p=339 
# and https://czxttkl.com/?p=397 
remove(list=ls())
data("iris")
log.ir <- log(iris[,1:2])

# center but not scale by myself
log.ir[,1] <- log.ir[,1] - mean(log.ir[,1])
log.ir[,2] <- log.ir[,2] - mean(log.ir[,2])

ir.pca <- prcomp(log.ir, center = FALSE, scale. = FALSE)

# corresponds to p, x, y in my post.
p <- t(ir.pca$rotation)
x <- t(log.ir)
y <- p %*% x

# cov() calculates covariance matrix of the input matrix columns
# cov_y should have non-zero diaognal elements and close-to-zero non-diagonal elements
cov_y <- cov(t(y), t(y))

# Check if my calculation of covariance of y is consistent with cov_y
y_col2 <- t(y)[,2]
cov_y_2_2 <- (y_col2 %*% y_col2) / (length(y_col2) - 1)

# The red lines are the directions of principle component vectors. They should be orthogonal.
plot(log.ir, xlim=c(-1,1), ylim=c(-1,1))
points(t(y), col="blue")
abline(b=ir.pca$rotation[1,2]/ir.pca$rotation[1,1], a=0, col="red")
abline(b=ir.pca$rotation[2,2]/ir.pca$rotation[2,1], a=0, col="red")

 

 

 

 

Install Theano 7.0 on Windows 8.1 64 bit (Update 2015.6)

Installing Theano is a pain in the ass if it is on Windows 8.1. Here is the breakdown of how to install it on my Windows 8.1 64bit machine:

1. Install Python 2.7.x: https://www.python.org/downloads/

2. Install MinGW64: http://sourceforge.net/projects/mingw-w64/.

After the installation of MinGW64, add `install_path_to_mingw64/bin` to your `PATH` variable. A caveat is to put the path at the beginning of `PATH` so that Python will not misuse other toolkits included in`PATH`. Also, go to `PYTHONPATH/Lib/distutils`, create a file named “ with the following content to let python using gcc from MinGW64 to compile C code:

[build]

compiler=mingw32

(Ref: http://stackoverflow.com/questions/3297254/how-to-use-mingws-gcc-compiler-when-installing-python-package-using-pip)

3. Install Cuda ToolKit 5.5 or 6.5. Since my machine has had it installed previously, I can’t go details about it. Hope you can manage to install it.

4. Install `Numpy`, `Scipy` and `libpython` using `pip install …` or go to http://www.lfd.uci.edu/~gohlke/pythonlibs/ to download installation files and install manually.

5. Install `PyCuda` using `pip install PyCuda`

There could be several common errors you will encounter:

gcc:error:unrecognized command line option ‘-mno-cygwin’: edit `PYTHONPATH/Lib/distutils/cygwinccompiler.py` to remove all instances of `-mno-cygwin`. (Ref: http://stackoverflow.com/questions/6034390/compiling-with-cython-and-mingw-produces-gcc-error-unrecognized-command-line-o)

collect2.exe: error:  ld returned 1 exit status: You should install `libpython` (see step 4) (Ref: https://github.com/Theano/Theano/issues/2087)

g++: error: unrecognized command line option ‘–output-lib’: edit `PYTHONPATH/Lib/distutils/cygwinccompiler.py` to comment out `extra_preargs.extend([“–output-lib”, lib_file])`

6. Install `Theano` using `pip install theano`.

7. Run the test code from Theano. At least you will be able to run through it using CPU.

 

This post below is the most helpful to my post: 

http://pavel.surmenok.com/2014/05/31/installing-theano-with-gpu-on-windows-64-bit/

Replicate the vanishing and exploding gradient problems in Recurrent Neural Network

I’ve talked about the vanishing gradient problem in one old post in normal multiple layer neural networks. Pascanur et al. (the first in References below) particularly discussed the vanishing gradient problem as well as another type of gradient instable issue, the exploding gradient problem in the scope of recurrent neural network.  

Let’s recap the basic idea of RNN. Here is the sketch of a simple RNN with three inputs, two hidden units and one output. In any given time, the network structure is the same: three input units connect to two hidden units through $latex W_{ih}$ (green lines in the pic), whereas the two hidden units connect to one output through $latex W_{ho}$ (red lines). Between two adjacent time steps, the two hidden units from the earlier network also connect to the two in the later network through $latex W_{hh}$ (blue lines). The two hidden units also perform an activation function. In our case we set the activation function to the sigmoid function. However, the outputs are just linear combination of the hidden unit values.

rnn(2)

Pascanur pointed out that, if the weights $latex W_{hh}$ has no singular value larger than the largest possible value of  the differentiation of the activation function of the hidden unit (in our case the sigmoid’s largest differentiation is 0.25), then the RNN suffers from the vanishing gradient problem (i.e., the much more earlier inputs tend to barely have influence on the later outputs). On the other hand, the largest singular value of $latex W_{hh}$ being larger than 0.25 is a necessary condition of the exploding gradient problem: the error of the output regarding the earlier input may augment to a exploding level so that the training never converges.

 Purely by my curiosity, I did some experiments trying to replicate the vanishing gradient problem and the exploding gradient problem. Here are my steps:

  1. Initialize an RNN with three inputs, two hidden units and an output unit. The weights connecting them are crafted to let $latex W_{hh}$ have singular values tending to cause either the vanishing gradient problem or the exploding gradient problem. I will record the values of $latex W_{ih}$, $latex W_{hh}$ and $latex W_{ho}$. 
  2. Randomly generate input sequences and feed into the RNN in step 1. Record the outputs of the inputs passing through the RNN. The inputs are generated in several sets. Each set has 500 input sequences. But different sets have different lengths of sequences. Doing so lets me examine how the vanishing (exploding) gradient problem forms regarding to the length of sequences.
  3. Now start to train a new RNN, with the input and output in step 2.
  4. Compare the weights from the two RNNs and see the training is successful (i.e. the weights are close)

Bear in mind that the largest singular value of $latex W_{hh}$ being larger than 0.25 is just a necessary condition of the exploding gradient problem. But I still keep going to see what will happen. The tables below contain the costs after certain epochs given certain lengths of sequences.

When vanish

epoch\seq length 2 10 100
500  1.2e-5 1.44e-5 1.35e-5
1500  3.21e-9 5.11e-9 2.7e-9

 

When 

expl

epoch\seq length 2 10 100
500  0.00144 0.00358 0.0045
1500  2.39e-5 0.00352  0.0045

From the results above, I find that the vanishing gradient problem is hard to replicate (all the costs in the first table converged to small values), as opposed to the exploding gradient problem which emerged given sequences of length 100 (as you can see the costs stuck at 0.0045).

Code

rnn_generator.py: used to generate simulated input and output data passing by fixed weights

import numpy as np
import theano
import theano.tensor as tt

class RNNGenerator(object):
    '''
    Generate a simulated RNN with a fixed set of weights, 
    based on a fixed seed: np.random.seed(19910715)
    Also check out RNNGenerator1, which generates simulated 
    RNN sequences of output with a fixed delay w.r.t input.
    '''
    
    def __init__(self, input_num, output_num, hidden_num):
        np.random.seed(19910715)
        self.input_num = input_num
        self.output_num = output_num
        self.hidden_num = hidden_num
        
        # Initialize weights
        self.W_hh = np.random.randn(hidden_num, hidden_num)
        u, s, v = np.linalg.svd(self.W_hh, full_matrices=True, compute_uv=True)
        s = np.diag(s)
        print u, "\n\n", s, "\n\n", v, "\n"
        print "Close?", np.allclose(self.W_hh, np.dot(np.dot(u, s), v)), "\n"
        
        # Manually craft the largest singular value of W_hh as you wish
        # s[0,0] = 10
        # self.W_hh = np.dot(np.dot(u, s), v)

        self.b_hh = np.random.randn(hidden_num)
        self.W_hi = np.random.randn(hidden_num, input_num)
        
        self.W_yh = np.random.randn(output_num, hidden_num)
        self.b_yh = np.random.randn(output_num)
        
        self.u, self.s, self.v = u, s, v
        
        # Initialize output function
        # Create symbols
        self.W_hh_sym = theano.shared(self.W_hh)
        self.W_hi_sym = theano.shared(self.W_hi)  # hidden_num x input_num
        self.b_hh_sym = theano.shared(self.b_hh)
        inputs_sym = tt.matrix("inputs_sym")  # data_num x input_num
        
        # fn: a lambda expression to define a recurrent process
        # sequences (if any), prior result(s) (if needed), non-sequences (if any)
        outputs_sym, _ = theano.scan(fn=lambda inputs, prior_hidden, W_hh, W_hi, b_hh:
                                     tt.nnet.sigmoid(tt.dot(W_hh, prior_hidden) + tt.dot(W_hi, inputs) + b_hh),
                                     sequences=inputs_sym,
                                     non_sequences=[self.W_hh_sym, self.W_hi_sym, self.b_hh_sym],
                                     # outputs_info is the initial state of prior_hidden
                                     outputs_info=tt.zeros_like(self.b_hh_sym))        
 
        # Doesn't need to update any shared variables, so set updates to None
        self.output_func = theano.function(inputs=[inputs_sym], outputs=outputs_sym, updates=None)
    
    def weights(self):
        return self.W_hh, self.W_hi, self.W_yh, self.b_hh, self.b_yh    
    
    def inputs_outputs(self, data_num, seq_len):
        m_inputs = np.random.randn(data_num, seq_len, self.input_num)
        m_final_outputs = np.zeros((data_num, seq_len, self.output_num))
        for j in xrange(data_num):
            m_hidden = self.output_func(m_inputs[j, :, :])
            m_outputs = np.zeros((seq_len, self.output_num))
            for i in xrange(m_hidden.shape[0]):
                m_outputs[i] = np.dot(self.W_yh, m_hidden[i]) + self.b_yh 
            m_final_outputs[j, :, :] = m_outputs
        
        return m_inputs, m_final_outputs
    
if __name__ == '__main__':
   input_num = 3
   output_num = 1
   hidden_num = 2
   # How delayed outputs will rely on inputs
   seq_len = 4
   data_num = 3
   
   rnn_generator = RNNGenerator(input_num, output_num, hidden_num)
   W_hh, W_hi, W_yh, b_hh, b_yh = rnn_generator.weights()
   print "Generated W_hh\n", W_hh, "\n"
   print "Generated W_hi\n", W_hi, "\n"
   print "Generated W_yh\n", W_yh, "\n"
   print "Generated b_hh\n", b_hh, "\n"
   print "Generated b_yh\n", b_yh, "\n"
   
   m_inputs, m_outputs = rnn_generator.inputs_outputs(data_num, seq_len)
   
   print "m_inputs\n", m_inputs, "\n"
   print "m_outputs\n", m_outputs, "\n"
   

my_rnn.py: the main class of a simple rnn taking as input the simulated data

import theano
import theano.tensor as T
import numpy as np
import cPickle
import random
from rnn_generator import RNNGenerator
import time

np.random.seed(19910715)

class RNN(object):

    def __init__(self, n_in, n_hidden, n_out, n_timestep):
        rng = np.random.RandomState(1234)

        self.activ = T.nnet.sigmoid
        lr, momentum, input, target = T.scalar(), T.scalar(), T.matrix(), T.matrix()
                
        self.W_uh = theano.shared(np.asarray(rng.normal(size=(n_in, n_hidden), scale=.01, loc=.0),
                                             dtype=theano.config.floatX), 'W_uh')
        self.W_hh = theano.shared(np.asarray(rng.normal(size=(n_hidden, n_hidden), scale=.01, loc=.0), 
                                             dtype=theano.config.floatX), 'W_hh')
        self.W_hy = theano.shared(np.asarray(rng.normal(size=(n_hidden, n_out), scale=.01, loc=0.0), 
                                             dtype=theano.config.floatX), 'W_hy')
        self.b_hh = theano.shared(np.zeros((n_hidden,), dtype=theano.config.floatX), 'b_hh')
        self.b_hy = theano.shared(np.zeros((n_out,), dtype=theano.config.floatX), 'b_hy')

        # Initialize the hidden unit state
        h0_tm1 = theano.shared(np.zeros(n_hidden, dtype=theano.config.floatX))

        # The parameter sequence fed into recurrent_fn:
        # sequences (if any), outputs_info(s) (if needed), non-sequences (if any)
        h, _ = theano.scan(self.recurrent_fn, sequences=input,
                       outputs_info=[h0_tm1],
                       non_sequences=[self.W_hh, self.W_uh, self.b_hh])
        
        # y_pred is the predict value by the current model  
        y_pred = T.zeros((n_timestep, n_out))
        
        # The cost is averaged over the number of output and the number of timestep
        for i in xrange(n_timestep):
            y_pred = T.set_subtensor(y_pred[i, :], T.dot(h[i], self.W_hy) + self.b_hy)
        # You can determine whether to only compare the last timestep output
        # or all outputs along all the timesteps
        # cost = T.mean((target - y_pred)**2)
        cost = T.mean((target[-1, :] - y_pred[-1, :])**2)
       
        # This is the single output cost function from the original file
        # cost = ((target - y_pred) ** 2).mean(axis=0).sum()
        
        # Store previous gradients. Used for momentum calculation.
        self.gW_uh = theano.shared(np.zeros((n_in, n_hidden)), 'gW_uh')
        self.gW_hh = theano.shared(np.zeros((n_hidden, n_hidden)), 'gW_hh')
        self.gW_hy = theano.shared(np.zeros((n_hidden, n_out)), 'gW_hy')
        self.gb_hh = theano.shared(np.zeros((n_hidden)), 'gb_hh')
        self.gb_hy = theano.shared(np.zeros((n_out)), 'gb_hy')
        
        gW_uh, gW_hh, gW_hy, gb_hh, gb_hy = T.grad(
               cost, [self.W_uh, self.W_hh, self.W_hy, self.b_hh, self.b_hy])

        self.train_step = theano.function([input, target, lr, momentum], [cost, y_pred],
                            on_unused_input='warn',
                            updates=[(self.gW_hh, momentum * self.gW_hh - lr * gW_hh),
                                     (self.gW_uh, momentum * self.gW_uh - lr * gW_uh),
                                     (self.gW_hy, momentum * self.gW_hy - lr * gW_hy),
                                     (self.gb_hh, momentum * self.gb_hh - lr * gb_hh),
                                     (self.gb_hy, momentum * self.gb_hy - lr * gb_hy),
                                     (self.W_hh, self.W_hh + self.gW_hh),
                                     (self.W_uh, self.W_uh + self.gW_uh),
                                     (self.W_hy, self.W_hy + self.gW_hy),
                                     (self.b_hh, self.b_hh + self.gb_hh),
                                     (self.b_hy, self.b_hy + self.gb_hy)],
                            allow_input_downcast=True)
        
        # This part is for debugging.
        # Create a function that takes the fixed weights from RNNGenerator as input and calculate output
        W_uh, W_hh, W_hy, b_hh, b_hy = T.matrix("W_uh"), T.matrix("W_hh"),\
                                         T.matrix('W_hy'), T.vector('b_hh'), T.vector('b_hy')
        
        h_validated, _ = theano.scan(self.recurrent_fn, sequences=input, outputs_info=[h0_tm1],
                            non_sequences=[W_hh, W_uh, b_hh])
        y_validated = T.zeros((n_timestep, n_out))
        for i in xrange(n_timestep):
            y_validated = T.set_subtensor(y_validated[i], T.dot(h_validated[i], W_hy) + b_hy)
        self.output_validated_fn = theano.function([input, W_hh, W_uh, W_hy, b_hh, b_hy], y_validated, 
                                                updates=None, allow_input_downcast=True)

    def recurrent_fn(self, u_t, h_tm1, W_hh, W_uh, b_hh):
        h_t = self.activ(T.dot(h_tm1, W_hh) + T.dot(u_t, W_uh) + b_hh)
        return h_t
    
def test_fixed_weights():
    epoch = 1500
    final_momentum = 0.9
    initial_momentum = 0.5
    momentum_switchover = 5
    
    input_num = 3
    output_num = 1
    hidden_num = 2
    
    seq_len = 2        # You can try different values of seq_len to check the cost
    data_num = 500
    
    rnn = RNN(input_num, hidden_num, output_num, seq_len)
    lr = 0.001
    
    rnn_generator = RNNGenerator(input_num, output_num, hidden_num)
    W_hh, W_hi, W_yh, b_hh, b_yh = rnn_generator.weights()
    m_inputs, m_outputs = rnn_generator.inputs_outputs(data_num, seq_len)
    
    for e in xrange(epoch):
        e_vals = []
        for i in xrange(data_num):
            u = m_inputs[i, :, :]          # seq_len x input_num
            t = m_outputs[i, :, :]         # seq_len x output_num
            mom = final_momentum if e > momentum_switchover \
                                 else initial_momentum
            cost, y_pred = rnn.train_step(u, t, lr, mom)
            e_vals.append(cost)
            
            # This part is for debugging
            # Validate that using the fixed weights the output passing RNN
            # is consistent with the generated output
            # t_validated = rnn.output_validated_fn(u, W_hh.transpose(), W_hi.transpose(), W_yh.transpose(), b_hh, b_yh)
        
        print "epoch {0}, average cost: {1}\n".format(e, np.average(e_vals))
        
        
    print_result(rnn_generator, rnn, "my_rnn.log")
    print "average error:{0}\n".format(np.average(e_vals))

def print_result(rnn_generator = None, rnn = None, filepath=None):
    output_str = ""
    if rnn_generator is not None:
        W_hh, W_hi, W_yh, b_hh, b_yh = rnn_generator.weights()
        output_str = output_str + "Generated W_hh\n" + str(W_hh) + "\n"
        output_str = output_str + "Generated W_hi\n" + str(W_hi) + "\n"
        output_str = output_str + "Generated W_yh\n" + str(W_yh) + "\n"
        output_str = output_str + "Generated b_hh\n" + str(b_hh) + "\n"
        output_str = output_str + "Generated b_yh\n" + str(b_yh) + "\n"

    if rnn is not None:
        output_str = output_str + "learnt W_hh\n" + str(rnn.W_hh.get_value()) + "\n"
        output_str = output_str + "learnt W_uh\n" + str(rnn.W_uh.get_value()) + "\n"
        output_str = output_str + "learnt W_hy\n" + str(rnn.W_hy.get_value()) + "\n"
        output_str = output_str + "learnt b_hh\n" + str(rnn.b_hh.get_value()) + "\n"
        output_str = output_str + "learnt b_hy\n" + str(rnn.b_hy.get_value()) + "\n"
    
    print output_str
    
    if filepath is not None:
        with open(filepath, "w") as text_file:
            text_file.write(output_str)

if __name__ == '__main__':
    start_time = time.time()
    test_fixed_weights()
    end_time = time.time()
    print "It took {0} seconds.".format((end_time - start_time))

References

Pascanur et al. paper (On the difficulty of  training recurrent neural networks):

http://jmlr.org/proceedings/papers/v28/pascanu13.pdf

theano tutorial: http://deeplearning.net/tutorial/lstm.html

Recurrent neural network (The format is not very crystal clear to understand): https://www.pdx.edu/sites/www.pdx.edu.sysc/files/Jaeger_TrainingRNNsTutorial.2005.pdf

RNN as a generative model: http://karpathy.github.io/2015/05/21/rnn-effectiveness/?utm_content=bufferab271&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

A simple version of RNN I: http://www.nehalemlabs.net/prototype/blog/2013/10/10/implementing-a-recurrent-neural-network-in-python/

A simple version of RNN II: https://github.com/gwtaylor/theano-rnn