Math Derivation for Bayesian Probabilistic Matrix Factorization Model

In this paper I am trying to derive the mathematical formulas that appear in the paper Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo. We will use exactly the same notations as in the paper. The part that I am interested in is the Inference part (Section 3.3).

Sample $latex U_i$

In Gibbs sampling, to sample one parameter, you need to determine its conditional distribution on other parameters. To sample $latex U_i$, we need to determine the distribution $latex p(U_i|R, V, \Theta_U, \Theta_V, \alpha)$. Since $latex U_i$ is independent to $latex \Theta_V$,$latex p(U_i|R, V, \Theta_U, \Theta_V, \alpha) = p(U_i | R, V, \Theta_U, \alpha)$.

According to the Bayesian rule, $latex p(U_i | R, V, \Theta_U, \alpha) \cdot p(R | V, \Theta_U, \alpha) = p(R | U_i, V, \Theta_U, \alpha) \cdot p(U_i | V, \Theta_U, \alpha) $. When we sample from $latex p(U_i | R, V, \Theta_U, \alpha)$,  $latex R, V, \Theta_U$ and $latex \alpha$ are fixed. Therefore $latex p(R | V, \Theta_U, \alpha)$ can be seen as a constant. Also, in $latex p(U_i|V, \Theta_U, \alpha)$, $latex U_i$ and $latex V$ are independent, so $latex p(U_i|V, \Theta_U, \alpha)=p(U_i|\Theta_U, \alpha)$ .

Therefore, $latex p(U_i | R, V, \Theta_U, \alpha) \sim p(R | U_i, V, \Theta_U, \alpha) \cdot p(U_i |\Theta_U, \alpha)$. Before we go into further details, note that $latex p(R|U_i, V, \Theta_U, \alpha) = \prod\limits_{j=1}^M [\mathcal{N}(R_{ij}|U_i^TV_j, \alpha^{-1}) ]^{I_{ij}}$, a multiplication of $latex M$ normal distributions, and $latex p(U_i | \Theta_U, \alpha) = \mathcal{N}(U_i | \mu_U, \Lambda_U^{-1})$, another normal distribution, therefore $latex p(U_i|R, V, \Theta_U, \alpha)$ should also be an normal distribution because multiplication of multiple normal distributions results in another normal distribution with a scale factor (here the scale factor is $latex p(R|V, \Theta_U, \alpha)$).

To make things easier, let me introduce the canonical form of univariate and multivariate normal distributions. $latex \mathcal{N}(x | \mu, \sigma^2) = exp[\xi + \eta x – \frac{1}{2} \lambda^2 x^2]$ where $latex \lambda=\frac{1}{\sigma}$, $latex \eta=\lambda^2 \mu$ and $latex \xi = -\frac{1}{2}(\log 2\pi – \log \lambda^2 + \eta^2 \sigma^2 )$. $latex \mathcal{N}(\boldsymbol{x} | \boldsymbol{\mu}, V) = exp[\xi + \boldsymbol{\eta}^T \boldsymbol{x} – \frac{1}{2}\boldsymbol{x}^T \Lambda \boldsymbol{x}]$, where $latex \Lambda=V^{-1}$, $latex \boldsymbol{\eta} = V^{-1}\boldsymbol{\mu}$ and $latex \xi=-\frac{1}{2}(d\log2\pi-log|\Lambda|+\boldsymbol{\eta}^T \Lambda^{-1} \boldsymbol{\eta})$. (You can find some patterns of how the univariate case generalizes to the multivariate case as I arrange in this way.)

Now we proceed to expand $latex p(R | U_i, V, \Theta_U, \alpha)$ and $latex p(U_i | \Theta_U, \alpha)$ leveraging such canonical form.

$latex p(R|U_i, V, \Theta_U, \alpha) \\= \prod\limits_{j=1}^M [\mathcal{N}(R_{ij}|U_i^TV_j, \alpha^{-1}) ]^{I_{ij}} \\= \prod\limits_{j=1}^{M} exp[-\frac{1}{2}I_{ij}(\log 2\pi – \log \alpha + \alpha U_i^T V_j V_j^T U_i) + \alpha I_{ij} R_{ij}V_j^T U_i – \frac{1}{2} \alpha I_{ij} R_{ij}^2] \\=\prod\limits_{j=1}^M exp[-\frac{1}{2} \alpha I_{ij} U_i^T V_j V_j^T U_i +\alpha I_{ij} R_{ij}V_j^T U_i + C] \qquad\qquad (C \text{ is everything not relevant to } U_i) \\=exp[-\frac{1}{2} U_i^T (\alpha \sum\limits_{j=1}^M I_{ij} V_j V_j^T) U_i + \alpha\sum\limits_{j=1}^M I_{ij}R_{ij}V_j^T U_i + C]$

 

$latex p(U_i | \Theta_U, \alpha) \\=\mathcal{N}(U_i | \mu_U, \Lambda_U^{-1}) \\= exp[-\frac{1}{2} U_i^T \Lambda_U U_i + \Lambda_U \mu_U U_i + C]$

 

$latex p(U_i | R, V, \Theta_U, \alpha) \\ \sim p(R | U_i, V, \Theta_U, \alpha) \cdot p(U_i |\Theta_U, \alpha) \\= exp[-\frac{1}{2} U_i^T ( \Lambda_U + \alpha \sum\limits_{j=1}^M I_{ij} V_j V_j^T ) U_i + (\alpha\sum\limits_{j=1}^M I_{ij}R_{ij}V_j + \Lambda_U \mu_U) U_i + C]$

 

Compare the exp formula with the canonical form of a multivariate normal distribution you can get the mean and the variance. This is how formula (12) and (13) in the paper derive.

 

Sample $latex \mu_U$ and $latex \Lambda_U$

According to the Bayesian rule, $latex p(\Theta_U | U, \Theta_0) \cdot p(U | \Theta_0 ) = p(U|\Theta_U, \Theta_0) \cdot p(\Theta_U | \Theta_0)$. 

Again, $latex p(U|\Theta_0)$ can be seen as a constant that can be ignored. Therefore,

$latex p(\Theta_U | U , \Theta_0) \\ \sim p(U|\Theta_U, \Theta_0) \cdot p(\Theta_U | \Theta_0) \\=p(U|\Theta_U, \Theta_0) \cdot p(\mu_U | \Lambda_U, \mu_0, \beta_0) \cdot p(\Lambda|W_0, v_0) \\=\prod\limits_{i=1}^N \mathcal{N}(U_i | \mu_U, \Lambda_U^{-1}) \cdot \mathcal{N}(\mu_U|\mu_0 , (\beta_0\Lambda_U)^{-1}) \mathcal{W}(\Lambda_U | W_0, v_0) \qquad\qquad \text{(Acrd. to Formula (5) and (7) in the paper)}$

 

Compare to formula 216~226 in Reference [5] and formula 1~3 in Reference [6] book page 178, you will get all derivations in formula (14) in the paper. However, it seems like the formulas for $latex S$ are different between the paper and the references. In the paper, $latex S=\frac{1}{N}\sum\limits_{i=1}^{N}U_i U_i^T$. In the reference, $latex S=\frac{1}{N}\sum\limits_{i=1}^{N}(U_i – N\bar{U}) (U_i – N\bar{U})^T$. This is the only place that confuses me and I have no answer for this before I go through all the proof in the reference.

 

Reference

[1] https://en.wikipedia.org/wiki/Multivariate_normal_distribution

[2] https://en.wikipedia.org/wiki/Normal_distribution

[3] http://www.tina-vision.net/docs/memos/2003-003.pdf

[4] http://blog.csdn.net/shenxiaolu1984/article/details/50405659

[5] https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf

[6] https://www.dropbox.com/s/k9nvt3goky541pq/Optimal%20Statistical%20Decisions%20%282004%20DeGroot%29.pdf?dl=0

 

Leave a comment

Your email address will not be published. Required fields are marked *