Leetcode 242: valid anagram

https://leetcode.com/problems/valid-anagram/

Given two strings s and t, write a function to determine if t is an anagram of s.

For example,
s = “anagram”, t = “nagaram”, return true.
s = “rat”, t = “car”, return false.

Note:
You may assume the string contains only lowercase alphabets.

Follow up:
What if the inputs contain unicode characters? How would you adapt your solution to such case?

 

 

Code

class Solution(object):
    def isAnagram(self, s, t):
        """
        :type s: str
        :type t: str
        :rtype: bool
        """
        if s is None and t is not None: return False
        if t is None and s is not None: return False
        if s is None and t is None: return True
        if len(s) != len(t): return False
        
        l = [0] * 26
        for c in s:
            l[ord(c) - ord('a')] += 1
        for c in t:
            l[ord(c) - ord('a')] -= 1
            
        for v in l:
            if v: return False
            
        return True

 

Idea

Create a dictionary counting occurrences of the 26 letters in `s` and `t`. You need O(N) time and O(N) space.

You can also use (in-place) sorting algorithms to solve this problem. You need O(NlogN) time and no space.

 

Reference

https://leetcode.com/discuss/50580/python-solutions-sort-and-dictionary

Leetcode 115: Distinct Subsequences

https://leetcode.com/problems/distinct-subsequences/

Given a string S and a string T, count the number of distinct subsequences of T in S.

A subsequence of a string is a new string which is formed from the original string by deleting some (can be none) of the characters without disturbing the relative positions of the remaining characters. (ie, "ACE" is a subsequence of "ABCDE" while "AEC" is not).

Here is an example:
S = "rabbbit", T = "rabbit"

Return 3.

 

The question is a bit ambiguous. Please see the following link for clarification.

https://leetcode.com/discuss/599/task-clarification

 

Code

class Solution(object):
    def numDistinct(self, s, t):
        """
        :type s: str
        :type t: str
        :rtype: int
        """
        if s is None or t is None:
            return 0
        if len(t) > len(s):
            return 0
            
        num = [[1] + [0] * (len(t)) for i in range(len(s)+2)]   # a (len(s)+1)*(len(t)+1) matrix

        for i in xrange(1, len(s)+1):
            for j in xrange(1, min(i, len(t))+1):
                num[i][j] = num[i-1][j] 
                if s[i-1] == t[j-1]: num[i][j] += num[i-1][j-1]

        return num[len(s)][len(t)] 
        

 

Idea

Use dynamic programing to tackle this problem. Create an array, `num`, in which `num[i][j]` denotes the number of distinct subsequences same to `t[:j]` in `s[:i]`.  For `num[i][j] = 0` where `i <j` because you can’t find subsequences longer than the full sequence. `num[i][0]` is always 1 because every full sequence has a subsequence of empty string. 

Therefore, the array `num` is initialized as:

dise(1)

Since we are using dynamic programming, we need to find the formula to update `num[i][j]`. We propose that `num[i][j]=num[i-1][j] + (s[i-1]==t[j-1]?num[i-1][j-1]:0)`. In other words, the  distinct subsequences equal to `t[:j]` in `s[:i]` come from two parts:

  1. the distinct subsequences equal to `t[:j]` in `s[:i-1]`. This is equivalent to discard `s[i-1]` and check the number of distinct subsequences in `s[:i-1]`.
  2. if `s[i-1]==t[j-1]`, the distinct subsequences equal to `t[:j-1]` in `s[i-1]`. This is equivalent to discard `s[i-1]` and `t[j-1]` and compare the two remaining strings.

The update rule can be visualized as:

dise(3)

Overall, this algorithm takes O(N^2) time to update the full `num` array and takes O(N^2) space to store `num` array. (Suppose O(N) is the length of `s` and `t`).

 

More

2D space in dynamic programming can often be reduced to 1D space. Our code updates` num` in a “column-first-row-after” manner. Essentially, we only need to have the previous row vector to update the current row vector. Note that `num[i][j]` needs the values from `num[i-1][j-1]` and `num[i-1][j]`. In order to not erase necessary values in `num[i-1][j]` and `num[i-1][j-1]`, we need to update the row vector backwards.

class Solution(object):
    def numDistinct(self, s, t):
        """
        :type s: str
        :type t: str
        :rtype: int
        """
        if s is None or t is None:
            return 0
        if len(t) > len(s):
            return 0
            
        num = [1] + [0] * (len(t)) 

        for i in xrange(1, len(s)+1):
            for j in xrange(min(i, len(t)), 0, -1):
                if s[i-1] == t[j-1]: num[j] += num[j-1]

        return num[len(t)] 
        

Now, the algorithm takes O(N^2) time and only O(N) space.

 

 

Create 2D array in Python

I used to create 2D zero array by the following way, for example:

arr = [[0] * 3] * 4

However, `arr` actually references the list [0,0,0] 4 times. If you set `arr[1][1] = 5`, for example, you will find all “other” lists in `arr` have 5 then.

>>> arr[1][1] = 5
>>> arr
[[0, 5, 0], [0, 5, 0], [0, 5, 0], [0, 5, 0]]

 

Therefore, the best way to create a 2D array is either using `numpy` or:

arr = [[0]*3 for i in range(4)]

 

Reference

http://stackoverflow.com/a/6667529/1758727

Leetcode 65: valid number

https://leetcode.com/problems/valid-number/

65. Valid Number

Total Accepted: 38767 Total Submissions: 328141 Difficulty: Hard

Validate if a given string is numeric.

Some examples:
"0" => true
" 0.1 " => true
"abc" => false
"1 a" => false
"2e10" => true

Note: It is intended for the problem statement to be ambiguous. You should gather all requirements up front before implementing one.

Update (2015-02-10):
The signature of the C++ function had been updated. If you still see your function signature accepts a const char * argument, please click the reload button to reset your code definition.

 

Code

class Solution(object):
    def isNumber(self, s):
        s = s.strip()
        return self._isNumber(s, ["+", "-", "."], True, True)

    def _isNumber(self, s, valid_first_chars = [], check_dot=True, check_e=True):
        if not s: return False
        if s.isdigit(): return True

        if (not s[0] in map(str, range(10)) + valid_first_chars):
            return False

        if s[0] in valid_first_chars:
            if (s[0] == "+" or s[0] == "-") and self._isNumber(s[1:], ["."], check_dot, check_e):
                return True
            if s[0] == "." and self._isNumber(s[1:], [], False, check_e):
                return True
        
        if check_e:
            es = s.split("e")
            if len(es) > 2: return False
            if len(es)==2 and self._isNumber(es[0], ["+", "-", "."], True, False) \
                and self._isNumber(es[1], ["+", "-"], False, False):
                return True            

        if check_dot:
            dots = s.split(".")
            if len(dots) > 2: return False
            if len(dots)==2 and self._isNumber(dots[0], ["+", "-"], False, False) \
                and (self._isNumber(dots[1], [], False, False) or dots[1] == ''):
                return True 

        return False

        

 

Idea

My (also most people’s) way is going straight to handle different cases for parsing numbers. 

  1. The most trivial case: an empty string or a null variable can’t be a valid number  (line 7)
  2. Use string.isdigit() to check if a string contains pure number (without ‘+’, ‘-‘, ‘.’ or ‘e’).   (line 8)
  3. a valid number must start with 0~9 or “+”, “-” or “.”  (line 10-11)
  4. if `s` starts with “+” or “-“, then `s[1:]` should be a valid number without prefix as “+” or “-“. However `s[1:]` can start with `.`, for example, “+.43” is a valid number  (line 14-15)
  5. if `s` starts with “.”, then `s[1:]` should be a pure valid number without any possibility to have prefix such as “+”, “-” and “.”  (line 16-17)
  6. if you want to check if `s` is in scientific notation, the part before “e” should be a valid number with at most one prefix (“+”, “-” or “.”). The part after `e` can’t start with “.” but can start with “+” or “-“.  (line 22-24)
  7. if you want to check if `s` is in a float notation, the part before “.” should be a valid number without any dot in it. The part after “.” should not have any dot in it either. It should not have any sign (“+” or “-“) after “e”.   (line 29-31)

 

Reference

An exhausted list of test case:

Screenshot from 2015-12-30 13:54:40

 

Use SSH tunnel and Firefox for proxy

1.Use a specified port (e.g., 12345) for SSH SOCKS server:

ssh -D 12345 user@host.domain

2. Go to the setting menu in Firefox, then Advanced->Network->Connection->Settings

Screenshot from 2015-12-12 01:06:48

3.Check the “Manual proxy configuration”, fill in SOCKS host (127.0.0.1) and port (your specified port, 12345 in our example)

4. Also, type in “about:config” in the address bar in Firefox, change “network.proxy.socks_remote_dns” to true.

 

Reference

https://www.linode.com/docs/networking/ssh/setting-up-an-ssh-tunnel-with-your-linode-for-safe-browsing

BFGS and L-BFGS materials

Good explanation for BFGS and L-BFGS can be found in this textbook: https://www.dropbox.com/s/qavnl2hr170njbd/NumericalOptimization2ndedJNocedalSWright%282006%29.pdf?dl=0

In my own words:

Gradient descent and its various variants have been proved to learn parameters well in practical unconstrained optimization problems. However, it may converge too slowly or too roughly depending on how you set learning rate.

Remember the optimum in unconstrained optimization has zero first derivative. Newton (and Raphson) method uses tangent information of first derivative (i.e., second derivative) to find root of first derivative (i.e., the value making first derivative equal to zero). It usually converges in a quadratic convergence rate (if classic assumptions of Newton method are satisfied). However, in every iteration of Newton method for a multivariate unconstrained optimization, not only the Hessian matrix but its inversion need to be calculated, which prohibits from Newton method being applied to high dimension problems.

BFGS, as a member of quasi-Newton method family, uses cheaper calculation to approximate inversion of Hessian. But it is still expensive in respect of memory usage. Later, L-BFGS was coined to use thrift memory to achieve comparable precision as BFGS does.

 

My notes in implementing L-BFGS:

We use the same notation as in the textbook written by Jorge Nocedal:

Screenshot from 2016-01-03 14:37:18

You need to use line search to determine $latex \alpha_k$.  Simple backtracking line search is not going to work because $latex p_k$ should satisfy both Wolfie conditions.

Screenshot from 2016-01-03 14:39:12

Screenshot from 2016-01-03 14:40:19

People have designed different versions of line search to meet both Wolfie conditions (see: https://www.math.lsu.edu/~hozhang/papers/nonmonotone.pdf and https://github.com/JuliaOpt/Optim.jl/issues/26), I am using a simple version:

  1. if 3.6a is not satisfied, reduce $latex \alpha_k$ by half in every line search iteration.
  2. when 3.6a is satisfied, double $latex \alpha_k$ until 3.6b is satisfied.

 

When you determine $latex p_k$, you can also check if $latex \nabla(f_k) p_k = \nabla(f_k) H_k \nabla(f_k)> 0$. $latex H_k$ in every iteration in BFGS should be positive definite because:

H should be positive definite, so that f(x) reaches local minimum at $latex H^{-1} p$. 

https://en.wikipedia.org/wiki/Hessian_matrix#Second_derivative_test

 

Reference

http://www.alglib.net/optimization/lbfgsandcg.php

http://aria42.com/blog/2014/12/understanding-lbfgs/

http://andrew.gibiansky.com/blog/machine-learning/hessian-free-optimization/

http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf

Recommendation System Review

Traditional Collaborative Filtering

  1. each user is represented by a O(N) length vector. N is the number of items.
  2. Therefore, user-item matrix has size O(MN). M is the number of users.
  3. for a user we want to recommend items for, we scan through user-item matrix, and find most similar k users.
  4. Based on the item vectors of the most similar k users, we recommend items.

cons: computationally expensive, since you need O(MN) in worst case to find the most similar k users.

Cluster Models

  1. similarly, construct a user-item matrix of O(MN) size
  2. cluster users into k groups
  3. for a user we want to recommend items for, we first identify which cluster he belongs to.
  4. Among the users in that cluster, we find items to recommend.

pros: for a user to be recommended, only O(k) time is needed to locate his desired cluster, thus reduces time complexity.

cons: The recommendation quality is not too good, given that users in the cluster are not the most similar users to the target user.

Search-Based Method

Search items with same author, artists, categories, etc of the item the target user just bought or rated. Then recommended.

cons: The recommendation quality is still not good, because a user’s next wanted items are not necessarily items with same author, artisits, categories, etc.

Item-to-Item Collaborative Filtering

  1. each item is represented by  a O(M) vector, which records which users have bought this item.
  2. construct a O(N^2) item-item matrix, M_{ij} is the similarity between item i and item j. This step runs offline. Worst case time complexity is O(N^2 M). But since the matrix should be very sparse, and it is computed offline, such time complexity is acceptable.
  3. When a target user buys an item, you can immediately find its most similar item and recommend it to the user.

Collaborative filtering in practice often outperforms content based filtering. However, collaborative filtering, either item based or user based, has a drawback: cold start problem. For new products/new user, its vector will not be well represented until a fair amount of time. Therefore,the quality of CF recommendation on the new product/user will downgrade. In light of this, content based filtering has its advantage. Another minor drawback of CF is its inability to recommend to users with unique taste.

Matrix Factorization Model

Each item and user is represented by a vector which describes the item’s or the user’s innate property. The dot product of a user vector \vec{p_u} \in \mathbb{R}^F and an item vector \vec{q_i} \in \mathbb{R}^F reflects how closely the two relate, where F, the dimension of the vector, will be determined by how granular you want to model user/item. As in the netflix prize competition, people reported to use 50-1500 dimensions. The higher the dimension they used, the better the recommendation quality reportedly.

The traditional matrix factorization model, such as SVD, can’t directly be applied to factorize user-item matrix because most of its elements are sparse. You can’t just simply zero an entry <i,j> in the user-item matrix.

There are many versions of matrix factorization model.

  1. In the scenarios where explicit feedbacks are available, we would like to approximate the dot product of \vec{q_i} and \vec{p_u} to the explicit rating from user u on item i. As given by this paper, the objective function to minimize is:Screenshot from 2015-11-07 20:47:50where r_{ui} is the real rating from user u to item i.
  2. In many other scenarios, only implicit feedbacks are available. For example, each item is a TV show and r_{ui} denotes how long user u has watched the show. (unit: minutes). In this case, low r_{ui} indicates user u only watched a little portion of the show. But this can be due to many reasons intractable to capture: lack of interest, not perfect time for watching, laptop battery dying, etc. On the other hand, high r_{ui} indicates user u showed potentially strong interest in the show. Naturally, we want our objective function to accurately approximate strong r_{ui} but relax its rigorousness on approaching weak r_{ui}. Therefore, the approximation error, (r_{ui}-q_i^Tp_u)^2, should be weighted by a confidence parameter c_{ui} to give higher r_{ui} larger importance. c_{ui} can take various forms, such as c_{ui}= 1+\alpha r_{ui} or c_{ui} = 1+\alpha \log (1+r_{ui}/\epsilon).
  3. In implicit feedback scenario, there are two subcategories: ranking based and prediction based, each has its assumption. Ranking based learning algorithm assumes that implicit feedback difference, i.e., r_{ui}-r_{uj}, reflects the difference of the user’s interest between item i and j. Prediction based algorithm assumes there is a function p(r_{ui}) which binarizes r_{ui} which is more or less random/unpredictable/fluctuated. The approximation error thus becomes (p(r_{ui}) - q_i^T p_u)^2.

prediction based paper: http://yifanhu.net/PUB/cf.pdf

ranking based paper: http://wanlab.poly.edu/recsys12/recsys/p83.pdf

Learning for Matrix Factorization Model

Stochastic gradient descent (SGD) or alternating least squares (ALS) can be used to learn parameters in the objective function. SGD is considered easier and faster but ALS can be parallelized on Mapreduce or Sparks. Both SGB and ALS’s time complexity is linear in the size of (user+item). This paper shows different parallelization strategies to learn matrix factorization model.

Bayesian Personalized Ranking Framework

A paper proposes a general Bayesian ranking framework which can be used for various recommendation system algorithm. The idea is to model the probability of observing every pair of items which user prefer differently (through implicit feedback), and maximize the joint probability of observing all such pairs w.r.t model specific parameters. Reading the paper’s section 4.3 will greatly help you understand the whole idea from its listed examples.

Incorporate temporal dynamics in Collaborative Filtering/Matrix Factorization

A more sophisticated model integrates temporal shift of user preference and item property in the recommendation. What they model in temporal aspect includes:

  1. user rating bias changes over time
  2. item rating bias (people tend to give higher/lower rating on certain items over others) changes over time
  3. For matrix factorization models, each component of the user vector (representing the user’s taste in a dimension) changes over time
  4. For collaborative filtering models, similarity between two items will be affected by time: if ratings for two items (i and j) from the same user happen within short time, their similarity should be high.

Finding candidate items to recommend

Once you have a trained model, the next thing is to give users recommendation. For collaborative filtering, you probably can immediately get most similar items for items a user just bought, depending on how you structure your data. For matrix factor models, you probably need more efforts to find a small set of candidate items, rather than the whole item set, and then determine the recommendation item within the candidate item subset. A stuff at Etcy reported that they used Locality-Similarity Hash (LSH) to find candidates. The item space is divided into subspaces by 2^k planes. Items falling into the same subspace constitute the candidate subsets.

Restricted Boltzmann Machines for Collaborative Filtering

It originates from a Geoffrey’s paper. (I felt hard to read.) Good tutorial about RBM can be found here: http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/

The basic idea of RBM:

  1. Input layer is a binary vector of a user of length N, the total size of items. The number of hidden units is parameter and can be tuned.
  2. Based on the current weights and biases, predict hidden unit values.
  3. Based on the calculated hidden unit values, recalculate the input layer values
  4. The difference between the calculated input layer values and the observed input values is the objective function to minimize.
  5. All users share the same weights and biases.
  6. After the training process, for a new user with his own item vector, we can see which hidden units are activated hence determine his taste. Then, try to binary his taste at hidden units, and calculate input layer vector. Recommend any activated input unit that the user has not buy the corresponding item.

Deep learning

So far, I’ve only seen deep learning as resorts in content-based recommendation. One guy interned at Spotify reported using Convolutional Neural Network (CNN) on audio signals to extract features in order to contribute to Spotify’s recommendation system. Another guy who worked at Spotify used Recurrent Neural Network (RNN) to predict which track  a user will play next. The input to RNN will be user playing track history.

LDA in Recommendation System

fLDA is a model that borrows ideas from classic LDA model. It is suitable to the cases where meta-data of items are rich. It proposes that:

Screenshot from 2015-11-09 22:40:35

where x_{ij} is the user-item interaction vector, b is the weight vector for x_{ij}, \alpha_i is user i‘s bias variable, \beta_j is item j‘s bias variable, s_i is the user’s topic vector on k topics (representing the user’s interest on the k topics), and \bar{z_j} is the item’s topic soft clustering label.  \alpha_i, \beta_j and s_i are all random variables controlled by some forms of prior. \bar{z_j} is actually a mean of all words’ topic distribution in item j‘s meta data. The topic of a word in a item is generated in the following way as the same in classic LDA model:

Screenshot from 2015-11-09 22:51:06

Unlike classic LDA model such that z_{jn} is purely obtained from Collapsed Gibbs Sampling based upon text data, z_{jn} has also interaction with user topic vector s_i and hence determine his rating \mu_{ij}. Therefore, when sampling z_{jn} in the training process (E step), all ratings for the item the word belongs to will take into account:

Screenshot from 2015-11-10 00:03:21

That is why in the paper the authors mentioned that:

Screenshot from 2015-11-09 23:53:19

The training process is a bit intimidating as shown in the paper. Note that the authors claimed that the training process takes O(# of rating + # or words) time.

Other Tutorial Materials

A very complete recommendation system review tutorial can be found here, which was given by Xaiver Amatrian, previous research director at Netflix and now at Quora. Another tutorial is from Edwin Chen. An overview of content based recommendation approaches is given here. An overview of collaborative filtering methods can be found here. A good slide can be found here.

My Lempel-Ziv Compressor

Lempel-Ziv algorithm is a widely known compression algorithm. Its compression rate is proved to asymptotically reach the entropy of per symbol in the sequence to be compressed, when the length of the sequence is long enough.i.e., \frac{\text{compressed bits}}{\text{length of symbols } X_1 \cdots X_n } = H(X), where H(X) is per symbol entropy in the sequence X_1 \cdots X_n. If, for example, you have a sequence of letters of size n and each letter is encoded with 8 bits ASCII code, then the best compressed sequence also in ASCII will be of length \frac{H(X) \cdot n}{8}.

More details can be found in the classic information theory book written by Thomas & Cover.

I implemented my own Lempel-Ziv compressor/decompressor below. I intentionally challenged myself to write them as short as possible (perhaps in trade of readability).

Code

compress.py

import sys
from math import *

line = sys.stdin.readline()
i, b, d, res = 0, 0, {}, ''  # current pointer, block number,
                             # lookup table, return string
while i < len(line):
    for j in xrange(i+1, len(line)+1):
        if d.get(line[i:j]) is None or j == len(line):
            b += 1
            res += '{0:0{1}b}'.format(b-d.get(line[i:j-1], b), 
                int(ceil(log(b, 2)))) + line[j-1]
            d[line[i:j]] = b
            i, j = j, len(line)  # stop for-loop for j if cond1

sys.stdout.write(res[1:])    # strip prefix for first block

decompress.py

import sys
from math import *
 
line = sys.stdin.readline()
i, b, d, res = 0, 1, {}, ''     # current pointer, block number,
                                # lookup table, return string
while i < len(line):
	r = int(ceil(log(b, 2)))    # prefix range   
	code = d.get(b-int('0'+line[i:i+r], 2), '') + line[i+r]
	d[b] = code
	res += code
	i = i + r + 1
	b += 1

sys.stdout.write(res)

Usage (helper perl scripts and other related files can be downloaded here.):

# file.in contains "jaa"
# file.out contains the compressed code of "jaa"
# file.out1 contains the decompressed string of file.out, which is exactly "jaa".
# 0010100111000100101001111000110011000001
perl preCompress.pl < file.in | python compress.py | perl postCompress.pl > file.out
perl preDecompress.pl < file.out | python decompress.py | perl postDecompress.pl > file.out1

# Encode and decode my own name
echo -n 'zhc' | perl preCompress.pl | python compress.py | python decompress.py | perl postDecompress.pl

# Encode and decode random string
echo -n 'zasdfasdfasdlfasdlhc' | perl preCompress.pl | python compress.py | python decompress.py | perl postDecompress.pl

# Encode and decode files
perl preCompress.pl < alice.txt | python compress.py | perl postCompress.pl > alice.out
perl preDecompress.pl < alice.out | python decompress.py | perl postDecompress.pl > alice.out1

# Decode "CS6800"
echo -n 00110001101100011011100100111101110001010010011001100100011100100000000 | python decompress.py | perl postDecompress.pl

# Generate long sequence according to a markov chain (only three symbols)
python gen_mc.py > mc.in
# Compress long sequence generated according to markov chain
perl preCompress.pl < mc.in | python compress.py | perl postCompress.pl > mc_compressed.out
perl preDecompress.pl < mc_compressed.out | python decompress.py | perl postDecompress.pl > mc_decompressed.out1

Some observation

  1. The higher the per symbol entropy of the sequence is, the poorer the compression algorithm performs. I compressed a sequence containing only three symbols with around 30% compression rates. (The entropy per symbol is 1.459 bits/symbol) I also compressed a sequence of text (containing 26 English letters) from Alice Wonderland but with only 84% compression rate. As we know, the entropy of English letter should be much higher than 1.459 bits/symbol.
  2. I used real compressor such as `gzip` to compress my three symbol sequence, the compression rate is around 23%. As we know from the proved theory, the compression rate cannot be less than H(X)/8 = 1.459/8 = 18.2\%.

Graph Model Recap

I wrote a post before talking about sum-product algorithm: http://maider.blog.sohu.com/307377602.html

At that time, I had fresh memory on how graph model works. However, today when I read the post again, I am lost again. Therefore, I want to recap the basic motivations of graph models. This post can serve as an intro to the old post. I think remembering the motivations for an algorithm is really more important than details in it.

What does a Graph Model represent?

A graph model represents a joint distribution of all variables appearing in the a graph. There are three types of graph models, one is directed, one is undirected graph model, and the other is factor graph. For me, I find directed model is really intuitive because you can tell casual relationship from it. (See below the directed graph of the sex-gender-disease example I made in the sum-product algorithm post.) I am less familiar with undirected graph model, though. The factor graph is usually converted from directed/undirected graph model for inference learning purpose. So no matter what is your original graph model (directed or undirected), all learning algorithms take place at corresponding factor graphs.

A summary from Bishop’s book:

Screenshot from 2015-10-28 13:32:03

Another summary from his paper:

1 2 

Factor Graph

A factor graph consists of multiple factors, each is a clique connecting a fraction of total variables. Say you have want to calculate marginal distribution $latex p(x_i)$ from the joint distribution $latex p(x_1, x_2, \cdots, x_N)$. The naive way to calculate that is:

  1. get all possible configuration of $latex (x_1, x_2, \cdots, x_{i-1}, x_{i+1}, \cdots, x_N)$. 
  2. for each configuration $latex \hat{x}$, you calculate the joint probability of $latex p(x_i, \hat{x})$
  3. Sum all the joint probabilities under all configurations, which equals to $latex p(x_i)$.

If $latex x_1$ can take $latex |x_1|$ different values, $latex x_2$ can take $latex |x_2|$ different values, …, we will need to sum up $latex \prod\limits_{j \neq i} |x_j|$ joint probabilities.

Factor graph helps eliminate unnecessary computation. For example, if we have only four variables and the joint distribution is $latex p(x_1, x_2, x_3, x_4) = p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_3) &s=2$, then $latex p(x_4)=\Big[\sum\limits_{x_3}p(x_4|x_3)\big[\sum\limits_{x_2}p(x_3|x_2)[\sum\limits_{x_1}p(x_1)p(x_2|x_1)]\big]\Big] &s=2$. Now you only need to compute addition operation $latex |x_1||x_2| + |x_2| + |x_3| $ times. 

What does Sum-Product algorithm do?

Sum-Product algorithm can easily calculate the marginal probability of any variable in the model. It, however, is not an inference learning algorithm. (i.e., it is not used for learning parameters in a graph model.)

 

What do inference algorithms do?

Inference algorithms learn the parameters of graph models. After that, you can use sum product algorithm to calculate marginal probability of any variable in the model.