Recently, I am studying a basic contextual bandit algorithm called Logistic Regression Bandit, which is also known as Bayesian Logistic Regression. It all starts from the paper “An Empirical Evaluation of Thompson Sampling” [1]. In Algorithm 3 in the paper, the authors gives the pseudo-code for how to train the Logistic Regression Bandit, assuming its weights are from a Gaussian distribution with a diagonal covariance matrix:
When I first read it, my biggest confusion is Laplace approximation – why is it used in training a Bayesian logistic regression? Now, let’s dig into how Laplace approximation comes into play.
Let’s first review the set up of a Bayesian logistic regression. Suppose we have a feature space . We have a dataset which contains tuples of for . These are observed features and labels. If we apply a Bayesian logistic regression on such data, the logistic regression will have a weight vector of the same dimension as the features: . The prediction function is a sigmoid function: .
Because it is a Bayesian logistic regression, we need to estimate a posterior probability distribution for given the observation , . We assume that come from a multi-dimensional Gaussian distribution with a diagonal covariance matrix. This is the same as what Algorithm 3’s notation implies: .
We can also get the likelihood function
Using the Bayesian Theorem, we can derive (the notations adapted from [2] (slides page 9~12) to the notations in this post):
Now, here is how Lapalace Approximation [3] comes into play. The goal of Lapalace Approximation is to learn a Gaussian distribution to approximate , which is not a Gaussian distribution in its exact expression. In other words, if , what would be and ?
First, we will take the logarithm on both sides, leading to:
Suppose achieves maximum , which is able to be found using gradient ascent or expressed in a closed formula (I think is a concave function but not totally sure). Based on the 2nd-order Taylor expansion at point , we have: . Because , we can further simplify to .
can actually be seen as the logarithm of another Gaussian distribution , whose mean is at and variance as , and absorbs any remaining constants. This is why Algorithm 3 updates because is essentially from that maximizes , and since this is essentially .
To summarize, Laplace Approximation finds a new Gaussian distribution to approximate !
Reference
[1] An Empirical Evaluation of Thompson Sampling: https://papers.nips.cc/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html
[2] https://cs.iupui.edu/~mdundar/CSCIMachineLearning/Lecture12.pdf
[3] https://bookdown.org/rdpeng/advstatcomp/laplace-approximation.html