Laplacian Approximation and Bayesian Logistic Regression

Recently, I am studying a basic contextual bandit algorithm called Logistic Regression Bandit, which is also known as Bayesian Logistic Regression. It all starts from the paper “An Empirical Evaluation of Thompson Sampling” [1]. In Algorithm 3 in the paper, the authors gives the pseudo-code for how to train the Logistic Regression Bandit, assuming its weights are from a Gaussian distribution with a diagonal covariance matrix:

When I first read it, my biggest confusion is Laplace approximation – why is it used in training a Bayesian logistic regression? Now, let’s dig into how Laplace approximation comes into play. 

Let’s first review the set up of a Bayesian logistic regression. Suppose we have a feature space \mathbf{x} \in \mathbb{R}^d. We have a dataset \mathcal{D} which contains tuples of (\mathbf{x}_i, y_i) for i=1, \cdots, N. These are observed features and labels. If we apply a Bayesian logistic regression on such data, the logistic regression will have a weight vector of the same dimension as the features: \mathbf{w} \in \mathbb{R}^d. The prediction function is a sigmoid function: p(\mathbf{x}) = \left( 1+exp(-\mathbf{w}^T\mathbf{x})\right)^{-1}.

Because it is a Bayesian logistic regression, we need to estimate a posterior probability distribution for \mathbf{w} given the observation \mathcal{D}, p(\mathbf{w}|\mathcal{D}). We assume that \mathbf{w} come from a multi-dimensional Gaussian distribution with a diagonal covariance matrix. This is the same as what Algorithm 3’s notation implies: w_i \sim \mathcal{N}(m_i, q_i^{-1})

We can also get the likelihood function p(\mathcal{D}|\mathbf{w})=\prod\limits_{j=1}^N p(\mathbf{x}_j)^{y_j} \left(1-p(\mathbf{x}_j)\right)^{1-y_j}

Using the Bayesian Theorem, we can derive (the notations adapted from [2] (slides page 9~12) to the notations in this post): 

p(\mathbf{w}|\mathcal{D}) \propto p(\mathbf{w})p(\mathcal{D}|\mathbf{w})\newline\quad\qquad=\prod\limits_{i=1}^d \mathcal{N}(m_i, q_i^{-1}) \prod\limits_{j=1}^N p(\mathbf{x}_j)^{y_j} \left(1-p(\mathbf{x}_j)\right)^{1-y_j}

Now, here is how Lapalace Approximation [3] comes into play. The goal of Lapalace Approximation is to learn a Gaussian distribution to approximate p(\mathbf{w}|\mathcal{D}), which is not a Gaussian distribution in its exact expression. In other words, if p(\mathbf{w}|\mathcal{D}) \approx \mathcal{N}(m_i', q_i'^{-1}), what would be m_i' and q_i' ?

First, we will take the logarithm on both sides, leading to:

\log p(\mathbf{w}|\mathcal{D}) = -\left( \frac{1}{2}\sum\limits_{i=1}^d q_i (w_i - m_i)^2 + \sum\limits_{j=1}^n \log(1 + exp\left( -y_j \mathbf{w}^T \mathbf{x}_j\right)) \right) + const \newline= f(\mathbf{w}) + const

Suppose f(\mathbf{w}) achieves maximum \mathbf{w}_0, which is able to be found using gradient ascent or expressed in a closed formula (I think f(\mathbf{w}) is a concave function but not totally sure). Based on the 2nd-order Taylor expansion at point \mathbf{w}_0, we have: f(\mathbf{w}) \approx f(\mathbf{w}_0) + f'(\mathbf{w}_0)(\mathbf{w}-\mathbf{w}_0)+\frac{1}{2}f''(\mathbf{w}_0)(\mathbf{w} - \mathbf{w}_0)^2. Because f'(\mathbf{w}_0)=0, we can further simplify to f(\mathbf{w}) \approx f(\mathbf{w}_0) +\frac{1}{2}f''(\mathbf{w}_0)(\mathbf{w} - \mathbf{w}_0)^2

f(\mathbf{w}_0) +\frac{1}{2}f''(\mathbf{w}_0)(\mathbf{w} - \mathbf{w}_0)^2 can actually be seen as the logarithm of another Gaussian distribution \mathcal{N}(m_i', q_i'^{-1}), whose mean is at \mathbf{w}_0 and variance as -f''(\mathbf{w}_0), and f(\mathbf{w}_0) absorbs any remaining constants. This is why Algorithm 3 updates m_i = w_i because w_i is essentially from \mathbf{w}_0 that maximizes f(\mathbf{w}), and q_i=q_i + \sum\limits_{j=1}^n x^2_{ij}p_j(1-p_j) since this is essentially f''(\mathbf{w}_0).

To summarize, Laplace Approximation finds a new Gaussian distribution to approximate p(\mathbf{w}|\mathcal{D})!

 

Reference

[1] An Empirical Evaluation of Thompson Sampling: https://papers.nips.cc/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html

[2] https://cs.iupui.edu/~mdundar/CSCIMachineLearning/Lecture12.pdf

[3] https://bookdown.org/rdpeng/advstatcomp/laplace-approximation.html

[4] https://ufal.mff.cuni.cz/~jurcicek/NPFL108-BI-2014LS/04-approximate-inference-laplace-approximation.pdf

Leave a comment

Your email address will not be published. Required fields are marked *