Bayesian linear regression

Ordinary least square (OLS) linear regression have point estimates on weight vector $w$ that fit the formula: $\arg\min_w \left\Vert(Y - Xw)\right\Vert^2 + \epsilon$ . If we assume normality of the errors: $\epsilon \sim \mathcal{N}(\boldsymbol{0}, \sigma^2 \boldsymbol{I})$ with a fixed point estimate on $\sigma^2$ , we could also enable analysis on confidence interval and future prediction (see discussion in the end of [2]). Instead of point estimates, bayesian linear regression assumes $w$ and $\sigma^2$ are random variables and learns the posterior distribution of $w$ and $\sigma^2$ from data. In my view, Bayesian linear regression is a more flexible method because it supports incorporating prior knowledge about parameters and the posterior distributions they provide enable more uncertainty analysis and facilitate other tasks [3], for example Thompson sampling in contextual bandit problems, which we will cover in the future. However, it is also not a panacea: they do not generally improve the prediction accuracy if no informative prior is provided.

The fundamental of Bayesian methods lies in Bayesian theorem, which states:

$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)} \propto p(D|\theta)p(\theta)$

Specifically, in Bayesian linear regression, $D$ represents observed data $Y$ and $X$ and $\theta$ refers to $w$ and $\sigma^2$ . So the formula above can be rewritten as:

$p(w, \sigma^2|Y, X) \propto p(Y|X, w, \sigma^2) p(w, \sigma^2)$

The procedure to adopt bayesian methods is to (1) define the likelihood and proper prior distributions of the parameters in interests; (2) calculate the posterior distribution according to Bayesian theorem; (3) use the posterior distribution to achieve other tasks, such as predicting $Y$ on new $X$ .

The likelihood function of linear regression is:

$p(Y|X, w, \sigma^2)=(2\pi\sigma^2)^{-n/2} exp\big\{-\frac{(Y-Xw)^T(Y-Xw)}{2\sigma^2}\big\}$

which can be illustrated as what the probability to observe the data if we assume $Y$ is normally distributed with mean $X^Tw$ and variance $\sigma^2\boldsymbol{I}$ .

To get analytical expression of the posterior distribution $p(w,\sigma^2|Y, X)$ , we usually require that the prior in the same probability distribution family as the likelihood. Here we can treat the likelihood $p(Y|X, w, \sigma^2)$ as a two-dimensional exponential family (the concept is illustrated in chapter 4 in [6]), a distribution regarding to $w$ and $\sigma^2$ . Therefore, the prior of the likelihood, $p(w, \sigma^2)$ , can be modeled as a Normal-inverse-Gamma (NIG) distribution:

$p(w, \sigma^2) =p(w|\sigma^2)p(\sigma^2)=\mathcal{N}(w_0, \sigma^2 V_0) \cdot IG(\alpha_0, \beta_0)$

The inverse gamma $IG(\cdot)$ is also called inverse chi-squared distribution; they only differ in parameterization [9]. We can denote $p(w, \sigma^2)$ as $NIG(w_0, V_0, \alpha_0, \beta_0)$ . Note that we express $p(w | \sigma^2)$ , meaning that $w$ and $\sigma^2$ is not independent. For one reason, if we model $p(w, \sigma^2)=p(w) \cdot p(\sigma^2)$ , we would not get a conjugate prior. Second, if you think $(X, Y)$ are generated from some process governed by $w$ and $\sigma^2$ , then $w$ and $\sigma^2$ are dependent conditioned on $(X, Y)$ (see section 3 in [8]).

Now, given that the likelihood and the prior are in the same probability distribution family, the posterior distribution is also a NIG distribution:

$p(w, \sigma^2|Y, X) = NIG(w_1, V_1, \alpha_1, \beta_1)$ ,

where:

$w_1 = (V_0^{-1}+X^TX)^{-1}(V_0^{-1}w_0+X^T Y)$

$V_1 = (V_0^{-1}+X^TX)^{-1}$

$\alpha_1 = \alpha_0 + \frac{1}{2}n$

$\beta_1 = \beta_0 + \frac{1}{2}(w_0^T V_0^{-1}w_0 + Y^T Y - w_1^TV_1^{-1}w_1)$

The posterior predictive distribution (for predicting new data) is:

$p(Y_{new}|X_{new}) = \int \int p(Y_{new}|X_{new}, Y, X, w, \sigma^2) p(w, \sigma^2|Y, X) dwd\sigma^2$

The result should be a student-t distribution. But the derivation detail is very complicated, possibly referring to section 6 in [10] and section 3 in [8]. I know from other online posts [11,12,13] that practical libraries don’t calculate the analytical form of the posterior distribution but rely on sampling techniques like MCMC. However, even though they get the posterior distribution, I don’t know how they would implement the posterior predictive distribution. This would worth my further investigation in the future.

Side notes

We have touched upon Bayesian linear regression when introducing Bayesian Optimization [1]. [4] is also a good resource of Bayesian linear regression. However, [1] and [4] only assume $w$ is a random variable but $\sigma^2$ is still a fix point estimate. This post actually goes fully bayesian by assuming both $w$ and $\sigma^2$ are random variables whose joint distribution follows the so called Normal inverse Gamma distribution. There aren’t too many resources in the same vein though. What I’ve found so far are: [5], section 2 in [7].

Reference

[1] https://czxttkl.com/?p=3212

[2] https://czxttkl.com/2015/03/22/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f/logistic-regression%e6%98%af%e5%a6%82%e4%bd%95%e5%bd%a2%e6%88%90%e7%9a%84%ef%bc%9f

[3] https://wso2.com/blog/research/part-two-linear-regression

[4] http://fourier.eng.hmc.edu/e176/lectures/ch7/node16.html

[5] A Guide to Bayesian Inference for Regression Problems: https://www.ptb.de/emrp/fileadmin/documents/nmasatue/NEW04/Papers/BPGWP1.pdf

[6] Bolstad, W. M. (2010). Understanding computational Bayesian statistics (Vol. 644). John Wiley & Sons.

[7] Denison, D. G., Holmes, C. C., Mallick, B. K., & Smith, A. F. (2002). Bayesian methods for nonlinear classification and regression (Vol. 386). John Wiley & Sons.

[8] https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

[9] https://en.wikipedia.org/wiki/Scaled_inverse_chi-squared_distribution

[10] Conjugate bayesian analysis of the gaussian distribution

[11] https://wso2.com/blog/research/part-two-linear-regression

[12] https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7

[13] https://towardsdatascience.com/bayesian-linear-regression-in-python-using-machine-learning-to-predict-student-grades-part-2-b72059a8ac7e