Notes on Bayesian Linear Regression - School of …fletcher/cs6957/lectures/BayesianLinear...Notes...

2

Click here to load reader

Transcript of Notes on Bayesian Linear Regression - School of …fletcher/cs6957/lectures/BayesianLinear...Notes...

Page 1: Notes on Bayesian Linear Regression - School of …fletcher/cs6957/lectures/BayesianLinear...Notes on Bayesian Linear Regression CS 6957: Probabilistic Modeling February 11, 2013 Linear

Notes on Bayesian Linear Regression

CS 6957: Probabilistic Modeling

February 11, 2013

Linear Regression Model

We are considering a random variable y as a function of a (typically non-random) vector-valued variablex ∈ Rk. This is modeled as a linear relationship, with coefficients βj , plus i.i.d. Gaussian random noise:

yi = xi1β1 + xi2β2 + · · ·+ xikβk + εi, εi ∼ N(0, σ2).

In matrix form, this looks likey1y2...yn

=

x11 x12 · · · x1kx21 x22 · · · x2k

.... . .

...xn1 xn2 · · · xnk

β1β2...βk

+

ε1ε2...εn

.Or, simply

y = Xβ + ε.

It is common to set the first column of X to a constant column of 1’s, so that β1 is an intercept term.

Ordinary Least-Squares (OLS)

Our goal is to estimate the unknown parameters in β. The maximum-likelihood estimate (MLE) of β isbased on the Gaussian likelihood:

p(y |X,β;σ2) =1

(2πσ2)n/2exp

(− 1

2σ2‖y −Xβ‖2

).

Keep in mind that this is a product of likelihoods for each of the individual components of y = (y1, . . . , yn).Taking the log of this likelihood and then taking the derivative w.r.t. β, we get

∇β ln p(y |X,β;σ2) = − 1

σ2XT (y −Xβ).

Setting this derivative equal to zero and solving for β gives the MLE or OLS estimate,

β = (XTX)−1XT y,

where the inverse here is the Moore-Penrose pseudoinverse (same as the inverse when it exists). The MLEfor β is Gaussian distributed and unbiased:

β ∼ N(β, σ2(XTX)−1

).

1

Page 2: Notes on Bayesian Linear Regression - School of …fletcher/cs6957/lectures/BayesianLinear...Notes on Bayesian Linear Regression CS 6957: Probabilistic Modeling February 11, 2013 Linear

Bayesian Linear Regression

As seen in the polynomial regression code example (BayesianLinearRegression.r), the entries ofβ can be overinflated for higher-order coefficients, as the model tries to overfit the data with a “wiggly”curve. To counteract this, we may inject our prior belief that these coefficients should not be so large. So,we introduce a conjugate Gaussian prior, β ∼ N(0,Λ−1). Here we are parameterizing the Gaussian usingthe inverse covariance, or precision matrix, Λ, which will make computations easier. A common choice isΛ = λI , for a positive scalar parameter λ.

Now, the posterior for β is

p(β | y;X,σ2) ∝ exp

(− 1

2σ2‖y −Xβ‖2 − 1

2βTΛβ

).

Just as we worked out in the univariate case, the conjugate prior for β results in the posterior also being amultivariate Gaussian. Completing the square inside the exponent, we see that the posterior for β has thefollowing distribution:

β ∼ N(µn,Σn),

where

µn =(XTX + σ2Λ

)−1XT y, (1)

Σn = σ2(XTX + σ2Λ

)−1. (2)

Compare this to the MLE, and note what happens when Λ → 0 (and think of how this would be the nonin-formative Jeffreys prior on β in the limit).

Exercise: Complete the square in the posterior to derive the formulas for µn and Σn.

Posterior Predictive Distribution

Now, say we are given a new independent data point x, and we would like to predict the correspondingunseen dependent value, y. Remember, the posterior predictive distribution of y is given by,

p(y | y; x, X, σ2,Λ) =

∫p(y |β; x, σ2) p(β | y;X,σ2,Λ) dβ.

This is now a univariate Gaussian:y | y ∼ N

(xTµn, σ

2n(x)

),

whereσ2n(x) = σ2 + xTΣnx.

The first term on the right is due to the noise (the additive ε), and the second term is due to the posteriorvariance of β, which represents our uncertainty in the parameters.

Exercise: Derive the formulas above for the mean and variance of p(y | y). You might find this Wikipediatopic useful:http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions

2