The Bayesian Linear Model - Biostatistics - Academic ...ph7440/pubh7440/Lecture5.pdf ·...

9
The Bayesian Linear Model Sudipto Banerjee [email protected] University of Minnesota The Bayesian Linear Model – p. 1/9

Transcript of The Bayesian Linear Model - Biostatistics - Academic ...ph7440/pubh7440/Lecture5.pdf ·...

The Bayesian Linear ModelSudipto Banerjee

[email protected]

University of Minnesota

The Bayesian Linear Model – p. 1/9

Linear Model Basics

The linear model is the most fundamental of all serious statistical models,encompassing ANOVA, regression, ANCOVA, random and mixed effect modelling etc.

Ingredients of a linear model include an n × 1 response vector y = (y1, . . . , yn) andan n × p design matrix (e.g. including regressors) X = [x1, . . . , xp], assumed to havebeen observed without error. The linear model:

y = Xβ + ǫ; ǫ ∼ N(0, σ2I)

Recall from standard statistical analysis, using calculus and linear algebra, theclassical unbiased estimates of the regression parameter β and σ2 are

β = (XT X)−1XT y; σ2 =1

n − p(y − Xβ)T (y − Xβ).

The above estimate of β is also a least-squares estimate. The predicted value of y isgiven by

y = Xβ = PXy where PX = X(XT X)−1XT .

PX is called the projector matrix of X. It is an operator that projects any vector to thespace spanned by the columns of X. Note: (y − Xβ)T (y − Xβ) = yT (I − PX)y isthe model residual. The Bayesian Linear Model – p. 2/9

Non-informative priors

For the Bayesian analysis, we will need to specify priors for the unknown regressionparameters β and the variance σ2.

What are the “non-informative” priors that would make this Bayesian analysisequivalent to the classical distribution theory?

We need to consider absolutely flat priors on β and log σ2. We have

P (β) ∝ 1; P (σ2) ∝ 1

σ2or equivalently P (β, σ2) ∝ 1

σ2.

Note that none of the above two “distributions” are valid probabilities (they do notintegrate to any finite number, let alone 1). So why is it that we are even discussingthem? It turns out that even if the priors are improper (that’s what we call them), aslong as the resulting posterior distributions are valid we can still conduct legitimatestatistical inference on them.

With a flat prior on β we obtain, after some algebra, the conditional posteriordistribution:

P (β |σ2, y) = N((XT X)−1XT y, (XT X)−1σ2).

The Bayesian Linear Model – p. 3/9

contd.

The conditional posterior distribution of β would have been the desired posteriordistribution had σ2 been known. Since that is not the case, we need to obtain themarginal posterior distribution by integrating out σ2 as:

P (β | y) =

P (β |σ2, y)P (σ2 | y)dσ2

So, we need to find the marginal posterior distribution of σ2: P (σ2 | y). With the choiceof the flat prior we obtain:

P (σ2 | y) ∝ 1

(σ2)(n−p)/2+1exp(− (n − p)s2

2σ2)

∼ IG(n − p

2,(n − p)s2

2),

where s2 = σ2 = 1n−p

yT (I − PX)y. This is a scaled inverse-chi-square distribution

which is the same as an inverted Gamma distribution IG((n − p)/2, (n − p)s2/2).

A striking similarity with the classical result: The distribution of σ2 is also characterizedas (n − p)s2/σ2 following a chi-square distribution.

The Bayesian Linear Model – p. 4/9

contd.

Returning to the marginal posterior distribution of P (β | y), we can perform furtheralgebra (using the form of the Inverted Gamma density) to show that [β | y] is anon-central multivariate tn−p distribution with n − p degrees of freedom andnon-centrality parameter β.

The above distribution is quite complicated but we rarely need to work with this. Inorder to carry out a non-informative Bayesian analysis, we use a simpler samplingbased mechanism. For each i = 1, . . . , M first draw σ2

(i)∼ [σ2 | y] (which is

Inverse-Gamma) followed by β(i) ∼ N(

(XT X)−1XT y, (XT X)−1σ2(i)

)

.

The resulting samples(

βi, σ2(i)

)M

i=1are precisely samples from the joint marginal

posterior distribution P (β, σ2 | y). Automatically the samples(

β(i)

)M

i=1are samples

from marginal posterior distributions P (β | y) while the samples(

σ2(i)

)M

i=1are from

the marginal posterior P (σ2 | y).

The Bayesian Linear Model – p. 5/9

contd.

A few important theoretical consequences are noted below:

The multivariate t density is given by:

P (β | y) =Γ(n/2)

(π(n − p))p/2Γ((n − p)/2)|s2(XT X)−1|

[

1 +(β − β)T (XT X)(β − β)

(n − p)s2

]

−n/2

.

The marginal distribution of each individual regression parameter βj is a non-centralunivariate tn−p distribution. In fact,

βj − βj

s√

(XT X)−1jj

∼ tn−p.

In particular, with a simple univariate N(µ, σ2) likelihood, we have

µ − y

s/√

n∼ tn−1.

The promised reproduction of the classical case is now complete.

The Bayesian Linear Model – p. 6/9

Predicting from Linear Models

Now we want to apply our regression analysis to a new set of data, where we haveobserved the new covariate matrix X, and we wish to predict the correspondingoutcome y

If β and σ2 were known, then y would have a N(Xβ, σ2I) distribution.

When parameters unknown: all predictions for the data must follow from the posteriorpredictive distribution:

p(y | y) =

p(y |β, σ2)p(β, σ2 | y)dβdσ2.

For each posterior draw of (β(i), σ2(i)

)Mi=1, draw y(i) from N(Xβ(i), σ

2(i)

I). The

resulting sample (y(i))Mi=1 represents the predictive distribution.

Theoretical Mean and Variance (conditional upon σ2):

E(y|σ2, y) = Xβ

var(y|σ2, y) = (I + X(XT X)−1XT )σ2.

Theoretical unconditional predictive distribution, p(y|y), is a multivariate t distribution,tn−p(Xβ, s2(I + X(XT X)−1XT )). The Bayesian Linear Model – p. 7/9

Lindley and Smith (1972)

Hierarchical Linear Models in their full general form:

y |β1, Σ1 ∼ N(X1β1, Σ1)

β1 |β2, Σ2 ∼ N(X2β2, Σ2).

It is assumed that X1, X2, Σ1 and Σ2 are known matrices, with the latter two beingpositive definite covariance matrices.

The marginal distribution of y is given by:

y ∼ N(X1X2β2, Σ1 + X1Σ2XT1 ).

The conditional distribution of β1 given y is:

β1 | y ∼ N(Σ∗µ, Σ∗), where

Σ∗ =(

XT1 Σ−1X1 + Σ−1

2

)

−1, and

µ = XT1 Σ−1

1 y + Σ−12 X2β2.

The Bayesian Linear Model – p. 8/9

The Proof: a sketch

To derive the marginal distribution of y, we first rewrite the system

y = X1β1 + u, where u ∼ N(0, Σ1);

β1 = X2β2 + v, where v ∼ N(0, Σ2).

Also, u and v are independent of each other. It then follows that:

y = X1X2β2 + X1v + u.

This shows that the marginal distribution is normal. The final result now follows bysimple calculations of the mean and variance of y.

For the conditional distribution β1 | y, we see from Bayes Theorem thatp(β1 | y) ∝ p(y|β1)p(β1). By substituting expressions for the respective normaldensities in the product on the right hand side and by completing the square in thequadratic form in the exponent, we find p(β1 | y) ∝ exp(−Q/2), where

Q = (β1 − Σ∗µ)T Σ∗−1 (β1 − Σ∗µ) + [yT Σ−11 y + βT

2 XT2 Σ−1

2 X2β2 − µT Σ∗µ].

The Bayesian Linear Model – p. 9/9