# (7) Bayesian linear regressionreich/ABA/notes/BLR.pdf · 2018-03-12 · I It includes as special...

date post

26-Aug-2018Category

## Documents

view

230download

1

Embed Size (px)

### Transcript of (7) Bayesian linear regressionreich/ABA/notes/BLR.pdf · 2018-03-12 · I It includes as special...

(7) Bayesian linear regression

ST440/540: Applied Bayesian Statistics

Spring, 2018

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Bayesian linear regression

I Linear regression is by far the most common statisticalmodel

I It includes as special cases the t-test and ANOVA

I The multiple linear regression model is

Yi Normal(0 + Xi11 + ...+ Xipp, 2)

independently across the i = 1, ...,n observations

I As well see, Bayesian and classical linear regression aresimilar if n >> p and the priors are uninformative.

I However, the results can be different for challengingproblems, and the interpretation is different in all cases

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Review of least squares

I The least squares estimate of = (0, 1, ..., p)T is

OLS = argmin

ni=1

(Yi i)2

where i = 0 + Xi11 + ...+ Xipp

I OLS is unbiased even if the errors are non-Gaussian

I If the errors are Gaussian then the likelihood isproportional to

ni=1

exp

[(Yi i)

2

22

]= exp

[n

i=1(Yi i)2

22

]

I Therefore, if the errors are Gaussian OLS is also the MLE

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Review of least squares

I Linear regression is often simpler to describe using linearalgebra notation

I Let Y = (Y1, ...,Yn)T be the response vector and X be then (p + 1) matrix of covariates

I Then the mean of Y is X and the least squares solution is

OLS = argmin

(Y X)T (Y X) = (XT X)1XT Y

I If the errors are Gaussian then the sampling distribution is

OLS Normal[, 2(XT X)1

]I If the variance 2 is estimated using the mean squared

residual error then the sampling distribution is multivariate t

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Bayesian regression

I The likelihood remains

Yi Normal(0 + Xi11 + ...+ Xipp, 2)

independent for i = 1, ...,n observations

I As with a least squares analysis, it is crucial to verify this isappropriate using qq-plots, added variable plots, etc.

I A Bayesian analysis also requires priors for and

I We will focus on prior specification since this piece isuniquely Bayesian.

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Priors

I For the purpose of setting priors, it is helpful to standardizeboth the response and each covariate to have mean zeroand variance one.

I Many priors for have been considered:1. Improper priors

2. Gaussian priors

3. Double exponential priors

4. Many, many more...

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Improper priors

I The Jeffreys prior is flat p() = 1

I This is improper, but the posterior is proper under thesame conditions required by least squares

I If is known then

|Y Normal[OLS,

2(XT X)1]

I See Post beta in http://www4.stat.ncsu.edu/~reich/ABA/Derivations7.pdf

I Therefore, the results should be similar to least squares

I How are they different?

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

http://www4.stat.ncsu.edu/~reich/ABA/Derivations7.pdfhttp://www4.stat.ncsu.edu/~reich/ABA/Derivations7.pdf

Improper priors

I Of course we rarely know

I Typically the error variance follows an InvGamma(a,b)prior with a and b set to be small, say a = b = 0.01.

I In this case the posterior of follows a multivariate tcentered on OLS

I Again, the results are similar to OLS

I The objective Bayes Jeffreys prior for = (, ) is

p(, 2) =12

which is the limit as a,b 0

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Multivariate normal prior

I Another common prior for is Zellners g-prior

Normal[0,2

g(XT X)1

]I This prior is proper assuming X is full rank

I The posterior mean is

11 + g

OLS

I This shrinks the least estimate towards zero

I g controls the amount of shrinkage

I g = 1/n is common, and called the unit information prior

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Univariate Gaussian priors

I If there are many covariates or the covariates are collinear,then OLS is unstable

I Independent priors can counteract collinearity

j Normal(0, 2/g)

independent over j

I The posterior mode is

argmin

ni=1

(Yi i)2 + gp

j=1

2j

I In classical statistics, this is known as the ridge regressionsolution and is used to stabilize the least squares solution

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

BLASSO

I An increasingly-popular prior is the double exponential orBayesian LASSO prior

I The prior is j DE() which has PDF

f () exp(||

)I The square in the Gaussian prior is replaced with an

absolute value

I The shape of the PDF is thus more peaked at zero (nextslide)

I The BLASSO prior favors settings where there are many jnear zero and a few large j

I That is, p is large but most of the covariates are noise

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

BLASSO

3 2 1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Prio

rGaussianBLASSO

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

BLASSO

I The posterior mode is

argmin

ni=1

(Yi i)2 + gp

j=1

|j |

I In classical statistics, this is known as the LASSO solution

I It is popular because it adds stability by shrinking estimatestowards zero, and also sets some coefficients to zero

I Covariates with coefficients set to zero can be removed

I Therefore, LASSO performs variables selection andestimation simultaneously

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Computing

I With flat or Gaussian (with fixed prior variance) priors theposterior is available in closed-form and Monte Carlosampling is not needed

I With normal priors all full conditionals are Gaussian orinverse gamma, and so Gibbs sampling is simple and fast

I JAGS works well, but there are R (and SAS and others)packages dedicated just to Bayesian linear regression thatare preferred for big/hard problems

I BLR is probably the most common

I http://www4.stat.ncsu.edu/~reich/ABA/code/regJAGS

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

http://www4.stat.ncsu.edu/~reich/ABA/code/regJAGShttp://www4.stat.ncsu.edu/~reich/ABA/code/regJAGS

Computing for the BLASSO

I For the BLASSO prior the full conditionals are morecomplicated

I There is a trick to make all full conditional conjugate so thatGibbs sampling can be used

I Metropolis sampling works fine too

I BLR works well for BLASSO and is super fast

I JAGS can handle this as well,

I http://www4.stat.ncsu.edu/~reich/ABA/code/BLASSO

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

http://www4.stat.ncsu.edu/~reich/ABA/code/BLASSOhttp://www4.stat.ncsu.edu/~reich/ABA/code/BLASSO

Summarizing the results

I The standard summary is a table with marginal means and95% intervals for each j

I This becomes unwieldy for large p

I Picking a subset of covariates is a crucial step in a linearregression analysis.

I We will discuss this later in the course.

I Common methods include cross-validation, informationcriteria, and stochastic search.

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Logistic regression

I Other forms of regression follow naturally from linearregression

I For example, for binary responses Yi {0,1} we mightuse logistic regression

logit[Prob(Yi = 1)] = i = 0 + 1Xi1 + ...+ pXip

I The logit link is the log-odd logit(x) = log[x/(1 x)]I Then j represents the increase in the log odds of an event

corresponding to a one-unit increase in covariate j

I The expit transformation expit(x) = exp(x)/[1 + exp(x)] isthe inverse, and

Prob(Yi = 1) = expit(i) [0,1]

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Logistic regression

I Bayesian logistic regression requires a prior for

I All of the prior we have discussed for linear regression(Zellner, BLASSO, etc) apply

I Computationally the full conditional distributions are nolonger conjugate and so we must use Metropolis sampling

I The R function MCMClogit does this efficiently

I It is fast in JAGS too, for example http://www4.stat.ncsu.edu/~reich/ABA/code/GLM

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

http://www4.stat.ncsu.edu/~reich/ABA/code/GLMhttp://www4.stat.ncsu.edu/~reich/ABA/code/GLM

Predictions

I Say we have a new covariate vector Xnew and we wouldlike to predict the corresponding response Ynew

I A plug-in approach would fix and at their posteriormeans and to make predictions

Ynew |, Normal(Xnew , 2)

I However this plug-in approach suppresses uncertaintyabout and

I Therefore these prediction intervals will be slightly toonarrow leading to undercoverage

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Posterior predicitive distribution (PPD)

I We should really account for all uncertainty when makingpredictions, including our uncertainty about and

I We really want the PPD

p(Ynew |Y) =

f (Ynew ,, |Y)dd

=

f (Ynew |, )f (, |Y)dd

I Marginalizing over the model parameters accounts for theiruncertainty

I The concept of the PPD applies generally (e.g., logisticregression) and means the distribution of the predictedvalue marginally over model parameters

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

Posterior predicitive distribution (PPD)

I MCMC naturally gives draws from Ynew s PPD

I For MCMC iteration t we have (t) and (t)

I For MCMC iteration t we sample

Y (t)new Normal(X(t), (t)2)

I Y (1)new , ...,Y(S)new are samples from the PPD

I This is an example of the claim that Bayesian methodsnaturally quantify uncertainty

I http://www4.stat.ncsu.edu/~reich/ABA/code/Predict

ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

http://www4.stat.ncsu.edu/~reich/ABA/code/Predicthttp://www4.stat.ncsu.edu/~reich/ABA/code/Predict