(7) Bayesian linear regressionreich/ABA/notes/BLR.pdf · 2018-03-12 · I It includes as special...

Click here to load reader

  • date post

    26-Aug-2018
  • Category

    Documents

  • view

    230
  • download

    1

Embed Size (px)

Transcript of (7) Bayesian linear regressionreich/ABA/notes/BLR.pdf · 2018-03-12 · I It includes as special...

  • (7) Bayesian linear regression

    ST440/540: Applied Bayesian Statistics

    Spring, 2018

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Bayesian linear regression

    I Linear regression is by far the most common statisticalmodel

    I It includes as special cases the t-test and ANOVA

    I The multiple linear regression model is

    Yi Normal(0 + Xi11 + ...+ Xipp, 2)

    independently across the i = 1, ...,n observations

    I As well see, Bayesian and classical linear regression aresimilar if n >> p and the priors are uninformative.

    I However, the results can be different for challengingproblems, and the interpretation is different in all cases

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Review of least squares

    I The least squares estimate of = (0, 1, ..., p)T is

    OLS = argmin

    ni=1

    (Yi i)2

    where i = 0 + Xi11 + ...+ Xipp

    I OLS is unbiased even if the errors are non-Gaussian

    I If the errors are Gaussian then the likelihood isproportional to

    ni=1

    exp

    [(Yi i)

    2

    22

    ]= exp

    [n

    i=1(Yi i)2

    22

    ]

    I Therefore, if the errors are Gaussian OLS is also the MLE

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Review of least squares

    I Linear regression is often simpler to describe using linearalgebra notation

    I Let Y = (Y1, ...,Yn)T be the response vector and X be then (p + 1) matrix of covariates

    I Then the mean of Y is X and the least squares solution is

    OLS = argmin

    (Y X)T (Y X) = (XT X)1XT Y

    I If the errors are Gaussian then the sampling distribution is

    OLS Normal[, 2(XT X)1

    ]I If the variance 2 is estimated using the mean squared

    residual error then the sampling distribution is multivariate t

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Bayesian regression

    I The likelihood remains

    Yi Normal(0 + Xi11 + ...+ Xipp, 2)

    independent for i = 1, ...,n observations

    I As with a least squares analysis, it is crucial to verify this isappropriate using qq-plots, added variable plots, etc.

    I A Bayesian analysis also requires priors for and

    I We will focus on prior specification since this piece isuniquely Bayesian.

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Priors

    I For the purpose of setting priors, it is helpful to standardizeboth the response and each covariate to have mean zeroand variance one.

    I Many priors for have been considered:1. Improper priors

    2. Gaussian priors

    3. Double exponential priors

    4. Many, many more...

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Improper priors

    I The Jeffreys prior is flat p() = 1

    I This is improper, but the posterior is proper under thesame conditions required by least squares

    I If is known then

    |Y Normal[OLS,

    2(XT X)1]

    I See Post beta in http://www4.stat.ncsu.edu/~reich/ABA/Derivations7.pdf

    I Therefore, the results should be similar to least squares

    I How are they different?

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

    http://www4.stat.ncsu.edu/~reich/ABA/Derivations7.pdfhttp://www4.stat.ncsu.edu/~reich/ABA/Derivations7.pdf

  • Improper priors

    I Of course we rarely know

    I Typically the error variance follows an InvGamma(a,b)prior with a and b set to be small, say a = b = 0.01.

    I In this case the posterior of follows a multivariate tcentered on OLS

    I Again, the results are similar to OLS

    I The objective Bayes Jeffreys prior for = (, ) is

    p(, 2) =12

    which is the limit as a,b 0

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Multivariate normal prior

    I Another common prior for is Zellners g-prior

    Normal[0,2

    g(XT X)1

    ]I This prior is proper assuming X is full rank

    I The posterior mean is

    11 + g

    OLS

    I This shrinks the least estimate towards zero

    I g controls the amount of shrinkage

    I g = 1/n is common, and called the unit information prior

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Univariate Gaussian priors

    I If there are many covariates or the covariates are collinear,then OLS is unstable

    I Independent priors can counteract collinearity

    j Normal(0, 2/g)

    independent over j

    I The posterior mode is

    argmin

    ni=1

    (Yi i)2 + gp

    j=1

    2j

    I In classical statistics, this is known as the ridge regressionsolution and is used to stabilize the least squares solution

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • BLASSO

    I An increasingly-popular prior is the double exponential orBayesian LASSO prior

    I The prior is j DE() which has PDF

    f () exp(||

    )I The square in the Gaussian prior is replaced with an

    absolute value

    I The shape of the PDF is thus more peaked at zero (nextslide)

    I The BLASSO prior favors settings where there are many jnear zero and a few large j

    I That is, p is large but most of the covariates are noise

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • BLASSO

    3 2 1 0 1 2 3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Prio

    rGaussianBLASSO

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • BLASSO

    I The posterior mode is

    argmin

    ni=1

    (Yi i)2 + gp

    j=1

    |j |

    I In classical statistics, this is known as the LASSO solution

    I It is popular because it adds stability by shrinking estimatestowards zero, and also sets some coefficients to zero

    I Covariates with coefficients set to zero can be removed

    I Therefore, LASSO performs variables selection andestimation simultaneously

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Computing

    I With flat or Gaussian (with fixed prior variance) priors theposterior is available in closed-form and Monte Carlosampling is not needed

    I With normal priors all full conditionals are Gaussian orinverse gamma, and so Gibbs sampling is simple and fast

    I JAGS works well, but there are R (and SAS and others)packages dedicated just to Bayesian linear regression thatare preferred for big/hard problems

    I BLR is probably the most common

    I http://www4.stat.ncsu.edu/~reich/ABA/code/regJAGS

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

    http://www4.stat.ncsu.edu/~reich/ABA/code/regJAGShttp://www4.stat.ncsu.edu/~reich/ABA/code/regJAGS

  • Computing for the BLASSO

    I For the BLASSO prior the full conditionals are morecomplicated

    I There is a trick to make all full conditional conjugate so thatGibbs sampling can be used

    I Metropolis sampling works fine too

    I BLR works well for BLASSO and is super fast

    I JAGS can handle this as well,

    I http://www4.stat.ncsu.edu/~reich/ABA/code/BLASSO

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

    http://www4.stat.ncsu.edu/~reich/ABA/code/BLASSOhttp://www4.stat.ncsu.edu/~reich/ABA/code/BLASSO

  • Summarizing the results

    I The standard summary is a table with marginal means and95% intervals for each j

    I This becomes unwieldy for large p

    I Picking a subset of covariates is a crucial step in a linearregression analysis.

    I We will discuss this later in the course.

    I Common methods include cross-validation, informationcriteria, and stochastic search.

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Logistic regression

    I Other forms of regression follow naturally from linearregression

    I For example, for binary responses Yi {0,1} we mightuse logistic regression

    logit[Prob(Yi = 1)] = i = 0 + 1Xi1 + ...+ pXip

    I The logit link is the log-odd logit(x) = log[x/(1 x)]I Then j represents the increase in the log odds of an event

    corresponding to a one-unit increase in covariate j

    I The expit transformation expit(x) = exp(x)/[1 + exp(x)] isthe inverse, and

    Prob(Yi = 1) = expit(i) [0,1]

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Logistic regression

    I Bayesian logistic regression requires a prior for

    I All of the prior we have discussed for linear regression(Zellner, BLASSO, etc) apply

    I Computationally the full conditional distributions are nolonger conjugate and so we must use Metropolis sampling

    I The R function MCMClogit does this efficiently

    I It is fast in JAGS too, for example http://www4.stat.ncsu.edu/~reich/ABA/code/GLM

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

    http://www4.stat.ncsu.edu/~reich/ABA/code/GLMhttp://www4.stat.ncsu.edu/~reich/ABA/code/GLM

  • Predictions

    I Say we have a new covariate vector Xnew and we wouldlike to predict the corresponding response Ynew

    I A plug-in approach would fix and at their posteriormeans and to make predictions

    Ynew |, Normal(Xnew , 2)

    I However this plug-in approach suppresses uncertaintyabout and

    I Therefore these prediction intervals will be slightly toonarrow leading to undercoverage

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Posterior predicitive distribution (PPD)

    I We should really account for all uncertainty when makingpredictions, including our uncertainty about and

    I We really want the PPD

    p(Ynew |Y) =

    f (Ynew ,, |Y)dd

    =

    f (Ynew |, )f (, |Y)dd

    I Marginalizing over the model parameters accounts for theiruncertainty

    I The concept of the PPD applies generally (e.g., logisticregression) and means the distribution of the predictedvalue marginally over model parameters

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

  • Posterior predicitive distribution (PPD)

    I MCMC naturally gives draws from Ynew s PPD

    I For MCMC iteration t we have (t) and (t)

    I For MCMC iteration t we sample

    Y (t)new Normal(X(t), (t)2)

    I Y (1)new , ...,Y(S)new are samples from the PPD

    I This is an example of the claim that Bayesian methodsnaturally quantify uncertainty

    I http://www4.stat.ncsu.edu/~reich/ABA/code/Predict

    ST440/540: Applied Bayesian Statistics (7) Bayesian linear regression

    http://www4.stat.ncsu.edu/~reich/ABA/code/Predicthttp://www4.stat.ncsu.edu/~reich/ABA/code/Predict