Bayes Factors, posterior predictives, short intro to...

36
Bayes Factors, posterior predictives, short intro to RJMCMC © Dave Campbell 2016 Thermodynamic Integration

Transcript of Bayes Factors, posterior predictives, short intro to...

Page 1: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Bayes Factors, posterior predictives, short intro to

RJMCMC

© Dave Campbell 2016

Thermodynamic Integration

Page 2: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Bayesian Statistical Inference

P(θ∣Y )∝ P(Y∣θ)π (θ)

Once you have posterior samples you can compute the predictive distribution of future observations:

P(Ynew∣θ,Yold )

Page 3: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

To do this you sample a from

(Sample 1 value from your collection of posterior samples)

Generate simulated data from the likelihood:

Repeat for a large sample of from to get at the posterior predictive distribution

P(Ynew∣θ*)

P(θ∣Y )θ*

P(θ∣Y )θ*

Page 4: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Posterior predictive distribution:

No need to use asymptotic normal assumptions or a single point and variance estimate for

Any shaped distribution on naturally feeds it’s entire distribution through to the data generating process!

θ*

P(θ∣Y )

Page 5: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Obtaining is related to obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn’t require asymptotic arguments.

P(Ynew∣θ,Yold )

Page 6: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Uses:

Another diagnostic tool; Obtain a sample from and see if it is similar to the data.

Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.

P(Ynew∣θ,Yold )

Page 7: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Hypothesis testing ; model comparison

Page 8: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Ultimately we want inference on

But computing the marginal likelihood is difficult.

P(M∣Y )

P(M∣Y ) =P(Y∣θM )π (θ)

Θ∫ π (M )dθ

P(Y )

Page 9: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Usually Bayesians make model decisions through Bayes Factors

B12 (y) =w1(y)w2 (y)

w(y) = π (θ) f (y∣θ)dθΘ∫

Page 10: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Bayes Factor interpretation

B12 (y) =w1(y)w2 (y)

w(y) = π (θ) f (y∣θ)dθΘ∫

Page 11: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

The odds ratio for two models:

posterior odds = Bayes Factor X prior odds

Uniform prior odds across models implies that

posterior odds = Bayes Factor

Page 12: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

posterior odds = Bayes Factor X prior odds

So the Bayes factor is the amount of evidence for one model compared to another.

Bf = the change in odds when moving from the prior to the posterior

Page 13: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Recall: P(θ∣Y ) = P(y∣θ)P(θ)P(y)

P(Y ) = P(y∣θ)P(θ)dθ∫

Page 14: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Newton & Raftery (1994)

P(θ∣Y ) = P(y∣θ)P(θ)P(y)

P(Y )P(θ∣Y )P(y∣θ)

= P(θ)

P(Y ) P(θ∣Y )P(y∣θ)

dθ∫ = P(θ)dθ = 1∫

Page 15: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Newton & Raftery (1994)

P(Y ) P(θ∣Y )P(y∣θ)

dθ∫ = 1

E 1P(y∣θ)⎡

⎣⎢

⎦⎥P(θ∣Y )

=1

P(Y )

And estimated P(Y) by P̂ (Y ) =

"1

N

NX

i=1

1

P (y | ✓)

#�1

Page 16: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Newton & Raftery (1994)

Compute this by calculating the likelihood for each value of that was obtained from the posterior sampling step

θi

P̂ (Y ) =

"1

N

NX

i=1

1

P (y | ✓)

#�1

Page 17: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Newton & Raftery (1994)

The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of

But it is asymptotically unbiased

Estimate P(Y) by

P(y∣θ)

P̂ (Y ) =

"1

N

NX

i=1

1

P (y | ✓)

#�1

Page 18: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples

Page 19: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Thermodynamic Integration Friel, N., Pettitt, A., 2008. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3)

Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)

Page 20: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

In Parallel Tempering we sample from

But we can get the marginal likelihood via:

Pm (θ∣Y ) =P(y∣θ)βm P(θ)

Pm (y)

log(p(Y )) = log p(Y∣θ)⎡⎣ ⎤⎦Pm (θ∣Y )dθ dβ∫0

1

log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0

1

Page 21: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Compute via 1-dimensional quadrature over the temperature!

log(p(Y )) = 12

βm − βm−1( ) Em + Em−1[ ]m∑

log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ0

1

Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

Page 22: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

To compute log(marginal likelihoods) all we need is to define a good grid for temperatures

Calderhead and Girolami (2009) suggest

log(p(Y )) = 12

βm − βm−1( ) Em + Em−1[ ]m∑

Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

β =seq( from = 1,to = N )

N⎛⎝⎜

⎞⎠⎟

5

Page 23: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Parallel Tempering To the Extreme!

R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature

Page 24: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Parallel Tempering densitiesThat dip just before

temperature β = 1 is real. It is caused by the introduction

of new modes

Page 25: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Compare the 3 group Galaxy to the 6 group galaxy.

Show plots of mean density vs temperature

25,000 iterations with 30 parallel chains

B12 (y) =w1(y)w2 (y)

Page 26: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups.

(the result: there is decisive evidence that the k=3 groups model is better)

Page 27: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Alternative to Bayes Factors: RJMCMC

Page 28: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

MODEL POSTERIOR PROBABILITY

Likelihood:

Parameter Prior:

Model Prior: for

The marginal posterior probability of a model is helpful when the answer is not clear

P(Y∣θ j ,M j )

P(θ j∣M j )

P(M j∣Ω)

P(M j∣Y ,Ω) =P(Y∣θ j ,M j ,Ω)P(θ j∣M j ,Ω)P(M j∣Ω)dθ j∫

P(Y )

M j ∈Ω

P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

Page 29: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

Our goal is to get in a single MCMC chain

even if contains a lot of models

We need simulation methods that sample across models.

P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

Ω

Page 30: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

We can avoid extensive MCMC for each model and instead sample from directly!

We just adjust MCMC so at each iteration we:

1. Sample j, i.e. choose a model Mnew

2. Then propose a from Mnew

3. Keep Mnew and with probability

REVERSIBLE JUMP MCMC

P(M j∣Y ,Ω)

θnew

α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old

,Mold )Pold (vold )Jold,new ,1

⎝⎜

⎠⎟

θnew

Biometrika (1995), 82, 4, pp. 711-32

Page 31: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

V

We use auxiliary variables to augment the dimension space so that dim(Mold) = dim (Mnew)

v

α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew )P(Y |θold ,Mold )P(θ old

,Mold )Pold (vold )Jold,new ,1

⎝⎜

⎠⎟

Page 32: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

JACOBIANWe need the Jacobian for the transformation

And the proposed values needs to allow the possibility of being accepted.

J =

∂θold ,1∂θnew,1

...∂θold ,1∂θnew, pnew

...∂θold ,1∂vnew

M O M M∂θold , pold∂θnew,1

...∂θold , pold∂θnew, pnew

...

M M O M∂vold∂θnew,1

... ... ∂vold∂vnew

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

θnew

. . .

.... . .

...

...

...

Page 33: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

M1 and M2 have different parameter dimensions

Often model parameters don’t have an obvious a transformation allowing an intuitive transition

The last accepted value might be from a different model and may require a large jump in the parameter space.

POTENTIAL PROBLEMS

Page 34: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

M1:

M2:

Moving from M1 to M2 to will require moving β1,0

quite far to get to a reasonable location for β2,0

Y ⇠ N(�2,0 + �2,1X + �2,2X2,�2

2)

Y ⇠ N(�1,0 + �1,1X,�21)

Page 35: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

M1: Galaxy with 3 Gaussians

M2: Galaxy with 4 Gaussians

Moving from M1 to M2 can be done by dividing one of the current Gaussians. From M2 to M1 can be done through merging 2 components

Page 36: Bayes Factors, posterior predictives, short intro to RJMCMCpeople.stat.sfu.ca/.../CompStat_Week12_Day2-2016.pdf · for parametric bootstrap, except that the distribution assumption

RJMCMC: Beautiful in principle, nasty in practice

Needs: transition function between parameters in multiple model spaces.

Efficiency depends completely on this functional choice and the distribution for the auxiliary variables.

Works well when we can use birth / death process (change-point analysis).