     • date post

06-Aug-2020
• Category

## Documents

• view

0

0

Embed Size (px)

### Transcript of Bayes Factors, posterior predictives, short intro to for parametric bootstrap, except that the...

• Bayes Factors, posterior predictives, short intro to

RJMCMC

Thermodynamic Integration

• Bayesian Statistical Inference

P(θ∣Y )∝ P(Y∣θ)π (θ)

Once you have posterior samples you can compute the predictive distribution of future observations:

P(Ynew∣θ,Yold )

• To do this you sample a from

(Sample 1 value from your collection of posterior samples)

Generate simulated data from the likelihood:

Repeat for a large sample of from to get at the posterior predictive distribution

P(Ynew∣θ *)

P(θ∣Y )θ*

P(θ∣Y )θ*

• Posterior predictive distribution:

No need to use asymptotic normal assumptions or a single point and variance estimate for

Any shaped distribution on naturally feeds it’s entire distribution through to the data generating process!

θ*

P(θ∣Y )

• Obtaining is related to obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn’t require asymptotic arguments.

P(Ynew∣θ,Yold )

• Uses:

Another diagnostic tool; Obtain a sample from and see if it is similar to the data.

Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.

P(Ynew∣θ,Yold )

• Hypothesis testing ; model comparison

• Ultimately we want inference on

But computing the marginal likelihood is difficult.

P(M∣Y )

P(M∣Y ) = P(Y∣θM )π (θ)

Θ ∫ π (M )dθ

P(Y )

• Usually Bayesians make model decisions through Bayes Factors

B12 (y) = w1(y) w2 (y)

w(y) = π (θ) f (y∣θ)dθ Θ ∫

• Bayes Factor interpretation

B12 (y) = w1(y) w2 (y)

w(y) = π (θ) f (y∣θ)dθ Θ ∫

• The odds ratio for two models:

posterior odds = Bayes Factor X prior odds

Uniform prior odds across models implies that

posterior odds = Bayes Factor

• posterior odds = Bayes Factor X prior odds

So the Bayes factor is the amount of evidence for one model compared to another.

Bf = the change in odds when moving from the prior to the posterior

• Recall: P(θ∣Y ) = P(y∣θ)P(θ) P(y)

P(Y ) = P(y∣θ)P(θ)dθ∫

• Newton & Raftery (1994)

P(θ∣Y ) = P(y∣θ)P(θ) P(y)

P(Y )P(θ∣Y ) P(y∣θ)

= P(θ)

P(Y ) P(θ∣Y ) P(y∣θ)

dθ∫ = P(θ)dθ = 1∫

• Newton & Raftery (1994)

P(Y ) P(θ∣Y ) P(y∣θ)

dθ∫ = 1

E 1 P(y∣θ) ⎡

⎣ ⎢

⎦ ⎥ P(θ∣Y )

= 1

P(Y )

And estimated P(Y) by P̂ (Y ) = " 1

N

NX

i=1

1

P (y | ✓)

#�1

• Newton & Raftery (1994)

Compute this by calculating the likelihood for each value of that was obtained from the posterior sampling step

θi

P̂ (Y ) =

" 1

N

NX

i=1

1

P (y | ✓)

#�1

• Newton & Raftery (1994)

The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of

But it is asymptotically unbiased

Estimate P(Y) by

P(y∣θ)

P̂ (Y ) =

" 1

N

NX

i=1

1

P (y | ✓)

#�1

• Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples

• Thermodynamic Integration Friel, N., Pettitt, A., 2008. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3)

Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)

• In Parallel Tempering we sample from

But we can get the marginal likelihood via:

Pm (θ∣Y ) = P(y∣θ)βm P(θ)

Pm (y)

log(p(Y )) = log p(Y∣θ)⎡⎣ ⎤⎦Pm (θ∣Y )dθ dβ∫ 0

1

log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ 0

1

• Compute via 1-dimensional quadrature over the temperature!

log(p(Y )) = 1 2

βm − βm−1( ) Em + Em−1[ ] m ∑

log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ 0

1

Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

• To compute log(marginal likelihoods) all we need is to define a good grid for temperatures

log(p(Y )) = 1 2

βm − βm−1( ) Em + Em−1[ ] m ∑

Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

β = seq( from = 1,to = N )

N ⎛ ⎝⎜

⎞ ⎠⎟

5

• Parallel Tempering To the Extreme!

R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature

• Parallel Tempering densities That dip just before

temperature β = 1 is real. It is caused by the introduction

of new modes

• Compare the 3 group Galaxy to the 6 group galaxy.

Show plots of mean density vs temperature

25,000 iterations with 30 parallel chains

B12 (y) = w1(y) w2 (y)

• Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups.

(the result: there is decisive evidence that the k=3 groups model is better)

• Alternative to Bayes Factors: RJMCMC

• MODEL POSTERIOR PROBABILITY

Likelihood:

Parameter Prior:

Model Prior: for

The marginal posterior probability of a model is helpful when the answer is not clear

P(Y∣θ j ,M j )

P(θ j∣M j )

P(M j∣Ω)

P(M j∣Y ,Ω) = P(Y∣θ j ,M j ,Ω)P(θ j∣M j ,Ω)P(M j∣Ω)dθ j∫

P(Y )

M j ∈Ω

P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

• Our goal is to get in a single MCMC chain

even if contains a lot of models

We need simulation methods that sample across models.

P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

Ω

• We can avoid extensive MCMC for each model and instead sample from directly!

We just adjust MCMC so at each iteration we:

1. Sample j, i.e. choose a model Mnew

2. Then propose a from Mnew

3. Keep Mnew and with probability

REVERSIBLE JUMP MCMC

P(M j∣Y ,Ω)

θnew

α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew ) P(Y |θold ,Mold )P(θ old ,Mold )Pold (vold )

Jold,new ,1 ⎛

⎝ ⎜

⎠ ⎟

θnew

Biometrika (1995), 82, 4, pp. 711-32

• V

We use auxiliary variables to augment the dimension space so that dim(Mold) = dim (Mnew)

v

α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew ) P(Y |θold ,Mold )P(θ old ,Mold )Pold (vold )

Jold,new ,1 ⎛

⎝ ⎜

⎠ ⎟

• JACOBIAN We need the Jacobian for the transformation

And the proposed values needs to allow the possibility of being accepted.

J =

∂θold ,1 ∂θnew,1

... ∂θold ,1 ∂θnew, pnew

... ∂θold ,1 ∂vnew

M O M M ∂θold , pold ∂θnew,1

... ∂θold , pold ∂θnew, pnew

...

M M O M ∂vold ∂θnew,1

... ... ∂vold ∂vnew

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

θnew

. . .

... . . .

...

...

...

• M1 and M2 have different parameter dimensions

Often model parameters don’t have an obvious a transformation allowing an intuitive transition

The last accepted value might be from a different model and may require a large jump in the parameter space.

POTENTIAL PROBLEMS

• M1:

M2:

Moving from M1 to M2 to will require moving β1,0 quite far to get to a reasonable location for β2,0

Y ⇠ N(�2,0 + �2,1X + �2,2X2,�22)

Y ⇠ N(�1,0 + �1,1X,�21)

• M1: Galaxy with 3 Gaussians

M2: Galaxy with 4 Gaussians

Moving from M1 to M2 can be done by dividing one of the current Gaussians. From M2 to M1 can be done through merging 2 components

• RJMCMC: Beautiful in principle, nasty in practice

Needs: transition function between parameters in multiple model spaces.

Efficiency depends completely on this functional choice and the distribution for the auxiliary variables.

Works well when we can use birth / death process (change-point analysis).