Bayes Factors, posterior predictives, short intro to for parametric bootstrap, except that the...

download Bayes Factors, posterior predictives, short intro to for parametric bootstrap, except that the distribution

of 36

  • date post

    06-Aug-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Bayes Factors, posterior predictives, short intro to for parametric bootstrap, except that the...

  • Bayes Factors, posterior predictives, short intro to

    RJMCMC

    © Dave Campbell 2016

    Thermodynamic Integration

  • Bayesian Statistical Inference

    P(θ∣Y )∝ P(Y∣θ)π (θ)

    Once you have posterior samples you can compute the predictive distribution of future observations:

    P(Ynew∣θ,Yold )

  • To do this you sample a from

    (Sample 1 value from your collection of posterior samples)

    Generate simulated data from the likelihood:

    Repeat for a large sample of from to get at the posterior predictive distribution

    P(Ynew∣θ *)

    P(θ∣Y )θ*

    P(θ∣Y )θ*

  • Posterior predictive distribution:

    No need to use asymptotic normal assumptions or a single point and variance estimate for

    Any shaped distribution on naturally feeds it’s entire distribution through to the data generating process!

    θ*

    P(θ∣Y )

  • Obtaining is related to obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn’t require asymptotic arguments.

    P(Ynew∣θ,Yold )

  • Uses:

    Another diagnostic tool; Obtain a sample from and see if it is similar to the data.

    Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.

    P(Ynew∣θ,Yold )

  • Hypothesis testing ; model comparison

  • Ultimately we want inference on

    But computing the marginal likelihood is difficult.

    P(M∣Y )

    P(M∣Y ) = P(Y∣θM )π (θ)

    Θ ∫ π (M )dθ

    P(Y )

  • Usually Bayesians make model decisions through Bayes Factors

    B12 (y) = w1(y) w2 (y)

    w(y) = π (θ) f (y∣θ)dθ Θ ∫

  • Bayes Factor interpretation

    B12 (y) = w1(y) w2 (y)

    w(y) = π (θ) f (y∣θ)dθ Θ ∫

  • The odds ratio for two models:

    posterior odds = Bayes Factor X prior odds

    Uniform prior odds across models implies that

    posterior odds = Bayes Factor

  • posterior odds = Bayes Factor X prior odds

    So the Bayes factor is the amount of evidence for one model compared to another.

    Bf = the change in odds when moving from the prior to the posterior

  • Recall: P(θ∣Y ) = P(y∣θ)P(θ) P(y)

    P(Y ) = P(y∣θ)P(θ)dθ∫

  • Newton & Raftery (1994)

    P(θ∣Y ) = P(y∣θ)P(θ) P(y)

    P(Y )P(θ∣Y ) P(y∣θ)

    = P(θ)

    P(Y ) P(θ∣Y ) P(y∣θ)

    dθ∫ = P(θ)dθ = 1∫

  • Newton & Raftery (1994)

    P(Y ) P(θ∣Y ) P(y∣θ)

    dθ∫ = 1

    E 1 P(y∣θ) ⎡

    ⎣ ⎢

    ⎦ ⎥ P(θ∣Y )

    = 1

    P(Y )

    And estimated P(Y) by P̂ (Y ) = " 1

    N

    NX

    i=1

    1

    P (y | ✓)

    #�1

  • Newton & Raftery (1994)

    Compute this by calculating the likelihood for each value of that was obtained from the posterior sampling step

    θi

    P̂ (Y ) =

    " 1

    N

    NX

    i=1

    1

    P (y | ✓)

    #�1

  • Newton & Raftery (1994)

    The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of

    But it is asymptotically unbiased

    Estimate P(Y) by

    P(y∣θ)

    P̂ (Y ) =

    " 1

    N

    NX

    i=1

    1

    P (y | ✓)

    #�1

  • Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples

  • Thermodynamic Integration Friel, N., Pettitt, A., 2008. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3)

    Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)

  • In Parallel Tempering we sample from

    But we can get the marginal likelihood via:

    Pm (θ∣Y ) = P(y∣θ)βm P(θ)

    Pm (y)

    log(p(Y )) = log p(Y∣θ)⎡⎣ ⎤⎦Pm (θ∣Y )dθ dβ∫ 0

    1

    log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ 0

    1

  • Compute via 1-dimensional quadrature over the temperature!

    log(p(Y )) = 1 2

    βm − βm−1( ) Em + Em−1[ ] m ∑

    log(p(Y )) = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y ) dβ 0

    1

    Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

  • To compute log(marginal likelihoods) all we need is to define a good grid for temperatures

    Calderhead and Girolami (2009) suggest

    log(p(Y )) = 1 2

    βm − βm−1( ) Em + Em−1[ ] m ∑

    Em = E log p(Y∣θ)⎡⎣ ⎤⎦{ }Pm (θ∣Y )

    β = seq( from = 1,to = N )

    N ⎛ ⎝⎜

    ⎞ ⎠⎟

    5

  • Parallel Tempering To the Extreme!

    R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature

  • Parallel Tempering densities That dip just before

    temperature β = 1 is real. It is caused by the introduction

    of new modes

  • Compare the 3 group Galaxy to the 6 group galaxy.

    Show plots of mean density vs temperature

    25,000 iterations with 30 parallel chains

    B12 (y) = w1(y) w2 (y)

  • Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups.

    (the result: there is decisive evidence that the k=3 groups model is better)

  • Alternative to Bayes Factors: RJMCMC

  • MODEL POSTERIOR PROBABILITY

    Likelihood:

    Parameter Prior:

    Model Prior: for

    The marginal posterior probability of a model is helpful when the answer is not clear

    P(Y∣θ j ,M j )

    P(θ j∣M j )

    P(M j∣Ω)

    P(M j∣Y ,Ω) = P(Y∣θ j ,M j ,Ω)P(θ j∣M j ,Ω)P(M j∣Ω)dθ j∫

    P(Y )

    M j ∈Ω

    P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

  • Our goal is to get in a single MCMC chain

    even if contains a lot of models

    We need simulation methods that sample across models.

    P(M j∣Y ,Ω) = P(θ j ,M j∣Y ,Ω)dθ j∫

    Ω

  • We can avoid extensive MCMC for each model and instead sample from directly!

    We just adjust MCMC so at each iteration we:

    1. Sample j, i.e. choose a model Mnew

    2. Then propose a from Mnew

    3. Keep Mnew and with probability

    REVERSIBLE JUMP MCMC

    P(M j∣Y ,Ω)

    θnew

    α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew ) P(Y |θold ,Mold )P(θ old ,Mold )Pold (vold )

    Jold,new ,1 ⎛

    ⎝ ⎜

    ⎠ ⎟

    θnew

    Biometrika (1995), 82, 4, pp. 711-32

  • V

    We use auxiliary variables to augment the dimension space so that dim(Mold) = dim (Mnew)

    v

    α = min P(Y |θnew ,Mnew )P(θnew ,Mnew )Pnew (vnew ) P(Y |θold ,Mold )P(θ old ,Mold )Pold (vold )

    Jold,new ,1 ⎛

    ⎝ ⎜

    ⎠ ⎟

  • JACOBIAN We need the Jacobian for the transformation

    And the proposed values needs to allow the possibility of being accepted.

    J =

    ∂θold ,1 ∂θnew,1

    ... ∂θold ,1 ∂θnew, pnew

    ... ∂θold ,1 ∂vnew

    M O M M ∂θold , pold ∂θnew,1

    ... ∂θold , pold ∂θnew, pnew

    ...

    M M O M ∂vold ∂θnew,1

    ... ... ∂vold ∂vnew

    ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

    ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

    θnew

    . . .

    ... . . .

    ...

    ...

    ...

  • M1 and M2 have different parameter dimensions

    Often model parameters don’t have an obvious a transformation allowing an intuitive transition

    The last accepted value might be from a different model and may require a large jump in the parameter space.

    POTENTIAL PROBLEMS

  • M1:

    M2:

    Moving from M1 to M2 to will require moving β1,0 quite far to get to a reasonable location for β2,0

    Y ⇠ N(�2,0 + �2,1X + �2,2X2,�22)

    Y ⇠ N(�1,0 + �1,1X,�21)

  • M1: Galaxy with 3 Gaussians

    M2: Galaxy with 4 Gaussians

    Moving from M1 to M2 can be done by dividing one of the current Gaussians. From M2 to M1 can be done through merging 2 components

  • RJMCMC: Beautiful in principle, nasty in practice

    Needs: transition function between parameters in multiple model spaces.

    Efficiency depends completely on this functional choice and the distribution for the auxiliary variables.

    Works well when we can use birth / death process (change-point analysis).