Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In...

32
Chapter 2 Revision of Bayesian inference

Transcript of Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In...

Page 1: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Chapter 2

Revision of Bayesianinference

Page 2: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Bayes Theorem

NotationParameter vector: θ = (θ1, . . . , θp)T

Data: x = (x1, x2, . . . , xn)T .

Bayes Theorem for eventsFor two events E and F , the conditional probability of E given F is:

Pr(E |F ) =Pr(E ∩ F )

Pr(F ).

This gives us the simplest version of Bayes theorem:

Pr(E |F ) =Pr(F |E )Pr(E )

Pr(F ).

Page 3: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Bayes Theorem

If E1,E2, . . . ,En is a partition i.e. the events Ei are mutuallyexclusive with Pr(E1 ∪ . . .En) = 1, then

Pr(F ) = Pr ((F ∩ E1) ∪ (F ∩ E2) ∪ . . . ∪ (F ∩ En)) =n∑

i=1

Pr(F ∩ Ei )

=n∑

i=1

Pr(F |Ei )Pr(Ei ).

This gives us the more commonly used version of Bayes theorem:

Pr(Ei |F ) =Pr(F |Ei )Pr(Ei )∑ni=1 Pr(F |Ei )Pr(Ei )

.

Page 4: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Page 5: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Page 6: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Page 7: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Comments

Bayes theorem is useful because it tells us how to turnprobabilities around.

Often we are able to understand the probability of outcomesF conditional on various hypotheses Ei .

We can then compute probabilities of the form Pr(F |Ei ).

However, when we actually observe some outcome, we areinterested in which hypothesis is most likely to be true, or inother words, we are interested in the probabilities of thehypotheses conditional on the outcome Pr(Ei |F ).

Bayes theorem tells us how to compute these posteriorprobabilities, but we need to use our prior belief Pr(Ei ) foreach hypothesis.

In this way, Bayes theorem gives us a coherent way to updateour prior beliefs into posterior probabilities following anoutcome F .

Page 8: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Inference for discrete parameters and data

Suppose our data and parameters have discrete distributions.

The hypotheses Ei are values for our parameter(s) Θ, while theoutcome F is a set of data X .

Bayes theorem is then

Pr(Θ = θ|X = x) =Pr(X = x |Θ = θ)Pr(Θ = θ)∑tPr(X = x |Θ = t)Pr(Θ = t)

.

Page 9: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.1 (Drug testing in athletes)

Due to inaccuracies in drug testing procedures (e.g., falsepositives and false negatives), in the medical field the resultsof a drug test represent only one factor in a physician’sdiagnosis.

Yet when Olympic athletes are tested for illegal drug use (i.e.doping), the results of a single test are used to ban the athletefrom competition.

In Chance (Spring 2004), University of Texas biostatisticiansD. A. Berry and L. Chastain demonstrated the application ofBayes’s Rule for making inferences about testosterone abuseamong Olympic athletes.

They used the following example.

Page 10: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.1 (Drug testing in athletes)

In a population of 1000 athletes, suppose 100 are illegally usingtestosterone. Of the users, suppose 50 would test positive fortestosterone. Of the nonusers, suppose 9 would test positive.

If an athlete tests positive for testosterone, use Bayes theorem tofind the probability that the athlete is really doping.

Let θ = 1 be the event that the athlete is a user, and θ = 0 be anonuser; and let x = 1 be the event of a positive test result. Weneed to consider the following questions.

What is the prior information for θ?

What is the observation?

What is the posterior distribution of θ after we get the data?

Page 11: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.1 (Drug testing in athletes)

In a population of 1000 athletes, suppose 100 are illegally usingtestosterone. Of the users, suppose 50 would test positive fortestosterone. Of the nonusers, suppose 9 would test positive.

If an athlete tests positive for testosterone, use Bayes theorem tofind the probability that the athlete is really doping.

Let θ = 1 be the event that the athlete is a user, and θ = 0 be anonuser; and let x = 1 be the event of a positive test result. Weneed to consider the following questions.

What is the prior information for θ?

What is the observation?

What is the posterior distribution of θ after we get the data?

Page 12: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Inference for continuous parameters and data

Consider a continuous parameter Θ, and observed data X . (Noticethat both data and parameters are random variables).

Our prior beliefs about the parameters define the prior distributionfor Θ, specified by a density π(θ).

We formulate a model that defines the distribution of X given theparameters, i.e. we specify a density fX |Θ(x |θ).

This can be regarded as a function of θ when we have got somefixed observed data x , called the likelihood:

L(θ|x) = fX |Θ(x |θ).

Page 13: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Inference for continuous parameters and data

The prior and likelihood determine the full joint density over dataand parameters:

fΘ,X (θ, x) = π(θ)L(θ|x).

Given the joint density we are then able to compute its marginalsand conditionals:

fX (x) =

∫ΘfΘ,X (θ, x) dθ =

∫Θπ(θ)L(θ|x) dθ

and

fΘ|X (θ|x) =fΘ,X (θ, x)

fX (x)=

π(θ)L(θ|x)∫Θ π(θ)L(θ|x) dθ

.

fΘ|X (θ|x) is known as the posterior density, and is usually denotedπ(θ|x).

Page 14: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Inference for continuous parameters and data

The prior and likelihood determine the full joint density over dataand parameters:

fΘ,X (θ, x) = π(θ)L(θ|x).

Given the joint density we are then able to compute its marginalsand conditionals:

fX (x) =

∫ΘfΘ,X (θ, x) dθ =

∫Θπ(θ)L(θ|x) dθ

and

fΘ|X (θ|x) =fΘ,X (θ, x)

fX (x)=

π(θ)L(θ|x)∫Θ π(θ)L(θ|x) dθ

.

fΘ|X (θ|x) is known as the posterior density, and is usually denotedπ(θ|x).

Page 15: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Inference for continuous parameters and data

This leads to the continuous version of Bayes theorem:

π(θ|x) =π(θ)L(θ|x)∫

Θ π(θ)L(θ|x) dθ

Now, the denominator is not a function of θ, so we can in factwrite this as

π(θ|x) ∝ π(θ)L(θ|x)

where the constant of proportionality is chosen to ensure that thedensity integrates to one. Hence, the posterior is proportional tothe prior times the likelihood.

Page 16: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Inference for continuous parameters and data

This leads to the continuous version of Bayes theorem:

π(θ|x) =π(θ)L(θ|x)∫

Θ π(θ)L(θ|x) dθ

Now, the denominator is not a function of θ, so we can in factwrite this as

π(θ|x) ∝ π(θ)L(θ|x)

where the constant of proportionality is chosen to ensure that thedensity integrates to one. Hence, the posterior is proportional tothe prior times the likelihood.

Page 17: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Bayesian computation

That should be everything!

However, typically only the kernel can be written in closedform, that is

π(θ)L(θ|x).

The denominator has the form∫Θπ(θ)L(θ|x) dθ

and this integral is often intractable.

We can try to evaluate this numerically, but

the integral may be very high dimensional, i.e. there might bemany parameters, andthe support of Θ may be very complicated.

Similar problems arise when we want to marginalize or workout an expectation.

Page 18: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Bayesian computation

That should be everything!

However, typically only the kernel can be written in closedform, that is

π(θ)L(θ|x).

The denominator has the form∫Θπ(θ)L(θ|x) dθ

and this integral is often intractable.

We can try to evaluate this numerically, but

the integral may be very high dimensional, i.e. there might bemany parameters, andthe support of Θ may be very complicated.

Similar problems arise when we want to marginalize or workout an expectation.

Page 19: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

Model:Xi |µ, τ ∼ N(µ, 1/τ).

Likelihood for a single observation:

L(µ, τ |xi ) = f (xi |µ, τ) =

√τ

2πexp

{−τ

2(xi − µ)2

}So for n independent observations, x = (x1, . . . , xn)T

L(µ, τ |x) = f (x |µ, τ) =n∏

i=1

√τ

2πexp

{−τ

2(xi − µ)2

}

Page 20: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

L(µ, τ |x) =( τ

)n2 exp

{−τ

2

n∑i=1

(xi − x̄ + x̄ − µ)2

}∝ τ n

2 exp{−nτ

2

[s2 + (x̄ − µ)2

]}where

x̄ =1

n

n∑i=1

xi and s2 =1

n

n∑i=1

(xi − x̄)2.

Page 21: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Page 22: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Page 23: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Page 24: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

Conjugate prior specification possible:

τ ∼ Gamma(g , h)

µ|τ ∼ N

(b,

1

).

Undesirable if our prior beliefs for µ and τ separate intoindependent specifications.

Take independent priors for the parameters:

τ ∼ Gamma(g , h)

µ ∼ N

(b,

1

c

).

Specification no longer conjugate, making analytic analysisintractable. Let us see why!

Page 25: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

π(µ) =

√c

2πexp

{−c

2(µ− b)2

}∝ exp

{−c

2(µ− b)2

}π(τ) =

hg

Γ(g)τg−1 exp{−hτ} ∝ τg−1 exp{−hτ}

Soπ(µ, τ) ∝ τg−1 exp

{−c

2(µ− b)2 − hτ

}and we obtain

π(µ, τ |x) ∝ τg−1 exp{−c

2(µ− b)2 − hτ

}× τ

n2 exp

{−nτ

2

[s2 + (x̄ − µ)2

]}= τg+

n2−1 exp

{−nτ

2

[s2 + (x̄ − µ)2

]− c

2(µ− b)2 − hτ

}.

Page 26: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

π(µ) =

√c

2πexp

{−c

2(µ− b)2

}∝ exp

{−c

2(µ− b)2

}π(τ) =

hg

Γ(g)τg−1 exp{−hτ} ∝ τg−1 exp{−hτ}

Soπ(µ, τ) ∝ τg−1 exp

{−c

2(µ− b)2 − hτ

}and we obtain

π(µ, τ |x) ∝ τg−1 exp{−c

2(µ− b)2 − hτ

}× τ

n2 exp

{−nτ

2

[s2 + (x̄ − µ)2

]}= τg+

n2−1 exp

{−nτ

2

[s2 + (x̄ − µ)2

]− c

2(µ− b)2 − hτ

}.

Page 27: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Page 28: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Page 29: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Page 30: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Page 31: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Example 2.2 (Normal with unknown mean and variance)

We have the kernel of the posterior for µ and τ , but it is notin a standard form.

We can gain some idea of the likely values of (µ, τ) byplotting the bivariate surface (the integration constant isn’tnecessary for that).

We cannot work out the posterior mean or variance, or theforms of the marginal posterior distributions for µ or τ , sincewe cannot integrate out the other variable.

There is nothing particularly special about the fact that thedensity represents a Bayesian posterior.

Given any complex non-standard probability distribution, weneed ways to understand it, to calculate its moments, tocompute its conditional and marginal distributions and theirmoments.

Stochastic simulation is one possible solution.

Page 32: Chapter 2 Revision of Bayesiannag48/teaching/MAS8951/bayesrevSlides.pdf · from competition. In Chance (Spring 2004), University of Texas biostatisticians D. A. Berry and L. Chastain

Tutorial questions

1 Describe a Monte Carlo algorithm to calculate the normalisingconstant in a Bayesian analysis.

2 Describe a weighted resampling algorithm for generatingdraws from a posterior π(θ|x) (known up to proportionality)using a proposal with density π(θ), i.e. the prior. What arethe weights? How can the weights be used to estimate theunknown normalising constant?