Chapter 5. Bayesian Statistics (II)

30
Chapter 5. Bayesian Statistics (II)

Transcript of Chapter 5. Bayesian Statistics (II)

Page 1: Chapter 5. Bayesian Statistics (II)

Chapter 5. Bayesian Statistics (II)

Page 2: Chapter 5. Bayesian Statistics (II)

Bayesian for multi-parameter models

The principle remains the same. The (joint) posterior distributiongiven data y is once again

p(θ|y) ∝ π(θ) · p(y|θ)

where θ = (θ1, . . . , θd) are the parameters of interest.

Page 3: Chapter 5. Bayesian Statistics (II)

For illustration, consider the special case of θ = (θ1, θ2).

1. The joint posterior distribution

p(θ1, θ2|y) ∝ π(θ1, θ2) · p(y|θ1, θ2)

2. The marginal posterior distribution of θ2

p(θ2|y) =

∫p(θ1, θ2|y) dθ1 ∝

∫π(θ1, θ2) · p(y|θ1, θ2) dθ1

3. The conditional posterior distribution of θ1 given θ2 is

p(θ1|θ2, y) =p(θ1, θ2|y)

p(θ2|y)∝ π(θ1, θ2) · p(y|θ1, θ2)

Note the difference with joint posterior distribution is that hereθ2 is regarded as fixed and known.

Remark: The following relation is useful for the simulation ofposterior distribution

p(θ1, θ2|y) = p(θ1|θ2, y) · p(θ2|y)

Page 4: Chapter 5. Bayesian Statistics (II)

Examples

Normal model. Suppose that y = {y1, . . . , yn} are iid samplesfrom N(θ, σ2) such that (θ, log(σ2)) has a flat prior, or

π(θ, σ2) ∝ 1/σ2.

The joint posterior distribution p(θ, σ2|y).

p(θ, σ2|y) ∝(σ2)−1−n

2e− 1

2σ2[n(θ−y)2+(n−1)s2]

where s2 is the sample variance

s2 =1

n − 1

n∑

i=1

(yi − y)2.

Page 5: Chapter 5. Bayesian Statistics (II)

The marginal posterior distribution p(σ2|y).

p(σ2|y) =

∫p(θ, σ2|y)dθ

∝∫ (

σ2)−1−n

2e− 1

2σ2[n(θ−y)2+(n−1)s2]dθ

=(σ2)−1−n

2e−(n−1)s2

2σ2

√2πσ2/n

∝(σ2)−n+1

2e−(n−1)s2

2σ2

It follows that the posterior distribution of((n − 1)s2

σ2

∣∣∣∣∣ y)

= χ2(n − 1)

Page 6: Chapter 5. Bayesian Statistics (II)

The marginal posterior distribution p(θ|y).

p(θ|y) =

∫p(θ, σ2|y)dσ2

∝∫ (

σ2)−1−n

2e− 1

2σ2[n(θ−y)2+(n−1)s2]dσ2

∝[n(θ − y)2 + (n − 1)s2

]−n2

[1 +

(θ − y

s/√

n

)2 1

n − 1

]−n2

It follows that the posterior distribution of(θ − y

s/√

n

∣∣∣∣ y)

= t(n − 1)

Page 7: Chapter 5. Bayesian Statistics (II)

The conditional posterior distribution p(θ|σ2, y).

p(θ|σ2, y) = N(y, σ2/n)

The conditional posterior distribution p(σ2|θ, y).

((n − 1)s2 + n(y − θ)2

σ2

∣∣∣∣∣ θ, y)

= χ2(n)

Remark: To simulate from the posterior distribution p(θ, σ2|y), one can first

simulate σ2 from marginal posterior distribution p(σ2|y), then simulate θ from

the conditional posterior distribution p(θ|σ2, y).

Page 8: Chapter 5. Bayesian Statistics (II)

Example. Suppose a stock’s daily return Y was recorded for n =22 consecutive business days, with y = 5% and s = 4%. Assumethat the daily return Y follows N(θ, σ2) with prior π(θ, σ2) ∝1/σ2. Find the 95% posterior interval for θ. Also use simulationto approximate E[θ/σ|y].

Solution: Since (θ − y

s/√

n

∣∣∣∣ y)

= t(n − 1)

The 95% posterior interval is (in %)

y ± t0.025(n − 1)s√n

= 5 ± 2.080 ∗ 4√21

= [3.2, 6.8]

Below is the histogram of 1000 draws of θ/σ. For each draw, we (1) draw asample of σ: draw a sample say u from χ2(n− 1), then let σ =

√(n − 1)s2/u;

(2) given σ, draw a sample θ from N(y, σ2/n); (3) θ/σ is a data point. Thesample average of θ/σ is 1.23.

Page 9: Chapter 5. Bayesian Statistics (II)

0.00.5

1.01.5

2.02.5

3.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Page 10: Chapter 5. Bayesian Statistics (II)

Multinomial model. Let Y = (Y1, . . . , Yd) be multinomial withparameter (n; θ1, . . . , θd) where

θ1 + · · · + θd = 1.

Consider prior distribution (Dirichlet distribution)

π(θ) ∝d∏

i=1

θαi−1i

restricted to non-negative θi’s with θ1 + · · · + θd = 1.

Page 11: Chapter 5. Bayesian Statistics (II)

The joint posterior distribution p(θ|y).

p(θ|y) ∝ π(θ) · p(y|θ) ∝d∏

i=1

θαi−1i ·

d∏

i=1

θyii =

d∏

i=1

θαi+yi−1i

That is, p(θ|y) is a Dirichlet distribution with parameter (α1 +y1, . . . , αd + yd).

The marginal posterior distribution p(θ1|y).

p(θ1|y) ∝∫

{∑d

i=2 θi=1−θ1}θα1+y1−11

d∏

i=2

θαi+yi−1i dθ2 · · · dθd−1

It follows that p(θ1|y) is Beta(α1 + y1,∑d

i=2[αi + yi]).

Page 12: Chapter 5. Bayesian Statistics (II)

The conditional posterior distribution p(θ1|y).

p(θ2, . . . , θd|θ1, y) ∝ θα1+y1−11

d∏

i=2

θαi+yi−1i

restricted to {θ2 + · · · + θd = 1 − θ1}. It follows that(θ2

1 − θ1, . . . ,

θd

1 − θ1

∣∣∣∣ θ1, y

)= Dirichlet(α2 + y2, . . . , αd + yd).

Remark on simulation: One way to simulate (θ1, . . . , θd) from posterior distribu-

tion is to simulate sequentially θ1 from p(θ1|y), and then θ2 from p(θ2|θ1, y), . . . ,

and θd−1 from p(θd−1|θ1, . . . , θd−2, y), and finally set θd = 1− (θ1 + · · ·+ θd−1).

Note that all these conditional distributions are Beta distributions [up to a

multiplicative constant]. Another way to simulate (θ1, . . . , θd) from poste-

rior Dirichlet distribution is to simulate xi from Gamma(αi + yi,1/2) for each

i = 1, . . . , d and let θi = xi/(x1 + · · · + xd).

Page 13: Chapter 5. Bayesian Statistics (II)

Example. In late October 1988, a pre-election poll was conductedby CBS news of 1447 adults in US to find out their preferences inthe upcoming Presidential election. Out of 1447 persons, y1 = 727supported George Bush, y2 = 583 supported Michael Dukakis,and y3 = 137 supported other candidates or expressed no opin-ion. Assume that the samples are randomly selected from thepopulation, then the data follows multinomial distribution withparameters (θ1, θ2, θ3). The quantity of interest is θ1 − θ2.

Solution: Assume a non-informative prior with α1 = α2 = α3 = 1. The pos-terior distribution for (θ1, θ2, θ3) is Dirichlet(728, 584, 138). We will draw 1000samples of (θ1, θ2, θ3) from the posterior Dirichlet distribution, and computeθ1 − θ2 for each sample. We will simulate using two equivalent approaches.

• Using conditional distribution decomposition. Simulate θ1 from Beta(728,584+138). Given θ1, simulate u from Beta(584, 138) and let θ2 = (1− θ1)u.Let θ3 = 1 − θ2 − θ3. Record θ1 − θ2.

Page 14: Chapter 5. Bayesian Statistics (II)

• Using Gamma distribution. Simulate independent x1, x2, x3 from, respec-tively, Gamma(728, 1/2)=χ2(728*2), Gamma(584, 1/2)=χ2(584*2), andGamma(138, 1/2)=χ2(138*2). Let θi = xi/(x1 + x2 + x3). Record θ1 − θ2.

The histograms are attached below, the sample means are 0.099 and 0.100respectively. None of the sample points of θ1 − θ2 are below zero.

Page 15: Chapter 5. Bayesian Statistics (II)

0.0 0.05 0.10 0.15 0.20

05

1015

Use decomposition

0.0 0.05 0.10 0.15 0.20

05

1015

Use Gamma distribution

Page 16: Chapter 5. Bayesian Statistics (II)

Comparison of two populations

Comparison of two proportions. Suppose Y1 has distribution B(n1; θ1),Y2 has distribution B(n2; θ2), and Y1 and Y2 are independent. Weare interested in θ1 − θ2, given the data Y1 = y1 and Y2 = y2.

Assuming a non-informative prior π(θ1, θ2) ∝ 1 on [0, 1]2. Thejoint posterior distribution p(θ1, θ2|y) is

p(θ1, θ2|y) ∝ θy11 (1 − θ1)

n1−y1θy22 (1 − θ2)

n2−y2

Thus the posterior distributions of θ1 and θ2 are independent and

p(θ1|y) = Beta(y1 + 1, n1 − y1 + 1)

p(θ2|y) = Beta(y2 + 1, n2 − y2 + 1)

One can use simulation to draws samples of θ1− θ2 or use normalapproximations (when n1 and n2 large) of θ1 − θ2.

Page 17: Chapter 5. Bayesian Statistics (II)

Comparison of two normal means. Suppose x = (x1, . . . , xn1)

are iid samples from N(θ1, σ2), y = (y1, . . . , yn2) are iid samples

from N(θ2, σ2), and that the two samples are independent. We are

interested in θ1 − θ2. All the parameters (θ1, θ2, σ) are unknown.

Assume a non-informative prior π(θ1, θ2, σ2) ∝ 1/σ2. The poste-

rior is

p(θ1, θ2, σ|x, y) ∝ (σ2)−1−n2e

− 12σ2[n1(x−θ1)

2+n2(y−θ2)2+(n−2)s2

p]

where

n = n1 + n2, s2p =

(n1 − 1)s2x + (n2 − 1)s2

y

(n1 − 1) + (n2 − 2)

Analogously, one have the marginal posterior distribution

p(σ2|x, y) ∝(σ2)−n

2e−(n−2)s2

p

2σ2

Page 18: Chapter 5. Bayesian Statistics (II)

or ((n − 2)s2

p

σ2

∣∣∣∣∣ x, y

)= χ2(n − 2).

The conditional posterior distributions of θ1, θ2 given σ are inde-pendent, and

p(θ1|σ, x, y) = N(x, σ2/n1), p(θ2|σ, x, y) = N(y, σ2/n2).

Remark on simulation. To draw samples of (θ1, θ2, σ). One can draw u from

χ2(n − 2) and let σ2 = (n − 2)s2p/u, then draw θ1, θ2 independently from

N(x, σ2/n1) and N(y, σ2/n2) respectively. If one is interested in θ1 − θ2, for

each sample point of (θ1, θ2, σ) compute θ1 − θ2. If one is interested θ1θ2, for

each sample point compute θ1θ2. And so on so forth.

The theoretical posterior distribution of θ1−θ2 can be obtained asfollows. Note that the conditional posterior distribution of θ1−θ2

Page 19: Chapter 5. Bayesian Statistics (II)

given σ is

p(θ1 − θ2|σ, x, y) = N(x − y, σ2[1/n1 + 1/n2]).

Therefore

p(θ1 − θ2, σ2|x, y) = p(θ1 − θ2|σ2, x, y) · p(σ2|x, y)

∝(σ2)−n+1

2e− 1

2σ2[(1/n1+1/n2)−1((θ1−θ2)−(x−y))2+(n−2)s2

p]

Integrating out σ2, we have similarly((θ1 − θ2) − (x − y)

sp ·√

1/n1 + 1/n2

∣∣∣∣∣x, y

)= t(n − 2)

Page 20: Chapter 5. Bayesian Statistics (II)

Example. Who is a better hitter, Ted Williams (Boston Red Sox)or Joe DiMaggio (NY Yankees)? Their major league career statis-tics are given below.

Player At-bats Hits Batting Average Home Run Home Run AverageT.W. 7706 2654 .3444 521 .0676J.D. 6821 2214 .3246 361 .0529

Find the posterior probability that Ted Williams is a better hitterthan Joe Dimaggio.

Solution: We consider the hits, and leave the home runs as exercise. Let θ1 bethe hit proportion for T.W. and θ2 for that of J.D. Assume a non-informativeprior π(θ1, θ2) ∝ 1. Then the posterior is

p(θ1, θ2|y) ∝ θ26541 (1 − θ1)

5052 · θ22142 (1 − θ2)

4607

We are interested in P (θ1 − θ2 > 0|y). We simulate 1000 draws of θ1 − θ2 [wesimulate θ1 and θ2 independently from Beta(2655,5053) and Beta(2215, 4608),respectively, and compute θ1 − θ2 for each (θ1, θ2).]

Page 21: Chapter 5. Bayesian Statistics (II)

Below is the histogram of θ1 − θ2. Among 1000 draws, 995 are positive. There-fore the posterior probability P (θ1 − θ2 > 0|y) ≈ 0.995.

−0.

020.

00.

020.

040.

06

01020304050

T.W

. − J

.D.

If we use normal approximation, θ1 − θ2 are approximately distributed as

N

(2654

2654 + 5052− 2214

2214 + 4607,

2654 ∗ 5052

(2654 + 5052)2(2654 + 5052 + 1)+

2214 ∗ 4607

(2214 + 4607)2(2214 + 4607 + 1)

)= N(0.0198, 0.00782).

Its density is super-imposed on the histogram.

Page 22: Chapter 5. Bayesian Statistics (II)

Example. Does birth weight increase when a mother quits smok-ing? Below is a data set.

Smokes Quit4.5 6.1 6.9 7.5 9.9 5.4 7.25.4 6.4 6.9 7.6 6.6 7.35.6 6.6 7.1 7.6 6.8 7.45.9 6.6 7.1 7.8 6.86.0 6.6 7.2 8.0 6.9

Assume the birth weight of a baby whose mother who smokesis N(θ1, σ

2) and the birth weight of a baby whose mother oncesmoked but quit is N(θ2, σ

2). Find the posterior probability ofθ1 − θ2 > 0, and give a 95% posterior interval for θ1 − θ2.

Solution: The data n1 = 21, n2 = 8, and (for smoke) x = 6.824, sx = 1.093,

Page 23: Chapter 5. Bayesian Statistics (II)

(for quit) y = 6.800, sy = 0.589. The pooled estimate

s2p =

(n1 − 1)s2x + (n2 − 1)s2

y

n1 + n2 − 2= 0.9749, sp = 0.987

To simulate θ1−θ2, we first draw u from χ2(n−2) and let σ2 = (n−2)s2p/u, and

then simulate θ1 and θ2 independently from N(x, σ2/n1) and N(y, σ2/n2). Thehistogram of 1000 draws are below. The 95% posterior interval from simulationis [−0.807, 0.863]. Out of these 1000 draws of θ1 − θ2, 499 are positive. So theposterior probability of θ1 − θ2 > 0 is 0.499.

Note that theoretically((θ1 − θ2) − (x − y)

sp

√1/n1 + 1/n2

∣∣∣∣∣x, y

)= t(n − 2).

Therefore the theoretical 95% posterior interval is

(x − y) ± t0.025(n − 2) ∗ sp

√1/n1 + 1/n2 = [−0.818, 0.866]

Page 24: Chapter 5. Bayesian Statistics (II)

and

P (θ1 − θ2 > 0|x, y) = P

[t(n − 2) ≥ −

(x − y)

sp

√1/n1 + 1/n2

]= 0.523.

−2

−1

01

2

0.00.20.40.60.81.0

Sm

okes

− Q

uit

Page 25: Chapter 5. Bayesian Statistics (II)

An example of generalized linear model

It is rare that multiparameter models allow simple calculationof posterior distribution. Simulation is often the only availabletool for data analysis. In this section we discuss in detail a two-parameter generalized linear model for a bioassay experiment.

The problem and the data. In the development of drugs, acutetoxicity test or bioassay are commonly performed on animals. The animalresponses are typically dichotomous: alive or dead, tumor or no tumor, and soon. The experiments are often administered by injecting various dose levels ofthe compound to batches of animals, which generate data of form (xi, ni, yi),where xi is the dose level (often measured in logarithmic scale), ni is the sizeof the batch of animals receiving dose xi, and yi is the number of animals withpositive response. The specific real data set is shown below.

Page 26: Chapter 5. Bayesian Statistics (II)

Dose xi (log g/ml) Size of batch ni Number of deaths yi

−0.86 5 0−0.30 5 1−0.05 5 30.73 5 5

Statistical model. Assume that yi is Binomial (ni, θi), with θi the populationdeath rate for animals receiving dose xi. We would like θi to be dependenton xi, and by definition θi ∈ [0, 1]. The following logistic regression model isadopted.

logit(θi) = α + βxi

where logit(θ).= log(θ/(1 − θ)). The inverse function of logit(·) is

logit−1(u) = eu/(1 + eu).

Note that in this model xi’s are explanatory variables and regarded as fixed.

Prior and likelihood. We use a flat prior π(α, β) ∝ 1 and the likelihood

p(yi|α, β) ∝[logit−1(α + βxi)

]yi ·[1 − logit−1(α + βxi)

]ni−yi .

Page 27: Chapter 5. Bayesian Statistics (II)

The posterior p(α, β|y). We have

p(α, β|y) ∝ π(α, β) ·4∏

i=1

p(yi|α, β) ∝4∏

i=1

p(yi|α, β)

Discretization of the posterior distribution. There is no analytical expressionto the posterior distribution, and we will use simulation to obtain numericalsummaries. Since the problem is only two dimensional, it is reasonable to expectthat simulating from a discretized approximation of the continuous posteriordistribution will do a good job. We will restrict the region to (α, β) ∈ [−2, 6]×[−5, 30]. The contour plot is shown below.

The discretization is done on a uniform 400× 700 grid. For each grid point, wecompute the unnormalized posterior density. Afterwards we normalize thesequantities such that their sum over all the grid points become one. In otherwords, we now have a discrete approximation of the posterior distribution.

Remark. A very popular methodology to simulate the posterior distribution isthe so-called Markov Chain Monte Carlo (MCMC) method. It is very differentfrom the discretization method we used in this example. When the dimension

Page 28: Chapter 5. Bayesian Statistics (II)

gets higher, discretization becomes obviously much more difficult.

alpha

beta

−2 0 2 4 6

010

2030

Figure 1: contour plot for the posterior distribution

Page 29: Chapter 5. Bayesian Statistics (II)

Simulating from the discrete approximation of the posterior distribution.

1. Draw α from its discrete marginal distribution p(α|y).

2. Given α, draw β from the discrete conditional distribution p(β|α, y).

3. Jitter the sample α and β by adding a uniform random perturbation centeredat zero with a width equal to the spacing of the sampling grid.

4. Repeated these three steps 1000 times to obtain 1000 samples of (α, β).

The histogram is attached below

The quantities of interest. The sign of β is important. For all the 1000 sampleswe have β > 0, which indicates the compound is harmful. Another quantity ofinterest in LD50 – the dose level at which the probability of death is 50%, or

α + β · LD50 = logit−1(0.5) = 0 ⇒ LD50 = −α/β.

The histogram of LD50 is attached.

Page 30: Chapter 5. Bayesian Statistics (II)

alpha

beta

−2 0 2 4 6

010

2030

−0.4 −0.2 0.0 0.2 0.4

01

23

45

LD50