Chapter 5. Bayesian Statistics (II) - BrownExample. Suppose a stock’s daily return Y was recorded...

Post on 08-Feb-2021

8 views 0 download

Transcript of Chapter 5. Bayesian Statistics (II) - BrownExample. Suppose a stock’s daily return Y was recorded...

  • Chapter 5. Bayesian Statistics (II)

  • Bayesian for multi-parameter models

    The principle remains the same. The (joint) posterior distributiongiven data y is once again

    p(θ|y) ∝ π(θ) · p(y|θ)where θ = (θ1, . . . , θd) are the parameters of interest.

  • For illustration, consider the special case of θ = (θ1, θ2).

    1. The joint posterior distribution

    p(θ1, θ2|y) ∝ π(θ1, θ2) · p(y|θ1, θ2)2. The marginal posterior distribution of θ2

    p(θ2|y) =∫

    p(θ1, θ2|y) dθ1 ∝∫

    π(θ1, θ2) · p(y|θ1, θ2) dθ1

    3. The conditional posterior distribution of θ1 given θ2 is

    p(θ1|θ2, y) =p(θ1, θ2|y)p(θ2|y)

    ∝ π(θ1, θ2) · p(y|θ1, θ2)

    Note the difference with joint posterior distribution is that hereθ2 is regarded as fixed and known.

    Remark: The following relation is useful for the simulation ofposterior distribution

    p(θ1, θ2|y) = p(θ1|θ2, y) · p(θ2|y)

  • Examples

    Normal model. Suppose that y = {y1, . . . , yn} are iid samplesfrom N(θ, σ2) such that (θ, log(σ2)) has a flat prior, or

    π(θ, σ2) ∝ 1/σ2.

    The joint posterior distribution p(θ, σ2|y).

    p(θ, σ2|y) ∝(σ2)−1−n2

    e− 1

    2σ2[n(θ−ȳ)2+(n−1)s2]

    where s2 is the sample variance

    s2 =1

    n − 1

    n∑

    i=1

    (yi − ȳ)2.

  • The marginal posterior distribution p(σ2|y).

    p(σ2|y) =∫

    p(θ, σ2|y)dθ

    ∝∫ (

    σ2)−1−n2

    e− 1

    2σ2[n(θ−ȳ)2+(n−1)s2]

    =(σ2)−1−n2

    e−(n−1)s

    2

    2σ2

    √2πσ2/n

    ∝(σ2)−n+12

    e−(n−1)s

    2

    2σ2

    It follows that the posterior distribution of((n − 1)s2

    σ2

    ∣∣∣∣∣ y)

    = χ2(n − 1)

  • The marginal posterior distribution p(θ|y).

    p(θ|y) =∫

    p(θ, σ2|y)dσ2

    ∝∫ (

    σ2)−1−n2

    e− 1

    2σ2[n(θ−ȳ)2+(n−1)s2]

    dσ2

    ∝[n(θ − ȳ)2 + (n − 1)s2

    ]−n2

    [1 +

    (θ − ȳs/√

    n

    )2 1n − 1

    ]−n2

    It follows that the posterior distribution of(θ − ȳs/√

    n

    ∣∣∣∣ y)

    = t(n − 1)

  • The conditional posterior distribution p(θ|σ2, y).

    p(θ|σ2, y) = N(ȳ, σ2/n)

    The conditional posterior distribution p(σ2|θ, y).

    ((n − 1)s2 + n(ȳ − θ)2

    σ2

    ∣∣∣∣∣ θ, y)

    = χ2(n)

    Remark: To simulate from the posterior distribution p(θ, σ2|y), one can firstsimulate σ2 from marginal posterior distribution p(σ2|y), then simulate θ fromthe conditional posterior distribution p(θ|σ2, y).

  • Example. Suppose a stock’s daily return Y was recorded for n =22 consecutive business days, with ȳ = 5% and s = 4%. Assumethat the daily return Y follows N(θ, σ2) with prior π(θ, σ2) ∝1/σ2. Find the 95% posterior interval for θ. Also use simulationto approximate E[θ/σ|y].

    Solution: Since (θ − ȳs/√

    n

    ∣∣∣∣ y)

    = t(n − 1)

    The 95% posterior interval is (in %)

    ȳ ± t0.025(n − 1)s√n

    = 5 ± 2.080 ∗ 4√21

    = [3.2, 6.8]

    Below is the histogram of 1000 draws of θ/σ. For each draw, we (1) draw asample of σ: draw a sample say u from χ2(n− 1), then let σ =

    √(n − 1)s2/u;

    (2) given σ, draw a sample θ from N(ȳ, σ2/n); (3) θ/σ is a data point. Thesample average of θ/σ is 1.23.

  • 0.00.5

    1.01.5

    2.02.5

    3.0

    0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

  • Multinomial model. Let Y = (Y1, . . . , Yd) be multinomial withparameter (n; θ1, . . . , θd) where

    θ1 + · · · + θd = 1.Consider prior distribution (Dirichlet distribution)

    π(θ) ∝d∏

    i=1

    θαi−1i

    restricted to non-negative θi’s with θ1 + · · · + θd = 1.

  • The joint posterior distribution p(θ|y).

    p(θ|y) ∝ π(θ) · p(y|θ) ∝d∏

    i=1

    θαi−1i ·

    d∏

    i=1

    θyii =

    d∏

    i=1

    θαi+yi−1i

    That is, p(θ|y) is a Dirichlet distribution with parameter (α1 +y1, . . . , αd + yd).

    The marginal posterior distribution p(θ1|y).

    p(θ1|y) ∝∫

    {∑d

    i=2 θi=1−θ1}θα1+y1−11

    d∏

    i=2

    θαi+yi−1i dθ2 · · · dθd−1

    It follows that p(θ1|y) is Beta(α1 + y1,∑d

    i=2[αi + yi]).

  • The conditional posterior distribution p(θ1|y).

    p(θ2, . . . , θd|θ1, y) ∝ θα1+y1−11

    d∏

    i=2

    θαi+yi−1i

    restricted to {θ2 + · · · + θd = 1 − θ1}. It follows that(θ2

    1 − θ1, . . . ,

    θd1 − θ1

    ∣∣∣∣ θ1, y)

    = Dirichlet(α2 + y2, . . . , αd + yd).

    Remark on simulation: One way to simulate (θ1, . . . , θd) from posterior distribu-

    tion is to simulate sequentially θ1 from p(θ1|y), and then θ2 from p(θ2|θ1, y), . . . ,and θd−1 from p(θd−1|θ1, . . . , θd−2, y), and finally set θd = 1− (θ1 + · · ·+ θd−1).Note that all these conditional distributions are Beta distributions [up to a

    multiplicative constant]. Another way to simulate (θ1, . . . , θd) from poste-

    rior Dirichlet distribution is to simulate xi from Gamma(αi + yi,1/2) for each

    i = 1, . . . , d and let θi = xi/(x1 + · · · + xd).

  • Example. In late October 1988, a pre-election poll was conductedby CBS news of 1447 adults in US to find out their preferences inthe upcoming Presidential election. Out of 1447 persons, y1 = 727supported George Bush, y2 = 583 supported Michael Dukakis,and y3 = 137 supported other candidates or expressed no opin-ion. Assume that the samples are randomly selected from thepopulation, then the data follows multinomial distribution withparameters (θ1, θ2, θ3). The quantity of interest is θ1 − θ2.

    Solution: Assume a non-informative prior with α1 = α2 = α3 = 1. The pos-terior distribution for (θ1, θ2, θ3) is Dirichlet(728, 584, 138). We will draw 1000samples of (θ1, θ2, θ3) from the posterior Dirichlet distribution, and computeθ1 − θ2 for each sample. We will simulate using two equivalent approaches.

    • Using conditional distribution decomposition. Simulate θ1 from Beta(728,584+138). Given θ1, simulate u from Beta(584, 138) and let θ2 = (1− θ1)u.Let θ3 = 1 − θ2 − θ3. Record θ1 − θ2.

  • • Using Gamma distribution. Simulate independent x1, x2, x3 from, respec-tively, Gamma(728, 1/2)=χ2(728*2), Gamma(584, 1/2)=χ2(584*2), andGamma(138, 1/2)=χ2(138*2). Let θi = xi/(x1 + x2 + x3). Record θ1 − θ2.

    The histograms are attached below, the sample means are 0.099 and 0.100respectively. None of the sample points of θ1 − θ2 are below zero.

  • 0.0 0.05 0.10 0.15 0.20

    05

    1015

    Use decomposition

    0.0 0.05 0.10 0.15 0.20

    05

    1015

    Use Gamma distribution

  • Comparison of two populations

    Comparison of two proportions. Suppose Y1 has distribution B(n1; θ1),Y2 has distribution B(n2; θ2), and Y1 and Y2 are independent. Weare interested in θ1 − θ2, given the data Y1 = y1 and Y2 = y2.

    Assuming a non-informative prior π(θ1, θ2) ∝ 1 on [0, 1]2. Thejoint posterior distribution p(θ1, θ2|y) is

    p(θ1, θ2|y) ∝ θy11 (1 − θ1)

    n1−y1θy22 (1 − θ2)n2−y2

    Thus the posterior distributions of θ1 and θ2 are independent and

    p(θ1|y) = Beta(y1 + 1, n1 − y1 + 1)p(θ2|y) = Beta(y2 + 1, n2 − y2 + 1)

    One can use simulation to draws samples of θ1− θ2 or use normalapproximations (when n1 and n2 large) of θ1 − θ2.

  • Comparison of two normal means. Suppose x = (x1, . . . , xn1)are iid samples from N(θ1, σ

    2), y = (y1, . . . , yn2) are iid samplesfrom N(θ2, σ

    2), and that the two samples are independent. We areinterested in θ1 − θ2. All the parameters (θ1, θ2, σ) are unknown.

    Assume a non-informative prior π(θ1, θ2, σ2) ∝ 1/σ2. The poste-

    rior is

    p(θ1, θ2, σ|x, y) ∝ (σ2)−1−n2e

    − 12σ2

    [n1(x̄−θ1)2+n2(ȳ−θ2)2+(n−2)s2p]

    where

    n = n1 + n2, s2p =

    (n1 − 1)s2x + (n2 − 1)s2y(n1 − 1) + (n2 − 2)

    Analogously, one have the marginal posterior distribution

    p(σ2|x, y) ∝(σ2)−n2

    e−(n−2)s

    2p

    2σ2

  • or ((n − 2)s2p

    σ2

    ∣∣∣∣∣ x, y)

    = χ2(n − 2).

    The conditional posterior distributions of θ1, θ2 given σ are inde-pendent, and

    p(θ1|σ, x, y) = N(x̄, σ2/n1), p(θ2|σ, x, y) = N(ȳ, σ2/n2).

    Remark on simulation. To draw samples of (θ1, θ2, σ). One can draw u from

    χ2(n − 2) and let σ2 = (n − 2)s2p/u, then draw θ1, θ2 independently fromN(x̄, σ2/n1) and N(ȳ, σ

    2/n2) respectively. If one is interested in θ1 − θ2, foreach sample point of (θ1, θ2, σ) compute θ1 − θ2. If one is interested θ1θ2, foreach sample point compute θ1θ2. And so on so forth.

    The theoretical posterior distribution of θ1−θ2 can be obtained asfollows. Note that the conditional posterior distribution of θ1−θ2

  • given σ is

    p(θ1 − θ2|σ, x, y) = N(x̄ − ȳ, σ2[1/n1 + 1/n2]).Therefore

    p(θ1 − θ2, σ2|x, y) = p(θ1 − θ2|σ2, x, y) · p(σ2|x, y)

    ∝(σ2)−n+12

    e− 1

    2σ2[(1/n1+1/n2)−1((θ1−θ2)−(x̄−ȳ))2+(n−2)s2p]

    Integrating out σ2, we have similarly((θ1 − θ2) − (x̄ − ȳ)sp ·

    √1/n1 + 1/n2

    ∣∣∣∣∣x, y)

    = t(n − 2)

  • Example. Who is a better hitter, Ted Williams (Boston Red Sox)or Joe DiMaggio (NY Yankees)? Their major league career statis-tics are given below.

    Player At-bats Hits Batting Average Home Run Home Run AverageT.W. 7706 2654 .3444 521 .0676J.D. 6821 2214 .3246 361 .0529

    Find the posterior probability that Ted Williams is a better hitterthan Joe Dimaggio.

    Solution: We consider the hits, and leave the home runs as exercise. Let θ1 bethe hit proportion for T.W. and θ2 for that of J.D. Assume a non-informativeprior π(θ1, θ2) ∝ 1. Then the posterior is

    p(θ1, θ2|y) ∝ θ26541 (1 − θ1)5052 · θ22142 (1 − θ2)4607

    We are interested in P (θ1 − θ2 > 0|y). We simulate 1000 draws of θ1 − θ2 [wesimulate θ1 and θ2 independently from Beta(2655,5053) and Beta(2215, 4608),respectively, and compute θ1 − θ2 for each (θ1, θ2).]

  • Below is the histogram of θ1 − θ2. Among 1000 draws, 995 are positive. There-fore the posterior probability P (θ1 − θ2 > 0|y) ≈ 0.995.

    −0.

    020.

    00.

    020.

    040.

    06

    01020304050

    T.W

    . − J

    .D.

    If we use normal approximation, θ1 − θ2 are approximately distributed asN

    (2654

    2654 + 5052− 2214

    2214 + 4607,

    2654 ∗ 5052(2654 + 5052)2(2654 + 5052 + 1)

    +2214 ∗ 4607

    (2214 + 4607)2(2214 + 4607 + 1)

    )= N(0.0198, 0.00782).

    Its density is super-imposed on the histogram.

  • Example. Does birth weight increase when a mother quits smok-ing? Below is a data set.

    Smokes Quit4.5 6.1 6.9 7.5 9.9 5.4 7.25.4 6.4 6.9 7.6 6.6 7.35.6 6.6 7.1 7.6 6.8 7.45.9 6.6 7.1 7.8 6.86.0 6.6 7.2 8.0 6.9

    Assume the birth weight of a baby whose mother who smokesis N(θ1, σ

    2) and the birth weight of a baby whose mother oncesmoked but quit is N(θ2, σ

    2). Find the posterior probability ofθ1 − θ2 > 0, and give a 95% posterior interval for θ1 − θ2.Solution: The data n1 = 21, n2 = 8, and (for smoke) x̄ = 6.824, sx = 1.093,

  • (for quit) ȳ = 6.800, sy = 0.589. The pooled estimate

    s2p =(n1 − 1)s2x + (n2 − 1)s2y

    n1 + n2 − 2= 0.9749, sp = 0.987

    To simulate θ1−θ2, we first draw u from χ2(n−2) and let σ2 = (n−2)s2p/u, andthen simulate θ1 and θ2 independently from N(x̄, σ

    2/n1) and N(ȳ, σ2/n2). The

    histogram of 1000 draws are below. The 95% posterior interval from simulationis [−0.807, 0.863]. Out of these 1000 draws of θ1 − θ2, 499 are positive. So theposterior probability of θ1 − θ2 > 0 is 0.499.

    Note that theoretically((θ1 − θ2) − (x̄ − ȳ)sp√

    1/n1 + 1/n2

    ∣∣∣∣∣x, y)

    = t(n − 2).

    Therefore the theoretical 95% posterior interval is

    (x̄ − ȳ) ± t0.025(n − 2) ∗ sp√

    1/n1 + 1/n2 = [−0.818, 0.866]

  • and

    P (θ1 − θ2 > 0|x, y) = P

    [t(n − 2) ≥ −

    (x̄ − ȳ)sp√

    1/n1 + 1/n2

    ]= 0.523.

    −2

    −1

    01

    2

    0.00.20.40.60.81.0

    Sm

    okes

    − Q

    uit

  • An example of generalized linear model

    It is rare that multiparameter models allow simple calculationof posterior distribution. Simulation is often the only availabletool for data analysis. In this section we discuss in detail a two-parameter generalized linear model for a bioassay experiment.

    The problem and the data. In the development of drugs, acutetoxicity test or bioassay are commonly performed on animals. The animalresponses are typically dichotomous: alive or dead, tumor or no tumor, and soon. The experiments are often administered by injecting various dose levels ofthe compound to batches of animals, which generate data of form (xi, ni, yi),where xi is the dose level (often measured in logarithmic scale), ni is the sizeof the batch of animals receiving dose xi, and yi is the number of animals withpositive response. The specific real data set is shown below.

  • Dose xi (log g/ml) Size of batch ni Number of deaths yi−0.86 5 0−0.30 5 1−0.05 5 30.73 5 5

    Statistical model. Assume that yi is Binomial (ni, θi), with θi the populationdeath rate for animals receiving dose xi. We would like θi to be dependenton xi, and by definition θi ∈ [0, 1]. The following logistic regression model isadopted.

    logit(θi) = α + βxi

    where logit(θ).= log(θ/(1 − θ)). The inverse function of logit(·) is

    logit−1(u) = eu/(1 + eu).

    Note that in this model xi’s are explanatory variables and regarded as fixed.

    Prior and likelihood. We use a flat prior π(α, β) ∝ 1 and the likelihoodp(yi|α, β) ∝

    [logit−1(α + βxi)

    ]yi ·[1 − logit−1(α + βxi)

    ]ni−yi .

  • The posterior p(α, β|y). We have

    p(α, β|y) ∝ π(α, β) ·4∏

    i=1

    p(yi|α, β) ∝4∏

    i=1

    p(yi|α, β)

    Discretization of the posterior distribution. There is no analytical expressionto the posterior distribution, and we will use simulation to obtain numericalsummaries. Since the problem is only two dimensional, it is reasonable to expectthat simulating from a discretized approximation of the continuous posteriordistribution will do a good job. We will restrict the region to (α, β) ∈ [−2, 6]×[−5, 30]. The contour plot is shown below.

    The discretization is done on a uniform 400× 700 grid. For each grid point, wecompute the unnormalized posterior density. Afterwards we normalize thesequantities such that their sum over all the grid points become one. In otherwords, we now have a discrete approximation of the posterior distribution.

    Remark. A very popular methodology to simulate the posterior distribution isthe so-called Markov Chain Monte Carlo (MCMC) method. It is very differentfrom the discretization method we used in this example. When the dimension

  • gets higher, discretization becomes obviously much more difficult.

    alpha

    beta

    −2 0 2 4 6

    010

    2030

    Figure 1: contour plot for the posterior distribution

  • Simulating from the discrete approximation of the posterior distribution.

    1. Draw α from its discrete marginal distribution p(α|y).2. Given α, draw β from the discrete conditional distribution p(β|α, y).3. Jitter the sample α and β by adding a uniform random perturbation centered

    at zero with a width equal to the spacing of the sampling grid.

    4. Repeated these three steps 1000 times to obtain 1000 samples of (α, β).

    The histogram is attached below

    The quantities of interest. The sign of β is important. For all the 1000 sampleswe have β > 0, which indicates the compound is harmful. Another quantity ofinterest in LD50 – the dose level at which the probability of death is 50%, or

    α + β · LD50 = logit−1(0.5) = 0 ⇒ LD50 = −α/β.The histogram of LD50 is attached.

  • alpha

    beta

    −2 0 2 4 6

    010

    2030

    −0.4 −0.2 0.0 0.2 0.4

    01

    23

    45

    LD50