Module4: IntroductiontotheNormalGamma ModelModule 4: Introduction to the Normal Gamma Model Author:...

Post on 26-Jul-2020

2 views 0 download

Transcript of Module4: IntroductiontotheNormalGamma ModelModule 4: Introduction to the Normal Gamma Model Author:...

Module 4: Introduction to the Normal GammaModel

Rebecca C. Steorts and Lei Qian

Agenda

I NormalGamma distributionI posterior derivationI application to IQ scores

NormalGamma distributionLet X1, . . . ,Xn

iid∼ N (µ, λ−1) and assume both

I the mean µ andI the precision λ = 1/σ2 are unknown.

The NormalGamma(m, c, a, b) distribution, with m ∈ R andc, a, b > 0, is a joint distribution on (µ, λ) obtained by letting

µ|λ ∼ N (m, (cλ)−1)λ ∼ Gamma(a, b)

In other words, the joint p.d.f. is

p(µ, λ) = p(µ|λ)p(λ) = N (µ | m, (cλ)−1)Gamma(λ | a, b)

which we will denote by NormalGamma(µ, λ | m, c, a, b).

NormalGamma distribution (continued)It turns out that this provides a conjugate prior for (µ, λ).

One can show the posterior is

µ,λ|x1:n ∼ NormalGamma(M,C ,A,B) (1)

i.e., p(µ, λ|x1:n) = NormalGamma(µ, λ | M,C ,A,B), where

M = cm +∑n

i=1 xic + n

C = c + nA = a + n/2B = b + 1

2(cm2 − CM2 +

∑ni=1 x2

i).

NormalGamma distribution (continued)

For interpretation, B can also be written (by rearranging terms) as

B = b + 12

n∑i=1

(xi − x̄)2 + 12

cnc + n (x̄ −m)2. (2)

NormalGamma distribution (continued)

I M: Posterior mean for µ. It is a weighted average (convexcombination) of the prior mean and the sample mean:

M = cc + nm + n

c + n x̄ .

I C : “Sample size” for estimating µ. (The standard deviation ofµ|λ is λ−1/2/

√C .)

I A: Shape for posterior on λ. Grows linearly with sample size.I B: Rate (1/scale) for posterior on λ. Equation 2 decomposes

B into the prior variation, observed variation (sample variance),and variation between the prior mean and sample mean:

B = (prior variation)+12n(observed variation)+1

2cn

c+n (variation bw means).

Posterior derivation

First, consider the NormalGamma density. Dropping constants ofproportionality, multiplying out (µ−m)2 = µ2 − 2µm + m2, andcollecting terms, we have

NormalGamma(µ, λ | m, c, a, b) (3)= N (µ | m, (cλ)−1)Gamma(λ | a, b)

=

√cλ2π exp

(− 1

2cλ(µ−m)2) ba

Γ(a)λa−1 exp(−bλ)

∝µ,λ

λa−1/2 exp(− 1

2λ(cµ2 − 2cmµ+ cm2 + 2b)). (4)

Posterior derivation (continued)

Similarly, for any x ,

N (x | µ, λ−1) =

√λ

2π exp(− 1

2λ(x − µ)2)

∝µ,λ

λ1/2 exp(− 1

2λ(µ2 − 2xµ+ x2)). (5)

Posterior derivation (continued)

p(µ, λ|x1:n)∝µ,λ

p(µ, λ)p(x1:n|µ, λ)

∝µ,λ

λa−1/2 exp(− 1

2λ(cµ2 − 2cmµ+ cm2 + 2b))

× λn/2 exp(− 1

2λ(nµ2 − 2(∑

xi )µ+∑

x2i ))

= λa+n/2−1/2 exp(− 1

2λ((c + n)µ2 − 2(cm +

∑xi )µ+ cm2 + 2b +

∑x2

i))

(a)= λA−1/2 exp(− 1

2λ(Cµ2 − 2CMµ+ CM2 + 2B

))(b)∝ NormalGamma(µ, λ | M,C ,A,B)

where step (b) is by Equation 4, and step (a) holds if A = a + n/2,C = c + n, CM = (cm +

∑xi ), and

CM2 + 2B = cm2 + 2b +∑

x2i .

Posterior derivation (continued)

This choice of A and C match the claimed form of the posterior,and solving for M and B, we get M = (cm +

∑xi )/(c + n) and

B = b + 12(cm2 − CM2 +

∑x2

i ),

as claimed.

Alternative derivation: complete the square all way (longer andmuch more tedious).

Do a teacher’s expectations influence studentachievement?

Do a teacher’s expectations influence student achievement? In afamous study, Rosenthal and Jacobson (1968) performed anexperiment in a California elementary school to try to answer thisquestion. At the beginning of the year, all students were given anIQ test. For each class, the researchers randomly selected around20% of the students, and told the teacher that these students were“spurters” that could be expected to perform particularly well thatyear. (This was not based on the test—the spurters were randomlychosen.) At the end of the year, all students were given another IQtest. The change in IQ score for the first-grade students was:1

1The original data is not available. This data is from the ex1321 dataset ofthe R package Sleuth3, which was constructed to match the summary statisticsand conclusions of the original study.

Do a teacher’s expectations influence studentachievement?

## Warning: package 'ggplot2' was built under R version 3.3.2

#### Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':#### arrange, count, desc, failwith, id, mutate, rename, summarise,## summarize

## The following objects are masked from 'package:stats':#### filter, lag

## The following objects are masked from 'package:base':#### intersect, setdiff, setequal, union

#### Attaching package: 'reshape'

## The following object is masked from 'package:dplyr':#### rename

## The following objects are masked from 'package:plyr':#### rename, round_any

Spurters/Control Data

#spurtersx <- c(18, 40, 15, 17, 20, 44, 38)#controly <- c(-4, 0, -19, 24, 19, 10, 5, 10,

29, 13, -9, -8, 20, -1, 12, 21,-7, 14, 13, 20, 11, 16, 15, 27,23, 36, -33, 34, 13, 11, -19, 21,6, 25, 30,22, -28, 15, 26, -1, -2,43, 23, 22, 25, 16, 10, 29)

iqData <- data.frame(Treatment =c(rep("Spurters", length(x)),rep("Controls", length(y))),Gain = c(x, y))

An initial exploratory analysis

Plot the number of students versus the change in IQ score forthe two groups. How strongly does this data support the hypothesisthat the teachers’ expectations caused the spurters to performbetter than their classmates?

Histogram of Change in IQ ScoresxLimits = seq(min(iqData$Gain) - (min(iqData$Gain) %% 5),

max(iqData$Gain) + (max(iqData$Gain) %% 5),by = 5)

ggplot(data = iqData, aes(x = Gain,fill = Treatment,colour = I("black"))) +

geom_histogram(position = "dodge", alpha = 0.5,breaks = xLimits, closed = "left")+

scale_x_continuous(breaks = xLimits,expand = c(0,0))+

scale_y_continuous(expand = c(0,0),breaks = seq(0, 10, by = 1))+

ggtitle("Histogram of Change in IQ Scores") +labs(x = "Change in IQ Score", fill = "Group") +theme(plot.title = element_text(hjust = 0.5))

0

1

2

3

4

5

6

7

8

9

10

−35 −30 −25 −20 −15 −10 −5 0 5 10 15 20 25 30 35 40 45

Change in IQ Score

coun

t Group

Controls

Spurters

Histogram of Change in IQ Scores

Histogram of Change in IQ Scores

0

1

2

3

4

5

6

7

8

9

10

−35 −30 −25 −20 −15 −10 −5 0 5 10 15 20 25 30 35 40 45

Change in IQ Score

coun

t Group

Controls

Spurters

Histogram of Change in IQ Scores

IQ Tests and Modeling

IQ tests are purposefully calibrated to make the scores normallydistributed, so it makes sense to use a normal model here:

spurters: X1, . . . ,XnSiid∼ N (µS , λ

−1S )

controls: Y1, . . . ,YnCiid∼ N (µC , λ

−1C ).

I We are interested in the difference between the means—inparticular, is µS > µC?

I We don’t know the standard deviations σS = λ−1/2S and

σC = λ−1/2C , and the sample seems too small to estimate them

very well.

IQ Tests and Modeling

On the other hand, it is easy using a Bayesian approach: we justneed to compute the posterior probability that µS > µC :

P(µS > µC | x1:nS , y1:nC ).

Let’s use independent NormalGamma priors:

spurters: (µS ,λS) ∼ NormalGamma(m, c, a, b)controls: (µC ,λC ) ∼ NormalGamma(m, c, a, b)

Hyperparameter settings

I m = 0 (Don’t know whether students will improve or not, onaverage.)

I c = 1 (Unsure about how big the mean change will be—priorcertainty in our choice of m assessed to be equivalent to onedatapoint.)

I a = 1/2 (Unsure about how big the standard deviation of thechanges will be.)

I b = 102a (Standard deviation of the changes expected to bearound 10 =

√b/a = E(λ)−1/2.)

Prior Samples## Warning: Removed 2031 rows containing missing values (geom_point).

0

10

20

30

40

−50 −25 0 25 50

µ (Mean Change in IQ Score)

λ−12 (

Std

. Dev

. of C

hang

e)

Prior Samples

Original question

“What is the posterior probability that µS > µC?”

I The easiest way to do this is to take a bunch of samples fromeach of the posteriors, and see what fraction of times we haveµS > µC .

I This is an example of a Monte Carlo approximation (muchmore to come on this in the future).

I To do this, we draw N = 106 samples from each posterior:

(µ(1)S , λ

(1)S ), . . . , (µ(N)

S , λ(N)S ) ∼ NormalGamma(A1,B1,C1,D1)

(µ(1)C , λ

(1)C ), . . . , (µ(N)

C , λ(N)C ) ∼ NormalGamma(A2,B2,C2,D2)

Original question (continued)

What are the updated posterior values?

and obtain the approximation

P(µS > µC | x1:nS , y1:nC ) ≈ 1N

N∑i=1

I(µ

(i)S > µ

(i)C)

=??.

In lab this week, you will explore this data set more and analyze thisin your homework assignment.