A 250-Year Argument - Stanford University

31
A 250-Year Argument Belief, Behavior, and the Bootstrap Bradley Efron Stanford U niversity

Transcript of A 250-Year Argument - Stanford University

Page 1: A 250-Year Argument - Stanford University

A 250-Year Argument

Belief, Behavior, and the Bootstrap

Bradley Efron

Stanford University

Page 2: A 250-Year Argument - Stanford University

The Greater World of Mathematics and Science

Mathematics

StatisticsA.I.

TerraIncognita

Ap

plied

Sciences

A 250-Year Argument 1

Page 3: A 250-Year Argument - Stanford University

The Physicist’s Twins

• Sonogram “Twin boys on the way!”

• Physicist “What’s the probability my twins will be identical

rather than fraternal?”

• Doctor “One third of twins are identical.”

A 250-Year Argument 2

Page 4: A 250-Year Argument - Stanford University

Bayes Rule for the Twins

• Prior odds:Pr{identical}

Pr{fraternal}=

1/32/3=

12

(past experience)

• Likelihood ratio:

Pr{same sex|identical}

Pr{same sex|fraternal}=

11/2= 2

(current evidence)

• Posterior odds:Pr{identical|same sex}

Pr{fraternal|same sex}= ? (updated beliefs)

• Bayes rule:

Posterior odds = Prior odds · Likelihood ratio =12· 2 = 1

• My answer : “50/50”

A 250-Year Argument 3

Page 5: A 250-Year Argument - Stanford University

If All Twins Were Sonogrammed:

5

Identical

Twins are:

Fraternal

Same sex Different

Physicist

Sonogram shows:

Doctor

2/3

1/3

1/3

1/3 0

1/3

b a

c d

A 250-Year Argument 4

Page 6: A 250-Year Argument - Stanford University

Belief and Inference

• θ: unknown state of nature (identical or fraternal?)

• π(θ): prior beliefs for θ (1/3, 2/3)

• x: current evidence (sonogram)

• fθ(x): probability model for x given θ

• Question What is π(θ|x)? (posterior beliefs given x)

A 250-Year Argument 5

Page 7: A 250-Year Argument - Stanford University

Bayes Rule (1763)

• π(θ|x) = cπ(θ) · fθ(x)

↑ ↑ ↑

posterior

beliefs

prior

beliefs

likelihood

function

• “c” makes π(θ|x) sum to 1

• Likelihood function fθ(x) with x fixed, θ varying, e.g.,

fθ(x) = 1√

2πe−

12 (θ−x)2

:

2 3 4 5 6 7 8

0.0

0.1

0.2

0.3

0.4

x theta−−><−−theta

A 250-Year Argument 6

Page 8: A 250-Year Argument - Stanford University

Bayes Inference without Prior Experience

“Objective Bayes”

• “p” population proportion of identical twins [Doctor : p = 13 ]

• Principle of insufficient reason (Laplace, Bernoulli) “In the

absence of prior experience, assume p equally likely to have

any value between 0 and 1.” [opposed Venn, Keynes, Fisher]

• Invariant prior (Harold Jeffreys, 1930s):

π(p) = cp−12 (1 − p)−

12

A 250-Year Argument 7

Page 9: A 250-Year Argument - Stanford University

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Possible prior densities for p, the Population prop Identical,And the corresponding predictions for the Physicist

p, the population proportion of Identical twins

prio

r de

nsity

pi(p

)

DoctorProb=.5

JeffreysProb=.67

LaplaceProb=.67

1/3●

A 250-Year Argument 8

Page 10: A 250-Year Argument - Stanford University

Frequentist Statistics (Behaviorism)

• θ = unknown parameter, x = observed data,

fθ(x) probability model (but no prior beliefs π(θ))

• “t(x)” some statistical procedure

(test, estimate, confidence interval, . . . )

• Inference based on behavior of t(x) in repeated use

• Optimality find best t(x)(R.A. Fisher, 1920s; J. Neyman, 1930s)

A 250-Year Argument 9

Page 11: A 250-Year Argument - Stanford University

●●

●●

0 20 40 60

3040

5060

70

Scores of 22 students on two tests 'mechanics' and 'vectors';Sample Correlation Coefficient is .498 +−??

means: mec 38.9 vec 50.6mechanics score −−>

vect

ors

scor

e −

−>

39.0

50.6

A 250-Year Argument 10

Page 12: A 250-Year Argument - Stanford University

Student Score Data

• n = 22 students’ scores on two tests: mechanics, vectors

• Data y = (y1, y2, . . . , y22) with yi = (meci, veci)

• Parameter of interest θ = correlation (mec, vec)

• Sample correlation coefficient θ = 0.498± ??

A 250-Year Argument 11

Page 13: A 250-Year Argument - Stanford University

R. A. Fisher

• 1915 : probability density fθ(θ)

(hypergeometric series)

• 1922–30 : θ is maximum likelihood estimate (MLE)

• Frequentist optimality of MLE minimize expected squared

error E{(θ − θ

)2}

• Bivariate normal models

A 250-Year Argument 12

Page 14: A 250-Year Argument - Stanford University

Jerzy Neyman (1930s)

• Optimal frequentist tests and confidence intervals

• 90% confidence interval for θ:

θ ∈ [0.164, 0.717]

• Neyman’s construction covers true θ 90% of the time, in

repeated use

A 250-Year Argument 13

Page 15: A 250-Year Argument - Stanford University

−0.5 0.0 0.5 1.0

01

23

4

Neyman's 90% confidence interval for student score correlation:.164 < theta < .717

thetahat* −−>

Fis

her's

den

sity

f(th

etah

at*

| the

ta)

−−

>

.05.05

.164 .717.498

theta=.164

theta=.717

A 250-Year Argument 14

Page 16: A 250-Year Argument - Stanford University

Jeffreys’ Invariant Prior

• Jeffreys’ objective (or “uninformative”) prior for correlation:

π(θ) = 1/(1 − θ2)

• General formula one over square root of Fisher’s information

bound for the variance of the MLE (transforms correctly under

change of variables)

A 250-Year Argument 15

Page 17: A 250-Year Argument - Stanford University

−0.2 0.0 0.2 0.4 0.6 0.8

0.0

0.5

1.0

1.5

2.0

Bayes posterior density pi(theta | thetahat) for the 22 students;90% Credible Limits = [.164,.718]; Neyman Limits [.164,.717]

theta −−>

post

erio

r de

nsity

−−

>

5% 90% 5%

.164 .718● ●

A 250-Year Argument 16

Page 18: A 250-Year Argument - Stanford University

More Students

n θ

22 .498

44 .663

66 .621

88 .553

∞ [.415, .662]

A 250-Year Argument 17

Page 19: A 250-Year Argument - Stanford University

*****

**** ***

**

*

*

* *** ***

*

*** *

** **

***

*

* ****

***

** **

** *

** *

***

**

*

**

**

***

***

******

*** *

** *

*

**

***

*

**

*

***

**

***

* *

**

** **

**

**

**** *

**

*

* ** ** **

** ****

* **

*

**

*

*

**** ****

*

***

**** * **

**

** *

*** *****

** *

*

* ****

* **

**

***

**

*

*

****

* ****

****

*

** *

*** *

****** **

*

**** *** **

**** *

*

****

*** ***

*

***

***

*

****

**

****

** *

***

**

** *

*** **

*

***

***

**

***

** *

***** **

***

** * **

*

*

** **

**

*** ** ** *

** *

**** ** ** *

** * ****

*

*

*****

*

***

*

*** * *

**

** **

* ** *

****** *

***

**** **** ***

****

***

*

* *

***

* ** **

* ** **

***** *

******

**** *

*** ***

**

****

**

***

**

* *

***

*

**

***

***

*****

** ** * *

*

*

*** *** *

*

*

**** *

***

*

**

** *

* *

*****

*

**

*** * ** *

***

* ***** *

**

** *

* *** ****

**

*

****

***

*

** * ****

* **** *****

*****

* ****

***

***

* ***

*

* *

**

**

**

**

*** *

**

*

* *

**** ***

***

***

***

*****

**

*

** **

** *

***

**** *

** ****

**

* ** ** **

* ** ** **

*

*** **

****

****

****

** *****

** **

*** ***

**

** *

***

****

*** **

* *****

*

**** **

* ***

**

**

** *

***** *

* * **

******

***

* **** **

****

***

** *

***

* *

**

** *

**

* * **

* **

* *****

* *

** ***

**

*** ****

***

* *

****

*

**** ** *

*

*

* ** ****

***

* *

* * ***

** * **

****** *

*****

**

*** ***

**

****

*

***

**

**

** **** *

**

** *

**

*

***

**

*

****

64 66 68 70 72 74

6065

7075

Galton's 1886 distribution of child's height vs parents';Ellipses are contours of best fit bivariate normal density;

Red dot at bivariate average (68.3, 68.1)

parents' height

child

's h

eigh

t

68.3

68.1

A 250-Year Argument 18

Page 20: A 250-Year Argument - Stanford University

Bivariate Normal Distribution

• “y ∼ N2(µ,Σ)” (y, µ ∈ R2, Σ 2 × 2 pos def):

fµ,Σ(y) =1

2π|Σ|−

12 e−

12 (y−µ)tΣ−1(y−µ)

• µ center of ellipse, Σ their shape

• 5 parameters: 2 means, 2 variances, 1 correlation

A 250-Year Argument 19

Page 21: A 250-Year Argument - Stanford University

A More Difficult Problem

• θ = “eigenratio” =λ1

λ1 + λ2(λ1 > λ2 eigenvalues Σ)

• Student score data y (22 × 2) gives MLEs µ, Σ, and

θ = 0.793±?

• Not true: fµ,Σ(θ)

depends only on θ

• There are 4 “nuisance parameters”

A 250-Year Argument 20

Page 22: A 250-Year Argument - Stanford University

0.5 0.6 0.7 0.8 0.9 1.0

02

46

Posterior density: eigenratio, Jeffreys prior bivariate normal; 90% credible limits [.68,.89]; Bootstrap CI [.63,.88]

Red dots are Bootstrap 90% confidence limitseigenratio−−>

post

erio

r de

nsity

−−

>

● ●

A 250-Year Argument 21

Page 23: A 250-Year Argument - Stanford University

Bootstrap Methods (Automatic Frequentist Inference)

• Original data yi ∼ N2(µ,Σ), i = 1, 2, . . . , 22

– gives MLEs µ, Σ, and θ = 0.793

• Bootstrap data y∗i ∼ N2(µ, Σ), i = 1, 2, . . . , 22

– gives θ∗ = bootstrap eigenratio

• 10,000 θ∗s • 58% exceed θ (upward bias)

• Reweighting formula puts bigger weights on smaller θ∗s

• Confidence limits are the weighted bootstrap percentiles

A 250-Year Argument 22

Page 24: A 250-Year Argument - Stanford University

10000 bootstrap eigenratio values from student score data(bivariate normal model); Red line shows confidence weights

58% of the bootstrap values exceed .793bootstrap eigenratios −−>

Fre

quen

cy

0.5 0.6 0.7 0.8 0.9 1.0

010

020

030

040

050

060

0

MLE=.793● ●

A 250-Year Argument 23

Page 25: A 250-Year Argument - Stanford University

Gibbs Sampling (Automatic Bayes Inference)

• Given: prior π(θ), data x, model fθ(x)

• Approximates: π(θ|x) by Markov chain random walk

• “MCMC”, “Metropolis-Hastings”, . . . (A-Bomb?)

• Most often used with convenient “ uninformative” priors

A 250-Year Argument 24

Page 26: A 250-Year Argument - Stanford University

Prostate Cancer Study

(Singh et al 2002)

• 102 men: 52 prostate cancer, 50 healthy controls

• Each man assessed for activity of 6033 genes

• Statistic xi measures differences in activity, patients minus

controls, for genei, i = 1, 2, . . . , 6033.

• Probability model

xi ∼ N(δi, 1) (normal, mean δi, variance 1)

δi the true difference or effect size

A 250-Year Argument 25

Page 27: A 250-Year Argument - Stanford University

Prostate Study (Singh et al 2002): difference estimates x[i]comparing cancer patients with normal controls, 6033 genes

hash marks show 10 largest x valuesdifference estimates x[i] −−>

Fre

quen

cy

−4 −2 0 2 4

010

020

030

040

0

if allx[i]=0

gene 610x=5.29

A 250-Year Argument 26

Page 28: A 250-Year Argument - Stanford University

Bayesian Analysis (for one gene)

• Assume δ has prior density π(δ)

• Prob model fδ(x) =1√

2πe−

12 (x−δ)2

• Marginal density m(x) =∫∞

−∞

fδ(x)π(δ) dδ

(overall density of x taking account of randomness in δ)

• Bayes posterior expectation (“Tweedie’s formula”)

E{δ|x} = x +d

dxlog m(x)

A 250-Year Argument 27

Page 29: A 250-Year Argument - Stanford University

Empirical Bayes Analysis

• We don’t know prior π(δ), but histogram provides a smooth

estimate m(x) for m(x)

• Empirical Bayes estimate:

E{δi|xi} = xi +ddx

log m(x)∣∣∣∣∣xi

• Frequentist estimation of a Bayesian inference

A 250-Year Argument 28

Page 30: A 250-Year Argument - Stanford University

−4 −2 0 2 4 6

−2

02

4

Empirical Bayes estimates of E{delta|x}, the expected truedifference delta[i] given the observed difference x[i]

Estimates near 0 for the 93% of genes in [−2,2]difference value x[i] −−>

E{d

elta

[i] |

x[i]}

−−

>

x[610]=5.29

estimate= 4.07

| |

A 250-Year Argument 29

Page 31: A 250-Year Argument - Stanford University

Score Sheet

Bayes Frequentist

1. Belief (prior)

2. Principled

3. One distribution

4. Dynamic

5. Individual (subjective)

6. Aggressive

1. Behavior (method)

2. Opportunistic

3. Many distributions (bootstrap?)

4. Static

5. Community (objective)

6. Defensive

A 250-Year Argument 30