Bootstrap Methods and the Accuracy of Large-Scale...

31
Bootstrap Methods and the Accuracy of Large-Scale Estimators Bradley Efron Stanford U niversity

Transcript of Bootstrap Methods and the Accuracy of Large-Scale...

Page 1: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Bootstrap Methods and the Accuracyof Large-Scale Estimators

Bradley EfronStanford University

Page 2: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Correlation and Accuracy

• Modern Scientific Studies N cases (genes, SNPs,

pixels, . . . ) each with its own summary statistic “zi”,

i = 1, 2, . . . ,N

• N ∼ 10, 000

• Estimate of interest θ = s(z)[e.g., θ = #{zi > 3}/N

]• Question How accurate is θ?

• Easy answer if zi’s independent (but usually not!)

• Troubles for the bootstrap

Bootstrap Methods 1

Page 3: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Leukemia Microarray Study

(Golub et al., 1999)

• 72 leukemia patients: n1 = 47 “ALL”, n2 = 25 “AML”

• N = 7128 genes • Data matrix X 7128 × 72

• X has independent columns but correlated rows

rms correlation α = .11

• ti = two-sample z-statistic, AML vs. ALL for gene i

• zi = Φ−1 (F70(ti)) [Φ, F70 cdfs N(0, 1), t70]

H0 : zi ∼ N(0, 1) “theoretical null”

Bootstrap Methods 2

Page 4: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Leukemia data: N=7128 z−values, 47 ALL versus 25 AML patients; RMS correlation =.11; Emp Null ~ N(.10, 1.68^2)

z values

Fre

quen

cy

−5 0 5

050

100

150

200

250

300

350

^ ^

Poisson splinefit (df=5)

Bootstrap Methods 3

Page 5: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Leukemia z−value histogram and average 100 bootstrap z* hists.[Two−sample Nonparametric Boots: resample Columns of X ]

z values

Fre

quen

cy

−5 0 5

050

100

150

200

250

300

350

boot average

Poissonspline fit

Bootstrap Methods 4

Page 6: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Bootstrap Dilation

• xi = ith row of X (n equals 72 = 47 + 25)

• xi → zi

• x∗i → z∗i ∼ zi +N(0, σ2i )

• Bootstrap histogram has extra component of variance:

E∗

N∑1

z∗2i

/N

=

N∑1

z2i

/N +

N∑1

σ2i

/N

• Next: Boot stdev estimates for F(x) = #{zi ≥ x}/N

Bootstrap Methods 5

Page 7: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

−10 −5 0 5 10

0.00

00.

005

0.01

00.

015

0.02

00.

025

Bootstrap Stdev for empirical cdf of Leukemia z−values,compared with Formula X

x value

Sd

estim

ates

Formula X

Bootstrap

Bootstrap Methods 6

Page 8: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

−10 −5 0 5 10

0.00

00.

005

0.01

00.

015

0.02

00.

025

Now permutation and jackknife ests of sd{empirical cdf}compared with Formula X

x value

Sd

estim

ates

* * *

* **

**

*

*

** *

*

* *

* * *

**

** *

*

*

*

*

**

*

*

*

* * *

**

*

**

*

* *

* *

**

*

*

*

*

*

**

*

*

* *

*

* * * **

* **

** * *

** *

perm

Formula X

jackknife

Bootstrap Methods 7

Page 9: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Formula X

Var{F(x)

}�

{F(x)(1−F(x))

N

}+

{σ2

0α f (1)(x)√

2

}2

independence correlation penalty

• σ0 = 1.68 from empirical null

• α = .11 estimated RMS correlation

• f (1)(x) first derivative of estimate f (x)

• Depends on normality: zi ∼ N(µi, σ2i )

Bootstrap Methods 8

Page 10: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Formula X for Leukemia Data

x: 1 2 3 4 5

F(x) .29 .13 .057 .025 .010

sd .017 .022 .010 .004 .002

sd0 .005 .004 .003 .002 .001

Bootstrap Methods 9

Page 11: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

−4 −2 0 2 4

0.00

00.

005

0.01

00.

015

0.02

0

Simulation: sd{Fhat(x)} from Formula X; N=6000, n=20+20, alpha=.10;Solid Curve and bars are mean and stdev of sdhat values, 100 sims

Dashed curve is actual sd

stan

dard

dev

iatio

n es

timat

es

Bootstrap Methods 10

Page 12: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Multi-Class Normal Model

• Suppose zi’s are in “classes” C1,C2, . . . ,CC, with

zi ∼ N(µc, σ2c ) for zi ∈ Cc

• Nc = # {Cc} , pc = Nc/N[so

∑c Nc = N,

∑c pc = 1

]• Correlation distribution gcd(ρ) = empirical density of

Nc ·Nd correlations between members of Cc, Cd

• Assume gcd all equal g(z)

g(z) = empirical density all(N

2

)correlations

Bootstrap Methods 11

Page 13: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Digression: The Non-Null Distribution of z-Values

• z-value is a test statistic ∼ N(0, 1) under H0

• Theorem Under reasonable conditions the non-null

distribution of z is

z ∼ N(µ, σ2) + Op(1/n)

where

σ2 = 1 + O(1/n

12

)• Normality degrades more slowly than unit standard

deviation

• Helps justify model zi ∼ N(µi, σ2i )

Bootstrap Methods 12

Page 14: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Student-t z-Values

• t ∼ tν(δ) [noncentral-t, noncentrality δ, d f = ν]

• H0 : δ = 0

• z = Φ−1Fν(t) [Fν central t cdf, d f = ν]

so under H0, z ∼ N(0, 1)

• What if δ , 0?

Bootstrap Methods 13

Page 15: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Densities for z=Phiinv(Fnu(t)), t~t(del,nu=20), fordel=0,1,2,3,4,5; Dotted dashed lines are matching N(M,SD)

z value

dens

ity

0 1 2 3 4 5del=

Bootstrap Methods 14

Page 16: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

The Count Vector y

• Partition range Z of z into K bins: Z =

K⋃k=1

Zk

• Each bin of width “∆”

• Bin centers “xk”, k = 1, 2, . . . ,K

(Leukemia histogram: Z = [−7.9, 7.9],∆ = .2,K = 79)

• Counts yk = # {zi ∈ Zk} • y = (y1, y2, . . . , yK)′

• Count vector y is discretized order statistic of z

(most statistics of interest of form θ = m(y))

Bootstrap Methods 15

Page 17: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Mehler’s Identity (Lancaster, 1958)

• ϕρ(u, v) = standard normal bivariate density

• Mehler λρ(u, v) =ϕρ(u, v)ϕ(u)ϕ(v)

− 1 =∑j≥1

ρ j

j!h j(u)h j(v)

where h j is jth Hermite polynomial

• Crucial quantity : Λ(u, v) =

∫ 1

−1λρ(u, v)g(ρ) dρ

=∑

j≥1

α j

j!h j(u)h j(v) where α j =

∫ 1

−1ρ jg(ρ) dρ

Bootstrap Methods 16

Page 18: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Exact Covariance of y

• zi ∼ N(µc, σ2c ) for zi ∈ Cc • Nc = #Cc, pc = Nc/N

Theorem cov(y) = cov0 + cov1,

cov0 = N∑

c

pc{diag(πc) − πcπ

c}

[independence]

where πck = Prc{zi ∈ bink}, πc = (· · ·πck . . . )′,

cov1 = N2∑

c

∑d

pcpdBcd −N∑

c

pcBcc [corr penalty ]

and Bcd(k, l) = πckπdlΛ(xk − µc

σc,

xl − µd

σd

).

Bootstrap Methods 17

Page 19: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Four Simplifications of cov1

• Drop N term

• Microarray standardization methods make α1 � 0

• Mehler expansion: α2 =∫ 1

−1ρ2g(ρ) is the lead term

• Higher terms ignorable if α2 small

Simplified Formula (almost Formula X):

Letting α = α122 and φ(2)

k =∑

c pcϕ(2)(

xkc−µc

σc

) /σc

cov1 � (N∆α)2φ(2)φ(2)′/2 [rms approximation]

Bootstrap Methods 18

Page 20: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Numerical Comparison

• N = 6000, α = .1

• Two classes: (pc, µc, σc) =

(.95, 0, 1)

(.05, 2.5, 1)

• Next figure compares standard deviations (square roots

diagonal elements) of exact cov(y) & rms approximation

Bootstrap Methods 19

Page 21: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

−4 −2 0 2 4

010

2030

40

Compare sd{y[k]} from exact formula (solid) with rms approx (dashed); N=6000, alpha=.1, (p0,mu0,sig0)=(.95,0,1) and(.05,2.5,1)

dashes show bin centers x[k]z value

stan

dard

dev

iatio

n

sd{y[k]},exact

without corrpenalty

rms approx−imation

Bootstrap Methods 20

Page 22: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

−4 −2 0 2 4

020

4060

8010

0

Same numerical example, now sd{Fhat[k]}[ Fhat[k]=sum(y[l] for l>=k)/N ]

z value

sd{F

hat}

exact

without corrpenalty

rms approx

Bootstrap Methods 21

Page 23: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Estimation of RMS Correlation α

• ρii′ = empirical correlation, rows i, i′ of X,

N × n expression matrix

• {ρii′} has mean and variance (m, v)

[leukemia = (.00, .192)]

α2 =n

n − 1

(v −

1n − 1

)

ALL AML Both

α: .121 .109 .114

Bootstrap Methods 22

Page 24: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

More General Accuracy Estimates

• “Q” q-dimensional statistic of interest: Q = Q(y)

• Influence Function

dQ = D dy[D jk = ∂Q j/∂yk

]cov(Q) = Dcov(y)D′

Bootstrap Methods 23

Page 25: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Example: Accuracy of f

• z → y → f by Poisson GLM

of counts yk on polynomial (xk)

• Q = log(f ) = (. . . log f (xk) . . . )′

• D = M[M ′diag

(f)M

]M ′

/N∆

with M the GLM structure matrix

Bootstrap Methods 24

Page 26: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Local False Discovery Rate

p0 = prior Pr null

p1 = prior Pr non-nullz ∼

f0(z)

f1(z)

• Mixture f (z) = p0 f0(z) + p1 f1(z)

• Estimated local false discovery rate

fdr(z) = Pr{null|z} = p0 f0(z)/

f (z)

• cov{log fdr

}� cov

{log f

}

Bootstrap Methods 25

Page 27: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

2.0 2.5 3.0 3.5

0.00

0.05

0.10

0.15

0.20

0.25

sd{log fdrhat(z)} ; N=6000, alpha=0, .1, and .2,(p0,mu,sig) = (.95,0,1) and (.05,2.5,1)

stars are sd's for N=1500, alpha=.1; number are fdrhat[z]z value −>

sd

alpha=0

alpha=.1

alpha=.2

0.69 0.58 0.44 0.25 0.09 0.03

* **

*

*

Bootstrap Methods 26

Page 28: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

2.0 2.5 3.0 3.5

0.00

0.05

0.10

0.15

0.20

0.25

Now compare sd's for log{fdrhat} and log{Fdrhat},alpha=.1

numbers are Fdr[z]z value −>

sd

sdlogfdr

sdlogFdr

sdlogFdrnon

0.34 0.26 0.18 0.1 0.04 0.01

Bootstrap Methods 27

Page 29: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Poisson Bootstrap

• Null Case All zi ∼ N(0, 1)

• Let A ∼ N(0, α2) and w = Nπ0h2

/√2

• Hierarchical Poisson Resampling: u = Nπ0 + Aw

and

ykind∼ Poi(uk) k = 1, 2, . . . ,K

• i.i.d. case if α = 0

• like rms formula if α = 0

Bootstrap Methods 28

Page 30: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

References

Efron, B. (2007a). Correlation and large-scale simultaneous

significance testing. J. Amer. Statist. Assoc. 102: 93–103.

Efron, B. (2007b). Size, power and false discovery rates. Ann.

Statist. 35: 1351–1377.

Efron, B. (2009). Correlated z-values and the accuracy of large-

scale statistical estimates. J. Amer. Statist. Assoc. To appear

(http://stat.stanford.edu/∼brad/papers).

Golub, T. R., Slonim, D. K. and Tamayo, P. et al. (1999).

Molecular Classification of Cancer: Class Discovery and Class

Prediction by Gene Expression Monitoring. Science 286:

531–537, [the leukemia data].

Bootstrap Methods 29

Page 31: Bootstrap Methods and the Accuracy of Large-Scale Estimatorsstatistics.rutgers.edu/joomlatools-files/docman-files/Efron.pdfLeukemia Microarray Study (Golub et al., 1999) 72 leukemia

Lancaster, H. O. (1958). The structure of bivariate distributions.

Ann. Math. Statist. 29: 719–736.

Owen, A. B. (2005). Variance of the number of false discoveries.

J. R. Stat. Soc. Ser. B Stat. Methodol. 67: 411–426.

Bootstrap Methods 30