Section V

42
Section V Sample Size and Power Multiple hypothesis testing

description

Section V. Sample Size and Power Multiple hypothesis testing. Sample size (n) based on estimation precision-CI width. Can plan sample size so that standard errors (SEs) and the corresponding confidence intervals are sufficiently small. (How small? How small do you need it to be?). - PowerPoint PPT Presentation

Transcript of Section V

Page 1: Section V

Section V

Sample Size and Power

Multiple hypothesis testing

Page 2: Section V

Sample size (n) based on estimation precision-CI width

Can plan sample size so that standard errors (SEs) and the corresponding confidence intervals are sufficiently

small. (How small? How small do you need it

to be?)

Page 3: Section V

Sample size for precision/CIs π = proportion with TB in the population (prevalence) p = proportion with TB in a sample of size n, SE(p) = √[π(1- π)/n] 95% confidence interval for π: p ± 1.96(SE)

Precision: want to estimate true prevalence (π) within ± 6%

Solve for n: 1.96(SE) = 1.96 √[π(1- π)/n] = 0.06 n = 1.962 π(1- π)/(0.06)2 = 3.84 π(1- π)/0.0036

Can estimate π using the observed p or use maximum at π=0.5 If π=0.15, n = 3.84 (0.15)(0.85)/0.0036 = 136 At π=0.50, n = 3.84 (0.50)(0.50)/0.0036 = 267 (worst case)Rule of thumb: For 95% CI for π, conservative n for precision w

is n = 1/w2

Page 4: Section V

Sample size based on hypothesis testing

Page 5: Section V

Hypothesis test decision table

No difference in population(Null is true)

Actual difference

(Null is false)

Test:Do not reject null 1-α

(correct)β

(Type II error)

Test:Reject null hypothesis

α(Type I error)

1-β = power(correct)

Page 6: Section V

Determinants of powerPower (1-β) depends on δ = delta = true difference σ = sigma = true SD or true variation α = alpha = significance criteiron n = sample size(Or, n depends on δ, σ, α, 1-β )

Page 7: Section V

Alpha versus Power•

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

α

The top distribution shows the sampling distribution of a test statistic under the assumption that delta (δ) is zero (the null hypothesis is true).

The bottom distribution shows the true population distribution (unknown at the time of testing), with a true population delta=2.5.

1 - β

-1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0

1 - β

Page 8: Section V

Power calculationZpower = Zobs – Z1-α/2 = (δ/SE) – Z1-α/2

Treatment n Mean HBA1cchg

SD SE

Liraglutide 5 -1.24 0.99 0.44Sitaglipin 4 -0.90 0.98 0.49Difference 0.34=δ √[0.442 + 0.492] = 0.66

Zobs=0.34/0.66 = 0.516 (<1.96, so not statistically significant, p=0.622)

Zpower = Zobs - Z α = 0.516 – 1.96 = -1.44From the Gaussian table, Zpower=-1.44 yields power

about 7%.

Page 9: Section V

Interpretation of powerIf test is “statistically” significant (p < α), we have a

“positive” or “significant” outcome (& accept the false positive probability of α).

If test is not statistically significant (p > α) either there is no relationship (“negative” outcome) or sample size is inadequate (inconclusive outcome).

If power is low for a given δ, results are inconclusive, not negative.

If power is high, results are affirmatively negative. (But better to quote Confidence interval after the

study is published)

Page 10: Section V

Sample size to test difference between 2 means(this is NOT a universal formula)

Two independent groups, each with sample sizen = 2(Zpower+ Z1-α/2)2 (σ/δ)2

Z0.975 = 1.96 and Zpower = 0.842 (for power of 80%), so

n = 2(0.842 + 1.96)2 (σ/δ)2 = 15.7 (σ/δ)2

orn per group approximately ≈ (range/δ)2

(since 15.7 ≈16, 16(σ/δ)2 = (4σ/δ)2 and the range ≈4σ )

Page 11: Section V

Power for increasing delta

Areas under the curves and right of the vertical line are α for the black curve and power for the other curves.

The power is larger for the red curve than for the blue.

-3 -2 -1 0 1 2 3 4 5 6 7

δ = 0δ = 3.5

δ =2.5

Page 12: Section V

Power SummaryPower increases as:

• True difference (δ) increases• Sample size (n) increases• α increases (less strict significance criterion)• Patient heterogeneity (σ) decreases

Generally, we set α = 0.05 & power = 1 – β = 0.80. To determine n, we need to estimate δ and σ.Often we use values of δ/σ for the calculationFor time to event outcomes (survival), n also depends on follow up time since “n” is the number of events. The sample size for comparing two survival curves is often computed based on comparing the corresponding two hazards.

Page 13: Section V

Sample Size Checklist Effect size (δ) = smallest clinically important difference (n increases as δ decreases)

Variability = patient heterogeneity = group SD (n increases as variability increases)

Power = prob of detecting the effect, set at 80% or higher (n increases with power)

α level = prob of rejecting when δ=0, usually set at 0.05, two-sided (n decreases with larger α)

** for time to event (survival) outcomes ***Time of comparison = how long it takes to achieve the effect (n decreases

with time)Follow up time = time each patient is followed (n reduced if patients followed

longer). In survival “n” is the number who have the outcome. Also consider

percentage who will agree to participatepatient accrual rate, dropout / loss rate

Page 14: Section V

Sample size for selected δ/σ -2 means(2 means difference= δ, SD=σ, two-sided α=0.05)

δ/σ 70% power 80% power 90% power0.10 1,234 1,570 2,102

0.15 549 698 934

0.20 309 392 5250.25 198 251 3360.50 49 63 840.75 22 28 371.00 12 16 211.25 8 10 131.50 5 7 9

Page 15: Section V

Sample size for comparing two proportions80% power, alpha=0.05

Difference between P1 and P2

|P1- P2|=δSmaller of P1 & P2 0.05 0.10 0.15 0.20

0.05 434 140 71 45

0.10 685 199 99 62

0.15 904 250 120 72

Page 16: Section V

Hypothesis testing limitations

Page 17: Section V

Pseudo replicationMost variation is between persons, not within person.Two blood samples on n=10 is not a sample size of 20.

Observed value = true population mean + between person variation (σp) + within person variation (σe) Example: To estimate the mean1. Compute a mean for each person using her “m”

observations per person. 2. Compute the group mean from the “n” person means.

SEM = √[σp2/n + σe

2/nm], usually σe < σp

Page 18: Section V

Statistical vs Medical “significance”

Average drop in weight (kg) after 3 months

Diet Mean Drop p 95% CI

I 0.50 <0.001 (0.45,0.55)

II 10.0 0.16 (-5.0, 25.0)

(“A difference, in order to be a difference, must make a difference”–Gertrude Stein?).

Page 19: Section V

p value limitations (ASA)1. p values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone ignoring the model.2. Conclusions should not be based only on whether a p-value passes a specific threshold.3. Proper inference requires full reporting & transparency.4. A p-value does not measure the size of an effect or the importance of a result. 5. A p-value alone does not provide a good measure of evidence regarding a model or hypothesis.

19

Page 20: Section V

R A Fisher on p values Statistical Methods and Scientific Inference, Hafner, New York, ed. 1, 1956

“The concept that the scientific worker can regard himself as an inert item in a vast co-operative concern working according to accepted rules, is encouraged by directing attention away from his duty to form correct scientific conclusions,…andby stressing his supposed duty to mechanically make a succession of automatic “decisions”. ...The idea that this responsibility can be delegated to a giant computer programmed with Decision Functions belongs to a phantasy of circles, rather remote from scientific research”. [pp. 104–105]

Page 21: Section V

Multiple Hypothesis testingMultiple efficacy endpoints /outcomes

Multiple safety endpoints/outcomes

Multiple treatment arms and/or doses

Multiple interim analyses

Multiple patient subgroups

Multiple analyses

Page 22: Section V
Page 23: Section V

Exploratory vs confirmatoryprotein example

Protein name Atril fib Atherosclerosis p value

RAS guanyl-releasing protein 2 33.3% 0.0% 0.0000Glutathione S-transferase P 38.9% 100.0% 0.0000Selenium-binding protein 1 22.2% 0.0% 0.0000

Nucleosome assembly protein 1-like 4 16.7% 0.0% 0.0000Integrin beta;Integrin beta-2 11.1% 50.0% 0.0000

Spectrin alpha chain, non-erythrocytic 1 11.1% 0.0% 0.0000Pituitary tumor-transforming gene 1 protein-interacting 11.1% 0.0% 0.0000

WW domain-binding protein 2 16.7% 50.0% 0.0000Syntaxin-4 5.6% 0.0% 0.0006

CD9 antigen 27.8% 50.0% 0.0013ATP synthase-coupling factor 6, mitochondrial 27.8% 50.0% 0.0013

Flotillin-1 77.8% 100.0% 0.0037Aconitate hydratase, mitochondrial 38.9% 50.0% 0.1142Fructose-bisphosphate aldolase C 94.4% 100.0% 0.4402

Alpha-adducin 50.0% 50.0% 1.000040S ribosomal protein SA 1.0% 1.0% 1.0000

Abl interactor 1 1.0% 1.0% 1.0000Bone marrow proteoglycan;Eosinophil granule major basic 1.0% 1.0% 1.0000

Tubulin alpha-4A chain 100.0% 100.0% 1.0000… (750 proteins total)      

750 proteins are compared between two groups – 12 are significant at p < 0.05

Page 24: Section V

Exploratory vs confirmatoryWho killed Tweety Bird?

Page 25: Section V

Did Sylvester do it?

Page 26: Section V

Motivation (class discussion)Tweety Bird is murdered by a cat who left a DNA

sample. The particular DNA profile found in the sample is known to occur in one of every one million cats. There is also about a 0.01% false positive rate for this test.

Is the level of evidence (guilt) equal in these two scenarios?

1. Sylvester only is tested and is a match. 2. A DNA database on 100,000 cats (but not all

cats), including Sylvester, is searched and Sylvester is a match, although not necessarily the only match. No prior belief that Sylvester is guilty.

Page 27: Section V

Motivation (class discussion)The “disease score” ranges from 2 (good) to

12 (worst). Scenario A: Due to prior suspicion (prior

information), only patients 19 and 47 are measured and both have scores of 12. We report that they are “significantly” ill.

Scenario B: The score is measured on 72 patients. Only patients 19 and 47 have scores of 12. We report that they are “significantly” ill.

Page 28: Section V

Is the amount of “evidence” or “belief” that patients 19 and 47 “really” are very ill (have “true” score of 12) the same in both scenarios? The data for patients 19 and 47 are the same in both scenarios.

Most would agree that, if both patients were retested (confirmation step), and came out with lower scores, this would decrease the belief that there “true” score is 12. If they came out with 12 again, this would increase the belief that the true score is 12.

Page 29: Section V

Multiple testing “If you torture the data long enough, it will eventually

confess”  Two different situations for new arthritis treatment compared

to aspirin.A. Only pain (0-10) and swelling (0-10) are measured. Both

are significantly better at p < 0.05 on the new treatment compared to aspirin.

B. Ten different outcomes measured: pain, swelling, activities of daily living, quality of life, sleep, walking, bending, lifting, grinding, climbing. Only the two that are significant are reported after all 10 are evaluated. (fraud?)

Confirmatory studies specify outcomes in advance. Misleading to report only statistically significant results.

Page 30: Section V

How to really lie with stats for fun and profit

1. Bet on the horse after you know who won (Movie -> The “Sting”)

2. Send financial advice after you know how the market did (example in class)

Page 31: Section V

Multiple Testing Assume all results are reported, not just the significant ones. Bonferroni: out of m independent tests at 0.05, num significant by

chance alone, even when all null hypotheses are true (assumes independence).

# tests=m Probability reject at least one

1 0.0500

2 0.0975

3 0.1426

4 0.1855

5 0.2262

10 0.4013

20 0.6415

25 0.7226

50 0.9231

Page 32: Section V

Multiple testing-What to do?

Option 1: Use nominal alpha level for significance. Creates too many false

positives.

Option 2: Use Bonferroni criterion –Declare significance if p < α/m if “m” tests are made. Has too many false negatives.

Option 3: Use Holm/Hochberg criterion – a compromise

Page 33: Section V

Holm/Hochberg criterionRule for m (not necessarily independent) significance

tests. Keeps overall false positive rate at α.

1)Sort the “m” p values from lowest to highest.

2)Declare the ith ordered p significant if it is less than α/(m+1-i). If p > α/(m+1-i), this & all larger p values are declared non significant.

Overall type I error rate (FWER) is ≤ α. FWER = family wise error rate

Page 34: Section V

Holm/Hochberg Example for m=5, α=0.05

i p value α/(6-i) 0.05/(6-i) 1 p1-smallest α/5 0.0100 2 p2 α/4 0.0125 3 p3 α/3 0.0167 4 p4 α/2 0.0250 5 p5-largest α 0.0500

Page 35: Section V

No adjustment vs Hochberg vs Bonferroni

1 2 3 4 50

0.01

0.02

0.03

0.04

0.05

0.06 m=5, α=0.05

no adjustmentBonferroniHochbergp value

i

sign

ifica

nce

crite

rion

Page 36: Section V

m=5, alpha=0.05

i no adjustment Bonferroni Hochberg p value

1 0.05 0.01 0.0100 0.007

2 0.05 0.01 0.0125 0.011

3 0.05 0.01 0.0167 0.014

4 0.05 0.01 0.0250 0.044

5 0.05 0.01 0.0500 0.049

Page 37: Section V

FWER vs FDRIf a “family” of “m” hypothesis tests are carried out, the family wise error rate (FWER) is the chance of any “false positive” type I error assuming that the null is true for all m tests. Rather than control the FWER, it may be preferable to control the number of “positive” tests (not all tests) that are false positives. This is called controlling the false discovery rate (FDR), a less stringent criterion. For FDR, the ith ordered p value must be less than (i/m)α which is larger than α/(m+1-i) for FWER.

Page 38: Section V

FDR vs FWERerrors committed when testing “m” null hypotheses

Declared non sig Declared sig Total

Truth-Null true U V m0

Truth-Null false T S m-m0

total m-R R m

FWER= V/m (prob V ≥ 1)FDR = V/R (average V/R) FDR is more liberal

Page 39: Section V

FWER vs FDR significance criteriam=5 hypothesis, 5 p values

α=0.05

p value FDR criteria FEWR criteriap1-smallest (1/5) α=0.01 α/5=0.01

p2 (2/5) α=0.02 α/4=0.0125

p3 (3/5) α=0.03 α/3=0.0167

p4 (4/5) α=0.04 α/2=0.025

p5-largest α=0.05 α=0.05

Page 40: Section V

Multiple testing & primary outcomes  As “m’, the number of outcomes, increases, individual αi

for each outcome must be smaller so n must be larger if overall α is to stay constant (ie at α=0.05).

But not all outcomes are equally important. Designate important outcomes “primary” & the rest secondary so m is only the number of primary outcomes. Assumes less concern if there is a false positive finding among secondary outcomes.

Must designate primary vs secondary outcomes in advance, before study results are known. It is not fair to declare which outcomes are primary and which are secondary based on their p values.

Page 41: Section V

Statistical Analysis PlanStatistical models and methods to answer study questions

Conclusions = data + models (assumptions) Each specific aim needs a stat analysis section.

Sample size and power follows the analysis plan. Outline: •Outcomes: denote primary & secondary •Primary predictors or comparison groups •Covariates/confounders/effect modifiers •Methods for missing data, dropouts •Interim analyses (for efficacy, for safety)

Page 42: Section V

Common MethodsUnivariate analysis

Continuous outcome: Means, SDs, mediansTime to event: Survival curvesDiscrete: Proportions

Multivariate analysisContinuous outcome: Linear regression,correlationPositive integers: Poisson regressionBinary (yes/no): Logistic regression Time to event: Proportional hazard regression

ANOVA and t-test are special cases of linear regression