# Notes on Maxwell & Delaney PSYCH 710 Multiple Comparisons

of 8
/8

Embed Size (px)

### Transcript of Notes on Maxwell & Delaney PSYCH 710 Multiple Comparisons

Week_7-psy710-2019-multiple-comparisons-v2.key1

• Multiple comparisons inflate Type I error rate

• Generally want to control family-wise Type I error rate by adjusting the per-comparison Type I error rate

• if C = 100 (orthogonal) comparisons and αFW = .05, then αPC=.00051

Bennett, PJ PSY710 Chapter 5

Notes on Maxwell & Delaney

5 Chapter 5 - Multiple Comparisons of Means

5.1 Inflation of Type I Error Rate

When conducting a statistical test, we typically set ↵ = .05 or ↵ = .01 so that the probability of making a Type I error is .05 or .01. Suppose, however, we conduct 100 such tests. Further suppose that the null hypothesis for each test is, in fact, true. Although ↵ for each individual test may be .05, the probability of making at least one Type I error across the entire set of 100 tests is much greater than .05. Why? Because the more tests we do, the greater the chances of making an error. More precisely, the probability of making at least one Type I error is

P (at least one Type I error) = ↵FW = 1 (1 ↵PC) C (1)

where C is the number of tests performed1. ↵PC is the per comparison Type I error rate; it represents the probability of making a Type I error for each test. ↵FW , on the other hand, is the familywise Type I error rate, and it represents the probability of making a Type I error across the entire family, or set, of tests. For the one-way designs we are considering, ↵FW also equals the ↵ level for the entire experiment, or the experimentwise error rate (↵EW ). For the current example, ↵ = .05, C = 100, and so ↵FW = 0.994079, which means that it is very likely that we would make at least one Type I error in our set of 100 tests. Here, in a nutshell, is the problem of conducting multiple tests of group means: the probability of making a Type I error increases with the number of tests. If the number of tests, C, is large, then it becomes very likely that we will make a Type I error.

When we are conducting multiple tests on group means, we generally want to minimize Type I errors across the entire experiment, and so we need some way of maintaining ↵EW at some reasonably low level (e.g., .05). One obvious way of controlling ↵EW is to rearrange Equation 1 to calculate the ↵PC that is required for a given ↵FW and C:

↵PC = 1 (1 ↵FW )1/C (2)

According to Equation 2, when C = 100 and we want ↵FW = .05, we must set ↵PC to .0005128.

5.2 Planned vs. Post-hoc Comparisons

I will compare the group means with a linear contrast that assumes equal variance across all groups. Suppose I conduct an experiment that compares the scores of subjects randomly assigned to eight di↵erent groups. After inspecting the data, shown in Figure 1, I decide to compare the means of groups 4 and 7 because the di↵erence between those groups looks fairly large. I can do the test two ways. First I can simply compare the groups using a t test assuming equal group variances. In the following commands, notice how I use the subset command to extract the data from the two groups, and then use R’s function t.test. I will set ↵ = .05. The null hypothesis is that the groups are equal; the alternative is that they di↵er.

> levels(g)

[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8"

1Technically, Equation 1 is correct only if the tests form an orthogonal set and the sample sizes for each group are large.

1

Notes on Maxwell & Delaney

5 Chapter 5 - Multiple Comparisons of Means

5.1 Inflation of Type I Error Rate

When conducting a statistical test, we typically set ↵ = .05 or ↵ = .01 so that the probability of making a Type I error is .05 or .01. Suppose, however, we conduct 100 such tests. Further suppose that the null hypothesis for each test is, in fact, true. Although ↵ for each individual test may be .05, the probability of making at least one Type I error across the entire set of 100 tests is much greater than .05. Why? Because the more tests we do, the greater the chances of making an error. More precisely, the probability of making at least one Type I error is

P (at least one Type I error) = ↵FW = 1 (1 ↵PC) C (1)

where C is the number of tests performed1. ↵PC is the per comparison Type I error rate; it represents the probability of making a Type I error for each test. ↵FW , on the other hand, is the familywise Type I error rate, and it represents the probability of making a Type I error across the entire family, or set, of tests. For the one-way designs we are considering, ↵FW also equals the ↵ level for the entire experiment, or the experimentwise error rate (↵EW ). For the current example, ↵ = .05, C = 100, and so ↵FW = 0.994079, which means that it is very likely that we would make at least one Type I error in our set of 100 tests. Here, in a nutshell, is the problem of conducting multiple tests of group means: the probability of making a Type I error increases with the number of tests. If the number of tests, C, is large, then it becomes very likely that we will make a Type I error.

When we are conducting multiple tests on group means, we generally want to minimize Type I errors across the entire experiment, and so we need some way of maintaining ↵EW at some reasonably low level (e.g., .05). One obvious way of controlling ↵EW is to rearrange Equation 1 to calculate the ↵PC that is required for a given ↵FW and C:

↵PC = 1 (1 ↵FW )1/C (2)

According to Equation 2, when C = 100 and we want ↵FW = .05, we must set ↵PC to .0005128.

5.2 Planned vs. Post-hoc Comparisons

I will compare the group means with a linear contrast that assumes equal variance across all groups. Suppose I conduct an experiment that compares the scores of subjects randomly assigned to eight di↵erent groups. After inspecting the data, shown in Figure 1, I decide to compare the means of groups 4 and 7 because the di↵erence between those groups looks fairly large. I can do the test two ways. First I can simply compare the groups using a t test assuming equal group variances. In the following commands, notice how I use the subset command to extract the data from the two groups, and then use R’s function t.test. I will set ↵ = .05. The null hypothesis is that the groups are equal; the alternative is that they di↵er.

> levels(g)

[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8"

1Technically, Equation 1 is correct only if the tests form an orthogonal set and the sample sizes for each group are large.

1

2

- Q = (# of false H0 rejections) / (total # H0 rejections)

•When all H0 are true, controlling αFW and FDR are equivalent

•When some H0 are false, FDR-based are more powerful

3

• Holm’s Sequential Bonferroni Test

• Benjamini & Hochberg’s (1995) Linear Step-Up Procedure (FDR)

4

- per-comparison alpha αPC = αFW/C

- Bonferroni method guarantees αFW ≤ C x αPC

• Inequalities true only for orthogonal comparisons/contrasts

• when contrasts are not orthogonal, familywise alpha will be less than nominal value

5

• C comparisons yielding C statistics (t’s, F’s, etc.)

- rank order absolute value of statistics (high to low)

- evaluate 1st statistic with α = αFW/C

- if significant, evaluate next statistic with α = αFW/(C-1)

- repeat until test is not significant

• true familywise alpha is less-than-or-equal to nominal αFW

- however, generally more powerful than Bonferroni method

6

• C comparisons yield C p-values

• set False Discovery Rate (FDR) (often FDR = .05)

• rank order p-values from smallest (k=1) to largest (k=C)

• compare p-values to alpha = (k/C) x FDR

- reject H0 for all p < (k/C) x FDR

• guaranteed to maintain FDR at nominal value

• generally more powerful than other methods

7

> my.p.values <- c(.127,.08,.03,.032,.02,.001,.01,.005,.025) > sort(my.p.values) [1] 0.001 0.005 0.010 0.020 0.025 0.030 0.032 0.080 0.127 > p.adjust(sort(my.p.values),method='bonferroni') [1] 0.009 0.045 0.090 0.180 0.225 0.270 0.288 0.720 1.000 > p.adjust(sort(my.p.values),method='holm') [1] 0.009 0.040 0.070 0.120 0.125 0.125 0.125 0.160 0.160 > p.adjust(sort(my.p.values),method='fdr') [1] 0.009 0.0225 0.0300 0.04114 0.04114 0.04114 0.04114 0.090 0.127

Significant tests (alpha/FDR = .05) are highlighted in orange font. N.B. Sorting p-values is not required.

8

• Generally, αFW and FDR are set to 0.01 or 0.05

• larger αFW may be justified for small number of orthogonal comparisons

- Bonferroni & Holm tests reduce power too much

- perhaps set αPC to 0.05

Type I will increase but Type II error will decrease

- Note: we do this with factorial ANOVA already

9

• 95% CI contains true parameter (mean) 95% of the time

- close connection between CI and p-values

- how to extend connection to case of multiple comparisons?

• simultaneous 95% confidence intervals:

- all CIs will contain true parameter 95% of the time

- 2 or more simultaneous CIs always larger than individual CIs

10

Bennett, PJ PSY710 Chapter 5

confidence intervals, we mean that all of the intervals in the set will contain the population mean 95% of the time. Sets of two or more simultaneous confidence intervals will always be larger than individual confidence intervals.

The R routine that I have written for this class, linear.comparison, computes adjusted confidence intervals. As an example, we’ll conduct three contrasts on the blood pressure data from Table 5.4 in your textbook.

> bp<-read.csv(url("http://psycserv.mcmaster.ca/bennett/psy710/datasets/maxwell_tab54.csv"))

> names(bp)

> bp.results <- linear.comparison(bp$bloodPressure,bp$group,blood.contrasts,var.equal=TRUE)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=1.562, t=1.250, p=0.226, psi=5.833, CI=(-6.106,17.773), adj.CI= (-6.976,18.643)"

[1] "C 2: F=0.510, t=0.714, p=0.483, psi=3.333, CI=(-5.185,11.852), adj.CI= (-9.476,16.143)"

[1] "C 3: F=0.287, t=-0.536, p=0.598, psi=-2.500, CI=(-13.521,8.521), adj.CI= (-15.310,10.310)"

[1] "C 4: F=9.823, t=3.134, p=0.005, psi=11.944, CI=(5.620,18.269), adj.CI= (1.485,22.403)"

Notice that I divided the weights of the last comparison by 3; this division will be important later. Now let’s compare the confidence intervals and adjusted confidence intervals for each comparison. These

are all 95% confidence intervals because ↵ = .05. The confidence intervals in the variable confinterval

are for the individual comparison. The confidence intervals in the variable adj.confint are adjusted, or simultaneous, confidence intervals. Note how the adjusted confidence intervals are always larger than the non-adjusted intervals. Also note that all of the adjusted confidence intervals include zero. What does this mean for the way we evaluate the null hypothesis for each comparison? If we evaluate the null hypothesis for each comparison using the unadjusted confidence interval, what is ↵FW ? What is ↵FW if we use the adjusted confidence intervals?

The confidence intervals are for , which is defined as

= aX

= (1/3) (µ1 + µ2 + µ3) (1)µ4

which tests the null hypothesis that mean of group 4 is equal to the mean of the other group means. Now, a contrast like c = (1, 1, 1,3) also compares group 4 to the other 3 groups, but the value of is changed to

= (1) (µ1 + µ2 + µ3) (3)µ4

which is three times the previous value. It turns out that the F and t tests are not a↵ected by this manipulation, but the value of and the confidence intervals for are changed:

> blood.contrasts <- list(c(1,-1,0,0), c(1,0,-1,0), c(0,1,-1,0), c(1,1,1,-3) );

> bp.results <- linear.comparison(bp$bloodPressure,bp$group,blood.contrasts,var.equal=TRUE)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=1.562, t=1.250, p=0.226, psi=5.833, CI=(-6.106,17.773), adj.CI= (-6.976,18.643)"

[1] "C 2: F=0.510, t=0.714, p=0.483, psi=3.333, CI=(-5.185,11.852), adj.CI= (-9.476,16.143)"

[1] "C 3: F=0.287, t=-0.536, p=0.598, psi=-2.500, CI=(-13.521,8.521), adj.CI= (-15.310,10.310)"

[1] "C 4: F=9.823, t=3.134, p=0.005, psi=35.833, CI=(16.860,54.806), adj.CI= (4.456,67.210)"

> bp.results[[4]]$contrast

Comparisons of blood pressure data in Table 5.4 of textbook

11

All pairwise tests (Tukey HSD) • Evaluate all pairwise differences between groups

• More powerful than Bonferroni method (for between-subj designs)

• Tukey HSD

- NOT necessary to evaluate omnibus F prior to Tukey test

- assumes equal n per group & equal variances

- Tukey-Kramer is valid with sample sizes are unequal

- Dunnett’s T3 test is better with unequal n & unequal variances

- see Kirk (1995, pp. 146-50 for more details)

12

> bp.results[[4]]$alpha

> bp.results[[4]]$confinterval

[1] 16.86 54.81

> bp.results[[4]]$adj.confint

[1] 4.456 67.210

In this case, the confidence interval is for a value that corresponds to three times the di↵erence between µ4

and (1/3)(µ1 + µ2 + µ3). So, make sure that the confidence interval is for the value that you really care about.

One more thing. The p-values returned by linear.comparison are not adjusted for multiple com- parisons. However, it is rather simple to compare them to an adjusted criterion. If we want to set ↵FW = .05, we can use the Bonferroni adjustment to calculate the correct value of ↵ for each comparison: ↵PC = ↵FW /C = .05/3 = .0167. By default, the p-values are listed in the output of linear.comparison. Manually listing the unadjusted p-values for each comparison is easy:

> bp.results[[1]]$p.2tailed

5.4 All Pairwise Contrasts

Sometimes you will be interested only in pairwise comparisons between group means. If you want to test just a few pairs of means, it often is sucient to do multiple t-tests (or to use linear contrasts to compare pairs of groups) and to use the Bonferroni adjustment (or Holm’s sequential method) to control ↵FW . However, when you want to do many pairwise tests, or you decide to do multiple pairwise tests after inspecting the data (i.e., post-hoc pairwise tests), then you should use TukeyHSD. (N.B. Tukey HSD is the same as Tukey WSD). Tukey’s HSD (for Honestly Significant Di↵erence) is the optimal procedure for doing all pairwise comparisons. In R, the Tukey test is invoked by calling TukeyHSD(my.aov), where my.aov is an object created by a call to aov. Here is an example, again using the data in Table 5.4:

> bp.aov<-aov(bloodPressure~group,data=bp)

95% family-wise confidence level

$group

diff lwr upr p adj

b-a -5.833 -18.90 7.231 0.6038

c-a -3.333 -16.40 9.731 0.8903

d-a -15.000 -28.06 -1.936 0.0209

c-b 2.500 -10.56 15.564 0.9493

d-b -9.167 -22.23 3.898 0.2345

d-c -11.667 -24.73 1.398 0.0906

The output is easier to read by calling TukeyHSD slightly di↵erently:

> TukeyHSD(bp.aov,ordered=TRUE)

95% family-wise confidence level

Fit: aov(formula = bloodPressure ~ group, data = bp)

$group

diff lwr upr p adj

b-d 9.167 -3.898 22.23 0.2345

c-d 11.667 -1.398 24.73 0.0906

a-d 15.000 1.936 28.06 0.0209

c-b 2.500 -10.564 15.56 0.9493

a-b 5.833 -7.231 18.90 0.6038

a-c 3.333 -9.731 16.40 0.8903

The output is a list of pairwise comparisons. For each comparison, the output shows the di↵erence, the confidence interval of the di↵erence (95% by default), and a p-value that can be used to evaluate the null hypothesis that the group di↵erence is zero. The confidence intervals and p-values are adjusted to take into account the multiple comparisons. Using TukeyHSD ensures that the familywise Type I error rate is controlled (.05, by default). The Tukey test assumes equal sample size in every group, and homogeneity of variance.

It is important to note that it is not necessary to obtain a significant omnibus F test before using the Tukey HSD procedure. In fact, requiring a significant omnibus test means that the actual ↵ for the Tukey HSD test will be significantly lower than the nominal value. If you want to make all pairwise comparisons, then it is perfectly reasonable to skip the regular ANOVA and use the Tukey HSD procedure (Wilcox, 1987).

5.4.1 Modifications of Tukey HSD

The Tukey HSD procedure assumes equal sample sizes and constant variance across groups. If those as- sumptions are not valid, then the p values and confidence intervals calculated with the Tukey HSD procedure will be incorrect. When the constant variance assumption is valid but sample sizes are unequal, the Tukey- Kramer test is recommended. When the variances are heterogeneous and sample sizes are unequal, Dunnett’s T3 test is recommended. The following code shows how to use the DTK package in R to perform these tests on the blood pressure data used in the previous section:

8

13

> # install.packages("DTK") # download package and install on computer

> library("DTK") # load package into workspace

> TK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Tukey-Kramer test

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = x ~ f)

> DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Dunnett's T3 test

[[1]]

b-a -5.833 -26.95 15.2867

c-a -3.333 -18.40 11.7357

d-a -15.000 -29.60 -0.4001

c-b 2.500 -17.00 21.9961

d-b -9.167 -28.30 9.9692

d-c -11.667 -23.80 0.4659

> # following commands store result and then plot simultaneous confidence intervals:

> # tmp <- DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05)

> # DTK.plot(tmp)

These tests, plus several others, are described in detail by Kirk (1995, pages 146-150).

5.5 Post-Hoc Comparisons

The final situation we will consider is the case where you want to perform post-hoc linear contrasts, some of which are not pairwise comparisons. In this situation, Tukey’s HSD procedure is not appropriate because not all of the comparisons are pairwise. Neither is the Bonferroni procedure, because the contrasts are post-hoc, not planned. Instead, we should use Sche↵e’s method.

Sche↵e’s method allows us to perform multiple, complex linear contrasts after looking at the data, while maintaining control of the Type I error rate. The method of computing the linear contrasts are exactly the same as the one used for planned linear contrasts. The only di↵erence is that the observed Fcontrast is compared to FSche↵e

FSche↵e = (a 1) F↵FW (df1 = a 1; df2 = N a) (5)

where a is the number of groups and N is the total number of subjects. Suppose we have 40 subjects (N = 40) divided among 4 groups (a = 4) and we want ↵FW to equal .05. The value of FSche↵e is

9

14

> # install.packages("DTK") # download package and install on computer

> library("DTK") # load package into workspace

> TK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Tukey-Kramer test

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = x ~ f)

> DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Dunnett's T3 test

[[1]]

b-a -5.833 -26.95 15.2867

c-a -3.333 -18.40 11.7357

d-a -15.000 -29.60 -0.4001

c-b 2.500 -17.00 21.9961

d-b -9.167 -28.30 9.9692

d-c -11.667 -23.80 0.4659

> # following commands store result and then plot simultaneous confidence intervals:

> # tmp <- DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05)

> # DTK.plot(tmp)

These tests, plus several others, are described in detail by Kirk (1995, pages 146-150).

5.5 Post-Hoc Comparisons

The final situation we will consider is the case where you want to perform post-hoc linear contrasts, some of which are not pairwise comparisons. In this situation, Tukey’s HSD procedure is not appropriate because not all of the comparisons are pairwise. Neither is the Bonferroni procedure, because the contrasts are post-hoc, not planned. Instead, we should use Sche↵e’s method.

Sche↵e’s method allows us to perform multiple, complex linear contrasts after looking at the data, while maintaining control of the Type I error rate. The method of computing the linear contrasts are exactly the same as the one used for planned linear contrasts. The only di↵erence is that the observed Fcontrast is compared to FSche↵e

FSche↵e = (a 1) F↵FW (df1 = a 1; df2 = N a) (5)

where a is the number of groups and N is the total number of subjects. Suppose we have 40 subjects (N = 40) divided among 4 groups (a = 4) and we want ↵FW to equal .05. The value of FSche↵e is

9

Performing a Single Comparison • After plotting data I decide to

compare means of groups 4 & 7 using a t-test:

Bennett, PJ PSY710 Chapter 5

g1 g2 g3 g4 g5 g6 g7 g8

60 80

10 0

12 0

14 0

3

> y.4<-subset(y,g=="g4") # get scores for group 4

> y.7<-subset(y,g=="g7") # get scores for group 7

> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erence between groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variances across groups. One advantage of this method is it uses all of the groups to derive an estimate of the population error variance, whereas t.test only uses data from the two groups being compared. Not only is the estimated error variance likely to be more accurate, but the test will have many more degrees of freedom in the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that the groups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387). Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

2

16

Performing a Single Comparison • Next I use a linear contrast which uses

all groups to derive estimate of error variance:

Bennett, PJ PSY710 Chapter 5

g1 g2 g3 g4 g5 g6 g7 g8

60 80

10 0

12 0

14 0

3

> y.4<-subset(y,g=="g4") # get scores for group 4

> y.7<-subset(y,g=="g7") # get scores for group 7

> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erence between groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variances across groups. One advantage of this method is it uses all of the groups to derive an estimate of the population error variance, whereas t.test only uses data from the two groups being compared. Not only is the estimated error variance likely to be more accurate, but the test will have many more degrees of freedom in the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that the groups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387). Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

2

17

Performing a Single Comparison • Next I use a linear contrast which uses

all groups to derive estimate of error variance:

Bennett, PJ PSY710 Chapter 5

g1 g2 g3 g4 g5 g6 g7 g8

60 80

10 0

12 0

14 0

3

> y.4<-subset(y,g=="g4") # get scores for group 4

> y.7<-subset(y,g=="g7") # get scores for group 7

> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erence between groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variances across groups. One advantage of this method is it uses all of the groups to derive an estimate of the population error variance, whereas t.test only uses data from the two groups being compared. Not only is the estimated error variance likely to be more accurate, but the test will have many more degrees of freedom in the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that the groups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387). Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

2

18

What was wrong with the preceding analyses? Answer: I performed the analyses after inspecting the data and choosing to compare groups 4 & 7 because they looked different which, obviously, inflates Type I error

19

• Comparisons after looking at data are post-hoc

• Scheffe method is preferred for post-hoc linear contrasts

- compute contrast with normal procedures

- evaluate observed F with new critical value:

FScheffe = (a-1) x FαFW (df1= a-1; df2 = N-a)

a = number of groups

- FScheffe is “normal” omnibus F x (a-1)

• Scheffe method and omnibus F test are mutually consistent

20

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

21

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

22

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

23

regards to some of the papers we discuss here, the researchers never respond that they had chosen all the details of their data processing and data analysis ahead of time; rather, they claim that they picked only one analysis for the particular data they saw. Intuitive as this defense may seem, it does not address the fundamental frequentist concern of multiple comparisons.

We illustrate with a very simple hypothetical example. A researcher is interested in di↵erences between Democrats and Republicans in how they perform in a short mathematics test when it is expressed in two di↵erent contexts, either involving health care or the military. The research hypothesis is that context matters, and one would expect Democrats to do better in the health- care context and Republicans in the military context. Party identification measured on a standard 7-point scale and various demographic information also available. At this point there is a huge number of possible comparisons that can be performed—all consistent with the data. For example, the pattern could be found (with statistical significance) among men and not among women— explicable under the theory that men are more ideological than women. Or the pattern could be found among women but not among men—explicable under the theory that women are more sensitive to context, compared to men. Or the pattern could be statistically significant for neither group, but the di↵erence could be significant (still fitting the theory, as described above). Or the e↵ect might only appear among men who are being asked the questions by female interviewers. We might see a di↵erence between sexes in the health-care context but not the military context; this would make sense given that health care is currently a highly politically salient issue and the military is not. There are degrees of freedom in the classification of respondents into Democrats and Republicans from a 7-point scale. And how are independents and nonpartisans handled? They could be excluded entirely. Or perhaps the key pattern is between partisans and nonpartisans? And so on. In our notation above, a single overarching research hypothesis—in this case, the idea that issue context interacts with political partisanship to a↵ect mathematical problem-solving skills—corresponds to many di↵erent possible choices of the decision variable .

At one level, these multiplicities are obvious. And it would take a highly unscrupulous researcher to perform test after test in a search for statistical significance (which could almost certainly be found at the 0.05 or even the 0.01 level, given all the options above and the many more that would be possible in a real study). We are not suggesting that researchers generally do such a search. What we are suggesting is that, given a particular data set, it is not so dicult to look at the data and construct completely reasonable rules for data exclusion, coding, and data analysis that can lead to statistical significance—thus, the researcher needs only perform one test, but that test is conditional on the data; hence, T (y;(y)) as described in item #4 above. As Humphreys, Sanchez, and Windt (2013) write, a researcher when faced with multiple reasonable measures can reason (perhaps correctly) that the one that produces a significant result is more likely to be the least noisy measure, but then decide (incorrectly) to draw inferences based on that one only.

This is all happening in a context of small e↵ect sizes, small sample sizes, large measurement errors, and high variation (which combine to give low power, hence less reliable results even when they happen to be statistically significant, as discussed by Button et al., 2013). Multiplicity would not be a huge problem in a setting of large real di↵erences, large samples, small measurement errors, and low variation. This is the familiar Bayesian argument: any data-based claim is more plausible to the extent it is a priori more likely and less plausible to the extent that it is estimated with more error. That is the context; in the present paper, though, we focus on the very specific reasons that published p-values cannot be taken at face value, even if no explicit fishing has been done.

3

24

•“researcher degrees of freedom” •Should more data be collected? [sequential analysis] •Should some observations be excluded? •Which conditions should be combined and which ones compared? •Which control variables should be considered? •Should we combine or transform our independent variables?

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi- tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi- tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser- vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra- dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec- essarily be significant with a larger one.

Solution As a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major- ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authors We propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu- lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

en ta

ge o

ts

Number of Additional Per Condition Observations Before Performing Another t Test

n ! 10 n ! 20 Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ! .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lu e

Sample Size (number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ! .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate. Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

Effect of adding observations on false positives.

26

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi- tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi- tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser- vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra- dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec- essarily be significant with a larger one.

Solution As a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major- ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authors We propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu- lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

en ta

ge o

ts

Number of Additional Per Condition Observations Before Performing Another t Test

n ! 10 n ! 20 Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ! .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lu e

Sample Size (number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ! .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate. Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

Effect of stopping too soon when statistical significance is achieved.

27

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

2011, 22(11), 1359-66 1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi- tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi- tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser- vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra- dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec- essarily be significant with a larger one.

Solution As a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major- ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authors We propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu- lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

en ta

ge o

ts

Number of Additional Per Condition Observations Before Performing Another t Test

n ! 10 n ! 20 Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ! .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lu e

Sample Size (number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ! .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate. Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

28

“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”

- DT Lykken, Statistical Significance in Psychological Research, Psychological Bulletin, 1968

“it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough…”

- Gleman & Loken, The garden of forking paths, 2013

29

• Multiple comparisons inflate Type I error rate

• Generally want to control family-wise Type I error rate by adjusting the per-comparison Type I error rate

• if C = 100 (orthogonal) comparisons and αFW = .05, then αPC=.00051

Bennett, PJ PSY710 Chapter 5

Notes on Maxwell & Delaney

5 Chapter 5 - Multiple Comparisons of Means

5.1 Inflation of Type I Error Rate

When conducting a statistical test, we typically set ↵ = .05 or ↵ = .01 so that the probability of making a Type I error is .05 or .01. Suppose, however, we conduct 100 such tests. Further suppose that the null hypothesis for each test is, in fact, true. Although ↵ for each individual test may be .05, the probability of making at least one Type I error across the entire set of 100 tests is much greater than .05. Why? Because the more tests we do, the greater the chances of making an error. More precisely, the probability of making at least one Type I error is

P (at least one Type I error) = ↵FW = 1 (1 ↵PC) C (1)

where C is the number of tests performed1. ↵PC is the per comparison Type I error rate; it represents the probability of making a Type I error for each test. ↵FW , on the other hand, is the familywise Type I error rate, and it represents the probability of making a Type I error across the entire family, or set, of tests. For the one-way designs we are considering, ↵FW also equals the ↵ level for the entire experiment, or the experimentwise error rate (↵EW ). For the current example, ↵ = .05, C = 100, and so ↵FW = 0.994079, which means that it is very likely that we would make at least one Type I error in our set of 100 tests. Here, in a nutshell, is the problem of conducting multiple tests of group means: the probability of making a Type I error increases with the number of tests. If the number of tests, C, is large, then it becomes very likely that we will make a Type I error.

When we are conducting multiple tests on group means, we generally want to minimize Type I errors across the entire experiment, and so we need some way of maintaining ↵EW at some reasonably low level (e.g., .05). One obvious way of controlling ↵EW is to rearrange Equation 1 to calculate the ↵PC that is required for a given ↵FW and C:

↵PC = 1 (1 ↵FW )1/C (2)

According to Equation 2, when C = 100 and we want ↵FW = .05, we must set ↵PC to .0005128.

5.2 Planned vs. Post-hoc Comparisons

I will compare the group means with a linear contrast that assumes equal variance across all groups. Suppose I conduct an experiment that compares the scores of subjects randomly assigned to eight di↵erent groups. After inspecting the data, shown in Figure 1, I decide to compare the means of groups 4 and 7 because the di↵erence between those groups looks fairly large. I can do the test two ways. First I can simply compare the groups using a t test assuming equal group variances. In the following commands, notice how I use the subset command to extract the data from the two groups, and then use R’s function t.test. I will set ↵ = .05. The null hypothesis is that the groups are equal; the alternative is that they di↵er.

> levels(g)

[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8"

1Technically, Equation 1 is correct only if the tests form an orthogonal set and the sample sizes for each group are large.

1

Notes on Maxwell & Delaney

5 Chapter 5 - Multiple Comparisons of Means

5.1 Inflation of Type I Error Rate

When conducting a statistical test, we typically set ↵ = .05 or ↵ = .01 so that the probability of making a Type I error is .05 or .01. Suppose, however, we conduct 100 such tests. Further suppose that the null hypothesis for each test is, in fact, true. Although ↵ for each individual test may be .05, the probability of making at least one Type I error across the entire set of 100 tests is much greater than .05. Why? Because the more tests we do, the greater the chances of making an error. More precisely, the probability of making at least one Type I error is

P (at least one Type I error) = ↵FW = 1 (1 ↵PC) C (1)

where C is the number of tests performed1. ↵PC is the per comparison Type I error rate; it represents the probability of making a Type I error for each test. ↵FW , on the other hand, is the familywise Type I error rate, and it represents the probability of making a Type I error across the entire family, or set, of tests. For the one-way designs we are considering, ↵FW also equals the ↵ level for the entire experiment, or the experimentwise error rate (↵EW ). For the current example, ↵ = .05, C = 100, and so ↵FW = 0.994079, which means that it is very likely that we would make at least one Type I error in our set of 100 tests. Here, in a nutshell, is the problem of conducting multiple tests of group means: the probability of making a Type I error increases with the number of tests. If the number of tests, C, is large, then it becomes very likely that we will make a Type I error.

When we are conducting multiple tests on group means, we generally want to minimize Type I errors across the entire experiment, and so we need some way of maintaining ↵EW at some reasonably low level (e.g., .05). One obvious way of controlling ↵EW is to rearrange Equation 1 to calculate the ↵PC that is required for a given ↵FW and C:

↵PC = 1 (1 ↵FW )1/C (2)

According to Equation 2, when C = 100 and we want ↵FW = .05, we must set ↵PC to .0005128.

5.2 Planned vs. Post-hoc Comparisons

I will compare the group means with a linear contrast that assumes equal variance across all groups. Suppose I conduct an experiment that compares the scores of subjects randomly assigned to eight di↵erent groups. After inspecting the data, shown in Figure 1, I decide to compare the means of groups 4 and 7 because the di↵erence between those groups looks fairly large. I can do the test two ways. First I can simply compare the groups using a t test assuming equal group variances. In the following commands, notice how I use the subset command to extract the data from the two groups, and then use R’s function t.test. I will set ↵ = .05. The null hypothesis is that the groups are equal; the alternative is that they di↵er.

> levels(g)

[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8"

1Technically, Equation 1 is correct only if the tests form an orthogonal set and the sample sizes for each group are large.

1

2

- Q = (# of false H0 rejections) / (total # H0 rejections)

•When all H0 are true, controlling αFW and FDR are equivalent

•When some H0 are false, FDR-based are more powerful

3

• Holm’s Sequential Bonferroni Test

• Benjamini & Hochberg’s (1995) Linear Step-Up Procedure (FDR)

4

- per-comparison alpha αPC = αFW/C

- Bonferroni method guarantees αFW ≤ C x αPC

• Inequalities true only for orthogonal comparisons/contrasts

• when contrasts are not orthogonal, familywise alpha will be less than nominal value

5

• C comparisons yielding C statistics (t’s, F’s, etc.)

- rank order absolute value of statistics (high to low)

- evaluate 1st statistic with α = αFW/C

- if significant, evaluate next statistic with α = αFW/(C-1)

- repeat until test is not significant

• true familywise alpha is less-than-or-equal to nominal αFW

- however, generally more powerful than Bonferroni method

6

• C comparisons yield C p-values

• set False Discovery Rate (FDR) (often FDR = .05)

• rank order p-values from smallest (k=1) to largest (k=C)

• compare p-values to alpha = (k/C) x FDR

- reject H0 for all p < (k/C) x FDR

• guaranteed to maintain FDR at nominal value

• generally more powerful than other methods

7

> my.p.values <- c(.127,.08,.03,.032,.02,.001,.01,.005,.025) > sort(my.p.values) [1] 0.001 0.005 0.010 0.020 0.025 0.030 0.032 0.080 0.127 > p.adjust(sort(my.p.values),method='bonferroni') [1] 0.009 0.045 0.090 0.180 0.225 0.270 0.288 0.720 1.000 > p.adjust(sort(my.p.values),method='holm') [1] 0.009 0.040 0.070 0.120 0.125 0.125 0.125 0.160 0.160 > p.adjust(sort(my.p.values),method='fdr') [1] 0.009 0.0225 0.0300 0.04114 0.04114 0.04114 0.04114 0.090 0.127

Significant tests (alpha/FDR = .05) are highlighted in orange font. N.B. Sorting p-values is not required.

8

• Generally, αFW and FDR are set to 0.01 or 0.05

• larger αFW may be justified for small number of orthogonal comparisons

- Bonferroni & Holm tests reduce power too much

- perhaps set αPC to 0.05

Type I will increase but Type II error will decrease

- Note: we do this with factorial ANOVA already

9

• 95% CI contains true parameter (mean) 95% of the time

- close connection between CI and p-values

- how to extend connection to case of multiple comparisons?

• simultaneous 95% confidence intervals:

- all CIs will contain true parameter 95% of the time

- 2 or more simultaneous CIs always larger than individual CIs

10

Bennett, PJ PSY710 Chapter 5

confidence intervals, we mean that all of the intervals in the set will contain the population mean 95% of the time. Sets of two or more simultaneous confidence intervals will always be larger than individual confidence intervals.

The R routine that I have written for this class, linear.comparison, computes adjusted confidence intervals. As an example, we’ll conduct three contrasts on the blood pressure data from Table 5.4 in your textbook.

> bp<-read.csv(url("http://psycserv.mcmaster.ca/bennett/psy710/datasets/maxwell_tab54.csv"))

> names(bp)

> bp.results <- linear.comparison(bp$bloodPressure,bp$group,blood.contrasts,var.equal=TRUE)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=1.562, t=1.250, p=0.226, psi=5.833, CI=(-6.106,17.773), adj.CI= (-6.976,18.643)"

[1] "C 2: F=0.510, t=0.714, p=0.483, psi=3.333, CI=(-5.185,11.852), adj.CI= (-9.476,16.143)"

[1] "C 3: F=0.287, t=-0.536, p=0.598, psi=-2.500, CI=(-13.521,8.521), adj.CI= (-15.310,10.310)"

[1] "C 4: F=9.823, t=3.134, p=0.005, psi=11.944, CI=(5.620,18.269), adj.CI= (1.485,22.403)"

Notice that I divided the weights of the last comparison by 3; this division will be important later. Now let’s compare the confidence intervals and adjusted confidence intervals for each comparison. These

are all 95% confidence intervals because ↵ = .05. The confidence intervals in the variable confinterval

are for the individual comparison. The confidence intervals in the variable adj.confint are adjusted, or simultaneous, confidence intervals. Note how the adjusted confidence intervals are always larger than the non-adjusted intervals. Also note that all of the adjusted confidence intervals include zero. What does this mean for the way we evaluate the null hypothesis for each comparison? If we evaluate the null hypothesis for each comparison using the unadjusted confidence interval, what is ↵FW ? What is ↵FW if we use the adjusted confidence intervals?

The confidence intervals are for , which is defined as

= aX

= (1/3) (µ1 + µ2 + µ3) (1)µ4

which tests the null hypothesis that mean of group 4 is equal to the mean of the other group means. Now, a contrast like c = (1, 1, 1,3) also compares group 4 to the other 3 groups, but the value of is changed to

= (1) (µ1 + µ2 + µ3) (3)µ4

which is three times the previous value. It turns out that the F and t tests are not a↵ected by this manipulation, but the value of and the confidence intervals for are changed:

> blood.contrasts <- list(c(1,-1,0,0), c(1,0,-1,0), c(0,1,-1,0), c(1,1,1,-3) );

> bp.results <- linear.comparison(bp$bloodPressure,bp$group,blood.contrasts,var.equal=TRUE)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=1.562, t=1.250, p=0.226, psi=5.833, CI=(-6.106,17.773), adj.CI= (-6.976,18.643)"

[1] "C 2: F=0.510, t=0.714, p=0.483, psi=3.333, CI=(-5.185,11.852), adj.CI= (-9.476,16.143)"

[1] "C 3: F=0.287, t=-0.536, p=0.598, psi=-2.500, CI=(-13.521,8.521), adj.CI= (-15.310,10.310)"

[1] "C 4: F=9.823, t=3.134, p=0.005, psi=35.833, CI=(16.860,54.806), adj.CI= (4.456,67.210)"

> bp.results[[4]]$contrast

Comparisons of blood pressure data in Table 5.4 of textbook

11

All pairwise tests (Tukey HSD) • Evaluate all pairwise differences between groups

• More powerful than Bonferroni method (for between-subj designs)

• Tukey HSD

- NOT necessary to evaluate omnibus F prior to Tukey test

- assumes equal n per group & equal variances

- Tukey-Kramer is valid with sample sizes are unequal

- Dunnett’s T3 test is better with unequal n & unequal variances

- see Kirk (1995, pp. 146-50 for more details)

12

> bp.results[[4]]$alpha

> bp.results[[4]]$confinterval

[1] 16.86 54.81

> bp.results[[4]]$adj.confint

[1] 4.456 67.210

In this case, the confidence interval is for a value that corresponds to three times the di↵erence between µ4

and (1/3)(µ1 + µ2 + µ3). So, make sure that the confidence interval is for the value that you really care about.

One more thing. The p-values returned by linear.comparison are not adjusted for multiple com- parisons. However, it is rather simple to compare them to an adjusted criterion. If we want to set ↵FW = .05, we can use the Bonferroni adjustment to calculate the correct value of ↵ for each comparison: ↵PC = ↵FW /C = .05/3 = .0167. By default, the p-values are listed in the output of linear.comparison. Manually listing the unadjusted p-values for each comparison is easy:

> bp.results[[1]]$p.2tailed

5.4 All Pairwise Contrasts

Sometimes you will be interested only in pairwise comparisons between group means. If you want to test just a few pairs of means, it often is sucient to do multiple t-tests (or to use linear contrasts to compare pairs of groups) and to use the Bonferroni adjustment (or Holm’s sequential method) to control ↵FW . However, when you want to do many pairwise tests, or you decide to do multiple pairwise tests after inspecting the data (i.e., post-hoc pairwise tests), then you should use TukeyHSD. (N.B. Tukey HSD is the same as Tukey WSD). Tukey’s HSD (for Honestly Significant Di↵erence) is the optimal procedure for doing all pairwise comparisons. In R, the Tukey test is invoked by calling TukeyHSD(my.aov), where my.aov is an object created by a call to aov. Here is an example, again using the data in Table 5.4:

> bp.aov<-aov(bloodPressure~group,data=bp)

95% family-wise confidence level

$group

diff lwr upr p adj

b-a -5.833 -18.90 7.231 0.6038

c-a -3.333 -16.40 9.731 0.8903

d-a -15.000 -28.06 -1.936 0.0209

c-b 2.500 -10.56 15.564 0.9493

d-b -9.167 -22.23 3.898 0.2345

d-c -11.667 -24.73 1.398 0.0906

The output is easier to read by calling TukeyHSD slightly di↵erently:

> TukeyHSD(bp.aov,ordered=TRUE)

95% family-wise confidence level

Fit: aov(formula = bloodPressure ~ group, data = bp)

$group

diff lwr upr p adj

b-d 9.167 -3.898 22.23 0.2345

c-d 11.667 -1.398 24.73 0.0906

a-d 15.000 1.936 28.06 0.0209

c-b 2.500 -10.564 15.56 0.9493

a-b 5.833 -7.231 18.90 0.6038

a-c 3.333 -9.731 16.40 0.8903

The output is a list of pairwise comparisons. For each comparison, the output shows the di↵erence, the confidence interval of the di↵erence (95% by default), and a p-value that can be used to evaluate the null hypothesis that the group di↵erence is zero. The confidence intervals and p-values are adjusted to take into account the multiple comparisons. Using TukeyHSD ensures that the familywise Type I error rate is controlled (.05, by default). The Tukey test assumes equal sample size in every group, and homogeneity of variance.

It is important to note that it is not necessary to obtain a significant omnibus F test before using the Tukey HSD procedure. In fact, requiring a significant omnibus test means that the actual ↵ for the Tukey HSD test will be significantly lower than the nominal value. If you want to make all pairwise comparisons, then it is perfectly reasonable to skip the regular ANOVA and use the Tukey HSD procedure (Wilcox, 1987).

5.4.1 Modifications of Tukey HSD

The Tukey HSD procedure assumes equal sample sizes and constant variance across groups. If those as- sumptions are not valid, then the p values and confidence intervals calculated with the Tukey HSD procedure will be incorrect. When the constant variance assumption is valid but sample sizes are unequal, the Tukey- Kramer test is recommended. When the variances are heterogeneous and sample sizes are unequal, Dunnett’s T3 test is recommended. The following code shows how to use the DTK package in R to perform these tests on the blood pressure data used in the previous section:

8

13

> # install.packages("DTK") # download package and install on computer

> library("DTK") # load package into workspace

> TK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Tukey-Kramer test

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = x ~ f)

> DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Dunnett's T3 test

[[1]]

b-a -5.833 -26.95 15.2867

c-a -3.333 -18.40 11.7357

d-a -15.000 -29.60 -0.4001

c-b 2.500 -17.00 21.9961

d-b -9.167 -28.30 9.9692

d-c -11.667 -23.80 0.4659

> # following commands store result and then plot simultaneous confidence intervals:

> # tmp <- DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05)

> # DTK.plot(tmp)

These tests, plus several others, are described in detail by Kirk (1995, pages 146-150).

5.5 Post-Hoc Comparisons

The final situation we will consider is the case where you want to perform post-hoc linear contrasts, some of which are not pairwise comparisons. In this situation, Tukey’s HSD procedure is not appropriate because not all of the comparisons are pairwise. Neither is the Bonferroni procedure, because the contrasts are post-hoc, not planned. Instead, we should use Sche↵e’s method.

Sche↵e’s method allows us to perform multiple, complex linear contrasts after looking at the data, while maintaining control of the Type I error rate. The method of computing the linear contrasts are exactly the same as the one used for planned linear contrasts. The only di↵erence is that the observed Fcontrast is compared to FSche↵e

FSche↵e = (a 1) F↵FW (df1 = a 1; df2 = N a) (5)

where a is the number of groups and N is the total number of subjects. Suppose we have 40 subjects (N = 40) divided among 4 groups (a = 4) and we want ↵FW to equal .05. The value of FSche↵e is

9

14

> # install.packages("DTK") # download package and install on computer

> library("DTK") # load package into workspace

> TK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Tukey-Kramer test

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = x ~ f)

> DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Dunnett's T3 test

[[1]]

b-a -5.833 -26.95 15.2867

c-a -3.333 -18.40 11.7357

d-a -15.000 -29.60 -0.4001

c-b 2.500 -17.00 21.9961

d-b -9.167 -28.30 9.9692

d-c -11.667 -23.80 0.4659

> # following commands store result and then plot simultaneous confidence intervals:

> # tmp <- DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05)

> # DTK.plot(tmp)

These tests, plus several others, are described in detail by Kirk (1995, pages 146-150).

5.5 Post-Hoc Comparisons

The final situation we will consider is the case where you want to perform post-hoc linear contrasts, some of which are not pairwise comparisons. In this situation, Tukey’s HSD procedure is not appropriate because not all of the comparisons are pairwise. Neither is the Bonferroni procedure, because the contrasts are post-hoc, not planned. Instead, we should use Sche↵e’s method.

Sche↵e’s method allows us to perform multiple, complex linear contrasts after looking at the data, while maintaining control of the Type I error rate. The method of computing the linear contrasts are exactly the same as the one used for planned linear contrasts. The only di↵erence is that the observed Fcontrast is compared to FSche↵e

FSche↵e = (a 1) F↵FW (df1 = a 1; df2 = N a) (5)

where a is the number of groups and N is the total number of subjects. Suppose we have 40 subjects (N = 40) divided among 4 groups (a = 4) and we want ↵FW to equal .05. The value of FSche↵e is

9

Performing a Single Comparison • After plotting data I decide to

compare means of groups 4 & 7 using a t-test:

Bennett, PJ PSY710 Chapter 5

g1 g2 g3 g4 g5 g6 g7 g8

60 80

10 0

12 0

14 0

3

> y.4<-subset(y,g=="g4") # get scores for group 4

> y.7<-subset(y,g=="g7") # get scores for group 7

> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erence between groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variances across groups. One advantage of this method is it uses all of the groups to derive an estimate of the population error variance, whereas t.test only uses data from the two groups being compared. Not only is the estimated error variance likely to be more accurate, but the test will have many more degrees of freedom in the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that the groups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387). Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

2

16

Performing a Single Comparison • Next I use a linear contrast which uses

all groups to derive estimate of error variance:

Bennett, PJ PSY710 Chapter 5

g1 g2 g3 g4 g5 g6 g7 g8

60 80

10 0

12 0

14 0

3

> y.4<-subset(y,g=="g4") # get scores for group 4

> y.7<-subset(y,g=="g7") # get scores for group 7

> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erence between groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variances across groups. One advantage of this method is it uses all of the groups to derive an estimate of the population error variance, whereas t.test only uses data from the two groups being compared. Not only is the estimated error variance likely to be more accurate, but the test will have many more degrees of freedom in the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that the groups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387). Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

2

17

Performing a Single Comparison • Next I use a linear contrast which uses

all groups to derive estimate of error variance:

Bennett, PJ PSY710 Chapter 5

g1 g2 g3 g4 g5 g6 g7 g8

60 80

10 0

12 0

14 0

3

> y.4<-subset(y,g=="g4") # get scores for group 4

> y.7<-subset(y,g=="g7") # get scores for group 7

> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erence between groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variances across groups. One advantage of this method is it uses all of the groups to derive an estimate of the population error variance, whereas t.test only uses data from the two groups being compared. Not only is the estimated error variance likely to be more accurate, but the test will have many more degrees of freedom in the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that the groups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387). Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

2

18

What was wrong with the preceding analyses? Answer: I performed the analyses after inspecting the data and choosing to compare groups 4 & 7 because they looked different which, obviously, inflates Type I error

19

• Comparisons after looking at data are post-hoc

• Scheffe method is preferred for post-hoc linear contrasts

- compute contrast with normal procedures

- evaluate observed F with new critical value:

FScheffe = (a-1) x FαFW (df1= a-1; df2 = N-a)

a = number of groups

- FScheffe is “normal” omnibus F x (a-1)

• Scheffe method and omnibus F test are mutually consistent

20

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

21

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

22

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompass the past and the future . . . I felt myself to be, for an unknown period of time, an abstract per- ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre-specified and, as a result, were contingent on data.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices being not just what statistical test to perform but also decisions on what data to exclude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel- son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant” (Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out di↵erently, they could have done other analyses that were just as reasonable in those circumstances.

We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work. † Department of Statistics, Columbia University, New York

‡ Department of Human Development and Family Studies, Penn State University

23

regards to some of the papers we discuss here, the researchers never respond that they had chosen all the details of their data processing and data analysis ahead of time; rather, they claim that they picked only one analysis for the particular data they saw. Intuitive as this defense may seem, it does not address the fundamental frequentist concern of multiple comparisons.

We illustrate with a very simple hypothetical example. A researcher is interested in di↵erences between Democrats and Republicans in how they perform in a short mathematics test when it is expressed in two di↵erent contexts, either involving health care or the military. The research hypothesis is that context matters, and one would expect Democrats to do better in the health- care context and Republicans in the military context. Party identification measured on a standard 7-point scale and various demographic information also available. At this point there is a huge number of possible comparisons that can be performed—all consistent with the data. For example, the pattern could be found (with statistical significance) among men and not among women— explicable under the theory that men are more ideological than women. Or the pattern could be found among women but not among men—explicable under the theory that women are more sensitive to context, compared to men. Or the pattern could be statistically significant for neither group, but the di↵erence could be significant (still fitting the theory, as described above). Or the e↵ect might only appear among men who are being asked the questions by female interviewers. We might see a di↵erence between sexes in the health-care context but not the military context; this would make sense given that health care is currently a highly politically salient issue and the military is not. There are degrees of freedom in the classification of respondents into Democrats and Republicans from a 7-point scale. And how are independents and nonpartisans handled? They could be excluded entirely. Or perhaps the key pattern is between partisans and nonpartisans? And so on. In our notation above, a single overarching research hypothesis—in this case, the idea that issue context interacts with political partisanship to a↵ect mathematical problem-solving skills—corresponds to many di↵erent possible choices of the decision variable .

At one level, these multiplicities are obvious. And it would take a highly unscrupulous researcher to perform test after test in a search for statistical significance (which could almost certainly be found at the 0.05 or even the 0.01 level, given all the options above and the many more that would be possible in a real study). We are not suggesting that researchers generally do such a search. What we are suggesting is that, given a particular data set, it is not so dicult to look at the data and construct completely reasonable rules for data exclusion, coding, and data analysis that can lead to statistical significance—thus, the researcher needs only perform one test, but that test is conditional on the data; hence, T (y;(y)) as described in item #4 above. As Humphreys, Sanchez, and Windt (2013) write, a researcher when faced with multiple reasonable measures can reason (perhaps correctly) that the one that produces a significant result is more likely to be the least noisy measure, but then decide (incorrectly) to draw inferences based on that one only.

This is all happening in a context of small e↵ect sizes, small sample sizes, large measurement errors, and high variation (which combine to give low power, hence less reliable results even when they happen to be statistically significant, as discussed by Button et al., 2013). Multiplicity would not be a huge problem in a setting of large real di↵erences, large samples, small measurement errors, and low variation. This is the familiar Bayesian argument: any data-based claim is more plausible to the extent it is a priori more likely and less plausible to the extent that it is estimated with more error. That is the context; in the present paper, though, we focus on the very specific reasons that published p-values cannot be taken at face value, even if no explicit fishing has been done.

3

24

•“researcher degrees of freedom” •Should more data be collected? [sequential analysis] •Should some observations be excluded? •Which conditions should be combined and which ones compared? •Which control variables should be considered? •Should we combine or transform our independent variables?

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi- tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi- tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser- vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra- dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec- essarily be significant with a larger one.

Solution As a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major- ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authors We propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu- lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

en ta

ge o

ts

Number of Additional Per Condition Observations Before Performing Another t Test

n ! 10 n ! 20 Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ! .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lu e

Sample Size (number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ! .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate. Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

Effect of adding observations on false positives.

26

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi- tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi- tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser- vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra- dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec- essarily be significant with a larger one.

Solution As a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major- ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authors We propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu- lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

en ta

ge o

ts

Number of Additional Per Condition Observations Before Performing Another t Test

n ! 10 n ! 20 Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ! .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lu e

Sample Size (number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ! .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate. Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

Effect of stopping too soon when statistical significance is achieved.

27

Psychological Science 22(11) 1359 –1366 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0956797611417632 http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor- rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse- ment of a maximum false-positive rate of 5% (i.e., p ! .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi- dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe- cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana- lytic alternatives, to search for a combination that yields “sta- tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali- cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors: Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn1 1The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

Abstract In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywords methodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

2011, 22(11), 1359-66 1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi- tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi- tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser- vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra- dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec- essarily be significant with a larger one.

Solution As a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major- ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authors We propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu- lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

en ta

ge o

ts

Number of Additional Per Condition Observations Before Performing Another t Test

n ! 10 n ! 20 Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ! .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lu e

Sample Size (number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ! .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate. Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

28

“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”

- DT Lykken, Statistical Significance in Psychological Research, Psychological Bulletin, 1968

“it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough…”

- Gleman & Loken, The garden of forking paths, 2013

29