Notes on Maxwell & Delaney PSYCH 710 Multiple Comparisons

PSYCH 710

Multiple Comparisons

Week 7Prof. Patrick Bennett

1

Multiple Comparisons of Group Means

• Multiple comparisons inflate Type I error rate

• Generally want to control family-wise Type I error rate by adjusting the per-comparison Type I error rate

• if C = 100 (orthogonal) comparisons and αFW = .05, then αPC=.00051

Bennett, PJ PSY710 Chapter 5

Notes on Maxwell & Delaney

PSY710

5 Chapter 5 - Multiple Comparisons of Means

5.1 Inflation of Type I Error Rate

When conducting a statistical test, we typically set ↵ = .05 or ↵ = .01 so that the probability of makinga Type I error is .05 or .01. Suppose, however, we conduct 100 such tests. Further suppose that the nullhypothesis for each test is, in fact, true. Although ↵ for each individual test may be .05, the probability ofmaking at least one Type I error across the entire set of 100 tests is much greater than .05. Why? Becausethe more tests we do, the greater the chances of making an error. More precisely, the probability of makingat least one Type I error is

P (at least one Type I error) = ↵FW = 1� (1� ↵PC)C (1)

where C is the number of tests performed1. ↵PC is the per comparison Type I error rate; it represents theprobability of making a Type I error for each test. ↵FW , on the other hand, is the familywise Type I errorrate, and it represents the probability of making a Type I error across the entire family, or set, of tests.For the one-way designs we are considering, ↵FW also equals the ↵ level for the entire experiment, or theexperimentwise error rate (↵EW ). For the current example, ↵ = .05, C = 100, and so ↵FW = 0.994079,which means that it is very likely that we would make at least one Type I error in our set of 100 tests. Here,in a nutshell, is the problem of conducting multiple tests of group means: the probability of making a TypeI error increases with the number of tests. If the number of tests, C, is large, then it becomes very likelythat we will make a Type I error.

When we are conducting multiple tests on group means, we generally want to minimize Type I errorsacross the entire experiment, and so we need some way of maintaining ↵EW at some reasonably low level(e.g., .05). One obvious way of controlling ↵EW is to rearrange Equation 1 to calculate the ↵PC that isrequired for a given ↵FW and C:

↵PC = 1� (1� ↵FW )1/C (2)

According to Equation 2, when C = 100 and we want ↵FW = .05, we must set ↵PC to .0005128.

5.2 Planned vs. Post-hoc Comparisons

I will compare the group means with a linear contrast that assumes equal variance across all groups. SupposeI conduct an experiment that compares the scores of subjects randomly assigned to eight di↵erent groups.After inspecting the data, shown in Figure 1, I decide to compare the means of groups 4 and 7 because thedi↵erence between those groups looks fairly large. I can do the test two ways. First I can simply comparethe groups using a t test assuming equal group variances. In the following commands, notice how I use thesubset command to extract the data from the two groups, and then use R’s function t.test. I will set↵ = .05. The null hypothesis is that the groups are equal; the alternative is that they di↵er.

> levels(g)

[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8"

1Technically, Equation 1 is correct only if the tests form an orthogonal set and the sample sizes for each group are large.

1


Notes on Maxwell & Delaney

PSY710

5 Chapter 5 - Multiple Comparisons of Means

5.1 Inflation of Type I Error Rate

When conducting a statistical test, we typically set ↵ = .05 or ↵ = .01 so that the probability of makinga Type I error is .05 or .01. Suppose, however, we conduct 100 such tests. Further suppose that the nullhypothesis for each test is, in fact, true. Although ↵ for each individual test may be .05, the probability ofmaking at least one Type I error across the entire set of 100 tests is much greater than .05. Why? Becausethe more tests we do, the greater the chances of making an error. More precisely, the probability of makingat least one Type I error is

P (at least one Type I error) = ↵FW = 1� (1� ↵PC)C (1)

where C is the number of tests performed1. ↵PC is the per comparison Type I error rate; it represents theprobability of making a Type I error for each test. ↵FW , on the other hand, is the familywise Type I errorrate, and it represents the probability of making a Type I error across the entire family, or set, of tests.For the one-way designs we are considering, ↵FW also equals the ↵ level for the entire experiment, or theexperimentwise error rate (↵EW ). For the current example, ↵ = .05, C = 100, and so ↵FW = 0.994079,which means that it is very likely that we would make at least one Type I error in our set of 100 tests. Here,in a nutshell, is the problem of conducting multiple tests of group means: the probability of making a TypeI error increases with the number of tests. If the number of tests, C, is large, then it becomes very likelythat we will make a Type I error.

When we are conducting multiple tests on group means, we generally want to minimize Type I errorsacross the entire experiment, and so we need some way of maintaining ↵EW at some reasonably low level(e.g., .05). One obvious way of controlling ↵EW is to rearrange Equation 1 to calculate the ↵PC that isrequired for a given ↵FW and C:

↵PC = 1� (1� ↵FW )1/C (2)

According to Equation 2, when C = 100 and we want ↵FW = .05, we must set ↵PC to .0005128.

5.2 Planned vs. Post-hoc Comparisons

I will compare the group means with a linear contrast that assumes equal variance across all groups. SupposeI conduct an experiment that compares the scores of subjects randomly assigned to eight di↵erent groups.After inspecting the data, shown in Figure 1, I decide to compare the means of groups 4 and 7 because thedi↵erence between those groups looks fairly large. I can do the test two ways. First I can simply comparethe groups using a t test assuming equal group variances. In the following commands, notice how I use thesubset command to extract the data from the two groups, and then use R’s function t.test. I will set↵ = .05. The null hypothesis is that the groups are equal; the alternative is that they di↵er.

> levels(g)

[1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8"

1Technically, Equation 1 is correct only if the tests form an orthogonal set and the sample sizes for each group are large.

1

2

Controlling False Discovery Rate

• Instead of controlling αFW, control False Discovery Rate (FDR):

- Q = (# of false H0 rejections) / (total # H0 rejections)

•When all H0 are true, controlling αFW and FDR are equivalent

•When some H0 are false, FDR-based are more powerful

3

Corrections for Multiple Comparisons

• Bonferroni Adjustment (aka Dunn’s Procedure)

• Holm’s Sequential Bonferroni Test

• Benjamini & Hochberg’s (1995) Linear Step-Up Procedure (FDR)

4

Bonferroni Correction

• C planned comparisons, familywise alpha = .05

- per-comparison alpha αPC = αFW/C

- adjusted p-value = pobserved x C

- Bonferroni method guarantees αFW ≤ C x αPC

• Inequalities true only for orthogonal comparisons/contrasts

• when contrasts are not orthogonal, familywise alpha will be less than nominal value

5

Holm’s Sequential Bonferroni Test

• C comparisons yielding C statistics (t’s, F’s, etc.)

- rank order absolute value of statistics (high to low)

- evaluate 1st statistic with α = αFW/C

- if significant, evaluate next statistic with α = αFW/(C-1)

- repeat until test is not significant

• true familywise alpha is less-than-or-equal to nominal αFW

- however, generally more powerful than Bonferroni method

6

Linear Step-Up Procedure (FDR) Benjamini & Hochberg (1995)

• C comparisons yield C p-values

• set False Discovery Rate (FDR) (often FDR = .05)

• rank order p-values from smallest (k=1) to largest (k=C)

• compare p-values to alpha = (k/C) x FDR

- reject H0 for all p < (k/C) x FDR

• guaranteed to maintain FDR at nominal value

• generally more powerful than other methods

7

Multiple Comparisons in R

> my.p.values <- c(.127,.08,.03,.032,.02,.001,.01,.005,.025)> sort(my.p.values)[1] 0.001 0.005 0.010 0.020 0.025 0.030 0.032 0.080 0.127> p.adjust(sort(my.p.values),method='bonferroni')[1] 0.009 0.045 0.090 0.180 0.225 0.270 0.288 0.720 1.000> p.adjust(sort(my.p.values),method='holm')[1] 0.009 0.040 0.070 0.120 0.125 0.125 0.125 0.160 0.160> p.adjust(sort(my.p.values),method='fdr')[1] 0.009 0.0225 0.0300 0.04114 0.04114 0.04114 0.04114 0.090 0.127

Significant tests (alpha/FDR = .05) are highlighted in orange font. N.B. Sorting p-values is not required.

8

Setting family-wise alpha & FDR

• Generally, αFW and FDR are set to 0.01 or 0.05

• larger αFW may be justified for small number of orthogonal comparisons

- Bonferroni & Holm tests reduce power too much

- perhaps set αPC to 0.05

‣ Type I will increase but Type II error will decrease

- Note: we do this with factorial ANOVA already

9

Simultaneous Confidence Intervals

• 95% CI contains true parameter (mean) 95% of the time

- close connection between CI and p-values

- how to extend connection to case of multiple comparisons?

• simultaneous 95% confidence intervals:

- all CIs will contain true parameter 95% of the time

- 2 or more simultaneous CIs always larger than individual CIs

10

Simultaneous Confidence Intervals


confidence intervals, we mean that all of the intervals in the set will contain the population mean 95% of thetime. Sets of two or more simultaneous confidence intervals will always be larger than individual confidenceintervals.

The R routine that I have written for this class, linear.comparison, computes adjusted confidenceintervals. As an example, we’ll conduct three contrasts on the blood pressure data from Table 5.4 in yourtextbook.

> bp<-read.csv(url("http://psycserv.mcmaster.ca/bennett/psy710/datasets/maxwell_tab54.csv"))

> names(bp)

[1] "group" "bloodPressure"

> blood.contrasts <- list(c(1,-1,0,0), c(1,0,-1,0), c(0,1,-1,0), c(1,1,1,-3)/3 );

> bp.results <- linear.comparison(bp$bloodPressure,bp$group,blood.contrasts,var.equal=TRUE)

[1] "computing linear comparisons assuming equal variances among groups"

[1] "C 1: F=1.562, t=1.250, p=0.226, psi=5.833, CI=(-6.106,17.773), adj.CI= (-6.976,18.643)"

[1] "C 2: F=0.510, t=0.714, p=0.483, psi=3.333, CI=(-5.185,11.852), adj.CI= (-9.476,16.143)"

[1] "C 3: F=0.287, t=-0.536, p=0.598, psi=-2.500, CI=(-13.521,8.521), adj.CI= (-15.310,10.310)"

[1] "C 4: F=9.823, t=3.134, p=0.005, psi=11.944, CI=(5.620,18.269), adj.CI= (1.485,22.403)"

Notice that I divided the weights of the last comparison by 3; this division will be important later.Now let’s compare the confidence intervals and adjusted confidence intervals for each comparison. These

are all 95% confidence intervals because ↵ = .05. The confidence intervals in the variable confinterval

are for the individual comparison. The confidence intervals in the variable adj.confint are adjusted, orsimultaneous, confidence intervals. Note how the adjusted confidence intervals are always larger than thenon-adjusted intervals. Also note that all of the adjusted confidence intervals include zero. What does thismean for the way we evaluate the null hypothesis for each comparison? If we evaluate the null hypothesisfor each comparison using the unadjusted confidence interval, what is ↵FW ? What is ↵FW if we use theadjusted confidence intervals?

The confidence intervals are for , which is defined as

=aX

j=1

(cj Yj)

Our fourth contrast, c = (1/3, 1/3, 1/3,�1), yields

= (1/3) (µ1 + µ2 + µ3)� (1)µ4

which tests the null hypothesis that mean of group 4 is equal to the mean of the other group means. Now, acontrast like c = (1, 1, 1,�3) also compares group 4 to the other 3 groups, but the value of is changed to

= (1) (µ1 + µ2 + µ3)� (3)µ4

which is three times the previous value. It turns out that the F and t tests are not a↵ected by thismanipulation, but the value of and the confidence intervals for are changed:

> blood.contrasts <- list(c(1,-1,0,0), c(1,0,-1,0), c(0,1,-1,0), c(1,1,1,-3) );

> bp.results <- linear.comparison(bp$bloodPressure,bp$group,blood.contrasts,var.equal=TRUE)


[1] "C 1: F=1.562, t=1.250, p=0.226, psi=5.833, CI=(-6.106,17.773), adj.CI= (-6.976,18.643)"

[1] "C 2: F=0.510, t=0.714, p=0.483, psi=3.333, CI=(-5.185,11.852), adj.CI= (-9.476,16.143)"

[1] "C 3: F=0.287, t=-0.536, p=0.598, psi=-2.500, CI=(-13.521,8.521), adj.CI= (-15.310,10.310)"

[1] "C 4: F=9.823, t=3.134, p=0.005, psi=35.833, CI=(16.860,54.806), adj.CI= (4.456,67.210)"

> bp.results[[4]]$contrast

6

Comparisons of blood pressure data in Table 5.4 of textbook

11

All pairwise tests (Tukey HSD)• Evaluate all pairwise differences between groups

• More powerful than Bonferroni method (for between-subj designs)

• Tukey HSD

- NOT necessary to evaluate omnibus F prior to Tukey test

- assumes equal n per group & equal variances

- Tukey-Kramer is valid with sample sizes are unequal

- Dunnett’s T3 test is better with unequal n & unequal variances

- see Kirk (1995, pp. 146-50 for more details)

12


[1] 1 1 1 -3

> bp.results[[4]]$alpha

[1] 0.05

> bp.results[[4]]$confinterval

[1] 16.86 54.81

> bp.results[[4]]$adj.confint

[1] 4.456 67.210

In this case, the confidence interval is for a value that corresponds to three times the di↵erence between µ4

and (1/3)(µ1 + µ2 + µ3). So, make sure that the confidence interval is for the value that you really careabout.

One more thing. The p-values returned by linear.comparison are not adjusted for multiple com-parisons. However, it is rather simple to compare them to an adjusted criterion. If we want to set↵FW = .05, we can use the Bonferroni adjustment to calculate the correct value of ↵ for each comparison:↵PC = ↵FW /C = .05/3 = .0167. By default, the p-values are listed in the output of linear.comparison.Manually listing the unadjusted p-values for each comparison is easy:

> bp.results[[1]]$p.2tailed

[1] 0.2258


[1] 0.4834


[1] 0.5981


[1] 0.005224

Only the fourth comparison is significant.

5.4 All Pairwise Contrasts

Sometimes you will be interested only in pairwise comparisons between group means. If you want to test justa few pairs of means, it often is su�cient to do multiple t-tests (or to use linear contrasts to compare pairsof groups) and to use the Bonferroni adjustment (or Holm’s sequential method) to control ↵FW . However,when you want to do many pairwise tests, or you decide to do multiple pairwise tests after inspecting thedata (i.e., post-hoc pairwise tests), then you should use TukeyHSD. (N.B. Tukey HSD is the same as TukeyWSD). Tukey’s HSD (for Honestly Significant Di↵erence) is the optimal procedure for doing all pairwisecomparisons. In R, the Tukey test is invoked by calling TukeyHSD(my.aov), where my.aov is an objectcreated by a call to aov. Here is an example, again using the data in Table 5.4:

> bp.aov<-aov(bloodPressure~group,data=bp)

> TukeyHSD(bp.aov)

7


Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = bloodPressure ~ group, data = bp)

$group

diff lwr upr p adj

b-a -5.833 -18.90 7.231 0.6038

c-a -3.333 -16.40 9.731 0.8903

d-a -15.000 -28.06 -1.936 0.0209

c-b 2.500 -10.56 15.564 0.9493

d-b -9.167 -22.23 3.898 0.2345

d-c -11.667 -24.73 1.398 0.0906

The output is easier to read by calling TukeyHSD slightly di↵erently:

> TukeyHSD(bp.aov,ordered=TRUE)



factor levels have been ordered

Fit: aov(formula = bloodPressure ~ group, data = bp)

$group

diff lwr upr p adj

b-d 9.167 -3.898 22.23 0.2345

c-d 11.667 -1.398 24.73 0.0906

a-d 15.000 1.936 28.06 0.0209

c-b 2.500 -10.564 15.56 0.9493

a-b 5.833 -7.231 18.90 0.6038

a-c 3.333 -9.731 16.40 0.8903

The output is a list of pairwise comparisons. For each comparison, the output shows the di↵erence, theconfidence interval of the di↵erence (95% by default), and a p-value that can be used to evaluate the nullhypothesis that the group di↵erence is zero. The confidence intervals and p-values are adjusted to takeinto account the multiple comparisons. Using TukeyHSD ensures that the familywise Type I error rate iscontrolled (.05, by default). The Tukey test assumes equal sample size in every group, and homogeneity ofvariance.

It is important to note that it is not necessary to obtain a significant omnibus F test before using theTukey HSD procedure. In fact, requiring a significant omnibus test means that the actual ↵ for the TukeyHSD test will be significantly lower than the nominal value. If you want to make all pairwise comparisons,then it is perfectly reasonable to skip the regular ANOVA and use the Tukey HSD procedure (Wilcox, 1987).

5.4.1 Modifications of Tukey HSD

The Tukey HSD procedure assumes equal sample sizes and constant variance across groups. If those as-sumptions are not valid, then the p values and confidence intervals calculated with the Tukey HSD procedurewill be incorrect. When the constant variance assumption is valid but sample sizes are unequal, the Tukey-Kramer test is recommended. When the variances are heterogeneous and sample sizes are unequal, Dunnett’sT3 test is recommended. The following code shows how to use the DTK package in R to perform these testson the blood pressure data used in the previous section:

8

simultaneous confidence intervals & adjusted p-values

13

simultaneous confidence intervals & adjusted p-values


> # install.packages("DTK") # download package and install on computer

> library("DTK") # load package into workspace

> TK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Tukey-Kramer test



Fit: aov(formula = x ~ f)

$f

diff lwr upr p adj

b-a -5.833 -18.90 7.231 0.6038

c-a -3.333 -16.40 9.731 0.8903

d-a -15.000 -28.06 -1.936 0.0209

c-b 2.500 -10.56 15.564 0.9493

d-b -9.167 -22.23 3.898 0.2345

d-c -11.667 -24.73 1.398 0.0906

> DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Dunnett's T3 test

[[1]]

[1] 0.05

[[2]]

Diff Lower CI Upper CI

b-a -5.833 -26.95 15.2867

c-a -3.333 -18.40 11.7357

d-a -15.000 -29.60 -0.4001

c-b 2.500 -17.00 21.9961

d-b -9.167 -28.30 9.9692

d-c -11.667 -23.80 0.4659

> # following commands store result and then plot simultaneous confidence intervals:

> # tmp <- DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05)

> # DTK.plot(tmp)

These tests, plus several others, are described in detail by Kirk (1995, pages 146-150).

5.5 Post-Hoc Comparisons

The final situation we will consider is the case where you want to perform post-hoc linear contrasts, someof which are not pairwise comparisons. In this situation, Tukey’s HSD procedure is not appropriate becausenot all of the comparisons are pairwise. Neither is the Bonferroni procedure, because the contrasts arepost-hoc, not planned. Instead, we should use Sche↵e’s method.

Sche↵e’s method allows us to perform multiple, complex linear contrasts after looking at the data, whilemaintaining control of the Type I error rate. The method of computing the linear contrasts are exactlythe same as the one used for planned linear contrasts. The only di↵erence is that the observed Fcontrast iscompared to FSche↵e

FSche↵e = (a� 1)⇥ F↵FW (df1 = a� 1; df2 = N � a) (5)

where a is the number of groups and N is the total number of subjects. Suppose we have 40 subjects(N = 40) divided among 4 groups (a = 4) and we want ↵FW to equal .05. The value of FSche↵e is

9

14

familywise alpha


> # install.packages("DTK") # download package and install on computer

> library("DTK") # load package into workspace

> TK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Tukey-Kramer test



Fit: aov(formula = x ~ f)

$f

diff lwr upr p adj

b-a -5.833 -18.90 7.231 0.6038

c-a -3.333 -16.40 9.731 0.8903

d-a -15.000 -28.06 -1.936 0.0209

c-b 2.500 -10.56 15.564 0.9493

d-b -9.167 -22.23 3.898 0.2345

d-c -11.667 -24.73 1.398 0.0906

> DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05) # Dunnett's T3 test

[[1]]

[1] 0.05

[[2]]

Diff Lower CI Upper CI

b-a -5.833 -26.95 15.2867

c-a -3.333 -18.40 11.7357

d-a -15.000 -29.60 -0.4001

c-b 2.500 -17.00 21.9961

d-b -9.167 -28.30 9.9692

d-c -11.667 -23.80 0.4659

> # following commands store result and then plot simultaneous confidence intervals:

> # tmp <- DTK.test(x=bp$bloodPressure,f=bp$group,a=0.05)

> # DTK.plot(tmp)

These tests, plus several others, are described in detail by Kirk (1995, pages 146-150).

5.5 Post-Hoc Comparisons

The final situation we will consider is the case where you want to perform post-hoc linear contrasts, someof which are not pairwise comparisons. In this situation, Tukey’s HSD procedure is not appropriate becausenot all of the comparisons are pairwise. Neither is the Bonferroni procedure, because the contrasts arepost-hoc, not planned. Instead, we should use Sche↵e’s method.

Sche↵e’s method allows us to perform multiple, complex linear contrasts after looking at the data, whilemaintaining control of the Type I error rate. The method of computing the linear contrasts are exactlythe same as the one used for planned linear contrasts. The only di↵erence is that the observed Fcontrast iscompared to FSche↵e

FSche↵e = (a� 1)⇥ F↵FW (df1 = a� 1; df2 = N � a) (5)

where a is the number of groups and N is the total number of subjects. Suppose we have 40 subjects(N = 40) divided among 4 groups (a = 4) and we want ↵FW to equal .05. The value of FSche↵e is

9

simultaneous confidence intervals

15

Performing a Single Comparison• After plotting data I decide to

compare means of groups 4 & 7 using a t-test:


g1 g2 g3 g4 g5 g6 g7 g8

6080

100

120

140

group

score

Figure 1: Eight sets of data

3


> y.4<-subset(y,g=="g4") # get scores for group 4


> t.test(y.4,y.7,var.equal=TRUE) # do t-test assuming equal variances

Two Sample t-test

data: y.4 and y.7

t = 4.165, df = 18, p-value = 0.0005813

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

13.84 42.01

sample estimates:

mean of x mean of y

111.33 83.41

The results are significant (t = 4.16, df = 18, p = 0.00058), so I reject the null hypothesis of no di↵erencebetween groups 4 and 7.

In the second method, I will compare the groups using a linear contrast, again assuming equal variancesacross groups. One advantage of this method is it uses all of the groups to derive an estimate of thepopulation error variance, whereas t.test only uses data from the two groups being compared. Not only isthe estimated error variance likely to be more accurate, but the test will have many more degrees of freedomin the denominator and therefore be more powerful. As before, ↵ = .05 and the null hypothesis is that thegroups are equal. (Note the double brackets that I use to read the results stored in c.4vs7).

> lc.source<-url("http://psycserv.mcmaster.ca/bennett/psy710/Rscripts/linear_contrast_v2.R")

> source(lc.source)

[1] "loading function linear.comparison"

> close(getConnection(lc.source));

> my.contrast<-list(c(0,0,0,1,0,0,-1,0) );

> c.4vs7 <- linear.comparison(y,g,c.weights=my.contrast )


[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 9.915

> c.4vs7[[1]]$t

[1] 3.149

> c.4vs7[[1]]$p.2tailed

[1] 0.002387

Again, the comparison between the two groups is significant (t = 3.1487, df = 72, p = .002387).Finally, for completeness, I will do the comparison using the lm() command:

> newG <- g; # copy grouping factor

> contrasts(newG) <- c(0,0,0,1,0,0,-1,0) # link contrast weights with newG

> newG.lm.01 <- lm(y~newG)

> summary(newG.lm.01) # print coefficients & t-tests

2

16

Performing a Single Comparison• Next I use a linear contrast which uses

all groups to derive estimate of error variance:


g1 g2 g3 g4 g5 g6 g7 g8

6080

100

120

140

group

score


3





Two Sample t-test

data: y.4 and y.7

t = 4.165, df = 18, p-value = 0.0005813



13.84 42.01

sample estimates:

mean of x mean of y

111.33 83.41




> source(lc.source)






[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 9.915

> c.4vs7[[1]]$t

[1] 3.149


[1] 0.002387






2

17

Performing a Single Comparison• Next I use a linear contrast which uses

all groups to derive estimate of error variance:


g1 g2 g3 g4 g5 g6 g7 g8

6080

100

120

140

group

score


3





Two Sample t-test

data: y.4 and y.7

t = 4.165, df = 18, p-value = 0.0005813



13.84 42.01

sample estimates:

mean of x mean of y

111.33 83.41




> source(lc.source)






[1] "C 1: F=9.915, t=3.149, p=0.002, psi=27.924, CI=(14.560,41.287), adj.CI= (10.245,45.602)"

> c.4vs7[[1]]$F

[1] 9.915

> c.4vs7[[1]]$t

[1] 3.149


[1] 0.002387






2

18

What was wrong with the preceding analyses?Answer: I performed the analyses after inspecting the data and choosing to compare groups 4 & 7 because they looked different which, obviously, inflates Type I error

19

Planned vs. Post-hoc Comparisons• Previous comparisons were planned

• Comparisons after looking at data are post-hoc

• Scheffe method is preferred for post-hoc linear contrasts

- compute contrast with normal procedures

- evaluate observed F with new critical value:

‣ FScheffe = (a-1) x FαFW (df1= a-1; df2 = N-a)

‣ a = number of groups

‣ FαFW is the F value for desired alpha

- FScheffe is “normal” omnibus F x (a-1)

• Scheffe method and omnibus F test are mutually consistent

20

The garden of forking paths: Why multiple comparisons can be a problem,even when there is no “fishing expedition” or “p-hacking” and the research

hypothesis was posited ahead of time⇤

Andrew Gelman† and Eric Loken‡

14 Nov 2013

“I thought of a labyrinth of labyrinths, of one sinuous spreading labyrinth that would encompassthe past and the future . . . I felt myself to be, for an unknown period of time, an abstract per-ceiver of the world.” — Borges (1941)

Abstract

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settingswhere researchers perform only a single analysis on their data. The problem is there can be alarge number of potential comparisons when the details of data analysis are highly contingent ondata, without the researcher having to perform any conscious procedure of fishing or examiningmultiple p-values. We discuss in the context of several examples of published papers wheredata-analysis decisions were theoretically-motivated based on previous literature, but where thedetails of data selection and analysis were not pre-specified and, as a result, were contingent ondata.

1. Multiple comparisons doesn’t have to feel like fishing

1.1. Background

There is a growing realization that statistically significant claims in scientific publications areroutinely mistaken. A dataset can be analyzed in so many di↵erent ways (with the choices beingnot just what statistical test to perform but also decisions on what data to exclude or exclude, whatmeasures to study, what interactions to consider, etc.), that very little information is provided bythe statement that a study came up with a p < .05 result. The short version is that it’s easy tofind a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientistsare skilled at looking hard enough and subsequently coming up with good stories (plausible even tothemselves, as well as to their colleagues and peer reviewers) to back up any statistically-significantcomparisons they happen to come up with.

This problem is sometimes called “p-hacking” or “researcher degrees of freedom” (Simmons, Nel-son, and Simonsohn, 2011). In a recent article, we spoke of “fishing expeditions, with a willingnessto look hard for patterns and report any comparisons that happen to be statistically significant”(Gelman, 2013a).

But we are starting to feel that the term “fishing” was unfortunate, in that it invokes an imageof a researcher trying out comparison after comparison, throwing the line into the lake repeatedlyuntil a fish is snagged. We have no reason to think that researchers regularly do that. We thinkthe real story is that researchers can perform a reasonable analysis given their assumptions andtheir data, but had the data turned out di↵erently, they could have done other analyses that werejust as reasonable in those circumstances.

⇤We thank Ed Vul, Howard Wainer, Macartan Humphreys, and E. J. Wagenmakers for helpful comments and the

National Science Foundation for partial support of this work.†Department of Statistics, Columbia University, New York

‡Department of Human Development and Family Studies, Penn State University




14 Nov 2013


Abstract



1.1. Background







21




14 Nov 2013


Abstract



1.1. Background










14 Nov 2013


Abstract



1.1. Background







22




14 Nov 2013


Abstract



1.1. Background










14 Nov 2013


Abstract



1.1. Background







23

regards to some of the papers we discuss here, the researchers never respond that they had chosenall the details of their data processing and data analysis ahead of time; rather, they claim thatthey picked only one analysis for the particular data they saw. Intuitive as this defense may seem,it does not address the fundamental frequentist concern of multiple comparisons.

We illustrate with a very simple hypothetical example. A researcher is interested in di↵erencesbetween Democrats and Republicans in how they perform in a short mathematics test when itis expressed in two di↵erent contexts, either involving health care or the military. The researchhypothesis is that context matters, and one would expect Democrats to do better in the health-care context and Republicans in the military context. Party identification measured on a standard7-point scale and various demographic information also available. At this point there is a hugenumber of possible comparisons that can be performed—all consistent with the data. For example,the pattern could be found (with statistical significance) among men and not among women—explicable under the theory that men are more ideological than women. Or the pattern couldbe found among women but not among men—explicable under the theory that women are moresensitive to context, compared to men. Or the pattern could be statistically significant for neithergroup, but the di↵erence could be significant (still fitting the theory, as described above). Or thee↵ect might only appear among men who are being asked the questions by female interviewers.We might see a di↵erence between sexes in the health-care context but not the military context;this would make sense given that health care is currently a highly politically salient issue and themilitary is not. There are degrees of freedom in the classification of respondents into Democratsand Republicans from a 7-point scale. And how are independents and nonpartisans handled? Theycould be excluded entirely. Or perhaps the key pattern is between partisans and nonpartisans?And so on. In our notation above, a single overarching research hypothesis—in this case, theidea that issue context interacts with political partisanship to a↵ect mathematical problem-solvingskills—corresponds to many di↵erent possible choices of the decision variable �.

At one level, these multiplicities are obvious. And it would take a highly unscrupulous researcherto perform test after test in a search for statistical significance (which could almost certainly befound at the 0.05 or even the 0.01 level, given all the options above and the many more that wouldbe possible in a real study). We are not suggesting that researchers generally do such a search.What we are suggesting is that, given a particular data set, it is not so di�cult to look at the dataand construct completely reasonable rules for data exclusion, coding, and data analysis that canlead to statistical significance—thus, the researcher needs only perform one test, but that test isconditional on the data; hence, T (y;�(y)) as described in item #4 above. As Humphreys, Sanchez,and Windt (2013) write, a researcher when faced with multiple reasonable measures can reason(perhaps correctly) that the one that produces a significant result is more likely to be the leastnoisy measure, but then decide (incorrectly) to draw inferences based on that one only.

This is all happening in a context of small e↵ect sizes, small sample sizes, large measurementerrors, and high variation (which combine to give low power, hence less reliable results even whenthey happen to be statistically significant, as discussed by Button et al., 2013). Multiplicity wouldnot be a huge problem in a setting of large real di↵erences, large samples, small measurement errors,and low variation. This is the familiar Bayesian argument: any data-based claim is more plausibleto the extent it is a priori more likely and less plausible to the extent that it is estimated with moreerror. That is the context; in the present paper, though, we focus on the very specific reasons thatpublished p-values cannot be taken at face value, even if no explicit fishing has been done.

3

24

•“researcher degrees of freedom” •Should more data be collected? [sequential analysis] •Should some observations be excluded? •Which conditions should be combined and which ones compared? •Which control variables should be considered? •Should we combine or transform our independent variables?

Psychological Science22(11) 1359 –1366© The Author(s) 2011Reprints and permission: sagepub.com/journalsPermissions.navDOI: 10.1177/0956797611417632http://pss.sagepub.com

Our job as scientists is to discover truths about the world. We generate hypotheses, collect data, and examine whether or not the data are consistent with those hypotheses. Although we aspire to always be accurate, errors are inevitable.

Perhaps the most costly error is a false positive, the incor-rect rejection of a null hypothesis. First, once they appear in the literature, false positives are particularly persistent. Because null results have many possible causes, failures to replicate previous findings are never conclusive. Furthermore, because it is uncommon for prestigious journals to publish null findings or exact replications, researchers have little incentive to even attempt them. Second, false positives waste resources: They inspire investment in fruitless research programs and can lead to ineffective policy changes. Finally, a field known for publishing false positives risks losing its credibility.

In this article, we show that despite the nominal endorse-ment of a maximum false-positive rate of 5% (i.e., p ≤ .05), current standards for disclosing details of data collection and analyses make false positives vastly more likely. In fact, it is unacceptably easy to publish “statistically significant” evi-dence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared?

Which control variables should be considered? Should spe-cific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various ana-lytic alternatives, to search for a combination that yields “sta-tistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This exploratory behavior is not the by-product of mali-cious intent, but rather the result of two factors: (a) ambiguity in how best to make these decisions and (b) the researcher’s desire to find a statistically significant result. A large literature documents that people are self-serving in their interpretation

Corresponding Authors:Joseph P. Simmons, The Wharton School, University of Pennsylvania, 551 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104 E-mail: [email protected]

Leif D. Nelson, Haas School of Business, University of California, Berkeley, Berkeley, CA 94720-1900 E-mail: [email protected]

Uri Simonsohn, The Wharton School, University of Pennsylvania, 548 Jon M. Huntsman Hall, 3730 Walnut St., Philadelphia, PA 19104E-mail: [email protected]

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant

Joseph P. Simmons1, Leif D. Nelson2, and Uri Simonsohn11The Wharton School, University of Pennsylvania, and 2Haas School of Business, University of California, Berkeley

AbstractIn this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

Keywordsmethodology, motivated reasoning, publication, disclosure

Received 3/17/11; Revision accepted 5/23/11

General Article

2011, 22(11), 1359-66

25

















General Article

2011, 22(11), 1359-66

1362 Simmons et al.

Contradicting this intuition, Figure 1 shows the false-posi-tive rates from additional simulations for a researcher who has already collected either 10 or 20 observations within each of two conditions, and then tests for significance every 1, 5, 10, or 20 per-condition observations after that. The researcher stops collecting data either once statistical significance is obtained or when the number of observations in each condi-tion reaches 50.

Figure 1 shows that a researcher who starts with 10 obser-vations per condition and then tests for significance after every new per-condition observation finds a significant effect 22% of the time. Figure 2 depicts an illustrative example continuing sampling until the number of per-condition observations reaches 70. It plots p values from t tests conducted after each

pair of observations. The example shown in Figure 2 contra-dicts the often-espoused yet erroneous intuition that if an effect is significant with a small sample size then it would nec-essarily be significant with a larger one.

SolutionAs a solution to the flexibility-ambiguity problem, we offer six requirements for authors and four guidelines for reviewers (see Table 2). This solution substantially mitigates the problem but imposes only a minimal burden on authors, reviewers, and readers. Our solution leaves the right and responsibility of identifying the most appropriate way to conduct research in the hands of researchers, requiring only that authors provide appropriately transparent descriptions of their methods so that reviewers and readers can make informed decisions regarding the credibility of their findings. We assume that the vast major-ity of researchers strive for honesty; this solution will not help in the unusual case of willful deception.

Requirements for authorsWe propose the following six requirements for authors.

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calcu-lations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

22.1%

17.0%14.3%

12.7%

16.3%

13.3%11.5%

10.4%

0

5

10

15

20

25

1 5 10 20Perc

enta

ge o

f Fal

se-P

ositi

ve R

esul

ts

Number of Additional Per ConditionObservations Before Performing Another t Test

n ≥ 10 n ≥ 20Minimum Sample Size

Fig. 1. Likelihood of obtaining a false-positive result when data collection ends upon obtaining significance (p ≤ .05, highlighted by the dotted line). The figure depicts likelihoods for two minimum sample sizes, as a function of the frequency with which significance tests are performed.

.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lue

Sample Size(number of observations in each of two conditions)

Fig. 2. Illustrative simulation of p values obtained by a researcher who continuously adds an observation to each of two conditions, conducting a t test after each addition. The dotted line highlights the conventional significance criterion of p ≤ .05.

Table 2. Simple Solution to the Problem of False-Positive Publications

Requirements for authors 1. Authors must decide the rule for terminating data collection

before data collection begins and report this rule in the article. 2. Authors must collect at least 20 observations per cell or else

provide a compelling cost-of-data-collection justification. 3. Authors must list all variables collected in a study. 4. Authors must report all experimental conditions, including

failed manipulations. 5. If observations are eliminated, authors must also report what

the statistical results are if those observations are included. 6. If an analysis includes a covariate, authors must report the

statistical results of the analysis without the covariate.Guidelines for reviewers 1. Reviewers should ensure that authors follow the requirements. 2. Reviewers should be more tolerant of imperfections in results. 3. Reviewers should require authors to demonstrate that their

results do not hinge on arbitrary analytic decisions. 4. If justifications of data collection or analysis are not compel-

ling, reviewers should require the authors to conduct an exact replication.

Effect of adding observations on false positives.

26

















General Article

2011, 22(11), 1359-66

1362 Simmons et al.







22.1%

17.0%14.3%

12.7%

16.3%

13.3%11.5%

10.4%

0

5

10

15

20

25

1 5 10 20Perc

enta

ge o

f Fal

se-P

ositi

ve R

esul

ts




.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lue












Effect of stopping too soon when statistical significance is achieved.

27

















General Article

2011, 22(11), 1359-661362 Simmons et al.







22.1%

17.0%14.3%

12.7%

16.3%

13.3%11.5%

10.4%

0

5

10

15

20

25

1 5 10 20Perc

enta

ge o

f Fal

se-P

ositi

ve R

esul

ts




.00

.10

.20

.30

.40

.50

.60

.70

11 16 21 26 31 36 41 46 51 56 61 66

p Va

lue












28

“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”

- DT Lykken, Statistical Significance in Psychological Research, Psychological Bulletin, 1968

“it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough…”

- Gleman & Loken, The garden of forking paths, 2013

29

Notes on Maxwell & Delaney PSYCH 710 Multiple Comparisons

Documents

Transcript of Notes on Maxwell & Delaney PSYCH 710 Multiple Comparisons