# The misuse of asterisks in hypothesis testing 2018-12-07¢ The misuse of asterisks in...

date post

18-Mar-2020Category

## Documents

view

1download

0

Embed Size (px)

### Transcript of The misuse of asterisks in hypothesis testing 2018-12-07¢ The misuse of asterisks in...

Psychology Science, Volume 46, 2004 (2), p. 227-242

The misuse of asterisks in hypothesis testing

DIETER RASCH1, KLAUS D. KUBINGER2, JÖRG SCHMIDTKE1, JOACHIM HÄUSLER2

Abstract

This paper serves to demonstrate that the practise of using one, two, or three asterisks (according to a type-I-risk α either 0.05, 0.01, or 0.001) in significance testing as given par- ticularly with regard to empirical research in psychology is in no way in accordance with the Neyman-Pearson theory of statistical hypothesis testing. Claiming a-posteriori that even a low type-I-risk α leads to significance merely discloses a researcher’s self-deception. Fur- thermore it will be emphasised that by using sequential sampling procedures instead of fixed sample sizes the „practice of asterisks“ would not arise. Besides this, a simulation study will show that sequential sampling procedures are not only efficient concerning a lower sample size but are also robust and nevertheless powerful in the case of non-normal distributions.

Key words: type-I-risk, Neyman-Pearson theory, sequential sampling, robustness

1 BioMath Company for Applied Mathematical Statistics in Biology and Medicine, Ltd. 2 Department of Psychology University of Vienna, Division for Assessment and Applied Psychometrics

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 228

1. Introduction In various publications on the results of using statistical tests in psychological and medi-

cal research work, we often still find the observed statistics accompanied by one of the sym- bols *, ** and ***. In agricultural research, after long discussions, this practice is now ap- plied very seldom but was often quite common in the past. The meaning of the symbols is that the corresponding test statistic for a one-sided alternative hypothesis exceeds the 95 %- (*), 99%- (**) or the 99.9%-quantile (***) respectively. For a two-sided alternative hypothe- sis, the corresponding numbers are the 97.5%-, 99.5%- and 99.95%- quantiles.

However, if a researcher makes use of this „asterisks practice convention“ it merely dis- closes that he/she did not really design his/her experiment or survey in advance; or that he/she might not even understand his/her design. This will be explained in the following by the simple case of comparing two means of normally distributed variables being independ- ently sampled, assuming equal variances in both of the underlying populations; that is, the pertinent case of using a Student’s two-sample t-test – and we will refer to its sequential counterpart.

In the following we will establish our objections to the „practice of asterisks“ concerning the Neyman-Pearson theory of statistical hypothesis testing. We will also introduce a se- quential two-sample (t-based) testing procedure which terminates at least after a fixed maxi- mum number of observations. Above all, such a procedure establishes the need for fixing the probabilities of type-I-error and type-II-error of the test in advance; therefore no room for the use of asterisks is left.

2. Basics of statistical hypothesis testing As indicated, we will restrict the problem to testing the hypothesized equality of two

population means µ1 and µ2. For instance, the considered variable is the score of a certain psychological test, the two populations: men and women. All the subsequent definitions will be given for this specific situation only. This being in order to gain as easy and understand- ing as possible. Bear in mind that all the following considerations also accordingly apply to other parameters like for instance, correlation coefficients of two variables.

The Neyman-Pearson theory of statistical hypothesis testing involves, of course, the fact that even if there is a very small difference between the means of any two populations, we can be sure of detecting it by getting only a sufficiently large number of observations. In other words, significance is just a question of, so to say, „being busy enough“! Consequently, any kind of analysis is not really worth our while; since the researcher looks only for any kind of differences regardless of whether these differences are of any practical magnitude. As a matter of fact, not even the screws’ means produced on different days are exactly the same – a commonly used example within introductory books to statistics.

From the point of didactics, a very often cited study of Belmont and Marolla (1973) serves as an impressive example of the artificial use of significance-based interpretation of empirical results. This study deals with differences concerning the intelligence of testees without and with siblings up to the number of eight siblings. The authors sampled intelli- gence test data from n = 386 114 Dutch recruits altogether, which established a continuous descendent IQ: Apart from recruits without any siblings – these being similar recruits with

The misuse of asterisks in hypothesis testing 229

two siblings – the IQ becomes smaller and smaller starting with recruits with exactly one sibling. Zajonc (1976) speculated on a certain socio-economic model as an explanation of this phenomenon; this model would have severe consequences if taken seriously by society. In actual fact, he established his model with reference to several more studies based also on very large samples, those being 800 000 17 year old scholarship candidates in the USA, 120 000 6 to 14 year old testees in France, and 70 000 11 year old testees in Scotland. How- ever, neither he nor many readers of his papers realized that all the significant differences are of almost no relevance. Take, for example, recruits with either none, one, two, and three siblings. Then the 386 114 recruits have to be reduced to at least 3/4 which is about 290 000, which is still a very large sample; however then the largest difference in mean is hardly larger than a tenth of the standard deviation (cf. Kubinger & Gittler, 1983). That is for an IQ with σ = 15 a mean difference of 1,5 IQ-points results which no serious psychologist would ever interpret as worthwhile: Pertinent intelligence tests with reliabilities up to 0.95 would lead to a 95-percentage confidence interval with a length of at least (twice) 6,6 IQ-points! This serves to complete our argument that significance does not qualify a study per se, it is rather the content relevance that does.

Therefore we should not ask for any difference at all but ask rather for a certain relevant difference, say δ = µ1 - µ2. And this should be fixed beforehand. A researcher should not start with any kind of data sampling and analysis just for interest’s sake but rather based only on deliberate considerations: what extent of difference δ would cause practical or theoretical consequence given that these results have been empirically established.

3. The power function of statistical hypothesis testing In the case of a one-sided test, a well known graphical representation of the two prob-

abilities of errors for a fixed sample size is shown in Figure 1. As usual we call the probabil- ity of a type-I-error (rejecting a null-hypothesis which is true) the type-I-risk α and corre- spondingly the probability of a type-II-error (accepting a null-hypothesis which is wrong) the type-II-risk β. If the difference of practical interest δ = µ1 - µ2 is standardised as

δ * = 1 2 µ µ

σ −

(this can also be called the non-centrality parameter), and δ * equals 3, the two

risks can be found in Figure 1 under the corresponding density curve. That means α is shown below the left symmetric curve of the central t-distribution (if the null-hypothesis is true) and β is shown below the density curve of the non-central t-distribution (if the alternative hy- pothesis δ * = 3 is true). If for a fixed sample size we make α smaller (shift the quantile line t(df, 1 - α) to the right) β becomes larger. To make both risks smaller, we have to increase the sample size because then the shape of both curves becomes steeper and steeper.

The power function is defined as the probability of rejecting the null-hypothesis – in our case this is the equality of two means – as a function of the difference between the two means. If the two population means under discussion are actually equal, this probability quantifies the type-I-error, that being the risk of rejecting the null-hypothesis although it is true. On the other hand, if the two means differ from each other, the probability of rejecting the null-hypothesis for the t-test increases with the difference of these means (cf. Fig. 2 the

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 230

given standardised difference 1 2 * 0 µ µ

δ σ −

= > and the resulting probability 1–β for reject-

ing the null-hypothesis).

Figure 1: The density function of the test statistic (here of the t-test; f(t)) under the null-hypothesis (left curve) and – just for simplification – for a simple one-sided alternative hypothesis

µ1 - µ2 = δ > 0 (right curve for δ* = δ /σ = 3) is considered

Figure 2 : The power function of the t-test for a two-sided alternative hypothesis: µ1 ≠ µ2 and a type-I-

risk of α = 0,05, for n = 5 and n = 20

The misuse of asterisks in hypothesis testing 231

Of course, any researcher aims to get a relatively high probability for 1–β , given the mean difference is