Download - The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

Transcript

Psychology Science, Volume 46, 2004 (2), p. 227-242

The misuse of asterisks in hypothesis testing

DIETER RASCH1, KLAUS D. KUBINGER

2, JÖRG SCHMIDTKE1, JOACHIM HÄUSLER

Abstract

This paper serves to demonstrate that the practise of using one, two, or three asterisks (according to a type-I-risk α either 0.05, 0.01, or 0.001) in significance testing as given par-ticularly with regard to empirical research in psychology is in no way in accordance with the Neyman-Pearson theory of statistical hypothesis testing. Claiming a-posteriori that even a low type-I-risk α leads to significance merely discloses a researcher’s self-deception. Fur-thermore it will be emphasised that by using sequential sampling procedures instead of fixed sample sizes the „practice of asterisks“ would not arise. Besides this, a simulation study will show that sequential sampling procedures are not only efficient concerning a lower sample size but are also robust and nevertheless powerful in the case of non-normal distributions.

Key words: type-I-risk, Neyman-Pearson theory, sequential sampling, robustness

1 BioMath Company for Applied Mathematical Statistics in Biology and Medicine, Ltd. 2 Department of Psychology University of Vienna, Division for Assessment and Applied Psychometrics

Page 2: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 228

1. Introduction In various publications on the results of using statistical tests in psychological and medi-

cal research work, we often still find the observed statistics accompanied by one of the sym-bols *, ** and ***. In agricultural research, after long discussions, this practice is now ap-plied very seldom but was often quite common in the past. The meaning of the symbols is that the corresponding test statistic for a one-sided alternative hypothesis exceeds the 95 %- (*), 99%- (**) or the 99.9%-quantile (***) respectively. For a two-sided alternative hypothe-sis, the corresponding numbers are the 97.5%-, 99.5%- and 99.95%- quantiles.

However, if a researcher makes use of this „asterisks practice convention“ it merely dis-closes that he/she did not really design his/her experiment or survey in advance; or that he/she might not even understand his/her design. This will be explained in the following by the simple case of comparing two means of normally distributed variables being independ-ently sampled, assuming equal variances in both of the underlying populations; that is, the pertinent case of using a Student’s two-sample t-test – and we will refer to its sequential counterpart.

In the following we will establish our objections to the „practice of asterisks“ concerning the Neyman-Pearson theory of statistical hypothesis testing. We will also introduce a se-quential two-sample (t-based) testing procedure which terminates at least after a fixed maxi-mum number of observations. Above all, such a procedure establishes the need for fixing the probabilities of type-I-error and type-II-error of the test in advance; therefore no room for the use of asterisks is left.

2. Basics of statistical hypothesis testing As indicated, we will restrict the problem to testing the hypothesized equality of two

population means µ1 and µ2. For instance, the considered variable is the score of a certain psychological test, the two populations: men and women. All the subsequent definitions will be given for this specific situation only. This being in order to gain as easy and understand-ing as possible. Bear in mind that all the following considerations also accordingly apply to other parameters like for instance, correlation coefficients of two variables.

The Neyman-Pearson theory of statistical hypothesis testing involves, of course, the fact that even if there is a very small difference between the means of any two populations, we can be sure of detecting it by getting only a sufficiently large number of observations. In other words, significance is just a question of, so to say, „being busy enough“! Consequently, any kind of analysis is not really worth our while; since the researcher looks only for any kind of differences regardless of whether these differences are of any practical magnitude. As a matter of fact, not even the screws’ means produced on different days are exactly the same – a commonly used example within introductory books to statistics.

From the point of didactics, a very often cited study of Belmont and Marolla (1973) serves as an impressive example of the artificial use of significance-based interpretation of empirical results. This study deals with differences concerning the intelligence of testees without and with siblings up to the number of eight siblings. The authors sampled intelli-gence test data from n = 386 114 Dutch recruits altogether, which established a continuous descendent IQ: Apart from recruits without any siblings – these being similar recruits with

The misuse of asterisks in hypothesis testing 229

two siblings – the IQ becomes smaller and smaller starting with recruits with exactly one sibling. Zajonc (1976) speculated on a certain socio-economic model as an explanation of this phenomenon; this model would have severe consequences if taken seriously by society. In actual fact, he established his model with reference to several more studies based also on very large samples, those being 800 000 17 year old scholarship candidates in the USA, 120 000 6 to 14 year old testees in France, and 70 000 11 year old testees in Scotland. How-ever, neither he nor many readers of his papers realized that all the significant differences are of almost no relevance. Take, for example, recruits with either none, one, two, and three siblings. Then the 386 114 recruits have to be reduced to at least 3/4 which is about 290 000, which is still a very large sample; however then the largest difference in mean is hardly larger than a tenth of the standard deviation (cf. Kubinger & Gittler, 1983). That is for an IQ with σ = 15 a mean difference of 1,5 IQ-points results which no serious psychologist would ever interpret as worthwhile: Pertinent intelligence tests with reliabilities up to 0.95 would lead to a 95-percentage confidence interval with a length of at least (twice) 6,6 IQ-points! This serves to complete our argument that significance does not qualify a study per se, it is rather the content relevance that does.

Therefore we should not ask for any difference at all but ask rather for a certain relevant difference, say δ = µ1 - µ2. And this should be fixed beforehand. A researcher should not start with any kind of data sampling and analysis just for interest’s sake but rather based only on deliberate considerations: what extent of difference δ would cause practical or theoretical consequence given that these results have been empirically established.

3. The power function of statistical hypothesis testing In the case of a one-sided test, a well known graphical representation of the two prob-

abilities of errors for a fixed sample size is shown in Figure 1. As usual we call the probabil-ity of a type-I-error (rejecting a null-hypothesis which is true) the type-I-risk α and corre-spondingly the probability of a type-II-error (accepting a null-hypothesis which is wrong) the type-II-risk β. If the difference of practical interest δ = µ1 - µ2 is standardised as

δ * = 1 2µ µσ−

(this can also be called the non-centrality parameter), and δ * equals 3, the two

risks can be found in Figure 1 under the corresponding density curve. That means α is shown below the left symmetric curve of the central t-distribution (if the null-hypothesis is true) and β is shown below the density curve of the non-central t-distribution (if the alternative hy-pothesis δ * = 3 is true). If for a fixed sample size we make α smaller (shift the quantile line t(df, 1 - α) to the right) β becomes larger. To make both risks smaller, we have to increase the sample size because then the shape of both curves becomes steeper and steeper.

The power function is defined as the probability of rejecting the null-hypothesis – in our case this is the equality of two means – as a function of the difference between the two means. If the two population means under discussion are actually equal, this probability quantifies the type-I-error, that being the risk of rejecting the null-hypothesis although it is true. On the other hand, if the two means differ from each other, the probability of rejecting the null-hypothesis for the t-test increases with the difference of these means (cf. Fig. 2 the

Page 4: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 230

given standardised difference 1 2 * 0µ µ

δσ−

= > and the resulting probability 1–β for reject-

ing the null-hypothesis).

Figure 1: The density function of the test statistic (here of the t-test; f(t)) under the null-hypothesis (left curve) and – just for simplification – for a simple one-sided alternative hypothesis

µ1 - µ2 = δ > 0 (right curve for δ* = δ /σ = 3) is considered

Figure 2 : The power function of the t-test for a two-sided alternative hypothesis: µ1 ≠ µ2 and a type-I-

risk of α = 0,05, for n = 5 and n = 20

Page 5: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

The misuse of asterisks in hypothesis testing 231

Of course, any researcher aims to get a relatively high probability for 1–β , given the mean difference is of a minimum δ which corresponds to the relevance. Now, as denoted in Figure 2, the power function amounts to 1–β for the standardised difference δ * = δ /σ , as β is the probability of a type-II-error (that being the risk of confirming the null-hypothesis although it is wrong). Consequently, the probability 1–β of rejecting the null-hypothesis becomes larger for any δ0 > δ and therefore the factual power and the factual type-II-risk β0 is smaller than β.

4. Statistical hypothesis testing-designed experiments and surveys It has already been indicated that sampling data and analysing it without pre-determining

type-I-, type-II-risk, and the extent of the relevant effect (i.e. here the relevant difference δ of two means) would lead to results as one pleases but without any scientifically-based merits. In such cases, the results are results by chance, and could just as well be obtained by throw-ing a coin with the same justification. Whether the researcher gains „important“ results – important because of the fact that they are significant – depends merely on the sample size he/she was able to recruit or he/she has decided on for convenience sake (see the two curves in Fig. 2 which disclose the tendency of the power function if n becomes larger).

As already stated, designing an empirical study, an experiment or survey, in order to in-vestigate whether any given relevant (minimum) difference δ between two populations ex-ists, requires the fixation of type-I-risk α and type-II-risk β beforehand. Either the sample size should be appropriately calculated or a sequential testing procedure should be carried out. The latter is especially advisable if the elements can be sequentially sampled quite easily as is often the case with patients in clinical research work or the like.

In doing so – either confirming the adequate sample size or testing sequentially – it is self-evident that one should use that particular α that was determined at the very beginning for making a decision on „significant vs. non-significant“. Therefore, it would be totally senseless to use any other α at the very end of the analysis. Either the sample size n has been confirmed for a certain and fixed α which must not be altered afterwards because then every conclusion would be based on an altered value of β, too; or sequential testing has terminated because of significance at level α – further effort would not pay off just to try to prove sig-nificance at a level α0 <α. In summary, it can be said that there is really no place for asterisks at all.

However, this argument does not seem to bother many researchers. By using the „prac-tice of asterisks“ they obviously think that their results would be of more greater importance, if they are not only significant, for instance, at α = 0.05, but also significant at α = 0.01; they use the resulting p-value – that being the probability that given the null-hypothesis, the ob-served data or even more (intuitively) the null-hypothesis contradicting data will occur – as some kind of proof of quality: The smaller the p-value the „better“ the results, „better“ very often connoted by a greater effect. Bear in mind, however, that the estimated effect always

amounts to the same 1 2ˆ = = −d y yδ regardless of the level of α; and, for any a-posteriori

effect d that is of relevance, a p-value that is too low indicates an uneconomic approach to designing the experiment or survey because of n is too large, rather than indicating any kind of high quality of the study – the researcher must be blamed because an even smaller sample

Page 6: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 232

size would have lead to significance! Also bear in mind that using the „practice of asterisks“ always implies the highest α from all α-levels that one would ever accept: If a researcher decides, according to the result, what level of α he/she applies (in order to get a significant result that might be even just at α =.05), then the general α-level is one that would suffice even in the worst case. Claiming a-posteriori that a lower α applies, would merely disclose the researcher’s self-deception.

It is evident that the outcome of significance at level α = 0.05, or α = 0.01, or α = 0.001 is also a matter of chance, given a certain minimal sample size n and a certain (minimal) difference δ: If n and δ are both sufficiently large, according to Figure 1 significance either does not result or does result at one of the three discussed levels due to the random sampling effect. Every point (or better: negligible interval) of the test statistic t on the abscissa is possible given the difference δ applies, though in some cases a certain t is very likely, but most of the time a certain t is actually not too likely. If we are lucky – or rather: if it so hap-pens by chance – our actual observed t will be large enough to reach, for instance, even the area of the α-level 0.001 and not only the area of α = 0.05. However, this does not change the true though unknown value δ in any way: „the two populations differ with respect to their means“ – if indeed the sample size has been confirmed beforehand as described above, we can then add: “if significantly then at least to an extent of δ”. The only restriction is that this conclusion risks a type-I-error of probability α. In other words, in case we are satisfied by a significant result at α-level 0.05, an a-posteriori stated significance at α = 0.01 or α = 0.001 only pretends that there is an even more „important“ difference, but, of course, there is not!

One particular problem within psychological research work is that by using correlation analyses the null-hypothesis is not always considered as critical as necessary: The null-hypothesis suggests that the correlation between two variables equals zero within the popula-tion. Of course, there is almost no researcher who would be satisfied with any experiment or survey if the null-hypothesis has to be rejected because of a significant result, that being if a so-called „significant“ correlation occurs. This simply indicates to the researcher that the correlation does not actually equal zero but rather might just be very small, for instance ρ = 0.1. Whether a certain correlation is significant or not again depends on the sample size only. And, hence, particularly with correlation analyses, the use of the „practice of asterisks“ can be seen all too often because of the researcher’s intention of giving the impression of an established effect, that being the extent of association between the two variables. However, it would have been much simpler and of greater relevance, if on the other hand a certain rele-vant extent of the association had been fixed beforehand. Nearly every book of introduction to statistics offers the pertinent test-statistic in order to test the null-hypothesis ρ = ρ0 > 0. In this case, a very important result would then be found if the null-hypothesis is not rejected because then the extent of the association is (at least) ρ0 – given a certain type-I-risk. This approach is hardly implemented anywhere in psychology research work. This is probably because only a few researchers dare to fix the effect that is of relevance before they know the results of analysis.

Another problem within psychological research work is that using *, **, or *** is sup-ported by statistical software packages, in particular SPSS and SAS. However, the presenta-tion of the respective p-value is only a matter of a traditional technical simplification, in other words, it is more convenient to programme p-values for the output than to ask the re-searcher at the very beginning for a fixed α-level in order to compare the resulting p-value

Page 7: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

The misuse of asterisks in hypothesis testing 233

with the critical p-value at α at the end. So, statistically unsure researchers are mislead into interpreting a lower p-value as more important that is as though caused by a greater effect.

5. Formalization of statistical hypothesis testing – the conventional approach In order to prove our argument more formally we will start with the following. We have

two populations 1 and 2 with the normally distributed variables y1 and y2, these supposedly have the means µ1, µ2 and the variances σ

2, σ2

2. We consider two independent random sam-

ples (y 11, ..., y1n1) and (y21, ..., y2n2

) of sizes n1 and n2 from the two populations under considera-

tion in order to test the null-hypothesis Ho : µ1 = µ2

against one of the following one- or two-sided alternative hypotheses a) H+

A : µ1 > µ2 b) H-

A : µ1 < µ2

3 or c) HA : µ1 ≠ µ2. The sample sizes n1 and n2 should be determined in such a way that for a given type-I-risk

α, the type-II-risk β0, as a function of (δ = µ1 - µ2), does not exceed a predetermined upper bound β as long as, for the respective alternative hypothesis, we have either

a) µ1 - µ2 > δ, b) µ1 - µ2 < -δ, or c) |µ1 - µ2| ≥ δ. Once more, the minimal difference of practical interest, respectively the (relevant) effect

δ between µ1 and µ2 has to be fixed by the researcher at the very beginning of the design of his/her experiment or survey.

As presupposed in our case with equal variances, the hypotheses above are best tested ei-ther with the one-sided Student’s t-test – cases a) and b) – or the two-sided Student’s t-test – case c). As is already well-known, the test is based on the pooled estimator of the common variance σ2 given by

1 2nn2 2

1i 1. 2i 2.i=1 i=12

p1 2

( - + ( -y y ) y y ) = s

+ -2n n

∑ ∑. (1)

3 the correct formulation of the null-hypothesis is then Ho : µ1 ≥ µ2

Page 8: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 234

The test statistic is:

1 21 2

1 2p

-y y n nt = +s n n

(2)

and H0 is rejected

in case a) if t > t(n1+n2–2; 1–α), in case b) if t < –t(n1+n2–2; 1–α), and

in case c) if │t│ > t(n1+n2–2; 1–2α

), otherwise it should be accepted.

As the variances are supposed as being equal, this test leads to the smallest possible sam-

ple sizes for a certain α and β. Given the total number of observations n1 + n2 then the vari-ance of all possible differences of the means of both the populations under consideration is minimised with n1 = n2 = n, i.e. when the two sample sizes are equal. The value of n can be determined iteratively from (cf. Rasch, Verdooren & Gowers, 1999, p. 83):

( ) ( )2 2

022

n CEIL t 2n 2;P t 2n 2;1dσ

= ⋅ − + − −

(3)

where P = 1–α for a one-sided alternative hypothesis and P = 1–α/2 for the two-sided alter-native hypotheses – here CEIL is an abbreviation of ceiling and means: „take the argument as it is, if it is integer, otherwise take the next larger integer“.

If no a-priori information is available concerning the common variance σ2, we can use an algorithm already described by Rasch (2003). However, it is very easy to calculate the sam-ple size by the module „MEANS“ of the CADEMO software (Rasch, Herrendörfer, Bock, Victor & Guiard, 1996, 1998; or Rasch, Verdooren & Gowers, 1999).

6. Formalization of statistical hypothesis testing – the approach of sequential testing Sequential testing means a multiple-staged sequencing approach to data sampling and

analysing. If we consider, in the following, for reasons of generality the symbol θ for some parameter in question – in our problem above θ = δ – then we may have to test:

H0 : θ = θ0 against: H1 : θ = θ

1 (4) with θ

0 ≠ θ

Page 9: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

The misuse of asterisks in hypothesis testing 235

Contrary to a conventional test with only two possible decisions (accepting or rejecting H0), a sequential test leads, after each step of data sampling, to one of the following three (!) decisions: Rejecting H0 or rejecting H1 or continuing with data sampling.

The random „walk“ during a sequential test can be illustrated graphically by a sequential path in a coordinate system: The sample size or any monotonic function of it is shown on the abscissa and the values of the test statistic calculated from the cumulating data are shown on the ordinate. In this coordinate system a „continuation region“ is defined, indicating that data sampling should not be stopped. As long as the sequential path is inside this region, the pro-cedure continues in collecting the next observation. If one of two boundaries is reached or passed by this path, one of the two decisions „rejecting H1“ or „rejecting H0“ is established.

The first sequential test was the sequential probability ratio test of Wald (1946). How-ever, this test and also other sequential probability ratio tests based on Wald’s test (as Ha-jnal’s test described in Rasch & Guiard, 2003; we call these tests Wald-type tests) have a serious disadvantage: They end with probability 1 after a finite number of steps but nobody knows the upper bound for the number of steps. These tests are therefore called „open“ tests. In the case of testing a mean against some hypothesized constant, the continuation region of the Wald test is given by two parallel lines. As long as the user is between the two lines, the procedure continues. If the upper line is passed, then the alternative hypothesis is to be re-jected, that is to say, the null-hypothesis has stood the test (or to say it slovenly: the null-hypothesis is to be „accepted“); if the lower line is passed, then the null-hypothesis is to be rejected. Among all sequential tests with the same risks, the Wald’s test requires the smallest expected sample size.

On the other hand, so-called closed tests have been developed which do end after a maximum number of steps. As a consequence, a small increase in the expected sample size in comparison to that of the Wald test results, but this expected sample size is often much smaller than the sample size for a conventional test with a fixed sample size in advance. One of these closed tests is the „triangular sequential test“, the name of which was chosen be-cause of the continuation region is a triangle.

The triangular sequential test was described by Whitehead (1983) and has been slightly improved by Schneider (1992). We consider an observed sample (y1, y2, …, yn) from n inde-pendently and identically distributed random variables yi, with probability density ( )if y ,θ

for a given but unknown parameter θ. We call this function for fixed yi the likelihood func-tion. The log-likelihood of a single observation yi is:

( )i il(y , ) ln f y ,θ θ = . (5)

Assuming appropriate continuity conditions for f with respect to θ, Taylor series expan-

sions give approximately negligible third and higher order terms:

2i i i

1l(y , ) const. Z V2

θ θ θ= + ⋅ − , (6)

while

Page 10: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 236

( )i iZ l y ,0θ= and ( )i iV l y ,0θθ= − (7)

– the suffices θ and θθ indicate first and second order derivation with respect to θ. The Z-values are called the „efficient score statistic“ and the V-values are „Fisher’s information statistic“.

We consider the sample statistic n

ii 1

Z Z=

= ∑ and correspondingly the sample statistic V.

It was shown by Wald (1947) that the random variable Z as a function of a random sam-ple is asymptotically normally distributed with expectation θV and variance V. Hence, the continuation region of the triangular sequential test for α = β is defined by the inequality

– a + 3cV < Z < a + cV (8)

with

11a 2ln /

2θ

α =

and 1c4

θ= . (9)

The test then terminates at

max 21

18lna 2Vc

αθ

= = and max

14 ln2Z 2a α

= = (10)

How to proceed, if α β≠ is shown by Schneider (1992). The application of the triangu-

lar sequential test can be carried out very easily with the program PEST or TRIQ of CA-DEMO, even for cases of α β≠ .

Note that sequential testing must be carried out on the data in exactly the order in which it is observed. This is automatically guaranteed if an analysis is done after each observation. That is to say that the data must not been rearranged in order to suit the theory of sequential testing. The same is true regarding continuous sequential testing after a final decision (rejec-tion of one of the hypotheses) has been made.

7. A numerical example Aiming for a conventional Student’s t-test we wish to determine the minimal sample size

for testing the null-hypothesis of the equality of two means from independent normally dis-tributed variables with common variances; the alternative hypothesis is a two-sided one. We choose α = 0.05, β = 0.2 and δ = 5. A rough idea about σ 2 from earlier experiments might be

2σ̂ = 25 as a consequence of which the difference of practical interest is about σ. Then

Page 11: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

The misuse of asterisks in hypothesis testing 237

using formula (3) we obtain, in each sample, 17 observations that have to be sampled. As a

matter of fact, this result is always the same as long as α = 0.05, β = 0.2 and the ratio 1δσ

= .

After determining the sample size in this way we have to use α = 0.05 and we reject the null-hypothesis if after data sampling the t-value in (2) with 1 2n n 17= = exceeds the 97.5%

quantile of the t-distribution with 32 d.f.; i.e. t≥ 2.037. If the t-value calculated from (2) also exceeds the value which corresponds to the 99.5% quantile of the t-distribution with 32 d.f., this is of no use and must not be emphasized by using the symbol **.

Let us apply a sequential triangular test and again choose α = 0.05, β = 0.2, and δ = 5. Let us assume that the first four observations are as given in Table 1 above the bold line.

Table 1: The data analysed by TRIQ - the first four observations above the bold line; all observations

in the order of their occurrences

Step number Number of group Observed value n1 n2

1 1 4.6 1 2 2 5.2 1 3 1 5.1 2 4 2 6.2 2 5 1 4.9 3 6 2 5.9 3 7 1 4.8 4 8 2 5.2 4 9 1 4.4 5 10 2 6.7 5 11 1 5.3 6 12 2 6.9 6 13 1 4.8 7 14 2 5.9 7

In Figure 3 the path of the Z-values from (7) along the sample size (or the V-values from

(7); the curve starting from point zero at the ordinate) remains inside the grey area (continua-tion region), so we have to continue. After 10 other data are sampled we terminate (see Tab. 1 again). Because the curve of Z-values finally leaves the grey area, we have to reject the null-hypothesis. Of course, this decision is possible exclusively at the given α-level of 0.05, but no other α-level is feasible for interpretation. In order to emphasize the senselessness of the „asterisks practice convention“ we performed the conventional Student’s t-test for these 14 observations and obtained by means of SPSS the result t = -4,207 and „Sig. (two-tailed)“ = ,001 which many researchers would be encouraged to promote with ***!

Page 12: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 238

Figure 3: Graph for the triangular sequential design for α = 0.05; β = 0.2 and δ = σ. Remark: The fixed

sample size needed for the conventional t-test was approximately calculated by TRIQ by using the normal quantiles in place of the quantiles of the t-distribution and is thus in the

figure given by 32 (N = n1 + n2) in place of 34 calculated from (3) The realised sample size N = n1 + n2 = 14 was smaller than that (N = n1 + n2 = 34) needed

for the t-test but also smaller than the expected sample size of TRIQ (22 and 16 respec-tively).

8. A simulation study for TRIQ We have seen that TRIQ is a very good alternative to the conventional t-test if sequential

sampling is reasonable. But questions still remain: • The risks α,β applied in TRIQ – as in any sequential test – are known as being just

approximations of those values fixed in advance; how accurate are these approxima-tions?

• The range of the finally realized sample sizes is also of interest. • And then there is also the question of how robust TRIQ is with regard to non-

normality (the fact that the t-test is quite robust was shown by Rasch & Guiard, 2004).

Page 13: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

The misuse of asterisks in hypothesis testing 239

In the following, therefore, the average sample size of the triangular sequential t-test will be estimated by a simulation study.

Because probability estimations of α and β are aimed for, the statistical test should be re-peated 10 000 times (cf. for instance Rasch & Guiard, 2004). That is, TRIQ has been used 10 000 times for each of the following situations – the hypotheses were

H0 : µ1 = µ2 HA : µ1 = µ2 = 5 = δ,

the standard deviation was σ = 5, and α and β were fixed by the pairs Case 1: α = 0.05 and β = 0.2, Case 2: α = 0.05 and β = 0.1, Case 3: α = 0.01 and β = 0.1, Case 4: α = 0.01 and β = 0.05. For each of these cases a normally distributed variable was simulated in situation 1. In

situations 2, 3, and 4 the Fleishman distribution was simulated (see Rasch & Guiard, 2004): Situation 1 normal distribution Situation 2 distribution with skewness 0 and kurtosis 4 Situation 3 distribution with skewness 1 and kurtosis 4 Situation 4 distribution with skewness 2 and kurtosis 7

– the last one is very extreme and hardly occurs in practical experimentation, but all the three last cases serve to get an idea of just how robust the test is with regard to non-normality.

The normally distributed random variables were generated by the standard generator of SPSS. The distributions in the three other situations were generated from the Fleishman system (Fleishman, 1978) with coefficients taken from Table A3 in Nürnberg (1982). A Fleishman random number y with mean 0, variance 1, skewness γ1 and kurtosis γ2 can be obtained from a normal random number u with mean 0 and variance 1 by the transformation

2 3y c bu cu du= − + + + ,

the values of b, c, and d are:

Situation b c d 2 0.737381231250 0 0.080925091736 3 0.776585015190 0.116977231170 0.065499503056 4 0.761585274860 0.260022598940 0.053072273491 The transformed values were finally multiplied by the standard deviation σ = 5 and the

mean µ1 and µ2 were respectively added. From each of the 10 000 runs in each (case × situa-tion)-combination the sample size needed was recorded and it was then decided whether the

Page 14: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 240

null-hypothesis should be accepted or rejected. From each of the 10 000 runs we calculated the average, the minimum and the maximum sample size and the relative frequency of reject-ing frej and accepting facc the null-hypothesis. Two values of the non-centrality parameter, analogue to the example in paragraph 7, were simulated:

A: µ1 = µ2 = 0 and B: δ =µ1 = µ2 = σ = 5. In Table 2 frej is given for case A, thus estimating the nominal risk α (0.05; 0.01) and facc

is given for case B, thus estimating the nominal risk ß (0.2; 0.1; 0.05) at δ =µ1 = µ2 = σ = 5. The average total size N = n1 + n2 is given for case A as N(A) and for case B as N(B).

Table 2: Relative frequencies of accepting (facc) and rejecting (frej) the null-hypothesis and empirical 0.025- and 0.975-quantiles (in []) of these frequencies from 10 000 simulations for each

(case × situation)-combination; the sums N(A) and N(B) of the average of both sample sizes needed for a decision and empirical 0.025- and 0.975-quantiles (in []). The sum N of the two sample sizes needed for the respective non-sequential fixed sample size t-test is given in the

first column (calculated by CADEMO from (3))

Situation 1 Situation 2 Situation 3 Situation 4

Case 1 α=0.05 β=0.2 N = 34

frej: .054 [.050; .058] N(A): 22.9 [7; 38.8] facc: .182[ .175; .189] N(B) 20.5[4.6; 36.4]

frej: .053 [.049; .057] N(A): 21.4 [7.3; 35.5] facc: .153[ .146; .160] N(B): 19.7[ 4.4; 35]

frej: .047 [.043; .051] N(A): 22.3 [6.6; 38] facc: .158[ .151; .165] N(B): 20[ 4.5; 35.5]

frej: .045 [.041; .049] N(A): 22.9 [6.7; 39.1] facc: .142[ .136; .148] N(B): 19.5[ 4.2; 34.8]

Case 2 α=0.05 β=0.1 N = 46

frej: .049 [.045; .054] N(A): 29.8 [11.6; 48] facc: .089[ .084; .094] N(B): 27.6[ 6.5; 48.7]

frej: .047 [.043; .051] N(A): 28 [9.3; 46.7] facc: .083[ .078; .088] N(B): 26.2[ 6.1; 46.3]

frej: .049 [.045; .053] N(A): 29.1 [7.7; 50.5] facc: .082[ .077; .087] N(B): 27.5[ 7.3; 47.7]

frej: .045 [.041; .049] N(A): 30.3 [10.3; 50.3] facc: .074[ .069; .079] N(B): 26.5[ 6.3; 46.7]

Case 3 α=0.01 β=0.1 N = 64

frej: .009 [.008; .01] N(A): 53.8 [27.1; 80.5] facc: .086[ .081; .091] N(B): 43.4[ 15.7; 71.1]

frej: .009 [.01; .008] N(A): 46.4 [11; 81.8] facc: .087[ .082; .092] N(B): 43.8[ 16.3; 71.3]

frej: .008 [.007; .009] N(A): 47.1 [21.5; 72.7] facc: .089[ .084; .094] N(B): 43.5[ 16.2; 70.8]

frej: .008 [.007; .009] N(A): 48 [19; 77] facc: .081[ .076; .086] N(B): 42.8[ 15.7; 69.9]

Case 4 α=0.01 β=0.05 N = 72

frej: .009 [.008; .01] N(A): 52.2 [18.9; 85.5] facc: .046[ .042; .050] N(B): 52.5[ 19.2; 85.8]

frej: .010 [.009; .011] N(A): 46.9 [26.4; 67.4] facc: .042[ .039; .045] N(B): 49.9[ 18.2; 81.6]

frej: .008 [.007; .009] N(A): 55.4 [21.8; 89] facc: .050[ .046; .054] N(B): 49.6[ 17.2; 82]

frej: .008 [.007; .009] N(A): 56.5 [25.4; 87.6] facc: .043[ .040; .046] N(B): 50.7[ 18.5; 82.9]

Let us discuss the entry in case 1 × situation 1. This is the case of assumption of normal-

ity (cf. paragraph 7). We found frej: .054 [.050; .058]. This means the actual frequency of rejecting a null-hypothesis which is true is in average 0.054 and was in 9500 of the runs between 0.05 and 0.058 (for a nominal α = 0.05). facc: .182[ .175; .189] analogously means that the actual frequency of accepting the null-hypothesis if it is wrong (forµ1 = µ2 = σ = 5) is in average 0.182 (or the power is 0.818 ) and was in 9500 of the runs between 0.175 and 0.189 (for a nominal β = 0.2, i.e. a nominal power of 0.8). The sample sizes N(A) varied in

Page 15: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

The misuse of asterisks in hypothesis testing 241

9500 of the runs between 7 and 38.8 with a mean of 22.9 and the sample sizes N(B) varied in 9500 of the runs between 4.6 and 36.4 with a mean of 20.5. It is not surprising that a test developed under normal assumptions behaves better in some of the non-normal cases than in normal cases: This could already be found in the robustness research reported in Rasch and Guiard (2004).

9. Conclusion In all simulations, even those for extreme non-normal distributions, the triangular se-

quential test is more powerful than the conventional t-test (the actual Type-II-risk is always below the value being fixed in advance). Only in case 1× situation 1, the actual Type-I-risk is above the value that was fixed in advance; however, things improve with smaller Type-II- and Type-I-risks and also in some cases of non-normality. In other words, we should even be glad if we have to deal with non-normality and it is no need to use non-parametric tests – that nonparametric tests are sensitive against violation of the assumed continuity of the underly-ing distribution is shown in Rasch, Häusler, Guiard, Herrendörfer and Kubinger (in prep.). Furthermore, the average sample size does not increase with increasing skewness or kurtosis of the distribution. In fact triangular sequential t-tests prove to be completely robust against any violation of the normal distribution.

Hence, testing sequentially can be advised for the following reasons: • There is no opportunity to use asterisks • Time and money can be saved because fewer observations are required • The test is extremely robust against non-normality.

References

1. Belmont, L. & Marolla, F.A. (1973): Birth order, family size, and intelligence. Science, 182, 1096-1101.

2. Fleishman, A. J. (1978): A method for simulating non-normal distributions. Psychometrika. 43, 521-532.

3. Kubinger, K.D. & Gittler, G. (1983): Der Zusammenhang von Geschwisterkonstellation und Intelligenz in kritischer Sicht [Some critical remarks: The relationship between birth or family size on intelligence]. Sprache und Kognition, 2, 254-263.

4. Nürnberg, G. (1982): Beiträge zur Versuchsplanung für die Schätzung von Varianzkompo- nenten und Robustheitsuntersuchungen zum Vergleich zweier Varianzen [On the estimation of variance components and the robustness of comparisons of two variances]. Probleme der angewandten Statistik, 6, Dummerstorf-Rostock.

5. Rasch, D. (2003): Determining the Optimal Size of Experiments and Surveys in Empirical Research. Psychology Science, 45, 3-47.

6. Rasch, D., Herrendörfer, G., Bock, J., Victor, N., & Guiard, V. (1996): Verfahrensbibliothek Versuchsplanung und -auswertung, Band I. [Collection of procedures for designing and analysing experiments, vol. I]. Munich: Oldenbourg.

Page 16: The misuse of asterisks in hypothesis testing · 2018-12-07 · The misuse of asterisks in hypothesis testing 229 two siblings – the IQ becomes smaller and smaller starting with

D. Rasch, K. D. Kubinger, J. Schmidtke, J. Häusler 242

7. Rasch, D., Herrendörfer, G., Bock, J., Victor, N., & Guiard, V. (1998): Verfahrens-bibliothek Versuchsplanung und - auswertung, Band II. [Collection of procedures for designing and analysing experiments, vol II] Munich: Oldenbourg.

8. Rasch, D. & Guiard, V. (2004): The Robustness of Parametric Statistical Methods. Psychology Science, 46, 175-208.

9. Rasch, D., Verdooren L.R. & Gowers, J.I. (1999): Fundamentals in the Design and Analysis of Experiments and Surveys - Grundlagen der Planung und Auswertung von Versuchen und Erhebungen. Oldenbourg: Munich.

10. Rasch, D., Häusler, J., Guiard, V., Herrendörfer, G. & Kubinger, K.D. (in prep.): How Robust are Non-parametric significance tests in cases of non-continuous distributions?

11. Schneider, B (1992): An Interactive Computer Program for Design and Monitoring of Sequential Clinical Trials. Proceedings of the XVIth International Biometric Conference 1992, Hamilton, New Zealand (pp. 237-250).

12. Wald, A. (1947): Sequential Analysis. Wiley: New York. 13. Whitehead, J. (1983): The design and analysis of sequential clinical trials. Chichester: Ellis

Horwood. 14. Zajonc, R.B. (1976): Family configuration and intelligence. Science, 192, 227-236.

Tobias Richter

Epistemologische Einschätzungen beim Textverstehen

Textverstehen ist nicht passives „ Einlesen“ von Fakteninformationen. Vielmehr nehmen wir beim Lesen von (Sach-)texten auch epistemologische Einschätzun-gen vor: Wir nutzen unser Wissen und unsere Überzeugungen zu einem In-haltsbereich, um zu beurteilen, ob wir die Aussagen und Argumente in einem Text für wahr oder gut begründet halten. Tatsächlich ist eine epistemologische Verarbeitung von Texten in vielen Fällen ausschlaggebend für ein adäquates Textverständnis. Trotzdem kommen wissensgestützte Bewertungen in kogniti-onspsychologischen Theorien des Textverstehens bisher nicht vor. In dieser Arbeit wird ein integratives kognitionspsychologisches Modell epistemologi-scher Einschätzungen beim Textverstehen entwickelt und in seinen zentralen Annahmen empirisch überprüft. 2003, 462 Seiten, ISBN 3-89967-094-9, Preis: 30,- Euro

PABST SCIENCE PUBLISHERS

Eichengrund 28, D-49525 Lengerich, Tel. ++ 49 (0) 5484-308, Fax ++ 49 (0) 5484-550, E-mail: [email protected], Internet: http://www.pabst-publishers.de

Top Related