HYPOTHESIS TESTING - Ιστοσελίδα του Τομέα ...fouskakis/SS/hypothesis...

AMS 5AMS 5

HYPOTHESIS TESTINGHYPOTHESIS TESTING

Hypothesis TestingHypothesis Testing

�� Was it due to chance, or something else?Was it due to chance, or something else?

�� Decide between two Decide between two hypotheseshypotheses that are that are mutually exclusive on the basis of mutually exclusive on the basis of evidence from observations.evidence from observations.

Test of SignificanceTest of Significance

• A certain brand of tobacco has an average nicotine content of2.5 milligrams, while another one has an average of only 1.5milligrams. A cigarette manufacturer receives an unlabeledshipment of tobacco and needs to determine the nicotinecontent using a sample of the tobacco.

• A certain type of vaccine is known to be only 25% effectiveover a period of 2 years. A new type of vaccine is being testedon 2000 people chosen at random. We want to test if this newvaccine is more effective than the original one.

Test of SignificanceTest of Significance• A machine for filling bottles of soda has to put 333 ml ofliquid in each bottle. If the average amount is too low or toohigh with respect to the expected content then the machine isconsidered to be out of control. The machine is regularlyinspected to check whether it is out of control by taking asample of bottles.• A balance designed for precision weighting in a lab has to bekept very well calibrated. To test the calibration of the balanceseveral measurements of the weight of an object are done. Ifthey differ too much then the balance has to be re-calibrated.

In all these examples a decision has to be taken based on the numbers in a sample. These numbers are subject to uncertainty and we have to decide if the differences that we observe are only due to chance or not.

Null and Alternative HypothesesNull and Alternative HypothesesA bill is proposed to simplify the tax code. The proposer claimsthat the bill is `revenue-neutral', that is, it will not lower taxrevenues. A simulation is run using 100 tax returns chosen atrandom and the differences between the tax paid using the old rules and those that would be paid using the new rules are recorded. The average difference comes up to be -$219 with a standard deviation of $725. Can we claim that the new rules arerevenue-neutral?We can put the problem in these terms: there are two hypotheses:

NULL HYPOTHESIS, H0and

ALTERNATIVE HYPOTHESIS, H1Under the null hypothesis there is no difference in revenue and the fact that the observed value is not 0 is totally due to chance. Under the alternative hypothesis the difference is real.

Null and Alternative HypothesesNull and Alternative HypothesesFor the examples that we considered at the beginning we have that:

• Cigarette: H0: the mean nicotine content is 1.5. H1: themean nicotine content is 2.5

• Vaccine: H0: the proportion is 25%. H1: the proportion ishigher than 25%.

• Soda Bottles: H0: the average amount of liquid is 333 ml.H1: the average amount of liquid is not equal to 333 ml.

• Balance: H0: the device is calibrated. H1: the device is notproperly calibrated.

Test StatisticsTest StatisticsHow do we test the null hypothesis against the alternative?Back to the tax example. Suppose the null is true. Then thedifference should be $0. How `large' is -$219 with respect to $0? To answer this question we convert to standard units. Given thatthe sample was of size 100 the SE is approximately $72:

$219 $03

$72

− −≈ −

so the difference between the value under the null and the observed value is -3 standard units.The probability of the interval to the left of -3 is about 0.001, thatis, one chance in 1,000. So, under the null hypothesis, $219 is avery unlikely value.

Test StatisticsTest Statistics

In general we are calculating a test statistics given by

observed expectedz

SE

−=

which is referred to as the z-test.Once the z-test is calculated we have to decide whether its value is `large' or is `small'. We observe the probability of the left tail of the normal curve, below the z-test. If this probability is small then the value of the z-test is far from the center of the distribution. This probability is called an observed significance level

Test StatisticsTest Statistics

The smaller the P-value, the stronger the evidence against the null, but

Making a test of significanceMaking a test of significanceTo make a test of significance you need to:

• set up the null hypothesis• pick a test statistics to measure the difference between the data and what is expected under the null hypothesis• compute the test statistics and the corresponding observed significance level.

A small observed significance level implies that the data are far from the values expected from the model under the null hypothesis.What is a small observed significance level?This is somewhat arbitrary, but it is usually considered that if P isless than 5% the results are significant. If P is less than 1% theresults are highly significant.

ExamplesExamples1. A random sample of 85 8th graders has a mean score of 265 with an SD of 55 on a national math test. A State Administrator claims that the mean score of 8th graders on the examinations is above 260. Is there enough evidence to support the administrator's claim? The hypotheses are:

H0: mean ≤ 260 vs H1: mean > 260

The test statistics is obtained by changing to standard units:

265 2600.838

55 / 85

−=

The probability that a standard normal is above 0.838 is about 21%. This is a rather large P-value, so there does not seem to be enough evidence to reject H0.

ExamplesExamples2. A light bulb manufacturer guarantees that the mean life of thebulbs is at least 750 hours. A random sample of 36 light bulbshas a mean of 725 hours and a standard deviation of 60 hours.Is there enough evidence to reject the manufacturer's claim?The hypotheses are

H0: mean ≥ 750 vs H1: mean < 750

The test statistics is obtained by changing to standard units:

725 7502.05

60 / 36

−= −

The probability that a standard normal is below -2.05 is about2%. There is some evidence to reject the manufacturer's claim.

Binary boxesBinary boxesConsider again the problem of testing the new vaccine. This is abinary model since we can classify the population in two groups:the group of people for which the vaccine was effective and thatfor which the vaccine was not.Under the null hypothesis the box model that generates the sample consists of the box

since there is 25% chance that the vaccine is effective.Suppose that the number of people in the sample (of 2000 people)for which the vaccine was effective is 534. According to the null hypothesis the expected number would be 500. Is the 34 people difference large enough to reject the null hypothesis and claim that the new vaccine is more effective?

Binary boxesBinary boxes

SD 0.25 0.75 0.43= × ≈

SE 2000 0.43 19.23= × =

We need to calculate the standard units of the difference between 550 and 534. Under the null hypothesis the SD of the box is

so the standard error is . Then:

534 500z 1.1768

19.23

−= =

The observed significance level is given by the area under normalcurve corresponding to interval above 1.768. This is around 4%,which is small enough to conclude that the difference is statistically significant. So there is evidence to support the claim that the new vaccine is more effective than the standard.

The tThe t--testtestThe examples that we have seen so far rely on the fact that the sample size is large. So, even when the SD of the box is unknown, we can still use the normal curve to obtain the observed significance level of the test.This is not the case when the sample size is small. In this case we need a modification of the z-test due to `Student', a pen name for a statistician called Gosset.Consider the following example. The following five measures of the concentration of Carbon monoxide (CO) are taken from a gassample where the concentration is precisely controlled to be 70 parts per million (ppm). Five measurements are taken to check the calibration of an instrument

78 83 68 72 88

The tThe t--testtest

7.22 / 5 3.23.≈

The null hypothesis is that the device is calibrated and so the average of the measurements is 70 ppm. The average of the sample is 77.8 ppm , the SD is 7.22 ppm and thus the SE of the average is The z-test can be obtained as

77.8 70z 2.4

3.23

−= ≈

To determine the observed significance level we calculate the areato the right of 2.4 under the normal curve. This is less than 1%,which looks like strong evidence against the null hypothesis.Unfortunately we have to remember that the SD that we havecalculated is NOT the SD of the box. It is the SD of the sample,whose size is fairly small, and thus the approximation is not veryprecise. We correct the procedure with the following steps.

The tThe t--testtestStep 1: Consider a different estimate of the SD

In our previous example we get so the SE of the average becomes as opposed to 3.23, reflecting a higher level of uncertainty.

Then the test statistics becomes

+

number of measurementsSD SD

number of measurements 1

Notice that SD SD.

+= ×

−

>

5SD 7.22 8.07

4

+= × ≈

8.07 / 5 3.61,=

77.8 70t 2.2

3.61

−= ≈

The tThe t--testtestStep 2: To find the observed significance level we can not use thenormal curve any more. We need to use a Student's t curve. Thiscurve depends on the degrees of freedom (DF). These arecalculated as

degrees of freedom = number of measurements - 1A table for the Student's t curves is found at the end of the book.There is one curve for each value of the DF. Each row corresponds to one curve. The probabilities that are reported correspond to the right hand tail, as opposed to what was reported for the normal curve. These curves are symmetric around 0 and for DF above 25 they resemble the normal curve very closely. Thus, in our example, we need a Student's t curve with 4 DF. The value 2.2 is not present in the table for the rowcorresponding to 4 DF. The closest value is 2.13, which corresponds to 5%. So the P-value for this test is about 5%. Which is much weaker an evidence against the null than before.

The tThe t--testtestSuppose now that 6 measures are taken with the device

72 79 65 84 67 77The average is equal to 74 ppm and the SD is 6.68 ppm. Thecorrected SD is The SE of the averageis So the t-test is

SD 6 / 5 6.88 7.32.+= × ≈

7.23/ 6 2.99.≈

74 70t 1.34

2.99

−= ≈

This time the DF are 5 and if we look at the table we find that theprobability corresponding to 1.48 is 10%. Since 1.34 is smaller than 1.48 we have that the P-value is larger than 10%. This is notenough evidence against the null. So the machine can be considered to be well calibrated.

ExamplesExamples1. An environmentalist estimates that the mean waste recycled byadults in the US is more than 1 pound per person per day. Youtake a sample of 12 adults and find that the waste generatedper person per day is 1.2 pounds with a standard deviation of0.3 pounds. Can you support the environmentalist's claim?The hypotheses are:

H0: mean ≤ 1 vs H1: mean > 1The corrected value of the SD is and the test statistics is obtained by changing to standard units:

1.2 12.17

0.32 / 12

−=

The probability that a Student with 11 degrees of freedom willbe above 2.17 is about 2.5%. This is a rather small P-value, sothere seems to be enough evidence to reject H0.

12 /11 0.3 0.32× =

ExamplesExamples2. A microwave oven repairer says that the mean repair cost fordamaged microwave ovens is less than $100. You find a randomsample of 5 ovens has a mean repair cost of $75 with an SD of$12.5. Do you have enough evidence to support the repairer'sclaim?

H0: mean ≥ 100 vs H1: mean < 100The corrected value of the SD is and the test statistics is obtained by changing to standard units:

75 1004.01

13.95 / 5

−=

The probability that a student with 4 degrees of freedom willbe above 4.01 is less than 1%. This is a rather small P-value,so there seems to be enough evidence to reject H0.

5 / 4 12.5 13.95× =

Was the result significant?Was the result significant?

�� How small does PHow small does P--value have to get before you value have to get before you reject the null hypothesis?reject the null hypothesis?

�� If PIf P--value < 5% statistically significant.value < 5% statistically significant.

�� If PIf P--value < 1% highly significant.value < 1% highly significant.

�� Therefore a PTherefore a P--value of 5.1% is totally different value of 5.1% is totally different than a Pthan a P--value of 4.9%!!!!!value of 4.9%!!!!!

Investigators should summarize the data, say Investigators should summarize the data, say what test was used, and report the Pwhat test was used, and report the P--value value instead of just comparing it to 5% or 1%instead of just comparing it to 5% or 1%

Data SnoopingData Snooping

�� A result which is statistically significant cannot A result which is statistically significant cannot be explain as chance variation. This is false. be explain as chance variation. This is false. Even if the null hypothesis is right, there is a 5% Even if the null hypothesis is right, there is a 5% chance of getting a difference which the test will chance of getting a difference which the test will call “statistically significant”! call “statistically significant”!

�� Therefore an investigator who makes 100 tests Therefore an investigator who makes 100 tests expect to get 5 results which are “statistically expect to get 5 results which are “statistically significant” due to chance!!!significant” due to chance!!!

�� To make bad enough worse, investigators often To make bad enough worse, investigators often decide which hypotheses to test only after have decide which hypotheses to test only after have seen the data data snooping.seen the data data snooping.

Practical significancePractical significance

�� Statistical significance and practical significance Statistical significance and practical significance are two different ideas!!!are two different ideas!!!

The PThe P--value of a test depends on the sample value of a test depends on the sample size. With a large sample, even a small size. With a large sample, even a small difference can be “statistically significant”, that difference can be “statistically significant”, that is, hard to explain by the luck of the draw. This is, hard to explain by the luck of the draw. This doesn’t necessarily make it important. doesn’t necessarily make it important. Conversely, an important difference may not be Conversely, an important difference may not be statistically significant if the sample is too small.statistically significant if the sample is too small.

HYPOTHESIS TESTING - Ιστοσελίδα του Τομέα ...fouskakis/SS/hypothesis...

Documents

Transcript of HYPOTHESIS TESTING - Ιστοσελίδα του Τομέα ...fouskakis/SS/hypothesis...