Chapter 8: Hypothesis Testing for Population Proportions The basics of Significance Testing.

Upload
egbertphillips 
Category
Documents

view
224 
download
2
Transcript of Chapter 8: Hypothesis Testing for Population Proportions The basics of Significance Testing.
Chapter 8: Hypothesis Testing for Population Proportions
The basics of Significance Testing
Statistical Inference• Already discussed confidence intervals for unknown
population parameter, p
• CI’s used when the goal is to estimate an unknown population parameter like ρ
• This chapter... statistical inference through significance tests
• Evaluate evidence (a statistic) provided by sample data about some claim concerning an unknown population parameter like ρ
I’m a great freethrow shooter...• I claim that I make 95% of my basketball free
throws.• To test my claim, I am asked to shoot 20 free
throws. I make only 8 of the 20 (only 40%). Now people don’t believe that my claim of making 95% of my basketball free throws.
• Making only 8 of 20 attempts would almost never happen/very unlikely if I truly did make 95% of my free throws
Significance Testing
• Basic idea... An outcome that would rarely happen if a claim were really true is good evidence that the claim is not true.
• Example... I claim that 99% of adult humans are 6 feet tall or taller.
• If my claim was true, it would be very rare to get most of the adult humans in a SRS of 100 that are shorter than 6 feet.
Significance Testing... Let’s begin by knowing μ and σ (unrealistic)
Because paramedic response time is critical to saving lives, several cities monitor these response times. In one city, the mean response time to all accidents involving lifethreatening injuries last year was μ = 6.7 minutes with a standard deviation of σ = 2 minutes. The city manager encourages them to “do better” next year.
At the end of the following year, the city manager selects a SRS of 400 calls involving lifethreatening injuries. For this sample, the mean response time was = 6.48 minutes. Do these data provide good evidence that response times have decreased since last year?
Previous Year: μ = 6.7 minutes; σ = 2 minutesFollowing year, SRS 400 with = 6.48 minutes
Does this data provide good evidence that the response times have decreased?
Remember, statistics vary from sample to sample. Maybe = 6.48 is a result of sampling variability.
Maybe response time hasn’t improved.
Previous Year: μ = 6.7 minutes; σ = 2 minutesFollowing year, SRS 400 with = 6.48 minutes
Make a claim and see if the data provides evidence against it.
Ho: μ = 6.7 minutes
Ho, null hypothesis; usually no effect, no change, no difference; neutral
Hypothesis always refers to some population parameter, like μ or ρ (NEVER a sample statistic! Don’t want to make hypothesis about something we already know.)
Previous Year: μ = 6.7 minutes; σ = 2 minutesFollowing year, SRS 400 with = 6.48 minutes
Ho: μ = 6.7 minutes
We are seeking evidence of a decrease in response time, so
Ha: μ < 6.7 minutes
Ha, alternative hypothesis, claim about population we are trying to find evidence for.
Onesided, only interested in decrease (in this case) but can be twosided, such as
Ho: μ = 0 and Ha: μ ≠ 0
Is our sampling distribution’s rare?
Practice: State the appropriate null hypothesis and alternative hypothesis in each case. Be sure to define
your parameter each time.
Larry's car averages 26 miles per gallon on the highway. He switches to a new brand of motor oil that is advertised to increase gas mileage. After driving 3000 highway miles with the new oil, he wants to determine if the average gas mileage has increased.
Parameter: μ = mean gas mileage for Larry’s car on the highway
Ho: μ = 26 mpg Ha: μ > 26 mpg
Practice: State the appropriate null hypothesis and alternative hypothesis in each case. Be sure to define
your parameter each time.
A May 2005 Gallup Poll report on a national survey of 1028 teenagers revealed that 72% of teens said they rarely or never argue with their friends. You wonder whether this national result would be true in your school. So you conduct your own survey of a random sample of students at your school.
Parameter: Proportion of teens in your school who rarely or never fight with their friends.
Ho: p = 0.72 Ha: p ≠ 0.72
Explain what is wrong in each situation and why it is wrong
A change is made that should improve student satisfaction with the parking situation at your school. The null hypothesis, that there is an improvement, is tested versus the alternative, that there is no change.
Ho and Ha have been switched. The null hypothesis should be a statement of ‘no change.’
Explain what is wrong in each situation and why it is wrong
A researcher tests the following null hypothesis: H0: = 10.
The null hypothesis (and the alternative hypothesis as well) should be a statement/claim about a population parameter (like µ), not a sample statistic (like )
Explain what is wrong in each situation and why it is wrong
The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures students' attitudes toward school and study habits. Scores range from 0 to 200. The mean score for U.S. college students is about 115. A teacher suspects that older students have better attitudes toward school.
Ho: μ = 115 Ha: μ > 120
Ho and Ha must share same numeric value (only change =, >, <, ≠)
Explain what is wrong in each situation and why it is wrong
The Census Bureau reports that households spend an average of 31% of their total spending on housing. A homebuilders association in Cleveland believes that this average is lower in their area. They interview a sample of 40 households in the Cleveland metropolitan area to learn what percent of their spending goes toward housing. Take μ to be the mean percent of spending devoted to housing among all Cleveland households.
H0: p = 31% Ha: p < 31%
Conditions for Significance Tests(just like Confidence Intervals)
• SRS (randomization)
• Normality (for means, proportions; requirements are different)
• Independence (population must be or must be able to reasonably assume that it is at least 10 times as large as the sample size; and that one observation has no influence on any others)
Significance tests use test statistics...Some principles that apply to most tests:
• The test is based on a statistic that compares the value of the parameter (Ho : μ = ) with an estimate of the parameter from the sample data ( , )
• Values of the estimate far from the parameter value in the direction specified by the alternative hypothesis give evidence against H0
• To assess how far the estimate is from the parameter, standardize the estimate (what does this mean?)
Good evidence against μ = 6.7 minutes...
... But rather μ < 6.7 minutes this year. Data would unlikely happen if Ho were true. But how unlikely? Need precise way to measure ‘how unlikely.’
PValues
A pvalue is a quantitative measure of rarity of/how unlikely a finding
Small pvalues are evidence against Ho
Large pvalues fail to give evidence against Ho
Definition of a PValue
The probability computed assuming that Ho is true, that the observed outcome would take a value as extreme as or more extreme than the actual observed value.
Let’s go back to the paramedics example again...
Ho: μ = 6.7 Ha: μ < 6.7z = 2.20
(Just fyi, in this example, negative values of z favor Ha over Ho (not always the case))
Remember, pvalue is the probability.... Or the area under the curve... what’s the area under the curve for z = 2.20?
Ho: μ = 6.7 Ha: μ < 6.7pvalue = 0.0139
Small pvalue strong evidence against Ho
Favors alternative hypothesis, Ha: μ < 6.7 minutes
What’s the difference between...
Ho: p = 0.2 Ha: p < 0.2
and
Ho: p = 0.2 Ha: p ≠ 0.2
• Because the alternative is twosided, the Pvalue is the probability of getting a z at least as far from 0 in either direction as the observed z = 1.20.
If Ha is 2sided (≠), both directions count
Statistical Significance...
• Most of the time, we take one more step to assess evidence against Ho
• We compare pvalue to some predetermined value (versus ‘unlikely’) called a significance level, symbol α (alpha)
• Can think of this as a rejection zone (sketch)
Statistical Significance
• Significance level makes ‘not likely’ more exact, more informative
• Most common α levels are α = 0.05 or α = 0.01• Interpretation:– At α = 0.05, data give evidence against Ho so
strong it would happen no more than 5% of the time
Statistical Significance
• If pvalue is as small or smaller than α, we say data are statistically significant at level α
• Note: ‘significant’ in statistics doesn’t mean important (like in English); it means not likely to happen by chance
Statistically Significant Sketches
• If pvalue is p = 0.03... this is significant at α = 0.05 level (in rejection zone)
• If pvalue is p = 0.03... this is not significant at α = 0.01 level (not in rejection zone)
Interpretation/Wording
Reject Ho (Null Hypothesis):
This happens when sample statistic is statistically significant, pvalue is too unlikely to have occurred by chance (we don’t believe null hypothesis), in the rejection zone
Wording must reference all of the following for a complete interpretation... pvalue, α level, reject Ho, and conclusion in context (caution about using the word ‘cause’ or ‘prove’).
Interpretation/Wording
Fail to Reject Ho (Null Hypothesis):
This happens when sample statistic could have occurred by chance (we do believe null hypothesis; we don’t believe the alternative), not in rejection zone
Wording must reference all of the following for a complete interpretation... pvalue, α level, fail to reject Ho, and conclusion in context (caution about using the word ‘cause’ or ‘prove’)
Tests about a population proportion ...
•
Conditions for Tests about a population proportion...
• Random Sample ... SRS or randomly selected or randomly assigned
• Large Sample Size; Normality ... npo ≥ 10 and n(1 – po) ≥ 10
• Independence ... Population at least 10 times sample size; and each observation has no influence on any other
Work stress...According to the National Institute for Occupational Safety and Health, job stress
poses a major threat to the health of workers. A national survey of restaurant employees found that 75% said that work stress had a negative impact on their personal lives.
A simple random sample of 100 employees from a large restaurant chain finds that 68 answer “Yes” when asked, “Does work stress have a negative impact on your personal life?” Is this good reason to think that the proportion of all employees in this chain who would say “Yes” differs from the national proportion p0 = 0.75?
H0: p = 0.75 Ha: p ≠ 0.75
We want to test a claim about p, the true proportion of this chain's employees who would say that work stress has a negative impact on their personal lives.
Work stress...Conditions: 1proportion z test
SRS – stated in problem
Normality  The expected number of “Yes” and “No” responses are (100)(0.75) = 75 and (100)(0.25) = 25, respectively. Both are at least 10.
Independence  Since we are sampling without replacement, this “large chain” must have at least (10)(100) = 1000 employees; and we must assume that one employee does not influence the response of any other employee
Work stress....
Calculations for 1prop z test; use Minitab
1 sample, proportion; change options and data as needed
Work stress...
Interpretation:
Fail to reject Ho. There is over a 10% (which is well over a reasonable α level) chance of obtaining a sample result as unusual as or even more unusual than we did ( = 0.68) when the null hypothesis is true. We have insufficient evidence to suggest that the proportion of this chain restaurant's employees who suffer from work stress is different from the national survey result, 0.75.
We want to be rich...
• In a recent study, 73% of firstyear college students responding to a national survey identified “being very welloff financially” as an important personal goal. A state university finds that 132 of an SRS of 200 of its firstyear students say that this goal is important.
• Is there evidence that the proportion of firstyear students at this university who think being very welloff is important differs from the national value, 73%? Carry out a significance test to help answer this question.
n = 200; x = 132; SRS; p = .73; = 0.66
We want to test Ho: p = 0.73 versus Ha: p ≠ 0.73 regarding the proportion of firstyear students at this university who think being very welloff is important differs from the national value of 73%
n = 200; x = 132; SRS; p = .73; = 0.66Conditions:SRS – stated in problem
Normality – np ≥ 10 & n (1 – p) ≥ 10(200)(0.73) ≥ 10 & (200) (1 0.73) ≥ 10
Independence – We must assume at least (10)(200) firstyear students in the population and that one student’s response does not influence any other student’s response.
Interpretation...
Reject Ho. With a pvalue of 0.0258, and assuming an α = 0.05, we conclude that we do have statistically significant evidence that the proportion of all firstyear students at this university who think being very welloff is important differs from the national value.
(determination, pvalue, α, and context... Always)
Use & Abuse of Tests...
• Significance tests are used in a variety of settings... Marketing, FDA drug testing, discrimination court cases, etc.
• Significance tests quantify event that is unlikely to occur simply by chance
• Different levels of significance (α) are chosen depending on the given situation; typically α = 0.10, 0.05, or 0.01
Use & Abuse of Tests...
• Pvalues allow us to decide individually if evidence is sufficiently strong
• But, there is still no practical distinction between pvalues of, say, 0.049 and 0.051
• Statistical inference does not correct basic flaws in survey or experimental design
Using Inference to Make Decisions...
Sometimes we do everything correctly... data collection, conditions, calculations, interpretation... but we still make an incorrect decision/determination... perhaps we just happen to get a sample statistic that is very extreme... that really doesn’t represent our population accurately
... we reject the null hypothesis when we really should have failed to reject (Ho was really true)
OR we fail to reject the null hypothesis when we really should have rejected the null hypothesis (Ho was really false)
... we make an error
Making errors when using inference...
• Type I ErrorWe reject Ho (null hypothesis) when Ho is really true
In other words, we determine Ha (alternative hypothesis) is true when, in actuality, Ho (null hypothesis) is true
• Type II ErrorWe fail to reject Ho (null hypothesis) when Ho is really false
In other words, we determine Ho (null hypothesis) is true, when, in reality, Ha (alternative hypothesis) is true
Type I and Type II Errors...
Paramedic Response Times Revisited...H0: μ = 6.7 minutes
Ha: μ < 6.7 minutes
... where μ was the mean response time to all calls involving lifethreatening injuries this year.
Type I error: reject H0 when H0 is true
Description: The city manager concludes that the mean response time this year is less than 6.7 minutes (last year's average) when in fact the mean response time is still 6.7 minutes (or higher).
Consequences: The city manager believes that paramedic response times have improved when they really haven't. This could result in additional loss of life for accident victims.
Paramedic Response Times Revisited...H0: μ = 6.7 minutes
Ha: μ < 6.7 minutes
... where μ was the mean response time to all calls involving lifethreatening injuries this year.
Type II error: fail to reject H0 when H0 is false
Description: The city manager decides that the paramedics' mean response time this year is still 6.7 minutes (or higher) when it is actually less than 6.7 minutes.
Consequences: The city manager may take action to decrease paramedic response times when such action is unnecessary. This could result in considerable expense for the city, as well as some disgruntled paramedics.
Probabilities of Type I and Type II Errors...
• Probability of Type I Error (rejecting Ho when null is really true): α, your significance level for the hypothesis test.
• Probability of Type II Error (failing to reject Ho when alternative is really true): β. Very complicated to calculate. Beyond scope of this course.
Power of a Test...
• Power: Probability that a test will reject Ho when Ha is true
• Think of power as making the correct decision, not making an error, not making a mistake
• High level of power is a good thing• Power = 1 – β (remember β is probability of
making a type II error); so ‘power’ and β are complimentary
Power of a Test...
• How can we increase power (making the correct decision)?
• Increase α• Increase n• Decrease standard deviation (same effect as
increasing the sample size, n)
Comparing Proportions from Two Populations: Hypothesis Testing
• Ho: p1 = p2
• Ha: p1 ≠ or > or < p2
We must first find the combined proportion of successes in both samples combined
= =
Two Proportion Hypothesis Testing
• Ho: p1 = p2
• Ha: p1 ≠ or > or < p2
Minitab will calculate this for us; no need to memorize
Two Proportion Hypothesis Testing Conditions...
• SRS – Each of the two samples must be SRSs from their respective populations or they must each be randomized experiments
• Normality – Each of the following are all ≥ 10(n1)(c)
(n1)(1 – c)
(n2)(c)
(n2)(1 – c)
Two Proportion Hypothesis Testing Conditions...
• IndependenceEach of the populations must be at least (10) times each of the corresponding sample sizes; and one sample does not influence the other
Confidence Interval for 1 – 2
To study the longterm effects of preschool programs for poor children, a research foundation has followed two groups of Michigan children since early childhood. A control group of 61 children represents Population 1, poor children with no preschool. Another group of 62 from the same area and similar backgrounds attended preschool as 3 and 4yearolds represents Population 2, poor children who attend preschool. Sizes are n1 = 61 and n2 = 62.
One response variable of interest is the need for social services as adults. In the past ten years, 38 of the preschool sample and 49 of the control sample have needed social services (mainly welfare). Carry out an hypothesis test to determine if there is significant evidence that preschool reduces or increases the later need for social services?
n preschool = 62 nno preschool = 6138 of preschool needed social services;49 of no preschool needed social services
State null and alternative hypothesisHo: pno preschool = ppreschool
Ha: pno preschool ≠ ppreschool
ConditionsRandomization, Normality/Large
Sample,Independence
Ho: pno preschool = ppreschool
Ha: pno preschool ≠ ppreschool
Minitab to calculate test statistic, pvalue, etc.
Two Sample, Proportion, Options & Data
Ho: pno preschool = ppreschool
Ha: pno preschool ≠ ppreschool
Interpretation:
Reject null hypothesis. At a significance level of 5% (α = 0.05), and a pvalue of approximately 0.02 there is sufficient evidence to show that p no preschool ≠ p preschool
Fear of Crime...
The elderly fear crime more than younger people, even though they are less likely to be victims of crime. One of the few studies that looked at older blacks recruited random samples of 56 black women and 63 black men over the age of 65 from Atlantic City, New Jersey. Of the women, 27 said they “felt vulnerable” to crime; 46 of the men said this.
What proportion of women in the sample feel vulnerable? Of men? (Note: Men are victims of crime more often than women, so we expect a higher proportion of men to feel vulnerable.)
Fear of Crime...
Test the hypothesis that the true, unknown population proportion of elderly black males who feel vulnerable is higher than that of elderly black women who feel vulnerable.
Hypothesis, Conditions, Computations, Interpretation
Cholesterol & Heart Attacks...• High levels of cholesterol in the blood are associated
with higher risk of heart attacks. Will using a drug to lower blood cholesterol reduce heart attacks? The Helsinki Heart Study looked at this question. Middleaged men were assigned at random to one of two treatments: 2051 men took the drug gemfibrozil to reduce their cholesterol levels, and a control group of 2030 men took a placebo. During the next five years, 56 men in the gemfibrozil group and 84 men in the placebo group had heart attacks.
• Is the apparent benefit of gemfibrozil statistically significant?
Ho: pgemfibrozil = pplacebo
Ha: pgemfibrozil < pplacebo
We want to use this comparative randomized experiment to draw conclusions about p1, the proportion of middleaged men who would suffer heart attacks after taking gemfibrozil, and p2, the proportion of middleaged men who would suffer heart attacks if they only took a placebo. We hope to show that gemfibrozil reduces heart attacks, so we have a onesided alternative.
Note: you could also state as Ho: pgemfibrozil – pplacebo = 0Ha: pgemfibrozil – pplacebo < 0
A Civil Action
The movie A Civil Action tells the story of a major legal battle that took place in the small town of Woburn, Massachusetts. A town well that supplied water to East Woburn residents was contaminated by industrial chemicals. During the period that residents drank water from this well, a sample of 414 births showed 16 birth defects. On the west side of Woburn, a sample of 228 babies born during the same time period revealed 3 with birth defects. The plaintiffs suing the companies responsible for the contamination claimed that these data show that the rate of birth defects was significantly higher in East Woburn, where the contaminated well water was in use.
Assume all conditions have been checked and met. How strong is the evidence supporting this claim? What should the judge for this case conclude?
East = 16/414 = 0.0386 West = 3/228 = 0.0132
• Is the rate of birth defects in East Woburn higher than in West Woburn?Ho: pEast = pWest or pEast – pWest = 0
Ha: pEast > pWest or pEast – pWest > 0
Is the difference East – West , 0.0386 – 0.0132 = 0.0254 statistically significant? (remember, these are just s; don’t determine that 2% is within rejection zone! This is NOT a pvalue; you must actually do the test to reach a pvalue).
East = 16/414 = 0.0386 West = 3/228 = 0.0132
Ho: pEast = pWest or pEast – pWest = 0Ha: pEast > pWest or pEast – pWest > 0
pvalue = 0.034
Seat Belt Use...
The proportion of drivers who use seat belts depends on things like age (young people are more likely to go unbelted) and gender (women are more likely to buckle up). It also depends on local law. Here are data from observing random samples of female Hispanic drivers in two cities:
Seat Belt Use...
Comparing local law suggests that a larger proportion of drivers wear seat belts in New York than in Boston. Do the data give good evidence that this is true for female Hispanic drivers? Justify your answer. Assume all conditions have been checked and met.
Ho: pNY = pB Ha: pNY > pB
pvalue = 0.000000253
Reject Ho. With a pvalue of ≈ 0, there is strong evidence at any reasonable α that a smaller proportion of female Hispanic drivers wear seat belts in Boston than in New York.
We check conditions for a reason...
• If conditions are not satisfied, our results may not be accurate, reliable, trustworthy, etc.