Ch 14 – Inference for Regression YMS - 14.1 Inference about the Model.
-
Upload
karin-lang -
Category
Documents
-
view
232 -
download
0
Transcript of Ch 14 – Inference for Regression YMS - 14.1 Inference about the Model.
Ch 14 – Inference for Regression
YMS - 14.1
Inference about the Model
Revisiting Chapter 3• Plot and interpret data
– Form, strength and direction
• Numerical summary – r is measure of linear association – r2 is the percent of variation in y values explained by the
regression of y on x
• Mathematical Model – Use LSRL for prediction – Residuals from a LSRL always have a sum and mean of
zero (use RESID list)
Inference About the Model• When a scatterplot shows a linear relationship between a
quantitative explanatory variable and a quantitative response variable , we can use a least-squares line fitted to the data to predict for a given value of .
• Now, we want to do tests and confidence intervals in this setting.
• The first step in inference is to estimate the unknown parameters .
Inference About the Model• Plot and interpret: Plot the explanatory variable horizontally
and the response variable vertically. Look for form, direction, and strength of the relationship– Look for outliers or other deviations
• Numerical summary: If the scatterplot shows a roughly linear pattern, the correlation describes the direction and strength of the relationship.
• Mathematical model: Find the least-squares regression line for the given data in the form
Regression Conditions• Linearity Assumption
– Examine a scatterplot of the data. If the plot looks straight, we’re good to go.
• Independence Assumption– For any fixed value of , the response varies according to a normal
distribution. Repeated responses are independent of each other.
• Equal Deviation Assumption– The standard deviation of (call it ) is the same for all values of . The
value of is unknown
• Normal Population Assumption– Make a histogram or stemplot of the residuals to check for
skewness and non-normality
Regression Model• The regression model assumes that the means of the
distributions fall along a line, even though the individuals are scattered around it.
• The true mean response is represented by
• For a given sample, the slope (b) and intercept (a) are unbiased estimators of , respectively.
• If we were to take many samples, we would find different values for a and b. These values, however, will be uniformly distributed about .
α + βx
From: Watkins, Scheaffer and Cobb, Statistics in Action.2004 p636
y = a + bx^
μy = α + βx
Best estimate for the true mean response line
True mean response:
Regression Model• Degrees of Freedom
– Equal to n – 2 since we’re observing two variables instead of one
• Standard Error – Still called s since we’re estimating the unknown σ – Find the sum of the squares of the residuals, divide by n-2
and then take the square root
p788 #14.1-14.2, 14.4
1. An Extinct Beast, I Archaeopteryx is an extinct beast having feathers like a bird but teeth and a long tail like a reptile. Here are the lengths in centimeters of the femur (a leg bone and the humerus (a bone in the upper arm) for the five fossil specimens that preserve both bones:
The strong linear relationship between the lengths of the two bones helped persuade scientists that all five specimens belong to the same species.
Femur: 38 56 59 64 74Humerus: 41 63 70 72 84
a) Examine the data. Make a scatterplot with femur length as the explanatory variable. Use your calculator to obtain the correlation r and the equation of the least-squares regression line. Do you think that femur length will allow good prediction of humerus length?
Femur: 38 56 59 64 74Humerus: 41 63 70 72 84
b) Explain in words what the slope of the true regression line says about Archaeopteryx. What is the estimate of from the data? What is your estimate of the intercept of the true regression line?
c) Calculate the residuals for the five data points. Check that their sum is 0 (up to roundoff error). Use the residuals to estimate the standard deviation in the regression model.
You have now estimated all three parameters in the model.
Femur: 38 56 59 64 74Humerus: 41 63 70 72 84
2. Backpacks Body weights and backpack weights were collected for eight students. Here are the results:
These data were entered into a statistics package and least-squares regression of backpack weight on body weight was requested. Here are the results:
Weight: 120 187 109 103 131 165 158 116
Backpack Weight: 26 30 26 24 29 35 31 28
Predictor Coef Stdev t-ratio p
Constant 16.265 3.937 4.13 0.006
BodyWT 0.09080 0.02831 3.21 0.018
s = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%
a) What is the equation of the least-squares line? (Hint: Look for the column “Coef.” What is the intercept? What is the slope?)
b) The model for regression inference has three parameters, which we call . Can you determine the estimates for and from the computer printout? What are they?
c) The computer output reports that . This is an estimate of the parameter . Use the formula for to verify the computer’s value of .
Predictor Coef Stdev t-ratio p
Constant 16.265 3.937 4.13 0.006
BodyWT 0.09080 0.02831 3.21 0.018
s = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%
4. Competitive Runners Exercise 3.71 on page 187 provided data on the speed of competitive runners and the number of steps they took per second. Good runners take more steps per second as they speed up. Here are the data again:
a) Enter the data into your calculator, perform least-squares regression, and plot the scatterplot with the least-squares line. What is the strength of the association between speed and steps per second?
Speed (ft/s) 15.86 16.88 17.50 18.62 19.97 21.06 22.11Steps per second: 3.05 3.12 3.17 3.25 3.36 3.46 3.55
b) Find the residuals for all 7 data points. Check that they sum is 0 (up to roundoff error).
c) The model for regression inference has three parameters, . Estimate these parameters from the data.
Speed (ft/s) 15.86 16.88 17.50 18.62 19.97 21.06 22.11Steps per second: 3.05 3.12 3.17 3.25 3.36 3.46 3.55
Confidence Intervals for • The slope of the true regression line is usually the most
important parameter in a regression problem.
• The slope b of the least-squares line is an unbiased estimator of .
• A confidence interval is more useful because it shows how accurate the estimate b is likely to be.
• The confidence interval for has the familiar form
Confidence Intervals for • Because b is our estimate, the confidence becomes
• The standard error of the least-squares slope is…
• and is the upper critical value from the distribution with degrees of freedom.
Hypothesis Test• We can also test hypotheses about the slope
– says• There is no linear relationship between x and y
OR
• Straight-line dependence on x is of no value for predicting y
OR
• There is no correlation between x and y
• Test Statistic
Hypothesis Test• In terms of a random variable T having distribution, the P-value for a test of against
• This test is also a test of the hypothesis that the correlation is 0 in the population.
Homework: 14.3, 14.6 and 14.10
YMS - 14.2
Predictions and Conditions
Confidence Interval
• Mean response takes the form
• Standard Error
Prediction Intervals
• Used for single observations
• Will be wider than CI
• Form
• Standard Error
In Class Exercises
p802 #14.12 and 14.19
Review Exercises
p806 14.18, 14.21,14.22, 14.24, 14.25