Ch 14 – Inference for Regression YMS - 14.1 Inference about the Model.

Ch 14 – Inference for Regression

YMS - 14.1

Inference about the Model

Revisiting Chapter 3• Plot and interpret data

– Form, strength and direction

• Numerical summary – r is measure of linear association – r2 is the percent of variation in y values explained by the

regression of y on x

• Mathematical Model – Use LSRL for prediction – Residuals from a LSRL always have a sum and mean of

zero (use RESID list)

Inference About the Model• When a scatterplot shows a linear relationship between a

quantitative explanatory variable and a quantitative response variable , we can use a least-squares line fitted to the data to predict for a given value of .

• Now, we want to do tests and confidence intervals in this setting.

• The first step in inference is to estimate the unknown parameters .

Inference About the Model• Plot and interpret: Plot the explanatory variable horizontally

and the response variable vertically. Look for form, direction, and strength of the relationship– Look for outliers or other deviations

• Numerical summary: If the scatterplot shows a roughly linear pattern, the correlation describes the direction and strength of the relationship.

• Mathematical model: Find the least-squares regression line for the given data in the form

Regression Conditions• Linearity Assumption

– Examine a scatterplot of the data. If the plot looks straight, we’re good to go.

• Independence Assumption– For any fixed value of , the response varies according to a normal

distribution. Repeated responses are independent of each other.

• Equal Deviation Assumption– The standard deviation of (call it ) is the same for all values of . The

value of is unknown

• Normal Population Assumption– Make a histogram or stemplot of the residuals to check for

skewness and non-normality

Regression Model• The regression model assumes that the means of the

distributions fall along a line, even though the individuals are scattered around it.

• The true mean response is represented by

• For a given sample, the slope (b) and intercept (a) are unbiased estimators of , respectively.

• If we were to take many samples, we would find different values for a and b. These values, however, will be uniformly distributed about .

α + βx

From: Watkins, Scheaffer and Cobb, Statistics in Action.2004 p636

y = a + bx^

μy = α + βx

Best estimate for the true mean response line

True mean response:

Regression Model• Degrees of Freedom

– Equal to n – 2 since we’re observing two variables instead of one

• Standard Error – Still called s since we’re estimating the unknown σ – Find the sum of the squares of the residuals, divide by n-2

and then take the square root

p788 #14.1-14.2, 14.4

1. An Extinct Beast, I Archaeopteryx is an extinct beast having feathers like a bird but teeth and a long tail like a reptile. Here are the lengths in centimeters of the femur (a leg bone and the humerus (a bone in the upper arm) for the five fossil specimens that preserve both bones:

The strong linear relationship between the lengths of the two bones helped persuade scientists that all five specimens belong to the same species.

Femur: 38 56 59 64 74Humerus: 41 63 70 72 84

a) Examine the data. Make a scatterplot with femur length as the explanatory variable. Use your calculator to obtain the correlation r and the equation of the least-squares regression line. Do you think that femur length will allow good prediction of humerus length?

Femur: 38 56 59 64 74Humerus: 41 63 70 72 84

b) Explain in words what the slope of the true regression line says about Archaeopteryx. What is the estimate of from the data? What is your estimate of the intercept of the true regression line?

c) Calculate the residuals for the five data points. Check that their sum is 0 (up to roundoff error). Use the residuals to estimate the standard deviation in the regression model.

You have now estimated all three parameters in the model.

Femur: 38 56 59 64 74Humerus: 41 63 70 72 84

2. Backpacks Body weights and backpack weights were collected for eight students. Here are the results:

These data were entered into a statistics package and least-squares regression of backpack weight on body weight was requested. Here are the results:

Weight: 120 187 109 103 131 165 158 116

Backpack Weight: 26 30 26 24 29 35 31 28

Predictor Coef Stdev t-ratio p

Constant 16.265 3.937 4.13 0.006

BodyWT 0.09080 0.02831 3.21 0.018

s = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%

a) What is the equation of the least-squares line? (Hint: Look for the column “Coef.” What is the intercept? What is the slope?)

b) The model for regression inference has three parameters, which we call . Can you determine the estimates for and from the computer printout? What are they?

c) The computer output reports that . This is an estimate of the parameter . Use the formula for to verify the computer’s value of .

Predictor Coef Stdev t-ratio p

Constant 16.265 3.937 4.13 0.006

BodyWT 0.09080 0.02831 3.21 0.018

s = 2.270 R-sq = 63.2% R-sq(adj) = 57.0%

4. Competitive Runners Exercise 3.71 on page 187 provided data on the speed of competitive runners and the number of steps they took per second. Good runners take more steps per second as they speed up. Here are the data again:

a) Enter the data into your calculator, perform least-squares regression, and plot the scatterplot with the least-squares line. What is the strength of the association between speed and steps per second?

Speed (ft/s) 15.86 16.88 17.50 18.62 19.97 21.06 22.11Steps per second: 3.05 3.12 3.17 3.25 3.36 3.46 3.55

b) Find the residuals for all 7 data points. Check that they sum is 0 (up to roundoff error).

c) The model for regression inference has three parameters, . Estimate these parameters from the data.

Speed (ft/s) 15.86 16.88 17.50 18.62 19.97 21.06 22.11Steps per second: 3.05 3.12 3.17 3.25 3.36 3.46 3.55

Confidence Intervals for • The slope of the true regression line is usually the most

important parameter in a regression problem.

• The slope b of the least-squares line is an unbiased estimator of .

• A confidence interval is more useful because it shows how accurate the estimate b is likely to be.

• The confidence interval for has the familiar form

Confidence Intervals for • Because b is our estimate, the confidence becomes

• The standard error of the least-squares slope is…

• and is the upper critical value from the distribution with degrees of freedom.

Hypothesis Test• We can also test hypotheses about the slope

– says• There is no linear relationship between x and y

OR

• Straight-line dependence on x is of no value for predicting y

OR

• There is no correlation between x and y

• Test Statistic

Hypothesis Test• In terms of a random variable T having distribution, the P-value for a test of against

• This test is also a test of the hypothesis that the correlation is 0 in the population.

Homework: 14.3, 14.6 and 14.10

YMS - 14.2

Predictions and Conditions

Confidence Interval

• Mean response takes the form

• Standard Error

Prediction Intervals

• Used for single observations

• Will be wider than CI

• Form

• Standard Error

In Class Exercises

p802 #14.12 and 14.19

Review Exercises

p806 14.18, 14.21,14.22, 14.24, 14.25

Ch 14 – Inference for Regression YMS - 14.1 Inference about the Model.

Documents

Transcript of Ch 14 – Inference for Regression YMS - 14.1 Inference about the Model.