Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11:...

13
1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual plot, and Normal probability plot. Use SPSS to find the following: least-squares regression line, correlation, r 2 , and estimate for σ. Find the predicted response and residual for particular explanatory variable values. Perform a hypothesis test for regression ANOVA, individual coefficients, and zero population correlation/independence, including: stating the null and alternative hypotheses, obtaining the test statistic and P-value from SPSS, and stating conclusions in terms of the story. Understand why it is important to refine the model instead of always keeping all the explanatory variables in the model. Know which changes are good changes when reducing the model one variable at a time. Know which changes are bad changes when reducing the model one variable at a time, and also know what to do to after a bad change. Explain which explanatory variables should be included in a model. Multiple Regression is what you use when you have 2 or more quantitative explanatory variables which will be used to predict another quantitative response variable. Simple Linear Regression (Chapters 2 and 10) is used when you have just 1 quantitative explanatory variable and 1 quantitative response variable. For simple linear regression (Chapters 2 and 10), our statistical model was: 0 1 i i i y x In the multiple regression (Chapter 11), our statistical model is: 0 11 2 2 ... i i i p pi i y x x x where you have p explanatory variables. Just because you have data for several x variables doesn’t mean that all the x variables are important enough to go in your model. We must do a multiple-step procedure to decide which x variables are the most important when describing y. So what do we do when we have multiple x variables?

Transcript of Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11:...

Page 1: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

1

Chapter 11: Multiple Regression

Learning goals for this chapter:

Interpret a scatterplot, residual plot, and Normal probability plot.

Use SPSS to find the following: least-squares regression line, correlation, r2, and

estimate for σ.

Find the predicted response and residual for particular explanatory variable

values.

Perform a hypothesis test for regression ANOVA, individual coefficients, and

zero population correlation/independence, including: stating the null and

alternative hypotheses, obtaining the test statistic and P-value from SPSS, and

stating conclusions in terms of the story.

Understand why it is important to refine the model instead of always keeping all

the explanatory variables in the model.

Know which changes are good changes when reducing the model one variable at

a time.

Know which changes are bad changes when reducing the model one variable at a

time, and also know what to do to after a bad change.

Explain which explanatory variables should be included in a model.

Multiple Regression is what you use when you have 2 or more quantitative

explanatory variables which will be used to predict another quantitative response

variable.

Simple Linear Regression (Chapters 2 and 10) is used when you have just 1 quantitative

explanatory variable and 1 quantitative response variable.

For simple linear regression (Chapters 2 and 10), our statistical model was:

0 1i i iy x

In the multiple regression (Chapter 11), our statistical model is:

0 1 1 2 2 ...i i i p pi iy x x x

where you have p explanatory variables.

Just because you have data for several x variables doesn’t mean that all the x variables are

important enough to go in your model. We must do a multiple-step procedure to decide

which x variables are the most important when describing y.

So what do we do when we have multiple x variables?

Page 2: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

2

1. Look at the variables individually.

Means, standard deviations, minimums, and maximums, outliers (if any), stem

plots or histograms are all good ways to show what is happening with your

individual variables.

In SPSS, AnalyzeDescriptive StatisticsExplore.

2. Look at the relationships between the variables using the correlation and scatter

plots.

In SPSS, AnalyzeCorrelateBivariate. Put all your variables (all the x’s

and y) into the “variables” box, and hit “ok.”

The higher the Pearson Correlation between 2 variables, the better, and the

lower the Sig. (2-tailed) the better. The P-value (Sig.) is the result of the test

0 : 0 vs. : 0aH H that we did in chapter 10.

Which are the stronger relationships between an x and the y? Which are the

stronger x-to-x relationships?

Look at scatter plots between each pair of variables, too (you will look at a

LOT of graphs).

We are only interested in keeping the variables which had strong correlations.

3. Do a regression using the variables you decided were important from part 2.

This will include an ANOVA table and coefficient output like what we saw in

Chapter 10.

ANOVA Table for Multiple Regression:

ANOVA

SS df MS F Significance

Regression SSM DFM=p MSM=SSM/DFM MSM/MSE P-value

Residual SSE DFE=n-p-1 MSE=SSE/DFE=s2

Total SST DFT=n-1 MST=SST/DFT

s = estimate for the standard deviation = MSE

Analysis of Variance F-Test:

In the multiple regression model, the hypothesis

H0: 1= 2=…= p=0

We had ANOVA results for simple linear

regression in Ch. 10, too, but since we

only had one i, we didn’t need to use it.

Page 3: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

3

Ha: Not ALL 1= 2=…= p=0

Ha means at least one j 0. We can’t tell how many are regression coefficients are not

0 at this point. We need to do t-tests to be more specific. (Think back: We did

Bonferroni multiple comparisons t tests if we could reject the null hypothesis in a One-

way ANOVA F test.) If we reject H0, basically we have determined that this problem is

worthy of further study.

Even if the P-value (Sig.) is small, you need to look at R2. If the R

2 is small, it

means the model (variables) you are using does not do a very good job of

explaining the variation in y.

You can get a fitted regression equation from the estimates for bj in the

SPSS output at this point.

The SPSS output will also include t test statistics and respective P-values

for the respective individual bj.

The degrees of freedom we will use for the t-tests will be n-p-1 where

n is the # of subjects we study,

p is the # of explanatory variables.

To test H0: j = 0 for an individual variable use the t statistic

j

j

b

bt

SE

(given to you in SPSS)

4. Interpretation of Results

Sometimes variables that are significant by themselves may not be significant

when other variables are included too.

The significance tests for individual regression coefficients assess the significance

of each predictor variable assuming that all other predictors are included in the

regression equation.

5. Residuals

Use residuals to help determine whether the multiple regression model is

appropriate for the data.

Use several residual plots since there are several explanatory variables.

Plot residuals vs. each of the explanatory variables and vs. y .

Page 4: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

4

Look for outliers, influential observations, evidence of a nonlinear relation, and

anything else unusual.

Use a Normal probability plot to determine that the residuals are normally

distributed. (Look for your points to make an increasing line.)

6. Refining the Model

Try deleting the variable with the largest P-value, then rerun the regression.

Check to see if R2, s, P-value from F-test, individual t statistics change much.

o R2 should be as high as possible (or at least not drop drastically when you

remove a variable)

o The standard deviation, s, should be as small as possible

o The F-test statistic from ANOVA should get bigger, and the P-value from

the ANOVA F-test should get smaller

o Any variables left in the equation should have a significant P-value from

their t-test of the coefficient (their confidence intervals should not contain

0) unless taking out a slightly insignificant coefficient makes the R2 and s

move the wrong direction.

Our goal is to keep only the variables which are the most useful to us. Get rid of

any excess variables, but balance removing insignificant variables with the

change that has on the whole model.

How do we know which variables should be included in our model and which should

not?

***Procedure 1:

Start with a model that contains all your explanatory variables with strong correlations,

run the regression, and then remove one at a time whichever variables aren’t significant

from the t-test until you find that your R2 starts to decrease too rapidly or your s goes up

too rapidly. You may end up leaving in one or more variables which are not significant

on their own. You just have to see what removing them does to the whole model. (This

is the procedure that we will follow in the lecture notes and that you should use for this

class.)

How much of a bad change is too much?

If R2 drops by more than 1-2, or if s increases by more than 5 when dropping out a

variable, and if the P-value on the newly dropped variable’s coefficient is between and

2 , then maybe you should put that dropped variable back in.

Page 5: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

5

Good changes:

• R2

increases • s decreases • F increases, P-value decreases • Fewer coefficients with big P-values

OK changes:

• R2

decreases by less than 1 or 2% • s increases by less than 5

Bad changes:

• R2

decreases by more than 2% • s increases by more than 5. • F decreases, P-value increases

Procedure 2:

Start with a model that contains only one explanatory variable and add one variable at a

time till you find that your R2

is no longer increasing rapidly. Chapter 11 in the book uses

a slightly different method. You should read it over to compare.

Sometimes there may be more than one appropriate choice for your model. The

most important thing is to be able to explain why you chose the model you did. Not

every model is as easy to define as the one in the CHEESE example below.

Example: As cheddar cheese matures a variety of chemical processes take place. The

taste of mature cheese is related to the concentration of several chemicals in the final

product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia,

samples of cheese were analyzed for their chemical composition and were subjected to

taste tests.

Data for one type of cheese-manufacturing processes appears in below. The

variable “Case” is used to number the observations from 1 to 30. “Taste” is the response

variable of interest. The taste scores were obtained by combining the scores from several

tasters.

Three chemicals whose concentrations were measured were acetic acid, hydrogen

sulfide, and lactic acid. For acetic acid and hydrogen sulfide (natural) log

transformations were taken. Thus the explanatory variables are the transformed

concentrations of acetic acid (“Acetic”) and hydrogen sulfide (“H2S”), and the

untransformed concentration of lactic acid (“Lactic”).

Page 6: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

6

Case Taste Acetic H2S Lactic

1 12.3 4.543 3.135 0.86

2 20.9 5.159 5.043 1.53

3 39 5.366 5.438 1.57

4 47.9 5.759 7.496 1.81

5 5.6 4.663 3.807 0.99

6 25.9 5.697 7.601 1.09

7 37.3 5.892 8.726 1.29

8 21.9 6.078 7.966 1.78

9 18.1 4.898 3.85 1.29

10 21 5.242 4.174 1.58

11 34.9 5.74 6.142 1.68

12 57.2 6.446 7.908 1.9

13 0.7 4.477 2.996 1.06

14 25.9 5.236 4.942 1.3

15 54.9 6.151 6.752 1.52

16 40.9 6.365 9.588 1.74

17 15.9 4.787 3.912 1.16

18 6.4 5.412 4.7 1.49

19 18 5.247 6.174 1.63

20 38.9 5.438 9.064 1.99

21 14 4.564 4.949 1.15

22 15.2 5.298 5.22 1.33

23 32 5.455 9.242 1.44

24 56.7 5.855 10.199 2.01

25 16.8 5.366 3.664 1.31

26 11.6 6.043 3.219 1.46

27 26.5 6.458 6.962 1.72

28 0.7 5.328 3.912 1.25

29 13.4 5.802 6.685 1.08

30 5.5 6.176 4.787 1.25

a) For each of the 4 variables in the CHEESE data set, find the mean, median,

standard deviation, and IQR. Display each distribution by means of a

stemplot.

Page 7: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

7

Descriptives

Statistic Std. Error

Taste Mean 24.533 2.9678

95% Confidence Interval for Mean

Lower Bound 18.463

Upper Bound 30.603

5% Trimmed Mean 24.052

Median 20.950

Variance 264.237

Std. Deviation 16.2554

Minimum .7

Maximum 57.2

Range 56.5

Interquartile Range 24.6

Acetic Mean 5.49803 .104228

95% Confidence Interval for Mean

Lower Bound 5.28486

Upper Bound 5.71120

5% Trimmed Mean 5.50043

Median 5.42500

Variance .326

Std. Deviation .570878

Minimum 4.477

Maximum 6.458

Range 1.981

Interquartile Range .713

H2S Mean 5.94177 .388313

95% Confidence Interval for Mean

Lower Bound 5.14758

Upper Bound 6.73596

5% Trimmed Mean 5.87765

Median 5.32900

Variance 4.524

Std. Deviation 2.126879

Minimum 2.996

Maximum 10.199

Range 7.203

Interquartile Range 3.766

Lactic Mean 1.4420 .05541

95% Confidence Interval for Mean

Lower Bound 1.3287

Upper Bound 1.5553

5% Trimmed Mean 1.4407

Median 1.4500

Variance .092

Std. Deviation .30349

Page 8: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

8

Minimum .86

Maximum 2.01

Range 1.15

Interquartile Range .46

Taste Stem-and-Leaf Plot

Frequency Stem & Leaf

5.00 0 . 00556

9.00 1 . 123455688

6.00 2 . 011556

5.00 3 . 24789

2.00 4 . 07

3.00 5 . 467

Acetic Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 4 . 4

5.00 4 . 55678

11.00 5 . 12222333444

6.00 5 . 677888

7.00 6 . 001134

H2S Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 2 . 9

7.00 3 . 1268899

5.00 4 . 17799

3.00 5 . 024

5.00 6 . 11679

4.00 7 . 4699

1.00 8 . 7

3.00 9 . 025

1.00 10 . 1

Page 9: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

9

Lactic Stem-and-Leaf Plot

Frequency Stem & Leaf

2.00 0 . 89

15.00 1 . 000112222333444

12.00 1 . 555566777899

1.00 2 . 0

b) Make a scatterplot for each pair of variables in the CHEESE data set (you will

have 6 plots). Describe the relationships. Calculate the correlation for each

pair of variables and report the P-value for the test of zero population

correlation in each case.

Correlations

** Correlation is significant at the 0.01 level (2-tailed).

Taste Acetic H2S Lactic

Taste Pearson Correlation

1 .550(**) .756(**) .704(**)

Sig. (2-tailed) . .002 .000 .000

N 30 30 30 30

Acetic Pearson Correlation

.550(**) 1 .618(**) .604(**)

Sig. (2-tailed) .002 . .000 .000

N 30 30 30 30

H2S Pearson Correlation

.756(**) .618(**) 1 .645(**)

Sig. (2-tailed) .000 .000 . .000

N 30 30 30 30

Lactic Pearson Correlation

.704(**) .604(**) .645(**) 1

Sig. (2-tailed) .000 .000 .000 .

N 30 30 30 30

Page 10: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

10

Page 11: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

11

c) Perform a multiple regression using the explanatory variables which look

important at this point. Give the fitted regression equation.

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .807(a) .652 .612 10.1307

a Predictors: (Constant), Lactic, Acetic, H2S ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression 4994.476 3 1664.825 16.221 .000(a)

Residual 2668.411 26 102.631

Total 7662.887 29

a Predictors: (Constant), Lactic, Acetic, H2S b Dependent Variable: Taste

d) State your hypotheses for an ANOVA F-test, give the test statistic and its P-

value, and state your conclusion.

e) Report the t statistics and P-values for the tests of the regression coefficients

of your explanatory variables. What conclusions do you draw from these

tests?

Coefficientsa

-28.877 19.735 -1.463 .155 -69.444 11.690

3.912 1.248 .512 3.133 .004 1.346 6.478

19.671 8.629 .367 2.280 .031 1.933 37.408

.328 4.460 .012 .073 .942 -8.839 9.495

(Constant)

H2S

Lactic

Acetic

Model

1

B Std. Error

Unstandardized

Coefficients

Beta

Standardized

Coefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: Tastea.

Page 12: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

12

f) What is the value of s, the estimator for standard error of the model?

g) What percent of variation in taste is explained by these explanatory variables?

h) One variable looks like a good candidate to be dropped. Which one is it? Try

running the multiple regression again without this variable. Look at parts c

through h again.

Model Summary

Model R R Square Adjusted R

Square Std. Error of the Estimate

1 .807(a) .652 .626 9.9424

a Predictors: (Constant), Lactic, H2S

ANOVA(b)

Model Sum of

Squares df Mean Square F Sig.

1 Regression

4993.921 2 2496.961 25.260 .000(a)

Residual 2668.965 27 98.851

Total 7662.887 29

a Predictors: (Constant), Lactic, H2S b Dependent Variable: Taste Coefficients(a)

Model Unstandardized

Coefficients

Standardized

Coefficients t Sig.

95% Confidence Interval for B

B Std. Error Beta Lower Bound

Upper Bound

1 (Constant) -27.592 8.982 -3.072 .005 -46.021 -9.163

H2S 3.946 1.136 .516 3.475 .002 1.616 6.277

Lactic 19.887 7.959 .371 2.499 .019 3.557 36.218

a Dependent Variable: Taste

What changed? What stayed the same or improved?

Page 13: Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11: Multiple Regression Learning goals for this chapter: Interpret a scatterplot, residual

13

Original (all 3

explanatory

variables)

New (only H2S and

Lactic) Change

R2 65.2% 65.2%

s 10.1307 9.9424

F, P-value 16.221, 0 25.260, 0

Insignificant

explanatory

variables

Acetic None

j) Now look at a residual plot for each of the variables you still have in the

model. Do a Normal probability plot, too.

k) Using the better model, predict the “taste” for an H2S=4 and Lactic=1.

2.000 4.000 6.000 8.000 10.000

H2S

-20.00000

-10.00000

0.00000

10.00000

20.00000

30.00000

Un

sta

nd

ard

ize

d R

es

idu

al

0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20

Lactic

-20.00000

-10.00000

0.00000

10.00000

20.00000

30.00000

Un

sta

nd

ard

ize

d R

es

idu

al

0.0 0.2 0.4 0.6 0.8 1.0

Observed Cum Prob

0.0

0.2

0.4

0.6

0.8

1.0

Exp

ecte

d C

um

Pro

b

Dependent Variable: Taste

Normal P-P Plot of Regression Standardized Residual