Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11:...
Transcript of Chapter 11: Multiple Regression - Purdue Universityghobbs/STAT_301/Chapter11.pdf · 1 Chapter 11:...
1
Chapter 11: Multiple Regression
Learning goals for this chapter:
Interpret a scatterplot, residual plot, and Normal probability plot.
Use SPSS to find the following: least-squares regression line, correlation, r2, and
estimate for σ.
Find the predicted response and residual for particular explanatory variable
values.
Perform a hypothesis test for regression ANOVA, individual coefficients, and
zero population correlation/independence, including: stating the null and
alternative hypotheses, obtaining the test statistic and P-value from SPSS, and
stating conclusions in terms of the story.
Understand why it is important to refine the model instead of always keeping all
the explanatory variables in the model.
Know which changes are good changes when reducing the model one variable at
a time.
Know which changes are bad changes when reducing the model one variable at a
time, and also know what to do to after a bad change.
Explain which explanatory variables should be included in a model.
Multiple Regression is what you use when you have 2 or more quantitative
explanatory variables which will be used to predict another quantitative response
variable.
Simple Linear Regression (Chapters 2 and 10) is used when you have just 1 quantitative
explanatory variable and 1 quantitative response variable.
For simple linear regression (Chapters 2 and 10), our statistical model was:
0 1i i iy x
In the multiple regression (Chapter 11), our statistical model is:
0 1 1 2 2 ...i i i p pi iy x x x
where you have p explanatory variables.
Just because you have data for several x variables doesn’t mean that all the x variables are
important enough to go in your model. We must do a multiple-step procedure to decide
which x variables are the most important when describing y.
So what do we do when we have multiple x variables?
2
1. Look at the variables individually.
Means, standard deviations, minimums, and maximums, outliers (if any), stem
plots or histograms are all good ways to show what is happening with your
individual variables.
In SPSS, AnalyzeDescriptive StatisticsExplore.
2. Look at the relationships between the variables using the correlation and scatter
plots.
In SPSS, AnalyzeCorrelateBivariate. Put all your variables (all the x’s
and y) into the “variables” box, and hit “ok.”
The higher the Pearson Correlation between 2 variables, the better, and the
lower the Sig. (2-tailed) the better. The P-value (Sig.) is the result of the test
0 : 0 vs. : 0aH H that we did in chapter 10.
Which are the stronger relationships between an x and the y? Which are the
stronger x-to-x relationships?
Look at scatter plots between each pair of variables, too (you will look at a
LOT of graphs).
We are only interested in keeping the variables which had strong correlations.
3. Do a regression using the variables you decided were important from part 2.
This will include an ANOVA table and coefficient output like what we saw in
Chapter 10.
ANOVA Table for Multiple Regression:
ANOVA
SS df MS F Significance
Regression SSM DFM=p MSM=SSM/DFM MSM/MSE P-value
Residual SSE DFE=n-p-1 MSE=SSE/DFE=s2
Total SST DFT=n-1 MST=SST/DFT
s = estimate for the standard deviation = MSE
Analysis of Variance F-Test:
In the multiple regression model, the hypothesis
H0: 1= 2=…= p=0
We had ANOVA results for simple linear
regression in Ch. 10, too, but since we
only had one i, we didn’t need to use it.
3
Ha: Not ALL 1= 2=…= p=0
Ha means at least one j 0. We can’t tell how many are regression coefficients are not
0 at this point. We need to do t-tests to be more specific. (Think back: We did
Bonferroni multiple comparisons t tests if we could reject the null hypothesis in a One-
way ANOVA F test.) If we reject H0, basically we have determined that this problem is
worthy of further study.
Even if the P-value (Sig.) is small, you need to look at R2. If the R
2 is small, it
means the model (variables) you are using does not do a very good job of
explaining the variation in y.
You can get a fitted regression equation from the estimates for bj in the
SPSS output at this point.
The SPSS output will also include t test statistics and respective P-values
for the respective individual bj.
The degrees of freedom we will use for the t-tests will be n-p-1 where
n is the # of subjects we study,
p is the # of explanatory variables.
To test H0: j = 0 for an individual variable use the t statistic
j
j
b
bt
SE
(given to you in SPSS)
4. Interpretation of Results
Sometimes variables that are significant by themselves may not be significant
when other variables are included too.
The significance tests for individual regression coefficients assess the significance
of each predictor variable assuming that all other predictors are included in the
regression equation.
5. Residuals
Use residuals to help determine whether the multiple regression model is
appropriate for the data.
Use several residual plots since there are several explanatory variables.
Plot residuals vs. each of the explanatory variables and vs. y .
4
Look for outliers, influential observations, evidence of a nonlinear relation, and
anything else unusual.
Use a Normal probability plot to determine that the residuals are normally
distributed. (Look for your points to make an increasing line.)
6. Refining the Model
Try deleting the variable with the largest P-value, then rerun the regression.
Check to see if R2, s, P-value from F-test, individual t statistics change much.
o R2 should be as high as possible (or at least not drop drastically when you
remove a variable)
o The standard deviation, s, should be as small as possible
o The F-test statistic from ANOVA should get bigger, and the P-value from
the ANOVA F-test should get smaller
o Any variables left in the equation should have a significant P-value from
their t-test of the coefficient (their confidence intervals should not contain
0) unless taking out a slightly insignificant coefficient makes the R2 and s
move the wrong direction.
Our goal is to keep only the variables which are the most useful to us. Get rid of
any excess variables, but balance removing insignificant variables with the
change that has on the whole model.
How do we know which variables should be included in our model and which should
not?
***Procedure 1:
Start with a model that contains all your explanatory variables with strong correlations,
run the regression, and then remove one at a time whichever variables aren’t significant
from the t-test until you find that your R2 starts to decrease too rapidly or your s goes up
too rapidly. You may end up leaving in one or more variables which are not significant
on their own. You just have to see what removing them does to the whole model. (This
is the procedure that we will follow in the lecture notes and that you should use for this
class.)
How much of a bad change is too much?
If R2 drops by more than 1-2, or if s increases by more than 5 when dropping out a
variable, and if the P-value on the newly dropped variable’s coefficient is between and
2 , then maybe you should put that dropped variable back in.
5
Good changes:
• R2
increases • s decreases • F increases, P-value decreases • Fewer coefficients with big P-values
OK changes:
• R2
decreases by less than 1 or 2% • s increases by less than 5
Bad changes:
• R2
decreases by more than 2% • s increases by more than 5. • F decreases, P-value increases
Procedure 2:
Start with a model that contains only one explanatory variable and add one variable at a
time till you find that your R2
is no longer increasing rapidly. Chapter 11 in the book uses
a slightly different method. You should read it over to compare.
Sometimes there may be more than one appropriate choice for your model. The
most important thing is to be able to explain why you chose the model you did. Not
every model is as easy to define as the one in the CHEESE example below.
Example: As cheddar cheese matures a variety of chemical processes take place. The
taste of mature cheese is related to the concentration of several chemicals in the final
product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia,
samples of cheese were analyzed for their chemical composition and were subjected to
taste tests.
Data for one type of cheese-manufacturing processes appears in below. The
variable “Case” is used to number the observations from 1 to 30. “Taste” is the response
variable of interest. The taste scores were obtained by combining the scores from several
tasters.
Three chemicals whose concentrations were measured were acetic acid, hydrogen
sulfide, and lactic acid. For acetic acid and hydrogen sulfide (natural) log
transformations were taken. Thus the explanatory variables are the transformed
concentrations of acetic acid (“Acetic”) and hydrogen sulfide (“H2S”), and the
untransformed concentration of lactic acid (“Lactic”).
6
Case Taste Acetic H2S Lactic
1 12.3 4.543 3.135 0.86
2 20.9 5.159 5.043 1.53
3 39 5.366 5.438 1.57
4 47.9 5.759 7.496 1.81
5 5.6 4.663 3.807 0.99
6 25.9 5.697 7.601 1.09
7 37.3 5.892 8.726 1.29
8 21.9 6.078 7.966 1.78
9 18.1 4.898 3.85 1.29
10 21 5.242 4.174 1.58
11 34.9 5.74 6.142 1.68
12 57.2 6.446 7.908 1.9
13 0.7 4.477 2.996 1.06
14 25.9 5.236 4.942 1.3
15 54.9 6.151 6.752 1.52
16 40.9 6.365 9.588 1.74
17 15.9 4.787 3.912 1.16
18 6.4 5.412 4.7 1.49
19 18 5.247 6.174 1.63
20 38.9 5.438 9.064 1.99
21 14 4.564 4.949 1.15
22 15.2 5.298 5.22 1.33
23 32 5.455 9.242 1.44
24 56.7 5.855 10.199 2.01
25 16.8 5.366 3.664 1.31
26 11.6 6.043 3.219 1.46
27 26.5 6.458 6.962 1.72
28 0.7 5.328 3.912 1.25
29 13.4 5.802 6.685 1.08
30 5.5 6.176 4.787 1.25
a) For each of the 4 variables in the CHEESE data set, find the mean, median,
standard deviation, and IQR. Display each distribution by means of a
stemplot.
7
Descriptives
Statistic Std. Error
Taste Mean 24.533 2.9678
95% Confidence Interval for Mean
Lower Bound 18.463
Upper Bound 30.603
5% Trimmed Mean 24.052
Median 20.950
Variance 264.237
Std. Deviation 16.2554
Minimum .7
Maximum 57.2
Range 56.5
Interquartile Range 24.6
Acetic Mean 5.49803 .104228
95% Confidence Interval for Mean
Lower Bound 5.28486
Upper Bound 5.71120
5% Trimmed Mean 5.50043
Median 5.42500
Variance .326
Std. Deviation .570878
Minimum 4.477
Maximum 6.458
Range 1.981
Interquartile Range .713
H2S Mean 5.94177 .388313
95% Confidence Interval for Mean
Lower Bound 5.14758
Upper Bound 6.73596
5% Trimmed Mean 5.87765
Median 5.32900
Variance 4.524
Std. Deviation 2.126879
Minimum 2.996
Maximum 10.199
Range 7.203
Interquartile Range 3.766
Lactic Mean 1.4420 .05541
95% Confidence Interval for Mean
Lower Bound 1.3287
Upper Bound 1.5553
5% Trimmed Mean 1.4407
Median 1.4500
Variance .092
Std. Deviation .30349
8
Minimum .86
Maximum 2.01
Range 1.15
Interquartile Range .46
Taste Stem-and-Leaf Plot
Frequency Stem & Leaf
5.00 0 . 00556
9.00 1 . 123455688
6.00 2 . 011556
5.00 3 . 24789
2.00 4 . 07
3.00 5 . 467
Acetic Stem-and-Leaf Plot
Frequency Stem & Leaf
1.00 4 . 4
5.00 4 . 55678
11.00 5 . 12222333444
6.00 5 . 677888
7.00 6 . 001134
H2S Stem-and-Leaf Plot
Frequency Stem & Leaf
1.00 2 . 9
7.00 3 . 1268899
5.00 4 . 17799
3.00 5 . 024
5.00 6 . 11679
4.00 7 . 4699
1.00 8 . 7
3.00 9 . 025
1.00 10 . 1
9
Lactic Stem-and-Leaf Plot
Frequency Stem & Leaf
2.00 0 . 89
15.00 1 . 000112222333444
12.00 1 . 555566777899
1.00 2 . 0
b) Make a scatterplot for each pair of variables in the CHEESE data set (you will
have 6 plots). Describe the relationships. Calculate the correlation for each
pair of variables and report the P-value for the test of zero population
correlation in each case.
Correlations
** Correlation is significant at the 0.01 level (2-tailed).
Taste Acetic H2S Lactic
Taste Pearson Correlation
1 .550(**) .756(**) .704(**)
Sig. (2-tailed) . .002 .000 .000
N 30 30 30 30
Acetic Pearson Correlation
.550(**) 1 .618(**) .604(**)
Sig. (2-tailed) .002 . .000 .000
N 30 30 30 30
H2S Pearson Correlation
.756(**) .618(**) 1 .645(**)
Sig. (2-tailed) .000 .000 . .000
N 30 30 30 30
Lactic Pearson Correlation
.704(**) .604(**) .645(**) 1
Sig. (2-tailed) .000 .000 .000 .
N 30 30 30 30
10
11
c) Perform a multiple regression using the explanatory variables which look
important at this point. Give the fitted regression equation.
Model Summary
Model R R Square Adjusted R
Square Std. Error of the Estimate
1 .807(a) .652 .612 10.1307
a Predictors: (Constant), Lactic, Acetic, H2S ANOVA(b)
Model Sum of
Squares df Mean Square F Sig.
1 Regression 4994.476 3 1664.825 16.221 .000(a)
Residual 2668.411 26 102.631
Total 7662.887 29
a Predictors: (Constant), Lactic, Acetic, H2S b Dependent Variable: Taste
d) State your hypotheses for an ANOVA F-test, give the test statistic and its P-
value, and state your conclusion.
e) Report the t statistics and P-values for the tests of the regression coefficients
of your explanatory variables. What conclusions do you draw from these
tests?
Coefficientsa
-28.877 19.735 -1.463 .155 -69.444 11.690
3.912 1.248 .512 3.133 .004 1.346 6.478
19.671 8.629 .367 2.280 .031 1.933 37.408
.328 4.460 .012 .073 .942 -8.839 9.495
(Constant)
H2S
Lactic
Acetic
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: Tastea.
12
f) What is the value of s, the estimator for standard error of the model?
g) What percent of variation in taste is explained by these explanatory variables?
h) One variable looks like a good candidate to be dropped. Which one is it? Try
running the multiple regression again without this variable. Look at parts c
through h again.
Model Summary
Model R R Square Adjusted R
Square Std. Error of the Estimate
1 .807(a) .652 .626 9.9424
a Predictors: (Constant), Lactic, H2S
ANOVA(b)
Model Sum of
Squares df Mean Square F Sig.
1 Regression
4993.921 2 2496.961 25.260 .000(a)
Residual 2668.965 27 98.851
Total 7662.887 29
a Predictors: (Constant), Lactic, H2S b Dependent Variable: Taste Coefficients(a)
Model Unstandardized
Coefficients
Standardized
Coefficients t Sig.
95% Confidence Interval for B
B Std. Error Beta Lower Bound
Upper Bound
1 (Constant) -27.592 8.982 -3.072 .005 -46.021 -9.163
H2S 3.946 1.136 .516 3.475 .002 1.616 6.277
Lactic 19.887 7.959 .371 2.499 .019 3.557 36.218
a Dependent Variable: Taste
What changed? What stayed the same or improved?
13
Original (all 3
explanatory
variables)
New (only H2S and
Lactic) Change
R2 65.2% 65.2%
s 10.1307 9.9424
F, P-value 16.221, 0 25.260, 0
Insignificant
explanatory
variables
Acetic None
j) Now look at a residual plot for each of the variables you still have in the
model. Do a Normal probability plot, too.
k) Using the better model, predict the “taste” for an H2S=4 and Lactic=1.
2.000 4.000 6.000 8.000 10.000
H2S
-20.00000
-10.00000
0.00000
10.00000
20.00000
30.00000
Un
sta
nd
ard
ize
d R
es
idu
al
0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20
Lactic
-20.00000
-10.00000
0.00000
10.00000
20.00000
30.00000
Un
sta
nd
ard
ize
d R
es
idu
al
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob
0.0
0.2
0.4
0.6
0.8
1.0
Exp
ecte
d C
um
Pro
b
Dependent Variable: Taste
Normal P-P Plot of Regression Standardized Residual