Post on 19-Mar-2021
Categorical Explanatory Variables
INSR 260, Spring 2009Bob Stine
1
Overview Review MRM
Group identification, dummy variables
Partial F test
Interaction
Prediction! ! ! ! ! ! ! ! ! ! ! similar to SRM
Example!! ! ! (from Bowerman, Ch 4)Sales volume and location
2
Multiple Regression ModelEquation has k explanatory variablesMean! ! ! ! E Y|X = β0 + β1 X1 +...+ βk Xk = μy|x
Observations ! ! yi = β0 + β1 xi1 +...+ βk xik + εi
AssumptionsIndependent observationsEqual variance σ2
Normal distribution around “line”! yi ~ N(μy|x,σ2)! ! ! ! εi ~ N(0, σ2)
Issue for this lectureHow to incorporate categorical explanatory variables that measure group differences.
3
Example (Table 4.9)Context
Retailer is studying the relationship betweenY = Sales volume in franchise stores, in $1,000X = Number of households near location, in thousands
Overall 15 locations, SRM gives
QuestionDoes the type of location influence the relationship between sales volume and population near the location?Three locations: in mall, suburban, or downtown
4
75
100
125
150
175
200
225
250
Sale
s (
$0
00
)
100 150 200 250
Households (000)
Intercept
Households (000)
Term
14.867648
0.9371196
Estimate
13.12805
0.073045
Std Error
1.13
12.83
t Ratio
0.2779
<.0001*
Prob>|t|
75
100
125
150
175
200
225
250
Sale
s (
$0
00
)
100 150 200 250
Households (000)
B&W Color
Separate FitsQuestion
Does the type of location influence the relationship between sales volume and population near the location?
Mall, suburban, downtownFive stores from each type of location
Are differences important? Statistically significant?
5
100
125
150
175
200
225
250
Sale
s (
$0
00
)
100 125 150 175 200 225 250
Households (000)
Linear Fit
Sales ($000) = 18.155451 + 0.887074*Households (000)
Linear Fit
Bivariate Fit of Sales ($000) By Households (000) Location=downtown
Downtown
120
140
160
180
200
220
240
260
Sale
s (
$0
00
)
100 120 140 160 180 200 220 240
Households (000)
Linear Fit
Sales ($000) = 50.630163 + 0.8289871*Households (000)
Linear Fit
Bivariate Fit of Sales ($000) By Households (000) Location=mall
Mall
90
100
110
120
130
140
150
160
Sale
s (
$0
00
)
90 100 110 120 130 140 150 160 170
Households (000)
Linear Fit
Sales ($000) = 7.9004191 + 0.9207038*Households (000)
Linear Fit
Bivariate Fit of Sales ($000) By Households (000) Location=street
Street
Intercept
Households (000)
Term
14.867648
0.9371196
Estimate
13.12805
0.073045
Std Error
1.13
12.83
t Ratio
0.2779
<.0001*
Prob>|t|
SRM
Qualitative VariablesRepresent categories using “dummy variables”
A 0/1 indicator for each of the categoriesRedundant: only need 2 dummies for the 3 categories
Data tableJMP software makes the manual creation of dummy variables unnecessary.
6
Regression with CategoricalAdd the dummy variables to the regression…
Or simply add the categorical variable itself…
Interpretation of fitted models?By default, JMP handles a categorical explanatory variable differently than with dummy variables. Same fit, but different slope estimates, interpretation.
7
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.986846
0.983258
6.349409
176.9893
15
Summary of Fit
Intercept
Households (000)
Location[downtown]
Location[mall]
Term
26.723538
0.8685884
-4.882067
16.627912
Estimate
7.194046
0.04049
2.553028
2.359355
Std Error
3.71
21.45
-1.91
7.05
t Ratio
0.0034*
<.0001*
0.0822
<.0001*
Prob>|t|
Parameter Estimates
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.986846
0.983258
6.349409
176.9893
15
Summary of Fit
Intercept
Households (000)
DD
DM
Term
14.977693
0.8685884
6.8637768
28.373756
Estimate
6.188445
0.04049
4.770477
4.461307
Std Error
2.42
21.45
1.44
6.36
t Ratio
0.0340*
<.0001*
0.1780
<.0001*
Prob>|t|
Parameter Estimates
JMP Fit with Dummy VarsAdd the dummy variables to the regression…
Add categorical variable “indicator parameterization”
Interpretation of fitted models?Slope estimates now match upStill missing that other category
8
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.986846
0.983258
6.349409
176.9893
15
Summary of Fit
Intercept
Households (000)
Location[downtown]
Location[mall]
Term
14.977693
0.8685884
6.8637768
28.373756
Estimate
6.188445
0.04049
4.770477
4.461307
Std Error
11.00
11.00
11.00
11.00
DFDen
2.42
21.45
1.44
6.36
t Ratio
0.0340*
<.0001*
0.1780
<.0001*
Prob>|t|
Indicator Function Parameterization
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.986846
0.983258
6.349409
176.9893
15
Summary of Fit
Intercept
Households (000)
DD
DM
Term
14.977693
0.8685884
6.8637768
28.373756
Estimate
6.188445
0.04049
4.770477
4.461307
Std Error
2.42
21.45
1.44
6.36
t Ratio
0.0340*
<.0001*
0.1780
<.0001*
Prob>|t|
Parameter Estimates
InterpretationPlot of fitted model (with categorical variable added) shows fit of the model as 3 parallel lines
Slopes are shifts (changes in the intercept) relative to the excluded group (street locations)
9
100
150
200
250
Sale
s (
$0
00
)
100 150 200 250
Households (000)
downtown
mall
street
Regression Plot
Intercept
Households (000)
Location[downtown]
Location[mall]
Term
14.977693
0.8685884
6.8637768
28.373756
Estimate
6.188445
0.04049
4.770477
4.461307
Std Error
11.00
11.00
11.00
11.00
DFDen
2.42
21.45
1.44
6.36
t Ratio
0.0340*
<.0001*
0.1780
<.0001*
Prob>|t|
Indicator Function Parameterization
Partial F-TestAre the differences among intercepts for the locations statistically significant?
H0: βdowntown = βmall = 0Test of two coefficient simultaneously
Partial F-test considers the contribution to the fit obtained by 1 or more explanatory variables
Two ways to compute test statisticJMP provides “Effect Test” for categorical variableCompare R2 statistics between the models (then you’ll need to obtain the p-value of the test)
10
F = (Change in R2)/(# added x’s) (1 - Rall2)/(n-k-1)
ExampleTest H0: βdowntown = βmall = 0
JMP provides effect test, rejecting H0
Compare explained variation obtained by two regressions, with and without categorical terms
11
Households (000)
Location
Source
1
2
Nparm
1
2
DF
18552.427
2024.342
Sum of
Squares
460.1867
25.1066
F Ratio
<.0001*
<.0001*
Prob > F
E!ect Tests
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.926798
0.921167
13.77793
176.9893
15
Summary of Fit
WithRSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.986846
0.983258
6.349409
176.9893
15
Summary of Fit
Without
F = (0.9868-0.9268)/2 ≈ 25 (1-0.9868)/(15-1-3)
InteractionWhy assume that the slopes parallel?
Why should the relationship between the number of households and sales be the same in the three locations?
Interaction implies that the slope of an explanatory variable depends on the value of another explanatory variable.
Most common interaction: between a categorical and numerical variable. The slope depends upon the group. Slopes in the initial simple regressions are not identical.Can also have interactions between other variables (text)
An interaction is obtained by adding the product of two explanatory variables.
12
Fitting an InteractionTwo approaches
Let JMP build the products for youBuild products of the dummy and numerical variables and add these to the regression model
JMP builds this model by “crossing” the number of households with the location
13
100
150
200
250
Sale
s (
$000)
100 150 200 250
Households (000)
downtown
mall
street
Regression PlotRSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.987657
0.9808
6.799532
176.9893
15
Summary of Fit
Intercept
Households (000)
Location[downtown]
Location[mall]
Location[downtown]*Households (000)
Location[mall]*Households (000)
Term
7.9004191
0.9207038
10.255032
42.729744
-0.03363
-0.091717
Estimate
17.03513
0.123428
21.28319
21.5042
0.138188
0.14163
Std Error
9.00
9.00
9.00
9.00
9.00
9.00
DFDen
0.46
7.46
0.48
1.99
-0.24
-0.65
t Ratio
0.6538
<.0001*
0.6414
0.0782
0.8132
0.5334
Prob>|t|
Indicator Function Parameterization
Mall: ŷ = 7.90 + 0.921 Households + 42.73 - 0.092 Households! ! = 50.63 + 0.829 Households
Testing the InteractionFitted equation with the interaction reproduces original simple regressions for each category:! ! Are the slopes really so different?
Partial F testTest H0: βinteraction terms = 0; not significant.
Location is not statistically significant when the interaction is present in the fitted model.Typical advice: Remove an interaction that is not statistically significant. Decide status of Location after simplifying model.
14
Households (000)
Location
Location*Households (000)
Source
1
2
2
Nparm
1
2
2
DF
13437.839
229.353
27.362
Sum of
Squares
290.6507
2.4804
0.2959
F Ratio
<.0001*
0.1387
0.7508
Prob > F
E!ect Tests
Checking AssumptionsUsual diagnostic plots
Color-coding is very helpful
Least squares meansAverage of response in each group at the average value of the explanatory variableHandy comparison among groups at common value of explanatory variable
15
-15
-10
-5
0
5
Sale
s (
$0
00
)
Resid
ual
100 150 200 250
Sales ($000) Predicted
Residual by Predicted Plot
100
150
200
250
Sale
s (
$000)
Levera
ge R
esid
uals
100 150 200 250
Households (000)
Leverage, P<.0001
Leverage Plot
Households (000)
100
150
200
250
Sale
s (
$0
00
)
Levera
ge R
esid
uals
165 170 175 180 185 190 195 200
Location Leverage, P<.0001
Leverage Plot
downtown
mall
street
Level
172.10727
193.61725
165.24349
Least
Sq Mean
3.0340765
2.8730165
3.2142985
Std Error
195.038
202.998
132.932
Mean
Least Squares Means Table
Location
Another DiagnosticWhy assume that variances of the errors are the same in each group?
Slopes, intercepts may be differentWhy force all 3 groups to have the same RMSE?
Plot residuals, grouped by categoryToo few to be definitive in this example (5 in each), but seem similar
16
-15
-10
-5
0
5
Resid
ual
Sale
s (
$000)
downtown mall street
Location
Oneway Analysis of Residual Sales ($000) By Location
PredictionUse fitted model with number of households, location to predict sales
Prediction interval determined by common estimate s2 and any extrapolation.
Check the normal quantileplot before rely on normality
17
Intercept
Households (000)
Location[downtown]
Location[mall]
Term
14.977693
0.8685884
6.8637768
28.373756
Estimate
6.188445
0.04049
4.770477
4.461307
Std Error
11.00
11.00
11.00
11.00
DFDen
2.42
21.45
1.44
6.36
t Ratio
0.0340*
<.0001*
0.1780
<.0001*
Prob>|t|
Indicator Function Parameterization
-15
-10
-5
0
5
1 2 3 4 5 6
Count
.01 .05.10 .25 .50 .75 .90.95 .99
-3 -2 -1 0 1 2 3
Normal Quantile Plot
Residual Sales ($000)
Summary Distinguishing groups using dummy variables
Refer to JMP’s “indicator parameterization”
Partial F testTest a subset of estimates, such as those associated with a categorical variable
Interaction: slope depends on groupOther types of interaction, such as quadratic are described in the text
DiscussionWhy not fit separate regressions for each group?
18