Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure...

18
Categorical Explanatory Variables INSR 260, Spring 2009 Bob Stine 1

Transcript of Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure...

Page 1: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Categorical Explanatory Variables

INSR 260, Spring 2009Bob Stine

1

Page 2: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Overview Review MRM

Group identification, dummy variables

Partial F test

Interaction

Prediction! ! ! ! ! ! ! ! ! ! ! similar to SRM

Example!! ! ! (from Bowerman, Ch 4)Sales volume and location

2

Page 3: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Multiple Regression ModelEquation has k explanatory variablesMean! ! ! ! E Y|X = β0 + β1 X1 +...+ βk Xk = μy|x

Observations ! ! yi = β0 + β1 xi1 +...+ βk xik + εi

AssumptionsIndependent observationsEqual variance σ2

Normal distribution around “line”! yi ~ N(μy|x,σ2)! ! ! ! εi ~ N(0, σ2)

Issue for this lectureHow to incorporate categorical explanatory variables that measure group differences.

3

Page 4: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Example (Table 4.9)Context

Retailer is studying the relationship betweenY = Sales volume in franchise stores, in $1,000X = Number of households near location, in thousands

Overall 15 locations, SRM gives

QuestionDoes the type of location influence the relationship between sales volume and population near the location?Three locations: in mall, suburban, or downtown

4

75

100

125

150

175

200

225

250

Sale

s (

$0

00

)

100 150 200 250

Households (000)

Intercept

Households (000)

Term

14.867648

0.9371196

Estimate

13.12805

0.073045

Std Error

1.13

12.83

t Ratio

0.2779

<.0001*

Prob>|t|

75

100

125

150

175

200

225

250

Sale

s (

$0

00

)

100 150 200 250

Households (000)

B&W Color

Page 5: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Separate FitsQuestion

Does the type of location influence the relationship between sales volume and population near the location?

Mall, suburban, downtownFive stores from each type of location

Are differences important? Statistically significant?

5

100

125

150

175

200

225

250

Sale

s (

$0

00

)

100 125 150 175 200 225 250

Households (000)

Linear Fit

Sales ($000) = 18.155451 + 0.887074*Households (000)

Linear Fit

Bivariate Fit of Sales ($000) By Households (000) Location=downtown

Downtown

120

140

160

180

200

220

240

260

Sale

s (

$0

00

)

100 120 140 160 180 200 220 240

Households (000)

Linear Fit

Sales ($000) = 50.630163 + 0.8289871*Households (000)

Linear Fit

Bivariate Fit of Sales ($000) By Households (000) Location=mall

Mall

90

100

110

120

130

140

150

160

Sale

s (

$0

00

)

90 100 110 120 130 140 150 160 170

Households (000)

Linear Fit

Sales ($000) = 7.9004191 + 0.9207038*Households (000)

Linear Fit

Bivariate Fit of Sales ($000) By Households (000) Location=street

Street

Intercept

Households (000)

Term

14.867648

0.9371196

Estimate

13.12805

0.073045

Std Error

1.13

12.83

t Ratio

0.2779

<.0001*

Prob>|t|

SRM

Page 6: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Qualitative VariablesRepresent categories using “dummy variables”

A 0/1 indicator for each of the categoriesRedundant: only need 2 dummies for the 3 categories

Data tableJMP software makes the manual creation of dummy variables unnecessary.

6

Page 7: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Regression with CategoricalAdd the dummy variables to the regression…

Or simply add the categorical variable itself…

Interpretation of fitted models?By default, JMP handles a categorical explanatory variable differently than with dummy variables. Same fit, but different slope estimates, interpretation.

7

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.986846

0.983258

6.349409

176.9893

15

Summary of Fit

Intercept

Households (000)

Location[downtown]

Location[mall]

Term

26.723538

0.8685884

-4.882067

16.627912

Estimate

7.194046

0.04049

2.553028

2.359355

Std Error

3.71

21.45

-1.91

7.05

t Ratio

0.0034*

<.0001*

0.0822

<.0001*

Prob>|t|

Parameter Estimates

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.986846

0.983258

6.349409

176.9893

15

Summary of Fit

Intercept

Households (000)

DD

DM

Term

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

2.42

21.45

1.44

6.36

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Parameter Estimates

Page 8: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

JMP Fit with Dummy VarsAdd the dummy variables to the regression…

Add categorical variable “indicator parameterization”

Interpretation of fitted models?Slope estimates now match upStill missing that other category

8

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.986846

0.983258

6.349409

176.9893

15

Summary of Fit

Intercept

Households (000)

Location[downtown]

Location[mall]

Term

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

11.00

11.00

11.00

11.00

DFDen

2.42

21.45

1.44

6.36

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Indicator Function Parameterization

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.986846

0.983258

6.349409

176.9893

15

Summary of Fit

Intercept

Households (000)

DD

DM

Term

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

2.42

21.45

1.44

6.36

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Parameter Estimates

Page 9: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

InterpretationPlot of fitted model (with categorical variable added) shows fit of the model as 3 parallel lines

Slopes are shifts (changes in the intercept) relative to the excluded group (street locations)

9

100

150

200

250

Sale

s (

$0

00

)

100 150 200 250

Households (000)

downtown

mall

street

Regression Plot

Intercept

Households (000)

Location[downtown]

Location[mall]

Term

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

11.00

11.00

11.00

11.00

DFDen

2.42

21.45

1.44

6.36

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Indicator Function Parameterization

Page 10: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Partial F-TestAre the differences among intercepts for the locations statistically significant?

H0: βdowntown = βmall = 0Test of two coefficient simultaneously

Partial F-test considers the contribution to the fit obtained by 1 or more explanatory variables

Two ways to compute test statisticJMP provides “Effect Test” for categorical variableCompare R2 statistics between the models (then you’ll need to obtain the p-value of the test)

10

F = (Change in R2)/(# added x’s) (1 - Rall2)/(n-k-1)

Page 11: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

ExampleTest H0: βdowntown = βmall = 0

JMP provides effect test, rejecting H0

Compare explained variation obtained by two regressions, with and without categorical terms

11

Households (000)

Location

Source

1

2

Nparm

1

2

DF

18552.427

2024.342

Sum of

Squares

460.1867

25.1066

F Ratio

<.0001*

<.0001*

Prob > F

E!ect Tests

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.926798

0.921167

13.77793

176.9893

15

Summary of Fit

WithRSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.986846

0.983258

6.349409

176.9893

15

Summary of Fit

Without

F = (0.9868-0.9268)/2 ≈ 25 (1-0.9868)/(15-1-3)

Page 12: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

InteractionWhy assume that the slopes parallel?

Why should the relationship between the number of households and sales be the same in the three locations?

Interaction implies that the slope of an explanatory variable depends on the value of another explanatory variable.

Most common interaction: between a categorical and numerical variable. The slope depends upon the group. Slopes in the initial simple regressions are not identical.Can also have interactions between other variables (text)

An interaction is obtained by adding the product of two explanatory variables.

12

Page 13: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Fitting an InteractionTwo approaches

Let JMP build the products for youBuild products of the dummy and numerical variables and add these to the regression model

JMP builds this model by “crossing” the number of households with the location

13

100

150

200

250

Sale

s (

$000)

100 150 200 250

Households (000)

downtown

mall

street

Regression PlotRSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.987657

0.9808

6.799532

176.9893

15

Summary of Fit

Intercept

Households (000)

Location[downtown]

Location[mall]

Location[downtown]*Households (000)

Location[mall]*Households (000)

Term

7.9004191

0.9207038

10.255032

42.729744

-0.03363

-0.091717

Estimate

17.03513

0.123428

21.28319

21.5042

0.138188

0.14163

Std Error

9.00

9.00

9.00

9.00

9.00

9.00

DFDen

0.46

7.46

0.48

1.99

-0.24

-0.65

t Ratio

0.6538

<.0001*

0.6414

0.0782

0.8132

0.5334

Prob>|t|

Indicator Function Parameterization

Mall: ŷ = 7.90 + 0.921 Households + 42.73 - 0.092 Households! ! = 50.63 + 0.829 Households

Page 14: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Testing the InteractionFitted equation with the interaction reproduces original simple regressions for each category:! ! Are the slopes really so different?

Partial F testTest H0: βinteraction terms = 0; not significant.

Location is not statistically significant when the interaction is present in the fitted model.Typical advice: Remove an interaction that is not statistically significant. Decide status of Location after simplifying model.

14

Households (000)

Location

Location*Households (000)

Source

1

2

2

Nparm

1

2

2

DF

13437.839

229.353

27.362

Sum of

Squares

290.6507

2.4804

0.2959

F Ratio

<.0001*

0.1387

0.7508

Prob > F

E!ect Tests

Page 15: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Checking AssumptionsUsual diagnostic plots

Color-coding is very helpful

Least squares meansAverage of response in each group at the average value of the explanatory variableHandy comparison among groups at common value of explanatory variable

15

-15

-10

-5

0

5

Sale

s (

$0

00

)

Resid

ual

100 150 200 250

Sales ($000) Predicted

Residual by Predicted Plot

100

150

200

250

Sale

s (

$000)

Levera

ge R

esid

uals

100 150 200 250

Households (000)

Leverage, P<.0001

Leverage Plot

Households (000)

100

150

200

250

Sale

s (

$0

00

)

Levera

ge R

esid

uals

165 170 175 180 185 190 195 200

Location Leverage, P<.0001

Leverage Plot

downtown

mall

street

Level

172.10727

193.61725

165.24349

Least

Sq Mean

3.0340765

2.8730165

3.2142985

Std Error

195.038

202.998

132.932

Mean

Least Squares Means Table

Location

Page 16: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Another DiagnosticWhy assume that variances of the errors are the same in each group?

Slopes, intercepts may be differentWhy force all 3 groups to have the same RMSE?

Plot residuals, grouped by categoryToo few to be definitive in this example (5 in each), but seem similar

16

-15

-10

-5

0

5

Resid

ual

Sale

s (

$000)

downtown mall street

Location

Oneway Analysis of Residual Sales ($000) By Location

Page 17: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

PredictionUse fitted model with number of households, location to predict sales

Prediction interval determined by common estimate s2 and any extrapolation.

Check the normal quantileplot before rely on normality

17

Intercept

Households (000)

Location[downtown]

Location[mall]

Term

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

11.00

11.00

11.00

11.00

DFDen

2.42

21.45

1.44

6.36

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Indicator Function Parameterization

-15

-10

-5

0

5

1 2 3 4 5 6

Count

.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Residual Sales ($000)

Page 18: Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Summary Distinguishing groups using dummy variables

Refer to JMP’s “indicator parameterization”

Partial F testTest a subset of estimates, such as those associated with a categorical variable

Interaction: slope depends on groupOther types of interaction, such as quadratic are described in the text

DiscussionWhy not fit separate regressions for each group?

18