Categorical Explanatory Variablesstine/insr260_2009/lectures/... · 2009. 3. 21. · Leverage, P

Categorical Explanatory Variables

INSR 260, Spring 2009Bob Stine

Overview Review MRM

Group identification, dummy variables

Partial F test

Interaction

Prediction! ! ! ! ! ! ! ! ! ! ! similar to SRM

Example!! ! ! (from Bowerman, Ch 4)Sales volume and location

Multiple Regression ModelEquation has k explanatory variablesMean! ! ! ! E Y|X = β0 + β1 X1 +...+ βk Xk = μy|x

Observations ! ! yi = β0 + β1 xi1 +...+ βk xik + εi

AssumptionsIndependent observationsEqual variance σ2

Normal distribution around “line”! yi ~ N(μy|x,σ2)! ! ! ! εi ~ N(0, σ2)

Issue for this lectureHow to incorporate categorical explanatory variables that measure group differences.

Example (Table 4.9)Context

Retailer is studying the relationship betweenY = Sales volume in franchise stores, in $1,000X = Number of households near location, in thousands

Overall 15 locations, SRM gives

QuestionDoes the type of location influence the relationship between sales volume and population near the location?Three locations: in mall, suburban, or downtown

100 150 200 250

Households (000)

Intercept

Households (000)

14.867648

0.9371196

Estimate

13.12805

0.073045

Std Error

t Ratio

0.2779

<.0001*

Prob>|t|

100 150 200 250

Households (000)

B&W Color

Separate FitsQuestion

Does the type of location influence the relationship between sales volume and population near the location?

Mall, suburban, downtownFive stores from each type of location

Are differences important? Statistically significant?

100 125 150 175 200 225 250

Households (000)

Linear Fit

Sales ($000) = 18.155451 + 0.887074*Households (000)

Linear Fit

Bivariate Fit of Sales ($000) By Households (000) Location=downtown

Downtown

100 120 140 160 180 200 220 240

Households (000)

Linear Fit

Bivariate Fit of Sales ($000) By Households (000) Location=mall

90 100 110 120 130 140 150 160 170

Households (000)

Linear Fit

Bivariate Fit of Sales ($000) By Households (000) Location=street

Street

Intercept

Households (000)

14.867648

0.9371196

Estimate

13.12805

0.073045

Std Error

t Ratio

0.2779

<.0001*

Prob>|t|

Qualitative VariablesRepresent categories using “dummy variables”

A 0/1 indicator for each of the categoriesRedundant: only need 2 dummies for the 3 categories

Data tableJMP software makes the manual creation of dummy variables unnecessary.

Regression with CategoricalAdd the dummy variables to the regression…

Or simply add the categorical variable itself…

Interpretation of fitted models?By default, JMP handles a categorical explanatory variable differently than with dummy variables. Same fit, but different slope estimates, interpretation.

RSquare

RSquare Adj

Root Mean Square Error

Mean of Response

Observations (or Sum Wgts)

0.986846

0.983258

6.349409

176.9893

Summary of Fit

Intercept

Households (000)

Location[downtown]

Location[mall]

26.723538

0.8685884

-4.882067

16.627912

Estimate

7.194046

0.04049

2.553028

2.359355

Std Error

t Ratio

0.0034*

<.0001*

0.0822

<.0001*

Prob>|t|

Parameter Estimates

RSquare

RSquare Adj

Mean of Response

0.986846

0.983258

6.349409

176.9893

Summary of Fit

Intercept

Households (000)

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Parameter Estimates

JMP Fit with Dummy VarsAdd the dummy variables to the regression…

Add categorical variable “indicator parameterization”

Interpretation of fitted models?Slope estimates now match upStill missing that other category

RSquare

RSquare Adj

Mean of Response

0.986846

0.983258

6.349409

176.9893

Summary of Fit

Intercept

Households (000)

Location[downtown]

Location[mall]

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Indicator Function Parameterization

RSquare

RSquare Adj

Mean of Response

0.986846

0.983258

6.349409

176.9893

Summary of Fit

Intercept

Households (000)

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Parameter Estimates

InterpretationPlot of fitted model (with categorical variable added) shows fit of the model as 3 parallel lines

Slopes are shifts (changes in the intercept) relative to the excluded group (street locations)

100 150 200 250

Households (000)

downtown

street

Regression Plot

Intercept

Households (000)

Location[downtown]

Location[mall]

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

Partial F-TestAre the differences among intercepts for the locations statistically significant?

H0: βdowntown = βmall = 0Test of two coefficient simultaneously

Partial F-test considers the contribution to the fit obtained by 1 or more explanatory variables

Two ways to compute test statisticJMP provides “Effect Test” for categorical variableCompare R2 statistics between the models (then you’ll need to obtain the p-value of the test)

F = (Change in R2)/(# added x’s) (1 - Rall2)/(n-k-1)

ExampleTest H0: βdowntown = βmall = 0

JMP provides effect test, rejecting H0

Compare explained variation obtained by two regressions, with and without categorical terms

Households (000)

Location

Source

18552.427

2024.342

Sum of

Squares

460.1867

25.1066

F Ratio

<.0001*

Prob > F

E!ect Tests

RSquare

RSquare Adj

Mean of Response

0.926798

0.921167

13.77793

176.9893

Summary of Fit

WithRSquare

RSquare Adj

Mean of Response

0.986846

0.983258

6.349409

176.9893

Summary of Fit

Without

F = (0.9868-0.9268)/2 ≈ 25 (1-0.9868)/(15-1-3)

InteractionWhy assume that the slopes parallel?

Why should the relationship between the number of households and sales be the same in the three locations?

Interaction implies that the slope of an explanatory variable depends on the value of another explanatory variable.

Most common interaction: between a categorical and numerical variable. The slope depends upon the group. Slopes in the initial simple regressions are not identical.Can also have interactions between other variables (text)

An interaction is obtained by adding the product of two explanatory variables.

Fitting an InteractionTwo approaches

Let JMP build the products for youBuild products of the dummy and numerical variables and add these to the regression model

JMP builds this model by “crossing” the number of households with the location

100 150 200 250

Households (000)

downtown

street

Regression PlotRSquare

RSquare Adj

Mean of Response

0.987657

0.9808

6.799532

176.9893

Summary of Fit

Intercept

Households (000)

Location[downtown]

Location[mall]

Location[downtown]*Households (000)

Location[mall]*Households (000)

7.9004191

0.9207038

10.255032

42.729744

-0.03363

-0.091717

Estimate

17.03513

0.123428

21.28319

21.5042

0.138188

0.14163

Std Error

t Ratio

0.6538

<.0001*

0.6414

0.0782

0.8132

0.5334

Prob>|t|

Mall: ŷ = 7.90 + 0.921 Households + 42.73 - 0.092 Households! ! = 50.63 + 0.829 Households

Testing the InteractionFitted equation with the interaction reproduces original simple regressions for each category:! ! Are the slopes really so different?

Partial F testTest H0: βinteraction terms = 0; not significant.

Location is not statistically significant when the interaction is present in the fitted model.Typical advice: Remove an interaction that is not statistically significant. Decide status of Location after simplifying model.

Households (000)

Location

Location*Households (000)

Source

13437.839

229.353

27.362

Sum of

Squares

290.6507

2.4804

0.2959

F Ratio

<.0001*

0.1387

0.7508

Prob > F

E!ect Tests

Checking AssumptionsUsual diagnostic plots

Color-coding is very helpful

Least squares meansAverage of response in each group at the average value of the explanatory variableHandy comparison among groups at common value of explanatory variable

100 150 200 250

Sales ($000) Predicted

Residual by Predicted Plot

Levera

100 150 200 250

Households (000)

Leverage, P<.0001

Leverage Plot

Households (000)

Levera

165 170 175 180 185 190 195 200

Location Leverage, P<.0001

Leverage Plot

downtown

street

172.10727

193.61725

165.24349

Sq Mean

3.0340765

2.8730165

3.2142985

Std Error

195.038

202.998

132.932

Least Squares Means Table

Location

Another DiagnosticWhy assume that variances of the errors are the same in each group?

Slopes, intercepts may be differentWhy force all 3 groups to have the same RMSE?

Plot residuals, grouped by categoryToo few to be definitive in this example (5 in each), but seem similar

downtown mall street

Location

Oneway Analysis of Residual Sales ($000) By Location

PredictionUse fitted model with number of households, location to predict sales

Prediction interval determined by common estimate s2 and any extrapolation.

Check the normal quantileplot before rely on normality

Intercept

Households (000)

Location[downtown]

Location[mall]

14.977693

0.8685884

6.8637768

28.373756

Estimate

6.188445

0.04049

4.770477

4.461307

Std Error

t Ratio

0.0340*

<.0001*

0.1780

<.0001*

Prob>|t|

1 2 3 4 5 6

.01 .05.10 .25 .50 .75 .90.95 .99

-3 -2 -1 0 1 2 3

Normal Quantile Plot

Residual Sales ($000)

Summary Distinguishing groups using dummy variables

Refer to JMP’s “indicator parameterization”

Partial F testTest a subset of estimates, such as those associated with a categorical variable

Interaction: slope depends on groupOther types of interaction, such as quadratic are described in the text

DiscussionWhy not fit separate regressions for each group?

Categorical Explanatory Variablesstine/insr260_2009/lectures/... · 2009. 3. 21. · Leverage, P

Documents

Transcript of Categorical Explanatory Variablesstine/insr260_2009/lectures/... · 2009. 3. 21. · Leverage, P

Volkert Siersma Research Unit for General Practice in Copenhagen V.Siersma@gpract.ku.dk P γ measure for association between categorical.

1 Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland.

Sections 6.1, 6.2, 6 · Sections 6.1, 6.2, 6.3 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1/33

2.5.2 Derivaons...3.1 Least squares with two or more explanatory variables 3.4 Stas+cal sogware output for mul+ple regression 2 - R and adjR2 and 3.4.1 Proper+es of R2 and σ2 - Sum

Categorical Explanatory VariablesHow to incorporate categorical explanatory variables that measure group differences. 3 Example (Table 4.9) Context Retailer is studying the relationship

Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

Two-sample Categorical data: Testing - University of …myweb.uiowa.edu/pbreheny/5710/f14/notes/10-29.pdfwould contain 53 blue balls and 22 red balls) and drawing out 40 balls that

RS-7 Explanatory Notes Bulletin - Vancouver...August 2019 City of Vancouver Planning - By-law Administration Bulletins Community Services, 453 W. 12th Ave Vancouver, BC V5Y 1V4 Φ

Two Categorical Variables Sections 25.1, 25.2, 25.3, 25.4 ...people.hsc.edu/faculty-staff/robbk/math121/lectures... · Outline 1 Two Categorical Variables 2 Expected Counts 3 The

Cloud and IPv6 · •An IPv6 perspective will change your Cloud architecture for the better and vice versa •Leverage funding and prioritization •Many opportunities for innovation

HE DEFENCE INDUSTRY AS AN EXPLANATORY FACTOR OF …¤he-Defence-industry-as-an-explanatory-factor-of...26,000 artillery pieces in 52 countries. Major foreign markets were those of

Econometrics: Models with Endogenous Explanatory - OCW - UC3M

1 Discrete and Categorical Data William N. Evans Department of Economics University of Maryland.

On Sticky Leverage by Gomes, Jermann and Schmid · On "Sticky Leverage" by Gomes, Jermann and Schmid Julia K. Thomas April 2015 Julia K. Thomas ( )Gomes, Jermann and Schmid April

Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Deﬁnition Let A 2Rm n and let c

REPRESENTATION AND CHARACTER THEORY OF FINITE CATEGORICAL ... › tac › volumes › 31 › 21 › 31-21.pdf · A linear representation of a categorical group Gwith centre End G(1)

Intuitionistic First-Order Logic: Categorical semantics via the Curry-Howard isomorphism

CPSY 501: Lecture 12, Nov 21 Review non-parametric tests Categorical analysis: χ 2, Log-Linear Meta-analysis: e.g., Hill & Lent article Review: Cycles.

Coherence for Weak Units - uni-bielefeld.de · 1995, seems still to represent the highest-dimensional explicit weak categorical structure that can be manipulated by hand (i.e. without

The Multiple Regression Model. Two Explanatory Variables y t = 1 + 2 x t2 + 3 x t3 + ε t ytyt x t2 = 2 x t3 ytyt = 3 x t affect y t.