Explanatory Variables Must be Linear Independent::yibi/teaching/stat222/...12195 2 3 0 12313 3 2 0...

Explanatory Variables Must be Linear Independent. . .Recall the multiple linear regression model

Yj = β0 + β1X1j + β2X2j + · · ·+ βpXpj + εj , i = 1, · · · , n.

is a shorthand for n linear relationships

Y1 = β0 + β1X11 + β2X21 + · · ·+ βpXp1 + ε1Y2 = β0 + β1X12 + β2X22 + · · ·+ βpXp2 + ε2

...Yn = β0 + β1X1n + β2X2n + · · ·+ βpXpn + εn

The least square estimate (β̂0, β̂1, . . . , β̂p) exists under 2 conditions

I n ≥ p ⇒ cannot include too many covariatesI The p covariates and also the intercept must be linearly

independentWhat does “linearly independent” mean?

Handout 1C - 1

Definition of Linear Dependence and Independence

A subset of vectors v1, v2, . . . , vn is called linearly dependent ifthere exist scalars a1, a2, . . . , an, not all zero, such that

a1v1 + a2v2 + · · ·+ anvn = 0.

Otherwise, the vectors v1, v2, . . . , vn are linearly independent.For example, the four vectors below are linearly dependent becausev1 − v2 − v3 − v4 = 0,

v1 =

1111

, v2 =

1100

, v3 =

0010

v4 =

0001

but v1, v2, v3 are linearly independent because the only scalars a1,a2, a3 that make a1v1 + a2v2 + a3v3 = 0 are a1 = a2 = a3 = 0.

Handout 1C - 2

Example

Suppose in some study, the covariates include

WT2 = weight at age 2, in kg

WT9 = weight at age 9, in kg

DW = weight gain from age 2 to 9, in kg

The covariate WT2, WT9, DW are linearly dependent, because

DW = WT9−WT2.

Handout 1C - 3

What happens When Explanatory Variables Are LinearlyDependent?

We cannot fit the model

Y = β0 + β1WT2 + β2WT9 + β3DW + ε,

because the coefficients cannot be uniquely determined. Observe

Y = β0 + (β1 + c)WT2 + (β2 − c)WT9 + (β3 + c)DW + ε= β0 + β1WT2 + β2WT9 + β3DW + c(WT2−WT9 + DW︸︷︷︸

=0

) + ε

Regardless of the value of c , the mean of the response Y are allthe same. The set of coefficients (β1, β2, β3) will fit the data aswell as (β1 + c , β2 − c , β3 + c) does for any constant c .

Handout 1C - 4

What to Do When Explanatory Variables Are LinearlyDependent?

I Remove some of the explanatory variables that are linearlydependent with others until the remaining explanatoryvariables are linearly independent

I e.g., remove anyone of WT2, WT9, and DW will makethe remaining linearly independent

I Put constraint(s) on the β’s so that they can be uniquelydetermined.

I commonly adopted approaches for models inexperimental designs

Handout 1C - 5

Dummy Variables (1)

I Sometimes the explanatory variables are categorical, like bloodtype (O, A, B, AB). However, it makes NO sense to write amodel

Y = β0 + β1(blood type) + εi .

because blood type is not a number

I In experimental design, the treatment factors are oftencategorial, e.g., the type of fertilizer

I How to represent categorical variables “numerically” in amodel?

I Create a dummy variable (aka. indicator variable) foreach category of the categorical variable

Handout 1C - 6

Dummy Variables (2)For example, for the variable “blood type”, four dummy variablesare created for the 4 categories: O, A, B, and AB:

I DO = 1 if one’s blood type is O, and 0 otherwiseI DA = 1 if one’s blood type is A, and 0 otherwiseI DB = 1 if one’s blood type is B, and 0 otherwiseI DAB = 1 if one’s blood type is AB, and 0 otherwise

Though the model Y = β0 + β1(blood type) + εi makes no sense,but the following model does because DO , DA, DB and DAB are allnumbers (either 0 or 1).

Y = β0 + β1DO + β2DA + β3DB + β4DAB + εi

The mean response E[Y ] for the 4 blood types are thenblood type E(Y )

O β0 + β1A β0 + β2B β0 + β3

AB β0 + β4Handout 1C - 7

But The Dummy Variables Are Linearly Dependent...Every individual must fall in exactly one of the 4 categories, it isalways true that

DO + DA + DB + DAB − 1 = 0.

This means:

I One of the 4 dummy variables is redundant because knowingany 3 tells us the rest one

I DO , DA, DB , DAB and the intercept are linearly dependent,and consequently, the coefficients (β0, β1, β2, β3, β4) cannotbe uniquely determinedFor this reason, we say the model

Y = β0 + β1DO + β2DA + β3DB + β4DAB + εi

is overparameterized because it specifies more parametersthan we actually need.

Handout 1C - 8

How to Deal With Overparametrization?

There are various ways to deal with overparametrization in themodel

Y = β0 + β1DO + β2DA + β3DB + β4DAB + εi .

Some common ways include

I dropping the intercept (i.e., letting β0 = 0)

I dropping one dummy variable, e.g., DO (i.e., letting β1 = 0)

I The category of which the dummy variable is dropped is calledthe baseline . If DO is dropped, the baseline is blood type O

I letting β1 + β2 + β3 + β4 = 0

Handout 1C - 9

When the Intercept is Dropped ...

Dropping the intercept β0, the coefficients for the dummy variablesbecome the mean response E[Y ] for the coefficient blood type

Y = β1DO + β2DA + β3DB + β4DAB + εi .

blood type E(Y )O β1A β2B β3

AB β4

Handout 1C - 10

When One of the Dummy Variables is Dropped ...Dropping one of the dummy variables is dropped, the modelbecomes

Y = β0 + β2DA + β3DB + β4DAB + εi ,

and the mean response E[Y ] for the 4 blood type are

blood type E(Y )O β0A β0 + β2B β0 + β3

AB β0 + β4

I The mean of Y under the baseline (blood type O) is β0I The mean of Y for for blood type A is β0 + β2I One can compare the means of Y for blood type A and O by

testing β2 = 0

I Useful for comparing categories with the baseline category.

Handout 1C - 11

Choice of the Baseline Category Can Be ArbitraryIf blood type O is the baseline:

Y = β0+β2DA+β3DB+β4DAB+εi

blood type E(Y )O β0A β0 + β2B β0 + β3

AB β0 + β4

If blood type A is the baseline:

Y = β′0+β′1DO+β

′3DB+β

′4DAB+εi

blood type E(Y )O β′0 + β

′1

A β′0B β′0 + β

′3

AB β′0 + β′4

The 2 models are equivalent in the sense that they give identicalgroup means:

β0 = β′0 + β

′1

β0 + β2 = β′0

β0 + β3 = β′0 + β

′3

β0 + β4 = β′0 + β

′4

Handout 1C - 12

Example: Salary Survey

S X E M

13876 1 1 1

11608 1 3 0

18701 1 3 1

11283 1 2 0

11767 1 3 0

20872 2 2 1

11772 2 2 0

10535 2 1 0

12195 2 3 0

12313 3 2 0

14975 3 1 1

21371 3 2 1

19800 3 3 1

11417 4 1 0

20263 4 3 1

13231 4 3 0

12884 4 2 0

13245 5 2 0

13677 5 3 0

.

.

.

19346 20 1 0

S = SalaryX = Experience, in yearsE = Education

(1 if H.S. only,2 if Bachelor’s only,3 if Advanced degree)

M = Management Status(1 if manager, 0 if non-manager)

Handout 1C - 13

Example: Salary Survey — Coding Variables (1)Let’s first consider the effect of experience (X ) and education (E )on employee’s salary (S), ignoring the effect of managementstatus.

I Experience (X ): numerical

I Education (E ): qualitative, 3 categories, need 3 dummyvariables

Ei1 =

{1 if i th person has a high school diploma only

0 otherwise

Ei2 =

{1 if i th person has a B.S. only

0 otherwise

Ei3 =

{1 if i th person has an advanced degree

0 otherwise.

I Model 1: S = β0 + βX + δ1E1 + δ2E2 + δ3E3 + ε

Handout 1C - 14

Example: Salary Survey — Coding Variables

Model 1: S = β0 + δ1E1 + δ2E2 + δ3E3 + βX + ε

This model is overparameterized. Need a constraint.

If dropping the intercept (letting β0 = 0), then

S =

δ1 + βX + ε if H.S. only

δ2 + βX + ε if B.A. or B.S. only

δ3 + βX + ε if advanced

In this parametrization, δ1, δ2, δ3 represent the 3 differentintercepts of the regression lines of S on X at the 3 differenteducation levelsOften, we are interested in comparison between categories, e.g.,

I whether Bachelors earn more than H.S. graduates on average,i.e., δ2 > δ1, or not?

Handout 1C - 15

Example: Salary Survey — Coding Variables

If we drop the dummy variable E2 for Bachelors, i.e., use theBachelors degree as the baseline, then

S =

β0 + δ1 + βX + ε if H.S. only

β0 + βX + ε if B.A. or B.S. only

β0 + δ3 + βX + ε if advanced

This parametrization is convenient for comparison betweencategories.

I One can test whether Bachelors earn δ2 more than H.S.graduates by testing δ1 < 0, and test whether an Advanceddegree increase salary by testing δ3 > 0

Handout 1C - 16

Example: Salary Survey — Regression Fit (1)

> salary = read.table("SalarySurvey.txt", head=TRUE)

> lm1a = lm(S ~ E+X, data = salary)

> summary(lm1a)

(... Part of the R output is omitted)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8279.9 1814.6 4.563 4.18e-05 ***

E 2418.4 706.9 3.421 0.00138 **

X 560.8 105.8 5.299 3.78e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3604 on 43 degrees of freedom

Multiple R-squared: 0.4422, Adjusted R-squared: 0.4163

F-statistic: 17.05 on 2 and 43 DF, p-value: 3.538e-06

Something wrong?

Handout 1C - 17


Let’s check the model matrix.

> model.matrix(lm1a)

(Intercept) E X

1 1 1 1

2 1 3 1

3 1 3 1

4 1 2 1

5 1 3 1

6 1 2 2

7 1 2 2

8 1 1 2

...(omitted)

45 1 2 17

46 1 1 20

attr(,"assign")

[1] 0 1 2

R treats E (education) as a numerical variable taking values 1, 2,and 3, not a categorical one.

Handout 1C - 18

Example: Salary Survey — Numerical or Categorical?If one treats E (education) as a numerical variable taking values 1,2, and 3, the model then becomes

Model 2 : S = β0 + βX + δE + ε.

But Model 2 has different implication from Model 1 that onaverage,

I a Bachelor’s degree increases salary by δ2;

I a Bachelor’s degree + an advanced degree increase salary byδ3

That is, the salary bonus for completing college is as much as thebonus for completing an advanced degree . . . . . . unrealistic and toorestrictive.Treating E as a categorical variable allows the salary bonus for aBachelor’s degree and an advanced degree to be different.

Remark: Model 2 is nested in Model 1 (Why?).

Handout 1C - 19


> salary$E = as.factor(salary$E)

> lm1 = lm(S ~ E+X, data = salary)

> summary(lm1)

(... Part of the R output is omitted)

Coefficients:


(Intercept) 10474.3 1305.4 8.024 5.19e-10 ***

E2 3221.1 1275.8 2.525 0.01544 *

E3 4780.1 1422.7 3.360 0.00167 **

X 548.6 107.6 5.100 7.69e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




I The command as.factor(E) tells R that E is categorical

I By default, R use the lower most level (E = 1) as the baseline

Handout 1C - 20

Example: Salary Survey — Regression Fit (4)Let’s check the model matrix of Model 1.

> model.matrix(lm1)

(Intercept) X E2 E3

1 1 1 0 0

2 1 1 0 1

3 1 1 0 1

4 1 1 1 0

5 1 1 0 1

6 1 2 1 0

(... omitted...)

45 1 17 1 0

46 1 20 0 0

attr(,"assign")

[1] 0 1 2 2

attr(,"contrasts")

attr(,"contrasts")$E

[1] "contr.treatment"

Now R knows E is categorial and creates 2 dummy variables: E2and E3, and treats H.S. diploma (E = 1) as the baseline.

Handout 1C - 21

Example: Salary Survey — Interpreting Coefficients

From the output of Model 1, the predicted salary is

Ŝ = 10474.3 + 548.6X + 3221.1E2 + 4780.1E3.

This model implies that on average:

I each extra year of experience worths $548.6;

I completing college increases salary by $3221.1;

I completing college + advanced degree increase salary by$4780.1

All the 3 coefficients above are significantly different from 0(P-value < 5%)

I What if we want to compare Bachelors with advanced degreeholders?

Handout 1C - 22

Example: Salary Survey — Changing Baseline (1)

If not happy with the baseline category R chooses, say want E = 2(Bachelor’s degree) to be the baseline, one can either manuallycreate the dummy variables E1 and E3

> salary$E1 = as.integer(salary$E==1)

> salary$E3 = as.integer(salary$E==3)

> lm1b = lm(S ~ X + E1 + E3, data = salary)

or use the command relevel()

> salary$E = relevel(salary$E, ref = "2")

> lm1c = lm(S ~ X + E,data=salary)

Both will fit Model 1 using E = 2 as the baseline.See the R outputs on the next page.

Conclusion: Looking at the coefficient for E3 in the next page, wecan conclude advanced degree holders do NOT earn significantlymore than Bachelors (P-value ≈ 0.25).

Handout 1C - 23

> summary(lm1b)


(Intercept) 13695.4 1225.0 11.180 3.63e-14 ***

X 548.6 107.6 5.100 7.69e-06 ***

E1 -3221.1 1275.8 -2.525 0.0154 *

E3 1559.0 1338.6 1.165 0.2507

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




> summary(lm1c)


(Intercept) 13695.4 1225.0 11.180 3.63e-14 ***

X 548.6 107.6 5.100 7.69e-06 ***

E1 -3221.1 1275.8 -2.525 0.0154 *

E3 1559.0 1338.6 1.165 0.2507

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




Handout 1C - 24

What If We Want To Drop The Intercept?

> lm1e = lm(S ~ -1 + X + E, data = salary)

> summary(lm1e)


X 548.6 107.6 5.100 7.69e-06 ***

E2 13695.4 1225.0 11.180 3.63e-14 ***

E1 10474.3 1305.4 8.024 5.19e-10 ***

E3 15254.4 1167.8 13.062 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



F-statistic: 270.1 on 4 and 42 DF, p-value: < 2.2e-16

This fits the model

S =

δ1 + βX + ε if H.S. only

δ2 + βX + ε if Bachelor’s only

δ3 + βX + ε if advanced

with δ̂1 = 10474.3, δ̂2 = 13695.4, δ̂3 = 15254.4, and β̂ = 548.6.Handout 1C - 25

What About the Sum-to-Zero Constraint δ1 + δ2 + δ3 = 0?

For the salary example,

S = β0 + δ1E1 + δ2E2 + δ3E3 + βX + ε

=

β0 + δ1 + βX + ε if H.S. only

β0 + δ2 + βX + ε if Bachelor’s only


the sum-to-zero constraint δ1 + δ2 + δ3 = 0 is NOT intuitive,under which the coefficients δ1, δ2, δ3 and β0 have NO naturalinterpretations.

Nonetheless, the sum-to-zero constraint will exhibit its power infactorial designs in which two or more treatment factors areadministered in an experiment.

We will come back to this in Chapter 8.

Handout 1C - 26

Interaction Between Categorical and Numerical VariablesRegardless of the constraint is used, the model

S = β0 + δ1E1 + δ2E2 + δ3E3 + βX + ε

=

β0 + δ1 + βX + ε if H.S.

β0 + δ2 + βX + ε if B.A. or B.S.


assumes constant effect of experience X on salary S (the slope β)across all education levels, which can be unrealistic.

I If the effect a variable on response changes with the level ofanother variable, we say the effects of the two variablesinteract, If not, we say their effect is additive.

I e.g., the model above assumes the effects of education (E)and experience (X) on salary are additive

How to write a MLR model with the slope of X changing witheducation levels?

Handout 1C - 27

Interaction Between Categorical and Numerical VariablesConsider the model

S = β0 + δ1E1 + δ2E2 + δ3E3

+ βX + γ1(E1 · X ) + γ2(E2 · X ) + γ3(E3 · X ) + ε

Here (E1 · X ) means the product of the variables E1 and X . Then

S =

β0 + δ1 + (β + γ1)X + ε if H.S.

β0 + δ2 + (β + γ2)X + ε if B.A. or B.S.

β0 + δ3 + (β + γ3)X + ε if advanced

Again, the model is overparameterized. We need one additionalconstraint on β and γ’s. Some common constraints are

I β = 0

I γ1 = 0 (or γ2 = 0, or γ3 = 0)

I γ1 + γ2 + γ3 = 0

Handout 1C - 28

If one uses H.S. diploma as the baseline, i.e., letting δ1 = 0 andγ1 = 0,

S = β0 + δ2E2 + δ3E3

+ βX + γ2(E2 · X ) + γ3(E3 · X ) + ε

=

β0 + (β )X + ε if H.S.

β0 + δ2 + (β + γ2)X + ε if B.A. or B.S.

β0 + δ3 + (β + γ3)X + ε if advanced

Then γ2 is the extra salary per year of experience for completingcollege, and γ3 is that for getting an advanced degree.

Handout 1C - 29

Fitting Models with Interaction In R

In R, the term X:E or X*E in the model formula represents all theinteraction terms of X and E .

By default, R uses the lowest level E = 1 (H.S. diploma) as thebaseline.

> salary$E = relevel(salary$E, ref = "1")

> lm2 = lm(S ~ X+E+X:E, data = salary)

> summary(lm2)

Coefficients:


(Intercept) 13760.2 1543.8 8.913 4.78e-11 ***

X 540.9 157.1 3.443 0.00136 **

E2 -1461.2 2326.4 -0.628 0.53352

E3 -562.9 2216.0 -0.254 0.80077

X:E2 -216.3 238.6 -0.907 0.37005

X:E3 379.2 275.4 1.377 0.17625

Neither γ2 nor γ3 is significantly different from 0 (P-value 0.37 and0.17).

Handout 1C - 30

Test For InteractionTesting whether the effect of experience on salary changes witheducation level is equivalent to testing

H0 : γ1 = γ2 = γ3

That is, it compares the full model and the reduced model below

S = β0 + δ1E1 + δ2E2 + δ3E3

+ βX + γ1(E1 · X ) + γ2(E2 · X ) + γ3(E3 · X ) + ε (full)S = β0 + δ1E1 + δ2E2 + δ3E3 + βX + ε (reduced)

> lm1 = lm(S ~ X+E, data = salary)

> lm2 = lm(S ~ X+E+X:E, data = salary)

> anova(lm1,lm2)

Analysis of Variance Table

Model 1: S ~ X + E

Model 2: S ~ X + E + X:E

Res.Df RSS Df Sum of Sq F Pr(>F)

1 42 550853135

2 40 497897342 2 52955792 2.1272 0.1325

The interaction is not significant.Handout 1C - 31

Interaction Between Two Categorical VariablesNow let’s take another categorical variable, management status(M), into account.

M =

{1 if manager,

0 if non-manager

Since M is a categorical variable, just like E, we should createdummy variables M0 and M1 for the two categories, and considerthe model

S = β0 + α0M0 + α1M1 + δ1E1 + δ2E2 + δ3E3 + βX + ε.

However, we don’t need both M0 and M1 since M0+ M1 = 1 andthe model is again overparameterized.We can drop one of M0 and M1 and one of E1, E2 and E3.So we dropped M0 and E1, and consider the model

S = β0 + α1M1 + δ2E2 + δ3E3 + βX + ε.

Handout 1C - 32

Interaction Between Two Categorical Variables

S = β0 + α1M1 + δ2E2 + δ3E3 + βX + ε.

This model implies that, on average

I managers earn α1 more than non-managers;

I completing college increases salary by δ2;

I completing college + advanced degree increase salary by δ3

However, the model above assumes the effect of managementstatus on salary does not change with education levels.Thus we may consider the follow model with management statusand education level interactions.

S = β0 +α1M1 + δ2E2 + δ3E3 + θ2(M · E2) + θ3(M · E3) + βX + ε.

Handout 1C - 33

Interaction Between Two Categorical Variables in R

No interaction

> lm3 = lm(S ~ X+E+M, data = salary)

> summary(lm3)

(... omitted ... )

Coefficients:


(Intercept) 8035.60 386.69 20.781 < 2e-16 ***

X 546.18 30.52 17.896 < 2e-16 ***

E2 3144.04 361.97 8.686 7.73e-11 ***

E3 2996.21 411.75 7.277 6.72e-09 ***

M 6883.53 313.92 21.928 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16

Handout 1C - 34


With interaction

> lm4 = lm(S ~ X+E+M+E:M, data = salary)

> summary(lm4)

(... omitted ... )


(Intercept) 9472.685 80.344 117.90


Test of interaction

> anova(lm3,lm4)

Analysis of Variance Table

Model 1: S ~ X + E + M

Model 2: S ~ X + E + M + E:M

Res.Df RSS Df Sum of Sq F Pr(>F)

1 41 43280719

2 39 1178168 2 42102552 696.84 < 2.2e-16 ***

Handout 1C - 36

Linear IndependenceDummy Variables

Explanatory Variables Must be Linear Independent::yibi/teaching/stat222/...12195 2 3 0 12313 3 2 0...

Documents

Transcript of Explanatory Variables Must be Linear Independent::yibi/teaching/stat222/...12195 2 3 0 12313 3 2 0...