Ch. 4 Multiple Linear Regression Models Notation · Ch. 4 Multiple Linear Regression Models...

Ch. 4 Multiple Linear Regression Models

• Notation

• Estimation

• Confidence Intervals, Testing and Prediction

• Extra Sums of Squares and Multiple Testing

• Analysis of Variance

• Weighted Least-Squares and Generalized Least-Squares

(To do some of the R calculations, you will need the functions inCh4.R)

1

Notation

• Start with simple linear regression

yi = β0 + β1xi + εi, εi ∼ N(0, σ2)(i.i.d.)

• Extension: add more variables on the right, i.e. more explanatoryvariables.

– e.g. polynomial regression

yi = β0 + β1xi + β2x2i + · · ·+ βpx

pi + εi

– e.g. additional variables

yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + εi

2

Matrix form

y˜

= Xβ˜

+ ε˜

y˜

= column vector of responses:

y˜

= [y1 y2 · · · yn]T

(T denotes matrix transpose)

β˜

= [β0 β1 · · · βk]T

ε˜

= [ε1 ε2 · · · εn]T

3

X is an n× (k+ 1) matrix (called the model matrix or design matrix):

X =

1 x11 x12 · · · x1k1 x21 x22 · · · x2k... ... ... ... ...1 xn1 xn2 · · · xnk

• e.g. regression through the origin:

yi = β1xi + εi is y˜

= Xβ˜

+ ε˜

with

X =

x1x2...xn

y˜ =

y1y2...yn

ε˜ =

ε1ε2...εn

β˜ = [β1]

e.g. Simple linear regression

yi = β0 + β1xi + εi

is

y˜

= Xβ˜

+ ε˜

with

X =

1 x11 x2... ...1 xn

y˜ =

y1y2...yn

ε˜ =

ε1ε2...εn

β˜ =

[β0β1

]

4

e.g. Quadratic regression

yi = β0 + β1xi + β2x2i + εi

is

y˜

= Xβ˜

+ ε˜

with

X =

1 x1 x2

11 x2 x2

2... ... ...1 xn x2

n

y˜ =

y1y2...yn

ε˜ =

ε1ε2...εn

β˜ =

β0β1β2

5

e.g. Regression with 2 predictor variables

yi = β0 + β1xi1 + β2xi2 + εi

is

y˜

= Xβ˜

+ ε˜

with

X =

1 x11 x121 x21 x22... ... ...1 xn1 xn2

y˜ =

y1y2...yn

ε˜ =

ε1ε2...εn

β˜ =

β0β1β2

6

Exercises

Find the model matrix X for each of the following.

1. yi = µ+ εi

2. yij = µi + εij, i = 1,2, j = 1,2,3.

[Hint: β˜

= [µ1µ2]T and y˜

= [y11 y12 · · · y23]T.]

(This is an example of a 1-way ANOVA model.)

3. yij = µi + βxij + εij, i = 1,2, j = 1,2,3.

(This is an example of an analysis of covariance model.)

4. yi = β0 + β1xi1 + β2xi2 + β12xi1xi2 + εi

5. yi = β0 + β1 cos(xi) + β2 sin(xi) + εi

6. yi = β1B1(xi) + β2B2(xi) + β3B3(xi) + β4B4(xi) where B1,B2, B3 and B4 are given real-valued functions.

7

4.2 Least-Squares Estimation

• Differentiation w.r.t. Vectors

– Suppose c = [c1 c2 · · · ck]T and x = [x1 x2 · · ·xk]T

Differentiate

f(x) = cTx

with respect to x.

f(x) =k∑i=1

cixi

Partial derivatives with respect to x1, x2, . . . , xk:

c1 c2 · · · ckThe derivative is a vector (called the gradient):

f ′(x) = cT

8

Example

• Suppose B is a symmetric k × k matrix. Differentiate

f(x) = xTBx

with respect to x.

• 2xTB.

9

Estimation of β˜

y˜

= Xβ˜

+ ε˜

y˜−Xβ

˜= ε

˜

• Minimize sum of squares of the errors:

L = ε˜Tε

˜= (y

˜−Xβ

˜)T(y

˜−Xβ

˜)

with respect to β˜

.

10

Estimation of β˜

(cont’d)

• Differentiate w.r.t. β˜

:

L = y˜Ty

˜− β

˜

TXTy˜− y

˜TXβ

˜+ β

˜

TXTXβ˜

= y˜Ty

˜− y

˜TXβ

˜− y

˜TXβ

˜+ β

˜

TXTXβ˜

L′(β˜

) = −2y˜TX + 2β

˜

TXTX

(XTX is symmetric.)

11

Estimation of β˜

(cont’d)

• Set to L′ = 0:

β˜

TXTX = y˜TX

or

XTXβ˜

= XTy˜

β˜

= (XTX)−1XTy˜

provided XTX has an inverse (columns must be linearlyindependent, and number of columns cannot exceed number ofrows).

12

Fitted Values

y˜

= Xβ˜

= X(XTX)−1XTy˜

= Hy˜

where

H = X(XTX)−1XT

is the so-called Hat matrix.

13

Residuals

e = ε˜

= y˜− y

˜

= y˜−Hy

˜= (I −H)y

˜

14

Geometric Interpretation

• The goal of Least-Squares is to identify values of the coefficients β˜

which make y˜

= Xβ˜

as close as possible to y˜

.

• Vectors of the form y˜

lie in n-dimensional space, while vectors of the

form Xβ˜

are linear combinations of the p n-vectors comprising the

columns of X:a p-dimensional subspace (provided that the columns are linearlyindependent)• Least-squares amounts to finding a vector in this subspace which is

closest to y˜

.

• The orthogonal projection of y˜

onto the subspace is the minimizing

vector: i.e. set the inner product between each of the columns of Xand y

˜− y

˜to 0:

XT(y˜− y

˜) = 0

15

This ensures that the vector y˜− y

˜is perpendicular to the subspace.

• Of course, y˜

must be of the form Xβ˜

, so we must have

XT(y˜−Xβ

˜) = 0

so that

β˜

= (XTX)−1XTy˜

and

Xβ˜

= X(XTX)−1XTy˜

= Hy˜

• The hat matrix is an example of what is called an orthogonalprojector, satisfying H = HT and H = H2. This last propertyensures that the projection of vectors already in the p-dimensionalsubspace land back in that subspace:

H(Hy˜

) = H2y˜

= Hy˜.

Properties of Least-Squares Estimates

• Model:

y˜

= Xβ˜

+ ε˜

where E[ε˜] = 0, Var(ε

˜) = E[ε

˜ε˜T] = σ2I

• Unbiasedness:

β˜

= (XTX)−1XTy˜

= (XTX)−1XT(Xβ˜

+ ε˜)

= β˜

+ (XTX)−1XTε˜

so E[β˜

] = β˜

, since E[ε˜] = 0.

16

Properties of Least-Squares Estimates (cont’d)

• Variance:

Var(β˜

) = E[(β˜− β

˜)(β

˜− β

˜)T]

= E[

((XTX)−1XTε

˜

)((XTX)−1XTε

˜

)T]

= (XTX)−1XTE[ε˜ε˜T]X(XTX)−1 = σ2(XTX)−1

• β˜

is also the m.l.e. in the case where the noise is normally distributed.

It is normally distributed in that case.• β

˜is approximately normal in general

• β˜

is the Best Linear Unbiased Estimator for β˜

: Gauss-Markov

Theorem

17

Gauss-Markov Theorem

• Among all estimators of the scalar quantity `Tβ˜

having the form

qTy˜

where ` is a fixed vector of length p and

E[qTy˜

] = `Tβ˜,

the variance of qTy˜

is smallest when

qTy˜

= `Tβ˜

18

Proof

Suppose

qT = `T(XTX)−1XT + ∆T

Then

E[qTy˜

] = `Tβ˜

+ ∆TXβ˜

In order for qTy˜

to be an unbiased estimator of `Tβ˜

, we must have

∆TXβ˜

= 0.

Note that this holds for all possible values of β˜

(i.e. any p× 1 vector)

Also,

β˜

TXT∆ = 0.

19

Proof (cont’d)

Now, let us determine the value of ∆ which minimizes thevariance of qTy

˜=∑ni=1 qiyi:

Var(qTy˜

) = σ2qTq

= σ2(`T(XTX)−1`+ ∆TX[(XTX)−1`]+

[`T(XTX)−1]XT∆] + ∆T∆)

= σ2`T(XTX)−1`+ σ2∆T∆

since (XTX)−1` is a p× 1 vector so that

∆TX[(XTX)−1`] = 0.

Thus, Var(qTy˜

) is minimized when ∆T∆ = 0 i.e. ∆ = 0.

20

Estimation of σ2

σ2 = MSE =1

n− # parameters

∑e2i

=1

n− pε˜Tε

˜

=1

n− py˜T(I −H)2y

˜

=1

n− py˜T(I −H)y

˜

p = k + 1. H is symmetric and idempotent (H2 = H.)

21

L-S Estimation - R Example

• The dataframe litters consists of brain weights and body weightsof 20 mice. The size of the litter in which each mouse was born isalso recorded.

library(DAAG)

data(litters)

litters

>

lsize bodywt brainwt

1 3 9.45 0.444

2 3 9.78 0.436

.......................

20 12 6.05 0.401

22

Look at all pairwise relationships

pairs(litters, pch=16)

lsize

6 7 8 9

46

810

12

67

89

bodywt

4 6 8 10 12 0.38 0.40 0.42 0.44

0.38

0.40

0.42

0.44

brainwt

23

Observations

– Body weight decreases with litter size.

– Brain weight decreases with litter size.

– Brain weight increases with body weight.

• In order to find out how brain weight relates to both body weight andlitter size, we can use the following model:

brainwt = β0 + β1bodywt + β2lsize + ε

24

Fitting the model in R

litters.lm <- lm(brainwt ˜ bodywt + lsize,

data = litters)

summary(litters.lm)

>

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.17825 0.07532 2.37 0.0301bodywt 0.02431 0.00678 3.59 0.0023lsize 0.00669 0.00313 2.14 0.0475

Residual standard error: 0.012 on 17degrees of freedom

25

Example (cont’d)

• The fitted model is then

y = .18 + .024x1 + .0067x2

where y is brain weight, x1 is body weight and x2 is litter size. Theerror variance estimate is .0122 = .000144.

• Note that this fitted model says that for a fixed body weight, brainweight is actually higher for larger litters. This is consistent with whatis known as ‘brain sparing’: nutritional deprivation that results fromlarge litter sizes has a proportionately smaller effect on brain weightthan on body weight.

26

Details of the L-S calculations

X <- model.matrix(litters.lm) # X matrix

X

>

(Intercept) bodywt lsize

1 1 9.45 3

...........................

20 1 6.05 12

27

Details (cont’d)

XX <- crossprod(X,X) # X’XXX

>(Intercept) bodywt lsize

(Intercept) 20 155 150bodywt 155 1236 1089lsize 150 1089 1290

XXinv <- qr.solve(XX) # calculates# inverse of X’X

XXinv>

[,1] [,2] [,3][1,] 39.71 -3.556 -1.6144[2,] -3.56 0.322 0.1419[3,] -1.61 0.142 0.0687

28

Details (cont’d)

# Alternative:

XXinv <- summary(litters.lm)$cov.unscaled

y <- litters$brainwt

Xy<- crossprod(X, y)

Xy # X’y:

>

[,1]

(Intercept) 8.33

bodywt 64.95

lsize 61.85

29

Details (cont’d)

betahat <- XXinv%*%Xy # betahat=(X’X)ˆ(-1) X’y

betahat # coefficient estimates

>

[,1]

[1,] 0.17825

[2,] 0.02431

[3,] 0.00669

# Best Alternative (Most Stable)

betahat <- qr.solve(X, y)

betahat

>

(Intercept) bodywt lsize

0.178246962 0.024306344 0.006690331

30

Details (cont’d)

# Calculation of fitted values:

yhat <- crossprod(t(X), betahat)

yhat

>

[,1]

1 0.428

2 0.436

.........

20 0.406

31

Details (cont’d)

SSE <- crossprod(y, y) - crossprod(y, yhat)

# SSE = y’(I-H)y = y’y - y’yhat

SSE

>

[,1]

[1,] 0.00243

MSE <- SSE/(length(y)-3)

MSE # error variance estimate

>

[,1]

[1,] 0.000143

32

Allometry

• An allometric growth model is most appropriate for modeling therelation between brainwt and bodywt:

brainwt = eβ0+εbodywtβ1

log(brainwt) = β0 + β1 log(bodywt) + ε

where ε is N(0, σ2).

litters.lm <- lm(log(brainwt) ˜ log(bodywt),

data = litters)

summary(litters.lm)

Coefficients:

Estimate Std. Error t value

(Intercept) -1.2835 0.0814 -15.76

log(bodywt) 0.2004 0.0399 5.02

33

Allometry (cont’d)

• As expected, larger brains are associated with larger bodies, but therelation is not linear. The hypothesis of interest here is β1 = 1 notβ1 = 0, so we should ignore the t-value and p-value given in thedefault output. Instead, we may be interested in the following test:

H0 : β1 = 1 H1 : β1 6= 1

t0 =β1 − β1

s.e.

=.2− 1

.0399= −20.1

pt(-20.1,18) # p-value [1] 4.42e-14

• Conclusion: if the allometric model assumptions hold, the exponent isnot 1.

34


• However, the linear model may be a good approximation. Thefollowing code allows for comparison of the two fitted models (see thenext figure):

plot(brainwt ˜ bodywt, data = litters, pch=16)

litters.lm <- lm(brainwt ˜ bodywt, data = litters)

abline(litters.lm,lwd=2)

litters.lm <- lm(log(brainwt) ˜ log(bodywt),

data = litters)

coeffs <- coef(litters.lm)

MSE <- summary(litters.lm)$sigmaˆ2

lines(x,xˆ(coeffs[2])*exp(coeffs[1]+ MSE/2),

col=2,lwd=2)

35


Which model is more accurate? Does it make a real difference in thiscase?

# alternative: litters.comparison()

36

6 7 8 9

0.38

0.40

0.42

0.44

bodywt

brai

nwt

allometric modellinear model

4.3 Confidence Intervals and Hypothesis Testing

• Confidence Intervals and Tests for βj, j = 0,1, . . . , k

– Recall

Var(β˜

) = σ2(XTX)−1

– Let cjj be the jth diagonal element of (XTX)−1.

– Then the variance of βj is

Var(βj) = σ2cjj

– An estimate of the standard error of βj is

s.e.(βj) =√

Var(βj) =√

MSEcjj

37

Confidence Intervals and Tests (cont’d)

• Since βj has a normal distribution and SSE/σ2 has a chi-squareddistribution on n− k − 1 degrees of freedom,

βj − βj√MSEcjj

∼ tn−k−1

• A (1− α) confidence interval for βj is

βj ± tn−k−1,α/2

√MSEcjj

38

Example

• log.hills

* 35 observations taken on the winning times to run the Scottish hillraces

* predictor variables: log.climb, log.dist

* response: log.time

data(hills)

log.hills <- log(hills)

names(log.hills) <- c("log.dist",

"log.climb", "log.time")

39

Example (cont’d)

log.hills

>

log.dist log.climb log.time

1 0.88 6.5 -1.317

2 1.79 7.8 -0.216

..............................

35 3.00 8.5 0.980

• Fit a linear regression model to these data and test whether thecoefficient of log.climb differs from 0. Find a 95% confidenceinterval for this coefficient.

40

Solution

• The model matrix X and y˜

are

1 0.88 6.5 -1.317

1 1.79 7.8 -0.216

.................... .....

1 3.00 8.5 0.980

XTy˜

=

1(-1.317)+ 1(-.216)+...+ 1(.980)=-10.7

.88(-1.317)+1.79(-.216)+...+3.00(.980)=-6.9

6.5(-1.317)+ 7.8(-.216)+...+ 8.5(.980)=-62.5

41

Solution (cont’d)

XTX =35 64 251

64 129 471

251 471 1826

(XTX)−1 =

2.89 0.302 -0.476

0.30 0.164 -0.084

-0.48 -0.084 0.088

β˜

= (XTX)−1XTy˜

=

-3.17

0.89

0.17

42

Solution(cont’d)

The fitted regression model is

y = −3.17 + .89log.dist + .17log.climb

The error variance is estimated as follows:

SSE = y˜Ty

˜− β

˜

TXTy˜

=

−1.3172+...+.982−[−3.17(−10.7)+.89(−6.9)+.17(−62.5)]

= 20− 17 = 3.0

p = k + 1 = 3 so the degrees of freedom for error are 35 - 3 = 32

The estimate of the error variance is

MSE = 3/32 = .094

43

Solution (cont’d)

Standard errors for estimates of coefficients of log.dist andlog.climb:

s.e.(β1) =√c11MSE =

√.164(.094) = .12

s.e.(β2) =√c22MSE =

√.088(.094) = .091

H0 : β2 = 0 H1 : β2 6= 0

t =.17− 0

.091= 1.9

The p-value is .066. Not very strong evidence that β2 6= 0.

A 95% confidence interval for β2 is

.17± 2.04(.091) = .17± .186

44

Prediction Intervals for New Observations

• Estimate β˜

and predict the value of y at a new vector x˜

0:

y0 = x˜T0 β

˜+ ε0

y0 = x˜T0 β

˜

Var(y0) = σ2 + σ2x˜T0 (XTX)−1x

˜0

Therefore, a prediction interval is

y0 ± tn−p,α/2

√MSE(1 + x

˜T0 (XTX)−1x

˜0)

45

Confidence Intervals for the Mean Response

E[y|x˜

= x˜

0] = x˜T0 β

˜

• Estimate this with y0 = x˜T0 β

˜. Then the C.I. is

y0 ± tn−p,α/2

√MSE(x

˜T0 (XTX)−1x

˜0)

46

Simultaneous Confidence Intervals

• If β˜

is the least-squares estimator for the p-vector β˜

, then

(β˜− β

˜)TXTX(β

˜− β

˜)/p

MSE∼ Fp,n−p

• A 1− α joint confidence region for all of the parameters in β˜

is then

given by the region in the p-dimensional space defined by

(β˜− β

˜)TXTX(β

˜− β

˜)/p

MSE≤ Fp,n−p,α

47

Example

• litters data

Recall:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.178247 0.075323 2.366 0.03010 *bodywt 0.024306 0.006779 3.586 0.00228 **lsize 0.006690 0.003132 2.136 0.04751 *

A 95% confidence region for β0, β1, β2 will be centered at (.178, .024,.0067). The confidence region given by

(β˜− β

˜)TXTX(β

˜− β

˜)/p

MSE≤ F3,17,.05

is an ellipsoid in 3-dimensional space.

48

Example (cont’d)

• We can get an idea of what this ellipsoid looks like by testingrandomly generated points (β

˜) in a neighborhood of β

˜to see whether

they exceed F3,17,.05 = 3.19. e.g.

β = (.177, .023, .0069)⇒ LHS =

qf(cr.test(c(.177,.023,.0068)),3,17)

>

[1] 5.383459

This exceeds 3.19, so it is not in the 95% confidence ellipsoid.

The function litters.cr() can be used to see what the ellipsoidlooks like for fixed values of β0. What we see are essentiallycross-sections of the ellipsoid in the (β1, β2) space as we vary β0.

49

Bonferroni Intervals

• This is a simpler method:

– In order to have (1− α) confidence that ` confidence intervals areall correct, we can use

β1j ± tα/(2`),n−ps.e.(β1j)

– e.g. For the litters data, simultaneous 95% confidence intervals forβ1 and β2 are

.024± t.05/4,17(.0068) = .024± 2.45(.0068)

= .024± .017

and

.0067± t.05/4,17(.0031) = .0067± .0076

– If we had wanted simultaneous 95% confidence intervals for all 3parameters we would have had to use t.05/6,17 = 2.65.

50

Scheffe Intervals

• Similar idea to Bonferroni, but only applicable when ` = p, thenumber of coefficients.

βj ± (2Fα,p,n−p)12s.e.(βj), j = 0,1, . . . , p

51

4.4 Testing Several Coefficients; Extra Sums of Squares

• Partition or split:

β˜

=

β˜

1

β˜

2

• β˜

2 contains r coefficients that we want to test. Are they all 0?

• Partition X accordingly:

X = [X1 X2]

• Then the full model is

y˜

= Xβ˜

+ ε˜

= X1β˜

1 + X2β˜

2 + ε˜

52

Extra Sums of Squares (cont’d)

• Under the full model, define the (uncorrected) regression sum ofsquares as

SSR(β˜

) = β˜

TXTy˜

• If β˜

2 = 0, then we have the reduced model

y˜

= X1β˜

1 + ε˜

• Under the reduced model, the regression sum of squares is

SSR(β˜

1) = β˜

T1X

T1y

˜

53


• Recall that the regression sum of squares indicates the amount ofvariability in the response explained by the regression.

• By adding β˜

2 to the model, we are able to explain more of the

variability in the response than with the reduced model (β˜

1 only). The

difference is

SSR(β˜

2|β˜

1) = SSR(β˜

)− SSR(β˜

1)

= β˜

TXTy˜− β

˜

T1X

T1y

˜

54


• To test H0 : β˜

2 = 0, use

F0 =

SSR(β˜

2|β˜

1)/r

MSE

• degrees of freedom: SSR(β˜

) ∼ k+ 1 d.f., SSR(β˜

1) ∼ k− r + 1 d.f.

so the difference is k + 1− (k − r + 1) = r d.f.

• The above test is called a partial F-test.

55

Example

• cfseal data

– Reduced Model:

log(heart) = β0 + β1log(weight) + ε˜

cfseal.red <- lm(log(heart)˜log(weight))

coef(cfseal.red)

>

[1] 1.20 1.13

56

Example (cont’d)

β˜

T1 = [β0 β1]T

XT1y˜

:

crossprod(model.matrix(cfseal.red),

log(heart))

>

[,1]

(Intercept) 165

log(weight) 643

57

Example (cont’d)

β˜

T1X

T1y˜

:

1.20(165) + 1.13(643) = 923

* Full Model:

log(heart) = β0 + β1log(weight) + · · ·+ ε˜

Other variables (without missing data) include

log(stomach) log(kidney)

* k = 3 (p = 4) n = 30

β˜

T2 = [β2 β3]

58

Using R

attach(cfseal)cfseal.full <- lm(log(heart) ˜ log(weight) +

log(stomach) + log(kidney))summary(cfseal.full)

>Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 0.383 0.432 0.89 0.38345log(weight) 0.723 0.190 3.80 0.00078log(stomach) 0.246 0.199 1.24 0.22652log(kidney) 0.132 0.284 0.46 0.64681

Residual standard error: 0.167 on 26 degrees of freedom

59

Example (cont’d)

Note that all of the p-values for the other individual tests are large.Does this mean that we should conclude β2 = β3 = 0?

MSE = .1672 = .0279

XTy˜

:

crossprod(model.matrix(cfseal.full),

log(heart))

[,1]

(Intercept) 165

log(weight) 643

log(stomach) 1076

log(kidney) 987

60

Example (cont’d)

SSR(β˜

2|β˜

1) = β˜

TXTy˜− β

˜

T1X

T1y˜

:

coef(cfseal.full)%*%

crossprod(model.matrix(cfseal.full),

log(heart)) -

crossprod(coef(cfseal.red),

t(model.matrix(cfseal.red)))%*%log(heart)

>

[,1]

[1,] 0.175

F0 =.175/2

.0279= 3.1

61

Example (cont’d)

p-value:

> 1 - pf(3.14, 2,26)

[1] 0.06

(We have two numerator degrees of freedom because we are testingβ2 = β3 = 0.)

Conclusion: weak evidence against the null hypothesis.

62

Automatic Method in R

• Quick R way to do this partial F-test:

cfseal.full <- lm(log(heart) ˜ log(weight) +

log(stomach) + log(kidney))

cfseal.red <- lm(log(heart) ˜ log(weight))

anova(cfseal.red,cfseal.full)

>

Analysis of Variance Table

Model 1: log(heart) ˜ log(weight)

Model 2: log(heart) ˜ log(weight) +

log(stomach) + log(kidney)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 28 0.90273

2 26 0.72763 2 0.17510 3.1284 0.06061 .

63

Sequential Sums of Squares

• We can use the general relationship to build up SSR from individualcomponents called the sequential sums of squares:

SSR(β˜

2|β˜

1) = SSR(β˜

)− SSR(β˜

1)

Start with SSR(β0) = 1ny˜TJy

˜; Then,

SSR(β1|β0) = SSR(β0, β1)− SSR(β0)

= [β0 β1][1˜x1˜

]Ty˜−

1

ny˜TJy

˜

(This is the ‘corrected’ regression sum of squares that we definedearlier.)

64

Sequential Sums of Squares (cont’d)

SSR(β2|β0, β1) = SSR(β0, β1, β2)− SSR(β0, β1)

SSR(β3|β0, β1, β2) = SSR(β0, β1, β2, β3)− SSR(β0, β1, β2)

Continuing, one can obtain all of the sequential sums of squares.• Direct evaluation using R: After fitting the full model, type

anova(cfseal.full)

>

Analysis of Variance Table

Response: log(heart)

Df Sum Sq Mean Sq F value Pr(>F)

log(weight) 1 13.68 13.68 488.75 <2e-16

log(stomach) 1 0.17 0.17 6.04 0.021

log(kidney) 1 0.01 0.01 0.21 0.647

Residuals 26 0.73 0.03

65

Example (cont’d)

SSR(β1|β0) = 13.68

SSR(β2|β0, β1) = .17

SSR(β3|β0, β1, β2) = .01

• For our test of β2 = β3 = 0, we were interested inSSR(β2, β3|β0, β1):

SSR(β0, β1, β2, β3)− SSR(β0, β1)

66

Sequential Sums of Squares (cont’d)

• Note that

SSR(β2|β0, β1) = SSR(β0, β1, β2)− SSR(β0, β1)

and SSR(β3|β0, β1, β2) = SSR(β0, β1, β2, β3)−

SSR(β0, β1, β2)

Therefore,

SSR(β2, β3|β0, β1) = SSR(β2|β0, β1)+

SSR(β3|β0, β1, β2)

SSR(β2, β3|β0, β1) = .17 + .01

F0 = .18/2.73/26 = 3.2

• Exercise 1: Conduct the F-test for β3 = 0 when the reduced modelincludes β0, β1 and β2.

67

• Exercise 2: Show that

SSR(β3, β4|β0, β1, β2) = SSR(β3|β0, β1, β2)+

SSR(β4|β0, β1, β2, β3)

Orthogonal Columns in X

• If the columns of X1 are orthogonal to the columns of X2, then

XT1X2 = 0

and

XT2X1 = 0

Then, under the full model,

β˜

= (XTX)−1XTy˜

=

[XT

1X1 00 XT

2X2

]−1 XT

1y˜

XT2y˜

so that

β˜

=

β˜

1

β˜

2

These are the estimates that would have been obtained from theseparate reduced models:

y˜

= X1β˜

1 + ε˜

and y˜

= X2β˜

2 + ε˜

68

Orthogonal Columns (cont’d)

• We can then show that

SSR(β˜

2|β˜

1) = SSR(β˜

2)

since

SSR(β˜

)− SSR(β˜

1) = β˜

TXTy˜− β

˜

T1X

T1y

˜

= β˜

T2X

T2y

˜= SSR(β

˜2)

• If X1 and X2 are not orthogonal, we have

β˜6=

β˜

1

β˜

2

so

SSR(β˜

2|β˜

1) 6= SSR(β˜

2)

69

Testing the General Linear Hypothesis

• Suppose T is an r × p matrix. (r ≤ p)

• General Linear Hypothesis:

H0 : Tβ˜

= 0

• Tβ˜

is estimated by T β˜

.

Var(T β˜

) = TVar(β˜

)TT

= Σ = σ2T (XTX)−1TT

70

General Linear Hypothesis (cont’d)

• Under H0,

β˜

TTTΣ−1T β˜∼ χ2

(r).

• To see this, first note that under H0,

T β˜

= T (XTX)−1XTε˜

so that β˜

TTTC−1T β˜

= ε˜X(XTX)−1TTC−1T (XTX)−1Xε

˜

where C = Σ/σ2 = T (XTX)−1TT. The following is idempotent:

X(XTX)−1TTC−1T (XTX)−1X and

tr(X(XTX)−1TTC−1T (XTX)−1X) = tr(C−1C) = tr(Ir) = r

[C is an r × r matrix.]

71


• Finally, we note that

ε˜X(XTX)−1TTC−1T (XTX)−1Xε

˜/σ2 =

ε˜X(XTX)−1TTΣ−1T (XTX)−1Xε

˜

which implies that the latter quadratic form has a χ2(r) distribution.

• The test statistic is

F0 =

β˜

TTTC−1T β˜/r

MSE∼ Fr,n−p

where MSE is computed for the full model (with p parameters).

72


• To see that this is a valid F statistic (under H0), we need to verify that

(I −H)[X(XTX)−1TTC−1T (XTX)−1X] = 0

since this is the product of the matrices of the quadratic forms for thenumerator sum of squares and the error sum of squares. SinceH = X(XTX)−1XT, the required result follows almost immediately.Therefore, the numerator and denominator sums of squares areindependent of each other.

73

Example

• Test the equality of all regression coefficientsβ0 = β1 = β2 = · · · = βk.

T = 1 -1 0 ... 0 0

0 1 -1 ... 0 0

................

0 0 0 ... 1 -1

T is a k × (k + 1) matrix, so F0 ∼ Fk,n−k−1 under H0.

74

4.5 The ANOVA Test for Significance of Regression

• Model:

y = β0 + β1x1 + · · ·βkxk + ε

y˜

= Xβ˜

+ ε˜

• Some Observations:

(I −H)X = 0 and XT(I −H) = 0

tr(H) = tr([X(XTX)−1]XT)

= tr(XTX(XTX)−1) = k + 1

SSE = y˜T(I −H)y

˜= ε

˜T(I −H)ε

˜

= tr(ε˜T(I −H)ε

˜) = tr((I −H)ε

˜ε˜T)

75

ANOVA (cont’d)

• Unbiased Estimation of σ2

E[SSE] = tr(I −H)σ2 = (n− k − 1)σ2

so an unbiased estimator for σ2 is

MSE = SSE/(n− k − 1)

• Partitioning the Variation in the Responses

– Recall from Simple Linear Regression:

TSS =n∑i=1

(yi − y)2 = SSR + SSE

MSR = SSR = TSS− SSE

E[MSR] = σ2 + β21Sxx

76

ANOVA (cont’d)

• What about TSS− SSE in the multiple regression case?

TSS = y˜T(I −

1

nJ)y

˜

where J = 11T is a matrix of all 1’s.

SSE = y˜T(I −H)y

˜

Therefore,

SSR = TSS− SSE = y˜T(H−

1

nJ)y

˜

77

ANOVA (cont’d)

E[SSR] = β˜

TXT(H−1

nJ)Xβ

˜+ E[ε

˜T(H−

1

nJ)ε

˜]

= β˜

TXT(H−1

nJ)Xβ

˜+ E[tr((H−

1

nJ)ε

˜ε˜T)]

E[tr((H−1

nJ)ε

˜ε˜T)] = tr(H−

1

nJ)σ2 = (k + 1− 1)σ2

so

E[SSR] = β˜

TXT(H−1

nJ)Xβ

˜+ kσ2

78

ANOVA (cont’d)

• If β˜

= 0, the first term vanishes.

• Even if β0 is nonzero, the first term vanishes whenβ1 = · · · = βk = 0:

E[SSR] = kσ2 + β201

T(I −1

nJ)1 = kσ2.

• Thus, if β1 = · · · = βk = 0, then another unbiased estimator of σ2 is

MSR = SSR/k

79

Quadratic Forms, Chi-squares, and Independence

• Assume β1 = · · · = βk = 0.

• SSR/σ2 has a χ2(k) distribution.

• Note the relation between the degrees of freedom and the trace of(H− 1

nJ), the matrix of the quadratic form SSR. Also, note that thismatrix is idempotent and symmetric.

• SSE = y˜T(I −H)y

˜.

80

Quadratic Forms, Chi-squares, and Independence

• I −H is idempotent and symmetric with trace n− k − 1. Therefore,SSE/σ

2 has a χ2 distribution on n− k − 1 degrees of freedom.

• (I −H)(H− 1nJ) = 0 so SSE and SSR are independent.

• Hence,

F0 =MSRMSE

has an F distribution on (k, n− k − 1) degrees of freedom.

• If some of the β’s are nonzero, then F0 will tend to be larger than anFk,n−k−1 random variable.

81

The ANOVA table

For testing

H0 : β1 = · · · = βk = 0

vs.

H1 : at least one coefficient is nonzero.

Source df SS MS FRegression k SSR MSR F0 = MSR/MSEError n− k − 1 SSE MSETotal n− 1 TSS

Reject H0 if the p-value is very small. i.e. if F0 is larger thanFk,n−k−1,α.

82

TextBook formula for SSR

SSR = TSS− SSE

=n∑i=1

yiyi −1

n(n∑

j=1

yj)2

= β˜

TXTy˜−

1

n(n∑

j=1

yj)2

83

Example

litters data:∑ni=1 yi = 8.33 (y = brainwt)∑ni=1 y

2i = 3.48 n = 20 k = 2 (bodywt and lsize)

β˜

= [0.178 0.0243 0.00669]T

XTy˜

= [8.33 64.95 61.85]T

TSS = 3.48−1

20(8.332) = .00695

SSR = .178(8.33) + .0243(64.95) + .00669(61.85)−8.332

20

= .00452

84

Example (cont’d)

Source df SS MS FRegression 2 0.00452 F0 = 15.8Error 17 MSETotal 19 0.00695

p-value:

> 1- pf(15.8,2,17)

[1] 0.000133

Conclusion: Reject H0. There is a relation between brainwt and theexplanatory variables (bodywt and lsize)

85

Hidden Extrapolation

• When making predictions, it is important not to extrapolate beyondthe range of the given data.

• In simple regression, it is obvious when one is extrapolating:

one is predicting y outside the range of given x-values.

• In multiple regression, extrapolation is not obvious.

• The diagonal elements hii of the hat matrix H can be useful indetermining when one is extrapolating.

• hii gives an idea of the distance from the ith observation to the‘center’ of the observations:

hii =Txi˜

(XTX)−1 xi˜

86

Example

• The following can be used to identify the hat diagonal elements forsome or all of the observations in the litters data:attach(litters)

litters.lm <- lm(brainwt ˜ bodywt + lsize)

extrap.fn(litters.lm,litters,n=3) # see the

# accompanying plot

detach(litters)

87

6 7 8 9

46

810

12

x

y

0.17

0.08

0.43

• Note how the hii values are largest for those observations near the‘edge’ of the data.

Hidden Extrapolation (cont’d)

• If we want to predict y at x0˜

, then we will be extrapolating if

Tx0˜

(XTX)−1 x0˜> hii

for all i = 1,2, . . . , n. i.e.

Tx0˜

(XTX)−1 x0˜> hmax

88

Example

• Suppose we want to predict the brain weight for a mouse whose bodyweight is 7 and who came from a litter of size 7. Are we extrapolating?

x0 = [1 7 7]T

The following function evaluates

Tx0˜

(XTX)−1 x0˜

We are extrapolating if this value exceeds .43.

89

Example (cont’d)

hidden.extrap(litters.lm, c(1, 7,7))

>

[,1]

[1,] 0.353

predict(litters.lm, newdata =

data.frame(bodywt=7,lsize=7),

interval="prediction")

>

fit lwr upr

[1,] 0.395 0.366 0.425

90

Example (cont’d)

• Suppose we now want to predict the brain weight for a mouse whosebody weight is 7 and who came from a litter of size 12. Are weextrapolating?

hidden.extrap(litters.lm, c(1, 7,12))

>

[,1]

[1,] 0.665

Since this is larger than .43, we must conclude that we extrapolating.

91

4.6 Weighted Least Squares

• Consider the regression through the origin model

yi = β1xi + εi

with E[εi] = 0 and suppose V (yi|xi) = σ2/wi where wi is a knownweight. i.e.

E[ε2i ] = σ2/wi

• The least squares estimate was previously found by minimizing∑ni=1 ε

2i :

β1 =

∑xiyi∑x2i

• Gauss-Markov Theorem: When the variances are constant, β1 hasthe smallest variance of any linear unbiased of β1.

92

Weighted Least Squares (cont’d)

• β1 is not the best linear unbiased estimator for β1 when there areweights wi.

• To find the BLUE now, multiply the model by ai:

aiyi = aiβ1xi + aiεi

or

y∗i = β1x∗i + ε∗i

Compute β1 for the new data (x∗i , y∗i ):

β1 =

∑x∗i y∗i∑

(x∗i )2

E[β1] = β1 (unbiased)

V (β1) = σ2∑x2i a

4i /wi

(∑a2i x

2i )2

93


• How do we choose a1, a2, . . . , an to make this as small as possible?

• Recall: Cauchy-Schwarz Inequality: n∑i=1

uivi

2

≤n∑

j=1

u2j

n∑k=1

v2k

(equality holds if the ui’s are proportional to the vi’s: ui = cvi)

• Look at the denominator of our variance: n∑i=1

a2i x

2i

2

≤

n∑i=1

a4i x

2i /wi

n∑i=1

wix2i

(equality holds if the ui’s are proportional to the vi’s: e.g.a4i x

2i /wi = wix

2i or ai =

√wi)

94


• Thus, the V (β1) is minimized if ai =√wi:

V (β1) =σ2∑n

i=1wx2i

• Note also that

E[√wiεi] = 0 and V (

√wiεi) = σ2

and that instead of minimizingn∑i=1

ε2i Ordinary Least Squares

we are now minimizingn∑i=1

wiε2i Weighted Least Squares

95

Example

• roller data

• Ordinary Least Squares:

roller.lm <- lm(depression˜weight, data=roller)

plot(roller.lm, which=4)

96

Example (Cont’d)

5 10 15 20 25 30

−10

−5

05

10

Fitted values

Res

idua

ls●

●

●

●

●

●

●

●

●

●

lm(formula = depression ~ weight, data = roller)

Residuals vs Fitted

8

75

• The residual plot indicates that the variance might not be constant.

97

Weighted Least Squares

roller.wlm <- lm(depression˜weight,

data=roller, weights=1/weightˆ2)

plot(roller.wlm, which=4)

0 5 10 15 20 25 30 35

−10

−5

05

10

Fitted values

Res

idua

ls ●

●

●

●

●

●

●

●

●

●

lm(formula = depression ~ weight, data = roller, weights = 1/weight^2)

Residuals vs Fitted

108

5

– a somewhat clearer pattern: variance does seem to be increasing

98

Weighted Least Squares

• Comparing the fitted lines:

●●

● ●

● ●

●

●

●

●

2 4 6 8 10 12

05

1015

2025

30

weight

depr

essi

on

OLSWLS

Roller Data

99

Generalized Least Squares

• Model:

y˜

= Xβ˜

+ ε˜

E[ε˜] = 0 and E[ε

˜ε˜T] = Σ = σ2V .

Σ must be symmetric and positive definite. This implies, among otherthings, that Σ possesses an inverse.

• Weighted Least Squares is a special case where Σ is a diagonalmatrix with ii element σ2/wi

• V = K2 for some symmetric nonsingular K.

100

Generalized Least Squares (cont’d)

• Consider

K−1y˜

= K−1Xβ˜

+K−1ε˜

Note

Var(K−1ε˜) = E[K−1ε

˜ε˜TK−1] =

K−1σ2V K−1 = σ2I

• By multiplying through by K−1 we now have a constant variance, soβ˜

can be estimated by Least-Squares:

β˜

= (XTK−2X)−1XTK−2y˜

= (XTV −1X)−1XTV −1y˜

• β˜

is the generalized least-squares estimator for β˜

.

101

Generalized Least Squares (cont’d)

• Unbiased:

E[β˜

] = β˜

• Variance:

Var(β˜

) = (XTV −1X)−1XTV −1ΣV −1X(XTV −1X)−1

= σ2(XTV −1X)−1

102

Ch. 4 Multiple Linear Regression Models Notation · Ch. 4 Multiple Linear Regression Models...

Documents

Transcript of Ch. 4 Multiple Linear Regression Models Notation · Ch. 4 Multiple Linear Regression Models...