Ch. 4 Multiple Linear Regression Models Notation · Ch. 4 Multiple Linear Regression Models...
Transcript of Ch. 4 Multiple Linear Regression Models Notation · Ch. 4 Multiple Linear Regression Models...
Ch. 4 Multiple Linear Regression Models
• Notation
• Estimation
• Confidence Intervals, Testing and Prediction
• Extra Sums of Squares and Multiple Testing
• Analysis of Variance
• Weighted Least-Squares and Generalized Least-Squares
(To do some of the R calculations, you will need the functions inCh4.R)
1
Notation
• Start with simple linear regression
yi = β0 + β1xi + εi, εi ∼ N(0, σ2)(i.i.d.)
• Extension: add more variables on the right, i.e. more explanatoryvariables.
– e.g. polynomial regression
yi = β0 + β1xi + β2x2i + · · ·+ βpx
pi + εi
– e.g. additional variables
yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + εi
2
Matrix form
y˜
= Xβ˜
+ ε˜
y˜
= column vector of responses:
y˜
= [y1 y2 · · · yn]T
(T denotes matrix transpose)
β˜
= [β0 β1 · · · βk]T
ε˜
= [ε1 ε2 · · · εn]T
3
X is an n× (k+ 1) matrix (called the model matrix or design matrix):
X =
1 x11 x12 · · · x1k1 x21 x22 · · · x2k... ... ... ... ...1 xn1 xn2 · · · xnk
• e.g. regression through the origin:
yi = β1xi + εi is y˜
= Xβ˜
+ ε˜
with
X =
x1x2...xn
y˜ =
y1y2...yn
ε˜ =
ε1ε2...εn
β˜ = [β1]
e.g. Simple linear regression
yi = β0 + β1xi + εi
is
y˜
= Xβ˜
+ ε˜
with
X =
1 x11 x2... ...1 xn
y˜ =
y1y2...yn
ε˜ =
ε1ε2...εn
β˜ =
[β0β1
]
4
e.g. Quadratic regression
yi = β0 + β1xi + β2x2i + εi
is
y˜
= Xβ˜
+ ε˜
with
X =
1 x1 x2
11 x2 x2
2... ... ...1 xn x2
n
y˜ =
y1y2...yn
ε˜ =
ε1ε2...εn
β˜ =
β0β1β2
5
e.g. Regression with 2 predictor variables
yi = β0 + β1xi1 + β2xi2 + εi
is
y˜
= Xβ˜
+ ε˜
with
X =
1 x11 x121 x21 x22... ... ...1 xn1 xn2
y˜ =
y1y2...yn
ε˜ =
ε1ε2...εn
β˜ =
β0β1β2
6
Exercises
Find the model matrix X for each of the following.
1. yi = µ+ εi
2. yij = µi + εij, i = 1,2, j = 1,2,3.
[Hint: β˜
= [µ1µ2]T and y˜
= [y11 y12 · · · y23]T.]
(This is an example of a 1-way ANOVA model.)
3. yij = µi + βxij + εij, i = 1,2, j = 1,2,3.
(This is an example of an analysis of covariance model.)
4. yi = β0 + β1xi1 + β2xi2 + β12xi1xi2 + εi
5. yi = β0 + β1 cos(xi) + β2 sin(xi) + εi
6. yi = β1B1(xi) + β2B2(xi) + β3B3(xi) + β4B4(xi) where B1,B2, B3 and B4 are given real-valued functions.
7
4.2 Least-Squares Estimation
• Differentiation w.r.t. Vectors
– Suppose c = [c1 c2 · · · ck]T and x = [x1 x2 · · ·xk]T
Differentiate
f(x) = cTx
with respect to x.
f(x) =k∑i=1
cixi
Partial derivatives with respect to x1, x2, . . . , xk:
c1 c2 · · · ckThe derivative is a vector (called the gradient):
f ′(x) = cT
8
Example
• Suppose B is a symmetric k × k matrix. Differentiate
f(x) = xTBx
with respect to x.
• 2xTB.
9
Estimation of β˜
y˜
= Xβ˜
+ ε˜
y˜−Xβ
˜= ε
˜
• Minimize sum of squares of the errors:
L = ε˜Tε
˜= (y
˜−Xβ
˜)T(y
˜−Xβ
˜)
with respect to β˜
.
10
Estimation of β˜
(cont’d)
• Differentiate w.r.t. β˜
:
L = y˜Ty
˜− β
˜
TXTy˜− y
˜TXβ
˜+ β
˜
TXTXβ˜
= y˜Ty
˜− y
˜TXβ
˜− y
˜TXβ
˜+ β
˜
TXTXβ˜
L′(β˜
) = −2y˜TX + 2β
˜
TXTX
(XTX is symmetric.)
11
Estimation of β˜
(cont’d)
• Set to L′ = 0:
β˜
TXTX = y˜TX
or
XTXβ˜
= XTy˜
β˜
= (XTX)−1XTy˜
provided XTX has an inverse (columns must be linearlyindependent, and number of columns cannot exceed number ofrows).
12
Fitted Values
y˜
= Xβ˜
= X(XTX)−1XTy˜
= Hy˜
where
H = X(XTX)−1XT
is the so-called Hat matrix.
13
Residuals
e = ε˜
= y˜− y
˜
= y˜−Hy
˜= (I −H)y
˜
14
Geometric Interpretation
• The goal of Least-Squares is to identify values of the coefficients β˜
which make y˜
= Xβ˜
as close as possible to y˜
.
• Vectors of the form y˜
lie in n-dimensional space, while vectors of the
form Xβ˜
are linear combinations of the p n-vectors comprising the
columns of X:a p-dimensional subspace (provided that the columns are linearlyindependent)• Least-squares amounts to finding a vector in this subspace which is
closest to y˜
.
• The orthogonal projection of y˜
onto the subspace is the minimizing
vector: i.e. set the inner product between each of the columns of Xand y
˜− y
˜to 0:
XT(y˜− y
˜) = 0
15
This ensures that the vector y˜− y
˜is perpendicular to the subspace.
• Of course, y˜
must be of the form Xβ˜
, so we must have
XT(y˜−Xβ
˜) = 0
so that
β˜
= (XTX)−1XTy˜
and
Xβ˜
= X(XTX)−1XTy˜
= Hy˜
• The hat matrix is an example of what is called an orthogonalprojector, satisfying H = HT and H = H2. This last propertyensures that the projection of vectors already in the p-dimensionalsubspace land back in that subspace:
H(Hy˜
) = H2y˜
= Hy˜.
Properties of Least-Squares Estimates
• Model:
y˜
= Xβ˜
+ ε˜
where E[ε˜] = 0, Var(ε
˜) = E[ε
˜ε˜T] = σ2I
• Unbiasedness:
β˜
= (XTX)−1XTy˜
= (XTX)−1XT(Xβ˜
+ ε˜)
= β˜
+ (XTX)−1XTε˜
so E[β˜
] = β˜
, since E[ε˜] = 0.
16
Properties of Least-Squares Estimates (cont’d)
• Variance:
Var(β˜
) = E[(β˜− β
˜)(β
˜− β
˜)T]
= E[
((XTX)−1XTε
˜
)((XTX)−1XTε
˜
)T]
= (XTX)−1XTE[ε˜ε˜T]X(XTX)−1 = σ2(XTX)−1
• β˜
is also the m.l.e. in the case where the noise is normally distributed.
It is normally distributed in that case.• β
˜is approximately normal in general
• β˜
is the Best Linear Unbiased Estimator for β˜
: Gauss-Markov
Theorem
17
Gauss-Markov Theorem
• Among all estimators of the scalar quantity `Tβ˜
having the form
qTy˜
where ` is a fixed vector of length p and
E[qTy˜
] = `Tβ˜,
the variance of qTy˜
is smallest when
qTy˜
= `Tβ˜
18
Proof
Suppose
qT = `T(XTX)−1XT + ∆T
Then
E[qTy˜
] = `Tβ˜
+ ∆TXβ˜
In order for qTy˜
to be an unbiased estimator of `Tβ˜
, we must have
∆TXβ˜
= 0.
Note that this holds for all possible values of β˜
(i.e. any p× 1 vector)
Also,
β˜
TXT∆ = 0.
19
Proof (cont’d)
Now, let us determine the value of ∆ which minimizes thevariance of qTy
˜=∑ni=1 qiyi:
Var(qTy˜
) = σ2qTq
= σ2(`T(XTX)−1`+ ∆TX[(XTX)−1`]+
[`T(XTX)−1]XT∆] + ∆T∆)
= σ2`T(XTX)−1`+ σ2∆T∆
since (XTX)−1` is a p× 1 vector so that
∆TX[(XTX)−1`] = 0.
Thus, Var(qTy˜
) is minimized when ∆T∆ = 0 i.e. ∆ = 0.
20
Estimation of σ2
σ2 = MSE =1
n− # parameters
∑e2i
=1
n− pε˜Tε
˜
=1
n− py˜T(I −H)2y
˜
=1
n− py˜T(I −H)y
˜
p = k + 1. H is symmetric and idempotent (H2 = H.)
21
L-S Estimation - R Example
• The dataframe litters consists of brain weights and body weightsof 20 mice. The size of the litter in which each mouse was born isalso recorded.
library(DAAG)
data(litters)
litters
>
lsize bodywt brainwt
1 3 9.45 0.444
2 3 9.78 0.436
.......................
20 12 6.05 0.401
22
Look at all pairwise relationships
pairs(litters, pch=16)
lsize
6 7 8 9
46
810
12
67
89
bodywt
4 6 8 10 12 0.38 0.40 0.42 0.44
0.38
0.40
0.42
0.44
brainwt
23
Observations
– Body weight decreases with litter size.
– Brain weight decreases with litter size.
– Brain weight increases with body weight.
• In order to find out how brain weight relates to both body weight andlitter size, we can use the following model:
brainwt = β0 + β1bodywt + β2lsize + ε
24
Fitting the model in R
litters.lm <- lm(brainwt ˜ bodywt + lsize,
data = litters)
summary(litters.lm)
>
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17825 0.07532 2.37 0.0301bodywt 0.02431 0.00678 3.59 0.0023lsize 0.00669 0.00313 2.14 0.0475
Residual standard error: 0.012 on 17degrees of freedom
25
Example (cont’d)
• The fitted model is then
y = .18 + .024x1 + .0067x2
where y is brain weight, x1 is body weight and x2 is litter size. Theerror variance estimate is .0122 = .000144.
• Note that this fitted model says that for a fixed body weight, brainweight is actually higher for larger litters. This is consistent with whatis known as ‘brain sparing’: nutritional deprivation that results fromlarge litter sizes has a proportionately smaller effect on brain weightthan on body weight.
26
Details of the L-S calculations
X <- model.matrix(litters.lm) # X matrix
X
>
(Intercept) bodywt lsize
1 1 9.45 3
...........................
20 1 6.05 12
27
Details (cont’d)
XX <- crossprod(X,X) # X’XXX
>(Intercept) bodywt lsize
(Intercept) 20 155 150bodywt 155 1236 1089lsize 150 1089 1290
XXinv <- qr.solve(XX) # calculates# inverse of X’X
XXinv>
[,1] [,2] [,3][1,] 39.71 -3.556 -1.6144[2,] -3.56 0.322 0.1419[3,] -1.61 0.142 0.0687
28
Details (cont’d)
# Alternative:
XXinv <- summary(litters.lm)$cov.unscaled
y <- litters$brainwt
Xy<- crossprod(X, y)
Xy # X’y:
>
[,1]
(Intercept) 8.33
bodywt 64.95
lsize 61.85
29
Details (cont’d)
betahat <- XXinv%*%Xy # betahat=(X’X)ˆ(-1) X’y
betahat # coefficient estimates
>
[,1]
[1,] 0.17825
[2,] 0.02431
[3,] 0.00669
# Best Alternative (Most Stable)
betahat <- qr.solve(X, y)
betahat
>
(Intercept) bodywt lsize
0.178246962 0.024306344 0.006690331
30
Details (cont’d)
# Calculation of fitted values:
yhat <- crossprod(t(X), betahat)
yhat
>
[,1]
1 0.428
2 0.436
.........
20 0.406
31
Details (cont’d)
SSE <- crossprod(y, y) - crossprod(y, yhat)
# SSE = y’(I-H)y = y’y - y’yhat
SSE
>
[,1]
[1,] 0.00243
MSE <- SSE/(length(y)-3)
MSE # error variance estimate
>
[,1]
[1,] 0.000143
32
Allometry
• An allometric growth model is most appropriate for modeling therelation between brainwt and bodywt:
brainwt = eβ0+εbodywtβ1
log(brainwt) = β0 + β1 log(bodywt) + ε
where ε is N(0, σ2).
litters.lm <- lm(log(brainwt) ˜ log(bodywt),
data = litters)
summary(litters.lm)
Coefficients:
Estimate Std. Error t value
(Intercept) -1.2835 0.0814 -15.76
log(bodywt) 0.2004 0.0399 5.02
33
Allometry (cont’d)
• As expected, larger brains are associated with larger bodies, but therelation is not linear. The hypothesis of interest here is β1 = 1 notβ1 = 0, so we should ignore the t-value and p-value given in thedefault output. Instead, we may be interested in the following test:
H0 : β1 = 1 H1 : β1 6= 1
t0 =β1 − β1
s.e.
=.2− 1
.0399= −20.1
pt(-20.1,18) # p-value [1] 4.42e-14
• Conclusion: if the allometric model assumptions hold, the exponent isnot 1.
34
Allometry (cont’d)
• However, the linear model may be a good approximation. Thefollowing code allows for comparison of the two fitted models (see thenext figure):
plot(brainwt ˜ bodywt, data = litters, pch=16)
litters.lm <- lm(brainwt ˜ bodywt, data = litters)
abline(litters.lm,lwd=2)
litters.lm <- lm(log(brainwt) ˜ log(bodywt),
data = litters)
coeffs <- coef(litters.lm)
MSE <- summary(litters.lm)$sigmaˆ2
lines(x,xˆ(coeffs[2])*exp(coeffs[1]+ MSE/2),
col=2,lwd=2)
35
Allometry (cont’d)
Which model is more accurate? Does it make a real difference in thiscase?
# alternative: litters.comparison()
36
6 7 8 9
0.38
0.40
0.42
0.44
bodywt
brai
nwt
allometric modellinear model
4.3 Confidence Intervals and Hypothesis Testing
• Confidence Intervals and Tests for βj, j = 0,1, . . . , k
– Recall
Var(β˜
) = σ2(XTX)−1
– Let cjj be the jth diagonal element of (XTX)−1.
– Then the variance of βj is
Var(βj) = σ2cjj
– An estimate of the standard error of βj is
s.e.(βj) =√
Var(βj) =√
MSEcjj
37
Confidence Intervals and Tests (cont’d)
• Since βj has a normal distribution and SSE/σ2 has a chi-squareddistribution on n− k − 1 degrees of freedom,
βj − βj√MSEcjj
∼ tn−k−1
• A (1− α) confidence interval for βj is
βj ± tn−k−1,α/2
√MSEcjj
38
Example
• log.hills
* 35 observations taken on the winning times to run the Scottish hillraces
* predictor variables: log.climb, log.dist
* response: log.time
data(hills)
log.hills <- log(hills)
names(log.hills) <- c("log.dist",
"log.climb", "log.time")
39
Example (cont’d)
log.hills
>
log.dist log.climb log.time
1 0.88 6.5 -1.317
2 1.79 7.8 -0.216
..............................
35 3.00 8.5 0.980
• Fit a linear regression model to these data and test whether thecoefficient of log.climb differs from 0. Find a 95% confidenceinterval for this coefficient.
40
Solution
• The model matrix X and y˜
are
1 0.88 6.5 -1.317
1 1.79 7.8 -0.216
.................... .....
1 3.00 8.5 0.980
XTy˜
=
1(-1.317)+ 1(-.216)+...+ 1(.980)=-10.7
.88(-1.317)+1.79(-.216)+...+3.00(.980)=-6.9
6.5(-1.317)+ 7.8(-.216)+...+ 8.5(.980)=-62.5
41
Solution (cont’d)
XTX =35 64 251
64 129 471
251 471 1826
(XTX)−1 =
2.89 0.302 -0.476
0.30 0.164 -0.084
-0.48 -0.084 0.088
β˜
= (XTX)−1XTy˜
=
-3.17
0.89
0.17
42
Solution(cont’d)
The fitted regression model is
y = −3.17 + .89log.dist + .17log.climb
The error variance is estimated as follows:
SSE = y˜Ty
˜− β
˜
TXTy˜
=
−1.3172+...+.982−[−3.17(−10.7)+.89(−6.9)+.17(−62.5)]
= 20− 17 = 3.0
p = k + 1 = 3 so the degrees of freedom for error are 35 - 3 = 32
The estimate of the error variance is
MSE = 3/32 = .094
43
Solution (cont’d)
Standard errors for estimates of coefficients of log.dist andlog.climb:
s.e.(β1) =√c11MSE =
√.164(.094) = .12
s.e.(β2) =√c22MSE =
√.088(.094) = .091
H0 : β2 = 0 H1 : β2 6= 0
t =.17− 0
.091= 1.9
The p-value is .066. Not very strong evidence that β2 6= 0.
A 95% confidence interval for β2 is
.17± 2.04(.091) = .17± .186
44
Prediction Intervals for New Observations
• Estimate β˜
and predict the value of y at a new vector x˜
0:
y0 = x˜T0 β
˜+ ε0
y0 = x˜T0 β
˜
Var(y0) = σ2 + σ2x˜T0 (XTX)−1x
˜0
Therefore, a prediction interval is
y0 ± tn−p,α/2
√MSE(1 + x
˜T0 (XTX)−1x
˜0)
45
Confidence Intervals for the Mean Response
E[y|x˜
= x˜
0] = x˜T0 β
˜
• Estimate this with y0 = x˜T0 β
˜. Then the C.I. is
y0 ± tn−p,α/2
√MSE(x
˜T0 (XTX)−1x
˜0)
46
Simultaneous Confidence Intervals
• If β˜
is the least-squares estimator for the p-vector β˜
, then
(β˜− β
˜)TXTX(β
˜− β
˜)/p
MSE∼ Fp,n−p
• A 1− α joint confidence region for all of the parameters in β˜
is then
given by the region in the p-dimensional space defined by
(β˜− β
˜)TXTX(β
˜− β
˜)/p
MSE≤ Fp,n−p,α
47
Example
• litters data
Recall:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.178247 0.075323 2.366 0.03010 *bodywt 0.024306 0.006779 3.586 0.00228 **lsize 0.006690 0.003132 2.136 0.04751 *
A 95% confidence region for β0, β1, β2 will be centered at (.178, .024,.0067). The confidence region given by
(β˜− β
˜)TXTX(β
˜− β
˜)/p
MSE≤ F3,17,.05
is an ellipsoid in 3-dimensional space.
48
Example (cont’d)
• We can get an idea of what this ellipsoid looks like by testingrandomly generated points (β
˜) in a neighborhood of β
˜to see whether
they exceed F3,17,.05 = 3.19. e.g.
β = (.177, .023, .0069)⇒ LHS =
qf(cr.test(c(.177,.023,.0068)),3,17)
>
[1] 5.383459
This exceeds 3.19, so it is not in the 95% confidence ellipsoid.
The function litters.cr() can be used to see what the ellipsoidlooks like for fixed values of β0. What we see are essentiallycross-sections of the ellipsoid in the (β1, β2) space as we vary β0.
49
Bonferroni Intervals
• This is a simpler method:
– In order to have (1− α) confidence that ` confidence intervals areall correct, we can use
β1j ± tα/(2`),n−ps.e.(β1j)
– e.g. For the litters data, simultaneous 95% confidence intervals forβ1 and β2 are
.024± t.05/4,17(.0068) = .024± 2.45(.0068)
= .024± .017
and
.0067± t.05/4,17(.0031) = .0067± .0076
– If we had wanted simultaneous 95% confidence intervals for all 3parameters we would have had to use t.05/6,17 = 2.65.
50
Scheffe Intervals
• Similar idea to Bonferroni, but only applicable when ` = p, thenumber of coefficients.
βj ± (2Fα,p,n−p)12s.e.(βj), j = 0,1, . . . , p
51
4.4 Testing Several Coefficients; Extra Sums of Squares
• Partition or split:
β˜
=
β˜
1
β˜
2
• β˜
2 contains r coefficients that we want to test. Are they all 0?
• Partition X accordingly:
X = [X1 X2]
• Then the full model is
y˜
= Xβ˜
+ ε˜
= X1β˜
1 + X2β˜
2 + ε˜
52
Extra Sums of Squares (cont’d)
• Under the full model, define the (uncorrected) regression sum ofsquares as
SSR(β˜
) = β˜
TXTy˜
• If β˜
2 = 0, then we have the reduced model
y˜
= X1β˜
1 + ε˜
• Under the reduced model, the regression sum of squares is
SSR(β˜
1) = β˜
T1X
T1y
˜
53
Extra Sums of Squares (cont’d)
• Recall that the regression sum of squares indicates the amount ofvariability in the response explained by the regression.
• By adding β˜
2 to the model, we are able to explain more of the
variability in the response than with the reduced model (β˜
1 only). The
difference is
SSR(β˜
2|β˜
1) = SSR(β˜
)− SSR(β˜
1)
= β˜
TXTy˜− β
˜
T1X
T1y
˜
54
Extra Sums of Squares (cont’d)
• To test H0 : β˜
2 = 0, use
F0 =
SSR(β˜
2|β˜
1)/r
MSE
• degrees of freedom: SSR(β˜
) ∼ k+ 1 d.f., SSR(β˜
1) ∼ k− r + 1 d.f.
so the difference is k + 1− (k − r + 1) = r d.f.
• The above test is called a partial F-test.
55
Example
• cfseal data
– Reduced Model:
log(heart) = β0 + β1log(weight) + ε˜
cfseal.red <- lm(log(heart)˜log(weight))
coef(cfseal.red)
>
[1] 1.20 1.13
56
Example (cont’d)
β˜
T1 = [β0 β1]T
XT1y˜
:
crossprod(model.matrix(cfseal.red),
log(heart))
>
[,1]
(Intercept) 165
log(weight) 643
57
Example (cont’d)
β˜
T1X
T1y˜
:
1.20(165) + 1.13(643) = 923
* Full Model:
log(heart) = β0 + β1log(weight) + · · ·+ ε˜
Other variables (without missing data) include
log(stomach) log(kidney)
* k = 3 (p = 4) n = 30
β˜
T2 = [β2 β3]
58
Using R
attach(cfseal)cfseal.full <- lm(log(heart) ˜ log(weight) +
log(stomach) + log(kidney))summary(cfseal.full)
>Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) 0.383 0.432 0.89 0.38345log(weight) 0.723 0.190 3.80 0.00078log(stomach) 0.246 0.199 1.24 0.22652log(kidney) 0.132 0.284 0.46 0.64681
Residual standard error: 0.167 on 26 degrees of freedom
59
Example (cont’d)
Note that all of the p-values for the other individual tests are large.Does this mean that we should conclude β2 = β3 = 0?
MSE = .1672 = .0279
XTy˜
:
crossprod(model.matrix(cfseal.full),
log(heart))
[,1]
(Intercept) 165
log(weight) 643
log(stomach) 1076
log(kidney) 987
60
Example (cont’d)
SSR(β˜
2|β˜
1) = β˜
TXTy˜− β
˜
T1X
T1y˜
:
coef(cfseal.full)%*%
crossprod(model.matrix(cfseal.full),
log(heart)) -
crossprod(coef(cfseal.red),
t(model.matrix(cfseal.red)))%*%log(heart)
>
[,1]
[1,] 0.175
F0 =.175/2
.0279= 3.1
61
Example (cont’d)
p-value:
> 1 - pf(3.14, 2,26)
[1] 0.06
(We have two numerator degrees of freedom because we are testingβ2 = β3 = 0.)
Conclusion: weak evidence against the null hypothesis.
62
Automatic Method in R
• Quick R way to do this partial F-test:
cfseal.full <- lm(log(heart) ˜ log(weight) +
log(stomach) + log(kidney))
cfseal.red <- lm(log(heart) ˜ log(weight))
anova(cfseal.red,cfseal.full)
>
Analysis of Variance Table
Model 1: log(heart) ˜ log(weight)
Model 2: log(heart) ˜ log(weight) +
log(stomach) + log(kidney)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 28 0.90273
2 26 0.72763 2 0.17510 3.1284 0.06061 .
63
Sequential Sums of Squares
• We can use the general relationship to build up SSR from individualcomponents called the sequential sums of squares:
SSR(β˜
2|β˜
1) = SSR(β˜
)− SSR(β˜
1)
Start with SSR(β0) = 1ny˜TJy
˜; Then,
SSR(β1|β0) = SSR(β0, β1)− SSR(β0)
= [β0 β1][1˜x1˜
]Ty˜−
1
ny˜TJy
˜
(This is the ‘corrected’ regression sum of squares that we definedearlier.)
64
Sequential Sums of Squares (cont’d)
SSR(β2|β0, β1) = SSR(β0, β1, β2)− SSR(β0, β1)
SSR(β3|β0, β1, β2) = SSR(β0, β1, β2, β3)− SSR(β0, β1, β2)
Continuing, one can obtain all of the sequential sums of squares.• Direct evaluation using R: After fitting the full model, type
anova(cfseal.full)
>
Analysis of Variance Table
Response: log(heart)
Df Sum Sq Mean Sq F value Pr(>F)
log(weight) 1 13.68 13.68 488.75 <2e-16
log(stomach) 1 0.17 0.17 6.04 0.021
log(kidney) 1 0.01 0.01 0.21 0.647
Residuals 26 0.73 0.03
65
Example (cont’d)
SSR(β1|β0) = 13.68
SSR(β2|β0, β1) = .17
SSR(β3|β0, β1, β2) = .01
• For our test of β2 = β3 = 0, we were interested inSSR(β2, β3|β0, β1):
SSR(β0, β1, β2, β3)− SSR(β0, β1)
66
Sequential Sums of Squares (cont’d)
• Note that
SSR(β2|β0, β1) = SSR(β0, β1, β2)− SSR(β0, β1)
and SSR(β3|β0, β1, β2) = SSR(β0, β1, β2, β3)−
SSR(β0, β1, β2)
Therefore,
SSR(β2, β3|β0, β1) = SSR(β2|β0, β1)+
SSR(β3|β0, β1, β2)
SSR(β2, β3|β0, β1) = .17 + .01
F0 = .18/2.73/26 = 3.2
• Exercise 1: Conduct the F-test for β3 = 0 when the reduced modelincludes β0, β1 and β2.
67
• Exercise 2: Show that
SSR(β3, β4|β0, β1, β2) = SSR(β3|β0, β1, β2)+
SSR(β4|β0, β1, β2, β3)
Orthogonal Columns in X
• If the columns of X1 are orthogonal to the columns of X2, then
XT1X2 = 0
and
XT2X1 = 0
Then, under the full model,
β˜
= (XTX)−1XTy˜
=
[XT
1X1 00 XT
2X2
]−1 XT
1y˜
XT2y˜
so that
β˜
=
β˜
1
β˜
2
These are the estimates that would have been obtained from theseparate reduced models:
y˜
= X1β˜
1 + ε˜
and y˜
= X2β˜
2 + ε˜
68
Orthogonal Columns (cont’d)
• We can then show that
SSR(β˜
2|β˜
1) = SSR(β˜
2)
since
SSR(β˜
)− SSR(β˜
1) = β˜
TXTy˜− β
˜
T1X
T1y
˜
= β˜
T2X
T2y
˜= SSR(β
˜2)
• If X1 and X2 are not orthogonal, we have
β˜6=
β˜
1
β˜
2
so
SSR(β˜
2|β˜
1) 6= SSR(β˜
2)
69
Testing the General Linear Hypothesis
• Suppose T is an r × p matrix. (r ≤ p)
• General Linear Hypothesis:
H0 : Tβ˜
= 0
• Tβ˜
is estimated by T β˜
.
Var(T β˜
) = TVar(β˜
)TT
= Σ = σ2T (XTX)−1TT
70
General Linear Hypothesis (cont’d)
• Under H0,
β˜
TTTΣ−1T β˜∼ χ2
(r).
• To see this, first note that under H0,
T β˜
= T (XTX)−1XTε˜
so that β˜
TTTC−1T β˜
= ε˜X(XTX)−1TTC−1T (XTX)−1Xε
˜
where C = Σ/σ2 = T (XTX)−1TT. The following is idempotent:
X(XTX)−1TTC−1T (XTX)−1X and
tr(X(XTX)−1TTC−1T (XTX)−1X) = tr(C−1C) = tr(Ir) = r
[C is an r × r matrix.]
71
General Linear Hypothesis (cont’d)
• Finally, we note that
ε˜X(XTX)−1TTC−1T (XTX)−1Xε
˜/σ2 =
ε˜X(XTX)−1TTΣ−1T (XTX)−1Xε
˜
which implies that the latter quadratic form has a χ2(r) distribution.
• The test statistic is
F0 =
β˜
TTTC−1T β˜/r
MSE∼ Fr,n−p
where MSE is computed for the full model (with p parameters).
72
General Linear Hypothesis (cont’d)
• To see that this is a valid F statistic (under H0), we need to verify that
(I −H)[X(XTX)−1TTC−1T (XTX)−1X] = 0
since this is the product of the matrices of the quadratic forms for thenumerator sum of squares and the error sum of squares. SinceH = X(XTX)−1XT, the required result follows almost immediately.Therefore, the numerator and denominator sums of squares areindependent of each other.
73
Example
• Test the equality of all regression coefficientsβ0 = β1 = β2 = · · · = βk.
T = 1 -1 0 ... 0 0
0 1 -1 ... 0 0
................
0 0 0 ... 1 -1
T is a k × (k + 1) matrix, so F0 ∼ Fk,n−k−1 under H0.
74
4.5 The ANOVA Test for Significance of Regression
• Model:
y = β0 + β1x1 + · · ·βkxk + ε
y˜
= Xβ˜
+ ε˜
• Some Observations:
(I −H)X = 0 and XT(I −H) = 0
tr(H) = tr([X(XTX)−1]XT)
= tr(XTX(XTX)−1) = k + 1
SSE = y˜T(I −H)y
˜= ε
˜T(I −H)ε
˜
= tr(ε˜T(I −H)ε
˜) = tr((I −H)ε
˜ε˜T)
75
ANOVA (cont’d)
• Unbiased Estimation of σ2
E[SSE] = tr(I −H)σ2 = (n− k − 1)σ2
so an unbiased estimator for σ2 is
MSE = SSE/(n− k − 1)
• Partitioning the Variation in the Responses
– Recall from Simple Linear Regression:
TSS =n∑i=1
(yi − y)2 = SSR + SSE
MSR = SSR = TSS− SSE
E[MSR] = σ2 + β21Sxx
76
ANOVA (cont’d)
• What about TSS− SSE in the multiple regression case?
TSS = y˜T(I −
1
nJ)y
˜
where J = 11T is a matrix of all 1’s.
SSE = y˜T(I −H)y
˜
Therefore,
SSR = TSS− SSE = y˜T(H−
1
nJ)y
˜
77
ANOVA (cont’d)
E[SSR] = β˜
TXT(H−1
nJ)Xβ
˜+ E[ε
˜T(H−
1
nJ)ε
˜]
= β˜
TXT(H−1
nJ)Xβ
˜+ E[tr((H−
1
nJ)ε
˜ε˜T)]
E[tr((H−1
nJ)ε
˜ε˜T)] = tr(H−
1
nJ)σ2 = (k + 1− 1)σ2
so
E[SSR] = β˜
TXT(H−1
nJ)Xβ
˜+ kσ2
78
ANOVA (cont’d)
• If β˜
= 0, the first term vanishes.
• Even if β0 is nonzero, the first term vanishes whenβ1 = · · · = βk = 0:
E[SSR] = kσ2 + β201
T(I −1
nJ)1 = kσ2.
• Thus, if β1 = · · · = βk = 0, then another unbiased estimator of σ2 is
MSR = SSR/k
79
Quadratic Forms, Chi-squares, and Independence
• Assume β1 = · · · = βk = 0.
• SSR/σ2 has a χ2(k) distribution.
• Note the relation between the degrees of freedom and the trace of(H− 1
nJ), the matrix of the quadratic form SSR. Also, note that thismatrix is idempotent and symmetric.
• SSE = y˜T(I −H)y
˜.
80
Quadratic Forms, Chi-squares, and Independence
• I −H is idempotent and symmetric with trace n− k − 1. Therefore,SSE/σ
2 has a χ2 distribution on n− k − 1 degrees of freedom.
• (I −H)(H− 1nJ) = 0 so SSE and SSR are independent.
• Hence,
F0 =MSRMSE
has an F distribution on (k, n− k − 1) degrees of freedom.
• If some of the β’s are nonzero, then F0 will tend to be larger than anFk,n−k−1 random variable.
81
The ANOVA table
For testing
H0 : β1 = · · · = βk = 0
vs.
H1 : at least one coefficient is nonzero.
Source df SS MS FRegression k SSR MSR F0 = MSR/MSEError n− k − 1 SSE MSETotal n− 1 TSS
Reject H0 if the p-value is very small. i.e. if F0 is larger thanFk,n−k−1,α.
82
TextBook formula for SSR
SSR = TSS− SSE
=n∑i=1
yiyi −1
n(n∑
j=1
yj)2
= β˜
TXTy˜−
1
n(n∑
j=1
yj)2
83
Example
litters data:∑ni=1 yi = 8.33 (y = brainwt)∑ni=1 y
2i = 3.48 n = 20 k = 2 (bodywt and lsize)
β˜
= [0.178 0.0243 0.00669]T
XTy˜
= [8.33 64.95 61.85]T
TSS = 3.48−1
20(8.332) = .00695
SSR = .178(8.33) + .0243(64.95) + .00669(61.85)−8.332
20
= .00452
84
Example (cont’d)
Source df SS MS FRegression 2 0.00452 F0 = 15.8Error 17 MSETotal 19 0.00695
p-value:
> 1- pf(15.8,2,17)
[1] 0.000133
Conclusion: Reject H0. There is a relation between brainwt and theexplanatory variables (bodywt and lsize)
85
Hidden Extrapolation
• When making predictions, it is important not to extrapolate beyondthe range of the given data.
• In simple regression, it is obvious when one is extrapolating:
one is predicting y outside the range of given x-values.
• In multiple regression, extrapolation is not obvious.
• The diagonal elements hii of the hat matrix H can be useful indetermining when one is extrapolating.
• hii gives an idea of the distance from the ith observation to the‘center’ of the observations:
hii =Txi˜
(XTX)−1 xi˜
86
Example
• The following can be used to identify the hat diagonal elements forsome or all of the observations in the litters data:attach(litters)
litters.lm <- lm(brainwt ˜ bodywt + lsize)
extrap.fn(litters.lm,litters,n=3) # see the
# accompanying plot
detach(litters)
87
6 7 8 9
46
810
12
x
y
0.17
0.08
0.43
• Note how the hii values are largest for those observations near the‘edge’ of the data.
Hidden Extrapolation (cont’d)
• If we want to predict y at x0˜
, then we will be extrapolating if
Tx0˜
(XTX)−1 x0˜> hii
for all i = 1,2, . . . , n. i.e.
Tx0˜
(XTX)−1 x0˜> hmax
88
Example
• Suppose we want to predict the brain weight for a mouse whose bodyweight is 7 and who came from a litter of size 7. Are we extrapolating?
x0 = [1 7 7]T
The following function evaluates
Tx0˜
(XTX)−1 x0˜
We are extrapolating if this value exceeds .43.
89
Example (cont’d)
hidden.extrap(litters.lm, c(1, 7,7))
>
[,1]
[1,] 0.353
predict(litters.lm, newdata =
data.frame(bodywt=7,lsize=7),
interval="prediction")
>
fit lwr upr
[1,] 0.395 0.366 0.425
90
Example (cont’d)
• Suppose we now want to predict the brain weight for a mouse whosebody weight is 7 and who came from a litter of size 12. Are weextrapolating?
hidden.extrap(litters.lm, c(1, 7,12))
>
[,1]
[1,] 0.665
Since this is larger than .43, we must conclude that we extrapolating.
91
4.6 Weighted Least Squares
• Consider the regression through the origin model
yi = β1xi + εi
with E[εi] = 0 and suppose V (yi|xi) = σ2/wi where wi is a knownweight. i.e.
E[ε2i ] = σ2/wi
• The least squares estimate was previously found by minimizing∑ni=1 ε
2i :
β1 =
∑xiyi∑x2i
• Gauss-Markov Theorem: When the variances are constant, β1 hasthe smallest variance of any linear unbiased of β1.
92
Weighted Least Squares (cont’d)
• β1 is not the best linear unbiased estimator for β1 when there areweights wi.
• To find the BLUE now, multiply the model by ai:
aiyi = aiβ1xi + aiεi
or
y∗i = β1x∗i + ε∗i
Compute β1 for the new data (x∗i , y∗i ):
β1 =
∑x∗i y∗i∑
(x∗i )2
E[β1] = β1 (unbiased)
V (β1) = σ2∑x2i a
4i /wi
(∑a2i x
2i )2
93
Weighted Least Squares (cont’d)
• How do we choose a1, a2, . . . , an to make this as small as possible?
• Recall: Cauchy-Schwarz Inequality: n∑i=1
uivi
2
≤n∑
j=1
u2j
n∑k=1
v2k
(equality holds if the ui’s are proportional to the vi’s: ui = cvi)
• Look at the denominator of our variance: n∑i=1
a2i x
2i
2
≤
n∑i=1
a4i x
2i /wi
n∑i=1
wix2i
(equality holds if the ui’s are proportional to the vi’s: e.g.a4i x
2i /wi = wix
2i or ai =
√wi)
94
Weighted Least Squares (cont’d)
• Thus, the V (β1) is minimized if ai =√wi:
V (β1) =σ2∑n
i=1wx2i
• Note also that
E[√wiεi] = 0 and V (
√wiεi) = σ2
and that instead of minimizingn∑i=1
ε2i Ordinary Least Squares
we are now minimizingn∑i=1
wiε2i Weighted Least Squares
95
Example
• roller data
• Ordinary Least Squares:
roller.lm <- lm(depression˜weight, data=roller)
plot(roller.lm, which=4)
96
Example (Cont’d)
5 10 15 20 25 30
−10
−5
05
10
Fitted values
Res
idua
ls●
●
●
●
●
●
●
●
●
●
lm(formula = depression ~ weight, data = roller)
Residuals vs Fitted
8
75
• The residual plot indicates that the variance might not be constant.
97
Weighted Least Squares
roller.wlm <- lm(depression˜weight,
data=roller, weights=1/weightˆ2)
plot(roller.wlm, which=4)
0 5 10 15 20 25 30 35
−10
−5
05
10
Fitted values
Res
idua
ls ●
●
●
●
●
●
●
●
●
●
lm(formula = depression ~ weight, data = roller, weights = 1/weight^2)
Residuals vs Fitted
108
5
– a somewhat clearer pattern: variance does seem to be increasing
98
Weighted Least Squares
• Comparing the fitted lines:
●●
● ●
● ●
●
●
●
●
2 4 6 8 10 12
05
1015
2025
30
weight
depr
essi
on
OLSWLS
Roller Data
99
Generalized Least Squares
• Model:
y˜
= Xβ˜
+ ε˜
E[ε˜] = 0 and E[ε
˜ε˜T] = Σ = σ2V .
Σ must be symmetric and positive definite. This implies, among otherthings, that Σ possesses an inverse.
• Weighted Least Squares is a special case where Σ is a diagonalmatrix with ii element σ2/wi
• V = K2 for some symmetric nonsingular K.
100
Generalized Least Squares (cont’d)
• Consider
K−1y˜
= K−1Xβ˜
+K−1ε˜
Note
Var(K−1ε˜) = E[K−1ε
˜ε˜TK−1] =
K−1σ2V K−1 = σ2I
• By multiplying through by K−1 we now have a constant variance, soβ˜
can be estimated by Least-Squares:
β˜
= (XTK−2X)−1XTK−2y˜
= (XTV −1X)−1XTV −1y˜
• β˜
is the generalized least-squares estimator for β˜
.
101
Generalized Least Squares (cont’d)
• Unbiased:
E[β˜
] = β˜
• Variance:
Var(β˜
) = (XTV −1X)−1XTV −1ΣV −1X(XTV −1X)−1
= σ2(XTV −1X)−1
102