Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model...

6
Ch. 9 Variable Selection 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables. Under the full model (all predictors) β ˜ =(X T X) -1 X T y ˜ and MSE = y ˜ T (I - H) y ˜ n - k - 1 Under the reduced model with q parameters (q<k + 1): β ˜ q =(X T q X q ) -1 X T q y ˜ and σ 2 q = y ˜ T (I - H q ) y ˜ n - q where H q = X q (X T q X q ) -1 X q . Suppose the reduced model is correct, and we use the full model. Then we have over-parametrized the problem. We have 1. E[ β ˜ ]= β ˜ (some of the parameters are 0) unbiased 2. V ( β ˜ )= σ 2 (X T X) -1 By including unnecessary variables, we risk introducing unnecessary multicollinearity inflated standard errors imprecise Prediction is affected too: unbiased, but less precise than with the correct model Suppose the full model is correct, and we use the reduced model. Then E[ β ˜ q ]= β ˜ q +(X T q X q ) -1 X T q X r β ˜ r = β ˜ q biased A =(X T q X q ) -1 X T q X r is the alias matrix. V ( β ˜ q )= σ 2 (X T q X q ) -1 Because of bias, we need to use MSE = E[( β ˜ q - β ˜ q )( β ˜ q - β ˜ q ) T ] = V ( β ˜ q ) + (Bias( β ˜ q ))(Bias( β ˜ q )) T = σ 2 (X T q X q ) -1 +(Aβ ˜ r )(Aβ ˜ r ) T E[ σ 2 q ]= σ 2 + 1 n - q β ˜ T r X T r (I - H q )X r β ˜ r σ 2 q over-estimates σ 2 . 1

Transcript of Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model...

Page 1: Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables.

Ch. 9 Variable Selection

• 9.1 Model Misspecification

◦ Consider linear regression of a response y on k predictor variables.

◦ Under the full model (all predictors)β̂˜

= (XTX)−1XTy˜

and

MSE =y˜T(I −H)y

˜n− k − 1

◦ Under the reduced model with q parameters (q < k + 1):

β̂˜

q = (XTq Xq)−1XT

qy˜

and

σ̂2q =

y˜T(I −Hq)y

˜n− q

whereHq = Xq(XT

q Xq)−1Xq.

◦ Suppose the reduced model is correct, and we use the full model. Then we have over-parametrized theproblem. We have

1. E[β̂˜] = β

˜(some of the parameters are 0)

⇒ unbiased2. V (β̂

˜) = σ2(XTX)−1

By including unnecessary variables, we risk introducing unnecessary multicollinearity⇒ inflated standard errors ⇒ imprecisePrediction is affected too: unbiased, but less precise than with the correct model

◦ Suppose the full model is correct, and we use the reduced model. Then

E[β̂˜

q] = β˜

q + (XTq Xq)−1XT

q Xrβ˜

r 6= β˜

q

⇒ biasedA = (XT

q Xq)−1XTq Xr is the alias matrix.

V (β̂˜

q) = σ2(XTq Xq)−1

Because of bias, we need to useMSE = E[(β̂

˜q − β

˜q)(β̂

˜q − β

˜q)T]

= V (β̂˜

q) + (Bias(β̂˜

q))(Bias(β̂˜

q))T

= σ2(XTq Xq)−1 + (Aβ

˜r)(Aβ

˜r)T

E[σ̂2q ] = σ2 +

1n− q

β˜Tr XT

r (I −Hq)Xrβ˜

r

⇒ σ̂2q over-estimates σ2.

1

Page 2: Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables.

◦ Summary:Loss of precision occurs when too many variables are included in a model.Bias results when too few variables are included in a model.

◦ How do we decide how many variables to include in a model?

1. R2 = SSR(p)SST

= 1− SSE(p)SST

:the proportion of variation in the response explained by the (p variable) regressionproblem: increases with p

2. Adjusted R2 or residual mean square

R2adj = 1−

(n− 1n− p

(1−R2p)

)This does always increase with p, but maximizing this is equivalent to minimizing

MSE(p) =SSE(p)n− p

This often decreases as variables are added, but sometimes begins to increase when too many are added.One might want to choose p where this first starts to level offBetter yet, one might use ...

• Mallows’ Cp

◦ Consider

Γ̂p =SSE(p)

σ2− n + 2p

(SSE(p) is the residual sum of squares for the p parameter model - the reduced model. )◦

E[SSE(p)] = σ2(n− p) + βTr˜

XTr (I −Hp)Xr βr

˜

soΓp = E[Γ̂p] = p +

1σ2

βTr˜

XTr (I −Hp)Xr βr

˜

◦ ⇒ Choose p so that all components of βp

˜are nonzero and βr

˜= 0

˜.

◦ Take p to be the smallest value for which Γp = p.

◦ In practice, take p to be the smallest value for which

Γ̂p.= p

since we have just shown that Γ̂p is an unbiased estimator for Γp.

◦ We need an estimator for σ2.

◦ If the full k variable model contains all important regressors (plus, possibly some extras), then

E[MSE(k + 1)] = σ2

so we can use σ̂2 = MSE(k + 1).

◦ ⇒ We estimate Γ̂p by

Cp =SSE(p)MSE

− n + 2p

◦ Example: An experiment involving 30 measurements on 5 regressors yields an MSE of 20.

1. The MSE for the best 2 parameter model is 40. Find C2 for this model. (30)2. The MSE for the best 3 parameter model is 30. Find C3 for this model. (16.5)3. The MSE for the best 4 parameter model is 25. Find C4 for this model. (10.5)

2

Page 3: Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables.

4. The MSE for the best 5 parameter model is 19. Find C5 for this model. (3.75)

• An example: the cement data

> library(MPV) # this contains the cement data set

> library(leaps) # this contains the variable selection routines

> data(cement)

> x <- cement[,-1] # this removes the y variable from

# the cement data set

> y <- cement[,1] # this is the y variable

> cement.leaps <- leaps(x,y)

> attach(cement.leaps) # this allows us to access

# the variables in cement.leaps

# directly

# e.g., Mallows Cp is one of the

# variables calculated

> plot(size, Cp) # size = no. of parameters

> abline(0,1) # reference line to see where Cp = p

> identify(size, Cp) # which models are close to Cp = p?

# Click on the

# plotted points near the reference line.

# [1] 6 14 15

which[6,] # which variables are included in model 6?

# 1 2 3 4

# TRUE FALSE FALSE TRUE

# Variables 1 and 4 are in the model.

> cement.6 <- lm(y ~ as.matrix(x[,which[6,]]))

> summary(cement.6)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 103.09738 2.12398 48.54 3.32e-13 ***

as.matrix(x[, which[6, ]])x1 1.43996 0.13842 10.40 1.11e-06 ***

as.matrix(x[, which[6, ]])x4 -0.61395 0.04864 -12.62 1.81e-07 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 2.734 on 10 degrees of freedom

Multiple R-Squared: 0.9725, Adjusted R-squared: 0.967

F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08

> PRESS(cement.6)

[1] 121.2244

# What about model 14?

cement.14 <- lm(y ~ as.matrix(x[,which[14,]]))

summary(cement.14)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 203.6420 20.6478 9.863 4.01e-06 ***

as.matrix(x[, which[14, ]])x2 -0.9234 0.2619 -3.525 0.006462 **

as.matrix(x[, which[14, ]])x3 -1.4480 0.1471 -9.846 4.07e-06 ***

as.matrix(x[, which[14, ]])x4 -1.5570 0.2413 -6.454 0.000118 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

3

Page 4: Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables.

Residual standard error: 2.864 on 9 degrees of freedom

Multiple R-Squared: 0.9728, Adjusted R-squared: 0.9638

F-statistic: 107.4 on 3 and 9 DF, p-value: 2.302e-07

> PRESS(cement.14)

[1] 146.8527 # higher than Model 6, so choose Model 6

# What about Model 15? (This is the full model.)

> cement.15 <- lm(y ~ as.matrix(x[,which[15,]]))

> summary(cement.15)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 62.4054 70.0710 0.891 0.3991

as.matrix(x[, which[15, ]])x1 1.5511 0.7448 2.083 0.0708 .

as.matrix(x[, which[15, ]])x2 0.5102 0.7238 0.705 0.5009

as.matrix(x[, which[15, ]])x3 0.1019 0.7547 0.135 0.8959

as.matrix(x[, which[15, ]])x4 -0.1441 0.7091 -0.203 0.8441

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 2.446 on 8 degrees of freedom

Multiple R-Squared: 0.9824, Adjusted R-squared: 0.9736

F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07

> PRESS(cement.15)

[1] 110.3466 # this is lower than for Model 6. Is this the best model?

plot(cement.15) # check the residual plots to ensure that everything is OK

9.2 Variable Selection Methods

• Forward Selection

* Fit models with each X variable. Choose the variable which gives the highest |t| statistic (if the p-value <0.5; otherwise, stop). Suppose the variable Xi was chosen.

* Fit models which include Xi and each of the other X variables. Add the X variable with the largest |t|statistic (if the p-value < .5; otherwise, stop).

* Continue adding variables in this manner, until the first time there are no p-values less than .5. Then stop.

• Backward Selection

* Begin with all X variables.

* Eliminate the variable with the largest p-value.

* Continue eliminating variables until all remaining variables have p-values < 0.1.

• Stepwise

* Proceed as in Forward selection, but at each step, remove any variable whose p-value is larger than 0.15.

• All Possible Regressions (Exhaustive Search)

Forward example:

seal.fwd <- regsubsets(age ~ ., data = cfseal1,

method = "forward")

summary.regsubsets(seal.fwd)$cp

[1] -0.0368 -2.1432 -1.7354 -0.8088

[5] 1.1139 3.0690 5.0171 7.0098

4

Page 5: Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables.

# try 2nd model:

subset <-

c(F,summary.regsubsets(seal.fwd)$which[2,-1])

seal.lm <- lm(cfseal1[,1] ~

as.matrix(cfseal1[,subset]))

summary(seal.lm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Inter -3.57779 3.27459 -1.09 0.28695

lung 0.01652 0.00708 2.33 0.02973

leftkid 0.19219 0.04796 4.01 0.00064

Residual standard error: 7.3 on 21 degrees of freedom

PRESS(seal.lm)

[1] 1552

# eliminate intercept:

seal.lm <- lm(cfseal1[,1] ~ as.matrix(cfseal1[,subset])-1)

summary(seal.lm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

lung 0.01667 0.00711 2.34 0.02856

leftkid 0.17546 0.04565 3.84 0.00088

Residual standard error: 7.33 on 22 degrees of freedom

PRESS(seal.lm)

[1] 1490

Backward Example

seal.bwd <- regsubsets(age ~ ., data = cfseal1,

method = "backward")

summary.regsubsets(seal.bwd)$cp

[1] 1.503 0.246 -1.788 -0.825

[5] 1.098 3.064 5.017 7.010

# try 3rd model:

subset <-

c(F,summary.regsubsets(seal.bwd)$which[3,-1])

seal.lm <- lm(cfseal1[,1] ~

as.matrix(cfseal1[,subset]))

PRESS(seal.lm)

1536

Exhaustive search example

seal.ex <- regsubsets(age ~ ., data = cfseal1,

method = "exhaustive")

Spline example - geophones:

tr.pwr <- function (x, knot, degree=3)

{ # truncated power function

(x > knot)*(x - knot)^degree

}

# one knot per function

5

Page 6: Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model Misspecification Consider linear regression of a response y on k predictor variables.

xx <- cbind(distance,distance^2,distance^3,

outer(distance,seq(20,80,length=20),tr.pwr))

# we start with 20 knots equally spaced between

# 20 and 80, and use forward selection to choose

# the best ones:

geophones.fwd <- regsubsets(thickness ~ xx,

method="forward", nvmax=12, data=geophones)

summary.regsubsets(geophones.fwd)$cp

[1] 153.39 52.34 31.41 20.27 15.77 10.59

[7] 10.78 10.68 9.75 8.81 8.52 9.21

# Which knots are in?

seq(20,80,length=20)[summary.regsubsets(geophones.fwd)$which[11,-seq(1,4)]]

[1] 20.0 32.6 35.8 38.9 45.3 51.6 54.7 70.5 73.7 76.8

knots.sub <- summary.regsubsets(geophones.fwd)$which[11,-seq(1,4)]

knots.try<-seq(20,80,length=20)[knots.sub]

geophones.bs <- lm(thickness ~ bs(distance, knots = knots.try,

Boundary.knots = c(0,100)),data=geophones)

PRESS(geophones.bs)

[1] 285

plot(geophones)

lines(spline(geophones$distance,predict(geophones.bs)),col=4)

# you can check plot(geophones.bs) to see if there are

# problems

Titanium Example

xx <- cbind(temperature,temperature^2,temperature^3,

outer(temperature,seq(620,1050,length=30),tr.pwr))

titanium.fwd <- regsubsets(g ~ xx, method="forward",

nvmax=15, data=titanium)

summary.regsubsets(titanium.fwd)$cp

[1] 72343.36 57292.13 42136.90 27174.83 11382.40 5551.22

[7] 1795.84 351.67 103.14 68.64 11.84 5.72

[13] 6.91 8.15 9.52

> knots.try<-seq(620,1050,

length=30)[summary.regsubsets(titanium.fwd)$which[12,-seq(1,4)]]

titanium.bs <- lm(g ~ bs(temperature, knots = knots.try, Boundary.knots = c(500,1100)))

plot(titanium)

lines(spline(temperature,predict(titanium.bs)),col=4)

# plot(titanium.bs)

6