Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model...
Transcript of Ch. 9 Variable Selection - University of Western Ontario. 9 Variable Selection • 9.1 Model...
Ch. 9 Variable Selection
• 9.1 Model Misspecification
◦ Consider linear regression of a response y on k predictor variables.
◦ Under the full model (all predictors)β̂˜
= (XTX)−1XTy˜
and
MSE =y˜T(I −H)y
˜n− k − 1
◦ Under the reduced model with q parameters (q < k + 1):
β̂˜
q = (XTq Xq)−1XT
qy˜
and
σ̂2q =
y˜T(I −Hq)y
˜n− q
whereHq = Xq(XT
q Xq)−1Xq.
◦ Suppose the reduced model is correct, and we use the full model. Then we have over-parametrized theproblem. We have
1. E[β̂˜] = β
˜(some of the parameters are 0)
⇒ unbiased2. V (β̂
˜) = σ2(XTX)−1
By including unnecessary variables, we risk introducing unnecessary multicollinearity⇒ inflated standard errors ⇒ imprecisePrediction is affected too: unbiased, but less precise than with the correct model
◦ Suppose the full model is correct, and we use the reduced model. Then
E[β̂˜
q] = β˜
q + (XTq Xq)−1XT
q Xrβ˜
r 6= β˜
q
⇒ biasedA = (XT
q Xq)−1XTq Xr is the alias matrix.
V (β̂˜
q) = σ2(XTq Xq)−1
Because of bias, we need to useMSE = E[(β̂
˜q − β
˜q)(β̂
˜q − β
˜q)T]
= V (β̂˜
q) + (Bias(β̂˜
q))(Bias(β̂˜
q))T
= σ2(XTq Xq)−1 + (Aβ
˜r)(Aβ
˜r)T
E[σ̂2q ] = σ2 +
1n− q
β˜Tr XT
r (I −Hq)Xrβ˜
r
⇒ σ̂2q over-estimates σ2.
1
◦ Summary:Loss of precision occurs when too many variables are included in a model.Bias results when too few variables are included in a model.
◦ How do we decide how many variables to include in a model?
1. R2 = SSR(p)SST
= 1− SSE(p)SST
:the proportion of variation in the response explained by the (p variable) regressionproblem: increases with p
2. Adjusted R2 or residual mean square
R2adj = 1−
(n− 1n− p
(1−R2p)
)This does always increase with p, but maximizing this is equivalent to minimizing
MSE(p) =SSE(p)n− p
This often decreases as variables are added, but sometimes begins to increase when too many are added.One might want to choose p where this first starts to level offBetter yet, one might use ...
• Mallows’ Cp
◦ Consider
Γ̂p =SSE(p)
σ2− n + 2p
(SSE(p) is the residual sum of squares for the p parameter model - the reduced model. )◦
E[SSE(p)] = σ2(n− p) + βTr˜
XTr (I −Hp)Xr βr
˜
soΓp = E[Γ̂p] = p +
1σ2
βTr˜
XTr (I −Hp)Xr βr
˜
◦ ⇒ Choose p so that all components of βp
˜are nonzero and βr
˜= 0
˜.
◦ Take p to be the smallest value for which Γp = p.
◦ In practice, take p to be the smallest value for which
Γ̂p.= p
since we have just shown that Γ̂p is an unbiased estimator for Γp.
◦ We need an estimator for σ2.
◦ If the full k variable model contains all important regressors (plus, possibly some extras), then
E[MSE(k + 1)] = σ2
so we can use σ̂2 = MSE(k + 1).
◦ ⇒ We estimate Γ̂p by
Cp =SSE(p)MSE
− n + 2p
◦ Example: An experiment involving 30 measurements on 5 regressors yields an MSE of 20.
1. The MSE for the best 2 parameter model is 40. Find C2 for this model. (30)2. The MSE for the best 3 parameter model is 30. Find C3 for this model. (16.5)3. The MSE for the best 4 parameter model is 25. Find C4 for this model. (10.5)
2
4. The MSE for the best 5 parameter model is 19. Find C5 for this model. (3.75)
• An example: the cement data
> library(MPV) # this contains the cement data set
> library(leaps) # this contains the variable selection routines
> data(cement)
> x <- cement[,-1] # this removes the y variable from
# the cement data set
> y <- cement[,1] # this is the y variable
> cement.leaps <- leaps(x,y)
> attach(cement.leaps) # this allows us to access
# the variables in cement.leaps
# directly
# e.g., Mallows Cp is one of the
# variables calculated
> plot(size, Cp) # size = no. of parameters
> abline(0,1) # reference line to see where Cp = p
> identify(size, Cp) # which models are close to Cp = p?
# Click on the
# plotted points near the reference line.
# [1] 6 14 15
which[6,] # which variables are included in model 6?
# 1 2 3 4
# TRUE FALSE FALSE TRUE
# Variables 1 and 4 are in the model.
> cement.6 <- lm(y ~ as.matrix(x[,which[6,]]))
> summary(cement.6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.09738 2.12398 48.54 3.32e-13 ***
as.matrix(x[, which[6, ]])x1 1.43996 0.13842 10.40 1.11e-06 ***
as.matrix(x[, which[6, ]])x4 -0.61395 0.04864 -12.62 1.81e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.734 on 10 degrees of freedom
Multiple R-Squared: 0.9725, Adjusted R-squared: 0.967
F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08
> PRESS(cement.6)
[1] 121.2244
# What about model 14?
cement.14 <- lm(y ~ as.matrix(x[,which[14,]]))
summary(cement.14)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 203.6420 20.6478 9.863 4.01e-06 ***
as.matrix(x[, which[14, ]])x2 -0.9234 0.2619 -3.525 0.006462 **
as.matrix(x[, which[14, ]])x3 -1.4480 0.1471 -9.846 4.07e-06 ***
as.matrix(x[, which[14, ]])x4 -1.5570 0.2413 -6.454 0.000118 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
3
Residual standard error: 2.864 on 9 degrees of freedom
Multiple R-Squared: 0.9728, Adjusted R-squared: 0.9638
F-statistic: 107.4 on 3 and 9 DF, p-value: 2.302e-07
> PRESS(cement.14)
[1] 146.8527 # higher than Model 6, so choose Model 6
# What about Model 15? (This is the full model.)
> cement.15 <- lm(y ~ as.matrix(x[,which[15,]]))
> summary(cement.15)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.4054 70.0710 0.891 0.3991
as.matrix(x[, which[15, ]])x1 1.5511 0.7448 2.083 0.0708 .
as.matrix(x[, which[15, ]])x2 0.5102 0.7238 0.705 0.5009
as.matrix(x[, which[15, ]])x3 0.1019 0.7547 0.135 0.8959
as.matrix(x[, which[15, ]])x4 -0.1441 0.7091 -0.203 0.8441
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-Squared: 0.9824, Adjusted R-squared: 0.9736
F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
> PRESS(cement.15)
[1] 110.3466 # this is lower than for Model 6. Is this the best model?
plot(cement.15) # check the residual plots to ensure that everything is OK
9.2 Variable Selection Methods
• Forward Selection
* Fit models with each X variable. Choose the variable which gives the highest |t| statistic (if the p-value <0.5; otherwise, stop). Suppose the variable Xi was chosen.
* Fit models which include Xi and each of the other X variables. Add the X variable with the largest |t|statistic (if the p-value < .5; otherwise, stop).
* Continue adding variables in this manner, until the first time there are no p-values less than .5. Then stop.
• Backward Selection
* Begin with all X variables.
* Eliminate the variable with the largest p-value.
* Continue eliminating variables until all remaining variables have p-values < 0.1.
• Stepwise
* Proceed as in Forward selection, but at each step, remove any variable whose p-value is larger than 0.15.
• All Possible Regressions (Exhaustive Search)
Forward example:
seal.fwd <- regsubsets(age ~ ., data = cfseal1,
method = "forward")
summary.regsubsets(seal.fwd)$cp
[1] -0.0368 -2.1432 -1.7354 -0.8088
[5] 1.1139 3.0690 5.0171 7.0098
4
# try 2nd model:
subset <-
c(F,summary.regsubsets(seal.fwd)$which[2,-1])
seal.lm <- lm(cfseal1[,1] ~
as.matrix(cfseal1[,subset]))
summary(seal.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Inter -3.57779 3.27459 -1.09 0.28695
lung 0.01652 0.00708 2.33 0.02973
leftkid 0.19219 0.04796 4.01 0.00064
Residual standard error: 7.3 on 21 degrees of freedom
PRESS(seal.lm)
[1] 1552
# eliminate intercept:
seal.lm <- lm(cfseal1[,1] ~ as.matrix(cfseal1[,subset])-1)
summary(seal.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lung 0.01667 0.00711 2.34 0.02856
leftkid 0.17546 0.04565 3.84 0.00088
Residual standard error: 7.33 on 22 degrees of freedom
PRESS(seal.lm)
[1] 1490
Backward Example
seal.bwd <- regsubsets(age ~ ., data = cfseal1,
method = "backward")
summary.regsubsets(seal.bwd)$cp
[1] 1.503 0.246 -1.788 -0.825
[5] 1.098 3.064 5.017 7.010
# try 3rd model:
subset <-
c(F,summary.regsubsets(seal.bwd)$which[3,-1])
seal.lm <- lm(cfseal1[,1] ~
as.matrix(cfseal1[,subset]))
PRESS(seal.lm)
1536
Exhaustive search example
seal.ex <- regsubsets(age ~ ., data = cfseal1,
method = "exhaustive")
Spline example - geophones:
tr.pwr <- function (x, knot, degree=3)
{ # truncated power function
(x > knot)*(x - knot)^degree
}
# one knot per function
5
xx <- cbind(distance,distance^2,distance^3,
outer(distance,seq(20,80,length=20),tr.pwr))
# we start with 20 knots equally spaced between
# 20 and 80, and use forward selection to choose
# the best ones:
geophones.fwd <- regsubsets(thickness ~ xx,
method="forward", nvmax=12, data=geophones)
summary.regsubsets(geophones.fwd)$cp
[1] 153.39 52.34 31.41 20.27 15.77 10.59
[7] 10.78 10.68 9.75 8.81 8.52 9.21
# Which knots are in?
seq(20,80,length=20)[summary.regsubsets(geophones.fwd)$which[11,-seq(1,4)]]
[1] 20.0 32.6 35.8 38.9 45.3 51.6 54.7 70.5 73.7 76.8
knots.sub <- summary.regsubsets(geophones.fwd)$which[11,-seq(1,4)]
knots.try<-seq(20,80,length=20)[knots.sub]
geophones.bs <- lm(thickness ~ bs(distance, knots = knots.try,
Boundary.knots = c(0,100)),data=geophones)
PRESS(geophones.bs)
[1] 285
plot(geophones)
lines(spline(geophones$distance,predict(geophones.bs)),col=4)
# you can check plot(geophones.bs) to see if there are
# problems
Titanium Example
xx <- cbind(temperature,temperature^2,temperature^3,
outer(temperature,seq(620,1050,length=30),tr.pwr))
titanium.fwd <- regsubsets(g ~ xx, method="forward",
nvmax=15, data=titanium)
summary.regsubsets(titanium.fwd)$cp
[1] 72343.36 57292.13 42136.90 27174.83 11382.40 5551.22
[7] 1795.84 351.67 103.14 68.64 11.84 5.72
[13] 6.91 8.15 9.52
> knots.try<-seq(620,1050,
length=30)[summary.regsubsets(titanium.fwd)$which[12,-seq(1,4)]]
titanium.bs <- lm(g ~ bs(temperature, knots = knots.try, Boundary.knots = c(500,1100)))
plot(titanium)
lines(spline(temperature,predict(titanium.bs)),col=4)
# plot(titanium.bs)
6