From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...
Transcript of From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...
From Linear to Nonlinear
Fit a nonlinear regression model via basis expansions
g(x) =M∑m=1
βmhm(x)
• hm(x) = x2j , xjxl, . . . ,
• hm(x) = I(am < xj < bm),
• hm(x) are wavelet bases functions.
We can use LS to estimate βm’s. And regularization can be incorporated into
this framework too,
minn∑i=1
(yi −
M∑m=1
βmhm(x))2
+ Pen(β).
1
Questions
• How to choose the basis functions?
• What we’ll learn: estimate the function locally by either using local basis
functions or fitting it locally
• How to deal with large p? One solution is the GAM approach
g(x) ≈ f1(x1) + · · ·+ fp(xp) + hj(x1x2) + · · ·
and we estimate each component, f ’s and h’s.
2
Spline Models
• Introduction to CS and NCS
• Regression splines
• Smoothing splines
3
Cubic Splines a
• knots: a < ξ1 < ξ2 < · · · < ξm < b
• A function g defined on [a, b] is a cubic spline w.r.t knots ξimi=1 if:
1) g is a cubic polynomial in each of the m+ 1 intervals,
g(x) = dix3 + cix
2 + bix+ ai, x ∈ [ξi, ξi+1]
where i = 0 : m, ξ0 = a and ξm+1 = b;
2) g is continuous up to the 2nd derivative,
g(0,1,2)(ξ+i ) = g(0,1,2)(ξ−i ), i = 1 : m.
aFrom now on, x ∈ R is one-dimensional.
4
• Given a set of knots ξimi=1, the corresponding cubic splines form a linear
space (of functions) with dimension m+ 4.
• A set of basis functions for that space:
h1(x) = 1; h2(x) = x;
h3(x) = x2; h4(x) = x3;
hi+4(x) = (x− ξi)3+, i = 1, 2, . . . ,m.
5
Natural Cubic Splines (NCS)
• A cubic spline on [a, b] is a NCS if its second and third direvatives are zero
at a and b.
• That is, it is linear in the two extreme intervals [a, ξ1] and [ξm, b].
• The degree of freedom of NCS’s with m knots is m. One version of the
basis functions can be found in the textbook (section 5.2.1).
• For a curve estimation problem with data (xi, yi)ni=1, if we put n knots at
the n data points (assumed to be unique), then we obtain a smooth curve
(using NCS) passing through all y’s.
6
Regression Splines
• A basis expansion approach:
g(x) = β1h1(x) + β2h2(x) + · · ·+ βphp(x),
where p = m+ 4 for regression with cubic splines and p = m for NCS.
• Represent the model on the observed n data points in matrix notation,
β = arg minβ‖y − Fβ‖2,
7
where
y1
y2
· · ·
yn
n×1
=
h1(x1) h2(x1) · · · hp(x1)
h1(x2) h2(x2) · · · hp(x2)
h1(xn) h2(xn) · · · hp(xn)
n×p
β1
· · ·
βp
p×1
8
• We can obtain the design matrix F by commands bs or ns in R, and then
call the regression function lm.
• Understand how R counts the degree-of-feedom.
> bs (x, knots=quantile(x, c(1/3, 2/3)));
> bs(x, df=5);
> bs(x, df=6, intercept=TRUE);
• One more input: the knots?
9
Choice of Knots
• Number and location of knots: to simplify this problem, we first ignore the
selection of locations – by default, the knots are located at the quantiles of
xi’s. Then we just need to select the number of knots, which can be
casted as a variable selection problem (an easier version, since there are
just p models, not 2p).
• AIC/BIC
• m-fold CV (cross-validation)
10
Summary: Regression Splines
• Use LS to fit a spline model: Specify the DFa p, and then fit a regression
model with a design matrix of p columns (including the intercept).
• How to do it in R?
• How to select the number/location of knots?
aNot the polynomial degree, but the DF of the spline, related to the number of knots.
11
Example: Phoneme Recognition
• Binary response Y : two classes “aa” (695) and “ao” (1022).
• Numeric features X: Log-periodogram is measured at 256 uniformly
spaced frequencies.
• Logistic regression model:
logP(aa|x)
P(ao|x)≈
256∑j=1
xjβj .
• Go to the review on logistic regression.
12
• Recall that the 256 measurements for each person are not the same as
measurements collected from 256 independent predictors. They are
observations (at discrete locations) from a continuous Log-periodogram
function.
• Naturally we would expect βj = β(vj) are also continuous in the frequency
domain. Model it by splines
β(v) =
M∑m=1
hm(v)αm.
• Using matrix notation, we have
Xn×pβp×1 = Xn×pFp×mαm×1 = Bn×mαm×1.
13
Logistic Regression
• P (Y = 1|X = x) = p(x)
logp(x)
1− p(x)= g(x), p(x) =
exp(g(x))
1 + exp(g(x)).
• MLE
maxβ
n∏i=1
p(xi)yi(1− p(xi)
)1−yi= max
β
n∑i=1
[yi log p(xi)
yi + (1− yi) log(1− p(xi)
)]
14
Iterative Re-weighting Algorithm
Iterative algorithm a (βold denotes the current estimate of the coefficient, and
all the pi’s are calculated based on βold):
βnew = βold −(∂2`(β)
∂β∂βt
)−1∂`(β)
∂β= (XWXt)−1XWz,
where
W = diag(pi(1− pi)), pi =exp(xTi β
old)
1 + exp(xTi βold)
is the weighting matrix and
z = Xβold + W−1(y − p), p = (p1, . . . , pn)T
is the “latent” continuous observations.aThis is the Newton-Raphson or gradient decent algorithm with
∂`(β)∂β
= X(y − p),
∂2`(β)∂β∂βt = −XWXt.
15
Smoothing Splines
• In Regression Splines (let’s use NCS), we need to choose the number and
the location of knots.
• What’s a Smoothing Spline? Start with an easy but “horrible” solution:
put knots at all the observed data points (x1, . . . , xn):
yn×1 = Fn×nβn×1.
Instead of selecting knots, let’s do ridge-type shrinkage (Ω will be defined
later):
minβ
[‖y − Fβ‖2 + λβtΩβ
],
where the tuning parameter λ is often chosen by CV or GCV.
• Next we’ll see how smoothing splines are derived from a different aspect.
16
Roughness Penalty Approach
• Let S[a, b] be the space of all “smooth” functions defined on [a, b].
• Among all the functions in S[a, b], look for the minimizer of the following
penalized residual sum of squares
RSS(g, λ) =n∑i=1
[yi − g(xi)]2 + λ
∫ b
a
[g′′(x)]2dx, (1)
where λ is a smoothing parameter.
• Theorem. g = arg min RSS(g, λ) is a NCS with knots at the n data
points x1, . . . , xn (xi 6= xj).
17
(WLOG, assume n ≥ 2.) Let g be a function on [a, b] and g be a NCS with
g(xi) = g(xi), i = 1 : n. Does such g exist?
Then ∫g
′′2 ≥∫g
′′2 (∗)
with equality only if g ≡ g.
PROOF : Let h(x) = g(x)− g(x). So h(xi) = 0 for i = 1, . . . , n.
Then (∗) holds true because∫g′′2 =
∫g′′2 +
∫h′′2
+2
∫g′′h′′︸ ︷︷ ︸=0
18
Smoothing Splines
Write g(x) =∑ni=1 βihi(x) where hi’s are basis functions for NCS with knots
at x1, . . . , xn.n∑i=1
[yi − g(xi)]2 = (y − Fβ)t(y − Fβ),
where Fn×n with Fij = hj(xi).∫ b
a
[g
′′(x)]2dx =
∫ [∑i
βih′′
i (x)]2dx
=∑i,j
βiβj
∫h
′′
i (x)h′′
j (x)dx = βtΩβ,
where Ωn×n with Ωij =∫ bah
′′
i (x)h′′
j (x)dx.
19
So
RSS(β, λ) = (y − Fβ)t(y − Fβ) + λβtΩβ,
and the solution is
β = arg minβ
RSS(β, λ)
= (FtF + λΩ)−1Fty
20
• Demmler & Reinsch (1975): a basis with double orthogonality property, i.e.
FtF = I, Ω = diag(di),
where d1 = d2 = 0 (Why?).
• Using this basis, we have
β = (FtF + λΩ)−1Fty
= (I + λdiag(di))−1Fty,
i.e.,
βi =1
1 + λdiβ(LS)i .
21
• Smoother matrix Sλ
y = Fβ = F(FtF + λΩ)−1Fty = Sλy.
• Using D&R basis,
Sλ = Fdiag( 1
1 + λdi
)Ft.
So columns of F are the eigen-vectors of Sλ, which does not depend on λ.
• Effective df of a smoothing spline:
df(λ) = trSλ =
n∑i=1
1
1 + λdi.
22
Choice of λ
• Leave-one-out CV
CV(λ) =1
n
n∑i=1
[ yi − g[−i](xi)]2
=1
n
n∑i=1
(yi − g(xi)
1− Sλ(i, i)
)2
.
• Generalized CV
GCV(λ) =1
n
n∑i=1
(yi − g(xi)
1− 1n trSλ
)2
23
Summary: Smoothing Splines
• Start with a model with the maximum complexity: NCS with knots at n
(unique) x points.
• Fit a Ridge Regression model on the data. If we parameterize the NCS
function space by the DR basis, then the design matrix is orthogonal and
the corresponding coefficient is penalized differently for each basis: no
penalty for the two linear basis functions, higher penalty for wigglier basis
functions.
• How to do it in R?
• How to select the tuning parameter λ or equivalently the df?
• What if we have collected two obs at the same location x?
24
Weighted Smoothing Splines
Suppose the first two obs have the same x value, i.e.,
(x1, y1), (x2, y2), where x1 = x2.
Then
[y1 − g(x1)
]2+[y2 − g(x1)
]2=
2∑i=1
[yi −
y1 + y22
+y1 + y2
2− g(x1)
]2=
(y1 −
y1 + y22
)2+(y2 −
y1 + y22
)2+2[y1 + y2
2− g(x1)
]2So we can replace the first two obs by one, (x1,
y1+y22 ), and its weight is 2
while the weights for other obs are 1.
25