From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...

25
From Linear to Nonlinear Fit a nonlinear regression model via basis expansions g (x)= M X m=1 β m h m (x) h m (x)= x 2 j ,x j x l ,..., h m (x)= I (a m <x j <b m ), h m (x) are wavelet bases functions. We can use LS to estimate β m ’s. And regularization can be incorporated into this framework too, min n X i=1 y i - M X m=1 β m h m (x) 2 + Pen(β ). 1

Transcript of From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...

Page 1: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

From Linear to Nonlinear

Fit a nonlinear regression model via basis expansions

g(x) =M∑m=1

βmhm(x)

• hm(x) = x2j , xjxl, . . . ,

• hm(x) = I(am < xj < bm),

• hm(x) are wavelet bases functions.

We can use LS to estimate βm’s. And regularization can be incorporated into

this framework too,

minn∑i=1

(yi −

M∑m=1

βmhm(x))2

+ Pen(β).

1

Page 2: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Questions

• How to choose the basis functions?

• What we’ll learn: estimate the function locally by either using local basis

functions or fitting it locally

• How to deal with large p? One solution is the GAM approach

g(x) ≈ f1(x1) + · · ·+ fp(xp) + hj(x1x2) + · · ·

and we estimate each component, f ’s and h’s.

2

Page 3: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Spline Models

• Introduction to CS and NCS

• Regression splines

• Smoothing splines

3

Page 4: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Cubic Splines a

• knots: a < ξ1 < ξ2 < · · · < ξm < b

• A function g defined on [a, b] is a cubic spline w.r.t knots ξimi=1 if:

1) g is a cubic polynomial in each of the m+ 1 intervals,

g(x) = dix3 + cix

2 + bix+ ai, x ∈ [ξi, ξi+1]

where i = 0 : m, ξ0 = a and ξm+1 = b;

2) g is continuous up to the 2nd derivative,

g(0,1,2)(ξ+i ) = g(0,1,2)(ξ−i ), i = 1 : m.

aFrom now on, x ∈ R is one-dimensional.

4

Page 5: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

• Given a set of knots ξimi=1, the corresponding cubic splines form a linear

space (of functions) with dimension m+ 4.

• A set of basis functions for that space:

h1(x) = 1; h2(x) = x;

h3(x) = x2; h4(x) = x3;

hi+4(x) = (x− ξi)3+, i = 1, 2, . . . ,m.

5

Page 6: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Natural Cubic Splines (NCS)

• A cubic spline on [a, b] is a NCS if its second and third direvatives are zero

at a and b.

• That is, it is linear in the two extreme intervals [a, ξ1] and [ξm, b].

• The degree of freedom of NCS’s with m knots is m. One version of the

basis functions can be found in the textbook (section 5.2.1).

• For a curve estimation problem with data (xi, yi)ni=1, if we put n knots at

the n data points (assumed to be unique), then we obtain a smooth curve

(using NCS) passing through all y’s.

6

Page 7: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Regression Splines

• A basis expansion approach:

g(x) = β1h1(x) + β2h2(x) + · · ·+ βphp(x),

where p = m+ 4 for regression with cubic splines and p = m for NCS.

• Represent the model on the observed n data points in matrix notation,

β = arg minβ‖y − Fβ‖2,

7

Page 8: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

where

y1

y2

· · ·

yn

n×1

=

h1(x1) h2(x1) · · · hp(x1)

h1(x2) h2(x2) · · · hp(x2)

h1(xn) h2(xn) · · · hp(xn)

n×p

β1

· · ·

βp

p×1

8

Page 9: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

• We can obtain the design matrix F by commands bs or ns in R, and then

call the regression function lm.

• Understand how R counts the degree-of-feedom.

> bs (x, knots=quantile(x, c(1/3, 2/3)));

> bs(x, df=5);

> bs(x, df=6, intercept=TRUE);

• One more input: the knots?

9

Page 10: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Choice of Knots

• Number and location of knots: to simplify this problem, we first ignore the

selection of locations – by default, the knots are located at the quantiles of

xi’s. Then we just need to select the number of knots, which can be

casted as a variable selection problem (an easier version, since there are

just p models, not 2p).

• AIC/BIC

• m-fold CV (cross-validation)

10

Page 11: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Summary: Regression Splines

• Use LS to fit a spline model: Specify the DFa p, and then fit a regression

model with a design matrix of p columns (including the intercept).

• How to do it in R?

• How to select the number/location of knots?

aNot the polynomial degree, but the DF of the spline, related to the number of knots.

11

Page 12: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Example: Phoneme Recognition

• Binary response Y : two classes “aa” (695) and “ao” (1022).

• Numeric features X: Log-periodogram is measured at 256 uniformly

spaced frequencies.

• Logistic regression model:

logP(aa|x)

P(ao|x)≈

256∑j=1

xjβj .

• Go to the review on logistic regression.

12

Page 13: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

• Recall that the 256 measurements for each person are not the same as

measurements collected from 256 independent predictors. They are

observations (at discrete locations) from a continuous Log-periodogram

function.

• Naturally we would expect βj = β(vj) are also continuous in the frequency

domain. Model it by splines

β(v) =

M∑m=1

hm(v)αm.

• Using matrix notation, we have

Xn×pβp×1 = Xn×pFp×mαm×1 = Bn×mαm×1.

13

Page 14: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Logistic Regression

• P (Y = 1|X = x) = p(x)

logp(x)

1− p(x)= g(x), p(x) =

exp(g(x))

1 + exp(g(x)).

• MLE

maxβ

n∏i=1

p(xi)yi(1− p(xi)

)1−yi= max

β

n∑i=1

[yi log p(xi)

yi + (1− yi) log(1− p(xi)

)]

14

Page 15: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Iterative Re-weighting Algorithm

Iterative algorithm a (βold denotes the current estimate of the coefficient, and

all the pi’s are calculated based on βold):

βnew = βold −(∂2`(β)

∂β∂βt

)−1∂`(β)

∂β= (XWXt)−1XWz,

where

W = diag(pi(1− pi)), pi =exp(xTi β

old)

1 + exp(xTi βold)

is the weighting matrix and

z = Xβold + W−1(y − p), p = (p1, . . . , pn)T

is the “latent” continuous observations.aThis is the Newton-Raphson or gradient decent algorithm with

∂`(β)∂β

= X(y − p),

∂2`(β)∂β∂βt = −XWXt.

15

Page 16: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Smoothing Splines

• In Regression Splines (let’s use NCS), we need to choose the number and

the location of knots.

• What’s a Smoothing Spline? Start with an easy but “horrible” solution:

put knots at all the observed data points (x1, . . . , xn):

yn×1 = Fn×nβn×1.

Instead of selecting knots, let’s do ridge-type shrinkage (Ω will be defined

later):

minβ

[‖y − Fβ‖2 + λβtΩβ

],

where the tuning parameter λ is often chosen by CV or GCV.

• Next we’ll see how smoothing splines are derived from a different aspect.

16

Page 17: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Roughness Penalty Approach

• Let S[a, b] be the space of all “smooth” functions defined on [a, b].

• Among all the functions in S[a, b], look for the minimizer of the following

penalized residual sum of squares

RSS(g, λ) =n∑i=1

[yi − g(xi)]2 + λ

∫ b

a

[g′′(x)]2dx, (1)

where λ is a smoothing parameter.

• Theorem. g = arg min RSS(g, λ) is a NCS with knots at the n data

points x1, . . . , xn (xi 6= xj).

17

Page 18: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

(WLOG, assume n ≥ 2.) Let g be a function on [a, b] and g be a NCS with

g(xi) = g(xi), i = 1 : n. Does such g exist?

Then ∫g

′′2 ≥∫g

′′2 (∗)

with equality only if g ≡ g.

PROOF : Let h(x) = g(x)− g(x). So h(xi) = 0 for i = 1, . . . , n.

Then (∗) holds true because∫g′′2 =

∫g′′2 +

∫h′′2

+2

∫g′′h′′︸ ︷︷ ︸=0

18

Page 19: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Smoothing Splines

Write g(x) =∑ni=1 βihi(x) where hi’s are basis functions for NCS with knots

at x1, . . . , xn.n∑i=1

[yi − g(xi)]2 = (y − Fβ)t(y − Fβ),

where Fn×n with Fij = hj(xi).∫ b

a

[g

′′(x)]2dx =

∫ [∑i

βih′′

i (x)]2dx

=∑i,j

βiβj

∫h

′′

i (x)h′′

j (x)dx = βtΩβ,

where Ωn×n with Ωij =∫ bah

′′

i (x)h′′

j (x)dx.

19

Page 20: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

So

RSS(β, λ) = (y − Fβ)t(y − Fβ) + λβtΩβ,

and the solution is

β = arg minβ

RSS(β, λ)

= (FtF + λΩ)−1Fty

20

Page 21: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

• Demmler & Reinsch (1975): a basis with double orthogonality property, i.e.

FtF = I, Ω = diag(di),

where d1 = d2 = 0 (Why?).

• Using this basis, we have

β = (FtF + λΩ)−1Fty

= (I + λdiag(di))−1Fty,

i.e.,

βi =1

1 + λdiβ(LS)i .

21

Page 22: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

• Smoother matrix Sλ

y = Fβ = F(FtF + λΩ)−1Fty = Sλy.

• Using D&R basis,

Sλ = Fdiag( 1

1 + λdi

)Ft.

So columns of F are the eigen-vectors of Sλ, which does not depend on λ.

• Effective df of a smoothing spline:

df(λ) = trSλ =

n∑i=1

1

1 + λdi.

22

Page 23: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Choice of λ

• Leave-one-out CV

CV(λ) =1

n

n∑i=1

[ yi − g[−i](xi)]2

=1

n

n∑i=1

(yi − g(xi)

1− Sλ(i, i)

)2

.

• Generalized CV

GCV(λ) =1

n

n∑i=1

(yi − g(xi)

1− 1n trSλ

)2

23

Page 24: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Summary: Smoothing Splines

• Start with a model with the maximum complexity: NCS with knots at n

(unique) x points.

• Fit a Ridge Regression model on the data. If we parameterize the NCS

function space by the DR basis, then the design matrix is orthogonal and

the corresponding coefficient is penalized differently for each basis: no

penalty for the two linear basis functions, higher penalty for wigglier basis

functions.

• How to do it in R?

• How to select the tuning parameter λ or equivalently the df?

• What if we have collected two obs at the same location x?

24

Page 25: From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current estimate of the coe cient, and ... where the tuning parameter is often chosen by CV or

Weighted Smoothing Splines

Suppose the first two obs have the same x value, i.e.,

(x1, y1), (x2, y2), where x1 = x2.

Then

[y1 − g(x1)

]2+[y2 − g(x1)

]2=

2∑i=1

[yi −

y1 + y22

+y1 + y2

2− g(x1)

]2=

(y1 −

y1 + y22

)2+(y2 −

y1 + y22

)2+2[y1 + y2

2− g(x1)

]2So we can replace the first two obs by one, (x1,

y1+y22 ), and its weight is 2

while the weights for other obs are 1.

25