From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...

From Linear to Nonlinear

Fit a nonlinear regression model via basis expansions

g(x) =M∑m=1

βmhm(x)

• hm(x) = x2j , xjxl, . . . ,

• hm(x) = I(am < xj < bm),

• hm(x) are wavelet bases functions.

We can use LS to estimate βm’s. And regularization can be incorporated into

this framework too,

minn∑i=1

(yi −

M∑m=1

βmhm(x))2

+ Pen(β).

1

Questions

• How to choose the basis functions?

• What we’ll learn: estimate the function locally by either using local basis

functions or fitting it locally

• How to deal with large p? One solution is the GAM approach

g(x) ≈ f1(x1) + · · ·+ fp(xp) + hj(x1x2) + · · ·

and we estimate each component, f ’s and h’s.

2

Spline Models

• Introduction to CS and NCS

• Regression splines

• Smoothing splines

3

Cubic Splines a

• knots: a < ξ1 < ξ2 < · · · < ξm < b

• A function g defined on [a, b] is a cubic spline w.r.t knots ξimi=1 if:

1) g is a cubic polynomial in each of the m+ 1 intervals,

g(x) = dix3 + cix

2 + bix+ ai, x ∈ [ξi, ξi+1]

where i = 0 : m, ξ0 = a and ξm+1 = b;

2) g is continuous up to the 2nd derivative,

g(0,1,2)(ξ+i ) = g(0,1,2)(ξ−i ), i = 1 : m.

aFrom now on, x ∈ R is one-dimensional.

4

• Given a set of knots ξimi=1, the corresponding cubic splines form a linear

space (of functions) with dimension m+ 4.

• A set of basis functions for that space:

h1(x) = 1; h2(x) = x;

h3(x) = x2; h4(x) = x3;

hi+4(x) = (x− ξi)3+, i = 1, 2, . . . ,m.

5

Natural Cubic Splines (NCS)

• A cubic spline on [a, b] is a NCS if its second and third direvatives are zero

at a and b.

• That is, it is linear in the two extreme intervals [a, ξ1] and [ξm, b].

• The degree of freedom of NCS’s with m knots is m. One version of the

basis functions can be found in the textbook (section 5.2.1).

• For a curve estimation problem with data (xi, yi)ni=1, if we put n knots at

the n data points (assumed to be unique), then we obtain a smooth curve

(using NCS) passing through all y’s.

6

Regression Splines

• A basis expansion approach:

g(x) = β1h1(x) + β2h2(x) + · · ·+ βphp(x),

where p = m+ 4 for regression with cubic splines and p = m for NCS.

• Represent the model on the observed n data points in matrix notation,

β = arg minβ‖y − Fβ‖2,

7

where

y1

y2

· · ·

yn

n×1

=

h1(x1) h2(x1) · · · hp(x1)

h1(x2) h2(x2) · · · hp(x2)

h1(xn) h2(xn) · · · hp(xn)

n×p

β1

· · ·

βp

p×1

8

• We can obtain the design matrix F by commands bs or ns in R, and then

call the regression function lm.

• Understand how R counts the degree-of-feedom.

> bs (x, knots=quantile(x, c(1/3, 2/3)));

> bs(x, df=5);

> bs(x, df=6, intercept=TRUE);

• One more input: the knots?

9

Choice of Knots

• Number and location of knots: to simplify this problem, we first ignore the

selection of locations – by default, the knots are located at the quantiles of

xi’s. Then we just need to select the number of knots, which can be

casted as a variable selection problem (an easier version, since there are

just p models, not 2p).

• AIC/BIC

• m-fold CV (cross-validation)

10

Summary: Regression Splines

• Use LS to fit a spline model: Specify the DFa p, and then fit a regression

model with a design matrix of p columns (including the intercept).

• How to do it in R?

• How to select the number/location of knots?

aNot the polynomial degree, but the DF of the spline, related to the number of knots.

11

Example: Phoneme Recognition

• Binary response Y : two classes “aa” (695) and “ao” (1022).

• Numeric features X: Log-periodogram is measured at 256 uniformly

spaced frequencies.

• Logistic regression model:

logP(aa|x)

P(ao|x)≈

256∑j=1

xjβj .

• Go to the review on logistic regression.

12

• Recall that the 256 measurements for each person are not the same as

measurements collected from 256 independent predictors. They are

observations (at discrete locations) from a continuous Log-periodogram

function.

• Naturally we would expect βj = β(vj) are also continuous in the frequency

domain. Model it by splines

β(v) =

M∑m=1

hm(v)αm.

• Using matrix notation, we have

Xn×pβp×1 = Xn×pFp×mαm×1 = Bn×mαm×1.

13

Logistic Regression

• P (Y = 1|X = x) = p(x)

logp(x)

1− p(x)= g(x), p(x) =

exp(g(x))

1 + exp(g(x)).

• MLE

maxβ

n∏i=1

p(xi)yi(1− p(xi)

)1−yi= max

β

n∑i=1

[yi log p(xi)

yi + (1− yi) log(1− p(xi)

)]

14

Iterative Re-weighting Algorithm

Iterative algorithm a (βold denotes the current estimate of the coefficient, and

all the pi’s are calculated based on βold):

βnew = βold −(∂2`(β)

∂β∂βt

)−1∂`(β)

∂β= (XWXt)−1XWz,

where

W = diag(pi(1− pi)), pi =exp(xTi β

old)

1 + exp(xTi βold)

is the weighting matrix and

z = Xβold + W−1(y − p), p = (p1, . . . , pn)T

is the “latent” continuous observations.aThis is the Newton-Raphson or gradient decent algorithm with

∂`(β)∂β

= X(y − p),

∂2`(β)∂β∂βt = −XWXt.

15

Smoothing Splines

• In Regression Splines (let’s use NCS), we need to choose the number and

the location of knots.

• What’s a Smoothing Spline? Start with an easy but “horrible” solution:

put knots at all the observed data points (x1, . . . , xn):

yn×1 = Fn×nβn×1.

Instead of selecting knots, let’s do ridge-type shrinkage (Ω will be defined

later):

minβ

[‖y − Fβ‖2 + λβtΩβ

],

where the tuning parameter λ is often chosen by CV or GCV.

• Next we’ll see how smoothing splines are derived from a different aspect.

16

Roughness Penalty Approach

• Let S[a, b] be the space of all “smooth” functions defined on [a, b].

• Among all the functions in S[a, b], look for the minimizer of the following

penalized residual sum of squares

RSS(g, λ) =n∑i=1

[yi − g(xi)]2 + λ

∫ b

a

[g′′(x)]2dx, (1)

where λ is a smoothing parameter.

• Theorem. g = arg min RSS(g, λ) is a NCS with knots at the n data

points x1, . . . , xn (xi 6= xj).

17

(WLOG, assume n ≥ 2.) Let g be a function on [a, b] and g be a NCS with

g(xi) = g(xi), i = 1 : n. Does such g exist?

Then ∫g

′′2 ≥∫g

′′2 (∗)

with equality only if g ≡ g.

PROOF : Let h(x) = g(x)− g(x). So h(xi) = 0 for i = 1, . . . , n.

Then (∗) holds true because∫g′′2 =

∫g′′2 +

∫h′′2

+2

∫g′′h′′︸︷︷︸=0

18

Smoothing Splines

Write g(x) =∑ni=1 βihi(x) where hi’s are basis functions for NCS with knots

at x1, . . . , xn.n∑i=1

[yi − g(xi)]2 = (y − Fβ)t(y − Fβ),

where Fn×n with Fij = hj(xi).∫ b

a

[g

′′(x)]2dx =

∫ [∑i

βih′′

i (x)]2dx

=∑i,j

βiβj

∫h

′′

i (x)h′′

j (x)dx = βtΩβ,

where Ωn×n with Ωij =∫ bah

′′

i (x)h′′

j (x)dx.

19

So

RSS(β, λ) = (y − Fβ)t(y − Fβ) + λβtΩβ,

and the solution is

β = arg minβ

RSS(β, λ)

= (FtF + λΩ)−1Fty

20

• Demmler & Reinsch (1975): a basis with double orthogonality property, i.e.

FtF = I, Ω = diag(di),

where d1 = d2 = 0 (Why?).

• Using this basis, we have

β = (FtF + λΩ)−1Fty

= (I + λdiag(di))−1Fty,

i.e.,

βi =1

1 + λdiβ(LS)i .

21

• Smoother matrix Sλ

y = Fβ = F(FtF + λΩ)−1Fty = Sλy.

• Using D&R basis,

Sλ = Fdiag( 1

1 + λdi

)Ft.

So columns of F are the eigen-vectors of Sλ, which does not depend on λ.

• Effective df of a smoothing spline:

df(λ) = trSλ =

n∑i=1

1

1 + λdi.

22

Choice of λ

• Leave-one-out CV

CV(λ) =1

n

n∑i=1

[ yi − g[−i](xi)]2

=1

n

n∑i=1

(yi − g(xi)

1− Sλ(i, i)

)2

.

• Generalized CV

GCV(λ) =1

n

n∑i=1

(yi − g(xi)

1− 1n trSλ

)2

23

Summary: Smoothing Splines

• Start with a model with the maximum complexity: NCS with knots at n

(unique) x points.

• Fit a Ridge Regression model on the data. If we parameterize the NCS

function space by the DR basis, then the design matrix is orthogonal and

the corresponding coefficient is penalized differently for each basis: no

penalty for the two linear basis functions, higher penalty for wigglier basis

functions.

• How to do it in R?

• How to select the tuning parameter λ or equivalently the df?

• What if we have collected two obs at the same location x?

24

Weighted Smoothing Splines

Suppose the first two obs have the same x value, i.e.,

(x1, y1), (x2, y2), where x1 = x2.

Then

[y1 − g(x1)

]2+[y2 − g(x1)

]2=

2∑i=1

[yi −

y1 + y22

+y1 + y2

2− g(x1)

]2=

(y1 −

y1 + y22

)2+(y2 −

y1 + y22

)2+2[y1 + y2

2− g(x1)

]2So we can replace the first two obs by one, (x1,

y1+y22 ), and its weight is 2

while the weights for other obs are 1.

25

From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...

Documents

Transcript of From Linear to Nonlinear - Blackboard Learn Linear to Nonlinear ... ( old denotes the current...