D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x)...

25
E D = + f H f H g (D) (x) ¯ g (x) f (x) E E N N E E N E E N

Transcript of D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x)...

Page 1: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Review of Le ture 8• Bias and varian eExpe ted value of Eout w.r.t. D

= bias + varf

H bias

var f

H

g(D)(x) → g(x) → f(x)

• Learning urvesHow Ein and Eout vary with N

PSfrag repla ements

Number of Data Points, N

Expe tedError

biasvarian e Eout

Ein

204060800.160.170.180.190.20.210.22

PSfrag repla ements

Number of Data Points, N

Expe tedError

in-sample errorgeneralization errorEout

Ein

204060800.160.170.180.190.20.210.22• N ∝ �VC dimension�

B-V:

VC:

Page 2: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Learning From DataYaser S. Abu-MostafaCalifornia Institute of Te hnologyLe ture 9: The Linear Model II

Sponsored by Calte h's Provost O� e, E&AS Division, and IST • Tuesday, May 1, 2012

Page 3: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Where we are• Linear lassi� ation X

• Linear regression X

• Logisti regression ?• Nonlinear transforms ≈X

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 2/24

Page 4: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Nonlinear transformsx = (x0, x1, · · · , xd)

Φ−→ z = (z0, z1, · · · · · · · · · · · · , zd)

Ea h zi = φi(x) z = Φ(x)

Example: z = (1, x1, x2, x1x2, x21, x

22)

Final hypothesis g(x) in X spa e:sign (wTΦ(x)

) or wTΦ(x)

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 3/24

Page 5: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

The pri e we payx = (x0, x1, · · · , xd)

Φ−→ z = (z0, z1, · · · · · · · · · · · · , zd)

↓ ↓

w w

dv = d + 1 dv ≤ d + 1

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 4/24

Page 6: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Two non-separable ases

PSfrag repla ements-1-0.500.51-1.5-1-0.500.511.5 © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 9 5/24

Page 7: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

First ase

PSfrag repla ements-1-0.500.51-1.5-1-0.500.511.5

Use a linear model in X ; a ept Ein > 0

orInsist on Ein = 0; go to high-dimensional Z

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 6/24

Page 8: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Se ond ase

PSfrag repla ements-1-0.500.51-1.5-1-0.500.511.5

z = (1, x1, x2, x1x2, x21, x

22)

Why not: z = (1, x21, x

22)

or better yet: z = (1, x21 + x2

2)

or even: z = (x21 + x2

2 − 0.6)

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 7/24

Page 9: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Lesson learnedLooking at the data before hoosing the model an be hazardous to your Eout

Data snooping

Learning From Data - Le ture 9 8/24

Page 10: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Logisti regression - Outline• The model• Error measure• Learning algorithm

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 9/24

Page 11: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

A third linear models =

d∑

i=0

wixi

linear lassi� ation linear regression logisti regressionh(x) = sign(s) h(x) = s h(x) = θ(s)

sx

x

x

x0

1

2

d

h x( )

sx

x

x

x0

1

2

d

h x( )

sx

x

x

x0

1

2

d

h x( )

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 10/24

Page 12: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

The logisti fun tion θ

The formula:θ(s) =

es

1 + es

PSfrag repla ements

θ(s)1

0 s

-4-202400.51soft threshold: un ertaintysigmoid: �attened out `s'

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 11/24

Page 13: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Probability interpretationh(x) = θ(s) is interpreted as a probability

Example. Predi tion of heart atta ksInput x: holesterol level, age, weight, et .θ(s): probability of a heart atta kThe signal s = w

Tx �risk s ore�

h(x) = θ(s) © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 12/24

Page 14: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Genuine probabilityData (x, y) with binary y, generated by a noisy target:

P (y | x) =

{

f(x) for y = +1;

1−f(x) for y = −1.

The target f : Rd→ [0, 1] is the probability

Learn g(x) = θ(wTx) ≈ f(x)

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 13/24

Page 15: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Error measureFor ea h (x, y), y is generated by probability f(x)

Plausible error measure based on likelihood:If h = f , how likely to get y from x?P (y | x) =

{

h(x) for y = +1;

1− h(x) for y = −1.

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 14/24

Page 16: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Formula for likelihoodP (y | x) =

{

h(x) for y = +1;

1− h(x) for y = −1.

Substitute h(x) = θ(wTx), noting θ(−s) = 1− θ(s)

PSfrag repla ements

θ(s)

1

0 s

-4-202400.51P (y | x) = θ(y w

Tx)

Likelihood of D = (x1, y1), . . . , (xN , yN) isN∏

n=1

P (yn | xn) =

N∏

n=1

θ(ynwTxn)

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 15/24

Page 17: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Maximizing the likelihoodMinimize −

1

Nln

(N∏

n=1

θ(yn wTxn)

)

=1

N

N∑

n=1

ln

(1

θ(yn wTxn)

) [

θ(s) =1

1 + e−s

]

Ein(w) =1

N

N∑

n=1

ln(

1 + e−ynwTxn

)

︸ ︷︷ ︸e(h(xn),yn)� ross-entropy� error © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 9 16/24

Page 18: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Logisti regression - Outline• The model• Error measure• Learning algorithm

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 17/24

Page 19: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

How to minimize EinFor logisti regression,

Ein(w) =1

N

N∑

n=1

ln(

1 + e−ynwTxn

)

Compare to linear regression:Ein(w) =

1

N

N∑

n=1

(wTxn − yn)

2←− losed-form solution

←− iterative solution

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 18/24

Page 20: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Iterative method: gradient des ent

PSfrag repla ements

Weights, wIn-sampleError,E

in-10-8-6-4-20210152025

General method for nonlinear optimizationStart at w(0); take a step along steepest slopeFixed step size: w(1) = w(0) + η v

What is the dire tion v?

Ein(w)

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 19/24

Page 21: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Formula for the dire tion v

∆Ein = Ein(w(0) + ηv )− Ein(w(0))

= η∇Ein(w(0))tv + O(η2)

≥ −η‖∇Ein(w(0))‖

Sin e v is a unit ve tor,v = −

∇Ein(w(0))

‖∇Ein(w(0))‖ © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 20/24

Page 22: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Fixed-size step?How η a�e ts the algorithm:

PSfrag repla ements

Weights, wIn-sampleError,Ein

-1-0.8-0.6-0.4-0.200.20.40.60.8100.20.40.60.81

PSfrag repla ements

Weights, wIn-sampleError,E

in-1-0.8-0.6-0.4-0.200.20.40.60.8100.050.10.150.20.25

PSfrag repla ements

Weights, wIn-sampleError,E

in large η

small η

-1-0.8-0.6-0.4-0.200.20.40.60.8100.20.40.60.81η too small η too large variable η � just right

η should in rease with the slope © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 21/24

Page 23: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Easy implementationInstead of

∆w = η v

= − η∇Ein(w(0))

‖∇Ein(w(0))‖

Have∆w = − η ∇Ein(w(0))

Fixed learning rate η © AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 22/24

Page 24: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Logisti regression algorithm1: Initialize the weights at t = 0 to w(0)2: for t = 0, 1, 2, . . . do3: Compute the gradient

∇Ein = −1

N

N∑

n=1

ynxn

1 + e ynwT(t)xn

4: Update the weights: w(t + 1) = w(t)− η∇Ein5: Iterate to the next step until it is time to stop6: Return the �nal weights w

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 23/24

Page 25: D ¯( N - Yaser Abu-Mostafawork.caltech.edu/slides/slides09.pdf · rn Lea g(x) = θ(w T x) ≈f(x) c AML r: Creato aser Y Abu-Mostafa-LFD Lecture 9 13/ 24. r Erro measure r o F each

Summary of Linear Models

CreditAnalysis Amountof CreditApproveor DenyProbabilityof Default

Per eptronLogisti RegressionLinear Regression Classi� ation ErrorPLA, Po ket,. . .

Cross-entropy ErrorGradient des entSquared ErrorPseudo-inverse

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 9 24/24