Learning From Data Lecture 11 Overfittingmagdon/courses/LFD-Slides/SlidesLect11.pdf · Learning...

Post on 14-Jun-2020

25 views 0 download

Transcript of Learning From Data Lecture 11 Overfittingmagdon/courses/LFD-Slides/SlidesLect11.pdf · Learning...

Learning From DataLecture 11Overfitting

What is Overfitting

When does Overfitting OccurStochastic and Deterministic Noise

M. Magdon-IsmailCSCI 4100/6100

recap: Nonlinear Transforms

Φ−→

1. Original data

xn ∈ X

2. Transform the data

zn = Φ(xn) ∈ Z

‘Φ−1’←−

4. Classify in X -space

g(x) = g̃(Φ(x)) = sign(w̃tΦ(x))

3. Separate data in Z-space

g̃(z) = sign(w̃tz)

X -space is Rd Z-space is Rd̃

x =

1x1

...xd

z = Φ(x) =

1Φ1(x)

...Φ

d̃(x)

=

1z1...zd̃

x1,x2, . . . ,xN z1, z2, . . . , zN

y1, y2, . . . , yN y1, y2, . . . , yN

no weights w̃ =

w0

w1

...w

dvc = d+ 1 dvc = d+ 1

g(x) = sign(w̃tΦ(x))

c© AML Creator: Malik Magdon-Ismail Overfitting: 2 /24 Superstitions −→

Superstitions – Myth or Reality?

• Paraskevedekatriaphobia – fear of Friday the 13th.

– Are future Friday the 13ths really more dangerous?

• OCD

the subjects performs an action which leads to a good outcome and thereby generalizes it as

cause and effect: the action will always give good results. Having overfit the data, the subject

compulsively engages in that activity.

Humans are overfitting machines, very good at “finding coincidences”.

c© AML Creator: Malik Magdon-Ismail Overfitting: 3 /24 Simple illustration −→

An Illustration of Overfitting on a Simple Example

Quadratic f

5 data points

A little noise (measurement error)

5 data points→ 4th order polynomial fit

x

y

DataTarget

Classic overfitting: simple target with excessively complex H.

The noise did us in. (why?)

c© AML Creator: Malik Magdon-Ismail Overfitting: 4 /24 Classic overfitting −→

An Illustration of Overfitting on a Simple Example

Quadratic f

5 data points

A little noise (measurement error)

5 data points→ 4th order polynomial fit

x

y

DataTargetFit

Classic overfitting: simple target with excessively complex H.

Ein ≈ 0; Eout ≫ 0

The noise did us in. (why?)

c© AML Creator: Malik Magdon-Ismail Overfitting: 5 /24 What is overfitting? −→

What is Overfitting?

Fitting the data more than is warranted

c© AML Creator: Malik Magdon-Ismail Overfitting: 6 /24 Is it bad generalization? −→

Overfitting is Not Just Bad Generalization

in-sample error

out-of-sample error

bad generalization

VC dimension, dvc

Error

VC Analysis:

Covers bad generalization but with lots of slack – the VC bound is loose

c© AML Creator: Malik Magdon-Ismail Overfitting: 7 /24 Beyond bad generalization −→

Overfitting is Not Just Bad Generalization

in-sample error

out-of-sample error

overfitting

VC dimension, dvc

Error

Overfitting:

Going for lower and lower Ein results in higher and higher Eout

c© AML Creator: Malik Magdon-Ismail Overfitting: 8 /24 Case study: simple and complex f −→

Case Study: 2nd vs 10th Order Polynomial Fit

x

y

DataTarget

x

y

DataTarget

10th order f with noise. 50th order f with no noise.

H2: 2nd order polynomial fit

H10: 10th order polynomial fit←− special case of linear models with feature transform x 7→ (1, x, x2, · · · ).

Which model do you pick for which problem and why?

c© AML Creator: Malik Magdon-Ismail Overfitting: 9 /24 H2 versus H10 −→

Case Study: 2nd vs 10th Order Polynomial Fit

x

y

DataTarget

x

y

DataTarget

10th order f with noise. 50th order f with no noise.

H2: 2nd order polynomial fit

H10: 10th order polynomial fit←− special case of linear models with feature transform x 7→ (1, x, x2, · · · ).

Which model do you pick for which problem and why?

c© AML Creator: Malik Magdon-Ismail Overfitting: 10 /24 H2 wins for both cases −→

Case Study: 2nd vs 10th Order Polynomial Fit

replacements

x

y

Data2nd Order Fit10th Order Fit

x

y

Data2nd Order Fit10th Order Fit

simple noisy target

2nd Order 10th Order

Ein 0.050 0.034Eout 0.127 9.00

complex noiseless target

2nd Order 10th Order

Ein 0.029 10−5

Eout 0.120 7680

Go figure:

Simpler H is better even for the more complex target with no noise.

c© AML Creator: Malik Magdon-Ismail Overfitting: 11 /24 Is there really no noise −→

Is there Really “No Noise” with the Complex f?

x

y

DataTarget

x

y

DataTarget

Simple f with noise. Complex f with no noise.

H should match quantity and quality of data, not f

c© AML Creator: Malik Magdon-Ismail Overfitting: 12 /24 Look only at the data −→

Is there Really “No Noise” with the Complex f?

x

y

x

y

Simple f with noise. Complex f with no noise.

H should match quantity and quality of data, not f

c© AML Creator: Malik Magdon-Ismail Overfitting: 13 /24 Learning curves for H2, H10 −→

When is H2 Better than H10?

Learning curves for H2 Learning curves for H10

Number of Data Points, N

Exp

ectedError Eout

Ein

Number of Data Points, N

Exp

ectedError

Eout

Ein

Overfitting:Eout(H10) > Eout(H2)

c© AML Creator: Malik Magdon-Ismail Overfitting: 14 /24 Overfit measure σ2 vs. N −→

Overfit Measure: Eout(H10)− Eout(H2)

Number of Data Points, N

Noise

Level,σ2

80 100 120-0.2

-0.1

0

0.1

0.2

0

1

2

c© AML Creator: Malik Magdon-Ismail Overfitting: 15 /24 Overfit measure Qf vs. N −→

Overfit Measure: Eout(H10)− Eout(H2)

Number of Data Points, N

Noise

Level,σ2

80 100 120-0.2

-0.1

0

0.1

0.2

0

1

2

Number of Data Points, N

TargetCom

plexity,Q

f

80 100 120-0.2

-0.1

0

0.1

0.2

0

25

50

75

100

Number of data points ↑ Overfitting ↓Noise ↑ Overfitting ↑

Target complexity ↑ Overfitting ↑

c© AML Creator: Malik Magdon-Ismail Overfitting: 16 /24 Define ‘noise’ −→

Noise

That part of y we cannot model

it has two sources . . .

c© AML Creator: Malik Magdon-Ismail Overfitting: 17 /24 Stochastic noise −→

Stochastic Noise — Data Error

We would like to learn from ◦:yn = f(xn)

Unfortunately, we only observe ◦:yn = f(xn) + ‘stochastic noise’

↑no one can model this

x

y

y = f(x)

stoch. noise

Stochastic Noise: fluctuations/measurement errors we cannot model.

c© AML Creator: Malik Magdon-Ismail Overfitting: 18 /24 Deterministic noise −→

Deterministic Noise — Model Error

We would like to learn from ◦:yn = h∗(xn)

Unfortunately, we only observe ◦:yn = f(xn)

= h∗(xn) + ‘deterministic noise’

↑H cannot model this

x

y

best approximation to f in H

h∗(x)

y = f(x)

det. noise

Deterministic Noise: the part of f we cannot model.

c© AML Creator: Malik Magdon-Ismail Overfitting: 19 /24 Both hurt learning −→

Stochastic & Deterministic Noise Hurt Learning

Stochastic Noise

x

y

f(x)

y = f(x)+stoch. noise

source: random measurement errors

re-measure ynstochastic noise changes.

change H

stochastic noise the same.

Deterministic Noise

x

y

h∗

y = h∗(x)+det. noise

source: learner’s H cannot model f

re-measure yndeterministic noise the same.

change H

deterministic noise changes.

We have single D and fixed H so we cannot distinguish

c© AML Creator: Malik Magdon-Ismail Overfitting: 20 /24 Stochastic noise and bias-var −→

Noise and the Bias-Variance Decomposition

y = f(x) + ǫ

↑measurement error

E[Eout(x)] = ED,ǫ[(g(x)− f(x)− ǫ)2]

= ED,ǫ[(g(x)− f(x))2 + 2(g(x)− f(x))ǫ + ǫ2]

↓ ↓ ↓

bias + var 0 σ2

c© AML Creator: Malik Magdon-Ismail Overfitting: 21 /24 bias-var-σ2 and noise −→

Noise and the Bias-Variance Decomposition

y = f(x) + ǫ

↑measurement error

E[Eout(x)] = σ2 + bias + var

↑ ↑ ↑stochastic deterministic indirect

noise noise impactof noise

c© AML Creator: Malik Magdon-Ismail Overfitting: 22 /24 Noise causes overfitting −→

Noise is the Culprit

Overfitting is the disease

Noise is the causeLearning is led astray by fitting the noise more than the signal

Cures

Regularization: Putting on the brakes.

Validation: A reality check from peeking at Eout (the bottom line).

c© AML Creator: Malik Magdon-Ismail Overfitting: 23 /24 Regularization teaser −→

Regularization

no regularization regularization!

x

y

Data

Target

Fit

x

y

c© AML Creator: Malik Magdon-Ismail Overfitting: 24 /24