Learning From Data Lecture 13 Validation and Model...

29
Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection Cross Validation M. Magdon-Ismail CSCI 4100/6100

Transcript of Learning From Data Lecture 13 Validation and Model...

Page 1: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Learning From DataLecture 13

Validation and Model Selection

The Validation Set

Model SelectionCross Validation

M. Magdon-IsmailCSCI 4100/6100

Page 2: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

recap: Regularization

Regularization combats the effects of noise by putting a leash on the algorithm.

Eaug(h) = Ein(h) +λ

NΩ(h)

Ω(h)→ smooth, simple h— noise is rough, complex.

Different regularizers give different results— can choose λ, the amount of regularization.

λ = 0 λ = 0.0001 λ = 0.01 λ = 1

x

y

DataTargetFit

x

y

x

y

x

y

Overfitting → → Underfitting

Optimal λ balances approximation and generalization, bias and variance.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 2 /29 Peeking at Eout −→

Page 3: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Validation: A Sneak Peek at Eout

Eout(g) = Ein(g) + overfit penalty︸ ︷︷ ︸

VC bounds this using a complexity error bar for Hregularization estimates this through a heuristic complexity penalty for g

Validation goes directly for the jugular:

Eout(g)︸ ︷︷ ︸

validation estimates this directly

= Ein(g) + overfit penalty.

In-sample estimate of Eout is the Holy Grail of learning from data.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 3 /29 Test set −→

Page 4: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Test Set

D(N data points)

Dtest(K test points)

−−−−→

−−−−→

gek = e(g(xk), yk)−−−−−−−−−−−−→ e1, e2, . . . , eK

−−−−→

−−−−→

g Etest =1

K

K∑

k=1

ek

−−−−→

Eout(g)

Etest is an estimate for Eout(g)

EDtest[ek] = Eout(g)

E[Etest] =1

K

K∑

k=1

E[ek]

=1

K

K∑

k=1

Eout(g)= Eout(g)

e1, . . . , eK are independent

Var[Etest] =1

K2

K∑

k=1

Var[ek]

=1

KVar[e]

տdecreases like 1

K

bigger K =⇒ more reliable Etest.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 4 /29 Validation set −→

Page 5: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Validation Set

D(N data points)

−−−→

−−−−−−−−−−−−−−−→Dtrain

(N −K training points)

Dval(K validation points)

−−−−→

−−−−→

gek = e(g (xk), yk)−−−−−−−−−−−−→ e1, e2, . . . , eK

−−−−→

−−−−→

g Eval =1

K

K∑

k=1

ek

−−−−→

Eout(g )

1. Remove K points from D

D = Dtrain ∪ Dval.

2. Learn using Dtrain −→ g .

3. Test g on Dval −→ Eval.

4. Use error Eval to estimate Eout(g ).

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 5 /29 Validation −→

Page 6: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Validation Set

D(N data points)

−−−→

−−−−−−−−−−−−−−−→Dtrain

(N −K training points)

Dval(K validation points)

−−−−→

−−−−→

gek = e(g (xk), yk)−−−−−−−−−−−−→ e1, e2, . . . , eK

−−−−→

−−−−→

g Eval =1

K

K∑

k=1

ek

−−−−→

Eout(g )

1. Remove K points from D

D = Dtrain ∪ Dval.

2. Learn using Dtrain −→ g .

3. Test g on Dval −→ Eval.

4. Use error Eval to estimate Eout(g ).

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 6 /29 Reliability of validation −→

Page 7: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Validation Set

D(N data points)

−−−→

−−−−−−−−−−−−−−−→Dtrain

(N −K training points)

Dval(K validation points)

−−−−→

−−−−→

gek = e(g (xk), yk)−−−−−−−−−−−−→ e1, e2, . . . , eK

−−−−→

−−−−→

g Eval =1

K

K∑

k=1

ek

−−−−→

Eout(g )

Eval is an estimate for Eout(g )

EDval[ek] = Eout(g )

E[Etest] =1

K

K∑

k=1

E[ek]

=1

K

K∑

k=1

Eout(g )= Eout(g )

e1, . . . , eK are independent

Var[Eval] =1

K2

K∑

k=1

Var[ek]

=1

KVar[e(g )]

տdecreases like 1

K

depends on g , not H

bigger K =⇒ more reliable Eval?

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 7 /29 Eval versus K −→

Page 8: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Choosing KPSfrag

Size of Validation Set, K

Exp

ectedE

val

10 20 30

Rule of thumb: K∗ = N5 .

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 8 /29 Restoring D −→

Page 9: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Restoring D

Dval

D(N)

Dtrain

(N −K)

g

(K)

Eval(g )g

CUSTOMER

Primary goal: output best hypothesis.

g was trained on all the data.

Secondary goal: estimate Eout(g).

g is behind closed doors.

Eout(g) Eout(g )

↓ ↓

Ein(g) Eval(g )︸ ︷︷ ︸

which should we use?

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 9 /29 Eval versus Ein −→

Page 10: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Eval Versus Ein

Eout(g) ≤ Ein(g) + O

(√dvc

NlogN

)

Eout(g) ≤ Eout(g )≤Eval(g ) +O

(1√K

)

↑learning curve is decreasing

(a practical truth, not a theorem)

ւBiased error bar depends on H.

տUnbiased error bar depends on g .

Eval(g) usually wins as an estimate for Eout(g), especially when the learning curve is not steep.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 10 /29 Model Selection −→

Page 11: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Model Selection

The most important use of validation

H1 H2 H3 · · · HM

−−−→

−−−→

−−−→

−−−→

g1 g2 g3 · · · gM

−−−→

E1

Dtrain −−−→

Dval −−−→

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 11 /29 Validation estimate for Eout(g1) −→

Page 12: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Validation Estimate for (H1, g1)

The most important use of validation

H1 H2 H3 · · · HM

−−−→

g1

−−−→

Eval(g1)

Dtrain −−−→

Dval −−−→

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 12 /29 Call it E1 −→

Page 13: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Validation Estimate for (H1, g1)

The most important use of validation

H1 H2 H3 · · · HM

−−−→

g1

−−−→

E1

Dtrain −−−→

Dval −−−→

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 13 /29 Validation estimates E1, . . . , EM −→

Page 14: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Compute Validation Estimates for All Models

The most important use of validation

H1 H2 H3 · · · HM

−−−→

−−−→

−−−→

−−−→

g1 g2 g3 · · · gM

−−−→

−−−→

−−−→

−−−→

E1 E2 E3 · · · EM

Dtrain −−−→

Dval −−−→

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 14 /29 Pick best validation error −→

Page 15: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Pick The Best Model According to Validation Error

The most important use of validation

H1 H2 H3 · · · HM

−−−→

−−−→

−−−→

−−−→

g1 g2 g3 · · · gM

−−−→

−−−→

−−−→

−−−→

E1 E2 E3 · · · EM

Dtrain −−−→

Dval −−−→

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 15 /29 Biased Eval(gm∗) −→

Page 16: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Eval(gm∗) is not Unbiased For Eout(gm∗)

Validation Set Size, K

Exp

ectedError

Eval (gm∗)

Eout (gm∗)

5 15 25

0.5

0.6

0.7

0.8

. . . because we choose one of the M finalists.

Eout(gm∗) ≤ Eval(gm∗) + O

(√lnM

K

)

↑VC error bar for selecting a hypothesisfrom M using a data set of size K.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 16 /29 Restoring D −→

Page 17: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Restoring D

H1 H2 H3 · · · HM

−−−→

−−−→

−−−→

−−−→

g1 g2 g3 · · · gM

−−−→

E1

Model with best g also has best g ← leap of faith

We can find model with best g using validation ← true modulo Eval error bar

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 17 /29 Comparing Ein and Eval −→

Page 18: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Comparing Ein and Eval for Model Selection

Validation Set Size, K

Exp

ectedE

out

optimal

validation: gm∗

in-sample: gm

validation: gm∗

5 15 25

0.48

0.52

0.56H1 H2 HM

g1 g2 gM

· · ·

· · ·

E1 · · · EM

Dval

Dtrain

gm∗

E2

(Hm∗ , Em∗)

︸ ︷︷ ︸pick the best

D

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 18 /29 Selecting λ −→

Page 19: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Application to Selecting λ

Which regularization parameter to use?

λ1, λ2, . . . , λM .

This is a special case of model selection over M models,

(H, λ1) (H, λ2) (H, λ3) · · · (H, λM)

−−−→

−−−→

−−−→

−−−→

g1 g2 g3 · · · gM

Picking a model amounts to chosing the optimal λ

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 19 /29 Tradeoff with K −→

Page 20: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Dilemma When Choosing K

Validation relies on the following chain of reasoning,

Eout(g) ≈(small K)

Eout(g ) ≈(large K)

Eval(g )

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 20 /29 K = 1? −→

Page 21: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Can we get away with K = 1?

Yes, almost!

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 21 /29 Leave one out −→

Page 22: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Leave One Out Error (K = 1)

e1

x

y

E[e1] = Eout(g1)

−−−−−−−−−−→

g1

. . . but it is a wild estimate

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 22 /29 Ecv −→

Page 23: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

The Leave One Our Errors

e1

x

y

e2

x

y

e3

x

y

E[e1] = Eout(g1)

Ecv =1

N

N∑

n=1

en

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 23 /29 CV is unbiased −→

Page 24: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Cross Validation is Unbiased

Theorem. Ecv is an unbiased estimate of Eout(N − 1).տ

Expected Eout when learning with N − 1 points.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 24 /29 Reliability of Ecv −→

Page 25: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Reliability of Ecv

en and em are not independent.

en depends on gn which was trained on (xm, ym).

em is evaluated on (xm, ym).

en Ecv

1 N

Effective number of fresh examplesgiving a comparable estimate of Eout

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 25 /29 Computational considerations −→

Page 26: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Cross Validation is Computationally Intensive

N epochs of learning each on a data set of size N − 1.

• Analytic approaches, for example linear regression with weight decay

wreg = (ZtZ + λI)−1Zty

Ecv =1

N

N∑

n=1

(yn − yn

1− Hnn(λ)

)2

H(λ) = Z(ZtZ + λI)−1Zt.

• 10-fold cross validation

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

train trainvalidate

D︷ ︸︸ ︷

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 26 /29 Restoring D −→

Page 27: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Restoring D

D1

D

g

g1

D2 · · ·

· · ·

Ecv

︸ ︷︷ ︸take average

gN

g2(x1, y1) (x2, y2) (xN , yN )

DN

e1 e2 eN· · ·

CUSTOMER

Eout(g(N))≤ Eout(N − 1) ≤ Ecv +O

(1√N

).

↑learning curve

↑nearly independent en

Ecv can be used for model selection just as Eval, for example to choose λ.

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 27 /29 Digits −→

Page 28: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Digits Problem: ‘1’ Versus ‘Not 1’

Average Intensity

Sym

metry

1

Not 1# Features Used

Error

Eout

Ecv

Ein

5 10 15 20

0.01

0.02

0.03

x = (1, x1, x2)

z = (1, x1, x2, x21, x1x2, x

22, x

31, x

21x2, x1x

22, x

32, . . . , x

51, x

41x2, x

31x

22, x

21x

32, x1x

42, x

52)

︸ ︷︷ ︸5th order polynomial transform −→ 20 dimensional non linear feature space

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 28 /29 Validation Wins −→

Page 29: Learning From Data Lecture 13 Validation and Model Selectionmagdon/courses/LFD-Slides/SlidesLect13.pdf · Learning From Data Lecture 13 Validation and Model Selection ... c AML Creator:

Validation Wins In the Real World

Average Intensity

Sym

metry

Average Intensity

Sym

metry

no validation (20 features)

Ein = 0%

Eout = 2.5%

cross validation (6 features)

Ein = 0.8%

Eout = 1.5%

c© AML Creator: Malik Magdon-Ismail Validation and Model Selection: 29 /29