Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model...

37
Lecture 4: Model Selection Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Transcript of Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model...

Page 1: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

Lecture 4: Model Selection

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

Page 2: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization Selection

Given λ1 and λ2, we solve

θλ1

= argminθ

L(θ) + λ12‖θ‖22 ,

θλ2

= argminθ

L(θ) + λ22‖θ‖22 .

Which one is better?

(Continuous) Model Selection

Tuo Zhao — Lecture 4: Model Selection 2/19

Page 3: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization Selection

Given λ1 and λ2, we solve

θλ1

= argminθ

L(θ) + λ12‖θ‖22 ,

θλ2

= argminθ

L(θ) + λ22‖θ‖22 .

Which one is better?

(Continuous) Model Selection

Tuo Zhao — Lecture 4: Model Selection 2/19

Page 4: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization Selection

Given λ1 and λ2, we solve

θλ1

= argminθ

L(θ) + λ12‖θ‖22 ,

θλ2

= argminθ

L(θ) + λ22‖θ‖22 .

Which one is better?

(Continuous) Model Selection

Tuo Zhao — Lecture 4: Model Selection 2/19

Page 5: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Margin?

Tuo Zhao — Lecture 4: Model Selection 3/19

Page 6: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Model Selection

Given a regression problem, we consider two models,

Y = θ0 + θ1X + θ2X2 + θ3X

3,

Y = θ0 + θ1X + θ2X2.

Which one is better?

(Discrete) Model Selection.

Tuo Zhao — Lecture 4: Model Selection 4/19

Page 7: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Model Selection

Given a regression problem, we consider two models,

Y = θ0 + θ1X + θ2X2 + θ3X

3,

Y = θ0 + θ1X + θ2X2.

Which one is better?

(Discrete) Model Selection.

Tuo Zhao — Lecture 4: Model Selection 4/19

Page 8: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Model Selection

Given a regression problem, we consider two models,

Y = θ0 + θ1X + θ2X2 + θ3X

3,

Y = θ0 + θ1X + θ2X2.

Which one is better?

(Discrete) Model Selection.

Tuo Zhao — Lecture 4: Model Selection 4/19

Page 9: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Residuals?

Tuo Zhao — Lecture 4: Model Selection 5/19

Page 10: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization and Constraint

Constrained Empirical Risk Minimization

θ = argminθ

L(θ) subject to ‖θ‖22 ≤ R.

Min-Max Problem

(θ, λ) = argminθ

maxλL(θ) + λ(‖θ‖22 −R).

Regularized Empirical Risk Minimization

θ = argminθ

L(θ) + λ ‖θ‖22 .

Tuo Zhao — Lecture 4: Model Selection 6/19

Page 11: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization and Constraint

Constrained Empirical Risk Minimization

θ = argminθ

L(θ) subject to ‖θ‖22 ≤ R.

Min-Max Problem

(θ, λ) = argminθ

maxλL(θ) + λ(‖θ‖22 −R).

Regularized Empirical Risk Minimization

θ = argminθ

L(θ) + λ ‖θ‖22 .

Tuo Zhao — Lecture 4: Model Selection 6/19

Page 12: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization and Constraint

Constrained Empirical Risk Minimization

θ = argminθ

L(θ) subject to ‖θ‖22 ≤ R.

Min-Max Problem

(θ, λ) = argminθ

maxλL(θ) + λ(‖θ‖22 −R).

Regularized Empirical Risk Minimization

θ = argminθ

L(θ) + λ ‖θ‖22 .

Tuo Zhao — Lecture 4: Model Selection 6/19

Page 13: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization and Constraint

Constrained Empirical Risk Minimization

fR = argminf

L(f) subject to f ∈ FR.

Regularized Empirical Risk Minimization

fλ = argminf

L(f) +Rλ(f).

One-to-one correspondence: FR and Rλ

Tuo Zhao — Lecture 4: Model Selection 7/19

Page 14: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization and Constraint

Constrained Empirical Risk Minimization

fR = argminf

L(f) subject to f ∈ FR.

Regularized Empirical Risk Minimization

fλ = argminf

L(f) +Rλ(f).

One-to-one correspondence: FR and Rλ

Tuo Zhao — Lecture 4: Model Selection 7/19

Page 15: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Regularization and Constraint

Constrained Empirical Risk Minimization

fR = argminf

L(f) subject to f ∈ FR.

Regularized Empirical Risk Minimization

fλ = argminf

L(f) +Rλ(f).

One-to-one correspondence: FR and Rλ

Tuo Zhao — Lecture 4: Model Selection 7/19

Page 16: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Learn to Generalize

Given a loss function `(f(X), Y ), we define

E(f) = EX,Y `(f(X), Y ).

Empirical Risk Minimization:

f = argminf∈FR

E(f), where E(f) = 1

n

m∑i=1

`(f(xi), yi)︸ ︷︷ ︸Training Error

.

How to estimate the testing error: E(f)?

We need an independent data set!

Tuo Zhao — Lecture 4: Model Selection 8/19

Page 17: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Learn to Generalize

Given a loss function `(f(X), Y ), we define

E(f) = EX,Y `(f(X), Y ).

Empirical Risk Minimization:

f = argminf∈FR

E(f), where E(f) = 1

n

m∑i=1

`(f(xi), yi)︸ ︷︷ ︸Training Error

.

How to estimate the testing error: E(f)?

We need an independent data set!

Tuo Zhao — Lecture 4: Model Selection 8/19

Page 18: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Learn to Generalize

Given a loss function `(f(X), Y ), we define

E(f) = EX,Y `(f(X), Y ).

Empirical Risk Minimization:

f = argminf∈FR

E(f), where E(f) = 1

n

m∑i=1

`(f(xi), yi)︸ ︷︷ ︸Training Error

.

How to estimate the testing error: E(f)?

We need an independent data set!

Tuo Zhao — Lecture 4: Model Selection 8/19

Page 19: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Learn to Generalize

Given a loss function `(f(X), Y ), we define

E(f) = EX,Y `(f(X), Y ).

Empirical Risk Minimization:

f = argminf∈FR

E(f), where E(f) = 1

n

m∑i=1

`(f(xi), yi)︸ ︷︷ ︸Training Error

.

How to estimate the testing error: E(f)?

We need an independent data set!

Tuo Zhao — Lecture 4: Model Selection 8/19

Page 20: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

A Simple Note on Learning Theory

Oracle Model:

f∗ = argminf∈FR

E(f).

Generalization Bound:

E(f)︸︷︷︸Testing Error

− E(f)︸︷︷︸Training Error

≤?

Excessive Bound:

E(f)︸︷︷︸Testing Error

− E(f∗)︸ ︷︷ ︸Oracle Error

≤?

Tuo Zhao — Lecture 4: Model Selection 9/19

Page 21: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

A Simple Note on Learning Theory

Oracle Model:

f∗ = argminf∈FR

E(f).

Generalization Bound:

E(f)︸︷︷︸Testing Error

− E(f)︸︷︷︸Training Error

≤?

Excessive Bound:

E(f)︸︷︷︸Testing Error

− E(f∗)︸ ︷︷ ︸Oracle Error

≤?

Tuo Zhao — Lecture 4: Model Selection 9/19

Page 22: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

A Simple Note on Learning Theory

Oracle Model:

f∗ = argminf∈FR

E(f).

Generalization Bound:

E(f)︸︷︷︸Testing Error

− E(f)︸︷︷︸Training Error

≤?

Excessive Bound:

E(f)︸︷︷︸Testing Error

− E(f∗)︸ ︷︷ ︸Oracle Error

≤?

Tuo Zhao — Lecture 4: Model Selection 9/19

Page 23: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Learning and Validation Sets

We split the whole dataset into to two disjoint subsets:

Training Set: {(x1, y1), ..., (xn, yn)}

fRk = argminf∈FRk

E(f)

Validation Set: {(x1, y1), ..., (xm, ym)}

λ = argminλ∈{λ1,...,λK}

E(fRk), where E(fRk) =1

m

m∑i=1

`(fRk(x)i, yi)

Tuo Zhao — Lecture 4: Model Selection 10/19

Page 24: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Learning and Validation Sets

We split the whole dataset into to two disjoint subsets:

Training Set: {(x1, y1), ..., (xn, yn)}

fRk = argminf∈FRk

E(f)

Validation Set: {(x1, y1), ..., (xm, ym)}

λ = argminλ∈{λ1,...,λK}

E(fRk), where E(fRk) =1

m

m∑i=1

`(fRk(x)i, yi)

Tuo Zhao — Lecture 4: Model Selection 10/19

Page 25: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Cross Validation

Tuo Zhao — Lecture 4: Model Selection 11/19

Page 26: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Double Cross Validation

Cross validation: A reliable estimation of the testing error?

The optimal λ is selected based on all data.

No! The cross validation error is not obtained fromindependent data.

Double Cross Validation:

Learning Set: Training the model

Validation Set: Selecting the model

Testing Set: Estimating the testing error

Tuo Zhao — Lecture 4: Model Selection 12/19

Page 27: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Double Cross Validation

Cross validation: A reliable estimation of the testing error?

The optimal λ is selected based on all data.

No! The cross validation error is not obtained fromindependent data.

Double Cross Validation:

Learning Set: Training the model

Validation Set: Selecting the model

Testing Set: Estimating the testing error

Tuo Zhao — Lecture 4: Model Selection 12/19

Page 28: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Double Cross Validation

Cross validation: A reliable estimation of the testing error?

The optimal λ is selected based on all data.

No! The cross validation error is not obtained fromindependent data.

Double Cross Validation:

Learning Set: Training the model

Validation Set: Selecting the model

Testing Set: Estimating the testing error

Tuo Zhao — Lecture 4: Model Selection 12/19

Page 29: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Double Cross Validation

Tuo Zhao — Lecture 4: Model Selection 13/19

Page 30: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Early Stopping

Tuo Zhao — Lecture 4: Model Selection 14/19

Page 31: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Grid Search

Tuo Zhao — Lecture 4: Model Selection 15/19

Page 32: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Climb Hill

Tuo Zhao — Lecture 4: Model Selection 16/19

Page 33: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Hyperparameter Optimization

The regularization parameter selection can be viewed as anoptimization problem

θ = argminθ

E(θ),

where E(θ) is the validation error on the validation set.

Different assumptions on E(θ) lead to different algorithms.

Example: Gaussian Process, ....

Tuo Zhao — Lecture 4: Model Selection 17/19

Page 34: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Hyperparameter Optimization

The regularization parameter selection can be viewed as anoptimization problem

θ = argminθ

E(θ),

where E(θ) is the validation error on the validation set.

Different assumptions on E(θ) lead to different algorithms.

Example: Gaussian Process, ....

Tuo Zhao — Lecture 4: Model Selection 17/19

Page 35: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Hyperparameter Optimization

The regularization parameter selection can be viewed as anoptimization problem

θ = argminθ

E(θ),

where E(θ) is the validation error on the validation set.

Different assumptions on E(θ) lead to different algorithms.

Example: Gaussian Process, ....

Tuo Zhao — Lecture 4: Model Selection 17/19

Page 36: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Random Search

Tuo Zhao — Lecture 4: Model Selection 18/19

Page 37: Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model Selection 2/19 ISYE/CSE 6740: Computational Data Analysis Regularization Selection Given

ISYE/CSE 6740: Computational Data Analysis

Random Latin Search

Tuo Zhao — Lecture 4: Model Selection 19/19