Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model...
Transcript of Lecture 4: Model Selection - ISyE Hometzhao80/Lectures/Lecture_4.pdfTuo Zhao | Lecture 4: Model...
Lecture 4: Model Selection
Tuo Zhao
Schools of ISYE and CSE, Georgia Tech
ISYE/CSE 6740: Computational Data Analysis
Regularization Selection
Given λ1 and λ2, we solve
θλ1
= argminθ
L(θ) + λ12‖θ‖22 ,
θλ2
= argminθ
L(θ) + λ22‖θ‖22 .
Which one is better?
(Continuous) Model Selection
Tuo Zhao — Lecture 4: Model Selection 2/19
ISYE/CSE 6740: Computational Data Analysis
Regularization Selection
Given λ1 and λ2, we solve
θλ1
= argminθ
L(θ) + λ12‖θ‖22 ,
θλ2
= argminθ
L(θ) + λ22‖θ‖22 .
Which one is better?
(Continuous) Model Selection
Tuo Zhao — Lecture 4: Model Selection 2/19
ISYE/CSE 6740: Computational Data Analysis
Regularization Selection
Given λ1 and λ2, we solve
θλ1
= argminθ
L(θ) + λ12‖θ‖22 ,
θλ2
= argminθ
L(θ) + λ22‖θ‖22 .
Which one is better?
(Continuous) Model Selection
Tuo Zhao — Lecture 4: Model Selection 2/19
ISYE/CSE 6740: Computational Data Analysis
Margin?
Tuo Zhao — Lecture 4: Model Selection 3/19
ISYE/CSE 6740: Computational Data Analysis
Model Selection
Given a regression problem, we consider two models,
Y = θ0 + θ1X + θ2X2 + θ3X
3,
Y = θ0 + θ1X + θ2X2.
Which one is better?
(Discrete) Model Selection.
Tuo Zhao — Lecture 4: Model Selection 4/19
ISYE/CSE 6740: Computational Data Analysis
Model Selection
Given a regression problem, we consider two models,
Y = θ0 + θ1X + θ2X2 + θ3X
3,
Y = θ0 + θ1X + θ2X2.
Which one is better?
(Discrete) Model Selection.
Tuo Zhao — Lecture 4: Model Selection 4/19
ISYE/CSE 6740: Computational Data Analysis
Model Selection
Given a regression problem, we consider two models,
Y = θ0 + θ1X + θ2X2 + θ3X
3,
Y = θ0 + θ1X + θ2X2.
Which one is better?
(Discrete) Model Selection.
Tuo Zhao — Lecture 4: Model Selection 4/19
ISYE/CSE 6740: Computational Data Analysis
Residuals?
Tuo Zhao — Lecture 4: Model Selection 5/19
ISYE/CSE 6740: Computational Data Analysis
Regularization and Constraint
Constrained Empirical Risk Minimization
θ = argminθ
L(θ) subject to ‖θ‖22 ≤ R.
Min-Max Problem
(θ, λ) = argminθ
maxλL(θ) + λ(‖θ‖22 −R).
Regularized Empirical Risk Minimization
θ = argminθ
L(θ) + λ ‖θ‖22 .
Tuo Zhao — Lecture 4: Model Selection 6/19
ISYE/CSE 6740: Computational Data Analysis
Regularization and Constraint
Constrained Empirical Risk Minimization
θ = argminθ
L(θ) subject to ‖θ‖22 ≤ R.
Min-Max Problem
(θ, λ) = argminθ
maxλL(θ) + λ(‖θ‖22 −R).
Regularized Empirical Risk Minimization
θ = argminθ
L(θ) + λ ‖θ‖22 .
Tuo Zhao — Lecture 4: Model Selection 6/19
ISYE/CSE 6740: Computational Data Analysis
Regularization and Constraint
Constrained Empirical Risk Minimization
θ = argminθ
L(θ) subject to ‖θ‖22 ≤ R.
Min-Max Problem
(θ, λ) = argminθ
maxλL(θ) + λ(‖θ‖22 −R).
Regularized Empirical Risk Minimization
θ = argminθ
L(θ) + λ ‖θ‖22 .
Tuo Zhao — Lecture 4: Model Selection 6/19
ISYE/CSE 6740: Computational Data Analysis
Regularization and Constraint
Constrained Empirical Risk Minimization
fR = argminf
L(f) subject to f ∈ FR.
Regularized Empirical Risk Minimization
fλ = argminf
L(f) +Rλ(f).
One-to-one correspondence: FR and Rλ
Tuo Zhao — Lecture 4: Model Selection 7/19
ISYE/CSE 6740: Computational Data Analysis
Regularization and Constraint
Constrained Empirical Risk Minimization
fR = argminf
L(f) subject to f ∈ FR.
Regularized Empirical Risk Minimization
fλ = argminf
L(f) +Rλ(f).
One-to-one correspondence: FR and Rλ
Tuo Zhao — Lecture 4: Model Selection 7/19
ISYE/CSE 6740: Computational Data Analysis
Regularization and Constraint
Constrained Empirical Risk Minimization
fR = argminf
L(f) subject to f ∈ FR.
Regularized Empirical Risk Minimization
fλ = argminf
L(f) +Rλ(f).
One-to-one correspondence: FR and Rλ
Tuo Zhao — Lecture 4: Model Selection 7/19
ISYE/CSE 6740: Computational Data Analysis
Learn to Generalize
Given a loss function `(f(X), Y ), we define
E(f) = EX,Y `(f(X), Y ).
Empirical Risk Minimization:
f = argminf∈FR
E(f), where E(f) = 1
n
m∑i=1
`(f(xi), yi)︸ ︷︷ ︸Training Error
.
How to estimate the testing error: E(f)?
We need an independent data set!
Tuo Zhao — Lecture 4: Model Selection 8/19
ISYE/CSE 6740: Computational Data Analysis
Learn to Generalize
Given a loss function `(f(X), Y ), we define
E(f) = EX,Y `(f(X), Y ).
Empirical Risk Minimization:
f = argminf∈FR
E(f), where E(f) = 1
n
m∑i=1
`(f(xi), yi)︸ ︷︷ ︸Training Error
.
How to estimate the testing error: E(f)?
We need an independent data set!
Tuo Zhao — Lecture 4: Model Selection 8/19
ISYE/CSE 6740: Computational Data Analysis
Learn to Generalize
Given a loss function `(f(X), Y ), we define
E(f) = EX,Y `(f(X), Y ).
Empirical Risk Minimization:
f = argminf∈FR
E(f), where E(f) = 1
n
m∑i=1
`(f(xi), yi)︸ ︷︷ ︸Training Error
.
How to estimate the testing error: E(f)?
We need an independent data set!
Tuo Zhao — Lecture 4: Model Selection 8/19
ISYE/CSE 6740: Computational Data Analysis
Learn to Generalize
Given a loss function `(f(X), Y ), we define
E(f) = EX,Y `(f(X), Y ).
Empirical Risk Minimization:
f = argminf∈FR
E(f), where E(f) = 1
n
m∑i=1
`(f(xi), yi)︸ ︷︷ ︸Training Error
.
How to estimate the testing error: E(f)?
We need an independent data set!
Tuo Zhao — Lecture 4: Model Selection 8/19
ISYE/CSE 6740: Computational Data Analysis
A Simple Note on Learning Theory
Oracle Model:
f∗ = argminf∈FR
E(f).
Generalization Bound:
E(f)︸︷︷︸Testing Error
− E(f)︸︷︷︸Training Error
≤?
Excessive Bound:
E(f)︸︷︷︸Testing Error
− E(f∗)︸ ︷︷ ︸Oracle Error
≤?
Tuo Zhao — Lecture 4: Model Selection 9/19
ISYE/CSE 6740: Computational Data Analysis
A Simple Note on Learning Theory
Oracle Model:
f∗ = argminf∈FR
E(f).
Generalization Bound:
E(f)︸︷︷︸Testing Error
− E(f)︸︷︷︸Training Error
≤?
Excessive Bound:
E(f)︸︷︷︸Testing Error
− E(f∗)︸ ︷︷ ︸Oracle Error
≤?
Tuo Zhao — Lecture 4: Model Selection 9/19
ISYE/CSE 6740: Computational Data Analysis
A Simple Note on Learning Theory
Oracle Model:
f∗ = argminf∈FR
E(f).
Generalization Bound:
E(f)︸︷︷︸Testing Error
− E(f)︸︷︷︸Training Error
≤?
Excessive Bound:
E(f)︸︷︷︸Testing Error
− E(f∗)︸ ︷︷ ︸Oracle Error
≤?
Tuo Zhao — Lecture 4: Model Selection 9/19
ISYE/CSE 6740: Computational Data Analysis
Learning and Validation Sets
We split the whole dataset into to two disjoint subsets:
Training Set: {(x1, y1), ..., (xn, yn)}
fRk = argminf∈FRk
E(f)
Validation Set: {(x1, y1), ..., (xm, ym)}
λ = argminλ∈{λ1,...,λK}
E(fRk), where E(fRk) =1
m
m∑i=1
`(fRk(x)i, yi)
Tuo Zhao — Lecture 4: Model Selection 10/19
ISYE/CSE 6740: Computational Data Analysis
Learning and Validation Sets
We split the whole dataset into to two disjoint subsets:
Training Set: {(x1, y1), ..., (xn, yn)}
fRk = argminf∈FRk
E(f)
Validation Set: {(x1, y1), ..., (xm, ym)}
λ = argminλ∈{λ1,...,λK}
E(fRk), where E(fRk) =1
m
m∑i=1
`(fRk(x)i, yi)
Tuo Zhao — Lecture 4: Model Selection 10/19
ISYE/CSE 6740: Computational Data Analysis
Cross Validation
Tuo Zhao — Lecture 4: Model Selection 11/19
ISYE/CSE 6740: Computational Data Analysis
Double Cross Validation
Cross validation: A reliable estimation of the testing error?
The optimal λ is selected based on all data.
No! The cross validation error is not obtained fromindependent data.
Double Cross Validation:
Learning Set: Training the model
Validation Set: Selecting the model
Testing Set: Estimating the testing error
Tuo Zhao — Lecture 4: Model Selection 12/19
ISYE/CSE 6740: Computational Data Analysis
Double Cross Validation
Cross validation: A reliable estimation of the testing error?
The optimal λ is selected based on all data.
No! The cross validation error is not obtained fromindependent data.
Double Cross Validation:
Learning Set: Training the model
Validation Set: Selecting the model
Testing Set: Estimating the testing error
Tuo Zhao — Lecture 4: Model Selection 12/19
ISYE/CSE 6740: Computational Data Analysis
Double Cross Validation
Cross validation: A reliable estimation of the testing error?
The optimal λ is selected based on all data.
No! The cross validation error is not obtained fromindependent data.
Double Cross Validation:
Learning Set: Training the model
Validation Set: Selecting the model
Testing Set: Estimating the testing error
Tuo Zhao — Lecture 4: Model Selection 12/19
ISYE/CSE 6740: Computational Data Analysis
Double Cross Validation
Tuo Zhao — Lecture 4: Model Selection 13/19
ISYE/CSE 6740: Computational Data Analysis
Early Stopping
Tuo Zhao — Lecture 4: Model Selection 14/19
ISYE/CSE 6740: Computational Data Analysis
Grid Search
Tuo Zhao — Lecture 4: Model Selection 15/19
ISYE/CSE 6740: Computational Data Analysis
Climb Hill
Tuo Zhao — Lecture 4: Model Selection 16/19
ISYE/CSE 6740: Computational Data Analysis
Hyperparameter Optimization
The regularization parameter selection can be viewed as anoptimization problem
θ = argminθ
E(θ),
where E(θ) is the validation error on the validation set.
Different assumptions on E(θ) lead to different algorithms.
Example: Gaussian Process, ....
Tuo Zhao — Lecture 4: Model Selection 17/19
ISYE/CSE 6740: Computational Data Analysis
Hyperparameter Optimization
The regularization parameter selection can be viewed as anoptimization problem
θ = argminθ
E(θ),
where E(θ) is the validation error on the validation set.
Different assumptions on E(θ) lead to different algorithms.
Example: Gaussian Process, ....
Tuo Zhao — Lecture 4: Model Selection 17/19
ISYE/CSE 6740: Computational Data Analysis
Hyperparameter Optimization
The regularization parameter selection can be viewed as anoptimization problem
θ = argminθ
E(θ),
where E(θ) is the validation error on the validation set.
Different assumptions on E(θ) lead to different algorithms.
Example: Gaussian Process, ....
Tuo Zhao — Lecture 4: Model Selection 17/19
ISYE/CSE 6740: Computational Data Analysis
Random Search
Tuo Zhao — Lecture 4: Model Selection 18/19
ISYE/CSE 6740: Computational Data Analysis
Random Latin Search
Tuo Zhao — Lecture 4: Model Selection 19/19