Machine Learning Basics Lecture 6: Overfitting

Princeton University COS 495

Instructor: Yingyu Liang

Review: machine learning basics

Math formulation

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 =1

𝑛σ𝑖=1𝑛 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖)

• s.t. the expected loss is small

𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)]

Machine learning 1-2-3

• Collect data and extract features

• Build model: choose hypothesis class 𝓗 and loss function 𝑙

• Optimization: minimize the empirical loss

Machine learning 1-2-3

• Collect data and extract features

• Build model: choose hypothesis class 𝓗 and loss function 𝑙

• Optimization: minimize the empirical loss

Feature mapping

Gradient descent; convex optimization

Occam’s razor

Maximum Likelihood

Overfitting

Linear vs nonlinear models

𝑥12

𝑥22

2𝑥1𝑥2

2𝑐𝑥1

2𝑐𝑥2

𝑦 = sign(𝑤𝑇𝜙(𝑥) + 𝑏)

Polynomial kernel

Linear vs nonlinear models

• Linear model: 𝑓 𝑥 = 𝑎0 + 𝑎1𝑥

• Nonlinear model: 𝑓 𝑥 = 𝑎0 + 𝑎1𝑥 + 𝑎2𝑥2 + 𝑎3𝑥

3 +…+ 𝑎𝑀 𝑥𝑀

• Linear model ⊆ Nonlinear model (since can always set 𝑎𝑖 = 0 (𝑖 >1))

• Looks like nonlinear model can always achieve same/smaller error

• Why one use Occam’s razor (choose a smaller hypothesis class)?

Example: regression using polynomial curve𝑡 = sin 2𝜋𝑥 + 𝜖

Figure from Machine Learning and Pattern Recognition, Bishop

Example: regression using polynomial curve

𝑡 = sin 2𝜋𝑥 + 𝜖

Regression using polynomial of

degree M

Example: regression using polynomial curve𝑡 = sin 2𝜋𝑥 + 𝜖

Prevent overfitting

• Empirical loss and expected loss are different• Also called training error and test/generalization error

• Larger the data set, smaller the difference between the two

• Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two• Thus has small training error but large test error (overfitting)

• Larger data set helps!

• Throwing away useless hypotheses also helps!

Prevent overfitting

• Empirical loss and expected loss are different• Also called training error and test error

• Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two• Thus has small training error but large test error (overfitting)

• Larger the data set, smaller the difference between the two

• Throwing away useless hypotheses also helps!

• Larger data set helps!

Use prior knowledge/model to prune hypotheses

Use experience/data to prune hypotheses

Prior v.s. data

Prior vs experience

• Super strong prior knowledge: 𝓗 = {𝑓∗}

• No data is needed!

𝑓∗: the best function

Prior vs experience

• Super strong prior knowledge: 𝓗 = {𝑓∗, 𝑓1}

• A few data points suffices to detect 𝑓∗

Prior vs experience

• Super larger data set: infinite data

• Hypothesis class 𝓗 can be all functions!

Prior vs experience

• Practical scenarios: finite data, 𝓗 of median capacity,𝑓∗ in/not in 𝓗

𝓗1 𝓗2

Prior vs experience

• Practical scenarios lie between the two extreme cases

𝓗 = {𝑓∗} Infinite datapractice

General Phenomenon

Figure from Deep Learning, Goodfellow, Bengio and Courville

Cross validation

Model selection

• How to choose the optimal capacity?• e.g., choose the best degree for polynomial curve fitting

• Cannot be done by training data alone

• Create held-out data to approx. the test error• Called validation data set

Model selection: cross validation

• Partition the training data into several groups

• Each time use one group as validation set

Model selection: cross validation

• Also used for selecting other hyper-parameters for model/algorithm• E.g., learning rate, stopping criterion of SGD, etc.

• Pros: general, simple

• Cons: computationally expensive; even worse when there are more hyper-parameters

Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect...

Transcript of Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect...

Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect...

Documents

Transcript of Machine Learning Basics Lecture 6: Overfitting · 2016-09-01 · Machine learning 1-2-3 •Collect...

Machine Learning 10-701tom/10701_sp11/slides/Kernels_SVM2_04_1… · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 12, 2011

Advanced Machine Learning Practical 4: Regression (SVR, RVR, GPR)lasa.epfl.ch/teaching/lectures/ML_MSc_Advanced/Practical/... · 2016-05-10 · Advanced Machine Learning Practical

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

Machine Learning The Future of LHC Theoryplehn/includes/... · 2020-06-26 · Machine Learning — The Future of LHC Theory Tilman Plehn Universitat Heidelberg ...

Machine Learning - Introduction to Bayesian Classification

Constraining Effective Field Theories with Machine Learning · Constraining Effective Field Theories with Machine Learning ATLAS ML workshop, October 15-17 2018 Gilles Louppe g.louppe@uliege.be

Probability in Machine Learning

Introduction to Machine Learning - UMDusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/02a_sg.pdfComputational Linguistics: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12

Proceso estadístico de modelos con Machine Learning › bitstream › 20.500... · Postproceso estadístico con Machine Learning: aplicación… • Trabajo con 5 aeropuertos de

Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

CS 337: Artiﬁcial Intelligence & Machine Learning ...

Stochastic Knowledge Representations and Machine Learning ...

Machine Learning Learning with Graphical Models · Machine Learning Learning with Graphical Models Marc Toussaint University of Stuttgart Summer 2015

CSC 411: Introduction to Machine Learningmren/teach/csc411_19s/lec/lec22.pdf · CSC 411: Introduction to Machine Learning CSC 411 Lecture 22: Reinforcement Learning II Mengye Ren

Machine Learning - cs.cmu.edu

06 Machine Learning - Naive Bayes

Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Sparse methods for machine learning Theory and algorithms

Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Machine Learning Seminar: Support Vector Regression