Machine Learning Basics Lecture 4: SVM I

Princeton University COS 495

Instructor: Yingyu Liang

Review: machine learning basics

Math formulation

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 =1

𝑛σ𝑖=1𝑛 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖)

• s.t. the expected loss is small

𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)]

Machine learning 1-2-3

• Collect data and extract features

• Build model: choose hypothesis class 𝓗 and loss function 𝑙

• Optimization: minimize the empirical loss

Loss function

• 𝑙2 loss: linear regression

• Cross-entropy: logistic regression

• Hinge loss: Perceptron

• General principle: maximum likelihood estimation (MLE)• 𝑙2 loss: corresponds to Normal distribution

• logistic regression: corresponds to sigmoid conditional distribution

Optimization

• Linear regression: closed form solution

• Logistic regression: gradient descent

• Perceptron: stochastic gradient descent

• General principle: local improvement• SGD: Perceptron; can also be applied to linear regression/logistic regression

Principle for hypothesis class?

• Yes, there exists a general principle (at least philosophically)

• Different names/faces/connections• Occam’s razor• VC dimension theory• Minimum description length• Tradeoff between Bias and variance; uniform convergence• The curse of dimensionality

• Running example: Support Vector Machine (SVM)

Motivation

Linear classification (𝑤∗)𝑇𝑥 = 0

Class +1

Class -1

𝑤∗

(𝑤∗)𝑇𝑥 > 0

(𝑤∗)𝑇𝑥 < 0

Assume perfect separation between the two classes

Attempt

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Hypothesis 𝑦 = sign(𝑓𝑤 𝑥 ) = sign(𝑤𝑇𝑥)• 𝑦 = +1 if 𝑤𝑇𝑥 > 0

• 𝑦 = −1 if 𝑤𝑇𝑥 < 0

• Let’s assume that we can optimize to find 𝑤

Multiple optimal solutions?

Class +1

Class -1

𝑤2 𝑤3𝑤1

Same on empirical loss;Different on test/expected loss

What about 𝑤1?

Class +1

Class -1

New test data

What about 𝑤3?

Class +1

Class -1

New test data

Most confident: 𝑤2

Class +1

Class -1

New test data

Intuition: margin

Class +1

Class -1

large margin

Margin

• Lemma 1: 𝑥 has distance |𝑓𝑤 𝑥 |

| 𝑤 |to the hyperplane 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 = 0

Proof:

• 𝑤 is orthogonal to the hyperplane

• The unit direction is 𝑤

| 𝑤 |

• The projection of 𝑥 is 𝑤

𝑥 =𝑓𝑤(𝑥)

| 𝑤 |

Margin: with bias

• Claim 1: 𝑤 is orthogonal to the hyperplane 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇𝑥 + 𝑏 = 0

Proof:

• pick any 𝑥1 and 𝑥2 on the hyperplane

• 𝑤𝑇𝑥1 + 𝑏 = 0

• 𝑤𝑇𝑥2 + 𝑏 = 0

• So 𝑤𝑇(𝑥1 − 𝑥2) = 0

Margin: with bias

• Claim 2: 0 has distance −𝑏

| 𝑤 |to the hyperplane 𝑤𝑇𝑥 + 𝑏 = 0

Proof:

• pick any 𝑥1 the hyperplane

• Project 𝑥1 to the unit direction 𝑤

| 𝑤 |to get the distance

•𝑤

𝑥1 =−𝑏

| 𝑤 |since 𝑤𝑇𝑥1 + 𝑏 = 0

Margin: with bias

• Lemma 2: 𝑥 has distance |𝑓𝑤,𝑏 𝑥 |

| 𝑤 |to the hyperplane 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇𝑥 +

𝑏 = 0

Proof:

• Let 𝑥 = 𝑥⊥ + 𝑟𝑤

| 𝑤 |, then |𝑟| is the distance

• Multiply both sides by 𝑤𝑇 and add 𝑏

• Left hand side: 𝑤𝑇𝑥 + 𝑏 = 𝑓𝑤,𝑏 𝑥

• Right hand side: 𝑤𝑇𝑥⊥ + 𝑟𝑤𝑇𝑤

| 𝑤 |+ 𝑏 = 0 + 𝑟| 𝑤 |

𝑦 𝑥 = 𝑤𝑇𝑥 + 𝑤0

The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

Support Vector Machine (SVM)

SVM: objective

• Margin over all training data points:

𝛾 = min𝑖

|𝑓𝑤,𝑏 𝑥𝑖 |

| 𝑤 |

• Since only want correct 𝑓𝑤,𝑏, and recall 𝑦𝑖 ∈ {+1,−1}, we have

𝛾 = min𝑖

𝑦𝑖𝑓𝑤,𝑏 𝑥𝑖| 𝑤 |

• If 𝑓𝑤,𝑏 incorrect on some 𝑥𝑖, the margin is negative

SVM: objective

• Maximize margin over all training data points:

max𝑤,𝑏

𝛾 = max𝑤,𝑏

min𝑖

𝑦𝑖𝑓𝑤,𝑏 𝑥𝑖| 𝑤 |

= max𝑤,𝑏

min𝑖

𝑦𝑖(𝑤𝑇𝑥𝑖 + 𝑏)

| 𝑤 |

• A bit complicated …

SVM: simplified objective

• Observation: when (𝑤, 𝑏) scaled by a factor 𝑐, the margin unchanged

• Let’s consider a fixed scale such that

𝑦𝑖∗ 𝑤𝑇𝑥𝑖∗ + 𝑏 = 1

where 𝑥𝑖∗ is the point closest to the hyperplane

𝑦𝑖(𝑐𝑤𝑇𝑥𝑖 + 𝑐𝑏)

| 𝑐𝑤 |=𝑦𝑖(𝑤

𝑇𝑥𝑖 + 𝑏)

| 𝑤 |

• Let’s consider a fixed scale such that

𝑦𝑖∗ 𝑤𝑇𝑥𝑖∗ + 𝑏 = 1

where 𝑥𝑖∗ is the point closet to the hyperplane

• Now we have for all data𝑦𝑖 𝑤

𝑇𝑥𝑖 + 𝑏 ≥ 1

and at least for one 𝑖 the equality holds

• Then the margin is 1

| 𝑤 |

• Optimization simplified to

min𝑤,𝑏

𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖

• How to find the optimum ෝ𝑤∗?

SVM: principle for hypothesis class

Thought experiment

• Suppose pick an 𝑅, and suppose can decide if exists 𝑤 satisfying1

2≤ 𝑅

• Decrease 𝑅 until cannot find 𝑤 satisfying the inequalities

Thought experiment

• ෝ𝑤∗ is the best weight (i.e., satisfying the smallest 𝑅)

ෝ𝑤∗

Thought experiment

ෝ𝑤∗

Thought experiment

ෝ𝑤∗

Thought experiment

ෝ𝑤∗

Thought experiment

ෝ𝑤∗

Thought experiment

• To handle the difference between empirical and expected losses

• Choose large margin hypothesis (high confidence)

• Choose a small hypothesis class

ෝ𝑤∗

Corresponds to the hypothesis class

Thought experiment

• Principle: use smallest hypothesis class still with a correct/good one• Also true beyond SVM

• Also true for the case without perfect separation between the two classes

• Math formulation: VC-dim theory, etc.

ෝ𝑤∗

Thought experiment

• Principle: use smallest hypothesis class still with a correct/good one• Whatever you know about the ground truth, add it as constraint/regularizer

ෝ𝑤∗

SVM: optimization

• Optimization (Quadratic Programming):

min𝑤,𝑏

• Solved by Lagrange multiplier method:

ℒ 𝑤, 𝑏, 𝜶 =1

𝛼𝑖[𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 − 1]

where 𝜶 is the Lagrange multiplier• Details in next lecture

Reading

• Review Lagrange multiplier method

• E.g. Section 5 in Andrew Ng’s note on SVM • posted on the course website:

http://www.cs.princeton.edu/courses/archive/spring16/cos495/

Machine Learning Basics Lecture 4: SVM I - cs.princeton.edu · Math formulation •Given training...

Transcript of Machine Learning Basics Lecture 4: SVM I - cs.princeton.edu · Math formulation •Given training...

Machine Learning Basics Lecture 4: SVM I - cs.princeton.edu · Math formulation •Given training...

Documents

Transcript of Machine Learning Basics Lecture 4: SVM I - cs.princeton.edu · Math formulation •Given training...

Tom Lee, Sanja Fidler, Sven Dickinsontshlee/pub/iccv15-midlevel-poster.pdf · Ø w-block (S-SVM): Ø λ-block (loss-augmented parametric energy minimization): Learning to Combine

Wall and Column Shear - c.ymcdn.comc.ymcdn.com/sites/ · PDF fileColumn shears ℎ 𝑖 ℎ 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 Dynamic Design - V MRSA Σ 𝑀 𝑟, ,𝑖 𝑙𝑢,𝑖

Bayes, not Naïve - Giovanni Cherubin · 2019. 2. 10. · Website Fingerprinting (WF) Adversary Φ: transmission time, total bandwidth, … f train: SVM, logistic regression, …

Sensitivityanalysis forrisk-relateddecision-making...Sensitivityanalysis:intuition Consideracasewhere𝑌=𝑓(𝑥)andthe function𝑓isa“blackbox” Wecansample𝑌fordifferentvaluesof𝑋to

Carta Di Smith - GAEMgaem.tlc.unipr.it/gestione/userfiles/File/Coscelli/Esempi Carta Smith.pdf · dalle formule per il calcolo dell’impedenza in ingresso di una linea in c.c.: 𝑖=−

Localizing 3D Cuboids in Single-view Imagesrobots.princeton.edu/projects/2012/SUNprimitive/poster.pdf · Supervised corner-location training with Structural SVM. Step 2: Local search

Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Overlapping Clustering Models, and One (class) SVM to Bind ...xmao/posters/svm_cone_poster.pdfOverlapping Clustering Models, and One (class) SVM to Bind Them All Xueyu Mao, Purnamrita

Lecture 5: More on Kernels. SVM regression and classi cationdprecup/courses/ML/Lectures/ml-lecture0… · Solving SVM regression We introduce slack variables, ˘+ i, ˘ i to account

A P2 será no dia 17OUT - fisica.ufpr.br · Não percebemos um som de intensidade 2 como duas vezes “mais alto” que um som de intensidade ... metais 𝑓= 𝑛𝑣 2 madeiras

Presentación de PowerPoint · 2020. 1. 20. · 𝐷𝑃ℎ𝑖 𝑃ℎ𝑖= ℎ, 𝑖,𝛽ℎ 𝑃(ℎ|𝑖)= 𝑥 l(𝜑 𝑃ℎ𝑖) 𝑔∈𝐻 𝑥 l(𝜑 𝑃𝑔𝑖൯ ℎ)

Support Vector Machines - Binghamton Universitydde.binghamton.edu/kodovsky/svm/lectures/svm_lecture_01.pdfA Tutorial on ν-Support Vector Machines (2005) link P.H. Chen, C.J. Lin,

Support Vector Machines - cs.umd.edusamir/498/SVM.pdf · 12 - SVM locates a separating hyperplane in the feature space and classifies points in that space - It does not need to represent

Using Combination of μ,β and γ Bands in Classification of EEG …bcn.iums.ac.ir/article-1-348-en.pdf · MLP. 3, SVM. 4. and KNN . 5 . methods. In all experiments, accuracy rate

Screening Tests for the LASSOhanxiaol/slides/screening.pdf · Safe screening of non-support vectors in pathwise svm computation. In Proceedings of the 30th International Conference

C.1 QAPP Review - EPA Archives · PDF fileMercury 130 μg/dscm EPA Method 29 EPA Method 6020 ICP-MS SVM Metals (Cd, Pb) 240 μg/dscm EPA Method 29 EPA Method 6020 ICP-MS

12782 miolo - Tribunal de Contas do Estado de São Paulo · 58 σ • σ 𝑥𝑖 𝑛 =1 𝑛 𝜎= 𝑛 𝑥𝑖−𝑀é𝑑𝑖𝑎 2 𝑖=1 𝑛 𝑥𝑖 𝑛 =1 𝑛 𝜎= 𝑛

MA 751 Part 7 Solving SVM: Quadratic Programming 1 ...math.bu.edu/people/mkon/MA751/L11DSolvingSVMOptional.pdf · 4. Other choices of kernel Recall in SVM we have used the kernel

Lecture 5: SVM as a kernel machine - cel.archives-ouvertes.fr · HαJ(α,α0) = Kdiag p(1I−p) K + λK solve the problem using Newton iterations

Recent Applications of the Stochastic Variational Method (SVM)brown/p-shell-2004/pdf/ysuzuki.pdf · the Stochastic Variational Method (SVM) Y. Suzuki (Niigata) Outline 1. Motivation