Machine Learning Basics Lecture 4: SVM I - cs.princeton.eduย ยท Math formulation โ€ขGiven training...

Post on 29-Aug-2019

221 views 0 download

Transcript of Machine Learning Basics Lecture 4: SVM I - cs.princeton.eduย ยท Math formulation โ€ขGiven training...

Machine Learning Basics Lecture 4: SVM I

Princeton University COS 495

Instructor: Yingyu Liang

Review: machine learning basics

Math formulation

โ€ข Given training data ๐‘ฅ๐‘– , ๐‘ฆ๐‘– : 1 โ‰ค ๐‘– โ‰ค ๐‘› i.i.d. from distribution ๐ท

โ€ข Find ๐‘ฆ = ๐‘“(๐‘ฅ) โˆˆ ๐“— that minimizes ๐ฟ ๐‘“ =1

๐‘›ฯƒ๐‘–=1๐‘› ๐‘™(๐‘“, ๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

โ€ข s.t. the expected loss is small

๐ฟ ๐‘“ = ๐”ผ ๐‘ฅ,๐‘ฆ ~๐ท[๐‘™(๐‘“, ๐‘ฅ, ๐‘ฆ)]

Machine learning 1-2-3

โ€ข Collect data and extract features

โ€ข Build model: choose hypothesis class ๐“— and loss function ๐‘™

โ€ข Optimization: minimize the empirical loss

Loss function

โ€ข ๐‘™2 loss: linear regression

โ€ข Cross-entropy: logistic regression

โ€ข Hinge loss: Perceptron

โ€ข General principle: maximum likelihood estimation (MLE)โ€ข ๐‘™2 loss: corresponds to Normal distribution

โ€ข logistic regression: corresponds to sigmoid conditional distribution

Optimization

โ€ข Linear regression: closed form solution

โ€ข Logistic regression: gradient descent

โ€ข Perceptron: stochastic gradient descent

โ€ข General principle: local improvementโ€ข SGD: Perceptron; can also be applied to linear regression/logistic regression

Principle for hypothesis class?

โ€ข Yes, there exists a general principle (at least philosophically)

โ€ข Different names/faces/connectionsโ€ข Occamโ€™s razorโ€ข VC dimension theoryโ€ข Minimum description lengthโ€ข Tradeoff between Bias and variance; uniform convergenceโ€ข The curse of dimensionality

โ€ข Running example: Support Vector Machine (SVM)

Motivation

Linear classification (๐‘คโˆ—)๐‘‡๐‘ฅ = 0

Class +1

Class -1

๐‘คโˆ—

(๐‘คโˆ—)๐‘‡๐‘ฅ > 0

(๐‘คโˆ—)๐‘‡๐‘ฅ < 0

Assume perfect separation between the two classes

Attempt

โ€ข Given training data ๐‘ฅ๐‘– , ๐‘ฆ๐‘– : 1 โ‰ค ๐‘– โ‰ค ๐‘› i.i.d. from distribution ๐ท

โ€ข Hypothesis ๐‘ฆ = sign(๐‘“๐‘ค ๐‘ฅ ) = sign(๐‘ค๐‘‡๐‘ฅ)โ€ข ๐‘ฆ = +1 if ๐‘ค๐‘‡๐‘ฅ > 0

โ€ข ๐‘ฆ = โˆ’1 if ๐‘ค๐‘‡๐‘ฅ < 0

โ€ข Letโ€™s assume that we can optimize to find ๐‘ค

Multiple optimal solutions?

Class +1

Class -1

๐‘ค2 ๐‘ค3๐‘ค1

Same on empirical loss;Different on test/expected loss

What about ๐‘ค1?

Class +1

Class -1

๐‘ค1

New test data

What about ๐‘ค3?

Class +1

Class -1

๐‘ค3

New test data

Most confident: ๐‘ค2

Class +1

Class -1

๐‘ค2

New test data

Intuition: margin

Class +1

Class -1

๐‘ค2

large margin

Margin

Margin

โ€ข Lemma 1: ๐‘ฅ has distance |๐‘“๐‘ค ๐‘ฅ |

| ๐‘ค |to the hyperplane ๐‘“๐‘ค ๐‘ฅ = ๐‘ค๐‘‡๐‘ฅ = 0

Proof:

โ€ข ๐‘ค is orthogonal to the hyperplane

โ€ข The unit direction is ๐‘ค

| ๐‘ค |

โ€ข The projection of ๐‘ฅ is ๐‘ค

๐‘ค

๐‘‡

๐‘ฅ =๐‘“๐‘ค(๐‘ฅ)

| ๐‘ค |

๐‘ค

| ๐‘ค |

๐‘ฅ

๐‘ค

๐‘ค

๐‘‡

๐‘ฅ

0

Margin: with bias

โ€ข Claim 1: ๐‘ค is orthogonal to the hyperplane ๐‘“๐‘ค,๐‘ ๐‘ฅ = ๐‘ค๐‘‡๐‘ฅ + ๐‘ = 0

Proof:

โ€ข pick any ๐‘ฅ1 and ๐‘ฅ2 on the hyperplane

โ€ข ๐‘ค๐‘‡๐‘ฅ1 + ๐‘ = 0

โ€ข ๐‘ค๐‘‡๐‘ฅ2 + ๐‘ = 0

โ€ข So ๐‘ค๐‘‡(๐‘ฅ1 โˆ’ ๐‘ฅ2) = 0

Margin: with bias

โ€ข Claim 2: 0 has distance โˆ’๐‘

| ๐‘ค |to the hyperplane ๐‘ค๐‘‡๐‘ฅ + ๐‘ = 0

Proof:

โ€ข pick any ๐‘ฅ1 the hyperplane

โ€ข Project ๐‘ฅ1 to the unit direction ๐‘ค

| ๐‘ค |to get the distance

โ€ข๐‘ค

๐‘ค

๐‘‡

๐‘ฅ1 =โˆ’๐‘

| ๐‘ค |since ๐‘ค๐‘‡๐‘ฅ1 + ๐‘ = 0

Margin: with bias

โ€ข Lemma 2: ๐‘ฅ has distance |๐‘“๐‘ค,๐‘ ๐‘ฅ |

| ๐‘ค |to the hyperplane ๐‘“๐‘ค,๐‘ ๐‘ฅ = ๐‘ค๐‘‡๐‘ฅ +

๐‘ = 0

Proof:

โ€ข Let ๐‘ฅ = ๐‘ฅโŠฅ + ๐‘Ÿ๐‘ค

| ๐‘ค |, then |๐‘Ÿ| is the distance

โ€ข Multiply both sides by ๐‘ค๐‘‡ and add ๐‘

โ€ข Left hand side: ๐‘ค๐‘‡๐‘ฅ + ๐‘ = ๐‘“๐‘ค,๐‘ ๐‘ฅ

โ€ข Right hand side: ๐‘ค๐‘‡๐‘ฅโŠฅ + ๐‘Ÿ๐‘ค๐‘‡๐‘ค

| ๐‘ค |+ ๐‘ = 0 + ๐‘Ÿ| ๐‘ค |

๐‘ฆ ๐‘ฅ = ๐‘ค๐‘‡๐‘ฅ + ๐‘ค0

The notation here is:

Figure from Pattern Recognition and Machine Learning, Bishop

Support Vector Machine (SVM)

SVM: objective

โ€ข Margin over all training data points:

๐›พ = min๐‘–

|๐‘“๐‘ค,๐‘ ๐‘ฅ๐‘– |

| ๐‘ค |

โ€ข Since only want correct ๐‘“๐‘ค,๐‘, and recall ๐‘ฆ๐‘– โˆˆ {+1,โˆ’1}, we have

๐›พ = min๐‘–

๐‘ฆ๐‘–๐‘“๐‘ค,๐‘ ๐‘ฅ๐‘–| ๐‘ค |

โ€ข If ๐‘“๐‘ค,๐‘ incorrect on some ๐‘ฅ๐‘–, the margin is negative

SVM: objective

โ€ข Maximize margin over all training data points:

max๐‘ค,๐‘

๐›พ = max๐‘ค,๐‘

min๐‘–

๐‘ฆ๐‘–๐‘“๐‘ค,๐‘ ๐‘ฅ๐‘–| ๐‘ค |

= max๐‘ค,๐‘

min๐‘–

๐‘ฆ๐‘–(๐‘ค๐‘‡๐‘ฅ๐‘– + ๐‘)

| ๐‘ค |

โ€ข A bit complicated โ€ฆ

SVM: simplified objective

โ€ข Observation: when (๐‘ค, ๐‘) scaled by a factor ๐‘, the margin unchanged

โ€ข Letโ€™s consider a fixed scale such that

๐‘ฆ๐‘–โˆ— ๐‘ค๐‘‡๐‘ฅ๐‘–โˆ— + ๐‘ = 1

where ๐‘ฅ๐‘–โˆ— is the point closest to the hyperplane

๐‘ฆ๐‘–(๐‘๐‘ค๐‘‡๐‘ฅ๐‘– + ๐‘๐‘)

| ๐‘๐‘ค |=๐‘ฆ๐‘–(๐‘ค

๐‘‡๐‘ฅ๐‘– + ๐‘)

| ๐‘ค |

SVM: simplified objective

โ€ข Letโ€™s consider a fixed scale such that

๐‘ฆ๐‘–โˆ— ๐‘ค๐‘‡๐‘ฅ๐‘–โˆ— + ๐‘ = 1

where ๐‘ฅ๐‘–โˆ— is the point closet to the hyperplane

โ€ข Now we have for all data๐‘ฆ๐‘– ๐‘ค

๐‘‡๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1

and at least for one ๐‘– the equality holds

โ€ข Then the margin is 1

| ๐‘ค |

SVM: simplified objective

โ€ข Optimization simplified to

min๐‘ค,๐‘

1

2๐‘ค

2

๐‘ฆ๐‘– ๐‘ค๐‘‡๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1, โˆ€๐‘–

โ€ข How to find the optimum เท๐‘คโˆ—?

SVM: principle for hypothesis class

Thought experiment

โ€ข Suppose pick an ๐‘…, and suppose can decide if exists ๐‘ค satisfying1

2๐‘ค

2โ‰ค ๐‘…

๐‘ฆ๐‘– ๐‘ค๐‘‡๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1, โˆ€๐‘–

โ€ข Decrease ๐‘… until cannot find ๐‘ค satisfying the inequalities

Thought experiment

โ€ข เท๐‘คโˆ— is the best weight (i.e., satisfying the smallest ๐‘…)

เท๐‘คโˆ—

Thought experiment

โ€ข เท๐‘คโˆ— is the best weight (i.e., satisfying the smallest ๐‘…)

เท๐‘คโˆ—

Thought experiment

โ€ข เท๐‘คโˆ— is the best weight (i.e., satisfying the smallest ๐‘…)

เท๐‘คโˆ—

Thought experiment

โ€ข เท๐‘คโˆ— is the best weight (i.e., satisfying the smallest ๐‘…)

เท๐‘คโˆ—

Thought experiment

โ€ข เท๐‘คโˆ— is the best weight (i.e., satisfying the smallest ๐‘…)

เท๐‘คโˆ—

Thought experiment

โ€ข To handle the difference between empirical and expected losses

โ€ข Choose large margin hypothesis (high confidence)

โ€ข Choose a small hypothesis class

เท๐‘คโˆ—

Corresponds to the hypothesis class

Thought experiment

โ€ข Principle: use smallest hypothesis class still with a correct/good oneโ€ข Also true beyond SVM

โ€ข Also true for the case without perfect separation between the two classes

โ€ข Math formulation: VC-dim theory, etc.

เท๐‘คโˆ—

Corresponds to the hypothesis class

Thought experiment

โ€ข Principle: use smallest hypothesis class still with a correct/good oneโ€ข Whatever you know about the ground truth, add it as constraint/regularizer

เท๐‘คโˆ—

Corresponds to the hypothesis class

SVM: optimization

โ€ข Optimization (Quadratic Programming):

min๐‘ค,๐‘

1

2๐‘ค

2

๐‘ฆ๐‘– ๐‘ค๐‘‡๐‘ฅ๐‘– + ๐‘ โ‰ฅ 1, โˆ€๐‘–

โ€ข Solved by Lagrange multiplier method:

โ„’ ๐‘ค, ๐‘, ๐œถ =1

2๐‘ค

2

โˆ’

๐‘–

๐›ผ๐‘–[๐‘ฆ๐‘– ๐‘ค๐‘‡๐‘ฅ๐‘– + ๐‘ โˆ’ 1]

where ๐œถ is the Lagrange multiplierโ€ข Details in next lecture

Reading

โ€ข Review Lagrange multiplier method

โ€ข E.g. Section 5 in Andrew Ngโ€™s note on SVM โ€ข posted on the course website:

http://www.cs.princeton.edu/courses/archive/spring16/cos495/