Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length...

5
CS 8751 ML & KDD Support Vector Machines 1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based on maximizing the notion of a margin Based on PAC learning Has mechanisms for – Noise Non-linear separating surfaces (kernel functions) Notes based on those of Prof. Jude Shavlik CS 8751 ML & KDD Support Vector Machines 2 Support Vector Machines A+ A- Find the best separating plane in feature space - many possibilities to choose from which is the best choice? CS 8751 ML & KDD Support Vector Machines 3 SVMs – The General Idea How to pick the best separating plane? • Idea: Define a set of inequalities we want to satisfy Use advanced optimization methods (e.g., linear programming) to find satisfying solutions Key issues: Dealing with noise What if no good linear separating surface? CS 8751 ML & KDD Support Vector Machines 4 Linear Programming Subset of Math Programming Problem has the following form: function f(x 1 ,x 2 ,x 3 ,…,x n ) to be maximized subject to a set of constraints of the form: g(x 1 ,x 2 ,x 3 ,…,x n ) > b Math programming - find a set of values for the variables x 1 ,x 2 ,x 3 ,…,x n that meets all of the constraints and maximizes the function f Linear programming - solving math programs where the constraint functions and function to be maximized use linear combinations of the variables Generally easier than general Math Programming problem Well studied problem CS 8751 ML & KDD Support Vector Machines 5 Maximizing the Margin A+ A- the decision boundary The margin between categories - want this distance to be maximal - (we’ll assume linearly separable for now) CS 8751 ML & KDD Support Vector Machines 6 PAC Learning PAC – Probably Approximately Correct learning Theorems that can be used to define bounds for the risk (error) of a family of learning functions Basic formula, with probability (1 - η): R – risk function, α is the parameters chosen by the learner, N is the number of data points, and h is the VC dimension (something like an estimate of the complexity of the class of functions) N h N h R R emp ) 4 / log( ) / 2 (log( ) ( ) ( η α α +

Transcript of Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length...

Page 1: Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length (“2 norm”) ofmargin Distance between blue and red planes (the margin) 1 For all

1

CS 8751 ML & KDD Support Vector Machines 1

Support Vector Machines (SVMs)• Learning mechanism based on linear

programming• Chooses a separating plane based on maximizing

the notion of a margin– Based on PAC learning

• Has mechanisms for– Noise– Non-linear separating surfaces (kernel functions)

• Notes based on those of Prof. Jude Shavlik

CS 8751 ML & KDD Support Vector Machines 2

Support Vector Machines

A+

A-

Find the best separating plane in feature space- many possibilities to choose from

which is thebest choice?

CS 8751 ML & KDD Support Vector Machines 3

SVMs – The General Idea• How to pick the best separating plane?• Idea:

– Define a set of inequalities we want to satisfy– Use advanced optimization methods (e.g., linear

programming) to find satisfying solutions

• Key issues:– Dealing with noise– What if no good linear separating surface?

CS 8751 ML & KDD Support Vector Machines 4

Linear Programming• Subset of Math Programming• Problem has the following form:

function f(x1,x2,x3,…,xn) to be maximizedsubject to a set of constraints of the form:

g(x1,x2,x3,…,xn) > b• Math programming - find a set of values for the variables

x1,x2,x3,…,xn that meets all of the constraints and maximizes the function f

• Linear programming - solving math programs where the constraint functions and function to be maximized use linear combinations of the variables– Generally easier than general Math Programming problem– Well studied problem

CS 8751 ML & KDD Support Vector Machines 5

Maximizing the Margin

A+

A-

the decisionboundary

The margin between categories- want this distance to be maximal- (we’ll assume linearly separable for now)

CS 8751 ML & KDD Support Vector Machines 6

PAC Learning• PAC – Probably Approximately Correct learning• Theorems that can be used to define bounds for

the risk (error) of a family of learning functions• Basic formula, with probability (1 - η):

• R – risk function, α is the parameters chosen by the learner, N is the number of data points, and h is the VC dimension (something like an estimate of the complexity of the class of functions)

NhNhRR emp

)4/log()/2(log()()( ηαα −+≤

Page 2: Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length (“2 norm”) ofmargin Distance between blue and red planes (the margin) 1 For all

2

CS 8751 ML & KDD Support Vector Machines 7

Margins and PAC Learning• Theorems connect PAC theory to the size of the

margin• Basically, the larger the margin, the better the

expected accuracy• See, for example, Chapter 4 of Support Vector

Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002

CS 8751 ML & KDD Support Vector Machines 8

Some Equations

w

xw

xw

xwxw

neg

pos

2margin

margin) (the planes red and bluebetween Distance

1 examples negative allFor

1 examples positive allFor

threshold- features,input - weights,-

Plane Separating

=

−=⋅

+=⋅

=⋅

γ

γ

γ

γ

rr

rr

rr

rr

1s result from dividingthrough by a constantfor convenience

Euclidean length (“2 norm”) ofthe weight vector

CS 8751 ML & KDD Support Vector Machines 9

What the Equations Mean

A+

A-

Support Vectors

Margin

x´w = γ + 1

x´w = γ - 1

2 / ||w||2

CS 8751 ML & KDD Support Vector Machines 10

Choosing a Separating Plane

A+

A-

?

CS 8751 ML & KDD Support Vector Machines 11

Our “Mathematical Program” (so far)

soln) optimal global (a above theosolution t a find tosoftware onoptimizati gprogramminmath existing use nowcan We

es)inequalitiour of sideleft the to move and trick"" ANN theuse course, of could, (we parameters adjustableour are , :Note

examples) (for 1

examples) (for 1 such that

min 2

,

γγ

γ

γ

γ

w

xwxw

w

neg

pos

w

r

rr

rr

rr

−−≤⋅

++≥⋅

for technical reasons easier tooptimize this “quadratic program”

CS 8751 ML & KDD Support Vector Machines 12

Dealing with Non-Separable DataWe can add what is called a “slack” variable to each

example

This variable can be viewed as:0 if the example is correctly separatedy “distance” we need to move example to make it

correct (i.e., the distance from its surface)

Page 3: Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length (“2 norm”) ofmargin Distance between blue and red planes (the margin) 1 For all

3

CS 8751 ML & KDD Support Vector Machines 13

“Slack” Variables

A+

A-

ySupport Vectors

CS 8751 ML & KDD Support Vector Machines 14

The Math Program with Slack Variables

0

1

1 such that

positive) (all components of sum - norm" one" constant scaling

exampleeach for one featureinput each for one

min

k

1

1

2

,,

≥∀

−≤−⋅

+≥+⋅

−−−

+

k

jneg

ipos

sw

s

sxw

sxw

s

sw

sw

j

i

γ

γ

µ

µγ

rr

rr

r

r

r

rrrr

This is the “traditional”Support Vector Machine

CS 8751 ML & KDD Support Vector Machines 15

Why the word “Support”?• All those examples on or on the wrong side of the

two separating planes are the support vectors– We’d get the same answer if we deleted all the non-

support vectors!– i.e., the “support vectors [examples]” support the

solution

CS 8751 ML & KDD Support Vector Machines 16

PAC and the Number of Support Vectors• The fewer the support vectors, the better the

generalization will be• Recall, non-support vectors are

– Correctly classified– Don’t change the learned model if left out of the

training set

• So

examples training#ctorssupport ve # rateerror out oneleave ≤−−

CS 8751 ML & KDD Support Vector Machines 17

Finding Non-Linear Separating Surfaces• Map inputs into new space

Example: features x1 x2

5 4

Example: features x1 x2 x12 x2

2 x1*x2

5 4 25 16 20

• Solve SVM program in this new space– Computationally complex if many features– But a clever trick exists

CS 8751 ML & KDD Support Vector Machines 18

The Kernel Trick• Optimization problems often/always have a

“primal” and a “dual” representation– So far we’ve looked at the primal formulation– The dual formulation is better for the case of a non-

linear separating surface

Page 4: Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length (“2 norm”) ofmargin Distance between blue and red planes (the margin) 1 For all

4

CS 8751 ML & KDD Support Vector Machines 19

Perceptrons Re-Visited

zero) (all 0 assumes This

weightschange and wrong get we timesofnumber some is where

Soiedmisclassifcurrently is example theif

, 1- F and 1 T if s,perceptronIn

#

1

11

rr

r

rr

rrr

=

=

+=⇒+⇒

∑=

++

initial

i

i

examples

iiiifinal

i

iikk

w

x

xyw

xxyww

α

α

η

CS 8751 ML & KDD Support Vector Machines 20

Dual Form of the Perceptron Learning Rule

[ ]

[ ]

errors) (counts 1 then

) teacher predicted (i.e., 0 if

i example eachFor :algorithm perceptron dual) (i.e.,New

sgn

sgn So

otherwise 1- 0 if 1

sgn

sgn

ii1

1

+=

≠≤

⋅∗

⋅=

=

≥+=

⋅=≡

=

=

ii

ij

#examples

jjji

iii

#examples

iiii

αα

xxyαy

)xxyα(

)xxyα( ) xh(

z(z)

)xw( ) x h( perceptronoutput of

rr

rr

rrr

rrr

CS 8751 ML & KDD Support Vector Machines 21

Primal versus Dual Space• Primal – “weight space”

– Weight features to make output decision

• Dual – “training-examples space”– Weight distance (which is based on the features) to

training examples

)sgn()( newnew xwxh rrr⋅=

[ ])xxyα()xh( ij

#examples

jjjnew

rrr⋅= ∑

=1sgn

CS 8751 ML & KDD Support Vector Machines 22

The Dual SVM

( )( )( )

( ) ( )2

minmax

:form primal back toconvert Can0

thatsuch

min

Let

11

1

1

11 121

iyiy

in

i ii

i

n

i ii

n

i in

i

n

j jijiji

xwxw

xyw

y

xxyy

examples #trainingn

ii

rrrr

rr

rrr

⋅+⋅−=

=

≥∀

−⋅

=

=−=

=

=

== =

∑∑ ∑

γ

α

α

α

αααα

CS 8751 ML & KDD Support Vector Machines 23

Non-Zero αi’s

weights the tocontribute 0) (i.e., ctorssupport ve only the -

Recall

ctorssupport ve the are 0 withexamples Those

1

=

∑ =

i

in

i ii

i

xyw

α

α

α

rr

CS 8751 ML & KDD Support Vector Machines 24

Generalizing the Dot Product

( )

spacenew toconvertingdirectly thanefficient moreusually -

spacenew thein features know the explicitly toneedt don' we-

productdot a computing re we'spacenew thisin - implicitly spacenew a into features original the

maps linear)-non(usually kernel acceptable LAn ), K(e.g.,

functions" kernel"other to ),uct( Dot_Prod

generalize can We

δjiji

jiji

xxxx

xxxx

rrrr

rrrr

⋅≡

⋅≡

Page 5: Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length (“2 norm”) ofmargin Distance between blue and red planes (the margin) 1 For all

5

CS 8751 ML & KDD Support Vector Machines 25

The New Space for a Sample Kernel

2212211122122111

2222121221211111

22211

2

2

,,,,,,

)(2let and

Let

zzzzzzzzxxxxxxxxzzxxzzxxzzxxzzxx

zxzx)zx(#features

)zx () z,xK(

⋅=

+++=+=⋅

=⋅=

rr

rrrr

Our new feature space (with 4 dimensions)- we’re doing a dot product in it

CS 8751 ML & KDD Support Vector Machines 26

g(+)

Visualizing the Kernel

-

++

++-

--

-

+

Input Space

Original SpaceSeparating plane (non-linear here butlinear in derived space)

g(+)g(-)

Derived Feature Space

New Space

g(+) g(+)g(+)g(-)

g(-)g(-) g(-)g() is feature transformation

function

process is similar to what hidden units do in ANNs but kernel is user chosen

CS 8751 ML & KDD Support Vector Machines 27

More Sample Kernels

( )( )

( )( )

etc.) DNA,(text, tasksspecific for designedmany including more,many plus -

sANN' of sigmoid to Related- tanh 3)

network RBF toleads kernel, Gaussian - 2)

1)22

d

dzxc)z,xK(

e)z,xK(

constzx)z,xK(zx

+⋅∗=

=

+⋅=−−

rrrr

rr

rrrr

rr σ

CS 8751 ML & KDD Support Vector Machines 28

What Makes a Kernel

... reala returns where 4)

3) constant is where 2)

1) are so thenkernels, are and If

kernela is function a whenzescharacteri theoremnsMercer'

21

1

21

21

f())zf()xf(()() * KK

c()c*K()K()K()K()K

)z,xf(

rr

rr

+

CS 8751 ML & KDD Support Vector Machines 29

Key SVM Ideas• Maximize the margin between positive and

negative examples (connects to PAC theory)• Penalize errors in non-separable case• Only the support vectors contribute to the solution• Kernels map examples into a new, usually non-

linear space– We implicitly do dot products in this new space (in the

“dual” form of the SVM program)