Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length...

1

CS 8751 ML & KDD Support Vector Machines 1

Support Vector Machines (SVMs)• Learning mechanism based on linear

programming• Chooses a separating plane based on maximizing

the notion of a margin– Based on PAC learning

• Has mechanisms for– Noise– Non-linear separating surfaces (kernel functions)

• Notes based on those of Prof. Jude Shavlik


Support Vector Machines

A+

A-

Find the best separating plane in feature space- many possibilities to choose from

which is thebest choice?


SVMs – The General Idea• How to pick the best separating plane?• Idea:

– Define a set of inequalities we want to satisfy– Use advanced optimization methods (e.g., linear

programming) to find satisfying solutions

• Key issues:– Dealing with noise– What if no good linear separating surface?


Linear Programming• Subset of Math Programming• Problem has the following form:

function f(x1,x2,x3,…,xn) to be maximizedsubject to a set of constraints of the form:

g(x1,x2,x3,…,xn) > b• Math programming - find a set of values for the variables

x1,x2,x3,…,xn that meets all of the constraints and maximizes the function f

• Linear programming - solving math programs where the constraint functions and function to be maximized use linear combinations of the variables– Generally easier than general Math Programming problem– Well studied problem


Maximizing the Margin

A+

A-

the decisionboundary

The margin between categories- want this distance to be maximal- (we’ll assume linearly separable for now)


PAC Learning• PAC – Probably Approximately Correct learning• Theorems that can be used to define bounds for

the risk (error) of a family of learning functions• Basic formula, with probability (1 - η):

• R – risk function, α is the parameters chosen by the learner, N is the number of data points, and h is the VC dimension (something like an estimate of the complexity of the class of functions)

NhNhRR emp

)4/log()/2(log()()( ηαα −+≤

2


Margins and PAC Learning• Theorems connect PAC theory to the size of the

margin• Basically, the larger the margin, the better the

expected accuracy• See, for example, Chapter 4 of Support Vector

Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002


Some Equations

w

xw

xw

xwxw

neg

pos

2margin

margin) (the planes red and bluebetween Distance

1 examples negative allFor

1 examples positive allFor

threshold- features,input - weights,-

Plane Separating

=

−=⋅

+=⋅

=⋅

γ

γ

γ

γ

rr

rr

rr

rr

1s result from dividingthrough by a constantfor convenience

Euclidean length (“2 norm”) ofthe weight vector


What the Equations Mean

A+

A-

Support Vectors

Margin

x´w = γ + 1

x´w = γ - 1

2 / ||w||2


Choosing a Separating Plane

A+

A-

?


Our “Mathematical Program” (so far)

soln) optimal global (a above theosolution t a find tosoftware onoptimizati gprogramminmath existing use nowcan We

es)inequalitiour of sideleft the to move and trick"" ANN theuse course, of could, (we parameters adjustableour are , :Note

examples) (for 1

examples) (for 1 such that

min 2

,

γγ

γ

γ

γ

w

xwxw

w

neg

pos

w

r

rr

rr

rr

−−≤⋅

++≥⋅

for technical reasons easier tooptimize this “quadratic program”


Dealing with Non-Separable DataWe can add what is called a “slack” variable to each

example

This variable can be viewed as:0 if the example is correctly separatedy “distance” we need to move example to make it

correct (i.e., the distance from its surface)

3


“Slack” Variables

A+

A-

ySupport Vectors


The Math Program with Slack Variables

0

1

1 such that

positive) (all components of sum - norm" one" constant scaling

exampleeach for one featureinput each for one

min

k

1

1

2

,,

≥∀

−≤−⋅

+≥+⋅

−

−−−

+

k

jneg

ipos

sw

s

sxw

sxw

s

sw

sw

j

i

γ

γ

µ

µγ

rr

rr

r

r

r

rrrr

This is the “traditional”Support Vector Machine


Why the word “Support”?• All those examples on or on the wrong side of the

two separating planes are the support vectors– We’d get the same answer if we deleted all the non-

support vectors!– i.e., the “support vectors [examples]” support the

solution


PAC and the Number of Support Vectors• The fewer the support vectors, the better the

generalization will be• Recall, non-support vectors are

– Correctly classified– Don’t change the learned model if left out of the

training set

• So

examples training#ctorssupport ve # rateerror out oneleave ≤−−


Finding Non-Linear Separating Surfaces• Map inputs into new space

Example: features x1 x2

5 4

Example: features x1 x2 x12 x2

2 x1*x2

5 4 25 16 20

• Solve SVM program in this new space– Computationally complex if many features– But a clever trick exists


The Kernel Trick• Optimization problems often/always have a

“primal” and a “dual” representation– So far we’ve looked at the primal formulation– The dual formulation is better for the case of a non-

linear separating surface

4


Perceptrons Re-Visited

zero) (all 0 assumes This

weightschange and wrong get we timesofnumber some is where

Soiedmisclassifcurrently is example theif

, 1- F and 1 T if s,perceptronIn

#

1

11

rr

r

rr

rrr

=

=

+=⇒+⇒

∑=

++

initial

i

i

examples

iiiifinal

i

iikk

w

x

xyw

xxyww

α

α

η


Dual Form of the Perceptron Learning Rule

[ ]

[ ]

errors) (counts 1 then

) teacher predicted (i.e., 0 if

i example eachFor :algorithm perceptron dual) (i.e.,New

sgn

sgn So

otherwise 1- 0 if 1

sgn

sgn

ii1

1

+=

≠≤

⋅∗

⋅=

⋅

=

≥+=

⋅=≡

∑

∑

∑

=

=

ii

ij

#examples

jjji

iii

#examples

iiii

αα

xxyαy

)xxyα(

)xxyα( ) xh(

z(z)

)xw( ) x h( perceptronoutput of

rr

rr

rrr

rrr


Primal versus Dual Space• Primal – “weight space”

– Weight features to make output decision

• Dual – “training-examples space”– Weight distance (which is based on the features) to

training examples

)sgn()( newnew xwxh rrr⋅=

[ ])xxyα()xh( ij

#examples

jjjnew

rrr⋅= ∑

=1sgn


The Dual SVM

( )( )( )

( ) ( )2

minmax

:form primal back toconvert Can0

thatsuch

min

Let

11

1

1

11 121

iyiy

in

i ii

i

n

i ii

n

i in

i

n

j jijiji

xwxw

xyw

y

xxyy

examples #trainingn

ii

rrrr

rr

rrr

⋅+⋅−=

=

≥∀

−⋅

=

=−=

=

=

== =

∑

∑

∑∑ ∑

γ

α

α

α

αααα


Non-Zero αi’s

weights the tocontribute 0) (i.e., ctorssupport ve only the -

Recall

ctorssupport ve the are 0 withexamples Those

1

≠

=

≠

∑ =

i

in

i ii

i

xyw

α

α

α

rr


Generalizing the Dot Product

( )

spacenew toconvertingdirectly thanefficient moreusually -

spacenew thein features know the explicitly toneedt don' we-

productdot a computing re we'spacenew thisin - implicitly spacenew a into features original the

maps linear)-non(usually kernel acceptable LAn ), K(e.g.,

functions" kernel"other to ),uct( Dot_Prod

generalize can We

δjiji

jiji

xxxx

xxxx

rrrr

rrrr

⋅≡

⋅≡

5


The New Space for a Sample Kernel

2212211122122111

2222121221211111

22211

2

2

,,,,,,

)(2let and

Let

zzzzzzzzxxxxxxxxzzxxzzxxzzxxzzxx

zxzx)zx(#features

)zx () z,xK(

⋅=

+++=+=⋅

=⋅=

rr

rrrr

Our new feature space (with 4 dimensions)- we’re doing a dot product in it


g(+)

Visualizing the Kernel

-

++

++-

--

-

+

Input Space

Original SpaceSeparating plane (non-linear here butlinear in derived space)

g(+)g(-)

Derived Feature Space

New Space

g(+) g(+)g(+)g(-)

g(-)g(-) g(-)g() is feature transformation

function

process is similar to what hidden units do in ANNs but kernel is user chosen


More Sample Kernels

( )( )

( )( )

etc.) DNA,(text, tasksspecific for designedmany including more,many plus -

sANN' of sigmoid to Related- tanh 3)

network RBF toleads kernel, Gaussian - 2)

1)22

d

dzxc)z,xK(

e)z,xK(

constzx)z,xK(zx

+⋅∗=

=

+⋅=−−

rrrr

rr

rrrr

rr σ


What Makes a Kernel

... reala returns where 4)

3) constant is where 2)

1) are so thenkernels, are and If

kernela is function a whenzescharacteri theoremnsMercer'

21

1

21

21

f())zf()xf(()() * KK

c()c*K()K()K()K()K

)z,xf(

rr

rr

∗

+


Key SVM Ideas• Maximize the margin between positive and

negative examples (connects to PAC theory)• Penalize errors in non-separable case• Only the support vectors contribute to the solution• Kernels map examples into a new, usually non-

linear space– We implicitly do dot products in this new space (in the

“dual” form of the SVM program)

Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length...

Documents

Transcript of Support Vector Machines (SVMs) Support Vector …rmaclin/cs8751/Notes/L15_SVMs.pdfEuclidean length...