SVMs: Duality and Kernel Trick./ggordon/10601/slides/Lec24_SVMs_part3...• Handwritten Character...

11/18/2009

1

Machine Learning - 10601

SVMs: Duality and Kernel Trick

Geoff Gordon, MiroslavDudík([[[partly based on slides of Ziv-Bar Joseph]

http://www.cs.cmu.edu/~ggordon/10601/

November 18, 2009

SVMs as quadratic programsTwo optimization problems: For the separable and non separable cases

(wTxi+b)yi 1 (wTxi+b)yi 1- i

i 0

n

i=

ib

Cεi 1,,

min +2

Tww

w 2

Tww

w b,min

11/18/2009

2

Dual for separable case

(wTxi+b)yi – 1 0 (αi)

2

Tww

w b,min



i

iii ybbL 1)(2

),,( xwww

αwT

T

Lagrangean

2

Tww

w b,min

11/18/2009

3



i

iii ybbL 1)(2

),,( xwww

αwT

T

Lagrangean

),,(maxmin0,

αwαw

bLb

),,(minmax,0

αwwα

bLb

2

Tww

w b,min



i

iii ybbL 1)(2

),,( xwww

αwT

T

Lagrangean

),,(maxmin0,

αwαw

bLb

),,(minmax,0

αwwα

bLb

2

Tww

w b,min

iyi 0i

i

iii yxw

01)(

0

iii

i

ybxwT

for all i:

optimality of αi optimality of w and b

11/18/2009

4

Optimality conditions

(KKT conditions)

iyi 0i

i

iii yxw

i

iiib

yb 1)(2

minmax,0

xwww

wα

TT

01)(

0

iii

i

ybxwT

Dual formulation



(KKT conditions)

i

y

yy

i

i

ii

i,j

jiji

i

i

0

0

2

1max

jixxα

iyi 0i

i

iii yxw

i

iiib

yb 1)(2

minmax,0

xwww

wα

TT

01)(

0

iii

i

ybxwT

Dual formulation


11/18/2009

5


(KKT conditions)

i

y

yy

i

i

ii

i,j

jiji

i

i

0

0

2

1max

jixxα

iyi 0i

i

iii yxw

01)(

0

iii

i

ybxwT

Dual formulation


Dual SVM - interpretation

i

iii yxw

For ’s that are not 0

0;0 i

i

ii y

11/18/2009

6

Dual SVM for linearlyseparable case

Our dual target function:

max i i

1

2 i jyiy j

i, j

x ix j

iyi 0i

i 0 i

Dot product across all

pairs of training samples

To evaluate a new sample x

we need to compute:byb i

i

ii xxxwT

Dot product with all

training samples

This might be too much work!

(e.g. when lifting x into high dimensions)

Classifying in 1-d

Can an SVM correctly

classify this data?

What about this?

X X

11/18/2009

7

Classifying in 1-d

Can an SVM correctly

classify this data?

And now?

X2

X X

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

φ : x → φ(x)

x12

x22

2x1x2

x=(x1,x2)

• The original input space (x) can be mapped to some higher-dimensional

feature space (φ(x))where the training set is separable:

φ(x) =(x12,x2

2,2x1x2)

Non-linear SVDs in 2-d

11/18/2009

8

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

φ : x → φ(x)

x12

x22

2x1x2

x=(x1,x2)

• The original input space (x) can be mapped to some higher-dimensional

feature space (φ(x))where the training set is separable:

φ(x) =(x12,x2

2,2x1x2)

Non-linear SVDs in 2-d

If data is mapped into sufficiently high dimension, then

samples will in general be linearly separable;

N data points are in general separable in a space of N-1

dimensions or more!!!

Transformation of Inputs• Possible problems

- High computation burden due to high-dimensionality

- Many more parameters

• SVM solves these two issues simultaneously– “Kernel tricks” for efficient computation

– Dual formulation only assigns parameters to samples, not features

φ( )

φ( )

φ( )φ( )φ( )

φ( )

φ( )φ( )

φ(.)φ( )

φ( )

φ( )

φ( )φ( )

φ( )

φ( )

φ( )φ( )

φ( )

Feature spaceInput space

11/18/2009

9

Polynomials of degree two

i

y

yy

i

i

ii

jiji

i

i

0

0

)max

ji (x)(x

ji ,

• While working in higher dimensions is

beneficial, it also increases our running time

because of the dot product computation

• However, there is a neat trick we can use

• consider all quadratic terms for x1, x2 … xm m is the

number of

features in

each vector

mm

m

m

xx

xx

x

x

x

x

x

1

21

2

21

1

2

2

2

2

1

)(

m+1 linear terms

m quadratic terms

m(m-1)/2 pairwise terms

The 2

term will

become

clear in the

next slide

Dot product for polynomialsof degree two

mm

m

m

mm

m

m

zz

zz

z

z

z

z

xx

xx

x

x

x

x

zx

1

21

2

21

1

1

21

2

21

1

2

2

2

2

1

2

2

2

2

1

)()(

How many operations do we need for the dot product?

2x izii

x i2

i

zi2 2x ix jziz j

j i1

i

1

m m m(m-1)/2 =~ m2

11/18/2009

10

Polynomials of degree din m variables

Polynomials of degree din m variables

i

y

yy

i

i

ii

jiji

i

i

0

0

)max

ji (x)(x

ji ,

Original formulation

Min (wTw)/2

(wTφ(xi)+b)yi 1

Dual formulation

11/18/2009

11

The kernel trick


2x izii

x i2

i

zi2 2x ix jziz j

j i1

i

1

m m m(m-1)/2 =~ m2

There’s structure to this dot product… we can do this faster!

(x.z 1)2 (x.z)2 2(x.z) 1

( x izi)2 2x izi

i

1i

2x izii

x i2

i

zi2 2x ix jziz j

j i1

i

1

We only need m operations!

)()( zx

The kernel trick


2x izii

x i2

i

zi2 2x ix jziz j

j i1

i

1

m m m(m-1)/2 =~ m2

There’s structure to this dot product… we can do this faster!

(x.z 1)2 (x.z)2 2(x.z) 1

( x izi)2 2x izi

i

1i

2x izii

x i2

i

zi2 2x ix jziz j

j i1

i

1

We only need m operations!

Note that to evaluate a new sample we are also using dot products so we save there as well

)()( zx

11/18/2009

12

Where we are

Our dual target function:

i

y

yy

i

i

ii

jijiji

i

ii

0

0

)2

1max

(x)(x

ji,

mn2 operations to

evaluate all coefficients


we need to compute:

byαb i

i

iiT )()()( xxxw

mr operations where r

is the number of

support vectors (i>0)

Other kernels

•Beyond polynomials there are other very high dimensional basis

functions that can be made practical by finding the right k ernel

function

- Radial-Basis Function:

- kernel functions for discrete objects (graphs, strings, etc.)

2

2

2

)(exp),(

zxzxK

11/18/2009

13

Kernels aremeasures of similarity

bKyαb i

i

iiT ),()φ( xxxw

2

2

2

)(exp),(

zxzxK Decision rule for a new

sample x:

This slide is courtesy of Hastie-Tibshirani-Friedman, 2nd ed.

11/18/2009

14

Dual formulation fornon-separable case

Dual target function:

max i i

1

2 i jyiy j

i, j

x ix j

iyi 0i

C i 0 i


we need to compute:

byb

i

ii xxixwT

The only difference is

that the I’s are now

bounded

Why do SVMs work?

• If we are using huge features spaces (with kernels) how come we

are not overfitting the data?

- We maximize margin!

- We minimize loss + regularization

11/18/2009

15

Software

• A list of SVM implementation can be found at http://www.kernel-machines.org/software.html

• Some implementation (such as LIBSVM) can handle multi-class classification

• SVMLight is among one of the earliest implementation of SVM

• Several Matlab toolboxes for SVM are also available

Applications of SVMs

• Bioinformatics

• Machine Vision

• Text Categorization

• Ranking (e.g., Google searches)

• Handwritten Character Recognition

• Time series analysis

Lots of very successful applications!!!

11/18/2009

16

Handwritten digit recognition

Important points

• Difference between regression classifiers and SVMs’

• Maximum margin principle

• Target function for SVMs

• Linearly separable and non separable cases

• Dual formulation of SVMs

• Kernel trick and computational complexity

SVMs: Duality and Kernel Trick./ggordon/10601/slides/Lec24_SVMs_part3...• Handwritten Character...

Documents

Transcript of SVMs: Duality and Kernel Trick./ggordon/10601/slides/Lec24_SVMs_part3...• Handwritten Character...