Support Vector Machines for Structured Classification and The Kernel Trick

Support Vector Machines for Structured Classification and The Kernel Trick

William Cohen

3-6-2007

Announcements

• Don’t miss this one:– Lise Getoor, 2:30 in Newell-Simon 3305

The voted perceptron

A Binstance xi Compute: yi = vk . xi

^

yi

^

yi

If mistake: vk+1 = vk + yi xi

u

-u

2γ

u

-u

2γ

v1

+x2

v2

+x1v1

-x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2

>γ

Perceptrons vs SVMs

• For the voted perceptron to “work” (in this proof), we need to assume there is some u such that

Perceptrons vs SVMs

• Question: why not use this assumption directly in the learning algorithm? i.e.– Given: γ, (x1,y1), (x2,y2), (x3,y3), …

– Find: some w where• ||w||=1 and

• for all i, w.xi.yi > γ

Perceptrons vs SVMs

• Question: why not use this assumption directly in the learning algorithm? i.e.– Given: (x1,y1), (x2,y2), (x3,y3), …

– Find: some w and γ such that

• ||w||=1 and


The best possible w and γ

Perceptrons vs SVMs

• Question: why not use this assumption directly in the learning algorithm? i.e.– Given: (x1,y1), (x2,y2), (x3,y3), …

– Maximize γ under the constraint

• ||w||=1 and


– Mimimize ||w||2 under the constraint

• for all i, w.xi.yi > 1

Units are arbitrary: rescaling increases γ and w

almost Thorsten’s eq (5-6), SVM0

The voted perceptron for ranking

A Binstances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

bIf mistake: vk+1 = vk + xb - xb*

u

-u

2γ

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

>γ

u

-u

u

-u

xx

x

x

x

v

3

The voted perceptron for NER

A Binstances z1 z2 z3 z4… Compute: yi = vk . zi

Return: the index b* of the “best” zi

^

b

b*If mistake: vk+1 = vk + zb - zb*

1. A sends B the Sha & Pereira paper and instructions for creating the instances:

• A sends a word vector xi. Then B could create the instances F(xi,y)…..

• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.

2. A sends B the correct label sequence yi.

3. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*)

F(xi,y*)

F(xi,yi)

The voted perceptron for NER

A Binstances z1 z2 z3 z4… Compute: yi = vk . zi

Return: the index b* of the “best” zi

^

b* bIf mistake: vk+1 = vk + zb - zb*

1. A sends a word vector xi.

2. B just returns the y* that gives the best score for vk . F(xi,y*)

3. A sends B the correct label sequence yi.

4. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*)

Thorsten’s notation vs mine

)( , jllj xxu

),,( ,,),: y)F(x)yF(xuyyyxu iiiiii D(

),,( ,,),: y)(x)y(xuyyyxu iiiiii D(

)( ,,),: (y)uyyyxu iiii D(

SVM for ranking: assumptions


y)(x)y(xy ,,)( where iiii

1u and

SVM for ranking: assumptions


y)(x)y(xy ,,)( where iiii

1u and

–Mimimize ||w||2 under the constraint

•for all i, w.xi.yi > 1

• 1)( ,, (y)wyy iii Thorsten’s eq (5-6), SVM0

Assumption:

suggestsalgorithm

The Kernel Trick

The voted perceptron

A Binstance xi Compute: yi = vk . xi

^

yi

^

yi

If mistake: vk+1 = vk + yi xi

The kernel trick

kk iiiiiik yyy xxxv ...2211

where i1,…,ik are the mistakes… so:

k

k

k

itestikitestiitesti

iiktestiitestiitest

iikiiiitestktest

yyy

yyy

yyy

xxxxxx

xxxxxx

xxxxvx

...

...

)...(

2211

2211

2211

Remember:

The kernel trick – con’t

kiikiiiik yyy xxxv ...2211

where i1,…,ik are the mistakes… then

kitestikitestiitestiktest yyy xxxxxxvx ...2211

Since:

Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:

)','(...),(),(

''...''''''

2211

2211

k

k

itestikitestiitesti

itestikitestiitestiktest

KyKyKy

yyy

xxxxxx

xxxxxxvx

''),( where itestitestK xxxx


bdyacx

dycxdc

byaxba

vu

v

u

,

,

hpgnxyfmyelxbdyacx

pnxymylxdycxdc

hgxyfyexbyaxba

22

22

22

''

,'

,'

vu

v

u

A voted perceptron over vectors like u,v is a linear

function…

Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)


1222

11

11

222222

22

bdyydbxacxyabcdxca

bdyacxbdyacx

bdyacxvu

hpgnxyfmyelxbdyacx

pnxymylxdycxdc

hgxyfyexbyaxba

22

22

22

''

,'

,'

vu

v

u

But notice…if we replace uv with (uv+1)2 ….

Compare to


''1 2 vuvu

So – up to constants on the cross-product terms

kitestikitestiitestiktest yyy ''...''''''2211

xxxxxxvx

Why not replace the computation of

With the computation of

),(...),(''11 kitestikitestiktest KyKy xxxxvx

2)1(),( iiK xxxxwhere ?

The kernel trick – con’tGeneral idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where

iiK ''),( xxxx This is really useful when you want to learn over objects x with some non-trivial structure….as in the two Mooney papers.

The kernel trick – con’tEven more general idea: use any function K that is• Continuous• Symmetric—i.e., K(u,v)=K(v,u)• “Positive semidefinite”—i.e., K(u,v)≥0

Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e.,

iiK ''),( xxxx Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.

Support Vector Machines for Structured Classification and The Kernel Trick

Documents

Transcript of Support Vector Machines for Structured Classification and The Kernel Trick