Support Vector Machines for Structured Classification and The Kernel Trick
-
Upload
kennedy-pitts -
Category
Documents
-
view
33 -
download
4
description
Transcript of Support Vector Machines for Structured Classification and The Kernel Trick
Support Vector Machines for Structured Classification and The Kernel Trick
William Cohen
3-6-2007
Announcements
• Don’t miss this one:– Lise Getoor, 2:30 in Newell-Simon 3305
The voted perceptron
A Binstance xi Compute: yi = vk . xi
^
yi
^
yi
If mistake: vk+1 = vk + yi xi
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
>γ
Perceptrons vs SVMs
• For the voted perceptron to “work” (in this proof), we need to assume there is some u such that
Perceptrons vs SVMs
• Question: why not use this assumption directly in the learning algorithm? i.e.– Given: γ, (x1,y1), (x2,y2), (x3,y3), …
– Find: some w where• ||w||=1 and
• for all i, w.xi.yi > γ
Perceptrons vs SVMs
• Question: why not use this assumption directly in the learning algorithm? i.e.– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find: some w and γ such that
• ||w||=1 and
• for all i, w.xi.yi > γ
The best possible w and γ
Perceptrons vs SVMs
• Question: why not use this assumption directly in the learning algorithm? i.e.– Given: (x1,y1), (x2,y2), (x3,y3), …
– Maximize γ under the constraint
• ||w||=1 and
• for all i, w.xi.yi > γ
– Mimimize ||w||2 under the constraint
• for all i, w.xi.yi > 1
Units are arbitrary: rescaling increases γ and w
almost Thorsten’s eq (5-6), SVM0
The voted perceptron for ranking
A Binstances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
bIf mistake: vk+1 = vk + xb - xb*
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
3
The voted perceptron for NER
A Binstances z1 z2 z3 z4… Compute: yi = vk . zi
Return: the index b* of the “best” zi
^
b
b*If mistake: vk+1 = vk + zb - zb*
1. A sends B the Sha & Pereira paper and instructions for creating the instances:
• A sends a word vector xi. Then B could create the instances F(xi,y)…..
• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.
2. A sends B the correct label sequence yi.
3. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*)
F(xi,y*)
F(xi,yi)
The voted perceptron for NER
A Binstances z1 z2 z3 z4… Compute: yi = vk . zi
Return: the index b* of the “best” zi
^
b* bIf mistake: vk+1 = vk + zb - zb*
1. A sends a word vector xi.
2. B just returns the y* that gives the best score for vk . F(xi,y*)
3. A sends B the correct label sequence yi.
4. On errors, B sets vk+1 = vk + zb - zb* = vk + F(xi,y) - F(xi,y*)
Thorsten’s notation vs mine
)( , jllj xxu
),,( ,,),: y)F(x)yF(xuyyyxu iiiiii D(
),,( ,,),: y)(x)y(xuyyyxu iiiiii D(
)( ,,),: (y)uyyyxu iiii D(
SVM for ranking: assumptions
)( ,,),: (y)uyyyxu iiii D(
y)(x)y(xy ,,)( where iiii
1u and
SVM for ranking: assumptions
)( ,,),: (y)uyyyxu iiii D(
y)(x)y(xy ,,)( where iiii
1u and
–Mimimize ||w||2 under the constraint
•for all i, w.xi.yi > 1
• 1)( ,, (y)wyy iii Thorsten’s eq (5-6), SVM0
Assumption:
suggestsalgorithm
The Kernel Trick
The voted perceptron
A Binstance xi Compute: yi = vk . xi
^
yi
^
yi
If mistake: vk+1 = vk + yi xi
The kernel trick
kk iiiiiik yyy xxxv ...2211
where i1,…,ik are the mistakes… so:
k
k
k
itestikitestiitesti
iiktestiitestiitest
iikiiiitestktest
yyy
yyy
yyy
xxxxxx
xxxxxx
xxxxvx
...
...
)...(
2211
2211
2211
Remember:
The kernel trick – con’t
kiikiiiik yyy xxxv ...2211
where i1,…,ik are the mistakes… then
kitestikitestiitestiktest yyy xxxxxxvx ...2211
Since:
Consider a preprocesser that replaces every x with x’ to include, directly in the example, all the pairwise variable interactions, so what is learned is a vector v’:
)','(...),(),(
''...''''''
2211
2211
k
k
itestikitestiitesti
itestikitestiitestiktest
KyKyKy
yyy
xxxxxx
xxxxxxvx
''),( where itestitestK xxxx
The kernel trick – con’t
bdyacx
dycxdc
byaxba
vu
v
u
,
,
hpgnxyfmyelxbdyacx
pnxymylxdycxdc
hgxyfyexbyaxba
22
22
22
''
,'
,'
vu
v
u
A voted perceptron over vectors like u,v is a linear
function…
Replacing u with u’ would lead to non-linear functions – f(x,y,xy,x2,…)
The kernel trick – con’t
1222
11
11
222222
22
bdyydbxacxyabcdxca
bdyacxbdyacx
bdyacxvu
hpgnxyfmyelxbdyacx
pnxymylxdycxdc
hgxyfyexbyaxba
22
22
22
''
,'
,'
vu
v
u
But notice…if we replace uv with (uv+1)2 ….
Compare to
The kernel trick – con’t
''1 2 vuvu
So – up to constants on the cross-product terms
kitestikitestiitestiktest yyy ''...''''''2211
xxxxxxvx
Why not replace the computation of
With the computation of
),(...),(''11 kitestikitestiktest KyKy xxxxvx
2)1(),( iiK xxxxwhere ?
The kernel trick – con’tGeneral idea: replace an expensive preprocessor xx’ and ordinary inner product with no preprocessor and a function K(x,xi) where
iiK ''),( xxxx This is really useful when you want to learn over objects x with some non-trivial structure….as in the two Mooney papers.
The kernel trick – con’tEven more general idea: use any function K that is• Continuous• Symmetric—i.e., K(u,v)=K(v,u)• “Positive semidefinite”—i.e., K(u,v)≥0
Then by an ancient theorem due to Mercer, K corresponds to some combination of a preprocessor and an inner product: i.e.,
iiK ''),( xxxx Terminology: K is a Mercer kernel. The set of all x’ is a reproducing kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram matrix.