SVMs: Duality and Kernel Trick./ggordon/10601/slides/Lec24_SVMs_part3...• Handwritten Character...
Transcript of SVMs: Duality and Kernel Trick./ggordon/10601/slides/Lec24_SVMs_part3...• Handwritten Character...
11/18/2009
1
Machine Learning - 10601
SVMs: Duality and Kernel Trick
Geoff Gordon, MiroslavDudík([[[partly based on slides of Ziv-Bar Joseph]
http://www.cs.cmu.edu/~ggordon/10601/
November 18, 2009
SVMs as quadratic programsTwo optimization problems: For the separable and non separable cases
(wTxi+b)yi 1 (wTxi+b)yi 1- i
i 0
n
i=
ib
Cεi 1,,
min +2
Tww
w 2
Tww
w b,min
11/18/2009
2
Dual for separable case
(wTxi+b)yi – 1 0 (αi)
2
Tww
w b,min
Dual for separable case
(wTxi+b)yi – 1 0 (αi)
i
iii ybbL 1)(2
),,( xwww
αwT
T
Lagrangean
2
Tww
w b,min
11/18/2009
3
Dual for separable case
(wTxi+b)yi – 1 0 (αi)
i
iii ybbL 1)(2
),,( xwww
αwT
T
Lagrangean
),,(maxmin0,
αwαw
bLb
),,(minmax,0
αwwα
bLb
2
Tww
w b,min
Dual for separable case
(wTxi+b)yi – 1 0 (αi)
i
iii ybbL 1)(2
),,( xwww
αwT
T
Lagrangean
),,(maxmin0,
αwαw
bLb
),,(minmax,0
αwwα
bLb
2
Tww
w b,min
iyi 0i
i
iii yxw
01)(
0
iii
i
ybxwT
for all i:
optimality of αi optimality of w and b
11/18/2009
4
Optimality conditions
(KKT conditions)
iyi 0i
i
iii yxw
i
iiib
yb 1)(2
minmax,0
xwww
wα
TT
01)(
0
iii
i
ybxwT
Dual formulation
Dual for separable case
Optimality conditions
(KKT conditions)
i
y
yy
i
i
ii
i,j
jiji
i
i
0
0
2
1max
jixxα
iyi 0i
i
iii yxw
i
iiib
yb 1)(2
minmax,0
xwww
wα
TT
01)(
0
iii
i
ybxwT
Dual formulation
Dual for separable case
11/18/2009
5
Optimality conditions
(KKT conditions)
i
y
yy
i
i
ii
i,j
jiji
i
i
0
0
2
1max
jixxα
iyi 0i
i
iii yxw
01)(
0
iii
i
ybxwT
Dual formulation
Dual for separable case
Dual SVM - interpretation
i
iii yxw
For ’s that are not 0
0;0 i
i
ii y
11/18/2009
6
Dual SVM for linearlyseparable case
Our dual target function:
max i i
1
2 i jyiy j
i, j
x ix j
iyi 0i
i 0 i
Dot product across all
pairs of training samples
To evaluate a new sample x
we need to compute:byb i
i
ii xxxwT
Dot product with all
training samples
This might be too much work!
(e.g. when lifting x into high dimensions)
Classifying in 1-d
Can an SVM correctly
classify this data?
What about this?
X X
11/18/2009
7
Classifying in 1-d
Can an SVM correctly
classify this data?
And now?
X2
X X
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
φ : x → φ(x)
x12
x22
2x1x2
x=(x1,x2)
• The original input space (x) can be mapped to some higher-dimensional
feature space (φ(x))where the training set is separable:
φ(x) =(x12,x2
2,2x1x2)
Non-linear SVDs in 2-d
11/18/2009
8
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
φ : x → φ(x)
x12
x22
2x1x2
x=(x1,x2)
• The original input space (x) can be mapped to some higher-dimensional
feature space (φ(x))where the training set is separable:
φ(x) =(x12,x2
2,2x1x2)
Non-linear SVDs in 2-d
If data is mapped into sufficiently high dimension, then
samples will in general be linearly separable;
N data points are in general separable in a space of N-1
dimensions or more!!!
Transformation of Inputs• Possible problems
- High computation burden due to high-dimensionality
- Many more parameters
• SVM solves these two issues simultaneously– “Kernel tricks” for efficient computation
– Dual formulation only assigns parameters to samples, not features
φ( )
φ( )
φ( )φ( )φ( )
φ( )
φ( )φ( )
φ(.)φ( )
φ( )
φ( )
φ( )φ( )
φ( )
φ( )
φ( )φ( )
φ( )
Feature spaceInput space
11/18/2009
9
Polynomials of degree two
i
y
yy
i
i
ii
jiji
i
i
0
0
)max
ji (x)(x
ji ,
• While working in higher dimensions is
beneficial, it also increases our running time
because of the dot product computation
• However, there is a neat trick we can use
• consider all quadratic terms for x1, x2 … xm m is the
number of
features in
each vector
mm
m
m
xx
xx
x
x
x
x
x
1
21
2
21
1
2
2
2
2
1
)(
m+1 linear terms
m quadratic terms
m(m-1)/2 pairwise terms
The 2
term will
become
clear in the
next slide
Dot product for polynomialsof degree two
mm
m
m
mm
m
m
zz
zz
z
z
z
z
xx
xx
x
x
x
x
zx
1
21
2
21
1
1
21
2
21
1
2
2
2
2
1
2
2
2
2
1
)()(
How many operations do we need for the dot product?
2x izii
x i2
i
zi2 2x ix jziz j
j i1
i
1
m m m(m-1)/2 =~ m2
11/18/2009
10
Polynomials of degree din m variables
Polynomials of degree din m variables
i
y
yy
i
i
ii
jiji
i
i
0
0
)max
ji (x)(x
ji ,
Original formulation
Min (wTw)/2
(wTφ(xi)+b)yi 1
Dual formulation
11/18/2009
11
The kernel trick
How many operations do we need for the dot product?
2x izii
x i2
i
zi2 2x ix jziz j
j i1
i
1
m m m(m-1)/2 =~ m2
There’s structure to this dot product… we can do this faster!
(x.z 1)2 (x.z)2 2(x.z) 1
( x izi)2 2x izi
i
1i
2x izii
x i2
i
zi2 2x ix jziz j
j i1
i
1
We only need m operations!
)()( zx
The kernel trick
How many operations do we need for the dot product?
2x izii
x i2
i
zi2 2x ix jziz j
j i1
i
1
m m m(m-1)/2 =~ m2
There’s structure to this dot product… we can do this faster!
(x.z 1)2 (x.z)2 2(x.z) 1
( x izi)2 2x izi
i
1i
2x izii
x i2
i
zi2 2x ix jziz j
j i1
i
1
We only need m operations!
Note that to evaluate a new sample we are also using dot products so we save there as well
)()( zx
11/18/2009
12
Where we are
Our dual target function:
i
y
yy
i
i
ii
jijiji
i
ii
0
0
)2
1max
(x)(x
ji,
mn2 operations to
evaluate all coefficients
To evaluate a new sample x
we need to compute:
byαb i
i
iiT )()()( xxxw
mr operations where r
is the number of
support vectors (i>0)
Other kernels
•Beyond polynomials there are other very high dimensional basis
functions that can be made practical by finding the right k ernel
function
- Radial-Basis Function:
- kernel functions for discrete objects (graphs, strings, etc.)
2
2
2
)(exp),(
zxzxK
11/18/2009
13
Kernels aremeasures of similarity
bKyαb i
i
iiT ),()φ( xxxw
2
2
2
)(exp),(
zxzxK Decision rule for a new
sample x:
This slide is courtesy of Hastie-Tibshirani-Friedman, 2nd ed.
11/18/2009
14
Dual formulation fornon-separable case
Dual target function:
max i i
1
2 i jyiy j
i, j
x ix j
iyi 0i
C i 0 i
To evaluate a new sample x
we need to compute:
byb
i
ii xxixwT
The only difference is
that the I’s are now
bounded
Why do SVMs work?
• If we are using huge features spaces (with kernels) how come we
are not overfitting the data?
- We maximize margin!
- We minimize loss + regularization
11/18/2009
15
Software
• A list of SVM implementation can be found at http://www.kernel-machines.org/software.html
• Some implementation (such as LIBSVM) can handle multi-class classification
• SVMLight is among one of the earliest implementation of SVM
• Several Matlab toolboxes for SVM are also available
Applications of SVMs
• Bioinformatics
• Machine Vision
• Text Categorization
• Ranking (e.g., Google searches)
• Handwritten Character Recognition
• Time series analysis
Lots of very successful applications!!!
11/18/2009
16
Handwritten digit recognition
Important points
• Difference between regression classifiers and SVMs’
• Maximum margin principle
• Target function for SVMs
• Linearly separable and non separable cases
• Dual formulation of SVMs
• Kernel trick and computational complexity