Kernels and Support Vector Machines
date post
08-Apr-2017Category
Science
view
411download
0
Embed Size (px)
Transcript of Kernels and Support Vector Machines
Seminar 2Kernelsand
Support Vector Machines
Edgar Marca
Supervisor: DSc. Andr M.S. Barreto
Petrpolis, Rio de Janeiro - BrazilSeptember 2nd, 2015
1 / 28
Kernels
Kernels
Why Kernalize?
At first sight, introducing k(x,x) has not improved our situation.Instead of calculating (xi),(xj) for i, j = 1, . . . n we have tocalculate k(xi,xj), which has exactly the same values. However, thereare two potential reasons why the kernelized setup can beadvantageous:
Speed: We might find and expression for k(xi,xj) that is faster tocalculate than forming (xi) and then (xi),(xj).
Flexibility: We construct functions k(x,x), for which we knowthat they corresponds to inner products after some featuremapping , but we dont know how to compute .
3 / 28
Kernels
How to use the Kernel Trick
To evaluate a decision function f(x) on an example x, one typicallyemploys the kernel trick as follows
f(x) = w,(x)
=
Ni=1
i(xi),(x)
=
Ni=1
i (xi),(x)
=
Ni=1
ik(xi,x)
4 / 28
How to proof that a functionis a kernel?
Kernels
Some Definitions
Definition 1.1 (Positive Definite Kernel)Let X be a nonempty set. A function k : X X C is called apositive definite if and only if
ni=1
nj=1
cicjk(xi,xj) 0 (1)
for all n N, {x1, . . . ,xn} X and {c1, . . . , cn}.Unfortunately, there is no common use of the preceding definition inthe literature. Indeed, some authors call positive definite functionpositive semi-definite, ans strictly positive definite functions aresometimes called positive definite.Note:For fixed x1,x2, . . . ,xn X, then n n matrix K := [k(xi,xj)]1i,jnis often called the Gram Matrix.
6 / 28
Kernels
Mercer Condition
Theorem 1.2Let X = [a, b] be compact interval and let k : [a, b] [a, b] C becontinuous. Then is positive definite if and only if b
a
bac(x)c(y)k(x, y)dxdy 0 (2)
for each continuous function c : X C.
7 / 28
Kernels
Theorem 1.3 (Symmetric, positive definite functions are kernels)A function k : X X R is a kernel if and only if is symmetric andpositive definite.
8 / 28
Kernels
Theorem 1.4Let k1, k2 . . . are arbitrary positive definite kernels in X X, where Xis not an empty set.
The set of positive definite kernels is a closed convex cone, that is,1. If 1, 2 0, then 1k1 + 2k2 is positive definitive.2. If k(x,x) := lim
nkn(x,x
) exists for all x,x then k is positivedefinitive.
The product k1.k2 is positive definite kernel. Assume that for i = 1, 2 ki is a positive definite kernel on Xi Xi,
where Xi is a nonempty set. Then the tensor product k1 k2 andthe direct sum k1 k2 are positive definite kernels on(X1 X2) (X1 X2).
Suppose that Y is not an empty set and let f : Y X anyarbitrary function then k(x,y) = k1(f(x), f(y)) is a positivedefinite kernel over Y Y .
9 / 28
Kernel Families
Kernels Kernel Families
Translation Invariant Kernels
Definition 1.5A translation invariant kernel is given by
K(x,y) = k(x y) (3)
where k is a even function in Rn, i.e., k(x) = k(x) for all x in Rn.
11 / 28
Kernels Kernel Families
Translation Invariant Kernels
Definition 1.6A function f : (0,) R is completely monotonic if it is C and, forall r > 0 and k 0,
(1)kf (k)(r) 0 (4)
Here f (k) denotes the kth derivative of f .
Theorem 1.7Let X Rn, f : (0,) R and K : X X R be defined byK(x,y) = f(x y2). If f is completely monotonic then K is positivedefinite.
12 / 28
Kernels Kernel Families
Translation Invariant Kernels
Corollary 1.8Let c = 0. Then following kernels, defined on a compact domainX Rn, are Mercer Kernels.
Gaussian Kernel or Radial Basis Function (RBF) orSquared Exponential Kernel (SE)
k(x,y) = exp
(x y
2
22
)(5)
Inverse Multiquadratic Kernel
k(x,y) =(c2 + x y2
), > 0 (6)
13 / 28
Kernels Kernel Families
Polynomial Kernels
k(x,x) = (x,x+ c)d, > 0, c 0, d Z (7)
14 / 28
Kernels Kernel Families
Non Mercer Kernels
Example 1.9Let k : X X R defined as
k(x, x) =
{1 , x x 10 , in other case
(8)
Suppose that k is a Mercer Kernel and set x1 = 1, x2 = 2 and x3 = 3then the matrix Kij = k(xi, xj) for 1 i, j 3 is
K =
1 1 01 1 10 1 1
(9)then the eigenvalues of K are 1 = (
2 1)1 > 0 and
2 = (12) < 0. This is a contradiction because all the eigenvalues
of K are positive then we can conclude that k is not a Mercer Kernel.
15 / 28
Kernels Kernel Families
References for Kernels
[3] C. Berg, J. Reus, and P. Ressel. Harmonic Analysis onSemigroups: Theory of Positive Definite and Related Functions.Springer Science+Business Media, LLV, 1984.
[9] Felipe Cucker and Ding Xuan Zhou. Learning Theory.Cambridge University Press, 2007.
[47] Ingo Steinwart and Christmannm Andreas. Support VectorMachines. 2008.
16 / 28
Support Vector Machines
Applications SVM
Support Vector Machines
w,x + b = 1w,x + b = 1
w,x + b = 0
margen
Figure: Linear Support Vector Machine18 / 28
Applications SVM
Primal Problem
Theorem 3.1The optimization program for the maximum margin classifier isminw,b
1
2w2
s.a yi(w,xi+ b) 1, i, 1 i m(10)
19 / 28
Applications SVM
Theorem 3.2Let F a function defined as:
F : Rm R+
w 7 F (w) = 12w2
then following affirmations are hold:
1. F is infinitely differential.2. The gradient of F is F (w) = w.3. The Hessian of F is 2F (w) = Imm.4. The Hessian 2F (w) is strictly convex.
20 / 28
Applications SVM
Theorem 3.3 (The dual problem)The Dual optimization program of (12) is:
max
mi=1
i 1
2
mi=1
mj=1
ijyiyjxi,xj
s.a i 0 mi=1
iyi = 0, i, 1 i m(11)
where = (1, 2, . . . , m) and the solution for this dual problem willbe denotated by = (1,
2, . . . ,
m).
21 / 28
Applications SVM
Proof.The Lagrangianx of the function F is
L(x, b, ) = 12w2
mi=1
i[yi(w,xi+ b) 1] (12)
Because of the KKT conditions are hold (F is continuous anddifferentiable and the restrictions are also continuous and differentiable)then we can add the complementary conditionsStationarity:
wL = w mi=1
iyixi = 0 w =mi=1
iyixi (13)
bL = mi=1
iyi = 0 mi=1
iyi = 0 (14)
22 / 28
Applications SVM
Primal feasibility:
yi(w,xi+ b) 1, i [1,m] (15)Dual feasibility:
i 0, i [1,m] (16)Complementary slackness:
i[yi(w,xi+b)1] = 0 i = 0yi(w,xi+b) = 1, i [1,m] (17)
L(w, b, ) = 12
mi=1
iyixi
2
mi=1
mj=1
ijyiyjxi,xj = 1
2
mi=1
mj=1 ijyiyjxi,xj
mi=1
iyib =0
+
mi=1
i
(18)then
L(w, b, ) =mi=1
i 1
2
mi=1
mj=1
ijyiyjxi,xj (19)23 / 28
Applications SVM
Theorem 3.4Let G a function defined as:
G : Rm R
7 G() = tImm 1
2tA
where = (1, 2, . . . , m) y A = [yiyjxi,xj]1i,jm in Rmm then
the following affirmations are hold:
1. The A is symmetric.
2. The function G is differentiable andG()
= Imm A.
3. The function G is twice differentiable and2G()
2= A.
4. The function G is a concave function.
24 / 28
Applications SVM
Linear Support Vector Machines
We will called Support Vector Machines to the decision function definedby
f(x) = sign (w,x+ b) = sign
(mi=1
i yixi,x+ b
)(20)
Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).
25 / 28
Applications Non Linear SVM
Non Linear Support Vector Machines
We will called Non Linear Support Vector Machines to the decisionfunction defined by
f(x) = sign (w,(x)+ b) = sign
(mi=1
i yi(xi),(x)+ b
)(21)
Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).
26 / 28
Applications Non Linear SVM
Applying the Kernel Trick
Using the kernel trick we can replace (xi),(x) by a kernel k(xi,x)
f(x) = sign
(mi=1
i yik(xi,x) + b
)(22)
Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).
27 / 28
Applications Non Linear SVM
References for Support Vector Machines
[31] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, 2012.
28 / 28
KernelsKernel Families
ApplicationsSVMNon Linear SVM