     • date post

08-Apr-2017
• Category

## Science

• view

407

0

Embed Size (px)

### Transcript of Kernels and Support Vector Machines

• Seminar 2Kernelsand

Support Vector Machines

Edgar Marca

Supervisor: DSc. Andr M.S. Barreto

Petrpolis, Rio de Janeiro - BrazilSeptember 2nd, 2015

1 / 28

• Kernels

• Kernels

Why Kernalize?

At first sight, introducing k(x,x) has not improved our situation.Instead of calculating (xi),(xj) for i, j = 1, . . . n we have tocalculate k(xi,xj), which has exactly the same values. However, thereare two potential reasons why the kernelized setup can beadvantageous:

Speed: We might find and expression for k(xi,xj) that is faster tocalculate than forming (xi) and then (xi),(xj).

Flexibility: We construct functions k(x,x), for which we knowthat they corresponds to inner products after some featuremapping , but we dont know how to compute .

3 / 28

• Kernels

How to use the Kernel Trick

To evaluate a decision function f(x) on an example x, one typicallyemploys the kernel trick as follows

f(x) = w,(x)

=

Ni=1

i(xi),(x)

=

Ni=1

i (xi),(x)

=

Ni=1

ik(xi,x)

4 / 28

• How to proof that a functionis a kernel?

• Kernels

Some Definitions

Definition 1.1 (Positive Definite Kernel)Let X be a nonempty set. A function k : X X C is called apositive definite if and only if

ni=1

nj=1

cicjk(xi,xj) 0 (1)

for all n N, {x1, . . . ,xn} X and {c1, . . . , cn}.Unfortunately, there is no common use of the preceding definition inthe literature. Indeed, some authors call positive definite functionpositive semi-definite, ans strictly positive definite functions aresometimes called positive definite.Note:For fixed x1,x2, . . . ,xn X, then n n matrix K := [k(xi,xj)]1i,jnis often called the Gram Matrix.

6 / 28

• Kernels

Mercer Condition

Theorem 1.2Let X = [a, b] be compact interval and let k : [a, b] [a, b] C becontinuous. Then is positive definite if and only if b

a

bac(x)c(y)k(x, y)dxdy 0 (2)

for each continuous function c : X C.

7 / 28

• Kernels

Theorem 1.3 (Symmetric, positive definite functions are kernels)A function k : X X R is a kernel if and only if is symmetric andpositive definite.

8 / 28

• Kernels

Theorem 1.4Let k1, k2 . . . are arbitrary positive definite kernels in X X, where Xis not an empty set.

The set of positive definite kernels is a closed convex cone, that is,1. If 1, 2 0, then 1k1 + 2k2 is positive definitive.2. If k(x,x) := lim

nkn(x,x

) exists for all x,x then k is positivedefinitive.

The product k1.k2 is positive definite kernel. Assume that for i = 1, 2 ki is a positive definite kernel on Xi Xi,

where Xi is a nonempty set. Then the tensor product k1 k2 andthe direct sum k1 k2 are positive definite kernels on(X1 X2) (X1 X2).

Suppose that Y is not an empty set and let f : Y X anyarbitrary function then k(x,y) = k1(f(x), f(y)) is a positivedefinite kernel over Y Y .

9 / 28

• Kernel Families

• Kernels Kernel Families

Translation Invariant Kernels

Definition 1.5A translation invariant kernel is given by

K(x,y) = k(x y) (3)

where k is a even function in Rn, i.e., k(x) = k(x) for all x in Rn.

11 / 28

• Kernels Kernel Families

Translation Invariant Kernels

Definition 1.6A function f : (0,) R is completely monotonic if it is C and, forall r > 0 and k 0,

(1)kf (k)(r) 0 (4)

Here f (k) denotes the kth derivative of f .

Theorem 1.7Let X Rn, f : (0,) R and K : X X R be defined byK(x,y) = f(x y2). If f is completely monotonic then K is positivedefinite.

12 / 28

• Kernels Kernel Families

Translation Invariant Kernels

Corollary 1.8Let c = 0. Then following kernels, defined on a compact domainX Rn, are Mercer Kernels.

Gaussian Kernel or Radial Basis Function (RBF) orSquared Exponential Kernel (SE)

k(x,y) = exp

(x y

2

22

)(5)

k(x,y) =(c2 + x y2

), > 0 (6)

13 / 28

• Kernels Kernel Families

Polynomial Kernels

k(x,x) = (x,x+ c)d, > 0, c 0, d Z (7)

14 / 28

• Kernels Kernel Families

Non Mercer Kernels

Example 1.9Let k : X X R defined as

k(x, x) =

{1 , x x 10 , in other case

(8)

Suppose that k is a Mercer Kernel and set x1 = 1, x2 = 2 and x3 = 3then the matrix Kij = k(xi, xj) for 1 i, j 3 is

K =

1 1 01 1 10 1 1

(9)then the eigenvalues of K are 1 = (

2 1)1 > 0 and

2 = (12) < 0. This is a contradiction because all the eigenvalues

of K are positive then we can conclude that k is not a Mercer Kernel.

15 / 28

• Kernels Kernel Families

References for Kernels

 C. Berg, J. Reus, and P. Ressel. Harmonic Analysis onSemigroups: Theory of Positive Definite and Related Functions.Springer Science+Business Media, LLV, 1984.

 Felipe Cucker and Ding Xuan Zhou. Learning Theory.Cambridge University Press, 2007.

 Ingo Steinwart and Christmannm Andreas. Support VectorMachines. 2008.

16 / 28

• Support Vector Machines

• Applications SVM

Support Vector Machines

w,x + b = 1w,x + b = 1

w,x + b = 0

margen

Figure: Linear Support Vector Machine18 / 28

• Applications SVM

Primal Problem

Theorem 3.1The optimization program for the maximum margin classifier isminw,b

1

2w2

s.a yi(w,xi+ b) 1, i, 1 i m(10)

19 / 28

• Applications SVM

Theorem 3.2Let F a function defined as:

F : Rm R+

w 7 F (w) = 12w2

then following affirmations are hold:

1. F is infinitely differential.2. The gradient of F is F (w) = w.3. The Hessian of F is 2F (w) = Imm.4. The Hessian 2F (w) is strictly convex.

20 / 28

• Applications SVM

Theorem 3.3 (The dual problem)The Dual optimization program of (12) is:

max

mi=1

i 1

2

mi=1

mj=1

ijyiyjxi,xj

s.a i 0 mi=1

iyi = 0, i, 1 i m(11)

where = (1, 2, . . . , m) and the solution for this dual problem willbe denotated by = (1,

2, . . . ,

m).

21 / 28

• Applications SVM

Proof.The Lagrangianx of the function F is

L(x, b, ) = 12w2

mi=1

i[yi(w,xi+ b) 1] (12)

Because of the KKT conditions are hold (F is continuous anddifferentiable and the restrictions are also continuous and differentiable)then we can add the complementary conditionsStationarity:

wL = w mi=1

iyixi = 0 w =mi=1

iyixi (13)

bL = mi=1

iyi = 0 mi=1

iyi = 0 (14)

22 / 28

• Applications SVM

Primal feasibility:

yi(w,xi+ b) 1, i [1,m] (15)Dual feasibility:

i 0, i [1,m] (16)Complementary slackness:

i[yi(w,xi+b)1] = 0 i = 0yi(w,xi+b) = 1, i [1,m] (17)

L(w, b, ) = 12

mi=1

iyixi

2

mi=1

mj=1

ijyiyjxi,xj = 1

2

mi=1

mj=1 ijyiyjxi,xj

mi=1

iyib =0

+

mi=1

i

(18)then

L(w, b, ) =mi=1

i 1

2

mi=1

mj=1

ijyiyjxi,xj (19)23 / 28

• Applications SVM

Theorem 3.4Let G a function defined as:

G : Rm R

7 G() = tImm 1

2tA

where = (1, 2, . . . , m) y A = [yiyjxi,xj]1i,jm in Rmm then

the following affirmations are hold:

1. The A is symmetric.

2. The function G is differentiable andG()

= Imm A.

3. The function G is twice differentiable and2G()

2= A.

4. The function G is a concave function.

24 / 28

• Applications SVM

Linear Support Vector Machines

We will called Support Vector Machines to the decision function definedby

f(x) = sign (w,x+ b) = sign

(mi=1

i yixi,x+ b

)(20)

Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).

25 / 28

• Applications Non Linear SVM

Non Linear Support Vector Machines

We will called Non Linear Support Vector Machines to the decisionfunction defined by

f(x) = sign (w,(x)+ b) = sign

(mi=1

i yi(xi),(x)+ b

)(21)

Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).

26 / 28

• Applications Non Linear SVM

Applying the Kernel Trick

Using the kernel trick we can replace (xi),(x) by a kernel k(xi,x)

f(x) = sign

(mi=1

i yik(xi,x) + b

)(22)

Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).

27 / 28

• Applications Non Linear SVM

References for Support Vector Machines

 Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, 2012.

28 / 28

KernelsKernel Families

ApplicationsSVMNon Linear SVM