Kernels and Support Vector Machines

download Kernels and Support Vector  Machines

of 28

  • date post

    08-Apr-2017
  • Category

    Science

  • view

    407
  • download

    0

Embed Size (px)

Transcript of Kernels and Support Vector Machines

  • Seminar 2Kernelsand

    Support Vector Machines

    Edgar Marca

    Supervisor: DSc. Andr M.S. Barreto

    Petrpolis, Rio de Janeiro - BrazilSeptember 2nd, 2015

    1 / 28

  • Kernels

  • Kernels

    Why Kernalize?

    At first sight, introducing k(x,x) has not improved our situation.Instead of calculating (xi),(xj) for i, j = 1, . . . n we have tocalculate k(xi,xj), which has exactly the same values. However, thereare two potential reasons why the kernelized setup can beadvantageous:

    Speed: We might find and expression for k(xi,xj) that is faster tocalculate than forming (xi) and then (xi),(xj).

    Flexibility: We construct functions k(x,x), for which we knowthat they corresponds to inner products after some featuremapping , but we dont know how to compute .

    3 / 28

  • Kernels

    How to use the Kernel Trick

    To evaluate a decision function f(x) on an example x, one typicallyemploys the kernel trick as follows

    f(x) = w,(x)

    =

    Ni=1

    i(xi),(x)

    =

    Ni=1

    i (xi),(x)

    =

    Ni=1

    ik(xi,x)

    4 / 28

  • How to proof that a functionis a kernel?

  • Kernels

    Some Definitions

    Definition 1.1 (Positive Definite Kernel)Let X be a nonempty set. A function k : X X C is called apositive definite if and only if

    ni=1

    nj=1

    cicjk(xi,xj) 0 (1)

    for all n N, {x1, . . . ,xn} X and {c1, . . . , cn}.Unfortunately, there is no common use of the preceding definition inthe literature. Indeed, some authors call positive definite functionpositive semi-definite, ans strictly positive definite functions aresometimes called positive definite.Note:For fixed x1,x2, . . . ,xn X, then n n matrix K := [k(xi,xj)]1i,jnis often called the Gram Matrix.

    6 / 28

  • Kernels

    Mercer Condition

    Theorem 1.2Let X = [a, b] be compact interval and let k : [a, b] [a, b] C becontinuous. Then is positive definite if and only if b

    a

    bac(x)c(y)k(x, y)dxdy 0 (2)

    for each continuous function c : X C.

    7 / 28

  • Kernels

    Theorem 1.3 (Symmetric, positive definite functions are kernels)A function k : X X R is a kernel if and only if is symmetric andpositive definite.

    8 / 28

  • Kernels

    Theorem 1.4Let k1, k2 . . . are arbitrary positive definite kernels in X X, where Xis not an empty set.

    The set of positive definite kernels is a closed convex cone, that is,1. If 1, 2 0, then 1k1 + 2k2 is positive definitive.2. If k(x,x) := lim

    nkn(x,x

    ) exists for all x,x then k is positivedefinitive.

    The product k1.k2 is positive definite kernel. Assume that for i = 1, 2 ki is a positive definite kernel on Xi Xi,

    where Xi is a nonempty set. Then the tensor product k1 k2 andthe direct sum k1 k2 are positive definite kernels on(X1 X2) (X1 X2).

    Suppose that Y is not an empty set and let f : Y X anyarbitrary function then k(x,y) = k1(f(x), f(y)) is a positivedefinite kernel over Y Y .

    9 / 28

  • Kernel Families

  • Kernels Kernel Families

    Translation Invariant Kernels

    Definition 1.5A translation invariant kernel is given by

    K(x,y) = k(x y) (3)

    where k is a even function in Rn, i.e., k(x) = k(x) for all x in Rn.

    11 / 28

  • Kernels Kernel Families

    Translation Invariant Kernels

    Definition 1.6A function f : (0,) R is completely monotonic if it is C and, forall r > 0 and k 0,

    (1)kf (k)(r) 0 (4)

    Here f (k) denotes the kth derivative of f .

    Theorem 1.7Let X Rn, f : (0,) R and K : X X R be defined byK(x,y) = f(x y2). If f is completely monotonic then K is positivedefinite.

    12 / 28

  • Kernels Kernel Families

    Translation Invariant Kernels

    Corollary 1.8Let c = 0. Then following kernels, defined on a compact domainX Rn, are Mercer Kernels.

    Gaussian Kernel or Radial Basis Function (RBF) orSquared Exponential Kernel (SE)

    k(x,y) = exp

    (x y

    2

    22

    )(5)

    Inverse Multiquadratic Kernel

    k(x,y) =(c2 + x y2

    ), > 0 (6)

    13 / 28

  • Kernels Kernel Families

    Polynomial Kernels

    k(x,x) = (x,x+ c)d, > 0, c 0, d Z (7)

    14 / 28

  • Kernels Kernel Families

    Non Mercer Kernels

    Example 1.9Let k : X X R defined as

    k(x, x) =

    {1 , x x 10 , in other case

    (8)

    Suppose that k is a Mercer Kernel and set x1 = 1, x2 = 2 and x3 = 3then the matrix Kij = k(xi, xj) for 1 i, j 3 is

    K =

    1 1 01 1 10 1 1

    (9)then the eigenvalues of K are 1 = (

    2 1)1 > 0 and

    2 = (12) < 0. This is a contradiction because all the eigenvalues

    of K are positive then we can conclude that k is not a Mercer Kernel.

    15 / 28

  • Kernels Kernel Families

    References for Kernels

    [3] C. Berg, J. Reus, and P. Ressel. Harmonic Analysis onSemigroups: Theory of Positive Definite and Related Functions.Springer Science+Business Media, LLV, 1984.

    [9] Felipe Cucker and Ding Xuan Zhou. Learning Theory.Cambridge University Press, 2007.

    [47] Ingo Steinwart and Christmannm Andreas. Support VectorMachines. 2008.

    16 / 28

  • Support Vector Machines

  • Applications SVM

    Support Vector Machines

    w,x + b = 1w,x + b = 1

    w,x + b = 0

    margen

    Figure: Linear Support Vector Machine18 / 28

  • Applications SVM

    Primal Problem

    Theorem 3.1The optimization program for the maximum margin classifier isminw,b

    1

    2w2

    s.a yi(w,xi+ b) 1, i, 1 i m(10)

    19 / 28

  • Applications SVM

    Theorem 3.2Let F a function defined as:

    F : Rm R+

    w 7 F (w) = 12w2

    then following affirmations are hold:

    1. F is infinitely differential.2. The gradient of F is F (w) = w.3. The Hessian of F is 2F (w) = Imm.4. The Hessian 2F (w) is strictly convex.

    20 / 28

  • Applications SVM

    Theorem 3.3 (The dual problem)The Dual optimization program of (12) is:

    max

    mi=1

    i 1

    2

    mi=1

    mj=1

    ijyiyjxi,xj

    s.a i 0 mi=1

    iyi = 0, i, 1 i m(11)

    where = (1, 2, . . . , m) and the solution for this dual problem willbe denotated by = (1,

    2, . . . ,

    m).

    21 / 28

  • Applications SVM

    Proof.The Lagrangianx of the function F is

    L(x, b, ) = 12w2

    mi=1

    i[yi(w,xi+ b) 1] (12)

    Because of the KKT conditions are hold (F is continuous anddifferentiable and the restrictions are also continuous and differentiable)then we can add the complementary conditionsStationarity:

    wL = w mi=1

    iyixi = 0 w =mi=1

    iyixi (13)

    bL = mi=1

    iyi = 0 mi=1

    iyi = 0 (14)

    22 / 28

  • Applications SVM

    Primal feasibility:

    yi(w,xi+ b) 1, i [1,m] (15)Dual feasibility:

    i 0, i [1,m] (16)Complementary slackness:

    i[yi(w,xi+b)1] = 0 i = 0yi(w,xi+b) = 1, i [1,m] (17)

    L(w, b, ) = 12

    mi=1

    iyixi

    2

    mi=1

    mj=1

    ijyiyjxi,xj = 1

    2

    mi=1

    mj=1 ijyiyjxi,xj

    mi=1

    iyib =0

    +

    mi=1

    i

    (18)then

    L(w, b, ) =mi=1

    i 1

    2

    mi=1

    mj=1

    ijyiyjxi,xj (19)23 / 28

  • Applications SVM

    Theorem 3.4Let G a function defined as:

    G : Rm R

    7 G() = tImm 1

    2tA

    where = (1, 2, . . . , m) y A = [yiyjxi,xj]1i,jm in Rmm then

    the following affirmations are hold:

    1. The A is symmetric.

    2. The function G is differentiable andG()

    = Imm A.

    3. The function G is twice differentiable and2G()

    2= A.

    4. The function G is a concave function.

    24 / 28

  • Applications SVM

    Linear Support Vector Machines

    We will called Support Vector Machines to the decision function definedby

    f(x) = sign (w,x+ b) = sign

    (mi=1

    i yixi,x+ b

    )(20)

    Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).

    25 / 28

  • Applications Non Linear SVM

    Non Linear Support Vector Machines

    We will called Non Linear Support Vector Machines to the decisionfunction defined by

    f(x) = sign (w,(x)+ b) = sign

    (mi=1

    i yi(xi),(x)+ b

    )(21)

    Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).

    26 / 28

  • Applications Non Linear SVM

    Applying the Kernel Trick

    Using the kernel trick we can replace (xi),(x) by a kernel k(xi,x)

    f(x) = sign

    (mi=1

    i yik(xi,x) + b

    )(22)

    Where m is the number of training points. i are the lagrange multipliers of the dual problem (13).

    27 / 28

  • Applications Non Linear SVM

    References for Support Vector Machines

    [31] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning. The MIT Press, 2012.

    28 / 28

    KernelsKernel Families

    ApplicationsSVMNon Linear SVM