Download - Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Transcript
Page 1: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Kernel Methods:Kernel Methods:Support Vector MachinesSupport Vector Machines

• Maximum Margin Classifiers and Support Vector Machines

Page 2: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Generalized Linear Discriminant FunctionsGeneralized Linear Discriminant Functions

A linear discriminant function g(x) can be written as:

g(x) = wo + Σi wixi i = 1, …, d (d is number of features).

We could add additional terms to obtain a quadratic function:

g(x) = wo + Σi wixi + Σi Σj wij xixj

The quadratic discriminant function introduces d(d-1)/2 coefficients corresponding to the products of attributes. The surfaces are thus more complicated(hyperquadric surfaces).

Page 3: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Generalized Linear Discriminant FunctionsGeneralized Linear Discriminant Functions

We could even add more terms wijk xi xj xk and obtain the classof polynomial discriminant functions. The generalized form is

g(x) = Σi ai φi(x)

g(x) = at φ

Where the summation goes over all functions φi(x).The φi(x) functions are called the phi or φ functions.The function is now linear on the φi(x).

The functions map a d-dimensional x-space into a d’ dimensionaly-space. Example: g(x) = a1 + a2x + a3x2 φ = (1 x x2 ) t

Page 4: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Figure 5.5Figure 5.5

Page 5: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Historical BackgroundHistorical Background

Vladimir Vapnik:

Publications: 6 books and over a hundred research papers.

Developed a theory for expected riskminimization.

Invented Support Vector Machines

Page 6: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Historical BackgroundHistorical Background

Alexey Chervonenkis

With Vladimir Vapnik developed the concept of the Vapnik-Chervonenkisdimension.

Page 7: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Support Vector MachinesSupport Vector Machines

What are support vector machines (SVMs)?

A very popular classifier that is based on the concepts previously discussed on linear discriminants and thenew concept of margins.

To begin, SVMs preprocess the data by representing all examples in a higher dimensional space. With sufficiently high dimensions the classes can be separated by a hyperplane.

Page 8: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The MarginThe Margin

Page 9: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Goal in Support Vector MachinesThe Goal in Support Vector Machines

Now, let t be 1 or – 1 depending on the example x being of class positive or negative. A separating hyperplane ensures that: t g(x) >= 0

The goal in support vector machines is to find the separating hyperplane with the “largest” margin. Margin is the distance between the hyperplane and the closest example to it.

Page 10: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

Now the distance from a pattern x to a hyperplane is g(x) / ||w||. So let’s change our objective to finding a vector w that maximizes the margin m in the equation:

t g(x) / ||w|| >= m

We can also say that the support vectors are those patterns x for which t g(x) / ||w|| = 1, because we canrescale the w vector and leave the hyperplane in thesame place. Support vectors are equally close to the hyperplane.

These are the patterns that are most difficult to separate. These are the most “informative” patterns.

Page 11: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

We said we want to find a vector w that maximizes the equation:

t g(x) / ||w|| >= 1

This means all we really need to do is to maximize ||w|| -1

under certain constraints.

So we have the following optimization problem:

arg min w ½ ||w||2 subject to t g(x) >= 1

This can be solved using Lagrange Multipliers

Page 12: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

What happens when there are unavoidable errors?

arg min w ½ ||w||2 + λ ∑ ei

subject to t g(xi) >= 1 - ei

where ei is the error incurred by example xi

These are known as slack variables.

Page 13: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

We can write this in a dual form (Karush-Kuhn-Tuckerconstruction).

max ∑ i – ½ ∑ ∑ i j ti tj (xi . xj)

subject to 0 <= i <= λ and ∑ i xi = 0

Page 14: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

The final result is a set of i, one for each training example. The optimal hyperplane can be expressedin the dual representation as:

f(x) = ∑ yi i < xi . x > + b

where w = ∑ yi i xi

Page 15: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

We can use kernel functions to map from theoriginal space to a new space.

max ∑ i – ½ ∑ ∑ i j ti tj ((xi) . (xj) )

subject to 0 <= i <= λ and ∑ i xi = 0

Page 16: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

Computing the dot product is simplified:

Polynomial kernels:

(xi) . (xj) = 1 + 2 ∑ xi xj + ∑ xi2 xj

2 + …

But fortunately that is equal to:

(1 + xi . xj ) 2 = K( xi, xj )

In general all we need is to compute the dot product of all examples in the original space. This results in the Gram matrix K

Page 17: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

The Support VectorsThe Support Vectors

The final formulation is as follows:

max ∑ i – ½ ∑ ∑ i j ti tj K (xi . xj)

subject to 0 <= i <= λ and ∑ i xi = 0

Page 18: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

An ExampleAn Example

The XOR problem is known to be non-separable:

x1

x2

-1 0 1

-10

1

We use phi functions (1, 1.41x1, 1.41x2, 1.41x1x2, x12, x22)(hidden in the kernel function).

Page 19: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

An ExampleAn Example

The optimal hyperplane is found to be g(x1,x2) = x1x2 = 0. The margin is p = 1.41

1.41 x1

1.41 x1x2

-1 0 1

-1

0

1 b = 1.41g = 0

Page 20: Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Benefits of SVMsBenefits of SVMs

Benefits:

The complexity of the classifier is based on the number of support vectors rather than the dimensionality of the feature space.

This makes the algorithm less prone to overfitting