Download - Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Transcript

Support Vector Machines Content

Introduction

The VC Dimension & Structure Risk Minimization

Linear SVM The Separable case

Linear SVM The Non-Separable case

Lagrange Multipliers

Support Vector Machines

Introduction

Learning Machines

A machine to learn the mapping

Defined as

i iyx

( , )fx x α

Learning by adjusting this parameter?

Generalization vs. Learning

How a machine learns? – Adjusting the parameters so as to partition the pattern

(feature) space for classification.

– How to adjust? Minimize the empirical risk (traditional approaches).

What the machine learned? – Memorize the patterns it sees? or

– Memorize the rules it finds for different classes?

– What does the machine actually learn if it minimizes empirical risk only?

Risks

Expected Risk (test error)

12( ) ( , ( )) ,R y f P yd x xα α

Empirical Risk (training error)

( ) ( , )l

emp i ili

R y f

αxα

( ) ( )?empR Rα α

Page 2: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

More on VC Dimension

In general, the VC dimension of a set of oriented hyperplanes in Rn is n+1.

VC dimension is a measure of memorization capability.

VC dimension is not directly related to number of parameters. Vapnik (1995) has an example with 1 parameter and infinite VC dimension.

Bound on Expected Risk

Expected Risk 12( ) ( , ( )) ,R y f P yd x xα α

Empirical Risk 12

( ) ( , )l

emp i ili

R y f

αxα

(log(( )

2 ) 1) log( 41

)( )emp

h lR

lRP

VC Confidence

Bound on Expected Risk

(log(( )

2 ) 1) log( 41

)( )emp

h lR

lRP

VC Confidence

Consider small (e.g., 0.5).

(log(2( )

) 1) log( )( )

4emp

h lR R

Bound on Expected Risk

Consider small (e.g., 0.5).

(log(2( )

) 1) log( )( )

4emp

h lR R

Traditional approaches minimize empirical risk only

Structure risk minimization want to minimize the bound

VC Confidence

(log(2( )

) 1) log( )( )

4emp

h lR R

=0.05 and l=10,000

Amongst machines with zero empirical risk, choose the one with smallest VC dimension

How to evaluate VC dimension? h4 h3

Structure Risk Minimization

h2 h1

Nested subset of functions with different VC dimensions.

4 3 2 1h h h h

Support Vector Machines

The Linear SVM

The Separable Case

The Linear Separability

Linearly separable Not linearly separable

The Linear Separability

Linearly separable

w 0b wx

0b wx

such at, thbw1b wx

1b wx

Linearly Separable

1fo1 r

for 11

b y

( ) 1 0 i iy b i wx

Margin Width

( ) 1 0 i iy b i wx

1 1

|| || || ||

w w

|| ||

How about maximize the margin?

What is the relation btw. the margin width and VC dimension?

Maximum Margin Classifier

( ) 1 0 i iy b i wx

1 1

|| || || ||

w w

|| ||

How about maximize the margin?

What is the relation btw. the margin width and VC dimension?

Supporters

Building SVM

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

This requires the knowledge about Lagrange Multiplier.

Page 5: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

The Method of Lagrange

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

1( , ; ) || || ( ) 1

i ii

iTbL y b

xw w w

The Lagrangian:

Minimize it w.r.t w, while maximize it w.r.t. .

The Lagrangian:

The Method of Lagrange

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

1( , ; ) || || ( ) 1

i ii

iTbL y b

xw w w 0i

Minimize it w.r.t w, while maximize it w.r.t. .

What value of i should be if it is

feasible and nonzero?

How about if it is zero?

The Method of Lagrange

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

1( , ; ) || || ( ) 1

i ii

iTbL y b

xw w w

The Lagrangian:

1 1

1|| || ( )

l l

i ii i

i iTy b

xw w

1 1

1( , ; ) || || ( )

l l

i ii i

iL yb b

w w xw

Duality

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

Maximize

Subject to

*( )*, ;bL w

, ( , ; )b bL w w 0

0, 1, ,i i l

1 1

1( , ; ) || || ( )

l l

i ii i

iL yb b

w w xw

Duality

1 1

1( , ; ) || || ( )

l l

i ii i

iL yb b

w w xw

Maximize

Subject to , ( , ; )b bL w w 0

0, 1, ,i i l

( , ; )l

i ibL y

w w xw 01

ii ii

w x

( , ; ) 0l

b iii

bL y

i y

*( )*, ;bL w

Duality

1 1

1( , ; ) || || ( )

l l

i ii i

iL yb b

w w xw

( , ; )l

i ibL y

w w xw 01

ii ii

w x

( , ; ) 0l

b iii

bL y

i y

1 1 1 1 1 1

1( , ; )*

T Tl l l l l l

i i i i i i i i ii i i

i i i i ii i

L y y y b yb y

x x xw x

1 1 1

Tl l l

i i i ii i

i ii

iy y

x x

1 1 1

l l l

i j i ji i j

i i j y y

x x Maximize ( )F

Page 6: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Duality

1 1

1( , ; ) || || ( )

l l

i ii i

iL yb b

w w xw

( , ; )l

i ibL y

w w xw 01

ii ii

w x

( , ; ) 0l

b iii

bL y

i y

1 1 1 1 1 1

1( , ; )*

T Tl l l l l l

i i i i i i i i ii i i

i i i i ii i

L y y y b yb y

x x xw x

1 1 1

Tl l l

i i i ii i

i ii

iy y

x x

1 1 1

l l l

i j i ji i j

i i j y y

x x Maximize ( )F

1( )

2TF D 1

0T y

Duality

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

Maximize

Subject to

1( )

2TF D 1

0T y

The Primal

The Dual

The Solution

Maximize

Subject to

1( )

2TF D 1

0T yThe Dual

Find * by …

ii ii

w x * ?b

Quadratic Programming

The Solution

Find * by …

ii ii

w x

1( , ; ) || || ( ) 1

i ii

iTbL y b

xw w w

The Lagrangian:

** , 0Ti iib y xw

Call it a support vector is i > 0.

The Karush-Kuhn-Tucker Conditions

( ) 1 0, 1, ,i iTy ib l w x

0, 1, ,i i l

( , ; )l

i ibL y

w w xw 0

( , ; ) 0l

b iii

bL y

1( , ; ) || || ( ) 1

i ii

iTbL y b

xw w w

( ) 1 0, 1, ,i iT

i by i l w x

Classification

ii ii

w x

( ) sgn * *T bf wx x

*sgn *,l

i ii

i y b

x x

sgn , *i

i i iy b

Page 7: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Classification Using Supporters

( ) sgn , *i

i iif by

x xx

The similarity measure btw.

input and the ith support vector.

The weight for the ith support vector.

Bias

Demonstration

Support Vector Machines

The Linear SVM

The Non-Separable Case

We require that

1for 1

1 f r 1o

i i

( ) 1 0 i iiy b i wx

0 i i

Mathematic Formulation

Minimize

Subject to

212 || ||w

( ) 1 0 Ti i iy b i w x

0 i i

iiC

For simplicity, we consider k = 1.

The Lagrangian

i iC

( , , ; )

1||

|| ( ) 12

i i ii i ii i i

y b

w w x

Minimize

Subject to

212 || ||w

( ) 1 0 Ti i iy b i w x

0 i i

0, 0i i

http://diwww.epfl.ch/mantra/tutorial/english/svm/html/index.html

Page 8: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Duality

i iC

21( , , ; ) ||, || ( ) 1

Ti i ii i ii i iL Cb y b w w w x

Minimize

Subject to

212 || ||w

( ) 1 0 Ti i iy b i w x

0 i i

0, 0i i

Maximize

Subject to

( *, *, ; ,* )L b w

, , ( , , ; ) 0,b L b w w

, 0 0

Duality

21( , , ; ) ||, || ( ) 1

Ti i ii i ii i iL Cb y b w w w x

0, 0i i

Maximize

Subject to

( *, *, ; ,* )L b w

, , ( , , ; ) 0,b L b w w

, 0 0

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

* i ii i yw x

0i iiy

i iC

0 i C

Duality

21( , , ; ) ||, || ( ) 1

Ti i ii i ii i iL Cb y b w w w x

0, 0i i

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

* i ii i yw x

0i iiy

i iC

0 i C 1

( *, *, ; ) ,* ,2

i j i jjii i jiL b y y w x x,( )F

Maximize this

Duality

21( , , ; ) ||, || ( ) 1

Ti i ii i ii i iL Cb y b w w w x

0, 0i i

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

* i ii i yw x

0i iiy

i iC

0 i C 1

( *, *, ; ) ,* ,2

i j i jjii i jiL b y y w x x,( )F

Maximize this 1

( )2

TF D 1

0T y

0 C

Duality

Maximize

Subject to

C 0 1

1( )

2TF D 1

0T y

The Primal

The Dual

Minimize

Subject to

212 || ||

i iC w

( ) 1 0 Ti i iy b i w x

0 i i

The Karush-Kuhn-Tucker Conditions

( ) 1 0Ti i iy b w x

0ii

21( , , ; ) ||, || ( ) 1

Ti i ii i ii i iL Cb y b w w w x

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

( ) 1 0Ti ii iy b w x

Page 9: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

The Solution

Find * by …

ii ii

w x

* ?b

Quadratic Programming

Maximize

Subject to

C 0 1

1( )

2TF D 1

0T yThe Dual

The Solution

Find * by …

ii ii

w x

The Lagrangian:

( , , ; )

1||

|| ( ) 12

i i ii i ii i i

y b

w w x

* * , 0<Ti i ib y C w x

Call it a support vector is 0 < i < C.

The Solution

Find * by …

ii ii

w x

* * , 0<Ti i ib y C w x

Call it a support vector is 0 < i < C.

max 0,1 ( * *)i i iy b w x

A false classification pattern if i > 0.

Classification

ii ii

w x

( ) sgn * *T bf wx x

*sgn *,l

i ii

i y b

x x

sgn , *i

i i iy b

Classification Using Supporters

( ) sgn , *i

i iif by

x xx

The similarity measure btw.

input and the ith support vector.

The weight for the ith support vector.

Bias

Support Vector Machines

Lagrange

Multipliers

Page 10: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Lagrange Multiplier

“Lagrange Multiplier Method” is a powerful tool for constraint optimization.

Contributed by Riemann.

Milkmaid Problem

Milkmaid Problem Milkmaid Problem

(x1, y1)

(x2, y2)

(x, y) Goal:

Minimize

( , )f x y

Subject to

( , ) 0g x y

22 2

( , ) ( ) ( )i ii

f x y x x y y

Observation

(x1, y1)

(x2, y2)

(x*, y*) Goal:

Minimize

( , )f x y

Subject to

( , ) 0g x y

At the extreme point, say, (x*, y*)

f (x*, y*) = g(x*, y*).

22 2

( , ) ( ) ( )i ii

f x y x x y y

Optimization with Equality Constraints

Goal: Min/Max ( )f x

Subject to ( ) 0g x

Lemma:

At an extreme point, say, x*, we have

( *) *) ( 0( * )f gg x xx if

Page 11: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Proof

( *) *) ( 0( * )f gg x xx if

( )f x

( ) 0g x

Let x* be an extreme point.

r(t) be any differentiable path on surface g(x)=0 such that r(t0)=x*.

0( )tris a vector tangent to the surface g(x)=0 at x*.

0( *) ( ( )) |t tf f t x r

Since x* be an extreme point,

( ( )) ( ( )) ( )ddt f t f t t r r r

0 00 ( ( )) ( )f t t r r 0( *) ( )f t x r

r(t) r(t0)

Proof

( *) *) ( 0( * )f gg x xx if

( )f x

( ) 0g x

0 00 ( ( )) ( )f t t r r 0( *) ( )f t x r

r(t) r(t0)

0( )tr

( *)f x

0(( )*)f t rxThis is true for any r pass through x* on surface g(x)=0.

It implies that ( *) ,f xwhere is the tangential plane of surface g(x)=0 at x*.

Optimization with Equality Constraints

Goal: Min/Max ( )f x

Subject to ( ) 0g x

Lemma:

At an extreme point, say, x*, we have

( *) *) ( 0( * )f gg x xx if

Lagrange Multiplier

The Method of Lagrange

Goal: Min/Max ( )f x

Subject to ( ) 0g x

Find the extreme points by solving the following equations.

( ) ( )f g x x

( ) 0g x

x: dimension n.

n + 1 equations with n + 1 variables

Lagrangian

Goal: Min/Max ( )f x

Subject to ( ) 0g x Constraint Optimization

Define ( ; ) ( ) ( )L f g x x x Lagrangian

Solve ( ; )L x x 0

( ; )L x 0

Unconstraint Optimization

Optimization with Multiple Equality Constraints

Min/Max ( )f x

Subject to ( ) 0, 1, ,ig i m x

Define 1

( ; ) ( ) ( )m

iL f g

x x x

Solve , ( ; )L x x 0

1,( , )mT

Lagrangian

Page 12: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Optimization with Inequality Constraints

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x

( ) 0, 1, ,jh j n x

You can always reformulate your problems into the about form.

Lagrange Multipliers

Lagrangian:

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

1,( , )mT

1,( , )nT 0i

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x

( ) 0, 1, ,jh j n x

Lagrange Multipliers

Lagrangian:

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

1,( , )mT

1,( , )nT 0i

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x

( ) 0, 1, ,jh j n x0 for feasible

solutions

negative for feasible solutions

Duality 1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Let x* be a local extreme.

*,( ; ) |L x x x0 x

( ) (, *; ),D L x

1 1

( ) ( *) ( *) ( *), i j

m n

i ji j

D f g h

x x x

Define

( *)f x

Maximize it w.r.t. ,

Duality 1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Let x* be a local extreme.

*,( ; ) |L x x x0 x

( ) (, *; ),D L x

1 1

( ) ( *) ( *) ( *), i j

m n

i ji j

D f g h

x x x

Define

( *)f x

Maximize it w.r.t. ,

To minimize the Lagrangian w.r.t x, while to maximize it w.r.t. and .

Saddle Point Determination

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Page 13: Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Duality

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x

( ) 0, 1, ,jh j n x

The primal

The dual Maximize ( *; ),L x

Subject to

, ( ; , )L x x 0

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

The Karush-Kuhn-Tucker Conditions

1 1

( ; ) ( ) ( ), ) (ji

m n

i ji j

L f g h

x x x xx x x x 0

( ) 0, 1, ,ig i m x

( ) 0, 1, ,jh j n x

0, 1, ,j j n

( ) 0, 1, ,j jh j n x

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Top Related

air Work support

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Support Vector Machines - Binghamton Universitydde.binghamton.edu/kodovsky/svm/lectures/svm_lecture_01.pdfA Tutorial on ν-Support Vector Machines (2005) link P.H. Chen, C.J. Lin,

Lecture 5: More on Kernels. SVM regression and classi cationdprecup/courses/ML/Lectures/ml-lecture0… · Solving SVM regression We introduce slack variables, ˘+ i, ˘ i to account

MA 751 Part 7 Solving SVM: Quadratic Programming 1 ...math.bu.edu/people/mkon/MA751/L11DSolvingSVMOptional.pdf · 4. Other choices of kernel Recall in SVM we have used the kernel

Recent Applications of the Stochastic Variational Method (SVM)brown/p-shell-2004/pdf/ysuzuki.pdf · the Stochastic Variational Method (SVM) Y. Suzuki (Niigata) Outline 1. Motivation

Support Vector Machines for Structured Classification and The Kernel Trick

02 dc machines - teicm.gr