MA 751 Part 7 Solving SVM: Quadratic Programming 1...

MA 751 Part 7

Solving SVM: Quadratic Programming

1. Quadratic programming (QP): Introducing Lagrangemultipliers and (can be justified in QP for inequality asα .4 4

well as equality constraints) we define the Lagrangian

PÐ ß ,ß ß ß Ña 0 α .

´ O C + OÐ ß Ñ , " "

8 – —4œ" 4œ" 3œ"

4 4 4 3 3 4 4X0 - α 0a a x x

Þ4œ"

4 4. 0

Solving using quadratic programmingBy Lagrange multiplier theory for constraints with inequalities,

the minimum of this in

aß ,ß ß œ Ð ßá ß Ñß œ Ð ßá ß Ñ0 α α . .α ." 8 " 8

is a stationary point of this Lagrangian (derivatives vanish) ismaximized wrt , and minimized wrt the Lagrangeaß ,ß 0multipliers, , subject to the constraintsα .

α .3 3ß !. (5)

Derivatives:

Solving using quadratic programming

`,œ ! Ê C œ !à

4 4α (6a)

` 8œ ! Ê œ !Þ

34 4 (6b)

Plugging in get reduced Lagrangian

P Ð ß Ñ œ O C + OÐ ß Ñ "‡ X

4œ" 4œ"

4 4 3 3 4a a a x xα - α

œ O ]O4œ"

4X Xα -a a a α

C ! ! á !! C ! á !ã ã ã ã ã! ! á C !! ! ! á C

Ô ×Ö ÙÖ ÙÖ ÙÖ ÙÕ Ø

Solving using quadratic programming(note (6) eliminates the terms) with same constraints (5).04

`+œ ! Ê # O O] œ !

4 (7)- a α

Ê + œ ÞC

Plug in for using (7), replacing by + O O]3"#a - α

everywhere:

P Ð ß Ñ œ ]O]"

%‡ X X

4a α α αα-

-α α,

where T œ ]O] X Þ

Constraints: ; by (6b) this impliesα .4 4ß !

! Ÿ Ÿ Þ"

Define , .G œ œ" "# 8 #- -α α

[note this does not mean complex conjugate!]

Then want to minimize (division by constant OK - does not#-change minimizing )α

" Ð# Ñ " "

# # % ## T œ T ß

- - --α α

4œ" 4œ"

#X Xα α α α (8)

Solving using quadratic programmingsubject to constraint also convenient to include! Ÿ Ÿ Gàα3

(6a) as constraint: Thus constraints are:α † œ !Þy

! Ÿ Ÿ Gà † œ !Þα α y

Summarizing above relationships:

0Ð Ñ œ + OÐ ß Ñ ,ßx x x4œ"

+ œ ßC

α -α4 4œ # ß

and are the (unconstrained) minimizers of (8), withα4

T œ ] O] ÞX

After are determined, must be computed directly by+ ,4

plugging into

More briefly,

0Ð Ñ œ C OÐ ß Ñ ,ßx x x4œ"

4 4 4α

where minimize (8).α4

Finally, to find , must plug into original optimization problem:,that is, we minimize

8Ð" C 0Ð ÑÑ m0m

4 4 O#x -

œ " C + OÐ ß Ñ , O Þ"

8 – —4œ" 3œ"

4 3 4 3

Xx x a a-

2 The RKHS for SVMÞ

General SVM: solution function is (see (4) above)

0Ð Ñ œ + OÐ ß Ñ ,ßx x x4

with sol'n for given by quadratic programming as above.+4

Consider a simple case (linear kernel):

OÐ ß Ñ œ †x x x x4 4.

RKHS for SVMThen we have

0Ð Ñ œ Ð+ Ñ † , ´ † ,ßx x x w x4

w x´ + Þ4

This gives the kernel. What class of functions is thecorresponding space ?[

Claim it is the set of linear functions of x:

RKHS for SVM

[ ‘œ Ö † l − ×w x w .

with inner product

Ø † ß † Ù œ †w x w x w w" # " #

is the RKHS of above.OÐ ß Ñx y

Indeed to show that is the reproducing kernelOÐ ß Ñ œ †x y x y for , note that if then recall[ [0Ð Ñ œ † − ßx x wOÐ ß Ñ œ †x y x y, so

RKHS for SVM

Ø0Ð † ÑßOÐ † ß ÑÙ œ † œ 0Ð Ñy w y y[ ,

as desired.

Thus the matrix , and we find the optimal separatorO œ †34 3 4x x

0Ð Ñ œ †x w x

by solving for as before.w

Note when we add to (as done earlier), have all affine, 0Ð Ñxfunctions .0Ð Ñ œ † ,x w x

RKHS for SVMNote above inner product gives the norm

m † m œ m m œ A Þw x w# # #

4[ ‘8

Why use this norm? A priori information content.

Final classification rule: ; 0Ð Ñ ! Ê C œ " 0Ð Ñ ! Êx xC œ "Þ

Learning from training data:

R0 œ Ð0Ð Ñßá ß 0Ð ÑÑ œ ÐC ßá ß C ÑÞx x" 8 " 8

RKHS for SVM

[ ‘œ Ö0Ð Ñ œ † À − ×x w x w 8

is set of linear separator functions (known as inperceptronsneural network theory).

Consider separating hyperplane :L À 0Ð Ñ œ !x

3 Toy example:Þ

RKHS for SVM

ExampleInformation

R0 œ ÖÒÐ"ß "Ñß "Óß Ò "ß " ß "Óß ÒÐ"ß "Ñß"Óß ÒÐ"ß"Ñß"Ó×

(red +1 blueœ à œ "Ñà

0 œ † ,w x

œ + Ð † Ñ ,ï3

3 3x x

OÐ ß Ñx x3

Example

w xœ + Þ3

PÐ0Ñ œ Ð" 0Ð ÑC Ñ l l *" "

4 4 #x w ( )

(we let minimize wrt , - œ "Î#à ,ÑÞw

Equivalent:

Example

PÐ0Ñ œ l l" "

% #4œ"

C 0Ð Ñ " à !Þ4 4 4 4x 0 0

[Note effectively ]03 3 3 œ Ð" † , C Ñw x

Define kernel matrix

Example

O œ OÐ ß Ñ œ † œ

# ! # !! # ! ## ! # !! # ! #

34 3 4 3 4x x x xÔ ×Ö ÙÖ ÙÕ Ø

m0m œ l l œ O # + % + + + + Þ[ w a a# X #

3 " $ # %œ

Example

where a œ Þ

Ô ×Ö ÙÖ ÙÕ Ø

Formulate

Example

PÐ0Ñ œ PÐ ß ,ß Ñ œ O" "

% #a a a0

% + # + + + +

4œ" 4œ"

3 " $ # %3#0

subject to (Eq. 4a):

0 04 4 4 4 " C ÐO Ñ , ß !Þc da

Lagrange multipliers (seeα .œ Ð ßá ß Ñ ß œ Ð ßá ß Ñα α . ." 8 " 8X X

(4b)):

Example

optimize

PÐ ß ,ß œ O O , C " " "

% #a a a a0 α .ß ß Ñ ˆ ‰

4œ" 4œ"

3 4 4X

4 4. 0

Example

œ O ]O , † †Ð"!Ñ" "

%4œ" 4œ"

4 4X X X0 αa a a y α α α 0 . 0

with constraints

α .3 3ß !Þ

Solution has (see (7) above)

α œ # ]- "a

Example

Ð ] œ œ Ñ

C ! á ! " ! ! !! C á ! ! " ! !ã ã ã ã ! ! " !! ! á C ! ! ! "

recall Ô × Ô ×Ö Ù Ö ÙÖ Ù Ö ÙÕ Ø Õ Ø

and (7a above)

α α αœ œ Þ"

Finally optimize (8):

Example

P œ T ß"

%Xα α α

T œ ]O] X

Example

" ! ! ! # ! # ! " ! ! !! " ! ! ! # ! # ! " ! !! ! " ! # ! # ! ! ! " !! ! ! " ! # ! # ! ! ! "

Ô ×Ô ×Ô ×Ö ÙÖ ÙÖÖ ÙÖ ÙÖÕ ØÕ ØÕ Ø

" ! ! ! # ! # !! " ! ! ! # ! #! ! " ! # ! # !! ! ! " ! # ! #

Ô ×Ô ×Ö ÙÖ ÙÖ ÙÖ ÙÕ ØÕ Ø

Example

# ! # !! # ! ## ! # !! # ! #

constraint is

! Ÿ Ÿ G ´ œ Þ" "

# 8 %α

-(10a)

Thus optimize

Example

P œ # #" 3 " $ # %

3œ" 3œ"

3#α α α α α α

œ Ð Ñ Ð Ñ Þ3œ"

3 " $ # %# #α α α α α

œ ? @ ? @ ß# #

Example? œ à @ œ Þα α α α" $ # %

Minimizing:

" #? œ !à " #@ œ !

? œ @ œ Þ"

Thus we have

Example

α3 œ"

for all (recall the constraint (10a)). Then3

α α αœ # œ œ Þ

"Î"Î"Î"Î

Example

a œ œ Þ]

"Î"Î"Î"Î

w x x x x xœ + œ Ð Ñ œ Ð%ß !Ñ œ Ð"ß !Ñ" "

% #3 3 " # $ % .

Margin = "l lw œ "Þ

ExampleNow we find separately from original equation ( ); we will, *

minimize with respect to the original functional,

PÐ0Ñ œ " † , C l l"

4 4 #w x w (

œ " Ð" ,ÑÐ"Ñ " Ð" ,ÑÐ"Ñ"

%œc d c d

Example

" Ð" ,ÑÐ"Ñ Ð" Ð" ,ÑÐ"Ñ "c d c d

œ , , , , ""

%œc d c d c d c d

œ , , ""

2.e fc d c d

Clearly the above is minimized when ., œ !

Thus w œ Ð"ß !Ñà , œ ! Ê

Example

0Ð Ñ œ † , œ Bx w x "

Example

Example4. Other choices of kernel

Recall in SVM we have used the kernel

OÐ ß Ñ œ † Þx y x y

There are many other choices of kernel, e.g.,

OÐ ß Ñ œ / OÐ ß Ñ œ Ð" l † lÑx y x y x yl l 8x y or

note - we must choose a kernel function which is positivedefinite. How do these choices change the discriminationfunction in SVM?0Ð Ñx

Other kernelsEx 1: Gaussian kernel

O Ð ß Ñ œ /5 x y l l#

# #x y5

[can show pos. def. Mercer kernel]

SVM: from (4) above have

0Ð Ñ œ + OÐ ß Ñ , œ + / ,ßx x x4 4

|x x l4#

where examples in have known classifications , andx4 4J C+ ß ,4 are obtained by quadratic programming.

Other kernels

What kind of classifier is this? It depends on (see Vert5movie).

Note Movie1 varies in the Gaussian ( corresponds to5 5 œ ∞a linear SVM) then movie2 varies the margin (in linearà "

l lwfeature space ) as determined by changing orJ# -equivalently G œ Þ"

5. Software available

Software which implements the quadratic programmingalgorithm above includes:

Other kernels

• SVMLight: http://svmlight.joachims.org• SVMTorch: http://www.idiap.ch/learning/SVMTorch.html• LIBSVM: http://wws.csie.ntu.edu.tw/~cjlin/libsvm

A Matlab package which implements most of these is Spider:

http://www.kyb.mpg.de/bs/people/spider/whatisit.html

6. Example application: handwritten digit recognition -USPS (Scholkopf, Burges, Vapnik)

Handwritten digits:

Other kernels

ExamplesTraining set: 7300; Test set: 2000

10 class classifier; class has a separating SVM function3>2

0 Ð Ñ œ † ,3 3 3x w x

Chosen class is

Class œ 0 Ð ÑÞargmax3−Ö!ßáß*×

F FÀ 1 Ä Ð1Ñ œ − Jdigit feature vector x

Kernels in feature space :J

Examples

RBF: OÐ ß Ñ œ /x x3 4

l l3 4#

Polynomial: O œ Ð † Ñx x3 4.)

Sigmoidal: O œ Ð Ð † Ñ Ñtanh , )x x3 4

Results:

Examples

ExamplesComputational Biology Applications

References:T. Golub et al Molecular Classification of Cancer: Class

Discovery and Class Prediction by Gene Expression.Science 1999.

S. Ramaswamy et al Multiclass Cancer Diagnosis UsingTumor Gene Expression Signatures. PNAS 2001.

Applications7. Microarrays for cancer classification

Goal: infer cancer genetics by looking at microarray.

Microarray reveals expression patterns and can hopefully beused to discriminate similar cancers, and thus lead to bettertreatments.

Usual problem: small sample size (e.g., 50 cancer tissuesamples), high dimensionality (e.g., 20-30,000). Curse ofdimensionality.

Example 1: Myeloid vs. Lymphoblastic leukemias

Applications

ALL: acute lymphoblastic leukemiaAML: acute myeloblastic leukemia

SVM training: leave one out cross-validation

Applications

Applications S. Mukherjee

fig. 1: Myeloid and Lymphoblastic Leukemia classification by SV

Applications

S. Mukherjee

Applicationsfig 2: AML vs. ALL error rates with increasing sample size

In above figure the curves represent error rates with splitbetween training and test sets. Red dot represents leaveone out cross-validation.

MA 751 Part 7 Solving SVM: Quadratic Programming 1...

Documents

Transcript of MA 751 Part 7 Solving SVM: Quadratic Programming 1...

Lecture 5: More on Kernels. SVM regression and classi cationdprecup/courses/ML/Lectures/ml-lecture0… · Solving SVM regression We introduce slack variables, ˘+ i, ˘ i to account

Screening Tests for the LASSOhanxiaol/slides/screening.pdf · Safe screening of non-support vectors in pathwise svm computation. In Proceedings of the 30th International Conference

Some Experience A bout μ kernel and Device Driver

Learning From Data Lecture 25 The Kernel Trickmagdon/courses/LFD-Slides/SlidesLect25.pdfLearning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel

Kernel Engineering Alessandro Moschittidisi.unitn.it/moschitti/ML2009-10/Kernel Engineering.pdf · Question Taxonomy Definition: What does HTML stand for? Description: What's the

Lecture 5: SVM as a kernel machine - cel.archives-ouvertes.fr · HαJ(α,α0) = Kdiag p(1I−p) K + λK solve the problem using Newton iterations

Multiscale Covariance Tensors for Data on Riemannian …hhang/publications/mcf_slides.pdfA multiscale kernel function K(x;y;t) is a sequence of kernel functions parametrized by t 2[0;+1).

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Overlapping Clustering Models, and One (class) SVM to Bind ...xmao/posters/svm_cone_poster.pdfOverlapping Clustering Models, and One (class) SVM to Bind Them All Xueyu Mao, Purnamrita

Bayesian Regression & Classiﬁcation · Bayesian Regression & Classiﬁcation learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic

Reproducing kernel Hilbert spaces and Mercer theorem · Proposition 1 Let Xbe a set. 1. Given a kernel Γ : X× X → C of positive type, there is a unique reproducing kernel Hilbert

Time-Accuracy Tradeo s in Kernel Prediction: Controlling ...

Kernel Ridge Regression - Rensselaer Polytechnic Institute

Laplacian Kernel Splatting for Efficient Depth-of-field ...laplaciansplatting.mpi-inf.mpg.de/laplacian_splatting.pdf · Laplacian Kernel Splating for Eficient Depth-of-field and Motion

Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

Support Vector Machines - Binghamton Universitydde.binghamton.edu/kodovsky/svm/lectures/svm_lecture_01.pdfA Tutorial on ν-Support Vector Machines (2005) link P.H. Chen, C.J. Lin,

Localizing 3D Cuboids in Single-view Imagesrobots.princeton.edu/projects/2012/SUNprimitive/poster.pdf · Supervised corner-location training with Structural SVM. Step 2: Local search

έ¸ÉÌÇ … ὅÊ ὁ όÄÇ έÀ - Σελίδες Χρηστών Α.Π.Θ.users.auth.gr/mkon/PDF/Introduction OT Steps 4.pdf2 ÊÍÂÂή ÌÑ έ ÈÉώÌÑ ¹À¹ÂÀÁώ έ, ¸ί

Heat kernel and Lipschitz-Besov spacesgrigor/besov.pdf · Heat kernel and Lipschitz-Besov spaces Alexander Grigor’yan and Liguang Liu April 2014 Abstract On a metric measure space

Neural Tangent Kernel - GitHub Pages · 2019. 12. 5. · Neural Tangent Kernel Convergence and Generalization of DNNs Arthur Jacot, Franck Gabriel, Clément Hongler Ecole Polytechnique