of 14 The - Yaser Abu-Mostafawork.caltech.edu/slides/slides15.pdf · 14 • The rgin ma Hi Hi Hi Hi...

Review of Le ture 14• The margin

Hi

Hi

Hi

Hi

Maximizing the margin =⇒ dual problem:L(α) =

N∑

n=1

αn −1

2

N∑

n=1

N∑

m=1

ynym αnαm xTnxm

quadrati programming

• Support ve tors

Hi

Hi

Hi

Hi

xn (or zn) with Lagrange αn > 0

E[Eout] ≤ E[# of SV's]N − 1(in-sample he k of out-of-sample error)

• Nonlinear transformComplex h, but simple H

Learning From DataYaser S. Abu-MostafaCalifornia Institute of Te hnologyLe ture 15: Kernel Methods

Sponsored by Calte h's Provost O� e, E&AS Division, and IST • Tuesday, May 22, 2012

Outline• The kernel tri k• Soft-margin SVM

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 15 2/20

What do we need from the Z spa e?L(α) =

N∑

n=1

αn − 1

2

N∑

n=1

N∑

m=1

ynym αnαm zTnzm

Constraints: αn ≥ 0 for n = 1, · · · , N and ∑Nn=1 αnyn = 0

g(x) = sign (wTz + b) need z

Tnz

where w =∑

zn is SVαnynzn

and b: ym (wTzm + b) = 1 need z

Tnzm © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 15 3/20

Generalized inner produ tGiven two points x and x

′ ∈ X , we need zTz′

Let zTz′ = K(x,x′) (the kernel) �inner produ t� of x and x

′

Example: x = (x1, x2) −→ 2nd-order Φ

z = Φ(x) = (1, x1, x2, x21, x

22, x1x2)

K(x,x′) = zTz′ = 1 + x1x

′1 + x2x

′2 +

x21x

′21 + x2

2x′22 + x1x

′1x2x

′2 © AM


The tri kCan we ompute K(x,x′) without transforming x and x

′ ?Example: Consider K(x,x′) = (1 + x

Tx′)2 = (1 + x1x

′1 + x2x

′2)

2

= 1 + x21x

′21 + x2

2x′22 + 2x1x

′1 + 2x2x

′2 + 2x1x

′1x2x

′2

This is an inner produ t!( 1 , x2

1 , x22 ,

√2x1 ,

√2x2 ,

√2 x1x2 )

( 1 , x′21 , x′2

2 ,√

2x′1 ,

√2x′

2 ,√

2x′1x

′2 ) © AM


The polynomial kernelX = R

d and Φ : X → Z is polynomial of order Q

The �equivalent� kernel K(x,x′) = (1 + xTx′)Q

= (1 + x1x′1 + x2x

′2 + · · · + xdx

′d)

Q

Compare for d = 10 and Q = 100

Can adjust s ale: K(x,x′) = (axTx′ + b)Q


We only need Z to exist!If K(x,x′) is an inner produ t in some spa e Z , we are good.Example: K(x,x′) = exp

(

−γ ‖x− x′‖2

)

In�nite-dimensional Z : take simple aseK(x, x′) = exp

(−(x − x′)2

)

= exp(−x2

)exp

(

−x′2) ∞∑

k=0

2k(x)k(x′)k

k!︸︷︷︸

exp(2xx′) © AML Creator: Yaser Abu-Mostafa - LFD Le ture 15 7/20

This kernel in a tionHi

Hi

Slightly non-separable ase:Transforming X into ∞-dimensional ZOverkill? Count the support ve tors


Kernel formulation of SVMRemember quadrati programming? The only di�eren e now is:

y1y1 K(x1,x1)T y1y2 K(x1,x2)

T . . . y1yNK(x1,xN)Ty2y1 K(x2,x1)

T y2y2 K(x2,x2)T . . . y2yNK(x2,xN)T

. . . . . . . . . . . .

yNy1K(xN ,x1)T yNy2K(xN ,x2)

T . . . yNyNK(xN ,xN)T

︸︷︷︸quadrati oe� ientsEverything else is the same. © AM


The �nal hypothesisExpress g(x) = sign (wT

z + b) in terms of K(−,−)

w =∑

zn is SVαnynzn =⇒ g(x) = sign

∑

αn>0

αnynK(xn,x) + b

where b = ym −∑

αn>0

αnynK(xn,xm)

for any support ve tor (αm > 0) © AML Creator: Yaser Abu-Mostafa - LFD Le ture 15 10/20

How do we know that Z exists . . .

. . . for a given K(x,x′)? valid kernelThree approa hes:

1. By onstru tion2. Math properties (Mer er's ondition)3. Who ares?


Design your own kernelK(x,x′) is a valid kernel i�1. It is symmetri and 2. The matrix:

K(x1,x1) K(x1,x2) . . . K(x1,xN)

K(x2,x1) K(x2,x2) . . . K(x2,xN)

. . . . . . . . . . . .

K(xN ,x1) K(xN ,x2) . . . K(xN ,xN)

is positive semi-de�nitefor any x1, · · · ,xN (Mer er's ondition) © AM


Outline• The kernel tri k• Soft-margin SVM


Error measure

Hi

Hi

violation

Margin violation: yn (wTxn + b) ≥ 1 fails

Quantify: yn (wTxn + b) ≥ 1 − ξn ξn ≥ 0

Total violation = N∑

n=1

ξn


The new optimizationMinimize 1

2w

Tw + C

N∑

n=1

ξn

subje t to yn (wTxn + b) ≥ 1 − ξn for n = 1, . . . , N

and ξn ≥ 0 for n = 1, . . . , N

w ∈ Rd , b ∈ R , ξ ∈ R

N


Lagrange formulationL(w, b, ξ,α,β) =

1

2w

Tw + C

N∑

n=1

ξn −N∑

n=1

αn(yn (wTxn + b) − 1+ξn)−

N∑

n=1

βn ξn

Minimize w.r.t. w, b, and ξ and maximize w.r.t. ea h αn ≥ 0 and βn ≥ 0

∇wL = w −N∑

n=1

αnynxn = 0

∂L∂b

= −N∑

n=1

αnyn = 0

∂L∂ξn

= C − αn − βn = 0


and the solution is . . .

Maximize L(α) =

N∑

n=1

αn − 1

2

N∑

n=1

N∑

m=1

ynym αnαm xTnxm w.r.t. to α

subje t to 0 ≤ αn≤ C for n = 1, · · · , N and N∑

n=1

αnyn = 0

=⇒ w =

N∑

n=1

αnynxn

minimizes 1

2w

Tw + C

N∑

n=1

ξn


Types of support ve tors

Hi

Himargin support ve tors (0 < αn < C)yn (wT

xn + b) = 1 (ξn = 0)non-margin support ve tors (αn = C)

yn (wTxn + b) < 1 (ξn > 0)


Two te hni al observations1. Hard margin: What if data is not linearly separable?

�primal −→ dual� breaks down2. Z : What if there is w0?

All goes to b and w0 → 0


of 14 The - Yaser Abu-Mostafawork.caltech.edu/slides/slides15.pdf · 14 • The rgin ma Hi Hi Hi Hi...

Documents

Transcript of of 14 The - Yaser Abu-Mostafawork.caltech.edu/slides/slides15.pdf · 14 • The rgin ma Hi Hi Hi Hi...