Linear Discrimination Centering on Support Vector Machines

30
CHAPTER 10: Linear Discrimination Eick/Aldaydin: Topic13

Transcript of Linear Discrimination Centering on Support Vector Machines

Page 1: Linear Discrimination Centering on Support Vector Machines

CHAPTER 10:

Linear Discrimination

Eick/Aldaydin: Topic13

Page 2: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2

Likelihood- vs. Discriminant-based Classification Likelihood-based: Assume a model for p(x|Ci), use

Bayes’ rule to calculate P(Ci|x)

gi(x) = log P(Ci|x)

Discriminant-based: Assume a model for gi(x|Φi); no density estimation

Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries

Page 3: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Linear Discriminant

Linear discriminant:

Advantages: Simple: O(d) space/computation Knowledge extraction: Weighted sum of attributes;

positive/negative weights, magnitudes (credit scoring) Optimal when p(x|Ci) are Gaussian with shared cov

matrix; useful when classes are (almost) linearly separable

01

00| ij

d

jiji

Tiiii wxwww,g

xwwx

Page 4: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4

Generalized Linear Model

Quadratic discriminant:

Higher-order (product) terms:

Map from x to z using nonlinear basis functions and use a linear discriminant in z-space

215224

2132211 xxz,xz,xz,xz,xz

00| iTii

Tiiii ww,,g xwxxwx WW

k

jjiji wg

1

xx

Page 5: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5

Two Classes

0

201021

202101

21

w

ww

ww

ggg

T

T

TT

xw

xww

xwxw

xxx

otherwise

0if choose

2

1

C

gC x

Page 6: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6

Geometry

Page 7: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines

One Possible Solution

B1

Page 8: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines

Another possible solution

B2

Page 9: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines

Other possible solutions

B2

Page 10: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines

Which one is better? B1 or B2? How do you define better?

B1

B2

Page 11: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines

Find hyperplane maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21b22

margin

Page 12: Linear Discrimination Centering on Support Vector Machines

Support Vector MachinesB1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Margin

w

Examples are:

(x1,..,xn,y) with y{-1,1}

Page 13: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines We want to maximize:

Which is equivalent to minimizing:

But subjected to the following N constraints:

This is a constrained convex quadratic optimization problem that can be solved in polynominal time

Numerical approaches to solve it (e.g., quadratic programming) exist

The function to be optimized has only a single minimum no local minimum problem

2||||

2 Margin

w

N1,..,i 1b)xw(y ii

2

||||)(

2wwL

Page 14: Linear Discrimination Centering on Support Vector Machines

Support Vector Machines What if the problem is not linearly separable?

Page 15: Linear Discrimination Centering on Support Vector Machines

Linear SVM for Non-linearly Separable Problems

What if the problem is not linearly separable? Introduce slack variables Need to minimize:

Subject to (i=1,..,N):

C is chosen using a validation set trying to keep the margins wide while keeping the training error low.

i

iii

0)2(

-1b)xw(*y )1(

N

i

kiC

wwL

1

2

2

||||)(

Measures prediction error

Inverse size of marginbetween hyperplanes

Parameter

Slack variable

allows constraint violationto a certain degree

Page 16: Linear Discrimination Centering on Support Vector Machines

Nonlinear Support Vector Machines

What if decision boundary is not linear?

Alternative 1:Use technique thatEmploys non-lineardecision boundaries

Non-linear function

Page 17: Linear Discrimination Centering on Support Vector Machines

Nonlinear Support Vector Machines1. Transform data into higher dimensional space2. Find the best hyperplane using the methods introduced

earlier

Alternative 2:Transform into a higher dimensionalattribute space and find linear decision boundaries in this space

Page 18: Linear Discrimination Centering on Support Vector Machines

Nonlinear Support Vector Machines

1. Choose a non-linear kernel function to transform into a different, usually higher dimensional, attribute space

2. Minimize

but subjected to the following N constraints:

N1,..,i 1b))x w(y ii

2

||||)(

2wwL

Find a good hyperplanein the transformed space

Page 19: Linear Discrimination Centering on Support Vector Machines

Example: Polynomial Kernel Function

Polynomial Kernel Function:(x1,x2)=(x12,x22,sqrt(2)*x1,sqrt(2)*x2,1)K(u,v)=(u)(v)= (uv + 1)2

A Support Vector Machine with polynomial kernel function classifies a new example z as follows:

sign(( iyi(xi)(z))+b) =

sign(( iyi (xiz +1)2))+b)

Remark: i and b are determined using the methods for linear SVMs that were discussed earlier

Kernel function trick: perform computations in the original space, although we solve an optimization problem in the transformed space more efficient; More detailsTopic14.

Page 20: Linear Discrimination Centering on Support Vector Machines

Summary Support Vector Machines Support vector machines learn hyperplanes that separate

two classes maximizing the margin between them (the empty space between the instances of the two classes).

Support vector machines introduce slack variables, in the case that classes are not linear separable and trying to maximize margins while keeping the training error low.

The most popular versions of SVMs use non-linear kernel functions to map the attribute space into a higher dimensional space to facilitate finding “good” linear decision boundaries in the modified space.

Support vector machines find “margin optimal” hyperplanes by solving a convex quadratic optimization problem. However, this optimization process is quite slow and support vector machines tend to fail if the number of examples goes beyond 500/2000/5000…

In general, support vector machines accomplish quite high accuracies, if compared to other techniques.

Page 21: Linear Discrimination Centering on Support Vector Machines

Useful Support Vector Machine LinksLecture notes are much more helpful to understand the basic ideas: http://www.ics.uci.edu/~welling/teaching/KernelsICS273B/Kernels.html  http://cerium.raunvis.hi.is/~tpr/courseware/svm/kraekjur.html Some tools are often used in publications                 livsvm:     http://www.csie.ntu.edu.tw/~cjlin/libsvm/                 spider:     http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html Tutorial Slides:      http://www.support-vector.net/icml-tutorial.pdf Surveys:                http://www.svms.org/survey/Camp00.pdf More General Material: http://www.learning-with-kernels.org/ http://www.kernel-machines.org/http://kernel-machines.org/publications.htmlhttp://www.support-vector.net/tutorial.html     

Remarks: Thanks to Chaofan Sun for providing these links!

Page 22: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22

Optimal Separating Hyperplane

(Cortes and Vapnik, 1995; Vapnik, 1995)

1

as rewritten be can which

1 for 1

1 for 1

that such and find

if 1

if 1 where

0

0

0

0

2

1

wr

rw

rw

w

C

Crr,

tTt

ttT

ttT

t

tt

ttt

xw

xw

xw

w

x

xxX

Alpaydin transparencies on Support Vector Machines—not used in lecture!

Page 23: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23

Margin

Distance from the discriminant to the closest instances on either side

Distance of x to the hyperplane is

We require

For a unique sol’n, fix ρ||w||=1and to max margin

w

xw 0wtT

t,

wr tTt

wxw 0

t,wr tTt 1 to subject 21 min 0

2xww

Page 24: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24

Page 25: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25

00

0

21

121

1 to subject 21 min

10

1

110

2

10

2

0

2

N

t

ttp

N

t

tttp

N

t

tN

t

tTtt

N

t

tTttp

tTt

rw

L

rL

wr

wrL

t,wr

xww

xww

xww

xww

Page 26: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26

Most αt are 0 and only a small number have αt

>0; they are the support vectors

t

0

0 and 0 to subject

21

21

21

t,r

rr

rwrL

ttt

t

tsTtst

t s

st

t

tT

t t

ttt

t

tttTTd

xx

ww

xwww

Page 27: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27

Soft Margin Hyperplane

Not linearly separable

Soft error

New primal is

ttTt wxr 10w

t

t

t

tt

t

ttTtt

t

tp wxrCL 1

21

0

2ww

Page 28: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28

t

ttt

t

TtttT

t

ttt

t

ttt

,Krg

rg

rr

xxx

xφxφxφwx

xφzw

Kernel Machines

Preprocess input x by basis functionsz = φ(x) g(z)=wTz

g(x)=wT φ(x) The SVM solution

Page 29: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29

Kernel Functions

Polynomials of degree q:

Radial-basis functions: Sigmoidal functions:

qtTt ,K 1 xxxx

(Cherkassky and Mulier, 1998)

T

T

x,x,xx,x,x,

yxyxyyxxyxyx

yxyx

,K

22

212121

22

22

21

2121212211

22211

2

2221

2221

1

1

x

yxyx

2

2

expxx

xxt

t ,K

12tanh tTt ,K xxxx

Page 30: Linear Discrimination Centering on Support Vector Machines

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30

SVM for Regression

Use a linear model (possibly kernelized)f(x)=wTx+w0

Use the є-sensitive error function

otherwise

if 0tt

tttt

fr

frf,re

x

xx

00

0

tt

ttT

tTt

,

rw

wr

xw

xw

t

ttC2

21

min w