1 Problem: Consider a two class task with ω 1, ω 2 LINEAR CLASSIFIERS.

37
1 Problem: Consider a two class task with ω 1 , ω 2 0 2 2 1 1 0 ... 0 ) ( w x w x w x w w x w x g l l T 2 1 2 1 0 2 0 1 2 1 0 0 : hyperplan decision on the Assume x , x ) x x ( w w x w w x w x , x T T T LINEAR CLASSIFIERS

Transcript of 1 Problem: Consider a two class task with ω 1, ω 2 LINEAR CLASSIFIERS.

Page 1: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1

Problem: Consider a two class task with ω1, ω2

02211

0

...

0)(

wxwxwxw

wxwxg

ll

T

2121

0201

21

0

0

:hyperplanedecision on the Assume

x,x)xx(w

wxwwxw

x,x

T

TT

LINEAR CLASSIFIERS

Page 2: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

2

Hence:

hyperplane on the w

0)( 0 wxwxg T

22

21

22

21

0 ,ww

)x(gz

ww

wd

Page 3: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

3

The Perceptron Algorithm

Assume linearly separable classes, i.e.

The casefalls under the above formulation, since

2T

1T

x0x*w

x0x*w:*w

*T* wxw 0

1

0

x'x,

w

*w'w *

00 'x'wwx*w T*T

Page 4: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

4

Our goal: Compute a solution, i.e., a hyperplane w,so that

• The steps– Define a cost function to be minimized– Choose an algorithm to minimize the cost

function– The minimum corresponds to a solution

2

1 0)(

xxwT

Page 5: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

5

The Cost Function

• Where Y is the subset of the vectors wrongly classified by w. When Y=(empty set) a solution is achieved and

Yx

Tx xwwJ )()(

0)( wJ

2

1

and if 1

and if 1

xYx

xYx

x

x

0)( wJ

Page 6: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

6

• J(w) is piecewise linear (WHY?)

The Algorithm• The philosophy of the gradient descent

is adopted.

Page 7: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

7

• Wherever valid

This is the celebrated Perceptron Algorithm

old)()(

(old)(new)

www

wJw

www

xxwww

wJ

Yx Yxx

Tx

)(

)(

xtwtwYx

xt

)()1(

w

Page 8: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

8

An example

The perceptron algorithm converges in a finite number of iteration steps to a solution if

)1(x)t(w

x)t(w)1t(w

xxt

t

t

ct

t

kk

t

t

kk

t

:e.g.

limlim0

2

0

Page 9: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

9

A useful variant of the perceptron algorithm

It is a reward and punishment type of algorithm

otherwise )()1(

0)( ,)()1(

0)( , )()1(

2)(

)()(

1)(

)()(

twtw

x

xtwxtwtw

x

xtwxtwtw

t

tT

t

t

tT

t

Page 10: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

10

The perceptron

shold thre

weightssynapticor synapses

0w

s'wi

It is a learning machine that learns from the training vectors via the perceptron algorithm

The network is called perceptron or neuron

Page 11: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

11

Example: At some stage t the perceptron algorithm results in

The corresponding hyperplane is

5.0w,1w,1w 021

05.021 xx

5.0

51.0

42.1

1

75.0

2.0

)1(7.0

1

05.0

4.0

)1(7.0

5.0

1

1

)1(tw

ρ=0.7

Page 12: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

12

Least Squares Methods If classes are linearly separable, the perceptron

output results in

If classes are NOT linearly separable, we shall compute the weights

so that the difference between• The actual output of the classifier, , and

• The desired outputs, e.g.

to be SMALL

1

021 w,...,w,w

xwT

2

1

if 1

if 1

x

x

Page 13: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

13

SMALL, in the mean square error sense, means to choose so that the cost function

w

responses desired ingcorrespond the

)(minargˆ

minimum is ])[()( 2

y

wJw

xwyEwJ

w

T

Page 14: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

14

Minimizing

where Rx the autocorrelation matrix

and the crosscorrelation vector

][ˆ

][][

)]([2

0])[()(

:in results to w.r.)(

1

2

yxERw

yxEwxxE

wxyxE

xwyEww

wJ

wwJ

x

T

T

T

][]...[][

...................................

][]...[][

][

21

12111

llll

lT

x

xxExxExxE

xxExxExxE

xxER

][

...

][

][1

yxE

yxE

yxE

l

Page 15: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

15

SMALL in the sum error squares sense means

that is, the input xi and its

corresponding class label (±1).

),(

)()(

ii

2N

1ii

Ti

xy

xwywJ

: training pairs

i

N

ii

Ti

N

ii

N

ii

Ti

yxwxx

xwyww

wJ

11

1

2

)(

0)()(

iy

Page 16: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

16

Pseudoinverse Matrix Define

responses desired ingcorrespond y

matrix) (an

N

1

TN

T2

T1

y

...

y

Nxl

x

...

x

x

X

matrix) (an lxN]x,...,x,x[X N21T

N

i

Tii

T xxXX1

i

N

ii

T yxyX

1

Page 17: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

17

Thus

Assume N=l X square and invertible. Then

yX

yXXXw

yXwXX

yxwxx

TT

TT

N

i

N

iiii

Ti

1

1 1

)(ˆ

ˆ)(

)(ˆ)(

TT XXXX 1)( Pseudoinverse of X

1

111)(

XX

XXXXXXX TTTT

Page 18: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

18

Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:

The “solution” corresponds to the minimum sum of squares solution

unknowns l equations N ....

NTN

2T2

1T1

ywx

ywx

ywx

:ywX

yXw

Page 19: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

19

Example:

5.0

7.0,

6.0

8.0,

4.0

7.0,

2.0

6.0,

6.0

4.0:

3.0

3.0,

7.0

2.0,

4.0

1.0,

5.0

6.0,

5.0

4.0:

2

1

y

..

..

..

..

..

..

..

..

..

..

X

1

1

1

1

1

1

1

1

1

1

15070

16080

14070

12060

16040

13030

17020

14010

15060

15040

Page 20: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

20

34.1

24.0

13.3

)(

0.0

1.0

6.1

,

107.48.4

7.441.224.2

8.424.28.2

1 yXXXw

yXXX

TT

TT

Page 21: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

21

The goal: Given two linearly separable classes, design the classifier

that leaves the maximum margin from both classes

0)( 0 wxwxg T

SUPPORT VECTOR MACHINES

Page 22: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

22

Margin: Each hyperplane is characterized by• Its direction in space, i.e.,

• Its position in space, i.e.,

• For EACH direction,choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

w

0w

w

Page 23: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

23

The distance of a point from a hyperplane is given by

Scale, so that at the nearest points from each class the discriminant function is ±1:

Thus the margin is given by

Also, the following is valid

, , 0ww

w

)x̂(gz x̂

21 for 1)( and for 1)x( 1)( xggxg

www

211

20

10

1

1

xwxw

xwxwT

T

Page 24: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

24

SVM (linear) classifier

Minimize

Subject to

This is because minimizing

the margin is maximised

0)( wxwxg T

2

2

1)( wwJ

2for 1

for 1

ii

iii

x,y

,x,y

Niwxwy iT

i ,...,2,1 ,1)( 0

w

w

2

Page 25: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

25

The above is a quadratic optimization task subject to a set of linear inequality constraints. The Karush-Kuhh-Tucker conditions, state that the minimizer satisfies:

• (1)

• (2)

• (3)

• (4)

• Where is the Lagrangian

0) ,L( 0 ,www

0) , , 00

ww(Lw

N,...,2,1i0i ,

Niwxwy iT

ii ,...,2,1 ,01)( 0

(.,.,.)L

)]([2

1) , ,( 0

10 wxwywwwwL i

Ti

N

ii

T

Page 26: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

26

The solution: from the above, it turns out that

ii

N

ii xyw

1

01

i

N

ii y

Page 27: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

27

Remarks:• The Lagrange multipliers can be either zero or

positive. Thus,

where , corresponding to positive Lagrange multipliers

– From constraint (4) above, i.e.,

the vectors contributing to

ii

N

ii xyw

s

1

0NN s

Niwxwy iT

ii ,...,2,1 ,0]1)([ 0

10 wxw iT

wsatisfy

Page 28: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

28

– These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier.

– Once is computed, is determined from conditions (4).

– The optimal hyperplane classifier of a support vector machine is UNIQUE.

w 0w

Page 29: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

29

Dual Problem Formulation• The SVM formulation is a convex

programming problem, with– Convex cost function– Convex region of feasible solutions

• Thus, its solution can be achieved by its dual problem, i.e.,

– maximize

– subject to

),,( 0 wwL

0

01

1

i

N

ii

ii

N

ii

y

xyw

λ

Page 30: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

30

• Combine the above to obtain

– maximize

– subject to

)2

1(

1j

Tijij

iji

N

ii xxyy

0

01

i

N

ii y

λ

Page 31: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

31

Remarks:• Support vectors enter via inner products• Although the solution, w, is unique, the

Lagrange multipliers ARE NOT.

Non-Separable classes

Page 32: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

32

In this case there is no hyperplane, such that

• Recall that the margin is defined as twice the distance between the following two hyperplanes

xwxwT ,1)(0

1

and

1

0

0

wxw

wxw

T

T

Page 33: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

33

The training vectors belong to one of three possible categories

• A. Vectors outside the band which are correctly classified, i.e.,

• B. Vectors inside the band, and correctly classified, i.e.,

• C. Vectors misclassified, i.e.

1)( 0 wxwy Ti

1)(0 0 wxwy Ti

0)( 0 wxwy Ti

Page 34: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

34

All three cases above can be represented as

A.B.C.

are known as slack variables

iT

i wxwy 1)( 0

i

i

i

1

10

0

i

Page 35: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

35

The goal of the optimization is now two-fold• Maximize margin• Minimize the number of patterns with ,

That is

where C a constant and

• I(.) not differentiable. In practice, we use an approximation

• Following similar procedure as before we obtain

0i

N

iiICwwwJ

1

2

0 2

1) , ,(

00

01)(

i

iiI

N

iiCwwwJ

1

2

0 2

1) , ,(

Page 36: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

36

KKT conditions

Ni

Ni

Niwxwy

NiC

xyλw

ii

ii

iiT

ii

ii

i

N

ii

ii

N

ii

,...,2,1 ,0, )6(

,...,2,1 ,0 )5(

,...,2,1 ,0]1)([ )4(

,...,2,1 ,0 )3(

0 )2(

)1(

0

1

1

Page 37: 1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

37

The associated dual problem

Maximize

subject to

Remarks: The only difference with the separable

class case is the existence of C in the constraints

)2

1(

1 ,j

Tijij

N

i jiii xxyy

N

1iii

i

0y

N,...,2,1i,C0