L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE...

Post on 16-Mar-2020

31 views 1 download

Transcript of L5: Quadratic classifiers - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l5.pdfCSCE...

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1

L5: Quadratic classifiers

โ€ข Bayes classifiers for Normally distributed classes โ€“ Case 1: ฮฃ๐‘– = ๐œŽ2๐ผ

โ€“ Case 2: ฮฃ๐‘– = ฮฃ (ฮฃ diagonal)

โ€“ Case 3: ฮฃ๐‘– = ฮฃ (ฮฃ non-diagonal)

โ€“ Case 4: ฮฃ๐‘– = ๐œŽ๐‘–2๐ผ

โ€“ Case 5: ฮฃ๐‘– โ‰  ฮฃ๐‘— (general case)

โ€ข Numerical example

โ€ข Linear and quadratic classifiers: conclusions

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2

Bayes classifiers for Gaussian classes โ€ข Recap

โ€“ On L4 we showed that the decision rule that minimized ๐‘ƒ[๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ] could be formulated in terms of a family of discriminant functions

โ€ข For normally Gaussian classes, these DFs reduce to simple expressions โ€“ The multivariate Normal pdf is

๐‘“๐‘‹ ๐‘ฅ = 2๐œ‹ โˆ’๐‘/2 ฮฃ โˆ’1/2๐‘’โˆ’12๐‘ฅโˆ’๐œ‡ ๐‘‡ฮฃโˆ’1 ๐‘ฅโˆ’๐œ‡

โ€“ Using Bayes rule, the DFs become

๐‘”๐‘– ๐‘ฅ = ๐‘ƒ ๐œ”๐‘–|๐‘ฅ = ๐‘ƒ ๐œ”๐‘– ๐‘ ๐‘ฅ ๐œ”๐‘– ๐‘ ๐‘ฅ

= 2๐œ‹ โˆ’๐‘/2 ฮฃ๐‘–โˆ’1/2๐‘’โˆ’

12๐‘ฅโˆ’๐œ‡๐‘–

๐‘‡ฮฃ๐‘–โˆ’1 ๐‘ฅโˆ’๐œ‡๐‘– ๐‘ƒ ๐œ”๐‘– ๐‘ ๐‘ฅ

โ€“ Eliminating constant terms

๐‘”๐‘– ๐‘ฅ = ฮฃ๐‘–โˆ’1/2 ๐‘’โˆ’

12๐‘ฅโˆ’๐œ‡๐‘–

๐‘‡ฮฃ๐‘–โˆ’1 ๐‘ฅโˆ’๐œ‡๐‘– ๐‘ƒ ๐œ”๐‘–

โ€“ And taking natural logs

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ฮฃ๐‘–โˆ’1 ๐‘ฅ โˆ’ ๐œ‡๐‘– โˆ’

1

2log ฮฃ๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ(๐œ”๐‘–)

โ€“ This expression is called a quadratic discriminant function

x2x2 x3

x3 xdxd

g1(x)g1(x)

x1x1

g2(x)g2(x) gC(x)gC(x)

Select maxSelect max

CostsCosts

Class assignment

Discriminant

functions

Features

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3

Case 1: ๐šบ๐’Š = ๐ˆ๐Ÿ๐‘ฐ โ€ข This situation occurs when features are statistically independent with

equal variance for all classes

โ€“ In this case, the quadratic DFs become

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ ๐œŽ2๐ผ โˆ’1 ๐‘ฅ โˆ’ ๐œ‡๐‘– โˆ’1

2log ๐œŽ2๐ผ + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘– โ‰ก โˆ’

1

2๐œŽ2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ ๐‘ฅ โˆ’ ๐œ‡๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ Expanding this expression

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐œŽ2๐‘ฅ๐‘‡๐‘ฅ โˆ’ 2๐œ‡๐‘–

๐‘‡๐‘ฅ + ๐œ‡๐‘–๐‘‡๐œ‡๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ Eliminating the term ๐‘ฅ๐‘‡๐‘ฅ, which is constant for all classes

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐œŽ2โˆ’2๐œ‡๐‘–

๐‘‡๐‘ฅ + ๐œ‡๐‘–๐‘‡๐œ‡๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘– = ๐‘ค๐‘–

๐‘‡๐‘ฅ + ๐‘ค0

โ€“ So the DFs are linear, and the boundaries ๐‘”๐‘–(๐‘ฅ) = ๐‘”๐‘—(๐‘ฅ) are hyper-planes

โ€“ If we assume equal priors

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐œŽ2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ ๐‘ฅ โˆ’ ๐œ‡๐‘–

โ€ข This is called a minimum-distance or nearest mean classifier

โ€ข The equiprobable contours are hyper-spheres

โ€ข For unit variance (๐œŽ2 = 1), ๐‘”๐‘– ๐‘ฅ is the Euclidean distance

Distance 1

Min

imu

m S

elec

tor

Distance 2

Distance C

class

x

[Schalkoff, 1992]

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4

โ€ข Example โ€“ Three-class 2D problem

with equal priors

๐œ‡1 = 3 2 ๐‘‡ ๐œ‡2 = 7 4 ๐‘‡ ๐œ‡3 = 2 5 ๐‘‡

ฮฃ1 =2

2ฮฃ1 =

22

ฮฃ1 =2

2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5

Case 2: ๐šบ๐’Š = ๐šบ (diagonal)

โ€ข Classes still have the same covariance, but features are allowed to have different variances โ€“ In this case, the quadratic DFs becomes

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ฮฃ๐‘–โˆ’1 ๐‘ฅ โˆ’ ๐œ‡๐‘– โˆ’

1

2log ฮฃ๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘– =

โˆ’1

2โˆ‘๐‘˜=1๐‘

๐‘ฅ๐‘˜ โˆ’ ๐œ‡๐‘–,๐‘˜2

๐œŽ๐‘˜2 โˆ’

1

2๐‘™๐‘œ๐‘”โˆ๐‘˜=1

๐‘ ๐œŽ๐‘˜2 + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ Eliminating the term ๐‘ฅ๐‘˜2, which is constant for all classes

๐‘”๐‘– ๐‘ฅ = โˆ’1

2โˆ‘๐‘˜=1๐‘

โˆ’2๐‘ฅ๐‘˜๐œ‡๐‘–,๐‘˜ + ๐œ‡๐‘–,๐‘˜2

๐œŽ๐‘˜2 โˆ’

1

2๐‘™๐‘œ๐‘”ฮ ๐‘˜=1

๐‘ ๐œŽ๐‘˜2 + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ This discriminant is also linear, so the decision boundaries ๐‘”๐‘–(๐‘ฅ) = ๐‘”๐‘—(๐‘ฅ) will also be hyper-planes

โ€“ The equiprobable contours are hyper-ellipses aligned with the reference frame

โ€“ Note that the only difference with the previous classifier is that the distance of each axis is normalized by its variance

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6

โ€ข Example โ€“ Three-class 2D problem

with equal priors

๐œ‡1 = 3 2 ๐‘‡ ๐œ‡2 = 5 4 ๐‘‡ ๐œ‡3 = 2 5 ๐‘‡

ฮฃ1 =1

2ฮฃ2 =

12

ฮฃ3 =1

2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7

Case 3: ๐šบ๐’Š = ๐šบ (non-diagonal) โ€ข Classes have equal covariance matrix, but no longer diagonal

โ€“ The quadratic discriminant becomes

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ฮฃโˆ’1 ๐‘ฅ โˆ’ ๐œ‡๐‘– โˆ’1

2log ฮฃ + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ Eliminating the term log ฮฃ , which is constant for all classes, and assuming equal priors

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ฮฃโˆ’1 ๐‘ฅ โˆ’ ๐œ‡๐‘–

โ€“ The quadratic term is called the Mahalanobis distance, a very important concept in statistical pattern recognition

โ€“ The Mahalanobis distance is a vector distance that uses a ฮฃโˆ’1norm,

โ€“ ฮฃโˆ’1 acts as a stretching factor on the space

โ€“ Note that when ฮฃ = ๐ผ, the Mahalanobis distance becomes the familiar Euclidean distance

x2

x1

ฮš ฮผ-x2

i =

K ฮผ-x2

i 1 =-รฅ

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8

โ€“ Expanding the quadratic term

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ๐‘‡ฮฃโˆ’1๐‘ฅ โˆ’ 2๐œ‡๐‘–

๐‘‡ฮฃโˆ’1๐‘ฅ + ๐œ‡๐‘–๐‘‡ฮฃโˆ’1๐œ‡๐‘–

โ€“ Removing the term ๐‘ฅ๐‘‡ฮฃโˆ’1๐‘ฅ, which is constant for all classes

๐‘”๐‘– ๐‘ฅ = โˆ’1

2โˆ’2๐œ‡๐‘–

๐‘‡ฮฃโˆ’1๐‘ฅ + ๐œ‡๐‘–๐‘‡ฮฃโˆ’1๐œ‡๐‘– = ๐‘ค1

๐‘‡๐‘ฅ + ๐‘ค0

โ€“ So the DFs are still linear, and the decision boundaries will also be hyper-planes

โ€“ The equiprobable contours are hyper-ellipses aligned with the eigenvectors of ฮฃ

โ€“ This is known as a minimum (Mahalanobis) distance classifier

Distance 1

Min

imu

m S

elec

tor

Distance 2

Distance C

รฅ

class

x

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9

โ€ข Example โ€“ Three-class 2D problem

with equal priors

๐œ‡1 = 3 2 ๐‘‡ ๐œ‡2 = 5 4 ๐‘‡ ๐œ‡3 = 2 5 ๐‘‡

ฮฃ1 =1 .7.7 2

ฮฃ2 =1 .7.7 2

ฮฃ3 =1 .7.7 2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

Case 4: ๐šบ๐’Š = ๐ˆ๐’Š๐Ÿ๐‘ฐ

โ€ข In this case, each class has a different covariance matrix, which is proportional to the identity matrix โ€“ The quadratic discriminant becomes

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡๐œŽ๐‘–โˆ’2 ๐‘ฅ โˆ’ ๐œ‡๐‘– โˆ’

1

2๐‘log ๐œŽ๐‘–

2 + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ This expression cannot be reduced further

โ€“ The decision boundaries are quadratic: hyper-ellipses

โ€“ The equiprobable contours are hyper-spheres aligned with the feature axis

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

โ€ข Example โ€“ Three-class 2D problem

with equal priors

๐œ‡1 = 3 2 ๐‘‡ ๐œ‡2 = 5 4 ๐‘‡ ๐œ‡3 = 2 5 ๐‘‡

ฮฃ1 =.5

.5ฮฃ2 =

11

ฮฃ3 =2

2

Zoom out

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

Case 5: ๐šบ๐’Š โ‰  ๐šบ๐’‹ (general case)

โ€ข We already derived the expression for the general case

๐‘”๐‘– ๐‘ฅ = โˆ’1

2๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘‡ฮฃ๐‘–โˆ’1 ๐‘ฅ โˆ’ ๐œ‡๐‘– โˆ’

1

2log ฮฃ๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ Reorganizing terms in a quadratic form yields

๐‘”๐‘– ๐‘ฅ = ๐‘ฅ๐‘‡๐‘Š2,๐‘–๐‘ฅ + ๐‘ค1,๐‘–๐‘‡ ๐‘ฅ + ๐‘ค0,๐‘–

where

๐‘Š2,๐‘– = โˆ’1

2ฮฃ๐‘–โˆ’1

๐‘ค1,๐‘– = ฮฃ๐‘–โˆ’1๐œ‡๐‘–

๐‘ค๐‘œ,๐‘– = โˆ’1

2๐œ‡๐‘–๐‘‡ฮฃ๐‘–

โˆ’1๐œ‡๐‘– โˆ’1

2log ฮฃ๐‘– + ๐‘™๐‘œ๐‘”๐‘ƒ๐‘–

โ€“ The equiprobable contours are hyper-ellipses, oriented with the eigenvectors of ฮฃ๐‘– for that class

โ€“ The decision boundaries are again quadratic: hyper-ellipses or hyper-parabolloids

โ€“ Notice that the quadratic expression in the discriminant is proportional to the Mahalanobis distance for covariance ฮฃ๐‘–

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13

โ€ข Example โ€“ Three-class 2D problem

with equal priors

๐œ‡1 = 3 2 ๐‘‡ ๐œ‡2 = 5 4 ๐‘‡ ๐œ‡3 = 3 4 ๐‘‡

ฮฃ1 =1 โˆ’1โˆ’1 2

ฮฃ2 =1 โˆ’1โˆ’1 7

ฮฃ3 =.5 .5.5 3

Zoom out

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14

Numerical example โ€ข Derive a linear DF for the following 2-class 3D problem

๐œ‡1 = 0 0 0 ๐‘‡; ๐œ‡2 = 1 1 1 ๐‘‡; ฮฃ1 = ฮฃ2 =.25

.25.25

; P2 = 2P1

โ€“ Solution

๐‘”1 ๐‘ฅ = โˆ’1

2๐œŽ2๐‘ฅ โˆ’ ๐œ‡1

๐‘‡ ๐‘ฅ โˆ’ ๐œ‡1 + ๐‘™๐‘œ๐‘”๐‘ƒ1 = โˆ’1

2

๐‘ฅ โˆ’ 0๐‘ฆ โˆ’ 0๐‘ง โˆ’ 0

๐‘‡4

44

๐‘ฅ โˆ’ 0๐‘ฆ โˆ’ 0๐‘ง โˆ’ 0

+ ๐‘™๐‘œ๐‘”1

3

โ€ข ๐‘”2 ๐‘ฅ = โˆ’1

2

๐‘ฅ โˆ’ 1๐‘ฆ โˆ’ 1๐‘ง โˆ’ 1

๐‘‡4

44

๐‘ฅ โˆ’ 1๐‘ฆ โˆ’ 1๐‘ง โˆ’ 1

+ ๐‘™๐‘œ๐‘”2

3

โ€ข ๐‘”1 ๐‘ฅ๐œ”1><๐œ”2

๐‘”2 ๐‘ฅ โ‡’ โˆ’2 ๐‘ฅ2 + ๐‘ฆ2 + ๐‘ง2 + ๐‘™๐‘”1

3 ๐œ”1><๐œ”2

โˆ’ 2 ๐‘ฅ โˆ’ 1 2 + ๐‘ฆ โˆ’ 1 2 + ๐‘ง โˆ’ 1 2 + lg2

3

โ€ข ๐‘ฅ + ๐‘ฆ + ๐‘ง ๐œ”2><๐œ”1

6โˆ’๐‘™๐‘œ๐‘”2

4= 1.32

โ€“ Classify the test example ๐‘ฅ๐‘ข = 0.1 0.7 0.8 ๐‘‡

0.1 + 0.7 + 0.8 = 1.6

๐œ”2

><๐œ”1

1.32 โ‡’ ๐‘ฅ๐‘ข โˆˆ ๐œ”2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15

Conclusions โ€ข The examples in this lecture illustrate the following points

โ€“ The Bayes classifier for Gaussian classes (general case) is quadratic โ€“ The Bayes classifier for Gaussian classes with equal covariance is linear โ€“ The Mahalanobis distance classifier is Bayes-optimal for

โ€ข normally distributed classes and โ€ข equal covariance matrices and โ€ข equal priors

โ€“ The Euclidean distance classifier is Bayes-optimal for โ€ข normally distributed classes and โ€ข equal covariance matrices proportional to the identity matrix and โ€ข equal priors

โ€“ Both Euclidean and Mahalanobis distance classifiers are linear classifiers

โ€ข Thus, some of the simplest and most popular classifiers can be derived from decision-theoretic principles โ€“ Using a specific (Euclidean or Mahalanobis) minimum distance classifier

implicitly corresponds to certain statistical assumptions โ€“ The question whether these assumptions hold or donโ€™t can rarely be

answered in practice; in most cases we can only determine whether the classifier solves our problem