EE 6885 Statistical Pattern Recognition

18
1 Review-Final-1 EE6887-Chang EE 6885 Statistical Pattern Recognition Fall 2005 Prof. Shih-Fu Chang http://www.ee.columbia.edu/~sfchang Review: Final Exam (12/12/2005) Review Final-2 EE6887-Chang Final Exam Dec. 16 th Friday 1:10-3 pm, Mudd Rm 644

Transcript of EE 6885 Statistical Pattern Recognition

Page 1: EE 6885 Statistical Pattern Recognition

1

Review-Final-1EE6887-Chang

EE 6885 Statistical Pattern Recognition

Fall 2005Prof. Shih-Fu Chang

http://www.ee.columbia.edu/~sfchang

Review: Final Exam (12/12/2005)

Review Final-2EE6887-Chang

Final ExamDec. 16th Friday 1:10-3 pm, Mudd Rm 644

Page 2: EE 6885 Statistical Pattern Recognition

2

Review Final-3EE6887-Chang

Chap 5: Linear Discriminant Functions

Review Final-4EE6887-Chang

Linear Discriminant Classifiers find weight and bias ow⇒ w

Augmented Vector1

11

d

x

x

yx

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥= = ⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦

0

10

d

www

w

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥= = ⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦

aw

0( ) tg w= +x w x

( ) ( ) tg g⇒ = =x y a y

1 2map to class if ( )>0, otherwise class gω ωy y

[1, ]txr

Normalization2 in class , ( )i i iy y yω∀ ←−

Design Objective t >b, i i∀a y y

y-space

a-space

Page 3: EE 6885 Statistical Pattern Recognition

3

Review Final-5EE6887-Chang

Minimal Squared-Error Solution

2 ( ) ( )tY Y Y= − = − −a b a b a b

2

t

t

tn

Y

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

1yy

y

dimension: n x (d+1) 2

1

define ( )n

ts i i

i

J b=

⇒ = −∑ a yTraining sample matrix

=2 ( ) 0tsJ Y Y∇ − =a a b

pseudo-inverse : (d+1) x n

( ) 1 †t tY Y Y Y−

⇒ = =a b b

( ) 1† t tY Y Y Y−

=Example

1 2training samples: : (1, 2) , (2,0) : (3,1) , (2,3)t t t tclass classω ω1 1 21 2 01 3 11 2 3

Y

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= ⎢ ⎥− − −⎢ ⎥⎢ ⎥− − −⎣ ⎦

1111

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

b† * †find , then compute Y Y=a b

tObjective: =b, i i∀a y y

Review Final-6EE6887-Chang

Vector Derivative (Gradient) and Chain RuleConsider scalar function of vector input: ( )J x

x 1 2( )=[ / , / , , / ]tdJ J x J x J x∇ ∂ ∂ ∂ ∂ ∂ ∂xVector derivative (gradient)

t⇒∇ =aa b b

= tk k

k

J a b=∑a binner product

Hermitian ti ij j

i j

J A x A x= =∑∑x x t tA A A⇒∇ = +xx x x x

Generalized chain rule

now consider , . . i ij jj

A i e x A x ′′= =∑x x

xtJ A J′⇒∇ = ∇xx

t

i

j

xJ Jx

δ

δ′

⎛ ⎞⎟⎜ ⎟⎜ ⎟∇ = ∇⎜ ⎟⎜ ⎟′⎜ ⎟⎜⎝ ⎠x

t∇ba b t=∇ =bb a a

/i j ijx x Aδ δ ′⇒ =

= tJ∇ ∇ =b bAb AMatrix-vector multiplication

HW#5 P.1

Page 4: EE 6885 Statistical Pattern Recognition

4

Review Final-7EE6887-Chang

Chap. 5.11 and Burges ’98 paper:Support Vector Machine

Review Final-8EE6887-Chang

Support Vector Machine (tutorial by Burges ‘98)

0

: 0t

Decision boundaryH b+ =w x

Look for separation plane with the highest margin

Linearly separable

1

2

1 in class i.e. 1

1 in class i.e. 1

Inequality constraints : ( ) 1 0 ,

ti i i

ti i i

ti i

b y

b y

y b i

ωω

+ ≥+ ∀ =++ ≤− ∀ =−

+ − ≥ ∀

w x x

w x x

w x

1hyperplane ( ) : 1 tiH H b+ + =+w x

Two parallel hyperplanes defining the margin

2hyperplane ( ) : 1 tiH H b− + =−w x

Margin: sum of distances of the closest points to the separation plane

margin = 2 / w Best plane defined by w and b

HW#5 P.2

Page 5: EE 6885 Statistical Pattern Recognition

5

Review Final-9EE6887-Chang

Finding the maximal margin

Use the Lagrange multiplier technique for the constrained opt. problem

2

1

1 ( ( ) 1)2

lt

p i i ii

L y bα=

= − + −∑w w x

subject to inequality constraints ( ) 1 0 1, ,t

i iy b i l+ − ≥ =w x

21minimize 2

w

Primal Problem

1 1 1

1

1 2

with conditions :

0

0

l l l

D i i j i j i ji i j

l

i ii

i

L y y

y

α αα

α

α

= = =

=

= − ⋅

=

∑ ∑∑

x x

Dual Problem

Quadratic Programming

minimize . . . and pL w r t bw

1

1

0

0 0

lp

i i ii

lp

i ii

dLy

ddL

ydb

α

α

=

=

= ⇒ =

= ⇒ =

w xw

maximize . . . and DL w r t bw

0iα ≥

HW#6 P.1

Review Final-10EE6887-Chang

0α >

0α =

if 0, is on or and is a support vectori i H Hα + −> x

*

1

l

i i ii

yα=

=∑w x

KKT conditions for separable case

How to compute w and b?How to classify new data? w

Page 6: EE 6885 Statistical Pattern Recognition

6

Review Final-11EE6887-Chang

Non-separable

Lagrange multiplier: minimize

if 1, then is misclassified (i.e. training error)i ixξ >

Ensure positivity

Add slack variablesiξ

New objective function

Review Final-12EE6887-Chang

All the points located in the margin gap or the wrong side will get i Cα =

i Cα =

0 i Cα< ≤

When C increases, samples with errors get more weightsbetter training accuracy, but smaller marginless generalization performance

after C increases

Page 7: EE 6885 Statistical Pattern Recognition

7

Review Final-13EE6887-Chang

Mapping to Higher-Dimension SpaceMap to a high dimensional space, to make the data separable

1 1 1

1 1 1

1 ( ) ( )2

1 ( , )2

l l l

D i i j i j i ji i j

l l l

i i j i j i ji i j

L y y

y y K

α αα

α αα

= = =

= = =

= − Φ ⋅ Φ

= −

∑ ∑∑

∑ ∑∑

x x

x x

Find the SVM in the high-dim space (embedding space)

We can use the same method (Dual Problem) to maximize LD to find iα

1

( ) ( ) ( )sN

i i ii

g y bα=

= Φ ⋅ Φ +∑x s x

define kernel ( , ) ( ) ( )i iK =Φ ⋅Φs x s x

1

( ) ( , )sN

i i ii

g y K bα=

⇒ = +∑x s x

w

HW#5 P.2

Review Final-14EE6887-Chang

Chap. 9 : Analysis of Learning Algorithms

Page 8: EE 6885 Statistical Pattern Recognition

8

Review Final-15EE6887-Chang

Assume F is a quantity whose value is to be estimated

2expected estimation error: D DE g F⎡ ⎤−⎢ ⎥⎣ ⎦

data sourceRandomly draw n samples { }1 2, , nD x x x= …

DgLearn to estimate F

Repeat multiple times

( ) ( )2 2D D D D D DE g F E g E g⎡ ⎤⎡ ⎤= − + −⎢ ⎥⎣ ⎦ ⎣ ⎦

Bias2 Variance

Bias vs. variance for estimator

Review Final-16EE6887-Chang

Bias vs. variance for classificationGround truth: 2D Gaussian

Complex models have smaller biases, more variances than simple modelsIncreasing training pool size helps reduce the varianceOccam’s Razor principle

Page 9: EE 6885 Statistical Pattern Recognition

9

Review Final-17EE6887-Chang

Boosting

training dataRandomly draw a subset of n1samples D1{ }1 2, , nD x x x= …

Weak classifier C1

Weak classifier C2

classifier Ck

Classifier Fusion

For each component classifier, use the subset of data that is most informative given the current set of component classifiers

Use the most informativesubset D2 from remaining set

Review Final-18EE6887-Chang

As in AdaBoost Ref.

HW#7 P.2

Page 10: EE 6885 Statistical Pattern Recognition

10

Review Final-19EE6887-Chang

Final Classifier hf

When will the final classifier be incorrect?Suppose c(i)=0, then hf(i) is incorrect if

1 1

1 1

1 (log ) ( ) log( )2

T T

t t tt t

h iβ β− −

= =

≥∑ ∑T

( ) 1/ 2

t=1 1

namely t

Th i

t tt

β β− −

=≥∏ ∏

In general 1/ 2T1 | ( ) ( )|

t=1 1

( ) is incorrect if t

Th i c i

f t tt

h i β β− −

=

⎛ ⎞⎟⎜≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏ ∏ ( )D i×

1/ 2T1 | ( ) ( )|

t=1 1

( ) ( )t

Th i c i

t tt

D i D iβ β− −

=

⎛ ⎞⎟⎜⇒ ≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏ ∏1/ 2

T+1i

1

w ( )T

tt

D i β=

⎛ ⎞⎟⎜⇒ ≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏?

T1 ( ) 1/ 2

t=1 1

t

Th i

t tt

β β−

=⇒ ≥∏ ∏

Review Final-20EE6887-Chang

N N1

i=1 i=1

w w (1 (1 )(1 ))t ti i t tEβ+ ≤ − − −∑ ∑Theorem 1 in Ref.

N N1

i=1 i=1

w w (2 )t ti i tE+ ≤∑ ∑

1/ 2T+1i

1

( ) is incorrect if w ( )T

f tt

h i D i β=

⎛ ⎞⎟⎜≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏1/ 2

T+1i

1i, ( ) ( ) i, ( ) ( )

w ( )f f

N N T

tth i c i h i c i

D i β=≠ ≠

⎛ ⎞⎟⎜≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∑ ∑ ∏

N1

1i=1

w (2 )T

Ti t

t

E+

=

⇒ ≤∑ ∏

1/ 2

1 1 1

(2 ) / (2 (1 ) )T T T

t t t tt t t

E E E Eβ= = =

⎛ ⎞⎟⎜∴ ≤ = −⎟⎜ ⎟⎟⎜⎝ ⎠∏ ∏ ∏

? ?

Ref.

1/ 2

1

T

tt

E β=

⎛ ⎞⎟⎜= ⎟⎜ ⎟⎟⎜⎝ ⎠∏

… Fill in details to complete HW7 P.2

1t

tt

EE

β =−

Page 11: EE 6885 Statistical Pattern Recognition

11

Review Final-21EE6887-Chang

AdaBoost LearningThe first two features after feature selection

Review Final-22EE6887-Chang

Cascade classifier for efficiencyBreak a large classifier into cascade of smaller classifiers

E.g., 200 features to {1, 10, 25, 50, 50}

Adjust threshold in early stage so that it rejects unlikely regions quickly

R

P

Design tradeoffsNumber of features in each classifier Threshold uses in each classifierNumber of classifiers

Add stages until objective in P-R is met

Page 12: EE 6885 Statistical Pattern Recognition

12

Review Final-23EE6887-Chang

Mixture of ExpertsEach component classifier is treated as an expertThe predictions from each expert are pooled and fused by a gating subsystem

01

( | , ) ( | , ) ( | , )k

rr

P P r Pθ θ=

Θ =∑y x x y x

where is the input pattern, is the outputx y

Determine , i.e., mixture priors? 0( | , )P r θx

01

( , ) ln ( | , ) ( | , )k

i i ir

i r

l D P r Pθ θ=

Θ =∑ ∑ x y xMaximize data likelihoodgradient decent or EM

Review Final-24EE6887-Chang

Chap. 10 :feature dimension reduction and clustering

Page 13: EE 6885 Statistical Pattern Recognition

13

Review Final-25EE6887-Chang

PCA for feature dimension reductionApproximate data with reduced dimensions

ˆ , : meana= +x m e m1-D approximation

21

1 1

ˆ( ) ( )n n

k k k kk k

J a= =

= − = + −∑ ∑e x x m e xApproximation Error

2 22

1 1 1

2 ( )n n n

tk k k k

k k k

a a= = =

= − − + −∑ ∑ ∑e e x m x m2 2

1 1

( )n n

tk k

k k= =

⎡ ⎤=− − + −⎢ ⎥⎣ ⎦∑ ∑e x m x m

2

1 1

( )( )n n

t tk k k

k k= =

⎡ ⎤⎢ ⎥=− − − + −⎢ ⎥⎣ ⎦∑ ∑e x m x m e x m

S: scatter matrix

2

1

nt

kk=

=− + −∑e Se x m

( 1) sample covariancen= − ×

Optimal e minimizing error J1 -- eigenvector of S with the largest eigenvalue

Multi-Dim. approximation1

d

i ii

a′

=

= +∑x m e what are the optimal ei ?

Review Final-26EE6887-Chang

Independent Component AnalysisSeek most independent directions, instead of minimize representation errors (sum-squared-error) as in PCABlind source separation in speech mixture

: sigmoid ( )=1/(1 )xf f x e−+

Page 14: EE 6885 Statistical Pattern Recognition

14

Review Final-27EE6887-Chang

Find the best weights to make the output components independentHow to measure independence?

Linear combination of random variables leads to Normal distributionUse the high-order statistics to measure Non-GaussianityGradient Decent to weights for discovering each component

(from Ellis)

FastICA Matlab package : http://www.cis.hut.fi/projects/ica/fastica/

PCA

ICA

Review Final-28EE6887-Chang

LDA: Linear Discriminant Analysis1 2Given a set of data , , , , and their class labelsnx x x…

ti iy =w xFind the best projection dimension,

so that are most separableiy

1

i

t ti i

Di

mn ∈

= =∑x

w x w m : sample means: sample means of projected points

i

imm

w

2 21 ( )i

i iYi

s y mn ∈

= −∑y

2 21 2 : within-class scatters s+

21 22 21 2

LDA maximizes criterion function: ( ) = m m

Js s−+

w

y1m2m

21s2

1s

PCA

Page 15: EE 6885 Statistical Pattern Recognition

15

Review Final-29EE6887-Chang

LDA Scatter Matricesbefore projection: ( )( )

i

ti i i

D∈

= − −∑x

S x m x m

2

2 21 2 1 2

after projecttion: =

( )

ti i

t t

s

s s+ = + = w

w S w

w S S w w S w

1 2= : within-class scatter matrix+wS S S

1 2 1 2Similarly, between-class scatter matrix ( )( )t= − −BS m m m m( ) =

t

tJ⇒ B

w

w S www S w

11 2

arg max ( )

( )opt J

⇒ =

= −w

w w

S m mRecall the Gaussian Cases

Mean difference vector in the PCA space

1( )i jw μ μ−= Σ −

P.25-28 of Chap 10

Review Final-30EE6887-Chang

Multi-Dimensional Scaling (MDS)Visualize the data points in a lower-dim space

How to preserve the original structure (e.g., distance)?

Optimization Criterion

2

2

( )ij iji j

eeij

i j

dJ

δ

δ<

<

−=∑∑

2( )ij ijff

i j ij

dJ

δδ<

−=∑

Gradient Decent to find new locations

( )2

2k

k jee kj kj

j kij kji j

J dd

δδ ≠

<

−∇ = −∑∑y

y y

Sometimes rank order is more important

Page 16: EE 6885 Statistical Pattern Recognition

16

Review Final-31EE6887-Chang

Classification vs. Clustering

x1

Decision Boundary

+++

+ + ++

+++

+++

+++ +

++

++

++

--

--

-- -

-- -

-

- -

-

-

-----

- --

--

--

--

----- -

-

--

-

---

--

x2

+

+ +

+

++ +

+

Data with labelsSupervisedFind decision boundaries

x1

+

+++ + +

+++

+++

++

++ ++

++

++

x2

+

+ +

++

+ +

++

Data without labelsUnsupervisedFind data structures and clusters

Review Final-32EE6887-Chang

Review: Mixture Of Gaussians

Given data x1, …, xN , log-likelihood:

x

p(x) π0 π1

( )∑=

Σ+Σ=N

nnn xNxNl

1111000 ),(),(log μπμπ

( ),iposteriers p z i xτ θ= = =

Model data distributions as GMM

Posterior probability of x being generated by a cluster i

( ) ( ) ( | )z

p x p z p x z=∑( ),z z zz

N xπ μ= Σ∑ ( ) ( )112

/ 21

1(2 )

Tz z z

Z x x

z Dz z

eμ μ

ππ

−− − Σ −

=

+

+++++

+++

+++++

++ ++++

++ +++

++

+ +

+ +

0 0 1 1: { , , , }parameter θ μ μ= Σ Σ

0 0 1 1 zfind { , , , } and mixture priors to max. likelihoodμ μ πΣ ΣOptimization

Page 17: EE 6885 Statistical Pattern Recognition

17

Review Final-33EE6887-Chang

GMM for Clustering

( ) 0 0 1 1 0, , { , , , , }iposteriers p z i xτ θ θ μ μ π= = = = Σ Σ

Given the estimated GMM model, compute the probability that x is generated by cluster i

( )( ) ( )( )( ) ( )( )∑ ΣΣ

=

jjjn

ti

iint

i

xNxNnExpectatio

tt)(

tt)(ti

n ,|,| :

μπμπτ

Each sample is assigned to every cluster with a ‘soft’ decision.

+

+++++

+++

+++++

++ ++++

++ +++

++

+ +

+ +

Review Final-34EE6887-Chang

Comparison: K-Mean ClusteringK-mean clustering

Fix K valuesChoose initial representative of each clusterMap each sample to its closest cluster

Re-compute the centersCan be used to initialize the EM for GMM

x(1)

x(2)

oooo

++

+++

o

+++++…

oooo

C1

C2

C3CK

, , ) , ), 'i k i k i k'

for i=1,2,...,N, x C if Dist(x C Dist(x C k kend

→ < ≠ Hard decision

Page 18: EE 6885 Statistical Pattern Recognition

18

Review Final-35EE6887-Chang

Hierarchical ClusteringAdd hierarchical structures to clusters

many real-world problems have such hierarchical structurese.g., biological, semantic taxonomy

Agglomerative vs. DivisiveDendrogram

Use large gap of similarity to find a suitable number of clustersclustering validity

Review Final-36EE6887-Chang

distances or similarity for mergingmin ,

( , ) mini j

i j D Dd D D

′∈ ∈′= −

x xx x max ,

( , ) maxi j

i j D Dd D D

′∈ ∈′= −

x xx x

Nearest neighbor algorithm, minimal algorithmMerging results in the min. distance spanning treeBut sensitive to noise/outlier

Farthest neighbor algorithm, maximum algorithmUse distance threshold to avoid large-diameter clustersDiscourage forming elongated clusters

HW#8 P.2