EE 6885 Statistical Pattern Recognition
Transcript of EE 6885 Statistical Pattern Recognition
1
Review-Final-1EE6887-Chang
EE 6885 Statistical Pattern Recognition
Fall 2005Prof. Shih-Fu Chang
http://www.ee.columbia.edu/~sfchang
Review: Final Exam (12/12/2005)
Review Final-2EE6887-Chang
Final ExamDec. 16th Friday 1:10-3 pm, Mudd Rm 644
2
Review Final-3EE6887-Chang
Chap 5: Linear Discriminant Functions
Review Final-4EE6887-Chang
Linear Discriminant Classifiers find weight and bias ow⇒ w
Augmented Vector1
11
d
x
x
yx
⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥= = ⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦
0
10
d
www
w
⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎢ ⎥⎢ ⎥= = ⎢ ⎥⎢ ⎥⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦
aw
0( ) tg w= +x w x
( ) ( ) tg g⇒ = =x y a y
1 2map to class if ( )>0, otherwise class gω ωy y
[1, ]txr
Normalization2 in class , ( )i i iy y yω∀ ←−
Design Objective t >b, i i∀a y y
y-space
a-space
3
Review Final-5EE6887-Chang
Minimal Squared-Error Solution
2 ( ) ( )tY Y Y= − = − −a b a b a b
2
t
t
tn
Y
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
1yy
y
dimension: n x (d+1) 2
1
define ( )n
ts i i
i
J b=
⇒ = −∑ a yTraining sample matrix
=2 ( ) 0tsJ Y Y∇ − =a a b
pseudo-inverse : (d+1) x n
( ) 1 †t tY Y Y Y−
⇒ = =a b b
( ) 1† t tY Y Y Y−
=Example
1 2training samples: : (1, 2) , (2,0) : (3,1) , (2,3)t t t tclass classω ω1 1 21 2 01 3 11 2 3
Y
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= ⎢ ⎥− − −⎢ ⎥⎢ ⎥− − −⎣ ⎦
1111
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
b† * †find , then compute Y Y=a b
tObjective: =b, i i∀a y y
Review Final-6EE6887-Chang
Vector Derivative (Gradient) and Chain RuleConsider scalar function of vector input: ( )J x
x 1 2( )=[ / , / , , / ]tdJ J x J x J x∇ ∂ ∂ ∂ ∂ ∂ ∂xVector derivative (gradient)
t⇒∇ =aa b b
= tk k
k
J a b=∑a binner product
Hermitian ti ij j
i j
J A x A x= =∑∑x x t tA A A⇒∇ = +xx x x x
Generalized chain rule
now consider , . . i ij jj
A i e x A x ′′= =∑x x
xtJ A J′⇒∇ = ∇xx
t
i
j
xJ Jx
δ
δ′
⎛ ⎞⎟⎜ ⎟⎜ ⎟∇ = ∇⎜ ⎟⎜ ⎟′⎜ ⎟⎜⎝ ⎠x
t∇ba b t=∇ =bb a a
/i j ijx x Aδ δ ′⇒ =
= tJ∇ ∇ =b bAb AMatrix-vector multiplication
HW#5 P.1
4
Review Final-7EE6887-Chang
Chap. 5.11 and Burges ’98 paper:Support Vector Machine
Review Final-8EE6887-Chang
Support Vector Machine (tutorial by Burges ‘98)
0
: 0t
Decision boundaryH b+ =w x
Look for separation plane with the highest margin
Linearly separable
1
2
1 in class i.e. 1
1 in class i.e. 1
Inequality constraints : ( ) 1 0 ,
ti i i
ti i i
ti i
b y
b y
y b i
ωω
+ ≥+ ∀ =++ ≤− ∀ =−
+ − ≥ ∀
w x x
w x x
w x
1hyperplane ( ) : 1 tiH H b+ + =+w x
Two parallel hyperplanes defining the margin
2hyperplane ( ) : 1 tiH H b− + =−w x
Margin: sum of distances of the closest points to the separation plane
margin = 2 / w Best plane defined by w and b
HW#5 P.2
5
Review Final-9EE6887-Chang
Finding the maximal margin
Use the Lagrange multiplier technique for the constrained opt. problem
2
1
1 ( ( ) 1)2
lt
p i i ii
L y bα=
= − + −∑w w x
subject to inequality constraints ( ) 1 0 1, ,t
i iy b i l+ − ≥ =w x
21minimize 2
w
Primal Problem
1 1 1
1
1 2
with conditions :
0
0
l l l
D i i j i j i ji i j
l
i ii
i
L y y
y
α αα
α
α
= = =
=
= − ⋅
=
≥
∑ ∑∑
∑
x x
Dual Problem
Quadratic Programming
minimize . . . and pL w r t bw
1
1
0
0 0
lp
i i ii
lp
i ii
dLy
ddL
ydb
α
α
=
=
= ⇒ =
= ⇒ =
∑
∑
w xw
maximize . . . and DL w r t bw
0iα ≥
HW#6 P.1
Review Final-10EE6887-Chang
0α >
0α =
if 0, is on or and is a support vectori i H Hα + −> x
*
1
l
i i ii
yα=
=∑w x
KKT conditions for separable case
How to compute w and b?How to classify new data? w
6
Review Final-11EE6887-Chang
Non-separable
Lagrange multiplier: minimize
if 1, then is misclassified (i.e. training error)i ixξ >
Ensure positivity
Add slack variablesiξ
New objective function
Review Final-12EE6887-Chang
All the points located in the margin gap or the wrong side will get i Cα =
i Cα =
0 i Cα< ≤
When C increases, samples with errors get more weightsbetter training accuracy, but smaller marginless generalization performance
after C increases
7
Review Final-13EE6887-Chang
Mapping to Higher-Dimension SpaceMap to a high dimensional space, to make the data separable
1 1 1
1 1 1
1 ( ) ( )2
1 ( , )2
l l l
D i i j i j i ji i j
l l l
i i j i j i ji i j
L y y
y y K
α αα
α αα
= = =
= = =
= − Φ ⋅ Φ
= −
∑ ∑∑
∑ ∑∑
x x
x x
Find the SVM in the high-dim space (embedding space)
We can use the same method (Dual Problem) to maximize LD to find iα
1
( ) ( ) ( )sN
i i ii
g y bα=
= Φ ⋅ Φ +∑x s x
define kernel ( , ) ( ) ( )i iK =Φ ⋅Φs x s x
1
( ) ( , )sN
i i ii
g y K bα=
⇒ = +∑x s x
w
HW#5 P.2
Review Final-14EE6887-Chang
Chap. 9 : Analysis of Learning Algorithms
8
Review Final-15EE6887-Chang
Assume F is a quantity whose value is to be estimated
2expected estimation error: D DE g F⎡ ⎤−⎢ ⎥⎣ ⎦
data sourceRandomly draw n samples { }1 2, , nD x x x= …
DgLearn to estimate F
Repeat multiple times
( ) ( )2 2D D D D D DE g F E g E g⎡ ⎤⎡ ⎤= − + −⎢ ⎥⎣ ⎦ ⎣ ⎦
Bias2 Variance
Bias vs. variance for estimator
Review Final-16EE6887-Chang
Bias vs. variance for classificationGround truth: 2D Gaussian
Complex models have smaller biases, more variances than simple modelsIncreasing training pool size helps reduce the varianceOccam’s Razor principle
9
Review Final-17EE6887-Chang
Boosting
training dataRandomly draw a subset of n1samples D1{ }1 2, , nD x x x= …
Weak classifier C1
Weak classifier C2
classifier Ck
Classifier Fusion
For each component classifier, use the subset of data that is most informative given the current set of component classifiers
Use the most informativesubset D2 from remaining set
Review Final-18EE6887-Chang
As in AdaBoost Ref.
HW#7 P.2
10
Review Final-19EE6887-Chang
Final Classifier hf
When will the final classifier be incorrect?Suppose c(i)=0, then hf(i) is incorrect if
1 1
1 1
1 (log ) ( ) log( )2
T T
t t tt t
h iβ β− −
= =
≥∑ ∑T
( ) 1/ 2
t=1 1
namely t
Th i
t tt
β β− −
=≥∏ ∏
In general 1/ 2T1 | ( ) ( )|
t=1 1
( ) is incorrect if t
Th i c i
f t tt
h i β β− −
=
⎛ ⎞⎟⎜≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏ ∏ ( )D i×
1/ 2T1 | ( ) ( )|
t=1 1
( ) ( )t
Th i c i
t tt
D i D iβ β− −
=
⎛ ⎞⎟⎜⇒ ≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏ ∏1/ 2
T+1i
1
w ( )T
tt
D i β=
⎛ ⎞⎟⎜⇒ ≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏?
T1 ( ) 1/ 2
t=1 1
t
Th i
t tt
β β−
=⇒ ≥∏ ∏
Review Final-20EE6887-Chang
N N1
i=1 i=1
w w (1 (1 )(1 ))t ti i t tEβ+ ≤ − − −∑ ∑Theorem 1 in Ref.
N N1
i=1 i=1
w w (2 )t ti i tE+ ≤∑ ∑
1/ 2T+1i
1
( ) is incorrect if w ( )T
f tt
h i D i β=
⎛ ⎞⎟⎜≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∏1/ 2
T+1i
1i, ( ) ( ) i, ( ) ( )
w ( )f f
N N T
tth i c i h i c i
D i β=≠ ≠
⎛ ⎞⎟⎜≥ ⎟⎜ ⎟⎟⎜⎝ ⎠∑ ∑ ∏
N1
1i=1
w (2 )T
Ti t
t
E+
=
⇒ ≤∑ ∏
1/ 2
1 1 1
(2 ) / (2 (1 ) )T T T
t t t tt t t
E E E Eβ= = =
⎛ ⎞⎟⎜∴ ≤ = −⎟⎜ ⎟⎟⎜⎝ ⎠∏ ∏ ∏
? ?
Ref.
1/ 2
1
T
tt
E β=
⎛ ⎞⎟⎜= ⎟⎜ ⎟⎟⎜⎝ ⎠∏
… Fill in details to complete HW7 P.2
1t
tt
EE
β =−
11
Review Final-21EE6887-Chang
AdaBoost LearningThe first two features after feature selection
Review Final-22EE6887-Chang
Cascade classifier for efficiencyBreak a large classifier into cascade of smaller classifiers
E.g., 200 features to {1, 10, 25, 50, 50}
Adjust threshold in early stage so that it rejects unlikely regions quickly
R
P
Design tradeoffsNumber of features in each classifier Threshold uses in each classifierNumber of classifiers
Add stages until objective in P-R is met
12
Review Final-23EE6887-Chang
Mixture of ExpertsEach component classifier is treated as an expertThe predictions from each expert are pooled and fused by a gating subsystem
01
( | , ) ( | , ) ( | , )k
rr
P P r Pθ θ=
Θ =∑y x x y x
where is the input pattern, is the outputx y
Determine , i.e., mixture priors? 0( | , )P r θx
01
( , ) ln ( | , ) ( | , )k
i i ir
i r
l D P r Pθ θ=
Θ =∑ ∑ x y xMaximize data likelihoodgradient decent or EM
Review Final-24EE6887-Chang
Chap. 10 :feature dimension reduction and clustering
13
Review Final-25EE6887-Chang
PCA for feature dimension reductionApproximate data with reduced dimensions
ˆ , : meana= +x m e m1-D approximation
21
1 1
ˆ( ) ( )n n
k k k kk k
J a= =
= − = + −∑ ∑e x x m e xApproximation Error
2 22
1 1 1
2 ( )n n n
tk k k k
k k k
a a= = =
= − − + −∑ ∑ ∑e e x m x m2 2
1 1
( )n n
tk k
k k= =
⎡ ⎤=− − + −⎢ ⎥⎣ ⎦∑ ∑e x m x m
2
1 1
( )( )n n
t tk k k
k k= =
⎡ ⎤⎢ ⎥=− − − + −⎢ ⎥⎣ ⎦∑ ∑e x m x m e x m
S: scatter matrix
2
1
nt
kk=
=− + −∑e Se x m
( 1) sample covariancen= − ×
Optimal e minimizing error J1 -- eigenvector of S with the largest eigenvalue
Multi-Dim. approximation1
d
i ii
a′
=
= +∑x m e what are the optimal ei ?
Review Final-26EE6887-Chang
Independent Component AnalysisSeek most independent directions, instead of minimize representation errors (sum-squared-error) as in PCABlind source separation in speech mixture
: sigmoid ( )=1/(1 )xf f x e−+
14
Review Final-27EE6887-Chang
Find the best weights to make the output components independentHow to measure independence?
Linear combination of random variables leads to Normal distributionUse the high-order statistics to measure Non-GaussianityGradient Decent to weights for discovering each component
(from Ellis)
FastICA Matlab package : http://www.cis.hut.fi/projects/ica/fastica/
PCA
ICA
Review Final-28EE6887-Chang
LDA: Linear Discriminant Analysis1 2Given a set of data , , , , and their class labelsnx x x…
ti iy =w xFind the best projection dimension,
so that are most separableiy
1
i
t ti i
Di
mn ∈
= =∑x
w x w m : sample means: sample means of projected points
i
imm
w
2 21 ( )i
i iYi
s y mn ∈
= −∑y
2 21 2 : within-class scatters s+
21 22 21 2
LDA maximizes criterion function: ( ) = m m
Js s−+
w
y1m2m
21s2
1s
PCA
15
Review Final-29EE6887-Chang
LDA Scatter Matricesbefore projection: ( )( )
i
ti i i
D∈
= − −∑x
S x m x m
2
2 21 2 1 2
after projecttion: =
( )
ti i
t t
s
s s+ = + = w
w S w
w S S w w S w
1 2= : within-class scatter matrix+wS S S
1 2 1 2Similarly, between-class scatter matrix ( )( )t= − −BS m m m m( ) =
t
tJ⇒ B
w
w S www S w
11 2
arg max ( )
( )opt J
−
⇒ =
= −w
w w
S m mRecall the Gaussian Cases
Mean difference vector in the PCA space
1( )i jw μ μ−= Σ −
P.25-28 of Chap 10
Review Final-30EE6887-Chang
Multi-Dimensional Scaling (MDS)Visualize the data points in a lower-dim space
How to preserve the original structure (e.g., distance)?
Optimization Criterion
2
2
( )ij iji j
eeij
i j
dJ
δ
δ<
<
−=∑∑
2( )ij ijff
i j ij
dJ
δδ<
−=∑
Gradient Decent to find new locations
( )2
2k
k jee kj kj
j kij kji j
J dd
δδ ≠
<
−∇ = −∑∑y
y y
Sometimes rank order is more important
16
Review Final-31EE6887-Chang
Classification vs. Clustering
x1
Decision Boundary
+++
+ + ++
+++
+++
+++ +
++
++
++
--
--
-- -
-- -
-
- -
-
-
-----
- --
--
--
--
----- -
-
--
-
---
--
x2
+
+ +
+
++ +
+
Data with labelsSupervisedFind decision boundaries
x1
+
+++ + +
+++
+++
++
++ ++
++
++
x2
+
+ +
++
+ +
++
Data without labelsUnsupervisedFind data structures and clusters
Review Final-32EE6887-Chang
Review: Mixture Of Gaussians
Given data x1, …, xN , log-likelihood:
x
p(x) π0 π1
( )∑=
Σ+Σ=N
nnn xNxNl
1111000 ),(),(log μπμπ
( ),iposteriers p z i xτ θ= = =
Model data distributions as GMM
Posterior probability of x being generated by a cluster i
( ) ( ) ( | )z
p x p z p x z=∑( ),z z zz
N xπ μ= Σ∑ ( ) ( )112
/ 21
1(2 )
Tz z z
Z x x
z Dz z
eμ μ
ππ
−− − Σ −
=
=Σ
∑
+
+++++
+++
+++++
++ ++++
++ +++
++
+ +
+ +
0 0 1 1: { , , , }parameter θ μ μ= Σ Σ
0 0 1 1 zfind { , , , } and mixture priors to max. likelihoodμ μ πΣ ΣOptimization
17
Review Final-33EE6887-Chang
GMM for Clustering
( ) 0 0 1 1 0, , { , , , , }iposteriers p z i xτ θ θ μ μ π= = = = Σ Σ
Given the estimated GMM model, compute the probability that x is generated by cluster i
( )( ) ( )( )( ) ( )( )∑ ΣΣ
=
jjjn
ti
iint
i
xNxNnExpectatio
tt)(
tt)(ti
n ,|,| :
μπμπτ
Each sample is assigned to every cluster with a ‘soft’ decision.
+
+++++
+++
+++++
++ ++++
++ +++
++
+ +
+ +
Review Final-34EE6887-Chang
Comparison: K-Mean ClusteringK-mean clustering
Fix K valuesChoose initial representative of each clusterMap each sample to its closest cluster
Re-compute the centersCan be used to initialize the EM for GMM
x(1)
x(2)
oooo
++
+++
o
+++++…
oooo
C1
C2
C3CK
, , ) , ), 'i k i k i k'
for i=1,2,...,N, x C if Dist(x C Dist(x C k kend
→ < ≠ Hard decision
18
Review Final-35EE6887-Chang
Hierarchical ClusteringAdd hierarchical structures to clusters
many real-world problems have such hierarchical structurese.g., biological, semantic taxonomy
Agglomerative vs. DivisiveDendrogram
Use large gap of similarity to find a suitable number of clustersclustering validity
Review Final-36EE6887-Chang
distances or similarity for mergingmin ,
( , ) mini j
i j D Dd D D
′∈ ∈′= −
x xx x max ,
( , ) maxi j
i j D Dd D D
′∈ ∈′= −
x xx x
Nearest neighbor algorithm, minimal algorithmMerging results in the min. distance spanning treeBut sensitive to noise/outlier
Farthest neighbor algorithm, maximum algorithmUse distance threshold to avoid large-diameter clustersDiscourage forming elongated clusters
HW#8 P.2