Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005...

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 56

Part 2. Spectral Clustering from Matrix Perspective

A brief tutorial emphasizing recent developments

(More detailed tutorial is given in ICML’04 )


From PCA to spectral clusteringusing generalized eigenvectors

∑=j iji wd

In Kernel PCA we compute eigenvector: vWv λ=

Consider the kernel matrix:

Generalized Eigenvector:

)(),( jiij xxW φφ=

DqWq λ=

),,( 1 ndddiagD L=

This leads to Spectral Clustering !


Indicator Matrix Quadratic ClusteringFramework

Unsigned Cluster indicator Matrix H=(h1, …, hK)

0,..),Tr( max ≥= HIHHtsWHH TTH

;XXW T=

Kernel K-means clustering:

Spectral clustering (normalized cut)

K-means: ))(),(( ><= ji xxW φφKernel K-means

0,..),Tr( max ≥= HIDHHtsWHH TTH


Brief Introduction to Spectral Clustering(Laplacian matrix based clustering)


Some historical notes• Fiedler, 1973, 1975, graph Laplacian matrix• Donath & Hoffman, 1973, bounds• Hall, 1970, Quadratic Placement (embedding)• Pothen, Simon, Liou, 1990, Spectral graph

partitioning (many related papers there after)• Hagen & Kahng, 1992, Ratio-cut• Chan, Schlag & Zien, multi-way Ratio-cut• Chung, 1997, Spectral graph theory book• Shi & Malik, 2000, Normalized Cut


Spectral Gold-Rush of 20019 papers on spectral clustering

• Meila & Shi, AI-Stat 2001. Random Walk interpreation of Normalized Cut

• Ding, He & Zha, KDD 2001. Perturbation analysis of Laplacian matrix on sparsely connected graphs

• Ng, Jordan & Weiss, NIPS 2001, K-means algorithm on the embeded eigen-space

• Belkin & Niyogi, NIPS 2001. Spectral Embedding• Dhillon, KDD 2001, Bipartite graph clustering• Zha et al, CIKM 2001, Bipartite graph clustering• Zha et al, NIPS 2001. Spectral Relaxation of K-means• Ding et al, ICDM 2001. MinMaxCut, Uniqueness of relaxation.• Gu et al, K-way Relaxation of NormCut and MinMaxCut


Spectral Clustering

min cutsize , without explicit size constraints

Need to balance sizes

But where to cut ?


Graph Clustering

max within-cluster similarities (weights)

min between-cluster similarities (weights)

∑∑∈ ∈

=Ai Bj

ijw(A,B) sim

∑∑∈ ∈

=Ai Aj

ijw(A,A) sim

Balance weight

Balance size

Balance volume


Clustering Objective Functions

• Ratio Cut

• Normalized Cut

• Min-Max-Cut

|B|s(A,B)

|A|s(A,B)

(A,B)J Rcut +=

),(),),(

),(),(),(

BAsBs(BBAs

BAsAAsBAs

++

+=

s(B,B)s(A,B)

s(A,A)s(A,B)(A,B)JMMC +=

BANcut d

BAsd

BAsBAJ ),(),(),( +=

∑∑∈ ∈

=Ai Bj

ijws(A,B)

∑∈

=Ai

iA dd


Normalized Cut (Shi & Malik, 2000)

Min similarity between A & B: ∑∈

∑∈

=Ai Bj

ijws(A,B)

Balance weights

⎪⎩

⎪⎨⎧

∈−∈

=BidddAiddd

iqBA

AB

if if

//

)(Cluster indicator:

BANcut d

BAsd

BAsBAJ ),(),(),( += ∑∈

=Ai

iA dd

∑∈

=Gi

idd

0,1 == DeqDqq TTNormalization: Substitute q leads to qWDq(q)J T

Ncut )( −=

)1()( −+− DqqqWDq TT λqmin

DqqWD λ=− )(Solution is eigenvector of


A simple example2 dense clusters, with sparse connections between them.

Eigenvector q2Adjacency matrix


K-way Spectral ClusteringK ≥ 2


K-way Clustering Objectives

• Ratio Cut

• Normalized Cut

• Min-Max-Cut

∑∑ −=⎟⎟

⎠

⎞⎜⎜⎝

⎛+=

>< k k

kk

lk l

lk

k

lkK ||C

C,GCs||C

,CCs||C

,CCsCCJ

)()()(),,(

,1 LRcut

∑∑ −=⎟⎟

⎠

⎞⎜⎜⎝

⎛+=

>< k k

kk

lk l

lk

k

lkK d

C,GCsd

,CCsd

,CCsCCJ

)()()(),,(

,1 LNcut

∑∑ −=⎟⎟

⎠

⎞⎜⎜⎝

⎛+=

>< k kk

kk

lk ll

lk

kk

lkK CCs

C,GCsCCs,CCs

CCs,CCs

CCJ),(

)(),()(

),()(

),,(,

1 LMMC


K-way Spectral Relaxation

Tk

T

T

h

h

h

)11,00,00(

)00,11,00(

)00,00,11(

2

1

LLL

LLL

LLL

LLL

=

=

=Unsigned cluster indicators:

kTk

kTk

T

T

k hhhWDh

hhhWDhhhJ

)()(),,(11

111

−++−= LLRcut

Re-write:

kTk

kTk

T

T

k DhhhWDh

DhhhWDhhhJ

)()(),,(11

111

−++−= LLNcut

kTk

kTk

T

T

k WhhhWDh

WhhhWDhhhJ

)()(),,(11

111

−++−= LLMMC


K-way Normalized Cut Spectral Relaxation

Unsigned cluster indicators:

))~((

)~()~(),,( 111

YWIY

yWIyyWIyyyJT

kTk

Tk

−=

−++−=

TrNcut LL

Re-write:

By K. Fan’s theorem, optimal solution is eigenvectors: Y=(v1,v2, …, vk),

),,(min 11 kk yyJ LL Ncut≤++ λλ (Gu, et al, 2001)

}||||/)00,11,00( 2/12/1

kT

n

k hDDyk

LLL=

IYYYWIY TT

Y=− tosubject Tr:Optimize ),)~((min

2/12/1~ −−= WDDW

kkk vvWI λ=− )~(

kkkkk vDuDuuWD 2/1,)( −==− λ


K-way Spectral Clustering is difficult

• Spectral clustering is best applied to 2-way clustering – positive entries for one cluster – negative entries for another cluster

• For K-way (K>2) clustering– Positive and negative signs make cluster

assignment difficult– Recursive 2-way clustering– Low-dimension embedding. Project the data to

eigenvector subspace; use another clustering method such as K-means to cluster the data (Ng et al; Zha et al; Back & Jordan, etc)

– Linearized cluster assignment using spectral ordering and cluster crossing


Scaled PCA: a Unified Framework for clustering and ordering

• Scaled PCA has two optimality properties– Distance sensitive ordering– Min-max principle Clustering

• SPCA on contingency table ⇒ Correspondence Analysis– Simultaneous ordering of rows and columns– Simultaneous clustering of rows and columns


Scaled PCAsimilarity matrix S=(sij) (generated from XXT)

Nonlinear re-scaling:

DqqDDzzDDSDSk

Tkkk

k

Tkkk ⎥

⎦

⎤⎢⎣

⎡=== ∑∑ λλ 21

21

21

21

~

2/1.. )/(~ ,~

21

21

jiijij ssssSDDS == −−

qk = D-1/2 zk is the scaled principal component

Apply SVD on ⇒S~

),,(diag 1 nddD L= .ii sd =

1..,/,1 02/1

00 === qsdzλDqqDsddS

k

Tkkk

T ../ 1∑

==−⇒ λ

Subtract trivial component

(Ding, et al, 2002)


Scaled PCA on a Rectangle Matrix⇒ Correspondence Analysis

Nonlinear re-scaling: 2/1.. )(~ ,~ /2

121

jiijijcr ppppPDDP == −−

are the scaled row and column principal component (standard coordinates in CA)

Apply SVD on P~

ck

Tkkkr

T DgfDprcP ..1

/ ∑=

=− λ

Subtract trivial component

Tnppr ),,( ..1 L=

Tnppc ),,( .1. L=

kckkrk vDguDf 21

21

, −− ==


Correspondence Analysis (CA)

• Mainly used in graphical display of data• Popular in France (Benzécri, 1969)• Long history

– Simultaneous row and column regression (Hirschfeld, 1935)

– Reciprocal averaging (Richardson & Kuder, 1933; Horst, 1935; Fisher, 1940; Hill, 1974)

– Canonical correlations, dual scaling, etc.• Formulation is a bit complicated (“convoluted”

Jolliffe, 2002, p.342)• “A neglected method”, (Hill, 1974)


Clustering of Bipartite Graphs (rectangle matrix)

Simultaneous clustering of rows and columnsof a contingency table (adjacency matrix B )

Examples of bipartite graphs

• Information Retrieval: word-by-document matrix

• Market basket data: transaction-by-item matrix

• DNA Gene expression profiles

• Protein vs protein-complex


Bipartite Graph Clustering

⎩⎨⎧

∈−∈

=2

1

if1 if1

)(RrRr

ifi

i

⎩⎨⎧

∈−∈

=2

1

if1 if1

)(CcCc

igi

i

Clustering indicators for rows and columns:

⎟⎟⎠

⎞⎜⎜⎝

⎛=

2212

2111

,,

,,

CRCR

CRCR

BBBB

B ⎟⎟⎠

⎞⎜⎜⎝

⎛=

00

TBB

W ⎟⎟⎠

⎞⎜⎜⎝

⎛=

gf

q

)()(

)()(

),;,(22

12

11

122121 Ws

WsWsWs

RRCCJ MMC +=Substitute and obtain

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟

⎠

⎞⎜⎜⎝

⎛gf

DD

gf

BB

DD

c

rT

c

r λ0

0f,g are determined by


Spectral Clustering of Bipartite Graphs

)(2)()(

)(2)()(

),;,(22

1221

11

1221

,

,,

,

,,2121

CR

CRCR

CR

CRCRMMC Bs

BsBsBs

BsBsRRCCJ

++

+=

Simultaneous clustering of rows and columns(adjacency matrix B )

cut

min between-cluster sum of xyz weights: s(R1,C2), s(R2,C1)

max within-cluster sum of xyz xyz weights: s(R1,C1), s(R2,C2)

(Ding, AI-STAT 2003)

∑ ∑∈ ∈

=1 2

21)( ,

Rr CcijCR

i j

bBs


Internet Newsgroups

Simultaneous clustering of documents and words


Embedding in Principal Subspace

Cluster Self-Aggregation(proved in perturbation analysis)

(Hall, 1970, “quadratic placement” (embedding) a graph)


Spectral Embedding: Self-aggregation

(Ding, 2004)

• Compute K eigenvectors of the Laplacian.• Embed objects in the K-dim eigenspace


Spectral embedding is not topology preserving

700 3-D data points form 2 interlock rings

In eigenspace, they shrink and separate


Spectral Embedding

(Ding, 2004)

Simplex Embedding Theorem.Objects self-aggregate to K centroidsCentroids locate on K corners of a simplex

• Simplex consists K basis vectors + coordinate origin• Simplex is rotated by an orthogonal transformation T•T are determined by perturbation analysis


Perturbation Analysis

1C2C

3C

Assume data has 3 dense clusters sparsely connected.

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

33

22

11

WW

WW

3231

2321

1312

WWWWWW

zzWDDzW λ== −− )(ˆ 2/12/1DqWq λ= zDq 2/1−=

Off-diagonal blocks are between-cluster connections, assumed small and are treated as a perturbation

(Ding et al, KDD’01)


Spectral Perturbation Theorem

kkk tt λ=Γ

21

21 −− ΩΓΩ=Γ

)](,),([ 1 kCC ρρ Ldiag=Ω

∑ ≠=

kpp kpkk sh|

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−

−−−−

=Γ

KKKK

K

K

hss

shsssh

L

MLMM

L

L

21

22221

11211

Spectral Perturbation Matrix

),( qppq CCss =

Orthogonal Transform Matrix )1( KT tt ,,L=

T are determined by:


Connectivity Network

⎩⎨⎧

=otherwise0

cluster same tobelong if1

ji,Cij

DqqDCK

k

Tkkk ∑

=

≅1

λScaled PCA provides

Green’s function : ∑= −

=≈K

k

Tk

kk qqGC

2 11λ

Projection matrix: ∑=

≡≈K

k

Tkk qqPC

1(Ding et al, 2002)


1st order Perturbation: Example 1

Between-cluster connections suppressed

Within-cluster connections enhanced

Sim

ilarit

y m

atrix

WCo

nnec

tivity

m

atrix

Effects of self-aggregation

268.0,300.0 22 == λλ

1st order solution


Scaled principal components have optimality properties:

Ordering– Adjacent objects along the order are similar– Far-away objects along the order are dissimilar– Optimal solution for the permutation index are given by

scaled PCA.

Clustering– Maximize within-cluster similarity– Minimize between-cluster similarity– Optimal solution for cluster membership indicators given

by scaled PCA.

Optimality Properties of Scaled PCA


Spectral Graph Ordering

(Hall, 1970), “quadratic placement of a graph”:

Solution are eigenvectors of Laplacian

xWDxwxxJ T

ijijji )()( 2 −=−=∑

(Barnard, Pothen, Simon, 1993), envelop reduction of sparse matrix: find ordering such that the envelop is minimized

∑∑ −⇒−ij

ijjii

ijj wxxwji 2)(min ||maxmin

Find coordinate x to minimize


Distance Sensitive Ordering

∑ −= +

= dnid dii

wJ 1 ,)( πππ

)()(min 11

2 πππ

∑ −== n

d dJdJ

),,(),,1( 1 nn πππ LL =

Given a graph. Find an optimal Ordering of the nodes.

The larger distance, the larger weights, panelity.

∩∩∩∩ ∩∩∩∩∩∩∪∪∪∪∪∪∪∪

:)(2 π=dJ31,ππw

π permutation indexes



∑∑ −=−=ji

jijiwjiwjiJ

ij πππππππ

,,

2,

2 )()()(

∑ −− −=ij

jiji w ,211 )( ππ

∑ +−+− −−

−=ij

jinn

nn wjin

,2

2/2/)1(

2/2/)1( )(

11

8

2 ππ

}1,,3,1{2/

2/)1(1

nn

nn

nn

nnq i

i−−−=+−=

−

Lπ

Define: shifted and rescaled inverse permutation indexes

qWDqwqqJ Tn

ijijji

n )()()( 42

822 −=−= ∑π



Once q2 is computed, since

can be uniquely recovered from q2

1122 )()( −− <⇒< jijqiq ππ

1−iπ

Implementation: sort q2 induces π


Re-ordering of Genes and Tissues

)()(

randomJJr π=

)random(

)(

1

11

=

== =

d

dd J

Jr π

18.0=r

39.31 ==dr


Spectral clustering vs Spectral ordering

• Continuous approximation of both integer programming problems are given by the same eigenvector

• Different problems could have the same continuous approximate solution.

• Quality of the approximation:

Ordering: better quality: the solution relax from a set of evenly spaced discrete values

Clustering: less better quality: solution relax from 2 discrete values


Linearized Cluster Assignment

• Spectral ordering on connectivity network• Cluster crossing

– Sum of similarities along anti-diagonal – Gives 1-D curve with valleys and peaks– Divide valleys and peaks into clusters

Turn spectral clustering to 1D clustering problem


Cluster overlap and crossing

• Cluster overlap

• Cluster crossing compute a smaller fraction of cluster overlap.

• Cluster crossing depends on an ordering o. It sums weights cross the site i along the order

• This is a sum along anti-diagonals of W.

∑∑∈ ∈

=Ai Bj

ijws(A,B)

Given similarity W, and clusters A,B.

∑=

+−=m

jjiojiowi

1)(),( )(ρ


cluster crossing


K-way Clustering Experiments

Accuracy of clustering results:

56.4%67.2%75.7%Data B

75.1%82.8%89.0%Data A

Embedding+ K-means

Recursive 2-way clustering

LinearizedAssignment

Method


Some Additional Advanced/related Topics

• Random talks and normalized cut• Semi-definite programming • Sub-sampling in spectral clustering• Extending to semi-supervised classification• Green’s function approach• Out-of-sample embeding

Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005...

Technology

Transcript of Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005...