Large-scalespectralclusteringusingdiffusion ...

29
Large-scale spectral clustering using diffusion coordinates on landmark-based bipartite graphs Guangliang Chen and Khiem Pham TextGraphs-2018 Workshop, New Orleans, LA June 6, 2018

Transcript of Large-scalespectralclusteringusingdiffusion ...

Page 1: Large-scalespectralclusteringusingdiffusion ...

Large-scale spectral clustering using diffusioncoordinates on landmark-based bipartite graphs

Guangliang Chen and Khiem Pham

TextGraphs-2018 Workshop, New Orleans, LA

June 6, 2018

Page 2: Large-scalespectralclusteringusingdiffusion ...

Outline

• Introduction + background

• Our scalable approach

• Experiments

• Conclusions

Page 3: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

What is spectral clustering?A family of clustering algorithms that utilize the spectral decompositionof a similarity matrix constructed on the given data x1, . . . ,xn ∈ Rd:

W = (wij) ∈ Rn×n, wij =

κ(xi,xj), if i 6= j

0, if i = j.

Here, κ(·, ·) is a similarity function, such as

• an indicator function (whether two points are “sufficiently close”),

• the Gaussian radial basis function (RBF), and

• the cosine similarity.

Guangliang Chen and Khiem Pham | San José State University, CA 3/29

Page 4: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

A (very convenient) graph cut point of view

W (as a weight matrix) defines aweighted graph on the given data.

b

b b bb

b

b b bb

b

b

b b

bb

b

bb

b

b

b

b b

bb

bb b

b

b

b

b

b b

b

b

bb

b

bb

bb

bb

bb

b

bb

b

b

bb b

bbb

b

b

b

bbb

bb

bbb

b

bb

bb

bb

b

b

Accordingly, clustering = finding anoptimal cut (under some criterion):e.g., RatioCut, NCut, MinMaxCut.

Some graph terminology:–Degree matrix D = diag(W1),with Dii =

∑j Wij .

–Graph Laplacian L = D−Wand its normalized versions:

Lsym = D− 12 LD− 1

2 = I−D− 12 WD− 1

2︸ ︷︷ ︸W (symmetric)

Lrw = D−1L = I− D−1W︸ ︷︷ ︸P (row stochastic)

Guangliang Chen and Khiem Pham | San José State University, CA 4/29

Page 5: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Different spectral clustering algorithms...use different kinds of spectral embedding for kmeans clustering:

• The Ng-Jordan-Weiss (NJW) algorithm (NIPS’01): U ∈ Rn×k,top k eigenvectors of W:

W ≈ Un×kΛk×kUTn×k, where Λ = diag(λ1 ≥ · · · ≥ λk)

• Normalized Cut (NCut) by Shi and Malik (PAMI’00):U = D−

12 U, top k eigenvectors of P = D−1W

• Diffusion Maps (DM(t)) by Coifman et al. (PNAS’05):U(t) = UΛt = D−

12 UΛt, diffusion coordinates in t time steps

Guangliang Chen and Khiem Pham | San José State University, CA 5/29

Page 6: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Computational challenges

Spectral clustering has achieved superior results in many applications (suchas image segmentation, documents clustering, social network partitioning),but requires significant computational power due to the matrix W ∈ Rn×n:

• Extensive memory requirement: O(n2)

• High computational cost: O(n3)← spectral decomposition

Consequently, there has been an urgent need to develop fast, approximatespectral clustering algorithms that are scalable to large data.

Guangliang Chen and Khiem Pham | San José State University, CA 6/29

Page 7: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Landmark-based scalable methodsMost existing scalable methods use a small landmark set y1, . . . ,ym ∈ Rd,selected from the given data x1, . . . ,xn ∈ Rd (e.g., uniformly at randomor via k-means), to first construct a (sparse) similarity matrix betweenthem:

A = (aij) ∈ Rn×m, aij = κ(xi,yj)

b

b b bb

b

b b bb

b

b

b b

bb

b

b

b b

b

b

b

b b

bb

bb b

b

b

b

b

b b

b

b

bb

b

bb

bb

bb

bb

b

bb

b

b

bb b

bbb

b

b

b

bbb

bb

bbb

b

bb

bb

bb

b

b

*

* ** *

*

**

* *

* *

**

*

bbbbbbbbbbbbbbbbbbb

* * * * * * * * * * * *r r rr r rr r rr r rr r rr r rr r rr r rr r r

r r rr r rr r rr r rr rr rr r rr r rr r r

A =

Guangliang Chen and Khiem Pham | San José State University, CA 7/29

Page 8: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Afterwards, different methods use the similarity matrix A in different ways:

• cSPEC (Wang et al., 2009): Regards A as a column-sampledversion of W and uses linear algebra to estimate eigenvectors of W

• KASP (Yan, Huang and Jordan, 2009): Uses vector quantizationtechnique (k-means) to aggressively reduce the given data to acollection of centroids (landmarks) and applies spectral clustering togroup them

• LSC (Cai and Chen, 2015): Obtains the matrix A from a sparsecoding perspective with the landmarks as a dictionary and thenapplies spectral clustering to the rows of A (after performing certainrow and column normalizations).

Guangliang Chen and Khiem Pham | San José State University, CA 8/29

Page 9: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Overview of our approachWe follow the direction of landmark-based spectral clustering, but

• use the landmarks to form a bipartite graph,

• and then run a random walk on the graph.

bbbbbbbbbbbbb

*******

******

*******

b

bb

bb

b

bb

bb

bb

bbbbbbbbbbbbb

bbbbbbbbbbbbb

*******

α = 1 α = 2q α = 2q + 1

Guangliang Chen and Khiem Pham | San José State University, CA 9/29

Page 10: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

MotivationDhillon (2001) proposed a bipartite graph model for the setting of docu-ments data, with the goal to co-cluster documents and terms.

rrrrrrrrr

rr

×××××r

rrrrrrrr

rr

×× × × × documents terms542

00000 0 0 5

24

b

b

b

Frequency matrix (under bag of words model)

In principle, they apply NCut to themixture of documents and termswith the weight matrix

W =(

0 AAT 0

)∈ R(n+m)×(n+m).

They derived an efficient procedurefor computing the eigenvectors ofP = D−1W directly from A.

Guangliang Chen and Khiem Pham | San José State University, CA 10/29

Page 11: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Theorem (Dhillon’01). Let

D1 = diag(A1), D2 = diag(AT1), D = diag(W1) =(

D1D2

),

andA = D−1/2

1 AD−1/22 .

Then for each pair of left and right singular vectors Av2 = σv1,

v =(

D−1/21

D−1/22

)(v1v2

)= D−1/2 v

is an eigenvector of P = D−1W (corresponding to eigenvalue σ).

Guangliang Chen and Khiem Pham | San José State University, CA 11/29

Page 12: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Our approachWe extend the bipartite graph model by Dhillon (2001) in two ways:

(1) We adapt it for landmark-basedclustering by using instead the givendata and a landmark set as its twoparts.

b

b b bb

b

b b bb

b

b

b b

bb

b

bb

b

b

b

b b

bb

bb b

b

b

b

b

b b

b

b

bb

b

bb

bb

bb

bb

b

bb

b

b

bb b

bbb

b

b

b

bbb

bb

bbb

b

bb

bb

bb

b

b

*

* ** *

*

**

* *

* *

**

*

ab

(2) We run a random walk on thebipartite graph to gather global in-formation about the data set.

bbbbbbbbbbbbb

*******

******

*******

b

bb

bb

b

bb

bb

bb

bbbbbbbbbbbbb

bbbbbbbbbbbbb

*******

α = 1 α = 2q α = 2q + 1

Guangliang Chen and Khiem Pham | San José State University, CA 12/29

Page 13: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Two important remarks

(1) The diffusion coordinates at any time step α can be computed efficientlyfrom A too:

[· · · | σα v | · · · ]

(2) Depending on whether α is odd or even, there are three ways to clusterthe given data:

• α even (two disjoint subgraphs): direct clustering, or landmarkclustering + NN classification (faster)

• α odd (still a bipartite graph): co-clustering first but removing thelandmarks later

Guangliang Chen and Khiem Pham | San José State University, CA 13/29

Page 14: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Alg. 1 Landmark-based Bipartite Diffusion Maps (LBDM)

Input:

• Data x1, . . . ,xn ∈ Rd

• similarity function κ

• # clusters k

• # diffusion steps α

• landmark selection method

• # landmark points m

• # nearest landmark points s

• clustering method (direct,landmark, or co-clustering)

Output: Clusters C1, . . . , Ck

Guangliang Chen and Khiem Pham | San José State University, CA 14/29

Page 15: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Steps:

1. Select m landmark points {yj} by the given method.

2. Compute the s-sparse adjacency matrix A = (aij), aij = κ(xi,yj)between each given data point xi and the s nearest landmarks yj .

3. Normalize A by using its row and column sums A = D−1/21 AD−1/2

2and then calculate the diffusion coordinates for the bipartite graph(through the rank-k SVD of A):

[· · · | σαD−1/2v | · · · ]

4. Use the indicated clustering method to cluster the given data.

Guangliang Chen and Khiem Pham | San José State University, CA 15/29

Page 16: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Complexity analysisTotal running time is O(nm(d+ s) + nk(s+ k)), obtained as follows:

• Landmark sampling: O(nmd) (k-means), or O(m) (uniform)

• Constructing A: O(nm(d+ s))

• Computing A: O(ns)

• Rank-k SVD of A: O(nsk)

• Diffusion coordinates: O((n+m)k)

• Final k-means: O(nk2) (direct), or O(mk2 + ns) (landmark), orO((n+m)k2) (co-clustering)

Guangliang Chen and Khiem Pham | San José State University, CA 16/29

Page 17: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: methods and setup• To be compared with: plain NCut, KASP, LSC, cSPEC, Dhillon

• Gaussian similarity κ(x,y) = exp(−‖x−y‖2

2σ2

)• k-means sampling (same landmark points for all)

• Parameters: m = 500 (for all scalable methods), s = 5 (for LSC,Dhillon, and LBDM), α = 1, 2 (only for LBDM)

• Evaluation metrics: clustering accuracy and CPU time (averagedover 50 trials for each method)

Guangliang Chen and Khiem Pham | San José State University, CA 17/29

Page 18: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: benchmark data sets

Data n d k

usps 9,298 256 10pendigits 10,992 16 10letter 20,000 16 26protein 24,387 357 3shuttle 58,000 9 7mnist 70,000 784 10

Guangliang Chen and Khiem Pham | San José State University, CA 18/29

Page 19: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: clustering accuracy (%)

Dataset Ncut KASP LSC cSPEC Dhillon LBDM(1) –(2,X) –(2,Y )

usps 66.21 67.25 66.86 66.89 68.21 67.80 68.10 69.45pendigits 69.73 68.45 77.93 67.93 73.20 72.95 74.70 73.22letter 24.93 26.19 31.51 24.98 32.06 32.13 32.21 31.28protein 43.68 43.85 43.85 44.84 43.35 43.55 43.16 45.88shuttle 74.52 39.71 82.78 74.24 74.26 74.38 74.49mnist 57.99 70.28 54.50 72.15 72.43 72.37 73.29

Guangliang Chen and Khiem Pham | San José State University, CA 19/29

Page 20: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: CPU time (in seconds)

Dataset Ncut (k-means) KASP LSC cSPEC Dhillon LBDM(1) –(2,X) –(2,Y )

usps 131.78 (7.46) 0.61 4.44 7.89 4.45 4.39 4.17 1.95pendigits 246.08 (3.13) 0.55 3.08 5.26 3.14 2.91 3.08 1.65letter 1180.70 (5.30) 0.77 12.24 25.07 13.51 14.96 12.87 2.78protein 2024.54 (27.04) 0.41 3.55 7.54 3.93 4.04 3.93 4.40shuttle (23.89) 1.23 8.49 61.68 12.35 15.09 12.15 5.88mnist (299.74) 0.63 25.07 39.26 27.17 25.69 25.83 16.67

Guangliang Chen and Khiem Pham | San José State University, CA 20/29

Page 21: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: parameter study (m)

s = 5 fixed (top: clustering accuracy, bottom: CPU time)

200 400 600 800 1000m (# landmarks)

65

66

67

68

69

70

71

accu

racy

(%

)

usps

200 400 600 800 1000m (# landmarks)

22

24

26

28

30

32

34

accu

racy

(%

)letter

200 400 600 800 1000m (# landmarks)

42

43

44

45

46

47

accu

racy

(%

)

protein

200 400 600 800 1000m (# landmarks)

50

60

70

80

accu

racy

(%

)

mnist

200 400 600 800 1000m (# landmarks)

0

2

4

6

8

10

12

CP

U ti

me

(s)

usps

200 400 600 800 1000m (# landmarks)

0

10

20

30

40

CP

U ti

me

(s)

letter

200 400 600 800 1000m (# landmarks)

0

5

10

15

CP

U ti

me

(s)

proteinLBDM2YLBDM2XLBDM1cSPECLSCDhillonKASP

200 400 600 800 1000m (# landmarks)

0

20

40

60

CP

U ti

me

(s)

mnist

Guangliang Chen and Khiem Pham | San José State University, CA 21/29

Page 22: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: parameter study (s)

m = 500 fixed (top: accuracy, bottom: CPU time)

0 2 4 6 8 10s (#nearest landmarks)

64

66

68

70

72

accu

racy

(%

)

usps

0 2 4 6 8 10s (#nearest landmarks)

25

30

35

accu

racy

(%

)letter

0 2 4 6 8 10s (#nearest landmarks)

42

43

44

45

46

47

accu

racy

(%

)

protein

0 2 4 6 8 10s (#nearest landmarks)

50

60

70

80

accu

racy

(%

)

mnist

0 2 4 6 8 10s (#nearest landmarks)

0

2

4

6

8

CP

U ti

me

(s)

usps

0 2 4 6 8 10s (#nearest landmarks)

0

10

20

30

CP

U ti

me

(s)

letterLBDM2YLBDM2XLBDM1cSPECLSCDhillonKASP

0 2 4 6 8 10s (#nearest landmarks)

0

2

4

6

8

CP

U ti

me

(s)

protein

0 2 4 6 8 10s (#nearest landmarks)

0

10

20

30

40

CP

U ti

me

(s)

mnist

Guangliang Chen and Khiem Pham | San José State University, CA 22/29

Page 23: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

LBDM for document-term bipartite graphs

We can apply LBDM directly to documents data by simply treating termsas “landmarks”.

rrrrrrrrr

rr

×××××r

rrrrrrrr

rr

×× × × × documents terms542

00000 0 0 5

24

b

b

b

LBDM extends Dhillon’s methodby running a random walk onthe document-term bipartite graphto construct document-document,term-term, and document-termgraphs (at different time steps).

Guangliang Chen and Khiem Pham | San José State University, CA 23/29

Page 24: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: documents data

Some documents data used in our experiments:

• Reuters-21578: We choose the largest 30 categories with a totalof 8,067 documents and 18,933 words.

• TDT2: We choose the largest 30 categories with a total of 9,392documents and 34,090 words.

• 20newsgroups: There are 18,774 documents (partitioned into 20newsgroups) and 61,118 words.

Guangliang Chen and Khiem Pham | San José State University, CA 24/29

Page 25: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: accuracy (for different α)

0 5 10 15 20 (time step)

0.3

0.35

0.4

0.45

0.5

0.55

0.6

accu

racy

(%

)

reuters

LBDM XLBDMDhillon

0 5 10 15 20 (time step)

0.4

0.5

0.6

0.7

0.8

accu

racy

(%

)

TDT2

Guangliang Chen and Khiem Pham | San José State University, CA 25/29

Page 26: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Experiments: topic identification for 20news

0 0.1 0.2 0.3

sharewarexv

tgapov

animationfile

viewerjpeg

ftptiff

formatsimages

filesimage

convertbmppcx

formatgraphics

gif

comp.graphics

0 0.1 0.2

turbochevy

scauto

transmissiontaurusautosdealer

rearmustanguokmax

tirescallisonnissanmiles

toyotaford

enginecarscar

rec.autos

0 0.1 0.2

securitydes

omissionsexcepted

strnlghtprivacy

pgpwiretap

algorithmencrypted

securecrypto

chipkeynsa

sternlightescrow

keysencryption

clipper

sci.crypt

Guangliang Chen and Khiem Pham | San José State University, CA 26/29

Page 27: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

0 0.1 0.2 0.3

zoomitems

codmm

mintselling

lensprice

excellentsell

offersmanuals

brandforsaleasking

obooffer

conditionshipping

sale

misc.forsale

0 0.1

sovietoccupied

proceededarabs

melkoniansahak

appressianohanus

jewsextermination

armeniansserdar

argicarmenian

arabturks

armeniaturkishisraeliisrael

talk.politics.mideast

0 0.05 0.1 0.15

holyspirit

catholicbiblical

doctrineclh

faithscripture

christianitylord

heavenchristian

churchsin

christiansbibleathos

rutgerschristjesus

soc.religion.christian

Guangliang Chen and Khiem Pham | San José State University, CA 27/29

Page 28: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

A unified view

These scalable algorithms are all based on the matrix A ∈ Rn×m, but maydiffer in three aspects (besides motivation).

NormalizationsMethods Sparsity row column order ClusteringLBDM 1 < s� m sqrt-L1 sqrt-L1 same time all 3LSC 1 < s� m L1 sqrt-L1 row first directDhillon 1 < s� m sqrt-L1 sqrt-L1 same time co-clusteringcSPEC s = m directKASP s = 1 landmark

Guangliang Chen and Khiem Pham | San José State University, CA 28/29

Page 29: Large-scalespectralclusteringusingdiffusion ...

Landmark-based Bipartite Diffusion Maps (LBDM)

Thank you for your attention!We presented a new landmark-based spectral clustering method andalso provided a unified view.

Our algorithm is simple to implement, fast to run, and accurate.

It can be applied for documents grouping and topic identification.

Acknowledgment. This work was motivated by a project with VerizonWireless, whose goal was to cluster their cell phone users based on dailywebsite visits.

Contact: [email protected],[email protected]

Guangliang Chen and Khiem Pham | San José State University, CA 29/29