A penalized matrix decomposition, with application to...

63
A penalized matrix decomposition, with application to sparse hierarchical clustering Daniela Witten PhD thesis 2009; Department of Statistics Stanford University May 4, 2015

Transcript of A penalized matrix decomposition, with application to...

Page 1: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

A penalized matrix decomposition,with application to sparse hierarchical clustering

Daniela WittenPhD thesis 2009; Department of Statistics

Stanford University

May 4, 2015

Page 2: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparsity

Suppose we wish to predict y ∈ Rn using X ∈ Rn×p where thenumber of features p is very large.

We can fit a model y = Xβ + ε in such a way that most elementsof β̂ equal zero. Then our model involves just a small number offeatures: the model is sparse.

Today we discuss sparsity in the unsupervised setting.

Page 3: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Matrix Decompositions

Consider a n × p matrix X for which we want a low-rankapproximation. For simplicity, assume that the row and columnmeans of X are zero.

Page 4: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Matrix Decompositions

We might want this low-rank approximation in order to

1. obtain a lower-dimensional projection of the data thatcaptures most of the variability, or

2. achieve a better understanding and interpretation of the data,or

3. impute missing values, e.g. for movie recommender systems

Page 5: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

The singular value decomposition

We decompose the matrix X as

X = UDVT

where U and V have orthonormal columns and D is diagonal;d1 ≥ d2 ≥ ... ≥ dp ≥ 0.

Page 6: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

A sparse matrix decomposition

The SVD has many useful and interesting properties, but ingeneral, the columns of U and V are not sparse - that is, noelements of U and V are exactly zero.

We want a matrix decomposition with sparse elements, forconciseness, parsimony, and interpretability.

Page 7: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Example of the sparse matrix decomposition: Netflix Data

Page 8: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

The Netflix Data

Page 9: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

The Netflix Data

Page 10: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Netflix Data

“Lord of the Rings: The Fellowship of the Ring”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Two Towers”“Lord of the Rings: The Return of the King”“Lord of the Rings: The Return of the King: Extended Edition”“Star Wars: Episode V: The Empire Strikes Back”“Star Wars: Episode VI: Return of the Jedi”“Star Wars: Episode IV: A New Hope”“Raiders of the Lost Ark”

Page 11: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Applications of the penalized matrix decomposition

Input matrix ResultData data interpretation

missing value imputationmatrix completion

Variance-covariance sparse PCA

Cross-products sparse CCA

Dissimilarity sparse clusteringsparse MDS

Page 12: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Criterion for the Singular Value Decomposition

Recall that the first components u, v, and d of the SVD comprisethe best rank-1 approximation to the matrix X, in the sense of theFrobenius norm:

minimizeu,v,d

||X− duvT ||2F subject to ||u||2 = 1, ||v||2 = 1

Page 13: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Criterion for the Penalized Matrix Decomposition

Suppose we add in additional penalty terms to that criterion:

minimizeu,v,d

||X− duvT ||2F

subject to ||u||2 = ||v||2 = 1,P1(u) ≤ c1,P2(v) ≤ c2,

where P1 and P2 are arbitrary penalty functions. We can call thisthe rank one penalized matrix decomposition.

For now, let P1(u) = ||u||1, P2(v) = ||v||1.

This encourages sparsity.

This is related to a proposal of Shen and Huang (2008).

Page 14: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Criterion for the Penalized Matrix Decomposition

Suppose we add in additional penalty terms to that criterion:

minimizeu,v,d

||X− duvT ||2F

subject to ||u||2 = ||v||2 = 1,P1(u) ≤ c1,P2(v) ≤ c2,

where P1 and P2 are arbitrary penalty functions. We can call thisthe rank one penalized matrix decomposition.

For now, let P1(u) = ||u||1, P2(v) = ||v||1.

This encourages sparsity.

This is related to a proposal of Shen and Huang (2008).

Page 15: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

L1 (lasso) constraints

I Lasso (Tibshirani 1996)

I Basis Pursuit (Chen, Donoho, and Saunders 1998)

I LARS (Efron, Hastie, Johnstone, and Tibshirani 2004)

I The Dantzig selector (Candes and Tao 2007)

I Coordinate descent procedures (Friedman, Hastie, andTibshirani 2008)

Hot active area (Emmanuel Candes, David Donoho and manyothers)

Page 16: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

More on the Single-Factor PMD Model

Note that u, v that minimize

||X− duvT ||2F subject to ||u||2 = ||v||2 = 1

also maximize

uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1.

This means that we can re-write the single-factor PMD criterion as

maximizeu,v

uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1, ||u||1 ≤ c1, ||v||1 ≤ c2.

Page 17: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Computation for PMD

maximizeu,v

uTXv subject to ||u||2 ≤ 1, ||v||2 ≤ 1, ||u||1 ≤ c1, ||v||1 ≤ c2.

With u fixed, the criterion is convex in v, and with v fixed, it’sconvex in u. This bi-convexity leads to a convenient iterativealgorithm!

Page 18: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm for Sparse Matrix Decomposition

1. Initialize v to satisfy the constraints ||v||2 = 1, ||v||1 ≤ c2.

2. Iterate until convergence:I u← argmaxuuTXv subject to ||u||1 ≤ c1, ||u||2 ≤ 1.I v← argmaxvuTXv subject to ||v||1 ≤ c2, ||v||2 ≤ 1.

For c1 and c2 sufficiently small, the resulting u and v will be sparse.

In the absence of L1 penalties, this yields the rank one SVD.

Page 19: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

L1 and L2 penalties

Movie in http://statweb.stanford.edu/~tibs/

sta306bfiles/pmd-movie.mpg

Page 20: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm for Sparse Matrix Decomposition

It’s not hard to show that the iterative steps are as follows:

I u← S(Xv,δ1)||S(Xv,δ1)||2 ,

I v← S(XTu,δ2)||S(XTu,δ2)||2

.

Here S(x , t) = sgn(x)(|x | − t)+ (the soft-thresholding operator),and δ1 is chosen so that ||u||1 = c1. Similarly for δ2.

Page 21: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm in action

Page 22: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm in action: Update u

Page 23: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm in action: Update v

Page 24: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm in action: Update u

Page 25: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm in action: Update v

Page 26: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Extension to K -factor Decomposition

1. Let X1 ← X.

2. For k ∈ 1, ...,K :I Find uk and vk by applying the single-factor algorithm to data

Xk .I Xk+1 ← Xk − dkukvTk where dk = uTk X

kvk .

In the absence of L1 penalties, this gives the rank K SVD.

Page 27: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Applications of the penalized matrix decomposition

Input matrix ResultData data interpretation

missing value imputationmatrix completion

Variance-covariance sparse PCA

Cross-products sparse CCA

Dissimilarity sparse clusteringsparse MDS

Page 28: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Example of Sparse Matrix Decomposition: Netflix Data

Page 29: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Netflix Data: Factor 1 - All movies have negative weights

“Lord of the Rings: The Fellowship of the Ring”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Two Towers”“Lord of the Rings: The Return of the King”“Lord of the Rings: The Return of the King: Extended Edition”“Star Wars: Episode V: The Empire Strikes Back”“Star Wars: Episode VI: Return of the Jedi”“Star Wars: Episode IV: A New Hope”“Raiders of the Lost Ark”

Page 30: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Netflix Data: Factor 5 - Movies with positive weights

“Austin Powers in Goldmember”“Austin Powers: International Man of Mystery”“Austin Powers: The Spy Who Shagged Me”“The Nutty Professor”“Big Mommas House”“Wild Wild West”“Dodgeball: A True Underdog Story”“Anchorman: The Legend of Ron Burgundy”“Mr. Deeds”“Punch-Drunk Love”“Anger Management”“Moulin Rouge”“Spaceballs”

Page 31: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Netflix Data: Factor 5 - Movies with negative weights

“Star Wars: Episode V: The Empire Strikes Back”“Lord of the Rings: The Two Towers: Extended Edition”“Lord of the Rings: The Fellowship of the Ring: Extended Edition”“Lord of the Rings: The Return of the King: Extended Edition”“Raiders of the Lost Ark”“The Silence of the Lambs”“Rain Man”“We Were Soldiers”“The Godfather”“The Shawshank Redemption: Special Edition”“Saving Private Ryan”“E.T. the Extra-Terrestrial: The 20th Anniversary (Rerelease)”“Finding Nemo (Widescreen)”

Page 32: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Applications of the penalized matrix decomposition

Input matrix ResultData data interpretation

missing value imputationmatrix completion

Variance-covariance sparse PCA

Cross-products sparse CCA

Dissimilarity sparse clusteringsparse MDS

Page 33: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Hierarchical clustering

There has been a resurgence of interest in hierarchical clustering inthe field of genomics.

Page 34: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Clustering when p � n

Suppose we wish to cluster n observations on p features, wherep � n.

If the true underlying classes are defined on only a subset of thefeatures, then the presence of noise features can obscure this signal.

Page 35: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Example

A simple example with 10 observations; 2 classes are defined on 10important features.

Page 36: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Example: 10 important features; 10 features total

9 8 10 6 7 1 4 5 2 3

Page 37: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Example: 10 important features; 500 features total

7 6 8 10 9 3 1 4 5 2

Page 38: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Example: 10 important features; 5000 features total

9 7 6 10 5 2 8 1 4 3

Page 39: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse hierarchical clustering results: 10 importantfeatures; 5000 features total

1 4 5 2 3 10 6 7 8 9

Page 40: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse Clustering

We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.

We want an automated way to

I find a subset of features to use in the clustering, and

I obtain a more accurate or interesting clustering using thatsubset of features.

Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =

∑pj=1 di ,i ′,j

Page 41: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse Clustering

We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.

We want an automated way to

I find a subset of features to use in the clustering, and

I obtain a more accurate or interesting clustering using thatsubset of features.

Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =

∑pj=1 di ,i ′,j

Page 42: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse Clustering

We want a method to hierarchically cluster observations based ona small subset of the features; we will call this sparse hierarchicalclustering.

We want an automated way to

I find a subset of features to use in the clustering, and

I obtain a more accurate or interesting clustering using thatsubset of features.

Assumption: We assume that the dissimilarity measure used isadditive in the features: Di ,i ′ =

∑pj=1 di ,i ′,j

Page 43: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Dissimilarity matrix for the n observations

Page 44: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Dissimilarity matrix for the n observations

Page 45: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Dissimilarity matrix is a sum of dissimilarity matrices overthe features

Page 46: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Hierarchical clustering sums the dissimilarity matrices forthe features

Page 47: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Weighted sum of the dissimilarity matrices for the features

Page 48: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse hierarchical clustering and the PMD

Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .

Then, suppose we apply the PMD to D:

maximizeu,w

uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1,∑j

wj ≤ s,wj ≥ 0

Then, wj is a weight on the dissimilarity matrix for feature j . If were-arrange the elements of Dw into a n × n matrix, thenperforming hierarchical clustering on this re-weighted dissimilaritymatrix gives sparse hierarchical clustering.

Page 49: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse hierarchical clustering and the PMD

Let D denote the n2 × p matrix for which column j is thefeature-wise dissimilarity matrix for feature j .

Then, suppose we apply the PMD to D:

maximizeu,w

uTDw subject to ||u||2 ≤ 1, ||w||2 ≤ 1,∑j

wj ≤ s,wj ≥ 0

Then, wj is a weight on the dissimilarity matrix for feature j . If were-arrange the elements of Dw into a n × n matrix, thenperforming hierarchical clustering on this re-weighted dissimilaritymatrix gives sparse hierarchical clustering.

Page 50: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Algorithm: Sparse hierarchical clustering

1. Compute D, the n2 × p matrix of which column j consists of{di ,i ′,j}i ,i ′ .

2. Initialize w as w1 = ... = wp = 1√p .

3. Iterate until convergence:I u← argmaxuuTDw subject to ||u||2 ≤ 1.I w← argmaxwuTDw subject to ||w||2 ≤ 1,

∑j wj ≤ s,wj ≥ 0.

4. Perform hierarchical clustering on the n × n dissimilaritymatrix obtained by rearranging the terms in Dw.

Page 51: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse hierarchical clustering in action

A simulated example with 6 classes defined on 100 signal features;2000 features in total.

5658

6062

6466

6870

72

Ordinary Clustering

0.00

00.

005

0.01

00.

015

0.02

00.

025

Sparse Clustering

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 500 1000 1500 2000

0.00

0.05

0.10

0.15

W

Index

Page 52: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

An important breast cancer paper

Nature (2000) 406:747-752.

Page 53: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Breast cancer data

I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).

I Clustered based on full set of 1753 genes first.

I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.

I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.

Page 54: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Breast cancer data

I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).

I Clustered based on full set of 1753 genes first.

I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.

I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.

Page 55: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Breast cancer data

I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).

I Clustered based on full set of 1753 genes first.

I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.

I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.

Page 56: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Breast cancer data

I 65 breast tumor samples for which gene expression data isavailable. Some samples are replicates from the same tumor(before and after chemo).

I Clustered based on full set of 1753 genes first.

I Clustered based on 496 intrinsic genes for which the variationbetween different tumors is large relative to the variationwithin a tumor.

I Based on the intrinsic gene clustering, determined that 62 of65 tumors fall into one of four classes: normal-breast-like,basal-like, ER+, Erb-B2+.

Page 57: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Clustering results: normal-breast-like, basal-like, ER+,Erb-B2+

0.0

0.5

1.0

1.5

Clustering Using All 1753 Genes

0.0

0.5

1.0

1.5

Clustering Using 496 Intrinsic Genes

Page 58: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse clustering

We wonder: If we sparsely cluster the observations using all of thegenes, can we identify the four classes successfully?

Three types of clustering:

1. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen to yield 496 genes.

2. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen by the gap statistic.

3. Standard hierarchical clustering using the 496 genes withhighest marginal variance.

Page 59: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Sparse clustering

We wonder: If we sparsely cluster the observations using all of thegenes, can we identify the four classes successfully?

Three types of clustering:

1. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen to yield 496 genes.

2. Sparse hierarchical clustering of all 1753 genes, with thetuning parameter chosen by the gap statistic.

3. Standard hierarchical clustering using the 496 genes withhighest marginal variance.

Page 60: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

normal-breast-like, basal-like, ER+, Erb-B2+

0.0

0.5

1.0

1.5

Sparse Clustering: 496 Genes

0.0

0.5

1.0

1.5

Sparse Clustering: 106 Genes

0.0

0.5

1.0

1.5

496 High−Variance Genes

Page 61: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Genes with high weights

# Gene Weight1 S100 CALCIUM-BINDING PROTEIN A8 (CALGRANULIN A) 0.2232 SECRETED FRIZZLED-RELATED PROTEIN 1 0.21263 ESTROGEN RECEPTOR 1 0.20764 KERATIN 17 0.16275 HUMAN REARRANGED IMMUNOGLOBULIN LAMBDA 0.15686 CYTOCHROME P450, SUBFAMILY IIA 0.1557 APOLIPOPROTEIN D 0.15098 LACTOTRANSFERRIN 0.14719 ESTROGEN RECEPTOR 1 0.140510 134783 0.1411 HEPATOCYTE NUCLEAR FACTOR 3, ALPHA 0.133212 HUMAN REARRANGED IMMUNOGLOBULIN LAMBDA LIGHT 0.130913 FATTY ACID BINDING PROTEIN 4, ADIPOCYTE 0.129214 CERULOPLASMIN (FERROXIDASE) 0.12615 HUMAN SECRETORY PROTEIN (P1.B) MRNA 0.120816 NON-SPECIFIC CROSS REACTING ANTIGEN 0.119917 LIPOPROTEIN LIPASE 0.112318 IMMUNOGLOBULIN LAMBDA LIGHT CHAIN 0.11219 CRYSTALLIN, ALPHA B 0.110820 FATTY ACID BINDING PROTEIN 4, ADIPOCYTE 0.1121 PLEIOTROPHIN (HEPARIN BINDING GROWTH FACTOR 8) 0.109922 85660 0.107723 ESTS, HIGHLY SIMILAR TO PROBABLE ATAXIA-TELANGIECTASIA 0.107124 V-FOS FBJ MURINE OSTEOSARCOMA VIRAL ONCOGENE HOMOLOG 0.105625 EPIDIDYMIS-SPECIFIC, WHEY-ACIDIC PROTEIN TYPE 0.101326 ALDO-KETO REDUCTASE FAMILY 1, MEMBER C1 0.1007

Page 62: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

Conclusions

I Clustering methods are very sensitive to the set of featuresused.

I In high dimensions, we may not want to simply use all of thefeatures that happen to be available.

I Objective methods are required for selecting the features foruse in clustering.

I This proposal can be applied to K -means clustering,hierarchical clustering, K -medoids clustering, and more.

I Unsupervised learning when p � n: need better ways to selecttuning parameters and validate results obtained.

I R package sparcl: Sparse Clustering.

I Witten and Tibshirani (2010) ’A framework for featureselection in clustering’, JASA (T & M) 105(490): 713-726.

Page 63: A penalized matrix decomposition, with application to ...statweb.stanford.edu/~tibs/sta306bfiles/PMD.pdf · \Star Wars: Episode V: The Empire Strikes Back" \Star Wars: Episode VI:

References

1. Chin et al. (2006) ’Genomic and transcriptional aberrations linked tobreast cancer pathophysiologies’, Cancer Cell 10: 529-541.

2. Perou et al. (2000) ’Molecular portraits of human breast tumours’,Nature 406: 747-752.

3. Shen and Huang (2008) ’Sparse principal component analysis viaregularized low rank matrix approximation’, Journal of MultivariateAnalysis 6: 1015-1034.

4. Witten, Tibshirani, and Hastie (2009) ’A penalized matrix decomposition,with applications to canonical correlation analysis and principalcomponents’, Biostatistics 10(3): 515-534.

5. Witten and Tibshirani (2009) ’A framework for feature selection inclustering’, Submitted.