SparsePCAviaBipartiteMatchings - GitHub...

arX

iv:1

508.

0062

5v1

[st

at.M

L]

4 A

ug 2

015

Sparse PCA via Bipartite Matchings

Megasthenis Asterisα, Dimitris Papailiopoulosβ , Anastasios Kyrillidisα

Alexandros G. Dimakisα

αUT Austin, βUC Berkeley

August 2015

Abstract

We consider the following multi-component sparse PCA problem: given a set of data points,we seek to extract a small number of sparse components with disjoint supports that jointlycapture the maximum possible variance. These components can be computed one by one,repeatedly solving the single-component problem and deflating the input data matrix, but aswe show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCAthat jointly optimizes multiple disjoint components. The extracted features capture variancethat lies within a multiplicative factor arbitrarily close to 1 from the optimal. Our algorithmis combinatorial and computes the desired components by solving multiple instances of thebipartite maximum weight matching problem. Its complexity grows as a low order polynomialin the ambient dimension of the input data matrix, but exponentially in its rank. However,it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtainpolynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm onreal data-sets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.

1 Introduction

Principal Component Analysis (PCA) reduces the dimensionality of a data set by projecting it ontoprincipal subspaces spanned by the leading eigenvectors of the sample covariance matrix. SparsePCA is a useful variant that offers higher data interpretability [1, 2, 3], a property that is sometimesdesired even at the cost of statistical fidelity [4]. Furthermore, when the obtained features are usedin subsequent learning tasks, sparsity potentially leads to better generalization error [5].

Given a real n× d data matrix S representing n centered data points supported on d features,the leading sparse principal component of the data set is the sparse vector that maximizes theexplained variance:

x⋆ , argmax‖x‖2=1,‖x‖0=s

x⊤Ax, (1)

where A = 1/n · S⊤S is the d × d empirical covariance matrix. The sparsity constraint makesthe problem NP-hard and hence computationally intractable in general, and hard to approximatewithin some small constant [6]. A significant volume of prior work has focused on algorithms thatapproximately solve the optimization problem [2, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15], while a largevolume of theoretical results has been established under planted statistical models [16, 17, 18, 19,20, 21, 22, 23, 24].

1

http://arxiv.org/abs/1508.00625v1

In most practical settings, we tend to go beyond computing a single sparse PC. Contrary to thesingle-component problem, there has been limited work on computing multiple components. Thescarcity is partially attributed to conventional PCA wisdom: multiple components can be computedone-by-one, repeatedly, by solving the single-component sparse PCA problem (1) and deflating theinput data to remove information captured by previously extracted components [25]. In fact,the multi-component version of sparse PCA is not uniquely defined in the literature. Differentdeflation-based approaches can lead to different outputs: extracted components may or may not beorthogonal, while they may have disjoint or overlapping supports [25]. In the statistics literature,where the objective is typically to recover a “true” principal subspace, a branch of work has focusedon the “subspace row sparsity” [26], an assumption that leads to sparse components all supportedon the same set of variables. While in [27], the authors discuss an alternative perspective on thefundamental objective of the sparse PCA problem.

In this work, we develop a novel algorithm for the multi-component sparse PCA problem withdisjoint supports. Formally, we are interested in finding k components that are s-sparse, havedisjoint supports, and jointly maximize the explained variance:

X⋆ , argmaxX∈Xk

Tr(X⊤AX

), (2)

where the feasible set is

Xk ,X ∈ R

d×k : ‖Xj‖2 = 1, ‖Xj‖0 = s, supp(Xi) ∩ supp(Xj) = ∅, ∀ j ∈ [k], i < j,

with Xj denoting the jth column of X. The number k of the desired components is a user definedparameter and we consider it to be a small constant.

Contrary to the greedy sequential approach that repeatedly uses deflation, our algorithm jointlycomputes all the vectors in X, and comes with theoretical approximation guarantees. We note thateven if one could solve each single-component sparse PCA problem (1) exactly, greedy deflationcan be highly suboptimal. We show this through a simple example in Section 7.

Our Contributions

1. We develop an algorithm that provably approximates the solution to the sparse PCA problem (2)within a multiplicative factor arbitrarily close to 1. To the best of our knowledge, this is thefirst algorithm that jointly optimizes multiple components with disjoint supports, provably. Ouralgorithm is combinatorial; it recasts sparse PCA as multiple instances of bipartite maximumweight matching on graphs determined by the input data.

2. The computational complexity of our algorithm grows as a low order polynomial in the ambientdimension d, but is exponential in the intrinsic dimension of the input data, i.e., the rankof A. To alleviate the impact of this dependence, our algorithm can be applied on a low-dimensional sketch of the input data to obtain an approximate solution to (2). This extra levelof approximation introduces an additional penalty in our theoretical approximation guarantees,which naturally depends on the quality of the sketch and, in turn, the spectral decay of A. Weshow how these bounds further translate to an additive PTAS (polynomial-time approximationscheme) for sparse PCA. Our additive PTAS outputs an approximate solution with explainedvariance of at least OPT − ǫ · s, for any sparsity s ∈ 1, . . . , n, any constant error ǫ > 0 andany k = O(1) number of orthogonal components.1

1Here, OPT is the explained variance captured by the optimal set of k components that are s sparse and havedisjoint supports.

2

3. We empirically evaluate our algorithm on real datasets, and compare it against state-of-the-artmethods for the single-component sparse PCA problem (1) in conjunction with the appro-priate deflation step. In many cases, our algorithm—as a result of jointly optimizing overmultiple components—leads to significantly improved results, and outperforms deflation-basedapproaches.

2 Sparse PCA through Bipartite Matchings

Our algorithm approximately solves the constrained maximization (2) on a d × d rank-r PSDmatrix A within a multiplicative factor arbitrarily close to 1. It operates by recasting the maxi-mization into multiple instances of the bipartite maximum weight matching problem. Each instanceultimately yields a feasible solution: a set of k components that are s-sparse and have disjoint sup-ports. The algorithm examines these solutions, and outputs the one that maximizes the explainedvariance, i.e., the quadratic objective in (2).

The computational complexity of our algorithm grows as a low order polynomial in the ambientdimension d of the input, but exponentially in its rank r. Despite the unfavorable dependence onthe rank, it is unlikely that a substantial improvement can be achieved in general [6]. However, de-coupling the dependence on the ambient and the intrinsic dimension of the input has an interestingramification; instead of the original input A, our algorithm can be applied on a low-rank surrogateto obtain an approximate solution, alleviating the dependence on r. We discuss this in Section 3,and present the approximation bound that this allows us to obtain.

Let A = UΛU⊤ denote the truncated eigenvalue decomposition of A; Λ is a diagonal r × rwhose ith diagonal entry is equal to the ith largest eigenvalue of A, while the columns of U arethe corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x ∈ R

d,

x⊤Ax =∥∥Λ1/2U⊤x

∥∥22≥⟨Λ1/2U⊤x, c

⟩2, ∀ c ∈ R

r : ‖c‖2 = 1. (3)

In fact, equality in (3) can always be achieved for c colinear to Λ1/2Ux ∈ Rr and in turn

x⊤Ax = maxc∈Sr−1

2

⟨x, UΛ1/2c

⟩2,

where Sr−12 denotes the ℓ2-unit sphere in r dimensions. More generally, for any X ∈ R

d×k,

Tr

(X⊤AX

)=

k∑

j=1

Xj⊤AXj = maxC:Cj∈Sr−1

2 ∀j

k∑

j=1

⟨Xj , UΛ1/2Cj

⟩2. (4)

Under the variational characterization of the trace objective in (4), the sparse PCA problem (2)can be re-written as a joint maximization over the variables X and C as follows:

maxX∈Xk

Tr(X⊤AX

)= max

X∈Xk

maxC:Cj∈Sr−1

2 ∀j

k∑

j=1

⟨Xj , UΛ1/2Cj

⟩2. (5)

The alternative formulation of the sparse PCA problem in (5) takes a step towards decoupling thedependence of the optimization on the ambient and intrinsic dimensions d and r, respectively. Themotivation behind the introduction of the auxiliary variable C will become clear in the sequel.

For a given C, the value of X ∈ Xk that maximizes the objective in (5) for that C is

X , argmaxX∈Xk

k∑

j=1

⟨Xj,Wj

⟩2, (6)

3

where W,UΛ1/2C is a real d× k matrix. The constrained, non-convex maximization (6) plays acentral role in our developments. We will later describe a combinatorial O(d · (s · k)2) procedure toefficiently compute X, reducing the maximization to an instance of the bipartite maximum weightmatching problem. For now, however, let us assume that such a procedure exists.

Let X⋆, C⋆ be the pair that attains the maximum in (5); in other words, X⋆ is the desiredsolution to the sparse PCA problem. If the optimal auxiliary variable C⋆ was known, then wewould be able to recover X⋆ by solving the maximization (6) for C = C⋆. Of course, C⋆ is notknown, and it is not possible to exhaustively consider all possible values in the domain of C.Instead, we examine only a finite number of possible values of C over a fine discretization of itsdomain. In particular, let Nǫ/2(S

r−12 ) denote a finite ǫ/2-net of the r-dimensional ℓ2-unit sphere; for

any point in Sr−12 , the net contains a point within an ǫ/2 radius from the former. There are several

ways to construct such a net [28]. Further, let [Nǫ/2(Sr−12 )]⊗k ⊂ R

d×k denote the kth Cartesianpower of the aforementioned ǫ/2-net. By construction, this collection of points contains a matrix Cthat is column-wise close to C⋆. In turn, it can be shown using the properties of the net, that thecandidate solution X ∈ Xk obtained through (6) at that point C will be approximately as good asthe optimal X⋆ in terms of the quadratic objective in (2).

Algorithm 1 Sparse PCA (Multiple disjoint components)

input : PSD d× d rank-r matrix A, ǫ ∈ (0, 1), k ∈ Z+.output : X ∈ Xk Theorem 11: C ← 2: [U,Λ]← EIG(A)3: for each C ∈ [Nǫ/2(S

r−12 )]⊗k do

4: W← UΛ1/2C W ∈ Rd×k

5: X← argmaxX∈Xk

∑kj=1

⟨Xj ,Wj

⟩2 Alg. 26: C ← C ∪

X

7: end for8: X← argmax

X∈C Tr(X⊤AX

)

All above observations yield aprocedure for approximately solvingthe sparse PCA problem (2). Thesteps are outlined in Algorithm 1.Given the desired number of com-ponents k and an accuracy parame-ter ǫ ∈ (0, 1), the algorithm gener-ates a net [Nǫ/2(S

r−12 )]⊗k and iterates

over its points. At each point C, itcomputes a feasible solution for thesparse PCA problem – a set of k s-sparse components – by solving themaximization in (6) via a procedure(Alg. 2) that will be described in thesequel. The algorithm collects the candidate solutions identified at the points of the net. The bestamong them achieves an objective in (2) that provably lies close to optimal. More formally,

Theorem 1. For any real d× d rank-r PSD matrix A, desired number of components k, number sof nonzero entries per component, and accuracy parameter ǫ ∈ (0, 1), Algorithm 1 outputs X ∈ Xk

such that

Tr(X

⊤AX

)≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

),

where X⋆, argmaxX∈XkTr(X⊤AX

), in time TSVD(r) +O

((4ǫ

)r·k · d · (s · k)2).

Algorithm 1 is the first nontrivial algorithm that provably approximates the solution of thesparse PCA problem (2). According to Theorem 1, it achieves an objective value that lies withina multiplicative factor from the optimal, arbitrarily close to 1. Its complexity grows as a low-orderpolynomial in the dimension d of the input, but exponentially in the intrinsic dimension r. Note,however, that it can be exponentially faster compared to the O(ds·k) brute force approach thatexhaustively considers all candidate supports for the k sparse components. The complexity of ouralgorithm follows from the cardinality of the net and the complexity of Algorithm 2, the subroutinethat solves the constrained maximization (6). The latter is a key ingredient of our algorithm, and isdiscussed in detail in the next subsection. A formal proof of Theorem 1 is provided in Section 9.2.

4

2.1 Sparse Components via Bipartite Matchings

In the core of Algorithm 1 lies Algorithm 2, a procedure that solves the constrained maximizationin (6). The algorithm breaks down the maximization into two stages. First, it identifies the supportof the optimal solution X. Determining the support reduces to an instance of the maximummatching problem on a weighted bipartite graph G. Then, it recovers the exact values of thenonzero entries in X based on the Cauchy-Schwarz inequality. In the sequel, we provide a briefdescription of Algorithm 2, leading up to its guarantees in Lemma 2.1.

Let Ij,supp(Xj) be the support of the jth column of X, j = 1, . . . , k. The objective in (6)becomes

k∑

j=1

⟨Xj ,Wj

⟩2=

k∑

j=1

(∑

i∈Ij

Xij ·Wij

)2≤

k∑

j=1

∑

i∈Ij

W 2ij . (7)

The last inequality is an application of the Cauchy-Schwarz Inequality and the constraint ‖Xj‖2 = 1∀ j ∈ 1, . . . , k. In fact, if an oracle reveals the supports Ij, j = 1, . . . , k, the upper bound in (7)can always be achieved by setting the nonzero entries of X as in Algorithm 2 (Line 6). Therefore,the key in solving (6) is determining the collection of supports to maximize the right-hand sideof (7).

u(1)

1

u(1)s

...

u(k)

1

u(k)s

...

v1

vd

vi

...

...

...

W2

i1

W2

i1

W2

ik

W2

ik

U1

Uk

V

Figure 1: The graph G generated byAlg. 2. It is used to determine the supportof the solution X in (6).

By constraint, the sets Ij must be pairwise dis-joint, each with cardinality s. Consider a weightedbipartite graph G =

(U = U1, . . . , Uk, V,E

)con-

structed as follows2 (Fig. 1):

• V is a set of d vertices v1, . . . , vd, corresponding tothe d variables, i.e., the d rows of X.

• U is a set of k · s vertices, conceptually partitionedinto k disjoint subsets U1, . . . , Uk, each of cardinal-ity s. The jth subset, Uj , is associated with thesupport Ij; the s vertices u(j)

α , α = 1, . . . , s in Uj

serve as placeholders for the variables/indices in Ij.

• Finally, the edge set is E = U × V . The edgeweights are determined by the d×k matrixW in (6).In particular, the weight of edge (u(j)

α , vi) is equalto W 2

ij . Note that all vertices in Uj are effectivelyidentical; they all share a common neighborhoodand edge weights.

Any feasible support Ijkj=1 corresponds to a perfect matching in G and vice-versa. Recall thata matching is a subset of the edges containing no two edges incident to the same vertex, while aperfect matching, in the case of an unbalanced bipartite graph G = (U, V,E) with |U | ≤ |V |, is amatching that contains at least one incident edge for each vertex in U . Given a perfect matchingM ⊆ E, the disjoint neighborhoods of Ujs under M yield a support Ijkj=1. Conversely, anyvalid support yields a unique perfect matching in G (taking into account that all vertices in Uj

are isomorphic). Moreover, due to the choice of weights in G, the right-hand side of (7) for agiven support Ijkj=1 is equal to the weight of the matching M in G induced by the former, i.e.,

2The construction is formally outlined in Algorithm 4 in Section 8.

5

Algorithm 2 Compute Candidate Solution

input Real d× k matrix Woutput X = argmaxX∈Xk

∑kj=1

⟨Xj ,Wj

⟩2

1: G(Ujkj=1, V,E

)← GenBiGraph(W) Alg. 4

2: M←MaxWeightMatch(G) ⊂ E3: X← 0d×k

4: for j = 1, . . . , k do5: Ij ← i ∈ 1, . . . , d : (u, vi) ∈ M, u ∈ Uj6: [Xj ]Ij ← [Wj ]Ij/‖[Wj ]Ij‖27: end for

∑kj=1

∑i∈Ij

W 2ij=∑

(u,v)∈M w(u, v). It follows that determining the support of the solution in (6),reduces to solving the maximum weight matching problem on the bipartite graph G.

Algorithm 2 readily follows. Given W ∈ Rd×k, the algorithm generates a weighted bipartite

graph G as described, and computes its maximum weight matching. Based on the latter, it firstrecovers the desired support of X (Line 5), and subsequently the exact values of its nonzero entries(Line 6). The running time is dominated by the computation of the matching, which can be donein O

(|E||U |+ |U |2 log |U |

)using a variant of the Hungarian algorithm [29]. Hence,

Lemma 2.1. For any W ∈ Rd×k, Algorithm 2 computes the solution to (6), in time O

(d · (s · k)2

).

A more formal analysis and proof of Lemma 2.1 is available in Section 9.1. With Algorithm 2and Lemma 2.1 in place, we complete the description of our sparse PCA algorithm (Algorithm 1)and the proof sketch of Theorem 1.

3 Sparse PCA on Low-Dimensional Sketches

Algorithm 3 Sparse PCA on Low Dim. Sketch

input : Real n× d S, r ∈ Z+, ǫ ∈ (0, 1), k ∈ Z+.output X(r) ∈ Xk. Thm. 21: S← Sketch(S, r)

2: A← S⊤S

3: X(r) ← Algorithm 1 (A, ǫ, k).

Algorithm 1 approximately solves thesparse PCA problem (2) on a d × d rank-rPSD matrix A, in time that grows as alow-order polynomial in the ambient dimen-sion d, but depends exponentially on r. Thisdependence can be prohibitive in practice.To mitigate its effect, instead of the origi-nal input, we can apply our sparse PCA al-gorithm on a low-rank approximation of A.Intuitively, the quality of the extracted components should depend on how well that low-ranksurrogate approximates the original input.

More formally, let S be the real n × d data matrix representing n (potentially centered) dat-apoints in d variables, and A the corresponding d × d covariance matrix. Further, let S be alow-dimensional sketch of the original data; an n × d matrix whose rows lie in an r-dimensionalsubspace, with r being an accuracy parameter. Such a sketch can be obtained in several ways,including for example exact or approximate SVD, or online sketching methods [30]. Finally, letA = 1/n · S⊤

S be the covariance matrix of the sketched data. Then, instead of A, we can approxi-mately solve the sparse PCA problem by applying Algorithm 1 on the low-rank surrogate A. Theabove are formally outlined in Algorithm 3. We note that the covariance matrix A does not needto be explicitly computed; Algorithm 1 can operate directly on the (sketched) input data matrix.

6

Theorem 2. For any n × d input data matrix S, with corresponding empirical covariance matrixA = 1/n · S⊤S, any desired number of components k, and accuracy parameters ǫ ∈ (0, 1) and r,Algorithm 3 outputs X(r) ∈ Xk such that

Tr(X⊤

(r)AX(r)

)≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

)− 2 · k · λ1,s(A−A),

in time TSKETCH(r) + TSVD(r) + O((

4ǫ

)r·k · d · (s · k)2). Here, X⋆, argmaxX∈Xk

Tr(X⊤AX

), and

λ1,s(A) denotes the sparse eigenvalue, i.e., the eigenvalue that corresponds to the principal s-sparseeigenvector of A.

The error λ1,s(A−A) and in turn the tightness of the approximation guarantees hinges on thequality of the sketch A. Higher values of the parameter r (the rank of the sketch) can allow fora more accurate solution and tighter guarantees. That is the case, for example, when the sketchis obtained through exact SVD. In that sense, Theorem 2 establishes a natural trade-off betweenthe running time of Algorithm 3 and the quality of the approximation guarantees. A formal proofof Theorem 2 is provided in Section 9.3. Observe that the error term itself is a sparse eigenvaluethat is hard to approximate, however even loose bounds provide tight conditional approximationresults, as we see next.

Using the main matrix approximation result of [31], the next theorem establishes that Algo-rithm 3 can be turned into an additive PTAS.

Theorem 3. Let A be a d × d positive semidefinite matrix with entries in [−1, 1], V be a d × dmatrix such that A = VV⊤. Further, let R be a random d × r matrix with entries drawn i.i.d.according to N (0, 1/r), and define

A,VRR⊤V⊤.

For any constant ǫ ∈ (0, 1], let r = O(ǫ−2 log d). Then, for any desired sparsity s, and numberof components k = O(1), Algorithm 1 with input argument A and accuracy parameter ǫ, outputsX(r) ∈ Xk such that

Tr(X⊤

(r)AX(r)

)≥ Tr

(X⊤

⋆ AX⋆

)− ǫ · s

with probability at least 1− 1/poly(d), in time nO(log(1/ǫ)/ǫ2)).

Remark 3.1. Note that λ1(A−A) serves as another elementary upper bound on λ1,s(A−A). IfA is a the rank-d SVD approximation of A, then—similar to [32]—we can obtain a multiplicativePTAS for sparse PCA, under the assumption of a decaying spectrum (e.g., under a power-lawdecay), and for s = Ω(n).

4 Related Work

We are not aware of any algorithm with provable guarantees for sparse PCA with disjoint supports.Multiple components can be extracted by repeatedly solving (1) using one of the aforementionedmethods. To ensure disjoint supports, variables “selected” by a component are removed fromthe dataset. This greedy approach, however, can result in highly suboptimal objective value (Seeexample in Sec. 7).

A significant volume of work has focused on the single-component sparse PCA problem (1);we scratch the surface and refer the reader to citations therein. Representative examples rangefrom early heuristics in [2], to the LASSO based techniques in [3], the elastic net ℓ1-regression in[4], ℓ1 and ℓ0 regularized optimization methods such as GPower in [7], a greedy branch-and-bound

7

technique in [8], or semidefinite programming approaches [9, 10, 11]. The authors of [13] presentan approach that uses ideas from an expectation-maximization (EM) formulation of the problem.More recently, [12] presents a simple and very efficient truncated version of the power iteration(TPower). Finally, [15] introduces an exact solver for the low-rank case of the problem; this solverwas then used on low-rank sketches in the work of [14] (SpanSPCA), that provides conditionalapproximation guarantees under spectral assumptions on the input data. Several ideas in this workare inspired by the aforementioned low-rank solvers. In our experiments, we compare against EM,TPower, and SpanSPCA, which all are experimentally achieving state-of-the-art performance.

Parallel to the algorithmic and optimization perspective, there is large line of statistical analysisfor sparse PCA that focuses on guarantees pertaining to planted models and the recovery of a “true”sparse component [16, 17, 18, 19, 20, 21, 22, 23, 24].

There has been some work on the explicit estimation of principal subspaces or multiple compo-nents under sparsity constraints. Non-deflation-based algorithms include extensions of the diagonalthresholding algorithm [33] and iterative thresholding approaches [17], while [34] and [35] proposemethods that rely on the “row sparsity for subspaces” assumption of [26]. These methods yieldcomponents supported on a common set of variables, and hence solve a problem different from (2).Magdon-Ismail and Boutsidis [27] discuss the multiple component Sparse PCA problem, proposean alternative objective function and for that problem obtain interesting theoretical guarantees.Finally, [36] develops a framework for sparse matrix factorizaiton problems, based on a novel atomicnorm. That framework captures sparse PCA – although not explicitly the constraint of disjointsupports – but the resulting optimization problem, albeit convex, is NP-hard.

5 Experiments

We evaluate our algorithm on a series of real datasets, and compare it to deflation-based approachesfor sparse PCA using TPower [12], EM [13], and SpanSPCA [14]. The latter are representativeof the state of the art for the single-component sparse PCA problem (1). Multiple componentsare computed one by one. To ensure disjoint supports, the deflation step effectively amounts toremoving from the dataset all variables used by previously extracted components. For algorithmsthat are randomly initialized, we depict best results over multiple random restarts. Additionalexperimental results are listed in Section 11 of the appendix.

Our experiments are conducted in a Matlab environment. Due to its nature, our algorithmis easily parallelizable; its prototypical implementation utilizes the Parallel Pool Matlab featureto exploit multicore (or distributed cluster) capabilities. Recall that our algorithm operates on alow-rank approximation of the input data. Unless otherwise specified, it is configured for a rank-4approximation obtained via truncated SVD. Finally, we put a time barrier in the execution ofour algorithm, at the cost of the theoretical approximation guarantees; the algorithm returns bestresults at the time of termination. This “early termination” can only hurt the performance of ouralgorithm.

Leukemia Dataset. We evaluate our algorithm on the Leukemia dataset [37]. The datasetcomprises 72 samples, each consisting of expression values for 12582 probe sets. We extract k = 5sparse components, each active on s = 50 features. In Fig. 2(a), we plot the cumulative explainedvariance versus the number of components. Deflation-based approaches are greedy: the leadingcomponents capture high values of variance, but subsequent ones contribute less. On the contrary,our algorithm jointly optimizes the k = 5 components and achieves higher total cumulative variance;one cannot identify a top component. We repeat the experiment for multiple values of k. Fig. 2(b)depicts the total cumulative variance capture by each method, for each value of k.

8

(a) (b)

Figure 2: Cumulative variance captured by the k s-sparse extracted components – Leukemiadataset [37]. Sparsity is arbitrarily set to s = 50 nonzero entries per component. Fig. 2(a) de-picts the cum. variance versus the number of components, for k = 5. Deflation-based approachesare greedy; first components capture high variance, but subsequent ones contribute less. Our algo-rithm jointly optimizes the k = 5 components and achieves higher total cum. variance. Fig. 2(b)depicts the total cum. variance achieved for various values of k.

Additional Datasets. We repeat the experiment on multiple datasets, arbitrarily selectedfrom [37]. Table 1 lists the total cumulative variance captured by k = 5 components, each withs = 40 nonzero entries, extracted using the four methods. Our algorithm achieves the highest valuesin most cases.

Bag of Words (BoW) Dataset. [37] This is a collection of text corpora stored under the“bag-of-words” model. For each text corpus, a vocabulary of d words is extracted upon tokenization,and the removal of stopwords and words appearing fewer than ten times in total. Each documentis then represented as a vector in that d-dimensional space, with the ith entry corresponding to thenumber of appearances of the ith vocabulary entry in the document.

We solve the sparse PCA problem (2) on the word-by-word cooccurrence matrix, and extractk = 8 sparse components, each with cardinality s = 10. We note that the latter is not explicitlyconstructed; our algorithm can operate directly on the input word-by-document matrix. Table 2lists the variance captured by each method; our algorithm consistently outperforms the other

TPower EM sPCA SpanSPCA SPCABiPart

Amzn Com Rev (1500×10000) 7.31e+ 03 7.32e + 03 7.31e+ 03 7.79e+ 03

Arcence Train (100×10000) 1.08e+ 07 1.02e + 07 1.08e+ 07 1.10e+ 07

CBCL Face Train (2429×361) 5.06e+ 00 5.18e + 00 5.23e+ 00 5.29e+ 00

Isolet-5 (1559×617) 3.31e+ 01 3.43e + 01 3.34e+ 01 3.51e+ 01

Leukemia (72×12582) 5.00e+ 09 5.03e + 09 4.84e+ 09 5.37e+ 09

Pems Train (267×138672) 3.94e+ 00 3.58e + 00 3.89e+ 00 3.75e+ 00

Mfeat Pix (2000×240) 5.00e+ 02 5.27e + 02 5.08e+ 02 5.47e+ 02

Table 1: Total cumulative variance captured by k = 5 40-sparse extracted components on variousdatasets [37]. For each dataset, we list the size (#samples×#variables) and the value of variancecaptured by each method. Our algorithm operates on a rank-4 sketch in all cases.

9

TPower EM sPCA SpanSPCA SPCABiPart

BoW:NIPS (1500×12419) 2.51e+ 03 2.57e+ 03 2.53e+ 03 3.34e+ 03 (+29.98%)

BoW:KOS (3430×6906) 4.14e+ 01 4.24e+ 01 4.21e+ 01 6.14e+ 01 (+44.57%)

BoW:Enron (39861×28102) 2.11e+ 02 2.00e+ 02 2.09e+ 02 2.38e+ 02 (+12.90%)

BoW:NyTimes (300000×102660) 4.81e+ 01 − 4.81e+ 01 5.31e+ 01 (+10.38%)

Table 2: Total variance captured by k = 8 extracted components, each with s = 15 nonzero entries– Bag of Words dataset [37]. For each corpus, we list the size (#documents×#vocabulary-size)and the explained variance. Our algorithm operates on a rank-5 sketch in all cases.

approaches.Finally, note that here each sparse component effectively selects a small set of words. In turn,

the k extracted components can be interpreted as a set of well-separated topics. In Table 3, we listthe topics extracted from the NY Times corpus (part of the Bag of Words dataset). The corpusconsists of 3 · 105 news articles and a vocabulary of d = 102660 words.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8

1: percent zzz united states zzz bush company team cup school zzz al gore

2: million zzz u s official companies game minutes student zzz george bush

3: money zzz american government market season add children campaign

4: high attack president stock player tablespoon women election

5: program military group business play oil show plan

6: number palestinian leader billion point teaspoon book tax

7: need war country analyst run water family public

8: part administration political firm right pepper look zzz washington

9: problem zzz white house american sales home large hour member

10: com games law cost won food small nation

Table 3: BoW:NyTimes dataset [37]. The table lists the words corresponding to the s = 10nonzero entries of each of the k = 8 extracted components (topics). Words corresponding to highermagnitude entries appear higher in the topic.

6 Conclusions

We considered the sparse PCA problem for multiple components with disjoint supports. Existingmethods for the single component problem can be used along with an appropriate deflation step tocompute multiple components one by one, leading to potentially suboptimal results. We presenteda novel algorithm for jointly optimizing multiple sparse and disjoint components with provableapproximation guarantees. Our algorithm is combinatorial and exploits interesting connectionsbetween the sparse PCA and the bipartite maximum weight matching problems. It runs in timethat grows as a low-order polynomial in the ambient dimension of the input data, but dependsexponentially on its rank. To alleviate this dependency, we can apply the algorithm on a low-dimensional sketch of the input, at the cost of an additional error in our theoretical approximationguarantees. Empirical evaluation of our algorithm demonstrated that in many cases it outperformsdeflation-based approaches.

10

Acknowledgments

DP is generously supported by NSF awards CCF-1217058 and CCF-1116404 and MURI AFOSRgrant 556016. This research has been supported by NSF Grants CCF 1344179, 1344364, 1407278,1422549 and ARO YIP W911NF-14-1-0258.

References

[1] H.F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika,23(3):187–200, 1958.

[2] I.T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journalof Applied Statistics, 22(1):29–35, 1995.

[3] I.T. Jolliffe, N.T. Trendafilov, and M. Uddin. A modified principal component technique basedon the lasso. Journal of Computational and Graphical Statistics, 12(3):531–547, 2003.

[4] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journalof computational and graphical statistics, 15(2):265–286, 2006.

[5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Sparse features for pca-likelinear regression. In Advances in Neural Information Processing Systems, pages 2285–2293,2011.

[6] Siu On Chan, Dimitris Papailiopoulos, and Aviad Rubinstein. On the worst-case approxima-bility of sparse PCA. preprint, 2015.

[7] M. Journee, Y. Nesterov, P. Richtarik, and R. Sepulchre. Generalized power method for sparseprincipal component analysis. The Journal of Machine Learning Research, 11:517–553, 2010.

[8] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse pca: Exact and greedyalgorithms. NIPS, 18:915, 2006.

[9] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparseprincipal component analysis. The Journal of Machine Learning Research, 9:1269–1294, 2008.

[10] Y. Zhang, A. d’Aspremont, and L.E. Ghaoui. Sparse pca: Convex relaxations, algorithms andapplications. Handbook on Semidefinite, Conic and Polynomial Optimization, pages 915–940,2012.

[11] A. d’Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet. A direct formulation forsparse pca using semidefinite programming. SIAM review, 49(3):434–448, 2007.

[12] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems.The Journal of Machine Learning Research, 14(1):899–925, 2013.

[13] Christian D. Sigg and Joachim M. Buhmann. Expectation-maximization for sparse and non-negative pca. In Proceedings of the 25th International Conference on Machine Learning, ICML’08, pages 960–967, New York, NY, USA, 2008. ACM.

[14] Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse pca throughlow-rank approximations. In Proceedings of The 30th International Conference on MachineLearning, pages 747–755, 2013.

11

[15] Megasthenis Asteris, Dimitris S. Papailiopoulos, and Georgios N. Karystinos. The sparseprincipal component of a constant-rank matrix. Information Theory, IEEE Transactions on,60(4):2281–2290, April 2014.

[16] Arash Amini and Martin Wainwright. High-dimensional analysis of semidefinite relaxationsfor sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE InternationalSymposium on, pages 2454–2458. IEEE, 2008.

[17] Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals ofStatistics, 41(2):772–801, 2013.

[18] A. d’Aspremont, F. Bach, and L.E. Ghaoui. Approximation bounds for sparse principal com-ponent analysis. arXiv preprint arXiv:1205.0121, 2012.

[19] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estima-tion. arXiv preprint arXiv:1211.1309, 2012.

[20] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. arXivpreprint arXiv:1311.5179, 2013.

[21] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components inhigh dimension. Ann. Statist., 41(1):1780–1815, 2013.

[22] Q. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principal componentdetection. Journal of Machine Learning Research (JMLR), 30:1046–1066 (electronic), 2013.

[23] Tengyao Wang, Quentin Berthet, and Richard J. Samworth. Statistical and computationaltrade-offs in estimation of sparse principal components. arXiv preprint arXiv:1408.5369, 2014.

[24] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations solvesparse PCA up to the information limit? Annals of Probability, 43:1300–1322, 2015.

[25] L. Mackey. Deflation methods for sparse pca. NIPS, 21:1017–1024, 2009.

[26] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. InInternational Conference on Artificial Intelligence and Statistics, pages 1278–1286, 2012.

[27] Malik Magdon-Ismail and Christos Boutsidis. Optimal sparse linear auto-encoders and sparsePCA. CoRR, abs/1502.06626, 2015.

[28] Jirı Matousek. Lectures on discrete geometry, volume 212. Springer New York, 2002.

[29] Lyle Ramshaw and Robert E Tarjan. On minimum-cost assignments in unbalanced bipartitegraphs. HP Labs, Palo Alto, CA, USA, Tech. Rep. HPL-2012-40R1, 2012.

[30] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review,53(2):217–288, 2011.

[31] Noga Alon, Troy Lee, Adi Shraibman, and Santosh Vempala. The approximate rank of amatrix and its algorithmic applications: approximate rank. In Proceedings of the forty-fifthannual ACM symposium on Theory of computing, pages 675–684. ACM, 2013.

12

[32] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparsepca with provable guarantees. In Proceedings of the 31st International Conference on MachineLearning (ICML-14), pages 1728–1736, 2014.

[33] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal componentsanalysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.

[34] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: Anear-optimal convex relaxation of sparse pca. In NIPS, pages 2670–2678, 2013.

[35] Zhaoran Wang, Huanran Lu, and Han Liu. Nonconvex statistical optimization: minimax-optimal sparse pca in polynomial time. arXiv preprint arXiv:1408.5352, 2014.

[36] Emile Richard, Guillaume R Obozinski, and Jean-Philippe Vert. Tight convex relaxationsfor sparse matrix factorization. In Advances in Neural Information Processing Systems, pages3284–3292, 2014.

[37] M. Lichman. UCI machine learning repository, 2013.

[38] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson andlindenstrauss. Random structures and algorithms, 22(1):60–65, 2003.

Supplemental Material

7 On the sub-optimality of deflation – An example

We provide a simple example demonstrating the sub-optimality of deflation based approaches forcomputing multiple sparse components with disjoint supports. Consider the real 4× 4 matrix

A =

1 0 0 ǫ

0 δ 0 0

0 0 δ 0

ǫ 0 0 1

,

with ǫ, δ > 0 such that ǫ+ δ < 1. Note that A is PSD; A = B⊤B for

B =

1 0 0 ǫ

0√δ 0 0

0 0√δ 0

0 0 0√1− ǫ2

.

We seek two 2-sparse components with disjoint supports, i.e., the solution to

maxX∈X

2∑

j=1

x⊤j Axj, (8)

where

X,X ∈ R

4×2 : ‖xi‖2 ≤ 1, ‖xi‖0 ≤ 2 ∀ i ∈ 1, 2, supp(x1) ∩ supp(x2) = ∅.

13

Iterative computation with deflation. Following an iterative, greedy procedure with adeflation step, we compute one component at the time. The first component is

x1 = argmax‖x‖0=2,‖x‖2=1

x⊤Ax. (9)

Recall that for any unit norm vector x with support I = supp(x),

x⊤Ax ≤ λmax (AI,I) , (10)

where AI,I denotes the principal submatrix of A formed by the rows and columns indexed by I.Equality can be achieved in (10) for x equal to the leading eigenvector of AI,I . Hence, it suffices todetermine the optimal support for x1. Due to the small size of the example, it is easy to determinethat the set I1 = 1, 4 maximizes the objective in (10) over all sets of two indices, achieving value

x⊤1 Ax1 = λmax

([1 ǫ

ǫ 1

])= 1 + ǫ. (11)

Since subsequent components must have disjoint supports, it follows that the support of the second2-sparse component x2 is I2 = 2, 3, and x2 achieves value

x⊤2 Ax2 = λmax

([δ 0

0 δ

])= δ. (12)

In total, the objective value in (8) achieved by the greedy computation with a deflation step is

2∑

j=1

x⊤j Axj = 1 + ǫ+ δ. (13)

The sub-optimality of deflation. Consider an alternative pair of 2-sparse components x′1 and x′

2

with support sets I ′1 = 1, 2 and I ′2 = 3, 4, respectively. Based on the above, such a pair achievesobjective value in (8) equal to

λmax

([1 0

0 δ

])+ λmax

([δ 0

0 1

])= 1 + 1 = 2,

which clearly outperforms the objective value in (13) (under the assumption ǫ + δ < 1), demon-strating the sub-optimality of the x1, x2 pair computed by the deflation-based approach. In fact,for small ǫ, δ the objective value in the second case is larger than the former by almost a factor oftwo.

8 Construction of Bipartite Graph

The following algorithm formally outlines the steps for generating the bipartite graphG =(Ujkj=1, V,E

)

given a weight d× k matrix W.

14

Algorithm 4 Generate Bipartite Graph

input Real d× k matrix Woutput Bipartite G =

(Ujkj=1, V,E

)Fig. 1

1: for j = 1, . . . , k do

2: Uj ←u(j)1 , . . . , u

(j)s

3: end for4: U ← ∪kj=1Uj |U | = k · s5: V ←

1, . . . , d

6: E ← U × V7: for i = 1, . . . , d do8: for j = 1, . . . , k do9: for each u ∈ Uj do

10: w(u, vi

)←W 2

ij

11: end for12: end for13: end for

9 Proofs

9.1 Guarantees of Algorithm 2

Lemma 2.1. For any real d× k matrix W, and Algorithm 2 outputs

X = argmaxX∈Xk

k∑

j=1

⟨Xj,Wj

⟩2(14)

in time O(d · (s · k)2

).

Proof. Consider a matrix X ∈ Xk and let Ij , j = 1 . . . , k denote the support sets of its columns.By the constraints in Xk, those sets are disjoint, i.e., Ij1 ∩ Ij2 = ∅ ∀j1, j2 ∈ 1, . . . , k, j1 6= j2, and

k∑

j=1

⟨Xj , Wj

⟩2=

k∑

j=1

(∑

i∈Ij

Xij ·Wij

)2≤

k∑

j=1

(∑

i∈Ij

W 2ij

). (15)

The last inequality is due to Cauchy-Schwarz and the fact that ‖Xj‖2 ≤ 1, ∀ j ∈ 1, . . . , k. Infact, if the supports sets Ij, j = 1, . . . , k were known, the upper bound in (15) would be achieved

by setting XjIj

= WjIj/‖Wj

Ij‖2, i.e., setting the nonzero subvector of the jth column of X colinear

to the corresponding subvector of the jth column of W. Hence, the key step towards computingthe optimal solution X is to determine the support sets Ij , j = 1, . . . , k of its columns.

Consider the set of binary matrices

Z,Z ∈ 0, 1d×k : ‖Zj‖0 ≤ s ∀ j ∈ [k], supp(Zi) ∩ supp(Zj) = ∅ ∀ i, j ∈ [k], i 6= j

.

The set represents all possible supports for the members of Xk. Taking into account the previousdiscussion, the maximization in (14) can be written with respect to Z ∈ Z:

maxX∈Xk

k∑

j=1

⟨Xj, Wj

⟩2= max

Z∈Z

k∑

j=1

d∑

i=1

ZijW2ij . (16)

15

Let Z ∈ Z denote the optimal solution, which corresponds to the (support) indicator of X. Next,we show that computing Z boils down to solving a maximum weight matching problem on thebipartite graph generated by Algorithm 4. Recall that given W ∈ R

d×k, Algorithm 4 generates acomplete weighted bipartite graph G = (U, V,E) where

• V is a set of d vertices v1, . . . , vd, corresponding to the d variables, i.e., the d rows of X.• U is a set of k · s vertices, conceptually partitioned into k disjoint subsets U1, . . . , Uk, each of

cardinality s. The jth subset, Uj, is associated with the support Ij; the s vertices u(j)α , α = 1, . . . , s

in Uj serve as placeholders for the variables/indices in Ij.• Finally, the edge set is E = U × V . The edge weights are determined by the d × k matrix W

in (6). In particular, the weight of edge (u(j)α , vi) is equal to W 2

ij . Note that all vertices in Uj areeffectively identical; they all share a common neighborhood and edge weights.

It is straightforward to verify that any Z ∈ Z corresponds to a perfect matching in G and viceversa; Zij = 1 if and only if vertex vi ∈ V is matched with a vertex in Uj (all vertices in Uj areequivalent with respect to their neighborhood). Further, the objective value in (16) for a givenZ ∈ Z is equal to the weight of the corresponding matching in G. More formally,

• Given a perfect matching M, the support Ij of the jth column of Z is determined by theneighborhood of Uj in the matching:

Ij ←i ∈ [d] : (u, vi) ∈ M, u ∈ Uj

, j = 1, . . . , k. (17)

Note that the sets Ij , j = 1, . . . , k are indeed disjoint, and each has cardinality equal to s. Theweight of the matchingM is

∑

(u,v)∈M

w(u, v) =

k∑

j=1

∑

(u,vi)∈M:u∈Uj

w(u, vi) =

k∑

j=1

∑

i∈Ij

W 2ij =

k∑

j=1

d∑

i=1

Zij ·W 2ij, (18)

which is equal to the objective function in (16).• Conversely, given an indicator matrix Z ∈ Z, let Ij,supp(Zj), and let Ij(α) denote the αth

element in the set, α = 1, . . . , s (with an arbitrary ordering). Then,

M =(u(j)

α , vIj(α)), α = 1, . . . , s, j = 1, . . . , k⊂ E

is a perfect matching in G. The objective value achieved by Z is equal to the weight ofM:

k∑

j=1

d∑

i=1

Zij ·W 2ij =

k∑

j=1

∑

i∈Ij

W 2ij =

k∑

j=1

s∑

α=1

W 2Ij(α),j

=∑

(u,v)∈M

w(u, v). (19)

It follows from (18) and (19) that to determine Z, it suffices to compute a maximum weight perfectmatching in G. The desired support is then obtained as described in (17) (lines 4-7 of Algorithm 2).This complete the proof of correctness of Algorithm 2 which proceeds in the steps described aboveto determine the support of X.

The weighted bipartite graph G is generated in O(d ·(s ·k)). The running time of Algorithm 2 isdominated by computing the maximum weight matching of G. For the case of unbalanced bipartitegraph with |U | = s · k < d = |V | the Hungarian algorithm can be modified [29] to computethe maximum weight bipartite matching in time O

(|E||U |+ |U |2 log |U |

)= O

(d · (s · k)2

). This

completes the proof.

16

9.2 Guarantees of Algorithm 1 – Proof of Theorem 1

We first prove a more general version of Theorem 1 for arbitrary constraint sets. Combining thatwith the guarantees of Algorithm 2, we prove the Theorem 1.

Lemma 9.2. For any real d × d rank-r PSD matrix A and arbitrary set X ⊂ Rd×k, let X⋆,

argmaxX∈X Tr(X⊤AX

). Assuming that there exists an operator PX : Rd×k → X such that PX (W),

argmaxX∈X

⟨xj , wj

⟩2, then Algorithm 1 outputs X ∈ X such that

Tr(X

⊤AX

)≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

),

in time TSVD(r)+O((

4ǫ

)r·k ·(TX +kd

)), where TX is the time required to compute PX (·) and TSVD(r)

the time required to compute the truncated SVD of A.

Proof. Let A = UΛU⊤denote the truncated eigenvalue decomposition of A; Λ is a diagonal r× r

whose ith diagonal entry Λii is equal to the ith largest eigenvalue of A, while the columns of Ucontain the corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x ∈ R

d,

x⊤Ax =∥∥Λ1/2

U⊤x∥∥22≥⟨Λ

1/2U

⊤x, c

⟩2, ∀ c ∈ R

r : ‖c‖2 = 1. (20)

In fact, equality in (20) is achieved for c colinear to Λ1/2

Ux, and hence,

x⊤Ax = maxc∈Sr−1

2

⟨Λ

1/2U

⊤x, c

⟩2. (21)

In turn,

Tr

(X⊤AX

)=

k∑

j=1

Xj⊤AXj = maxC:Cj∈Sr−1

2 ∀j

k∑

j=1

⟨Λ

1/2U

⊤Xj , Cj

⟩2. (22)

Recall that X⋆ is the optimal solution of the trace maximization on A, i.e.,

X⋆, argmaxX∈X

Tr

(X⊤AX

).

Let C⋆ be the maximizing value of C in (22) for X = X⋆, i.e., C⋆ is an r×k matrix with unit-normcolumns such that for all j ∈ 1, . . . , k,

Xj⋆⊤AXj

⋆ =⟨Λ

1/2U

⊤Xj

⋆, Cj⋆

⟩2. (23)

Algorithm 1 iterates over the points (r × k matrices) C in N⊗kǫ/2

(Sr−12

), the kth cartesian power of

a finite ǫ/2-net of the r-dimensional l2-unit sphere. At each such point C, it computes a candidate

X = argmaxX∈X

k∑

j=1

⟨Xj,UΛ1/2Cj

⟩2

via Algorithm 2 (See Lemma 9.1 for the guarantees of Algorithm 2). By construction, the setN⊗k

ǫ/2

(Sr−12

)contains a C♯ such that

‖C♯ −C⋆‖∞,2 = maxj∈1,...,k

‖Cj♯ −Cj

⋆‖2 ≤ ǫ/2. (24)

17

Based on the above, for all j ∈ 1, . . . , k,(Xj

⋆⊤AXj

⋆

)1/2=∣∣⟨Λ1/2

U⊤Xj

⋆, Cj⋆

⟩∣∣

=∣∣⟨Λ1/2

U⊤Xj

⋆, Cj♯

⟩+⟨Λ

1/2U

⊤Xj

⋆,(Cj

⋆ −Cj♯

)⟩∣∣

≤∣∣⟨Λ1/2

U⊤Xj

⋆, Cj♯

⟩∣∣+∣∣⟨Λ1/2

U⊤Xj

⋆,(Cj

⋆ −Cj♯

)⟩∣∣

≤∣∣⟨Λ1/2

U⊤Xj

⋆, Cj♯

⟩∣∣+∥∥Λ1/2

U⊤Xj

⋆

∥∥ ·∥∥Cj

⋆ −Cj♯

∥∥

≤∣∣⟨Λ1/2

U⊤Xj

⋆, Cj♯

⟩∣∣+ (ǫ/2) ·(Xj

⋆⊤AXj

⋆

)1/2. (25)

The first step follows by the definition of C⋆, the second by the linearity of the inner product,the third by the triangle inequality, the fourth by Cauchy-Schwarz inequality and the last by (24).Rearranging the terms in (25),

∣∣⟨Λ1/2U

⊤Xj

⋆, Cj♯

⟩∣∣ ≥(1− ǫ

2

)·(Xj

⋆⊤AXj

⋆

)1/2 ≥ 0,

and in turn,

⟨Λ

1/2U

⊤Xj

⋆, Cj♯

⟩2 ≥(1− ǫ

2

)2 ·Xj⋆⊤AXj

⋆ ≥ (1− ǫ) ·Xj⋆⊤AXj

⋆ (26)

Summing the terms in (26) over all j ∈ 1, . . . , k,k∑

j=1

⟨Λ

1/2U

⊤Xj

⋆, Cj♯

⟩2 ≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

). (27)

Let X♯ ∈ X be the candidate solution produced by the algorithm at C♯, i.e.,

X♯, argmaxX∈X

k∑

j=1

⟨xj, UΛ

1/2Cj

♯

⟩2. (28)

Then,

Tr

(X⊤

♯ AX♯

)(α)= max

C:Cj∈Sr−12 ∀j

k∑

j=1

⟨Λ

1/2U

⊤Xj

♯ , Cj⟩2

(β)

≥k∑

j=1

⟨Λ

1/2U

⊤Xj

♯ , Cj♯

⟩2

(γ)

≥k∑

j=1

⟨Xj

⋆, UΛ1/2

Cj♯

⟩2

(δ)

≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

), (29)

where (α) follows from the observation in (22), (β) from the sub-optimality of C♯, (γ) by thedefinition of X♯ in (28), while (δ) follows from (27). According to (29), at least one of the candidatesolutions produced by Algorithm 1, namely X♯, achieves an objective value within a multiplicativefactor (1− ǫ) from the optimal, implying the guarantees of the lemma.

Finally, the running time of Algorithm 1 follows immediately from the cost per iteration andthe cardinality of the ǫ/2-net on the unit-sphere. Note that matrix multiplications can exploit thesingular value decomposition which is performed once.

18

Theorem 1. For any real d× d rank-r PSD matrix A, desired number of components k, number sof nonzero entries per component, and accuracy parameter ǫ ∈ (0, 1), Algorithm 1 outputs X ∈ Xk

such that

Tr(X

⊤AX

)≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

),


), in time TSVD(r) +O

((4ǫ

)r·k · d · (s · k)2). TSVD(r) is the time

required to compute the truncated SVD of A.

Proof. Recall that Xk is the set of d× k matrices X whose columns have unit length and pairwisedisjoint supports. Algorithm 2, given any W ∈ R

d×k, computes X ∈ Xk that optimally solves theconstrained maximization in line 5. (See Lemma 9.1 for the guarantee of Algorithm 2). in timeO(d · (s · k)2

). The desired result then follows by Lemma 9.2 for the constrained set Xk.

9.3 Guarantees of Algorithm 3 – Proof of Theorem 2

We prove Theorem 2 with the approximation guarantees of Algorithm 3.

Lemma 9.3. For any d× d PSD matrices A and A, and any set X ⊆ Rd×k let

X⋆, argmaxX∈X

Tr

(X⊤AX

), and X⋆, argmax

X∈XTr(X⊤AX

).

Then, for any X ∈ X such that Tr(X

⊤AX

)≥ γ ·Tr

(X⊤

⋆ AX⋆

)for some 0 < γ < 1,

Tr(X

⊤AX

)≥ γ ·Tr

(X⊤

⋆ AX⋆

)− 2 · ‖A−A‖2 ·max

X∈X‖X‖2F.

Proof. By the optimality of X⋆ for A,

Tr

(X⊤

⋆ AX⋆

)≥ Tr

(X⊤

⋆ AX⋆

).

In turn, for any X ∈ X such that Tr

(X

⊤AX

)≥ γ ·Tr

(X⊤

⋆ AX⋆

)for some 0 < γ < 1,

Tr

(X

⊤AX

)≥ γ ·Tr

(X⊤

⋆ AX⋆

). (30)

Let E,A−A. By the linearity of the trace,

Tr

(X

⊤AX

)= Tr

(X

⊤AX

)−Tr

(X

⊤EX

)

≤ Tr

(X

⊤AX

)+∣∣Tr

(X

⊤EX

)∣∣. (31)

By Lemma 10.10,

∣∣Tr

(X

⊤EX

)∣∣ ≤ ‖X‖F · ‖X‖F · ‖E‖2 ≤ ‖E‖2 ·maxX∈X

‖X‖2F , R. (32)

Continuing from (31),

Tr

(X

⊤AX

)≤ Tr

(X

⊤AX

)+R. (33)

19

Similarly,

Tr

(X⊤

⋆ AX⋆

)= Tr

(X⊤

⋆ AX⋆

)−Tr

(X⊤

⋆ EX⋆

)

≥ Tr

(X⊤

⋆ AX⋆

)−∣∣Tr

(X⊤

⋆ EX⋆

)∣∣

≥ Tr

(X⊤

⋆ AX⋆

)−R. (34)

Combining the above, we have

Tr

(X

⊤AX

)≥ Tr

(X

⊤AX

)−R

≥ γ ·Tr

(X⊤

⋆ AX⋆

)−R

≥ γ ·(Tr

(X⊤

⋆ AX⋆

)−R

)−R

= γ ·Tr

(X⊤

⋆ AX⋆

)− (1 + γ) ·R

≥ γ ·Tr

(X⊤

⋆ AX⋆

)− 2 ·R,

where the first inequality follows from (33) the second from (30), the third from (34), and the lastfrom the fact that R ≥ 0 and 0 < γ ≤ 1. This concludes the proof.

Remark 9.2. If in Lemma 9.3 the PSD matrices A and A ∈ Rd×d are such that A −A is also

PSD, then the following tighter bound holds:

Tr(X

⊤AX

)≥ γ ·Tr

(X⊤

⋆ AX⋆

)−

k∑

i=1

λi

(A−A

).

Proof. This follows from the fact that if E,A−A is PSD, then

Tr

(X

⊤EX

)=

d∑

j=1

x⊤j Exj ≥ 0,

and the bound in (31) can be improved to

Tr

(X

⊤AX

)= Tr

(X

⊤AX

)−Tr

(X

⊤EX

)≤ Tr

(X

⊤AX

).

Further, by Lemma 10.11, the bound in (32) can be improved to

Tr(X

⊤EX

)≤

k∑

i=1

λi

(E), R.

The rest of the proof follows as is.

Theorem 2. For any n × d input data matrix S, with corresponding empirical covariance matrixA = 1/n · S⊤S, any desired number of components k, and accuracy parameters ǫ ∈ (0, 1) and r,Algorithm 3 outputs X(r) ∈ Xk such that

Tr(X⊤

(r)AX(r)

)≥ (1− ǫ) ·Tr

(X⊤

⋆ AX⋆

)− 2 · k · ‖A−A‖2,


), in time TSKETCH(r) + TSVD(r) +O

((4ǫ

)r·k · d · (s · k)2).

Proof. The theorem follows from Lemma 9.3 and the approximation guarantees of Algorithm 1.

20

9.4 Proof of Theorem 3

First, we restate and prove the following Lemma by [31].

Lemma 9.4. Let A ∈ Rd×d be an positive semidefinite matrix with entries in [−1, 1], and V ∈ R

d×d

matrix such that A = VV⊤. Consider a random matrix R ∈ Rd×r with entries drawn according to

a Gaussian distribution N(0, 1/r), and define

A = VRR⊤V⊤.

Then, for r = O(ǫ−2 log d),

∣∣[A]i,j − [A]i,j∣∣ ≤ ǫ

for all i, j with probability at least 1− 1/d.

Proof. The proof relies on the Johnson-Lindenstrauss (JL) Lemma [38], according to which for anytwo unit norm vectors x,y ∈ R

d and R generated as described

Pr|x⊤RR⊤y − x⊤y| ≥ ǫ

≤ 2 · e−(ǫ2−ǫ3)·r/4.

Observe that each element of A is in [−1, 1], hence can be rewritten as an inner product of twounit-norm vectors:

[A]i,j = VT:,iV:,j.

Setting r = O(ǫ−2 log d) and using the JL lemma and a union bound over all O(d2) vector pairsV:,i, V:,j we obtain the desired result.

Next, we provide the proof of Theorem 3 for the simple case of k = 1; the proof easily generalizesto the multi-component case k > 1. According to Lemma 9.4, choosing d = O

((δ/6)−2 log n

)=

O(δ−2 log n

)suffices for all entries of A constructed as described in the lemma to satisfiy

∣∣[A]i,j − [A]i,j∣∣ ≤ δ

6

with probability at least 1− 1/d. In turn, for any s-sparse, unit-norm x,

∣∣x⊤Ax− x⊤Ax∣∣ =

∣∣∣∣∣∣

∑

i,j

xixj([A]ij − [A]ij)

∣∣∣∣∣∣≤ δ

6·

∣∣∣∣∣∣

n∑

i=1

|xi|n∑

j=1

|xj|

∣∣∣∣∣∣

≤ δ

6· ‖x‖21 ≤

δ

6·(√

s · ‖x‖2)2

=δ

6· s, (35)

where the second inequality follows from the fact that x is s-sparse and unit norm.We run Algorithm 1 (for k = 1) with input argument the rank-r matrix A, desired sparsity s and

accuracy parameter ǫ = δ/6. Algorithm 1 outputs a s-sparse unit-norm vector x which accordingto Theorem 1 satisfies

(1− δ/6) · xd⊤Axd ≤ x⊤Ax ≤ xd

⊤Axd, (36)

where xd is the true s-sparse principal component of A. This, in turn, implies that x satisfies

∣∣∣x⊤Ax− xd⊤Axd

∣∣∣ ≤ δ

6· xd

⊤Axd ≤δ

6

(1 +

δ

6

)s ≤ δ

3· s, (37)

21

where the second inequality follows from the fact that the entries of A lie in [−1− δ6 , 1 +

δ6 ] and x

is s-sparse and unit-norm.In the following, we bound the difference of the performance of x on the original matrix A from

the optimal value. Let x⋆ denote the s-sparse principal component of A and define

OPT,x⋆TAx⋆.

Then,

|OPT− x⊤Ax| = |x⊤⋆ Ax⋆ − x⊤Ax|

= |x⊤⋆ Ax⋆ − x⊤

d Axd + xd⊤Axd − x⊤Ax|

≤ |x⊤⋆ Ax⋆ − xd

⊤Axd|︸︷︷︸A

+ |xd⊤Axd − x⊤Ax|︸︷︷︸

B

. (38)

Utilizing (35) and the triangle inequality, one can verify that

A = |x⋆⊤Ax⋆ − xd

⊤Axd + xd⊤Axd − xd

⊤Axd|≤ |x⋆

⊤Ax⋆ − xd⊤Axd|+ |xd

⊤Axd − xd⊤Axd|

≤ x⋆⊤Ax⋆ − xd

⊤Axd︸︷︷︸≥0

+δ

6· s

≤ x⋆⊤Ax⋆ − xd

⊤Axd +δ

6· s + xd

⊤Axd − x⋆⊤Ax⋆︸︷︷︸

≥0

≤ x⋆⊤Ax⋆ − x⋆

⊤Ax⋆ +δ

6· s + xd

⊤Axd − xd⊤Axd

≤ |x⋆⊤Ax⋆ − x⋆

⊤Ax⋆|+δ

6· s + |xd

⊤Axd − xd⊤Axd|

≤ δ

2· s. (39)

Similarly,

B =∣∣∣xd

⊤Axd − x⊤Ax+ x⊤Ax− x⊤Ax∣∣∣

=∣∣∣xd

⊤Axd − x⊤Ax∣∣∣+∣∣∣x⊤Ax− x⊤Ax

∣∣∣

≤∣∣∣xd

⊤Axd − x⊤Ax∣∣∣+ δ

6· s

(α)

≤ 2δ

6· s+ δ

6· s

≤ δ

2· s. (40)

where (α) follows from (37). Continuing from (38), combining (39) and (40) we obtain

|OPT −x⊤Ax∣∣∣ ≤ δ · s,

which is the desired result.

22

10 Auxiliary Technical Lemmata

Lemma 10.5. For any real d× n matrix M, and any r, k ≤ mind, n,r+k∑

i=r+1

σi(M) ≤ k√r + k

· ‖M‖F,

where σi(M) is the ith largest singular value of M.

Proof. By the Cauchy-Schwartz inequality,

r+k∑

i=r+1

σi(M) =

r+k∑

i=r+1

|σi(M)| ≤(

r+k∑

i=r+1

σ2i (M)

)1/2

· ‖1k‖2 =√k ·(

r+k∑

i=r+1

σ2i (M)

)1/2

.

Note that σr+1(M), . . . , σr+k(M) are the k smallest among the r+k largest singular values. Hence,

r+k∑

i=r+1

σ2i (M) ≤ k

r + k

r+k∑

i=1

σ2i (M) ≤ k

r + k

mind,n∑

i=1

σ2i (M) =

k

r + k‖M‖2F.

Combining the two inequalities, the desired result follows.

Corollary 1. For any real d× n matrix M and k ≤ mind, n, σk(M) ≤ k−1/2 · ‖M‖F.

Proof. It follows immediately from Lemma 10.5.

Lemma 10.6. Let a1, . . . , an and b1, . . . , bn be 2n real numbers and let p and q be two numberssuch that 1/p+ 1/q = 1 and p > 1. We have

∣∣n∑

i=1

aibi∣∣ ≤

(n∑

i=1

|ai|p)1/p

·(

n∑

i=1

|bi|q)1/q

.

Lemma 10.7. For any two real matrices A and B of appropriate dimensions,

‖AB‖F ≤ min‖A‖2‖B‖F, ‖A‖F‖B‖2 .

Proof. Let bi denote the ith column of B. Then,

‖AB‖2F =∑

i

‖Abi‖22 ≤∑

i

‖A‖22‖bi‖22 = ‖A‖22∑

i

‖bi‖22 = ‖A‖22‖B‖2F.

Similarly, using the previous inequality,

‖AB‖2F = ‖B⊤A⊤‖2F ≤ ‖B⊤‖22‖A⊤‖2F = ‖B‖22‖A‖2F.

Combining the two upper bounds, the desired result follows.

Lemma 10.8. For any A,B ∈ Rn×k,

∣∣〈A,B〉∣∣,∣∣Tr

(A⊤B

)∣∣ ≤ ‖A‖F‖B‖F.

Proof. The inequality follows from Lemma 10.6 for p = q = 2, treating A and B as vectors.

23

Lemma 10.9. For any real m× n matrix A, and any k ≤ minm, n,

maxY∈Rn×k

Y⊤Y=Ik

‖AY‖F =

(k∑

i=1

σ2i (A)

)1/2

.

The maximum is attained by Y coinciding with the k leading right singular vectors of A.

Proof. Let UΣV⊤ be the singular value decomposition of A; U and V are m × m and n × nunitary matrices respectively, while Σ is a diagonal matrix with Σjj = σj, the jth largest singularvalue of A, j = 1, . . . , d, where d,minm,n. Due to the invariance of the Frobenius norm underunitary multiplication,

‖AY‖2F = ‖UΣV⊤Y‖2F = ‖ΣV⊤Y‖2F. (41)

Continuing from (41),

‖ΣV⊤Y‖2F = Tr

(Y⊤VΣ2V⊤Y

)=

k∑

i=1

y⊤i

( d∑

j=1

σ2j · vjv

⊤j

)yi =

d∑

j=1

σ2j ·

k∑

i=1

(v⊤j yi

)2.

Let zj,∑k

i=1

(v⊤j yi

)2, j = 1, . . . , d. Note that each individual zj satisfies

0 ≤ zj,k∑

i=1

(v⊤j yi

)2≤ ‖vj‖2 = 1,

where the last inequality follows from the fact that the columns of Y are orthonormal. Further,

d∑

j=1

zj =

d∑

j=1

k∑

i=1

(v⊤j yi

)2=

k∑

i=1

d∑

j=1

(v⊤j yi

)2=

k∑

i=1

‖yi‖2 = k.

Combining the above, we conclude that

‖AY‖2F =

d∑

j=1

σ2j · zj ≤ σ2

1 + . . .+ σ2k. (42)

Finally, it is straightforward to verify that if yi = vi, i = 1, . . . , k, then (42) holds with equality.

Lemma 10.10. For any real d×n matrix A, and pair of d× k matrix X and n× k matrix Y suchthat X⊤X = Ik and Y⊤Y = Ik with k ≤ mind, n, the following holds:

∣∣Tr

(X⊤AY

)∣∣ ≤√k ·( k∑

i=1

σ2i (A)

)1/2.

Proof. By Lemma 10.8,

|〈X, AY〉| =∣∣Tr

(X⊤AY

)∣∣ ≤ ‖X‖F · ‖AY‖F =√k · ‖AY‖F.

where the last inequality follows from the fact that ‖X‖2F = Tr(X⊤X

)= Tr(Ik) = k. Combining

with a bound on ‖AY‖F as in Lemma 10.9, completes the proof.

24

Lemma 10.11. For any real d × d PSD matrix A, and k × d matrix X with k ≤ d orthonormalcolumns,

Tr

(X⊤AX

)≤

k∑

i=1

λi(A)

where λi(A) is the ith largest eigenvalue of A. Equality is achieved for X coinciding with the kleading eigenvectors of A.

Proof. LetA = VV⊤ be a factorization of the PSDmatrixA. Then,Tr(X⊤AX

)= Tr

(X⊤VV⊤X

)=

‖V⊤X‖2F. The desired result follows by Lemma 10.9 and the fact that λi(A) = σ2i (V), i =

1, . . . , d.

11 Additional Experimental Results

(a) (b)

Figure 3: Cumulative variance captured by k s-spars components computed on the word-by-wordmatrix – BagOfWords:NIPS dataset [37]. Sparsity is arbitrarily set to s = 10 nonzero entriesper component. Fig. 3(a) depicts the cum. variance captured by k = 6 components. Deflation leadsto a greedy formation of components; first components capture high variance, but subsequent onescontribute less. On the contrary, our algorithm jointly optimizes the k components and achieveshigher total cum. variance. Fig. 3(b) depicts the total cum. variance achieved for various valuesof k. Our algorithm operates on a rank-4 approximation of the input.

25


1:

TPower

network algorithm neuron parameter object classifier word noise

2: model data cell point image net speech control

3: learning system pattern distribution recognition classification level dynamic

4: input error layer hidden images class context step

5: function weight information space task test hmm term

6: neural problem signal gaussian features order character optimal

7: unit result visual linear feature examples processing component

8: set number field probability representation rate non equation

9: training method synaptic mean performance values approach single

10: output vector firing case view experiment trained analysis

11:

SpanSPCA

network algorithm neuron parameter recognition control classifier noise

12: model data cell distribution object action classification order

13: input weight pattern point image dynamic class term

14: learning error layer linear word step net component

15: neural problem signal probability performance optimal test rate

16: function output information space task policy speech equation

17: unit result visual gaussian features states examples single

18: set number synaptic hidden representation reinforcement approach analysis

19: system method field case feature values experiment large

20: training vector response mean images controller trained form

21:

SPCABiPart

data function neuron unit learning network model training

22: distribution algorithm cell weight space input parameter hidden

23: gaussian set visual layer action neural information performance

24: probability error direction net order system control recognition

25: component problem firing task step output dynamic classifier

26: approach result synaptic connection linear pattern mean test

27: analysis number response activation case signal noise word

28: mixture method spike architecture values processing field speech

29: likelihood vector activity generalization term image local classification

30: experiment point motion threshold optimal object equation trained

Total Cum. Variance

TPower 2.5999 · 103

SpanSPCA 2.5981 · 103

SPCABiPart 3.2090 · 103

Table 4: BagOfWords:NIPS dataset [37]. We run various SPCA algorithms for k = 8 com-ponents (topics) and s = 10 nonzero entries per component. The table lists the words selected byeach component (words corresponding to higher magnitude entries appear higher in the topic). Ouralgorithm was configured to use a rank-4 approximation of the input data.

26


1:

TPower

percent zzz bush team school women zzz enron drug palestinian

2: company zzz al gore game student show firm patient zzz israel

3: million president season program book zzz arthur andersen doctor zzz israeli

4: companies official player high com deal system zzz yasser arafat

5: market zzz george bush play children look lay problem attack

6: stock campaign games right american financial law leader

7: business government point group need energy care peace

8: money plan run home part executives cost israelis

9: billion administration coach public family accounting help israeli

10: fund zzz white house win teacher found partnership health zzz west bank

11:

SpanSPCA

percent team zzz bush palestinian school cup show won

12: company game zzz al gore attack student minutes com night

13: million season president zzz united states children add part left

14: companies player zzz george bush zzz u s program tablespoon look big

15: market play campaign military home teaspoon need put

16: stock games official leader family oil book win

17: business point government zzz israel women pepper called hit

18: money run political zzz american public water hour job

19: billion right election war high large american ago

20: plan coach group country law sugar help zzz new york

21:

SPCABiPart

percent zzz united states zzz bush company team cup school zzz al gore

22: million zzz u s official companies game minutes student zzz george bush

23: money zzz american government market season add children campaign

24: high attack president stock player tablespoon women election

25: program military group business play oil show plan

26: number palestinian leader billion point teaspoon book tax

27: need war country analyst run water family public

28: part administration political firm right pepper look zzz washington

29: problem zzz white house american sales home large hour member

30: com games law cost won food small nation

Total Cum. Variance

TPower 45.4014

SpanSPCA 46.0075

SPCABiPart 47.7212

Table 5: BagOfWords:NyTimes dataset [37]. We run various SPCA algorithms for k = 8components (topics) and s = 10 nonzero entries per component. The table lists the words selectedby each component (words corresponding to higher magnitude entries appear higher in the topic).Our algorithm was configured to use a rank-4 approximation of the input data.

27


1:

TPower

percent zzz bush team school com zzz enron law palestinian

2: company zzz al gore game student women firm drug zzz israel

3: million zzz george bush season program book deal court zzz israeli

4: companies campaign player children web financial case zzz yasser arafat

5: market right play show american zzz arthur andersen federal peace

6: stock group games public information chief patient israelis

7: money political point need look executive system israeli

8: business zzz united states run part site analyst decision military

9: government zzz u s coach family zzz new york executives bill zzz palestinian

10: official administration home help question lay member zzz west bank

11: billion leader win job number investor lawyer war

12: president attack won teacher called energy doctor security

13: plan zzz white house night country find investment cost violence

14: high tax left problem found employees care killed

15: fund zzz washington guy parent ago accounting health talk

16:

SpanSPCA

percent team official zzz al gore cup show public night

17: company game zzz bush zzz george bush minutes com member big

18: million season zzz united states campaign add part system set

19: companies player attack election tablespoon look case film

20: market play zzz u s political teaspoon need number find

21: stock games palestinian vote oil book question room

22: business point military republican pepper women job place

23: money run leader voter water family told friend

24: billion right zzz american democratic large called put took

25: plan win war school sugar children zzz washington start

26: government coach zzz israel presidential serving help found car

27: president home country zzz white house butter ago information feel

28: high won administration law chopped zzz new york federal half

29: cost left terrorist zzz republican hour program student guy

30: group hit american tax pan problem court early

31:

SPCABiPart

company show cup team percent zzz al gore official school

32: companies home minutes game million zzz george bush zzz bush student

33: stock run add season money campaign government children

34: market com tablespoon player plan right president women

35: billion high oil play business election zzz united states book

36: zzz enron need teaspoon games tax political zzz u s family

37: firm look pepper coach cost point group called

38: analyst part water guy cut leader attack hour

39: industry night large yard job zzz washington zzz american friend

40: fund zzz new york sugar hit pay administration country found

41: investor help serving played deal question military find

42: sales left butter playing quarter member american set

43: customer put chopped ball chief won war room

44: investment ago fat fan executive win law film

45: economy big food shot financial told public small

Total Cum. Variance

TPower 48.140645

SpanSPCA 48.767864

SPCABiPart 51.873063

Table 6: BagOfWords:NyTimes dataset [37]. We run various SPCA algorithms for k = 8components (topics) and cardinality s = 15 per component. The table lists the words correspondingto each component (words corresponding to higher magnitude entries appear higher in the topic).Our algorithm was configured to use a rank-4 approximation of the input data.

28


1:

TPower

percent zzz bush team school com zzz enron drug palestinian

2: company zzz al gore game student women court patient zzz israel

3: million zzz george bush season program book case doctor zzz israeli

4: companies campaign player children web firm cell zzz yasser arafat

5: market zzz united states play show site federal care peace

6: stock zzz u s games public information lawyer disease israelis

7: government political point part zzz new york deal health israeli

8: official attack run family www decision medical zzz palestinian

9: money zzz american home system hour chief test zzz west bank

10: business american coach help find power hospital security

11: president administration win problem mail industry research violence

12: billion leader won law found executive cancer killed

13: plan country left job put according treatment talk

14: group election night called set financial study meeting

15: high zzz washington hit look room office death soldier

16: right military guy member big analyst human minister

17: fund zzz white house yard question told executives heart zzz sharon

18: need war played ago friend zzz arthur andersen blood fire

19: cost tax start teacher director employees trial zzz ariel sharon

20: number nation playing parent place investor benefit zzz arab

21:

SpanSPCA

percent team zzz al gore attack school cup com drug

22: company game zzz bush zzz united states student minutes web patient

23: million season zzz george bush zzz u s children add site cell

24: companies player campaign palestinian program tablespoon information doctor

25: market play election military family oil computer disease

26: stock games political zzz american women teaspoon find care

27: business point tax zzz israel show pepper big health

28: money run republican war help water zzz new york test

29: billion win zzz white house country told large www research

30: government home vote terrorist parent sugar mail human

31: president won law american problem serving set medical

32: plan coach administration zzz taliban book butter put study

33: high left democratic zzz afghanistan job chopped director death

34: group night voter security found hour industry cancer

35: official hit leader zzz israeli friend pan room hospital

36: need guy public nation ago fat small treatment

37: right yard zzz republican member question bowl car scientist

38: part played presidential support teacher gram zzz internet according

39: cost look federal called case food place blood

40: system start zzz washington forces number medium film heart

41:

SPCABiPart

palestinian percent zzz al gore cup school team company official

42: zzz israel million zzz bush minutes right game companies government

43: zzz israeli money zzz george bush add group season market president

44: zzz yasser arafat billion campaign tablespoon show player stock zzz united states

45: peace business election oil home play zzz enron zzz u s

46: war fund political teaspoon high games analyst attack

47: terrorist tax zzz white house pepper program point firm zzz american

48: zzz taliban cost administration water need run industry country

49: zzz afghanistan cut republican hour part coach investor law

50: forces job leader large com win sales plan

51: bin pay vote sugar american won customer public

52: troop economy democratic serving look left price zzz washington

53: laden deal presidential butter help night investment member

54: student big zzz clinton chopped problem hit quarter system

55: zzz pakistan chief support pan called guy executives nation

56: product executive zzz congress fat zzz new york yard consumer case

57: zzz internet financial military bowl number played technology federal

58: profit start policy gram question ball share information

59: earning record court food ago playing prices power

60: shares manager security league told lead growth effort

Total Cum. Variance

TPower 50.7686

SpanSPCA 52.8117

SPCABiPart 54.8906

Table 7: BagOfWords:NyTimes dataset [37]. We run various SPCA algorithms for k = 8components (topics) and cardinality s = 20 per component. The table lists the words correspondingto each component (words corresponding to higher magnitude entries appear higher in the topic).Our algorithm was configured to use a rank-4 approximation of the input data.

29

(a) (b)

Figure 4: Cumulative variance captured by k s-sparse (s = 10) extracted components on theword-by-word matrix – BagOfWords:NyTimes dataset [37]. Fig. 4(a) depicts the cum. variancecaptured by k = 6 components. Deflation leads to a greedy formation of components; first compo-nents capture high variance, but subsequent ones contribute less. On the contrary, our algorithmjointly optimizes the k components and achieves higher total cum. variance. Fig. 4(b) depicts thetotal cum. variance achieved for various values of k. Sparsity is arbitrarily set to s = 10 nonzeroentries per component. Our algorithm operates on a rank-4 approximation.

(a) (b)

Figure 5: Same as Fig. 4, but for sparsity s = 15.

30

(a) (b)

Figure 6: Same as Fig. 4, but for sparsity s = 20.

31

SparsePCAviaBipartiteMatchings - GitHub...

Documents

Transcript of SparsePCAviaBipartiteMatchings - GitHub...