SparsePCAviaBipartiteMatchings - GitHub...
Transcript of SparsePCAviaBipartiteMatchings - GitHub...
arX
iv:1
508.
0062
5v1
[st
at.M
L]
4 A
ug 2
015
Sparse PCA via Bipartite Matchings
Megasthenis Asterisα, Dimitris Papailiopoulosβ , Anastasios Kyrillidisα
Alexandros G. Dimakisα
αUT Austin, βUC Berkeley
August 2015
Abstract
We consider the following multi-component sparse PCA problem: given a set of data points,we seek to extract a small number of sparse components with disjoint supports that jointlycapture the maximum possible variance. These components can be computed one by one,repeatedly solving the single-component problem and deflating the input data matrix, but aswe show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCAthat jointly optimizes multiple disjoint components. The extracted features capture variancethat lies within a multiplicative factor arbitrarily close to 1 from the optimal. Our algorithmis combinatorial and computes the desired components by solving multiple instances of thebipartite maximum weight matching problem. Its complexity grows as a low order polynomialin the ambient dimension of the input data matrix, but exponentially in its rank. However,it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtainpolynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm onreal data-sets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.
1 Introduction
Principal Component Analysis (PCA) reduces the dimensionality of a data set by projecting it ontoprincipal subspaces spanned by the leading eigenvectors of the sample covariance matrix. SparsePCA is a useful variant that offers higher data interpretability [1, 2, 3], a property that is sometimesdesired even at the cost of statistical fidelity [4]. Furthermore, when the obtained features are usedin subsequent learning tasks, sparsity potentially leads to better generalization error [5].
Given a real n× d data matrix S representing n centered data points supported on d features,the leading sparse principal component of the data set is the sparse vector that maximizes theexplained variance:
x⋆ , argmax‖x‖2=1,‖x‖0=s
x⊤Ax, (1)
where A = 1/n · S⊤S is the d × d empirical covariance matrix. The sparsity constraint makesthe problem NP-hard and hence computationally intractable in general, and hard to approximatewithin some small constant [6]. A significant volume of prior work has focused on algorithms thatapproximately solve the optimization problem [2, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15], while a largevolume of theoretical results has been established under planted statistical models [16, 17, 18, 19,20, 21, 22, 23, 24].
1
In most practical settings, we tend to go beyond computing a single sparse PC. Contrary to thesingle-component problem, there has been limited work on computing multiple components. Thescarcity is partially attributed to conventional PCA wisdom: multiple components can be computedone-by-one, repeatedly, by solving the single-component sparse PCA problem (1) and deflating theinput data to remove information captured by previously extracted components [25]. In fact,the multi-component version of sparse PCA is not uniquely defined in the literature. Differentdeflation-based approaches can lead to different outputs: extracted components may or may not beorthogonal, while they may have disjoint or overlapping supports [25]. In the statistics literature,where the objective is typically to recover a “true” principal subspace, a branch of work has focusedon the “subspace row sparsity” [26], an assumption that leads to sparse components all supportedon the same set of variables. While in [27], the authors discuss an alternative perspective on thefundamental objective of the sparse PCA problem.
In this work, we develop a novel algorithm for the multi-component sparse PCA problem withdisjoint supports. Formally, we are interested in finding k components that are s-sparse, havedisjoint supports, and jointly maximize the explained variance:
X⋆ , argmaxX∈Xk
Tr(X⊤AX
), (2)
where the feasible set is
Xk ,X ∈ R
d×k : ‖Xj‖2 = 1, ‖Xj‖0 = s, supp(Xi) ∩ supp(Xj) = ∅, ∀ j ∈ [k], i < j,
with Xj denoting the jth column of X. The number k of the desired components is a user definedparameter and we consider it to be a small constant.
Contrary to the greedy sequential approach that repeatedly uses deflation, our algorithm jointlycomputes all the vectors in X, and comes with theoretical approximation guarantees. We note thateven if one could solve each single-component sparse PCA problem (1) exactly, greedy deflationcan be highly suboptimal. We show this through a simple example in Section 7.
Our Contributions
1. We develop an algorithm that provably approximates the solution to the sparse PCA problem (2)within a multiplicative factor arbitrarily close to 1. To the best of our knowledge, this is thefirst algorithm that jointly optimizes multiple components with disjoint supports, provably. Ouralgorithm is combinatorial; it recasts sparse PCA as multiple instances of bipartite maximumweight matching on graphs determined by the input data.
2. The computational complexity of our algorithm grows as a low order polynomial in the ambientdimension d, but is exponential in the intrinsic dimension of the input data, i.e., the rankof A. To alleviate the impact of this dependence, our algorithm can be applied on a low-dimensional sketch of the input data to obtain an approximate solution to (2). This extra levelof approximation introduces an additional penalty in our theoretical approximation guarantees,which naturally depends on the quality of the sketch and, in turn, the spectral decay of A. Weshow how these bounds further translate to an additive PTAS (polynomial-time approximationscheme) for sparse PCA. Our additive PTAS outputs an approximate solution with explainedvariance of at least OPT − ǫ · s, for any sparsity s ∈ 1, . . . , n, any constant error ǫ > 0 andany k = O(1) number of orthogonal components.1
1Here, OPT is the explained variance captured by the optimal set of k components that are s sparse and havedisjoint supports.
2
3. We empirically evaluate our algorithm on real datasets, and compare it against state-of-the-artmethods for the single-component sparse PCA problem (1) in conjunction with the appro-priate deflation step. In many cases, our algorithm—as a result of jointly optimizing overmultiple components—leads to significantly improved results, and outperforms deflation-basedapproaches.
2 Sparse PCA through Bipartite Matchings
Our algorithm approximately solves the constrained maximization (2) on a d × d rank-r PSDmatrix A within a multiplicative factor arbitrarily close to 1. It operates by recasting the maxi-mization into multiple instances of the bipartite maximum weight matching problem. Each instanceultimately yields a feasible solution: a set of k components that are s-sparse and have disjoint sup-ports. The algorithm examines these solutions, and outputs the one that maximizes the explainedvariance, i.e., the quadratic objective in (2).
The computational complexity of our algorithm grows as a low order polynomial in the ambientdimension d of the input, but exponentially in its rank r. Despite the unfavorable dependence onthe rank, it is unlikely that a substantial improvement can be achieved in general [6]. However, de-coupling the dependence on the ambient and the intrinsic dimension of the input has an interestingramification; instead of the original input A, our algorithm can be applied on a low-rank surrogateto obtain an approximate solution, alleviating the dependence on r. We discuss this in Section 3,and present the approximation bound that this allows us to obtain.
Let A = UΛU⊤ denote the truncated eigenvalue decomposition of A; Λ is a diagonal r × rwhose ith diagonal entry is equal to the ith largest eigenvalue of A, while the columns of U arethe corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x ∈ R
d,
x⊤Ax =∥∥Λ1/2U⊤x
∥∥22≥⟨Λ1/2U⊤x, c
⟩2, ∀ c ∈ R
r : ‖c‖2 = 1. (3)
In fact, equality in (3) can always be achieved for c colinear to Λ1/2Ux ∈ Rr and in turn
x⊤Ax = maxc∈Sr−1
2
⟨x, UΛ1/2c
⟩2,
where Sr−12 denotes the ℓ2-unit sphere in r dimensions. More generally, for any X ∈ R
d×k,
Tr
(X⊤AX
)=
k∑
j=1
Xj⊤AXj = maxC:Cj∈Sr−1
2 ∀j
k∑
j=1
⟨Xj , UΛ1/2Cj
⟩2. (4)
Under the variational characterization of the trace objective in (4), the sparse PCA problem (2)can be re-written as a joint maximization over the variables X and C as follows:
maxX∈Xk
Tr(X⊤AX
)= max
X∈Xk
maxC:Cj∈Sr−1
2 ∀j
k∑
j=1
⟨Xj , UΛ1/2Cj
⟩2. (5)
The alternative formulation of the sparse PCA problem in (5) takes a step towards decoupling thedependence of the optimization on the ambient and intrinsic dimensions d and r, respectively. Themotivation behind the introduction of the auxiliary variable C will become clear in the sequel.
For a given C, the value of X ∈ Xk that maximizes the objective in (5) for that C is
X , argmaxX∈Xk
k∑
j=1
⟨Xj,Wj
⟩2, (6)
3
where W,UΛ1/2C is a real d× k matrix. The constrained, non-convex maximization (6) plays acentral role in our developments. We will later describe a combinatorial O(d · (s · k)2) procedure toefficiently compute X, reducing the maximization to an instance of the bipartite maximum weightmatching problem. For now, however, let us assume that such a procedure exists.
Let X⋆, C⋆ be the pair that attains the maximum in (5); in other words, X⋆ is the desiredsolution to the sparse PCA problem. If the optimal auxiliary variable C⋆ was known, then wewould be able to recover X⋆ by solving the maximization (6) for C = C⋆. Of course, C⋆ is notknown, and it is not possible to exhaustively consider all possible values in the domain of C.Instead, we examine only a finite number of possible values of C over a fine discretization of itsdomain. In particular, let Nǫ/2(S
r−12 ) denote a finite ǫ/2-net of the r-dimensional ℓ2-unit sphere; for
any point in Sr−12 , the net contains a point within an ǫ/2 radius from the former. There are several
ways to construct such a net [28]. Further, let [Nǫ/2(Sr−12 )]⊗k ⊂ R
d×k denote the kth Cartesianpower of the aforementioned ǫ/2-net. By construction, this collection of points contains a matrix Cthat is column-wise close to C⋆. In turn, it can be shown using the properties of the net, that thecandidate solution X ∈ Xk obtained through (6) at that point C will be approximately as good asthe optimal X⋆ in terms of the quadratic objective in (2).
Algorithm 1 Sparse PCA (Multiple disjoint components)
input : PSD d× d rank-r matrix A, ǫ ∈ (0, 1), k ∈ Z+.output : X ∈ Xk Theorem 11: C ← 2: [U,Λ]← EIG(A)3: for each C ∈ [Nǫ/2(S
r−12 )]⊗k do
4: W← UΛ1/2C W ∈ Rd×k
5: X← argmaxX∈Xk
∑kj=1
⟨Xj ,Wj
⟩2 Alg. 26: C ← C ∪
X
7: end for8: X← argmax
X∈C Tr(X⊤AX
)
All above observations yield aprocedure for approximately solvingthe sparse PCA problem (2). Thesteps are outlined in Algorithm 1.Given the desired number of com-ponents k and an accuracy parame-ter ǫ ∈ (0, 1), the algorithm gener-ates a net [Nǫ/2(S
r−12 )]⊗k and iterates
over its points. At each point C, itcomputes a feasible solution for thesparse PCA problem – a set of k s-sparse components – by solving themaximization in (6) via a procedure(Alg. 2) that will be described in thesequel. The algorithm collects the candidate solutions identified at the points of the net. The bestamong them achieves an objective in (2) that provably lies close to optimal. More formally,
Theorem 1. For any real d× d rank-r PSD matrix A, desired number of components k, number sof nonzero entries per component, and accuracy parameter ǫ ∈ (0, 1), Algorithm 1 outputs X ∈ Xk
such that
Tr(X
⊤AX
)≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
),
where X⋆, argmaxX∈XkTr(X⊤AX
), in time TSVD(r) +O
((4ǫ
)r·k · d · (s · k)2).
Algorithm 1 is the first nontrivial algorithm that provably approximates the solution of thesparse PCA problem (2). According to Theorem 1, it achieves an objective value that lies withina multiplicative factor from the optimal, arbitrarily close to 1. Its complexity grows as a low-orderpolynomial in the dimension d of the input, but exponentially in the intrinsic dimension r. Note,however, that it can be exponentially faster compared to the O(ds·k) brute force approach thatexhaustively considers all candidate supports for the k sparse components. The complexity of ouralgorithm follows from the cardinality of the net and the complexity of Algorithm 2, the subroutinethat solves the constrained maximization (6). The latter is a key ingredient of our algorithm, and isdiscussed in detail in the next subsection. A formal proof of Theorem 1 is provided in Section 9.2.
4
2.1 Sparse Components via Bipartite Matchings
In the core of Algorithm 1 lies Algorithm 2, a procedure that solves the constrained maximizationin (6). The algorithm breaks down the maximization into two stages. First, it identifies the supportof the optimal solution X. Determining the support reduces to an instance of the maximummatching problem on a weighted bipartite graph G. Then, it recovers the exact values of thenonzero entries in X based on the Cauchy-Schwarz inequality. In the sequel, we provide a briefdescription of Algorithm 2, leading up to its guarantees in Lemma 2.1.
Let Ij,supp(Xj) be the support of the jth column of X, j = 1, . . . , k. The objective in (6)becomes
k∑
j=1
⟨Xj ,Wj
⟩2=
k∑
j=1
(∑
i∈Ij
Xij ·Wij
)2≤
k∑
j=1
∑
i∈Ij
W 2ij . (7)
The last inequality is an application of the Cauchy-Schwarz Inequality and the constraint ‖Xj‖2 = 1∀ j ∈ 1, . . . , k. In fact, if an oracle reveals the supports Ij, j = 1, . . . , k, the upper bound in (7)can always be achieved by setting the nonzero entries of X as in Algorithm 2 (Line 6). Therefore,the key in solving (6) is determining the collection of supports to maximize the right-hand sideof (7).
u(1)
1
u(1)s
...
u(k)
1
u(k)s
...
v1
vd
vi
...
...
...
W2
i1
W2
i1
W2
ik
W2
ik
U1
Uk
V
Figure 1: The graph G generated byAlg. 2. It is used to determine the supportof the solution X in (6).
By constraint, the sets Ij must be pairwise dis-joint, each with cardinality s. Consider a weightedbipartite graph G =
(U = U1, . . . , Uk, V,E
)con-
structed as follows2 (Fig. 1):
• V is a set of d vertices v1, . . . , vd, corresponding tothe d variables, i.e., the d rows of X.
• U is a set of k · s vertices, conceptually partitionedinto k disjoint subsets U1, . . . , Uk, each of cardinal-ity s. The jth subset, Uj , is associated with thesupport Ij; the s vertices u(j)
α , α = 1, . . . , s in Uj
serve as placeholders for the variables/indices in Ij.
• Finally, the edge set is E = U × V . The edgeweights are determined by the d×k matrixW in (6).In particular, the weight of edge (u(j)
α , vi) is equalto W 2
ij . Note that all vertices in Uj are effectivelyidentical; they all share a common neighborhoodand edge weights.
Any feasible support Ijkj=1 corresponds to a perfect matching in G and vice-versa. Recall thata matching is a subset of the edges containing no two edges incident to the same vertex, while aperfect matching, in the case of an unbalanced bipartite graph G = (U, V,E) with |U | ≤ |V |, is amatching that contains at least one incident edge for each vertex in U . Given a perfect matchingM ⊆ E, the disjoint neighborhoods of Ujs under M yield a support Ijkj=1. Conversely, anyvalid support yields a unique perfect matching in G (taking into account that all vertices in Uj
are isomorphic). Moreover, due to the choice of weights in G, the right-hand side of (7) for agiven support Ijkj=1 is equal to the weight of the matching M in G induced by the former, i.e.,
2The construction is formally outlined in Algorithm 4 in Section 8.
5
Algorithm 2 Compute Candidate Solution
input Real d× k matrix Woutput X = argmaxX∈Xk
∑kj=1
⟨Xj ,Wj
⟩2
1: G(Ujkj=1, V,E
)← GenBiGraph(W) Alg. 4
2: M←MaxWeightMatch(G) ⊂ E3: X← 0d×k
4: for j = 1, . . . , k do5: Ij ← i ∈ 1, . . . , d : (u, vi) ∈ M, u ∈ Uj6: [Xj ]Ij ← [Wj ]Ij/‖[Wj ]Ij‖27: end for
∑kj=1
∑i∈Ij
W 2ij=∑
(u,v)∈M w(u, v). It follows that determining the support of the solution in (6),reduces to solving the maximum weight matching problem on the bipartite graph G.
Algorithm 2 readily follows. Given W ∈ Rd×k, the algorithm generates a weighted bipartite
graph G as described, and computes its maximum weight matching. Based on the latter, it firstrecovers the desired support of X (Line 5), and subsequently the exact values of its nonzero entries(Line 6). The running time is dominated by the computation of the matching, which can be donein O
(|E||U |+ |U |2 log |U |
)using a variant of the Hungarian algorithm [29]. Hence,
Lemma 2.1. For any W ∈ Rd×k, Algorithm 2 computes the solution to (6), in time O
(d · (s · k)2
).
A more formal analysis and proof of Lemma 2.1 is available in Section 9.1. With Algorithm 2and Lemma 2.1 in place, we complete the description of our sparse PCA algorithm (Algorithm 1)and the proof sketch of Theorem 1.
3 Sparse PCA on Low-Dimensional Sketches
Algorithm 3 Sparse PCA on Low Dim. Sketch
input : Real n× d S, r ∈ Z+, ǫ ∈ (0, 1), k ∈ Z+.output X(r) ∈ Xk. Thm. 21: S← Sketch(S, r)
2: A← S⊤S
3: X(r) ← Algorithm 1 (A, ǫ, k).
Algorithm 1 approximately solves thesparse PCA problem (2) on a d × d rank-rPSD matrix A, in time that grows as alow-order polynomial in the ambient dimen-sion d, but depends exponentially on r. Thisdependence can be prohibitive in practice.To mitigate its effect, instead of the origi-nal input, we can apply our sparse PCA al-gorithm on a low-rank approximation of A.Intuitively, the quality of the extracted components should depend on how well that low-ranksurrogate approximates the original input.
More formally, let S be the real n × d data matrix representing n (potentially centered) dat-apoints in d variables, and A the corresponding d × d covariance matrix. Further, let S be alow-dimensional sketch of the original data; an n × d matrix whose rows lie in an r-dimensionalsubspace, with r being an accuracy parameter. Such a sketch can be obtained in several ways,including for example exact or approximate SVD, or online sketching methods [30]. Finally, letA = 1/n · S⊤
S be the covariance matrix of the sketched data. Then, instead of A, we can approxi-mately solve the sparse PCA problem by applying Algorithm 1 on the low-rank surrogate A. Theabove are formally outlined in Algorithm 3. We note that the covariance matrix A does not needto be explicitly computed; Algorithm 1 can operate directly on the (sketched) input data matrix.
6
Theorem 2. For any n × d input data matrix S, with corresponding empirical covariance matrixA = 1/n · S⊤S, any desired number of components k, and accuracy parameters ǫ ∈ (0, 1) and r,Algorithm 3 outputs X(r) ∈ Xk such that
Tr(X⊤
(r)AX(r)
)≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
)− 2 · k · λ1,s(A−A),
in time TSKETCH(r) + TSVD(r) + O((
4ǫ
)r·k · d · (s · k)2). Here, X⋆, argmaxX∈Xk
Tr(X⊤AX
), and
λ1,s(A) denotes the sparse eigenvalue, i.e., the eigenvalue that corresponds to the principal s-sparseeigenvector of A.
The error λ1,s(A−A) and in turn the tightness of the approximation guarantees hinges on thequality of the sketch A. Higher values of the parameter r (the rank of the sketch) can allow fora more accurate solution and tighter guarantees. That is the case, for example, when the sketchis obtained through exact SVD. In that sense, Theorem 2 establishes a natural trade-off betweenthe running time of Algorithm 3 and the quality of the approximation guarantees. A formal proofof Theorem 2 is provided in Section 9.3. Observe that the error term itself is a sparse eigenvaluethat is hard to approximate, however even loose bounds provide tight conditional approximationresults, as we see next.
Using the main matrix approximation result of [31], the next theorem establishes that Algo-rithm 3 can be turned into an additive PTAS.
Theorem 3. Let A be a d × d positive semidefinite matrix with entries in [−1, 1], V be a d × dmatrix such that A = VV⊤. Further, let R be a random d × r matrix with entries drawn i.i.d.according to N (0, 1/r), and define
A,VRR⊤V⊤.
For any constant ǫ ∈ (0, 1], let r = O(ǫ−2 log d). Then, for any desired sparsity s, and numberof components k = O(1), Algorithm 1 with input argument A and accuracy parameter ǫ, outputsX(r) ∈ Xk such that
Tr(X⊤
(r)AX(r)
)≥ Tr
(X⊤
⋆ AX⋆
)− ǫ · s
with probability at least 1− 1/poly(d), in time nO(log(1/ǫ)/ǫ2)).
Remark 3.1. Note that λ1(A−A) serves as another elementary upper bound on λ1,s(A−A). IfA is a the rank-d SVD approximation of A, then—similar to [32]—we can obtain a multiplicativePTAS for sparse PCA, under the assumption of a decaying spectrum (e.g., under a power-lawdecay), and for s = Ω(n).
4 Related Work
We are not aware of any algorithm with provable guarantees for sparse PCA with disjoint supports.Multiple components can be extracted by repeatedly solving (1) using one of the aforementionedmethods. To ensure disjoint supports, variables “selected” by a component are removed fromthe dataset. This greedy approach, however, can result in highly suboptimal objective value (Seeexample in Sec. 7).
A significant volume of work has focused on the single-component sparse PCA problem (1);we scratch the surface and refer the reader to citations therein. Representative examples rangefrom early heuristics in [2], to the LASSO based techniques in [3], the elastic net ℓ1-regression in[4], ℓ1 and ℓ0 regularized optimization methods such as GPower in [7], a greedy branch-and-bound
7
technique in [8], or semidefinite programming approaches [9, 10, 11]. The authors of [13] presentan approach that uses ideas from an expectation-maximization (EM) formulation of the problem.More recently, [12] presents a simple and very efficient truncated version of the power iteration(TPower). Finally, [15] introduces an exact solver for the low-rank case of the problem; this solverwas then used on low-rank sketches in the work of [14] (SpanSPCA), that provides conditionalapproximation guarantees under spectral assumptions on the input data. Several ideas in this workare inspired by the aforementioned low-rank solvers. In our experiments, we compare against EM,TPower, and SpanSPCA, which all are experimentally achieving state-of-the-art performance.
Parallel to the algorithmic and optimization perspective, there is large line of statistical analysisfor sparse PCA that focuses on guarantees pertaining to planted models and the recovery of a “true”sparse component [16, 17, 18, 19, 20, 21, 22, 23, 24].
There has been some work on the explicit estimation of principal subspaces or multiple compo-nents under sparsity constraints. Non-deflation-based algorithms include extensions of the diagonalthresholding algorithm [33] and iterative thresholding approaches [17], while [34] and [35] proposemethods that rely on the “row sparsity for subspaces” assumption of [26]. These methods yieldcomponents supported on a common set of variables, and hence solve a problem different from (2).Magdon-Ismail and Boutsidis [27] discuss the multiple component Sparse PCA problem, proposean alternative objective function and for that problem obtain interesting theoretical guarantees.Finally, [36] develops a framework for sparse matrix factorizaiton problems, based on a novel atomicnorm. That framework captures sparse PCA – although not explicitly the constraint of disjointsupports – but the resulting optimization problem, albeit convex, is NP-hard.
5 Experiments
We evaluate our algorithm on a series of real datasets, and compare it to deflation-based approachesfor sparse PCA using TPower [12], EM [13], and SpanSPCA [14]. The latter are representativeof the state of the art for the single-component sparse PCA problem (1). Multiple componentsare computed one by one. To ensure disjoint supports, the deflation step effectively amounts toremoving from the dataset all variables used by previously extracted components. For algorithmsthat are randomly initialized, we depict best results over multiple random restarts. Additionalexperimental results are listed in Section 11 of the appendix.
Our experiments are conducted in a Matlab environment. Due to its nature, our algorithmis easily parallelizable; its prototypical implementation utilizes the Parallel Pool Matlab featureto exploit multicore (or distributed cluster) capabilities. Recall that our algorithm operates on alow-rank approximation of the input data. Unless otherwise specified, it is configured for a rank-4approximation obtained via truncated SVD. Finally, we put a time barrier in the execution ofour algorithm, at the cost of the theoretical approximation guarantees; the algorithm returns bestresults at the time of termination. This “early termination” can only hurt the performance of ouralgorithm.
Leukemia Dataset. We evaluate our algorithm on the Leukemia dataset [37]. The datasetcomprises 72 samples, each consisting of expression values for 12582 probe sets. We extract k = 5sparse components, each active on s = 50 features. In Fig. 2(a), we plot the cumulative explainedvariance versus the number of components. Deflation-based approaches are greedy: the leadingcomponents capture high values of variance, but subsequent ones contribute less. On the contrary,our algorithm jointly optimizes the k = 5 components and achieves higher total cumulative variance;one cannot identify a top component. We repeat the experiment for multiple values of k. Fig. 2(b)depicts the total cumulative variance capture by each method, for each value of k.
8
(a) (b)
Figure 2: Cumulative variance captured by the k s-sparse extracted components – Leukemiadataset [37]. Sparsity is arbitrarily set to s = 50 nonzero entries per component. Fig. 2(a) de-picts the cum. variance versus the number of components, for k = 5. Deflation-based approachesare greedy; first components capture high variance, but subsequent ones contribute less. Our algo-rithm jointly optimizes the k = 5 components and achieves higher total cum. variance. Fig. 2(b)depicts the total cum. variance achieved for various values of k.
Additional Datasets. We repeat the experiment on multiple datasets, arbitrarily selectedfrom [37]. Table 1 lists the total cumulative variance captured by k = 5 components, each withs = 40 nonzero entries, extracted using the four methods. Our algorithm achieves the highest valuesin most cases.
Bag of Words (BoW) Dataset. [37] This is a collection of text corpora stored under the“bag-of-words” model. For each text corpus, a vocabulary of d words is extracted upon tokenization,and the removal of stopwords and words appearing fewer than ten times in total. Each documentis then represented as a vector in that d-dimensional space, with the ith entry corresponding to thenumber of appearances of the ith vocabulary entry in the document.
We solve the sparse PCA problem (2) on the word-by-word cooccurrence matrix, and extractk = 8 sparse components, each with cardinality s = 10. We note that the latter is not explicitlyconstructed; our algorithm can operate directly on the input word-by-document matrix. Table 2lists the variance captured by each method; our algorithm consistently outperforms the other
TPower EM sPCA SpanSPCA SPCABiPart
Amzn Com Rev (1500×10000) 7.31e+ 03 7.32e + 03 7.31e+ 03 7.79e+ 03
Arcence Train (100×10000) 1.08e+ 07 1.02e + 07 1.08e+ 07 1.10e+ 07
CBCL Face Train (2429×361) 5.06e+ 00 5.18e + 00 5.23e+ 00 5.29e+ 00
Isolet-5 (1559×617) 3.31e+ 01 3.43e + 01 3.34e+ 01 3.51e+ 01
Leukemia (72×12582) 5.00e+ 09 5.03e + 09 4.84e+ 09 5.37e+ 09
Pems Train (267×138672) 3.94e+ 00 3.58e + 00 3.89e+ 00 3.75e+ 00
Mfeat Pix (2000×240) 5.00e+ 02 5.27e + 02 5.08e+ 02 5.47e+ 02
Table 1: Total cumulative variance captured by k = 5 40-sparse extracted components on variousdatasets [37]. For each dataset, we list the size (#samples×#variables) and the value of variancecaptured by each method. Our algorithm operates on a rank-4 sketch in all cases.
9
TPower EM sPCA SpanSPCA SPCABiPart
BoW:NIPS (1500×12419) 2.51e+ 03 2.57e+ 03 2.53e+ 03 3.34e+ 03 (+29.98%)
BoW:KOS (3430×6906) 4.14e+ 01 4.24e+ 01 4.21e+ 01 6.14e+ 01 (+44.57%)
BoW:Enron (39861×28102) 2.11e+ 02 2.00e+ 02 2.09e+ 02 2.38e+ 02 (+12.90%)
BoW:NyTimes (300000×102660) 4.81e+ 01 − 4.81e+ 01 5.31e+ 01 (+10.38%)
Table 2: Total variance captured by k = 8 extracted components, each with s = 15 nonzero entries– Bag of Words dataset [37]. For each corpus, we list the size (#documents×#vocabulary-size)and the explained variance. Our algorithm operates on a rank-5 sketch in all cases.
approaches.Finally, note that here each sparse component effectively selects a small set of words. In turn,
the k extracted components can be interpreted as a set of well-separated topics. In Table 3, we listthe topics extracted from the NY Times corpus (part of the Bag of Words dataset). The corpusconsists of 3 · 105 news articles and a vocabulary of d = 102660 words.
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
1: percent zzz united states zzz bush company team cup school zzz al gore
2: million zzz u s official companies game minutes student zzz george bush
3: money zzz american government market season add children campaign
4: high attack president stock player tablespoon women election
5: program military group business play oil show plan
6: number palestinian leader billion point teaspoon book tax
7: need war country analyst run water family public
8: part administration political firm right pepper look zzz washington
9: problem zzz white house american sales home large hour member
10: com games law cost won food small nation
Table 3: BoW:NyTimes dataset [37]. The table lists the words corresponding to the s = 10nonzero entries of each of the k = 8 extracted components (topics). Words corresponding to highermagnitude entries appear higher in the topic.
6 Conclusions
We considered the sparse PCA problem for multiple components with disjoint supports. Existingmethods for the single component problem can be used along with an appropriate deflation step tocompute multiple components one by one, leading to potentially suboptimal results. We presenteda novel algorithm for jointly optimizing multiple sparse and disjoint components with provableapproximation guarantees. Our algorithm is combinatorial and exploits interesting connectionsbetween the sparse PCA and the bipartite maximum weight matching problems. It runs in timethat grows as a low-order polynomial in the ambient dimension of the input data, but dependsexponentially on its rank. To alleviate this dependency, we can apply the algorithm on a low-dimensional sketch of the input, at the cost of an additional error in our theoretical approximationguarantees. Empirical evaluation of our algorithm demonstrated that in many cases it outperformsdeflation-based approaches.
10
Acknowledgments
DP is generously supported by NSF awards CCF-1217058 and CCF-1116404 and MURI AFOSRgrant 556016. This research has been supported by NSF Grants CCF 1344179, 1344364, 1407278,1422549 and ARO YIP W911NF-14-1-0258.
References
[1] H.F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika,23(3):187–200, 1958.
[2] I.T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journalof Applied Statistics, 22(1):29–35, 1995.
[3] I.T. Jolliffe, N.T. Trendafilov, and M. Uddin. A modified principal component technique basedon the lasso. Journal of Computational and Graphical Statistics, 12(3):531–547, 2003.
[4] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journalof computational and graphical statistics, 15(2):265–286, 2006.
[5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Sparse features for pca-likelinear regression. In Advances in Neural Information Processing Systems, pages 2285–2293,2011.
[6] Siu On Chan, Dimitris Papailiopoulos, and Aviad Rubinstein. On the worst-case approxima-bility of sparse PCA. preprint, 2015.
[7] M. Journee, Y. Nesterov, P. Richtarik, and R. Sepulchre. Generalized power method for sparseprincipal component analysis. The Journal of Machine Learning Research, 11:517–553, 2010.
[8] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse pca: Exact and greedyalgorithms. NIPS, 18:915, 2006.
[9] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparseprincipal component analysis. The Journal of Machine Learning Research, 9:1269–1294, 2008.
[10] Y. Zhang, A. d’Aspremont, and L.E. Ghaoui. Sparse pca: Convex relaxations, algorithms andapplications. Handbook on Semidefinite, Conic and Polynomial Optimization, pages 915–940,2012.
[11] A. d’Aspremont, L. El Ghaoui, M.I. Jordan, and G.R.G. Lanckriet. A direct formulation forsparse pca using semidefinite programming. SIAM review, 49(3):434–448, 2007.
[12] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems.The Journal of Machine Learning Research, 14(1):899–925, 2013.
[13] Christian D. Sigg and Joachim M. Buhmann. Expectation-maximization for sparse and non-negative pca. In Proceedings of the 25th International Conference on Machine Learning, ICML’08, pages 960–967, New York, NY, USA, 2008. ACM.
[14] Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse pca throughlow-rank approximations. In Proceedings of The 30th International Conference on MachineLearning, pages 747–755, 2013.
11
[15] Megasthenis Asteris, Dimitris S. Papailiopoulos, and Georgios N. Karystinos. The sparseprincipal component of a constant-rank matrix. Information Theory, IEEE Transactions on,60(4):2281–2290, April 2014.
[16] Arash Amini and Martin Wainwright. High-dimensional analysis of semidefinite relaxationsfor sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE InternationalSymposium on, pages 2454–2458. IEEE, 2008.
[17] Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals ofStatistics, 41(2):772–801, 2013.
[18] A. d’Aspremont, F. Bach, and L.E. Ghaoui. Approximation bounds for sparse principal com-ponent analysis. arXiv preprint arXiv:1205.0121, 2012.
[19] T Tony Cai, Zongming Ma, and Yihong Wu. Sparse pca: Optimal rates and adaptive estima-tion. arXiv preprint arXiv:1211.1309, 2012.
[20] Yash Deshpande and Andrea Montanari. Sparse pca via covariance thresholding. arXivpreprint arXiv:1311.5179, 2013.
[21] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal components inhigh dimension. Ann. Statist., 41(1):1780–1815, 2013.
[22] Q. Berthet and P. Rigollet. Complexity theoretic lower bounds for sparse principal componentdetection. Journal of Machine Learning Research (JMLR), 30:1046–1066 (electronic), 2013.
[23] Tengyao Wang, Quentin Berthet, and Richard J. Samworth. Statistical and computationaltrade-offs in estimation of sparse principal components. arXiv preprint arXiv:1408.5369, 2014.
[24] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations solvesparse PCA up to the information limit? Annals of Probability, 43:1300–1322, 2015.
[25] L. Mackey. Deflation methods for sparse pca. NIPS, 21:1017–1024, 2009.
[26] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. InInternational Conference on Artificial Intelligence and Statistics, pages 1278–1286, 2012.
[27] Malik Magdon-Ismail and Christos Boutsidis. Optimal sparse linear auto-encoders and sparsePCA. CoRR, abs/1502.06626, 2015.
[28] Jirı Matousek. Lectures on discrete geometry, volume 212. Springer New York, 2002.
[29] Lyle Ramshaw and Robert E Tarjan. On minimum-cost assignments in unbalanced bipartitegraphs. HP Labs, Palo Alto, CA, USA, Tech. Rep. HPL-2012-40R1, 2012.
[30] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review,53(2):217–288, 2011.
[31] Noga Alon, Troy Lee, Adi Shraibman, and Santosh Vempala. The approximate rank of amatrix and its algorithmic applications: approximate rank. In Proceedings of the forty-fifthannual ACM symposium on Theory of computing, pages 675–684. ACM, 2013.
12
[32] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparsepca with provable guarantees. In Proceedings of the 31st International Conference on MachineLearning (ICML-14), pages 1728–1736, 2014.
[33] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal componentsanalysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.
[34] Vincent Q Vu, Juhee Cho, Jing Lei, and Karl Rohe. Fantope projection and selection: Anear-optimal convex relaxation of sparse pca. In NIPS, pages 2670–2678, 2013.
[35] Zhaoran Wang, Huanran Lu, and Han Liu. Nonconvex statistical optimization: minimax-optimal sparse pca in polynomial time. arXiv preprint arXiv:1408.5352, 2014.
[36] Emile Richard, Guillaume R Obozinski, and Jean-Philippe Vert. Tight convex relaxationsfor sparse matrix factorization. In Advances in Neural Information Processing Systems, pages3284–3292, 2014.
[37] M. Lichman. UCI machine learning repository, 2013.
[38] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson andlindenstrauss. Random structures and algorithms, 22(1):60–65, 2003.
Supplemental Material
7 On the sub-optimality of deflation – An example
We provide a simple example demonstrating the sub-optimality of deflation based approaches forcomputing multiple sparse components with disjoint supports. Consider the real 4× 4 matrix
A =
1 0 0 ǫ
0 δ 0 0
0 0 δ 0
ǫ 0 0 1
,
with ǫ, δ > 0 such that ǫ+ δ < 1. Note that A is PSD; A = B⊤B for
B =
1 0 0 ǫ
0√δ 0 0
0 0√δ 0
0 0 0√1− ǫ2
.
We seek two 2-sparse components with disjoint supports, i.e., the solution to
maxX∈X
2∑
j=1
x⊤j Axj, (8)
where
X,X ∈ R
4×2 : ‖xi‖2 ≤ 1, ‖xi‖0 ≤ 2 ∀ i ∈ 1, 2, supp(x1) ∩ supp(x2) = ∅.
13
Iterative computation with deflation. Following an iterative, greedy procedure with adeflation step, we compute one component at the time. The first component is
x1 = argmax‖x‖0=2,‖x‖2=1
x⊤Ax. (9)
Recall that for any unit norm vector x with support I = supp(x),
x⊤Ax ≤ λmax (AI,I) , (10)
where AI,I denotes the principal submatrix of A formed by the rows and columns indexed by I.Equality can be achieved in (10) for x equal to the leading eigenvector of AI,I . Hence, it suffices todetermine the optimal support for x1. Due to the small size of the example, it is easy to determinethat the set I1 = 1, 4 maximizes the objective in (10) over all sets of two indices, achieving value
x⊤1 Ax1 = λmax
([1 ǫ
ǫ 1
])= 1 + ǫ. (11)
Since subsequent components must have disjoint supports, it follows that the support of the second2-sparse component x2 is I2 = 2, 3, and x2 achieves value
x⊤2 Ax2 = λmax
([δ 0
0 δ
])= δ. (12)
In total, the objective value in (8) achieved by the greedy computation with a deflation step is
2∑
j=1
x⊤j Axj = 1 + ǫ+ δ. (13)
The sub-optimality of deflation. Consider an alternative pair of 2-sparse components x′1 and x′
2
with support sets I ′1 = 1, 2 and I ′2 = 3, 4, respectively. Based on the above, such a pair achievesobjective value in (8) equal to
λmax
([1 0
0 δ
])+ λmax
([δ 0
0 1
])= 1 + 1 = 2,
which clearly outperforms the objective value in (13) (under the assumption ǫ + δ < 1), demon-strating the sub-optimality of the x1, x2 pair computed by the deflation-based approach. In fact,for small ǫ, δ the objective value in the second case is larger than the former by almost a factor oftwo.
8 Construction of Bipartite Graph
The following algorithm formally outlines the steps for generating the bipartite graphG =(Ujkj=1, V,E
)
given a weight d× k matrix W.
14
Algorithm 4 Generate Bipartite Graph
input Real d× k matrix Woutput Bipartite G =
(Ujkj=1, V,E
)Fig. 1
1: for j = 1, . . . , k do
2: Uj ←u(j)1 , . . . , u
(j)s
3: end for4: U ← ∪kj=1Uj |U | = k · s5: V ←
1, . . . , d
6: E ← U × V7: for i = 1, . . . , d do8: for j = 1, . . . , k do9: for each u ∈ Uj do
10: w(u, vi
)←W 2
ij
11: end for12: end for13: end for
9 Proofs
9.1 Guarantees of Algorithm 2
Lemma 2.1. For any real d× k matrix W, and Algorithm 2 outputs
X = argmaxX∈Xk
k∑
j=1
⟨Xj,Wj
⟩2(14)
in time O(d · (s · k)2
).
Proof. Consider a matrix X ∈ Xk and let Ij , j = 1 . . . , k denote the support sets of its columns.By the constraints in Xk, those sets are disjoint, i.e., Ij1 ∩ Ij2 = ∅ ∀j1, j2 ∈ 1, . . . , k, j1 6= j2, and
k∑
j=1
⟨Xj , Wj
⟩2=
k∑
j=1
(∑
i∈Ij
Xij ·Wij
)2≤
k∑
j=1
(∑
i∈Ij
W 2ij
). (15)
The last inequality is due to Cauchy-Schwarz and the fact that ‖Xj‖2 ≤ 1, ∀ j ∈ 1, . . . , k. Infact, if the supports sets Ij, j = 1, . . . , k were known, the upper bound in (15) would be achieved
by setting XjIj
= WjIj/‖Wj
Ij‖2, i.e., setting the nonzero subvector of the jth column of X colinear
to the corresponding subvector of the jth column of W. Hence, the key step towards computingthe optimal solution X is to determine the support sets Ij , j = 1, . . . , k of its columns.
Consider the set of binary matrices
Z,Z ∈ 0, 1d×k : ‖Zj‖0 ≤ s ∀ j ∈ [k], supp(Zi) ∩ supp(Zj) = ∅ ∀ i, j ∈ [k], i 6= j
.
The set represents all possible supports for the members of Xk. Taking into account the previousdiscussion, the maximization in (14) can be written with respect to Z ∈ Z:
maxX∈Xk
k∑
j=1
⟨Xj, Wj
⟩2= max
Z∈Z
k∑
j=1
d∑
i=1
ZijW2ij . (16)
15
Let Z ∈ Z denote the optimal solution, which corresponds to the (support) indicator of X. Next,we show that computing Z boils down to solving a maximum weight matching problem on thebipartite graph generated by Algorithm 4. Recall that given W ∈ R
d×k, Algorithm 4 generates acomplete weighted bipartite graph G = (U, V,E) where
• V is a set of d vertices v1, . . . , vd, corresponding to the d variables, i.e., the d rows of X.• U is a set of k · s vertices, conceptually partitioned into k disjoint subsets U1, . . . , Uk, each of
cardinality s. The jth subset, Uj, is associated with the support Ij; the s vertices u(j)α , α = 1, . . . , s
in Uj serve as placeholders for the variables/indices in Ij.• Finally, the edge set is E = U × V . The edge weights are determined by the d × k matrix W
in (6). In particular, the weight of edge (u(j)α , vi) is equal to W 2
ij . Note that all vertices in Uj areeffectively identical; they all share a common neighborhood and edge weights.
It is straightforward to verify that any Z ∈ Z corresponds to a perfect matching in G and viceversa; Zij = 1 if and only if vertex vi ∈ V is matched with a vertex in Uj (all vertices in Uj areequivalent with respect to their neighborhood). Further, the objective value in (16) for a givenZ ∈ Z is equal to the weight of the corresponding matching in G. More formally,
• Given a perfect matching M, the support Ij of the jth column of Z is determined by theneighborhood of Uj in the matching:
Ij ←i ∈ [d] : (u, vi) ∈ M, u ∈ Uj
, j = 1, . . . , k. (17)
Note that the sets Ij , j = 1, . . . , k are indeed disjoint, and each has cardinality equal to s. Theweight of the matchingM is
∑
(u,v)∈M
w(u, v) =
k∑
j=1
∑
(u,vi)∈M:u∈Uj
w(u, vi) =
k∑
j=1
∑
i∈Ij
W 2ij =
k∑
j=1
d∑
i=1
Zij ·W 2ij, (18)
which is equal to the objective function in (16).• Conversely, given an indicator matrix Z ∈ Z, let Ij,supp(Zj), and let Ij(α) denote the αth
element in the set, α = 1, . . . , s (with an arbitrary ordering). Then,
M =(u(j)
α , vIj(α)), α = 1, . . . , s, j = 1, . . . , k⊂ E
is a perfect matching in G. The objective value achieved by Z is equal to the weight ofM:
k∑
j=1
d∑
i=1
Zij ·W 2ij =
k∑
j=1
∑
i∈Ij
W 2ij =
k∑
j=1
s∑
α=1
W 2Ij(α),j
=∑
(u,v)∈M
w(u, v). (19)
It follows from (18) and (19) that to determine Z, it suffices to compute a maximum weight perfectmatching in G. The desired support is then obtained as described in (17) (lines 4-7 of Algorithm 2).This complete the proof of correctness of Algorithm 2 which proceeds in the steps described aboveto determine the support of X.
The weighted bipartite graph G is generated in O(d ·(s ·k)). The running time of Algorithm 2 isdominated by computing the maximum weight matching of G. For the case of unbalanced bipartitegraph with |U | = s · k < d = |V | the Hungarian algorithm can be modified [29] to computethe maximum weight bipartite matching in time O
(|E||U |+ |U |2 log |U |
)= O
(d · (s · k)2
). This
completes the proof.
16
9.2 Guarantees of Algorithm 1 – Proof of Theorem 1
We first prove a more general version of Theorem 1 for arbitrary constraint sets. Combining thatwith the guarantees of Algorithm 2, we prove the Theorem 1.
Lemma 9.2. For any real d × d rank-r PSD matrix A and arbitrary set X ⊂ Rd×k, let X⋆,
argmaxX∈X Tr(X⊤AX
). Assuming that there exists an operator PX : Rd×k → X such that PX (W),
argmaxX∈X
⟨xj , wj
⟩2, then Algorithm 1 outputs X ∈ X such that
Tr(X
⊤AX
)≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
),
in time TSVD(r)+O((
4ǫ
)r·k ·(TX +kd
)), where TX is the time required to compute PX (·) and TSVD(r)
the time required to compute the truncated SVD of A.
Proof. Let A = UΛU⊤denote the truncated eigenvalue decomposition of A; Λ is a diagonal r× r
whose ith diagonal entry Λii is equal to the ith largest eigenvalue of A, while the columns of Ucontain the corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x ∈ R
d,
x⊤Ax =∥∥Λ1/2
U⊤x∥∥22≥⟨Λ
1/2U
⊤x, c
⟩2, ∀ c ∈ R
r : ‖c‖2 = 1. (20)
In fact, equality in (20) is achieved for c colinear to Λ1/2
Ux, and hence,
x⊤Ax = maxc∈Sr−1
2
⟨Λ
1/2U
⊤x, c
⟩2. (21)
In turn,
Tr
(X⊤AX
)=
k∑
j=1
Xj⊤AXj = maxC:Cj∈Sr−1
2 ∀j
k∑
j=1
⟨Λ
1/2U
⊤Xj , Cj
⟩2. (22)
Recall that X⋆ is the optimal solution of the trace maximization on A, i.e.,
X⋆, argmaxX∈X
Tr
(X⊤AX
).
Let C⋆ be the maximizing value of C in (22) for X = X⋆, i.e., C⋆ is an r×k matrix with unit-normcolumns such that for all j ∈ 1, . . . , k,
Xj⋆⊤AXj
⋆ =⟨Λ
1/2U
⊤Xj
⋆, Cj⋆
⟩2. (23)
Algorithm 1 iterates over the points (r × k matrices) C in N⊗kǫ/2
(Sr−12
), the kth cartesian power of
a finite ǫ/2-net of the r-dimensional l2-unit sphere. At each such point C, it computes a candidate
X = argmaxX∈X
k∑
j=1
⟨Xj,UΛ1/2Cj
⟩2
via Algorithm 2 (See Lemma 9.1 for the guarantees of Algorithm 2). By construction, the setN⊗k
ǫ/2
(Sr−12
)contains a C♯ such that
‖C♯ −C⋆‖∞,2 = maxj∈1,...,k
‖Cj♯ −Cj
⋆‖2 ≤ ǫ/2. (24)
17
Based on the above, for all j ∈ 1, . . . , k,(Xj
⋆⊤AXj
⋆
)1/2=∣∣⟨Λ1/2
U⊤Xj
⋆, Cj⋆
⟩∣∣
=∣∣⟨Λ1/2
U⊤Xj
⋆, Cj♯
⟩+⟨Λ
1/2U
⊤Xj
⋆,(Cj
⋆ −Cj♯
)⟩∣∣
≤∣∣⟨Λ1/2
U⊤Xj
⋆, Cj♯
⟩∣∣+∣∣⟨Λ1/2
U⊤Xj
⋆,(Cj
⋆ −Cj♯
)⟩∣∣
≤∣∣⟨Λ1/2
U⊤Xj
⋆, Cj♯
⟩∣∣+∥∥Λ1/2
U⊤Xj
⋆
∥∥ ·∥∥Cj
⋆ −Cj♯
∥∥
≤∣∣⟨Λ1/2
U⊤Xj
⋆, Cj♯
⟩∣∣+ (ǫ/2) ·(Xj
⋆⊤AXj
⋆
)1/2. (25)
The first step follows by the definition of C⋆, the second by the linearity of the inner product,the third by the triangle inequality, the fourth by Cauchy-Schwarz inequality and the last by (24).Rearranging the terms in (25),
∣∣⟨Λ1/2U
⊤Xj
⋆, Cj♯
⟩∣∣ ≥(1− ǫ
2
)·(Xj
⋆⊤AXj
⋆
)1/2 ≥ 0,
and in turn,
⟨Λ
1/2U
⊤Xj
⋆, Cj♯
⟩2 ≥(1− ǫ
2
)2 ·Xj⋆⊤AXj
⋆ ≥ (1− ǫ) ·Xj⋆⊤AXj
⋆ (26)
Summing the terms in (26) over all j ∈ 1, . . . , k,k∑
j=1
⟨Λ
1/2U
⊤Xj
⋆, Cj♯
⟩2 ≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
). (27)
Let X♯ ∈ X be the candidate solution produced by the algorithm at C♯, i.e.,
X♯, argmaxX∈X
k∑
j=1
⟨xj, UΛ
1/2Cj
♯
⟩2. (28)
Then,
Tr
(X⊤
♯ AX♯
)(α)= max
C:Cj∈Sr−12 ∀j
k∑
j=1
⟨Λ
1/2U
⊤Xj
♯ , Cj⟩2
(β)
≥k∑
j=1
⟨Λ
1/2U
⊤Xj
♯ , Cj♯
⟩2
(γ)
≥k∑
j=1
⟨Xj
⋆, UΛ1/2
Cj♯
⟩2
(δ)
≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
), (29)
where (α) follows from the observation in (22), (β) from the sub-optimality of C♯, (γ) by thedefinition of X♯ in (28), while (δ) follows from (27). According to (29), at least one of the candidatesolutions produced by Algorithm 1, namely X♯, achieves an objective value within a multiplicativefactor (1− ǫ) from the optimal, implying the guarantees of the lemma.
Finally, the running time of Algorithm 1 follows immediately from the cost per iteration andthe cardinality of the ǫ/2-net on the unit-sphere. Note that matrix multiplications can exploit thesingular value decomposition which is performed once.
18
Theorem 1. For any real d× d rank-r PSD matrix A, desired number of components k, number sof nonzero entries per component, and accuracy parameter ǫ ∈ (0, 1), Algorithm 1 outputs X ∈ Xk
such that
Tr(X
⊤AX
)≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
),
where X⋆, argmaxX∈XkTr(X⊤AX
), in time TSVD(r) +O
((4ǫ
)r·k · d · (s · k)2). TSVD(r) is the time
required to compute the truncated SVD of A.
Proof. Recall that Xk is the set of d× k matrices X whose columns have unit length and pairwisedisjoint supports. Algorithm 2, given any W ∈ R
d×k, computes X ∈ Xk that optimally solves theconstrained maximization in line 5. (See Lemma 9.1 for the guarantee of Algorithm 2). in timeO(d · (s · k)2
). The desired result then follows by Lemma 9.2 for the constrained set Xk.
9.3 Guarantees of Algorithm 3 – Proof of Theorem 2
We prove Theorem 2 with the approximation guarantees of Algorithm 3.
Lemma 9.3. For any d× d PSD matrices A and A, and any set X ⊆ Rd×k let
X⋆, argmaxX∈X
Tr
(X⊤AX
), and X⋆, argmax
X∈XTr(X⊤AX
).
Then, for any X ∈ X such that Tr(X
⊤AX
)≥ γ ·Tr
(X⊤
⋆ AX⋆
)for some 0 < γ < 1,
Tr(X
⊤AX
)≥ γ ·Tr
(X⊤
⋆ AX⋆
)− 2 · ‖A−A‖2 ·max
X∈X‖X‖2F.
Proof. By the optimality of X⋆ for A,
Tr
(X⊤
⋆ AX⋆
)≥ Tr
(X⊤
⋆ AX⋆
).
In turn, for any X ∈ X such that Tr
(X
⊤AX
)≥ γ ·Tr
(X⊤
⋆ AX⋆
)for some 0 < γ < 1,
Tr
(X
⊤AX
)≥ γ ·Tr
(X⊤
⋆ AX⋆
). (30)
Let E,A−A. By the linearity of the trace,
Tr
(X
⊤AX
)= Tr
(X
⊤AX
)−Tr
(X
⊤EX
)
≤ Tr
(X
⊤AX
)+∣∣Tr
(X
⊤EX
)∣∣. (31)
By Lemma 10.10,
∣∣Tr
(X
⊤EX
)∣∣ ≤ ‖X‖F · ‖X‖F · ‖E‖2 ≤ ‖E‖2 ·maxX∈X
‖X‖2F , R. (32)
Continuing from (31),
Tr
(X
⊤AX
)≤ Tr
(X
⊤AX
)+R. (33)
19
Similarly,
Tr
(X⊤
⋆ AX⋆
)= Tr
(X⊤
⋆ AX⋆
)−Tr
(X⊤
⋆ EX⋆
)
≥ Tr
(X⊤
⋆ AX⋆
)−∣∣Tr
(X⊤
⋆ EX⋆
)∣∣
≥ Tr
(X⊤
⋆ AX⋆
)−R. (34)
Combining the above, we have
Tr
(X
⊤AX
)≥ Tr
(X
⊤AX
)−R
≥ γ ·Tr
(X⊤
⋆ AX⋆
)−R
≥ γ ·(Tr
(X⊤
⋆ AX⋆
)−R
)−R
= γ ·Tr
(X⊤
⋆ AX⋆
)− (1 + γ) ·R
≥ γ ·Tr
(X⊤
⋆ AX⋆
)− 2 ·R,
where the first inequality follows from (33) the second from (30), the third from (34), and the lastfrom the fact that R ≥ 0 and 0 < γ ≤ 1. This concludes the proof.
Remark 9.2. If in Lemma 9.3 the PSD matrices A and A ∈ Rd×d are such that A −A is also
PSD, then the following tighter bound holds:
Tr(X
⊤AX
)≥ γ ·Tr
(X⊤
⋆ AX⋆
)−
k∑
i=1
λi
(A−A
).
Proof. This follows from the fact that if E,A−A is PSD, then
Tr
(X
⊤EX
)=
d∑
j=1
x⊤j Exj ≥ 0,
and the bound in (31) can be improved to
Tr
(X
⊤AX
)= Tr
(X
⊤AX
)−Tr
(X
⊤EX
)≤ Tr
(X
⊤AX
).
Further, by Lemma 10.11, the bound in (32) can be improved to
Tr(X
⊤EX
)≤
k∑
i=1
λi
(E), R.
The rest of the proof follows as is.
Theorem 2. For any n × d input data matrix S, with corresponding empirical covariance matrixA = 1/n · S⊤S, any desired number of components k, and accuracy parameters ǫ ∈ (0, 1) and r,Algorithm 3 outputs X(r) ∈ Xk such that
Tr(X⊤
(r)AX(r)
)≥ (1− ǫ) ·Tr
(X⊤
⋆ AX⋆
)− 2 · k · ‖A−A‖2,
where X⋆, argmaxX∈XkTr(X⊤AX
), in time TSKETCH(r) + TSVD(r) +O
((4ǫ
)r·k · d · (s · k)2).
Proof. The theorem follows from Lemma 9.3 and the approximation guarantees of Algorithm 1.
20
9.4 Proof of Theorem 3
First, we restate and prove the following Lemma by [31].
Lemma 9.4. Let A ∈ Rd×d be an positive semidefinite matrix with entries in [−1, 1], and V ∈ R
d×d
matrix such that A = VV⊤. Consider a random matrix R ∈ Rd×r with entries drawn according to
a Gaussian distribution N(0, 1/r), and define
A = VRR⊤V⊤.
Then, for r = O(ǫ−2 log d),
∣∣[A]i,j − [A]i,j∣∣ ≤ ǫ
for all i, j with probability at least 1− 1/d.
Proof. The proof relies on the Johnson-Lindenstrauss (JL) Lemma [38], according to which for anytwo unit norm vectors x,y ∈ R
d and R generated as described
Pr|x⊤RR⊤y − x⊤y| ≥ ǫ
≤ 2 · e−(ǫ2−ǫ3)·r/4.
Observe that each element of A is in [−1, 1], hence can be rewritten as an inner product of twounit-norm vectors:
[A]i,j = VT:,iV:,j.
Setting r = O(ǫ−2 log d) and using the JL lemma and a union bound over all O(d2) vector pairsV:,i, V:,j we obtain the desired result.
Next, we provide the proof of Theorem 3 for the simple case of k = 1; the proof easily generalizesto the multi-component case k > 1. According to Lemma 9.4, choosing d = O
((δ/6)−2 log n
)=
O(δ−2 log n
)suffices for all entries of A constructed as described in the lemma to satisfiy
∣∣[A]i,j − [A]i,j∣∣ ≤ δ
6
with probability at least 1− 1/d. In turn, for any s-sparse, unit-norm x,
∣∣x⊤Ax− x⊤Ax∣∣ =
∣∣∣∣∣∣
∑
i,j
xixj([A]ij − [A]ij)
∣∣∣∣∣∣≤ δ
6·
∣∣∣∣∣∣
n∑
i=1
|xi|n∑
j=1
|xj|
∣∣∣∣∣∣
≤ δ
6· ‖x‖21 ≤
δ
6·(√
s · ‖x‖2)2
=δ
6· s, (35)
where the second inequality follows from the fact that x is s-sparse and unit norm.We run Algorithm 1 (for k = 1) with input argument the rank-r matrix A, desired sparsity s and
accuracy parameter ǫ = δ/6. Algorithm 1 outputs a s-sparse unit-norm vector x which accordingto Theorem 1 satisfies
(1− δ/6) · xd⊤Axd ≤ x⊤Ax ≤ xd
⊤Axd, (36)
where xd is the true s-sparse principal component of A. This, in turn, implies that x satisfies
∣∣∣x⊤Ax− xd⊤Axd
∣∣∣ ≤ δ
6· xd
⊤Axd ≤δ
6
(1 +
δ
6
)s ≤ δ
3· s, (37)
21
where the second inequality follows from the fact that the entries of A lie in [−1− δ6 , 1 +
δ6 ] and x
is s-sparse and unit-norm.In the following, we bound the difference of the performance of x on the original matrix A from
the optimal value. Let x⋆ denote the s-sparse principal component of A and define
OPT,x⋆TAx⋆.
Then,
|OPT− x⊤Ax| = |x⊤⋆ Ax⋆ − x⊤Ax|
= |x⊤⋆ Ax⋆ − x⊤
d Axd + xd⊤Axd − x⊤Ax|
≤ |x⊤⋆ Ax⋆ − xd
⊤Axd|︸ ︷︷ ︸A
+ |xd⊤Axd − x⊤Ax|︸ ︷︷ ︸
B
. (38)
Utilizing (35) and the triangle inequality, one can verify that
A = |x⋆⊤Ax⋆ − xd
⊤Axd + xd⊤Axd − xd
⊤Axd|≤ |x⋆
⊤Ax⋆ − xd⊤Axd|+ |xd
⊤Axd − xd⊤Axd|
≤ x⋆⊤Ax⋆ − xd
⊤Axd︸ ︷︷ ︸≥0
+δ
6· s
≤ x⋆⊤Ax⋆ − xd
⊤Axd +δ
6· s + xd
⊤Axd − x⋆⊤Ax⋆︸ ︷︷ ︸
≥0
≤ x⋆⊤Ax⋆ − x⋆
⊤Ax⋆ +δ
6· s + xd
⊤Axd − xd⊤Axd
≤ |x⋆⊤Ax⋆ − x⋆
⊤Ax⋆|+δ
6· s + |xd
⊤Axd − xd⊤Axd|
≤ δ
2· s. (39)
Similarly,
B =∣∣∣xd
⊤Axd − x⊤Ax+ x⊤Ax− x⊤Ax∣∣∣
=∣∣∣xd
⊤Axd − x⊤Ax∣∣∣+∣∣∣x⊤Ax− x⊤Ax
∣∣∣
≤∣∣∣xd
⊤Axd − x⊤Ax∣∣∣+ δ
6· s
(α)
≤ 2δ
6· s+ δ
6· s
≤ δ
2· s. (40)
where (α) follows from (37). Continuing from (38), combining (39) and (40) we obtain
|OPT −x⊤Ax∣∣∣ ≤ δ · s,
which is the desired result.
22
10 Auxiliary Technical Lemmata
Lemma 10.5. For any real d× n matrix M, and any r, k ≤ mind, n,r+k∑
i=r+1
σi(M) ≤ k√r + k
· ‖M‖F,
where σi(M) is the ith largest singular value of M.
Proof. By the Cauchy-Schwartz inequality,
r+k∑
i=r+1
σi(M) =
r+k∑
i=r+1
|σi(M)| ≤(
r+k∑
i=r+1
σ2i (M)
)1/2
· ‖1k‖2 =√k ·(
r+k∑
i=r+1
σ2i (M)
)1/2
.
Note that σr+1(M), . . . , σr+k(M) are the k smallest among the r+k largest singular values. Hence,
r+k∑
i=r+1
σ2i (M) ≤ k
r + k
r+k∑
i=1
σ2i (M) ≤ k
r + k
mind,n∑
i=1
σ2i (M) =
k
r + k‖M‖2F.
Combining the two inequalities, the desired result follows.
Corollary 1. For any real d× n matrix M and k ≤ mind, n, σk(M) ≤ k−1/2 · ‖M‖F.
Proof. It follows immediately from Lemma 10.5.
Lemma 10.6. Let a1, . . . , an and b1, . . . , bn be 2n real numbers and let p and q be two numberssuch that 1/p+ 1/q = 1 and p > 1. We have
∣∣n∑
i=1
aibi∣∣ ≤
(n∑
i=1
|ai|p)1/p
·(
n∑
i=1
|bi|q)1/q
.
Lemma 10.7. For any two real matrices A and B of appropriate dimensions,
‖AB‖F ≤ min‖A‖2‖B‖F, ‖A‖F‖B‖2 .
Proof. Let bi denote the ith column of B. Then,
‖AB‖2F =∑
i
‖Abi‖22 ≤∑
i
‖A‖22‖bi‖22 = ‖A‖22∑
i
‖bi‖22 = ‖A‖22‖B‖2F.
Similarly, using the previous inequality,
‖AB‖2F = ‖B⊤A⊤‖2F ≤ ‖B⊤‖22‖A⊤‖2F = ‖B‖22‖A‖2F.
Combining the two upper bounds, the desired result follows.
Lemma 10.8. For any A,B ∈ Rn×k,
∣∣〈A,B〉∣∣,∣∣Tr
(A⊤B
)∣∣ ≤ ‖A‖F‖B‖F.
Proof. The inequality follows from Lemma 10.6 for p = q = 2, treating A and B as vectors.
23
Lemma 10.9. For any real m× n matrix A, and any k ≤ minm, n,
maxY∈Rn×k
Y⊤Y=Ik
‖AY‖F =
(k∑
i=1
σ2i (A)
)1/2
.
The maximum is attained by Y coinciding with the k leading right singular vectors of A.
Proof. Let UΣV⊤ be the singular value decomposition of A; U and V are m × m and n × nunitary matrices respectively, while Σ is a diagonal matrix with Σjj = σj, the jth largest singularvalue of A, j = 1, . . . , d, where d,minm,n. Due to the invariance of the Frobenius norm underunitary multiplication,
‖AY‖2F = ‖UΣV⊤Y‖2F = ‖ΣV⊤Y‖2F. (41)
Continuing from (41),
‖ΣV⊤Y‖2F = Tr
(Y⊤VΣ2V⊤Y
)=
k∑
i=1
y⊤i
( d∑
j=1
σ2j · vjv
⊤j
)yi =
d∑
j=1
σ2j ·
k∑
i=1
(v⊤j yi
)2.
Let zj,∑k
i=1
(v⊤j yi
)2, j = 1, . . . , d. Note that each individual zj satisfies
0 ≤ zj,k∑
i=1
(v⊤j yi
)2≤ ‖vj‖2 = 1,
where the last inequality follows from the fact that the columns of Y are orthonormal. Further,
d∑
j=1
zj =
d∑
j=1
k∑
i=1
(v⊤j yi
)2=
k∑
i=1
d∑
j=1
(v⊤j yi
)2=
k∑
i=1
‖yi‖2 = k.
Combining the above, we conclude that
‖AY‖2F =
d∑
j=1
σ2j · zj ≤ σ2
1 + . . .+ σ2k. (42)
Finally, it is straightforward to verify that if yi = vi, i = 1, . . . , k, then (42) holds with equality.
Lemma 10.10. For any real d×n matrix A, and pair of d× k matrix X and n× k matrix Y suchthat X⊤X = Ik and Y⊤Y = Ik with k ≤ mind, n, the following holds:
∣∣Tr
(X⊤AY
)∣∣ ≤√k ·( k∑
i=1
σ2i (A)
)1/2.
Proof. By Lemma 10.8,
|〈X, AY〉| =∣∣Tr
(X⊤AY
)∣∣ ≤ ‖X‖F · ‖AY‖F =√k · ‖AY‖F.
where the last inequality follows from the fact that ‖X‖2F = Tr(X⊤X
)= Tr(Ik) = k. Combining
with a bound on ‖AY‖F as in Lemma 10.9, completes the proof.
24
Lemma 10.11. For any real d × d PSD matrix A, and k × d matrix X with k ≤ d orthonormalcolumns,
Tr
(X⊤AX
)≤
k∑
i=1
λi(A)
where λi(A) is the ith largest eigenvalue of A. Equality is achieved for X coinciding with the kleading eigenvectors of A.
Proof. LetA = VV⊤ be a factorization of the PSDmatrixA. Then,Tr(X⊤AX
)= Tr
(X⊤VV⊤X
)=
‖V⊤X‖2F. The desired result follows by Lemma 10.9 and the fact that λi(A) = σ2i (V), i =
1, . . . , d.
11 Additional Experimental Results
(a) (b)
Figure 3: Cumulative variance captured by k s-spars components computed on the word-by-wordmatrix – BagOfWords:NIPS dataset [37]. Sparsity is arbitrarily set to s = 10 nonzero entriesper component. Fig. 3(a) depicts the cum. variance captured by k = 6 components. Deflation leadsto a greedy formation of components; first components capture high variance, but subsequent onescontribute less. On the contrary, our algorithm jointly optimizes the k components and achieveshigher total cum. variance. Fig. 3(b) depicts the total cum. variance achieved for various valuesof k. Our algorithm operates on a rank-4 approximation of the input.
25
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
1:
TPower
network algorithm neuron parameter object classifier word noise
2: model data cell point image net speech control
3: learning system pattern distribution recognition classification level dynamic
4: input error layer hidden images class context step
5: function weight information space task test hmm term
6: neural problem signal gaussian features order character optimal
7: unit result visual linear feature examples processing component
8: set number field probability representation rate non equation
9: training method synaptic mean performance values approach single
10: output vector firing case view experiment trained analysis
11:
SpanSPCA
network algorithm neuron parameter recognition control classifier noise
12: model data cell distribution object action classification order
13: input weight pattern point image dynamic class term
14: learning error layer linear word step net component
15: neural problem signal probability performance optimal test rate
16: function output information space task policy speech equation
17: unit result visual gaussian features states examples single
18: set number synaptic hidden representation reinforcement approach analysis
19: system method field case feature values experiment large
20: training vector response mean images controller trained form
21:
SPCABiPart
data function neuron unit learning network model training
22: distribution algorithm cell weight space input parameter hidden
23: gaussian set visual layer action neural information performance
24: probability error direction net order system control recognition
25: component problem firing task step output dynamic classifier
26: approach result synaptic connection linear pattern mean test
27: analysis number response activation case signal noise word
28: mixture method spike architecture values processing field speech
29: likelihood vector activity generalization term image local classification
30: experiment point motion threshold optimal object equation trained
Total Cum. Variance
TPower 2.5999 · 103
SpanSPCA 2.5981 · 103
SPCABiPart 3.2090 · 103
Table 4: BagOfWords:NIPS dataset [37]. We run various SPCA algorithms for k = 8 com-ponents (topics) and s = 10 nonzero entries per component. The table lists the words selected byeach component (words corresponding to higher magnitude entries appear higher in the topic). Ouralgorithm was configured to use a rank-4 approximation of the input data.
26
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
1:
TPower
percent zzz bush team school women zzz enron drug palestinian
2: company zzz al gore game student show firm patient zzz israel
3: million president season program book zzz arthur andersen doctor zzz israeli
4: companies official player high com deal system zzz yasser arafat
5: market zzz george bush play children look lay problem attack
6: stock campaign games right american financial law leader
7: business government point group need energy care peace
8: money plan run home part executives cost israelis
9: billion administration coach public family accounting help israeli
10: fund zzz white house win teacher found partnership health zzz west bank
11:
SpanSPCA
percent team zzz bush palestinian school cup show won
12: company game zzz al gore attack student minutes com night
13: million season president zzz united states children add part left
14: companies player zzz george bush zzz u s program tablespoon look big
15: market play campaign military home teaspoon need put
16: stock games official leader family oil book win
17: business point government zzz israel women pepper called hit
18: money run political zzz american public water hour job
19: billion right election war high large american ago
20: plan coach group country law sugar help zzz new york
21:
SPCABiPart
percent zzz united states zzz bush company team cup school zzz al gore
22: million zzz u s official companies game minutes student zzz george bush
23: money zzz american government market season add children campaign
24: high attack president stock player tablespoon women election
25: program military group business play oil show plan
26: number palestinian leader billion point teaspoon book tax
27: need war country analyst run water family public
28: part administration political firm right pepper look zzz washington
29: problem zzz white house american sales home large hour member
30: com games law cost won food small nation
Total Cum. Variance
TPower 45.4014
SpanSPCA 46.0075
SPCABiPart 47.7212
Table 5: BagOfWords:NyTimes dataset [37]. We run various SPCA algorithms for k = 8components (topics) and s = 10 nonzero entries per component. The table lists the words selectedby each component (words corresponding to higher magnitude entries appear higher in the topic).Our algorithm was configured to use a rank-4 approximation of the input data.
27
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
1:
TPower
percent zzz bush team school com zzz enron law palestinian
2: company zzz al gore game student women firm drug zzz israel
3: million zzz george bush season program book deal court zzz israeli
4: companies campaign player children web financial case zzz yasser arafat
5: market right play show american zzz arthur andersen federal peace
6: stock group games public information chief patient israelis
7: money political point need look executive system israeli
8: business zzz united states run part site analyst decision military
9: government zzz u s coach family zzz new york executives bill zzz palestinian
10: official administration home help question lay member zzz west bank
11: billion leader win job number investor lawyer war
12: president attack won teacher called energy doctor security
13: plan zzz white house night country find investment cost violence
14: high tax left problem found employees care killed
15: fund zzz washington guy parent ago accounting health talk
16:
SpanSPCA
percent team official zzz al gore cup show public night
17: company game zzz bush zzz george bush minutes com member big
18: million season zzz united states campaign add part system set
19: companies player attack election tablespoon look case film
20: market play zzz u s political teaspoon need number find
21: stock games palestinian vote oil book question room
22: business point military republican pepper women job place
23: money run leader voter water family told friend
24: billion right zzz american democratic large called put took
25: plan win war school sugar children zzz washington start
26: government coach zzz israel presidential serving help found car
27: president home country zzz white house butter ago information feel
28: high won administration law chopped zzz new york federal half
29: cost left terrorist zzz republican hour program student guy
30: group hit american tax pan problem court early
31:
SPCABiPart
company show cup team percent zzz al gore official school
32: companies home minutes game million zzz george bush zzz bush student
33: stock run add season money campaign government children
34: market com tablespoon player plan right president women
35: billion high oil play business election zzz united states book
36: zzz enron need teaspoon games tax political zzz u s family
37: firm look pepper coach cost point group called
38: analyst part water guy cut leader attack hour
39: industry night large yard job zzz washington zzz american friend
40: fund zzz new york sugar hit pay administration country found
41: investor help serving played deal question military find
42: sales left butter playing quarter member american set
43: customer put chopped ball chief won war room
44: investment ago fat fan executive win law film
45: economy big food shot financial told public small
Total Cum. Variance
TPower 48.140645
SpanSPCA 48.767864
SPCABiPart 51.873063
Table 6: BagOfWords:NyTimes dataset [37]. We run various SPCA algorithms for k = 8components (topics) and cardinality s = 15 per component. The table lists the words correspondingto each component (words corresponding to higher magnitude entries appear higher in the topic).Our algorithm was configured to use a rank-4 approximation of the input data.
28
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
1:
TPower
percent zzz bush team school com zzz enron drug palestinian
2: company zzz al gore game student women court patient zzz israel
3: million zzz george bush season program book case doctor zzz israeli
4: companies campaign player children web firm cell zzz yasser arafat
5: market zzz united states play show site federal care peace
6: stock zzz u s games public information lawyer disease israelis
7: government political point part zzz new york deal health israeli
8: official attack run family www decision medical zzz palestinian
9: money zzz american home system hour chief test zzz west bank
10: business american coach help find power hospital security
11: president administration win problem mail industry research violence
12: billion leader won law found executive cancer killed
13: plan country left job put according treatment talk
14: group election night called set financial study meeting
15: high zzz washington hit look room office death soldier
16: right military guy member big analyst human minister
17: fund zzz white house yard question told executives heart zzz sharon
18: need war played ago friend zzz arthur andersen blood fire
19: cost tax start teacher director employees trial zzz ariel sharon
20: number nation playing parent place investor benefit zzz arab
21:
SpanSPCA
percent team zzz al gore attack school cup com drug
22: company game zzz bush zzz united states student minutes web patient
23: million season zzz george bush zzz u s children add site cell
24: companies player campaign palestinian program tablespoon information doctor
25: market play election military family oil computer disease
26: stock games political zzz american women teaspoon find care
27: business point tax zzz israel show pepper big health
28: money run republican war help water zzz new york test
29: billion win zzz white house country told large www research
30: government home vote terrorist parent sugar mail human
31: president won law american problem serving set medical
32: plan coach administration zzz taliban book butter put study
33: high left democratic zzz afghanistan job chopped director death
34: group night voter security found hour industry cancer
35: official hit leader zzz israeli friend pan room hospital
36: need guy public nation ago fat small treatment
37: right yard zzz republican member question bowl car scientist
38: part played presidential support teacher gram zzz internet according
39: cost look federal called case food place blood
40: system start zzz washington forces number medium film heart
41:
SPCABiPart
palestinian percent zzz al gore cup school team company official
42: zzz israel million zzz bush minutes right game companies government
43: zzz israeli money zzz george bush add group season market president
44: zzz yasser arafat billion campaign tablespoon show player stock zzz united states
45: peace business election oil home play zzz enron zzz u s
46: war fund political teaspoon high games analyst attack
47: terrorist tax zzz white house pepper program point firm zzz american
48: zzz taliban cost administration water need run industry country
49: zzz afghanistan cut republican hour part coach investor law
50: forces job leader large com win sales plan
51: bin pay vote sugar american won customer public
52: troop economy democratic serving look left price zzz washington
53: laden deal presidential butter help night investment member
54: student big zzz clinton chopped problem hit quarter system
55: zzz pakistan chief support pan called guy executives nation
56: product executive zzz congress fat zzz new york yard consumer case
57: zzz internet financial military bowl number played technology federal
58: profit start policy gram question ball share information
59: earning record court food ago playing prices power
60: shares manager security league told lead growth effort
Total Cum. Variance
TPower 50.7686
SpanSPCA 52.8117
SPCABiPart 54.8906
Table 7: BagOfWords:NyTimes dataset [37]. We run various SPCA algorithms for k = 8components (topics) and cardinality s = 20 per component. The table lists the words correspondingto each component (words corresponding to higher magnitude entries appear higher in the topic).Our algorithm was configured to use a rank-4 approximation of the input data.
29
(a) (b)
Figure 4: Cumulative variance captured by k s-sparse (s = 10) extracted components on theword-by-word matrix – BagOfWords:NyTimes dataset [37]. Fig. 4(a) depicts the cum. variancecaptured by k = 6 components. Deflation leads to a greedy formation of components; first compo-nents capture high variance, but subsequent ones contribute less. On the contrary, our algorithmjointly optimizes the k components and achieves higher total cum. variance. Fig. 4(b) depicts thetotal cum. variance achieved for various values of k. Sparsity is arbitrarily set to s = 10 nonzeroentries per component. Our algorithm operates on a rank-4 approximation.
(a) (b)
Figure 5: Same as Fig. 4, but for sparsity s = 15.
30
(a) (b)
Figure 6: Same as Fig. 4, but for sparsity s = 20.
31