Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP)...

15
Provable Deterministic Leverage Score Sampling Dimitris Papailiopoulos (UC Berkeley) Anastasios Kyrillidis (EPFL) Christos Boutsidis (Yahoo Labs) KDD New York, New York August 27th, 2014

Transcript of Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP)...

Page 1: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Provable Deterministic Leverage Score Sampling

Dimitris Papailiopoulos (UC Berkeley)Anastasios Kyrillidis (EPFL)

Christos Boutsidis (Yahoo Labs)

KDD

New York, New York

August 27th, 2014

Page 2: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Singular Value Decomposition

m × n matrix A

k < ρ = rank(A)

Low-rank matrix approximation problem:

minX∈Rm×n,rank(X)≤k

||A − X||F

Singular Value Decomposition (SVD):

A = U · Σ · VT =(

Uk Uρ−k)︸ ︷︷ ︸

m×ρ

(Σk 00 Σρ−k

)︸ ︷︷ ︸

ρ×ρ

(VT

k

VTρ−k

)︸ ︷︷ ︸

ρ×n

Uk ∈ Rm×k , Σk ∈ Rk×k , and Vk ∈ Rn×k

Solution via Eckart-Young Theorem

Ak = Uk Σk VTk = AVk VT

k . O(mn min{m,n}) time

Page 3: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

The Column Subset Selection Problem (CSSP)

Definition

Let A ∈ Rm×n and let c < n be a sampling parameter. Find ccolumns of A – denoted as C ∈ Rm×c – that minimize

‖A − CC†A‖F or ‖A − CC†A‖2,

where C† denotes the Moore-Penrose pseudo-inverse.

CSSP gives a low-rank matrix factorization to A (X = C†A): A

=

C

( X)

+

E

Page 4: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Motivation

Consider applying this to date-by-stock matrices.

Returns the most important stocks in the portfolio.

Interpretable matrix decompositions in general.

Page 5: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Prior work on CSSP

c ‖A − CC†A‖2F ≤ Running time

1 k/ε2 ‖A − Ak‖2F + ε‖A‖2

F nnz(A)2 (k log k)/ε2 (1 + ε)‖A − Ak‖2

F mn2

3 (k log k)/ε2 (1 + ε)‖A − Ak‖2F mnk2 log k

4 k/ε (1 + ε)‖A − Ak‖2F mnk/ε

5 k/ε (1 + ε)‖A − Ak‖2F m3nk/ε

References:1 Frieze, Kannan, Vempala. FOCS. 2003.

2 Drineas, Mahoney, and Muthukrishnan. RANDOM, 2006.

3 Deshpande, Rademacher, Vempala, Wang. SODA, 2006.

4 Boutsidis, Drineas, Magdon-Ismail. FOCS, 2011.

5 Guruswami, Sinop. SODA, 2012

There are more results in the linear algebra literature focusing on the spectral norm version of the CSSP.

Page 6: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Leverage scores and randomized samplingDrineas, Mahoney, and Muthukrishnan. RANDOM, 2006.

Definition

[Leverage scores] Let Vk ∈ Rn×k contain the top k right singularvectors of an m × n matrix A with rank ρ = rank(A) ≥ k . Then,the (rank-k ) leverage score of the i-th column of A is defined as

`(k)i = ‖[Vk ]i,:‖22, i = 1,2, . . . ,n.

For a target rank k < rank(A), define a probabilitydistribution over the columns of A, pi = `

(k)i /k ;

In c independent and identically distributed passes,sample with replacement c columns from AFor c = O(k log k/ε2) and with constant probability:‖A − CC†A‖F ≤ (1 + ε) ‖A − Ak‖F.

Page 7: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Deterministic leverage score sampling[Jollife, 1972]

Compute the leverage scores of A w.r.t. some k .

Pick the c columns with the largest leverage scores.

Nice empirical results.

No theoretical analysis.

Contribution of this talk: theoretical analysis of deterministicleverage scores sampling.

Page 8: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Deterministic leverage score sampling[revisited]

Input: A ∈ Rm×n, k , θ (0 < θ < 1)- ComputeVk ∈Rn×k (via SVD).- Compute the leverage scores:for i = 1,2, . . . ,n`(k)i =

∥∥[Vk ]i,:∥∥2

2end forWithout loss of generality, let `(k)i ’s be sorted:

`(k)1 ≥ · · · ≥ `(k)i ≥ `(k)i+1 ≥ · · · ≥ `

(k)n .

Find index c ∈ {1, . . . ,n} such that:

c = argminc

(c∑

i=1

`(k)i > θ

).

If c < k , set c = k .Output: C ∈ Rm×c containing the first c columns of A.

Page 9: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Main result

Theorem

Letθ = k − ε,

for some ε ∈ (0,1). Then, for ξ = {2,F}, we have

‖A − CC†A‖2ξ < (1 + ε) · ‖A − Ak‖2ξ .

Weak result if the leverage scores are almost uniform.

Page 10: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Main result: leverage scores following a power law

Theorem

Let the leverage scores follow a power-law decay with exponentαk = 1 + η, for η > 0:

`(k)i =

`(k)1iαk

.

Let θ = k − ε. Then,

c =

(2kε

) 11+η

and‖A − CC†A‖2ξ < (1 + ε) · ‖A − Ak‖2ξ .

Page 11: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Is power law a realistic assumption?

Test leverage scores of large graphs.

Show leverage scores follow power law decays.

Page 12: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Power law is a realistic assumption

1 200 400 600 800 100010−5

100

α 1 0 = 1 .45

amazon

1 200 400 600 800 100010−5

100

105

α 1 0 = 1 .5

citeseer

1 200 400 600 800 100010−10

10−5

100

α 1 0 = 1 .7

foursquare

1 200 400 600 800 100010−5

100

105

α 1 0 = 1 .13

github

1 200 400 600 800 100010−5

100

105

α 1 0 = 2

gnutella

1 200 400 600 800 100010−5

100

105

α 1 0 = 1 .6

google

1 200 400 600 800 100010−4

10−2

100

α 1 0 = 0 .9

gowalla

1 200 400 600 800 100010−3

10−2

10−1

α 1 0 = 0 .2

livejournal

1 200 400 600 800 100010−4

10−2

100

α 1 0 = 0 .9

slashdot

1 200 400 600 800 100010−5

100

105

α 1 0 = 1 .6

nips

1 200 400 600 800 100010−4

10−3

10−2

α 1 0 = 0 .2

skitter

1 200 400 600 800 1000

10−3.6

10−3.3α 1 0 = 0 .12

slice

1 200 400 600 800 100010−5

100

105

α 1 0 = 1 .58

cora

1 200 400 600 800 100010−10

100

1010

α 1 0 = 4

writers

1 200 400 600 800 100010−5

100

105

α 1 0 = 1 .75

youtube groups

1 200 400 600 800 100010−4

10−2

100

α 1 0 = 0 .5

youtube

k = 10Show decay of leverage scores logarithmic scalePlot a fitting power-law curve β · x−αk .True leverage scores are plotted with a red× marker.The fitted curves are denoted with a solid blue line.

Page 13: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Power-law decaying leverage scores

5 5000

0.5

1

1.5

∥A−CC

† A∥2 2

∥A−A

k∥2 2

c

c =10

k = 5

10 5000

0.5

1

1.5

c

c =38

k = 10

50 5000

1

2

c

c =97

k = 50

100 5000

2

4

6

c

c =152

k = 100

5 5000

0.5

1

1.5

∥A−CC

† A∥2 2

∥A−A

k∥2 2

c

c =7

10 5000

0.5

1

1.5

c

c =11

50 5000

1

2

c

c =88

100 5000

2

4

6

c

c =129

↵k

=0.

5↵

k=

1.5

m = 200, n = 1000.k = 5, 10, 50, 100.c = 1, 2, ..., 1000.αk = 0.5 and αk = 1.5.

Blue curve is the relative error ratio ‖A − CC†A‖22/‖A − Ak‖2

2The vertical cyan line corresponds to the point where k = cThe vertical magenta line indicates the point where the c sampled columns offer a better approximationcompared to the best rank-k matrix Ak

Page 14: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Nearly-uniform leverage scores

5 500 10000

0.5

1

1.5

∥A−CC

† A∥2 2

∥A−A

k∥2 2

c

c =473

k = 5

10 500 10000

0.5

1

1.5

c

c =404

k = 10

50 500 10000

0.5

1

1.5

2

c

c =629

k = 50

100 500 10000

2

4

6

c

c =630

k = 100

m = 200, n = 1000.

k = 5, 10, 50, 100.

c = 1, 2, ..., 1000.

Blue curve is the relative error ratio ‖A − CC†A‖22/‖A − Ak‖2

2

The leftmost vertical cyan line corresponds to the point where k = c.

The rightmost vertical magenta line indicates the point where the c sampled columns offer as good anapproximation as that of the best rank-k matrix Ak

Page 15: Provable Deterministic Leverage Score Sampling€¦ · The Column Subset Selection Problem (CSSP) Definition Let A 2Rm n and let c

Conclusions

The Column Subset Selection Problemapproach: sampling w.r.t the leverage scores.

Randomized leverage scores sampling

theory: strong results [Drineas et al, 2008].practice: strong performance

Deterministic leverage scores sampling

theory: good performance if leverage scores follow apower law decay.

practice: many real data exhibit leverage scores withpower law decays.