mmds-data.orgmmds-data.org/presentations/2014/nelson_mmds14.pdf · Metric Johnson-Lindenstrauss...

Dimensionality reduction via sparse matrices

Jelani NelsonHarvard

June 17, 2014

based on joint works with Jean Bourgain (IAS), Daniel Kane (Stanford), Huy Nguy˜en (Princeton)

Metric Johnson-Lindenstrauss lemma

Metric JL (MJL) Lemma, 1984

Every set of N points in Euclidean space can be embedded intoO(ε−2 logN)-dimensional Euclidean space so that all pairwisedistances are preserved up to a 1± ε factor.

• Speed up geometric algorithms by first reducing dimension ofinput [Indyk, Motwani ’98], [Indyk ’01]

• Faster/streaming numerical linear algebra algorithms [Sarlos’06], [LWMRT ’07], [Clarkson, Woodruff ’09]

• Essentially equivalent to RIP matrices from compressedsensing [Baraniuk et al. ’08], [Krahmer, Ward ’11](used for recovery of sparse signals)

• Volume-preserving embeddings (applications to projectiveclustering) [Magen ’02]

• . . .

Metric Johnson-Lindenstrauss lemma

Metric JL (MJL) Lemma, 1984

Every set of N points in Euclidean space can be embedded intoO(ε−2 logN)-dimensional Euclidean space so that all pairwisedistances are preserved up to a 1± ε factor.

• Speed up geometric algorithms by first reducing dimension ofinput [Indyk, Motwani ’98], [Indyk ’01]

• Faster/streaming numerical linear algebra algorithms [Sarlos’06], [LWMRT ’07], [Clarkson, Woodruff ’09]

• Essentially equivalent to RIP matrices from compressedsensing [Baraniuk et al. ’08], [Krahmer, Ward ’11](used for recovery of sparse signals)

• Volume-preserving embeddings (applications to projectiveclustering) [Magen ’02]

• . . .

How to prove the JL lemma

Distributional JL (DJL) lemma

LemmaFor any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rm×n form = O(ε−2 log(1/δ)) so that for any u of unit `2 norm

PΠ∼Dε,δ

(∣∣‖Πu‖22 − 1

∣∣ > ε)< δ.

Proof of MJL: Set δ = 1/N2 in DJL and u as the difference vectorof some pair of points. Union bound over the

)pairs.

Theorem (Alon, 2003)

For MJL, m = Ω((ε−2/ log(1/ε)) logN) is required.

Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011)

For DJL, m = Θ(ε−2 log(1/δ)) is optimal.

PΠ∼Dε,δ

(∣∣‖Πu‖22 − 1

∣∣ > ε)< δ.

)pairs.

PΠ∼Dε,δ

(∣∣‖Πu‖22 − 1

∣∣ > ε)< δ.

)pairs.

Proving the distributional JL lemma

Older proofs

• [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988],[Dasgupta-Gupta, 2003]: Random rotation, then projectiononto first m coordinates.

• [Indyk-Motwani, 1998]:Random matrix with independent Gaussian entries.

• [Achlioptas, 2001]: Independent ±1 entries.

• [Clarkson-Woodruff, 2009]:O(log(1/δ))-wise independent ±1 entries.

• [Arriaga-Vempala, 1999], [Matousek, 2008]:Independent entries having mean 0, variance 1/m, andsubGaussian tails

Downside: Performing embedding is dense matrix-vectormultiplication, O(m · ‖u‖0) time

Proving the distributional JL lemma

Older proofs

• [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988],[Dasgupta-Gupta, 2003]: Random rotation, then projectiononto first m coordinates.

• [Indyk-Motwani, 1998]:Random matrix with independent Gaussian entries.

• [Achlioptas, 2001]: Independent ±1 entries.

• [Clarkson-Woodruff, 2009]:O(log(1/δ))-wise independent ±1 entries.

• [Arriaga-Vempala, 1999], [Matousek, 2008]:Independent entries having mean 0, variance 1/m, andsubGaussian tails

Downside: Performing embedding is dense matrix-vectormultiplication, O(m · ‖u‖0) time

Fast JL Transforms

• [Ailon-Chazelle, 2006]: x 7→ PHDx , O(n log n + m3) time

P random+sparse, H Fourier, D has random ±1 on diagonal

• Also follow-up works based on similar approach which improvethe time while, for some, slightly increasing target dimension[Ailon, Liberty ’08], [Ailon, Liberty ’11], [Krahmer, Ward ’11],[N., Price, Wootters ’14], . . .

Downside: Slow to embed sparse vectors: running time isΩ(minm · ‖u‖0, n log n).

Fast JL Transforms

• [Ailon-Chazelle, 2006]: x 7→ PHDx , O(n log n + m3) time

P random+sparse, H Fourier, D has random ±1 on diagonal

• Also follow-up works based on similar approach which improvethe time while, for some, slightly increasing target dimension[Ailon, Liberty ’08], [Ailon, Liberty ’11], [Krahmer, Ward ’11],[N., Price, Wootters ’14], . . .

Downside: Slow to embed sparse vectors: running time isΩ(minm · ‖u‖0, n log n).

Where Do Sparse Vectors Show Up?

• Document as bag of words: ui = number of occurrences ofword i . Compare documents using cosine similarity.

n = lexicon size; most documents aren’t dictionaries

• Network traffic: ui ,j = #bytes sent from i to j

n = 264 (2256 in IPv6); most servers don’t talk to each other

• User ratings: ui ,j is user i ’s score for movie j on Netflix

n = #movies; most people haven’t rated all movies

• Streaming: u receives a stream of updates of the form: “addv to ui ”. Maintaining Πu requires calculating v · Πei .

• . . .

Sparse JL transformsOne way to embed sparse vectors faster: use sparse matrices.

s = #non-zero entries per column in Π(so embedding time is s · ‖u‖0)

reference value of s type

[JL84], [FM88], [IM98], . . . m ≈ 4ε−2 ln(1/δ) dense

[Achlioptas01] m/3 sparseBernoulli

[WDALS09] no proof hashing

[DKS10] O(ε−1 log3(1/δ)) hashing

[KN10a], [BOR10] O(ε−1 log2(1/δ)) ”

[KN12] O(ε−1 log(1/δ)) modified hashing

[N., Nguy˜en ’13]: for any m ≤ poly(1/ε) · logN, s = Ω(ε−1 logN/ log(1/ε)) is

required, even for metric JL, so [KN12] is optimal up to O(log(1/ε)).

*[Thorup, Zhang ’04] gives m = O(ε−2δ−1), s = 1.

[KN10a], [BOR10] O(ε−1 log2(1/δ)) ”

Sparse JL Constructions

[DKS, 2010] s = Θ(ε−1 log2(1/δ))

[KN12] s = Θ(ε−1 log(1/δ))

[DKS, 2010] s = Θ(ε−1 log2(1/δ))

[KN12] s = Θ(ε−1 log(1/δ))

[KN12] m/s s = Θ(ε−1 log(1/δ))

Sparse JL Constructions (in matrix form)

Each black cell is ±1/√s at random

One open problem . . .

• [Achlioptas ’01]:

Πi ,j =

√m/3, w.p. 1/6

−1/√m/3, w.p. 1/6

0, w.p. 2/3

• There is an expected s = m/3 non-zeroes per column.

• [Kane-N. ’12]: Eliminate the variance in the number ofnon-zeroes per column. Force it to be exactly s.

Surprisingly, this helps!

(makes asymptotically smaller s = O(εm) possible)

• Conjecture: Error with exactly s non-zeroes per column isstochastically dominated by error with expected s per column.

Πi ,j =

√m/3, w.p. 1/6

−1/√m/3, w.p. 1/6

0, w.p. 2/3

Πi ,j =

√m/3, w.p. 1/6

−1/√m/3, w.p. 1/6

0, w.p. 2/3

Analysis

• In both constructions, can write Πi ,j = δi ,jσi ,j/√s

‖Πu‖22 − 1 =

m∑r=1

∑i 6=j

δr ,iδr ,jσr ,iσr ,juiuj = σTBσ

B1 0 . . . 00 B2 . . . 0

0 0. . . 0

0 . . . 0 Bm

• (Br )i ,j = δr ,iδr ,juiuj

• P(|‖Πu‖2 − 1| > ε) = P(|σTBσ| > ε) < ε−` · E |σTBσ|`.Use moment bound for quadratic forms, which depends on‖B‖, ‖B‖F (Hanson-Wright inequality).

Analysis

‖Πu‖22 − 1 =

m∑r=1

∑i 6=j

B1 0 . . . 00 B2 . . . 0

0 0. . . 0

0 . . . 0 Bm

Analysis

‖Πu‖22 − 1 =

m∑r=1

∑i 6=j

B1 0 . . . 00 B2 . . . 0

0 0. . . 0

0 . . . 0 Bm

What next?

Natural “matrix extension” of sparse JL

[Kane, N. ’12]

TheoremLet u ∈ Rn be arbitrary, unit `2 norm, Π sparse sign matrix. Then

(∣∣‖Πu‖2 − 1∣∣ > ε

as long as

m &log(1/δ)

ε2, s &

log(1/δ)

ε, ` = log(1/δ)

ε2δ, s = 1, ` = 2 ([Thorup, Zhang′04])

[Kane, N. ’12]

TheoremLet u ∈ Rn×1 be arbitrary, o.n. cols, Π sparse sign matrix. Then

(‖(Πu)T (Πu)− I1‖ > ε) < δ

as long as

m &1 + log(1/δ)

ε2, s &

log(1/δ)

ε, ` = log(1/δ)

ε2δ, s = 1, ` = 2

[Kane, N. ’12]

TheoremLet U ∈ Rn×1 be arbitrary, o.n. cols, Π sparse sign matrix. Then

(‖(ΠU)T (ΠU)− I1‖ > ε) < δ

as long as

m &1 + log(1/δ)

ε2, s &

log(1/δ)

ε, ` = log(1/δ)

ε2δ, s = 1, ` = 2

Conjecture

TheoremLet U ∈ Rn×d be arbitrary, o.n. cols, Π sparse sign matrix. Then

(‖(ΠU)T (ΠU)− Id‖ > ε) < δ

as long as

m &d + log(1/δ)

ε2, s &

log(d/δ)

ε, ` = log(d/δ)

ε2δ, s = 1, ` = 2

What we prove [N., Nguy˜en ’13]

TheoremLet U ∈ Rn×d be arbitrary, o.n. cols, Π sparse sign matrix. Then

(‖(ΠU)T (ΠU)− Id‖ > ε) < δ

as long as

m &d · logc (d/δ)

ε2, s &

logc (d/δ)

εor m &

ε2, s &

ε2δ, s = 1

Remarks

• [Clarkson, Woodruff ’13] was first to showm = d2 · polylog(d/ε)/ε2, s = 1 bound via other methods

• m = O(d2/ε2), s = 1 also obtained by [Mahoney, Meng ’13].

• m = O(d2/ε2), s = 1 also follows from [Thorup, Zhang ’04]+ [Kane, N. ’12] (observed by Nguy˜en)

• What does the “moment method” mean for matrices?

(‖(ΠU)T (ΠU)− Id‖ > ε) < ε−` · E ‖(ΠU)T (ΠU)− Id‖`

≤ ε−` · E tr(((ΠU)T (ΠU)− Id )`)

• Classical “moment method” in random matrix theory; e.g.[Wigner, 1955], [Furedi, Komlos, 1981], [Bai, Yin, 1993]

Remarks

• [Clarkson, Woodruff ’13] was first to showm = d2 · polylog(d/ε)/ε2, s = 1 bound via other methods

• m = O(d2/ε2), s = 1 also obtained by [Mahoney, Meng ’13].

• m = O(d2/ε2), s = 1 also follows from [Thorup, Zhang ’04]+ [Kane, N. ’12] (observed by Nguy˜en)

• What does the “moment method” mean for matrices?

(‖(ΠU)T (ΠU)− Id‖ > ε) < ε−` · E ‖(ΠU)T (ΠU)− Id‖`

≤ ε−` · E tr(((ΠU)T (ΠU)− Id )`)

• Classical “moment method” in random matrix theory; e.g.[Wigner, 1955], [Furedi, Komlos, 1981], [Bai, Yin, 1993]

Who cares about this matrix extension?

Motivation for matrix extension of sparse JL

• ‖(ΠU)T (ΠU)− I‖ ≤ ε equivalent to ‖Πx‖ = (1± ε)‖x‖ for allx ∈ V , where V is the subspace spanned by the columns of U

(up to changing ε by a factor of 2). “subspace embedding”.

• Subspace embeddings can be used to speed up algorithms formany numerical linear algebra problems on big matrices[Sarlos, 2006], [Dasgupta, Drineas, Harb, Kumar, Mahoney,2008], [Clarkson, Woodruff, 2009], [Drineas, Magdon-Ismail,Mahoney, Woodruff, 2012], [Clarkson, Woodruff, 2013],[Clarkson, Drineas, Magdon-Ismail, Mahoney, Meng,Woodruff, 2013], [Woodruff, Zhang, 2013], . . .

• Sparse Π: can multiply ΠA in s · nnz(A) time for big matrix A.

Numerical linear algebra

• A ∈ Rn×d , n d , rank(A) = r

Classical numerical linear algebra problems

• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.

• Least squares regression: Given also b ∈ Rn.

Compute x∗ = argminx∈Rd ‖Ax − b‖2

• `p regression (p ∈ [1,∞)):

Compute x∗ = argminx∈Rd ‖Ax − b‖p

• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .

Compute Ak = argminrank(B)≤k ‖A− B‖F

• Preconditioning: Compute R ∈ Rd×d (for d = r) so that

∀x ‖ARx‖2 ≈ ‖x‖2

Computationally efficient solutions

Singular Value Decomposition

TheoremEvery matrix A ∈ Rn×d of rank r can be written as

A = U︸︷︷︸orthonormcolumns

Σ︸︷︷︸diagonal

positive definiter×r

V T︸︷︷︸orthonormcolumns

Can compute SVD in O(ndω−1) [Demmel, Dumitriu, Holtz, 2007].ω < 2.373 . . . is the exponent of square matrix multiplication[Coppersmith, Winograd, 1987], [Stothers, 2010],[Vassilevska-Williams, 2012]

• Leverage scores: Output row norms of U.

• Least squares regression: Output VΣ−1UTb.

• Low-rank approximation: Output UΣkVT .

• Preconditioning: Output R = VΣ−1.

Conclusion: In time O(ndω−1) we can compute the SVD thensolve all the previously stated problems. Is there a faster way?

• Leverage scores: Output row norms of U.

• Least squares regression: Output VΣ−1UTb.

• Low-rank approximation: Output UΣkVT .

• Preconditioning: Output R = VΣ−1.

Conclusion: In time O(ndω−1) we can compute the SVD thensolve all the previously stated problems. Is there a faster way?

How to use subspace embeddings

Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then

(1− ε)‖Ax − b‖ ≤‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖≤ (1 + ε)‖Ax∗ − b‖

⇒ ‖Ax − b‖ ≤(

1 + ε

1− ε

)· ‖Ax∗ − b‖

(1− ε)‖Ax − b‖ ≤‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖≤ (1 + ε)‖Ax∗ − b‖

⇒ ‖Ax − b‖ ≤(

1 + ε

1− ε

)· ‖Ax∗ − b‖

(1−ε)‖Ax−b‖ ≤ ‖ΠAx − Πb‖︸︷︷︸‖Π(Ax−b)‖

≤ ‖ΠAx∗−Πb‖≤ (1 + ε)‖Ax∗ − b‖

⇒ ‖Ax − b‖ ≤(

1 + ε

1− ε

)· ‖Ax∗ − b‖

(1−ε)‖Ax−b‖ ≤ ‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖ ≤ (1+ε)‖Ax∗−b‖

⇒ ‖Ax − b‖ ≤(

1 + ε

1− ε

)· ‖Ax∗ − b‖

Computing SVD of ΠA takes time O(mdω−1), which is muchfaster than O(ndω−1) since m n.

Note: [Sarlos ’06] actually describes a more efficient reduction, where one only

needs that (1) Π is a (1− 1/√

2)-subspace embedding, and (2)

‖(ΠU)T (Π(Ax∗ − b))‖2 <√ε · ‖Ax∗ − b‖. Item (2) only requires

m & d/ε, s ≥ 1 [Thorup, Zhang ’04] + [Kane, N. ’12].

⇒ ‖Ax − b‖ ≤(

1 + ε

1− ε

)· ‖Ax∗ − b‖

⇒ ‖Ax − b‖ ≤(

1 + ε

1− ε

)· ‖Ax∗ − b‖

Back to the analysis

(∥∥∥(ΠU)T (ΠU)− Id

∥∥∥ > ε)< ε−` · E tr(((ΠU)T (ΠU)− Id )`)

Analysis (` = 2)s = 1, m = O(d2/ε2)

Want to understand S − I , S = (ΠU)T (ΠU)

Let the columns of U be u1, . . . , ud

Recall Πi ,j = δi ,jσi ,j/√s

Some computations yield

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

δr ,iδr ,jσr ,iσr ,juki u

k ′j

Computing E tr((S − I )2) = E ‖S − I‖2F is straightforward, and can

show E ‖S − I‖2F ≤ (d2 + d)/m

P(‖S − I‖ > ε) <1

d2 + d

Set m ≥ δ−1(d2 + d)/ε2 for success probability 1− δ

Analysis (` = 2)s = 1, m = O(d2/ε2)

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

show E ‖S − I‖2F ≤ (d2 + d)/m

P(‖S − I‖ > ε) <1

d2 + d

Analysis (` = 2)s = 1, m = O(d2/ε2)

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

show E ‖S − I‖2F ≤ (d2 + d)/m

P(‖S − I‖ > ε) <1

d2 + d

Analysis (` = 2)s = 1, m = O(d2/ε2)

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

show E ‖S − I‖2F ≤ (d2 + d)/m

P(‖S − I‖ > ε) <1

d2 + d

Analysis (` = 2)s = 1, m = O(d2/ε2)

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

show E ‖S − I‖2F ≤ (d2 + d)/m

P(‖S − I‖ > ε) <1

d2 + d

Analysis (large `)s = Oγ(1/ε), m = O(d1+γ/ε2)

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

By induction, for any square matrix B and integer ` ≥ 1,

(B`)i ,j =∑

i1,...,i`+1i1=i ,i`+1=j

∏t=1

Bit ,it+1

⇒ tr(B`) =∑

i1,...,i`+1i1=i`+1

∏t=1

Bit ,it+1

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

(B`)i ,j =∑

i1,...,i`+1i1=i ,i`+1=j

∏t=1

Bit ,it+1

⇒ tr(B`) =∑

i1,...,i`+1i1=i`+1

∏t=1

Bit ,it+1

(S − I )k,k ′ =1

m∑r=1

∑i 6=j

k ′j

(B`)i ,j =∑

i1,...,i`+1i1=i ,i`+1=j

∏t=1

Bit ,it+1

⇒ tr(B`) =∑

i1,...,i`+1i1=i`+1

∏t=1

Bit ,it+1

E tr((S − I )`) =∑

i1 6=j1,...,i` 6=j`r1,...,r`

k1,...,k`+1k1=k`+1

(E∏t=1

δrt ,it δrt ,jt

)(E∏t=1

σrt ,itσrt ,jt

)∏t=1

uktitu

kt+1jt

The strategy: Associate each monomial in summation above witha graph, group monomials that have the same graph, and estimatethe contribution of each graph then do some combinatorics

(a common strategy; see [Wigner, 1955], [Furedi, Komlos, 1981],[Bai, Yin, 1993])

E tr((S − I )`) =∑

i1 6=j1,...,i` 6=j`r1,...,r`

k1,...,k`+1k1=k`+1

(E∏t=1

δrt ,it δrt ,jt

)(E∏t=1

σrt ,itσrt ,jt

)∏t=1

uktitu

kt+1jt

The strategy: Associate each monomial in summation above witha graph, group monomials that have the same graph, and estimatethe contribution of each graph then do some combinatorics

(a common strategy; see [Wigner, 1955], [Furedi, Komlos, 1981],[Bai, Yin, 1993])

Finite pointsets, then subspaces, now what?

(Linear) dimensionality reduction

• Given T ⊂ Rn, want a Π ∈ Rm×n such that

∀u ∈ T , (1− ε)‖u‖2 ≤ ‖Πu‖2 ≤ (1 + ε)‖u‖2

with m as small as possible.

• Above is equivalent to, assuming T ⊂ Sn−1,

supu∈T|‖Πu‖2

2 − 1| < ε

• We consider Π random:

supu∈T|‖Πu‖2

2 − 1| < ε

∀u ∈ T , (1− ε)‖u‖2 ≤ ‖Πu‖2 ≤ (1 + ε)‖u‖2

supu∈T|‖Πu‖2

2 − 1| < ε

supu∈T|‖Πu‖2

2 − 1| < ε

∀u ∈ T , (1− ε)‖u‖2 ≤ ‖Πu‖2 ≤ (1 + ε)‖u‖2

supu∈T|‖Πu‖2

2 − 1| < ε

supu∈T|‖Πu‖2

2 − 1| < ε

Applications

• Finite T : Johnson Lindenstrauss (JL) lemma hasT = (ui − uj )/‖ui − uj‖21≤i<j≤N [JL84].

Applications to clustering, nearest neighbor search, . . .

• Linear subspace: T = E ∩ Sn−1, E ⊂ Rn, dim(E ) = d .

Subspace embeddings pioneered by [Sarlos ’06]; faster leastsquares regression and low-rank approximation.

Also applications to approximating leverage scores[DMMW’12], k-means clustering [BZMD’11], canonicalcorrelation analysis [ABTZ’13], support vector machines[PBMD’13], `p regression [CDM+13,WZ13], ridge regression[LDFU’13].

Applications

• Union of subspaces: streaming approx of eigenvals [Andoni,Nguy˜en’13], compressed sensing [Donoho’04, Candes-Tao’06]

e.g. compressed sensing: T = u ∈ Rn : ‖u‖2 = 1, ‖u‖0 ≤ k,‖u‖0 is support size. Can also be sparse over some other basis.

• This is a union of(n

)subspaces

• Π for this T known as having restricted isometry property(RIP). Given Πu for (near) sparse u, can (approximately)recover u in polynomial time [Candes, Tao’05]

Supporting (n

)sparsity patterns also of interest

(“model-based compressed sensing” [BCDH’10])

Applications

)subspaces

Supporting (n

Applications

)subspaces

Supporting (n

Applications

)subspaces

Supporting (n

Applications

)subspaces

Supporting (n

Applications

• Smooth manifolds: M = F (B`d2), for smooth F : B`d

2→ Rn

Manifold learning: classifying images of handwritten digits[Hinton-Dayan-Revow’97], or human faces [Broomhead-Kirby’01]

• Idea: All valid drawings of the digit “2” lie on alow-dimensional manifold in R

√n×√

n (pixelated image). Fromsamples, learn the manifold.

• [Baraniuk-Wakin’09] suggest reducing dimensionality using Πand doing learning in lower-dimensional space

• to preserve curve lengths on manifold, can set T to betangent space of M (infinite union T = (

⋃u∈M Eu) ∩ Sn−1)

Applications

2→ Rn

√n×√

⋃u∈M Eu) ∩ Sn−1)

Applications

2→ Rn

√n×√

⋃u∈M Eu) ∩ Sn−1)

Applications

2→ Rn

√n×√

⋃u∈M Eu) ∩ Sn−1)

Euclidean dimensionality reduction:Optimality?

Can we do better than the JL lemma?

• Unfortunately, not by much if any.

[Alon ’03]: Let T = e1, . . . , en ∪ (ei − ej )/√

2i 6=j

• Any embedding of T with distortion 1 + ε requiresm & log n

ε2 log(1/ε)(off from JL lemma by just log(1/ε))

2i 6=j

Euclidean dimensionality reduction:“Beyond Worst-Case Analysis”

• Suppose Πi ,j ∼ N (0, 1/m) (i.i.d. gaussian entries)

• Gordon’s theorem (1988): suffices to setm ∼ (g2(T ) + 1)/ε2

where g(T ) = Eg supu∈T 〈g , u〉 (g a gaussian vector)

• g(T ) ≤√

log |T | always (union bound), but can be muchsmaller if the vectors in T are well-clusterable

• [Klartag, Mendelson ’05]: Same result if Πi ,j ∼ ±1/√m

• Same problem as before: Π is dense.

• g(T ) ≤√

Sparse JL:Optimality

• JL: m . ε−2 log |T |, s . ε−1 log |T | [Kane, N. ’12]

• Near-optimal:

m . ε−100 log |T | requires s & ε−1 log |T |log(1/ε) [N., Nguy˜en ’13]

Sparse JL:Optimality

• JL: m . ε−2 log |T |, s . ε−1 log |T | [Kane, N. ’12]

• Near-optimal:

m . ε−100 log |T | requires s & ε−1 log |T |log(1/ε) [N., Nguy˜en ’13]

Sparse JL:“Beyond Worst-Case Analysis”

Question: Can we obtain a Gordon-type theorem to understandhow to set m, s as a function of T?

What we show [Bourgain-N. ’14]

set T our m our s previous m previous s ref|T | <∞ log |T | log |T | log |T | log |T | [JL’84]

|T | <∞, ‖u‖∞ ≤ α log |T | dα log |T |e2 log |T | dα log |T |e2 [M’08]E , dim(E) ≤ d d 1 d 1 [NN’13]

Sn,k k log(n/k) k log(n/k) k log(n/k) k log(n/k) [CT’05]

HSn,k k log(n/k) 1 k log(n/k) 1 [RV’08]∗

|Θ| <∞ d + log |Θ| log |Θ| d + (log |Θ|)6 (log |Θ|)3 [NN’13]∀E ∈ Θ, dim(E) ≤ d

|Θ| <∞ d + log |Θ| dα log |Θ|e2 — — —∀E ∈ Θ, dim(E) ≤ d∀j, E ‖PE ej‖2 ≤ α|Θ| infinite see paper see paper d m [D’14]

∀E ∈ Θ, dim(E) ≤ d (non-trivial)

M a smooth manifold d dαde2 d d [D’14]∀E ∈ TM, j ‖PE ej‖2 ≤ α

All bounds shown hide poly(ε−1 log n) factors. One row is blank in previous work due

to no non-trivial results being previously known.

Question we studied: What’s going on here?

Our main contribution [Bourgain-N. ’14]T ⊆ Sn−1, Π ∈ Rm×n an SJLT. Suffices to set

m g2(T ) + 1, s 1

such that

maxq≤m

slog s

1√qs

supu∈T|

n∑j=1

ηjujgj |

q1/q 1

and hide (log n)/ε factors. (ηj ) i.i.d. Bernoulli of expectationqs/(m log s). We abbreviate as

maxq≤m

slog s

1√qs‖ sup

n∑j=1

ηjujgj‖LqηL1

m g2(T ) + 1, s 1

such that

maxq≤m

slog s

1√qs

supu∈T|

n∑j=1

ηjujgj |

q1/q 1

maxq≤m

slog s

1√qs‖ sup

n∑j=1

ηjujgj‖LqηL1

m g2(T ) + 1, s 1

such that

maxq≤m

slog s

1√qs

supu∈T|

n∑j=1

ηjujgj |

q1/q 1

maxq≤m

slog s

1√qs‖ sup

n∑j=1

ηjujgj‖LqηL1

Fast JL note

Although we study sparse JL transforms, our results also apply tothe “Fast JL transforms” of [Ailon, Chazelle ’06] and follow-ups.

0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...

.... . .

......

1 0 . . . 0 0

︸︷︷︸

Fourier

±1 0 . . . 0 00 ±1 . . . 0 0...

.... . .

......

0 0 . . . ±1 00 0 . . . 0 ±1

︸︷︷︸

• S a sampling matrix (exactly one non-zero per row)

• D a diagonal matrix with signs on diagonal

• Imagine replacing S with a SJLT Π, so Φ = ΠHD

• Want Φ to preserve all of T ; can view as Π preserving HDT

Fast JL note

0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...

.... . .

......

1 0 . . . 0 0

︸︷︷︸

Fourier

±1 0 . . . 0 00 ±1 . . . 0 0...

.... . .

......

0 0 . . . ±1 00 0 . . . 0 ±1

︸︷︷︸

Fast JL note

0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...

.... . .

......

1 0 . . . 0 0

︸︷︷︸

Fourier

±1 0 . . . 0 00 ±1 . . . 0 0...

.... . .

......

0 0 . . . ±1 00 0 . . . 0 ±1

︸︷︷︸

Fast JL note

0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...

.... . .

......

1 0 . . . 0 0

︸︷︷︸

Fourier

±1 0 . . . 0 00 ±1 . . . 0 0...

.... . .

......

0 0 . . . ±1 00 0 . . . 0 ±1

︸︷︷︸

An example new application

Model-based compressed sensing

• Our result implies sparse Π for model-based RIP when #sparsity patterns is 2o(k)

• Example: block-sparsity. u divided into n/b blocks of size beach. Each block either “on” or “off”.

• Sparsity k means at most k/b blocks are on.

• log |Θ| = log(n/b

)∼ (k/b) log(n/k)

• Thus m k + (k/b) log(n/k), s (k/b) log(n/k)

• Non-trivial sparsity s for b polylog(n)

)∼ (k/b) log(n/k)

Main theorem: proof outline

• Want to bound EΠ supu∈T |‖Πu‖22 − 1|

• Can write Πi ,j = δi ,jσi ,j/√s

δi ,j ∈ 0, 1, σi ,j = ±1

• Can write ‖Πu‖22 = ‖Au,δσ‖2

Au,δ =1√

(1)1 · · · u

(1)n 0 · · · · · · · · · · · · · · · · · · 0

0 · · · 0 u(2)1 · · · u

(2)n 0 · · · · · · · · · 0

......

... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u

(m)1 · · · u

where u(i)j = δi ,juj . Define Aδ = Au,δ : u ∈ T.

• Want: Eδ Eσ supA∈Aδ |‖Aσ‖22 − E ‖Aσ‖2

2| < ε

• Can handle using chaining and tools from Banach space theory

δi ,j ∈ 0, 1, σi ,j = ±1

Au,δ =1√

(1)1 · · · u

(1)n 0 · · · · · · · · · · · · · · · · · · 0

0 · · · 0 u(2)1 · · · u

(2)n 0 · · · · · · · · · 0

......

... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u

(m)1 · · · u

2| < ε

δi ,j ∈ 0, 1, σi ,j = ±1

Au,δ =1√

(1)1 · · · u

(1)n 0 · · · · · · · · · · · · · · · · · · 0

0 · · · 0 u(2)1 · · · u

(2)n 0 · · · · · · · · · 0

......

... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u

(m)1 · · · u

2| < ε

δi ,j ∈ 0, 1, σi ,j = ±1

Au,δ =1√

(1)1 · · · u

(1)n 0 · · · · · · · · · · · · · · · · · · 0

0 · · · 0 u(2)1 · · · u

(2)n 0 · · · · · · · · · 0

......

... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u

(m)1 · · · u

2| < ε

Proof outline

EδEσ

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

We’re in luck!

Theorem (Krahmer, Mendelson, Rauhut ’11)

For any collection A of matrices and σ a Rademacher vector

supA∈A|‖Aσ‖2

2 − E ‖Aσ‖22|

. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)

dX (Y ) is the radius of Y under the norm ‖ · ‖X

Proof outline

EδEσ

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

We’re in luck!

supA∈A|‖Aσ‖2

2 − E ‖Aσ‖22|

. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)

Proof outline

EδEσ

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

We’re in luck!

supA∈A|‖Aσ‖2

2 − E ‖Aσ‖22|

. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)

Proof outline

EδEσ

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

We’re in luck!

supA∈A|‖Aσ‖2

2 − E ‖Aσ‖22|

. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)

γ2 and Talagrand’s “generic chaining”

• What is γ2? Doesn’t matter for us except for three things:

(1) γ2(Y , ‖ · ‖2) ∼ g(Y ) for any Y

(“majorizing measures”) [Fernique ’76], [Talagrand ’01]

(2) γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2 for any norm

(3) γ2(Y , ‖ · ‖X ) is a function only of pairwise distances in Yunder the ‖ · ‖X norm

• Here N (Y , ‖ · ‖X , η) is the minimum number of ‖ · ‖X -balls ofradius η centered at points in Y required to cover Y

(usually called “covering” or “entropy” numbers)

(1) γ2(Y , ‖ · ‖2) ∼ g(Y ) for any Y

Bounding the right hand side

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

. γ22(Aδ, ‖ · ‖) + γ2(Aδ, ‖ · ‖) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)

• γ2(Aδ, ‖ · ‖) depends only on ‖ · ‖-distances

• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,

(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2

def= |||u|||

• Therefore ‖Au,δ − Aw ,δ‖ = |||u − w |||• Therefore γ2(Aδ, ‖ · ‖) = γ2(T , |||·|||)

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2

def= |||u|||

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2

def= |||u|||

• Therefore ‖Au,δ − Aw ,δ‖ = |||u − w |||

• Therefore γ2(Aδ, ‖ · ‖) = γ2(T , |||·|||)

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2

def= |||u|||

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

. γ22(T , |||·|||) + γ2(T , |||·|||) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)

• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,(2) ‖Au,δ‖ = 1√

s·max1≤i≤m ‖u(i)‖2 = |||u|||

• Since T ⊂ Sn−1, ‖Au,δ‖ ≤ 1√s· ‖u‖2 = 1√

• Also ‖Au,δ‖2F = 1

∑mi=1

∑nj=1 δi ,ju

2j = ‖u‖2

• So RHS is at most γ22(T , |||·|||) + γ2(T , |||·|||) + 1√

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

s·max1≤i≤m ‖u(i)‖2 = |||u|||

• Since T ⊂ Sn−1, ‖Au,δ‖ ≤ 1√s· ‖u‖2 = 1√

∑mi=1

∑nj=1 δi ,ju

2j = ‖u‖2

Fix δ.

supA∈Aδ

|‖Aσ‖22 − E ‖Aσ‖2

s·max1≤i≤m ‖u(i)‖2 = |||u|||

• Since T ⊂ Sn−1, ‖Au,δ‖ ≤ 1√s· ‖u‖2 = 1√

∑mi=1

∑nj=1 δi ,ju

2j = ‖u‖2

Warmup for main theorem: case of a linear subspace

• T = E , the intersection of a d-dim. linear subspace and Sn−1

• Want to bound Eδ γ2(T , |||·|||)• One of our facts:γ2(T , |||·|||) supη>0 η · [logN (T , |||·|||, η)]1/2

• Another stroke of luck. Dual Sudakov minoration [Bourgain,Lindenstrauss, Milman ’89], [Pajor, Tomczak-Jaegermann ’86]

supη>0

η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||

where g is a gaussian vector (very specific to linear subspaces)

• Here columns of U ∈ Rn×d form orthonormal basis for E

• Want to bound Eδ γ2(T , |||·|||)

• One of our facts:γ2(T , |||·|||) supη>0 η · [logN (T , |||·|||, η)]1/2

supη>0

η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||

supη>0

η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||

supη>0

η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||

Let U(i) be U but where the jth row is multiplied by δi ,j

Eg|||Ug ||| =

1√sEg

max1≤i≤m

‖U(i)g‖2

≤ 1√s

1≤i≤mEg‖U(i)g‖2 + E

1≤i≤m

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣]

≤ 1√s

max1≤i≤m

Eg‖U(i)g‖2 +

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣p)1/p

≤ 1√

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

](last step: Chose p = logm and used gaussian concentration of

Lipschitz functions [Pisier ’86])

Eg|||Ug ||| =

1√sEg

max1≤i≤m

‖U(i)g‖2

≤ 1√s

1≤i≤m

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣]

≤ 1√s

max1≤i≤m

Eg‖U(i)g‖2 +

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣p)1/p

≤ 1√

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

Eg|||Ug ||| =

1√sEg

max1≤i≤m

‖U(i)g‖2

≤ 1√s

1≤i≤m

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣]

≤ 1√s

max1≤i≤m

Eg‖U(i)g‖2 +

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣p)1/p

≤ 1√s

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

Eg|||Ug ||| =

1√sEg

max1≤i≤m

‖U(i)g‖2

≤ 1√s

1≤i≤m

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣]

≤ 1√s

max1≤i≤m

Eg‖U(i)g‖2 +

∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2

∣∣∣∣p)1/p

≤ 1√

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

Warmup for main theorem: case of a linear subspaceNow must bound

1√s· E

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

≤ 1√s·

( m∑i=1

Eδ‖U(i)‖p

)1/p (2)

choosing p = logm again as usual so that `p ∼ `∞.

• Eδ ‖U(i)‖pF is basically the Chernoff bound

• Eδ ‖U(i)‖p can be handled by non-commutative Khintchineinequality [Lust-Piquard ’86], [Lust-Piquard, Pisier ’91]

Going through calculations gives that we can set:

m .d(log d)2 + (logm)(log d)2

s . 1 + (log d)2(logm)2 · max1≤j≤n ‖PEej‖22

Can even set s = 1 if “leverage scores” small (true for random E )

1√s· E

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

≤ 1√s·

( m∑i=1

Eδ‖U(i)‖p

)1/p (2)

1√s· E

1≤i≤m‖U(i)‖F +

√logm max

1≤i≤m‖U(i)‖

≤ 1√s·

( m∑i=1

Eδ‖U(i)‖p

)1/p (2)

Back to main theorem

General theoremT ⊆ Sn−1 arbitrary. Suffices to set

m g2(T ) + 1, s 1

such that

maxq≤m

slog s

1√qs‖ sup

n∑j=1

ηjujgj‖LqηL1

Proof ingredients

• Dual Sudakov minoration (already seen, [BLM’89], [PTJ’86]):supη>0 η · [logN (E , |||·|||, η)]1/2 ≤ Eg |||Ug |||

• Maurey’s empirical method [Pisier ’81], [Carl ’85]

• Duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89]

General theoremT ⊆ Sn−1 arbitrary. Suffices to set

m g2(T ) + 1, s 1

such that

maxq≤m

slog s

1√qs‖ sup

n∑j=1

ηjujgj‖LqηL1

Proof ingredients

• Dual Sudakov minoration (already seen, [BLM’89], [PTJ’86]):supη>0 η · [logN (E , |||·|||, η)]1/2 ≤ Eg |||Ug |||

• Maurey’s empirical method [Pisier ’81], [Carl ’85]

• Duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89]

General theorem: proof outline

• Recall: γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2

• So we have to bound logN (T , |||·|||, η)

• Using duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89], can bound

logN (T , |||·|||, η) logN (conv(BJi), ‖ · ‖T ,

√sη)

where ‖z‖T = supu∈T | 〈u, z〉 |.• Ultimately want to use dual Sudakov; we’re not there yet• Maurey’s lemma: A tool to bound covering numbers of

convex combinations of spaces

logN (conv(BJi ), ‖·‖T , ε) .logm

ε2+log

εmaxk≤ 1

max|A|=k

logN (1

∑i∈A

BJi , ‖·‖T , ε)

√sη)

where ‖z‖T = supu∈T | 〈u, z〉 |.

• Ultimately want to use dual Sudakov; we’re not there yet• Maurey’s lemma: A tool to bound covering numbers of

ε2+log

εmaxk≤ 1

max|A|=k

logN (1

∑i∈A

BJi , ‖·‖T , ε)

√sη)

where ‖z‖T = supu∈T | 〈u, z〉 |.• Ultimately want to use dual Sudakov; we’re not there yet

• Maurey’s lemma: A tool to bound covering numbers ofconvex combinations of spaces

ε2+log

εmaxk≤ 1

max|A|=k

logN (1

∑i∈A

BJi , ‖·‖T , ε)

√sη)

where ‖z‖T = supu∈T | 〈u, z〉 |.• Ultimately want to use dual Sudakov; we’re not there yet• Maurey’s lemma: A tool to bound covering numbers of

ε2+log

εmaxk≤ 1

max|A|=k

logN (1

∑i∈A

BJi , ‖·‖T , ε)

• Define Uα = Uα(δ) = j = 1, . . . , n :∑

i∈A δi ,j ∼ 2α

• Can show 1k

∑i∈A BJi

⊂∑

α1√k

2α/2Eα

where Eα is the linear subspace of vectors supported on Uα

• Therefore

logN ( 1k

∑i∈A BJi , ‖ · ‖T , ε) .

∑α logN ( 1√

k2α/2Eα, ‖ · ‖T ,

εlog m

• Dual Sudakov minoration! (. . . plus some calculations)

• Define Uα = Uα(δ) = j = 1, . . . , n :∑

i∈A δi ,j ∼ 2α• Can show 1

∑i∈A BJi

⊂∑

α1√k

2α/2Eα

• Therefore

logN ( 1k

∑i∈A BJi , ‖ · ‖T , ε) .

∑α logN ( 1√

k2α/2Eα, ‖ · ‖T ,

εlog m

• Define Uα = Uα(δ) = j = 1, . . . , n :∑

∑i∈A BJi

⊂∑

α1√k

2α/2Eα

• Therefore

logN ( 1k

∑i∈A BJi , ‖ · ‖T , ε) .

∑α logN ( 1√

k2α/2Eα, ‖ · ‖T ,

εlog m

• Define Uα = Uα(δ) = j = 1, . . . , n :∑

∑i∈A BJi

⊂∑

α1√k

2α/2Eα

• Therefore

logN ( 1k

∑i∈A BJi , ‖ · ‖T , ε) .

∑α logN ( 1√

k2α/2Eα, ‖ · ‖T ,

εlog m

Conclusion

Open Problems

• OPEN: Prove conjecture: to get subsp. embedding with prob.1− δ, can set m = O((d + log(1/δ))/ε2), s = O(log(d/δ)/ε).Easier: obtain this m with s = m via moment method.

([Cohen ’14] has made progress via other methods)

• OPEN: Show that the tradeoff m = O(d1+γ/ε2),s = poly(1/γ) · 1/ε is optimal for any distribution oversubspace embeddings (the poly is probably linear; someprogress in [N., Nguy˜en ’14])

• OPEN: Show that m = Ω(d2/ε2) is optimal for s = 1

Partial progress: [N., Nguy˜en ’13] shows m = Ω(d2)

• OPEN: Improve parameters of [Bourgain, N. ’14]: s shouldbe 1/ε2, not 1/ε2, and remove log factors.

• OPEN: Give a full tradeoff curve for s,m for general theorem

• OPEN: Usual: m, s “big enough” ⇒ T preserved ⇒ application works

Can we skip the middle step to get better bounds? e.g. fork-means clustering?

mmds-data.orgmmds-data.org/presentations/2014/nelson_mmds14.pdf · Metric Johnson-Lindenstrauss...

Documents

Transcript of mmds-data.orgmmds-data.org/presentations/2014/nelson_mmds14.pdf · Metric Johnson-Lindenstrauss...