mmds-data.orgmmds-data.org/presentations/2014/nelson_mmds14.pdf · Metric Johnson-Lindenstrauss...
Transcript of mmds-data.orgmmds-data.org/presentations/2014/nelson_mmds14.pdf · Metric Johnson-Lindenstrauss...
Dimensionality reduction via sparse matrices
Jelani NelsonHarvard
June 17, 2014
based on joint works with Jean Bourgain (IAS), Daniel Kane (Stanford), Huy Nguy˜en (Princeton)
Metric Johnson-Lindenstrauss lemma
Metric JL (MJL) Lemma, 1984
Every set of N points in Euclidean space can be embedded intoO(ε−2 logN)-dimensional Euclidean space so that all pairwisedistances are preserved up to a 1± ε factor.
Uses:
• Speed up geometric algorithms by first reducing dimension ofinput [Indyk, Motwani ’98], [Indyk ’01]
• Faster/streaming numerical linear algebra algorithms [Sarlos’06], [LWMRT ’07], [Clarkson, Woodruff ’09]
• Essentially equivalent to RIP matrices from compressedsensing [Baraniuk et al. ’08], [Krahmer, Ward ’11](used for recovery of sparse signals)
• Volume-preserving embeddings (applications to projectiveclustering) [Magen ’02]
• . . .
Metric Johnson-Lindenstrauss lemma
Metric JL (MJL) Lemma, 1984
Every set of N points in Euclidean space can be embedded intoO(ε−2 logN)-dimensional Euclidean space so that all pairwisedistances are preserved up to a 1± ε factor.
Uses:
• Speed up geometric algorithms by first reducing dimension ofinput [Indyk, Motwani ’98], [Indyk ’01]
• Faster/streaming numerical linear algebra algorithms [Sarlos’06], [LWMRT ’07], [Clarkson, Woodruff ’09]
• Essentially equivalent to RIP matrices from compressedsensing [Baraniuk et al. ’08], [Krahmer, Ward ’11](used for recovery of sparse signals)
• Volume-preserving embeddings (applications to projectiveclustering) [Magen ’02]
• . . .
How to prove the JL lemma
Distributional JL (DJL) lemma
LemmaFor any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rm×n form = O(ε−2 log(1/δ)) so that for any u of unit `2 norm
PΠ∼Dε,δ
(∣∣‖Πu‖22 − 1
∣∣ > ε)< δ.
Proof of MJL: Set δ = 1/N2 in DJL and u as the difference vectorof some pair of points. Union bound over the
(N2
)pairs.
Theorem (Alon, 2003)
For MJL, m = Ω((ε−2/ log(1/ε)) logN) is required.
Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011)
For DJL, m = Θ(ε−2 log(1/δ)) is optimal.
How to prove the JL lemma
Distributional JL (DJL) lemma
LemmaFor any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rm×n form = O(ε−2 log(1/δ)) so that for any u of unit `2 norm
PΠ∼Dε,δ
(∣∣‖Πu‖22 − 1
∣∣ > ε)< δ.
Proof of MJL: Set δ = 1/N2 in DJL and u as the difference vectorof some pair of points. Union bound over the
(N2
)pairs.
Theorem (Alon, 2003)
For MJL, m = Ω((ε−2/ log(1/ε)) logN) is required.
Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011)
For DJL, m = Θ(ε−2 log(1/δ)) is optimal.
How to prove the JL lemma
Distributional JL (DJL) lemma
LemmaFor any 0 < ε, δ < 1/2 there exists a distribution Dε,δ on Rm×n form = O(ε−2 log(1/δ)) so that for any u of unit `2 norm
PΠ∼Dε,δ
(∣∣‖Πu‖22 − 1
∣∣ > ε)< δ.
Proof of MJL: Set δ = 1/N2 in DJL and u as the difference vectorof some pair of points. Union bound over the
(N2
)pairs.
Theorem (Alon, 2003)
For MJL, m = Ω((ε−2/ log(1/ε)) logN) is required.
Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011)
For DJL, m = Θ(ε−2 log(1/δ)) is optimal.
Proving the distributional JL lemma
Older proofs
• [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988],[Dasgupta-Gupta, 2003]: Random rotation, then projectiononto first m coordinates.
• [Indyk-Motwani, 1998]:Random matrix with independent Gaussian entries.
• [Achlioptas, 2001]: Independent ±1 entries.
• [Clarkson-Woodruff, 2009]:O(log(1/δ))-wise independent ±1 entries.
• [Arriaga-Vempala, 1999], [Matousek, 2008]:Independent entries having mean 0, variance 1/m, andsubGaussian tails
Downside: Performing embedding is dense matrix-vectormultiplication, O(m · ‖u‖0) time
Proving the distributional JL lemma
Older proofs
• [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988],[Dasgupta-Gupta, 2003]: Random rotation, then projectiononto first m coordinates.
• [Indyk-Motwani, 1998]:Random matrix with independent Gaussian entries.
• [Achlioptas, 2001]: Independent ±1 entries.
• [Clarkson-Woodruff, 2009]:O(log(1/δ))-wise independent ±1 entries.
• [Arriaga-Vempala, 1999], [Matousek, 2008]:Independent entries having mean 0, variance 1/m, andsubGaussian tails
Downside: Performing embedding is dense matrix-vectormultiplication, O(m · ‖u‖0) time
Fast JL Transforms
• [Ailon-Chazelle, 2006]: x 7→ PHDx , O(n log n + m3) time
P random+sparse, H Fourier, D has random ±1 on diagonal
• Also follow-up works based on similar approach which improvethe time while, for some, slightly increasing target dimension[Ailon, Liberty ’08], [Ailon, Liberty ’11], [Krahmer, Ward ’11],[N., Price, Wootters ’14], . . .
Downside: Slow to embed sparse vectors: running time isΩ(minm · ‖u‖0, n log n).
Fast JL Transforms
• [Ailon-Chazelle, 2006]: x 7→ PHDx , O(n log n + m3) time
P random+sparse, H Fourier, D has random ±1 on diagonal
• Also follow-up works based on similar approach which improvethe time while, for some, slightly increasing target dimension[Ailon, Liberty ’08], [Ailon, Liberty ’11], [Krahmer, Ward ’11],[N., Price, Wootters ’14], . . .
Downside: Slow to embed sparse vectors: running time isΩ(minm · ‖u‖0, n log n).
Where Do Sparse Vectors Show Up?
• Document as bag of words: ui = number of occurrences ofword i . Compare documents using cosine similarity.
n = lexicon size; most documents aren’t dictionaries
• Network traffic: ui ,j = #bytes sent from i to j
n = 264 (2256 in IPv6); most servers don’t talk to each other
• User ratings: ui ,j is user i ’s score for movie j on Netflix
n = #movies; most people haven’t rated all movies
• Streaming: u receives a stream of updates of the form: “addv to ui ”. Maintaining Πu requires calculating v · Πei .
• . . .
Sparse JL transformsOne way to embed sparse vectors faster: use sparse matrices.
s = #non-zero entries per column in Π(so embedding time is s · ‖u‖0)
reference value of s type
[JL84], [FM88], [IM98], . . . m ≈ 4ε−2 ln(1/δ) dense
[Achlioptas01] m/3 sparseBernoulli
[WDALS09] no proof hashing
[DKS10] O(ε−1 log3(1/δ)) hashing
[KN10a], [BOR10] O(ε−1 log2(1/δ)) ”
[KN12] O(ε−1 log(1/δ)) modified hashing
[N., Nguy˜en ’13]: for any m ≤ poly(1/ε) · logN, s = Ω(ε−1 logN/ log(1/ε)) is
required, even for metric JL, so [KN12] is optimal up to O(log(1/ε)).
*[Thorup, Zhang ’04] gives m = O(ε−2δ−1), s = 1.
Sparse JL transformsOne way to embed sparse vectors faster: use sparse matrices.
s = #non-zero entries per column in Π(so embedding time is s · ‖u‖0)
reference value of s type
[JL84], [FM88], [IM98], . . . m ≈ 4ε−2 ln(1/δ) dense
[Achlioptas01] m/3 sparseBernoulli
[WDALS09] no proof hashing
[DKS10] O(ε−1 log3(1/δ)) hashing
[KN10a], [BOR10] O(ε−1 log2(1/δ)) ”
[KN12] O(ε−1 log(1/δ)) modified hashing
[N., Nguy˜en ’13]: for any m ≤ poly(1/ε) · logN, s = Ω(ε−1 logN/ log(1/ε)) is
required, even for metric JL, so [KN12] is optimal up to O(log(1/ε)).
*[Thorup, Zhang ’04] gives m = O(ε−2δ−1), s = 1.
Sparse JL transformsOne way to embed sparse vectors faster: use sparse matrices.
s = #non-zero entries per column in Π(so embedding time is s · ‖u‖0)
reference value of s type
[JL84], [FM88], [IM98], . . . m ≈ 4ε−2 ln(1/δ) dense
[Achlioptas01] m/3 sparseBernoulli
[WDALS09] no proof hashing
[DKS10] O(ε−1 log3(1/δ)) hashing
[KN10a], [BOR10] O(ε−1 log2(1/δ)) ”
[KN12] O(ε−1 log(1/δ)) modified hashing
[N., Nguy˜en ’13]: for any m ≤ poly(1/ε) · logN, s = Ω(ε−1 logN/ log(1/ε)) is
required, even for metric JL, so [KN12] is optimal up to O(log(1/ε)).
*[Thorup, Zhang ’04] gives m = O(ε−2δ−1), s = 1.
Sparse JL Constructions
[DKS, 2010] s = Θ(ε−1 log2(1/δ))
Sparse JL Constructions
[DKS, 2010] s = Θ(ε−1 log2(1/δ))
[KN12] s = Θ(ε−1 log(1/δ))
Sparse JL Constructions
[DKS, 2010] s = Θ(ε−1 log2(1/δ))
[KN12] s = Θ(ε−1 log(1/δ))
[KN12] m/s s = Θ(ε−1 log(1/δ))
Sparse JL Constructions (in matrix form)
=
0
0 m
00
m/s =
0
0m
00
Each black cell is ±1/√s at random
One open problem . . .
• [Achlioptas ’01]:
Πi ,j =
+1/
√m/3, w.p. 1/6
−1/√m/3, w.p. 1/6
0, w.p. 2/3
• There is an expected s = m/3 non-zeroes per column.
• [Kane-N. ’12]: Eliminate the variance in the number ofnon-zeroes per column. Force it to be exactly s.
Surprisingly, this helps!
(makes asymptotically smaller s = O(εm) possible)
• Conjecture: Error with exactly s non-zeroes per column isstochastically dominated by error with expected s per column.
One open problem . . .
• [Achlioptas ’01]:
Πi ,j =
+1/
√m/3, w.p. 1/6
−1/√m/3, w.p. 1/6
0, w.p. 2/3
• There is an expected s = m/3 non-zeroes per column.
• [Kane-N. ’12]: Eliminate the variance in the number ofnon-zeroes per column. Force it to be exactly s.
Surprisingly, this helps!
(makes asymptotically smaller s = O(εm) possible)
• Conjecture: Error with exactly s non-zeroes per column isstochastically dominated by error with expected s per column.
One open problem . . .
• [Achlioptas ’01]:
Πi ,j =
+1/
√m/3, w.p. 1/6
−1/√m/3, w.p. 1/6
0, w.p. 2/3
• There is an expected s = m/3 non-zeroes per column.
• [Kane-N. ’12]: Eliminate the variance in the number ofnon-zeroes per column. Force it to be exactly s.
Surprisingly, this helps!
(makes asymptotically smaller s = O(εm) possible)
• Conjecture: Error with exactly s non-zeroes per column isstochastically dominated by error with expected s per column.
Analysis
• In both constructions, can write Πi ,j = δi ,jσi ,j/√s
‖Πu‖22 − 1 =
1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juiuj = σTBσ
B =1
s·
B1 0 . . . 00 B2 . . . 0
0 0. . . 0
0 . . . 0 Bm
• (Br )i ,j = δr ,iδr ,juiuj
• P(|‖Πu‖2 − 1| > ε) = P(|σTBσ| > ε) < ε−` · E |σTBσ|`.Use moment bound for quadratic forms, which depends on‖B‖, ‖B‖F (Hanson-Wright inequality).
Analysis
• In both constructions, can write Πi ,j = δi ,jσi ,j/√s
‖Πu‖22 − 1 =
1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juiuj = σTBσ
B =1
s·
B1 0 . . . 00 B2 . . . 0
0 0. . . 0
0 . . . 0 Bm
• (Br )i ,j = δr ,iδr ,juiuj
• P(|‖Πu‖2 − 1| > ε) = P(|σTBσ| > ε) < ε−` · E |σTBσ|`.Use moment bound for quadratic forms, which depends on‖B‖, ‖B‖F (Hanson-Wright inequality).
Analysis
• In both constructions, can write Πi ,j = δi ,jσi ,j/√s
‖Πu‖22 − 1 =
1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juiuj = σTBσ
B =1
s·
B1 0 . . . 00 B2 . . . 0
0 0. . . 0
0 . . . 0 Bm
• (Br )i ,j = δr ,iδr ,juiuj
• P(|‖Πu‖2 − 1| > ε) = P(|σTBσ| > ε) < ε−` · E |σTBσ|`.Use moment bound for quadratic forms, which depends on‖B‖, ‖B‖F (Hanson-Wright inequality).
What next?
Natural “matrix extension” of sparse JL
[Kane, N. ’12]
TheoremLet u ∈ Rn be arbitrary, unit `2 norm, Π sparse sign matrix. Then
PΠ
(∣∣‖Πu‖2 − 1∣∣ > ε
)< δ
as long as
m &log(1/δ)
ε2, s &
log(1/δ)
ε, ` = log(1/δ)
or
m &1
ε2δ, s = 1, ` = 2 ([Thorup, Zhang′04])
Natural “matrix extension” of sparse JL
[Kane, N. ’12]
TheoremLet u ∈ Rn×1 be arbitrary, o.n. cols, Π sparse sign matrix. Then
PΠ
(‖(Πu)T (Πu)− I1‖ > ε) < δ
as long as
m &1 + log(1/δ)
ε2, s &
log(1/δ)
ε, ` = log(1/δ)
or
m &12
ε2δ, s = 1, ` = 2
Natural “matrix extension” of sparse JL
[Kane, N. ’12]
TheoremLet U ∈ Rn×1 be arbitrary, o.n. cols, Π sparse sign matrix. Then
PΠ
(‖(ΠU)T (ΠU)− I1‖ > ε) < δ
as long as
m &1 + log(1/δ)
ε2, s &
log(1/δ)
ε, ` = log(1/δ)
or
m &12
ε2δ, s = 1, ` = 2
Natural “matrix extension” of sparse JL
Conjecture
TheoremLet U ∈ Rn×d be arbitrary, o.n. cols, Π sparse sign matrix. Then
PΠ
(‖(ΠU)T (ΠU)− Id‖ > ε) < δ
as long as
m &d + log(1/δ)
ε2, s &
log(d/δ)
ε, ` = log(d/δ)
or
m &d2
ε2δ, s = 1, ` = 2
Natural “matrix extension” of sparse JL
What we prove [N., Nguy˜en ’13]
TheoremLet U ∈ Rn×d be arbitrary, o.n. cols, Π sparse sign matrix. Then
PΠ
(‖(ΠU)T (ΠU)− Id‖ > ε) < δ
as long as
m &d · logc (d/δ)
ε2, s &
logc (d/δ)
εor m &
d1.01
ε2, s &
1
ε
or
m &d2
ε2δ, s = 1
Remarks
• [Clarkson, Woodruff ’13] was first to showm = d2 · polylog(d/ε)/ε2, s = 1 bound via other methods
• m = O(d2/ε2), s = 1 also obtained by [Mahoney, Meng ’13].
• m = O(d2/ε2), s = 1 also follows from [Thorup, Zhang ’04]+ [Kane, N. ’12] (observed by Nguy˜en)
• What does the “moment method” mean for matrices?
PΠ
(‖(ΠU)T (ΠU)− Id‖ > ε) < ε−` · E ‖(ΠU)T (ΠU)− Id‖`
≤ ε−` · E tr(((ΠU)T (ΠU)− Id )`)
• Classical “moment method” in random matrix theory; e.g.[Wigner, 1955], [Furedi, Komlos, 1981], [Bai, Yin, 1993]
Remarks
• [Clarkson, Woodruff ’13] was first to showm = d2 · polylog(d/ε)/ε2, s = 1 bound via other methods
• m = O(d2/ε2), s = 1 also obtained by [Mahoney, Meng ’13].
• m = O(d2/ε2), s = 1 also follows from [Thorup, Zhang ’04]+ [Kane, N. ’12] (observed by Nguy˜en)
• What does the “moment method” mean for matrices?
PΠ
(‖(ΠU)T (ΠU)− Id‖ > ε) < ε−` · E ‖(ΠU)T (ΠU)− Id‖`
≤ ε−` · E tr(((ΠU)T (ΠU)− Id )`)
• Classical “moment method” in random matrix theory; e.g.[Wigner, 1955], [Furedi, Komlos, 1981], [Bai, Yin, 1993]
Who cares about this matrix extension?
Motivation for matrix extension of sparse JL
• ‖(ΠU)T (ΠU)− I‖ ≤ ε equivalent to ‖Πx‖ = (1± ε)‖x‖ for allx ∈ V , where V is the subspace spanned by the columns of U
(up to changing ε by a factor of 2). “subspace embedding”.
• Subspace embeddings can be used to speed up algorithms formany numerical linear algebra problems on big matrices[Sarlos, 2006], [Dasgupta, Drineas, Harb, Kumar, Mahoney,2008], [Clarkson, Woodruff, 2009], [Drineas, Magdon-Ismail,Mahoney, Woodruff, 2012], [Clarkson, Woodruff, 2013],[Clarkson, Drineas, Magdon-Ismail, Mahoney, Meng,Woodruff, 2013], [Woodruff, Zhang, 2013], . . .
• Sparse Π: can multiply ΠA in s · nnz(A) time for big matrix A.
Motivation for matrix extension of sparse JL
• ‖(ΠU)T (ΠU)− I‖ ≤ ε equivalent to ‖Πx‖ = (1± ε)‖x‖ for allx ∈ V , where V is the subspace spanned by the columns of U
(up to changing ε by a factor of 2). “subspace embedding”.
• Subspace embeddings can be used to speed up algorithms formany numerical linear algebra problems on big matrices[Sarlos, 2006], [Dasgupta, Drineas, Harb, Kumar, Mahoney,2008], [Clarkson, Woodruff, 2009], [Drineas, Magdon-Ismail,Mahoney, Woodruff, 2012], [Clarkson, Woodruff, 2013],[Clarkson, Drineas, Magdon-Ismail, Mahoney, Meng,Woodruff, 2013], [Woodruff, Zhang, 2013], . . .
• Sparse Π: can multiply ΠA in s · nnz(A) time for big matrix A.
Motivation for matrix extension of sparse JL
• ‖(ΠU)T (ΠU)− I‖ ≤ ε equivalent to ‖Πx‖ = (1± ε)‖x‖ for allx ∈ V , where V is the subspace spanned by the columns of U
(up to changing ε by a factor of 2). “subspace embedding”.
• Subspace embeddings can be used to speed up algorithms formany numerical linear algebra problems on big matrices[Sarlos, 2006], [Dasgupta, Drineas, Harb, Kumar, Mahoney,2008], [Clarkson, Woodruff, 2009], [Drineas, Magdon-Ismail,Mahoney, Woodruff, 2012], [Clarkson, Woodruff, 2013],[Clarkson, Drineas, Magdon-Ismail, Mahoney, Meng,Woodruff, 2013], [Woodruff, Zhang, 2013], . . .
• Sparse Π: can multiply ΠA in s · nnz(A) time for big matrix A.
Numerical linear algebra
• A ∈ Rn×d , n d , rank(A) = r
Classical numerical linear algebra problems
• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.
• Least squares regression: Given also b ∈ Rn.
Compute x∗ = argminx∈Rd ‖Ax − b‖2
• `p regression (p ∈ [1,∞)):
Compute x∗ = argminx∈Rd ‖Ax − b‖p
• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .
Compute Ak = argminrank(B)≤k ‖A− B‖F
• Preconditioning: Compute R ∈ Rd×d (for d = r) so that
∀x ‖ARx‖2 ≈ ‖x‖2
Numerical linear algebra
• A ∈ Rn×d , n d , rank(A) = r
Classical numerical linear algebra problems
• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.
• Least squares regression: Given also b ∈ Rn.
Compute x∗ = argminx∈Rd ‖Ax − b‖2
• `p regression (p ∈ [1,∞)):
Compute x∗ = argminx∈Rd ‖Ax − b‖p
• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .
Compute Ak = argminrank(B)≤k ‖A− B‖F
• Preconditioning: Compute R ∈ Rd×d (for d = r) so that
∀x ‖ARx‖2 ≈ ‖x‖2
Numerical linear algebra
• A ∈ Rn×d , n d , rank(A) = r
Classical numerical linear algebra problems
• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.
• Least squares regression: Given also b ∈ Rn.
Compute x∗ = argminx∈Rd ‖Ax − b‖2
• `p regression (p ∈ [1,∞)):
Compute x∗ = argminx∈Rd ‖Ax − b‖p
• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .
Compute Ak = argminrank(B)≤k ‖A− B‖F
• Preconditioning: Compute R ∈ Rd×d (for d = r) so that
∀x ‖ARx‖2 ≈ ‖x‖2
Numerical linear algebra
• A ∈ Rn×d , n d , rank(A) = r
Classical numerical linear algebra problems
• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.
• Least squares regression: Given also b ∈ Rn.
Compute x∗ = argminx∈Rd ‖Ax − b‖2
• `p regression (p ∈ [1,∞)):
Compute x∗ = argminx∈Rd ‖Ax − b‖p
• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .
Compute Ak = argminrank(B)≤k ‖A− B‖F
• Preconditioning: Compute R ∈ Rd×d (for d = r) so that
∀x ‖ARx‖2 ≈ ‖x‖2
Numerical linear algebra
• A ∈ Rn×d , n d , rank(A) = r
Classical numerical linear algebra problems
• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.
• Least squares regression: Given also b ∈ Rn.
Compute x∗ = argminx∈Rd ‖Ax − b‖2
• `p regression (p ∈ [1,∞)):
Compute x∗ = argminx∈Rd ‖Ax − b‖p
• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .
Compute Ak = argminrank(B)≤k ‖A− B‖F
• Preconditioning: Compute R ∈ Rd×d (for d = r) so that
∀x ‖ARx‖2 ≈ ‖x‖2
Numerical linear algebra
• A ∈ Rn×d , n d , rank(A) = r
Classical numerical linear algebra problems
• Compute the leverage scores of A, i.e. the `2 norms of the nstandard basis vectors when projected onto the subspacespanned by the columns of A.
• Least squares regression: Given also b ∈ Rn.
Compute x∗ = argminx∈Rd ‖Ax − b‖2
• `p regression (p ∈ [1,∞)):
Compute x∗ = argminx∈Rd ‖Ax − b‖p
• Low-rank approximation: Given also an integer 1 ≤ k ≤ d .
Compute Ak = argminrank(B)≤k ‖A− B‖F
• Preconditioning: Compute R ∈ Rd×d (for d = r) so that
∀x ‖ARx‖2 ≈ ‖x‖2
Computationally efficient solutions
Singular Value Decomposition
TheoremEvery matrix A ∈ Rn×d of rank r can be written as
A = U︸︷︷︸orthonormcolumns
n×r
Σ︸︷︷︸diagonal
positive definiter×r
V T︸︷︷︸orthonormcolumns
d×r
Can compute SVD in O(ndω−1) [Demmel, Dumitriu, Holtz, 2007].ω < 2.373 . . . is the exponent of square matrix multiplication[Coppersmith, Winograd, 1987], [Stothers, 2010],[Vassilevska-Williams, 2012]
Computationally efficient solutions
A = U︸︷︷︸orthonormcolumns
n×r
Σ︸︷︷︸diagonal
positive definiter×r
V T︸︷︷︸orthonormcolumns
d×r
• Leverage scores: Output row norms of U.
• Least squares regression: Output VΣ−1UTb.
• Low-rank approximation: Output UΣkVT .
• Preconditioning: Output R = VΣ−1.
Conclusion: In time O(ndω−1) we can compute the SVD thensolve all the previously stated problems. Is there a faster way?
Computationally efficient solutions
A = U︸︷︷︸orthonormcolumns
n×r
Σ︸︷︷︸diagonal
positive definiter×r
V T︸︷︷︸orthonormcolumns
d×r
• Leverage scores: Output row norms of U.
• Least squares regression: Output VΣ−1UTb.
• Low-rank approximation: Output UΣkVT .
• Preconditioning: Output R = VΣ−1.
Conclusion: In time O(ndω−1) we can compute the SVD thensolve all the previously stated problems. Is there a faster way?
How to use subspace embeddings
Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then
(1− ε)‖Ax − b‖ ≤‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖≤ (1 + ε)‖Ax∗ − b‖
⇒ ‖Ax − b‖ ≤(
1 + ε
1− ε
)· ‖Ax∗ − b‖
How to use subspace embeddings
Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then
(1− ε)‖Ax − b‖ ≤‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖≤ (1 + ε)‖Ax∗ − b‖
⇒ ‖Ax − b‖ ≤(
1 + ε
1− ε
)· ‖Ax∗ − b‖
How to use subspace embeddings
Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then
(1−ε)‖Ax−b‖ ≤ ‖ΠAx − Πb‖︸ ︷︷ ︸‖Π(Ax−b)‖
≤ ‖ΠAx∗−Πb‖≤ (1 + ε)‖Ax∗ − b‖
⇒ ‖Ax − b‖ ≤(
1 + ε
1− ε
)· ‖Ax∗ − b‖
How to use subspace embeddings
Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then
(1−ε)‖Ax−b‖ ≤ ‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖ ≤ (1+ε)‖Ax∗−b‖
⇒ ‖Ax − b‖ ≤(
1 + ε
1− ε
)· ‖Ax∗ − b‖
Computing SVD of ΠA takes time O(mdω−1), which is muchfaster than O(ndω−1) since m n.
Note: [Sarlos ’06] actually describes a more efficient reduction, where one only
needs that (1) Π is a (1− 1/√
2)-subspace embedding, and (2)
‖(ΠU)T (Π(Ax∗ − b))‖2 <√ε · ‖Ax∗ − b‖. Item (2) only requires
m & d/ε, s ≥ 1 [Thorup, Zhang ’04] + [Kane, N. ’12].
How to use subspace embeddings
Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then
(1−ε)‖Ax−b‖ ≤ ‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖ ≤ (1+ε)‖Ax∗−b‖
⇒ ‖Ax − b‖ ≤(
1 + ε
1− ε
)· ‖Ax∗ − b‖
Computing SVD of ΠA takes time O(mdω−1), which is muchfaster than O(ndω−1) since m n.
Note: [Sarlos ’06] actually describes a more efficient reduction, where one only
needs that (1) Π is a (1− 1/√
2)-subspace embedding, and (2)
‖(ΠU)T (Π(Ax∗ − b))‖2 <√ε · ‖Ax∗ − b‖. Item (2) only requires
m & d/ε, s ≥ 1 [Thorup, Zhang ’04] + [Kane, N. ’12].
How to use subspace embeddings
Least squares regression: Let Π be a subspace embedding forthe subspace spanned by b and the columns of A. Letx∗ = argmin ‖Ax − b‖ and x = argmin ‖ΠAx − Πb‖. Then
(1−ε)‖Ax−b‖ ≤ ‖ΠAx−Πb‖ ≤ ‖ΠAx∗−Πb‖ ≤ (1+ε)‖Ax∗−b‖
⇒ ‖Ax − b‖ ≤(
1 + ε
1− ε
)· ‖Ax∗ − b‖
Computing SVD of ΠA takes time O(mdω−1), which is muchfaster than O(ndω−1) since m n.
Note: [Sarlos ’06] actually describes a more efficient reduction, where one only
needs that (1) Π is a (1− 1/√
2)-subspace embedding, and (2)
‖(ΠU)T (Π(Ax∗ − b))‖2 <√ε · ‖Ax∗ − b‖. Item (2) only requires
m & d/ε, s ≥ 1 [Thorup, Zhang ’04] + [Kane, N. ’12].
Back to the analysis
PΠ
(∥∥∥(ΠU)T (ΠU)− Id
∥∥∥ > ε)< ε−` · E tr(((ΠU)T (ΠU)− Id )`)
Analysis (` = 2)s = 1, m = O(d2/ε2)
Want to understand S − I , S = (ΠU)T (ΠU)
Let the columns of U be u1, . . . , ud
Recall Πi ,j = δi ,jσi ,j/√s
Some computations yield
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
Computing E tr((S − I )2) = E ‖S − I‖2F is straightforward, and can
show E ‖S − I‖2F ≤ (d2 + d)/m
P(‖S − I‖ > ε) <1
ε2
d2 + d
m
Set m ≥ δ−1(d2 + d)/ε2 for success probability 1− δ
Analysis (` = 2)s = 1, m = O(d2/ε2)
Want to understand S − I , S = (ΠU)T (ΠU)
Let the columns of U be u1, . . . , ud
Recall Πi ,j = δi ,jσi ,j/√s
Some computations yield
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
Computing E tr((S − I )2) = E ‖S − I‖2F is straightforward, and can
show E ‖S − I‖2F ≤ (d2 + d)/m
P(‖S − I‖ > ε) <1
ε2
d2 + d
m
Set m ≥ δ−1(d2 + d)/ε2 for success probability 1− δ
Analysis (` = 2)s = 1, m = O(d2/ε2)
Want to understand S − I , S = (ΠU)T (ΠU)
Let the columns of U be u1, . . . , ud
Recall Πi ,j = δi ,jσi ,j/√s
Some computations yield
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
Computing E tr((S − I )2) = E ‖S − I‖2F is straightforward, and can
show E ‖S − I‖2F ≤ (d2 + d)/m
P(‖S − I‖ > ε) <1
ε2
d2 + d
m
Set m ≥ δ−1(d2 + d)/ε2 for success probability 1− δ
Analysis (` = 2)s = 1, m = O(d2/ε2)
Want to understand S − I , S = (ΠU)T (ΠU)
Let the columns of U be u1, . . . , ud
Recall Πi ,j = δi ,jσi ,j/√s
Some computations yield
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
Computing E tr((S − I )2) = E ‖S − I‖2F is straightforward, and can
show E ‖S − I‖2F ≤ (d2 + d)/m
P(‖S − I‖ > ε) <1
ε2
d2 + d
m
Set m ≥ δ−1(d2 + d)/ε2 for success probability 1− δ
Analysis (` = 2)s = 1, m = O(d2/ε2)
Want to understand S − I , S = (ΠU)T (ΠU)
Let the columns of U be u1, . . . , ud
Recall Πi ,j = δi ,jσi ,j/√s
Some computations yield
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
Computing E tr((S − I )2) = E ‖S − I‖2F is straightforward, and can
show E ‖S − I‖2F ≤ (d2 + d)/m
P(‖S − I‖ > ε) <1
ε2
d2 + d
m
Set m ≥ δ−1(d2 + d)/ε2 for success probability 1− δ
Analysis (large `)s = Oγ(1/ε), m = O(d1+γ/ε2)
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
By induction, for any square matrix B and integer ` ≥ 1,
(B`)i ,j =∑
i1,...,i`+1i1=i ,i`+1=j
∏t=1
Bit ,it+1
⇒ tr(B`) =∑
i1,...,i`+1i1=i`+1
∏t=1
Bit ,it+1
Analysis (large `)s = Oγ(1/ε), m = O(d1+γ/ε2)
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
By induction, for any square matrix B and integer ` ≥ 1,
(B`)i ,j =∑
i1,...,i`+1i1=i ,i`+1=j
∏t=1
Bit ,it+1
⇒ tr(B`) =∑
i1,...,i`+1i1=i`+1
∏t=1
Bit ,it+1
Analysis (large `)s = Oγ(1/ε), m = O(d1+γ/ε2)
(S − I )k,k ′ =1
s
m∑r=1
∑i 6=j
δr ,iδr ,jσr ,iσr ,juki u
k ′j
By induction, for any square matrix B and integer ` ≥ 1,
(B`)i ,j =∑
i1,...,i`+1i1=i ,i`+1=j
∏t=1
Bit ,it+1
⇒ tr(B`) =∑
i1,...,i`+1i1=i`+1
∏t=1
Bit ,it+1
Analysis (large `)s = Oγ(1/ε), m = O(d1+γ/ε2)
E tr((S − I )`) =∑
i1 6=j1,...,i` 6=j`r1,...,r`
k1,...,k`+1k1=k`+1
(E∏t=1
δrt ,it δrt ,jt
)(E∏t=1
σrt ,itσrt ,jt
)∏t=1
uktitu
kt+1jt
The strategy: Associate each monomial in summation above witha graph, group monomials that have the same graph, and estimatethe contribution of each graph then do some combinatorics
(a common strategy; see [Wigner, 1955], [Furedi, Komlos, 1981],[Bai, Yin, 1993])
Analysis (large `)s = Oγ(1/ε), m = O(d1+γ/ε2)
E tr((S − I )`) =∑
i1 6=j1,...,i` 6=j`r1,...,r`
k1,...,k`+1k1=k`+1
(E∏t=1
δrt ,it δrt ,jt
)(E∏t=1
σrt ,itσrt ,jt
)∏t=1
uktitu
kt+1jt
The strategy: Associate each monomial in summation above witha graph, group monomials that have the same graph, and estimatethe contribution of each graph then do some combinatorics
(a common strategy; see [Wigner, 1955], [Furedi, Komlos, 1981],[Bai, Yin, 1993])
Finite pointsets, then subspaces, now what?
(Linear) dimensionality reduction
• Given T ⊂ Rn, want a Π ∈ Rm×n such that
∀u ∈ T , (1− ε)‖u‖2 ≤ ‖Πu‖2 ≤ (1 + ε)‖u‖2
with m as small as possible.
• Above is equivalent to, assuming T ⊂ Sn−1,
supu∈T|‖Πu‖2
2 − 1| < ε
• We consider Π random:
EΠ
supu∈T|‖Πu‖2
2 − 1| < ε
(Linear) dimensionality reduction
• Given T ⊂ Rn, want a Π ∈ Rm×n such that
∀u ∈ T , (1− ε)‖u‖2 ≤ ‖Πu‖2 ≤ (1 + ε)‖u‖2
with m as small as possible.
• Above is equivalent to, assuming T ⊂ Sn−1,
supu∈T|‖Πu‖2
2 − 1| < ε
• We consider Π random:
EΠ
supu∈T|‖Πu‖2
2 − 1| < ε
(Linear) dimensionality reduction
• Given T ⊂ Rn, want a Π ∈ Rm×n such that
∀u ∈ T , (1− ε)‖u‖2 ≤ ‖Πu‖2 ≤ (1 + ε)‖u‖2
with m as small as possible.
• Above is equivalent to, assuming T ⊂ Sn−1,
supu∈T|‖Πu‖2
2 − 1| < ε
• We consider Π random:
EΠ
supu∈T|‖Πu‖2
2 − 1| < ε
Applications
• Finite T : Johnson Lindenstrauss (JL) lemma hasT = (ui − uj )/‖ui − uj‖21≤i<j≤N [JL84].
Applications to clustering, nearest neighbor search, . . .
• Linear subspace: T = E ∩ Sn−1, E ⊂ Rn, dim(E ) = d .
Subspace embeddings pioneered by [Sarlos ’06]; faster leastsquares regression and low-rank approximation.
Also applications to approximating leverage scores[DMMW’12], k-means clustering [BZMD’11], canonicalcorrelation analysis [ABTZ’13], support vector machines[PBMD’13], `p regression [CDM+13,WZ13], ridge regression[LDFU’13].
Applications
• Finite T : Johnson Lindenstrauss (JL) lemma hasT = (ui − uj )/‖ui − uj‖21≤i<j≤N [JL84].
Applications to clustering, nearest neighbor search, . . .
• Linear subspace: T = E ∩ Sn−1, E ⊂ Rn, dim(E ) = d .
Subspace embeddings pioneered by [Sarlos ’06]; faster leastsquares regression and low-rank approximation.
Also applications to approximating leverage scores[DMMW’12], k-means clustering [BZMD’11], canonicalcorrelation analysis [ABTZ’13], support vector machines[PBMD’13], `p regression [CDM+13,WZ13], ridge regression[LDFU’13].
Applications
• Finite T : Johnson Lindenstrauss (JL) lemma hasT = (ui − uj )/‖ui − uj‖21≤i<j≤N [JL84].
Applications to clustering, nearest neighbor search, . . .
• Linear subspace: T = E ∩ Sn−1, E ⊂ Rn, dim(E ) = d .
Subspace embeddings pioneered by [Sarlos ’06]; faster leastsquares regression and low-rank approximation.
Also applications to approximating leverage scores[DMMW’12], k-means clustering [BZMD’11], canonicalcorrelation analysis [ABTZ’13], support vector machines[PBMD’13], `p regression [CDM+13,WZ13], ridge regression[LDFU’13].
Applications
• Union of subspaces: streaming approx of eigenvals [Andoni,Nguy˜en’13], compressed sensing [Donoho’04, Candes-Tao’06]
e.g. compressed sensing: T = u ∈ Rn : ‖u‖2 = 1, ‖u‖0 ≤ k,‖u‖0 is support size. Can also be sparse over some other basis.
• This is a union of(n
k
)subspaces
• Π for this T known as having restricted isometry property(RIP). Given Πu for (near) sparse u, can (approximately)recover u in polynomial time [Candes, Tao’05]
Supporting (n
k
)sparsity patterns also of interest
(“model-based compressed sensing” [BCDH’10])
Applications
• Union of subspaces: streaming approx of eigenvals [Andoni,Nguy˜en’13], compressed sensing [Donoho’04, Candes-Tao’06]
e.g. compressed sensing: T = u ∈ Rn : ‖u‖2 = 1, ‖u‖0 ≤ k,‖u‖0 is support size. Can also be sparse over some other basis.
• This is a union of(n
k
)subspaces
• Π for this T known as having restricted isometry property(RIP). Given Πu for (near) sparse u, can (approximately)recover u in polynomial time [Candes, Tao’05]
Supporting (n
k
)sparsity patterns also of interest
(“model-based compressed sensing” [BCDH’10])
Applications
• Union of subspaces: streaming approx of eigenvals [Andoni,Nguy˜en’13], compressed sensing [Donoho’04, Candes-Tao’06]
e.g. compressed sensing: T = u ∈ Rn : ‖u‖2 = 1, ‖u‖0 ≤ k,‖u‖0 is support size. Can also be sparse over some other basis.
• This is a union of(n
k
)subspaces
• Π for this T known as having restricted isometry property(RIP). Given Πu for (near) sparse u, can (approximately)recover u in polynomial time [Candes, Tao’05]
Supporting (n
k
)sparsity patterns also of interest
(“model-based compressed sensing” [BCDH’10])
Applications
• Union of subspaces: streaming approx of eigenvals [Andoni,Nguy˜en’13], compressed sensing [Donoho’04, Candes-Tao’06]
e.g. compressed sensing: T = u ∈ Rn : ‖u‖2 = 1, ‖u‖0 ≤ k,‖u‖0 is support size. Can also be sparse over some other basis.
• This is a union of(n
k
)subspaces
• Π for this T known as having restricted isometry property(RIP). Given Πu for (near) sparse u, can (approximately)recover u in polynomial time [Candes, Tao’05]
Supporting (n
k
)sparsity patterns also of interest
(“model-based compressed sensing” [BCDH’10])
Applications
• Union of subspaces: streaming approx of eigenvals [Andoni,Nguy˜en’13], compressed sensing [Donoho’04, Candes-Tao’06]
e.g. compressed sensing: T = u ∈ Rn : ‖u‖2 = 1, ‖u‖0 ≤ k,‖u‖0 is support size. Can also be sparse over some other basis.
• This is a union of(n
k
)subspaces
• Π for this T known as having restricted isometry property(RIP). Given Πu for (near) sparse u, can (approximately)recover u in polynomial time [Candes, Tao’05]
Supporting (n
k
)sparsity patterns also of interest
(“model-based compressed sensing” [BCDH’10])
Applications
• Smooth manifolds: M = F (B`d2), for smooth F : B`d
2→ Rn
Manifold learning: classifying images of handwritten digits[Hinton-Dayan-Revow’97], or human faces [Broomhead-Kirby’01]
• Idea: All valid drawings of the digit “2” lie on alow-dimensional manifold in R
√n×√
n (pixelated image). Fromsamples, learn the manifold.
• [Baraniuk-Wakin’09] suggest reducing dimensionality using Πand doing learning in lower-dimensional space
• to preserve curve lengths on manifold, can set T to betangent space of M (infinite union T = (
⋃u∈M Eu) ∩ Sn−1)
Applications
• Smooth manifolds: M = F (B`d2), for smooth F : B`d
2→ Rn
Manifold learning: classifying images of handwritten digits[Hinton-Dayan-Revow’97], or human faces [Broomhead-Kirby’01]
• Idea: All valid drawings of the digit “2” lie on alow-dimensional manifold in R
√n×√
n (pixelated image). Fromsamples, learn the manifold.
• [Baraniuk-Wakin’09] suggest reducing dimensionality using Πand doing learning in lower-dimensional space
• to preserve curve lengths on manifold, can set T to betangent space of M (infinite union T = (
⋃u∈M Eu) ∩ Sn−1)
Applications
• Smooth manifolds: M = F (B`d2), for smooth F : B`d
2→ Rn
Manifold learning: classifying images of handwritten digits[Hinton-Dayan-Revow’97], or human faces [Broomhead-Kirby’01]
• Idea: All valid drawings of the digit “2” lie on alow-dimensional manifold in R
√n×√
n (pixelated image). Fromsamples, learn the manifold.
• [Baraniuk-Wakin’09] suggest reducing dimensionality using Πand doing learning in lower-dimensional space
• to preserve curve lengths on manifold, can set T to betangent space of M (infinite union T = (
⋃u∈M Eu) ∩ Sn−1)
Applications
• Smooth manifolds: M = F (B`d2), for smooth F : B`d
2→ Rn
Manifold learning: classifying images of handwritten digits[Hinton-Dayan-Revow’97], or human faces [Broomhead-Kirby’01]
• Idea: All valid drawings of the digit “2” lie on alow-dimensional manifold in R
√n×√
n (pixelated image). Fromsamples, learn the manifold.
• [Baraniuk-Wakin’09] suggest reducing dimensionality using Πand doing learning in lower-dimensional space
• to preserve curve lengths on manifold, can set T to betangent space of M (infinite union T = (
⋃u∈M Eu) ∩ Sn−1)
Euclidean dimensionality reduction:Optimality?
Can we do better than the JL lemma?
• Unfortunately, not by much if any.
[Alon ’03]: Let T = e1, . . . , en ∪ (ei − ej )/√
2i 6=j
• Any embedding of T with distortion 1 + ε requiresm & log n
ε2 log(1/ε)(off from JL lemma by just log(1/ε))
Euclidean dimensionality reduction:Optimality?
Can we do better than the JL lemma?
• Unfortunately, not by much if any.
[Alon ’03]: Let T = e1, . . . , en ∪ (ei − ej )/√
2i 6=j
• Any embedding of T with distortion 1 + ε requiresm & log n
ε2 log(1/ε)(off from JL lemma by just log(1/ε))
Euclidean dimensionality reduction:Optimality?
Can we do better than the JL lemma?
• Unfortunately, not by much if any.
[Alon ’03]: Let T = e1, . . . , en ∪ (ei − ej )/√
2i 6=j
• Any embedding of T with distortion 1 + ε requiresm & log n
ε2 log(1/ε)(off from JL lemma by just log(1/ε))
Euclidean dimensionality reduction:“Beyond Worst-Case Analysis”
• Suppose Πi ,j ∼ N (0, 1/m) (i.i.d. gaussian entries)
• Gordon’s theorem (1988): suffices to setm ∼ (g2(T ) + 1)/ε2
where g(T ) = Eg supu∈T 〈g , u〉 (g a gaussian vector)
• g(T ) ≤√
log |T | always (union bound), but can be muchsmaller if the vectors in T are well-clusterable
• [Klartag, Mendelson ’05]: Same result if Πi ,j ∼ ±1/√m
• Same problem as before: Π is dense.
Euclidean dimensionality reduction:“Beyond Worst-Case Analysis”
• Suppose Πi ,j ∼ N (0, 1/m) (i.i.d. gaussian entries)
• Gordon’s theorem (1988): suffices to setm ∼ (g2(T ) + 1)/ε2
where g(T ) = Eg supu∈T 〈g , u〉 (g a gaussian vector)
• g(T ) ≤√
log |T | always (union bound), but can be muchsmaller if the vectors in T are well-clusterable
• [Klartag, Mendelson ’05]: Same result if Πi ,j ∼ ±1/√m
• Same problem as before: Π is dense.
Euclidean dimensionality reduction:“Beyond Worst-Case Analysis”
• Suppose Πi ,j ∼ N (0, 1/m) (i.i.d. gaussian entries)
• Gordon’s theorem (1988): suffices to setm ∼ (g2(T ) + 1)/ε2
where g(T ) = Eg supu∈T 〈g , u〉 (g a gaussian vector)
• g(T ) ≤√
log |T | always (union bound), but can be muchsmaller if the vectors in T are well-clusterable
• [Klartag, Mendelson ’05]: Same result if Πi ,j ∼ ±1/√m
• Same problem as before: Π is dense.
Euclidean dimensionality reduction:“Beyond Worst-Case Analysis”
• Suppose Πi ,j ∼ N (0, 1/m) (i.i.d. gaussian entries)
• Gordon’s theorem (1988): suffices to setm ∼ (g2(T ) + 1)/ε2
where g(T ) = Eg supu∈T 〈g , u〉 (g a gaussian vector)
• g(T ) ≤√
log |T | always (union bound), but can be muchsmaller if the vectors in T are well-clusterable
• [Klartag, Mendelson ’05]: Same result if Πi ,j ∼ ±1/√m
• Same problem as before: Π is dense.
Euclidean dimensionality reduction:“Beyond Worst-Case Analysis”
• Suppose Πi ,j ∼ N (0, 1/m) (i.i.d. gaussian entries)
• Gordon’s theorem (1988): suffices to setm ∼ (g2(T ) + 1)/ε2
where g(T ) = Eg supu∈T 〈g , u〉 (g a gaussian vector)
• g(T ) ≤√
log |T | always (union bound), but can be muchsmaller if the vectors in T are well-clusterable
• [Klartag, Mendelson ’05]: Same result if Πi ,j ∼ ±1/√m
• Same problem as before: Π is dense.
Sparse JL:Optimality
• JL: m . ε−2 log |T |, s . ε−1 log |T | [Kane, N. ’12]
• Near-optimal:
m . ε−100 log |T | requires s & ε−1 log |T |log(1/ε) [N., Nguy˜en ’13]
Sparse JL:Optimality
• JL: m . ε−2 log |T |, s . ε−1 log |T | [Kane, N. ’12]
• Near-optimal:
m . ε−100 log |T | requires s & ε−1 log |T |log(1/ε) [N., Nguy˜en ’13]
Sparse JL:“Beyond Worst-Case Analysis”
Question: Can we obtain a Gordon-type theorem to understandhow to set m, s as a function of T?
What we show [Bourgain-N. ’14]
set T our m our s previous m previous s ref|T | <∞ log |T | log |T | log |T | log |T | [JL’84]
|T | <∞, ‖u‖∞ ≤ α log |T | dα log |T |e2 log |T | dα log |T |e2 [M’08]E , dim(E) ≤ d d 1 d 1 [NN’13]
Sn,k k log(n/k) k log(n/k) k log(n/k) k log(n/k) [CT’05]
HSn,k k log(n/k) 1 k log(n/k) 1 [RV’08]∗
|Θ| <∞ d + log |Θ| log |Θ| d + (log |Θ|)6 (log |Θ|)3 [NN’13]∀E ∈ Θ, dim(E) ≤ d
|Θ| <∞ d + log |Θ| dα log |Θ|e2 — — —∀E ∈ Θ, dim(E) ≤ d∀j, E ‖PE ej‖2 ≤ α|Θ| infinite see paper see paper d m [D’14]
∀E ∈ Θ, dim(E) ≤ d (non-trivial)
M a smooth manifold d dαde2 d d [D’14]∀E ∈ TM, j ‖PE ej‖2 ≤ α
All bounds shown hide poly(ε−1 log n) factors. One row is blank in previous work due
to no non-trivial results being previously known.
Question we studied: What’s going on here?
What we show [Bourgain-N. ’14]
set T our m our s previous m previous s ref|T | <∞ log |T | log |T | log |T | log |T | [JL’84]
|T | <∞, ‖u‖∞ ≤ α log |T | dα log |T |e2 log |T | dα log |T |e2 [M’08]E , dim(E) ≤ d d 1 d 1 [NN’13]
Sn,k k log(n/k) k log(n/k) k log(n/k) k log(n/k) [CT’05]
HSn,k k log(n/k) 1 k log(n/k) 1 [RV’08]∗
|Θ| <∞ d + log |Θ| log |Θ| d + (log |Θ|)6 (log |Θ|)3 [NN’13]∀E ∈ Θ, dim(E) ≤ d
|Θ| <∞ d + log |Θ| dα log |Θ|e2 — — —∀E ∈ Θ, dim(E) ≤ d∀j, E ‖PE ej‖2 ≤ α|Θ| infinite see paper see paper d m [D’14]
∀E ∈ Θ, dim(E) ≤ d (non-trivial)
M a smooth manifold d dαde2 d d [D’14]∀E ∈ TM, j ‖PE ej‖2 ≤ α
All bounds shown hide poly(ε−1 log n) factors. One row is blank in previous work due
to no non-trivial results being previously known.
Question we studied: What’s going on here?
What we show [Bourgain-N. ’14]
set T our m our s previous m previous s ref|T | <∞ log |T | log |T | log |T | log |T | [JL’84]
|T | <∞, ‖u‖∞ ≤ α log |T | dα log |T |e2 log |T | dα log |T |e2 [M’08]E , dim(E) ≤ d d 1 d 1 [NN’13]
Sn,k k log(n/k) k log(n/k) k log(n/k) k log(n/k) [CT’05]
HSn,k k log(n/k) 1 k log(n/k) 1 [RV’08]∗
|Θ| <∞ d + log |Θ| log |Θ| d + (log |Θ|)6 (log |Θ|)3 [NN’13]∀E ∈ Θ, dim(E) ≤ d
|Θ| <∞ d + log |Θ| dα log |Θ|e2 — — —∀E ∈ Θ, dim(E) ≤ d∀j, E ‖PE ej‖2 ≤ α|Θ| infinite see paper see paper d m [D’14]
∀E ∈ Θ, dim(E) ≤ d (non-trivial)
M a smooth manifold d dαde2 d d [D’14]∀E ∈ TM, j ‖PE ej‖2 ≤ α
All bounds shown hide poly(ε−1 log n) factors. One row is blank in previous work due
to no non-trivial results being previously known.
Question we studied: What’s going on here?
Our main contribution [Bourgain-N. ’14]T ⊆ Sn−1, Π ∈ Rm×n an SJLT. Suffices to set
m g2(T ) + 1, s 1
such that
maxq≤m
slog s
1√qs
Eη
Eg
supu∈T|
n∑j=1
ηjujgj |
q1/q 1
and hide (log n)/ε factors. (ηj ) i.i.d. Bernoulli of expectationqs/(m log s). We abbreviate as
maxq≤m
slog s
1√qs‖ sup
u∈T
n∑j=1
ηjujgj‖LqηL1
g
1
Our main contribution [Bourgain-N. ’14]T ⊆ Sn−1, Π ∈ Rm×n an SJLT. Suffices to set
m g2(T ) + 1, s 1
such that
maxq≤m
slog s
1√qs
Eη
Eg
supu∈T|
n∑j=1
ηjujgj |
q1/q 1
and hide (log n)/ε factors. (ηj ) i.i.d. Bernoulli of expectationqs/(m log s). We abbreviate as
maxq≤m
slog s
1√qs‖ sup
u∈T
n∑j=1
ηjujgj‖LqηL1
g
1
Our main contribution [Bourgain-N. ’14]T ⊆ Sn−1, Π ∈ Rm×n an SJLT. Suffices to set
m g2(T ) + 1, s 1
such that
maxq≤m
slog s
1√qs
Eη
Eg
supu∈T|
n∑j=1
ηjujgj |
q1/q 1
and hide (log n)/ε factors. (ηj ) i.i.d. Bernoulli of expectationqs/(m log s). We abbreviate as
maxq≤m
slog s
1√qs‖ sup
u∈T
n∑j=1
ηjujgj‖LqηL1
g
1
Fast JL note
Although we study sparse JL transforms, our results also apply tothe “Fast JL transforms” of [Ailon, Chazelle ’06] and follow-ups.
Φ =
0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...
.... . .
......
1 0 . . . 0 0
︸ ︷︷ ︸
S
×
H
︸ ︷︷ ︸
Fourier
×
±1 0 . . . 0 00 ±1 . . . 0 0...
.... . .
......
0 0 . . . ±1 00 0 . . . 0 ±1
︸ ︷︷ ︸
D
• S a sampling matrix (exactly one non-zero per row)
• D a diagonal matrix with signs on diagonal
• Imagine replacing S with a SJLT Π, so Φ = ΠHD
• Want Φ to preserve all of T ; can view as Π preserving HDT
Fast JL note
Although we study sparse JL transforms, our results also apply tothe “Fast JL transforms” of [Ailon, Chazelle ’06] and follow-ups.
Φ =
0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...
.... . .
......
1 0 . . . 0 0
︸ ︷︷ ︸
S
×
H
︸ ︷︷ ︸
Fourier
×
±1 0 . . . 0 00 ±1 . . . 0 0...
.... . .
......
0 0 . . . ±1 00 0 . . . 0 ±1
︸ ︷︷ ︸
D
• S a sampling matrix (exactly one non-zero per row)
• D a diagonal matrix with signs on diagonal
• Imagine replacing S with a SJLT Π, so Φ = ΠHD
• Want Φ to preserve all of T ; can view as Π preserving HDT
Fast JL note
Although we study sparse JL transforms, our results also apply tothe “Fast JL transforms” of [Ailon, Chazelle ’06] and follow-ups.
Φ =
0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...
.... . .
......
1 0 . . . 0 0
︸ ︷︷ ︸
S
×
H
︸ ︷︷ ︸
Fourier
×
±1 0 . . . 0 00 ±1 . . . 0 0...
.... . .
......
0 0 . . . ±1 00 0 . . . 0 ±1
︸ ︷︷ ︸
D
• S a sampling matrix (exactly one non-zero per row)
• D a diagonal matrix with signs on diagonal
• Imagine replacing S with a SJLT Π, so Φ = ΠHD
• Want Φ to preserve all of T ; can view as Π preserving HDT
Fast JL note
Although we study sparse JL transforms, our results also apply tothe “Fast JL transforms” of [Ailon, Chazelle ’06] and follow-ups.
Φ =
0 0 . . . 1 00 1 . . . 0 00 0 . . . 0 1...
.... . .
......
1 0 . . . 0 0
︸ ︷︷ ︸
S
×
H
︸ ︷︷ ︸
Fourier
×
±1 0 . . . 0 00 ±1 . . . 0 0...
.... . .
......
0 0 . . . ±1 00 0 . . . 0 ±1
︸ ︷︷ ︸
D
• S a sampling matrix (exactly one non-zero per row)
• D a diagonal matrix with signs on diagonal
• Imagine replacing S with a SJLT Π, so Φ = ΠHD
• Want Φ to preserve all of T ; can view as Π preserving HDT
An example new application
Model-based compressed sensing
• Our result implies sparse Π for model-based RIP when #sparsity patterns is 2o(k)
• Example: block-sparsity. u divided into n/b blocks of size beach. Each block either “on” or “off”.
• Sparsity k means at most k/b blocks are on.
• log |Θ| = log(n/b
k/b
)∼ (k/b) log(n/k)
• Thus m k + (k/b) log(n/k), s (k/b) log(n/k)
• Non-trivial sparsity s for b polylog(n)
An example new application
Model-based compressed sensing
• Our result implies sparse Π for model-based RIP when #sparsity patterns is 2o(k)
• Example: block-sparsity. u divided into n/b blocks of size beach. Each block either “on” or “off”.
• Sparsity k means at most k/b blocks are on.
• log |Θ| = log(n/b
k/b
)∼ (k/b) log(n/k)
• Thus m k + (k/b) log(n/k), s (k/b) log(n/k)
• Non-trivial sparsity s for b polylog(n)
An example new application
Model-based compressed sensing
• Our result implies sparse Π for model-based RIP when #sparsity patterns is 2o(k)
• Example: block-sparsity. u divided into n/b blocks of size beach. Each block either “on” or “off”.
• Sparsity k means at most k/b blocks are on.
• log |Θ| = log(n/b
k/b
)∼ (k/b) log(n/k)
• Thus m k + (k/b) log(n/k), s (k/b) log(n/k)
• Non-trivial sparsity s for b polylog(n)
An example new application
Model-based compressed sensing
• Our result implies sparse Π for model-based RIP when #sparsity patterns is 2o(k)
• Example: block-sparsity. u divided into n/b blocks of size beach. Each block either “on” or “off”.
• Sparsity k means at most k/b blocks are on.
• log |Θ| = log(n/b
k/b
)∼ (k/b) log(n/k)
• Thus m k + (k/b) log(n/k), s (k/b) log(n/k)
• Non-trivial sparsity s for b polylog(n)
Main theorem: proof outline
• Want to bound EΠ supu∈T |‖Πu‖22 − 1|
• Can write Πi ,j = δi ,jσi ,j/√s
δi ,j ∈ 0, 1, σi ,j = ±1
• Can write ‖Πu‖22 = ‖Au,δσ‖2
2
Au,δ =1√
s·
u
(1)1 · · · u
(1)n 0 · · · · · · · · · · · · · · · · · · 0
0 · · · 0 u(2)1 · · · u
(2)n 0 · · · · · · · · · 0
......
... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u
(m)1 · · · u
(m)n
where u(i)j = δi ,juj . Define Aδ = Au,δ : u ∈ T.
• Want: Eδ Eσ supA∈Aδ |‖Aσ‖22 − E ‖Aσ‖2
2| < ε
• Can handle using chaining and tools from Banach space theory
Main theorem: proof outline
• Want to bound EΠ supu∈T |‖Πu‖22 − 1|
• Can write Πi ,j = δi ,jσi ,j/√s
δi ,j ∈ 0, 1, σi ,j = ±1
• Can write ‖Πu‖22 = ‖Au,δσ‖2
2
Au,δ =1√
s·
u
(1)1 · · · u
(1)n 0 · · · · · · · · · · · · · · · · · · 0
0 · · · 0 u(2)1 · · · u
(2)n 0 · · · · · · · · · 0
......
... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u
(m)1 · · · u
(m)n
where u(i)j = δi ,juj . Define Aδ = Au,δ : u ∈ T.
• Want: Eδ Eσ supA∈Aδ |‖Aσ‖22 − E ‖Aσ‖2
2| < ε
• Can handle using chaining and tools from Banach space theory
Main theorem: proof outline
• Want to bound EΠ supu∈T |‖Πu‖22 − 1|
• Can write Πi ,j = δi ,jσi ,j/√s
δi ,j ∈ 0, 1, σi ,j = ±1
• Can write ‖Πu‖22 = ‖Au,δσ‖2
2
Au,δ =1√
s·
u
(1)1 · · · u
(1)n 0 · · · · · · · · · · · · · · · · · · 0
0 · · · 0 u(2)1 · · · u
(2)n 0 · · · · · · · · · 0
......
... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u
(m)1 · · · u
(m)n
where u(i)j = δi ,juj . Define Aδ = Au,δ : u ∈ T.
• Want: Eδ Eσ supA∈Aδ |‖Aσ‖22 − E ‖Aσ‖2
2| < ε
• Can handle using chaining and tools from Banach space theory
Main theorem: proof outline
• Want to bound EΠ supu∈T |‖Πu‖22 − 1|
• Can write Πi ,j = δi ,jσi ,j/√s
δi ,j ∈ 0, 1, σi ,j = ±1
• Can write ‖Πu‖22 = ‖Au,δσ‖2
2
Au,δ =1√
s·
u
(1)1 · · · u
(1)n 0 · · · · · · · · · · · · · · · · · · 0
0 · · · 0 u(2)1 · · · u
(2)n 0 · · · · · · · · · 0
......
... · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · 0 u
(m)1 · · · u
(m)n
where u(i)j = δi ,juj . Define Aδ = Au,δ : u ∈ T.
• Want: Eδ Eσ supA∈Aδ |‖Aσ‖22 − E ‖Aσ‖2
2| < ε
• Can handle using chaining and tools from Banach space theory
Proof outline
EδEσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
We’re in luck!
Theorem (Krahmer, Mendelson, Rauhut ’11)
For any collection A of matrices and σ a Rademacher vector
Eσ
supA∈A|‖Aσ‖2
2 − E ‖Aσ‖22|
. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)
dX (Y ) is the radius of Y under the norm ‖ · ‖X
Proof outline
EδEσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
We’re in luck!
Theorem (Krahmer, Mendelson, Rauhut ’11)
For any collection A of matrices and σ a Rademacher vector
Eσ
supA∈A|‖Aσ‖2
2 − E ‖Aσ‖22|
. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)
dX (Y ) is the radius of Y under the norm ‖ · ‖X
Proof outline
EδEσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
We’re in luck!
Theorem (Krahmer, Mendelson, Rauhut ’11)
For any collection A of matrices and σ a Rademacher vector
Eσ
supA∈A|‖Aσ‖2
2 − E ‖Aσ‖22|
. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)
dX (Y ) is the radius of Y under the norm ‖ · ‖X
Proof outline
EδEσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
We’re in luck!
Theorem (Krahmer, Mendelson, Rauhut ’11)
For any collection A of matrices and σ a Rademacher vector
Eσ
supA∈A|‖Aσ‖2
2 − E ‖Aσ‖22|
. γ22(A, ‖ · ‖) + γ2(A, ‖ · ‖) · dF (A) + dF (A) · d`2→`2(A)
dX (Y ) is the radius of Y under the norm ‖ · ‖X
γ2 and Talagrand’s “generic chaining”
• What is γ2? Doesn’t matter for us except for three things:
(1) γ2(Y , ‖ · ‖2) ∼ g(Y ) for any Y
(“majorizing measures”) [Fernique ’76], [Talagrand ’01]
(2) γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2 for any norm
(3) γ2(Y , ‖ · ‖X ) is a function only of pairwise distances in Yunder the ‖ · ‖X norm
• Here N (Y , ‖ · ‖X , η) is the minimum number of ‖ · ‖X -balls ofradius η centered at points in Y required to cover Y
(usually called “covering” or “entropy” numbers)
γ2 and Talagrand’s “generic chaining”
• What is γ2? Doesn’t matter for us except for three things:
(1) γ2(Y , ‖ · ‖2) ∼ g(Y ) for any Y
(“majorizing measures”) [Fernique ’76], [Talagrand ’01]
(2) γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2 for any norm
(3) γ2(Y , ‖ · ‖X ) is a function only of pairwise distances in Yunder the ‖ · ‖X norm
• Here N (Y , ‖ · ‖X , η) is the minimum number of ‖ · ‖X -balls ofradius η centered at points in Y required to cover Y
(usually called “covering” or “entropy” numbers)
γ2 and Talagrand’s “generic chaining”
• What is γ2? Doesn’t matter for us except for three things:
(1) γ2(Y , ‖ · ‖2) ∼ g(Y ) for any Y
(“majorizing measures”) [Fernique ’76], [Talagrand ’01]
(2) γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2 for any norm
(3) γ2(Y , ‖ · ‖X ) is a function only of pairwise distances in Yunder the ‖ · ‖X norm
• Here N (Y , ‖ · ‖X , η) is the minimum number of ‖ · ‖X -balls ofradius η centered at points in Y required to cover Y
(usually called “covering” or “entropy” numbers)
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(Aδ, ‖ · ‖) + γ2(Aδ, ‖ · ‖) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• γ2(Aδ, ‖ · ‖) depends only on ‖ · ‖-distances
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,
(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2
def= |||u|||
• Therefore ‖Au,δ − Aw ,δ‖ = |||u − w |||• Therefore γ2(Aδ, ‖ · ‖) = γ2(T , |||·|||)
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(Aδ, ‖ · ‖) + γ2(Aδ, ‖ · ‖) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• γ2(Aδ, ‖ · ‖) depends only on ‖ · ‖-distances
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,
(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2
def= |||u|||
• Therefore ‖Au,δ − Aw ,δ‖ = |||u − w |||• Therefore γ2(Aδ, ‖ · ‖) = γ2(T , |||·|||)
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(Aδ, ‖ · ‖) + γ2(Aδ, ‖ · ‖) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• γ2(Aδ, ‖ · ‖) depends only on ‖ · ‖-distances
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,
(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2
def= |||u|||
• Therefore ‖Au,δ − Aw ,δ‖ = |||u − w |||
• Therefore γ2(Aδ, ‖ · ‖) = γ2(T , |||·|||)
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(Aδ, ‖ · ‖) + γ2(Aδ, ‖ · ‖) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• γ2(Aδ, ‖ · ‖) depends only on ‖ · ‖-distances
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,
(2) ‖Au,δ‖ = 1√s·max1≤i≤m ‖u(i)‖2
def= |||u|||
• Therefore ‖Au,δ − Aw ,δ‖ = |||u − w |||• Therefore γ2(Aδ, ‖ · ‖) = γ2(T , |||·|||)
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(T , |||·|||) + γ2(T , |||·|||) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,(2) ‖Au,δ‖ = 1√
s·max1≤i≤m ‖u(i)‖2 = |||u|||
• Since T ⊂ Sn−1, ‖Au,δ‖ ≤ 1√s· ‖u‖2 = 1√
s
• Also ‖Au,δ‖2F = 1
s
∑mi=1
∑nj=1 δi ,ju
2j = ‖u‖2
2 = 1
• So RHS is at most γ22(T , |||·|||) + γ2(T , |||·|||) + 1√
s
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(T , |||·|||) + γ2(T , |||·|||) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,(2) ‖Au,δ‖ = 1√
s·max1≤i≤m ‖u(i)‖2 = |||u|||
• Since T ⊂ Sn−1, ‖Au,δ‖ ≤ 1√s· ‖u‖2 = 1√
s
• Also ‖Au,δ‖2F = 1
s
∑mi=1
∑nj=1 δi ,ju
2j = ‖u‖2
2 = 1
• So RHS is at most γ22(T , |||·|||) + γ2(T , |||·|||) + 1√
s
Bounding the right hand side
Fix δ.
Eσ
supA∈Aδ
|‖Aσ‖22 − E ‖Aσ‖2
2|
. γ22(T , |||·|||) + γ2(T , |||·|||) · dF (Aδ) + dF (Aδ) · d`2→`2(Aδ)
• Facts: (1) Au,δ − Aw ,δ = Au−w ,δ,(2) ‖Au,δ‖ = 1√
s·max1≤i≤m ‖u(i)‖2 = |||u|||
• Since T ⊂ Sn−1, ‖Au,δ‖ ≤ 1√s· ‖u‖2 = 1√
s
• Also ‖Au,δ‖2F = 1
s
∑mi=1
∑nj=1 δi ,ju
2j = ‖u‖2
2 = 1
• So RHS is at most γ22(T , |||·|||) + γ2(T , |||·|||) + 1√
s
Warmup for main theorem: case of a linear subspace
• T = E , the intersection of a d-dim. linear subspace and Sn−1
• Want to bound Eδ γ2(T , |||·|||)• One of our facts:γ2(T , |||·|||) supη>0 η · [logN (T , |||·|||, η)]1/2
• Another stroke of luck. Dual Sudakov minoration [Bourgain,Lindenstrauss, Milman ’89], [Pajor, Tomczak-Jaegermann ’86]
supη>0
η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||
where g is a gaussian vector (very specific to linear subspaces)
• Here columns of U ∈ Rn×d form orthonormal basis for E
Warmup for main theorem: case of a linear subspace
• T = E , the intersection of a d-dim. linear subspace and Sn−1
• Want to bound Eδ γ2(T , |||·|||)
• One of our facts:γ2(T , |||·|||) supη>0 η · [logN (T , |||·|||, η)]1/2
• Another stroke of luck. Dual Sudakov minoration [Bourgain,Lindenstrauss, Milman ’89], [Pajor, Tomczak-Jaegermann ’86]
supη>0
η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||
where g is a gaussian vector (very specific to linear subspaces)
• Here columns of U ∈ Rn×d form orthonormal basis for E
Warmup for main theorem: case of a linear subspace
• T = E , the intersection of a d-dim. linear subspace and Sn−1
• Want to bound Eδ γ2(T , |||·|||)• One of our facts:γ2(T , |||·|||) supη>0 η · [logN (T , |||·|||, η)]1/2
• Another stroke of luck. Dual Sudakov minoration [Bourgain,Lindenstrauss, Milman ’89], [Pajor, Tomczak-Jaegermann ’86]
supη>0
η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||
where g is a gaussian vector (very specific to linear subspaces)
• Here columns of U ∈ Rn×d form orthonormal basis for E
Warmup for main theorem: case of a linear subspace
• T = E , the intersection of a d-dim. linear subspace and Sn−1
• Want to bound Eδ γ2(T , |||·|||)• One of our facts:γ2(T , |||·|||) supη>0 η · [logN (T , |||·|||, η)]1/2
• Another stroke of luck. Dual Sudakov minoration [Bourgain,Lindenstrauss, Milman ’89], [Pajor, Tomczak-Jaegermann ’86]
supη>0
η · [logN (BE , |||·|||, η)]1/2 . Eg|||Ug |||
where g is a gaussian vector (very specific to linear subspaces)
• Here columns of U ∈ Rn×d form orthonormal basis for E
Warmup for main theorem: case of a linear subspace
Let U(i) be U but where the jth row is multiplied by δi ,j
Eg|||Ug ||| =
1√sEg
max1≤i≤m
‖U(i)g‖2
≤ 1√s
[max
1≤i≤mEg‖U(i)g‖2 + E
gmax
1≤i≤m
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣]
≤ 1√s
max1≤i≤m
Eg‖U(i)g‖2 +
(m∑
i=1
Eg
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣p)1/p
≤ 1√
s
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](last step: Chose p = logm and used gaussian concentration of
Lipschitz functions [Pisier ’86])
Warmup for main theorem: case of a linear subspace
Let U(i) be U but where the jth row is multiplied by δi ,j
Eg|||Ug ||| =
1√sEg
max1≤i≤m
‖U(i)g‖2
≤ 1√s
[max
1≤i≤mEg‖U(i)g‖2 + E
gmax
1≤i≤m
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣]
≤ 1√s
max1≤i≤m
Eg‖U(i)g‖2 +
(m∑
i=1
Eg
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣p)1/p
≤ 1√
s
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](last step: Chose p = logm and used gaussian concentration of
Lipschitz functions [Pisier ’86])
Warmup for main theorem: case of a linear subspace
Let U(i) be U but where the jth row is multiplied by δi ,j
Eg|||Ug ||| =
1√sEg
max1≤i≤m
‖U(i)g‖2
≤ 1√s
[max
1≤i≤mEg‖U(i)g‖2 + E
gmax
1≤i≤m
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣]
≤ 1√s
max1≤i≤m
Eg‖U(i)g‖2 +
(m∑
i=1
Eg
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣p)1/p
≤ 1√s
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](last step: Chose p = logm and used gaussian concentration of
Lipschitz functions [Pisier ’86])
Warmup for main theorem: case of a linear subspace
Let U(i) be U but where the jth row is multiplied by δi ,j
Eg|||Ug ||| =
1√sEg
max1≤i≤m
‖U(i)g‖2
≤ 1√s
[max
1≤i≤mEg‖U(i)g‖2 + E
gmax
1≤i≤m
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣]
≤ 1√s
max1≤i≤m
Eg‖U(i)g‖2 +
(m∑
i=1
Eg
∣∣∣∣‖U(i)g‖2 − Eg‖U(i)g‖2
∣∣∣∣p)1/p
≤ 1√
s
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](last step: Chose p = logm and used gaussian concentration of
Lipschitz functions [Pisier ’86])
Warmup for main theorem: case of a linear subspaceNow must bound
1√s· E
δ
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](1)
≤ 1√s·
( m∑i=1
Eδ‖U(i)‖p
F
)1/p
+√
logm
(m∑
i=1
Eδ‖U(i)‖p
)1/p (2)
choosing p = logm again as usual so that `p ∼ `∞.
• Eδ ‖U(i)‖pF is basically the Chernoff bound
• Eδ ‖U(i)‖p can be handled by non-commutative Khintchineinequality [Lust-Piquard ’86], [Lust-Piquard, Pisier ’91]
Going through calculations gives that we can set:
m .d(log d)2 + (logm)(log d)2
ε2 d
ε2
s . 1 + (log d)2(logm)2 · max1≤j≤n ‖PEej‖22
ε2
Can even set s = 1 if “leverage scores” small (true for random E )
Warmup for main theorem: case of a linear subspaceNow must bound
1√s· E
δ
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](1)
≤ 1√s·
( m∑i=1
Eδ‖U(i)‖p
F
)1/p
+√
logm
(m∑
i=1
Eδ‖U(i)‖p
)1/p (2)
choosing p = logm again as usual so that `p ∼ `∞.
• Eδ ‖U(i)‖pF is basically the Chernoff bound
• Eδ ‖U(i)‖p can be handled by non-commutative Khintchineinequality [Lust-Piquard ’86], [Lust-Piquard, Pisier ’91]
Going through calculations gives that we can set:
m .d(log d)2 + (logm)(log d)2
ε2 d
ε2
s . 1 + (log d)2(logm)2 · max1≤j≤n ‖PEej‖22
ε2
Can even set s = 1 if “leverage scores” small (true for random E )
Warmup for main theorem: case of a linear subspaceNow must bound
1√s· E
δ
[max
1≤i≤m‖U(i)‖F +
√logm max
1≤i≤m‖U(i)‖
](1)
≤ 1√s·
( m∑i=1
Eδ‖U(i)‖p
F
)1/p
+√
logm
(m∑
i=1
Eδ‖U(i)‖p
)1/p (2)
choosing p = logm again as usual so that `p ∼ `∞.
• Eδ ‖U(i)‖pF is basically the Chernoff bound
• Eδ ‖U(i)‖p can be handled by non-commutative Khintchineinequality [Lust-Piquard ’86], [Lust-Piquard, Pisier ’91]
Going through calculations gives that we can set:
m .d(log d)2 + (logm)(log d)2
ε2 d
ε2
s . 1 + (log d)2(logm)2 · max1≤j≤n ‖PEej‖22
ε2
Can even set s = 1 if “leverage scores” small (true for random E )
Back to main theorem
General theoremT ⊆ Sn−1 arbitrary. Suffices to set
m g2(T ) + 1, s 1
such that
maxq≤m
slog s
1√qs‖ sup
u∈T
n∑j=1
ηjujgj‖LqηL1
g
1
Proof ingredients
• Dual Sudakov minoration (already seen, [BLM’89], [PTJ’86]):supη>0 η · [logN (E , |||·|||, η)]1/2 ≤ Eg |||Ug |||
• Maurey’s empirical method [Pisier ’81], [Carl ’85]
• Duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89]
General theoremT ⊆ Sn−1 arbitrary. Suffices to set
m g2(T ) + 1, s 1
such that
maxq≤m
slog s
1√qs‖ sup
u∈T
n∑j=1
ηjujgj‖LqηL1
g
1
Proof ingredients
• Dual Sudakov minoration (already seen, [BLM’89], [PTJ’86]):supη>0 η · [logN (E , |||·|||, η)]1/2 ≤ Eg |||Ug |||
• Maurey’s empirical method [Pisier ’81], [Carl ’85]
• Duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89]
General theorem: proof outline
• Recall: γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2
• So we have to bound logN (T , |||·|||, η)
• Using duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89], can bound
logN (T , |||·|||, η) logN (conv(BJi), ‖ · ‖T ,
1
8
√sη)
where ‖z‖T = supu∈T | 〈u, z〉 |.• Ultimately want to use dual Sudakov; we’re not there yet• Maurey’s lemma: A tool to bound covering numbers of
convex combinations of spaces
logN (conv(BJi ), ‖·‖T , ε) .logm
ε2+log
1
εmaxk≤ 1
ε2
max|A|=k
logN (1
k
∑i∈A
BJi , ‖·‖T , ε)
General theorem: proof outline
• Recall: γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2
• So we have to bound logN (T , |||·|||, η)
• Using duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89], can bound
logN (T , |||·|||, η) logN (conv(BJi), ‖ · ‖T ,
1
8
√sη)
where ‖z‖T = supu∈T | 〈u, z〉 |.
• Ultimately want to use dual Sudakov; we’re not there yet• Maurey’s lemma: A tool to bound covering numbers of
convex combinations of spaces
logN (conv(BJi ), ‖·‖T , ε) .logm
ε2+log
1
εmaxk≤ 1
ε2
max|A|=k
logN (1
k
∑i∈A
BJi , ‖·‖T , ε)
General theorem: proof outline
• Recall: γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2
• So we have to bound logN (T , |||·|||, η)
• Using duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89], can bound
logN (T , |||·|||, η) logN (conv(BJi), ‖ · ‖T ,
1
8
√sη)
where ‖z‖T = supu∈T | 〈u, z〉 |.• Ultimately want to use dual Sudakov; we’re not there yet
• Maurey’s lemma: A tool to bound covering numbers ofconvex combinations of spaces
logN (conv(BJi ), ‖·‖T , ε) .logm
ε2+log
1
εmaxk≤ 1
ε2
max|A|=k
logN (1
k
∑i∈A
BJi , ‖·‖T , ε)
General theorem: proof outline
• Recall: γ2(Y , ‖ · ‖X ) supη>0 η · [logN (Y , ‖ · ‖X , η)]1/2
• So we have to bound logN (T , |||·|||, η)
• Using duality of entropy numbers [Bourgain, Pajor, Szarek,Tomczak-Jaegermann ’89], can bound
logN (T , |||·|||, η) logN (conv(BJi), ‖ · ‖T ,
1
8
√sη)
where ‖z‖T = supu∈T | 〈u, z〉 |.• Ultimately want to use dual Sudakov; we’re not there yet• Maurey’s lemma: A tool to bound covering numbers of
convex combinations of spaces
logN (conv(BJi ), ‖·‖T , ε) .logm
ε2+log
1
εmaxk≤ 1
ε2
max|A|=k
logN (1
k
∑i∈A
BJi , ‖·‖T , ε)
General theorem: proof outline
• Define Uα = Uα(δ) = j = 1, . . . , n :∑
i∈A δi ,j ∼ 2α
• Can show 1k
∑i∈A BJi
⊂∑
α1√k
2α/2Eα
where Eα is the linear subspace of vectors supported on Uα
• Therefore
logN ( 1k
∑i∈A BJi , ‖ · ‖T , ε) .
∑α logN ( 1√
k2α/2Eα, ‖ · ‖T ,
εlog m
)
• Dual Sudakov minoration! (. . . plus some calculations)
General theorem: proof outline
• Define Uα = Uα(δ) = j = 1, . . . , n :∑
i∈A δi ,j ∼ 2α• Can show 1
k
∑i∈A BJi
⊂∑
α1√k
2α/2Eα
where Eα is the linear subspace of vectors supported on Uα
• Therefore
logN ( 1k
∑i∈A BJi , ‖ · ‖T , ε) .
∑α logN ( 1√
k2α/2Eα, ‖ · ‖T ,
εlog m
)
• Dual Sudakov minoration! (. . . plus some calculations)
General theorem: proof outline
• Define Uα = Uα(δ) = j = 1, . . . , n :∑
i∈A δi ,j ∼ 2α• Can show 1
k
∑i∈A BJi
⊂∑
α1√k
2α/2Eα
where Eα is the linear subspace of vectors supported on Uα
• Therefore
logN ( 1k
∑i∈A BJi , ‖ · ‖T , ε) .
∑α logN ( 1√
k2α/2Eα, ‖ · ‖T ,
εlog m
)
• Dual Sudakov minoration! (. . . plus some calculations)
General theorem: proof outline
• Define Uα = Uα(δ) = j = 1, . . . , n :∑
i∈A δi ,j ∼ 2α• Can show 1
k
∑i∈A BJi
⊂∑
α1√k
2α/2Eα
where Eα is the linear subspace of vectors supported on Uα
• Therefore
logN ( 1k
∑i∈A BJi , ‖ · ‖T , ε) .
∑α logN ( 1√
k2α/2Eα, ‖ · ‖T ,
εlog m
)
• Dual Sudakov minoration! (. . . plus some calculations)
Conclusion
Open Problems
• OPEN: Prove conjecture: to get subsp. embedding with prob.1− δ, can set m = O((d + log(1/δ))/ε2), s = O(log(d/δ)/ε).Easier: obtain this m with s = m via moment method.
([Cohen ’14] has made progress via other methods)
• OPEN: Show that the tradeoff m = O(d1+γ/ε2),s = poly(1/γ) · 1/ε is optimal for any distribution oversubspace embeddings (the poly is probably linear; someprogress in [N., Nguy˜en ’14])
• OPEN: Show that m = Ω(d2/ε2) is optimal for s = 1
Partial progress: [N., Nguy˜en ’13] shows m = Ω(d2)
• OPEN: Improve parameters of [Bourgain, N. ’14]: s shouldbe 1/ε2, not 1/ε2, and remove log factors.
• OPEN: Give a full tradeoff curve for s,m for general theorem
• OPEN: Usual: m, s “big enough” ⇒ T preserved ⇒ application works
Can we skip the middle step to get better bounds? e.g. fork-means clustering?