Lecture 12: Concentration of sums of independent random...
Transcript of Lecture 12: Concentration of sums of independent random...
Lecture 12 Concentration of sums of independentrandom variables
May 15 - 20 2020
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 1 17
1 Why is probability more and more important
Probabilistic linear algebra
A new and vigorous research direction
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 2 17
2 Gaussian or normal distribution N (microσ2)
Standard normal random variable X sim N (0 1)
(i) TailP|X| ge t le 2 exp(minust22) forall t ge 0
(ii) Moment
983042X983042p = (E|X|p)1p = O(radicp) as p rarr infin
(iii) Moment Generation function (MGF)
E exp(λX) = exp(λ22) for all λ isin R
(iv) MGF of square
E exp(cX2) le 2 for some c gt 0
The sum of independent normal random variables is also normal
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 3 17
3 Sub-gaussian distributions
Theorem 1 (Sub-gaussian properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1
MGF of square E exp(X2K23 ) le 2
Moreover if EX = 0 then these properties are also equivalent to thefollowing one
MGF E exp(λX) le exp(λ2K24 ) for all λ isin R
Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17
Definition 2
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian
The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is
983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2
Sub-gaussian random variable examples
(i) Normal random variables X sim N (microσ2)
(ii) Bernoulli random variable X = 0 1 with equal probabilities
(iii) More generally any bounded random variable X
Not sub-gaussian random variable examples
Poisson exponential Pareto and Cauchy distributions
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17
Theorem 3 (Sums of sub-gaussians)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then
983123Ni=1Xi is sub-gaussian and
983056983056983056983056983131N
i=1Xi
9830569830569830569830562
ψ2
le C983131N
i=1983042Xi9830422ψ2
where C is an absolute constant
Proof Let us bound the MGF of the sum for any λ isin R
E exp
983061λ983131N
i=1Xi
983062=
983132N
i=1E exp(λXi) (using independence)
le983132N
i=1exp(Cλ2983042Xi9830422ψ2
) (by last property in Theorem 1)
= exp(λ2K2) where K2 = C983131N
i=1983042Xi9830422ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
1 Why is probability more and more important
Probabilistic linear algebra
A new and vigorous research direction
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 2 17
2 Gaussian or normal distribution N (microσ2)
Standard normal random variable X sim N (0 1)
(i) TailP|X| ge t le 2 exp(minust22) forall t ge 0
(ii) Moment
983042X983042p = (E|X|p)1p = O(radicp) as p rarr infin
(iii) Moment Generation function (MGF)
E exp(λX) = exp(λ22) for all λ isin R
(iv) MGF of square
E exp(cX2) le 2 for some c gt 0
The sum of independent normal random variables is also normal
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 3 17
3 Sub-gaussian distributions
Theorem 1 (Sub-gaussian properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1
MGF of square E exp(X2K23 ) le 2
Moreover if EX = 0 then these properties are also equivalent to thefollowing one
MGF E exp(λX) le exp(λ2K24 ) for all λ isin R
Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17
Definition 2
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian
The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is
983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2
Sub-gaussian random variable examples
(i) Normal random variables X sim N (microσ2)
(ii) Bernoulli random variable X = 0 1 with equal probabilities
(iii) More generally any bounded random variable X
Not sub-gaussian random variable examples
Poisson exponential Pareto and Cauchy distributions
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17
Theorem 3 (Sums of sub-gaussians)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then
983123Ni=1Xi is sub-gaussian and
983056983056983056983056983131N
i=1Xi
9830569830569830569830562
ψ2
le C983131N
i=1983042Xi9830422ψ2
where C is an absolute constant
Proof Let us bound the MGF of the sum for any λ isin R
E exp
983061λ983131N
i=1Xi
983062=
983132N
i=1E exp(λXi) (using independence)
le983132N
i=1exp(Cλ2983042Xi9830422ψ2
) (by last property in Theorem 1)
= exp(λ2K2) where K2 = C983131N
i=1983042Xi9830422ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
2 Gaussian or normal distribution N (microσ2)
Standard normal random variable X sim N (0 1)
(i) TailP|X| ge t le 2 exp(minust22) forall t ge 0
(ii) Moment
983042X983042p = (E|X|p)1p = O(radicp) as p rarr infin
(iii) Moment Generation function (MGF)
E exp(λX) = exp(λ22) for all λ isin R
(iv) MGF of square
E exp(cX2) le 2 for some c gt 0
The sum of independent normal random variables is also normal
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 3 17
3 Sub-gaussian distributions
Theorem 1 (Sub-gaussian properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1
MGF of square E exp(X2K23 ) le 2
Moreover if EX = 0 then these properties are also equivalent to thefollowing one
MGF E exp(λX) le exp(λ2K24 ) for all λ isin R
Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17
Definition 2
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian
The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is
983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2
Sub-gaussian random variable examples
(i) Normal random variables X sim N (microσ2)
(ii) Bernoulli random variable X = 0 1 with equal probabilities
(iii) More generally any bounded random variable X
Not sub-gaussian random variable examples
Poisson exponential Pareto and Cauchy distributions
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17
Theorem 3 (Sums of sub-gaussians)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then
983123Ni=1Xi is sub-gaussian and
983056983056983056983056983131N
i=1Xi
9830569830569830569830562
ψ2
le C983131N
i=1983042Xi9830422ψ2
where C is an absolute constant
Proof Let us bound the MGF of the sum for any λ isin R
E exp
983061λ983131N
i=1Xi
983062=
983132N
i=1E exp(λXi) (using independence)
le983132N
i=1exp(Cλ2983042Xi9830422ψ2
) (by last property in Theorem 1)
= exp(λ2K2) where K2 = C983131N
i=1983042Xi9830422ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
3 Sub-gaussian distributions
Theorem 1 (Sub-gaussian properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1
MGF of square E exp(X2K23 ) le 2
Moreover if EX = 0 then these properties are also equivalent to thefollowing one
MGF E exp(λX) le exp(λ2K24 ) for all λ isin R
Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17
Definition 2
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian
The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is
983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2
Sub-gaussian random variable examples
(i) Normal random variables X sim N (microσ2)
(ii) Bernoulli random variable X = 0 1 with equal probabilities
(iii) More generally any bounded random variable X
Not sub-gaussian random variable examples
Poisson exponential Pareto and Cauchy distributions
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17
Theorem 3 (Sums of sub-gaussians)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then
983123Ni=1Xi is sub-gaussian and
983056983056983056983056983131N
i=1Xi
9830569830569830569830562
ψ2
le C983131N
i=1983042Xi9830422ψ2
where C is an absolute constant
Proof Let us bound the MGF of the sum for any λ isin R
E exp
983061λ983131N
i=1Xi
983062=
983132N
i=1E exp(λXi) (using independence)
le983132N
i=1exp(Cλ2983042Xi9830422ψ2
) (by last property in Theorem 1)
= exp(λ2K2) where K2 = C983131N
i=1983042Xi9830422ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Definition 2
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian
The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is
983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2
Sub-gaussian random variable examples
(i) Normal random variables X sim N (microσ2)
(ii) Bernoulli random variable X = 0 1 with equal probabilities
(iii) More generally any bounded random variable X
Not sub-gaussian random variable examples
Poisson exponential Pareto and Cauchy distributions
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17
Theorem 3 (Sums of sub-gaussians)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then
983123Ni=1Xi is sub-gaussian and
983056983056983056983056983131N
i=1Xi
9830569830569830569830562
ψ2
le C983131N
i=1983042Xi9830422ψ2
where C is an absolute constant
Proof Let us bound the MGF of the sum for any λ isin R
E exp
983061λ983131N
i=1Xi
983062=
983132N
i=1E exp(λXi) (using independence)
le983132N
i=1exp(Cλ2983042Xi9830422ψ2
) (by last property in Theorem 1)
= exp(λ2K2) where K2 = C983131N
i=1983042Xi9830422ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Theorem 3 (Sums of sub-gaussians)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then
983123Ni=1Xi is sub-gaussian and
983056983056983056983056983131N
i=1Xi
9830569830569830569830562
ψ2
le C983131N
i=1983042Xi9830422ψ2
where C is an absolute constant
Proof Let us bound the MGF of the sum for any λ isin R
E exp
983061λ983131N
i=1Xi
983062=
983132N
i=1E exp(λXi) (using independence)
le983132N
i=1exp(Cλ2983042Xi9830422ψ2
) (by last property in Theorem 1)
= exp(λ2K2) where K2 = C983131N
i=1983042Xi9830422ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
31 Hoeffdingrsquos inequality
We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1
Theorem 4 (Hoeffdingrsquos inequality)
Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have
P983069983055983055983055983055
983131N
i=1Xi
983055983055983055983055 ge t
983070le 2 exp
983075minus ct2983123N
i=1 983042Xi9830422ψ2
983076
Remark 5
Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
4 Sub-exponential distributions
A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian
Theorem 6 (Sub-exponential properties)
For a random variable X the following properties are equivalent
Tail P|X| ge t le 2 exp(minustK1) for all t ge 0
Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1
MGF of square E exp(|X|K3) le 2
Moreover if EX = 0 then these properties imply the following one
MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4
Definition 7
Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is
983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2
All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have
983042X2983042ψ1 = 983042X9830422ψ2
41 Bernsteinrsquos inequality
Theorem 8 (Bernsteinrsquos inequality)
Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983077minuscmin
983075t2
983123Ni=1 983042Xi9830422ψ1
t
maxi 983042Xi983042ψ1
983076983078
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N
i=1Xi)
PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)
Then by independence we have
PS 983245 t le eminusλt983132N
i=1E exp(λXi)
If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1
then by the
last property in Theorem 6 we have
E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1
983060
Hence
PS 983245 t le exp983043minusλt+ Cλ2σ2
983044 σ2 =
N983131
i=1
983042Xi9830422ψ1
The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Why does Bernsteinrsquos inequality have a mixture of two tails
(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)
(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity
Remark 9 (Bernsteinrsquos inequality for bounded random variables)
Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have
P
983083983055983055983055983055983055
N983131
i=1
Xi
983055983055983055983055983055 983245 t
983084983249 2 exp
983061minus t22
σ2 + CKt
983062
where σ2 =983123N
i=1 EX2i is the variance of the sum
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Note that σ2 + CKt 983249 2max983043σ2 CKt
983044 So we can state the
probability bound as
2 exp
983063minuscmin
983061t2
σ2t
K
983062983064
Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential
The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi
The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi
More on concentration
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
5 Sub-gaussian random vectors
Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian
The sub-gaussian norm of X is defined as
983042X983042ψ2 = supxisinRn983042x9830422=1
983042〈Xx〉983042ψ2
Sub-gaussian random vector examples
(i) The standard normal distribution N (0 In) (why)
(ii) The uniform distribution on the centered Euclidean sphere ofradius
radicn
(iii) The uniform distribution on the cube minus1 1n
(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
6 Johnson-Lindenstrauss Lemma
Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms
Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data
The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically
What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Theorem 10 (Johnson-Lindenstrauss Lemma)
Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo
P = Aradicm
Assume thatm ge Cεminus2 logN
where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X
(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that
1minus ε le 983042P z98304222 le 1 + ε for all z isin T
where
T =
983069xminus y
983042xminus y9830422 xy isin X and x ∕= y
983070
By P z = Azradicm it is enough to show that
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 le ε for all z isin T
We can prove this inequality by combining concentration and a unionbound
Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17
In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)
P
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249 2 exp
983043minuscε2m
983044
Finally we can unfix z by taking a union bound over all possible z isin T
P
983083maxzisinT
9830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084983249
983131
zisinTP
9830839830559830559830559830559830551
m
m983131
i=1
〈Xi z〉2 minus 1
983055983055983055983055983055 gt ε
983084
983249 |T | middot 2 exp983043minuscε2m
983044
By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make
|T | middot 2 exp983043minuscε2m
983044le 001
The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17