Lecture 12: Concentration of sums of independent random...

17
Lecture 12: Concentration of sums of independent random variables May 15 - 20, 2020 Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 1 / 17

Transcript of Lecture 12: Concentration of sums of independent random...

Page 1: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Lecture 12 Concentration of sums of independentrandom variables

May 15 - 20 2020

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 1 17

1 Why is probability more and more important

Probabilistic linear algebra

A new and vigorous research direction

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 2 17

2 Gaussian or normal distribution N (microσ2)

Standard normal random variable X sim N (0 1)

(i) TailP|X| ge t le 2 exp(minust22) forall t ge 0

(ii) Moment

983042X983042p = (E|X|p)1p = O(radicp) as p rarr infin

(iii) Moment Generation function (MGF)

E exp(λX) = exp(λ22) for all λ isin R

(iv) MGF of square

E exp(cX2) le 2 for some c gt 0

The sum of independent normal random variables is also normal

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 3 17

3 Sub-gaussian distributions

Theorem 1 (Sub-gaussian properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1

MGF of square E exp(X2K23 ) le 2

Moreover if EX = 0 then these properties are also equivalent to thefollowing one

MGF E exp(λX) le exp(λ2K24 ) for all λ isin R

Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17

Definition 2

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian

The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is

983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2

Sub-gaussian random variable examples

(i) Normal random variables X sim N (microσ2)

(ii) Bernoulli random variable X = 0 1 with equal probabilities

(iii) More generally any bounded random variable X

Not sub-gaussian random variable examples

Poisson exponential Pareto and Cauchy distributions

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17

Theorem 3 (Sums of sub-gaussians)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then

983123Ni=1Xi is sub-gaussian and

983056983056983056983056983131N

i=1Xi

9830569830569830569830562

ψ2

le C983131N

i=1983042Xi9830422ψ2

where C is an absolute constant

Proof Let us bound the MGF of the sum for any λ isin R

E exp

983061λ983131N

i=1Xi

983062=

983132N

i=1E exp(λXi) (using independence)

le983132N

i=1exp(Cλ2983042Xi9830422ψ2

) (by last property in Theorem 1)

= exp(λ2K2) where K2 = C983131N

i=1983042Xi9830422ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 2: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

1 Why is probability more and more important

Probabilistic linear algebra

A new and vigorous research direction

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 2 17

2 Gaussian or normal distribution N (microσ2)

Standard normal random variable X sim N (0 1)

(i) TailP|X| ge t le 2 exp(minust22) forall t ge 0

(ii) Moment

983042X983042p = (E|X|p)1p = O(radicp) as p rarr infin

(iii) Moment Generation function (MGF)

E exp(λX) = exp(λ22) for all λ isin R

(iv) MGF of square

E exp(cX2) le 2 for some c gt 0

The sum of independent normal random variables is also normal

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 3 17

3 Sub-gaussian distributions

Theorem 1 (Sub-gaussian properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1

MGF of square E exp(X2K23 ) le 2

Moreover if EX = 0 then these properties are also equivalent to thefollowing one

MGF E exp(λX) le exp(λ2K24 ) for all λ isin R

Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17

Definition 2

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian

The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is

983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2

Sub-gaussian random variable examples

(i) Normal random variables X sim N (microσ2)

(ii) Bernoulli random variable X = 0 1 with equal probabilities

(iii) More generally any bounded random variable X

Not sub-gaussian random variable examples

Poisson exponential Pareto and Cauchy distributions

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17

Theorem 3 (Sums of sub-gaussians)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then

983123Ni=1Xi is sub-gaussian and

983056983056983056983056983131N

i=1Xi

9830569830569830569830562

ψ2

le C983131N

i=1983042Xi9830422ψ2

where C is an absolute constant

Proof Let us bound the MGF of the sum for any λ isin R

E exp

983061λ983131N

i=1Xi

983062=

983132N

i=1E exp(λXi) (using independence)

le983132N

i=1exp(Cλ2983042Xi9830422ψ2

) (by last property in Theorem 1)

= exp(λ2K2) where K2 = C983131N

i=1983042Xi9830422ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 3: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

2 Gaussian or normal distribution N (microσ2)

Standard normal random variable X sim N (0 1)

(i) TailP|X| ge t le 2 exp(minust22) forall t ge 0

(ii) Moment

983042X983042p = (E|X|p)1p = O(radicp) as p rarr infin

(iii) Moment Generation function (MGF)

E exp(λX) = exp(λ22) for all λ isin R

(iv) MGF of square

E exp(cX2) le 2 for some c gt 0

The sum of independent normal random variables is also normal

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 3 17

3 Sub-gaussian distributions

Theorem 1 (Sub-gaussian properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1

MGF of square E exp(X2K23 ) le 2

Moreover if EX = 0 then these properties are also equivalent to thefollowing one

MGF E exp(λX) le exp(λ2K24 ) for all λ isin R

Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17

Definition 2

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian

The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is

983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2

Sub-gaussian random variable examples

(i) Normal random variables X sim N (microσ2)

(ii) Bernoulli random variable X = 0 1 with equal probabilities

(iii) More generally any bounded random variable X

Not sub-gaussian random variable examples

Poisson exponential Pareto and Cauchy distributions

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17

Theorem 3 (Sums of sub-gaussians)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then

983123Ni=1Xi is sub-gaussian and

983056983056983056983056983131N

i=1Xi

9830569830569830569830562

ψ2

le C983131N

i=1983042Xi9830422ψ2

where C is an absolute constant

Proof Let us bound the MGF of the sum for any λ isin R

E exp

983061λ983131N

i=1Xi

983062=

983132N

i=1E exp(λXi) (using independence)

le983132N

i=1exp(Cλ2983042Xi9830422ψ2

) (by last property in Theorem 1)

= exp(λ2K2) where K2 = C983131N

i=1983042Xi9830422ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 4: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

3 Sub-gaussian distributions

Theorem 1 (Sub-gaussian properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minust2K21 ) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le K2radicp for all p ge 1

MGF of square E exp(X2K23 ) le 2

Moreover if EX = 0 then these properties are also equivalent to thefollowing one

MGF E exp(λX) le exp(λ2K24 ) for all λ isin R

Remark The parameters Ki gt 0 appearing in these properties can bedifferent However they may differ from each other by at most anabsolute constant factor This means that there exists an absoluteconstant C such that property 1 implies property 2 with parameterK2 le CK1 and similarly for every other pair or properties

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 4 17

Definition 2

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian

The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is

983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2

Sub-gaussian random variable examples

(i) Normal random variables X sim N (microσ2)

(ii) Bernoulli random variable X = 0 1 with equal probabilities

(iii) More generally any bounded random variable X

Not sub-gaussian random variable examples

Poisson exponential Pareto and Cauchy distributions

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17

Theorem 3 (Sums of sub-gaussians)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then

983123Ni=1Xi is sub-gaussian and

983056983056983056983056983131N

i=1Xi

9830569830569830569830562

ψ2

le C983131N

i=1983042Xi9830422ψ2

where C is an absolute constant

Proof Let us bound the MGF of the sum for any λ isin R

E exp

983061λ983131N

i=1Xi

983062=

983132N

i=1E exp(λXi) (using independence)

le983132N

i=1exp(Cλ2983042Xi9830422ψ2

) (by last property in Theorem 1)

= exp(λ2K2) where K2 = C983131N

i=1983042Xi9830422ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 5: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Definition 2

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 1 are called sub-gaussian

The best K3 is called the sub-gaussian norm of X and is usuallydenoted 983042X983042ψ2 that is

983042X983042ψ2 = inft gt 0 E exp(X2t2) le 2

Sub-gaussian random variable examples

(i) Normal random variables X sim N (microσ2)

(ii) Bernoulli random variable X = 0 1 with equal probabilities

(iii) More generally any bounded random variable X

Not sub-gaussian random variable examples

Poisson exponential Pareto and Cauchy distributions

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 5 17

Theorem 3 (Sums of sub-gaussians)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then

983123Ni=1Xi is sub-gaussian and

983056983056983056983056983131N

i=1Xi

9830569830569830569830562

ψ2

le C983131N

i=1983042Xi9830422ψ2

where C is an absolute constant

Proof Let us bound the MGF of the sum for any λ isin R

E exp

983061λ983131N

i=1Xi

983062=

983132N

i=1E exp(λXi) (using independence)

le983132N

i=1exp(Cλ2983042Xi9830422ψ2

) (by last property in Theorem 1)

= exp(λ2K2) where K2 = C983131N

i=1983042Xi9830422ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 6: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Theorem 3 (Sums of sub-gaussians)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then

983123Ni=1Xi is sub-gaussian and

983056983056983056983056983131N

i=1Xi

9830569830569830569830562

ψ2

le C983131N

i=1983042Xi9830422ψ2

where C is an absolute constant

Proof Let us bound the MGF of the sum for any λ isin R

E exp

983061λ983131N

i=1Xi

983062=

983132N

i=1E exp(λXi) (using independence)

le983132N

i=1exp(Cλ2983042Xi9830422ψ2

) (by last property in Theorem 1)

= exp(λ2K2) where K2 = C983131N

i=1983042Xi9830422ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 6 17

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 7: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

31 Hoeffdingrsquos inequality

We rewrite Theorem 3 as a concentration inequality by using thefirst property in Theorem 1

Theorem 4 (Hoeffdingrsquos inequality)

Let X1 XN be independent mean zero sub-gaussian randomvariables Then for every t ge 0 we have

P983069983055983055983055983055

983131N

i=1Xi

983055983055983055983055 ge t

983070le 2 exp

983075minus ct2983123N

i=1 983042Xi9830422ψ2

983076

Remark 5

Hoeffdingrsquos inequality controls how far and with what probability a sumof independent random variables can deviate from its mean which iszero

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 7 17

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 8: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

4 Sub-exponential distributions

A square X2 of a normal random variable X sim N (0 1) is notsub-gaussian

Theorem 6 (Sub-exponential properties)

For a random variable X the following properties are equivalent

Tail P|X| ge t le 2 exp(minustK1) for all t ge 0

Moment 983042X983042p = (E|X|p)1p le pK2 for all p ge 1

MGF of square E exp(|X|K3) le 2

Moreover if EX = 0 then these properties imply the following one

MGF E exp(λX) le exp(λ2K24 ) for |λ| le 1K4

Definition 7

Random variables that satisfy one of the first three properties (andthus all of them) in Theorem 6 are called sub-exponential

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 8 17

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 9: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

The best K3 is called the sub-exponential norm of X and isusually denoted 983042X983042ψ1 that is

983042X983042ψ1 = inft gt 0 E exp(|X|t) le 2

All squares of sub-gaussian random variables are sub-exponentialrandom variables (Conversely) We have

983042X2983042ψ1 = 983042X9830422ψ2

41 Bernsteinrsquos inequality

Theorem 8 (Bernsteinrsquos inequality)

Let X1 XN be independent mean zero sub-exponential randomvariables Then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983077minuscmin

983075t2

983123Ni=1 983042Xi9830422ψ1

t

maxi 983042Xi983042ψ1

983076983078

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 9 17

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 10: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Proof Choose λ ge 0 and use Markovrsquos inequality to get (S =983123N

i=1Xi)

PS 983245 t = Pexp(λS) 983245 exp(λt) 983249 eminusλtE exp(λS)

Then by independence we have

PS 983245 t le eminusλt983132N

i=1E exp(λXi)

If we choose λ small enough so that 0 lt λ 983249 cmaxi983042Xi983042ψ1

then by the

last property in Theorem 6 we have

E exp (λXi) 983249 exp983059Cλ2 983042Xi9830422ψ1

983060

Hence

PS 983245 t le exp983043minusλt+ Cλ2σ2

983044 σ2 =

N983131

i=1

983042Xi9830422ψ1

The remaining part is left as an exerciseIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 10 17

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 11: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Why does Bernsteinrsquos inequality have a mixture of two tails

(i) The sub-exponential tail should of course be there Indeedeven if the entire sum consisted of a single term Xi the bestbound we could hope for would be of the form exp(minusct983042Xi983042ψ1)

(ii) The sub-gaussian term could be explained by the central limittheorem which states that the sum should becomes approximatelynormal as the number of terms N increases to infinity

Remark 9 (Bernsteinrsquos inequality for bounded random variables)

Suppose further the random variables Xi are uniformly bounded whichis a stronger assumption than being sub-gaussian If K gt 0 is such that|Xi| le K almost surely for all i then for every t ge 0 we have

P

983083983055983055983055983055983055

N983131

i=1

Xi

983055983055983055983055983055 983245 t

983084983249 2 exp

983061minus t22

σ2 + CKt

983062

where σ2 =983123N

i=1 EX2i is the variance of the sum

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 11 17

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 12: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Note that σ2 + CKt 983249 2max983043σ2 CKt

983044 So we can state the

probability bound as

2 exp

983063minuscmin

983061t2

σ2t

K

983062983064

Just as before here we also have a mixture of two tailssub-gaussian and sub-exponential

The sub-gaussian tail is a bit sharper than in Theorem 8 since itdepends on the variances rather than sub-gaussian norms of Xi

The sub-exponential tail on the other hand is weaker since itdepends on the sup-norms rather than the sub-exponential normsof Xi

More on concentration

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 12 17

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 13: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

5 Sub-gaussian random vectors

Definition Consider a random vector X taking values in Rn Wecall X a sub-gaussian random vector if all one-dimensionalmarginals of X ie the random variables 〈Xx〉 for all x isin Rnare sub-gaussian

The sub-gaussian norm of X is defined as

983042X983042ψ2 = supxisinRn983042x9830422=1

983042〈Xx〉983042ψ2

Sub-gaussian random vector examples

(i) The standard normal distribution N (0 In) (why)

(ii) The uniform distribution on the centered Euclidean sphere ofradius

radicn

(iii) The uniform distribution on the cube minus1 1n

(iv) A random vector X = (X1 middot middot middot Xn) with independent andsub-gaussian coordinates is sub-gaussian 983042X983042ψ2 983249 Cmaxi 983042Xi983042ψ2

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 13 17

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 14: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

6 Johnson-Lindenstrauss Lemma

Concentration inequalities like Hoeffdingrsquos and Bernsteinrsquos aresuccessfully used in the analysis of algorithms

Let us give one example for the problem of dimension reductionSuppose we have some data that is represented as a set of Npoints in Rn We would like to compress the data by representingit in a lower dimensional space Rm instead of Rn with m ≪ n Byhow much can we reduce the dimension without loosing theimportant features of the data

The basic result in this direction is the Johnson-LindenstraussLemma It states that a remarkably simple dimension reductionmethod works - a random linear map from Rn to Rm withm sim logN The logarithmic function grows very slowly so we canusually reduce the dimension dramatically

What exactly is a random linear map We consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors in Rn

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 14 17

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 15: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Theorem 10 (Johnson-Lindenstrauss Lemma)

Let X be a set of N points in Rn and ε isin (0 1) Consider an mtimes nmatrix A whose rows are independent mean zero isotropic andsub-gaussian random vectors Xi in Rn Rescale A by defining theldquoGaussian random projectionrdquo

P = Aradicm

Assume thatm ge Cεminus2 logN

where C is an appropriately large constant that depends only on thesub-gaussian norms of the vectors Xi Then with high probability (say099) the map P preserves the distances between all points in X witherror ε that is for all xy isin X

(1minus ε)983042xminus y9830422 983249 983042Pxminus Py9830422 983249 (1 + ε)983042xminus y9830422

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 15 17

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 16: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

Proof By linearity of P 1minus ε ge (1minus ε)2 and 1 + ε le (1 + ε)2 it issufficient to prove that

1minus ε le 983042P z98304222 le 1 + ε for all z isin T

where

T =

983069xminus y

983042xminus y9830422 xy isin X and x ∕= y

983070

By P z = Azradicm it is enough to show that

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 le ε for all z isin T

We can prove this inequality by combining concentration and a unionbound

Independent Random Variables DAMC Lecture 12 May 15 - 20 2020 16 17

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17

Page 17: Lecture 12: Concentration of sums of independent random ...math.xmu.edu.cn/group/nona/damc/Lecture12.pdf · Independent Random Variables DAMC Lecture 12 May 15 - 20, 2020 14 / 17.

In order to use concentration we first fix z isin T By assumption therandom variables 〈Xi z〉2 minus 1 are independent they have zero mean(why Exercise) and they are sub-exponential (why Exercise) ThenBernsteinrsquos inequality gives (why Exercise)

P

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249 2 exp

983043minuscε2m

983044

Finally we can unfix z by taking a union bound over all possible z isin T

P

983083maxzisinT

9830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084983249

983131

zisinTP

9830839830559830559830559830559830551

m

m983131

i=1

〈Xi z〉2 minus 1

983055983055983055983055983055 gt ε

983084

983249 |T | middot 2 exp983043minuscε2m

983044

By definition of T we have |T | le N2 So if we choose m ge Cεminus2 logNwith appropriately large constant C we can make

|T | middot 2 exp983043minuscε2m

983044le 001

The proof is completeIndependent Random Variables DAMC Lecture 12 May 15 - 20 2020 17 17