The Central Limit Theorem: More of the Story
Transcript of The Central Limit Theorem: More of the Story
The Central Limit Theorem:More of the Story
Steven Janke
November 2015
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 1 / 33
Central Limit Theorem
Theorem (Central Limit Theorem)
Let X1,X2, . . . be a sequence of independent and identically distributedrandom variables, each with expectation µ and variance σ2. Then thedistribution of
Zn =X1 + X2 + · · ·+ Xn − nµ
σ√
n
converges to the distribution of a standard normal random variable.
limn→∞
P(Zn ≤ x) =1√2π
∫ x
−∞e−
y2
2 dy
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 2 / 33
Central Limit Theorem Applications
The sampling distribution of the mean is approximately normal.
The distribution of experimental errors is approximately normal.
>> But why the normal distribution?
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 3 / 33
Benford’s Law
In an arbitrary table of data, such as populations or lake areas,
P[Leading digit is d] = log10(1 +1
d)
Data: List of 60 Tallest BuildingsLead Digit Meters Feet Benford
1 0.433 0.300 0.301
2 0.117 0.133 0.176
3 0.150 0.133 0.125
4 0.100 0.100 0.097
5 0.067 0.167 0.079
6 0.017 0.083 0.067
7 0.033 0.033 0.058
8 0.083 0.017 0.051
9 0.000 0.033 0.046
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 4 / 33
Benford Justification
Simon Newcomb 1881Frank Benford 1938
”Proof” arguments:
Positional number system
Densities
Scale invariance
Scale and base unbiased (Hill 1995)
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 5 / 33
Central Limit Theorem
Theorem (Central Limit Theorem)
Let X1,X2, . . . be a sequence of independent and identically distributedrandom variables, each with expectation µ and variance σ2. Then thedistribution of
Zn =X1 + X2 + · · ·+ Xn − nµ
σ√
n
converges to the distribution of a standard normal random variable.
limn→∞
P(Zn ≤ x) =1√2π
∫ x
−∞e−
y2
2 dy
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 6 / 33
Central Limit Theorem Proof
Proof Sketch:
Let Yi = Xi − µMoment Generating Function of Yi is MYi
(t) = EetYi
MGF of Zn is MZn(t) = [MY1( tσ√n
]n
limn→∞ ln MZn(t) = t2
2
The MGF of the standard normal is et2
2
Since the MGF’s converge, the distributions converge.(Levy Continuity Theorem).
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 7 / 33
Counter-Examples
Moment problem: Lognormal R.V. not determined by moments.
No first moment: Cauchy R.V. has no MGF, EX =∞ so CLT doesnot hold.
No second moment: f (x) = 1|x |3 for |x | ≥ 1.
Pairwise independence is not sufficient for CLT.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 8 / 33
Demoivre’s Theorem 1733
Each Xi is Bernoulli ( 0 or 1 ).
b(k) = P[Sn = k] =(nk
)12n
n! ≈ (√
2π)nn√ne−n (Stirling’s formula)
b(n2 ) ≈√2√πn
log(b( n
2+d
b( n2) ) ≈ −2d2
n
b(n2 + d) ≈√2√πn
e−2d2
n
limn→∞ P[a ≤ Sn−n/2√n/2≤ b] = 1√
2
∫ ba e− x2
2 dx
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 9 / 33
Laplace 1810
Dealt with independent and identically distributed case.Started with discrete variables:
Consider Xi where pk = P[Xi = km ] for
k = −m,−m + 1, · · · ,m − 1,m
Generating function: T (t) =∑m
k=−m pktk
qj = P[∑
Xi = jm ] is coefficient of t j in T (t)n
Substitute e ix for t and recall 12π
∫ π−π e−itxe isxdx = δts
Then, qj = 12π
∫ π−π e−ijx [
∑m−m pke ikx ]ndx
Now, expand e ikx in a power series around 0 and use the fact that themean of Xi is zero.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 10 / 33
Why Normal?
Normal Characteristic Function
f (u) =1√2π
∫e iuxe−
x2
2 dx = e−u2
2
f Snσ√n(u) = E [e iu(Sn/σ
√n)] = (f (
u
σ√
n))n
= (1− σ2
2σ2nu2 + o(
σ2
σ2nu2))n
= (1− u2
2n+ o(
u2
n))n → e−
u2
2
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 11 / 33
Levy Continuity
Theorem
If distribution functions Fn converge to F , then the corresponding ch.f. fnconverge to f . Conversely, if fn converges to g continuous at 0, then Fn
converges to F .
Proof Sketch:
First direction is the Helly-Bray theorem.
The set e iux is a separating set for distribution functions.
In both directions, continuity points and mass of Fn are critical.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 12 / 33
History
Laplace never presented a general CLT statement. (Concerned withlimiting probabilities for particular problems).
Concern over convergence led Poisson to improvements (notidentically distributed case).
Dirichlet and Cauchy changed conception of analysis (epsilon/delta).Counter-examples uncovered limitations.
Chebychev proved CLT using convergence of moments. (Markov andLiapounov were students).
First rigorous proof (Liapounov 1900). CLT holds with independent(but not necessarily i.i.d.) Xi if∑
E |Xj |3
[∑
X 2j ]3/2
=
∑E |Xj |3
s3n→ 0
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 13 / 33
Liapounov proof
Assume:∑
E |Xj |3[∑
X 2j ]
3/2 =∑
E |Xj |3s3n
→ 0
gn(u) = Πn1fk( u
sn) = Πn
1[1 + (fk( usn
)− 1)]
fk( usn
) = 1− u2
2s2n(σ2k + δk( u
sn))
|fk( usn
)− 1| ≤ 2u2 σ2k
s2n=⇒
∑k1 |fk( u
sn)− 1| ≤ 2u2
(E |X 2k |)
32 < E |Xk |3
supk≤nσksn→ 0 =⇒ sup|fk( u
sn)− 1| → 0
Use log(1 + z) = z(1 + θz) where |θ| ≤ 1 for |z | ≤ 12
log gn(u) =∑n
1(fk( usn
)− 1) + θ∑n
1(fk( usn
)− 1)2
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 14 / 33
Liapounov proof continued
log gn(u) =∑n
1(fk( usn
)− 1) + θ∑n
1(fk( usn
)− 1)2
|θ∑n
1(fk( usn
)− 1)2| ≤ sup|fk( usn
)− 1| ·∑n
1 |(fk( usn
)− 1)| → 0
fk( usn
)− 1 = −u2
2σ2k
s2n+ θk
u3
s3nE |Xk |3∑n
1(fk( usn
)− 1) = −u2
2 + θ u3
s3n
∑n1 E |Xk |3 → −u2
2
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 15 / 33
Lindeberg 1922
Theorem (Central Limit Theorem)
Let the variables Xi be independent with EXi = 0 and EX 2i = σ2i . Let s be
the standard deviation of the sum S and let F be the distribution of Ss .
With Φ(x) the normal distribution, then if 1s2n
∑∫|x |≥εsn x2dFk → 0, we
havesupx|F (x)− Φ(x)| ≤ 5ε
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 16 / 33
Lindeberg Proof
Pick auxiliary function f .
Arbitrary distribution V , define F (x) =∫
f (x − t)dV (x)
With φ(x) the normal density, define Ψ(x) =∫
(f (x − t)φ(x)dx
Taylor expansion of f to third power gives|F (x)−Ψ(x)| < k
∫|x |3dV (x)
With Ui the distribution of Xi ,F1(x) =
∫f (x − t)dU1(x) . . .Fn(x) =
∫Fn−1(x − t)dUn(x)
Note U(x) =∫· · ·∫
U(x − t1 − t2 − · · · tn)dU1(t1) · · · dUn(tn)
By selecting f carefully, |U(x)− Φ(x)| < 3(∑n
i
∫|x |3dUi (x))
14
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 17 / 33
Still Why Normal?
Let X = X1 + X2 where X is N (0, 1) and X1 independent of X2
f (u) = f1(u)f2(u) = e−u2
2
e−u2
2 is an entire, non-vanishing function with |f1(z)| ≤ ec|u|2,
Hadamard factorization theorem =⇒ log f1(u) is a polynomial in uof at most degree 2.
f is a characteristic function =⇒ f (0) = 1, f (u) = f (−u), and it isbounded.
Hence, log f (u) = iua + bu2.
This is the general form of the normal characteristic function.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 18 / 33
Feller - Levy 1935
Theorem (Final Central Limit Theorem)
Let the variables Xi be independent with EXi = 0 and EX 2i = σ2i . Let
Sn =∑n
1 Xi and s2n =∑n
1 σ2k . Φ is the normal distribution and Fk is the
distribution of Xk . Then as n→∞,
P[Sn/sn ≤ x ]→ Φ(x) and maxk≤nσksn→ 0
if and only if for every ε > 0
1
s2n
∑∫|x |≥εsn
x2dFk → 0
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 19 / 33
Levy
Following appeared in Levy’s monograph in 1937.
Theorem (Levy’s version)
In order that a sum S =∑
j Xj of independent variables have adistribution close to Gaussian, it is necessary and sufficient that, afterreducing medians to zero, the following conditions be satisfied.
Each summand that is not negligible compared to the dispersion ofthe entire sum has a distribution close to Gaussian.
The maximum of the absolute value of the negligible summands isitself negligible compared to the dispersion of the sum.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 20 / 33
Normed sums and Stable Laws
Definition
X is said to have a stable law if whenever X1,X2, · · · ,Xk are independentwith the same distribution as X , we have
X1 + X2 + · · ·+ XkD= aX + b
Both the Normal (σ2 < 0) and Cauchy (σ2 =∞) Laws are stable.
Theorem (Limit of Normed Sums)
Suppose that
Sn =X1 + X2 + · · ·+ Xn
An− Bn
D→ X
then X has a stable law.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 21 / 33
Entropy
Definition (Discrete Entropy)
Let X be a discrete random variable taking values xi with probability pi .The entropy of X is
H(X ) = −∑
pi log pi
H(X ) > 0 unless X is constant.
H(aX + b) = H(X )
If X is the result of flipping a fair coin, H(X ) = 1.
If X takes n values, H(X ) is maximized when pi = 1n .
Extend to joint entropy H(X ,Y ).
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 22 / 33
Entropy Axioms
Symmetric in pi
Continuous in pi
Normalized: H(X ) = 1 for fair coin.
X and Y independent gives H(X ,Y ) = H(X ) + H(Y )
Decomposition:H(r1, · · · , rm, q1, · · · , qn) = αH(r1, · · · , rm) + (1− α)H(q1, · · · , qn)
Axioms =⇒ H(X ) = −∑
pi log pi
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 23 / 33
Entropy
Definition (Differential Entropy)
Let X be a continuous random variable with density p. The differentialentropy of X is
H(X ) = −∫
p(t) log p(t)dt
X is uniform on [0, c] gives H(X ) = log(c)
X is N (0, σ2) gives H(X ) = 12 log(2πeσ2)
H(aX + b) = H(X ) + log(a)
H(X ) can be negative.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 24 / 33
Entropy
Definition (Relative Entropy)
Let p and q be densities. The relative entropy distance from p to q is
D(p‖q) =
∫p(x) log(
p(x)
q(x))dx
If supp(p) 6⊂ supp(q), then D(p‖q) =∞.
D is not a metric (not symmetric and no triangle inequality).
D(p‖q) ≥ 0 with equality if and only if p = q a.e.
Convergence in D is stronger than convergence in L1.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 25 / 33
Entropy
Lemma (Maximum Entropy)
Let p be the density of random variable with variance σ2, and let φ2σ bethe density of a N (0, σ2) random variable. Then
H(p) ≤ H(φ2σ) =log(2πeσ2)
2
with equality if and only if p is a normal density.
Proof:
0 ≤ D(p‖φ2σ) =
∫p(x)(log(p(x) +
log(2πσ2)
2+
x2
2σ2log(e))dx
= −H(p) +log(2πeσ2)
2= −H(p) + H(φ2σ)
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 26 / 33
Second Law of Thermodynamics
Let Ω be microstates corresponding to a particular macrostate. Then theentropy of the macrostate is S = k loge Ω.Ω is composed of microstates r of probability pr . If there are v copies ofthe system, about vpr are microstates of type r .
Ω =v
v1!v2! · · · vk !≈ v vv−v11 v−v22 · · · v−vkk
Then,
S = k loge Ω ≈ k(v log(v)−∑
vr log(vr )) = −kv∑
pr log(pr )
Entropy is maximized subject to energy constraint at Gibbs states.(∑
pr = 1 and∑
prEr = E )
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 27 / 33
Fisher Information
Definition
Let Y be a random variable with density g and variance σ2. Set ρ = g ′
gand let φ be the normal density with the same mean and variance as Y .Fisher Information is
I (Y ) = E [(ρ(Y ))2]
Standardized Fisher Information is
J(Y ) = σ2E [(ρ(Y )− ρφ(Y ))2]
Z with distribution N (0, σ2) minimizes Fisher information (I (Z ) = 1σ2 )
among all distributions with variance σ2. (Note: J(Z ) = 0)
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 28 / 33
Fisher Information Properties
Lemma (de Bruijn)
If Z is a normal R.V. independent of Y with the same mean and variance,then
D(Y ‖Z ) =
∫ 1
0J(√
tY +√
1− tZ )1
2tdt
Proof relies on these facts:
The normal density satisfies the heat equation: ∂φτ∂τ = 1
2∂2φτ∂/x2
.
Hence, Y + Z also satisfies the heat equation.
We can then calculate the derivative of D(Y ‖Z ).
We also have that if J(Y )→ 0 then D(Y ‖Z )→ 0.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 29 / 33
Fisher Information Properties
Lemma
If U and V are independent, then for β ∈ [0, 1]
J(U + V ) ≤ β2J(U) + (1− β2)J(V )
J(√βU +
√1− βV ) ≤ βJ(U) + (1− β)J(V )
with equality if and only if U and V are normal.
In particular,J(X + Y ) ≤ J(X )
H(X + Y ) ≥ H(X )
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 30 / 33
Fisher Information Properties
From Lemma, J(√βU +
√1− βV ) ≤ βJ(U) + (1− β)J(V )
In particular, with Sn = (∑n
1 Xi )/√
n and β = nn+m ,
nJ(Sn) + mJ(Sm) ≥ (m + n)J(Sn+m)
If J(Sn) <∞ for some n, then J(Sn) converges to 0.(Take n = m and assume i.i.d. variables to see monotone convergence of asubsequence.)
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 31 / 33
Sketch of Central Limit Theorem Proof
Assume i.i.d. random variables
J(Sn) converges to 0
Hence D(Sn‖Z ) converges to 0
For densities pSn and φ2σ,∫
(pSn − φσ2)2 ≤ 2D(Sn‖Z )
Sn → N (0, σ2)
Can generalize to non i.i.d. variables.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 32 / 33
Conclusions
Assume errors are ”uniformally asymptotically negligible” (no onedominates).
Assume independent errors with distributions that have secondmoments.
Normalize the sum with mean and variance.
Then the limit is a standard normal distribution.
Normal distribution is stable (”infinitely divisible”). (X = X1 + X2)
Normal distribution maximizes entropy.
Convolution (adding R.V.’s) increases entropy.
Steven Janke (Seminar) The Central Limit Theorem:More of the Story November 2015 33 / 33