CHAPTER 1: DISTRIBUTION THEORY

of 58 /58
CHAPTER 1: DISTRIBUTION THEORY

Embed Size (px)

Transcript of CHAPTER 1: DISTRIBUTION THEORY

• Formal definitions based on measure theory
R.V. is a measurable function on a probability measure space, which consists of sets in a σ-field equipped with a probability measure; probability mass:
the Radon-Nykodym derivative of a random variable-induced measure w.r.t. to a counting measure; density function:
the Radon-Nykodym derivative of a random variable-induced measure w.r.t. the Lebesgue measure.
• Descriptive quantities of a univariate distribution
cumulative distribution function
FX(x) ≡ P(X ≤ x)
moments (mean): E[Xk], k = 1, 2, . . . (mean if k = 1) ; centralized moments: E[(X − µ)k] (variance if k = 2) quantile function
F−1 X (p) ≡ inf
x {FX(x) > p}
• Characteristic function (c.f.) definition
φX(t) = E[exp{itX}]
FX(b)− FX(a) = lim T→∞
moment generating function (if it exists)
mX(t) = E[exp{tX}]
covariance and correlation: Cov(X,Y), corr(X,Y)
conditional density or probability (f ’s become probability mass functions if R.V.s are discrete):
fX|Y(x|y) = fX,Y(x, y)/fY(y)
E[X|Y] =
∫ xfX|Y(x|y)dx
fX,Y(x, y) = fX(x)fY(y)
• Double expectation formulae
Var(X) = E[Var(X|Y)] + Var(E[X|Y])
• This is useful for computing the moments of X when X given Y has a simple distribution! In-class exercise Compute E[X] and Var(X) if Y ∼ Bernoulli(p) and X given Y = y follows N(µy, σ
2 y).
Discrete Random Variables
• Binomial distribution Binomial r.v. is the total number of successes in n
independent Bernoulli trials. Binomial(n, p):
P(Sn = k) =
( n k
φX(t) = (1− p + peit)n
In-class exercise Prove the last statement.
• Negative Binomial distribution It describes the distribution of the number of independent
Bernoulli trials until m successes. NB(m, p):
P(Wm = k) =
NB(m1, p) + NB(m2, p) ∼ NB(m1 + m2, p)
In-class exercise Prove the last statement.
• Hypergeometric distribution It describes the distribution of the number of black balls if
we randomly draw n balls from an urn with M blacks and (N −M) whites (urn model). HG(N,M,n):
P(Sn = k) =
(M k
(M + N)2(M + N − 1)
• Poisson distribution It is used to describe the distribution of the number of
incidences in an infinite population. Poisson(λ):
P(X = k) = λke−λ/k!, k = 0, 1, 2, ...
E[X] = Var(X) = λ, φX(t) = exp{−λ(1− eit)}
Summation of two independent r.v.’s:
Poisson(λ1) + Poisson(λ2) ∼ Poisson(λ1 + λ2)
Assume
λ1
If Xn1, ...,Xnn are i.i.d Bernoulli(pn) and npn → λ, then
P(Sn = k) = n!
k!(n− k)! pk
•Multinomial Distribution Assume Y1, ...,Yn are independent k-category random
varaibles such that P(Yi = category l) = pl, ∑k
l=1 pl = 1. Nl, 1 ≤ l ≤ k counts the number of times that {Y1, ...,Yn}
taking category value l. Then P(N1 = n1, ...,Nk = nk) =
( n n1,...,nk
n
fX(x) = I[a,b](x)/(b− a),
where IA(x) is an indicator function taking value 1 if x ∈ A and 0, otherwise.
E[X] = (a + b)/2, Var(X) = (b− a)2/12
The uniform distribution in [0, 1] is the most fundamental distribution each pseudo random number generator tries to generate numbers from.
• Normal distribution It describes the distribution behavior of averaged statistics
in practice. N(µ, σ2)
φX(t) = exp{itµ− σ2t2/2}
• Gamma distribution It is useful to describe the distribution of a positive
random variable with long right tail. Gamma(θ, β)
fX(x) = 1
β }, x > 0
E[X] = θβ, Var(X) = θβ2
θ is called the shape parameter and β is the scale parameter. θ = 1 reduces to the so-called exponential distribution,
Exp(β), and θ = n/2, β = 2 is equivalent to a chi-squared distribution with n degrees of freedom.
Caution In some references, notation Gamma(θ, β) is used to refer to the distribution Gamma(θ, 1/β) as defined
here!
bπ {1 + (x− a)2/b2}
E[|X|] =∞, φX(t) = exp{iat− |bt|} This distribution is often used as a counter example whose
mean does not exist.
Algebra of Random Variables
• Assumption We assume that X and Y are independent with distribution
functions FX and FY, respectively. We are interested in the distribution or density of
X + Y, XY, X/Y.
For XY and X/Y, we assume Y to be positive for convenience, although this is not necessary.
• Summation of X and Y Derivation:
FX+Y(z) = E[I(X + Y ≤ z)] = EY[EX[I(X ≤ z− Y)|Y]]
= EY[FX(z− Y)] =
∫ FX(z− y)dFY(y)
Convolution formula (density refers to the situation when X and Y are continuous):
FX+Y(z) =
FXY(z) = E[E[I(XY ≤ z)|Y]] = ∫
FX(z/y)dFY(y)
fXY(z) =
∫ fX(z/y)/yfY(y)dy
FX(yz)dFY(y)
fX/Y(z) =
∫ fX(yz)yfY(y)dy
In-class exercise How to obtain the distributions if Y may not be positive?
• Application of algebra formulae N(µ1, σ
2 1) + N(µ2, σ
2 1 + σ2
Poisson(λ1) + Poisson(λ2) ∼ Poisson(λ1 + λ2)
∼ Negative Binomial(m1 + m2, p)
• Deriving the summation of R.V.s using characteristic functions
Key property If X and Y are independent, then
φX+Y(t) = φX(t)φY(t)
For example, X and Y are normal with φX(t) = exp{iµ1t− σ2
1t2/2}, φY(t) = exp{iµ2t− σ2 2t2/2}
⇒φX+Y(t) = exp{i(µ1 + µ2)t− (σ2 1 + σ2
2)t2/2}
Assume X ∼ N(0, 1), Y ∼ χ2 m and Z ∼ χ2
n are independent. Then


Γ(a)Γ(b) xa−1(1− x)b−1I(x ∈ (0, 1))
It is always recommended that an indicator function is included in the density expression to indicate the support of r.v.
• Exponential vs Beta- distributions
∼ Beta(i,n− i + 1).
In fact, (Z1, . . . ,Zn) has the same joint distribution as that of the order statistics (ξn:1, ..., ξn:n) of n independent Uniform(0,1) random variables.
Transformation of Random Vector
Theorem 1.3 Suppose that X is k-dimension random vector with density function fX(x1, ..., xk). Let g be a one-to-one and continuously differentiable map from Rk to Rk. Then Y = g(X) is a random vector with density function
fX(g−1(y1, ..., yk))|Jg−1(y1, ..., yk)|,
where g−1 is the inverse of g and Jg−1 is the Jacobian of g−1. In-class exercise Prove it by transformation in multivariate integration.
• An application
let R2 ∼ Exp{2},R > 0 and Θ ∼ Uniform(0, 2π) be independent; X = R cos Θ and Y = R sin Θ are two independent
standard normal random variables; it can be applied to simulate normally distributed data.
In-class exercise Prove this statement.
In-class remark It is fundamental to have a mechanism to generate data from U(0, 1) (pseudo random
number generator). To generate data from any distribution FX , generate data from F−1 X (U) where
U ∼ U(0, 1). Alternative algorithms include acceptance-rejection method, MCMC and etc.
Multivariate Normal Distribution
• Definition Y = (Y1, ...,Yn)′ is said to have a multivariate normal distribution with mean vector µ = (µ1, ..., µn)′ and non-degenerate covariance matrix Σn×n if its joint density is
fY(y1, ..., yn) = 1
(y− µ)′Σ−1(y− µ)}
• Characteristic function
2
} dy
= exp
by µ and Σ. for standard multivariate normal distribution,
φX(t) = exp{−t′t/2}.
exp{t′µ+ t′Σt/2}.
In-class exercise Derive the moments of the standard normal distribution.
• Linear transformation of normal r.v. Theorem 1.4 If Y = An×kXk×1 where X ∼ N(0, I) (standard multivariate normal distribution), then Y’s characteristic function is given by
φY(t) = exp { −t′Σt/2
} , t = (t1, ..., tn)′ ∈ Rk,
where Σ = AA′ so rank(Σ) = rank(A), i.e., Y ∼ N(0,Σ). Conversely, if φY(t) = exp{−t′Σt/2}with Σn×n ≥ 0 of rank k, i.e., Y ∼ N(0,Σ), then there exits an A matrix with rank(A) = k such that
Y = An×kXk×1 with X ∼ N(0, Ik×k).
Proof of Theorem 1.4 For the first part, since Y = AX and X ∼ N(0, Ik×k), we obtain
φY(t) = E[exp{it′(AX)}]
= E[exp{i(A′t)′X}]
= exp{−(A′t)′(A′t)/2}
= exp{−t′AA′t/2}.
Thus, Y ∼ N(0,Σ) with Σ = AA′ . Clearly, rank(Σ) = rank(AA′) = rank(A).
For the second part, suppose φY(t) = exp{−t′Σt/2}. There exists an orthogonal matrix, O, such that
Σ = O′DO, D = diag((d1, ..., dk, 0, ..., 0)′), d1, ..., dk > 0.
Define Z = OY then φZ(t) = E[exp{it′(OY)}] = E[exp{i(O′t)′Y}]
= exp{− (O′t)′Σ(O′t)
2 } = exp{−d1t2
1/2− ...− dkt2 k/2}.
Thus, Z1, ..., Zk are independent N(0, d1), ...,N(0, dk) but Zk+1, ..., Zn are zeros. Let Xi = Zi/
√ di, i = 1, ..., k and write O′ = (Bn×k, Cn×(n−k)). Clearly,
Y = O′Z = Bdiag{ √
d1, ..., √
dk}.
In-class remark Note that this proof also shows the way to generate data from N(0,Σ).
• Conditional normal distributions Theorem 1.5 Suppose that Y = (Y1, ...,Yk,Yk+1, ...,Yn)′ has a multivariate normal distribution with mean µ = (µ(1)
′ , µ(2)
Σ =
Let Y(1) = (Y1, ...,Yk) ′ and Y(2) = (Yk+1, ...,Yn)′. Then
(i) Y(1) ∼ Nk(µ (1),Σ11).
(ii) Y(1) and Y(2) are independent iff Σ12 = Σ21 = 0. (iii) For any matrix Am×n, AY has a multivariate normal distribution with mean Aµ and covariance AΣA′. (iv) The conditional distribution of Y(1) given Y(2) is
Y(1)|Y(2) ∼ Nk(µ (1) + Σ12Σ−1
22 (Y(2) − µ(2)),Σ11 − Σ12Σ−1 22 Σ21).
Proof of Theorem 1.5 We prove (iii) first. Since φY(t) = exp{it′µ− t′Σt/2},
φAY(t) = E[eit′AY ]
= E[ei(A′t)′Y ]
= exp{it′(Aµ)− t′(AΣA′)t/2}.
Thus, AY ∼ N(Aµ,AΣA′).
Ik×k 0k×(n−k) )
. To prove (ii). Note
Σ22t(2) }] .
Thus, Y(1) and Y(2) are independent iff the characteristics function is a product of separate functions for t(1) and
t(2) iff Σ12 = 0.
Z(1) = Y(1) − µ(1) − Σ12Σ
−1 22 (Y(2) − µ(2)
ΣZ = Cov(Y(1) , Y(1)
On the other hand,
, Y(2) ) = 0
so Z(1) is independent of Y(2) by (ii). Therefore, the conditional distribution Z(1) given Y(2) is the same as the unconditional distribution of Z(1) , which is N(0,ΣZ). I.e.,
Z(1)|Y(2) ∼ Z(1) ∼ N(0,Σ11 − Σ12Σ −1 22 Σ21).
The result holds.
In-class remark The constructed Z has a geometric interpretation as (Y(1) − µ(1)) subtracting its project on the
normalized variable Σ −1/2 22 (Y(2) − µ(2)).
• An example of measurement error model
X and U are independent with X ∼ N(0, σ2 x) and
U ∼ N(0, σ2 u)
Y = X + U.
x σ2 x
σ2 x σ2
x + σ2 u
) so from (iv),
Summation of independent normal r.v.s:
N(0, 1)2 + ...+ N(0, 1)2 ∼ χ2 n ≡ Gamma(n/2, 2).
If Y ∼ N(0,Σn×n) with Σ > 0, then
Y′Σ−1Y ∼ χ2 n.
Proof: Y can be expressed as AX where X ∼ N(0, I) and A = Σ−1/2 . Thus,
Y′Σ−1Y = X′A′(AA′)−1AX = X′X ∼ χ2 n.
In-class exercise What is the distribution of Y′BY for an n× n symmetric matrix B?
• Noncentral Chi-square distribution
n(δ)
and δ = µ′µ. Y has a noncentral chi-square distribution and its density
function is a Poisson-probability weighted summation of gamma densities:
fY(y) =
k! Gamma(y; (2k + n)/2, 2).
In-class remark Equivalently, we generate S from Poisson(δ/2) and given S = k, we generate
Y ∼ Gamma((2k + n)/2, 2).
√ χ2
One trick to evaluate E[X′AX] when X ∼ N(µ,Σ):
E[X′AX] = E[tr(X′AX)] = E[tr(AXX′)] = tr(AE[XX′]) = tr(A(µµ′ + Σ))
= µ′Aµ+ tr(AΣ)
Distribution Family
• Location-scale family aX + b: fX((x− b)/a)/a, a > 0, b ∈ R with mean aE[X] + b
and variance a2Var(X)
• Exponential family
expression of densities, which share some family-specific properties.
• General form of densities in an exponential family
{pθ(x)}, which consists of densities indexed by θ (one or more parameters), is said to form an s-parameter exponential family if
pθ(x) = exp
ηk(θ)Tk(x)}h(x)dµ(x) <∞.
When (η1(θ), ..., ηs(θ)) ′ is exactly the same as θ, the above
form is called the canonical form of the exponential family.
• Examples X1, ...,Xn ∼ N(µ, σ2):
exp
{ µ
σ2
2πσ)n .
Its canonical parameterization is θ1 = µ/σ2 and θ2 = 1/σ2. X ∼ Binomial(n, p):
exp{x log p
( n x
Its canonical parameterization is θ = log p/(1− p). X ∼ Poisson(λ):
P(X = x) = exp{x logλ− λ}/x!
Its canonical parameterization is θ = logλ.
•Moment generating function in the exponential family Theorem 1.6 Suppose the densities of an exponential family can be written as the canonical form
exp{ s∑
k=1
(MGF) for (T1, ...,Ts), defined as
MT(t1, ..., ts) = E [exp{t1T1 + ...+ tsTs}]
for t = (t1, ..., ts) ′, is given by
MT(t) = exp{A(η + t)− A(η)}.
Proof of Theorem 1.6 Since
exp{A(η)} =
=
(ηi + ti)Ti(x)− A(η)}h(x)dµ(x)
= exp{A(η + t)− A(η)}.
•MGF and cumulant generating function (CGF) MGF can be useful to calculate the moments of (T1, ...,Ts). Define the cumulant generating function (CGF)
KT(t1, ..., ts) = log MT(t1, ..., ts) = A(η + t)− A(η).
The coefficients in its Taylor expansion are called the cumulants for (T1, ...,Ts). The first two cumulants are the mean and variance.
• Examples of using MGF
A(η) = 1
2 ((η + t)2 − η2)} = exp{µt + t2σ2/2}.
For X ∼ N(0, σ2), since T = X, from the Taylor expansion, we have
E[X2r+1] = 0,E[X2r] = 1 · 2 · · · (2r− 1)σ2r, r = 1, 2, ...
Gamma distribution has a canonical form
exp{−x/b + (a− 1) log x− log(Γ(a)ba)}I(x > 0).
Treat b as the only parameter (a is a constant) then the canonical parameterization is η = −1/b with T = X. Note A(η) = log(Γ(a)ba) = a log(−1/η) + log Γ(a). we
obtain
This gives
READING MATERIALS: Lehmann and Casella, Sections 1.4 and 1.5