Download - Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

Transcript
Page 1: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

Lecture 1

Basic probability refresher

1.1 Characterizations of random variables

Let (Ω,F , P ) be a probability space where Ω is a general set, F is a σ-algebra and P is aprobability measure on Ω. A random variable (r.v.) X is a (scalar) measurable function X :(Ω,F) → (R,B) where B is a Borel σ-algebra. We will also write X(ω) to stress the fact thatit is a function of ω ∈ Ω.

Cumulative distribution function (c.d.f.) of a random variable X is the function F : R→[0, 1]

F (x) = P (X ≤ x) = P (ω : X(ω) ≤ x).

F is monotone nondecreasing, right-continuous and such that F (−∞) = 0 and F (∞) = 1. Wealso refer to F as the probability law (distribution) of X.

We distinguish 2 types of random variables: discrete variables and continuous variables.

Discrete variable X takes values in the finite or countable set. Poisson random variable X1 is an example of a discrete variable with countable value set: for λ > 0 the distribution of Xsatisfies

Pλ(X = k) =λk

k!e−λ, k = 0, 1, 2, ...

1We will see in the sequel the importance of this law and how it is linked to Poisson point process.

1

Page 2: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2 LECTURE 1. BASIC PROBABILITY REFRESHER

We denote X ∼ P(λ) and say that X is distributed according to the Poisson distribution withparameter λ. The c.d.f. of X is

−1 0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The c.d.f. of a discrete random variable is a step function.

Continuous variable. X is a continuous variable if its distribution admits a density withrespect to the Lebesgue measure on R. In this case the c.d.f. F of X is differentiable almosteverywhere on R and its derivative

f(x) = F ′(x)

is called probability density of X. Note that f(x) ≥ 0 for all x ∈ R and∫ ∞−∞

f(x)dx = 1.

Example 1.1

a) Normal distribution N(µ, σ2) with density

f(x) =1√2πσ

e−(x−µ)2

2σ2 , x ∈ R,

where µ ∈ R and σ > 0. If µ = 0, σ2 = 1, the distribution N(0, 1) is referred to as standardnormal distribution.

b) Uniform distribution U [0, θ] with density

f(x) =1

θIx ∈ [0, θ], x ∈ R,

where η > 0 and I· stands for the indicator function: for set A

Ix ∈ A =

1 if x ∈ A,0 otherwise.

Page 3: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.1. CHARACTERIZATIONS OF RANDOM VARIABLES 3

c) Exponential distribution E(λ) with density

f(x) = λe−λx for x ≥ 0, and f(x) = 0 for x < 0,

where λ > 0. The c.d.f. of E(λ) is given by

F (x) = (1− e−λx) for x ≥ 0, and F (x) = 0 for x < 0.

Discrete distributions are entirely determined by the probabilities P (X = k)k, continuousdistribution are defined with their density f(·). However, some scalar functionals of the distri-bution may be useful to characterize the behavior of corresponding random variables. Examplesof such functionals are the moments and quantiles.

1.1.1 Moments of random variables

Mean (expectation) of a random variable X:

µ = E(X) =

∫ ∞−∞

xdF (x) =

∑i iP (X = i) in the discrete case,∫xf(x)dx in the continuous case.

Moment of order k (k = 1, 2, ...) :

µk = E(Xk) =

∫ ∞−∞

xkdF (x),

same as central moment of order k:

µ′k = E((X − µ)k) =

∫ ∞−∞

(x− µ)kdF (x).

A special case is the variance σ2(= µ′2 – the central moment of order 2):

σ2 = Var(X) = E((X − E(X))2) = E(X2)− (E(X))2.

The squared root of the variance is called standard deviation (s.d. or st.d.) of X: σ =√Var(X).

Absolute moment µk of order k

µk = E(|X|k)

same as central absolute moment of order k:

µ′k = E(|X − µ|k).

Clearly, these definitions assume the existence of the respective integrals, and not all distributionspossess moments.

Example 1.2

Page 4: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

4 LECTURE 1. BASIC PROBABILITY REFRESHER

Let X be a random variable with probability density

f(x) =c

1 + |x| log2 |x|, x ∈ R,

where the constant c > 0 is such that∫f = 1. Then E(|X|a) =∞ for all a > 0.

The mean is used to characterize the location (position) of a random variable. The variancecharacterizes the scale (dispersion) of the distribution.

The normal distribution N(µ, σ2) with mean µ and variance σ2:

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

“large” σ (large dispertion), “small” σ (little dispersion)

Let F be the c.d.f. of the random variable X with mean µ and variance σ. By an affinetransformation we obtain the variable X0 = (X − µ)/σ, such that E(X0) = 0, E(X2

0 ) = 1 (thestandardized variable). If F0 is the c.d.f. of X0 then F (x) = F0(x−µσ ). In the continuouscase, the density of X satisfies

f(x) =1

σf0(

x− µσ

),

where f0 is the density of X0.Note that it is not necessary to assume that the mean and the variance exist to define the

standardized distribution F0 and the representation F (x) = F0(x−µσ ). Typically, this is doneto underline that F depends on location parameter µ and scale σ. E.g., for the family ofCauchy densities parameterized with µ, σ, f(x) = 1

πσ(1+[(x−µ)/σ]2), the standardized density is

f0(x) = 1π(1+x2)

. Meanwhile, expectation and variance do not exist for the Cauchy distribution.

An interesting problem of Calculus is related to the notion of moments µk: let F be a c.d.f.such that all its moments are finite. Given a sequence µk, k = 1, 2, ... of moments of F , is itpossible to recover F? The general answer to this question is negative. Nevertheless, there existparticular cases where the recovery is possible, namely, under the hypothesis that

lim supk→∞

µ1/kk

k<∞

(µk being the k-the absolute moment). This hypothesis holds true, for instance, for densitieswith bounded support. To the best of our knowledge, necessary and sufficient conditions forexistence of a solution to the problem of moments are currently unknown.

Page 5: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.1. CHARACTERIZATIONS OF RANDOM VARIABLES 5

1.1.2 Probability quantiles

Let X be a random variable with continuous and strictly increasing c.d.f. F . The quantile oforder p, 0 < p < 1, of the distribution F is the solution qp of the equation

F (qp) = p.

Observe that if F is strictly increasing and continuous, the solution exists and is unique, thusthe quintile qp is well defined. If F has “flat zones” or is not continuous we can modify thedefinition, for instance, as follows:

Definition 1.1 Let F be a c.d.f. The quintile qp of order p of F is the value

qp = infq : F (q) ≥ p.

The median M of the c.d.f. F is the quintile of order 1/2,

M = q1/2.

Note that if F is continuous F (M) = 1/2.

The quartiles are the quantiles q1/4 and q3/4 of order 1/4 and 3/4.

The l% percentiles of F are the quantiles qp of order p = l/100, 0 < l < 100.

We note that the median characterizes the location of the probability distribution, while thedifference q3/4−q1/4 (referred to as interquartile interval) can be interpreted as a characteris-tics of scale. These quantities are analogues of the mean µ and standard deviation σ. However,unlike the mean and the standard deviation, the median and the interquartile interval are welldefined for all probability distributions.

1.1.3 Other characterizations

The mode. For a discrete distribution F , we call the mode of F the value k∗ such that

P (X = k∗) = maxk

P (X = k)

In the continuous case, the mode x∗ is defined a local maximum of the density f :

f(x∗) = maxx

f(x).

A density f is said unimodal if x∗ is the unique local maximum of f (one can also speak ofbi-modal or multi-modal densities). This characteristics is rather imprecise, because even when

Page 6: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

6 LECTURE 1. BASIC PROBABILITY REFRESHER

the density has a unique global maximum, we will call it multimodal if it has other local maxima.The mode is a characteristics of location which can be of interest in the case of unimodal density.

0 2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

0.25

Mode

Mediane

Moyenne

The mode, the mediane and the mean of a distribution

Skewness and kurtosis

Definition 1.2 The distribution of X (the c.d.f. F ) is said symmetric with respect to zero (or“simply” symmetric) if for all x ∈ R, F (x) = 1−F (−x) (f(x) = f(−x) in the continuous case).

Definition 1.3 The distribution of X (the c.d.f. F ) is called symmetric with respect to µ ∈ Rif

F (x+ µ) = 1− F (µ− x)

(f(x+ µ) = f(µ− x) in the continuous case).

In other words, the c.d.f F (· − µ) is symmetric (with respect to zero).

Exercise 1.1

a) Show that if F is symmetric with respect to µ, and E(|X|) <∞, then E(X) = µ. Moreover,if F admits an unimodal density, then the mean = median = mode.b) If F is symmetric and all absolute moments µk exist, then the moments µk = 0 for all oddk. If F is symmetric with respect to µ and all the moments µk exist, then µ′k = 0 for all odd k(e.g., µ′3 = 0).

We can qualify the “asymmetry” of distributions (for which E(|X|3) <∞) using the skewnessparameter

α =µ′3σ3.

Note that α = 0 for a symmetric c.d.f. such that E(|X|3) < ∞. The inverse is of course nottrue: condition α = 0 does not imply the distribution symmetry.

Exercise 1.2

Page 7: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.1. CHARACTERIZATIONS OF RANDOM VARIABLES 7

Provide an example of asymmetric density with α = 0.

Observe the role of σ in the definition of α: suppose, for istance, that the density f0(x) of Xsatisfies

∫xf0(x)dx = 0 and

∫x2f0(x)dx = 1 with α0 = µ′30 =

∫x3f0(x)dx. For σ > 0, µ ∈ R,

the function

f(x) =1

σf0(

x− µσ

),

is the density of the random variable σX + µ, and thus Var(σX + µ) = σ2 and µ′3 =∫

(x −µ)3f(x)dx = σ3µ′30. When computing α =

µ′3σ3 we note that α = α0. Thus the skewness α is

invariant with respect to affine transformations (of scale and position).

Note that one cannot say that α > 0 for distributions which are “asymmetric on the right”,or α < 0 for ‘asymmetric on the left” distributions. The notions of left or right asymmetries arenot properly defined.

Kurtosis coefficient β is defined as follows: if the 4th central moment µ′4 of X exists then

β =µ′4σ4− 3.

Exercise 1.3

Show that µ′4/σ4 = 3 and β = 0 for normal distribution N(µ, σ2).

We note that, same as the asymmetry coefficient α, the kurtosis β is invariant with respectto affine transformations.

The coefficient β is often used to roughly qualify the tails of the distribution of X. One usethe following vocabulary: a distribution F has “heavy tails” if

Q(b) =

∫|x|≥b

dF (x) (=

∫|x|≥b

f(x)dx in the continuous case)

decreases slowly when b→∞; for instance, polynomially (as 1/br where r > 0). Otherwise, wesay that F has “light tails” if Q(b) is fast decreasing (example: exponentially decreasing).

We may use the following heuristics: if β > 0 we may consider that the distribution tails areheavier than those of the normal normal distribution (Q(b) = O(e−b

2/2) for N(0, 1)). If β < 0(we say that the distribution is leptokurtic) and assume that its tails are lighter than those ofnormal distribution (β = 0 for the normal distribution).

Note also that β ≥ −2 for all distributions (cf. Section 1.2.1).

Example 1.3

a) The kurtosis β of the uniform distribution U [0, 1] is equal to −1.2 (ultra-light tails).

b) If f(x) ∼ |x|−5 when |x| tends to ∞, σ2 is finite but µ′4 = ∞, imlying that β = ∞ (heavytails).

Page 8: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

8 LECTURE 1. BASIC PROBABILITY REFRESHER

1.2 A toolbox

1.2.1 Some useful inequalities

Proposition 1.1 (Markov inequality) Let h(·) be a nonnegative nondecreasing function, andE(h(X)) <∞. Then for all a > 0 such that h(a) > 0,

P (X ≥ a) ≤ E(h(X))

h(a). (1.1)

Proof : Let a > 0 be such that h(a) > 0. Since h(·) is a nondecreasing function,

P (X ≥ a) ≤ P (h(X) ≥ h(a)) =

∫Ih(x) ≥ h(a)dF (x)

= E(Ih(X) ≥ h(a)) ≤ E(h(X)

h(a)Ih(X) ≥ h(a)

)≤ E(h(X))

h(a).

Corollary 1.1 (Chebyshev inequality) Let X be a random variable such that E(X2) <∞.Then for all a > 0

P (|X| ≥ a) ≤ E(X2)

a2

P (|X − E(X)| ≥ a) ≤ Var(X)

a2

Proof : To show the first inequality it suffices to set in (1.1) h(t) = t2 and Y = |X| (orY = |X − E(X)| for the second one).

Proposition 1.2 (Holder inequality) Let r > 1, with 1/r + 1/s = 1. Let ξ and η be tworandom variables such that E(|ξ|r) <∞ and E(|η|s) <∞. Then E(|ξη|) <∞ and

E(|ξη|) ≤ [E(|ξ|r)]1/r[E(|η|s)]1/s.

Proof : We first note that for all a > 0, b > 0, by concavity of log t,

(1/r) log a+ (1/s) log b ≤ log(a/r + b/s),

what is equivalent toa1/rb1/s ≤ a/r + b/s.

Let us set a = |ξ|r/E(|ξ|r) and b = |η|s/E(|η|s) (we suppose for a moment that E(|ξ|r) 6= 0,E(|η|s) 6= 0), what results in

|ξη| ≤ [E(|ξ|r)]1/r[E(|η|s)]1/s (|ξ|r/rE(|ξ|r) + |η|s/sE(|η|s)) ,

and we conclude when taking the expectation. If E(|ξ|r) = 0 or E(|η|s) = 0, then ξ = 0 (p.s) orη = 0 (p.s.), and the inequality is trivial.

Page 9: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.2. A TOOLBOX 9

Corollary 1.2 (Lyapunov inequality) Let 0 < v < t and let X be a random variable suchthat E(|X|t) <∞. Then E(|X|v) <∞ and

[E(|X|v)]1/v ≤ [E(|X|t)]1/t. (1.2)

To show the corollary it suffices to apply the Holder inequality with ξ = Xv, η = 1, r = t/v.

Using the inequality (1.2) with v = 2, t = 4 and |X − E(X)| instead of |X| we getµ′4σ4 ≥ 1.

Thus the coefficient kurtosis β verifies the inequality β ≥ −2.

The Lyapunov inequality implies the chain of inequalities

E(|X|) ≤ [E(|X|2)]1/2 ≤ . . . ≤ [E(|X|k)]1/k.

Corollary 1.3 (Cauchy-Schwarz inequality) Let ξ and η be two random variables such thatE(ξ2) <∞ and E(η2) <∞. Then E(|ξη|) <∞ et

E(|ξη|)2 ≤ E(ξ2)E(η2).

(A particular case of the Holder inequality with r = s = 2.)

Proposition 1.3 (Jensen inequality) Let g(·) be a convex function, X be a random variablesuch that E(|X|) <∞. Then

g(E(X)) ≤ E(g(X)).

Proof : By convexity of g, there exists a function g1(·) such that for all x, x0 ∈ R

g(x) ≥ g(x0) + (x− x0)g1(x0).

We put x0 = E(X). Then

g(X) ≥ g(E(X)) + (X − E(X))g1(E(X)).

When taking the expectation we obtain E(g(X)) ≥ g(E(X)).

We have the following simple example of application oh the Jensen inequality:

|E(X)| ≤ E(|X|). (1.3)

Proposition 1.4 (Cauchy-Schwarz inequality, a modification) Let ξ and η be two ran-dom variables such that E(ξ2) <∞ and E(η2) <∞. Then

(E(ξη))2 ≤ E(ξ2)E(η2). (1.4)

Moreover, the equality is attained if and only if (iff) there are a1, a2 ∈ R such that a1 6= 0 ora2 6= 0, and

a1ξ + a2η = 0 (a.s.) (1.5)

Page 10: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

10 LECTURE 1. BASIC PROBABILITY REFRESHER

Proof : The inequality (1.4) is a consequence of Corollary 1.3 and of (1.3). If (1.5) is true,the equality

(E(ξη))2 − E(ξ2)E(η2) = 0 (1.6)

is obvious. On the other hand, if we have (1.6) and E(η2) 6= 0, then E((ξ − aη)2) = 0 witha = E(ξη)/E(η2), what implies that ξ = aη a.s.. The case E(η2) = 0 is trivial.

1.2.2 Convergence of random variables

Let ξ1, ξ2..., and ξ be random variables (r.v.) on (Ω,F , P ).

Definition 1.4 The sequence (ξn) converges to a random variable ξ in probability (denoted

ξnP→ ξ) when n→∞ if

limn→∞

P|ξn − ξ| ≥ ε = 0

for any ε > 0.

Definition 1.5 The sequence (ξn) converges to ξ in quadratic mean (or “in L2”) if E(ξ2) <∞, and

limn→∞

E(|ξn − ξ|2) = 0.

Definition 1.6 The sequence (ξn) converges to ξ almost surely (denoted ξn → ξ (a.s.), n→∞) if

P ω : ξn/→ξ = 0

Remark. It can be shown that this definition is equivalent to the following one: for all ε > 0

limn→∞

Psupk≥n|ξk − ξ| ≥ ε = 0.

Definition 1.7 The sequence (ξn) converges to a random variable ξ in distribution (we denote

ξnD→ ξ, n→∞) if

Pξn ≤ t → Pξ ≤ t as n→∞

in all points of continuity of the c.d.f. F (t) = Pξ ≤ t.

Remark. The latter definition is equivalent to the convergence

E(f(ξn))→ E(f(ξ)) quand n→∞

for all continuous and bounded f (weak convergence).

Page 11: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.3. INDEPENDENCE AND LIMIT THEOREMS 11

Relationships between different types of convergence:

L2-convergence =⇒a.s. convergence =⇒

∣∣∣∣∣ convergence in probability =⇒ | convergence in distribution

Exercise 1.4

Let (ξn) and (ηn) be two sequences of r.v.. Prove the following statements:1o. If a ∈ R is a constant then

ξnD→ a ⇔ ξn

P→ a,

when n→∞.2o. (Slutsky’s theorem) If ξn

D→ a and ηnD→ η when n→∞ and a ∈ R is a constant then

ξn + ηnD→ a+ η,

as n → ∞. Show that if a is a general r.v., these two relations do not hold (construct acounterexample).

3o. Let ξnP→ a, and let ηn

D→ η when n→∞, where a ∈ R is a constant and η is a randomvariable, Then

ξnηnD→ aη,

as n→∞.Would this result continue to hold if we suppose that a is a general random variable?

1.3 Independence and limit theorems

Definition 1.8 Let X and Y be two random variables. The variable X is said independent ofY if

P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)

for all A ∈ B and B ∈ B (Borel A and B), denoted X⊥⊥Y .

If E(|X|) <∞, E(|Y |) <∞ then the independence implies

E(XY ) = E(X)E(Y )

(the inverse does not hold!).

Definition 1.9 Let X1, ..., Xn be random variables, we say that X1, ..., Xn are (mutually) inde-pendent if for all A1, ..., An ∈ B

P (X1 ∈ A1, ..., Xn ∈ An) = P (X1 ∈ A1) · · ·P (Xn ∈ An).

Remark. The fact that Xi, i = 1, ..., n are pairwise independent, i.e. Xi⊥⊥Xj , does not implythat X1, .., Xn are mutually independent. On the other hand, mutual independence impliespairwise independence. In particular, if X1, ..., Xn are independent and E(|Xi|) <∞, i = 1, ..., n,

E(XiXj) = E(Xi)E(Xj), i 6= j

(and E(XiXj ...Xk) = E(Xi)E(Xj)...E(Xk), etc).

Page 12: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

12 LECTURE 1. BASIC PROBABILITY REFRESHER

1.3.1 Sums of independent random variables

Let us consider the sum∑ni=1Xi, where X1, ..., Xn are independent. If E(X2

i ) <∞, i = 1, ..., n(by the Lyapunov inequality this implies E(|Xi|) <∞) then

E

(n∑i=1

Xi

)=

n∑i=1

E(Xi) (true without the independence hypothesis)

and, moreover,

Var

(n∑i=1

Xi

)=

n∑i=1

Var(Xi).

Definition 1.10 We say that the variables X1, ..., Xn are i.i.d. (independent and identicallydistributed) if they are mutually independent and Xi obeys the same distribution as Xj for all1 ≤ i, j ≤ n.

Proposition 1.5 Let X1, ..., Xn be i.i.d. r.v. such that E(X1) = µ and Var(X1) = σ2 < ∞.Then the arithmetic mean

X =1

n

n∑i=1

Xi

satisfies

E(X) = µ and Var(X) =1

nVar(X1) =

σ2

n.

Proposition 1.6 (Kolmogorov’s strong law of large numbers) Let X1, ..., Xn be i.i.d. r.v.such that E(|X1|) <∞, and µ = E(X1). We have

X → µ (a.s.) when n→∞.

Counterexample. Let Xi be i.i.d. r.v. with Cauchy distribution with density

f(x) =1

π(1 + x2), x ∈ R.

Then E(|X1|) = ∞, E(X1) is not defined and the mean X does not converge (we observe thatCauchy distribution has “heavy tails”).

Proposition 1.7 (Central Limit Theorem (CLT)) Let X1, ..., Xn be i.i.d. r.v. such thatE(X2

1 ) <∞ and σ2 = Var(X1) > 0. Then

√n

(X − µσ

)D→ η, lorsque n→∞,

where µ = E(X1), and η ∼ N(0, 1).

Page 13: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.4. CONTINUITY THEOREMS 13

1.3.2 Asymptotic approximations of probability distributions

The CLT (Proposition 1.7) can be rewritten in the equivalent form:

P

(√n

(X − µσ

)≤ t)→ P (η ≤ t), as n→∞,

for all t ∈ R, where η ∼ N(0, 1), same as

P

X ≤ µ+σt√n︸ ︷︷ ︸

x

→ P (η ≤ t), as n→∞.

Let us denoteΦ(t) = P (η ≤ t)

the standard normal c.d.f.. Then

P (X ≤ x) = P

(√n

(X − µσ

)≤√n

(x− µσ

))≈ Φ

(√n

(x− µσ

))

when n→∞ and t =√nx−µσ “not too large.” In other words, for sufficiently large n, the c.d.f.

P (X ≤ x) of X can be approximated by the normal c.d.f.:

P (X ≤ x) ≈ Φ

(√n

(x− µσ

)).

1.4 Continuity theorems

Proposition 1.8 (The first continuity theorem) Let g(·) be a continuous function, and letξ1, ξ2, ... and ξ be random variables on (Ω,F , P ). Then

(i) ξn → ξ (a.s.) ⇒ g(ξn)→ g(ξ) (a.s.)

(ii) ξnP→ ξ ⇒ g(ξn)

P→ g(ξ)

(iii) ξnD→ ξ ⇒ g(ξn)

D→ g(ξ)

Proof : (i) is evident. We prove (ii) in a particular case where ξ = a (a is fixed nonrandom),the only case to be of interest in the sequel. The continuity of g implies that for any ε > 0 thereexists δ > 0 such that

|ξn − a| ≤ δ ⇒ |g(ξn)− g(a)| < ε.

Since ξnP→ a as n→∞, we have

limn→∞

P (|ξn − a| < δ) = 1 for all δ > 0.

Thuslimn→∞

P (|g(ξn)− g(a)| < ε) = 1 for any ε > 0.

Page 14: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

14 LECTURE 1. BASIC PROBABILITY REFRESHER

(iii) It suffices to prove (see the comment after Definition 1.7) that for any continuous andbounded function h(x)

E(h(g(ξn)))→ E(h(g(ξ))), n→∞.

Since g is continuous, f = h g is also continuous and bounded, and we arrive at (iii) because

ξnD→ ξ implies that

E(f(ξn))→ E(f(ξ)), n→∞,

for any continuous and bounded function f .

Proposition 1.9 (Second continuity theorem) Suppose that g(·) is continuous and contin-uously differentiable, and let X1, ..., Xn be i.i.d. random variables such that E(X2

1 ) < ∞ andσ2 = Var(X1) > 0. Then

√n

(g(X)− g(µ)

σ

)D→ ηg′(µ), n→∞,

where X = 1n

∑ni=1Xi, µ = E(X1), and η ∼ N(0, 1).

Proof : In the premise of the proposition the function

h(x) =

g(x)−g(µ)x−µ , if x 6= µ

g′(µ), if x = µ

is continuous. Because XP→ µ (due to Proposition 1.6), and h is continuous, we conclude, due

to the first continuity theorem, that

h(X)P→ h(µ) = g′(µ), n→∞. (1.7)

However,√ng(X)− g(µ)

σ=

√n

σh(X)(X − µ) = h(X)ηn,

where ηn =√nσ (X − µ). Now Proposition 1.7 implies that ηn

D→ η ∼ N(0, 1) when n → ∞.Using this fact along with (1.7) and the result 3o of the Exercise 1.4 we obtain the desiredstatement.

1.5 Simulation of random variables

In applications we often need to “generate” (build) a computer simulated sequence X1, ..., Xn

of i.i.d. random values following a given distribution F (we call it a sample). Of course,computer simulation only allows to build pseudo-random variables (not the “true” randomones). That means that the simulated values X1, ..., Xn are deterministic – they are obtainedby a deterministic algorithm – but the properties the sequence X1, ..., Xn are “analogous” tothose of a random i.i.d. sequence . For example, for the pseudo-random variables one has

supx|Fn(x)− F (x)| → 0, n→∞

Page 15: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.5. SIMULATION OF RANDOM VARIABLES 15

for any x ∈ R, where Fn(x) = 1nµn, and µn is the number of ξ1, ..., ξn which satisfy ξk < x. We

call Fn(x) empirical distribution function computed using the sequence X1, ..., Xn (here weconsider deterministic convergence, cf. Exercise 1.1.14). The strong law of large numbers andthe central limit theorem also hold for pseudo-random variables, etc.

1.5.1 Simulation of uniformly distributed random variables

The generation program is available in (essentially) all programming languages. How does itwork? The c.d.f. F (x) of the distribution U [0, 1] satisfies

F (x) =

0, x < 0x, x ∈ [0, 1]1, x > 1.

Congruential algorithm. We fix a real number a > 1 and an integer m (usually a and mare “very large” mutually prime numbers). We start with a fixed value z0. For 1 ≤ i ≤ n wedefine

zi = the rest of division of azi−1 by m

= azi−1 −[azi−1

m

]m,

where [·] is the integer part. We always have 0 ≤ zi < m. Thus, if we set

Ui =zim

=azi−1

m−[azi−1

m

],

then 0 ≤ Ui < 1. The sequence U1, ..., Un is considered a sample from the uniform distributionU [0, 1]. Even if this is not a random sequence, the empirical c.d.f.

FUn (x) =1

n

n∑i=1

IUi ≤ x

satisfies sup0≤x≤1 |Fn − x| ≤ ε(m), n→∞, with ε(m) converging rapidly to 0 when m→∞.

A well developed mathematical theory allows to justify “good” choices of z0, a and m. Forinstance, the following values may be used:

a = 16807 (75), m = 2147483647 (231 − 1).

Page 16: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

16 LECTURE 1. BASIC PROBABILITY REFRESHER

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

the step-function empirical c.d.f./the theoretical c.d.f.

In modern programming languages pseudo-random generators with improved properties becameavailable, and simple congruential generators are not used anymore.

1.5.2 Simulation of general pseudo-random variables

Given an i.i.d. sample U1, ..., Un from the uniform distribution, we can obtain a sample ofgeneral distribution F (·) using the inversion algorithm. It may be used when an explicitexpression for F (·) is available. This technique is based on the following statement:

Proposition 1.10 Let F be a continuous and strictly monotone c.d.f., and let U be a randomvariable uniformly distributed on [0, 1]. Then the c.d.f. of the r.v.

X = F−1(U)

is exactly F (·).

Proof : We observe that

F (x) = P (U ≤ F (x)) = P (F−1(U) ≤ x) = P (X ≤ x).

For instance to simulate a sample X1, ..., Xn from distribution F which is continuous andstrictly increasing, we may take

Xi = F−1(Ui),

where Ui are pseudo-random variables uniformly distributed on [0, 1], i = 1, ..., n.

If F is not continuous or strictly monotone, we need to modify the definition of the “inverse”F−1. We set

F−1(y)∆= supt : F (t) < y.

Then,

P (Xi ≤ x) = P (supt : F (t) < Ui ≤ x) = P (Ui ≤ F (x)) = F (x).

Page 17: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.5. SIMULATION OF RANDOM VARIABLES 17

Example 1.4 Exponential distribution:

f(x) = e−xIx > 0, F (x) = (1− e−x)Ix > 0.

We compute F−1(y) = − ln(1 − y) for y ∈ (0, 1). Xi = − ln(1 − Ui), i = 1, ..., n where Ui ∼U [0, 1].

Example 1.5 Bernoulli distribution:

P (X = 1) = p, P (X = 0) = 1− p, 0 < p < 1.

We use the modified algorithm:

F−1(y) = supt : F (t) < y =

0, y ∈]0, 1− p],1, y ∈]1− p, 1].

If Ui is a uniform r.v. then Xi = F−1(Ui) is a Bernoulli r.v., we have

Xi =

0, Ui ∈]0, 1− p],1, Ui ∈]1− p, 1].

Exercise 1.5

A r.v. Y takes values 1, 3 and 4 with the probabilities P (Y = 1) = 3/5, P (Y = 3) = 1/5 etP (Y = 4) = 1/5. How would you generate Y given a r.v. U ∼ U(0, 1).

Exercise 1.6

Let U ∼ U(0, 1).

1. Explain how to simulate a dice with 6 faces given U .

2. Let Y = [6U + 1], where [a] is the integer part of a. What are possible values of Y andthe corresponding probabilities?

Simulating transformed variables How to simulate a sample Y1, ..., Yn from the distributionF ((x − µ)/σ), given a sample X1, ..., Xn from F (·)? We suppose that σ > 0 and µ ∈ R). Weshould take Yi = σXi + µ, i = 1, ..., n.

1.5.3 Simulating normal N(0, 1) random variables

Note that while the normal c.d.f. F is continuous and strictly increasing, the explicit expressionfor F is not available. Thus, one cannot apply the inversion algorithm directly. Nevertheless,there are other techniques of simulating normal r.v. which are efficient from the numerical pointof view.

Page 18: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

18 LECTURE 1. BASIC PROBABILITY REFRESHER

Using the CLT. If U ∼ U [0, 1] then E(U) = 1/2 and Var(U) = 1/12. This implies by theCentral Limit Theorem that

U1 + ...+ UN −N/2√N/12

D→ N(0, 1), N →∞,

for an i.i.d. sample U1, ..., UN with uniform distribution on [0, 1] (N = 12 is usually sufficient toobtain a “good” approximation!). Thus, one can consider the following simulation algorithm:let U1, U2, ..., UnN be a pseudo-random sequence from uniform distribution U [0, 1], we take

Xi =U(i−1)N+1 + ...+ UiN −N/2√

N/12, i = 1, ..., n.

Box–Muller algorithm. The algorithm is based on the following result:

Proposition 1.11 Let ξ and η be independent U [0, 1] random variables. Then the r.v.

X =√−2 ln ξ cos(2πη) and Y =

√−2 ln ξ sin(2πη)

are standard normal and independent.

The proof of this statement is a subject of Exercise 2.19 of Lecture 2.This relation provides us with an efficient simulation technique:: let U1, ..., U2n be random

variables i.i.d. r.v. U1 ∼ U [0, 1]. We set

X2i =√−2 lnU2i cos(2πU2i−1),

X2i−1 =√−2 lnU2i sin(2πU2i−1),

for i = 1, ...n.

Page 19: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.6. EXERCISES 19

1.6 Exercises

Exercise 1.7

Suppose 2 balanced dices are drawn. Find joint probability distribution of X and Y if:

1. X is the maximum of the obtained values and Y is a sum;

2. X is the value of the first dice and Y is the maximum of the two;

3. X and Y are, respectively, the smallest and the largest value.

Exercise 1.8

Suppose that X and Y are 2 independent Bernoulli B(12) random variables. Let U = X+Y and

V = |X − Y |.

1. What is the joint probability distribution and marginal probability distributions of U andV , conditional distribution of U given V = 0 et V = 1.

2. are r.v. U and V independent?

Exercise 1.9

Let ξ1, ..., ξn be independent r.v., and let

ξmin = min(ξ1, ..., ξn), ξmax = max(ξ1, ..., ξn).

1) Show that

P (ξmin ≥ x) =n∏i=1

P (ξi ≥ x), P (ξmax < x) =n∏i=1

P (ξi < x).

2) Suppose, furthermore, that ξ1, ..., ξn are identically distributed with uniform distributionU [0, a]. Compute E(ξmin), E(ξmax), Var(ξmin) and Var(ξmax)

Exercise 1.10

Let ξ1, ..., ξn be i.i.d. Bernoulli r.v. with

P (ξ1 = 0) = 1− λi∆, P (ξ1 = 1) = λi∆

where λi > 0 and ∆ > 0 is small. Show that

P

(n∑i=1

ξi = 1

)=

(n∑i=1

λi

)∆ +O(∆2), P

(n∑i=1

ξi > 1

)= O(∆2).

Exercise 1.11

Page 20: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

20 LECTURE 1. BASIC PROBABILITY REFRESHER

1) Prove that inf−∞<a<∞E((ξ − a)2) is attained for a = E(ξ) and so

inf−∞<a<∞

E((ξ − a)2) = Var(ξ).

2) Let ξ be a nonnegative r.v. with c.d.f. F and finite expectation. Prove that

E(ξ) =

∫ ∞0

(1− F (x))dx.

3) Show, using the result of 2), that if M is the median of the c.d.f. F of ξ,

inf−∞<a<∞

E(|ξ − a|) = E(|ξ −M |).

Exercise 1.12

Let X1 and X2 be two independent r.v. with the exponential distribution E(λ). Show thatmin(X1, X2) and |X1 −X2| are r.v. with distributions, respectively, E(2λ) and E(λ).

Exercise 1.13

Let X be the number of “6” in 12000 independent draws of a dice. Using the Central Limit The-orem estimate the probability that 1800 < X ≤ 2100 (Φ(

√6) ≈ 0.9928, Φ(2

√6) ≈ 0.999999518).

Compare this approximation to that obtained using the Chebyshev inequality.

Exercise 1.14

Suppose that r.v. ξ1, ..., ξn are mutually independent and identically distributed with the c.d.f.F . For x ∈ R, let us define the random variable Fn(x) = 1

nµn, where µn is the number ofξ1, ..., ξn which satisfy ξk ≤ x. Show that for any x

Fn(x)P→ F (x)

(the function Fn(x) is called the empirical distribution function).

Exercise 1.15

[Monte-Carlo method] We want to compute the integral I =∫ 1

0 f(x)dx. Let X be a U [0, 1]random variable, then

E(f(X)) =

∫ 1

0f(x)dx = I.

Let X1, ..., Xn be uniformly distributed on [0, 1] i.i.d. r.v.. Let us consider the quantity

fn =1

n

n∑i=1

f(Xi)

and let us suppose that σ2 = Var(f(X)) < ∞. Prove that E(fn) → I et fnP→ I as n → ∞.

Estimate P (|fn − I| < ε) using the CLT.

Exercise 1.16

Page 21: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

1.6. EXERCISES 21

Weibull distributions are often used in the survival and reliability analysis. An example of adistribution from this family is given by c.d.f.

F (x) =

0, x < 0

1− e−5x2 , x ≥ 0.

Explain how to generate a r.v. Z ∼ F given a uniform r.v. U .

Exercise 1.17

Write down the algorithm of simulating a Poisson r.v. by inversion.Hint: there is no simple expression for the Poisson c.d.f. and the set of values is infinite.

However, the Poisson c.d.f. can be easily computed recursively. Observe that if X is the Poissonr.v.,

P (X = k) = e−λλk

k!=λ

kP (X = k − 1).

Page 22: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

22 LECTURE 1. BASIC PROBABILITY REFRESHER

Page 23: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

Lecture 2

Random vectors

2.1 Couples of random variables

2.1.1 Joint and marginal distributions

Let (X,Y ) be a couple of r.v.. The joint c.d.f of (X,Y ) is given by

FX,Y (x, y) = P (X ≤ x, Y ≤ y), x, y ∈ R.

The marginal c.d.f. are given by

FX(x) = FX,Y (x,∞) = P (X ≤ x);

FY (y) = FX,Y (∞, y) = P (Y ≤ y).

In the continuous case, we suppose that FX,Y the derivative

∂2FX,Y (x, y)

∂x∂y= fX,Y (x, y) (2.1)

exists a.e.. The function fX,Y (x, y) is called the density of FX,Y (x, y).

Marginal densities fX and fY are defined according to

fX(x) =

∫ ∞−∞

fX,Y (x, y)dy, fY (y) =

∫ ∞−∞

fX,Y (x, y)dx.

In the discrete case X and Y take values in a finite or countable set. A joint distributionof a couple X,Y is defined by the probabilities P (X = k, Y = m)k,m. The marginal laws aredefined by the probabilities

P (X = k) =∑m

P (X = k, Y = m),

P (Y = m) =∑k

P (X = k, Y = m).

If X and Y are independent then

FX,Y = FX(x)FY (y) for all (x, y) ∈ R2.

23

Page 24: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

24 LECTURE 2. RANDOM VECTORS

The inverse is also true. In the continuous case the independence is equivalent to the decompo-sition

fX,Y (x, y) = fX(x)fY (y), for all (x, y) ∈ R2,

and in the discrete case,

P (X = k, Y = m) = P (X = k)P (Y = m).

2.1.2 Covariance and correlation

Let X and Y be square-integrable r.v., i.e. E(X2) <∞ and E(Y 2) <∞. We denote

σ2X = Var(X), σ2

Y = Var(Y ).

Definition 2.1 The covariance of X and Y is the value

Cov(X,Y ) = E ((X − E(X))(Y − E(Y ))) = E(XY )− E(X)E(Y ).

If Cov(X,Y ) = 0, we say that X and Y are orthogonal, and we denote X ⊥ Y .

Definition 2.2 Let σ2X > 0 and σ2

Y > 0. The correlation between X and Y is the value

Corr(X,Y ) = ρXY =Cov(X,Y )

σX σY.

2.1.3 Properties of covariance and correlation

The below relationships are immediate consequences of Definition 2.1.1. Cov(X,X) = Var(X).2. Cov(aX, bY ) = abCov(X,Y ), a, b ∈ R.3. Cov(X + a, Y ) = Cov(X,Y ), a ∈ R.4. Cov(X,Y ) = Cov(Y,X).5. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(Y,X).

Indeed,

Var(X + Y ) = E((X + Y )2)− (E(X) + E(Y ))2

= E(X2) + E(Y 2) + 2E(XY )− E2(X)− E2(Y )− 2E(X)E(Y ).

6. If X and Y are independent, Cov(X,Y ) = 0.

Important note: the inverse is not true, for instance, if X ∼ N(0, 1) and Y = X2, then

Cov(X,Y ) = E(X3)− E(X)E(X2) = E(X3) = 0

(recall that N(0, 1) is symmetric with respect to 0).

Page 25: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.1. COUPLES OF RANDOM VARIABLES 25

Let us now consider the properties of correlation:1. −1 ≤ ρXY ≤ 1 (the Cauchy-Schwarz inequality)

|Cov(X,Y )| = E ((X − E(X))(Y − E(Y )))

≤√E((X − E(X))2)

√E((Y − E(Y ))2) = σX σY

2. If X and Y are independent, ρXY = 0.3. |ρXY | = 1, if and only if X and Y are linearly dependent: there exist a 6= 0, b ∈ R suchthat Y = aX + b.Proof : Note that |ρXY | = 1, iff the equality is attained in the Cauchy-Schwarz inequality. ByProposition 1.4, this is only possible if there are α, β ∈ R such that

α(X − E(X)) + β(Y − E(Y )) = 0 (a.s.),

and if α 6= 0 or β 6= 0. This is equivalent to the existence of α, β and γ ∈ R such that

αX + βY + γ = 0 (a.s.),

with α 6= 0 or β 6= 0. If α 6= 0 and β 6= 0 one has

Y = −αβX − γ

β, X = −β

αY − γ

α,

The case with α = 0 or β = 0 is impossible, because this would mean that one of the variablesY or X is constant (a.s.), but we have assumed that σX and σY are positive.

Observe that if Y = aX + b, a, b ∈ R, a 6= 0,

σ2Y = E((Y − E(Y ))2) = a2E((X − E(X))2) = a2σ2

X ;

and the covariance,

Cov(X,Y ) = E ((X − E(X))a(X − E(X))) = aσ2X ,

so that ρXY =aσ2X

σX |a|σX = a|a| . We say that the correlation between X and Y is positive if

ρXY > 0 and negative if ρXY < 0. The correlation above is thus positive (= 1) if a > 0 andnegative (= −1) if a < 0.

Geometric interpretation of the correlation. Let us consider the set of random variablesξ on (Ω,F , P ) such that E(ξ2) < ∞. We will say that ξ ∼ ξ′ (is equivalent) if ξ = ξ′ (a.s.)with respect to the measure P . This relation defines a family of equivalence classes over randomvariables ξ such that E(ξ2) <∞.

Definition 2.3 We denote L2(P ) the space of (equivalence classes of) square-integrable r.v. ξ(E(ξ2) <∞).

The space L2(P ) we have just defined is a Hilbert space equipped with the scalar product

〈X,Y 〉 = E(XY ), X, Y ∈ L2(P ),

Page 26: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

26 LECTURE 2. RANDOM VECTORS

and the corresponding norm ‖X‖ = [E(X2)]1/2, X ∈ L2(P ).

Indeed, 〈·, ·〉 verifies all the axioms of the scalar product: for all X, ξ, η ∈ L2(P ) and a, b ∈ R

〈aξ + bη,X〉 = E([aξ + bη]X) = aE(ξX) + bE(ηX) = a〈ξ,X〉+ b〈η,X〉,

and 〈X,X〉 ≥ 0; 〈X,X〉 = 0 implies X = 0 (a.s.).

Note that

Cov(X,Y ) = 〈X − E(X), Y − E(Y )〉

and

ρXY =〈X − E(X), Y − E(Y )〉‖X − E(X)‖ ‖Y − E(Y )‖

.

In other words, ρXY is the “cosine of the angle” between X − E(X) and Y − E(Y ). Thus,ρXY = ±1 means that X − E(X) and Y − E(Y ) are collinear: Y − E(Y ) = a(X − E(X)) fora 6= 0.

2.2 Random vectors

Let X = (ξ1, ..., ξp)T be a random vector, 1 where ξ1, ..., ξp are random (univariate) variables.

We introduce random matrices in the same way:

Ξ =

ξ11, ... ξ1q

...ξp1, ... ξpq

,where ξ11, ..., ξqp are (univariate) r.v.. The cumulative distribution function of the random vectorX is

F (t) = P (ξ1 ≤ t1, ..., ξp ≤ tp), t = (t1, ..., tp)T ∈ Rp.

If F (t) is differentiable with respect to t, the density of X (the joint density of ξ1, ..., ξp) existsand is equal to the mixed derivative

f(t) = f(t1, ..., tp) =∂pF (t)

∂t1, ..., ∂tp.

In this case

F (t) =

∫ t1

−∞...

∫ tp

−∞f(u1, ..., up)du1...dup.

2.2.1 Properties of a multivariate density

We have: f(t) ≥ 0,∫∞−∞ ...

∫∞−∞ f(t1, ..., tp)dt1...dtp = 1. The marginal density of ξ1, ..., ξk,

k < p is (we use the symbol f(·) for generic density notation)

f(t1, ..., tk) =

∫ ∞−∞

...

∫ ∞−∞

f(t1, ..., tp)dtk+1...dtp.

1By convention, vector X ∈ Rp×1 is a column vector.

Page 27: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.2. RANDOM VECTORS 27

Note that two different random vectors may have the same marginal distributions.

Example 2.1 Consider the densities

f1(t1, t2) = 1, et f2(t1, t2) = 1 + (2t1 − 1)(2t2 − 1), 0 < t1, t2 < 1.

In both cases, f(t1) =∫ 1

0 f(t1, t2)dt2 = 1.

Independence. Suppose that two random vectors X1 and X2 have a joint density f(x1, x2).They are independent iff

f(x1, x2) = f1(x1)f2(x2),

where f1 and f2 are probability densities. The independence is preserved by measurable trans-formations of vectors X1 and X2.

2.2.2 Moments of random vectors

Vector µ = (µ1, ..., µp)T ∈ Rp is the mean (expectation) of the random vector X = (ξ1, ..., ξp)

T

if

µj = E(ξj) =

∫...

∫tjf(t1, ..., tp)dt1...dtp, j = 1, ..., p

(we suppose that the corresponding integrals exist), we write µ = E(X). In the same way wedefine the expectation of a random matrix. Same as in the scalar case, the expectation is alinear functional: for any matrix A ∈ Rq×p et b ∈ Rq,

E(AX + b) = AE(X) + b = Aµ+ b.

This property is still valid for random matrices: if Ξ is a p× q random matrix, A ∈ Rq×p, thenE(AΞ) = AE(Ξ).

Covariance matrix Σ of the random vector X is given by

Σ∆= V (X) = E((X − µ)(X − µ)T ) = (σij),

a p× p matrix, where

σij = E((ξi − µi)(ξj − µj)) =

∫...

∫(ti − µi)(tj − µj)f(t1, ..., tp)dt1...dtp

(we note that in this case σij are not always positive). Because σij = σji, Σ is a symmetricmatrix. We can define also the covariance matrix of random vectors X (p × 1) and Y(q × 1):

C(X,Y ) = E((X − E(X))(Y − E(Y ))T ), C ∈ Rp×q.

The covariance matrix possesses the following properties:1o. Σ = E(XXT )− µµT , where µ = E(X).2o. For any a ∈ Rp, Var(aTX) = aTV (X)a.Proof : Observe that by linearity of expectation,

Var(aTX) = E((aTX − E(aTX))2) = E((aT (X − E(X))2

)= E

(aT (X − µ)(X − µ)Ta

)= aTE

((X − µ)(X − µ)Ta

)= aTV (X)a.

Page 28: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

28 LECTURE 2. RANDOM VECTORS

Var(aTX) ≥ 0 for all a ∈ Rp, implying that matrix V (X) is positive semidefinite, we writeV (X) 0. Thus, we have3o. Σ 0.4o. Let A be a p× q matrix. Then V (AX + b) = AV (X)AT .Proof : Let Y = AX + b, then by linearity of expectation,

ν = E(Y ) = E(AX + b) = Aµ+ b et Y − E(Y ) = A(X − µ).

Now we have:

V (Y ) = E(A(X − µ)(X − µ)TA) = AV (X)AT (linearity again).

5o. C(X,X) = V (X). In this case C = CT 0 (positive semidefinite matrix).6o. C(X,Y ) = C(Y,X)T .7o. C(X1 +X2, Y ) = C(X1, Y ) + C(X2, Y ).8o. If X and Y are two p-random vectors,

V (X + Y ) = V (X) + C(X,Y ) + C(Y,X) + V (Y ) = V (X) + C(X,Y ) + C(X,Y )T + V (Y ).

9o. If X⊥⊥Y , then C(X,Y ) = 0 (null matrix), the inverse is not true. This can be provedexactly the same way as in the case of covariance of r.v..

Correlation matrix P of X is given by P = (ρij), 1 ≤ i, j ≤ p where

ρij =σij√σii√σjj

.

We note that the diagonal entries ρii = 1, i = 1, ..., p.If ∆ is a diagonal matrix with ∆ii =

√σii, then P = ∆−1Σ∆−1, and the positivity of Σ

implies the positivity of P , i.e. P 0.

2.2.3 Transformations of random vectors

Let h = (h1, ..., hp)T be a transformation, i.e. an application from Rp to Rp,

h(t1, ..., tp) = (h1(t1, ..., tp), ..., hp(t1, ..., tp))T , t = (t1, ..., tp)

T ∈ Rp.

The Jacobian matrix of the transformation is defined par

Jh(t) = Det

(∂hi∂tj

(t)

)i,j

.

Proposition 2.1 (Calculus reminder) Suppose that

(i) partial derivatives of hi(·) are continuous on Rp, i = 1, ..., p,

(ii) h is a bijection,

Page 29: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.2. RANDOM VECTORS 29

(iii) Jh(t) 6= 0 for any t ∈ Rp.

Then, for any function f(t) such that ∫Rp|f(t)|dt <∞

and any Borel set K ⊆ Rp we a∫Kf(t)dt =

∫h−1(K)

f(h(u))|Jh(u)|du.

Remark: by the inverse function theorem, under conditions of the Proposition 2.1, the inversefunction g(·) = h−1(·) exists everywhere on Rp, and

Jh−1(t) =1

Jh(h−1(t)),

thus

Jh−1(h(u)) =1

Jh(u).

We conclude that h satisfies conditions (i)−(iii) of Proposition 2.1 iff g = h−1 satisfies the sameconditions.

We have the following corollary of Proposition 2.1:

Proposition 2.2 Let Y be a random vector with density fY (t), t ∈ Rp. Let g : Rp → Rp be atransformation satisfying the premise of Proposition 2.1. Then, the density of the random vectorX = g(Y ) exists and is given by

fX(u) = fY (h(u))|Jh(u)|, for any u ∈ Rp,

where h = g−1.

Proof : Let X = (ξ1, ..., ξp)T , v = (v1, ..., vp)

T , and Av = t ∈ Rp : gi(t) ≤ vi, i = 1, ..., p.Then, by the Proposition 2.1 with h = g−1 and f = fY , the c.d.f. of X is

FX(v) = P (ξi ≤ vi, i = 1, ..., p) = P (gi(Y ) ≤ vi, i = 1, ..., p)

=

∫AvfY (t)dt =

∫g(Av)

fY (h(u))|Jh(u)|du.

On the other hand,

g(Av) = u = g(t) ∈ Rp : t ∈ Av = u = g(t) ∈ Rp : gi(t) ≤ vi, i = 1, ..., p= u = (u1, ..., up)

T ∈ Rp : ui ≤ vi, i = 1, ..., p.

We conclude that

FX(v) =

∫ v1

−∞...

∫ vp

−∞fY (h(u))|Jh(u)|du

for any v = (v1, ..., vp)T ∈ Rp. This implies that the density of X is fY (h(u))|Jh(u)|.

Page 30: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

30 LECTURE 2. RANDOM VECTORS

Corollary 2.1 If X = AY + b where Y is a random vector on Rp with the density fY and A isan invertible p× p matrix then

fX(u) = fY (A−1(u− b))|Det(A−1)| = fY (A−1(u− b))|Det(A)|

.

To verify the result it suffices to use Proposition 2.2 with u = g(t) = At + b and thus t =g−1(u) = h(u) = A−1(u− b).

2.2.4 Reminder of properties of symmetric matrices

Recall that p×p matrix A, A = (aij), i, j = 1, ..., p is called symmetric if aij = aji, i, j = 1, ..., p.The matrix Γ p× p is said orthogonal if

Γ−1 = ΓT (same as ΓΓT = ΓTΓ = I)

(where I is the p × p identity matrix). In other words, the columns γ·j of Γ are orthogonalvectors of length 1 (the same is true for the lines γi· of Γ). Of course, |Det(Γ)| = 1. We havethe spectral decomposition theorem (Jordan):

Let A ∈ Rp×p be a symmetric matrix. Then there exists an orthogonal matrix Γ and thediagonal matrix

Λ = Diag(λi) =

λ1 0 ... 0... ...

0 ... 0 λp

,such that

A = ΓΛΓT =p∑i=1

λiγ·iγT·i , (2.2)

where γ·i are the orthonormal eigenvectors of A: 2

γT·i γ·j = δij i, j = 1, ..., p,

Γ = (γ·1, ..., γ·p).

Comments.1) Though symmetric matrix may have multiple eigenvalues, all the eigenvectors of such matrixare different.2) We assume in the sequel that eigenvalues λi, i = 1, ..., p are sorted in the decreasing order:

λ1 ≥ λ2 ≥ ... ≥ λp.

We say that γ·1 is the first (or principal) eigenvector of A, i.e. the eigenvector correspondingto the maximal eigenvalue; γ·2 is the second eigenvector, and so on.

If all eigenvalues λi, i = 1, ..., p are nonnegative, matrix A is positive semidefinite (andpositive definite if λi > 0).

2Here δij stands for the Kronecker index: δij = 1 if i = j, otherwise δij = 0.

Page 31: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.2. RANDOM VECTORS 31

Other useful properties of square matrices1o. Det(A) =

∏pi=1 λi, Tr(A) =

∑pi=1 λi.

2o. Det(AB) = Det(A) Det(B), Det(AT ) = Det(A).3o. The calculus of matrix functions simplifies for symmetric matrices: for example, a positiveinteger power As, s ∈ N+ of a symmetric matrix satisfies As = ΓΛsΓT (if the matrix A ispositive definite it is true for any real s).4o. Det(A−1) = Det(A)−1 for nondegenerate A.5o. For any s ∈ R and any matrix A = AT 0, Det(As) = Det(A)s (the simple consequence of|det Γ| = 1 for an orthonormal matrix Γ).

Projectors. Symmetric matrix P such that

P 2 = P (idempotent matrix)

is called projection matrix (or projector).

All the eigenvalues of P are 0 or 1. Rank(P ) is the number of eigenvalues = 1. In otherwords, there is an orthogonal matrix Γ such that

ΓTPΓ =

(I 00 0

),

where I is the Rank(P )× Rank(P ) identity matrix.

Indeed, let v be an eigenvector of P , then Pv = λv, where λ is an eigenvalue of P . Due toP 2 = P ,

(λ2 − λ)v = (λP − P )v = (P 2 − P )v = 0,

what is equivalent to λ = 1 if Pv 6= 0.

2.2.5 Characteristic function of a random vector

Definition 2.4 Let X ∈ Rp be a random vector. Its characteristic function for t ∈ Rp is givenby

φX(t) = E(exp(itTX)

).

Exercise 2.1

Two random vectors X ∈ Rp and Y ∈ Rq are independent iff the characteristic function φZ(u)

of the vector Z =

(XY

)can be represented, for any u =

(ab

), a ∈ Rp and b ∈ Rq, as

φZ(u) = φX(a)φY (b).

Verify this characterization in the continuous case.

Page 32: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

32 LECTURE 2. RANDOM VECTORS

2.3 Multivariate normal distribution

Normal distribution on R: recall that the normal distribution N(µ, σ2) on R is the prob-ability distribution with density

f(x) =1√2πσ

exp

(−(x− µ)2

2σ2

).

Here µ is the mean and σ2 is the variance. The characteristic function of the normal distributionN(µ, σ2) is given by

φ(t) = exp

(iµt− σ2t2

2

)).

In particular, for N(0, 1) we have φ(t) = e−t2/2.

2.3.1 The distribution Np(0, I)

Np(0, I) is the distribution of random vector X = (ξ1, ..., ξp)T where ξi, i = 1, ..., p are i.i.d.

random variables with distribution N(0, 1).

Properties of Np(0, I):1o. The mean and the covariance matrix of X ∼ Np(0, I) are E(X) = 0 and V (X) = I.2o. Distribution Np(0, I) is absolutely continuous with density

f(u) = (2π)−p/2 exp(−1

2uTu) = (2π)−p/2

p∏i=1

exp(−1

2u2i ) =

p∏i=1

f0(ui),

where u = (u1, ..., up)T and f0(t) = 1√

2πe−t

2/2 is the density of N(0, 1).

3o. The characteristic function of Np(0, I) is, by definition,

φX(a) = E(eia

TX)

= E

p∏j=1

eiajξj

=

p∏j=1

E(eiajξj

)=

p∏j=1

φξj (aj) =p∏j=1

e−a2j/2 = exp(−1

2aTa),

where a = (a1, ..., ap)T ∈ Rp.

2.3.2 Normal distribution on Rp

Definition 2.5 The random vector X is normally distributed on Rp if and only if there exist ap× p matrix A and a vector µ ∈ Rp such that

X = AY + µ, where Y ∼ Np(0, I).

Page 33: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.3. MULTIVARIATE NORMAL DISTRIBUTION 33

Properties:1o. E(X) = µ due to E(Y ) = 0.2o. V (X) = AV (Y )AT = AAT . We put Σ = AAT .3o. The characteristic function

φX(a) = E(eia

TX)

= E(eia

T (AY+µ))

= eiaTµE

(eib

TY)

(with b = ATa)

= eiaTµ− 1

2bT b = eia

Tµ− 12aTΣa. (2.3)

We have the following characterization:

Theorem 2.1 Let φ : Rp → C (a complex-valued function). Then, φ is a characteristic functionof a normal distribution if and only if there exist µ ∈ Rp and a positive semidefinite matrixΣ ∈ Rp×p such that

φ(a) = eiaTµ− 1

2aTΣa, a ∈ Rp. (2.4)

Remark: in this case µ is the mean and Σ is the covariance matrix of the normal distributionin question.

Proof : The necessity (“only if” part) has been already proved by (2.3). To show the “if” part,starting from (2.4) we have to build a normal vector X ∈ Rp such that φ(·) is its characteristicfunction.

1st step: by the spectral decomposition theorem, there exists an orthogonal matrix Γ suchthat ΓTΣΓ = Λ, where Λ is a diagonal matrix of rank k ≤ p with strictly positives eigenvaluesλj , 1 ≤ j ≤ k. Then (cf. (2.2)),

Σ =p∑j=1

λjγ·jγT·j =

p∑j=1

a·jaT·j ,

where γ·j are the columns of Γ, and a·j =√λjγ·j . Observe that a·j ⊥ a·l for l 6= j (recall that

γ·j are orthonormal).

2nd step: Let Y ∼ N(0, I); let us denote ηj the components of Y (Y = (η1, ..., ηp)T ). We

consider random vector

X = η1a·1 + ...+ ηka·k + µ,

so thatX = AY+µ, whereA is a p×pmatrix with columns aj , j = 1, ..., k: A = (a·1, ..., a·k, 0, ..., 0).Thus, X is a normal vector. Observe that we have E(X) = µ and

V (X) = E((η1a·1 + ...+ ηka·k)(η1a·1 + ...+ ηka·k)

T)

=k∑j=1

a·jaT·k = Σ,

Page 34: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

34 LECTURE 2. RANDOM VECTORS

due to E(ηlηj) = δjl where δjl is the Kronecker symbol. Thus, by (2.3), the characteristicfunction of X coincide with φ(u) in (2.4).

The result of Theorem 2.1 implies the following important corollary: any normal distributionon Rp is entirely defined by the vector of means and its covariance matrix. This explains thenotation

X ∼ N(µ,Σ)

for the random vector X which is normally distributed with mean µ and covariance matrixΣ = ΣT 0.

We will distinguish two situations, that of nondegenerate normal distribution and of degen-erate normal distribution.

2.3.3 Nondegenerate normal distribution

This is a normal distribution on Rp with positive definite covariance matrix Σ, i.e. Σ 0(⇔ Det(Σ) > 0). Moreover, because Σ is symmetric and Σ 0, there exists a symmetricmatrix A1 = Σ1/2 (a symmetric square root of Σ) such that Σ = A2

1 = AT1 A1 = A1AT1 . As

Det(Σ) = [Det(A1)2] > 0, we have Det(A1) > 0 and A1 is invertible. By (2.3), if X ∼ N(µ,Σ),its characteristic function is

φX(a) = eiaTµ− 1

2aTΣa

for any a ∈ Rp, and due to Σ = A1AT1 , we have

φX(a) = eiaTµ− 1

2aTΣa = E

(eia

T (A1Y+µ))

= φA1Y+µ(a),

where Y ∼ Np(0, I). Therefore,

X = A1Y + µ

and, due to the invertibility of A1,

Y = A−11 (X − µ).

The Jacobian of this linear transformation is Det(A−11 ), and by Corollary 2.1 the density of X

is, for any u ∈ Rp,

fX(u) = Det(A−11 )fY (A−1

1 (u− µ)) =1

Det(A1)fY (A−1

1 (u− µ))

=1

(2π)p/2√

Det(Σ)exp

(−1

2(u− µ)TΣ−1(u− µ)

).

Definition 2.6 We say that X has a nondegenerate normal distribution Np(µ,Σ) (with a pos-itive semidefinite covariance matrix Σ) iff X is a random vector with density

f(t) =1

(2π)p/2√

Det(Σ)exp

(−1

2(t− µ)TΣ−1(t− µ)

)

Page 35: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.3. MULTIVARIATE NORMAL DISTRIBUTION 35

2.3.4 Degenerate normal distribution

This is a normal distribution on Rp with singular covariance matrix Σ, i.e. Det(Σ) = 0 (in otherwords, Rank(Σ) = k < p). For instance, consider Σ = 0, then the characteristic function of

X ∼ N(µ, 0) is φX(a) = eiaTµ (by Property 3o) and the distribution of X is the Dirac function

at µ.More generally, if Rank(Σ) = k ≥ 1, we obtain (cf. the proof of Theorem 2.1) that a vector

X ∼ Np(µ,Σ) can be represented asX = AY + µ,

where Y ∼ N(0, I), A = (a·1, ..., a·k, 0, ..., 0) and AAT = Σ, with Rank(A) = k.

Proposition 2.3 Let X ∼ Np(µ,Σ) and Rank(Σ) = k < p. Then there exists a linear subspaceH ⊂ Rp of dimension p− k such that the projection aTX of X on any vector a ∈ H is a “Diracrandom variable.”

Proof : We have X = AY + µ where AAT = Σ, Rank(A) = k, Let H = Ker(AT ) of dimensiondim (H) = p− k. If a ∈ H, then ATa = 0 and Σa = 0.

Now, let a ∈ H, the characteristic function of the r.v. aTX is

φ(u) = E(ei(a

TX)u)

= E(ei(ua)TX

)= ei(ua)Tµ− 1

2(ua)TΣ(ua) = ei(ua)Tµ.

Therefore, the distribution of aTX is a (scalar) Dirac function at aTµ.

In particular, any component of X is either normally distributed (nondegenerate) or itsdistribution is a Dirac function.

Theorem 2.2 (Equivalent definition of the multivariate normal distribution) A random vectorX ∈ Rp is normally distributed iff all its scalar projections aTX for any a ∈ Rp are normalrandom variables.

Remark: we include the Dirac distribution as a special case of normal distributions (corre-sponding to σ2 = 0).Proof : Observe first that for any a ∈ Rp and any u ∈ Rp the characteristic function φaTX(u)of the r.v. aTX is related to that of X:

φaTX(u) = E(eia

TXu)

= φX(ua). (2.5)

“only if” part: Let X be a normal vector in Rp. We are to show that aTX is a normalrandom variable for any a ∈ Rp. We use (2.5) to obtain for any u ∈ R

φaTX(u) = eiuaTµ− 1

2u2aTΣa,

where µ and Σ are the mean and the covariance matrix of X. Thus,

φaTX(u) = eiµ0u−12u2σ2

0

with µ0 = aTµ and σ20 = aTΣa. As a result,

aTX ∼ N(µ0, σ20) = N(aTµ, aTΣa).

Page 36: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

36 LECTURE 2. RANDOM VECTORS

“if” part: We are to prove that if for any a ∈ Rp aTX is a normal random variable thenX is a normal vector. To this end, note that if aTX is a normal r.v. for any a ∈ Rp, thenE(|X|2) < ∞ (to see why this is true it suffices to choose a as vectors of an orthonormal basisof Rp). Therefore, the mean µ = E(X) and the covariance matrix Σ = V (X) are well defined.

Let us fix a ∈ Rp. By hypothesis, there exists m ∈ R and s2 ≥ 0 such that aTX ∼ N(m, s2).However, this immediately implies that

m = E(aTX) = aTµ, s2 = Var(aTX) = aTΣa.

Moreover, the characteristic function of aTX is given by

φaTX(u) = eimu−12s2u2 = eiua

Tµ− 12u2aTΣa.

Now using (2.5) we obtain

φX(a) = φaTX(1) = eiaTµ− 1

2aTΣa.

Because a ∈ Rp is arbitrary, we conclude by Theorem 2.1 that X is a normal random vector inRp with mean µ and covariance matrix Σ.

2.3.5 Properties of multivariate normal distribution

Here we consider X ∼ Np(µ,Σ), where µ ∈ Rp and Σ ∈ Rp×p is a symmetric matrix, Σ 0.The following properties are consequences of the results of the preceding section:

(N1) Let Σ 0, then the random vector Y = Σ−1/2(X − µ) satisfies

Y ∼ Np(0, I).

(N2) One-dimensional projections aTX of X for any a ∈ Rp are normal random variables:

aTX ∼ N(aTµ, aTΣa).

In particular, the marginal densities of the distribution Np(µ,Σ) are also normal.

Exercise 2.2

Let the joint density of r.v.’s X and Y satisfy

f(x, y) =1

2πe−

x2

2 e−y2

2 [1 + xyI−1 ≤ x, y ≤ 1],

What is the distribution of X, of Y ?

(N3) Any linear transformation of a normal vector is again a normal vector: if Y = AX + cwhere A ∈ Rq×p and c ∈ Rq are some fixed matrix and vector (non-random),

Y ∼ Nq(Aµ+ c, AΣAT ).

Exercise 2.3

Page 37: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.3. MULTIVARIATE NORMAL DISTRIBUTION 37

Prove this claim.

(N4) Let σ2 > 0. The distribution of X ∼ Np(0, σ2I) is invariant with respect to orthogonal

transformations: if Γ is an orthogonal matrix, then ΓX ∼ Np(0, σ2I). (The proof is

evident: it suffices to use (N3) with A = Γ.)

(N5) All subsets of components of a normal vector are normal vectors: Let X = (XT1 , X

T2 )T

where X1 ∈ Rk and X2 ∈ Rp−k, then X1 and X2 are (k- and p−k-variate) normal vectors.

Proof : We use (N3) with c = 0 and A ∈ Rk×p, A = (Ik, 0) where Ik is the k × k identitymatrix to conclude that X1 is normal. For X2 we take A ∈ R(p−k)×p = (0, Ip−k).

(N6) Two jointly normal vectors are independent if and only if they are non-correlated.

Proof : Only sufficiency (“if” part) of the claim should be proved. Let Z =

(XY

), where

X ∈ Rp and Y ∈ Rq, Z is a normal vector in Rq+p, and C(X,Y ) = 0 (the covariancematrix of X and Y ). To prove that X and Y are independent it suffices to verify (cf.

Exercise 2.1) that the characteristic function φZ(u), u =

(ab

), a ∈ Rp and b ∈ Rq, can

be decomposed as

φZ(u) = φX(a)φY (b).

Indeed, we have

E(Z) =

(E(X)E(Y )

), V (Z) =

(V (X) C(X,Y )C(Y,X) V (Y )

)=

(V (X) 0

0 V (Y )

),

where V (X) ∈ Rp×p and V (Y ) ∈ Rq×q are covariance matrices of X and of Y , respectively.Therefore, the characteristic function φZ(u) of Z is given by

φZ(u) = φZ(a, b) = exp

[i(aTE(X) + bTE(Y ))− 1

2(aT , bT )V (Z)

(ab

)]

= exp

[iaTE(X)− 1

2aTV (X)a

]exp

[ibTE(Y )− 1

2bTV (Y )b

]= φX(a)φY (b).

for any u =

(ab

).

2.3.6 Geometry of the multivariate normal distribution

Let Σ 0. Note that the density of Np(µ,Σ) is constant on the surfaces

EC = x : (x− µ)TΣ−1(x− µ) = C2,

Page 38: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

38 LECTURE 2. RANDOM VECTORS

We call these level sets “contours” of the distribution. In the case of interest, EC are ellipsodswhich we refer to as concentration ellipsoids.

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

3

2

1

1 2

=0.75

Concentration ellipsoids: X = (ξ1, ξ2), Y = (η1, η2), where Y = Σ−1/2X, Σ =

(1 3/4

3/4 1

)

2.4 Distributions derived from the normal

2.4.1 Pearson’s χ2 distribution

This is the distribution of the sum

Y = η21 + ...+ η2

p,

where η1, ..., ηp are i.i.d. N(0, 1) random variables. We denote it Y ∼ χ2p and say that “Y

follows the chi-square distribution with p degrees of freedom (d.f.). The density of theχ2p distribution is given by

fχ2p(y) = C(p)yp/2−1e−y/2I0 < y <∞, (2.6)

where

C(p) =(2p/2Γ(p/2)

)−1,

and Γ(·) is the gamma-function:

Γ(x) =

∫ ∞0

ux−1e−u/2du, x > 0.

Page 39: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.4. DISTRIBUTIONS DERIVED FROM THE NORMAL 39

We have E(Y ) = p, Var(Y ) = 2p if Y ∼ χ2p.

0 1 2 3 4 5 6 7 8 9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

p=1 p=2 p=3 p=6

Density of the χ2 distribution for different p’s

Exercise 2.4

Obtain the expression (2.6) for the density of χ2p.

2.4.2 Fisher-Snedecor distribution

Let U ∼ χ2p, V ∼ χ2

q be two independent r.v.. The Fisher distribution with degrees offreedom p and q is the distribution of the random variable

Y =U/p

V/q.

We write Y ∼ Fp,q. The density of Fp,q is given by

fFp,q(y) = C(p, q)yp/2−1

(q + py)p+q2

I0 < y <∞, (2.7)

where

C(p, q) =pp/2qq/2

B(p/2, q/2), with B(p, q) =

Γ(p)Γ(q)

Γ(p+ q).

Page 40: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

40 LECTURE 2. RANDOM VECTORS

This density approches the density fχ2p

in the limit q →∞.

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 F(10,4) F(10,10) F(10,100)

Density of the Fisher distribution

Exercise 2.5

Verify the expression (2.7) for the density of the Fisher distribution.

2.4.3 Student (W. Gosset) t distribution

Let η ∼ N(0, 1), ξ ∼ χ2q be independent. The Student distribution with q degrees of

freedom is that of the r.v.

Y =η√ξq

,

we write Y ∼ tq. The density of tq is

ftq(y) = C(q)(1 + y2/q)−(q+1)/2, y ∈ R, (2.8)

where

C(q) =1

√qB(1/2, q/2)

.

Note that t1 is the Cauchy distribution, and tq “approaches” N(0, 1) when q →∞; tq distributionis symmetric. The tails of tq are heavier than normal tales.

Exercise 2.6

Page 41: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.4. DISTRIBUTIONS DERIVED FROM THE NORMAL 41

Verify the expression (2.8) for the Student distribution density.

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4 N(0,1) t

4

Density of Student distribution

2.4.4 Cochran’s theorem

Theorem 2.3 Let X ∼ Np(0, I) and let A1, ..., AJ , J < p, p× p matrices such that

(1) A2j = Aj,

(2) Aj is symmetric, Rank(Aj) = nj,

(3) AjAk = 0 when j 6= k and∑Jj=1 nj ≤ p.3

Then,

(i) vectors AjX are independent with normal distribution Np(0, Aj), j = 1, ..., J , respectively;

(ii) random variables |AjX|2, j = 1, ..., J are independent with χ2nj distribution, j = 1, ..., J .

Proof :

(i) Observe that E(AjX) = 0 and

V (AjX) = AjV (X)ATj = AjATj = A2

j = Aj .

T he joint distribution of AkX and AjX is clearly normal. However,

C(AkX,AjX) = E(AkXXTATj ) = AkV (X)ATj = AkA

Tj = AkAj = 0

for j 6= k. By the property (N6) of the normal distribution, this implies that AkX and AjX areindependent for all k 6= j.

3Some presentations of this result in statistical literature assume also that A1 + ...+AJ = I.

Page 42: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

42 LECTURE 2. RANDOM VECTORS

(ii) Since Aj is a projector, there exists an orthogonal matrix Γ such that

Aj = Γ

(Ij 00 0

)︸ ︷︷ ︸

Λ

ΓT ,

the diagonal matrix of eigenvalues of Aj . Because the rank of Aj is equal to nj , we haveRank(Ij) = nj , and so

|AjX|2 = XTATj AjX = XTAjX = XTΓΛΓTX = Y TΛY =

nj∑i=1

η2i ,

where Y = (η1, ..., ηp)T is a normal vector, Y = ΓTX ∼ Np(0, I) (due to the property (N4) of

the normal distribution). We conclude that |AjX|2 ∼ χ2nj . Since the independence is preserved

by measurable transformations, |AjX|2 and |AkX|2 are independent for j 6= k.

Page 43: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.5. EXERCISES 43

2.5 Exercises

Exercise 2.7

Suppose that the joint distribution of X and Y is given by

F (x, y) =

1− e−2x − e−y + e−(2x+y) si x > 0, y > 0,

0 otherwise.

1. Find the marginal distribution of X and Y .

2. Find the joint density of X and Y .

3. Compute the marginal densities of X and Y .

4. Are X and Y independent?

Exercise 2.8

Let the joint density X and Y is given by:

f(x, y) = e−(x+y), 0 ≤ x <∞, 0 ≤ y <∞

Compute P (X < a) and P (X < Y ).

Exercise 2.9

Two points are chosen at random on the opposite sides of the middle point the interval of lengthL; in other words, the two points X and Y are independent random variables such that Xis uniformly distributed over [0, L/2[, and Y is uniformly distributed over [L/2, L]. Find theprobability that the distance |X − Y | is larger than L/3.

Exercise 2.10

Let U1 and U2 be 2 independent r.v., both being uniformly distributed on [0, a]. Let V =minU1, U2 and Z = maxU1, U2. Show that the joint c.d.f. F of V and Z is given by

F (s, t) = P (V ≤ s, Z ≤ t) =t2 − (t− s)2

a2for 0 ≤ s ≤ t ≤ a.

Hint: note that V ≤ s and Z ≤ t iff both U1 ≤ t and U2 ≤ t, but not both s < U1 ≤ t ands < U2 ≤ t.

Exercise 2.11

Given 2 independent r.v. X1 and X2 with exponential distribution with parameters λ1 and λ2,find the distribution of Z = X1/X2. Compute P (X1 < X2).

Exercise 2.12

Show that

1. Cov(X + Y,Z) = Cov(X,Z) + Cov(Y,Z),

Page 44: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

44 LECTURE 2. RANDOM VECTORS

2. Cov(∑n

i=1Xi,∑nj=1 Yj

)=∑ni=1

∑nj=1 Cov(Xi, Yj).

3. Prove that if Var(Xi) = σ2 and Cov(Xi, Xj) = γ for all 1 ≤ i, j ≤ n then

Var(X1 + ...+Xn) = nσ2 + n(n− 1)γ.

4. Let ξ1 and ξ2 be i.i.d. random variables such that 0 < Var(ξ1) < ∞. Show that r.v.η1 = ξ1 − ξ2 and η2 = ξ1 + ξ2 are uncorrelated.

Exercise 2.13

Let X be the number of “1” and Y the number of “2” in n draws of a honest (balanced) dice.Compute Cov(X,Y ).

Before computing the quantity, would you be able to predict if Cov(X,Y ) ≥ 0 or Cov(X,Y ) <0.

Hint: use the relationship 2) of Exercise 2.12.

Exercise 2.14

1o. Let ξ and η be r.v. with E(ξ) = E(η) = 0, Var(ξ) = Var(η) = 1 and the correlationcoefficient ρ. Show that

E(max(ξ2, η2)) ≤ 1 +√

1− ρ2.

Hint: observe that

max(ξ2, η2) =|ξ2 + η2|+ |ξ2 − η2|

2.

2o. Let ρ be the correlation of η and ξ. Show the inequality

P

(|ξ − E(ξ)| ≥ ε

√Var(ξ) or |η − E(η)| ≥ ε

√Var(η)

)≤ 1 +

√1− ρ2

ε2.

Exercise 2.15

Show that if φ is a characteristic function of some r.v. then φ∗, |φ|2 and Re(φ), are alsocharacteristic functions (of certain r.v.).

Hint: for Re(φ) consider 2 independent random variables X and Y , where Y takes values −1and 1 with probabilities 1/2, X has characteristic function φ, then compute the characteristicfunction of XY .

Exercise 2.16

Let Q be a q × p matrix with q > p of rank p.1o. Show that P = Q(QTQ)−1QT is a projector.2o. On what subspace L does P project?

Exercise 2.17

Let (X,Y ) be a random vector with density

f(x, y) = C exp(−x2 + xy − y2/2).

1o. Show that (X,Y ) is a normal vector. Compute the expectation, the covariance matrix andthe characteristic function of (X,Y ). Compute the correlation coefficient ρXY of X and Y .2o. What is the distribution of X? of Y ? of 2X − Y ?3o. Show that X and Y −X are independent random variables with the same distribution.

Page 45: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

2.5. EXERCISES 45

Exercise 2.18

Let X ∼ N(0, 1) and let Z be a r.v. taking values −1 and 1 with probability 12 . We suppose

that X and Z are independent, we set Y = ZX.1o. Show that distribution of Y is N(0, 1).2o. Compute covariance and correlation of X and Y .3o. Compute P (X + Y = 0).4o. Is (X,Y ) a normal vector?

Exercise 2.19

Let ξ and η be independent r.v. with uniform distribution U [0, 1]. Prove that

X =√−2 ln ξ cos(2πη), Y =

√−2 ln ξ sin(2πη)

satisfy Z = (X,Y )T ∼ N2(0, I).

Hint: let (X,Y ) ∼ N2(0, I). Change to the polar coordinates.

Exercise 2.20

Let Z = (Z1, Z2, Z3)T be a normal vector, with density f ,

f(z1, z2, z3) =1

4(2π)3/2exp

(−6z2

1 + 6z22 + 8z2

3 + 4z1z2

32

).

1o. What are distributions of the components of Z.

Let X and Y be random vectors defined with

X =

2 2 20 2 50 4 101 2 4

Z et Y =

(1 1 11 0 0

)Z.

2o. Vector (X,Y ) of dimension 6, is it Gaussian? Does vector X have a density? Does vectorY have a density?3o. Are vectors X and Y independent?

Exercise 2.21

Let (X,Y, Z)T be a normal vector with zero mean and covariance matrix

Σ =

2 1 11 2 11 1 2

.1o. We set U = −X + Y + Z, V = X − Y + Z, W = X + Y − Z. What is the distribution ofthe vector (U, V,W )T ?2o. What is the distribution of the random variable T = U2 + V 2 +W 2?

Exercise 2.22

Page 46: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

46 LECTURE 2. RANDOM VECTORS

Let a vector (X,Y ) be normal N2(µ,Σ) with mean and covariance matrix:

µ =

(02

), Σ =

(4 11 8

).

1o. What is the distribution of X + 4Y ?2o. What is the joint distribution of Y − 2X and X + 4Y .

Exercise 2.23

Let X be a zero mean normal vector of dimension n with covariance matrix Σ > 0. What is thedistribution of the r.v. XTΣ−1X?

Exercise 2.24

We model the height H of a male person in population P by the Gaussian distribution N(172, 49)(units: cm). In this model:1o. What is the probability for a man to be of height ≤160cm?2o. We assume that there are approximately 15 millions of adult men in P; provide an estimationof a number of men of height ≥200cm.3o. What is the probability for 10 men randomly drawn from P to be all in the interval[168,188]cm?

The height H ′ of females of P is modeled by the Gaussian distribution N(162, 49) (units:cm).4o. What is the probability for a male chosen at random to be higher than a randomly chosenfemale?

We model the heights (H,H ′) of a man and a woman in a couple by a normal vector, thecorrelation coefficient ρ between the height of a woman and a man being 0.4 (respectively, −0.4).5o. Compute the probability p (respectively, p′) that the height of a man in a couple is largerthan that of a woman (before making the computation, what would be your guess, in whichorder should one range p and p′?).

Page 47: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

Lecture 3

Conditional expectation

3.1 Conditioning (discrete case)

Let A and B be two random events (A,B ∈ F) such that P (B) 6= 0. The conditionalprobability P (A|B) of A given B is defined as

P (A|B) =P (AB)

P (B).

Let X and Y be two discrete r.v.. According to this definition

P (Y = k|X = m) =P (Y = k,X = m)

P (X = m),

for all k,m such that P (X = m) > 0. If X and Y are independent

P (Y = k|X = m) =P (Y = k)P (X = m)

P (X = m)= P (Y = k). (3.1)

We suppose that P (X = m) > 0 for all admissible m. Then∑k

P (Y = k|X = m) =

∑k P (Y = k,X = m)

P (X = m)= 1.

As a result, the probabilities P (Y = k|X = m)k define a discrete probability distribution.The conditional expectation of Y given that X = m is the numerical function of m givenby

E(Y |X = m) =∑k

kP (Y = k|X = m).

The conditional variance is defined by

Var(Y |X = m) = E(Y 2|X = m)− [E(Y |X = m)]2.

In a similar way we define conditional moments, conditional quantiles and other characteristicsof conditional distribution.

Definition 3.1 Conditional expectation E(Y |X) of Y given X where X and Y are discreter.v. such that E(|Y |) < ∞, is a discrete random variable which only depends on X andtakes values

E(Y |X = m)mwith probabilities P (X = m).

47

Page 48: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

48 LECTURE 3. CONDITIONAL EXPECTATION

It is important not to confuse the random variable E(Y |X) with the (deterministic) numericfunction E(Y |X = m) (function of m).

Note that the condition E(|Y |) < ∞ guaranties the existence of conditional expectationE(Y |X).

3.1.1 Properties of the conditional expectation (discrete case)

1o. (Linearity) Let E(|Y1|) <∞, E(|Y2|) <∞, then for all a ∈ R, b ∈ R,

E(aY1 + bY2|X) = aE(Y1|X) + bE(Y2|X) (a.s.)

2o. If X and Y are independent and E(|Y |) <∞, then E(Y |X) = E(Y ) (a.s.) (cf. (3.1)).

3o. E(h(X)|X) = h(X) (a.s.) for all Borelian h.

4o. (Substitution theorem.) If E(|h(Y,X)|) <∞ then

E(h(Y,X)|X = m) = E(h(Y,m)|X = m).

Proof : Let Y ′ = h(Y,X), this is a discrete r.v. taking values h(k,m). Thus, the conditionaldistribution of Y ′ given X is given by the probabilities

P (Y ′ = a|X = m) = P (h(Y,X) = a|X = m) =P (h(Y,X) = a,X = m)

P (X = m)

=P (h(Y,m) = a,X = m)

P (X = m)= P (h(Y,m) = a|X = m).

Therefore, for all m

E(Y ′|X = m) =∑a

aP (Y ′ = a|X = m) =∑a

aP (h(Y,m) = a|X = m) = E(h(Y,m)|X = m).

As a result, if h(x, y) = h1(y)h2(x), we have

E(h1(Y )h2(X)|X = m) = h2(m)E(h1(Y )|X = m),

andE(h1(Y )h2(X)|X) = h2(X)E(h1(Y )|X) (a.s.) .

5o. (Double expectation theorem) Let E(|Y |) <∞, then E(E(Y |X)) = E(Y ).Proof : We write

E(E(Y |X)) =∑m

E(Y |X = m)P (X = m) =∑m

∑k

kP (Y = k|X = m)P (X = m)

=∑m,k

kP (Y = k,X = m) =∑k

k∑m

P (Y = k,X = m) =∑k

kP (Y = k) = E(Y ).

Page 49: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.2. CONDITIONING AS A PROJECTION 49

Example 3.1 Let ξ and η be two independent Bernoulli r.v., taking values 1 and 0 with proba-bilities, respectively, p and 1− p. What is the conditional expectation E(ξ + η|η)? E(η|ξ + η)?

Using the properties 2o and 3o we obtain

E(ξ + η|η) = Eξ + η = p+ η.

Furthermore, by the definition, for k = 0, 1, 2,

E(η|ξ + η = k) = 1 · P (η = 1|ξ + η = k) =

0, k = 0,1/2, k = 1,1, k = 2.

Thus, E(η|ξ + η) = ξ+η2 (a.s.).

3.2 Conditioning as a projection

3.2.1 The best prediction

If r.v. X and Y are independent, the knowledge of X does not supply any information aboutY . However, when X and Y are dependent and we know the realization of X, it does providesome information about Y . We define the problem of the best prediction of Y given X asfollows:

Let Y ∈ L2(P ) and X be r.v. on (Ω,F , P ). Find a Borel measurable g(·) such that

‖Y − g(X)‖ = minh(·)‖Y − h(X)‖, (3.2)

where the minimum is taken over all Borel measurable functions h(·) and ‖ · ‖ is the norm ofL2(P ).1 The random variable Y = g(X) is referred to as the best prediction of Y given X.

We use the following (statistical) vocabulary: X is explanatory variable or predictor, Y isexplained variable. We can write (3.2) in the equivalent form:

E((Y − g(X))2) = minh(·)

E((Y − h(X))2) = minh(X)∈LX2 (P )

E((Y − h(X))2)

where the linear subspace LX2 (P ) of L2(P ) is defined according to

LX2 (P ) = ξ = h(X) : E(h2(X)) <∞.

Indeed, it suffices to consider only h(X) ∈ L2(P ), because the solution g(·) to (3.2) is “auto-matically” in L2(P ).

Therefore, (3.2) is the problem of orthogonal projection of Y on LX2 (P ). By the propertiesof the orthogonal projection, g(X) ∈ LX2 (P ) is the solution to (3.2) if and only if

〈Y − g(X), h(X)〉 = 0 for all h(X) ∈ LX2 (P ), (3.3)

1Recall (cf. Section 2.1.3) that we have defined L2(P ) as the space of (equivalence classes of) square integrabler.v., equipped with the scalar product 〈X,Y 〉 = E(XY ). The corresponding norm is ‖X‖ = E(X2)1/2.

Page 50: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

50 LECTURE 3. CONDITIONAL EXPECTATION

g(X)

Y

L2X(P)

and the orthogonal projection g(X) is unique (a.s.). When using notation with expectationsinstead, (3.3) can be equivalently rewritten as

E((Y − g(X))h(X)) = 0 for all h(X) ∈ LX2 (P ),

or

E(Y h(X)) = E(g(X)h(X)) for all h(X) ∈ LX2 (P ). (3.4)

In particular,

E(Y IX ∈ A) = E(g(X)IX ∈ A) for all A ∈ B (all Borel measurable sets). (3.5)

Remark. Note that (3.5) also implies (3.4), making (3.4) and (3.5) equivalent – recall thatall functions in L2 can be approximated by sums of step functions

∑i ciIx ∈ Ai (piecewise-

constant functions).

Let us show that in the discrete case the only r.v. g(X) which verifies (3.4) (and (3.5)),and thus solves the problem of the best prediction (3.2), is the conditional expectation of Ygiven X.

Proposition 3.1 Let X and Y be discrete r.v. with Y ∈ L2(P ). Then the best prediction Y ofY given X is unique (a.s.) and given by

Y = g(X) = E(Y |X).

Proof : Let h(X) ∈ L2(P ). Then

E (E(Y |X)h(X)) =∑k

E(Y |X = k)h(k)P (X = k)

=∑k

[∑m

mP (Y = m|X = k)

]h(k)P (X = k)

=∑k,m

mh(k)P (Y = m,X = k) = E(Y h(X)).

Thus (3.4) is verified with g(X) = E(Y |X). Due to the (a.s.) uniqueness of the orthogonalprojection, the best prediction is also unique (a.s.).

Page 51: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.3. CONDITIONAL EXPECTATION AND PROBABILITY. GENERAL CASE 51

3.3 Conditional expectation and probability. General case

We extend the definition of the conditional expectation E(Y |X) to the general case of 2 r.v. Xand Y . We use the following definition:

Definition 3.2 Let Y and X be r.v. such that E(|Y |) < ∞. The conditional expectationg(X) = E(Y |X) is a measurable with respect to X r.v. which verifies

E(Y IX ∈ A) = E(g(X)IX ∈ A) (3.6)

for all Borel sets A.

Remark. We replaced condition Y ∈ L2(P ) (≡ E(Y 2) <∞) with a weaker condition E(|Y |) <∞. One can show (cf. a course of Probability Theory) that the function g(X) which verifies(3.6) exists and is unique (a.s.) (a consequence of the Radon-Nikodym theorem).

If Y ∈ L2(P ), the existence and the a.s. uniqueness of the function g(X) verifying (3.6), aswe have already seen, is a consequence of the properties of the orthogonal projection in L2.

Theorem 3.1 (Best prediction) Let X and Y be 2 r.v., Y ∈ L2(P ). Then the best predictionof Y given X is unique (a.s.) and coincides with

Y = g(X) = E(Y |X).

3.3.1 Conditional probability

Let us consider a special case as follows: we replace Y with Y ′ = IY ∈ B. Note that the r.v.Y ′ is bounded (|Y ′| ≤ 1), and thus E(|Y ′|2) < ∞. We can define the conditional expectationg(X) = E(Y ′|X) by the relationship (cf. (3.6))

E (IY ∈ BIX ∈ A) = E(g(X)IX ∈ A) for all A,B ∈ B.

Definition 3.3 The conditional probability P (Y ∈ B|X) is a random variable which verifies

P (Y ∈ B,X ∈ A) = E [P (Y ∈ B|X)IX ∈ A] for all A ∈ B

Same as in the discrete case, we also define a numeric function.

Definition 3.4 The function of 2 variables P (Y ∈ B|X = x), B ∈ B (a Borel set) and x ∈ Ris referred to as conditional probability of Y given X = x if

(i) for all fixed B P (Y ∈ B|X = x) verifies

P (Y ∈ B,X ∈ A) =

∫AP (Y ∈ B|X = x)dFX(x); (3.7)

(ii) for all fixed x, P (Y ∈ B|X = x) defines a probability distribution as function of B.

Page 52: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

52 LECTURE 3. CONDITIONAL EXPECTATION

Remark. We know already that for all B ∈ B there is a function

gB(x) = P (Y ∈ B|X = x)

such that (i) is valid. However, this function is defined up to its values on the set NB of zeromeasure. It is important that, in general, this set depends on B. Therefore, it may happenthat N =

⋃B∈BNB is of positive measure. This could do a serious damage – for example, the

additivity of the probability measure could be destroyed, etc. Fortunately, in our case (real r.v.and Borel σ-algebra) there is a result due to Kolmogorov which says that one can always choosea version of the function gB(·) such that P (Y ∈ B|X = x) is a probability measure for all fixedx ∈ R. We will suppose in the sequel that this version is chosen in every particular case.

We can also define a real-valued function of x:

E(Y |X = x) =

∫yP (dy|X = x).

such that

E(Y IX ∈ A) =

∫AE(Y |X = x)dFX(x), for all A ∈ B.

3.3.2 Properties of conditional expectation, general case

1o. (Linearity.) Let E(|Y1|) <∞, E(|Y2|) <∞, then

E(aY1 + bY2|X) = aE(Y1|X) + bE(Y2|X) (a.s.)

2o. If X and Y are independent and E(|Y |) <∞, then E(Y |X) = E(Y ) (a.s.). In view of thedefinition (3.6) it suffices to prove that

E(Y IX ∈ A) = E (E(Y )IX ∈ A) , for all A ∈ B. (3.8)

But

E (E(Y )IX ∈ A) = E(Y )P (X ∈ A),

and so (3.8) is a consequence of the independence of X and Y .

3o. E(h(X)|X) = h(X) (a.s.) for all Borel functions h.

4o. (Substitution theorem.) If E(|h(Y,X)|) <∞, then

E(h(Y,X)|X = x) = E(h(Y, x)|X = x).

5o. (Double expectation theorem)

E(E(Y |X)) = E(Y ).

Proof : Let us set A = R in the definition (3.6), then I(X ∈ A) = 1, and we obtain the desiredresult.

Page 53: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.3. CONDITIONAL EXPECTATION AND PROBABILITY. GENERAL CASE 53

3.3.3 Conditioning: continuous case

We suppose that there exists a joint density fX,Y (x, y) of the couple (X,Y ). Let us set

fY |X(y|x) =

fX,Y (x,y)fX(x) , if fX(x) > 0,

0, if fX(x) = 0.

Proposition 3.2 If the joint density of (X,Y ) exists then

P (Y ∈ B|X = x) =

∫BfY |X(y|x)dy for all B ∈ B.

Proof : It suffices to show (cf. (3.7)) that for all A,B ∈ B

P (Y ∈ B,X ∈ A) =

∫A

[∫BfY |X(y|x)dy

]dFX(x).

Since X has a density, dFX(x) = fX(x)dx. By the Fubini theorem∫A

∫BfY |X(y|x)dyfX(x)dx =

∫B

∫AfY |X(y|x)fX(x) dxdy

But fY |X(y|x)fX(x) = fX,Y (x, y) a.e. (if fX(x) = 0, then fX,Y (x, y) = 0 a fortiori). Therefore,the last integral is equal to∫

B

∫AfX,Y (x, y)dxdy = P (X ∈ A, Y ∈ B).

The result of Proposition 3.2 provides a direct way to compute conditional expectation:

Corollary 3.1 1. E(Y |X = x) =∫yfY |X(y|x)dy,

2.∫∞−∞ fY |X(y|x)dy = 1,

3. Y⊥⊥X ⇒ fY |X(y|x) = fY (y).

We can define, same as in the discrete case, conditional variance:

Var(Y |X = x) = E(Y 2|X = x)− (E(Y |X = x))2

=

∫ ∞−∞

y2fY |X(y|x)dy −[∫ ∞−∞

yfY |X(y|x)dy

]2

.

Example 3.2 Let X and Y be i.i.d. r.v. with exponential distribution. Let us compute theconditional density f(x|z) = fX|X+Y (x|z) and E(X|X + Y ).

Let f(u) = λe−λuIu > 0 be the density of X and Y . If z < x

P (X + Y < z,X < x) = P (X + Y < z,X < z) =

∫ z

0f(u)

∫ z−u

0f(v)dudv,

Page 54: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

54 LECTURE 3. CONDITIONAL EXPECTATION

and if z ≥ x,

P (X + Y < z,X < x) =

∫ x

0f(u)

∫ z−u

0f(v)dudv.

As a result, for z ≥ x the joint density of the couple (X + Y,X) is (cf. (2.1))

f(z, x) =∂2P (X + Y < z,X < x)

∂x∂z= f(z − x)f(x) = λ2e−λz.

Besides, the density of X + Y is the convolution of two exponential densities, i.e.

fX+Y (z) = λ2ze−λz.

We obtain

fX|X+Y (x|z) =f(z, x)

fX+Y (z)=

1

z.

for 0 ≤ x ≤ z, and fX|X+Y (x|z) = 0 for x > z. In other words, the conditional density is thedensity of the uniform distribution on [0, z]. Thus we conclude that E(X|X + Y ) = (X + Y )/2(a.s.).

This example is related to the model of requests for service arriving to a service system. LetX be the instant when the 1st request arrives (the instant t = 0 is the instant when the requestzero arrives), Y is the interval of time between the arrival of the 1st and of the 2nd requests.We are looking for the probability density of the instant of the 1st arrival, given that the secondrequest arrived at time z.

3.4 Regression

Definition 3.5 Let X and Y be 2 r.v. such that E(|Y |) <∞. The function g : R→ R,

g(x) = E(Y |X = x)

is called regression of Y on X (of Y in X).

We also refer to this regression as simple (the word means that X and Y are univariate). If Xor Y are multi-dimensional, we refer to the regression as multiple.

Geometric interpretation. Let us recall the construction of paragraph 3.2. Suppose thatY is an element of the Hilbert space L2(P ) (i.e. E(Y 2) <∞) and let, as before, LX2 (P ) be thelinear subspace of L2(P ) of all functions h(X) which are measurable with respect to X and such

Page 55: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.4. REGRESSION 55

that E(h2(X)) <∞. Then g(X) is the orthogonal projection Y on LX2 (P ).

E(Y|X)

Y

L2X(P)

The r.v. ξ = Y − g(X) is referred to as stochastic error (or residual). We have

Y = g(X) + ξ. (3.9)

By definition of the conditional expectation, E(ξ|X) = 0 (a.s.), and so E(ξ) = 0.

Example 3.3 Let the joint density of X and Y

f(x, y) = (x+ y)I0 < x < 1, 0 < y < 1.

What is the regression function g(x) = E(Y |X = x)?

We use Corollary 3.1:

fY |X(y|x) =f(x, y)

fX(x); where fX(x) =

∫ 1

0f(x, y)dy = (x+ 1/2)I0 < x < 1.

We conclude that

fY |X(y|x) =x+ y

x+ 1/2I0 < x < 1, 0 < y < 1,

and

g(x) = E(Y |X = x) =

∫ 1

0yfY |X(y|x)dy =

∫ 1

0

y(x+ y)

x+ 12

dy =12x+ 1

3

x+ 12

for 0 < x < 1. Observe that g(x) is a nonlinear function of x.

3.4.1 Residual variance

The quadratic (mean square) error of approximation of Y with g(X) is the value

∆ = E((Y − g(X))2) = E((Y − E(Y |X))2

)= E(ξ2) = Var(ξ).

We call ∆ residual variance. The residual variance is smaller than the variance of Y . Indeed,let h(X) = E(Y ) = const. By the best prediction theorem,

∆ = E((Y − g(X))2

)≤ E

((Y − h(X))2

)= E((Y − E(Y ))2) = Var(Y ).

Page 56: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

56 LECTURE 3. CONDITIONAL EXPECTATION

Because E(Y ) is an element of LX2 (P ), this means, geometrically, that a leg is smaller than thehypotenuse:

E(Y|X)

Y

L2X(P)

L E(Y)

Observe that the space L of “constant r.v.” is also a linear subspace of L2(P ). Moreover, this isexactly the intersection of all LX2 (P ) for all X. But we already know that E(Y ) is the projectionof Y on L: indeed, for any constant a

E((Y − a)2) ≥ E((Y − E(Y ))2)

(cf. Exercise 1.11). By the Pythagoras theorem,

‖Y − E(Y )‖2 = ‖E(Y |X)− E(Y )‖2 + ‖Y − E(Y |X)‖2,

or

Var(Y ) = E((Y − E(Y ))2) = E((E(Y |X)− E(Y ))2

)+ E

((Y − E(Y |X))2

)= Var (E(Y |X)) + E (Var(Y |X))

= “variance explained by X” + “residual variance”

= Var(g(X)) + Var(ξ)

= Var(g(X)) + ∆.

Definition 3.6 Let Var(Y ) > 0. We call the correlation ratio of Y to X the nonnegativevalue η2 = η2

Y |X given by

η2Y |X =

Var(g(X))

Var(Y )=E(E(Y )− E(Y |X))2

)Var(Y )

.

Note that, by the Pythagoras theorem,

η2Y |X = 1− E

((Y − g(X))2

)Var(Y )

= 1− ∆

Var(Y ).

Geometric interpretation. The correlation ratio η2Y |X is the squared cosine of the angle θ

between Y − E(Y ) and E(Y |X)− E(Y ), thus 0 ≤ η2Y |X ≤ 1.

Page 57: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.4. REGRESSION 57

Remarks.

1. Generally, η2X|Y 6= η2

Y |X (absence of symmetry).

2. The values η2 = 0 and η2 = 1 are special: η2 = 1 implies that E((Y − E(Y |X))2) = 0,thus Y = g(X) (a.s.); in other words, Y is a function of X.

Otherwise, η2 = 0 means that E((E(Y )− E(Y |X))2) = 0, and E(Y |X) = E(Y ) (a.s.), sothe regression is constant.

It is useful to note that g(X) = const implies the orthogonality of X and Y (i.e. Cov(X,Y ) = 0).

Proposition 3.3 Let E(X2) <∞, E(Y 2) <∞ and σ2X > 0, σ2

Y > 0. Then,

η2Y |X ≥ ρ

2XY .

Proof : By the definition of η2Y |X , it suffices to show that

E((E(Y )− E(Y |X))2

)Var(X) ≥ [E((X − E(X))(Y − E(Y )))]2 .

Yet, by double expectation theorem,

E((X−E(X))(Y−E(Y ))) = E ((X − E(X))E((Y − E(Y )|X)) = E ((X − E(X))(E(Y |X)− E(Y ))) .

Now, by applying the Cauchy-Schwarz inequality we arrive at

[E((X − E(X))(Y − E(Y )))]2 ≤ E((X − E(X))2)E((E(Y |X)− E(Y ))2

)= Var(X)E

((E(Y |X)− E(Y ))2

)(3.10)

Remarks.

• η2Y |X = 0 implies that ρXY = 0.

• Residual variance can be expressed in terms of correlation ratio:

∆ = (1− η2Y |X)Var(Y ). (3.11)

3.4.2 Linear regression

A particular case E(Y |X = x) = a+ bx is called linear regression. When using (3.9), we canwrite

Y = a+ bX + ξ

where ξ is the residual, E(ξ|X) = 0 (a.s.) (⇒ E(ξ) = 0).Let ρ = ρXY , and let σX > 0, σY > 0 be the correlation coefficient between X and Y and

the standard deviations of X and Y . One can express coefficients a and b of the linear regressionin terms of ρ, σX and σY . Indeed,

Y − E(Y ) = b(X − E(X)) + ξ.

Page 58: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

58 LECTURE 3. CONDITIONAL EXPECTATION

When multiplying this equation by X − E(X) and taking the expectation, we obtain

Cov(X,Y ) = bVar(X) = bσ2X ,

so that

b =Cov(X,Y )

σ2X

= ρXYσYσX

.

Then,

Y = a+ ρXYσYσX

X + ξ.

On the other hand,

E(Y ) = a+ ρXYσYσX

E(X)

and so

a = E(Y )− ρXYσYσX

E(X).

Finally,

Y = E(Y ) + ρXYσYσX

(X − E(X)) + ξ. (3.12)

Proposition 3.4 If E(X2) < ∞ and E(Y 2) < ∞, Var(X) = σ2X > 0, Var(Y ) = σ2

Y > 0, andthe regression function g(x) = E(Y |X = x) is linear, then it may be written in the form

E(Y |X = x) = E(Y ) + ρXYσYσX

(x− E(X)). (3.13)

The residual variance is

∆ = (1− ρ2XY )σ2

Y , (3.14)

where ρXY is the correlation coefficient between X and Y .

Proof : The equality (3.13) is a an immediate consequence of (3.12) along with the fact thatE(ξ|X = x) = 0. Let us prove (3.14). We can write (3.12) in the form

ξ = (Y − E(Y ))− ρXYσYσX

(X − E(X)).

When taking the square and the expectation on the both sides, we come to

∆ = E(ξ2) = E

[(Y − E(Y ))2 − 2ρXY

σYσX

(X − E(X))(Y − E(Y )) +

(ρXY

σYσX

)2

(X − E(X))2

]

= Var(Y )− 2ρXYσYσX

Cov(X,Y ) + ρ2XY

σ2Y

σ2X

Var(X) = (1− ρ2XY )σ2

Y .

Page 59: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.5. CONDITIONAL EXPECTATION OF A RANDOM VECTOR 59

Corollary 3.2 If the regression of Y in X is linear, under the premise of Proposition 3.4 wehave

η2Y |X = ρ2

XY .

In other words, in the case of the linear regression, the correlation ratio coincide with thecorrelation between X and Y . (In particular, this implies that ρXY = 0 ⇔ η2

Y |X = 0 and

η2Y |X = η2

X|Y =0.)

The inverse is also true: if ρ2XY = η2

Y |X , then the regression is linear.

Proof : Due to (3.11) one has: ∆ = (1 − η2Y |X)Var(Y ), but in the linear case, moreover,

∆ = (1− ρ2)Var(Y ) due to (3.14). To show the inverse, note that if the equality is attained inthe Cauchy-Schwarz inequality (3.10), then there exists α 6= 0 such that

α(X − E(X)) = E(Y |X)− E(Y ),

and thusE(Y |X) = E(Y ) + α(X − E(X)).

Remark. The fact that the regression of Y on X is linear, in general, does not imply that theregression of X on Y is linear too.

Exercise 3.1

We have X and Z, 2 r.v., independent with exponential distribution, X ∼ E(λ), Z ∼ E(1). LetY = X + Z. Compute the regression function g(y) = E(X|Y = y).

3.5 Conditional expectation of a random vector

Let X = (ξ1, ..., ξp)T and Y = (η1, ..., ηq)

T be two random vectors. Here we consider onlycontinuous case, i.e. we assume that the joint density fX,Y (x, y) = fX,Y (t1, ..., tp, s1, ..., sq)exists.

Same as is in the case of p = 2, the conditional density of Y given X is

fY |X(s1, ..., sq|t1, ..., tp) =fX,Y (t1, ..., tp, s1, ..., sq)

fX(t1, ..., tp),

same as

fY |X(s|t) =fX,Y (t, s)

fX(t).

The conditional expectation E(Y |X) is the q-variate random vector with the components

E(η1|X), ..., E(ηq|X);

here E(ηj |X) = gj(X) (a measurable function of X), and

gj(t) = E(ηj |X = t) =

∫sjfηj |X=t(sj |t)dsj =

∫sjfηj |ξ1=t1,...,ξp=tp(sj |t1, ..., tp)dsj .

Page 60: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

60 LECTURE 3. CONDITIONAL EXPECTATION

In Probablity Theory one verifies that the latter quantity is well defined if, for instance, E(|ηj |) <∞, j = 1, ..., q. All the properties of conditional expectation, established in Lecture 2 hold truein the vector case.

Same as in the scalar case, we can define the conditional covariance matrix:

V (Y |X) = E(Y Y T |X)− E(Y |X)E(Y |X)T .

3.5.1 Best prediction theorem

Let |a| =√a2

1 + ...+ a2p stand for the Euclidian norm of Rp.

Definition 3.7 Let X ∈ Rp and Y ∈ Rq be two random vectors, and G : Rp → Rq. We saythat G(X) is the best prediction of Y given X (in the mean square sense) if

E((Y −G(X))(Y −G(X))T

) E

((Y −H(X))(Y −H(X))T

)(3.15)

(we say that A B if the difference B−A is positive semidefinite) for any measurable functionH : Rp → Rq.

Clearly, (3.15) implies (please, verify this!)

E(|Y −G(X)|2) = infH(·)

E(|Y −H(X)|2).

where the minimum is taken over all measurable functions H(·) : Rp → Rq.Same as in the case of p = q = 1, we have

Theorem 3.2 If E(|Y |2) <∞, then the best prediction of Y given X is unique a.s. and satisfies

G(X) = E(Y |X) (a.s.).

Proof : Of course, it is sufficient to look for the minimum among functions H(·) such thatE(|H(X)|2) <∞. For any such H(X)

E((H(X)− Y )(H(X)− Y )T

))

= E([(H(X)−G(X)) + (G(X)− Y )][(H(X)−G(X)) + (G(X)− Y )]T

)= E

((H(X)−G(X)(H(X)−G(X))T

)+ E

((H(X)−G(X))(G(X)− Y )T

)+E

((G(X)− Y )(H(X)−G(X))T

)+ E

((G(X)− Y )(G(X)− Y )T

).

But, using the properties of conditional expectation, we obtain:

E((H(X)−G(X))(G(X)− Y )T

)= E

[E((H(X)−G(X))(G(X)− Y )T |X

)]= E

[(H(X)−G(X))E

((G(X)− Y )T |X

)]= 0.

The statement of the theorem follows.

Page 61: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.6. CONDITIONING IN GAUSSIAN CASE 61

3.6 Conditioning in Gaussian case

3.6.1 Normal correlation theorem

An important consequence of properties of normal distribution (Section 2.3) is the followingstatement:

Theorem 3.3 (Gauss-Markov) Let XT = (ξT , θT ), ξ ∈ Rk, θ ∈ Rl, p = k + l, a normalvector, X ∼ Np(µ,Σ), where

µT = (µTξ , µTθ ), Σ =

(Σξξ Σξθ

Σθξ Σθθ

),

Σξξ ∈ Rk×k, Σθθ ∈ Rl×l, ΣTθξ = Σξθ ∈ Rk×l. We suppose that Σξξ 0.

Then

m∆= E(θ|ξ) = µθ + ΣθξΣ

−1ξξ (ξ − µξ), (a.s.)

γ∆= V (θ|ξ) = Σθθ − ΣθξΣ

−1ξξ Σξθ (a.s.),

(3.16)

and the conditional distribution of θ given ξ is normal: for any s ∈ Rl, P (θ ≤ s|ξ) is (a.s.) thec.d.f. of a l-variate normal distribution with the vector of means m and the covariance matrixγ (we write a ≤ b for two vectors a, b ∈ Rl for the system of inequalities a1 ≤ b1, ..., ap ≤ bl).

Moreover, random vectors ξ and

η = θ − ΣθξΣ−1ξξ ξ

are independent.

Remarks:

1. The theorem provides an explicit expression for the regression function m = E(θ|ξ) (re-gression of θ on ξ) and the conditional covariance matrix

γ = V (θ|ξ) = E((θ −m)(θ −m)T

∣∣∣ξ) .This regression is linear in the case of a Gaussian couple (ξ, θ).

2. If we assume, in addition, that Σ 0 then the matrix γ is also 0. Indeed, let a ∈ Rk,b ∈ Rl, then

(aT bT )Σ

(ab

)= (aT bT )

(Σξξ Σξθ

Σθξ Σθθ

)(ab

)> 0.

same as

aTΣξξa+ aTΣξθb+ bTΣθξa+ bTΣθθb > 0. (3.17)

If we choosea = −Σ−1

ξξ Σξθb,

then (3.17) can be rewritten as

−bTΣθξΣ−1Σξθb+ bTΣθθb > 0,

for any b ∈ Rl. Thus,Σθθ − ΣθξΣ

−1Σξθ 0.

Page 62: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

62 LECTURE 3. CONDITIONAL EXPECTATION

3. The normal correlation theorem allows for the follwing geometric interpretation: as-sume that E(ξ) = 0 and E(θ) = 0, and let Lξ2(P ) be the subspace of random vectors withfinite covariance matrix which are mesurables with respect to ξ. Then ΣθξΣ

−1ξξ ξ is the

orthogonal projection of θ on L2ξ(P ), and vector η = θ−ΣθξΣ

−1ξξ ξ is orthogonal to L2

ξ(P ).

4. An important extension of of Theorem 3.3 is given by its “conditional” version – if weassume that the conditional distribution of the couple (ξ, θ) given another r.v., say, Z isnormal (a.s.). Indeed, let X = (ξ, θ)T = ((ξ1, ..., ξk), (θ1, ..., θl))

T be a random vector andZ some other random vector defined on the same probability space (Ω,F , P ). Assumethat the conditional distribution of X given Z is normal (a.s.) with vector of means

E(X|Z)T = (E(ξ|Z)T , E(θ|Z)T ) = (µTξ|Z , µTθ|Z),

and covariance matrix

ΣX|Z =

(V (ξ|Z) C(ξ, θ|Z)C(θ, ξ|Z) V (θ|Z)

)∆=

(Σξξ|Z Σξ,θ|ZΣθ,ξ|Z Σθθ|Z

).

Then the conditional expectation m = E(θ|ξ, Z) and the conditional covariance matrixγ = V (θ|ξ, Z) are given by

m = µθ|Z + Σθξ|ZΣ−1ξξ|Z(ξ − µξ|Z),

γ = Σθθ|Z − Σθξ|ZΣ−1ξξ|ZΣξθ|Z ;

(3.18)

and the conditional distribution of θ given ξ and Z is normal: for any s ∈ Rl, any s ∈ Rl,P (θ ≤ s|ξ, Z) is (a.s.) the c.d.f. of an l-variate normal distribution with the mean m andthe covariance matrix γ. Moreover, random vectors ξ and

η = θ − Σθξ|ZΣ−1ξξ|Zξ

are conditionally independent given Z.

This statement can be proved in the exactly same way as Theorem 3.3 and will be usedin the next section.

Proof of the normal correlation theorem.

Step 1. Let us compute E(η) and V (η):

E(η) = E(θ − ΣθξΣ−1ξξ ξ) = µθ − ΣθξΣ

−1ξξ µξ,

and

V (η) = E([(θ − µθ)− ΣθξΣ

−1ξξ (ξ − µξ)][(θ − µθ)− ΣθξΣ

−1ξξ (ξ − µξ)]T

)= Σθθ − Σ−1

ξξ ΣξθE((ξ − µξ)(θ − µθ)T

)−E

((θ − µθ)(ξ − µξ)T

)Σ−1ξξ Σθξ + ΣθξΣ

−1ξξ E(ξ − µξ)(ξ − µξ)T )Σ−1

ξξ ΣTθξ

= Σθθ − ΣθξΣ−1ξξ Σξθ.

Page 63: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.6. CONDITIONING IN GAUSSIAN CASE 63

Step 2. Let us verify the orthogonality of η and ξ:

C(η, ξ) = C(θ, ξ)− ΣθξΣ−1ξξ C(ξ, ξ) = Σθξ − ΣθξΣ

−1ξξ Σξξ = 0,

thus η ⊥ ξ.

Step 3. We show that the couple (ξ, η) is normal. Indeed,(ξη

)= AX = A

(ξθ

),

where

A =

(Ik 0

−ΣθξΣ−1ξξ Il

),

with idenitity matrices Ik ∈ Rk×k and Il ∈ Rl×l. By the property (N3) of Section 2.3.5

(ξη

)is a normal vector. Its covariance matrix, is given by

V

((ξη

))=

(V (ξ) C(ξ, η)C(η, ξ) V (η)

)=

(Σξξ 0

0 Σθθ − ΣθξΣ−1ξξ Σξθ

)

Because Σξξ 0, and Σθθ − ΣθξΣ−1ξξ Σξθ 0 (by the Cauchy-Schwarz inequality), we have

V

((ξη

)) 0. Besides this,

V

((ξη

))= AV (X)AT 0.

Step 4. Now the property (N6) implies that η and ξ are independent. On the other hand, theresult of Step 3, along with (N5), allows us to conclude that η is a normal vector. Using theabove expressions for E(η) and V (η) we obtain

η ∼ Nl

(µθ − ΣθξΣ

−1ξξ µξ,Σθθ − ΣθξΣ

−1ξξ Σξθ

).

Now it suffices to note thatθ = η + ΣθξΣ

−1ξξ ξ,

where η is independent of ξ. Therefore, the conditional distribution of θ given ξ is the distributionof η, translated by ΣθξΣ

−1ξξ ξ. In particular,

E(θ|ξ) = E(η) + ΣθξΣ−1ξξ ξ,

V (θ|ξ) = V (η).

The linearity of the best prediction m = E(θ|ξ) of θ given ξ is of course tightly linked tothe normality of the couple (ξ, θ), what allows for simple calculus of m. It is interesting to seewhat is the best linear prediction in the case where the joint distribution of the couple ξ and

Page 64: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

64 LECTURE 3. CONDITIONAL EXPECTATION

θ is not normal. In other words, we may be interested to find a matrix A∗ ∈ Rl×k and vectorb∗ ∈ Rl such that θ = b∗ +A∗ξ satisfies

E((θ − θ)(θ − θ)T

)= inf

A∈Rl×k,b∈RlE((θ −Aξ − b)(θ −Aξ − b)T

).

The answer is given by the following lemma which stresses the importance of the Gaussian casewhen looking for the best linear predictors:

Lemma 3.1 Suppose that (X,Y ) is a random vector, X ∈ Rk, Y ∈ Rl, such that E(|X|2 +|Y |2) <∞, V (X) 0. Let (ξ, θ) be a normal vector with the same mean and covariance matrix,i.e.

E(ξ) = E(X), E(θ) = E(Y ), V (ξ) = V (X), V (θ) = V (Y ), C(X,Y ) = C(ξ, θ).

Let now λ(b) : Rk → Rl be a linear function such that

λ(b) = E(θ|ξ = b).

Then λ(X) is the best linear prediction of Y given X. Besides this, E(λ(X)) = E(Y ).

Proof : Observe that the existance of a linear function λ(b) which coincides with E(θ|ξ = b)is a concequence of the normal correlation theorem. Let η(b) be another linear estimation of θgiven ξ, then, by the best prediction theorem,

E((θ − λ(ξ)(θ − λ(ξ))T

)≤ E

((θ − η(ξ)(θ − η(ξ))T

).

By linearity of predictions λ(·) and η(·), under the premise of the lemma, we get

E((Y − λ(X))(Y − λ(X))T

)= E

((θ − λ(ξ))(θ − λ(ξ))T

)≤ E

((θ − η(ξ))(θ − η(ξ))T

)= E

((Y − η(X))(Y − η(X))T

),

what proves the optimality of λ(X). Finally,

E(λ(X)) = E(λ(ξ)) = E (E(θ|ξ)) = E(θ) = E(Y ).

Let us consider the following example (cf. Exercise 3.9):

Example 3.4 Let X and Y be r.v. such that the couple (X,Y ) is normally distributed withmeans µX = E(X), µY = E(Y ), variances σ2

X = Var(X) > 0, σ2Y = Var(Y ) > 0 and correlation

ρ = ρXY < 1.

When putting Σ = Var

((XY

)), we get

Σ =

(σ2X ρσXσY

ρσXσY σ2Y

)

Page 65: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.6. CONDITIONING IN GAUSSIAN CASE 65

with Det(Σ) = σ2Xσ

2Y (1− ρ2) > 0. Observe that if in Theorem 3.3 ξ = X and θ = Y , then

Σθξ = Σξθ = ρσXσY ΣθξΣ−1ξξ = ρ

σYσX

.

So the regression function satisfies

m(x) = E(Y |X = x) = µY + ρσYσX

(x− µX), γ = γ(x) = V (Y |X = x) = σ2Y (1− ρ2),

and the conditional density of Y given X is

fY |X(y|x) =1√2πγ

exp

(−(y −m(x))2

)

(the density of distribution N(m(x), γ2(x))).

Let us consider a particular case of µX = µY = 0 and σX = σY = 1. Then

Σ =

(1 ρ · 1ρ · 1 1

), Σ−1 = (1− ρ2)−1

(1 −ρ−ρ 1

).

The eigenvectors of Σ (and of Σ−1) are

(1, 1)T et (−1, 1)T ,

corresponding to the eigenvalues, respectively,

λ1 = 1 + ρ et λ2 = 1− ρ.

The normalized eigenvectors are γ1 = 2−1/2(1, 1)T and γ2 = 2−1/2(−1, 1)T . If we put Γ =(γ1, γ2), then we have the eigenvalue decomposition:

Σ = ΓΛΓT = Γ

(1 + ρ 0

0 1− ρ

)ΓT .

Let us consider the concentration ellipsoids of the joint density of (X,Y ). Let for C > 0

EC = x ∈ R2 : xTΣ−1x ≤ C2 = x ∈ R2 : |y|2 ≤ C2,

where y = Σ−1/2x. We set

y =

(y1

y2

), x =

(x1

x2

),

then

y1 =1√

2(1 + ρ)(x1 + x2), y2 =

1√2(1− ρ)

(x1 − x2).

In this case the concentration ellipse becomes

EC = xTΣ−1x ≤ C2 =

(

1√2(1 + ρ)

(x1 + x2)

)2

+

(1√

2(1− ρ)(x1 − x2)

)2

≤ C2.

Page 66: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

66 LECTURE 3. CONDITIONAL EXPECTATION

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

3

2

1

1 2

=0.75

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

1 2

=−0.5

Concentration ellipses: X = (ξ1, ξ2), Y = (η1, η2), where Y = Σ−1/2X.

3.6.2 The Kalman-Bucy filter

Suppose that the sequence of (couples of) random vectors (θ, ξ) = ((θn), (ξn)), n = 0, 1, 2, ...,θn = (θ1(n), ..., θl(n))T ∈ Rl and ξn = (ξ1(n), ..., ξk(n))T ∈ Rk, is generated by the recursiveequations

θn+1 = an+1θn + bn+1ε(0)n+1,

ξn+1 = An+1θn +Bn+1ε(1)n+1,

(3.19)

with initial conditions (θ0, ξ0).

Here ε(0)n = (ε(01), ..., ε(0l))T and ε

(1)n = (ε(11), ..., ε(0k))T are independent normal vectors, ε

(0)1 ∼

Nl(0, I), and ε(1)1 ∼ Nk(0, I); an, bn, An and Bn are deterministic matrices of size, respectively,

l × l, l × l, k × k and k × k. We suppose that matrices Bn are of full rank, and that the initial

conditions (θ0, ξ0) are independent of the sequences (ε(0)n ) and (ε

(1)n ).

In the sequel we use notation ξn0 for the “long” vector ξn0 = (ξT0 , ..., ξTn )T .

First, observe that if E(|θ0|2 + |ξ0|2) < ∞, then for all n ≥ 0, E(|θn|2 + |ξn|2) < ∞. If weassume, in addition, that the couple (θ0, ξ0) is a normal vector, then we can easily verify (all θn

and ξn are linear functions of Gaussian vectors (θ0, ξ0), (ε(0)i ) and (ε

(1)i ), i = 1, ..., n) that the

“long” vector ZT = (θT0 , ξT0 , ..., θ

Tn , ξ

Tn ) is normal for each n ≥ 0. We can thus apply the normal

correlation theorem to compute the best prediction of the sequence (θi), 0 ≤ i ≤ n, given (ξi),0 ≤ i ≤ n.

This computation may become rather expensive if we want to build the prediction for largen. This observation is not quite valid today, but in the 50-60’s, memory and processing powerrequirements were important factors, especially, for “onboard” computations. This has moti-vated the search for “cheap” algorithms of computing best predictions, which resulted in 1960in discovery of the Kalman-Bucy filter which computes the prediction in a fully recursive way.The aim of the next exercises is to obtain the recursive equations of the Kalman filter – recursiveformulas for

mn = E(θn|ξn0 ), γn = V (θn|ξn0 ).

This problem, extremely complicated in the general setting, allows for a simple solution if wesuppose that the conditional distribution P (θ0 < a|ξ0) of vector θ0 given ξ0 is normal (a.s.),what we will assume in this section. Our first objective is to show that under above conditions

Page 67: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.6. CONDITIONING IN GAUSSIAN CASE 67

the sequence (θ, ξ) is conditionally gaussian. In other words, the conditional c.d.f.

P (ξn+1 ≤ x, θn+1 ≤ a|ξn0 )

coincide (a.s.) with the c.d.f. of a l + k-dimensional normal vector with mean and covariancematrix which depend on ξn0 .

Exercise 3.2

Let ζn = (ξTn , θTn )T , t ∈ Rk+l. Verify that conditional c.d.f.

P (ζn+1 ≤ t|ξn0 , θn+1 = u)

is (a.s.) normal with mean Mu, where M is a (k + l) × l matrix, and the (k + l) × (k + l)covariance matrix Σ to be determined.

Now, let us suppose that for n ≥ 0 the conditional c.d.f.

P (ζn ≤ t|ξn−10 )

is (a.s.) that of a l + k-dimensional normal distribution with mean and covariance matrixdepending on ξn−1

0 .

Exercise 3.3

Use the “conditional version” of the normal correlation theorem (cf Remark 4 and display (3.18))to show that conditional c.d.f.

P (ζn+1 ≤ t|ξn0 ), n ≥ 0

are (a.s.) normal with

E(ζn+1|ξn0 ) =

(An+1mn

an+1mn

), V (ζn+1|ξn0 ) =

(Bn+1B

Tn+1 +An+1γnA

Tn+1 An+1γna

Tn+1

an+1γnATn+1 bn+1b

Tn+1 + an+1γna

Tn+1,

)

where mn = E(θn|ξn0 ) and γn = V (θn|ξn0 ).Hint: compute the conditional characteristic function

E(exp(itT ζn+1)|ξn0 , θn

), t ∈ Rl+k,

then use the fact that in the premise of the exercise the distribution of θn, given ξn−10 and ξn, is

conditionally normal with parameters mn and γn.

Exercise 3.4

Apply the (conditional) normal correlation theorem to obtain recursive relations:

mn+1 = an+1mn + an+1γnATn+1(Bn+1B

Tn+1 +An+1γnA

Tn+1)−1(ξn+1 − an+1mn),

γn+1 = an+1γnan+1 + bn+1bTn+1 − an+1γnA

Tn+1(Bn+1B

Tn+1 +An+1γnA

Tn+1)−1An+1γna

Tn+1

(3.20)

(since Bn+1 is of full rank, so is the matrix Bn+1BTn+1 +An+1γnA

Tn+1, which is invertible).

Show that ξn+1 and

η = θn+1 − an+1γnATn+1(Bn+1B

Tn+1 +An+1γnA

Tn+1)−1(ξn+1 − an+1mn)

are independent given ξn0 .

Page 68: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

68 LECTURE 3. CONDITIONAL EXPECTATION

Example 3.5 Let X = (Xn) and ξ = (ξn) be scalar random sequences, such that

Xn+1 = cXn + bε(0)n+1, Yn+1 = Xn +Bε

(1)n+1, (3.21)

where c, b and B are reals, ε(0) and ε(1) are sequences of i.i.d. N(0, 1) r.v. which are mutuallyindependent. Let us compute mn = E(Xn|Y n

0 ).

We can suppose that Xn is the “useful signal,” and Bε((1)n+1 is the observation noise, and we want

to recover Xn given the observations Y0, ..., Yn. Kalman equations (3.20) allow us to easilyobtain the expressions of the recursive prediction:

mn = cmn−1 +cγn−1

B2 + γn−1(Yn − cmn−1)

γn = c2γn−1 + b2 −c2γ2

n−1

B2 + γn−1.

Exercise 3.5

Show that if b 6= 0, B 6= 0 and |c| < 1, then the “limit error” γ = limn→∞ γn of the Kalmanfilter exists and is a positive solution of the quadratic (Riccati) equation

γ2 + (B2(1− c2)− b2)γ − b2B2 = 0.

Example 3.6 Let θ ∈ Rl be a normal vector with E(θ) = 0 and V (θ) = γ (we assume thatγ is known). We look for the best prediction of θ given observations of the k-variate sequence(ξ) = (ξn)

ξn+1 = An+1θ +Bn+1ε(1)n+1, ξ0 = 0,

where An+1, Bn+1 and ε(1)n+1 satisfy the same hypotheses as in (3.19).

From (3.20) we obtain

mn+1 = mn + γnAn+1[Bn+1BTn+1 +An+1γnA

Tn+1]−1(ξn+1 −An+1mn),

γn+1 = γn − γnAn+1[Bn+1BTn+1 +An+1γnA

Tn+1]−1ATn+1γn.

(3.22)

Then solutions to (3.22) are given by

mn+1 =[I + γ

∑nm=0A

Tm+1(Bm+1B

Tm+1)−1ATm+1

]−1γ∑nm=0A

Tm+1(Bn+1B

Tn+1)−1ξm+1,

γn+1 =[I + γ

∑nm=0A

Tm+1(Bm+1B

Tm+1)−1ATm+1

]−1γ,

(3.23)

where I is a k × k identity matrix.

Exercise 3.6

Derive the formula (3.23).

Page 69: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.6. CONDITIONING IN GAUSSIAN CASE 69

3.6.3 Hints to the exercises of this section

Exercise 3.2 On can easily verify that (a.s.)

E(θn+1|ξn0 , θn = u) = an+1u, E(ξn+1|ξn0 , θn = u) = An+1u,

V (θn+1|ξn0 , θn = u) = bn+1bTn+1, V (ξn+1|ξn0 , θn = u) = Bn+1B

Tn+1

andC(θn+1, ξn+1|ξn0 , θn = u) = 0.

Thus, the conditional distribution of ζn+1 is (a.s.) normal with

E(ζn+1|ξn0 , θn = u) =

(Auau

), V (ζn+1|ξn0 , θn = u) =

(Bn+1B

Tn+1 0

0 bn+1bTn+1

)

Exercise 3.3 In the premise of the exercise, by the normal correlation theorem, the distribu-tion of θn given ξn0 is normal with parameters mn = E(θn|ξn0 ) and γn = V (θn|ξn0 ) which doesnot depend on ξn0 . We observe that (a.s.)

E(exp(itT ζk+1)|ξn0 , θn

)= exp

[itT

(An+1θnan+1θn

)− 1

2tT(Bn+1B

Tn+1 0

0 bn+1bTn+1

)t

],

and, because

E

(exp

[itT

(An+1θnan+1θn

)]∣∣∣∣∣ ξno)

= exp

[itT

(An+1mn

an+1mn

)− 1

2tT(An+1γnA

Tn+1 An+1γna

Tn+1

an+1γnATn+1 an+1γna

Tn+1

)t

],

we conclude that

E(exp(itT ζk+1)|ξn0

)= exp

[itT

(An+1mn

an+1mn

)− 1

2tT(An+1γnA

Tn+1 An+1γna

Tn+1

an+1γnATn+1 an+1γna

Tn+1

)t

−1

2tT(Bn+1B

Tn+1 0

0 bn+1bTn+1

)t

]

Exercise 3.4 Just apply the (conditional) normal correlation theorem.

Exercise 3.6 Try Google search for matrix inversion lemma, then apply the lemma to

[γ−1n +An+1(Bn+1B

Tn+1)−1ATn+1]−1.

Page 70: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

70 LECTURE 3. CONDITIONAL EXPECTATION

3.7 Exercises

Exercise 3.7

Let the joint distribution of X and Y is given by

F (x, y) =

1− e−2x − e−y + e−(2x+y) si x > 0, y > 0,

0 otherwise.

(cf Exercise 2.7). What is conditional density of X given Y = y?

Exercise 3.8

Let X and Y be i.i.d. r.v.. Use the definition of the conditional expectation to show thatE(X|X + Y ) = E(Y |X + Y ) (p.s.), and thus E(X|X + Y ) = E(Y |X + Y ) = X+Y

2 (a.s.).

Exercise 3.9

Let (X,Y ) be a random vector of dimension 2. Suppose that Y ∼ N(m, τ2) and that thedistribution of X given Y = y is N(y, σ2).1o. What is the distribution of Y given X = x?2o. What is the distribution of X?3o. What is the distribution of E(Y |X)?

Exercise 3.10

Consider the joint density function of X and Y given by:

f(x, y) =6

7(x2 +

xy

2), 0 ≤ x ≤ 1, 0 ≤ y ≤ 2.

1. Verify that f is a joint density.

2. Find the density of X, the conditional density fY |X(y|x).

3. Compute P(Y > 1

2 |X < 12

).

Exercise 3.11

Let X and N be r.v. such that N is valued in 1, 2, . . ., and E(|X|) <∞, E(N) <∞ . Considerthe sequence X1, X2, . . . of independent r.v. with the same distribution as X. Show the Waldidentity: if N is independent of Xi then

E(N∑i=1

Xi) = E(N)E(X).

Exercise 3.12

Suppose that the salary of an individual satisfies Y ∗ = Xb + σε, where σ > 0, b ∈ R, X is ar.v. with bounded second order moments corresponding the capacities of the individual, and εis independent of X standard normal variable, ε ∼ N(0, 1). If Y ∗ is larger than the SMIC valueS, the received salary Y is Y ∗, otherwise it is equal to S. Compute E(Y |X). Is this expectationlinear?

Page 71: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

3.7. EXERCISES 71

Exercise 3.13

Let X, Y1 and Y2 be independent r.v., with Y1 and Y2 being normal N(0, 1), and

Z =Y1 +XY2√

1 +X2.

Using the conditional distribution P (Z < u|X = x) show that Z ∼ N(0, 1).

Exercise 3.14

Let X and Y be 2 square integrable r.v. on (Ω,F , P ). Prove that

Var(Y ) = E(Var(Y |X)) + Var(E(Y |X)).

Exercise 3.15

Let X1, ..., Xn be independent r.v. such that Xi ∼ P(λi) (Poisson distribution with parameter

λi, i.e. P (Xi = k) = e−λiλkik! ).

1o. Find the distribution of X =∑ni=1Xi.

2o. Show that the conditional distribution of (X1, ..., Xn) givenX = r is multinomialM(r, p1, ..., pn)(you will compute the corresponding parameters).

Recall that r.v. (X1, ..., Xk) integer valued in 0, ..., r have multinomial distributionM(r, p1, ..., pk)if

P (X1 = n1, ..., Xk = nk) =r!

n1!...nk!pn1

1 ...pnkk ,

with∑ki=1 ni = r. This is the distribution of (X1, ..., Xk) where

Xi = “number of Y ’s which are equal to i”

in r independent trials Y1, ..., Yr with probabilities P (Y1 = i) = pi, i = 1, ..., k. Note that ifk = 2,

P (X1 = n1, X2 = r − n1) = P (X1 = n1),

and the distribution is denoted M(r, p) = B(r, p).3o. Compute E(X1|X1 +X2).4o. Show that if Xn is binomially distributed, Xn ∼ B(n, λ/n), then for all k, P (Xn = k) tends

to e−λ λk

k! when n→∞.

Recall that binomial distribution describes the number X of wins in n independent Bernoullitrials,

P (X = k) = Cknpk(1− p)n−k.

Exercise 3.16

Same as in Exercise 2.24, let us model the heights (H,H ′) of male and female parters in acouple by a normal vector with H ∼ N(172, 49), H ′ ∼ N(162, 49) (units: cm), and correlationcoefficient ρ = 0.4.Compute the probability p the man in a couple is higher than the woman given that H ′ = 169;redo the computation for H ′ = 155.

Page 72: Basic probability refresher - imagLecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (;F;P) be a probability space where is a general set, Fis a ˙-algebra

72 LECTURE 3. CONDITIONAL EXPECTATION

Exercise 3.17

Let Z = (Z1, Z2, Z3)T be a normal vector, with density f (cf. Exercise 2.20),

f(z1, z2, z3) =1

4(2π)3/2exp

(−6z2

1 + 6z22 + 8z2

3 + 4z1z2

32

).

What is the distribution of (Z2, Z3) given Z1 = z1?