Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes*...

150
Stat 8151 Lecture Notes* Fall Quarter 1998 1. Friday, September 25, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Definitions and axioms; σ-algebras, Borel sets, probability spaces, random variables 2. Monday, September 28, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Distributions, distribution functions, densities, expected values 3. Wednesday, September 30, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Riemann-Stieltjes integrals, moments, characteristic functions 4. Friday, October 2, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Dominated Convergence Theorem, characteristic functions 5. Monday, October 5, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Fourier inversion, inequalities, convex functions 6. Wednesday, October 7, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Jensen’s inequality, convex sets, H¨ older’s inequality 7. Friday, October 9, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Bivariate distributions: joint, marginal, conditional 8. Monday, October 12, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Conditional expectation, conditional variance, independence 9. Wednesday, October 14, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Covariance, correlation, Cauchy-Schwarz inequality 10. Friday, October 16, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Singular distributions, prediction, multivariate distributions 11. Monday, October 19, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Independence, conditional distributions 12. Wednesday, October 21, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Independence, random vectors and matrices, singular distributions 13. Friday, October 23, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Projections, orthogonal complements, range and null spaces 14. Monday, October 26, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Multivariate normal distributions 15. Wednesday, October 28, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Multivariate normal distributions. random samples 16. Friday, October 30, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Multivariate normal distributions, random samples, quadratic forms, noncentral chi-square 17. Monday, November 2, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Noncentral chi-square, Poisson sums 18. Wednesday, November 4, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Distribution and independence of quadratic forms, application to samples from bivariate normal * Although planned for thirty one-hour lectures, this material was actually presented in 29 class meetings. There were two extra-long sessions to accomodate the Thanksgiving holiday. i

Transcript of Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes*...

Page 1: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Stat 8151 Lecture Notes*

Fall Quarter 1998

1. Friday, September 25, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Definitions and axioms; σ-algebras, Borel sets, probability spaces, random variables

2. Monday, September 28, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Distributions, distribution functions, densities, expected values

3. Wednesday, September 30, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Riemann-Stieltjes integrals, moments, characteristic functions

4. Friday, October 2, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Dominated Convergence Theorem, characteristic functions

5. Monday, October 5, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Fourier inversion, inequalities, convex functions

6. Wednesday, October 7, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Jensen’s inequality, convex sets, Holder’s inequality

7. Friday, October 9, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Bivariate distributions: joint, marginal, conditional

8. Monday, October 12, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Conditional expectation, conditional variance, independence

9. Wednesday, October 14, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Covariance, correlation, Cauchy-Schwarz inequality

10. Friday, October 16, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Singular distributions, prediction, multivariate distributions

11. Monday, October 19, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Independence, conditional distributions

12. Wednesday, October 21, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Independence, random vectors and matrices, singular distributions

13. Friday, October 23, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Projections, orthogonal complements, range and null spaces

14. Monday, October 26, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Multivariate normal distributions

15. Wednesday, October 28, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Multivariate normal distributions. random samples

16. Friday, October 30, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Multivariate normal distributions, random samples, quadratic forms, noncentral chi-square

17. Monday, November 2, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Noncentral chi-square, Poisson sums

18. Wednesday, November 4, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Distribution and independence of quadratic forms, application to samples from bivariatenormal

* Although planned for thirty one-hour lectures, this material was actually presented in 29 class meetings.There were two extra-long sessions to accomodate the Thanksgiving holiday.

i

Page 2: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

19. Friday, November 6, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Maximum likelihood estimates for bivariate normal sample, Wishart distribution, predic-tion, multiple correlation coefficient

20. Monday, November 9, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Multiple correlation coefficient, exchangeability, Polya urn

21. Wednesday, November 11, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Exchangeability, de Finetti’s theorem

22. Friday, November 13, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Exchangeability, de Finetti’s theorem

23. Monday, November 16, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Exchangeability for finitely many random variables; exponential family of distributions

24. Wednesday, November 18, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117Exponential families, natural parameter spaces, moments

25. Friday, November 20, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122Unbiasedness, completeness, order statistics

26. Monday, November 23, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Order statistics

27. Monday, November 30, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130Order statstics, distributions, sufficiency, completeness

28. Wednesday, December 2, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Completeness of order statistic; statistical information and likelihood

29. Friday, December 4, 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138Likelihood, invariance, conditionality, and sufficiency principles

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

ii

Page 3: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Friday, September 25, 1998

We study probability so we can make mathematical models for random experi-ments.

idea notation

elementary outcome ωexhaustive list of elementary outcomes Ωevent (collection of elementary outcomes) A subset of Ωprobability of an event P (A) set function,

defined on subsets of Ω“measure” the elementary outcome X real-valued function,

defined on Ω

We’ll need a bit more structure. Let Ω be an arbitrary set, and A ⊆ Ω. Let ^be a class of events.

^ is called an algebra if it is closed under finite unions and complements.(By De Morgan, we get finite intersections, too.)

^ is called a σ-algebra if it is closed under countable unions and comple-ments.

The set of all subsets of Ω is both an algebra and a σ-algebra, so existenceis not a problem.

Another (trivial) algebra is (∅,Ω).(In response to a question, it was mentioned that an algebra has to include

both ∅ and Ω.)1

Suppose we think of a random experiment, say, picking a number at randomfrom [0, 1]. (That’s really not so easy to do, actually.) Then we would haveΩ = [0, 1], and a reasonable event might be A = (0, 1/2). Hence, our σ-algebraof events should include all intervals. This leads to the following definition.

Definition. The σ-algebra of Borel sets is the smallest σ-algebra which con-tains all the intervals. We’ll denote this by @.

1 The usual definition specifically includes ∅, but it would also do just to saythe class is non-empty. Then by complements and the closure properties we haveboth Ω and ∅ included.

9/25/98

Page 4: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

2 Stat 8151

Such a σ-algebra exists, since the collection of all subsets of Ω is a σ-algebrawhich contains all the intervals, and the intersection of any such σ-algebras isanother one, so the smallest σ-algebra is just the intersection of all of them.

This also contains every singleton.

Note that @ is a smaller σ-algebra than the σ-algebra of all subsets of Ω.

We will usually work with the Borel σ-algebra.

Definition. (Ω,^, P ) is called a probability space if Ω is a set, ^ is a σ-algebra of subsets of Ω, and P is a function P : ^→ [0, 1] such that

i. P (Ω) = 1,

ii. P (A ∪B) = P (A) + P (B) if A ∩B = ∅iii. P (

⋃∞i=1Ai) =

∑∞i=1 P (Ai) if Ai ∩Aj = ∅ for i 6= j.

Condition ii. is finite additivity—okay for an algebra; condition iii. is countableadditivity—for a σ-algebra. Of course iii. is a stronger condition than ii. Theseare due to Kolmogorov.2

Some people prefer to use just finite additivity instead of countable additiv-ity.3

A random (uniform) selection from (0,∞) is not possible with countableadditivity, but is possible using just finite additivity.

Using countable additivity on top of finite additivity is equivalent to conti-nuity of measure.

Lemma. P (⋃∞i=1Ai) =

∑∞i=1 P (Ai) if Ai ∩ Aj = ∅ for i 6= j iff for every

decreasing sequence of sets Bn with⋂∞n=1Bn = ∅ we have limn→∞ P (Bn) = 0.

Proof: Let Bn be such a decreasing sequence of sets. For each j, letAj = Bj ∩Bj+1 = Bj \Bj+1. Then Aj1 ∩Aj2 = ∅ if j1 6= j2, and Bn =

⋃∞j=nAj .

The idea is illustrated here—a nested sequence of sets can be expressed as a

2 The usual list of axioms includes P (A) ≥ 0, P (Ω) = 1, and an additivitycondition. Then we also have that A ⊆ B ⇒ P (A) ≤ P (B).

3 Preferring finite additivity is usually associated with Bruno de Finetti; ourlocal expert is Prof. Sudderth.

9/25/98

Page 5: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 3

union of a different sequence of disjoint sets formed by pairwise set differences.

Aj Aj+1

Bj+1Bj Bj+2

Aj+2

Bj ⊇ Bj+1 ⊇ Bj+2

Bj = Aj ∪Aj+1 ∪Aj+2

Now, by assumption (countable additivity),

P

( ∞⋃j=1

Aj

)=∞∑j=1

P (Aj) =n−1∑j=1

P (Aj) +∞∑j=n

P (Aj)

=n−1∑j=1

P (Aj) + P

( ∞⋃j=n

Aj

)=n−1∑j=1

P (Aj) + P (Bn)

but P (Bn) is the tail of a convergent series, and it converges to 0 as n→∞.Conversely, let Aj be a pairwise disjoint sequence of sets, and for each n

let Bn =⋃∞j=nAj. Now, by finite additivity,

P

( ∞⋃j=1

Aj

)= P

(n−1⋃j=1

Aj

)+ P

( ∞⋃j=n

Aj

)=n−1∑j=1

P (Aj) + P (Bn)

but P (Bn) → 0 as n → ∞ by assumption, so we conclude that countable addi-tivity holds.

Definition. Given the probability space (Ω,^, P ), a random variable X is areal-valued function defined on Ω such that for every Borel set B ∈ @, we haveX−1(B) = ω : X(ω) ∈ B ∈ ^.

In other words, X is a Borel-measurable function. Thus we can talk aboutP [X ∈ B] = P

(X−1(B)

)= P

(ω : X(ω) ∈ B

).

Definition. Let PX(B) = P(X−1(B)

).

Now (R,@, PX) is a probability space, for PX(R) = 1 and PX(⋃∞i=1Bi) =∑∞

i=1 PX(Bi) for pairwise disjoint Borel sets Bi.

9/25/98

Page 6: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

4 Stat 8151

The last assertion is proved as follows.

PX

( ∞⋃i=1

Bi

)= P

(X−1

( ∞⋃i=1

Bi

))= P

( ∞⋃i=1

X−1(Bi))

=∞∑i=1

P(X−1(Bi)

)=∞∑i=1

PX(Bi).

We see that a random variable is a carrier of a distribution from one probabilityspace to another.

9/25/98

Page 7: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 5

Monday, September 28, 1998

We consider (Ω,^, P ) X−→(R,@, PX); note that statisticians focus on (R,@, PX)and forget about (Ω,^, P ) while probabilists focus on (Ω,^, P ) and ignore(R,@, PX).

Definition. Let FX(x) = PX((−∞, x]

)= P [X ≤ x].

FX is called the distribution function of X.

The (cumulative) distribution function has these properties:

1. FX(−∞) = 0 and FX(+∞) = 1.2. FX is nondecreasing.3. FX is right-continuous.

Observe that the appropriate inverse images are ∅ and Ω, so the first propertyis true. Next, if x1 < x2 then X−1

((−∞, x1]

)⊆ X−1

((−∞, x2]

), so the second

follows. Finally, the third involves the idea from the lemma of the previouslecture. Suppose xn ↓ x. Then FX(xn) − FX(x) = P [X ≤ xn] − P [X ≤ x] =P [x < X ≤ xn] = P (An) where An = ω : x < X ≤ xn . But An ↓ ∅, soP (An)→ 0 (by the lemma) and we conclude that FX(xn) ↓ FX(x).

[In response to a question, GM stated that random variables were finite andreal valued.]

Clearly PX determines FX , for the distribution function is defined in termsof the induced probability measure. Although less obvious, it is also true thatFX determines PX . (A standard measure-theoretic extension allows us to gofrom half-lines to intervals to Borel sets.)

Thus, PX and FX are equivalent ways to characterize a distribution. Wecan work with a probability space (R,@, PX) using FX instead of PX .

Now fix X, and let F = FX . Then F (x) = P [X ≤ x] by definition, andthen F (b)− F (a) = P [X ≤ b]− P [X ≤ a] = P [a < X ≤ b].

It is a fact that P [X = b] = limn→∞ P [bn < X ≤ b] where bn ↑ b; the proof issimilar to that of an earlier lemma: if An ↓ A, where A need not necessarily be ∅,then P (An)→ P (A). Thus P [X = b] = limn→∞

(F (b)−P (bn)

)= F (b)−F (b−).

Definition. F is said to be a continuous distribution function if it is con-tinuous at each real number. (Recall we already have right-continuity.)

The graph of an arbitrary distribution function may have jumps or flatspots.

9/28/98

Page 8: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

6 Stat 8151

The set of discontinuities of F is at most countable, since each jump definesan interval on the y-axis which contains at least one rational number. Of coursesuch intervals are disjoint.

The simplest distribution function has a jump at some c, so

FX(x) = δc(x) =

0 if x < c,1 if x ≥ c.

It is easy to check that if F1, . . . , Fn are distribution functions, and αi > 0 with∑ni=1 αi = 1, then

∑ni=1 αiFi is a distribution function. More generally, any

“convex combination” of distribution functions is a distribution function, so amixture of countably many distribution functions will be a distribution function.Thus, if p1 > 0, p2 > 0, . . . , and

∑ni=1 pi = 1, then

∞∑i=1

piδxi(x)

is a distribution function with jumps of size pi at each xi.Now, for an arbitrary F , let JF = x1, x2, . . . be its set of discontinuities.

(Recall that this “jump set” is at most countable.)Let pi = F (xi)− F (xi−) = probability at the ith jump.We say that F is discrete if F =

∑∞i=1 piδxi , i.e., all the mass is concen-

trated on at most countably many points.Recall that F is absolutely continuous if for every ε > 0 there is a δ > 0

such that∑ni=1[F (x′i)−F (xi)] < ε for every finite collection of points (xi, x′i)

of non-overlapping intervals with∑ni=1(x′i − xi) < δ.

[In response to a student’s question, absolute continuity implies uniformcontinuity, but not conversely.]

A fact (from measure theory, but not proved here): “absolute continuity”implies “can be written as a Riemann integral.”

If F is absolutely continuous, then there exists Riemann-integrable f ≥ 0such that F (x) =

∫ x−∞ f(u) du.

So,∫∞−∞ f(u) du = 1 and f is called the density of F .

Proposition. If F is a continuous distribution function, then it is uniformlycontinuous on R.

Proof: We know that a continuous function on a compact set is uniformlycontinuous there.

Given ε > 0, we must find δ > 0 such that if |x−y| < δ then |F (x)−F (y)| <ε.

Select x` < xu such that F (x`) < ε/2 and 1− F (xu) < ε/2.On [x` − 1, xu + 1], F is continuous, hence uniformly continuous. So, given

ε > 0, there exists δ > 0, which we can pick to satisfy δ < 1/2, such that ifx, y ∈ [x` − 1, xu + 1] and |x− y| < δ then |F (x)− F (y)| < ε.

9/28/98

Page 9: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 7

Now consider three cases:4

1. x ∈(−∞, x` − 1

2

)|x− y| < δ < 1/2 =⇒ x, y ∈ (−∞, x`]

=⇒ |F (x)− F (y)| ≤ F (x) + F (y) < ε/2 + ε/2 = ε.

2. x ∈[x` − 1

2 , xu + 12

]|x− y| < δ < 1/2 =⇒ x, y ∈ [x` − 1, xu + 1] =⇒ |F (x)− F (y)| < ε.

3. x ∈(xu + 1

2 ,+∞)

|x− y| < δ < 1/2 =⇒ x, y ∈ (xu,+∞)

=⇒ |F (x)− F (y)| =∣∣1− F (y)−

(1− F (x)

)∣∣≤ |1− F (y)|+ |1− F (x)| < ε/2 + ε/2 = ε.

In each case we have uniform continuity.

Fact. If F1(x) = F2(x) for all x ∈ D, a dense subset of R, then F1(x) = F2(x)for all x ∈ R.

This follows from the right-continuity of F .5

Fact. F is determined by its values on JcF , i.e., the complement of the set ofdiscontinuities of F .6

Since JF is at most countable, its complement is dense, and thus is sufficientto determine F .

Recall our setup of (Ω,^, P ) X−→(R,@, PX) with F = FX used to character-ize PX .

4 As presented in class, cases 1 and 3 involved half-closed intervals, but thoseend points are included with case 2.

5 The idea is that limits are unique. Let x ∈ R. There is a sequence (yn) ofpoints in D converging to x from the right, for if D is dense in R we can findpoints of D in every interval of the form (x, x + 1/n). That induces sequencesconverging to F1(x) and F2(x), but those sequences are identical and hence havea common limit.

Formally, suppose ε > 0 is given. There exists N1 such that n > N1 =⇒|F1(yn)−F1(x)| < ε/2. Similarly, there exists N2 such that n > N2 =⇒ |F2(yn)−F2(x)| < ε/2. Let N = maxN1, N2; for n > N we have |F1(x) − F2(x)| =|F1(x)−F1(yn)+F1(yn)−F2(yn)+F2(yn)−F2(x)| ≤ |F1(x)−F1(yn)|+|F1(yn)−F2(yn)|+ |F2(yn)− F2(x)| < ε/2 + 0 + ε/2 = ε.

6 GM uses an overbar for complement, so he writes JF instead of JcF or J ′F .

9/28/98

Page 10: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

8 Stat 8151

Definition. The expected value of X is defined as

EX =∫

Ω

X(ω) dP =∫Rx dPX =

∫ +∞

−∞x dF (x)

if it exists.

Only the first equality is the definition; the other two are theorems. One isa change-of-variables theorem (remember that a random variable is a carrier of adistribution), the other uses the distribution function instead of the probabilitymeasure.

9/28/98

Page 11: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 9

Wednesday, September 30, 1998

Riemann-Stieltjes integrals

In the definition of expected value, the quantity∫ +∞−∞ x dF (x) is a Riemann-

Stieltjes integral, a generalization of the ordinary Riemann integral. (If we takeF (x) = x then the Riemann-Stieltjes integral is just the usual Riemann integral.)

Here’s how we define∫ bag(x) dF (x).

Let Γ be a partition of [ a, b ], so a = x0 < x1 < · · · < xn = b, and let∆Γ = max(xi − xi−1). Choose ξi such that xi−1 ≤ ξi ≤ xi. Then, let

∫ b

a

g(x) dF (x) = lim∆Γ→0

n∑i=1

g(ξi)(F (xi)− F (xi−1)

)if it exists.

Fact 1. Suppose g is continuous, and suppose that F is both monotone (in-

creasing or decreasing) and bounded. Then∫ bag(x) dF (x) exists.

Definition.∫ +∞−∞ g(x) dF (x) = lim

a→−∞b→+∞

∫ bag(x) dF (x).

Fact 2. Suppose g is bounded and continuous, and F is monotone and boundedon R. Then

∫ +∞−∞ g(x) dF (x) exists.

Example 1. Suppose g is continuous, and F (x) =α if x < c,β if x ≥ c.

Then ∫ b

a

g dF = g(c)(β − α)

Example 2. Suppose g is continuous, and F is both monotone and differen-tiable. Assume f = F ′ is continuous except perhaps at a finite number of points.By the Mean Value Theorem, we know that F (xi)−F (xi−1) = f(ξi)

(xi− xi−1

)for some ξi ∈ (xi−1, xi). Now look at

∑ni=1 g(ξi)

(F (xi) − F (xi−1)

). We know

this converges to∫g dF . But we can also use the MVT to rewrite this as∑n

i=1 g(ξi)f(ξi)(xi − xi−1) which converges to the Riemann integral∫gf . (We

9/30/98

Page 12: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

10 Stat 8151

can pick any ξi ∈ [xi−1, xi] for g, so we need only choose the one from the MVTto ensure that the same ξi appears in f and g.)

So, ∫ b

a

g dF =∫ b

a

g(x)f(x) dx

and we see that f is a density for an absolutely continuous F .

Fact 3. Assume g is continuous, F is increasing, with jumps at t1, . . . , tn in[a, b], and differentiable between those jumps, with a continuous derivative exceptpossibly at a finite number of points. Then∫ b

a

g dF =n∑i=1

g(ti)[F (ti+)− F (ti−)] +∫ b

a

g(x)F ′(x) dx.

Hence, if a distribution function is a mixture of discrete and absolutely contin-uous components, we can deal with each part separately.

In the purely discrete case, pi = P [X = xi], so EX =∑i xipi =

∫x dF .

In the absolutely continuous situation, with F ′ = f , EX =∫xf(x) dx =∫

x dF .Thus in either case, the Riemann-Stieltjes integral is a sum or a Riemann

integral, as appropriate.

Fact. Integration by parts is okay. The usual formula,∫u dv = uv −

∫v du,

becomes ∫g dF = gF −

∫F dg.

Moments

Definition. We call E(Xk) =∫xk dF the kth (central) moment of X if the

expectation exists. (Typically k is an integer, although it need not be.)

When k = 1 we have the mean µ = EX; the variance σ2 = E(X − µ)2 isthe second moment about the mean.

There are several functions that involve moments of a random variable X;they are defined in terms of expected values of functions of X and anotherargument t.

Probability Generating Function EtX

Moment Generating Function EetX

Laplace Transform Ee−tX

Characteristic Function EeitX

The moment generating function (mgf) may or may not exist, but the charac-teristic function (ch.f.) always exists. (The ch.f. is the Fourier Transform.)

We often write ϕX(t) = ϕ(t) = EeitX .

9/30/98

Page 13: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 11

Complex numbersA complex number z can be expressed in terms of its real and imaginary parts,using the imaginary unit i =

√−1. Thus z = x+ iy. Its conjugate is z = x− iy,

and its modulus is |z| =√x2 + y2. Note that zz = |z|2 and that ez = ex+iy =

exeiy = ex(cos y + i sin y). (This last follows by rearranging the Taylor seriesfor eiy into the Taylor series for sine and cosine.) Since eiy = cos y + i sin y, itfollows that |eiy| =

√cos2 y + sin2 y = 1.

Now suppose we have a C-valued random variable, Z = X + iY .Then EZ = E(X+ iY ) = EX+ iEY by linearity of the expectation operator.

Fact. |EZ| ≤ E|Z|This follows from a two-dimensional version of Jensen’s Inequality: if f(·, ·)

is convex, then Ef(U1, U2) ≥ f(EU1,EU2). But f(u1, u2) =√u2

1 + u22 is a convex

function on the plane, so |EZ| =[(EX)2+(EY )2

]1/2 = f(EX,EY ) ≤ Ef(X,Y ) =E[(X2 + Y 2)1/2

]= E|Z|.

The characteristic function exists for every random variable, even though themoment generating function may not, for ϕX(t) = ϕ(t) = E(eitX) =

∫eitX dF =∫

cos tx dF + i∫

sin tx dF , but sine and cosine are bounded continuous functions,hence both these integrals exist. Thus ϕ(t) exists for all t for every F .

A few facts about characteristic functions

1. ϕ(0) = 1.7

2. ϕaX+b(t) = Eeit(aX+b) = eitbEeiatX = eitbϕX(at).3. ϕ−X(t) = Ee−itX = ϕX(t)

Claim. ϕ(t) is uniformly continuous.

Proof: Easy, if we have Lebesgue’s Dominated Convergence Theorem.We need to show that, for every ε > 0 there exists δ > 0 such that |ϕ(t)−

ϕ(t+ h)| < ε whenever |h| < δ.8 But

|ϕ(t)− ϕ(t+ h)| =∣∣∣∣∫ (eitx − ei(t+h)x

)dF (x)

∣∣∣∣=∣∣∣∣∫ eitx

(1− eihx

)dF (x)

∣∣∣∣≤∫ ∣∣eitx∣∣ ∣∣1− eihx∣∣ dF (x)

=∫ ∣∣1− eihx∣∣ dF (x)

Note that the last integral does not depend on t, so we will obtain a uniformbound.

7 Although it was not mentioned in class, it is worth noting that |ϕ(t)| ≤ 1.8 GM had no absolute value for h, perhaps it is assumed positive?

9/30/98

Page 14: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

12 Stat 8151

What we want to do next is to take the limit of that integral as h → 0by moving the limit under the integral sign. Recall that an integral is definedin terms of a limit, so what we need to do amounts to interchanging two limitoperations—a nontrivial task.

9/30/98

Page 15: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 13

Friday, October 2, 1998

To finish the proof from last time, we have

limh→0

supt|ϕ(t)− ϕ(t+ h)| ≤ lim

h→0

∫ ∣∣1− eihx∣∣ dF (x) =∫

limh→0

∣∣1− eihx∣∣ dF (x)

where the last equality is an application of Lebesgue’s Dominated ConvergenceTheorem. Now lim

h→0

∣∣1− eihx∣∣ = 0, so we obtain∫

0 dF = 0 as desired.

Dominated Convergence Theorem. [Lebesgue] Let F be a distribution func-tion. Let gn be a sequence of measurable functions, and let g and h be mea-surable functions such that gn(x) → g(x) for almost all x, i.e., except possiblyfor a set with probability 0 under F , |gn(x)| ≤ h(x) for all x and n, and

∫h dF

exists. Then

limn→∞

∫gn dF =

∫limn→∞

gn dF =∫g dF.

The key assumption is that the gn converging to g are dominated by some inte-grable h.

In our application, we have

n→∞ ⇔ h→ 0gn(x) = 1− eihxg(x) = 0h(x) = 17

Of course there’s nothing special about 17—any constant is integrable with re-spect to a probability measure, and 2 would have done just as well as a dominator,since

∣∣1− eihx∣∣ ≤ 1 +∣∣eihx∣∣ = 1 + 1 = 2.

Another application is the useful fact that if E|X|k < ∞ then E|X|r < ∞whenever 0 ≤ r ≤ k. Thus if the kth moment exists, then so do all lower-ordermoments. To see this, recall that |x|r ≤ 1 + |x|k for all x ∈ R, and recall thatconstant functions are integrable with respect to distribution functions.

10/2/98

Page 16: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

14 Stat 8151

Characteristic functions and moments

Lemma. Let X be a random variable with distribution function F and char-acteristic function ϕ. If E|X|n < ∞ for some positive integer n, then Φ(r)(t) =dr

dtrϕ(t) exists for r = 1, 2, . . . , n and is a continuous function of t, and

Φ(r)(t) = ir∫xreitx dF (x) for r = 1, 2, . . . , n.

We can evaluate this at 0 to get moments.Proof: (for the case n = 1, the others follow by induction.)

Φ(1)(t) = limh→0

ϕ(t+ h)− ϕ(t)h

= limh→0

∫ei(t+h)x − eitx

hdF

= limh→0

∫eitx

eihx − 1h

dF

Recall that |eitx| = 1 and assume that hx > 0. (The case h < 0 is handledsimilarly.)9

Now eihx − 1 = i∫ hx

0eiu du, so |eihx − 1| ≤

∫ hx0

1 du = |hx|.Hence ∣∣∣∣eitx · eihx − 1

h

∣∣∣∣ ≤ 1 · |hx|h

= |x|

but by assumption∫|x| dF <∞. So,

Φ(1)(t) = limh→0

∫eitx · e

ihx − 1h

dF =∫

limh→0

eitx · eihx − 1h

dF = i

∫xeitx dF.

The rest follows by induction.

We can evaluate the derivatives at 0 to obtain the moments.

Examples of characteristic functions.

1. Point mass

If FX = δx0 then ϕ(t) = eitx0 .

9 GM uses |eitx| ≤ 1 instead of |eitx| = 1; it also is unclear whether he meanshx > 0 or merely h > 0.

10/2/98

Page 17: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 15

2. Convex combinations

Suppose F =∑j αjFj where each Fj is a distribution function and αj > 0 and

F =∑j αj = 1. Let ϕj be the characteristic function for Fj . Then ϕ is the

characteristic function of F , and ϕ =∑j αjϕj .

A special case is when F = αF1 + (1 − α)F2. Then ϕ = αϕ1 + (1 − α)ϕ2,because we can break up the Riemann-Stieltjes integral. This works for countableas well as finite combinations.

If we combine Examples 1 and 2, we get

3. Discrete distributions

If F =∑j pjδxj then ϕ(t) =

∑j pje

itxj .

4. Uniform

Let X ∼ Uniform[a, b]. Then

ϕ(t) = EeitX =∫ b

a

eitx1

b− a dx =1

b− a ·eitx

it

∣∣∣∣ba

=1

b− a ·eitb − eita

it.

5. Binomial

ϕ(t) = EeitX =n∑j=0

eitj ·(n

j

)pj(1− p)n−j

=n∑j=0

·(n

j

)(peit)j(1− p)n−j = (peit + 1− p)n

10/2/98

Page 18: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

16 Stat 8151

6. Standard Normal

Suppose X ∼ Normal(0, 1).

ϕ(t) = EeitX =∫ ∞−∞

eitλ · e−λ2/2

√2π

dλ,

but recall that X is symmetric and recall that sine is odd and cosine is even, so

ϕ(t) =∫ ∞−∞

cos tλ · e−λ2/2

√2π

dλ.

This random variable has all its moments, so we can differentiate under theintegral sign.

ϕ′(t) =∫ ∞−∞−λ sin tλ · e

−λ2/2

√2π

dλ.

Now integrate by parts. Let u = sin tλ and dv =−λe−λ2/2

√2π

dλ, so du =

t cos tλ dλ and v =e−λ

2/2

√2π

. Again we use the fact that the normal density

is even and the sine is odd, so the uv term vanishes. We are left with

ϕ′(t) = 0−∫ ∞−∞

t cos tλ · e−λ2/2

√2π

dλ = −t · ϕ(t)

When we solve the differential equation ϕ′ = −tϕ we get ϕ(t) = ce−t2/2 but we

know ϕ(0) = 1, so c = 1. Finally, we have

ϕ(t) = e−t2/2.

We can extend this to other normal variates. If X ∼ Normal(µ, σ2), then X =σZ + µ where Z is standard normal. Then

ϕX(t) = EeitX = Eeit(σZ+µ) = eitµEeiσtZ = eitµϕZ(σt) = eitµ−12σ

2t2 .

7. Cauchy

Let X have densitya

π(a2 + x2)for x ∈ (−∞,∞). Using contour integration

from complex analysis, it can be shown that

ϕ(t) = e−a|t|

Recall that this distribution has no moments.

10/2/98

Page 19: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 17

Products (characteristic function for a sum of independent random variables)

Fact. If ϕ1, ϕ2, . . . , ϕn are characteristic functions then∏nj=1 ϕj is a character-

istic function.

Let Xj have distribution function Fj and characteristic function ϕj andassume the Xjs are independent. Let Sn = X1 +X2 + · · ·+Xn. Then

ϕSn(t) = EeitSn = Eeit(X1+···+Xn) = E[eitX1eitX2 . . . eitXn ]= EeitX1EeitX2 · · ·EeitXn = ϕ1(t)ϕ2(t) · · ·ϕn(t)

Moment generating functions and characteristic functions

Definition. A function f(z) is analytic in D if f ′(z) exists for all z ∈ D, i.e.,if

limη→0

f(z + η)− f(z)η

= f ′(z)

exists for all z ∈ D, where the limit does not depend on the path η → 0.

This is a strong property.Question: if MX(t) = EetX exists, how is it related to ϕX(t) = EeitX?

Proposition. Let X be a random variable such that MX(t) = EetX exists for|t| < δ for some δ > 0. Assume there exists A(z) analytic in |Re z| < δ′ for someδ′ > δ and that A(t) = MX(t) for |t| < δ. Then ϕX(t) = EeitX = A(it) for allt ∈ R.

10/2/98

Page 20: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

18 Stat 8151

Monday, October 5, 1998

ExamplesWe have some examples of the proposition from last time.

1. Exponential

Suppose X ∼ exponential(1). We know the m.g.f. is EetX =1

1− t .

Now look at f(z) =1

1− z which is analytic in D = z : |Re z| < 1 .

Since f(t) = EetX for |t| < 1, we get ϕ(t) = EeitX =1

1− it .

2. Standard NormalSuppose X ∼ N(0, 1). The m.g.f. is EetX = et

2/2 for all t ∈ R.Now f(z) = ez

2/2 is analytic in C; since they agree on R, we get ϕ(t) =EeitX = e−t

2/2.

So, under some conditions, if we know the m.g.f., we can find the characteristicfunction.

Fact. [Fourier Inversion] There exists a one-to-one correspondence between dis-tribution functions and characteristic functions. If we know F , we can find ϕ,and if we know ϕ we can find F . Let X be a random variable with distributionfunction F and characteristic function ϕ. Let x1 < x2 be continuity points of F .Then

F (x2)− F (x1) = limT→∞

12π

∫ T

−T

eitx1 − eitx2

itϕ(t) dt

If we let x1 → −∞ then we know F at every continuity point.Characteristic functions are important because they always exist and they

characterize distributions.Recall the usual proof of the Central Limit Theorem, where we recognize

the m.g.f. of the standard normal distribution. In a similar way, we’ll see in theSpring Quarter that when ϕn → ϕ, then Fn → F in some sense.

Why does the inversion formula make sense? A characteristic function con-tains information about the distribution, i.e., its moments.

10/5/98

Page 21: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 19

Theorem. Suppose X ∼ f and Y ∼ g, where f and g are continuous (hencebounded) probability densities on [0, 1]. If EXn = EY n for n = 1, 2, . . . , then∫ a

0f(t) dt =

∫ a0g(t) dt for all a ∈ [0, 1]. Thus f and g are the same density.

(Differentiate both sides to get f(a) = g(a).)

For R instead of [0, 1] this is the Hamburger moment problem; it is muchharder, and we need an extra condition.

Proof: If E(Xn) = E(Y n) for n = 1, 2, . . . , then Ef [P (X)] = Eg[P (Y )]for every polynomial P , because expectation is linear. But then Ef [C(X)] =Eg[C(Y )] for any continuous function C on [0, 1]. (Recall the theorem (Weier-strass) that polynomials are dense in the space of continuous functions, so wecan uniformly approximate continuous functions by polynomials.)

Now choose M such that f(x) ≤ M and g(x) ≤ M for every x ∈ [0, 1].Define Ia by

Ia(x) =

1 if x ∈ [0, a]0 if x ∈ (a, 1]

and, for ε > 0, define tε(x), a continuous approximation to Ia, by

tε(x) =

x/ε if 0 ≤ x < ε1 if ε ≤ x < a− ε(a− x)/ε if a− ε ≤ x < a0 if a ≤ x ≤ 1

Here’s what they look like:

a 1

Graphs of Ia(x) (gray) and tε(x) (black)

Now to show that∫ a

0f dx =

∫ a0g dx we want to make |

∫Iaf dx−

∫Iag dx| small.

So,∣∣∣∣∫ Iaf dx−∫Iag dx

∣∣∣∣ =∣∣∣∣∫ Iaf − Iag dx

∣∣∣∣=∣∣∣∣∫ Iaf − tεf + tεf − tεg + tεg − Iag dx

∣∣∣∣≤

∫(0,ε)∪(a−ε,a)

|Iaf − tεf |dx

︸ ︷︷ ︸≤2εM

+∫

(0,ε)∪(a−ε,a)

|Iag − tεg|dx

︸ ︷︷ ︸≤2εM

+∫ 1

0

|tε(f − g)| dx︸ ︷︷ ︸0

10/5/98

Page 22: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

20 Stat 8151

In the last expression, the first two integrals are over the only intervals wherethe integrands are nonzero, and the third integral vanishes anyway, since tε iscontinuous. Thus we have |

∫Iaf dx−

∫Iag dx| ≤ 2εM + 2εM + 0 = 4εM → 0

as ε→ 0.

Note: It can be difficult to determine whether a given function is or isnot a characteristic function. It may be easiest to check if a function ϕis a characteristic function by guessing the appropriate F , computingϕF , and comparing that to the original ϕ; although Bochner’s Theoremcan test a characteristic function, it is almost impossible to use.

Inequalities

Lemma.

P (|X| ≥ u) ≤ E(|X|k

)uk

when k > 0 and u > 0.

When k = 1 this becomes the familiar Chebyshev Inequality.

Proof:

P (|X| ≥ u) = P

( |X|u≥ 1)

=∫h(x) dF, where h(x) =

1 if

|x|u≥ 1

0 if|x|u< 1

Let g(x) =|x|kuk

and observe that g dominates h whether k > 1 or k < 1.

g with k < 1

g with k > 1

h

u

g dominates h

Now ∫h(x) dF (x) ≤

∫g(x) dF (x) ≤ E|X|k

uk

and the proof is complete.

Corollary.

P (|X − EX| ≥ µ) ≤ E(X − EX)2

u2=

VarXu2

10/5/98

Page 23: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 21

Convex FunctionsLet I be an interval of real numbers (possibly the whole line), and let g : I → R.We say g is convex if

g(αx1 + (1− α)x2

)≤ αg(x1) + (1− α)g(x2) (∗)

whenever 0 < α < 1 and x1, x2 ∈ I.We say g is strictly convex when we can replace ≤ by < in (∗).The idea is that a chord lies above the graph of the function.

x1 x2αx1 + (1− α)x2

Typical (strictly) convex function

Examples

1. g(x) = x2, I = R. This is the classic example of a convex function.2. g(x) = |x− a|p, a ∈ R, p ≥ 1, I = R. This is convex, but not strictly

convex.3. g(x) = ex, I = R.4. g(x) = − log x, I = (0,∞).5. g(x) = x, I = R.6. g(x) = 1/x, I = (0,∞).7. g(x) =

0 0 ≤ x < 11 x = 1

I = [0, 1].

Theorem. Suppose g is twice differentiable on an open interval I. If g′′(x) ≥ 0for all x ∈ I, then g is convex. If g′′(x) > 0 for all x ∈ I, then g is strictlyconvex.

Proof: Let 0 < α < 1, x1 ∈ I and x2 ∈ I, and assume x1 6= x2. Thenx0 = αx1 + (1− α)x2 ∈ I. By Taylor’s Theorem, there exists x∗ ∈ I such that

g(x) = g(x0) + g′(x0)(x− x0) +g′′(x∗)

2(x− x0)2 (1)

Note that the last term in (1) is nonnegative. Thus

g(x) ≥ g(x0) + g′(x0)(x− x0) (2)

and this inequality will be strict if x 6= x0 and g′′ > 0.

10/5/98

Page 24: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

22 Stat 8151

Apply (1) with x = x1 and multiply by α to obtain

αg(x1) ≥ αg(x0) + αg′(x0)(x1 − x0) (3)

Similarly, take x = x2 and multiply by 1− α to get

(1− α)g(x2) ≥ (1− α)g(x0) + (1− α)g′(x0)(x2 − x0) (4)

Now add (3) and (4)

αg(x1) + (1− α)g(x2) ≥ αg(x0) + (1− α)g(x0) + αg′(x0)(x1 − x0) + (1− α)g′(x0)(x2 − x0)= g(x0) + g′(x0)[α(x1 − x0) + (1− α)(x2 − x0)]= g(x0) + g′(x0)[αx1 − αx0 + (1− α)x2 − (1− α)x0]= g(x0) + g′(x0)[αx1 + (1− α)x2 − αx0 − (1− α)x0]= g(x0) + g′(x0)[x0 − αx0 − x0 + αx0]= g(x0)= g(αx1 + (1− α)x2

)and g is convex.

Theorem. [Supporting Line] Suppose g is convex on I. Let x0 be an interiorpoint of I. Then there exists a real number m such that

g(x) ≥ g(x0) +m(x− x0) (?)

for x ∈ I. If g is strictly convex, then (?) is strict, except when x = x0.

x0

Line of support for g at x0

The tangents are below the curve.This theorem will be easier to prove after we have proved a lemma.

Lemma. Let f(x) =g(x)− g(x0)x− x0

for x 6= x0 but x ∈ I. Then f is nondecreas-

ing on I \ x0. f is strictly increasing if g is strictly convex.

10/5/98

Page 25: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 23

So f(x) is just the slope of the chord through(x, g(x)

)and

(x0, g(x0)

).

x0x1 x2

f is nondecreasing: f(x1) ≤ f(x2) when x1 < x2

Proof: (of lemma) For x1 < x2, is f(x1) ≤ f(x2)?There are three cases:

i. x0 < x1 < x2

ii. x1 < x0 < x2

iii. x1 < x2 < x0

For all three cases, the trick is to write the middle guy as a convex combinationof the other two, e.g., for the first case:

x1 =x1 − x0

x2 − x0x2 +

x2 − x1

x2 − x0x0 = αx2 + (1− α)x0

Now

g(x1) ≤ x1 − x0

x2 − x0g(x2) +

x2 − x1

x2 − x0g(x0)

subtract g(x0) from both sides

g(x1)− g(x0) ≤ x1 − x0

x2 − x0g(x2)− x1 − x0

x2 − x0g(x0)

group over common denominatorsg(x1)− g(x0)x1 − x0

≤ g(x2)− g(x0)x2 − x0

f(x1) ≤ f(x2)

The other cases are handled in the same way.

10/5/98

Page 26: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

24 Stat 8151

Wednesday, October 7, 1998

Proof: (for the Supporting Line Theorem) The right-hand derivative of g isf(x0+) = limx↓x0 f(x) and the left-hand derivative of g is f(x0−) = limx↑x0 f(x).The right-hand derivative and the left-hand derivative exist, and f is nondecreas-ing, so there exists some m such that f(x0−) ≤ m ≤ f(x0+). Hence, for x > x0,

we haveg(x)− g(x0)x− x0

= f(x) ≥ f(x0+) ≥ m, so g(x) ≥ g(x0) +m(x− x0).

The case x < x0 is handled in the same way.In the strict case, everything is strict.

Jensen’s Inequality.

Suppose g is convex on an interval I and X is a random variable takingvalues in I. Suppose EX exists.10 Then

Eg(X) ≥ g(EX).

If g is strictly convex, and P [X 6= EX] > 011 then

Eg(X) > g(EX).

Proof: Case I: EX is an interior point of I.We know that g(x) ≥ g(x0) + m(x − x0) for some m where x0 = EX. In

other words, g(X) ≥ g(EX)+m(X−EX). Now take expectations on both sides,so Eg(X) ≥ g(EX) +m · 0.

If g is strictly convex, then g(X) > g(EX)+m(X−EX) but P [X 6= EX] > 0,so Eg(X) > g(EX).

Case II: EX is an end point of I. The only way that can occur is if thedistribution puts all its mass at that point—i.e., X is a constant, and EX = X.Then obviously g(EX) = g(X) = Eg(X).

The classic example is g(x) = x2; applying Jensen’s Inequality yields

[EX]2 ≤ E[X2].

We can generalize to higher dimensions.

10 Recall that this means that E|X| < ∞, but we don’t have to assume thatg(x) <∞.

11 i.e., X is not constant.

10/7/98

Page 27: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 25

Definition. A set B ⊆ Rn is convex if αx1 + (1−α)x2 ∈ B whenever x1 ∈ B,x2 ∈ B, and α ∈ [0, 1].

For example, in R2 the line segment connecting any pair of points in B lieswithin B.

An immediate further generalization:∑ki=1 pixi ∈ B whenever all xi ∈ B,

pi ≥ 0, and∑ki=1 pi = 1 .

Examples

1. Rectangles and ellipses are convex.2. Intersections of convex sets are convex.

3. Let a =

a1

a2...an

and x =

x1

x2...xn

. The “half-space” Hab = x : a′x ≥ b is

convex.

Definition. A real-valued function f defined on B is convex if (i) B is convex,and (ii) f

(αx1 + (1 − α)x2

)≤ αf(x1) + (1 − α)f(x2) for all α ∈ [0, 1], and

x1,x2 ∈ B.

Definition. Let S ⊆ Rn. The convex hull C(S) is the intersection of allconvex sets containing S.

Since Rn is convex, that intersection is nonempty, so the convex hull exists.C(S) is the smallest convex set containing S.

Lemma. Let D = y : y =∑ki=1 pixi where x1, . . . , xk ∈ S, pi ≥ 0, and∑k

i=1 pi = 1. Then D = C(S)

Proof: D ⊇ S, since we can recover S as a collection of elements of D. Dis obviously convex. Therefore D ⊇ C(S).

Now suppose B is convex and B ⊇ S. Then B ⊇ D, and therefore D ⊆C(S).

Example. Let S = x1, . . . ,xm ⊂ Rn and let X =

X1...Xn

be a vector-

valued random variable taking values in S with probabilities pi. (So pi ≥ 0 and∑mi=1 pi = 1.)

That is, pi = P [X = xi], so the vector of expectations EX =∑mi=1 pixi = EX1

...EXn

, where EXj =∑mi=1 xji for j = 1, . . . , n. Such an expectation is a

convex combination, so EX ∈ C(S) = C(x1, . . . ,xm).

10/7/98

Page 28: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

26 Stat 8151

Example. Consider the convex function f(z) = ez. Let b1, . . . , bn ∈ R withprobabilities p1, . . . , pn satisfying pi ≥ 0 and

∑ni=1 pi = 1. Apply Jensen’s

Inequality to getn∏j=1

ebjpj = exp

[n∑j=1

bjpj

]≤

n∑j=1

pjebj

where the middle term is f(EX) and the last term is Ef(X). Now look at thefirst and last terms.

n∏j=1

ebjpj ≤n∑j=1

pjebj

and let aj = ebj , so we can write

n∏j=1

apjj ≤

n∑j=1

pj aj

If pj = 1/n then [n∏j=1

aj

]1/n

≤ 1n

n∑j=1

aj

and we have shown that the geometric mean ≤ arithmetic mean.

Now set aj = apjj , so this time we get

n∏j=1

aj ≤n∑j=1

pja1/pjj (∗)

10/7/98

Page 29: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 27

Generalized Holder Inequality

Let X1, . . . , Xn be random variables with E|Xj |rj < ∞ where rj > 0 andn∑j=1

1rj

= 1. Let aj =|Xj |

E(|Xj |rj )1/rjand pj = 1/rj . (So aj > 0 and∑n

j=1 pj = 1.)Then (∗) becomes

|X1 . . . Xn|∏nj=1E(|Xj |rj )1/rj

≤n∑j=1

1rj· |Xj |rj

E(|Xj |rj )

Take expected values on both sides and simplify (multiply by denominator ofleft-hand side).

E|X1 . . . Xn| ≤n∏j=1

E(|Xj |rj )1/rj ·n∑j=1

1rj· 1

but∑nj=1

1rj

= 1, so we have

E|X1 . . . Xn| ≤n∏j=1

E(|Xj |rj )1/rj

In particular,E|XY | ≤

√E(X2)

√E(Y 2)

Holder’s Inequality for positive random variables

Let X and Y be positive random variables with finite means (i.e., EX <∞ andEY <∞). Then, for p ∈ [0, 1],

E(XpY 1−p) ≤ (EX)p(EY )1−p

Proof: Trivial when p = 0 or p = 1, so assume 0 < p < 1.Let X1 = Xp, X2 = Y 1−p, r1 = 1/p, and r2 = 1/(1−p), so 1/r1+1/r2 = 1.Now use Holder’s Inequality.

E(XpY 1−p) = E|X1X2|

≤(

E|X1|1/p)p (

E|X2|1/(1−p))1−p

≤ (EX)p(EY )1−p

10/7/98

Page 30: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

28 Stat 8151

Moments of a sum of random variables

Let X be a random variable, and assume that the nth moment E(Xn) exists.(Here n does not have to be an integer.)

Proposition 1. (a+ b)α ≤ aα + bα if a, b ≥ 0 and 0 ≤ α ≤ 1.

Proof: Fix a. The proposition is clearly true if b = 0.Differentiate both sides with respect to b, getting α(a+b)α−1 and 0+αbα−1.

Observe that the derivative of the left-hand part is smaller than that of the right-hand part, so aα + bα increases faster than (a+ b)α as a function of b.

Proposition 2. (a+ b)α ≤ 2α−1(aα + bα) if a, b ≥ 0 and α ≥ 1.

Proof: xα is convex, so[a+ b

2

]α≤ aα + bα

2, equivalent to the desired

result.

Proposition 3. If 0 < α < β then aα < aβ + 1.

Proof: If a > 1 then the proposition is obviously true. If a < 1, then it isstill true because aα < 1.

By Proposition 3, we get that if E|X|β <∞ then E|X|α <∞ for 0 < α < β.Combining Propositions 1 and 2, we have |X + Y |α ≤ Kα

(|X|α + |Y |α

)for

α > 0, where Kα =

1 if α < 1,2α−1 if α ≥ 1.

Then E(|X + Y |α

)≤ Kα

[E(|X|α

)+ E

(|Y |α

)], so if X and Y have nth

moments, then X + Y also does.

10/7/98

Page 31: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 29

Friday, October 9, 1998

Random variables in R2.

Let (X,Y ) be a pair of random variables defined on a common probability space(Ω,^, P ).

Definition. F (x, y) = P [X ≤ x, Y ≤ y] is the joint distribution function of(X,Y ).

This function gives the probability that the random pair (X,Y ) is in theshaded area (including the boundary) in the illustration.

(x, y)

Joint distribution function

The joint distribution function has the following properties:

1. F is non-decreasing in each variable.2. F is right-continuous in each variable.3. lim

x→−∞F (x, y) = lim

y→−∞F (x, y) = 0.

4. limminx,y→∞

F (x, y) = 1.

5. F (x′′, y′′)−F (x′′, y′)−F (x′, y′′)+F (x′, y′) ≥ 0 for all pairs of points (x′, y′)and (x′′, y′′) in R2 satisfying x′ ≤ x′′ and y′ ≤ y′′.

The first four of these are obvious generalizations from the one-dimensional situ-ation, and they can be proved in the same way as their one-dimensional versions,using limits of monotone sequences of sets. The fifth one, however, has no obviousone-dimensional counterpart. It says that the distribution assigns non-negativemass to every rectangle aligned with the coordinate axes; we need to explicitly

10/9/98

Page 32: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

30 Stat 8151

mention this, for it does not follow automatically from the other conditions.

(x′′, y′′)

(x′, y′)

Basic rectangles get non-negative mass

Conversely, if a function F satisfies conditions 1–5, then F is the jointdistribution function for some pair of random variables defined on some samplespace.

Moreover, F induces a unique probability measure on the Borel sets in theplane. (If we know the probability of rectangles12 then we can extend that tothe Borel sets.)

Definition. The marginal distribution function of X is F1(x) = FX(x) =F (x,∞) = P [X ≤ x].

Of course there is a corresponding marginal cdf for Y .If we know the joint distribution, then we know the marginal distribution,

but not conversely.A joint distribution function F can be discrete, continuous, absolutely con-

tinuous, or “singular.”

Definition. A joint distribution function F is singular if there exists a Borelset S with Lebesgue measure (i.e., area) equal to zero and

∫ScdF (x, y) = 0.

An example of a singular distribution would be one with support on theline segment between (0, 0) and (1, 1), with the (1-D) probability measure ofan interval being length divided by

√2. However, that whole segment has 2-D

measure zero.

Example of singular distribution

A discrete distribution is singular, but not all singular distributions arediscrete. Singular distributions will be a problem when we work with the multi-variate normal distribution.

12 The rectangles include the top and right edges, but not the bottom or leftedges. That’s like the half-open intervals we get in the one-dimensional case,where F (b)− F (a) = P (a < X ≤ b).

10/9/98

Page 33: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 31

Definition. We say F is absolutely continuous if there exists f : R2 → [0,∞)such that

F (x, y) =∫ x

−∞

∫ y

−∞f(u1, u2) du1 du2

If the density f is continuous at (x, y) then f =∂2F

∂x∂y.

Expectations, etc.

If g : R2 → R then

Eg(X,Y ) =∫∫

g(x, y) dF (x, y)

provided the integral exists. (As usual, by that we mean∫∫|g(x, y)| dF (x, y) <

∞.)We can talk about expected values, variances, and characteristic functions.

The joint characteristic function of X and Y is ϕ(t1, t2) = Eei(t1X+t2Y ), andit has the following properties.

1. ϕ(0, 0) = 1.2. ϕ is uniformly continuous.3. ϕ(0, t2) = Eeit2Y = ϕY (t2) is a marginal characteristic function of Y .4. Moments can be found through ϕ. Assume mjk = E(XjY k) exists. Then

∂j

∂tj1

∂k

∂tk2ϕ(t1, t2)

∣∣∣∣t1=t2=0

= mjk · ij+k

10/9/98

Page 34: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

32 Stat 8151

Transformations and change of variables

R1 → R1

First we recall the one-dimensional case. Suppose X is a continuous randomvariable with density f , and let Y = u(X); assume u is continuous and strictlyincreasing, and that u′ exists. We can find a density for Y by differentiating thedistribution function for Y . Since FY (y) = P [Y ≤ y] = P [u(X) ≤ y] = P [X ≤u−1(y)] = FX

(u−1(y)

), we differentiate to get

g(y) =d

dyFY (y) = f

(u−1(y)

)· ddyu−1(y) (1)

On the other hand, if we assumed that u is continuous but strictly decreasing,then FY (y) = P [Y ≤ y] = P [u(X) ≤ y] = P [X ≥ u−1(y)] = P [X > u−1(y)] =1− P [X ≤ u−1(y)] = 1− FX

(u−1(y)

)and then

g(y) =d

dyFY (y) = −f

(u−1(y)

)· ddyu−1(y) (2)

Note that in either case, the density g is non-negative, for the factor ddyu−1(y) is

positive in (1) but negative in (2). We can combine these for strictly monotone(increasing or decreasing) u to get

g(y) = f(u−1(y)

)·∣∣∣∣ ddyu−1(y)

∣∣∣∣ (3)

Then ∫R

f(x) dx =∫S

f(u−1(y)

)·∣∣∣∣ ddyu−1(y)

∣∣∣∣ dy

10/9/98

Page 35: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 33

R2 → R2

Let X =(X1

X2

)with joint density f(x) = f(x1, x2). Now suppose Y1 =

u1(X1, X2) and Y2 = u2(X1, X2).

We can find the joint density g(y1, y2) for Y =(Y1

Y2

)by first inverting the

transformation. (Solve for the “old” variables in terms of the “new” variables.)Then we have X1 = v1(Y1, Y2) and X2 = v2(Y1, Y2).

Next find the Jacobian of the inverse transformation.

J =∂(x1, x2)∂(y1, y2)

=

∣∣∣∣∣∣∣∣∂x1

∂y1

∂x1

∂y2

∂x2

∂y1

∂x2

∂y2

∣∣∣∣∣∣∣∣If things are “nice,”

J =∂(x1, x2)∂(y1, y2)

=1

∂(y1, y2)∂(x1, x2)

so we may not have to work directly with the inverse transformation.

We use |J | as we had used∣∣∣∣ ddyu−1

∣∣∣∣ in the earlier case.

Then if the region R ⊆ R2 is transformed into a region S ⊆ R2, we have∫∫R

f(x1, x2) dx1 dx2 =∫∫S

f(v1(y1, y2), v2(y1, y2)

) ∣∣∣∣∂(x1, x2)∂(y1, y2)

∣∣∣∣︸ ︷︷ ︸g(y1,y2)

dy1 dy2.

Although in general it can be difficult to find S, for a particular R we can usuallydetermine the corresponding S.

Conditioning

Recall the elementary definition of conditional probability:

P (A | B ) =P (A ∩B)P (B)

,

when P (B) > 0; observe that P (· | B) behaves like a probability function thatis restricted to B.

10/9/98

Page 36: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

34 Stat 8151

Example (X and Y discrete)

Let p(x, y) = P (X = x, Y = y) be a joint probability function. Then p1(x) =∑yp(x, y) and p2(y) =

∑xp(x, y) are, respectively, the marginal probability

functions of X and Y . Now

p(x | y) = P (X = x | Y = y) =p(x, y)p2(y)

is the conditional probability function of X given Y whenever p2(y) > 0.In particular, there is the trivial observation that P (a < X < b | Y = y) =∑

a<x<b

p(x | y).

Absolutely continuous case

Now suppose X and Y are absolutely continuous with joint density function

f(x, y). Define the marginal density of Y by f2(y) =∫ ∞−∞

f(x, y) dx. Then,

by analogy with the discrete case, we’d like to define a conditional density by

f(x | y) =f(x, y)f2(y)

whenever f2(y) > 0. Although this is appealing, there is a problem. The troubleis that for any given y we would be conditioning on an event with zero probability.Although we can justify this rigorously, a careful definition requires the Radon-Nikodym derivative (from measure theory).

An alternative justification

Here is a naive argument that provides some support for the definition we want,but avoids technicalities from measure theory.

First we recall a form of the Fundamental Theorem of Calculus:If Φ(y) =

∫ ycf(t) dt, then

Φ′(y) = lim∆y→0

Φ(y + ∆y)− Φ(y)∆y

= lim∆y→0

∫ y+∆y

cf(t) dt−

∫ ycf(t) dt

∆y= lim

∆y→0

[1

∆y

∫ y+∆y

y

f(t) dt

]= f(y)

provided f is continuous at y.

10/9/98

Page 37: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 35

Now, for ∆y > 0 (with an obvious modification for ∆y < 0), define

P (a < X < b | Y = y) = lim∆y→0

P (a < X < b | y ≤ Y ≤ y + ∆y)

= lim∆y→0

P (a < X < b and y ≤ Y ≤ y + ∆y)P (y ≤ Y ≤ y + ∆y)

= lim∆y→0

∫ y+∆y

y

∫ b

a

f(x, t) dx dt∫ y+∆y

y

∫ ∞−∞

f(x, t) dx dt

= lim∆y→0

1∆y

∫ y+∆y

y

∫ b

a

f(x, t) dx dt

1∆y

∫ y+∆y

y

∫ ∞−∞

f(x, t) dx dt

=

∫ b

a

f(x, y) dx∫ ∞−∞

f(x, y) dx=

∫ b

a

f(x, y) dx

f2(y)=∫ b

a

f(x, y)f2(y)

dx

=∫ b

a

f(x | y) dx,

provided f is continuous and f2 is positive at y.

10/9/98

Page 38: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

36 Stat 8151

Monday, October 12, 1998

Conditional ExpectationWe’ll discuss the case where densities exist, but the results hold generally.

Suppose we have a pair (X,Y ) of random variables, with a joint density

f(x, y), and conditional density f(x | y) =f(x, y)f2(y)

.

Definition. The conditional expectation of X given Y = y is

E(X | Y = y ) =∫ ∞−∞

xf(x|y) dx

provided E|X| <∞. More generally,

E[h(X) | Y = y ] =∫ ∞−∞

h(x)f(x|y) dx

if h is a measurable function of y.

Note that these conditional expectations are functions of y.

Theorem. [Iterated expectations]

E(E(X | Y )

)= E(X)

Proof:

E(E(X | Y )

)=∫ ∞−∞

[∫ ∞−∞

xf(x|y) dx]f2(y) dy

=∫ ∞−∞

[∫ ∞−∞

xf(x, y)f2(y)

f2(y) dy]dx

=∫ ∞−∞

[∫ ∞−∞

xf(x, y) dy]dx

=∫ ∞−∞

[x

∫ ∞−∞

f(x, y) dy]dx

=∫ ∞−∞

xf1(x) dx

= EX

Similarly, E(E[h(X) | Y ]

)= E

(h(X)

).

10/12/98

Page 39: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 37

Conditional Variance

Definition. The conditional variance of X given Y = y is

Var(X | Y = y) =∫ ∞−∞

(X − E(X | Y = y )

)2f(x|y) dx (1)

Fix y, and use E(X | Y = y ) and f(x|y).Now take expected values of both sides of (1).

E(Var(X | Y )

)=∫ ∞−∞

[∫ ∞−∞

(X − E(X | Y = y )

)2f(x|y) dx

]f2(y) dy

write this iterated integral as a double integral

=∫ ∞−∞

∫ ∞−∞

[X − E(X | Y = y )

]2f(x, y) dx dy

= E([X − E(X | Y )

]2) (2)

Now let µ = EX, and write VarX as a double integral.

VarX =∫∫

(X − µ)2f(x, y) dx dy

= E(X − µ)2

= E(X − E(X | Y ) + E(X | Y )− µ

)2= E

[(X − E(X | Y )

)2 +(E(X | Y )− µ

)2+ 2(X − E(X | Y )

)(E(X | Y )− µ

)]= E

[(X − E(X | Y )

)2]︸ ︷︷ ︸E(Var(X | Y )

) + E[(

E(X | Y )− µ)2]︸ ︷︷ ︸

Var(E(X | Y )

)+ 2E

[(X − E(X | Y )

)(E(X | Y )− µ

)]︸ ︷︷ ︸0

The first term is E(Var(X | Y )

)from (2), and the second is Var

(E(X | Y )

)because E(X | Y ) is a function (of Y ) whose expected value is E[E(X | Y )] =EX = µ, from the iterated expectations theorem. Now the cross product termvanishes because we can write it as an iterated expectation.

E[(X − E(X | Y )

)(E(X | Y )− µ

)]= E

E[(X − E(X | Y )

)(E(X | Y )− µ

)] ∣∣Y But, conditioned on Y , the factor

(E(X | Y )−µ

)is a constant, so we can remove

it from the expectation. The other factor is an iterated expectation whose valueis zero.

E

E[(X − E(X | Y )

)] ∣∣Y = E[(X − E(X | Y )

)]= EX − E

(E(X | Y )

)= EX − EX = 0

We have established the following useful fact.

10/12/98

Page 40: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

38 Stat 8151

Theorem.VarX = E

(Var(X | Y )

)+ Var

(E(X | Y )

)Independence

Let (Ω,^, P ) be a probability space.

Definition. The events A and B are independent if P (A∩B) = P (A)P (B).

This is equivalent to saying that P (A | B ) = P (A), and also to P (B |A ) = P (B).

Definition. The random variables X and Y are independent if P (X ∈B1 and Y ∈ B2) = P (X ∈ B1)P (Y ∈ B2) for all Borel sets B1 and B2. We couldalso express this in terms of independent events: P

(X−1(B1) ∩ Y −1(B2)

)=

P(X−1(B1)

)P(Y −1(B2)

).

Now let Γ1, Γ2, . . . , Γn, be classes of subsets of Ω with each Γi ⊆ ^.

Definition. The classes Γ1, Γ2, . . . , Γn are independent if P (A1 ∩ A2 ∩· · · ∩An) =

∏ni=1 P (Ai) for all Ai ∈ Γi, for i = 1, 2, . . . , n.

If Ω ∈ Γi for each i, then the property of independence is hereditary, i.e., ifΓ1, . . . , Γn are independent, then any subcollection is independent. For example,if Γ1, Γ2, and Γ3, are independent, then Γ1 and Γ2 are independent. This is atrivial observation, but the next one is not.

Fact. If Γ1, Γ2, . . . , Γn are algebras and if the Γi’s are independent, then theσ-algebras generated by them are independent.

Now suppose we have Γi : α ∈ I where I is some index set, which neednot be countable. This collection is independent if every finite subcollection isindependent.

Suppose X is a random variable defined on some probability space (Ω,^, P ).Recall that X is measurable, i.e., X−1(B) ∈ ^ for every Borel set B. Considera (constant) random variable X such that X(ω) = 17 for all ω ∈ Ω. It would beenough to have ^ = (Ω, ∅) to make such an X measurable.

Definition. Let ^(X) = X−1(B) : B is a Borel set . Then ^(X) is thesmallest σ-algebra ^ which makes X measurable. We call ^(X) the σ-algebragenerated by X.

It is clear that ^(X) is a σ-algebra.13

Now we can revisit what we had said about independent random variables.X and Y are independent iff ^(X) and ^(Y ) are independent. So, for anarbitrary collection of random variables, independence means that every finitesubcollection of the generated σ-algebras is independent.

13 Other notation includes σ(X).

10/12/98

Page 41: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 39

Facts about independent random variables

Proposition. The following are equivalent:

1. X and Y are independent.

2. F (x, y) = F1(x)F2(y) for all x and y.

3. ϕ(t1, t2) = ϕ1(t1)ϕ2(t2) for all t1 and t2.

4. In the absolutely continuous case, f(x, y) = f1(x)f2(y) for almost all x andy;in the discrete case, p(x, y) = p1(x)p2(y) for all x and y.

Proof: Clearly (2) ⇐⇒ (4) in the absolutely continuous or discrete cases.To show that (2) =⇒ (3), we write

ϕ(t1, t2) = Eei(t1X+t2Y )

=∫∫

ei(t1X+t2Y ) dF (x, y)

=∫∫

ei(t1X+t2Y ) d[F1(x)F2(y)]

=∫eit1X dF1(x)

∫eit2Y dF2(y)

= ϕ1(t1)ϕ2(t2)

To show that (3) =⇒ (2), let ϕ ∼ F and let ϕ1 × ϕ2 ∼ F1 × F2. From (3),we know that ϕ = ϕ1 × ϕ2, so by uniqueness of characteristic functions, we getF = F1 × F2.

Now suppose (1) is true. Apply the definition, using A = (−∞, x] andB = (−∞, y] to get (2).

Conversely, assume (2). Now F (x, y) and F1(x)F2(y) agree on sets of theform (x, y) : x ≤ x0 and y ≤ y0 , so they agree on rectangles, and then on theBorel sets.

(x, y)

F and F1F2 agree on basic sets

But if two measures agree on an algebra, they agree on the σ-algebra generatedby the algebra.

10/12/98

Page 42: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

40 Stat 8151

Proposition. X and Y are independent if and only if

E(h1(X)h2(Y )

)= E

(h1(X)

)E(h2(Y )

)for all functions h1 and h2 for which the integrals exist.

Proof: For “=⇒” factor F (x, y) = F1(x)F2(y); for “⇐=” take h1(x) =I(−∞,a](x) and h2(y) = I(−∞,b](y) and recall that the expected value of an indi-cator is the probability of the support of that indicator.

Let W = X + Y . Then VarW = VarX + VarY + 2 Cov(X,Y ) whereCov(X,Y ) = E [(X − EX)(Y − EY )] = E(XY )− (EX)(EY ).

If X and Y are independent, then Cov(X,Y ) = 0 and in that case Var(X +Y ) = VarX + VarY .

There were additional comments on the first set of homework problems.

10/12/98

Page 43: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 41

Wednesday, October 14, 1998

Covariance and CorrelationLet (X,Y ) be a pair of random variables., and write µX = EX and µY =EY . Then Cov(X,Y ) = E[(X − µX)(Y − µY )]. We know that if X and Y areindependent, then Cov(X,Y ) = 0, but the converse is not true.

Definition. Let X and Y be random variables with 0 < VarX < ∞ and0 < VarY <∞. The correlation coefficient ρ is

ρ(X,Y ) =Cov(X,Y )√VarX

√VarY

.

Note that ρ(aX + b, cY + d) = ρ(X,Y ) when a and c have the same sign.

Schwarz Inequality

Proposition. (Cauchy-Schwarz Inequality) Let U and V be random vari-ables with EU2 <∞ and EV 2 <∞. Then

[E(UV )]2 ≤ E(U2)

E(V 2)

with equality only if U = kV for some constant k.

Proof: If EV 2 = 0 then P (V = 0) = 1 and the inequality is triviallysatisfied. So, assume EV 2 > 0. Then

0 ≤ E

[U − E(UV )

E(V 2)V

]2

(1)

= E

[U2 − 2U

E(UV )E(V 2)

V +(

E(UV )E(V 2)

V

)2]

= E(U2)− 2

(E(UV )

)2E(V 2)

+

(E(UV )

)2E(V 2)

= E(U2)−(E(UV )

)2E(V 2)

10/14/98

Page 44: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

42 Stat 8151

and we’re done.The inequality in (1) is strict unless U − E(UV )

E(V 2)V = 0, but then U =

E(UV )

E(V 2)V = kV .

This result yields another:

Corollary. ρ2(X,Y ) ≤ 1, so −1 ≤ ρ(X,Y ) ≤ 1.

ρ2(X,Y ) = 1 iff Y = kX for some k in which case ρ(X,Y ) = 1 if k > 0,−1 if k < 0.

The correlation coefficient is bounded, and invariant under location-scalelinear transformations (subject to the sign condition on a and c), so it is anessentially unit-free measure of (linear) association.

Example (Uncorrelated does not imply independent)

Suppose θ ∼ Uniform [0, 2π], and let X = sin θ and Y = cos θ. Observe that

EXY =∫ 2π

0

sin θ cos θ · 12π dθ = 1

∫ 2π

0

12 sin 2θ dθ = 1

∫ 4π

0

sinu du = 0.

Similarly, EX = EY = 0, so ρ(X,Y ) = 0, yet X and Y are clearly not indepen-dent.

10/14/98

Page 45: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 43

Bivariate Normal (“old-fashioned” version)Let

f(x, y) =1

2πσ1σ2

√1− ρ2

e−Q/2 (2)

where

Q =1

1− ρ2

[(x− µ1

σ1

)2

− 2ρ(x− µ1

σ1

)(y − µ2

σ2

)+(y − µ2

σ2

)2],

where µi ∈ R, σi > 0, for i = 1, 2; and ρ ∈ (−1, 1). (Here we restrict ρ and theσi’s to avoid singular distributions, but later we may allow |ρ| = 1 or σi = 0.)

We claim that f appearing in (2) is a density, and we will see that theparameters µi, σ1, and ρ are means, standard deviations, and the correlationcoefficient.

Now

(1− ρ2)Q =(x− µ1

σ1

)2

− 2ρx− µ1

σ1

y − µ2

σ2+(y − µ2

σ2

)2

=(y − µ2

σ2

)2

− 2ρx− µ1

σ1

y − µ2

σ2+ ρ2

(x− µ1

σ1

)2

+ (1− ρ2)(x− µ1

σ1

)2

=[(

y − µ2

σ2

)− ρx− µ1

σ1

]2

+ (1− ρ2)(x− µ1

σ1

)2

=[(

y − µ2

σ2

)− ρσ2

σ1

(x− µ1)σ2

]2

+ (1− ρ2)(x− µ1

σ1

)2

=

y −(µ2 + ρσ2

σ1(x− µ1)

)σ2

2

+ (1− ρ2)(x− µ1

σ1

)2

so

Q =

[y −

( bX︷ ︸︸ ︷µ2 + ρ

σ2

σ1(x− µ1)

)]2

σ22(1− ρ2)

+(x− µ1)2

σ21

and then

f(x, y) =1

σ2

√2π(1− ρ2)

exp

−(y −

[µ2 + ρσ2

σ1(x− µ1)

])2

2σ22(1− ρ2)

︸ ︷︷ ︸

f(y|x)∼N(bX ,σ22(1−ρ2))

× 1σ1

√2π

exp[− (x− µ1)2

2σ21

]︸ ︷︷ ︸

f1(x)∼N(µ1,σ21)

10/14/98

Page 46: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

44 Stat 8151

is a product of conditional and marginal densities of known distributions, soX ∼ N(µ1, σ

21) and Y | X ∼ N

(bX , σ

22(1− ρ2)

).

Now f(x, y) ≥ 0 and∫ ∞−∞

∫ ∞−∞

f(x, y) dx dy =∫ ∞−∞

∫ ∞−∞

f(y|x)f1(x) dx dy

=∫ ∞−∞

[ ∫ ∞−∞

f(y|x) dy︸ ︷︷ ︸1

]f1(x) dx =

∫ ∞−∞

f1(x) dx = 1,

so f(x, y) is a density.We recognize µ1 and σ1 as the mean and standard deviation of the marginal

distribution for X; a similar calculation interchanging X and Y , the µi’s, and theσi’s shows that µ2 and σ2 are the mean and standard deviation of the marginaldistribution for Y .

What about ρ? We can use E(Y | X ) = bX = µ2 +ρσ2

σ1(x−µ1) to compute

Cov(X,Y ).Cov(X,Y ) = E[(X − µ1)(Y − µ2)]

= E[

E[

(X − µ1)(Y − µ2)∣∣ X ]]

= E[

(X − µ1)E[

(Y − µ2)∣∣ X ]]

= E[

(X − µ1)(E(Y | X )− µ2

)]= E

[(X − µ1)

(ρσ2

σ1(X − µ1)

)]= ρ

σ2

σ1E[(X − µ1)2

]= ρ

σ2

σ1σ2

1

= ρσ1σ2,

so ρ =Cov(X,Y )σ1σ2

is the correlation coefficient; recall that we assumed ρ2 < 1.

10/14/98

Page 47: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 45

Constant density on elliptical contours

Suppose Q = c for some constant c > 0 and assume µ1 = µ2 = 0.Then

Q =1

1− ρ2

[(x

σ1

)2

− 2ρ(x

σ1

)(y

σ2

)+(y

σ2

)2]

= c,

so [1

(1− ρ2)cσ21

]︸ ︷︷ ︸

A

x2 +[ −2ρ

(1− ρ2)cσ1σ2

]︸ ︷︷ ︸

B

xy +[

1(1− ρ2)cσ2

2

]︸ ︷︷ ︸

C

y2 = 1.

This will be an ellipse iff B2 − 4AC < 0, or

4ρ2

(1− ρ2)2c2σ21σ

22

− 4(1− ρ2)2c2σ2

1σ22

< 0 ⇐⇒ 4ρ2 − 4 < 0 ⇐⇒ ρ2 < 1

since (1− ρ2)2c2σ21σ

22 > 0.

Note that x and y appear only in Q, so f is constant if and only if Q isconstant; the (nondegenerate) bivariate normal density is constant on ellipses.

Moment generating functions

Let Z ∼ N(0, 1). Then

MZ(t) = EetZ =∫ ∞−∞

etz1√2πe−z

2/2 dz

=∫ ∞−∞

1√2πetz−z

2/2 dz

=∫ ∞−∞

1√2πe−(−2tz+z2)/2 dz

=∫ ∞−∞

1√2πet

2/2−(t2−2tz+z2)/2 dz

= et2/2

∫ ∞−∞

1√2πe−(t2−2tz+z2)/2 dz

= et2/2

∫ ∞−∞

1√2πe−(z−t)2/2 dz

= et2/2

because the integrand is just the density of a N(t, 1) random variable.If X ∼ N(µ, σ2) we write X = σZ + µ, where Z ∼ N(0, 1), so

MX(t) = EetX = Eet(σZ+µ) = E[eµte(σt)Z

]= eµtMZ(σt) = eµt+

12σ

2t2 (3)

10/14/98

Page 48: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

46 Stat 8151

If we know the mean and the variance of a normal random variable, we knowthe moment generating function.

Now we will find the joint moment generating function for a bivariate normaldistribution. (In what follows, the integrals are over R2 or R.)

M(t1, t2) =∫∫

et1x+t2yf(x, y) dx dy =∫et1x

[∫et2yf(y|x) dy

]f1(x) dx (4)

but the inner integral evaluates to the moment generating function for the con-ditional distribution of Y given X, and we know that distribution—it is normal,with mean bx = µ2 + ρσ2

σ1(x − µ1) and variance σ2

2(1 − ρ2). Hence, by (3),

that integral is ebXt2+ 12σ

22(1−ρ2)t22 = e[µ2+ρ

σ2σ1

(x−µ1)]t2+ 12 t

22σ

22(1−ρ2), which we can

substitute in (4) to get

M(t1, t2) =∫

expt1x+

[µ2 + ρ

σ2

σ1(x− µ1)

]t2 + 1

2 t22σ

22(1− ρ2)

f1(x) dx

= expµ2t2 − ρ

σ2

σ1µ1t2 + 1

2 t22σ

22(1− ρ2)

∫exp

t1x+ ρ

σ2

σ1xt2

f1(x) dx

= expµ2t2 − ρ

σ2

σ1µ1t2 + 1

2 t22σ

22(1− ρ2)

∫exp

[t1 + ρ

σ2

σ1t2

]x

f1(x) dx

(5)

but the last integral is just

MX(t1 + ρσ2

σ1t2) = exp

µ1

(t1 + ρ

σ2

σ1t2)

+σ2

1

2(t1 + ρ

σ2

σ1t2)2

by (3), so (5) becomes

M(t1, t2) = expµ2t2 − ρ

σ2

σ1µ1t2 +

12t22σ

22(1− ρ2)

exp

µ1

(t1 + ρ

σ2

σ1t2)

+σ2

1

2(t1 + ρ

σ2

σ1t2)2

= expµ2t2 − ρ

σ2

σ1µ1t2 +

12t22σ

22(1− ρ2) + µ1

(t1 + ρ

σ2

σ1t2)

+σ2

1

2(t1 + ρ

σ2

σ1t2)2

= expµ2t2 − ρ

σ2

σ1µ1t2 +

12t22σ

22 −

12t22σ

22ρ

2 + µ1t1 + ρµ1σ2

σ1t2 +

σ21

2(t1 + ρ

σ2

σ1t2)2

= expµ2t2 − ρ

σ2

σ1µ1t2 +

12t22σ

22 −

12t22σ

22ρ

2 + µ1t1 + ρµ1σ2

σ1t2 +

σ21

2

(t21 + 2ρ

σ2

σ1t1t2 + ρ2σ

22

σ21

t22

)= exp

µ2t2 − ρ

σ2

σ1µ1t2 +

12t22σ

22 −

12t22σ

22ρ

2 + µ1t1 + ρµ1σ2

σ1t2 +

12σ2

1t21 + ρσ1σ2t1t2 +

12ρ2σ2

2t22

= exp

µ1t1 + µ2t2 +

12

(σ2

1t21 + 2ρσ1σ2t1t2 + t22σ

22

).

10/14/98

Page 49: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 47

Matrix formulation

For a random vector X =(X1

X2

), we define EX =

(EX1

EX2

)=(µ1

µ2

)= µ; a

similar version will hold for matrices.We define the variance-covariance matrix by

Σ = (σij)2×2 = E[(X− µ)(X− µ)′

]= E

[(X1 − µ1)2 (X1 − µ1)(X2 − µ2)

(X1 − µ1)(X2 − µ2) (X2 − µ2)2

].

If Σ =(σ11 σ12

σ21 σ22

)then σii = E(Xi − µi)2 = VarXi = σ2

i , and σ12 = σ21 =

Cov(X1, X2) = ρσ1σ2, so we can write Σ =(

σ21 ρσ1σ2

ρσ1σ2 σ22

).

Now we can express our earlier results for the bivariate normal in matrixform.

1. It is trivial to prove that Q = (X− µ)′Σ−1(X− µ).

2. The joint characteristic function of(X1

X2

)is eit

′µ− 12 t′Σt, where t =

(t1t2

).

3. The moment generating function14 of(X1

X2

)is et

′µ+ 12 t′Σt.

4. The joint density is f(x, y) =1

2π√|Σ|

e−12 (X−µ)′Σ−1(X−µ), where |Σ| =

det Σ = σ21σ

22(1− ρ2).

14 Actually the moment generating function was not mentioned at this point,nor had we looked the characteristic function earlier in the lecture.

10/14/98

Page 50: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

48 Stat 8151

Friday, October 16, 1998

A continuation of material from the end of the previous class:

Suppose X =(X1

X2

)∼ N(µ,Σ).

Now suppose A =(a11 a12

a21 a22

)such that |A| 6= 0. Then Y = AX ∼

N(Aµ,AΣA′).15

Singular distributionsLast time’s method is “old fashioned” because it does not handle singular dis-tributions. We will later see that there is a way to include singular multivariatenormal distributions, and we will need to be able to do that.

An example might be a univariate normal on y = x, which is a line throughthe origin—all the mass is concentrated on a set of Lebesgue measure (area)zero, so in R2 it has zero probability. If we look only at the subspace (the line)there ought to be a way to see the mass concentrated there.

The problem is that we cannot ignore the distribution within the smallerspace, yet we cannot ignore the larger space in which probability appears tobe zero. (After all, we can always embed Rn in some larger Rm.) It’s like thedifficulty in conditioning on a set of measure zero for an absolutely continuousrandom variable.

x

y

x

y

Singular density Its support hasconcentrates mass on a line in R2 probability zero

15 Although not mentioned in class, it also true that, if b ∈ R2, then AX+b ∼N(Aµ+ b,AΣA′).

10/16/98

Page 51: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 49

Prediction

Assume we have a pair (X,Y ) of random variables, with (known) joint densityf(x, y) and with EX2 <∞ and EY 2 <∞. Now we observe X = x and we wantto predict Y . If we use h(X) to predict Y , we will use expected squared errorE[Y − h(X)]2 as a loss function to measure how well we did.

Theorem. E[Y − h(X)]2 ≥ E[Y − E(Y | X )]2, so the best predictor is theconditional expectation of Y given X.

Proof: This is a standard argument we adapt for many problems. Sub-tract and add the conditional expectation, expand the square, and use iteratedexpectations and conditioning.

E[Y − h(X)]2 = E[Y − E(Y | X ) + E(Y | X )− h(X)︸ ︷︷ ︸

W (X)

]2= E

[Y − E(Y | X ) +W (X)

]2= E

[[Y − E(Y | X )

]2 +[W (X)

]2 + 2[Y − E(Y | X )

][W (X)

]]= E

[[Y − E(Y | X )

]2]+ E[[W (X)

]2]+ E[2[Y − E(Y | X )

][W (X)

]](1)

There’s nothing we can do with the first term in (1). However, if we can showthe third (cross-product) term in (1) vanishes, then we can try to make thesecond term as small as possible. So, if that cross-product term is zero, then wecan minimize the sum of the remaining terms by forcing the second term to bezero. Now the second term is E

([W (X)

]2) and that will be zero if and only ifW (X) = 0, but recall that W (X) = E(Y | X )− h(X), so all we would have todo is pick h(X) = E(Y | X ). But first we need to show that the cross productterm does in fact equal zero.

Here’s where we use iterated expectations after conditioning.

E[2[Y − E(Y | X )

][W (X)

]]= 2E

[E( [Y − E(Y | X )

][W (X)

] ∣∣∣ X )]but W (X) is a constant when we condition on X, so

= 2E

[[W (X)

]E( [Y − E(Y | X )

] ∣∣∣ X )]and we’re still conditioning on X, so

= 2E[[W (X)

][E(Y | X )− E(Y | X )︸ ︷︷ ︸

0

]]= 2E0 = 0

Since the cross-product term vanishes, we are free to select h(X) = E(Y | X )to minimize the loss (as measured by mean-squared-error).

10/16/98

Page 52: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

50 Stat 8151

Linear PredictorsLet’s change the problem a bit. Instead of considering all predictors, now supposewe are limited to linear predictors. In other words, our h(X) must have the formα+ βX. We then seek α and β to minimize

E(Y − α− βX)2 (2)

This is linear regression by least squares.

Theorem. The solution to the linear prediction problem is

α = EY − βEX, β =Cov(X,Y )

VarX, min

α,βE(Y − α− βX)2 = (1− ρ2) VarY,

where ρ is the correlation coefficient.

Proof: Recall that if EU2 <∞ then E(U−α)2 is minimized when α = EU .(The proof of this is just like that of the theorem for the general predictionproblem. Add and subtract, expand the square, etc.16)

Hence, for a fixed β, E(Y − βX − α)2 is minimized at

α = E(Y − βX) = EY − βEX. (3)

Now substitute this for α in (2) to get a function of β.

E(Y − α− βX

)2 = E(Y − [E(Y − βX)]− βX

)2= E

(Y − EY − β[X − EX]

)2= E

((Y − EY )2 + β2[X − EX]2 − 2(Y − EY )β[X − EX]

)2= E(Y − EY )2 + β2E[X − EX]2 − 2βE

[(Y − EY )(X − EX)

]= VarY + β2 VarX − 2β Cov(X,Y ) (4)

which is a quadratic in β. We can differentiate—or just use high-school algebrato find the vertex—and we find that the minimum occurs at

β = − b

2a=−(−2 Cov(X,Y ))

2(VarX)=

Cov(X,Y )VarX

Now use this in (3) to findα = EY − βEX,

16 It is somewhat easier, since conditioning is not required. Here it is: E(U −α)2 = E(U−EU+EU−α)2 = E(U−EU)2+E(EU−α)2+2E[(U−EU)(EU−α)] =E(U − EU)2 + E(EU − α)2 + 2(EU − α) E[U − EU ]︸ ︷︷ ︸

0

= E(U − EU)2 + E(EU − α)2

is minimized when E(EU −α)2 = 0 ⇐⇒ (EU −α)2 = 0 ⇐⇒ EU −α = 0 ⇐⇒α = EU .

10/16/98

Page 53: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 51

and substitute in (4) to find the minimum value.

VarY + β2 VarX − 2β Cov(X,Y )

= VarY +(

Cov(X,Y )VarX

)2

VarX − 2Cov(X,Y )

VarXCov(X,Y )

= VarY +[Cov(X,Y )]2

[VarX]2VarX − 2

[Cov(X,Y )]2

VarX

= VarY − [Cov(X,Y )]2

VarX

= VarY − [Cov(X,Y )]2

VarXVarYVarY

=(

1− [Cov(X,Y )]2

VarX VarY

)VarY

= (1− ρ2) VarY

and the proof is complete.

Comments:

1. Suppose X and Y are uncorrelated. Then β = 0, and α = EY , and theminimum for (2) is VarY .

2. In some sense, correlation measures the improvement in prediction (by 1−ρ2). So if |ρ| = 1, then we have perfect prediction, and the minimum in (2)is zero.

3. If the joint distribution is bivariate normal, then the “best” predictor (gen-erally) is the best linear predictor, because E(Y | X ) is linear. In thissituation, the linear case is the general case, and not an actual restriction.

Memory aidThis may help in recalling what β is and the form of the conditional expectationfor the bivariate normal case.

Think of the basic identity “standardized Y = ρ · standardized X”

Y − µ2

σ2= ρ

X − µ1

σ1

and solve it for YY = µ2 + ρ

σ2

σ1︸︷︷︸β

(X − µ1)

︸ ︷︷ ︸E(Y | X ),

in bivariate normal case

10/16/98

Page 54: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

52 Stat 8151

Random Vectors

Suppose (Ω,^, P ) is a probability space, and let X : Ω → Rn be a Borel-measurable function. That is, for every Borel set17 B ∈ Rn, we assume X−1(B) ∈^.

Let t =

t1...tn

. Then t′X is a random variable (R1-valued Borel measurable

function).Now let t′ = (0, 0, . . . , 0, 1, 0, . . . , 0), i.e., there is a 1 in the ith position, and

0 elsewhere. Let Xi = t′iX, so X =

X1...Xn

. Each Xi is just a component, or

projection.

Joint Distribution Functions

We can easily generalize what we’ve already done for the case n = 2. We definethe joint distribution function F (x1, . . . , xn) = P (X1 ≤ x1, . . . , Xn ≤ xn). Asbefore, we note that F has the following properties.

1. F is right-continuous.2. F (x1, . . . ,−∞, . . . , xn) = 0. (It is enough to have −∞ occur in at least one

position.)3. F (+∞, . . . , . . . ,+∞) = 1. (All arguments infinite.)18

4. FXi(xi) = F (+∞, . . . ,+∞, xi,+∞, . . . ,+∞) is the marginal distributionfunction for Xi.

5. There are joint marginal distribution functions for any subcollection of theXi’s. e.g.,FXi,Xj (xi, xj) = F (+∞, . . . ,+∞, xi,+∞, . . . ,+∞, xj ,+∞, . . . ,+∞)is the marginal distribution function for (Xi, Xj).

6. There is a “box condition.” We need to ensure that n-dimensional rect-angles in Rn have non-negative probability. Here is an example, withn = 3. Consider the rectangle (a1, b1] × (a2, b2] × (a3, b3]. We need tohave 0 ≤ P (ai < Xi ≤ bi) for i = 1, 2, 3. But this requires F (b1, b2, b3) −[F (a1, b2, b3) +F (b1, a2, b3) +F (b1, b2, a3)

]+[F (a1, a2, b3) +F (a1, b2, a3) +

F (b1, a2, a3)]− F (a1, a2, a3) ≥ 0.

17 The collection of Borel sets is the smallest σ-algebra that contains all n-dimensional rectangles in Rn.

18 This condition was not mentioned in lecture today, although the correspond-ing form was mentioned when we did the 2-dimensional case earlier.

10/16/98

Page 55: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 53

Example where “box condition” fails.

A question arose about the necessity of this last condition. Here is an examplein R2 showing why we need that condition.

Let F (x, y) =

0 if x+ y < 0 or x < −3 or y < −3,1 if x+ y ≥ 0 and x ≥ −3 and y ≥ −3.

Note that F (x, y) = 1 on the boundary (to ensure right-hand continuity) asshown in the figures below.

x

y

F (x, y) = 1

F (x, y) = 0

−3

−3

x

y

(−1,−1)

(2, 2)

F is not a joint cdf Rectangle with negative mass

It is easy to show that F satisfies conditions 1–5; however, it fails to satisfycondition 6. To see that, consider the rectangle (−1, 2]× (−1, 2] as shown in thefigure on the right. It would have probability F (2, 2) − F (−1, 2) − F (2,−1) +F (−1,−1) = 1− 1− 1 + 0 = −1 < 0.

Multivariate characteristic functions and moment generating functions.

For t ∈ Rn, we define the joint characteristic function by ϕ(t) = Eeit′X.

ϕ has the following properties:

1. ϕ(0) = 1.2. ϕ is uniformly continuous.3. |ϕ(t)| ≤ 1.4. ϕ always exists.5. There is a one-to-one correspondence between ϕ’s and F ’s.6. If the moment E(Xj1

1 Xj22 . . . Xjn

n ) exists, then

E(Xj11 X

j22 . . . Xjn

n ) = ij1+j2+···+jn ∂j1+j2+···+jn

∂j1t1∂j2t2 . . . ∂jntnϕ(t)

∣∣∣∣t=0

We also define the joint moment generating function by MX(t) = Eet′X;

its properties are similar to those for its univariate form.

10/16/98

Page 56: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

54 Stat 8151

Monday, October 19, 1998

Independence and random vectors.

Suppose X = (X1, X2, . . . , Xn)′.

Definition. The components Xi are independent if

P (X1 ∈ B1, . . . , Xn ∈ Bn) =n∏i=1

P (Xi ∈ Bi)

for all Borel sets Bi.

Definition. X is absolutely continuous if

F (x1, . . . , xn) =∫ xn

−∞· · ·∫ x1

−∞f(u1, . . . , fn) du1 . . . dun

where f ≥ 0 and∫Rn f(u) du = 1.

Proposition. The following are equivalent:

7. X1, . . . , Xn are independent.8. F (x1, . . . , xn) =

∏ni=1 FXi(xi).

9. ϕ(t) =∏ni=1 ϕXi(ti).

10. f(x1, . . . , xn) =∏ni=1 fXi(xi) in the absolutely continuous case.

Proposition. X1, . . . , Xn are independent if and only if

E

[n∏i=1

hi(Xi)

]=

n∏i=1

E [hi(Xi)]

for all hi for which the expectations exist.

10/19/98

Page 57: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 55

Conditional distributions

Let X =

X1...Xn

, and let f(x1, . . . , xn) be the joint density function. As usual,

we’ll discuss the absolutely continuous case, but the results are more general.Once we have the joint distribution, then the marginal distribution for any

subset of the Xi’s is well-defined. Hence, the conditional distribution of onesubset given another subset is well-defined. In practice, this can be hard to do.

Sometimes, when modeling, we can specify the marginal distributions ofcertain subsets of the variables and the conditional distributions of certain sub-sets of the variables. Suppose, for example, that n = 7, and f(x3, x4 | x1, x2),f(x4 | x2, x5), f(x1, x5), and f(x5, x6, x7) are given. The question is “Does thereexist f(x1, . . . , x7) such that the above are all true?” The answer is “yes”—provided the given distributions are consistent. For instance, we can get f(x5)from f(x5, x6, x7) and also from f(x1, x5), but these had better be the samef(x5). However, such a joint f(x1, . . . , x7) need not be unique.

The next step is an infinite sequence (X1, X2, . . .) of random variables. Usinga theorem of Kolmogorov, we’ll be able to assume, e.g., that there exists asequence (X1, X2, . . .) such that (X1, X2, . . .) ∼ iid N(µ, σ2). We essentiallyneed the joint distribution of every finite subset to be specified in a consistentway to ensure existence. There is also a version for an uncountable collection ofrandom variables.

Let (X,Y,Θ)′ be a random vector with joint density19 function f(x, y, θ).

Proposition 1.

f(y | x) =∫f(y | x, θ)f(θ | x) dθ

Proof:∫f(y | x, θ)f(θ | x) dθ =

∫f(x, y, θ)f(x, θ)

f(θ, x)f(x)

dθ =f(x, y)f(x)

= f(y | x)

Proposition 2. Now suppose, given Θ = θ, that X and Y are independent.Then

f(y | x) =∫f(y | θ)f(θ | x) dθ

Proof:

f(y | x, θ) =f(x, y, θ)f(x, θ)

=f(x, y | θ)f(θ)f(x | θ)f(θ)

=f(x | θ)f(y | θ)f(θ)

f(x | θ)f(θ)= f(y | θ)

The key step is the next-to-last equality where we use the conditional indepen-dence of X and Y given Θ.

19 This “density” could be a probability function, or mixture.

10/19/98

Page 58: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

56 Stat 8151

Bayesian calculation exampleSuppose X1, X2, . . . , Xn, Xn+1 | Θ = θ ∼ iid Bernoulli(θ) with Θ ∼ Beta(α, β)

where α and β are positive and known. Recall that fα,β(θ) =Γ(α+ β)Γ(α)Γ(β)

θα−1(1−

θ)β−1 is a density on (0, 1) and E(Θ) =α

α+ β.

We want to apply Proposition 2 with X = (X1, . . . , Xn) and Y = Xn+1

because, conditioned on Θ, X and Y are independent. But to do that, we willneed f(θ, x).

f(θ | x1, . . . , xn) =f(x1, . . . , xn, θ)f(x1, . . . , xn)

=f(x1, . . . , xn | θ)f(θ)∫f(x1, . . . , xn, θ) dθ

but the denominator is just a scale constant so this integrates to 1, sof(θ | x1, . . . , xn) ∝ f(x1, . . . , xn | θ)f(θ)

∝ θΣxi(1− θ)n−Σxiθα−1(1− θ)β−1

(where the sums Σxi in the exponents have n terms)= θα+Σxi−1(1− θ)β+n−Σxi−1

which we recognize as the kernel of a Beta(α+ Σxi, β + n− Σxi) density.

Now use Proposition 2.

f(xn+1 | x1, . . . , xn)

=∫θxn+1(1− θ)1−xn+1︸ ︷︷ ︸

f(y|θ)

× Γ(α+ β + n)Γ(α+

∑ni=1 xi

)Γ(β + n−

∑ni=1 xi

)θα+Σxi−1(1− θ)β+n−Σxi−1

︸ ︷︷ ︸f(θ|x)

=Γ(α+ β + n)

Γ(α+

∑ni=1 xi

)Γ(β + n−

∑ni=1 xi

) ∫ θα+Σxi−1(1− θ)β+n+1−Σxi−1 dθ,

where the sums in the exponents now include n+ 1 terms

=Γ(α+ β + n)

Γ(α+

∑ni=1 xi

)Γ(β + n−

∑ni=1 xi

) · Γ(α+

∑n+1i=1 xi

)Γ(β + n+ 1−

∑n+1i=1 xi

)Γ(α+ β + n+ 1)

(1)

Now use the fact that Γ(k) = (k − 1)! = (k − 1)Γ(k − 1) to simplify (1). We areonly interested in the special case xn+1 = 1, and in that case we get

P (Xn+1 = 1 | x1, . . . , xn ) =∑ni=1 xi + α

α+ β + n(2)

10/19/98

Page 59: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 57

Suppose that we want to express ignorance—that is, we want to use a noninfor-mative prior distribution for a parameter. One way to do this is to assume auniform distribution for Θ. This is a silly argument.20 Nevertheless, we’ll takeα = β = 1 which makes the Beta distribution for Θ into a uniform distributionon the unit interval. Then (2) becomes

P (Xn+1 = 1 | x1, . . . , xn ) =∑ni=1 xi + 12 + n

(3)

Now suppose that x1 = · · · = xn = 1. Then (3) becomes

P (Xn+1 = 1 | X1 = · · · = Xn = 1 ) =n+ 1n+ 2

(4)

This argument has been used to estimate the probability that the Sun wouldrise tomorrow, given that it had risen—or was assumed to have risen—for theprevious n days.21 Note that (n+1)/(n+2)→ 1 as n→∞ when all the xi = 1,and that this doesn’t depend on α and β.

Often an uninformative prior will provide results with good frequentist prop-erties. The expression in (2) will be close to the MLE (the sample mean) if αand β are small and equal; however, in the limit as α→ 0 and β → 0, we get an

improper prior with “density” of the form1

θ(1− θ) .

A question led to further discussion about what it means to assumeignorance, and what a distribution for a parameter means. Frequentist,objective/subjective Bayesian ideas were compared. See a 1996 JASApaper by Kass and Wasserman for more information.

There were some comments on HW2.

20 One reason is that we are assuming a scale on which the distribution isuniform, but an equivalent reparamaterization would no longer have a uniformdistribution. Are we more ignorant of Θ than we would be of Θ2, or

√Θ, or

log Θ?21 Laplace’s Rule of Succession.

10/19/98

Page 60: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

58 Stat 8151

Wednesday, October 21, 1998

More n-dimensional versions of what we’ve earlier seen for 1 or 2 dimensions.

Let X =

X1...Xn

with density f(x) = f(x1, . . . , xn).22

Suppose

y1 = u1(x1, . . . , xn)y2 = u2(x1, . . . , xn)

...yn = un(x1, . . . , xn)

and assume we can solve to get

x1 = v1(y1, . . . , yn)x2 = v2(y1, . . . , yn)

...xn = vn(y1, . . . , yn)

Thus u:Rn → Rn and we can think of Y = u(X).Now let the Jacobian J be defined by

J =∂(x1, . . . , xn)∂(y1, . . . , yn)

=

∣∣∣∣∣∣∣∂x1∂y1

· · · ∂x1∂yn

.... . .

...∂xn∂y1

· · · ∂xn∂yn

∣∣∣∣∣∣∣ ,and recall that under suitable regularity conditions,∣∣∣∣∂(x1, . . . , xn)

∂(y1, . . . , yn)

∣∣∣∣ =1∣∣∣∣ ∂(y1, . . . , yn)

∂(x1, . . . , xn)

∣∣∣∣ .Now we can write the density of Y = u(X) as

g(y1, . . . , yn) = f(v1(y1, . . . , yn), . . . , vn1(y1, . . . , yn)

)|J |.

22 Here is another standard abuse of notation. Note that f :Rn → Rn andwe sometimes write f(x) instead of F (x1, . . . , xn). We probably ought to writef(x′) instead, but we pretend that a vector and its transpose are the same inthis context.

However, we are more careful when we write row and column vectors asif they were row and column matrices, e.g., we write the inner product of twovectors as if they were matrices: x′y.

10/21/98

Page 61: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 59

This leads to a standard way of finding the density for a function of severalXi’s. For example, if Y = X1 + · · ·+ Xn we could let Y1 = Y and define otherYi’s so there is a 1–1 transformation whose joint density we could find by theprevious method. Then we could integrate out the extra variables to get thedesired marginal density for Y1 = Y .

Functions of independent random variable.

But another approach works if the Xi’s are independent. In that situation wecan use characteristic functions. For example, if Y = X1 + · · ·+Xn, then

ϕY (t) = EeitY = Eeit∑n

j=1Xj =

n∏j=1

eitXj =n∏j=1

ϕXj (t).

(A similar result holds for moment generating functions.)A useful special case is that if the Xj ’s have the same distribution, then ϕY (t) =[ϕX1(t)]n.

Examples:

1. Normal

Suppose Xj ∼ N(µj , σ2j ). We know ϕXj (t) = EeitXj = eitµj−

12σ

2i t

2, so ϕY (t) =

expit(∑n

j=1 µj)− 1

2

(∑nj=1 σ

2j

)t2, which we recognize as the characteristic

function for a N(∑n

j=1 µj ,∑nj=1 σ

2j

)distribution.

2. Poisson

Suppose Xj ∼ Poisson(λj). Then ϕj(t) = exp[λj(eit − 1)]. By the same ar-gument as before, if Y = X1 + · · · + Xn and the Xj ’s are independent, then

ϕY (t) =∏nj=1 ϕXj (t) =

∏nj=1

(exp[λj(eit − 1)]

)= exp

∑nj=1

[λj(eit − 1)

]=

exp[∑n

j=1 λj](eit − 1)

, so we see that Y ∼ Poisson

(∑nj=1 λj

).

3. Binomial

Suppose Xj ∼ binomial(nj , p) and the Xj ’s are independent. Note that p is thesame for every Xj . This time, ϕXj (t) =

[peit + (1− p)

]nj .As before, let Y = X1 +· · ·+Xn. Then we find that ϕY (t) =

∏nj=1 ϕXj (t) =∏n

j=1

[peit + (1 − p)

]nj =[peit + (1 − p)

]∑n

j=1nj , and we conclude that Y ∼

binomial(∑n

j=1 nj , p).

10/21/98

Page 62: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

60 Stat 8151

Random matricesSuppose Z = (Zij) is an m× n matrix of random variables. Then EZ = (EZij);equivalently, (EZ)ij = EZij . Clearly, E(Z′) = (EZ)′.

Special case: If X =

X1...Xn

then EX =

EX1...

EXn

.

Proposition. Let Z be a random matrix and let A, B, and C be real matricessuch that AZB + C is well-defined. Then E(AZB + C) = A(EZ)B + C.

Proof: Use the definition of matrix multiplication and linearity of theintegral. Look at the i, l element of the left-hand side.

E

(∑j,k

aijZjkbkl + cil

)= E

(∑j

aij [∑k

Zjkbkl] + cil)

=∑j

aijE([∑

k

Zjkbkl]

+ cil)

=∑j

(aij[∑k

E(Zjk)bkl]

+ cil)

=∑j,k

(aijE(Zjk)bkl + cil

)which is the i, l element of the right-hand side.

Corollary. E(a′X + b) = a′EX + b.

Mean vectors and Covariance Matrices

Definition. Let X =

X1...Xn

. Then we define the mean vector by

EX =

EX1...

EXn

=

µ1...µn

= µ,

provided all these expectations exist.

Fact. µ is the unique vector such that Ea′X = a′µ for all a.

Proof: We know that µ works. Now suppose there is some v such thatEa′X = a′v for all a. Then a′v = a′µ for all a, so a′(v−µ) = 0 for all a. Thenv − µ = 0, so v = µ.

Definition. Let X =

X1...Xn

and assume that EX2i < ∞ for i = 1, . . . , n, so

Cov(Xi, Xj) exists for all i, j. Then the covariance matrix Σ = Cov(X) =E[(X − µ)(X − µ)′

], and we note that Σij = σij = Cov(Xi, Xj) = E

[(Xi −

µi)(Xj − µj)].

10/21/98

Page 63: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 61

Definition. An n× n real matrix A is symmetric if A = A′.

Definition. An n×n real matrix A is positive semidefinite if it is symmetricand b′Ab ≥ 0 for all b ∈ Rn. It is positive definite if b′Ab > 0 for allb ∈ Rn \ 0.

Definition. An n× n real matrix A is orthogonal (written A ⊥) if AA′ = I.In this event, the rows—or columns—form an orthogonal set, and each has length1.

Definition. An n × n real matrix D = (d)ij is diagonal if dij = 0 wheneveri 6= j.

If we write di instead of dii, then D =

d1 0 . . . 00 d2 . . . 0...

.... . .

...0 0 . . . dn

.

Note that some of the di’s may be 0.

Fact. If A is symmetric then there exist an orthogonal matrix Γ and a diagonalmatrix D such that A = Γ′DΓ.

10/21/98

Page 64: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

62 Stat 8151

Lemma.

1. Σ = EXX′ − µµ′.2. If A is m× n and b is m× 1 then Cov(AX + b) = A Cov(X)A′.3. Σ is the unique positive semidefinite matrix such that Var(a′X) = a′Σa for

all a ∈ Rn.

Proof: For the first part, just use the definition.Σ = E

[(X− µ)(X− µ)′

]= E

[(X− µ)(X′ − µ′)

]= E

[(XX′ −Xµ′ − µX′ + µµ′

]= E[XX′]− E[Xµ′]− E[µX′] + E[µµ′]= E[XX′]− E[X]µ′ − µE[X′] + µµ′

= E[XX′]− µµ′ − µµ′ + µµ′

= E[XX′]− µµ′

For the second part, first consider the special case with A = I, so we haveCov(X + b), and use the first part:

Cov(X + b) = E[(X + b)(X + b)′

]− (µ+ b)(µ+ b)′

= E[(X + b)(X′ + b′)

]− (µ+ b)(µ′ + b′)

= E[XX′ + bX′ + Xb′ + bb′)

]− (µµ′ + µb′ + bµ′ + bb′)

= E[XX′

]+ bµ′ + µb′ + bb′ − µµ′ − µb′ − bµ′ − bb′)

= E[XX′

]− µµ′

Now it is easy to show that Cov(AX + b) = Cov(AX).23 Using that, we have

Cov(AX + b) = Cov(AX)

= E[(AX)(AX)′

]− E(AX)

(E(AX)

)′= E

[AXX′A′

]−Aµµ′A′

= AE[XX′

]A′ −Aµµ′A′

= A(

E[XX′

]− µµ′

)A′

= A Cov(X)A′

23 We know that E(AX+b) = AEX+b = Aµ+b, hence (AX+b)−E(AX+b) = (AX + b) − (Aµ + b) = AX −Aµ. (Note that this doesn’t involve b atall—so we might as well assume b = 0, in which case it is obviously true thatCov(AX + b) = Cov(AX + 0) = Cov(AX).)

Even without that observation, it is still a straightforward computation.Cov(AX + b) = E

[(AX−Aµ)(AX−Aµ)′

]= E

[(AX−Aµ)(X′A′−µ′A′)

]=

E[AXX′A′ − AµX′A′ − AXµ′A′ + Aµµ′A′)

]= E

[AX(AX)′

]− Aµµ′A′ −

Aµµ′A′ + Aµµ′A′ = E[AX(AX)′

]−Aµ(Aµ)′ = Cov(AX).

10/21/98

Page 65: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 63

For the third part, first check that Σ works.

Var(a′X) = Cov(a′X) = a′ Cov(X)(a′)′ = a′ Cov(X)a = a′Σa

Since Var(a′X) ≥ 0 always, Σ is positive semidefinite.Now let Σ1 be another one that works. Then a′Σ1a = a′Σa, so a′(Σ −

Σ1)a = 0, and we can set B = Σ−Σ1. Thus we have a′Ba = 0 for every a ∈ Rn.Note that B is symmetric and positive semidefinite, so we can diagonalize B,i.e., we write B = Γ′DΓ where Γ is orthogonal and D is diagonal. Note that anyorthogonal Γ has |det Γ| = 1, so orthogonal matrices are non-singular, and hencewe can express any a ∈ Rn as a = Γb for some b. Now a′Da = (Γb)′D(Γb) =b′Γ′DΓb = b′(Γ′DΓ)b = b′(B)b = 0 for every b ∈ Rn. But then a′Da = 0 forevery a ∈ Rn, which means D = 0. Then B = 0, too, so Σ1 = Σ.

Next time, we’ll prove the following result.Let 5 be the range of Σ, i.e., 5 = Σa : a ∈ Rn .

Lemma. P [X ∈ 5 + µ] = 1. In particular, if Σ is singular (|Σ| = 0) then X issupported by an affine subspace24 of dimension ≤ n− 1.

24 An affine subspace is a translated linear subspace, or hyperplane.

10/21/98

Page 66: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

64 Stat 8151

Friday, October 23, 1998

Before proving the lemma from the end of last time’s class, we need a few morefacts from linear algebra.

Suppose A is an n× n real matrix.

Definition. The range of A is 5A = Ax : x ∈ Rn .Definition. The nullspace of A is 1A = x ∈ Rn : Ax = 0 .

Clearly 5A and 1A are linear subspaces of Rn and dim(5A)+dim(1A) = n.

Definition. The orthogonal complement of A is

A⊥ = y ∈ Rn : y is perpendicular to every member of 5A .If A is symmetric, then

1. 5A = 1⊥A2. 1A = 5⊥A3. Rn = 5A ⊕1A

where ⊕ denotes the direct sum of vector subspaces.

Definition/Characterizations. Suppose W1, . . . , Wk are linear subspaces ofa vector space V . Then the following are equivalent:

1. W1, . . . , Wk are (linearly) independent. (i.e., if∑ki=1 αi = 0 when αi ∈

Wi =⇒ αi = 0, so the only way for a sum to be 0 is for every term to be 0.)2. Each α ∈ V can be expressed uniquely in the form α = α1 + · · ·+ αk, with

αi ∈Wi.3. If Bi is a basis for Wi, then

⋃ki=1Bi is a basis for V .

4. V = W1 ⊕ · · · ⊕Wk. (V is the direct sum of the Wi’s.)

Definition. Suppose α ∈ Rn and β ∈ Rn, so α =

α1...αn

and β =

β1...βn

.

The inner product of α and β is 〈α,β〉 =∑ni=1 αiβi.

Note that 〈α,β〉 = 〈β,α〉 and that 〈α,β〉 = α′β.In R3, the inner product (also called dot product) is the product of the

length of α, the length of β, and the cosine of the angle between α and β. So,〈α,β〉 = 0 ⇐⇒ α ⊥ β.

10/23/98

Page 67: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 65

Projections, components

Suppose y1, . . . ,yn is an orthonormal basis for Rn, i.e., a basis whose elementsare mutually orthogonal and have length 1.

If x ∈ Rn, we can write x =∑ni=1 ciyi.

Note that 〈x,yj〉 =⟨∑n

i=1 ciyi,yj⟩

=∑ni=1 ci〈yi,yj〉 = cj .

So, x =∑ni=1〈x,yi〉yi.

The inner product gives the length of a projection onto a basis element; wecan recover a vector from its components.

Proof of the lemma from last time.

Recall the lemma.

Lemma. P [X ∈ 5 + µ] = 1. In particular, if Σ is singular (|Σ| = 0) then X issupported by an affine subspace of dimension ≤ n− 1.

Proof: It is enough to show that P [X− µ ∈ 5] = 1.Assume µ = 0.Let m = dim 5. If m = n then there is nothing to prove, so suppose m < n.Because Σ is a covariance matrix, it is symmetric. Hence, 1 = 1Σ = 5⊥,

and there is an orthonormal basis y1, . . . ,yn for Rn such that y1, . . . ,ym isa basis for 5 and ym+1, . . . ,yn is a basis for 1.

For every x ∈ Rn, we write x =∑ni=1〈x,yi〉yi.

Look at P (X /∈ 5); we want to show this is 0.Now X ∈ 5 if and only if 〈X,yi〉 = 0 for m+ 1 ≤ i ≤ n.Thus P [X /∈ 5] = P

(〈X,yi〉 6= 0 for some i with m+ 1 ≤ i ≤ n

).

P(〈X,yi〉 6= 0 for some i with m+ 1 ≤ i ≤ n

)=∑ni=m+1 P

(〈X,yi〉 6= 0

).

Now recall that X =

X1...Xn

is a random vector, and yi =

yi1...yin

is a

basis vector, so P(〈X,yi〉

)is a real-valued function (random variable), which we

can write as y′iX. (For a fixed yi, we observe X, and then compute the innerproduct.)

We will show that both the expected value and variance of y′iX are zero, soy′iX must be zero with probability one.

E〈X,yi〉 = E[∑n

j=1Xiyij]

=∑nj=1 yijEXi = 〈yi,µ〉 = 0 because we as-

sumed µ = 0.Now Var〈X,yi〉 = Var y′iX is the same as the covariance, since this is a

real-valued random variable. But then Var y′iX = y′ΣyFor a real-valued random variable, the variance and the covariance matrix

coincide, so Var〈X,yi〉 = Var y′iX = y′iΣyi = y′i0 = 0 for i > m.So, P

[〈X,yi〉 6= 0

]= 0 for i > m.

We’ll have another lemma about 5, but first we need to recall another factfrom linear algebra.

10/23/98

Page 68: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

66 Stat 8151

Fact. Let S be a symmetric n × n positive semidefinite matrix. If X′SX = 0then SX = 0.

Proof: First suppose S is diagonal,25 so S = D =

d1

. . . 0dk

0

0 . . .0

.

Then X′DX =∑ki=1 diX

2i = 0 implies Xi = 0 for i = 1, . . . , k.

For the general case, write S = Γ′DΓ where D is diagonal and Γ is or-thogonal. Then 0 = X′SX = X′(Γ′DΓ)X = (X′Γ′)D(ΓX) = (ΓX)′D(ΓX),so by the special case we know that D(ΓX) = 0. Now multiply by Γ′ to getΓ′[D(ΓX)] = 0, but that is just (Γ′DΓ)X = SX = 0.

Now we’re ready for that other lemma.

Lemma. 5 is the smallest linear subspace + of Rn such that P [X−µ ∈ +] = 1.

Proof: As before, we assume µ = 0. Let + be another linear subspacesuch that P (X ∈ +) = 1.

Let y1, . . . ,yn be an orthonormal basis for Rn such that y1, . . . ,yr isa basis for + and yr+1, . . . ,yn is a basis for +⊥.

Just as before, X =∑ri=1〈X,yi〉 a.s., since 〈X,yi〉 = 0 a.s. for i = r + 1,

. . . , n.This implies Var〈X,yi〉 = Var(y′iX) = y′iΣyi = 0 for i > r. Now use the

previous result26 to conclude that Σyi = 0 =⇒ yi ∈ 1 for i = r + 1, . . . , n.Therefore, +⊥ ⊆ 1 = 5⊥, so + ⊇ 5.

25 This will be a standard way to represent an n× n diagonal matrix of rankk ≤ n. We will generally assume that the non-zero diagonal elements come first,and if there are any 0’s on the diagonal, they appear last.

26 X′SX = 0 =⇒ SX = 0 with X = yi and S = Σ.

10/23/98

Page 69: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 67

Monday, October 26, 1998

Multivariate Normal Distributions

Recall univariate normal, X ∼ N(µ, σ2), with characteristic function ϕX(t) =eitµ−

12σ

2t2 .

• If σ2 > 0 then X = σZ + µ where Z ∼ N(0, 1).• If σ2 = 0 then X ≡ µ (singular distribution).

One difficulty—a multivariate normal in 3-space may really be bivariate, on asubspace. A useful generalization to higher dimensions needs to take this intoaccount.

Definition. We say X =

X1...Xn

has a multivariate normal distribution

if a′X is univariate normal for all a ∈ Rn.

Theorem. For any µ ∈ Rn and a positive semidefinite n × n matrix Σ, thereis a unique multivariate normal distribution with mean µ and covariance matrixΣ.

Proof: (Existence) Let Z1, . . . , Zn ∼ iid N(0, 1), and let Z =

Z1...Zn

.

Then EZ =

0...0

and Cov(Z) = I.

Note that a′Z = 〈a,Z〉 =∑ni=1 aiZi is a sum of independent standard

normal random variables, hence a′Z ∼ N(0,a′a).

Write Σ = Γ′DΓ where Γ is orthogonal and D =

d1

. . . 0dk

0

0 . . .0

,

10/26/98

Page 70: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

68 Stat 8151

where k is the rank of Σ.

Note that di ≥ 0 for all i, so we can let D1/2 =

√d1

. . . 0√dk

0

0 . . .0

.

Now set A = Γ′D1/2Γ; then Σ = AA′, as is easily checked.27 (So A = Σ1/2,in some sense.) Now let X = AZ + µ; this generalizes X = σZ + µ from theunivariate situation.

Observe that EX = E(AZ + µ) = AEZ + µ = 0 + µ = µ and Cov X =Cov(AZ + µ) = A(Cov Z)A′ = AIA′ = AA′ = Σ, so we have constructed arandom vector X with the proper mean and covariance.

We still need to show that it satisfies the definition (so it is multivariatenormal) and that it is unique.

Look at a′X = a′(AZ + µ) = a′AZ + a′µ; this is a linear combination ofthe Zi’s, so a′X ∼ N(a′µ,a′AA′a) = N(a′µ,a′Σa).

It remains to show that X is unique.28

Now the characteristic function of X is ϕX(t) = Eeit′X = ϕt′X(1), but t′X is

univariate normal29 with mean and covariance such that ϕt′X(1) = ei(t′µ)− 1

2 t′Σt.

Now if Y =

Y1...Yn

is another collection of random variables with EY =

µ and Cov Y = Σ, and is multivariate normal, then by the same argument,ϕY(t) = eit

′µ− 12 t′Σt. Finally, by uniqueness of characteristic functions, ϕY(t) =

ϕX(t) implies Y ∼ X.

It is important to note that this definition does not look at densities.We write X ∼ Nn(µ,Σ), or X ∼ N(µ,Σ) when n is understood.

Theorem.

1. If X ∼ Nn(µ,Σ) then X has a density iff Σ is positive definite (i.e., nonsin-gular).

2. If Σ is nonsingular then the density is

f(x) =1

(2π)n/2√|Σ|

e−12 (x−µ)′Σ−1(x−µ).

Proof: (“only if” part of 1.) By contradiction. Suppose Σ is singular.Then the range space 5 of Σ has dimension < n, and P (X ∈ 5 + µ) = 1. But

27 Just multiply: AA′ = (Γ′D1/2Γ)(Γ′D1/2Γ)′ = (Γ′D1/2Γ)Γ′D1/2′Γ′′ =(Γ′D1/2)(ΓΓ′)(D1/2Γ) = Γ′D1/2ID1/2Γ = Γ′D1/2D1/2Γ = Γ′DΓ = Σ.

28 Of course, it is the distribution of X that is unique.29 Take a = t in the definition.

10/26/98

Page 71: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 69

if X has a density f on Rn, then∫5+µ

f = 0 because dim(5 + µ) < n. This isa contradiction, as

∫Bf = P (X ∈ B).

(“if” part of 1, and 2.) Let’s suppose Σ is nonsingular. Then Σ = AA′

where A is nonsingular and X ∼ AZ + µ where Z ∼ Nn(0, In) and the density

of Z is g(z) =(

1√2π

)ne−

12 z′z.

Now change variables: X ∼ AZ + µ =⇒ Z = A−1(X − µ). Then thedensity of X must be f(x) = g

(A−1(x−µ)

)|J | = g

(A−1(x−µ)

) ∣∣A−1∣∣, where

the Jacobian J is just A−1 because we have a linear transformation. Recall thatA = Γ′D1/2Γ, so A−1 =

(Γ′D1/2Γ

)−1= Γ−1(D1/2)−1(Γ′)−1 = Γ′D−1/2Γ, and

then∣∣A−1

∣∣ =∣∣Γ′D−1/2Γ

∣∣ = |Γ′|∣∣D−1/2

∣∣ |Γ| = ∣∣D−1/2∣∣ =

∣∣D−1∣∣1/2 = |D|−1/2 =

Σ−1/2.Observe that z′z =

(A−1(x−µ)

)′(A−1(x−µ)

)= (x−µ)′

(A−1

)′A−1(x−

µ) = (x−µ)′(A′)−1

A−1(x−µ) = (x−µ)′(AA′

)−1(x−µ) = (x−µ)′Σ−1(x−µ).

Now plug these into g to get f(x) =(

1√2π

)n|Σ|−1/2

e−12 (x−µ)′Σ−1(x−µ).

Proposition. X ∼ Nn(µ,Σ) ⇐⇒ a′X ∼ N(a′µ,a′Σa) for all a ∈ Rn.

Proof: “⇐=” Just use the definition; in fact, µ and Σ are unique.“=⇒” By the definition, a′X is univariate normal, and has the right mean andvariance. Ea′X = a′EX = a′µ and Var(a′X) = a′Σa.

This extends the definition slightly—a′X is univariate normal, with the rightmean and variance. Next we have a slightly stronger result:

Proposition. Let X ∼ Nn(µ,Σ), let B be an m × n matrix, and let c ∈ Rm.Then BX + c ∼ Nm(Bµ,BΣB′).

Proof: Look at a′(BX + c) = (a′B)X + (a′c) ∼ N1

(a′Bµ,a′BΣ(a′B)′

)+

a′c ∼ N1(a′Bµ + a′c,a′BΣB′a). Apply the previous proposition and we’redone.

Proposition. Suppose Xn×1 =(

YZ

)where Y is m× 1 and Z is (n−m)× 1,

and let µ =(ξη

)and Σ =

(Σ11 Σ12

Σ21 Σ22

)be the corresponding mean vector

and covariance matrix. Then Y ∼ Nm(ξ,Σ11) and Z ∼ Nm(η,Σ22).

Proof: Let B = ( Im 0n−m ) and write Y = BX. By the previous propo-sition, Y is multivariate normal with mean Bµ = ξ and covariance BΣB′ = Σ11.The proof for Z works the same way (or else just renumber and put the Z’sfirst).

Proposition. Let X =(

YZ

), µ =

(ξη

), and Σ =

(Σ11 Σ12

Σ21 Σ22

); note that

Cov(Y,Z) = Σ12. Then Y and Z are independent iff Σ12 = 0.

10/26/98

Page 72: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

70 Stat 8151

Proof: Recall that Y and Z are independent if and only if their joint charac-

teristic function factors as ϕX(a) = ϕY(b)ϕZ(b) for all a =(

bc

)∈ Rn.

Now,

a′X = ( b′ c′ )(

YZ

)= b′Y + c′Z

a′µ = ( b′ c′ )(ξη

)= b′ξ + c′η

a′Σa = ( b′ c′ )(

Σ11 Σ12

Σ21 Σ22

)(bc

)= b′Σ11b + b′Σ12c + c′Σ21b + c′Σ22c

= b′Σ11b + b′Σ12c + c′Σ′12b + c′Σ22c

(∗)

Next recall the characteristic function ϕX(a) = eia′µ− 1

2 a′Σa and plug in theexpressions from (∗) to get

ϕX(a) = eia′µ− 1

2 a′Σa

= ei(b′ξ+c′η)− 1

2 (b′Σ11b+b′Σ12c+c′Σ′12b+c′Σ22c)

= eib′ξ− 1

2 b′Σ11b︸ ︷︷ ︸ϕY(b)

eic′η− 1

2 c′Σ22c︸ ︷︷ ︸ϕZ(c)

e−12 [b′Σ12c+c′Σ′12b]︸ ︷︷ ︸

K

Note that the last factor, K, will be 1 if and only if b′Σ12c + c′Σ′12b = 0, andthat occurs when Σ12 = 0.

Corollary. Bivariate normal (n = 2) with X =(X1

X2

). Then X1 ⊥⊥ X2 ⇐⇒

σ12 = 0.

Corollary. If X =

X1...Xn

is multivariate normal then the Xi’s are (mutually)

independent if and only if they are pairwise independent.

Independence implies pairwise independence, but the converse does not holdgenerally. However, for multivariate normal random variables, it does.

10/26/98

Page 73: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 71

Wednesday, October 28, 1998

Proposition. Let X =[

YZ

], µ =

[ξη

], and Σ =

[Σ11 Σ12

Σ21 Σ22

], and assume

that Σ−122 exists. Then Z and Y−Σ12Σ−1

22 Z are jointly normal and independent,and

Y −Σ12Σ−122 Z ∼ N(ξ −Σ12Σ−1

22 η,Σ11 −Σ12Σ−122 Σ21).

Proof: Let V =[

WZ

]where W = Y −Σ12Σ−1

22 Z.

Then

V =[

WZ

]=[

Y −Σ12Σ−122 Z

Z

]=[

Im −Σ12Σ−122

0 In−m

] [YZ

]= BX

From last time, V is jointly normal. (It is a linear transformation.) Now checkto see that it has the proper mean and covariance.

E

[WZ

]= EV = E(BX) = BEX = Bµ

=[

Im −Σ12Σ−122

0 In−m

] [ξη

]=[ξ −Σ12Σ−1

22 ηη

],

so EW = ξ −Σ12Σ−122 η.

Cov V = Cov(BX) = BΣB′

=[

Im −Σ12Σ−122

0 In−m

] [Σ11 Σ12

Σ21 Σ22

] [Im 0

−Σ−122 Σ21 In−m

]=[

Σ11 −Σ12Σ−122 Σ21 Σ12 −Σ12Σ−1

22 Σ22

Σ21 Σ22

] [Im 0

−Σ−122 Σ21 In−m

]=[

Σ11 −Σ12Σ−122 Σ21 0

Σ21 Σ22

] [Im 0

−Σ−122 Σ21 In−m

]=[

Σ11 −Σ12Σ−122 Σ21 0

Σ21 −Σ22Σ−122 Σ21 Σ22

]=[

Σ11 −Σ12Σ−122 Σ21 0

0 Σ22

]10/28/98

Page 74: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

72 Stat 8151

The upper-left block of Cov V is Cov W = Σ11 −Σ12Σ−122 Σ21.

Finally, W and Z are independent because the upper-right block of Cov Vis 0.

Proposition. Let X =[

YZ

], µ =

[ξη

], and Σ =

[Σ11 Σ12

Σ21 Σ22

], and assume

that Σ−122 exists. Then, given Z = z, the conditional distribution of Y is

Nm(ξ + Σ12Σ−122 (z− η),Σ11 −Σ12Σ−1

22 Σ21).

Proof: Calculate the characteristic function of Y | Z = z. We add and sub-tract terms appearing in the exponent, and use the facts that Y −Σ12Σ−1

22 Z isindependent of Y, and that E( Y | Z ) = EY if Y is independent of Z.

ϕY|Z=z(t) = E(eit′Y | Z = z

)= E

(eit′(Y−Σ12Σ−1

22 Z)eit′Σ12Σ−1

22 Z | Z = z)

= eit′Σ12Σ−1

22 zE(eit′(Y−Σ12Σ−1

22 Z) | Z = z)

but Y −Σ12Σ−122 Z is normal with known mean and covariance, so we know its

characteristic function, and we can evaluate the expectation appearing above.Thus,

ϕY|Z=z(t) = eit′Σ12Σ−1

22 z · eit′(ξ−Σ12Σ−122 η)− 1

2 t′(Σ11−Σ12Σ−122 Σ21)t

= eit′[ξ+Σ12Σ−1

22 (z−η)]− 12 t′(Σ11−Σ12Σ−1

22 Σ21)t,

which is the characteristic function of a multivariate normal with mean ξ +Σ12Σ−1

22 (z− η) and covariance Σ11 −Σ12Σ−122 Σ21.

Corollary. E( Y | Z = z ) = EY + Σ12Σ−122 (z− EZ). Note that this is linear in

z, hence the best linear predictor is the best predictor.

A special case occurs when(X1

X2

)∼ N2

([µ1

µ2

],

[σ11 σ12

σ21 σ22

]).

Then

E(X1 | X2 = x2 ) = µ1 +σ12

σ22(x2 − µ2) = µ1 + ρ

√σ11

σ22(x2 − µ2).

10/28/98

Page 75: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 73

Functions of multivariate normal random variables

We will find the distributions of certain functions of multivariate normal ran-dom variables, in particular, quadratic forms. First we review some more linearalgebra.Inner product spaces. The usual inner product is 〈x,y〉 =

∑i xiyi = x′y but

there are others, as well. For instance x1y1 − x2y1 − x1y2 + 4x2y2 is anotherinner product in R2.

Given an inner product and linear operator T, there exists a unique linearoperator T∗ such that 〈Tx,y〉 = 〈x,T∗y〉. We call T∗ the adjoint of T.

For the usual inner product, 〈Tx,y〉 = (Tx)′y = x′T′y = x′(T′y) =〈x,T′y〉, so T∗ = T′ in this case.

Let 0 be a linear subspace of Rn. Now Rn = 0 ⊕00 for lots of choicesof 00. But, if we have an inner product, then we can talk about 0⊥, andRn = 0 ⊕ 0⊥. Now there exists a unique projection P with range 0 andnullspace 0⊥. We say P is the orthogonal projection of Rn onto 0.

Theorem. Let P be a projection, i.e., P2 = P. Then

1. PP∗ = P∗P⇐⇒

2. P = P∗

⇐⇒3. P is the orthogonal projection on its range.

Proof: (Omitted, except for 2 =⇒ 3) 〈Px,y〉 = 〈x,P∗y〉 = 〈x,Py〉, so y ⊥every Px ⇐⇒ Py ⊥ Rn, i.e., Py = 0, i.e., the nullspace of P is the orthogonalcomplement of its range.

Definition. In an inner product space, we say that P is an orthogonal pro-jection if P2 = P and P = P∗.

Suppose P is an orthogonal projection. Now x ∈ range of P ⇐⇒ (I −P)x = 0. This is true because a projection leaves its range unchanged, sox ∈ 5P ⇐⇒ Px = x ⇐⇒ x − Px = 0. But this is a linear operator,so x − Px = Ix − Px = (I − P)x. Note that (I − P)2 = (I − P)(I − P) =I − P − P + P2 = I − P − P + P = I − P. So, I − P is also an orthogonalprojection when P is. In fact, I−P is an orthogonal projection if and only if Pis an orthogonal projection, and the range of one is the nullspace of the other.

Now we are ready to look at some examples of functions of X.

10/28/98

Page 76: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

74 Stat 8151

ExampleSuppose we have a random sample from a multivariate normal distribution.

X =

X1...Xn

where the Xi’s are iid N(µ, σ2).

Then X ∼ Nn(µe, σ2I) where e =

1...1

, and the joint density is

(1√

2πσ2

)ne−

12 (x−µe)′ 1

σ2 I(x−µe)

=(

1√2πσ2

)nexp

[− 1

2σ2

n∑i=1

(xi − µ)2

]

=(

1√2πσ2

)nexp

[− 1

2σ2

n∑i=1

(xi − x+ x− µ)2

]where x =

x1 + · · ·+ xnn

; continuing, we have

=(

1√2πσ2

)nexp

[− 1

2σ2

n∑i=1

((xi − x)2 + (x− µ)2 + 2(xi − x)(x− µ)

)]

=(

1√2πσ2

)nexp

[− 1

2σ2

n∑i=1

((xi − x)2 + (x− µ)2

)]

=(

1√2πσ2

)nexp

[− 1

2σ2

(n∑i=1

(xi − x)2 + n(x− µ)2

)]and this is a function of x and

∑ni=1(xi − x)2. We’ll see later that X and∑n

i=1(Xi −X)2 are sufficient statistics.

Observe that x1 + · · ·+ xn = e′x, so x =1n

e′x.

Similarly, ee′ =

1 · · · 1...

. . ....

1 · · · 1

, so ee′x =

Σxi...

Σxi

=

nx...nx

= nxe.

Let P =ee′

n, and then Px = xe.

Note that∑ni=1(xi − x)2 = (x − xe)′(x − xe) = (x − Px)′(x − Px) =(

(I−P)x)′((I−P)x

).

Note that P2 = P, so P is a projection, and clearly P′ = P. So P is theorthogonal projection of Rn onto 5P where 5P = Px : x ∈ Rn = x : x1 =· · · = xn = all multiples of e.

It is also easy to see that (I−P)2 = (I−P) = (I−P)′.

10/28/98

Page 77: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 75

Friday, October 30, 1998

Continuing last time’s example of a random sample from a multivariate normal

distribution, we had X ∼ Nn(µe, σ2I) where e =

1...1

, and the joint density is

a function of x and∑

(xi − x)2. Recall P = 1nee′, so Px = x = xe, and now let

Q = I−P. We see that both P and Q are orthogonal projections, and Qe = 0.Now Qx = Ix − Px = x − x = x − xe, so 〈Qx,Qx〉 =

∑ni=1(xi − x)2 =

(Qx)′(Qx) = x′Q′Qx = x′QQx = x′Qx.

Lemma. Let X ∼ Nn(µe, σ2I) and suppose P and Q are defined as above.Then:

1. X and QX are independent.

2. X ∼ N1(µ, σ2/n). (Actually, X = PX ∼ N1(µe, σ2/n).)3. QX ∼ Nn(0, σ2Q).

Proof: Let V =(

XQX

). Thus

V =(

XQX

)=(

1ne′

Q

)X =

[1/n 1/n . . . 1/n

Q

]X

but V is a linear transformation of X, and therefore,

V ∼ Nn+1

([1ne′

Q

]µe,

[1ne′

Q

]σ2I [ 1

ne Q ])

∼ Nn+1

([µ0

], σ2

[1/n 00 Q

])from which we immediately obtain the conclusions of the lemma .

Corollary. X and∑ni=1(Xi −X)2 are independent.

Proof: We know X and QX are independent, so then X and (QX)′(QX)

10/30/98

Page 78: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

76 Stat 8151

are independent.30 But recall that QX = X−PX = X−X, so (QX)′(QX) =〈QX,QX〉 = ‖QX‖2 =

∑ni=1(Xi −X)2.

Note that (QX)′(QX) = X′Q′QX = X′QQX since Q is symmetric, andX′QQX = X′QX since Q is idempotent. We want to find the distribution ofX′QX, which is an example of a quadratic form. We will consider the moregeneral problem of finding the distribution of quadratic forms.

Quadratic Forms in Normal Variates

Let’s assume X ∼ Nn(0,Σ) and let B be a positive semidefinite matrix, sox′Bx ≥ 0 for all x ∈ Rn. We want the distribution of W = X′BX. In particular,letting B = Q gives the distribution we wanted before.

Now Σ = AA′, where rank(A) = rank(Σ); but then X ∼ AZ, whereZ ∼ Nn(0, In). So,

W = X′BX = (AZ)′B(AZ) = Z′A′BAZ = Z′CZ

where C = A′BA. Note that C is positive semidefinite, for x′Cx = x′A′BAx =(Ax)′B(Ax) ≥ 0 since B is positive semidefinite.

Diagonalize C = Γ′DΓ where Γ is orthogonal and D is diagonal. As usual,

we have D =

d1

. . . 0dk

0

0 . . .0

. Note that di > 0 for i = 1, . . . ,

k, where k is the rank of C. Recall that W = Z′CZ, so W ∼ Z′(Γ′DΓ)Z =(ΓZ)′D(ΓZ)

Since Γ is orthogonal and Z ∼ Nn(0, In), we know ΓZ ∼ Nn(Γ0,ΓInΓ′) =Nn(0, In), i.e., ΓZ ∼ Z.

Thus W ∼ Z′DZ.But Z′DZ =

∑ki=1 diZ

2i , so W ∼

∑ki=1 diZ

2i where Z1, . . . , Zk are i.i.d.

(univariate) standard normal random variables.

30 Recall that measurable functions of independent random variables are alsoindependent random variables.

Lemma. If X ⊥⊥ Y , then φ(X) ⊥⊥ ψ(Y ).

Proof: P(φ(X) ∈ A,ψ(X) ∈ B

)= P

(X ∈ φ−1(A), Y ∈ ψ−1(B)

)=

P(X ∈ φ−1(A)

)P(Y ∈ ψ−1(B)

)= P

(φ(X) ∈ A

)P(ψ(X) ∈ B

).

10/30/98

Page 79: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 77

Fact. The numbers d1, . . . , dk are the eigenvalues of C = A′BA.

Proof: They are clearly the eigenvalues of D, i.e., roots of det(D−λI) = 0.But look at det(D − λI) = 0. Multiply by det(Γ) det(Γ′) = 1 to get det(D −λI) = det(Γ′) det(D − λI) det(Γ) = det

(Γ′(D − λI)Γ

)= det

(Γ′DΓ − λI

)=

det(C− λI) = 0. Thus, the eigenvalues of D are the same as those of C.

So the distribution of W depends on both A(= Σ1/2) and B (from thequadratic form X′BX), through the eigenvalues of C = A′BA.

We have shown that:

Proposition. Suppose X ∼ Nn(0,Σ), and suppose B is a positive semidefi-nite matrix, and let W = X′BX. Write Σ = AA′ and let d1, . . . , dk be theeigenvalues of A′BA. Then W ∼

∑ki=1 diZ

2i where Z1, . . . , Zk are i.i.d. N1(0, 1).

Corollary. If the eigenvalues are equal, d1 = · · · = dk = d, thenW

d∼ χ2

k.

Proof: W ∼k∑i=1

diZ2i =

k∑i=1

dZ2i = d

k∑i=1

Z2i ⇐⇒ W/d ∼

k∑i=1

Z2i = χ2

k.

Applications(Remember what we’re trying to do.)

1. Suppose X ∼ Nn(0, σ2I) and B is an orthogonal projection of rank k. LetW = X′BX. Then

W

σ2∼ χ2

k.

Proof: Recall that the non-zero eigenvalues of an orthogonal projection are all1, so B has eigenvalues λ1 = · · · = λk = 1 and λk+1 = · · · = λn = 0. Since theseare non-negative, we see that B is positive semidefinite.

Note that Σ = σ2I = (σI)(σI) = AA′, so A′BA = (σI)′B(σI) = σ2B; itfollows that the corresponding eigenvalues for A′BA are d1 = · · · = dk = σ2

and dk+1 = · · · = dn = 0. Now apply the corollary to get W/σ2 ∼ χ2k.

2. Suppose X ∼ Nn(µ, σ2I), W = X′BX, B is an orthogonal projection ofrank k, and µ ∈ 1B, i.e., Bµ = 0. Then

W

σ2∼ χ2

k.

Proof: By the previous result, we know (X− µ)′B(X− µ) ∼ χ2k. But

(X− µ)′B(X− µ) = X′BX−X′Bµ− µ′BX + µ′Bµ = X′BX.

3. (Special case of 2.) Suppose X ∼ Nn(µe, σ2I). Let nS2 =∑ni=1(Xi −X)2 =

X′QX, where Q = I−P. Then

1σ2

n∑i=1

(Xi −X)2 ∼ χ2n−1.

Proof: Q is an orthogonal projection onto 5⊥P, and hence has rank n− 1.

The question arises, “What if µ /∈ 1B?”

10/30/98

Page 80: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

78 Stat 8151

Non-central chi-square distribution.

Definition. Let X1, . . . , Xn be i.i.d. random variables with Xi ∼ Ni(µi, 1).Then X2

1 + · · ·+X2n is said to have a non-central chi-squared distribution.

Note: If µi = 0 for all i, then we get the usual (central) chi-square.

Lemma. The non-central chi-squared characteristic function is

ϕX21+···+X2

n(t) =

e12 (µ2

1+···+µ2n)( 1

1−2it−1)

(1− 2it)n/2.

Again, If µi = 0 for all i, then we get the characteristic function for the usual(central) chi-square. Also notice that this depends on the µi’s only throughµ2

1 + · · ·+ µ2n and n.

10/30/98

Page 81: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 79

Monday, November 2, 1998

Proof: (Of the lemma from last time) We note that the alleged characteristicfunction factors, so it suffices to prove the case n = 1. So, let X1 ∼ N1(µ, 1).Then

ϕX2(t) = EeitX2

=∫eitx

2 1√2πe−

12 (x−µ)2

dx

=∫

1√2π

exp[− 1

2 [−2itx2 + (x− µ)2]]dx

=∫

1√2π

exp[− 1

2 [(1− 2it)x2 − 2µx+ µ2]]dx

= e−µ2/2

∫1√2π

exp[− 1

2 [(1− 2it)x2 − 2µx]]dx

Now let y = (1 − 2it)1/2x, so x = (1 − 2it)−1/2y and dx = (1 − 2it)−1/2 dy;complete the square, and make a change of variables.

= exp[−µ2/2]∫

1√2π

exp[− 1

2

(y − µ(1− 2it)−1/2

)2]× (1− 2it)−1/2 exp[ 1

2µ2/(1− 2it)] dy

= exp[−µ2/2](1− 2it)−1/2 exp[ 12µ

2/(1− 2it)]

×∫

1√2π

exp[− 1

2

(y − µ(1− 2it)−1/2

)2]dy︸ ︷︷ ︸

1

(The integral is 1 because we’re integrating a density.)

= (1− 2it)−1/2 exp[ 12µ

2/(1− 2it)− 12µ

2]× 1

=1

(1− 2it)1/2exp

[µ2

2

(1

1− 2it− 1)]

. (1)

If n > 1 we can use the assumption of independence to get

ϕX21+···+X2

n(t) =

1(1− 2it)n/2

exp[

12

(µ21 + · · ·+ µ2

n)(

11− 2it

− 1)]

. (2)

Note that if µ1 = · · · = µn = 0 then (2) is the characteristic function of the usual(central) chi-square.

11/2/98

Page 82: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

80 Stat 8151

Observe that (1) factors as

1(1− 2it)1/2︸ ︷︷ ︸

characteristic functionof central chi-square

× exp[µ2

2

(1

1− 2it− 1)]

︸ ︷︷ ︸characteristic functionof something else

This suggests that a non-central chi-square may be represented as a sum of aχ2

1 and something else, so we’d like to determine what distribution is associatedwith that second factor. It resembles the characteristic function of a Poissonrandom variable, for if Y ∼ Poisson(λ) then ϕY (t) = eλ(eit−1).

Lemma. Let U1, U2, . . . , be i.i.d. with common characteristic function h(t). Let

N ∼ Poisson(λ) and independent of the Uj ’s. Let SN =∑Nj=1 Uj , where S0 ≡ 0.

Then the characteristic function ψ of SN is ψ(t) = eλ[h(t)−1)].

Proof:

ψ(t) = EeitSN = E[E( eitSN | N )] = E([h(t)]N

)=

n∑n=0

[h(t)]nP (N = n)

=∞∑n=0

[h(t)]ne−λλn

n!= e−λ

∞∑n=0

[λh(t)]n

n!= e−λeλh(t) = eλ[h(t)−1]

and the proof is complete.

Note: If Uj ≡ 1, then we get an ordinary Poisson, for the characteristicfunction of X ≡ 1 would be eit − 1.

So SN is a Poisson (random) sum of i.i.d. Uj ’s.Now we want to consider a central chi-square with a random number of

degrees of freedom.Let Z1, Z2, . . . be i.i.d. N(0, 1), and let N be a non-negative integer-valued

random variable where N is independent of Z1, Z2, . . . . (N is Poisson, but wedon’t need that here.) Set SN =

∑Nj=1 Z

2j with S0 ≡ 0. Then

SN ∼ χ2N .

Now suppose we have one more standard normal variate, say W ∼ N(0, 1)and independent of the Zj ’s and N . Then

W +2N∑j=1

Z2j ∼ χ2

1+2N .

(W guarantees there is always at least one term.)

11/2/98

Page 83: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 81

Lemma. If X ∼ N(µ, 1) then X2 ∼ χ21+2N where N ∼ Poisson( 1

2µ2). We call

µ2 the noncentrality parameter. If µ = 0 then N ∼ Poisson(0), so N = 0 withprobability one, and then this reduces to the usual central chi-square.

Proof: We’ll use characteristic functions. First recall that, under indepen-dence, χ2

p + χ2r ∼ χ2

p+r, so ϕχ2p+r

(t) = ϕχ2p(t) · ϕχ2

r(t). Thus

ϕχ21+2N

(t) = ϕχ21(t) · ϕχ2

2N(t). (3)

We know the first factor, for ϕχ21(t) =

1(1− 2it)1/2

. But we also know the second

factor, by our lemma on Poisson sums. Since χ22N ∼

∑2Nj=1 Z

2j ∼

∑2Nj=1 χ

21 (i.e.,

each term is χ21) we have ϕχ2

2N(t) = exp

µ2

2

(1

1−2it − 1)

. Thus (3) becomes

ϕχ21+2N

(t) =1

(1− 2it)1/2exp

[µ2

2

(1

1− 2it− 1)]

That is the same characteristic function we found in the lemma stated at theend of the previous class and proved at the beginning of today’s class, so we aredone.

A non-central chi-square can be represented as a central chi-square with arandom number of degrees of freedom, or as a sum of a central chi-square and arandom sum of more central chi-squares.

Lemma. If X ∼ Np(µ, Ip) then X′X ∼ χ2p+2N where N ∼ Poisson( 1

2µ′µ).

Proof: Let Γ be orthogonal with first rowµ′

‖µ‖ where ‖µ‖2 =n∑i=1

µ2i .

Then X′X = X′Γ′ΓX = (ΓX)′(ΓX) = Y′Y, say.

Now Y = ΓX ∼ Np(Γµ, I) because ΓIΓ′ = I; observe that Γµ =

‖µ‖

0...0

.

So, Y′Y = Y 21 + · · ·+ Y 2

p where the Yi’s are independent and

Yi ∼

N(‖µ‖, 1) if i = 1,N(0, 1) if i = 2, . . . , p.

Then Y′Y ∼ χ21+2N + χ2

p−1, where N ∼ Poisson( 12µ′µ).

Thus Y′Y ∼ χ2(1+2N)+(p−1) ∼ χ2

p+2N .

We write χ2p(λ) = χ2

p+2N where N ∼ Poisson(λ/2). This is called a non-central chi-squared distribution with p degrees of freedom and non-centrality parameter λ = µ′µ. Its characteristic function is

ϕ(t) =1

(1− 2it)p/2exp

2

(1

1− 2it− 1)]

=1

(1− 2it)p/2exp

[λit

1− 2it

].

11/2/98

Page 84: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

82 Stat 8151

Proposition. Suppose X1 ∼ χ2p1

(λ1) and X2 ∼ χ2p2

(λ2) and suppose that X1

and X2 are independent. Then X1 +X2 ∼ χ2p1+p2

(λ1 + λ2).

Proof: For j = 1, 2, we know Xj ∼ χ2pj (λj). Representing non-central chi-

squares in terms of central chi-squares allows us to use the fact that the centralchi-square and the Poisson distributions are reproductive. We can break this upinto lots of independent central chi-squares, and rearrange them.

χ2p1

(λ1) + χ2p2

(λ2) ∼ χ2p1+2N1

+ χ2p2+2N2

∼ (χ2p1

+ χ22N1

) + (χ2p2

+ χ22N2

)

∼ (χ2p1

+ χ2p2

) + (χ22N1

+ χ22N2

)

∼ χ2p1+p2

+ χ22N1+2N2

∼ χ2p1+p2

+ χ22(N1+2N)

∼ χ2p1+p2

(λ1 + λ2)

becauseN1+N2 ∼ Poisson((λ1+λ2)/2

)ifNj ∼ Poisson(λj/2) are independent.

11/2/98

Page 85: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 83

Wednesday, November 4, 1998

Last time we saw that if X ∼ Np(µ, I), then X′X ∼ χ2p+2N where N ∼

Poisson( 12µ′µ).

Now we want to generalize this. Let X ∼ Nn(µ,Σ) and assume that µ ∈5Σ, i.e., µ = Σα for some α ∈ Rn. This is clearly okay if Σ−1 exists, butotherwise it is an extra assumption.

Write Σ = CC′, and then 5Σ ⊆ 5C. So, we can write µ = Cτ for some τand X ∼ CW where W ∼ Nn(τ , I). Let

Y = X′AX

where A is symmetric. Question: when is Y noncentral χ2?Write Y = X′AX = (CW)′A(CW) = W′C′ACW = W′BW where

B = C′AC. Now B is symmetric, since A and C are, so we can diagonalize Bin the usual way: B = Γ′DΓ where Γ is orthogonal and

D =

d1

. . . 0dr

0

0 . . .0

where r is the rank of B. So, Y ∼W′BW = W′(Γ′DΓ)W = (ΓW)′B(ΓW) =U′DU where U = ΓW.

Since Γ is orthogonal and W is normal, ΓW is also normal. Let ξ = Γτ ,so ΓW ∼ Nn(ξ, I). Then Y ∼ U′DU =

∑ri=1 diU

2i .

If d1 = · · · = dr = d, then Y ∼ dχ2r(∑rj=1 ξ

2j ). a non-central chi-square

with r degrees of freedom and noncentrality parameter∑rj=1 ξ

2j .

Remember, we’re trying to find the distribution of X′AX, which dependson A, Σ (or C), and µ. How does the noncentrality parameter

∑rj=1 ξ

2j relate

to this? Well, µ′Aµ = (Cτ )′A(Cτ ) = τ ′C′ACτ = τ ′Bτ = τ ′(Γ′DΓ)τ =(Γτ )′D(Γτ ) = ξ′Dξ =

∑rj=1 djξ

2j ; if d1 = · · · = dr = d, then this is d

∑rj=1 ξ

2j .

So, µ′Aµ =∑rj=1 djξ

2j , which was the noncentrality parameter.

We have proved the following proposition.

11/4/98

Page 86: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

84 Stat 8151

Proposition. Suppose X ∼ N(µ,Σ), µ ∈ 5Σ, and A is symmetric. Set Y =X′AX, write Σ = CC′ and µ = Cτ . If C′AC = dP where P is an orthogonalprojection of rank r, then

Y ∼ dχ2r

(µ′Aµd

)

Independence of Quadratic Forms

Let X ∼ N(µ,Σ). Recall31 that A1X and A2X are independent iff A1ΣA′2 = 0.What about quadratic forms? Let Y1 = X′AX+a′X and Y2 = X′BX+b′X.

When are these independent?

Theorem. Y1 and Y2 are independent iff AB = 0, Ab = 0, Ba = 0, anda′Σb = 0.

Proof:

⇐= “Kind of easy.”=⇒ “Hard.”

Why would we care about this? Much of the theory for, say, F -tests in one-way ANOVA depends on Cochran’s Theorem; that involves conditions for inde-pendence of the sums of squares whose quotient forms the F statistic. Cochran’sTheorem,32 in turn, follows from these results on independence of quadraticforms.

31 Homework problem32 See Hogg and Craig for a proof of Cochran’s Theorem.

11/4/98

Page 87: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 85

Sampling from a bivariate normal distribution

Suppose X1, . . . ,Xn are i.i.d. N2(µ,Σ) where µ ∈ R2 and Σ =(σ11 σ12

σ12 σ22

)is

nonsingular; assume n ≥ 3. Each Xi =(Xi1

Xi2

).33

We want to find the maximum likelihood estimates (MLEs) for µ and Σ.We start with the joint distribution.

n∏i=1

f(xi) =n∏i=1

|Σ|−1/2

2πe−

12 (xi−µ)′Σ−1(xi−µ)

=|Σ|−n/2(2π)n

exp

[−1

2

n∑i=1

(xi − µ)′Σ−1(xi − µ)

](1)

The sample contains information about µ and Σ. We fix the data, and findvalues to maximize the probability of what we’ve seen. That’s okay for discretedistributions, but doesn’t really make sense for a continuous distribution. Nev-ertheless, we proceed to maximize the density anyway. That is, we want to findvalues of µ and Σ which maximize (1).

First recall a few facts from linear algebra; remember that the trace of asquare matrix is the sum of its diagonal entries. We’ll use these facts about thetrace.

1. tr(CD) = tr(DC)2. tr(

∑i Ci) =

∑i tr(Ci)

3. tr(a) = a if a ∈ R1.

We want to play with the exponent in a way that generalizes what happens inthe 1-D case. Forget the − 1

2 for now, and focus on

n∑i=1

(xi − µ)′Σ−1(xi − µ) =n∑i=1

tr((xi − µ)′Σ−1(xi − µ)

)(2)

because the summands are scalars—use the third trace property above

=n∑i=1

tr(Σ−1(xi − µ)′(xi − µ)

)use the first trace property

= tr(Σ−1

n∑i=1

(xi − µ)′(xi − µ))

(3)

use the second trace property, and pull out a common factor

Now define

x =n∑i=1

1n

xi =(∑n

i=11nxi1∑n

i=11nxi2

)=(x1

x2

)33 I’ll use notation such as Xij to refer to the jth element of the ith vector.

11/4/98

Page 88: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

86 Stat 8151

and look at

n∑i=1

(xi − µ)(xi − µ)′ =n∑i=1

(xi − x + x− µ)(xi − x + x− µ)′

=n∑i=1

[(xi − x)(xi − x)′ + (xi − x)(x− µ)′

+ (x− µ)(xi − x)′ + (x− µ)(x− µ)′]

=n∑i=1

(xi − x)(xi − x)′ +[ n∑i=1

(xi − x)]

︸ ︷︷ ︸0

(x− µ)′

+ (x− µ)n∑i=1

(xi − x)′︸ ︷︷ ︸0

+n∑i=1

(x− µ)(x− µ)′

=n∑i=1

(xi − x)(xi − x)′︸ ︷︷ ︸S

+n∑i=1

(x− µ)(x− µ)′︸ ︷︷ ︸n(x− µ)(x− µ)′

= S + n(x− µ)(x− µ)′ (4)

where S =∑ni=1(xi−x)(xi−x)′ is the sample covariance matrix.34 We can

substitute (4) in (3) and then “retrace” the steps of our earlier trace argumentto get a replacement expression for (2) which we then use in (1); that in turnbecomes

n∏i=1

f(xi) =|Σ|−n/2(2π)n

exp[− 1

2n(x− µ)′Σ−1(x− µ)− 12 tr(Σ−1S)

]. (5)

Now it is obvious what the MLE is—at least for µ. The second term in theexponent doesn’t involve µ, so we can maximize (5) by setting µ = x.

34 The following non-trivial fact is stated without proof.

Theorem. The sample covariance matrix S is positive definite with probabilityone when n ≥ 3.

That’s for our bivariate normal distribution. Generally, we need n at leastone more than the dimension of our normal variate.

11/4/98

Page 89: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 87

Friday, November 6, 1998

Let h(µ,Σ) denote the joint density (likelihood) from last time. so

h(µ,Σ) =|Σ|−n/2(2π)n

exp[− 1

2n(x− µ)′Σ−1(x− µ)− 12 tr(Σ−1S)

].

We want to maximize this, and we’ve already seen that µ, the MLE of µ, is x.We need to find the corresponding MLE for Σ. Now

h(µ,Σ) =|Σ|−n/2(2π)n

exp[− 1

2 tr(Σ−1S)],

and we want to maximize this over positive definite Σ, given that S is positivedefinite.

Claim. Σ, the MLE for Σ, is S/n.

We need to show that

φ(Σ) = |Σ|−n/2 e− 12 tr(Σ−1S) ≤ |S/n|−n/2 e− 1

2 tr((S/n)−1S) = φ(S/n)

(We ignore (2π)n for the moment.)Now (S/n)−1 =

((1/n)S

)−1 = nS−1, so

(S/n)−1S = nS−1S = nI =(n 00 n

).

Then tr((S/n)−1S)

)= tr

(n 00 n

)= 2n, so e−

12 tr((S/n)−1S) = e−n.

Also recall that S is 2× 2, so |S/n| = |S| /n2.Thus

φ(S/n) =( |S|n2

)−n/2e−n = |S|−n/2 nne−n.

Remember, we want to show that this is the maximum value of φ(Σ) over allpositive definite Σ, given that S is positive definite.

11/6/98

Page 90: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

88 Stat 8151

Now φ(Σ) = |Σ|−n/2 e− 12 tr(Σ−1S) =

∣∣SS−1Σ∣∣−n/2 e− 1

2 tr(Σ−1S) and we canwrite S = BB′ since S is positive definite. Then S−1 = (BB′)−1 = (B′)−1B−1,so we have

φ(Σ) = |S|−n/2∣∣(B′)−1B−1Σ

∣∣−n/2 e− 12 tr(Σ−1BB′).

Now the determinant of a product is the product of the determinants, and wecan rearrange those factors as we wish; the same holds for the trace operator, sowe get

φ(Σ) = |S|−n/2∣∣B−1Σ(B′)−1

∣∣−n/2 e− 12 tr(B′Σ−1B).

Next, the reciprocal of a determinant is the determinant of the inverse, so wehave

φ(Σ) = |S|−n/2∣∣B′Σ−1B

∣∣n/2 e− 12 tr(B′Σ−1B).

Now let C = B′Σ−1B; note that C is positive definite.So, we must show that

ψ(C) = |C|n/2 e− 12 tr C ≤ nne−n

for positive definite C. (Again we strip off factors that don’t matter—this timeit is |S|−n/2.)

But C is positive definite, so we can diagonalize it. Write C = ΓDΓ′ where

Γ is orthogonal, and D =(λ1 00 λ2

), with λ1 > 0 and λ2 > 0. Now |C| = |D|

and tr C = tr D, so

ψ(C) = ψ(D) =∣∣∣∣λ1 0

0 λ2

∣∣∣∣n/2 e− 12 tr(λ10

0λ2

)= (λ1λ2)n/2e−(λ1+λ2)/2 = λ

n/21 e−λ1/2λ

n/22 e−λ2/2 = k(λ1)k(λ2)

where k(λ) = λn/2e−λ/2.But elementary calculus shows that k has a unique maximum at λ = n, so

k(λ) ≤ k(n) = nn/2e−n/2. Thus

k(λ1)k(λ2) ≤ (nn/2e−n/2)2 = nne−n

and we’re done.

11/6/98

Page 91: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 89

X and S are sufficient for µ and Σ, so we might want to know their distributions.Recall that their univariate counterparts, X and S2, are normal and chi-square,respectively, and independent.

Let

W =

X(1)

X(2)

=

X11

X21...

Xn1

X12

X22...

Xn2

=

X11

X21...

Xn1

X12

X22...

Xn2

∈ R2n

so we are stacking all the first components, followed by all the second compo-nents. Note that the first n components of W are independent, as are the lastn components. However, each of the first n components is positively correlatedwith one of the last n components. That is, Xi1 and Xj1 are independent if i 6= j,and Xi2 and Xj2 are independent if i 6= j, but Xj1 and Xj2 are correlated.

If we let e =

1...1

, µ =[µ1

µ2

], and Σ =

[σ11 σ12

σ21 σ22

]=[σ11 σ12

σ12 σ22

], then

we can write

W =[

X(1)

X(2)

]∼ N2n

([µ1eµ2e

],

[σ11I σ12Iσ12I σ22I

]).

Now we find a linear transformation. Let P =1n

ee′ and look at

[P 00 P

]W =

[X1eX2e

]and [

I−P 00 I−P

]W =

[X(1) −X1eX(2) −X2e

].

Now X is a function of[

P 00 P

]W, and S is a function of

[I−P 0

0 I−P

]W;

furthermore, these are independent, because[P 00 P

] [σ11I σ12Iσ12I σ22I

] [I−P 0

0 I−P

]= 0.

(Use P2 = P′ = P and recall that A1X ⊥⊥ A2X if A1ΣA′2 = 0.)

11/6/98

Page 92: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

90 Stat 8151

Then

X ∼ N2

(µ,

1n

Σ)

and S ∼n−1∑i=1

YY′,

i.e., S has the same distribution asn−1∑i=1

YY′, where the Yi’s are i.i.d. N2(0,Σ).

Note that the sum involves n−1 terms; recall that in the univariate case we loseone degree of freedom when computing the sample variance.

That distribution is an example of a Wishart distribution; in this casewe have S ∼Wishart(2, n−1,Σ). The parameters reflect the dimension, samplesize (reduced by 1), and covariance of our original distribution for X.

Prediction and Multiple Correlation Coefficients

Suppose Y ∈ R1 and X ∈ Rm, and stack them to form Y

X

∈ Rm+1.

Let E

YX

=µξ

and Cov Y

X

= σ11 Σ12

Σ′12 Σ22

= Σ.

Here Σ12 is 1×m, Σ21 = Σ′12 ism×1, Σ22 ism×m, and Σ is (m+1)×(m+1);obviously we’re assuming first and second moments exist.

What is the best predictor of Y based on X? We want minh

E(Y − h(X)

)2.

The earlier proof (from the univariate case) carries over to the present situation,so just as before we have the answer E(Y | X ).

Then we can ask what is the best linear predictor of Y based on X?

Consider b′X + a where b =

b1...bm

∈ Rm and a ∈ R1.

Proposition. Assume Σ22 > 0 (i.e., it is positive definite), so Σ−122 exists. Then

minb∈Rma∈R1

E(Y − b′X− a)2 = σ11 −Σ12Σ−122 Σ21

and is achieved at

b = Σ−122 Σ21 and a = µ− b′ξ.

Proof deferred for now, but first a few comments:

• We’ve already done a 1-D version of this.• Here we are considering both X and Y as random; but, in applied courses,

e.g., 5161, we often think of X as fixed, and would thus be conditioning onX = x.

• We can think of all this as involving projections in function spaces.

11/6/98

Page 93: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 91

Monday, November 9, 1998

Proof: (Of proposition from last time)For fixed b, we know that E(Y −b′X−a)2 is minimized at a = E(Y −b′X) =

µ − b′ξ. Now, by the usual argument of adding and subtracting the mean andexpanding the square, we see that

E(Y − b′X− a)2 = E[Y − b′X− (µ− b′ξ) + (µ− b′ξ)− a]2

= E[Y − b′X− (µ− b′ξ)]2 + E[(µ− b′ξ)− a]2

+ 2E[(Y − b′X− (µ− b′ξ)

)((µ− b′ξ)− a

)]= E[Y − b′X− (µ− b′ξ)]2 + [(µ− b′ξ)− a]2

+ 2((µ− b′ξ)− a

)E[(Y − b′X− (µ− b′ξ)

)]︸ ︷︷ ︸0

= Var(Y − b′X) + (µ− b′ξ − a)2.

As noted above, given b, we can minimize this by making the second term vanish;that’s how we get a = E(Y −b′X) = µ−b′ξ. But now we want to pick b whichminimizes Var(Y − b′X).

Write

Y − b′X = [ 1 −b′ ][YX

]and recall that Cov(AU) = A(Cov U)A′, so

Var(Y − b′X) = [ 1 −b′ ] Σ[

1−b

]= [ 1 −b′ ]

[σ11 Σ12

Σ21 Σ22

] [1−b

]= [σ11 − b′Σ21 Σ12 − b′Σ22 ]

[1−b

]= σ11 − b′Σ21 −Σ12b + b′Σ22b

11/9/98

Page 94: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

92 Stat 8151

Now recall that Σ12 is 1× n, so Σ12b is a scalar, and Σ12b = (Σ12b)′ = b′Σ21.Thus Var(Y − b′X) = σ11 − 2b′Σ21 + b′Σ22b and we want to find

infb

(σ11 − 2b′Σ21 + b′Σ22b). (1)

Let d = Σ1/222 b, so b = Σ−1/2

22 d.

Then b′ = (Σ−1/222 d)′ = d′

(Σ−1/2

22

)′= d′Σ−1/2

22 (because Σ−1/222 is symmet-

ric) and b′Σ22b = (d′Σ−1/222 )Σ22(Σ−1/2

22 d) = d′d.Rewrite (1) as

infd

(σ11 − 2d′Σ−1/222 Σ21 + d′d). (2)

We’d like to “complete the square” but first we have another change of notation.Let c = d−Σ−1/2

22 Σ21, so d = c + Σ−1/222 Σ21.

Now (2) becomes

infc

[σ11 − 2

(c + Σ−1/2

22 Σ21

)′Σ−1/2

22 Σ21 +(c + Σ−1/2

22 Σ21

)′(c + Σ−1/2

22 Σ21

)]= inf

c

[σ11 − 2 c′Σ−1/2

22 Σ21︸ ︷︷ ︸scalar

−2(Σ−1/2

22 Σ21

)′Σ−1/2

22 Σ21 + c′c

+ c′Σ−1/222 Σ21︸ ︷︷ ︸scalar

+(Σ−1/2

22 Σ21

)′c︸ ︷︷ ︸

scalar

+(Σ−1/2

22 Σ21

)′Σ−1/2

22 Σ21

]= inf

c

[σ11 −Σ12Σ−1

22 Σ21 + c′c]. (3)

The indicated terms are scalars, and hence equal their transposes; their sum iszero.

Now it is easy to see that (3) is minimized when c = 0. Then d = Σ−1/222 Σ21,

and b = Σ−1/222 (Σ−1/2

22 Σ21) = Σ−122 Σ21, as claimed.

Finally, from (3), the minimum value achieved is σ11 −Σ12Σ−122 Σ21, again

as claimed.

11/9/98

Page 95: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 93

Recall that VarY = σ11, and write

E(Y − b′X− a)2 = σ11 −Σ12Σ−122 Σ21 = σ11

(1− Σ12Σ−1

22 Σ21

σ11

)= σ11(1−R2)

where

R2 =Σ12Σ−1

22 Σ21

σ11

is the squared multiple correlation coefficient.

Lemma.max

b[ρ(Y,b′X)]2 = R2

Proof:

[ρ(Y,b′X)]2 =[Cov(Y,b′X)]2

(VarY )(Var b′X)=

(b′Σ21)2

σ11(b′Σ22b)

so

supb 6=0

[ρ(Y,b′X)]2 = supb 6=0

(b′Σ21)2

σ11(b′Σ22b)= sup

d 6=0

(d′Σ−1/222 Σ21)2

σ11(d′d)

if we let d = Σ1/222 b, just as we had earlier done. Now use a version of the

Cauchy-Schwarz inequality,35 to get

supd 6=0

(d′Σ−1/222 Σ21)2

σ11(d′d)≤ d′dΣ12Σ−1

22 Σ21

σ11(d′d)=

Σ12Σ−122 Σ21

σ11= R2.

Remember that equality holds when one vector is a constant times the other, sod = kΣ−1/2

22 Σ21, hence b = kΣ−122 Σ21. We can take the constant k to be 1, for

it clearly works. (It cancels, anyway.)

35 Use (u′v)2 ≤ (u′u)(v′v), with u = d and v = Σ−1/222 Σ21 (in the numerator).

11/9/98

Page 96: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

94 Stat 8151

ExchangeabilityPolya Urn Scheme

Consider an urn in which there are a white balls and b blue balls, wherea ≥ 1 and b ≥ 1. Now we sample according to the following scheme:

• Pick a ball at random.• Note the color of the ball.• Replace the ball, along with an additional ball of the same color.• Repeat.

Let Xi be the outcome of the ith drawing, so Xi = 1 if the ith ball drawn iswhite and Xi = 0 if the ith ball drawn is blue.

Obviously,P (X1 = 1) =

a

a+ b; (4)

however, perhaps a bit surprisingly,

P (X2 = 1) = P (X1 = 1, X2 = 1) + P (X1 = 0, X2 = 1)= P (X1 = 1)P (X2 = 1 | X1 = 1 ) + P (X1 = 0)P (X2 = 1 | X1 = 0 )

=a

a+ b· a+ 1a+ b+ 1

+b

a+ b· a

a+ b+ 1

=a(a+ 1 + b)

(a+ b)(a+ b+ 1)

=a

a+ b

= P (X1 = 1).

The same holds for X3, etc. Although the Xi’s are not i.i.d., they have a commonmarginal distribution.

Fact. P (Xn = 1) =a

a+ b. (5)

Proof: We use mathematical induction.36

The initial step is obvious—see (4) above.We now assume that (5) holds for n = 1, 2, . . . , k − 1, and show that it

must also hold for n = k.Let An be the number of white balls and Bn the number of blue balls just

before the nth drawing; so A1 = a and B1 = b.Then An+1 and Bn+1 count the white and the blue balls after the nth

drawing, including the replaced balls; note that An+1 = An + Xn and Bn+1 =Bn + (1−Xn).

Observe that An+1 +Bn+1 = An+Bn+1, independent of Xn, so we alwaysknow the total number of balls: An +Bn = a+ b+ n− 1.

36 This is an exercise in Feller, Volume I. By the way, the sequence of Xi’s isnot a Markov sequence, although the sequence of Ai’s is.

11/9/98

Page 97: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 95

Consequently, P (Xn = 1) = An/(An +Bn).Apply this with n = k − 1 to get P (Xk−1 = 1) = Ak−1/(Ak−1 + Bk−1).

But, under the induction hypothesis, we believe (5) with n = k − 1, so we alsohave P (Xk−1 = 1) = a/(a+ b).

ThusAk−1

Ak−1 +Bk−1=

a

a+ b,

so(a+ b)Ak−1 = a(Ak−1 +Bk−1) = a(Ak +Bk − 1). (6)

Now

P (Xk = 1) = P (Xk = 1, Xk−1 = 1) + P (Xk = 1, Xk−1 = 0)= P (Xk = 1 | Xk−1 = 1 )P (Xk−1 = 1)

+ P (Xk = 1 | Xk−1 = 0 )P (Xk−1 = 0)

=Ak

Ak +Bk· a

a+ b+

AkAk +Bk

· b

a+ brecall that Ak = Ak−1 +Xk−1 so we have

Ak−1 + 1Ak +Bk

· a

a+ b+

Ak−1

Ak +Bk· b

a+ b

=(a+ b)Ak−1 + a

(Ak +Bk)(a+ b)substitute from (6) to get

a(Ak +Bk − 1) + a

(Ak +Bk)(a+ b)

=a(Ak +Bk)

(Ak +Bk)(a+ b)

=a

a+ b

and we’ve established the induction.

There is a shorter and more elegant proof, due to Feller. The idea is torealize that the induction hypothesis (the probability that the last ball drawnis white is the same as the probability that the first ball drawn is white) isassumed to hold for every sequence of fewer than k draws from the urn, not justthose sequences starting with the actual first ball drawn. In particular, we canconsider a sequence X1, X2, . . . , Xk as a first drawing X1 followed by a sequenceX2, . . . , Xk of length k − 1; it is to that sequence that we apply the inductionassumption, i.e., P (Xk = 1 | X1 = x ) = P (X2 = 1 | X1 = x ) for x = 0 or 1.Thus

P (Xk = 1) = P (Xk = 1 | X1 = 1 )P (X1 = 1) + P (Xk = 1 | X1 = 0 )P (X1 = 0)

=a+ 1

a+ b+ 1· a

a+ b+

a

a+ b+ 1· b

a+ b=

a(a+ b+ 1)(a+ b+ 1)(a+ b)

=a

a+ b.

11/9/98

Page 98: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

96 Stat 8151

Now suppose we draw r white balls, followed by s blue balls (in precisely thatorder). Observe that the denominators increase by 1 with each fraction, nomatter which color ball is being added.

P (X1 = · · · = Xr = 1, Xr+1 = · · · = Xr+s = 0)

=a

a+ b· a+ 1a+ b+ 1

· · · a+ r − 1a+ b+ r − 1

· b

a+ b+ r· · · b+ s− 1

a+ b+ r + s− 1.

This same probability arises for any sequence of r white balls and s blue balls,regardless of the order in which they are drawn, for the factors appearing in thenumerators commute.

Notice that

P (Xn = 1) =n−1∑i=1

P (X1 = x1, . . . , Xn−1 = xn−1, Xn = 1)

where xi = 0 or 1; we sum over all sequences of 0’s and 1’s, of length n, andending with a 1. But these same terms can be rearranged to get

n∑i=2

P (X1 = 1, . . . , Xn−1 = xn, Xn = xn) = P (Xn = 1).

So the Xi’s have the same marginal distribution. In the same way, every pair hasthe same marginal distribution, for instance, (X1, X2) ∼ (X1, X3) ∼ (X4, X17);also every triple, and so forth.

There is time-symmetry, too:

P (X2 = 1 | X1 = 1 ) =P (X1 = 1, X2 = 1)

P (X1 = 1)but P (X1 = 1) = P (X2 = 1), so we have

P (X1 = 1, X2 = 1)P (X2 = 1)

= P (X1 = 1 | X2 = 1 ).

Another example:

P (X1 = 1, . . . , Xn = 1) =a

a+ b· a+ 1a+ b+ 1

· · · a+ n− 1a+ b+ n− 1

These are examples of exchangeable random variables, but there are others. Wecould have more than two colors, or a more complicated replacement scheme.(We could even remove—rather than add—balls.)

11/9/98

Page 99: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 97

Wednesday, November 11, 1998

Exchangeability

Let X1, X2, . . . be a sequence of random variables which take on the values 0 and1. (Most of this generalizes to other random variables in the obvious way.)

The random variables are said to be exchangeable if the probability distri-bution of Xi1 , . . . , Xin is the same no matter how the n distinct indices i1, . . . , inare chosen for n = 1, 2, . . . .

This means that the individual marginal distributions are the same, andthe distributions for pairs are the same, and the distributions for triples are thesame, . . . .

The marginal distribution of Xi does not depend on i. So, P (Xi = 1) = p,independent of i. In the same way, the joint distribution of (Xi, Xj) for i 6= jdoes not depend on i or j.

Examples

1. It is easy to see that i.i.d. random variables are exchangeable.2. Polya sampling (from previous lecture). This shows that random variables

can be exchangeable without being i.i.d.3. Mixtures of i.i.d. random variables are exchangeable.

We’ll look at the third example now. Suppose P ∼ f(·) on [0, 1], and supposethat X1, X2, . . . | P ∼ i.i.d. Bernoulli(P ). Imagine a collection of coins, eachwith its own probability of heads for a single flip. Randomly select such a coin,and then flip that coin repeatedly to generate a sequence of heads or tails. Then

P (Xi1 = xi1 , . . . , Xin = xin) =∫ 1

0

pΣxij (1− p)n−Σxij f(p) dp

because we can compute the probability of any fixed sequence of outcomes givenp, and then we just average over all p. This doesn’t depend on the indices, sowe have exchangeability. A mixture is a generalized convex combination.

11/11/98

Page 100: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

98 Stat 8151

We now claim that the converse to that third example is true. In other words,we claim that infinite sequences of exchangeable random variables are mixturesof i.i.d. random variables.37

We have a few technical items first.Let w(n)

r be the probability of exactly r 1’s in some specified set of n tri-

als.38 There are(n

r

)points making up this event, and each point has the same

probability,w

(n)r(nr

) . Let wn = w(n)n . Then

w(n)r(nr

) = P( 1, . . . , 1︸ ︷︷ ︸

r

, 0, . . . , 0︸ ︷︷ ︸s

)

(1)

where r + s = n.

Claim.

w(n)r(nr

) = wr −(s

1

)wr+1 +

(s

2

)wr+2 − · · ·+ (−1)s

(s

s

)wr+s (2)

Proof: The event in (1) occurs if and only if

X1X2 · · ·Xr(1−Xr+1)(1−Xr+2) · · · (1−Xr+s) = 1. (3)

Let Yj = Xr+j , so (3) becomes

X1X2 · · ·Xr(1− Y1)(1− Y2) · · · (1− Ys) = 1.

Remember that the Xi’s are 0 or 1, so P (Xi = 1) = EXi and the same holds forV = X1X2 · · ·Xr(1− Y1)(1− Y2) · · · (1− Ys). Thus

P (V = 1) = EV = E[X1X2 · · ·Xr(1− Y1)(1− Y2) · · · (1− Ys)]

= E[X1X2 · · ·Xr −

s∑j=1

X1X2 · · ·XrYj +∑j<k

X1X2 · · ·XrYjYk − · · ·

+ (−1)sX1X2 · · ·XrY1 · · ·Ys]

= wr −(s

1

)wr+1 +

(s

2

)wr+2 − · · ·+ (−1)s

(s

s

)wr+s

where we again use the fact that expectations are probabilities for 0-1-valuedrandom variables.

Now EXi doesn’t depend on i (because the Xi’s are exchangeable) so wecan let EXi = m1. Similarly, we let EX2

i = µ2 and EXiXj = m2; in factEXiXj · · ·Xk = mk (for any k of the Xi’s—not just the first k.)

Let Sn = (1/n)∑ni=1Xi.

37 This is De Finetti’s Theorem.38 For instance, the probability of 3 1’s in 5 trials would be w(5)

3 .

11/11/98

Page 101: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 99

Lemma 1. Suppose h < k. Then

E(Sh − Sk)2 =(

1h− 1k

)(µ2 −m2)

Proof:

E(Sh − Sk)2 = E

[X1 + · · ·+Xh

h− X1 + · · ·+Xk

k

]2

= E

[X1 + · · ·+Xh

h

]2

+ E

[X1 + · · ·+Xk

k

]2

− 2E

[X1 + · · ·+Xh

h· X1 + · · ·+Xk

k

]=

1h2

[hµ2 + h(h− 1)m2] +1k2

[kµ2 + k(k − 1)m2]

− 21hk

[hµ2 + h(k − 1)m2]

= µ2

[1h

+1k− 2k

]+m2

[h− 1h

+k − 1k− 2h(k − 1)

hk

]= µ2

(1h− 1k

)+m2

(1k− 1h

)=(

1h− 1k

)(µ2 −m2)

and we’re done.

Note that exchangeability allows using µ2 and m2 here.

Lemma 2. For ε > 0 and δ > 0, there exists Nδ such that P(|Sh − Sk| > ε

)< δ

for Nδ < h < k.

Proof:

P(|Sh − Sk| > ε

)≤ 1ε2

(1h− 1k

)(µ2 −m2)→ 0

as h→∞, by Chebyshev’s inequality.

11/11/98

Page 102: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

100 Stat 8151

Complete convergence

This shows that the sequence Sn is Cauchy in measure, and hence the cor-responding distribution functions converge completely.39 Intuitively, the distri-bution functions of the Sn’s, for large n, look almost alike, and hence “shouldconverge.” (We could also appeal to Helly’s Selection Theorem40 to extract aconvergent subsequence.)

Some facts:Let Fn(z) = P (Sn ≤ z), so Fn is a distribution function on [0, 1]; since Xi’s

are 0 or 1, Sn ∈ [0, 1]. There exists a distribution function F on [0, 1] such thatFn converges to F completely, i.e., Fn(z) → F (z) for all continuity points z ofF , and also at 0 and 1. (This says nothing about discontinuity points of F .)

Moreover, moments converge. In particular,

ESkn =∫ 1

0

zk dFn(z)→∫ 1

0

zk dF (z).

Notice that

ESn = E

(X1 + · · ·+Xn

n

)=nm1

n= m1,

and

ES2n = E

(X1 + · · ·+Xn

n

)2

= E

(∑iX

2i

n2

)+ E

(∑i<j XiXj

n2

)=nµ2

n2+n(n− 1)

n2m2 → m2

as n→ 0; more generally,

ESkn =n(n− 1) · · · (n− k + 1)

nkmk + o(1/n).

Hence ESkn → mk as n → ∞, but ESkn →∫zk dF (z) by complete convergence,

so mk =∫zk dF (z).

Also mk = E(X1 · · ·Xk) = P (k 1’s in k trials) = wk, so∫zk dF (z) = wk.

39 The sequence Sn of random variables is Cauchy in probability, and thusconverges in probability (because Sn ∈ [0, 1], a closed subset of R1, which is com-plete); but then the sequence Sn also converges in distribution, which meansthe corresponding sequence of distribution functions converges pointwise at ev-ery continuity point of its limit. This convergence of the distribution functionsis what GM refers to as “complete convergence.”

40 Actually it was the Helly-Bray Theorem that was mentioned in the lecture.

11/11/98

Page 103: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 101

Now use the claim appearing in (2). When r + s = n, we have

w(n)r = P (r 1’s in n specified Xi’s)

=(n

r

)[wr −

(s

1

)wr+1 +

(s

2

)wr+2 − · · ·+ (−1)s

(s

s

)wr+s

]=(n

r

)[∫zr dF (z)−

(s

1

)∫zr+1 dF (z)

+(s

2

)∫zr+2 dF (z)− · · ·+ (−1)s

(s

s

)∫zr+s dF (z)

]=(n

r

)∫zr[1−

(s

1

)z +

(s

2

)z2 − · · ·+ (−1)s

(s

s

)zs]dF (z)

=(n

r

)∫zr(1− z)s dF (z)

=∫ 1

0

(n

r

)zr(1− z)n−r dF (z).

So, w(n)r , which is the probability of r successes in n specified trials, is seen to

be the average over Z ∈ [0, 1] of a binomial probability given Z = z. In otherwords, given Z = z, the conditional distribution of nSn = (number of successesin n specified trials) is binomial(z); the marginal distribution is a mixture.

We have proved a version of De Finetti’s Theorem.41

41 This proof (by Sudderth and Heath) appeared in The American Statisticianin November 1976.

11/11/98

Page 104: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

102 Stat 8151

Friday, November 13, 1998

Polya sampling revisited

Here is a short proof of the induction step from the earlier discussion ofPolya sampling.42

The idea is to realize that the induction hypothesis (the probability that thelast ball drawn is white is the same as the probability that the first ball drawn iswhite) is assumed to hold for every sequence of fewer than k draws from the urn,not just those sequences starting with the actual first ball drawn. In particular,we can consider a sequence X1, X2, . . . , Xk as a first drawing X1 followed by asequence X2, . . . , Xk of length k − 1; it is to that sequence that we apply theinduction assumption, i.e., P (Xk = 1 | X1 = x ) = P (X2 = 1 | X1 = x ) forx = 0 or 1. Thus

P (Xk = 1) = P (Xk = 1 | X1 = 1 )P (X1 = 1) + P (Xk = 1 | X1 = 0 )P (X1 = 0)

=a+ 1

a+ b+ 1· a

a+ b+

a

a+ b+ 1· b

a+ b=

a(a+ b+ 1)(a+ b+ 1)(a+ b)

=a

a+ b.

42 This just duplicates part of my notes for Monday, November 9; see pages 94–95 for the rest of the proof and an alternative for this part.

11/13/98

Page 105: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 103

Back to our regularly scheduled program . . .

Continuing from last time . . . .We proved a version of de Finetti’s theorem; a more general version, due to

Hewitt and Savage, says

The probability distributions of classes of exchangeable random variablesare “averages” of the probability distributions of classes of independentand identically distributed random variables.

So, if X1, X2, . . . is an infinite sequence of exchangeable random variables, whereXi = 0 or 1, then

P (r 1’s in any specified set of n Xi’s) =∫ 1

0

(n

r

)zr(1− z)n−r dF (z).

In particular, if we think of X1, . . . , Xn, and Z as random variables, then their“joint density” is zΣxi(1− z)n−Σxif(z) where f is the density for Z.

Infinite collections of exchangeable random variables are averages (mixtures)of i.i.d. random variables. However, a finite exchangeable sequence need not bean i.i.d. mixture.

Now suppose X1, X2, . . . is an infinite exchangeable sequence of real-valuedrandom variables. Let fz : z ∈ ] be the set of all densities on the line, indexedby elements of some index set ]. If we assume each Xi has a (marginal) densitythen the joint density for X1, . . . , Xn is given by∫

fz(x1) · · · fz(xn) dF (z).

11/13/98

Page 106: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

104 Stat 8151

Example (Where we can see what the limiting distribution F is.)

Recall the Polya urn scheme. (An urn with a white and b blue balls, a ≥ 1 andb ≥ 1. A ball is drawn and its color noted; that ball and another of the samecolor are returned to the urn, and the process is repeated.) Let Xi = 1 if theith ball drawn is white, and Xi = 0 if the ith ball drawn is blue. The sequenceX1, X2, . . . is exchangeable. (It is crucial that this is an infinite sequence—afinite exchangeable sequence need not be an i.i.d. mixture.)

There exists a probability distribution on [0, 1] with distribution function Fsuch that θ ∼ F and X1, X2, . . . given θ are i.i.d. Bernoulli(θ).

By the Strong Law of Large Numbers,∑ni=1Xi/n → θ a.s., conditionally

on θ.By the Bounded Convergence Theorem, E

(∑ni=1Xi/n

)→ Eθ.

By exchangeability, EXi doesn’t depend on i; thus EXi = EX1 = 1 ·P (X1 =1)+0 ·P (X1 = 0) = a/(a+b). But EXi = E[E(Xi | θ)] = Eθ. Hence, Eθ =

a

a+ b.

Now recall that on 11/1143 we defined wn = P (exactly n 1’s in n trials),and we had already computed that value on 11/9.44

Thus

P (X1 = X2 = · · · = Xn = 1) = wn =∫ 1

0

θn dF (θ)

= P (n white in n trials)

=a(a+ 1) · · · (a+ n− 1)

(a+ b)(a+ b+ 1) · · · (a+ b+ n− 1)

so

E(θn) =a(a+ 1) · · · (a+ n− 1)

(a+ b)(a+ b+ 1) · · · (a+ b+ n− 1),

and we know all the moments of the distribution of θ.But those are the moments of a Beta distribution—in particular, a B(a, b)

distribution—so θ ∼ B(a, b), with densityΓ(a+ b)Γ(a)Γ(b)

θa−1(1− θ)b−1.

43 page 9844 page 96

11/13/98

Page 107: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 105

So, for large n, the sample mean is approximately distributed as B(a, b). Thedistribution of balls in the urn is also approximately described by that sample

mean, for the proportion of white balls isa+X1 + · · ·+Xn

a+ b+ n≈ X1 + · · ·+Xn

nif nÀ a+ b.

• If we start with a = b = 1 then after, say, 100 draws, we would havean approximate B(1, 1) distribution, which is just uniform(0, 1). In otherwords, the proportion is distributed evenly over the whole interval (0, 1).

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

Beta density with a = b = 1

• If we start with a = 2 and b = 1, then we would have approximately aB(2, 1) distribution, which is skewed.

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

Beta density with a = 2, b = 1

• If we start with a = b and “large,” then our limiting distribution is sym-metric, but more concentrated near the center of (0, 1) rather than beingspread over the entire interval.

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

2.5

3

3.5

Beta density with a = b = 10

11/13/98

Page 108: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

106 Stat 8151

Philosophical discussion

A frequentist believes in a fixed but unknown θ (i.e., “truth”) that may bethought of as the limit of an infinite sequence of observations—a Bayesian saysthat such a parameter is meaningless, and that probability is subjective.

The Bayesian’s exchangeable sequence corresponds to the frequentist’s i.i.d.sequence, so de Finetti’s theorem lets a Bayesian justify looking for limitingbehavior without have to admit that he’s doing what a frequentist does.

Bayesian approaches can be hard to implement; nevertheless, Bayesianmethods can produce answers with good frequentist properties.

Neither frequentists nor Bayesians have all the answers.

Generalized Polya Urn

As noted last time, we could generalize this sampling scheme. We could havemore than two colors, in which case we’d get a Dirichlet—rather than a Beta—distribution, and we’d have a random vector to describe the contents of theurn.

We could add more or fewer balls of the same or different colors, or removeballs, instead of always adding one more matching the one drawn.

Exchangeability for finitely many random variables

Consider X1, . . . , Xn where each Xi = 0 or 1. This set is exchangeable if forevery fixed sequence of n 0’s or 1’s, written eini=1,

P (X1 = e1, . . . , Xn = en) = P (X1 = eπ(1), . . . , Xn = eπ(n))

for all π, where π is a permutation of 1, 2, . . . , n.So these are invariant under permutation.De Finetti’s theorem is false in this case, as shown in the following example.

11/13/98

Page 109: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 107

(Counter)example

Suppose n = 2, so we have 4 sample points.

X1 X2 P

1 0 1/20 1 1/20 0 01 1 0

By inspection it is clear that this joint distribution satisfies our definition ofexchangeable.45 Question: Is the distribution of X1 and X2 that of a mixture ofi.i.d. random variables?

Suppose it is, and let µ be the mixing prior.

Now 0 = P (X1 = 1, X2 = 1) =∫ 1

0

p2 dµ(p), so µ puts all its mass at 0, i.e.,

µ(0) = 1.

Similarly, 0 = P (X1 = 0, X2 = 0) =∫ 1

0

(1 − p)2 dµ(p), so µ(1) = 1, a

contradiction.So, exchangeability does not imply an i.i.d. mixture in the finite case.

This is hardly a contrived pathological example—the above distributionrepresents a sample of size two (without replacement) from an urn with twoballs of different colors. A sample of size two means we draw both balls, and ofcourse they are different.

45 P (X1 = 0, X2 = 0) = 0, P (X1 = 1, X2 = 1) = 0, and P (X1 = 1, X2 = 0) =P (X1 = 0, X2 = 1) = 1/2. There are only trivial permutations if X1 = X2, andthe permutation for the remaining case leaves the probabilities unchanged.

11/13/98

Page 110: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

108 Stat 8151

Geometry of finitely many exchangeable random variables

Let 32 = (p1, p2, p3, p4) : p1 = P (X1 = 0, X2 = 0), p2 = P (X1 = 0, X2 =1), p3 = P (X1 = 1, X2 = 0), p4 = P (X1 = 1, X2 = 1), . We claim that 32 is atetrahedron.

(1, 0, 0, 0)

(0, 1, 0, 0)

(0, 0, 1, 0)

(0, 0, 0, 1)

32

Furthermore, any point “inside” has a unique representation as a convex combi-nation of the four vertices.

Now let %2 ⊆ 32 be the set of exchangeable distributions for X1, X2. Thus%2 is the set of (p1, p2, p3, p4) ∈ 32 such that p2 = p3. This turns out to be theintersection of a plane and the tetrahedron, and appears as a triangular regionin the figure. It too is a convex set.

(1, 0, 0, 0)

(0, 1, 0, 0)

(0, 0, 1, 0)

(0, 0, 0, 1)

(0, 12 , 1

2 , 0)

%2 ⊆ 32

11/13/98

Page 111: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 109

Monday, November 16, 1998

Continuing from last time, we have the collection 32 of probability vectors(p1, p2, p3, p4) corresponding to all possible distributions for (X1, X2) ∈ 0, 1 ×0, 1.

X2

0 1X1 0 p1 p2

1 p3 p4

Because p1 + p2 + p3 + p4 = 1 each vector lives in a three-dimensional subspaceof a four-dimensional space. We can represent that three-dimensional subspaceas a tetrahedron whose vertices are the four extreme points corresponding toputting all the mass on one sample point and zero on the rest.

Then we look at the subset %2 of exchangeable distributions—the elementsof 32 such that p2 = p3. The set %2 is a triangle formed by the intersection ofa plane with the tetrahedron 32. Both 32 and %2 are shown in this illustration(repeated from last time).

(1, 0, 0, 0)

(0, 1, 0, 0)

(0, 0, 1, 0)

(0, 0, 0, 1)

(0, 12 , 1

2 , 0)

%2 ⊂ 32

Now consider the collection (2 of i.i.d. distributions. Clearly (2 $ %2, and(2 can be parameterized as

(p2, p(1− p), p(1− p), (1− p)2

). Note that this is a

one-dimensional object. Where is (2 in our drawing?

11/16/98

Page 112: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

110 Stat 8151

Let’s look at %2, and see where (2 fits in.

(1, 0, 0, 0) (0, 0, 0, 1)

(0, 12 , 1

2 , 0)

(2

de Finetti’s Theorem fails

de Finetti’s Theorem holds

(2 ⊂ %2

In this illustration, the triangle is %2, and (2 is the parabolic curve. The shadedregion below that curve is the collection of convex combinations of elementsof (2, so it represents the distributions for which de Finetti’s Theorem holds.(Remember that the theorem says exchangeable distributions are mixtures—herethat means convex combinations—of i.i.d. ones.) These representations are notunique—there are uncountably many ways to represent each convex combinationof elements of (2.

The unshaded area above the curve is the collection of exchangeable distri-butions where de Finetti’s Theorem fails, for those are the distributions outsideof the convex hull of (2.

Now, if X1, X2, X3 are exchangeable, then so are X1, X2. So now we ask“What subset of %2 contains all the distributions that can arise from an ex-changeable sequence of length 3?”

(1, 0, 0, 0) (0, 0, 0, 1)

(0, 12 , 1

2 , 0)

(0, 13 , 1

3 , 13 )( 1

3 , 13 , 1

3 , 0)

shaded region: cannot be embedded in exchangeable 3-sequences

This time the shaded region at the top is all the 2-D exchangeable distributionsthat cannot be embedded in exchangeable sequences of length 3; the unshadedpart below that is the collection of 2-D exchangeable distributions that can beembedded in exchangeable sequences of length 3.

11/16/98

Page 113: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 111

More generally, what is the subset of %2 that can be embedded in an ex-changeable sequence of length k?

(1, 0, 0, 0) (0, 0, 0, 1)

(0, 12 , 1

2 , 0)

de Finetti’s Theorem in the limit as k →∞

The boundary between the embeddable and non-embeddable regions is a polygonand it converges downward to (2 as k → ∞; this gives an alternative proof ofde Finetti’s Theorem.46

Finite exchangeable representations

Let 3k represent all probability distributions on (x1, . . . , xk) : xi = 0 or 1 .Let S(l, k) be the set of points with l 1’s and k − l 0’s.Let %k denote the exchangeable members of 3k.

Now consider an urn with k balls, l marked “1” and the remaining k − lmarked “0.” Draw a simple random sample of size k without replacement, i.e.,exhaust the urn. If λl is the resulting probability distribution, then λl putsprobability 1/

(kl

)on each of the

(kl

)points of S(l, k).

Claim. λl is exchangeable and is an extreme point of %k.

Proof: Clearly λl is exchangeable. Off S(l, k) it assigns 0 probability, butwithin S(l, k), it assigns equal mass everywhere. Hence λl is invariant underpermutations.

Now suppose λl = αa+ (1− α)b where a and b are exchangeable. We wantto show that a and b are in fact λl. Clearly a and b assign 0 probability offS(l, k). In addition, each must assign equal probability to points of S(l, k), byexchangeability. But those facts characterize λl, so a = b = λl.

Hence λ0, λ1, . . . , λk are the extreme points of %k and every member of%k has a unique representation as a mixture of these extreme points. So, forexample, λ∗ =

∑kl=0 wlλl where wl ≥ 0 and

∑kl=0 wl = 1.

46 Diaconis, Synthesis: 36 (1977) pp. 271–281.

11/16/98

Page 114: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

112 Stat 8151

In our earlier example, with k = 2, we have l = 0, 1, or 2, so there are threeextreme points. Each vertex of the %2 triangle corresponds to an urn model.

(1, 0, 0, 0) (0, 0, 0, 1)

(0, 12 , 1

2 , 0)

1

0

0

0

1

1

P (0, 0) = 1 P (1, 1) = 1

P (0, 1) = P (1, 0) = 12

three extreme points for %2

Exchangeable distributions are mixtures of these urns.

The exponential family of distributions

Suppose X1, . . . , Xn are i.i.d. with fθ(·) as the possible probability function ordensity function, and θ ∈ Θ. In statistics, typically the actual value of θ isunknown, and we will want to make inferences about θ.

Definition. Let Θ ⊆ Rk. A family of distributions on the real line with prob-ability function or density fθ(·) for θ ∈ Θ is said to be an exponential familyof distributions if it is given by

fθ(x) = c(θ)h(x) exp[ k∑i=1

πi(θ)ti(x)]. (])

The only part where θ and x appear together is in the exponential sum.In the discrete case,

1c(θ)

=∑x

h(x) exp[ k∑i=1

πi(θ)ti(x)];

and in the continuous case,

1c(θ)

=∫ ∞−∞

h(x) exp[ k∑i=1

πi(θ)ti(x)]dx.

11/16/98

Page 115: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 113

However, in general we can write

1c(θ)

=∫ ∞−∞

h(x) exp[ k∑i=1

πi(θ)ti(x)]dν.

As usual, the choices ν = counting measure or ν = Lebesgue measure yield,respectively, the discrete and continuous cases above.

Examples

Let X ∼ Bernoulli(θ) for θ ∈ Θ = (0, 1). Then

fθ(x) = θx(1− θ)1−x (x = 0 or 1)

= (1− θ)(

θ

1− θ

)x= (1− θ) exp

[x log

1− θ

)]= (1− θ) exp

[x(log θ − log(1− θ)

)]and this is in the desired form, with k = 1 and

c(θ) = 1− θ

h(x) = 1 if x = 0 or x = 1

0 otherwiseπ(θ) = log θ − log(1− θ)t(x) = x

It is basically the same example if we have X ∼ binomial(n, θ) where n is knownand θ ∈ Θ = (0, 1). Then

fθ(x) =(n

x

)θx(1− θ)n−x (x = 0, 1, . . . , n)

= (1− θ)n(n

x

)exp

[x(log θ − log(1− θ)

)].

Now h(x) =(n

x

)and c(θ) = (1− θ)n, but π(θ) and t(x) are the same as before.

However, if n is unknown then we cannot separate n and x in(n

x

), and the

distribution is not in an exponential family.

11/16/98

Page 116: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

114 Stat 8151

Another example, and natural parameter spaces

Now suppose X ∼ N(µ, σ2), with −∞ < µ < ∞ and σ2 > 0. Then, for−∞ < x <∞, we have

fµ,σ2(x) =1√

2πσ2e−(x−µ)2/(2σ2) =

1√2πσ2

exp[−µ2

2σ2

]exp

[−x2

2σ2+µx

σ2

]

This fits the exponential family pattern if we take θ = (µ, σ), so

c(θ) =1√

2πσ2e−µ

2/(2σ2)

h(x) ≡ 1

π1(θ) = − 12σ2

t1(x) = x2

π2(θ) =µ

σ2

t1(x) = x

But we could instead take θ = (θ1, θ2) where θ1 = −1/(2σ2) and θ2 = µ/σ2.

Then c(θ) =1√

π(−1/θ1)e

14 θ

22/θ1 , and πi(θ) = θi with θ1 ∈ (−∞, 0) and θ2 ∈

(−∞,∞).So there is more than one way to parameterize, and we get simpler functions

the second way. When we choose the parameters such that πi(θ) = θ1 fori = 1, . . . , k, this is called the natural parameterization.

Definition. A statistic is a function of a random variable or a collection ofrandom variables.

So, if X ∈ (-,@, P ) then T (X) is a statistic.

11/16/98

Page 117: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 115

Random samples

Suppose we take a random sample, with X1, . . . , Xn i.i.d. such that

Xj ∼ c(θ)h(xj) exp[ k∑i=1

πi(θ)ti(xj)]

and form the joint density

fθ(x1, . . . , xn) =n∏j=1

c(θ)h(xj) exp

[ k∑i=1

πi(θ)ti(xj)]

= [c(θ)]n ·n∏j=1

h(xj) · exp[ n∑j=1

( k∑i=1

πi(θ)ti(xj))]

= [c(θ)]n ·n∏j=1

h(xj) · exp[ k∑i=1

(πi(θ)

n∑j=1

ti(xj))]

Note that the data (the xj ’s) appear with the parameters only through the sumof the ti’s.

Let T = (T1, . . . , Tk) =( n∑j=1

t1(Xj), . . . ,n∑j=1

ti(Xj), . . . ,n∑j=1

tk(Xj))

. This T

is a statistic, and in fact is a sufficient statistic.A sufficient statistic stays in the family; recall our examples of Bernoulli

and binomial distributions.

Lemma. Let X1, . . . , Xn ∼ i.i.d. c(θ)h(xj) exp[∑k

i=1 πi(θ)ti(xj)]. Then the

joint probability function or density for T has the form

f∗θ (t) = c0(θ)h0(t) exp[ k∑i=1

πi(θ)ti

].

(For the continuous case, we need the additional assumption that a k-dimensionaldensity of T exists.)

Proof: (for the discrete case)

Let A =

(x1, . . . , xn) :n∑j=1

t1(xj) = t1, . . . ,n∑j=1

tk(xj) = tk

.

Now

f∗θ (t) = f∗θ (t1, . . . , tk) = Pθ(T1 = t1, . . . , Tk = tk)

=∑A

[c(θ)]n ·

n∏j=1

h(xj) · exp[ k∑i=1

πi(θ)n∑j=1

ti(xj)︸ ︷︷ ︸Ti

]

11/16/98

Page 118: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

116 Stat 8151

but the overall sum is over precisely those points for which the innermost sumsare constants, i.e., Ti =

∑nj=1 ti(xj) = ti for (x1, . . . , xn) ∈ A.

All we need to do now is identify the functions c0 and h0, so we set

c0(θ) = [c(θ)]n

andh0(t) = h0(t1, . . . , tk)

=∑A

n∏j=1

h(xj)

and we are done.47

47 A potentially confusing point is the use of “ti” both for functions of thexj ’s and for summed values of those functions, e.g.,

∑nj=1 ti(xj) = ti. This

unfortunate choice of notation is perhaps tolerable because we’ll not see bothuses together anywhere else, and it is sort of clear from context within this proofwhich meaning is intended. Another rationalization is that sufficiency allows usto use the ti’s without needing all the detail of the underlying xj ’s; we need onlythe values taken by the functions, so we can treat them as reduced data.

I’ve written θ as a reminder that θ ∈ Rk. Often we use θ instead of θ butwhen we do so we generally do not intend to restrict θ to scalar values.

11/16/98

Page 119: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 117

Wednesday, November 18, 1998

Natural parameter space

Suppose

fθ(x) = c(θ)h(x) exp[ k∑i=1

πi(θ)ti(x)]. (1)

If in (1), each πi is taken as a parameter, then (1) becomes

fπ(x) = c(θ)h(x) exp[ k∑i=1

πiti(x)], (2)

where π = (π1, . . . , πk) is the natural parameterization of the exponentialfamily in (1). Basically forget about θ, and use π. Let

Π =

(π1, . . . , πk) :1

c(π)=∫h(x) exp

[ k∑i=1

πiti(x)]dν(x) <∞

. (3)

So Π ⊆ Rk and (π1(θ), . . . , πk(θ)) ∈ Π for each θ ∈ Θ. Note that Π is just thelargest collection of π’s for which the integral in (3) is finite.

Lemma. Π is convex.

Proof: (for the continuous case, so dν = dx here, although the same proofworks in general.)

Let π = (π1, . . . , πk) ∈ Π and π′ = (π′1, . . . , π′k) ∈ Π, and suppose α ∈ (0, 1).

Is απ′ + (1− α)π ∈ Π?To answer that, we want to show that

∫h(x) exp

[ k∑i=1

(απ′i + (1− α)πi

)ti(x)

]dx <∞.

11/18/98

Page 120: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

118 Stat 8151

So, do the obvious thing.

∫h(x) exp

[ k∑i=1

(απ′i + (1− α)πi

)ti(x)

]dx

=∫h(x) exp

[ k∑i=1

α(π′i − πi)ti(x)]

exp[ k∑i=1

πiti(x)]dx

multiply and divide by c(π), multiply by 11−α, and rearrange to get

=1

c(π)

∫ exp[ k∑i=1

(π′i − πi)ti(x)]α

· 11−α · c(π)h(x) exp[ k∑i=1

πiti(x)]

︸ ︷︷ ︸density under π

dx

=1

c(π)Eπ

(exp[ k∑i=1

(π′i − πi)ti(x)]α

· 11−α)

now use Holder’s Inequality

≤ 1c(π)

(Eπ

exp[ k∑i=1

(π′i − πi)ti(x)])α

·(

Eπ1)1−α

=1

c(π)

∫exp[ k∑i=1

(π′i − πi)ti(x)]· c(π)h(x) exp

[ k∑i=1

πiti(x)]

︸ ︷︷ ︸density under π

dx

=∫h(x) exp

[ k∑i=1

π′iti(x)]dx = c(π′) <∞

Dimension of parameter space

So, Π is a convex set in Rk, but its effective dimension may be less than k, as Πmay be contained in some hyperplane V in Rk given by

k∑i=1

βiπi = c (4)

In that situation, the interior of Π would be an non-empty open set in the relativetopology of that hyperplane, but not in Rk.

This is not a big problem, however, as we can solve (4) for some πi in termsof the rest, and thus reduce the dimension by one. If needed, we can repeat that,so without loss of generality, we can assume that Π has a non-empty interior,i.e., contains an open set in Rk for some k, although perhaps not the original k.

11/18/98

Page 121: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 119

Π

Π ⊆ V $ Rkdim Π ≤ dimV < dimRk = k

Not all families of distributions are exponential families

Two examples of this are Cauchy and uniform.

1. The Cauchy distribution has density f(x) =β

π

1(x− α)2

, where β > 0 and

α ∈ R1. This distribution has no finite moments, but typically we do havemoments for distributions that belong to an exponential family.

2. The uniform distribution on (0, θ) has density

fθ(x) = 1 for 0 < x < θ

0 elsewhere

but a problem is that the range depends on θ. Recall that when we writeh(x) exp[π(θ)t(x)], we want to leave it to the dominating measure ν to de-termine the range (by where it puts mass), but that doesn’t work for theuniform distribution.

The classic example of a non-regular family is the uniform(θ) because its rangedepends on the parameter.

11/18/98

Page 122: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

120 Stat 8151

MomentsAs noted in the first example above (Cauchy) we typically have good behaviorfor distributions from exponential families. Consider the one-parameter situationwhere fπ(x) = c(π) exp[πt(x)]h(x), so

1c(π)

=∫

exp[πt(x)]h(x) dν(x).

Suppose we could differentiate (with respect to π) under the integral sign. Then

− c′(π)[c(π)]2

=∫t(x) exp[πt(x)]h(x) dν(x).

Now multiply by c(π) to get

−c′(π)c(π)

=∫t(x)c(π) exp[πt(x)]h(x) dν(x) = Eπ[t(X)].

We can repeat this to get the second moment and the variance; if the parameteris a vector we use partial derivatives so we can get covariances too.

Lemma. Let ε > 0 be given. Then, for |h| < ε,∣∣∣∣eht − 1h

∣∣∣∣ ≤ 2k(e2εt + e−2εt

)for all t ∈ R where k = kε depends on ε, but not on t.

Proof: Let ψ(t) = etx. By the Mean Value Theorem, there exists ξh with|ξh| ≤ |h| such that

ψ(h)− ψ(0)h− 0

=eth − 1h

= ψ′(ξh) = tetξh

Hence, for |h| < ε, ∣∣∣∣eht − 1h

∣∣∣∣ ≤ |t| (eεt + e−εt).

Since limt→∞

|t|eε|t|

= 0, there exists k = kε such that |t| ≤ keε|t|. Now replace t bythat.

Hence, ∣∣∣∣eht − 1h

∣∣∣∣ ≤ keε|t| (eεt + e−εt).

So now, for t > 0, |t| = t, so eε|t| (eεt + e−εt) = e2εt + 1 ≤ 2e2εt.For t < 0, |t| = −t, so eε|t| (eεt + e−εt) = 1 + e−2εt ≤ 2e−2εt.Hence, ∣∣∣∣eht − 1

h

∣∣∣∣ ≤ 2k(e2εt + e−2εt

)We’ll use this to justify using the Dominated Convergence Theorem when weprove the following theorem.48

48 This lemma was actually stated and proved halfway through the proof ofthe theorem that follows the lemma’s proof.

11/18/98

Page 123: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 121

Theorem. Let Π be an open subset of Rk. If φ(X1, . . . , Xn) is a function forwhich ∫

· · ·∫φ(x1, . . . , xn) exp

[ k∑i=1

πi

n∑j=1

ti(xj)]·n∏j=1

h(xj) dx1 · · · dxn

exists for all π ∈ Π, then derivatives of all orders with respect to π may be passedunder the integral sign. (Although stated for the continuous case, this is true ingeneral.)

Proof: (for the special case k = 2)Change of notation: let θi = πi, and let h(x) be absorbed into φ. Now let

w(θ1, θ2) =∫φ(x)eθ1t1(x)+θ2t2(x) dν(x). (5)

Assuming the integral in (5) exists over the natural parameter space Θ, whichis an open set in R2, let θ′ = (θ′1, θ

′2) be an interior point of Θ.

Let δ > 0 be such that (θ′1 + δ, θ′2) and (θ′1 − δ, θ′2) are both in Θ.We want to show that

∂w

∂θ1

∣∣∣∣θ1=θ′1

=∫φ(x)t1(x)eθ1t1(x)+θ2t2(x) dν.

11/18/98

Page 124: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

122 Stat 8151

Friday, November 20, 1998

We continue with the proof of the theorem from last time.We compute the derivative, using the definition.

∂w

∂θ1

∣∣∣∣θ1=θ′1

= limh→0

w(θ′1 + h, θ′2)− w(θ′1, θ′2)

h

= limh→0

[1h

(∫φ(x)e(θ′1+h)t1(x)+θ′2t2(x) dν(x)

−∫φ(x)eθ

′1t1(x)+θ′2t2(x) dν(x)

)]= limh→0

∫φ(x)eθ

′2t2(x)

[e(θ′1+h)t1(x) − eθ′1t1(x)

h

]dν(x)

= limh→0

∫φ(x)eθ

′1t1(x)+θ′2t2(x)

[eht1(x) − 1

h

]dν(x). (1)

Now pick ε such that 0 < ε < δ/2 and apply the lemma to obtain∣∣∣∣eht1(x) − 1h

∣∣∣∣ ≤ 2k(e2εt1(x) + e−2εt1(x)

). (2)

Next use (2) on the integrand in (1) to get∣∣∣∣φ(x)eθ′1t1(x)+θ′2t2(x)

[eht1(x) − 1

h

]∣∣∣∣≤ 2k |φ(x)|

(e(θ′1+2ε)t1(x)+θ′2t2(x) + e(θ′1−2ε)t1(x)+θ′2t2(x)

)(3)

Observe that (θ′1+2ε, θ′2) ∈ Θ and (θ′1−2ε, θ′2) ∈ Θ, so the dominating function in(3) is integrable. Now we can apply Lebesgue’s Dominated Convergence Theoremto justify interchanging the limit and the integral in (1) and we’re done.

We can repeat this process for higher-order derivatives, e.g., just replaceφ(x) by φ(x)t1(x) when finding second derivatives.

11/20/98

Page 125: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 123

Complete statistics

Suppose X is a random variable (possibly vector-valued) and fθ : θ ∈ Θ isa family of possible densities or probability functions. Let T be a statistic—afunction of the data X. Let γ(θ) be some function of the parameter.

After observing X, we wish to estimate γ(θ), i.e., make a good guess for thevalue of γ(θ). Think of δ(X) as our estimator of γ(θ).

Definition. We say δ(X) is unbiased for estimating γ(θ) if Eθ(δ(X)

)= γ(θ)

for all θ ∈ Θ.

Although being unbiased sounds like a desirable property for an estimator,we’ll see that it is not necessarily a Good Idea—in fact we’ll see examples whereit is a Bad Idea.

Now suppose that γ(θ) = 0. Clearly one unbiased estimator of such a γwould be the trivial estimator δ(X) ≡ 0. Completeness says the only unbiasedestimator of 0 is 0. If there are nontrivial unbiased estimators of zero, then thefamily is not complete.

Definition. The statistic T (X), along with its family of distributions for θ ∈ Θ,is said to be complete if every real-valued function g satisfying Eθ

[g(T (X)

)]= 0

for every θ ∈ Θ also satisfies Pθ[g(T (X)

)= 0]

= 1.

So an unbiased estimator of zero is zero with probability one.

Examples

1. Assume X = (X1, . . . , Xn) where Xi ∼ i.i.d. Bernoulli(θ) for θ ∈ (0, 1). Ifγ(θ) = θ then we are trying to estimate θ.Suppose Xi records success (1) or failure (0) for a drug administered to theith subject in a clinical trial. If we take the sample mean X as our δ(X),and if we observe 17 1’s and 3 0’s in n = 20 trials, then we estimate θ by.85. Repeated sequences of trials would yield different estimates, but theaverage of those estimates should be near the true value, i.e., X is unbiasedfor θ.

2. Suppose X ∼ N(0.θ), with θ ∈ (0,∞) = Θ. Take T (X) = X, and g(t) = t,so Eθ

[g(T (X)

)]= EθX = 0 for all θ. But Pθ

[g(T (X)

)= 0

]= Pθ(X =

0) = 0 6= 1. Since X is not identically zero, this family is not complete.

Back to the Bernoulli example. Let T (X) =∑ni=1Xi; this is a sufficient statistic,

and we know that T ∼ Binomial(n, θ).

Claim. T for θ ∈ (0, 1) is complete.

Proof: Suppose for all θ ∈ (0, 1) that Eθ[g(T (X)

)]= 0. We want to show that

g ≡ 0. So

0 = Eθ[g(T (X)

)]=

n∑t=0

g(t)(n

t

)θt(1− θ)n−t = (1− θ)n

n∑t=0

g(t)(n

t

)(θ

1− θ

)t11/20/98

Page 126: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

124 Stat 8151

Now (1− θ)n 6= 0 because θ < 1, and we can set ρ = θ/(1− θ). Then we have

0 =n∑t=0

g(t)(n

t

)ρt, for all ρ ∈ (0,∞). (4)

But the right-hand side of (4) is a polynomial of degree at most n in ρ, yet ithas more than n zeroes; it follows that g(t) ≡ 0.

Completeness is a measure of the richness of the parameter space. Note thatuncountably many ρ’s are solutions to (4), but we really needed only n + 1 ofthem to show that T was complete. Suppose Θ $ (0, 1); under what conditionswould we have completeness? If Θ contains at least n+ 1 points, then, with Θ,T =

∑ni=1Xi is complete.

If we add enough parameter points then eventually Eθ[g(T (X)

)]= 0 has

only trivial solutions, and we have completeness.This has applications later when we discuss Best Unbiased Estimators. Sup-

pose we have an urn with ten balls, either white or blue, and we sample n timeswith replacement. Suppose we want to estimate the proportion of white balls inthe urn. If n = 9 then there is a BUE, but if n = 11 a BUE does not exist.

We can generalize this as follows. Let - = Rn and @ = Borel sets; let µ bea σ-finite measure on @. Let fθ(x) = k(θ)h(x) exp

[∑ni=1 θiti(x)

]be a density

with respect to µ. Let x = (x1, . . . , xn), and let θ = (θ1, . . . , θr) ∈ Θ ⊆ Rr.49

Theorem. Under this setup,50 T (along with its family of distributions indexedby θ ∈ Θ) is complete provided Θ contains a non-empty open subset of Rr.51

Proof: First a few preliminaries:

1. We will use the uniqueness of moment generating functions.Suppose Y = (Y1, . . . , Yr), remember that θ = (θ1, . . . , θr), and let m(θ) =Eeθ·Y , so m(θ1, . . . , θr) = Eeθ1Y1+···+θrYr .At the end of the proof, we’ll use the fact that if m(θ) exists for θ in aneighborhood of the origin, then m uniquely determines the distributionof Y .

2. The key assumption is that Θ is of full dimension.3. Without loss of generality, we may assume that the origin (0, . . . , 0) is an

interior point of Θ. If not, we can reparameterize. If θ0 = (θ01, . . . , θ

0r)

49 It’s time for another quasi-random change of notation. What we had calledc before will now be denoted by k, and what had been k is now called r. Asnoted earlier, we’ll use θ instead of θ, so we’ll also use x instead of x.

50 I assume T is still defined as∑ni=1Xi.

51 In our earlier Bernoulli/binomial example, using this theorem would requirethat Θ contain an open interval—that’s actually more then we needed for thatexample. This is a stronger general exponential family result.

11/20/98

Page 127: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 125

is an interior point of Θ, then f(θ) = f(θ − θ0 + θ0) = f(θ∗ + θ0) =f∗(θ∗) = k∗(θ∗)h∗(x) exp

[∑ri=1 θ

∗ti(x)]

where k∗(θ∗) = k(θ∗ + θ0) andh∗(x) = h(x) exp

[∑ri=1 θ

0ti(x)]. This is the usual add/subtract/rearrange

argument.

Onto the real proof, by contradiction. Suppose that T is not complete. Hence,there exists some function g such that

Eθ[g(T (X)

)]= 0 for all θ ∈ Θ,

andEθ∣∣g(T (X)

)∣∣ > 0 for some θ ∈ Θ.

The latter condition is clearly equivalent to g(T (X)) 6≡ 0.In what follows, we suppress the x part.

0 = Eθ[g(T (X)

)]= k(θ)

∫-

g(t)h exp[ r∑i=1

θiti

]dµ

let dν = h dµ

= k(θ)∫-

g(t)eΣiθiti dν

make a change of variables where = is the range of T and νT (C) = ν(T ∈ C)

= k(θ)∫=

g(y)eΣiθiyi dνT . (5)

Next let g+ = maxg, 0 and g− = −ming, 0, so g = g+ − g−. Now k(θ) > 0for all θ, so we can replace (5) by

0 =∫=

(g+(y)− g−(y)

)eΣiθiyi dνT

which is equivalent to∫=

eΣiθiyig+(y) dνT =∫=

eΣiθiyig−(y) dνT (for all θ) (6)

If we consider θ = 0 = (0, . . . , 0) then (6) becomes∫=

g+(y) dνT =∫=

g−(y) dνT (7)

11/20/98

Page 128: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

126 Stat 8151

so g+ and g− are νT-integrable. Now divide (6) by (7).∫=

eΣiθiyig+(y)∫

=

g+(y) dνTdνT =

∫=

eΣiθiyig−(y)∫

=

g−(y) dνTdνT (8)

Butg+∫g+ dνT

andg−∫g− dνT

are densities, so in (8) we have the same functions

integrated with respect to two densities with respect to a common measure. Thus(8) says those densities have the same moment generating functions for θ ∈ Θ,since by assumption Θ contains (0, . . . , 0) in its interior.

Hence g+ = g− a.e. (νT ), and (handwaving), g+ = g− a.e. (ν) as well. Butthen g = g+−g− = 0 a.e., which contradicts the assumption that Eθ

∣∣g(T (X))∣∣ >

0 for some θ ∈ Θ.

In an exponential family, the natural sufficient statistic is complete providedthe parameter space has a non-empty interior.

Order Statistics

Assume that F is a continuous distribution function on R; however, it need not be

absolutely continuous. Suppose X1, . . . , Xn are i.i.d. ∼ F , and set X =

X1...Xn

.

We will rearrange the Xi’s in order from smallest to largest to form theorder statistic.

This is important because whether we think of statistics as being (1) a wayto make sense of data, or (2) a guide for making decisions under uncertainty, weare often dealing with an unknown distribution. We can use information basedon the order statistic without having to make strong assumptions about thatunderlying distribution.

Lemma 1. P (Xi = Xj for some i 6= j) = 0.

“proof:” P (Xi = Xj for some i 6= j) ≤∑i≤j P (Xi = Xj), since P (

⋃iAi) ≤∑

i P (Ai). Hence, it is sufficient to prove that P (X1 = X2) = 0. But P (X1 =X2) =

∫P (X1 = x | X2 = x ) dF (x) =

∫0 dF = 0.

There are no ties from a continuous distribution.Let S = x ∈ Rn | xi 6= xj for i 6= j . We’ve just seen that P (X ∈ S) = 1.

11/20/98

Page 129: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 127

Monday, November 23, 1998

Let O = x : x ∈ S and x1 < x2 < · · · < xn , and let T :S → O.

T (x) =

x(1)

...x(n)

, where minixi = x(1) < x(2) < · · · < x(n) = max

ixi. We

sometimes write y = T (x), so yi = x(i).

Y = T (X) =

X(1)

...X(n)

is the order statistic. Note that Y is obtained

from X by permuting the coordinates.Let 3 = 3n denote the group of permutations of 1, 2, . . . , n, where the

group operation is composition of functions, e.g., (π1 π2)(i) = π1

(π2(i)

).

Note that there are n! elements in 3.For x ∈ Rn and π ∈ 3, define πx by (πx)i = xπ(i), so

πx = π

x1

...xn

=

xπ(1)

...xπ(n)

.

Lemma 2.(i) y = Tx ⇐⇒ πy = x for some π ∈ 3 ⇐⇒ y = π′x for some π′ ∈ 3.

(ii) Each y ∈ O is the image under T of precisely n! points in S, namely, thepoints πy : π ∈ 3 .

(iii) For C ⊆ O, T−1(C) =⋃π∈3

π(C) and this is a union of disjoint sets.

Proof: “obvious”

Lemma 3. For all π ∈ 3, π(X) ∼ X.

Proof: Remember Xi i.i.d. ∼ F .

Proposition. For C ⊆ O, P (Y ∈ C) = n!P (X ∈ C).

Proof: P [Y ∈ C] = P [TX ∈ C] = P [X ∈ T−1(C)] =∑π

[X ∈ π(C)] =∑π

[π−1(X) ∈ C] = n!P [X ∈ C].

11/23/98

Page 130: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

128 Stat 8151

Corollary. If the distribution of X has density f(x) =∏ni=1 f

∗(xi), wheref∗(xi) is the density of F , then the distribution of Y has density

g(y) =n!f(y) y ∈ O,0 elsewhere.

Proof: Follows immediately from the previous proposition.

In particular, if F is uniform [0, 1], then

f(x) = f(x1, . . . , xn) =

1 if 0 ≤ xi ≤ 1 for i = 1, . . . , n,0 otherwise;

henceg(y) = g(y1, . . . , yn) =

n! if 0 ≤ y1 ≤ y2 ≤ · · · ≤ yn ≤ 1,0 otherwise.

Note that all the above remains true if “i.i.d.” is replaced by “exchangeable”—allwe really used was just invariance under permutations, i.e., exchangeability.

For x ∈ S, let πx ∈ 3 be that permutation such that

πx(Tx) = x, or Tx = π−1x (x), or (Tx)i = x(i) = xπ−1

x (i).

Thus, y = T (x) and πx together determine x.

Proposition.

(i) P (πX = π) =1n!

for all π ∈ 3.

(ii) πX and Y are independent.

Proof:(i) πX = π ⇐⇒ π(Y) = X ⇐⇒ π−1(X) ∈ O but P [π−1(X) ∈ O] = P [X ∈

O] by Lemma 3; in turn, that is1n!P [Y ∈ O] =

1n!· 1 =

1n!

, by the previousproposition.

(ii) Let C ⊆ O. Now

P [πX = π and Y ∈ C] = P [ Y ∈ C | πX = π ] · P [πX = π]

= P [π−1(X) ∈ C | πX = π ] · 1n!

= P [π−1(X) ∈ C | π−1(X) ∈ O ] · 1n!

=P [π−1(X) ∈ C]P [π−1(X) ∈ O]

· 1n!

(because C ⊆ O)

=P [X ∈ C]P [X ∈ O]

· 1n!

(by Lemma 3—added 11/30/98 )

=1n!P [Y ∈ C]1n!P [Y ∈ O]

· 1n!

=P [Y ∈ C]

1· 1n!

= P [Y ∈ C] · P [πX = π].

11/23/98

Page 131: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 129

Suppose X1, . . . , Xn are i.i.d. F , and X(1) ≤ X(2) ≤ · · · ≤ X(n). Let Fr bethe distribution function of X(r), so

Fr(x) = P [X(r) ≤ x]

= P [Xi ≤ x for at least r of the Xi’s ]

=n∑i=r

(n

i

)[F (x)

]i[1− F (x)]n−i

This is true for any F—we don’t need to assume continuity.However, now assume F has a continuous density f∗. We want to find the

density for the distribution of X(r).First we need a lemma.

Lemma. limh→0

P[2 or more observations in [u− h, u]

]h

= 0,

i.e., P [2 or more observations in a small interval of length h] = o(h).

Proof: The probability that an observation lies within the interval (u − h, u]is F (u) − F (u − h). Let Z be the number of observations (in our sample of n)falling in (u− h, u]. Then Z ∼ Binomial(n, p) where p = F (u)− F (u− h). So,

P [Z ≥ 2] =n∑j=2

(n

j

)[F (u)− F (u− h)

]j[1− (F (u)− F (u− h))]n−j

=[F (u)− F (u− h)

]2×

n∑j=2

(n

j

)[F (u)− F (u− h)

]j−2[1− (F (u)− F (u− h))]n−j

because j ≥ 2. Now divide by h to getP [Z ≥ 2]

h=F (u)− F (u− h)

h︸ ︷︷ ︸α

[F (u)− F (u− h)

]︸ ︷︷ ︸

β

×n∑j=2

(n

j

)[F (u)− F (u− h)

]j−2[1− (F (u)− F (u− h))]n−j

︸ ︷︷ ︸γ

Now look at the factors labeled α, β, and γ, and let h ↓ 0.

(α) We use the Fundamental Theorem of the Calculus. There is some ξ ∈(u− h, u) such that f∗(ξ) =

F (u)− F (u− h)h

→ f∗(u) as h ↓ 0.

(β) limh↓0

[F (u)− F (u− h)

]= 0 by continuity of F .

(γ) Again by continuity, there is a finite limit. (If n ≥ 2 that limit is 0, butexistence of the limit is all we need here.)

Finally, limh↓0

(α · β · γ) = limh↓0

α · limh↓0

β · limh↓0

γ = f∗(u) · 0 · limh↓0

γ = 0.

11/23/98

Page 132: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

130 Stat 8151

Monday, November 30, 1998

Density of the rth order statisticRemember we’re assuming absolute continuity, so f∗ is the density for X, andnow we want to find gr(u), the density for X(r).

We can look at this as a multinomial problem, putting n balls in three cellswith this distribution:

r − 1 1 n− robservations observation observations

u− h u

(∗)

Now P (u− h < X(r) < u) ≈ h · gr(u), and by our lemma, for sufficiently small hthere can be no more than one observation in that interval. So the probabilitywe have the distribution described in (∗) is

n!(r − 1)! 1! (n− r)!

[F (u− h)

]r−1∫ u

u−hf∗(t) dt

[1− F (u)

]n−r.

Divide by h,

P (u− h < X(r) < u)h

=n!

(r − 1)! (n− r)![F (u− h)

]r−1

×∫ uu−h f

∗(t) dth

[1− F (u)

]n−r.

and take the limit as h ↓ 0, to get

gr(u) =n!

(r − 1)! (n− r)![F (u)

]r−1f∗(u)

[1− F (u)

]n−r.

We can also write this as

gr(u) = nf∗(u)(n− 1r − 1

)[F (u)

]r−1[1− F (u)]n−r

.

The derivation is easier to remember then the result itself.

11/30/98

Page 133: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 131

ExampleSuppose X1, . . . , Xn are i.i.d. Uniform(0, 1). Then, for u ∈ [0, 1], we have

FX(u) = u and f∗(u) = 1, so

gr(u) =n!

(r − 1)! (n− r)! · 1 · ur−1 · (1− u)n−r

Now for the moment suppose F is some continuous distribution function, andfor p ∈ (0, 1) let xp be the largest value of x such that F (x) = p.

xp xq

1

p

q

Note. If X ∼ F then F (X) ∼ Uniform(0, 1).

Proof: Let Z = F (X) and for z ∈ (0, 1),

PZ ≤ z = PF (X) ≤ z = PX ≤ xz = z,

so Z ∼ Uniform(0, 1).

A nice consequence of this is that if X1, . . . , Xn are i.i.d. F where F iscontinuous, then F (X1), . . . , F (Xn) ∼ i.i.d. Uniform(0, 1). So if X(1), . . . , X(n)

are the order statistics, then U1 = X(1), . . . , Un = X(n) is like the order statisticfrom a random sample from a Uniform(0, 1) distribution. We can use this toconstruct nonparametric confidence intervals for quantiles.

For example, let ξF be the median of F . (If F is strictly increasing, themedian will be unique.) Let 1 ≤ r < s ≤ n, and consider P (X(r) ≤ ξF ≤X(s)) = P

[F(X(r)

)≤ 1

2 ≤ F(X(s)

)]but this is like the order statistic from a

uniform, so these probabilities don’t depend on F .So, if we are sampling from a continuous F , then

P [X(r) ≤ x] = P [F (X(r)) ≤ F (x)],

but this is like the rth order statistic from a uniform sample, so

Fr(x) = P [X(r) ≤ x] =n!

(r − 1)! (n− r)!

∫ F (x)

0

ur−1(1− u)n−r du. (?)

Remember, (?) holds for continuous F , but we had earlier52 seen

Fr(x) =n∑i=1

(n

i

)[F (x)

]i[1− F (x)]n−i (??)

which holds for arbitrary F . Thus, for continuous F , (?) and (??) agree, so thisgives us a way to evaluate the Incomplete Beta Function integral in (?). Anotherway to accomplish that is repeated integration by parts, e.g., let u = (1− t)n−rand dv = (t)r−1 dt.

52 page 129

11/30/98

Page 134: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

132 Stat 8151

Joint distribution for a pair of order statistics

We can extend what we’ve already done. Again let 1 ≤ r < s ≤ n, and supposeF has continuous density f∗. Then the joint distribution of X(r) and X(s) canbe found using a multinomial scheme as before.

r − 1 1 s− r − 1 1 n− sobservations observation observations observation observations

u− h u v − h v

Sogr,s(u, v) · h2 ≈ P [u− h < X(r) < u, v − h < X(s) < v] + o(h2)

and then

gr,s(u, v) = limh↓0

(n!

(r − 1)! (s− r − 1)! (n− s)!

[[F (u− h)]r−1 ·

∫ uu−h f

∗(t) dth

· [F (v − h)− F (u)]s−r−1 ·∫ vv−h f

∗(t) dth

· [1− F (v)]n−s])

=n!

(r − 1)! (s− r − 1)! (n− s)!· [F (u)]r−1 · f∗(u) · [F (v)− F (u)]s−r−1 · f∗(v) · [1− F (v)]n−s,

provided u < v.

Joint density for arbitrary collection of order statistics

If 1 ≤ r1 < r2 < · · · < rk ≤ n for 1 ≤ k ≤ n and F is continuous, then

gr1,...,rk(x1, . . . , xk) =n!

(r1 − 1)! (r2 − r1 − 1)! · · · (n− rk)![F (x1)]r1−1f∗(x1)

× [F (x2)− F (x1)]r2−r1−1f∗(x2) · · · f∗(xk)[1− F (xk)]n−rk .

In particular, the joint density of X(1), . . . , X(n) is

g1,...,n(x1, . . . , xn) =n! f∗(x1)f∗(x2) · · · f∗(xn) where x1 < · · · < xn,0 elsewhere.

Note that this is supported only on 1n! of Rn.

11/30/98

Page 135: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 133

Sufficiency of the order statisticSuppose X1, . . . , Xn are i.i.d. F where F ∈ Θ, some (parametric) family ofcontinuous distribution functions, and let X = (X1, . . . , Xn)′. Then the orderstatistic T (X) is sufficient.53 Recall that

P ( X = x | T (X) = t ) =

1n! if x = π(t) for some permutation π,0 otherwise.

Completeness of the order statisticFirst, some preliminary remarks.

Let x = (x1, . . . , xn) be a vector of distinct real numbers, and let T (x) =(x(1), . . . , x(n)) be its order statistic.

Now let

V (x) =(V1(x), . . . , Vn(x)

)=(∑

i

xi,∑i<j

xixj ,∑i<j<k

xixjxk, . . . , x1x2 · · ·xn)

and

T ∗(x) =( n∑i=1

xi,

n∑i=1

x2i ,

n∑i=1

x3i , . . . ,

n∑i=1

xni

).

Let x0 = (x01, . . . , x

0n) be fixed, and let t0 = T (x0), i.e., its order statistic.

Then T−1(t0

)= S = x : the coordinates of x are permutations of x0 .

Claim 1. If V (x0) = v0, then V −1(v0) = S also.

Proof: Clearly V −1(v0) ⊇ S because the sums in V are unchanged bypermutations.

Now note thatn∏i=1

(z − xi) = zn − V1(x)zn−1 + V2(x)zn−2 + · · ·+ (−1)nVn(x).

Suppose V (y) = v0, i.e., (y1, . . . , yn) ∈ V −1(v0). Nown∏i=1

(z − yi) = zn − v01zn−1 + v0

2zn−2 + · · ·+ (−1)nv0

n =n∏i=1

(z − x0i ) (∗)

because V (y) and V (x0) have the same values. Since y and x0 are both solutionsets for the polynomial in z given in (∗), y is just a permutation of x0, hencey ∈ S and we’re done.

Next time we’ll argue that:

1. T ∗ is also the same as the order statistic.2. T ∗ is complete.3. Therefore the order statistic is complete.

53 We’ll do this next quarter; the idea is that the conditional distribution ofX given T doesn’t depend on the parameter.

11/30/98

Page 136: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

134 Stat 8151

Wednesday, December 2, 1998

We continue the proof of completeness for the order statistic.

Claim 2. If T ∗(x0) = t0∗ then (T ∗)−1(t0∗) = S.

Proof: There is a one-to-one correspondence between T ∗ and V , given bythe relations (Newton’s identities), for 1 ≤ k ≤ n.

T ∗k − V1T∗k−1 + V2T

∗k−2 + · · ·+ (−1)k−1Vk−1T

∗1 + (−1)kVk = 0.

For example, if k = 1, we have

T ∗1 − V1 = 0.

Thus if we know T ∗1 then we can find V1 and if we know V1 then we can find T ∗1 .Next, for k = 2, we have

T ∗2 − T ∗1 V1 + V2 = 0

so we can find either T ∗2 or V2 if we know the other, and so on.

Now recall that if X ∼ fθ where fθ : θ ∈ Θ is a family of densities withrespect to a σ-finite measure µ, and W (X) is some statistic, then σ(W ) is thesmallest σ-algebra which makes W measurable; e.g., if X = X1, . . . , Xn are i.i.d.Bernoulli(θ) then T1(X) =

∑ni=1Xi and T2(X) =

∑ni=1Xi/n generate the same

σ-algebra, i.e., σ(T1) = σ(T2).Claim 2 says that T ∗ and V generate the same σ-algebra; Claim 1 and

Claim 2 together say that σ(T ) = σ(V ) = σ(T ∗).But completeness of W for fθ : θ ∈ Θ is a property of σ(W ).Also recall that W is complete for fθ : θ ∈ Θ if

Eθ(g(W (X))

)for all θ =⇒ Pθ

[[g(W (X)) = 0

]= 1 for all θ

whenever g is σ(W )-measurable.So, if we can show any of these three is complete, then they all are. In

particular, the order statistic will be complete.

12/2/98

Page 137: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 135

Now let X1, . . . , Xn be i.i.d. with density

K(θ)e−x2n+θ1x+θ2x

2+···+θnxn (∗)

for θ = (θ1, . . . , θn) and x ∈ R1. Note that the parameter depends on the samplesize n.

The first term appearing in the exponent in (∗) dominates, hence (∗) is adensity; in fact, for every θ ∈ Rn, (∗) is a probability density which belongs tothe exponential family.

Next consider the joint density for X1, . . . , Xn, given by

[K(θ)]ne−Σx2n+θ1Σx+θ2Σx2+···+θnΣxn

which we can write as

[K(θ)]ne−Σx2neθ1Σx+θ2Σx2+···+θnΣxn

so by our earlier theorem,54 we see that T ∗(X) =(∑

Xi,∑X2i , . . . ,

∑Xni

)is

complete for (∗) with θ ∈ Rn.So, you might think “Big Deal.” What can this particular example (∗) have

to do with the general situation?Well, suppose X1, . . . , Xn are i.i.d., each with density f where f ∈ Θ and

Θ is any class of densities which contains all densities of the form (∗).55

A couple of examples are

1. Θ is the class of all densities with respect to Lebesgue measure on R1 (allcontinuous densities).

2. Θ is the class of all densities with respect to Lebesgue measure on R1 whichhave a first moment. (This is a smaller class.)

We’ve expanded the class.

Claim. The order statistic is complete for both these classes.

Proof: This will follow from the following lemma.

Lemma. Suppose T is a complete statistic for PXθ : θ ∈ Θ , a family ofdensities. Suppose Θ0 ⊂ Θ and PXθ : θ ∈ Θ0 dominates PXθ : θ ∈ Θ , i.e.,

PXθ (B) = 0 for all θ ∈ Θ0 =⇒ PXθ (B) = 0 for all θ ∈ Θ.

Then T is complete for PXθ : θ ∈ Θ .In other words, we can extend completeness from a small family to a larger

one if the smaller one dominates the larger one.

54 Completeness of the natural sufficient statistic for an exponential familywhose parameter space has a non-empty interior, see page 124.

55 Here Θ is a family of densities—not a parameter space.

12/2/98

Page 138: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

136 Stat 8151

Proof: Let (i) and (ii) refer to the following statements:

(i) Eθ[g(T (X)

)]= 0 for all θ,

and

(ii) PXθ[g(T (X)

)= 0]

= 1 for all θ.

We want to show that (i) implies (ii) for all θ ∈ Θ, when we already know (i)implies (ii) for all θ ∈ Θ0.

Suppose (i) holds for all θ ∈ Θ; it must than also hold for all θ ∈ Θ0.By the assumption of completeness, we know that (ii) holds for all θ ∈ Θ0.Now, by domination, we have that (ii) holds for all θ ∈ Θ. The idea here is

that the only way sets can have probability zero is for them to have Lebesguemeasure zero, since the densities in (∗) are positive. But then such sets musthave probability zero under the extended class too.

Be careful—if X1, . . . , Xn ∼ N(0, 1) this does not imply completeness of theorder statistic. We have it for large (nonparametric) families, not for small(parametric) ones.

For the normal case, the order statistic is sufficient, but not complete. (ButX is both sufficient and complete.)

Statistical Information and LikelihoodThis discussion follows Basu’s56 gloss on Birnbaum;57 there is also a discussionby Berger and Wolpert.58

Consider the following assertion.

All the information about the parameter contained in the data is in the likelihood.

Let E denote an experiment, i.e., E = (X, fθ(·), θ ∈ Θ).If X = x is what we see, then Inf(E, x) = all the information about θ

contained in the data (E, x).An aside—you might argue that the amount of information depends on

other things, like

1. prior information, or2. the decision problem that made you consider E in the first place.

We’ll assume that we have no prior information and that we’re just doing this“for our health.”(!)

56 1975: Sankhya57 1962: JASA58 1984 IMS publication?

12/2/98

Page 139: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 137

Fact. Inf(E, x) may depend critically on x.

Example: An urn contains 100 balls, numbered θ+1, θ+2, . . . , θ+100where θ is an unknown integer, so θ ∈ 0,±1,±2, . . ..

Now we select two balls without replacement from the urn. Clearlyθ + 1 ≤ xi ≤ θ + 100; it follows that maxi=1,2(xi) − 100 ≤ θ ≤mini=1,2(xi)− 1.

If the xi’s are close to each other, then we learn little about θ; ifthey are far apart we learn a lot. For instance, if we draw (17, 115)then 15 ≤ θ ≤ 16, but (17, 52) then −48 ≤ θ ≤ 16. We have moreinformation in the first situation than in the second.

Now assume:

1. X takes on only finitely many values,2. Θ is finite, and3.

∑θ∈Θ

fθ(x) > 0 for all x.

This last condition means that no x gets probability zero for all θ, for if therewere such an x we could just discard it.

Definition. When the experiment E(X, fθ(·), θ ∈ Θ) is performed, and X = x,then the function θ → fθ(x) is called the likelihood function generated by thedata (E, x) and we write

L(θ) = L(θ | x) = L(θ | E, x) = fθ(x).

Definition. Two likelihood functions L1 and L2 defined on Θ (but possiblycorresponding to two different pairs (E1, x1) and (E2, x2), respectively) are saidto be equivalent if there exists a constant c > 0 such that L1(θ) = cL2(θ) forall θ ∈ Θ, where c may depend on E1, E2, x1, and x2; we write L1 ∼ L2.

12/2/98

Page 140: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

138 Stat 8151

Friday, December 4, 1998

Continuing what we were doing last time, we want to show that under a fewsimple assumptions Inf(E, x) = L(θ, x).

We introduce several principles of inference.

Invariance

Principle I′. (Invariance) If fθ(x′) = fθ(x′′) for all θ ∈ Θ, then

Inf(E, x′) = Inf(E, x′′).

Principle I′ induces an equivalence relation on the sample space of X such thatx′ ∼ x′′ if they generate identical likelihood functions. For each x, let πx be theset of equivalent sample points, the πx’s partition the sample space.

Definition. Let T (x) = πx be the function mapping x to its equivalence class.

Note. T is sufficient.

Proof: We need to show that Pθ(X = x | T = t ) does not depend on θ.Now Pθ(X = x | T = t ) = 0 if T (x) 6= t; this doesn’t involve θ.

If T (x) = t, then

Pθ(X = x | T = t ) =Pθ(X = x, T = t)

Pθ(T = t)=Pθ(X = x)Pθ(T = t)

=fθ(x)

Pθ(T (x) = πx)

=fθ(x)∑

y∈πxfθ(y)

=fθ(x)

#(members of πx) · fθ(x)

=1

#(members of πx)which does not depend on θ.

The point is that partitioning the sample space uses all the available in-formation about θ; the likelihoods cancel (so θ drops out) when we look at theconditional distribution of X given T , i.e., within the equivalence classes.

12/4/98

Page 141: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 139

Sufficiency

Principle S′. (Sufficiency) If T is a sufficient statistic and T (x′) = T (x′′),then

Inf(E, x′) = Inf(E, x′′).

Most statisticians accept some form of S′.

Note. S′ implies I ′.

Proof: Suppose fθ(x′) = fθ(x′′) for all θ ∈ Θ. It is enough to show thatthere exists some sufficient T with T (x′) = T (x′′); the Invariance Principle doesthe rest. Let T be defined by T (x) = “the class of x’s where the likelihoodsagree” just as in the previous proof. Then T is sufficient and T (x′) = T (x′′) iffθ(x′) = fθ(x′′).

Conditionality

Suppose we have two experiments:

E1 =(X1, f

(1)θ (x), θ ∈ Θ

)and E2 =

(X2, f

(2)θ (x), θ ∈ Θ

)with a common parameter space.

We randomly choose an experiment, so we observe E1 with probability γ,and E2 with probability 1− γ.

Let E = (X, fθ,Θ) where X = (i,Xi), so x = (i, xi), and

fθ(x) = fθ(i, xi) =

γf

(1)θ (x1) if i = 1,

(1− γ)f (2)θ (x2) if i = 2.

We call E a mixture experiment of E1 and E2.

Principle C′. (Weak Conditionality) If E is a mixture of E1 and E2, thenfor any (i, xi),

Inf(E, (i, xi)

)= Inf(Ei, xi).

After we have observed the experiment what you could have possibly observedno longer matters.

12/4/98

Page 142: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

140 Stat 8151

Likelihood

Principle L. (Likelihood) If the data (E1, x1) and (E2, x2) generate equiva-lent likelihood functions on Θ, then

Inf(E1, x1) = Inf(E2, x2).

Theorem. (Birnbaum). (I ′ and C ′) =⇒ L.

Proof: Suppose (E1, x1) and (E2, x2) generate equivalent likelihood functions.Then

L(θ | E1, x1) = cL(θ | E2, x2) (?)

for some constant c. Consider the mixture experiment E where we observe E1

with probability γ = 1/(1 + c) and we observe E2 with probability 1 − γ =c/(1 + c), so (1, x1) and (2, x2) are points in the sample space of E.

We can rewrite (?) as

f(1)θ (x1) = cf

(2)θ (x2)

and multiply by γ = 1/(1 + c) to get

11 + c

f(1)θ (x1) =

11 + c

cf(2)θ (x2).

Now we haveγf

(1)θ (x1) = (1− γ)f (2)

θ (x2).

which isfθ(1, x1) = fθ(2, x2)

in terms of the mixture experiment. But there we have identical likelihoods, sousing I ′ we have

Inf(E, (1, x1)

)= Inf

(E, (2, x2)

).

Now invoke C ′ (twice) to get

Inf(E1, x1) = Inf(E2, x2)

and we are done.

Corollary.59 (S′ and C ′) =⇒ L.

Proof: S′ =⇒ I ′.

59 This is what is often called Birnbaum’s Theorem.

12/4/98

Page 143: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Mathematical Statistics 141

Examples.

1. We are given a coin, and are interested in θ = the probability of Heads. Wewant to investigate

H: θ = 1/2 vs. K: θ > 1/2.

Suppose the coin is flipped independently several times and we observe that nineflips are Heads and three flips are Tails.

Now suppose we want to use a test of significance for the problem; we wantto reject if there are too many Heads. But to construct a test, we need to knowthe sample space. Here are two possibilities:

1. The experiment was to flip the coin 12 times, and then stop. In this case,X = the number of Heads ∼ Binomial(12, θ) and the level of significance isα = Pθ=1/2(X ≥ 9) = .075.

2. The experiment was to flip the coin until three Tails appeared, and X = thenumber of Heads ∼ Negative Binomial(3, θ). Now the level of significanceis α = Pθ=1/2(X ≥ 9) = P1/2(9H, 3T ) + P1/2(10H, 3T ) + · · · = .0325.

So, to compute a level of significance, we need to know the sample space of theexperiment.

However, the two likelihood functions are equivalent, for

L( θ | x = 9 ) =

(

129

)θ9(1− θ)3 in the first (binomial) plan,(

119

)(1− θ)3θ9 in the second (negative binomial) plan,

and these differ only by a multiplicative constant. It is important to know thestopping rule. (e.g., flipper is tired, 12 flips, 3 Tails, etc.)

But then our significance test violates the Likelihood Principle. We aresumming over the sample space, in effect we are worrying about what we couldhave seen but did not in fact actually observe.

This same objection applies to the issue of unbiasedness—we get differentunbiased estimators in the two cases above, even though the likelihoods areequivalent. Taking an expectation (needed for unbiasedness) involves summingover the whole sample space.

12/4/98

Page 144: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

142 Stat 8151

2. Another example shows that confidence intervals are slippery things.Suppose Θ = Z = 0,±1,±2, . . ., and let

fθ(x) =

1/2 if x = θ − 1 or x = θ + 1,0 otherwise.

Let (X1, X2) be i.i.d. ∼ fθ, i.e., we have a sample of size two.Now define a confidence “interval” S for θ.Let

S(x1, x2) =

x1 + x2

2if x1 6= x2,

x1 + x2

2− 1 = x1 − 1 if x1 = x2.

(∗)

Thus, if we draw different numbers, we know θ, but if we draw the same numbertwice, we have equal chances of covering or missing θ. (We could have chosen+1 instead of −1 in the second part of (∗).)

Now PθS(X1, X2) = θ = 12 + 1

4 = 34 , so S is a 75% confidence interval

for θ. But that is an overall coverage rate—on a single run, either it is right (andwe know it is right whenever X1 6= X2), or it has probability 1/2 of being right.In what sense does the 75% figure have anything to do with a single run?

It is determined by looking at the sample space, and hence violates theLikelihood Principle.

This S is not unbiased, because we chose −1 in the second part of (∗). Itwould also be biased, but in the opposite direction, had we chosen +1 instead.If we randomized the sign there, then we would have an unbiased version.

Similar difficulties are attached to other confidence intervals. Would you bewilling to bet even money on coverage success for a 50% confidence interval forthe mean of a normally-distributed quantity?

12/4/98

Page 145: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Index 143

Aabsolute continuity

joint distribution functions, 31random vectors, 54

absolutely continuous, joint distribution functions, 31

absolutely continuous distribution functions, 6, 31densities of, 6uniform continuity of, 6

additivitycountable, 1, 2finite, 1, 2

adjoint, of linear operator, 73algebra, 1analytic functions, definition, 17ANOVA, F-test, 84arithmetic mean, relation to geometric mean, 26

BBasu, 136Bayesian calculation, example, 56Bayesian philosophy, 106Berger, 136best linear predictor, 50best predictor, 49Beta distribution, 104, 105Beta function

Incomplete, 130Birnbaum, 136Birnbaum's Theorem, 140bivariate. see also under multivariatebivariate distributions, 29bivariate normal density

correlation coefficient, 44elliptical contours, 45matrix form, 47moment generating functions, 45

bivariate normal distribution, 43sample from, maximum likelihood estimates, 85

Bochner's Theorem, too difficult in practice, 20Borel sigma-algebra, 1, 2Borel-measurable function, 3box condition, 52, 53

CCauchy sequence

in measure, 100in probability, 100

Cauchy-Schwarz Inequality, 41, 93

change of variablesbivariate transformations, 33transformations, 58univariate transformations, 32

characteristic functionsboundedness, 11complex conjugates, 11complex-valued, 11convex combinations of, 15definition, 10derivatives, 14distribution determined by, 18, 19examples

binomial, 15Cauchy, 16discrete, 15exponential, 18point mass, 14standard normal, 16, 18uniform, 15

existence, 10, 11for sums of independent random variables, 17independent random variables, 39, 59joint, 31moments of a distribution, 14multivariate, 53of linear transformation of random variable, 11products of, 17relation to moment generating functions, 17uniform continuity of, 11

proof, 11–13value at origin, 12

Chebyshev Inequality, 20, 99chf. see characteristic functionschi-square

central, 76non-central, 78

Cochran's Theorem, 84complete convergence, 100complete statistic, 123completeness, 123

definition, 133exponential family, 123–126

natural sufficient statistic, 135of natural sufficient statistic, 123–126order statistics, 133

components, 65components of multivariate normal distribution,

jointly normal, 71

Friday, December 11, 1998

Page 146: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

144 Index

conditional densities, 34conditional distribution, multivariate normal

distribution, 72conditional distributions, 55conditional expectations, 36, 47conditional independence, 55conditional probability functions, 34conditional variance, 37conditionality, 139conditioning, 33–37confidence intervals

for quantiles, 130nonparametric, 130unbiasedness, 142

consistency, of distributions, 55continuity, uniform. see uniform continuitycontinuity of measure, 2continuous distribution functions, uniform

continuity of, 6convergence, of distribution functions, 100convex combinations

expectations as, 25of distribution functions, 5–6

convex functions, 21, 25examples, 21Jensen's Inequality, 24second derivative, 21supporting lines, 21

convex hull, 25convex sets, 25convexity

natural parameter space, 117of functions (see convex functions)of sets (see convex sets)

correlation, 41correlation and independence, 42correlation coefficient, 41, 42

bivariate normal density, 44correlation coefficients, multiple, 90countable additivity, 2covariance, 41covariance matrices, 60covariance matrix, 47

sample, 86cumulative distribution functions, 5

Dde Finetti, 106de Finetti's Theorem, 98, 101, 103

de Finetti's Theorem (Continued)alternative proof, 111Diaconis, 111Hewitt-Savage form, 103

densitiesconditional (see conditional densities)for order statistics, 129independent random variables, 39marginal (see marginal densities)order statistics, 130

Diaconis, proof of de Finetti's Theorem, 111diagonal matrices, 61diagonalization, 76direct sum, 64Dirichlet distribution, 106discontinuities, set of for distribution function, 6discrete distribution functions, 6distribution functions, 5

absolutely continuous, 6, 31continuous, 5, 10convergence of, 100convex combination of, 5–6determined by values

on complement of jump set, 7on dense set, 7

discontinuitiesat most countably many, 6jumps, 5set of, 6

discrete, 6, 10for order statistics, 129independent random variables, 39joint, 52

definition, 29properties, 29

marginal, 30mixtures of continuous and discrete, 10probability measure determined by, 5set of discontinuities, 6simplest, 6singular, 30

distributions, 4Dominated Convergence Theorem, 13domination

of class of distributions, 135

Eeigenvalues, 77elliptical contours, bivariate normal density, 45

Friday, December 11, 1998

Page 147: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Index 145

exampleslikelihood principle

violated, 141, 142exchangeability, 94–96, 97, 98, 99, 103, 128

and i.i.d., 97, 128and order statistics, 128finitely many random variables, 108

geometric view, 108–111not an i.i.d. mixture, example, 107

for finitely many random variables, 106mixtures of extreme points, 111

exchangeable sequences, embedded, 110expectations, 8

and independence, 40as convex combinations, 25as probabilities for 0-1 variables, 98bivariate, 31conditional (see conditional expectations)iterated, 36

expected value. see also expectationsdefinition, 8

experiment, 136mixture, 139

exponential familycompleteness, 123–126definition, 112examples

Bernoulli, 113binomial, 113normal, 114

moments, 120not, examples

binomial (n unknown), 113Cauchy, 119uniform, 119

sample from, 115sufficient statistic, 114

extreme points, 111

FF-test, ANOVA, 84field. see algebrafinite additivity, 2Fourier Inversion Theorem, 18Fourier transforms. see characteristic functionsfunctions

analytic, 17convex (see convex functions)

Ggenerating functions

for moments, 10for probability, 10

geometric mean, relation to arithmetic mean, 26group of permutations, 127

HHölder's Inequality, 27

Ii.i.d.

and exchangeability, 97, 128subset of exchangeable, 109–110

ignorance, 57improper prior, 57Incomplete Beta function, 130independence

and expectations, 40of algebras, 38of classes, 38of components of multivariate normal

distribution, 70of events, 38of quadratic forms, 84of random variables, 38, 39of random vectors, 54of sample mean and sample variance, 89of sample mean and sample variance from

multivariate normal distribution, 75of sigma-algebras, 38permutations and order statistics, 128

independence and correlation, 42independent random variables, 39

characteristic functions, 39, 59densities, 39distribution functions, 39functions of, 59probability functions, 39

induced probability measure, 4information

likelihood, 136inner product, 64inner product spaces, 73integrable functions, constant functions, 13integration by parts, 10invariance, 138iterated expectations, 36

Friday, December 11, 1998

Page 148: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

146 Index

JJacobian, 58Jensen's Inequality, 24joint characteristic functions, 31joint distribution functions, 29, 52

absolutely continuous, 31joint distributions

order statistics, 132joint moment generating functions, 53jointly normal, components of multivariate normal

distribution, 71

KKolmogorov, 2

axioms for probability space, 2

LLaplace transform, 10. see also moment generating

functionLebesgue, Dominated Convergence Theorem, 13likelihood

information, 136likelihood functions, 137

definition, 137equivalence, 137

likelihood principleviolated

examples, 141, 142linear operator, adjoint, 73linear operators, 73linear predictor, 50

Mmarginal densities, 34marginal distribution functions, 30marginal probability functions, 34matrices, random, 60matrix, trace of, 85maximum likelihood estimates, 86, 87, 88

sample from bivariate normal distribution, 85mean

arithmetic, 26definition, 10geometric, 26

mean vectors, 60measurable function, 3mgf. see moment generating functionmixture experiment, 139mixtures, 97, 101, 103

mixtures of extreme points, exchangeability, 111moment generating functions, 10

bivariate normal density, 45joint, 53multivariate, 53relation to characteristic functions, 17

momentscharacteristic functions, 14definition, 10distribution determined by, 19exist when higher-order moment exists, 13exponential family, 120of sums of random variables, 28

multiple correlation coefficients, 90, 93multivariate. see also under bivariatemultivariate characteristic functions, 53multivariate moment generating functions, 53multivariate normal distribution, 67–70

best linear predictor, 72best predictor, 72conditional distribution, 72independence of components, 70independence of sample mean and sample

variance, 75predictor, 72sample from, 74

multivariate normal random variables, functions of, 73

Nnatural parameter, 114natural parameter space, 117

convexity, 117dimension of, 118

natural parameterization, 117natural sufficient statistic, completeness of, 123–126Newton's identities, 134non-central chi-square, 78, 80

and quadratic forms, 83noncentrality parameter, 81

noncentrality parameter, 81noninformative prior distribution, 57nonparametric

confidence intervals, 130normal distribution

order statisticsufficient, but not complete, 136

nullspace, 64, 73

Friday, December 11, 1998

Page 149: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

Index 147

Oorder statistic

normal distributionsufficient, but not complete, 136

sufficient, but not completenormal distribution, 136

order statistic, definition, 127order statistics, 126–129

completeness, 133densities, 130densities for, 129distribution functions for, 129exchangeability and, 128joint distributions, 132permutations and, independence, 128relation to uniform (0,1), 130sufficiency, 133

orthogonal complement, 64orthogonal matrices, 61orthogonal projection, 73

Pparameter, natural, 114permutations

group of, 127order statistics and, independence, 128

pgf. see probability generating functionPoisson sum, 80Pólya sampling, 94–96, 104

generalized, 106induction, 95, 102

Pólya urn, 94–96. see also Pólya samplingpositive definite matrices, 61positive semidefinite matrices, 61prediction, 49, 50, 90

best linear predictor, 50principles of inference

conditionality, 139invariance, 138likelihood, 140

violated, 141, 142sufficiency, 139

prior, improper, 57prior distribution, 57

noninformative, 57probabilities, as expectations for 0-1 variables, 98probability functions

conditional (see conditional probability functions)independent random variables, 39

probability functions (Continued)marginal (see marginal probability functions)

probability generating functions, 10probability measures, induced by random variable, 4,

5probability spaces, 2

axioms for, 2projections, 65

Qquadratic forms, 76

distribution of, non-central chi-square, 83independence, 84

quantilesconfidence intervals, 130

Rrandom matrices, 60random sums, 80random variables

bivariate, 29definition, 3in the plane, 29independent, 38, 39

random vectors, 52absolute continuity, 54independence, 54

randomized test, 142range, 64, 73regular family, 119Riemann-Stieltjes integrals

conditions for existence, 9definition, 9

Ssample, from multivariate normal, 74sample covariance matrix, 86sample from multivariate normal distribution,

independence of sample mean and sample variance, 75

sample mean and sample variancedistributions, 89independence, 75, 89sufficient statistics, 89

Schwarz Inequality. see Cauchy-Schwarz Inequalitysigma-algebra, 1

generated by a random variable, 38, 134sigma-field. see sigma-algebra

Friday, December 11, 1998

Page 150: Stat 8151 Notes - Statisticsusers.stat.umn.edu/~corbett/Stat_8151_Notes.pdfStat 8151 Lecture Notes* Fall Quarter 1998 1. ... 3. Wednesday, September 30, 1998 ... The ¾-algebra of

148 Index

significancetest of, 141

singular distribution functions, 30singular distributions, 48statistic, 114

complete, 123order (see order statistics)sufficient, 74

sufficiency, 139order statistics, 133

sufficient statistic, exponential family, 114sufficient statistics, 74

sample mean and sample variance, 89sums of independent random variables, 59

binomial, 59normal, 59Poisson, 59

sums of random variables, moments, 28Supporting Hyperplane Theorem, 21Supporting Line Theorem, 21symmetric matrices, 61

Ttest

randomized, 142test of significance, 141trace, of a matrix, 85transformations, change of variables, 58

bivariate, 33univariate, 32

transformsFourier (see characteristic functions)Laplace (see also moment generating functions)

Uunbiased estimator of zero, 123unbiasedness, 123

confidence intervals, 142violation of likelihood principle, 141

uniform (0,1)relation to order statistics, 130

uniform continuityabsolutely continuous distribution functions, 6characteristic functions, 11continuous distribution functions, 6

Vvariance

conditional (see conditional variance)

variance (Continued)definition, 10

variance-covariance matrix. see covariance matrixviolation of likelihood principle

unbiasedness, 141

WWishart distribution, 90Wolpert, 136

Friday, December 11, 1998