STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...

STAT 200B: Theoretical Statistics

Arash A. Amini

March 2, 2020

1 / 218

Statistical decision theory

• A probability model P = Pθ : θ ∈ Ω for data X ∈ X :Ω: parameter space, X : sample space.

• An action space A: set of available actions (decisions).

• A loss function:

0-1 loss L(θ, a) = 1θ 6= a Ω = A = 0, 1.Quadratic loss (Squared error) L(θ, a) = ‖θ − a‖2

2 Ω = A = Rd .

Statistical inference as a game:

1. Nature picks the “true” parameter θ, and draws X ∼ Pθ.Thus, X is a random element of X .

2. Statistician observes X and makes a decision δ(X ) ∈ A.δ : X → A is called a decision rule.

3. Statistician incurs the loss L(θ, δ(X )).

The goal of the statistician is to minimize its expected loss, a.k.a the risk:

R(θ, δ) := EθL(θ, δ(X ))

2 / 218

• The goal of the statistician is to minimize its expected loss, a.k.a the risk:

R(θ, δ) := EθL(θ, δ(X ))

=

∫L(θ, δ(x)) dPθ(x)

=

∫L(θ, δ(x)) pθ(x) dµ(x)

when family is dominated: Pθ = pθdµ.

• Usually work with the family of densities pθ(·) : θ ∈ Ω.

3 / 218

Example 1 (Bernoulli trials)

• A coin being flipped, want to estimate the probability of coming up heads.

• One possible model:

X = (X1, . . . ,Xn), Xiiid∼Ber(θ), for some θ ∈ [0, 1].

• Formally, X = 0, 1n, Pθ = (Ber(θ))⊗n and Ω = [0, 1].

• PMF of Xi :

P(Xi = x) =

θ x = 1

1− θ x = 0= θx(1− θ)1−x , x ∈ 0, 1

• Joint PMF: pθ(x1, . . . , xn) =∏n

i=1 θxi (1− θ)1−xi

• Action space: A = Ω.

• Quadratic loss: L(θ, δ) = (θ − δ)2.

4 / 218

Comparing estimators via their risk

Bernoulli trials. Let us look at three estimators:

Sample mean δ1(X ) = 1n

∑ni=1 Xi R(θ, δ1) = θ(1−θ)

n

Constant estimator δ2(X ) = 12 R(θ, δ2) = (θ − 1

2 )2

Strange looking δ3(X ) =∑

i Xi+3n+6 R(θ, δ3) = nθ(1−θ)+(3−6θ)2

(n+6)2 .

Throw data out δ4(X ) = X1 R(θ, δ4) = θ(1− θ).

5 / 218


n = 10 n = 50

0 0.2 0.4 0.6 0.8 10

5 · 10−2

0.1

0.15

0.2

δ1

δ2

δ4

δ3

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2·10−2

Comparison depends on the choice of the loss. A different loss gives a differentpicture.

6 / 218


How to deal with the fact that the risks are functions?

• Summarize them by reducing to numbers:

• (Bayesian) Take a weighted average:

infδ

∫Ω

R(θ, δ) dπ(θ)

• (Frequentist) Take the maximum:

infδ

maxθ∈Ω

R(θ, δ)

• Restrict to a class of estimators: unbiased (UMVU), equivariant, etc.

• Rule out estimators that are dominated by others (inadmissible).

7 / 218

Admissibility

Definition 1

Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if

• R(θ, δ∗) ≤ R(θ, δ), for all θ ∈ Ω, and

• R(θ, δ∗) < R(θ, δ), for some θ ∈ Ω.

δ is inadmissible if there is a different δ∗ that dominates it;otherwise δ is admissible.

An inadmissible rule can be uniformly “improved”.

δ4 in the Bernoulli example is inadmissible.

We will see a non-trivial example soon (Exponential Distribution).

8 / 218

Bias

Definition 2

The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ)− g(θ).

The estimator is unbiased if Bθ(δ) = 0 for all θ ∈ Ω.

Not always possible to find unbiased estimators.Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62)

Definition 3g is called U-estimable if there an unbiased estimator δ for g .

Usually g(θ) = θ.

9 / 218

Bias-variance decomposition

For the quadratic loss L(θ, a) = (θ − a)2, the risk is mean-squared error (MSE).In this case we have the following decomposition

MSEθ(δ) = [Bθ(δ)]2 + varθ(δ)

Proof.

Let µθ := Eθ(δ). We have

MSEθ(δ) = Eθ(θ − δ)2 = Eθ(θ − µθ + µθ − δ)2

= (θ − µθ)2 + 2(θ − µθ)Eθ[µθ − δ] + Eθ(µθ − δ)2.

Same goes for general g(θ) or higher dimensions: L(θ, a) = ‖g(θ)− a‖22.

10 / 218

Example 2 (Berger)

Let X ∼ N(θ, 1).

Class of estimators of the form δc(X ) = cX , for c ∈ R.

MSEθ(δ) = (θ − cθ)2 + c2 = (1− c)2θ2 + c2

For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc) for all θ.

For c ∈ [0, 1] the rules are incomparable.

11 / 218

Optimality depends on the loss

Example 3 (Possion process)

X1, . . . ,Xn be the inter-arrival times of a Poisson process with rate λ.

X1, . . . ,Xniid∼Expo(λ). The model has the following p.d.f.

pλ(x) =n∏

i=1

pλ(xi ) =∏

i

λe−λxi 1xi > 0 = λne−λ∑

i xi 1mini

xi > 0

Ω = A = (0,∞).

• Let S =∑

i Xi and X = 1nS .

• The MLE for λ is λ = 1/X = n/S .

12 / 218

X1, . . . ,Xniid∼Expo(λ)

• Let S =∑

i Xi and X = 1nS .

• The MLE for λ is λ = 1/X = n/S .

• S :=∑n

i=1 Xi has the Gamma(n, λ) distribution.

• 1/S has Inv-Gamma(n, λ) distribution with mean λ/(n − 1).

• Eλ[λ] = nλ/(n − 1). MLE is biased for λ.

• Then, λ := (n − 1)λ/n is unbiased.

• We also have varλ(λ) < varλ(λ).

• It follows that

MSEλ(λ) < MSEλ(λ), ∀λ

The MLE λ is inadmissible for quadratic loss.

13 / 218

Possible explanation:Quadratic loss penalizes over-estimationmore than under-estimation for Ω =(0,∞).

0 1 2 3 4 5 60

1

2

3

4

5

6

7

8

Alternative loss function (Itakura–Saito distance)

L(λ, a) = λ/a− 1− log(λ/a), a, λ ∈ (0,∞)

• With this loss function, R(λ, λ) > R(λ, λ),∀λ.

• That is, MLE renders λ inadmissible.

An example of a Bregman divergence for φ(x) = − log x .For a convex function φ : Rd → R, the Bregman divergence is defined as

dφ(x , y) = φ(x)− φ(y)− 〈∇φ(y), x − y〉

the remainder of the first order Taylor expansion of φ at y .

14 / 218

Details:

• Consider δα(X ) = α/S . Then, we have

R(λ, δα)− R(λ, δβ) =( nα− n

β

)−(

logn

α− log

n

β

)

• Take α = n − 1 and β = n.

• Use log x − log y < x − y for x > y ≥ 1.

(Follows from strict concavity of f (x) = log(x):f (x)− f (y) < f ′(y)(x − y) for y 6= x).

15 / 218

Sufficiency

Idea: Separate the data into

• parts that are relevant for the estimating θ (sufficient)

• and parts that are irrelevant (ancillary).

Benefits:

• Achieve data compression: efficient computation and storage

• Irrelevant parts can increase the risk (Rao-Blackwell)

Definition 4

Consider the model P = Pθ : θ ∈ Ω for X .A statistic T = T (X ) is sufficient for P (or for θ or for X ) if the conditionaldistribution of X given T does not depend on θ.

More precisely, we have

Pθ(X ∈ A | T = t) = Qt(A), ∀t, A

for some Markov kernel Q. Making it more precise requires measure theory.Intuitively, given T , we can simulate X by an external source of randomness.

16 / 218

Sufficiency

Example 4 (Coin tossing)

• Xiiid∼Ber(θ), i = 1, . . . , n.

• Notation: X = (X1, . . . ,Xn), x = (x1, . . . , xn).

• Will show that T = T (X ) =∑

i Xi is sufficient for θ. (This should beintuitive.)

Pθ(X = x) = pθ(x) =n∏

i=1

θxi (1− θ)1−xi = θT (x)(1− θ)n−T (x)

• Then

Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= θt(1− θ)n−t1T (x) = t.

17 / 218

• Then

Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= θt(1− θ)n−t1T (x) = t.

• Marginalizing,

Pθ(T = t) = θt(1− θ)n−t∑

x ∈0,1n1T (x) = t

=

(n

t

)θt(1− θ)n−t .

• Hence,

Pθ(X = x |T = t) =θt(1− θ)n−t1T (x) = t(

nt

)θt(1− θ)n−t

=1(nt

)1T (x) = t.

• What is the above (conditional) distribution?

18 / 218

Factorization Theorem

It is not convenient to check for sufficiency this way, hence:

Theorem 1 (Factorization (Fisher–Neyman))

Assume that P = Pθ : θ ∈ Ω is dominated by µ. A statistic T is sufficient ifffor some function gθ, h ≥ 0

pθ(x) = gθ(T (x))h(x), for µ-a.e. x

The likelihood θ 7→ pθ(X ), only depends on X through T (X ).

Family being dominated (having a density) is important.Proof (only discrete case):Assume T = T (X ) is sufficient. Fix x , and let t = T (x), Then,

Pθ(X = x) = Pθ(X = x ,T = t)

= Pθ(X = x |T = t)Pθ(T = t)

= Qt(X = x)gθ(t).

19 / 218

Factorization Theorem

• Now assume factorization holds. Then,

Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= gθ(t)h(x)1T (x) = t.

• It follows that

Pθ(T = t) = gθ(t)∑

x′

h(x ′)1T (x ′) = t,

• hence

Pθ(X = x |T = t) =gθ(t)h(x)1T (x) = t

gθ(t)∑

x′ h(x ′)1T (x ′) = t

=h(x)1T (x) = t∑x′ h(x ′)1T (x ′) = t .

20 / 218

Example 5 (Uniform)

• Let X1, . . . ,Xniid∼U[0, θ].

• Family is dominated by Lebesgue measure.

• X(n) = maxX1, . . . ,Xn is sufficient by Factorization theorem:

pθ(x) =n∏

i=1

1

θ10 ≤ xi ≤ θ

=1

θn10 ≤ xi ≤ θ, ∀i =

1

θn10 ≤ min

ixi1max

ixi ≤ θ

Useful fact∏

i 1Ai = 1∩iAi .

21 / 218

• The entire data (X1, . . . ,Xn) is always sufficient.

• For i.i.d. data there is always more reduction.

Example 6 (IID data)

• Let X1, . . . ,Xniid∼ pθ.

• Then, the order statistics (X(1), . . . ,X(n)) is sufficient:

• Order the data such that X(1) ≤ X(2) ≤ · · · ≤ X(n), and discard the ranks.

22 / 218

Minimal sufficiency

• There is a hierarchy among sufficient statistics in terms of data reduction.

• Can be made precise by using “functions” as “reduction mechanisms”.

Lemma 1

If T is sufficient for P and T = f (S) a.e. P, then S is sufficient.

We write T ≤s S if such functional relation exists.

• An easy consequence of the factorization theorem.

Examples:

• T sufficient 6=⇒ T 2 sufficient. (T 6≤s T2) (Missing sign information)

• T 2 sufficient =⇒ T sufficient. (T 2≤s T )

• T sufficient ⇐⇒ T 3 sufficient. (bijection)

23 / 218

• T ≤s S is not standard notation, but useful shorthand for:

∃ function f such that T = f (S) a.e. P.

• We want to obtain a sufficient statistic that achieves greatest reduction,i.e. is at the bottom of that hierarchy.

Definition 5

A statistic T is minimal sufficient if

• T is sufficient, and

• T ≤s S for any sufficient statistic S .

24 / 218

• Minimal sufficient statistics exist under mild conditions.

• Minimal sufficient statistic is essentially unique modulo bijections.

Example 7 (Location family)

Consider X1, . . . ,Xniid∼ pθ, that is, they have density pθ(x) = f (x − θ). For

example consider f (x) = C exp(−β|x |α).

• The case α = 2 corresponds to the normal location family X1, . . . ,Xniid∼N(θ, 1).

Then, T = 1n

∑i Xi is sufficient by factorization. We will show later that it is

minimal sufficient.

• The case α = 1 (Laplace or double exponential), no further reduction beyondorder statistic is possible.

• A family P is DCS if it dominated with densities having common support.

25 / 218

• Goal: To show that the likelihood (ratio) function is minimal sufficient.

• General idea: For any fixed θ and θ0,

pθ(X )

pθ0 (X )

will always be a function of any sufficient statistic (by Fact. Thm).

• We just have to collect enough of them so that collectively

( pθ1 (X )

pθ0 (X ),pθ2 (X )

pθ0 (X ),pθ3 (X )

pθ0 (X ), . . .

)

they are sufficient.

26 / 218

A useful lemma

• Let us fix θ0 ∈ Ω and define

Lθ := Lθ(X ) :=pθ(X )

pθ0 (X ).

Lemma 2

In a DCS family, the following are equivalent

(a) U is sufficient for P.

(b) Lθ ≤s U, ∀θ ∈ Ω.

• When DCS fails, (a) still implies (b), but not necessarily vice versa.

27 / 218

Proof of Lemma 2

• Work on the common support, densities can be assumed positive.

• (a) =⇒ (b): U sufficient implies pθ(x) = gθ(U(x))h(x) (Fact. Thm.):

Lθ(x) =pθ(x)

pθ0 (x)=

gθ(U(x))

gθ0 (U(x))= fθ,θ0 (U(x))

• (b) =⇒ (a): ∃fθ,θ0 such that pθ(x) = pθ0 (x)fθ,θ0 (U(x)). Q.E.D.

28 / 218

A useful lemma

• Let L := (Lθ)θ∈Ω.

• Since R ≤s U and S ≤s U ⇐⇒ (R,S)≤s U, we have

Lemma 3

In a DCS family, the following are equivalent

(a) U is sufficient for P.

(b) L≤s U.

• The argument is correct when Ω is finite.

• Needs more care dealing with “a.e. P” when Ω is infinite.

• From Lemma 3 follows that L is itself sufficient. (Why?)

29 / 218

Likelihood is minimal sufficient

Proposition 1

In a DCS family, L := (Lθ)θ∈Ω is minimal sufficient.

Proof:

• Let U be a sufficient statistics.

• Lemma 3(a) =⇒ (b) gives L≤s U.

• (i.e., L is a function of any sufficient U.)

• We also know that L is itself sufficient.

• The proof is complete.

30 / 218

• Consequence of Prop. 1 is

Corollary 1

A statistic T is minimal sufficient if

• T is sufficient, and

• T ≤s L.

• That is, it is enough to show that T is sufficient and,

• T can be written as a function of L.

T ≤s L is equivalent to either of the following:

• L(x) = L(y) =⇒ T (x) = T (y).

• Level sets of L are “included” in level sets of T , i.e.,

• level sets of T are coarser than level set of L.

31 / 218

Corollary 2

T is minimal sufficient for DCS family P iff

(a) T is sufficient, and

(b) L(x) = L(y) =⇒ T (x) = T (y)

Corollary 3

T is minimal sufficient for DCS family P iff

L(x) = L(y) ⇐⇒ T (x) = T (y)

• Can replace L(x) = L(y) with `x(θ) ∝ `y (θ),

where `x(θ) = pθ(x) is the likelihood function. (Theorem 3.11 in Keener.)

• Corollary 3: T is minimal sufficient if it has the same level sets as L.

• Informally, shape of the likelihood is minimal sufficient.

32 / 218

Example 8 (Gaussian location family)

Consider X1, . . . ,Xniid∼ pθ = f (x − θ) with f (x) = C exp(−βx2).

• `X (θ) ∝ exp(−β∑i (Xi − θ)2).

• Shape of `X (·) uniquely determined by θ 7→∑i (Xi − θ)2,

• Alternatively, θ 7→ −2(∑

i Xi )θ + nθ2,

• Alternatively,∑

i Xi .

Example 9 (Laplace location family)

Consider X1, . . . ,Xniid∼ pθ = f (x − θ) with f (x) = C exp(−β|x |).

• `X (θ) ∝ exp(−β∑i |Xi − θ|).

• Shape of `X (·) uniquely uniquely determined by the breakpoints of thepiecewise linear function θ 7→∑

i |Xi − θ|,• In one-to-one correspondence with the order statistic (X(1), . . . ,X(n)).

33 / 218

θ

∑i |Xi − θ|

X(2)X(1) X(3)

• Shape of the likelihood for the Laplace location family is determined by theorder statistics.

34 / 218

Example with no common support

P = P0,P1,P2 where

P0 = U[−1, 0],

P1 = U[0, 1],

p2(x) = 2x1x ∈ (0, 1).

p1

p0=

p2

p0=

0 on (−1, 0)

∞ on (0, 1).

• Cannot tell P1 and P2 based on ( p1p0, p2p0

).

• However, there is information in the original modelto statistically tell these two apart to some extent.

• Consider in addition p2(x)p1(x)

= 2x1x ∈ (0, 1).

• A minimal suff. stat.: (1X < 0,X+)

• Could just take X+, since 1X < 0 = 1− X+

35 / 218

Empirical distribution

• We saw that for IID data, X1, . . . ,Xniid∼Pθ,

• the order statistic X(1) ≤ X(2) ≤ · · · ≤ X(n) is sufficient.

• Another way: The empirical distribution Pn is sufficient,

Pn :=1

n

n∑

i=1

δXi , (δx : unit point mass at x )

• δx is a measure defined by: δx(A) := 1x ∈ A:

x

• Example: X = (0, 1,−1, 0, 4, 4, 0) =⇒ Pn := 17 (δ−1 + 3δ0 + δ1 + 2δ4).

−1 0 1 4

36 / 218

Example 10 (Empirical distribution in finite alphabet (IID data))

• Suppose sample space is finite X = a1, . . . , aK.• Let P = collection of all prob. measures P on X .• P can be parametrized by θ = (θ1, . . . , θK ) where θj = P(aj).

• Empirical measure reduces to Pn =∑

j πj(X ) δaj where

πj(X ) :=1

n#i : Xi = aj

• Show that Pn or equivalently (π1(X ), . . . , πK (X )) is minimal sufficient.

37 / 218

Statements from Theory of Point Estimation (TPE)

Proposition 2 (TPE)

Consider a finite DCS family P = Pθ, θ ∈ Ω, i.e., Ω = θ0, θ1, . . . , θK.Then, the following is minimal sufficient

L(X ) =(pθ1 (X )

pθ0 (X ), . . . ,

pθK (X )

pθ0 (X )

).

Proposition 3 (TPE)

Assume P is DCS and P0 ⊂ P. Assume that T is

• sufficient for P, and

• minimal sufficient for P0.

Then, T is minimal sufficient for P.

Same support gives “a.e. P0 =⇒ a.e. P”.S sufficient for P =⇒ S sufficient for P0.T minimal suff. for P0 =⇒ T = f (S) a.e. P0 and hence a.e. P. Q.E.D.

38 / 218


• Consider X1, . . . ,Xniid∼N(θ, 1).

• Look at sub-family P0 = N(θ0, 1),N(θ1, 1). Let T (X ) =∑

i Xi .

• The following is minimal sufficient by Proposition 2,

log Lθ1 := logpθ1 (x)

pθ0 (x)=

1

2

∑

i

(xi − θ0)2 − 1

2

∑

i

(xi − θ1)2

= (θ1 − θ0)T (x) +1

2(θ2

0 − θ21)

showing that T (X ) is minimal sufficient for P0.

• Since T (X ) is also sufficient for P (Exercise.), Proposition 3 implies that itis minimal sufficient for P.

39 / 218

Completeness and ancillaritiy

• We can compress even more!

Definition 6

• V = V (X ) is ancillary if its distribution does not depend on θ.

• First-order ancillary if its expectation does not depend on θ. (EθV = c, ∀θ).

• The latter is a weaker notion.

Example 12 (Location family)

• The following statistics are all ancillary:

Xi − Xj , Xi − X(j), X(i) − X(j), X(i) − X

• Hint: We can write Xi = θ + εi , where εiiid∼ f .

• Minimal sufficient statistic can still contain ancillary information, forexample X(n) − X(1) in the Laplace location family.

40 / 218

• A notion stronger than minimal sufficiency is completeness:

Definition 7A statistic T is complete if

(Eθ[f (T )] = c , ∀θ

)=⇒ f (t) = c ∀t.

(Actually P-a.e. t.)

• T is complete if no nonconstant function of it is first-order ancillary.

• Minimal sufficient statistic need not be complete:

Example 13 (Laplace location family)

• X(n) − X(1) is ancillary, hence first-order ancillary.

• f (X(1), . . . ,X(n)) is ancillary for the nonconstant function f (z1, . . . , zn) = z1 − zn.

• The converse is however true.

41 / 218

• Showing completeness is not easy.

• Will see a general result for exponential families.

• Here is another example:

Example 14

• X1, . . . ,Xniid∼U[0, θ].

• Will show that T = maxX1, . . . ,Xn is complete.

• CDF of T is FT (t) = (t/θ)n1t ∈ (0, θ)+ 1t ≥ θ.• Density t 7→ nθ−ntn−11t ∈ (0, θ).• Suppose that Eθf (T ) = 0 for all θ > 0. Then,

0 = Eθf (T ) = nθ−n∫ θ

0

f (t)tn−1dt, t > 0

• Fundamental theorem of calculus implies f (t)tn−1 = 0, a.e. t > 0,

• Hence f (t) = 0 a.e. t > 0. Conclude that T is complete.

42 / 218

Detour: Conditional expectation as L2 projection

• The L2 space of random variables:

L2 := L2(P) := X : E[X 2] <∞

• We can define an inner product on this space

〈X ,Y 〉 := E[XY ], X ,Y ∈ L2

• The inner product induces a norm, called the L2 norm,

‖X‖2 :=√〈X ,X 〉 :=

√E[X 2]

• Norm induces a distance ‖X − Y ‖2.

• Squared distance ‖X − Y ‖22 = E(X − Y )2, the same as MSE.

• Orthogonality: X ⊥ Y if 〈X ,Y 〉 = 0, i.e., E[XY ] = 0.

43 / 218


• Assume EX 2 <∞ and EY 2 <∞ (i.e., X ,Y ∈ L2 ).

• Consider the the linear space

L :=g(X ) | g is a (measurable) function with E[g(X )]2 <∞

i.e., the space of all (meas.) functions of X .

• There is an essentially unique L2 projection of Y onto L:

Y := argminZ∈L

‖Y − Z‖2

• Alternatively, an essentially unique function g such that

ming

E(Y − g(X ))2 = E(Y − g(X ))2

• We define E[Y |X ] := g(X ).

44 / 218


• There is an essentially unique function g such that

ming

E(Y − g(X ))2 = E(Y − g(X ))2

• We define E[Y |X ] := g(X ).

• E[Y |X ] is the best prediction of Y given X , in the MSE sense.

• From this definition, we get the following characterization of g :

E[(Y − g(X )

)g(X )] = 0, ∀g

saying that the optimal prediction error Y − g(X ) is orthogonal to L.

• Applied to the constant function g(X ) ≡ 1, we get

E[Y ] = E[g(X )] = E[E[Y |X ]]

the smoothing or averaging property of conditional expectation.

45 / 218

Proposition 4

A complete sufficient statistic is minimal sufficient.

Proof. Let T be complete sufficient, and U minimal sufficient.

• Idea is to show that T is a function of U.

• U = g(T ). (By minimal sufficiency of U.)

• Let h(U) := Eθ[T |U] well defined by sufficiency of U.

• Eθ[T − h(U)] = 0, ∀θ ∈ Ω. (By smoothing.)

• T = h(U). (By completeness of T .)

Hints:

• Took f (t) := t − h(g(t)) in the definition of completeness.

• Equalities are a.e. P.

46 / 218

Proposition 5 (Basu)

Let T be complete sufficient and V ancillary.Then T and V are independent.

Proof. Let A be an event.

• qA := Pθ(V ∈ A) well-defined. (By ancillary of V .)

• fA(T ) := Pθ(V ∈ A|T ) well-defined. (By sufficiency of T .)

• Eθ[qA − fA(T )] = 0, ∀θ.

• qA = fA(T ). (By completeness of T .)

Equalities are a.e. P.

47 / 218

• Application of Basu:


• X1, . . . ,Xniid∼N(θ, σ2), θ is unknown, σ2 is known.

• X := 1n

∑i Xi is complete sufficient. (cf. Exponential families)

• (Xi − X , i = 1, . . . , n) is ancillary.

• Hence, sample variance S2 := 1n−1

∑i (Xi − X )2 is ancillary.

• Hence, X and S2 are independent.

• Had we taken (θ, σ2) as the parameter, then S2 would not be ancillary.

48 / 218

Rao–Blackwell

• Rao–Blackwell theorem ties risk minimization with sufficiency.

• It is a statement about convex loss functions.

Definition 8A function f : Rp → R is convex if for all x , y

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y), ∀λ ∈ [0, 1], (1)

and strictly convex if the inequality is strict for λ ∈ (0, 1) and x 6= y .

Example 16 (`p loss)

• Loss function a 7→ L(θ, a) = |θ − a|p on R.

• Convex for p ≥ 1 and nonconvex for p ∈ (0, 1).

• Stricly convex when p > 1.

• In particular, the `1 loss (p = 1) is convex but not stricly convex.

49 / 218

Jensen inequality

By induction (1) leads to f (∑

i αixi ) ≤∑

i αi f (xi ), for αi ≥ 0 and∑

i αi = 1.A sweeping generalization is the following:

Proposition 6 (Jensen inequality)

Assume that f : S → R is convex and consider a random variable Xconcentrated on S (i.e., P(X ∈ S) = 1), and E|X | <∞. Then,

Ef (X ) ≥ f (EX )

If f is strictly convex, equality holds iff X ≡ EX a.s. (that is, X is constant).

Proof. Relies on the existence of supporting hyperplanes to f (i.e., affineminorants that touch the function)

50 / 218

• Let x0 := EX .

• Let A(x) = 〈a, x − x0〉+ f (x0) be a supporting hyperplane to f at x0:

f (x) ≥ A(x), ∀x ∈ S, and A(x0) = f (x0).

• Then, we have

f (X ) ≥ A(X ) =⇒ E[f (X )] ≥ E[A(X )] (Monotonicity of E)

= 〈a,E[X − x0]〉+ f (x0) (Linearity of E)

= f (x0)

• Strict convexity implies f (x) > A(x) for x 6= x0.

• If X 6= x0 with positive prob., we have f (X ) > A(X ) with positive prob.,

• Hence, Ef (X ) > EA(X ), and the rest follows. Q.E.D.

51 / 218

Recall the decision-theoretic setup:Loss function L(θ, a), decision rule δ = δ(X ), risk R(θ, δ) = EθL(θ, δ).

Theorem 2 (Rao–Blackwell)

Let us assume the following:

• T is sufficient for family P,

• δ = δ(X ) is a possibly randomized decision rule,

• a 7→ L(θ, a) is convex, for all θ ∈ Ω.

Define the estimator η(T ) := Eθ[δ|T ]. Then, η dominates δ, that is,

R(θ, η) ≤ R(θ, δ) for all θ ∈ Ω

The inequality is strict when the loss is strictly convex, unless η = δ a.e. P.

Consequence: for convex loss functions, randomization does not help. Proof:

• η is well-defined. (By sufficiency of T .)

• Eθ[L(θ, δ)|T ] ≥ L(θ,Eθ[δ|T ]) = L(θ, η). (By conditional Jensen inequality.)

• Take expectation and use monotonicity and smoothing. Q.E.D.

52 / 218

Example 17

• Xiiid∼U[0, θ], i = 1, . . . , n.

• T = maxX1, . . . ,Xn is sufficient. Take δ = 1n

∑i Xi .

• Rao–Blackwell: η = Eθ[δ|T ] strictly dominates δ, for any strictly convexloss. Let us verify this:

• Conditional distribution of Xi given T :Mixture of a point mass at T and uniform distribution on [0,T ] (why?):

Pθ(Xi ∈ A|T ) =1

nδT (A) +

(1− 1

n

)∫ T

0

1

T1A(x)dx

• Compactly Xi |T ∼ 1nδT + (1− 1

n )Unif(0,T ).

• It follows

Eθ(Xi |T ) =1

nT +

(1− 1

n

)T2

=n + 1

2nT

• Same expression holds for η by symmetry.

53 / 218

• That is, η = Eθ[δ|T ] = n+12n T

• Consider the quadratic loss:

• η has the same bias as δ (by smoothing).

• How about variances?

varθ(δ) =1

n

θ2

12

• Since T/θ is Beta(n, 1) distributed, we have

varθ(η) =(n + 1

2n

)2 n

(n + 1)2(n + 2)θ2 =

θ2

4n(n + 2)(2)

• (Note: δ is biased. 2δ is unbiased. Better to compare 2η and 2δ.)

54 / 218

• Another result about strictly convex loss functions:An admissible decision rule is uniquely determined by its risk function.

Proposition 7

• Assume a 7→ L(θ, a) is stricly convex, for all θ. (R(θ, δ) = Eθ[L(θ, δ)].)

Then, the map δ 7→ R(·, δ) is injective over the class of admissible decision rules.

We are identifying decision rules that are the same a.e. P.Proof.

• Let δ be admissible.

• Let δ′ 6= δ be such that R(θ, δ) = R(θ, δ′),∀θ.

• Take δ∗ = 12 (δ + δ′). Then, by strict convexity of the loss

L(θ, δ∗) <1

2

(L(θ, δ) + L(θ, δ′)

), ∀θ

Taking expectation: (Note: X > 0 a.s. =⇒ E[X ] > 0 )

R(θ, δ∗) <1

2

(R(θ, δ) + R(θ, δ′)

)= R(θ, δ), ∀θ

• δ′ 6= δ implies δ∗ 6= δ.

• δ∗ strictly dominates δ, contradicting admissibility of δ.55 / 218

Rao–Blackwell can fail for non-convex loss. Here is an example:

Example 18

• X ∼ Bin(n, θ). Ω = A = [0, 1].

• ε-sensitive loss function Lε(θ, a) = 1|θ − a| ≥ ε.• Consider a general deterministic estimator of δ = δ(X ).

• δ takes at most n + 1 values δ(0), δ(1), . . . , δ(n) ⊂ [0, 1].

• Divide [0, 1] into bins of length 2ε.

• Assume that N := 1/(2ε) ≥ n + 2 and that N is an integer (for simplicity).

• At least one of the N bins contains no δ(i), i = 0, . . . , n, and the midpointof that bin is at distance ≥ ε from any δ(i). Hence

supθ∈[0,1]

R(θ, δ) = 1

for any nonrandomized rule (assuming ε ≤ 1/[2(n + 2)]).

• Consider randomized estimator δ′ = U ∼ U[0, 1] independent of X :

R(θ, δ′) = P(|U − θ| ≥ ε) ≤ 1− ε (worst case at θ = 0, 1. )

• supθ∈[0,1] R(θ, δ′) < 1 strictly better than that of δ.

56 / 218

Uniformly minimum variance unbiased (UMVU) criterion

• Comparing estimators based on their risk functions problematic.

• One way to mitigate: restrict the class of estimators.

• Focus on qudratic loss, restrict to unbiased estimators.

• Bias-variance decomp.,

Let Ug be the class of unbiased estimators of g(θ), that is,

Ug = δ : Eθ[δ] = g(θ), ∀θ.

Definition 9

An estimator δ is UMVU for estimating g(θ) if

• δ ∈ Ug , and

• varθ(δ) ≤ varθ(δ′), for θ ∈ Ω, and for all δ′ ∈ Ug .

57 / 218

Theorem 3 (Lehmann–Scheffe)

Consider the family P and assume that

• Ug is nonempty (i.e., g is U-estimable), and

• there is a complete sufficient statistic T for P.

Then, there is an essentially unique UMVU for g(θ) of the form h(T ).

Proof.

• Pick δ ∈ Ug (Valid by non-emptiness.)

• Let η = Eθ[δ|T ] be an estimator (Well-defined by sufficiency of T .)

• Claim: η is the essentially unique UMVU.

• Pick any δ′ ∈ Ug and let η′ = Eθ[δ′|T ].

• Eθ[η − η′] = g(θ)− g(θ) = 0, ∀θ (By smoothing and unbiasedness.)

(a) η − η′ = 0 a.e. P. (By completeness of T .)

• By Rao–Blackwell for quadratic loss a 7→ (g(θ)− a)2 and unbiasedness

varθ(η) = R(θ, η)(a)= R(θ, η′) ≤ R(θ, δ′) = varθ(δ′)

• Since δ′ was an arbitrary element of the class Ug we are done. Q.E.D.

58 / 218

Proof.

• Pick δ ∈ Ug (Valid by non-emptiness.)

• Let η = Eθ[δ|T ] be an estimator (Well-defined by sufficiency of T .)

• Claim: η is the essentially unique UMVU.

• Pick any δ′ ∈ Ug and let η′ = Eθ[δ′|T ].

• Eθ[η − η′] = g(θ)− g(θ) = 0, ∀θ (By smoothing and unbiasedness.)

• η − η′ = 0 a.e. P. (By completeness of T .)

• By Rao–Blackwell for quadratic loss a 7→ (g(θ)− a)2 and unbiasedness

varθ(η) = R(θ, η) = R(θ, η′)(b)< R(θ, δ′) = varθ(δ′)

• Since δ′ was an arbitrary element of the class Ug we are done. Q.E.D.

Remark 1Note that we have also shown the uniqueness:

(b) If δ′ ∈ Ug is UMVU and not a function of T , then it is strictly dominatedby η′ (by Rao–Blackwell and strict convexity of quadratic loss).

• Otherwise, it is equal to η′ which is equal to η a.e. P.

59 / 218

Lehman–Schuffe suggest a way of constructing UMVUs.

Example 19 (Coin tossing)

• X1, . . . ,Xniid∼Ber(θ), want to estimate g(θ) = θ2.

• T =∑

i Xi is complete and sufficient.(General result for exponential families.)

• Take U = X1X2.

• U is unbiased for θ2: Eθ[U] = Eθ[X1]Eθ[X2] = θ2 by independence.

• By Lehman–Schuffe

E[U|T = t] = P(X1 + X2 = 2|T = t)

=

(n−2t−2

)/(nt

), t ≥ 2

0 otherwise=

t(t − 1)

n(n − 1)

is UMVU estimator for θ2.

60 / 218

Approach 2 in obtaining UMVUs:

Example 20

• X1, . . . ,Xniid∼U[0, θ].

• T = X(n) = maxX1, . . . ,Xn is complete sufficient.

• UMVU for g(θ) is given by h(X(n)).

• h is the solution of the following integral equation:

g(θ) = Eθ[h(X(n))] = nθ−n∫ θ

0

tn−1h(t)dt.

• For g(θ) = θ, δ1 = n+1n T is unbiased, hence UMVU by Lehamn–Schuffe.

• MSEθ(δ1) = varθ(δ1) = θ2

n(n+2) .

• On the other hand, among estimators of the form δa = aT ,a = (n + 2)/(n + 1) gives the lowest MSE.

• This biased estimator has slightly better MSE = θ2/(n + 1)2.

• A little bit of bias is not bad.

61 / 218

Exponential family

Definition 10X : a general sample space, Ω: a general parameter space,

• A function T : X → Rd , T (x) = (T1(x), . . . ,Td(x)).

• A function η : Ω→ Rd , η(θ) = (η1(θ), . . . , ηd(θ)).

• A measure ν on X (e.g., Lebesgue or counting), and a functionh : X → R+.

The exponential family with sufficient statistic T and parametrization η, relativeto h · ν, is the dominated family of distributions given by the following densitiesw.r.t. ν

pθ(x) = exp〈η(θ),T (x)〉 − A(θ)

h(x), x ∈ X .

where 〈η(θ),T (x)〉 =∑d

i=1 ηi (θ)Ti (x) is the Euclidean inner product.

62 / 218

• T : X → Rd , η : Ω→ Rd .


h(x), x ∈ X .

• A(θ) is the determined by the other ingredients,

• via the normalization constraint∫pθ(x)dν(x) = 1,

A(θ) = log

∫e〈η(θ),T (x)〉d ν(x)

where d ν(x) := h(x)dν(x).

• A is called the log partition function or cumulant generating function.

• The actual parameter space is

Ω0 = θ ∈ Ω : A(θ) <∞.

• By factorization theorem, T (X ) is indeed sufficient.

• The representation of the exponential family is not unique.

63 / 218

Here are some examples:

Example 21

• X ∼ Ber(θ):

pθ(x) = θx(1− θ)1−x = exp[x log

θ

1− θ + log(1− θ)].

• Here h(x) = 1x ∈ 0, 1,

η(θ) = log( θ

1− θ), T (x) = x

A(θ) = − log(1− θ), Ω0 = (0, 1)

• We need to take Ω = (0, 1) otherwise η is not well-defined.

64 / 218

Example 22

• X ∼ N(µ, σ2): Let θ = (µ, σ2),

pθ(x) =1√

2πσ2exp

[− 1

2σ2(x − µ)2

]

= exp[− 1

2σ2x2 +

µ

σ2x −

( µ2

2σ2+

1

2log(2πσ2)

)]

• Here, h(x) = 1,

η(θ) =( µσ2,− 1

2σ2

), T (x) = (x , x2)

A(θ) =µ2

2σ2+

1

2log(2πσ2), Ω0 = (µ, σ2) : σ2 > 0

• We could have taken h(x) = 1√2π

and A(θ) = µ2

2σ2 + 12 log σ2.

65 / 218

Example 23

• X ∼ U[0, θ],

• pθ(x) = θ−11x ∈ (0, θ).• Not an exponential family since the support depends on the parameter.

66 / 218

Consider the following conditions:

(E1) η(Ω0) has non-empty interior.

(E2) T1, . . . ,Td , 1 are linearly independent ν a.e.. That is,

@a ∈ Rd \ 0, c ∈ R such that 〈a,T (x)〉 = c , ν-a.e. x

(E1′) η(Ω0) is open.

Definition 11

• A family satisfying (E1) and (E2) is called full-rank.

• One that satisfies (E1′) is regular.

• One that satisfies (E2) is minimal.

• Condition (E1) prevents ηi from satisfying a constraint.

• Condition (E2) prevents unidentifiability.

67 / 218

Example 24

• A Bernoulli model: pθ(x) ∝ exp(θ0(1− x) + θ1x).

• x + (1− x) = 1,∀x . Hence, the family is not full-rank.

Example 25

• A continuous model with Ω = R and η1(θ) = θ, η2(θ) = θ2,

• Interior of η(Ω) is empty, hence the model is not full-rank.

68 / 218

Theorem 4In a full-rank exponential family, T is complete.

• We just show that T is minimal sufficient.

• Completeness is more technical, but follows from Laplace transformarguments.

69 / 218

Theorem 5In a full-rank exponential family, T is complete.

Proof. (Minimal sufficiency.)

• By factorization theorem, T is sufficient for P (the whole family).

• Choose θ0, θ1, . . . , θd ⊂ Ω, with ηi := η(θi ), such that

η1 − η0, η2 − η0, . . . , ηd − η0 are linearly independent. Possible by (E1)

• Matrix AT = (η1 − η0, . . . , ηd − η0) ∈ Rd×d is full-rank.

• Let P0 = pθi : i = 0, 1, . . . , d. Then, with T = T (X )

(log

pθ1 (X )

pθ0 (X ), . . . , log

pθd (X )

pθ0 (X )

)=(〈η1 − η0,T 〉, . . . , 〈ηd − η0,T 〉

)= AT

is minimal sufficient for P0.

• It follows that T is so, since A is invertible.

• Since P0 and P have common support, T is also minimal for P.

70 / 218


h(x), x ∈ X .

Definition 12

An exponential family is in canonical (or natural) form if η(θ) = θ.

In this case:

• η = θ is called the natural parameter.

• Ω0 := θ ∈ Rd : A(θ) <∞ is called the natural parameter space.

• Ω0 ⊂ Rd .

Family determined by choice of X , T (x) and ν = h · ν.

Example 26 (Two-parameter Gaussian)

• X = R, T (x) = (x , x2).

• pθ(x) = exp(θ1x + θ2x2 − A(θ)), ∀x ∈ X .

• A(θ) = log∫eθ1x+θ2x

2

dx .

• A(θ) <∞ iff θ2 < 0. Natural parameter space: Ω0 = (θ1, θ2) : θ2 < 0.• Note: θ1 = µ

σ2 and θ2 = − 12σ2 (in original parametrization (µ, σ2)).

71 / 218

Recall: ‖x‖1 =∑d

i=1 |xi |.

Example 27 (Multinomial)

• X = Zd+ = x = (x1, . . . , xd) : xi integer, xi ≥ 0

• T (x) = x .

• h(x) =(

nx1,x2,...,xd

)1‖x‖1 = n, ν = counting measure

• Canonical family: pθ(x) = exp(∑d

i=1 θixi − A(θ))h(x).

• Can show that A(θ) = n log(∑d

i=1 eθi), finite everywhere.

• Hence Ω0 = Rd .

• Not full-rank. Violates (E2).

72 / 218

• Multinomial distribution, with usual parameter,

qπ(x) =

(n

x1, x2, . . . , xd

) d∏

i=1

πxii

looks like a subfamily:

• Corresponds to the following subset of the natural parameter space Ω0:

(log πi ) : πi > 0,

d∑

i=1

πi = 1

=θ ∈ Rd :

d∑

i=1

eθi = 1.

• This family is also not full-rank (violates (E1)).

• Actually not a sub-family of Example 27 since pθ = pθ+a1 for any a ∈ R.

• That is, θ parametrization is non-identifiable in Example 27.

73 / 218

Example 28 (Multivariate Gaussian)

• X = Rp. ν = Lebesgue measure on X .

• T (x) =(xi , i = 1, . . . , p | xixj , 1 ≤ i < j ≤ p | x2

i , i = 1, . . . p).

• param =(θi , i = 1, . . . , p | 2Θij , 1 ≤ i < j ≤ p | Θii , i = 1, . . . p).

• Corresponding canonical Expf

pθ,Θ(x) = exp(∑

i

θixi + 2∑

i<j

Θijxixj +∑

i

Θiix2i − A(θ,Θ)

)

• Compactly, treating Θ as a symmetric matrix,

pθ,Θ(x) = exp〈θ, x〉+ 〈Θ, xxT 〉 − A(θ,Θ)

where 〈Θ, xxT 〉 := tr(ΘxxT ) = tr(xTΘx) = xTΘx .

• Dimension (or rank) of the family d = p + p(p − 1)/2.

74 / 218

• Density of multivariate Gaussian N(µ,Σ):

pµ,Σ(x) ∝ 1

|Σ|1/2exp

[−1

2(x − µ)TΣ−1(x − µ)

]

= exp[−1

2xTΣ−1x + xTΣ−1µ− 1

2µTΣ−1µ− 1

2log |Σ|

]

• Can be written as a canonical exponential family

pθ,Θ(x) = exp〈θ, x〉+ 〈Θ, xxT 〉 − A(θ,Θ)

where 〈Θ, xxT 〉 := tr(ΘxxT ) = tr(xTΘx) = xTΘx .

• Correspondence with the original parameters:

• θ = Σ−1µ and Θ = − 12Σ−1

• A(θ,Θ) = 12(µTΣ−1µ+ log |Σ|) = 1

4θTΘ−1θ − 1

2log |−2Θ|+ const.

• Sometimes called Gaussian Markov Random Field (GMRF), esp. whenΘij = 0 for (i , j) /∈ E where E is the edge set of a graph.

75 / 218

Example 29 (Ising model)

• Both a graphical model and an exponential family.

• Used in statistical physics. Allows for complex correlations among discretevariables. Discrete counterpart of GMRF.

• Ingredients:

• A given graph G = (V ,E).V := 1, . . . , n vertex set. E ⊂ V 2: edge set.

• Random variables attached to vertices X = (Xi : i ∈ V ).

• Each xi ∈ −1,+1, the spin of node i .

• Take X = −1,+1V ' −1,+1n.

• T (X ) =(Xi , i ∈ V , XiXj , (i , j) ∈ E

).

• Underlying measure is counting (and h(x) ≡ 1.)

pθ(x) = exp(∑

i∈Vθixi +

∑

(i,j)∈Eθijxixj − A(θ)

)

76 / 218

Example 30 (Exponential random graph model (ERGM))

• A parametric family of probability distributions on graphs.

• Let X = space of graphs on n nodes.

• Let Ti (G ) be functions on the space of graphs for i = 1, . . . , k.

• Usually subgraph counts:

T1(G ) = # number of edges

T2(G ) = # number of triangles

. . . = . . .

Tj(G ) = # number of r -stars (for given r)

. . . = . . .

• Underling measure (counting on graphs)

pθ(G ) = exp( k∑

i=1

θiTi (G )− A(θ))

77 / 218

Focus: full-rank canonical (FRC) exponential families.

Proposition 8

In a canonical exponential family,

(a) A is convex on its domain Ω0,

(b) Ω0 is a convex set.

Proof. (Enough to show (a). Convexity of Ω0 follows from convexity of A.)

• Apply Holder inequality, with 1/p = α and 1/q = 1− α. (Exercise) Q.E.D.

• Holder inequality: For X ,Y ≥ 0 a.s.,

E[XαY 1−α] ≤ (EX )α(EY )1−α, ∀α ∈ [0, 1]

• Expectation can be replaced with integral w.r.t. a general measure:f , g ≥ 0 a.e. ν

∫f αg1−αd ν ≤

(∫fd ν)α(∫

gd ν)1−α

, ∀α ∈ [0, 1].

78 / 218

Proposition 9

In a FRC exponential family, A is C∞ on int(Ω0) and moreover

Eθ[T ] = ∇A(θ), covθ[T ] = ∇2A(θ)

That is,

∂A

∂θi= Eθ[Ti (X )],

∂2A

∂θi∂θj= covθ[Ti (X ),Tj(X )]

Proof sketch.• Moment generating function (mgf) of T is

MT (u) := MT (u; θ) := Eθ[e〈u,T〉]

=

∫e〈u,T (x)〉e〈θ,T (x)〉−A(θ)d ν(x)

= eA(u+θ)−A(θ)

• If θ ∈ int Ω0, then MT is finite in a neighborhood of zero:

MT (u) <∞, for ‖u‖2 ≤ ε.• DCT implies MT is C∞ in a neighborhood of 0, and we can interchange

the order of differentiation and integration.79 / 218

• Moment generating function:

MT (u) = Eθ[e〈u,T〉] = eA(u+θ)−A(θ)

• We get (fixing θ)

MT (u)∂A

∂ui(u + θ) =

∂

∂uiMT (u)= Eθ[

∂

∂uie〈u,T〉] = Eθ[Tie

〈u,T〉],

valid in a neighborhood of 0.

• Evaluating at u = 0 gives the result for mean. (MT (0) = 1.)

• Getting the covariance is similar. (Exercise)

Remark 2

• Covariance is positive semidefinite, hence ∇2A(θ) 0.

• This gives another proof for convexity of A.

80 / 218

Example 31

• X ∼ N(θ, 1):

pθ(x) =1√2π

e−12 x

2

exp(θx − 1

2θ2)

• A(θ) = 12θ

2. Hence:

Eθ[X ] = A′(θ) = θ, varθ(X ) = A′′(θ) = 1

81 / 218

Mean parameters

• Exponential family

dPθ(x) = exp[〈θ,T (x)〉 − A(θ)] dν(x)

• Alternative parametrization in terms of mean parameter µ:

µ := µ(θ) = Eθ[T (X )]

• Mean parameters are easy to estimate: µ = 1n

∑ni=1 T (X (i)).


• X = R, T (x) = (x , x2), pθ(x) = exp(θ1x + θ2x2 − A(θ)).

• Natural parameter space: Ω0 = (θ1, θ2) : θ2 < 0.• θ1 = m

σ2 and θ2 = − 12σ2 in the original parametrization N(m, σ2).

• Mean parameters:

µ =

[µ1

µ2

]=

[E[X ]E[X 2]

]=

[m

m2 + σ2

]=

[− θ1

2θ2θ2

1

4θ22− 1

2θ2

]

82 / 218

Realizable means

• Interesting general set:

M := µ ∈ Rd | µ = Ep[T (X )] for some density p w.r.t. ν,

the set of realizable mean parameters by any distribution (absolutelycontinuous w.r.t. ν).

• M is essentially the convex hull of the support of ν#T = ν T−1

• More precisely,int(M) = int(co(supp(ν#T ))).

83 / 218

M := µ ∈ Rd | µ = Ep[T (X )] for some density p w.r.t. ν,

Example 33

• T (X ) = (X ,X 2) ∈ R2 and ν the Lebesgue measure:

(µ1, µ2) = (Ep[X ],Ep[X 2])

• By nonnegativity of the variance we need to have

M⊂M0 := (µ1, µ2) : µ2 ≥ µ21

• Any (µ1, µ2) ∈ intM0 can be realized by a N(µ1, µ2 − µ21).

• bdM0 := (µ1, µ2) : µ2 = µ21 cannot be achieved by a density (why?).

• bdM0 can be approached arbitrarily closely by densities.

• We have M = intM0 and M =M0.

84 / 218

Example 34 (Multivariate Gaussian)

• T (X ) = (X ,XXT ).

• Let µ = Ep[X ] and Λ = Ep[XXT ] for some density p w.r.t. Lebesguemeasure.

• Covariance matrix is PSD, hence Λ− µµT 0.

• Closure of M (sef of realizable means) is

M := (µ,Λ) | Λ µµT

• int(M) =M = (µ,Λ) | Λ µµT can be realized by non-degenerateGaussian distributions N(µ,Λ− µµT ), a full-rank exponential family.

85 / 218

A remarkable result:Anything ∈ intM can be realized by an exponential family. Let

Ω := domA := θ ∈ Rd : A(θ) <∞

Theorem 6

In a FRC exponential family, assuming A is essentially smooth,

• ∇A : int Ω→ intM is one-to-one and onto.

In other words, ∇A establishes a bijection between int Ω and intM.

• Recall as part of FRC assumption, int Ω 6= ∅.• WLOG, we can assume T (x) = x (absorb T into the measure, ν T−1),

• That is, we work with the standard family

dPθ(x) = exp(〈θ, x〉 − A(θ)) dν(x)

• By Proposition 9, Eθ(X ) = ∇A(θ).

• The proof is a tour de France of convex/real analysis.

86 / 218

Proof sketch

∇A : int Ω→ intM is one-to-one (injective) and onto (surjective)

Let Φ := ∇A.

1. Φ is regular on int Ω: DΦ = ∇2A ∈ Rd×d is a full-rank matrix.True since condition (E2) implies ∇2A 0.

2. Φ is injective:Since condition (E2) implies ∇2A 0, we conclude that A is strictlyconvex. This in turns implies that ∇A is a strictly monotone operator:〈∇A(θ)−∇A(θ′), θ − θ′〉 > 0 for θ 6= θ′.

3. Φ is an open mapping: (maps open sets to open sets) (Corollary 3.27 of ?)

U ⊂ Rd open, f ∈ C 1(U,Rd) regular on U =⇒ f open mapping.

4. By Proposition 9, we have ∇A(int Ω) ⊂M.

5. But why ∇A(int Ω) ⊂ intM? Follows from ∇A being an open map1

1A continuous map is not necessarily open: x 7→ sin(x) maps (0, 4π) to [−1, 1].87 / 218

Proof sketch

∇A : int Ω→ intM is one-to-one (injective) and onto (surjective)

It remains to show that intM⊂ ∇A(int Ω):

For any µ ∈ intM, need to show θ ∈ int Ω such that ∇A(θ) = µ.

6. By applying a shift to ν, WLOG enough to show it for µ = 0 ∈ intM.

7. WTS: 0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.

In general 0 /∈ intM. So, without employing a shift, all the arguments are

applied to θ 7→ A(θ)− 〈µ, θ〉.

88 / 218

Proof sketch

0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.

8. A is lower semi-continuous (lsc) on Rd : lim infθ→θ0

A(θ) ≥ A(θ0).

Follows from Fatou’s Lemma.

(lsc only matters at bd Ω since A is continuous on int Ω.)

9. Let Γ0(Rd) := f : Rd → (−∞,∞] | f is proper, convex, lsc.(proper means not identically ∞.)

10. A ∈ Γ0(Rd).

11. A is coersive: lim‖θ‖→∞

A(θ) =∞. To be shown.

12. A is essentially smooth: by assumption.

A function f ∈ Γ0(Rd) is essentially smooth (a.k.a. steep) if

(a) f is differentiable on int dom f 6= ∅ and(b) ‖∇f (xn)‖ → ∞ whenever xn → x ∈ bd dom f .

We in fact only need this for x ∈ (dom f ) ∩ (bd dom f ), parts of the boundary thatare in the domain. In particular, (b) is not needed if dom f is itself open.

89 / 218

0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.

13. A coercive lsc function attains its minimum (over Rd).

14. f ∈ Γ0(Rd) and essentially smooth =⇒ minimum cannot be attained atbd dom f .

15. If in addition f is strictly convex on int dom f , the minimum is unique.

Lemma 4

Assume that f ∈ Γ0(Rd) is coersive, essentially smooth, and strictly convex onint dom f . Then, f attains its unique minimum at some x ∈ int dom f .

A is coersive, essentially smooth, and strictly convex on int domA = int Ω

16. Conclude that A attains its minimum at a unique point θ ∈ int Ω.

17. Necessary 1st order optimality condition is ∇A(θ) = 0.

18. Done if we show the only remaining piece: coersivity.

90 / 218

A is coersive.

19. For every ε, let Hu,ε := x ∈ Rd : 〈x , u〉 ≥ ε and let S = u : ‖u‖2 = 1.20. 0 ∈ intM (and full-rank assumption) implies that ∃ε > 0 such that

infu∈S

ν(Hu,ε) > 0.

i.e., ∃ε > 0 and c ∈ R such that ν(Hu,ε) ≥ ec for all u ∈ S .

21. Then, for any ρ > 0,

∫e〈ρu,x〉ν(dx) ≥

∫

Hu,ε

e〈ρu,x〉ν(dx) ≥ eρεν(Hu,ε) ≥ eρε+c

That is, A(ρu) ≥ ρε+ c .

22. For any θ 6= 0, taking u = θ/‖θ‖ and ρ = ‖θ‖, we obtain

A(θ) ≥ ‖θ‖ε+ c , ∀θ ∈ Rd \ 0

showing that A is coersive.

91 / 218

Side Note: In fact, ∇A : int Ω→ intM is a C 1 diffeomorphism (i.e., a bijectionwhich is C 1 in both directions.)This follows from Theorem 3.2.8 of ?:

Theorem 7 (Global inverse function theorem)

Let U ⊂ Rd be open and Φ ∈ C 1(U,Rd). The following are equivalent:

• V = Φ(U) is open and Φ : U → V is a C 1 diffeomorphism.

• Φ is injective and regular on U.

Φ = ∇A is injective and regular on int Ω hence a C 1 diffeomorphism.

92 / 218

MLE in exponential family

• Significance of Theorem 6 for statistical inference:

• Assume X1, . . . ,Xn are i.i.d. draws from

pθ(x) = exp(〈θ,T (x)〉 − A(θ))h(x).

• The likelihood is

LX (θ) =∏

i

pθ(Xi ) ∝ exp(〈θ,∑

i

T (Xi )〉 − nA(θ)).

• Letting µ = 1n

∑ni=1 T (Xi ) the log-likelihood is

`X (θ) = n[〈θ, µ〉 − A(θ)

]+ const..

• If µ ∈ intM, there exists a unique MLE, solution of ∇A(θ) = µ.

• That is θMLE = ∇A−1(µ) = ∇A∗(µ).

• A∗ is the Fenchel–Legandre conjugate of A.

• If µ /∈M, then the MLE does not exist. What happens at the boundarycan be determined on a case by case basis.

93 / 218

Technical remarks

• A is always lower semi-continuous (lsc).

• If Ω = domA is open, lower semicontinuity implies that A(θ)→∞ as θapproaches the boundary.(Pick θ0 ∈ bd Ω, then lim infθ→θ0 A(θ) ≥ A(θ0) =∞. )

• In other words, if Ω is open, A is automatically essentially smooth.

94 / 218


• X = R, T (x) = (x , x2). pθ(x) = exp(θ1x + θ2x2 − A(θ)), ∀x ∈ X .

• A(θ) = log∫eθ1x+θ2x

2

dx . A(θ) <∞ iff θ2 < 0.

• Natural parameter space: Ω = domA = (θ1, θ2) : θ2 < 0.• θ1 = m

σ2 and θ2 = − 12σ2 in original parametrization (m, σ2).

• µ = θ1

−2θ2and σ2 = 1

−2θ2.

• Mean parametrization: µ1 = θ1

−2θ2, µ2 =

(θ1

−2θ2

)2+ 1−2θ2

• A(θ) = µ2

2σ2 + 12 log(2πσ2) =

θ21

−4θ2+ 1

2 log π−θ2

.

• Easy to verify that ∇A(θ) = (µ1, µ2)and it establishes a bijection between(θ1, θ2) : θ2 < 0 = int Ω ↔ intM = (µ1, µ2) : µ2

1 > µ2

• Note that since A is lsc, µ(θ) = ∇A(θ)→∞ as θ approaches the boundary.

• Show a picture of θ 7→ A(θ)− 〈θ, µ〉 for µ = (0, 1).

95 / 218

Maximum entropy characterization of exponential family

• Not only exponential families realize any mean, they achieve it withmaximum entropy: Solution to

maxp

Ep[− log p(X )] s.t. Ep[T (X )] = µ,

is given by a density of the form p(x) ∝ exp(〈θ,T (x)〉).

• Discrete case, easy to verify by introducing Lagrange multipliers:

• X = x1, . . . , xK• ν = counting meas. and pi = p(xi ), let p = (p1, . . . , pK ), and ti = T (xi ),

maxp −∑i pi log pi

s.t.∑

i pi ti = µ,

pi ≥ 0,∑

i pi = 1

• Without∑

i pi ti = µ, uniform distribution maximizes the entropy.

96 / 218

Information inequality (Cramer–Rao)

• How small can the variance of an unbiased estimator be? How well can theUMVU do?

• The bound also plays a role in asymptotics.

• Idea: Use Cauchy–Schwarz (CS), also called covariance inequality in thiscontext:

(EXY )2 ≤ (EX 2)(EY 2), or [cov(X ,Y )]2 ≤ var(X ) var(Y )

• Running assumption: every RV/estimator has finite second moment.

• δ unbiased for some g(θ), ψ any other estimator

varθ(δ) ≥ [covθ(δ, ψ)]2

varθ(ψ)(3)

• Need to get rid of δ on the RHS.

• By cleverly choosing ψ, we can obtain good bounds.

97 / 218

• Assume Pθ+h Pθ: pθ+h(x) = 0 whenever pθ(x) = 0.

• (Local) likelihood ratio is well-defined: (Can define it to be 1 for 0/0.)

Lθ,h(X ) = pθ+h(X )/pθ(X )

• (= dPθ+h/dPθ, Radon-Nikodym deriavative of Pθ+h w.r.t. Pθ.)

• Change of measure by integrating against the likelihood ratio:

Eθ[δLθ,h] =

∫

pθ>0

δ Lθ,h pθ dµ =

∫

pθ>0

δ pθ+h dµ = Eθ+h[δ] (4)

Note that pθ+h is concentrated on x : pθ(x) > 0.

98 / 218

Lemma 5 (Hammersley–Chapman–Robbins (HCR))

Assume Pθ+h Pθ, and let δ be unbiased for g(θ). Then,

varθ(δ) ≥ [g(θ + h)− g(θ)]2

Eθ(Lθ,h − 1)2

Proof.

• Idea: Apply CS inequality (3) to ψ = Lθ,h − 1.

• Eθ[ψ] = Eθ[Lθ,h]− 1 = 0. (By an application of (4) to δ = 1.)

• Another application of (4) gives2

covθ(δ, ψ) = Eθ[δψ] = Eθ[δLθ,h]− Eθ[δ] = g(θ + h)− g(θ).

2ψ is not an unbiased estimator of 0, since it depends on θ. It is not a proper estimator.Not a contradiction with “UMVU is uncorrelated from any unbiased estimator of 0”.

99 / 218

• Assume that θ (and hence h) is a scalar.

• Likelihood ratio approaches 1 as h→ 0:

limh→0

1

h[Lθ,h(X )− 1] = lim

h→0

[pθ+h(X )− pθ(X )]/h

pθ(X )

=∂θ[pθ(X )]

pθ(X )= ∂θ[log pθ(X )]

called the score function.

• Divide HCR by h, and let h→ 0:

varθ(δ) ≥ limh→0

[g(θ + h)− g(θ)]2/h2

Eθ(Lθ,h − 1)2/h2

• Numerator goes to g ′(θ). If justified in exchanging limit and expectation,

varθ[δ(X )] ≥ [g ′(θ)]2

Eθ[∂θ log pθ(X )]2

100 / 218

Cramer–Rao (formal statement)

• log-likelihood: `θ(X ) := log pθ(X ),• Score function: ˙

θ(X ) := ∇θ`θ(X ) = ∇θ log pθ(X ) ∈ Rd

Theorem 8 (Cramer–Rao lower bound)

Let P be dominated family with densities having common support S on someopen parameter space Ω ⊂ Rd . Assume:

(a) δ is an unbiased estimator for g(θ) ∈ R.

(b) g is differentiable over Ω, with gradient g = ∇θg ∈ Rd ,

(c) ˙θ(x) exists for x ∈ S and θ ∈ Ω,

(d) At least for ξ = 1 and ξ = δ and ∀θ ∈ Ω,

∂

∂θi

∫

S

ξ(x)pθ(x) dµ(x) =

∫

S

ξ(x)∂

∂θipθ(x) dµ(x), ∀i (5)

Then,varθ(δ) ≥ g(θ)T [I (θ)]−1g(θ)

where I (θ) = Eθ[ ˙θ

˙Tθ ] ∈ Rd×d is the Fisher information matrix.

101 / 218

• Let us rewrite the assumption:

∂

∂θi

∫

S

ξ(x)pθ(x) dµ(x) =

∫

S

ξ(x)∂

∂θipθ(x) dµ(x), ∀i (6)

• Note that the right-hand side is:

RHS =

∫

S

ξ(x)∂ log pθ(x)

∂θipθ(x) dµ(x) = Eθ

(ξ(X )[ ˙

θ(X )]i)

• Putting the pieces together

∇θEθ[ξ] = Eθ[ξ ˙θ] (7)

• which is the differential form of the change of measure formula:

Eθ+h[ξ] = Eθ[ξLθ,h]

102 / 218

Proof.

• Score function has zero mean, Eθ[ ˙θ] = 0. (Apply (7) with ξ = 1.)

• g(θ) = Eθ[δ ˙θ] (Apply (7) with ξ = δ.)

• Fix some a ∈ Rd . Will apply CS inequality (3) with ψ = aT ˙θ.

• Since aT ˙θ is zero mean:

aT g(θ) = Eθ[δ aT ˙θ] = covθ(δ, aT ˙

θ).

• Similarly,

varθ(aT ˙θ) = Eθ[aT ˙

θ˙Tθ a] = aT I (θ)a

• CS inequality (3) with ψ = aT ˙θ gives:

varθ(δ) ≥ [covθ(δ, aT ˙θ)]2

varθ(aT ˙θ)

=(aT g(θ))2

aT I (θ)a.

• Almost done. Problem reduces to (Exercise)

supa 6=0

(aT v)2

aTBa= vTB−1v .

Hint: Since B 0, B−1/2 is well-defined; take z = B1/2a. Q.E.D.103 / 218

• Regularity conditions for interchanging the integral and derivative are key,

• so is the unbiasedness.

• Under same assumptions (recall `θ = log pθ(X ))

I (θ) = Eθ[−¨θ] = Eθ[−∇2

θ`θ]

• I (θ) measures expected local curvature of the likelihood.

• Attainment of CRB is related to attainment of Cauchy–Schwarz: Wijsman(1973) shows that it happens if and only if we are in the exponential family.

• Fisher info. is not invariant to reparametrization:

θ = θ(µ) =⇒ I (µ) = [θ′(µ)]2I (θ)

• CRB is invariant to reparametrization. (Exercise.)

• Fisher info. is additive over independent sampling.

104 / 218

https://projecteuclid.org/euclid.aos/1176342419

https://projecteuclid.org/euclid.aos/1176342419

Multiple parameters

What if g : Ω→ Rm where Ω ⊂ Rd?

• Let Jθ = (∂gi/∂θj) ∈ Rm×d be the Jacobian of g .

• Then, under similar assumptions (notation: Iθ = I (θ)):

covθ(δ) JθI−1θ JTθ

for any δ unbiased for g(θ).

• A B means A− B 0, i.e., A− B is positive semidefinite (PSD).

• Proof: Fix u ∈ Rm and apply the 1-D theorem to uT δ. (Exercise)

105 / 218

Example 36

• Xiiid∼N(θ, σ2), i = 1, . . . , n,

• σ2 is fixed, g(θ) = θ.

`θ(X ) = log pθ(X ) =n∑

i=1

log pθ(Xi ) = − 1

2σ2

n∑

i=1

(Xi − θ)2 + const.

• Differentiating, we get the score function

˙θ(X ) =

∂

∂θlog pθ(X ) =

1

σ2

n∑

i=1

(Xi − θ) =⇒ ¨θ(X ) = − n

σ2.

• whence I (θ) = n/σ2.

• CRB is varθ(δ) ≥ σ2/n and is achieved by sample mean.

106 / 218

Example 37 (Exponential families)

• Xi ∼ pθ(xi ) = h(xi ) exp(〈θ,T (xi )〉 − A(θ)), i = 1, . . . , n.

`θ(X ) = log pθ(X1, . . . ,Xn) = 〈θ,∑

i

T (Xi )〉 − nA(θ) + const.

• whence I (θ) = Eθ[−¨θ(X )] = n∇2A(θ) = n covθ[T ].

• Consider 1-D case and n = 1.

• Want unbiased estimate of the mean parameter: µ(θ) = Eθ[T ] = A′(θ).

• CRB is

[µ′(θ)]2

I (θ)=

[A′′(θ)]2

A′′(θ)= A′′(θ) = varθ(T )

i.e., it is attained by T .

• General case: T := 1n

∑ni=1 T (Xi ) attains CRB for the mean parameter:

covθ(δ) covθ(T ), ∀δ s.t. Eθ(δ) = Eθ(T ).

107 / 218

Example 38

• Xiiid∼Poi(λ), i = 1, . . . , n.

• Exponential family with T (X ) = X and mean parameter λ,

• hence, sample mean δ(X ) = 1n

∑i Xi achieves the CRB for λ.

• What if we want an unbiased estimate of g(λ) = λ2?

• Since I (λ) = n/ varλ[X1] = n/λ (why?),

• the CRB = [2λ]2/(n/λ) = 4λ3/n.

• The estimator T1 = 1n

∑ni=1 Xi (Xi − 1) is unbiased for λ2 and

varλ(T1) = 4λ3/n + 2λ2/n > CRB

• S =∑

i Xi is complete sufficient, hence

• Rao–Blackwellized estimator T2 = E[T1|S ] = S(S − 1)/n2 is UMVU.

• CRB is not attained since (exercise)

varλ(T2) = 4λ3/n + 2λ2/n2 > CRB,

A vector of independent Poisson variables, conditioned on their sum has amultinomial distribution. In this case, Mult(S , ( 1

n , . . . ,1n )).

108 / 218

Average vs. maximum risk optimality

Bayesian Methods:

• Trouble comparing estimators based on whole risk functions θ 7→ R(θ, δ).

• The Bayesian approach: reduce to (weighted) average risk.

• Assumes that the parameter is a random variable Θ with some distributionΛ, called the prior, having density π(θ) (w.r.t. to say Lebesgue).

• Choice of the prior is important in the Bayesian framework.

• Frequentest perspective: Bayes estimators have desirable properties.

109 / 218

• Recall the decision-theoretic framework:

• Family of distributions P = Pθ : θ ∈ Ω.• Bayesian framework:

interpret Pθ as conditional distribution of X given Θ = θ,

Pθ(A) = P(X ∈ A | Θ = θ)

• Together with the marginal (prior) distribution of Θ, we have the jointdistribution of (Θ,X ).

• Recall the risk defined as

R(θ, δ) = Eθ[L(θ, δ(X ))] = E[L(θ, δ(X )) | Θ = θ]

or in other words, R(Θ, δ) = E[L(Θ, δ(X )) | Θ].

• The Bayes risk is

r(Λ, δ) = E[R(Θ, δ)] = E[L(Θ, δ(X ))].

110 / 218

• Write p(x |θ) = pθ(x) for density of Pθ.

• Recall that

R(θ, δ) =

∫L(θ, δ(x))p(x |θ)dx .

Then,

r(Λ, δ) =

∫π(θ)R(θ, δ)dθ =

∫π(θ)

[ ∫L(θ, δ(x))p(x |θ)dx

]dθ.

• We rarely used this explicit form.

111 / 218

• A Bayes rule or estimator w.r.t. Λ, denoted as δΛ, is a minimizer of theBayes risk:

r(Λ, δΛ) = minδ

r(Λ, δ)

• Depends both on the prior Λ and the loss L.

Theorem 9 (Existence of Bayes estimators)

Assume that

(a) ∃δ′ with r(Λ, δ′) <∞(b) Posterior risk has a minimizer for µ-almost all x , that is,

δΛ(x) := argmina∈A

E[L(Θ, a)|X = x ]

is well-defined for µ-almost all x . (Measurable selection.)

Then, δΛ is a Bayes rule.

Proof. Condition (a) is to guarantee that we can use Fubini theorem.

• By definition of δΛ, for any δ we have E[L(Θ, δ)|X ] ≥ E[L(Θ, δΛ)|X ].

• Taking expectation and using smoothing finishes the proof.112 / 218

• Posterior risk can be computed based on the posterior distribution of Θgiven X = x . Bayes rule gives

π(θ|x) =p(x |θ)π(θ)

m(x)∝ p(x |θ)π(θ)

where m(x) =∫π(θ)p(x |θ)dθ is the marginal distribution of X .

• Posterior is proportional to prior times the likelihood.

Example 39

Bayes estimators for two simple loss functions:

• Quadratic (or `2) loss: L(θ, a) = (g(θ)− a)2:

δΛ(x) = mina

E[(g(Θ)− a)2|X = x ] = E[g(Θ)|X = x ].

For g(θ) = θ reduces to the posterior mean.

• `1 loss: L(θ, a) = |θ − a|: Here, δΛ(x) = median(Θ|X = x) is one possibleBayes estimator. (Not unique in this case.)

113 / 218

Example 40 (Binomial)

• X ∼ Bin(n, θ).

• PMF is p(x |θ) =(nx

)θx(1− θ)n−x .

• Put a Beta prior on Θ, with hyperparameters α, β > 0,

π(θ) =Γ(α + β)

Γ(α)Γ(β)θα−1(1− θ)β−1 ∝ θα−1(1− θ)β−1

• We have π(θ|x) ∝ pθ(x)π(θ) ∝ θx+α−1(1− θ)n−x+β−1

• showing that Θ|X = x ∼ Beta(α + x , n − x + β), whence

δΛ(x) := E[Θ|X = x ] =x + α

n + α + β= (1− λ)

x

n+ λ

α

α + β

where λ = α+βn+α+β .

• Note: α/(α + β) is the prior mean, and x/n is the MLE (or unbiasedestimator of the mean parameter).

• No coincidence, happens in a general exponential family.

114 / 218

Example 41 (Normal location family)

• Assume that Xi |Θ = θ ∼ N(θ, σ2).

• Put a Gaussian prior on Θ ∼ N(µ, b2).

• The model is equivalent to

Xi = Θ + wi , wiiid∼N(0, σ2), for i = 1, . . . , n

• Reparametrize in terms of precisions τ 2 = 1/b2 and γ2 = 1/σ2.

• (Θ,X1, . . . ,Xn) is jointly Gaussian, and the posterior is

Θ|X = x ∼ N(

(1− λn)x + λnµ︸︷︷︸δΛ(x)

, 1/τ 2n

)

where

x =1

n

∑xi , τ 2

n = nγ2 + τ 2, λn = τ 2/τ 2n ∈ [0, 1]

• Continued ...

115 / 218

• With

x =1

n

∑xi , τ 2

n = nγ2 + τ 2, λn = τ 2/τ 2n ∈ [0, 1]

• We have

Θ|X = x ∼ N(δΛ(x), 1/τ 2

n

)

• Posterior mean δΛ(x), i.e., the Bayes rule for `2 loss, is

δΛ(x) := (1− λn)x + λnµ

which is a convex combination of x and µ and we have

• δΛ(x)→ x if n→∞ or SNR = γ2/τ 2 →∞.

• δΛ(x)→ µ if SNR = γ2/τ 2 → 0.

116 / 218

Conjugate priors

• The two examples above are examples of conjugacy.

• A family Q = π(·) of priors is conjugate to a family of likelihoodsP = p(· | θ) if the corresponding posteriors also belong to Q.

• Example of conjugate families

Q normal beta DirichletP normal binomial multinomial

Example 42 (Exponential families)

We have the following conjugate pairs

p(x |θ) = exp〈η(θ),T (x)〉)− A(θ)

qa,b(θ) = exp〈a, η(θ)〉+ bA(θ)− B(a, b)

117 / 218

Example 43 (Improper priors)

• Xi ∼ N(θ, σ2), i = 1, . . . , n.

• Is δ(x) = 1n

∑xi , a Bayes estimator w.r.t. some prior?

• Not if we require proper priors (finite measures):∫π(θ)dθ <∞, in which

case π can be normalized to integrate to 1.

• Need a uniform (proper) prior on the whole R which does not exist.

• An improper prior can still be used if the posterior is well-defined.(Generalized Bayes.)

• Alternatively, δ(x) is the limit of Bayes rules for a sequence of properpriors. (see also the Beta-Binomial example.)

118 / 218

Comment on the uniqueness of the Bayes estimator.

Theorem 10 (TPE 4.1.4)

Let Q be the marginal distribution of X , that is, Q(A) =∫Pθ(X ∈ A)dΛ(θ).

Recall that δΛ is (a) Bayes estimator. Assume that

• The loss function is strictly convex,

• r(Λ, δΛ) <∞,

• Q a.e =⇒ P a.e. . Equivalently, Pθ Q for all θ ∈ Ω.

Then, there is a unique Bayes estimator.

119 / 218

Minimax criterion

• Instead of averaging the risk: look at the worst-case or maximum risk:

R(δ) := supθ∈Ω

R(θ, δ)

• More in accord with an adversarial nature. (A zero-sum game.)

Definition 13

An estimator δ∗ is minimax if minδ∈D R(δ) = R(δ∗).

• An effective strategy for finding minimax estimators is to look among theBayes estimators:

• The minimax problem is: infδ supθ R(θ, δ).

• We generalize this to: infδ supΛ r(Λ, δ)

120 / 218

• Recall: δΛ a Bayes estimator for prior Λ, and Bayes risk,

rΛ = infδr(Λ, δ) = r(Λ, δΛ),

(Last equality: assume rΛ is finite and is achieved.)

• Can order priors based on their Bayes risk:

Definition 14Λ∗ is a least favorable prior if rΛ∗ ≥ rΛ for any prior Λ.

• For a least favorable prior, we have

rΛ∗ = supΛ

rΛ = supΛ

infδr(Λ, δ) ≤ inf

δsup

Λr(Λ, δ) =: inf

δr(δ)

where r(δ) = supΛ r(Λ, δ) is a generalization of the maximum risk R(δ).

• Interested in situations where equality holds.

121 / 218

Characteriztion of minimax estimators

Theorem 11 (TPE 5.1.4)

Assume that δΛ is Bayes for Λ, and r(Λ, δΛ) = R(δΛ). Then,

• δΛ is minimax.

• Λ is least favorable.

• If δΛ is the unique Bayes estimator (a.e. P), then it is the unique minimaxestimator.

Proof of minimaxity of δΛ:

• Maximum risk is always lower-bounded by Bayes risk,

R(δ) = supθ∈Ω

R(θ, δ) ≥∫

R(θ, δ)dΛ(θ) = r(Λ, δ), ∀δ

• R(δ) ≥ r(Λ, δ) ≥ rΛ = R(δΛ). (Last equality by assumption.)

122 / 218

Rest of the proof:

• R(δ) ≥ r(Λ, δ) ≥ rΛ = R(δΛ). (Last equality by assumption.)

• Uniqueness of the Bayes rule makes second inequality strict for δ 6= δΛ,showing the uniqueness of minimax rule.

• On the other hand,

rΛ′ ≤ r(Λ′, δΛ) ≤ R(δΛ) = rΛ.

showing that Λ is least favorable.

123 / 218

• A decision rule δ is called an equalizer if it has constant risk:

R(θ′, δ) = R(θ, δ), for all θ, θ′ ∈ Ω.

• Let ω(δ) := θ : R(θ, δ) = R(δ) = argmaxθ R(θ, δ).

• (δ is an equalizer if ω(δ) = Ω.)

Corollary 4 (TPE 5.1.5–6)

(a) A Bayes estimator with constant risk (i.e., an equalizer) is minimax.

(b) A Bayes estimator δΛ is minimax if Λ(ω(δΛ)) = 1.

• Both of these conditions are sufficient, not necessary.

• (b) is weaker than (a).

• Strategy: Find a prior Λ whose support is contained in argmaxθ R(θ, δΛ).

124 / 218

Example 44 (Bernoulli, continuous parameter space)

• X ∼ Ber(θ) with quadratic loss, and Θ ∈ [0, 1].

• Given a prior Λ on [0, 1], let m1 = E[Θ] and m2 = E[Θ2].

• Frequentsit risk

R(θ, δ) = (δ0 − θ)2(1− θ) + (δ1 − θ)2θ

= θ2[1 + 2(δ0 − δ1)] + θ(δ21 − δ2

0 − 2δ0) + δ20

• Bayes risk

r(Λ, δ) = E[R(Θ, δ)] = m2[1 + 2(δ0 − δ1)] + m1(δ21 − δ2

0 − 2δ0) + δ20

• Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1,

δ∗1 =m2

m1, δ∗0 =

m1 −m2

1−m1.

125 / 218

• Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1,

δ∗1 =m2

m1, δ∗0 =

m1 −m2

1−m1.

• Aside: Since p(θ|x) = Cθx(1− θ)1−xπ(θ), check thatδ∗x = E[Θ|X = x ], x = 0, 1 as it should be:

δ∗x =

∫θx+1(1− θ)1−xπ(θ)dθ∫θx(1− θ)1−xπ(θ)dθ

.

• A general rule δ is an equalizer, i.e., R(θ, δ) does not depend on θ, if

δ1 − δ0 = 1/2 and δ21 − δ2

0 − 2δ0 = 0.

• These equations have a single solution: δ0 = 1/4 and δ1 = 3/4.(There is a unique equalizer rule.)

• Equalizer Bayes rule: need 3/4 = m2

m1, and 1/4 = m1−m2

1−m1.

• Solving: m∗1 = 1/2 and m∗2 = 3/8.

• Need prior Λ that has these moments; Λ = Beta(1/2, 1/2) fits the bill.

• This is a least favorable prior.

• Corresponding Bayes, hence minimax, risk is 1/16.

126 / 218

• The above can be generalized to an i.i.d. sample of size n:

X1, . . . ,Xniid∼Ber(θ), where Beta(n/2, n/2) is least favorable and the

associated minimax risk is 14(√n+1)2 .

• Compare with the risk of the sample mean R(θ, X ) = θ(1− θ)/n.

127 / 218

Example 45 (Bernoulli, discrete parameter space)

• Let X ∼ Ber(θ) and Ω = 1/3, 2/3 =: a, b.• Take L(θ, δ) = (θ − δ)2.

• Any (nonrandomized) decision rule is specified by a pair of numbers (δ0, δ1).

• Any prior π specified by a single number πa = P(Θ = a) ∈ [0, 1].

• Frequentist risk

R(θ, δ) = (δ0 − θ)2(1− θ) + (δ1 − θ)2θ

• Bayes risk r(π, δ) = E[R(Θ, δ)],

r(π, δ) = πaR(a, δ) + (1− πa)R(b, δ)

128 / 218

• Take derivatives w.r.t. δ0, δ1, set to zero, find the Bayes rule

δ∗0 =aπa(1− a) + b(1− b)(1− πa)

(1− a)πa + (1− b)(1− πa)

δ∗1 =a2πa + b2(1− πa)

aπa + b(1− πa)

• For a = 1/3 = 1− b,

δ∗0 =2

3(πa + 1)and δ∗1 =

4− 3πa6− 3πa

.

• Equalizer rule, one that R(a, δ) = R(b, δ):

(a + b)[2(δ0 − δ1) + 1] + δ21 − δ2

0 − 2δ0 = 0.

• A Bayes rule that is also equalizer occurs for π∗a = 1/2.

• This is the least favorable prior.

• Corrsponding rule (δ∗0 , δ∗1 ) = (4/9, 5/9) is minimax.

129 / 218

Geometry of Bayes and Minimax

• Risk body for two-parameter Bernoulli Ω = 1/3, 2/3.• Determinisitc rules, Bayes rules, minimax rule.

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 0.05 0.1 0.15 0.20

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

130 / 218

Geometry of Bayes and minimax for finite Ω

• Assume Ω = θ1, . . . , θk finite, and consider the risk set (or body)

S =

(y1, . . . , yk) | yi = R(θi , δ) for some δ⊂ Rk .

• Alternatively, define ρ : D → Rk by

ρ(δ) = (R(θ1, δ), . . . ,R(θk , δ))

where D is the set of randomized decision rules.

• S is the image of D underρ, i.e., S = ρ(D).

Lemma 6

S is a convex set (with randomized estimators).

Proof. For δ, δ′ ∈ D, and a ∈ [0, 1], we can form a randomized decision rule δasuch that S 3 R(θ, δa) = aR(θ, δ) + (1− a)R(θ, δ′). (Exercise.)

131 / 218

• Every prior Λ corresponds to a vector λ = (λ1, . . . , λk) ∈ Rk , viaΛ(θi) = λi . Note that λ lies in the (k − 1)-simplex,

∆ := (λ1, . . . , λk) ∈ Rk+ :

k∑

i=1

λi = 1• Bayes risk is

r(Λ, δ) = E[R(Θ, δ)] =k∑

i=1

λiR(θi , δ) = λTρ(δ)

• Hence finding the Bayes rule is equivalent to

infδ ∈D

r(Λ, δ) = infδ ∈D

λTρ(δ) = infy ∈ S

λT y

a convex problem in Rk . Minimax problem is

infδ ∈D

‖ρ(δ)‖∞ = infy ∈ S‖y‖∞

Finding the least favorable prior corresponds to supλ∈∆[infy ∈ S λT y ].

132 / 218

Admissibility of Bayes rules

• In general, a unique (a.e. P) Bayes rule is admissable (TPE 5.2.4).

• Complete answer to admissibility question for finite parameter spaces.

Proposition 10

Assume Ω = θ1, . . . , θk and that δλ is Bayes rule for λ.If λi > 0 for all i , then δλ is admissible.

Proof. If δλ is inadmissible, there is δ such that

R(θi , δ) ≤ R(θi , δλ),∀i

with strict inequality for some j . Then,

∑

i

λiR(θi , δ) <∑

i

λiR(θi , δλ).

Q.E.D.

133 / 218

Proposition 11

Assume Ω = θ1, . . . , θk and δ admissible. Then δ is Bayes w.r.t. some prior λ.

Proof.

• Let x := ρ(δ), the risk vector of δ and Qx := y ∈ Rk | yi ≤ xi \ x.• Qx is convex.

(Removing an extreme point from a convex set preserves convexity.)

• Admissibility means Qx ∩ S = ∅.• Two non-empty disjoint convex sets ⊂ Rk , can be separated by a

hyperplane:

∃u 6= 0 s.t. uT z ≤ uT y , for all z ∈ Qx and y ∈ S .

• Suppose we can choose u to have nonnegative coordinates.(Proof by contradiction. (Exercise.))

134 / 218

• Since u 6= 0, we can set λ = u/(∑

i ui ) ∈ ∆k−1.

• λT z ≤ infy∈S λT y = rλ, ∀z ∈ Qx .

• Taking zn ⊂ Qx such that zn → x , we obtain λT x ≤ rλ.

• But, by defintion of optimal Bayes risk rλ ≤ λT x , hence rλ = λT x .

135 / 218

M-estimation

• Setup: An i.i.d. sample of size n from a model P(1) = Pθ : θ ∈ Ω onsample space X , i.e.,

X1, . . . ,Xniid∼Pθ.

• Full model is actually P(n) = P⊗nθ : θ ∈ Ω and sample space X n.

• M-estimators: those obtained as solutions of optimization problems.

Definition 15Given a family of functions mθ : X → R, for θ ∈ Ω, the correspondingM-estimator based on X1, . . . ,Xn is

θn := θn(X1, . . . ,Xn) := argmaxθ∈Ω

1

n

n∑

i=1

mθ(Xi )

• Often Mn(θ) := 1n

∑ni=1 mθ(Xi ), a random function.

136 / 218

• Alternative approach is to specify θ as Z -estimator, i.e., the solution of aset of estimating equations

Ψn(θ) :=1

n

n∑

i=1

ψ(Xi , θ) = 0.

• Often 1st-order optimality conditions for an M-estimator produce a set ofestimating equations.(Simplistic in general, ignoring the possibility of constraints imposed by Ω.)

137 / 218

Example 46

1. mθ(x) = −(x − θ)2, then Mn(θ) = − 1n

∑i=1(Xi − θ)2, giving θ = X .

2. mθ(x) = −|x − θ|, then Mn(θ) = − 1n

∑i=1 |Xi − θ|, giving

θ = median(X1, . . . ,Xn).

3. mθ(x) = log pθ(x), then Mn(θ) = 1n

∑i=1 log pθ(Xi ), giving the maximum

likelihood estimator (MLE).

• In location family with pθ(x) = C exp(−β|x − θ|p), MLE is equivalent toan M-estimator with mθ(x) = −|x − θ|p.

• p = 2, Gaussian distribution. (Case 1)• p = 1, Laplace distribution. (Case 2)

• Corresponding Z -estimator forms of 1. and 2. are obtained forψθ(x) = x − θ and ψθ(x) = sign(x − θ), obtained by differentiation (orsub-differentiation) of mθ.

138 / 218

Example 47 (Method of Moments (MOM))

• Find θ by matching empirical and population (true) moments:

Eθ[X k1 ] =

1

n

n∑

i=1

X ki , k = 1, 2, . . . , d .

• Usually d is the dimension of parameter θ (d equations in d unknown).

• A set of estimating equations for ψθ(x) = xk − Eθ[X k1 ].

• A generalized version of MOM is to solve

Eθ[ϕk(X1)] =1

n

n∑

i=1

ϕk(Xi ), k = 1, 2, . . . , d .

for some collection of function ϕk, corresponding to Z -estimator withψθ(x) = ϕk(x)− Eθ(ϕ(X1)).

139 / 218

• In canonical exponential families:

Xiiid∼ pθ(x) ∝ exp〈θ,T (x)〉 − A(θ)

ML and MOM are equivalent.• MLE is the M-estimator associated with

mθ(x) = log pθ(x) = 〈θ,T (x)〉 − A(θ)

hence

Mn(θ) =1

n

∑

i

[〈θ,T (Xi )〉 − A(θ)

]= 〈θ, T 〉 − A(θ)

where T = 1n

∑i T (Xi ) is the empirical mean of the sufficient statistic.

• MLE is

θmle = argmaxθ∈Ω

〈θ, T 〉 − A(θ)

• Setting derivatives to zero gives T = ∇A(θmle). (First-order optimality.)• Since Eθ[T ] = ∇A(θ), MLE is the solution of

Eθ[T ] = T

for θ, which is a MOM estimator. (If you will Eθ[T (X1)] = 1n

∑i T (Xi ).)

140 / 218

Sidenote

• Recall that µ = ∇A(θ) is the mean parameterization.

• The inverse of this map is θ = ∇A∗(µ) where

A∗(µ) = supθ∈Ω〈θ, µ〉 − A(θ)

is the conjugate dual of A. (Exercise.)

• So θmle = ∇A∗(T ), assuming that T ∈ int(dom(A∗)).

141 / 218

Asymptotics or large-sample theoryZeroth-order (consistency)

• Statistical behavior of estimators, in particular M-estimators, as n→∞.

• For concreteness consider the sequence:

θn = θn(X1, . . . ,Xn) = argmaxθ∈Ω

1

n

n∑

i=1

mθ(Xi )

Definition 16

Let X1, . . . ,Xniid∼Pθ0 . We say that θn is consistent if θn

p→ θ0.

• Equivalently,

∀ε > 0, P(d(θn, θ0) > ε)→ 0, as n→∞.

• Usually d θn, θ0) = ‖θn − θ0‖ for Eucledian parameters spaces Ω ⊂ Rd .

• For d = 1, d(θn, θ0) = |θn − θ0|.142 / 218

• We write Zn = op(1) if Znp→ 0.

• By the WLLN, for any fixed θ, we have (assuming Eθ0 |mθ(X1)| <∞)

1

n

n∑

i=1

mθ(Xi )p→ Eθ0 [mθ(X1)]

• Letting M(θ) := Eθ0 [mθ(X1)], for any fixed θ, Mn(θ)p→ M(θ).

• If θ0 is the maximum of M over Ω, hope θn which is the maximum of Mn

over Ω approaches it.

• However, pointwise convergence of Mn to M is not enough; need uniformconvergence, i.e.,

‖Mn −M‖∞ := supθ∈Ω|Mn(θ)−M(θ)|

to go to zero in probability.

143 / 218

Why uniform convergence?

• Even a nonrandom example is enough:

• Here Mn(t)→ M(t) pointwise, but Mn(tn) = 1 and M(t0) = 1/2.

Mn(t) =

1− n|x − 1n | |x | < 2

n12 − |x − 1| 1

2 < |x | < 32

0 otherwise

M(t) =

12 − |x − 1| 1

2 < |x | < 32

0 otherwise

0 0.5 1 1.5

0

0.2

0.4

0.6

0.8

1

144 / 218

Theorem 12 (AS 5.7 modified)

Let Mn be random functions, and let M be a fixed function of θ. Let

θn ∈ argmaxθ∈Ω

Mn(θ) (cond-M)

be well-defined. Assume:

(a) ‖Mn −M‖∞ p→ 0. (Unifrom convergence.)

(b) (∀ε > 0) supθ: d(θ,θ0)≥ε

M(θ) < M(θ0). (M has well-separated maxima.)

Then θn is consistent, i.e., θnp→ θ0.

• By optimality of θ for Mn, we have Mn(θ0) ≤ Mn(θn), or

0 ≤ Mn(θn)−Mn(θ0) Basic inequality

• By adding and subtracting, we get

M(θ0)−M(θn) ≤ Mn(θn)−M(θn)− [Mn(θ0)−M(θ0)]

≤ 2‖Mn −M‖∞(We are keeping random deviations from the mean on one side and fixedfunctions on the other side.)

145 / 218

• Fix some ε > 0, and let

η(ε) := M(θ0)− supd(θ,θ0)≥ ε

M(θ) = infd(θ,θ0)≥ ε

[M(θ0)−M(θ)]

• By assumption (b) η(ε) > 0.

• Since d(θ, θ0) ≥ ε implies M(θ0)−M(θ) ≥ η(ε), we have

P(d(θn, θ0) ≥ ε

)≤ P

(M(θ0)−M(θn) ≥ η(ε)

)

≤ P(2‖Mn −M‖∞ ≥ η(ε)

)→ 0

by assumption (a). Q.E.D.

Remark 3

A key step is bounding Mn(θn)−M(θn) with ‖Mn −M‖∞.

Exercise: Condition (cond-M) can be replaced with Mn(θn) ≥ Mn(θ0)− op(1).

146 / 218

• Sufficient conditions for uniform convergence can be found in Keener,Chapter 9, Theorem 9.2.

• For example, we have (a) if

• Ω is compact,• θ 7→ mθ(x) is continuous (for a.e x), and• E‖m∗(X1)‖∞ <∞. (‖m∗(X1)‖∞ = supθ∈Ω |mθ(X1)|)

• For example, we have (b) if

• Ω is compact, and• M is continuous, and• M has a unique maximizer over Ω.

• In general, the key factor in whether the uniform convergence holds is thesize of the parameter space Ω.

147 / 218

Side note:

• Why do we have (b) if

• Ω is compact, and• M is continuous, and• M has a unique maximizer over Ω?

• Since Ω is compact and M is continuous, it attains its maximum over

Ω \ B(θ0; ε) := θ ∈ Ω : d(θ, θ0) ≥ ε.

where B(θ0; ε) is the open ball of radius ε centered at θ0.

• Let θε be a maximizer of M over Ω \ B(θ0; ε). Then,

supθ: d(θ,θ0)≥ε

M(θ) = M(θε) < M(θ0)

• Strict inequality is due to the uniqueness of maximizer of M over Ω.

• Compactness is key, otherwise uniqueness of global maximizer does notimply this inequality.

148 / 218

Example 48

• MLE can be obtained as an M-estimator with mθ(x) = log pθ(x)pθ0

(x) .

• Addition of − log pθ0 (x) does not change the maximizer of Mn(θ).

M(θ) = Eθ0 [mθ(X1)] = −∫

pθ0 (x) logpθ0 (x)

pθ(x)dx = −D(pθ0‖pθ).

• D(p‖q) is the Kullback–Leibler (KL) divergence between p and q.

• A form of (squared) distance among distributions.

• Does not satisfy triangle equality or symmetry.

• D(p‖q) ≥ 0 with equality iff p = q.

• Condition (b) is a bit stronger.

• Often, we can show (strong identifiability)

γ(d(θ0, θ)

)≤ D(pθ0‖pθ)

for some strictly increasing function γ ∈ [0,∞)→ [0,∞) in aneighborhood of θ0.

149 / 218

• Example: exponential distribution with pλ(x) = λe−λx1x > 0:

D(pλ0‖pλ) = Eλ0

[log

λ0e−λ0X1

λe−λX1

]

= logλ0

λ+ Eλ0 [(λ− λ0)X1]

= − logλ

λ0+

λ

λ0− 1

• Itakura-Saito distance, or the Bregman divergence for φ(x) = − log x , froman earlier lecture.

• f (x) = − log x + x − 1 is strictly convex on (0,∞) with unique minimumat x = 1.

150 / 218

First-order (asymptotic normality)

• More refined understanding, by looking at scaled (magnified) deviations ofconsistent estimators.

• IID sequence X1,Xn, . . . with mean µ = E[X1] and Σ = cov(X1):

WLLN Xnp→ µ Xn is consistent for µ.

CLT√n(Xn − µ)

d→ N(0,Σ) Characterizes fluctuations of Xn − µ.

• Fluctuations are of the order n−1/2 and after normalization haveapproximate Gaussian dist’n.

151 / 218

• First, let us look at how modes of convergence interact.

Proposition 12

(a) Xnp→ X implies Xn

d→ X , but not vice versa.

(b) Xnp→ c is equivalent to Xn

d→ c. (c is a constant.)

(c) Continuous mapping (CM): Xn → X and f is continuous, implies f (Xn)→ f (X ).

Holds for bothd→ and

p→.

(d) Slutsky’s: Xnd→ X and Yn

d→ c implies (Xn,Yn)d→ (X , c).

(e) Xnp→ X and Yn

p→ Y implies (Xn,Yn)p→ (X ,Y ).

(f) Xnd→ X and d(Xn,Yn)

p→ 0 implies Ynd→ X .

• For (c), f only need to be continuous on a set C with P(X ∈ C ) = 1.

• (d) does not hold in general if c is replaced by some random variable Y .

152 / 218

• Slutsky’s lemma is usually not what we mentioned.

• It in fact is a special of application of (c) and (d), to functions(x , y) 7→ x + y , (x , y) 7→ xy and (x , y) 7→ y−1x .

Corollary 5 (Slutsky’s lemma)

Let Xn,Yn and X be random variables, or vectors or matrices, and c a constant.

Assume that Xnd→ X and Yn

d→ c . Then,

Xn + Ynd→ X + c , YnXn

d→ cX , Y−1n Xn

d→ c−1X ,

assuming c is invertible for the latter. More generally f (Xn,Yn)d→ f (Xn, c) for

any continuous function.

E.g. op(1) + op(1) = op(1).

153 / 218

• Simple examples:

(a) Xnd→ Z ∼ N(0, 1) implies X 2

nd→ Z 2 ∼ χ2

1.

Example 49 (Counterexample)

• Xn = X ∼ U(0, 1),∀n and

Yn = Xn1n odd+ (1− Xn)1n even.

• Xnd→ X and Yn

d→ X , but (Xn,Yn) does not converge in distribution.

• Why?

• Let C1 = (x , y) ∈ [0, 1]2 : x = y and C2 = (x , y) ∈ [0, 1]2 : x + y = 1.• Let U(Ci ) be uniform distribution on Ci . Then,

(Xn,Yn) ∼U(C1) n odd

U(C2) n even

154 / 218

Example 50 (t-statistic)

• IID sequence Xi, with E[Xi ] = µ and var(Xi ) = σ2.

• Let Xn = 1n

∑Xi and S2

n = 1n

∑(Xi − X )2 = 1

n

∑X 2i − (Xn)2. Then,

tn−1 :=Xn − µSn/√n

d→ N(0, 1).

• Why? 1n

∑X 2i

d→ E[X 21 ] = (σ2 + µ2) and X 2

nd→ µ2.

• These imply Snd→√σ2 + µ2 − µ2 = σ.

• It follows that

tn−1 =

√n(Xn − µ)

Sn

d→ N(0, σ2)

σ= N(0, 1)

• Distribution-free result: We are not assuming that Xi are Gaussian.

155 / 218

• Also need the concept of uniform tightness or boundedness in probability.

• A collection of random vectors Xn is uniformly tight if

∀ε > 0, ∃M such that supn

P(‖Xn‖ > M) < ε

We will write Xn = Op(1) in this case.

Proposition 13 (Uniform Tightness)

(a) If Xnp→ 0 and Yn is uniformly tight, then XnYn

p→ 0.

(b) If Xnd→ X , then Xn is uniformly tight.

(a) can be written compactly as op(1)Op(1) = op(1).

156 / 218

Simplified notation: E[mθ0 ] in place of E[mθ0 (X1)].

Theorem 13 (Asymptotic normality of M-estimators)

Assume the following

(a) mθ0 (X1) has up to second moments with

• E[mθ0 ] = 0, and

• well-defined covariance matrix Sθ0 := E[mθ0mTθ0

].

(b) The Hessian mθ0 (X1) is integrable with Vθ0 := E[mθ0 ] ≺ 0.

(c) θn is consistent for θ0.

(d) ∃ ε > 0 such that sup‖θ−θ0‖≤ε ‖Mn(θ)− M(θ0)‖ p→ 0

Let ∆n,θ := 1√n

∑ni=1 mθ(Xi ). Then,

√n(θn − θ0) = −V−1

θ0∆n,θ0 + op(1), and ∆n,θ0

d→ N(0,Sθ0 ).

In particular,√n(θn − θ0)

d→ N(0,V−1θ0

Sθ0V−1θ0

).

In (b), only need the Hessian to be nonsingular.(d) is (local) Uniform Convergence (UC)

157 / 218

Proof of AN

1. θn is a maximizer of Mn, hence

2. Mn(θn) = 0. (first-order optimality condition)

3. Taylor-expand M around θ0:

Mn(θn)− Mn(θ0) = Mn(θn)[θn − θ0]

for some θn in the line segment [θn, θ0].(Mean-value theorem, assuming continuity of Mn)

4. θn = θ0 + op(1). (By consistency of θn)

5. Mn(θn) = Mn(θ0) + op(1) (By (d): UC)

6. Note that Mn(θ0) = n−1∑

i mθ0 (Xi ) is an average.

7. Mn(θ0) = Eθ0 [mθ0 (X1)] + op(1) = Vθ0 + op(1). (By (b) and WLLN)

8. Mn(θn) = Vθ0 + op(1). (Combine 5. and 7. + CM)

9. By CM applied w/ f (X ) = X−1, and invertibility of Vθ0 :

[Mn(θn)]−1 = [Vθ0 + op(1)]−1 = V−1θ0

+ op(1).

158 / 218

10. Combine 2., 3. and 9.

θn − θ0 = [Mn(θn)]−1[Mn(θn)− Mn(θ0)

]

= [V−1θ0

+ op(1)][0− Mn(θ0)

]

11. Expand RHS and multiply by√n,

√n(θn − θ0) = −V−1

θ0[√nMn(θ0)]− op(1)[

√nMn(θ0)] (8)

12. Mn(θ0) is an average with zero-mean terms w/ covariance Sθ0 by (a).

13.√nMn(θ0)

d→ N(0,Sθ0 ). (CLT and (a))

14.√nMn(θ0) = Op(1) (By Prop. 13(b) and 13.)

15. Applying op(1)Op(1) = op(1) to (8), (Prop. 13(b) and 11.)

√n(θn − θ0) = −V−1

θ0[√nMn(θ0)] + op(1).

16. Note that√nMn(θ0) = ∆n,θ by definition.

17. Second part: Apply CM with f (x) = −V−1θ0

x . (Exercise.)

159 / 218

Example 51 (AN of MLE)

• For MLE, mθ(x) = `θ(x) = log pθ(x).

• mθ = ˙θ, the score-function, zero-mean under regularity conditions.

• Sθ = Eθ[ ˙θ

˙Tθ ] = I (θ).

• Vθ = Eθ[¨θ] = −I (θ)

• Asymptotic covariance of MLE = [−I (θ)]−1I (θ)[−I (θ)]−1 = [I (θ)]−1.

• It follows (assuming (c) and (d) hold)

√n(θmle − θ0)

d→ N(0, [I (θ0)]−1)

• Often interpreted as “MLE is asymptotically efficient”,

• i.e., achieves Cramer–Rao bound in the limit.

160 / 218

Hodge’s superefficient example

• If√n(δn − θ)

d→ N(0, σ2(θ)), one might think that σ2(θ) ≥ 1/I (θ) by CRB.

• If so, any estimator with variance 1/I (θ) could be called asymptoticallyefficient.

• Unfortunately this is not true. (The convergence in distribution is ratherweak to guarantee this.) Here is a counterexample:

Example 52

• Consider the shrinkage estimator

δ′n =

aδn |δn| ≤ n−1/4

δn otherwise

• δ′n has the same asymptotic behavior as δn for θ 6= 0.

• Asymptotic behavior of δ′n at θ = 0 is the same as aδn which hasasymptotic variance a2σ2(θ); this can be made arbitrarily small by choosinga sufficiently small.

161 / 218

Delta method

• Delta method is a powerful extension of the CLT.

• Assume that f : Ω→ Rk , with Ω ⊂ Rd , is differentiable and θ ∈ Ω.

• Let Jθ = (∂fi/∂xj) |x=θ be the Jacobian of f at θ.

• Note: Jθ ∈ Rk×d .

Proposition 14

Under the above assumption: If an(Xn − θ)d→ Z , with an →∞, then

an[f (Xn)− f (θ)]d→ JθZ

• If f is differentiable, then it is partially differentiable and its total derivativecan be represented (or identified) with the Jacobian matrix Jθ.

• Simplest case, k = d = 1: an[f (Xn)− f (θ)]d→ f ′(θ)Z

162 / 218

Proof of Delta method

• an(Xn − θ) = Op(1) and since an →∞, we have Xn − θ = op(1).

• By differentiability (1st order Taylor expansion)

f (θ + h) = f (θ) + Jθh + R(h)‖h‖

where R(h) = o(1). Define R(0) = 0 so that R is continuous at 0.

• Applying with h = Xn − θ, we have

f (Xn) = f (θ) + Jθ(Xn − θ) + R(Xn − θ)‖Xn − θ‖.

• Multiplying by an, we get

an[f (Xn)− f (θ)] = Jθ[an(Xn − θ)] + R(Xn − θ)‖an(Xn − θ)‖

• ‖an(Xn − θ)‖ = Op(1).

163 / 218

• Multiplying by an, we get

an[f (Xn)− f (θ)] = Jθ[an(Xn − θ)] + R(Xn − θ)︸︷︷︸op(1)

‖an(Xn − θ)‖︸︷︷︸Op(1)

• R(Xn − θ) = op(1), and Jθ[an(Xn − θ)]d→ JθZ , both by CM.

• The result follows from op(1)Op(1) = op(1) and Prop 12(f).

164 / 218

Example 53

• Let Xi be iid with µ = E[Xi ] and σ2 = var[Xi ].

• By CLT,√n(Xn − µ)

d→ N(0, σ2).

• Consider the function f (t) = t2. Then, by delta method

√n(f (Xn)− f (µ))

d→ f ′(µ)N(0, σ2),

that is, √n[(Xn)2 − µ2]

d→ N(0, σ2(2µ)2).

• For µ = 0, we get the degenerate result that√n(Xn)2 d→ 0.

• In this case, we need to scale the error further, i.e.,

• n(Xn)2 d→ σ2χ21, which follows from CLT

√nX

d→ σN(0, 1) and CM.

165 / 218

Example 54

• Xiiid∼Ber(p).

• By CLT√n(Xn − p)

d→ N(0, p(1− p)).

• Let f (p) = p(1− p).

• f (Xn) is a plugin estimator for the variance, and

√n(f (Xn)− f (p))

d→ N(0, (1− 2p)2p(1− p))

since f ′(x) = 1− 2x .

• Again at p = 1/2 this is degenerate and the convergence happens at afaster rate.

166 / 218

These examples can be dealt with using the following extension.

Proposition 15

Consider the scalar case k = d = 1. If√n(Xn − θ)

d→ N(0, σ2) and f is twicedifferentiable with f ′(θ) = 0, then,

n[f (Xn)− f (θ)]d→ 1

2f ′′(θ)σ2χ2

1

Inofrmale Derivation:

• f (Xn)− f (θ) = 12 f′′(θ)(Xn − θ)2 + o((Xn − θ)2).

• Since n(Xn − θ)2 d→ (σZ )2 where Z ∼ N(0, 1), we get the result.

167 / 218

Example 55 (Multivariate delta method)

• S2 = 1n

∑i X

2i − (Xn)2. Let

Zn :=1

n

n∑

i=1

(Xi

X 2i

), θ =

(µ

µ2 + σ2

), Σ = cov

((X1

X 21

))

• By (multivariate) CLT, we have

√n(Zn − θ)

d→ N(0,Σ)

• Letting f (x , y) = (x , y − x2), we have

√n[(

Xn

S2n

)−(µσ2

)]d→ JθN(0,Σ) = N(0, JθΣJθ)

• Exercise: Evaluate asymptotic covariance JθΣJθ.

168 / 218

What are asymptotic normality results useful for?

• Simplify comparison of estimators: Can use asymptotic variances. (ARE)

• Can build asymptotic confidence intervals.

169 / 218

Asymptotic relative efficiency (ARE)

• Can compare estimators based on their asymptotic variance.

• Assume that for two estimators θ1,n and θ2,n, we have

√n(θi,n − µ(θ))

d→ N(0, σ2i (θ)), i = 1, 2.

• For large n, the variance of θi,n is ≈ σ2i (θ)/n.

• Relative efficiency of θ1,n with respect to θ2,n can be measured in terms theratio of the number of samples required to achieve the same asymptoticvariance (i.e., error),

σ21(θ)

n1=σ2

2(θ)

n2=⇒ AREθ(θ1, θ2) =

n2

n1=σ2

2(θ)

σ21(θ)

If the above ARE > 1, then we prefer θ1 over θ2.

170 / 218

Example 56

• Xiiid∼ fX with mean = (unique) median = θ, and variance 1.

• By CLT, we have√n(Xn − θ)→ N(0, 1).

• Sample median: Zn = median(X1, . . . ,Xn) = X( 12 n)

• Can show

√n(Zn − θ)

d→ N(

0,1

4[fX (θ)]2

)

• Consider normal location family: Xiiid∼N(θ, 1).

• fX (θ) = φ(0) = 1/√

2π where φ is the density of standard normal.

• Hence, σ2Zn

(θ) = π/2.

• ARE of sample mean relative to median:

σ2Zn

(θ)

σ2Xn

(θ)=π

2≈ 1.57

• In normal family, we prefer the mean, since the median requires roughly1.57 more samples to achieve the same accuracy.

171 / 218

Confidence intervals

An alternative to point estimators which provides a measure of our uncertainty

or confidence. Recall X1, . . . ,Xniid∼Pθ0 .

Definition 17

A (1− α)-confidence set for θ0 is a random set S = S(X1, . . . ,Xn) such thatPθ0 (θ0 ∈ S) ≥ 1− α

• Trade-off between size of the set S and its coverage probability Pθ0 (θ0 ∈ S).• Want to minimize size while maintaining a lower bound on coverage prob.• Usually CIs are built based on pivots:• Functions of data and parameter whose dist’n is independent of param.

Example 57 (Normal family, known variance)

• Xiiid∼N(µ, σ2), then Z = (Xn − µ)/(σ/

√n) ∼ N(0, 1).

• Let zα/2 be such that P(Z ≥ zα/2) = α/2.

P(|√n(Xn − µ)| ≤ zα/2) = 1− α ⇐⇒ P

(µ ∈ [Xn ±

σ√nzα/2]

)= 1− α.

172 / 218

Example 58 (Normal family, unknown variance)

• Xi ∼ N(µ, σ2).

• Z = (X − µ)/(σ/√n) ∼ N(0, 1)

• V = (n − 1)S2n/σ

2 ∼ χ2n−1 where S2 = 1

n−1

∑ni=1(Xi − X )2.

• Hence, T := Z/√

V /(n − 1) ∼ tn−1 (Student’s t distribution).

• Let tα be such that P(|T | ≥ tn−1(α2 )) = α.

• Xn ± tn−1(α2 ) Sn√n

is an exact (1− α) confidence interval.

173 / 218

Asymptotic CIs

Definition 18

An asymptotic (1− α)-confidence set for θ0 is a random set S = S(X1, . . . ,Xn)such that Pθ0 (θ0 ∈ S)→ 1− α as n→∞.

Example 59

• If√n(Tn − θ0)

d→ N(0, σ2(θ0)), then assuming σ(·) is continuous.

Tn ±√σ2(Tn)

nzα/2

is an asymptotic C.I. at level 1− α.

• Why? Since Tnp→ θ0, by CM theorem, σ(θ0)

σ(Tn)

p→ 1. By Slutsky’s lemma

√n

σ2(Tn)(Tn − θ0) =

√σ2(θ0)

σ2(Tn)

√n

σ2(θ0)(Tn − θ0)

d→ N(0, 1).

174 / 218

Asymptotic CI for MLE

Example 60 (Asym. CI for MLE based on Fisher info)

• Recall that under regularity√n(θn − θ0)

d→ N(0, 1I (θ0) ), or

√nI (θ0)(θn − θ0)

d→ N(0, 1).

• Assuming I (·) is continuous, applying the previous example,

√nI (θn)(θn − θ0)

d→ N(0, 1).

• Hence, the following is asymptotic (1− α)-CI for θ0:

θn ±zα/2√nI (θn)

175 / 218

Example 61 (Asym. CI based on empirical Fisher Info.)

• Let `n(θ) =∑n

i=1 log pθ(Xi ) and I (θ) = E[− ∂2

∂θ2 ]

• One can consider − 1n

¨n(θ) as the empirical version of I (θ).

• (It is an unbiased and consistent estimate. I (θ) is Fisher info. based onsample of size 1.)

• By the same argument as in AN theorem, − 1n

¨n(θn)

p→ I (θ0).

• It follows that√− ¨

n(θn)nI (θ0)

p→ 1

• By Slutsky’s lemma,

√−¨

n(θn)

nI (θ0)

√nI (θ0)(θn − θ0)

d→ N(0, 1)

• In other words,

√−¨

n(θn)(θn − θ0)d→ N(0, 1).

• Hence, the following is asymptotic (1− α)-CI for θ0:

θn ±zα/2√−¨

n(θn)

176 / 218

Variance-stabilizing transform

• Assume√n(Tn − θ)

d→ N(0, σ2(θ)).

• By delta method,√n[f (Tn)− f (θ)

] d→ N(0, [f ′(θ)]2σ2(θ)).

• We can choose f so that [f ′(θ)]2σ2(θ) = C , a constant.

• Good for building asymptotic pivots.

Example 62

• Xiiid∼Poi(θ). Note Eθ[Xi ] = varθ[Xi ] = θ.

• By CLT√n(Xn − θ)

d→ N(0, θ).

• Take f ′(θ) = 1√θ

. Can be realized by f (θ) = 2√θ, hence

2√n(√

Xn −√θ)

d→ N(0, 1)

• Asymptotic CI for√θ of level 1− α:

(√Xn ± 1

2√nzα/2

).

• Compare with standard asym. CI for θ:(Xn ±

√Xn

n zα/2

).

177 / 218

Hypothesis testing

• Recall decision theory framework:

• Probabilistic model Pθ : θ ∈ Ω for X ∈ X .

• Special case : Ω is partitioned into two disjoint sets Ω0 and Ω1.

• Want to decide which piece θ belongs.

• Can form an estimate θ for θ and output 1θ ∈ Ω1.• A general principal:

Do not estimate more than what you care about.The more complex the model, the more potential for fitting to noise.

178 / 218

• We want to test

H0 : θ ∈ Ω0 (null)H1 : θ ∈ Ω1 (alternative)

• A non-random test can be specified with a critical region S ⊂ X as

δ(X ) = 1X ∈ S.

• When δ(X ) = 1, we have accepted H1, or “rejected H0”.

• Power function of the test is given by

β(θ) = Pθ(X ∈ S) = Eθ[δ(X )]

• We would like β(θ) ≈ 1θ ∈ Ω1.• It cannot be achieved, so we settle for a trade-off. Define

significance level α = supθ∈Ω0

β(θ)

power of the test β = infθ∈Ω1

β(θ)

• Neyman–Pearson framework: Maximize β subject to a fixed α.179 / 218

• Often need to consider a randomized test, in which case interpret

δ(x) = P(accept H1 | X = x)

• Power function β(θ) = Eθ[δ(X )] still gives the probability of accepting H1,by the smoothing property.

180 / 218

Simple hypothesis test

• Ω0 = θ0 and Ω1 = θ1.• Neyman–Pearson criterion reads: Fix α

supδ

Eθ1 [δ(X )] s.t. Eθ0 [δ(X )] ≤ α.

Most powerful (MP) test for significance level at most α.

• Neyman–Pearson lemma:Most power achieved by a likelihood ratio test (LRT),

δ(X ) = 1L(x) > τ+ γ 1L(x) = τ, L(x) := pθ1 (x)/pθ0 (x).

• Sometimes write 1pθ1 (x) ≥ τpθ0 (x) to avoid division by zero.

• For simplicity assume write p0 = pθ0 and p1 = pθ1 .

• So we write L(x) = p1(x)/p0(x) for example.

181 / 218

Informal proof

• For simplicity drop dependence on X : δ = δ(X ) and L = L(X ).

• Introduce Lagrange multipliers, and solve the unconstrained problem:

δ∗ = argmaxδ

[E1(δ) + λ(α− E0(δ))

]= argmax

δ

[E1(δ)− λE0(δ)

]

• Recall the change of measure formula (note L = p1/p0):

E1[δ] =

∫δp1 dµ =

∫δL p0 dµ = E0[δL].

• The problem reduces to

δ∗ = argmaxδ

E0[δ(L− λ)]

• The optimal solution is

δ∗ =

1 L > λ

0 L < λ

which is a likelihood ratio test.182 / 218

Theorem 14 (Nyeman-Pearson Lemma)

Consider the family of (randomized) likelihood ratio tests

δt,γ(x) =

1 p1(x) > t p0(x)

γ p1(x) = t p0(x)

0 p1(x) < t p0(x)

The following hold:

(a) For every α ∈ [0, 1], there are t, γ such that E0[δt,γ(X )] = α.

(b) If a LRT satisfies E0[δt,γ(X )] = α, then it is most powerful (MP) at level α.

(c) Any MP test at level α can be written as a LRT.

• Part (a) follows by looking at g(t) = P0(L(X ) > t) = 1− FZ (t) whereZ = L(X ). g is non-increasing and right-continuous, etc. (Draw a picture.)

183 / 218

Proof of Neyman-Pearson Lemma

• For part (b), let δ∗ be the LRT with significance level α.

• Let δ be any other rule satisfying E0[δ(X )] ≤ α = E0[δ∗(X )].

• For all x , (consider the three possibilities)

δ(x)[p1(x)− tp0(x)

]≤ δ∗(x)

[p1(x)− tp0(x)

]

• Integrate w.r.t. x :

E1[δ(X )]− tE0[δ(X )] ≤ E1[δ∗(X )]− tE0[δ∗(X )]

or

E1[δ(X )]− E1[δ∗(X )] ≤ t(E0[δ(X )]− E0[δ∗(X )]

)≤ 0

• Conclude that E1[δ(X )] ≤ E1[δ∗(X )].

• Part (c), left as an exercise.

184 / 218

Example 63

• Consider X ∼ N(θ, 1) and the two hypotheses

H0 : θ = θ0 versus H1 : θ = θ1

• Likelihood ratio is

L(x) =p1(x)

p0(x)=

exp[− 12 (x − θ1)2]

exp[− 12 (x − θ0)2]

• LRT rejects H0 if L(x) > t. Equivalently

log L(x) > log t ⇐⇒ −1

2(x − θ1)2 +

1

2(x − θ0)2 > log t

⇐⇒ x(θ1 − θ0) +1

2(θ2

0 − θ21) > log t

⇐⇒ x · sign(θ1 − θ0) >log t − 1

2 (θ20 − θ2

1)

|θ1 − θ0|=: τ

• Assume θ1 > θ0. Then, the test is equivalent to x > τ .

• We set τ by requiring P0(X > τ) = α. This gives τ = θ0 + Q−1(α).

185 / 218

Power calculation (Previous example continued)

• Q(x) = 1− Φ(x) where Φ is the CDF of standard normal distribution.

• The power is (since X − θ1 ∼ N(0, 1) under P1)

β = P1(X > τ) = P1(X − θ1 > τ − θ1) = Q(τ − θ1)

• Plugging in τ , we have β = Q(−δ + Q−1(α)) where δ = θ1 − θ0.

• Plot of β versus α is the ROC curve of the test.

• ROC = Receiver Operating Characteristic

• See next slide.

• Alternatively, can plot parametric curve β = Q(τ − θ1) and α = Q(τ − θ1),where parameter τ varies in R.

• ROC of no test can go about this curve (by Neyman-Pearson lemma).

186 / 218

ROC curve

,

0 0.2 0.4 0.6 0.8 1

-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

187 / 218

Composite hypothesis testing

• Often want to test H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1 where Ω0 ∩ Ω1 = ∅.

Example 64

Testing whether a coin is fair or not. Here Ω0 = 1/2 and Ω1 = [0, 1] \ 1/2.

Definition 19

A test δ of size α is uniformly most powerful (UMP) at level α if

∀ tests φ of level ≤ α, ∀θ ∈ Ω1, βδ(θ) ≥ βφ(θ).

• UMP tests do not always exists (they often don’t in fact).

188 / 218

Example 65 (Coin flipping continued)

• Observe X ∼ Bin(n, θ).

• Consider testing H0 : θ = 1/2 versus H1 : θ = θ1 based on X .

• Most powerful test is a LRT, based on

T (x) =θx1 (1− θ1)n−x

(1/2)x(1/2)n−x=( θ1

1− θ1

)x(1− θ1

1/2

)n

• Nature of the test changes based on whether θ1 < 1/2 or θ1 > 1/2:

θ1 < 1/2 =⇒ log[θ1/(1− θ1)] < 0 =⇒ reject H0 when x < τ

θ1 > 1/2 =⇒ log[θ1/(1− θ1)] > 0 =⇒ reject H0 when x > τ

• Suggests that a special structure is needed for the existence of a UMP test.

189 / 218

Definition 20

A family P = pθ(x) : θ ∈ Ω of densities has a monotone likelihood ratio(MLR) in T (x) if for θ0 < θ1, the LR L(x) = pθ1 (x)/pθ0 (x) is a non-decreasingfunction of T (x).

For example, in the coin flipping problem, the model has MLR in T (X ) = X .

Example 66 (1-D exponential family)

• Consider pθ(x) = h(x) exp[ η(θ)T (x)− A(θ) ]. LR is

L(x) =pθ1 (x)

pθ0 (x)= exp

[(η(θ1)− η(θ0))T (x)− A(θ1) + A(θ0)

].

• If η is monotone (e.g. θ0 ≤ θ1 =⇒ η(θ0) ≤ η(θ1)), then the family hasMLR in T (x) or −T (x).

• Includes the Bernoulli (or binomial) example before with η(θ) = log θ1−θ .

Others cases: normal location family, Poisson and exponential.

190 / 218

Example 67 (Non-exponential family)

• Xiiid∼U[0, θ], i = 1, . . . , n.

• p(x) = θ−n1x(n) ≤ θ1x(1) ≥ 0, and

L(x) =(θ1

θ0

)n 1x(n) ≤ θ11x(n) ≤ θ0

=

(θ1

θ0

)n, x(n) ∈ [0, θ0)

∞ x(n) ∈ [θ0, θ1)

• Consider θ1 > θ0.

• For x(n) ∈ (0, θ1), L(x) increasing in x(n) (transitions from (θ1/θ0)n to ∞).

191 / 218

Theorem 15 (UMP for one-sided problems)

• Let P be a family with MLR in T (x).

• Consider the one-sided test H0 : θ ≤ θ0 versus H1 : θ > θ0.

• Then, δ(x) = 1T (x) > C+ γ1T (x) = C, for γ,C such thatβδ(θ0) = α is UMP of size α.

• Take θ1 > θ0 and let Lθ1,θ0 (x) = pθ1 (x)/pθ0 (x) be the corresponding LR.

• By the MLR property, the given test is a LR test, i.e.,

δ(x) = 1Lθ1,θ0 (x) > C ′+ γ1Lθ1,θ0 (x) = C ′

for some constant C ′ = C ′(θ1, θ0).

• Since βδ(θ0) = α, by Neyman–Pearson lemma, δ is MP for testingH0 : θ = θ0 versus H1 : θ = θ1

• Since θ1 > θ0 was arbitrary, δ is UMP for θ = θ0 versus θ > θ0.

• Last piece to check is βδ(θ) ≤ α for θ < θ0. (Exercise.)

192 / 218

Example 68 (Non-exponential family (continued))

• Xiiid∼U[0, θ], i = 1, . . . , n.

• The family has MLR in X(n).

• δ(X ) = 1X(n) ≥ t is UMP for one-sided testing: θ > θ0 against θ ≤ θ0.

• To set the threshold

g(t) = 1− Pθ0 (X(n) ≤ t) = 1−∏

i

Pθ0 (Xi ≤ t) =

1− (t/θ0)n, t ≤ θ0

0 t > θ0

which is a continuous function.

• Solving g(t) = α gives t = (1− α)1/nθ0.

• Similarly the power function is

β(θ) = Pθ(X(n) > t) = [1− (t/θ)n]+

which holds for all θ > 0.

• For the UMP test, we have β(θ) = [1− (1− α)(θ0/θ)n]+.

193 / 218

• Plots of the power function for U[0, θ] example for various n,(θ0 = 1, α = 0.2).

3

0 0.5 1 1.5 2

pow

er fu

nctio

n -

(3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n=2n=5n=10n=20n=100

194 / 218

• Plots of the power function for U[0, θ] example for various n,(θ0 = 1, α = 0.05).

3

0 0.5 1 1.5 2

pow

er fu

nctio

n -

(3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n=2n=5n=10n=20n=100

195 / 218

• ROC plots for U[0, θ] example for various n, (θ0 = 1, θ = 1.1).

,

0 0.2 0.4 0.6 0.8 1

-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

13 = 1.1

n=2n=5n=10random

196 / 218

• ROC plots for U[0, θ] example for various n, (θ0 = 1, θ = 2).

,

0 0.2 0.4 0.6 0.8 1

-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

13 = 2.0

n=2n=5n=10random

197 / 218

Generalized likelihood ratio test (GLRT)

• Consider testing H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1.

• In the absence of UMPs a natural extension of LRT is the followinggeneralized LRT

L(x) =supθ∈Ω pθ(x)

supθ∈Ω0pθ(x)

=pθ(x)

pθ0(x)∈ [1,∞]

where Ω = Ω0 ∪ Ω1.

• θ is the unconstrained MLE, whereas θ0 is the constrained MLE, themaximizer of θ 7→ pθ(x) over θ ∈ Ω1.

• A GLRT rejects H0 if L(x) > λ for some threshold.

• Alternatively, GLRT can be written as

δ(x) = 1Λn(x) ≥ τ+ γ1Λn(x) = τ, Λn(x) = 2 log L(x)

• The threshold τ is set as usual by solving supθ∈Ω0Eθ[δ(X )] = α

198 / 218

• Why the above is a reasonable test?

• Assume γ = 0: No randomization is needed.

• When θ ∈ Ω0, then both θ and θ0 approach θ as n→∞.

• Hence, L(x) ≈ 1 when θ ∈ Ω0.

• However, when θ ∈ Ω1, the unconstrained MLE θ approaches θ as n→∞,while θ0 does not. This is because θ0 ∈ Ω0 and θ ∈ Ω1, and Ω0 and Ω1 aredisjoint.

• It follows that L(x) > 1 when θ ∈ Ω1 (in fact L(x) 1 usually).

• By thresholding L(x) at some λ > 1, we can tell the two hypotheses apart

199 / 218

Example 69

• Xiiid∼N(µ, σ2), both µ and σ2 unknown. Let θ = (µ, σ2). Take

Ω = (µ, σ2) | µ ∈ R, σ2 > 0, Ω0 = θ0, for θ0 = (µ0, σ20) = (0, 1).

• Want to test Ω0 against Ω1 = Ω \ Ω0.

supθ∈Ω0

pθ(x) = pθ0 (x) =1

(2π)n/2exp

(− 1

2

∑

i

x2i

)

supθ∈Ω

pθ(x) = pθ(x) =1

(2πσ2)n/2exp

(− 1

2σ2

∑

i

(xi − µ)2)

where θ = (µ, σ2) with µ = 1n

∑i xi and σ2 = 1

n

∑i (xi − µ)2 is the MLE.

200 / 218

Example continued

• The GLR is

L(x) =pθ(x)

pθ0 (x)= (σ2)−n/2 exp

[1

2

∑

i

x2i −

1

2σ2(xi − µ)2

︸︷︷︸n/2

].

The GLRT rejects H0 when L(x) > tα where Pθ0 (L(x) > tα) = α.

• Alternatively, threshold

Λn(x) = 2 log L(x) = −n log σ2 +∑

i

x2i − n.

• Will see that Λn in general has χ2r distribution where r is the difference

between the dimensions of the full (Ω) versus null (Ω0) parameter spaces.

• Thus, we expect Λn in this problem to have an asymptotic χ22 distribution

under the null θ = θ0. (It is instructive to try to show this directly.)

201 / 218

Asymptotics of GLRT

• Consider Ω ⊂ Rd , open, and let r ≤ d . Take the null hypothesis to be

Ω0 = θ ∈ Ω : θ1 = θ2 = · · · = θr = 0= θ ∈ Ω : θ = (0, . . . , 0, θr+1, . . . , θr+d−r ).

• Note that Ω0 is a (d − r)-dimensional subspace of Ω.

Theorem 16Under the same assumptions guaranteeing asymptotic normality of MLEs,

Λn = 2 log L(X )d→ χ2

r , under H0.

Degrees of freedom r = d − (d − r), that is, the difference in the localdimensions of full and null parameter sets.

202 / 218

• Recall `θ(X ) = log pθ(X ) and let Mn(θ) = 1n

∑i `θ(Xi ).

• We have Λn = −2n[Mn(θ0,n)−Mn(θn)].

• By Taylor expansion around θn (unrestricted MLE), for some θn ∈ [θn, θ0,n],

Mn(θ0,n)−Mn(θn) = [Mn(θn)]T (θ0,n − θn) +1

2(θ0,n − θn)T Mn(θn)(θ0,n − θn)

• Since θn is the MLE, we have Mn(θn) = 0 assuming θn ∈ int(Ω).

• By the same uniform arguments θn = θ0 + op(1) implies

• Mn(θn) = Mn(θ0) + op(1) = −Iθ0 + op(1),

• the last equality is because Mn(θ0)p→ Eθ0 [¨θ0 ] = −Iθ0 by WLLN.

203 / 218

• Assuming that√n(θ0,n − θn) = Op(1) , we obtain

Λn = [√n(θ0,n − θn)]T Iθ0

√n(θ0,n − θn) + op(1).

• Asymptotically GLR measures a particular distance (squared) between θ0,n

and θn, one which weighs different directions differently, according toeigenvectors of I (θ0).

• More specifically, let ‖z‖Q :=√zTQz = ‖Q1/2z‖2

2, which defines a normwhen Q 0. Then,

Λn = ‖√n(θ0,n − θn)‖2

Iθ0+ op(1).

• In the simple case where Ω0 = θ0, we have√n(θn − θ0)

d→ N(0, I−1θ0

).

• Equivalently,√n(θn − θ0)

d→ I−1/2θ0

Z where Z ∼ N(0, Id).

• It follows from the CM theorem (since z 7→ ‖z‖Q is continuous) that

Λnd→ ‖I−1/2

θ0Z‖2

Iθ0= ‖I 1/2

θ0I−1/2θ0

Z‖22 = ‖Z‖2

2.

• Since ‖Z‖22 =

∑di=1 Z

2i ∼ χ2

d we have the proof Ω0 = θ0.• The proof of the general case is more complicated and is omitted.

204 / 218

Example 70 (Multinomial: testing uniformity)

• (X1, . . . ,Xd) ∼ Multinomial(n, θ), where θ = (θ1, . . . , θd) is a probabilityvector, that is,

θ ∈ Ω = θ ∈ R : θi ≥ 0,∑

i

θi = 1

• Xj counts how many of n objects fall into category j .

pθ(x) =

(n

x1, . . . , xd

) d∏

i=1

θxii ∝d∏

i=1

θxii .

• Would like to test Ω0 = θ0 where θ0 = ( 1d , . . . ,

1d ) versus Ω1 = Ω \ Ω0.

• MLE over Ω is given by θi = xi/n. Requires techniques for constrainedoptimization, such as Lagrange multipliers, since Ω itself is constrained.(Exercise.)

205 / 218

• MLE over Ω is given by θi = xi/n. This requires using techniques forconstrained optimization, such as Lagrange multipliers, since Ω itself isconstrained. (Exercise.)

• We obtain

Λn = 2 logpθ(x)

pθ0 (x)= 2 log

d∏

i=1

( θi(θ0)i

)xi= 2

d∑

i=1

xi logθi

(θ0)i

= 2n∑

i

θi logθi

(θ0)i= 2nD(θ‖θ0)

• Both θ and θ0 are probability vectors; D(θ‖θ0) is their KL divergence.

• GLRT does a sensible test: Reject null if θ is far from θ0 in KL divergence.

• Our asymptotic theory implies: Λnd→ χ2

d−1 under the null, i.e. θ = θ0,since Ω is (d − 1)-dimensional and Ω0 is 0-dimensional. This is a fairlynon-trivial result.

206 / 218

p-values

• Consider a family of tests δα, α ∈ (0, 1) indexed by their level α.

• Assume: α 7→ δα(x) is nondecreasing, and right-continuous.

• E.g., if δα(x) = 1x ∈ C (α), then C (α1) ⊆ C (α2) if α1 ≤ α2.

• p-value or attained significance for observed x is defined as

p(x) := infα : δα(x) = 1

• Note that p = p(X ) is a random variable. We have

p(X ) ≤ α ⇐⇒ δα(X ) = 1

p ≤ α implies 1 = δp ≤ δα, since the infimum is attained by assumptions on δα,

hence δα = 1. The other direction follows from the definition of inf.

• This implies P0(p(X ) ≤ α) = P0(δα(X ) = 1) = α.

• That is, p = p(X ) ∼ U[0, 1] under null.

207 / 218

Example 71 (Normal example continued)

• Consider X ∼ N(θ, 1) and H0 : θ = θ0 versus H1 : θ = θ1.

• MP test at level α is δα(X ) = 1X − θ0 ≥ Q−1(α).• Alternatively δα(X ) = 1Q(X − θ0) ≤ α since Q is decreasing.

• The p-value is

p(X ) = infα : Q(X − θ0) ≤ α = Q(X − θ0) (9)

• Under null X − θ0 ∼ N(0, 1) hence Φ(X − θ0) ∼ U[0, 1] (why?).

• Then, p(X ) = 1− Φ(X − θ0) ∼ U[0, 1] as expected.

• Exercise: Verify that under H1 : θ = θ1, the CDF of p(X ) is

P1(p(X ) ≤ t) = Q(−δ + Q−1(t)).

where δ = θ1 − θ0. Note that this curve is the same as the ROC.

• Recall Q(t) = 1− Φ(t) where Φ is the CDF of N(0, 1).

208 / 218

• Can verify that definition (9) produces the “common” definition ofp-values, say when δα(X ) = 1T ≥ τα or δα(X ) = 1|T | ≥ τα.

Example 72

• Consider the two-sided test and let G (t) = P0(|T | ≥ t).

• Assume that G is continuous hence invertible (both decreasing functions.)

• Requiring level α: G (τα) = P0(|T | ≥ τα) = α =⇒ τα = G−1(α).

• This gives δα(X ) = 1|T | ≥ G−1(α) = 1G (|T |) ≤ α.• By definition (9), p = G (|T |) which is the common definition.

209 / 218

Multiple Hypothesis testing

• We have a collection of null hypotheses H0,i , i = 1, . . . , n.

Example 73 (Basic example)

• Testing in the normal means model

yi ∼ N(µi , 1), i = 1, . . . , n

and H0,i : µi = 0.

• yi could be the expression level (or state) of gene i .

• H0,i means that gene i has no effect on the disease under consideration.

• Suppose that for each H0,i , we have a test, hence a p-value pi .

• Assume under H0,i : pi ∼ U[0, 1].

• A test that reject H0,i when pi ≤ α, is of size α under ith null.

210 / 218

Testing global null

• The global null is H0 =⋂n

i=1 H0,i .

• Want to combine p1, . . . , pn to build a test of level α for H0.

• Can use 1pi ≤ α for a fixed i . But, better power if use all of them.

• Benferroni’s test for global null:

Reject H0 if mini

pi ≤α

n

• By union bound (no independence needed),

PH0 (rejecting H0) = PH0

( n⋃

i=1

pi ≤ α/n)≤

n∑

i=1

PH0 (pi ≤ α/n) = α

• Exercise: Assuming pi s are independent under H0, the exact the size ofBenferroni’s test is 1− (1− α

n )n → 1− e−α as n→∞. Thus for large nand small alpha, size ≈ 1− e−α ≈ α, hence union bound is not bad in thiscase.

211 / 218

Fisher test for global null

• Fisher combination test:

Reject H0 if Tn := −n∑

i=1

2 log pi > χ22n(1− α)

Lemma 7

If p1, . . . , pn are independent, then Tn ∼ χ22n.

• Thus, assuming independence under H0, Fisher test has the correct size.

212 / 218

Simes test for global null

• Simes test:

Reject H0 if Tn := mini

p(i),

n

i

≤ α

where p(1) ≤ p(2) ≤ · · · ≤ p(n) is the order statistics of p1, . . . , pn.

Lemma 8

If p1, . . . , pn are independent, then Tn ∼ U[0, 1].

• (Independence can be relaxed.)

• Thus, assuming independence under H0, Simes test has the correct size.

• Equivalent form of Simes test:

Reject H0 if p(i) ≤i

nα for some i

• Less conservative than Benferroni’s that reject if p(1) ≤ 1nα.

213 / 218

Testing individual hypotheses

• In the gene expression example, we care about which genes are null/notnull. We would like to test H0,i : µi = 0 versus H1,i : µi = 1 for all i .

• The problem has resemblance to classification.

• Interested in how we are doing in an aggregate sense.

• We can think of having a decision problem

p ∼ Pθ, where θ ∈ 0, 1n. (10)

• θi = 1 iff H0,i is true.

• Global null corresponds to θ = 0 (zero vector).

• Minimal assumption: When θi = 0, we have pi ∼ U[0, 1], i.e., the ithmarginal of Pθ is uniform.

214 / 218

Terminology (shared with classification)

• Confusion matrix: Count how many combinations we have.

• For example if θ ∈ 0, 1n is our guess for the hypotheses:

TP =1

n

n∑

i=1

1θi = 1, θi = 1

positive (1) negative (0) totalaccepted rejected

true (1) TP TN Tfalse (0) FP FN F

P N

• True here means H0,i is true.

215 / 218

• Alternative notationpositive (1) negative (0) total

accepted rejectedtrue (1) U V n0

false (0) T S n − n0

n − R R n

• Familywise error rate (FWER):

FWERθ = Pθ(V ≥ 1)

• A much less stringent criterion is the False Discovery Rate (FDR).

• Consider the false discovery proportion (a random variable)

FDP =V

max(R, 1)=

V

R1R > 0

• FDR is the expectation of FDP:

FDRθ = Eθ[FDP]

216 / 218

• Controlling FWER in a strong sense: control for all θ ∈ 0, 1n.

Theorem 17

Benferroni’s approach, where we test each hypothesis at level α/n, controlsFWER at level α in a strong sense. In fact

Eθ[V ] ≤ n0

nα, ∀θ ∈ 0, 1n.

• E[V ] =∑n

i=1 P(V ≥ i) which holds for any nonnegative discrete variable.

• Hence, E[V ] ≥ P(V ≥ 1).

• Let Vi = 1Hi,0 is true but rejected = 1θi = 1, θi = 0,

Eθ[Vi ] = 1θi = 1Pθ(θi = 0)

• Since V =∑

i Vi ,

Eθ[V ] =∑

i

Eθ[Vi ] =∑

i :θi=1

Pθ(θi = 0) ≤∑

i :θi=1

α

n=

n0

nα

(Here θi is only based on pi .)

217 / 218

• Benjamini-Hochberg (BH) procedure: Let i0 be the largest i such that

p(i) ≤i

nq

Reject all H0,i for i ≤ i0.

Theorem 18Under independence of null hypotheses from each other and from the non-nulls,the BH procedure (uniformly) controls the FDR at level q. In fact,

FDRθ(θBH) =n0

nq, ∀θ ∈ 0, 1n

218 / 218

STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...

Documents

Transcript of STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...