STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...

218
STAT 200B: Theoretical Statistics Arash A. Amini March 2, 2020 1 / 218

Transcript of STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...

Page 1: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

STAT 200B: Theoretical Statistics

Arash A. Amini

March 2, 2020

1 / 218

Page 2: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Statistical decision theory

• A probability model P = Pθ : θ ∈ Ω for data X ∈ X :Ω: parameter space, X : sample space.

• An action space A: set of available actions (decisions).

• A loss function:

0-1 loss L(θ, a) = 1θ 6= a Ω = A = 0, 1.Quadratic loss (Squared error) L(θ, a) = ‖θ − a‖2

2 Ω = A = Rd .

Statistical inference as a game:

1. Nature picks the “true” parameter θ, and draws X ∼ Pθ.Thus, X is a random element of X .

2. Statistician observes X and makes a decision δ(X ) ∈ A.δ : X → A is called a decision rule.

3. Statistician incurs the loss L(θ, δ(X )).

The goal of the statistician is to minimize its expected loss, a.k.a the risk:

R(θ, δ) := EθL(θ, δ(X ))

2 / 218

Page 3: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• The goal of the statistician is to minimize its expected loss, a.k.a the risk:

R(θ, δ) := EθL(θ, δ(X ))

=

∫L(θ, δ(x)) dPθ(x)

=

∫L(θ, δ(x)) pθ(x) dµ(x)

when family is dominated: Pθ = pθdµ.

• Usually work with the family of densities pθ(·) : θ ∈ Ω.

3 / 218

Page 4: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 1 (Bernoulli trials)

• A coin being flipped, want to estimate the probability of coming up heads.

• One possible model:

X = (X1, . . . ,Xn), Xiiid∼Ber(θ), for some θ ∈ [0, 1].

• Formally, X = 0, 1n, Pθ = (Ber(θ))⊗n and Ω = [0, 1].

• PMF of Xi :

P(Xi = x) =

θ x = 1

1− θ x = 0= θx(1− θ)1−x , x ∈ 0, 1

• Joint PMF: pθ(x1, . . . , xn) =∏n

i=1 θxi (1− θ)1−xi

• Action space: A = Ω.

• Quadratic loss: L(θ, δ) = (θ − δ)2.

4 / 218

Page 5: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Comparing estimators via their risk

Bernoulli trials. Let us look at three estimators:

Sample mean δ1(X ) = 1n

∑ni=1 Xi R(θ, δ1) = θ(1−θ)

n

Constant estimator δ2(X ) = 12 R(θ, δ2) = (θ − 1

2 )2

Strange looking δ3(X ) =∑

i Xi+3n+6 R(θ, δ3) = nθ(1−θ)+(3−6θ)2

(n+6)2 .

Throw data out δ4(X ) = X1 R(θ, δ4) = θ(1− θ).

5 / 218

Page 6: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Comparing estimators via their risk

n = 10 n = 50

0 0.2 0.4 0.6 0.8 10

5 · 10−2

0.1

0.15

0.2

δ1

δ2

δ4

δ3

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2·10−2

Comparison depends on the choice of the loss. A different loss gives a differentpicture.

6 / 218

Page 7: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Comparing estimators via their risk

How to deal with the fact that the risks are functions?

• Summarize them by reducing to numbers:

• (Bayesian) Take a weighted average:

infδ

∫Ω

R(θ, δ) dπ(θ)

• (Frequentist) Take the maximum:

infδ

maxθ∈Ω

R(θ, δ)

• Restrict to a class of estimators: unbiased (UMVU), equivariant, etc.

• Rule out estimators that are dominated by others (inadmissible).

7 / 218

Page 8: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Admissibility

Definition 1

Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if

• R(θ, δ∗) ≤ R(θ, δ), for all θ ∈ Ω, and

• R(θ, δ∗) < R(θ, δ), for some θ ∈ Ω.

δ is inadmissible if there is a different δ∗ that dominates it;otherwise δ is admissible.

An inadmissible rule can be uniformly “improved”.

δ4 in the Bernoulli example is inadmissible.

We will see a non-trivial example soon (Exponential Distribution).

8 / 218

Page 9: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Bias

Definition 2

The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ)− g(θ).

The estimator is unbiased if Bθ(δ) = 0 for all θ ∈ Ω.

Not always possible to find unbiased estimators.Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62)

Definition 3g is called U-estimable if there an unbiased estimator δ for g .

Usually g(θ) = θ.

9 / 218

Page 10: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Bias-variance decomposition

For the quadratic loss L(θ, a) = (θ − a)2, the risk is mean-squared error (MSE).In this case we have the following decomposition

MSEθ(δ) = [Bθ(δ)]2 + varθ(δ)

Proof.

Let µθ := Eθ(δ). We have

MSEθ(δ) = Eθ(θ − δ)2 = Eθ(θ − µθ + µθ − δ)2

= (θ − µθ)2 + 2(θ − µθ)Eθ[µθ − δ] + Eθ(µθ − δ)2.

Same goes for general g(θ) or higher dimensions: L(θ, a) = ‖g(θ)− a‖22.

10 / 218

Page 11: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 2 (Berger)

Let X ∼ N(θ, 1).

Class of estimators of the form δc(X ) = cX , for c ∈ R.

MSEθ(δ) = (θ − cθ)2 + c2 = (1− c)2θ2 + c2

For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc) for all θ.

For c ∈ [0, 1] the rules are incomparable.

11 / 218

Page 12: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Optimality depends on the loss

Example 3 (Possion process)

X1, . . . ,Xn be the inter-arrival times of a Poisson process with rate λ.

X1, . . . ,Xniid∼Expo(λ). The model has the following p.d.f.

pλ(x) =n∏

i=1

pλ(xi ) =∏

i

λe−λxi 1xi > 0 = λne−λ∑

i xi 1mini

xi > 0

Ω = A = (0,∞).

• Let S =∑

i Xi and X = 1nS .

• The MLE for λ is λ = 1/X = n/S .

12 / 218

Page 13: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

X1, . . . ,Xniid∼Expo(λ)

• Let S =∑

i Xi and X = 1nS .

• The MLE for λ is λ = 1/X = n/S .

• S :=∑n

i=1 Xi has the Gamma(n, λ) distribution.

• 1/S has Inv-Gamma(n, λ) distribution with mean λ/(n − 1).

• Eλ[λ] = nλ/(n − 1). MLE is biased for λ.

• Then, λ := (n − 1)λ/n is unbiased.

• We also have varλ(λ) < varλ(λ).

• It follows that

MSEλ(λ) < MSEλ(λ), ∀λ

The MLE λ is inadmissible for quadratic loss.

13 / 218

Page 14: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Possible explanation:Quadratic loss penalizes over-estimationmore than under-estimation for Ω =(0,∞).

0 1 2 3 4 5 60

1

2

3

4

5

6

7

8

Alternative loss function (Itakura–Saito distance)

L(λ, a) = λ/a− 1− log(λ/a), a, λ ∈ (0,∞)

• With this loss function, R(λ, λ) > R(λ, λ),∀λ.

• That is, MLE renders λ inadmissible.

An example of a Bregman divergence for φ(x) = − log x .For a convex function φ : Rd → R, the Bregman divergence is defined as

dφ(x , y) = φ(x)− φ(y)− 〈∇φ(y), x − y〉

the remainder of the first order Taylor expansion of φ at y .

14 / 218

Page 15: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Details:

• Consider δα(X ) = α/S . Then, we have

R(λ, δα)− R(λ, δβ) =( nα− n

β

)−(

logn

α− log

n

β

)

• Take α = n − 1 and β = n.

• Use log x − log y < x − y for x > y ≥ 1.

(Follows from strict concavity of f (x) = log(x):f (x)− f (y) < f ′(y)(x − y) for y 6= x).

15 / 218

Page 16: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Sufficiency

Idea: Separate the data into

• parts that are relevant for the estimating θ (sufficient)

• and parts that are irrelevant (ancillary).

Benefits:

• Achieve data compression: efficient computation and storage

• Irrelevant parts can increase the risk (Rao-Blackwell)

Definition 4

Consider the model P = Pθ : θ ∈ Ω for X .A statistic T = T (X ) is sufficient for P (or for θ or for X ) if the conditionaldistribution of X given T does not depend on θ.

More precisely, we have

Pθ(X ∈ A | T = t) = Qt(A), ∀t, A

for some Markov kernel Q. Making it more precise requires measure theory.Intuitively, given T , we can simulate X by an external source of randomness.

16 / 218

Page 17: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Sufficiency

Example 4 (Coin tossing)

• Xiiid∼Ber(θ), i = 1, . . . , n.

• Notation: X = (X1, . . . ,Xn), x = (x1, . . . , xn).

• Will show that T = T (X ) =∑

i Xi is sufficient for θ. (This should beintuitive.)

Pθ(X = x) = pθ(x) =n∏

i=1

θxi (1− θ)1−xi = θT (x)(1− θ)n−T (x)

• Then

Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= θt(1− θ)n−t1T (x) = t.

17 / 218

Page 18: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Then

Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= θt(1− θ)n−t1T (x) = t.

• Marginalizing,

Pθ(T = t) = θt(1− θ)n−t∑

x ∈0,1n1T (x) = t

=

(n

t

)θt(1− θ)n−t .

• Hence,

Pθ(X = x |T = t) =θt(1− θ)n−t1T (x) = t(

nt

)θt(1− θ)n−t

=1(nt

)1T (x) = t.

• What is the above (conditional) distribution?

18 / 218

Page 19: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Factorization Theorem

It is not convenient to check for sufficiency this way, hence:

Theorem 1 (Factorization (Fisher–Neyman))

Assume that P = Pθ : θ ∈ Ω is dominated by µ. A statistic T is sufficient ifffor some function gθ, h ≥ 0

pθ(x) = gθ(T (x))h(x), for µ-a.e. x

The likelihood θ 7→ pθ(X ), only depends on X through T (X ).

Family being dominated (having a density) is important.Proof (only discrete case):Assume T = T (X ) is sufficient. Fix x , and let t = T (x), Then,

Pθ(X = x) = Pθ(X = x ,T = t)

= Pθ(X = x |T = t)Pθ(T = t)

= Qt(X = x)gθ(t).

19 / 218

Page 20: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Factorization Theorem

• Now assume factorization holds. Then,

Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= gθ(t)h(x)1T (x) = t.

• It follows that

Pθ(T = t) = gθ(t)∑

x′

h(x ′)1T (x ′) = t,

• hence

Pθ(X = x |T = t) =gθ(t)h(x)1T (x) = t

gθ(t)∑

x′ h(x ′)1T (x ′) = t

=h(x)1T (x) = t∑x′ h(x ′)1T (x ′) = t .

20 / 218

Page 21: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 5 (Uniform)

• Let X1, . . . ,Xniid∼U[0, θ].

• Family is dominated by Lebesgue measure.

• X(n) = maxX1, . . . ,Xn is sufficient by Factorization theorem:

pθ(x) =n∏

i=1

1

θ10 ≤ xi ≤ θ

=1

θn10 ≤ xi ≤ θ, ∀i =

1

θn10 ≤ min

ixi1max

ixi ≤ θ

Useful fact∏

i 1Ai = 1∩iAi .

21 / 218

Page 22: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• The entire data (X1, . . . ,Xn) is always sufficient.

• For i.i.d. data there is always more reduction.

Example 6 (IID data)

• Let X1, . . . ,Xniid∼ pθ.

• Then, the order statistics (X(1), . . . ,X(n)) is sufficient:

• Order the data such that X(1) ≤ X(2) ≤ · · · ≤ X(n), and discard the ranks.

22 / 218

Page 23: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Minimal sufficiency

• There is a hierarchy among sufficient statistics in terms of data reduction.

• Can be made precise by using “functions” as “reduction mechanisms”.

Lemma 1

If T is sufficient for P and T = f (S) a.e. P, then S is sufficient.

We write T ≤s S if such functional relation exists.

• An easy consequence of the factorization theorem.

Examples:

• T sufficient 6=⇒ T 2 sufficient. (T 6≤s T2) (Missing sign information)

• T 2 sufficient =⇒ T sufficient. (T 2≤s T )

• T sufficient ⇐⇒ T 3 sufficient. (bijection)

23 / 218

Page 24: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• T ≤s S is not standard notation, but useful shorthand for:

∃ function f such that T = f (S) a.e. P.

• We want to obtain a sufficient statistic that achieves greatest reduction,i.e. is at the bottom of that hierarchy.

Definition 5

A statistic T is minimal sufficient if

• T is sufficient, and

• T ≤s S for any sufficient statistic S .

24 / 218

Page 25: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Minimal sufficient statistics exist under mild conditions.

• Minimal sufficient statistic is essentially unique modulo bijections.

Example 7 (Location family)

Consider X1, . . . ,Xniid∼ pθ, that is, they have density pθ(x) = f (x − θ). For

example consider f (x) = C exp(−β|x |α).

• The case α = 2 corresponds to the normal location family X1, . . . ,Xniid∼N(θ, 1).

Then, T = 1n

∑i Xi is sufficient by factorization. We will show later that it is

minimal sufficient.

• The case α = 1 (Laplace or double exponential), no further reduction beyondorder statistic is possible.

• A family P is DCS if it dominated with densities having common support.

25 / 218

Page 26: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Goal: To show that the likelihood (ratio) function is minimal sufficient.

• General idea: For any fixed θ and θ0,

pθ(X )

pθ0 (X )

will always be a function of any sufficient statistic (by Fact. Thm).

• We just have to collect enough of them so that collectively

( pθ1 (X )

pθ0 (X ),pθ2 (X )

pθ0 (X ),pθ3 (X )

pθ0 (X ), . . .

)

they are sufficient.

26 / 218

Page 27: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

A useful lemma

• Let us fix θ0 ∈ Ω and define

Lθ := Lθ(X ) :=pθ(X )

pθ0 (X ).

Lemma 2

In a DCS family, the following are equivalent

(a) U is sufficient for P.

(b) Lθ ≤s U, ∀θ ∈ Ω.

• When DCS fails, (a) still implies (b), but not necessarily vice versa.

27 / 218

Page 28: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof of Lemma 2

• Work on the common support, densities can be assumed positive.

• (a) =⇒ (b): U sufficient implies pθ(x) = gθ(U(x))h(x) (Fact. Thm.):

Lθ(x) =pθ(x)

pθ0 (x)=

gθ(U(x))

gθ0 (U(x))= fθ,θ0 (U(x))

• (b) =⇒ (a): ∃fθ,θ0 such that pθ(x) = pθ0 (x)fθ,θ0 (U(x)). Q.E.D.

28 / 218

Page 29: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

A useful lemma

• Let L := (Lθ)θ∈Ω.

• Since R ≤s U and S ≤s U ⇐⇒ (R,S)≤s U, we have

Lemma 3

In a DCS family, the following are equivalent

(a) U is sufficient for P.

(b) L≤s U.

• The argument is correct when Ω is finite.

• Needs more care dealing with “a.e. P” when Ω is infinite.

• From Lemma 3 follows that L is itself sufficient. (Why?)

29 / 218

Page 30: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Likelihood is minimal sufficient

Proposition 1

In a DCS family, L := (Lθ)θ∈Ω is minimal sufficient.

Proof:

• Let U be a sufficient statistics.

• Lemma 3(a) =⇒ (b) gives L≤s U.

• (i.e., L is a function of any sufficient U.)

• We also know that L is itself sufficient.

• The proof is complete.

30 / 218

Page 31: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Consequence of Prop. 1 is

Corollary 1

A statistic T is minimal sufficient if

• T is sufficient, and

• T ≤s L.

• That is, it is enough to show that T is sufficient and,

• T can be written as a function of L.

T ≤s L is equivalent to either of the following:

• L(x) = L(y) =⇒ T (x) = T (y).

• Level sets of L are “included” in level sets of T , i.e.,

• level sets of T are coarser than level set of L.

31 / 218

Page 32: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Corollary 2

T is minimal sufficient for DCS family P iff

(a) T is sufficient, and

(b) L(x) = L(y) =⇒ T (x) = T (y)

Corollary 3

T is minimal sufficient for DCS family P iff

L(x) = L(y) ⇐⇒ T (x) = T (y)

• Can replace L(x) = L(y) with `x(θ) ∝ `y (θ),

where `x(θ) = pθ(x) is the likelihood function. (Theorem 3.11 in Keener.)

• Corollary 3: T is minimal sufficient if it has the same level sets as L.

• Informally, shape of the likelihood is minimal sufficient.

32 / 218

Page 33: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 8 (Gaussian location family)

Consider X1, . . . ,Xniid∼ pθ = f (x − θ) with f (x) = C exp(−βx2).

• `X (θ) ∝ exp(−β∑i (Xi − θ)2).

• Shape of `X (·) uniquely determined by θ 7→∑i (Xi − θ)2,

• Alternatively, θ 7→ −2(∑

i Xi )θ + nθ2,

• Alternatively,∑

i Xi .

Example 9 (Laplace location family)

Consider X1, . . . ,Xniid∼ pθ = f (x − θ) with f (x) = C exp(−β|x |).

• `X (θ) ∝ exp(−β∑i |Xi − θ|).

• Shape of `X (·) uniquely uniquely determined by the breakpoints of thepiecewise linear function θ 7→∑

i |Xi − θ|,• In one-to-one correspondence with the order statistic (X(1), . . . ,X(n)).

33 / 218

Page 34: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

θ

∑i |Xi − θ|

X(2)X(1) X(3)

• Shape of the likelihood for the Laplace location family is determined by theorder statistics.

34 / 218

Page 35: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example with no common support

P = P0,P1,P2 where

P0 = U[−1, 0],

P1 = U[0, 1],

p2(x) = 2x1x ∈ (0, 1).

p1

p0=

p2

p0=

0 on (−1, 0)

∞ on (0, 1).

• Cannot tell P1 and P2 based on ( p1p0, p2p0

).

• However, there is information in the original modelto statistically tell these two apart to some extent.

• Consider in addition p2(x)p1(x)

= 2x1x ∈ (0, 1).

• A minimal suff. stat.: (1X < 0,X+)

• Could just take X+, since 1X < 0 = 1− X+

35 / 218

Page 36: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Empirical distribution

• We saw that for IID data, X1, . . . ,Xniid∼Pθ,

• the order statistic X(1) ≤ X(2) ≤ · · · ≤ X(n) is sufficient.

• Another way: The empirical distribution Pn is sufficient,

Pn :=1

n

n∑

i=1

δXi , (δx : unit point mass at x )

• δx is a measure defined by: δx(A) := 1x ∈ A:

x

• Example: X = (0, 1,−1, 0, 4, 4, 0) =⇒ Pn := 17 (δ−1 + 3δ0 + δ1 + 2δ4).

−1 0 1 4

36 / 218

Page 37: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 10 (Empirical distribution in finite alphabet (IID data))

• Suppose sample space is finite X = a1, . . . , aK.• Let P = collection of all prob. measures P on X .• P can be parametrized by θ = (θ1, . . . , θK ) where θj = P(aj).

• Empirical measure reduces to Pn =∑

j πj(X ) δaj where

πj(X ) :=1

n#i : Xi = aj

• Show that Pn or equivalently (π1(X ), . . . , πK (X )) is minimal sufficient.

37 / 218

Page 38: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Statements from Theory of Point Estimation (TPE)

Proposition 2 (TPE)

Consider a finite DCS family P = Pθ, θ ∈ Ω, i.e., Ω = θ0, θ1, . . . , θK.Then, the following is minimal sufficient

L(X ) =(pθ1 (X )

pθ0 (X ), . . . ,

pθK (X )

pθ0 (X )

).

Proposition 3 (TPE)

Assume P is DCS and P0 ⊂ P. Assume that T is

• sufficient for P, and

• minimal sufficient for P0.

Then, T is minimal sufficient for P.

Same support gives “a.e. P0 =⇒ a.e. P”.S sufficient for P =⇒ S sufficient for P0.T minimal suff. for P0 =⇒ T = f (S) a.e. P0 and hence a.e. P. Q.E.D.

38 / 218

Page 39: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 11 (Gaussian location family)

• Consider X1, . . . ,Xniid∼N(θ, 1).

• Look at sub-family P0 = N(θ0, 1),N(θ1, 1). Let T (X ) =∑

i Xi .

• The following is minimal sufficient by Proposition 2,

log Lθ1 := logpθ1 (x)

pθ0 (x)=

1

2

i

(xi − θ0)2 − 1

2

i

(xi − θ1)2

= (θ1 − θ0)T (x) +1

2(θ2

0 − θ21)

showing that T (X ) is minimal sufficient for P0.

• Since T (X ) is also sufficient for P (Exercise.), Proposition 3 implies that itis minimal sufficient for P.

39 / 218

Page 40: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Completeness and ancillaritiy

• We can compress even more!

Definition 6

• V = V (X ) is ancillary if its distribution does not depend on θ.

• First-order ancillary if its expectation does not depend on θ. (EθV = c, ∀θ).

• The latter is a weaker notion.

Example 12 (Location family)

• The following statistics are all ancillary:

Xi − Xj , Xi − X(j), X(i) − X(j), X(i) − X

• Hint: We can write Xi = θ + εi , where εiiid∼ f .

• Minimal sufficient statistic can still contain ancillary information, forexample X(n) − X(1) in the Laplace location family.

40 / 218

Page 41: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• A notion stronger than minimal sufficiency is completeness:

Definition 7A statistic T is complete if

(Eθ[f (T )] = c , ∀θ

)=⇒ f (t) = c ∀t.

(Actually P-a.e. t.)

• T is complete if no nonconstant function of it is first-order ancillary.

• Minimal sufficient statistic need not be complete:

Example 13 (Laplace location family)

• X(n) − X(1) is ancillary, hence first-order ancillary.

• f (X(1), . . . ,X(n)) is ancillary for the nonconstant function f (z1, . . . , zn) = z1 − zn.

• The converse is however true.

41 / 218

Page 42: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Showing completeness is not easy.

• Will see a general result for exponential families.

• Here is another example:

Example 14

• X1, . . . ,Xniid∼U[0, θ].

• Will show that T = maxX1, . . . ,Xn is complete.

• CDF of T is FT (t) = (t/θ)n1t ∈ (0, θ)+ 1t ≥ θ.• Density t 7→ nθ−ntn−11t ∈ (0, θ).• Suppose that Eθf (T ) = 0 for all θ > 0. Then,

0 = Eθf (T ) = nθ−n∫ θ

0

f (t)tn−1dt, t > 0

• Fundamental theorem of calculus implies f (t)tn−1 = 0, a.e. t > 0,

• Hence f (t) = 0 a.e. t > 0. Conclude that T is complete.

42 / 218

Page 43: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Detour: Conditional expectation as L2 projection

• The L2 space of random variables:

L2 := L2(P) := X : E[X 2] <∞

• We can define an inner product on this space

〈X ,Y 〉 := E[XY ], X ,Y ∈ L2

• The inner product induces a norm, called the L2 norm,

‖X‖2 :=√〈X ,X 〉 :=

√E[X 2]

• Norm induces a distance ‖X − Y ‖2.

• Squared distance ‖X − Y ‖22 = E(X − Y )2, the same as MSE.

• Orthogonality: X ⊥ Y if 〈X ,Y 〉 = 0, i.e., E[XY ] = 0.

43 / 218

Page 44: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Detour: Conditional expectation as L2 projection

• Assume EX 2 <∞ and EY 2 <∞ (i.e., X ,Y ∈ L2 ).

• Consider the the linear space

L :=g(X ) | g is a (measurable) function with E[g(X )]2 <∞

i.e., the space of all (meas.) functions of X .

• There is an essentially unique L2 projection of Y onto L:

Y := argminZ∈L

‖Y − Z‖2

• Alternatively, an essentially unique function g such that

ming

E(Y − g(X ))2 = E(Y − g(X ))2

• We define E[Y |X ] := g(X ).

44 / 218

Page 45: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Detour: Conditional expectation as L2 projection

• There is an essentially unique function g such that

ming

E(Y − g(X ))2 = E(Y − g(X ))2

• We define E[Y |X ] := g(X ).

• E[Y |X ] is the best prediction of Y given X , in the MSE sense.

• From this definition, we get the following characterization of g :

E[(Y − g(X )

)g(X )] = 0, ∀g

saying that the optimal prediction error Y − g(X ) is orthogonal to L.

• Applied to the constant function g(X ) ≡ 1, we get

E[Y ] = E[g(X )] = E[E[Y |X ]]

the smoothing or averaging property of conditional expectation.

45 / 218

Page 46: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proposition 4

A complete sufficient statistic is minimal sufficient.

Proof. Let T be complete sufficient, and U minimal sufficient.

• Idea is to show that T is a function of U.

• U = g(T ). (By minimal sufficiency of U.)

• Let h(U) := Eθ[T |U] well defined by sufficiency of U.

• Eθ[T − h(U)] = 0, ∀θ ∈ Ω. (By smoothing.)

• T = h(U). (By completeness of T .)

Hints:

• Took f (t) := t − h(g(t)) in the definition of completeness.

• Equalities are a.e. P.

46 / 218

Page 47: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proposition 5 (Basu)

Let T be complete sufficient and V ancillary.Then T and V are independent.

Proof. Let A be an event.

• qA := Pθ(V ∈ A) well-defined. (By ancillary of V .)

• fA(T ) := Pθ(V ∈ A|T ) well-defined. (By sufficiency of T .)

• Eθ[qA − fA(T )] = 0, ∀θ.

• qA = fA(T ). (By completeness of T .)

Equalities are a.e. P.

47 / 218

Page 48: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Application of Basu:

Example 15 (Gaussian location family)

• X1, . . . ,Xniid∼N(θ, σ2), θ is unknown, σ2 is known.

• X := 1n

∑i Xi is complete sufficient. (cf. Exponential families)

• (Xi − X , i = 1, . . . , n) is ancillary.

• Hence, sample variance S2 := 1n−1

∑i (Xi − X )2 is ancillary.

• Hence, X and S2 are independent.

• Had we taken (θ, σ2) as the parameter, then S2 would not be ancillary.

48 / 218

Page 49: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Rao–Blackwell

• Rao–Blackwell theorem ties risk minimization with sufficiency.

• It is a statement about convex loss functions.

Definition 8A function f : Rp → R is convex if for all x , y

f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y), ∀λ ∈ [0, 1], (1)

and strictly convex if the inequality is strict for λ ∈ (0, 1) and x 6= y .

Example 16 (`p loss)

• Loss function a 7→ L(θ, a) = |θ − a|p on R.

• Convex for p ≥ 1 and nonconvex for p ∈ (0, 1).

• Stricly convex when p > 1.

• In particular, the `1 loss (p = 1) is convex but not stricly convex.

49 / 218

Page 50: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Jensen inequality

By induction (1) leads to f (∑

i αixi ) ≤∑

i αi f (xi ), for αi ≥ 0 and∑

i αi = 1.A sweeping generalization is the following:

Proposition 6 (Jensen inequality)

Assume that f : S → R is convex and consider a random variable Xconcentrated on S (i.e., P(X ∈ S) = 1), and E|X | <∞. Then,

Ef (X ) ≥ f (EX )

If f is strictly convex, equality holds iff X ≡ EX a.s. (that is, X is constant).

Proof. Relies on the existence of supporting hyperplanes to f (i.e., affineminorants that touch the function)

50 / 218

Page 51: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Let x0 := EX .

• Let A(x) = 〈a, x − x0〉+ f (x0) be a supporting hyperplane to f at x0:

f (x) ≥ A(x), ∀x ∈ S, and A(x0) = f (x0).

• Then, we have

f (X ) ≥ A(X ) =⇒ E[f (X )] ≥ E[A(X )] (Monotonicity of E)

= 〈a,E[X − x0]〉+ f (x0) (Linearity of E)

= f (x0)

• Strict convexity implies f (x) > A(x) for x 6= x0.

• If X 6= x0 with positive prob., we have f (X ) > A(X ) with positive prob.,

• Hence, Ef (X ) > EA(X ), and the rest follows. Q.E.D.

51 / 218

Page 52: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Recall the decision-theoretic setup:Loss function L(θ, a), decision rule δ = δ(X ), risk R(θ, δ) = EθL(θ, δ).

Theorem 2 (Rao–Blackwell)

Let us assume the following:

• T is sufficient for family P,

• δ = δ(X ) is a possibly randomized decision rule,

• a 7→ L(θ, a) is convex, for all θ ∈ Ω.

Define the estimator η(T ) := Eθ[δ|T ]. Then, η dominates δ, that is,

R(θ, η) ≤ R(θ, δ) for all θ ∈ Ω

The inequality is strict when the loss is strictly convex, unless η = δ a.e. P.

Consequence: for convex loss functions, randomization does not help. Proof:

• η is well-defined. (By sufficiency of T .)

• Eθ[L(θ, δ)|T ] ≥ L(θ,Eθ[δ|T ]) = L(θ, η). (By conditional Jensen inequality.)

• Take expectation and use monotonicity and smoothing. Q.E.D.

52 / 218

Page 53: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 17

• Xiiid∼U[0, θ], i = 1, . . . , n.

• T = maxX1, . . . ,Xn is sufficient. Take δ = 1n

∑i Xi .

• Rao–Blackwell: η = Eθ[δ|T ] strictly dominates δ, for any strictly convexloss. Let us verify this:

• Conditional distribution of Xi given T :Mixture of a point mass at T and uniform distribution on [0,T ] (why?):

Pθ(Xi ∈ A|T ) =1

nδT (A) +

(1− 1

n

)∫ T

0

1

T1A(x)dx

• Compactly Xi |T ∼ 1nδT + (1− 1

n )Unif(0,T ).

• It follows

Eθ(Xi |T ) =1

nT +

(1− 1

n

)T2

=n + 1

2nT

• Same expression holds for η by symmetry.

53 / 218

Page 54: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• That is, η = Eθ[δ|T ] = n+12n T

• Consider the quadratic loss:

• η has the same bias as δ (by smoothing).

• How about variances?

varθ(δ) =1

n

θ2

12

• Since T/θ is Beta(n, 1) distributed, we have

varθ(η) =(n + 1

2n

)2 n

(n + 1)2(n + 2)θ2 =

θ2

4n(n + 2)(2)

• (Note: δ is biased. 2δ is unbiased. Better to compare 2η and 2δ.)

54 / 218

Page 55: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Another result about strictly convex loss functions:An admissible decision rule is uniquely determined by its risk function.

Proposition 7

• Assume a 7→ L(θ, a) is stricly convex, for all θ. (R(θ, δ) = Eθ[L(θ, δ)].)

Then, the map δ 7→ R(·, δ) is injective over the class of admissible decision rules.

We are identifying decision rules that are the same a.e. P.Proof.

• Let δ be admissible.

• Let δ′ 6= δ be such that R(θ, δ) = R(θ, δ′),∀θ.

• Take δ∗ = 12 (δ + δ′). Then, by strict convexity of the loss

L(θ, δ∗) <1

2

(L(θ, δ) + L(θ, δ′)

), ∀θ

Taking expectation: (Note: X > 0 a.s. =⇒ E[X ] > 0 )

R(θ, δ∗) <1

2

(R(θ, δ) + R(θ, δ′)

)= R(θ, δ), ∀θ

• δ′ 6= δ implies δ∗ 6= δ.

• δ∗ strictly dominates δ, contradicting admissibility of δ.55 / 218

Page 56: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Rao–Blackwell can fail for non-convex loss. Here is an example:

Example 18

• X ∼ Bin(n, θ). Ω = A = [0, 1].

• ε-sensitive loss function Lε(θ, a) = 1|θ − a| ≥ ε.• Consider a general deterministic estimator of δ = δ(X ).

• δ takes at most n + 1 values δ(0), δ(1), . . . , δ(n) ⊂ [0, 1].

• Divide [0, 1] into bins of length 2ε.

• Assume that N := 1/(2ε) ≥ n + 2 and that N is an integer (for simplicity).

• At least one of the N bins contains no δ(i), i = 0, . . . , n, and the midpointof that bin is at distance ≥ ε from any δ(i). Hence

supθ∈[0,1]

R(θ, δ) = 1

for any nonrandomized rule (assuming ε ≤ 1/[2(n + 2)]).

• Consider randomized estimator δ′ = U ∼ U[0, 1] independent of X :

R(θ, δ′) = P(|U − θ| ≥ ε) ≤ 1− ε (worst case at θ = 0, 1. )

• supθ∈[0,1] R(θ, δ′) < 1 strictly better than that of δ.

56 / 218

Page 57: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Uniformly minimum variance unbiased (UMVU) criterion

• Comparing estimators based on their risk functions problematic.

• One way to mitigate: restrict the class of estimators.

• Focus on qudratic loss, restrict to unbiased estimators.

• Bias-variance decomp.,

Let Ug be the class of unbiased estimators of g(θ), that is,

Ug = δ : Eθ[δ] = g(θ), ∀θ.

Definition 9

An estimator δ is UMVU for estimating g(θ) if

• δ ∈ Ug , and

• varθ(δ) ≤ varθ(δ′), for θ ∈ Ω, and for all δ′ ∈ Ug .

57 / 218

Page 58: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Theorem 3 (Lehmann–Scheffe)

Consider the family P and assume that

• Ug is nonempty (i.e., g is U-estimable), and

• there is a complete sufficient statistic T for P.

Then, there is an essentially unique UMVU for g(θ) of the form h(T ).

Proof.

• Pick δ ∈ Ug (Valid by non-emptiness.)

• Let η = Eθ[δ|T ] be an estimator (Well-defined by sufficiency of T .)

• Claim: η is the essentially unique UMVU.

• Pick any δ′ ∈ Ug and let η′ = Eθ[δ′|T ].

• Eθ[η − η′] = g(θ)− g(θ) = 0, ∀θ (By smoothing and unbiasedness.)

(a) η − η′ = 0 a.e. P. (By completeness of T .)

• By Rao–Blackwell for quadratic loss a 7→ (g(θ)− a)2 and unbiasedness

varθ(η) = R(θ, η)(a)= R(θ, η′) ≤ R(θ, δ′) = varθ(δ′)

• Since δ′ was an arbitrary element of the class Ug we are done. Q.E.D.

58 / 218

Page 59: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof.

• Pick δ ∈ Ug (Valid by non-emptiness.)

• Let η = Eθ[δ|T ] be an estimator (Well-defined by sufficiency of T .)

• Claim: η is the essentially unique UMVU.

• Pick any δ′ ∈ Ug and let η′ = Eθ[δ′|T ].

• Eθ[η − η′] = g(θ)− g(θ) = 0, ∀θ (By smoothing and unbiasedness.)

• η − η′ = 0 a.e. P. (By completeness of T .)

• By Rao–Blackwell for quadratic loss a 7→ (g(θ)− a)2 and unbiasedness

varθ(η) = R(θ, η) = R(θ, η′)(b)< R(θ, δ′) = varθ(δ′)

• Since δ′ was an arbitrary element of the class Ug we are done. Q.E.D.

Remark 1Note that we have also shown the uniqueness:

(b) If δ′ ∈ Ug is UMVU and not a function of T , then it is strictly dominatedby η′ (by Rao–Blackwell and strict convexity of quadratic loss).

• Otherwise, it is equal to η′ which is equal to η a.e. P.

59 / 218

Page 60: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Lehman–Schuffe suggest a way of constructing UMVUs.

Example 19 (Coin tossing)

• X1, . . . ,Xniid∼Ber(θ), want to estimate g(θ) = θ2.

• T =∑

i Xi is complete and sufficient.(General result for exponential families.)

• Take U = X1X2.

• U is unbiased for θ2: Eθ[U] = Eθ[X1]Eθ[X2] = θ2 by independence.

• By Lehman–Schuffe

E[U|T = t] = P(X1 + X2 = 2|T = t)

=

(n−2t−2

)/(nt

), t ≥ 2

0 otherwise=

t(t − 1)

n(n − 1)

is UMVU estimator for θ2.

60 / 218

Page 61: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Approach 2 in obtaining UMVUs:

Example 20

• X1, . . . ,Xniid∼U[0, θ].

• T = X(n) = maxX1, . . . ,Xn is complete sufficient.

• UMVU for g(θ) is given by h(X(n)).

• h is the solution of the following integral equation:

g(θ) = Eθ[h(X(n))] = nθ−n∫ θ

0

tn−1h(t)dt.

• For g(θ) = θ, δ1 = n+1n T is unbiased, hence UMVU by Lehamn–Schuffe.

• MSEθ(δ1) = varθ(δ1) = θ2

n(n+2) .

• On the other hand, among estimators of the form δa = aT ,a = (n + 2)/(n + 1) gives the lowest MSE.

• This biased estimator has slightly better MSE = θ2/(n + 1)2.

• A little bit of bias is not bad.

61 / 218

Page 62: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Exponential family

Definition 10X : a general sample space, Ω: a general parameter space,

• A function T : X → Rd , T (x) = (T1(x), . . . ,Td(x)).

• A function η : Ω→ Rd , η(θ) = (η1(θ), . . . , ηd(θ)).

• A measure ν on X (e.g., Lebesgue or counting), and a functionh : X → R+.

The exponential family with sufficient statistic T and parametrization η, relativeto h · ν, is the dominated family of distributions given by the following densitiesw.r.t. ν

pθ(x) = exp〈η(θ),T (x)〉 − A(θ)

h(x), x ∈ X .

where 〈η(θ),T (x)〉 =∑d

i=1 ηi (θ)Ti (x) is the Euclidean inner product.

62 / 218

Page 63: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• T : X → Rd , η : Ω→ Rd .

pθ(x) = exp〈η(θ),T (x)〉 − A(θ)

h(x), x ∈ X .

• A(θ) is the determined by the other ingredients,

• via the normalization constraint∫pθ(x)dν(x) = 1,

A(θ) = log

∫e〈η(θ),T (x)〉d ν(x)

where d ν(x) := h(x)dν(x).

• A is called the log partition function or cumulant generating function.

• The actual parameter space is

Ω0 = θ ∈ Ω : A(θ) <∞.

• By factorization theorem, T (X ) is indeed sufficient.

• The representation of the exponential family is not unique.

63 / 218

Page 64: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Here are some examples:

Example 21

• X ∼ Ber(θ):

pθ(x) = θx(1− θ)1−x = exp[x log

θ

1− θ + log(1− θ)].

• Here h(x) = 1x ∈ 0, 1,

η(θ) = log( θ

1− θ), T (x) = x

A(θ) = − log(1− θ), Ω0 = (0, 1)

• We need to take Ω = (0, 1) otherwise η is not well-defined.

64 / 218

Page 65: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 22

• X ∼ N(µ, σ2): Let θ = (µ, σ2),

pθ(x) =1√

2πσ2exp

[− 1

2σ2(x − µ)2

]

= exp[− 1

2σ2x2 +

µ

σ2x −

( µ2

2σ2+

1

2log(2πσ2)

)]

• Here, h(x) = 1,

η(θ) =( µσ2,− 1

2σ2

), T (x) = (x , x2)

A(θ) =µ2

2σ2+

1

2log(2πσ2), Ω0 = (µ, σ2) : σ2 > 0

• We could have taken h(x) = 1√2π

and A(θ) = µ2

2σ2 + 12 log σ2.

65 / 218

Page 66: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 23

• X ∼ U[0, θ],

• pθ(x) = θ−11x ∈ (0, θ).• Not an exponential family since the support depends on the parameter.

66 / 218

Page 67: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Consider the following conditions:

(E1) η(Ω0) has non-empty interior.

(E2) T1, . . . ,Td , 1 are linearly independent ν a.e.. That is,

@a ∈ Rd \ 0, c ∈ R such that 〈a,T (x)〉 = c , ν-a.e. x

(E1′) η(Ω0) is open.

Definition 11

• A family satisfying (E1) and (E2) is called full-rank.

• One that satisfies (E1′) is regular.

• One that satisfies (E2) is minimal.

• Condition (E1) prevents ηi from satisfying a constraint.

• Condition (E2) prevents unidentifiability.

67 / 218

Page 68: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 24

• A Bernoulli model: pθ(x) ∝ exp(θ0(1− x) + θ1x).

• x + (1− x) = 1,∀x . Hence, the family is not full-rank.

Example 25

• A continuous model with Ω = R and η1(θ) = θ, η2(θ) = θ2,

• Interior of η(Ω) is empty, hence the model is not full-rank.

68 / 218

Page 69: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Theorem 4In a full-rank exponential family, T is complete.

• We just show that T is minimal sufficient.

• Completeness is more technical, but follows from Laplace transformarguments.

69 / 218

Page 70: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Theorem 5In a full-rank exponential family, T is complete.

Proof. (Minimal sufficiency.)

• By factorization theorem, T is sufficient for P (the whole family).

• Choose θ0, θ1, . . . , θd ⊂ Ω, with ηi := η(θi ), such that

η1 − η0, η2 − η0, . . . , ηd − η0 are linearly independent. Possible by (E1)

• Matrix AT = (η1 − η0, . . . , ηd − η0) ∈ Rd×d is full-rank.

• Let P0 = pθi : i = 0, 1, . . . , d. Then, with T = T (X )

(log

pθ1 (X )

pθ0 (X ), . . . , log

pθd (X )

pθ0 (X )

)=(〈η1 − η0,T 〉, . . . , 〈ηd − η0,T 〉

)= AT

is minimal sufficient for P0.

• It follows that T is so, since A is invertible.

• Since P0 and P have common support, T is also minimal for P.

70 / 218

Page 71: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

pθ(x) = exp〈η(θ),T (x)〉 − A(θ)

h(x), x ∈ X .

Definition 12

An exponential family is in canonical (or natural) form if η(θ) = θ.

In this case:

• η = θ is called the natural parameter.

• Ω0 := θ ∈ Rd : A(θ) <∞ is called the natural parameter space.

• Ω0 ⊂ Rd .

Family determined by choice of X , T (x) and ν = h · ν.

Example 26 (Two-parameter Gaussian)

• X = R, T (x) = (x , x2).

• pθ(x) = exp(θ1x + θ2x2 − A(θ)), ∀x ∈ X .

• A(θ) = log∫eθ1x+θ2x

2

dx .

• A(θ) <∞ iff θ2 < 0. Natural parameter space: Ω0 = (θ1, θ2) : θ2 < 0.• Note: θ1 = µ

σ2 and θ2 = − 12σ2 (in original parametrization (µ, σ2)).

71 / 218

Page 72: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Recall: ‖x‖1 =∑d

i=1 |xi |.

Example 27 (Multinomial)

• X = Zd+ = x = (x1, . . . , xd) : xi integer, xi ≥ 0

• T (x) = x .

• h(x) =(

nx1,x2,...,xd

)1‖x‖1 = n, ν = counting measure

• Canonical family: pθ(x) = exp(∑d

i=1 θixi − A(θ))h(x).

• Can show that A(θ) = n log(∑d

i=1 eθi), finite everywhere.

• Hence Ω0 = Rd .

• Not full-rank. Violates (E2).

72 / 218

Page 73: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Multinomial distribution, with usual parameter,

qπ(x) =

(n

x1, x2, . . . , xd

) d∏

i=1

πxii

looks like a subfamily:

• Corresponds to the following subset of the natural parameter space Ω0:

(log πi ) : πi > 0,

d∑

i=1

πi = 1

=θ ∈ Rd :

d∑

i=1

eθi = 1.

• This family is also not full-rank (violates (E1)).

• Actually not a sub-family of Example 27 since pθ = pθ+a1 for any a ∈ R.

• That is, θ parametrization is non-identifiable in Example 27.

73 / 218

Page 74: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 28 (Multivariate Gaussian)

• X = Rp. ν = Lebesgue measure on X .

• T (x) =(xi , i = 1, . . . , p | xixj , 1 ≤ i < j ≤ p | x2

i , i = 1, . . . p).

• param =(θi , i = 1, . . . , p | 2Θij , 1 ≤ i < j ≤ p | Θii , i = 1, . . . p).

• Corresponding canonical Expf

pθ,Θ(x) = exp(∑

i

θixi + 2∑

i<j

Θijxixj +∑

i

Θiix2i − A(θ,Θ)

)

• Compactly, treating Θ as a symmetric matrix,

pθ,Θ(x) = exp〈θ, x〉+ 〈Θ, xxT 〉 − A(θ,Θ)

where 〈Θ, xxT 〉 := tr(ΘxxT ) = tr(xTΘx) = xTΘx .

• Dimension (or rank) of the family d = p + p(p − 1)/2.

74 / 218

Page 75: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Density of multivariate Gaussian N(µ,Σ):

pµ,Σ(x) ∝ 1

|Σ|1/2exp

[−1

2(x − µ)TΣ−1(x − µ)

]

= exp[−1

2xTΣ−1x + xTΣ−1µ− 1

2µTΣ−1µ− 1

2log |Σ|

]

• Can be written as a canonical exponential family

pθ,Θ(x) = exp〈θ, x〉+ 〈Θ, xxT 〉 − A(θ,Θ)

where 〈Θ, xxT 〉 := tr(ΘxxT ) = tr(xTΘx) = xTΘx .

• Correspondence with the original parameters:

• θ = Σ−1µ and Θ = − 12Σ−1

• A(θ,Θ) = 12(µTΣ−1µ+ log |Σ|) = 1

4θTΘ−1θ − 1

2log |−2Θ|+ const.

• Sometimes called Gaussian Markov Random Field (GMRF), esp. whenΘij = 0 for (i , j) /∈ E where E is the edge set of a graph.

75 / 218

Page 76: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 29 (Ising model)

• Both a graphical model and an exponential family.

• Used in statistical physics. Allows for complex correlations among discretevariables. Discrete counterpart of GMRF.

• Ingredients:

• A given graph G = (V ,E).V := 1, . . . , n vertex set. E ⊂ V 2: edge set.

• Random variables attached to vertices X = (Xi : i ∈ V ).

• Each xi ∈ −1,+1, the spin of node i .

• Take X = −1,+1V ' −1,+1n.

• T (X ) =(Xi , i ∈ V , XiXj , (i , j) ∈ E

).

• Underlying measure is counting (and h(x) ≡ 1.)

pθ(x) = exp(∑

i∈Vθixi +

(i,j)∈Eθijxixj − A(θ)

)

76 / 218

Page 77: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 30 (Exponential random graph model (ERGM))

• A parametric family of probability distributions on graphs.

• Let X = space of graphs on n nodes.

• Let Ti (G ) be functions on the space of graphs for i = 1, . . . , k.

• Usually subgraph counts:

T1(G ) = # number of edges

T2(G ) = # number of triangles

. . . = . . .

Tj(G ) = # number of r -stars (for given r)

. . . = . . .

• Underling measure (counting on graphs)

pθ(G ) = exp( k∑

i=1

θiTi (G )− A(θ))

77 / 218

Page 78: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Focus: full-rank canonical (FRC) exponential families.

Proposition 8

In a canonical exponential family,

(a) A is convex on its domain Ω0,

(b) Ω0 is a convex set.

Proof. (Enough to show (a). Convexity of Ω0 follows from convexity of A.)

• Apply Holder inequality, with 1/p = α and 1/q = 1− α. (Exercise) Q.E.D.

• Holder inequality: For X ,Y ≥ 0 a.s.,

E[XαY 1−α] ≤ (EX )α(EY )1−α, ∀α ∈ [0, 1]

• Expectation can be replaced with integral w.r.t. a general measure:f , g ≥ 0 a.e. ν

∫f αg1−αd ν ≤

(∫fd ν)α(∫

gd ν)1−α

, ∀α ∈ [0, 1].

78 / 218

Page 79: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proposition 9

In a FRC exponential family, A is C∞ on int(Ω0) and moreover

Eθ[T ] = ∇A(θ), covθ[T ] = ∇2A(θ)

That is,

∂A

∂θi= Eθ[Ti (X )],

∂2A

∂θi∂θj= covθ[Ti (X ),Tj(X )]

Proof sketch.• Moment generating function (mgf) of T is

MT (u) := MT (u; θ) := Eθ[e〈u,T〉]

=

∫e〈u,T (x)〉e〈θ,T (x)〉−A(θ)d ν(x)

= eA(u+θ)−A(θ)

• If θ ∈ int Ω0, then MT is finite in a neighborhood of zero:

MT (u) <∞, for ‖u‖2 ≤ ε.• DCT implies MT is C∞ in a neighborhood of 0, and we can interchange

the order of differentiation and integration.79 / 218

Page 80: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Moment generating function:

MT (u) = Eθ[e〈u,T〉] = eA(u+θ)−A(θ)

• We get (fixing θ)

MT (u)∂A

∂ui(u + θ) =

∂uiMT (u)= Eθ[

∂uie〈u,T〉] = Eθ[Tie

〈u,T〉],

valid in a neighborhood of 0.

• Evaluating at u = 0 gives the result for mean. (MT (0) = 1.)

• Getting the covariance is similar. (Exercise)

Remark 2

• Covariance is positive semidefinite, hence ∇2A(θ) 0.

• This gives another proof for convexity of A.

80 / 218

Page 81: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 31

• X ∼ N(θ, 1):

pθ(x) =1√2π

e−12 x

2

exp(θx − 1

2θ2)

• A(θ) = 12θ

2. Hence:

Eθ[X ] = A′(θ) = θ, varθ(X ) = A′′(θ) = 1

81 / 218

Page 82: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Mean parameters

• Exponential family

dPθ(x) = exp[〈θ,T (x)〉 − A(θ)] dν(x)

• Alternative parametrization in terms of mean parameter µ:

µ := µ(θ) = Eθ[T (X )]

• Mean parameters are easy to estimate: µ = 1n

∑ni=1 T (X (i)).

Example 32 (Two-parameter Gaussian)

• X = R, T (x) = (x , x2), pθ(x) = exp(θ1x + θ2x2 − A(θ)).

• Natural parameter space: Ω0 = (θ1, θ2) : θ2 < 0.• θ1 = m

σ2 and θ2 = − 12σ2 in the original parametrization N(m, σ2).

• Mean parameters:

µ =

[µ1

µ2

]=

[E[X ]E[X 2]

]=

[m

m2 + σ2

]=

[− θ1

2θ2θ2

1

4θ22− 1

2θ2

]

82 / 218

Page 83: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Realizable means

• Interesting general set:

M := µ ∈ Rd | µ = Ep[T (X )] for some density p w.r.t. ν,

the set of realizable mean parameters by any distribution (absolutelycontinuous w.r.t. ν).

• M is essentially the convex hull of the support of ν#T = ν T−1

• More precisely,int(M) = int(co(supp(ν#T ))).

83 / 218

Page 84: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

M := µ ∈ Rd | µ = Ep[T (X )] for some density p w.r.t. ν,

Example 33

• T (X ) = (X ,X 2) ∈ R2 and ν the Lebesgue measure:

(µ1, µ2) = (Ep[X ],Ep[X 2])

• By nonnegativity of the variance we need to have

M⊂M0 := (µ1, µ2) : µ2 ≥ µ21

• Any (µ1, µ2) ∈ intM0 can be realized by a N(µ1, µ2 − µ21).

• bdM0 := (µ1, µ2) : µ2 = µ21 cannot be achieved by a density (why?).

• bdM0 can be approached arbitrarily closely by densities.

• We have M = intM0 and M =M0.

84 / 218

Page 85: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 34 (Multivariate Gaussian)

• T (X ) = (X ,XXT ).

• Let µ = Ep[X ] and Λ = Ep[XXT ] for some density p w.r.t. Lebesguemeasure.

• Covariance matrix is PSD, hence Λ− µµT 0.

• Closure of M (sef of realizable means) is

M := (µ,Λ) | Λ µµT

• int(M) =M = (µ,Λ) | Λ µµT can be realized by non-degenerateGaussian distributions N(µ,Λ− µµT ), a full-rank exponential family.

85 / 218

Page 86: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

A remarkable result:Anything ∈ intM can be realized by an exponential family. Let

Ω := domA := θ ∈ Rd : A(θ) <∞

Theorem 6

In a FRC exponential family, assuming A is essentially smooth,

• ∇A : int Ω→ intM is one-to-one and onto.

In other words, ∇A establishes a bijection between int Ω and intM.

• Recall as part of FRC assumption, int Ω 6= ∅.• WLOG, we can assume T (x) = x (absorb T into the measure, ν T−1),

• That is, we work with the standard family

dPθ(x) = exp(〈θ, x〉 − A(θ)) dν(x)

• By Proposition 9, Eθ(X ) = ∇A(θ).

• The proof is a tour de France of convex/real analysis.

86 / 218

Page 87: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof sketch

∇A : int Ω→ intM is one-to-one (injective) and onto (surjective)

Let Φ := ∇A.

1. Φ is regular on int Ω: DΦ = ∇2A ∈ Rd×d is a full-rank matrix.True since condition (E2) implies ∇2A 0.

2. Φ is injective:Since condition (E2) implies ∇2A 0, we conclude that A is strictlyconvex. This in turns implies that ∇A is a strictly monotone operator:〈∇A(θ)−∇A(θ′), θ − θ′〉 > 0 for θ 6= θ′.

3. Φ is an open mapping: (maps open sets to open sets) (Corollary 3.27 of ?)

U ⊂ Rd open, f ∈ C 1(U,Rd) regular on U =⇒ f open mapping.

4. By Proposition 9, we have ∇A(int Ω) ⊂M.

5. But why ∇A(int Ω) ⊂ intM? Follows from ∇A being an open map1

1A continuous map is not necessarily open: x 7→ sin(x) maps (0, 4π) to [−1, 1].87 / 218

Page 88: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof sketch

∇A : int Ω→ intM is one-to-one (injective) and onto (surjective)

It remains to show that intM⊂ ∇A(int Ω):

For any µ ∈ intM, need to show θ ∈ int Ω such that ∇A(θ) = µ.

6. By applying a shift to ν, WLOG enough to show it for µ = 0 ∈ intM.

7. WTS: 0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.

In general 0 /∈ intM. So, without employing a shift, all the arguments are

applied to θ 7→ A(θ)− 〈µ, θ〉.

88 / 218

Page 89: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof sketch

0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.

8. A is lower semi-continuous (lsc) on Rd : lim infθ→θ0

A(θ) ≥ A(θ0).

Follows from Fatou’s Lemma.

(lsc only matters at bd Ω since A is continuous on int Ω.)

9. Let Γ0(Rd) := f : Rd → (−∞,∞] | f is proper, convex, lsc.(proper means not identically ∞.)

10. A ∈ Γ0(Rd).

11. A is coersive: lim‖θ‖→∞

A(θ) =∞. To be shown.

12. A is essentially smooth: by assumption.

A function f ∈ Γ0(Rd) is essentially smooth (a.k.a. steep) if

(a) f is differentiable on int dom f 6= ∅ and(b) ‖∇f (xn)‖ → ∞ whenever xn → x ∈ bd dom f .

We in fact only need this for x ∈ (dom f ) ∩ (bd dom f ), parts of the boundary thatare in the domain. In particular, (b) is not needed if dom f is itself open.

89 / 218

Page 90: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.

13. A coercive lsc function attains its minimum (over Rd).

14. f ∈ Γ0(Rd) and essentially smooth =⇒ minimum cannot be attained atbd dom f .

15. If in addition f is strictly convex on int dom f , the minimum is unique.

Lemma 4

Assume that f ∈ Γ0(Rd) is coersive, essentially smooth, and strictly convex onint dom f . Then, f attains its unique minimum at some x ∈ int dom f .

A is coersive, essentially smooth, and strictly convex on int domA = int Ω

16. Conclude that A attains its minimum at a unique point θ ∈ int Ω.

17. Necessary 1st order optimality condition is ∇A(θ) = 0.

18. Done if we show the only remaining piece: coersivity.

90 / 218

Page 91: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

A is coersive.

19. For every ε, let Hu,ε := x ∈ Rd : 〈x , u〉 ≥ ε and let S = u : ‖u‖2 = 1.20. 0 ∈ intM (and full-rank assumption) implies that ∃ε > 0 such that

infu∈S

ν(Hu,ε) > 0.

i.e., ∃ε > 0 and c ∈ R such that ν(Hu,ε) ≥ ec for all u ∈ S .

21. Then, for any ρ > 0,

∫e〈ρu,x〉ν(dx) ≥

Hu,ε

e〈ρu,x〉ν(dx) ≥ eρεν(Hu,ε) ≥ eρε+c

That is, A(ρu) ≥ ρε+ c .

22. For any θ 6= 0, taking u = θ/‖θ‖ and ρ = ‖θ‖, we obtain

A(θ) ≥ ‖θ‖ε+ c , ∀θ ∈ Rd \ 0

showing that A is coersive.

91 / 218

Page 92: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Side Note: In fact, ∇A : int Ω→ intM is a C 1 diffeomorphism (i.e., a bijectionwhich is C 1 in both directions.)This follows from Theorem 3.2.8 of ?:

Theorem 7 (Global inverse function theorem)

Let U ⊂ Rd be open and Φ ∈ C 1(U,Rd). The following are equivalent:

• V = Φ(U) is open and Φ : U → V is a C 1 diffeomorphism.

• Φ is injective and regular on U.

Φ = ∇A is injective and regular on int Ω hence a C 1 diffeomorphism.

92 / 218

Page 93: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

MLE in exponential family

• Significance of Theorem 6 for statistical inference:

• Assume X1, . . . ,Xn are i.i.d. draws from

pθ(x) = exp(〈θ,T (x)〉 − A(θ))h(x).

• The likelihood is

LX (θ) =∏

i

pθ(Xi ) ∝ exp(〈θ,∑

i

T (Xi )〉 − nA(θ)).

• Letting µ = 1n

∑ni=1 T (Xi ) the log-likelihood is

`X (θ) = n[〈θ, µ〉 − A(θ)

]+ const..

• If µ ∈ intM, there exists a unique MLE, solution of ∇A(θ) = µ.

• That is θMLE = ∇A−1(µ) = ∇A∗(µ).

• A∗ is the Fenchel–Legandre conjugate of A.

• If µ /∈M, then the MLE does not exist. What happens at the boundarycan be determined on a case by case basis.

93 / 218

Page 94: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Technical remarks

• A is always lower semi-continuous (lsc).

• If Ω = domA is open, lower semicontinuity implies that A(θ)→∞ as θapproaches the boundary.(Pick θ0 ∈ bd Ω, then lim infθ→θ0 A(θ) ≥ A(θ0) =∞. )

• In other words, if Ω is open, A is automatically essentially smooth.

94 / 218

Page 95: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 35 (Two-parameter Gaussian)

• X = R, T (x) = (x , x2). pθ(x) = exp(θ1x + θ2x2 − A(θ)), ∀x ∈ X .

• A(θ) = log∫eθ1x+θ2x

2

dx . A(θ) <∞ iff θ2 < 0.

• Natural parameter space: Ω = domA = (θ1, θ2) : θ2 < 0.• θ1 = m

σ2 and θ2 = − 12σ2 in original parametrization (m, σ2).

• µ = θ1

−2θ2and σ2 = 1

−2θ2.

• Mean parametrization: µ1 = θ1

−2θ2, µ2 =

(θ1

−2θ2

)2+ 1−2θ2

• A(θ) = µ2

2σ2 + 12 log(2πσ2) =

θ21

−4θ2+ 1

2 log π−θ2

.

• Easy to verify that ∇A(θ) = (µ1, µ2)and it establishes a bijection between(θ1, θ2) : θ2 < 0 = int Ω ↔ intM = (µ1, µ2) : µ2

1 > µ2

• Note that since A is lsc, µ(θ) = ∇A(θ)→∞ as θ approaches the boundary.

• Show a picture of θ 7→ A(θ)− 〈θ, µ〉 for µ = (0, 1).

95 / 218

Page 96: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Maximum entropy characterization of exponential family

• Not only exponential families realize any mean, they achieve it withmaximum entropy: Solution to

maxp

Ep[− log p(X )] s.t. Ep[T (X )] = µ,

is given by a density of the form p(x) ∝ exp(〈θ,T (x)〉).

• Discrete case, easy to verify by introducing Lagrange multipliers:

• X = x1, . . . , xK• ν = counting meas. and pi = p(xi ), let p = (p1, . . . , pK ), and ti = T (xi ),

maxp −∑i pi log pi

s.t.∑

i pi ti = µ,

pi ≥ 0,∑

i pi = 1

• Without∑

i pi ti = µ, uniform distribution maximizes the entropy.

96 / 218

Page 97: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Information inequality (Cramer–Rao)

• How small can the variance of an unbiased estimator be? How well can theUMVU do?

• The bound also plays a role in asymptotics.

• Idea: Use Cauchy–Schwarz (CS), also called covariance inequality in thiscontext:

(EXY )2 ≤ (EX 2)(EY 2), or [cov(X ,Y )]2 ≤ var(X ) var(Y )

• Running assumption: every RV/estimator has finite second moment.

• δ unbiased for some g(θ), ψ any other estimator

varθ(δ) ≥ [covθ(δ, ψ)]2

varθ(ψ)(3)

• Need to get rid of δ on the RHS.

• By cleverly choosing ψ, we can obtain good bounds.

97 / 218

Page 98: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Assume Pθ+h Pθ: pθ+h(x) = 0 whenever pθ(x) = 0.

• (Local) likelihood ratio is well-defined: (Can define it to be 1 for 0/0.)

Lθ,h(X ) = pθ+h(X )/pθ(X )

• (= dPθ+h/dPθ, Radon-Nikodym deriavative of Pθ+h w.r.t. Pθ.)

• Change of measure by integrating against the likelihood ratio:

Eθ[δLθ,h] =

pθ>0

δ Lθ,h pθ dµ =

pθ>0

δ pθ+h dµ = Eθ+h[δ] (4)

Note that pθ+h is concentrated on x : pθ(x) > 0.

98 / 218

Page 99: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Lemma 5 (Hammersley–Chapman–Robbins (HCR))

Assume Pθ+h Pθ, and let δ be unbiased for g(θ). Then,

varθ(δ) ≥ [g(θ + h)− g(θ)]2

Eθ(Lθ,h − 1)2

Proof.

• Idea: Apply CS inequality (3) to ψ = Lθ,h − 1.

• Eθ[ψ] = Eθ[Lθ,h]− 1 = 0. (By an application of (4) to δ = 1.)

• Another application of (4) gives2

covθ(δ, ψ) = Eθ[δψ] = Eθ[δLθ,h]− Eθ[δ] = g(θ + h)− g(θ).

2ψ is not an unbiased estimator of 0, since it depends on θ. It is not a proper estimator.Not a contradiction with “UMVU is uncorrelated from any unbiased estimator of 0”.

99 / 218

Page 100: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Assume that θ (and hence h) is a scalar.

• Likelihood ratio approaches 1 as h→ 0:

limh→0

1

h[Lθ,h(X )− 1] = lim

h→0

[pθ+h(X )− pθ(X )]/h

pθ(X )

=∂θ[pθ(X )]

pθ(X )= ∂θ[log pθ(X )]

called the score function.

• Divide HCR by h, and let h→ 0:

varθ(δ) ≥ limh→0

[g(θ + h)− g(θ)]2/h2

Eθ(Lθ,h − 1)2/h2

• Numerator goes to g ′(θ). If justified in exchanging limit and expectation,

varθ[δ(X )] ≥ [g ′(θ)]2

Eθ[∂θ log pθ(X )]2

100 / 218

Page 101: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Cramer–Rao (formal statement)

• log-likelihood: `θ(X ) := log pθ(X ),• Score function: ˙

θ(X ) := ∇θ`θ(X ) = ∇θ log pθ(X ) ∈ Rd

Theorem 8 (Cramer–Rao lower bound)

Let P be dominated family with densities having common support S on someopen parameter space Ω ⊂ Rd . Assume:

(a) δ is an unbiased estimator for g(θ) ∈ R.

(b) g is differentiable over Ω, with gradient g = ∇θg ∈ Rd ,

(c) ˙θ(x) exists for x ∈ S and θ ∈ Ω,

(d) At least for ξ = 1 and ξ = δ and ∀θ ∈ Ω,

∂θi

S

ξ(x)pθ(x) dµ(x) =

S

ξ(x)∂

∂θipθ(x) dµ(x), ∀i (5)

Then,varθ(δ) ≥ g(θ)T [I (θ)]−1g(θ)

where I (θ) = Eθ[ ˙θ

˙Tθ ] ∈ Rd×d is the Fisher information matrix.

101 / 218

Page 102: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Let us rewrite the assumption:

∂θi

S

ξ(x)pθ(x) dµ(x) =

S

ξ(x)∂

∂θipθ(x) dµ(x), ∀i (6)

• Note that the right-hand side is:

RHS =

S

ξ(x)∂ log pθ(x)

∂θipθ(x) dµ(x) = Eθ

(ξ(X )[ ˙

θ(X )]i)

• Putting the pieces together

∇θEθ[ξ] = Eθ[ξ ˙θ] (7)

• which is the differential form of the change of measure formula:

Eθ+h[ξ] = Eθ[ξLθ,h]

102 / 218

Page 103: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof.

• Score function has zero mean, Eθ[ ˙θ] = 0. (Apply (7) with ξ = 1.)

• g(θ) = Eθ[δ ˙θ] (Apply (7) with ξ = δ.)

• Fix some a ∈ Rd . Will apply CS inequality (3) with ψ = aT ˙θ.

• Since aT ˙θ is zero mean:

aT g(θ) = Eθ[δ aT ˙θ] = covθ(δ, aT ˙

θ).

• Similarly,

varθ(aT ˙θ) = Eθ[aT ˙

θ˙Tθ a] = aT I (θ)a

• CS inequality (3) with ψ = aT ˙θ gives:

varθ(δ) ≥ [covθ(δ, aT ˙θ)]2

varθ(aT ˙θ)

=(aT g(θ))2

aT I (θ)a.

• Almost done. Problem reduces to (Exercise)

supa 6=0

(aT v)2

aTBa= vTB−1v .

Hint: Since B 0, B−1/2 is well-defined; take z = B1/2a. Q.E.D.103 / 218

Page 104: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Regularity conditions for interchanging the integral and derivative are key,

• so is the unbiasedness.

• Under same assumptions (recall `θ = log pθ(X ))

I (θ) = Eθ[−¨θ] = Eθ[−∇2

θ`θ]

• I (θ) measures expected local curvature of the likelihood.

• Attainment of CRB is related to attainment of Cauchy–Schwarz: Wijsman(1973) shows that it happens if and only if we are in the exponential family.

• Fisher info. is not invariant to reparametrization:

θ = θ(µ) =⇒ I (µ) = [θ′(µ)]2I (θ)

• CRB is invariant to reparametrization. (Exercise.)

• Fisher info. is additive over independent sampling.

104 / 218

Page 105: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Multiple parameters

What if g : Ω→ Rm where Ω ⊂ Rd?

• Let Jθ = (∂gi/∂θj) ∈ Rm×d be the Jacobian of g .

• Then, under similar assumptions (notation: Iθ = I (θ)):

covθ(δ) JθI−1θ JTθ

for any δ unbiased for g(θ).

• A B means A− B 0, i.e., A− B is positive semidefinite (PSD).

• Proof: Fix u ∈ Rm and apply the 1-D theorem to uT δ. (Exercise)

105 / 218

Page 106: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 36

• Xiiid∼N(θ, σ2), i = 1, . . . , n,

• σ2 is fixed, g(θ) = θ.

`θ(X ) = log pθ(X ) =n∑

i=1

log pθ(Xi ) = − 1

2σ2

n∑

i=1

(Xi − θ)2 + const.

• Differentiating, we get the score function

˙θ(X ) =

∂θlog pθ(X ) =

1

σ2

n∑

i=1

(Xi − θ) =⇒ ¨θ(X ) = − n

σ2.

• whence I (θ) = n/σ2.

• CRB is varθ(δ) ≥ σ2/n and is achieved by sample mean.

106 / 218

Page 107: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 37 (Exponential families)

• Xi ∼ pθ(xi ) = h(xi ) exp(〈θ,T (xi )〉 − A(θ)), i = 1, . . . , n.

`θ(X ) = log pθ(X1, . . . ,Xn) = 〈θ,∑

i

T (Xi )〉 − nA(θ) + const.

• whence I (θ) = Eθ[−¨θ(X )] = n∇2A(θ) = n covθ[T ].

• Consider 1-D case and n = 1.

• Want unbiased estimate of the mean parameter: µ(θ) = Eθ[T ] = A′(θ).

• CRB is

[µ′(θ)]2

I (θ)=

[A′′(θ)]2

A′′(θ)= A′′(θ) = varθ(T )

i.e., it is attained by T .

• General case: T := 1n

∑ni=1 T (Xi ) attains CRB for the mean parameter:

covθ(δ) covθ(T ), ∀δ s.t. Eθ(δ) = Eθ(T ).

107 / 218

Page 108: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 38

• Xiiid∼Poi(λ), i = 1, . . . , n.

• Exponential family with T (X ) = X and mean parameter λ,

• hence, sample mean δ(X ) = 1n

∑i Xi achieves the CRB for λ.

• What if we want an unbiased estimate of g(λ) = λ2?

• Since I (λ) = n/ varλ[X1] = n/λ (why?),

• the CRB = [2λ]2/(n/λ) = 4λ3/n.

• The estimator T1 = 1n

∑ni=1 Xi (Xi − 1) is unbiased for λ2 and

varλ(T1) = 4λ3/n + 2λ2/n > CRB

• S =∑

i Xi is complete sufficient, hence

• Rao–Blackwellized estimator T2 = E[T1|S ] = S(S − 1)/n2 is UMVU.

• CRB is not attained since (exercise)

varλ(T2) = 4λ3/n + 2λ2/n2 > CRB,

A vector of independent Poisson variables, conditioned on their sum has amultinomial distribution. In this case, Mult(S , ( 1

n , . . . ,1n )).

108 / 218

Page 109: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Average vs. maximum risk optimality

Bayesian Methods:

• Trouble comparing estimators based on whole risk functions θ 7→ R(θ, δ).

• The Bayesian approach: reduce to (weighted) average risk.

• Assumes that the parameter is a random variable Θ with some distributionΛ, called the prior, having density π(θ) (w.r.t. to say Lebesgue).

• Choice of the prior is important in the Bayesian framework.

• Frequentest perspective: Bayes estimators have desirable properties.

109 / 218

Page 110: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Recall the decision-theoretic framework:

• Family of distributions P = Pθ : θ ∈ Ω.• Bayesian framework:

interpret Pθ as conditional distribution of X given Θ = θ,

Pθ(A) = P(X ∈ A | Θ = θ)

• Together with the marginal (prior) distribution of Θ, we have the jointdistribution of (Θ,X ).

• Recall the risk defined as

R(θ, δ) = Eθ[L(θ, δ(X ))] = E[L(θ, δ(X )) | Θ = θ]

or in other words, R(Θ, δ) = E[L(Θ, δ(X )) | Θ].

• The Bayes risk is

r(Λ, δ) = E[R(Θ, δ)] = E[L(Θ, δ(X ))].

110 / 218

Page 111: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Write p(x |θ) = pθ(x) for density of Pθ.

• Recall that

R(θ, δ) =

∫L(θ, δ(x))p(x |θ)dx .

Then,

r(Λ, δ) =

∫π(θ)R(θ, δ)dθ =

∫π(θ)

[ ∫L(θ, δ(x))p(x |θ)dx

]dθ.

• We rarely used this explicit form.

111 / 218

Page 112: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• A Bayes rule or estimator w.r.t. Λ, denoted as δΛ, is a minimizer of theBayes risk:

r(Λ, δΛ) = minδ

r(Λ, δ)

• Depends both on the prior Λ and the loss L.

Theorem 9 (Existence of Bayes estimators)

Assume that

(a) ∃δ′ with r(Λ, δ′) <∞(b) Posterior risk has a minimizer for µ-almost all x , that is,

δΛ(x) := argmina∈A

E[L(Θ, a)|X = x ]

is well-defined for µ-almost all x . (Measurable selection.)

Then, δΛ is a Bayes rule.

Proof. Condition (a) is to guarantee that we can use Fubini theorem.

• By definition of δΛ, for any δ we have E[L(Θ, δ)|X ] ≥ E[L(Θ, δΛ)|X ].

• Taking expectation and using smoothing finishes the proof.112 / 218

Page 113: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Posterior risk can be computed based on the posterior distribution of Θgiven X = x . Bayes rule gives

π(θ|x) =p(x |θ)π(θ)

m(x)∝ p(x |θ)π(θ)

where m(x) =∫π(θ)p(x |θ)dθ is the marginal distribution of X .

• Posterior is proportional to prior times the likelihood.

Example 39

Bayes estimators for two simple loss functions:

• Quadratic (or `2) loss: L(θ, a) = (g(θ)− a)2:

δΛ(x) = mina

E[(g(Θ)− a)2|X = x ] = E[g(Θ)|X = x ].

For g(θ) = θ reduces to the posterior mean.

• `1 loss: L(θ, a) = |θ − a|: Here, δΛ(x) = median(Θ|X = x) is one possibleBayes estimator. (Not unique in this case.)

113 / 218

Page 114: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 40 (Binomial)

• X ∼ Bin(n, θ).

• PMF is p(x |θ) =(nx

)θx(1− θ)n−x .

• Put a Beta prior on Θ, with hyperparameters α, β > 0,

π(θ) =Γ(α + β)

Γ(α)Γ(β)θα−1(1− θ)β−1 ∝ θα−1(1− θ)β−1

• We have π(θ|x) ∝ pθ(x)π(θ) ∝ θx+α−1(1− θ)n−x+β−1

• showing that Θ|X = x ∼ Beta(α + x , n − x + β), whence

δΛ(x) := E[Θ|X = x ] =x + α

n + α + β= (1− λ)

x

n+ λ

α

α + β

where λ = α+βn+α+β .

• Note: α/(α + β) is the prior mean, and x/n is the MLE (or unbiasedestimator of the mean parameter).

• No coincidence, happens in a general exponential family.

114 / 218

Page 115: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 41 (Normal location family)

• Assume that Xi |Θ = θ ∼ N(θ, σ2).

• Put a Gaussian prior on Θ ∼ N(µ, b2).

• The model is equivalent to

Xi = Θ + wi , wiiid∼N(0, σ2), for i = 1, . . . , n

• Reparametrize in terms of precisions τ 2 = 1/b2 and γ2 = 1/σ2.

• (Θ,X1, . . . ,Xn) is jointly Gaussian, and the posterior is

Θ|X = x ∼ N(

(1− λn)x + λnµ︸ ︷︷ ︸δΛ(x)

, 1/τ 2n

)

where

x =1

n

∑xi , τ 2

n = nγ2 + τ 2, λn = τ 2/τ 2n ∈ [0, 1]

• Continued ...

115 / 218

Page 116: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• With

x =1

n

∑xi , τ 2

n = nγ2 + τ 2, λn = τ 2/τ 2n ∈ [0, 1]

• We have

Θ|X = x ∼ N(δΛ(x), 1/τ 2

n

)

• Posterior mean δΛ(x), i.e., the Bayes rule for `2 loss, is

δΛ(x) := (1− λn)x + λnµ

which is a convex combination of x and µ and we have

• δΛ(x)→ x if n→∞ or SNR = γ2/τ 2 →∞.

• δΛ(x)→ µ if SNR = γ2/τ 2 → 0.

116 / 218

Page 117: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Conjugate priors

• The two examples above are examples of conjugacy.

• A family Q = π(·) of priors is conjugate to a family of likelihoodsP = p(· | θ) if the corresponding posteriors also belong to Q.

• Example of conjugate families

Q normal beta DirichletP normal binomial multinomial

Example 42 (Exponential families)

We have the following conjugate pairs

p(x |θ) = exp〈η(θ),T (x)〉)− A(θ)

qa,b(θ) = exp〈a, η(θ)〉+ bA(θ)− B(a, b)

117 / 218

Page 118: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 43 (Improper priors)

• Xi ∼ N(θ, σ2), i = 1, . . . , n.

• Is δ(x) = 1n

∑xi , a Bayes estimator w.r.t. some prior?

• Not if we require proper priors (finite measures):∫π(θ)dθ <∞, in which

case π can be normalized to integrate to 1.

• Need a uniform (proper) prior on the whole R which does not exist.

• An improper prior can still be used if the posterior is well-defined.(Generalized Bayes.)

• Alternatively, δ(x) is the limit of Bayes rules for a sequence of properpriors. (see also the Beta-Binomial example.)

118 / 218

Page 119: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Comment on the uniqueness of the Bayes estimator.

Theorem 10 (TPE 4.1.4)

Let Q be the marginal distribution of X , that is, Q(A) =∫Pθ(X ∈ A)dΛ(θ).

Recall that δΛ is (a) Bayes estimator. Assume that

• The loss function is strictly convex,

• r(Λ, δΛ) <∞,

• Q a.e =⇒ P a.e. . Equivalently, Pθ Q for all θ ∈ Ω.

Then, there is a unique Bayes estimator.

119 / 218

Page 120: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Minimax criterion

• Instead of averaging the risk: look at the worst-case or maximum risk:

R(δ) := supθ∈Ω

R(θ, δ)

• More in accord with an adversarial nature. (A zero-sum game.)

Definition 13

An estimator δ∗ is minimax if minδ∈D R(δ) = R(δ∗).

• An effective strategy for finding minimax estimators is to look among theBayes estimators:

• The minimax problem is: infδ supθ R(θ, δ).

• We generalize this to: infδ supΛ r(Λ, δ)

120 / 218

Page 121: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Recall: δΛ a Bayes estimator for prior Λ, and Bayes risk,

rΛ = infδr(Λ, δ) = r(Λ, δΛ),

(Last equality: assume rΛ is finite and is achieved.)

• Can order priors based on their Bayes risk:

Definition 14Λ∗ is a least favorable prior if rΛ∗ ≥ rΛ for any prior Λ.

• For a least favorable prior, we have

rΛ∗ = supΛ

rΛ = supΛ

infδr(Λ, δ) ≤ inf

δsup

Λr(Λ, δ) =: inf

δr(δ)

where r(δ) = supΛ r(Λ, δ) is a generalization of the maximum risk R(δ).

• Interested in situations where equality holds.

121 / 218

Page 122: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Characteriztion of minimax estimators

Theorem 11 (TPE 5.1.4)

Assume that δΛ is Bayes for Λ, and r(Λ, δΛ) = R(δΛ). Then,

• δΛ is minimax.

• Λ is least favorable.

• If δΛ is the unique Bayes estimator (a.e. P), then it is the unique minimaxestimator.

Proof of minimaxity of δΛ:

• Maximum risk is always lower-bounded by Bayes risk,

R(δ) = supθ∈Ω

R(θ, δ) ≥∫

R(θ, δ)dΛ(θ) = r(Λ, δ), ∀δ

• R(δ) ≥ r(Λ, δ) ≥ rΛ = R(δΛ). (Last equality by assumption.)

122 / 218

Page 123: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Rest of the proof:

• R(δ) ≥ r(Λ, δ) ≥ rΛ = R(δΛ). (Last equality by assumption.)

• Uniqueness of the Bayes rule makes second inequality strict for δ 6= δΛ,showing the uniqueness of minimax rule.

• On the other hand,

rΛ′ ≤ r(Λ′, δΛ) ≤ R(δΛ) = rΛ.

showing that Λ is least favorable.

123 / 218

Page 124: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• A decision rule δ is called an equalizer if it has constant risk:

R(θ′, δ) = R(θ, δ), for all θ, θ′ ∈ Ω.

• Let ω(δ) := θ : R(θ, δ) = R(δ) = argmaxθ R(θ, δ).

• (δ is an equalizer if ω(δ) = Ω.)

Corollary 4 (TPE 5.1.5–6)

(a) A Bayes estimator with constant risk (i.e., an equalizer) is minimax.

(b) A Bayes estimator δΛ is minimax if Λ(ω(δΛ)) = 1.

• Both of these conditions are sufficient, not necessary.

• (b) is weaker than (a).

• Strategy: Find a prior Λ whose support is contained in argmaxθ R(θ, δΛ).

124 / 218

Page 125: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 44 (Bernoulli, continuous parameter space)

• X ∼ Ber(θ) with quadratic loss, and Θ ∈ [0, 1].

• Given a prior Λ on [0, 1], let m1 = E[Θ] and m2 = E[Θ2].

• Frequentsit risk

R(θ, δ) = (δ0 − θ)2(1− θ) + (δ1 − θ)2θ

= θ2[1 + 2(δ0 − δ1)] + θ(δ21 − δ2

0 − 2δ0) + δ20

• Bayes risk

r(Λ, δ) = E[R(Θ, δ)] = m2[1 + 2(δ0 − δ1)] + m1(δ21 − δ2

0 − 2δ0) + δ20

• Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1,

δ∗1 =m2

m1, δ∗0 =

m1 −m2

1−m1.

125 / 218

Page 126: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1,

δ∗1 =m2

m1, δ∗0 =

m1 −m2

1−m1.

• Aside: Since p(θ|x) = Cθx(1− θ)1−xπ(θ), check thatδ∗x = E[Θ|X = x ], x = 0, 1 as it should be:

δ∗x =

∫θx+1(1− θ)1−xπ(θ)dθ∫θx(1− θ)1−xπ(θ)dθ

.

• A general rule δ is an equalizer, i.e., R(θ, δ) does not depend on θ, if

δ1 − δ0 = 1/2 and δ21 − δ2

0 − 2δ0 = 0.

• These equations have a single solution: δ0 = 1/4 and δ1 = 3/4.(There is a unique equalizer rule.)

• Equalizer Bayes rule: need 3/4 = m2

m1, and 1/4 = m1−m2

1−m1.

• Solving: m∗1 = 1/2 and m∗2 = 3/8.

• Need prior Λ that has these moments; Λ = Beta(1/2, 1/2) fits the bill.

• This is a least favorable prior.

• Corresponding Bayes, hence minimax, risk is 1/16.

126 / 218

Page 127: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• The above can be generalized to an i.i.d. sample of size n:

X1, . . . ,Xniid∼Ber(θ), where Beta(n/2, n/2) is least favorable and the

associated minimax risk is 14(√n+1)2 .

• Compare with the risk of the sample mean R(θ, X ) = θ(1− θ)/n.

127 / 218

Page 128: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 45 (Bernoulli, discrete parameter space)

• Let X ∼ Ber(θ) and Ω = 1/3, 2/3 =: a, b.• Take L(θ, δ) = (θ − δ)2.

• Any (nonrandomized) decision rule is specified by a pair of numbers (δ0, δ1).

• Any prior π specified by a single number πa = P(Θ = a) ∈ [0, 1].

• Frequentist risk

R(θ, δ) = (δ0 − θ)2(1− θ) + (δ1 − θ)2θ

• Bayes risk r(π, δ) = E[R(Θ, δ)],

r(π, δ) = πaR(a, δ) + (1− πa)R(b, δ)

128 / 218

Page 129: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Take derivatives w.r.t. δ0, δ1, set to zero, find the Bayes rule

δ∗0 =aπa(1− a) + b(1− b)(1− πa)

(1− a)πa + (1− b)(1− πa)

δ∗1 =a2πa + b2(1− πa)

aπa + b(1− πa)

• For a = 1/3 = 1− b,

δ∗0 =2

3(πa + 1)and δ∗1 =

4− 3πa6− 3πa

.

• Equalizer rule, one that R(a, δ) = R(b, δ):

(a + b)[2(δ0 − δ1) + 1] + δ21 − δ2

0 − 2δ0 = 0.

• A Bayes rule that is also equalizer occurs for π∗a = 1/2.

• This is the least favorable prior.

• Corrsponding rule (δ∗0 , δ∗1 ) = (4/9, 5/9) is minimax.

129 / 218

Page 130: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Geometry of Bayes and Minimax

• Risk body for two-parameter Bernoulli Ω = 1/3, 2/3.• Determinisitc rules, Bayes rules, minimax rule.

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 0.05 0.1 0.15 0.20

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

130 / 218

Page 131: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Geometry of Bayes and minimax for finite Ω

• Assume Ω = θ1, . . . , θk finite, and consider the risk set (or body)

S =

(y1, . . . , yk) | yi = R(θi , δ) for some δ⊂ Rk .

• Alternatively, define ρ : D → Rk by

ρ(δ) = (R(θ1, δ), . . . ,R(θk , δ))

where D is the set of randomized decision rules.

• S is the image of D underρ, i.e., S = ρ(D).

Lemma 6

S is a convex set (with randomized estimators).

Proof. For δ, δ′ ∈ D, and a ∈ [0, 1], we can form a randomized decision rule δasuch that S 3 R(θ, δa) = aR(θ, δ) + (1− a)R(θ, δ′). (Exercise.)

131 / 218

Page 132: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Every prior Λ corresponds to a vector λ = (λ1, . . . , λk) ∈ Rk , viaΛ(θi) = λi . Note that λ lies in the (k − 1)-simplex,

∆ := (λ1, . . . , λk) ∈ Rk+ :

k∑

i=1

λi = 1• Bayes risk is

r(Λ, δ) = E[R(Θ, δ)] =k∑

i=1

λiR(θi , δ) = λTρ(δ)

• Hence finding the Bayes rule is equivalent to

infδ ∈D

r(Λ, δ) = infδ ∈D

λTρ(δ) = infy ∈ S

λT y

a convex problem in Rk . Minimax problem is

infδ ∈D

‖ρ(δ)‖∞ = infy ∈ S‖y‖∞

Finding the least favorable prior corresponds to supλ∈∆[infy ∈ S λT y ].

132 / 218

Page 133: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Admissibility of Bayes rules

• In general, a unique (a.e. P) Bayes rule is admissable (TPE 5.2.4).

• Complete answer to admissibility question for finite parameter spaces.

Proposition 10

Assume Ω = θ1, . . . , θk and that δλ is Bayes rule for λ.If λi > 0 for all i , then δλ is admissible.

Proof. If δλ is inadmissible, there is δ such that

R(θi , δ) ≤ R(θi , δλ),∀i

with strict inequality for some j . Then,

i

λiR(θi , δ) <∑

i

λiR(θi , δλ).

Q.E.D.

133 / 218

Page 134: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proposition 11

Assume Ω = θ1, . . . , θk and δ admissible. Then δ is Bayes w.r.t. some prior λ.

Proof.

• Let x := ρ(δ), the risk vector of δ and Qx := y ∈ Rk | yi ≤ xi \ x.• Qx is convex.

(Removing an extreme point from a convex set preserves convexity.)

• Admissibility means Qx ∩ S = ∅.• Two non-empty disjoint convex sets ⊂ Rk , can be separated by a

hyperplane:

∃u 6= 0 s.t. uT z ≤ uT y , for all z ∈ Qx and y ∈ S .

• Suppose we can choose u to have nonnegative coordinates.(Proof by contradiction. (Exercise.))

134 / 218

Page 135: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Since u 6= 0, we can set λ = u/(∑

i ui ) ∈ ∆k−1.

• λT z ≤ infy∈S λT y = rλ, ∀z ∈ Qx .

• Taking zn ⊂ Qx such that zn → x , we obtain λT x ≤ rλ.

• But, by defintion of optimal Bayes risk rλ ≤ λT x , hence rλ = λT x .

135 / 218

Page 136: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

M-estimation

• Setup: An i.i.d. sample of size n from a model P(1) = Pθ : θ ∈ Ω onsample space X , i.e.,

X1, . . . ,Xniid∼Pθ.

• Full model is actually P(n) = P⊗nθ : θ ∈ Ω and sample space X n.

• M-estimators: those obtained as solutions of optimization problems.

Definition 15Given a family of functions mθ : X → R, for θ ∈ Ω, the correspondingM-estimator based on X1, . . . ,Xn is

θn := θn(X1, . . . ,Xn) := argmaxθ∈Ω

1

n

n∑

i=1

mθ(Xi )

• Often Mn(θ) := 1n

∑ni=1 mθ(Xi ), a random function.

136 / 218

Page 137: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Alternative approach is to specify θ as Z -estimator, i.e., the solution of aset of estimating equations

Ψn(θ) :=1

n

n∑

i=1

ψ(Xi , θ) = 0.

• Often 1st-order optimality conditions for an M-estimator produce a set ofestimating equations.(Simplistic in general, ignoring the possibility of constraints imposed by Ω.)

137 / 218

Page 138: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 46

1. mθ(x) = −(x − θ)2, then Mn(θ) = − 1n

∑i=1(Xi − θ)2, giving θ = X .

2. mθ(x) = −|x − θ|, then Mn(θ) = − 1n

∑i=1 |Xi − θ|, giving

θ = median(X1, . . . ,Xn).

3. mθ(x) = log pθ(x), then Mn(θ) = 1n

∑i=1 log pθ(Xi ), giving the maximum

likelihood estimator (MLE).

• In location family with pθ(x) = C exp(−β|x − θ|p), MLE is equivalent toan M-estimator with mθ(x) = −|x − θ|p.

• p = 2, Gaussian distribution. (Case 1)• p = 1, Laplace distribution. (Case 2)

• Corresponding Z -estimator forms of 1. and 2. are obtained forψθ(x) = x − θ and ψθ(x) = sign(x − θ), obtained by differentiation (orsub-differentiation) of mθ.

138 / 218

Page 139: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 47 (Method of Moments (MOM))

• Find θ by matching empirical and population (true) moments:

Eθ[X k1 ] =

1

n

n∑

i=1

X ki , k = 1, 2, . . . , d .

• Usually d is the dimension of parameter θ (d equations in d unknown).

• A set of estimating equations for ψθ(x) = xk − Eθ[X k1 ].

• A generalized version of MOM is to solve

Eθ[ϕk(X1)] =1

n

n∑

i=1

ϕk(Xi ), k = 1, 2, . . . , d .

for some collection of function ϕk, corresponding to Z -estimator withψθ(x) = ϕk(x)− Eθ(ϕ(X1)).

139 / 218

Page 140: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• In canonical exponential families:

Xiiid∼ pθ(x) ∝ exp〈θ,T (x)〉 − A(θ)

ML and MOM are equivalent.• MLE is the M-estimator associated with

mθ(x) = log pθ(x) = 〈θ,T (x)〉 − A(θ)

hence

Mn(θ) =1

n

i

[〈θ,T (Xi )〉 − A(θ)

]= 〈θ, T 〉 − A(θ)

where T = 1n

∑i T (Xi ) is the empirical mean of the sufficient statistic.

• MLE is

θmle = argmaxθ∈Ω

〈θ, T 〉 − A(θ)

• Setting derivatives to zero gives T = ∇A(θmle). (First-order optimality.)• Since Eθ[T ] = ∇A(θ), MLE is the solution of

Eθ[T ] = T

for θ, which is a MOM estimator. (If you will Eθ[T (X1)] = 1n

∑i T (Xi ).)

140 / 218

Page 141: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Sidenote

• Recall that µ = ∇A(θ) is the mean parameterization.

• The inverse of this map is θ = ∇A∗(µ) where

A∗(µ) = supθ∈Ω〈θ, µ〉 − A(θ)

is the conjugate dual of A. (Exercise.)

• So θmle = ∇A∗(T ), assuming that T ∈ int(dom(A∗)).

141 / 218

Page 142: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Asymptotics or large-sample theoryZeroth-order (consistency)

• Statistical behavior of estimators, in particular M-estimators, as n→∞.

• For concreteness consider the sequence:

θn = θn(X1, . . . ,Xn) = argmaxθ∈Ω

1

n

n∑

i=1

mθ(Xi )

Definition 16

Let X1, . . . ,Xniid∼Pθ0 . We say that θn is consistent if θn

p→ θ0.

• Equivalently,

∀ε > 0, P(d(θn, θ0) > ε)→ 0, as n→∞.

• Usually d θn, θ0) = ‖θn − θ0‖ for Eucledian parameters spaces Ω ⊂ Rd .

• For d = 1, d(θn, θ0) = |θn − θ0|.142 / 218

Page 143: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• We write Zn = op(1) if Znp→ 0.

• By the WLLN, for any fixed θ, we have (assuming Eθ0 |mθ(X1)| <∞)

1

n

n∑

i=1

mθ(Xi )p→ Eθ0 [mθ(X1)]

• Letting M(θ) := Eθ0 [mθ(X1)], for any fixed θ, Mn(θ)p→ M(θ).

• If θ0 is the maximum of M over Ω, hope θn which is the maximum of Mn

over Ω approaches it.

• However, pointwise convergence of Mn to M is not enough; need uniformconvergence, i.e.,

‖Mn −M‖∞ := supθ∈Ω|Mn(θ)−M(θ)|

to go to zero in probability.

143 / 218

Page 144: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Why uniform convergence?

• Even a nonrandom example is enough:

• Here Mn(t)→ M(t) pointwise, but Mn(tn) = 1 and M(t0) = 1/2.

Mn(t) =

1− n|x − 1n | |x | < 2

n12 − |x − 1| 1

2 < |x | < 32

0 otherwise

M(t) =

12 − |x − 1| 1

2 < |x | < 32

0 otherwise

0 0.5 1 1.5

0

0.2

0.4

0.6

0.8

1

144 / 218

Page 145: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Theorem 12 (AS 5.7 modified)

Let Mn be random functions, and let M be a fixed function of θ. Let

θn ∈ argmaxθ∈Ω

Mn(θ) (cond-M)

be well-defined. Assume:

(a) ‖Mn −M‖∞ p→ 0. (Unifrom convergence.)

(b) (∀ε > 0) supθ: d(θ,θ0)≥ε

M(θ) < M(θ0). (M has well-separated maxima.)

Then θn is consistent, i.e., θnp→ θ0.

• By optimality of θ for Mn, we have Mn(θ0) ≤ Mn(θn), or

0 ≤ Mn(θn)−Mn(θ0) Basic inequality

• By adding and subtracting, we get

M(θ0)−M(θn) ≤ Mn(θn)−M(θn)− [Mn(θ0)−M(θ0)]

≤ 2‖Mn −M‖∞(We are keeping random deviations from the mean on one side and fixedfunctions on the other side.)

145 / 218

Page 146: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Fix some ε > 0, and let

η(ε) := M(θ0)− supd(θ,θ0)≥ ε

M(θ) = infd(θ,θ0)≥ ε

[M(θ0)−M(θ)]

• By assumption (b) η(ε) > 0.

• Since d(θ, θ0) ≥ ε implies M(θ0)−M(θ) ≥ η(ε), we have

P(d(θn, θ0) ≥ ε

)≤ P

(M(θ0)−M(θn) ≥ η(ε)

)

≤ P(2‖Mn −M‖∞ ≥ η(ε)

)→ 0

by assumption (a). Q.E.D.

Remark 3

A key step is bounding Mn(θn)−M(θn) with ‖Mn −M‖∞.

Exercise: Condition (cond-M) can be replaced with Mn(θn) ≥ Mn(θ0)− op(1).

146 / 218

Page 147: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Sufficient conditions for uniform convergence can be found in Keener,Chapter 9, Theorem 9.2.

• For example, we have (a) if

• Ω is compact,• θ 7→ mθ(x) is continuous (for a.e x), and• E‖m∗(X1)‖∞ <∞. (‖m∗(X1)‖∞ = supθ∈Ω |mθ(X1)|)

• For example, we have (b) if

• Ω is compact, and• M is continuous, and• M has a unique maximizer over Ω.

• In general, the key factor in whether the uniform convergence holds is thesize of the parameter space Ω.

147 / 218

Page 148: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Side note:

• Why do we have (b) if

• Ω is compact, and• M is continuous, and• M has a unique maximizer over Ω?

• Since Ω is compact and M is continuous, it attains its maximum over

Ω \ B(θ0; ε) := θ ∈ Ω : d(θ, θ0) ≥ ε.

where B(θ0; ε) is the open ball of radius ε centered at θ0.

• Let θε be a maximizer of M over Ω \ B(θ0; ε). Then,

supθ: d(θ,θ0)≥ε

M(θ) = M(θε) < M(θ0)

• Strict inequality is due to the uniqueness of maximizer of M over Ω.

• Compactness is key, otherwise uniqueness of global maximizer does notimply this inequality.

148 / 218

Page 149: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 48

• MLE can be obtained as an M-estimator with mθ(x) = log pθ(x)pθ0

(x) .

• Addition of − log pθ0 (x) does not change the maximizer of Mn(θ).

M(θ) = Eθ0 [mθ(X1)] = −∫

pθ0 (x) logpθ0 (x)

pθ(x)dx = −D(pθ0‖pθ).

• D(p‖q) is the Kullback–Leibler (KL) divergence between p and q.

• A form of (squared) distance among distributions.

• Does not satisfy triangle equality or symmetry.

• D(p‖q) ≥ 0 with equality iff p = q.

• Condition (b) is a bit stronger.

• Often, we can show (strong identifiability)

γ(d(θ0, θ)

)≤ D(pθ0‖pθ)

for some strictly increasing function γ ∈ [0,∞)→ [0,∞) in aneighborhood of θ0.

149 / 218

Page 150: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Example: exponential distribution with pλ(x) = λe−λx1x > 0:

D(pλ0‖pλ) = Eλ0

[log

λ0e−λ0X1

λe−λX1

]

= logλ0

λ+ Eλ0 [(λ− λ0)X1]

= − logλ

λ0+

λ

λ0− 1

• Itakura-Saito distance, or the Bregman divergence for φ(x) = − log x , froman earlier lecture.

• f (x) = − log x + x − 1 is strictly convex on (0,∞) with unique minimumat x = 1.

150 / 218

Page 151: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

First-order (asymptotic normality)

• More refined understanding, by looking at scaled (magnified) deviations ofconsistent estimators.

• IID sequence X1,Xn, . . . with mean µ = E[X1] and Σ = cov(X1):

WLLN Xnp→ µ Xn is consistent for µ.

CLT√n(Xn − µ)

d→ N(0,Σ) Characterizes fluctuations of Xn − µ.

• Fluctuations are of the order n−1/2 and after normalization haveapproximate Gaussian dist’n.

151 / 218

Page 152: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• First, let us look at how modes of convergence interact.

Proposition 12

(a) Xnp→ X implies Xn

d→ X , but not vice versa.

(b) Xnp→ c is equivalent to Xn

d→ c. (c is a constant.)

(c) Continuous mapping (CM): Xn → X and f is continuous, implies f (Xn)→ f (X ).

Holds for bothd→ and

p→.

(d) Slutsky’s: Xnd→ X and Yn

d→ c implies (Xn,Yn)d→ (X , c).

(e) Xnp→ X and Yn

p→ Y implies (Xn,Yn)p→ (X ,Y ).

(f) Xnd→ X and d(Xn,Yn)

p→ 0 implies Ynd→ X .

• For (c), f only need to be continuous on a set C with P(X ∈ C ) = 1.

• (d) does not hold in general if c is replaced by some random variable Y .

152 / 218

Page 153: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Slutsky’s lemma is usually not what we mentioned.

• It in fact is a special of application of (c) and (d), to functions(x , y) 7→ x + y , (x , y) 7→ xy and (x , y) 7→ y−1x .

Corollary 5 (Slutsky’s lemma)

Let Xn,Yn and X be random variables, or vectors or matrices, and c a constant.

Assume that Xnd→ X and Yn

d→ c . Then,

Xn + Ynd→ X + c , YnXn

d→ cX , Y−1n Xn

d→ c−1X ,

assuming c is invertible for the latter. More generally f (Xn,Yn)d→ f (Xn, c) for

any continuous function.

E.g. op(1) + op(1) = op(1).

153 / 218

Page 154: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Simple examples:

(a) Xnd→ Z ∼ N(0, 1) implies X 2

nd→ Z 2 ∼ χ2

1.

Example 49 (Counterexample)

• Xn = X ∼ U(0, 1),∀n and

Yn = Xn1n odd+ (1− Xn)1n even.

• Xnd→ X and Yn

d→ X , but (Xn,Yn) does not converge in distribution.

• Why?

• Let C1 = (x , y) ∈ [0, 1]2 : x = y and C2 = (x , y) ∈ [0, 1]2 : x + y = 1.• Let U(Ci ) be uniform distribution on Ci . Then,

(Xn,Yn) ∼U(C1) n odd

U(C2) n even

154 / 218

Page 155: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 50 (t-statistic)

• IID sequence Xi, with E[Xi ] = µ and var(Xi ) = σ2.

• Let Xn = 1n

∑Xi and S2

n = 1n

∑(Xi − X )2 = 1

n

∑X 2i − (Xn)2. Then,

tn−1 :=Xn − µSn/√n

d→ N(0, 1).

• Why? 1n

∑X 2i

d→ E[X 21 ] = (σ2 + µ2) and X 2

nd→ µ2.

• These imply Snd→√σ2 + µ2 − µ2 = σ.

• It follows that

tn−1 =

√n(Xn − µ)

Sn

d→ N(0, σ2)

σ= N(0, 1)

• Distribution-free result: We are not assuming that Xi are Gaussian.

155 / 218

Page 156: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Also need the concept of uniform tightness or boundedness in probability.

• A collection of random vectors Xn is uniformly tight if

∀ε > 0, ∃M such that supn

P(‖Xn‖ > M) < ε

We will write Xn = Op(1) in this case.

Proposition 13 (Uniform Tightness)

(a) If Xnp→ 0 and Yn is uniformly tight, then XnYn

p→ 0.

(b) If Xnd→ X , then Xn is uniformly tight.

(a) can be written compactly as op(1)Op(1) = op(1).

156 / 218

Page 157: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Simplified notation: E[mθ0 ] in place of E[mθ0 (X1)].

Theorem 13 (Asymptotic normality of M-estimators)

Assume the following

(a) mθ0 (X1) has up to second moments with

• E[mθ0 ] = 0, and

• well-defined covariance matrix Sθ0 := E[mθ0mTθ0

].

(b) The Hessian mθ0 (X1) is integrable with Vθ0 := E[mθ0 ] ≺ 0.

(c) θn is consistent for θ0.

(d) ∃ ε > 0 such that sup‖θ−θ0‖≤ε ‖Mn(θ)− M(θ0)‖ p→ 0

Let ∆n,θ := 1√n

∑ni=1 mθ(Xi ). Then,

√n(θn − θ0) = −V−1

θ0∆n,θ0 + op(1), and ∆n,θ0

d→ N(0,Sθ0 ).

In particular,√n(θn − θ0)

d→ N(0,V−1θ0

Sθ0V−1θ0

).

In (b), only need the Hessian to be nonsingular.(d) is (local) Uniform Convergence (UC)

157 / 218

Page 158: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof of AN

1. θn is a maximizer of Mn, hence

2. Mn(θn) = 0. (first-order optimality condition)

3. Taylor-expand M around θ0:

Mn(θn)− Mn(θ0) = Mn(θn)[θn − θ0]

for some θn in the line segment [θn, θ0].(Mean-value theorem, assuming continuity of Mn)

4. θn = θ0 + op(1). (By consistency of θn)

5. Mn(θn) = Mn(θ0) + op(1) (By (d): UC)

6. Note that Mn(θ0) = n−1∑

i mθ0 (Xi ) is an average.

7. Mn(θ0) = Eθ0 [mθ0 (X1)] + op(1) = Vθ0 + op(1). (By (b) and WLLN)

8. Mn(θn) = Vθ0 + op(1). (Combine 5. and 7. + CM)

9. By CM applied w/ f (X ) = X−1, and invertibility of Vθ0 :

[Mn(θn)]−1 = [Vθ0 + op(1)]−1 = V−1θ0

+ op(1).

158 / 218

Page 159: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

10. Combine 2., 3. and 9.

θn − θ0 = [Mn(θn)]−1[Mn(θn)− Mn(θ0)

]

= [V−1θ0

+ op(1)][0− Mn(θ0)

]

11. Expand RHS and multiply by√n,

√n(θn − θ0) = −V−1

θ0[√nMn(θ0)]− op(1)[

√nMn(θ0)] (8)

12. Mn(θ0) is an average with zero-mean terms w/ covariance Sθ0 by (a).

13.√nMn(θ0)

d→ N(0,Sθ0 ). (CLT and (a))

14.√nMn(θ0) = Op(1) (By Prop. 13(b) and 13.)

15. Applying op(1)Op(1) = op(1) to (8), (Prop. 13(b) and 11.)

√n(θn − θ0) = −V−1

θ0[√nMn(θ0)] + op(1).

16. Note that√nMn(θ0) = ∆n,θ by definition.

17. Second part: Apply CM with f (x) = −V−1θ0

x . (Exercise.)

159 / 218

Page 160: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 51 (AN of MLE)

• For MLE, mθ(x) = `θ(x) = log pθ(x).

• mθ = ˙θ, the score-function, zero-mean under regularity conditions.

• Sθ = Eθ[ ˙θ

˙Tθ ] = I (θ).

• Vθ = Eθ[¨θ] = −I (θ)

• Asymptotic covariance of MLE = [−I (θ)]−1I (θ)[−I (θ)]−1 = [I (θ)]−1.

• It follows (assuming (c) and (d) hold)

√n(θmle − θ0)

d→ N(0, [I (θ0)]−1)

• Often interpreted as “MLE is asymptotically efficient”,

• i.e., achieves Cramer–Rao bound in the limit.

160 / 218

Page 161: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Hodge’s superefficient example

• If√n(δn − θ)

d→ N(0, σ2(θ)), one might think that σ2(θ) ≥ 1/I (θ) by CRB.

• If so, any estimator with variance 1/I (θ) could be called asymptoticallyefficient.

• Unfortunately this is not true. (The convergence in distribution is ratherweak to guarantee this.) Here is a counterexample:

Example 52

• Consider the shrinkage estimator

δ′n =

aδn |δn| ≤ n−1/4

δn otherwise

• δ′n has the same asymptotic behavior as δn for θ 6= 0.

• Asymptotic behavior of δ′n at θ = 0 is the same as aδn which hasasymptotic variance a2σ2(θ); this can be made arbitrarily small by choosinga sufficiently small.

161 / 218

Page 162: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Delta method

• Delta method is a powerful extension of the CLT.

• Assume that f : Ω→ Rk , with Ω ⊂ Rd , is differentiable and θ ∈ Ω.

• Let Jθ = (∂fi/∂xj) |x=θ be the Jacobian of f at θ.

• Note: Jθ ∈ Rk×d .

Proposition 14

Under the above assumption: If an(Xn − θ)d→ Z , with an →∞, then

an[f (Xn)− f (θ)]d→ JθZ

• If f is differentiable, then it is partially differentiable and its total derivativecan be represented (or identified) with the Jacobian matrix Jθ.

• Simplest case, k = d = 1: an[f (Xn)− f (θ)]d→ f ′(θ)Z

162 / 218

Page 163: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof of Delta method

• an(Xn − θ) = Op(1) and since an →∞, we have Xn − θ = op(1).

• By differentiability (1st order Taylor expansion)

f (θ + h) = f (θ) + Jθh + R(h)‖h‖

where R(h) = o(1). Define R(0) = 0 so that R is continuous at 0.

• Applying with h = Xn − θ, we have

f (Xn) = f (θ) + Jθ(Xn − θ) + R(Xn − θ)‖Xn − θ‖.

• Multiplying by an, we get

an[f (Xn)− f (θ)] = Jθ[an(Xn − θ)] + R(Xn − θ)‖an(Xn − θ)‖

• ‖an(Xn − θ)‖ = Op(1).

163 / 218

Page 164: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Multiplying by an, we get

an[f (Xn)− f (θ)] = Jθ[an(Xn − θ)] + R(Xn − θ)︸ ︷︷ ︸op(1)

‖an(Xn − θ)‖︸ ︷︷ ︸Op(1)

• R(Xn − θ) = op(1), and Jθ[an(Xn − θ)]d→ JθZ , both by CM.

• The result follows from op(1)Op(1) = op(1) and Prop 12(f).

164 / 218

Page 165: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 53

• Let Xi be iid with µ = E[Xi ] and σ2 = var[Xi ].

• By CLT,√n(Xn − µ)

d→ N(0, σ2).

• Consider the function f (t) = t2. Then, by delta method

√n(f (Xn)− f (µ))

d→ f ′(µ)N(0, σ2),

that is, √n[(Xn)2 − µ2]

d→ N(0, σ2(2µ)2).

• For µ = 0, we get the degenerate result that√n(Xn)2 d→ 0.

• In this case, we need to scale the error further, i.e.,

• n(Xn)2 d→ σ2χ21, which follows from CLT

√nX

d→ σN(0, 1) and CM.

165 / 218

Page 166: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 54

• Xiiid∼Ber(p).

• By CLT√n(Xn − p)

d→ N(0, p(1− p)).

• Let f (p) = p(1− p).

• f (Xn) is a plugin estimator for the variance, and

√n(f (Xn)− f (p))

d→ N(0, (1− 2p)2p(1− p))

since f ′(x) = 1− 2x .

• Again at p = 1/2 this is degenerate and the convergence happens at afaster rate.

166 / 218

Page 167: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

These examples can be dealt with using the following extension.

Proposition 15

Consider the scalar case k = d = 1. If√n(Xn − θ)

d→ N(0, σ2) and f is twicedifferentiable with f ′(θ) = 0, then,

n[f (Xn)− f (θ)]d→ 1

2f ′′(θ)σ2χ2

1

Inofrmale Derivation:

• f (Xn)− f (θ) = 12 f′′(θ)(Xn − θ)2 + o((Xn − θ)2).

• Since n(Xn − θ)2 d→ (σZ )2 where Z ∼ N(0, 1), we get the result.

167 / 218

Page 168: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 55 (Multivariate delta method)

• S2 = 1n

∑i X

2i − (Xn)2. Let

Zn :=1

n

n∑

i=1

(Xi

X 2i

), θ =

µ2 + σ2

), Σ = cov

((X1

X 21

))

• By (multivariate) CLT, we have

√n(Zn − θ)

d→ N(0,Σ)

• Letting f (x , y) = (x , y − x2), we have

√n[(

Xn

S2n

)−(µσ2

)]d→ JθN(0,Σ) = N(0, JθΣJθ)

• Exercise: Evaluate asymptotic covariance JθΣJθ.

168 / 218

Page 169: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

What are asymptotic normality results useful for?

• Simplify comparison of estimators: Can use asymptotic variances. (ARE)

• Can build asymptotic confidence intervals.

169 / 218

Page 170: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Asymptotic relative efficiency (ARE)

• Can compare estimators based on their asymptotic variance.

• Assume that for two estimators θ1,n and θ2,n, we have

√n(θi,n − µ(θ))

d→ N(0, σ2i (θ)), i = 1, 2.

• For large n, the variance of θi,n is ≈ σ2i (θ)/n.

• Relative efficiency of θ1,n with respect to θ2,n can be measured in terms theratio of the number of samples required to achieve the same asymptoticvariance (i.e., error),

σ21(θ)

n1=σ2

2(θ)

n2=⇒ AREθ(θ1, θ2) =

n2

n1=σ2

2(θ)

σ21(θ)

If the above ARE > 1, then we prefer θ1 over θ2.

170 / 218

Page 171: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 56

• Xiiid∼ fX with mean = (unique) median = θ, and variance 1.

• By CLT, we have√n(Xn − θ)→ N(0, 1).

• Sample median: Zn = median(X1, . . . ,Xn) = X( 12 n)

• Can show

√n(Zn − θ)

d→ N(

0,1

4[fX (θ)]2

)

• Consider normal location family: Xiiid∼N(θ, 1).

• fX (θ) = φ(0) = 1/√

2π where φ is the density of standard normal.

• Hence, σ2Zn

(θ) = π/2.

• ARE of sample mean relative to median:

σ2Zn

(θ)

σ2Xn

(θ)=π

2≈ 1.57

• In normal family, we prefer the mean, since the median requires roughly1.57 more samples to achieve the same accuracy.

171 / 218

Page 172: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Confidence intervals

An alternative to point estimators which provides a measure of our uncertainty

or confidence. Recall X1, . . . ,Xniid∼Pθ0 .

Definition 17

A (1− α)-confidence set for θ0 is a random set S = S(X1, . . . ,Xn) such thatPθ0 (θ0 ∈ S) ≥ 1− α

• Trade-off between size of the set S and its coverage probability Pθ0 (θ0 ∈ S).• Want to minimize size while maintaining a lower bound on coverage prob.• Usually CIs are built based on pivots:• Functions of data and parameter whose dist’n is independent of param.

Example 57 (Normal family, known variance)

• Xiiid∼N(µ, σ2), then Z = (Xn − µ)/(σ/

√n) ∼ N(0, 1).

• Let zα/2 be such that P(Z ≥ zα/2) = α/2.

P(|√n(Xn − µ)| ≤ zα/2) = 1− α ⇐⇒ P

(µ ∈ [Xn ±

σ√nzα/2]

)= 1− α.

172 / 218

Page 173: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 58 (Normal family, unknown variance)

• Xi ∼ N(µ, σ2).

• Z = (X − µ)/(σ/√n) ∼ N(0, 1)

• V = (n − 1)S2n/σ

2 ∼ χ2n−1 where S2 = 1

n−1

∑ni=1(Xi − X )2.

• Hence, T := Z/√

V /(n − 1) ∼ tn−1 (Student’s t distribution).

• Let tα be such that P(|T | ≥ tn−1(α2 )) = α.

• Xn ± tn−1(α2 ) Sn√n

is an exact (1− α) confidence interval.

173 / 218

Page 174: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Asymptotic CIs

Definition 18

An asymptotic (1− α)-confidence set for θ0 is a random set S = S(X1, . . . ,Xn)such that Pθ0 (θ0 ∈ S)→ 1− α as n→∞.

Example 59

• If√n(Tn − θ0)

d→ N(0, σ2(θ0)), then assuming σ(·) is continuous.

Tn ±√σ2(Tn)

nzα/2

is an asymptotic C.I. at level 1− α.

• Why? Since Tnp→ θ0, by CM theorem, σ(θ0)

σ(Tn)

p→ 1. By Slutsky’s lemma

√n

σ2(Tn)(Tn − θ0) =

√σ2(θ0)

σ2(Tn)

√n

σ2(θ0)(Tn − θ0)

d→ N(0, 1).

174 / 218

Page 175: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Asymptotic CI for MLE

Example 60 (Asym. CI for MLE based on Fisher info)

• Recall that under regularity√n(θn − θ0)

d→ N(0, 1I (θ0) ), or

√nI (θ0)(θn − θ0)

d→ N(0, 1).

• Assuming I (·) is continuous, applying the previous example,

√nI (θn)(θn − θ0)

d→ N(0, 1).

• Hence, the following is asymptotic (1− α)-CI for θ0:

θn ±zα/2√nI (θn)

175 / 218

Page 176: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 61 (Asym. CI based on empirical Fisher Info.)

• Let `n(θ) =∑n

i=1 log pθ(Xi ) and I (θ) = E[− ∂2

∂θ2 ]

• One can consider − 1n

¨n(θ) as the empirical version of I (θ).

• (It is an unbiased and consistent estimate. I (θ) is Fisher info. based onsample of size 1.)

• By the same argument as in AN theorem, − 1n

¨n(θn)

p→ I (θ0).

• It follows that√− ¨

n(θn)nI (θ0)

p→ 1

• By Slutsky’s lemma,

√−¨

n(θn)

nI (θ0)

√nI (θ0)(θn − θ0)

d→ N(0, 1)

• In other words,

√−¨

n(θn)(θn − θ0)d→ N(0, 1).

• Hence, the following is asymptotic (1− α)-CI for θ0:

θn ±zα/2√−¨

n(θn)

176 / 218

Page 177: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Variance-stabilizing transform

• Assume√n(Tn − θ)

d→ N(0, σ2(θ)).

• By delta method,√n[f (Tn)− f (θ)

] d→ N(0, [f ′(θ)]2σ2(θ)).

• We can choose f so that [f ′(θ)]2σ2(θ) = C , a constant.

• Good for building asymptotic pivots.

Example 62

• Xiiid∼Poi(θ). Note Eθ[Xi ] = varθ[Xi ] = θ.

• By CLT√n(Xn − θ)

d→ N(0, θ).

• Take f ′(θ) = 1√θ

. Can be realized by f (θ) = 2√θ, hence

2√n(√

Xn −√θ)

d→ N(0, 1)

• Asymptotic CI for√θ of level 1− α:

(√Xn ± 1

2√nzα/2

).

• Compare with standard asym. CI for θ:(Xn ±

√Xn

n zα/2

).

177 / 218

Page 178: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Hypothesis testing

• Recall decision theory framework:

• Probabilistic model Pθ : θ ∈ Ω for X ∈ X .

• Special case : Ω is partitioned into two disjoint sets Ω0 and Ω1.

• Want to decide which piece θ belongs.

• Can form an estimate θ for θ and output 1θ ∈ Ω1.• A general principal:

Do not estimate more than what you care about.The more complex the model, the more potential for fitting to noise.

178 / 218

Page 179: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• We want to test

H0 : θ ∈ Ω0 (null)H1 : θ ∈ Ω1 (alternative)

• A non-random test can be specified with a critical region S ⊂ X as

δ(X ) = 1X ∈ S.

• When δ(X ) = 1, we have accepted H1, or “rejected H0”.

• Power function of the test is given by

β(θ) = Pθ(X ∈ S) = Eθ[δ(X )]

• We would like β(θ) ≈ 1θ ∈ Ω1.• It cannot be achieved, so we settle for a trade-off. Define

significance level α = supθ∈Ω0

β(θ)

power of the test β = infθ∈Ω1

β(θ)

• Neyman–Pearson framework: Maximize β subject to a fixed α.179 / 218

Page 180: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Often need to consider a randomized test, in which case interpret

δ(x) = P(accept H1 | X = x)

• Power function β(θ) = Eθ[δ(X )] still gives the probability of accepting H1,by the smoothing property.

180 / 218

Page 181: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Simple hypothesis test

• Ω0 = θ0 and Ω1 = θ1.• Neyman–Pearson criterion reads: Fix α

supδ

Eθ1 [δ(X )] s.t. Eθ0 [δ(X )] ≤ α.

Most powerful (MP) test for significance level at most α.

• Neyman–Pearson lemma:Most power achieved by a likelihood ratio test (LRT),

δ(X ) = 1L(x) > τ+ γ 1L(x) = τ, L(x) := pθ1 (x)/pθ0 (x).

• Sometimes write 1pθ1 (x) ≥ τpθ0 (x) to avoid division by zero.

• For simplicity assume write p0 = pθ0 and p1 = pθ1 .

• So we write L(x) = p1(x)/p0(x) for example.

181 / 218

Page 182: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Informal proof

• For simplicity drop dependence on X : δ = δ(X ) and L = L(X ).

• Introduce Lagrange multipliers, and solve the unconstrained problem:

δ∗ = argmaxδ

[E1(δ) + λ(α− E0(δ))

]= argmax

δ

[E1(δ)− λE0(δ)

]

• Recall the change of measure formula (note L = p1/p0):

E1[δ] =

∫δp1 dµ =

∫δL p0 dµ = E0[δL].

• The problem reduces to

δ∗ = argmaxδ

E0[δ(L− λ)]

• The optimal solution is

δ∗ =

1 L > λ

0 L < λ

which is a likelihood ratio test.182 / 218

Page 183: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Theorem 14 (Nyeman-Pearson Lemma)

Consider the family of (randomized) likelihood ratio tests

δt,γ(x) =

1 p1(x) > t p0(x)

γ p1(x) = t p0(x)

0 p1(x) < t p0(x)

The following hold:

(a) For every α ∈ [0, 1], there are t, γ such that E0[δt,γ(X )] = α.

(b) If a LRT satisfies E0[δt,γ(X )] = α, then it is most powerful (MP) at level α.

(c) Any MP test at level α can be written as a LRT.

• Part (a) follows by looking at g(t) = P0(L(X ) > t) = 1− FZ (t) whereZ = L(X ). g is non-increasing and right-continuous, etc. (Draw a picture.)

183 / 218

Page 184: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Proof of Neyman-Pearson Lemma

• For part (b), let δ∗ be the LRT with significance level α.

• Let δ be any other rule satisfying E0[δ(X )] ≤ α = E0[δ∗(X )].

• For all x , (consider the three possibilities)

δ(x)[p1(x)− tp0(x)

]≤ δ∗(x)

[p1(x)− tp0(x)

]

• Integrate w.r.t. x :

E1[δ(X )]− tE0[δ(X )] ≤ E1[δ∗(X )]− tE0[δ∗(X )]

or

E1[δ(X )]− E1[δ∗(X )] ≤ t(E0[δ(X )]− E0[δ∗(X )]

)≤ 0

• Conclude that E1[δ(X )] ≤ E1[δ∗(X )].

• Part (c), left as an exercise.

184 / 218

Page 185: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 63

• Consider X ∼ N(θ, 1) and the two hypotheses

H0 : θ = θ0 versus H1 : θ = θ1

• Likelihood ratio is

L(x) =p1(x)

p0(x)=

exp[− 12 (x − θ1)2]

exp[− 12 (x − θ0)2]

• LRT rejects H0 if L(x) > t. Equivalently

log L(x) > log t ⇐⇒ −1

2(x − θ1)2 +

1

2(x − θ0)2 > log t

⇐⇒ x(θ1 − θ0) +1

2(θ2

0 − θ21) > log t

⇐⇒ x · sign(θ1 − θ0) >log t − 1

2 (θ20 − θ2

1)

|θ1 − θ0|=: τ

• Assume θ1 > θ0. Then, the test is equivalent to x > τ .

• We set τ by requiring P0(X > τ) = α. This gives τ = θ0 + Q−1(α).

185 / 218

Page 186: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Power calculation (Previous example continued)

• Q(x) = 1− Φ(x) where Φ is the CDF of standard normal distribution.

• The power is (since X − θ1 ∼ N(0, 1) under P1)

β = P1(X > τ) = P1(X − θ1 > τ − θ1) = Q(τ − θ1)

• Plugging in τ , we have β = Q(−δ + Q−1(α)) where δ = θ1 − θ0.

• Plot of β versus α is the ROC curve of the test.

• ROC = Receiver Operating Characteristic

• See next slide.

• Alternatively, can plot parametric curve β = Q(τ − θ1) and α = Q(τ − θ1),where parameter τ varies in R.

• ROC of no test can go about this curve (by Neyman-Pearson lemma).

186 / 218

Page 187: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

ROC curve

,

0 0.2 0.4 0.6 0.8 1

-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

187 / 218

Page 188: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Composite hypothesis testing

• Often want to test H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1 where Ω0 ∩ Ω1 = ∅.

Example 64

Testing whether a coin is fair or not. Here Ω0 = 1/2 and Ω1 = [0, 1] \ 1/2.

Definition 19

A test δ of size α is uniformly most powerful (UMP) at level α if

∀ tests φ of level ≤ α, ∀θ ∈ Ω1, βδ(θ) ≥ βφ(θ).

• UMP tests do not always exists (they often don’t in fact).

188 / 218

Page 189: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 65 (Coin flipping continued)

• Observe X ∼ Bin(n, θ).

• Consider testing H0 : θ = 1/2 versus H1 : θ = θ1 based on X .

• Most powerful test is a LRT, based on

T (x) =θx1 (1− θ1)n−x

(1/2)x(1/2)n−x=( θ1

1− θ1

)x(1− θ1

1/2

)n

• Nature of the test changes based on whether θ1 < 1/2 or θ1 > 1/2:

θ1 < 1/2 =⇒ log[θ1/(1− θ1)] < 0 =⇒ reject H0 when x < τ

θ1 > 1/2 =⇒ log[θ1/(1− θ1)] > 0 =⇒ reject H0 when x > τ

• Suggests that a special structure is needed for the existence of a UMP test.

189 / 218

Page 190: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Definition 20

A family P = pθ(x) : θ ∈ Ω of densities has a monotone likelihood ratio(MLR) in T (x) if for θ0 < θ1, the LR L(x) = pθ1 (x)/pθ0 (x) is a non-decreasingfunction of T (x).

For example, in the coin flipping problem, the model has MLR in T (X ) = X .

Example 66 (1-D exponential family)

• Consider pθ(x) = h(x) exp[ η(θ)T (x)− A(θ) ]. LR is

L(x) =pθ1 (x)

pθ0 (x)= exp

[(η(θ1)− η(θ0))T (x)− A(θ1) + A(θ0)

].

• If η is monotone (e.g. θ0 ≤ θ1 =⇒ η(θ0) ≤ η(θ1)), then the family hasMLR in T (x) or −T (x).

• Includes the Bernoulli (or binomial) example before with η(θ) = log θ1−θ .

Others cases: normal location family, Poisson and exponential.

190 / 218

Page 191: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 67 (Non-exponential family)

• Xiiid∼U[0, θ], i = 1, . . . , n.

• p(x) = θ−n1x(n) ≤ θ1x(1) ≥ 0, and

L(x) =(θ1

θ0

)n 1x(n) ≤ θ11x(n) ≤ θ0

=

(θ1

θ0

)n, x(n) ∈ [0, θ0)

∞ x(n) ∈ [θ0, θ1)

• Consider θ1 > θ0.

• For x(n) ∈ (0, θ1), L(x) increasing in x(n) (transitions from (θ1/θ0)n to ∞).

191 / 218

Page 192: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Theorem 15 (UMP for one-sided problems)

• Let P be a family with MLR in T (x).

• Consider the one-sided test H0 : θ ≤ θ0 versus H1 : θ > θ0.

• Then, δ(x) = 1T (x) > C+ γ1T (x) = C, for γ,C such thatβδ(θ0) = α is UMP of size α.

• Take θ1 > θ0 and let Lθ1,θ0 (x) = pθ1 (x)/pθ0 (x) be the corresponding LR.

• By the MLR property, the given test is a LR test, i.e.,

δ(x) = 1Lθ1,θ0 (x) > C ′+ γ1Lθ1,θ0 (x) = C ′

for some constant C ′ = C ′(θ1, θ0).

• Since βδ(θ0) = α, by Neyman–Pearson lemma, δ is MP for testingH0 : θ = θ0 versus H1 : θ = θ1

• Since θ1 > θ0 was arbitrary, δ is UMP for θ = θ0 versus θ > θ0.

• Last piece to check is βδ(θ) ≤ α for θ < θ0. (Exercise.)

192 / 218

Page 193: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 68 (Non-exponential family (continued))

• Xiiid∼U[0, θ], i = 1, . . . , n.

• The family has MLR in X(n).

• δ(X ) = 1X(n) ≥ t is UMP for one-sided testing: θ > θ0 against θ ≤ θ0.

• To set the threshold

g(t) = 1− Pθ0 (X(n) ≤ t) = 1−∏

i

Pθ0 (Xi ≤ t) =

1− (t/θ0)n, t ≤ θ0

0 t > θ0

which is a continuous function.

• Solving g(t) = α gives t = (1− α)1/nθ0.

• Similarly the power function is

β(θ) = Pθ(X(n) > t) = [1− (t/θ)n]+

which holds for all θ > 0.

• For the UMP test, we have β(θ) = [1− (1− α)(θ0/θ)n]+.

193 / 218

Page 194: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Plots of the power function for U[0, θ] example for various n,(θ0 = 1, α = 0.2).

3

0 0.5 1 1.5 2

pow

er fu

nctio

n -

(3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n=2n=5n=10n=20n=100

194 / 218

Page 195: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Plots of the power function for U[0, θ] example for various n,(θ0 = 1, α = 0.05).

3

0 0.5 1 1.5 2

pow

er fu

nctio

n -

(3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n=2n=5n=10n=20n=100

195 / 218

Page 196: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• ROC plots for U[0, θ] example for various n, (θ0 = 1, θ = 1.1).

,

0 0.2 0.4 0.6 0.8 1

-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

13 = 1.1

n=2n=5n=10random

196 / 218

Page 197: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• ROC plots for U[0, θ] example for various n, (θ0 = 1, θ = 2).

,

0 0.2 0.4 0.6 0.8 1

-

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

13 = 2.0

n=2n=5n=10random

197 / 218

Page 198: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Generalized likelihood ratio test (GLRT)

• Consider testing H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1.

• In the absence of UMPs a natural extension of LRT is the followinggeneralized LRT

L(x) =supθ∈Ω pθ(x)

supθ∈Ω0pθ(x)

=pθ(x)

pθ0(x)∈ [1,∞]

where Ω = Ω0 ∪ Ω1.

• θ is the unconstrained MLE, whereas θ0 is the constrained MLE, themaximizer of θ 7→ pθ(x) over θ ∈ Ω1.

• A GLRT rejects H0 if L(x) > λ for some threshold.

• Alternatively, GLRT can be written as

δ(x) = 1Λn(x) ≥ τ+ γ1Λn(x) = τ, Λn(x) = 2 log L(x)

• The threshold τ is set as usual by solving supθ∈Ω0Eθ[δ(X )] = α

198 / 218

Page 199: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Why the above is a reasonable test?

• Assume γ = 0: No randomization is needed.

• When θ ∈ Ω0, then both θ and θ0 approach θ as n→∞.

• Hence, L(x) ≈ 1 when θ ∈ Ω0.

• However, when θ ∈ Ω1, the unconstrained MLE θ approaches θ as n→∞,while θ0 does not. This is because θ0 ∈ Ω0 and θ ∈ Ω1, and Ω0 and Ω1 aredisjoint.

• It follows that L(x) > 1 when θ ∈ Ω1 (in fact L(x) 1 usually).

• By thresholding L(x) at some λ > 1, we can tell the two hypotheses apart

199 / 218

Page 200: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 69

• Xiiid∼N(µ, σ2), both µ and σ2 unknown. Let θ = (µ, σ2). Take

Ω = (µ, σ2) | µ ∈ R, σ2 > 0, Ω0 = θ0, for θ0 = (µ0, σ20) = (0, 1).

• Want to test Ω0 against Ω1 = Ω \ Ω0.

supθ∈Ω0

pθ(x) = pθ0 (x) =1

(2π)n/2exp

(− 1

2

i

x2i

)

supθ∈Ω

pθ(x) = pθ(x) =1

(2πσ2)n/2exp

(− 1

2σ2

i

(xi − µ)2)

where θ = (µ, σ2) with µ = 1n

∑i xi and σ2 = 1

n

∑i (xi − µ)2 is the MLE.

200 / 218

Page 201: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example continued

• The GLR is

L(x) =pθ(x)

pθ0 (x)= (σ2)−n/2 exp

[1

2

i

x2i −

1

2σ2(xi − µ)2

︸ ︷︷ ︸n/2

].

The GLRT rejects H0 when L(x) > tα where Pθ0 (L(x) > tα) = α.

• Alternatively, threshold

Λn(x) = 2 log L(x) = −n log σ2 +∑

i

x2i − n.

• Will see that Λn in general has χ2r distribution where r is the difference

between the dimensions of the full (Ω) versus null (Ω0) parameter spaces.

• Thus, we expect Λn in this problem to have an asymptotic χ22 distribution

under the null θ = θ0. (It is instructive to try to show this directly.)

201 / 218

Page 202: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Asymptotics of GLRT

• Consider Ω ⊂ Rd , open, and let r ≤ d . Take the null hypothesis to be

Ω0 = θ ∈ Ω : θ1 = θ2 = · · · = θr = 0= θ ∈ Ω : θ = (0, . . . , 0, θr+1, . . . , θr+d−r ).

• Note that Ω0 is a (d − r)-dimensional subspace of Ω.

Theorem 16Under the same assumptions guaranteeing asymptotic normality of MLEs,

Λn = 2 log L(X )d→ χ2

r , under H0.

Degrees of freedom r = d − (d − r), that is, the difference in the localdimensions of full and null parameter sets.

202 / 218

Page 203: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Recall `θ(X ) = log pθ(X ) and let Mn(θ) = 1n

∑i `θ(Xi ).

• We have Λn = −2n[Mn(θ0,n)−Mn(θn)].

• By Taylor expansion around θn (unrestricted MLE), for some θn ∈ [θn, θ0,n],

Mn(θ0,n)−Mn(θn) = [Mn(θn)]T (θ0,n − θn) +1

2(θ0,n − θn)T Mn(θn)(θ0,n − θn)

• Since θn is the MLE, we have Mn(θn) = 0 assuming θn ∈ int(Ω).

• By the same uniform arguments θn = θ0 + op(1) implies

• Mn(θn) = Mn(θ0) + op(1) = −Iθ0 + op(1),

• the last equality is because Mn(θ0)p→ Eθ0 [¨θ0 ] = −Iθ0 by WLLN.

203 / 218

Page 204: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Assuming that√n(θ0,n − θn) = Op(1) , we obtain

Λn = [√n(θ0,n − θn)]T Iθ0

√n(θ0,n − θn) + op(1).

• Asymptotically GLR measures a particular distance (squared) between θ0,n

and θn, one which weighs different directions differently, according toeigenvectors of I (θ0).

• More specifically, let ‖z‖Q :=√zTQz = ‖Q1/2z‖2

2, which defines a normwhen Q 0. Then,

Λn = ‖√n(θ0,n − θn)‖2

Iθ0+ op(1).

• In the simple case where Ω0 = θ0, we have√n(θn − θ0)

d→ N(0, I−1θ0

).

• Equivalently,√n(θn − θ0)

d→ I−1/2θ0

Z where Z ∼ N(0, Id).

• It follows from the CM theorem (since z 7→ ‖z‖Q is continuous) that

Λnd→ ‖I−1/2

θ0Z‖2

Iθ0= ‖I 1/2

θ0I−1/2θ0

Z‖22 = ‖Z‖2

2.

• Since ‖Z‖22 =

∑di=1 Z

2i ∼ χ2

d we have the proof Ω0 = θ0.• The proof of the general case is more complicated and is omitted.

204 / 218

Page 205: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 70 (Multinomial: testing uniformity)

• (X1, . . . ,Xd) ∼ Multinomial(n, θ), where θ = (θ1, . . . , θd) is a probabilityvector, that is,

θ ∈ Ω = θ ∈ R : θi ≥ 0,∑

i

θi = 1

• Xj counts how many of n objects fall into category j .

pθ(x) =

(n

x1, . . . , xd

) d∏

i=1

θxii ∝d∏

i=1

θxii .

• Would like to test Ω0 = θ0 where θ0 = ( 1d , . . . ,

1d ) versus Ω1 = Ω \ Ω0.

• MLE over Ω is given by θi = xi/n. Requires techniques for constrainedoptimization, such as Lagrange multipliers, since Ω itself is constrained.(Exercise.)

205 / 218

Page 206: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• MLE over Ω is given by θi = xi/n. This requires using techniques forconstrained optimization, such as Lagrange multipliers, since Ω itself isconstrained. (Exercise.)

• We obtain

Λn = 2 logpθ(x)

pθ0 (x)= 2 log

d∏

i=1

( θi(θ0)i

)xi= 2

d∑

i=1

xi logθi

(θ0)i

= 2n∑

i

θi logθi

(θ0)i= 2nD(θ‖θ0)

• Both θ and θ0 are probability vectors; D(θ‖θ0) is their KL divergence.

• GLRT does a sensible test: Reject null if θ is far from θ0 in KL divergence.

• Our asymptotic theory implies: Λnd→ χ2

d−1 under the null, i.e. θ = θ0,since Ω is (d − 1)-dimensional and Ω0 is 0-dimensional. This is a fairlynon-trivial result.

206 / 218

Page 207: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

p-values

• Consider a family of tests δα, α ∈ (0, 1) indexed by their level α.

• Assume: α 7→ δα(x) is nondecreasing, and right-continuous.

• E.g., if δα(x) = 1x ∈ C (α), then C (α1) ⊆ C (α2) if α1 ≤ α2.

• p-value or attained significance for observed x is defined as

p(x) := infα : δα(x) = 1

• Note that p = p(X ) is a random variable. We have

p(X ) ≤ α ⇐⇒ δα(X ) = 1

p ≤ α implies 1 = δp ≤ δα, since the infimum is attained by assumptions on δα,

hence δα = 1. The other direction follows from the definition of inf.

• This implies P0(p(X ) ≤ α) = P0(δα(X ) = 1) = α.

• That is, p = p(X ) ∼ U[0, 1] under null.

207 / 218

Page 208: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Example 71 (Normal example continued)

• Consider X ∼ N(θ, 1) and H0 : θ = θ0 versus H1 : θ = θ1.

• MP test at level α is δα(X ) = 1X − θ0 ≥ Q−1(α).• Alternatively δα(X ) = 1Q(X − θ0) ≤ α since Q is decreasing.

• The p-value is

p(X ) = infα : Q(X − θ0) ≤ α = Q(X − θ0) (9)

• Under null X − θ0 ∼ N(0, 1) hence Φ(X − θ0) ∼ U[0, 1] (why?).

• Then, p(X ) = 1− Φ(X − θ0) ∼ U[0, 1] as expected.

• Exercise: Verify that under H1 : θ = θ1, the CDF of p(X ) is

P1(p(X ) ≤ t) = Q(−δ + Q−1(t)).

where δ = θ1 − θ0. Note that this curve is the same as the ROC.

• Recall Q(t) = 1− Φ(t) where Φ is the CDF of N(0, 1).

208 / 218

Page 209: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Can verify that definition (9) produces the “common” definition ofp-values, say when δα(X ) = 1T ≥ τα or δα(X ) = 1|T | ≥ τα.

Example 72

• Consider the two-sided test and let G (t) = P0(|T | ≥ t).

• Assume that G is continuous hence invertible (both decreasing functions.)

• Requiring level α: G (τα) = P0(|T | ≥ τα) = α =⇒ τα = G−1(α).

• This gives δα(X ) = 1|T | ≥ G−1(α) = 1G (|T |) ≤ α.• By definition (9), p = G (|T |) which is the common definition.

209 / 218

Page 210: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Multiple Hypothesis testing

• We have a collection of null hypotheses H0,i , i = 1, . . . , n.

Example 73 (Basic example)

• Testing in the normal means model

yi ∼ N(µi , 1), i = 1, . . . , n

and H0,i : µi = 0.

• yi could be the expression level (or state) of gene i .

• H0,i means that gene i has no effect on the disease under consideration.

• Suppose that for each H0,i , we have a test, hence a p-value pi .

• Assume under H0,i : pi ∼ U[0, 1].

• A test that reject H0,i when pi ≤ α, is of size α under ith null.

210 / 218

Page 211: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Testing global null

• The global null is H0 =⋂n

i=1 H0,i .

• Want to combine p1, . . . , pn to build a test of level α for H0.

• Can use 1pi ≤ α for a fixed i . But, better power if use all of them.

• Benferroni’s test for global null:

Reject H0 if mini

pi ≤α

n

• By union bound (no independence needed),

PH0 (rejecting H0) = PH0

( n⋃

i=1

pi ≤ α/n)≤

n∑

i=1

PH0 (pi ≤ α/n) = α

• Exercise: Assuming pi s are independent under H0, the exact the size ofBenferroni’s test is 1− (1− α

n )n → 1− e−α as n→∞. Thus for large nand small alpha, size ≈ 1− e−α ≈ α, hence union bound is not bad in thiscase.

211 / 218

Page 212: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Fisher test for global null

• Fisher combination test:

Reject H0 if Tn := −n∑

i=1

2 log pi > χ22n(1− α)

Lemma 7

If p1, . . . , pn are independent, then Tn ∼ χ22n.

• Thus, assuming independence under H0, Fisher test has the correct size.

212 / 218

Page 213: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Simes test for global null

• Simes test:

Reject H0 if Tn := mini

p(i),

n

i

≤ α

where p(1) ≤ p(2) ≤ · · · ≤ p(n) is the order statistics of p1, . . . , pn.

Lemma 8

If p1, . . . , pn are independent, then Tn ∼ U[0, 1].

• (Independence can be relaxed.)

• Thus, assuming independence under H0, Simes test has the correct size.

• Equivalent form of Simes test:

Reject H0 if p(i) ≤i

nα for some i

• Less conservative than Benferroni’s that reject if p(1) ≤ 1nα.

213 / 218

Page 214: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Testing individual hypotheses

• In the gene expression example, we care about which genes are null/notnull. We would like to test H0,i : µi = 0 versus H1,i : µi = 1 for all i .

• The problem has resemblance to classification.

• Interested in how we are doing in an aggregate sense.

• We can think of having a decision problem

p ∼ Pθ, where θ ∈ 0, 1n. (10)

• θi = 1 iff H0,i is true.

• Global null corresponds to θ = 0 (zero vector).

• Minimal assumption: When θi = 0, we have pi ∼ U[0, 1], i.e., the ithmarginal of Pθ is uniform.

214 / 218

Page 215: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

Terminology (shared with classification)

• Confusion matrix: Count how many combinations we have.

• For example if θ ∈ 0, 1n is our guess for the hypotheses:

TP =1

n

n∑

i=1

1θi = 1, θi = 1

positive (1) negative (0) totalaccepted rejected

true (1) TP TN Tfalse (0) FP FN F

P N

• True here means H0,i is true.

215 / 218

Page 216: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Alternative notationpositive (1) negative (0) total

accepted rejectedtrue (1) U V n0

false (0) T S n − n0

n − R R n

• Familywise error rate (FWER):

FWERθ = Pθ(V ≥ 1)

• A much less stringent criterion is the False Discovery Rate (FDR).

• Consider the false discovery proportion (a random variable)

FDP =V

max(R, 1)=

V

R1R > 0

• FDR is the expectation of FDP:

FDRθ = Eθ[FDP]

216 / 218

Page 217: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Controlling FWER in a strong sense: control for all θ ∈ 0, 1n.

Theorem 17

Benferroni’s approach, where we test each hypothesis at level α/n, controlsFWER at level α in a strong sense. In fact

Eθ[V ] ≤ n0

nα, ∀θ ∈ 0, 1n.

• E[V ] =∑n

i=1 P(V ≥ i) which holds for any nonnegative discrete variable.

• Hence, E[V ] ≥ P(V ≥ 1).

• Let Vi = 1Hi,0 is true but rejected = 1θi = 1, θi = 0,

Eθ[Vi ] = 1θi = 1Pθ(θi = 0)

• Since V =∑

i Vi ,

Eθ[V ] =∑

i

Eθ[Vi ] =∑

i :θi=1

Pθ(θi = 0) ≤∑

i :θi=1

α

n=

n0

(Here θi is only based on pi .)

217 / 218

Page 218: STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su cient statistics exist under mild conditions. Minimal su cient statistic is essentially

• Benjamini-Hochberg (BH) procedure: Let i0 be the largest i such that

p(i) ≤i

nq

Reject all H0,i for i ≤ i0.

Theorem 18Under independence of null hypotheses from each other and from the non-nulls,the BH procedure (uniformly) controls the FDR at level q. In fact,

FDRθ(θBH) =n0

nq, ∀θ ∈ 0, 1n

218 / 218