STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...
Transcript of STAT 200B: Theoretical Statisticsarash.amini/teaching/stat200b/notes/200B_slides.pdf · Minimal su...
STAT 200B: Theoretical Statistics
Arash A. Amini
March 2, 2020
1 / 218
Statistical decision theory
• A probability model P = Pθ : θ ∈ Ω for data X ∈ X :Ω: parameter space, X : sample space.
• An action space A: set of available actions (decisions).
• A loss function:
0-1 loss L(θ, a) = 1θ 6= a Ω = A = 0, 1.Quadratic loss (Squared error) L(θ, a) = ‖θ − a‖2
2 Ω = A = Rd .
Statistical inference as a game:
1. Nature picks the “true” parameter θ, and draws X ∼ Pθ.Thus, X is a random element of X .
2. Statistician observes X and makes a decision δ(X ) ∈ A.δ : X → A is called a decision rule.
3. Statistician incurs the loss L(θ, δ(X )).
The goal of the statistician is to minimize its expected loss, a.k.a the risk:
R(θ, δ) := EθL(θ, δ(X ))
2 / 218
• The goal of the statistician is to minimize its expected loss, a.k.a the risk:
R(θ, δ) := EθL(θ, δ(X ))
=
∫L(θ, δ(x)) dPθ(x)
=
∫L(θ, δ(x)) pθ(x) dµ(x)
when family is dominated: Pθ = pθdµ.
• Usually work with the family of densities pθ(·) : θ ∈ Ω.
3 / 218
Example 1 (Bernoulli trials)
• A coin being flipped, want to estimate the probability of coming up heads.
• One possible model:
X = (X1, . . . ,Xn), Xiiid∼Ber(θ), for some θ ∈ [0, 1].
• Formally, X = 0, 1n, Pθ = (Ber(θ))⊗n and Ω = [0, 1].
• PMF of Xi :
P(Xi = x) =
θ x = 1
1− θ x = 0= θx(1− θ)1−x , x ∈ 0, 1
• Joint PMF: pθ(x1, . . . , xn) =∏n
i=1 θxi (1− θ)1−xi
• Action space: A = Ω.
• Quadratic loss: L(θ, δ) = (θ − δ)2.
4 / 218
Comparing estimators via their risk
Bernoulli trials. Let us look at three estimators:
Sample mean δ1(X ) = 1n
∑ni=1 Xi R(θ, δ1) = θ(1−θ)
n
Constant estimator δ2(X ) = 12 R(θ, δ2) = (θ − 1
2 )2
Strange looking δ3(X ) =∑
i Xi+3n+6 R(θ, δ3) = nθ(1−θ)+(3−6θ)2
(n+6)2 .
Throw data out δ4(X ) = X1 R(θ, δ4) = θ(1− θ).
5 / 218
Comparing estimators via their risk
n = 10 n = 50
0 0.2 0.4 0.6 0.8 10
5 · 10−2
0.1
0.15
0.2
δ1
δ2
δ4
δ3
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2·10−2
Comparison depends on the choice of the loss. A different loss gives a differentpicture.
6 / 218
Comparing estimators via their risk
How to deal with the fact that the risks are functions?
• Summarize them by reducing to numbers:
• (Bayesian) Take a weighted average:
infδ
∫Ω
R(θ, δ) dπ(θ)
• (Frequentist) Take the maximum:
infδ
maxθ∈Ω
R(θ, δ)
• Restrict to a class of estimators: unbiased (UMVU), equivariant, etc.
• Rule out estimators that are dominated by others (inadmissible).
7 / 218
Admissibility
Definition 1
Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if
• R(θ, δ∗) ≤ R(θ, δ), for all θ ∈ Ω, and
• R(θ, δ∗) < R(θ, δ), for some θ ∈ Ω.
δ is inadmissible if there is a different δ∗ that dominates it;otherwise δ is admissible.
An inadmissible rule can be uniformly “improved”.
δ4 in the Bernoulli example is inadmissible.
We will see a non-trivial example soon (Exponential Distribution).
8 / 218
Bias
Definition 2
The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ)− g(θ).
The estimator is unbiased if Bθ(δ) = 0 for all θ ∈ Ω.
Not always possible to find unbiased estimators.Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62)
Definition 3g is called U-estimable if there an unbiased estimator δ for g .
Usually g(θ) = θ.
9 / 218
Bias-variance decomposition
For the quadratic loss L(θ, a) = (θ − a)2, the risk is mean-squared error (MSE).In this case we have the following decomposition
MSEθ(δ) = [Bθ(δ)]2 + varθ(δ)
Proof.
Let µθ := Eθ(δ). We have
MSEθ(δ) = Eθ(θ − δ)2 = Eθ(θ − µθ + µθ − δ)2
= (θ − µθ)2 + 2(θ − µθ)Eθ[µθ − δ] + Eθ(µθ − δ)2.
Same goes for general g(θ) or higher dimensions: L(θ, a) = ‖g(θ)− a‖22.
10 / 218
Example 2 (Berger)
Let X ∼ N(θ, 1).
Class of estimators of the form δc(X ) = cX , for c ∈ R.
MSEθ(δ) = (θ − cθ)2 + c2 = (1− c)2θ2 + c2
For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc) for all θ.
For c ∈ [0, 1] the rules are incomparable.
11 / 218
Optimality depends on the loss
Example 3 (Possion process)
X1, . . . ,Xn be the inter-arrival times of a Poisson process with rate λ.
X1, . . . ,Xniid∼Expo(λ). The model has the following p.d.f.
pλ(x) =n∏
i=1
pλ(xi ) =∏
i
λe−λxi 1xi > 0 = λne−λ∑
i xi 1mini
xi > 0
Ω = A = (0,∞).
• Let S =∑
i Xi and X = 1nS .
• The MLE for λ is λ = 1/X = n/S .
12 / 218
X1, . . . ,Xniid∼Expo(λ)
• Let S =∑
i Xi and X = 1nS .
• The MLE for λ is λ = 1/X = n/S .
• S :=∑n
i=1 Xi has the Gamma(n, λ) distribution.
• 1/S has Inv-Gamma(n, λ) distribution with mean λ/(n − 1).
• Eλ[λ] = nλ/(n − 1). MLE is biased for λ.
• Then, λ := (n − 1)λ/n is unbiased.
• We also have varλ(λ) < varλ(λ).
• It follows that
MSEλ(λ) < MSEλ(λ), ∀λ
The MLE λ is inadmissible for quadratic loss.
13 / 218
Possible explanation:Quadratic loss penalizes over-estimationmore than under-estimation for Ω =(0,∞).
0 1 2 3 4 5 60
1
2
3
4
5
6
7
8
Alternative loss function (Itakura–Saito distance)
L(λ, a) = λ/a− 1− log(λ/a), a, λ ∈ (0,∞)
• With this loss function, R(λ, λ) > R(λ, λ),∀λ.
• That is, MLE renders λ inadmissible.
An example of a Bregman divergence for φ(x) = − log x .For a convex function φ : Rd → R, the Bregman divergence is defined as
dφ(x , y) = φ(x)− φ(y)− 〈∇φ(y), x − y〉
the remainder of the first order Taylor expansion of φ at y .
14 / 218
Details:
• Consider δα(X ) = α/S . Then, we have
R(λ, δα)− R(λ, δβ) =( nα− n
β
)−(
logn
α− log
n
β
)
• Take α = n − 1 and β = n.
• Use log x − log y < x − y for x > y ≥ 1.
(Follows from strict concavity of f (x) = log(x):f (x)− f (y) < f ′(y)(x − y) for y 6= x).
15 / 218
Sufficiency
Idea: Separate the data into
• parts that are relevant for the estimating θ (sufficient)
• and parts that are irrelevant (ancillary).
Benefits:
• Achieve data compression: efficient computation and storage
• Irrelevant parts can increase the risk (Rao-Blackwell)
Definition 4
Consider the model P = Pθ : θ ∈ Ω for X .A statistic T = T (X ) is sufficient for P (or for θ or for X ) if the conditionaldistribution of X given T does not depend on θ.
More precisely, we have
Pθ(X ∈ A | T = t) = Qt(A), ∀t, A
for some Markov kernel Q. Making it more precise requires measure theory.Intuitively, given T , we can simulate X by an external source of randomness.
16 / 218
Sufficiency
Example 4 (Coin tossing)
• Xiiid∼Ber(θ), i = 1, . . . , n.
• Notation: X = (X1, . . . ,Xn), x = (x1, . . . , xn).
• Will show that T = T (X ) =∑
i Xi is sufficient for θ. (This should beintuitive.)
Pθ(X = x) = pθ(x) =n∏
i=1
θxi (1− θ)1−xi = θT (x)(1− θ)n−T (x)
• Then
Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= θt(1− θ)n−t1T (x) = t.
17 / 218
• Then
Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= θt(1− θ)n−t1T (x) = t.
• Marginalizing,
Pθ(T = t) = θt(1− θ)n−t∑
x ∈0,1n1T (x) = t
=
(n
t
)θt(1− θ)n−t .
• Hence,
Pθ(X = x |T = t) =θt(1− θ)n−t1T (x) = t(
nt
)θt(1− θ)n−t
=1(nt
)1T (x) = t.
• What is the above (conditional) distribution?
18 / 218
Factorization Theorem
It is not convenient to check for sufficiency this way, hence:
Theorem 1 (Factorization (Fisher–Neyman))
Assume that P = Pθ : θ ∈ Ω is dominated by µ. A statistic T is sufficient ifffor some function gθ, h ≥ 0
pθ(x) = gθ(T (x))h(x), for µ-a.e. x
The likelihood θ 7→ pθ(X ), only depends on X through T (X ).
Family being dominated (having a density) is important.Proof (only discrete case):Assume T = T (X ) is sufficient. Fix x , and let t = T (x), Then,
Pθ(X = x) = Pθ(X = x ,T = t)
= Pθ(X = x |T = t)Pθ(T = t)
= Qt(X = x)gθ(t).
19 / 218
Factorization Theorem
• Now assume factorization holds. Then,
Pθ(X = x ,T = t) = Pθ(X = x)1T (x) = t= gθ(t)h(x)1T (x) = t.
• It follows that
Pθ(T = t) = gθ(t)∑
x′
h(x ′)1T (x ′) = t,
• hence
Pθ(X = x |T = t) =gθ(t)h(x)1T (x) = t
gθ(t)∑
x′ h(x ′)1T (x ′) = t
=h(x)1T (x) = t∑x′ h(x ′)1T (x ′) = t .
20 / 218
Example 5 (Uniform)
• Let X1, . . . ,Xniid∼U[0, θ].
• Family is dominated by Lebesgue measure.
• X(n) = maxX1, . . . ,Xn is sufficient by Factorization theorem:
pθ(x) =n∏
i=1
1
θ10 ≤ xi ≤ θ
=1
θn10 ≤ xi ≤ θ, ∀i =
1
θn10 ≤ min
ixi1max
ixi ≤ θ
Useful fact∏
i 1Ai = 1∩iAi .
21 / 218
• The entire data (X1, . . . ,Xn) is always sufficient.
• For i.i.d. data there is always more reduction.
Example 6 (IID data)
• Let X1, . . . ,Xniid∼ pθ.
• Then, the order statistics (X(1), . . . ,X(n)) is sufficient:
• Order the data such that X(1) ≤ X(2) ≤ · · · ≤ X(n), and discard the ranks.
22 / 218
Minimal sufficiency
• There is a hierarchy among sufficient statistics in terms of data reduction.
• Can be made precise by using “functions” as “reduction mechanisms”.
Lemma 1
If T is sufficient for P and T = f (S) a.e. P, then S is sufficient.
We write T ≤s S if such functional relation exists.
• An easy consequence of the factorization theorem.
Examples:
• T sufficient 6=⇒ T 2 sufficient. (T 6≤s T2) (Missing sign information)
• T 2 sufficient =⇒ T sufficient. (T 2≤s T )
• T sufficient ⇐⇒ T 3 sufficient. (bijection)
23 / 218
• T ≤s S is not standard notation, but useful shorthand for:
∃ function f such that T = f (S) a.e. P.
• We want to obtain a sufficient statistic that achieves greatest reduction,i.e. is at the bottom of that hierarchy.
Definition 5
A statistic T is minimal sufficient if
• T is sufficient, and
• T ≤s S for any sufficient statistic S .
24 / 218
• Minimal sufficient statistics exist under mild conditions.
• Minimal sufficient statistic is essentially unique modulo bijections.
Example 7 (Location family)
Consider X1, . . . ,Xniid∼ pθ, that is, they have density pθ(x) = f (x − θ). For
example consider f (x) = C exp(−β|x |α).
• The case α = 2 corresponds to the normal location family X1, . . . ,Xniid∼N(θ, 1).
Then, T = 1n
∑i Xi is sufficient by factorization. We will show later that it is
minimal sufficient.
• The case α = 1 (Laplace or double exponential), no further reduction beyondorder statistic is possible.
• A family P is DCS if it dominated with densities having common support.
25 / 218
• Goal: To show that the likelihood (ratio) function is minimal sufficient.
• General idea: For any fixed θ and θ0,
pθ(X )
pθ0 (X )
will always be a function of any sufficient statistic (by Fact. Thm).
• We just have to collect enough of them so that collectively
( pθ1 (X )
pθ0 (X ),pθ2 (X )
pθ0 (X ),pθ3 (X )
pθ0 (X ), . . .
)
they are sufficient.
26 / 218
A useful lemma
• Let us fix θ0 ∈ Ω and define
Lθ := Lθ(X ) :=pθ(X )
pθ0 (X ).
Lemma 2
In a DCS family, the following are equivalent
(a) U is sufficient for P.
(b) Lθ ≤s U, ∀θ ∈ Ω.
• When DCS fails, (a) still implies (b), but not necessarily vice versa.
27 / 218
Proof of Lemma 2
• Work on the common support, densities can be assumed positive.
• (a) =⇒ (b): U sufficient implies pθ(x) = gθ(U(x))h(x) (Fact. Thm.):
Lθ(x) =pθ(x)
pθ0 (x)=
gθ(U(x))
gθ0 (U(x))= fθ,θ0 (U(x))
• (b) =⇒ (a): ∃fθ,θ0 such that pθ(x) = pθ0 (x)fθ,θ0 (U(x)). Q.E.D.
28 / 218
A useful lemma
• Let L := (Lθ)θ∈Ω.
• Since R ≤s U and S ≤s U ⇐⇒ (R,S)≤s U, we have
Lemma 3
In a DCS family, the following are equivalent
(a) U is sufficient for P.
(b) L≤s U.
• The argument is correct when Ω is finite.
• Needs more care dealing with “a.e. P” when Ω is infinite.
• From Lemma 3 follows that L is itself sufficient. (Why?)
29 / 218
Likelihood is minimal sufficient
Proposition 1
In a DCS family, L := (Lθ)θ∈Ω is minimal sufficient.
Proof:
• Let U be a sufficient statistics.
• Lemma 3(a) =⇒ (b) gives L≤s U.
• (i.e., L is a function of any sufficient U.)
• We also know that L is itself sufficient.
• The proof is complete.
30 / 218
• Consequence of Prop. 1 is
Corollary 1
A statistic T is minimal sufficient if
• T is sufficient, and
• T ≤s L.
• That is, it is enough to show that T is sufficient and,
• T can be written as a function of L.
T ≤s L is equivalent to either of the following:
• L(x) = L(y) =⇒ T (x) = T (y).
• Level sets of L are “included” in level sets of T , i.e.,
• level sets of T are coarser than level set of L.
31 / 218
Corollary 2
T is minimal sufficient for DCS family P iff
(a) T is sufficient, and
(b) L(x) = L(y) =⇒ T (x) = T (y)
Corollary 3
T is minimal sufficient for DCS family P iff
L(x) = L(y) ⇐⇒ T (x) = T (y)
• Can replace L(x) = L(y) with `x(θ) ∝ `y (θ),
where `x(θ) = pθ(x) is the likelihood function. (Theorem 3.11 in Keener.)
• Corollary 3: T is minimal sufficient if it has the same level sets as L.
• Informally, shape of the likelihood is minimal sufficient.
32 / 218
Example 8 (Gaussian location family)
Consider X1, . . . ,Xniid∼ pθ = f (x − θ) with f (x) = C exp(−βx2).
• `X (θ) ∝ exp(−β∑i (Xi − θ)2).
• Shape of `X (·) uniquely determined by θ 7→∑i (Xi − θ)2,
• Alternatively, θ 7→ −2(∑
i Xi )θ + nθ2,
• Alternatively,∑
i Xi .
Example 9 (Laplace location family)
Consider X1, . . . ,Xniid∼ pθ = f (x − θ) with f (x) = C exp(−β|x |).
• `X (θ) ∝ exp(−β∑i |Xi − θ|).
• Shape of `X (·) uniquely uniquely determined by the breakpoints of thepiecewise linear function θ 7→∑
i |Xi − θ|,• In one-to-one correspondence with the order statistic (X(1), . . . ,X(n)).
33 / 218
θ
∑i |Xi − θ|
X(2)X(1) X(3)
• Shape of the likelihood for the Laplace location family is determined by theorder statistics.
34 / 218
Example with no common support
P = P0,P1,P2 where
P0 = U[−1, 0],
P1 = U[0, 1],
p2(x) = 2x1x ∈ (0, 1).
p1
p0=
p2
p0=
0 on (−1, 0)
∞ on (0, 1).
• Cannot tell P1 and P2 based on ( p1p0, p2p0
).
• However, there is information in the original modelto statistically tell these two apart to some extent.
• Consider in addition p2(x)p1(x)
= 2x1x ∈ (0, 1).
• A minimal suff. stat.: (1X < 0,X+)
• Could just take X+, since 1X < 0 = 1− X+
35 / 218
Empirical distribution
• We saw that for IID data, X1, . . . ,Xniid∼Pθ,
• the order statistic X(1) ≤ X(2) ≤ · · · ≤ X(n) is sufficient.
• Another way: The empirical distribution Pn is sufficient,
Pn :=1
n
n∑
i=1
δXi , (δx : unit point mass at x )
• δx is a measure defined by: δx(A) := 1x ∈ A:
x
• Example: X = (0, 1,−1, 0, 4, 4, 0) =⇒ Pn := 17 (δ−1 + 3δ0 + δ1 + 2δ4).
−1 0 1 4
36 / 218
Example 10 (Empirical distribution in finite alphabet (IID data))
• Suppose sample space is finite X = a1, . . . , aK.• Let P = collection of all prob. measures P on X .• P can be parametrized by θ = (θ1, . . . , θK ) where θj = P(aj).
• Empirical measure reduces to Pn =∑
j πj(X ) δaj where
πj(X ) :=1
n#i : Xi = aj
• Show that Pn or equivalently (π1(X ), . . . , πK (X )) is minimal sufficient.
37 / 218
Statements from Theory of Point Estimation (TPE)
Proposition 2 (TPE)
Consider a finite DCS family P = Pθ, θ ∈ Ω, i.e., Ω = θ0, θ1, . . . , θK.Then, the following is minimal sufficient
L(X ) =(pθ1 (X )
pθ0 (X ), . . . ,
pθK (X )
pθ0 (X )
).
Proposition 3 (TPE)
Assume P is DCS and P0 ⊂ P. Assume that T is
• sufficient for P, and
• minimal sufficient for P0.
Then, T is minimal sufficient for P.
Same support gives “a.e. P0 =⇒ a.e. P”.S sufficient for P =⇒ S sufficient for P0.T minimal suff. for P0 =⇒ T = f (S) a.e. P0 and hence a.e. P. Q.E.D.
38 / 218
Example 11 (Gaussian location family)
• Consider X1, . . . ,Xniid∼N(θ, 1).
• Look at sub-family P0 = N(θ0, 1),N(θ1, 1). Let T (X ) =∑
i Xi .
• The following is minimal sufficient by Proposition 2,
log Lθ1 := logpθ1 (x)
pθ0 (x)=
1
2
∑
i
(xi − θ0)2 − 1
2
∑
i
(xi − θ1)2
= (θ1 − θ0)T (x) +1
2(θ2
0 − θ21)
showing that T (X ) is minimal sufficient for P0.
• Since T (X ) is also sufficient for P (Exercise.), Proposition 3 implies that itis minimal sufficient for P.
39 / 218
Completeness and ancillaritiy
• We can compress even more!
Definition 6
• V = V (X ) is ancillary if its distribution does not depend on θ.
• First-order ancillary if its expectation does not depend on θ. (EθV = c, ∀θ).
• The latter is a weaker notion.
Example 12 (Location family)
• The following statistics are all ancillary:
Xi − Xj , Xi − X(j), X(i) − X(j), X(i) − X
• Hint: We can write Xi = θ + εi , where εiiid∼ f .
• Minimal sufficient statistic can still contain ancillary information, forexample X(n) − X(1) in the Laplace location family.
40 / 218
• A notion stronger than minimal sufficiency is completeness:
Definition 7A statistic T is complete if
(Eθ[f (T )] = c , ∀θ
)=⇒ f (t) = c ∀t.
(Actually P-a.e. t.)
• T is complete if no nonconstant function of it is first-order ancillary.
• Minimal sufficient statistic need not be complete:
Example 13 (Laplace location family)
• X(n) − X(1) is ancillary, hence first-order ancillary.
• f (X(1), . . . ,X(n)) is ancillary for the nonconstant function f (z1, . . . , zn) = z1 − zn.
• The converse is however true.
41 / 218
• Showing completeness is not easy.
• Will see a general result for exponential families.
• Here is another example:
Example 14
• X1, . . . ,Xniid∼U[0, θ].
• Will show that T = maxX1, . . . ,Xn is complete.
• CDF of T is FT (t) = (t/θ)n1t ∈ (0, θ)+ 1t ≥ θ.• Density t 7→ nθ−ntn−11t ∈ (0, θ).• Suppose that Eθf (T ) = 0 for all θ > 0. Then,
0 = Eθf (T ) = nθ−n∫ θ
0
f (t)tn−1dt, t > 0
• Fundamental theorem of calculus implies f (t)tn−1 = 0, a.e. t > 0,
• Hence f (t) = 0 a.e. t > 0. Conclude that T is complete.
42 / 218
Detour: Conditional expectation as L2 projection
• The L2 space of random variables:
L2 := L2(P) := X : E[X 2] <∞
• We can define an inner product on this space
〈X ,Y 〉 := E[XY ], X ,Y ∈ L2
• The inner product induces a norm, called the L2 norm,
‖X‖2 :=√〈X ,X 〉 :=
√E[X 2]
• Norm induces a distance ‖X − Y ‖2.
• Squared distance ‖X − Y ‖22 = E(X − Y )2, the same as MSE.
• Orthogonality: X ⊥ Y if 〈X ,Y 〉 = 0, i.e., E[XY ] = 0.
43 / 218
Detour: Conditional expectation as L2 projection
• Assume EX 2 <∞ and EY 2 <∞ (i.e., X ,Y ∈ L2 ).
• Consider the the linear space
L :=g(X ) | g is a (measurable) function with E[g(X )]2 <∞
i.e., the space of all (meas.) functions of X .
• There is an essentially unique L2 projection of Y onto L:
Y := argminZ∈L
‖Y − Z‖2
• Alternatively, an essentially unique function g such that
ming
E(Y − g(X ))2 = E(Y − g(X ))2
• We define E[Y |X ] := g(X ).
44 / 218
Detour: Conditional expectation as L2 projection
• There is an essentially unique function g such that
ming
E(Y − g(X ))2 = E(Y − g(X ))2
• We define E[Y |X ] := g(X ).
• E[Y |X ] is the best prediction of Y given X , in the MSE sense.
• From this definition, we get the following characterization of g :
E[(Y − g(X )
)g(X )] = 0, ∀g
saying that the optimal prediction error Y − g(X ) is orthogonal to L.
• Applied to the constant function g(X ) ≡ 1, we get
E[Y ] = E[g(X )] = E[E[Y |X ]]
the smoothing or averaging property of conditional expectation.
45 / 218
Proposition 4
A complete sufficient statistic is minimal sufficient.
Proof. Let T be complete sufficient, and U minimal sufficient.
• Idea is to show that T is a function of U.
• U = g(T ). (By minimal sufficiency of U.)
• Let h(U) := Eθ[T |U] well defined by sufficiency of U.
• Eθ[T − h(U)] = 0, ∀θ ∈ Ω. (By smoothing.)
• T = h(U). (By completeness of T .)
Hints:
• Took f (t) := t − h(g(t)) in the definition of completeness.
• Equalities are a.e. P.
46 / 218
Proposition 5 (Basu)
Let T be complete sufficient and V ancillary.Then T and V are independent.
Proof. Let A be an event.
• qA := Pθ(V ∈ A) well-defined. (By ancillary of V .)
• fA(T ) := Pθ(V ∈ A|T ) well-defined. (By sufficiency of T .)
• Eθ[qA − fA(T )] = 0, ∀θ.
• qA = fA(T ). (By completeness of T .)
Equalities are a.e. P.
47 / 218
• Application of Basu:
Example 15 (Gaussian location family)
• X1, . . . ,Xniid∼N(θ, σ2), θ is unknown, σ2 is known.
• X := 1n
∑i Xi is complete sufficient. (cf. Exponential families)
• (Xi − X , i = 1, . . . , n) is ancillary.
• Hence, sample variance S2 := 1n−1
∑i (Xi − X )2 is ancillary.
• Hence, X and S2 are independent.
• Had we taken (θ, σ2) as the parameter, then S2 would not be ancillary.
48 / 218
Rao–Blackwell
• Rao–Blackwell theorem ties risk minimization with sufficiency.
• It is a statement about convex loss functions.
Definition 8A function f : Rp → R is convex if for all x , y
f (λx + (1− λ)y) ≤ λf (x) + (1− λ)f (y), ∀λ ∈ [0, 1], (1)
and strictly convex if the inequality is strict for λ ∈ (0, 1) and x 6= y .
Example 16 (`p loss)
• Loss function a 7→ L(θ, a) = |θ − a|p on R.
• Convex for p ≥ 1 and nonconvex for p ∈ (0, 1).
• Stricly convex when p > 1.
• In particular, the `1 loss (p = 1) is convex but not stricly convex.
49 / 218
Jensen inequality
By induction (1) leads to f (∑
i αixi ) ≤∑
i αi f (xi ), for αi ≥ 0 and∑
i αi = 1.A sweeping generalization is the following:
Proposition 6 (Jensen inequality)
Assume that f : S → R is convex and consider a random variable Xconcentrated on S (i.e., P(X ∈ S) = 1), and E|X | <∞. Then,
Ef (X ) ≥ f (EX )
If f is strictly convex, equality holds iff X ≡ EX a.s. (that is, X is constant).
Proof. Relies on the existence of supporting hyperplanes to f (i.e., affineminorants that touch the function)
50 / 218
• Let x0 := EX .
• Let A(x) = 〈a, x − x0〉+ f (x0) be a supporting hyperplane to f at x0:
f (x) ≥ A(x), ∀x ∈ S, and A(x0) = f (x0).
• Then, we have
f (X ) ≥ A(X ) =⇒ E[f (X )] ≥ E[A(X )] (Monotonicity of E)
= 〈a,E[X − x0]〉+ f (x0) (Linearity of E)
= f (x0)
• Strict convexity implies f (x) > A(x) for x 6= x0.
• If X 6= x0 with positive prob., we have f (X ) > A(X ) with positive prob.,
• Hence, Ef (X ) > EA(X ), and the rest follows. Q.E.D.
51 / 218
Recall the decision-theoretic setup:Loss function L(θ, a), decision rule δ = δ(X ), risk R(θ, δ) = EθL(θ, δ).
Theorem 2 (Rao–Blackwell)
Let us assume the following:
• T is sufficient for family P,
• δ = δ(X ) is a possibly randomized decision rule,
• a 7→ L(θ, a) is convex, for all θ ∈ Ω.
Define the estimator η(T ) := Eθ[δ|T ]. Then, η dominates δ, that is,
R(θ, η) ≤ R(θ, δ) for all θ ∈ Ω
The inequality is strict when the loss is strictly convex, unless η = δ a.e. P.
Consequence: for convex loss functions, randomization does not help. Proof:
• η is well-defined. (By sufficiency of T .)
• Eθ[L(θ, δ)|T ] ≥ L(θ,Eθ[δ|T ]) = L(θ, η). (By conditional Jensen inequality.)
• Take expectation and use monotonicity and smoothing. Q.E.D.
52 / 218
Example 17
• Xiiid∼U[0, θ], i = 1, . . . , n.
• T = maxX1, . . . ,Xn is sufficient. Take δ = 1n
∑i Xi .
• Rao–Blackwell: η = Eθ[δ|T ] strictly dominates δ, for any strictly convexloss. Let us verify this:
• Conditional distribution of Xi given T :Mixture of a point mass at T and uniform distribution on [0,T ] (why?):
Pθ(Xi ∈ A|T ) =1
nδT (A) +
(1− 1
n
)∫ T
0
1
T1A(x)dx
• Compactly Xi |T ∼ 1nδT + (1− 1
n )Unif(0,T ).
• It follows
Eθ(Xi |T ) =1
nT +
(1− 1
n
)T2
=n + 1
2nT
• Same expression holds for η by symmetry.
53 / 218
• That is, η = Eθ[δ|T ] = n+12n T
• Consider the quadratic loss:
• η has the same bias as δ (by smoothing).
• How about variances?
varθ(δ) =1
n
θ2
12
• Since T/θ is Beta(n, 1) distributed, we have
varθ(η) =(n + 1
2n
)2 n
(n + 1)2(n + 2)θ2 =
θ2
4n(n + 2)(2)
• (Note: δ is biased. 2δ is unbiased. Better to compare 2η and 2δ.)
54 / 218
• Another result about strictly convex loss functions:An admissible decision rule is uniquely determined by its risk function.
Proposition 7
• Assume a 7→ L(θ, a) is stricly convex, for all θ. (R(θ, δ) = Eθ[L(θ, δ)].)
Then, the map δ 7→ R(·, δ) is injective over the class of admissible decision rules.
We are identifying decision rules that are the same a.e. P.Proof.
• Let δ be admissible.
• Let δ′ 6= δ be such that R(θ, δ) = R(θ, δ′),∀θ.
• Take δ∗ = 12 (δ + δ′). Then, by strict convexity of the loss
L(θ, δ∗) <1
2
(L(θ, δ) + L(θ, δ′)
), ∀θ
Taking expectation: (Note: X > 0 a.s. =⇒ E[X ] > 0 )
R(θ, δ∗) <1
2
(R(θ, δ) + R(θ, δ′)
)= R(θ, δ), ∀θ
• δ′ 6= δ implies δ∗ 6= δ.
• δ∗ strictly dominates δ, contradicting admissibility of δ.55 / 218
Rao–Blackwell can fail for non-convex loss. Here is an example:
Example 18
• X ∼ Bin(n, θ). Ω = A = [0, 1].
• ε-sensitive loss function Lε(θ, a) = 1|θ − a| ≥ ε.• Consider a general deterministic estimator of δ = δ(X ).
• δ takes at most n + 1 values δ(0), δ(1), . . . , δ(n) ⊂ [0, 1].
• Divide [0, 1] into bins of length 2ε.
• Assume that N := 1/(2ε) ≥ n + 2 and that N is an integer (for simplicity).
• At least one of the N bins contains no δ(i), i = 0, . . . , n, and the midpointof that bin is at distance ≥ ε from any δ(i). Hence
supθ∈[0,1]
R(θ, δ) = 1
for any nonrandomized rule (assuming ε ≤ 1/[2(n + 2)]).
• Consider randomized estimator δ′ = U ∼ U[0, 1] independent of X :
R(θ, δ′) = P(|U − θ| ≥ ε) ≤ 1− ε (worst case at θ = 0, 1. )
• supθ∈[0,1] R(θ, δ′) < 1 strictly better than that of δ.
56 / 218
Uniformly minimum variance unbiased (UMVU) criterion
• Comparing estimators based on their risk functions problematic.
• One way to mitigate: restrict the class of estimators.
• Focus on qudratic loss, restrict to unbiased estimators.
• Bias-variance decomp.,
Let Ug be the class of unbiased estimators of g(θ), that is,
Ug = δ : Eθ[δ] = g(θ), ∀θ.
Definition 9
An estimator δ is UMVU for estimating g(θ) if
• δ ∈ Ug , and
• varθ(δ) ≤ varθ(δ′), for θ ∈ Ω, and for all δ′ ∈ Ug .
57 / 218
Theorem 3 (Lehmann–Scheffe)
Consider the family P and assume that
• Ug is nonempty (i.e., g is U-estimable), and
• there is a complete sufficient statistic T for P.
Then, there is an essentially unique UMVU for g(θ) of the form h(T ).
Proof.
• Pick δ ∈ Ug (Valid by non-emptiness.)
• Let η = Eθ[δ|T ] be an estimator (Well-defined by sufficiency of T .)
• Claim: η is the essentially unique UMVU.
• Pick any δ′ ∈ Ug and let η′ = Eθ[δ′|T ].
• Eθ[η − η′] = g(θ)− g(θ) = 0, ∀θ (By smoothing and unbiasedness.)
(a) η − η′ = 0 a.e. P. (By completeness of T .)
• By Rao–Blackwell for quadratic loss a 7→ (g(θ)− a)2 and unbiasedness
varθ(η) = R(θ, η)(a)= R(θ, η′) ≤ R(θ, δ′) = varθ(δ′)
• Since δ′ was an arbitrary element of the class Ug we are done. Q.E.D.
58 / 218
Proof.
• Pick δ ∈ Ug (Valid by non-emptiness.)
• Let η = Eθ[δ|T ] be an estimator (Well-defined by sufficiency of T .)
• Claim: η is the essentially unique UMVU.
• Pick any δ′ ∈ Ug and let η′ = Eθ[δ′|T ].
• Eθ[η − η′] = g(θ)− g(θ) = 0, ∀θ (By smoothing and unbiasedness.)
• η − η′ = 0 a.e. P. (By completeness of T .)
• By Rao–Blackwell for quadratic loss a 7→ (g(θ)− a)2 and unbiasedness
varθ(η) = R(θ, η) = R(θ, η′)(b)< R(θ, δ′) = varθ(δ′)
• Since δ′ was an arbitrary element of the class Ug we are done. Q.E.D.
Remark 1Note that we have also shown the uniqueness:
(b) If δ′ ∈ Ug is UMVU and not a function of T , then it is strictly dominatedby η′ (by Rao–Blackwell and strict convexity of quadratic loss).
• Otherwise, it is equal to η′ which is equal to η a.e. P.
59 / 218
Lehman–Schuffe suggest a way of constructing UMVUs.
Example 19 (Coin tossing)
• X1, . . . ,Xniid∼Ber(θ), want to estimate g(θ) = θ2.
• T =∑
i Xi is complete and sufficient.(General result for exponential families.)
• Take U = X1X2.
• U is unbiased for θ2: Eθ[U] = Eθ[X1]Eθ[X2] = θ2 by independence.
• By Lehman–Schuffe
E[U|T = t] = P(X1 + X2 = 2|T = t)
=
(n−2t−2
)/(nt
), t ≥ 2
0 otherwise=
t(t − 1)
n(n − 1)
is UMVU estimator for θ2.
60 / 218
Approach 2 in obtaining UMVUs:
Example 20
• X1, . . . ,Xniid∼U[0, θ].
• T = X(n) = maxX1, . . . ,Xn is complete sufficient.
• UMVU for g(θ) is given by h(X(n)).
• h is the solution of the following integral equation:
g(θ) = Eθ[h(X(n))] = nθ−n∫ θ
0
tn−1h(t)dt.
• For g(θ) = θ, δ1 = n+1n T is unbiased, hence UMVU by Lehamn–Schuffe.
• MSEθ(δ1) = varθ(δ1) = θ2
n(n+2) .
• On the other hand, among estimators of the form δa = aT ,a = (n + 2)/(n + 1) gives the lowest MSE.
• This biased estimator has slightly better MSE = θ2/(n + 1)2.
• A little bit of bias is not bad.
61 / 218
Exponential family
Definition 10X : a general sample space, Ω: a general parameter space,
• A function T : X → Rd , T (x) = (T1(x), . . . ,Td(x)).
• A function η : Ω→ Rd , η(θ) = (η1(θ), . . . , ηd(θ)).
• A measure ν on X (e.g., Lebesgue or counting), and a functionh : X → R+.
The exponential family with sufficient statistic T and parametrization η, relativeto h · ν, is the dominated family of distributions given by the following densitiesw.r.t. ν
pθ(x) = exp〈η(θ),T (x)〉 − A(θ)
h(x), x ∈ X .
where 〈η(θ),T (x)〉 =∑d
i=1 ηi (θ)Ti (x) is the Euclidean inner product.
62 / 218
• T : X → Rd , η : Ω→ Rd .
pθ(x) = exp〈η(θ),T (x)〉 − A(θ)
h(x), x ∈ X .
• A(θ) is the determined by the other ingredients,
• via the normalization constraint∫pθ(x)dν(x) = 1,
A(θ) = log
∫e〈η(θ),T (x)〉d ν(x)
where d ν(x) := h(x)dν(x).
• A is called the log partition function or cumulant generating function.
• The actual parameter space is
Ω0 = θ ∈ Ω : A(θ) <∞.
• By factorization theorem, T (X ) is indeed sufficient.
• The representation of the exponential family is not unique.
63 / 218
Here are some examples:
Example 21
• X ∼ Ber(θ):
pθ(x) = θx(1− θ)1−x = exp[x log
θ
1− θ + log(1− θ)].
• Here h(x) = 1x ∈ 0, 1,
η(θ) = log( θ
1− θ), T (x) = x
A(θ) = − log(1− θ), Ω0 = (0, 1)
• We need to take Ω = (0, 1) otherwise η is not well-defined.
64 / 218
Example 22
• X ∼ N(µ, σ2): Let θ = (µ, σ2),
pθ(x) =1√
2πσ2exp
[− 1
2σ2(x − µ)2
]
= exp[− 1
2σ2x2 +
µ
σ2x −
( µ2
2σ2+
1
2log(2πσ2)
)]
• Here, h(x) = 1,
η(θ) =( µσ2,− 1
2σ2
), T (x) = (x , x2)
A(θ) =µ2
2σ2+
1
2log(2πσ2), Ω0 = (µ, σ2) : σ2 > 0
• We could have taken h(x) = 1√2π
and A(θ) = µ2
2σ2 + 12 log σ2.
65 / 218
Example 23
• X ∼ U[0, θ],
• pθ(x) = θ−11x ∈ (0, θ).• Not an exponential family since the support depends on the parameter.
66 / 218
Consider the following conditions:
(E1) η(Ω0) has non-empty interior.
(E2) T1, . . . ,Td , 1 are linearly independent ν a.e.. That is,
@a ∈ Rd \ 0, c ∈ R such that 〈a,T (x)〉 = c , ν-a.e. x
(E1′) η(Ω0) is open.
Definition 11
• A family satisfying (E1) and (E2) is called full-rank.
• One that satisfies (E1′) is regular.
• One that satisfies (E2) is minimal.
• Condition (E1) prevents ηi from satisfying a constraint.
• Condition (E2) prevents unidentifiability.
67 / 218
Example 24
• A Bernoulli model: pθ(x) ∝ exp(θ0(1− x) + θ1x).
• x + (1− x) = 1,∀x . Hence, the family is not full-rank.
Example 25
• A continuous model with Ω = R and η1(θ) = θ, η2(θ) = θ2,
• Interior of η(Ω) is empty, hence the model is not full-rank.
68 / 218
Theorem 4In a full-rank exponential family, T is complete.
• We just show that T is minimal sufficient.
• Completeness is more technical, but follows from Laplace transformarguments.
69 / 218
Theorem 5In a full-rank exponential family, T is complete.
Proof. (Minimal sufficiency.)
• By factorization theorem, T is sufficient for P (the whole family).
• Choose θ0, θ1, . . . , θd ⊂ Ω, with ηi := η(θi ), such that
η1 − η0, η2 − η0, . . . , ηd − η0 are linearly independent. Possible by (E1)
• Matrix AT = (η1 − η0, . . . , ηd − η0) ∈ Rd×d is full-rank.
• Let P0 = pθi : i = 0, 1, . . . , d. Then, with T = T (X )
(log
pθ1 (X )
pθ0 (X ), . . . , log
pθd (X )
pθ0 (X )
)=(〈η1 − η0,T 〉, . . . , 〈ηd − η0,T 〉
)= AT
is minimal sufficient for P0.
• It follows that T is so, since A is invertible.
• Since P0 and P have common support, T is also minimal for P.
70 / 218
pθ(x) = exp〈η(θ),T (x)〉 − A(θ)
h(x), x ∈ X .
Definition 12
An exponential family is in canonical (or natural) form if η(θ) = θ.
In this case:
• η = θ is called the natural parameter.
• Ω0 := θ ∈ Rd : A(θ) <∞ is called the natural parameter space.
• Ω0 ⊂ Rd .
Family determined by choice of X , T (x) and ν = h · ν.
Example 26 (Two-parameter Gaussian)
• X = R, T (x) = (x , x2).
• pθ(x) = exp(θ1x + θ2x2 − A(θ)), ∀x ∈ X .
• A(θ) = log∫eθ1x+θ2x
2
dx .
• A(θ) <∞ iff θ2 < 0. Natural parameter space: Ω0 = (θ1, θ2) : θ2 < 0.• Note: θ1 = µ
σ2 and θ2 = − 12σ2 (in original parametrization (µ, σ2)).
71 / 218
Recall: ‖x‖1 =∑d
i=1 |xi |.
Example 27 (Multinomial)
• X = Zd+ = x = (x1, . . . , xd) : xi integer, xi ≥ 0
• T (x) = x .
• h(x) =(
nx1,x2,...,xd
)1‖x‖1 = n, ν = counting measure
• Canonical family: pθ(x) = exp(∑d
i=1 θixi − A(θ))h(x).
• Can show that A(θ) = n log(∑d
i=1 eθi), finite everywhere.
• Hence Ω0 = Rd .
• Not full-rank. Violates (E2).
72 / 218
• Multinomial distribution, with usual parameter,
qπ(x) =
(n
x1, x2, . . . , xd
) d∏
i=1
πxii
looks like a subfamily:
• Corresponds to the following subset of the natural parameter space Ω0:
(log πi ) : πi > 0,
d∑
i=1
πi = 1
=θ ∈ Rd :
d∑
i=1
eθi = 1.
• This family is also not full-rank (violates (E1)).
• Actually not a sub-family of Example 27 since pθ = pθ+a1 for any a ∈ R.
• That is, θ parametrization is non-identifiable in Example 27.
73 / 218
Example 28 (Multivariate Gaussian)
• X = Rp. ν = Lebesgue measure on X .
• T (x) =(xi , i = 1, . . . , p | xixj , 1 ≤ i < j ≤ p | x2
i , i = 1, . . . p).
• param =(θi , i = 1, . . . , p | 2Θij , 1 ≤ i < j ≤ p | Θii , i = 1, . . . p).
• Corresponding canonical Expf
pθ,Θ(x) = exp(∑
i
θixi + 2∑
i<j
Θijxixj +∑
i
Θiix2i − A(θ,Θ)
)
• Compactly, treating Θ as a symmetric matrix,
pθ,Θ(x) = exp〈θ, x〉+ 〈Θ, xxT 〉 − A(θ,Θ)
where 〈Θ, xxT 〉 := tr(ΘxxT ) = tr(xTΘx) = xTΘx .
• Dimension (or rank) of the family d = p + p(p − 1)/2.
74 / 218
• Density of multivariate Gaussian N(µ,Σ):
pµ,Σ(x) ∝ 1
|Σ|1/2exp
[−1
2(x − µ)TΣ−1(x − µ)
]
= exp[−1
2xTΣ−1x + xTΣ−1µ− 1
2µTΣ−1µ− 1
2log |Σ|
]
• Can be written as a canonical exponential family
pθ,Θ(x) = exp〈θ, x〉+ 〈Θ, xxT 〉 − A(θ,Θ)
where 〈Θ, xxT 〉 := tr(ΘxxT ) = tr(xTΘx) = xTΘx .
• Correspondence with the original parameters:
• θ = Σ−1µ and Θ = − 12Σ−1
• A(θ,Θ) = 12(µTΣ−1µ+ log |Σ|) = 1
4θTΘ−1θ − 1
2log |−2Θ|+ const.
• Sometimes called Gaussian Markov Random Field (GMRF), esp. whenΘij = 0 for (i , j) /∈ E where E is the edge set of a graph.
75 / 218
Example 29 (Ising model)
• Both a graphical model and an exponential family.
• Used in statistical physics. Allows for complex correlations among discretevariables. Discrete counterpart of GMRF.
• Ingredients:
• A given graph G = (V ,E).V := 1, . . . , n vertex set. E ⊂ V 2: edge set.
• Random variables attached to vertices X = (Xi : i ∈ V ).
• Each xi ∈ −1,+1, the spin of node i .
• Take X = −1,+1V ' −1,+1n.
• T (X ) =(Xi , i ∈ V , XiXj , (i , j) ∈ E
).
• Underlying measure is counting (and h(x) ≡ 1.)
pθ(x) = exp(∑
i∈Vθixi +
∑
(i,j)∈Eθijxixj − A(θ)
)
76 / 218
Example 30 (Exponential random graph model (ERGM))
• A parametric family of probability distributions on graphs.
• Let X = space of graphs on n nodes.
• Let Ti (G ) be functions on the space of graphs for i = 1, . . . , k.
• Usually subgraph counts:
T1(G ) = # number of edges
T2(G ) = # number of triangles
. . . = . . .
Tj(G ) = # number of r -stars (for given r)
. . . = . . .
• Underling measure (counting on graphs)
pθ(G ) = exp( k∑
i=1
θiTi (G )− A(θ))
77 / 218
Focus: full-rank canonical (FRC) exponential families.
Proposition 8
In a canonical exponential family,
(a) A is convex on its domain Ω0,
(b) Ω0 is a convex set.
Proof. (Enough to show (a). Convexity of Ω0 follows from convexity of A.)
• Apply Holder inequality, with 1/p = α and 1/q = 1− α. (Exercise) Q.E.D.
• Holder inequality: For X ,Y ≥ 0 a.s.,
E[XαY 1−α] ≤ (EX )α(EY )1−α, ∀α ∈ [0, 1]
• Expectation can be replaced with integral w.r.t. a general measure:f , g ≥ 0 a.e. ν
∫f αg1−αd ν ≤
(∫fd ν)α(∫
gd ν)1−α
, ∀α ∈ [0, 1].
78 / 218
Proposition 9
In a FRC exponential family, A is C∞ on int(Ω0) and moreover
Eθ[T ] = ∇A(θ), covθ[T ] = ∇2A(θ)
That is,
∂A
∂θi= Eθ[Ti (X )],
∂2A
∂θi∂θj= covθ[Ti (X ),Tj(X )]
Proof sketch.• Moment generating function (mgf) of T is
MT (u) := MT (u; θ) := Eθ[e〈u,T〉]
=
∫e〈u,T (x)〉e〈θ,T (x)〉−A(θ)d ν(x)
= eA(u+θ)−A(θ)
• If θ ∈ int Ω0, then MT is finite in a neighborhood of zero:
MT (u) <∞, for ‖u‖2 ≤ ε.• DCT implies MT is C∞ in a neighborhood of 0, and we can interchange
the order of differentiation and integration.79 / 218
• Moment generating function:
MT (u) = Eθ[e〈u,T〉] = eA(u+θ)−A(θ)
• We get (fixing θ)
MT (u)∂A
∂ui(u + θ) =
∂
∂uiMT (u)= Eθ[
∂
∂uie〈u,T〉] = Eθ[Tie
〈u,T〉],
valid in a neighborhood of 0.
• Evaluating at u = 0 gives the result for mean. (MT (0) = 1.)
• Getting the covariance is similar. (Exercise)
Remark 2
• Covariance is positive semidefinite, hence ∇2A(θ) 0.
• This gives another proof for convexity of A.
80 / 218
Example 31
• X ∼ N(θ, 1):
pθ(x) =1√2π
e−12 x
2
exp(θx − 1
2θ2)
• A(θ) = 12θ
2. Hence:
Eθ[X ] = A′(θ) = θ, varθ(X ) = A′′(θ) = 1
81 / 218
Mean parameters
• Exponential family
dPθ(x) = exp[〈θ,T (x)〉 − A(θ)] dν(x)
• Alternative parametrization in terms of mean parameter µ:
µ := µ(θ) = Eθ[T (X )]
• Mean parameters are easy to estimate: µ = 1n
∑ni=1 T (X (i)).
Example 32 (Two-parameter Gaussian)
• X = R, T (x) = (x , x2), pθ(x) = exp(θ1x + θ2x2 − A(θ)).
• Natural parameter space: Ω0 = (θ1, θ2) : θ2 < 0.• θ1 = m
σ2 and θ2 = − 12σ2 in the original parametrization N(m, σ2).
• Mean parameters:
µ =
[µ1
µ2
]=
[E[X ]E[X 2]
]=
[m
m2 + σ2
]=
[− θ1
2θ2θ2
1
4θ22− 1
2θ2
]
82 / 218
Realizable means
• Interesting general set:
M := µ ∈ Rd | µ = Ep[T (X )] for some density p w.r.t. ν,
the set of realizable mean parameters by any distribution (absolutelycontinuous w.r.t. ν).
• M is essentially the convex hull of the support of ν#T = ν T−1
• More precisely,int(M) = int(co(supp(ν#T ))).
83 / 218
M := µ ∈ Rd | µ = Ep[T (X )] for some density p w.r.t. ν,
Example 33
• T (X ) = (X ,X 2) ∈ R2 and ν the Lebesgue measure:
(µ1, µ2) = (Ep[X ],Ep[X 2])
• By nonnegativity of the variance we need to have
M⊂M0 := (µ1, µ2) : µ2 ≥ µ21
• Any (µ1, µ2) ∈ intM0 can be realized by a N(µ1, µ2 − µ21).
• bdM0 := (µ1, µ2) : µ2 = µ21 cannot be achieved by a density (why?).
• bdM0 can be approached arbitrarily closely by densities.
• We have M = intM0 and M =M0.
84 / 218
Example 34 (Multivariate Gaussian)
• T (X ) = (X ,XXT ).
• Let µ = Ep[X ] and Λ = Ep[XXT ] for some density p w.r.t. Lebesguemeasure.
• Covariance matrix is PSD, hence Λ− µµT 0.
• Closure of M (sef of realizable means) is
M := (µ,Λ) | Λ µµT
• int(M) =M = (µ,Λ) | Λ µµT can be realized by non-degenerateGaussian distributions N(µ,Λ− µµT ), a full-rank exponential family.
85 / 218
A remarkable result:Anything ∈ intM can be realized by an exponential family. Let
Ω := domA := θ ∈ Rd : A(θ) <∞
Theorem 6
In a FRC exponential family, assuming A is essentially smooth,
• ∇A : int Ω→ intM is one-to-one and onto.
In other words, ∇A establishes a bijection between int Ω and intM.
• Recall as part of FRC assumption, int Ω 6= ∅.• WLOG, we can assume T (x) = x (absorb T into the measure, ν T−1),
• That is, we work with the standard family
dPθ(x) = exp(〈θ, x〉 − A(θ)) dν(x)
• By Proposition 9, Eθ(X ) = ∇A(θ).
• The proof is a tour de France of convex/real analysis.
86 / 218
Proof sketch
∇A : int Ω→ intM is one-to-one (injective) and onto (surjective)
Let Φ := ∇A.
1. Φ is regular on int Ω: DΦ = ∇2A ∈ Rd×d is a full-rank matrix.True since condition (E2) implies ∇2A 0.
2. Φ is injective:Since condition (E2) implies ∇2A 0, we conclude that A is strictlyconvex. This in turns implies that ∇A is a strictly monotone operator:〈∇A(θ)−∇A(θ′), θ − θ′〉 > 0 for θ 6= θ′.
3. Φ is an open mapping: (maps open sets to open sets) (Corollary 3.27 of ?)
U ⊂ Rd open, f ∈ C 1(U,Rd) regular on U =⇒ f open mapping.
4. By Proposition 9, we have ∇A(int Ω) ⊂M.
5. But why ∇A(int Ω) ⊂ intM? Follows from ∇A being an open map1
1A continuous map is not necessarily open: x 7→ sin(x) maps (0, 4π) to [−1, 1].87 / 218
Proof sketch
∇A : int Ω→ intM is one-to-one (injective) and onto (surjective)
It remains to show that intM⊂ ∇A(int Ω):
For any µ ∈ intM, need to show θ ∈ int Ω such that ∇A(θ) = µ.
6. By applying a shift to ν, WLOG enough to show it for µ = 0 ∈ intM.
7. WTS: 0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.
In general 0 /∈ intM. So, without employing a shift, all the arguments are
applied to θ 7→ A(θ)− 〈µ, θ〉.
88 / 218
Proof sketch
0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.
8. A is lower semi-continuous (lsc) on Rd : lim infθ→θ0
A(θ) ≥ A(θ0).
Follows from Fatou’s Lemma.
(lsc only matters at bd Ω since A is continuous on int Ω.)
9. Let Γ0(Rd) := f : Rd → (−∞,∞] | f is proper, convex, lsc.(proper means not identically ∞.)
10. A ∈ Γ0(Rd).
11. A is coersive: lim‖θ‖→∞
A(θ) =∞. To be shown.
12. A is essentially smooth: by assumption.
A function f ∈ Γ0(Rd) is essentially smooth (a.k.a. steep) if
(a) f is differentiable on int dom f 6= ∅ and(b) ‖∇f (xn)‖ → ∞ whenever xn → x ∈ bd dom f .
We in fact only need this for x ∈ (dom f ) ∩ (bd dom f ), parts of the boundary thatare in the domain. In particular, (b) is not needed if dom f is itself open.
89 / 218
0 ∈ intM =⇒ ∃θ ∈ int Ω s.t. ∇A(θ) = 0.
13. A coercive lsc function attains its minimum (over Rd).
14. f ∈ Γ0(Rd) and essentially smooth =⇒ minimum cannot be attained atbd dom f .
15. If in addition f is strictly convex on int dom f , the minimum is unique.
Lemma 4
Assume that f ∈ Γ0(Rd) is coersive, essentially smooth, and strictly convex onint dom f . Then, f attains its unique minimum at some x ∈ int dom f .
A is coersive, essentially smooth, and strictly convex on int domA = int Ω
16. Conclude that A attains its minimum at a unique point θ ∈ int Ω.
17. Necessary 1st order optimality condition is ∇A(θ) = 0.
18. Done if we show the only remaining piece: coersivity.
90 / 218
A is coersive.
19. For every ε, let Hu,ε := x ∈ Rd : 〈x , u〉 ≥ ε and let S = u : ‖u‖2 = 1.20. 0 ∈ intM (and full-rank assumption) implies that ∃ε > 0 such that
infu∈S
ν(Hu,ε) > 0.
i.e., ∃ε > 0 and c ∈ R such that ν(Hu,ε) ≥ ec for all u ∈ S .
21. Then, for any ρ > 0,
∫e〈ρu,x〉ν(dx) ≥
∫
Hu,ε
e〈ρu,x〉ν(dx) ≥ eρεν(Hu,ε) ≥ eρε+c
That is, A(ρu) ≥ ρε+ c .
22. For any θ 6= 0, taking u = θ/‖θ‖ and ρ = ‖θ‖, we obtain
A(θ) ≥ ‖θ‖ε+ c , ∀θ ∈ Rd \ 0
showing that A is coersive.
91 / 218
Side Note: In fact, ∇A : int Ω→ intM is a C 1 diffeomorphism (i.e., a bijectionwhich is C 1 in both directions.)This follows from Theorem 3.2.8 of ?:
Theorem 7 (Global inverse function theorem)
Let U ⊂ Rd be open and Φ ∈ C 1(U,Rd). The following are equivalent:
• V = Φ(U) is open and Φ : U → V is a C 1 diffeomorphism.
• Φ is injective and regular on U.
Φ = ∇A is injective and regular on int Ω hence a C 1 diffeomorphism.
92 / 218
MLE in exponential family
• Significance of Theorem 6 for statistical inference:
• Assume X1, . . . ,Xn are i.i.d. draws from
pθ(x) = exp(〈θ,T (x)〉 − A(θ))h(x).
• The likelihood is
LX (θ) =∏
i
pθ(Xi ) ∝ exp(〈θ,∑
i
T (Xi )〉 − nA(θ)).
• Letting µ = 1n
∑ni=1 T (Xi ) the log-likelihood is
`X (θ) = n[〈θ, µ〉 − A(θ)
]+ const..
• If µ ∈ intM, there exists a unique MLE, solution of ∇A(θ) = µ.
• That is θMLE = ∇A−1(µ) = ∇A∗(µ).
• A∗ is the Fenchel–Legandre conjugate of A.
• If µ /∈M, then the MLE does not exist. What happens at the boundarycan be determined on a case by case basis.
93 / 218
Technical remarks
• A is always lower semi-continuous (lsc).
• If Ω = domA is open, lower semicontinuity implies that A(θ)→∞ as θapproaches the boundary.(Pick θ0 ∈ bd Ω, then lim infθ→θ0 A(θ) ≥ A(θ0) =∞. )
• In other words, if Ω is open, A is automatically essentially smooth.
94 / 218
Example 35 (Two-parameter Gaussian)
• X = R, T (x) = (x , x2). pθ(x) = exp(θ1x + θ2x2 − A(θ)), ∀x ∈ X .
• A(θ) = log∫eθ1x+θ2x
2
dx . A(θ) <∞ iff θ2 < 0.
• Natural parameter space: Ω = domA = (θ1, θ2) : θ2 < 0.• θ1 = m
σ2 and θ2 = − 12σ2 in original parametrization (m, σ2).
• µ = θ1
−2θ2and σ2 = 1
−2θ2.
• Mean parametrization: µ1 = θ1
−2θ2, µ2 =
(θ1
−2θ2
)2+ 1−2θ2
• A(θ) = µ2
2σ2 + 12 log(2πσ2) =
θ21
−4θ2+ 1
2 log π−θ2
.
• Easy to verify that ∇A(θ) = (µ1, µ2)and it establishes a bijection between(θ1, θ2) : θ2 < 0 = int Ω ↔ intM = (µ1, µ2) : µ2
1 > µ2
• Note that since A is lsc, µ(θ) = ∇A(θ)→∞ as θ approaches the boundary.
• Show a picture of θ 7→ A(θ)− 〈θ, µ〉 for µ = (0, 1).
95 / 218
Maximum entropy characterization of exponential family
• Not only exponential families realize any mean, they achieve it withmaximum entropy: Solution to
maxp
Ep[− log p(X )] s.t. Ep[T (X )] = µ,
is given by a density of the form p(x) ∝ exp(〈θ,T (x)〉).
• Discrete case, easy to verify by introducing Lagrange multipliers:
• X = x1, . . . , xK• ν = counting meas. and pi = p(xi ), let p = (p1, . . . , pK ), and ti = T (xi ),
maxp −∑i pi log pi
s.t.∑
i pi ti = µ,
pi ≥ 0,∑
i pi = 1
• Without∑
i pi ti = µ, uniform distribution maximizes the entropy.
96 / 218
Information inequality (Cramer–Rao)
• How small can the variance of an unbiased estimator be? How well can theUMVU do?
• The bound also plays a role in asymptotics.
• Idea: Use Cauchy–Schwarz (CS), also called covariance inequality in thiscontext:
(EXY )2 ≤ (EX 2)(EY 2), or [cov(X ,Y )]2 ≤ var(X ) var(Y )
• Running assumption: every RV/estimator has finite second moment.
• δ unbiased for some g(θ), ψ any other estimator
varθ(δ) ≥ [covθ(δ, ψ)]2
varθ(ψ)(3)
• Need to get rid of δ on the RHS.
• By cleverly choosing ψ, we can obtain good bounds.
97 / 218
• Assume Pθ+h Pθ: pθ+h(x) = 0 whenever pθ(x) = 0.
• (Local) likelihood ratio is well-defined: (Can define it to be 1 for 0/0.)
Lθ,h(X ) = pθ+h(X )/pθ(X )
• (= dPθ+h/dPθ, Radon-Nikodym deriavative of Pθ+h w.r.t. Pθ.)
• Change of measure by integrating against the likelihood ratio:
Eθ[δLθ,h] =
∫
pθ>0
δ Lθ,h pθ dµ =
∫
pθ>0
δ pθ+h dµ = Eθ+h[δ] (4)
Note that pθ+h is concentrated on x : pθ(x) > 0.
98 / 218
Lemma 5 (Hammersley–Chapman–Robbins (HCR))
Assume Pθ+h Pθ, and let δ be unbiased for g(θ). Then,
varθ(δ) ≥ [g(θ + h)− g(θ)]2
Eθ(Lθ,h − 1)2
Proof.
• Idea: Apply CS inequality (3) to ψ = Lθ,h − 1.
• Eθ[ψ] = Eθ[Lθ,h]− 1 = 0. (By an application of (4) to δ = 1.)
• Another application of (4) gives2
covθ(δ, ψ) = Eθ[δψ] = Eθ[δLθ,h]− Eθ[δ] = g(θ + h)− g(θ).
2ψ is not an unbiased estimator of 0, since it depends on θ. It is not a proper estimator.Not a contradiction with “UMVU is uncorrelated from any unbiased estimator of 0”.
99 / 218
• Assume that θ (and hence h) is a scalar.
• Likelihood ratio approaches 1 as h→ 0:
limh→0
1
h[Lθ,h(X )− 1] = lim
h→0
[pθ+h(X )− pθ(X )]/h
pθ(X )
=∂θ[pθ(X )]
pθ(X )= ∂θ[log pθ(X )]
called the score function.
• Divide HCR by h, and let h→ 0:
varθ(δ) ≥ limh→0
[g(θ + h)− g(θ)]2/h2
Eθ(Lθ,h − 1)2/h2
• Numerator goes to g ′(θ). If justified in exchanging limit and expectation,
varθ[δ(X )] ≥ [g ′(θ)]2
Eθ[∂θ log pθ(X )]2
100 / 218
Cramer–Rao (formal statement)
• log-likelihood: `θ(X ) := log pθ(X ),• Score function: ˙
θ(X ) := ∇θ`θ(X ) = ∇θ log pθ(X ) ∈ Rd
Theorem 8 (Cramer–Rao lower bound)
Let P be dominated family with densities having common support S on someopen parameter space Ω ⊂ Rd . Assume:
(a) δ is an unbiased estimator for g(θ) ∈ R.
(b) g is differentiable over Ω, with gradient g = ∇θg ∈ Rd ,
(c) ˙θ(x) exists for x ∈ S and θ ∈ Ω,
(d) At least for ξ = 1 and ξ = δ and ∀θ ∈ Ω,
∂
∂θi
∫
S
ξ(x)pθ(x) dµ(x) =
∫
S
ξ(x)∂
∂θipθ(x) dµ(x), ∀i (5)
Then,varθ(δ) ≥ g(θ)T [I (θ)]−1g(θ)
where I (θ) = Eθ[ ˙θ
˙Tθ ] ∈ Rd×d is the Fisher information matrix.
101 / 218
• Let us rewrite the assumption:
∂
∂θi
∫
S
ξ(x)pθ(x) dµ(x) =
∫
S
ξ(x)∂
∂θipθ(x) dµ(x), ∀i (6)
• Note that the right-hand side is:
RHS =
∫
S
ξ(x)∂ log pθ(x)
∂θipθ(x) dµ(x) = Eθ
(ξ(X )[ ˙
θ(X )]i)
• Putting the pieces together
∇θEθ[ξ] = Eθ[ξ ˙θ] (7)
• which is the differential form of the change of measure formula:
Eθ+h[ξ] = Eθ[ξLθ,h]
102 / 218
Proof.
• Score function has zero mean, Eθ[ ˙θ] = 0. (Apply (7) with ξ = 1.)
• g(θ) = Eθ[δ ˙θ] (Apply (7) with ξ = δ.)
• Fix some a ∈ Rd . Will apply CS inequality (3) with ψ = aT ˙θ.
• Since aT ˙θ is zero mean:
aT g(θ) = Eθ[δ aT ˙θ] = covθ(δ, aT ˙
θ).
• Similarly,
varθ(aT ˙θ) = Eθ[aT ˙
θ˙Tθ a] = aT I (θ)a
• CS inequality (3) with ψ = aT ˙θ gives:
varθ(δ) ≥ [covθ(δ, aT ˙θ)]2
varθ(aT ˙θ)
=(aT g(θ))2
aT I (θ)a.
• Almost done. Problem reduces to (Exercise)
supa 6=0
(aT v)2
aTBa= vTB−1v .
Hint: Since B 0, B−1/2 is well-defined; take z = B1/2a. Q.E.D.103 / 218
• Regularity conditions for interchanging the integral and derivative are key,
• so is the unbiasedness.
• Under same assumptions (recall `θ = log pθ(X ))
I (θ) = Eθ[−¨θ] = Eθ[−∇2
θ`θ]
• I (θ) measures expected local curvature of the likelihood.
• Attainment of CRB is related to attainment of Cauchy–Schwarz: Wijsman(1973) shows that it happens if and only if we are in the exponential family.
• Fisher info. is not invariant to reparametrization:
θ = θ(µ) =⇒ I (µ) = [θ′(µ)]2I (θ)
• CRB is invariant to reparametrization. (Exercise.)
• Fisher info. is additive over independent sampling.
104 / 218
Multiple parameters
What if g : Ω→ Rm where Ω ⊂ Rd?
• Let Jθ = (∂gi/∂θj) ∈ Rm×d be the Jacobian of g .
• Then, under similar assumptions (notation: Iθ = I (θ)):
covθ(δ) JθI−1θ JTθ
for any δ unbiased for g(θ).
• A B means A− B 0, i.e., A− B is positive semidefinite (PSD).
• Proof: Fix u ∈ Rm and apply the 1-D theorem to uT δ. (Exercise)
105 / 218
Example 36
• Xiiid∼N(θ, σ2), i = 1, . . . , n,
• σ2 is fixed, g(θ) = θ.
`θ(X ) = log pθ(X ) =n∑
i=1
log pθ(Xi ) = − 1
2σ2
n∑
i=1
(Xi − θ)2 + const.
• Differentiating, we get the score function
˙θ(X ) =
∂
∂θlog pθ(X ) =
1
σ2
n∑
i=1
(Xi − θ) =⇒ ¨θ(X ) = − n
σ2.
• whence I (θ) = n/σ2.
• CRB is varθ(δ) ≥ σ2/n and is achieved by sample mean.
106 / 218
Example 37 (Exponential families)
• Xi ∼ pθ(xi ) = h(xi ) exp(〈θ,T (xi )〉 − A(θ)), i = 1, . . . , n.
`θ(X ) = log pθ(X1, . . . ,Xn) = 〈θ,∑
i
T (Xi )〉 − nA(θ) + const.
• whence I (θ) = Eθ[−¨θ(X )] = n∇2A(θ) = n covθ[T ].
• Consider 1-D case and n = 1.
• Want unbiased estimate of the mean parameter: µ(θ) = Eθ[T ] = A′(θ).
• CRB is
[µ′(θ)]2
I (θ)=
[A′′(θ)]2
A′′(θ)= A′′(θ) = varθ(T )
i.e., it is attained by T .
• General case: T := 1n
∑ni=1 T (Xi ) attains CRB for the mean parameter:
covθ(δ) covθ(T ), ∀δ s.t. Eθ(δ) = Eθ(T ).
107 / 218
Example 38
• Xiiid∼Poi(λ), i = 1, . . . , n.
• Exponential family with T (X ) = X and mean parameter λ,
• hence, sample mean δ(X ) = 1n
∑i Xi achieves the CRB for λ.
• What if we want an unbiased estimate of g(λ) = λ2?
• Since I (λ) = n/ varλ[X1] = n/λ (why?),
• the CRB = [2λ]2/(n/λ) = 4λ3/n.
• The estimator T1 = 1n
∑ni=1 Xi (Xi − 1) is unbiased for λ2 and
varλ(T1) = 4λ3/n + 2λ2/n > CRB
• S =∑
i Xi is complete sufficient, hence
• Rao–Blackwellized estimator T2 = E[T1|S ] = S(S − 1)/n2 is UMVU.
• CRB is not attained since (exercise)
varλ(T2) = 4λ3/n + 2λ2/n2 > CRB,
A vector of independent Poisson variables, conditioned on their sum has amultinomial distribution. In this case, Mult(S , ( 1
n , . . . ,1n )).
108 / 218
Average vs. maximum risk optimality
Bayesian Methods:
• Trouble comparing estimators based on whole risk functions θ 7→ R(θ, δ).
• The Bayesian approach: reduce to (weighted) average risk.
• Assumes that the parameter is a random variable Θ with some distributionΛ, called the prior, having density π(θ) (w.r.t. to say Lebesgue).
• Choice of the prior is important in the Bayesian framework.
• Frequentest perspective: Bayes estimators have desirable properties.
109 / 218
• Recall the decision-theoretic framework:
• Family of distributions P = Pθ : θ ∈ Ω.• Bayesian framework:
interpret Pθ as conditional distribution of X given Θ = θ,
Pθ(A) = P(X ∈ A | Θ = θ)
• Together with the marginal (prior) distribution of Θ, we have the jointdistribution of (Θ,X ).
• Recall the risk defined as
R(θ, δ) = Eθ[L(θ, δ(X ))] = E[L(θ, δ(X )) | Θ = θ]
or in other words, R(Θ, δ) = E[L(Θ, δ(X )) | Θ].
• The Bayes risk is
r(Λ, δ) = E[R(Θ, δ)] = E[L(Θ, δ(X ))].
110 / 218
• Write p(x |θ) = pθ(x) for density of Pθ.
• Recall that
R(θ, δ) =
∫L(θ, δ(x))p(x |θ)dx .
Then,
r(Λ, δ) =
∫π(θ)R(θ, δ)dθ =
∫π(θ)
[ ∫L(θ, δ(x))p(x |θ)dx
]dθ.
• We rarely used this explicit form.
111 / 218
• A Bayes rule or estimator w.r.t. Λ, denoted as δΛ, is a minimizer of theBayes risk:
r(Λ, δΛ) = minδ
r(Λ, δ)
• Depends both on the prior Λ and the loss L.
Theorem 9 (Existence of Bayes estimators)
Assume that
(a) ∃δ′ with r(Λ, δ′) <∞(b) Posterior risk has a minimizer for µ-almost all x , that is,
δΛ(x) := argmina∈A
E[L(Θ, a)|X = x ]
is well-defined for µ-almost all x . (Measurable selection.)
Then, δΛ is a Bayes rule.
Proof. Condition (a) is to guarantee that we can use Fubini theorem.
• By definition of δΛ, for any δ we have E[L(Θ, δ)|X ] ≥ E[L(Θ, δΛ)|X ].
• Taking expectation and using smoothing finishes the proof.112 / 218
• Posterior risk can be computed based on the posterior distribution of Θgiven X = x . Bayes rule gives
π(θ|x) =p(x |θ)π(θ)
m(x)∝ p(x |θ)π(θ)
where m(x) =∫π(θ)p(x |θ)dθ is the marginal distribution of X .
• Posterior is proportional to prior times the likelihood.
Example 39
Bayes estimators for two simple loss functions:
• Quadratic (or `2) loss: L(θ, a) = (g(θ)− a)2:
δΛ(x) = mina
E[(g(Θ)− a)2|X = x ] = E[g(Θ)|X = x ].
For g(θ) = θ reduces to the posterior mean.
• `1 loss: L(θ, a) = |θ − a|: Here, δΛ(x) = median(Θ|X = x) is one possibleBayes estimator. (Not unique in this case.)
113 / 218
Example 40 (Binomial)
• X ∼ Bin(n, θ).
• PMF is p(x |θ) =(nx
)θx(1− θ)n−x .
• Put a Beta prior on Θ, with hyperparameters α, β > 0,
π(θ) =Γ(α + β)
Γ(α)Γ(β)θα−1(1− θ)β−1 ∝ θα−1(1− θ)β−1
• We have π(θ|x) ∝ pθ(x)π(θ) ∝ θx+α−1(1− θ)n−x+β−1
• showing that Θ|X = x ∼ Beta(α + x , n − x + β), whence
δΛ(x) := E[Θ|X = x ] =x + α
n + α + β= (1− λ)
x
n+ λ
α
α + β
where λ = α+βn+α+β .
• Note: α/(α + β) is the prior mean, and x/n is the MLE (or unbiasedestimator of the mean parameter).
• No coincidence, happens in a general exponential family.
114 / 218
Example 41 (Normal location family)
• Assume that Xi |Θ = θ ∼ N(θ, σ2).
• Put a Gaussian prior on Θ ∼ N(µ, b2).
• The model is equivalent to
Xi = Θ + wi , wiiid∼N(0, σ2), for i = 1, . . . , n
• Reparametrize in terms of precisions τ 2 = 1/b2 and γ2 = 1/σ2.
• (Θ,X1, . . . ,Xn) is jointly Gaussian, and the posterior is
Θ|X = x ∼ N(
(1− λn)x + λnµ︸ ︷︷ ︸δΛ(x)
, 1/τ 2n
)
where
x =1
n
∑xi , τ 2
n = nγ2 + τ 2, λn = τ 2/τ 2n ∈ [0, 1]
• Continued ...
115 / 218
• With
x =1
n
∑xi , τ 2
n = nγ2 + τ 2, λn = τ 2/τ 2n ∈ [0, 1]
• We have
Θ|X = x ∼ N(δΛ(x), 1/τ 2
n
)
• Posterior mean δΛ(x), i.e., the Bayes rule for `2 loss, is
δΛ(x) := (1− λn)x + λnµ
which is a convex combination of x and µ and we have
• δΛ(x)→ x if n→∞ or SNR = γ2/τ 2 →∞.
• δΛ(x)→ µ if SNR = γ2/τ 2 → 0.
116 / 218
Conjugate priors
• The two examples above are examples of conjugacy.
• A family Q = π(·) of priors is conjugate to a family of likelihoodsP = p(· | θ) if the corresponding posteriors also belong to Q.
• Example of conjugate families
Q normal beta DirichletP normal binomial multinomial
Example 42 (Exponential families)
We have the following conjugate pairs
p(x |θ) = exp〈η(θ),T (x)〉)− A(θ)
qa,b(θ) = exp〈a, η(θ)〉+ bA(θ)− B(a, b)
117 / 218
Example 43 (Improper priors)
• Xi ∼ N(θ, σ2), i = 1, . . . , n.
• Is δ(x) = 1n
∑xi , a Bayes estimator w.r.t. some prior?
• Not if we require proper priors (finite measures):∫π(θ)dθ <∞, in which
case π can be normalized to integrate to 1.
• Need a uniform (proper) prior on the whole R which does not exist.
• An improper prior can still be used if the posterior is well-defined.(Generalized Bayes.)
• Alternatively, δ(x) is the limit of Bayes rules for a sequence of properpriors. (see also the Beta-Binomial example.)
118 / 218
Comment on the uniqueness of the Bayes estimator.
Theorem 10 (TPE 4.1.4)
Let Q be the marginal distribution of X , that is, Q(A) =∫Pθ(X ∈ A)dΛ(θ).
Recall that δΛ is (a) Bayes estimator. Assume that
• The loss function is strictly convex,
• r(Λ, δΛ) <∞,
• Q a.e =⇒ P a.e. . Equivalently, Pθ Q for all θ ∈ Ω.
Then, there is a unique Bayes estimator.
119 / 218
Minimax criterion
• Instead of averaging the risk: look at the worst-case or maximum risk:
R(δ) := supθ∈Ω
R(θ, δ)
• More in accord with an adversarial nature. (A zero-sum game.)
Definition 13
An estimator δ∗ is minimax if minδ∈D R(δ) = R(δ∗).
• An effective strategy for finding minimax estimators is to look among theBayes estimators:
• The minimax problem is: infδ supθ R(θ, δ).
• We generalize this to: infδ supΛ r(Λ, δ)
120 / 218
• Recall: δΛ a Bayes estimator for prior Λ, and Bayes risk,
rΛ = infδr(Λ, δ) = r(Λ, δΛ),
(Last equality: assume rΛ is finite and is achieved.)
• Can order priors based on their Bayes risk:
Definition 14Λ∗ is a least favorable prior if rΛ∗ ≥ rΛ for any prior Λ.
• For a least favorable prior, we have
rΛ∗ = supΛ
rΛ = supΛ
infδr(Λ, δ) ≤ inf
δsup
Λr(Λ, δ) =: inf
δr(δ)
where r(δ) = supΛ r(Λ, δ) is a generalization of the maximum risk R(δ).
• Interested in situations where equality holds.
121 / 218
Characteriztion of minimax estimators
Theorem 11 (TPE 5.1.4)
Assume that δΛ is Bayes for Λ, and r(Λ, δΛ) = R(δΛ). Then,
• δΛ is minimax.
• Λ is least favorable.
• If δΛ is the unique Bayes estimator (a.e. P), then it is the unique minimaxestimator.
Proof of minimaxity of δΛ:
• Maximum risk is always lower-bounded by Bayes risk,
R(δ) = supθ∈Ω
R(θ, δ) ≥∫
R(θ, δ)dΛ(θ) = r(Λ, δ), ∀δ
• R(δ) ≥ r(Λ, δ) ≥ rΛ = R(δΛ). (Last equality by assumption.)
122 / 218
Rest of the proof:
• R(δ) ≥ r(Λ, δ) ≥ rΛ = R(δΛ). (Last equality by assumption.)
• Uniqueness of the Bayes rule makes second inequality strict for δ 6= δΛ,showing the uniqueness of minimax rule.
• On the other hand,
rΛ′ ≤ r(Λ′, δΛ) ≤ R(δΛ) = rΛ.
showing that Λ is least favorable.
123 / 218
• A decision rule δ is called an equalizer if it has constant risk:
R(θ′, δ) = R(θ, δ), for all θ, θ′ ∈ Ω.
• Let ω(δ) := θ : R(θ, δ) = R(δ) = argmaxθ R(θ, δ).
• (δ is an equalizer if ω(δ) = Ω.)
Corollary 4 (TPE 5.1.5–6)
(a) A Bayes estimator with constant risk (i.e., an equalizer) is minimax.
(b) A Bayes estimator δΛ is minimax if Λ(ω(δΛ)) = 1.
• Both of these conditions are sufficient, not necessary.
• (b) is weaker than (a).
• Strategy: Find a prior Λ whose support is contained in argmaxθ R(θ, δΛ).
124 / 218
Example 44 (Bernoulli, continuous parameter space)
• X ∼ Ber(θ) with quadratic loss, and Θ ∈ [0, 1].
• Given a prior Λ on [0, 1], let m1 = E[Θ] and m2 = E[Θ2].
• Frequentsit risk
R(θ, δ) = (δ0 − θ)2(1− θ) + (δ1 − θ)2θ
= θ2[1 + 2(δ0 − δ1)] + θ(δ21 − δ2
0 − 2δ0) + δ20
• Bayes risk
r(Λ, δ) = E[R(Θ, δ)] = m2[1 + 2(δ0 − δ1)] + m1(δ21 − δ2
0 − 2δ0) + δ20
• Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1,
δ∗1 =m2
m1, δ∗0 =
m1 −m2
1−m1.
125 / 218
• Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1,
δ∗1 =m2
m1, δ∗0 =
m1 −m2
1−m1.
• Aside: Since p(θ|x) = Cθx(1− θ)1−xπ(θ), check thatδ∗x = E[Θ|X = x ], x = 0, 1 as it should be:
δ∗x =
∫θx+1(1− θ)1−xπ(θ)dθ∫θx(1− θ)1−xπ(θ)dθ
.
• A general rule δ is an equalizer, i.e., R(θ, δ) does not depend on θ, if
δ1 − δ0 = 1/2 and δ21 − δ2
0 − 2δ0 = 0.
• These equations have a single solution: δ0 = 1/4 and δ1 = 3/4.(There is a unique equalizer rule.)
• Equalizer Bayes rule: need 3/4 = m2
m1, and 1/4 = m1−m2
1−m1.
• Solving: m∗1 = 1/2 and m∗2 = 3/8.
• Need prior Λ that has these moments; Λ = Beta(1/2, 1/2) fits the bill.
• This is a least favorable prior.
• Corresponding Bayes, hence minimax, risk is 1/16.
126 / 218
• The above can be generalized to an i.i.d. sample of size n:
X1, . . . ,Xniid∼Ber(θ), where Beta(n/2, n/2) is least favorable and the
associated minimax risk is 14(√n+1)2 .
• Compare with the risk of the sample mean R(θ, X ) = θ(1− θ)/n.
127 / 218
Example 45 (Bernoulli, discrete parameter space)
• Let X ∼ Ber(θ) and Ω = 1/3, 2/3 =: a, b.• Take L(θ, δ) = (θ − δ)2.
• Any (nonrandomized) decision rule is specified by a pair of numbers (δ0, δ1).
• Any prior π specified by a single number πa = P(Θ = a) ∈ [0, 1].
• Frequentist risk
R(θ, δ) = (δ0 − θ)2(1− θ) + (δ1 − θ)2θ
• Bayes risk r(π, δ) = E[R(Θ, δ)],
r(π, δ) = πaR(a, δ) + (1− πa)R(b, δ)
128 / 218
• Take derivatives w.r.t. δ0, δ1, set to zero, find the Bayes rule
δ∗0 =aπa(1− a) + b(1− b)(1− πa)
(1− a)πa + (1− b)(1− πa)
δ∗1 =a2πa + b2(1− πa)
aπa + b(1− πa)
• For a = 1/3 = 1− b,
δ∗0 =2
3(πa + 1)and δ∗1 =
4− 3πa6− 3πa
.
• Equalizer rule, one that R(a, δ) = R(b, δ):
(a + b)[2(δ0 − δ1) + 1] + δ21 − δ2
0 − 2δ0 = 0.
• A Bayes rule that is also equalizer occurs for π∗a = 1/2.
• This is the least favorable prior.
• Corrsponding rule (δ∗0 , δ∗1 ) = (4/9, 5/9) is minimax.
129 / 218
Geometry of Bayes and Minimax
• Risk body for two-parameter Bernoulli Ω = 1/3, 2/3.• Determinisitc rules, Bayes rules, minimax rule.
0 0.1 0.2 0.3 0.4 0.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 0.05 0.1 0.15 0.20
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
130 / 218
Geometry of Bayes and minimax for finite Ω
• Assume Ω = θ1, . . . , θk finite, and consider the risk set (or body)
S =
(y1, . . . , yk) | yi = R(θi , δ) for some δ⊂ Rk .
• Alternatively, define ρ : D → Rk by
ρ(δ) = (R(θ1, δ), . . . ,R(θk , δ))
where D is the set of randomized decision rules.
• S is the image of D underρ, i.e., S = ρ(D).
Lemma 6
S is a convex set (with randomized estimators).
Proof. For δ, δ′ ∈ D, and a ∈ [0, 1], we can form a randomized decision rule δasuch that S 3 R(θ, δa) = aR(θ, δ) + (1− a)R(θ, δ′). (Exercise.)
131 / 218
• Every prior Λ corresponds to a vector λ = (λ1, . . . , λk) ∈ Rk , viaΛ(θi) = λi . Note that λ lies in the (k − 1)-simplex,
∆ := (λ1, . . . , λk) ∈ Rk+ :
k∑
i=1
λi = 1• Bayes risk is
r(Λ, δ) = E[R(Θ, δ)] =k∑
i=1
λiR(θi , δ) = λTρ(δ)
• Hence finding the Bayes rule is equivalent to
infδ ∈D
r(Λ, δ) = infδ ∈D
λTρ(δ) = infy ∈ S
λT y
a convex problem in Rk . Minimax problem is
infδ ∈D
‖ρ(δ)‖∞ = infy ∈ S‖y‖∞
Finding the least favorable prior corresponds to supλ∈∆[infy ∈ S λT y ].
132 / 218
Admissibility of Bayes rules
• In general, a unique (a.e. P) Bayes rule is admissable (TPE 5.2.4).
• Complete answer to admissibility question for finite parameter spaces.
Proposition 10
Assume Ω = θ1, . . . , θk and that δλ is Bayes rule for λ.If λi > 0 for all i , then δλ is admissible.
Proof. If δλ is inadmissible, there is δ such that
R(θi , δ) ≤ R(θi , δλ),∀i
with strict inequality for some j . Then,
∑
i
λiR(θi , δ) <∑
i
λiR(θi , δλ).
Q.E.D.
133 / 218
Proposition 11
Assume Ω = θ1, . . . , θk and δ admissible. Then δ is Bayes w.r.t. some prior λ.
Proof.
• Let x := ρ(δ), the risk vector of δ and Qx := y ∈ Rk | yi ≤ xi \ x.• Qx is convex.
(Removing an extreme point from a convex set preserves convexity.)
• Admissibility means Qx ∩ S = ∅.• Two non-empty disjoint convex sets ⊂ Rk , can be separated by a
hyperplane:
∃u 6= 0 s.t. uT z ≤ uT y , for all z ∈ Qx and y ∈ S .
• Suppose we can choose u to have nonnegative coordinates.(Proof by contradiction. (Exercise.))
134 / 218
• Since u 6= 0, we can set λ = u/(∑
i ui ) ∈ ∆k−1.
• λT z ≤ infy∈S λT y = rλ, ∀z ∈ Qx .
• Taking zn ⊂ Qx such that zn → x , we obtain λT x ≤ rλ.
• But, by defintion of optimal Bayes risk rλ ≤ λT x , hence rλ = λT x .
135 / 218
M-estimation
• Setup: An i.i.d. sample of size n from a model P(1) = Pθ : θ ∈ Ω onsample space X , i.e.,
X1, . . . ,Xniid∼Pθ.
• Full model is actually P(n) = P⊗nθ : θ ∈ Ω and sample space X n.
• M-estimators: those obtained as solutions of optimization problems.
Definition 15Given a family of functions mθ : X → R, for θ ∈ Ω, the correspondingM-estimator based on X1, . . . ,Xn is
θn := θn(X1, . . . ,Xn) := argmaxθ∈Ω
1
n
n∑
i=1
mθ(Xi )
• Often Mn(θ) := 1n
∑ni=1 mθ(Xi ), a random function.
136 / 218
• Alternative approach is to specify θ as Z -estimator, i.e., the solution of aset of estimating equations
Ψn(θ) :=1
n
n∑
i=1
ψ(Xi , θ) = 0.
• Often 1st-order optimality conditions for an M-estimator produce a set ofestimating equations.(Simplistic in general, ignoring the possibility of constraints imposed by Ω.)
137 / 218
Example 46
1. mθ(x) = −(x − θ)2, then Mn(θ) = − 1n
∑i=1(Xi − θ)2, giving θ = X .
2. mθ(x) = −|x − θ|, then Mn(θ) = − 1n
∑i=1 |Xi − θ|, giving
θ = median(X1, . . . ,Xn).
3. mθ(x) = log pθ(x), then Mn(θ) = 1n
∑i=1 log pθ(Xi ), giving the maximum
likelihood estimator (MLE).
• In location family with pθ(x) = C exp(−β|x − θ|p), MLE is equivalent toan M-estimator with mθ(x) = −|x − θ|p.
• p = 2, Gaussian distribution. (Case 1)• p = 1, Laplace distribution. (Case 2)
• Corresponding Z -estimator forms of 1. and 2. are obtained forψθ(x) = x − θ and ψθ(x) = sign(x − θ), obtained by differentiation (orsub-differentiation) of mθ.
138 / 218
Example 47 (Method of Moments (MOM))
• Find θ by matching empirical and population (true) moments:
Eθ[X k1 ] =
1
n
n∑
i=1
X ki , k = 1, 2, . . . , d .
• Usually d is the dimension of parameter θ (d equations in d unknown).
• A set of estimating equations for ψθ(x) = xk − Eθ[X k1 ].
• A generalized version of MOM is to solve
Eθ[ϕk(X1)] =1
n
n∑
i=1
ϕk(Xi ), k = 1, 2, . . . , d .
for some collection of function ϕk, corresponding to Z -estimator withψθ(x) = ϕk(x)− Eθ(ϕ(X1)).
139 / 218
• In canonical exponential families:
Xiiid∼ pθ(x) ∝ exp〈θ,T (x)〉 − A(θ)
ML and MOM are equivalent.• MLE is the M-estimator associated with
mθ(x) = log pθ(x) = 〈θ,T (x)〉 − A(θ)
hence
Mn(θ) =1
n
∑
i
[〈θ,T (Xi )〉 − A(θ)
]= 〈θ, T 〉 − A(θ)
where T = 1n
∑i T (Xi ) is the empirical mean of the sufficient statistic.
• MLE is
θmle = argmaxθ∈Ω
〈θ, T 〉 − A(θ)
• Setting derivatives to zero gives T = ∇A(θmle). (First-order optimality.)• Since Eθ[T ] = ∇A(θ), MLE is the solution of
Eθ[T ] = T
for θ, which is a MOM estimator. (If you will Eθ[T (X1)] = 1n
∑i T (Xi ).)
140 / 218
Sidenote
• Recall that µ = ∇A(θ) is the mean parameterization.
• The inverse of this map is θ = ∇A∗(µ) where
A∗(µ) = supθ∈Ω〈θ, µ〉 − A(θ)
is the conjugate dual of A. (Exercise.)
• So θmle = ∇A∗(T ), assuming that T ∈ int(dom(A∗)).
141 / 218
Asymptotics or large-sample theoryZeroth-order (consistency)
• Statistical behavior of estimators, in particular M-estimators, as n→∞.
• For concreteness consider the sequence:
θn = θn(X1, . . . ,Xn) = argmaxθ∈Ω
1
n
n∑
i=1
mθ(Xi )
Definition 16
Let X1, . . . ,Xniid∼Pθ0 . We say that θn is consistent if θn
p→ θ0.
• Equivalently,
∀ε > 0, P(d(θn, θ0) > ε)→ 0, as n→∞.
• Usually d θn, θ0) = ‖θn − θ0‖ for Eucledian parameters spaces Ω ⊂ Rd .
• For d = 1, d(θn, θ0) = |θn − θ0|.142 / 218
• We write Zn = op(1) if Znp→ 0.
• By the WLLN, for any fixed θ, we have (assuming Eθ0 |mθ(X1)| <∞)
1
n
n∑
i=1
mθ(Xi )p→ Eθ0 [mθ(X1)]
• Letting M(θ) := Eθ0 [mθ(X1)], for any fixed θ, Mn(θ)p→ M(θ).
• If θ0 is the maximum of M over Ω, hope θn which is the maximum of Mn
over Ω approaches it.
• However, pointwise convergence of Mn to M is not enough; need uniformconvergence, i.e.,
‖Mn −M‖∞ := supθ∈Ω|Mn(θ)−M(θ)|
to go to zero in probability.
143 / 218
Why uniform convergence?
• Even a nonrandom example is enough:
• Here Mn(t)→ M(t) pointwise, but Mn(tn) = 1 and M(t0) = 1/2.
Mn(t) =
1− n|x − 1n | |x | < 2
n12 − |x − 1| 1
2 < |x | < 32
0 otherwise
M(t) =
12 − |x − 1| 1
2 < |x | < 32
0 otherwise
0 0.5 1 1.5
0
0.2
0.4
0.6
0.8
1
144 / 218
Theorem 12 (AS 5.7 modified)
Let Mn be random functions, and let M be a fixed function of θ. Let
θn ∈ argmaxθ∈Ω
Mn(θ) (cond-M)
be well-defined. Assume:
(a) ‖Mn −M‖∞ p→ 0. (Unifrom convergence.)
(b) (∀ε > 0) supθ: d(θ,θ0)≥ε
M(θ) < M(θ0). (M has well-separated maxima.)
Then θn is consistent, i.e., θnp→ θ0.
• By optimality of θ for Mn, we have Mn(θ0) ≤ Mn(θn), or
0 ≤ Mn(θn)−Mn(θ0) Basic inequality
• By adding and subtracting, we get
M(θ0)−M(θn) ≤ Mn(θn)−M(θn)− [Mn(θ0)−M(θ0)]
≤ 2‖Mn −M‖∞(We are keeping random deviations from the mean on one side and fixedfunctions on the other side.)
145 / 218
• Fix some ε > 0, and let
η(ε) := M(θ0)− supd(θ,θ0)≥ ε
M(θ) = infd(θ,θ0)≥ ε
[M(θ0)−M(θ)]
• By assumption (b) η(ε) > 0.
• Since d(θ, θ0) ≥ ε implies M(θ0)−M(θ) ≥ η(ε), we have
P(d(θn, θ0) ≥ ε
)≤ P
(M(θ0)−M(θn) ≥ η(ε)
)
≤ P(2‖Mn −M‖∞ ≥ η(ε)
)→ 0
by assumption (a). Q.E.D.
Remark 3
A key step is bounding Mn(θn)−M(θn) with ‖Mn −M‖∞.
Exercise: Condition (cond-M) can be replaced with Mn(θn) ≥ Mn(θ0)− op(1).
146 / 218
• Sufficient conditions for uniform convergence can be found in Keener,Chapter 9, Theorem 9.2.
• For example, we have (a) if
• Ω is compact,• θ 7→ mθ(x) is continuous (for a.e x), and• E‖m∗(X1)‖∞ <∞. (‖m∗(X1)‖∞ = supθ∈Ω |mθ(X1)|)
• For example, we have (b) if
• Ω is compact, and• M is continuous, and• M has a unique maximizer over Ω.
• In general, the key factor in whether the uniform convergence holds is thesize of the parameter space Ω.
147 / 218
Side note:
• Why do we have (b) if
• Ω is compact, and• M is continuous, and• M has a unique maximizer over Ω?
• Since Ω is compact and M is continuous, it attains its maximum over
Ω \ B(θ0; ε) := θ ∈ Ω : d(θ, θ0) ≥ ε.
where B(θ0; ε) is the open ball of radius ε centered at θ0.
• Let θε be a maximizer of M over Ω \ B(θ0; ε). Then,
supθ: d(θ,θ0)≥ε
M(θ) = M(θε) < M(θ0)
• Strict inequality is due to the uniqueness of maximizer of M over Ω.
• Compactness is key, otherwise uniqueness of global maximizer does notimply this inequality.
148 / 218
Example 48
• MLE can be obtained as an M-estimator with mθ(x) = log pθ(x)pθ0
(x) .
• Addition of − log pθ0 (x) does not change the maximizer of Mn(θ).
M(θ) = Eθ0 [mθ(X1)] = −∫
pθ0 (x) logpθ0 (x)
pθ(x)dx = −D(pθ0‖pθ).
• D(p‖q) is the Kullback–Leibler (KL) divergence between p and q.
• A form of (squared) distance among distributions.
• Does not satisfy triangle equality or symmetry.
• D(p‖q) ≥ 0 with equality iff p = q.
• Condition (b) is a bit stronger.
• Often, we can show (strong identifiability)
γ(d(θ0, θ)
)≤ D(pθ0‖pθ)
for some strictly increasing function γ ∈ [0,∞)→ [0,∞) in aneighborhood of θ0.
149 / 218
• Example: exponential distribution with pλ(x) = λe−λx1x > 0:
D(pλ0‖pλ) = Eλ0
[log
λ0e−λ0X1
λe−λX1
]
= logλ0
λ+ Eλ0 [(λ− λ0)X1]
= − logλ
λ0+
λ
λ0− 1
• Itakura-Saito distance, or the Bregman divergence for φ(x) = − log x , froman earlier lecture.
• f (x) = − log x + x − 1 is strictly convex on (0,∞) with unique minimumat x = 1.
150 / 218
First-order (asymptotic normality)
• More refined understanding, by looking at scaled (magnified) deviations ofconsistent estimators.
• IID sequence X1,Xn, . . . with mean µ = E[X1] and Σ = cov(X1):
WLLN Xnp→ µ Xn is consistent for µ.
CLT√n(Xn − µ)
d→ N(0,Σ) Characterizes fluctuations of Xn − µ.
• Fluctuations are of the order n−1/2 and after normalization haveapproximate Gaussian dist’n.
151 / 218
• First, let us look at how modes of convergence interact.
Proposition 12
(a) Xnp→ X implies Xn
d→ X , but not vice versa.
(b) Xnp→ c is equivalent to Xn
d→ c. (c is a constant.)
(c) Continuous mapping (CM): Xn → X and f is continuous, implies f (Xn)→ f (X ).
Holds for bothd→ and
p→.
(d) Slutsky’s: Xnd→ X and Yn
d→ c implies (Xn,Yn)d→ (X , c).
(e) Xnp→ X and Yn
p→ Y implies (Xn,Yn)p→ (X ,Y ).
(f) Xnd→ X and d(Xn,Yn)
p→ 0 implies Ynd→ X .
• For (c), f only need to be continuous on a set C with P(X ∈ C ) = 1.
• (d) does not hold in general if c is replaced by some random variable Y .
152 / 218
• Slutsky’s lemma is usually not what we mentioned.
• It in fact is a special of application of (c) and (d), to functions(x , y) 7→ x + y , (x , y) 7→ xy and (x , y) 7→ y−1x .
Corollary 5 (Slutsky’s lemma)
Let Xn,Yn and X be random variables, or vectors or matrices, and c a constant.
Assume that Xnd→ X and Yn
d→ c . Then,
Xn + Ynd→ X + c , YnXn
d→ cX , Y−1n Xn
d→ c−1X ,
assuming c is invertible for the latter. More generally f (Xn,Yn)d→ f (Xn, c) for
any continuous function.
E.g. op(1) + op(1) = op(1).
153 / 218
• Simple examples:
(a) Xnd→ Z ∼ N(0, 1) implies X 2
nd→ Z 2 ∼ χ2
1.
Example 49 (Counterexample)
• Xn = X ∼ U(0, 1),∀n and
Yn = Xn1n odd+ (1− Xn)1n even.
• Xnd→ X and Yn
d→ X , but (Xn,Yn) does not converge in distribution.
• Why?
• Let C1 = (x , y) ∈ [0, 1]2 : x = y and C2 = (x , y) ∈ [0, 1]2 : x + y = 1.• Let U(Ci ) be uniform distribution on Ci . Then,
(Xn,Yn) ∼U(C1) n odd
U(C2) n even
154 / 218
Example 50 (t-statistic)
• IID sequence Xi, with E[Xi ] = µ and var(Xi ) = σ2.
• Let Xn = 1n
∑Xi and S2
n = 1n
∑(Xi − X )2 = 1
n
∑X 2i − (Xn)2. Then,
tn−1 :=Xn − µSn/√n
d→ N(0, 1).
• Why? 1n
∑X 2i
d→ E[X 21 ] = (σ2 + µ2) and X 2
nd→ µ2.
• These imply Snd→√σ2 + µ2 − µ2 = σ.
• It follows that
tn−1 =
√n(Xn − µ)
Sn
d→ N(0, σ2)
σ= N(0, 1)
• Distribution-free result: We are not assuming that Xi are Gaussian.
155 / 218
• Also need the concept of uniform tightness or boundedness in probability.
• A collection of random vectors Xn is uniformly tight if
∀ε > 0, ∃M such that supn
P(‖Xn‖ > M) < ε
We will write Xn = Op(1) in this case.
Proposition 13 (Uniform Tightness)
(a) If Xnp→ 0 and Yn is uniformly tight, then XnYn
p→ 0.
(b) If Xnd→ X , then Xn is uniformly tight.
(a) can be written compactly as op(1)Op(1) = op(1).
156 / 218
Simplified notation: E[mθ0 ] in place of E[mθ0 (X1)].
Theorem 13 (Asymptotic normality of M-estimators)
Assume the following
(a) mθ0 (X1) has up to second moments with
• E[mθ0 ] = 0, and
• well-defined covariance matrix Sθ0 := E[mθ0mTθ0
].
(b) The Hessian mθ0 (X1) is integrable with Vθ0 := E[mθ0 ] ≺ 0.
(c) θn is consistent for θ0.
(d) ∃ ε > 0 such that sup‖θ−θ0‖≤ε ‖Mn(θ)− M(θ0)‖ p→ 0
Let ∆n,θ := 1√n
∑ni=1 mθ(Xi ). Then,
√n(θn − θ0) = −V−1
θ0∆n,θ0 + op(1), and ∆n,θ0
d→ N(0,Sθ0 ).
In particular,√n(θn − θ0)
d→ N(0,V−1θ0
Sθ0V−1θ0
).
In (b), only need the Hessian to be nonsingular.(d) is (local) Uniform Convergence (UC)
157 / 218
Proof of AN
1. θn is a maximizer of Mn, hence
2. Mn(θn) = 0. (first-order optimality condition)
3. Taylor-expand M around θ0:
Mn(θn)− Mn(θ0) = Mn(θn)[θn − θ0]
for some θn in the line segment [θn, θ0].(Mean-value theorem, assuming continuity of Mn)
4. θn = θ0 + op(1). (By consistency of θn)
5. Mn(θn) = Mn(θ0) + op(1) (By (d): UC)
6. Note that Mn(θ0) = n−1∑
i mθ0 (Xi ) is an average.
7. Mn(θ0) = Eθ0 [mθ0 (X1)] + op(1) = Vθ0 + op(1). (By (b) and WLLN)
8. Mn(θn) = Vθ0 + op(1). (Combine 5. and 7. + CM)
9. By CM applied w/ f (X ) = X−1, and invertibility of Vθ0 :
[Mn(θn)]−1 = [Vθ0 + op(1)]−1 = V−1θ0
+ op(1).
158 / 218
10. Combine 2., 3. and 9.
θn − θ0 = [Mn(θn)]−1[Mn(θn)− Mn(θ0)
]
= [V−1θ0
+ op(1)][0− Mn(θ0)
]
11. Expand RHS and multiply by√n,
√n(θn − θ0) = −V−1
θ0[√nMn(θ0)]− op(1)[
√nMn(θ0)] (8)
12. Mn(θ0) is an average with zero-mean terms w/ covariance Sθ0 by (a).
13.√nMn(θ0)
d→ N(0,Sθ0 ). (CLT and (a))
14.√nMn(θ0) = Op(1) (By Prop. 13(b) and 13.)
15. Applying op(1)Op(1) = op(1) to (8), (Prop. 13(b) and 11.)
√n(θn − θ0) = −V−1
θ0[√nMn(θ0)] + op(1).
16. Note that√nMn(θ0) = ∆n,θ by definition.
17. Second part: Apply CM with f (x) = −V−1θ0
x . (Exercise.)
159 / 218
Example 51 (AN of MLE)
• For MLE, mθ(x) = `θ(x) = log pθ(x).
• mθ = ˙θ, the score-function, zero-mean under regularity conditions.
• Sθ = Eθ[ ˙θ
˙Tθ ] = I (θ).
• Vθ = Eθ[¨θ] = −I (θ)
• Asymptotic covariance of MLE = [−I (θ)]−1I (θ)[−I (θ)]−1 = [I (θ)]−1.
• It follows (assuming (c) and (d) hold)
√n(θmle − θ0)
d→ N(0, [I (θ0)]−1)
• Often interpreted as “MLE is asymptotically efficient”,
• i.e., achieves Cramer–Rao bound in the limit.
160 / 218
Hodge’s superefficient example
• If√n(δn − θ)
d→ N(0, σ2(θ)), one might think that σ2(θ) ≥ 1/I (θ) by CRB.
• If so, any estimator with variance 1/I (θ) could be called asymptoticallyefficient.
• Unfortunately this is not true. (The convergence in distribution is ratherweak to guarantee this.) Here is a counterexample:
Example 52
• Consider the shrinkage estimator
δ′n =
aδn |δn| ≤ n−1/4
δn otherwise
• δ′n has the same asymptotic behavior as δn for θ 6= 0.
• Asymptotic behavior of δ′n at θ = 0 is the same as aδn which hasasymptotic variance a2σ2(θ); this can be made arbitrarily small by choosinga sufficiently small.
161 / 218
Delta method
• Delta method is a powerful extension of the CLT.
• Assume that f : Ω→ Rk , with Ω ⊂ Rd , is differentiable and θ ∈ Ω.
• Let Jθ = (∂fi/∂xj) |x=θ be the Jacobian of f at θ.
• Note: Jθ ∈ Rk×d .
Proposition 14
Under the above assumption: If an(Xn − θ)d→ Z , with an →∞, then
an[f (Xn)− f (θ)]d→ JθZ
• If f is differentiable, then it is partially differentiable and its total derivativecan be represented (or identified) with the Jacobian matrix Jθ.
• Simplest case, k = d = 1: an[f (Xn)− f (θ)]d→ f ′(θ)Z
162 / 218
Proof of Delta method
• an(Xn − θ) = Op(1) and since an →∞, we have Xn − θ = op(1).
• By differentiability (1st order Taylor expansion)
f (θ + h) = f (θ) + Jθh + R(h)‖h‖
where R(h) = o(1). Define R(0) = 0 so that R is continuous at 0.
• Applying with h = Xn − θ, we have
f (Xn) = f (θ) + Jθ(Xn − θ) + R(Xn − θ)‖Xn − θ‖.
• Multiplying by an, we get
an[f (Xn)− f (θ)] = Jθ[an(Xn − θ)] + R(Xn − θ)‖an(Xn − θ)‖
• ‖an(Xn − θ)‖ = Op(1).
163 / 218
• Multiplying by an, we get
an[f (Xn)− f (θ)] = Jθ[an(Xn − θ)] + R(Xn − θ)︸ ︷︷ ︸op(1)
‖an(Xn − θ)‖︸ ︷︷ ︸Op(1)
• R(Xn − θ) = op(1), and Jθ[an(Xn − θ)]d→ JθZ , both by CM.
• The result follows from op(1)Op(1) = op(1) and Prop 12(f).
164 / 218
Example 53
• Let Xi be iid with µ = E[Xi ] and σ2 = var[Xi ].
• By CLT,√n(Xn − µ)
d→ N(0, σ2).
• Consider the function f (t) = t2. Then, by delta method
√n(f (Xn)− f (µ))
d→ f ′(µ)N(0, σ2),
that is, √n[(Xn)2 − µ2]
d→ N(0, σ2(2µ)2).
• For µ = 0, we get the degenerate result that√n(Xn)2 d→ 0.
• In this case, we need to scale the error further, i.e.,
• n(Xn)2 d→ σ2χ21, which follows from CLT
√nX
d→ σN(0, 1) and CM.
165 / 218
Example 54
• Xiiid∼Ber(p).
• By CLT√n(Xn − p)
d→ N(0, p(1− p)).
• Let f (p) = p(1− p).
• f (Xn) is a plugin estimator for the variance, and
√n(f (Xn)− f (p))
d→ N(0, (1− 2p)2p(1− p))
since f ′(x) = 1− 2x .
• Again at p = 1/2 this is degenerate and the convergence happens at afaster rate.
166 / 218
These examples can be dealt with using the following extension.
Proposition 15
Consider the scalar case k = d = 1. If√n(Xn − θ)
d→ N(0, σ2) and f is twicedifferentiable with f ′(θ) = 0, then,
n[f (Xn)− f (θ)]d→ 1
2f ′′(θ)σ2χ2
1
Inofrmale Derivation:
• f (Xn)− f (θ) = 12 f′′(θ)(Xn − θ)2 + o((Xn − θ)2).
• Since n(Xn − θ)2 d→ (σZ )2 where Z ∼ N(0, 1), we get the result.
167 / 218
Example 55 (Multivariate delta method)
• S2 = 1n
∑i X
2i − (Xn)2. Let
Zn :=1
n
n∑
i=1
(Xi
X 2i
), θ =
(µ
µ2 + σ2
), Σ = cov
((X1
X 21
))
• By (multivariate) CLT, we have
√n(Zn − θ)
d→ N(0,Σ)
• Letting f (x , y) = (x , y − x2), we have
√n[(
Xn
S2n
)−(µσ2
)]d→ JθN(0,Σ) = N(0, JθΣJθ)
• Exercise: Evaluate asymptotic covariance JθΣJθ.
168 / 218
What are asymptotic normality results useful for?
• Simplify comparison of estimators: Can use asymptotic variances. (ARE)
• Can build asymptotic confidence intervals.
169 / 218
Asymptotic relative efficiency (ARE)
• Can compare estimators based on their asymptotic variance.
• Assume that for two estimators θ1,n and θ2,n, we have
√n(θi,n − µ(θ))
d→ N(0, σ2i (θ)), i = 1, 2.
• For large n, the variance of θi,n is ≈ σ2i (θ)/n.
• Relative efficiency of θ1,n with respect to θ2,n can be measured in terms theratio of the number of samples required to achieve the same asymptoticvariance (i.e., error),
σ21(θ)
n1=σ2
2(θ)
n2=⇒ AREθ(θ1, θ2) =
n2
n1=σ2
2(θ)
σ21(θ)
If the above ARE > 1, then we prefer θ1 over θ2.
170 / 218
Example 56
• Xiiid∼ fX with mean = (unique) median = θ, and variance 1.
• By CLT, we have√n(Xn − θ)→ N(0, 1).
• Sample median: Zn = median(X1, . . . ,Xn) = X( 12 n)
• Can show
√n(Zn − θ)
d→ N(
0,1
4[fX (θ)]2
)
• Consider normal location family: Xiiid∼N(θ, 1).
• fX (θ) = φ(0) = 1/√
2π where φ is the density of standard normal.
• Hence, σ2Zn
(θ) = π/2.
• ARE of sample mean relative to median:
σ2Zn
(θ)
σ2Xn
(θ)=π
2≈ 1.57
• In normal family, we prefer the mean, since the median requires roughly1.57 more samples to achieve the same accuracy.
171 / 218
Confidence intervals
An alternative to point estimators which provides a measure of our uncertainty
or confidence. Recall X1, . . . ,Xniid∼Pθ0 .
Definition 17
A (1− α)-confidence set for θ0 is a random set S = S(X1, . . . ,Xn) such thatPθ0 (θ0 ∈ S) ≥ 1− α
• Trade-off between size of the set S and its coverage probability Pθ0 (θ0 ∈ S).• Want to minimize size while maintaining a lower bound on coverage prob.• Usually CIs are built based on pivots:• Functions of data and parameter whose dist’n is independent of param.
Example 57 (Normal family, known variance)
• Xiiid∼N(µ, σ2), then Z = (Xn − µ)/(σ/
√n) ∼ N(0, 1).
• Let zα/2 be such that P(Z ≥ zα/2) = α/2.
P(|√n(Xn − µ)| ≤ zα/2) = 1− α ⇐⇒ P
(µ ∈ [Xn ±
σ√nzα/2]
)= 1− α.
172 / 218
Example 58 (Normal family, unknown variance)
• Xi ∼ N(µ, σ2).
• Z = (X − µ)/(σ/√n) ∼ N(0, 1)
• V = (n − 1)S2n/σ
2 ∼ χ2n−1 where S2 = 1
n−1
∑ni=1(Xi − X )2.
• Hence, T := Z/√
V /(n − 1) ∼ tn−1 (Student’s t distribution).
• Let tα be such that P(|T | ≥ tn−1(α2 )) = α.
• Xn ± tn−1(α2 ) Sn√n
is an exact (1− α) confidence interval.
173 / 218
Asymptotic CIs
Definition 18
An asymptotic (1− α)-confidence set for θ0 is a random set S = S(X1, . . . ,Xn)such that Pθ0 (θ0 ∈ S)→ 1− α as n→∞.
Example 59
• If√n(Tn − θ0)
d→ N(0, σ2(θ0)), then assuming σ(·) is continuous.
Tn ±√σ2(Tn)
nzα/2
is an asymptotic C.I. at level 1− α.
• Why? Since Tnp→ θ0, by CM theorem, σ(θ0)
σ(Tn)
p→ 1. By Slutsky’s lemma
√n
σ2(Tn)(Tn − θ0) =
√σ2(θ0)
σ2(Tn)
√n
σ2(θ0)(Tn − θ0)
d→ N(0, 1).
174 / 218
Asymptotic CI for MLE
Example 60 (Asym. CI for MLE based on Fisher info)
• Recall that under regularity√n(θn − θ0)
d→ N(0, 1I (θ0) ), or
√nI (θ0)(θn − θ0)
d→ N(0, 1).
• Assuming I (·) is continuous, applying the previous example,
√nI (θn)(θn − θ0)
d→ N(0, 1).
• Hence, the following is asymptotic (1− α)-CI for θ0:
θn ±zα/2√nI (θn)
175 / 218
Example 61 (Asym. CI based on empirical Fisher Info.)
• Let `n(θ) =∑n
i=1 log pθ(Xi ) and I (θ) = E[− ∂2
∂θ2 ]
• One can consider − 1n
¨n(θ) as the empirical version of I (θ).
• (It is an unbiased and consistent estimate. I (θ) is Fisher info. based onsample of size 1.)
• By the same argument as in AN theorem, − 1n
¨n(θn)
p→ I (θ0).
• It follows that√− ¨
n(θn)nI (θ0)
p→ 1
• By Slutsky’s lemma,
√−¨
n(θn)
nI (θ0)
√nI (θ0)(θn − θ0)
d→ N(0, 1)
• In other words,
√−¨
n(θn)(θn − θ0)d→ N(0, 1).
• Hence, the following is asymptotic (1− α)-CI for θ0:
θn ±zα/2√−¨
n(θn)
176 / 218
Variance-stabilizing transform
• Assume√n(Tn − θ)
d→ N(0, σ2(θ)).
• By delta method,√n[f (Tn)− f (θ)
] d→ N(0, [f ′(θ)]2σ2(θ)).
• We can choose f so that [f ′(θ)]2σ2(θ) = C , a constant.
• Good for building asymptotic pivots.
Example 62
• Xiiid∼Poi(θ). Note Eθ[Xi ] = varθ[Xi ] = θ.
• By CLT√n(Xn − θ)
d→ N(0, θ).
• Take f ′(θ) = 1√θ
. Can be realized by f (θ) = 2√θ, hence
2√n(√
Xn −√θ)
d→ N(0, 1)
• Asymptotic CI for√θ of level 1− α:
(√Xn ± 1
2√nzα/2
).
• Compare with standard asym. CI for θ:(Xn ±
√Xn
n zα/2
).
177 / 218
Hypothesis testing
• Recall decision theory framework:
• Probabilistic model Pθ : θ ∈ Ω for X ∈ X .
• Special case : Ω is partitioned into two disjoint sets Ω0 and Ω1.
• Want to decide which piece θ belongs.
• Can form an estimate θ for θ and output 1θ ∈ Ω1.• A general principal:
Do not estimate more than what you care about.The more complex the model, the more potential for fitting to noise.
178 / 218
• We want to test
H0 : θ ∈ Ω0 (null)H1 : θ ∈ Ω1 (alternative)
• A non-random test can be specified with a critical region S ⊂ X as
δ(X ) = 1X ∈ S.
• When δ(X ) = 1, we have accepted H1, or “rejected H0”.
• Power function of the test is given by
β(θ) = Pθ(X ∈ S) = Eθ[δ(X )]
• We would like β(θ) ≈ 1θ ∈ Ω1.• It cannot be achieved, so we settle for a trade-off. Define
significance level α = supθ∈Ω0
β(θ)
power of the test β = infθ∈Ω1
β(θ)
• Neyman–Pearson framework: Maximize β subject to a fixed α.179 / 218
• Often need to consider a randomized test, in which case interpret
δ(x) = P(accept H1 | X = x)
• Power function β(θ) = Eθ[δ(X )] still gives the probability of accepting H1,by the smoothing property.
180 / 218
Simple hypothesis test
• Ω0 = θ0 and Ω1 = θ1.• Neyman–Pearson criterion reads: Fix α
supδ
Eθ1 [δ(X )] s.t. Eθ0 [δ(X )] ≤ α.
Most powerful (MP) test for significance level at most α.
• Neyman–Pearson lemma:Most power achieved by a likelihood ratio test (LRT),
δ(X ) = 1L(x) > τ+ γ 1L(x) = τ, L(x) := pθ1 (x)/pθ0 (x).
• Sometimes write 1pθ1 (x) ≥ τpθ0 (x) to avoid division by zero.
• For simplicity assume write p0 = pθ0 and p1 = pθ1 .
• So we write L(x) = p1(x)/p0(x) for example.
181 / 218
Informal proof
• For simplicity drop dependence on X : δ = δ(X ) and L = L(X ).
• Introduce Lagrange multipliers, and solve the unconstrained problem:
δ∗ = argmaxδ
[E1(δ) + λ(α− E0(δ))
]= argmax
δ
[E1(δ)− λE0(δ)
]
• Recall the change of measure formula (note L = p1/p0):
E1[δ] =
∫δp1 dµ =
∫δL p0 dµ = E0[δL].
• The problem reduces to
δ∗ = argmaxδ
E0[δ(L− λ)]
• The optimal solution is
δ∗ =
1 L > λ
0 L < λ
which is a likelihood ratio test.182 / 218
Theorem 14 (Nyeman-Pearson Lemma)
Consider the family of (randomized) likelihood ratio tests
δt,γ(x) =
1 p1(x) > t p0(x)
γ p1(x) = t p0(x)
0 p1(x) < t p0(x)
The following hold:
(a) For every α ∈ [0, 1], there are t, γ such that E0[δt,γ(X )] = α.
(b) If a LRT satisfies E0[δt,γ(X )] = α, then it is most powerful (MP) at level α.
(c) Any MP test at level α can be written as a LRT.
• Part (a) follows by looking at g(t) = P0(L(X ) > t) = 1− FZ (t) whereZ = L(X ). g is non-increasing and right-continuous, etc. (Draw a picture.)
183 / 218
Proof of Neyman-Pearson Lemma
• For part (b), let δ∗ be the LRT with significance level α.
• Let δ be any other rule satisfying E0[δ(X )] ≤ α = E0[δ∗(X )].
• For all x , (consider the three possibilities)
δ(x)[p1(x)− tp0(x)
]≤ δ∗(x)
[p1(x)− tp0(x)
]
• Integrate w.r.t. x :
E1[δ(X )]− tE0[δ(X )] ≤ E1[δ∗(X )]− tE0[δ∗(X )]
or
E1[δ(X )]− E1[δ∗(X )] ≤ t(E0[δ(X )]− E0[δ∗(X )]
)≤ 0
• Conclude that E1[δ(X )] ≤ E1[δ∗(X )].
• Part (c), left as an exercise.
184 / 218
Example 63
• Consider X ∼ N(θ, 1) and the two hypotheses
H0 : θ = θ0 versus H1 : θ = θ1
• Likelihood ratio is
L(x) =p1(x)
p0(x)=
exp[− 12 (x − θ1)2]
exp[− 12 (x − θ0)2]
• LRT rejects H0 if L(x) > t. Equivalently
log L(x) > log t ⇐⇒ −1
2(x − θ1)2 +
1
2(x − θ0)2 > log t
⇐⇒ x(θ1 − θ0) +1
2(θ2
0 − θ21) > log t
⇐⇒ x · sign(θ1 − θ0) >log t − 1
2 (θ20 − θ2
1)
|θ1 − θ0|=: τ
• Assume θ1 > θ0. Then, the test is equivalent to x > τ .
• We set τ by requiring P0(X > τ) = α. This gives τ = θ0 + Q−1(α).
185 / 218
Power calculation (Previous example continued)
• Q(x) = 1− Φ(x) where Φ is the CDF of standard normal distribution.
• The power is (since X − θ1 ∼ N(0, 1) under P1)
β = P1(X > τ) = P1(X − θ1 > τ − θ1) = Q(τ − θ1)
• Plugging in τ , we have β = Q(−δ + Q−1(α)) where δ = θ1 − θ0.
• Plot of β versus α is the ROC curve of the test.
• ROC = Receiver Operating Characteristic
• See next slide.
• Alternatively, can plot parametric curve β = Q(τ − θ1) and α = Q(τ − θ1),where parameter τ varies in R.
• ROC of no test can go about this curve (by Neyman-Pearson lemma).
186 / 218
ROC curve
,
0 0.2 0.4 0.6 0.8 1
-
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
187 / 218
Composite hypothesis testing
• Often want to test H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1 where Ω0 ∩ Ω1 = ∅.
Example 64
Testing whether a coin is fair or not. Here Ω0 = 1/2 and Ω1 = [0, 1] \ 1/2.
Definition 19
A test δ of size α is uniformly most powerful (UMP) at level α if
∀ tests φ of level ≤ α, ∀θ ∈ Ω1, βδ(θ) ≥ βφ(θ).
• UMP tests do not always exists (they often don’t in fact).
188 / 218
Example 65 (Coin flipping continued)
• Observe X ∼ Bin(n, θ).
• Consider testing H0 : θ = 1/2 versus H1 : θ = θ1 based on X .
• Most powerful test is a LRT, based on
T (x) =θx1 (1− θ1)n−x
(1/2)x(1/2)n−x=( θ1
1− θ1
)x(1− θ1
1/2
)n
• Nature of the test changes based on whether θ1 < 1/2 or θ1 > 1/2:
θ1 < 1/2 =⇒ log[θ1/(1− θ1)] < 0 =⇒ reject H0 when x < τ
θ1 > 1/2 =⇒ log[θ1/(1− θ1)] > 0 =⇒ reject H0 when x > τ
• Suggests that a special structure is needed for the existence of a UMP test.
189 / 218
Definition 20
A family P = pθ(x) : θ ∈ Ω of densities has a monotone likelihood ratio(MLR) in T (x) if for θ0 < θ1, the LR L(x) = pθ1 (x)/pθ0 (x) is a non-decreasingfunction of T (x).
For example, in the coin flipping problem, the model has MLR in T (X ) = X .
Example 66 (1-D exponential family)
• Consider pθ(x) = h(x) exp[ η(θ)T (x)− A(θ) ]. LR is
L(x) =pθ1 (x)
pθ0 (x)= exp
[(η(θ1)− η(θ0))T (x)− A(θ1) + A(θ0)
].
• If η is monotone (e.g. θ0 ≤ θ1 =⇒ η(θ0) ≤ η(θ1)), then the family hasMLR in T (x) or −T (x).
• Includes the Bernoulli (or binomial) example before with η(θ) = log θ1−θ .
Others cases: normal location family, Poisson and exponential.
190 / 218
Example 67 (Non-exponential family)
• Xiiid∼U[0, θ], i = 1, . . . , n.
• p(x) = θ−n1x(n) ≤ θ1x(1) ≥ 0, and
L(x) =(θ1
θ0
)n 1x(n) ≤ θ11x(n) ≤ θ0
=
(θ1
θ0
)n, x(n) ∈ [0, θ0)
∞ x(n) ∈ [θ0, θ1)
• Consider θ1 > θ0.
• For x(n) ∈ (0, θ1), L(x) increasing in x(n) (transitions from (θ1/θ0)n to ∞).
191 / 218
Theorem 15 (UMP for one-sided problems)
• Let P be a family with MLR in T (x).
• Consider the one-sided test H0 : θ ≤ θ0 versus H1 : θ > θ0.
• Then, δ(x) = 1T (x) > C+ γ1T (x) = C, for γ,C such thatβδ(θ0) = α is UMP of size α.
• Take θ1 > θ0 and let Lθ1,θ0 (x) = pθ1 (x)/pθ0 (x) be the corresponding LR.
• By the MLR property, the given test is a LR test, i.e.,
δ(x) = 1Lθ1,θ0 (x) > C ′+ γ1Lθ1,θ0 (x) = C ′
for some constant C ′ = C ′(θ1, θ0).
• Since βδ(θ0) = α, by Neyman–Pearson lemma, δ is MP for testingH0 : θ = θ0 versus H1 : θ = θ1
• Since θ1 > θ0 was arbitrary, δ is UMP for θ = θ0 versus θ > θ0.
• Last piece to check is βδ(θ) ≤ α for θ < θ0. (Exercise.)
192 / 218
Example 68 (Non-exponential family (continued))
• Xiiid∼U[0, θ], i = 1, . . . , n.
• The family has MLR in X(n).
• δ(X ) = 1X(n) ≥ t is UMP for one-sided testing: θ > θ0 against θ ≤ θ0.
• To set the threshold
g(t) = 1− Pθ0 (X(n) ≤ t) = 1−∏
i
Pθ0 (Xi ≤ t) =
1− (t/θ0)n, t ≤ θ0
0 t > θ0
which is a continuous function.
• Solving g(t) = α gives t = (1− α)1/nθ0.
• Similarly the power function is
β(θ) = Pθ(X(n) > t) = [1− (t/θ)n]+
which holds for all θ > 0.
• For the UMP test, we have β(θ) = [1− (1− α)(θ0/θ)n]+.
193 / 218
• Plots of the power function for U[0, θ] example for various n,(θ0 = 1, α = 0.2).
3
0 0.5 1 1.5 2
pow
er fu
nctio
n -
(3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n=2n=5n=10n=20n=100
194 / 218
• Plots of the power function for U[0, θ] example for various n,(θ0 = 1, α = 0.05).
3
0 0.5 1 1.5 2
pow
er fu
nctio
n -
(3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
n=2n=5n=10n=20n=100
195 / 218
• ROC plots for U[0, θ] example for various n, (θ0 = 1, θ = 1.1).
,
0 0.2 0.4 0.6 0.8 1
-
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
13 = 1.1
n=2n=5n=10random
196 / 218
• ROC plots for U[0, θ] example for various n, (θ0 = 1, θ = 2).
,
0 0.2 0.4 0.6 0.8 1
-
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
13 = 2.0
n=2n=5n=10random
197 / 218
Generalized likelihood ratio test (GLRT)
• Consider testing H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1.
• In the absence of UMPs a natural extension of LRT is the followinggeneralized LRT
L(x) =supθ∈Ω pθ(x)
supθ∈Ω0pθ(x)
=pθ(x)
pθ0(x)∈ [1,∞]
where Ω = Ω0 ∪ Ω1.
• θ is the unconstrained MLE, whereas θ0 is the constrained MLE, themaximizer of θ 7→ pθ(x) over θ ∈ Ω1.
• A GLRT rejects H0 if L(x) > λ for some threshold.
• Alternatively, GLRT can be written as
δ(x) = 1Λn(x) ≥ τ+ γ1Λn(x) = τ, Λn(x) = 2 log L(x)
• The threshold τ is set as usual by solving supθ∈Ω0Eθ[δ(X )] = α
198 / 218
• Why the above is a reasonable test?
• Assume γ = 0: No randomization is needed.
• When θ ∈ Ω0, then both θ and θ0 approach θ as n→∞.
• Hence, L(x) ≈ 1 when θ ∈ Ω0.
• However, when θ ∈ Ω1, the unconstrained MLE θ approaches θ as n→∞,while θ0 does not. This is because θ0 ∈ Ω0 and θ ∈ Ω1, and Ω0 and Ω1 aredisjoint.
• It follows that L(x) > 1 when θ ∈ Ω1 (in fact L(x) 1 usually).
• By thresholding L(x) at some λ > 1, we can tell the two hypotheses apart
199 / 218
Example 69
• Xiiid∼N(µ, σ2), both µ and σ2 unknown. Let θ = (µ, σ2). Take
Ω = (µ, σ2) | µ ∈ R, σ2 > 0, Ω0 = θ0, for θ0 = (µ0, σ20) = (0, 1).
• Want to test Ω0 against Ω1 = Ω \ Ω0.
supθ∈Ω0
pθ(x) = pθ0 (x) =1
(2π)n/2exp
(− 1
2
∑
i
x2i
)
supθ∈Ω
pθ(x) = pθ(x) =1
(2πσ2)n/2exp
(− 1
2σ2
∑
i
(xi − µ)2)
where θ = (µ, σ2) with µ = 1n
∑i xi and σ2 = 1
n
∑i (xi − µ)2 is the MLE.
200 / 218
Example continued
• The GLR is
L(x) =pθ(x)
pθ0 (x)= (σ2)−n/2 exp
[1
2
∑
i
x2i −
1
2σ2(xi − µ)2
︸ ︷︷ ︸n/2
].
The GLRT rejects H0 when L(x) > tα where Pθ0 (L(x) > tα) = α.
• Alternatively, threshold
Λn(x) = 2 log L(x) = −n log σ2 +∑
i
x2i − n.
• Will see that Λn in general has χ2r distribution where r is the difference
between the dimensions of the full (Ω) versus null (Ω0) parameter spaces.
• Thus, we expect Λn in this problem to have an asymptotic χ22 distribution
under the null θ = θ0. (It is instructive to try to show this directly.)
201 / 218
Asymptotics of GLRT
• Consider Ω ⊂ Rd , open, and let r ≤ d . Take the null hypothesis to be
Ω0 = θ ∈ Ω : θ1 = θ2 = · · · = θr = 0= θ ∈ Ω : θ = (0, . . . , 0, θr+1, . . . , θr+d−r ).
• Note that Ω0 is a (d − r)-dimensional subspace of Ω.
Theorem 16Under the same assumptions guaranteeing asymptotic normality of MLEs,
Λn = 2 log L(X )d→ χ2
r , under H0.
Degrees of freedom r = d − (d − r), that is, the difference in the localdimensions of full and null parameter sets.
202 / 218
• Recall `θ(X ) = log pθ(X ) and let Mn(θ) = 1n
∑i `θ(Xi ).
• We have Λn = −2n[Mn(θ0,n)−Mn(θn)].
• By Taylor expansion around θn (unrestricted MLE), for some θn ∈ [θn, θ0,n],
Mn(θ0,n)−Mn(θn) = [Mn(θn)]T (θ0,n − θn) +1
2(θ0,n − θn)T Mn(θn)(θ0,n − θn)
• Since θn is the MLE, we have Mn(θn) = 0 assuming θn ∈ int(Ω).
• By the same uniform arguments θn = θ0 + op(1) implies
• Mn(θn) = Mn(θ0) + op(1) = −Iθ0 + op(1),
• the last equality is because Mn(θ0)p→ Eθ0 [¨θ0 ] = −Iθ0 by WLLN.
203 / 218
• Assuming that√n(θ0,n − θn) = Op(1) , we obtain
Λn = [√n(θ0,n − θn)]T Iθ0
√n(θ0,n − θn) + op(1).
• Asymptotically GLR measures a particular distance (squared) between θ0,n
and θn, one which weighs different directions differently, according toeigenvectors of I (θ0).
• More specifically, let ‖z‖Q :=√zTQz = ‖Q1/2z‖2
2, which defines a normwhen Q 0. Then,
Λn = ‖√n(θ0,n − θn)‖2
Iθ0+ op(1).
• In the simple case where Ω0 = θ0, we have√n(θn − θ0)
d→ N(0, I−1θ0
).
• Equivalently,√n(θn − θ0)
d→ I−1/2θ0
Z where Z ∼ N(0, Id).
• It follows from the CM theorem (since z 7→ ‖z‖Q is continuous) that
Λnd→ ‖I−1/2
θ0Z‖2
Iθ0= ‖I 1/2
θ0I−1/2θ0
Z‖22 = ‖Z‖2
2.
• Since ‖Z‖22 =
∑di=1 Z
2i ∼ χ2
d we have the proof Ω0 = θ0.• The proof of the general case is more complicated and is omitted.
204 / 218
Example 70 (Multinomial: testing uniformity)
• (X1, . . . ,Xd) ∼ Multinomial(n, θ), where θ = (θ1, . . . , θd) is a probabilityvector, that is,
θ ∈ Ω = θ ∈ R : θi ≥ 0,∑
i
θi = 1
• Xj counts how many of n objects fall into category j .
pθ(x) =
(n
x1, . . . , xd
) d∏
i=1
θxii ∝d∏
i=1
θxii .
• Would like to test Ω0 = θ0 where θ0 = ( 1d , . . . ,
1d ) versus Ω1 = Ω \ Ω0.
• MLE over Ω is given by θi = xi/n. Requires techniques for constrainedoptimization, such as Lagrange multipliers, since Ω itself is constrained.(Exercise.)
205 / 218
• MLE over Ω is given by θi = xi/n. This requires using techniques forconstrained optimization, such as Lagrange multipliers, since Ω itself isconstrained. (Exercise.)
• We obtain
Λn = 2 logpθ(x)
pθ0 (x)= 2 log
d∏
i=1
( θi(θ0)i
)xi= 2
d∑
i=1
xi logθi
(θ0)i
= 2n∑
i
θi logθi
(θ0)i= 2nD(θ‖θ0)
• Both θ and θ0 are probability vectors; D(θ‖θ0) is their KL divergence.
• GLRT does a sensible test: Reject null if θ is far from θ0 in KL divergence.
• Our asymptotic theory implies: Λnd→ χ2
d−1 under the null, i.e. θ = θ0,since Ω is (d − 1)-dimensional and Ω0 is 0-dimensional. This is a fairlynon-trivial result.
206 / 218
p-values
• Consider a family of tests δα, α ∈ (0, 1) indexed by their level α.
• Assume: α 7→ δα(x) is nondecreasing, and right-continuous.
• E.g., if δα(x) = 1x ∈ C (α), then C (α1) ⊆ C (α2) if α1 ≤ α2.
• p-value or attained significance for observed x is defined as
p(x) := infα : δα(x) = 1
• Note that p = p(X ) is a random variable. We have
p(X ) ≤ α ⇐⇒ δα(X ) = 1
p ≤ α implies 1 = δp ≤ δα, since the infimum is attained by assumptions on δα,
hence δα = 1. The other direction follows from the definition of inf.
• This implies P0(p(X ) ≤ α) = P0(δα(X ) = 1) = α.
• That is, p = p(X ) ∼ U[0, 1] under null.
207 / 218
Example 71 (Normal example continued)
• Consider X ∼ N(θ, 1) and H0 : θ = θ0 versus H1 : θ = θ1.
• MP test at level α is δα(X ) = 1X − θ0 ≥ Q−1(α).• Alternatively δα(X ) = 1Q(X − θ0) ≤ α since Q is decreasing.
• The p-value is
p(X ) = infα : Q(X − θ0) ≤ α = Q(X − θ0) (9)
• Under null X − θ0 ∼ N(0, 1) hence Φ(X − θ0) ∼ U[0, 1] (why?).
• Then, p(X ) = 1− Φ(X − θ0) ∼ U[0, 1] as expected.
• Exercise: Verify that under H1 : θ = θ1, the CDF of p(X ) is
P1(p(X ) ≤ t) = Q(−δ + Q−1(t)).
where δ = θ1 − θ0. Note that this curve is the same as the ROC.
• Recall Q(t) = 1− Φ(t) where Φ is the CDF of N(0, 1).
208 / 218
• Can verify that definition (9) produces the “common” definition ofp-values, say when δα(X ) = 1T ≥ τα or δα(X ) = 1|T | ≥ τα.
Example 72
• Consider the two-sided test and let G (t) = P0(|T | ≥ t).
• Assume that G is continuous hence invertible (both decreasing functions.)
• Requiring level α: G (τα) = P0(|T | ≥ τα) = α =⇒ τα = G−1(α).
• This gives δα(X ) = 1|T | ≥ G−1(α) = 1G (|T |) ≤ α.• By definition (9), p = G (|T |) which is the common definition.
209 / 218
Multiple Hypothesis testing
• We have a collection of null hypotheses H0,i , i = 1, . . . , n.
Example 73 (Basic example)
• Testing in the normal means model
yi ∼ N(µi , 1), i = 1, . . . , n
and H0,i : µi = 0.
• yi could be the expression level (or state) of gene i .
• H0,i means that gene i has no effect on the disease under consideration.
• Suppose that for each H0,i , we have a test, hence a p-value pi .
• Assume under H0,i : pi ∼ U[0, 1].
• A test that reject H0,i when pi ≤ α, is of size α under ith null.
210 / 218
Testing global null
• The global null is H0 =⋂n
i=1 H0,i .
• Want to combine p1, . . . , pn to build a test of level α for H0.
• Can use 1pi ≤ α for a fixed i . But, better power if use all of them.
• Benferroni’s test for global null:
Reject H0 if mini
pi ≤α
n
• By union bound (no independence needed),
PH0 (rejecting H0) = PH0
( n⋃
i=1
pi ≤ α/n)≤
n∑
i=1
PH0 (pi ≤ α/n) = α
• Exercise: Assuming pi s are independent under H0, the exact the size ofBenferroni’s test is 1− (1− α
n )n → 1− e−α as n→∞. Thus for large nand small alpha, size ≈ 1− e−α ≈ α, hence union bound is not bad in thiscase.
211 / 218
Fisher test for global null
• Fisher combination test:
Reject H0 if Tn := −n∑
i=1
2 log pi > χ22n(1− α)
Lemma 7
If p1, . . . , pn are independent, then Tn ∼ χ22n.
• Thus, assuming independence under H0, Fisher test has the correct size.
212 / 218
Simes test for global null
• Simes test:
Reject H0 if Tn := mini
p(i),
n
i
≤ α
where p(1) ≤ p(2) ≤ · · · ≤ p(n) is the order statistics of p1, . . . , pn.
Lemma 8
If p1, . . . , pn are independent, then Tn ∼ U[0, 1].
• (Independence can be relaxed.)
• Thus, assuming independence under H0, Simes test has the correct size.
• Equivalent form of Simes test:
Reject H0 if p(i) ≤i
nα for some i
• Less conservative than Benferroni’s that reject if p(1) ≤ 1nα.
213 / 218
Testing individual hypotheses
• In the gene expression example, we care about which genes are null/notnull. We would like to test H0,i : µi = 0 versus H1,i : µi = 1 for all i .
• The problem has resemblance to classification.
• Interested in how we are doing in an aggregate sense.
• We can think of having a decision problem
p ∼ Pθ, where θ ∈ 0, 1n. (10)
• θi = 1 iff H0,i is true.
• Global null corresponds to θ = 0 (zero vector).
• Minimal assumption: When θi = 0, we have pi ∼ U[0, 1], i.e., the ithmarginal of Pθ is uniform.
214 / 218
Terminology (shared with classification)
• Confusion matrix: Count how many combinations we have.
• For example if θ ∈ 0, 1n is our guess for the hypotheses:
TP =1
n
n∑
i=1
1θi = 1, θi = 1
positive (1) negative (0) totalaccepted rejected
true (1) TP TN Tfalse (0) FP FN F
P N
• True here means H0,i is true.
215 / 218
• Alternative notationpositive (1) negative (0) total
accepted rejectedtrue (1) U V n0
false (0) T S n − n0
n − R R n
• Familywise error rate (FWER):
FWERθ = Pθ(V ≥ 1)
• A much less stringent criterion is the False Discovery Rate (FDR).
• Consider the false discovery proportion (a random variable)
FDP =V
max(R, 1)=
V
R1R > 0
• FDR is the expectation of FDP:
FDRθ = Eθ[FDP]
216 / 218
• Controlling FWER in a strong sense: control for all θ ∈ 0, 1n.
Theorem 17
Benferroni’s approach, where we test each hypothesis at level α/n, controlsFWER at level α in a strong sense. In fact
Eθ[V ] ≤ n0
nα, ∀θ ∈ 0, 1n.
• E[V ] =∑n
i=1 P(V ≥ i) which holds for any nonnegative discrete variable.
• Hence, E[V ] ≥ P(V ≥ 1).
• Let Vi = 1Hi,0 is true but rejected = 1θi = 1, θi = 0,
Eθ[Vi ] = 1θi = 1Pθ(θi = 0)
• Since V =∑
i Vi ,
Eθ[V ] =∑
i
Eθ[Vi ] =∑
i :θi=1
Pθ(θi = 0) ≤∑
i :θi=1
α
n=
n0
nα
(Here θi is only based on pi .)
217 / 218
• Benjamini-Hochberg (BH) procedure: Let i0 be the largest i such that
p(i) ≤i
nq
Reject all H0,i for i ≤ i0.
Theorem 18Under independence of null hypotheses from each other and from the non-nulls,the BH procedure (uniformly) controls the FDR at level q. In fact,
FDRθ(θBH) =n0
nq, ∀θ ∈ 0, 1n
218 / 218