1/95
Mathematical Statistics, Fall 2016
Chapter 6: Likelihood Methods
Byeong U. Park
Department of Statistics, Seoul National University
2/95
6.1 Maximum Likelihood Estimation
6.2 Information Inequality and Efficiency
6.3 Maximum Likelihood Tests
3/95
Idea of Maximizing Likelihood
I If pdf(x; θ1) > pdf(x; θ2) for an observation X = x, then the distribution
pdf(·; θ1) is more likely, than pdf(·; θ2), to be the true distribution that
generated the observation x.
I Likelihood and log-likelihood functions: pdf(x; θ) as a function of θ for a
given x is called the likelihood (function). Its logarithm is called the
log-likelihood (function).
I Below we write L(θ) ≡ L(θ;x) = pdf(x; θ) and
`(θ) ≡ `(θ;x) = log pdf(x; θ).
I For a set of observations x1, . . . , xn of a random sample X1, . . . , Xn from
pdf f(·; θ), we have
L(θ) =n∏i=1
f(xi; θ), `(θ) =n∑i=1
log f(xi; θ).
4/95
Maximum Likelihood Estimator (MLE)
I Definition of MLE: The MLE of θ for a given observation x is defined by
θMLE(x) ≡ arg maxθ∈Θ
L(θ;x)
when it exists.
I MLE may not exist and may not be unique when it exists.
I In some cases MLE does not exist in a parameter space Θ but exists in an
extended parameter space Θ.
I Suppose that Θ is open in Rd and the log-likelihood function `(θ) is
continuous in Θ and diverges to −∞ as θ approaches to the ‘boundary’ of
Θ. Then, an MLE exists.
5/95
Likelihood Equation
I When the likelihood is differentiable, an MLE is often found by solving the
likelihood equation ˙(θ;x) = 0, where ˙(θ;x) = (∂/∂θ)`(θ;x).
I Suppose that the likelihood is twice differentiable. For a given x, if an
MLE exists and the second derivative ¨(θ;x) is negative definite for all
θ ∈ Θ, then the solution of the likelihood equation exists and is the unique
MLE.
I Suppose that the likelihood is twice differentiable. For a given x, if
˙(θ(x);x) = 0 and ¨(θ;x) is negative definite for all θ ∈ Θ, then θ(x) is
the unique MLE:
`(θ) = `(θ) + ˙(θ)>(θ − θ) +1
2(θ − θ)> ¨(θ∗)(θ − θ) ≤ `(θ)
with the equality holding if and only if θ = θ.
6/95
MLE of Function of Parameter
I Let θ = (θ1, θ2) and θ = (θ1, θ2) is its MLE. Then, we call θj the MLEs of
θj , respectively.
I For an injective function g, the MLE of η = g(θ) is given by
ηMLE = g(θMLE).
Proof. Let L denote the likelihood function of θ. Then, the likelihood of
η equals L(g−1(η);x). This entails
(Likelihood of ηMLE) = L(g−1(ηMLE);x) = L(θMLE;x)
= maxθ∈Θ
L(θ;x)
= maxη∈g(Θ)
L(g−1(η);x).
7/95
MLE of Function of Parameter
I What if the function g in η = g(θ) is not injective?
One may find h such that the map θ 7→ (g(θ), h(θ)) is injective, so that
the MLE of ηext ≡ (g(θ), h(θ)) is given by
ηMLEext = (g(θMLE), h(θMLE)),
so that it remains true that
ηMLE = g(θMLE).
8/95
Profiling Method of Finding MLE
Sometimes it is difficult to find the MLE of θ = (θ1, θ2) simultaneously, but
rather easy to find the MLEs of θj with the other being fixed. Let θ1(θ2)
denote the MLE of θ1 when θ2 is fixed.
I L(θ1(θ2), θ2) is a function of θ2 only, called the profile likelihood of θ2.
I The MLE of θ is given by θMLE = (θ1(θ2), θ2), where
θ2 = argmaxθ2:(θ1(θ2),θ2)∈Θ
L(θ1(θ2), θ2).
Proof. For any (θ1, θ2) ∈ Θ, it holds that
L(θMLE) ≥ L(θ1(θ2), θ2) ≥ L(θ1, θ2).
9/95
Newton-Raphson Algorithm
It may not be possible to find the solution of the likelihood equation explicitly
when the likelihood equation is nonlinear.
I The Newton-Raphson method is an iteration scheme based on the linear
approximation of the likelihood equation:
0 = ˙(θ) ' ˙(θOLD) + ¨(θOLD)(θ − θOLD).
I The iteration scheme:
θNEW = θOLD − ¨(θOLD)−1 ˙(θOLD).
I Convergence of the iteration: Newton-Kantorovich Theorem!
10/95
Newton-Kantorovich Theorem
Suppose that there exist constants α, β, γ and r such that 2αβγ < 1 and
2α < r for which ` has the second derivative ¨(θ) at all θ ∈ Br(θ[0]) that is
invertible, and
(i) ‖¨(θ[0])−1 ˙(θ[0])‖ ≤ α,
(ii) ‖¨(θ[0])−1‖ ≤ β,
(iii) ‖¨(θ)− ¨(θ′)‖ ≤ γ‖θ − θ′‖ for all θ, θ′ ∈ Br(θ[0]).
Then, ˙(θ) = 0 has a unique solution θ in B2α(θ[0]). Furthermore, θ can be
approximated by Newton-Raphson iterative method
θ[k+1] = θ[k] − ¨(θ[k])−1 ˙(θ[k]), k ≥ 0, which converges at a geometric rate:
‖θ[k] − θ‖ ≤ α2−(k−1)q2k−1,
where q = 2αβγ < 1.
11/95
Gradient Descent Algorithm
Sometimes the Newton-Raphson algorithm is unstable, especially when θ is of
high-dimension. The gradient descent method is an iterative scheme of finding
the minimal point of an objective function F . The method moves slightly the
current update to the opposite direction of the gradient of F , i.e.,
θNEW = θOLD − γF (θOLD)
for γ small enough. Too small γ takes too long to reach the minimum and too
large may overshoot the minimal point. It is suggested to take a larger step at
the beginning and a smaller step as the iteration goes on.
x1
x2
12/95
Coordinate Descent Algorithm
The minimization of a multivariable function F may be achieved by minimizing
it along one direction at a time, i.e., solving univariate optimization problems in
a loop. The algorithm solves the optimization problem
θ[k+1]j = argmin
θj
F (θ[k+1]1 , . . . , θ
[k+1]j−1 , θj , θ
[k]j+1, . . . , θ
[k]d ), 1 ≤ j ≤ d.
It is easy to see that
F (θ[0]) ≥ F (θ[1]) ≥ F (θ[2]) ≥ . . . .
13/95
Example: Poisson Model
Suppose that we observe X from Poisson(θ) for unknown θ ∈ Θ = (0,∞).
I The log-likelihood is maximized at x ∈ Θ when x > 0. When x = 0,
`(θ; 0) = −θ + x log θ − log x! = −θ
does not have a maximizer in (0,∞). Thus, the MLE of θ does not exist
when x = 0.
I If we let Θ = [0,∞), then the MLE exists for all x ≥ 0 and is given by
θMLE = X.
I Read #6.1.3 for an example of non-uniqueness of MLE.
14/95
Example: Bernoulli Model
Suppose that we observe a random sample X1, . . . , Xn from Bernoulli(θ) for
unknown θ ∈ [0, 1]. Then, θMLE = X.
I For 0 <∑ni=1 xi < n, θ = x is the unique MLE:
I For θ = 0 or θ = 1, L(θ) = 0;
I For θ ∈ (0, 1), L(θ) > L(0) = L(1) = 0 and
˙(θ) =
∑ni=1 xi
θ−n−
∑ni=1 xi
1− θ= 0
has the solution θ = x with
¨(θ) = −∑ni=1 xi
θ2−n−
∑ni=1 xi
(1− θ)2< 0
for all 0 < θ < 1.
I For∑ni=1 xi = 0, L(θ) = (1− θ)n is maximized at θ = 0.
I For∑ni=1 xi = n, L(θ) = θn is maximized at θ = 1.
15/95
Example: Logistic Family of Distributions
Suppose that we observe a random sample X1, . . . , Xn from Logistic(θ, 1) for
unknown θ ∈ R that has a density function given by
f(x; θ) =exp(−(x− θ))
(1 + exp(−(x− θ)))2· I(−∞,∞)(x)
I Derivatives of the log-likelihood:
˙(θ) = n− 2
n∑i=1
exp(−(xi − θ))/{1 + exp(−(xi − θ))},
¨(θ) = −2
n∑i=1
exp(−(xi − θ))/{(1 + exp(−(xi − θ))}2 < 0, θ ∈ R.
I The likelihood equation ˙(θ) = 0 has the unique solution since ˙ is strictly
decreasing and ˙(θ)→ ∓n as θ → ±∞, respectively. Thus, it is the
unique MLE.
16/95
Example: Family of Exponential Distributions
Suppose that we observe a random sample X1, . . . , Xn from Exp(θ) for
unknown θ > 0 that has a density function given by
f(x; θ) = θ−1e−x/θ · I(0,∞)(x). It is easier to find the MLE of η = θ−1 and
then transform the MLE of η back to the MLE of θ = η−1.
I Derivatives of the log-likelihood of η:
˙(η) = n/η − nx,
¨(η) = −n/η2 < 0, η > 0.
I The likelihood equation ˙(η) = 0 has the solution η = 1/x, thus it is the
unique MLE of η.
I θMLE = X is the unique MLE of θ.
17/95
Example: Families of Double Exponential Distributions
Suppose that we observe a random sample X1, . . . , Xn from DE(θ, 1) for
unknown θ ∈ R that has a density function given by
f(x; θ) =1
2exp(−|x− θ|) · I(−∞,∞)(x).
In this case the likelihood is not differentiable. Let X(1) ≤ X(2) ≤ · · · ≤ X(n)
be the order statistics of X1, . . . , Xn.
I The log-likelihood: `(θ) = −∑ni=1 |xi − θ| − n log 2.
I When n = 2m+ 1, θMLE = X(m+1) is the unique MLE.
I When n = 2m, θMLE = a for any a ∈ [X(m), X(m+1)].
I Thus, θMLE = med(Xi).
18/95
S(θ) =∑ni=1 |X(i) − θ|
X(1) X(2) X(3)· · · X(n−1) X(n)
I1 I2 I3 In In+1
Let
I1 = (−∞, X(1)]
Ij = [X(j−1), X(j)], 2 ≤ j ≤ n “closed interval”
In+1 = [X(n),∞)
(i) S(θ) ≥ S(X(1)) for all θ ∈ I1
S(θ) ≥ S(X(n)) for all θ ∈ I(n+1)
(ii) S(θ1) > S(θ2) for all θ1 < θ2 in Ij+1 if j < n− j;
S(θ1) < S(θ2) for all θ1 < θ2 in Ij+1 if j > n− j;
S(θ1) = S(θ2) for all θ1 < θ2 in Ij+1 if j = n− j
19/95
Example: Families of Double Exponential Distributions
Suppose that we observe a random sample X1, . . . , Xn from an DE(µ, σ) for
unknown θ = (µ, σ2) ∈ R× R+ that has a density function given by
f(x; θ) =1
2σexp(−|x− µ|/σ) · I(−∞,∞)(x).
I The log-likelihood: `(θ) = −∑ni=1 |xi − µ|/σ − n log 2− n log σ.
I Profiling approach: For each fixed σ, we know µ(σ) = med(xi) maximizes
`(µ, σ), which is independent of σ. The profile likelihood equals
`(med(xi), σ) = −n∑i=1
|xi −med(xi)|/σ − n log 2− n log σ,
which is uniquely maximized at σ = n−1∑ni=1 |xi −med(xi)|.
I θMLE = (med(Xi), n−1∑n
i=1 |Xi −med(Xi)|).
20/95
Example: Normal Families of Distributions
Suppose that we observe a random sample X1, . . . , Xn from N(µ, 1) for
unknown µ ∈ R.
I The log-likelihood:
`(µ) = −n2
log(2π)− 1
2
n∑i=1
(xi − x)2 − n
2(x− µ)2.
I The MLE of η = |µ| is ηMLE = |X|.
21/95
Example: Normal Families of Distributions
Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2) for
unknown θ = (µ, σ2) ∈ R× R+.
I The log-likelihood:
`(θ) = −n2
log(2π)− n
2log σ2 − 1
2σ2
n∑i=1
(xi − x)2 − n
2σ2(x− µ)2.
I Profiling approach: For each fixed σ2, we get µ(σ2) = x maximizes
`(µ, σ2), which is independent of σ2. The profile likelihood equals
`(x, σ2) = −n2
log(2π)− n
2log σ2 − 1
2σ2
n∑i=1
(xi − x)2,
which is uniquely maximized at σ2 = n−1∑ni=1(xi − x)2. Thus,
θMLE = (X, n−1∑ni=1(Xi − X)2).
22/95
Consistency of MLE: Some Examples
I Let X1, . . . , Xn be a random sample from (i) Poisson(θ), θ ∈ Θ = [0,∞);
(ii) Bernoulli(θ), θ ∈ Θ = [0, 1]; (iii) Exp(θ), θ ∈ Θ = (0,∞). Then,
θMLE = XPθ→ θ
as n→∞ for all θ ∈ Θ by WLLN.
I Let X1, . . . , Xn be a random sample from a population with a distribution
function F that has a density f with respect to the Lebesgue measure
(continuous type). It can be shown that
√n(med(Xi)− F−1(1/2))
d→ N(0, 1/[4f(F−1(1/2))2]),
so that med(Xi)p→ F−1(1/2) for all continuous type F . [Use the normal
approximation of Binomial distribution]
23/95
Consistency of MLE: Some Examples
I Let X1, . . . , Xn be a random sample from DE(θ, 1), θ ∈ R. Then, by the
consistency of the sample median as an estimator of the population
median,
θMLE = med(Xi)Pθ→ θ as n→∞ for all θ ∈ R.
I Let X1, . . . , Xn be a random sample from U [0, θ], θ ∈ (0,∞). In this case
θMLE = X(n). Since
Eθ(X(n)) =n
n+ 1·θ → θ and varθ(X(n)) =
n
(n+ 2)(n+ 1)2·θ2 → 0,
we get that θMLE Pθ→ θ as n→∞ for all θ > 0.
24/95
Kullback-Leibler Divergence
I Kullback-Leibler divergence: Let P = {Pθ : θ ∈ Θ} is a statistical model
for an observation X. Let f(·; θ) denote the density function of Pθ. The
Kullback-Leibler divergence (of Pθ from Pθ0) is defined by
KL(θ, θ0) = −Eθ0(log f(X, θ)/f(X, θ0)).
I Identifiability of θ: Pθ = Pθ0 implies θ = θ0. (This has been assumed to
hold in the estimation of θ so far!)
I Assume θ in P is identifiable and Pθ have common support, i.e.,
{x : f(x; θ) > 0} does not depend on θ ∈ Θ. Then,
KL(θ, θ0) ≥ 0 and KL(θ, θ0) = 0 if and only if θ = θ0.
[Use the inequality 1 + log z ≤ z for all z > 0 with “=” iff z = 1]
25/95
Kullback-Leibler Divergence and MLE
Let X1, . . . , Xn be a random sample from Pθ in P = {Pθ : θ ∈ Θ}. Let f(·; θ)
denote the density function of Pθ. Assume θ in Θ is identifiable and Pθ have
common support. For a fixed θ0, define
Dn(θ) = −n−1n∑i=1
log f(Xi; θ)/f(Xi; θ0), D0(θ) = KL(θ, θ0).
I θ0 is the unique minimizer of D0(θ) over Θ.
I θMLE is a minimizer of Dn(θ) when it exists.
I By WLLN, Dn(θ)Pθ0→ D0(θ) for all θ ∈ Θ.
I Does θMLE, the minimizer of Dn(θ), converges to θ0, the minimizer of
D(θ) in Pθ0 probability?
26/95
Kullback-Leibler Divergence and MLE
I In general, for a sequence of random functions Gn and a function G0
defined on Θ, the minimizer of Gn(θ) over Θ converges in probability to
the minimizer of G0(θ) over Θ if Θ is compact (bounded and closed), G0
is continuous on Θ, the minimizer of G0(θ) is unique and
supθ∈Θ|Gn(θ)−G0(θ)| p→ 0.
I Thus, if Θ is compact, D0(θ) is continuous on Θ for all θ0 ∈ Θ and
supθ∈Θ|Dn(θ)−D0(θ)|
Pθ0→ 0 for all θ0 ∈ Θ,
then, θMLE is consistent.
27/95
Kullback-Leibler Divergence and MLE
I Convexity Lemma: In general, for a sequence of random functions Gn and
a function G0 defined on Θ, if Gn is a sequence of convex functions and
Θ is a convex set, then the uniform convergence
supθ∈Θ|Gn(θ)−G0(θ)| p→ 0
is implied by the pointwise convergence
Gn(θ)−G0(θ)p→ 0 for each θ ∈ Θ.
I MLE for a finite parameter space: Let P = {Pθ : θ ∈ Θ}, where Θ is
finite, i.e., Θ = {θ1, . . . , θk} for some k ≥ 1. Assume θ in P is identifiable
and Pθ have common support. Then, θMLE is consistent.
28/95
Sufficient Conditions for Consistency of MLE
I Suppose that we observe a random sample from Pθ in P = {Pθ : θ ∈ Θ}.
Assume that θ in Θ is identifiable and that Pθ have common support.
Assume also that the likelihood is twice differentiable, approaches to −∞
on the boundary of Θ, and ¨(θ) is negative definite for all θ ∈ Θ. Then,
the MLE of θ given by the unique solution of the likelihood equation
˙(θ) = 0 is consistent. [See Theorem 6.1.3 and Corollary 6.1.1 in the text]
I Logistic(θ, 1) example: The support of Pθ equals R, the likelihood
`(θ)→ −∞ as θ → ±∞ and ¨(θ) < 0 for all θ. Thus, the MLE given by
the solution of ˙(θ) = 0 is consistent.
29/95
6.1 Maximum Likelihood Estimation
6.2 Information Inequality and Efficiency
6.3 Maximum Likelihood Tests
30/95
Basic Regularity Assumptions
Let X1, · · · , Xn be a random sample from a distribution with pdf
f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. (µ is either the Lebesgue
measure or the counting measure)
(R0) The parameter θ is identifiable in Θ.
(R1) The density f(·; θ) have common support X .
(R2) The parameter space is open in Rd.
(R3) The log-density log f(x; θ) is twice differentiable as a function of θ for all
x ∈ X .
(R4) For any statistic u(X1, . . . , Xn) with finite expectation, the integral
Eθ(u(X1, . . . , xn)) =
∫Xn
u(x1, . . . , xn)n∏i=1
f(xi; θ) dµ(x1) · · · dµ(xn)
is twice differentiable under the integral sign.
31/95
Fisher Information
I The derivative of the log-density f(x; θ) as function of θ is called the score
function. The larger the magnitude of the score function is, the more
information one has about θ.
I Fisher information (that an observation of X1 has about θ) is defined by
I1(θ) or IX1(θ) = varθ
(∂
∂θlog f(X1; θ)
)I Bartlett identity (first-order): Under (R0)–(R4), we get
Eθ
(∂
∂θlog f(X1; θ)
)= 0.
I Bartlett identity (second-order): Under (R0)–(R4), we get
varθ
(∂
∂θlog f(X1; θ)
)= Eθ
(− ∂2
∂θ2log f(X1; θ)
).
32/95
Information Inequality (Cramer-Rao Lower Bound)
Let X1, · · · , Xn be a random sample from a distribution with pdf
f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ, that satisfies (R0)–(R4).
Assume further that I1(θ) is invertible. Then, for any statistics
Un ≡ u(X1, . . . , Xn) with finite second moment, it holds that
varθ(Un) ≥(∂
∂θEθU
>n
)>(nI1(θ))−1
(∂
∂θEθU
>n
)for all θ ∈ Θ. In case Un is multivariate, A ≥ B for matrices A and B should
be interpreted that A−B is nonnegative definite.
33/95
Proof of Information Inequality
Let Vn =∑ni=1(∂/∂θ) log f(Xi; θ). Consider first the case where Un is
univariate. For this case, we get
varθ(Un) ≥ maxa6=0
(a>covθ(Vn, Un))2
a>varθ(Vn)a
= covθ(Vn, Un)>varθ(Vn)−1covθ(Vn, Un)
In case Un is multivariate, replace Un in the above inequality by t>Un and get
t> ·varθ(Un) · t ≥ t> · covθ(Vn, Un)>varθ(Vn)−1covθ(Vn, Un) · t for all t.
The inequality follows then since varθ(Vn) = nI1(θ) and
covθ(Vn, Un) = Eθ(VnU>n ) =
∂
∂θEθU
>n
due to the first-order Bartlett identity and (R4).
34/95
Lower Bound on Variance of Unbiased Estimator
Let X1, · · · , Xn be a random sample from a distribution with pdf
f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ, that satisfies (R0)–(R4).
Assume further that I1(θ) is invertible. Consider the problem of estimating
g(θ) for a smooth function g, which may be a vector of real-valued functions.
Then, for any unbiased estimator η of η ≡ g(θ) with finite second moment, it
holds that
varθ(η) ≥(∂
∂θg(θ)>
)>(nI1(θ))−1
(∂
∂θg(θ)>
)for all θ ∈ Θ.
35/95
MLE and Information Bound: Poisson Model
Suppose that we observe a random sample X1, . . . , Xn from Poisson(θ) for
unknown θ ∈ Θ = (0,∞).
I Writing `1(θ) ≡ log f(x; θ) = −θ + x log θ − log x!, we have
˙1(θ) = −1 + x/θ, ¨
1(θ) = −x/θ2.
I Thus, I1(θ) = θ−1, so that for any unbiased estimator θ of θ we get
varθ(θ) ≥ (nI1(θ))−1 = θ/n for all θ > 0.
I For θMLE = X, it is unbiased and varθ(θMLE) = θ/n. Thus, θMLE has the
minimal variance among all unbiased estimators of θ, is the so called
Uniformly Minimum Variance Unbiased Estimator (UMVUE) of θ.
36/95
MLE and Information Bound: Bernoulli Model
Suppose that we observe a random sample X1, . . . , Xn from Bernoulli(θ) for
unknown θ ∈ Θ = (0, 1).
I We have
˙1(θ) = x/θ − (1− x)/(1− θ), ¨
1(θ) = −x/θ2 − (1− x)/(1− θ)2.
I Thus, I1(θ) = θ−1(1− θ)−1, so that for any unbiased estimator θ of θ we
get
varθ(θ) ≥ (nI1(θ))−1 = θ(1− θ)/n for all 0 < θ < 1.
I For θ = X, which is the MLE when 0 < X < 1, it is unbiased and
varθ(X) = θ(1− θ)/n. Thus, X is the UMVUE of θ.
37/95
MLE and Information Bound: Bernoulli Model
Now, consider the estimation of η = θ(1− θ).
I Since ∂η/∂θ = 1− 2θ, we get that, for any unbiased estimator η of η,
varθ(η) ≥ (1−2θ)2(nI1(θ))−1 = θ(1−θ)(1−2θ)2/n for all 0 < θ < 1.
I For the estimator η = X(1− X), which is the MLE of η when 0 < X < 1,
Eθ(η) =n− 1
n· θ(1− θ) =
n− 1
n· η,
varθ(η) = n−1θ(1− θ)(1− 2θ)2 + o(n−1),
with o(n−1) = 5n−2θ2(1− θ)(3− θ) + n−3θ(1− θ)(6θ2 − 6θ + 1).
I The bias-corrected estimator nη/(n− 1) is unbiased and has the variance
n
(n− 1)2· θ(1− θ)(1− 2θ)2 > θ(1− θ)(1− 2θ)2/n
unless θ = 1/2, but the relative efficiency to the CR lower bound
approaches to one as n→∞.
38/95
MLE and Information Bound: Beta Model
Suppose that we observe a random sample X1, . . . , Xn from Beta(θ, 1) for
unknown θ > 0.
I We have ˙1(θ) = 1/θ + log x and ¨
1(θ) = −1/θ2.
I Thus, I1(θ) = 1/θ2, so that for any unbiased estimator θ of θ we get
varθ(θ) ≥ (nI1(θ))−1 = θ2/n for all 0 < θ < 1.
I θMLE = −n/(∑ni=1 logXi) and using the fact that −θ logXi are i.i.d.
Exp(1), we get
Eθ(θMLE) =
n
n− 1θ, varθ(θ
MLE) =n2
(n− 1)2(n− 2)θ2.
I Again, the unbiased estimator (n− 1)θMLE/n has the variance θ2/(n− 2)
that is strictly larger than the CR lower bound, but its relative efficiency to
the lower bound converges to one as n→∞.
39/95
MLE and Information Bound: Normal Model
Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2) for
unknown θ = (µ, σ2) ∈ R× R+.
I The derivatives of the log-density:
˙1(θ) =
(x− µ)/σ2
(x− µ)2/(2σ4)− 1/(2σ2)
,
¨1(θ) =
−1/σ2, −(x− µ)/(2σ4)
−(x− µ)/(2σ4), −(x− µ)2/σ6 + 1/(2σ4)
.
I The Fisher information about θ from a single observation:
I1(θ) = Eθ(−¨1(θ)) =
1/σ2 0
0 1/(2σ4)
.
40/95
MLE and Information Bound: Normal Model
I For any unbiased estimators µ and σ2 of µ and σ2,
varθ(µ) ≥ (1, 0)(nI1(θ))−1(1, 0)> = σ2/n,
varθ(σ2) ≥ (0, 1)(nI1(θ))−1(0, 1)> = 2σ4/n.
I For the unbiased estimators X and S2 =∑ni=1(Xi − X)2/(n− 1) of µ
and σ2, respectively, we get
varθ(X) = σ2/n, varθ(S2) = 2σ4/(n− 1).
I Thus, X achieves the minimal variance and S2 has a variance slightly
larger than the lower bound. Later we will see that S2 also achieves the
minimal variance among all unbiased estimators. This means that the CR
lower bound may not be sharp!
41/95
MLE and Information Bound: Normal Model
I The lower bounds for µ and σ2 do not change under the sub-models where
one of µ and σ2 is known (given). The main reason is that the covariance
of the score functions, i.e, the off-diagonal term of I1(θ) equals zero.
I In general, for a nonsingular symmetric matrix
A =
A11 A12
A21 A22
,
it holds that
(A−1)11 = A−111 +A−1
11 A12(A22 −A21A−111 A12)−1A21A
−111 ≥ A
−111 ,
and the equality holds if A12 = 0.
42/95
Asymptotic Efficiency
I Unbiased estimators: Typically, the CR lower bound for unbiased
estimators is asymptotically sharp. Thus, the asymptotic efficiency of an
unbiased estimator η of a real-valued parameter η = g(θ) is defined by
eff(η) = limn→∞
((∂/∂θ)g(θ))>(nI1(θ))−1((∂/∂θ)g(θ))
varθ(η).
I For estimators η with varθ(η) ≥ Cn−1 for some constant C > 0 and
(∂/∂θ)biasη(θ)→ 0 as n→∞, we get
varθ(η) ≥(∂
∂θg(θ) + o(1)
)>(nI1(θ))−1
(∂
∂θg(θ) + o(1)
)=
(∂
∂θg(θ)
)>(nI1(θ))−1
(∂
∂θg(θ)
)+ o(n−1),
where biasη(θ) = Eθ(η)− η. Thus, the asymptotic efficiency eff(η) in
terms of variance defined above may be extended to these estimators.
43/95
Asymptotic Efficiency
I Lower bound in terms of mean squared error:
MSEθ(η) ≥ biasη(θ)2 +
(∂
∂θg(θ)
)>(nI1(θ))−1
(∂
∂θg(θ)
)+ o(n−1)
I For those estimators with varθ(η) ≥ Cn−1 for some constant C > 0,
(∂/∂θ)biasη(θ)→ 0 as n→∞ and biasη(θ) = o(√
varθ(η)), it follows
that
(lower bound for MSE)
MSEθ(η)=
eff(η) + o(1)
1 + o(1)= eff(η) + o(1).
Thus, eff(η) may be also a definition of the asymptotic efficiency in terms
of mean squared error for these estimators.
44/95
Examples: Asymptotic Efficiency of MLE
I Bernoulli model with Θ = (0, 1): Note first that
Pθ(θMLE = X) ≥ Pθ(0 < X < 1)→ 1 for all θ ∈ (0, 1). For the MLE
η = X(1− X) of η = θ(1− θ), we get
biasη(θ) = −θ(1− θ)/n, varθ(η) = θ(1− θ)(1− 2θ)2/n+ o(n−1).
Thus, eff(η) = 1 unless θ = 1/2 both in terms of variance and mean
squared error.
I Beta model with Θ = (0,∞): For the MLE θ = −n/(∑ni=1 logXi) of θ,
we have
biasθ(θ) = θ/(n− 1), varθ(θ) = θ2/n+ o(n−1).
Thus, eff(η) = 1 both in terms of variance and mean squared error.
45/95
Fisher’s Wrong Conjecture
(1)√n(θMLE − θ) d→ N(0, I1(θ)−1).
(2) For any estimator θ such that√n(θ − θ) d→ N(0, v(θ)) it follows that
v(θ) ≥ I1(θ)−1 for all θ ∈ Θ. (R. A. Fisher, 1925)
I The conjecture (1) is true under some regularity conditions on the model,
while (2) may not be true even under the regularity conditions on the
model. (J. L. Hodges, Jr., 1951)
I Among regular estimators of θ the conjecture (2) was found to be true
under some regularity conditions on the model. (L. Le Cam, 1953)
I Asymptotically efficient estimator: If there exists a regular estimator θ
such that√n(θ − θ) d→ N(0, I1(θ)−1), then the estimator is said to be
asymptotically efficient.
46/95
Further Regularity Assumptions
Let X1, · · · , Xn be a random sample from a distribution with pdf
f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. Under the following
assumptions in addition to (R0)–(R4) we may show that the MLE (precisely,
the solution of the likelihood equation) is asymptotically normal with mean
zero and variance I1(θ)−1.
(R5) For all θ0 ∈ Θ, there exist δ0 > 0 and M(·) with Eθ0M(X1) <∞ such
that
maxθ:|θ−θ0|≤δ0
∣∣ ∂3
∂θi∂θj∂θklog f(X; θ)
∣∣ ≤ M(X).
(R6) The likelihood equation ˙(θ) = 0 has the unique solution θ and the
solution is a consistent estimator of θ.
(R7) The Fisher information I1(θ) exists and is non-singular for all θ ∈ Θ.
47/95
Asymptotic Normality of MLE
I Theorem: Under the assumptions (R0)–(R7),
√n(θ − θ) d→ N(0, I1(θ)−1) under Pθ for all θ ∈ Θ.
I Proof of the theorem: For simplicity, consider only the case Θ ⊂ R, i.e.,
d = 1. The multi-dimensional extension only involves more complexity in
notation. Let Sj(θ) denote the jth derivative of the scaled likelihood
n−1`(θ). The theorem follows immediately from the facts that, for any
θ0 ∈ Θ,
(1) 0 = S1(θ0) + (S2(θ0) + (θ0 − θ0)S3(θ∗)/2)(θ − θ0),
(2) S3(θ∗) = Op(1) under Pθ0 ,
(3) S2(θ0) = −I1(θ0) + op(1) under Pθ0 ,
(4)√nS1(θ0)
d→ N(0, I1(θ0)) under Pθ0 .
48/95
Asymptotic Normality of MLE
The fact (1) is simply a Taylor expansion, (3) follows from WLLN and (4) from
CLT. For the proof of (2), we observe
lim supn→∞
Pθ0(|S3(θ∗)| > C)
≤ lim supn→∞
Pθ0(|S3(θ∗)| > C, |θ − θ0| ≤ δ0)
≤ lim supn→∞
Pθ0
(n−1
n∑i=1
M(Xi) > C
)≤ C−1Eθ0M(X1) → 0 as C →∞.
49/95
Asymptotic Normality of MLE
I Asymptotic linearity of MLE: Under the assumptions (R0)–(R7)
√n(θ − θ) = I1(θ)−1√nS1(θ) + op(1) under Pθ for all θ ∈ Θ.
I Function of θ: Let η = g(θ) for a smooth function g and let η = g(θ) for θ
in (R6). Under the assumptions (R0)–(R7), it follows that
√n(η − g(θ))
d→ N(0, g(θ)>I1(θ)−1g(θ)) under Pθ for all θ ∈ Θ,
where g(θ) = (∂/∂θ)g(θ)>.
50/95
Examples: Asymptotic Normality of MLE
I Bernoulli model with Θ = (0, 1): Recall I1(θ) = θ−1(1− θ)−1 and that
Pθ(θMLE = X) ≥ Pθ(0 < X < 1)→ 1 for all θ ∈ (0, 1).
By CLT we get√n(θMLE − θ) d→ N(0, θ(1− θ)).
I Poisson model with Θ = (0,∞): Again,
Pθ(θMLE = X) ≥ Pθ(X > 0)→ 1 for all θ > 0.
Recall I1(θ) = θ−1. By CLT we get√n(θMLE − θ) d→ N(0, θ).
I Exponential model with Θ = (0,∞): Recall θMLE = X and note
I1(θ) = θ−2. By CLT,√n(θMLE − θ) d→ N(0, θ2).
51/95
Examples: Asymptotic Normality of MLE
I Beta model with Θ = (0,∞): Recall that −1/θMLE = n−1∑ni=1 logXi
and I1(θ) = θ−2. By CLT and since Eθ(logX1) = −θ−1 and
varθ(logX1) = θ−2,
√n(n−1
n∑i=1
logXi + θ−1)d→ N(0, θ−2).
Thus,√n(θMLE − θ) d→ N(0, θ2).
I Double exponential model DE(θ, 1) with Θ = R: Note that θ is the
median of the distribution Pθ, fθ(θ; θ) = 1/2 and θMLE = med(Xi), so
that√n(θMLE − θ) d→ N(0, 1) (see page 21 of this slide).
52/95
Examples: Asymptotic Normality of MLE
I Double exponential model (continued): The log-density
f(x; θ) = e−|x−θ|/2 is not differentiable (pointwise) as a function of θ, but
it is differentiable in L2 sense (Frechet differentiability) with the derivative
∂/∂θf(x; θ) = sgn(x− θ)e−|x−θ|/2:
limh→0
∫ ∞−∞
(h−1(e−|x−θ−h|/2− e−|x−θ|/2)− sgn(x− θ)e−|x−θ|/2
)2
dx = 0.
With this derivative, I1(θ) ≡ 1.
I Remark: Indeed, the notion of Fisher information is generalized to models
with Frechet differentiability and the regularity conditions (R0)–(R7) may
be relaxed based on the generalization.
53/95
Examples: Asymptotic Normality of MLE
I Logistic(θ, 1) model with Θ = R: We know that θMLE exists and is
unique, but do not have its explicit form. One may check that all the
regularity conditions (R0)–(R7) are met, so that we have√n(θMLE − θ) d→ N(0, I1(θ)−1). Now,
˙1(θ) = (1− e−(x−θ))/(1 + e−(x−θ)),
I1(θ) ≡∫ ∞−∞
(1− e−x
1 + e−x
)2
· e−x
(1 + e−x)2dx =
1
3.
Thus, we conclude that√n(θMLE − θ) d→ N(0, 3)
I Uniform(0, θ) model with θ > 0 (non-regular model): Note that
θMLE = X(n), and in this case
n(θMLE − θ) d→ −Exp(θ).
54/95
Examples: Asymptotic Normality of MLE
I Multinomial model: Let X1, . . . , Xn be i.i.d. d-dimensional observations
from Multinomial(p1, . . . , pd), 0 < pj < 1,∑dj=1 pj = 1. Write
X = (X1, . . . , Xd)>, x = (x1, . . . , xd) for a realization x of Xi and
θ = (p1, . . . , pd−1)>. Note that
θMLE = (X1, . . . , Xd−1)>
with probability tending to one, under Pθ for all θ ∈ Θ. By CLT
√n(θMLE − θ) d→ N(0, diag(θj)− θθ>).
Observe that
˙1(θ, x) = (x1/θ1 − xd/θd, . . . , xd−1/θd−1 − xd/θd)>,
¨1(θ, x) = −diag(xj/θ
2j )− (xd/θ
2d) · 11>,
I1(θ) = diag(1/θj) + (1/θd) · 11>, I1(θ)−1 = diag(θj)− θθ>.
55/95
Asymptotic Relative Efficiency
I For two estimators θ1 and θ2 of θ that are asymptotically normal:√n(θj − θ)
d→ N(0, vj(θ)), the asymptotic relative efficiency of θ1
against θ2 is defined by
ARE(θ1, θ2) = v2(θ)/v1(θ).
I MLE against sample mean or sample median: When f(x; θ) = f0(x− θ),
f0 N(0, 1) Logistic(0, 1) DE(0, 1)
against mean 1 π2/9 2
against median π/2 4/3 1
For the logistic model, we have used the fact∑∞k=1 k
−2 = π2/6, so that∫ ∞−∞
x2e−x(1 + e−x)−2 dx = 4
∫ ∞0
xe−x(1 + e−x)−1 dx
= 4∞∑k=1
(−1)k−1k−1
∫ ∞0
e−kx dx = 4∞∑k=1
(−1)k−1/k2 = π2/3.
56/95
One-Step Approximation of MLE
We have seen that MLE may exist but may not have a closed form in case the
likelihood equation is nonlinear. In that case, we have learned that the
Newton-Raphson method may be used to approximate the MLE. The following
theorem tells that, if the initial choice, say θ[0], is well located, to state it
precisely, lies in a n−1/2-neighborhood of the true θ in probability, then the
one-step update in Newton-Raphson iteration is enough, i.e., the estimator
θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])
has the same first-order properties as the MLE.
57/95
One-Step Approximation of MLE
I Theorem: Assume the assumptions (R0)–(R7) and that
θ[0] = θ +Op(n−1/2) under Pθ for all θ ∈ Θ. Then,
√n(θ[1] − θ) d→ N(0, I(θ)−1) under Pθ for all θ ∈ Θ.
I Proof of the theorem: For simplicity again, consider the case Θ ⊂ R. Let
Sj(θ) be the jth derivative of the likelihood n−1`(θ). Then,
√n(θ[1] − θ)
=√n(θ[0] − θ) +
(−S2(θ[0])
)−1
·√nS1(θ[0])
=√n(θ[0] − θ) +
(−S2(θ)− S3(θ∗)(θ[0] − θ)
)−1
·(√
nS1(θ) + S2(θ)√n(θ[0] − θ) + S3(θ∗∗)
√n(θ[0] − θ)2/2
).
58/95
One-Step Approximation of MLE
I Proof of the theorem (continued): Since S3(θ∗), S3(θ∗∗) and√n(θ[0] − θ)
are all Op(1) (thus θ[0] − θ = op(1)), and S2(θ) = −I1(θ) + op(1), we
obtain
√n(θ[1] − θ) =
√n(θ[0] − θ) + (I1(θ) + op(1))−1
·(√
nS1(θ)− I1(θ)√n(θ[0] − θ) + op(1)
)= I1(θ)−1 ·
√nS1(θ) + op(1)
d→ N(0, I1(θ)−1)
under Pθ for all θ ∈ Θ.
59/95
One-step Approximation of MLE: Logistic Model
Suppose that we observe a random sample X1, . . . , Xn from Logistic(θ, 1) for
unknown θ ∈ R. We have
`(θ) = −nX + nθ − 2n∑i=1
log (1 + exp(−(Xi − θ))) ,
˙(θ) = n− 2
n∑i=1
exp(−(Xi − θ)) (1 + exp(−(Xi − θ)))−1 ,
¨(θ) = −2n∑i=1
exp(−(Xi − θ)) (1 + exp(−(Xi − θ)))−2
With θ[0] = X, which is√n-consistent, the one-step estimator
θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])
is an asymptotically efficient estimator of θ.
60/95
One-step Approximation of MLE: Gamma Model
Suppose that we observe a random sample X1, . . . , Xn from Gamma(α, β) for
unknown α, β > 0. Let θ = (α, β)>.
I Finding√n-consistent estimator of θ: Try MME. Since Eθ(X1) = αβ and
varθ(X1) = αβ2, we get
αMME =(n−1
n∑i=1
(Xi − X)2)−1
X2,
βMME = n−1n∑i=1
(Xi − X)2 · 1
X.
It can be shown that θMME = θ +Op(n−1/2) under Pθ for all θ ∈ Θ.
I One-step MLE: With θ[0] = θMME, the one-step estimator
θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])
is an asymptotically efficient estimator of θ.
61/95
6.1 Maximum Likelihood Estimation
6.2 Information Inequality and Efficiency
6.3 Maximum Likelihood Tests
62/95
Idea of Maximum Likelihood Test
Suppose that we observe a random sample X1, . . . , Xn from a distribution Pθ
with pdf f(·, θ), θ ∈ Θ.
I The problem: Testing
H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1
at a significance level 0 < α < 1, where Θ0 is a subset of Θ and
Θ1 = Θ \Θ0.
I Fisher’s idea: Compare the maximum likelihoods in Θ0 and Θ1. For a
given observation x ≡ (x1, . . . , xn) of X ≡ (X1, . . . , Xn), reject H0 if
maxθ∈Θ1
L(θ;x)/maxθ∈Θ0
L(θ;x) ≥ c
for a critical value c that is determined by the level α as follows.
63/95
Idea of Maximum Likelihood Test
I Determination of critical value: Choose c such that (the size of the
test)=α, i.e.,
supθ∈Θ0
Pθ
(maxθ∈Θ1 L(θ;X)
maxθ∈Θ0 L(θ;X)> c
)= α.
I The maximization of the likelihood in a restricted set, such as Θ1, is more
involved than the maximization in Θ.
I Note that, for c > 1,
R0 ≡maxθ∈Θ1 L(θ;X)
maxθ∈Θ0 L(θ;X)≥ c
⇔ R ≡ maxθ∈Θ L(θ;X)
maxθ∈Θ0 L(θ;X)= max
{1,
maxθ∈Θ1 L(θ;X)
maxθ∈Θ0 L(θ;X)
}≥ c.
If there exists c > 0 such that supθ∈Θ0Pθ(R ≥ c) = α, then c > 1 since
R ≥ 1 always and α < 1. This means that the LRT may be based on R,
rather than R0.
64/95
Likelihood Ratio Test
Let θΘ and θΘ0 denote the MLEs in Θ and Θ0, respectively. The likelihood
ratio test (LRT), for H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1, rejects H0 when
maxθ∈Θ L(θ;x)
maxθ∈Θ0 L(θ;x)≥ c,
or equivalently when
2(`(θΘ;x)− `(θΘ0 ;x)
)≥ c′,
where c and c′ are determined by the given level α.
Remark: When the LR test statistic, 2(`(θΘ;X)− `(θΘ0 ;X)), has a
distribution on a discrete set, there may not exist c′ such that (the size of the
test)=α. In that case one may do randomization to meet the level condition.
65/95
Examples of LRT: Normal Mean
Suppose that we observe a random sample X1, . . . , Xn from N(θ, σ2) with
known σ2, θ ∈ R, and want to test H0 : θ = θ0 against H1 : θ 6= θ0 at a level
0 < α < 1.
I Likelihood:
`(θ) = −n2
log(2πσ2)− 1
2σ2
n∑i=1
(xi − x)2 − n
2σ2(x− θ)2.
I Maximization of likelihood: θΘ = x and θΘ0 = θ0.
I Likelihood ratio:
2(`(θΘ;x)− `(θΘ0 ;x)) = 2(`(x;x)− `(θ0;x)) =n
σ2(x− θ0)2
=
(x− θ0
σ/√n
)2
.
66/95
Examples of LRT: Normal Mean
I Rejection region: Reject H0 if∣∣∣∣ x− θ0
σ/√n
∣∣∣∣ ≥ c.I Determination of critical value c:
Pθ0
(∣∣∣∣ X − θ0
σ/√n
∣∣∣∣ ≥ zα/2) = α.
I Power function:
γ(θ) ≡ Pθ
(∣∣∣∣ X − θ0
σ/√n
∣∣∣∣ ≥ zα/2)= Φ
(−zα/2 −
θ − θ0
σ/√n
)+ Φ
(−zα/2 +
θ − θ0
σ/√n
).
I The power function is symmetric about θ0, takes the minimal value α at
θ = θ0 and increases to one as θ gets away from θ0.
67/95
Examples of LRT: Exponential Mean
Suppose that we observe a random sample X1, . . . , Xn from Exp(θ) with mean
θ > 0, and want to test H0 : θ = θ0 against H1 : θ 6= θ0 at a level 0 < α < 1.
I Likelihood: `(θ) = −n log θ − nx/θ.
I Maximization of likelihood: θΘ = x and θΘ0 = θ0.
I Likelihood ratio:
2(`(θΘ;x)− `(θΘ0 ;x))
= 2n(x/θ0 − log(x/θ0)− 1).
I Rejection region:
x/θ0 ≤ c1 or x/θ0 ≥ c2
with c1 − log c1 = c2 − log c2t1
λ
c1 c2
68/95
Examples of LRT: Exponential Mean
I Determination of critical value: Since 2∑ni=1 Xi/θ0
d= χ2(2n) under Pθ0 ,
we choose c1 and c2 such that
c1 − log c1 = c2 − log c2 and
∫ 2nc2
2nc1
pdfχ2(2n)(y)dy = 1− α.
I Approximation of c1 and c2:
χ21−α/2(2n)/2n = 1− zα/2/
√n+ o(n−1/2),
χ2α/2(2n)/2n = 1 + zα/2/
√n+ o(n−1/2).
I Power function :
γ(θ) ≡ 1− Pθ(c1 ≤ X/θ0 ≤ c2
)= F2n(2nc1θ0/θ) + F2n(2nc2θ0/θ),
where F2n(·) is the cdf of χ2(2n) and F2n = 1− F2n.
69/95
Examples of LRT: Normal Mean With Unknown Variance
Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2),
θ = (µ, σ2) ∈ R× R+, and want to test H0 : µ = µ0 against H1 : µ 6= µ0 at a
level 0 < α < 1.
I Likelihood:
`(θ) = −n2
log(2π)− n
2log σ2 − 1
2σ2
n∑i=1
(xi − µ)2.
I Maximization of likelihood: θΘ = (x, σ2) and θΘ0 = (µ0, σ20), where
σ2 = n−1∑ni=1(xi − x)2 and σ2
0 = n−1∑ni=1(xi − µ0)2.
I Likelihood ratio: Since σ20 = σ2 + (x− µ0)2, we get
2(`(θΘ;x)− `(θΘ0 ;x)) = n log
(1 +
(x− µ0)2
σ2
).
70/95
Examples of LRT: Normal Mean With Unknown Variance
I Rejection region: Let s2 = (n− 1)−1∑ni=1(xi − x)2.
Reject H0 if
∣∣∣∣ x− µ0
s/√n
∣∣∣∣ ≥ c.I Determination of critical value c:
Pθ0
(∣∣∣∣ X − µ0
S/√n
∣∣∣∣ ≥ tα/2(n− 1)
)= α.
I Wilks’ phenomenon: Since Tn ≡ (X − µ0)/(S/√n)
d= t(n− 1) under H0
and t(n− 1)d→ Z
d= N(0, 1), we get
2(`(θΘ;X)− `(θΘ0 ;X)) = n log(1 + T 2
n/(n− 1))
= T 2n + op(1)
d→ Z2 d= χ2(1)
under H0. Note that (d.f.) = dim(Θ)− dim(Θ0) = 2− 1 = 1.
71/95
Examples of LRT: Normal Mean With Unknown Variance
I Power function: With δ =√n(µ− µ0)/σ,
γ(δ) ≡ Pθ(∣∣∣∣ X − µ0
S/√n
∣∣∣∣ ≥ tα/2(n− 1)
)= P
(∣∣∣∣ Z + δ√V/(n− 1)
∣∣∣∣ ≥ tα/2(n− 1)
),
where Zd= N(0, 1) and V
d= χ2(n− 1) are independent.
I The power function is symmetric about δ = 0, takes the minimal value α
at δ = 0 and increases to one as δ gets away from 0.
72/95
Examples of LRT: Two Normal Means With Common Variance
Let X1, . . . , Xn1 and Y1, . . . , Yn2 be random samples, respectively, from
N(µ1, σ2) and N(µ2, σ
2), θ = (µ1, µ2, σ2) ∈ R2 × R+. Assume that
(X1, . . . , Xn1) and (Y1, . . . , Yn2) are independent. We want to test
H0 : µ1 = µ2 against H1 : µ1 6= µ2 at a level 0 < α < 1.
I Likelihood:
`(θ) = (const.)−n1 + n2
2log σ2− 1
2σ2
(n1∑i=1
(xi − µ1)2 +
n2∑i=1
(yi − µ2)2
).
I Maximization of likelihood: θΘ = (x, y, σ2), θΘ0 = (µ0, µ0, σ20), where
σ2 =
∑n1i=1(xi − x)2 +
∑n2i=1(yi − y)2)
n1 + n2,
σ20 =
∑n1i=1(xi − µ0)2 +
∑n2i=1(yi − µ0)2)
n1 + n2,
µ0 = (n1x+ n2y)/(n1 + n2).
73/95
Examples of LRT: Two Normal Means With Common Variance
I Likelihood ratio: Note that
σ20
σ2= 1 +
n1(x− µ0)2 + n2(y − µ0)2
(n1 + n2)σ2= 1 +
n1n2(x− y)2
(n1 + n2)σ2.
Let s2p = (n1 + n2)σ2/(n1 + n2 − 2) be the pooled sample variance.
Then, we get
2(`(θΘ;x, y)− `(θΘ0 ;x, y))
= (n1 + n2) log(σ20/σ
2)
= (n1 + n2) log
(1 +
n1n2(x− y)2
(n1 + n2)2σ2
)= (n1 + n2) log
(1 +
1
n1 + n2 − 2· (x− y)2
(n−11 + n−1
2 )s2p
).
I Rejection region:
Reject H0 if
∣∣∣∣∣ x− y
sp
√n−1
1 + n−12
∣∣∣∣∣ ≥ tα/2(n1 + n2 − 2).
74/95
Examples of LRT: Two Normal Means With Common Variance
I Wilks’ phenomenon: Since
Tn ≡ (X − Y )/(Sp
√n−1
1 + n−12
) d= t(n1 + n2 − 2) under H0 and
t(n1 + n2 − 2)d→ Z
d= N(0, 1), we get
2(`(θΘ;X,Y )− `(θΘ0 ;X,Y ))
= (n1 + n2) log(1 + T 2
n/(n1 + n2 − 2))
= T 2n + op(1)
d→ Z2 d= χ2(1)
under H0. Note that (d.f.) = dim(Θ)− dim(Θ0) = 3− 2 = 1.
I Power function: With δ = (µ1 − µ2)/(σ√n−1
1 + n−12
),
γ(δ) ≡ Pθ (reject H0) = P
(∣∣∣∣ Z + δ√V/(n1 + n2 − 2)
∣∣∣∣ ≥ tα/2(n1 + n2 − 2)
),
where Zd= N(0, 1) and V
d= χ2(n1 + n2 − 2) are independent.
75/95
Examples of LRT: Two Normal Variances
Let X1, . . . , Xn1 and Y1, . . . , Yn2 be random samples, respectively, from
N(µ1, σ21) and N(µ2, σ
22), θ = (µ1, µ2, σ
21 , σ
22) ∈ R2 × R2
+. Assume that
(X1, . . . , Xn1) and (Y1, . . . , Yn2) are independent. We want to test
H0 : σ21 = σ2
2 against H1 : σ21 6= σ2
2 at a level 0 < α < 1.
I Likelihood:
`(θ) = (const.)−n1
2log σ2
1−n2
2log σ2
2−1
2σ21
n1∑i=1
(xi−µ1)2− 1
2σ22
n2∑i=1
(yi−µ2)2.
I Maximization of likelihood: θΘ = (x, y, σ21 , σ
22), θΘ0 = (x, y, σ2, σ2),
where
σ21 =
n1∑i=1
(xi − x)2/n1, σ22 =
n2∑i=1
(yi − y)2/n2,
σ2 =
∑n1i=1(xi − x)2 +
∑n2i=1(yi − y)2
n1 + n2=n1σ
21 + n2σ
22
n1 + n2.
76/95
Examples of LRT: Two Normal Variances
I Likelihood ratio: Let s21 =
∑n1i=1(xi − x)2/(n1 − 1) and define s2
2 likewise
with yi. Then,
2(`(θΘ;x, y)− `(θΘ0 ;x, y))
= (n1 + n2) log σ2 − n1 log σ21 − n2 log σ2
2
= n1 log
(1 +
n2 − 1
n1 − 1· s
22
s21
)+ n2 log
(1 +
n1 − 1
n2 − 1· s
21
s22
)+ (const)
let= rn1,n2(s2
1/s22).
I Rejection region:
Reject H0 if(n1 − 1)s2
1
(n2 − 1)s22
≤ c1 or(n1 − 1)s2
1
(n2 − 1)s22
≥ c2,
where c1 and c2 satisfy
rn1,n2(c1) = rn1,n2(c2) and
∫ c2
c1
pdfF (n1−1,n2−1)(y)dy = 1− α.
77/95
Examples of LRT: One-Sided Poisson Mean
Let X1, . . . , Xn be a random sample from Poisson(θ), θ > 0. We want to test
H0 : θ ≤ θ0 against H1 : θ > θ0 at a level 0 < α < 1.
I Likelihood: `(θ) = n(−θ + x log θ − n−1∑ni=1 log xi!).
I Maximization of likelihood: θΘ = x and θΘ0 = min{x, θ0}.
I Likelihood ratio:
2(`(θΘ;x)− `(θΘ0 ;x)) = 2n (x log x− x(1 + log θ0) + θ0) I(x ≥ θ0).
I Rejection region: f(u) ≡ (u log u− u(1 + log θ0) + θ0)I(u ≥ θ0) equals
zero for u ≤ θ0 and is strictly increasing for u > θ0. Thus, for any λ > 0,
the inequality f(u) ≥ λ is equivalent to u ≥ λ′ for some λ′ that depends
on λ. This implies
LRT rejects H0 ifn∑i=1
xi ≥ c.
78/95
Examples of LRT: One-Sided Poisson Mean
I Determination of critical value: Need to find c > 0 such that
supθ≤θ0
Pθ
(n∑i=1
Xi ≥ c
)= α.
For an integer c, the rejection probability under Pθ equals
1−∑c−1j=0 e
−nθ(nθ)j/j!, which is an increasing function of θ. Thus, the
supremum over θ ≤ θ0 is achieved at θ = θ0. The task reduces then to
finding c > 0 such that
Pθ0
(n∑i=1
Xi ≥ c
)= α.
But, this is not possible for most values of 0 < α < 1.
I One may do randomization to get a test such that Pθ0(reject H0) = α.
79/95
Examples of LRT: One-Sided Poisson Mean
I Randomization: For a given 0 < α < 1, suppose that
1−c0∑j=0
e−nθ0(nθ0)j/j!let= α0 < α < α1
let= 1−
c0−1∑j=0
e−nθ0(nθ0)j/j!.
The randomized LRT rejects H0 if∑ni=1 xi ≥ c0 + 1 and does not reject
H0 if∑ni=1 xi ≤ c0 − 1. On the boundary where
∑ni=1 xi = c0, it rejects
H0 with probability (α− α0)/(α1 − α0). If we denote by φLRT(u) the
probability of rejecting H0 when∑ni=1 xi = u is observed, then
φLRT(u) =
1 u ≥ c0 + 1
(α− α0)/(α1 − α0) u = c0
0 u ≤ c0 − 1
I Level condition: It can be seen that
Pθ0 (reject H0) = Pθ0
(n∑i=1
Xi ≥ c0 + 1
)+α− α0
α1 − α0·Pθ0
(n∑i=1
Xi = c0
)= α.
80/95
Approximation of LRT: Simple Null
Let X1, . . . , Xn be a random sample from a distribution with pdf
f(·, θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. Assume that the regularity
conditions (R0)–(R7) hold. Consider the problem of testing H0 : θ = θ0. Let θ
denote the MLE over Θ. For the LRT statistic 2(`(θ)− `(θ0)), it holds that,
under H0,
(1) 2(`(θ)− `(θ0))d→ χ2(d);
(2) 2(`(θ)− `(θ0)) = Wn + op(1);
(3) 2(`(θ)− `(θ0)) = Rn + op(1),
where Wn = (θ − θ0)>(nI1(θ0))(θ − θ0) (Wald test statistic) and
Rn = ˙(θ0)>(nI1(θ0))−1 ˙(θ0) (Rao/score test statistic)
Remark: (1) is called Wilks’ phenomenon.
81/95
Approximation of LRT: Simple Null
I Proof of (2): Let Sj(θ) denote the jth derivative of n−1`(θ). Since
˙(θ) = 0, we get
`(θ0) = `(θ) + (θ − θ0)> · nS2(θ∗) · (θ − θ0)/2
= `(θ) + (θ − θ0)>(−nI1(θ0) + op(n))(θ − θ0)/2
= `(θ)−Wn/2 + op(1).
I Proof of (3): Since√n(θ − θ0) = I1(θ0)−1√nS1(θ0) + op(1) and
√nS1(θ0) = Op(1) under Pθ0 , we obtain
Wn =(I1(θ0)−1√nS1(θ0) + op(1)
)>I1(θ0)
(I1(θ0)−1√nS1(θ0) + op(1)
)= nS1(θ0)>I1(θ0)−1S1(θ0) + op(1)
= Rn + op(1).
I Proof of (1): Immediately follows from (2) or (3).
82/95
Approximation of LRT: Exponential Mean (Page 66, LecNote)
I Hypothesis: H0 : θ = θ0 versus H1 : θ 6= θ0.
I Asymptotic LRT: Reject H0 if
2n(x/θ0 − log(x/θ0)− 1) ≥ χ2α(1).
I Wald and Rao tests: Reject H0 if
n(x− θ0)2/θ20 ≥ χ2
α(1).
I Compare these with the LRT.
83/95
Approximation of LRT: Beta Model
Suppose we observe a random sample X1, . . . , Xn from Beta(θ, 1), θ > 0.
I Hypothesis: H0 : θ = 1 versus H1 : θ 6= 1.
I Recall that MLE θ = −n/(∑ni=1 logXi) and I1(θ) = 1/θ2:
`(θ) = n log θ + (θ − 1)n∑i=1
log xi, ˙(θ) = n/θ +n∑i=1
log xi.
I LRT statistic: 2(`(θ)− `(1)) = 2n(log θ − 1 + 1/θ).
I Wald test statistic: Wn(1) = (θ − 1)2nI1(1) = n(θ − 1)2.
I Rao test statistic:
Rn(1) =(
˙(1))2n−1I1(1)−1 = n
(1 + n−1
n∑i=1
logXi
)2
.
84/95
Approximation of LRT: Double Exponential Model
Suppose we observe a random sample X1, . . . , Xn from DE(θ, 1),
−∞ < θ <∞.
I Hypothesis: H0 : θ = θ0 versus H1 : θ 6= θ0.
I Recall that MLE θ = med(Xi) and I1(θ) = 1 (page 51):
`(θ) = −n∑i=1
|xi − θ| − n log 2, ˙(θ) =
n∑i=1
sgn(xi − θ).
Here, we take ˙1(θ;x) = (∂/∂θ)f(x; θ)/f(x; θ) with (∂/∂θ)f(·; θ) being
the Frechet derivative of f(·; θ).
I LRT statistic: 2(`(θ)− `(1)) = 2(∑n
i=1 |xi − θ0| −∑ni=1 |xi − θ|
).
I Wald test statistic: W (θ0) = n(θ − θ0)2.
I Rao test statistic: R(θ0) =(∑n
i=1 sgn(Xi − θ0))2/n.
85/95
Approximation of LRT: Multinomial Model
Suppose we observe a random sample Z1, . . . , Zn from
Multinomial(1, (θ1, . . . , θk)
), and wish to test
H0 : θ = θ0 versus H1 : θ 6= θ0.
I If we take Θ = {(θ1, . . . , θk−1) : θ > 0, θ1 + · · ·+ θk−1 < 1}, then the
model satisfies (R0)–(R7).
I Let Xj =∑ni=1 Zij , 1 ≤ j ≤ k, and θ· =
∑k−1j=1 θj = 1− θk. With
θ = (θ1, . . . , θk−1)>,
`(θ) =
k−1∑j=1
xj log θj + xk log(1− θ·),
˙(θ) =(xj/θj − xk/(1− θ·) : 1 ≤ j ≤ k − 1
),
MLE θ = (xj/n : 1 ≤ j ≤ k − 1) and I1(θ) = diag(θ−1j ) + (1− θ·)−111>.
86/95
Approximation of LRT: Multinomial Model
I LRT statistic: n(∑k
j=1 θj log θj −∑kj=1 θ0j log θ0j
)I Wald statistic: n(θ − θ0)>I1(θ0)(θ − θ0) =
∑kj=1(Xj − nθ0j)
2/(nθ0j).
I Rao statistic: n−1 ˙(θ0)>I1(θ0)−1 ˙(θ0) =∑kj=1(Xj − nθ0j)
2/(nθ0j).
I nI1(θ)(θ − θ) = (Xj/θj : 1 ≤ j ≤ k − 1)− (Xk/θk) · 1 = ˙(θ).
I Often the Wald and Rao tests are expressed as
Reject H0 ifk∑j=1
(Oj − Ej)2/Ei ≥ χ2α(k − 1),
where Oj = Xj and Ei = E(Xj |H0) = nθ0j .
87/95
Approximation of LRT: Composite Null
Here, we discuss the approximation of LRT when the null hypothesis is
composite. For this, let X1, . . . , Xn be a random sample from a distribution
with pdf f(·; θ), θ ∈ Θ ⊂ Rd that satisfies the regularity conditions (R0)–(R7).
Let θ = (ξ>, η>)> with η ∈ Rd0 , and suppose we wish to test
H0 : ξ = ξ0 versus H1 : ξ 6= ξ0
The null parameter space is of dimension dim(Θ0) = d0 given by
Θ0 ={
(ξ>0 , η>)> : η ∈ Rd0 and (ξ>, η>)> ∈ Θ
}.
Let θΘ0 and θΘ denote the MLE in Θ0 and in Θ, respectively.
88/95
Approximation of LRT: Composite Null
Under the composite null hypothesis H0 : ξ = ξ0 ⇔ H0 : θ ∈ Θ0, it holds that
(1) 2(`(θΘ)− `(θΘ0))d→ χ2(d− d0); (Wilks’ phenomenon)
(2) 2(`(θΘ)− `(θΘ0)) = Wn + op(1);
(3) 2(`(θΘ)− `(θΘ0)) = Rn + op(1),
where
Wn = (θΘ − θΘ0)>(nI1(θΘ0))(θΘ − θΘ0) (Wald test statistic),
Rn = ˙(θΘ0)>(nI1(θΘ0))−1 ˙(θΘ0) (Rao/score test statistic).
(Proof : See Bickel and Doksum(2001), Chapter 5)
89/95
Approximation of LRT: Composite Null
I Remark 1: The above approximations are valid for
H0 : g1(θ) = 0, . . . , gd1(θ) = 0 versus H1 : not H0,
where d1 = d− d0, provided that there exists a smooth reparametrization
ξ1 = g1(θ), . . . , ξd1 = gd1(θ), η1(θ) = gd1+1(θ), . . . , ηd0(θ) = gd(θ).
I Remark 2: The above approximation may be extended to the case of
independent but not identically distributed observations. In the latter case,
Wn = (θΘ − θΘ0)>In(θΘ0)(θΘ − θΘ0) (Wald test statistic),
Rn = ˙(θΘ0)>In(θΘ0))−1 ˙(θΘ0) (Rao/score test statistic),
where In(θ) = −Eθ ¨(θ).
90/95
Approximation of LRT: Independence Test in Contingency Table
Suppose we observe X =(Xjk
) d= Multinomial
(n, (pjk)
), where∑a
j=1
∑bk=1 pjk = 1, pjk > 0, i = 1, . . . , a; j = 1, . . . , b. Consider the
hypothesis
H0 : pjkj,k≡ pj· × p·k versus H1 : not H0,
where pj· = pj1 + · · ·+ pjb and p·k = p1k + · · ·+ pak.
I With θ = (p11, . . . , pa,b−1)>, define
gjk(θ) = pjk − pj·p·k, 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b− 1
and gjk for other (j, k) appropriately so that the resulting transformation
is a smooth reparametrization of θ.
pjkj,k≡ pj· × p·k ⇔ gjk(θ) = 0 for 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b− 1.
91/95
Approximation of LRT: Independence Test in Contingency Table
I Indeed,
Θ = {(p11, . . . , pab) : pjk > 0, p11 + · · ·+ pab = 1},
Θ0 = {(p1·p·1, . . . , pa·p·b) : pj· > 0, p·k > 0, p1· + · · ·+ pa· = 1,
p·1 + · · ·+ p·b = 1}.
Thus, d = dim(Θ) = ab− 1 and d0 = dim(Θ0) = (a− 1) + (b− 1), so
that d1 = dim(Θ)− dim(Θ0) = (a− 1)(b− 1).
I The likelihood under H0:
L(p) =
a∏j=1
b∏k=1
(pjk)xjk×(constant) =
a∏j=1
(pj·)xj·
b∏k=1
(p·j)x·j×(constant).
I MLEs: pjk = xjk/n in the whole parameter space Θ, and
p0jk = (xi·/n)× (x·j/n) = p0
j· × p0·k, say, in Θ0.
92/95
Approximation of LRT: Independence Test in Contingency Table
I The Wald and Rao test statistics coincide: With θ = (p11, . . . , pa,b−1)>,
n(θ−θ)>I1(θ)(θ−θ) =
a∑j=1
b∑k=1
(Xjk−npjk)2/(npjk) = n−1 ˙(θ)>I1(θ)−1 ˙(θ).
I The Wald and Rao test statistics are given by
Wn =
a∑j=1
b∑k=1
(Xjk − np0j·p
0·k)2/(np0
j·p0·j).
I Therefore, both the Wald and Rao tests reject H0 if
a∑j=1
b∑k=1
(Ojk − E0jk)2/E0
jk ≥ χ2α
((a− 1)(b− 1)
),
where Ojk = Xjk and E0jk = E(Xjk|H0) = np0
i·p0·j .
93/95
Approximation of LRT: Homogeneity of Multinomial Models
Suppose now that we observe b independent multinomial random variables:
Xk ≡ (X1k, . . . , Xak)>d= Multinomial
(nk, (p1k, . . . , pak)
), 1 ≤ k ≤ b, where
p1k + · · ·+ pak = 1 and pjk > 0. We wish to test
H0 : p1 = · · · = pb versus H1 : not H0,
where pk = (p1k, . . . , pak)>, 1 ≤ k ≤ b.
I With θk = (p1k, . . . , pa−1,k)> and θ = (θ>1 , . . . , θ>b )>, we get
`(θ) =b∑
k=1
a−1∑j=1
xjk log θjk +b∑
k=1
(nk − x·k) log(1− θ·k),
˙(θ) = (xjk/θjk − xak/(1− θ·k) : 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b) ,
In(θ) = var( ˙(θ)) = diag(nk[diag(θ−1
jk ) + (1− θ·k)−111>]),
where θ·k =∑a−1j=1 θjk = 1− pak and x·k =
∑a−1j=1 xjk = nk − xak.
94/95
Approximation of LRT: Homogeneity of Multinomial Models
I MLEs: pjk = xjk/nk in Θ, and p0jk
k≡ xj·/n in Θ0 with n = n1 + · · ·+nb.
I The Wald and Rao statistics coincide and
(θΘ − θ)>In(θ)(θΘ − θ)
=b∑
k=1
nk(θΘk − θk)>
[diag(θ−1
jk ) + (1− θ·k)−111>]
(θΘk − θk)
=b∑
k=1
nk
[a−1∑j=1
(θΘjk − θjk)2/θjk + (θΘ
·k − θ·k)2/(1− θ·k)
]
=
b∑k=1
nk
a∑j=1
(pjk − pjk)2/pjk
=
b∑k=1
a∑j=1
(nkpjk − nkpjk)2/(nkpjk)
=
b∑k=1
a∑j=1
(Xjk − nkpjk)2/(nkpjk).
95/95
Approximation of LRT: Homogeneity of Multinomial Models
I Thus, the Wald and Rao statistics for H0 are given by
Wn =
b∑k=1
a∑j=1
(Xjk − nkp0jk)2/(nkp
0jk)
=
b∑k=1
a∑j=1
(Xjk − nk(Xj·/n)
)2/(nkXj·/n)
I Note that dim(Θ) = b(a− 1) and
Θ0 = {(p1, . . . , p1) : p11 + · · ·+ pa1 = 1, pj1 > 0} so that
dim(Θ0) = a− 1.
I Therefore, the Wald and Rao tests reject H0 if
b∑k=1
a∑j=1
(Ojk − E0jk)2/E0
jk ≥ χ2α
((a− 1)(b− 1)
),
where Ojk = Xjk and E0jk = E(Xjk|H0) = nkp
0jk = nk(Xj·/n).
Top Related