Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical...

Post on 22-Jul-2020

5 views 1 download

Transcript of Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical...

1/95

Mathematical Statistics, Fall 2016

Chapter 6: Likelihood Methods

Byeong U. Park

Department of Statistics, Seoul National University

2/95

6.1 Maximum Likelihood Estimation

6.2 Information Inequality and Efficiency

6.3 Maximum Likelihood Tests

3/95

Idea of Maximizing Likelihood

I If pdf(x; θ1) > pdf(x; θ2) for an observation X = x, then the distribution

pdf(·; θ1) is more likely, than pdf(·; θ2), to be the true distribution that

generated the observation x.

I Likelihood and log-likelihood functions: pdf(x; θ) as a function of θ for a

given x is called the likelihood (function). Its logarithm is called the

log-likelihood (function).

I Below we write L(θ) ≡ L(θ;x) = pdf(x; θ) and

`(θ) ≡ `(θ;x) = log pdf(x; θ).

I For a set of observations x1, . . . , xn of a random sample X1, . . . , Xn from

pdf f(·; θ), we have

L(θ) =n∏i=1

f(xi; θ), `(θ) =n∑i=1

log f(xi; θ).

4/95

Maximum Likelihood Estimator (MLE)

I Definition of MLE: The MLE of θ for a given observation x is defined by

θMLE(x) ≡ arg maxθ∈Θ

L(θ;x)

when it exists.

I MLE may not exist and may not be unique when it exists.

I In some cases MLE does not exist in a parameter space Θ but exists in an

extended parameter space Θ.

I Suppose that Θ is open in Rd and the log-likelihood function `(θ) is

continuous in Θ and diverges to −∞ as θ approaches to the ‘boundary’ of

Θ. Then, an MLE exists.

5/95

Likelihood Equation

I When the likelihood is differentiable, an MLE is often found by solving the

likelihood equation ˙(θ;x) = 0, where ˙(θ;x) = (∂/∂θ)`(θ;x).

I Suppose that the likelihood is twice differentiable. For a given x, if an

MLE exists and the second derivative ¨(θ;x) is negative definite for all

θ ∈ Θ, then the solution of the likelihood equation exists and is the unique

MLE.

I Suppose that the likelihood is twice differentiable. For a given x, if

˙(θ(x);x) = 0 and ¨(θ;x) is negative definite for all θ ∈ Θ, then θ(x) is

the unique MLE:

`(θ) = `(θ) + ˙(θ)>(θ − θ) +1

2(θ − θ)> ¨(θ∗)(θ − θ) ≤ `(θ)

with the equality holding if and only if θ = θ.

6/95

MLE of Function of Parameter

I Let θ = (θ1, θ2) and θ = (θ1, θ2) is its MLE. Then, we call θj the MLEs of

θj , respectively.

I For an injective function g, the MLE of η = g(θ) is given by

ηMLE = g(θMLE).

Proof. Let L denote the likelihood function of θ. Then, the likelihood of

η equals L(g−1(η);x). This entails

(Likelihood of ηMLE) = L(g−1(ηMLE);x) = L(θMLE;x)

= maxθ∈Θ

L(θ;x)

= maxη∈g(Θ)

L(g−1(η);x).

7/95

MLE of Function of Parameter

I What if the function g in η = g(θ) is not injective?

One may find h such that the map θ 7→ (g(θ), h(θ)) is injective, so that

the MLE of ηext ≡ (g(θ), h(θ)) is given by

ηMLEext = (g(θMLE), h(θMLE)),

so that it remains true that

ηMLE = g(θMLE).

8/95

Profiling Method of Finding MLE

Sometimes it is difficult to find the MLE of θ = (θ1, θ2) simultaneously, but

rather easy to find the MLEs of θj with the other being fixed. Let θ1(θ2)

denote the MLE of θ1 when θ2 is fixed.

I L(θ1(θ2), θ2) is a function of θ2 only, called the profile likelihood of θ2.

I The MLE of θ is given by θMLE = (θ1(θ2), θ2), where

θ2 = argmaxθ2:(θ1(θ2),θ2)∈Θ

L(θ1(θ2), θ2).

Proof. For any (θ1, θ2) ∈ Θ, it holds that

L(θMLE) ≥ L(θ1(θ2), θ2) ≥ L(θ1, θ2).

9/95

Newton-Raphson Algorithm

It may not be possible to find the solution of the likelihood equation explicitly

when the likelihood equation is nonlinear.

I The Newton-Raphson method is an iteration scheme based on the linear

approximation of the likelihood equation:

0 = ˙(θ) ' ˙(θOLD) + ¨(θOLD)(θ − θOLD).

I The iteration scheme:

θNEW = θOLD − ¨(θOLD)−1 ˙(θOLD).

I Convergence of the iteration: Newton-Kantorovich Theorem!

10/95

Newton-Kantorovich Theorem

Suppose that there exist constants α, β, γ and r such that 2αβγ < 1 and

2α < r for which ` has the second derivative ¨(θ) at all θ ∈ Br(θ[0]) that is

invertible, and

(i) ‖¨(θ[0])−1 ˙(θ[0])‖ ≤ α,

(ii) ‖¨(θ[0])−1‖ ≤ β,

(iii) ‖¨(θ)− ¨(θ′)‖ ≤ γ‖θ − θ′‖ for all θ, θ′ ∈ Br(θ[0]).

Then, ˙(θ) = 0 has a unique solution θ in B2α(θ[0]). Furthermore, θ can be

approximated by Newton-Raphson iterative method

θ[k+1] = θ[k] − ¨(θ[k])−1 ˙(θ[k]), k ≥ 0, which converges at a geometric rate:

‖θ[k] − θ‖ ≤ α2−(k−1)q2k−1,

where q = 2αβγ < 1.

11/95

Gradient Descent Algorithm

Sometimes the Newton-Raphson algorithm is unstable, especially when θ is of

high-dimension. The gradient descent method is an iterative scheme of finding

the minimal point of an objective function F . The method moves slightly the

current update to the opposite direction of the gradient of F , i.e.,

θNEW = θOLD − γF (θOLD)

for γ small enough. Too small γ takes too long to reach the minimum and too

large may overshoot the minimal point. It is suggested to take a larger step at

the beginning and a smaller step as the iteration goes on.

x1

x2

12/95

Coordinate Descent Algorithm

The minimization of a multivariable function F may be achieved by minimizing

it along one direction at a time, i.e., solving univariate optimization problems in

a loop. The algorithm solves the optimization problem

θ[k+1]j = argmin

θj

F (θ[k+1]1 , . . . , θ

[k+1]j−1 , θj , θ

[k]j+1, . . . , θ

[k]d ), 1 ≤ j ≤ d.

It is easy to see that

F (θ[0]) ≥ F (θ[1]) ≥ F (θ[2]) ≥ . . . .

13/95

Example: Poisson Model

Suppose that we observe X from Poisson(θ) for unknown θ ∈ Θ = (0,∞).

I The log-likelihood is maximized at x ∈ Θ when x > 0. When x = 0,

`(θ; 0) = −θ + x log θ − log x! = −θ

does not have a maximizer in (0,∞). Thus, the MLE of θ does not exist

when x = 0.

I If we let Θ = [0,∞), then the MLE exists for all x ≥ 0 and is given by

θMLE = X.

I Read #6.1.3 for an example of non-uniqueness of MLE.

14/95

Example: Bernoulli Model

Suppose that we observe a random sample X1, . . . , Xn from Bernoulli(θ) for

unknown θ ∈ [0, 1]. Then, θMLE = X.

I For 0 <∑ni=1 xi < n, θ = x is the unique MLE:

I For θ = 0 or θ = 1, L(θ) = 0;

I For θ ∈ (0, 1), L(θ) > L(0) = L(1) = 0 and

˙(θ) =

∑ni=1 xi

θ−n−

∑ni=1 xi

1− θ= 0

has the solution θ = x with

¨(θ) = −∑ni=1 xi

θ2−n−

∑ni=1 xi

(1− θ)2< 0

for all 0 < θ < 1.

I For∑ni=1 xi = 0, L(θ) = (1− θ)n is maximized at θ = 0.

I For∑ni=1 xi = n, L(θ) = θn is maximized at θ = 1.

15/95

Example: Logistic Family of Distributions

Suppose that we observe a random sample X1, . . . , Xn from Logistic(θ, 1) for

unknown θ ∈ R that has a density function given by

f(x; θ) =exp(−(x− θ))

(1 + exp(−(x− θ)))2· I(−∞,∞)(x)

I Derivatives of the log-likelihood:

˙(θ) = n− 2

n∑i=1

exp(−(xi − θ))/{1 + exp(−(xi − θ))},

¨(θ) = −2

n∑i=1

exp(−(xi − θ))/{(1 + exp(−(xi − θ))}2 < 0, θ ∈ R.

I The likelihood equation ˙(θ) = 0 has the unique solution since ˙ is strictly

decreasing and ˙(θ)→ ∓n as θ → ±∞, respectively. Thus, it is the

unique MLE.

16/95

Example: Family of Exponential Distributions

Suppose that we observe a random sample X1, . . . , Xn from Exp(θ) for

unknown θ > 0 that has a density function given by

f(x; θ) = θ−1e−x/θ · I(0,∞)(x). It is easier to find the MLE of η = θ−1 and

then transform the MLE of η back to the MLE of θ = η−1.

I Derivatives of the log-likelihood of η:

˙(η) = n/η − nx,

¨(η) = −n/η2 < 0, η > 0.

I The likelihood equation ˙(η) = 0 has the solution η = 1/x, thus it is the

unique MLE of η.

I θMLE = X is the unique MLE of θ.

17/95

Example: Families of Double Exponential Distributions

Suppose that we observe a random sample X1, . . . , Xn from DE(θ, 1) for

unknown θ ∈ R that has a density function given by

f(x; θ) =1

2exp(−|x− θ|) · I(−∞,∞)(x).

In this case the likelihood is not differentiable. Let X(1) ≤ X(2) ≤ · · · ≤ X(n)

be the order statistics of X1, . . . , Xn.

I The log-likelihood: `(θ) = −∑ni=1 |xi − θ| − n log 2.

I When n = 2m+ 1, θMLE = X(m+1) is the unique MLE.

I When n = 2m, θMLE = a for any a ∈ [X(m), X(m+1)].

I Thus, θMLE = med(Xi).

18/95

S(θ) =∑ni=1 |X(i) − θ|

X(1) X(2) X(3)· · · X(n−1) X(n)

I1 I2 I3 In In+1

Let

I1 = (−∞, X(1)]

Ij = [X(j−1), X(j)], 2 ≤ j ≤ n “closed interval”

In+1 = [X(n),∞)

(i) S(θ) ≥ S(X(1)) for all θ ∈ I1

S(θ) ≥ S(X(n)) for all θ ∈ I(n+1)

(ii) S(θ1) > S(θ2) for all θ1 < θ2 in Ij+1 if j < n− j;

S(θ1) < S(θ2) for all θ1 < θ2 in Ij+1 if j > n− j;

S(θ1) = S(θ2) for all θ1 < θ2 in Ij+1 if j = n− j

19/95

Example: Families of Double Exponential Distributions

Suppose that we observe a random sample X1, . . . , Xn from an DE(µ, σ) for

unknown θ = (µ, σ2) ∈ R× R+ that has a density function given by

f(x; θ) =1

2σexp(−|x− µ|/σ) · I(−∞,∞)(x).

I The log-likelihood: `(θ) = −∑ni=1 |xi − µ|/σ − n log 2− n log σ.

I Profiling approach: For each fixed σ, we know µ(σ) = med(xi) maximizes

`(µ, σ), which is independent of σ. The profile likelihood equals

`(med(xi), σ) = −n∑i=1

|xi −med(xi)|/σ − n log 2− n log σ,

which is uniquely maximized at σ = n−1∑ni=1 |xi −med(xi)|.

I θMLE = (med(Xi), n−1∑n

i=1 |Xi −med(Xi)|).

20/95

Example: Normal Families of Distributions

Suppose that we observe a random sample X1, . . . , Xn from N(µ, 1) for

unknown µ ∈ R.

I The log-likelihood:

`(µ) = −n2

log(2π)− 1

2

n∑i=1

(xi − x)2 − n

2(x− µ)2.

I The MLE of η = |µ| is ηMLE = |X|.

21/95

Example: Normal Families of Distributions

Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2) for

unknown θ = (µ, σ2) ∈ R× R+.

I The log-likelihood:

`(θ) = −n2

log(2π)− n

2log σ2 − 1

2σ2

n∑i=1

(xi − x)2 − n

2σ2(x− µ)2.

I Profiling approach: For each fixed σ2, we get µ(σ2) = x maximizes

`(µ, σ2), which is independent of σ2. The profile likelihood equals

`(x, σ2) = −n2

log(2π)− n

2log σ2 − 1

2σ2

n∑i=1

(xi − x)2,

which is uniquely maximized at σ2 = n−1∑ni=1(xi − x)2. Thus,

θMLE = (X, n−1∑ni=1(Xi − X)2).

22/95

Consistency of MLE: Some Examples

I Let X1, . . . , Xn be a random sample from (i) Poisson(θ), θ ∈ Θ = [0,∞);

(ii) Bernoulli(θ), θ ∈ Θ = [0, 1]; (iii) Exp(θ), θ ∈ Θ = (0,∞). Then,

θMLE = XPθ→ θ

as n→∞ for all θ ∈ Θ by WLLN.

I Let X1, . . . , Xn be a random sample from a population with a distribution

function F that has a density f with respect to the Lebesgue measure

(continuous type). It can be shown that

√n(med(Xi)− F−1(1/2))

d→ N(0, 1/[4f(F−1(1/2))2]),

so that med(Xi)p→ F−1(1/2) for all continuous type F . [Use the normal

approximation of Binomial distribution]

23/95

Consistency of MLE: Some Examples

I Let X1, . . . , Xn be a random sample from DE(θ, 1), θ ∈ R. Then, by the

consistency of the sample median as an estimator of the population

median,

θMLE = med(Xi)Pθ→ θ as n→∞ for all θ ∈ R.

I Let X1, . . . , Xn be a random sample from U [0, θ], θ ∈ (0,∞). In this case

θMLE = X(n). Since

Eθ(X(n)) =n

n+ 1·θ → θ and varθ(X(n)) =

n

(n+ 2)(n+ 1)2·θ2 → 0,

we get that θMLE Pθ→ θ as n→∞ for all θ > 0.

24/95

Kullback-Leibler Divergence

I Kullback-Leibler divergence: Let P = {Pθ : θ ∈ Θ} is a statistical model

for an observation X. Let f(·; θ) denote the density function of Pθ. The

Kullback-Leibler divergence (of Pθ from Pθ0) is defined by

KL(θ, θ0) = −Eθ0(log f(X, θ)/f(X, θ0)).

I Identifiability of θ: Pθ = Pθ0 implies θ = θ0. (This has been assumed to

hold in the estimation of θ so far!)

I Assume θ in P is identifiable and Pθ have common support, i.e.,

{x : f(x; θ) > 0} does not depend on θ ∈ Θ. Then,

KL(θ, θ0) ≥ 0 and KL(θ, θ0) = 0 if and only if θ = θ0.

[Use the inequality 1 + log z ≤ z for all z > 0 with “=” iff z = 1]

25/95

Kullback-Leibler Divergence and MLE

Let X1, . . . , Xn be a random sample from Pθ in P = {Pθ : θ ∈ Θ}. Let f(·; θ)

denote the density function of Pθ. Assume θ in Θ is identifiable and Pθ have

common support. For a fixed θ0, define

Dn(θ) = −n−1n∑i=1

log f(Xi; θ)/f(Xi; θ0), D0(θ) = KL(θ, θ0).

I θ0 is the unique minimizer of D0(θ) over Θ.

I θMLE is a minimizer of Dn(θ) when it exists.

I By WLLN, Dn(θ)Pθ0→ D0(θ) for all θ ∈ Θ.

I Does θMLE, the minimizer of Dn(θ), converges to θ0, the minimizer of

D(θ) in Pθ0 probability?

26/95

Kullback-Leibler Divergence and MLE

I In general, for a sequence of random functions Gn and a function G0

defined on Θ, the minimizer of Gn(θ) over Θ converges in probability to

the minimizer of G0(θ) over Θ if Θ is compact (bounded and closed), G0

is continuous on Θ, the minimizer of G0(θ) is unique and

supθ∈Θ|Gn(θ)−G0(θ)| p→ 0.

I Thus, if Θ is compact, D0(θ) is continuous on Θ for all θ0 ∈ Θ and

supθ∈Θ|Dn(θ)−D0(θ)|

Pθ0→ 0 for all θ0 ∈ Θ,

then, θMLE is consistent.

27/95

Kullback-Leibler Divergence and MLE

I Convexity Lemma: In general, for a sequence of random functions Gn and

a function G0 defined on Θ, if Gn is a sequence of convex functions and

Θ is a convex set, then the uniform convergence

supθ∈Θ|Gn(θ)−G0(θ)| p→ 0

is implied by the pointwise convergence

Gn(θ)−G0(θ)p→ 0 for each θ ∈ Θ.

I MLE for a finite parameter space: Let P = {Pθ : θ ∈ Θ}, where Θ is

finite, i.e., Θ = {θ1, . . . , θk} for some k ≥ 1. Assume θ in P is identifiable

and Pθ have common support. Then, θMLE is consistent.

28/95

Sufficient Conditions for Consistency of MLE

I Suppose that we observe a random sample from Pθ in P = {Pθ : θ ∈ Θ}.

Assume that θ in Θ is identifiable and that Pθ have common support.

Assume also that the likelihood is twice differentiable, approaches to −∞

on the boundary of Θ, and ¨(θ) is negative definite for all θ ∈ Θ. Then,

the MLE of θ given by the unique solution of the likelihood equation

˙(θ) = 0 is consistent. [See Theorem 6.1.3 and Corollary 6.1.1 in the text]

I Logistic(θ, 1) example: The support of Pθ equals R, the likelihood

`(θ)→ −∞ as θ → ±∞ and ¨(θ) < 0 for all θ. Thus, the MLE given by

the solution of ˙(θ) = 0 is consistent.

29/95

6.1 Maximum Likelihood Estimation

6.2 Information Inequality and Efficiency

6.3 Maximum Likelihood Tests

30/95

Basic Regularity Assumptions

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. (µ is either the Lebesgue

measure or the counting measure)

(R0) The parameter θ is identifiable in Θ.

(R1) The density f(·; θ) have common support X .

(R2) The parameter space is open in Rd.

(R3) The log-density log f(x; θ) is twice differentiable as a function of θ for all

x ∈ X .

(R4) For any statistic u(X1, . . . , Xn) with finite expectation, the integral

Eθ(u(X1, . . . , xn)) =

∫Xn

u(x1, . . . , xn)n∏i=1

f(xi; θ) dµ(x1) · · · dµ(xn)

is twice differentiable under the integral sign.

31/95

Fisher Information

I The derivative of the log-density f(x; θ) as function of θ is called the score

function. The larger the magnitude of the score function is, the more

information one has about θ.

I Fisher information (that an observation of X1 has about θ) is defined by

I1(θ) or IX1(θ) = varθ

(∂

∂θlog f(X1; θ)

)I Bartlett identity (first-order): Under (R0)–(R4), we get

(∂

∂θlog f(X1; θ)

)= 0.

I Bartlett identity (second-order): Under (R0)–(R4), we get

varθ

(∂

∂θlog f(X1; θ)

)= Eθ

(− ∂2

∂θ2log f(X1; θ)

).

32/95

Information Inequality (Cramer-Rao Lower Bound)

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ, that satisfies (R0)–(R4).

Assume further that I1(θ) is invertible. Then, for any statistics

Un ≡ u(X1, . . . , Xn) with finite second moment, it holds that

varθ(Un) ≥(∂

∂θEθU

>n

)>(nI1(θ))−1

(∂

∂θEθU

>n

)for all θ ∈ Θ. In case Un is multivariate, A ≥ B for matrices A and B should

be interpreted that A−B is nonnegative definite.

33/95

Proof of Information Inequality

Let Vn =∑ni=1(∂/∂θ) log f(Xi; θ). Consider first the case where Un is

univariate. For this case, we get

varθ(Un) ≥ maxa6=0

(a>covθ(Vn, Un))2

a>varθ(Vn)a

= covθ(Vn, Un)>varθ(Vn)−1covθ(Vn, Un)

In case Un is multivariate, replace Un in the above inequality by t>Un and get

t> ·varθ(Un) · t ≥ t> · covθ(Vn, Un)>varθ(Vn)−1covθ(Vn, Un) · t for all t.

The inequality follows then since varθ(Vn) = nI1(θ) and

covθ(Vn, Un) = Eθ(VnU>n ) =

∂θEθU

>n

due to the first-order Bartlett identity and (R4).

34/95

Lower Bound on Variance of Unbiased Estimator

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ, that satisfies (R0)–(R4).

Assume further that I1(θ) is invertible. Consider the problem of estimating

g(θ) for a smooth function g, which may be a vector of real-valued functions.

Then, for any unbiased estimator η of η ≡ g(θ) with finite second moment, it

holds that

varθ(η) ≥(∂

∂θg(θ)>

)>(nI1(θ))−1

(∂

∂θg(θ)>

)for all θ ∈ Θ.

35/95

MLE and Information Bound: Poisson Model

Suppose that we observe a random sample X1, . . . , Xn from Poisson(θ) for

unknown θ ∈ Θ = (0,∞).

I Writing `1(θ) ≡ log f(x; θ) = −θ + x log θ − log x!, we have

˙1(θ) = −1 + x/θ, ¨

1(θ) = −x/θ2.

I Thus, I1(θ) = θ−1, so that for any unbiased estimator θ of θ we get

varθ(θ) ≥ (nI1(θ))−1 = θ/n for all θ > 0.

I For θMLE = X, it is unbiased and varθ(θMLE) = θ/n. Thus, θMLE has the

minimal variance among all unbiased estimators of θ, is the so called

Uniformly Minimum Variance Unbiased Estimator (UMVUE) of θ.

36/95

MLE and Information Bound: Bernoulli Model

Suppose that we observe a random sample X1, . . . , Xn from Bernoulli(θ) for

unknown θ ∈ Θ = (0, 1).

I We have

˙1(θ) = x/θ − (1− x)/(1− θ), ¨

1(θ) = −x/θ2 − (1− x)/(1− θ)2.

I Thus, I1(θ) = θ−1(1− θ)−1, so that for any unbiased estimator θ of θ we

get

varθ(θ) ≥ (nI1(θ))−1 = θ(1− θ)/n for all 0 < θ < 1.

I For θ = X, which is the MLE when 0 < X < 1, it is unbiased and

varθ(X) = θ(1− θ)/n. Thus, X is the UMVUE of θ.

37/95

MLE and Information Bound: Bernoulli Model

Now, consider the estimation of η = θ(1− θ).

I Since ∂η/∂θ = 1− 2θ, we get that, for any unbiased estimator η of η,

varθ(η) ≥ (1−2θ)2(nI1(θ))−1 = θ(1−θ)(1−2θ)2/n for all 0 < θ < 1.

I For the estimator η = X(1− X), which is the MLE of η when 0 < X < 1,

Eθ(η) =n− 1

n· θ(1− θ) =

n− 1

n· η,

varθ(η) = n−1θ(1− θ)(1− 2θ)2 + o(n−1),

with o(n−1) = 5n−2θ2(1− θ)(3− θ) + n−3θ(1− θ)(6θ2 − 6θ + 1).

I The bias-corrected estimator nη/(n− 1) is unbiased and has the variance

n

(n− 1)2· θ(1− θ)(1− 2θ)2 > θ(1− θ)(1− 2θ)2/n

unless θ = 1/2, but the relative efficiency to the CR lower bound

approaches to one as n→∞.

38/95

MLE and Information Bound: Beta Model

Suppose that we observe a random sample X1, . . . , Xn from Beta(θ, 1) for

unknown θ > 0.

I We have ˙1(θ) = 1/θ + log x and ¨

1(θ) = −1/θ2.

I Thus, I1(θ) = 1/θ2, so that for any unbiased estimator θ of θ we get

varθ(θ) ≥ (nI1(θ))−1 = θ2/n for all 0 < θ < 1.

I θMLE = −n/(∑ni=1 logXi) and using the fact that −θ logXi are i.i.d.

Exp(1), we get

Eθ(θMLE) =

n

n− 1θ, varθ(θ

MLE) =n2

(n− 1)2(n− 2)θ2.

I Again, the unbiased estimator (n− 1)θMLE/n has the variance θ2/(n− 2)

that is strictly larger than the CR lower bound, but its relative efficiency to

the lower bound converges to one as n→∞.

39/95

MLE and Information Bound: Normal Model

Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2) for

unknown θ = (µ, σ2) ∈ R× R+.

I The derivatives of the log-density:

˙1(θ) =

(x− µ)/σ2

(x− µ)2/(2σ4)− 1/(2σ2)

,

¨1(θ) =

−1/σ2, −(x− µ)/(2σ4)

−(x− µ)/(2σ4), −(x− µ)2/σ6 + 1/(2σ4)

.

I The Fisher information about θ from a single observation:

I1(θ) = Eθ(−¨1(θ)) =

1/σ2 0

0 1/(2σ4)

.

40/95

MLE and Information Bound: Normal Model

I For any unbiased estimators µ and σ2 of µ and σ2,

varθ(µ) ≥ (1, 0)(nI1(θ))−1(1, 0)> = σ2/n,

varθ(σ2) ≥ (0, 1)(nI1(θ))−1(0, 1)> = 2σ4/n.

I For the unbiased estimators X and S2 =∑ni=1(Xi − X)2/(n− 1) of µ

and σ2, respectively, we get

varθ(X) = σ2/n, varθ(S2) = 2σ4/(n− 1).

I Thus, X achieves the minimal variance and S2 has a variance slightly

larger than the lower bound. Later we will see that S2 also achieves the

minimal variance among all unbiased estimators. This means that the CR

lower bound may not be sharp!

41/95

MLE and Information Bound: Normal Model

I The lower bounds for µ and σ2 do not change under the sub-models where

one of µ and σ2 is known (given). The main reason is that the covariance

of the score functions, i.e, the off-diagonal term of I1(θ) equals zero.

I In general, for a nonsingular symmetric matrix

A =

A11 A12

A21 A22

,

it holds that

(A−1)11 = A−111 +A−1

11 A12(A22 −A21A−111 A12)−1A21A

−111 ≥ A

−111 ,

and the equality holds if A12 = 0.

42/95

Asymptotic Efficiency

I Unbiased estimators: Typically, the CR lower bound for unbiased

estimators is asymptotically sharp. Thus, the asymptotic efficiency of an

unbiased estimator η of a real-valued parameter η = g(θ) is defined by

eff(η) = limn→∞

((∂/∂θ)g(θ))>(nI1(θ))−1((∂/∂θ)g(θ))

varθ(η).

I For estimators η with varθ(η) ≥ Cn−1 for some constant C > 0 and

(∂/∂θ)biasη(θ)→ 0 as n→∞, we get

varθ(η) ≥(∂

∂θg(θ) + o(1)

)>(nI1(θ))−1

(∂

∂θg(θ) + o(1)

)=

(∂

∂θg(θ)

)>(nI1(θ))−1

(∂

∂θg(θ)

)+ o(n−1),

where biasη(θ) = Eθ(η)− η. Thus, the asymptotic efficiency eff(η) in

terms of variance defined above may be extended to these estimators.

43/95

Asymptotic Efficiency

I Lower bound in terms of mean squared error:

MSEθ(η) ≥ biasη(θ)2 +

(∂

∂θg(θ)

)>(nI1(θ))−1

(∂

∂θg(θ)

)+ o(n−1)

I For those estimators with varθ(η) ≥ Cn−1 for some constant C > 0,

(∂/∂θ)biasη(θ)→ 0 as n→∞ and biasη(θ) = o(√

varθ(η)), it follows

that

(lower bound for MSE)

MSEθ(η)=

eff(η) + o(1)

1 + o(1)= eff(η) + o(1).

Thus, eff(η) may be also a definition of the asymptotic efficiency in terms

of mean squared error for these estimators.

44/95

Examples: Asymptotic Efficiency of MLE

I Bernoulli model with Θ = (0, 1): Note first that

Pθ(θMLE = X) ≥ Pθ(0 < X < 1)→ 1 for all θ ∈ (0, 1). For the MLE

η = X(1− X) of η = θ(1− θ), we get

biasη(θ) = −θ(1− θ)/n, varθ(η) = θ(1− θ)(1− 2θ)2/n+ o(n−1).

Thus, eff(η) = 1 unless θ = 1/2 both in terms of variance and mean

squared error.

I Beta model with Θ = (0,∞): For the MLE θ = −n/(∑ni=1 logXi) of θ,

we have

biasθ(θ) = θ/(n− 1), varθ(θ) = θ2/n+ o(n−1).

Thus, eff(η) = 1 both in terms of variance and mean squared error.

45/95

Fisher’s Wrong Conjecture

(1)√n(θMLE − θ) d→ N(0, I1(θ)−1).

(2) For any estimator θ such that√n(θ − θ) d→ N(0, v(θ)) it follows that

v(θ) ≥ I1(θ)−1 for all θ ∈ Θ. (R. A. Fisher, 1925)

I The conjecture (1) is true under some regularity conditions on the model,

while (2) may not be true even under the regularity conditions on the

model. (J. L. Hodges, Jr., 1951)

I Among regular estimators of θ the conjecture (2) was found to be true

under some regularity conditions on the model. (L. Le Cam, 1953)

I Asymptotically efficient estimator: If there exists a regular estimator θ

such that√n(θ − θ) d→ N(0, I1(θ)−1), then the estimator is said to be

asymptotically efficient.

46/95

Further Regularity Assumptions

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. Under the following

assumptions in addition to (R0)–(R4) we may show that the MLE (precisely,

the solution of the likelihood equation) is asymptotically normal with mean

zero and variance I1(θ)−1.

(R5) For all θ0 ∈ Θ, there exist δ0 > 0 and M(·) with Eθ0M(X1) <∞ such

that

maxθ:|θ−θ0|≤δ0

∣∣ ∂3

∂θi∂θj∂θklog f(X; θ)

∣∣ ≤ M(X).

(R6) The likelihood equation ˙(θ) = 0 has the unique solution θ and the

solution is a consistent estimator of θ.

(R7) The Fisher information I1(θ) exists and is non-singular for all θ ∈ Θ.

47/95

Asymptotic Normality of MLE

I Theorem: Under the assumptions (R0)–(R7),

√n(θ − θ) d→ N(0, I1(θ)−1) under Pθ for all θ ∈ Θ.

I Proof of the theorem: For simplicity, consider only the case Θ ⊂ R, i.e.,

d = 1. The multi-dimensional extension only involves more complexity in

notation. Let Sj(θ) denote the jth derivative of the scaled likelihood

n−1`(θ). The theorem follows immediately from the facts that, for any

θ0 ∈ Θ,

(1) 0 = S1(θ0) + (S2(θ0) + (θ0 − θ0)S3(θ∗)/2)(θ − θ0),

(2) S3(θ∗) = Op(1) under Pθ0 ,

(3) S2(θ0) = −I1(θ0) + op(1) under Pθ0 ,

(4)√nS1(θ0)

d→ N(0, I1(θ0)) under Pθ0 .

48/95

Asymptotic Normality of MLE

The fact (1) is simply a Taylor expansion, (3) follows from WLLN and (4) from

CLT. For the proof of (2), we observe

lim supn→∞

Pθ0(|S3(θ∗)| > C)

≤ lim supn→∞

Pθ0(|S3(θ∗)| > C, |θ − θ0| ≤ δ0)

≤ lim supn→∞

Pθ0

(n−1

n∑i=1

M(Xi) > C

)≤ C−1Eθ0M(X1) → 0 as C →∞.

49/95

Asymptotic Normality of MLE

I Asymptotic linearity of MLE: Under the assumptions (R0)–(R7)

√n(θ − θ) = I1(θ)−1√nS1(θ) + op(1) under Pθ for all θ ∈ Θ.

I Function of θ: Let η = g(θ) for a smooth function g and let η = g(θ) for θ

in (R6). Under the assumptions (R0)–(R7), it follows that

√n(η − g(θ))

d→ N(0, g(θ)>I1(θ)−1g(θ)) under Pθ for all θ ∈ Θ,

where g(θ) = (∂/∂θ)g(θ)>.

50/95

Examples: Asymptotic Normality of MLE

I Bernoulli model with Θ = (0, 1): Recall I1(θ) = θ−1(1− θ)−1 and that

Pθ(θMLE = X) ≥ Pθ(0 < X < 1)→ 1 for all θ ∈ (0, 1).

By CLT we get√n(θMLE − θ) d→ N(0, θ(1− θ)).

I Poisson model with Θ = (0,∞): Again,

Pθ(θMLE = X) ≥ Pθ(X > 0)→ 1 for all θ > 0.

Recall I1(θ) = θ−1. By CLT we get√n(θMLE − θ) d→ N(0, θ).

I Exponential model with Θ = (0,∞): Recall θMLE = X and note

I1(θ) = θ−2. By CLT,√n(θMLE − θ) d→ N(0, θ2).

51/95

Examples: Asymptotic Normality of MLE

I Beta model with Θ = (0,∞): Recall that −1/θMLE = n−1∑ni=1 logXi

and I1(θ) = θ−2. By CLT and since Eθ(logX1) = −θ−1 and

varθ(logX1) = θ−2,

√n(n−1

n∑i=1

logXi + θ−1)d→ N(0, θ−2).

Thus,√n(θMLE − θ) d→ N(0, θ2).

I Double exponential model DE(θ, 1) with Θ = R: Note that θ is the

median of the distribution Pθ, fθ(θ; θ) = 1/2 and θMLE = med(Xi), so

that√n(θMLE − θ) d→ N(0, 1) (see page 21 of this slide).

52/95

Examples: Asymptotic Normality of MLE

I Double exponential model (continued): The log-density

f(x; θ) = e−|x−θ|/2 is not differentiable (pointwise) as a function of θ, but

it is differentiable in L2 sense (Frechet differentiability) with the derivative

∂/∂θf(x; θ) = sgn(x− θ)e−|x−θ|/2:

limh→0

∫ ∞−∞

(h−1(e−|x−θ−h|/2− e−|x−θ|/2)− sgn(x− θ)e−|x−θ|/2

)2

dx = 0.

With this derivative, I1(θ) ≡ 1.

I Remark: Indeed, the notion of Fisher information is generalized to models

with Frechet differentiability and the regularity conditions (R0)–(R7) may

be relaxed based on the generalization.

53/95

Examples: Asymptotic Normality of MLE

I Logistic(θ, 1) model with Θ = R: We know that θMLE exists and is

unique, but do not have its explicit form. One may check that all the

regularity conditions (R0)–(R7) are met, so that we have√n(θMLE − θ) d→ N(0, I1(θ)−1). Now,

˙1(θ) = (1− e−(x−θ))/(1 + e−(x−θ)),

I1(θ) ≡∫ ∞−∞

(1− e−x

1 + e−x

)2

· e−x

(1 + e−x)2dx =

1

3.

Thus, we conclude that√n(θMLE − θ) d→ N(0, 3)

I Uniform(0, θ) model with θ > 0 (non-regular model): Note that

θMLE = X(n), and in this case

n(θMLE − θ) d→ −Exp(θ).

54/95

Examples: Asymptotic Normality of MLE

I Multinomial model: Let X1, . . . , Xn be i.i.d. d-dimensional observations

from Multinomial(p1, . . . , pd), 0 < pj < 1,∑dj=1 pj = 1. Write

X = (X1, . . . , Xd)>, x = (x1, . . . , xd) for a realization x of Xi and

θ = (p1, . . . , pd−1)>. Note that

θMLE = (X1, . . . , Xd−1)>

with probability tending to one, under Pθ for all θ ∈ Θ. By CLT

√n(θMLE − θ) d→ N(0, diag(θj)− θθ>).

Observe that

˙1(θ, x) = (x1/θ1 − xd/θd, . . . , xd−1/θd−1 − xd/θd)>,

¨1(θ, x) = −diag(xj/θ

2j )− (xd/θ

2d) · 11>,

I1(θ) = diag(1/θj) + (1/θd) · 11>, I1(θ)−1 = diag(θj)− θθ>.

55/95

Asymptotic Relative Efficiency

I For two estimators θ1 and θ2 of θ that are asymptotically normal:√n(θj − θ)

d→ N(0, vj(θ)), the asymptotic relative efficiency of θ1

against θ2 is defined by

ARE(θ1, θ2) = v2(θ)/v1(θ).

I MLE against sample mean or sample median: When f(x; θ) = f0(x− θ),

f0 N(0, 1) Logistic(0, 1) DE(0, 1)

against mean 1 π2/9 2

against median π/2 4/3 1

For the logistic model, we have used the fact∑∞k=1 k

−2 = π2/6, so that∫ ∞−∞

x2e−x(1 + e−x)−2 dx = 4

∫ ∞0

xe−x(1 + e−x)−1 dx

= 4∞∑k=1

(−1)k−1k−1

∫ ∞0

e−kx dx = 4∞∑k=1

(−1)k−1/k2 = π2/3.

56/95

One-Step Approximation of MLE

We have seen that MLE may exist but may not have a closed form in case the

likelihood equation is nonlinear. In that case, we have learned that the

Newton-Raphson method may be used to approximate the MLE. The following

theorem tells that, if the initial choice, say θ[0], is well located, to state it

precisely, lies in a n−1/2-neighborhood of the true θ in probability, then the

one-step update in Newton-Raphson iteration is enough, i.e., the estimator

θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])

has the same first-order properties as the MLE.

57/95

One-Step Approximation of MLE

I Theorem: Assume the assumptions (R0)–(R7) and that

θ[0] = θ +Op(n−1/2) under Pθ for all θ ∈ Θ. Then,

√n(θ[1] − θ) d→ N(0, I(θ)−1) under Pθ for all θ ∈ Θ.

I Proof of the theorem: For simplicity again, consider the case Θ ⊂ R. Let

Sj(θ) be the jth derivative of the likelihood n−1`(θ). Then,

√n(θ[1] − θ)

=√n(θ[0] − θ) +

(−S2(θ[0])

)−1

·√nS1(θ[0])

=√n(θ[0] − θ) +

(−S2(θ)− S3(θ∗)(θ[0] − θ)

)−1

·(√

nS1(θ) + S2(θ)√n(θ[0] − θ) + S3(θ∗∗)

√n(θ[0] − θ)2/2

).

58/95

One-Step Approximation of MLE

I Proof of the theorem (continued): Since S3(θ∗), S3(θ∗∗) and√n(θ[0] − θ)

are all Op(1) (thus θ[0] − θ = op(1)), and S2(θ) = −I1(θ) + op(1), we

obtain

√n(θ[1] − θ) =

√n(θ[0] − θ) + (I1(θ) + op(1))−1

·(√

nS1(θ)− I1(θ)√n(θ[0] − θ) + op(1)

)= I1(θ)−1 ·

√nS1(θ) + op(1)

d→ N(0, I1(θ)−1)

under Pθ for all θ ∈ Θ.

59/95

One-step Approximation of MLE: Logistic Model

Suppose that we observe a random sample X1, . . . , Xn from Logistic(θ, 1) for

unknown θ ∈ R. We have

`(θ) = −nX + nθ − 2n∑i=1

log (1 + exp(−(Xi − θ))) ,

˙(θ) = n− 2

n∑i=1

exp(−(Xi − θ)) (1 + exp(−(Xi − θ)))−1 ,

¨(θ) = −2n∑i=1

exp(−(Xi − θ)) (1 + exp(−(Xi − θ)))−2

With θ[0] = X, which is√n-consistent, the one-step estimator

θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])

is an asymptotically efficient estimator of θ.

60/95

One-step Approximation of MLE: Gamma Model

Suppose that we observe a random sample X1, . . . , Xn from Gamma(α, β) for

unknown α, β > 0. Let θ = (α, β)>.

I Finding√n-consistent estimator of θ: Try MME. Since Eθ(X1) = αβ and

varθ(X1) = αβ2, we get

αMME =(n−1

n∑i=1

(Xi − X)2)−1

X2,

βMME = n−1n∑i=1

(Xi − X)2 · 1

X.

It can be shown that θMME = θ +Op(n−1/2) under Pθ for all θ ∈ Θ.

I One-step MLE: With θ[0] = θMME, the one-step estimator

θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])

is an asymptotically efficient estimator of θ.

61/95

6.1 Maximum Likelihood Estimation

6.2 Information Inequality and Efficiency

6.3 Maximum Likelihood Tests

62/95

Idea of Maximum Likelihood Test

Suppose that we observe a random sample X1, . . . , Xn from a distribution Pθ

with pdf f(·, θ), θ ∈ Θ.

I The problem: Testing

H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1

at a significance level 0 < α < 1, where Θ0 is a subset of Θ and

Θ1 = Θ \Θ0.

I Fisher’s idea: Compare the maximum likelihoods in Θ0 and Θ1. For a

given observation x ≡ (x1, . . . , xn) of X ≡ (X1, . . . , Xn), reject H0 if

maxθ∈Θ1

L(θ;x)/maxθ∈Θ0

L(θ;x) ≥ c

for a critical value c that is determined by the level α as follows.

63/95

Idea of Maximum Likelihood Test

I Determination of critical value: Choose c such that (the size of the

test)=α, i.e.,

supθ∈Θ0

(maxθ∈Θ1 L(θ;X)

maxθ∈Θ0 L(θ;X)> c

)= α.

I The maximization of the likelihood in a restricted set, such as Θ1, is more

involved than the maximization in Θ.

I Note that, for c > 1,

R0 ≡maxθ∈Θ1 L(θ;X)

maxθ∈Θ0 L(θ;X)≥ c

⇔ R ≡ maxθ∈Θ L(θ;X)

maxθ∈Θ0 L(θ;X)= max

{1,

maxθ∈Θ1 L(θ;X)

maxθ∈Θ0 L(θ;X)

}≥ c.

If there exists c > 0 such that supθ∈Θ0Pθ(R ≥ c) = α, then c > 1 since

R ≥ 1 always and α < 1. This means that the LRT may be based on R,

rather than R0.

64/95

Likelihood Ratio Test

Let θΘ and θΘ0 denote the MLEs in Θ and Θ0, respectively. The likelihood

ratio test (LRT), for H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1, rejects H0 when

maxθ∈Θ L(θ;x)

maxθ∈Θ0 L(θ;x)≥ c,

or equivalently when

2(`(θΘ;x)− `(θΘ0 ;x)

)≥ c′,

where c and c′ are determined by the given level α.

Remark: When the LR test statistic, 2(`(θΘ;X)− `(θΘ0 ;X)), has a

distribution on a discrete set, there may not exist c′ such that (the size of the

test)=α. In that case one may do randomization to meet the level condition.

65/95

Examples of LRT: Normal Mean

Suppose that we observe a random sample X1, . . . , Xn from N(θ, σ2) with

known σ2, θ ∈ R, and want to test H0 : θ = θ0 against H1 : θ 6= θ0 at a level

0 < α < 1.

I Likelihood:

`(θ) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(xi − x)2 − n

2σ2(x− θ)2.

I Maximization of likelihood: θΘ = x and θΘ0 = θ0.

I Likelihood ratio:

2(`(θΘ;x)− `(θΘ0 ;x)) = 2(`(x;x)− `(θ0;x)) =n

σ2(x− θ0)2

=

(x− θ0

σ/√n

)2

.

66/95

Examples of LRT: Normal Mean

I Rejection region: Reject H0 if∣∣∣∣ x− θ0

σ/√n

∣∣∣∣ ≥ c.I Determination of critical value c:

Pθ0

(∣∣∣∣ X − θ0

σ/√n

∣∣∣∣ ≥ zα/2) = α.

I Power function:

γ(θ) ≡ Pθ

(∣∣∣∣ X − θ0

σ/√n

∣∣∣∣ ≥ zα/2)= Φ

(−zα/2 −

θ − θ0

σ/√n

)+ Φ

(−zα/2 +

θ − θ0

σ/√n

).

I The power function is symmetric about θ0, takes the minimal value α at

θ = θ0 and increases to one as θ gets away from θ0.

67/95

Examples of LRT: Exponential Mean

Suppose that we observe a random sample X1, . . . , Xn from Exp(θ) with mean

θ > 0, and want to test H0 : θ = θ0 against H1 : θ 6= θ0 at a level 0 < α < 1.

I Likelihood: `(θ) = −n log θ − nx/θ.

I Maximization of likelihood: θΘ = x and θΘ0 = θ0.

I Likelihood ratio:

2(`(θΘ;x)− `(θΘ0 ;x))

= 2n(x/θ0 − log(x/θ0)− 1).

I Rejection region:

x/θ0 ≤ c1 or x/θ0 ≥ c2

with c1 − log c1 = c2 − log c2t1

λ

c1 c2

68/95

Examples of LRT: Exponential Mean

I Determination of critical value: Since 2∑ni=1 Xi/θ0

d= χ2(2n) under Pθ0 ,

we choose c1 and c2 such that

c1 − log c1 = c2 − log c2 and

∫ 2nc2

2nc1

pdfχ2(2n)(y)dy = 1− α.

I Approximation of c1 and c2:

χ21−α/2(2n)/2n = 1− zα/2/

√n+ o(n−1/2),

χ2α/2(2n)/2n = 1 + zα/2/

√n+ o(n−1/2).

I Power function :

γ(θ) ≡ 1− Pθ(c1 ≤ X/θ0 ≤ c2

)= F2n(2nc1θ0/θ) + F2n(2nc2θ0/θ),

where F2n(·) is the cdf of χ2(2n) and F2n = 1− F2n.

69/95

Examples of LRT: Normal Mean With Unknown Variance

Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2),

θ = (µ, σ2) ∈ R× R+, and want to test H0 : µ = µ0 against H1 : µ 6= µ0 at a

level 0 < α < 1.

I Likelihood:

`(θ) = −n2

log(2π)− n

2log σ2 − 1

2σ2

n∑i=1

(xi − µ)2.

I Maximization of likelihood: θΘ = (x, σ2) and θΘ0 = (µ0, σ20), where

σ2 = n−1∑ni=1(xi − x)2 and σ2

0 = n−1∑ni=1(xi − µ0)2.

I Likelihood ratio: Since σ20 = σ2 + (x− µ0)2, we get

2(`(θΘ;x)− `(θΘ0 ;x)) = n log

(1 +

(x− µ0)2

σ2

).

70/95

Examples of LRT: Normal Mean With Unknown Variance

I Rejection region: Let s2 = (n− 1)−1∑ni=1(xi − x)2.

Reject H0 if

∣∣∣∣ x− µ0

s/√n

∣∣∣∣ ≥ c.I Determination of critical value c:

Pθ0

(∣∣∣∣ X − µ0

S/√n

∣∣∣∣ ≥ tα/2(n− 1)

)= α.

I Wilks’ phenomenon: Since Tn ≡ (X − µ0)/(S/√n)

d= t(n− 1) under H0

and t(n− 1)d→ Z

d= N(0, 1), we get

2(`(θΘ;X)− `(θΘ0 ;X)) = n log(1 + T 2

n/(n− 1))

= T 2n + op(1)

d→ Z2 d= χ2(1)

under H0. Note that (d.f.) = dim(Θ)− dim(Θ0) = 2− 1 = 1.

71/95

Examples of LRT: Normal Mean With Unknown Variance

I Power function: With δ =√n(µ− µ0)/σ,

γ(δ) ≡ Pθ(∣∣∣∣ X − µ0

S/√n

∣∣∣∣ ≥ tα/2(n− 1)

)= P

(∣∣∣∣ Z + δ√V/(n− 1)

∣∣∣∣ ≥ tα/2(n− 1)

),

where Zd= N(0, 1) and V

d= χ2(n− 1) are independent.

I The power function is symmetric about δ = 0, takes the minimal value α

at δ = 0 and increases to one as δ gets away from 0.

72/95

Examples of LRT: Two Normal Means With Common Variance

Let X1, . . . , Xn1 and Y1, . . . , Yn2 be random samples, respectively, from

N(µ1, σ2) and N(µ2, σ

2), θ = (µ1, µ2, σ2) ∈ R2 × R+. Assume that

(X1, . . . , Xn1) and (Y1, . . . , Yn2) are independent. We want to test

H0 : µ1 = µ2 against H1 : µ1 6= µ2 at a level 0 < α < 1.

I Likelihood:

`(θ) = (const.)−n1 + n2

2log σ2− 1

2σ2

(n1∑i=1

(xi − µ1)2 +

n2∑i=1

(yi − µ2)2

).

I Maximization of likelihood: θΘ = (x, y, σ2), θΘ0 = (µ0, µ0, σ20), where

σ2 =

∑n1i=1(xi − x)2 +

∑n2i=1(yi − y)2)

n1 + n2,

σ20 =

∑n1i=1(xi − µ0)2 +

∑n2i=1(yi − µ0)2)

n1 + n2,

µ0 = (n1x+ n2y)/(n1 + n2).

73/95

Examples of LRT: Two Normal Means With Common Variance

I Likelihood ratio: Note that

σ20

σ2= 1 +

n1(x− µ0)2 + n2(y − µ0)2

(n1 + n2)σ2= 1 +

n1n2(x− y)2

(n1 + n2)σ2.

Let s2p = (n1 + n2)σ2/(n1 + n2 − 2) be the pooled sample variance.

Then, we get

2(`(θΘ;x, y)− `(θΘ0 ;x, y))

= (n1 + n2) log(σ20/σ

2)

= (n1 + n2) log

(1 +

n1n2(x− y)2

(n1 + n2)2σ2

)= (n1 + n2) log

(1 +

1

n1 + n2 − 2· (x− y)2

(n−11 + n−1

2 )s2p

).

I Rejection region:

Reject H0 if

∣∣∣∣∣ x− y

sp

√n−1

1 + n−12

∣∣∣∣∣ ≥ tα/2(n1 + n2 − 2).

74/95

Examples of LRT: Two Normal Means With Common Variance

I Wilks’ phenomenon: Since

Tn ≡ (X − Y )/(Sp

√n−1

1 + n−12

) d= t(n1 + n2 − 2) under H0 and

t(n1 + n2 − 2)d→ Z

d= N(0, 1), we get

2(`(θΘ;X,Y )− `(θΘ0 ;X,Y ))

= (n1 + n2) log(1 + T 2

n/(n1 + n2 − 2))

= T 2n + op(1)

d→ Z2 d= χ2(1)

under H0. Note that (d.f.) = dim(Θ)− dim(Θ0) = 3− 2 = 1.

I Power function: With δ = (µ1 − µ2)/(σ√n−1

1 + n−12

),

γ(δ) ≡ Pθ (reject H0) = P

(∣∣∣∣ Z + δ√V/(n1 + n2 − 2)

∣∣∣∣ ≥ tα/2(n1 + n2 − 2)

),

where Zd= N(0, 1) and V

d= χ2(n1 + n2 − 2) are independent.

75/95

Examples of LRT: Two Normal Variances

Let X1, . . . , Xn1 and Y1, . . . , Yn2 be random samples, respectively, from

N(µ1, σ21) and N(µ2, σ

22), θ = (µ1, µ2, σ

21 , σ

22) ∈ R2 × R2

+. Assume that

(X1, . . . , Xn1) and (Y1, . . . , Yn2) are independent. We want to test

H0 : σ21 = σ2

2 against H1 : σ21 6= σ2

2 at a level 0 < α < 1.

I Likelihood:

`(θ) = (const.)−n1

2log σ2

1−n2

2log σ2

2−1

2σ21

n1∑i=1

(xi−µ1)2− 1

2σ22

n2∑i=1

(yi−µ2)2.

I Maximization of likelihood: θΘ = (x, y, σ21 , σ

22), θΘ0 = (x, y, σ2, σ2),

where

σ21 =

n1∑i=1

(xi − x)2/n1, σ22 =

n2∑i=1

(yi − y)2/n2,

σ2 =

∑n1i=1(xi − x)2 +

∑n2i=1(yi − y)2

n1 + n2=n1σ

21 + n2σ

22

n1 + n2.

76/95

Examples of LRT: Two Normal Variances

I Likelihood ratio: Let s21 =

∑n1i=1(xi − x)2/(n1 − 1) and define s2

2 likewise

with yi. Then,

2(`(θΘ;x, y)− `(θΘ0 ;x, y))

= (n1 + n2) log σ2 − n1 log σ21 − n2 log σ2

2

= n1 log

(1 +

n2 − 1

n1 − 1· s

22

s21

)+ n2 log

(1 +

n1 − 1

n2 − 1· s

21

s22

)+ (const)

let= rn1,n2(s2

1/s22).

I Rejection region:

Reject H0 if(n1 − 1)s2

1

(n2 − 1)s22

≤ c1 or(n1 − 1)s2

1

(n2 − 1)s22

≥ c2,

where c1 and c2 satisfy

rn1,n2(c1) = rn1,n2(c2) and

∫ c2

c1

pdfF (n1−1,n2−1)(y)dy = 1− α.

77/95

Examples of LRT: One-Sided Poisson Mean

Let X1, . . . , Xn be a random sample from Poisson(θ), θ > 0. We want to test

H0 : θ ≤ θ0 against H1 : θ > θ0 at a level 0 < α < 1.

I Likelihood: `(θ) = n(−θ + x log θ − n−1∑ni=1 log xi!).

I Maximization of likelihood: θΘ = x and θΘ0 = min{x, θ0}.

I Likelihood ratio:

2(`(θΘ;x)− `(θΘ0 ;x)) = 2n (x log x− x(1 + log θ0) + θ0) I(x ≥ θ0).

I Rejection region: f(u) ≡ (u log u− u(1 + log θ0) + θ0)I(u ≥ θ0) equals

zero for u ≤ θ0 and is strictly increasing for u > θ0. Thus, for any λ > 0,

the inequality f(u) ≥ λ is equivalent to u ≥ λ′ for some λ′ that depends

on λ. This implies

LRT rejects H0 ifn∑i=1

xi ≥ c.

78/95

Examples of LRT: One-Sided Poisson Mean

I Determination of critical value: Need to find c > 0 such that

supθ≤θ0

(n∑i=1

Xi ≥ c

)= α.

For an integer c, the rejection probability under Pθ equals

1−∑c−1j=0 e

−nθ(nθ)j/j!, which is an increasing function of θ. Thus, the

supremum over θ ≤ θ0 is achieved at θ = θ0. The task reduces then to

finding c > 0 such that

Pθ0

(n∑i=1

Xi ≥ c

)= α.

But, this is not possible for most values of 0 < α < 1.

I One may do randomization to get a test such that Pθ0(reject H0) = α.

79/95

Examples of LRT: One-Sided Poisson Mean

I Randomization: For a given 0 < α < 1, suppose that

1−c0∑j=0

e−nθ0(nθ0)j/j!let= α0 < α < α1

let= 1−

c0−1∑j=0

e−nθ0(nθ0)j/j!.

The randomized LRT rejects H0 if∑ni=1 xi ≥ c0 + 1 and does not reject

H0 if∑ni=1 xi ≤ c0 − 1. On the boundary where

∑ni=1 xi = c0, it rejects

H0 with probability (α− α0)/(α1 − α0). If we denote by φLRT(u) the

probability of rejecting H0 when∑ni=1 xi = u is observed, then

φLRT(u) =

1 u ≥ c0 + 1

(α− α0)/(α1 − α0) u = c0

0 u ≤ c0 − 1

I Level condition: It can be seen that

Pθ0 (reject H0) = Pθ0

(n∑i=1

Xi ≥ c0 + 1

)+α− α0

α1 − α0·Pθ0

(n∑i=1

Xi = c0

)= α.

80/95

Approximation of LRT: Simple Null

Let X1, . . . , Xn be a random sample from a distribution with pdf

f(·, θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. Assume that the regularity

conditions (R0)–(R7) hold. Consider the problem of testing H0 : θ = θ0. Let θ

denote the MLE over Θ. For the LRT statistic 2(`(θ)− `(θ0)), it holds that,

under H0,

(1) 2(`(θ)− `(θ0))d→ χ2(d);

(2) 2(`(θ)− `(θ0)) = Wn + op(1);

(3) 2(`(θ)− `(θ0)) = Rn + op(1),

where Wn = (θ − θ0)>(nI1(θ0))(θ − θ0) (Wald test statistic) and

Rn = ˙(θ0)>(nI1(θ0))−1 ˙(θ0) (Rao/score test statistic)

Remark: (1) is called Wilks’ phenomenon.

81/95

Approximation of LRT: Simple Null

I Proof of (2): Let Sj(θ) denote the jth derivative of n−1`(θ). Since

˙(θ) = 0, we get

`(θ0) = `(θ) + (θ − θ0)> · nS2(θ∗) · (θ − θ0)/2

= `(θ) + (θ − θ0)>(−nI1(θ0) + op(n))(θ − θ0)/2

= `(θ)−Wn/2 + op(1).

I Proof of (3): Since√n(θ − θ0) = I1(θ0)−1√nS1(θ0) + op(1) and

√nS1(θ0) = Op(1) under Pθ0 , we obtain

Wn =(I1(θ0)−1√nS1(θ0) + op(1)

)>I1(θ0)

(I1(θ0)−1√nS1(θ0) + op(1)

)= nS1(θ0)>I1(θ0)−1S1(θ0) + op(1)

= Rn + op(1).

I Proof of (1): Immediately follows from (2) or (3).

82/95

Approximation of LRT: Exponential Mean (Page 66, LecNote)

I Hypothesis: H0 : θ = θ0 versus H1 : θ 6= θ0.

I Asymptotic LRT: Reject H0 if

2n(x/θ0 − log(x/θ0)− 1) ≥ χ2α(1).

I Wald and Rao tests: Reject H0 if

n(x− θ0)2/θ20 ≥ χ2

α(1).

I Compare these with the LRT.

83/95

Approximation of LRT: Beta Model

Suppose we observe a random sample X1, . . . , Xn from Beta(θ, 1), θ > 0.

I Hypothesis: H0 : θ = 1 versus H1 : θ 6= 1.

I Recall that MLE θ = −n/(∑ni=1 logXi) and I1(θ) = 1/θ2:

`(θ) = n log θ + (θ − 1)n∑i=1

log xi, ˙(θ) = n/θ +n∑i=1

log xi.

I LRT statistic: 2(`(θ)− `(1)) = 2n(log θ − 1 + 1/θ).

I Wald test statistic: Wn(1) = (θ − 1)2nI1(1) = n(θ − 1)2.

I Rao test statistic:

Rn(1) =(

˙(1))2n−1I1(1)−1 = n

(1 + n−1

n∑i=1

logXi

)2

.

84/95

Approximation of LRT: Double Exponential Model

Suppose we observe a random sample X1, . . . , Xn from DE(θ, 1),

−∞ < θ <∞.

I Hypothesis: H0 : θ = θ0 versus H1 : θ 6= θ0.

I Recall that MLE θ = med(Xi) and I1(θ) = 1 (page 51):

`(θ) = −n∑i=1

|xi − θ| − n log 2, ˙(θ) =

n∑i=1

sgn(xi − θ).

Here, we take ˙1(θ;x) = (∂/∂θ)f(x; θ)/f(x; θ) with (∂/∂θ)f(·; θ) being

the Frechet derivative of f(·; θ).

I LRT statistic: 2(`(θ)− `(1)) = 2(∑n

i=1 |xi − θ0| −∑ni=1 |xi − θ|

).

I Wald test statistic: W (θ0) = n(θ − θ0)2.

I Rao test statistic: R(θ0) =(∑n

i=1 sgn(Xi − θ0))2/n.

85/95

Approximation of LRT: Multinomial Model

Suppose we observe a random sample Z1, . . . , Zn from

Multinomial(1, (θ1, . . . , θk)

), and wish to test

H0 : θ = θ0 versus H1 : θ 6= θ0.

I If we take Θ = {(θ1, . . . , θk−1) : θ > 0, θ1 + · · ·+ θk−1 < 1}, then the

model satisfies (R0)–(R7).

I Let Xj =∑ni=1 Zij , 1 ≤ j ≤ k, and θ· =

∑k−1j=1 θj = 1− θk. With

θ = (θ1, . . . , θk−1)>,

`(θ) =

k−1∑j=1

xj log θj + xk log(1− θ·),

˙(θ) =(xj/θj − xk/(1− θ·) : 1 ≤ j ≤ k − 1

),

MLE θ = (xj/n : 1 ≤ j ≤ k − 1) and I1(θ) = diag(θ−1j ) + (1− θ·)−111>.

86/95

Approximation of LRT: Multinomial Model

I LRT statistic: n(∑k

j=1 θj log θj −∑kj=1 θ0j log θ0j

)I Wald statistic: n(θ − θ0)>I1(θ0)(θ − θ0) =

∑kj=1(Xj − nθ0j)

2/(nθ0j).

I Rao statistic: n−1 ˙(θ0)>I1(θ0)−1 ˙(θ0) =∑kj=1(Xj − nθ0j)

2/(nθ0j).

I nI1(θ)(θ − θ) = (Xj/θj : 1 ≤ j ≤ k − 1)− (Xk/θk) · 1 = ˙(θ).

I Often the Wald and Rao tests are expressed as

Reject H0 ifk∑j=1

(Oj − Ej)2/Ei ≥ χ2α(k − 1),

where Oj = Xj and Ei = E(Xj |H0) = nθ0j .

87/95

Approximation of LRT: Composite Null

Here, we discuss the approximation of LRT when the null hypothesis is

composite. For this, let X1, . . . , Xn be a random sample from a distribution

with pdf f(·; θ), θ ∈ Θ ⊂ Rd that satisfies the regularity conditions (R0)–(R7).

Let θ = (ξ>, η>)> with η ∈ Rd0 , and suppose we wish to test

H0 : ξ = ξ0 versus H1 : ξ 6= ξ0

The null parameter space is of dimension dim(Θ0) = d0 given by

Θ0 ={

(ξ>0 , η>)> : η ∈ Rd0 and (ξ>, η>)> ∈ Θ

}.

Let θΘ0 and θΘ denote the MLE in Θ0 and in Θ, respectively.

88/95

Approximation of LRT: Composite Null

Under the composite null hypothesis H0 : ξ = ξ0 ⇔ H0 : θ ∈ Θ0, it holds that

(1) 2(`(θΘ)− `(θΘ0))d→ χ2(d− d0); (Wilks’ phenomenon)

(2) 2(`(θΘ)− `(θΘ0)) = Wn + op(1);

(3) 2(`(θΘ)− `(θΘ0)) = Rn + op(1),

where

Wn = (θΘ − θΘ0)>(nI1(θΘ0))(θΘ − θΘ0) (Wald test statistic),

Rn = ˙(θΘ0)>(nI1(θΘ0))−1 ˙(θΘ0) (Rao/score test statistic).

(Proof : See Bickel and Doksum(2001), Chapter 5)

89/95

Approximation of LRT: Composite Null

I Remark 1: The above approximations are valid for

H0 : g1(θ) = 0, . . . , gd1(θ) = 0 versus H1 : not H0,

where d1 = d− d0, provided that there exists a smooth reparametrization

ξ1 = g1(θ), . . . , ξd1 = gd1(θ), η1(θ) = gd1+1(θ), . . . , ηd0(θ) = gd(θ).

I Remark 2: The above approximation may be extended to the case of

independent but not identically distributed observations. In the latter case,

Wn = (θΘ − θΘ0)>In(θΘ0)(θΘ − θΘ0) (Wald test statistic),

Rn = ˙(θΘ0)>In(θΘ0))−1 ˙(θΘ0) (Rao/score test statistic),

where In(θ) = −Eθ ¨(θ).

90/95

Approximation of LRT: Independence Test in Contingency Table

Suppose we observe X =(Xjk

) d= Multinomial

(n, (pjk)

), where∑a

j=1

∑bk=1 pjk = 1, pjk > 0, i = 1, . . . , a; j = 1, . . . , b. Consider the

hypothesis

H0 : pjkj,k≡ pj· × p·k versus H1 : not H0,

where pj· = pj1 + · · ·+ pjb and p·k = p1k + · · ·+ pak.

I With θ = (p11, . . . , pa,b−1)>, define

gjk(θ) = pjk − pj·p·k, 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b− 1

and gjk for other (j, k) appropriately so that the resulting transformation

is a smooth reparametrization of θ.

pjkj,k≡ pj· × p·k ⇔ gjk(θ) = 0 for 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b− 1.

91/95

Approximation of LRT: Independence Test in Contingency Table

I Indeed,

Θ = {(p11, . . . , pab) : pjk > 0, p11 + · · ·+ pab = 1},

Θ0 = {(p1·p·1, . . . , pa·p·b) : pj· > 0, p·k > 0, p1· + · · ·+ pa· = 1,

p·1 + · · ·+ p·b = 1}.

Thus, d = dim(Θ) = ab− 1 and d0 = dim(Θ0) = (a− 1) + (b− 1), so

that d1 = dim(Θ)− dim(Θ0) = (a− 1)(b− 1).

I The likelihood under H0:

L(p) =

a∏j=1

b∏k=1

(pjk)xjk×(constant) =

a∏j=1

(pj·)xj·

b∏k=1

(p·j)x·j×(constant).

I MLEs: pjk = xjk/n in the whole parameter space Θ, and

p0jk = (xi·/n)× (x·j/n) = p0

j· × p0·k, say, in Θ0.

92/95

Approximation of LRT: Independence Test in Contingency Table

I The Wald and Rao test statistics coincide: With θ = (p11, . . . , pa,b−1)>,

n(θ−θ)>I1(θ)(θ−θ) =

a∑j=1

b∑k=1

(Xjk−npjk)2/(npjk) = n−1 ˙(θ)>I1(θ)−1 ˙(θ).

I The Wald and Rao test statistics are given by

Wn =

a∑j=1

b∑k=1

(Xjk − np0j·p

0·k)2/(np0

j·p0·j).

I Therefore, both the Wald and Rao tests reject H0 if

a∑j=1

b∑k=1

(Ojk − E0jk)2/E0

jk ≥ χ2α

((a− 1)(b− 1)

),

where Ojk = Xjk and E0jk = E(Xjk|H0) = np0

i·p0·j .

93/95

Approximation of LRT: Homogeneity of Multinomial Models

Suppose now that we observe b independent multinomial random variables:

Xk ≡ (X1k, . . . , Xak)>d= Multinomial

(nk, (p1k, . . . , pak)

), 1 ≤ k ≤ b, where

p1k + · · ·+ pak = 1 and pjk > 0. We wish to test

H0 : p1 = · · · = pb versus H1 : not H0,

where pk = (p1k, . . . , pak)>, 1 ≤ k ≤ b.

I With θk = (p1k, . . . , pa−1,k)> and θ = (θ>1 , . . . , θ>b )>, we get

`(θ) =b∑

k=1

a−1∑j=1

xjk log θjk +b∑

k=1

(nk − x·k) log(1− θ·k),

˙(θ) = (xjk/θjk − xak/(1− θ·k) : 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b) ,

In(θ) = var( ˙(θ)) = diag(nk[diag(θ−1

jk ) + (1− θ·k)−111>]),

where θ·k =∑a−1j=1 θjk = 1− pak and x·k =

∑a−1j=1 xjk = nk − xak.

94/95

Approximation of LRT: Homogeneity of Multinomial Models

I MLEs: pjk = xjk/nk in Θ, and p0jk

k≡ xj·/n in Θ0 with n = n1 + · · ·+nb.

I The Wald and Rao statistics coincide and

(θΘ − θ)>In(θ)(θΘ − θ)

=b∑

k=1

nk(θΘk − θk)>

[diag(θ−1

jk ) + (1− θ·k)−111>]

(θΘk − θk)

=b∑

k=1

nk

[a−1∑j=1

(θΘjk − θjk)2/θjk + (θΘ

·k − θ·k)2/(1− θ·k)

]

=

b∑k=1

nk

a∑j=1

(pjk − pjk)2/pjk

=

b∑k=1

a∑j=1

(nkpjk − nkpjk)2/(nkpjk)

=

b∑k=1

a∑j=1

(Xjk − nkpjk)2/(nkpjk).

95/95

Approximation of LRT: Homogeneity of Multinomial Models

I Thus, the Wald and Rao statistics for H0 are given by

Wn =

b∑k=1

a∑j=1

(Xjk − nkp0jk)2/(nkp

0jk)

=

b∑k=1

a∑j=1

(Xjk − nk(Xj·/n)

)2/(nkXj·/n)

I Note that dim(Θ) = b(a− 1) and

Θ0 = {(p1, . . . , p1) : p11 + · · ·+ pa1 = 1, pj1 > 0} so that

dim(Θ0) = a− 1.

I Therefore, the Wald and Rao tests reject H0 if

b∑k=1

a∑j=1

(Ojk − E0jk)2/E0

jk ≥ χ2α

((a− 1)(b− 1)

),

where Ojk = Xjk and E0jk = E(Xjk|H0) = nkp

0jk = nk(Xj·/n).