Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical...

95
1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of Statistics, Seoul National University

Transcript of Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical...

Page 1: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

1/95

Mathematical Statistics, Fall 2016

Chapter 6: Likelihood Methods

Byeong U. Park

Department of Statistics, Seoul National University

Page 2: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

2/95

6.1 Maximum Likelihood Estimation

6.2 Information Inequality and Efficiency

6.3 Maximum Likelihood Tests

Page 3: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

3/95

Idea of Maximizing Likelihood

I If pdf(x; θ1) > pdf(x; θ2) for an observation X = x, then the distribution

pdf(·; θ1) is more likely, than pdf(·; θ2), to be the true distribution that

generated the observation x.

I Likelihood and log-likelihood functions: pdf(x; θ) as a function of θ for a

given x is called the likelihood (function). Its logarithm is called the

log-likelihood (function).

I Below we write L(θ) ≡ L(θ;x) = pdf(x; θ) and

`(θ) ≡ `(θ;x) = log pdf(x; θ).

I For a set of observations x1, . . . , xn of a random sample X1, . . . , Xn from

pdf f(·; θ), we have

L(θ) =n∏i=1

f(xi; θ), `(θ) =n∑i=1

log f(xi; θ).

Page 4: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

4/95

Maximum Likelihood Estimator (MLE)

I Definition of MLE: The MLE of θ for a given observation x is defined by

θMLE(x) ≡ arg maxθ∈Θ

L(θ;x)

when it exists.

I MLE may not exist and may not be unique when it exists.

I In some cases MLE does not exist in a parameter space Θ but exists in an

extended parameter space Θ.

I Suppose that Θ is open in Rd and the log-likelihood function `(θ) is

continuous in Θ and diverges to −∞ as θ approaches to the ‘boundary’ of

Θ. Then, an MLE exists.

Page 5: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

5/95

Likelihood Equation

I When the likelihood is differentiable, an MLE is often found by solving the

likelihood equation ˙(θ;x) = 0, where ˙(θ;x) = (∂/∂θ)`(θ;x).

I Suppose that the likelihood is twice differentiable. For a given x, if an

MLE exists and the second derivative ¨(θ;x) is negative definite for all

θ ∈ Θ, then the solution of the likelihood equation exists and is the unique

MLE.

I Suppose that the likelihood is twice differentiable. For a given x, if

˙(θ(x);x) = 0 and ¨(θ;x) is negative definite for all θ ∈ Θ, then θ(x) is

the unique MLE:

`(θ) = `(θ) + ˙(θ)>(θ − θ) +1

2(θ − θ)> ¨(θ∗)(θ − θ) ≤ `(θ)

with the equality holding if and only if θ = θ.

Page 6: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

6/95

MLE of Function of Parameter

I Let θ = (θ1, θ2) and θ = (θ1, θ2) is its MLE. Then, we call θj the MLEs of

θj , respectively.

I For an injective function g, the MLE of η = g(θ) is given by

ηMLE = g(θMLE).

Proof. Let L denote the likelihood function of θ. Then, the likelihood of

η equals L(g−1(η);x). This entails

(Likelihood of ηMLE) = L(g−1(ηMLE);x) = L(θMLE;x)

= maxθ∈Θ

L(θ;x)

= maxη∈g(Θ)

L(g−1(η);x).

Page 7: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

7/95

MLE of Function of Parameter

I What if the function g in η = g(θ) is not injective?

One may find h such that the map θ 7→ (g(θ), h(θ)) is injective, so that

the MLE of ηext ≡ (g(θ), h(θ)) is given by

ηMLEext = (g(θMLE), h(θMLE)),

so that it remains true that

ηMLE = g(θMLE).

Page 8: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

8/95

Profiling Method of Finding MLE

Sometimes it is difficult to find the MLE of θ = (θ1, θ2) simultaneously, but

rather easy to find the MLEs of θj with the other being fixed. Let θ1(θ2)

denote the MLE of θ1 when θ2 is fixed.

I L(θ1(θ2), θ2) is a function of θ2 only, called the profile likelihood of θ2.

I The MLE of θ is given by θMLE = (θ1(θ2), θ2), where

θ2 = argmaxθ2:(θ1(θ2),θ2)∈Θ

L(θ1(θ2), θ2).

Proof. For any (θ1, θ2) ∈ Θ, it holds that

L(θMLE) ≥ L(θ1(θ2), θ2) ≥ L(θ1, θ2).

Page 9: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

9/95

Newton-Raphson Algorithm

It may not be possible to find the solution of the likelihood equation explicitly

when the likelihood equation is nonlinear.

I The Newton-Raphson method is an iteration scheme based on the linear

approximation of the likelihood equation:

0 = ˙(θ) ' ˙(θOLD) + ¨(θOLD)(θ − θOLD).

I The iteration scheme:

θNEW = θOLD − ¨(θOLD)−1 ˙(θOLD).

I Convergence of the iteration: Newton-Kantorovich Theorem!

Page 10: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

10/95

Newton-Kantorovich Theorem

Suppose that there exist constants α, β, γ and r such that 2αβγ < 1 and

2α < r for which ` has the second derivative ¨(θ) at all θ ∈ Br(θ[0]) that is

invertible, and

(i) ‖¨(θ[0])−1 ˙(θ[0])‖ ≤ α,

(ii) ‖¨(θ[0])−1‖ ≤ β,

(iii) ‖¨(θ)− ¨(θ′)‖ ≤ γ‖θ − θ′‖ for all θ, θ′ ∈ Br(θ[0]).

Then, ˙(θ) = 0 has a unique solution θ in B2α(θ[0]). Furthermore, θ can be

approximated by Newton-Raphson iterative method

θ[k+1] = θ[k] − ¨(θ[k])−1 ˙(θ[k]), k ≥ 0, which converges at a geometric rate:

‖θ[k] − θ‖ ≤ α2−(k−1)q2k−1,

where q = 2αβγ < 1.

Page 11: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

11/95

Gradient Descent Algorithm

Sometimes the Newton-Raphson algorithm is unstable, especially when θ is of

high-dimension. The gradient descent method is an iterative scheme of finding

the minimal point of an objective function F . The method moves slightly the

current update to the opposite direction of the gradient of F , i.e.,

θNEW = θOLD − γF (θOLD)

for γ small enough. Too small γ takes too long to reach the minimum and too

large may overshoot the minimal point. It is suggested to take a larger step at

the beginning and a smaller step as the iteration goes on.

x1

x2

Page 12: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

12/95

Coordinate Descent Algorithm

The minimization of a multivariable function F may be achieved by minimizing

it along one direction at a time, i.e., solving univariate optimization problems in

a loop. The algorithm solves the optimization problem

θ[k+1]j = argmin

θj

F (θ[k+1]1 , . . . , θ

[k+1]j−1 , θj , θ

[k]j+1, . . . , θ

[k]d ), 1 ≤ j ≤ d.

It is easy to see that

F (θ[0]) ≥ F (θ[1]) ≥ F (θ[2]) ≥ . . . .

Page 13: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

13/95

Example: Poisson Model

Suppose that we observe X from Poisson(θ) for unknown θ ∈ Θ = (0,∞).

I The log-likelihood is maximized at x ∈ Θ when x > 0. When x = 0,

`(θ; 0) = −θ + x log θ − log x! = −θ

does not have a maximizer in (0,∞). Thus, the MLE of θ does not exist

when x = 0.

I If we let Θ = [0,∞), then the MLE exists for all x ≥ 0 and is given by

θMLE = X.

I Read #6.1.3 for an example of non-uniqueness of MLE.

Page 14: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

14/95

Example: Bernoulli Model

Suppose that we observe a random sample X1, . . . , Xn from Bernoulli(θ) for

unknown θ ∈ [0, 1]. Then, θMLE = X.

I For 0 <∑ni=1 xi < n, θ = x is the unique MLE:

I For θ = 0 or θ = 1, L(θ) = 0;

I For θ ∈ (0, 1), L(θ) > L(0) = L(1) = 0 and

˙(θ) =

∑ni=1 xi

θ−n−

∑ni=1 xi

1− θ= 0

has the solution θ = x with

¨(θ) = −∑ni=1 xi

θ2−n−

∑ni=1 xi

(1− θ)2< 0

for all 0 < θ < 1.

I For∑ni=1 xi = 0, L(θ) = (1− θ)n is maximized at θ = 0.

I For∑ni=1 xi = n, L(θ) = θn is maximized at θ = 1.

Page 15: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

15/95

Example: Logistic Family of Distributions

Suppose that we observe a random sample X1, . . . , Xn from Logistic(θ, 1) for

unknown θ ∈ R that has a density function given by

f(x; θ) =exp(−(x− θ))

(1 + exp(−(x− θ)))2· I(−∞,∞)(x)

I Derivatives of the log-likelihood:

˙(θ) = n− 2

n∑i=1

exp(−(xi − θ))/{1 + exp(−(xi − θ))},

¨(θ) = −2

n∑i=1

exp(−(xi − θ))/{(1 + exp(−(xi − θ))}2 < 0, θ ∈ R.

I The likelihood equation ˙(θ) = 0 has the unique solution since ˙ is strictly

decreasing and ˙(θ)→ ∓n as θ → ±∞, respectively. Thus, it is the

unique MLE.

Page 16: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

16/95

Example: Family of Exponential Distributions

Suppose that we observe a random sample X1, . . . , Xn from Exp(θ) for

unknown θ > 0 that has a density function given by

f(x; θ) = θ−1e−x/θ · I(0,∞)(x). It is easier to find the MLE of η = θ−1 and

then transform the MLE of η back to the MLE of θ = η−1.

I Derivatives of the log-likelihood of η:

˙(η) = n/η − nx,

¨(η) = −n/η2 < 0, η > 0.

I The likelihood equation ˙(η) = 0 has the solution η = 1/x, thus it is the

unique MLE of η.

I θMLE = X is the unique MLE of θ.

Page 17: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

17/95

Example: Families of Double Exponential Distributions

Suppose that we observe a random sample X1, . . . , Xn from DE(θ, 1) for

unknown θ ∈ R that has a density function given by

f(x; θ) =1

2exp(−|x− θ|) · I(−∞,∞)(x).

In this case the likelihood is not differentiable. Let X(1) ≤ X(2) ≤ · · · ≤ X(n)

be the order statistics of X1, . . . , Xn.

I The log-likelihood: `(θ) = −∑ni=1 |xi − θ| − n log 2.

I When n = 2m+ 1, θMLE = X(m+1) is the unique MLE.

I When n = 2m, θMLE = a for any a ∈ [X(m), X(m+1)].

I Thus, θMLE = med(Xi).

Page 18: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

18/95

S(θ) =∑ni=1 |X(i) − θ|

X(1) X(2) X(3)· · · X(n−1) X(n)

I1 I2 I3 In In+1

Let

I1 = (−∞, X(1)]

Ij = [X(j−1), X(j)], 2 ≤ j ≤ n “closed interval”

In+1 = [X(n),∞)

(i) S(θ) ≥ S(X(1)) for all θ ∈ I1

S(θ) ≥ S(X(n)) for all θ ∈ I(n+1)

(ii) S(θ1) > S(θ2) for all θ1 < θ2 in Ij+1 if j < n− j;

S(θ1) < S(θ2) for all θ1 < θ2 in Ij+1 if j > n− j;

S(θ1) = S(θ2) for all θ1 < θ2 in Ij+1 if j = n− j

Page 19: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

19/95

Example: Families of Double Exponential Distributions

Suppose that we observe a random sample X1, . . . , Xn from an DE(µ, σ) for

unknown θ = (µ, σ2) ∈ R× R+ that has a density function given by

f(x; θ) =1

2σexp(−|x− µ|/σ) · I(−∞,∞)(x).

I The log-likelihood: `(θ) = −∑ni=1 |xi − µ|/σ − n log 2− n log σ.

I Profiling approach: For each fixed σ, we know µ(σ) = med(xi) maximizes

`(µ, σ), which is independent of σ. The profile likelihood equals

`(med(xi), σ) = −n∑i=1

|xi −med(xi)|/σ − n log 2− n log σ,

which is uniquely maximized at σ = n−1∑ni=1 |xi −med(xi)|.

I θMLE = (med(Xi), n−1∑n

i=1 |Xi −med(Xi)|).

Page 20: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

20/95

Example: Normal Families of Distributions

Suppose that we observe a random sample X1, . . . , Xn from N(µ, 1) for

unknown µ ∈ R.

I The log-likelihood:

`(µ) = −n2

log(2π)− 1

2

n∑i=1

(xi − x)2 − n

2(x− µ)2.

I The MLE of η = |µ| is ηMLE = |X|.

Page 21: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

21/95

Example: Normal Families of Distributions

Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2) for

unknown θ = (µ, σ2) ∈ R× R+.

I The log-likelihood:

`(θ) = −n2

log(2π)− n

2log σ2 − 1

2σ2

n∑i=1

(xi − x)2 − n

2σ2(x− µ)2.

I Profiling approach: For each fixed σ2, we get µ(σ2) = x maximizes

`(µ, σ2), which is independent of σ2. The profile likelihood equals

`(x, σ2) = −n2

log(2π)− n

2log σ2 − 1

2σ2

n∑i=1

(xi − x)2,

which is uniquely maximized at σ2 = n−1∑ni=1(xi − x)2. Thus,

θMLE = (X, n−1∑ni=1(Xi − X)2).

Page 22: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

22/95

Consistency of MLE: Some Examples

I Let X1, . . . , Xn be a random sample from (i) Poisson(θ), θ ∈ Θ = [0,∞);

(ii) Bernoulli(θ), θ ∈ Θ = [0, 1]; (iii) Exp(θ), θ ∈ Θ = (0,∞). Then,

θMLE = XPθ→ θ

as n→∞ for all θ ∈ Θ by WLLN.

I Let X1, . . . , Xn be a random sample from a population with a distribution

function F that has a density f with respect to the Lebesgue measure

(continuous type). It can be shown that

√n(med(Xi)− F−1(1/2))

d→ N(0, 1/[4f(F−1(1/2))2]),

so that med(Xi)p→ F−1(1/2) for all continuous type F . [Use the normal

approximation of Binomial distribution]

Page 23: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

23/95

Consistency of MLE: Some Examples

I Let X1, . . . , Xn be a random sample from DE(θ, 1), θ ∈ R. Then, by the

consistency of the sample median as an estimator of the population

median,

θMLE = med(Xi)Pθ→ θ as n→∞ for all θ ∈ R.

I Let X1, . . . , Xn be a random sample from U [0, θ], θ ∈ (0,∞). In this case

θMLE = X(n). Since

Eθ(X(n)) =n

n+ 1·θ → θ and varθ(X(n)) =

n

(n+ 2)(n+ 1)2·θ2 → 0,

we get that θMLE Pθ→ θ as n→∞ for all θ > 0.

Page 24: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

24/95

Kullback-Leibler Divergence

I Kullback-Leibler divergence: Let P = {Pθ : θ ∈ Θ} is a statistical model

for an observation X. Let f(·; θ) denote the density function of Pθ. The

Kullback-Leibler divergence (of Pθ from Pθ0) is defined by

KL(θ, θ0) = −Eθ0(log f(X, θ)/f(X, θ0)).

I Identifiability of θ: Pθ = Pθ0 implies θ = θ0. (This has been assumed to

hold in the estimation of θ so far!)

I Assume θ in P is identifiable and Pθ have common support, i.e.,

{x : f(x; θ) > 0} does not depend on θ ∈ Θ. Then,

KL(θ, θ0) ≥ 0 and KL(θ, θ0) = 0 if and only if θ = θ0.

[Use the inequality 1 + log z ≤ z for all z > 0 with “=” iff z = 1]

Page 25: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

25/95

Kullback-Leibler Divergence and MLE

Let X1, . . . , Xn be a random sample from Pθ in P = {Pθ : θ ∈ Θ}. Let f(·; θ)

denote the density function of Pθ. Assume θ in Θ is identifiable and Pθ have

common support. For a fixed θ0, define

Dn(θ) = −n−1n∑i=1

log f(Xi; θ)/f(Xi; θ0), D0(θ) = KL(θ, θ0).

I θ0 is the unique minimizer of D0(θ) over Θ.

I θMLE is a minimizer of Dn(θ) when it exists.

I By WLLN, Dn(θ)Pθ0→ D0(θ) for all θ ∈ Θ.

I Does θMLE, the minimizer of Dn(θ), converges to θ0, the minimizer of

D(θ) in Pθ0 probability?

Page 26: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

26/95

Kullback-Leibler Divergence and MLE

I In general, for a sequence of random functions Gn and a function G0

defined on Θ, the minimizer of Gn(θ) over Θ converges in probability to

the minimizer of G0(θ) over Θ if Θ is compact (bounded and closed), G0

is continuous on Θ, the minimizer of G0(θ) is unique and

supθ∈Θ|Gn(θ)−G0(θ)| p→ 0.

I Thus, if Θ is compact, D0(θ) is continuous on Θ for all θ0 ∈ Θ and

supθ∈Θ|Dn(θ)−D0(θ)|

Pθ0→ 0 for all θ0 ∈ Θ,

then, θMLE is consistent.

Page 27: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

27/95

Kullback-Leibler Divergence and MLE

I Convexity Lemma: In general, for a sequence of random functions Gn and

a function G0 defined on Θ, if Gn is a sequence of convex functions and

Θ is a convex set, then the uniform convergence

supθ∈Θ|Gn(θ)−G0(θ)| p→ 0

is implied by the pointwise convergence

Gn(θ)−G0(θ)p→ 0 for each θ ∈ Θ.

I MLE for a finite parameter space: Let P = {Pθ : θ ∈ Θ}, where Θ is

finite, i.e., Θ = {θ1, . . . , θk} for some k ≥ 1. Assume θ in P is identifiable

and Pθ have common support. Then, θMLE is consistent.

Page 28: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

28/95

Sufficient Conditions for Consistency of MLE

I Suppose that we observe a random sample from Pθ in P = {Pθ : θ ∈ Θ}.

Assume that θ in Θ is identifiable and that Pθ have common support.

Assume also that the likelihood is twice differentiable, approaches to −∞

on the boundary of Θ, and ¨(θ) is negative definite for all θ ∈ Θ. Then,

the MLE of θ given by the unique solution of the likelihood equation

˙(θ) = 0 is consistent. [See Theorem 6.1.3 and Corollary 6.1.1 in the text]

I Logistic(θ, 1) example: The support of Pθ equals R, the likelihood

`(θ)→ −∞ as θ → ±∞ and ¨(θ) < 0 for all θ. Thus, the MLE given by

the solution of ˙(θ) = 0 is consistent.

Page 29: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

29/95

6.1 Maximum Likelihood Estimation

6.2 Information Inequality and Efficiency

6.3 Maximum Likelihood Tests

Page 30: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

30/95

Basic Regularity Assumptions

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. (µ is either the Lebesgue

measure or the counting measure)

(R0) The parameter θ is identifiable in Θ.

(R1) The density f(·; θ) have common support X .

(R2) The parameter space is open in Rd.

(R3) The log-density log f(x; θ) is twice differentiable as a function of θ for all

x ∈ X .

(R4) For any statistic u(X1, . . . , Xn) with finite expectation, the integral

Eθ(u(X1, . . . , xn)) =

∫Xn

u(x1, . . . , xn)n∏i=1

f(xi; θ) dµ(x1) · · · dµ(xn)

is twice differentiable under the integral sign.

Page 31: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

31/95

Fisher Information

I The derivative of the log-density f(x; θ) as function of θ is called the score

function. The larger the magnitude of the score function is, the more

information one has about θ.

I Fisher information (that an observation of X1 has about θ) is defined by

I1(θ) or IX1(θ) = varθ

(∂

∂θlog f(X1; θ)

)I Bartlett identity (first-order): Under (R0)–(R4), we get

(∂

∂θlog f(X1; θ)

)= 0.

I Bartlett identity (second-order): Under (R0)–(R4), we get

varθ

(∂

∂θlog f(X1; θ)

)= Eθ

(− ∂2

∂θ2log f(X1; θ)

).

Page 32: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

32/95

Information Inequality (Cramer-Rao Lower Bound)

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ, that satisfies (R0)–(R4).

Assume further that I1(θ) is invertible. Then, for any statistics

Un ≡ u(X1, . . . , Xn) with finite second moment, it holds that

varθ(Un) ≥(∂

∂θEθU

>n

)>(nI1(θ))−1

(∂

∂θEθU

>n

)for all θ ∈ Θ. In case Un is multivariate, A ≥ B for matrices A and B should

be interpreted that A−B is nonnegative definite.

Page 33: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

33/95

Proof of Information Inequality

Let Vn =∑ni=1(∂/∂θ) log f(Xi; θ). Consider first the case where Un is

univariate. For this case, we get

varθ(Un) ≥ maxa6=0

(a>covθ(Vn, Un))2

a>varθ(Vn)a

= covθ(Vn, Un)>varθ(Vn)−1covθ(Vn, Un)

In case Un is multivariate, replace Un in the above inequality by t>Un and get

t> ·varθ(Un) · t ≥ t> · covθ(Vn, Un)>varθ(Vn)−1covθ(Vn, Un) · t for all t.

The inequality follows then since varθ(Vn) = nI1(θ) and

covθ(Vn, Un) = Eθ(VnU>n ) =

∂θEθU

>n

due to the first-order Bartlett identity and (R4).

Page 34: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

34/95

Lower Bound on Variance of Unbiased Estimator

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ, that satisfies (R0)–(R4).

Assume further that I1(θ) is invertible. Consider the problem of estimating

g(θ) for a smooth function g, which may be a vector of real-valued functions.

Then, for any unbiased estimator η of η ≡ g(θ) with finite second moment, it

holds that

varθ(η) ≥(∂

∂θg(θ)>

)>(nI1(θ))−1

(∂

∂θg(θ)>

)for all θ ∈ Θ.

Page 35: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

35/95

MLE and Information Bound: Poisson Model

Suppose that we observe a random sample X1, . . . , Xn from Poisson(θ) for

unknown θ ∈ Θ = (0,∞).

I Writing `1(θ) ≡ log f(x; θ) = −θ + x log θ − log x!, we have

˙1(θ) = −1 + x/θ, ¨

1(θ) = −x/θ2.

I Thus, I1(θ) = θ−1, so that for any unbiased estimator θ of θ we get

varθ(θ) ≥ (nI1(θ))−1 = θ/n for all θ > 0.

I For θMLE = X, it is unbiased and varθ(θMLE) = θ/n. Thus, θMLE has the

minimal variance among all unbiased estimators of θ, is the so called

Uniformly Minimum Variance Unbiased Estimator (UMVUE) of θ.

Page 36: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

36/95

MLE and Information Bound: Bernoulli Model

Suppose that we observe a random sample X1, . . . , Xn from Bernoulli(θ) for

unknown θ ∈ Θ = (0, 1).

I We have

˙1(θ) = x/θ − (1− x)/(1− θ), ¨

1(θ) = −x/θ2 − (1− x)/(1− θ)2.

I Thus, I1(θ) = θ−1(1− θ)−1, so that for any unbiased estimator θ of θ we

get

varθ(θ) ≥ (nI1(θ))−1 = θ(1− θ)/n for all 0 < θ < 1.

I For θ = X, which is the MLE when 0 < X < 1, it is unbiased and

varθ(X) = θ(1− θ)/n. Thus, X is the UMVUE of θ.

Page 37: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

37/95

MLE and Information Bound: Bernoulli Model

Now, consider the estimation of η = θ(1− θ).

I Since ∂η/∂θ = 1− 2θ, we get that, for any unbiased estimator η of η,

varθ(η) ≥ (1−2θ)2(nI1(θ))−1 = θ(1−θ)(1−2θ)2/n for all 0 < θ < 1.

I For the estimator η = X(1− X), which is the MLE of η when 0 < X < 1,

Eθ(η) =n− 1

n· θ(1− θ) =

n− 1

n· η,

varθ(η) = n−1θ(1− θ)(1− 2θ)2 + o(n−1),

with o(n−1) = 5n−2θ2(1− θ)(3− θ) + n−3θ(1− θ)(6θ2 − 6θ + 1).

I The bias-corrected estimator nη/(n− 1) is unbiased and has the variance

n

(n− 1)2· θ(1− θ)(1− 2θ)2 > θ(1− θ)(1− 2θ)2/n

unless θ = 1/2, but the relative efficiency to the CR lower bound

approaches to one as n→∞.

Page 38: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

38/95

MLE and Information Bound: Beta Model

Suppose that we observe a random sample X1, . . . , Xn from Beta(θ, 1) for

unknown θ > 0.

I We have ˙1(θ) = 1/θ + log x and ¨

1(θ) = −1/θ2.

I Thus, I1(θ) = 1/θ2, so that for any unbiased estimator θ of θ we get

varθ(θ) ≥ (nI1(θ))−1 = θ2/n for all 0 < θ < 1.

I θMLE = −n/(∑ni=1 logXi) and using the fact that −θ logXi are i.i.d.

Exp(1), we get

Eθ(θMLE) =

n

n− 1θ, varθ(θ

MLE) =n2

(n− 1)2(n− 2)θ2.

I Again, the unbiased estimator (n− 1)θMLE/n has the variance θ2/(n− 2)

that is strictly larger than the CR lower bound, but its relative efficiency to

the lower bound converges to one as n→∞.

Page 39: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

39/95

MLE and Information Bound: Normal Model

Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2) for

unknown θ = (µ, σ2) ∈ R× R+.

I The derivatives of the log-density:

˙1(θ) =

(x− µ)/σ2

(x− µ)2/(2σ4)− 1/(2σ2)

,

¨1(θ) =

−1/σ2, −(x− µ)/(2σ4)

−(x− µ)/(2σ4), −(x− µ)2/σ6 + 1/(2σ4)

.

I The Fisher information about θ from a single observation:

I1(θ) = Eθ(−¨1(θ)) =

1/σ2 0

0 1/(2σ4)

.

Page 40: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

40/95

MLE and Information Bound: Normal Model

I For any unbiased estimators µ and σ2 of µ and σ2,

varθ(µ) ≥ (1, 0)(nI1(θ))−1(1, 0)> = σ2/n,

varθ(σ2) ≥ (0, 1)(nI1(θ))−1(0, 1)> = 2σ4/n.

I For the unbiased estimators X and S2 =∑ni=1(Xi − X)2/(n− 1) of µ

and σ2, respectively, we get

varθ(X) = σ2/n, varθ(S2) = 2σ4/(n− 1).

I Thus, X achieves the minimal variance and S2 has a variance slightly

larger than the lower bound. Later we will see that S2 also achieves the

minimal variance among all unbiased estimators. This means that the CR

lower bound may not be sharp!

Page 41: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

41/95

MLE and Information Bound: Normal Model

I The lower bounds for µ and σ2 do not change under the sub-models where

one of µ and σ2 is known (given). The main reason is that the covariance

of the score functions, i.e, the off-diagonal term of I1(θ) equals zero.

I In general, for a nonsingular symmetric matrix

A =

A11 A12

A21 A22

,

it holds that

(A−1)11 = A−111 +A−1

11 A12(A22 −A21A−111 A12)−1A21A

−111 ≥ A

−111 ,

and the equality holds if A12 = 0.

Page 42: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

42/95

Asymptotic Efficiency

I Unbiased estimators: Typically, the CR lower bound for unbiased

estimators is asymptotically sharp. Thus, the asymptotic efficiency of an

unbiased estimator η of a real-valued parameter η = g(θ) is defined by

eff(η) = limn→∞

((∂/∂θ)g(θ))>(nI1(θ))−1((∂/∂θ)g(θ))

varθ(η).

I For estimators η with varθ(η) ≥ Cn−1 for some constant C > 0 and

(∂/∂θ)biasη(θ)→ 0 as n→∞, we get

varθ(η) ≥(∂

∂θg(θ) + o(1)

)>(nI1(θ))−1

(∂

∂θg(θ) + o(1)

)=

(∂

∂θg(θ)

)>(nI1(θ))−1

(∂

∂θg(θ)

)+ o(n−1),

where biasη(θ) = Eθ(η)− η. Thus, the asymptotic efficiency eff(η) in

terms of variance defined above may be extended to these estimators.

Page 43: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

43/95

Asymptotic Efficiency

I Lower bound in terms of mean squared error:

MSEθ(η) ≥ biasη(θ)2 +

(∂

∂θg(θ)

)>(nI1(θ))−1

(∂

∂θg(θ)

)+ o(n−1)

I For those estimators with varθ(η) ≥ Cn−1 for some constant C > 0,

(∂/∂θ)biasη(θ)→ 0 as n→∞ and biasη(θ) = o(√

varθ(η)), it follows

that

(lower bound for MSE)

MSEθ(η)=

eff(η) + o(1)

1 + o(1)= eff(η) + o(1).

Thus, eff(η) may be also a definition of the asymptotic efficiency in terms

of mean squared error for these estimators.

Page 44: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

44/95

Examples: Asymptotic Efficiency of MLE

I Bernoulli model with Θ = (0, 1): Note first that

Pθ(θMLE = X) ≥ Pθ(0 < X < 1)→ 1 for all θ ∈ (0, 1). For the MLE

η = X(1− X) of η = θ(1− θ), we get

biasη(θ) = −θ(1− θ)/n, varθ(η) = θ(1− θ)(1− 2θ)2/n+ o(n−1).

Thus, eff(η) = 1 unless θ = 1/2 both in terms of variance and mean

squared error.

I Beta model with Θ = (0,∞): For the MLE θ = −n/(∑ni=1 logXi) of θ,

we have

biasθ(θ) = θ/(n− 1), varθ(θ) = θ2/n+ o(n−1).

Thus, eff(η) = 1 both in terms of variance and mean squared error.

Page 45: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

45/95

Fisher’s Wrong Conjecture

(1)√n(θMLE − θ) d→ N(0, I1(θ)−1).

(2) For any estimator θ such that√n(θ − θ) d→ N(0, v(θ)) it follows that

v(θ) ≥ I1(θ)−1 for all θ ∈ Θ. (R. A. Fisher, 1925)

I The conjecture (1) is true under some regularity conditions on the model,

while (2) may not be true even under the regularity conditions on the

model. (J. L. Hodges, Jr., 1951)

I Among regular estimators of θ the conjecture (2) was found to be true

under some regularity conditions on the model. (L. Le Cam, 1953)

I Asymptotically efficient estimator: If there exists a regular estimator θ

such that√n(θ − θ) d→ N(0, I1(θ)−1), then the estimator is said to be

asymptotically efficient.

Page 46: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

46/95

Further Regularity Assumptions

Let X1, · · · , Xn be a random sample from a distribution with pdf

f(·; θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. Under the following

assumptions in addition to (R0)–(R4) we may show that the MLE (precisely,

the solution of the likelihood equation) is asymptotically normal with mean

zero and variance I1(θ)−1.

(R5) For all θ0 ∈ Θ, there exist δ0 > 0 and M(·) with Eθ0M(X1) <∞ such

that

maxθ:|θ−θ0|≤δ0

∣∣ ∂3

∂θi∂θj∂θklog f(X; θ)

∣∣ ≤ M(X).

(R6) The likelihood equation ˙(θ) = 0 has the unique solution θ and the

solution is a consistent estimator of θ.

(R7) The Fisher information I1(θ) exists and is non-singular for all θ ∈ Θ.

Page 47: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

47/95

Asymptotic Normality of MLE

I Theorem: Under the assumptions (R0)–(R7),

√n(θ − θ) d→ N(0, I1(θ)−1) under Pθ for all θ ∈ Θ.

I Proof of the theorem: For simplicity, consider only the case Θ ⊂ R, i.e.,

d = 1. The multi-dimensional extension only involves more complexity in

notation. Let Sj(θ) denote the jth derivative of the scaled likelihood

n−1`(θ). The theorem follows immediately from the facts that, for any

θ0 ∈ Θ,

(1) 0 = S1(θ0) + (S2(θ0) + (θ0 − θ0)S3(θ∗)/2)(θ − θ0),

(2) S3(θ∗) = Op(1) under Pθ0 ,

(3) S2(θ0) = −I1(θ0) + op(1) under Pθ0 ,

(4)√nS1(θ0)

d→ N(0, I1(θ0)) under Pθ0 .

Page 48: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

48/95

Asymptotic Normality of MLE

The fact (1) is simply a Taylor expansion, (3) follows from WLLN and (4) from

CLT. For the proof of (2), we observe

lim supn→∞

Pθ0(|S3(θ∗)| > C)

≤ lim supn→∞

Pθ0(|S3(θ∗)| > C, |θ − θ0| ≤ δ0)

≤ lim supn→∞

Pθ0

(n−1

n∑i=1

M(Xi) > C

)≤ C−1Eθ0M(X1) → 0 as C →∞.

Page 49: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

49/95

Asymptotic Normality of MLE

I Asymptotic linearity of MLE: Under the assumptions (R0)–(R7)

√n(θ − θ) = I1(θ)−1√nS1(θ) + op(1) under Pθ for all θ ∈ Θ.

I Function of θ: Let η = g(θ) for a smooth function g and let η = g(θ) for θ

in (R6). Under the assumptions (R0)–(R7), it follows that

√n(η − g(θ))

d→ N(0, g(θ)>I1(θ)−1g(θ)) under Pθ for all θ ∈ Θ,

where g(θ) = (∂/∂θ)g(θ)>.

Page 50: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

50/95

Examples: Asymptotic Normality of MLE

I Bernoulli model with Θ = (0, 1): Recall I1(θ) = θ−1(1− θ)−1 and that

Pθ(θMLE = X) ≥ Pθ(0 < X < 1)→ 1 for all θ ∈ (0, 1).

By CLT we get√n(θMLE − θ) d→ N(0, θ(1− θ)).

I Poisson model with Θ = (0,∞): Again,

Pθ(θMLE = X) ≥ Pθ(X > 0)→ 1 for all θ > 0.

Recall I1(θ) = θ−1. By CLT we get√n(θMLE − θ) d→ N(0, θ).

I Exponential model with Θ = (0,∞): Recall θMLE = X and note

I1(θ) = θ−2. By CLT,√n(θMLE − θ) d→ N(0, θ2).

Page 51: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

51/95

Examples: Asymptotic Normality of MLE

I Beta model with Θ = (0,∞): Recall that −1/θMLE = n−1∑ni=1 logXi

and I1(θ) = θ−2. By CLT and since Eθ(logX1) = −θ−1 and

varθ(logX1) = θ−2,

√n(n−1

n∑i=1

logXi + θ−1)d→ N(0, θ−2).

Thus,√n(θMLE − θ) d→ N(0, θ2).

I Double exponential model DE(θ, 1) with Θ = R: Note that θ is the

median of the distribution Pθ, fθ(θ; θ) = 1/2 and θMLE = med(Xi), so

that√n(θMLE − θ) d→ N(0, 1) (see page 21 of this slide).

Page 52: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

52/95

Examples: Asymptotic Normality of MLE

I Double exponential model (continued): The log-density

f(x; θ) = e−|x−θ|/2 is not differentiable (pointwise) as a function of θ, but

it is differentiable in L2 sense (Frechet differentiability) with the derivative

∂/∂θf(x; θ) = sgn(x− θ)e−|x−θ|/2:

limh→0

∫ ∞−∞

(h−1(e−|x−θ−h|/2− e−|x−θ|/2)− sgn(x− θ)e−|x−θ|/2

)2

dx = 0.

With this derivative, I1(θ) ≡ 1.

I Remark: Indeed, the notion of Fisher information is generalized to models

with Frechet differentiability and the regularity conditions (R0)–(R7) may

be relaxed based on the generalization.

Page 53: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

53/95

Examples: Asymptotic Normality of MLE

I Logistic(θ, 1) model with Θ = R: We know that θMLE exists and is

unique, but do not have its explicit form. One may check that all the

regularity conditions (R0)–(R7) are met, so that we have√n(θMLE − θ) d→ N(0, I1(θ)−1). Now,

˙1(θ) = (1− e−(x−θ))/(1 + e−(x−θ)),

I1(θ) ≡∫ ∞−∞

(1− e−x

1 + e−x

)2

· e−x

(1 + e−x)2dx =

1

3.

Thus, we conclude that√n(θMLE − θ) d→ N(0, 3)

I Uniform(0, θ) model with θ > 0 (non-regular model): Note that

θMLE = X(n), and in this case

n(θMLE − θ) d→ −Exp(θ).

Page 54: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

54/95

Examples: Asymptotic Normality of MLE

I Multinomial model: Let X1, . . . , Xn be i.i.d. d-dimensional observations

from Multinomial(p1, . . . , pd), 0 < pj < 1,∑dj=1 pj = 1. Write

X = (X1, . . . , Xd)>, x = (x1, . . . , xd) for a realization x of Xi and

θ = (p1, . . . , pd−1)>. Note that

θMLE = (X1, . . . , Xd−1)>

with probability tending to one, under Pθ for all θ ∈ Θ. By CLT

√n(θMLE − θ) d→ N(0, diag(θj)− θθ>).

Observe that

˙1(θ, x) = (x1/θ1 − xd/θd, . . . , xd−1/θd−1 − xd/θd)>,

¨1(θ, x) = −diag(xj/θ

2j )− (xd/θ

2d) · 11>,

I1(θ) = diag(1/θj) + (1/θd) · 11>, I1(θ)−1 = diag(θj)− θθ>.

Page 55: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

55/95

Asymptotic Relative Efficiency

I For two estimators θ1 and θ2 of θ that are asymptotically normal:√n(θj − θ)

d→ N(0, vj(θ)), the asymptotic relative efficiency of θ1

against θ2 is defined by

ARE(θ1, θ2) = v2(θ)/v1(θ).

I MLE against sample mean or sample median: When f(x; θ) = f0(x− θ),

f0 N(0, 1) Logistic(0, 1) DE(0, 1)

against mean 1 π2/9 2

against median π/2 4/3 1

For the logistic model, we have used the fact∑∞k=1 k

−2 = π2/6, so that∫ ∞−∞

x2e−x(1 + e−x)−2 dx = 4

∫ ∞0

xe−x(1 + e−x)−1 dx

= 4∞∑k=1

(−1)k−1k−1

∫ ∞0

e−kx dx = 4∞∑k=1

(−1)k−1/k2 = π2/3.

Page 56: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

56/95

One-Step Approximation of MLE

We have seen that MLE may exist but may not have a closed form in case the

likelihood equation is nonlinear. In that case, we have learned that the

Newton-Raphson method may be used to approximate the MLE. The following

theorem tells that, if the initial choice, say θ[0], is well located, to state it

precisely, lies in a n−1/2-neighborhood of the true θ in probability, then the

one-step update in Newton-Raphson iteration is enough, i.e., the estimator

θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])

has the same first-order properties as the MLE.

Page 57: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

57/95

One-Step Approximation of MLE

I Theorem: Assume the assumptions (R0)–(R7) and that

θ[0] = θ +Op(n−1/2) under Pθ for all θ ∈ Θ. Then,

√n(θ[1] − θ) d→ N(0, I(θ)−1) under Pθ for all θ ∈ Θ.

I Proof of the theorem: For simplicity again, consider the case Θ ⊂ R. Let

Sj(θ) be the jth derivative of the likelihood n−1`(θ). Then,

√n(θ[1] − θ)

=√n(θ[0] − θ) +

(−S2(θ[0])

)−1

·√nS1(θ[0])

=√n(θ[0] − θ) +

(−S2(θ)− S3(θ∗)(θ[0] − θ)

)−1

·(√

nS1(θ) + S2(θ)√n(θ[0] − θ) + S3(θ∗∗)

√n(θ[0] − θ)2/2

).

Page 58: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

58/95

One-Step Approximation of MLE

I Proof of the theorem (continued): Since S3(θ∗), S3(θ∗∗) and√n(θ[0] − θ)

are all Op(1) (thus θ[0] − θ = op(1)), and S2(θ) = −I1(θ) + op(1), we

obtain

√n(θ[1] − θ) =

√n(θ[0] − θ) + (I1(θ) + op(1))−1

·(√

nS1(θ)− I1(θ)√n(θ[0] − θ) + op(1)

)= I1(θ)−1 ·

√nS1(θ) + op(1)

d→ N(0, I1(θ)−1)

under Pθ for all θ ∈ Θ.

Page 59: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

59/95

One-step Approximation of MLE: Logistic Model

Suppose that we observe a random sample X1, . . . , Xn from Logistic(θ, 1) for

unknown θ ∈ R. We have

`(θ) = −nX + nθ − 2n∑i=1

log (1 + exp(−(Xi − θ))) ,

˙(θ) = n− 2

n∑i=1

exp(−(Xi − θ)) (1 + exp(−(Xi − θ)))−1 ,

¨(θ) = −2n∑i=1

exp(−(Xi − θ)) (1 + exp(−(Xi − θ)))−2

With θ[0] = X, which is√n-consistent, the one-step estimator

θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])

is an asymptotically efficient estimator of θ.

Page 60: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

60/95

One-step Approximation of MLE: Gamma Model

Suppose that we observe a random sample X1, . . . , Xn from Gamma(α, β) for

unknown α, β > 0. Let θ = (α, β)>.

I Finding√n-consistent estimator of θ: Try MME. Since Eθ(X1) = αβ and

varθ(X1) = αβ2, we get

αMME =(n−1

n∑i=1

(Xi − X)2)−1

X2,

βMME = n−1n∑i=1

(Xi − X)2 · 1

X.

It can be shown that θMME = θ +Op(n−1/2) under Pθ for all θ ∈ Θ.

I One-step MLE: With θ[0] = θMME, the one-step estimator

θ[1] = θ[0] − ¨(θ[0])−1 ˙(θ[0])

is an asymptotically efficient estimator of θ.

Page 61: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

61/95

6.1 Maximum Likelihood Estimation

6.2 Information Inequality and Efficiency

6.3 Maximum Likelihood Tests

Page 62: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

62/95

Idea of Maximum Likelihood Test

Suppose that we observe a random sample X1, . . . , Xn from a distribution Pθ

with pdf f(·, θ), θ ∈ Θ.

I The problem: Testing

H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1

at a significance level 0 < α < 1, where Θ0 is a subset of Θ and

Θ1 = Θ \Θ0.

I Fisher’s idea: Compare the maximum likelihoods in Θ0 and Θ1. For a

given observation x ≡ (x1, . . . , xn) of X ≡ (X1, . . . , Xn), reject H0 if

maxθ∈Θ1

L(θ;x)/maxθ∈Θ0

L(θ;x) ≥ c

for a critical value c that is determined by the level α as follows.

Page 63: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

63/95

Idea of Maximum Likelihood Test

I Determination of critical value: Choose c such that (the size of the

test)=α, i.e.,

supθ∈Θ0

(maxθ∈Θ1 L(θ;X)

maxθ∈Θ0 L(θ;X)> c

)= α.

I The maximization of the likelihood in a restricted set, such as Θ1, is more

involved than the maximization in Θ.

I Note that, for c > 1,

R0 ≡maxθ∈Θ1 L(θ;X)

maxθ∈Θ0 L(θ;X)≥ c

⇔ R ≡ maxθ∈Θ L(θ;X)

maxθ∈Θ0 L(θ;X)= max

{1,

maxθ∈Θ1 L(θ;X)

maxθ∈Θ0 L(θ;X)

}≥ c.

If there exists c > 0 such that supθ∈Θ0Pθ(R ≥ c) = α, then c > 1 since

R ≥ 1 always and α < 1. This means that the LRT may be based on R,

rather than R0.

Page 64: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

64/95

Likelihood Ratio Test

Let θΘ and θΘ0 denote the MLEs in Θ and Θ0, respectively. The likelihood

ratio test (LRT), for H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1, rejects H0 when

maxθ∈Θ L(θ;x)

maxθ∈Θ0 L(θ;x)≥ c,

or equivalently when

2(`(θΘ;x)− `(θΘ0 ;x)

)≥ c′,

where c and c′ are determined by the given level α.

Remark: When the LR test statistic, 2(`(θΘ;X)− `(θΘ0 ;X)), has a

distribution on a discrete set, there may not exist c′ such that (the size of the

test)=α. In that case one may do randomization to meet the level condition.

Page 65: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

65/95

Examples of LRT: Normal Mean

Suppose that we observe a random sample X1, . . . , Xn from N(θ, σ2) with

known σ2, θ ∈ R, and want to test H0 : θ = θ0 against H1 : θ 6= θ0 at a level

0 < α < 1.

I Likelihood:

`(θ) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(xi − x)2 − n

2σ2(x− θ)2.

I Maximization of likelihood: θΘ = x and θΘ0 = θ0.

I Likelihood ratio:

2(`(θΘ;x)− `(θΘ0 ;x)) = 2(`(x;x)− `(θ0;x)) =n

σ2(x− θ0)2

=

(x− θ0

σ/√n

)2

.

Page 66: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

66/95

Examples of LRT: Normal Mean

I Rejection region: Reject H0 if∣∣∣∣ x− θ0

σ/√n

∣∣∣∣ ≥ c.I Determination of critical value c:

Pθ0

(∣∣∣∣ X − θ0

σ/√n

∣∣∣∣ ≥ zα/2) = α.

I Power function:

γ(θ) ≡ Pθ

(∣∣∣∣ X − θ0

σ/√n

∣∣∣∣ ≥ zα/2)= Φ

(−zα/2 −

θ − θ0

σ/√n

)+ Φ

(−zα/2 +

θ − θ0

σ/√n

).

I The power function is symmetric about θ0, takes the minimal value α at

θ = θ0 and increases to one as θ gets away from θ0.

Page 67: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

67/95

Examples of LRT: Exponential Mean

Suppose that we observe a random sample X1, . . . , Xn from Exp(θ) with mean

θ > 0, and want to test H0 : θ = θ0 against H1 : θ 6= θ0 at a level 0 < α < 1.

I Likelihood: `(θ) = −n log θ − nx/θ.

I Maximization of likelihood: θΘ = x and θΘ0 = θ0.

I Likelihood ratio:

2(`(θΘ;x)− `(θΘ0 ;x))

= 2n(x/θ0 − log(x/θ0)− 1).

I Rejection region:

x/θ0 ≤ c1 or x/θ0 ≥ c2

with c1 − log c1 = c2 − log c2t1

λ

c1 c2

Page 68: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

68/95

Examples of LRT: Exponential Mean

I Determination of critical value: Since 2∑ni=1 Xi/θ0

d= χ2(2n) under Pθ0 ,

we choose c1 and c2 such that

c1 − log c1 = c2 − log c2 and

∫ 2nc2

2nc1

pdfχ2(2n)(y)dy = 1− α.

I Approximation of c1 and c2:

χ21−α/2(2n)/2n = 1− zα/2/

√n+ o(n−1/2),

χ2α/2(2n)/2n = 1 + zα/2/

√n+ o(n−1/2).

I Power function :

γ(θ) ≡ 1− Pθ(c1 ≤ X/θ0 ≤ c2

)= F2n(2nc1θ0/θ) + F2n(2nc2θ0/θ),

where F2n(·) is the cdf of χ2(2n) and F2n = 1− F2n.

Page 69: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

69/95

Examples of LRT: Normal Mean With Unknown Variance

Suppose that we observe a random sample X1, . . . , Xn from N(µ, σ2),

θ = (µ, σ2) ∈ R× R+, and want to test H0 : µ = µ0 against H1 : µ 6= µ0 at a

level 0 < α < 1.

I Likelihood:

`(θ) = −n2

log(2π)− n

2log σ2 − 1

2σ2

n∑i=1

(xi − µ)2.

I Maximization of likelihood: θΘ = (x, σ2) and θΘ0 = (µ0, σ20), where

σ2 = n−1∑ni=1(xi − x)2 and σ2

0 = n−1∑ni=1(xi − µ0)2.

I Likelihood ratio: Since σ20 = σ2 + (x− µ0)2, we get

2(`(θΘ;x)− `(θΘ0 ;x)) = n log

(1 +

(x− µ0)2

σ2

).

Page 70: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

70/95

Examples of LRT: Normal Mean With Unknown Variance

I Rejection region: Let s2 = (n− 1)−1∑ni=1(xi − x)2.

Reject H0 if

∣∣∣∣ x− µ0

s/√n

∣∣∣∣ ≥ c.I Determination of critical value c:

Pθ0

(∣∣∣∣ X − µ0

S/√n

∣∣∣∣ ≥ tα/2(n− 1)

)= α.

I Wilks’ phenomenon: Since Tn ≡ (X − µ0)/(S/√n)

d= t(n− 1) under H0

and t(n− 1)d→ Z

d= N(0, 1), we get

2(`(θΘ;X)− `(θΘ0 ;X)) = n log(1 + T 2

n/(n− 1))

= T 2n + op(1)

d→ Z2 d= χ2(1)

under H0. Note that (d.f.) = dim(Θ)− dim(Θ0) = 2− 1 = 1.

Page 71: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

71/95

Examples of LRT: Normal Mean With Unknown Variance

I Power function: With δ =√n(µ− µ0)/σ,

γ(δ) ≡ Pθ(∣∣∣∣ X − µ0

S/√n

∣∣∣∣ ≥ tα/2(n− 1)

)= P

(∣∣∣∣ Z + δ√V/(n− 1)

∣∣∣∣ ≥ tα/2(n− 1)

),

where Zd= N(0, 1) and V

d= χ2(n− 1) are independent.

I The power function is symmetric about δ = 0, takes the minimal value α

at δ = 0 and increases to one as δ gets away from 0.

Page 72: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

72/95

Examples of LRT: Two Normal Means With Common Variance

Let X1, . . . , Xn1 and Y1, . . . , Yn2 be random samples, respectively, from

N(µ1, σ2) and N(µ2, σ

2), θ = (µ1, µ2, σ2) ∈ R2 × R+. Assume that

(X1, . . . , Xn1) and (Y1, . . . , Yn2) are independent. We want to test

H0 : µ1 = µ2 against H1 : µ1 6= µ2 at a level 0 < α < 1.

I Likelihood:

`(θ) = (const.)−n1 + n2

2log σ2− 1

2σ2

(n1∑i=1

(xi − µ1)2 +

n2∑i=1

(yi − µ2)2

).

I Maximization of likelihood: θΘ = (x, y, σ2), θΘ0 = (µ0, µ0, σ20), where

σ2 =

∑n1i=1(xi − x)2 +

∑n2i=1(yi − y)2)

n1 + n2,

σ20 =

∑n1i=1(xi − µ0)2 +

∑n2i=1(yi − µ0)2)

n1 + n2,

µ0 = (n1x+ n2y)/(n1 + n2).

Page 73: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

73/95

Examples of LRT: Two Normal Means With Common Variance

I Likelihood ratio: Note that

σ20

σ2= 1 +

n1(x− µ0)2 + n2(y − µ0)2

(n1 + n2)σ2= 1 +

n1n2(x− y)2

(n1 + n2)σ2.

Let s2p = (n1 + n2)σ2/(n1 + n2 − 2) be the pooled sample variance.

Then, we get

2(`(θΘ;x, y)− `(θΘ0 ;x, y))

= (n1 + n2) log(σ20/σ

2)

= (n1 + n2) log

(1 +

n1n2(x− y)2

(n1 + n2)2σ2

)= (n1 + n2) log

(1 +

1

n1 + n2 − 2· (x− y)2

(n−11 + n−1

2 )s2p

).

I Rejection region:

Reject H0 if

∣∣∣∣∣ x− y

sp

√n−1

1 + n−12

∣∣∣∣∣ ≥ tα/2(n1 + n2 − 2).

Page 74: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

74/95

Examples of LRT: Two Normal Means With Common Variance

I Wilks’ phenomenon: Since

Tn ≡ (X − Y )/(Sp

√n−1

1 + n−12

) d= t(n1 + n2 − 2) under H0 and

t(n1 + n2 − 2)d→ Z

d= N(0, 1), we get

2(`(θΘ;X,Y )− `(θΘ0 ;X,Y ))

= (n1 + n2) log(1 + T 2

n/(n1 + n2 − 2))

= T 2n + op(1)

d→ Z2 d= χ2(1)

under H0. Note that (d.f.) = dim(Θ)− dim(Θ0) = 3− 2 = 1.

I Power function: With δ = (µ1 − µ2)/(σ√n−1

1 + n−12

),

γ(δ) ≡ Pθ (reject H0) = P

(∣∣∣∣ Z + δ√V/(n1 + n2 − 2)

∣∣∣∣ ≥ tα/2(n1 + n2 − 2)

),

where Zd= N(0, 1) and V

d= χ2(n1 + n2 − 2) are independent.

Page 75: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

75/95

Examples of LRT: Two Normal Variances

Let X1, . . . , Xn1 and Y1, . . . , Yn2 be random samples, respectively, from

N(µ1, σ21) and N(µ2, σ

22), θ = (µ1, µ2, σ

21 , σ

22) ∈ R2 × R2

+. Assume that

(X1, . . . , Xn1) and (Y1, . . . , Yn2) are independent. We want to test

H0 : σ21 = σ2

2 against H1 : σ21 6= σ2

2 at a level 0 < α < 1.

I Likelihood:

`(θ) = (const.)−n1

2log σ2

1−n2

2log σ2

2−1

2σ21

n1∑i=1

(xi−µ1)2− 1

2σ22

n2∑i=1

(yi−µ2)2.

I Maximization of likelihood: θΘ = (x, y, σ21 , σ

22), θΘ0 = (x, y, σ2, σ2),

where

σ21 =

n1∑i=1

(xi − x)2/n1, σ22 =

n2∑i=1

(yi − y)2/n2,

σ2 =

∑n1i=1(xi − x)2 +

∑n2i=1(yi − y)2

n1 + n2=n1σ

21 + n2σ

22

n1 + n2.

Page 76: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

76/95

Examples of LRT: Two Normal Variances

I Likelihood ratio: Let s21 =

∑n1i=1(xi − x)2/(n1 − 1) and define s2

2 likewise

with yi. Then,

2(`(θΘ;x, y)− `(θΘ0 ;x, y))

= (n1 + n2) log σ2 − n1 log σ21 − n2 log σ2

2

= n1 log

(1 +

n2 − 1

n1 − 1· s

22

s21

)+ n2 log

(1 +

n1 − 1

n2 − 1· s

21

s22

)+ (const)

let= rn1,n2(s2

1/s22).

I Rejection region:

Reject H0 if(n1 − 1)s2

1

(n2 − 1)s22

≤ c1 or(n1 − 1)s2

1

(n2 − 1)s22

≥ c2,

where c1 and c2 satisfy

rn1,n2(c1) = rn1,n2(c2) and

∫ c2

c1

pdfF (n1−1,n2−1)(y)dy = 1− α.

Page 77: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

77/95

Examples of LRT: One-Sided Poisson Mean

Let X1, . . . , Xn be a random sample from Poisson(θ), θ > 0. We want to test

H0 : θ ≤ θ0 against H1 : θ > θ0 at a level 0 < α < 1.

I Likelihood: `(θ) = n(−θ + x log θ − n−1∑ni=1 log xi!).

I Maximization of likelihood: θΘ = x and θΘ0 = min{x, θ0}.

I Likelihood ratio:

2(`(θΘ;x)− `(θΘ0 ;x)) = 2n (x log x− x(1 + log θ0) + θ0) I(x ≥ θ0).

I Rejection region: f(u) ≡ (u log u− u(1 + log θ0) + θ0)I(u ≥ θ0) equals

zero for u ≤ θ0 and is strictly increasing for u > θ0. Thus, for any λ > 0,

the inequality f(u) ≥ λ is equivalent to u ≥ λ′ for some λ′ that depends

on λ. This implies

LRT rejects H0 ifn∑i=1

xi ≥ c.

Page 78: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

78/95

Examples of LRT: One-Sided Poisson Mean

I Determination of critical value: Need to find c > 0 such that

supθ≤θ0

(n∑i=1

Xi ≥ c

)= α.

For an integer c, the rejection probability under Pθ equals

1−∑c−1j=0 e

−nθ(nθ)j/j!, which is an increasing function of θ. Thus, the

supremum over θ ≤ θ0 is achieved at θ = θ0. The task reduces then to

finding c > 0 such that

Pθ0

(n∑i=1

Xi ≥ c

)= α.

But, this is not possible for most values of 0 < α < 1.

I One may do randomization to get a test such that Pθ0(reject H0) = α.

Page 79: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

79/95

Examples of LRT: One-Sided Poisson Mean

I Randomization: For a given 0 < α < 1, suppose that

1−c0∑j=0

e−nθ0(nθ0)j/j!let= α0 < α < α1

let= 1−

c0−1∑j=0

e−nθ0(nθ0)j/j!.

The randomized LRT rejects H0 if∑ni=1 xi ≥ c0 + 1 and does not reject

H0 if∑ni=1 xi ≤ c0 − 1. On the boundary where

∑ni=1 xi = c0, it rejects

H0 with probability (α− α0)/(α1 − α0). If we denote by φLRT(u) the

probability of rejecting H0 when∑ni=1 xi = u is observed, then

φLRT(u) =

1 u ≥ c0 + 1

(α− α0)/(α1 − α0) u = c0

0 u ≤ c0 − 1

I Level condition: It can be seen that

Pθ0 (reject H0) = Pθ0

(n∑i=1

Xi ≥ c0 + 1

)+α− α0

α1 − α0·Pθ0

(n∑i=1

Xi = c0

)= α.

Page 80: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

80/95

Approximation of LRT: Simple Null

Let X1, . . . , Xn be a random sample from a distribution with pdf

f(·, θ), θ ∈ Θ ⊂ Rd, with respect to a measure µ. Assume that the regularity

conditions (R0)–(R7) hold. Consider the problem of testing H0 : θ = θ0. Let θ

denote the MLE over Θ. For the LRT statistic 2(`(θ)− `(θ0)), it holds that,

under H0,

(1) 2(`(θ)− `(θ0))d→ χ2(d);

(2) 2(`(θ)− `(θ0)) = Wn + op(1);

(3) 2(`(θ)− `(θ0)) = Rn + op(1),

where Wn = (θ − θ0)>(nI1(θ0))(θ − θ0) (Wald test statistic) and

Rn = ˙(θ0)>(nI1(θ0))−1 ˙(θ0) (Rao/score test statistic)

Remark: (1) is called Wilks’ phenomenon.

Page 81: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

81/95

Approximation of LRT: Simple Null

I Proof of (2): Let Sj(θ) denote the jth derivative of n−1`(θ). Since

˙(θ) = 0, we get

`(θ0) = `(θ) + (θ − θ0)> · nS2(θ∗) · (θ − θ0)/2

= `(θ) + (θ − θ0)>(−nI1(θ0) + op(n))(θ − θ0)/2

= `(θ)−Wn/2 + op(1).

I Proof of (3): Since√n(θ − θ0) = I1(θ0)−1√nS1(θ0) + op(1) and

√nS1(θ0) = Op(1) under Pθ0 , we obtain

Wn =(I1(θ0)−1√nS1(θ0) + op(1)

)>I1(θ0)

(I1(θ0)−1√nS1(θ0) + op(1)

)= nS1(θ0)>I1(θ0)−1S1(θ0) + op(1)

= Rn + op(1).

I Proof of (1): Immediately follows from (2) or (3).

Page 82: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

82/95

Approximation of LRT: Exponential Mean (Page 66, LecNote)

I Hypothesis: H0 : θ = θ0 versus H1 : θ 6= θ0.

I Asymptotic LRT: Reject H0 if

2n(x/θ0 − log(x/θ0)− 1) ≥ χ2α(1).

I Wald and Rao tests: Reject H0 if

n(x− θ0)2/θ20 ≥ χ2

α(1).

I Compare these with the LRT.

Page 83: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

83/95

Approximation of LRT: Beta Model

Suppose we observe a random sample X1, . . . , Xn from Beta(θ, 1), θ > 0.

I Hypothesis: H0 : θ = 1 versus H1 : θ 6= 1.

I Recall that MLE θ = −n/(∑ni=1 logXi) and I1(θ) = 1/θ2:

`(θ) = n log θ + (θ − 1)n∑i=1

log xi, ˙(θ) = n/θ +n∑i=1

log xi.

I LRT statistic: 2(`(θ)− `(1)) = 2n(log θ − 1 + 1/θ).

I Wald test statistic: Wn(1) = (θ − 1)2nI1(1) = n(θ − 1)2.

I Rao test statistic:

Rn(1) =(

˙(1))2n−1I1(1)−1 = n

(1 + n−1

n∑i=1

logXi

)2

.

Page 84: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

84/95

Approximation of LRT: Double Exponential Model

Suppose we observe a random sample X1, . . . , Xn from DE(θ, 1),

−∞ < θ <∞.

I Hypothesis: H0 : θ = θ0 versus H1 : θ 6= θ0.

I Recall that MLE θ = med(Xi) and I1(θ) = 1 (page 51):

`(θ) = −n∑i=1

|xi − θ| − n log 2, ˙(θ) =

n∑i=1

sgn(xi − θ).

Here, we take ˙1(θ;x) = (∂/∂θ)f(x; θ)/f(x; θ) with (∂/∂θ)f(·; θ) being

the Frechet derivative of f(·; θ).

I LRT statistic: 2(`(θ)− `(1)) = 2(∑n

i=1 |xi − θ0| −∑ni=1 |xi − θ|

).

I Wald test statistic: W (θ0) = n(θ − θ0)2.

I Rao test statistic: R(θ0) =(∑n

i=1 sgn(Xi − θ0))2/n.

Page 85: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

85/95

Approximation of LRT: Multinomial Model

Suppose we observe a random sample Z1, . . . , Zn from

Multinomial(1, (θ1, . . . , θk)

), and wish to test

H0 : θ = θ0 versus H1 : θ 6= θ0.

I If we take Θ = {(θ1, . . . , θk−1) : θ > 0, θ1 + · · ·+ θk−1 < 1}, then the

model satisfies (R0)–(R7).

I Let Xj =∑ni=1 Zij , 1 ≤ j ≤ k, and θ· =

∑k−1j=1 θj = 1− θk. With

θ = (θ1, . . . , θk−1)>,

`(θ) =

k−1∑j=1

xj log θj + xk log(1− θ·),

˙(θ) =(xj/θj − xk/(1− θ·) : 1 ≤ j ≤ k − 1

),

MLE θ = (xj/n : 1 ≤ j ≤ k − 1) and I1(θ) = diag(θ−1j ) + (1− θ·)−111>.

Page 86: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

86/95

Approximation of LRT: Multinomial Model

I LRT statistic: n(∑k

j=1 θj log θj −∑kj=1 θ0j log θ0j

)I Wald statistic: n(θ − θ0)>I1(θ0)(θ − θ0) =

∑kj=1(Xj − nθ0j)

2/(nθ0j).

I Rao statistic: n−1 ˙(θ0)>I1(θ0)−1 ˙(θ0) =∑kj=1(Xj − nθ0j)

2/(nθ0j).

I nI1(θ)(θ − θ) = (Xj/θj : 1 ≤ j ≤ k − 1)− (Xk/θk) · 1 = ˙(θ).

I Often the Wald and Rao tests are expressed as

Reject H0 ifk∑j=1

(Oj − Ej)2/Ei ≥ χ2α(k − 1),

where Oj = Xj and Ei = E(Xj |H0) = nθ0j .

Page 87: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

87/95

Approximation of LRT: Composite Null

Here, we discuss the approximation of LRT when the null hypothesis is

composite. For this, let X1, . . . , Xn be a random sample from a distribution

with pdf f(·; θ), θ ∈ Θ ⊂ Rd that satisfies the regularity conditions (R0)–(R7).

Let θ = (ξ>, η>)> with η ∈ Rd0 , and suppose we wish to test

H0 : ξ = ξ0 versus H1 : ξ 6= ξ0

The null parameter space is of dimension dim(Θ0) = d0 given by

Θ0 ={

(ξ>0 , η>)> : η ∈ Rd0 and (ξ>, η>)> ∈ Θ

}.

Let θΘ0 and θΘ denote the MLE in Θ0 and in Θ, respectively.

Page 88: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

88/95

Approximation of LRT: Composite Null

Under the composite null hypothesis H0 : ξ = ξ0 ⇔ H0 : θ ∈ Θ0, it holds that

(1) 2(`(θΘ)− `(θΘ0))d→ χ2(d− d0); (Wilks’ phenomenon)

(2) 2(`(θΘ)− `(θΘ0)) = Wn + op(1);

(3) 2(`(θΘ)− `(θΘ0)) = Rn + op(1),

where

Wn = (θΘ − θΘ0)>(nI1(θΘ0))(θΘ − θΘ0) (Wald test statistic),

Rn = ˙(θΘ0)>(nI1(θΘ0))−1 ˙(θΘ0) (Rao/score test statistic).

(Proof : See Bickel and Doksum(2001), Chapter 5)

Page 89: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

89/95

Approximation of LRT: Composite Null

I Remark 1: The above approximations are valid for

H0 : g1(θ) = 0, . . . , gd1(θ) = 0 versus H1 : not H0,

where d1 = d− d0, provided that there exists a smooth reparametrization

ξ1 = g1(θ), . . . , ξd1 = gd1(θ), η1(θ) = gd1+1(θ), . . . , ηd0(θ) = gd(θ).

I Remark 2: The above approximation may be extended to the case of

independent but not identically distributed observations. In the latter case,

Wn = (θΘ − θΘ0)>In(θΘ0)(θΘ − θΘ0) (Wald test statistic),

Rn = ˙(θΘ0)>In(θΘ0))−1 ˙(θΘ0) (Rao/score test statistic),

where In(θ) = −Eθ ¨(θ).

Page 90: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

90/95

Approximation of LRT: Independence Test in Contingency Table

Suppose we observe X =(Xjk

) d= Multinomial

(n, (pjk)

), where∑a

j=1

∑bk=1 pjk = 1, pjk > 0, i = 1, . . . , a; j = 1, . . . , b. Consider the

hypothesis

H0 : pjkj,k≡ pj· × p·k versus H1 : not H0,

where pj· = pj1 + · · ·+ pjb and p·k = p1k + · · ·+ pak.

I With θ = (p11, . . . , pa,b−1)>, define

gjk(θ) = pjk − pj·p·k, 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b− 1

and gjk for other (j, k) appropriately so that the resulting transformation

is a smooth reparametrization of θ.

pjkj,k≡ pj· × p·k ⇔ gjk(θ) = 0 for 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b− 1.

Page 91: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

91/95

Approximation of LRT: Independence Test in Contingency Table

I Indeed,

Θ = {(p11, . . . , pab) : pjk > 0, p11 + · · ·+ pab = 1},

Θ0 = {(p1·p·1, . . . , pa·p·b) : pj· > 0, p·k > 0, p1· + · · ·+ pa· = 1,

p·1 + · · ·+ p·b = 1}.

Thus, d = dim(Θ) = ab− 1 and d0 = dim(Θ0) = (a− 1) + (b− 1), so

that d1 = dim(Θ)− dim(Θ0) = (a− 1)(b− 1).

I The likelihood under H0:

L(p) =

a∏j=1

b∏k=1

(pjk)xjk×(constant) =

a∏j=1

(pj·)xj·

b∏k=1

(p·j)x·j×(constant).

I MLEs: pjk = xjk/n in the whole parameter space Θ, and

p0jk = (xi·/n)× (x·j/n) = p0

j· × p0·k, say, in Θ0.

Page 92: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

92/95

Approximation of LRT: Independence Test in Contingency Table

I The Wald and Rao test statistics coincide: With θ = (p11, . . . , pa,b−1)>,

n(θ−θ)>I1(θ)(θ−θ) =

a∑j=1

b∑k=1

(Xjk−npjk)2/(npjk) = n−1 ˙(θ)>I1(θ)−1 ˙(θ).

I The Wald and Rao test statistics are given by

Wn =

a∑j=1

b∑k=1

(Xjk − np0j·p

0·k)2/(np0

j·p0·j).

I Therefore, both the Wald and Rao tests reject H0 if

a∑j=1

b∑k=1

(Ojk − E0jk)2/E0

jk ≥ χ2α

((a− 1)(b− 1)

),

where Ojk = Xjk and E0jk = E(Xjk|H0) = np0

i·p0·j .

Page 93: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

93/95

Approximation of LRT: Homogeneity of Multinomial Models

Suppose now that we observe b independent multinomial random variables:

Xk ≡ (X1k, . . . , Xak)>d= Multinomial

(nk, (p1k, . . . , pak)

), 1 ≤ k ≤ b, where

p1k + · · ·+ pak = 1 and pjk > 0. We wish to test

H0 : p1 = · · · = pb versus H1 : not H0,

where pk = (p1k, . . . , pak)>, 1 ≤ k ≤ b.

I With θk = (p1k, . . . , pa−1,k)> and θ = (θ>1 , . . . , θ>b )>, we get

`(θ) =b∑

k=1

a−1∑j=1

xjk log θjk +b∑

k=1

(nk − x·k) log(1− θ·k),

˙(θ) = (xjk/θjk − xak/(1− θ·k) : 1 ≤ j ≤ a− 1, 1 ≤ k ≤ b) ,

In(θ) = var( ˙(θ)) = diag(nk[diag(θ−1

jk ) + (1− θ·k)−111>]),

where θ·k =∑a−1j=1 θjk = 1− pak and x·k =

∑a−1j=1 xjk = nk − xak.

Page 94: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

94/95

Approximation of LRT: Homogeneity of Multinomial Models

I MLEs: pjk = xjk/nk in Θ, and p0jk

k≡ xj·/n in Θ0 with n = n1 + · · ·+nb.

I The Wald and Rao statistics coincide and

(θΘ − θ)>In(θ)(θΘ − θ)

=b∑

k=1

nk(θΘk − θk)>

[diag(θ−1

jk ) + (1− θ·k)−111>]

(θΘk − θk)

=b∑

k=1

nk

[a−1∑j=1

(θΘjk − θjk)2/θjk + (θΘ

·k − θ·k)2/(1− θ·k)

]

=

b∑k=1

nk

a∑j=1

(pjk − pjk)2/pjk

=

b∑k=1

a∑j=1

(nkpjk − nkpjk)2/(nkpjk)

=

b∑k=1

a∑j=1

(Xjk − nkpjk)2/(nkpjk).

Page 95: Mathematical Statistics, Fall 2016 - Seoul National University · 2020-06-12 · 1/95 Mathematical Statistics, Fall 2016 Chapter 6: Likelihood Methods Byeong U. Park Department of

95/95

Approximation of LRT: Homogeneity of Multinomial Models

I Thus, the Wald and Rao statistics for H0 are given by

Wn =

b∑k=1

a∑j=1

(Xjk − nkp0jk)2/(nkp

0jk)

=

b∑k=1

a∑j=1

(Xjk − nk(Xj·/n)

)2/(nkXj·/n)

I Note that dim(Θ) = b(a− 1) and

Θ0 = {(p1, . . . , p1) : p11 + · · ·+ pa1 = 1, pj1 > 0} so that

dim(Θ0) = a− 1.

I Therefore, the Wald and Rao tests reject H0 if

b∑k=1

a∑j=1

(Ojk − E0jk)2/E0

jk ≥ χ2α

((a− 1)(b− 1)

),

where Ojk = Xjk and E0jk = E(Xjk|H0) = nkp

0jk = nk(Xj·/n).