6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1...

23

Transcript of 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1...

Page 1: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6 More about likelihood

6.1 Invariance property of m.l.e.’s

Theorem

If θ is an m.l.e. of θ and if g is a function, then g(θ)is an m.l.e. of g(θ).

Proof

If g is one-to-one, then

L(θ) = L(g−1 (g(θ))

)are both maximised by θ, so

θ = g−1(g(θ)

)or

g(θ)= g(θ).

If g is many-to-one, then θ which maximises L(θ) still corresponds to g(θ),

so g(θ)still corresponds to the maximum of L(θ).

Example

Suppose X1,X2, . . . ,Xn is a random sample from a Bernoulli distributionB(1, θ). Consider m.l.e.’s of the mean, θ, and variance, θ(1 − θ).

Note, by the way, that θ(1 − θ) is not a 1-1 function of θ.

The log-likelihood is

l(θ) =∑

i xi log θ + (n−∑i xi) log(1− θ)and

dl(θ)dθ

=∑

i xi/ θ − (n−∑i xi)/ (1− θ)

so it is easily shown that the m.l.e. of θ is θ = X.

Putting ν = θ(1 − θ),dl(ν)

dν=

dl (ν(θ))

dθ· dθdν

so it is easily seen that, since dθdν

is not, in general, equal to zero,

ν = ν(θ)= X

(1−X

).

34

Page 2: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6.2 Relative likelihood

If supθ

L(θ) <∞, the relative likelihood is

RL(θ) =L(θ)

supθ

L(θ); 0 ≤ RL(θ) ≤ 1.

Relative likelihood is invariant to known 1-1 transformations of x, for if y isa 1-1 function of x,

fY (y; θ) = fX (x(y); θ)

∣∣∣∣dxdy∣∣∣∣ .∣∣∣dxdy ∣∣∣ is independent of θ, so

RLX(θ) = RLY (θ).

6.3 Likelihood summaries

Realistic statistical problems often have many parameters. These cause prob-lems because it can be hard to visualise L(θ), and it becomes necessary touse summaries.

Key idea

In large samples, log-likelihoods are often approximately quadratic near themaximum.

35

Page 3: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Example

Suppose X1,X2, . . . ,Xn is a random sample from an exponential distributionwith parameter λ.i.e.

fX(x) = λe−λx, x ≥ 0.

Thenl(λ) = n log λ− λ

∑i xi,

dl(λ)dλ

= n/ λ−∑i xi,

d2l(λ)

dλ2= − n/ λ2, d3l(λ)

dλ3= 2n/ λ3.

The log-likelihood has a maximum at λ = n /∑

i xi , so

RL(λ) =(λ

λ

)nen−λ

∑ixi

=(λ

λe1−λ/λ

)n, λ > 0.

→ 1 as λ→ λ.

Now, what happens as, for λ fixed, n→∞?

logRL(λ) = l(λ)− l(λ)

= l(λ) + l′(λ)(λ− λ

)+

1

2l′′(λ)

(λ1 − λ

)2− l(λ)

using Taylor series, where∣∣∣λ1 − λ

∣∣∣ < ∣∣∣λ− λ∣∣∣ .

Now l′(λ) = 0 and l′′(λ) = −n/λ2, so

logRL(λ) = −n(λ1 − λ

)2λ2 → −∞ as n→∞

unless λ = λ.

Thus, as n→∞,

RL(λ)→{

1, λ = λ,0, otherwise.

Conclusion

Likelihood becomes more concentrated about the maximum as n→∞, andvalues far from the maximum become less and less plausible.

36

Page 4: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

In general

We call the value θ which maximises L(θ) or, equivalently, l(θ) = log L(θ)the maximum likelihood estimate, and

J(θ) = −∂2l(θ)

∂θ2

is called the observed information.

Usually J(θ) > 0 and J(θ) measures the concentration of l(θ) at θ. Close to

θ, we summarise

l(θ) � l(θ)− 1

2

(θ − θ

)2J(θ).

6.4 Information

In a model with log-likelihood l(θ), the observed information is

J(θ) = −∂2l(θ)

∂θ2.

When observations are independent, L(θ) is a product of densities so

l(θ) =∑i

log f (xi; θ)

and

J(θ) = −∑i

∂2

∂θ2log f (xi; θ) .

Since

l(θ) � l(θ)− 1

2

(θ − θ

)2J(θ),

for θ near to θ, we see that large J(θ) implies that l(θ) is more concentrated

about θ.

This means that the data are less ambiguous about possible values of θ, i.e.we have more information about θ.

37

Page 5: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6.5 Expected information

6.5.1 Univariate distributions

Before an experiment is conducted, we have no data so that we cannot eval-uate J(θ).

But we can find its expected value

I(θ) = E

(−∂

2l(θ)

∂θ2

).

This is called the expected information or Fisher’s information.

If the observations are a random sample, then the whole sample expectedinformation is

I(θ) = ni(θ)

where

i(θ) = E

(− ∂2

∂θ2log f (Xi; θ)

),

the single observation Fisher information.

Example

Suppose X1,X2, . . . ,Xn is a random sample from a Poisson distribution withparameter θ.

L(θ) =n∏i=1

θxie−θ

xi!,

giving

l(θ) = log L(θ) =∑i

xi log θ − nθ −∑i

log xi!

Thus

J(θ) = −∂2l(θ)

∂θ2=∑

ixi/θ2.

To find I(θ), we need E (Xi) = θ and

I(θ) =1

θ2

∑i

E (Xi) =n

θ.

38

Page 6: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6.5.2 Multivariate distributions

If θ is a (p×1) vector of parameters, then I(θ) and J(θ) are (p×p) matrices.

{J(θ)}rs = − ∂2l(θ)

∂θr∂θsand {I(θ)}rs = E

(− ∂2l(θ)

∂θr∂θs

).

These matrices are obviously symmetric.

We can also write the above as

J(θ) = − ∂2l(θ)

∂θ∂θTand I(θ) = E

(− ∂2l(θ)

∂θ∂θT

).

Example

X1,X2, . . . ,Xn is a random sample from a normal distribution with param-eters µ and σ2. We have already seen that

L(µ, σ2

)=(2πσ2

)−n/2exp

[− 1

2σ2

∑i(xi − µ)2

],

so

l(µ, σ2

)= −n

2log 2π − n

2log σ2 − 1

2σ2

∑i(xi − µ)2 .

and∂l∂µ

= 1σ2

∑i (xi − µ) ,

∂l∂σ2

= − n2σ2

+ 12σ4

∑i (xi − µ)2 ,

∂2l∂µ2

= − nσ2,

∂2l∂µ∂σ2

= − 1σ4

∑i (xi − µ) .

∂2l

∂(σ2)2= n

2σ4− 1

σ6

∑i (xi − µ)2 .

J(µ, σ2) =

(nσ2

1σ4

∑i (xi − µ)

1σ4

∑i (xi − µ) 1

σ6

∑i (xi − µ)2 − n

2σ4

).

To find I(µ,σ2), use

E (Xi) = µ,

V (Xi) = E[(Xi − µ)2

]= σ2,

so that

I(µ, σ2) = E(J(µ, σ2)

)=

(nσ2

00 n

2σ4

).

39

Page 7: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Example Censored exponential data

Lifetimes of n components, safety devices, etc. are observed for a time c,when r have failed and (n− r) are still OK.

We have two kinds of observation:

1. Exact failure times xi observed if xi ≤ c, so that

f(x;λ) = λe−λx, x ≥ 0;

2. xi unobserved if xi > c,

P (X > c) = e−λc.

Data are therefore x1, . . . , xr, c, . . . , c︸ ︷︷ ︸n−r times

The (n − r) components, safety devices, etc. which have not failed are saidto be censored.

The likelihood is

L(λ) =r∏

i=1

λe−λxin∏

i=r+1

e−λc

= λr exp

[−λ

(r∑

i=1

xi + (n− r)c

)].

l(λ) = r log λ− λ (∑r

i=1 xi + (n− r)c)

l′(λ) = r/ λ− (∑r

i=1 xi + (n− r)c)

l′′(λ) = −r /λ2 .Thus J(λ) = r

/λ2 > 0 if r > 0 so we must observe at least one exact failure

time.

I(λ) = E(r/λ2)=

1

λ2E (#Xi observed exactly.)

Now P (Xi observed exactly) = P (Xi ≤ c) = 1 − e−λc, so

Ic(λ) =n(1 − e−λc

)λ2

.

40

Page 8: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

No censoring if c→∞, giving

I∞(λ) =n

λ2> Ic(λ)

as one might expect.

The asymptotic efficiency when there is censoring at c relative to no censoringis

Ic(λ) /I∞(λ) = 1− e−λc.

Example Events in a Poisson process

Events are observed for period (0, T ).

n events occur at times 0 < t1 < t2 < . . . < tn < T

Two observers A and B. A records exact times, B uses an automatic counterand goes to the pub (i.e. B merely records how many events there are).

A knows exact times, and times between events are independent and ex-ponentially distributed, so

LA(λ) = λe−λt1 × λe−λ(t2−t1) × · · · × λe−λ(tn−tn−1) × λe−(λT−tn)

= λne−λT .

B merely observes the event [N = n], where N ∼ Poi(λT ), so

LB(λ) =(λT )n e−λT

n!.

Log-likelihoods are

lA(λ) = n log λ− λT,

lB(λ) = n log λ+ n log T − λT − log n!

andJA(λ) = JB(λ) = n

/λ2 .

E(N) = λT , so IA(λ) = IB(λ) = T /λ , and both observers get the sameinformation. As usual, the one who went to the pub did the right thing.

41

Page 9: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6.6 Maximum likelihood estimates

The maximum likelihood estimate θ of θ maximises L(θ) and often (but notalways) satisfies the likelihood equation

∂l

∂θ

(θ)= 0,

with

J(θ) = − ∂2l

∂θ2

(θ)> 0

for a maximum.

In the vector case, θ solves simultaneously

∂l

∂θr

)= 0, r = 1, . . . , p,

withdetJ(θ) > 0

(i.e. J(θ) positive definite).

If the likelihood equation has many solutions, we find them all and checkL(θ) for each.

Usually, the equation has to be solved numerically. One way is by Newton-Raphson.

Suppose we have a starting value θ0. Then

0 =∂l

∂θ

(θ)� ∂l

∂θ(θ0) +

∂2l

∂θ2(θ0)

(θ − θ0

)which may be re-arranged to

θ = θ0 +U (θ0)

J(θ0),

where

U(θ) =∂l

∂θis the score function,

J(θ) = − ∂2l

∂θ2is the observed information.

42

Page 10: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Now we iterate using θ0 as a starting value and

θn+1 = θn +U(θn)

J(θn).

Example Extreme value (Gumbel) distribution

This distribution is used to model such things as annual maximum temper-ature. Data due to Bliss on numbers of beetles killed by exposure to carbondisulphide are fitted by this model. The c.d.f. is

F (x) = exp(−e−(x−η)) , x ∈ R, η ∈ R,

and the density is

f (x) = exp[−(x− η)− e−(x−η)

], x ∈ R, η ∈ R.

The sample log-likelihood is

l(η) = −∑i

(xi − η)−∑i

e−(xi−η),

so that

U (η) = n −∑i

e−(xi−η),

J(η) =∑i

e−(xi−η).

Starting at η0 = x, iterate using

ηn+1 = ηn +n−∑i e

−(xi−ηn)∑i e

−(xi−ηn).

6.6.1 Fisher scoring

This simply involves replacing J(θ) with I(θ).

Example Extreme value distribution

We need

I(η) = E [J(η)] =∑i

E[e−(Xi−η)

]= n

∫ ∞

−∞

e−(x−η) exp[−(x− η)− e−(x−η)

]dx.

43

Page 11: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Put u = e−(x−η) and the integral becomes

I(η) = n

∫ ∞

0

ue−udu = n,

so Fisher scoring gives the iteration

ηn+1 = ηn + 1− 1

n

∑i

e−(xi−ηn).

44

Page 12: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6.7 Sufficient statistics

You have already seen a likelihood which cannot be summarised by a quadratic.

Example

f (xi; θ) = θ−1, 0 < xi < θ,

soL(θ) = θ−n, 0 < max {xi} < θ.

Clearly a quadratic approximation is useless here.

Suppose there exists a statistic s(x) such that L(θ) only depends upon datax through s(x). Then s(X) is a sufficient statistic for θ and obviously alwaysexists.

The important question is:

Does s(x) reduce the dimensionality of the problem?

Definition

If S = s(X) is such that the conditional density fX|S(x|s;θ) is independentof θ, then S is a sufficient statistic.

45

Page 13: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Example

Suppose X1,X2 ∼ B(n, θ) and consider

P (X1 = x|X1 +X2 = r)

=P (X1 = x,X1 +X2 = r)

P (X1 +X2 = r)

=P (X1 = x,X2 = r − x)

P (X1 +X2 = r)

=

(n

x

)θx (1− θ)n−x

(n

r−x

)θr−x (1− θ)n−r+x(

2nr

)θr (1− θ)2n−r

=

(n

x

)(n

r−x

)(2nr

)This does not contain θ, so that X1 +X2 is a sufficient statistic for θ.

46

Page 14: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Example

X1,X2, . . . ,Xn ∼ U(0, θ), so that

L(θ) = θ−n, 0 < x1, . . . , xn < θ.

Suppose we find the conditional density of X1,X2, . . . ,Xn given X(n).

The joint density of X(1),X(2), . . . ,X(n) is

f (x(1), x(2), . . . , x(n)) =n!

θn, 0 < x(1), . . . , x(n) < θ.

and the density of X(n) is nxn−1 /θn so that the conditional density of

X(1), . . . ,X(n−1)

∣∣X(n) = y is

n!

θn

/nxn−1

θn=

(n− 1)!

xn−1, 0 < x(1), . . . , x(n−1) < y.

Thus the density of X1,X2, . . . ,Xn|X(n) is

f(x1, . . . , xn

∣∣x(n) = y)=

1

xn−1, 0 < x1, . . . , xn < y,

which is free of θ, so that X(n) is a sufficient statistic for θ.

Factorization Theorem

s(X) is a sufficient statistic for θ if and only if there exist functions g and hsuch that

f (x; θ) = g (s(x); θ)h(x)

for all x ∈ Rn, θ ∈ Θ.

Proof for discrete random variables

(i) Let s(x) = a and suppose the factorization condition to be satisfied, sothat f(x; θ) = g (s(x); θ)h(x).

ThenP (s(X) = a) =

∑y∈s−1(a)

p(y) = g(a; θ)∑

y∈s−1(a)

h(y).

Hence

P (X = x | s(X) = a) =h(x)∑

y∈s−1(a)

h(y)

and this does not depend upon θ.

47

Page 15: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

(ii) Let s(X) be a sufficient statistic for θ. Then

P (X = x) = P (X = x | s(X) = a)P (s(X) = a) .

But sufficiency ⇒ P (X = x |s(X) =a) does not depend upon θ so,writing P (s(X) =a) = g(a; θ) and P (X = x |s(X) =a) = h(x) givesthe result.

The proof in the continuous case requires measure theory and is beyond thescope of this course.

Example

Suppose X1,X2, . . . ,Xn is a random sample from a Bernoulli distribution.Then

p(x; θ) = θ∑

ixi(1 − θ)n−

∑ixi.

Trivially this factorizes with s(x) =∑

i xi and h(x) = 1.

Example

Suppose X1,X2, . . . ,Xn is a random sample from a N(µ, σ2) distribution,where (µ, σ2)T is a vector of unknown parameters. Then

f (x;µ, σ2) =(2πσ2

)−n/2exp

[− 1

2σ2

∑i

(xi − µ)2

]

=(2πσ2

)−n/2exp

[− 1

2σ2

∑i

(xi − x)2 + n(x− µ)2

].

Again this factorizes where s(x) = (x,∑

i(xi − x)2)T, a vector valued func-

tion.

48

Page 16: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

6.8 The exponential family

The density function/probability mass function has the form

f(x;ϕ) = exp [a(x)b(ϕ)− c(ϕ) + d(x)] ,

where x may be continuous or discrete and ϕ is in a suitable space (usuallyopen reals).

For a random sample X1,X2, . . . ,Xn, we obtain

L(ϕ) = exp

[b(ϕ)

∑i

a(xi)− nc(ϕ) +∑i

d(xi)

],

and, therefore, by the factorization theorem,∑

i a(xi) is sufficient for ϕ.

Example

Let X ∼ B(n, θ). Then

pX(x) =

(nx

)θx(1 − θ)n−x

= exp

[x log θ + (n− x) log(1− θ) + log

(nx

)]= exp

[x log (θ/(1− θ)) + n log(1− θ) + log

(nx

)]Calling Y = a(X), θ = b(ϕ) the natural parameterisation, we can write thedensity function/probability mass function in the form

f (y; θ) = exp [yθ − k(θ)]m(y).

Clearly∑

i Yi is a sufficient statistic.

Note that, in the continuous case, the moment generating function is

E(etY)

=

∫ety+θy−k(θ)m(y)dy

= ek(θ+t)−k(θ)∫

ety+θy−k(θ+t)m(y)dy

= ek(θ+t)−k(θ).

The function k(θ) is called the cumulant generator.

Let us see why.

49

Page 17: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

The cumulant generating function

If X is a random variable with moment generating function M(t), thenK(t) = logM(t) is said to be the cumulant generating function.

Differentiating,

K ′(t) =M ′(t)

M (t), K ′(0) =

M ′(0)

M (0)= E(X),

K ′′(t) = M ′′(t)M(t)

− M ′(t)2

M(t)2,

K ′′(0) = M ′′(0)M(0)

− M ′(0)2

M(0)2= V (X)

and so on. The cumulants are generated directly.

For the exponential family,

K(t) = logM (t) = k(θ + t)− k(θ)

so thatK ′(t) = k′(θ + t), K ′(0) = k′(θ),

and so on. The cumulants are generated by repeated differentiation of k(θ).

Example Poisson distribution

The p.m.f. is

p(x;µ) =µxe−µ

x!, x = 0, 1, . . .

= exp [x log µ − µ − log x!]

so that

a(x) = x, b(µ) = log µ, c(µ) = µ, d(x) = − log x!

Under natural parameterisation,

y = x, θ = log µ, k(θ) = eθ, m(y) =1

y!.

Cumulants are given by derivatives of k(θ), all of which are eθ = µ.

Example Binomial distribution

50

Page 18: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

The p.m.f. is

p(x; p) =

(n

x

)px(1 − p)n−x, x = 0, 1, . . . , n

= exp

[x log

(p

1 − p

)+ n log(1 − p) + log

(n

x

)].

Natural parameterisation is

y = x, θ = log

(p

1− p

)k(θ) = n log(1 + e θ).

For the cumulants,

k′(θ) =ne θ

1 + e θ= np, k′′(θ) =

ne θ

(1 + e θ)2= np(1− p),

and so on.

6.9 Large sample distribution of θ

From the data summary point of view, the m.l.e. θ and J(θ)have been

thought of in terms of a particular set of data. We now wish to think of θ interms of repeated sampling (i.e. as a random variable).

Main results

In many situations and subject to regularity conditions

θD→ N(θ, I(θ)−1),

and an approximate 95% confidence interval for θ is given by

θ ± 1.96I(θ)−1/2 .

[or θ ± 1.96J(θ)−1/2 , regarded by many as better, but not in the books].

In the multivariate case,

θD→ N(θ, I(θ)

−1).

51

Page 19: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Example Exponential distribution

For an exponential distribution with mean θ,

L(θ) = θ−ne−∑

ixi/θ , θ > 0,

l(θ) = −n log θ −∑i

xi /θ ,

so that

U (θ) = −nθ+

∑i xi

θ2, J(θ) = − n

θ2+

2∑

i xi

θ3.

Thusθ = x, J(θ) =

n

x2

and an approximate 95% confidence interval is

x± 1.96x/√

n

Example Normal distribution

For a normal random sample,

µ = x, σ2 = n−1∑i

(xi − x)2 ,

J(µ, σ2

)=

(n /σ2 σ−4

∑i (xi − µ)

σ−4∑

i (xi − µ) σ−6∑

i (xi − µ)2 − n /2σ4

).

Therefore

I(µ, σ2

)=

(n/σ2 00 n

/2σ4

).

An approximate 95% confidence interval for µ is

x± 1.96σ/√

n ,

and for σ2 is

σ2 ± 1.96σ2√

2

n.

Note that the estimators µ and σ2 are asymptotically uncorrelated.

The exact interval for µ is

x± S√nt0.975(n− 1)

52

Page 20: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

which is not quite the same.

Proof of asymptotic normality

Suppose X1,X2, . . . ,Xn is a random sample from a distribution with p.d.f.f(x; θ). Then the log-likelihood, score and observed information are

l(θ) =∑i

log f(xi; θ),

U(θ) =∑i

∂θlog f(xi; θ),

J(θ) = −∑i

∂2

∂θ2log f (xi; θ).

Let Ui(θ) be the random variable Ui(θ) =∂∂θ

log f (Xi; θ), and, provided thatconditions are such that integration and differentiation are interchangeable,

E [Ui(θ)] =

∫f (x; θ)

∂θlog f (x; θ)dx

=

∫∂

∂θf(x; θ)dx

=∂

∂θ

∫f(x; θ)dx =

∂θ1 = 0

and

0 =∂

∂θ

∫f(x; θ)

∂θlog f(x; θ)dx

=

∫f(x; θ)

∂2

∂θ2log f(x; θ)dx+

∫∂

∂θf (x; θ)

∂θlog f (x; θ)dx

= E

[∂2

∂θ2log f (X; θ)

]+

∫f (x; θ)

(∂

∂θlog f (x; θ)

)2

dx.

So0 = −i(θ) + E

[Ui(θ)

2]

and, therefore, V [Ui(θ)] = i(θ).

It follows that E [U(θ)] = 0, V [U (θ)] = ni(θ) = I(θ), and the CLT showsthat

U(θ)D→ N (0, I(θ)) .

Now the m.l.e. is a solution of U(θ) = 0, so that, Taylor expanding,

U (θ) + U ′(θ)(θ − θ) � 0

53

Page 21: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

orU (θ)− J(θ)(θ − θ) � 0.

Re-arranging,

√I(θ)(θ − θ) � U (θ)

√I(θ)

J(θ)=

U(θ)√I(θ)

/J(θ)

I(θ).

From the CLT,U(θ)√I(θ)

D→ N (0, 1)

and from WLLNJ(θ)

I(θ)P→ 1.

Slutsky’s Theorem therefore results in√I(θ)(θ − θ)

D→ N(0,1)

orθ

D→ N(θ, I(θ)−1).

54

Page 22: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

Requirements of this proof

1. The true value of θ is interior to the partameter space.

2. Differentiation under the integral is valid, so that E [U(θ)] = 0 andV [U(θ)] = ni(θ). This allows a central limit theorem to apply to U(θ).

3. Taylor expansions are valid for the derivatives of the log-likelihood, sothat higher order terms may be neglected.

4. A weak law of large numbers applies to J(θ).

Example: Exponential family

In the natural parameterisation, the likelihood has the form

L(θ) = m(y)eθ∑

yi−nk(θ)

so thatU(θ) =

∑yi − nk′(θ)

J(θ) = nk′′(θ).

θ solvesU(θ) = 0 ⇒ k′(θ) = y.

Expanding,k′(θ) + (θ − θ)k′′(θ) � y

so that

θ � θ +y − k′(θ)

k′′(θ).

SinceE(Y ) = n−1

∑i

E(Yi) = k′(θ),

we haveE(θ) � θ.

V (θ) =1

k′′(θ)2V (Y ) =

n−1k′′(θ)

k′′(θ)2=

1

nk′′(θ)

which, of course, we could have obtained directly from θ.∼ N (θ, I(θ)−1) .

Example: Exponential distribution

55

Page 23: 6 More about lieklihood - Oxford Statisticsdlunn/b8_02/b8pdf_6.pdf · 6 More about lieklihood 6.1 arInviance propeyrt of m.l.e. s Theorem If is an m.l.e. of and if g is a unfon,cti

f(x;λ) = λe−λx, x > 0, λ > 0,

soθ = −λ, k(θ) = − log λ = − log(−θ)

k′(θ) = −1

θ, k′′(θ) =

1

θ2

so

θ = −1

y, I(θ) =

n

y2.

Thus, approximately,

−1

y± zα

y√n

gives a confidence interval for θ, and

1

y± zα

y√n

gives a confidence interval for λ.

56