1 Maximum Likelihood Estimation - ani.stat.fsu.edudebdeep/mle.pdf · 1 Maximum Likelihood...

Maximum Likelihood Estimators

February 22, 2016 Debdeep Pati

1 Maximum Likelihood Estimation

Assume X ∼ Pθ, θ ∈ Θ, with joint pdf (or pmf) f(x | θ). Suppose we observe X = x. TheLikelihood function is

L(θ | x) = f(x | θ)

as a function of θ (with the data x held fixed). The likelihood function L(θ | x) and jointpdf f(x | θ) are the same except that f(x | θ) is generally viewed as a function of x withθ held fixed, and L(θ | x) as a function of θ with x held fixed. f(x | θ) is a density in xfor each fixed θ. But L(θ | x) is not a density (or mass function) in θ for fixed x (exceptby coincidence).

1.1 The Maximum Likelihood Estimator (MLE)

A point estimator θ̂ = θ̂(x) is a MLE for θ if

L(θ̂ | x) = supθL(θ | x),

that is, θ̂ maximizes the likelihood. In most cases, the maximum is achieved at a uniquevalue, and we can refer to “the” MLE, and write

θ̂(x) = argmaxθL(θ|x).

(But there are cases where the likelihood has flat spots and the MLE is not unique.)

1.2 Motivation for MLE’s

Note: We often write L(θ | x) = L(θ), suppressing x, which is kept fixed at the observeddata. Suppose x ∈ Rn.Discrete Case: If f(· | θ) is a mass function (X is discrete), then

L(θ) = f(x | θ) = Pθ(X = x).

L(θ) is the probability of getting the observed data x when the parameter value is θ.Continuous Case: When f(· | θ) is a continuous density Pθ(X = x) = 0, but if B ⊂ Rn isa very, very small ball (or cube) centered at the observed data x, then

Pθ(X ∈ B) ≈ f(x | θ)×Volume(B) ∝ L(θ).

1

L(θ) is proportional to the probability the random data X will be close to the observeddata x when the parameter value is θ. Thus, the MLE θ̂ is the value of θ which makes theobserved data x “most probable”.

To find θ̂, we maximize L(θ). This is usually done by calculus (finding a stationary point),but not always. If the parameter space Θ contains endpoints or boundary points, themaximum can be achieved at a boundary point without being a stationary point. If L(θ)is not “smooth” (continuous and everywhere differentiable), the maximum does not haveto be achieved at a stationary point.Cautionary Example: Suppose X1, . . . , Xn are iid Uniform(0, θ) and Θ = (0,∞). Givendata x = (x1, . . . , xn), find the MLE for θ.

L(θ) =n∏i=1

θ−1I(0 < xi < θ) = θ−nI(0 ≤ x(1))I(x(n) ≤ θ)

=

{θ−n for θ ≥ x(n)0 for 0 < θ < x(n)

which is maximized at θ = x(n), which is a point of discontinuity and certainly not a

stationary point. Thus, the MLE is θ̂ = x(n).Notes: L(θ) = 0 for θ < x(n) is just saying that these values of θ are absolutely ruled out bythe data (which is obvious). A strange property of the MLE in this example (not typical):

Pθ(θ̂ < θ) = 1

The MLE is biased; it is always less than the true value.A Similar Example: Let X1, . . . , Xn be iid Uniform(α, β) and Θ = {(α, β) : α < β}.Given data x = (x1, . . . , xn), find the MLE for θ = (α, β).

L(α, β) =n∏i=1

(β − α)−1I(α < xi < β) = (β − α)−nI(α ≤ x(1))I(x(n) ≤ β)

=

{(β − α)−n for α ≤ x(1), x(n) ≤ β0 otherwise

which is maximized by making β − α as small as possible without entering “0 otherwise”region. Clearly, the maximum is achieved at (α, β) = (x(1), x(n)). Thus the MLE is

θ = (α̂, β̂) = (x(1), x(n)). Again, Pα,β(α < α̂, β̂ < β) = 1.

2

2 Maximizing the Likelihood (one parameter)

2.1 General Remarks

Basic Result: A continuous function g(θ) defined on a closed, bounded interval J attainsits supremum (but might do so at one of the endpoints). (That is, there exists a pointθ0 ∈ J such that g(θ0) = supθ∈J g(θ). )Consequence: Suppose g(θ) is a continuous, non-negative function defined on an openinterval J = (c, d) (where perhaps c = −∞ or d =∞). If

limθ→c

g(θ) = limθ→d

g(θ) = 0,

then g attains its supremum. Thus, MLEs usually exist when the likelihood function iscontinuous.

3 Maxima at Stationary Points

Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infiniteor finite). If g is differentiable and attains its supremum at a point θ0 in the interior of Θ,that point must be a stationary point (that is, g′(θ0) = 0).

1. If g′(θ0) = 0 and g′′(θ0) < 0, then θ0 is a local maximum (but might not be the globalmaximum).

2. If g′(θ0) = 0 and g′′(θ) < 0 for all θ ∈ Θ, then θ0 is a global maximum (that is, itattains the supremum). The condition in (1) is necessary (but not sufficient) for θ0to be a global maximum. Condition (2) is sufficient (but not necessary).A function satisfying g′′(θ) < 0 for all θ ∈ Θ is called strictly concave. It lies belowany tangent line. Another useful condition (sufficient, but not necessary) is:

3. If g′(θ) > 0 for θ < θ0 and g′(θ) < 0 for θ > θ0, then θ0 is a global maximum.

4 Maximizing the Likelihood (multi-parameter)

4.1 Basic Result:

A continuous function g(θ) defined on a closed, bounded set J ⊂ Rk attains its supremum(but might do so on the boundary).

3

4.2 Consequence:

Suppose g(θ) is a continuous, non-negative function defined for all θ ∈ Rk. If g(θ) → 0as ||θ|| → ∞, then g attains its supremum. Thus, MLEs usually exist when the likelihoodfunction is continuous.

Suppose the function g(θ) is defined on a convex set Θ ⊂ Rk (that is, the line segmentjoining any two points in Θ lies entirely inside Θ). If g is differentiable and attains itssupremum at a point θ0 in the interior of Θ, that point must be a stationary point:

∂g(θ0)

∂θi= 0, i = 1, 2, . . . , k.

Define the vector D and Hessian matrix H:

D(θ) =

(∂g(θ)

∂θi

)ki=1

(a k × 1 vector).

H(θ) =

(∂2g(θ)

∂θi∂θj

)ki,j=1

(a k × kmatrix)

4.3 Maxima at Stationary Points

1. If D(θ0) = 0 and H(θ0) is negative definite, then θ0 is a local maximum (but mightnot be the global maximum).

2. If D(θ0) = 0 and H(θ) is negative definite for all θ ∈ Θ, then θ0 is a global maximum(that is, it attains the supremum).

(1) is necessary (but not sufficient) for θ0 to be a global maximum. (2) is sufficient (butnot necessary).A function for which H(θ) is negative definite for all θ ∈ Θ is called strictly concave. Itlies below any tangent plane.

4.4 Positive and Negative Definite Matrices

Suppose M is a k × k symmetric matrix.Note: Hessian matrices and covariance matrices are symmetric.Definitions:

1. M is positive definite if x′Mx > 0 for all x 6= 0 (x ∈ Rk).

2. M is negative definite if x′Mx < 0 for all x 6= 0.

4

3. M is non-negative definite (or positive semi-definite) if x′Mx ≥ 0 for all x ∈ Rk.

Facts:

1. M is p.d. iff all its eigenvalues are positive.

2. M is n.d. iff all its eigenvalues are negative.

3. M is n.n.d. iff all its eigenvalues are non-negative.

4. M is p.d. iff −M is n.d.

5. If M is p.d., all its diagonal elements must be positive.

6. If M is n.d., all its diagonal elements must be negative.

7. The determinant of a symmetric matrix is equal to the product of its eigenvalues.2× 2 Symmetric Matrices:

M = (mij) =

(m11 m12

m21 m22

),m12 = m21

|M | = m11m22 −m12m21 = m11m22 −m212.

A 2×2 matrix is p.d. when the determinant is positive and the diagonal elements arepositive. A 2 × 2 matrix is n.d. when the determinant is positive and the diagonalelements are negative.The bare minimum you need to check:M is p.d. if m11 > 0 (or m22 > 0) and |M | > 0.M is n.d. if m11 < 0 (or m22 < 0) and |M | > 0.

Example: Observe X1, X2, . . . , Xn be iid Gamma(α, β).Preliminaries:

L(α, β) =

n∏i=1

xα−1i e−xi/β

βαΓ(α)

Maximizing L is same as maximzing l = logL given by

l(α, β) = (α− 1)T1 − T2/β − nα log β − n log Γ(α)

5

where T1 =∑

i log xi, T2 =∑

i xi. Note that T = (T1, T2) is the natural sufficient statisticof this 2pef.

∂l

∂α= T1 − n log β − nψ(α), ψ(α) ≡ d

dαlog Γ(α) =

Γ′(α)

Γ(α)

∂l

∂β=

T2β2− nα

β=

1

β2(T2 − nαβ)

∂2l

∂α2= −nψ′(α)

∂2l

∂β2=−2T2β3

+nα

β2=−1

β3(2T2 − nαβ)

∂2l

∂α∂β=−nβ

Situation #1: Suppose α = α0 is known. Find MLE for β. (Drop α from arguments:l(β) = l(α0, β) etc.)l(β) is continuous and differentiable.l(β) has a unique stationary point:

l′(β) =∂l

∂β=

1

β2(T2 − nα0β) = 0

iff T2 = nα0β, iffβ =T2nα0

(≡ β∗)

Now we check the second derivative.

l′′(β) =∂2l

∂β2=−1

β3(2T2 − nαβ) =

−1

β3{T2 + (T2 − nαβ)}.

Note l′′(β∗) < 0 since T2 − nα0β∗ = 0, but l′′(β) > 0 for β > 2T2/(nα0). Thus, the sta-

tionary point satisfies the necessary condition for a global maximum, but not the sufficientcondition (i.e., l(β) is not a strictly concave function). How can we be sure that we havefound the global maximum, and not just a local maximum? In this case, there is a simpleargument: The stationary point β∗ is unique, and l′(β) > 0 for β < β∗,and l′(β) < 0 forβ > β∗. This ensures β∗ is the unique global maximizer.Conclusion: β̂ = T2

nα0. (This is a function of T2, which is a sufficient statistic for β when α

is known.)Situation #2: Suppose β = β0 is known. Find MLE for α. (Drop β from arguments:l(α) = l(α, β0) etc.)Note: l′(α) and l′′(α) involve ψ(α). The function ψ is infinitely differentiable on the inter-val (0,∞), and satisfies ψ′(α) > 0 and ψ′′(α) < 0 for all α > 0. (The function is strictlyincreasing and strictly concave.)

6

Also,

limα→0+

ψ(α) = −∞, limα→∞

ψ(α) =∞.

Thus ψ−1 : R → (0,∞) exists. l(α) is continuous and differentiable. l(α) has a uniquestationary point:

l′(α) = T1 − n log β0 − nψ(α) = 0

iff ψ(α) = T1/n− log β0

iff α = ψ−1(T1/n− log β0)

This is the unique global maximizer since

l′′(α) = −nψ′(α) < 0, ∀α > 0.

Thus α̂ = ψ−1(T1/n − log β0) is the MLE. (This is a function of T1, which is a sufficientstatistic for α when β is known.)Situation #3: Find MLE for θ = (α, β). l(α, β) is continuous and differentiable. Astationary point must satisfy the system of two equations:

∂l

∂α= T1 − n log β − nψ(α) = 0

∂l

∂β=

1

β2(T2 − nαβ) = 0.

Solving the second equation for β gives

β =T2nα

Plugging this into the first equation, and rearranging a bit leads to

T1n− log

(T2n

)= ψ(α)− logα ≡ H(α)

The function H(α) is continuous and strictly increasing from (0,∞) to (−∞, 0), so that ithas an inverse mapping (−∞, 0) to (0,∞). Thus, the solution to the above equation canbe written:

α = H−1{T1n− log

(T2n

)}Thus the unique stationary point is:

α̂ = H−1{T1n− log

(T2n

)}β̂ =

T2nα̂

7

Is this the MLE?Let us examine the Hessian.

H(α, β) =

(∂2l∂α2

∂2l∂α∂β

∂2l∂α∂β

∂2l∂β2

)

=

(−nψ′(α) −n

β−nβ

−1β3 (2T2 − nαβ)

)

H(α̂, β̂) =

(−nψ′(α̂) −n2α̂

T2−n2α̂T2

−n3α̂3

T 22

)The diagonal elements are both negative, and the determinant is equal to

n4α̂2

T 22

(α̂ψ′(α̂)− 1).

This is positive since αψ′(α)−1 > 0 for all α > 0. This guarantees that H(α̂, β̂) is negativedefinite so that (α̂, β̂) is at least a local maximum.

5 Invariance principle for the MLE’s

If η = τ(θ) and θ̂ is the MLE of θ, then η̂ = τ(θ̂) is the MLE of η.Comments

1. If τ(θ) is a 1-1 function, this is a trivial theorem.

2. If τ(θ) is not 1-1, this is essentially true by definition of induced likelihood. (seelater).

Example: X = (X1, X2, . . . , Xn) iid N(µ, σ2). The usual parameters θ = (µ, σ2) are relatedto the natural parameters η = (µ/σ2,−1/(2σ2)) of the 2pef by a 1-1 function: η = τ(θ).The likelihood in terms of θ is

L1(θ) = (2πσ2)−n/2e−nµ2/2σ2

eµ/σ2T1−(1/2σ2)T2

where T1 =∑Xi, T2 =

∑X2i .

Simple Example: X = (X1, X2, . . . , Xn) iid Bernoulli(p). It is known that MLE of p isp̂ = X̄. Thus

1. MLE of p2 is p̂2 = X̄2.

2. MLE of p(1− p) is X̄(1− X̄).

The function of p in 1. is 1-1, but not 1-1 in 2.

8

5.1 Induced Likelihood

Definition 1. If η = τ(θ), then

L∗(η) ≡ supθ:τ(θ)=η

L(θ).

Go back to the example X1, X2, . . . , Xn ∼ N(µ, σ2) iid. If the MLE η̂ of η is defined to bethe value which maximized L∗(η), then it is easily seen that η̂ = τ(θ̂). The likelihood interms of η is

L2(η) = (−π/η2)−n/2enη21/4η2eη1T1+η2T2

obtained by substituting in L1(θ)

µ = −η1/(2η2), σ2 = −1/(2η2),

that is, evaluating L1 at

θ = (µ, σ2) = (−η1/(2η2),−1/(2η2)) = τ−1(η).

Stated abstractly L2(η) = L1(τ−1(η)), so that L2 is maximized when τ−1(η) = θ̂, that is,

by η = τ(θ̂). The MLE of θ is known to be

θ̂ = (µ̂, σ̂2) =

(X̄,

1

n

n∑i=1

(Xi − X̄)2)

so the invariance principle says the MLE of η is

η̂ = τ(θ̂) =

(µ̂

σ̂2,−1

2σ̂2

).

Continuation of example: What is the MLE of α = µ+ σ2? Note that

α = g(µ, σ2) = µ+ σ2

is not a 1-1 function, but

α̂ = g(µ̂, σ̂2) = µ̂+ σ̂2 = X̄ + SS/n

where SS =∑n

i=1(Xi − X̄)2.What is MLE of µ?, σ2 ?With g1(x, y) = x, g2(x, y) = y, we have

µ = g1(θ), σ2 = g2(θ)

so that the MLEs are

µ̂ = g1(θ̂) = X̄, σ̂2 = g2(θ̂) = SS/n.

Thus, the invariance principle implies:

ˆ(µ, σ2)(MLE as a pair) = (µ̂(MLE of µ), σ̂2(MLE of σ2))

9

5.2 MLE for Exponential Families

The invariance principle for MLEs allows us to work with the natural parameter η (whichis a 1-1 function of θ).1pef:

f(x | θ) = c(θ)h(x) exp{w(θ)t(x)}

Natural parameter: η = w(θ).With a little abuse of notation (writing f(x | η) for f∗(x | η) = f(x | w−1(η)) and c(η) forc∗(η) = c(w−1(η)), we can write

f(x | η) = c(η)h(x) exp{ηt(x)}.

For clarity of notations, we will use x = (x1, . . . , xN ) as the observed data and X =(X1, X2, . . . , XN ) as the random data. If X1, . . . , XN iid from f(x | η), then

l(η) = N log{c(η)}+

N∑i=1

log h(xi) + η

N∑i=1

t(xi)

Since by 3.32(a) Et(Xi) = − ∂∂η log{c(η)}, we have

l′(η) = N∂

∂ηlog{c(η)}+

N∑i=1

t(xi) (1)

= −E[ N∑i=1

t(Xi)

]+

N∑i=1

t(xi)

= −ET (X) + T (x)

where T (X) =∑N

i=1 t(Xi). Hence the condition for a stationary point is equivalent to:

EηT (X) = T (x)

Note that using (1),

l′′(η) = N∂2

∂η2log{c(η)} = N{−Varηt(Xi)} < 0

for all η. Thus any interior stationary point (not on the boundary of Θ∗ = {w(θ) : θ ∈ Θ})is automatically a global maximum so long as Θ∗ is convex. In one dimension (Θ ⊂ R),this means Θ∗ must be an interval of some sort (can be infinite). Ignoring this fine point,

10

for a 1pef, the log-likelihood will have a unique stationary point which will be the MLE.k-pef:

f(x | θ) = c(θ)h(x) exp{k∑j=1

wj(θ)tj(x)}

Natural parameter: η = (η1, . . . , nk) = (w1(θ), . . . , wk(θ)), that is ηj = wj(θ).

f(x | η) = c(η)h(x) exp{k∑j=1

ηjtj(x)}

If X1, X2, . . . , XN iid from f(x | η), then

l(η) = N log c(η) +N∑i=1

log h(xi) +k∑j=1

ηj

{ N∑i=1

tj(xi)

}∂l

∂ηj= N

∂

∂ηjlog c(η)︸︷︷︸−tj(Xi)

+N∑i=1

tj(xi)

= −EN∑i=1

tj(Xi) +N∑i=1

tj(xi)

∂2l

∂ηj∂ηl= N

(∂2

∂ηj∂ηllog c(η)

)= N(−Cov(tj(Xi), tl(Xi)))

Thus, the equations for a stationary point is

∂l

∂ηj= 0, j = 1, . . . , k

are equivalent to

EηTj(X) = Tj(x), j = 1, . . . , k (2)

where Tj(X) =∑N

i=1 tj(Xi) and Tj(x) =∑N

i=1 tj(xi) or in vector notation,

EηT (X) = T (x), j = 1, . . . , k

where T (X) = (T1(X), . . . , Tk(X)) and T (x) = (T1(x), . . . , Tk(x)).

11

The Hessian matrix H(η) =

(∂2l

∂ηi∂ηj

)ki,j=1

is given

H(η) = −NΣ(η)

where Σ(η) is the k × k covariance matrix of (T1(X1), T2(X1), . . . , Tk(X1)). A covariancematrix will be positive definite (except in degenerate cases), so that H(η) will be negativedefinite for all η.Conclusion: An interior stationary point (i.e., a solution of (2)) must be the unique globalmaximum, and hence the MLE. This result also holds in the original parameterization with(2) restated as

EθTj(X) = Tj(x), j = 1, . . . , k.

Connection with MOM: For a 1pef with t(x) = x, MOM and MLE agree. For a kpef withtj(x) = xj , MOM and MLE agree. Why? Because then (2) is equivalent to the equationsfor the MOM estimator.

6 Revisiting Gamma Example:

The system of equations for the MLE of (α, β) may be easily derived directly from (2).

ET1(X) = T1(x)

ET2(X) = T2(x)

which becomes

En∑i=1

logXi = nE logX1 = n(log β + ψ(α)) = T1(x)

En∑i=1

Xi = nEX1 = nαβ = T2(x)

12

The equations are the same as the equations for a stationary point derived earlier. ForX ∼ Gamma(α, β), we have used:

E logX =

∫ ∞0

log xxα−1e−x/β

βαΓ(α)dx

=

∫ ∞0

(log(x/β) + log β)(x/β)α−1e−x/β

Γ(α)

dx

β

=

∫ ∞0

(log z + log β)zα−1e−z

Γ(α)dz

= log β +1

Γ(α)

∫ ∞0

(zα−1 log z)︸︷︷︸∂∂αzα−1

e−zdz

= log β +1

Γ(α)

∂

∂α

∫ ∞0

zα−1e−zdz

= log β +Γ′(α)

Γ(α)= log β + ψ(α).

Verifying Stationary Point is Global Maximum: The Gamma family is a 2pef (or a 1pef ifα or β is held fixed). Switching to the natural parameters η1 = α − 1, η2 = −1/β(or justmaking the substitution λ = 1/β) simplifies the second derivatives w.r.t. η2 (or λ). TheHessian matrix is now negative definite for all θ = (η1, η2), which is a sufficient conditionfor the stationary point to be the global maximum.

6.1 MLEs for More General Exponential Families

Proposition 1. If X ∼ Pθ, θ ∈ Θ, where Pθ has a joint pdf (pmt) from an n-variatek-parameter exponential family

f(x | θ) = c(θ)h(x) exp

{ k∑j=1

wj(θ)Tj(x)

}

for x ∈ Rn, θ ∈ Θ ⊂ Rk, then the MLE of θ based on the observed data x is the solution ofthe system of equations

EθTj(X) = Tj(x), j = 1, . . . , k, Solve for θ.

providing the solution (call it θ̂) satisfies

w(θ̂) ∈ interior of {w(θ) : θ ∈ Θ}.

Proof. Essentially the same as for the ordinary kpef.

13

Example: Simple Linear Regression with known variance: Y1, Y2, . . . , Yn are independentwith

Yi ∼ N(β0 + β1xi, σ20), θ = (β0, β1)

Joint distribution of Y˜

= (Y1, Y2, . . . , Yn) forms exponential family. Natural sufficient

statistic is

t(Y˜

) = (∑i

Yi,∑i

xiYi).

Eθt(Y˜

) = t(y˜) has the form

E(∑

Yi) =∑

yi

E(∑

xiYi) =∑

xiyi

Thus the MLE θ̂ = (β̂0, β̂1) is solution of∑i

(β0 + β1xi) =∑i

yi∑i

xi(β0 + β1xi) =∑i

xiyi

6.2 Sufficient statistics and MLEs

If T = T (X) is a sufficient statistic for θ, then there is an MLE which is a function of T .(If the MLE is unique, then we can say the MLE is a function of T ).

Proof. By FC,

f(x | θ) = g(T (x), θ)h(x).

Assume for convenience the MLE is unique. Then the MLE is

θ̂(x) = argmaxθf(x | θ)= argmaxθg(T (x), θ)

which is clearly a function of T (x).

14

MLE coincides with “Least Squares”. For independent normal rv’s with constant varianceσ2 (known or unknown).Y1, Y2, . . . , Yn are independent with

Yi ∼ N(β0 + β1xi, σ20), θ = (β0, β1)

or more generally,

Yi ∼ N(g(xi, β), σ20),

where β is possibly a vector. Then

L(β, σ2︸︷︷︸θ

) = f(y˜| θ) =

(1√2πσ

)nexp

{− 1

2σ2

n∑i=1

(yi − g(xi, β)2}.

For any σ2 (fixed arbitrary value), maximizing L(β, σ2) with respect to β is equivalent tominimizing

∑ni=1(yi − g(xi, β))2 with respect to β. Hence MLE and Least squares give

same estimates of β parameters.

15

1 Maximum Likelihood Estimation - ani.stat.fsu.edudebdeep/mle.pdf · 1 Maximum Likelihood...

Documents

Transcript of 1 Maximum Likelihood Estimation - ani.stat.fsu.edudebdeep/mle.pdf · 1 Maximum Likelihood...