1 Maximum Likelihood Estimation - ani.stat.fsu.edudebdeep/mle.pdf · 1 Maximum Likelihood...
Click here to load reader
Transcript of 1 Maximum Likelihood Estimation - ani.stat.fsu.edudebdeep/mle.pdf · 1 Maximum Likelihood...
Maximum Likelihood Estimators
February 22, 2016 Debdeep Pati
1 Maximum Likelihood Estimation
Assume X ∼ Pθ, θ ∈ Θ, with joint pdf (or pmf) f(x | θ). Suppose we observe X = x. TheLikelihood function is
L(θ | x) = f(x | θ)
as a function of θ (with the data x held fixed). The likelihood function L(θ | x) and jointpdf f(x | θ) are the same except that f(x | θ) is generally viewed as a function of x withθ held fixed, and L(θ | x) as a function of θ with x held fixed. f(x | θ) is a density in xfor each fixed θ. But L(θ | x) is not a density (or mass function) in θ for fixed x (exceptby coincidence).
1.1 The Maximum Likelihood Estimator (MLE)
A point estimator θ̂ = θ̂(x) is a MLE for θ if
L(θ̂ | x) = supθL(θ | x),
that is, θ̂ maximizes the likelihood. In most cases, the maximum is achieved at a uniquevalue, and we can refer to “the” MLE, and write
θ̂(x) = argmaxθL(θ|x).
(But there are cases where the likelihood has flat spots and the MLE is not unique.)
1.2 Motivation for MLE’s
Note: We often write L(θ | x) = L(θ), suppressing x, which is kept fixed at the observeddata. Suppose x ∈ Rn.Discrete Case: If f(· | θ) is a mass function (X is discrete), then
L(θ) = f(x | θ) = Pθ(X = x).
L(θ) is the probability of getting the observed data x when the parameter value is θ.Continuous Case: When f(· | θ) is a continuous density Pθ(X = x) = 0, but if B ⊂ Rn isa very, very small ball (or cube) centered at the observed data x, then
Pθ(X ∈ B) ≈ f(x | θ)×Volume(B) ∝ L(θ).
1
L(θ) is proportional to the probability the random data X will be close to the observeddata x when the parameter value is θ. Thus, the MLE θ̂ is the value of θ which makes theobserved data x “most probable”.
To find θ̂, we maximize L(θ). This is usually done by calculus (finding a stationary point),but not always. If the parameter space Θ contains endpoints or boundary points, themaximum can be achieved at a boundary point without being a stationary point. If L(θ)is not “smooth” (continuous and everywhere differentiable), the maximum does not haveto be achieved at a stationary point.Cautionary Example: Suppose X1, . . . , Xn are iid Uniform(0, θ) and Θ = (0,∞). Givendata x = (x1, . . . , xn), find the MLE for θ.
L(θ) =n∏i=1
θ−1I(0 < xi < θ) = θ−nI(0 ≤ x(1))I(x(n) ≤ θ)
=
{θ−n for θ ≥ x(n)0 for 0 < θ < x(n)
which is maximized at θ = x(n), which is a point of discontinuity and certainly not a
stationary point. Thus, the MLE is θ̂ = x(n).Notes: L(θ) = 0 for θ < x(n) is just saying that these values of θ are absolutely ruled out bythe data (which is obvious). A strange property of the MLE in this example (not typical):
Pθ(θ̂ < θ) = 1
The MLE is biased; it is always less than the true value.A Similar Example: Let X1, . . . , Xn be iid Uniform(α, β) and Θ = {(α, β) : α < β}.Given data x = (x1, . . . , xn), find the MLE for θ = (α, β).
L(α, β) =n∏i=1
(β − α)−1I(α < xi < β) = (β − α)−nI(α ≤ x(1))I(x(n) ≤ β)
=
{(β − α)−n for α ≤ x(1), x(n) ≤ β0 otherwise
which is maximized by making β − α as small as possible without entering “0 otherwise”region. Clearly, the maximum is achieved at (α, β) = (x(1), x(n)). Thus the MLE is
θ = (α̂, β̂) = (x(1), x(n)). Again, Pα,β(α < α̂, β̂ < β) = 1.
2
2 Maximizing the Likelihood (one parameter)
2.1 General Remarks
Basic Result: A continuous function g(θ) defined on a closed, bounded interval J attainsits supremum (but might do so at one of the endpoints). (That is, there exists a pointθ0 ∈ J such that g(θ0) = supθ∈J g(θ). )Consequence: Suppose g(θ) is a continuous, non-negative function defined on an openinterval J = (c, d) (where perhaps c = −∞ or d =∞). If
limθ→c
g(θ) = limθ→d
g(θ) = 0,
then g attains its supremum. Thus, MLEs usually exist when the likelihood function iscontinuous.
3 Maxima at Stationary Points
Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infiniteor finite). If g is differentiable and attains its supremum at a point θ0 in the interior of Θ,that point must be a stationary point (that is, g′(θ0) = 0).
1. If g′(θ0) = 0 and g′′(θ0) < 0, then θ0 is a local maximum (but might not be the globalmaximum).
2. If g′(θ0) = 0 and g′′(θ) < 0 for all θ ∈ Θ, then θ0 is a global maximum (that is, itattains the supremum). The condition in (1) is necessary (but not sufficient) for θ0to be a global maximum. Condition (2) is sufficient (but not necessary).A function satisfying g′′(θ) < 0 for all θ ∈ Θ is called strictly concave. It lies belowany tangent line. Another useful condition (sufficient, but not necessary) is:
3. If g′(θ) > 0 for θ < θ0 and g′(θ) < 0 for θ > θ0, then θ0 is a global maximum.
4 Maximizing the Likelihood (multi-parameter)
4.1 Basic Result:
A continuous function g(θ) defined on a closed, bounded set J ⊂ Rk attains its supremum(but might do so on the boundary).
3
4.2 Consequence:
Suppose g(θ) is a continuous, non-negative function defined for all θ ∈ Rk. If g(θ) → 0as ||θ|| → ∞, then g attains its supremum. Thus, MLEs usually exist when the likelihoodfunction is continuous.
Suppose the function g(θ) is defined on a convex set Θ ⊂ Rk (that is, the line segmentjoining any two points in Θ lies entirely inside Θ). If g is differentiable and attains itssupremum at a point θ0 in the interior of Θ, that point must be a stationary point:
∂g(θ0)
∂θi= 0, i = 1, 2, . . . , k.
Define the vector D and Hessian matrix H:
D(θ) =
(∂g(θ)
∂θi
)ki=1
(a k × 1 vector).
H(θ) =
(∂2g(θ)
∂θi∂θj
)ki,j=1
(a k × kmatrix)
4.3 Maxima at Stationary Points
1. If D(θ0) = 0 and H(θ0) is negative definite, then θ0 is a local maximum (but mightnot be the global maximum).
2. If D(θ0) = 0 and H(θ) is negative definite for all θ ∈ Θ, then θ0 is a global maximum(that is, it attains the supremum).
(1) is necessary (but not sufficient) for θ0 to be a global maximum. (2) is sufficient (butnot necessary).A function for which H(θ) is negative definite for all θ ∈ Θ is called strictly concave. Itlies below any tangent plane.
4.4 Positive and Negative Definite Matrices
Suppose M is a k × k symmetric matrix.Note: Hessian matrices and covariance matrices are symmetric.Definitions:
1. M is positive definite if x′Mx > 0 for all x 6= 0 (x ∈ Rk).
2. M is negative definite if x′Mx < 0 for all x 6= 0.
4
3. M is non-negative definite (or positive semi-definite) if x′Mx ≥ 0 for all x ∈ Rk.
Facts:
1. M is p.d. iff all its eigenvalues are positive.
2. M is n.d. iff all its eigenvalues are negative.
3. M is n.n.d. iff all its eigenvalues are non-negative.
4. M is p.d. iff −M is n.d.
5. If M is p.d., all its diagonal elements must be positive.
6. If M is n.d., all its diagonal elements must be negative.
7. The determinant of a symmetric matrix is equal to the product of its eigenvalues.2× 2 Symmetric Matrices:
M = (mij) =
(m11 m12
m21 m22
),m12 = m21
|M | = m11m22 −m12m21 = m11m22 −m212.
A 2×2 matrix is p.d. when the determinant is positive and the diagonal elements arepositive. A 2 × 2 matrix is n.d. when the determinant is positive and the diagonalelements are negative.The bare minimum you need to check:M is p.d. if m11 > 0 (or m22 > 0) and |M | > 0.M is n.d. if m11 < 0 (or m22 < 0) and |M | > 0.
Example: Observe X1, X2, . . . , Xn be iid Gamma(α, β).Preliminaries:
L(α, β) =
n∏i=1
xα−1i e−xi/β
βαΓ(α)
Maximizing L is same as maximzing l = logL given by
l(α, β) = (α− 1)T1 − T2/β − nα log β − n log Γ(α)
5
where T1 =∑
i log xi, T2 =∑
i xi. Note that T = (T1, T2) is the natural sufficient statisticof this 2pef.
∂l
∂α= T1 − n log β − nψ(α), ψ(α) ≡ d
dαlog Γ(α) =
Γ′(α)
Γ(α)
∂l
∂β=
T2β2− nα
β=
1
β2(T2 − nαβ)
∂2l
∂α2= −nψ′(α)
∂2l
∂β2=−2T2β3
+nα
β2=−1
β3(2T2 − nαβ)
∂2l
∂α∂β=−nβ
Situation #1: Suppose α = α0 is known. Find MLE for β. (Drop α from arguments:l(β) = l(α0, β) etc.)l(β) is continuous and differentiable.l(β) has a unique stationary point:
l′(β) =∂l
∂β=
1
β2(T2 − nα0β) = 0
iff T2 = nα0β, iffβ =T2nα0
(≡ β∗)
Now we check the second derivative.
l′′(β) =∂2l
∂β2=−1
β3(2T2 − nαβ) =
−1
β3{T2 + (T2 − nαβ)}.
Note l′′(β∗) < 0 since T2 − nα0β∗ = 0, but l′′(β) > 0 for β > 2T2/(nα0). Thus, the sta-
tionary point satisfies the necessary condition for a global maximum, but not the sufficientcondition (i.e., l(β) is not a strictly concave function). How can we be sure that we havefound the global maximum, and not just a local maximum? In this case, there is a simpleargument: The stationary point β∗ is unique, and l′(β) > 0 for β < β∗,and l′(β) < 0 forβ > β∗. This ensures β∗ is the unique global maximizer.Conclusion: β̂ = T2
nα0. (This is a function of T2, which is a sufficient statistic for β when α
is known.)Situation #2: Suppose β = β0 is known. Find MLE for α. (Drop β from arguments:l(α) = l(α, β0) etc.)Note: l′(α) and l′′(α) involve ψ(α). The function ψ is infinitely differentiable on the inter-val (0,∞), and satisfies ψ′(α) > 0 and ψ′′(α) < 0 for all α > 0. (The function is strictlyincreasing and strictly concave.)
6
Also,
limα→0+
ψ(α) = −∞, limα→∞
ψ(α) =∞.
Thus ψ−1 : R → (0,∞) exists. l(α) is continuous and differentiable. l(α) has a uniquestationary point:
l′(α) = T1 − n log β0 − nψ(α) = 0
iff ψ(α) = T1/n− log β0
iff α = ψ−1(T1/n− log β0)
This is the unique global maximizer since
l′′(α) = −nψ′(α) < 0, ∀α > 0.
Thus α̂ = ψ−1(T1/n − log β0) is the MLE. (This is a function of T1, which is a sufficientstatistic for α when β is known.)Situation #3: Find MLE for θ = (α, β). l(α, β) is continuous and differentiable. Astationary point must satisfy the system of two equations:
∂l
∂α= T1 − n log β − nψ(α) = 0
∂l
∂β=
1
β2(T2 − nαβ) = 0.
Solving the second equation for β gives
β =T2nα
Plugging this into the first equation, and rearranging a bit leads to
T1n− log
(T2n
)= ψ(α)− logα ≡ H(α)
The function H(α) is continuous and strictly increasing from (0,∞) to (−∞, 0), so that ithas an inverse mapping (−∞, 0) to (0,∞). Thus, the solution to the above equation canbe written:
α = H−1{T1n− log
(T2n
)}Thus the unique stationary point is:
α̂ = H−1{T1n− log
(T2n
)}β̂ =
T2nα̂
7
Is this the MLE?Let us examine the Hessian.
H(α, β) =
(∂2l∂α2
∂2l∂α∂β
∂2l∂α∂β
∂2l∂β2
)
=
(−nψ′(α) −n
β−nβ
−1β3 (2T2 − nαβ)
)
H(α̂, β̂) =
(−nψ′(α̂) −n2α̂
T2−n2α̂T2
−n3α̂3
T 22
)The diagonal elements are both negative, and the determinant is equal to
n4α̂2
T 22
(α̂ψ′(α̂)− 1).
This is positive since αψ′(α)−1 > 0 for all α > 0. This guarantees that H(α̂, β̂) is negativedefinite so that (α̂, β̂) is at least a local maximum.
5 Invariance principle for the MLE’s
If η = τ(θ) and θ̂ is the MLE of θ, then η̂ = τ(θ̂) is the MLE of η.Comments
1. If τ(θ) is a 1-1 function, this is a trivial theorem.
2. If τ(θ) is not 1-1, this is essentially true by definition of induced likelihood. (seelater).
Example: X = (X1, X2, . . . , Xn) iid N(µ, σ2). The usual parameters θ = (µ, σ2) are relatedto the natural parameters η = (µ/σ2,−1/(2σ2)) of the 2pef by a 1-1 function: η = τ(θ).The likelihood in terms of θ is
L1(θ) = (2πσ2)−n/2e−nµ2/2σ2
eµ/σ2T1−(1/2σ2)T2
where T1 =∑Xi, T2 =
∑X2i .
Simple Example: X = (X1, X2, . . . , Xn) iid Bernoulli(p). It is known that MLE of p isp̂ = X̄. Thus
1. MLE of p2 is p̂2 = X̄2.
2. MLE of p(1− p) is X̄(1− X̄).
The function of p in 1. is 1-1, but not 1-1 in 2.
8
5.1 Induced Likelihood
Definition 1. If η = τ(θ), then
L∗(η) ≡ supθ:τ(θ)=η
L(θ).
Go back to the example X1, X2, . . . , Xn ∼ N(µ, σ2) iid. If the MLE η̂ of η is defined to bethe value which maximized L∗(η), then it is easily seen that η̂ = τ(θ̂). The likelihood interms of η is
L2(η) = (−π/η2)−n/2enη21/4η2eη1T1+η2T2
obtained by substituting in L1(θ)
µ = −η1/(2η2), σ2 = −1/(2η2),
that is, evaluating L1 at
θ = (µ, σ2) = (−η1/(2η2),−1/(2η2)) = τ−1(η).
Stated abstractly L2(η) = L1(τ−1(η)), so that L2 is maximized when τ−1(η) = θ̂, that is,
by η = τ(θ̂). The MLE of θ is known to be
θ̂ = (µ̂, σ̂2) =
(X̄,
1
n
n∑i=1
(Xi − X̄)2)
so the invariance principle says the MLE of η is
η̂ = τ(θ̂) =
(µ̂
σ̂2,−1
2σ̂2
).
Continuation of example: What is the MLE of α = µ+ σ2? Note that
α = g(µ, σ2) = µ+ σ2
is not a 1-1 function, but
α̂ = g(µ̂, σ̂2) = µ̂+ σ̂2 = X̄ + SS/n
where SS =∑n
i=1(Xi − X̄)2.What is MLE of µ?, σ2 ?With g1(x, y) = x, g2(x, y) = y, we have
µ = g1(θ), σ2 = g2(θ)
so that the MLEs are
µ̂ = g1(θ̂) = X̄, σ̂2 = g2(θ̂) = SS/n.
Thus, the invariance principle implies:
ˆ(µ, σ2)(MLE as a pair) = (µ̂(MLE of µ), σ̂2(MLE of σ2))
9
5.2 MLE for Exponential Families
The invariance principle for MLEs allows us to work with the natural parameter η (whichis a 1-1 function of θ).1pef:
f(x | θ) = c(θ)h(x) exp{w(θ)t(x)}
Natural parameter: η = w(θ).With a little abuse of notation (writing f(x | η) for f∗(x | η) = f(x | w−1(η)) and c(η) forc∗(η) = c(w−1(η)), we can write
f(x | η) = c(η)h(x) exp{ηt(x)}.
For clarity of notations, we will use x = (x1, . . . , xN ) as the observed data and X =(X1, X2, . . . , XN ) as the random data. If X1, . . . , XN iid from f(x | η), then
l(η) = N log{c(η)}+
N∑i=1
log h(xi) + η
N∑i=1
t(xi)
Since by 3.32(a) Et(Xi) = − ∂∂η log{c(η)}, we have
l′(η) = N∂
∂ηlog{c(η)}+
N∑i=1
t(xi) (1)
= −E[ N∑i=1
t(Xi)
]+
N∑i=1
t(xi)
= −ET (X) + T (x)
where T (X) =∑N
i=1 t(Xi). Hence the condition for a stationary point is equivalent to:
EηT (X) = T (x)
Note that using (1),
l′′(η) = N∂2
∂η2log{c(η)} = N{−Varηt(Xi)} < 0
for all η. Thus any interior stationary point (not on the boundary of Θ∗ = {w(θ) : θ ∈ Θ})is automatically a global maximum so long as Θ∗ is convex. In one dimension (Θ ⊂ R),this means Θ∗ must be an interval of some sort (can be infinite). Ignoring this fine point,
10
for a 1pef, the log-likelihood will have a unique stationary point which will be the MLE.k-pef:
f(x | θ) = c(θ)h(x) exp{k∑j=1
wj(θ)tj(x)}
Natural parameter: η = (η1, . . . , nk) = (w1(θ), . . . , wk(θ)), that is ηj = wj(θ).
f(x | η) = c(η)h(x) exp{k∑j=1
ηjtj(x)}
If X1, X2, . . . , XN iid from f(x | η), then
l(η) = N log c(η) +N∑i=1
log h(xi) +k∑j=1
ηj
{ N∑i=1
tj(xi)
}∂l
∂ηj= N
∂
∂ηjlog c(η)︸ ︷︷ ︸−tj(Xi)
+N∑i=1
tj(xi)
= −EN∑i=1
tj(Xi) +N∑i=1
tj(xi)
∂2l
∂ηj∂ηl= N
(∂2
∂ηj∂ηllog c(η)
)= N(−Cov(tj(Xi), tl(Xi)))
Thus, the equations for a stationary point is
∂l
∂ηj= 0, j = 1, . . . , k
are equivalent to
EηTj(X) = Tj(x), j = 1, . . . , k (2)
where Tj(X) =∑N
i=1 tj(Xi) and Tj(x) =∑N
i=1 tj(xi) or in vector notation,
EηT (X) = T (x), j = 1, . . . , k
where T (X) = (T1(X), . . . , Tk(X)) and T (x) = (T1(x), . . . , Tk(x)).
11
The Hessian matrix H(η) =
(∂2l
∂ηi∂ηj
)ki,j=1
is given
H(η) = −NΣ(η)
where Σ(η) is the k × k covariance matrix of (T1(X1), T2(X1), . . . , Tk(X1)). A covariancematrix will be positive definite (except in degenerate cases), so that H(η) will be negativedefinite for all η.Conclusion: An interior stationary point (i.e., a solution of (2)) must be the unique globalmaximum, and hence the MLE. This result also holds in the original parameterization with(2) restated as
EθTj(X) = Tj(x), j = 1, . . . , k.
Connection with MOM: For a 1pef with t(x) = x, MOM and MLE agree. For a kpef withtj(x) = xj , MOM and MLE agree. Why? Because then (2) is equivalent to the equationsfor the MOM estimator.
6 Revisiting Gamma Example:
The system of equations for the MLE of (α, β) may be easily derived directly from (2).
ET1(X) = T1(x)
ET2(X) = T2(x)
which becomes
En∑i=1
logXi = nE logX1 = n(log β + ψ(α)) = T1(x)
En∑i=1
Xi = nEX1 = nαβ = T2(x)
12
The equations are the same as the equations for a stationary point derived earlier. ForX ∼ Gamma(α, β), we have used:
E logX =
∫ ∞0
log xxα−1e−x/β
βαΓ(α)dx
=
∫ ∞0
(log(x/β) + log β)(x/β)α−1e−x/β
Γ(α)
dx
β
=
∫ ∞0
(log z + log β)zα−1e−z
Γ(α)dz
= log β +1
Γ(α)
∫ ∞0
(zα−1 log z)︸ ︷︷ ︸∂∂αzα−1
e−zdz
= log β +1
Γ(α)
∂
∂α
∫ ∞0
zα−1e−zdz
= log β +Γ′(α)
Γ(α)= log β + ψ(α).
Verifying Stationary Point is Global Maximum: The Gamma family is a 2pef (or a 1pef ifα or β is held fixed). Switching to the natural parameters η1 = α − 1, η2 = −1/β(or justmaking the substitution λ = 1/β) simplifies the second derivatives w.r.t. η2 (or λ). TheHessian matrix is now negative definite for all θ = (η1, η2), which is a sufficient conditionfor the stationary point to be the global maximum.
6.1 MLEs for More General Exponential Families
Proposition 1. If X ∼ Pθ, θ ∈ Θ, where Pθ has a joint pdf (pmt) from an n-variatek-parameter exponential family
f(x | θ) = c(θ)h(x) exp
{ k∑j=1
wj(θ)Tj(x)
}
for x ∈ Rn, θ ∈ Θ ⊂ Rk, then the MLE of θ based on the observed data x is the solution ofthe system of equations
EθTj(X) = Tj(x), j = 1, . . . , k, Solve for θ.
providing the solution (call it θ̂) satisfies
w(θ̂) ∈ interior of {w(θ) : θ ∈ Θ}.
Proof. Essentially the same as for the ordinary kpef.
13
Example: Simple Linear Regression with known variance: Y1, Y2, . . . , Yn are independentwith
Yi ∼ N(β0 + β1xi, σ20), θ = (β0, β1)
Joint distribution of Y˜
= (Y1, Y2, . . . , Yn) forms exponential family. Natural sufficient
statistic is
t(Y˜
) = (∑i
Yi,∑i
xiYi).
Eθt(Y˜
) = t(y˜) has the form
E(∑
Yi) =∑
yi
E(∑
xiYi) =∑
xiyi
Thus the MLE θ̂ = (β̂0, β̂1) is solution of∑i
(β0 + β1xi) =∑i
yi∑i
xi(β0 + β1xi) =∑i
xiyi
6.2 Sufficient statistics and MLEs
If T = T (X) is a sufficient statistic for θ, then there is an MLE which is a function of T .(If the MLE is unique, then we can say the MLE is a function of T ).
Proof. By FC,
f(x | θ) = g(T (x), θ)h(x).
Assume for convenience the MLE is unique. Then the MLE is
θ̂(x) = argmaxθf(x | θ)= argmaxθg(T (x), θ)
which is clearly a function of T (x).
14
MLE coincides with “Least Squares”. For independent normal rv’s with constant varianceσ2 (known or unknown).Y1, Y2, . . . , Yn are independent with
Yi ∼ N(β0 + β1xi, σ20), θ = (β0, β1)
or more generally,
Yi ∼ N(g(xi, β), σ20),
where β is possibly a vector. Then
L(β, σ2︸ ︷︷ ︸θ
) = f(y˜| θ) =
(1√2πσ
)nexp
{− 1
2σ2
n∑i=1
(yi − g(xi, β)2}.
For any σ2 (fixed arbitrary value), maximizing L(β, σ2) with respect to β is equivalent tominimizing
∑ni=1(yi − g(xi, β))2 with respect to β. Hence MLE and Least squares give
same estimates of β parameters.
15