Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...

Maximum Likelihood Estimation

Prof. C. F. Jeff Wu

ISyE 8813

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Section 1

Motivation


What is parameter estimation?

A modeler proposes a modelM(θ) for explaining someobserved phenomenon

θ are the parameters whichdictate properties of such a model

Parameter estimation is theprocess by which the modelerdetermines the best parameterchoice θ given a set of trainingobservations


Why parameter estimation?

θ often encodes important andinterpretable properties of themodel M

e.g., calibration parameters incomputer experiment models

An estimate of θ (call this θ̂)allows for:

Model validation: Checkingwhether the proposed model fitsthe observed dataPrediction: Forecasting futureobservations at untested settings


Simple example

Suppose we observe a random sample X1,X2,⋯,Xn:

Xi = 1 if student i own a sports car,Xi = 0 if student i does not own a sports car

We then postulate the following model: Xii.i.d.∼ Bernoulli(θ),

i.e., Xi’s are i.i.d. Bernoulli random variables with the same(unknown) parameter θ

We would like to estimate θ based on the observationsX1,X2,⋯,Xn


Popular estimation techniques

Maximum-likelihood estimation (MLE)

Mnimax estimation

Methods-of-moments (MOM)

(Non-linear) Least-squares estimation


Popular estimation techniques

Maximum-likelihood estimation (MLE)

Mnimax estimation

Methods-of-moments (MOM)

(Non-linear) Least-squares estimation

We will focus on these two techniques in this lecture.


Section 2

Maximum likelihood estimation (MLE)


Likelihood function

In words, MLE chooses the parameter setting whichmaximizes the likelihood of the observed sample

But how do we define this likelihood as a function of θ?

Definition

Let f(¯x∣θ) be the probability of observating the sample

¯x = (x1,⋯,xn) from M(θ). The likelihood function is defined as:

L(θ∣¯x) = f(

¯x∣θ)

MLE tries to find the parameter setting which maximizes theprobability of observing the data


Log-likelihood Function

If X1,X2,⋯,Xni.i.d.∼ f(

¯x∣θ), the likelihood function simplifies

to L(θ∣¯x) =∏

ni=1 f(xi∣θ)

In this case, the likelihood can become very small, because weare multiplying many terms

To fix this computational problem, the log-likelihoodl(θ∣

¯x) = log(L(θ∣

¯x)) is often used instead

For i.i.d observations, this becomes:

l(θ∣¯x) = log(L(θ∣

¯x)) =

n

∑i=1log[f(xi∣θ∣

¯x)]


Maximum likehood estimator

We can now formally define the estimator for MLE:

Definition

Given observed data¯x, the maximum likelihood estimator

(MLE) of θ is defined as:

θ̂ ∈ Argmaxθ

[L(θ∣¯x)]

Equivalently, because the log-function is monotonic, we caninstead solve for:

θ̂ ∈ Argmaxθ

[l(θ∣¯x)]

The latter is more numerically stable for optimization.


MLE: a simple example

Suppose some count data is observed Xi ∈ Z+, and the

following Poisson model is assumed Xii.i.d.∼ Pois(λ)

The likelihood function can be shown to be

L(λ∣¯x) =

n

∏i=1

(λxi

xi!e−λ)

with log-likelihood function:

l(λ∣¯x) = log [(

n

∏i=1

λxi

xi!e−λ)] = log(λ)(

n

∑i=1xi)−

n

∑i=1

log(xi!)−nλ


MLE: a simple example

Since l(λ∣¯x) is differentiable, we can solve for its minimizer by:

Differentiating the log-likelihood:

d[l(λ∣¯x)]

dλ=

1

λ

n

∑i=1xi −n

Setting it to 0, and solving for λ:

λ̂ =∑ni=1 xin

= X̄

Checking the Hessian matrix is positive-definite:

∇2λl(λ∣¯

x) ⪰ 0


What if a closed-form solution does not exist /?

In most practical models, there are two computational difficulties:

No closed-form solution exists for the MLE,

Multiple locally optimal solutions or stationary points.

Standard non-linear optimization methods (Nocedal and Wright,2006) are often used as a black-box technique for obtaining locallyoptimal solutions.


What if a closed-form solution does not exist /?

We can do better if the model exhibits specific structure:

If optimization program is convex, one can use some form ofaccelerated gradient descent (Nesterov, 1983)

If non-convex but twice-differentiable, the limited-memoryBroyden-Fletcher-Goldfard-Shanno (L-BFGS) method worksquite well for local optimization

If missing data is present, the EM algorithm (Dempster et al,1977; Wu, 1983) can be employed


Section 3

Non-linear least-squares estimation

(Adapted from

en.wikipedia.org/wiki/Non-linear_least_squares)

en.wikipedia.org/wiki/Non-linear_least_squares


Problem statement

Consider a set of m data points, (x1,y1), (x2,y2), . . . , (xm,ym),and a curve (model function) y = f(x;β):

Curve model depends on β = (β1,β2, . . . ,βn), with m ≥ n

Want to choose the parameters β so that the curve best fitsthe given data in the least squares sense, i.e., such that

S(β) =m

∑i=1ri(β)

2

is minimized, where the residuals ri(β) are defined asri(β) = yi − f(xi;β)


Problem statement

The minimum value of S occurs when the gradient equals zero:

Since the model contains n parameters, there are n gradientequations to solve:

∂S

∂βj= 2∑

i

ri∂ri

∂βj= 0, j = 1, . . . ,n.

In a nonlinear system, this system of equations typically donot have a closed-form solution /

One way to solve for this system is:

Set initial values for parameters

Update parameters using the successive iterations:

β(k+1)j = β

(k)j +∆βj.

Here, k is the increment count, and ∆β is known as the shiftvector


Deriving the normal equations

From first-order Taylor series expansion about βk, we get:

f(xi;β) ≈ f(xi;β(k)

)+∑j

∂f(xi;β(k))

∂βj(βj −β

(k)j ) ≈ f(xi;β

(k))+∑

j

Jij∆βj,

where J is the Jacobian matrix.



In terms of the linearized model:

Jij = −∂ri

∂βj,

with residuals given by:

ri = (yi − f(xi;β(k)

))+(f(xi;β(k)

) − f(xi;β)) = ∆yi−n

∑s=1Jis∆βs

where ∆yi = yi − f(xi;β(k)). Substituting these into the gradient

equations, we get:

−2m

∑i=1Jij (∆yi −

n

∑s=1Jis ∆βs) = 0



This system of equations is better known as the set of normalequations:

m

∑i=1

n

∑s=1JijJis ∆βs =

m

∑i=1Jij ∆yi j = 1, . . . ,n.

In matrix form, this becomes:

(JTJ)∆β = JT ∆y

where ∆y = (∆y1,⋯,∆yn)T .


Solving the normal equations

Several methods have been proposed to solve the set of normalequations:

Gauss-Newton method

Levenberg-Marquardt algorithm

Gradient methods

Davidson-Fletcher-PowellSteepest descent


Solving the normal equations

Several methods have been proposed to solve the set of normalequations:

Gauss-Newton method

Levenberg-Marquardt-Fletcher algorithm

Gradient methodDavidson-Fletcher-PowellSteepest descent

We will focus on two in this lecture.


Gauss-Newton method

Starting with an initial guess β(0) for the minimum, theGauss-Newton method iteratively updates β(k) by solving forthe shift vector ∆β in the normal equations:

β(k+1) = β(k) + (JTJ)−1JTr(β(k)).

This approach can encounter several problems:

(JTJ)−1 is often ill-conditioned,

When far from fixed-point solution, such an iterative map maynot be contractive; the sequence (β(k))∞k=1 may divergewithout reaching a limiting solution.


Levenberg-Marquardt-Fletcher algorithm

Historical perspective of the Levenberg-Marquardt-Fletcher(LMF) algorithm:

The first form of LMF was first published in 1944 by KennethLevenberg, while working at the Frankford Army Arsenal,

It was then rediscovered and improved upon by DonaldMarquardt in 1963, who worked as a statistician at DuPont,

A further modification was suggested by Roger Fletcher in1971, which greatly improved the robustness of the algorithm.


Levenberg’s contribution

To improve the numerical stability of the algorithm, Levenberg(1944) proposed the modified iterative updates:

β(k+1) = β(k) + (JTJ + λI)−1JTr(β(k)),

where λ is a damping factor and I is the identity matrix.

λ→ 0+ gives the previous Gauss-Newton updates, whichconverges quickly but may diverge,

λ→∞ gives the steepest-descent updates, which convergeslowly but is more stable.


Marquardt’s contribution

How does one choose λ to control this trade-off between quickconvergence and numerical stability? Marquardt (1963) proposesthe improved iterative updates:

β(k+1) = β(k) + (JTJ + λ(k)I)−1JTr(β(k)),

where the damping factor λ(k) can vary between iterations.

When convergence is stable, the damping factor is iterativelydecreased by λ(k+1) ← λ(k)/ν to exploit the acceleratedGauss-Newton rate,

When divergence is observed, the damping factor is increasedby λ(k+1) ← λ(k)ν to restore stability.


Fletcher’s contribution

Fletcher (1971) proposes a further improvement of this algorithmthrough the following update scheme:

β(k+1) = β(k) + (JTJ + λ diag{JTJ})−1JTr(β(k)),

where diag{JTJ} is a diagonal matrix of the diagonal entries inJTJ.

Intuition: By scaling each gradient component by thecurvature, greater movement is encouraged in directionswhere the gradient is smaller,

Can be viewed as a pre-conditioning step for solvingill-conditioned problems,

Similar approach is used in Tikhonov regularization forlinear least-squares.


Summary

Parameter estimation is an unavoidable step in modelvalidation and prediction,

MLE is a popular approach for parameter estimation,

For the non-linear least-squares problem, theLevenberg-Marquardt-Fletcher algorithm provides a stableand efficient method for estimating coefficients.

Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...

Documents

Transcript of Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...