Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...

29
Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813

Transcript of Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...

Page 1: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Maximum Likelihood Estimation

Prof. C. F. Jeff Wu

ISyE 8813

Page 2: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Section 1

Motivation

Page 3: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

What is parameter estimation?

A modeler proposes a modelM(θ) for explaining someobserved phenomenon

θ are the parameters whichdictate properties of such a model

Parameter estimation is theprocess by which the modelerdetermines the best parameterchoice θ given a set of trainingobservations

Page 4: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Why parameter estimation?

θ often encodes important andinterpretable properties of themodel M

e.g., calibration parameters incomputer experiment models

An estimate of θ (call this θ̂)allows for:

Model validation: Checkingwhether the proposed model fitsthe observed dataPrediction: Forecasting futureobservations at untested settings

Page 5: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Simple example

Suppose we observe a random sample X1,X2,⋯,Xn:

Xi = 1 if student i own a sports car,Xi = 0 if student i does not own a sports car

We then postulate the following model: Xii.i.d.∼ Bernoulli(θ),

i.e., Xi’s are i.i.d. Bernoulli random variables with the same(unknown) parameter θ

We would like to estimate θ based on the observationsX1,X2,⋯,Xn

Page 6: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Popular estimation techniques

Maximum-likelihood estimation (MLE)

Mnimax estimation

Methods-of-moments (MOM)

(Non-linear) Least-squares estimation

Page 7: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Popular estimation techniques

Maximum-likelihood estimation (MLE)

Mnimax estimation

Methods-of-moments (MOM)

(Non-linear) Least-squares estimation

We will focus on these two techniques in this lecture.

Page 8: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Section 2

Maximum likelihood estimation (MLE)

Page 9: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Likelihood function

In words, MLE chooses the parameter setting whichmaximizes the likelihood of the observed sample

But how do we define this likelihood as a function of θ?

Definition

Let f(¯x∣θ) be the probability of observating the sample

¯x = (x1,⋯,xn) from M(θ). The likelihood function is defined as:

L(θ∣¯x) = f(

¯x∣θ)

MLE tries to find the parameter setting which maximizes theprobability of observing the data

Page 10: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Log-likelihood Function

If X1,X2,⋯,Xni.i.d.∼ f(

¯x∣θ), the likelihood function simplifies

to L(θ∣¯x) =∏

ni=1 f(xi∣θ)

In this case, the likelihood can become very small, because weare multiplying many terms

To fix this computational problem, the log-likelihoodl(θ∣

¯x) = log(L(θ∣

¯x)) is often used instead

For i.i.d observations, this becomes:

l(θ∣¯x) = log(L(θ∣

¯x)) =

n

∑i=1log[f(xi∣θ∣

¯x)]

Page 11: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Maximum likehood estimator

We can now formally define the estimator for MLE:

Definition

Given observed data¯x, the maximum likelihood estimator

(MLE) of θ is defined as:

θ̂ ∈ Argmaxθ

[L(θ∣¯x)]

Equivalently, because the log-function is monotonic, we caninstead solve for:

θ̂ ∈ Argmaxθ

[l(θ∣¯x)]

The latter is more numerically stable for optimization.

Page 12: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

MLE: a simple example

Suppose some count data is observed Xi ∈ Z+, and the

following Poisson model is assumed Xii.i.d.∼ Pois(λ)

The likelihood function can be shown to be

L(λ∣¯x) =

n

∏i=1

(λxi

xi!e−λ)

with log-likelihood function:

l(λ∣¯x) = log [(

n

∏i=1

λxi

xi!e−λ)] = log(λ)(

n

∑i=1xi)−

n

∑i=1

log(xi!)−nλ

Page 13: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

MLE: a simple example

Since l(λ∣¯x) is differentiable, we can solve for its minimizer by:

Differentiating the log-likelihood:

d[l(λ∣¯x)]

dλ=

1

λ

n

∑i=1xi −n

Setting it to 0, and solving for λ:

λ̂ =∑ni=1 xin

= X̄

Checking the Hessian matrix is positive-definite:

∇2λl(λ∣¯

x) ⪰ 0

Page 14: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

What if a closed-form solution does not exist /?

In most practical models, there are two computational difficulties:

No closed-form solution exists for the MLE,

Multiple locally optimal solutions or stationary points.

Standard non-linear optimization methods (Nocedal and Wright,2006) are often used as a black-box technique for obtaining locallyoptimal solutions.

Page 15: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

What if a closed-form solution does not exist /?

We can do better if the model exhibits specific structure:

If optimization program is convex, one can use some form ofaccelerated gradient descent (Nesterov, 1983)

If non-convex but twice-differentiable, the limited-memoryBroyden-Fletcher-Goldfard-Shanno (L-BFGS) method worksquite well for local optimization

If missing data is present, the EM algorithm (Dempster et al,1977; Wu, 1983) can be employed

Page 16: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Section 3

Non-linear least-squares estimation

(Adapted from

en.wikipedia.org/wiki/Non-linear_least_squares)

Page 17: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Problem statement

Consider a set of m data points, (x1,y1), (x2,y2), . . . , (xm,ym),and a curve (model function) y = f(x;β):

Curve model depends on β = (β1,β2, . . . ,βn), with m ≥ n

Want to choose the parameters β so that the curve best fitsthe given data in the least squares sense, i.e., such that

S(β) =m

∑i=1ri(β)

2

is minimized, where the residuals ri(β) are defined asri(β) = yi − f(xi;β)

Page 18: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Problem statement

The minimum value of S occurs when the gradient equals zero:

Since the model contains n parameters, there are n gradientequations to solve:

∂S

∂βj= 2∑

i

ri∂ri

∂βj= 0, j = 1, . . . ,n.

In a nonlinear system, this system of equations typically donot have a closed-form solution /

One way to solve for this system is:

Set initial values for parameters

Update parameters using the successive iterations:

β(k+1)j = β

(k)j +∆βj.

Here, k is the increment count, and ∆β is known as the shiftvector

Page 19: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Deriving the normal equations

From first-order Taylor series expansion about βk, we get:

f(xi;β) ≈ f(xi;β(k)

)+∑j

∂f(xi;β(k))

∂βj(βj −β

(k)j ) ≈ f(xi;β

(k))+∑

j

Jij∆βj,

where J is the Jacobian matrix.

Page 20: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Deriving the normal equations

In terms of the linearized model:

Jij = −∂ri

∂βj,

with residuals given by:

ri = (yi − f(xi;β(k)

))+(f(xi;β(k)

) − f(xi;β)) = ∆yi−n

∑s=1Jis∆βs

where ∆yi = yi − f(xi;β(k)). Substituting these into the gradient

equations, we get:

−2m

∑i=1Jij (∆yi −

n

∑s=1Jis ∆βs) = 0

Page 21: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Deriving the normal equations

This system of equations is better known as the set of normalequations:

m

∑i=1

n

∑s=1JijJis ∆βs =

m

∑i=1Jij ∆yi j = 1, . . . ,n.

In matrix form, this becomes:

(JTJ)∆β = JT ∆y

where ∆y = (∆y1,⋯,∆yn)T .

Page 22: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Solving the normal equations

Several methods have been proposed to solve the set of normalequations:

Gauss-Newton method

Levenberg-Marquardt algorithm

Gradient methods

Davidson-Fletcher-PowellSteepest descent

Page 23: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Solving the normal equations

Several methods have been proposed to solve the set of normalequations:

Gauss-Newton method

Levenberg-Marquardt-Fletcher algorithm

Gradient methodDavidson-Fletcher-PowellSteepest descent

We will focus on two in this lecture.

Page 24: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Gauss-Newton method

Starting with an initial guess β(0) for the minimum, theGauss-Newton method iteratively updates β(k) by solving forthe shift vector ∆β in the normal equations:

β(k+1) = β(k) + (JTJ)−1JTr(β(k)).

This approach can encounter several problems:

(JTJ)−1 is often ill-conditioned,

When far from fixed-point solution, such an iterative map maynot be contractive; the sequence (β(k))∞k=1 may divergewithout reaching a limiting solution.

Page 25: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Levenberg-Marquardt-Fletcher algorithm

Historical perspective of the Levenberg-Marquardt-Fletcher(LMF) algorithm:

The first form of LMF was first published in 1944 by KennethLevenberg, while working at the Frankford Army Arsenal,

It was then rediscovered and improved upon by DonaldMarquardt in 1963, who worked as a statistician at DuPont,

A further modification was suggested by Roger Fletcher in1971, which greatly improved the robustness of the algorithm.

Page 26: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Levenberg’s contribution

To improve the numerical stability of the algorithm, Levenberg(1944) proposed the modified iterative updates:

β(k+1) = β(k) + (JTJ + λI)−1JTr(β(k)),

where λ is a damping factor and I is the identity matrix.

λ→ 0+ gives the previous Gauss-Newton updates, whichconverges quickly but may diverge,

λ→∞ gives the steepest-descent updates, which convergeslowly but is more stable.

Page 27: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Marquardt’s contribution

How does one choose λ to control this trade-off between quickconvergence and numerical stability? Marquardt (1963) proposesthe improved iterative updates:

β(k+1) = β(k) + (JTJ + λ(k)I)−1JTr(β(k)),

where the damping factor λ(k) can vary between iterations.

When convergence is stable, the damping factor is iterativelydecreased by λ(k+1) ← λ(k)/ν to exploit the acceleratedGauss-Newton rate,

When divergence is observed, the damping factor is increasedby λ(k+1) ← λ(k)ν to restore stability.

Page 28: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Fletcher’s contribution

Fletcher (1971) proposes a further improvement of this algorithmthrough the following update scheme:

β(k+1) = β(k) + (JTJ + λ diag{JTJ})−1JTr(β(k)),

where diag{JTJ} is a diagonal matrix of the diagonal entries inJTJ.

Intuition: By scaling each gradient component by thecurvature, greater movement is encouraged in directionswhere the gradient is smaller,

Can be viewed as a pre-conditioning step for solvingill-conditioned problems,

Similar approach is used in Tikhonov regularization forlinear least-squares.

Page 29: Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x; ): Curve model depends on =( 1, 2,..., n), with m≥n Want to choose the parameters

Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation

Summary

Parameter estimation is an unavoidable step in modelvalidation and prediction,

MLE is a popular approach for parameter estimation,

For the non-linear least-squares problem, theLevenberg-Marquardt-Fletcher algorithm provides a stableand efficient method for estimating coefficients.