Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...
Transcript of Maximum Likelihood Estimation - ISyEjeffwu/isye8813/mle.pdf · and a curve (model function) y=f(x;...
Maximum Likelihood Estimation
Prof. C. F. Jeff Wu
ISyE 8813
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Section 1
Motivation
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
What is parameter estimation?
A modeler proposes a modelM(θ) for explaining someobserved phenomenon
θ are the parameters whichdictate properties of such a model
Parameter estimation is theprocess by which the modelerdetermines the best parameterchoice θ given a set of trainingobservations
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Why parameter estimation?
θ often encodes important andinterpretable properties of themodel M
e.g., calibration parameters incomputer experiment models
An estimate of θ (call this θ̂)allows for:
Model validation: Checkingwhether the proposed model fitsthe observed dataPrediction: Forecasting futureobservations at untested settings
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Simple example
Suppose we observe a random sample X1,X2,⋯,Xn:
Xi = 1 if student i own a sports car,Xi = 0 if student i does not own a sports car
We then postulate the following model: Xii.i.d.∼ Bernoulli(θ),
i.e., Xi’s are i.i.d. Bernoulli random variables with the same(unknown) parameter θ
We would like to estimate θ based on the observationsX1,X2,⋯,Xn
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Popular estimation techniques
Maximum-likelihood estimation (MLE)
Mnimax estimation
Methods-of-moments (MOM)
(Non-linear) Least-squares estimation
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Popular estimation techniques
Maximum-likelihood estimation (MLE)
Mnimax estimation
Methods-of-moments (MOM)
(Non-linear) Least-squares estimation
We will focus on these two techniques in this lecture.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Section 2
Maximum likelihood estimation (MLE)
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Likelihood function
In words, MLE chooses the parameter setting whichmaximizes the likelihood of the observed sample
But how do we define this likelihood as a function of θ?
Definition
Let f(¯x∣θ) be the probability of observating the sample
¯x = (x1,⋯,xn) from M(θ). The likelihood function is defined as:
L(θ∣¯x) = f(
¯x∣θ)
MLE tries to find the parameter setting which maximizes theprobability of observing the data
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Log-likelihood Function
If X1,X2,⋯,Xni.i.d.∼ f(
¯x∣θ), the likelihood function simplifies
to L(θ∣¯x) =∏
ni=1 f(xi∣θ)
In this case, the likelihood can become very small, because weare multiplying many terms
To fix this computational problem, the log-likelihoodl(θ∣
¯x) = log(L(θ∣
¯x)) is often used instead
For i.i.d observations, this becomes:
l(θ∣¯x) = log(L(θ∣
¯x)) =
n
∑i=1log[f(xi∣θ∣
¯x)]
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Maximum likehood estimator
We can now formally define the estimator for MLE:
Definition
Given observed data¯x, the maximum likelihood estimator
(MLE) of θ is defined as:
θ̂ ∈ Argmaxθ
[L(θ∣¯x)]
Equivalently, because the log-function is monotonic, we caninstead solve for:
θ̂ ∈ Argmaxθ
[l(θ∣¯x)]
The latter is more numerically stable for optimization.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
MLE: a simple example
Suppose some count data is observed Xi ∈ Z+, and the
following Poisson model is assumed Xii.i.d.∼ Pois(λ)
The likelihood function can be shown to be
L(λ∣¯x) =
n
∏i=1
(λxi
xi!e−λ)
with log-likelihood function:
l(λ∣¯x) = log [(
n
∏i=1
λxi
xi!e−λ)] = log(λ)(
n
∑i=1xi)−
n
∑i=1
log(xi!)−nλ
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
MLE: a simple example
Since l(λ∣¯x) is differentiable, we can solve for its minimizer by:
Differentiating the log-likelihood:
d[l(λ∣¯x)]
dλ=
1
λ
n
∑i=1xi −n
Setting it to 0, and solving for λ:
λ̂ =∑ni=1 xin
= X̄
Checking the Hessian matrix is positive-definite:
∇2λl(λ∣¯
x) ⪰ 0
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
What if a closed-form solution does not exist /?
In most practical models, there are two computational difficulties:
No closed-form solution exists for the MLE,
Multiple locally optimal solutions or stationary points.
Standard non-linear optimization methods (Nocedal and Wright,2006) are often used as a black-box technique for obtaining locallyoptimal solutions.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
What if a closed-form solution does not exist /?
We can do better if the model exhibits specific structure:
If optimization program is convex, one can use some form ofaccelerated gradient descent (Nesterov, 1983)
If non-convex but twice-differentiable, the limited-memoryBroyden-Fletcher-Goldfard-Shanno (L-BFGS) method worksquite well for local optimization
If missing data is present, the EM algorithm (Dempster et al,1977; Wu, 1983) can be employed
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Section 3
Non-linear least-squares estimation
(Adapted from
en.wikipedia.org/wiki/Non-linear_least_squares)
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Problem statement
Consider a set of m data points, (x1,y1), (x2,y2), . . . , (xm,ym),and a curve (model function) y = f(x;β):
Curve model depends on β = (β1,β2, . . . ,βn), with m ≥ n
Want to choose the parameters β so that the curve best fitsthe given data in the least squares sense, i.e., such that
S(β) =m
∑i=1ri(β)
2
is minimized, where the residuals ri(β) are defined asri(β) = yi − f(xi;β)
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Problem statement
The minimum value of S occurs when the gradient equals zero:
Since the model contains n parameters, there are n gradientequations to solve:
∂S
∂βj= 2∑
i
ri∂ri
∂βj= 0, j = 1, . . . ,n.
In a nonlinear system, this system of equations typically donot have a closed-form solution /
One way to solve for this system is:
Set initial values for parameters
Update parameters using the successive iterations:
β(k+1)j = β
(k)j +∆βj.
Here, k is the increment count, and ∆β is known as the shiftvector
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Deriving the normal equations
From first-order Taylor series expansion about βk, we get:
f(xi;β) ≈ f(xi;β(k)
)+∑j
∂f(xi;β(k))
∂βj(βj −β
(k)j ) ≈ f(xi;β
(k))+∑
j
Jij∆βj,
where J is the Jacobian matrix.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Deriving the normal equations
In terms of the linearized model:
Jij = −∂ri
∂βj,
with residuals given by:
ri = (yi − f(xi;β(k)
))+(f(xi;β(k)
) − f(xi;β)) = ∆yi−n
∑s=1Jis∆βs
where ∆yi = yi − f(xi;β(k)). Substituting these into the gradient
equations, we get:
−2m
∑i=1Jij (∆yi −
n
∑s=1Jis ∆βs) = 0
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Deriving the normal equations
This system of equations is better known as the set of normalequations:
m
∑i=1
n
∑s=1JijJis ∆βs =
m
∑i=1Jij ∆yi j = 1, . . . ,n.
In matrix form, this becomes:
(JTJ)∆β = JT ∆y
where ∆y = (∆y1,⋯,∆yn)T .
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Solving the normal equations
Several methods have been proposed to solve the set of normalequations:
Gauss-Newton method
Levenberg-Marquardt algorithm
Gradient methods
Davidson-Fletcher-PowellSteepest descent
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Solving the normal equations
Several methods have been proposed to solve the set of normalequations:
Gauss-Newton method
Levenberg-Marquardt-Fletcher algorithm
Gradient methodDavidson-Fletcher-PowellSteepest descent
We will focus on two in this lecture.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Gauss-Newton method
Starting with an initial guess β(0) for the minimum, theGauss-Newton method iteratively updates β(k) by solving forthe shift vector ∆β in the normal equations:
β(k+1) = β(k) + (JTJ)−1JTr(β(k)).
This approach can encounter several problems:
(JTJ)−1 is often ill-conditioned,
When far from fixed-point solution, such an iterative map maynot be contractive; the sequence (β(k))∞k=1 may divergewithout reaching a limiting solution.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Levenberg-Marquardt-Fletcher algorithm
Historical perspective of the Levenberg-Marquardt-Fletcher(LMF) algorithm:
The first form of LMF was first published in 1944 by KennethLevenberg, while working at the Frankford Army Arsenal,
It was then rediscovered and improved upon by DonaldMarquardt in 1963, who worked as a statistician at DuPont,
A further modification was suggested by Roger Fletcher in1971, which greatly improved the robustness of the algorithm.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Levenberg’s contribution
To improve the numerical stability of the algorithm, Levenberg(1944) proposed the modified iterative updates:
β(k+1) = β(k) + (JTJ + λI)−1JTr(β(k)),
where λ is a damping factor and I is the identity matrix.
λ→ 0+ gives the previous Gauss-Newton updates, whichconverges quickly but may diverge,
λ→∞ gives the steepest-descent updates, which convergeslowly but is more stable.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Marquardt’s contribution
How does one choose λ to control this trade-off between quickconvergence and numerical stability? Marquardt (1963) proposesthe improved iterative updates:
β(k+1) = β(k) + (JTJ + λ(k)I)−1JTr(β(k)),
where the damping factor λ(k) can vary between iterations.
When convergence is stable, the damping factor is iterativelydecreased by λ(k+1) ← λ(k)/ν to exploit the acceleratedGauss-Newton rate,
When divergence is observed, the damping factor is increasedby λ(k+1) ← λ(k)ν to restore stability.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Fletcher’s contribution
Fletcher (1971) proposes a further improvement of this algorithmthrough the following update scheme:
β(k+1) = β(k) + (JTJ + λ diag{JTJ})−1JTr(β(k)),
where diag{JTJ} is a diagonal matrix of the diagonal entries inJTJ.
Intuition: By scaling each gradient component by thecurvature, greater movement is encouraged in directionswhere the gradient is smaller,
Can be viewed as a pre-conditioning step for solvingill-conditioned problems,
Similar approach is used in Tikhonov regularization forlinear least-squares.
Motivation Maximum likelihood estimation (MLE) Non-linear least-squares estimation
Summary
Parameter estimation is an unavoidable step in modelvalidation and prediction,
MLE is a popular approach for parameter estimation,
For the non-linear least-squares problem, theLevenberg-Marquardt-Fletcher algorithm provides a stableand efficient method for estimating coefficients.