Greene, Econometric Analysis (6th ed, 2008)fm · 2010. 11. 4. · EC771: Econometrics, Spring 2009...

EC771: Econometrics, Spring 2009

Greene, Econometric Analysis (6th ed, 2008)

Chapter 17:

Maximum Likelihood Estimation

The preferred estimator in a wide variety of

econometric settings is that derived from the

principle of maximum likelihood (MLE). The

pdf for a random variable y, conditioned on a

set of parameters θ, is denoted f(y|θ). Thisfunction identifies the data generating process

(DGP) that underlies an observed sample of

data, and provides a mathematical description

of the data that the DGP will produce. That

is, for a well–specified DGP, we could gener-

ate any desired quantity of artificial data whose

properties correspond to that DGP. The joint

density of n independently and identically dis-

tributed (i.i.d.) observations from that process

is the product of the individual densities:

f(y1, . . . , yn|θ) =n∏

i=1

f(yi|θ) = L(θ|y).

This joint density, the likelihood function L(θ|y),is a function of the unknown parameter vector

θ and the collection of sample data y. It is

usually easier to work with its logarithm,

lnL(θ|y) =n∑

i=1

ln f(yi|θ).

Of course, we need not consider a set of yithat are i.i.d. to use the MLE framework. In

a simple regression framework, yi = x′iβ + �i.

Assume that � is normally distributed. Then

conditioned on a specific xi, yi is distributed

with µi = x′iβ and variance σ

2� . Nevertheless,

normality of the observations implies that they

are independently distributed, as we may write

the loglikelihood function of θ conditioned on

y, X:

lnL(θ|y, X) =n∑

i=1

ln f(yi|xi, θ)

= −1

2

n∑i=1

[lnσ2 + ln(2π) + (yi − x′iβ)2/σ2],

where X is the n × K matrix of data with itsith row equal to x′i. Since the first-order con-ditions for maximization of this function with

respect to β will not be affected by the first

two terms in this expression, they are often

ignored in the computation.

Before we consider the mechanics of MLE, we

must discuss the identifiability of a particular

model. Suppose that we had an infinitely large

sample. Could we uniquely determine the val-

ues of the parameters θ from the information in

that sample? We will not always be able to do

so, in which case the underlying model is said

to be unidentified. We say that a parameter

vector θ is identified, or estimable, if for any

other parameter vector θ∗ 6= θ and some datay, L(θ∗|y) 6= L(θ|y). For instance, in a probitequation, where we observe a binary outcome

and hypothesize that the probability of that

event (e.g. purchase of an automobile) is re-

lated to a causal factor x:

Prob(purchase|β1, β2, σ, xi) =Prob(yi > 0|β1, β2, σ, xi)

= Prob(�i/σ > −(β1 + β2xi)/σ|β1, β2, σ, xi).

In this case yi may be considered the differ-

ence between the amount a buyer is willing to

pay and the price of the car: a latent vari-

able. If that difference is positive, we ob-

serve purchase, and vice versa. Since multi-

plying (β1, β2, σ) by the same nonzero constant

leaves the left-hand-side of this expression un-

changed, the model’s parameter vector is not

identified. We must apply a normalization: in

this case, we conventionally set σ = σ2 = 1 in

order to ”tie down” the model’s parameters,

so that we may then uniquely identify the βs.

The principle of maximum likelihood provides a

means of choosing an asymptotically efficient

estimator for a set of parameters. Let us con-

sider first how this might be done for a discrete

pdf such as the Poisson distribution:

f(yi|θ) =e−θθyi

yi!.

This is the density for each observation. The

corresponding loglikelihood function (LLF) for

the sample of n observations will be:

lnL(θ|y) = −nθ + ln θn∑

i=1

yi −n∑

i=1

ln(yi!)

To perform maximum likelihood estimation in

Stata, you must code the likelihood function

as a program, or ado-file. In the simplest form

of such a program, in a setting where the ob-

servations are considered to be independently

distributed, you need only express the LLF for

a single observation, and Stata will add it up

over the sample. This is known in Stata terms

as an lf (linear–form) estimator, and it is appli-

cable to a number of estimation problems that

satisfy the so–called linear form restrictions.

This is a different use of ’linear’ than that of

’linear regression’ or ’linear model’, in that a

linear–form MLE problem can involve nonlin-

earities (such as a binomial probit model). The

linear–form restrictions require that the log–

likelihood contributions can be calculated sep-

arately for each observation, and that the sum

of the individual contributions equals the over-

all log likelihood (Gould, Pitblado and Sribney,

2003, p.30). More complicated problems re-

quire use of Stata’s d0 form, in which the en-

tire LLF is coded (or its relatives, d1 and d2, in

which first or first and second analytic deriva-

tives are also provided). For the lf model,

no analytic derivatives are required, and the

optimization is more accurate and computa-

tionally efficient than that derived from the d0

form, since differentiation of the loglikelihood

function may be done with respect to the lin-

ear form rather than with respect to each of

its elements.

771ml1.html contains the logfile for a simple

MLE of the Poisson problem described in the

text. The Stata program fishy1 lf.ado ex-

presses the likelihood of a single observation

yi given the parameter θ:

program fishy1_lf

version 10.1

args lnf theta

quietly replace ‘lnf’ = -‘theta’ + ///

log(‘theta’)*$ML_y1 - lnfactorial($ML_y1)

end

Note that to avoid numerical overflow with the

factorial function, we use Stata’s lnfactorial()

function. The ml model statement sets up the

maximum likelihood problem; we use ml check

to see whether any obvious errors have been

made in the routine. The ml maximize state-

ment instructs Stata to perform the estima-

tion. Following estimation, just as with any

estimation command, we may use the test

command to perform a Wald test of any linear

hypothesis, or the testnl command to perform

a Wald–type test of a nonlinear hypothesis via

the delta method.

The probability of observing the given sample

of data is not exact when we switch to a con-

tinuous distribution, since a particular sample

has probability zero; nevertheless, the princi-

ple is the same. The values of the parameters

that maximize the LF or LLF are the maximumlikelihood estimates, θ̂, and the necessary con-ditions for maximizing the LLF are

∂ lnL(θ|data)∂θ

= 0.

That is, each element of the gradient or scorevector must be approximately zero at the op-timum (or the norm of that vector must beappropriately close to zero).

Let us consider how the MLE for the meanand variance of a normally distributed randomvariable may be computed. The LLF for theunivariate normal distribution is

lnL(µ, σ) = −n

2ln(2π)−

n

2ln(σ2)−

1

2

n∑i=1

[(yi − µ)2

σ2].

To maximize this function, we differentiate withrespect to both µ and σ and solve the resultingFOCs:

∂ lnL

∂µ=

1

σ2

n∑i=1

(yi − µ) = 0.

∂ lnL

∂σ2= −

n

2σ2+

1

2σ4

n∑i=1

(yi − µ)2 = 0.

Solution of these two equations will lead to the

familiar formula for the mean, and the MLE of

σ2, which is the biased estimator

σ2ML =1

n

n∑i=1

(yi − µ)2

771ml2.html illustrates MLE of the parameters

of a Normal distribution, where we have gener-

ated the artificial data with µ=5 and σ=2, and

want to recover those parameters. The Stata

ado-file is a lf-form function here as well, ex-

pressing the likelihood of a single observation

in terms of the two unknown parameters:

program norm1_lf

version 10.1

args lnf mu sigma

quietly replace ‘lnf’ = ///

ln(normalden($ML_y1,‘mu’,‘sigma’))

end

Note that the results of estimation give us con-

fidence intervals for both µ and σ.

The beauty of the linear–form model is that a

whole variety of other DGPs may be expressed

in this same context without altering the pro-

gram. This example considers a univariate es-

timation problem. What if, instead, we con-

sider a model:

yt = α + βt + �t,

that is, a linear trend model? We then want to

estimate (by means of MLE, although we could

surely use a least squares regression to solve

this problem) those two parameters as well as

σ, the standard error of the � process. In this

case, the LLF above is modified to include µi,

the conditional mean of y|x, rather than thescalar parameter µ. 771ml3.html presents the

results of that estimation. Since the ”explana-

tory variable” t enters the conditional mean

equation in linear form, we may merely change

the ml model statement to express y as a lin-

ear function of x (by default with a constant

term, α above). We need not modify the pro-

gram to estimate this model. Notice that since

we are now working with a bivariate relation-

ship, Stata computes a Wald χ2 statistic for

this model which compares it with the ”naive

model” in which the coefficient of x is zero

(that is, a model in which the time trend plays

no role).

Finally, we may also model the standard er-

ror, rather than assuming it to be common

across observations. We may express �t in this

model as not being normally distributed, but

rather possessing a standard error, γ = σx2t :

an instance where the relationship is becom-

ing less precise over time (in the implemen-

tation, we demean x so that the multiplica-

tive factor is mean zero). This assumption on

the error process does not violate the inde-

pendence of the errors (their second moments

are systematically related to t). In regression

terms, it is a heteroskedastic regression model,

in which we explicitly model the form of the

heteroskedasticity. 771ml4.html presents the

estimates of that model: again, we need not

modify the program, but merely change the ml

model statement so that we may indicate that

the second ”equation” being estimated (that

for the standard error of �) is non–trivial.

Properties of MLEs

MLEs are most attractive due to their large–sample or asymptotic properties; their finitesample properties may be suboptimal. For ex-ample, the MLE of σ2 in a standard regressionproblem involves a divisor of N , rather thanN − 1, so that the MLE is biased downward.In large–sample terms, of course, this does notmatter, since plim(N−1N ) = 1. To discuss theirproperties, let us define θ̂ as the MLE, θ0 asthe true parameter vector, and θ as an arbitraryparameter vector (not necessarily either of theabove). Under regularity conditions to be dis-cussed, the MLE has the following asymptoticproperties:

• Consistency: plim θ̂ = θ0.

• Asymptotic normality: θ̂ a→ N [θ0, {I(θ0)}−1],where I(θ0) = −E0[∂2 lnL/∂θ0∂θ′0]

• Asymptotic efficiency: θ̂ is asy. efficientand achieves the Cramér–Rao Lower Bound

(CRLB) for consistent estimators.

• Invariance: The MLE of γ0 = c(θ0) is c(θ̂)if c(θ0) is a continuous and continuously

differentiable function.

The regularity conditions, which we will not

further discuss, require that the first three deriva-

tives of ln f(yi|θ) with respect to θ are contin-uous and finite for almost all yi and all θ. This

condition guarantees the existence of a cer-

tain Taylor series approximation. Furthermore,

the conditions necessary to obtain the expec-

tations of the first and second derivatives of

ln f(yi|θ) must be met, and the third deriva-tive must be bounded, permitting that Taylor

series to be truncated. With these conditions

met, we may define the moments of the deriva-

tives of the log–likelihood function:

• 1. ln f(yi|θ), gi = ∂ ln f(yi|θ)/∂θ and Hi =∂2 ln f(yi|θ)/∂θ∂θ′, i = 1 . . . n are all randomsamples of random variables.

• 2. E0[gi(θ0)] = 0.

• 3. V ar[gi(θ0)] = −E[Hi(θ0)].

The first definition follows from the definition

of the likelihood function. The second defines

the moment condition, or score vector, by

which the MLE may locate the optimum, and

indicates that the MLE is one of a more gen-

eral class of Generalized Method of Moments

(GMM) estimators. Intuitively, this condition

indicates that the gradient of the likelihood

function must be zero at the optimum, requir-

ing that the first–order conditions for a max-

imum be satisfied. The third condition pro-

vides the Information Matrix Equality, which

defines the asymptotic covariance matrix of

the MLE as being related to the expectation of

the Hessian of the log-likelihood. The Hessian

must be negative (semi-)definite for a max-

imum, corresponding to the notion that the

asymptotic covariance matrix must be positive

(semi-)definite.

The expected value of the log–likelihood is

maximized at the true value of the parame-

ters, which implies that the MLE will be con-

sistent. To demonstrate the asymptotic nor-

mality of the MLE, expand the first–order con-

ditions g(θ̂) = 0 in a second–order Taylor series

around the true parameters:

g(θ̂) = g(θ0) + H(θ̄)(θ̂ − θ0) = 0.

The Hessian is evaluated at a point θ̄ that is

between θ̂ and θ0. Rearrange this function and

multiply by√

n to obtain:

√n(θ̂ − θ0) =

[−H(θ̄)

]−1 [√n g(θ0)

].

Since plim(θ̂− θ0) = 0, plim(θ̂− θ̄) = 0 as well,and (dividing the derivatives by n):

√n(θ̂ − θ0)

d→[−

1

nH(θ0)

]−1 [√n ḡ(θ0)

].

Since[√

n ḡ(θ0)]

is√

n times the mean of a

random sample, we find the limiting variance of

that expression, and can express its distribution

as: [√n ḡ(θ0)

]d→ N

[0,−E0

[1

nH(θ0)

]]or

√n(θ̂ − θ0)

d→ N[0,(−E0

[1

nH(θ0)

])−1].

Combining results, we may derive the expres-

sion above for asymptotic normality of the MLE,

with a covariance matrix equal to the inverse

of the information matrix: minus the expec-

tation of the mean of the Hessian, evaluated

at the true parameter vector:

θ̂a∼ N

[θ0, [I(θ0)]

−1] .For the normal distribution, the second deriva-

tives of the LLF are:

∂2 lnL

∂µ2=−nσ2

,

∂2 lnL

∂(σ2)2= −

n

2σ4+

1

σ6

n∑i=1

(yi − µ)2,

∂ lnL

∂µ∂σ2= −

−1σ4

n∑i=1

(yi − µ).

The expectation of the cross–partial is zero,

since E[xi] = µ. The first expression is non-

stochastic, while the second has expectation−n2σ4

since each of the n terms has expected

value σ2. Collecting these expectations in the

(diagonal) information matrix, reversing the

sign and inverting, we derive the asymptotic

covariance matrix for the MLE:(−E0

[∂2 lnL

∂θ0∂θ′0

])−1=

σ2n 00 2σ

4

n

.The elements of this covariance matrix may

then be calculated, following the successful max-

imization of the LLF, replacing the unknown

parameter σ2 with its point estimate. Gener-

ally, though, the form of the expected values

of the second derivatives of the LLF may be

unknown, as it will be a complicated nonlinear

function of the data. We can derive an esti-

mate of the information matrix by evaluating

the actual (not expected) second derivatives

of the LLF of the ML estimates:

[I(θ0)]−1 =

(−

∂2 lnL(θ̂)

∂θ̂ ∂θ̂′

)−1.

This computation will require that the sec-

ond derivatives are evaluated, however, and ex-

pressing them analytically may be burdensome.

As an alternative, we may use the outer prod-

uct of gradients (OPG) estimator,

[I(θ0)]−1 =

n∑i=1

ĝiĝ′i

−1 = [Ĝ′Ĝ]−1 ,

where

ĝi =∂ ln f(xi, θ̂)

∂θ̂,

Ĝ = [ĝ1, ĝ2, . . . ĝn]′ .

Ĝ is a n × K matrix with ith row equal to thetranspose of the ith vector of derivatives with

respect to the estimated parameters (the gra-

dient vector for the ith observation, given the

current guess for θ). This expression is very

convenient, since it does not require evaluation

of the second derivatives (i.e. computation of

the Hessian). This estimate of the covariance

matrix is also known as the BHHH (Berndt,

Hall, Hall and Hausman) estimator. Never-

theless, if the estimator based on the second

derivatives is available, it will often be prefer-

able, especially in small to moderate sized sam-

ples.

Many econometric software packages provide

the option to calculate ML estimate by alter-

native methods, or by combinations of algo-

rithms. For instance, Stata’s ml command will

default to a Newton–Raphson method, technique(nr),

but will also allow specification of the BHHH

method (technique(bhhh), the Davidon–Fletcher–

Powell method (technique(dfp) or the Broyden–

Fletcher–Goldfarb–Shanno method, technique(bfgs).

These techniques are described in Gould, Pit-

blado and Sribney, Maximum Likelihood Esti-

mation with Stata, 2d ed. (2003). All four

of these methods may be employed, by them-

selves or in combination, with Stata’s lf (linear

form) method. For all but the bhhh method,

Stata computes estimates of the covariance

matrix via vce(oim), utilizing the observed in-

formation matrix (the inverse of the negative

Hessian). Alternatively, OPG standard errors

may be computed via vce(opg), which is the

default method when technique(bhhh) is used.

Most software that produces maximum likeli-

hood estimates affords similar capabilities.

Asymptotically equivalent test procedures

We consider MLE of a parameter vector θ0 and

a test of the hypothesis c(θ0) = q. There are

three approaches that we might use to test the

hypothesis:

• Likelihood ratio (LR) test. If the restric-tions c(θ0) = q are valid, then imposing

them will not lead to a large reduction in

the LLF. The test is thus based on the ra-

tio λ = L̂RL̂U

, or in terms of log–likelihood the

difference (lnLR − lnLU), where LU is thelikelihood function value of the unrestricted

estimate and LR is the likelihood function

value of the restricted estimate. Under

the null hypothesis, −2(lnLR− lnLU) ∼ χ2,with degrees of freedom equal to the num-

ber of restrictions imposed in c(). This

is the natural test to apply in a MLE set-

ting, but it requires that we solve two MLE

problems since both the unrestricted and

restricted estimates must be computed.

• Wald test. If the restrictions are valid,then c(θ̂MLE) should be close to q, since

the MLE is consistent. The Wald test

statistic is then a quadratic form in the dif-

ference[c(θ̂)− q

]′ (Asy.V ar.

[c(θ̂)− q

])−1 [c(θ̂)− q

].

Under the null hypothesis, this quadratic

form is distributed as χ2, with degrees of

freedom equal to the number of restric-

tions imposed in c(). The Wald test only

requires computation of the unrestricted

model.

• Lagrange multiplier (LM, or score) test.If the restrictions are valid, the restricted

estimates should be near the point that

maximizes the LLF, and the slope of the

LLF should be near zero at the restricted

estimator. The test is based on the slope

of the LLF at the point where the func-

tion is maximized subject to the restric-

tions, and is a quadratic form in the scores

(first derivatives) of the restricted LLF:(∂ lnL(θ̂R)

∂θ̂R

)′ [I(θ̂R)

]−1(∂ lnL(θ̂R)∂θ̂R

)Under the null hypothesis, this quadratic

form is distributed as χ2, with degrees of

freedom equal to the number of restric-

tions imposed in c(). The LM test only re-

quires computation of the restricted model.

All three of these tests have the same asymp-

totic distribution, and are thus equivalent. How-

ever, in finite samples, they will yield differ-

ent results, and may even lead to different in-

ferences. Ease of computation is sometimes

an issue; for instance, estimating a restricted

model in the context of linear regression is of-

ten difficult if the constraints are nonlinear. In

that case, the Wald test (which does not re-

quire explicit estimation of the restricted model)

may be preferable. On the other hand, if the

model is nonlinear in the parameters and the

restrictions render it linear, the LM test may

be easily computed. For an explicit MLE prob-

lem, the LR test is usually straightforward. For

a linear model, it has been shown that the χ2

statistics from the three tests follow the order-

ing

W ≥ LR ≥ LM,

so that the LM test will be the most conserva-

tive (if it rejects, the others will). A graphical

representation of the tests:

In Stata, the test command performs Wald

tests of linear constraints. The testnl com-

mand performs Wald–type tests via the “delta

method,” an approximation appropriate in large

samples. The lrtest command performs like-

lihood ratio tests after maximum likelihood es-

timation. Many commonly employed tests are

actually LM tests: in particular, any test which

may be described as an “n R2” test is an LM

test. Note also that many of these test pro-

cedures will generate F–statistics, rather than

χ2 statistics. Stata uses the rule that if small–

sample inferences are presented for the pa-

rameter estimates–e.g. if t statistics are com-

puted from the parameters’ estimated stan-

dard errors–then tests based on those estimates

will be reported as F . If large–sample infer-

ences are presented (so that z statistics are

computed from the parameters’ estimated stan-

dard errors), those estimates are reported as

χ2. Since the F statistic is the ratio of two

independent χ2 variables, and in most large–

sample problems the denominator χ2 → j, wherej is the number of restrictions, it follows that

F = j χ2. For a single restriction (j = 1), the

reported F(j,∞) and χ2(j) will be identical.

Greene, Econometric Analysis (6th ed, 2008)fm · 2010. 11. 4. · EC771: Econometrics, Spring 2009...

Documents

Transcript of Greene, Econometric Analysis (6th ed, 2008)fm · 2010. 11. 4. · EC771: Econometrics, Spring 2009...