Greene, Econometric Analysis (6th ed, 2008)fm · 2010. 11. 4. · EC771: Econometrics, Spring 2009...

29
EC771: Econometrics, Spring 2009 Greene, Econometric Analysis (6th ed, 2008) Chapter 17: Maximum Likelihood Estimation The preferred estimator in a wide variety of econometric settings is that derived from the principle of maximum likelihood (MLE). The pdf for a random variable y , conditioned on a set of parameters θ , is denoted f (y |θ ). This function identifies the data generating process (DGP) that underlies an observed sample of data, and provides a mathematical description of the data that the DGP will produce. That is, for a well–specified DGP, we could gener- ate any desired quantity of artificial data whose properties correspond to that DGP. The joint

Transcript of Greene, Econometric Analysis (6th ed, 2008)fm · 2010. 11. 4. · EC771: Econometrics, Spring 2009...

  • EC771: Econometrics, Spring 2009

    Greene, Econometric Analysis (6th ed, 2008)

    Chapter 17:

    Maximum Likelihood Estimation

    The preferred estimator in a wide variety of

    econometric settings is that derived from the

    principle of maximum likelihood (MLE). The

    pdf for a random variable y, conditioned on a

    set of parameters θ, is denoted f(y|θ). Thisfunction identifies the data generating process

    (DGP) that underlies an observed sample of

    data, and provides a mathematical description

    of the data that the DGP will produce. That

    is, for a well–specified DGP, we could gener-

    ate any desired quantity of artificial data whose

    properties correspond to that DGP. The joint

  • density of n independently and identically dis-

    tributed (i.i.d.) observations from that process

    is the product of the individual densities:

    f(y1, . . . , yn|θ) =n∏

    i=1

    f(yi|θ) = L(θ|y).

    This joint density, the likelihood function L(θ|y),is a function of the unknown parameter vector

    θ and the collection of sample data y. It is

    usually easier to work with its logarithm,

    lnL(θ|y) =n∑

    i=1

    ln f(yi|θ).

    Of course, we need not consider a set of yithat are i.i.d. to use the MLE framework. In

    a simple regression framework, yi = x′iβ + �i.

    Assume that � is normally distributed. Then

    conditioned on a specific xi, yi is distributed

    with µi = x′iβ and variance σ

    2� . Nevertheless,

    normality of the observations implies that they

    are independently distributed, as we may write

  • the loglikelihood function of θ conditioned on

    y, X:

    lnL(θ|y, X) =n∑

    i=1

    ln f(yi|xi, θ)

    = −1

    2

    n∑i=1

    [lnσ2 + ln(2π) + (yi − x′iβ)2/σ2],

    where X is the n × K matrix of data with itsith row equal to x′i. Since the first-order con-ditions for maximization of this function with

    respect to β will not be affected by the first

    two terms in this expression, they are often

    ignored in the computation.

    Before we consider the mechanics of MLE, we

    must discuss the identifiability of a particular

    model. Suppose that we had an infinitely large

    sample. Could we uniquely determine the val-

    ues of the parameters θ from the information in

    that sample? We will not always be able to do

    so, in which case the underlying model is said

  • to be unidentified. We say that a parameter

    vector θ is identified, or estimable, if for any

    other parameter vector θ∗ 6= θ and some datay, L(θ∗|y) 6= L(θ|y). For instance, in a probitequation, where we observe a binary outcome

    and hypothesize that the probability of that

    event (e.g. purchase of an automobile) is re-

    lated to a causal factor x:

    Prob(purchase|β1, β2, σ, xi) =Prob(yi > 0|β1, β2, σ, xi)

    = Prob(�i/σ > −(β1 + β2xi)/σ|β1, β2, σ, xi).

    In this case yi may be considered the differ-

    ence between the amount a buyer is willing to

    pay and the price of the car: a latent vari-

    able. If that difference is positive, we ob-

    serve purchase, and vice versa. Since multi-

    plying (β1, β2, σ) by the same nonzero constant

    leaves the left-hand-side of this expression un-

    changed, the model’s parameter vector is not

    identified. We must apply a normalization: in

  • this case, we conventionally set σ = σ2 = 1 in

    order to ”tie down” the model’s parameters,

    so that we may then uniquely identify the βs.

    The principle of maximum likelihood provides a

    means of choosing an asymptotically efficient

    estimator for a set of parameters. Let us con-

    sider first how this might be done for a discrete

    pdf such as the Poisson distribution:

    f(yi|θ) =e−θθyi

    yi!.

    This is the density for each observation. The

    corresponding loglikelihood function (LLF) for

    the sample of n observations will be:

    lnL(θ|y) = −nθ + ln θn∑

    i=1

    yi −n∑

    i=1

    ln(yi!)

    To perform maximum likelihood estimation in

    Stata, you must code the likelihood function

    as a program, or ado-file. In the simplest form

  • of such a program, in a setting where the ob-

    servations are considered to be independently

    distributed, you need only express the LLF for

    a single observation, and Stata will add it up

    over the sample. This is known in Stata terms

    as an lf (linear–form) estimator, and it is appli-

    cable to a number of estimation problems that

    satisfy the so–called linear form restrictions.

    This is a different use of ’linear’ than that of

    ’linear regression’ or ’linear model’, in that a

    linear–form MLE problem can involve nonlin-

    earities (such as a binomial probit model). The

    linear–form restrictions require that the log–

    likelihood contributions can be calculated sep-

    arately for each observation, and that the sum

    of the individual contributions equals the over-

    all log likelihood (Gould, Pitblado and Sribney,

    2003, p.30). More complicated problems re-

    quire use of Stata’s d0 form, in which the en-

    tire LLF is coded (or its relatives, d1 and d2, in

  • which first or first and second analytic deriva-

    tives are also provided). For the lf model,

    no analytic derivatives are required, and the

    optimization is more accurate and computa-

    tionally efficient than that derived from the d0

    form, since differentiation of the loglikelihood

    function may be done with respect to the lin-

    ear form rather than with respect to each of

    its elements.

    771ml1.html contains the logfile for a simple

    MLE of the Poisson problem described in the

    text. The Stata program fishy1 lf.ado ex-

    presses the likelihood of a single observation

    yi given the parameter θ:

    program fishy1_lf

    version 10.1

    args lnf theta

    quietly replace ‘lnf’ = -‘theta’ + ///

    log(‘theta’)*$ML_y1 - lnfactorial($ML_y1)

  • end

    Note that to avoid numerical overflow with the

    factorial function, we use Stata’s lnfactorial()

    function. The ml model statement sets up the

    maximum likelihood problem; we use ml check

    to see whether any obvious errors have been

    made in the routine. The ml maximize state-

    ment instructs Stata to perform the estima-

    tion. Following estimation, just as with any

    estimation command, we may use the test

    command to perform a Wald test of any linear

    hypothesis, or the testnl command to perform

    a Wald–type test of a nonlinear hypothesis via

    the delta method.

    The probability of observing the given sample

    of data is not exact when we switch to a con-

    tinuous distribution, since a particular sample

    has probability zero; nevertheless, the princi-

    ple is the same. The values of the parameters

  • that maximize the LF or LLF are the maximumlikelihood estimates, θ̂, and the necessary con-ditions for maximizing the LLF are

    ∂ lnL(θ|data)∂θ

    = 0.

    That is, each element of the gradient or scorevector must be approximately zero at the op-timum (or the norm of that vector must beappropriately close to zero).

    Let us consider how the MLE for the meanand variance of a normally distributed randomvariable may be computed. The LLF for theunivariate normal distribution is

    lnL(µ, σ) = −n

    2ln(2π)−

    n

    2ln(σ2)−

    1

    2

    n∑i=1

    [(yi − µ)2

    σ2].

    To maximize this function, we differentiate withrespect to both µ and σ and solve the resultingFOCs:

    ∂ lnL

    ∂µ=

    1

    σ2

    n∑i=1

    (yi − µ) = 0.

  • ∂ lnL

    ∂σ2= −

    n

    2σ2+

    1

    2σ4

    n∑i=1

    (yi − µ)2 = 0.

    Solution of these two equations will lead to the

    familiar formula for the mean, and the MLE of

    σ2, which is the biased estimator

    σ2ML =1

    n

    n∑i=1

    (yi − µ)2

    771ml2.html illustrates MLE of the parameters

    of a Normal distribution, where we have gener-

    ated the artificial data with µ=5 and σ=2, and

    want to recover those parameters. The Stata

    ado-file is a lf-form function here as well, ex-

    pressing the likelihood of a single observation

    in terms of the two unknown parameters:

  • program norm1_lf

    version 10.1

    args lnf mu sigma

    quietly replace ‘lnf’ = ///

    ln(normalden($ML_y1,‘mu’,‘sigma’))

    end

    Note that the results of estimation give us con-

    fidence intervals for both µ and σ.

    The beauty of the linear–form model is that a

    whole variety of other DGPs may be expressed

    in this same context without altering the pro-

    gram. This example considers a univariate es-

    timation problem. What if, instead, we con-

    sider a model:

    yt = α + βt + �t,

    that is, a linear trend model? We then want to

    estimate (by means of MLE, although we could

    surely use a least squares regression to solve

  • this problem) those two parameters as well as

    σ, the standard error of the � process. In this

    case, the LLF above is modified to include µi,

    the conditional mean of y|x, rather than thescalar parameter µ. 771ml3.html presents the

    results of that estimation. Since the ”explana-

    tory variable” t enters the conditional mean

    equation in linear form, we may merely change

    the ml model statement to express y as a lin-

    ear function of x (by default with a constant

    term, α above). We need not modify the pro-

    gram to estimate this model. Notice that since

    we are now working with a bivariate relation-

    ship, Stata computes a Wald χ2 statistic for

    this model which compares it with the ”naive

    model” in which the coefficient of x is zero

    (that is, a model in which the time trend plays

    no role).

    Finally, we may also model the standard er-

    ror, rather than assuming it to be common

  • across observations. We may express �t in this

    model as not being normally distributed, but

    rather possessing a standard error, γ = σx2t :

    an instance where the relationship is becom-

    ing less precise over time (in the implemen-

    tation, we demean x so that the multiplica-

    tive factor is mean zero). This assumption on

    the error process does not violate the inde-

    pendence of the errors (their second moments

    are systematically related to t). In regression

    terms, it is a heteroskedastic regression model,

    in which we explicitly model the form of the

    heteroskedasticity. 771ml4.html presents the

    estimates of that model: again, we need not

    modify the program, but merely change the ml

    model statement so that we may indicate that

    the second ”equation” being estimated (that

    for the standard error of �) is non–trivial.

  • Properties of MLEs

    MLEs are most attractive due to their large–sample or asymptotic properties; their finitesample properties may be suboptimal. For ex-ample, the MLE of σ2 in a standard regressionproblem involves a divisor of N , rather thanN − 1, so that the MLE is biased downward.In large–sample terms, of course, this does notmatter, since plim(N−1N ) = 1. To discuss theirproperties, let us define θ̂ as the MLE, θ0 asthe true parameter vector, and θ as an arbitraryparameter vector (not necessarily either of theabove). Under regularity conditions to be dis-cussed, the MLE has the following asymptoticproperties:

    • Consistency: plim θ̂ = θ0.

    • Asymptotic normality: θ̂ a→ N [θ0, {I(θ0)}−1],where I(θ0) = −E0[∂2 lnL/∂θ0∂θ′0]

  • • Asymptotic efficiency: θ̂ is asy. efficientand achieves the Cramér–Rao Lower Bound

    (CRLB) for consistent estimators.

    • Invariance: The MLE of γ0 = c(θ0) is c(θ̂)if c(θ0) is a continuous and continuously

    differentiable function.

    The regularity conditions, which we will not

    further discuss, require that the first three deriva-

    tives of ln f(yi|θ) with respect to θ are contin-uous and finite for almost all yi and all θ. This

    condition guarantees the existence of a cer-

    tain Taylor series approximation. Furthermore,

    the conditions necessary to obtain the expec-

    tations of the first and second derivatives of

    ln f(yi|θ) must be met, and the third deriva-tive must be bounded, permitting that Taylor

    series to be truncated. With these conditions

  • met, we may define the moments of the deriva-

    tives of the log–likelihood function:

    • 1. ln f(yi|θ), gi = ∂ ln f(yi|θ)/∂θ and Hi =∂2 ln f(yi|θ)/∂θ∂θ′, i = 1 . . . n are all randomsamples of random variables.

    • 2. E0[gi(θ0)] = 0.

    • 3. V ar[gi(θ0)] = −E[Hi(θ0)].

    The first definition follows from the definition

    of the likelihood function. The second defines

    the moment condition, or score vector, by

    which the MLE may locate the optimum, and

    indicates that the MLE is one of a more gen-

    eral class of Generalized Method of Moments

    (GMM) estimators. Intuitively, this condition

    indicates that the gradient of the likelihood

  • function must be zero at the optimum, requir-

    ing that the first–order conditions for a max-

    imum be satisfied. The third condition pro-

    vides the Information Matrix Equality, which

    defines the asymptotic covariance matrix of

    the MLE as being related to the expectation of

    the Hessian of the log-likelihood. The Hessian

    must be negative (semi-)definite for a max-

    imum, corresponding to the notion that the

    asymptotic covariance matrix must be positive

    (semi-)definite.

    The expected value of the log–likelihood is

    maximized at the true value of the parame-

    ters, which implies that the MLE will be con-

    sistent. To demonstrate the asymptotic nor-

    mality of the MLE, expand the first–order con-

    ditions g(θ̂) = 0 in a second–order Taylor series

    around the true parameters:

    g(θ̂) = g(θ0) + H(θ̄)(θ̂ − θ0) = 0.

  • The Hessian is evaluated at a point θ̄ that is

    between θ̂ and θ0. Rearrange this function and

    multiply by√

    n to obtain:

    √n(θ̂ − θ0) =

    [−H(θ̄)

    ]−1 [√n g(θ0)

    ].

    Since plim(θ̂− θ0) = 0, plim(θ̂− θ̄) = 0 as well,and (dividing the derivatives by n):

    √n(θ̂ − θ0)

    d→[−

    1

    nH(θ0)

    ]−1 [√n ḡ(θ0)

    ].

    Since[√

    n ḡ(θ0)]

    is√

    n times the mean of a

    random sample, we find the limiting variance of

    that expression, and can express its distribution

    as: [√n ḡ(θ0)

    ]d→ N

    [0,−E0

    [1

    nH(θ0)

    ]]or

    √n(θ̂ − θ0)

    d→ N[0,(−E0

    [1

    nH(θ0)

    ])−1].

    Combining results, we may derive the expres-

    sion above for asymptotic normality of the MLE,

  • with a covariance matrix equal to the inverse

    of the information matrix: minus the expec-

    tation of the mean of the Hessian, evaluated

    at the true parameter vector:

    θ̂a∼ N

    [θ0, [I(θ0)]

    −1] .For the normal distribution, the second deriva-

    tives of the LLF are:

    ∂2 lnL

    ∂µ2=−nσ2

    ,

    ∂2 lnL

    ∂(σ2)2= −

    n

    2σ4+

    1

    σ6

    n∑i=1

    (yi − µ)2,

    ∂ lnL

    ∂µ∂σ2= −

    −1σ4

    n∑i=1

    (yi − µ).

    The expectation of the cross–partial is zero,

    since E[xi] = µ. The first expression is non-

    stochastic, while the second has expectation−n2σ4

    since each of the n terms has expected

  • value σ2. Collecting these expectations in the

    (diagonal) information matrix, reversing the

    sign and inverting, we derive the asymptotic

    covariance matrix for the MLE:(−E0

    [∂2 lnL

    ∂θ0∂θ′0

    ])−1=

    σ2n 00 2σ

    4

    n

    .The elements of this covariance matrix may

    then be calculated, following the successful max-

    imization of the LLF, replacing the unknown

    parameter σ2 with its point estimate. Gener-

    ally, though, the form of the expected values

    of the second derivatives of the LLF may be

    unknown, as it will be a complicated nonlinear

    function of the data. We can derive an esti-

    mate of the information matrix by evaluating

    the actual (not expected) second derivatives

    of the LLF of the ML estimates:

    [I(θ0)]−1 =

    (−

    ∂2 lnL(θ̂)

    ∂θ̂ ∂θ̂′

    )−1.

  • This computation will require that the sec-

    ond derivatives are evaluated, however, and ex-

    pressing them analytically may be burdensome.

    As an alternative, we may use the outer prod-

    uct of gradients (OPG) estimator,

    [I(θ0)]−1 =

    n∑i=1

    ĝiĝ′i

    −1 = [Ĝ′Ĝ]−1 ,

    where

    ĝi =∂ ln f(xi, θ̂)

    ∂θ̂,

    Ĝ = [ĝ1, ĝ2, . . . ĝn]′ .

    Ĝ is a n × K matrix with ith row equal to thetranspose of the ith vector of derivatives with

    respect to the estimated parameters (the gra-

    dient vector for the ith observation, given the

    current guess for θ). This expression is very

    convenient, since it does not require evaluation

  • of the second derivatives (i.e. computation of

    the Hessian). This estimate of the covariance

    matrix is also known as the BHHH (Berndt,

    Hall, Hall and Hausman) estimator. Never-

    theless, if the estimator based on the second

    derivatives is available, it will often be prefer-

    able, especially in small to moderate sized sam-

    ples.

    Many econometric software packages provide

    the option to calculate ML estimate by alter-

    native methods, or by combinations of algo-

    rithms. For instance, Stata’s ml command will

    default to a Newton–Raphson method, technique(nr),

    but will also allow specification of the BHHH

    method (technique(bhhh), the Davidon–Fletcher–

    Powell method (technique(dfp) or the Broyden–

    Fletcher–Goldfarb–Shanno method, technique(bfgs).

    These techniques are described in Gould, Pit-

    blado and Sribney, Maximum Likelihood Esti-

    mation with Stata, 2d ed. (2003). All four

  • of these methods may be employed, by them-

    selves or in combination, with Stata’s lf (linear

    form) method. For all but the bhhh method,

    Stata computes estimates of the covariance

    matrix via vce(oim), utilizing the observed in-

    formation matrix (the inverse of the negative

    Hessian). Alternatively, OPG standard errors

    may be computed via vce(opg), which is the

    default method when technique(bhhh) is used.

    Most software that produces maximum likeli-

    hood estimates affords similar capabilities.

  • Asymptotically equivalent test procedures

    We consider MLE of a parameter vector θ0 and

    a test of the hypothesis c(θ0) = q. There are

    three approaches that we might use to test the

    hypothesis:

    • Likelihood ratio (LR) test. If the restric-tions c(θ0) = q are valid, then imposing

    them will not lead to a large reduction in

    the LLF. The test is thus based on the ra-

    tio λ = L̂RL̂U

    , or in terms of log–likelihood the

    difference (lnLR − lnLU), where LU is thelikelihood function value of the unrestricted

    estimate and LR is the likelihood function

    value of the restricted estimate. Under

    the null hypothesis, −2(lnLR− lnLU) ∼ χ2,with degrees of freedom equal to the num-

    ber of restrictions imposed in c(). This

    is the natural test to apply in a MLE set-

    ting, but it requires that we solve two MLE

  • problems since both the unrestricted and

    restricted estimates must be computed.

    • Wald test. If the restrictions are valid,then c(θ̂MLE) should be close to q, since

    the MLE is consistent. The Wald test

    statistic is then a quadratic form in the dif-

    ference[c(θ̂)− q

    ]′ (Asy.V ar.

    [c(θ̂)− q

    ])−1 [c(θ̂)− q

    ].

    Under the null hypothesis, this quadratic

    form is distributed as χ2, with degrees of

    freedom equal to the number of restric-

    tions imposed in c(). The Wald test only

    requires computation of the unrestricted

    model.

    • Lagrange multiplier (LM, or score) test.If the restrictions are valid, the restricted

    estimates should be near the point that

  • maximizes the LLF, and the slope of the

    LLF should be near zero at the restricted

    estimator. The test is based on the slope

    of the LLF at the point where the func-

    tion is maximized subject to the restric-

    tions, and is a quadratic form in the scores

    (first derivatives) of the restricted LLF:(∂ lnL(θ̂R)

    ∂θ̂R

    )′ [I(θ̂R)

    ]−1(∂ lnL(θ̂R)∂θ̂R

    )Under the null hypothesis, this quadratic

    form is distributed as χ2, with degrees of

    freedom equal to the number of restric-

    tions imposed in c(). The LM test only re-

    quires computation of the restricted model.

    All three of these tests have the same asymp-

    totic distribution, and are thus equivalent. How-

    ever, in finite samples, they will yield differ-

    ent results, and may even lead to different in-

    ferences. Ease of computation is sometimes

  • an issue; for instance, estimating a restricted

    model in the context of linear regression is of-

    ten difficult if the constraints are nonlinear. In

    that case, the Wald test (which does not re-

    quire explicit estimation of the restricted model)

    may be preferable. On the other hand, if the

    model is nonlinear in the parameters and the

    restrictions render it linear, the LM test may

    be easily computed. For an explicit MLE prob-

    lem, the LR test is usually straightforward. For

    a linear model, it has been shown that the χ2

    statistics from the three tests follow the order-

    ing

    W ≥ LR ≥ LM,

    so that the LM test will be the most conserva-

    tive (if it rejects, the others will). A graphical

    representation of the tests:

  • In Stata, the test command performs Wald

    tests of linear constraints. The testnl com-

    mand performs Wald–type tests via the “delta

    method,” an approximation appropriate in large

    samples. The lrtest command performs like-

    lihood ratio tests after maximum likelihood es-

    timation. Many commonly employed tests are

  • actually LM tests: in particular, any test which

    may be described as an “n R2” test is an LM

    test. Note also that many of these test pro-

    cedures will generate F–statistics, rather than

    χ2 statistics. Stata uses the rule that if small–

    sample inferences are presented for the pa-

    rameter estimates–e.g. if t statistics are com-

    puted from the parameters’ estimated stan-

    dard errors–then tests based on those estimates

    will be reported as F . If large–sample infer-

    ences are presented (so that z statistics are

    computed from the parameters’ estimated stan-

    dard errors), those estimates are reported as

    χ2. Since the F statistic is the ratio of two

    independent χ2 variables, and in most large–

    sample problems the denominator χ2 → j, wherej is the number of restrictions, it follows that

    F = j χ2. For a single restriction (j = 1), the

    reported F(j,∞) and χ2(j) will be identical.