Econometrics - Lecture 6 GMM-Estimator and Econometric Models.
Econometrics I - Stanford Universitydoubleh/eco270/pointestimation.pdf · 2016-11-16 ·...
Transcript of Econometrics I - Stanford Universitydoubleh/eco270/pointestimation.pdf · 2016-11-16 ·...
Econometrics I
Department of EconomicsStanford University
November, 2016
Part II
Topics
• Point Estimation.
• Interval Estimation
• Hypothesis Testing
• Sufficiency and Data Reduction (maybe).
Different Approaches
• Frequentist: There exists a true parameter value θ0.
• Bayesian: θ is a random variable. Prior+Data =⇒ Posterior.
• Fiduciary Inference: No prior. Data =⇒ Posterior. OrBayesian with uniform (diffused) prior.
Data, sample of size n
• I.I.D Sampling (sampling with replacement)
• Different Sampling Schemes is a science by itself.
• Usual Notations: X1, . . . ,Xn, Y1, . . . ,Yn, Z1, . . . ,Zn.
• Xn, Yn, Zn.
• Parameter θ: a function(al) of the distribution.
µ (FX (·)) =
∫xfX (x) dx =
∫xdFX (x) .
• Estimators: a function of the data:
θ = φn (Xn) = φn (X1,X2, . . . , n)
• Strictly speaking, a sequence of functions of the data, since itis a different function for a different n. For example:
θ = Xn =X1 + X2 + · · ·+ Xn
n.
• Estimate: a realized value of the estimator.
• Is θ = 1/2 an estimator or an estimate?
• Empirical Distribution Function (EDF):
FX (x) =1
n
n∑i=1
1 (Xi ≤ x)
• Analog principle: replace true population value (CDF) withestimated sample value (EDF):
Xn = µ =
∫xdFX (x)
• Properties of Estimators:• Finite Sample properties:
• Unbiasedness• Mean Square Error• Finite sample distribution
• Asymptotic properties:• Consistency• Asymptotic Distribution
• Unbiasedness
Eθθ =
∫. . .
∫θ (X1, . . . ,Xn) f (X1, . . . ,Xn|θ) dX1 . . . dXn = θ.
• MSE (function of θ, or θ0):
MSE = E(θ − θ0
)2= Eθ
(θ − θ
)2
MSE = Var(θ)
+(E(θ − θ0
))2
• More general loss functions Eθ`(θ − θ
).
• Suppose X1,X2 ∼ i .i .d .Bernoulli(p).
p1 =X1 + X2
2p2 = X1 p3 =
1
2p1 and p2 are unbiased. p3 is biased.
• MSE:
MSE (p1) =Var (p1) =1
2p (1− p)
MSE (p2) =Var (p2) = p (1− p)
MSE (p3) =Bias (p2)2 =
(1
2− p
)2
.
• p2 is inadmissible: p1 is better.
• θ is admissible if there is no estimator that is better (in theMSE sense) than θ for some p and is at least as good as θ forall p.
• p1 and p3 are admissible: can not choose one over anotherb/c p is unknown. A typical Bayesian estimator isp4 = wp1 + (1− w) p3.
• Xn is best linear unbiased estimator (BLUE).
• Linearity: θ =∑n
i=1 ωiXi
• Unbiasedness: E θ − θ ⇐⇒∑n
i=1 wi = 1.
MSE(θ)
=n∑
i=1
ω2i σ
2 = Var(θ)
(ω1, ω2, . . . , ωn)= arg minω1,ω2,...,ωn
n∑i=1
ω2i such that
n∑i=1
ωi = 1.
Solution:
ω1 = ω2 = ω3 = . . . = ωn =1
n.
• Xn is BLUE (Gauss-Markov)
• What can be better than Xn?• Nonlinear and unbiased estimator• Linear biased estimator• Nonlinear and biased estimator
• Example 1: Xi ∼ i .i .dUniform (0, θ),µ = EX =
∫xdFX (x) = θ
2 . Want to estimate µ.
•µ1 = Xn
µ2 =n + 1
2nZn, Zn = max (X1, . . . ,Xn)
• Since Zn < θ, bias correct by multiplying by n+1n .
MSE (µ1) = Var(Xn
)=σ2
n=
θ2
12n
MSE (µ2) = Var (µ2) + bias (µ2)2
FZn (z) =P (max (X1, . . . ,Xn) ≤ z) =n∏
i=1
P (Xi ≤ z) =(zθ
)nfZn (z) =
∂
∂zFZn (z) = n
(zn−1
θn
).
• Moments of Zn:
EZn =
∫ θ
0
znzn−1
θndz =
n
n + 1θ
EZ 2n =
∫ θ
0
z2nzn−1
θndz =
n
n + 2θ2
Var (Zn) = EZ 2n − (EZn)2 =
n
n + 2θ2 −
[n
n + 1θ
]2
=n
(n + 2)(n + 1)2θ2
Var (µ2) =(n + 1)2
4n2Var (Zn) =
θ2
4n (n + 2)
Bias (µ2) = E µ2 −θ
2= 0
MSE (µ1)−MSE (µ2) =1
12nθ2 − 1
4n (n + 2)θ2
= θ2
(1
12n− 1
4n (n + 2)
)> 0
if n > 1.
• Large Sample Analysis
• Weak consistency: θnp−→ θ0 as n→∞.
• Strong consistency: θna.s.−→ θ0 as n→∞.
• Rate of Convergence and Asymptotic Distribution (typicallynormal)
• Asymptotic Efficiency: this can be a difficult concept.
• Maximum Likelihood Estimator
• Likelihood function, a random function: f (Xn|θ) ≡ L (θ|Xn)
• Joint likelihood, conditional likelihood, marginal likelihood,partial likelihood.
• Joint likelihood, conditional likelihood, marginal likelihood,partial
• If Xn = (X1, . . . ,Xn) is i.i.d, then f (Xn|θ) =∏n
i=1 f (Xi |θ).
• We can define θMLE = arg maxθ∈Θ f (Xn|θ) ≡ L (θ|Xn).
• But for computational and statistical reasons, define
θLMLE = arg maxθ∈Θ
log L (θ|Xn)i .i .d .≡
n∑i=1
log f (Xi |θ)
• θLMLE = θMLE if θMLE can be computed analytically. But often
times θLMLE can be computed numerically but θMLE can not.E.g. log L (θ|Xn) ≈ −500, then L (θ|Xn) ≈ 0.
• Recall that, using thea average log likelihood (to facilitate theproofs)
θ = arg maxθ∈Θ
1
n
n∑i=1
log f (Xi |θ) =1
nlog L (θ|Xn) .
• Example 1:
Xi =
1, p0, 1− p
i .i .d .
L (θ|Xn) =n∏
i=1
p (Xi |θ)
=n∏
i=1
pXi (1− p)1−Xi = p∑
Xi (1− p)n−∑
Xi
max log L (p|Xn) =∑
(Xi log p + (1− Xi ) log (1− p))
• First order condition
∂ log L (p|Xn)
∂p=
1
p
∑Xi −
1
1− p
∑(1− Xi ) =
∑ Xi − p
p (1− p)= 0
=⇒ p =1
n
n∑i=1
Xi
• Example 2: Xi ∼ N(µ, σ2
), θ =
(µ, σ2
)L (θ|Xn) =
n∏i=1
f (Xi |θ) =n∏
i=1
1√2πσ2
exp
(−(Xi − µ)2
2σ2
)
. log L (θ|Xn) =C +∑[
log1
σ− (Xi − µ)2
2σ2
]
µ = arg minµ
n∑i=1
(Xi − µ)2 ≡ Xn
∂ log L (θ|Xn)
∂σ2=− n
2σ2+
∑ni=1 (Xi − µ)2
2σ4= 0
• σ2 = 1n
∑ni=1 (Xi − µ)2.
• σ2 is biased, but
S2 =1
n − 1
n∑i=1
(Xi − µ)2
is unbiased.
• Example 3: Xi ∼ vectork×1N (µ,Σ), θ = (µ,Σ), k + k(k+1)2
parameters.
L (θ|Xn) =∏
f (Xi |θ)
=∏ 1(√
2π)k |Σ|1/2
exp
(−(Xi − µ)′Σ−1 (Xi − µ)
2
)log L (θ|Xn) =C +
n
2log |Σ−1| − 1
2
∑(Xi − µ)′Σ−1 (Xi − µ)
• Recall (from Dhrymes book)
∂x ′Ax
∂x=(A + A′
)x = 2AX if A is symmetric
∂
∂Alog |A| =A−1 using principal minors and cofactors
∂
∂Atr (AB) =B ′ tr (AB) = tr (BA)
tr (ABC ) =tr (BCA) = tr(CAB).
log L (θ|Xn) =C +n
2log |Σ−1| − 1
2
∑tr((Xi − µ)′Σ−1 (Xi − µ)
)=C +
n
2log |Σ−1| − 1
2
∑tr(Σ−1 (Xi − µ) (Xi − µ)′
)∂
∂µlog L (θ|Xn) =
∑Σ−1 (Xi − µ) = 0
=⇒Σ−1n∑
i=1
(Xi − µ) = 0
=⇒µ = Xn.
∂
∂Σ−1log(µ,Σ−1|Xn
)=
n
2Σ− 1
2
n∑i=1
(Xi − µ) (Xi − µ)′
Σ =1
n
n∑i=1
(Xi − µ) (Xi − µ)′ =1
n
n∑i=1
(Xi − Xn
) (Xi − Xn
)′.
Often times can not compute MLE by hand: Qn (θ) = log L (θ|Xn)
• Newton Raphson Iteration (max of quadratic approximation)
• Stochastic Optimization
• (Ken Judd) Numerical Methods for Economists
• Root finding: Bisection, Gauss-Newton Iteration
Initial Guess: θ(0):
Q (θ) ≈ Q(θ(0)
)+∂Q
(θ(0)
)∂θ
(θ − θ(0)
)+
1
2
(θ − θ(0)
)′ ∂2
∂θ∂θ′Q
(θ(0)
)(θ − θ(0)
)
Q (θ) ≈ Q(θ, θ(0)
)
0 =∂Q(θ, θ(0)
)∂θ
=∂Q(θ(0))
∂θ+∂2Q
(θ(0))
∂θ∂θ′
(θ − θ(0)
)
θ(1) = θ(0) −
(∂2Q
(θ(0))
∂θ∂θ′
)−1∂Q(θ(0))
∂θ.
In general
θ(t+1) = θ(t) −
(∂2Q
(θ(t))
∂θ∂θ′
)−1∂Q(θ(t))
∂θ.
Hopefully, θ(t) → θMLE as t →∞.Statistical properties of MLE
• Finite sample: sometimes unbiased, and sometimes biased.
• Large sample.
Cramer Rao Lower Bound: Under some regularity conditions(including the support at Xi not dependent on θ), any unbiasedestimator θ of θ has a variance that is no smaller than
Var
(∂ log f (Xn|θ)
∂θ
)−1
.
Proof: Unbiasedness ⇐⇒ E θ = θ0 ⇐⇒ Eθθ = θ.∫θ (Xn) f (Xn|θ) dXn = θ
∂
∂θ
∫θ (Xn) f (Xn|θ) dXn = I
Under regularity conditions∫θ (Xn)
∂
∂θf (Xn|θ) dXn = I∫
θ (Xn)∂
∂θf (Xn|θ)
f (Xn|θ)
f (Xn|θ)dXn = I
∫θ (Xn)
∂
∂θlog f (Xn|θ) f (Xn|θ) dXn = I
E
[θ (Xn)
∂
∂θlog f (Xn|θ) f (Xn|θ)
]= I .
Because E[∂∂θ log f (Xn|θ) f (Xn|θ)
]= 0,
Cov
(θ (Xn) ,
∂
∂θlog f (Xn|θ)
)= I .
Suppose θ is a scalar, using Cauchy Schwartz
1 =Cov2
(θ (Xn) ,
∂
∂θlog f (Xn|θ)
)≤Var
(θ (Xn)
)Var
(∂
∂θlog f (Xn|θ)
)So that
Var(θ (Xn)
)≥ Var
(∂
∂θlog f (Xn|θ)
)−1
.
The vector version of Cauchy Schwartz inequality takes more work.For vectors U,V :
Var (V ) ≥ Cov (V ,U)Var(U)−1Cov (U,V ) .
in the sense that the difference is semi-positive-definite. LetU = ∂
∂θ log f (Xn|θ) and V = θ (Xn). Then Cov (U,V ) = I .
Var(θ (Xn)
)≥ Var
(∂
∂θlog f (Xn|θ)
)−1
.
If an unbiased estimator achieves the CRLB, then it must be thebest (minimum variance) unbiased estimator.Example of CRLB achievement: Bernoulli, Xi = 1 with probabilityp, Xi = 0 with probability 1− p
log f (Xn|θ) =∑
(Xi log p + (1− Xi ) log (1− p))
∂ log f (Xn|p)
∂p=
n∑i=1
Xi − p
p (1− p).
CRLB = Var
(∂
∂θlog f (Xn|θ)
)−1
=p (1− p)
n.
Information Matrix, and information matrix equality
I (θ) = −E ∂2 log f (Xn|θ0)
∂θ∂θ′= Var
(∂
∂θlog f (Xn|θ)
).
E∂2 log f (Xn|θ0)
∂θ∂θ′+ E
∂
∂θlog f (Xn|θ)
∂
∂θlog f (Xn|θ)′ = 0
This follows from totally differentiating
E∂ log f (Xn|θ)
∂θ=
∫∂ log f (Xn|θ)
∂θf (Xn|θ) = 0.
with respect to θ.
Normal Example:
f (Xi ; θ) =1√
2πσ2exp
(−(Xi − µ)2
2σ2
)
log L (θ|Xn) =− n
2log (2π)− n
2log(σ2)− 1
2σ2
n∑i=1
(Xi − µ)2 .
First order condition:
µ :1
σ2
n∑i=1
(Xi − µ) = 0 =⇒ µMLE = X
σ2 :− n
2σ2+
1
2σ4
n∑i=1
(Xi − µ)2 = 0 =⇒ σ2MLE =
1
n
n∑i=1
(Xi − Xn
)2
θMLE =
(Xn
1n
∑ni=1
(Xi − Xn
)2
)
To compute CRLB, note
− E∂2 log f (Xn|θ0)
∂θ∂θ′= −E
[ ∂2 log f (Xn|θ)∂µ2
∂2 log f (Xn|θ)∂µ∂σ2
∂2 log f (Xn|θ)∂µ∂σ2
∂2 log f (Xn|θ)
∂(σ2)2
]
= −E[
− nσ2 − 1
σ4
∑(Xi − µ)
− 1σ4
∑(Xi − µ) n
σ4 − 1σ6
∑ni=1 (Xi − µ)2
]= E
[ nσ2 0
0 − n2σ4 + nσ2
σ6 = n2σ4
]
µ is unbiased and achieves the CRLB
Var (µ) = σ2/n
σ2 is biased and has a variance even lower than the CRLB
Var(σ2MLE
)= Var
(1
n
n∑i=1
(Xi − Xn
)2
)=σ4
nVar
(n∑
i=1
(Xi − Xn
σ
)2)
But∑n
i=1
(Xi−Xnσ
)2is χ2
n−1, where Var(χ2k
)= 2k, so that
Var(σ2MLE
)= 2
n − 1
n2σ4 < CRLB = 2σ4/n.
Unbiased variance estimator
S2 =1
n − 1
n∑i=1
(Xi − Xn
)2ES2 = σ2
But
Var(S2)
=Var
(1
n − 1
n∑i=1
(Xi − Xn
)2
)
=σ4
(n − 1)2Var
(n∑
i=1
(Xi − Xn
σ
)2)
=2 (n − 1)σ4
(n − 1)2
=2σ4
n − 1>
2σ4
n= CRLB.
S2 does not achieve CRLB. However, S2 can be shown to be theminimum variance unbiased estimator for σ2 using the notion ofcompleteness and sufficiency.
Consistency
• Bernoulli: p = Xp−→ p by WLLN. p is consistent.
• Normal(µ, σ2
), µ = X
p−→ µ,
σ2 =1
n
n∑i=1
(Xi − Xn
)2=
1
n
n∑i=1
X 2i − X 2
np−→ σ2
by LLN and the continuous mapping theorem.σ2 is biased but still consistent.
If the data is i.i.d,
θMLE = arg maxθ∈Θ
1
n
n∑i=1
log f (Xi |θ) ≡ Qn (θ)
θMLE = arg maxθ∈Θ
Qn (θ) .
This is a special case of a M (maximization or minimization)estimator.For each θ, Qn (θ) is a random variable. By a (pointwise) LLN:
Qn (θ) ≡ 1
n
n∑i=1
log f (Xi |θ)p−→ E log f (Xi |θ)
However we need a stronger statement: the whole function Qn (θ)should converge to Q (θ). ComparePointwise LLN: for any θ,
1
n
n∑i=1
log f (Xi |θ)p−→ E log f (Xi |θ) ≡ Q (θ) .
∀ε > 0,∀δ > 0, ∃n0, s.t. ∀n > n0, P(|Qn (θ)− Q (θ) | > ε
)< δ.
Uniform LLN:
supθ∈Θ|Qn (θ)− Q (θ) | p−→ 0.
∀ε > 0,∀δ > 0, ∃n0, s.t. ∀n > n0,
P
(supθ∈Θ|Qn (θ)− Q (θ) | > ε
)< δ.
Note that n0 does not depend on θ.
M Estimator Theory: Suppose θ = arg maxθ∈Θ Qn (θ). If
• supθ∈Θ |Qn (θ)− Q (θ) | p−→ 0,
• θ0 uniquely maximizes Q (θ), in the sense that for anyneighborhood N (θ0) around θ0,
supθ∈Θ\N(θ0)
Q (θ) < Q (θ0)
Then θnp−→ θ0.
Consistency of MLE is a special case of the M-estimator theorem,where
• Qn (θ) = 1n
∑ni=1 log f (Xi |θ)
• Q (θ) = E log f (Xi |θ)
• supθ∈Θ |Qn (θ)− Q (θ) | p−→ 0.
For now we will verify pointwise convergence rather than uniformconvergence, which follows from LLN.Next that θ0 uniquely maximizes Q (θ) = E log f (Xi |θ) followsfrom Jensen’s inequality:
Q (θ) = E log f (Xi |θ) =
∫log f (Xi |θ) f (Xi |θ0) dXi
Want to show that for any θ 6= θ0, Q (θ) < Q (θ0).
Q (θ) < Q (θ0)
Q (θ)− Q (θ0) =
∫log f (Xi |θ) f (Xi |θ0) dXi −
∫log f (Xi |θ0) f (Xi |θ0) dXi
=
∫log
(f (Xi |θ)
f (Xi |θ0)
)f (Xi |θ0) dXi = E log
(f (Xi |θ)
f (Xi |θ0)
)≤ log E
f (Xi |θ)
f (Xi |θ0)= log
[∫xi :f (xi |θ0)>0
f (xi |θ) dxi
]≤ 0.
• The first inequality is strict when f (x |θ) 6= f (x |θ0) for x withpositive probability under f (x |θ0).
• The second inequality is strict when the support of x dependson θ. For example, if Xi |θ ∼ uniform (θ − 0.5, θ + 0.5), then∫
xi :f (xi |θ0)>0f (xi |θ) dxi < 1.
• We have shown that for θ 6= θ0, Q (θ0)− Q (θ) > 0, or∫log f (x |θ0) f (x |θ0) dx −
∫log f (x |θ) f (x |θ0) dx > 0.
Kullback Leibler informatino criterion (KLIC) quasi-distancebetween two density functions f (·) and g (·),
KLIC (f (·) , g (·)) =
∫log f (x) f (x) dx −
∫log g (x) f (x) dx ≥ 0.
In the MLE case, for f (x) the true density of the data,
arg maxθ∈Θ
Q (θ) = arg minθ∈Θ−Q (θ) = arg min
θ∈ΘKLIC (f (x) , f (x |θ)) .
When the model is correctly specified, f (x) = f (x |θ0)Summary:
• Case 1: textbook, If the model is correctly specified, namelyf (x) = f (x |θ0) for some θ0 ∈ Θ, then θMLE
p−→ θ0.
• Case 2: Misspecified model, there is no θ ∈ Θ such thatf (x) = f (x |θ), then
θMLEp−→ θ∗ = arg min
θ∈ΘKLIC (f (x) , f (x |θ))
• Case 3: some elements of θ might be consistent.
Example 1: Xi is Bernoulli,
E log f (x |θ) = E log(pX (1− p)1−X
)=E [X log p + (1− X ) log (1− p)] = p0 log p + (1− p0) log (1− p)
This is maximized at p = p0.Example 2: X ∼ N
(µ, σ2
),
E log f(x |µ, σ2
)=E log
1√2πσ
exp
[−(X − µ)2
2σ2
]
=E
[− log σ2
2− (X − µ)2
2σ2
]E[(X − µ)2
]=E
[(X − µ0 + µ0 − µ)2
]= σ2
0 + (µ− µ0)2
Q(µ, σ2, µ0, σ
20
)=E log f
(X |µ, σ2
)=− log σ2
2− 1
2σ2
(σ2
0 + (µ− µ0)2)
(µ0, σ
20
)= arg max
µ,σ2
(− log σ2
2− 1
2σ2
(σ2
0 + (µ− µ0)2))
Next rate of convergence and asymptotic distribution. Typically√n rate, and limiting distribution is normal. But this is not always
the case.Xi ∼ uniform (0, θ) i.i.d.
L (Xn|θ) =n∏
i=1
f (Xi |θ) =n∏
i=1
1
θ1 (0 ≤ Xi ≤ θ) =
1
θn1 (maxXi ≤ θ) .
θMLE = Zn = max (X1, . . . ,Xn)
n(θ − θ0
)d−→ negative of exponential distribution
For x < 0,
P(n(θ − θ0
)< x
)=P
(θ < θ0 +
x
n
)= P
(max
iXi ≤ θ0 +
x
n
)=P
(Xi ≤ θ0 +
x
n
)n=
(1
θ0
(θ0 +
x
n
))n
=
(1 +
x
θ0n
)nn→∞−→ exp
(x
θ0
)Negative exponential distribution, for z < 0,
FZ (z) = exp
(z
θ0
)fZ (z) =
1
θ0exp
(z
θ0
)
Asymptotic Distribution of the Maximum Likelihood Estimator:Under regularity conditions and if the support of Xi does not
depend on θ, then in most cases,√n(θ − θ0
)is asymptotically
normal.Assumptions:
• Θ is a compact set. θ0 is in the interior of Θ.
• The support of X : x : fX (x |θ) > 0 does not depend on θ.
• Assume that the likelihood function has many derivatives withmany bounded moments.
Under these assumptions plus additional regularity conditions:
√n(θMLE − θ0
)d−→ N (0,Ω)
where Ω = −H−1 = S−1 = H−1SH−1, H = −S ,
H = E∂2 log f (Xi |θ0)
∂θ∂θ′< 0 S = Var
(∂ log f (Xi |θ0)
∂θ
)> 0.
Sketch of Proof: First assume θp−→ θ0. Since θ0 is in the interior
of Θ, so is θ with probability converging to 1,
P(θ ∈ int (Θ)
)→ 1.
All statements are now conditional on this sequence of events.
∂Qn
(θ)
∂θ= 0
∂Qn
(θ)
∂θ=∂Qn (θ0)
∂θ+∂2Qn (θ∗)
∂θ∂θ′
(θ − θ0
)= 0
∂2Qn (θ∗)
∂θ∂θ′
(θ − θ0
)= − ∂
∂θQn (θ0)
√n(θ − θ0
)= −
[∂2Qn (θ∗)
∂θ∂θ′
]−1√n∂
∂θQn (θ0) = − (2)−1 (1) .
As long as supp (Xi ) does not depend on θ, and other regularityconditions hold,
(1) =√n∂
∂θQn (θ0) =
√n∂
∂θ
1
n
n∑i=1
log f (Xi |θ0)
=1√n
∑ ∂
∂θlog f (Xi |θ0) =
1√n
∑Wi
where
EWi =E∂
∂θlog f (Xi |θ0) =
∫∂
∂θlog f (Xi |θ0) f (Xi |θ0) dXi
=
∫ ∂∂θ f (Xi |θ0)
f (Xi |θ0)f (Xi |θ0) dXi =
∫∂
∂θf (Xi |θ0) dXi
=∂
∂θ
∫f (Xi |θ0) dXi =
∂
∂θ
∫f (Xi |θ0) dXi =
∂
∂θ1 = 0.
Because Wi has zero mean
√n∂
∂θQn (θ0) =
1√n
n∑i=1
Wid−→ N (0,S = Var (Wi ))
where S = Var(∂∂θ log f (Xi |θ0)
). Next
∂2Qn (θ∗)
∂θ∂θ′=
∂2
∂θ∂θ′1
n
n∑i=1
log f (Xi |θ∗) =1
n
n∑i=1
∂2
∂θ∂θ′log f (Xi |θ∗)
The summands are not i.i.d, since θ∗ depends on all theobservations. We need a local uniform version of the LLN:
supθ∈N(θ0)
|1n
n∑i=1
∂2
∂θ∂θ′log f (Xi |θ)− E
∂2
∂θ∂θ′log f (Xi |θ) | p−→ 0.
Since θp−→ θ0, θ∗ is between θ0 and θ, θ∗
p−→ θ0.
Therefore
∂2Qn (θ∗)
∂θ∂θ′p−→ H =
∂2Q (θ0)
∂θ∂θ′= E
∂2
∂θ∂θ′log f (Xi |θ0)
Using Slutsky (Continuous Mapping Theorem), as long as H isnonsingular,
√n(θ − θ0
)=−
[∂2Qn (θ∗)
∂θ∂θ′
]−1√n∂
∂θQn (θ0)
d−→N(0,H−1SH−1
).
Information matrix equality again: H + S = 0: forWi = ∂
∂θ log f (Xi |θ0), EWi = 0:
E∂
∂θlog f (Xi |θ0)⇐⇒
∫∂
∂θlog f (Xi |θ0) f (Xi |θ0) = 0.
Totally differentiable with respect to θ0.
Sandwich formula, robust variance H−1SH−1; nonrobust versions:−H−1, S−1. Need to estimate either or both of H and S , so thatH
p−→ H, Sp−→ S ,
H1 =1
n
n∑i=1
∂2
∂θ∂θ′log f
(Xi |θ
)H2 =
∫∂2
∂θ∂θ′log f
(Xi |θ
)f(Xi |θ
)dXi
S1 =1
n
n∑i=1
∂
∂θlog f
(Xi |θ
) ∂
∂θlog f
(Xi |θ
)′S2 =
∫∂
∂θlog f
(Xi |θ
) ∂
∂θlog f
(Xi |θ
)′f(Xi |θ
)dXi
H2 and S2 can be computed using numerical integral or simulation.If the model is correct, H2 and S2 can be more precise than H1 andS1.
To summarize
√n(θMLE − θ0
)A∼ N
(0, H−1SH−1
)(θMLE − θ0
)A∼ N
(0,
1
nH−1SH−1
)= N
(0,− H−1
n
)= N
(0,− S−1
n
).
Example 1: Xi i.i.d. = 1 w.p. p, = 0 with prob 1− p.
f (Xi |θ) =pXi (1− p)1−Xi
log f (Xi |θ) =Xi log p + (1− Xi ) log (1− p)
Score function
∂
∂θlog f (Xi |θ) =
Xi
p− 1− Xi
1− p=
Xi − p
p (1− p)
S =Var
(Xi − p
p (1− p)
)=
1
p (1− p)
H =E∂
∂p
(Xi − p
p (1− p)
)= − 1
p (1− p)
H + S = 0, and since Xi is 1 or 0,
S1 =1
n
n∑i=1
(Xi − p)2
p2 (1− p)2=
1
p (1− p).
Example 2: Xi ∼ N(µ, σ2
), f (Xi |θ) = 1√
2πσexp
(− (Xi−µ)2
σ2
).
log f (Xi |θ) = C − 1
2log σ2 − (Xi − µ)2
2σ2
∂ log f (Xi |θ)
∂θ=
( ∂f (Xi |θ)∂µ
∂f (Xi |θ)∂σ2
)=
( Xi−µσ2
− 12σ2 + Xi−µ
2σ4
)S =Var
(∂ log f (Xi |θ)
∂θ
)=
(1σ2 00 1
2σ4
)