Econometrics I - Stanford Universitydoubleh/eco270/pointestimation.pdf · 2016-11-16 ·...

Econometrics I

Department of EconomicsStanford University

November, 2016

Part II

Topics

• Point Estimation.

• Interval Estimation

• Hypothesis Testing

• Sufficiency and Data Reduction (maybe).

Different Approaches

• Frequentist: There exists a true parameter value θ0.

• Bayesian: θ is a random variable. Prior+Data =⇒ Posterior.

• Fiduciary Inference: No prior. Data =⇒ Posterior. OrBayesian with uniform (diffused) prior.

Data, sample of size n

• I.I.D Sampling (sampling with replacement)

• Different Sampling Schemes is a science by itself.

• Usual Notations: X1, . . . ,Xn, Y1, . . . ,Yn, Z1, . . . ,Zn.

• Xn, Yn, Zn.

• Parameter θ: a function(al) of the distribution.

µ (FX (·)) =

∫xfX (x) dx =

∫xdFX (x) .

• Estimators: a function of the data:

θ = φn (Xn) = φn (X1,X2, . . . , n)

• Strictly speaking, a sequence of functions of the data, since itis a different function for a different n. For example:

θ = Xn =X1 + X2 + · · ·+ Xn

n.

• Estimate: a realized value of the estimator.

• Is θ = 1/2 an estimator or an estimate?

• Empirical Distribution Function (EDF):

FX (x) =1

n

n∑i=1

1 (Xi ≤ x)

• Analog principle: replace true population value (CDF) withestimated sample value (EDF):

Xn = µ =

∫xdFX (x)

• Properties of Estimators:• Finite Sample properties:

• Unbiasedness• Mean Square Error• Finite sample distribution

• Asymptotic properties:• Consistency• Asymptotic Distribution

• Unbiasedness

Eθθ =

∫. . .

∫θ (X1, . . . ,Xn) f (X1, . . . ,Xn|θ) dX1 . . . dXn = θ.

• MSE (function of θ, or θ0):

MSE = E(θ − θ0

)2= Eθ

(θ − θ

)2

MSE = Var(θ)

+(E(θ − θ0

))2

• More general loss functions Eθ`(θ − θ

).

• Suppose X1,X2 ∼ i .i .d .Bernoulli(p).

p1 =X1 + X2

2p2 = X1 p3 =

1

2p1 and p2 are unbiased. p3 is biased.

• MSE:

MSE (p1) =Var (p1) =1

2p (1− p)

MSE (p2) =Var (p2) = p (1− p)

MSE (p3) =Bias (p2)2 =

(1

2− p

)2

.

• p2 is inadmissible: p1 is better.

• θ is admissible if there is no estimator that is better (in theMSE sense) than θ for some p and is at least as good as θ forall p.

• p1 and p3 are admissible: can not choose one over anotherb/c p is unknown. A typical Bayesian estimator isp4 = wp1 + (1− w) p3.

• Xn is best linear unbiased estimator (BLUE).

• Linearity: θ =∑n

i=1 ωiXi

• Unbiasedness: E θ − θ ⇐⇒∑n

i=1 wi = 1.

MSE(θ)

=n∑

i=1

ω2i σ

2 = Var(θ)

(ω1, ω2, . . . , ωn)= arg minω1,ω2,...,ωn

n∑i=1

ω2i such that

n∑i=1

ωi = 1.

Solution:

ω1 = ω2 = ω3 = . . . = ωn =1

n.

• Xn is BLUE (Gauss-Markov)

• What can be better than Xn?• Nonlinear and unbiased estimator• Linear biased estimator• Nonlinear and biased estimator

• Example 1: Xi ∼ i .i .dUniform (0, θ),µ = EX =

∫xdFX (x) = θ

2 . Want to estimate µ.

•µ1 = Xn

µ2 =n + 1

2nZn, Zn = max (X1, . . . ,Xn)

• Since Zn < θ, bias correct by multiplying by n+1n .

MSE (µ1) = Var(Xn

)=σ2

n=

θ2

12n

MSE (µ2) = Var (µ2) + bias (µ2)2

FZn (z) =P (max (X1, . . . ,Xn) ≤ z) =n∏

i=1

P (Xi ≤ z) =(zθ

)nfZn (z) =

∂

∂zFZn (z) = n

(zn−1

θn

).

• Moments of Zn:

EZn =

∫ θ

0

znzn−1

θndz =

n

n + 1θ

EZ 2n =

∫ θ

0

z2nzn−1

θndz =

n

n + 2θ2

Var (Zn) = EZ 2n − (EZn)2 =

n

n + 2θ2 −

[n

n + 1θ

]2

=n

(n + 2)(n + 1)2θ2

Var (µ2) =(n + 1)2

4n2Var (Zn) =

θ2

4n (n + 2)

Bias (µ2) = E µ2 −θ

2= 0

MSE (µ1)−MSE (µ2) =1

12nθ2 − 1

4n (n + 2)θ2

= θ2

(1

12n− 1

4n (n + 2)

)> 0

if n > 1.

• Large Sample Analysis

• Weak consistency: θnp−→ θ0 as n→∞.

• Strong consistency: θna.s.−→ θ0 as n→∞.

• Rate of Convergence and Asymptotic Distribution (typicallynormal)

• Asymptotic Efficiency: this can be a difficult concept.

• Maximum Likelihood Estimator

• Likelihood function, a random function: f (Xn|θ) ≡ L (θ|Xn)

• Joint likelihood, conditional likelihood, marginal likelihood,partial likelihood.

• Joint likelihood, conditional likelihood, marginal likelihood,partial

• If Xn = (X1, . . . ,Xn) is i.i.d, then f (Xn|θ) =∏n

i=1 f (Xi |θ).

• We can define θMLE = arg maxθ∈Θ f (Xn|θ) ≡ L (θ|Xn).

• But for computational and statistical reasons, define

θLMLE = arg maxθ∈Θ

log L (θ|Xn)i .i .d .≡

n∑i=1

log f (Xi |θ)

• θLMLE = θMLE if θMLE can be computed analytically. But often

times θLMLE can be computed numerically but θMLE can not.E.g. log L (θ|Xn) ≈ −500, then L (θ|Xn) ≈ 0.

• Recall that, using thea average log likelihood (to facilitate theproofs)

θ = arg maxθ∈Θ

1

n

n∑i=1

log f (Xi |θ) =1

nlog L (θ|Xn) .

• Example 1:

Xi =

1, p0, 1− p

i .i .d .

L (θ|Xn) =n∏

i=1

p (Xi |θ)

=n∏

i=1

pXi (1− p)1−Xi = p∑

Xi (1− p)n−∑

Xi

max log L (p|Xn) =∑

(Xi log p + (1− Xi ) log (1− p))

• First order condition

∂ log L (p|Xn)

∂p=

1

p

∑Xi −

1

1− p

∑(1− Xi ) =

∑ Xi − p

p (1− p)= 0

=⇒ p =1

n

n∑i=1

Xi

• Example 2: Xi ∼ N(µ, σ2

), θ =

(µ, σ2

)L (θ|Xn) =

n∏i=1

f (Xi |θ) =n∏

i=1

1√2πσ2

exp

(−(Xi − µ)2

2σ2

)

. log L (θ|Xn) =C +∑[

log1

σ− (Xi − µ)2

2σ2

]

µ = arg minµ

n∑i=1

(Xi − µ)2 ≡ Xn

∂ log L (θ|Xn)

∂σ2=− n

2σ2+

∑ni=1 (Xi − µ)2

2σ4= 0

• Recall (from Dhrymes book)

∂x ′Ax

∂x=(A + A′

)x = 2AX if A is symmetric

∂

∂Alog |A| =A−1 using principal minors and cofactors

∂

∂Atr (AB) =B ′ tr (AB) = tr (BA)

tr (ABC ) =tr (BCA) = tr(CAB).

log L (θ|Xn) =C +n

2log |Σ−1| − 1

2

∑tr((Xi − µ)′Σ−1 (Xi − µ)

)=C +

n

2log |Σ−1| − 1

2

∑tr(Σ−1 (Xi − µ) (Xi − µ)′

)∂

∂µlog L (θ|Xn) =

∑Σ−1 (Xi − µ) = 0

=⇒Σ−1n∑

i=1

(Xi − µ) = 0

=⇒µ = Xn.

∂

∂Σ−1log(µ,Σ−1|Xn

)=

n

2Σ− 1

2

n∑i=1

(Xi − µ) (Xi − µ)′

Σ =1

n

n∑i=1

(Xi − µ) (Xi − µ)′ =1

n

n∑i=1

(Xi − Xn

) (Xi − Xn

)′.

Often times can not compute MLE by hand: Qn (θ) = log L (θ|Xn)

• Newton Raphson Iteration (max of quadratic approximation)

• Stochastic Optimization

• (Ken Judd) Numerical Methods for Economists

• Root finding: Bisection, Gauss-Newton Iteration

Initial Guess: θ(0):

Q (θ) ≈ Q(θ(0)

)+∂Q

(θ(0)

)∂θ

(θ − θ(0)

)+

1

2

(θ − θ(0)

)′ ∂2

∂θ∂θ′Q

(θ(0)

)(θ − θ(0)

)

Q (θ) ≈ Q(θ, θ(0)

)

0 =∂Q(θ, θ(0)

)∂θ

=∂Q(θ(0))

∂θ+∂2Q

(θ(0))

∂θ∂θ′

(θ − θ(0)

)

θ(1) = θ(0) −

(∂2Q

(θ(0))

∂θ∂θ′

)−1∂Q(θ(0))

∂θ.

In general

θ(t+1) = θ(t) −

(∂2Q

(θ(t))

∂θ∂θ′

)−1∂Q(θ(t))

∂θ.

Hopefully, θ(t) → θMLE as t →∞.Statistical properties of MLE

• Finite sample: sometimes unbiased, and sometimes biased.

• Large sample.

Cramer Rao Lower Bound: Under some regularity conditions(including the support at Xi not dependent on θ), any unbiasedestimator θ of θ has a variance that is no smaller than

Var

(∂ log f (Xn|θ)

∂θ

)−1

.

Proof: Unbiasedness ⇐⇒ E θ = θ0 ⇐⇒ Eθθ = θ.∫θ (Xn) f (Xn|θ) dXn = θ

∂

∂θ

∫θ (Xn) f (Xn|θ) dXn = I

Under regularity conditions∫θ (Xn)

∂

∂θf (Xn|θ) dXn = I∫

θ (Xn)∂

∂θf (Xn|θ)

f (Xn|θ)

f (Xn|θ)dXn = I

The vector version of Cauchy Schwartz inequality takes more work.For vectors U,V :

Var (V ) ≥ Cov (V ,U)Var(U)−1Cov (U,V ) .

in the sense that the difference is semi-positive-definite. LetU = ∂

∂θ log f (Xn|θ) and V = θ (Xn). Then Cov (U,V ) = I .

Var(θ (Xn)

)≥ Var

(∂

∂θlog f (Xn|θ)

)−1

.

If an unbiased estimator achieves the CRLB, then it must be thebest (minimum variance) unbiased estimator.Example of CRLB achievement: Bernoulli, Xi = 1 with probabilityp, Xi = 0 with probability 1− p

log f (Xn|θ) =∑

(Xi log p + (1− Xi ) log (1− p))

∂ log f (Xn|p)

∂p=

n∑i=1

Xi − p

p (1− p).

CRLB = Var

(∂

∂θlog f (Xn|θ)

)−1

=p (1− p)

n.

Normal Example:

f (Xi ; θ) =1√

2πσ2exp

(−(Xi − µ)2

2σ2

)

log L (θ|Xn) =− n

2log (2π)− n

2log(σ2)− 1

2σ2

n∑i=1

(Xi − µ)2 .

First order condition:

µ :1

σ2

n∑i=1

(Xi − µ) = 0 =⇒ µMLE = X

σ2 :− n

2σ2+

1

2σ4

n∑i=1

(Xi − µ)2 = 0 =⇒ σ2MLE =

1

n

n∑i=1

(Xi − Xn

)2

θMLE =

(Xn

1n

∑ni=1

(Xi − Xn

)2

)

To compute CRLB, note

− E∂2 log f (Xn|θ0)

∂θ∂θ′= −E

[ ∂2 log f (Xn|θ)∂µ2

∂2 log f (Xn|θ)∂µ∂σ2

∂2 log f (Xn|θ)∂µ∂σ2

∂2 log f (Xn|θ)

∂(σ2)2

]

= −E[

− nσ2 − 1

σ4

∑(Xi − µ)

− 1σ4

∑(Xi − µ) n

σ4 − 1σ6

∑ni=1 (Xi − µ)2

]= E

[ nσ2 0

0 − n2σ4 + nσ2

σ6 = n2σ4

]

µ is unbiased and achieves the CRLB

Var (µ) = σ2/n

σ2 is biased and has a variance even lower than the CRLB

Var(σ2MLE

)= Var

(1

n

n∑i=1

(Xi − Xn

)2

)=σ4

nVar

(n∑

i=1

(Xi − Xn

σ

)2)

But∑n

i=1

(Xi−Xnσ

)2is χ2

n−1, where Var(χ2k

)= 2k, so that

Var(σ2MLE

)= 2

n − 1

n2σ4 < CRLB = 2σ4/n.

Unbiased variance estimator

S2 =1

n − 1

n∑i=1

(Xi − Xn

)2ES2 = σ2

But

Var(S2)

=Var

(1

n − 1

n∑i=1

(Xi − Xn

)2

)

=σ4

(n − 1)2Var

(n∑

i=1

(Xi − Xn

σ

)2)

=2 (n − 1)σ4

(n − 1)2

=2σ4

n − 1>

2σ4

n= CRLB.

S2 does not achieve CRLB. However, S2 can be shown to be theminimum variance unbiased estimator for σ2 using the notion ofcompleteness and sufficiency.

Consistency

• Bernoulli: p = Xp−→ p by WLLN. p is consistent.

• Normal(µ, σ2

), µ = X

p−→ µ,

σ2 =1

n

n∑i=1

(Xi − Xn

)2=

1

n

n∑i=1

X 2i − X 2

np−→ σ2

by LLN and the continuous mapping theorem.σ2 is biased but still consistent.

If the data is i.i.d,

θMLE = arg maxθ∈Θ

1

n

n∑i=1

log f (Xi |θ) ≡ Qn (θ)

θMLE = arg maxθ∈Θ

Qn (θ) .

This is a special case of a M (maximization or minimization)estimator.For each θ, Qn (θ) is a random variable. By a (pointwise) LLN:

Qn (θ) ≡ 1

n

n∑i=1

log f (Xi |θ)p−→ E log f (Xi |θ)

However we need a stronger statement: the whole function Qn (θ)should converge to Q (θ). ComparePointwise LLN: for any θ,

1

n

n∑i=1

log f (Xi |θ)p−→ E log f (Xi |θ) ≡ Q (θ) .

∀ε > 0,∀δ > 0, ∃n0, s.t. ∀n > n0, P(|Qn (θ)− Q (θ) | > ε

)< δ.

Uniform LLN:

supθ∈Θ|Qn (θ)− Q (θ) | p−→ 0.

∀ε > 0,∀δ > 0, ∃n0, s.t. ∀n > n0,

P

(supθ∈Θ|Qn (θ)− Q (θ) | > ε

)< δ.

Note that n0 does not depend on θ.

M Estimator Theory: Suppose θ = arg maxθ∈Θ Qn (θ). If

• supθ∈Θ |Qn (θ)− Q (θ) | p−→ 0,

• θ0 uniquely maximizes Q (θ), in the sense that for anyneighborhood N (θ0) around θ0,

supθ∈Θ\N(θ0)

Q (θ) < Q (θ0)

Then θnp−→ θ0.

Consistency of MLE is a special case of the M-estimator theorem,where

• Qn (θ) = 1n

∑ni=1 log f (Xi |θ)

• Q (θ) = E log f (Xi |θ)

• supθ∈Θ |Qn (θ)− Q (θ) | p−→ 0.

Kullback Leibler informatino criterion (KLIC) quasi-distancebetween two density functions f (·) and g (·),

KLIC (f (·) , g (·)) =

∫log f (x) f (x) dx −

∫log g (x) f (x) dx ≥ 0.

In the MLE case, for f (x) the true density of the data,

arg maxθ∈Θ

Q (θ) = arg minθ∈Θ−Q (θ) = arg min

θ∈ΘKLIC (f (x) , f (x |θ)) .

When the model is correctly specified, f (x) = f (x |θ0)Summary:

• Case 1: textbook, If the model is correctly specified, namelyf (x) = f (x |θ0) for some θ0 ∈ Θ, then θMLE

p−→ θ0.

• Case 2: Misspecified model, there is no θ ∈ Θ such thatf (x) = f (x |θ), then

θMLEp−→ θ∗ = arg min

θ∈ΘKLIC (f (x) , f (x |θ))

• Case 3: some elements of θ might be consistent.

Example 1: Xi is Bernoulli,

E log f (x |θ) = E log(pX (1− p)1−X

)=E [X log p + (1− X ) log (1− p)] = p0 log p + (1− p0) log (1− p)

This is maximized at p = p0.Example 2: X ∼ N

(µ, σ2

),

E log f(x |µ, σ2

)=E log

1√2πσ

exp

[−(X − µ)2

2σ2

]

=E

[− log σ2

2− (X − µ)2

2σ2

]E[(X − µ)2

]=E

[(X − µ0 + µ0 − µ)2

]= σ2

0 + (µ− µ0)2

Q(µ, σ2, µ0, σ

20

)=E log f

(X |µ, σ2

)=− log σ2

2− 1

2σ2

(σ2

0 + (µ− µ0)2)

(µ0, σ

20

)= arg max

µ,σ2

(− log σ2

2− 1

2σ2

(σ2

0 + (µ− µ0)2))

Next rate of convergence and asymptotic distribution. Typically√n rate, and limiting distribution is normal. But this is not always

the case.Xi ∼ uniform (0, θ) i.i.d.

L (Xn|θ) =n∏

i=1

f (Xi |θ) =n∏

i=1

1

θ1 (0 ≤ Xi ≤ θ) =

1

θn1 (maxXi ≤ θ) .

θMLE = Zn = max (X1, . . . ,Xn)

n(θ − θ0

)d−→ negative of exponential distribution

For x < 0,

P(n(θ − θ0

)< x

)=P

(θ < θ0 +

x

n

)= P

(max

iXi ≤ θ0 +

x

n

)=P

(Xi ≤ θ0 +

x

n

)n=

(1

θ0

(θ0 +

x

n

))n

=

(1 +

x

θ0n

)nn→∞−→ exp

(x

θ0

)Negative exponential distribution, for z < 0,

FZ (z) = exp

(z

θ0

)fZ (z) =

1

θ0exp

(z

θ0

)

Asymptotic Distribution of the Maximum Likelihood Estimator:Under regularity conditions and if the support of Xi does not

depend on θ, then in most cases,√n(θ − θ0

)is asymptotically

normal.Assumptions:

• Θ is a compact set. θ0 is in the interior of Θ.

• The support of X : x : fX (x |θ) > 0 does not depend on θ.

• Assume that the likelihood function has many derivatives withmany bounded moments.

Under these assumptions plus additional regularity conditions:

√n(θMLE − θ0

)d−→ N (0,Ω)

where Ω = −H−1 = S−1 = H−1SH−1, H = −S ,

H = E∂2 log f (Xi |θ0)

∂θ∂θ′< 0 S = Var

(∂ log f (Xi |θ0)

∂θ

)> 0.

Sketch of Proof: First assume θp−→ θ0. Since θ0 is in the interior

of Θ, so is θ with probability converging to 1,

P(θ ∈ int (Θ)

)→ 1.

All statements are now conditional on this sequence of events.

∂Qn

(θ)

∂θ= 0

∂Qn

(θ)

∂θ=∂Qn (θ0)

∂θ+∂2Qn (θ∗)

∂θ∂θ′

(θ − θ0

)= 0

∂2Qn (θ∗)

∂θ∂θ′

(θ − θ0

)= − ∂

∂θQn (θ0)

√n(θ − θ0

)= −

[∂2Qn (θ∗)

∂θ∂θ′

]−1√n∂

∂θQn (θ0) = − (2)−1 (1) .

Because Wi has zero mean

√n∂

∂θQn (θ0) =

1√n

n∑i=1

Wid−→ N (0,S = Var (Wi ))

where S = Var(∂∂θ log f (Xi |θ0)

). Next

∂2Qn (θ∗)

∂θ∂θ′=

∂2

∂θ∂θ′1

n

n∑i=1

log f (Xi |θ∗) =1

n

n∑i=1

∂2

∂θ∂θ′log f (Xi |θ∗)

The summands are not i.i.d, since θ∗ depends on all theobservations. We need a local uniform version of the LLN:

supθ∈N(θ0)

|1n

n∑i=1

∂2

∂θ∂θ′log f (Xi |θ)− E

∂2

∂θ∂θ′log f (Xi |θ) | p−→ 0.

Since θp−→ θ0, θ∗ is between θ0 and θ, θ∗

p−→ θ0.

Therefore

∂2Qn (θ∗)

∂θ∂θ′p−→ H =

∂2Q (θ0)

∂θ∂θ′= E

∂2

∂θ∂θ′log f (Xi |θ0)

Using Slutsky (Continuous Mapping Theorem), as long as H isnonsingular,

√n(θ − θ0

)=−

[∂2Qn (θ∗)

∂θ∂θ′

]−1√n∂

∂θQn (θ0)

d−→N(0,H−1SH−1

).

Information matrix equality again: H + S = 0: forWi = ∂

∂θ log f (Xi |θ0), EWi = 0:

E∂

∂θlog f (Xi |θ0)⇐⇒

∫∂

∂θlog f (Xi |θ0) f (Xi |θ0) = 0.

Totally differentiable with respect to θ0.

Sandwich formula, robust variance H−1SH−1; nonrobust versions:−H−1, S−1. Need to estimate either or both of H and S , so thatH

p−→ H, Sp−→ S ,

H1 =1

n

n∑i=1

∂2

∂θ∂θ′log f

(Xi |θ

)H2 =

∫∂2

∂θ∂θ′log f

(Xi |θ

)f(Xi |θ

)dXi

S1 =1

n

n∑i=1

∂

∂θlog f

(Xi |θ

) ∂

∂θlog f

(Xi |θ

)′S2 =

∫∂

∂θlog f

(Xi |θ

) ∂

∂θlog f

(Xi |θ

)′f(Xi |θ

)dXi

H2 and S2 can be computed using numerical integral or simulation.If the model is correct, H2 and S2 can be more precise than H1 andS1.

To summarize

√n(θMLE − θ0

)A∼ N

(0, H−1SH−1

)(θMLE − θ0

)A∼ N

(0,

1

nH−1SH−1

)= N

(0,− H−1

n

)= N

(0,− S−1

n

).

Example 1: Xi i.i.d. = 1 w.p. p, = 0 with prob 1− p.

f (Xi |θ) =pXi (1− p)1−Xi

log f (Xi |θ) =Xi log p + (1− Xi ) log (1− p)

Score function

∂

∂θlog f (Xi |θ) =

Xi

p− 1− Xi

1− p=

Xi − p

p (1− p)

S =Var

(Xi − p

p (1− p)

)=

1

p (1− p)

H =E∂

∂p

(Xi − p

p (1− p)

)= − 1

p (1− p)

H + S = 0, and since Xi is 1 or 0,

S1 =1

n

n∑i=1

(Xi − p)2

p2 (1− p)2=

1

p (1− p).

Example 2: Xi ∼ N(µ, σ2

), f (Xi |θ) = 1√

2πσexp

(− (Xi−µ)2

σ2

).

log f (Xi |θ) = C − 1

2log σ2 − (Xi − µ)2

2σ2

∂ log f (Xi |θ)

∂θ=

( ∂f (Xi |θ)∂µ

∂f (Xi |θ)∂σ2

)=

( Xi−µσ2

− 12σ2 + Xi−µ

2σ4

)S =Var

(∂ log f (Xi |θ)

∂θ

)=

(1σ2 00 1

2σ4

)

Econometrics I - Stanford Universitydoubleh/eco270/pointestimation.pdf · 2016-11-16 ·...

Documents

Transcript of Econometrics I - Stanford Universitydoubleh/eco270/pointestimation.pdf · 2016-11-16 ·...