Variance of OLS Estimators and Hypothesis Testing - Charlie...

41
Variance of OLS Estimators and Hypothesis Testing Charlie Gibbons ARE 212 Spring 2011

Transcript of Variance of OLS Estimators and Hypothesis Testing - Charlie...

  • Variance of OLS Estimators andHypothesis Testing

    Charlie GibbonsARE 212

    Spring 2011

  • Randomness in the model

    Considering the model

    Y = Xβ + �,

    what is random?

    β is a parameter and not random,

    X may be random, but we condition on it, and

    � is random, making Y random as well.

    Though β is not random, our estimator β̂ = (X ′X )−1 X ′Y israndom because it is a function of Y .

  • GM assumptions

    Under the Gauss-Markov assumptions,

    1 Y = Xβ + � (linear model),

    2 X has full column rank (no multicolinearity),

    3 E[� | X ] = 0 (strict exogeneity), and4 Var(� | X ) = σ2I (homoskedasticity, no serial correlation).

    Assumptions 1–3 guarantee unbiasedness of the OLS estimator.We have also seen that it is consistent.

    The final assumption guarantees efficiency; the OLS estimatorhas the smallest variance of any linear estimator of Y . TheOLS estimator is BLUE.

    Sometimes we add the assumption � | X ∼ N(0, σ2), whichmakes the OLS estimator BUE.

  • Variance of β̂

    We typically calculate the conditional variance of β̂:

    Var(β̂ | X

    )= Var

    ((X ′X

    )−1X ′Y | X

    )=(X ′X

    )−1X ′Var(Y | X )X

    (X ′X

    )−1=(X ′X

    )−1X ′Var(� | X )X

    (X ′X

    )−1= σ2

    (X ′X

    )−1.

  • Simple regression example

    Recall that, for a simple regression, we have

    (X ′X

    )−1=

    1

    V̂ar(x )

    [1n

    ∑x 2i −x̄

    −x̄ 1

    ];

    the variance of the slope coefficient is

    V̂ar(β̂1

    )=

    σ̂2

    V̂ar(x ).

  • σ̂ is unbiased

    We show that σ̂2 is unbiased:

    E[

    �̂′�̂

    N −K| X]

    = E[�′M ′M �

    N −K| X]

    = E[�′M �

    N −K| X]

    =

    ∑Ni=1

    ∑Nj=1 mjiE[�i�j | X ]N −K

    =

    ∑Ni=1 miiσ

    2

    N −K

    =σ2 tr(M )

    N −K

  • σ̂2 is unbiased, continued

    tr(M ) = tr(IN − PX )= tr(IN )− tr(PX )

    = N − tr(

    X(X ′X

    )−1X ′)

    = N − tr((

    X ′X)−1

    X ′X)

    = N − tr(IK+1) = N −K

    =⇒ E[

    �̂′�̂

    N −K| X]

    =σ2(N −K )(N −K )

    = σ2.

  • Known covariance matrix

    Suppose that Var(� | X ) = Ω. This matrix must be symmetricand positive definite; in this case, the Cholesky decompositionsays that there exists an upper triangular matrix C such thatCC ′ = Ω.

    If Ω is known, then we can alter our regression using weightedleast squares:

    C−1Y = C−1Xβ + C−1�

    and this regression follows the GM assumptions. For example,

    Var(C−1� | X

    )= C−1Ω

    (C ′)−1

    = C−1CC ′(C ′)−1

    = I .

    We could also get back to the GM assumptions if we knew thematrix Ω̃ : σ2Ω̃ = Ω.

  • Heteroskedasticity

    Suppose that we have heteroskedasticity: Var(�i | X ) = σ2i , butstill no serial correlation. Then, our derivation for the varianceof our estimator would have

    Var(β̂ | X

    )=(X ′X

    )−1X ′Var(� | X )X

    (X ′X

    )−1=(X ′X

    )−1X ′E

    [��′ | X

    ]X(X ′X

    )−1.

    What is the dimensions of E [��′ | X ]? N ×N .

    How many unique elements does it have? N by assumption(N × (N + 1)/2 if serial correlation is possible).

  • Robust covariance matrix

    We estimate the Eicker-White heteroskedasticity-robust (robust)matrix using the moment estimator(X ′X

    )−1X ′E

    [��′ | X

    ]X(X ′X

    )−1=(X ′X

    )−1∑i

    xix′i �̂

    2(X ′X

    )−1.

    Recall that we said that the asymptotic variance of β̂ − β is

    1

    nE[x ′x]−1 E [x ′ixi�2i ]E [x ′x]−1

    =⇒ 1n

    (X ′X

    n

    )−1 1n

    ∑i

    xix′i �̂

    2i

    (X ′X

    n

    )−1,

    which reduces to the top expression; robust standard errors areconsistent estimator for the asymptotic variance of thecoefficients.

  • Hypothesis testing

    Now that we have a well-established distribution for ourestimator, we want to ask whether our data are consistent withsome belief that we have about the true value of ourparameter(s) known as a null hypothesis, typically written H0.

    Specifically, we consider the null hypothesis that β = b (ascalar).

    To perform a hypothesis test, we ask, “what’s the probability ofgetting a value of our estimator further away from our nullhypothesis (in absolute value) than our particular estimategiven that the null hypothesis is true.”

  • Fisherian hypothesis testing

    Let β̂ be our estimator and let b̂ be our estimate. Our nullhypothesis is b has an asymptotically normal distribution. Wefind

    Pr(∣∣∣β̂ − b∣∣∣ > ∣∣∣b̂ − b∣∣∣ | β = b)

    = Pr(β̂ − b >

    ∣∣∣b̂ − b∣∣∣ | β = b)+ Pr(β̂ − b < − ∣∣∣b̂ − b∣∣∣ | β = b)

    = Pr

    β̂ − b√V̂ar

    (β̂) >

    ∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂) | β = b

    + Pr

    β̂ − b√V̂ar

    (β̂) < −

    ∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂) | β = b

  • = Pr

    β̂ − β√V̂ar

    (β̂) >

    ∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂)

    + Pr

    β̂ − β√V̂ar

    (β̂) < −

    ∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂)

    = 1− Φ

    ∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂)+ Φ

    −∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂)

    = 2× Φ

    −∣∣∣b̂ − b∣∣∣√V̂ar

    (β̂) = 2Φ (−|ẑ |) .

  • p values

    2Φ (−|ẑ |) is called the p value.

    p value

    The probability of observing a β̂ at least as far from your nullhypothesis as your actual estimate given that the nullhypothesis is true.

    Note that the p value is just a restatement, a one-to-onetransformation, of our test statistic ẑ and is just a means ofdescribing our result relative to the null hypothesis; it issample-dependent and so too is its interpretation (cf. afrequency interpretation, as in the next case).

  • Interpretation

    The p value is calculated assuming that the null hypothesisis true. We calculate the probability of observing our datagiven this assumption.

    Note that this tells us the probability of our data, not theprobability that the null hypothesis is true.

  • Fundamental problem of statistics

    We learn Pr(data | null hypothesis), notPr(null hypothesis | data).

    How can we go from the former to the latter, the actualquantity of interest?

    Frequentists can’t; this is called the fundamental problem ofstatistics.

  • Bayesian inference

    Bayes’ rule states that

    Pr(null hypothesis | data)

    =Pr(data | null hypothesis) Pr(null hypothesis)

    Pr(data).

    What’s the problem?

    Pr(data) isn’t known, but we actually don’t need it and

    The prior probability of the null Pr(null hypothesis) isunknown.

    This is an illustration of Bayesian inference.

  • Neyman-Pearson

    Imagine observing many data sets and calculating many pvalues. You reject the null hypothesis if the p value is less thansome level α. Then the probability of rejecting a null hypothesiswhen the null hypothesis is true is

    Pr (p(Z ) < α | H0) = Pr (p(Z ) < α)= Pr (2Φ(−|Z |) < α)

    = Pr(|Z | > −Φ−1

    (α2

    ))= Pr

    (Z > −Φ−1

    (α2

    ))+ Pr

    (Z < Φ−1

    (α2

    ))=α

    2+α

    2= α.

  • Significance

    α is called the significance level of a test. We reject the nullhypothesis if p (ẑ ) < α.

    If you reject the null hypothesis using a level α test, then, if youperformed many α-level tests, you would falsely reject the nullhypothesis 100× α% of the time.

    Note that this is a frequency-based interpretation of ahypothesis test (cf. the data-specific p value).

    Note that this procedure, too, does not tell us whether ourspecific null hypothesis is true or false; instead, it tells whatproportion of the time we make the correct decision of rejectingthe null.

  • Alternative hypothesis

    We have only mentioned the null hypothesis; we haven’tmentioned what happens if the null is in fact false.

    We haven’t specified an alternative hypothesis H1.

    Some test statistics require specifying an alternative, thoughthe ones that we consider here do not (see, e.g., the likelihoodratio test).

  • Type I and type II errors

    Four things could happen:

    H0 is true H0 is falseDo not reject H0 Correct Type II error

    Reject H0 Type I error Correct

    β ≡ 1− Type II error is called the power.

    Neyman and Pearson advocated finding a test that falselyrejects the null hypothesis some specified α proportion of thetime and that maximizes the probability of rejecting the nullhypothesis when it is false.

    We want to minimize the Type II error (or maximize power)subject to some specified level of Type I error.

  • A courtroom example

    Consider being on a jury, with the previous table relabeledunder the null hypothesis of not guilty:

    Not guilty GuiltyDo not convict Correct Type II error

    Convict Type I error Correct

    We can minimize the Type I error by never convicting anyone,but that would mean that we let a lot of guilty people go free;in other words, we have a high Type II error.

    We could make sure that every guilty person goes to jail byconvicting everyone, but that would require convicting a lot ofinnocent people; minimizing the Type II error leads to a highType I error.

    There is a trade-off between Type I and Type II errors.

  • Most powerful tests

    Actually, we (may) have taken an alternative into accountbefore we even started.

    The alternative hypothesis helps us choose the “best” (highestpower) tests, but we don’t (typically) use it in calculating teststatistics. These are called most powerful tests.

    A test may be powerful for only a range of alternatives, whileanother test is more powerful for alternatives in another range.It is hard to find a test that is the most powerful for allalternatives, a uniformly most powerful test.

  • Power calculations

    Let’s consider the power of the z test. Suppose, without loss ofgenerality, that β = 0. We reject the null hypothesis if |ẑ | > c,where c is chosen to give us the appropriate level of our test(e.g., c = Φ−1(1− α/2)).

    To calculate power, we calculate the probability of rejecting thenull hypothesis across all possible values of the true β. Stateddifferently, power is a function of the true parameter value.

    Comprehension check: What is the power of this test atβ = 0?

  • Calculating power

    Let z be the test statistic and Z represent a standard normalrandom variable. Then,

    Pr(reject null) = Pr(|z | > c)= 1− Pr(|z | < c)= 1− Pr(−c < z < c)

    = 1− Pr

    −c < β̂ − b√V̂ar

    (β̂) < c

  • Calculating power, continued

    = 1− Pr

    −c + b − β√V̂ar

    (β̂) < β̂ − β√

    V̂ar(β̂) < c + b − β√

    V̂ar(β̂)

    = 1− Pr

    −c + b − β√V̂ar

    (β̂) < Z < c + b − β√

    V̂ar(β̂)

    = 1−

    Φc + b − β√

    V̂ar(β̂)− Φ

    −c + b − β√V̂ar

    (β̂) .

  • β

    Pow

    er

    0.2

    0.4

    0.6

    0.8

    −4 −2 0 2 4

    Figure: Power for z test with standard error of 1 and null hypothesisof 0 with α = 0.05

  • z test

    The asymptotic distribution of β̂ is N

    (β,

    √V̂ar

    (β̂))

    . The

    variance is assumed to be given, rather than estimated. We have

    ẑ ∼ N(0, 1).

    Uses: testing single hypotheses (e.g., a particular coefficient isequal to 0).

    This only applies for “large” n.

    Reject if |ẑ | > Φ−1(1− α/2).

  • t tests

    If we assume that the errors are distributed N(0, σ2), then β̂ is

    distributed N

    (β,

    √V̂ar

    (β̂))

    , but the variance is taken to be

    estimated. Then,ẑ ∼ t(0, 1, d.f.).

    Uses: testing single hypotheses (e.g., a particular coefficient isequal to 0).

    If the model matrix has rank K , then the degrees-of-freedom fora regression coefficient is N −K .

    Reject if |ẑ | > tN−K ,α2, where this is the two-sided α critical

    value of the t distribution with N −K degrees of freedom.

  • Wald test

    Let the vector β̂ ∼ N(β, V̂ ). Then, for a matrix R of a set ofrestrictions with rank r with the null hypothesis that Rβ = b,

    W =(

    Rβ̂ − b)′ (

    RV̂ R′)−1 (

    Rβ̂ − b)∼ χ2r

    Based upon the asymptotic distribution.

    Can test multiple restrictions using robust variance-covariancematrix (cf. F test).

    Reject if W > χ2r ,α2.

  • F tests

    Let the vector β̂ ∼ N(β, σ̂2 (X ′X )−1

    ). Then, for a matrix R of

    a set of restrictions with rank r with a null hypothesis thatRβ = b,

    F =

    (Rβ̂ − b

    )′ (R (X ′X )−1 R′

    )−1 (Rβ̂ − b

    )r σ̂2

    ∼ Fr ,N−K

    Usually a finite sample test; here, σ̂2 is assumed to be a randomvariable itself that has a χ2N−K distribution; asymptotic teststake the variance as fixed.

    Cannot handle a robust variance-covariance matrix.

    Reject if F > Fr ,N−K ,α2.

  • Partitioned matrix inverse formula

    Note the following fact:[A11 A12A21 A22

    ]−1=

    [D−11 −D

    −11 A21A

    −122

    D−12 A21A−111 D

    −12

    ],

    where D1 = A11 −A12A−122 A21 and D2 = A22 −A21A−111 A12; this

    is the partitioned matrix inverse formula.

    Let R = [0 Ir ] and the null hypothesis be R = 0; we are testingwhether some subset of the coefficients (the last r) are 0. Then,using the partitioned matrix inverse formula,

    R′(X ′X

    )−1R =

    (X ′2X2 −X ′2X1

    (X ′1X1

    )−1X ′1X2

    )−1=(X ′2M1X2

    )−1.

  • Applying the FWL theorem

    Our F statistic, then is

    F =β̂′2 (X

    ′2M1X2) β̂2r σ̂2

    .

    Let �̃ be the residuals from a regression of y on X1. Let �̂ be theresiduals from the full regression. X̃2 is the set of residuals froma regression of X2 on X1. The FWL theorem states that�̃ = X̃2β̂2 + �̂. Then,

    �̃′�̃ = β̂′2X̃′2X̃2β̂2 + �̂

    ′�̂

    = β̂′2X′2M1X2β̂2 + �̂

    ′�̂

    =⇒ β̂′2X ′2M1X2β̂2 = �̃′�̃− �̂′�̂

  • The F statistic reframed

    Recall that σ̂2 = 1N−K �̂′�̂. The F statistic is

    F =N −K

    r

    �̃′�̃− �̂′�̂�̂′�̂

    .

    This is the proportion difference between the sum of squaredresiduals (SSR) from a regression that omits the variables thatwe propose omitting and the full regression that is rescaled;intuitively, the F test asks, “relatively, how much bigger are themistakes that we are making when we exclude the proposedvariables?”

    Note that the SSR from the restricted regression must begreater than that of the unrestricted regression.

  • The F test and R2

    Let the restricted regression be no model at all—i.e., Y = β0.

    Then, the restricted SSR is the total sum of squared errors(SST) in our model. The number of restrictions is r = K − 1.We have:

    (SST−SSR)K−1SSRN−K

    =N −KK − 1

    (SST

    SSR− 1)

    =N −KK − 1

    (1

    1− R2− 1)

    =N −KK − 1

    R2

    1− R2.

  • Bonferroni’s inequality

    Let’s begin with a fact from basic statistics. Suppose that wehave two events, A and B . We can write the probability thateither A happens or B happens as

    Pr(A ∪ B) = Pr(A) + Pr(B)− Pr(A ∩ B),

    which is the probability that A happens, plus the probabilitythat B happens, minus the probability that both happen(otherwise we’d be double counting the case where bothhappen). This leads to Bonferroni’s inequality,

    Pr(A ∪ B) ≤ Pr(A) + Pr(B);

    “less than” because we don’t subtract off the probability of bothA and B occurring from the right-hand side.

  • Testing regression coefficients

    Let’s suppose that we have a multiple regression model

    yi = β0 + β1x1i + β2x2i + �i

    and we want to test β1 = 0 and β2 = 0.

    Suppose that we did two separate t tests to answer thisquestion; i.e., test β1 = 0 and β2 = 0 each by a t test and rejectthe joint hypothesis if we can reject either separate nullhypothesis.

  • Type I error

    To calculate the Type I error of this procedure, we consider

    Pr(reject β1 = 0 or reject β2 = 0|β1 = 0, β2 = 0).

    Based upon on fact of probability, we see that this probability isequal to

    Pr(reject β1 = 0|β1 = 0, β2 = 0)+ Pr(reject β2 = 0|β1 = 0, β2 = 0)− Pr(reject β1 = 0 and reject β2 = 0|β1 = 0, β2 = 0).

    This is the Type I error of the t test for β1 = 0 plus the Type Ierror of the t test for β2 = 0 minus the probability that youreject both when both are in fact true.

  • Application of Bonferroni’s inequality

    By Bonferroni’s inequality, we have

    Pr(reject β1 = 0 or reject β2 = 0|β1 = 0, β2 = 0) ≤ α+ α = 2α,

    where α is the Type I error for each of the t tests.

    So the Type I error of our joint test can be twice as large as theerror from our separate t tests! How can we get around thisproblem?

    If we want to falsely reject the joint test with probability α,then set our Type I error rate for the separate t tests to α2 .

    More generally, if we have n hypotheses, set the individualType I error levels to αn .

  • Issues of power

    Note that this is a conservative test; we have shown that theType I error for the joint test is less than or equal to the sum ofthe separate Type I errors; “less than” because we ignorecorrelations between the tests. This means that this test is notpowerful.

    The F test does not ignore these correlations and thus is morepowerful than this “Bonferroni correction.” Tests of jointhypotheses are useful when you have a natural set of hypotheses(i.e., testing that “race doesn’t matter” by testing the joint nullhypothesis that the coefficients on several race indicatorvariables are 0).

  • Multiple tests and multicolinearity

    Testing multiple hypotheses is also useful when you have twomulticolinear variables; the standard errors of each may be toowide to reject an individual null hypothesis (due to themulticolinearity), but the joint null takes into account thiscorrelation and provides a more powerful test.