Variance of OLS Estimators and Hypothesis Testing - Charlie...
Transcript of Variance of OLS Estimators and Hypothesis Testing - Charlie...
-
Variance of OLS Estimators andHypothesis Testing
Charlie GibbonsARE 212
Spring 2011
-
Randomness in the model
Considering the model
Y = Xβ + �,
what is random?
β is a parameter and not random,
X may be random, but we condition on it, and
� is random, making Y random as well.
Though β is not random, our estimator β̂ = (X ′X )−1 X ′Y israndom because it is a function of Y .
-
GM assumptions
Under the Gauss-Markov assumptions,
1 Y = Xβ + � (linear model),
2 X has full column rank (no multicolinearity),
3 E[� | X ] = 0 (strict exogeneity), and4 Var(� | X ) = σ2I (homoskedasticity, no serial correlation).
Assumptions 1–3 guarantee unbiasedness of the OLS estimator.We have also seen that it is consistent.
The final assumption guarantees efficiency; the OLS estimatorhas the smallest variance of any linear estimator of Y . TheOLS estimator is BLUE.
Sometimes we add the assumption � | X ∼ N(0, σ2), whichmakes the OLS estimator BUE.
-
Variance of β̂
We typically calculate the conditional variance of β̂:
Var(β̂ | X
)= Var
((X ′X
)−1X ′Y | X
)=(X ′X
)−1X ′Var(Y | X )X
(X ′X
)−1=(X ′X
)−1X ′Var(� | X )X
(X ′X
)−1= σ2
(X ′X
)−1.
-
Simple regression example
Recall that, for a simple regression, we have
(X ′X
)−1=
1
V̂ar(x )
[1n
∑x 2i −x̄
−x̄ 1
];
the variance of the slope coefficient is
V̂ar(β̂1
)=
σ̂2
V̂ar(x ).
-
σ̂ is unbiased
We show that σ̂2 is unbiased:
E[
�̂′�̂
N −K| X]
= E[�′M ′M �
N −K| X]
= E[�′M �
N −K| X]
=
∑Ni=1
∑Nj=1 mjiE[�i�j | X ]N −K
=
∑Ni=1 miiσ
2
N −K
=σ2 tr(M )
N −K
-
σ̂2 is unbiased, continued
tr(M ) = tr(IN − PX )= tr(IN )− tr(PX )
= N − tr(
X(X ′X
)−1X ′)
= N − tr((
X ′X)−1
X ′X)
= N − tr(IK+1) = N −K
=⇒ E[
�̂′�̂
N −K| X]
=σ2(N −K )(N −K )
= σ2.
-
Known covariance matrix
Suppose that Var(� | X ) = Ω. This matrix must be symmetricand positive definite; in this case, the Cholesky decompositionsays that there exists an upper triangular matrix C such thatCC ′ = Ω.
If Ω is known, then we can alter our regression using weightedleast squares:
C−1Y = C−1Xβ + C−1�
and this regression follows the GM assumptions. For example,
Var(C−1� | X
)= C−1Ω
(C ′)−1
= C−1CC ′(C ′)−1
= I .
We could also get back to the GM assumptions if we knew thematrix Ω̃ : σ2Ω̃ = Ω.
-
Heteroskedasticity
Suppose that we have heteroskedasticity: Var(�i | X ) = σ2i , butstill no serial correlation. Then, our derivation for the varianceof our estimator would have
Var(β̂ | X
)=(X ′X
)−1X ′Var(� | X )X
(X ′X
)−1=(X ′X
)−1X ′E
[��′ | X
]X(X ′X
)−1.
What is the dimensions of E [��′ | X ]? N ×N .
How many unique elements does it have? N by assumption(N × (N + 1)/2 if serial correlation is possible).
-
Robust covariance matrix
We estimate the Eicker-White heteroskedasticity-robust (robust)matrix using the moment estimator(X ′X
)−1X ′E
[��′ | X
]X(X ′X
)−1=(X ′X
)−1∑i
xix′i �̂
2(X ′X
)−1.
Recall that we said that the asymptotic variance of β̂ − β is
1
nE[x ′x]−1 E [x ′ixi�2i ]E [x ′x]−1
=⇒ 1n
(X ′X
n
)−1 1n
∑i
xix′i �̂
2i
(X ′X
n
)−1,
which reduces to the top expression; robust standard errors areconsistent estimator for the asymptotic variance of thecoefficients.
-
Hypothesis testing
Now that we have a well-established distribution for ourestimator, we want to ask whether our data are consistent withsome belief that we have about the true value of ourparameter(s) known as a null hypothesis, typically written H0.
Specifically, we consider the null hypothesis that β = b (ascalar).
To perform a hypothesis test, we ask, “what’s the probability ofgetting a value of our estimator further away from our nullhypothesis (in absolute value) than our particular estimategiven that the null hypothesis is true.”
-
Fisherian hypothesis testing
Let β̂ be our estimator and let b̂ be our estimate. Our nullhypothesis is b has an asymptotically normal distribution. Wefind
Pr(∣∣∣β̂ − b∣∣∣ > ∣∣∣b̂ − b∣∣∣ | β = b)
= Pr(β̂ − b >
∣∣∣b̂ − b∣∣∣ | β = b)+ Pr(β̂ − b < − ∣∣∣b̂ − b∣∣∣ | β = b)
= Pr
β̂ − b√V̂ar
(β̂) >
∣∣∣b̂ − b∣∣∣√V̂ar
(β̂) | β = b
+ Pr
β̂ − b√V̂ar
(β̂) < −
∣∣∣b̂ − b∣∣∣√V̂ar
(β̂) | β = b
-
= Pr
β̂ − β√V̂ar
(β̂) >
∣∣∣b̂ − b∣∣∣√V̂ar
(β̂)
+ Pr
β̂ − β√V̂ar
(β̂) < −
∣∣∣b̂ − b∣∣∣√V̂ar
(β̂)
= 1− Φ
∣∣∣b̂ − b∣∣∣√V̂ar
(β̂)+ Φ
−∣∣∣b̂ − b∣∣∣√V̂ar
(β̂)
= 2× Φ
−∣∣∣b̂ − b∣∣∣√V̂ar
(β̂) = 2Φ (−|ẑ |) .
-
p values
2Φ (−|ẑ |) is called the p value.
p value
The probability of observing a β̂ at least as far from your nullhypothesis as your actual estimate given that the nullhypothesis is true.
Note that the p value is just a restatement, a one-to-onetransformation, of our test statistic ẑ and is just a means ofdescribing our result relative to the null hypothesis; it issample-dependent and so too is its interpretation (cf. afrequency interpretation, as in the next case).
-
Interpretation
The p value is calculated assuming that the null hypothesisis true. We calculate the probability of observing our datagiven this assumption.
Note that this tells us the probability of our data, not theprobability that the null hypothesis is true.
-
Fundamental problem of statistics
We learn Pr(data | null hypothesis), notPr(null hypothesis | data).
How can we go from the former to the latter, the actualquantity of interest?
Frequentists can’t; this is called the fundamental problem ofstatistics.
-
Bayesian inference
Bayes’ rule states that
Pr(null hypothesis | data)
=Pr(data | null hypothesis) Pr(null hypothesis)
Pr(data).
What’s the problem?
Pr(data) isn’t known, but we actually don’t need it and
The prior probability of the null Pr(null hypothesis) isunknown.
This is an illustration of Bayesian inference.
-
Neyman-Pearson
Imagine observing many data sets and calculating many pvalues. You reject the null hypothesis if the p value is less thansome level α. Then the probability of rejecting a null hypothesiswhen the null hypothesis is true is
Pr (p(Z ) < α | H0) = Pr (p(Z ) < α)= Pr (2Φ(−|Z |) < α)
= Pr(|Z | > −Φ−1
(α2
))= Pr
(Z > −Φ−1
(α2
))+ Pr
(Z < Φ−1
(α2
))=α
2+α
2= α.
-
Significance
α is called the significance level of a test. We reject the nullhypothesis if p (ẑ ) < α.
If you reject the null hypothesis using a level α test, then, if youperformed many α-level tests, you would falsely reject the nullhypothesis 100× α% of the time.
Note that this is a frequency-based interpretation of ahypothesis test (cf. the data-specific p value).
Note that this procedure, too, does not tell us whether ourspecific null hypothesis is true or false; instead, it tells whatproportion of the time we make the correct decision of rejectingthe null.
-
Alternative hypothesis
We have only mentioned the null hypothesis; we haven’tmentioned what happens if the null is in fact false.
We haven’t specified an alternative hypothesis H1.
Some test statistics require specifying an alternative, thoughthe ones that we consider here do not (see, e.g., the likelihoodratio test).
-
Type I and type II errors
Four things could happen:
H0 is true H0 is falseDo not reject H0 Correct Type II error
Reject H0 Type I error Correct
β ≡ 1− Type II error is called the power.
Neyman and Pearson advocated finding a test that falselyrejects the null hypothesis some specified α proportion of thetime and that maximizes the probability of rejecting the nullhypothesis when it is false.
We want to minimize the Type II error (or maximize power)subject to some specified level of Type I error.
-
A courtroom example
Consider being on a jury, with the previous table relabeledunder the null hypothesis of not guilty:
Not guilty GuiltyDo not convict Correct Type II error
Convict Type I error Correct
We can minimize the Type I error by never convicting anyone,but that would mean that we let a lot of guilty people go free;in other words, we have a high Type II error.
We could make sure that every guilty person goes to jail byconvicting everyone, but that would require convicting a lot ofinnocent people; minimizing the Type II error leads to a highType I error.
There is a trade-off between Type I and Type II errors.
-
Most powerful tests
Actually, we (may) have taken an alternative into accountbefore we even started.
The alternative hypothesis helps us choose the “best” (highestpower) tests, but we don’t (typically) use it in calculating teststatistics. These are called most powerful tests.
A test may be powerful for only a range of alternatives, whileanother test is more powerful for alternatives in another range.It is hard to find a test that is the most powerful for allalternatives, a uniformly most powerful test.
-
Power calculations
Let’s consider the power of the z test. Suppose, without loss ofgenerality, that β = 0. We reject the null hypothesis if |ẑ | > c,where c is chosen to give us the appropriate level of our test(e.g., c = Φ−1(1− α/2)).
To calculate power, we calculate the probability of rejecting thenull hypothesis across all possible values of the true β. Stateddifferently, power is a function of the true parameter value.
Comprehension check: What is the power of this test atβ = 0?
-
Calculating power
Let z be the test statistic and Z represent a standard normalrandom variable. Then,
Pr(reject null) = Pr(|z | > c)= 1− Pr(|z | < c)= 1− Pr(−c < z < c)
= 1− Pr
−c < β̂ − b√V̂ar
(β̂) < c
-
Calculating power, continued
= 1− Pr
−c + b − β√V̂ar
(β̂) < β̂ − β√
V̂ar(β̂) < c + b − β√
V̂ar(β̂)
= 1− Pr
−c + b − β√V̂ar
(β̂) < Z < c + b − β√
V̂ar(β̂)
= 1−
Φc + b − β√
V̂ar(β̂)− Φ
−c + b − β√V̂ar
(β̂) .
-
β
Pow
er
0.2
0.4
0.6
0.8
−4 −2 0 2 4
Figure: Power for z test with standard error of 1 and null hypothesisof 0 with α = 0.05
-
z test
The asymptotic distribution of β̂ is N
(β,
√V̂ar
(β̂))
. The
variance is assumed to be given, rather than estimated. We have
ẑ ∼ N(0, 1).
Uses: testing single hypotheses (e.g., a particular coefficient isequal to 0).
This only applies for “large” n.
Reject if |ẑ | > Φ−1(1− α/2).
-
t tests
If we assume that the errors are distributed N(0, σ2), then β̂ is
distributed N
(β,
√V̂ar
(β̂))
, but the variance is taken to be
estimated. Then,ẑ ∼ t(0, 1, d.f.).
Uses: testing single hypotheses (e.g., a particular coefficient isequal to 0).
If the model matrix has rank K , then the degrees-of-freedom fora regression coefficient is N −K .
Reject if |ẑ | > tN−K ,α2, where this is the two-sided α critical
value of the t distribution with N −K degrees of freedom.
-
Wald test
Let the vector β̂ ∼ N(β, V̂ ). Then, for a matrix R of a set ofrestrictions with rank r with the null hypothesis that Rβ = b,
W =(
Rβ̂ − b)′ (
RV̂ R′)−1 (
Rβ̂ − b)∼ χ2r
Based upon the asymptotic distribution.
Can test multiple restrictions using robust variance-covariancematrix (cf. F test).
Reject if W > χ2r ,α2.
-
F tests
Let the vector β̂ ∼ N(β, σ̂2 (X ′X )−1
). Then, for a matrix R of
a set of restrictions with rank r with a null hypothesis thatRβ = b,
F =
(Rβ̂ − b
)′ (R (X ′X )−1 R′
)−1 (Rβ̂ − b
)r σ̂2
∼ Fr ,N−K
Usually a finite sample test; here, σ̂2 is assumed to be a randomvariable itself that has a χ2N−K distribution; asymptotic teststake the variance as fixed.
Cannot handle a robust variance-covariance matrix.
Reject if F > Fr ,N−K ,α2.
-
Partitioned matrix inverse formula
Note the following fact:[A11 A12A21 A22
]−1=
[D−11 −D
−11 A21A
−122
D−12 A21A−111 D
−12
],
where D1 = A11 −A12A−122 A21 and D2 = A22 −A21A−111 A12; this
is the partitioned matrix inverse formula.
Let R = [0 Ir ] and the null hypothesis be R = 0; we are testingwhether some subset of the coefficients (the last r) are 0. Then,using the partitioned matrix inverse formula,
R′(X ′X
)−1R =
(X ′2X2 −X ′2X1
(X ′1X1
)−1X ′1X2
)−1=(X ′2M1X2
)−1.
-
Applying the FWL theorem
Our F statistic, then is
F =β̂′2 (X
′2M1X2) β̂2r σ̂2
.
Let �̃ be the residuals from a regression of y on X1. Let �̂ be theresiduals from the full regression. X̃2 is the set of residuals froma regression of X2 on X1. The FWL theorem states that�̃ = X̃2β̂2 + �̂. Then,
�̃′�̃ = β̂′2X̃′2X̃2β̂2 + �̂
′�̂
= β̂′2X′2M1X2β̂2 + �̂
′�̂
=⇒ β̂′2X ′2M1X2β̂2 = �̃′�̃− �̂′�̂
-
The F statistic reframed
Recall that σ̂2 = 1N−K �̂′�̂. The F statistic is
F =N −K
r
�̃′�̃− �̂′�̂�̂′�̂
.
This is the proportion difference between the sum of squaredresiduals (SSR) from a regression that omits the variables thatwe propose omitting and the full regression that is rescaled;intuitively, the F test asks, “relatively, how much bigger are themistakes that we are making when we exclude the proposedvariables?”
Note that the SSR from the restricted regression must begreater than that of the unrestricted regression.
-
The F test and R2
Let the restricted regression be no model at all—i.e., Y = β0.
Then, the restricted SSR is the total sum of squared errors(SST) in our model. The number of restrictions is r = K − 1.We have:
(SST−SSR)K−1SSRN−K
=N −KK − 1
(SST
SSR− 1)
=N −KK − 1
(1
1− R2− 1)
=N −KK − 1
R2
1− R2.
-
Bonferroni’s inequality
Let’s begin with a fact from basic statistics. Suppose that wehave two events, A and B . We can write the probability thateither A happens or B happens as
Pr(A ∪ B) = Pr(A) + Pr(B)− Pr(A ∩ B),
which is the probability that A happens, plus the probabilitythat B happens, minus the probability that both happen(otherwise we’d be double counting the case where bothhappen). This leads to Bonferroni’s inequality,
Pr(A ∪ B) ≤ Pr(A) + Pr(B);
“less than” because we don’t subtract off the probability of bothA and B occurring from the right-hand side.
-
Testing regression coefficients
Let’s suppose that we have a multiple regression model
yi = β0 + β1x1i + β2x2i + �i
and we want to test β1 = 0 and β2 = 0.
Suppose that we did two separate t tests to answer thisquestion; i.e., test β1 = 0 and β2 = 0 each by a t test and rejectthe joint hypothesis if we can reject either separate nullhypothesis.
-
Type I error
To calculate the Type I error of this procedure, we consider
Pr(reject β1 = 0 or reject β2 = 0|β1 = 0, β2 = 0).
Based upon on fact of probability, we see that this probability isequal to
Pr(reject β1 = 0|β1 = 0, β2 = 0)+ Pr(reject β2 = 0|β1 = 0, β2 = 0)− Pr(reject β1 = 0 and reject β2 = 0|β1 = 0, β2 = 0).
This is the Type I error of the t test for β1 = 0 plus the Type Ierror of the t test for β2 = 0 minus the probability that youreject both when both are in fact true.
-
Application of Bonferroni’s inequality
By Bonferroni’s inequality, we have
Pr(reject β1 = 0 or reject β2 = 0|β1 = 0, β2 = 0) ≤ α+ α = 2α,
where α is the Type I error for each of the t tests.
So the Type I error of our joint test can be twice as large as theerror from our separate t tests! How can we get around thisproblem?
If we want to falsely reject the joint test with probability α,then set our Type I error rate for the separate t tests to α2 .
More generally, if we have n hypotheses, set the individualType I error levels to αn .
-
Issues of power
Note that this is a conservative test; we have shown that theType I error for the joint test is less than or equal to the sum ofthe separate Type I errors; “less than” because we ignorecorrelations between the tests. This means that this test is notpowerful.
The F test does not ignore these correlations and thus is morepowerful than this “Bonferroni correction.” Tests of jointhypotheses are useful when you have a natural set of hypotheses(i.e., testing that “race doesn’t matter” by testing the joint nullhypothesis that the coefficients on several race indicatorvariables are 0).
-
Multiple tests and multicolinearity
Testing multiple hypotheses is also useful when you have twomulticolinear variables; the standard errors of each may be toowide to reject an individual null hypothesis (due to themulticolinearity), but the joint null takes into account thiscorrelation and provides a more powerful test.