Heteroskedasticity-Robust Inferenceqed.econ.queensu.ca/pub/faculty/mackinnon/econ850/slides/...For...
Transcript of Heteroskedasticity-Robust Inferenceqed.econ.queensu.ca/pub/faculty/mackinnon/econ850/slides/...For...
-
Heteroskedasticity-Robust Inference
Heteroskedasticity-Robust Inference
Consider the linear regression model with exogenous regressors,
y = Xβ + u, E(u) = 0, E(uu>) = Ω, (1)
where Ω is an N×N matrix with i th diagonal element equal to ω2i > 0and all the off-diagonal elements equal to 0.
Since X is assumed to be exogenous, the expectations in (1) can betreated as conditional on X.
We would get the same asymptotic results if, instead of treating X asexogenous, we assumed that E(ui |Xi) = 0.The disturbances in (1) are uncorrelated and have mean 0, but theirvariances differ. They are said to be heteroskedastic.
We assume that the investigator knows nothing about the ω2i . In otherwords, the form of the heteroskedasticity is completely unknown.
November 26, 2020 1 / 28
-
Heteroskedasticity-Robust Inference
Whatever the form of Ω, the covariance matrix of β̂ is
E((β̂− β0)(β̂− β0)>
)= (X>X)−1X>E(uu>)X(X>X)−1
= (X>X)−1X>ΩX(X>X)−1.(2)
This is often called a sandwich covariance matrix, for obvious reasons.
If we knew the ω2i , we could evaluate (2). In fact, we could do betterand obtain efficient estimates of β.
Observations with low variance convey more information than oneswith high variance, and so the former should be given greater weight.
But it is assumed that we do not know the ω2i . We cannot hope toestimate them consistently without making additional assumptions,because there are N of them.
For the purposes of asymptotic theory, we wish to consider thecovariance matrix of N1/2(β̂− β0), the limit of N times the matrix (2).
November 26, 2020 2 / 28
-
Heteroskedasticity-Robust Inference
The asymptotic covariance matrix of N1/2(β̂− β0) is(lim
N→∞
1N
X>X)−1(
limN→∞
1N
X>ΩX)(
limN→∞
1N
X>X)−1
. (3)
Under standard assumptions, the factor (lim N−1X>X)−1 tends to thepositive definite matrix S−1
X>X.
To estimate S−1X>X
, we can simply use the matrix (N−1X>X)−1 itself.
In a very famous paper, White (1980) showed that, under certainconditions, the middle matrix can be estimated consistently by
1N
X>Ω̂X, (4)
where Ω̂ is an inconsistent estimator of Ω.
The simplest version of Ω̂ is a diagonal matrix with i th diagonalelement equal to û2i , the i
th squared OLS residual.
November 26, 2020 3 / 28
-
Heteroskedasticity-Robust Inference
The matrix lim(N−1X>ΩX) is a k× k symmetric matrix. Therefore, ithas exactly (k2 + k)/2 distinct elements.
Since this number is independent of the sample size, the matrix can beestimated consistently. Its jl th element is
limN→∞
(1N
n
∑t=1
ω2i xij xil
). (5)
This is estimated by the jl th element of (4). For the simplest version ofΩ̂, the estimator is
1N
N
∑i=1
û2i xij xil. (6)
Because β̂ is consistent for β0, ûi must be consistent for ui, and û2i istherefore consistent for u2i .
November 26, 2020 4 / 28
-
Heteroskedasticity-Robust Inference
Asymptotically, expression (6) is equal to
1N
N
∑i=1
u2i xij xil =1N
N
∑i=1
(ω2i + vi)xij xil (7)
=1N
N
∑i=1
ω2i xij xil +1N
N
∑i=1
vi xij xil, (8)
where vi is defined to equal u2i minus its mean of ω2i .
Under suitable assumptions about the xij and the ω2j , we can apply alaw of large numbers to the second term in (8).
Since E(vi) = 0, this term converges to 0, while the first term convergesto expression (5).
Because N−1 ∑Ni=1 û2i xij xil
a= N−1 ∑Ni=1 u
2i xij xil, these arguments imply
that (6) consistently estimates (5).
November 26, 2020 5 / 28
-
Heteroskedasticity-Robust Inference
In practice, of course, we omit the factors of 1/N. We simply use thematrix
V̂arh(β̂) ≡ (X>X)−1X>Ω̂X(X>X)−1 (9)directly to estimate the covariance matrix of β̂.
A more revealing way to write (9) is
V̂arh(β̂) ≡ (X>X)−1( N
∑i=1
û2i X>i Xi
)(X>X)−1, (10)
where Xi is the i th row of X. This makes it clear that the N×N matrixΩ̂ is never used.
The sandwich estimator (10) is a heteroskedasticity-consistentcovariance matrix estimator, or HCCME. It is valid forheteroskedasticity of unknown form.
By taking square roots of the diagonal elements of (10), we can obtainheteroskedasticity-robust standard errors.
November 26, 2020 6 / 28
-
Asymptotic Theory for OLS
Asymptotic Theory for OLS
With heteroskedasticity of unknown form, Theorem 4.3 would bereplaced by
N1/2(β̂− β0)d−→ N
(0, S−1
X>X
(lim
N→∞
1N
X>ΩX)
S−1X>X
)(11)
andlim
N→∞
1N
V̂arh(β̂) = S−1X>X
(lim
N→∞
1N
X>ΩX)
S−1X>X
. (12)
We conclude that β̂ is root-N consistent and asymptotically normal,with (12) providing a consistent estimator of its covariance matrix.
Of course, all the factors of N are omitted when we actually makeinferences about β̂.
November 26, 2020 7 / 28
-
Alternative Forms of HCCME
Alternative Forms of HCCME
The original HCCME (10) of White (1980), often called HC0, usessquared residuals to estimate the diagonal elements Ω.
But least-squares residuals tend to be too small. Better estimatorsinflate the squared residuals (MacKinnon and White, 1985).
HC1: Use û2i in Ω̂ and then multiply the entire matrix by the scalarN/(N− k), for a standard degrees-of-freedom correction.HC2: Use û2i /(1− hi) in Ω̂, where
hi ≡ Xi(X>X)−1X>i (13)
is the i th diagonal element of the “hat” matrix PX .
Recall the result that, when Var(ui) = σ2 for all i, E(û2i ) = σ2(1− hi).
Therefore, the ratio of û2i to 1− hi would have expectation σ2 if thedisturbances were homoskedastic.
November 26, 2020 8 / 28
-
Alternative Forms of HCCME
HC3: Use û2i /(1− hi)2 in Ω̂. This is a slightly simplified version ofwhat one gets by employing the jackknife.
A jackknife estimator omits one observation at a time when obtainingfitted values, so that the fitted value for observation i does not dependon the data for that observation.
Dividing by (1− hi)2 actually overcorrects the residuals.But observations with large variances often tend have residuals thatare very much too small. Thus, HC3 may be attractive if largevariances are associated with large values of hi.
Inferences based on any HCCME, especially HC0 and HC1, may beseriously inaccurate even when the sample size is moderately large ifsome observations have much higher leverage than others.
By default, Stata uses HC1 with the “robust” and “vec(robust)”options. But “vce(hc2)” and “vce(hc3)” provide HC2 and HC3.
November 26, 2020 9 / 28
-
When Does Heteroskedasticity Matter?
When Does Heteroskedasticity Matter?
Even when the disturbances are heteroskedastic, we do not necessarilyhave to use an HCCME.
Consider the jl th element of N−1X>ΩX, which is
1N
N
∑i=1
ω2i xij xil. (14)
If the limit as N → ∞ of the average of the ω2i exists and is denoted σ2,then expression (14) can be rewritten as
σ21N
N
∑i=1
xij xil +1N
N
∑i=1
(ω2i − σ2)xij xil. (15)
The first term here is just the jl th element of σ2 N−1X>X.
November 26, 2020 10 / 28
-
When Does Heteroskedasticity Matter?
If it happens that
limN→∞
1N
N
∑i=1
(ω2i − σ2)xij xil = 0 (16)
for j, l = 1, . . . , k, then we find that
limN→∞
1N
X>ΩX = σ2 limN→∞
1N
X>X. (17)
The asymptotic covariance matrix of N1/2(β̂− β0) is just(lim
N→∞
1N
X>X)−1
σ2(
limN→∞
1N
X>X)(
limN→∞
1N
X>X)−1
= σ2S−1X>X
. (18)
The usual OLS estimate of σ2 is s2 =(1/(N− k)
)∑Ni=1 û
2i .
November 26, 2020 11 / 28
-
When Does Heteroskedasticity Matter?
If we assume that we can apply a law of large numbers, the probabilitylimit of N−1 ∑Ni=1 û
2i is
limN→∞
1N
N
∑i=1
ω2i = σ2. (19)
In this special case, the usual OLS covariance matrix estimators2(X>X)−1 is valid asymptotically.
If we are estimating a sample mean, then X = ι, and
1N
N
∑i=1
ω2i xij xil =1N
N
∑i=1
ω2i ι2i =
1N
N
∑i=1
ω2i → σ2 as N → ∞. (20)
Thus (16) holds, and we do not have to worry about heteroskedasticity.
Only heteroskedasticity related to the squares and cross-products ofthe xij affects the validity of the usual OLS covariance matrix estimator.
November 26, 2020 12 / 28
-
HAC Covariance Matrix Estimation
HAC Covariance Matrix Estimation
The assumption that the matrix Ω is diagonal is what makes itpossible to estimate N−1X>ΩX consistently and obtain an HCCME,even though Ω itself cannot be estimated consistently.
The matrix N−1X>ΩX can sometimes be estimated consistently for amodel that uses time-series data when the disturbances are correlatedacross time periods.
Observations that are close to each other may be strongly correlated,but observations that are far apart may be uncorrelated or nearly so.
If so, only the elements of Ω that are on or close to the principaldiagonal are large.
We may be able to obtain an estimate of the covariance matrix of theparameter estimates that is heteroskedasticity and autocorrelationconsistent, or HAC.
November 26, 2020 13 / 28
-
Cluster-Robust Inference
Cluster-Robust Inference
Data are often collected at the individual level, but each observation isassociated with a higher-level entity, such as a city, state, province, orcountry, a classroom or school, a hospital, or perhaps a time period.
Thus each observation belongs to a cluster, and the regressiondisturbances may be correlated within the clusters.
We can divide the data into G clusters, indexed by g, and write thelinear regression model as
y ≡
y1y2...
yG
= Xβ + u ≡
X1X2...
XG
β +
u1u2...
uG
. (21)
November 26, 2020 14 / 28
-
Cluster-Robust Inference
Cluster g has Ng observations, and the entire sample has N = ∑Gg=1 Ngobservations.
The matrix X and the vectors y and u have N rows, the matrix X has kcolumns, and the parameter vector β has k elements.
We assume that the disturbances are uncorrelated across clusters butpotentially correlated and heteroskedastic within clusters, so that
E(ugu>g ) = Ωg, g = 1, . . . , G. (22)
The Ng ×Ng covariance matrices Ωg are assumed to be unknown.Thus Ω is block diagonal, with the Ωg forming the diagonal blocks.
As usual, the covariance matrix of β̂ is a sandwich:
Var(β̂) = (X>X)−1X>ΩX(X>X)−1. (23)
But now the filling is a sum of G matrices, each of them k× k.
November 26, 2020 15 / 28
-
Cluster-Robust Inference
We can write the filling in (23) as
G
∑g=1
X>g ΩgXg =G
∑g=1
Σg, (24)
whereΩg = E(ugu>g ) and Σg = E(X
>g ugu
>g Xg). (25)
Here Σg is the covariance matrix of the score vector X>g ug.
We can rewrite X>ΩX as Σ = ∑Gg=1 Σg, so that Var(β̂) becomes(X>X)−1Σ(X>X)−1, which is simpler than (23).
If either the disturbances or the regressors are uncorrelated withincluster g, then Σg will be diagonal. If true for all g, then likewise Σ.
Thus the errors of inference from assuming that Ω is diagonal dependon both the disturbances and the regressors.
November 26, 2020 16 / 28
-
Why Clustering Matters
Why Clustering Matters
When the Ng are large, the diagonal elements of (23) can be very muchlarger than those of covariance matrices that treat Ω as diagonal.
Consider the random-effects or error-components model
ugi = vg + egi, vg ∼ IID(0, σ2v ), egi ∼ IID(0, σ2e ), (26)
where vg is a random variable that affects every observation in clusterg and no observation in any other cluster.
The random-effects model (26) implies that
Var(ugi) = σ2v + σ2e and Cov(ugi, ugj) = σ
2v , (27)
so that
ρu ≡Cov(ugi, ugj)
Var(ugi)=
σ2vσ2v + σ
2e
for all g. (28)
Thus all the intra-cluster correlations are the same and equal to ρu.
November 26, 2020 17 / 28
-
Why Clustering Matters
Suppose the model is ygi = β1 + β2xgi + ugi, where xgi is fixed withineach cluster.
If Ng is the same for every cluster, then it can be shown that
Var(β̂2)Varc(β̂2)
= 1 + (Ng − 1)ρu, (29)
where Var(β̂2) is the true variance of β̂2 based on (23), and Varc(β̂2) isthe incorrect (conventional) variance based on σ2(X>X)−1.
The r.h.s. of (29) is a special case of what is called the Moulton factor.
More generally, the Moulton factor is
1 + ρx ρu
(Var(Ng)
N̄g+ N̄g − 1
), (30)
where N̄g = N/G, and ρx is the intra-cluster correlation of x.
November 26, 2020 18 / 28
-
Why Clustering Matters
From (29) and (30), Var(β̂2) can be very much greater than Varc(β̂2)when Ng is large, even if ρu is quite small.
For example, if ρ = 0.05, Var(β̂2) = 2 Varc when Ng = 21, 4 Varc whenNg = 61, and 25 Varc when Ng = 481.
We could “solve” this problem by including group fixed effects. Thesewould explain the vg, leaving only the egi.
But group fixed effects cannot be included if any regressor does notvary within clusters. Examples include labor market conditions, taxrates, class sizes, local prices or wages, or measures of local amenities.
Laws and policies typically affect entire cities, provinces, or states, sowe cannot include fixed effects if we are trying to evaluate them.
Moreover, error components models like (26) may not describeintra-cluster correlations very well. These may arise frommisspecification or the way the data are collected.
November 26, 2020 19 / 28
-
Why Clustering Matters
Consider the sample mean ȳ = N−1 ∑Ni=1 yi. The usual formula forVar(ȳ) is σ2/N. Thus the standard error of ȳ is O(N−1/2).
The usual formula assumes that Var(yi) = σ2 and Cov(yi, yj) = 0.
A formula that is valid under much weaker assumptions is
Var(ȳ) =1
N2N
∑i=1
Var(yi) +2
N2N
∑i=1
N
∑j=i+1
Cov(yi, yj). (31)
The first term on the r.h.s. of (31) is O(N−1). It simplifies to the usualformula if we define σ2 as N−1 ∑Ni=1 σ
2i .
The second term is O(1), because it involves two summations over N,and it is divided by N2. It must eventually dominate.
For large samples with clustered disturbances and a fixed number ofclusters, standard errors will decline more slowly than N−1/2 and willbe bounded from below.
November 26, 2020 20 / 28
-
An Empirical Example
An Empirical Example
Using CPS data for 1979 through 2015, MacKinnon (2016) estimatedthe earnings equation
ygti = β1 + β2Ed2gti + β3Ed3gti + β4Ed4gti + β5Ed5gti+ β6 Agegti + β7 Age
2gti + γt Yeart + δg Stateg + ugti, (32)
where g indexes states from 1 to 51, t indexes years from 1 to 37, and iindexes individuals within each year. There are 36 time fixed effects,the γt, and 50 state fixed effects, the δg.
The education dummies Ed2 through Ed5 represent, respectively, highschool, high school plus two years, college/university, or at least onepostgraduate degree.
There are 1,156,597 observations on white men aged 25 to 65.
November 26, 2020 21 / 28
-
An Empirical Example
The coefficient of interest is ϕ = β5 − β4. The OLS estimate is 0.10965,which implies a percentage increase of 11.589%.
The resulting 95% confidence interval is [11.155, 12.023].
But when we cluster by state and year (1887 clusters), the 95%confidence interval is [10.967, 12.212].
When we cluster by state (51 observations), the 95% confidenceinterval is [10.447, 12.732].
• The standard error based on state-level clustering is 2.63 times thestandard error with no clustering.
• Increasing the standard error by a factor of 2.63 is equivalent toreducing the sample size by a factor of 2.63 squared.
• 1,156,597 clustered observations are equivalent to a sample ofapproximately 167,000 independent observations.
These results hold despite state-level fixed effects!
November 26, 2020 22 / 28
-
An Empirical Example
Recall the implication of (31) that the accuracy of OLS estimates willgrow more slowly than N1/2 and will be bounded from above.
To investigate this, I created multiple subsamples of various sizes,ranging from 1/64 of the original sample to 1/2 of the original sample.
Let M denote the reduced sample size, where M ∼= N/m form = 2, 3, 4, 6, 9, 16, 25, 36, 49, 64.
The subsamples of size M were created by retaining only observationsm + j, 2m + j, and so on, for a number of values of j ≤ m.The figure graphs the inverse of the average of the reported standarderrors of ϕ̂ for three covariance matrix estimators against M.
• The HR standard errors are proportional to√
1/M, but the twocluster-robust standard errors decline more slowly than
√1/M.
• For state-level clustering, the standard errors decrease very slowlyindeed beyond about M = 250,000.
November 26, 2020 23 / 28
-
An Empirical Example
0
50
100
150
200
250
300
350
400
450
500
550
20,000 100,000 250,000 500,000 1,000,000
..........
..........
..........
..........
..........
..........
..........
..........
..........
............................HC1
....................................
.................................................
...................................
....................................
....................................
..........................................
................................................
......................................................
.............................................................
...........................................................
..................................................................................................CV1(S,Y)
..........................................................
..................................................................
......................................................................................
......................................................................................................................................................................................................................................
.......................................................................................................................................................................................
...........................................................................................................................................CV1(S)
M
November 26, 2020 24 / 28
-
Cluster-Robust Variance Matrix Estimation
Cluster-Robust Variance Matrix Estimation
We can use a cluster-robust variance estimator, or CRVE thatgeneralizes the HCCME.
The simplest and most widely-used CRVE is
CV1 :G(N− 1)
(G− 1)(N− k) (X>X)−1
(G
∑g=1
X>g ûg û>g Xg
)(X>X)−1, (33)
where ûg is the vector of OLS residuals for cluster g.
Each of the k× k matrices within the summation has rank 1, because itis the vector X>g ûg times its transpose.
The rank of the CRVE cannot exceed G (or maybe G− 1), making itimpossible to test more than G (or G− 1) restrictions at once.When Ng = 1 for all G, CV1 reduces to HC1.
November 26, 2020 25 / 28
-
Cluster-Robust Variance Matrix Estimation
It is customary to base inferences on the t(G− 1) distribution.Each of the terms in the summation in (33) contributes one degree offreedom, and a t test uses up one degree of freedom.
In a very special case, Bester, Conley, and Hansen (2011) shows that tstatistics based on CV1 actually follow the t(G− 1) distribution forfixed G as N → ∞.When G is small and N is large, critical values based on t(G− 1) can besubstantially larger than ones based on t(N− k).Djogbenou, MacKinnon, and Nielsen (2019) proves that inferencebased on CR t statistics is asymptotically valid as G→ ∞.The number of clusters G must tend to infinity with the sample size,but G may increase more slowly than N.
The proof would be much easier if N/G were constant as N → ∞. Inthat case, standard errors would be O(N−1/2) = O(G−1/2).
November 26, 2020 26 / 28
-
Cluster-Robust Variance Matrix Estimation
Inference based on CV1 and the t(G− 1) distribution can be veryunreliable in certain cases:
When G is small (say G < 50 if reasonably homogeneous).When one or a few clusters are much larger than average.When the disturbances or scores for one or a few clusters havemuch larger variances than average.When only a few clusters are “treated.”
Most effective methods for dealing with such cases are based onbootstrap methods (the wild cluster, pairs cluster, or wild bootstraps),or on randomization inference.
Other methods are based on CV2, or maybe CV3, variance matrices.
The g th diagonal block of the MX matrix is
Mgg ≡ INg −Xg(X>X)−1X>g , g = 1, . . . , G. (34)
November 26, 2020 27 / 28
-
Cluster-Robust Variance Matrix Estimation
The CV2 matrix uses the transformed residuals
u̇g = M−1/2gg ûg, (35)
where M−1/2gg is the inverse of the symmetric square root of Mgg.
CV2 : (X>X)−1(
G
∑g=1
X>g u̇g u̇>g Xg
)(X>X)−1, (36)
Unfortunately, since the Mgg are Ng ×Ng matrices, computing them, aswell as their inverse symmetric square roots, can be expensive orimpossible.
The CV3 matrix is similar, but it uses üg = M−1gg ûg instead of the u̇g.
There are methods that use CV1, CV2, or CV3 combined with the t(d)distribution, where d is computed from the data in various ways.
November 26, 2020 28 / 28
Heteroskedasticity-Robust InferenceAsymptotic Theory for OLSAlternative Forms of HCCMEWhen Does Heteroskedasticity Matter?HAC Covariance Matrix EstimationCluster-Robust InferenceWhy Clustering MattersAn Empirical ExampleCluster-Robust Variance Matrix Estimation