Heteroskedasticity-Robust Inferenceqed.econ.queensu.ca/pub/faculty/mackinnon/econ850/slides/...For...

28
Heteroskedasticity-Robust Inference Heteroskedasticity-Robust Inference Consider the linear regression model with exogenous regressors, y = Xβ + u, E(u)= 0, E(uu > )= Ω, (1) where Ω is an N × N matrix with i th diagonal element equal to ω 2 i > 0 and all the off-diagonal elements equal to 0. Since X is assumed to be exogenous, the expectations in (1) can be treated as conditional on X. We would get the same asymptotic results if, instead of treating X as exogenous, we assumed that E(u i | X i )= 0. The disturbances in (1) are uncorrelated and have mean 0, but their variances differ. They are said to be heteroskedastic. We assume that the investigator knows nothing about the ω 2 i . In other words, the form of the heteroskedasticity is completely unknown. November 26, 2020 1 / 28

Transcript of Heteroskedasticity-Robust Inferenceqed.econ.queensu.ca/pub/faculty/mackinnon/econ850/slides/...For...

  • Heteroskedasticity-Robust Inference

    Heteroskedasticity-Robust Inference

    Consider the linear regression model with exogenous regressors,

    y = Xβ + u, E(u) = 0, E(uu>) = Ω, (1)

    where Ω is an N×N matrix with i th diagonal element equal to ω2i > 0and all the off-diagonal elements equal to 0.

    Since X is assumed to be exogenous, the expectations in (1) can betreated as conditional on X.

    We would get the same asymptotic results if, instead of treating X asexogenous, we assumed that E(ui |Xi) = 0.The disturbances in (1) are uncorrelated and have mean 0, but theirvariances differ. They are said to be heteroskedastic.

    We assume that the investigator knows nothing about the ω2i . In otherwords, the form of the heteroskedasticity is completely unknown.

    November 26, 2020 1 / 28

  • Heteroskedasticity-Robust Inference

    Whatever the form of Ω, the covariance matrix of β̂ is

    E((β̂− β0)(β̂− β0)>

    )= (X>X)−1X>E(uu>)X(X>X)−1

    = (X>X)−1X>ΩX(X>X)−1.(2)

    This is often called a sandwich covariance matrix, for obvious reasons.

    If we knew the ω2i , we could evaluate (2). In fact, we could do betterand obtain efficient estimates of β.

    Observations with low variance convey more information than oneswith high variance, and so the former should be given greater weight.

    But it is assumed that we do not know the ω2i . We cannot hope toestimate them consistently without making additional assumptions,because there are N of them.

    For the purposes of asymptotic theory, we wish to consider thecovariance matrix of N1/2(β̂− β0), the limit of N times the matrix (2).

    November 26, 2020 2 / 28

  • Heteroskedasticity-Robust Inference

    The asymptotic covariance matrix of N1/2(β̂− β0) is(lim

    N→∞

    1N

    X>X)−1(

    limN→∞

    1N

    X>ΩX)(

    limN→∞

    1N

    X>X)−1

    . (3)

    Under standard assumptions, the factor (lim N−1X>X)−1 tends to thepositive definite matrix S−1

    X>X.

    To estimate S−1X>X

    , we can simply use the matrix (N−1X>X)−1 itself.

    In a very famous paper, White (1980) showed that, under certainconditions, the middle matrix can be estimated consistently by

    1N

    X>Ω̂X, (4)

    where Ω̂ is an inconsistent estimator of Ω.

    The simplest version of Ω̂ is a diagonal matrix with i th diagonalelement equal to û2i , the i

    th squared OLS residual.

    November 26, 2020 3 / 28

  • Heteroskedasticity-Robust Inference

    The matrix lim(N−1X>ΩX) is a k× k symmetric matrix. Therefore, ithas exactly (k2 + k)/2 distinct elements.

    Since this number is independent of the sample size, the matrix can beestimated consistently. Its jl th element is

    limN→∞

    (1N

    n

    ∑t=1

    ω2i xij xil

    ). (5)

    This is estimated by the jl th element of (4). For the simplest version ofΩ̂, the estimator is

    1N

    N

    ∑i=1

    û2i xij xil. (6)

    Because β̂ is consistent for β0, ûi must be consistent for ui, and û2i istherefore consistent for u2i .

    November 26, 2020 4 / 28

  • Heteroskedasticity-Robust Inference

    Asymptotically, expression (6) is equal to

    1N

    N

    ∑i=1

    u2i xij xil =1N

    N

    ∑i=1

    (ω2i + vi)xij xil (7)

    =1N

    N

    ∑i=1

    ω2i xij xil +1N

    N

    ∑i=1

    vi xij xil, (8)

    where vi is defined to equal u2i minus its mean of ω2i .

    Under suitable assumptions about the xij and the ω2j , we can apply alaw of large numbers to the second term in (8).

    Since E(vi) = 0, this term converges to 0, while the first term convergesto expression (5).

    Because N−1 ∑Ni=1 û2i xij xil

    a= N−1 ∑Ni=1 u

    2i xij xil, these arguments imply

    that (6) consistently estimates (5).

    November 26, 2020 5 / 28

  • Heteroskedasticity-Robust Inference

    In practice, of course, we omit the factors of 1/N. We simply use thematrix

    V̂arh(β̂) ≡ (X>X)−1X>Ω̂X(X>X)−1 (9)directly to estimate the covariance matrix of β̂.

    A more revealing way to write (9) is

    V̂arh(β̂) ≡ (X>X)−1( N

    ∑i=1

    û2i X>i Xi

    )(X>X)−1, (10)

    where Xi is the i th row of X. This makes it clear that the N×N matrixΩ̂ is never used.

    The sandwich estimator (10) is a heteroskedasticity-consistentcovariance matrix estimator, or HCCME. It is valid forheteroskedasticity of unknown form.

    By taking square roots of the diagonal elements of (10), we can obtainheteroskedasticity-robust standard errors.

    November 26, 2020 6 / 28

  • Asymptotic Theory for OLS

    Asymptotic Theory for OLS

    With heteroskedasticity of unknown form, Theorem 4.3 would bereplaced by

    N1/2(β̂− β0)d−→ N

    (0, S−1

    X>X

    (lim

    N→∞

    1N

    X>ΩX)

    S−1X>X

    )(11)

    andlim

    N→∞

    1N

    V̂arh(β̂) = S−1X>X

    (lim

    N→∞

    1N

    X>ΩX)

    S−1X>X

    . (12)

    We conclude that β̂ is root-N consistent and asymptotically normal,with (12) providing a consistent estimator of its covariance matrix.

    Of course, all the factors of N are omitted when we actually makeinferences about β̂.

    November 26, 2020 7 / 28

  • Alternative Forms of HCCME

    Alternative Forms of HCCME

    The original HCCME (10) of White (1980), often called HC0, usessquared residuals to estimate the diagonal elements Ω.

    But least-squares residuals tend to be too small. Better estimatorsinflate the squared residuals (MacKinnon and White, 1985).

    HC1: Use û2i in Ω̂ and then multiply the entire matrix by the scalarN/(N− k), for a standard degrees-of-freedom correction.HC2: Use û2i /(1− hi) in Ω̂, where

    hi ≡ Xi(X>X)−1X>i (13)

    is the i th diagonal element of the “hat” matrix PX .

    Recall the result that, when Var(ui) = σ2 for all i, E(û2i ) = σ2(1− hi).

    Therefore, the ratio of û2i to 1− hi would have expectation σ2 if thedisturbances were homoskedastic.

    November 26, 2020 8 / 28

  • Alternative Forms of HCCME

    HC3: Use û2i /(1− hi)2 in Ω̂. This is a slightly simplified version ofwhat one gets by employing the jackknife.

    A jackknife estimator omits one observation at a time when obtainingfitted values, so that the fitted value for observation i does not dependon the data for that observation.

    Dividing by (1− hi)2 actually overcorrects the residuals.But observations with large variances often tend have residuals thatare very much too small. Thus, HC3 may be attractive if largevariances are associated with large values of hi.

    Inferences based on any HCCME, especially HC0 and HC1, may beseriously inaccurate even when the sample size is moderately large ifsome observations have much higher leverage than others.

    By default, Stata uses HC1 with the “robust” and “vec(robust)”options. But “vce(hc2)” and “vce(hc3)” provide HC2 and HC3.

    November 26, 2020 9 / 28

  • When Does Heteroskedasticity Matter?

    When Does Heteroskedasticity Matter?

    Even when the disturbances are heteroskedastic, we do not necessarilyhave to use an HCCME.

    Consider the jl th element of N−1X>ΩX, which is

    1N

    N

    ∑i=1

    ω2i xij xil. (14)

    If the limit as N → ∞ of the average of the ω2i exists and is denoted σ2,then expression (14) can be rewritten as

    σ21N

    N

    ∑i=1

    xij xil +1N

    N

    ∑i=1

    (ω2i − σ2)xij xil. (15)

    The first term here is just the jl th element of σ2 N−1X>X.

    November 26, 2020 10 / 28

  • When Does Heteroskedasticity Matter?

    If it happens that

    limN→∞

    1N

    N

    ∑i=1

    (ω2i − σ2)xij xil = 0 (16)

    for j, l = 1, . . . , k, then we find that

    limN→∞

    1N

    X>ΩX = σ2 limN→∞

    1N

    X>X. (17)

    The asymptotic covariance matrix of N1/2(β̂− β0) is just(lim

    N→∞

    1N

    X>X)−1

    σ2(

    limN→∞

    1N

    X>X)(

    limN→∞

    1N

    X>X)−1

    = σ2S−1X>X

    . (18)

    The usual OLS estimate of σ2 is s2 =(1/(N− k)

    )∑Ni=1 û

    2i .

    November 26, 2020 11 / 28

  • When Does Heteroskedasticity Matter?

    If we assume that we can apply a law of large numbers, the probabilitylimit of N−1 ∑Ni=1 û

    2i is

    limN→∞

    1N

    N

    ∑i=1

    ω2i = σ2. (19)

    In this special case, the usual OLS covariance matrix estimators2(X>X)−1 is valid asymptotically.

    If we are estimating a sample mean, then X = ι, and

    1N

    N

    ∑i=1

    ω2i xij xil =1N

    N

    ∑i=1

    ω2i ι2i =

    1N

    N

    ∑i=1

    ω2i → σ2 as N → ∞. (20)

    Thus (16) holds, and we do not have to worry about heteroskedasticity.

    Only heteroskedasticity related to the squares and cross-products ofthe xij affects the validity of the usual OLS covariance matrix estimator.

    November 26, 2020 12 / 28

  • HAC Covariance Matrix Estimation

    HAC Covariance Matrix Estimation

    The assumption that the matrix Ω is diagonal is what makes itpossible to estimate N−1X>ΩX consistently and obtain an HCCME,even though Ω itself cannot be estimated consistently.

    The matrix N−1X>ΩX can sometimes be estimated consistently for amodel that uses time-series data when the disturbances are correlatedacross time periods.

    Observations that are close to each other may be strongly correlated,but observations that are far apart may be uncorrelated or nearly so.

    If so, only the elements of Ω that are on or close to the principaldiagonal are large.

    We may be able to obtain an estimate of the covariance matrix of theparameter estimates that is heteroskedasticity and autocorrelationconsistent, or HAC.

    November 26, 2020 13 / 28

  • Cluster-Robust Inference

    Cluster-Robust Inference

    Data are often collected at the individual level, but each observation isassociated with a higher-level entity, such as a city, state, province, orcountry, a classroom or school, a hospital, or perhaps a time period.

    Thus each observation belongs to a cluster, and the regressiondisturbances may be correlated within the clusters.

    We can divide the data into G clusters, indexed by g, and write thelinear regression model as

    y ≡

    y1y2...

    yG

    = Xβ + u ≡

    X1X2...

    XG

    β +

    u1u2...

    uG

    . (21)

    November 26, 2020 14 / 28

  • Cluster-Robust Inference

    Cluster g has Ng observations, and the entire sample has N = ∑Gg=1 Ngobservations.

    The matrix X and the vectors y and u have N rows, the matrix X has kcolumns, and the parameter vector β has k elements.

    We assume that the disturbances are uncorrelated across clusters butpotentially correlated and heteroskedastic within clusters, so that

    E(ugu>g ) = Ωg, g = 1, . . . , G. (22)

    The Ng ×Ng covariance matrices Ωg are assumed to be unknown.Thus Ω is block diagonal, with the Ωg forming the diagonal blocks.

    As usual, the covariance matrix of β̂ is a sandwich:

    Var(β̂) = (X>X)−1X>ΩX(X>X)−1. (23)

    But now the filling is a sum of G matrices, each of them k× k.

    November 26, 2020 15 / 28

  • Cluster-Robust Inference

    We can write the filling in (23) as

    G

    ∑g=1

    X>g ΩgXg =G

    ∑g=1

    Σg, (24)

    whereΩg = E(ugu>g ) and Σg = E(X

    >g ugu

    >g Xg). (25)

    Here Σg is the covariance matrix of the score vector X>g ug.

    We can rewrite X>ΩX as Σ = ∑Gg=1 Σg, so that Var(β̂) becomes(X>X)−1Σ(X>X)−1, which is simpler than (23).

    If either the disturbances or the regressors are uncorrelated withincluster g, then Σg will be diagonal. If true for all g, then likewise Σ.

    Thus the errors of inference from assuming that Ω is diagonal dependon both the disturbances and the regressors.

    November 26, 2020 16 / 28

  • Why Clustering Matters

    Why Clustering Matters

    When the Ng are large, the diagonal elements of (23) can be very muchlarger than those of covariance matrices that treat Ω as diagonal.

    Consider the random-effects or error-components model

    ugi = vg + egi, vg ∼ IID(0, σ2v ), egi ∼ IID(0, σ2e ), (26)

    where vg is a random variable that affects every observation in clusterg and no observation in any other cluster.

    The random-effects model (26) implies that

    Var(ugi) = σ2v + σ2e and Cov(ugi, ugj) = σ

    2v , (27)

    so that

    ρu ≡Cov(ugi, ugj)

    Var(ugi)=

    σ2vσ2v + σ

    2e

    for all g. (28)

    Thus all the intra-cluster correlations are the same and equal to ρu.

    November 26, 2020 17 / 28

  • Why Clustering Matters

    Suppose the model is ygi = β1 + β2xgi + ugi, where xgi is fixed withineach cluster.

    If Ng is the same for every cluster, then it can be shown that

    Var(β̂2)Varc(β̂2)

    = 1 + (Ng − 1)ρu, (29)

    where Var(β̂2) is the true variance of β̂2 based on (23), and Varc(β̂2) isthe incorrect (conventional) variance based on σ2(X>X)−1.

    The r.h.s. of (29) is a special case of what is called the Moulton factor.

    More generally, the Moulton factor is

    1 + ρx ρu

    (Var(Ng)

    N̄g+ N̄g − 1

    ), (30)

    where N̄g = N/G, and ρx is the intra-cluster correlation of x.

    November 26, 2020 18 / 28

  • Why Clustering Matters

    From (29) and (30), Var(β̂2) can be very much greater than Varc(β̂2)when Ng is large, even if ρu is quite small.

    For example, if ρ = 0.05, Var(β̂2) = 2 Varc when Ng = 21, 4 Varc whenNg = 61, and 25 Varc when Ng = 481.

    We could “solve” this problem by including group fixed effects. Thesewould explain the vg, leaving only the egi.

    But group fixed effects cannot be included if any regressor does notvary within clusters. Examples include labor market conditions, taxrates, class sizes, local prices or wages, or measures of local amenities.

    Laws and policies typically affect entire cities, provinces, or states, sowe cannot include fixed effects if we are trying to evaluate them.

    Moreover, error components models like (26) may not describeintra-cluster correlations very well. These may arise frommisspecification or the way the data are collected.

    November 26, 2020 19 / 28

  • Why Clustering Matters

    Consider the sample mean ȳ = N−1 ∑Ni=1 yi. The usual formula forVar(ȳ) is σ2/N. Thus the standard error of ȳ is O(N−1/2).

    The usual formula assumes that Var(yi) = σ2 and Cov(yi, yj) = 0.

    A formula that is valid under much weaker assumptions is

    Var(ȳ) =1

    N2N

    ∑i=1

    Var(yi) +2

    N2N

    ∑i=1

    N

    ∑j=i+1

    Cov(yi, yj). (31)

    The first term on the r.h.s. of (31) is O(N−1). It simplifies to the usualformula if we define σ2 as N−1 ∑Ni=1 σ

    2i .

    The second term is O(1), because it involves two summations over N,and it is divided by N2. It must eventually dominate.

    For large samples with clustered disturbances and a fixed number ofclusters, standard errors will decline more slowly than N−1/2 and willbe bounded from below.

    November 26, 2020 20 / 28

  • An Empirical Example

    An Empirical Example

    Using CPS data for 1979 through 2015, MacKinnon (2016) estimatedthe earnings equation

    ygti = β1 + β2Ed2gti + β3Ed3gti + β4Ed4gti + β5Ed5gti+ β6 Agegti + β7 Age

    2gti + γt Yeart + δg Stateg + ugti, (32)

    where g indexes states from 1 to 51, t indexes years from 1 to 37, and iindexes individuals within each year. There are 36 time fixed effects,the γt, and 50 state fixed effects, the δg.

    The education dummies Ed2 through Ed5 represent, respectively, highschool, high school plus two years, college/university, or at least onepostgraduate degree.

    There are 1,156,597 observations on white men aged 25 to 65.

    November 26, 2020 21 / 28

  • An Empirical Example

    The coefficient of interest is ϕ = β5 − β4. The OLS estimate is 0.10965,which implies a percentage increase of 11.589%.

    The resulting 95% confidence interval is [11.155, 12.023].

    But when we cluster by state and year (1887 clusters), the 95%confidence interval is [10.967, 12.212].

    When we cluster by state (51 observations), the 95% confidenceinterval is [10.447, 12.732].

    • The standard error based on state-level clustering is 2.63 times thestandard error with no clustering.

    • Increasing the standard error by a factor of 2.63 is equivalent toreducing the sample size by a factor of 2.63 squared.

    • 1,156,597 clustered observations are equivalent to a sample ofapproximately 167,000 independent observations.

    These results hold despite state-level fixed effects!

    November 26, 2020 22 / 28

  • An Empirical Example

    Recall the implication of (31) that the accuracy of OLS estimates willgrow more slowly than N1/2 and will be bounded from above.

    To investigate this, I created multiple subsamples of various sizes,ranging from 1/64 of the original sample to 1/2 of the original sample.

    Let M denote the reduced sample size, where M ∼= N/m form = 2, 3, 4, 6, 9, 16, 25, 36, 49, 64.

    The subsamples of size M were created by retaining only observationsm + j, 2m + j, and so on, for a number of values of j ≤ m.The figure graphs the inverse of the average of the reported standarderrors of ϕ̂ for three covariance matrix estimators against M.

    • The HR standard errors are proportional to√

    1/M, but the twocluster-robust standard errors decline more slowly than

    √1/M.

    • For state-level clustering, the standard errors decrease very slowlyindeed beyond about M = 250,000.

    November 26, 2020 23 / 28

  • An Empirical Example

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    550

    20,000 100,000 250,000 500,000 1,000,000

    ..........

    ..........

    ..........

    ..........

    ..........

    ..........

    ..........

    ..........

    ..........

    ............................HC1

    ....................................

    .................................................

    ...................................

    ....................................

    ....................................

    ..........................................

    ................................................

    ......................................................

    .............................................................

    ...........................................................

    ..................................................................................................CV1(S,Y)

    ..........................................................

    ..................................................................

    ......................................................................................

    ......................................................................................................................................................................................................................................

    .......................................................................................................................................................................................

    ...........................................................................................................................................CV1(S)

    M

    November 26, 2020 24 / 28

  • Cluster-Robust Variance Matrix Estimation

    Cluster-Robust Variance Matrix Estimation

    We can use a cluster-robust variance estimator, or CRVE thatgeneralizes the HCCME.

    The simplest and most widely-used CRVE is

    CV1 :G(N− 1)

    (G− 1)(N− k) (X>X)−1

    (G

    ∑g=1

    X>g ûg û>g Xg

    )(X>X)−1, (33)

    where ûg is the vector of OLS residuals for cluster g.

    Each of the k× k matrices within the summation has rank 1, because itis the vector X>g ûg times its transpose.

    The rank of the CRVE cannot exceed G (or maybe G− 1), making itimpossible to test more than G (or G− 1) restrictions at once.When Ng = 1 for all G, CV1 reduces to HC1.

    November 26, 2020 25 / 28

  • Cluster-Robust Variance Matrix Estimation

    It is customary to base inferences on the t(G− 1) distribution.Each of the terms in the summation in (33) contributes one degree offreedom, and a t test uses up one degree of freedom.

    In a very special case, Bester, Conley, and Hansen (2011) shows that tstatistics based on CV1 actually follow the t(G− 1) distribution forfixed G as N → ∞.When G is small and N is large, critical values based on t(G− 1) can besubstantially larger than ones based on t(N− k).Djogbenou, MacKinnon, and Nielsen (2019) proves that inferencebased on CR t statistics is asymptotically valid as G→ ∞.The number of clusters G must tend to infinity with the sample size,but G may increase more slowly than N.

    The proof would be much easier if N/G were constant as N → ∞. Inthat case, standard errors would be O(N−1/2) = O(G−1/2).

    November 26, 2020 26 / 28

  • Cluster-Robust Variance Matrix Estimation

    Inference based on CV1 and the t(G− 1) distribution can be veryunreliable in certain cases:

    When G is small (say G < 50 if reasonably homogeneous).When one or a few clusters are much larger than average.When the disturbances or scores for one or a few clusters havemuch larger variances than average.When only a few clusters are “treated.”

    Most effective methods for dealing with such cases are based onbootstrap methods (the wild cluster, pairs cluster, or wild bootstraps),or on randomization inference.

    Other methods are based on CV2, or maybe CV3, variance matrices.

    The g th diagonal block of the MX matrix is

    Mgg ≡ INg −Xg(X>X)−1X>g , g = 1, . . . , G. (34)

    November 26, 2020 27 / 28

  • Cluster-Robust Variance Matrix Estimation

    The CV2 matrix uses the transformed residuals

    u̇g = M−1/2gg ûg, (35)

    where M−1/2gg is the inverse of the symmetric square root of Mgg.

    CV2 : (X>X)−1(

    G

    ∑g=1

    X>g u̇g u̇>g Xg

    )(X>X)−1, (36)

    Unfortunately, since the Mgg are Ng ×Ng matrices, computing them, aswell as their inverse symmetric square roots, can be expensive orimpossible.

    The CV3 matrix is similar, but it uses üg = M−1gg ûg instead of the u̇g.

    There are methods that use CV1, CV2, or CV3 combined with the t(d)distribution, where d is computed from the data in various ways.

    November 26, 2020 28 / 28

    Heteroskedasticity-Robust InferenceAsymptotic Theory for OLSAlternative Forms of HCCMEWhen Does Heteroskedasticity Matter?HAC Covariance Matrix EstimationCluster-Robust InferenceWhy Clustering MattersAn Empirical ExampleCluster-Robust Variance Matrix Estimation