OLS in Matrix Form - Stanford...

OLS in Matrix Form

Nathaniel BeckDepartment of Political Science

University of California, San DiegoLa Jolla, CA 92093

[email protected]://weber.ucsd.edu/∼nbeck

April, 2001

1

Some useful matrices

If X is a matrix, its transpose, X′ is the matrix withrows and columns flipped so the ijth element of Xbecomes the jith element of X′.

Matrix forms to recognize:

For vector x, x′x = sum of squares of the elements ofx (scalar)

For vector x, xx′ = N ×N matrix with ijth elementxixj

A square matrix is symmetric if it can be flippedaround its main diagonal, that is, xij = xji. In otherwords, if X is symmetric, X = X′. xx′ is symmetric.

For a rectangular m×N matrix X, X′X is theN ×N square matrix where a typical element is thesum of the cross products of the elements of row iand column j; the diagonal is the sum of the squaresof row i.

2

OLS

Let X be an N × k matrix where we haveobservations on K variables for N units. (Since themodel will usually contain a constant term, one of thecolumns has all ones. This column is no different thanany other, and so henceforth we can ignore constantterms.) Let y be an n-vector of observations on thedependent variable. IF ε is the vector of errors and βis the K-vector of unknown parameters:

We can write the general linear model as

y = Xβ + ε. (1)

The vector of residuals is given by

e = y −Xβ (2)

where the hat over β indicates the OLS estimate of β.

We can find this estimate by minimizing the sum of

3

squared residuals. Note this sum is e′e. Make sureyou can see that this is very different than ee′.

e′e = (y −Xβ)′(y −Xβ) (3)

which is quite easy to minimize using standardcalculus (on matrices quadratic forms and then usingchain rule).

This yields the famous normal equations

X′Xβ = X′y (4)

or, if X′X is non-singular,

β = (X′X)−1X′y (5)

Under what conditions will X′X be non-singular (offull rank)?

X′X is K ×K.

One necessary condition, based on a trivial theoremon rank, is that N ≥ K. This assumptions is usuallymet trivially, N is usually big, K is usually small.

4

Next must have all of the columns of X be linearlyindependent (this is why we did all this work), that isno variable is a linear combination of the othervariables.

This is the assumption of no (perfect)multicolinearity.

Note that only linear combinations are ruled out,NOT non-linear combinations.

5

Gauss-Markov assumptions

The critical assumption is that we get the meanfunction right, that is E(y) = Xβ.

The second critical assumption is either that X isnon-stochastic, or, if it is, that it is independent of e.

We can very compactly write the Gauss-Markov(OLS) assumptions on the errors as

Ω = σ2I (6)

where Ω is the variance covariance matrix of the errorprocess,

Ω = E(εε′). (7)

Make sure you can unpack this into

• Homoskedasticity

• Uncorrelated errors

6

VCV Matrix of the OLS estimates

We can derive the variance covariance matrix of theOLS estimator, β.

β = (X′X)−1X′y (8)

= (X′X)−1X′(Xβ + ε) (9)

= (X′X)−1X′Xβ + (X′X)−1X′ε (10)

= β + (X′X)−1X′ε. (11)

This shows immediately that OLS is unbiased so longas either X is non-stochastic so that

E(β) = β + (X′X)−1X′E(ε) = β (12)

or still unbiased if X is stochastic but independent ofε, so that E(Xε) = 0.

The variance covariance matrix of the OLS estimator

7

is then

E((β − β)(β − β)′) = E((X′X)−1X′ε[(X′X)−1X′ε]′

)(13)

= (X′X)−1X′E (εε′)X(X′X)−1

(14)

and then given our assumption about the variancecovariance of the errors, Equation 6

= σ2(X′X)−1 (15)

8

Robust (Huber or White) standard errors

Note how the second to last formulation makes senseof both White’s heteroskedasticity consistent standarderrors and my panel consistent standard errors.

Heteroskedasticity will lead to incorrect standarderrors insofar as

X′E (εε′)X 6= σ2(X′X) (16)

We don’t know the ε but we do know the residuals, e.Obviously the each individual residual is not a goodestimator of the corresponding ε, but White showedthat X′ee′X is a good estimator of the correspondingexpectation term.

Thus White suggested a test for seeing how far thisestimator diverges from what you would get if youjust used the OLS standard errors. This test is toregress the squared residuals on the terms in X′X,that is the squares and cross-products of theindependent variables. If the R2 exceeds a critical

9

value (NR2 is χ2k), then heteroskedasticity causes

problems. At that point use the White estimate. (Byand large always using the White estimate can dolittle harm and some good.)

10

Partitioned matrix and partial regression- the FWL theorem

In all the below, any matrix or submatrix that isinverted is square, but other matrices may berectangular so long as everything is conformable andonly square matrices ended up being inverted.

Direct multiplication tell us that

[A11 00 A22

]−1

=[A−1

11 00 A−1

22

](17)

Matrices like the above are called block diagonal. Aswe shall see, this tells us a lot about when we canignore that we might have added other variables to aregression.

The situation is more complicated for a general

11

matrix.

[A11 A12

A21 A22

]−1

=[A−1

11 (I + A12FA21A−111 ) −A−1

11 A12F−FA21A−1

11 F

](18)

whereF = (A22 −A21A−1

11 A12)−1 (19)

There are a lot of ways to show this, but can be done(tediously) by direct multiplication. NOTHING DEEPHERE.

Why do we care? The above formulae allow us tounderstand what it means to add variables to aregression, and when it matters if we either have toomany or too few (omitted variable bias) variables.

First note that if we have two sets of independentvariables, say X1 and X2 that are orthogonal to eachother, then the sums of cross products of the variablesin X1 with X2 are zero (by definition). Thus theX′X matrix formed out of X1 and X2 is blockdiagonal and so the theorem on the inverse of block

12

diagonal matrices tells us that the OLS estimates ofthe coefficients of the first set of variables estimatedseparately is the same as what we would get if weestimated using both sets of variables.

What does it mean for the two sets of variables to beorthogonal. Essentially, it means they areindependent, that is, one has nothing to do with theother. So if we have regressions involving politicalvariables, and we think that hair color is unrelated toany of these, then we can not worry about whatwould happen if we included hair color in theregression. But if we leave out race or party id, it willmake a difference.

The more interesting question is what happens if thetwo sets of variables are not orthogonal; in particular,what happens if we estimate a regression using a setof variables X1 but omit relevant X2. That is,suppose the “true” model is

y = X1β1 + X2β2 + ε (20)

but we “mistakenly” omit the X2 variables from theregression.

13

The true normal equation is:

[X′

1X1 X′1X2

X′2X1 X′

2X2

]−1 [X1yX2y

]=

[β1

β2

](21)

Now we can use the results on partitioned inverse tosee that

β1 = (X′1X1)−1X′

1y − (X′1X1)−1X′

1X2β2 (22)

Note that the first term in this (up to the minus sign)is just the OLS estimates of the β1 in the regressionof y on the X1 variables alone.

Thus it is only irrelevant to ignore “omitted” variablesif the second term, after the minus sign, is zero.

What is that term.

The first part of that term, up the β2 is just theregression of the variables in X2 (done separately andthen put together into a matrix) on all the variables inX1.

14

This will only be zero if the variables in X1 arelinearly unrelated to the variables in X2 (politicalvariables and hair coloring).

The second term will also be zero if β2 = 0, that is,the X2 variables have no impact on y.

Thus you can ignore all potential omitted variablesthat are either unrelated to the variables you doinclude or unrelated to the dependent variables.

But any other variables that do not meet thiscondition will change your estimates of β1 if you doinclude them.

To study this further, we need some more matrices!

15

The residual maker and the hat matrix

There are some useful matrices that pop up a lot.

Note that

e = y −Xβ (23)

= y −X(X′X)−1X′y (24)

= (I −X(X′X)−1X′)y (25)

= My (26)

where M = and M Makes residuals out of y. Notethat M is N ×N , that is, big!

A square matrix A is idempotent if A2 = AA = A (inscalars, only 0 and 1 would be idempotent). M is

16

idempotent.

MM = (I −X(X′X)−1X′)(I −X(X′X)−1X′)(27)

= I2 − 2X(X′X)−1X′ + X(X′X)−1X′X(X′X)−1X′

(28)

= I − 2X(X′X)−1X′ + X(X′X)−1X′ = I −X(X′X)−1X′

(29)

= M (30)

This will prove useful

A related matrix is the hat matrix which makes y, thepredicted y out of y. Just note that

y = y − e = [I −M ]y = Hy (31)

whereH = X(X′X)−1X′ (32)

Greene calls this matrix P, but he is alone. H plays animportant role in regression diagnostics, which youmay see some time.

17

Back to comparing big and smallregressions

If we “uninvert” the normal equation (Equation 21)we get [

X′1X1 X′

1X2

X′2X1 X′

2X2

] [β1

β2

]=

[X1yX2y

](33)

and we can simplify the equation for β1 when allvariables are included (Equation 22 to

β1 = (X′1X1)−1X′

1(y −X2β2) (34)

Direct multiplication for the second element inEquation 33 gives

X′2X1β1 + X′

2X2β2 = X′2y (35)

and then substituting for β1 using Equation 34 gives

X′2X1(X′

1X1)−1X′1y−X′

2X1(X′1X1)−1X′

1X2β2+X′2X2β2

= X2y (36)

18

which simplifies to

X′2[I −X1(X′

1X1)−1X′1]X2β2 =

X′2[I −X1(X′

1X1)−1X′1]y (37)

and then we get β2 by premultiplying both sides bythe inverse of the term that premultiplies β2.

Note that the term in brackets is the M matrix whichmakes residuals for regressions on the X1 variables;My is the vector of residuals from regressing y onthe X1 variables and MX2 is the matrix made up ofthe column by column residuals of regressing eachvariable (column) in X2 on all the variables in X1.

Because M is both idempotent and symmetric, wecan then write

β2 = (X∗2X2)−1X∗

2y∗ (38)

where X∗2 = M1X2 and y∗ = M1y.

Note Equation 38 shows that β2 is just obtained fromregressing y∗ on X∗

2 (get good at spottingregressions, that is, (X′X)−1X′y forms).

19

But what are the starred variables. They are just theresiduals of the variables after regressing on the X1

variables.

So the difference between regressing only on X2 andboth X2 and X1 variables is the latter first regressesboth the dependent variable and all the X2 variableson the X1 variables and then regresses the residualson each other, while the smaller regression justregresses y on the X2 variables.

This is what is means to hold the X1 variables“constant” in a multiple regression, and explains whywe have so many controversies about what variablesto include in a multiple regression.

OLS in Matrix Form - Stanford...

Documents

Transcript of OLS in Matrix Form - Stanford...