Hat Matrix: Properties and Interpretation Week 5, Lecture...

6
MA 575: Linear Models MA 575 Linear Models: Cedric E. Ginestet, Boston University Hat Matrix: Properties and Interpretation Week 5, Lecture 1 1 Hat Matrix 1.1 From Observed to Fitted Values The OLS estimator was found to be given by the (p * × 1) vector, b β =(X T X) -1 X T y. The predicted values b y can then be written as, b y = X b β = X(X T X) -1 X T y =: Hy, where H := X(X T X) -1 X T is an n × n matrix, which “puts the hat on y” and is therefore referred to as the hat matrix. This shows that the fitted values are, in fact, a linear function of the observed values, such that for any y, y 0 R n , we have letting b y =: f (y), f (y + y 0 )= H(y + y 0 )= Hy + Hy 0 = f (y)+ f (y 0 ), and for any a R, f (ay)= af (y). Similarly, the residuals can also be expressed as a function of H, b e := y - b y = y - Hy =(I - H)y, with I denoting the n × n identity matrix, and where again the residuals can also be seen to be a linear function of the observed values, y. In summary, we therefore have b y = Hy and b e =(I - H)y. Crucially, it can be shown that both H and I - H are orthogonal projections. This can be represented visually, by contrasting two different viewpoints on multiple linear regression: i. Thus far, we have mainly be concerned with what may be called the variable space, R p * , in which each subject is a point, and the variables are dimensions. ii. We need to accomplish a change of perspective by considering the subject space, R n . In this Euclidean space, each subject is a dimension, whereas the y, b y and b e are treated as vectors. Department of Mathematics and Statistics, Boston University 1

Transcript of Hat Matrix: Properties and Interpretation Week 5, Lecture...

Page 1: Hat Matrix: Properties and Interpretation Week 5, Lecture 1math.bu.edu/people/cgineste/classes/ma575/p/w5_1.pdfThe matrix of a projection, which is also symmetric is an orthogonal

MA 575: Linear Models

MA 575 Linear Models:Cedric E. Ginestet, Boston University

Hat Matrix: Properties and InterpretationWeek 5, Lecture 1

1 Hat Matrix

1.1 From Observed to Fitted Values

The OLS estimator was found to be given by the (p∗ × 1) vector,

β = (XTX)−1XTy.

The predicted values y can then be written as,

y = Xβ =(X(XTX)−1XT

)y =: Hy,

where H := X(XTX)−1XT is an n×n matrix, which “puts the hat on y” and is therefore referred to as thehat matrix. This shows that the fitted values are, in fact, a linear function of the observed values, suchthat for any y,y′ ∈ Rn, we have letting y =: f(y),

f(y + y′) = H(y + y′) = Hy + Hy′ = f(y) + f(y′),

and for any a ∈ R, f(ay) = af(y).Similarly, the residuals can also be expressed as a function of H,

e := y− y = y−Hy = (I−H)y,

with I denoting the n × n identity matrix, and where again the residuals can also be seen to be a linearfunction of the observed values, y. In summary, we therefore have

y = Hy and e = (I−H)y.

Crucially, it can be shown that both H and I −H are orthogonal projections. This can be representedvisually, by contrasting two different viewpoints on multiple linear regression:

i. Thus far, we have mainly be concerned with what may be called the variable space, Rp∗, in which each

subject is a point, and the variables are dimensions.

ii. We need to accomplish a change of perspective by considering the subject space, Rn. In this Euclideanspace, each subject is a dimension, whereas the y, y and e are treated as vectors.

Department of Mathematics and Statistics, Boston University 1

Page 2: Hat Matrix: Properties and Interpretation Week 5, Lecture 1math.bu.edu/people/cgineste/classes/ma575/p/w5_1.pdfThe matrix of a projection, which is also symmetric is an orthogonal

MA 575: Linear Models

1.2 Hat Matrix as Orthogonal Projection

The matrix of a projection, which is also symmetric is an orthogonal projection. We can show that bothH and I−H are orthogonal projections. These two conditions can be re-stated as follows:

1. A square matrix A is a projection if it is idempotent,

2. A projection A is orthogonal if it is also symmetric.

A projection, which is not orthogonal is called an oblique projection. Specifically, H projects y onto thecolumn space of X, whereas I−H projects y onto the orthogonal complement of the image of H. Thecolumn space of matrix is defined as the range or the image of the corresponding linear transformation.Formally, as H is an n× n matrix, its column space is defined as

col(H) :=

{n∑

i=1

cjH·,j : cj ∈ R, ∀ j = 1, . . . , n

},

where H·,j denotes the jth column of H. This is the span of the linearly independent columns of H.

1.3 Idempotency of the Hat Matrix

H is an n× n square matrix, and moreover, it is idempotent, which can be verified as follows,

HH =(X(XTX)−1XT

)(X(XTX)−1XT

)= X(XTX)−1(XTX)(XTX)−1XT

= X(XTX)−1XT

= H.

Similarly, I−H can also be shown to be idempotent,

(I−H)(I−H) = I− 2H + HH = (I−H).

Every square and idempotent matrix is a projection matrix. The meaning of a projection can be under-stood with the following 2× 2 example of a projection matrix, P which sends any 2-dimensional vector, x,to a one-dimensional subspace,

Px =

∣∣∣∣1 α0 0

∣∣∣∣ ∣∣∣∣x1x2∣∣∣∣ =

∣∣∣∣x1 + αx20

∣∣∣∣ .The idempotency of P implies that once a vector has been projected to a subspace, it “remains” there, evenif we re-apply the same projection.

1.4 Symmetry of the Hat Matrix

For any square and invertible matrices, the inverse and transpose operator commute,

(XT )−1 = (X−1)T .

Moreover, the transpose unary operator is an involution, since (XT )T = X. Thus, it follows that (XTX)−1

is self-transpose (i.e. symmetric), since

[(XTX)−1]T = [(XTX)T ]−1 = (XTX)−1.

Department of Mathematics and Statistics, Boston University 2

Page 3: Hat Matrix: Properties and Interpretation Week 5, Lecture 1math.bu.edu/people/cgineste/classes/ma575/p/w5_1.pdfThe matrix of a projection, which is also symmetric is an orthogonal

MA 575: Linear Models

One can apply these two rules in order to show that H is self-transpose or symmetric,

HT =[X(XTX)−1XT

]T= X

[(XTX)−1

]TXT = H,

This symmetry is inherited by I−H, in the following manner,

(I−H)T = IT −HT = (I−H).

The orthogonality of a matrix is best understood through an example. Consider the following 2× 2 matrix,P, and its complement I−P, when applied to any vector x ∈ R2,

Px =

∣∣∣∣1 00 0

∣∣∣∣ ∣∣∣∣x1x2∣∣∣∣ =

∣∣∣∣x10∣∣∣∣ , and (I−P)x =

∣∣∣∣0 00 1

∣∣∣∣ ∣∣∣∣x1x2∣∣∣∣ =

∣∣∣∣ 0x2

∣∣∣∣ .Clearly, one can immediately see that the two resulting vectors are perpendicular, 〈Px, (I−P)x〉 = 0; andthat therefore they span two orthogonal linear subspaces of R2.

1.5 Orthogonal Projections and Orthogonal Matrices

One should be careful not to confuse the matrix of an orthogonal projection with an orthogonal matrix.Recall that the latter satisfies,

ATA = AAT = I.

Specifically, one may observe that an orthogonal projection H projects vectors, which are orthogonal tovectors in its null space, whereas an orthogonal matrix has column vectors, which are orthogonal. Ingeneral, an orthogonal matrix does not induce an orthogonal projection. In fact, it can be shown that thesole matrix, which is both an orthogonal projection and an orthogonal matrix is the identity matrix.

2 Orthogonal Decomposition

2.1 Range and Kernel of the Hat Matrix

By combining our definitions of the fitted values and the residuals, we have

y = Hy and e = (I−H)y.

These equations correspond to an orthogonal decomposition of the observed values, such that

y = y + e = Hy + (I−H)y.

Observe that the column space or range of H, denoted col(H), is identical to the column space of X.Recall that X is a matrix with real entries, and therefore it is known that the rank of X is equal to therank of its Gram matrix, defined as XTX, such that

rank(X) = rank(XTX) = p∗.

Moreover, we can use some basic operations on matrix ranks, such that for any square matrix A of orderk × k; if B is an n× k matrix of rank k, then

rank(BA) = rank(A), and rank(ABT ) = rank(A).

Department of Mathematics and Statistics, Boston University 3

Page 4: Hat Matrix: Properties and Interpretation Week 5, Lecture 1math.bu.edu/people/cgineste/classes/ma575/p/w5_1.pdfThe matrix of a projection, which is also symmetric is an orthogonal

MA 575: Linear Models

By assumption, we have rank(X) = p∗. Thus, it suffices to apply these rules in order to obtain

rank(H) = rank(X(XTX)−1XT ) = rank((XTX)−1) = rank(XTX) = p∗,

and to note that only full rank matrices are invertible, which implies that matrix inversion preserves rank.Thus, the column space of H is equal to the column space of X, such that

col(H) = col(X),

where the column space of X is the set of all vectors that can be obtained as linear combinations of thecolumns of X. This is commonly referred to as the span of the columns of X.

The orthogonal complement of this vector subspace is the kernel or null space of H, denoted ker(H). Insummary, the space of the n columns of H can be divided into the following two orthogonal vector subspaces,

i. y ∈ Rn,

ii. y ∈ X := col(H) = span(x1, . . . ,xp),

iii. e ∈ X⊥ := ker(H) = {v ∈ Rn : 〈v,xj〉 = 0, ∀ j = 1, . . . , p∗};

where the vector structure on Rn can be recovered by taking the direct sum of the vector structures on Xand X⊥. Here, X and X⊥ are referred to as vector subspaces or linear subspaces of Rn. More specifically,the linear subspace X⊥ is the orthogonal complement of X. Naturally, the fact that y ∈ X and e ∈ X⊥implies that y ⊥ e, which can be checked algebraically, as demonstrated in the next section.

2.2 Residuals and Fitted Values are Orthogonal

Firstly, observe that left-multiplication of I−H by H gives an n× n matrix of zeros,

H(I−H) = H−HH = H−H = 0.

Geometrically, the residuals and the fitted values are two orthogonal vectors. Therefore, one can verify thatthe Euclidean inner product or dot product between these two vectors in Rn is zero.

〈y, e〉 = yT e = (Hy)T (I−H)y = yTH(I−H)y = yT0y = 0.

Therefore, y and e are orthogonal in Rn.

2.3 Residuals and Fitted Values are Uncorrelated

Moreover, one can also show that y and e are elementwise uncorrelated. We will need to use the factCov[Ax,By] = ACov[x,y]BT , for any two random vectors x,y ∈ Rn. Thus,

Cov[y, e|X] = Cov[Hy, (I−H)y|X]

= HCov[y,y|X](I−H)T

= σ2H(I−H) = 0.

Here, we are treating y and e as random vectors, whereas in the previous section, we only considered twovectors of realizations from these random processes.

Department of Mathematics and Statistics, Boston University 4

Page 5: Hat Matrix: Properties and Interpretation Week 5, Lecture 1math.bu.edu/people/cgineste/classes/ma575/p/w5_1.pdfThe matrix of a projection, which is also symmetric is an orthogonal

MA 575: Linear Models

3 Geometric Interpretation of Regression

3.1 Degrees of Freedom of y and e

Geometrically, the degrees of freedom associated to y and e can be seen to simply be the dimensions ofthe respective vector subspaces in which these two vectors have been projected. In particular, for anydecomposition of a vector space V into its range and kernel, by a simple application of the rank-nullitytheorem,

dim(Rn) = dim(X) + dim(X⊥)n = rank(H) + nul(H),

and where nul(H) indicates the nullity of H, which is defined as the dimension of the null space of H.When considering a mean function with p∗ different parameters, we have the following decomposition of thedegrees of freedom,

df(y) = df(y) + df(e)n = p∗ + (n− p∗). (1)

Since H was noted to be an orthogonal projection, one can also observe that the eigenvalues of H determinethe dimensions of these vector subspaces. Indeed, the eigenvalues of the matrix of an orthogonal projectioncan only be 0 or 1. Thus, the number of zeros in the spectrum of H is equal to the nullity of H, whereasthe number of ones in its spectrum is equal to its rank.

3.2 Variance Partitioning Through Pythagoras’ Theorem

The vectors y, y and e determine three points in Rn, which forms a triangle. Since 〈y, e〉 = 0, it followsthat this triangle is a right triangle, or (right-angled) triangle. Two aspects of regression analysis can beunderstood using basic trigonometry. Firstly, the decomposition of the total sum of squares (TSS or SYY)into estimated sum of squares (SSreg or ESS) and residual sum of squares (RSS) can be shown to constitutea special case of Pythagoras’ theorem. For convenience, we center the yi’s at y. Then, the total sum ofsquares is given by

SYY :=

n∑i=1

y2i = ||y||2,

where || · || is the Euclidean norm. Using the fact that yi = yi + ei, for every i = 1, . . . , n, we have

n∑i=1

(yi + ei)2 =

n∑i=1

y2i + 2

n∑i=1

yiei +

n∑i=1

e2i

= ||y||2 + 2 〈y, e〉+ ||e||2.

Moreover, we have already shown that y ⊥ e. Therefore, ||y|| is the length of the hypotenuse, whosesquare is equal to the sum of the squares of the lengths of the sides such that

||y||2 = ||y||2 + ||e||2.

We will see how to compute the degrees of freedom for these three quantities in the next class. These degreesof freedom may be different from the ones of the random vectors in equation (1), depending on whether ornot we have centered our data.

Department of Mathematics and Statistics, Boston University 5

Page 6: Hat Matrix: Properties and Interpretation Week 5, Lecture 1math.bu.edu/people/cgineste/classes/ma575/p/w5_1.pdfThe matrix of a projection, which is also symmetric is an orthogonal

MA 575: Linear Models

3.3 Coefficient of Determination using Trigonometry

Secondly, we can also apply another consequence of Pythagoras’ theorem in order to measure the degree oflinear dependency between y and y. Recall that for any right triangle,

cos θ =|adjacent||hypotenuse|

,

where the angle θ is measured in Radians. In regression, we are interested in the angle at 0 in Rn, with sidesgiven by the vectors y and y. Thus, similarly we have

(cos θ)2 =||y||2

||y||2=

ESS

TSS=

SYY−RSS

SYY= R2.

Therefore, maximizing the fit of a regression model is equivalent to decreasing the angle between the vectorsy and y in Rn. A perfect correlation arises when this angle is zero, such that cos(0) = 1; whereas fullstatistical independence (i.e. absence of correlation) is equivalent to a right angle, cos(π/2) = 0. Finally, anegative correlation is obtained when the angle between y and y is straight, whereby cos(π) = −1.

3.4 Collinear Predictors

Moreover, this geometrical perspective on multiple regression also sheds light on the use of the term collinearin statistics. Several points are said to be collinear if they are lying on a single line. In the context of multipleregression, each column of X is a vector in Rn. If the correlation between two such column vectors, say X·,jand X·,k, is perfect; then they are lying on the same line.

Department of Mathematics and Statistics, Boston University 6