02 Estimation Ey

Estimation of the Response Mean

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 1 / 27

The Gauss-Markov Linear Model

y = Xβ + ε

y is an n× 1 random vector of responses.

X is an n× p matrix of constants with columns corresponding toexplanatory variables. X is sometimes referred to as the designmatrix.

β is an unknown parameter vector in IRp.

ε is an n× 1 random vector of errors.

E(ε) = 0 and Var(ε) = σ2I, where σ2 is an unknown parameter inIR+.


The Column Space of the Design MatrixXβ is a linear combination of the columns of X:

Xβ = [x1, . . . , xp]

β1...βp

= β1x1 + · · ·+ βpxp.

The set of all possible linear combinations of the columns of X iscalled the column space of X and is denoted by

C(X) = {Xa : a ∈ IRp}.

The Gauss-Markov linear model says y is a random vector whosemean is in the column space of X and whose variance is σ2I forsome positive real number σ2, i.e.,

E(y) ∈ C(X) and Var(y) = σ2I, σ2 ∈ IR+.


An Example Column Space

X =[

11

]=⇒ C(X) = {Xa : a ∈ IRp}

={[

11

] [a1]

: a1 ∈ IR}

={

a1

[11

]: a1 ∈ IR

}=

{[a1a1

]: a1 ∈ IR

}


Another Example Column Space

X =

1 01 00 10 1

=⇒ C(X) =

1 01 00 10 1

[ a1a2

]: a ∈ IR2

=

a1

1100

+ a2

0011

: a1, a2 ∈ IR

=

a1a100

+

00a2a2

: a1, a2 ∈ IR

=

a1a1a2a2

: a1, a2 ∈ IR


Different Matrices with the Same Column Space

W =

1 01 00 10 1

X =

1 1 01 1 01 0 11 0 1

x ∈ C(W) =⇒ x = Wa for some a ∈ IR2

=⇒ x = X[

0a

]for some a ∈ IR2

=⇒ x = Xb for some b ∈ IR3

=⇒ x ∈ C(X)

Thus, C(W) ⊆ C(X).


W =

1 01 00 10 1

X =

1 1 01 1 01 0 11 0 1

x ∈ C(X) =⇒ x = Xa for some a ∈ IR3

=⇒ x = a1

1111

+ a2

1100

+ a3

0011

for some a ∈ IR3

=⇒ x =

a1 + a2a1 + a2a1 + a3a1 + a3

for some a1, a2, a3 ∈ IR

=⇒ x = W[

a1 + a2a1 + a3

]for some a1, a2, a3 ∈ IR


=⇒ x = W[

a1 + a2a1 + a3

]for some a1, a2, a3 ∈ IR

=⇒ x = Wb for some b ∈ IR2

=⇒ x ∈ C(W)

Thus, C(X) ⊆ C(W).We previously showed that C(W) ⊆ C(X).Thus, it follows that C(W) = C(X).


Estimation of E(y)

A fundamental goal of linear model analysis is to estimate E(y).

We could, of course, use y to estimate E(y).

y is obviously an unbiased estimator of E(y), but it is often not avery sensible estimator.

For example, suppose[y1y2

]=[

11

]µ+

[ε1ε2

], and we observe y =

[6.12.3

].

Should we estimate E(y) =[µµ

]by y =

[6.12.3

]?


Estimation of E(y)

The Gauss-Markov linear models says that E(y) ∈ C(X), so weshould use that information when estimating E(y).

Consider estimating E(y) by the point in C(X) that is closest to y(as measured by the usual Euclidean distance).

This unique point is called the orthogonal projection of y onto C(X)and denoted by y (although it could be argued that E(y) might bebetter notation).

By definition, ||y− y|| = minz∈C(X) ||y− z||, where ||a|| ≡√∑n

i=1 a2i .


Orthogonal Projection Matrices

In Homework Assignment 2, we will formally prove the following:1 ∀ y ∈ IRn, y = PXy, where PX is a unique n× n matrix known as an

orthogonal projection matrix.

2 PX is idempotent: PXPX = PX.

3 PX is symmetric: PX = P′X.

4 PXX = X and X′PX = X′.

5 PX = X(X′X)−X′, where (X′X)− is any generalized inverse of X′X.


Why Does PXX = X?

PXX = PX[x1, . . . , xp]

= [PXx1, . . . , PXxp]

= [x1, . . . , xp]

= X.


Generalized Inverses

G is a generalized inverse of a matrix A if AGA = A.

We usually denote a generalized inverse of A by A−.

If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and onlygeneralized inverse of A.

AA−1A = AI = IA = A

If A is singular, i.e., if A−1 does not exist, then there are infinitelymany generalized inverses of A.


An Algorithm for Finding a Generalized Inverse of aMatrix A

1 Find any r × r nonsingular submatrix of A where r = rank(A). Callthis matrix W.

2 Invert and transpose W, ie., compute (W−1)′.

3 Replace each element of W in A with the corresponding elementof (W−1)′.

4 Replace all other elements in A with zeros.

5 Transpose the resulting matrix to obtain G, a generalized inversefor A.


Invariance of PX = X(X′X)−X′ to Choice of (X′X)−

If X′X is nonsingular, then PX = X(X′X)−1X′ because the onlygeneralized inverse of X′X is (X′X)−1.

If X′X is singular, then PX = X(X′X)−X′ and the choice of thegeneralized inverse (X′X)− does not matter becausePX = X(X′X)−X′ will turn out to be the same matrix no matterwhich generalized inverse of X′X is used.

To see this, suppose (X′X)−1 and (X′X)−2 are any two generalizedinverses of X′X. Then

X(X′X)−1 X′ = X(X′X)−2 X′X(X′X)−1 X′ = X(X′X)−2 X′.


An Example Orthogonal Projection Matrix

Suppose[

y1y2

]=[

11

]µ+

[ε1ε2

], and we observe y =

[6.12.3

].

X(X′X)−X′ =[

11

]([11

]′ [ 11

])− [11

]′=

[11

]([ 1 1 ]

[11

])−[ 1 1 ]

=[

11

][2]−1[ 1 1 ] =

[11

] [12

][ 1 1 ]

=12

[11

][ 1 1 ] =

12

[1 11 1

]=

[1/2 1/21/2 1/2

].


An Example Orthogonal Projection

Thus, the orthogonal projection of y =[

6.12.3

]

onto the column space of X =[

11

]

is PXy =[

1/2 1/21/2 1/2

] [6.12.3

]=[

4.24.2

].


Why is PX called an orthogonal projection matrix?

Suppose X =[

12

]and y =

[234

].

X●



Suppose X =[

12

]and y =

[234

].

X●

C(X)



Suppose X =[

12

]and y =

[234

].

X●

C(X)

●

y



Suppose X =[

12

]and y =

[234

].

X●

C(X)

●

y

●y



Suppose X =[

12

]and y =

[234

].

X●

C(X)

●

y

●y

● y − y



The angle between y and y− y is 90◦.The vectors y and y− y are orthogonal.

y′(y− y) = y′(y− PXy) = y′(I − PX)y

= (PXy)′(I − PX)y = y′P′X(I − PX)y

= = y′PX(I − PX)y = y′(PX − PXPX)y

= y′(PX − PX)y = 0.


Optimality of y as an Estimator of E(y)

y is an unbiased estimator of E(y):

E(y) = E(PXy) = PXE(y) = PXXβ = Xβ = E(y).

It can be shown that y = PXy is the best estimator of E(y) in theclass of linear unbiased estimators, i.e., estimators of the form Myfor M satisfying

E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X.

Under the Gauss-Markov Linear Model, y = PXy is best among allunbiased estimators of E(y).


Ordinary Least Squares (OLS) Estimation of E(y)OLS: Find a vector b∗ ∈ IRp such that

Q(b∗) ≤ Q(b) ∀ b ∈ IRp, where Q(b) ≡n∑

i=1

(yi − x′(i)b)2.

Note that

Q(b) =n∑

i=1

(yi − x′(i)b)2 = (y− Xb)′(y− Xb) = ||y− Xb||2.

To minimize this sum of squares, we need to choose b∗ ∈ IRp suchXb∗ will be the point in C(X) that is closest to y.

In other words, we need to choose b∗ such thatXb∗ = PXy = X(X′X)−X′y.

Clearly, choosing b∗ = (X′X)−X′y will work.


Ordinary Least Squares and the Normal EquationsIt can be shown that Q(b∗) ≤ Q(b) ∀ b ∈ IRp if and only if b∗ is asolution to the normal equations:

X′Xb = X′y.

If X′X is nonsingular, multiplying both sides of the normalequations by (X′X)−1 shows that the only solution to the normalequations is b∗ = (X′X)−1X′y.

If X′X is singular, there are infinitely many solutions that include(X′X)−X′y for all choices of generalized inverse of X′X.

X′X[(X′X)−X′y] = X′[X(X′X)−X′]y = X′PXy = X′y

Henceforth, we will use β to denote any solution to the normalequations.


Ordinary Least Squares Estimator of E(y) = Xβ

We call Xβ = PXXβ = X(X′X)−X′Xβ = X(X′X)−X′y = PXy = y theOLS estimator of E(y) = Xβ.

It might be more appropriate to use Xβ rather than Xβ to denoteour estimator because we are estimating Xβ rather thanpre-multiplying an estimator of β by X.

As we shall soon see, it does not make sense to estimate β whenX′X is singular.

However, it does make sense to estimate E(y) = Xβ whether X′Xis singular or nonsingular.


02 Estimation Ey

Documents

Transcript of 02 Estimation Ey