02 Estimation Ey

27
Estimation of the Response Mean Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 511 1 / 27

description

Estimation

Transcript of 02 Estimation Ey

Page 1: 02 Estimation Ey

Estimation of the Response Mean

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 1 / 27

Page 2: 02 Estimation Ey

The Gauss-Markov Linear Model

y = Xβ + ε

y is an n× 1 random vector of responses.

X is an n× p matrix of constants with columns corresponding toexplanatory variables. X is sometimes referred to as the designmatrix.

β is an unknown parameter vector in IRp.

ε is an n× 1 random vector of errors.

E(ε) = 0 and Var(ε) = σ2I, where σ2 is an unknown parameter inIR+.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 2 / 27

Page 3: 02 Estimation Ey

The Column Space of the Design MatrixXβ is a linear combination of the columns of X:

Xβ = [x1, . . . , xp]

β1...βp

= β1x1 + · · ·+ βpxp.

The set of all possible linear combinations of the columns of X iscalled the column space of X and is denoted by

C(X) = {Xa : a ∈ IRp}.

The Gauss-Markov linear model says y is a random vector whosemean is in the column space of X and whose variance is σ2I forsome positive real number σ2, i.e.,

E(y) ∈ C(X) and Var(y) = σ2I, σ2 ∈ IR+.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 3 / 27

Page 4: 02 Estimation Ey

An Example Column Space

X =[

11

]=⇒ C(X) = {Xa : a ∈ IRp}

={[

11

] [a1]

: a1 ∈ IR}

={

a1

[11

]: a1 ∈ IR

}=

{[a1a1

]: a1 ∈ IR

}

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 4 / 27

Page 5: 02 Estimation Ey

Another Example Column Space

X =

1 01 00 10 1

=⇒ C(X) =

1 01 00 10 1

[ a1a2

]: a ∈ IR2

=

a1

1100

+ a2

0011

: a1, a2 ∈ IR

=

a1a100

+

00a2a2

: a1, a2 ∈ IR

=

a1a1a2a2

: a1, a2 ∈ IR

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 5 / 27

Page 6: 02 Estimation Ey

Different Matrices with the Same Column Space

W =

1 01 00 10 1

X =

1 1 01 1 01 0 11 0 1

x ∈ C(W) =⇒ x = Wa for some a ∈ IR2

=⇒ x = X[

0a

]for some a ∈ IR2

=⇒ x = Xb for some b ∈ IR3

=⇒ x ∈ C(X)

Thus, C(W) ⊆ C(X).

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 6 / 27

Page 7: 02 Estimation Ey

W =

1 01 00 10 1

X =

1 1 01 1 01 0 11 0 1

x ∈ C(X) =⇒ x = Xa for some a ∈ IR3

=⇒ x = a1

1111

+ a2

1100

+ a3

0011

for some a ∈ IR3

=⇒ x =

a1 + a2a1 + a2a1 + a3a1 + a3

for some a1, a2, a3 ∈ IR

=⇒ x = W[

a1 + a2a1 + a3

]for some a1, a2, a3 ∈ IR

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 7 / 27

Page 8: 02 Estimation Ey

=⇒ x = W[

a1 + a2a1 + a3

]for some a1, a2, a3 ∈ IR

=⇒ x = Wb for some b ∈ IR2

=⇒ x ∈ C(W)

Thus, C(X) ⊆ C(W).We previously showed that C(W) ⊆ C(X).Thus, it follows that C(W) = C(X).

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 8 / 27

Page 9: 02 Estimation Ey

Estimation of E(y)

A fundamental goal of linear model analysis is to estimate E(y).

We could, of course, use y to estimate E(y).

y is obviously an unbiased estimator of E(y), but it is often not avery sensible estimator.

For example, suppose[y1y2

]=[

11

]µ+

[ε1ε2

], and we observe y =

[6.12.3

].

Should we estimate E(y) =[µµ

]by y =

[6.12.3

]?

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 9 / 27

Page 10: 02 Estimation Ey

Estimation of E(y)

The Gauss-Markov linear models says that E(y) ∈ C(X), so weshould use that information when estimating E(y).

Consider estimating E(y) by the point in C(X) that is closest to y(as measured by the usual Euclidean distance).

This unique point is called the orthogonal projection of y onto C(X)and denoted by y (although it could be argued that E(y) might bebetter notation).

By definition, ||y− y|| = minz∈C(X) ||y− z||, where ||a|| ≡√∑n

i=1 a2i .

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 10 / 27

Page 11: 02 Estimation Ey

Orthogonal Projection Matrices

In Homework Assignment 2, we will formally prove the following:1 ∀ y ∈ IRn, y = PXy, where PX is a unique n× n matrix known as an

orthogonal projection matrix.

2 PX is idempotent: PXPX = PX.

3 PX is symmetric: PX = P′X.

4 PXX = X and X′PX = X′.

5 PX = X(X′X)−X′, where (X′X)− is any generalized inverse of X′X.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 11 / 27

Page 12: 02 Estimation Ey

Why Does PXX = X?

PXX = PX[x1, . . . , xp]

= [PXx1, . . . , PXxp]

= [x1, . . . , xp]

= X.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 12 / 27

Page 13: 02 Estimation Ey

Generalized Inverses

G is a generalized inverse of a matrix A if AGA = A.

We usually denote a generalized inverse of A by A−.

If A is nonsingular, i.e., if A−1 exists, then A−1 is the one and onlygeneralized inverse of A.

AA−1A = AI = IA = A

If A is singular, i.e., if A−1 does not exist, then there are infinitelymany generalized inverses of A.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 13 / 27

Page 14: 02 Estimation Ey

An Algorithm for Finding a Generalized Inverse of aMatrix A

1 Find any r × r nonsingular submatrix of A where r = rank(A). Callthis matrix W.

2 Invert and transpose W, ie., compute (W−1)′.

3 Replace each element of W in A with the corresponding elementof (W−1)′.

4 Replace all other elements in A with zeros.

5 Transpose the resulting matrix to obtain G, a generalized inversefor A.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 14 / 27

Page 15: 02 Estimation Ey

Invariance of PX = X(X′X)−X′ to Choice of (X′X)−

If X′X is nonsingular, then PX = X(X′X)−1X′ because the onlygeneralized inverse of X′X is (X′X)−1.

If X′X is singular, then PX = X(X′X)−X′ and the choice of thegeneralized inverse (X′X)− does not matter becausePX = X(X′X)−X′ will turn out to be the same matrix no matterwhich generalized inverse of X′X is used.

To see this, suppose (X′X)−1 and (X′X)−2 are any two generalizedinverses of X′X. Then

X(X′X)−1 X′ = X(X′X)−2 X′X(X′X)−1 X′ = X(X′X)−2 X′.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 15 / 27

Page 16: 02 Estimation Ey

An Example Orthogonal Projection Matrix

Suppose[

y1y2

]=[

11

]µ+

[ε1ε2

], and we observe y =

[6.12.3

].

X(X′X)−X′ =[

11

]([11

]′ [ 11

])− [11

]′=

[11

]([ 1 1 ]

[11

])−[ 1 1 ]

=[

11

][2]−1[ 1 1 ] =

[11

] [12

][ 1 1 ]

=12

[11

][ 1 1 ] =

12

[1 11 1

]=

[1/2 1/21/2 1/2

].

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 16 / 27

Page 17: 02 Estimation Ey

An Example Orthogonal Projection

Thus, the orthogonal projection of y =[

6.12.3

]

onto the column space of X =[

11

]

is PXy =[

1/2 1/21/2 1/2

] [6.12.3

]=[

4.24.2

].

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 17 / 27

Page 18: 02 Estimation Ey

Why is PX called an orthogonal projection matrix?

Suppose X =[

12

]and y =

[234

].

X●

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 18 / 27

Page 19: 02 Estimation Ey

Why is PX called an orthogonal projection matrix?

Suppose X =[

12

]and y =

[234

].

X●

C(X)

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 19 / 27

Page 20: 02 Estimation Ey

Why is PX called an orthogonal projection matrix?

Suppose X =[

12

]and y =

[234

].

X●

C(X)

y

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 20 / 27

Page 21: 02 Estimation Ey

Why is PX called an orthogonal projection matrix?

Suppose X =[

12

]and y =

[234

].

X●

C(X)

y

●y

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 21 / 27

Page 22: 02 Estimation Ey

Why is PX called an orthogonal projection matrix?

Suppose X =[

12

]and y =

[234

].

X●

C(X)

y

●y

● y − y

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 22 / 27

Page 23: 02 Estimation Ey

Why is PX called an orthogonal projection matrix?

The angle between y and y− y is 90◦.The vectors y and y− y are orthogonal.

y′(y− y) = y′(y− PXy) = y′(I − PX)y

= (PXy)′(I − PX)y = y′P′X(I − PX)y

= = y′PX(I − PX)y = y′(PX − PXPX)y

= y′(PX − PX)y = 0.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 23 / 27

Page 24: 02 Estimation Ey

Optimality of y as an Estimator of E(y)

y is an unbiased estimator of E(y):

E(y) = E(PXy) = PXE(y) = PXXβ = Xβ = E(y).

It can be shown that y = PXy is the best estimator of E(y) in theclass of linear unbiased estimators, i.e., estimators of the form Myfor M satisfying

E(My) = E(y) ∀ β ∈ IRp ⇐⇒ MXβ = Xβ ∀ β ∈ IRp ⇐⇒ MX = X.

Under the Gauss-Markov Linear Model, y = PXy is best among allunbiased estimators of E(y).

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 24 / 27

Page 25: 02 Estimation Ey

Ordinary Least Squares (OLS) Estimation of E(y)OLS: Find a vector b∗ ∈ IRp such that

Q(b∗) ≤ Q(b) ∀ b ∈ IRp, where Q(b) ≡n∑

i=1

(yi − x′(i)b)2.

Note that

Q(b) =n∑

i=1

(yi − x′(i)b)2 = (y− Xb)′(y− Xb) = ||y− Xb||2.

To minimize this sum of squares, we need to choose b∗ ∈ IRp suchXb∗ will be the point in C(X) that is closest to y.

In other words, we need to choose b∗ such thatXb∗ = PXy = X(X′X)−X′y.

Clearly, choosing b∗ = (X′X)−X′y will work.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 25 / 27

Page 26: 02 Estimation Ey

Ordinary Least Squares and the Normal EquationsIt can be shown that Q(b∗) ≤ Q(b) ∀ b ∈ IRp if and only if b∗ is asolution to the normal equations:

X′Xb = X′y.

If X′X is nonsingular, multiplying both sides of the normalequations by (X′X)−1 shows that the only solution to the normalequations is b∗ = (X′X)−1X′y.

If X′X is singular, there are infinitely many solutions that include(X′X)−X′y for all choices of generalized inverse of X′X.

X′X[(X′X)−X′y] = X′[X(X′X)−X′]y = X′PXy = X′y

Henceforth, we will use β to denote any solution to the normalequations.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 26 / 27

Page 27: 02 Estimation Ey

Ordinary Least Squares Estimator of E(y) = Xβ

We call Xβ = PXXβ = X(X′X)−X′Xβ = X(X′X)−X′y = PXy = y theOLS estimator of E(y) = Xβ.

It might be more appropriate to use Xβ rather than Xβ to denoteour estimator because we are estimating Xβ rather thanpre-multiplying an estimator of β by X.

As we shall soon see, it does not make sense to estimate β whenX′X is singular.

However, it does make sense to estimate E(y) = Xβ whether X′Xis singular or nonsingular.

Copyright c©2012 Dan Nettleton (Iowa State University) Statistics 511 27 / 27