Chapter 1: Multiple Linear Regression - Personal...

Chapter 1: Multiple Linear

Regression

The general linear model

Yi = β0 +

p−1∑

j=1

βjxi,j + εi With:

• Yi the ith observation of the response variable

– dependent variable

– explained, observed variable

• xi,j the (observed) ith value of the jth covariatealso called

– independent variable

– regression variable

– explanatory variable

– predictor, predicting variable

– control variable

c©Maarten Jansen STAT-F-408 Comp. Stat.Chap. 1: Multiple Regression p.1

• βj the regressiecoefficient corresponding to the jth covariateThis is a parameter vector to be estimated

• β0 the intercept (other coefficients sometimes “slopes”)

• εi the error (observational noise) in the ith observation

In matrix form: Y = Xβ + ε where X ∈ Rn×p and the first column

of X all ones, (Xi,0 = 1) corresponding to the intercept. The othercolumns of X are observed covariate values.


Examples, specific cases

1. Polynomial regression Yi = β0 +

p−1∑

j=1

βjxji + εi So xi,j = xj

i .

There is only one explanatory variable in strict sense (x withobservations xi), but the response depends on several powersof x.

2. Models with interaction

Suppose: 2 explanatory variables, x1 and x2, but Y alsodepends on the interaction between themYi = β0 + β1xi,1 + β2xi,2 + β1,2xi,1xi,2

We then set x3 = x1x2

3. One-way variance analysis (ANOVA) Yij = µi + εij withi = 1, . . . , k and j = 1, . . . , ni


Note that in this model the observations have double indices

All observations at the same level µi are only different by theobservational noise, so in a regression model they should sharecommon variables of the explanatory variables.

The model becomes Yij =∑k

ℓ=1 xij,ℓµℓ + εij where we takeindices i and j together into a single index ij and xij,ℓ gets thevalue 1 if ℓ = i and 0 otherwise.


The normal equation

Model Y = Xβ + ε

We search for β so that Xβ is as close as possible to Y in quadraticnorm.

Algebraicly this means that we look for the minimum least squaressolution to the overdetermined set of equations Y = Xβ

‖Y −Xβ‖ has to be minimized. In a Euclidian space (i.e., a spacewhere the notion of orthogonality can be defined using innerproducts) that means that the residual Y −Xβ has to be orthogonal(perpendicular) to the approximation (or projection) Xβ in thesubspace spanned by X.

So(Xβ

)T(Y −Xβ) = 0

This is the case if XT (Y −Xβ) = 0


from which the normal equation follows β = (XTX)−1XTY

Let y be the estimator for y = Xβ , then

y = X(XTX)−1XTY

This is an orthogonal Projection with projection matrix

P = X(XTX)−1XT

PP = P

The vector of Residuals is e = Y − y = (I − P )Y


Orthogonalisation

If X has orthogonal columns , then XTX = Ip, and thus theprojection becomes

y = X ·XT · Y

The ith component is then

yi =

p−1∑

j=0

n∑

k=1

xi,jxk,j · Yk

The computation of all n components of y can proceed in parallel,without numerical errors progressing from one component to another.

Otherwise, we Orthogonalise X


Gram-Schmidt-orthogonalisation

Let xj , j = 0, . . . , p− 1 be the columns of X, then define

q0 = x0 = 1

q1 = x1 −〈x1, q0〉

‖q0‖2· q0

. . .

qj = xj −

j−1∑

k=0

〈xj , qk〉

‖qk‖2· qk

This is:qj is the residual of orthogonal projection of xj onto q0, . . . , qj−1

Looks like linear regression within the covariate matrix X

The new covariates qj are orthogonal, but not orthonormal (they canbe easily normalised)


Gram-Schmidt in matrix form

We have

xj = qj +

j−1∑

k=0

〈xj , qk〉

‖qk‖2· qk

Apply scalar product with qj on both sides to find that〈xj ,qj〉

‖qj‖2 = 1,

Hence xj =∑j

k=0

〈xj ,qk〉

‖qk‖2· qk

In matrix-form: QR-decomposition X = Q ·R with the columns Q equal to qj ,

and R an upper triangular matrix, with entries

Rk,j =〈xj , qk〉

‖qk‖2.

R ∈ Rp×p and R is invertible.

Interpretation Every column of X (every covariate xj can be written as a linearcombination of orthogonal covariates, which are the columns of Q.(Right-multiplication with R)


Regression after orthogonalisation

Plugging in into the solution of the normal equation:

y = QR(RTQTQR)−1RTQT · Y

= Q(QTQ)−1QT · Y

This is a projection onto the orthogonal (but not orthonormal) basisQ.

Elaborated y =

p−1∑

k=0

〈Y , qk〉

‖qk‖2· qk

The vector of coefficients γ with γk =〈Y , qk〉

‖qk‖2

satisfies y = Qγ and γ = (QTQ)−1QT · Y


Relation between γ and β:

β = (XTX)−1XT · Y

= (RTQTQR)−1RTQT · Y

= R−1(QTQ)−1QT · Y

β = R−1γ γ = Rβ (Left-multiplication with R)


Elaboration for p = 1, simple linearregression

Model Yi = β0 + β1xi,1 + εi, i = 1, . . . , n

Simple linear regression: x0 = 1 and x1 = x.

q1 = x1 −〈x1, q0〉

‖q0‖2· q0

= x1 −〈x1,1〉

‖1‖2· 1

= x1 −

∑ni=1 xi,1

n· 1 = x1 − x1 · 1

R =

1/n x1

0 1

R−1 =

n −nx1

0 1


γ0 = Y γ1 = 〈q1,Y 〉‖q1‖2 = 〈x1−x1·1,Y 〉

‖x1−x1·1‖2

Since the second row of R−1 equals[0 1

], we find immediately

that

β1 = 〈x1−x1·1,Y 〉‖x1−x1·1‖2

which corresponds to the classical result, namely

β1 =

∑ni=1(Yi − Y )(xi,1 − x1)∑n

i=1(xi,1 − x1)2

β0 = Y − β1x1


Tools for statistical exploration

In order to further explore the properties of the least squaresestimator, we need to establish some classical results in multivariatestatistics

1. Covariance matrix

2. Multivariate normal distribution

3. Trace of a matrix


Covariance matrix

Let X be a p-variate random vector with joint density function fX(x),joint (cumulative) distribution function FX(x), and joint characteristic

function φX(t) = E(eitTX)

µ = E(X)

ΣX = E[(X − µ)(X − µ)T

]We have (ΣX)ij = cov(Xi, Xj)

ΣX is symmetric

If Y = AX Then ΣY = AΣXAT

Proof:

It holds that µY = AµX , so

ΣY = E[(Y − µY )(Y − µY )T

]= E

[(A(X − µX))(A(X − µX))T

]

So, with Ai the ith column of A, we have


cov(Yi, Yj) = ATi ΣXAj = AT

j ΣXAi

Special case: matrix A has just one row: A = aT , then Y = aTX

and σ2Y = aTΣXa

A symmetric matrix S is called positive (semi-)definite if xTSx >

(≥)0 for every vector x 6= 0

A positive-definite matrix always has real, positive eigenvalues

ΣX is positive semi-definite, since if aTΣXa < 0, then Y = aTX

would have a negative variance

We further assume that ΣX is positive definite.

Then there exists a square, orthogonal matrix T so that ΣX = TΛTT ,with Λ a diagonal matrix with the eigen values on the diagonal

Furthermore, there exists a matrix A = TΛ1/2 so that ΣX = AAT


For this matrix it holds [det(A)]2 = det(ΣX)

Suppose now Z = A−1(X − µ) (with µ = EX), then

EZ = 0 and ΣZ = A−1(AAT )A−T = I

The components of Z are thus uncorrelated

Vice versa , if ΣZ = I and A is an arbitrary matrix, and X = AZ,then ΣX = AAT and this arbitrary A can then be written asA = TΛ1/2, where T orthogonal.


The multivariate normal distribution

Definition: multivariate normal distribution

X ∼ N(µ,ΣX) ⇔ Zi ∼ N(0, 1)

As the components of Z are uncorrelated, and (here) normallydistributed, they must me mutually independent.

The joint density function

fX(x) =1

(2π)p/2√

det(ΣX)e−

12 (x−µ)TΣ−1

X(x−µ)

The characteristic function

φX(t) = exp

(itTµ−

1

2tTΣXt

)


Properties of the multivariate normal distribution

Property 1 If X ∼ N(µ,ΣX), then (X − µ)TΣ−1X (X − µ) ∼ χ2

p

This follows from(X−µ)TΣ−1

X (X−µ) = (X−µ)T A−TA−1(X−µ) = ZTZ =∑p

i=1 Z2i

and Z2i ∼ χ2

1 and Zi mutually independent

Property 2

If X ∼ N(µ,ΣX), then for C ∈ Rq×p with q ≤ p: CX ∼ N(Cµ,CΣXCT )

Property 3

X ∼ N(µ,ΣX) ⇔ aTX ∼ N(aTµ,aTΣXa)


Properties of the multivariate normal distribution

Property 4

Suppose X =

X1

X2

and X ∼ N(µ,ΣX) with µ =

µ1

µ2

and

ΣX =

Σ11 Σ12

Σ21 Σ22

and det(Σ22) 6= 0, then

X1|X2 = x2 ∼ N(µ1|2,Σ1|2)

with µ1|2 = µ1 + Σ12Σ−122 (x2 − µ2) and Σ1|2 = Σ11 − Σ12Σ

−122 Σ21

It holds that Σ−11|2 =

(Σ−1

)11

Note also that Σ21 = ΣT12


Marginal and conditional covariance

(elaboration of previous slide)

Suppose X =

X1

X2

with mean µ =

µ1

µ2

and covariance matrix

ΣX =

Σ11 Σ12

Σ21 Σ22

and det(Σ22) 6= 0, then

Marginal covariance

cov(X1) = Σ11

Conditional covariance

cov(X1|X2) = Σ1|2 = Σ11 − Σ12Σ−122 Σ21

This expression is known as the Schur-complement .

It holds that cov(X1|X2) = Σ1|2 =[(Σ−1

)11

]−1


Proof

Denote L =

I11 0

−Σ−1

22Σ21 I22

, then L−1 =

I11 0

+Σ−1

22Σ21 I22

and

ΣXL =

Σ11 Σ12

Σ21 Σ22

I11 0

−Σ−1

22Σ21 I22

=

Σ11 − Σ12Σ−1

22Σ21 Σ12

0 Σ22

=

Σ1|2 Σ12

0 Σ22

=

I11 Σ12Σ−1

22

0 I22

Σ1|2 0

0 Σ22

So, ΣX =

I11 Σ12Σ−1

22

0 I22

Σ1|2 0

0 Σ22

I11 0

Σ−1

22Σ21 I22

and

Σ−1

X=

I11 0

−Σ−1

22Σ21 I22

Σ−1

1|20

0 Σ−1

22

I11 −Σ12Σ−1

22

0 I22

=

Σ−1

1|2. . .

. . . . . .


Trace of a matrix

Tr(A) =∑p

i=1 Aii

Tr(AB) = Tr(BA) (A ∈ Rp×q,B ∈ Rq×p)Holds for both square (p = q) and rectangular matrices with matching sizes

Proof:Tr(AB) =

p∑

i=1

(AB)ii =

p∑

i=1

q∑

j=1

AijBji =

q∑

j=1

p∑

i=1

BjiAij =

q∑

j=1

(BA)jj = Tr(BA)

Corollary A,B,C square matrices, then:

Tr(ABC) = Tr(BCA) = Tr(CAB)Trace remains unchanged under cyclic permutation, so NOT under every permutation:

Tr(ABC) 6= Tr(BAC)

Holds also for more than 3 matrices and also for rectangular matrices with matching

dimensions

Tr(A + B) = Tr(A) + Tr(B) Attention: Tr(A · B) 6= Tr(A) · Tr(B)


Maximum likelihood for regression

Model Y = Xβ + ε (random variable Y instead of X in previous slides)

Assume ε ∼ N.I.D.(0, σ2), then joint observational density ismultivariate normal, more precisely

L(β, σ2;Y ) =1

(2πσ2)n/2e

−1

2σ2 (Y −Xβ)T (Y −Xβ)

To be maximised over β and σ2.

logL(β, σ2;Y ) = −n

2log(2πσ2

)−

1

2σ2(Y −Xβ)T (Y −Xβ)

Dependence on β only through second term, so β maximisesL(β, σ2;Y ) or logL(β, σ2;Y ) iff β minimises

(Y −Xβ)T (Y −Xβ) = ‖Y −Xβ‖22

This is least squares.

Conclusion: for normal errors is the least squares solution themaximum likelihood solution.


Properties of maximum likelihood estimators

Model Y = Xβ + ε

Least squares: β = (XTX)−1XTY

Property 1: unbiased

Eβ = β

Property 2: covariance

Σβ= σ2(XTX)−1

Property 3: Normality β is normally distributed (indeed, the

estimator is a linear combination of normally distributed randomvariables)

Marginal distributions βj ∼ N(βj , σ

2(XTX)−1jj

)


Property 4: best linear unbiased estimator (BLUE)

Denote y = Xβ and y = Xβ = X(XTX)−1XTY = PY

We consider estimators for linear combinations of the expectedresponses: θ = cTy = cTXβ.

For instance the βj ’s themselves. Indeed: β = (XTX)−1XTy. Sotake for c the jth column of X(XTX)−T (cT is then the jth row of(XTX)−1XT )

The LS estimator is a linear combination of the observationsθ = cTPY = cTPY with cP = P T c

Note P T = P and PP = P and Py = y

We have: Eθ = θ, so cTPy = θ = cTy — Or: cTPy = cTy

Suppose now that θ = dTY were also unbiased, i.e., dTy = cTy

Example: denote Xp the p× p upper part of the n× p matrix X (that


is: restrict to the first p observations) and dT = cTX−1p

Because y = Xβ, this becomes: dTXβ = cTXβ

X does not depend on β, the same argument can be repeated forthe same d and c and always other β, so it follows: (dT − cT )X = 0

Taking transposes: XT (d− c) = 0

Pre-multiplication by (XTX)−1: P (d− c) = 0

So Pd = Pc

And further var(θ)− var(θ) = var(dTY )− var(cTPY ) =

var(dTY )− var(dTPY ) = dTdσ2 − dTPdσ2 = dT (Id− Pd)σ2 =

dT (I − P )dσ2 = dT[(I − P )T (I − P )

]dσ2 = ‖(I − P )d‖σ2 ≥ 0


Variance estimation

S2 = (Y −Xβ)T (Y −Xβ)n−p

Note that the first factor is transpose (↔ expression of covariance matrix)

ES2 = σ2

Proof We know y = Xβ = PY , soY −Xβ = (I − P )Y = (I − P )(Xβ + ε) = (I − P )ε

and so

E[(Y −Xβ)T (Y −Xβ)

]= E

[εT (I − P )T (I − P )ε

]=

E[εT (I − P )ε

]= E

[∑ni=1

∑nj=1 εi(I − P )ijεj

]=

∑ni=1(I − P )iiEε2i = Tr(I − P )σ2 = (n− p)σ2


Distribution of the variance estimator

Theorem

(n− p)S2/σ2 ∼ χ2n−p

• The matrix I − P is idempotent: (I − P )(I − P ) = I − P

• This matrix has rank n− p: all linear combinations of thecolumns of X belong to the kernel of the matrix, since(I − P )X = 0

• An idempotent matrix with rank k has k independent eigenvectors with eigen value 1 and n− k independent eigen vectorswith eigen value 0. The latter are the columns of X

• The matrix I − P is symmetric, so eigen vectors correspondingto eigen value 1 are orthogonal to eigen vectors with eigenvalue 0. Within each of the two groups of eigen vectors,orthogalisation (using Gram-Schmidt) is straightforward.


• There exists thus an orthogonal transform T so thatI − P = TT

ΛT with Λ = diag(1, . . . , 1, 0, . . . , 0)

• We know from the proof of ES2 = σ2 that (n− p)S2/σ2 =

(Y −Xβ)T (Y −Xβ)/σ2 = εT

σ (I − P ) εσ = εT

σ TTΛT ε

σ

• εσ is a vector of independent, standard normal randomvariables: Σε/σ = I. After application of the orthogonal

transform Z = T εσ we have ΣZ = TITT = I, so Z is also a

vector of independent, standard normal random variables, Zi,so (n− p)S2/σ2 =

∑n−pi=1 Z2

i ∼ χ2n−p


Hypothesis testing - likelihood ratio

We want to test if H0 : Aβ = c with A ∈ Rq×p where q ≤ p and rankof A equal to its number of rows q

Example c = 0 and the rows of A =[I 0

]. This leads to the

hypothesis test if the first q βj equal zero (other subset of βj arepossible)

As σ is unknown, it is a component in the parameter vectorθ = (β, σ2) to be estimated.

Let Ω = θ be the vector space of all values that the parametervector can take and Ω0 = θ ∈ Ω|Aβ = c the subspaceof valuesthat satisfy the null hypothesis.

Likelihood ratio

ℓ =maxθ0∈Ω0 L(θ0;Y )

maxθ∈Ω L(θ;Y )


Likelihood ratio

Theorem (from mathematical statistics)

If n → ∞, then −2 log ℓ d→ χ2ν−ν0

ν is the dimension of the space Ω, ν0 is the dimension of Ω0: in ourcase ν = p, ν0 = p− q

Maximum likelihood estimator in multiple regression

L(β, σ2;Y ) =1

(2πσ2)n/2e

−1

2σ2 (Y −Xβ)T (Y −Xβ)

Taking derivatives for σ2 yields

σ2 = 1n (Y −Xβ)T (Y −Xβ) = n−p

n S2 where β follows from takingderivatives for βj (= least squares as discussed before; but we don’tneed this for the forthcoming analysis)

Plugging in leads to L(β, σ2;Y ) = 1(2πσ2)n/2 e

−n/2


Maximum likelihood estimator within Ω0

Maximise L(β, σ2;Y ) under the restruction that Aβ = c

Technique: Lagrange multipliers (elaboration beyond the scope ofthis text)

leads to the same expression as before for σ20 :

σ20 = 1

n (Y −Xβ0)T (Y −Xβ0)

And so: L(β0, σ20 ;Y ) = 1

(2πσ20)

n/2 e−n/2

but β0 is now: β0 = β + (XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

The likelihood ratio is then ℓ =(

σ2

σ20

)n/2

Large values of −2 log ℓ, i.e., small values of ℓ lead to rejection of nullhypothesis.


Estimators under reduced model

Some heavy calculations ...

Starting from the reduced model H0 : Aβ = c with A ∈ Rq×p where q ≤ p and rank ofA equal to number of rows q

If H0 holds, then

β0 = β + (XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

σ2

0 =1

n(Y −Xβ0)

T (Y −Xβ0)

Denote y = Xβ and y0 = Xβ0, then

y0 − y = X(β0 − β) = X(XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

and

(Y − y)T (y0 − y)

=[(I −X(XTX)−1XT )Y

]T(y0 − y)


= Y T (I −X(XTX)−1XT )X(XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

= Y T[(I −X(XTX)−1XT )X(XTX)−1

]· AT

[A(XTX)−1AT

]−1

(c− Aβ)

= Y T[X(XTX)−1 −X(XTX)−1(XTX)(XTX)−1

]· AT

[A(XTX)−1AT

]−1

(c− Aβ)

= Y T · 0 · AT[A(XTX)−1AT

]−1

(c− Aβ)

Corollary

For Y − y0 = (Y − y) + (y0 − y)2 we have

‖Y − y0‖2 = ‖Y − y‖2 + ‖y0 − y‖2

So

σ2

0 =1

n‖Y − y0‖

2

=1

n‖Y − y‖2 +

1

n‖y0 − y‖2


= σ2 +1

n‖X(XTX)−1AT

[A(XTX)−1AT

]−1

(c− Aβ)‖2

= σ2 +1

n(c− Aβ)T

[A(XTX)−1AT

]−TA(XTX)−TXTX(XTX)−1AT

[A(XTX)−1A

= σ2 +1

n(c− Aβ)T

[A(XTX)−1AT

]−1 [A(XTX)−1AT

] [A(XTX)−1AT

]−1

(c− Aβ)

So σ20 − σ2 = 1

n(c− Aβ)T[A(XTX)−1AT

]−1

(c− Aβ)


An equivalent F -test

The distribution of −2 log ℓ is only asymptotically χ2. The elaborationof −2 log ℓ leads to logarithms of sample variances. Intuitively, a ratioof sample variances suggests the use of an F -test, which is exact,and without the need of taking a log.

∼ ANOVA

Is the “extra sum of squares” (the part of the variance that isexplained by giving up the restrictions imposed by the nullhypothesis) significant?

Formally: F =(σ2

0−σ2)/qσ2/(n−p)

We have: F = n−pq

σ20−σ2

σ2 = n−pq

(σ20

σ2 − 1)= n−p

q

(ℓ−n/2 − 1

)

We now show that F ∼ Fq,n−p


The proof for the F -test

We have

1. σ2 = n−pn S2 = (Y −Xβ)T (Y −Xβ)

n

2. σ20 − σ2 = (Aβ − c)T

[A(XTX)−1AT

]−1

(Aβ − c)/n

We have to prove that

1. nσ2/σ2 = (n− p)S2/σ2 ∼ χ2n−p (done before)

2. under H0:

n(σ20 − σ2)/σ2 = (Aβ − c)T

[A(XTX)−1AT

]−1

(Aβ − c)/σ2 ∼ χ2q

3. S2 and β are independent

Therefore, all functions of S2 are also independent from allfunctions of β. More precisely σ2 and σ2

0 − σ2 are mutuallyindependent.


The proof for the F -test: distribution under H0

As β ∼ N(β, (XTX)−1σ2)

we have that Aβ − c is a vector with length q, which is normallydistributed with expected value Aβ − c = 0 (under H0)

and with covariance matrix A(XTX)−1ATσ2.

Then (property covariance matrices)

(Aβ − c)T[A(XTX)−1AT

]−1

(Aβ − c)/σ2 ∼ χ2q


The proof for the F -test: independence S2 and β

We show that (Y −Xβ) is independent from β. As both arenormally distributed, it suffices to show that they are uncorrelated.

The covariance matrix Mind the positions of the transposes (outer product) :cov(Y −Xβ, β) = E

[(Y −Xβ)βT

]− E(Y −Xβ) · EβT

= E[(Y −X(XTX)−1XTY ) · ((XTX)−1XTY )T

]− 0 · βT

= (I −X(XTX)−1XT ) · E(Y Y T ) · ((XTX)−1XT )T

= (I −X(XTX)−1XT ) · (σ2I +XββTXT )X(XTX)−1

= (I −X(XTX)−1XT ) · (X(XTX)−1σ2 +XββT ) = 0


Multiple testing and simultaneous confidence

We know that β and S2 are independent, so it follows immediately

that(β − β)T (XTX)(β − β)/p

S2∼ Fp,n−p

(This is the F -statistic for the full model, so in the test H0: all βi zero,i.e., q = p)

So

β :

(β − β)T (XTX)(β − β)

S2≤ pFp,n−p,α

is an 1− α confidence region for β

One single linear combination of β:

cT β ∼ N(cTβ, cT (XTX)−1cσ2) SocT β − cTβ

S√cT (XTX)−1c

∼ tn−p


So P

(|cT (β − β)|

S√cT (XTX)−1c

≥ tn−p,α/2

)= α

We verify now that the simulateneous confidence interval followsindeed from the F test for the full model.

We search for Mα so that P

(max

c

|cT (β − β)|

S√cT (XTX)−1c

≥ Mα

)= α

We square the expression and get[cT (β − β)

]2=[cT (β − β)

]T [cT (β − β)

]= (β − β)T (ccT )(β − β)

The value of the expression in c is the same as in rc with r ∈ R0, sowe can maximise under the normalisation that cT (XTX)−1c = 1

XTX is a p× p symmetric, full rank matrix and so it can bedecomposed as XTX = RRT with R a p× p invertible matrix. So,we can write that cT (XTX)−1c = cT (R−TR−1)c = uTu with


u = R−1c, so c = Ru.

We maximise over u,

maxu

(β − β)TRuuTRT (β − β)

uTu

The numerator contains a squared inner product of u withRT (β − β). According to Cauchy-Schwarz’s inequality:

β − β)TRuuTRT (β − β) ≤ ‖u‖2 · ‖RT (β − β)‖2,

So maxu

(β − β)TRuuTRT (β − β)

uTu= ‖RT (β − β)‖2 =

(β − β)TRRT (β − β) ∼ pFp,n−p


Residuals and outliers

e = Y − y = (I − P )Y

Σe = σ2(I − P )

This allows to standardise/studentise the residuals and then test fornormality.

Cook-distance; influence measure; influence points

measures the influence of observation i on the eventual estimatedvector:

Di =(β(i) − β)T (XTX)(β(i) − β)/p

S2


Transformations

Model

Y (λ) = Xβ + ε

with Y(λ)i =

Y λi −1λ

General: transformation of a random variable

If Y = g(X) then

fY (y) = fX(x(y))|J(y)| = fX(g−1(y))|J(y)|

with J = det

. . . . . . . . .

. . . ∂xi

∂yj. . .

. . . . . . . . .

and

∂xi

∂yj=

∂g−1i (y)

∂yj

In our case Y (λ) ∼ N(Xβ, σ2I) takes the role of X, so∂xi

∂yi= 0 unless j = i, in that case ∂xi

∂yi= yλ−1

i

So J =∏n

i=1 yλ−1i


The likelihood becomesL(β, σ2, λ;Y ) = 1

(2πσ2)n/2 e−1

2σ2 (Y (λ)−Xβ)T (Y (λ)−Xβ)∏ni=1 Y

λ−1i

Differentation for σ2 yields

σ2 = 1n (Y

(λ) −Xβ)T (Y (λ) −Xβ)

Plugging in

L(β, σ2, λ;Y ) = 1(2πσ2)n/2 e

−n/2∏n

i=1 Yλ−1i

This function can be evaluated for different values of λ (note that σ2

in its turn is also a function of λ)


Some notions of model selection

The analysis until now assumed that the full (global) model is correct. For instance,the expressions Eβ = β and Σ

β= σ2(XTX)−1 are true only if indeed Y can be

written as Xβ and no explanatory variable is missing.

In order to check if a specific covariate xi should be considered as a candidate in theglobal model, one could “play safe” put it in the full model and let the hypothesis testH0 : βi = 0 decide whether or not it is significant.

However, if we have to play safe for a very wide range of possible explanatoryvariables, a lot of parameters need to be estimated, leaving just a few (if at all any)degrees of freedom to estimate the variance of the observational error. As aconsequence, the resulting hypothesis tests will have weak power, and reallysignificant parameters may be left undiscovered.

We would therefore prefer a procedure that limits the size of the full model before theactual estimation and testing takes place. At the same time, the model selectionprocedure should be able to move beyond the simple question whether or not aspecific explanatory variable (or a group of covariates) should be in the model. Themodel selection should also be able to compare the quality of two arbitrary modelswhere each model contains variables not present in the other.


As the filosophy of hyposthesis testing favors the null hypothesis a priori (thealternative has to come up with significant data), this filosophy cannot be used in asymmetric comparison of two (or more) models.

A model selection procedure therefore makes use of a model selection criterion thatexpresses how well the data can be described in that model.


Adjusted R2

Least squares is an orthogonal projection: with y = Xβ we have(Y − y)TY = 0 and also (Y − y)T1 = 0 since 1 is the first column ofX and the residual is orthogonal to all columns of X Son∑

i=1

(Yi − yi)(yi − Y ) = 0

n∑

i=1

(Yi − Y )2 =n∑

i=1

(Yi − yi)2 +

n∑

i=1

(yi − Y )2

SST = SSR + SSE with

Sum of squares Total SST =∑n

i=1(Yi − Y )2

Sum of squares Regression SSR =∑n

i=1(yi − Y )2

Sum of squares Errors SSE =∑n

i=1(Yi − yi)2


We define Coefficient of determination

R2 =SSRSST

R2 = 1−SSESST

R2 is a measure of how well the model explains the variance of theresponse variables. However, a submodel of a full model has alwaysa smaller R2, even if the full model only adds unnecessary,insignificant parameters to the submodel.

Note The denominator in R2 is the same for all models. Thenumerator is (up to a constant) the log-likelihood of a normal modelwith known variance. So R2 measures the likelihood of a model. Themore parameters the model has, the more likely the observationsbecome, because there are more degrees of freedom to make theobseravtions fit into the model.

Adjusted coefficient of determination


R2= 1−

SSE/(n− p)

SST/(n− 1)Or 1−R

2= n−1

n−p (1−R2)

Interpretation

We know: S2 = (Y −y)T (Y −y)n−p = 1

n−p

∑ni=1(Yi − yi)

2 = 1n−pSSE

and σ2 = 1n

∑ni=1(Yi − yi)

2 = n−pn S2 = 1

nSSE = 1n(SST − SSR) =

1nSST(1−R2)

Suppose we have two nested models, i.e., one model containing allcovariates of the other (this is: a full model with p covariates and areduced model with q restrictions, so with p− q out of p freecovariates), then the F -value in testing the significance of the fullmodel is then:F =

(σ2p−q−σ2

p)/q

σ2p/(n−p)

Dividing numerator and denominator by SST leads to:

F =(R2

p−R2p−q)/q

(1−R2p)/(n−p) =

(1−R2p−q)−(1−R2

p)

q(1−R2p)/(n−p)


We rewrite this in terms of “adjusted R2”

F = n−pq

[1−R

2p−q

1−R2p

− 1

]+

1−R2p−q

1−R2p

from which it follows that R2

p−q ≤ R2

p ⇔ F ≥ 1

The criterion that decides which model has higher quality is thusequivalent to a “hypothesis test” with critical value F = 1, instead ofFq,n−p,α. This illustrates the symmetry of the model selectioncriterion, compared to the testing procedure, which uses null andalternative hypotheses, with clearly unequal roles.


Mallows’s Cp statistic

Criterion Choose the model with smallest mean prediction error

The prediction for covariate values xp:yp = xT

p βp = xTp (X

Tp Xp)

−1XTp Y

In this expression Xp is a model with p covariates.

The prediction error is the expected difference E(yp − y)2 betweenthe prediction and the true, error-free value of the response for thegiven covariate values.

We want to assess the quality of the model Xp. The precisecovariate values in which we measure the quality are free in theory.However, since we observe the model in a limited number ofcovariate values, we restrict to those values. Indeed, it is hard inpractice to say something about the quality of a prediction of aresponse value, if we even don’t observe that response value. In thepoints of observations, we still don’t know the exact (noise-free)


response.

Prediction error = variance + bias 2

E(yp − y)2 = var(yp) + (Eyp − y)2

with y = xTβ in the correct model

Because of linearity of expectationsEyp = E(xT

p (XTp Xp)

−1XTp Y ) = xT

p (XTp Xp)

−1XTp y

And var(yp) = xTp Σβpxp = σ2xT

p (XTp Xp)

−1xp

We search for the minimum average prediction error, where theaverage is taken over all points of observation, so we want tominimize ∆p = 1

σ2

∑ni=1E(ypi − yi)

2

The average variance can we elaborated as


1

σ2

n∑

i=1

var(ypi) =n∑

i=1

xTpi(X

Tp Xp)

−1xpi

=n∑

i=1

Tr(xTpi(X

Tp Xp)

−1xpi

)=

n∑

i=1

Tr((XT

p Xp)−1(xpix

Tpi))

= Tr

(n∑

i=1

(XTp Xp)

−1(xpixTpi)

)= Tr

((XT

p Xp)−1

n∑

i=1

(xpixTpi)

)

= Tr(Ip) = p

thereby using Xp =

...

xTpi

...

so XT

p Xp =

n∑

i=1

(xpixTpi)

The average variance is thus only dependent on the size of themodel. It does not depent on the exact locations of the points inwhich the prediction quality is evaluated. We don’t need that theprediction is evaluated in the points of observations for this part.


The average mean or expected error (bias)

1

σ2

n∑

i=1

(Eypi − yi)2=

1

σ2(Eyp − y)T (Eyp − y)

Now Eyp =

...

xTp (X

Tp Xp)

−1XTp y

...

= Xp(X

Tp Xp)

−1XTp y = Ppy

so the average bias is 1σ2y

T (I − P )T (I − P )y = 1σ2y

T (I − P )y

This contains the unknown, error-free response variable. We try toestimate the bias by plugging in the observed values instead. This isof course only possible in points of observations. This is why theinformation criterion uses points of observations only. We find theexpression Y T (I − P )Y = SSEp.

This estimator of the bias is biased again, as follows from


E(Y T (I − P )Y ) = E[(y + ε)T (I − P )(y + ε)

]

= yT (I−P )y+E[εT (I − P )y

]+E

[yT (I − P )ε

]+E

[εT (I − P )ε

]

= yT (I − P )y + 0 + 0 + Tr(I − P )σ2 = yT (I − P )y + (n− p)σ2 Butthis time, the bias, (n− p)σ2 only depends on the model size, nolonger on the unknown response.

So ∆pσ2 = E

[SSEp + (2p− n)σ2

]

The random variable on the right hand side (between the expectationbrackets) is observable and it is an unbiased estimator of theprediction quality measure in the left hand side of the expression.

In most practical situations σ2 has to be estimated by S2 and so

Cp =SSEp

S2 + 2p− n is a consistent estimator for ∆p


Algorithms for searching the best model

• Suppose the full model has K possible variables, then thenumber of possible models in 2K

• Computing all adjusted R2’s, Cp scores or other criterion isimpossible (exponential computational complexity)

• Possible strategies:

1. Forward selection : Start with a zero model, then graduallyadd the most significant parameter in a test with H0 thecurrent model.

2. Backward elimination

3. Forward-backward alternation : after a variable has beenadded, it is tested if any variable in the model can now beeliminated without significant increase in RSS.

4. Grouped/structured selection: the routine may impose that


some parameters must be included before others. Forinstance, no interaction terms without main effects.


Sparsity

• Parsimonious model selection: we search for models where“some” covariates have exactly zero contribution.

• Sparse model selection: the number of zeros dominates (by far)the number of nonzeros


Model selection as a constraint optimization problem

Constraint optimization problem = regularized least squares

RegLS(β) = ‖Y −K · β‖2 + λ‖β‖ℓq

• Minimum over β

• Defines the estimator within a model

• Operates within full model, but estimator is parsimonious (ifq < 2)

RegLS(β) depends on smoothing parameter λ. Parameter can beestimated using penalized likelihood , for instance Mallows’ Cp:

Cp =SSEp

S2+ 2p− n

• Information criterion

• Minimize over λ (or equivalently: over model size k)

• smoothing parameter λ determines model size k (and viceversa)


Regularisation: from combinatorial to convex constraints

• In the constraint optimization problemRegLS(β) = ‖Y −K · β‖2 + λp

the constraint or penalty equals p =∑

i β0i = ‖β‖0, where we

take 00 = 0.

ℓ0 regularisation = combinatorial integer programming problem

• Replace ℓ0 by ℓ1: → convex quadratic programming problem

RegLS(β) = ‖Y −K · β‖2 + λ∑

i

|βi|

Combinatorial constraint

X β = Y

I(|β1|>0) + I(|β

2|>0) = 1

Algebraic constraint

X β = Y

|β1|1/2 + |β

2|1/2 = 1

Convex constraint

X β = Y

|β1| + |β

2| = 1

Linear constraint

X β = Y

β12 + β

22 = r2


Solving the ℓ1 constraint problem

Steepest descend for quadratic forms

Suppose F (β) = ‖Y −Xβ‖22 =n∑

i=1

Yi −

∑

j

Xijβj

2

then∂F

∂βk=

n∑

i=1

2

Yi −

∑

j

Xijβj

Xik

Sum over i is an inner product of the vectors Y −Xβ and the kthcolumn of X, so all k together, the gradient vector becomesXT (Y −Xβ)

Karush-Kuhn-Tucker conditions

If β solves the ℓ1 constraint optimization problemIC1(β) = ‖Y −Xβ‖22 + 2λ‖β‖1 then


XTj (Y −Xβ) = λsign(βj) if βj 6= 0∣∣∣XTj (Y −Xβ)

∣∣∣ ≤ λ if βj = 0

Intuition

• If βj 6= 0, then∂IC1(β)

∂βj(β) is continuous, and it must be zero.

That is expressed by the first line in the Karush-Kuhn-Tuckerconditions

• If βj = 0, then the partial derivative is discontinuous. Thesteepest decrease of ‖Y −Xβ‖22 by letting βj deviate fromzero, should then be smaller than the increase in the penalty

Consequence

The solution of the ℓ1 constraint problem will be sparse, but it doesnot coincide with the ℓ0 constraint solution. Indeed, the solution that


satisfies the KKT conditions cannot be an orthogonal projection (aleast squares solution).

On the next slides is a simple, yet important, special case: soft-versus hard-thresholding. The latter is a ℓ0, and thus a projection,the former is projection plus shrinkage


Soft- and Hard-Thresholding

• Suppose X = I, so we have the observational modelY = β + ε

In this model, both ℓ0 and ℓ1-penalized closeness-of-fit can beoptimized componentwise

• ℓ0-penalized closeness-of-fit:IC(β) =

∑ni=1(Yi − βi)

2 + λ2I(|βi| > 0)

Solution is Hard-thresholding : βi = HTλ(Yi),

where HTλ(x) is the following function

−λ λ


• ℓ1-penalized closeness-of-fit:IC(β) =

∑ni=1(Yi − βi)

2 + 2λ|βi|

Solution is Soft-thresholding : βi = STλ(Yi),

where STλ(x) is the following function

−λ λ


LASSO/Basis pursuit

• ℓ1 penalization/regularization/constraint is called least absoluteshrinkage and selection operator (LASSO) or Basis Pursuit

• For given λ, ℓ1 penalization leads to the same degree of sparsityas ℓ0 (see Figure)

• For fixed λ, and if all nonzero β are large enough, ℓ1penalization is variable selection consistent: for n → ∞, the setof nonzero variables in the selection equals the true set withprobability tending to one

• The convex optimization problem can be solved byquadratic/dynamic programming or by specific methods, such as

– Least Angle Regression (LARS), a direct solver

– Iterative Soft Thresholding (It. ST), an iterative solver

We discuss both on the subsequent slides


Least Angle Regression

• Brings in a new variable that with maximum projection of currentresidual, i.e., component with highest magnitude in the vectorXT (Y −Xβ0)

• Moves along equiangular vector uA β1 = β0 + αuA with

uA = XA(XTAXA)

−11A/

√1TA(X

TAXA)−11A

where A is set of active variables (variables currently in themodel), until one variable not in A has the same projection ofthe new residualXT (Y −Xβ1)

• Stopping criterion: Mallows’s Cp: we stop if (we think that) Cp(p)

has reached a minimum


Iterative soft-thresholding

• General X, observational model Y = Xβ + ε

• We want ‖Y −Xβ‖22 as small as possible, under the constraintthat ‖β‖1 =

∑i |βi| is restricted

• Suppose we have an estimator β(r), then we improve thisestimator in two steps

– We proceed in the direction of the steepest descend in‖Y −Xβ‖22.The steepest decsend is XT (Y −Xβ(r))

We define β(r+1/2) = β(r) +XT (Y −Xβ(r))

– We search for β(r+1) which is as close as possible to

β(r+1/2), but which has restricted ℓ1-norm, so, we take

β(r+1) = STλ(β(r+1/2))


Chapter 1: Multiple Linear Regression - Personal...

Documents

Transcript of Chapter 1: Multiple Linear Regression - Personal...