Assumptions in linear regression models Y i = β 0 + β 1 x 1i + … + β k x ki + ε i, i = 1, …,...

24
Assumptions in linear regression models Y i = β 0 + β 1 x 1i + … + β k x ki + ε i , i = 1, … , n Assumptions x 1i , … , x ki are deterministic (not random variables) ε 1 … ε n are independent random variables with null mean, i.e. E(ε i ) = 0 and common variance, i.e. V(ε i ) = σ 2 . Consequences E(Y i ) = β 0 + β 1 x 1i + … + β k x ki and V(Y i ) = σ 2 . i = 1, … , n The OLS (ordinary least squares) estimators of β 0 , … β k indicated with b 1 , …, b k are BLUE (Best Linear Unbiased Estimators) – Gauss Markov theorem.

Transcript of Assumptions in linear regression models Y i = β 0 + β 1 x 1i + … + β k x ki + ε i, i = 1, …,...

Assumptions in linear regression models

Yi = β0 + β1 x1i + … + βk xki + εi, i = 1, … , n

Assumptions

x1i , … , xki are deterministic (not random variables)

ε1 … εn are independent random variables with null mean, i.e. E(εi) = 0 and common variance, i.e. V(εi)

= σ2. Consequences

E(Yi) = β0 + β1 x1i + … + βk xki and V(Yi) = σ2. i = 1, … , n

The OLS (ordinary least squares) estimators of β0, … βk indicated with b1, …, bk are BLUE (Best Linear

Unbiased Estimators) – Gauss Markov theorem.

Normality assumption

ε1 … εn are independent NORMAL r.v. with null mean and common variance σ2, i.e. εi ~N(0, σ2), i =

1, … , n

If, in addition, we assume that the errors are Normal r.v.

Consequences

Yi ~ N( β0 + β1 x1i + … + βk xki , σ2), i = 1, … , n

bi ~ N( βi V(bi)), i = 0, … , k

The normality assumption is needed to make reliable inference (confidence intervals and tests of hypotheses). I.e. probability statements are exact

If the normality assumption does not hold, under some conditions, a large n (observations), via a Central Limit theorem allows reliable asymptotic inference on the estimated betas.

Checking assumptions

The error term ε is unobservable. Instead we can provide an estimate by using the parameter estimates.

The regression residual is defined as

ei = yi – yi , i= 1, 2, ... n

Plots of the regression residuals are fundamental in revealing model inadequacies such asa)non-normalityb)unequal variancesc) presence of outliersd)correlation (in time) of error terms

Detecting model lack of fit with residuals

• Plot the residuals ei on the vertical axis against each of the independend variables x1, ..., xk on the horizontal axis.

• Plot the residuals ei on the vertical axis against the predicted value y on the horizontal axis.

• In each plot look for trends, dramatic changes in variability, and /or more than 5% residuals lie outside 2s of 0. Any of these patterns indicates a problem with model fit.

Use the Scatter/Dot graph command in SPSS to construct any of the plots above.

Examples: residuals vs. predicted

Residual Plot

-200

-150

-100

-50

0

50

100

150

200

250

300

-10 0 10 20 30 40 50 60 70

Predicted value

Res

idu

als

Residual Plot

-150

-100

-50

0

50

100

150

200

250

-400 -200 0 200 400 600 800 1000 1200

Predicted value

Res

idu

als

Residual Plot

-4

-3

-2

-1

0

1

2

3

0 10 20 30 40 50 60 70

Predicted value

Res

idu

als

Residual Plot

-10

-8

-6

-4

-2

0

2

4

6

8

10

0 10 20 30 40 50 60 70

Predicted value

Res

idu

als

finenonlinearity

outliersunequal variances

Examples: residuals vs. predictedResidual Plot

-4

-3

-2

-1

0

1

2

3

4

0 10 20 30 40 50 60 70

Predicted value

Resid

uals

Residual Plot

-6

-4

-2

0

2

4

6

8

-20 0 20 40 60 80 100 120

Predicted value

Res

idu

als

nonlinearity and auto-correlation

auto-correlation

Partial residuals plot

An alternative method to detect lack of fit in models with more than one independent variable uses the partial residuals; for a selected j-th independent var xj,

e* = y – (b0+ b1x1+...+ bj-1xj-1 + bj+1xj+1 + ... + bkxk )

= e + bjxj

Partial residuals measure the influence of xj after the effects of all other independent vars have been removed.

A plot of the partial residuals for xj against xj often reveals more information about the relationship between y and xj than the usual residual plot.

If everything is fine they should show a straight line with slope bj.Partial residual plots can be calculated in SPSS by selecting “Produce all partial plots” in the “Plots” options in the “Regression” dialog box.

Example

A supermarket chain wants to investigate the effect of price p on the weekly demand of a house brand of coffee.

Eleven prices were randomly assigned to the stores and were advertised using the same procedure.

A few weeks later the chain conducted the same experiment using no advertising

– Y : weekly demand in pounds

– X1: price, dollars/pound

– X2: advertisement: 1 = Yes, 0 =No.

Model 1: E(Y) = β0 + β1x1 + β2x2

Data: Coffee2.sav

Computer Output

Model Summary

.988a .975 .973 49.876Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Advertisment, Price per pounda.

ANOVAb

1859299 2 929649.475 373.710 .000a

47264.868 19 2487.625

1906564 21

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), Advertisment, Price per pounda.

Dependent Variable: Weekly demandb. Coefficientsa

2400.182 68.914 34.829 .000

-456.295 16.813 -.980 -27.139 .000

70.182 21.267 .119 3.300 .004

(Constant)

Price per pound

Advertisment

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Weekly demanda.

Residual and partial residual (price) plots

Residuals vs. price.Shows non-linearity Partial residuals for price vs.

price.Shows nature of non-linearity.

Try using 1/x instead of x

E(Y) = β0 + β1(1/x1) + β2x2

Model Summaryb

.999a .999 .999 11.097Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), RecPrice, Advertismenta.

Dependent Variable: Weekly demandb. ANOVAb

1904224 2 952111.958 7731.145 .000a

2339.903 19 123.153

1906564 21

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), RecPrice, Advertismenta.

Dependent Variable: Weekly demandb. Coefficientsa

-1217.343 14.898 -81.711 .000

70.182 4.732 .119 14.831 .000

6986.507 56.589 .992 123.460 .000

(Constant)

Advertisment

RecPrice

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Weekly demanda.

RecPrice = 1/Price

Residual and partial residual (1/price) plots

After fitting the independent variable “x1 = 1/price” the Residual plot does not show any pattern and the Partial residual plot for (1/price) does not show any non linearity.

An example with simulated data

The true model, supposedly unknown, is

Y = 1 + x1 + 2∙x2 + 1.5∙x1∙x2 + ε, with ε~N(0,1)

Data: Interaz.sav

Fit a model based on data

Y

x1

X2

x2

Fit a model based on data

Cor(X1,X2)=0.131

Model 1: E(Y) = β0 + β1x1 + β2x2

Anovab

SS df MS F Sig.Regressione 8447,42 2 4233,711 768,494 ,000a

Residuo 533,12 97 5,496Totale 8980,54 99 Adj. R2=0.939

Coefficientia

t Sig.B DS VIF1

(Costante) -6,092 ,630 -9,668 ,000X1 3,625 ,207 17,528 ,000 1,018X2 6,145 ,189 32,465 ,000 1,018

Model 1: standardized residual plot

Y

Nonlinearity is present

To what is due?

Since the scatter-plots do not show any non-linearity it could be due to an interaction

Model 1: partial regression plots

X1

X2

Show that linear effects are roughly fine. But some non-linearity shows up

Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2

Anovab

SS df MS F Sig.Regressione 8885,372 3 2961,791 2987,64 000a

Residuo 95,169 96 ,991Totale 8980,541 99 Adj. R2=0.989

Coefficientia

t Sig.B DS VIF1

(Costante) ,305 ,405 ,753 ,453

X1 1,288 ,142 9,087 ,000 2,648X2 2,098 ,209 10,051 ,000 6,857IntX1X2 1,411 ,067 21,018 ,000 9,280

Model 2: standardized residual plot

Looks fine

Model 2: partial regression plots

X1

X1X

2

X2

All plots show linearity of the corresponding

terms

Maybe an outlier is present

Model 3: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x22

Anovab

SS df MS F Sig.Regressione 8890,686 4 2222,67 2349,92 ,000a

Residuo 89,856 95 ,946Totale 8980,541 99

Coefficientia

t Sig.B DS VIF1

(Costante) ,023 ,413 ,055 ,956X1 1,258 ,139 9,051 ,000 2,670X2 2,615 ,299 8,757 ,000 14,713IntX1X2 1,436 ,066 21,619 ,000 9,528X2Square -,137 ,058 -2,370 ,020 11,307

Suppose I wanto to try fitting a quadratic term

Adj. R2=0.990

x22 seems fine Higher MC

Model 3: standardized residual plot

Looks fine

Model 3: partial regression plots

X1X

2

X22

X2X1

Doesn’t show “linearity”

Checking the normality assumption

The inference procedures on the estimates (tests and confidence intervals) are based on the Normality assumption on the error term ε. If this assumption is not satisfied the conclusions drawn may be wrong.

Two widely used graphical tools are

1)the P-P plot for Normality of the residuals

2)the histogram of the residuals compared with the Normal density function.

Again, the residuals ei are used for checking this assumption

The P-P plot for Normality and histogram of the residuals can be calculated in SPSS by selecting the appropriate boxes in the “Plots” options in the “Regression” dialog box.

Social Workers example: E(ln(Y)) = β0 + β1x

Both graphs do not show strong departures from the Normality assumption.

Points should be as close as possible to the straight

line

Histogram should match the continuous line