Assumptions in linear regression models Y i = β 0 + β 1 x 1i + … + β k x ki + ε i, i = 1, …,...
-
Upload
kristopher-james -
Category
Documents
-
view
235 -
download
1
Transcript of Assumptions in linear regression models Y i = β 0 + β 1 x 1i + … + β k x ki + ε i, i = 1, …,...
Assumptions in linear regression models
Yi = β0 + β1 x1i + … + βk xki + εi, i = 1, … , n
Assumptions
x1i , … , xki are deterministic (not random variables)
ε1 … εn are independent random variables with null mean, i.e. E(εi) = 0 and common variance, i.e. V(εi)
= σ2. Consequences
E(Yi) = β0 + β1 x1i + … + βk xki and V(Yi) = σ2. i = 1, … , n
The OLS (ordinary least squares) estimators of β0, … βk indicated with b1, …, bk are BLUE (Best Linear
Unbiased Estimators) – Gauss Markov theorem.
Normality assumption
ε1 … εn are independent NORMAL r.v. with null mean and common variance σ2, i.e. εi ~N(0, σ2), i =
1, … , n
If, in addition, we assume that the errors are Normal r.v.
Consequences
Yi ~ N( β0 + β1 x1i + … + βk xki , σ2), i = 1, … , n
bi ~ N( βi V(bi)), i = 0, … , k
The normality assumption is needed to make reliable inference (confidence intervals and tests of hypotheses). I.e. probability statements are exact
If the normality assumption does not hold, under some conditions, a large n (observations), via a Central Limit theorem allows reliable asymptotic inference on the estimated betas.
Checking assumptions
The error term ε is unobservable. Instead we can provide an estimate by using the parameter estimates.
The regression residual is defined as
ei = yi – yi , i= 1, 2, ... n
Plots of the regression residuals are fundamental in revealing model inadequacies such asa)non-normalityb)unequal variancesc) presence of outliersd)correlation (in time) of error terms
Detecting model lack of fit with residuals
• Plot the residuals ei on the vertical axis against each of the independend variables x1, ..., xk on the horizontal axis.
• Plot the residuals ei on the vertical axis against the predicted value y on the horizontal axis.
• In each plot look for trends, dramatic changes in variability, and /or more than 5% residuals lie outside 2s of 0. Any of these patterns indicates a problem with model fit.
Use the Scatter/Dot graph command in SPSS to construct any of the plots above.
Examples: residuals vs. predicted
Residual Plot
-200
-150
-100
-50
0
50
100
150
200
250
300
-10 0 10 20 30 40 50 60 70
Predicted value
Res
idu
als
Residual Plot
-150
-100
-50
0
50
100
150
200
250
-400 -200 0 200 400 600 800 1000 1200
Predicted value
Res
idu
als
Residual Plot
-4
-3
-2
-1
0
1
2
3
0 10 20 30 40 50 60 70
Predicted value
Res
idu
als
Residual Plot
-10
-8
-6
-4
-2
0
2
4
6
8
10
0 10 20 30 40 50 60 70
Predicted value
Res
idu
als
finenonlinearity
outliersunequal variances
Examples: residuals vs. predictedResidual Plot
-4
-3
-2
-1
0
1
2
3
4
0 10 20 30 40 50 60 70
Predicted value
Resid
uals
Residual Plot
-6
-4
-2
0
2
4
6
8
-20 0 20 40 60 80 100 120
Predicted value
Res
idu
als
nonlinearity and auto-correlation
auto-correlation
Partial residuals plot
An alternative method to detect lack of fit in models with more than one independent variable uses the partial residuals; for a selected j-th independent var xj,
e* = y – (b0+ b1x1+...+ bj-1xj-1 + bj+1xj+1 + ... + bkxk )
= e + bjxj
Partial residuals measure the influence of xj after the effects of all other independent vars have been removed.
A plot of the partial residuals for xj against xj often reveals more information about the relationship between y and xj than the usual residual plot.
If everything is fine they should show a straight line with slope bj.Partial residual plots can be calculated in SPSS by selecting “Produce all partial plots” in the “Plots” options in the “Regression” dialog box.
Example
A supermarket chain wants to investigate the effect of price p on the weekly demand of a house brand of coffee.
Eleven prices were randomly assigned to the stores and were advertised using the same procedure.
A few weeks later the chain conducted the same experiment using no advertising
– Y : weekly demand in pounds
– X1: price, dollars/pound
– X2: advertisement: 1 = Yes, 0 =No.
Model 1: E(Y) = β0 + β1x1 + β2x2
Data: Coffee2.sav
Computer Output
Model Summary
.988a .975 .973 49.876Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Advertisment, Price per pounda.
ANOVAb
1859299 2 929649.475 373.710 .000a
47264.868 19 2487.625
1906564 21
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), Advertisment, Price per pounda.
Dependent Variable: Weekly demandb. Coefficientsa
2400.182 68.914 34.829 .000
-456.295 16.813 -.980 -27.139 .000
70.182 21.267 .119 3.300 .004
(Constant)
Price per pound
Advertisment
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: Weekly demanda.
Residual and partial residual (price) plots
Residuals vs. price.Shows non-linearity Partial residuals for price vs.
price.Shows nature of non-linearity.
Try using 1/x instead of x
E(Y) = β0 + β1(1/x1) + β2x2
Model Summaryb
.999a .999 .999 11.097Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), RecPrice, Advertismenta.
Dependent Variable: Weekly demandb. ANOVAb
1904224 2 952111.958 7731.145 .000a
2339.903 19 123.153
1906564 21
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), RecPrice, Advertismenta.
Dependent Variable: Weekly demandb. Coefficientsa
-1217.343 14.898 -81.711 .000
70.182 4.732 .119 14.831 .000
6986.507 56.589 .992 123.460 .000
(Constant)
Advertisment
RecPrice
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig.
Dependent Variable: Weekly demanda.
RecPrice = 1/Price
Residual and partial residual (1/price) plots
After fitting the independent variable “x1 = 1/price” the Residual plot does not show any pattern and the Partial residual plot for (1/price) does not show any non linearity.
An example with simulated data
The true model, supposedly unknown, is
Y = 1 + x1 + 2∙x2 + 1.5∙x1∙x2 + ε, with ε~N(0,1)
Data: Interaz.sav
Fit a model based on data
Y
x1
X2
x2
Fit a model based on data
Cor(X1,X2)=0.131
Model 1: E(Y) = β0 + β1x1 + β2x2
Anovab
SS df MS F Sig.Regressione 8447,42 2 4233,711 768,494 ,000a
Residuo 533,12 97 5,496Totale 8980,54 99 Adj. R2=0.939
Coefficientia
t Sig.B DS VIF1
(Costante) -6,092 ,630 -9,668 ,000X1 3,625 ,207 17,528 ,000 1,018X2 6,145 ,189 32,465 ,000 1,018
Model 1: standardized residual plot
Y
Nonlinearity is present
To what is due?
Since the scatter-plots do not show any non-linearity it could be due to an interaction
Model 1: partial regression plots
X1
X2
Show that linear effects are roughly fine. But some non-linearity shows up
Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2
Anovab
SS df MS F Sig.Regressione 8885,372 3 2961,791 2987,64 000a
Residuo 95,169 96 ,991Totale 8980,541 99 Adj. R2=0.989
Coefficientia
t Sig.B DS VIF1
(Costante) ,305 ,405 ,753 ,453
X1 1,288 ,142 9,087 ,000 2,648X2 2,098 ,209 10,051 ,000 6,857IntX1X2 1,411 ,067 21,018 ,000 9,280
Model 2: partial regression plots
X1
X1X
2
X2
All plots show linearity of the corresponding
terms
Maybe an outlier is present
Model 3: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x22
Anovab
SS df MS F Sig.Regressione 8890,686 4 2222,67 2349,92 ,000a
Residuo 89,856 95 ,946Totale 8980,541 99
Coefficientia
t Sig.B DS VIF1
(Costante) ,023 ,413 ,055 ,956X1 1,258 ,139 9,051 ,000 2,670X2 2,615 ,299 8,757 ,000 14,713IntX1X2 1,436 ,066 21,619 ,000 9,528X2Square -,137 ,058 -2,370 ,020 11,307
Suppose I wanto to try fitting a quadratic term
Adj. R2=0.990
x22 seems fine Higher MC
Checking the normality assumption
The inference procedures on the estimates (tests and confidence intervals) are based on the Normality assumption on the error term ε. If this assumption is not satisfied the conclusions drawn may be wrong.
Two widely used graphical tools are
1)the P-P plot for Normality of the residuals
2)the histogram of the residuals compared with the Normal density function.
Again, the residuals ei are used for checking this assumption
The P-P plot for Normality and histogram of the residuals can be calculated in SPSS by selecting the appropriate boxes in the “Plots” options in the “Regression” dialog box.