Multivariate Regression Analysis

Click here to load reader

Embed Size (px)

Transcript of Multivariate Regression Analysis

  • GEO6161: Intermediate Quantitative Methods for Geographers



    Kalaivanan Murthy

    Page 1/8


    1. Plot the Ys vs individual Xs:

    (Anti-clockwise) We observe that X1 is more scattered, X2 is closely linear, X3=1 has higher Ys than X3=0, X1-X3 and X2-X3 is not much inferential.

    2. Run Nave Model: = 0 + 1X1 + 2X2 + 3X3 + 4A


    Estimate Std. Error t value Pr(>|t|)

    (Intercept) -1082.3793 165.4371 -6.543 8.17e-08 ***

    X1 0.2397 3.4967 0.069 0.946

    X2 1.2993 0.1620 8.019 7.45e-10 ***

    X3 67.9740 47.9893 1.416 0.164

    A 2.8016 7.1449 0.392 0.697


    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 120.4 on 40 degrees of freedom

    Multiple R-squared: 0.8635, Adjusted R-squared: 0.8498

    F-statistic: 63.24 on 4 and 40 DF, p-value: < 2.2e-16

    Page 2/8

  • Only is 2 significant at 95% significance. R2=86% and R2-adj=85% implies the model tries to explain 85% variability in data and model is concise. The F-statistic p-value|t|)

    (Intercept) 1.771e+03 4.499e+02 3.936 0.000322 ***

    X1 -6.394e+01 1.037e+01 -6.168 2.75e-07 ***

    X2 -1.078e+00 3.852e-01 -2.799 0.007855 **

    X3 8.190e+01 3.386e+01 2.419 0.020217 *

    X1:X2 5.275e-02 8.278e-03 6.372 1.42e-07 ***

    Page 3/8

  • ---

    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 85.01 on 40 degrees of freedom

    Multiple R-squared: 0.932, Adjusted R-squared: 0.9252

    F-statistic: 137 on 4 and 40 DF, p-value: < 2.2e-16

    > AIC(y.reg)

    [1] 534.2501

    All s are significant at 95% significance. R2=93.2% (86%) and R2-adj=92.52% (85%) has significantly improved. The F-statistic=137 (63) and its p-value|t|)

    (Intercept) 2.998e+00 9.894e-02 30.297 < 2e-16 ***

    X1 -6.471e-03 2.197e-03 -2.945 0.005418 **

    X2 -5.618e-05 8.309e-05 -0.676 0.502958

    X3 1.814e-02 7.115e-03 2.549 0.014834 *

    A -2.139e-03 1.069e-03 -2.001 0.052373 .

    X1:X2 6.259e-06 1.755e-06 3.568 0.000973 ***


    Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    Residual standard error: 0.01782 on 39 degrees of freedom

    Multiple R-squared: 0.9377, Adjusted R-squared: 0.9297

    F-statistic: 117.3 on 5 and 39 DF, p-value: < 2.2e-16

    [1] -227.1882

    Though values are insignificant for X2 and A at 95% confidence level, we keep it because it helps to conform to Normality assumption. Analysis show that removing X2 and A yield a slightly better model but violates the

    normality assumption. The R2-adj=92.9% is significantly high, and F-statistic p-value


    6. Test for Normality: The following methods can be used

    i. Shapiro-Wilk

    ii. Kolmogorov-Smirnov

    iii. DAgostinos Battery of Tests

    The sample size is relatively small; n=45. Hence Shapiro-Wilk test is performed and itse p-value=0.2489 > 0.05.

    Fail to reject H0: Distribution is Normal.

    7. Test for independence: The following tests can be used to test if error terms are independent of space and


    i. Runs test

    ii. Durbin Watson test

    Runs test is performed here. In the above figure, the plot on right side shows distribution of residuals about

    the mean. Spatial dependence is not significantly observable. The p-value=0.453 > 0.05, implies Fail to Reject

    H0: Variance is not independent.

    8. Test for homoscedasticity: Homoscedasticity or homogeneity of variance can be tested by Bartlett, Levenes

    and Fligner Killeen test. Bartlett test is roughly valid only when data is normally distributed data. Levenes test

    is performed here. It uses a non-parametric approach and is powerful than Bartlett.

    Levenes test yield p-value=0.92>0.05. Hence fail to reject H0: Homogeneity of variance.

    Page 5/8


    9. Coefficient of Determination (R2-adjusted): The transformed model has an R2-adj=92.97% which means it can

    explain the variability effectively up to 92.97%.

    10. FisherSnedecor Statistic: The transformed model has F-statistic=117.3 and its p-value0, which means the

    model explains the variability better.

    11. Akaike Information Criterion (AIC): It is a measure of relative quality of statistical models. The transformed

    model yielda AIC=227.18 which is very low than nave model.

    12. Modified Coefficient of Efficiency (E*): E* is less sensitive to larger values hence it the terms not squared. This

    model has an E*=0.7569.

    There are other testing procedures available but these three are the most powerful. By this we can finalize our


    The best fit regression model is

    Y -0.277 = 2.998 6.471*10-

    3*X1 - 5.618*10-5*X2 + 1.814*10-2*X3 2.139*10-

    3*A + 6.259*10-6*X1*X2

    Y: Monthly Mortgage Payment ($)

    X1: Household Disposable Income (x1000 $)

    X2: Square Footage of housing units

    X3: Mortgage Type

    A: Housing Units Age


    13. Some of the tests for model validation are not performed but intuitive in the model. In the lines, the following


    i. Multicollinearity: Test for multicollinearity is not systematically performed. It requires regressing each independent variable to rest of independent variables and finding Coefficient of Variation for each of the regression. In this model, multicollinearity exists with X2 but it is found that removing X2 violates the normality and other assumptions. ii. Polynomial Interaction Terms: Since R2-adj is above 90%, polynomial interaction terms are not included in the final model. iii. Outliers: Outliers are not identified in the model. Those which has high deviations from predicted values

    (usually when standard residuals, i> 2*. It could be noticed that two points lie outside of 2. Removing it might give a better model. iv. Since there are not many models formulated, the Mallow Cp which is used to identify the best model among a set of models is not implemented here. v. In MODEL.ACCURACY(), a function created to check the reliability, the significance test for (0) is not explicitly performed but it is implied from the t-statistic of individual .

    Page 6/8

  • VI. APPENDIX The following program, written in R, is used to simulate the above results. #read data"~/My R Codes/Data/LabDataGEO6161.csv",header=T) attach(; length(Y)

    #split window dev.list() mat=matrix(c(1,2,1,2,3,4,3,5),2,4) layout(mat); #scatter plot for each variable plot(X1,Y,main="Y - X1",ylab="Y",las=1) plot(-log10(A),Y,main="Y - X2",ylab="Y",las=1) boxplot(Y~X3,main="Y - X3",ylab="Y",xlab="X3",las=1) boxplot(X1~X3,main="X1 - X3",ylab="X1",xlab="X3",las=1) boxplot(X2~X3,main="X2 - X3",ylab="X2",xlab="X3",las=1) TEST.ASSUMPTIONS=function(reg.sample,Yi) { error.sample=rstandard(reg.sample) mat=matrix(c(1,1,2,3),2,2);layout(mat) #anderson-darling normality qqnorm(error.sample,datax=TRUE); qqline(error.sample,datax=TRUE) p.norm=nortest::ad.test(error.sample)$p.value norm=ifelse(nortest::ad.test(error.sample)$p.value

  • p.fstat=pf(fstat[1],fstat[2],fstat[3],lower.tail=F) mod.coefVAR=1-(sum(abs(Yi-reg.sample$fitted.values))/sum(abs(Yi-mean(Yi)))) RESULTS.B=list("R2-adj:"=r.sq.adj,"F-statistic:"=c(fstat,round(p.fstat,4)), "AIC"=AIC(reg.sample),"Modified E*:"=mod.coefVAR) return(RESULTS.B) } #naive model y.reg.naive=lm(Y~X1+X2+X3+A);summary(y.reg.naive); MODEL.ACCURACY(y.reg.naive,Y) #correlation matrix cov.mat=round(cor([c("Y","X1","X2","X3","A")]),2);cov.mat #model improvement y.reg=lm(Y~X1+X2+X3+X1:X2); summary(y.reg); MODEL.ACCURACY(y.reg,Y) TEST.ASSUMPTIONS(y.reg,Y) #power transformation power.transform=powerTransform(y.reg); Y.transform=bcPower(Y,power.transform$lambda) y.reg.transform=lm(Y.transform~X1+X2+X3+X1:X2+A); summary(y.reg.transform) y.reg.transform$coefficients MODEL.ACCURACY(y.reg.transform,Y.transform) TEST.ASSUMPTIONS(y.reg.transform,Y.transform)

    Page 8/8