Bivariate
-
Upload
vikas-saini -
Category
Technology
-
view
397 -
download
1
Transcript of Bivariate
Bivariate analysis
The Multiple Regression Model
Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)
ikik2i21i10i εXβXβXββY
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
Assumptions of Regression
Use the acronym LINE:• Linearity
– The underlying relationship between X and Y is linear
• Independence of Errors– Error values are statistically independent
• Normality of Error– Error values (ε) are normally distributed for any given value of X
• Equal Variance (Homoscedasticity)– The probability distribution of the errors has constant variance
Regression StatisticsMultiple R 0.998368R Square 0.996739Adjusted R Square 0.995808Standard Error 1.350151Observations 28
ANOVA
df SS MS FSignifican
ce FRegression 6 11701.72 1950.286 1069.876 5.54E-25Residual 21 38.28108 1.822908Total 27 11740
.99673911740
11704.1
SST
SSRr2
99.674% variation is explained by the dependent Variables
Adjusted r2
• r2 never decreases when a new X variable is added to the model– This can be a disadvantage when comparing models
• What is the net effect of adding a new variable?– We lose a degree of freedom when a new X variable
is added– Did the new X variable add enough explanatory
power to offset the loss of one degree of freedom?
• Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used
(where n = sample size, k = number of independent variables)
– Penalize excessive use of unimportant independent variables
– Smaller than r2
– Useful in comparing among models
Adjusted r2
1
1)1(1 22
kn
nrradj
Error and coefficients relationship• B1 = Covar(yx)/Varp(x)
Stddevp 419.28571 1103.4439 115902.4 1630165.82 36245060.6 706538.59 195.9184Covar 662.14286 6862.5 25621.4286 120976.786 16061.643 257.1429b1 0.6000694 0.059209 0.01571707 0.00333775 0.0227329 1.3125
Is the Model Significant?
• F Test for Overall Significance of the Model
• Shows if there is a linear relationship between all of the X variables considered together and Y
• Use F-test statistic
• Hypotheses: H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent variable affects Y)
F Test for Overall Significance
• Test statistic:
where F has (numerator) = k and(denominator) = (n – k - 1)
degrees of freedom
1knSSEk
SSR
MSE
MSRF
Case discussion
Multiple Regression Assumptions
Assumptions:• The errors are normally distributed• Errors have a constant variance• The model errors are independent
ei = (Yi – Yi)
<
Errors (residuals) from the regression model:
Error terms and coefficient estimates
• Once we think of the Error term as a random variable, it becomes clear that the estimates of b1, b2, … (as distinguished from their true values) will also be random variables, because the estimates generated by the SSE criterion will depend upon the particular value of e drawn by nature for each individual in the data set.
Statistical Inference and Goodness of fit
• The parameter estimates are themselves random variables, dependent upon the random variables e.
• Thus, each estimate can be thought of as a draw from some underlying probability distribution, the nature of that distribution as yet unspecified.
• If we assume that the error terms e are all drawn from the same normal distribution, it is possible to show that the parameter estimates have a normal distribution as well.
T Statistic and P value
• T = B1-B1average/B1 std dev
Can you have a hypothesis that b1 average = b1 estimate and do the T test
Are Individual Variables Significant?
• Use t tests of individual variable slopes
• Shows if there is a linear relationship between the variable Xj and Y
• Hypotheses:
– H0: βj = 0 (no linear relationship)
– H1: βj ≠ 0 (linear relationship does exist between Xj and Y)
Are Individual Variables Significant?
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist between xj and y)
Test Statistic:
(df = n – k – 1)
jb
j
S
bt
0
Coefficien
tsStandard
Error t Stat P-valueLower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 -82.5325 -35.5996OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 -0.10302 0.089097BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 0.031028 0.052949YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 0.000637 0.004794VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 0.000918 0.002021INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 -0.00552 3.78E-05SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592 -0.41049 -0.12592
with n – (k+1) degrees of freedom
Confidence Interval Estimate for the Slope
• Confidence interval for the population slope βj
• where t has (n – k – 1) d.f.jbknj Stb 1
Example: Form a 95% confidence interval for the effect of changes in Bars on fatal accidents:
0.041988 ±(2.079614 )(0.005271)So the interval is (0.031028, 0.052949 )
(This interval does not contain zero, so bars has a significant effect on Accidents)
Coefficien
tsStandard
Error t Stat P-valueLower 95%
Upper 95%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592
Using Dummy Variables
• A dummy variable is a categorical explanatory variable with two levels:– yes or no, on or off, male or female– coded as 0 or 1
• Regression intercepts are different if the variable is significant
• Assumes equal slopes for other variables
Interaction Between Independent Variables
• Hypothesizes interaction between pairs of X variables– Response to one X variable may vary at different
levels of another X variable
• Contains cross-product term
–
)X(XbXbXbb
XbXbXbbY
21322110
3322110
Effect of Interaction
• Given: • Without interaction term, effect of X1 on Y is
measured by β1
• With interaction term, effect of X1 on Y is measured by β1 + β3 X2
• Effect changes as X2 changes
εXXβXβXββY 21322110
X2 = 1:Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
X2 = 0: Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
Interaction Example
Slopes are different if the effect of X1 on Y depends on X2 value
X1
4
8
12
0
0 10.5 1.5
Y = 1 + 2X1 + 3X2 + 4X1X2
Suppose X2 is a dummy variable and the estimated regression equation is
Y
Residual Analysis
• The residual for observation i, ei, is the difference between its observed and predicted value
• Check the assumptions of regression by examining the residuals– Examine for linearity assumption– Evaluate independence assumption – Evaluate normal distribution assumption – Examine for constant variance for all levels of X (homoscedasticity)
• Graphical Analysis of Residuals– Can plot residuals vs. X
iii YYe
Residual Analysis for Independence
Not Independent
Independent
X
Xresi
dual
s
resi
dual
s
X
resi
dual
s
Residual Analysis for Equal Variance
Non-constant variance Constant variance
x x
Y
x x
Y
resi
dual
s
resi
dual
s
Linear fit does not give random residuals
Linear vs. Nonlinear Fit
Nonlinear fit gives random residuals
X
resi
dual
s
X
Y
X
resi
dual
s
Y
X
Quadratic Regression Model
Quadratic models may be considered when the scatter diagram takes on one of the following shapes:
X1
Y
X1X1
YYY
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β1 = the coefficient of the linear term β2 = the coefficient of the squared term
X1
i21i21i10i εXβXββY
β2 > 0 β2 > 0 β2 < 0 β2 < 0