Simple linear regression Chap 10 IPS

Simple linear regression Chap 2/3p90

Given n observations on the explanatory variable x the response variable y,

( , ), ( , ), , ( , )1 1 2 2x y x y x yn n…

the statistical model for Simple linear regression of y on x is given by

0 1y xi i iβ β ε= + + ,

where iε are independent and normally distributed with mean 0 and standard deviation

σ.

,0 1β β and σ are the model parameters.

0β and 1β are estimated by minimizing the sum of squares of errors.

1β is estimated by ˆ1

SSxySSxx

β = .

0β is estimated by ˆ0 1y b xβ = − .

Where

( )( )1 1

n nSS x x y y x y x yxy i i i i

i i= − − = −∑ ∑= =

.

2 2( )

1 1

n nSS x x x xxx i i

i i= − = −∑ ∑= =

2 .

1

Example (Ex10.20p697IPS) Data Display Row HR VO2 1 94 0.473 2 96 0.753 3 95 0.929 4 95 0.939 5 94 0.832 6 95 0.983 7 94 1.049 8 104 1.178 9 104 1.176 10 106 1.292 11 108 1.403 12 110 1.499 13 113 1.529 14 113 1.599 15 118 1.749 16 115 1.746 17 121 1.897 18 127 2.040 19 131 2.231 VO2 = Oxygen uptake HR = Heart rate

13012011010090

2.4

1.4

0.4

HR

VO

2

2

Regression Analysis The regression equation is VO2 = - 2.80 + 0.0387 HR Predictor Coef StDev T P Constant -2.8044 0.2583 -10.86 0.000 HR 0.038652 0.002400 16.10 0.000 S = 0.1205 R-Sq = 93.8% R-Sq(adj) = 93.5% Analysis of Variance Source DF SS MS F P Regression 1 3.7619 3.7619 259.27 0.000 Residual Error 17 0.2467 0.0145 Total 18 4.0085 Unusual Observations Obs HR VO2 Fit StDev Fit Residual St Resid 1 94 0.4730 0.8289 0.0417 -0.3559 -3.15R R denotes an observation with a large standardized residual

130120110100 90

2.5

1.5

0.5

HR

VO2

R-Sq = 93.8 %

Y = -2.80435 + 3.87E-02X

Regression Plot

3

13012011010090

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

HR

Res

idua

l

Residuals Versus HR(response is VO2)

2.31.81.30.8

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is VO2)

0.20.10.0-0.1-0.2-0.3-0.4

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is VO2)

4

0.20.10.0-0.1-0.2-0.3-0.4

10

5

0

Residual

Fre

quen

cy

Histogram of the Residuals(response is VO2)

5

Minitab commands for the above regression analysis

6

Confidence Intervals and Significance Tests for Regression Slope and Intercept

CI for 0β is given by ˆ

0 ˆ/ 20

t Sβ α β±

CI for 1β is given by ˆ1 ˆ/ 2

1t Sβ α β

±

1

2ˆ 2( )i

MSESX Xβ

=−∑

and 212

ˆ 2( )0

XS MSEX Xn iβ

= + −∑

Example (Ex above)

Significance tests for Regression Slope

To test the null hypothesis : 00 1H β = , compute the test statistic

2

ˆ 211ˆ

1

r nS r

β

β

−= =

−t .

This test static has a t distribution with n-2 degrees of freedom. Examples (Ex above) In Ex 10.20 above Test 1) : 00 1H β = , against : 01 1H β ≠

2) : 00 1H β = against : 01 1H β >

3) : 00 1H β = against : 01 1H β <

9

4) : 0.030 1H β = against : 0.031 1H β >

5) : 00 0H β = , against : 01 0H β ≠

6) : 00 0H β = , against : 01 0H β <

Confidence Interval for the mean response A level C CI for the mean response yµ when x takes value xp is

ˆ ˆ/ 2t Syµ µα±

where

2( )1

ˆ 2( )

x xpS s

n x xiµ

−= +

∑ − and t is the value for the t n/ 2α ( 2)− density curve

with area C between - t and . / 2α / 2tα

10

Example Regression Analysis The regression equation is Wages = 44.2 + 0.0731 LOS Predictor Coef StDev T P Constant 44.213 2.628 16.82 0.000 LOS 0.07310 0.03015 2.42 0.018 S = 11.98 R-Sq = 9.2% R-Sq(adj) = 7.6% Analysis of Variance Source DF SS MS F P Regression 1 843.5 843.5 5.88 0.018 Residual Error 58 8322.9 143.5 Total 59 9166.4 Predicted Values (for LOS=36 months) Fit StDev Fit 95.0% CI 95.0% PI 46.84 1.86 ( 43.11, 50.57) ( 22.58, 71.11)

11

Minitab commands for prediction

12

Prediction Interval for a new observation A level C PI for a new observation with px x= is

ˆ ˆ/ 2y t Syα±

where where

2( )1

1ˆ 2( )

x xpS sy n x xi

−= + +

∑ −

/ 2tα α

and is the value for the t n density

curve with area C between - and t .

/ 2tα ( 2)−

/ 2

Note

2ˆˆS SE MSy µ= + E

Example: Give a 95% PI for the wage of an employee with 3 years experience. (i.e LOS=36) Example: Give a 90% PI for the wage of an employee with 3 years experience. (i.e LOS=36)

14

Multiple regression The statistical model for multiple linear regression is

0 1 ,1 2 ,2 ,y x x xpi ii i i kβ β β β= + + + +ε

for i = 1,2,…,n. The errors iε are independent and normally distributed with mean 0 and std dev σ.

We will use MINITAB to estimate the β’s. ˆ ˆ ˆ ˆ0 1 1 2 2y x x xp kβ β β β= − − − −

-Interpretation of the regression coefficients -tests and CI’s for βs -ANOVA table -ANOVA F-test

-R-square = 1SSR SSESST SST

= − and

R-sq(adj)= ( 1) 2(1

( 1)

nMSE1 1 )RMST n k

−− = − −

− +

p184

R-sq(adj) ≤ R-sq -estimate of 2 MSEσ = Comments 1. 20 1R≤ ≤ 2. R-sq =1 only when SSE = 0. i.e. when all regression predictions are perfect. ( for all observations)

ˆi iy y=

3) It can be shown that the R-sq can be viewed as a coefficient of simple determination between response and the fitted values iy ˆiy 4) A large value of R-sq does not necessarily imply that the fitted model is a useful one. For instance, observations may have been taken at only a few levels of the predictor variables. 5) The coefficient of multiple correlation ( R) is the positive square root of R-sq.

15

Example (A first-Order Model with quantitative predictors) Regression Analysis The regression equation is gpa = 0.327 +0.000944 satm -0.000408 satv + 0.146 hsm + 0.0359 hss + 0.0553 hse Predictor Coef StDev T P Constant 0.3267 0.4000 0.82 0.415 satm 0.0009436 0.0006857 1.38 0.170 satv -0.0004078 0.0005919 -0.69 0.492 hsm 0.14596 0.03926 3.72 0.000 hss 0.03591 0.03780 0.95 0.343 hse 0.05529 0.03957 1.40 0.164 S = 0.7000 R-Sq = 21.1% R-Sq(adj) = 19.3% Analysis of Variance Source DF SS MS F P Regression 5 28.6436 5.7287 11.69 0.000 Residual Error 218 106.8191 0.4900 Total 223 135.4628

16

Minitab commands for multiple regression

17

-Prediction

Residual Analysis

- We will use residuals for examining the following six types of departures from the model.

- The regression is nonlinear - The error terms do not have constant variance - The error terms are not independent - The model fits but some outliers - The error terms are not normally distributed - One or more important variables have been omitted from the model

Residual plots Residuals vs X or fitted values Residuals vs time (when the data are obtained in a time sequence) or other variables Residuals vs normal scores Stemplots, boxplots of residuals Plots of absolute values of the residuals (or squared residuals) against X or against

Fitted values are also helpful in diagnosing nonconstancy of error variance. Example. Residual analysis for the above example

1 2 3

-3

-2

-1

0

1

2

3

Fitted Value

Stan

dard

ized

Res

idua

l

Residuals Versus the Fitted Values(response is gpa)

19

3210-1-2-3

3

2

1

0

-1

-2

-3

Nor

mal

Sco

re

Standardized Residual

Normal Probability Plot of the Residuals(response is gpa)

3210-1-2-3

35

30

25

20

15

10

5

0

Standardized Residual

Fre

quen

cy

Histogram of the Residuals(response is gpa)

20

Simple linear regression Chap 10 IPS

Documents

Transcript of Simple linear regression Chap 10 IPS