Simple Linear Regression

15
ETM 620 - 09U 1 Simple Linear Regression Often we want to understand the relationships among variables, e.g., SAT scores and college GPA car weight and gas mileage amount of a certain pollutant in wastewater and bacteria growth in local streams number of takeoffs and landings and degree of metal fatigue in aircraft structures Simplest relationship Y = β 0 + β 1 x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 115 120 125 130 135 140 P redic tor variable, x Response variable, y ETM 620 - 09U 1

description

Simple Linear Regression. Often we want to understand the relationships among variables, e.g., SAT scores and college GPA car weight and gas mileage amount of a certain pollutant in wastewater and bacteria growth in local streams - PowerPoint PPT Presentation

Transcript of Simple Linear Regression

Page 1: Simple Linear Regression

ETM 620 - 09U1

Simple Linear RegressionOften we want to understand the relationships

among variables, e.g.,SAT scores and college GPAcar weight and gas mileageamount of a certain pollutant in wastewater and bacteria

growth in local streamsnumber of takeoffs and landings and degree of metal

fatigue in aircraft structuresSimplest relationship

Y = β0 + β1x

0

0.5

11.5

22.5

33.5

44.5

115 120 125 130 135 140

Predictor variable, x

Res

pons

e va

riabl

e, y

ETM 620 - 09U1

Page 2: Simple Linear Regression

ETM 620 - 09U2

ExampleAn electric power cooperative is concerned about the cost of power outages in the winter and the analyst has an idea that these costs are directly related to the average temperature during the outage period. A random sampling of power outages over a number of years was conducted and the cost per 100 homes (adjusted for inflation) was determined, with these results:

ETM 620 - 09U2

Temp, °F

Cost/ Outage

45 $3,639 42 $4,111 44 $3,928 37 $4,252 33 $5,020 45 $3,838 35 $4,293 38 $4,244 39 $4,227 40 $4,111 30 $5,335

Avg. Cost/ Outage

$3,000$3,500$4,000$4,500$5,000$5,500

25 30 35 40 45 50Temperature

Cost

Page 3: Simple Linear Regression

ETM 620 - 09U3

Estimating the regression coefficients Method of Least Squares

Determine estimates for β0 and β1 so that the sum of the squares of the residuals is minimized, that is …

Solution to the minimization gives:

xy

xxn

yxyxnn

ii

n

ii

n

i

n

iii

n

iii

10

211

21 11

1

ˆˆ)(

)()(ˆ

ETM 620 - 09U3

min L i1

ni

2 i1

n(y i 0 1x1)

2

Page 4: Simple Linear Regression

ETM 620 - 09U4

For our example,

ETM 620 - 09U4

ˆ 1 ______________________________

ˆ 0 ______________________________

Sample Temp, x Cost, y xiyi xi

2

1 45 $3,639 163,755 20252 42 $4,111 172,662 17643 44 $3,928 172,832 19364 37 $4,252 157,324 13695 33 $5,020 165,660 10896 45 $3,838 172,710 20257 35 $4,293 150,255 12258 38 $4,244 161,272 14449 39 $4,227 164,853 152110 40 $4,111 164,440 160011 30 $5,335 160,050 900

sum = 428 46998 1805813 16898

Page 5: Simple Linear Regression

ETM 620 - 09U5

What does this mean?We can draw the regression line that describes

the relationship between temperature and outage cost:

We can also predict the cost of outages based on expected temperatures.

ˆ y ˆ 0 ˆ 1x

ETM 620 - 09U5

Cost vs Temperature

$3,000$3,500$4,000$4,500$5,000$5,500

25 30 35 40 45 50Temperature

Cost

Page 6: Simple Linear Regression

ETM 620 - 09U6

Dangers of regression analysisYou can regress any variable on any other variable

e.g., hair loss and heart disease; hours playing video games and number of arrests for violent behavior; consecutive hours in class and retention of material; etc.

Which of these relationships can you legitimately claim reflect a causal relationship between the “predictor” and the “response”?

The regression equation is a “best fit” for the data on which it is based, but may lose validity for predictor values outside the range of the data.For example, our outage cost data implies that the cost

per outage decreases as the temperature increases – do you believe that temperatures in the 80’s or 90’s will result in low-cost outages?

Page 7: Simple Linear Regression

ETM 620 - 09U7

How good is our prediction?Estimating the variance:

Lack of fit test,Tests the hypotheses

H0: the model adequately fits the dataH1: the model does not fit the data

As with our goodness-of-fit tests, a high p-value indicates that the model is adequate.

ˆ 2 s2 SSE

n 2

(y i ˆ y )2

i1

n

n 2

__________________

(see next page)

ETM 620 - 09U7

Page 8: Simple Linear Regression

ETM 620 - 09U8

How good is our prediction?

ETM 620 - 09U8

Coefficient of determination, R2

a measure of the “quality of fit,” or the proportion of the variability explained by the fitted model.

Use with care – increasing the number of variables will usually increase R2, but this doesn’t necessarily make it a “better” model!

n

ii

n

iii

yy

E

yy

yy

SSSR

12

12

2

)(

)ˆ(11

Page 9: Simple Linear Regression

ETM 620 - 09U9

Linear regression in Excel …Step 1: Graph the data

Does it look like a straight line is the best fit?

Avg. Cost/ Outage

$0$1,000$2,000$3,000

$4,000$5,000$6,000

25 30 35 40 45 50

Temperature

Cos

t

ETM 620 - 09U9

Page 10: Simple Linear Regression

ETM 620 - 09U10

Step 2: Perform the analysisChoose “Regression” from the Data Analysis

menu (under Tools). Input the Y-range (Cost, including the label) and X-range (Temp, including the label), then select“Labels” if you included those in your data

range.Your desired location for the output.Residuals and Normal Probability Plot, as

desired.Choose “OK”

ETM 620 - 09U10

Page 11: Simple Linear Regression

ETM 620 - 09U11

Step 3: Check assumptionsLook at residuals plot and normal

probability plots.Temp Residual Plot

-500

0

500

0 10 20 30 40 50

Temp

Resi

dual

s

Normal Probability Plot

05000

10000

0 50 100 150

Sample Percentile

Avg

. Cos

t/ O

utag

e

ETM 620 - 09U11

Page 12: Simple Linear Regression

ETM 620 - 09U12

Step 4. Evaluate the results.SUMMARY OUTPUT

Regression StatisticsMultiple R 0.9318278R Square 0.868303Adjusted R Square 0.85367Standard Error 189.43373Observations 11

ANOVAdf SS MS F Significance F

Regression 1 2129376.47 2129376.47 59.34 0.00Residual 9 322966.25 35885.14Total 10 2452342.73

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%Intercept 7900.61 474.43 16.65 0.00 6827.36 8973.85 6827.36 8973.85Temp -93.24 12.10 -7.70 0.00 -120.63 -65.86 -120.63 -65.86

ETM 620 - 09U12

Page 13: Simple Linear Regression

ETM 620 - 09U13

Step 5. Specify and use the model.Simple linear model:

Use the model to:Make predictions

expected costsbudgeting

Recommend actionsidentify and address sources of cost increase

ETM 620 - 09U13

ˆ y _________________

Page 14: Simple Linear Regression

ETM 620 - 09U14

In Minitab …Step 1: Graph the data (for one or two predictor

variables)!Again, do you think a simple linear relationship is the best

fit?Step 2: Select Stat Regression Regression …Step 3: Choose “Response” (y) and “Predictor” (x).Step 4: In “Options”, check the “Lack of Fit” box.

(“Fit Intercept” box should be checked by default.) Click “OK”.

Step 6: In “Graphs” select the appropriate residual plots to create.

Step 5: Click “OK”.Step 6: Evaluate the residual plots and results.

ETM 620 - 09U14

Page 15: Simple Linear Regression

ETM 620 - 09U15

Transformation to a straight line ..,If simple linear regression is not appropriate

because the underlying function is nonlinear, then we have two choicesfit a more complex modeltransform the model to a straight-line model

Simplest transformation – logarithmic transformation

Original model:

Transformed model:

lnlnln 10

0 1

xy

ey x