Download - Lecture 13. Inference for regression

Transcript
Page 1: Lecture 13. Inference for regression

Lecture 13. Inference for regression

Page 2: Lecture 13. Inference for regression

Objectives

Inference for regression

(NHST Regression Inference Award)[B level award]

The regression model

Confidence interval for the regression slope β

Testing the hypothesis of no linear relationship

Inference for prediction

Conditions for inference

Page 3: Lecture 13. Inference for regression

Most scatterplots are

created from sample data.

Is the observed relationship statistically significant

(not explained by chance events due to random sampling alone)?

What is the population mean response y as a function of the

explanatory variable x?

y = + x

Common Questions Asked of Bivariate Relationships

Page 4: Lecture 13. Inference for regression

The regression model

The least-squares regression line

ŷ = a + bx

is a mathematical model of the

relationship between 2

quantitative variables:

“observed data = fit + residual”

The regression line is the fit.

For each data point in the sample, the residual is the difference (y - ŷ).

Page 5: Lecture 13. Inference for regression

At the population level, the model becomes

yi = ( + xi) + (i)

with residuals i independent and

Normally distributed N(0,).

The population mean response y is

y = + xŷ unbiased estimate for mean response y

a unbiased estimate for intercept

b unbiased estimate for slope

Page 6: Lecture 13. Inference for regression

The regression standard error, s, for n sample data points is

computed from the residuals (yi – ŷi):

s is an unbiased estimate of the regression standard deviation

The model has constant

variance of Y ( is the same

for all values of x).

For any fixed x, the responses y follow a Normal distribution with standard deviation .

2

)ˆ(

2

22

n

yy

n

residuals ii

Page 7: Lecture 13. Inference for regression

Regression Analysis: The regression equation isFrequency = - 6.19 + 2.33 Temperature

Predictor Coef SE Coef T PConstant -6.190 8.243 -0.75 0.462Temperature 2.3308 0.3468 6.72 0.000

S = 2.82159 R-Sq = 71.5% R-Sq(adj) = 69.9%

Frog mating call

Scientists recorded the call frequency of

20 gray tree frogs, Hyla chrysoscelis, as

a function of field air temperature.

MINITAB

2

)ˆ(

2

22

n

yy

n

residuals iis: aka “MSE”, regression standard

error, unbiased estimate of

Page 8: Lecture 13. Inference for regression

Confidence interval for the slope βEstimating the regression parameter for the slope is a case of one-

sample inference with σ unknown. Hence we rely on t distributions.

The standard error of the slope b is:

(s is the regression standard error)

Thus, a level C confidence interval for the slope is:

estimate ± t*SEestimate

b ± t* SEb

t* is t critical for t(df = n – 2) density curve with C% between –t* and +t*

Page 9: Lecture 13. Inference for regression

Regression Analysis: The regression equation isFrequency = - 6.19 + 2.33 Temperature

Predictor Coef SE Coef T PConstant -6.190 8.243 -0.75 0.462Temperature 2.3308 0.3468 6.72 0.000

S = 2.82159 R-Sq = 71.5% R-Sq(adj) = 69.9%

Frog mating call

standard error of the slope SEb

We are 95% confident that a 1 degree Celsius increase in temperature results

in an increase in mating call frequency between 1.60 and 3.06 notes/seconds.

95% CI for the slope is:

b ± t*SEb = 2.3308 ± (2.101)(0.3468) = 2.3308 ± 07286, or 1.6022 to 3.0594

n = 20, df = 18

t* = 2.101

Page 10: Lecture 13. Inference for regression

Testing the hypothesis of no relationship

)( 2

xx

sSEb

Page 11: Lecture 13. Inference for regression

Regression Analysis: The regression equation isFrequency = - 6.19 + 2.33 Temperature

Predictor Coef SE Coef T PConstant -6.190 8.243 -0.75 0.462Temperature 2.3308 0.3468 6.72 0.000

S = 2.82159 R-Sq = 71.5% R-Sq(adj) = 69.9%

Frog mating call

Two-sided P-value

for H0: β = 0

t = b / SEb = 2.3308 / 0.3468 = 6.721, with df = n – 2 = 18,

Table C: t > 3.922 P < 0.001 (two-sided test), highly significant.

There is a significant relationship between temperature and call frequency.

We test H0: β = 0 versus Ha: β ≠ 0

Page 12: Lecture 13. Inference for regression

Testing for lack of correlation

The regression slope b and the correlation

coefficient r are related and b = 0 r = 0.

Conversely, the population parameter for the slope β is related to the

population correlation coefficient ρ, and when β= 0 ρ = 0.

Thus, testing the hypothesis H0: β = 0 is the same as testing the

hypothesis of no correlation between x and y in the population from

which our data were drawn.

slope y

x

sb r

s

Page 13: Lecture 13. Inference for regression

Hand length and body height for a

random sample of 21 adult men

Regression Analysis: Hand length (mm) vs Height (m)

The regression equation isHand length (mm) = 126 + 36.4 Height (m)

Predictor Coef SE Coef T PConstant 125.51 39.77 3.16 0.005Height (m) 36.41 21.91 1.66 0.113

S = 6.74178 R-Sq = 12.7% R-Sq(adj) = 8.1%

No significant relationship

Page 14: Lecture 13. Inference for regression

Objectives (PSLS Chapter 23)

Inference for regression

(NHST Regression Inference Award)[B level award] Note Change

The regression model

Confidence interval for the regression slope β

Testing the hypothesis of no linear relationship

Inference for prediction

Conditions for inference

Page 15: Lecture 13. Inference for regression

Two Types of Inference for prediction1. One use of regression is for prediction within range: ŷ = a + bx.

But this prediction depends on the particular sample drawn.

We need statistical inference to generalize our conclusions.

2. To estimate an individual response y for a given value of x, we use

a prediction interval.

If we randomly sampled many times, there

would be many different values of y

obtained for a particular x following N(0,σ)

around the mean response µy.

Page 16: Lecture 13. Inference for regression

Confidence interval for µy

Predicting the population mean value of y, µy, for any value of x within

the range (domain) of data tested.

Using inference, we calculate a level C confidence interval for the

population mean μy of all responses y when x takes the value x*:

This interval is centered on ŷ, the unbiased estimate of μy.

The true value of the population mean μy at a given

value of x, will indeed be within our confidence

interval in C% of all intervals computed

from many different random samples.

Page 17: Lecture 13. Inference for regression

A level C prediction interval for a single observation on y when x

takes the value x* is:

ŷ ± t* SEŷ

A level C confidence interval for the mean response μy at a given

value x* of x is:

ŷ ± t* SE^

Use t* for a t distribution with df = n – 2

Page 18: Lecture 13. Inference for regression

Confidence interval for the mean

call frequency at 22 degrees C:

µy within 43.3 and 46.9 notes/sec

Prediction interval for individual

call frequencies at 22 degrees C:

ŷ within 38.9 and 51.3 notes/sec

MinitabPredicted Values for New Observations: Temperature=22

NewObs Fit SE Fit 95% CI 95% PI 1 45.088 0.863 (43.273, 46.902) (38.888, 51.287)

Page 19: Lecture 13. Inference for regression

Blood alcohol content (bac) as a function of alcohol consumed (number of beers) for a random sample of 16 adults.

What do you predict for 5 beers consumed?

What are typical values of BAC when 5 beers

are consumed? Give a range.

Page 20: Lecture 13. Inference for regression

Conditions for inference

The observations are independent

The relationship is indeed linear (in the parameters)

The standard deviation of y, σ, is the same for all values of x

The response y varies Normally

around its mean

Page 21: Lecture 13. Inference for regression

Using residual plots to check for regression validityThe residuals (y – ŷ) give useful information about the contribution of

individual data points to the overall pattern of scatter.

We view the residuals in

a residual plot:

If residuals are scattered randomly around 0 with uniform variation, it

indicates that the data fit a linear model, have Normally distributed

residuals for each value of x, and constant standard deviation σ.

Page 22: Lecture 13. Inference for regression

Residuals are randomly scattered

good!

Curved pattern

the shape of the relationship is

not modeled correctly.

Change in variability across plot

σ not equal for all values of x.

Page 23: Lecture 13. Inference for regression

Frog mating call

The data are a random sample of frogs.

The relationship is clearly linear.

The residuals are roughly Normally distributed.

The spread of the residuals around 0 is fairly homogenous along all values of x.

Page 24: Lecture 13. Inference for regression

Objectives (PSLS Chapter 23)

Inference for regression

(NHST Regression Inference Award)[B level award] Note Change

The regression model

Confidence interval for the regression slope β

Testing the hypothesis of no linear relationship

Inference for prediction

Conditions for inference