Lecture 13. Inference for regression

Click here to load reader

  • date post

    04-Jan-2016
  • Category

    Documents

  • view

    44
  • download

    0

Embed Size (px)

description

Lecture 13. Inference for regression. Objectives. Inference for regression (NHST Regression Inference Award)[B level award] The regression model Confidence interval for the regression slope β Testing the hypothesis of no linear relationship Inference for prediction - PowerPoint PPT Presentation

Transcript of Lecture 13. Inference for regression

  • Lecture 13. Inference for regression

  • Objectives Inference for regression (NHST Regression Inference Award)[B level award]

    The regression modelConfidence interval for the regression slope Testing the hypothesis of no linear relationshipInference for predictionConditions for inference

  • Most scatterplots are created from sample data. Is the observed relationship statistically significant (not explained by chance events due to random sampling alone)?

    What is the population mean response my as a function of the explanatory variable x? my = a + bxCommon Questions Asked of Bivariate Relationships

  • The regression modelThe least-squares regression line = a + bx is a mathematical model of the relationship between 2 quantitative variables: observed data = fit + residual

    The regression line is the fit.For each data point in the sample, the residual is the difference (y - ).

  • At the population level, the model becomes yi = (a + bxi) + (ei) with residuals ei independent and Normally distributed N(0,s).

    The population mean response my is my = a + bx unbiased estimate for mean response mya unbiased estimate for intercept ab unbiased estimate for slope b

  • The regression standard error, s, for n sample data points is computed from the residuals (yi i):

    s is an unbiased estimate of the regression standard deviation s.The model has constant variance of Y (s is the same for all values of x).For any fixed x, the responses y follow a Normal distribution with standard deviation s.

  • Regression Analysis: The regression equation isFrequency = - 6.19 + 2.33 Temperature

    Predictor Coef SE Coef T PConstant -6.190 8.243 -0.75 0.462Temperature 2.3308 0.3468 6.72 0.000

    S = 2.82159 R-Sq = 71.5% R-Sq(adj) = 69.9%Frog mating callScientists recorded the call frequency of 20 gray tree frogs, Hyla chrysoscelis, as a function of field air temperature.

    MINITABs: aka MSE, regression standard error, unbiased estimate of s

  • Confidence interval for the slope Estimating the regression parameter b for the slope is a case of one-sample inference with unknown. Hence we rely on t distributions.

    The standard error of the slope b is:(s is the regression standard error) Thus, a level C confidence interval for the slope b is: estimate t*SEestimate b t* SEb t* is t critical for t(df = n 2) density curve with C% between t* and +t*

  • Regression Analysis: The regression equation isFrequency = - 6.19 + 2.33 Temperature

    Predictor Coef SE Coef T PConstant -6.190 8.243 -0.75 0.462Temperature 2.3308 0.3468 6.72 0.000

    S = 2.82159 R-Sq = 71.5% R-Sq(adj) = 69.9%Frog mating callstandard error of the slope SEbWe are 95% confident that a 1 degree Celsius increase in temperature results in an increase in mating call frequency between 1.60 and 3.06 notes/seconds. 95% CI for the slope b is: b t*SEb = 2.3308 (2.101)(0.3468) = 2.3308 07286, or 1.6022 to 3.0594n = 20, df = 18t* = 2.101

  • Testing the hypothesis of no relationship

  • Regression Analysis: The regression equation isFrequency = - 6.19 + 2.33 Temperature

    Predictor Coef SE Coef T PConstant -6.190 8.243 -0.75 0.462Temperature 2.3308 0.3468 6.72 0.000

    S = 2.82159 R-Sq = 71.5% R-Sq(adj) = 69.9%Frog mating callTwo-sided P-value for H0: = 0

    t = b / SEb = 2.3308 / 0.3468 = 6.721, with df = n 2 = 18,

    Table C: t > 3.922 P < 0.001 (two-sided test), highly significant. There is a significant relationship between temperature and call frequency. We test H0: = 0 versus Ha: 0

  • Testing for lack of correlationThe regression slope b and the correlation coefficient r are related and b = 0 r = 0.

    Conversely, the population parameter for the slope is related to the population correlation coefficient , and when = 0 = 0.

    Thus, testing the hypothesis H0: = 0 is the same as testing the hypothesis of no correlation between x and y in the population from which our data were drawn.

  • Hand length and body height for a random sample of 21 adult menRegression Analysis: Hand length (mm) vs Height (m)

    The regression equation isHand length (mm) = 126 + 36.4 Height (m)

    Predictor Coef SE Coef T PConstant 125.51 39.77 3.16 0.005Height (m) 36.41 21.91 1.66 0.113

    S = 6.74178 R-Sq = 12.7% R-Sq(adj) = 8.1% No significant relationship

  • Objectives (PSLS Chapter 23)Inference for regression (NHST Regression Inference Award)[B level award] Note Change

    The regression modelConfidence interval for the regression slope Testing the hypothesis of no linear relationshipInference for predictionConditions for inference

  • Two Types of Inference for prediction1. One use of regression is for prediction within range: = a + bx.But this prediction depends on the particular sample drawn. We need statistical inference to generalize our conclusions.

    2. To estimate an individual response y for a given value of x, we use a prediction interval. If we randomly sampled many times, there would be many different values of y obtained for a particular x following N(0,) around the mean response y.

  • Confidence interval for yPredicting the population mean value of y, y, for any value of x within the range (domain) of data tested.

    Using inference, we calculate a level C confidence interval for the population mean y of all responses y when x takes the value x*:

    This interval is centered on , the unbiased estimate of y. The true value of the population mean y at a given value of x, will indeed be within our confidenceinterval in C% of all intervals computed from many different random samples.

  • A level C prediction interval for a single observation on y when x takes the value x* is: t* SEA level C confidence interval for the mean response y at a given value x* of x is: t* SEm^Use t* for a t distribution with df = n 2

  • Confidence interval for the mean call frequency at 22 degrees C: y within 43.3 and 46.9 notes/secPrediction interval for individual call frequencies at 22 degrees C: within 38.9 and 51.3 notes/secMinitabPredicted Values for New Observations: Temperature=22

    NewObs Fit SE Fit 95% CI 95% PI 1 45.088 0.863 (43.273, 46.902) (38.888, 51.287)

  • Blood alcohol content (bac) as a function of alcohol consumed (number of beers) for a random sample of 16 adults.What do you predict for 5 beers consumed? What are typical values of BAC when 5 beers are consumed? Give a range.

  • Conditions for inferenceThe observations are independentThe relationship is indeed linear (in the parameters)The standard deviation of y, , is the same for all values of xThe response y varies Normally around its mean

  • Using residual plots to check for regression validityThe residuals (y ) give useful information about the contribution of individual data points to the overall pattern of scatter.

    We view the residuals in a residual plot:

    If residuals are scattered randomly around 0 with uniform variation, it indicates that the data fit a linear model, have Normally distributed residuals for each value of x, and constant standard deviation .

  • Residuals are randomly scattered good!

    Curved pattern the shape of the relationship is not modeled correctly.

    Change in variability across plot not equal for all values of x.

  • Frog mating call

    The data are a random sample of frogs.

    The relationship is clearly linear.

    The residuals are roughly Normally distributed.

    The spread of the residuals around 0 is fairly homogenous along all values of x.

  • Objectives (PSLS Chapter 23)Inference for regression (NHST Regression Inference Award)[B level award] Note Change

    The regression modelConfidence interval for the regression slope Testing the hypothesis of no linear relationshipInference for predictionConditions for inference

    SE changes with N, changes with changes in distribution of X*The prediction interval contains C% of all the individual values taken by y at a particular value of x. *The confidence interval for y contains with C% confidence the population mean y of all responses at a particular value of x.*(In practice, use technology)The prediction interval for a single observation will always be larger than the confidence interval for the mean response.

    *Graphically, a series of confidence intervals is shown for the whole range of x values.

    *The scatterplot is clearly linear, and the t test for the slope gives t = 7.47, P < 0.0001, highly significant.The confidence interval for the mean bac for 5 beers consumed is 0.066 to 0.088.The prediction interval for individual bac values for 5 beers consumed is 0.032 to 0.122.*Minitab here displays the residuals in four different graphs: Normal quantile plot and histogram (top and bottom left) to assess Normality; a residual plot against the fitted values (top right) to assess the spread of the residuals; and a residual plot against the order of the observations (bottom right) to determine whether a pattern might have arisen during data collection.*