Key Points Statistics and Quantitative Analysis U4320 · Assumptions of the Linear Regression Model...

CopyRight Sharyn O'Halloran 1

Copyright Sharyn O'Halloran

Statistics and Quantitative Analysis U4320

Segment 9:Assumptions and Hypothesis Testing

Prof. Sharyn O’Halloran


Key Pointsn Assumptions of OLS Regression

n Violations

n How good are our estimates of a and b of the true population parameters α and β?

n Develop notions of error or variance around the predicted line.

n Construct estimates of Standard Error of the Sampling Process

n Test hypotheses about β(the impact of X on Y).n Is it significantly different from zero?n Construct Confidence Intervals n Use P-value test.


Assumptions of the Linear Regression Model

n Overviewn Develop techniques to test hypotheses about the

parameters of our regression line. n The parameters of interests will be:

n The intercept an The slop b

n For instance, one important question is whether the slope of the regression line is statistically distinguishable from zero? n As we showed last time, if the independent variable has

no effect on the dependent variable then b should be close to zero.


Assumptions of the Linear Regression Modeln Overview (cont.)

n So testing if b=0 is really a test to see if the independent variable has any statistically significant effect on the dependent variable.

Fertilizer

Yield

aintercept

b=0

Y=a+bX

X

X

XX

X

X

X

X

X

X X

X

X

X

XX X

XX

XX slope




n OLS Line n The predicted line takes the form:

Where

n Think of this regression line as the expected value of Y for a given value of X.

$Y a bX= +xy∑b= x∑ 2 a Y bX= −

Relation Between Yield and Fertilizer

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800

Fertilizer (lb/Acre)

Yie

ld (

Bu

sh

el/

Ac

re)

Trend line

That is, for any value of the independent variable there is a single most likely value for the dependent variable



n Random Errorn When we actually collect data, however, we will not

necessarily obtain the predicted value every time. n That is, we should expect to observe some variation in our

dependent variable. n For example, the same amount of fertilizer does not

necessarily produce the same amount of wheat each time.

n Need to measure this error n We will be doing this the same way we did with point

estimates. n We will have some population from which we draw samples

and we will make inferences about that population subject to some error in the sampling process.


Assumptions of the Linear Regression Modeln Error term

n To include this in our analysis we use the equation:

where εi is the error in observation i.Y Xi i i= + +α β ε

Yield

intercept

Fertilizer

a b=0

Y=a+bX

X

X

XX

X

X

X

X

X

X X

X

X1

X3

X

X X

XX

XX

ε2

ε 2ε3

2

These epsilons (errors) are random variables and each have there own distribution.

εi= − +Y a bXi i( ( ))


Assumptions of the Linear Regression Modeln Assumptions:Ø To be explicit, the analysis that follows

makes the following four assumptions: 1. Linearity

n The true relation between Y and X is captured in the equation: Y = a + bX

2. Homoskedasticity (Homogeneous Variance)

n Each of the εi has the same variance.

E(εi2)= σ2 for all i




3. Independencen Each of the εi's is independent from the other.

n That is, the value of one does not effect the value of any other observation i's error.

4. Normality n Each εi is normally distributed:

i* j ≠Cov (ε ε ) = 0 for i j

µ

Distribution of εi


Assumptions of the Linear Regression Modeln Properties

n Combine assumptions 4 and 2, n Implies that the error terms are normally

distributed with mean = 0 and variance σ2

µ=0

We write this as εi ~ N(0, σ2)

σ2



n Notice that σ2, the variance of the ε's, is the expected squared deviation of the Y's from the regression line.

σ

ε

2 2

2

2

= −= − +=

E Y Y

E Y a bX

E

i i

i i

i

( $ )

( ( ))

( ) .


Assumptions of the Linear Regression Modeln Examples of Violations

n Non-Linearity n The true relation between the independent and

dependent variables may not be linear. n For example, Consider campaign fundraising and the

probability of winning an election.

$50 ,000

P(w)

Spending

Probability of Winning an

Election

The probability of winning increases with each additional dollar spent and then levels

off after $50,000.



Assumptions of the Linear Regression Modeln Homoskedasticity

n The Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others.n Of course, due to the luck of the draw, some errors will

turn out to be larger then others. n But homoskedasticity is violated only when this happens

in a predictable manner.n Example: income and spending on certain goods.

n People with higher incomes have more choices about what to buy.

n We would expect that there consumption of certain goods is more variable than for families with lower incomes.



X

income

XX

XX

X

X

X

X

1

2

3

4

5

6

X9

8

7

10

ε 9

ε 8

ε 7

ε 5

ε6

Spending

ε9= − +Y a bX9 9( ( ))

As income increases so do the errors (vertical distance from the predicted line)

Relation between Income and Spending violates

homoskedasticity

ε6= − +Y a bX6 6( ( ))



n Non-Independence n The independence assumption means

that errors terms of two variables will not necessarily influence one another. n Technically, the RESIDUALS or error

terms are uncorrelated.n The most common violation occurs with

data that are collected over time or time series analysis. n Example: high tariff rates in one period

are often associated with very high tariff rates in the next period.

n Example: Nominal GNP and Consumption

Residuals of GNP and Consumption over Time

Highly Correlated


Assumptions of the Linear Regression Modeln Non-Normality

Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every day.

An example of a bimodal distribution.

Frequency of Nicotine use

n Of all the assumptions, this is the one that we need to be least worried about violating. n Why?




n Summary of Assumptionsn Algebra

These assumptions can be written as:n The random variables Y1, Y2, .....Yi are

independent with

n Notice that when β= 0, the mean of Yi is just simply the intercept α.

Mean = α + βX

Variance = σ2 .



n Also, the deviation of Y from its expected values is captured by the disturbance terms or the error e.

Yi= a+ bXi + ei

where e1,e2, ...ei are independent errors with

Mean = 0Variance = σ2

n Overview n http://www.basic.nwu.edu/statguidefiles/linreg.html


Assumptions of the Linear Regression Modeln Recap

n There are two things to be clear on:n First, we never observe the true population

regression line or the actual errors. n Second, the only information that we know are

Y observations and the resulting fitted regression line. n We observe only X and Y from which we estimate

n the regression Y-hat; and n the distribution of the error terms around the

observed data εi


Discussion of the Error Term εn Regression as an approximation

n The Fitted Linen The least squares method is the line that fits

that data with the minimum amount of variation.

n That is, the line that minimizes the sum of squared deviations.n Remember the criterion:

( )Minimize Yi YdΣ Σ2 2= − $



Discussion of Error termn Approximation

n But the regression equation is just an approximation, unless all of the observed data fall on the predicted line

Yield

Fertilizer

40

50

60

70

80

100 200 300 400 500 600 700

x

x xx

xx x

a change in X

change in Y

Y=a+bXNo variation in the data


Discussion of Error termn The Fitted line with Variance

n Analogy to point estimatesn With point estimates, we did not expect that all the

observed data would fall exactly on the mean.n Rather, the mean represented the expected or the

average value.n Similarly, then, we do not expect that the regression

line be a perfect estimate of the true regression line. n We expect to see some error or randomness.

n Measurement of variancen We need to developed a notion of dispersion or

variance around the regression line.n A measurement of the error of the regression line is:

e Y Yi i i= − $


Discussion of Error termn The Fitted line with variance(cont.)

Yield

Fertilizer

4050

60

70

80

100 200 300 400 500 600 700

x

x x

x

x x

x

a change in x

change in Y

ei

Interested in explaining the Y-deviation from the fitted line. That is, the variation in the Dependent variableYield.

iebXaY ++=ˆ

ei represents the difference between the observed and the predicted Yi for a given value of X


Discussion of Error termn Two components of Error:

n Measurement Errorn Error that results from collecting data on the dependent

variable.n For example, there may be difficulty in measuring

crop yields due to sloppy harvesting.n Or perhaps two scales are calibrated slightly

differently.n Inherent Variability

n Variations due to other factors beyond the experimenter's control.

n What would be some examples in our fertilizer problem? n Rain n Land conditionsn Different skill levels of farmers



Discussion of Error termn Note:

n The error term is very important for regression analysis. n Since the rest of the regression equation is

completely deterministic, all unexpected variation is included in the error term.

n It is this variation that provides the basis of all estimation and hypothesis testing in regression analysis.


Sampling Variabilityn Overview

n Are the parameters a and b good estimates of the true regression line?

n Just as with point estimates, to what extend does the true value of Y approximated by the predicted line .

ΕY x

Y a bx

( )= +

= +

α β true regression line)

(estimated regression line)

(

$

Y

Population parameters

Sample parameters


Sampling Variabilityn Sampling Distribution of b

n We know from the central limit theorem that if we randomly draw a large enough sample, the sample will be normally distributed, with

Expected value of b = β

n Here σ represents the standard deviation of the Y observations about the population line.n We define:

Standard error of b = σΣx2

( )σ =

−Σ Y Y

n

$ 2


Sampling Variabilityn Standard Error of the Estimate b

n Graphn So as before we can represent the estimate of

the true population as follows:

σ2

E(b) =β

SE(b) = σ

Σx 2p(b)



Sampling Variabilityn Formulas

n We can restate the standard error as:

n Recall that is just the mean standard deviation.n Remember x2 = (X- )2

n So the is approximately the variance s2x.

n Thus,

Σxn

2

X

Σxn

2

nxn

nxn ∑

•=

Σ

=22

1 Error Standard σσ

Xsn1 Error Standard •≈σ

The spread of X

Standard Error


Sampling Variabilityn Three ways to improve the accuracy

of estimating b.n Ways to reduce the standard error to

improve the accuracy of our estimate b:n By reducing σ, the variability of the Y

observations (controlling for outside effects).

n By increasing n, the sample size.n By increasing SX, the spread of the X values,

which are determined by the experimenter.


Sampling Variabilityn Example: Data with little spread

n When SX is small then we have little information from which we can predict the value of Y.

n Our regression line will be very sensitive to the exact location of each X value.

Small errors may have a large effect on our estimates of the true regression line

Y

X

x xxxx

xx

x

x

XY βα +=

bXaY +=ˆ

estimate

Population


Sampling Variabilityn Example: Data with large spread

Y

X

x x xx

x

x

xx

xIf the x values take on a wide range of values (sX, the spread, is large), then our estimates are not so sensitive to small errors.

X

XY βα +=

bXaY +=ˆestimate

Population



Sampling Variabilityn Ruler Analogy

n For example, think about the ruler analogy.n If I want you to draw a line and I give you two

points that are very close together then there are many lines that you can draw.

n On the other hand, if I give you two points that are at the ends of the ruler you will be more accurate in drawing your line.

n The spread of X has the same effect.


Hypothesis Tests for βn Estimating the Standard Error of b

n σ2 Known n Before we can make inferences about our

regression parameters, there is one more piece of information that we need.

n Our formulas for the standard error includes σ2, which is the variance of Y about the true population regression line.

n The sum of variation of each of the εi’’s.

∑=

∑•=

Σ=

222

1 b ofError Standard

xnxn

nxn

σσσ


Hypothesis Tests for βnσ2 unknown

n But as we know, most times we do not observe the underlying population.

n Since σ2 is generally unknown, it must be estimated.n Remember that for the εi's, σ2 is the

averaged squared distance from the mean.n So we estimate σ2 by the averaged squared

distanced from the observed data points from the estimated regression line.


Hypothesis Tests for β

n An obvious candidate would just be the average of the d2’s.n The variance of the average deviation of the Yi from the

predicted line.

x

x

x

x

x

x

x

x

xx

x

dd

1

1

2

2

Y

X

regression line

s Y Yn

dn

22 2

= − =∑ ∑( $) .

bXaY +=ˆ



Hypothesis Tests for βn N-2 Degrees of Freedom

n Correctionn We need to divide by n-2 instead of n, so

that we have an unbiased estimate of σ2:

n Here, we have already used up two degrees of freedom in calculating the regression line, one for the y-intercept and one for the slope.

sd

n2

2

2=

−∑

.= 1

22

nY Y

−−∑ ( $ )


Hypothesis Tests for βnCorrection (cont.)

n So for the estimated standard error of b, we have the formula:

n Since we used the estimated standard error instead of the actual standard error, what distribution do you think we have to use?

estimated SE(b) = sx2∑ .


Hypothesis Tests: Confidence Intervals

n 95% Confidence Intervaln Formula

n We can specify a confidence interval for the true slope of βgiven our estimate b:

n Distribution n We use the t-distribution with n-2 degrees of

freedom

β= b ± t.025 SE

β= b ± t.025sxΣ 2

α=.025α=.025



ObservationX Y Deviation

from x MeanDeviation

from y meanx*y x*x

1 100 40 -300 -20 6000 90,0002 200 50 -200 -10 2000 40,0003 300 50 -100 -10 1000 10,0004 400 70 0 10 0 05 500 65 100 5 500 10,0006 600 65 200 5 1000 40,0007 700 80 300 20 6000 90,000

Sum 2800 420 0 0 16500 280,000Mean 400 60

n Example Fertilizer and Yield

059.000,280

500,162

==∑∑=

xxyb

$Y a bX= +

4.36)400(059.60 =−=−= XbYa$ . .Y X= +36 4 059

slope:

regression line:intercept:

n = 7α = .05α/2 =.025




n Step 1: State the Hypothesisn The hypothesis we're most interested in

testing is that β= 0.H0: β= 0Ha: β≠ 0

n The slop of the coefficient is equal to zero.n Against the alternative that there is no relation

between the independent and dependent variables.

n How can we perform this test at theα = 5% level?



n Step 2: Calculate a 95% Confidence Interval for the Slope b.

X Y $Y =36.4+.059X d=Y- $Y (Y- $Y )2

100 40 42.3 -2.3 5.29

200 50 48.2 1.8 3.24

300 50 54.1 -4.1 16.81

400 70 60.0 10.0 100.0

500 65 65.9 -0.9 .81

600 65 71.8 -6.8 46.24

700 80 77.7 2.3 5.29

X = 400 Y = 60 0 ∑ =177.68

Relation between fertilizer and yield



nCalculate regression line:n We calculated the regression line to be:

n Calculate a 95% confidence interval: n We can calculate s2 as:

$ . .Y X= +36 4 059

s2 = 1

2

2

nY Y

−−∑ ( $ )

s2 177 687 2

= 35.5= −.


Hypothesis Tests: Confidence Intervals for βn Calculate a 95% confidence interval: (cont.)

β= b ±t.025s

x2∑ .

β= .059 ±2.57 35 5280 000

.,

β= .059 ±2.57 (.0113)

β= .059 ± .029

0.030 ≤β≤0.088

.088.03

95% Total Area



Hypothesis Tests: Confidence Intervals for βn If the slope were .030, what would the y-

intercept be?

n If the slope were .088, what would the y-intercept be?

a Y bX= +a = 60 - .030 (400) = 48

a = 60 - .088 (400) = 24.8


Hypothesis Tests: Confidence Intervals for βn We can graph these two lines like this:

Yield

Fertilizer

40

50

60

70

80

100 200 300 400 500 600 700

x

x x

xx x

x

24.8

This means that we are 95% confident that the true regression line falls within these lower and upper bounds

XY 03.48ˆ −=

XY 088.8.24ˆ −=

$ . .Y X= +36 4 059


Hypothesis Tests: Confidence Intervals for β

n Since zero does not fall within this interval, we reject the hypothesis that β= 0.n That is, according to our data, fertilizer really

does help increase crop yield. Yay!

β.030 .088

Fail to reject reject reject

n Step 3: Reject or Fail to Reject


Hypothesis Tests: Calculating p-Values

n Definition of p-value: n The p-value is the probability of obtaining a test

statistic equal to or more extreme than the result obtained from the sample data, given that the null hypothesis H0 is really true.n P-values summarize the credibility of the null hypothesis.

n If the p-value is greater than or equal to α, the null hypothesis is not rejected.

n If the p-value is smaller than α, the null hypothesis is rejected.

n The smaller the p-value, the stronger is the evidence against H0.

n In essence, p-values state the probability that the hypothesis is true.



Hypothesis Tests: Calculating p-Values

n Calculating p-Valuesn We calculate the test statistics:

n Where b0 is the value of b under the null hypothesisn B is the estimated parameter, n Standard Error of the sampling process is:

n Since we are interested in the hypothesisH0: b=0, the the statistics of interest is:

t b bSE= − 0

t bSE=

sxΣ 2


Hypothesis Tests: Calculating p-Values for βn Just like before, we can draw the distribution

of b:

n If the observed value of b falls within one of the tails, then we can reject the null hypothesis.

β=0 α=.025α=.025 p=.0025

p-valuesobserved values


Hypothesis Tests: Calculating p-Values for β

n Step 1: State the Null HypothesisH0: β= 0Ha: β≠ 0;

n What do we need to calculate the confidence interval? n Step 2: Calculate a p-value

n For the fertilizer data, we have:

n We go to our t-tables and find that for 5 degrees of freedom, 5.2 is greater than the value t.0025 = 4.77.

n Step 3: Reject or Fail to Reject Null Hypothesisn These data are very significant. n Therefore, we can reject the null hypothesis.

t = .059/.0113 = 5.2.; d.f. = 7-2=5

at the α=5% significance level


Homeworkn Variables

n Take the same two variables you used for the homework you turned in today.n I used PARTYID and INCOME91

n Hypothesisn Does income have a significant effect on the party

identification?

n The computer will automatically delete all missing data for regressions, so you don't have to recode the variables.

Income Party



Homeworkn Commands

n The command to run a regression is just REGRESSION.

n Next, you input the variables you'll be using in the analysis.n You enter all variables, including the dependent and

independent variables.n Then tell the computer which variable is the

dependent variable. n In this case, it was PARTYID.

n Finally, you put in /METHOD ENTER.n Don't worry about this part of the command for now; I'll

explain it later.Copyright Sharyn O'Halloran

Homeworkn Results

n The bottom of the page shows the regression results.n The slope was 0.039, which is positive.

n This means that as income rises, you are more likely to identify yourself as a Republican.

n The standard error was .0125, as reported in the next column.

n The t-statistic, then, is just .039 /.0125, which is 3.093.n The p-value corresponding to this is given in the "Sig T"

column, and it's .0020.

XY 039.20.2ˆ +=

20.2

Party ID

Income


Homeworkn Reject or Fail to Reject?

n Can we reject Ho: β= 0 in favor of the alternative Ha: β>0 at the 5% level?n Since .0020 is less than .05, we would reject the null.

n Interpretationn As your income rises, you're more likely to

identify yourself as Republican.

β=0 α=.05 p=.0020

observed value


Stop hereHypothesis Tests for β: Examplen Example: Income and Savings

n Now we want to know do people save more money as their income increases?

n Suppose we observed 4 individual's income and saving rates?

Income (X)

Savings (Y)

x y xy x2 $Y d=Y- $Y (Y- $Y )2

22 2.0 1 -0.2 -0.2 1 2.34 -0.34 0.11618 2.0 -3 -0.2 0.6 9 1.77 0.23 0.05317 1.6 -4 -0.6 2.4 16 1.63 -.03 0.000927 3.2 6 1.0 6.0 36 3.05 0.15 0.0225

X = 21 Y = 2.2 0√ 0√ ∑ =8.8 ∑ =62 ∑ = 0.192



Confidence Intervals and Tests for β

n Calculate the fitted line

n Estimate b

n What does this mean? n On average, people save a little over 14% of

every extra dollar they earn. n Intercept a

n What does this mean? n With no income, people borrow.

n The regression equation is:

$Y = a + bX

b = Σxy / Σx2 = 8.8 / 62 = 0.142

a = Y - bX = 2.2 - 0.142 (21) = -0.782

$Y = -0.78 + 0.142XCopyright Sharyn O'Halloran


n State the hypothesisn Now let's test the null hypothesis that β= 0.

That is, the hypothesis that people do not tend to save any of the extra money they earn.

H0: β= 0Ha: β≠ 0;

at the 5% significance level



n Step 2: Calculate a 95% confidence interval n What do we need to calculate the confidence

interval?

n What is the formula for the confidence interval?

β = b ± t.025 s

x 2∑.

β= .142 ±4.30 •.309 / √62β= .142 ± .169 -.027 ≤β≤.311

s2 = Σd2 / n-2 = .192 / 2 = 0.096

s = √ .096 = .309



n Reject the null hypothesisn Since zero falls within this interval, we cannot

reject the null hypothesis. This is probably due to the small sample size.

β=0-.027 .311




n Additional Examples:n How about the hypothesis that β= .50, so that

people save half their extra income?n It is outside the confidence interval, so we can

reject this hypothesis.n Let's say that it is well known that Japanese

consumers save 20% of their income on average. n Can we use these data (presumably from American

families) to test the hypothesis that Japanese save at a higher rate than Americans?

n Since 20% also falls within the confidence interval, we cannot reject the null hypothesis that Americans save at the same rate as Japanese.


Homeworkn Part of your homework for next time is

to run a regression on SPSS.n This will be a quick exercise, just to get

you used to the commands in regression analysis.


Homeworkn Predicting Y at a given level of X.

n As we said before, regression analysis can be used to predict outcomes as well as to summarize data. n For instance, after running a regression and getting the best fit line

for some data, we can ask what is the most likely value for Y at a given level of X, say X0.

n We know that this is the value = a + bX0

n Now we might ask how confident we are of this prediction. n That is, we want a confidence interval for .

n It turns out that this confidence interval can be calculated in two ways, depending on exactly what we're interested in:

$Y0

$Y0


Predicting Y at a given level of X.

n What is the true expected value of over many different observations? That is, what is the expected value of Y when X=X0?n For instance, we might ask on average how large the

yield should be if we add 450 lbs of fertilizer per acre.

n What will the value of Y be if we have only oneobservation of it for X=X0?n For instance, say we want a 95% confidence interval for

next year's crop only, given that we plan to add 450 lbs of fertilizer per acre.

$Y0



Predicting Y at a given level of X.n Confidence Interval for the Mean of Y at X

n Prediction Interval for a Single Observation

n Comparisons of the two


Background.n Simple linear regression uses the model

n y is response β0 and β1 are the intercept and slope, ε is the model error.

n β0 and β1 are usually estimated using the method known as least squares.

n This involves some serious assumptions.

εββ ++=110

xy


Assumptions.n The xi are non-random and measured

with negligible error.n The εi are random variables with mean

0 and constant variance σ2

n The εi are also uncorrelatedn We are interested in

The mean of the distribution of the yi

[ ]ii

xyE10

ββ+=

Key Points Statistics and Quantitative Analysis U4320 · Assumptions of the Linear Regression Model...

Documents

Transcript of Key Points Statistics and Quantitative Analysis U4320 · Assumptions of the Linear Regression Model...