Multiple Regression - Kasetsart Universityfin.bus.ku.ac.th/135512 Economic Environment for...

Multiple Regression

Peerapat Wongchaiwat, Ph.D.

[email protected]

The Multiple Regression Model

Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)

εXβXβXββY kk22110

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Multiple Regression Equation

The coefficients of the multiple regression model are

estimated using sample data

ik,k2i21i10i xbxbxbby

Estimated (or predicted) value of y

Estimated slope coefficients

Multiple regression equation with k independent variables:

Estimated intercept

We will always use a computer to obtain the regression

slope coefficients and other regression summary

measures.

Sales Example

Salest = b0 + b1 (Price)t

+ b2 (Advertising)t + et

Week

Pie

Sales

Price

($)

Advertising

($100s)

1 350 5.50 3.3

2 460 7.50 3.3

3 350 8.00 3.0

4 430 8.00 4.5

5 350 6.80 3.0

6 380 7.50 4.0

7 430 4.50 3.0

8 470 6.40 3.7

9 450 7.00 3.5

10 490 5.00 4.0

11 340 7.20 3.5

12 300 7.90 3.2

13 440 5.90 4.0

14 450 5.00 3.5

15 300 7.00 2.7

Multiple regression equation:

Multiple Regression Output

Regression Statistics

Multiple R 0.72213

R Square 0.52148

Adjusted R Square 0.44172

Standard Error 47.46341

Observations 15

ANOVA df SS MS F Significance F

Regression 2 29460.027 14730.013 6.53861 0.01201

Residual 12 27033.306 2252.776

Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404

Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392

Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

ertising)74.131(Adv ce)24.975(Pri - 306.526 Sales

Adjusted

• R2 never decreases when a new X variable is added to the model, even if the new variable is not an important predictor variable

– Hence, models with different number of explanatory variables cannot be compared by R2

• What is the net effect of adding a new variable?

– We lose a degree of freedom when a new X variable is added

– Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?

* Adjusted R2 penalizes excessive use of unimportant independent variables

Adjusted R2 is always smaller than R2 (except when R2 =1)

2R

F-Test for Overall Significance

of the Model

• Shows if there is a linear relationship between all of the

X variables considered together and Y

• Use F test statistic

• Hypotheses:

H0: β1 = β2 = … = βk = 0 (no linear relationship)

H1: at least one βi ≠ 0 (at least one independent variable affects Y)

6.53862252.8

14730.0

MSE

MSRF

Regression Statistics

Multiple R 0.72213

R Square 0.52148

Adjusted R Square 0.44172

Standard Error 47.46341

Observations 15

ANOVA df SS MS F Significance F

Regression 2 29460.027 14730.013 6.53861 0.01201

Residual 12 27033.306 2252.776

Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404

Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392

Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

(continued)

F-Test for Overall Significance

With 2 and 12 degrees of

freedom P-value for

the F-Test

Source of

Variation

Sum of

Squares

Degrees of

Freedom Mean Square F Ratio

Regression SSR (k)

Error SSE (n-(k+1))

=(n-k-1)

Total SST (n-1)

kSSRMSR

)1(

knSSEMSE

MSTSST

n

( )1

FMSR

MSE

=SSR

SST= 1 -

SSE

SSTR

2

= 1 -

SSE

(n - (k + 1))

SST

(n - 1)

=MSE

MSTR

2FR

R

n k

k

2

12

1

( )

( ( ))

( )

The ANOVA Table in Regression

11-9

Hypothesis tests about individual regression slope parameters:

(1) H0: b1= 0

H1: b1 0

(2) H0: b2 = 0

H1: b2 0 . . . (k) H0: bk = 0

H1: bk 0

Test statistic for test i tb

s bn k

i

i

:( )

( ( )

1

0

Tests of the Significance of Individual

Regression Parameters

11-10

The Concept of Partial Regression

Coefficients

In multiple regression, the interpretation of slope

coefficients requires special attention:

• Here, b1 shows the relationship between X1 and

Y holding X2 constant (i.e. controlling for the

effect of X2 ).

2i21i10i xbxbby

Purifying X1 from X2 (i.e. Removing the effect of

X2 on X1 : Run a regression of X2 on X1

X2i = 0 + 1X1i + vi

vi = X2i – (0 + 1X1i) is X2 purified from X1

Then, run a regression of Yi on vi.

Yi = 0 + 1vi .

1 is the b1 in the original multiple regression

equation.

b1 shows the relationship between X1 purified from

X2 and Y.

Whenever, a new explanatory variable is

added into the regression equation or

removed from from the equation, all b

coefficients change.

(unless, the covariance of the added or

removed variable with all other variables is

zero).

The Principle of Parsimony:

Any insignificant explanatory variable should be removed out of the regression equation.

The Principle of Generosity:

Any significant variable must be included in the regression equation.

Choosing the best model:

Choose the model with the highest adjusted R2 or F or the lowest AIC (Akaike Information Criterion) or SC (Schwarz Criterion).

Apply the stepwise regression procedure.

Multiple Regression

For example:

A researcher may be interested in the

relationship between Education and Income

and Number of Children in a family.

Independent Variables

Education

Family Income

Dependent Variable

Number of Children

Multiple Regression

For example:

Research Hypothesis: As education of respondents

increases, the number of children in families will

decline (negative relationship).

Research Hypothesis: As family income of

respondents increases, the number of children in

families will decline (negative relationship).


Education

Family Income

Dependent Variable

Number of Children

Multiple Regression

For example:

Null Hypothesis: There is no relationship between

education of respondents and the number of children

in families.

Null Hypothesis: There is no relationship between

family income and the number of children in families.


Education

Family Income

Dependent Variable

Number of Children

Multiple Regression

Bivariate regression is based on fitting a line as

close as possible to the plotted coordinates of

your data on a two-dimensional graph.

Trivariate regression is based on fitting a plane

as close as possible to the plotted coordinates of

your data on a three-dimensional graph.

Case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Children (Y): 2 5 1 9 6 3 0 3 7 7 2 5 1 9 6 3 0 3 7 14 2 5 1 9 6

Education (X1) 12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9 8 12 10 20 11 9

Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4 3 10 4 9 4 4 12 10 6 4 1 10 3 9 2 4

Multiple Regression

Case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Children (Y): 2 5 1 9 6 3 0 3 7 7 2 5 1 9 6 3 0 3 7 14 2 5 1 9 6

Education (X1) 12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9 8 12 10 20 11 9

Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4 3 10 4 9 4 4 12 10 6 4 1 10 3 9 2 4

Y

X1 X2

0

Plotted coordinates

(1 – 10) for Education,

Income and Number of

Children

Multiple Regression

Case: 1 2 3 4 5 6 7 8 9 10

Children (Y): 2 5 1 9 6 3 0 3 7 7

Education (X1) 12 16 2012 9 18 16 14 9 12

Income 1=$10K (X2): 3 4 9 5 4 12 10 1 4 3

Y

X1 X2

0

What multiple regression

does is fit a plane to

these coordinates.

Multiple Regression

• Mathematically, that plane is:

Y = a + b1X1 + b2X2

a = y-intercept, where X’s equal zero

b=coefficient or slope for each variable

For our problem, SPSS says the equation is:

Y = 11.8 - .36X1 - .40X2

Expected # of Children = 11.8 - .36*Educ - .40*Income

Multiple Regression

• Let’s take a moment to

reflect…

Why do I write the equation:

Y = a + b1X1 + b2X2

Whereas KBM often write:

Yi = a + b1X1i + b2X2i + ei

One is the equation for a

prediction, the other is the

value of a data point for a

person.

Multiple Regression

Model Summary

.757a .573 .534 2.33785

Model

1

R R Square

Adjusted

R Square

Std. Error of

the Estimate

Predictors: (Constant), Income, Educat iona. ANOVAb

161.518 2 80.759 14.776 .000a

120.242 22 5.466

281.760 24

Regression

Residual

Total

Model

1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), Income, Educat iona.

Dependent Variable: Childrenb.

Coefficientsa

11.770 1.734 6.787 .000

-.364 .173 -.412 -2.105 .047

-.403 .194 -.408 -2.084 .049

(Constant)

Education

Income

Model

1

B Std. Error

Unstandardized

Coeff icients

Beta

Standardized

Coeff icients

t Sig.

Dependent Variable: Childrena.

Y = 11.8 - .36X1 - .40X2

57% of the variation in

number of children is

explained by education

and income!

Multiple Regression

Model Summary

.757a .573 .534 2.33785

Model

1

R R Square

Adjusted

R Square

Std. Error of

the Estimate

Predictors: (Constant), Income, Educat iona. ANOVAb

161.518 2 80.759 14.776 .000a

120.242 22 5.466

281.760 24

Regression

Residual

Total

Model

1

Sum of

Squares df Mean Square F Sig.

Predictors: (Constant), Income, Educat iona.

Dependent Variable: Childrenb.

Coefficientsa

11.770 1.734 6.787 .000

-.364 .173 -.412 -2.105 .047

-.403 .194 -.408 -2.084 .049

(Constant)

Education

Income

Model

1

B Std. Error

Unstandardized

Coeff icients

Beta

Standardized

Coeff icients

t Sig.

Dependent Variable: Childrena.

Y = 11.8 - .36X1 - .40X2

r2

(Y – Y)2 - (Y – Y)2

(Y – Y)2

161.518 ÷ 261.76 = .573

Multiple Regression

So what does our equation tell us?

Y = 11.8 - .36X1 - .40X2


Try “plugging in” some values for your

variables.

Multiple Regression


Y = 11.8 - .36X1 - .40X2


If Education equals:& If Income Equals: Then, children equals:

0 0 11.8

10 0 8.2

10 10 4.2

20 10 0.6

20 11 0.2

^

Multiple Regression


Y = 11.8 - .36X1 - .40X2


If Education equals:& If Income Equals: Then, children

equals:

1 0 11.44

1 1 11.04

1 5 9.44

1 10 7.44

1 15 5.44

^

Multiple Regression


Y = 11.8 - .36X1 - .40X2


If Education equals:& If Income Equals: Then, children

equals:

0 1 11.40

1 1 11.04

5 1 9.60

10 1 7.80

15 1 6.00

^

Multiple Regression

If graphed, holding one variable constant produces a two-

dimensional graph for the other variable.

Y

X2 = Income 0 15

11.44

5.44

b = -.4

Y

X1 = Education 0 15

11.40

6.00

b = -.36

Dummy Explanatory Variables

Qualitative binomial (0,1) variables. Di

Yi = β0 + β1Xi + β2Di + ui

For Di = 0 : Yi = β0 + β1Xi + ui

For Di = 1 : Yi = β0 + β1Xi + β2 +ui

Yi = (β0+β2)+ β1Xi +ui

To measure the effect of Di on the relation between X and Y

Yi = β0 + β1Xi + β2Xi*Di + ui

For Di = 0 : Yi = β0 + β1Xi + ui

For Di = 1 : Yi = β0 + β1Xi + β2Xi +ui

Yi = β0+ (β1+β2)Xi +ui

Warning: Dummy variables can be used only as regressors.

Should the dependent variable be binomial, you need to use

Logit or Probit regression models, which employ ML estimator.

This is because the binomial feature violates the normal

distribution assumption which renders t-statistics invalid.

(you can learn these techniques in Econometrics II)

Time-period dummies can be used for:

1) measuring the stability of a relationship over time

2) to treat outliers

Seasonal dummies can be used to treat seasonal variation in seasonally-unadjusted data. Simply create n–1 dummies for n seasonal sections and use them as regressors. You may include the seasonal dummies in the regression to control for seasonal variation.

Multiple Regression

The way you use nominal variables in regression is by converting them to a series of dummy variables.

Recode into different

Nomimal Variable Dummy Variables

Race 1. White

1 = White 0 = Not White; 1 = White

2 = Black 2. Black

3 = Other 0 = Not Black; 1 = Black

3. Other

0 = Not Other; 1 = Other

Multiple Regression

The way you use nominal variables in regression is by converting them to a series of dummy variables.

Recode into different

Nomimal Variable Dummy Variables

Religion 1. Catholic

1 = Catholic 0 = Not Catholic; 1 = Catholic

2 = Protestant 2. Protestant

3 = Jewish 0 = Not Prot.; 1 = Protestant

4 = Muslim 3. Jewish

5 = Other Religions 0 = Not Jewish; 1 = Jewish

4. Muslim

0 = Not Muslim; 1 = Muslim

5. Other Religions

0 = Not Other; 1 = Other Relig.

Multiple Regression

When you need to use a nominal variable in

regression (like race), just convert it to a series

of dummy variables.

When you enter the variables into your model,

you MUST LEAVE OUT ONE OF THE

DUMMIES.

Leave Out One Enter Rest into Regression

White Black

Other

Multiple Regression

The reason you MUST LEAVE OUT ONE OF

THE DUMMIES is that regression is

mathematically impossible without an excluded

group.

If all were in, holding one of them constant would

prohibit variation in all the rest.

Leave Out One Enter Rest into Regression

Catholic Protestant

Jewish

Muslim

Other Religion

Multiple Regression

The regression equations for dummies will

look the same.

For Race, with 3 dummies, predicting self-esteem:

Y = a + b1X1 + b2X2

a = the y-intercept,

which in this case is

the predicted value

of self-esteem for

the excluded group,

white.

b1 = the slope

for variable

X1, black

b2 = the slope

for variable

X2, other

Multiple Regression

• If our equation were:

For Race, with 3 dummies, predicting self-esteem:

Y = 28 + 5X1 – 2X2

a = the y-intercept,

which in this case is

the predicted value

of self-esteem for

the excluded group,

white.

5 = the slope

for variable

X1, black

-2 = the slope

for variable

X2, other

Plugging in values for

the dummies tells you

each group’s self-

esteem average:

White = 28

Black = 33

Other = 26

When cases’ values for X1 = 0 and X2 = 0, they are white;

when X1 = 1 and X2 = 0, they are black;

when X1 = 0 and X2 = 1, they are other.

Multiple Regression

• Dummy variables can be entered into multiple regression along with other dichotomous and continuous variables.

• For example, you could regress self-esteem on sex, race, and education:

Y = a + b1X1 + b2X2 + b3X3 + b4X4

How would you interpret this?

Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4

X1 = Female

X2 = Black

X3 = Other

X4 = Education

Multiple Regression


Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4

1. Women’s self-esteem is 4 points lower than men’s.

2. Blacks’ self-esteem is 5 points higher than whites’.

3. Others’ self-esteem is 2 points lower than whites’ and

consequently 7 points lower than blacks’.

4. Each year of education improves self-esteem by 0.3

units.

X1 = Female

X2 = Black

X3 = Other

X4 = Education

Multiple Regression How would you interpret this?

Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4

Plugging in some select values, we’d get self-esteem for

select groups:

• White males with 10 years of education = 33

• Black males with 10 years of education = 38

• Other females with 10 years of education = 27

• Other females with 16 years of education = 28.8

X1 = Female

X2 = Black

X3 = Other

X4 = Education

Multiple Regression


Y = 30 – 4X1 + 5X2 – 2X3 + 0.3X4

The same regression rules apply. The slopes represent

the linear relationship of each independent variable in

relation to the dependent while holding all other

variables constant.

X1 = Female

X2 = Black

X3 = Other

X4 = Education

Make sure you get into the habit of saying the

slope is the effect of an independent variable

“while holding everything else constant.”

http://www.karlloren.com/images/j0118667.gif

Seasonal-adjusment using dummy variables

Example: Suppose a researcher is using seasonally-

unadjusted data at the quarterly frequency for the

variable Yt.

For 4 quarters, create 3 dummies:

D1= 1 if t is Q1, 0 otherwise



The residuals of the regression:

Yt = β0 + β1D1,t + β2D2,t + β3D3,t + εt is the seasonally-adjusted Yt

Log Transformations

Yi = β0 + β1Xi + ui

The β1 in the above regression indicates the

expected change in Yi resulting from a 1-unit

increase in Xi. – not the relationship in % terms –

If you need to compute the expected % change in

Yi resulting from a 1% increase in Xi , you need to

run the following regression:

Ln(Yi )= β0 + β1Ln(Xi) + ui

Assumptions of OLS Estimator

1) E(ei) = 0 (unbiasedness)

2) Var(ei) is constant (homoscedasticity)

3) Cov(ui,uj) = 0 (independent error terms)

4) Cov(ui,Xi) = 0 (error terms unrelated to X’s)

ei ~ iid (0 , 2)

Gauss-Markov Theorem: If these conditions

hold, OLS is the best linear unbiased

estimator (BLUE).

Additional Assumption: ei’s are normally distributed.

Time Series Regressions

Lagged variable: Yt = β0+β1Xt+β2Xt-1+ut

Autoregressive Model: Xt = β1Xt-1+β2Xt-2+ut

Time-Trend: Yt = β0 + β1Xt + β2Tt+ut

Spurious Regressions

• As a general and very strict rule:

All variables in a time-series regression must

be stationary.

Never run a regression with nonstationary

variables!

* DW statistic will warn.

A nonstationary variable can be made stationary by

taking its first difference.

If X is nonstationary, ΔX = Xt – Xt-1 may be

stationary.

Exercise: How to create a regression?

• Statistic descriptive: Mean, median, etc …

• Correlation: not over 0.5 for xi (explanatory

variables)

• Stationary: ADF test

• Run regression

• Test heteroscedasticity, Normality

• Test VIF in case of Multicollinearity

Multiple Regression - Kasetsart Universityfin.bus.ku.ac.th/135512 Economic Environment for...

Documents

Transcript of Multiple Regression - Kasetsart Universityfin.bus.ku.ac.th/135512 Economic Environment for...