STAT 360—REGRESSION ANALYSIS - Technology...

25
Handout #15: Polynomial Terms in a Linear Regression Model Section 15.1: Fitting a Linear Regression Model with Polynomial Terms Consider the situation of modeling the temperature profile for across the United States. Latitude Temperature Profile of United States Longitude Standard Linear Regression Setup Response Variable: Temperature Predictor Variables: Latitude and Longitude Initially assume the following structure for mean and variance functions o E ( TemperatureLatitude ,Longitude )=β 0 +β 1 Latitude+β 2 Longitude o Var ( TemperatureLatitude,Longitude ) =σ 2 1

Transcript of STAT 360—REGRESSION ANALYSIS - Technology...

Page 1: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Handout #15: Polynomial Terms in a Linear Regression Model

Section 15.1: Fitting a Linear Regression Model with Polynomial Terms

Consider the situation of modeling the temperature profile for across the United States.

L

atitu

de

Temperature Profile of United States

Longitude

Standard Linear Regression Setup

Response Variable: Temperature Predictor Variables: Latitude and Longitude Initially assume the following structure for mean and variance functions

o E (Temperature∨Latitude ,Longitude )=β0+ β1∗Latitude+β2∗Longitudeo Var (Temperature∨Latitude , Longitude )=σ2

1

Page 2: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Understanding the Effects in this model

Reality Model EffectsLa

titud

eLo

ngitu

de

A linear term, i.e. constant rate of change, for Longitude does not appear to match reality.

Latit

ude

Possible Fix: Include a quadratic term in the mean function.

E (Temperature∨Latitude ,Longitude )=β0+ β1∗Latitude+β2∗Longitude+β3∗Longitude2

2

Page 3: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Comments

Consider the proposed (updated) mean function

E (Temperature∨Latitude ,Longitude )=β0+ β1∗Latitude+β2∗Longitude+β3∗Longitude2

1. This model is said to have two predictors, but three terms in addition to intercept. Generally speaking, a mean function is constructed using terms. Terms may be simple predictors, combinations of predictor variables, or functions of the predictor variables.

Predictors TermsLatitude

LongitudeInterceptLatitude

LongitudeLongitude2

2. This model is said to be a linear model even though it includes a quadratic term. The notation of linear here implies linear in its coefficients. That is, the derivative of the mean function with respect to each coefficient is free of all other coefficients.

o∂∂ β0

E (Temperature|Latitude ,Longitude )=¿1

o∂∂ β1

E (Temperature|Latitude ,Longitude )=Latitude

o∂∂ β2

E (Temperature|Latitude ,Longitude )=Longitude

o∂∂ β3

E (Temperature|Latitude ,Longitude )=Longitude2

An example of a non-linear model – consider the Michaelis-Menton model for enzyme kinetics. In this model v=reactionrate and x=conentration of substrate. Realize, the partial derivatives of one coefficient are a function of the other.

E (v|x )=β1∗x

(β2+x )Derivatives

o∂∂ β1

E ( v|x )= x( β2+x )

o∂∂ β2

E ( v|x )=−β1 x

( β2+x )2

Visual

3

Page 4: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

The estimation of model coefficients for a linear model is more straight forward than for a non-linear model. Consider the construction of the X matrix used by software to estimate model coefficients. In a linear regression model, estimation is straight forward; however, for a non-linear model the elements of the X matrix depend on the estimated coefficients. Thus, estimation must be done iteratively, i.e. obtain initial estimates for coefficients, update X matrix, re-estimate coefficients, update X matrix, re-estimate coefficients, etc. This is known as iterative least squares estimation and is repeated until the coefficients do not change much from one iteration to the next.

Linear Model Non-Linear Model

β̂=( X ' X )−1 X 'Y

4

Page 5: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Section 15.2: Predicting January Temperature in continental United States

Example 15.2.1 For this example, consider the US City Weather dataset on our course website. A snip-it of the dataset is provided here.

Note: Albuquerque, NM will be removed from consideration from our analysis. This city is an extreme outlier. The effect of this city on the analysis will be considered after fitting an appropriate model.

To begin, consider a standard linear model setup. In Section 5.1, we learned that this is likely an inappropriate model as a quadratic term for Longitude is probably necessary. The inadequate form of this model with respect to Longitude will be apparent when plotting the residuals from this model against Longitude.

Model Setup (without the use of a quadratic term for Longitude)

Response Variable: Jan Temp Predictor Variables: Latitude and Longitude Begin with the standard mean and variance function

o E (JanTemp∨Latitude , Longitude )=β0+β1∗Latitude+β2∗Longitudeo Var (JanTemp∨Latitude , Longitude )=σ2

Output from JMP for above the models specified above.

Standard Regression Output Residual Plot

Note: Potential problems with residuals. Further investigation is warranted.

5

Page 6: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

The residuals from the initial model are plotted against each predictor, Latitude and Longitude respectively. The anticipated lack-of-fit due to not incorporating a quadratic term for Longitude is apparent in the plot to the right.

Weak quadratic trend Much stronger quadratic trend as suggested by the following output.

Some may consider the model fitting done above as overkill. If the goal is to simple trend the residuals a kernel smoother is likely sufficient. The usual Analyze > Fit Y by X platform can be used in JMP; however, the Graph Builder framework is somewhat quicker and easier. This can be done by selecting Graph > Graph Builder.

6

Page 7: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Updated Model Setup

Response Variable: Jan Temp Predictor Variables: Latitude and Longitude Terms: Intercept, Latitude, Longitude, Longitude2

Updated structure for mean function; Continue with constant variance function

o

E (JanTemp∨Latitude , Longitude )=β0+β1∗Latitude+β2∗Longitude+β3∗Longitude2

o Var (JanTemp∨Latitude , Longitude )=σ2

Fitting this model in JMP

7

Page 8: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

8

Page 9: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

The regression output from the model that includes a quadratic term for Longitude is provided here. The accompanying residual plot is provided as well.

Standard Regression Output Residual Plot

Note: In addition to the plot provided by JMP, the residuals should be plotted against each predictor to ensure correct functional form.

Comment:

The standard form for a quadratic function is given by y=a∗x2+b∗x+c. The effect of each coefficient is show graphically here.

In my opinion, the fact that the coefficient for Longitude is *not* statistically different from 0 is irrelevant. This implies that the horizontal shift in the vertex of the parabola is not statistically different than the average Longitude (average Longitude in JMP as JMP invokes a horizontal shift for this term).

Identifying a parsimonious model, i.e. a model with only significant terms, is of utmost importance when modeling. However, in this case, the Longitude2 term is statistically important and keeping a lower order term of the polynomial, i.e. Longitude, in the model is not really increasing the complexity of the model.

9

Page 10: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Unfortunately and a bit to my surprise, a model that includes a quadratic term did not fix the apparent lack-of-fit in the Longitude direction.

Final Model Setup

Response Variable: Jan Temp Predictor Variables: Latitude and Longitude Terms: Intercept, Latitude, Longitude, Longitude2, Longitude3

Updated structure for mean function; Continue with constant variance function

o

E (JanTemp∨Latitude , Longitude )=β0+β1∗Latitude+β2∗Longitude+β3∗Longitude2+ β4∗Longitude

3

o Var (JanTemp∨Latitude , Longitude )=σ2

Consider the rationale for including a cubic term for Longitude.

For a quadratic function, i.e. 2nd degree polynomial, rate of decrease to left of vertex is the same as the rate of increase to the right.

This is not the case for a 3rd degree polynomial. This functional form may in fact more closely match reality.

10

Page 11: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Output from Final Model

Observations from this final model

The residual plots provided below suggest a correct function form for the mean function.

The model appears to be doing well in terms of a high R2 value = 95%. The Root Mean Square Error is pretty small with a value of 2.9. Recall, an approximate

95% prediction interval is given by ±2∗RMSE. This suggest that predictions for Jan Temp in most locations across the US can be made with about 5.8OF, i.e. 2 * 2.9 = 5.8.

The statistical importance of all model terms is evident through very small p-values as provided under the Prob > |t| column.

A plot of the predicted values against actual Jan Temp values suggests possible over-prediction for lower temps and under-prediction for warmer temps.

11

Page 12: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Residual Plots from Final Model

Overall residual plot from final model Checking normality in the residuals

A list of outliers, i.e observations whose residuals exceed 2*RMSE. Over-prediction appears to be occurring in northern California (and Reno which borders California) and under-prediction is occurring in some cities in the northwest.

12

Page 13: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Viewing the Fitted Model in JMP

The Surface Profiler functionality allows us to investigate the surface in JMP. Select Factor Profiling > Surface Profiler from the red-drop down menu in JMP.

The fitted model can be seen here.

Profile view of Longitude Profile view of Latitude

Note: The Contour Profiler produces a 2-dimensional display of this surface.

13

Page 14: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Investigating the Effect of Albuquerque, New Mexico

Albuquerque, NM was removed from consideration in the above analyses. Albuquerque, NM is a city with a small Latitude (southern city) and a large Longitude (western city), but has a unusually cold temp due to its much higher altitude (in the mountains).

Comparing the summary of fit output from the model with and without Albuquerque, NM.

Model excluding Albuquerque, NM Model including Albuquerque, NM

A model including Albuquerque, NM clearly indicates its fit as an extreme outlier.

An investigation of the hat values suggests that Albuquerque, NM does *not* appear to have

high leverage, but is simply an outlier.

Recall, Cook’s Distance combines information regarding the degree to which Albuquerque, NM is an outlier with its leverage. Albuquerque ‘s Cook’s Distance is substantially higher than others.

Plot of Cook’s Distance

14

Page 15: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Cook ' s D=¿

(student r i )2

( k+1 )∗hi

( 1−hi )(−7.32 )2

(5 )∗0.055

(1−0.055 )0.62

Section 15.3: Fitting this model in R (with visualization of surface)

> # attach dataset to be used> > attach(USCity_WeatherData)

> # Getting the names of the variables in this data>> names(USCity_WeatherData) [1] "City" "Latitude" "Longitude" [4] "JanTemp" "AprilTemp" "JulyTemp" [7] "OctTemp" "Precipitation.in." "Percipitation.days."[10] "Snowfall.in."

> #Creating the additional terms needed for the final model>> Longitude2=Longitude^2> Longitude3=Longitude^3

> #Reconstruct a data frame with all necessary terms> #This step is not necessary, but makes is much easier to remove Albuquerque, NM from model consideration>>> mydata=data.frame(City,Latitude,Longitude,Longitude2,Longitude3,JanTemp)

> #Using the head() function to see top portion of newly created data>> head(mydata) City Latitude Longitude Longitude2 Longitude3 JanTemp1 Albany, NY 42.67 73.75 5439.062 401130.9 22.22 Albuquerque, NM 35.08 106.65 11374.223 1213060.8 5.73 Asheville, NC 35.35 82.33 6778.229 558051.6 35.84 Atlanta, GA 33.75 84.38 7119.984 600784.3 42.75 Atlantic City, NJ 39.21 74.25 5513.062 409344.9 32.16 Austin, TX 30.27 97.73 9551.153 933434.2 50.2

> # Fitting the linear model with Latitude and up to 3rd degree polynomial for Longitude. The data=mydata[-2,] is used to remove the 2nd observation (Albuquerque, NM) from consideration. If no observations are to be removed, this would simple read data=mydata.

15

Page 16: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

>> myfit=lm(JanTemp ~ Latitude + Longitude + Longitude2 + Longitude3,data=mydata[-2,])

> # The summary() function can be used to obtain the standard regression output>> summary(myfit)

Call:lm(formula = JanTemp ~ Latitude + Longitude + Longitude2 + Longitude3, data = mydata[-2, ])

Residuals: Min 1Q Median 3Q Max -12.7138 -1.0795 0.1146 1.4033 8.4928

16

Page 17: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.640e+02 8.551e+01 -3.087 0.00269 ** Latitude -2.417e+00 6.152e-02 -39.289 < 2e-16 ***Longitude 1.433e+01 2.733e+00 5.244 1.05e-06 ***Longitude2 -1.720e-01 2.889e-02 -5.952 5.15e-08 ***Longitude3 6.728e-04 1.004e-04 6.703 1.81e-09 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.904 on 89 degrees of freedomMultiple R-squared: 0.953, Adjusted R-squared: 0.9508 F-statistic: 450.8 on 4 and 89 DF, p-value: < 2.2e-16

Note: As it should, the model output here matches the output obtained from JMP.

Using R to create 3D plot of surface being fit

Step 1: Create a grid pattern for Latitude and Longitude.

> #Get smallest Latitude> min(Latitude)[1] 25.77

> #Get largest Latitude> max(Latitude)[1] 47.67

> #Create a sequence of x values to make prediction from> xgrid=seq(from=25,to=48,by=0.1)

Now, do the same for Longitude> min(Longitude)[1] 67.59> max(Longitude)[1] 122.68> ygrid=seq(from=67,to=123,by=0.1)

Creating a grid so that predictions can be made across continental US

Step 2: Create a function from which prediction can be made.

Note: The coefficients for each term are determined from the final regression model.

> mypredict=function(x,y){-263.988 - 2.417*x + 14.333*y - 0.172*y^2 + 0.00067*y^3}

Step 3: Obtain predictions across grid and create the 3d plot

> # Use the outer function to obtain predictions across the entire grid, save the results into a matrix named z> > z=outer(xgrid,ygrid,mypredict)

17

Page 18: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

The graph will be made using a R package named rgl{}. This package must first be downloaded onto your machine and loaded in your current R session.

> #Load the rlg package in R> #First, download rlg package in R> #Next, load this package into your current R session> library(rgl)

Once, the rgl{} package has been loaded into your current R session, use the persp3d() function to create the surface

> #Use the persp3d function to create a 3d perspective plot> persp3d(xgrid,ygrid,z,xlab="Latitude",ylab="Longitude",zlab="JanTemp")

A plot is produced in the rgl device window. This plot can be spun around etc. to see the surface from different angles, etc.

A view of the surface in the Latitude direction A view of the surface in the Longitude direction

18

Page 19: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Section 15.4: Fitting the More Appropriate Geo-Spatial Model in R

A geo-spatial model is a modeling strategy that utilizes the correlation structure due to the proximity of the observations. For example, when making a prediction for a particular city, (called kriging in geo-spatial models), the cities is close proximity are emphasized more than other cities. Model specifications determine the degree of closeness that is appropriate.

Sampling locations in our datasetObservations close in distance have more impact on than others when making predictions in geo-spatial models

The gstat{} package in R will be used here to make geo-spatial predictions for Jan Temp.

> #Download and load the gstat{} package>library(gstat)

> # Getting initial estimates for model parameters> init.model=gstat(id="JanTemp",formula=JanTemp~1,locations = ~ Latitude + Longitude,data=USCity_WeatherData)

A variogram function is used to identify appropriate parameters for the geo-spatial model.

> # Use a varigram function to update model parameters> plot(variogram(init.model)

Variogram Characteristics Sill = 275 Nugget = 5 Partial Sill = 270

Range=20 A Gaussian model will be

used, i.e. “Gau”

Variogram Plot

19

Page 20: STAT 360—REGRESSION ANALYSIS - Technology ...course1.winona.edu/cmalone/stat360/Notes/Handout15.docx · Web viewThat is, the derivative of the mean function with respect to each

Using the variogram characteristics to obtain the geo-spatial model.

> #Fitting the variogram from the initial model > variogram.fit=fit.variogram(variogram(init.model), vgm(psill=270,model="Gau",range=20,nugget=5)) Creating a grid so that predictions can be made across locations in US.

> # Creating the grid so that predictions can be made> mygrid = expand.grid(Latitude=seq(from=25,to=48,by=0.1),Longitude = seq(from=67,to=123,by=0.1))

Using the krige() function to make predictions using a geo-spatial model.

> # Using the krige function to make predictions> mypredict = krige(id="JanTemp",JanTemp ~ 1, locations = ~ Latitude + Longitude, model=variogram.fit,data=USCity_WeatherData,newdata=mygrid)

Use the plot3d() function to create a plot on the left. The axis3d() function is used here to clean up the axes a bit. The plot to the lower right was created using the levelplot() in the lattice{} package.

> # Creating a visualization using plot3d() function> plot3d(mypredict$Latitude,mypredict$Longitude,mypredict$JanTemp.pred, xlab="",ylab="",zlab="",axes=F)> axis3d("x",at=c(25,48),labels=c("South","North"))> axis3d("y",at=c(67,90,123),labels=c("East","Midwest","West"))> axis3d("z")

The plot using the above code Created using the levelplot() in the lattice{} package

20