MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000...

22
Transformations Merlise Clyde Readings: Gelman & Hill Ch 2-4

Transcript of MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000...

Page 1: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Transformations

Merlise Clyde

Readings: Gelman & Hill Ch 2-4

Page 2: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Assumptions of Linear Regression

Yi = β0 + β1Xi1 + β2Xi2 + . . . βpXip + εi

I Model Linear in Xj but Xj could be a transformation of theoriginal variables

I εi ∼ N(0, σ2)I Yi

ind∼ N(β0 + β1Xi1 + β2Xi2 + . . . βpXip, σ2)

I correct mean functionI constant varianceI independent errorsI Normal errors

Page 3: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Animals

Read in Animal data from MASS. The data set containsmeasurements on body weight and brain weight.

Let’s try to predict brain weight (size) from body weight.

library(MASS)data(Animals)brain.lm = lm(brain ~ body, data=Animals)

Page 4: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Diagnostic Plots## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

540 550 560 570

−10

0010

0030

0050

00

Fitted values

Res

idua

ls

Residuals vs Fitted

African elephant

Asian elephant

Human

−2 −1 0 1 2

−1

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

African elephant

Asian elephant

Brachiosaurus

540 550 560 570

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationAfrican elephant

Asian elephant

Brachiosaurus

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

10.50.51

Residuals vs Leverage

Brachiosaurus

African elephant

Asian elephant

Page 5: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Leverage plot

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Index

hatv

alue

s(br

ain.

lm)

Energetic students: how I should plot with ggplot?

Page 6: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Outliers and Influential Points

Flag outliers after Bonferroni Correction

pval = 2*(1 - pt(abs(rstudent(brain.lm)), brain.lm$df -1))rownames(Animals)[pval < .05/nrow(Animals)]

## [1] "Asian elephant" "African elephant"

Cook’s Distance > 1

rownames(Animals)[cooks.distance(brain.lm) > 1]

## [1] "Brachiosaurus"

Page 7: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Remove Influential Point & Refit

brain2.lm = lm(brain ~ body, data=Animals,subset = !cooks.distance(brain.lm)>1)

par(mfrow=c(2,2)); plot(brain2.lm)

500 1000 1500 2000

−20

000

2000

Fitted values

Res

idua

ls

Residuals vs Fitted

African elephantAsian elephant

Dipliodocus

−2 −1 0 1 2

−2

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

African elephant

Asian elephant

Dipliodocus

500 1000 1500 2000

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationAfrican elephant

Asian elephant

Dipliodocus

0.0 0.1 0.2 0.3 0.4 0.5

−2

01

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance10.5

0.51

Residuals vs Leverage

Dipliodocus

African elephant

Triceratops

Page 8: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Keep removing points?

500 1000 1500 2000 2500 3000

−40

000

2000

4000

Fitted values

Res

idua

ls

Residuals vs Fitted

Asian elephant African elephant

Triceratops

−2 −1 0 1 2

−4

−2

02

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

Triceratops

African elephantAsian elephant

500 1000 1500 2000 2500 3000

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationTriceratops

African elephantAsian elephant

0.0 0.1 0.2 0.3 0.4 0.5 0.6

−4

−2

02

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

10.5

0.51

Residuals vs Leverage

Triceratops

African elephantAsian elephant

Page 9: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

And another one bites the dust

0 1000 2000 3000 4000 5000 6000

−10

000

500

1500

Fitted values

Res

idua

ls

Residuals vs Fitted

Asian elephant

Human

African elephant

−2 −1 0 1 2

−4

−2

02

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

Asian elephant

African elephant

Human

0 1000 2000 3000 4000 5000 6000

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationAsian elephant African elephant

Human

0.0 0.2 0.4 0.6 0.8

−4

−2

02

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

10.50.51

Residuals vs Leverage

African elephant

Asian elephant

Human

Page 10: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

and another one

0 1000 2000 3000 4000

−50

00

500

1000

Fitted values

Res

idua

ls

Residuals vs Fitted

Human

CowHorse

−2 −1 0 1 2

−1

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

Human

Asian elephant

Cow

0 1000 2000 3000 4000

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationHuman

Asian elephant

Cow

0.0 0.2 0.4 0.6 0.8

−2

−1

01

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

10.5

0.51

Residuals vs Leverage

Asian elephant

Human

Cow

Page 11: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

And they just keep coming!

Figure 1: Walt Disney Fantasia

Page 12: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Plot of Original Data (what you should always do first!)

library(ggplot2)ggplot(Animals, aes(x=body, y=brain)) +

geom_point() +xlab("Body Weight") + ylab("Brain Weight")

0

2000

4000

0 25000 50000 75000

Body Weight

Bra

in W

eigh

t

Page 13: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Log Transform

Body Weight

Fre

quen

cy

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

05

1015

2025

log(Body Weight)

Fre

quen

cy

0 5 10

01

23

45

6Who can reproduce this slide using ggplot? Tell me how on Piazza!Even better make a pull request!

Page 14: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Plot of Transformed DataAnimals= mutate(Animals, log.body = log(body))ggplot(Animals, aes(log.body, brain)) + geom_point()

0

2000

4000

−4 0 4 8 12

log.body

brai

n

#plot(brain ~ body, Animals, log="x")

Page 15: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Diagnostics with log(body)

−500 0 500 1000 1500

−20

000

2000

5000

Fitted values

Res

idua

ls

Residuals vs Fitted

15

7

26

−2 −1 0 1 2

−1

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

15

7

26

−500 0 500 1000 1500

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location15

7

26

0.00 0.05 0.10 0.15−

11

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

0.5

1

Residuals vs Leverage

15

7

26

Variance increasing with mean

Page 16: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Try Log-LogAnimals= mutate(Animals, log.brain= log(brain))ggplot(Animals, aes(log.body, log.brain)) + geom_point()

0

2

4

6

8

−4 0 4 8 12

log.body

log.

brai

n

#plot(brain ~ body, Animals, log="xy")

Page 17: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Diagnostics with log(body) & log(brain)

2 4 6 8

−4

−2

01

23

Fitted values

Res

idua

ls

Residuals vs Fitted

6 2616

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

6 2616

2 4 6 8

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location6 26

16

0.00 0.05 0.10 0.15

−2

−1

01

2

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance 0.5

0.5

Residuals vs Leverage

26616

Page 18: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Optimal Transformation for NormalityThe BoxCox procedure can be used to find “best” powertransformation λ of Y (for positive Y) for a given set of transformedpredictors.

Ψ(Y, λ) ={

Yλ−1λ if λ 6= 0

log(Y) if λ = 0

Find value of λ that maximizes the likelihood derived fromΨ(Y, λ) ∼ N(Xβλ, σ

2λ) (need to obtain distribution of Y first)

Find λ to minimize

RSS(λ) = ‖ΨM(Y, λ)− Xβ̂λ‖2

ΨM(Y, λ) ={

(GM(Y)1−λ(Yλ − 1)/λ if λ 6= 0GM(Y) log(Y) if λ = 0

where GM(Y) = exp(∑

log(Yi )/n) (Geometric mean)

Page 19: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

boxcox in R: Profile likelihood

library(MASS)boxcox(braintransX.lm)

−2 −1 0 1 2

−25

0−

200

−15

0−

100

−50

λ

log−

Like

lihoo

d

95%

Page 20: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Caveats

I Boxcox transformation depends on choice of transformations ofX ’s

I For choice of X transformation use boxTidwell inlibrary(car)

I transformations of X ’s can reduce leverage values (potentialinfluence)

I if the dynamic range of Y or X is less than 1 or 10 (iemax/min) then transformation may have little effect

I transformations such as logs may still be useful forinterpretability

I outliers that are not influential may still

Page 21: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Review of Last Class

I In the model with both response and predictor log transformed,are dinosaurs outliers?

I should you test each one individually or as a group; if as agroup how do you think you would you do this using lm?

I do you think your final model is adequate? What else mightyou change?

Page 22: MerliseClyde Readings: Gelman&HillCh2-4 · 2017-01-25 · Andanotheronebitesthedust 0 1000 2000 3000 4000 5000 6000-1000 0 500 1500 Fitted values Residuals Residuals vs Fitted Asian

Check Your Prediction SkillsAfter you determine whether dinos can stay or go and refine yourmodel, what about prediction?

I I would like to predict Aria’s brain size given her current weightof 259 grams. Give me a prediction and interval estimate.

I Is her body weight within the range of the data in Animals orwill you be extrapolating? What are the dangers here?

I Can you find any data on Rose-Breasted Cockatoo brain sizes?Are the values in the prediction interval?