Transformations
Merlise Clyde
Readings: Gelman & Hill Ch 2-4
Assumptions of Linear Regression
Yi = β0 + β1Xi1 + β2Xi2 + . . . βpXip + εi
I Model Linear in Xj but Xj could be a transformation of theoriginal variables
I εi ∼ N(0, σ2)I Yi
ind∼ N(β0 + β1Xi1 + β2Xi2 + . . . βpXip, σ2)
I correct mean functionI constant varianceI independent errorsI Normal errors
Animals
Read in Animal data from MASS. The data set containsmeasurements on body weight and brain weight.
Let’s try to predict brain weight (size) from body weight.
library(MASS)data(Animals)brain.lm = lm(brain ~ body, data=Animals)
Diagnostic Plots## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
540 550 560 570
−10
0010
0030
0050
00
Fitted values
Res
idua
ls
Residuals vs Fitted
African elephant
Asian elephant
Human
−2 −1 0 1 2
−1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
African elephant
Asian elephant
Brachiosaurus
540 550 560 570
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationAfrican elephant
Asian elephant
Brachiosaurus
0.0 0.2 0.4 0.6 0.8 1.0
−2
−1
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.50.51
Residuals vs Leverage
Brachiosaurus
African elephant
Asian elephant
Leverage plot
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Index
hatv
alue
s(br
ain.
lm)
Energetic students: how I should plot with ggplot?
Outliers and Influential Points
Flag outliers after Bonferroni Correction
pval = 2*(1 - pt(abs(rstudent(brain.lm)), brain.lm$df -1))rownames(Animals)[pval < .05/nrow(Animals)]
## [1] "Asian elephant" "African elephant"
Cook’s Distance > 1
rownames(Animals)[cooks.distance(brain.lm) > 1]
## [1] "Brachiosaurus"
Remove Influential Point & Refit
brain2.lm = lm(brain ~ body, data=Animals,subset = !cooks.distance(brain.lm)>1)
par(mfrow=c(2,2)); plot(brain2.lm)
500 1000 1500 2000
−20
000
2000
Fitted values
Res
idua
ls
Residuals vs Fitted
African elephantAsian elephant
Dipliodocus
−2 −1 0 1 2
−2
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
African elephant
Asian elephant
Dipliodocus
500 1000 1500 2000
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationAfrican elephant
Asian elephant
Dipliodocus
0.0 0.1 0.2 0.3 0.4 0.5
−2
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance10.5
0.51
Residuals vs Leverage
Dipliodocus
African elephant
Triceratops
Keep removing points?
500 1000 1500 2000 2500 3000
−40
000
2000
4000
Fitted values
Res
idua
ls
Residuals vs Fitted
Asian elephant African elephant
Triceratops
−2 −1 0 1 2
−4
−2
02
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Triceratops
African elephantAsian elephant
500 1000 1500 2000 2500 3000
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationTriceratops
African elephantAsian elephant
0.0 0.1 0.2 0.3 0.4 0.5 0.6
−4
−2
02
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.5
0.51
Residuals vs Leverage
Triceratops
African elephantAsian elephant
And another one bites the dust
0 1000 2000 3000 4000 5000 6000
−10
000
500
1500
Fitted values
Res
idua
ls
Residuals vs Fitted
Asian elephant
Human
African elephant
−2 −1 0 1 2
−4
−2
02
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Asian elephant
African elephant
Human
0 1000 2000 3000 4000 5000 6000
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationAsian elephant African elephant
Human
0.0 0.2 0.4 0.6 0.8
−4
−2
02
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.50.51
Residuals vs Leverage
African elephant
Asian elephant
Human
and another one
0 1000 2000 3000 4000
−50
00
500
1000
Fitted values
Res
idua
ls
Residuals vs Fitted
Human
CowHorse
−2 −1 0 1 2
−1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Human
Asian elephant
Cow
0 1000 2000 3000 4000
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationHuman
Asian elephant
Cow
0.0 0.2 0.4 0.6 0.8
−2
−1
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
10.5
0.51
Residuals vs Leverage
Asian elephant
Human
Cow
And they just keep coming!
Figure 1: Walt Disney Fantasia
Plot of Original Data (what you should always do first!)
library(ggplot2)ggplot(Animals, aes(x=body, y=brain)) +
geom_point() +xlab("Body Weight") + ylab("Brain Weight")
0
2000
4000
0 25000 50000 75000
Body Weight
Bra
in W
eigh
t
Log Transform
Body Weight
Fre
quen
cy
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
05
1015
2025
log(Body Weight)
Fre
quen
cy
0 5 10
01
23
45
6Who can reproduce this slide using ggplot? Tell me how on Piazza!Even better make a pull request!
Plot of Transformed DataAnimals= mutate(Animals, log.body = log(body))ggplot(Animals, aes(log.body, brain)) + geom_point()
0
2000
4000
−4 0 4 8 12
log.body
brai
n
#plot(brain ~ body, Animals, log="x")
Diagnostics with log(body)
−500 0 500 1000 1500
−20
000
2000
5000
Fitted values
Res
idua
ls
Residuals vs Fitted
15
7
26
−2 −1 0 1 2
−1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
15
7
26
−500 0 500 1000 1500
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location15
7
26
0.00 0.05 0.10 0.15−
11
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
0.5
1
Residuals vs Leverage
15
7
26
Variance increasing with mean
Try Log-LogAnimals= mutate(Animals, log.brain= log(brain))ggplot(Animals, aes(log.body, log.brain)) + geom_point()
0
2
4
6
8
−4 0 4 8 12
log.body
log.
brai
n
#plot(brain ~ body, Animals, log="xy")
Diagnostics with log(body) & log(brain)
2 4 6 8
−4
−2
01
23
Fitted values
Res
idua
ls
Residuals vs Fitted
6 2616
−2 −1 0 1 2
−2
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
6 2616
2 4 6 8
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location6 26
16
0.00 0.05 0.10 0.15
−2
−1
01
2
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance 0.5
0.5
Residuals vs Leverage
26616
Optimal Transformation for NormalityThe BoxCox procedure can be used to find “best” powertransformation λ of Y (for positive Y) for a given set of transformedpredictors.
Ψ(Y, λ) ={
Yλ−1λ if λ 6= 0
log(Y) if λ = 0
Find value of λ that maximizes the likelihood derived fromΨ(Y, λ) ∼ N(Xβλ, σ
2λ) (need to obtain distribution of Y first)
Find λ to minimize
RSS(λ) = ‖ΨM(Y, λ)− Xβ̂λ‖2
ΨM(Y, λ) ={
(GM(Y)1−λ(Yλ − 1)/λ if λ 6= 0GM(Y) log(Y) if λ = 0
where GM(Y) = exp(∑
log(Yi )/n) (Geometric mean)
boxcox in R: Profile likelihood
library(MASS)boxcox(braintransX.lm)
−2 −1 0 1 2
−25
0−
200
−15
0−
100
−50
λ
log−
Like
lihoo
d
95%
Caveats
I Boxcox transformation depends on choice of transformations ofX ’s
I For choice of X transformation use boxTidwell inlibrary(car)
I transformations of X ’s can reduce leverage values (potentialinfluence)
I if the dynamic range of Y or X is less than 1 or 10 (iemax/min) then transformation may have little effect
I transformations such as logs may still be useful forinterpretability
I outliers that are not influential may still
Review of Last Class
I In the model with both response and predictor log transformed,are dinosaurs outliers?
I should you test each one individually or as a group; if as agroup how do you think you would you do this using lm?
I do you think your final model is adequate? What else mightyou change?
Check Your Prediction SkillsAfter you determine whether dinos can stay or go and refine yourmodel, what about prediction?
I I would like to predict Aria’s brain size given her current weightof 259 grams. Give me a prediction and interval estimate.
I Is her body weight within the range of the data in Animals orwill you be extrapolating? What are the dangers here?
I Can you find any data on Rose-Breasted Cockatoo brain sizes?Are the values in the prediction interval?
Top Related