BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20...

23
BS1a Applied Statistics Lectures 9-10 Dr Geoff Nicholls Week 5 MT10

Transcript of BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20...

Page 1: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

BS1a Applied Statistics

Lectures 9-10

Dr Geoff Nicholls

Week 5 MT10

Page 2: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Box-Cox

Observations of y, x1, ..., xp with yk ≥ 0.

y not linear with x1, ..., xp try

y′ = (yλ − 1)/λ

treating λ as an(other) unknown parameter.

(yλ − 1)/λ gives powers of y and log(y).

Likelihood is now

L(β, σ2, λ; y′) ∝ 1

σnexp

− 1

2σ2

k

(y′k − xTk β)2

.

Exercise Compute MLE’s for β, σ2 and λ (trans-

form to get L(β, σ2, λ; y) then maximise - this

is the first exercise of PS4).

Page 3: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Example: fraction of successful putts as a func-

tion of distance in feet.

> putts<-data.frame(Dist=2:20, Prop=c(0.93,0.83,

0.74,0.59,0.55,0.53,0.46,0.32,0.34,0.32,0.26,

0.24,0.31,0.17,0.13,0.16,0.17,0.14,0.16))

> putts

Dist Prop

1 2 0.93

2 3 0.83

3 4 0.74

...

17 18 0.17

18 19 0.14

19 20 0.16

> y<-(1-putts$Prop)/putts$Prop

> x<-putts$Dist

Page 4: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

5 10 15 20

0.2

0.4

0.6

0.8

data

Dist

Pro

p

5 10 15 20

01

23

45

6

y=(1−Prop)/Prop

Dist

y

5 10 15 20

−1

01

23

(y^lambda−1)/lambda

Dist

(y^l

ambd

a −

1)/

lam

bda

5 10 15 20

−2

−1

01

2

log(y)

Dist

log(

y)

The λ value was estimated by maximising the

likelihood, as above.

Page 5: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

> putts.bc<-boxcox(y~x)

−2 −1 0 1 2

−10

0−

80−

60−

40−

20

λ

log−

Like

lihoo

d

95%

> putts.lm<-lm(sqrt(y)~x)

> summary(putts.lm)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.14342 0.09818 1.461 0.162

x 0.12293 0.00799 15.386 2.07e-11

...√

1 − Prop

Prop= Dist+ ǫ

Page 6: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Weighted Regression

Yi = xiβ + ǫ for i = 1,2, ..., n

ǫi ∼ N(0, σ2/wi) non-constant variance.

Eg: Yi = n−1i

j Yi,j yielding var(Yi) ≃ σ2/ni.

Let W = diag(w1, ..., wn) and Y ′ = W1/2Y and

X ′ = W1/2X so

Y ′ = X ′β + ǫ′

ǫ′ ∼ N(0, σ2In)

β = (XT WX)−1XTWY estimates β, and

s2 = (Y − Xβ)TW (Y − Xβ)/(n − p)

is unbiased for σ2. This is WLS.

Page 7: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

price

0 20 40 60 80 100

100

300

500

700

020

4060

8010

0

month

100 300 500 700 20 40 60 80 120

2040

6080

120

sales

Page 8: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Example: OHP revisited

OHP data were monthly averages, Sales gives

number of sales by month/House Type

Let wi = sales[i] in the ith month. Very simple

NLM for flats alone.

yk ∼ α + γMxk,M + ǫk, ǫk ∼ N(0, σ2/wi),

Not enough (see below). Signs of σ ∝ E(y).

Would like to consider

wi = 1/E(Yi)2

but we dont have E(Yi) (it is a result of the

regression we want to do). Fit unweighted

then fit wi = 1/y2i or wi = sales[i]/y2

i .

Note that when we refit using the results of the

first fit in the second, we replace one model er-

ror (non-constant variance) with another (cor-

relation between observations). We do at least

diagnose the issue.

Page 9: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

> #read the data

> fohp<-read.table(’ohp.txt’,header=TRUE)

> names(fohp)<-c(’price’,’type’,’month’,’sales’)

>

> #collect flats

> ohp<-fohp[ (fohp[,2]==’Flat’),]

#unweighted NLM

> ohp.lm<-lm(price~month,data=ohp)

> yhat<-fitted.values(ohp.lm)

> #weighted by number of sales

> ohp.wlm1<-lm(price~month,weights=ohp$sales,data=ohp)

> #weighted by fitted values (sigma ~ mean price)

> ohp.wlm2<-lm(price~month,weights=1/yhat^2,data=ohp)

> #weighted by sales and fitted values

> ohp.wlm3<-lm(price~month,weights=ohp$sales/yhat^2,data=ohp)

Page 10: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

120 140 160 180 200 220 240

−2

02

4

unweighted

stud

entis

ed r

esid

uals

120 140 160 180 200 220 240−

20

24

weighted by sales

120 140 160 180 200 220 240

−2

02

4

weighted by month

fitted values

stud

entis

ed r

esid

uals

120 140 160 180 200 220 240

−2

02

4

weighted by month and sales

fitted values

Page 11: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

> summary(ohp.wlm3)

...

Estimate Std. Error t value Pr(>|t|)

(Intercept) 125.15076 2.47997 50.47 <2e-16

month 1.25198 0.05204 24.06 <2e-16

...

Residual standard error: 0.5502 on 98 degrees of freedom

F-statistic: 578.8 on 1 and 98 DF, p-value: < 2.2e-16

Exercise Explain how to compute the RSE s = 0.55 and

F = 578.8.

Exercise We fit the weighted response so r′(e′) ∼ t(n−p−1) and Y ′ are independent under NLM for Y ′. However, R

returns Y = Xβ and e = (Y −Xβ) (ie in original unweighted

coordinates). Verify r′(e) ∼ t(n− p−1) and Y independent

under NLM for Y ′.

Page 12: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Generalised Linear Models

Restrict observation model to Y ∼ f(y|θ)

f(y|θ) = exp

(

yθ − κ(θ)

φ+ c(y;φ)

)

, y ∈ Ω

φ = 1 natural exponential family

φ > 0 natural exponential dispersion family.

exp(κ/φ) =

Ωexp

(

φ+ c(y)

)

dy

θ ∈ θ; κ(θ) < ∞

y natural observation

θ natural parameter

φ dispersion parameter

Page 13: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Normal: Y ∼ N(µ, σ2) is NEDF

f(y|θ) =1√

2πσ2exp

(

−(y − µ)2

2σ2

)

= exp

(

σ2− µ2

2σ2− log(2πσ2)

2− y2

2σ2

)

= exp

(

yθ − κ(θ)

φ+ c(y;φ)

)

θ = µ, y ∈ R, φ = σ2,

κ = µ2/2

c(y;φ) = −(1/2) log(2πφ) − y2/2φ

Student’s-t: Y ∼ t(ν) is not in this class

f(y|ν) ∝ (1 + y2/ν)−(

ν+12

)

cant factorise y and ν in exponent.

Page 14: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

eκ/φ =

Ωexp

(

φ+ c(y)

)

dy

κ(θ) cummulant GF (at φ = 1).

κ′eκ/φ =

Ω

(

y

φ

)

exp

(

φ+ c(y)

)

dy

κ′ = E(Y )

(

(κ′)2

φ2+

k′′

φ

)

eκ/φ =

Ω

(

y

φ

)2

exp

(

φ+ c(y)

)

dy

(

(κ′)2

φ2+

k′′

φ

)

= E(Y 2)/φ2

var(Y ) = E(Y 2) − E(Y )2

= (κ′)2 + φκ′′ − (κ′)2

= φκ′′.

Let µi = E(Yi), var(Yi) = φV (µi).

Exercisedµi

dθi= V (µi) so µi increases with θi.

Page 15: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Modeling with GLM’s

three decisions fix the model.

Distribution: Yi ∼ f(yi|θ), iid for i = 1,2, ..., n.

given the EV, how is the response distributed?

Linear predictor: η = x1β1 + x2β2 + ... + xpβp

what variables x1, x2, ..., xp should we use?

Link function g(µi) = ηi

(increasing, continuous, differentiable function)

How does mean response increase with LP?

NLM yi = xiβ + ǫi, ǫi ∼ N(0, σ2)

Stochastic, Yi ∼ N(µi, σ2) jointly indpendent

Deterministic, ηi = xTi β, η = Xβ

Link, choice g(µi) = µi gives NLM E(Yi) = xiβ.

Page 16: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Log-Likelihood (GLM)

ℓ(β; y) =n∑

i=1

yiθi − κ(θi)

φ+ c(yi;φ),

with

θi = θ(xiβ)

since µi = κ′(θi) and g(µi) = ηi both invertible

Canonical link function (important special case)

If we choose link function

g(µi) = κ′−1(µi)

then θi = ηi since g(µi) = θi.

Log-Likelihood (canonical link, φ = 1 or known)

ℓ(β; y) =n∑

i=1

yiηi − κ(ηi)

φ+c(y, φ), ηi = xiβ.

Page 17: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Example GLM for a Poisson response

Yi ∼ Poisson(λi), independent E(Yi) = λi.

exp(−λ)λy

y!= exp (y log(λ) − λ − log(y!))

= exp

(

yθ − κ(θ)

φ+ c(y;φ)

)

θ = log(λ), φ = 1, κ(θ) = exp(θ).

Check E(Y ): µ = κ′(θ) = exp(θ) = λ.

Check var(Y ): φV (µ) = κ′′(θ) = λ = µ.

Link function g(µ) = η where η = xβ:

Canonical link g(µ) = κ′−1(µ) = log(µ)

Log-likelihood

ℓ(β; y) =n∑

i=1

yixiβ − exiβ

Page 18: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

y~Poisson(mu) mu=exp(1+2x)

x

y

Page 19: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Example GLM for Yi ∼ Binomial(m, pi)

f(y|θ) = Cmy py(1 − p)m−y

= exp(y log

(

p

1 − p

)

+ m log(1 − p) + log(Cmy ))

θ = log

(

p

1 − p

)

(log odds), φ = 1

κ(θ) = −m log(1 − p) = m log(1 + exp(θ))

Check E(Y ) = mp:

κ′(θ) = m exp(θ)/(1 + exp(θ)).

Exercise Check variance φV = mp(1 − p).

Canonical link g(µ) = κ′−1(µ)

g(µ) = log

(

µ/m

1 − µ/m

)

Log-likelihood

ℓ(β; y) =n∑

i=1

yixiβ − m log(1 + exiβ)

Page 20: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

GLM MLEs

The MLEs β = β for the GLM satisfy scoreequations

∂ℓ(β; y)

∂βi= 0, i = 1,2, ..., p.

β ∈ Rp, ℓ(β; y) : Rp → R.

In terms of the gradient operator

∂ℓ

∂β=

(

∂ℓ

∂β1,

∂ℓ

∂β2, ...,

∂ℓ

∂βp

)T

∂ℓ

∂β= 0 at β = β.

Hessian operator is

∂2ℓ

∂β∂βT=

∂2ℓ∂β2

1

∂2ℓ∂β1∂β2

... ∂2ℓ∂β1∂βp

∂2ℓ∂β2∂β1

∂2ℓ∂β2∂βp

... ...∂2ℓ

∂βp∂β1

∂2ℓ∂βp∂β2

... ∂2ℓ∂β2

p

Page 21: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Observed information

J(y) = −∂2ℓ/∂β∂βT

Expected information

I = −E(∂2ℓ/∂β∂βT)

βD→ N(β, I−1) with n.

Compute I

∂ℓ

∂βi=

n∑

k=1

∂ℓ

∂ηk

∂ηk

∂βi

=∂ℓ

∂ηT

∂η

∂βi∂ℓ

∂βT=

∂ℓ

∂ηT

∂η

∂βT

=∂ℓ

∂ηTX

∂2ℓ

∂ββT=

∂ηT

∂β

∂η

∂ℓ

∂βT

= XT ∂2ℓ

∂η∂ηTX

Page 22: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Let

W = −E

(

∂2ℓ

∂η∂ηT

)

so that I = XTWX and

βD→ N(β, (XTWX)−1).

W is diagonal since ℓ =∑

k ℓk with ℓk = ℓk(ηk).

Wkk = −E

(

∂2ℓ

∂η2k

)

= −E

[

dθk

dηk

]2∂2ℓ

∂θ2k

ηk = g(κ′(θk))

1 =dθk

dηkg′(µk)κ

′′(θk)

dθk

dηk=

1

g′(µk)V (µk)

∂2ℓ

∂θ2k

= −κ′′/φ

Wkk =1

g′(µk)2φV (µk)

Page 23: BS1a Applied Statistics Lectures 9-10nicholls/bs1a/lectures9-10.pdf · 2010. 11. 8. · 5 10 15 20 −2 −1 0 1 2 log(y) Dist log(y) The λ value was estimated by maximising the

Example: Poisson, canonical link

φ = 1, κ(θ) = exp(θ), V (µ) = µ.

Link g(µ) = η, canonical link log(µ) = η.

Log-likelihood

ℓ(β; y) =n∑

i=1

yiηi − eηi

Wkk = −E

(

∂2ℓ

∂η2k

)

= eηk

= µk

=1

g′(µk)2φV (µk)

I = J

= XTdiag(µ1, ..., µn)X