Regression Models - Week 12brill/Stat200b/Review1/...Log-linear model Premier League data > soccer...

Regression ModelsWeek 12

Anthony Davison

22 January 2007

Anthony Davison: Regression Models, Week 12 1

Motivation

Generalized Linear Model (GLM)

◮ Regression modelling for proportions, counts, positiveresponses

◮ Key elements:

• Response distribution is normal, binomial, Poisson, gamma,inverse Gaussian, negative binomial, . . ., with mean µ,variance φV (µ), where V (µ) is variance function

• Linear predictor η = Xβ is function of design matrix Xn×p

and regression parameters βp×1

• Link function η = g(µ) connects µ and η

◮ Fitting using iterative weighted least squares (IWLS)

◮ Compare models using analysis of deviance (likelihoodratio statistics), using large-sample likelihood theory

◮ Residuals/diagnostics generalise versions for linear model

◮ Today: Methods and models for count data


Count data

Types of count data

◮ Basic response y ∈ {0, 1, 2, . . .}, perhaps with upper boundm, depending on sampling scheme/experiment

◮ Examples on following pages

◮ Simplest models:

• single response Y ∼ Pois(µ)• group of responses (Y1, . . . , Yd) subject to

∑Yj = m has

multinomial distribution, probabilities (π1, . . . , πd) anddenominator m

◮ In fact these are intimately related, as we will see.


Count data

Smoking data (Doll and Hill)

Table: Lung cancer deaths in British male physicians. The table gives

man-years at risk T /number of cases y of lung cancer, cross-classified

by years of smoking t, taken to be age minus 20 years, and number of

cigarettes smoked per day, d.

Years of Daily cigarette consumption d

smoking t

Nonsmokers 1–9 10–14 15–19 20–24 25–34 35+15–19 10366/1 3121 3577 4317 5683 3042 67020–24 8162 2937 3286/1 4214 6385/1 4050/1 116625–29 5969 2288 2546/1 3185 5483/1 4290/4 148230–34 4496 2015 2219/2 2560/4 4687/6 4268/9 1580/435–39 3512 1648/1 1826 1893 3646/5 3529/9 1336/640–44 2201 1310/2 1386/1 1334/2 2411/12 2424/11 924/1045–49 1421 927 988/2 849/2 1567/9 1409/10 556/750–54 1121 710/3 684/4 470/2 857/7 663/5 255/455–59 826/2 606 449/3 280/5 416/7 284/3 104/1


Count data

Jacamar data

Table: Response (N=not sampled, S = sampled and rejected, E =

eaten) of a rufous-tailed jacamar to individuals of seven species of

palatable butterflies with artifically coloured wing undersides. Data

from Peng Chai, University of Texas.

Aphrissa Phoebis Dryas Pierella Consul Siproeta

boisduvalli argante iulia luna fabius stelenes†

N/S/E N/S/E N/S/E N/S/E N/S/E N/S/EUnpainted 0/0/14 6/1/0 1/0/2 4/1/5 0/0/0 0/0/1Brown 7/1/2 2/1/0 1/0/1 2/2/4 0/0/3 0/0/1Yellow 7/2/1 4/0/2 5/0/1 2/0/5 0/0/1 0/0/3Blue 6/0/0 0/0/0 0/0/1 4/0/3 0/0/1 0/1/1Green 3/0/1 1/1/0 5/0/0 6/0/2 0/0/1 0/0/3Red 4/0/0 0/0/0 6/0/0 4/0/2 0/0/1 3/0/1Orange 4/2/0 6/0/0 4/1/1 7/0/1 0/0/2 1/1/1Black 4/0/0 0/0/0 1/0/1 4/2/2 7/1/0 0/1/0† includes Philaethria dido also.


Count data

Pneumoconiosis data

Table: Period of exposure x and prevalence of pneumoconiosis amongst

coalminers.

Period of exposure (years)5.8 15 21.5 27.5 33.5 39.5 46 51.5

Normal 98 51 34 35 32 23 12 4Present 0 2 6 5 10 7 6 2Severe 0 1 3 8 9 8 10 5


Log-linear model

Poisson distribution◮ Y ∼ Pois(µ) implies that

f(y;µ) =µy

y!e−µ, y = 0, 1, 2, . . . , µ > 0.

◮ Exponential family with natural parameter θ = log µ, GLMwith canonical logarithmic link, xTβ = η = log µ.

◮ Sometimes take Y as number of events in Poisson processof rate λ observed for period of length T , then µ = λT andcan set η = xTβ + log T — offset log T is fixed part oflinear predictor

◮ Multinomial connection: if Yjind∼ Pois(µj), j = 1, . . . , n,

then (D1) conditional distn of Y1, . . . , Yd givenY1 + · · · + Yd = m is multinomial, denominator m, probabs

π1 =µ1∑µr

, . . . , πd =µd∑µr

.


Log-linear model

Log-linear and logistic regressions

◮ Special case: if d = 2, then

Y2 | Y1 + Y2 = m ∼ B

(m,π =

µ2

µ1 + µ2

)

◮ Hence if µ1 = exp(γ + xT1β), µ2 = exp(γ + xT

2β),

π = · · · =exp{(x2 − x1)

Tβ}

1 + exp{(x2 − x1)Tβ}.

◮ Hence can estimate β using log linear model or logisticmodel (but can’t estimate γ from logistic model).


Log-linear model

Premier League data

> soccer

month day year team1 team2 score1 score2

1 Aug 19 2000 Charlton ManchesterC 4 0

2 Aug 19 2000 Chelsea WestHam 4 2

3 Aug 19 2000 Coventry Middlesbr 1 3

4 Aug 19 2000 Derby Southampton 2 2

5 Aug 19 2000 Leeds Everton 2 0

6 Aug 19 2000 Leicester AstonVilla 0 0

7 Aug 19 2000 Liverpool Bradford 1 0

8 Aug 19 2000 Sunderland Arsenal 1 0

9 Aug 19 2000 Tottenham Ipswich 3 1

10 Aug 20 2000 ManchesterU Newcastle 2 0

11 Aug 21 2000 Arsenal Liverpool 2 0

12 Aug 22 2000 Bradford Chelsea 2 0

13 Aug 22 2000 Ipswich ManchesterU 1 1

14 Aug 22 2000 Middlesbr Tottenham 1 1

15 Aug 23 2000 Everton Charlton 3 0

16 Aug 23 2000 ManchesterC Sunderland 4 2

17 Aug 23 2000 Newcastle Derby 3 2

18 Aug 23 2000 Southampton Coventry 1 2

19 Aug 23 2000 WestHam Leicester 0 1

20 Aug 26 2000 Arsenal Charlton 5 3

...


Log-linear model

Premier League data

◮ 380 soccer matches in English Premier League in2000–2001 season

◮ Data: home score yhij and away score ya

ij when team i is athome to team j, for i, j,= 1, . . . , 20, i 6= j.

◮ Treat these as Poisson counts with means

µhij = exp(∆ + αi − βj), µa

ij = exp(αj − βi)

where

• ∆ (> 0?) represents the home advantage• αi and βi represent offensive and defensive strengths of

team i

◮ Fit this as GLM


Log-linear model

Analysis of deviance

Table: Analysis of deviance for log-linear and logistic models fitted to

Premier League data.

Log-linear model Logistic modelTerms df Deviance Terms df Deviance

reduction reduction

Home 1 33.58 Home 1 33.58Defense 19 39.21 Team 19 79.63Offense 19 58.85

Residual 720 801.08 Residual 332 410.65


Log-linear model

Overall (δ) Offensive (α) Defensive (β)Manchester United 0.39 0.22 0.15Liverpool 0.13 0.12 −0.08Arsenal — 0.04 —Chelsea −0.09 0.08 −0.22Leeds −0.10 0.02 −0.17Ipswich −0.16 −0.10 −0.13Sunderland −0.33 −0.31 −0.10Aston Villa −0.48 −0.31 −0.15West Ham −0.53 −0.33 −0.30Middlesborough −0.53 −0.35 −0.17Charlton −0.55 −0.21 −0.43Tottenham −0.58 −0.28 −0.38Newcastle −0.59 −0.35 −0.30Southampton −0.60 −0.45 −0.25Everton −0.75 −0.32 −0.46Leicester −0.77 −0.47 −0.31Manchester City −0.90 −0.40 −0.56Coventry −0.93 −0.53 −0.52Derby −0.93 −0.51 −0.45Bradford −1.29 −0.71 −0.62

SEs 0.29 0.20 0.20

Home advantage: ∆̂ = 0.37 (0.07), exp(∆̂) = 1.45.


Contingency tables

Sampling schemes

◮ Contingency table contains count data cross-classified bydifferent categories

◮ Example: jacamar data cross-classify butterflies by

6 species × 8 colours × 3 fates

for a total of 144 categories, each with its count 0, 1, . . . , 14.

◮ Sampling scheme may fix certain totals — in the jacamardata the total for each species and colour is fixed, so thereponses are trinomial:

(not eaten, sampled, eaten)

◮ Derivation (D2) of likelihoods for Poisson, multinomial andproduct multinomial sampling schemes in two-way table


Contingency tables

Connection with log-linear model

◮ Multinomial models can be fitted using Poisson errors.◮ Write data as two-way layout, with row totals fixed◮ Consider Poisson model with means µrc = exp(γr + xT

rcβ);interest focuses on β, not γr

◮ Corresponding multinomial model has fixed row totals mrc

and probabilities

πrc =µrc∑d µrd

=exp(γr + xT

rcβ)∑d exp(γr + xT

rdβ)=

exp(xTrcβ)∑

d exp(xTrdβ)

,

giving log likelihood (D3)

ℓMult(β; y | m) ≡∑

rc

yrc log πrc

=∑

r

{∑

c

yrcxTrcβ − mr log

(∑

c

exTrcβ

)},


Contingency tables

◮ For Poisson model, can re-express likelihood as

f(y;β, γ) = f(y | m;β)f(m; τ),

where vector of row totals is

τr = τr(β, γr) =∑

c

µrc = eγr

∑

c

exTrcβ = E(mr)

and mapping τ ↔ γ is 1–1. Hence inferences on β using themultinomial model are equivalent to those based on thePoisson model, provided the row parameters γr are included.

◮ A more detailed calculation shows that the MLEs β̂ andtheir standard errors are identical under the two models.


Smoking data

Smoking data (Doll and Hill)

Table: Lung cancer deaths in British male physicians. The table gives

man-years at risk T /number of cases y of lung cancer, cross-classified

by years of smoking t, taken to be age minus 20 years, and number of

cigarettes smoked per day, d.

Years of Daily cigarette consumption d

smoking t

Nonsmokers 1–9 10–14 15–19 20–24 25–34 35+15–19 10366/1 3121 3577 4317 5683 3042 67020–24 8162 2937 3286/1 4214 6385/1 4050/1 116625–29 5969 2288 2546/1 3185 5483/1 4290/4 148230–34 4496 2015 2219/2 2560/4 4687/6 4268/9 1580/435–39 3512 1648/1 1826 1893 3646/5 3529/9 1336/640–44 2201 1310/2 1386/1 1334/2 2411/12 2424/11 924/1045–49 1421 927 988/2 849/2 1567/9 1409/10 556/750–54 1121 710/3 684/4 470/2 857/7 663/5 255/455–59 826/2 606 449/3 280/5 416/7 284/3 104/1


Smoking data

Models

◮ Suppose number of deaths y has Poisson distribution, meanTλ(d, t), where T is man-years at risk, d is number ofcigarettes smoked daily and t is time smoking (years).

◮ Log-linear model

• λrc = exp(γr + βc)• deviance 51.47 on 48 df• one parameter for each row and column

◮ Substantive model• λ(d, t) = β0t

β1

(1 + β2d

β3

), so

◮ background rate of lung cancer is β0tβ1 for non-smoker

◮ additional risk due to smoking d cigarettes/day is β2dβ3

• deviance is 59.58 on 59 df• just 4 parameters overall


Smoking data

Substantive model: Some details

◮ Likelihood ratio test of β1 = 0 or β2 = 0 would benon-regular (D4 — why?)

◮ Reparametrize to avoid constraints β0, β1 > 0 inmaximisation: set

λ(d, t) = {eγ0 + exp(γ1 + β2 log d)} exp(β3 log t).

◮ Parameter estimates (next page) suggest possible thatβ2 = 1, then get deviance of 1.84 on 60 df.

◮ Beware small counts

• χ2 approximation to distribution of deviance unreliable —but simulation shows that the models fit well

• residuals not very useful


Smoking data

Table: Parameter estimates (standard errors) for lung cancer data.

γ0 γ1 β2 β3

Smokers only 0.96 (25.4) 2.15 (1.45) 1.20 (0.40) 4.50 (0.34)All data 2.94 (0.58) 1.82 (0.66) 1.29 (0.20) 4.46 (0.33)All data (β2 = 1) 2.75 (0.56) 2.72 (0.09) — 4.43 (0.33)


Closing

Final comments

◮ Log-linear models mathematically elegant and useful, butinterpretation often difficult, especially for contingencytables

◮ Marginal models less elegant mathematically, but havebetter interpretations in practice

◮ Also possible to fit models for ordinal data, usingmultinomial models and tolerance distributioninterpretation used for binomial data


Closing

Pneumoconiosis data

Table: Period of exposure x and prevalence of pneumoconiosis amongst

coalminers.

Period of exposure (years)5.8 15 21.5 27.5 33.5 39.5 46 51.5

Normal 98 51 34 35 32 23 12 4Present 0 2 6 5 10 7 6 2Severe 0 1 3 8 9 8 10 5


Closing

Pneumoconiosis data

Figure: Pneumoconiosis data analysis, showing how the implied fitted

logistic distributions depend on x.

2

2

2

2 2 22

2

Exposure x

Em

piric

al lo

gist

ic tr

ansf

orm

10 20 30 40 50

-6-4

-20

3

3

33

3 33 3

Linear predictor

0 5 10 15

x= 5.8

x= 15

x= 21.5

x= 27.5

x= 33.5

x= 39.5

x= 46

x= 51.5


Regression Models - Week 12brill/Stat200b/Review1/...Log-linear model Premier League data > soccer...

Documents

Transcript of Regression Models - Week 12brill/Stat200b/Review1/...Log-linear model Premier League data > soccer...