soccer ball I have a soccer ball. soccer ball /’s ɒ kə//b ɒ :l
Regression Models - Week 12brill/Stat200b/Review1/...Log-linear model Premier League data > soccer...
Transcript of Regression Models - Week 12brill/Stat200b/Review1/...Log-linear model Premier League data > soccer...
Regression ModelsWeek 12
Anthony Davison
22 January 2007
Anthony Davison: Regression Models, Week 12 1
Motivation
Generalized Linear Model (GLM)
◮ Regression modelling for proportions, counts, positiveresponses
◮ Key elements:
• Response distribution is normal, binomial, Poisson, gamma,inverse Gaussian, negative binomial, . . ., with mean µ,variance φV (µ), where V (µ) is variance function
• Linear predictor η = Xβ is function of design matrix Xn×p
and regression parameters βp×1
• Link function η = g(µ) connects µ and η
◮ Fitting using iterative weighted least squares (IWLS)
◮ Compare models using analysis of deviance (likelihoodratio statistics), using large-sample likelihood theory
◮ Residuals/diagnostics generalise versions for linear model
◮ Today: Methods and models for count data
Anthony Davison: Regression Models, Week 12 2
Count data
Types of count data
◮ Basic response y ∈ {0, 1, 2, . . .}, perhaps with upper boundm, depending on sampling scheme/experiment
◮ Examples on following pages
◮ Simplest models:
• single response Y ∼ Pois(µ)• group of responses (Y1, . . . , Yd) subject to
∑Yj = m has
multinomial distribution, probabilities (π1, . . . , πd) anddenominator m
◮ In fact these are intimately related, as we will see.
Anthony Davison: Regression Models, Week 12 3
Count data
Smoking data (Doll and Hill)
Table: Lung cancer deaths in British male physicians. The table gives
man-years at risk T /number of cases y of lung cancer, cross-classified
by years of smoking t, taken to be age minus 20 years, and number of
cigarettes smoked per day, d.
Years of Daily cigarette consumption d
smoking t
Nonsmokers 1–9 10–14 15–19 20–24 25–34 35+15–19 10366/1 3121 3577 4317 5683 3042 67020–24 8162 2937 3286/1 4214 6385/1 4050/1 116625–29 5969 2288 2546/1 3185 5483/1 4290/4 148230–34 4496 2015 2219/2 2560/4 4687/6 4268/9 1580/435–39 3512 1648/1 1826 1893 3646/5 3529/9 1336/640–44 2201 1310/2 1386/1 1334/2 2411/12 2424/11 924/1045–49 1421 927 988/2 849/2 1567/9 1409/10 556/750–54 1121 710/3 684/4 470/2 857/7 663/5 255/455–59 826/2 606 449/3 280/5 416/7 284/3 104/1
Anthony Davison: Regression Models, Week 12 4
Count data
Jacamar data
Table: Response (N=not sampled, S = sampled and rejected, E =
eaten) of a rufous-tailed jacamar to individuals of seven species of
palatable butterflies with artifically coloured wing undersides. Data
from Peng Chai, University of Texas.
Aphrissa Phoebis Dryas Pierella Consul Siproeta
boisduvalli argante iulia luna fabius stelenes†
N/S/E N/S/E N/S/E N/S/E N/S/E N/S/EUnpainted 0/0/14 6/1/0 1/0/2 4/1/5 0/0/0 0/0/1Brown 7/1/2 2/1/0 1/0/1 2/2/4 0/0/3 0/0/1Yellow 7/2/1 4/0/2 5/0/1 2/0/5 0/0/1 0/0/3Blue 6/0/0 0/0/0 0/0/1 4/0/3 0/0/1 0/1/1Green 3/0/1 1/1/0 5/0/0 6/0/2 0/0/1 0/0/3Red 4/0/0 0/0/0 6/0/0 4/0/2 0/0/1 3/0/1Orange 4/2/0 6/0/0 4/1/1 7/0/1 0/0/2 1/1/1Black 4/0/0 0/0/0 1/0/1 4/2/2 7/1/0 0/1/0† includes Philaethria dido also.
Anthony Davison: Regression Models, Week 12 5
Count data
Pneumoconiosis data
Table: Period of exposure x and prevalence of pneumoconiosis amongst
coalminers.
Period of exposure (years)5.8 15 21.5 27.5 33.5 39.5 46 51.5
Normal 98 51 34 35 32 23 12 4Present 0 2 6 5 10 7 6 2Severe 0 1 3 8 9 8 10 5
Anthony Davison: Regression Models, Week 12 6
Log-linear model
Poisson distribution◮ Y ∼ Pois(µ) implies that
f(y;µ) =µy
y!e−µ, y = 0, 1, 2, . . . , µ > 0.
◮ Exponential family with natural parameter θ = log µ, GLMwith canonical logarithmic link, xTβ = η = log µ.
◮ Sometimes take Y as number of events in Poisson processof rate λ observed for period of length T , then µ = λT andcan set η = xTβ + log T — offset log T is fixed part oflinear predictor
◮ Multinomial connection: if Yjind∼ Pois(µj), j = 1, . . . , n,
then (D1) conditional distn of Y1, . . . , Yd givenY1 + · · · + Yd = m is multinomial, denominator m, probabs
π1 =µ1∑µr
, . . . , πd =µd∑µr
.
Anthony Davison: Regression Models, Week 12 7
Log-linear model
Log-linear and logistic regressions
◮ Special case: if d = 2, then
Y2 | Y1 + Y2 = m ∼ B
(m,π =
µ2
µ1 + µ2
)
◮ Hence if µ1 = exp(γ + xT1β), µ2 = exp(γ + xT
2β),
π = · · · =exp{(x2 − x1)
Tβ}
1 + exp{(x2 − x1)Tβ}.
◮ Hence can estimate β using log linear model or logisticmodel (but can’t estimate γ from logistic model).
Anthony Davison: Regression Models, Week 12 8
Log-linear model
Premier League data
> soccer
month day year team1 team2 score1 score2
1 Aug 19 2000 Charlton ManchesterC 4 0
2 Aug 19 2000 Chelsea WestHam 4 2
3 Aug 19 2000 Coventry Middlesbr 1 3
4 Aug 19 2000 Derby Southampton 2 2
5 Aug 19 2000 Leeds Everton 2 0
6 Aug 19 2000 Leicester AstonVilla 0 0
7 Aug 19 2000 Liverpool Bradford 1 0
8 Aug 19 2000 Sunderland Arsenal 1 0
9 Aug 19 2000 Tottenham Ipswich 3 1
10 Aug 20 2000 ManchesterU Newcastle 2 0
11 Aug 21 2000 Arsenal Liverpool 2 0
12 Aug 22 2000 Bradford Chelsea 2 0
13 Aug 22 2000 Ipswich ManchesterU 1 1
14 Aug 22 2000 Middlesbr Tottenham 1 1
15 Aug 23 2000 Everton Charlton 3 0
16 Aug 23 2000 ManchesterC Sunderland 4 2
17 Aug 23 2000 Newcastle Derby 3 2
18 Aug 23 2000 Southampton Coventry 1 2
19 Aug 23 2000 WestHam Leicester 0 1
20 Aug 26 2000 Arsenal Charlton 5 3
...
Anthony Davison: Regression Models, Week 12 9
Log-linear model
Premier League data
◮ 380 soccer matches in English Premier League in2000–2001 season
◮ Data: home score yhij and away score ya
ij when team i is athome to team j, for i, j,= 1, . . . , 20, i 6= j.
◮ Treat these as Poisson counts with means
µhij = exp(∆ + αi − βj), µa
ij = exp(αj − βi)
where
• ∆ (> 0?) represents the home advantage• αi and βi represent offensive and defensive strengths of
team i
◮ Fit this as GLM
Anthony Davison: Regression Models, Week 12 10
Log-linear model
Analysis of deviance
Table: Analysis of deviance for log-linear and logistic models fitted to
Premier League data.
Log-linear model Logistic modelTerms df Deviance Terms df Deviance
reduction reduction
Home 1 33.58 Home 1 33.58Defense 19 39.21 Team 19 79.63Offense 19 58.85
Residual 720 801.08 Residual 332 410.65
Anthony Davison: Regression Models, Week 12 11
Log-linear model
Overall (δ) Offensive (α) Defensive (β)Manchester United 0.39 0.22 0.15Liverpool 0.13 0.12 −0.08Arsenal — 0.04 —Chelsea −0.09 0.08 −0.22Leeds −0.10 0.02 −0.17Ipswich −0.16 −0.10 −0.13Sunderland −0.33 −0.31 −0.10Aston Villa −0.48 −0.31 −0.15West Ham −0.53 −0.33 −0.30Middlesborough −0.53 −0.35 −0.17Charlton −0.55 −0.21 −0.43Tottenham −0.58 −0.28 −0.38Newcastle −0.59 −0.35 −0.30Southampton −0.60 −0.45 −0.25Everton −0.75 −0.32 −0.46Leicester −0.77 −0.47 −0.31Manchester City −0.90 −0.40 −0.56Coventry −0.93 −0.53 −0.52Derby −0.93 −0.51 −0.45Bradford −1.29 −0.71 −0.62
SEs 0.29 0.20 0.20
Home advantage: ∆̂ = 0.37 (0.07), exp(∆̂) = 1.45.
Anthony Davison: Regression Models, Week 12 12
Contingency tables
Sampling schemes
◮ Contingency table contains count data cross-classified bydifferent categories
◮ Example: jacamar data cross-classify butterflies by
6 species × 8 colours × 3 fates
for a total of 144 categories, each with its count 0, 1, . . . , 14.
◮ Sampling scheme may fix certain totals — in the jacamardata the total for each species and colour is fixed, so thereponses are trinomial:
(not eaten, sampled, eaten)
◮ Derivation (D2) of likelihoods for Poisson, multinomial andproduct multinomial sampling schemes in two-way table
Anthony Davison: Regression Models, Week 12 13
Contingency tables
Connection with log-linear model
◮ Multinomial models can be fitted using Poisson errors.◮ Write data as two-way layout, with row totals fixed◮ Consider Poisson model with means µrc = exp(γr + xT
rcβ);interest focuses on β, not γr
◮ Corresponding multinomial model has fixed row totals mrc
and probabilities
πrc =µrc∑d µrd
=exp(γr + xT
rcβ)∑d exp(γr + xT
rdβ)=
exp(xTrcβ)∑
d exp(xTrdβ)
,
giving log likelihood (D3)
ℓMult(β; y | m) ≡∑
rc
yrc log πrc
=∑
r
{∑
c
yrcxTrcβ − mr log
(∑
c
exTrcβ
)},
Anthony Davison: Regression Models, Week 12 14
Contingency tables
◮ For Poisson model, can re-express likelihood as
f(y;β, γ) = f(y | m;β)f(m; τ),
where vector of row totals is
τr = τr(β, γr) =∑
c
µrc = eγr
∑
c
exTrcβ = E(mr)
and mapping τ ↔ γ is 1–1. Hence inferences on β using themultinomial model are equivalent to those based on thePoisson model, provided the row parameters γr are included.
◮ A more detailed calculation shows that the MLEs β̂ andtheir standard errors are identical under the two models.
Anthony Davison: Regression Models, Week 12 15
Smoking data
Smoking data (Doll and Hill)
Table: Lung cancer deaths in British male physicians. The table gives
man-years at risk T /number of cases y of lung cancer, cross-classified
by years of smoking t, taken to be age minus 20 years, and number of
cigarettes smoked per day, d.
Years of Daily cigarette consumption d
smoking t
Nonsmokers 1–9 10–14 15–19 20–24 25–34 35+15–19 10366/1 3121 3577 4317 5683 3042 67020–24 8162 2937 3286/1 4214 6385/1 4050/1 116625–29 5969 2288 2546/1 3185 5483/1 4290/4 148230–34 4496 2015 2219/2 2560/4 4687/6 4268/9 1580/435–39 3512 1648/1 1826 1893 3646/5 3529/9 1336/640–44 2201 1310/2 1386/1 1334/2 2411/12 2424/11 924/1045–49 1421 927 988/2 849/2 1567/9 1409/10 556/750–54 1121 710/3 684/4 470/2 857/7 663/5 255/455–59 826/2 606 449/3 280/5 416/7 284/3 104/1
Anthony Davison: Regression Models, Week 12 16
Smoking data
Models
◮ Suppose number of deaths y has Poisson distribution, meanTλ(d, t), where T is man-years at risk, d is number ofcigarettes smoked daily and t is time smoking (years).
◮ Log-linear model
• λrc = exp(γr + βc)• deviance 51.47 on 48 df• one parameter for each row and column
◮ Substantive model• λ(d, t) = β0t
β1
(1 + β2d
β3
), so
◮ background rate of lung cancer is β0tβ1 for non-smoker
◮ additional risk due to smoking d cigarettes/day is β2dβ3
• deviance is 59.58 on 59 df• just 4 parameters overall
Anthony Davison: Regression Models, Week 12 17
Smoking data
Substantive model: Some details
◮ Likelihood ratio test of β1 = 0 or β2 = 0 would benon-regular (D4 — why?)
◮ Reparametrize to avoid constraints β0, β1 > 0 inmaximisation: set
λ(d, t) = {eγ0 + exp(γ1 + β2 log d)} exp(β3 log t).
◮ Parameter estimates (next page) suggest possible thatβ2 = 1, then get deviance of 1.84 on 60 df.
◮ Beware small counts
• χ2 approximation to distribution of deviance unreliable —but simulation shows that the models fit well
• residuals not very useful
Anthony Davison: Regression Models, Week 12 18
Smoking data
Table: Parameter estimates (standard errors) for lung cancer data.
γ0 γ1 β2 β3
Smokers only 0.96 (25.4) 2.15 (1.45) 1.20 (0.40) 4.50 (0.34)All data 2.94 (0.58) 1.82 (0.66) 1.29 (0.20) 4.46 (0.33)All data (β2 = 1) 2.75 (0.56) 2.72 (0.09) — 4.43 (0.33)
Anthony Davison: Regression Models, Week 12 19
Closing
Final comments
◮ Log-linear models mathematically elegant and useful, butinterpretation often difficult, especially for contingencytables
◮ Marginal models less elegant mathematically, but havebetter interpretations in practice
◮ Also possible to fit models for ordinal data, usingmultinomial models and tolerance distributioninterpretation used for binomial data
Anthony Davison: Regression Models, Week 12 20
Closing
Pneumoconiosis data
Table: Period of exposure x and prevalence of pneumoconiosis amongst
coalminers.
Period of exposure (years)5.8 15 21.5 27.5 33.5 39.5 46 51.5
Normal 98 51 34 35 32 23 12 4Present 0 2 6 5 10 7 6 2Severe 0 1 3 8 9 8 10 5
Anthony Davison: Regression Models, Week 12 21
Closing
Pneumoconiosis data
Figure: Pneumoconiosis data analysis, showing how the implied fitted
logistic distributions depend on x.
2
2
2
2 2 22
2
Exposure x
Em
piric
al lo
gist
ic tr
ansf
orm
10 20 30 40 50
-6-4
-20
3
3
33
3 33 3
Linear predictor
0 5 10 15
x= 5.8
x= 15
x= 21.5
x= 27.5
x= 33.5
x= 39.5
x= 46
x= 51.5
Anthony Davison: Regression Models, Week 12 22