Modeling the association between a binary outcome, Y, and an “exposure”, X
-
Upload
sybill-miranda -
Category
Documents
-
view
22 -
download
0
description
Transcript of Modeling the association between a binary outcome, Y, and an “exposure”, X
BIOST 536 Thompson 1
Modeling the association between a binary outcome, Y, and an “exposure”, X
Slides are from Research Professor M. Thompson
BIOST 536 Thompson 2
We might want to model px=P(Y=1|X)
What are the characteristics of pX?
0 ≤pX≤ 1
pX possibly monotone in X
BIOST 536 Thompson 3
Logit
Probit
-50
5T
rans
form
of p
0 .2 .4 .6 .8 1Probability
Model g(pX)=β0 + β1 X
pg(p)=logit(p)=ln( )
1-p
1( ) ( )g p p
BIOST 536 Thompson 4
Logistic regression with a single binary risk factor
Table A X=1 X=0
Y=1 a b n1
Y=0 c d n0
m1 m0 N
BIOST 536 Thompson 5
1m
a
0m
b
bc
ad
)0|0()0|1(
)1|0()1|1(
XYPXYP
XYPXYP
estimates P(Y=1 | X=0)
estimates P(Y=1 | X=1)
estimates the odds ratio:
Cohort or Cross-sectional study
X=1 X=0
Y=1 a b n1
Y=0 c d n0
m1 m0 N
BIOST 536 Thompson 6
Under the logistic model:
logit(P(Y=1|X))=β0+β1X
ln(OR) = ln(Ψ) = logit(P(Y=1|X=1))-logit(P(Y=1|X=0))
= β0 + β1 - β0
= β1
i.e. Ψ = exp(β1)
And:
logit(P(Y=1 |X=0)) = β0
P(Y=1 |X=0) is estimated by
0
01
e
e
BIOST 536 Thompson 7
The logistic equations:
0 1
1( 0 | )
1 XP Y X
e
0
0
1)0|1(
ee
XYP
01
1)0|0( e
XYP
0 1
0 1( 1 | )
1
X
X
eP Y X
e
10
10
1)1|1(
e
eXYP
1011
)1|0(
eXYP
For binary X:
BIOST 536 Thompson 8
Case Control study
Let Z = 1 if individual was sampled
= 0 otherwise
Define π1 = P(Z=1 | Y=1);
π0 = P(Z=1 | Y=0)
Let pZ(X)= P(Y=1 | X, Z=1)
BIOST 536 Thompson 9
We can model:
Logit(pZ(X))
1
0
( 1 | ) ( )ln( ) ln( )
( 0 | ) ( )
P Y X P X
P Y X P X
P(Z=1|X,Y=1)P(X,Y=1) =ln( )
P(Z=1|X,Y=0)P(X,Y=0)
( 1 | 1, )ln( )
( 0 | 1, )
P Y Z X
P Y Z X
( 1, 1, )
ln( )( 0, 1, )
P Y Z X
P Y Z X
BIOST 536 Thompson 10
If we model
logit(pZ(X)) = α + β1 X
Then ln(Ψ) = β1 or Ψ = exp(β1) as before.
But:
1))Z 0,X | 1logit(P(Y
1
0ln( ) logit( ( 1 | 0))P Y X
10
0ln( )
BIOST 536 Thompson 11
Parameter estimation:Maximum Likelihood
We choose that estimate of the parameters that makes the data most likely to have occurred
Let's take the simple setting of a cross-sectional study where we want to estimate the prevalence of a disease. Say we take a random sample of N individuals and w of them have the disease.
The common sense estimate of the prevalence of disease is :
w
N
BIOST 536 Thompson 12
The likelihood
Let w=number diseased in N independent individuals and let the true disease prevalence in the population be p.
Then the likelihood of observing w diseased individuals in N is given by:
(1 )w N wNp p
w
BIOST 536 Thompson 13
Setting the derivative equal to zero and solving for p:
ln ln( ) ( ) ln(1 )N
l w N ww
p p
We want to choose that value of p which maximizes the likelihood or, equivalently, the log of the likelihood:
ln ln( ) ( ) ln(1 )N
l w N ww
p p
1p p
w N w
pw
N
ln ln( ) ( ) ln(1 )N
l w N ww
p p
Taking the derivative of l with respect to p:
BIOST 536 Thompson 14
In a study involving 53 men with prostate cancer, 20 of the men had nodal involvement
How to estimate the chance of nodal involvement?
20.377
53p
0
.05
.1Li
kelih
ood
0 .2 .4 .6 .8 1Probability
BIOST 536 Thompson 15
Using MLE in the logistic regression setting with a single covariate, X:
Say we have N observations (Yi, Xi ), i=1,2,…,N, where Y denotes disease status (0 =non-diseased,
1=diseased) and X is a risk factor of interest.
Let p(X) denote P(Y=1 | X).
Then:
1( | ) ( ) (1 ( ))i iY Yi i i iP Y X p X p X
BIOST 536 Thompson 16
L=
l =ln(L) =
Alternative (Binomial) formulation:
If X takes on n different values, Xj, j=1,2,…,n, and, for
each Xj, there are nj subjects, where , of whom yj
are “diseased”, we can represent the log likelihood as
N
i
Yi
Yi
ii XpXp1
1))(1()(
1 1ln( ( )) (1 ) ln(1 ( ))
N Ni i i i
i iY p X Y p X
1 1ln ln( ( )) ( ) ln(1 ( ))
n njj j j j j
j jj
ny p X n y p X
y
1
nj
jn N
BIOST 536 Thompson 17
X
X
e
eXYPXp
10
10
1)|1()(
If we model
then, for a single dichotomous risk factor, X, as in Table A,the maximum likelihood estimate of
β0 is ln(b/d)β1 is ln(ad/bc)
and hence the maximum likelihood estimate ofP(Y=1 | X=1) is a/m1
and ofP(Y=1 | X=0) is b/m0.
BIOST 536 Thompson 18
Hypothesis testing and confidence intervals
Say we want to establish whether tumor size affects the chance of nodal involvement in men with prostate cancer
Nodal | Tumorinvolvement| large small| Total-----------+----------------------+---------- Yes | 15 5 | 20 | 56% 19% | 38% -----------+----------------------+---------- No | 12 21 | 33 | 44% 81% | 62% -----------+----------------------+---------- Total | 26 27 | 53
BIOST 536 Thompson 19
Consider
logit(P(nodal involvement | tumor size=X))=β0 + β1 X
The maximum likelihood estimate of β1 is
Hence the OR is estimated by e1.66 = 5.25 (=15x21/(5x12))
How do we test the statistical significance of the OR?
Calculate a confidence interval?
1 1.66
BIOST 536 Thompson 20
Ho: β1=0 <=> Ho: OR=Ψ=1
LR te
st
Wald test
Score test
-36
-35
-34
-33
-32
-31
Log
likel
ihoo
d
0 1 2 3beta1
BIOST 536 Thompson 21
likelihood of the current model2 ln( )
likelihood of the saturated modelD
12 [ ln( ) ( ) ln( )]
n j j j
j j jj
jj j
y n yy n y
y n y
likelihood without X2ln( )
likelihood with X
The deviance compares observed to predicted values via the likelihood:
where
To assess the role of X in the logistic model :
Logit(P(Y=1|X))= β0 + β1 X
We can consider
G = D(model without X)-D(model with X)
=
j
j
j
yy
n
BIOST 536 Thompson 22
Let Y=nodal involvement in prostate cancer, X=tumor sizeWe estimate: logit(P(Y=1|X)= -1.44+1.66 X,
and OR=Ψ=5.25Ln L= -31.276
Under the null model:Logit(P(Y=1))=constant, thenLn L=-35.126
Under the hypothesis H0 : β1 =0, G has a Χ2 distribution with 1 degree of freedom
Here G =-2*(-35.126+31.276) = 7.7
LR test: P(Х21 > 7.7)= .0055
Score Test: P(Х21 > 7.44)= .0064
Wald test: P(Х21 > 6.92)= .0090
STATA gives the LR test for the fitted model versus the null model STATA does not do the Score test easilySTATA gives the single parameter Wald test
BIOST 536 Thompson 23
. logistic node tumor
Logistic regression Number of obs = 53 LR chi2(1) = 7.70
Prob > chi2 = 0.0055Log likelihood = -31.276312 Pseudo R2 = 0.1096
------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- tumor | 5.25 3.310487 2.63 0.009 1.52552 18.06761------------------------------------------------------------------------------
. logit
------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- tumor | 1.658228 .630569 2.63 0.009 .4223355 2.894121 _cons | -1.435085 .4976116 -2.88 0.004 -2.410385 -.4597837------------------------------------------------------------------------------
Pseudo R2=1-lm/l0
Stata code
BIOST 536 Thompson 24
The information matrix
Maximum likelihood theory states that the variance estimators for estimates obtained from MLE can be derived from the matrix of second partial derivatives of the log likelihood.
Minus this matrix is called the information matrix, I, and the estimated variances and covariances of the parameter estimates are obtained from the inverse of the matrix.
BIOST 536 Thompson 25
Let and β
and let V=
NX
X
X
X
1
..
..
1
1
2
1
1
0
)1(0000
0....
0.)1(00
0.0)1(0
0..0)1(
33
22
11
NN pp
pp
pp
pp
BIOST 536 Thompson 26
Then I = X' V X and it can be shown that
~N(β, I-1)
and so an approximate 95% CI for, e.g., β1 is given by:
and hence a 95% CI for the OR is obtained by exponentiation of the CI for β1
)se( 1.96 11
BIOST 536 Thompson 27
Interpretation of coefficients
Dichotomous X (coded 0 or 1)
Here OR =
or
Interpretation of β0 depends on study design.
1e1)ln(
BIOST 536 Thompson 28
Polytomous X
Smoking cigs/day
CHD >30 21-30 1-20 0
Present 39 50 70 98
Absent 253 355 735 1554
OR 2.44 2.23 1.51 1.00
BIOST 536 Thompson 29
Polytomous X with k categories
We define X1, X2, …, Xk-1 dummy 0-1 design variables and consider the model:
P(Y=1 | X) = β0 + β1 X1 + β2 X2 + … βk-1 Xk-1 .
is the odds ratio for the j'th category
of X relative to the baseline category.
jej
BIOST 536 Thompson 30
Stata code:
. input chd smoke count
. 1 3 39
. 1 2 50
. 1 1 70
. 1 0 98
. 0 3 253
. 0 2 355
. 0 1 735
. 0 0 1554
. end
BIOST 536 Thompson 31
. xi: logit chd i.smoke [fweight = count]
i.smoke _Ismoke_0-3 (naturally coded; _Ismoke_0 omitted)Iteration 0: log likelihood = -890.62187Iteration 1: log likelihood = -876.52013Iteration 2: log likelihood = -875.84853Iteration 3: log likelihood = -875.84738Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000Log likelihood = -875.84738 Pseudo R2 = 0.0166------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | .4122448 .1627693 2.53 0.011 .0932229 .7312667 _Ismoke_2 | .8035253 .1834786 4.38 0.000 .4439138 1.163137 _Ismoke_3 | .8937922 .2010989 4.44 0.000 .4996455 1.287939 _cons | -2.76362 .1041517 -26.53 0.000 -2.967754 -2.559486------------------------------------------------------------------------------
BIOST 536 Thompson 32
. xi: logistic chd i.smoke [fweight=count]
i.smoke _Ismoke_0-3 (naturally coded; _Ismoke_0 omitted)Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000Log likelihood = -875.84738 Pseudo R2 = 0.0166------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 1.510204 .2458148 2.53 0.011 1.097706 2.077711 _Ismoke_2 | 2.2334 .4097812 4.38 0.000 1.558796 3.199955 _Ismoke_3 | 2.444382 .4915626 4.44 0.000 1.648137 3.625307------------------------------------------------------------------------------
BIOST 536 Thompson 33
. expand count(3146 observations created)
. xi: logit chd i.smoke Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000Log likelihood = -875.84738 Pseudo R2 = 0.0166------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | .4122448 .1627693 2.53 0.011 .0932229 .7312667 _Ismoke_2 | .8035253 .1834786 4.38 0.000 .4439138 1.163137 _Ismoke_3 | .8937922 .2010989 4.44 0.000 .4996455 1.287939 _cons | -2.76362 .1041517 -26.53 0.000 -2.967754 -2.559486------------------------------------------------------------------------------
. xi: logistic chd i.smoke
------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 1.510204 .2458148 2.53 0.011 1.097706 2.077711 _Ismoke_2 | 2.2334 .4097812 4.38 0.000 1.558796 3.199955 _Ismoke_3 | 2.444382 .4915626 4.44 0.000 1.648137 3.625307
-------------------------------------------------------------------------------------------------------------
BIOST 536 Thompson 34
. lincom _Ismoke_2- _Ismoke_1, or ( 1) - _Ismoke_1 + _Ismoke_2 = 0------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- (1) | 1.478873 .2900367 2.00 0.046 1.006916 2.172044------------------------------------------------------------------------------
. lincom _Ismoke_3- _Ismoke_2, or ( 1) - _Ismoke_2 + _Ismoke_3 = 0------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- (1) | 1.094466 .2505588 0.39 0.693 .698771 1.714234------------------------------------------------------------------------------
. lincom _Ismoke_3- _Ismoke_1, or ( 1) - _Ismoke_1 + _Ismoke_3 = 0------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- (1) | 1.618577 .3442644 2.26 0.024 1.066809 2.455728------------------------------------------------------------------------------
BIOST 536 Thompson 35
Continuous X
Here interpretation of β1 depends on the units of X.
If the logit is linear in X, then β1 represents the change in log odds for a 1 unit increase in X.
is the odds ratio corresponding to a 1 unit increase in X.
1e
BIOST 536 Thompson 36
. logistic node ageLogit estimates Number of obs = 53 LR chi2(1) = 1.09 Prob > chi2 = 0.2965Log likelihood = -34.581125 Pseudo R2 = 0.0155------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- age | .9526993 .0445086 -1.037 0.300 .8693389 1.044053------------------------------------------------------------------------------
. logit
------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- age | -.048456 .0467184 -1.037 0.300 -.1400223 .0431104 _cons | 2.366605 2.770912 0.854 0.393 -3.064283 7.797493------------------------------------------------------------------------------
Example: Effect of age on nodal involvement in prostate cancer
BIOST 536 Thompson 37
NOTES
The OR for nodal involvement corresponding to a ten year
age difference is: estimated by .95310=.62
The 95% CI for log(10βAGE) is given by:
Hence the 95% CI for the 10-year OR is given by: (.25,1.54)
This OR is the same comparing 40 year olds with 30 year
olds as comparing 60 year olds with 50 year olds etc
10 AGEe
AGE AGE10ß ±1.96x10xSE( ß )
BIOST 536 Thompson 38
Multiple logistic regression
Logit(P(Y=1| X1, X2, .., Xk) )
= β0 +β1 X1 + β2 X2 + …+ βk Xk
kk22110
kk22110
X ... X X
X ... X X
k211 )X .., ,X ,X |1P(Y
e
e
kk22110 X ... X X k211
1 )X .., ,X ,X |0P(Y
e
BIOST 536 Thompson 39
Estimation
Assume we have N observations (Yi, Xi1, Xi2, .., Xik), i=1,2,…,N
As before, we can use maximum likelihood to obtain estimates of β0, β1, β2,…, βk that maximize the likelihood:
L=
and we can estimate the variances and covariances of the estimates from the inverse of the information matrix, I.
1
1( ) (1 ( ))
NY Y
i ii
i ip X p X
BIOST 536 Thompson 40
Hypothesis testing
The Wald, Likelihood Ratio and Score tests generalize to the case of k X variables.
In generalFull model:
logit(p) = β0 +β1 X1 + β2 X2 + …+ βk Xk
Reduced model: logit(p) = β0 +β1 X1 + β2 X2 + …+ βp Xp, , p<k
H0 : βp+1 = βp+2 = …= βk =0Ha : ≠0 somewhere
BIOST 536 Thompson 41
Likelihood ratio test
LR statistic = -2[ln L(reduced) -ln L(full)] = Deviance(reduced) - Deviance(full)
Approximate distribution under H0 : Χ2k-p
We must fit two models to calculate the LR statistic
Stata provides LR test of the current model relative to the null model:
H0 : β1 = β2 = …= βk =0
BIOST 536 Thompson 42
Score test
If H0 implies β = β* thenScore statistic = S(β*)' I-1 S(β*)
where I denotes the information matrix
Approximate distribution under H0 : Χ2k-p
Only need to fit the reduced model to calculate the Score statistic
Stata does not perform the Score test easily.
BIOST 536 Thompson 43
Wald test
For a single parameter:
~ N(0,1) under H0 : βj=0.
The Wald test can be generalized to multiple parameters where it also follows a Χ2
k-p
distribution under H0. Most confidence intervals are based on the
Wald test statistic
( )
j
j
zSE
BIOST 536 Thompson 44
LR tests using Stata
In general:
Fit "full" model, then:. est store A saves log-likelihood from most recently
fitted model and labels it “A"Fit reduced model, then:. est store B saves log-likelihood from most recently
fitted model and labels it “B" Carry out the LR test comparing "full" model (A) with
reduced model (B). lrtest A B, stats
BIOST 536 Thompson 45
Example: prostate cancer study
Tumor large Tumor small
Nodal involvement Xray+ Xray- Xray+ Xray-
Yes 9 6 2 3
No 1 11 3 18
BIOST 536 Thompson 46
Fitting “full” model:
. logistic node tsize xray Logistic regression Number of obs = 53 LR chi2(2) = 16.90 Prob > chi2 = 0.0002Log likelihood = -26.676709 Pseudo R2 = 0.2405------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 4.895297 3.426809 2.269 0.023 1.241425 19.30357 xray | 8.326496 6.218498 2.838 0.005 1.926448 35.9888------------------------------------------------------------------------------
. logit
------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 1.588275 .7000206 2.269 0.023 .2162598 2.96029 xray | 2.119443 .7468325 2.838 0.005 .6556779 3.583208 _cons | -2.044627 .6099686 -3.352 0.001 -3.240144 -.8491109------------------------------------------------------------------------------
. est store A
BIOST 536 Thompson 47
Fitting “reduced” model:
. logistic node tsizeLogistic regression Number of obs = 53 LR chi2(1) = 7.70 Prob > chi2 = 0.0055Log likelihood = -31.276312 Pseudo R2 = 0.1096------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 5.25 3.310487 2.630 0.009 1.52552 18.06761------------------------------------------------------------------------------------------------
. logit------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 1.658228 .630569 2.630 0.009 .4223355 2.894121 _cons | -1.435085 .4976116 -2.884 0.004 -2.410385 -.4597837------------------------------------------------------------------------------
. est stor B
BIOST 536 Thompson 48
Likelihood ratio test: comparing models for nodal involvement with and without effect of xray. lrtest A B, stats
Likelihood-ratio test LR chi2(1) = 9.20
(Assumption: B nested in A) Prob > chi2 = 0.0024
------------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+----------------------------------------------------------------
B | 53 -35.12608 -31.27631 2 66.55262 70.49321
A | 53 -35.12608 -26.67671 3 59.35342 65.26429
------------------------------------------------------------------------------
What hypothesis is this testing?
BIOST 536 Thompson 49
Fitted probabilities in the “full” model:
. predict pnode, p
P(node | tumor=0, xray=0)=.1146(.1429)P(node | tumor=1, xray=0)=.3879(.3529)P(node | tumor=0, xray=1)=.5187(.4000)P(node | tumor=1, xray=1)=.8407(.9000)
Note: these are slightly different from what we would get if we used the raw data without modelling. Why?
BIOST 536 Thompson 50
Confidence intervals
A 100(1-α)% Likelihood Ratio based confidence region for β is given by:
Stata provides Wald-based CIs for individual parameters
CIs for odds ratios can be obtained by exponentiation
2
1-{ | -2(lnL( )-lnL( )) (p)}
j j1- /2 z SE( )
BIOST 536 Thompson 51
Confounding
“With ruin upon ruin, rout on rout, Confusion worse confounded.” Milton: Paradise Lost, ii. line 996.
A confounder, C, is a variable which, because of its relationship to disease, D, and the exposure of interest, E, distorts the disease-exposure relationship. Adjustment can remove its effects. We can adjust for the confounder explicitly by modeling or implicitly by stratification.
Note: C would not be considered a confounder if - it lies in the causal pathway between E and D, or- C is caused by both E and D
BIOST 536 Thompson 52
Example
E+ E-
D+ 90 10
D- 9 1
9.22OR
0C
E+ E-
D+ 91 19
D- 19 91
0.1OR
E+ E-
D+ 1 9
D- 10 90
1C 0.1OR
BIOST 536 Thompson 53
Effect modification Effect modification occurs when the chosen summary of association
differs in different strata. In some cases there may be effect modification for one summary but not for another.
For example, in clinical trials of cholesterol-lowering drugs it
appears that the relative risk of coronary heart disease comparing the treatment to placebo is about the same in people in people with and without previous disease, but the risk difference is enormously greater in those with previous disease.
Effect modification may or may not be of interest. For example, pleconaril, an investigational antiviral drug, reduced the mean duration of symptoms in subjects with a common cold due to rhinoviruses but had no effect in subjects whose cold was due to some other agent. Here the effect modification was important in checking that the drug really worked by inhibiting rhinovirus. On the other hand, in clinical use of the drug it would typically not be possible to determine the infectious agent and so the average effectiveness across all colds would be a more important quantity.
BIOST 536 Thompson 54
Effect modification and confounding can exist separately or together:
Confounding without effect modification. Here, the overall association is not the same as the causal effect of interest, but after stratification the association is the same within each stratum of the confounder. The ideal solution is to stratify and then to average the association across strata to regain the precision lost by stratifying.
Effect modification without confounding. Here, the overall association correctly estimates the average effect of the exposure, but that effect is different in different subgroups. If the separate associations are of interest then a stratified analysis is called for. If the main scientific interest is in the average effect across the population then a stratified analysis is unnecessary.
Both confounding and effect modification. That is, the overall association does not correctly estimate the average effect of exposure and after stratification the association is different in different subgroups. The confounding means that a stratified analysis is necessary. If the effect modification is scientifically uninteresting the estimates from separate strata can be combined as would be done in the absence of effect modification.
BIOST 536 Thompson 55
Logistic models for a binary exposure (XE) and binary covariate (XC) Consider the model:
logit(P(Y=1| XE, XC))=β0 + β1 XE + β2 XC + β3 XE XC
Let ( 1| 1, 0) ( 0| 1, 0)
( 1| 0, 0) ( 0| 0, 0)1
P Y X X P Y X XE EC C
P Y X X P Y X XE EC C
( 1| 1, 1) ( 0| 1, 1)
( 1| 0, 1) ( 0| 0, 1)2
P Y X X P Y X XE EC C
P Y X X P Y X XE EC C
BIOST 536 Thompson 56
Then, in a cohort study, under the logistic model:
ln(Ψ1) = logit(P(Y=1| XE=1, XC=0)) - logit(P(Y=1| XE=0, XC=0))
= β0 + β1 - β0 = β1
ln(Ψ2) = logit(P(Y=1| XE=1, XC=1)) - logit(P(Y=1| XE=0, XC=1))
= β0 + β1 + β2 + β3 - β0 - β2 = β1 + β3
β0 estimates logit(P(Y=1| XE=0, XC=0))β1 estimates ln(Ψ1)β2 estimates logit(P(Y=1| XE=0, XC=1)) - logit(P(Y=1| XE=0, XC=0))β3 estimates ln(Ψ2) - ln(Ψ1)
BIOST 536 Thompson 57
Logistic models for two 2x2 tables
0logit(p)=β 0 1logit( ) Ep X
E+
E-
E+
E-b0
b0+b1
Lo
git(
p)
C- C+
E+
E-
E+
E-b0
Lo
git(
p)
C- C+
BIOST 536 Thompson 58
0 2logit( ) Cp X 0 1 2logit( ) E Cp X X
0 1 2 3logit( ) E C E Cp X X X X
E+
E-
E+
E-
b0
b0+b2
b0+b1
b0+b1+b2+b3
Lo
git(
p)
C- C+
E+
E-
E+
E-
b0
b0+b2
b0+b1
b0+b1+b2
Lo
git(
p)
C- C+
E+E-
E+E-
b0
b0+b2
Lo
git(
p)
C- C+
BIOST 536 Thompson 59
In a case-control study:
Let Z=1 if an individual is sampled, Z=0 otherwise
And let π11=P(Z=1 | Y=1, XC=1)π10=P(Z=1 | Y=1, XC=0)
π01=P(Z=1 | Y=0, XC=1)π00=P(Z=1 | Y=0, XC=0)
Then
logit(P(Y=1| XE, XC=1, Z=1))
=log(π11/ π01) + logit(P(Y=1| XE, XC=1))etc
BIOST 536 Thompson 60
Consider the model:
logit(P(Y=1| XE, XC, Z=1)= β0* + β1 XE + β2* XC + β3 XE XC
Then:
log(Ψ1) = β1 log(Ψ2) = β1 + β3
as before, but
β0* = ln(π10/ π00) + β0
β2* = ln(π11/ π01) - ln(π10/ π00) + β2
BIOST 536 Thompson 61
Confidence intervals for linear combinations of parameters
100(1-α)% CI for Ψ2:
where
This can be obtained in Stata using the "lincom" command. Or by using a different parameterization.
31
2
e
1 3 1 3 1 3( ) ( ) ( ) 2cov( , )SE Var Var
( )1 3 1 3/ 2z SEe
BIOST 536 Thompson 62
Parameterizations
1. "Full" model with interaction (A):logit(P(Y=1| XE, XC))=β0 + β1 XE + β2 XC + β3 XE XC
2. Reduced model without interaction (B):
logit(P(Y=1| XE, XC))=β0 + β1 XE + β2 XC
XE XC Cases n 1 0 a1 m10
0 0 b1 m00
1 1 a2 m11
0 1 b2 m01
BIOST 536 Thompson 63
3. Another "full" logistic model:
X1 = 1 when XE=1, XC=0 X2=1 when XE=1, XC=1 = 0 otherwise =0 otherwise
logit(P(Y=1| X1, X2, XC))=β0 + β1 X1 + β2 XC + γ X2
Here β1 = log(Ψ1)β2 = log(ΨC|E-)γ = log(Ψ2)
This model allows us to estimate Ψ2 directly, but it is more complicated to test for interaction.
(No interaction => γ =β1.)
BIOST 536 Thompson 64
4. A third "full" logistic model
X1 = 1 when XE=1, XC=0 X2=1 when XE=0, XC=1
= 0 otherwise =0 otherwiseX3 = 1 when XE=1, XC=1 = 0 otherwise
logit(P(Y=1| X1, X2, X3))=β0 + β1 X1 + β2 X2 + η X3
allows us to test each of the above groups against the baseline.
The interpretation of regression coefficients depends on the parameterization and what other variables are in the model.
BIOST 536 Thompson 65
Case-control study of esophageal cancer and alcohol consumption in France (Breslow & Day, Vol I, p137).6 age strata; 2 exposure variables: daily alcohol consumption (4 categories), daily
tobacco consumption (4 categories); 2 disease groups: cases and controls
. infile age alcohol tobacco case using "p:\536\esoph.raw", clear(975 observations read). * begin labelling data and variables. label data "Esophageal Cancer Case-Control Study". label define agelabel 1 "25-34" 2 "35-44" 3 "45-54" 4 "55-64" 5 "65-74" 6 "75+". label values age agelabel. label variable age "Age in years". label define alclabel 1 "0-39" 2 "40-79" 3 "80-119" 4 "120+". label values alcohol alclabel. label variable alcohol "Alcohol g/day". label define toclabel 1 "0-9" 2 "10-19" 3 "20-29" 4 "30+". label values tobacco toclabel. label variable tobacco "Tobacco g/day". label define caselab 0 "Control" 1 "Case". label values case caselab. label variable case "Case-control status"
BIOST 536 Thompson 66
. * CREATE SOME SIMPLE TABLES TO LOOK AT DATA
. tabulate age case, col Age in | Case-control status years | Control Case | Total-----------+----------------------+---------- 25-34 | 115 1 | 116 | 14.84 0.50 | 11.90 -----------+----------------------+---------- 35-44 | 190 9 | 199 | 24.52 4.50 | 20.41 -----------+----------------------+---------- 45-54 | 167 46 | 213 | 21.55 23.00 | 21.85 -----------+----------------------+---------- 55-64 | 166 76 | 242 | 21.42 38.00 | 24.82 -----------+----------------------+---------- 65-74 | 106 55 | 161 | 13.68 27.50 | 16.51 -----------+----------------------+---------- 75+ | 31 13 | 44 | 4.00 6.50 | 4.51 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00
BIOST 536 Thompson 67
. tabulate alcohol case, col
Alcohol | Case-control status g/day | Control Case | Total-----------+----------------------+---------- 0-39 | 386 29 | 415 | 49.81 14.50 | 42.56 -----------+----------------------+---------- 40-79 | 280 75 | 355 | 36.13 37.50 | 36.41 -----------+----------------------+---------- 80-119 | 87 51 | 138 | 11.23 25.50 | 14.15 -----------+----------------------+---------- 120+ | 22 45 | 67 | 2.84 22.50 | 6.87 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00
BIOST 536 Thompson 68
. tabulate tobacco case, col Tobacco | Case-control status g/day | Control Case | Total-----------+----------------------+---------- 0-9 | 447 78 | 525 | 57.68 39.00 | 53.85 -----------+----------------------+---------- 10-19 | 178 58 | 236 | 22.97 29.00 | 24.21 -----------+----------------------+---------- 20-29 | 99 33 | 132 | 12.77 16.50 | 13.54 -----------+----------------------+---------- 30+ | 51 31 | 82 | 6.58 15.50 | 8.41 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00
BIOST 536 Thompson 69
. table case alcohol tobacco
----------+-----------------------------------------------------------------Case-cont | Tobacco g/day and Alcohol g/day rol | ------------- 0-9 ------------ ------------ 10-19 -----------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+----------------------------------------------------------------- Control | 252 145 42 8 74 68 30 6 Case | 9 34 19 16 10 17 19 12----------+-----------------------------------------------------------------
----------+-----------------------------------------------------------------Case-cont | Tobacco g/day and Alcohol g/day rol | ------------ 20-29 ----------- ------------- 30+ ------------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+----------------------------------------------------------------- Control | 37 47 10 5 23 20 5 3 Case | 5 15 6 7 5 9 7 10----------+-----------------------------------------------------------------
BIOST 536 Thompson 70
. table case alcohol tobacco, by(age)----------+-----------------------------------------------------------------Age in |years and |Case-cont | Tobacco g/day and Alcohol g/day rol | ------------- 0-9 ------------ ------------ 10-19 -----------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+-----------------------------------------------------------------25-34 | Control | 40 27 2 1 10 7 1 Case | 1----------+-----------------------------------------------------------------35-44 | Control | 60 35 11 1 13 20 6 3 Case | 2 1 3 ----------+-----------------------------------------------------------------45-54 | Control | 45 32 13 18 17 8 1 Case | 1 6 3 4 4 6 3----------+-----------------------------------------------------------------55-64 | Control | 47 31 9 5 19 15 7 1 Case | 2 9 9 5 3 6 8 6----------+-----------------------------------------------------------------65-74 | Control | 43 17 7 1 10 7 8 1 Case | 5 17 6 3 4 3 4 1----------+-----------------------------------------------------------------75+ | Control | 17 3 4 2 Case | 1 2 1 2 2 1 1 1----------+-----------------------------------------------------------------
BIOST 536 Thompson 71
----------+-----------------------------------------------------------------Age in |years and |Case-cont | Tobacco g/day and Alcohol g/day rol | ------------ 20-29 ----------- ------------- 30+ ------------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+-----------------------------------------------------------------25-34 | Control | 6 4 1 5 7 2 2 Case | ----------+-----------------------------------------------------------------35-44 | Control | 7 13 2 2 8 8 1 Case | 1 2 ----------+-----------------------------------------------------------------45-54 | Control | 10 10 4 1 4 2 2 Case | 5 1 2 5 2 4----------+-----------------------------------------------------------------55-64 | Control | 9 13 3 1 2 3 1 Case | 3 4 3 2 4 3 4 5----------+-----------------------------------------------------------------65-74 | Control | 5 4 1 2 Case | 2 5 2 1 1 1----------+-----------------------------------------------------------------75+ | Control | 3 2 Case | 1 1 ----------+-----------------------------------------------------------------
BIOST 536 Thompson 72
Some Stata language for recoding variables:. generate agegp=recode(age,2,4,6)
. * All obsns with age <= 2 have agegp=2, all with age >2 and <=4
. * have agegp=4 and all with age > 4 have agegp=6
. * Change the coding to 1,2,3
. recode agegp 2=1 4=2 6=3
(975 changes made)
. table age ----------+-----------
Age in |
years | Freq.
----------+-----------
25-34 | 116
35-44 | 199
45-54 | 213
55-64 | 242
65-74 | 161
75+ | 44
----------+-----------
. table agegp ----------+----------- agegp | Freq. ----------+----------- 1 | 315 2 | 455 3 | 205 ----------+-----------
BIOST 536 Thompson 73
. drop agegp
. gen agegp=recode(age,2,4)
. table agegp-------+----------- agegp | Freq.-------+----------- 2 | 315 4 | 660-------+-----------
. * All observations that are not <= a number in the list are given the last
. * value in the list
. drop agegp
. gen agegp=1+(age>2)+(age>4)
. table agegp----------+----------- agegp | Freq.----------+----------- 1 | 315 2 | 455 3 | 205----------+-----------
BIOST 536 Thompson 74
Analysis with binary tobacco and alcohol variables
. gen binalc=alcohol>2
. gen bintob=tobacco>2
. * Start by looking at some crude and stratified analyses
. table case binalc bintob----------+-------------------------
Case-cont | bintob and binalc
rol | ---- 0 --- ---- 1 ---
status | 0 1 0 1
----------+-------------------------
Control | 539 86 127 23
Case | 70 66 34 30
----------+-------------------------
BIOST 536 Thompson 75
. cc case binalc, by (bintob) bintob | OR [95% Conf. Interval] M-H Weight-----------------+------------------------------------------------- 0 | 5.909302 3.94179 8.859986 7.910644 (Cornfield) 1 | 4.872123 2.523999 9.408074 3.654206 (Cornfield)-----------------+------------------------------------------------- Crude | 5.640085 4.003217 7.94673 (Cornfield) M-H combined | 5.581579 3.945401 7.89629 -----------------+-------------------------------------------------
Test of homogeneity (M-H) chi2(1) = 0.24 Pr>chi2 = 0.6258 Test that combined OR = 1: Mantel-Haenszel chi2(1) = 106.85 Pr>chi2 = 0.0000