Modeling the association between a binary outcome, Y, and an “exposure”, X

BIOST 536 Thompson 1

Modeling the association between a binary outcome, Y, and an “exposure”, X

Slides are from Research Professor M. Thompson


We might want to model px=P(Y=1|X)

What are the characteristics of pX?

0 ≤pX≤ 1

pX possibly monotone in X


Logit

Probit

-50

5T

rans

form

of p

0 .2 .4 .6 .8 1Probability

Model g(pX)=β0 + β1 X

pg(p)=logit(p)=ln( )

1-p

1( ) ( )g p p


Logistic regression with a single binary risk factor

Table A X=1 X=0

Y=1 a b n1

Y=0 c d n0

m1 m0 N


1m

a

0m

b

bc

ad

)0|0()0|1(

)1|0()1|1(

XYPXYP

XYPXYP

estimates P(Y=1 | X=0)

estimates P(Y=1 | X=1)

estimates the odds ratio:

Cohort or Cross-sectional study

X=1 X=0

Y=1 a b n1

Y=0 c d n0

m1 m0 N


Case Control study

Let Z = 1 if individual was sampled

= 0 otherwise

Define π1 = P(Z=1 | Y=1);

π0 = P(Z=1 | Y=0)

Let pZ(X)= P(Y=1 | X, Z=1)


If we model

logit(pZ(X)) = α + β1 X

Then ln(Ψ) = β1 or Ψ = exp(β1) as before.

But:

1))Z 0,X | 1logit(P(Y

1

0ln( ) logit( ( 1 | 0))P Y X

10

0ln( )


Parameter estimation:Maximum Likelihood

We choose that estimate of the parameters that makes the data most likely to have occurred

Let's take the simple setting of a cross-sectional study where we want to estimate the prevalence of a disease. Say we take a random sample of N individuals and w of them have the disease.

The common sense estimate of the prevalence of disease is :

w

N


The likelihood

Let w=number diseased in N independent individuals and let the true disease prevalence in the population be p.

Then the likelihood of observing w diseased individuals in N is given by:

(1 )w N wNp p

w


Setting the derivative equal to zero and solving for p:

ln ln( ) ( ) ln(1 )N

l w N ww

p p

We want to choose that value of p which maximizes the likelihood or, equivalently, the log of the likelihood:

ln ln( ) ( ) ln(1 )N

l w N ww

p p

1p p

w N w

pw

N

ln ln( ) ( ) ln(1 )N

l w N ww

p p

Taking the derivative of l with respect to p:


In a study involving 53 men with prostate cancer, 20 of the men had nodal involvement

How to estimate the chance of nodal involvement?

20.377

53p

0

.05

.1Li

kelih

ood

0 .2 .4 .6 .8 1Probability


Using MLE in the logistic regression setting with a single covariate, X:

Say we have N observations (Yi, Xi ), i=1,2,…,N, where Y denotes disease status (0 =non-diseased,

1=diseased) and X is a risk factor of interest.

Let p(X) denote P(Y=1 | X).

Then:

1( | ) ( ) (1 ( ))i iY Yi i i iP Y X p X p X


L=

l =ln(L) =

Alternative (Binomial) formulation:

If X takes on n different values, Xj, j=1,2,…,n, and, for

each Xj, there are nj subjects, where , of whom yj

are “diseased”, we can represent the log likelihood as

N

i

Yi

Yi

ii XpXp1

1))(1()(

1 1ln( ( )) (1 ) ln(1 ( ))

N Ni i i i

i iY p X Y p X

1 1ln ln( ( )) ( ) ln(1 ( ))

n njj j j j j

j jj

ny p X n y p X

y

1

nj

jn N


X

X

e

eXYPXp

10

10

1)|1()(

If we model

then, for a single dichotomous risk factor, X, as in Table A,the maximum likelihood estimate of

β0 is ln(b/d)β1 is ln(ad/bc)

and hence the maximum likelihood estimate ofP(Y=1 | X=1) is a/m1

and ofP(Y=1 | X=0) is b/m0.


Hypothesis testing and confidence intervals

Say we want to establish whether tumor size affects the chance of nodal involvement in men with prostate cancer

Nodal | Tumorinvolvement| large small| Total-----------+----------------------+---------- Yes | 15 5 | 20 | 56% 19% | 38% -----------+----------------------+---------- No | 12 21 | 33 | 44% 81% | 62% -----------+----------------------+---------- Total | 26 27 | 53


Consider

logit(P(nodal involvement | tumor size=X))=β0 + β1 X

The maximum likelihood estimate of β1 is

Hence the OR is estimated by e1.66 = 5.25 (=15x21/(5x12))

How do we test the statistical significance of the OR?

Calculate a confidence interval?

1 1.66


Ho: β1=0 <=> Ho: OR=Ψ=1

LR te

st

Wald test

Score test

-36

-35

-34

-33

-32

-31

Log

likel

ihoo

d

0 1 2 3beta1


likelihood of the current model2 ln( )

likelihood of the saturated modelD

12 [ ln( ) ( ) ln( )]

n j j j

j j jj

jj j

y n yy n y

y n y

likelihood without X2ln( )

likelihood with X

The deviance compares observed to predicted values via the likelihood:

where

To assess the role of X in the logistic model :

Logit(P(Y=1|X))= β0 + β1 X

We can consider

G = D(model without X)-D(model with X)

=

j

j

j

yy

n


Let Y=nodal involvement in prostate cancer, X=tumor sizeWe estimate: logit(P(Y=1|X)= -1.44+1.66 X,

and OR=Ψ=5.25Ln L= -31.276

Under the null model:Logit(P(Y=1))=constant, thenLn L=-35.126

Under the hypothesis H0 : β1 =0, G has a Χ2 distribution with 1 degree of freedom

Here G =-2*(-35.126+31.276) = 7.7

LR test: P(Х21 > 7.7)= .0055

Score Test: P(Х21 > 7.44)= .0064

Wald test: P(Х21 > 6.92)= .0090

STATA gives the LR test for the fitted model versus the null model STATA does not do the Score test easilySTATA gives the single parameter Wald test


. logistic node tumor

Logistic regression Number of obs = 53 LR chi2(1) = 7.70

Prob > chi2 = 0.0055Log likelihood = -31.276312 Pseudo R2 = 0.1096

------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- tumor | 5.25 3.310487 2.63 0.009 1.52552 18.06761------------------------------------------------------------------------------

. logit

------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- tumor | 1.658228 .630569 2.63 0.009 .4223355 2.894121 _cons | -1.435085 .4976116 -2.88 0.004 -2.410385 -.4597837------------------------------------------------------------------------------

Pseudo R2=1-lm/l0

Stata code


The information matrix

Maximum likelihood theory states that the variance estimators for estimates obtained from MLE can be derived from the matrix of second partial derivatives of the log likelihood.

Minus this matrix is called the information matrix, I, and the estimated variances and covariances of the parameter estimates are obtained from the inverse of the matrix.


Let and β

and let V=

NX

X

X

X

1

..

..

1

1

2

1

1

0

)1(0000

0....

0.)1(00

0.0)1(0

0..0)1(

33

22

11

NN pp

pp

pp

pp


Then I = X' V X and it can be shown that

~N(β, I-1)

and so an approximate 95% CI for, e.g., β1 is given by:

and hence a 95% CI for the OR is obtained by exponentiation of the CI for β1

)se( 1.96 11


Interpretation of coefficients

Dichotomous X (coded 0 or 1)

Here OR =

or

Interpretation of β0 depends on study design.

1e1)ln(


Polytomous X

Smoking cigs/day

CHD >30 21-30 1-20 0

Present 39 50 70 98

Absent 253 355 735 1554

OR 2.44 2.23 1.51 1.00


Polytomous X with k categories

We define X1, X2, …, Xk-1 dummy 0-1 design variables and consider the model:

P(Y=1 | X) = β0 + β1 X1 + β2 X2 + … βk-1 Xk-1 .

is the odds ratio for the j'th category

of X relative to the baseline category.

jej


Stata code:

. input chd smoke count

. 1 3 39

. 1 2 50

. 1 1 70

. 1 0 98

. 0 3 253

. 0 2 355

. 0 1 735

. 0 0 1554

. end


. xi: logit chd i.smoke [fweight = count]

i.smoke _Ismoke_0-3 (naturally coded; _Ismoke_0 omitted)Iteration 0: log likelihood = -890.62187Iteration 1: log likelihood = -876.52013Iteration 2: log likelihood = -875.84853Iteration 3: log likelihood = -875.84738Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000Log likelihood = -875.84738 Pseudo R2 = 0.0166------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | .4122448 .1627693 2.53 0.011 .0932229 .7312667 _Ismoke_2 | .8035253 .1834786 4.38 0.000 .4439138 1.163137 _Ismoke_3 | .8937922 .2010989 4.44 0.000 .4996455 1.287939 _cons | -2.76362 .1041517 -26.53 0.000 -2.967754 -2.559486------------------------------------------------------------------------------


. xi: logistic chd i.smoke [fweight=count]

i.smoke _Ismoke_0-3 (naturally coded; _Ismoke_0 omitted)Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000Log likelihood = -875.84738 Pseudo R2 = 0.0166------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 1.510204 .2458148 2.53 0.011 1.097706 2.077711 _Ismoke_2 | 2.2334 .4097812 4.38 0.000 1.558796 3.199955 _Ismoke_3 | 2.444382 .4915626 4.44 0.000 1.648137 3.625307------------------------------------------------------------------------------


. expand count(3146 observations created)

. xi: logit chd i.smoke Logistic regression Number of obs = 3154 LR chi2(3) = 29.55 Prob > chi2 = 0.0000Log likelihood = -875.84738 Pseudo R2 = 0.0166------------------------------------------------------------------------------ chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | .4122448 .1627693 2.53 0.011 .0932229 .7312667 _Ismoke_2 | .8035253 .1834786 4.38 0.000 .4439138 1.163137 _Ismoke_3 | .8937922 .2010989 4.44 0.000 .4996455 1.287939 _cons | -2.76362 .1041517 -26.53 0.000 -2.967754 -2.559486------------------------------------------------------------------------------

. xi: logistic chd i.smoke

------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 1.510204 .2458148 2.53 0.011 1.097706 2.077711 _Ismoke_2 | 2.2334 .4097812 4.38 0.000 1.558796 3.199955 _Ismoke_3 | 2.444382 .4915626 4.44 0.000 1.648137 3.625307

-------------------------------------------------------------------------------------------------------------


. lincom _Ismoke_2- _Ismoke_1, or ( 1) - _Ismoke_1 + _Ismoke_2 = 0------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- (1) | 1.478873 .2900367 2.00 0.046 1.006916 2.172044------------------------------------------------------------------------------

. lincom _Ismoke_3- _Ismoke_2, or ( 1) - _Ismoke_2 + _Ismoke_3 = 0------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- (1) | 1.094466 .2505588 0.39 0.693 .698771 1.714234------------------------------------------------------------------------------

. lincom _Ismoke_3- _Ismoke_1, or ( 1) - _Ismoke_1 + _Ismoke_3 = 0------------------------------------------------------------------------------ chd | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- (1) | 1.618577 .3442644 2.26 0.024 1.066809 2.455728------------------------------------------------------------------------------


Continuous X

Here interpretation of β1 depends on the units of X.

If the logit is linear in X, then β1 represents the change in log odds for a 1 unit increase in X.

is the odds ratio corresponding to a 1 unit increase in X.

1e


. logistic node ageLogit estimates Number of obs = 53 LR chi2(1) = 1.09 Prob > chi2 = 0.2965Log likelihood = -34.581125 Pseudo R2 = 0.0155------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- age | .9526993 .0445086 -1.037 0.300 .8693389 1.044053------------------------------------------------------------------------------

. logit

------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- age | -.048456 .0467184 -1.037 0.300 -.1400223 .0431104 _cons | 2.366605 2.770912 0.854 0.393 -3.064283 7.797493------------------------------------------------------------------------------

Example: Effect of age on nodal involvement in prostate cancer


NOTES

The OR for nodal involvement corresponding to a ten year

age difference is: estimated by .95310=.62

The 95% CI for log(10βAGE) is given by:

Hence the 95% CI for the 10-year OR is given by: (.25,1.54)

This OR is the same comparing 40 year olds with 30 year

olds as comparing 60 year olds with 50 year olds etc

10 AGEe

AGE AGE10ß ±1.96x10xSE( ß )


Multiple logistic regression

Logit(P(Y=1| X1, X2, .., Xk) )

= β0 +β1 X1 + β2 X2 + …+ βk Xk

kk22110

kk22110

X ... X X

X ... X X

k211 )X .., ,X ,X |1P(Y

e

e

kk22110 X ... X X k211

1 )X .., ,X ,X |0P(Y

e


Estimation

Assume we have N observations (Yi, Xi1, Xi2, .., Xik), i=1,2,…,N

As before, we can use maximum likelihood to obtain estimates of β0, β1, β2,…, βk that maximize the likelihood:

L=

and we can estimate the variances and covariances of the estimates from the inverse of the information matrix, I.

1

1( ) (1 ( ))

NY Y

i ii

i ip X p X


Hypothesis testing

The Wald, Likelihood Ratio and Score tests generalize to the case of k X variables.

In generalFull model:

logit(p) = β0 +β1 X1 + β2 X2 + …+ βk Xk

Reduced model: logit(p) = β0 +β1 X1 + β2 X2 + …+ βp Xp, , p<k

H0 : βp+1 = βp+2 = …= βk =0Ha : ≠0 somewhere


Likelihood ratio test

LR statistic = -2[ln L(reduced) -ln L(full)] = Deviance(reduced) - Deviance(full)

Approximate distribution under H0 : Χ2k-p

We must fit two models to calculate the LR statistic

Stata provides LR test of the current model relative to the null model:

H0 : β1 = β2 = …= βk =0


Score test

If H0 implies β = β* thenScore statistic = S(β*)' I-1 S(β*)

where I denotes the information matrix

Approximate distribution under H0 : Χ2k-p

Only need to fit the reduced model to calculate the Score statistic

Stata does not perform the Score test easily.


Wald test

For a single parameter:

~ N(0,1) under H0 : βj=0.

The Wald test can be generalized to multiple parameters where it also follows a Χ2

k-p

distribution under H0. Most confidence intervals are based on the

Wald test statistic

( )

j

j

zSE


LR tests using Stata

In general:

Fit "full" model, then:. est store A saves log-likelihood from most recently

fitted model and labels it “A"Fit reduced model, then:. est store B saves log-likelihood from most recently

fitted model and labels it “B" Carry out the LR test comparing "full" model (A) with

reduced model (B). lrtest A B, stats


Example: prostate cancer study

Tumor large Tumor small

Nodal involvement Xray+ Xray- Xray+ Xray-

Yes 9 6 2 3

No 1 11 3 18


Fitting “full” model:

. logistic node tsize xray Logistic regression Number of obs = 53 LR chi2(2) = 16.90 Prob > chi2 = 0.0002Log likelihood = -26.676709 Pseudo R2 = 0.2405------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 4.895297 3.426809 2.269 0.023 1.241425 19.30357 xray | 8.326496 6.218498 2.838 0.005 1.926448 35.9888------------------------------------------------------------------------------

. logit

------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 1.588275 .7000206 2.269 0.023 .2162598 2.96029 xray | 2.119443 .7468325 2.838 0.005 .6556779 3.583208 _cons | -2.044627 .6099686 -3.352 0.001 -3.240144 -.8491109------------------------------------------------------------------------------

. est store A


Fitting “reduced” model:

. logistic node tsizeLogistic regression Number of obs = 53 LR chi2(1) = 7.70 Prob > chi2 = 0.0055Log likelihood = -31.276312 Pseudo R2 = 0.1096------------------------------------------------------------------------------ node | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 5.25 3.310487 2.630 0.009 1.52552 18.06761------------------------------------------------------------------------------------------------

. logit------------------------------------------------------------------------------ node | Coef. Std. Err. z P>|z| [95% Conf. Interval]---------+-------------------------------------------------------------------- tsize | 1.658228 .630569 2.630 0.009 .4223355 2.894121 _cons | -1.435085 .4976116 -2.884 0.004 -2.410385 -.4597837------------------------------------------------------------------------------

. est stor B


Likelihood ratio test: comparing models for nodal involvement with and without effect of xray. lrtest A B, stats

Likelihood-ratio test LR chi2(1) = 9.20

(Assumption: B nested in A) Prob > chi2 = 0.0024

------------------------------------------------------------------------------

Model | Obs ll(null) ll(model) df AIC BIC

-------------+----------------------------------------------------------------

B | 53 -35.12608 -31.27631 2 66.55262 70.49321

A | 53 -35.12608 -26.67671 3 59.35342 65.26429

------------------------------------------------------------------------------

What hypothesis is this testing?


Fitted probabilities in the “full” model:

. predict pnode, p

P(node | tumor=0, xray=0)=.1146(.1429)P(node | tumor=1, xray=0)=.3879(.3529)P(node | tumor=0, xray=1)=.5187(.4000)P(node | tumor=1, xray=1)=.8407(.9000)

Note: these are slightly different from what we would get if we used the raw data without modelling. Why?


Confidence intervals

A 100(1-α)% Likelihood Ratio based confidence region for β is given by:

Stata provides Wald-based CIs for individual parameters

CIs for odds ratios can be obtained by exponentiation

2

1-{ | -2(lnL( )-lnL( )) (p)}

j j1- /2 z SE( )


Confounding

“With ruin upon ruin, rout on rout, Confusion worse confounded.” Milton: Paradise Lost, ii. line 996.

A confounder, C, is a variable which, because of its relationship to disease, D, and the exposure of interest, E, distorts the disease-exposure relationship. Adjustment can remove its effects. We can adjust for the confounder explicitly by modeling or implicitly by stratification.

Note: C would not be considered a confounder if - it lies in the causal pathway between E and D, or- C is caused by both E and D


Example

E+ E-

D+ 90 10

D- 9 1

9.22OR

0C

E+ E-

D+ 91 19

D- 19 91

0.1OR

E+ E-

D+ 1 9

D- 10 90

1C 0.1OR


Effect modification Effect modification occurs when the chosen summary of association

differs in different strata. In some cases there may be effect modification for one summary but not for another.

For example, in clinical trials of cholesterol-lowering drugs it

appears that the relative risk of coronary heart disease comparing the treatment to placebo is about the same in people in people with and without previous disease, but the risk difference is enormously greater in those with previous disease.

Effect modification may or may not be of interest. For example, pleconaril, an investigational antiviral drug, reduced the mean duration of symptoms in subjects with a common cold due to rhinoviruses but had no effect in subjects whose cold was due to some other agent. Here the effect modification was important in checking that the drug really worked by inhibiting rhinovirus. On the other hand, in clinical use of the drug it would typically not be possible to determine the infectious agent and so the average effectiveness across all colds would be a more important quantity.


Effect modification and confounding can exist separately or together:

Confounding without effect modification. Here, the overall association is not the same as the causal effect of interest, but after stratification the association is the same within each stratum of the confounder. The ideal solution is to stratify and then to average the association across strata to regain the precision lost by stratifying.

Effect modification without confounding. Here, the overall association correctly estimates the average effect of the exposure, but that effect is different in different subgroups. If the separate associations are of interest then a stratified analysis is called for. If the main scientific interest is in the average effect across the population then a stratified analysis is unnecessary.

Both confounding and effect modification. That is, the overall association does not correctly estimate the average effect of exposure and after stratification the association is different in different subgroups. The confounding means that a stratified analysis is necessary. If the effect modification is scientifically uninteresting the estimates from separate strata can be combined as would be done in the absence of effect modification.


Logistic models for a binary exposure (XE) and binary covariate (XC) Consider the model:

logit(P(Y=1| XE, XC))=β0 + β1 XE + β2 XC + β3 XE XC

Let ( 1| 1, 0) ( 0| 1, 0)

( 1| 0, 0) ( 0| 0, 0)1

P Y X X P Y X XE EC C


( 1| 1, 1) ( 0| 1, 1)

( 1| 0, 1) ( 0| 0, 1)2




Then, in a cohort study, under the logistic model:

ln(Ψ1) = logit(P(Y=1| XE=1, XC=0)) - logit(P(Y=1| XE=0, XC=0))

= β0 + β1 - β0 = β1

ln(Ψ2) = logit(P(Y=1| XE=1, XC=1)) - logit(P(Y=1| XE=0, XC=1))

= β0 + β1 + β2 + β3 - β0 - β2 = β1 + β3

β0 estimates logit(P(Y=1| XE=0, XC=0))β1 estimates ln(Ψ1)β2 estimates logit(P(Y=1| XE=0, XC=1)) - logit(P(Y=1| XE=0, XC=0))β3 estimates ln(Ψ2) - ln(Ψ1)


Logistic models for two 2x2 tables

0logit(p)=β 0 1logit( ) Ep X

E+

E-

E+

E-b0

b0+b1

Lo

git(

p)

C- C+

E+

E-

E+

E-b0

Lo

git(

p)

C- C+


0 2logit( ) Cp X 0 1 2logit( ) E Cp X X

0 1 2 3logit( ) E C E Cp X X X X

E+

E-

E+

E-

b0

b0+b2

b0+b1

b0+b1+b2+b3

Lo

git(

p)

C- C+

E+

E-

E+

E-

b0

b0+b2

b0+b1

b0+b1+b2

Lo

git(

p)

C- C+

E+E-

E+E-

b0

b0+b2

Lo

git(

p)

C- C+


Consider the model:

logit(P(Y=1| XE, XC, Z=1)= β0* + β1 XE + β2* XC + β3 XE XC

Then:

log(Ψ1) = β1 log(Ψ2) = β1 + β3

as before, but

β0* = ln(π10/ π00) + β0

β2* = ln(π11/ π01) - ln(π10/ π00) + β2


Confidence intervals for linear combinations of parameters

100(1-α)% CI for Ψ2:

where

This can be obtained in Stata using the "lincom" command. Or by using a different parameterization.

31

2

e

1 3 1 3 1 3( ) ( ) ( ) 2cov( , )SE Var Var

( )1 3 1 3/ 2z SEe


Parameterizations

1. "Full" model with interaction (A):logit(P(Y=1| XE, XC))=β0 + β1 XE + β2 XC + β3 XE XC

2. Reduced model without interaction (B):

logit(P(Y=1| XE, XC))=β0 + β1 XE + β2 XC

XE XC Cases n 1 0 a1 m10

0 0 b1 m00

1 1 a2 m11

0 1 b2 m01


3. Another "full" logistic model:

X1 = 1 when XE=1, XC=0 X2=1 when XE=1, XC=1 = 0 otherwise =0 otherwise

logit(P(Y=1| X1, X2, XC))=β0 + β1 X1 + β2 XC + γ X2

Here β1 = log(Ψ1)β2 = log(ΨC|E-)γ = log(Ψ2)

This model allows us to estimate Ψ2 directly, but it is more complicated to test for interaction.

(No interaction => γ =β1.)


4. A third "full" logistic model

X1 = 1 when XE=1, XC=0 X2=1 when XE=0, XC=1

= 0 otherwise =0 otherwiseX3 = 1 when XE=1, XC=1 = 0 otherwise

logit(P(Y=1| X1, X2, X3))=β0 + β1 X1 + β2 X2 + η X3

allows us to test each of the above groups against the baseline.

The interpretation of regression coefficients depends on the parameterization and what other variables are in the model.


Case-control study of esophageal cancer and alcohol consumption in France (Breslow & Day, Vol I, p137).6 age strata; 2 exposure variables: daily alcohol consumption (4 categories), daily

tobacco consumption (4 categories); 2 disease groups: cases and controls

. infile age alcohol tobacco case using "p:\536\esoph.raw", clear(975 observations read). * begin labelling data and variables. label data "Esophageal Cancer Case-Control Study". label define agelabel 1 "25-34" 2 "35-44" 3 "45-54" 4 "55-64" 5 "65-74" 6 "75+". label values age agelabel. label variable age "Age in years". label define alclabel 1 "0-39" 2 "40-79" 3 "80-119" 4 "120+". label values alcohol alclabel. label variable alcohol "Alcohol g/day". label define toclabel 1 "0-9" 2 "10-19" 3 "20-29" 4 "30+". label values tobacco toclabel. label variable tobacco "Tobacco g/day". label define caselab 0 "Control" 1 "Case". label values case caselab. label variable case "Case-control status"


. * CREATE SOME SIMPLE TABLES TO LOOK AT DATA

. tabulate age case, col Age in | Case-control status years | Control Case | Total-----------+----------------------+---------- 25-34 | 115 1 | 116 | 14.84 0.50 | 11.90 -----------+----------------------+---------- 35-44 | 190 9 | 199 | 24.52 4.50 | 20.41 -----------+----------------------+---------- 45-54 | 167 46 | 213 | 21.55 23.00 | 21.85 -----------+----------------------+---------- 55-64 | 166 76 | 242 | 21.42 38.00 | 24.82 -----------+----------------------+---------- 65-74 | 106 55 | 161 | 13.68 27.50 | 16.51 -----------+----------------------+---------- 75+ | 31 13 | 44 | 4.00 6.50 | 4.51 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00


. tabulate alcohol case, col

Alcohol | Case-control status g/day | Control Case | Total-----------+----------------------+---------- 0-39 | 386 29 | 415 | 49.81 14.50 | 42.56 -----------+----------------------+---------- 40-79 | 280 75 | 355 | 36.13 37.50 | 36.41 -----------+----------------------+---------- 80-119 | 87 51 | 138 | 11.23 25.50 | 14.15 -----------+----------------------+---------- 120+ | 22 45 | 67 | 2.84 22.50 | 6.87 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00


. tabulate tobacco case, col Tobacco | Case-control status g/day | Control Case | Total-----------+----------------------+---------- 0-9 | 447 78 | 525 | 57.68 39.00 | 53.85 -----------+----------------------+---------- 10-19 | 178 58 | 236 | 22.97 29.00 | 24.21 -----------+----------------------+---------- 20-29 | 99 33 | 132 | 12.77 16.50 | 13.54 -----------+----------------------+---------- 30+ | 51 31 | 82 | 6.58 15.50 | 8.41 -----------+----------------------+---------- Total | 775 200 | 975 | 100.00 100.00 | 100.00


. table case alcohol tobacco

----------+-----------------------------------------------------------------Case-cont | Tobacco g/day and Alcohol g/day rol | ------------- 0-9 ------------ ------------ 10-19 -----------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+----------------------------------------------------------------- Control | 252 145 42 8 74 68 30 6 Case | 9 34 19 16 10 17 19 12----------+-----------------------------------------------------------------

----------+-----------------------------------------------------------------Case-cont | Tobacco g/day and Alcohol g/day rol | ------------ 20-29 ----------- ------------- 30+ ------------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+----------------------------------------------------------------- Control | 37 47 10 5 23 20 5 3 Case | 5 15 6 7 5 9 7 10----------+-----------------------------------------------------------------


. table case alcohol tobacco, by(age)----------+-----------------------------------------------------------------Age in |years and |Case-cont | Tobacco g/day and Alcohol g/day rol | ------------- 0-9 ------------ ------------ 10-19 -----------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+-----------------------------------------------------------------25-34 | Control | 40 27 2 1 10 7 1 Case | 1----------+-----------------------------------------------------------------35-44 | Control | 60 35 11 1 13 20 6 3 Case | 2 1 3 ----------+-----------------------------------------------------------------45-54 | Control | 45 32 13 18 17 8 1 Case | 1 6 3 4 4 6 3----------+-----------------------------------------------------------------55-64 | Control | 47 31 9 5 19 15 7 1 Case | 2 9 9 5 3 6 8 6----------+-----------------------------------------------------------------65-74 | Control | 43 17 7 1 10 7 8 1 Case | 5 17 6 3 4 3 4 1----------+-----------------------------------------------------------------75+ | Control | 17 3 4 2 Case | 1 2 1 2 2 1 1 1----------+-----------------------------------------------------------------


----------+-----------------------------------------------------------------Age in |years and |Case-cont | Tobacco g/day and Alcohol g/day rol | ------------ 20-29 ----------- ------------- 30+ ------------status | 0-39 40-79 80-119 120+ 0-39 40-79 80-119 120+----------+-----------------------------------------------------------------25-34 | Control | 6 4 1 5 7 2 2 Case | ----------+-----------------------------------------------------------------35-44 | Control | 7 13 2 2 8 8 1 Case | 1 2 ----------+-----------------------------------------------------------------45-54 | Control | 10 10 4 1 4 2 2 Case | 5 1 2 5 2 4----------+-----------------------------------------------------------------55-64 | Control | 9 13 3 1 2 3 1 Case | 3 4 3 2 4 3 4 5----------+-----------------------------------------------------------------65-74 | Control | 5 4 1 2 Case | 2 5 2 1 1 1----------+-----------------------------------------------------------------75+ | Control | 3 2 Case | 1 1 ----------+-----------------------------------------------------------------


Some Stata language for recoding variables:. generate agegp=recode(age,2,4,6)

. * All obsns with age <= 2 have agegp=2, all with age >2 and <=4

. * have agegp=4 and all with age > 4 have agegp=6

. * Change the coding to 1,2,3

. recode agegp 2=1 4=2 6=3

(975 changes made)

. table age ----------+-----------

Age in |

years | Freq.

----------+-----------

25-34 | 116

35-44 | 199

45-54 | 213

55-64 | 242

65-74 | 161

75+ | 44

----------+-----------

. table agegp ----------+----------- agegp | Freq. ----------+----------- 1 | 315 2 | 455 3 | 205 ----------+-----------


. drop agegp

. gen agegp=recode(age,2,4)

. table agegp-------+----------- agegp | Freq.-------+----------- 2 | 315 4 | 660-------+-----------

. * All observations that are not <= a number in the list are given the last

. * value in the list

. drop agegp

. gen agegp=1+(age>2)+(age>4)

. table agegp----------+----------- agegp | Freq.----------+----------- 1 | 315 2 | 455 3 | 205----------+-----------


Analysis with binary tobacco and alcohol variables

. gen binalc=alcohol>2

. gen bintob=tobacco>2

. * Start by looking at some crude and stratified analyses

. table case binalc bintob----------+-------------------------

Case-cont | bintob and binalc

rol | ---- 0 --- ---- 1 ---

status | 0 1 0 1

----------+-------------------------

Control | 539 86 127 23

Case | 70 66 34 30

----------+-------------------------


. cc case binalc, by (bintob) bintob | OR [95% Conf. Interval] M-H Weight-----------------+------------------------------------------------- 0 | 5.909302 3.94179 8.859986 7.910644 (Cornfield) 1 | 4.872123 2.523999 9.408074 3.654206 (Cornfield)-----------------+------------------------------------------------- Crude | 5.640085 4.003217 7.94673 (Cornfield) M-H combined | 5.581579 3.945401 7.89629 -----------------+-------------------------------------------------

Test of homogeneity (M-H) chi2(1) = 0.24 Pr>chi2 = 0.6258 Test that combined OR = 1: Mantel-Haenszel chi2(1) = 106.85 Pr>chi2 = 0.0000

Modeling the association between a binary outcome, Y, and an “exposure”, X

Documents

Transcript of Modeling the association between a binary outcome, Y, and an “exposure”, X