Bayesian Survival Analysis: Model, Prior and Posterior - Eurandom
Transcript of Bayesian Survival Analysis: Model, Prior and Posterior - Eurandom
Bayesian Survival Analysis: Model, Prior andPosterior
Yongdai Kim
Seoul National University, Korea
2010.11.08
0. Contents
1. Model
2. Prior 1: Dirichlet process
3. Prior 2: Neutral to right process
4. Prior 3: Beta process
5. The proportional hazards model
6. Event history data
Seoul National University. 1
1. Model
• Survival times: X1, . . . , Xn ∼ F .
• Right censoring Times: C1, . . . , Cn ∼ G.
• Let Ti = min(Xi, Ci) and δi = I(Xi ≤ Ci), i = 1, . . . , n.
• We observe (T1, δ1), . . . , (Tn, δn) and wish to estimate F .
• Parameter space:
F = { all probability distributions on R+}.
Seoul National University. 2
Bayesian Analysis in a Nutshell
• The Prior is the probability measure on the parameter space Θwhich reflects the statistician’s knowledge about θ before he/shesees the data.
θ ∼ π(θ).
• ModelX|θ ∼ f(x|θ).
• The posterior is the conditional distribution of θ given the dataX and reflects the knowledge after he/she sees the data.
θ|X ∼ π(θ|X).
• For Bayesian analysis of the survival model, we need a tractableand rich class of prior distributions on the parameter space
F = { all probability distributions on R+}.
Seoul National University. 3
2. Prior 1: Dirichlet process (Ferguson, 1973)
• We put prior measures on F by defining prior distributions of(F (A1), . . . , F (Ak)) for all possible finite partitions A1, . . . , Ak.
• A natural conjugate prior for the partition probabilities is aDirichlet distribution, which leads the following definition ofDirichlet processes.
• Definition of Dirichlet process
– Let α be a positive finite measure on R. The random c.d.f F
has a Dirichlet process prior with parameter α, denoted byDP(α) if for every finite partition A1, . . . , Ak of R,
(F (A1), . . . , F (Ak)) ∼ Dirichlet(α(A1), . . . , α(Ak)).
Seoul National University. 4
Properties of Dirichlet process prior
• The support of a Dirichlet process (with respect to the weaktopology) is F .
• Without censoring, it is conjugate. In fact, the posteriordistribution is a Dirichlet process with the base measure αp where
αp(·) = α(·) +n∑
i=1
δXi(·).
• With right censored data, it is not conjugate, and the posteriorhad been a mystery until Hjort (1990).
Seoul National University. 5
3. Prior 2: Neutral to right process prior (Doksum,1974)
• Consider the conditional distribution function
F (t|s) = Pr(Xi ≤ t|Xi ≥ s) =F (t)− F (s)
1− F (s).
• Let 0 = t0 < t1 < · · · < tk < tk+1 = ∞ be a finite orderedpartition of R+.
• A neutral to right (NTR) process is a random probabilitymeasure defined on R+ such that
F (t1|t0), F (t2|t1), . . . , F (tk+1|tk)
are independent.
Seoul National University. 6
Properties of the NTR process prior
• Dirichlet processes are NTR processes.
• The class of NTR processes is conjugate with right censored data(Ferguson and Phadia, 1979).
• Let Y (t) = − log(1− F (t)). Then, F is a NTR process if andonly if Y (t) has independent increments.
• The class of NTR processes is large enough to exhibit all kinds ofpathological as well as well-behaving examples (we will see someexamples in the tomorrow talk).
• The class of NTR processes would be too large for theconjugateness to have practical meanings (e.g. the class of alldistributions is conjugate!)
Seoul National University. 7
4. Prior 3: Beta process (Hjort 1990)
Cumulative hazard function
• A cumulative hazard function (chf) A of a distribution F isdefined as
A(t) =∫ t
0
dF (s)1− F (s−)
.
• If F is continuous,
–
A(t) =∫ t
0
f(s)1− F (s−)
ds.
– a(t) = dA(t)/dt is called the hazard function.
• The chf A is roughly
A(t) =∫ t
0
P (X = s|X ≥ s)ds.
Seoul National University. 8
Properties of the chf
• A is nondecreasing with A(0) = 0 and limt→∞A(t) = ∞.
• ∆A(t) ∈ [0, 1]
• Product integration
1− F (t) =∏
s≤t
(1− dA(s))
• There is a one-to-one correspondence between the distributionfunction and chf.
Seoul National University. 9
Nondecreasing process with independent increments: Subordinator
• A stochastic process A on R+ has independent increments (i.eLevy process) if A(t)−A(s), s ≤ t is independent withσ(A(u), u ≤ s).
• By the famous Levy’s representation, any Levy process A can bedecomposed roughly by
A(t) = D(t) + B(t) + C(t) where
– D(t): deterministic function
– B(t): Brownian motion
– C(t): limit of compound Poisson processes
• Assume that D(t) = 0. For a given process to be subordinator,B(t) should be 0.
• A subordinator = limit of compound Poisson process
Seoul National University. 10
Compound Poisson process
• Let N(t) be a Poisson process with mean Λ(t) =∫ t
0λ(s)ds
• Let Zt, t ∈ R+ be independent random variables with density ht
• A compound Poisson process A is given as
A(t) =∑
s≤t
ZsI(∆N(s) = 1).
• For a given compound Poisson process to be nondecreasing, thesupport of ht should be R+.
* Any subordinator can be approximated by letting Λ becomeslarge and Zt becomes smalle.
Seoul National University. 11
Levy measure of a subordinator
• The famous Levy’s representation, for a subordinator process A,there exists a measure ν on R+ ×R+ such that
E(exp(−uA(t))) = exp[−
∫ t
0
∫
R
(1− exp(−ux))ν(ds, dx)]
provided E(exp(−uA(t))) exists.
• If there exist functions ft(x) ≥ 0 such that
–
ν([0, t]×B) =∫ t
0
∫
B
fs(x)dxds
– ∫ t
0
∫ 1
0
xfs(x)dxds < ∞ for all t,
then there exists a unique subordinator whose Levy measure is ν
(see, for example, Kim 1999).
Seoul National University. 12
• For a compound Poisson process,
ft(x) = ht(x)λ(t).
• That is,
– λ(t) =∫∞0
ft(dx)dx
– ht(x) = ft(x)/λ(t)
provided λ(t) < ∞.
• For general subordinator with∫∞0
ft(x)dx = ∞, we canapproximate it by a sequence of compound Poisson processes.
• Hence, we can think of ft(x) as unnormalized density functionsof jump sizes.
Seoul National University. 13
Beta process prior
• The key idea of beta process priors is to put prior measures onthe space of chf rather than the space of distribution functions.
• That is, a priori we let A be a subordinator.
• Beta process is a special form of subordinators which is conjugatewith right censored data.
• Note that ∆A(t) ∈ [0, 1], and so a natural conjugate prior for∆A(t) is a beta distribution.
• Beta process let ∆A(t) ∼ Beta(0, c(t)) (incomplete betadistribution) for some nonnegative function c(t).
Seoul National University. 14
• A priori, we let E(A(t)) = A0(t).
• Then, the beta process is a subordinator whose correspondingLevy measure is
ν(dt, dx) =c(t)x
(1− x)c(t)−1I(0 ≤ x ≤ 1)dxdA0(t)
provided A0(t) is absolutely continuous.
Seoul National University. 15
• Suppose A0(t) = A0c(t) + A0d(t) where A0c is absolutelycontinuous and A0d(t) =
∑s≤t ∆A(s).
• The beta process A with parameter A0 and c is decomposed by
A(t) = Ac(t) + Ad(t) where
– Ac is a beta process with parameter A0c and c.
– Ad(t) =∑
s≤t Ps where Ps are independent and
Ps ∼ Beta(c(t)∆A0(t), c(t)(1−∆A0(t))).
– Ac and Ad are independent.
Seoul National University. 16
Properties of beta process
• Let A be a beta process with parameter (A0, c).
•E(A(t)) = A0(t)
•Var(A(t)) =
∫ t
0
dA0(s)(1− dA0(s))1 + c(s)
• Note that the variance is reciprocally proportional to c, which iscalled as the precision parameter.
• If F is a Dirichlet process with base measure α, then A is a betaprocess with
– A0 is the chf of F 0(t) = α([0, t])/α(R+)
– c(t) = α(R+)(1− F 0(t)).
Seoul National University. 17
Different Construction of beta processes
• The construction of beta processes via a subordinator withincomplete beta distribution jumps was introduced by Kim(1999).
• The original construction of beta processes by Hjort (1990) is thelimit of time discrete beta processes.
– Let 0 = t0 < t1 < · · · < tk < ∞.
– Given A0 and c, let
∗ ∆A0(ti) = A0(ti)−A0(ti−1).∗ ∆Ak(ti) ∼ Beta(c(t)∆A0(ti), c(t)(1−∆A0(ti))).∗ Ak(t) =
∑ti≤t ∆Ak(ti)
– As max |ti − ti−1| → 0, Hjort (1990) proved that Ak(t)converges to a beta process.
Seoul National University. 18
• Lo (1993) proposed a beta process using gamma processes.
– A gamma process γ with base measure α (write γα) is asubordinator such that γ(t)− γ(s) follows a gamma distributionwith shape parameter α((s, t]) and scale parameter 1.
– The corresponding Levy measure is given as
ν(dt, dx) =1x
exp(−x)α(dt).
– Let
A(t) =∫ t
0
γα(ds)γα[s,∞) + γβ [s,∞)
.
– Then, A is a beta process.
Seoul National University. 19
Posterior distribution (Hjort 1990, Kim 1999)
• Let
– N(t) =∑n
i=1 I(Xi ≤ t, δi = 1)
– Y (t) =∑n
i=1 I(Xi ≥ t)
• A priori, A is a beta process with mean A0 and precision c.
• Then, a posteriori, A is a beta process with
A0p(t) =∫ t
0
c(s)c(s) + Y (s)
dA0(s) +∫ t
0
Y (s)c(s) + Y (s)
dA(s)
andcp(t) = c(t) + Y (t)
where A(t) =∫ t
01/Y (s)dN(s) is the Nelson-Aalen estimator.
• The Bayes estimator is a convex combination of the prior meanand Nelson-Aalen estimator.
Seoul National University. 20
Relation with NTR processes
• If A is a beta process, then F is a NTR process.
• If F is a NTR process, then A is a subordinator.
• A priori, A is a subordinator with Levy measure
ν(dt, dx) = ft(x)dxdt.
• (Kim 1999) Then, a posteriori, A is again a subordinator withLevy measure
νp(dt, dx) = x∆N(t)(1− x)Y (t)−∆N(t)ft(x)dxdt.
• For beta process,
νp(dt, dx) = c(t)x∆N(t)−1(1− x)Y (t)−∆N(t)+c(t)dxdA0(t).
Seoul National University. 21
5. Bayesian analysis of the proportional hazardsmodel
The proportional hazards model
• Data: (X1, Z1), . . . , (Xn, Zn).
• Distribution
Xi|Zi ∼ F (t|Zi) = 1− (1− F (t))exp(Z′iβ)
orA(t|Zi) = exp(Z ′iβ)A(t)
where A(t|z) is the cumulative hazard function of the patientwhose covariate is z.
• Parameters: β and F (or A)
Seoul National University. 22
Left truncated right censored data
• In survival analysis, typically we do not observe the data.
• A most common reason for partial observation is right censoring,in which the event is observed only if it occurs prior to someprespecified time (censoring time).
• Along with right censoring, data can be subject to lefttruncation.
• Left truncation occurs when a subject is observed by theinvestigator only when the survival time is greater thanpredetermined time (truncation time)
• In left truncation, the investigator will not aware of the subjectswhose survival times are less than prespecified times (differentfrom left censored data)
Seoul National University. 23
Example of left truncation
• Survival times of individuals in a retirement center
• The observations are subject to left truncation since individualsmust survive to a sufficient age to enter the retirementcommunity.
• Individuals who die at an early age are excluded from the study.
Seoul National University. 24
Observations
• Let (W1, C1), . . . , (Wn, Cn) be the truncation and censoring timesof n individuals.
• Let Ti = min{Ci, Xi} and δi = I(Xi ≤ Ci).
• We observe (Ti, δi) only when Ti ≥ Wi.
• Observations: (T1, δ1,W1, Z1), . . . , (Tn, δn,Wn, Zn).
Seoul National University. 25
Prior
• β ∼ π(β)
• F ∼ a process neutral to right with Levy measure ν.
• That is, the corresponding chf A is a subordinator with Levymeasure ν.
Seoul National University. 26
Literature review
• Kalbfleisch (1978) uses gamma process priors. The formula of theposterior distribution of β is very messy in particular when thereare ties among uncensored observations.
• Hjort (1990) uses beta process priors. The posterior distributionof β is available only when no tie exists among uncensoredobservations.
• Laud, Damien and Smith (1998) develop an efficient MCMCalgorithm with beta process priors. The derivation of theposterior distribution of F given β and Dn is not available.
• Kim and Lee (2003) provided complete answers.
Seoul National University. 27
Notations
• Suppose a priori A is a subordinator with Levy measure ν givenby
ν(dt, dx) = ft(x)dxdt
• DefineDn(t) = {i : Ti = t, δi = 1, i = 1, . . . , n}Rn(t) = {i : Wi < t ≤ Ti, i = 1, . . . , n}
and R+n (t) = Rn(t)−Dn(t).
• Dn = ((T1, δ1,W1, Z1), . . . , (Tn, δn,Wn, Zn)).
• qn : the number of distinct uncensored observations
• t1 < t2 < · · · < tqn : the ordered distinct uncensored observations.
Seoul National University. 28
Posterior distribution of F given β and Dn.
• The posterior distribution of F is a process neutral to the rightwith Levy measure
ν(dt, dx|β,Dn) = (1− x)∑
j∈Rn(t) exp(βT Zj)ft(x)dxdt
+qn∑
i=1
hi(x|β)dxδti(dt),
where
hi(x|β) =
∏
j∈Dn(ti)
(1− (1− x)exp(βT Zj))
×(1− x)∑
j∈R+n (t)
exp(βT Zj)fti(x).
Seoul National University. 29
Marginal posterior distribution of β given Dn.
•π(β|Dn) ∝ e−ρ(β)
qn∏
i=1
∫ 1
0
hi(x|β)dxπ(β),
where
–
ρ(β) =n∑
i=1
∫ Ti
Wi
∫ 1
0
(1− (1− x)exp(βT Zi))
×(1− x)∑i−1
j=1 Yj(t) exp(βT Zj)ft(x)dxdt
– Yj(t) = I(Wj < t ≤ Tj) for j = 1, . . . , n.
Seoul National University. 30
Propriety with uniform prior
• Conical hull
– For a set of vectors {x1, · · · , xm} in Rp, their conical(nonnegative linear) combination is represented as the pointx =
∑mj=1 λjxj , where λj ≥ 0 for j = 1, · · · ,m.
– For a set A in Rp, the conical hull of A is the collection of allconical combinations of vectors from A or
coni(A) = {m∑
j=1
λjxj : xj ∈ A, λj ≥ 0 and m is a positive integer}.
Seoul National University. 31
• Assumptions
– A1. There exists a positive number ς1 such that
supt∈[0,τ ],x∈[0,1]
xft(x)(1− x)1−ς1(= M1) < ∞
where τ = max{T1, . . . , Tn}.– A2. There exist positive constants M2 and ς2 and a positive
function a0(t) continuous on (0, τ) such that
xft(x) ≥ M2(1− x)ς2−1a0(t)
for all x ∈ [0, 1] and t ≤ τ .
Seoul National University. 32
• Propriety
– If
coni({Zj − Zk : i = 1, · · · , qn, j ∈ Dn(ti), k ∈ R+n (ti)}) = Rp,
(1)then the posterior distribution with the constant prior on theregression coefficients is proper under A1 and A2.
– The propriety condition (1) is an equivalent condition for theuniqueness of the maximum likelihood estimator andlog-concavity of the partial likelihood (Jacobsen 1989).
Seoul National University. 33
Stability of posterior
• Question
– Suppose a priori F is a Dirichlet process with mean F0 andprecision parameter α (i.e. the base measure is αF ).
– In practice, when no much prior information is available, onechooses a small value of α.
– Question: How does the value of α affect π(β|Dn)?
Seoul National University. 34
• Definition of stability
– The posterior distribution of β, π(β|Dn) is stable with respectto α if π(β|Dn) converges to a certain density function asα → 0.
Seoul National University. 35
• Facts learned from the marginal posterior of β
– In some cases, π(β|Dn) is not stable.
– This implies that π(β|Dn) is very sensitive to the choice of α
when α is small.
– Hence, α should be chosen carefully.
– We can make π(β|Dn) stable by choosing the values ofcovariates appropriately. For example, using the centeredcovariates (Zi − Z) may be helpful.
Seoul National University. 36
6. Bayesian analysis of event history data
What is “event history data”?
• A subject (patient, mice, machine, company etc.) can experienceseveral types of events at random time points.
• “Event history data” records types and times of events thesubject has experienced during the follow-up period.
• Examples are
– Survival time with censoring
– Multiple event time data
– Illness-death model
– Competing risks model
Seoul National University. 37
Example 1: Censored survival data
• Let X be a survival time and let C be a censoring time.
• An observation is (T, δ) where T = min{X, C} andδ = I(X ≤ C).
• There are two types of event (death and censoring) and only onetime point.
Seoul National University. 38
Example 2: Recurrent event data
• A subject experiences a certain event repeatedly. An examples isa recurrent disease (eg. diarrhea).
• Data consist of the random time points T1, T2, . . . where Ti is thetime of the i-th event.
• There are only one type of event and many time points.
Seoul National University. 39
Example 3: Illness-Death model
• A subject changes states at random time points before theabsorbing event (death).
healthy disease
death
Seoul National University. 40
• The data consists of the random time points of state transitionand the corresponding types of transition ( health-to-disease,disease-to-health, health-to-death and disease-to-death).
• There are many types of event and many time points.
Seoul National University. 41
Example 4: Competing risks model
• A subject experiences one event among various types of events ata random time point.
survival
cause 1
cause 2
cause k
Seoul National University. 42
• An example: a mice dies due to either thymic lymphoma,reticulum sell sarcoma (both specific types of cancer) or othercauses.
• The data consists of the type and time of the event the subjectexperiences.
• There are many types of event and only one time point.
Seoul National University. 43
Time inhomogeneous Markov process
• Multi-state event history data can be modeled by a timeinhomogeneous Markov process.
• Let X be a time inhomogeneous Markov process with the statespace S = {0, . . . , K}.
• For event history data, X(t) represents the state at time t.
• Example– Censored survival data: S = {survival, death, censored} and
X(0) = survival.
– Multiple event time data: S = {0, 1, . . .} and X(t) is the number of
an event experienced until time t
– Illness-death model: S = {health, disease, death} and
X(0) = health or disease.
– Competing risks model: S = {health, 1, . . . , K} and X(0) = health
Seoul National University. 44
Transition probabilities and Intensity functions
• For a given Markov process X, let
Phj(s, t) = Pr(X(t) = j|X(s) = h)
for h, j ∈ {0, . . . , k}.• The (k + 1)× (k + 1) matrix function defined by
P(s, t) = [Phj(s, t)]h,j∈{0,...,k}
is called the “ (matrix of ) transition probabilities”.
• LetA(t) = [Ahj(t)]h,j∈{0,...,k}
where Ahj , j 6= h are the intensity functions of Ahj andAhh(t) = −∑
j 6=h Ahj(t).
Seoul National University. 45
• Then, we haveP(s, t) =
∏
s<u≤t
(I + dA(u))
where I is the (k + 1)× (k + 1) identity matrix.
• We letAh = (Ahj , j = 0, . . . , k)
be the (vector of cumulative) intensity functions from state h
• We want to develop a prior process for Ah, a multivariate versionof beta processes.
• Note that there are constraints: for j 6= h
0 ≤ ∆Ahj(t)
and ∑
j 6=h
∆Ahj(t) ≤ 1.
Seoul National University. 46
Understanding the intensity functions
• Ah·(t) =∑
j 6=h Ahj(t): the (cumulative) intensity functions fortransitions from state h.
• phj(t) = dAhj(t)/dAh·(t) : the (instantaneous) transitionprobability from state h to state j conditional on that thetransition occurs at time t.
Seoul National University. 47
Beta-Dirichlet process prior (Kim et al., 2009)
• For given Ah
– Ah·(t) and {phj(t), j 6= h} are independent
– Ah· is a beta process
– {phj(t), j 6= h} follows a Dirichlet distribution for given t.
Seoul National University. 48
Levy measure of (stochastically continuous) beta-Dirichlet process
• A stochastic continuous K-dimensional beta-Dirichlet process Awith parameter (A0(·), c(·), {γj(·), j = 1, . . . ,K}) is aK-dimensional Levy process with
E(exp(− < θ,A >)) = exp
[−
∫ t
0
∫
[0,1]K
(1− e−<θ,x>
)ν(ds, dx)
]
where
Seoul National University. 49
ν(dt, dx) = d(t)xγ1(t)−11 · · ·xγK(t)−1
K
K∑
j=1
xj
−∑K
j=1 γj(t)
1−
K∑
j=1
xj
c(t)−1
dxa0(t)dt
for xj ≥ 0 and∑K
j=1 xj ≤ 1, where a0(t) = dA0(t)/dt and
d(t) =Γ(
∑j γj(t))
Γ(γ1(t)) · · ·Γ(γk(t))c(t).
Seoul National University. 50
Moments
• Mean
E(Ai(t)) =∫ t
0
γi(s)∑h γh(s)
a0(s)ds,
• Variance
Var(Ai(t)) =∫ t
0
γi(s)(γi(s) + 1)(∑
h γh(s)) (∑
h γh(s) + 1)1
c(s) + 1a0(s)ds (2)
• Covariance
Cov(Ai(t), Aj(t)) =∫ t
0
γi(s)γj(s)(∑
h γh(s)) (∑
h γh(s) + 1)1
c(s) + 1a0(s)ds
(3)for i 6= j.
Seoul National University. 51
General beta-Dirichlet process
• A general beta-Dirichlet process A with parameter(A0(·), c(·), {γj(·), j = 1, . . . ,K}) is defined as
A = Ac + Ad
where
– Ac is a stochastically continuous beta-Dirichlet process withparameter (A0c(·), c(·), {γj(·), j = 1, . . . , K}) where A0c(·) is aabsolutely continuous part of A0;
– Ad is a time discrete beta-Dirichlet process with parameterα(t) = c(t)∆A0(t), β(t) = c(t)(1−∆A0(t)) and αj(t) = γj(t);
– Ac and Ad are independent.
Seoul National University. 52
Posterior distribution
• For given A, let X1, . . . , Xn be independent Markov processes on[0, τ ] with the intensity functions A.
• A priori, A0, . . . ,AK are independent beta-Dirichlet processeswith parameters (A0
h(·), ch(·), {γhj(·), j 6= h}) for h = 0, . . . ,K,
respectively.
• Let
– Nhj(t) =∑n
i=1
∑s≤t I(Xi(s−) = h,X(s) = j)
– Nh·(t) =∑n
i=1
∑s≤t I(Xi(s−) = h,Xi(s) 6= Xi(s−))
– Yh(t) =∑n
i=1 I(Xi(t−) = h).
Seoul National University. 53
• Then, the posterior distribution of A is again independentbeta-Dirichlet processes with parameters(A0p
h , cph, γp
hj , j 6= h), h = 0, . . . , K where
A0ph (t) =
∫ t
0
ch(s)dA0h(s) + dNh·(s)
Yh(s) + ch(s)
cph(t) = Yh(t) + ch(t)
andγp
hj(t) = γhj(t) + ∆Nhj(t)
for j 6= h.
Seoul National University. 54