Bayesian Survival Analysis: Model, Prior and Posterior - Eurandom

Bayesian Survival Analysis: Model, Prior andPosterior

Yongdai Kim

Seoul National University, Korea

2010.11.08

0. Contents

1. Model

2. Prior 1: Dirichlet process

3. Prior 2: Neutral to right process

4. Prior 3: Beta process

5. The proportional hazards model

6. Event history data

Seoul National University. 1

1. Model

• Survival times: X1, . . . , Xn ∼ F .

• Right censoring Times: C1, . . . , Cn ∼ G.

• Let Ti = min(Xi, Ci) and δi = I(Xi ≤ Ci), i = 1, . . . , n.

• We observe (T1, δ1), . . . , (Tn, δn) and wish to estimate F .

• Parameter space:

F = { all probability distributions on R+}.


Bayesian Analysis in a Nutshell

• The Prior is the probability measure on the parameter space Θwhich reflects the statistician’s knowledge about θ before he/shesees the data.

θ ∼ π(θ).

• ModelX|θ ∼ f(x|θ).

• The posterior is the conditional distribution of θ given the dataX and reflects the knowledge after he/she sees the data.

θ|X ∼ π(θ|X).

• For Bayesian analysis of the survival model, we need a tractableand rich class of prior distributions on the parameter space

F = { all probability distributions on R+}.


2. Prior 1: Dirichlet process (Ferguson, 1973)

• We put prior measures on F by defining prior distributions of(F (A1), . . . , F (Ak)) for all possible finite partitions A1, . . . , Ak.

• A natural conjugate prior for the partition probabilities is aDirichlet distribution, which leads the following definition ofDirichlet processes.

• Definition of Dirichlet process

– Let α be a positive finite measure on R. The random c.d.f F

has a Dirichlet process prior with parameter α, denoted byDP(α) if for every finite partition A1, . . . , Ak of R,

(F (A1), . . . , F (Ak)) ∼ Dirichlet(α(A1), . . . , α(Ak)).


Properties of Dirichlet process prior

• The support of a Dirichlet process (with respect to the weaktopology) is F .

• Without censoring, it is conjugate. In fact, the posteriordistribution is a Dirichlet process with the base measure αp where

αp(·) = α(·) +n∑

i=1

δXi(·).

• With right censored data, it is not conjugate, and the posteriorhad been a mystery until Hjort (1990).


3. Prior 2: Neutral to right process prior (Doksum,1974)

• Consider the conditional distribution function

F (t|s) = Pr(Xi ≤ t|Xi ≥ s) =F (t)− F (s)

1− F (s).

• Let 0 = t0 < t1 < · · · < tk < tk+1 = ∞ be a finite orderedpartition of R+.

• A neutral to right (NTR) process is a random probabilitymeasure defined on R+ such that

F (t1|t0), F (t2|t1), . . . , F (tk+1|tk)

are independent.


Properties of the NTR process prior

• Dirichlet processes are NTR processes.

• The class of NTR processes is conjugate with right censored data(Ferguson and Phadia, 1979).

• Let Y (t) = − log(1− F (t)). Then, F is a NTR process if andonly if Y (t) has independent increments.

• The class of NTR processes is large enough to exhibit all kinds ofpathological as well as well-behaving examples (we will see someexamples in the tomorrow talk).

• The class of NTR processes would be too large for theconjugateness to have practical meanings (e.g. the class of alldistributions is conjugate!)


4. Prior 3: Beta process (Hjort 1990)

Cumulative hazard function

• A cumulative hazard function (chf) A of a distribution F isdefined as

A(t) =∫ t

0

dF (s)1− F (s−)

.

• If F is continuous,

–

A(t) =∫ t

0

f(s)1− F (s−)

ds.

– a(t) = dA(t)/dt is called the hazard function.

• The chf A is roughly

A(t) =∫ t

0

P (X = s|X ≥ s)ds.


Properties of the chf

• A is nondecreasing with A(0) = 0 and limt→∞A(t) = ∞.

• ∆A(t) ∈ [0, 1]

• Product integration

1− F (t) =∏

s≤t

(1− dA(s))

• There is a one-to-one correspondence between the distributionfunction and chf.


Nondecreasing process with independent increments: Subordinator

• A stochastic process A on R+ has independent increments (i.eLevy process) if A(t)−A(s), s ≤ t is independent withσ(A(u), u ≤ s).

• By the famous Levy’s representation, any Levy process A can bedecomposed roughly by

A(t) = D(t) + B(t) + C(t) where

– D(t): deterministic function

– B(t): Brownian motion

– C(t): limit of compound Poisson processes

• Assume that D(t) = 0. For a given process to be subordinator,B(t) should be 0.

• A subordinator = limit of compound Poisson process


Compound Poisson process

• Let N(t) be a Poisson process with mean Λ(t) =∫ t

0λ(s)ds

• Let Zt, t ∈ R+ be independent random variables with density ht

• A compound Poisson process A is given as

A(t) =∑

s≤t

ZsI(∆N(s) = 1).

• For a given compound Poisson process to be nondecreasing, thesupport of ht should be R+.

* Any subordinator can be approximated by letting Λ becomeslarge and Zt becomes smalle.


Levy measure of a subordinator

• The famous Levy’s representation, for a subordinator process A,there exists a measure ν on R+ ×R+ such that

E(exp(−uA(t))) = exp[−

∫ t

0

∫

R

(1− exp(−ux))ν(ds, dx)]

provided E(exp(−uA(t))) exists.

• If there exist functions ft(x) ≥ 0 such that

–

ν([0, t]×B) =∫ t

0

∫

B

fs(x)dxds

– ∫ t

0

∫ 1

0

xfs(x)dxds < ∞ for all t,

then there exists a unique subordinator whose Levy measure is ν

(see, for example, Kim 1999).


• For a compound Poisson process,

ft(x) = ht(x)λ(t).

• That is,

– λ(t) =∫∞0

ft(dx)dx

– ht(x) = ft(x)/λ(t)

provided λ(t) < ∞.

• For general subordinator with∫∞0

ft(x)dx = ∞, we canapproximate it by a sequence of compound Poisson processes.

• Hence, we can think of ft(x) as unnormalized density functionsof jump sizes.


Beta process prior

• The key idea of beta process priors is to put prior measures onthe space of chf rather than the space of distribution functions.

• That is, a priori we let A be a subordinator.

• Beta process is a special form of subordinators which is conjugatewith right censored data.

• Note that ∆A(t) ∈ [0, 1], and so a natural conjugate prior for∆A(t) is a beta distribution.

• Beta process let ∆A(t) ∼ Beta(0, c(t)) (incomplete betadistribution) for some nonnegative function c(t).


• A priori, we let E(A(t)) = A0(t).

• Then, the beta process is a subordinator whose correspondingLevy measure is

ν(dt, dx) =c(t)x

(1− x)c(t)−1I(0 ≤ x ≤ 1)dxdA0(t)

provided A0(t) is absolutely continuous.


• Suppose A0(t) = A0c(t) + A0d(t) where A0c is absolutelycontinuous and A0d(t) =

∑s≤t ∆A(s).

• The beta process A with parameter A0 and c is decomposed by

A(t) = Ac(t) + Ad(t) where

– Ac is a beta process with parameter A0c and c.

– Ad(t) =∑

s≤t Ps where Ps are independent and

Ps ∼ Beta(c(t)∆A0(t), c(t)(1−∆A0(t))).

– Ac and Ad are independent.


Properties of beta process

• Let A be a beta process with parameter (A0, c).

•E(A(t)) = A0(t)

•Var(A(t)) =

∫ t

0

dA0(s)(1− dA0(s))1 + c(s)

• Note that the variance is reciprocally proportional to c, which iscalled as the precision parameter.

• If F is a Dirichlet process with base measure α, then A is a betaprocess with

– A0 is the chf of F 0(t) = α([0, t])/α(R+)

– c(t) = α(R+)(1− F 0(t)).


Different Construction of beta processes

• The construction of beta processes via a subordinator withincomplete beta distribution jumps was introduced by Kim(1999).

• The original construction of beta processes by Hjort (1990) is thelimit of time discrete beta processes.

– Let 0 = t0 < t1 < · · · < tk < ∞.

– Given A0 and c, let

∗ ∆A0(ti) = A0(ti)−A0(ti−1).∗ ∆Ak(ti) ∼ Beta(c(t)∆A0(ti), c(t)(1−∆A0(ti))).∗ Ak(t) =

∑ti≤t ∆Ak(ti)

– As max |ti − ti−1| → 0, Hjort (1990) proved that Ak(t)converges to a beta process.


• Lo (1993) proposed a beta process using gamma processes.

– A gamma process γ with base measure α (write γα) is asubordinator such that γ(t)− γ(s) follows a gamma distributionwith shape parameter α((s, t]) and scale parameter 1.

– The corresponding Levy measure is given as

ν(dt, dx) =1x

exp(−x)α(dt).

– Let

A(t) =∫ t

0

γα(ds)γα[s,∞) + γβ [s,∞)

.

– Then, A is a beta process.


Posterior distribution (Hjort 1990, Kim 1999)

• Let

– N(t) =∑n

i=1 I(Xi ≤ t, δi = 1)

– Y (t) =∑n

i=1 I(Xi ≥ t)

• A priori, A is a beta process with mean A0 and precision c.

• Then, a posteriori, A is a beta process with

A0p(t) =∫ t

0

c(s)c(s) + Y (s)

dA0(s) +∫ t

0

Y (s)c(s) + Y (s)

dA(s)

andcp(t) = c(t) + Y (t)

where A(t) =∫ t

01/Y (s)dN(s) is the Nelson-Aalen estimator.

• The Bayes estimator is a convex combination of the prior meanand Nelson-Aalen estimator.


Relation with NTR processes

• If A is a beta process, then F is a NTR process.

• If F is a NTR process, then A is a subordinator.

• A priori, A is a subordinator with Levy measure

ν(dt, dx) = ft(x)dxdt.

• (Kim 1999) Then, a posteriori, A is again a subordinator withLevy measure

νp(dt, dx) = x∆N(t)(1− x)Y (t)−∆N(t)ft(x)dxdt.

• For beta process,

νp(dt, dx) = c(t)x∆N(t)−1(1− x)Y (t)−∆N(t)+c(t)dxdA0(t).


5. Bayesian analysis of the proportional hazardsmodel

The proportional hazards model

• Data: (X1, Z1), . . . , (Xn, Zn).

• Distribution

Xi|Zi ∼ F (t|Zi) = 1− (1− F (t))exp(Z′iβ)

orA(t|Zi) = exp(Z ′iβ)A(t)

where A(t|z) is the cumulative hazard function of the patientwhose covariate is z.

• Parameters: β and F (or A)


Left truncated right censored data

• In survival analysis, typically we do not observe the data.

• A most common reason for partial observation is right censoring,in which the event is observed only if it occurs prior to someprespecified time (censoring time).

• Along with right censoring, data can be subject to lefttruncation.

• Left truncation occurs when a subject is observed by theinvestigator only when the survival time is greater thanpredetermined time (truncation time)

• In left truncation, the investigator will not aware of the subjectswhose survival times are less than prespecified times (differentfrom left censored data)


Example of left truncation

• Survival times of individuals in a retirement center

• The observations are subject to left truncation since individualsmust survive to a sufficient age to enter the retirementcommunity.

• Individuals who die at an early age are excluded from the study.


Observations

• Let (W1, C1), . . . , (Wn, Cn) be the truncation and censoring timesof n individuals.

• Let Ti = min{Ci, Xi} and δi = I(Xi ≤ Ci).

• We observe (Ti, δi) only when Ti ≥ Wi.

• Observations: (T1, δ1,W1, Z1), . . . , (Tn, δn,Wn, Zn).


Prior

• β ∼ π(β)

• F ∼ a process neutral to right with Levy measure ν.

• That is, the corresponding chf A is a subordinator with Levymeasure ν.


Literature review

• Kalbfleisch (1978) uses gamma process priors. The formula of theposterior distribution of β is very messy in particular when thereare ties among uncensored observations.

• Hjort (1990) uses beta process priors. The posterior distributionof β is available only when no tie exists among uncensoredobservations.

• Laud, Damien and Smith (1998) develop an efficient MCMCalgorithm with beta process priors. The derivation of theposterior distribution of F given β and Dn is not available.

• Kim and Lee (2003) provided complete answers.


Notations

• Suppose a priori A is a subordinator with Levy measure ν givenby

ν(dt, dx) = ft(x)dxdt

• DefineDn(t) = {i : Ti = t, δi = 1, i = 1, . . . , n}Rn(t) = {i : Wi < t ≤ Ti, i = 1, . . . , n}

and R+n (t) = Rn(t)−Dn(t).

• Dn = ((T1, δ1,W1, Z1), . . . , (Tn, δn,Wn, Zn)).

• qn : the number of distinct uncensored observations

• t1 < t2 < · · · < tqn : the ordered distinct uncensored observations.


Posterior distribution of F given β and Dn.

• The posterior distribution of F is a process neutral to the rightwith Levy measure

ν(dt, dx|β,Dn) = (1− x)∑

j∈Rn(t) exp(βT Zj)ft(x)dxdt

+qn∑

i=1

hi(x|β)dxδti(dt),

where

hi(x|β) =

∏

j∈Dn(ti)

(1− (1− x)exp(βT Zj))

×(1− x)∑

j∈R+n (t)

exp(βT Zj)fti(x).


Marginal posterior distribution of β given Dn.

•π(β|Dn) ∝ e−ρ(β)

qn∏

i=1

∫ 1

0

hi(x|β)dxπ(β),

where

–

ρ(β) =n∑

i=1

∫ Ti

Wi

∫ 1

0

(1− (1− x)exp(βT Zi))

×(1− x)∑i−1

j=1 Yj(t) exp(βT Zj)ft(x)dxdt

– Yj(t) = I(Wj < t ≤ Tj) for j = 1, . . . , n.


Propriety with uniform prior

• Conical hull

– For a set of vectors {x1, · · · , xm} in Rp, their conical(nonnegative linear) combination is represented as the pointx =

∑mj=1 λjxj , where λj ≥ 0 for j = 1, · · · ,m.

– For a set A in Rp, the conical hull of A is the collection of allconical combinations of vectors from A or

coni(A) = {m∑

j=1

λjxj : xj ∈ A, λj ≥ 0 and m is a positive integer}.


• Assumptions

– A1. There exists a positive number ς1 such that

supt∈[0,τ ],x∈[0,1]

xft(x)(1− x)1−ς1(= M1) < ∞

where τ = max{T1, . . . , Tn}.– A2. There exist positive constants M2 and ς2 and a positive

function a0(t) continuous on (0, τ) such that

xft(x) ≥ M2(1− x)ς2−1a0(t)

for all x ∈ [0, 1] and t ≤ τ .


• Propriety

– If

coni({Zj − Zk : i = 1, · · · , qn, j ∈ Dn(ti), k ∈ R+n (ti)}) = Rp,

(1)then the posterior distribution with the constant prior on theregression coefficients is proper under A1 and A2.

– The propriety condition (1) is an equivalent condition for theuniqueness of the maximum likelihood estimator andlog-concavity of the partial likelihood (Jacobsen 1989).


Stability of posterior

• Question

– Suppose a priori F is a Dirichlet process with mean F0 andprecision parameter α (i.e. the base measure is αF ).

– In practice, when no much prior information is available, onechooses a small value of α.

– Question: How does the value of α affect π(β|Dn)?


• Definition of stability

– The posterior distribution of β, π(β|Dn) is stable with respectto α if π(β|Dn) converges to a certain density function asα → 0.


• Facts learned from the marginal posterior of β

– In some cases, π(β|Dn) is not stable.

– This implies that π(β|Dn) is very sensitive to the choice of α

when α is small.

– Hence, α should be chosen carefully.

– We can make π(β|Dn) stable by choosing the values ofcovariates appropriately. For example, using the centeredcovariates (Zi − Z) may be helpful.


6. Bayesian analysis of event history data

What is “event history data”?

• A subject (patient, mice, machine, company etc.) can experienceseveral types of events at random time points.

• “Event history data” records types and times of events thesubject has experienced during the follow-up period.

• Examples are

– Survival time with censoring

– Multiple event time data

– Illness-death model

– Competing risks model


Example 1: Censored survival data

• Let X be a survival time and let C be a censoring time.

• An observation is (T, δ) where T = min{X, C} andδ = I(X ≤ C).

• There are two types of event (death and censoring) and only onetime point.


Example 2: Recurrent event data

• A subject experiences a certain event repeatedly. An examples isa recurrent disease (eg. diarrhea).

• Data consist of the random time points T1, T2, . . . where Ti is thetime of the i-th event.

• There are only one type of event and many time points.


Example 3: Illness-Death model

• A subject changes states at random time points before theabsorbing event (death).

healthy disease

death


• The data consists of the random time points of state transitionand the corresponding types of transition ( health-to-disease,disease-to-health, health-to-death and disease-to-death).

• There are many types of event and many time points.


Example 4: Competing risks model

• A subject experiences one event among various types of events ata random time point.

survival

cause 1

cause 2

cause k


• An example: a mice dies due to either thymic lymphoma,reticulum sell sarcoma (both specific types of cancer) or othercauses.

• The data consists of the type and time of the event the subjectexperiences.

• There are many types of event and only one time point.


Time inhomogeneous Markov process

• Multi-state event history data can be modeled by a timeinhomogeneous Markov process.

• Let X be a time inhomogeneous Markov process with the statespace S = {0, . . . , K}.

• For event history data, X(t) represents the state at time t.

• Example– Censored survival data: S = {survival, death, censored} and

X(0) = survival.

– Multiple event time data: S = {0, 1, . . .} and X(t) is the number of

an event experienced until time t

– Illness-death model: S = {health, disease, death} and

X(0) = health or disease.

– Competing risks model: S = {health, 1, . . . , K} and X(0) = health


Transition probabilities and Intensity functions

• For a given Markov process X, let

Phj(s, t) = Pr(X(t) = j|X(s) = h)

for h, j ∈ {0, . . . , k}.• The (k + 1)× (k + 1) matrix function defined by

P(s, t) = [Phj(s, t)]h,j∈{0,...,k}

is called the “ (matrix of ) transition probabilities”.

• LetA(t) = [Ahj(t)]h,j∈{0,...,k}

where Ahj , j 6= h are the intensity functions of Ahj andAhh(t) = −∑

j 6=h Ahj(t).


• Then, we haveP(s, t) =

∏

s<u≤t

(I + dA(u))

where I is the (k + 1)× (k + 1) identity matrix.

• We letAh = (Ahj , j = 0, . . . , k)

be the (vector of cumulative) intensity functions from state h

• We want to develop a prior process for Ah, a multivariate versionof beta processes.

• Note that there are constraints: for j 6= h

0 ≤ ∆Ahj(t)

and ∑

j 6=h

∆Ahj(t) ≤ 1.


Understanding the intensity functions

• Ah·(t) =∑

j 6=h Ahj(t): the (cumulative) intensity functions fortransitions from state h.

• phj(t) = dAhj(t)/dAh·(t) : the (instantaneous) transitionprobability from state h to state j conditional on that thetransition occurs at time t.


Beta-Dirichlet process prior (Kim et al., 2009)

• For given Ah

– Ah·(t) and {phj(t), j 6= h} are independent

– Ah· is a beta process

– {phj(t), j 6= h} follows a Dirichlet distribution for given t.


Levy measure of (stochastically continuous) beta-Dirichlet process

• A stochastic continuous K-dimensional beta-Dirichlet process Awith parameter (A0(·), c(·), {γj(·), j = 1, . . . ,K}) is aK-dimensional Levy process with

E(exp(− < θ,A >)) = exp

[−

∫ t

0

∫

[0,1]K

(1− e−<θ,x>

)ν(ds, dx)

]

where


ν(dt, dx) = d(t)xγ1(t)−11 · · ·xγK(t)−1

K

K∑

j=1

xj

−∑K

j=1 γj(t)

1−

K∑

j=1

xj

c(t)−1

dxa0(t)dt

for xj ≥ 0 and∑K

j=1 xj ≤ 1, where a0(t) = dA0(t)/dt and

d(t) =Γ(

∑j γj(t))

Γ(γ1(t)) · · ·Γ(γk(t))c(t).


Moments

• Mean

E(Ai(t)) =∫ t

0

γi(s)∑h γh(s)

a0(s)ds,

• Variance

Var(Ai(t)) =∫ t

0

γi(s)(γi(s) + 1)(∑

h γh(s)) (∑

h γh(s) + 1)1

c(s) + 1a0(s)ds (2)

• Covariance

Cov(Ai(t), Aj(t)) =∫ t

0

γi(s)γj(s)(∑

h γh(s)) (∑

h γh(s) + 1)1

c(s) + 1a0(s)ds

(3)for i 6= j.


General beta-Dirichlet process

• A general beta-Dirichlet process A with parameter(A0(·), c(·), {γj(·), j = 1, . . . ,K}) is defined as

A = Ac + Ad

where

– Ac is a stochastically continuous beta-Dirichlet process withparameter (A0c(·), c(·), {γj(·), j = 1, . . . , K}) where A0c(·) is aabsolutely continuous part of A0;

– Ad is a time discrete beta-Dirichlet process with parameterα(t) = c(t)∆A0(t), β(t) = c(t)(1−∆A0(t)) and αj(t) = γj(t);

– Ac and Ad are independent.


Posterior distribution

• For given A, let X1, . . . , Xn be independent Markov processes on[0, τ ] with the intensity functions A.

• A priori, A0, . . . ,AK are independent beta-Dirichlet processeswith parameters (A0

h(·), ch(·), {γhj(·), j 6= h}) for h = 0, . . . ,K,

respectively.

• Let

– Nhj(t) =∑n

i=1

∑s≤t I(Xi(s−) = h,X(s) = j)

– Nh·(t) =∑n

i=1

∑s≤t I(Xi(s−) = h,Xi(s) 6= Xi(s−))

– Yh(t) =∑n

i=1 I(Xi(t−) = h).


• Then, the posterior distribution of A is again independentbeta-Dirichlet processes with parameters(A0p

h , cph, γp

hj , j 6= h), h = 0, . . . , K where

A0ph (t) =

∫ t

0

ch(s)dA0h(s) + dNh·(s)

Yh(s) + ch(s)

cph(t) = Yh(t) + ch(t)

andγp

hj(t) = γhj(t) + ∆Nhj(t)

for j 6= h.


Bayesian Survival Analysis: Model, Prior and Posterior - Eurandom

Documents

Transcript of Bayesian Survival Analysis: Model, Prior and Posterior - Eurandom