Adaptive Markov Chain Monte Carlo: Theory and Methodsdept.stat.lsa.umich.edu/~yvesa/afmp.pdfAdaptive...

Chapter 1

Adaptive Markov Chain Monte Carlo: Theory and Methods

Yves Atchade 1, Gersende Fort and Eric Moulines 2, Pierre Priouret 3

1.1 Introduction

Markov chain Monte Carlo (MCMC) methods allow to generate samples from anarbitrary distribution π known up to a scaling factor; see Robert and Casella (1999).The method consists in sampling a Markov chain Xk, k ≥ 0 on a state space Xwith transition probability P admitting π as its unique invariant distribution, i.eπP = π. Such samples can be used e.g. to approximate π (f) def=

∫Xf (x)π (dx) for

some π-integrable function f : X→ R, by

1n

n∑k=1

f(Xk) . (1.1)

In general, the transition probability P of the Markov chain depends on some tuningparameter θ defined on some space Θ which can be either finite dimensional orinfinite dimensional. The success of the MCMC procedure depends crucially upona proper choice of θ. To illustrate, consider the standard Metropolis-Hastings (MH)algorithm. For simplicity, we assume that π has a density also denoted by π withrespect to the Lebesgue measure on X = Rd endowed with its Borel σ-field X . Giventhat the chain is at x, a candidate y is sampled from a proposal transition densityq(x, ·) and is accepted with probability α(x, y) defined as

α(x, y) = 1 ∧ π(y)π(x)

q(y, x)q(x, y)

,

where a ∧ b def= min(a, b). Otherwise, the move is rejected and the Markov chainstays at its current location x. The transition kernel P of the associated MarkovChain is reversible with respect to π, i.e. for any non-negative measurable func-tion f ,

∫∫π(dx)P (x, dy)f(x, y) =

∫∫π(dx)P (x, dy)f(y, x) and therefore admits π

as invariant distribution. A commonly used choice for the proposal kernel is thesymmetric increment random-walk leading to the Random Walk MH algorithm

1University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States.2LTCI, CNRS - Telecom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, France.3LPMA, Universite P. & M. Curie- Boıte courrier 188, 75252 Paris Cedex 05, France.

1

2

(hereafter SRWM), in which q(x, y) = q(y − x) for some symmetric proposal dis-tribution q on Rd. The SRWM is frequently used because it is easy to implementand often efficient enough even for complex high-dimensional distributions, pro-vided that the proposal distribution is chosen properly. Finding a good proposalfor a particular problem is not necessarily an easy task. A possible choice of theincrement distribution q is the multivariate normal with zero-mean and covariancematrix Γ, N (0,Γ), leading to the N-SRWM algorithm. As illustrated in Figure 1.1in the one-dimensional case d = 1, if the variance is either too small or too large,then the convergence rate of the N-SRWM algorithm will be slow and any inferencefrom values drawn from the chain are likely to be unreliable4. Intuitively, this maybe understood as follows. If the variance is too small, then almost all the proposedvalues are accepted, and the algorithm behaves almost as a random walk. Becausethe difference between two successive values are small, the algorithm visits the statespace very slowly. On the contrary, if the variance is too large, the proposed movesare nearly all into low-probability area of the target distribution. These proposalsare often rejected and the algorithm stays at the same place. Finding a proper

0 500 1000−3

−2

−1

0

1

2

3

0 50 1000

0.2

0.4

0.6

0.8

1

0 500 1000−1

−0.5

0

0.5

1

1.5

2

2.5

0 50 1000

0.2

0.4

0.6

0.8

1

0 500 1000−3

−2

−1

0

1

2

3

0 50 100−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Figure 1.1: The N-SRWM in one dimension.

scale is thus mandatory. Gelman et al. (1996) have shown that if the target and theproposal distributions are both gaussian, then an appropriate choice for covariancematrix for the N-SRWM is Γ = (2.382/d)Γπ, where Γπ is the covariance matrixof the target distribution. In a particularly large dimensional context (d → ∞),(Roberts and Rosenthal, 2001, Theorem 5) prove that this choice is optimal (seealso Gelman et al. (1996) and Roberts et al. (1997)). In practice this covariancematrix Γ is determined by trial and error, using several realizations of the Markovchain. This hand-tuning requires some expertise and can be time-consuming. Inorder to circumvent this problem, Haario et al. (1999) (see also Haario et al. (2001))have proposed a novel algorithm, referred to as adaptive Metropolis, to update con-tinuously Γ during the run, according to the past values of the simulations. This

4After J. Rosenthal, this effect is referred to as the goldilock principle

3

algorithm can be summarized as follows,

µk+1 = µk +1

k + 1(Xk+1 − µk) k ≥ 0 (1.2)

Γk+1 = Γk +1

k + 1((Xk+1 − µk)(Xk+1 − µk)T − Γk) , (1.3)

Xk+1 being simulated from the Metropolis kernel with Gaussian proposal distribu-tion N

(0, (2.382/d)Γk

). It was soon recognized by Andrieu and Robert (2001) in

an influential technical report that such a scheme can be cast into a more generalframework, referred to as controlled MCMC. In this framework, the transition ker-nel of the chain depends upon a finite-dimensional parameter θ, i.e. instead ofhaving a single transition kernel P with stationary distribution π, we consider aparametric family Pθ, θ ∈ Θ each having π as its stationary distribution, πPθ = πfor all θ ∈ Θ. In a controlled MCMC, the value of the parameter θ is updatedonline, ideally by optimizing a criterion reflecting the sampler’s performance. Thisproblem shares some similarities with Markov decision processes, the choice of theparameter θ can be seen as an action. There are some difficulties however whentrying to push this similarity too far, because it is not obvious how to define aproper reward function. Controlled MCMC is a specific example of an internaladaptation setting where the ”parameter” θ is updated from the past history ofthe chain. Other examples of internal adaptation algorithms, which do not neces-sarily rely on a stochastic approximation step, are given in Section 1.2.2; see alsoAndrieu and Thoms (2008) and Rosenthal (2009). When attempting to simulatefrom probability measures with multiple modes or when the dimension of the state-space is large, the Markov kernels Pθ, θ ∈ Θ might mix so slowly that an internaladaptation strategy cannot always be expected to work. Other forms of adaptationcan then be considered, using one or several auxiliary processes, which are run inparallel to the chain Xk , k ≥ 0 targeting π. Because the target chains is adaptedusing some auxiliary processes, we refer to this adaptation framework as external.The idea of running several MCMC in parallels and making them interact has beensuggested by many authors (see for example Chauveau and Vandekerkhove (1999),Chauveau and Vandekerkhove (2002), Jasra et al. (2007)). For example, one maysample an instrumental target distribution π? on a product space XN (N beingthe number of MCMC run in parallel) such that π? admits π as a marginal. Thesimplest idea consists in defining the target density on the product space X2 as the(tensor) product of two independent distributions π? = π⊗π1, where π1 is easier toexplore but related to π, and constructing a kernel allowing to swap the componentsof the chain via a so-called exchange step. This simple construction has been shownto improve the overall mixing of the chain, by allowing the badly mixing chain toexplore the state more efficiently with the help of the auxiliary chain. It is possibleto extend significantly this simple idea by allowing more general interactions be-tween the auxiliary chains. In particular, as suggested in Kou et al. (2006), Andrieuet al. (2007) orAtchade (2009), it is possible to make the auxiliary chains interactwith the whole set of past simulations. Instead of allowing to swap only the currentstates of the auxiliary chains, we may for example exchange the current state ofthe target chain with one of the states visited by an auxiliary chain in the past.The selection of this state can be guided by sampling in the past of the auxiliarychains, with weights depending for example on the current value of the state. Thisclass of methods are referred to as interacting MCMC. These chains can be castinto the framework outlined above, by allowing the parameter θ to take its valuein an infinite dimensional space (θk might for example be an appropriately speci-fied weighted empirical measure of the auxiliary chains at time k). The purpose of

4

this chapter is to review adaptive MCMC methods, emphasizing the links betweeninternal (controlled MCMC) and external (interacting MCMC) algorithms. Theemphasis of this review is to evidence general techniques to construct well-behavedadaptive MCMC algorithms and to prove their convergence. This chapter comple-ments two recent surveys on this topic by Andrieu and Thoms (2008) and Rosenthal(2009) which put more emphasis on the design of adaptive algorithms.

The paper is organized as follows. In Section 1.2.2 the general framework ofcontrolled MCMC is introduced and some examples are given. In Section 1.2.3,interacting MCMC algorithms are presented. In Section 1.2.1, it is shown thatthese two types of adaptations can be cast into a common unifying framework.

In the context of adaptive MCMC, the convergence of the parameter θk is not thecentral issue; the focus is rather on the way the simulations Xk, k ≥ 0 approximateπ. The minimal requirements are that, the marginal distribution of Xk convergesin an appropriate sense to the stationary distribution π, and that the sample meann−1

∑nk=1 f(Xk), for f chosen in a suitably large class of functions, converges to

π(f). In Section 1.3 we establish the convergence of the marginal distribution ofXk, k ≥ 0 and in Section 1.4, we establish the consistency (i.e. a strong law oflarge numbers) for n−1

∑nk=1 f(Xk). Finally and for pedagogical purposes, we show

how to apply these results in Section 1.5 to the equi-energy sampler of Kou et al.(2006).

1.2 Algorithms

1.2.1 A general framework for adaptive MCMC

In the sequel, all the variables are defined on a common probability space (Ω,A,P).We let Pθ, θ ∈ Θ be a parametric family of Markov kernels on (X,X ). We con-sider a process (Xn, θn), n ≥ 0 and a filtration F = Fn, n ≥ 0 such that(Xn, θn), n ≥ 0 is adapted to F and for each n, and any non-negative function f ,

E [f(Xn+1) | Fn] = Pθnf(Xn) =∫Pθn(Xn,dy)f(y) ,P− a.s. (1.4)

1.2.2 Internal adaptive MCMC

The so-called internal adaptive MCMC algorithm corresponds to the case wherethe parameter θn depends on the whole history X0, . . . , Xn, θ0, . . . , θn−1 though inpractice it is often the case that the pair process (Xn, θn), n ≥ 0 is Markovian.The so-called controlled MCMC algorithm is a specific class of internal adaptivealgorithms. According to this scheme, the parameter θk is updated according toa single step of a stochastic approximation procedure,

θk+1 = θk + γk+1H(θk, Xk, Xk+1), k ≥ 0 , (1.5)

where Xk+1 is sampled from Pθk(Xk, ·). In most cases, the function H is chosen sothat the adaptation is easy to implement, requiring only moderate amounts of extracomputer programming, and not adding a large computational overhead (Rosenthal(2007) developed some generic adaptive software). For reasons that will becomeobvious below, the rate of adaptation tends to zero as the number k of iterations goesto infinity, i.e. limk→∞ γk = 0. On the other hand,

∑∞k=0 γk = ∞, meaning that

the sum of the parameter moves can still be infinite, i.e. the sequence θk, k ≥ 0may move at an infinite distance from the initial value θ0. It is not necessarilyrequired that the parameters θk, k ≥ 0 converge to some fixed value. An in depth

5

description of controlled MCMC algorithms is given in Andrieu and Thoms (2008),where these algorithms are illustrated with many examples (some of which are givenbelow). Instead of adapting the parameter θk continuously with smaller and smallerstep sizes it is possible to adapt it by batches, which can be computationally lessdemanding. We may for example define an increasing sequence of time instantsTj , j ≥ 0 where adaptation occurs, the parameter being kept constant betweentwo such instants. In this case, the recursion becomes

θTj+1 = θTj + (Tj+1 − Tj)−1

Tj+1∑k=Tj+1

H(θTj , Xk, Xk+1) , (1.6)

where Xk+1 is sampled from Pθk(Xk, ·), θk = θTj for k ∈ Tj+1, . . . , Tj+1−1. Therate at which the adaptation takes place is diminished by taking limj→∞ Tj+1−Tj =+∞. Instead of selecting deterministic time intervals, it is also possible to adaptthe parameter at time k with a probability pk, with limk→∞ pk = 0 (see Robertsand Rosenthal (2007)). The usefulness of this extra bit of randomness is somehowquestionable. The recursions (1.5) or (1.6) is aimed at solving the equation h(θ) = 0where θ 7→ h(θ) is the mean field of the stochastic approximation procedure (seefor example Benveniste et al. (1990) or Kushner and Yin (2003)),

h(θ) def=∫

X

H(θ, x, x′)π(dx)Pθ(x, dx′) . (1.7)

Returning to the adaptive Metropolis example in the introduction, the parameterθ is equal to the mean and the covariance matrix of the multivariate distribution,θ = (µ,Γ) ∈ Θ = (Rd, Cd+), where Cd+ is the cone of symmetric non-negative d × dmatrices. The function expression of H is explicitly given in Eqs. (1.2) and (1.3).Assuming that

∫X|x|2π(dx) < ∞, one can easily check that the associated mean

field function is given by

h(θ) =∫

X

H(θ, x)π(dx) = (µπ − µ, (µπ − µ)(µπ − µ)T + Γπ − Γ)T , (1.8)

with µπ and Γπ the mean and covariance of the target distribution,

µπdef=∫

X

x π(dx) and Γπdef=∫

X

(x− µπ)(x− µπ)Tπ(dx) . (1.9)

The stationary point of the algorithm is unique θ? = (µπ,Γπ). Provided that thestep size is appropriately chosen, a stochastic approximation procedure will typicallyconverge toward that stationary point. The convergence is in general neverthelessnot trivial to establish; see for example Andrieu et al. (2005), Andrieu and Moulines(2006). The behavior of the Adaptive Metropolis (AM) algorithm is illustrated infigure 1.2. The target distribution is a multivariate Gaussian distribution N (0,Σ)with dimension d = 200. The eigenvalues of the covariance of the target distribu-tion are regularly spaced in the interval [10−2, 103]. This is a rather challengingsimulation tasks, with a rather large dispersion of the eigenvalues. The proposaldistribution at step k ≥ 2d is given by

Pθk(x, ·) = (1− β)N (x, (2.38)2Γk/d) + βN (x, 0.1 ∗ Id/d) ,

where Γk is the current estimate of the covariance matrix given in (1.3) and βis a positive constant (we take β = 0.05, as suggested in Roberts and Rosenthal

6

0 0.5 1 1.5 2

x 104

−40

−30

−20

−10

0

10

20

0 0.5 1 1.5 2

x 106

0

50

100

150

0 0.5 1 1.5 2

x 106

0.2

0.4

0.6

0.8

Figure 1.2: Left Panel: trace plot of the 20000 first iterations of the AM algorithm. Right Panel,Top: suboptimality factor as a function of the number of iterations; Right Panel, Bottom: meanacceptance rate as a function of the number of iterations (obtained by averaging the number ofaccepted moves on a sliding window of size 6000)

(2006)). The initial value for Γ0 is (2.38/√d)Id. The rationale of using such β is to

avoid the algorithm being stuck with a singular covariance matrix (in the originalAM algorithm, Haario et al. (2001) suggested to regularize the covariance matrix byloading the diagonal; another more sophisticated solution, based on projections ontocompact sets is considered in Andrieu and Moulines (2006)). Figure 1.2 displays thetrace plot of the first coordinate of the chain for dimension d = 200 together withthe suboptimality criterion introduced in Roberts and Rosenthal (2001), defined as

bkdef= d

∑di=1 λ

−2i,k(∑d

i=1 λ−1i,k

)2

where λi,k are the eigenvalues of the matrix Γ1/2k Σ−1/2. Usually we will have bk > 1,

and the closer bk is to 1, the better. The criterion being optimized in AM is there-fore b−1

k . Another algorithm which benefits from a strong theoretical background, isthe expected acceptance probability (i.e. the fraction of proposed moves which areaccepted) of the MH algorithm for random walk Metropolis algorithms or Langevinbased MH updates, Roberts et al. (1997). As shown for SRWM algorithm in onedimension, if the expected acceptance probability is (very) close to 1, this suggeststhat scale is too small. If this fraction is (very) close to 0, this suggests that itis too large (as in Figure 1.1). But if this fraction is far from 0 and far from 1,then we may consider that the algorithm avoids both extremes and is properlytune. The choice of a proper scale can be automated by controlling the expectedacceptance ratio. For simplicity, we consider only the one-dimensional SRWM,where θ is the scale. In this case, we choose H(x, x′) def= α(x, x′)− α? whichis associated to the mean field h(θ) =

∫π(dx)Pθ(x, dx′)α(x, x′) − α?, where the

acceptance ratio is α(x, x′) = 1 ∧ π(x′)/π(x). The value of α? can be set to 0.4or 0.5 (the ”optimal” value in this context is 0.44). The stationary acceptanceprobability has several advantages. The same idea applies in large-dimensional con-text. We may for example couple this approach with the AM algorithm: insteadof using the asymptotic (2.38)2/d factor, we might let the algorithm determineautomatically a proper scaling by controlling both the covariance matrix of theproposal distribution and the mean acceptance rate. Consider first the case wherethe target distribution and the proposal distribution are Gaussian with identitycovariance matrix. We learn the proper scaling by targeting a mean acceptancerate α? = 0.234, which is known to be optimal in a particular large dimensionalcontext. To illustrate the behavior of the algorithm, we display in Figure 1.3 the

7

0 10 200

0.2

0.4

0.6

0.8

1

d= 10, v.= 1.8578

0 10 200

0.2

0.4

0.6

0.8

1

d= 10, v.= 1.8645

0 10 200

0.2

0.4

0.6

0.8

1

d= 25, v.= 1.8124

0 10 200

0.2

0.4

0.6

0.8

1

d= 25, v.= 1.7704

0 10 200

0.2

0.4

0.6

0.8

1

d= 50, v.= 1.8555

0 10 200

0.2

0.4

0.6

0.8

1

d= 50, v.= 1.6026

Figure 1.3: Autocorrelation function for g(x1, . . . , xd) = x1 and value of the integrated correlationv for different dimensions d = 10, 25, 50; in the top panels, the optimal scaling is used. In thebottom panels, the scaling is adapted by controlling the mean acceptance rate

−10 −5 0 5 10−10

−5

0

5

10

0 500 1000 1500−3

−2

−1

0

1

2

0 500 1000 1500−3

−2

−1

0

1

2

3

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

correlation

−10 −5 0 5 10−10

−5

0

5

10

0 500 1000 1500−4

−2

0

2

4

0 500 1000 1500−8

−6

−4

−2

0

2

4

6

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

correlation

Figure 1.4: Left panel: SRWM algorithm with identity covariance matrix with optimal scaling;Right panel: Adaptive Metropolis with adapted scaling; the targeted value of the mean acceptancerate is α? = 0.234

autocorrelation function Corrπ [g(X0), g(Xk)] , k ≥ 0 and the integrated autocor-relation time 1 + 2

∑∞k=0 Corrπ(g(X0), g(Xk)) as a function of the dimension d for

the function g(x1, . . . , xd) = x1 (these quantities are estimated using a single runof length 50000). In another experiment, we try to learn simultaneously the covari-ance and the scaling. The target distribution is zero-mean Gaussian with covariancedrawn at random as above in dimension d = 100. Another possibility consists inusing a Metropolis-within-Gibbs algorithm, where the variables are updated one ata time (in either systematic or random order), each using a MCMC algorithm witha one-dimensional proposal (see Roberts and Rosenthal (2006); Rosenthal (2009);this algorithm is also closely related to the Single-Component Adaptive Metropolisintroduced in Haario et al. (2005)). It has recently been applied successfully tohigh-dimensional inference for statistical genetics (Turro et al., 2007). More sophis-ticated adaptation techniques may be used. An obvious idea is to try to make theadaptation ”local”, i.e. to adapt to the local behavior of the target density. Such

8

techniques have been introduced to alleviate the main weakness of the adaptiveMetropolis algorithm when applied to spatially non-homogeneous target which isdue to the use of a single global covariance distribution for the proposal. Considerfor example the case where the target density is a mixture of Gaussian distributions,π =

∑pj=1 ajN (µj ,Σj). Provided that the overlap between the components is weak,

and the covariances Σj are widely different, then there does not exist a commonproposal distribution which is well-fitted to sample in the regions surrounding eachmode. This example suggests to tune the empirical covariance matrices by learningthe history of the past simulations in different regions of the state space. To bemore specific, assume that there exists a partition: X =

⋃pj=1 Xj . Then, according

to the discussion above, it seems beneficial to use different proposal distributions ineach set of the partition. We might for example use the proposal :

qθ(x;x′) =p∑j=1

1Xj (x)φ(x′;x,Γj) ,

where 1A(x) is the indicator of the set A, and φ(x;µ,Γ) is the density of a d-dimensional Gaussian distribution with mean µ and covariance Γ. Here, the pa-rameter θ collects the covariances of the individual proposal distributions withineach region. With such a proposal, the acceptance ratio of MH algorithm becomes:

αθ(x;x′) = 1 ∧p∑

i,j=1

π(x′)π(x)

φ(x;x′,Γj)φ(x′;x,Γi)

1Γi(x)1Γj (x′) .

To adapt the covariance matrices Γi, i = 1, . . . , p we can for example use theupdate equations (1.2) and (1.3) within each region. To ensure a proper commu-nication between the regions, it is recommended to mix the adaptive kernel with afixed kernel. This technique is investigated in Craiu et al. (2008). Another inter-esting direction of research is to adapt the proposal distribution in an independentMH setting. Recall that in this case, the proposed moves do not depend on thecurrent state, i.e. q(x, x′) = q(x′) where q is some probability distribution on X.The acceptance ratio writes in such case α(x, x′) = 1 ∧ (π(x′)q(x)/π(x)q(x′)). Toperform adequately, the proposal q should be chosen sufficiently close to the targetπ. A natural idea is to choose a parametric family of distribution qθ, θ ∈ Θ, andto adapt the parameter θ from the history of the draws. This technique is of courseclosely related to the adaptive importance sampling idea; see for example Rubin-stein and Kroese (2008). Because of their extreme flexibility and their tractability,a finite mixture of Gaussian distributions is an appealing choice for that purpose.Recall that any continuous distribution can be approximated arbitrary well by afinite mixture of normal densities with common covariance matrix; in addition, dis-crete mixture of Gaussians are fast to sample from, and their likelihood is easy tocalculate, which is of key importance. In this case, the parameters θ are the mix-ing proportions and the mean and covariances of the component densities. Severalapproaches have been considered to fit these parameters. Other mixture from theexponential family can also be considered, such as a discrete/continuous mixtureof Student’s t-distribution (see Andrieu and Thoms (2008) for details). Andrieuand Moulines (2006) suggest to fit these parameters using a maximum likelihoodapproach or, equivalently, by maximizing the cross-entropy between the proposaldistribution and the target; this algorithm shares some similarities with the so-calledadaptive independence sampler developed in Keith et al. (2008). In this framework,the parameters are fitted using a sequential version of the EM algorithm (Cappe and

9

−30 −20 −10 0 10 20 30−20

−15

−10

−5

0

5

10

15

20

1000

123

−30 −20 −10 0 10 20 30−20

−15

−10

−5

0

5

10

15

20

2000

123

−30 −20 −10 0 10 20 30−20

−15

−10

−5

0

5

10

15

20

3000

123

−30 −20 −10 0 10 20 30−20

−15

−10

−5

0

5

10

15

20

4000

123

−30 −20 −10 0 10 20 30−20

−15

−10

−5

0

5

10

15

20

5000

123

−30 −20 −10 0 10 20 30−20

−15

−10

−5

0

5

10

15

20

6000

123

Figure 1.5: Adaptive fit of a mixture of three Gaussian distributions with arbitrary means andcovariance using the maximum likelihood approach developed in Andrieu and Moulines (2006)

Moulines, 2009) (several improvements on this basic schemes are presented in An-drieu and Thoms (2008)). Giordani and Kohn (2008) proposed a principled versionreplacing the EM by the k-harmonic mean, an extension of the k-means algorithmthat allows for soft membership. This algorithm is less sensitive to convergence tolocal minima; in addition degeneracies of the covariance matrices of the componentscan be easily prevented. An example of fit is given in Figure 1.5. There are manyother possible forms of adaptive MCMC and many of these are presented in therecent surveys by Roberts and Rosenthal (2006), Andrieu and Thoms (2008) and,Rosenthal (2009).

1.2.3 External Adaptive Algorithm

The so-called external adaptive MCMC algorithms correspond to the case where theparameter process θn, n ≥ 0 is computed using an auxiliary process Yk , k ≥ 0run independently from the process Xk , k ≥ 0 (several auxiliary processes canbe of course used). More precisely, it is assumed that the process is adapted tothe natural filtration of the process Yk , k ≥ 0, meaning that for each n, θnis a function of the history Y0:n

def= (Y0, Y1, . . . , Yn) of the auxiliary process. Inaddition, conditionally to the auxiliary process Yk , k ≥ 0, Xk , k ≥ 0 is aninhomogeneous Markov Chain, for any 0 ≤ k ≤ n, and any bounded function f ,

E [f(Xk+1) |X0:k, Y0:k] = Pθkf(Xk) .

The use of an auxiliary process to learn the proposal distribution in an independentMH algorithm has been considered in Chauveau and Vandekerkhove (2001) andChauveau and Vandekerkhove (2002). In this setting θn is a distribution (and isthus non parametric), and is obtained using an histogram. This approach worksbest in situations where the dimension of the state space d is small, otherwise, thehistogram estimation becomes very unreliable (in Chauveau and Vandekerkhove(2002) only one and two-dimensional examples are considered).

(Kou et al., 2006) have introduced another form of interaction: instead of tryingto learn a well fitted proposal distribution, these authors suggest to ”swap” thecurrent state of the Markov Chain with a state sampled from the history of the

10

auxiliary processes. A similar idea has been advocated in Atchade (2009). In thissetting, the ”parameter” θn is also infinite dimensional, and these algorithms maybe seen as a kind of non-parametric extensions of the controlled MCMC procedures.

All these new ideas originate from parallel tempering and simulated tempering,two influential algorithms developed in the early 1990’s to speed up the convergenceof MCMC algorithms (Geyer (1991); Marinari and Parisi (1992); Geyer and Thomp-son (1995)). In these approaches, the sampling algorithm moves progressively tothe target distribution π through a sequence of ”easily sampled” distributions. Theidea behind parallel tempering algorithm by Geyer (1991) is to perform parallelMetropolis sampling at different temperatures. Occasionally, a swap between thestates of two neighboring chains (two chains running at adjacent temperature levels)is proposed. The acceptance probability for the swap is computed to ensure thatthe joint states of all the parallel chains evolve according to the Metropolis-Hastingsrule targeting the product distribution. The objective of the parallel tempering isto use the faster mixing of the high temperature chains to improve the mixing of thelow temperature chains. The simulated tempering algorithm introduced in Marinariand Parisi (1992) exploits a similar idea but using a markedly different approach.Instead of using multiple parallel chains, this algorithm runs a single chain butaugments the state of this chain by an auxiliary variable, the temperature, that isdynamically moved up or down the temperature ladder.

The Equi-Energy (EE) sampler exploits the parallel tempering idea, in the sensethat the algorithm runs several chains at different temperatures, but allows for moregeneral swaps between states of the neighboring chains. The idea is to replacean instantaneous swap by a so-called equi-energy swap. To avoid cumbersomenotations, we assume here that there is a single auxiliary process, but it should bestressed that the EE sampler works best in presence of multiple auxiliary processes,covering a wide range of temperatures. Let π be the target density distribution on(X,X ). For β ∈ (0, 1) define the tempered density π ∝ πβ . The auxiliary processYn, n ≥ 0 is X-valued and is such that its marginal distribution converges as ngoes to infinity to π. Let P be a transition kernel on (X,X ) with unique invariantdistribution π (in most cases, P is a MH kernel). Choose ε ∈ (0, 1), which is theprobability of proposing a swap between the states of two neighboring chains. Definea partition X =

⋃K`=1 X`, where X` are the so-called rings (a term linked with the

particular choice of the partition in Kou et al. (2006), which is defined as the levelset of the logarithm of the target distribution). At iteration n of the algorithm, twoactions may be taken:

1. with probability (1−ε), we move the current state Xn according to the Markovkernel P ,

2. with probability ε, we propose to swap the current state Xn with a state Zdrawn from the past of the auxiliary process with weights proportional tog(Xn, Yi), i ≤ n, where

g(x, y) def=K∑`=1

1X`×X`(x, y) . (1.10)

More precisely, we propose a move Z at random within the same ring asXn. This move is accepted with probability α(Xn, Z), where the acceptanceprobability α is defined by α : X× X→ [0, 1] defined by

α(x, y) def= 1 ∧

(π(y)π(y)

[π(x)π(x)

]−1)

= 1 ∧ π1−β(y)π1−β(x)

. (1.11)

11

More formally, let Θ be the set of the probability measures on (X,X ). For anydistribution θ ∈ Θ, define the Markov transition kernel

Pθ(x, ·)def= (1− εθ(x))P (x,A) + εθ(x)Kθ(x,A) (1.12)

with

Kθ(x,A) def=∫A

α(x, y)g(x, y)θ(dy)θ[g(x, ·)]

+1A(x)∫1− α(x, y) g(x, y)θ(dy)

θ[g(x, ·)], (1.13)

where θ[g(x, ·)] def=∫g(x, y)θ(dy) and

εθ(x) = ε1θ[g(x,·)]>0 . (1.14)

The kernel Kπ can be seen as a Hastings-Metropolis kernel with proposal kernelg(x, y)π(y)/

∫g(x, y)π(dy) and target distribution π. Hence, πPπ = π.

Using the samples drawn from this process, a family of weighted empirical prob-ability distributions θn, n ≥ 0 is recursively constructed as follows:

θndef=

1n+ 1

n∑j=0

δYj =(

1− 1n+ 1

)θn−1 +

1n+ 1

δYn , (1.15)

where, δy is the Dirac mass at y and by convention, θ−1 = 0. Given the currentvalue Xn and the sequence θk, k ≤ n, Xn+1 is obtained by sampling the kernelPθn(Xn, ·).

Figure 1.6 illustrates the effectiveness of equi-energy sampler. In that example,the target distribution is the two-dimensional Gaussian mixture introduced in (Kouet al., 2006, p. 1591-1592) Figure 1.6(a) represents independent sample pointsfrom the target distribution. This distribution has 20 well separated modes (mostlocal modes are more than 15 standard deviations away from the nearest ones) andposes a serious challenge for sampling algorithms. We test a plain SRWM, paralleltempering and the equi-energy sampler on this problem. For parallel temperingand the equi-energy sampler, we use 5 parallel chains at inverse-temperature β =1, 0.36, 0.13, 0.05, 0.02. For the equi-energy sampling, we define 5 equi-energyrings X1 = x ∈ R2 : − log π(x) < 2, X2 = x ∈ R2 : 2 ≤ − log π(x) < 6.3,X3 = x ∈ R2 : 6.3 ≤ − log π(x) < 20, X4 = x ∈ R2 : 20 ≤ − log π(x) < 63.2and X5 = x ∈ R2 : − log π(x) ≥ 63.2. Figure 1.6 (b) plots the first 2, 000iterations from the EE sampler; (c) plots the first 2, 000 iterations from paralleltempering and (d) plots the first 2, 000 iterations from a plain SRWM algorithm.The plain SRWM has a very poor mixing for this example. Parallel tempering mixesbetter but the equi-energy sampler mixes even faster.

1.3 Convergence of the marginal distribution

There is a difficulty with either the internal or external adaptation procedures: asthe parameter estimate θk depends on the whole past either of the process or theauxiliary process, the process Xk, k ≥ 0 is no longer a Markov chain and classicalconvergence results do not hold. This may cause serious problems, as illustratedin this naive example. Let X = 1, 2 and consider for θ, t1, t2 ∈ Θ = (0, 1), witht1 6= t2, the following Markov transition probability matrices

Pθ =[

1− θ θθ 1− θ

]P =

[1− t1 t1t2 1− t2

].

12

2 4 6 8

02

46

810

(a)

0 2 4 6 8

02

46

810

(b)

0 1 2 3 4 5

02

46

8

(c)

0 2 4 6 8 10

02

46

810

(d)

Figure 1.6: Comparison of Equi-Energy sampler (b), Parallel tempering (c) and a Plain SRWM(d). (a) represents the target distribution.

For any θ ∈ Θ, π = (1/2, 1/2) satisfies πPθ = π. However if we let θk be a givenfunction Ξ : X → (0, 1) of the current state, i.e. θk = Ξ(Xk) with Ξ(i) = ti,i ∈ 1, 2, one defines a new Markov chain with transition probability P . Now Phas [t2/(t1 + t2), t1/(t1 + t2)] as invariant distribution. Instead of improving themixing, the adaptation has in such case destroyed the convergence to π.

The first, admittedly weak, requirement for an MCMC algorithm to be well-behaved is that the distribution of Xn converges weakly to the target distribu-tion π as n → ∞, i.e. for any measurable bounded function f : X → R,limn→∞ E [f(Xn)] = π(f). When this limit holds uniformly for any bounded func-tion f : X→ R, this convergence of the marginal Xn, n ≥ 0 is also referred to inthe literature as the “ergodicity” of the marginal Xn, n ≥ 0 (see e.g. Roberts andRosenthal (2007)). Convergence of the marginal distributions has been obtained fora general class of adaptive algorithm (allowing ”extra” randomization) by Robertsand Rosenthal (2007) and for a more restricted class of adaptive algorithms byAndrieu and Moulines (2006); Atchade and Fort (To appear). The analysis of theinteracting MCMC algorithm began more recently and the theory is still less de-veloped. In this Section, we extend the results obtained in Roberts and Rosenthal(2007) to cover the case where the kernels Pθ, θ ∈ Θ do not all have the sameinvariant distribution π.

All the proofs of this section are postponed in Section 1.6.

1.3.1 Notations

For a bounded function f , the supremum norm is denoted by |f |∞. Let bX be theset of measurable functions from X to R which are bounded by 1: |f |∞ ≤ 1.

More generally, for a function V : X → [1,+∞), we define the V -norm of afunction f : X → R by |f |V

def= supX |f |V −1. Let LV be the set of the functionswith finite V -norm.

For a signed measure µ on (X,X ), we define the total variation norm and theV -norm by ‖µ‖TV

def= supf :|f |∞≤1 |µ(f)| and ‖µ‖Vdef= supf :|f |V ≤1 |µ(f)|. We

denote by a.s.−→ and P−→ and D−→, the a.s. convergence and the convergence inprobability.

1.3.2 Main result

Consider the following assumptions

13

A1 For any θ ∈ Θ, Pθ is a transition kernel on (X,X ) that possesses an uniqueinvariant probability measure πθ.

For any θ, θ′ ∈ Θ, define

DTV (θ, θ′) def= supx∈X‖Pθ(x, ·)− Pθ′(x, ·)‖TV . (1.16)

A2 (Diminishing Adaptation) DTV (θn, θn−1) P−→ 0.

The diminishing adaptation A2 condition is fundamental. It requires that theamount of change in the parameter value at the nth iteration vanishes as n → ∞.Note that it is not necessarily required that the parameter sequence θn, n ≥ 0converges to some fixed value. Since the user controls the updating scheme, assump-tion A2 can be ensured by an appropriate planning of the adaptation. For example,if the algorithm adapts only at the successive time instants Tn, n ≥ 0, (or isadapted at iteration n with probability p(n)), then A2 is automatically satisfied iflimn→∞ Tn+1 − Tn =∞ (or if limn→∞ p(n) = 0).

One interesting consequence of (1.4) and A2 is that, for any integer N ≥ 1,the conditional expectation E [f(Xn+N ) | Fn] behaves like PNθnf(Xn), i.e. the con-ditional expectation is as if the adaptation was frozen at time n during the Nsuccessive time steps.

Proposition 1.3.1. For any integers n,N > 0

supf : f∈bX

∣∣E [f(Xn+N ) | Fn]− PNθnf(Xn)∣∣ ≤ N−1∑

j=1

E [DTV (θn+j , θn) | Fn] , P−a.s.

For x ∈ X, θ ∈ Θ and for any ε > 0, define

Mε(x, θ)def= infn ≥ 0, ‖Pnθ (x, ·)− πθ‖TV ≤ ε .

A3 (Containment Condition) For any ε > 0, the sequence Mε(Xn, θn), n ≥ 0 isbounded in probability, i.e. limM→∞ lim supn→∞ P(Mε(Xn, θn) ≥M) = 0.

The containment condition A3 expresses a kind of uniform-in-θ ergodicity of thekernels Pθ, θ ∈ Θ. We show indeed in Proposition 1.3.4 below that a sufficient con-dition for A3 is a uniform-in-θ geometric drift inequality which is known to imply auniform-in-θ ergodicity limn supθ ‖Pnθ (x, ·)− πθ‖TV = 0 (see e.g. Lemma 1.7.1(iii)).As discussed in Atchade and Fort (To appear) (see also Bai (2008)), this uniform-in-θ geometric drift inequality could be weakened and replaced by a uniform-in-θ sub-geometric drift inequality. The containment condition A3 is thus satisfiedfor many sensible adaptive schemes. It holds in particular for adaptive SRWMand Metropolis-within-Gibbs algorithms under very general conditions (Bai et al.,2009). It is, however, possible to construct pathological counter-examples, wherecontainment does not hold.

Theorem 1.3.2 establishes the convergence of the marginal distribution:

Theorem 1.3.2. Assume A1, A2 and A3. Then for any ε > 0, there exist n0, N ,n0 ≥ N such that for any n ≥ n0

supf :f∈bX

∣∣E [f(Xn)− πθn−N (f)]∣∣ ≤ ε .

14

When πθ = π? for any θ ∈ Θ, this result shows the convergence of the marginaldistribution to π?; see (Roberts and Rosenthal, 2007, Theorem 13). The fact thatsome of the conditions A1, A2 and A3 are also necessary has been discussed inthe case the transition kernels Pθ, θ ∈ Θ have the same invariant probabilitydistribution π?. In particular, (Bai, 2008, Theorem 4.1) shows that if the transitionkernels Pθ, θ ∈ Θ are ergodic, uniformly-in-θ, and the marginal Xk, k ≥ 0converges to π? then the containment condition holds.

Theorem 1.3.3 establishes the convergence of πθn , n ≥ 0 under the additionalassumption:

A4 There exist F ⊆ bX and a probability measure π? on (X,X ) such that, for anyε > 0,

limn→∞

supf :f∈F

P (|πθn(f)− π?(f)| ≥ ε) = 0 .

Theorem 1.3.3. Assume A1, A2, A3 and A4. Then

limn→∞

supf :f∈F

|E [f(Xn)]− π? (f)| = 0 .

1.3.3 Sufficient conditions for A1, A2, A3 and A4

To check the existence of an invariant probability measure πθ for each transition ker-nel Pθ (assumption A1) and the containment condition (assumption A3), we provideconditions in terms, essentially, of a uniform-in-θ drift condition and a uniform-in-θminorization condition. For the convergence of the random sequence of the expec-tations πθn(f), n ≥ 0 to π?(f) (assumption A4), we provide a sufficient conditionin terms of almost-sure convergence of the kernels Pθn , n ≥ 0, which implies thealmost-sure convergence of πθn(f), n ≥ 0 for any fixed bounded function f .

B1 (a) For any θ ∈ Θ, Pθ is phi-irreducible.

(b) There exist a measurable function V : X → [1,+∞), constants b ∈(0,+∞), λ ∈ (0, 1) and a set C ∈ X such that for any θ ∈ Θ,

PθV ≤ λV + b 1C .

(c) The level sets of V are 1-small uniformly in θ i.e. for any v ≥ 1, thereexist a constant ε > 0 and a probability measure ν such that for anyθ ∈ Θ, Pθ(x, ·) ≥ ε 1V≤v(x) ν.

Note that in B1b, we can assume without loss of generality (and we do so) thatsupC V < +∞. Note also that B1 implies that each transition kernel Pθ is aperiodic.

Proposition 1.3.4. (i) Assume B1. For any θ ∈ Θ, the kernel Pθ is positiverecurrent with invariant probability πθ and supθ πθ(V ) ≤ (1− λ)−1b.

(ii) Under B1b, the containment condition A3 holds.

As shown in the counter-example in the introduction of this Section 1.3, therate at which the parameter is adapted cannot be arbitrary. By Theorem 1.3.2,combining Proposition 1.3.4 with A2 yields the following result in the case πθ = π?for any θ ∈ Θ.

15

Theorem 1.3.5 (Convergence of the Marginal when πθ = π? for any θ ∈ Θ).Assume B1, A2 and πθ = π? for all θ ∈ Θ. Then

limn

supf : f∈bX

|E [f(Xn)]− π?(f)| = 0 .

In the general case, when the invariant distributions πθ depend upon θ, checkingA4 is usually problem specific. When πθ is available in closed-form, it might bepossible to check A4 directly. For external adaptations algorithms (such as theEE sampler) such expression is not readily available (πθ is known to exist, butcomputing its expression is out of reach). In this case, the only available option isto state conditions on the transition kernel:

E There exist θ? ∈ Θ and a set A ∈ A such that P(A) = 1 and for all ω ∈ A,x ∈ X, and B ∈ X , limn Pθn(ω)(x,B) = Pθ?(x,B).

Lemma 1.7.3 shows that this condition implies limn Pkθn(ω)(x,B) = P kθ?(x,B) for

any integer k. This result, combined with the assumption B1, implies the followingProposition

Proposition 1.3.6. Assume B1 and E. Let α ∈ [0, 1) and f? : X → R be ameasurable function such that f ∈ LV α . Then limn πθn(f) = πθ?(f), P-a.s.

In E, it is assumed that the limiting kernel is of the form Pθ? for some θ? ∈ Θ.This assumption can be relaxed by imposing additional conditions on the limitingkernel; these conditions can be read from the proof of Proposition 1.3.6. Details areleft to the interested reader. We thus have

Theorem 1.3.7 (Convergence of the Marginal). Assume B1, A2 and E. Then, forany bounded function f ,

limn→∞

|E[f(Xn)]− πθ?(f)| = 0 .

1.4 Strong law of large numbers

In this Section, we establish a Strong Law of Large Number for adaptive MCMC.Haario et al. (2001) were the first to prove the consistency of Monte Carlo averagesfor the algorithm described by (1.2) and (1.3) for bounded functions, using mixin-gales techniques; these results have later been extended by Atchade and Rosenthal(2005) to unbounded functions. Andrieu and Moulines (2006) have establishedthe consistency and the asymptotic normality of n−1

∑nk=1 f(Xk) for bounded and

unbounded function for controlled MCMC algorithms associated to a stochastic ap-proximation procedure (see Atchade and Fort (To appear) for extensions). Robertsand Rosenthal (2007) prove a weak law of large numbers for bounded functions forgeneral adaptive MCMC samplers. Finally, Atchade (2009) provides a consistencyresult for an external adaptive sampler.

The proof is based on the so-called martingale technique (see (Meyn and Tweedie,2009, Chapter 17)). We consider first this approach in the simple case of an homo-geneous Markov chain, i.e. Pθ = P . In this context, this technique amounts todecompose n−1

∑nk=1 f(Xk) − π?(f) as n−1Mn(f) + Rn(f) where Mn(f), n ≥ 0

is a P-martingale (w.r.t. the natural filtration) and Rn(f) is a remainder term. Themartingale is shown to converge a.s. using standard results and the remainder termsis shown to converge to 0. This decomposition is not unique and different choicescan be found in the literature. The most usual is based on the Poisson equation

16

with forcing function f , namely f − P f = f − π?(f). Sufficient conditions for theexistence of a solution to the Poisson equation can be found in (Meyn and Tweedie,2009, Chapter 17) (see also Lemma 1.7.1 below). In terms of f , the martingale andthe remainder terms may be expressed as

Mn(f) def=n∑k=1

f(Xk)− P f(Xk−1)

, Rn(f) def= n−1

[P f(X0)− P f(Xn)

].

Proposition 1.4.1, which follows directly from (Hall and Heyde, 1980, Theorem 2.18),provides sufficient conditions for the almost-sure convergence of n−1

∑nk=1 f(Xk)

to π?(f).

Proposition 1.4.1. Let Xk , k ≥ 0 be a Markov chain with transition kernel Pand invariant distribution π?. Let f : X → R be a measurable function; assumethat the Poisson equation g − Pg = f − π?(f) with forcing function f possesses asolution denoted by f . If |P f(X0)| < +∞ P-a.s. and there exists p ∈ [1, 2] such that∑k k−pE

[|f(Xk)|p

]< +∞, then limn→∞ n−1

∑nk=1 f(Xk) = π?(f), P-a.s.

Provided that,∑j≥0 |P jf − π?(f)(x)| < +∞ for any x ∈ X, then f(x) def=∑

j≥0 Pjf−π?(f)(x) is a solution to the Poisson equation. Lemma 1.7.1 gives suf-

ficient conditions for this series to converge: under conditions which are essentiallya Foster-Lyapunov drift inequality of the form PV ≤ λV + b1C , for any β ∈ [0, 1]and any f ∈ LV β , a solution to the Poisson equation exists and this solution sat-isfies the condition supk E

[|f(Xk)|p

]< +∞ for any p such that β ≤ pβ ≤ 1.

Hence, Proposition 1.4.1 with p ∈ (1, 2] shows that a strong LLN holds for anyfunction f ∈ LV β , β ∈ [0, 1). On the other hand, this drift condition also impliesthat π?(V ) < +∞.Therefore, the approach based on martingales yields to slightlyweaker results than that obtained using the regenerative approach (see (Meyn andTweedie, 2009, Section 17.2)).

The method based on martingales has been successfully used to prove strongLLN (and central limit theorems) for different adaptive chains (see e.g. Andrieuand Moulines (2006); Atchade and Fort (To appear); Atchade (2009)). A weak law oflarge numbers for bounded functions is also proved in Roberts and Rosenthal (2007);the proof is not based on a martingale-type transformation but relies on the couplingof the adaptive process with a ’frozen’ chain with fixed transition kernels. Theirconvergence is established under the assumption that the kernels Pθ, θ ∈ Θ aresimultaneously uniformly ergodic i.e. limn supx∈X supθ∈Θ ‖Pnθ (x, ·)− πθ‖TV = 0.Using the same coupling approach, Yang (2008) also discusses the existence of aweak LLN for unbounded functions.

We develop below a general scheme of proof for strong LLN, based on the mar-tingale approach. Let a measurable function f : X→ R. For any θ ∈ Θ, denote byfθ the solution to the Poisson equation g − Pθg = f − πθ(f) with forcing functionf . Consider the following decomposition:

n−1n∑k=1

f(Xk)− πθk−1(f) = n−1Mn(f) +2∑i=1

Rn,i(f) , (1.17)

17

where

Mn(f) def=n∑k=1

fθk−1(Xk)− Pθk−1 fθk−1(Xk−1) ,

Rn,1(f) def= n−1(Pθ0 fθ0(X0)− Pθn−1 fθn−1(Xn)

),

Rn,2(f) def= n−1n−1∑k=1

Pθk fθk(Xk)− Pθk−1 fθk−1(Xk) .

Mn(f), n ≥ 0 is a P-martingale and Rn,i(f), i = 1, 2 are remainder terms. Con-ditions for the almost-sure convergence to zero of n−1Mn(f), n ≥ 0 and of theresidual term Rn,1(f), n ≥ 1 are similar to those of Proposition 1.4.1; as discussedabove they are implied by a geometric drift inequality, that is, in the adaptive case,they are implied by B1. The remaining term Rn,2(f) requires additional attention.A (slight) strengthening of the diminishing adaptation condition is required. Definefor θ, θ′ ∈ Θ,

DV α(θ, θ′) def= supx∈X

V −α(x) ‖Pθ(x, ·)− Pθ′(x, ·)‖Vα .

Lemma 1.4.2. Assume B1. Let α ∈ (0, 1). Then, there exists a constant C suchthat for any f ∈ LV α and for any θ, θ′ ∈ Θ,

‖πθ − πθ′‖Vα ≤ C DV α(θ, θ′) ,

|Pθfθ − Pθ′ fθ′ |V α ≤ C|f |V α DV α(θ, θ′) ,

|fθ − fθ′ |V α ≤ C|f |V α DV α(θ, θ′) .

The proof is in Section 1.6. As a consequence of this Lemma, a sufficient condi-tion for the convergence of Rn,2(f), n ≥ 0 when f ∈ LV α is

B2 For any α ∈ (0, 1),∑k≥1

k−1DV α(θk, θk−1) V α(Xk) < +∞ , P− a.s.

We now have all the ingredients to establish the following result

Theorem 1.4.3. Assume B1, B2 and E[V (X0)] < +∞. Then, for any α ∈ (0, 1)and f ∈ LV α , limn→∞ n−1

∑nk=1f(Xk)− πθk−1(f) = 0, P-a.s.

If πθ = π? for any θ ∈ Θ, Theorem 1.4.3 implies the strong LLN for any functionsf ∈ LV α , whatever α ∈ (0, 1). When πθ 6= π?, we have to prove the almost-sureconvergence of the term n−1

∑nk=1 πθk−1(f) to π?(f) for some probability measure

π?. This last convergence can be obtained from the condition E. Combining Propo-sition 1.3.6 and Theorem 1.4.3 yields

Theorem 1.4.4. Assume B1 and E. Then, for any α ∈ (0, 1) and f ∈ LV α ,limn→+∞ n−1

∑nk=1 πθk−1(f) = π?(f), P-a.s.

1.5 Convergence of the Equi-Energy sampler

We study the convergence of the EE sampler described in Section 1.2.3. We provethe convergence of the marginals and a law of large numbers.

18

If Yn, n ≥ 0 is such that n−1∑nk=1 f(Yk)→ π(f) a.s. for any bounded function

f , the empirical distributions θn, n ≥ 0 converge weakly to π so that, asymptot-ically, the dynamic of Xn is given by Pπ. Since πPπ = π, it is expected thatπ governs the distribution of Xn, n ≥ 0 asymptotically. By application of theresults of Sections 1.3 and 1.4 this intuition can now be formalized.

1.5.1 Convergence of the marginal

We need X to be a Polish space. For simplicity, we assume that X = Rd butour results remain true for more general state space X. In addition, it is assumedthroughout this section that the rings are well-behaved, so that the closure of eachring X` coincides with the closure of its interior. Assume that

EES1 (a) π is a positive density distribution on X and supX π < +∞.(b) π is continuous on X.

EES2 P is a phi-irreducible transition kernel on (X,X ) such that πP = π. Thelevel sets x ∈ X, π(x) ≥ p are 1-small for P , for any p > 0.

EES3 (a) There exist τ ∈ (0, 1], λ ∈ (0, 1), b < +∞ and a set C ∈ X such that

PV ≤ λV + b1C , with V (x) def=(π(x)/ sup

Xπ

)−τ(1−β)

.

(b) The probability of swapping ε is bounded by,

0 ≤ ε < 1− λ1− λ+ τ(1− τ)

1−ττ

. (1.18)

If P is a symmetric random-walk Hastings-Metropolis (SRWM) kernel with a sym-metric proposal distribution q which is positive on X, then πP = π and P is π-irreducible (Mengersen and Tweedie, 1996, Lemma 1.1.). If in addition q is positiveand continuous on X then, since π is positive and continuous on X, any compactset of X is 1-small (Mengersen and Tweedie, 1996, Lemma 1.2.). This discussionevidences that under EES1, condition EES2 is verified with P being, for example,a SRWM kernel with Gaussian proposal. Drift conditions for P being a SRWMkernel have been often discussed in the literature: conditions are both on the targetdistribution and on the proposal distribution. Depending upon π is decaying inthe tails, the drift condition is geometric (as assumed in EES3) or sub-geometric;see e.g. Roberts and Tweedie (1996) for the geometric case and Fort and Moulines(2000) for the polynomial case. Conditions on π so that EES3 holds when P is aSRWM kernel (resp. an hybrid sampler) can be found e.g. in Roberts and Tweedie(1996) and Jarner and Hansen (2000) (resp. in Fort et al. (2003)).

Under conditions which imply that the target density π is super-exponential,Jarner and Hansen (2000) show that the functions proportional to π−s for some s ∈(0, 1) solve a Foster-Lyapunov drift inequality (Jarner and Hansen, 2000, Theorems4.1 and 4.3). Hence, when P is a SRWM kernel, with a target density and a proposaldistribution satisfying the conditions of Jarner and Hansen (2000), we can chooseτ ∈ (0, 1/(1− β)).

Lemma 1.5.1. Assume EES1a and EES3. Then, for θ ∈ Θ, the Foster-Lyapunovcondition PθV (x) ≤ λV (x) + b1C(x) holds, with

λdef= (1− ε)λ+ ε

(1 + τ(1− τ)

1−ττ

).

19

The proof of Lemma 1.5.1 is in Section 1.6.

Proposition 1.5.2. (i) Assume EES1a and EES3. Then, B1b holds and thereexists a constant C such that supn E [V (Xn)] ≤ CE [V (X0)].

(ii) Assume EES1a, EES2 and EES3. Then B1a and B1c hold.

EES4 For any bounded function f : X→ R, limn θn(f) = π(f) P-a.s.

Condition EES4 holds whenever Yn, n ≥ 0 is an ergodic Markov chain with sta-tionary distribution π. A sufficient condition for ergodicity is the phi-irreducibility,the aperiodicity and the existence of an unique invariant distribution π for thetransition kernel of the chain Yn, n ≥ 0. For example, such a Markov chainYn, n ≥ 0 can be obtained by running a HM sampler with invariant distributionπ, with proposal distribution chosen in such a way that the sampler is ergodic (seee.g. Robert and Casella (1999)).

Proposition 1.5.3. Assume EES1b and EES4. Then A2 and E holds.

The proof is in Section 1.6. The second assertion really depends upon thetopology of X: we assumed that X = Rd but this condition can be weakened; detailsare left to the interested reader. From the above discussions, we have by applicationof Theorem 1.3.7:

Theorem 1.5.4. Assume EES1 to EES4. Then for any bounded function f

limn|E [f(Xn)]− π(f)| = 0 .

1.5.2 Strong law of large numbers

We have to strengthen the conditions on the auxiliary process Yn, n ≥ 0 as follows.

EES5 supn E [V (Yn)] < +∞ and for any α ∈ (0, 1), limn θn(V α) = π(V α) P a.s.

When Yn, n ≥ 0 is a Markov chain with transition kernel Q and invariant distri-bution π, a sufficient condition for EES5 is the existence of a Foster-Lyapunov driftinequality for Q. More precisely, if

Q is phi-irreducible, aperiodic and there exist constants λ′ ∈ (0, 1), b′ ∈(0,+∞), and a function W such that QW ≤ λ′W + b′,

V ∈ LW and E [W (Y0)] < +∞

then supn E [W (Yn)] ≤ C E [W (Y0)] (see Lemma 1.7.1), and limn θn(U) = π(U) a.s.for any function U ∈ LW (see (Meyn and Tweedie, 2009, Theorem 17.1.7)). Wethus have EES5.

Proposition 1.5.5. Assume EES1a, EES3, EES5 and E [V (X0)] < +∞. For anyα ∈ (0, 1), B2 holds.

The proof is in Section 1.6. We now have all the ingredients to apply Theo-rems 1.4.3 and 1.4.4: from Propositions 1.5.2, 1.5.3 and 1.5.5, we have

Theorem 1.5.6. Assume EES1 to EES5 and E [V (X0)] < +∞. For any α ∈ (0, 1),and any f ∈ LV a ,

limnn−1

n∑k=1

f(Xk) = π?(f) ,P− a.s.

20

We thus established that the process Xn, n ≥ 0 possesses V -moment whichare uniformly bounded (Proposition 1.5.2), its distribution converges to π (The-orem 1.5.4) and it satisfies a strong law of large numbers for a wide family offunctions (Theorem 1.5.6). These results are obtained provided the auxiliary pro-cess Yn, n ≥ 0 also possesses uniformly bounded V -moment and satisfies a stronglaw of large numbers (see EES4 and EES5). Repeated applications of this analysisprovide sufficient conditions for the equi-energy sampler with more than one aux-iliary process - as originally described in Kou et al. (2006) - to be ergodic and tosatisfy a strong law of large numbers. Details are omitted.

1.6 Proof of the main results

1.6.1 Proof of Section 1.3

Proof of Proposition 1.3.1

We prove this by induction. By (1.4), the statement is trivially true for N = 1 andany n ≥ 0. Suppose that the statement is true for some N ≥ 1 and for all n ≥ 0.Then for f ∈ bX :∣∣E [f(Xn+N+1)|Fn]− PN+1

θnf(Xn)

∣∣ =∣∣E [Pθn+N f(Xn+N )|Fn

]− PN+1

θnf(Xn)

∣∣≤

∣∣E [Pθn+N f(Xn+N )− Pθnf(Xn+N )|Fn]∣∣

+∣∣E [(Pθnf) (Xn+N )|Fn]− PNθn (Pθnf) (Xn)

∣∣ .The first term on the rhs is bounded by E [D(θn+N , θn) | Fn] and the second termis bounded by

∑N−1j=1 E [D(θn+j , θn) | Fn] by the induction assumption.

Proof of Theorem 1.3.2

Let ε > 0. Since k 7→∥∥P kθ (x, ·)− πθ

∥∥TV

is non-increasing,

supk≥Mε(x,θ)

∥∥P kθ (x, ·)− πθ∥∥

TV≤ ‖PMε(x,θ)

θ (x, ·)− πθ‖TV ≤ ε , (1.19)

where we used the definition of Mε(x, θ) in the last inequality. Under A3, thereexist N ∈ N such that P(Mε(Xn, θn) ≥ N) ≤ ε for any n ∈ N.

Set η def= ε/(2N2). Under A2, there exists nε,N such that for any n ≥ nε,N ,P (DTV (θn+1, θn) ≥ η) ≤ η. Using Proposition 1.3.1, we get for n ≥ N + nε,N

supf,|f |∞≤1

E[∣∣∣E [f(Xn)|Fn−N ]− PNθn−N f(Xn−N )

∣∣∣]≤N−1∑j=1

j∑l=1

E [DTV (θn−N+l, θn−N+l−1)]

≤ ε/2 +N−1∑j=1

j∑l=1

P (DTV (θn−N+l, θn−N+l−1) ≥ η) ≤ ε .

We write for n ≥ N ,

E [f(Xn)|Fn−N ]− πθn−N (f) = E [f(Xn)|Fn−N ]− PNθn−N f(Xn−N )

+ PNθn−N f(Xn−N ) − πθn−N (f) ,

21

so that for n ≥ N + nε,N and for any f ∈ bX ,∣∣E [f(Xn)− πθn−N (f)]∣∣

≤ E[∣∣∣E [f(Xn)|Fn−N ]− PNθn−N f(Xn−N )

∣∣∣]+ E[∣∣∣PNθn−N f(Xn−N )− πθn−N (f)

∣∣∣]≤ ε+ E

[∣∣∣PNθn−N f(Xn−N )− πθn−N (f)∣∣∣] .

Furthermore,

E[∣∣∣PNθn−N f(Xn−N )− πθn−N (f)

∣∣∣] ≤ 2 P (Mε(Xn−N , θn−N ) ≥ N)

+ E[∥∥∥PNθn−N (Xn−N , ·)− πθn−N (·)

∥∥∥TV

1Mε(Xn−N ,θn−N )≤N

].

The rhs is upper bounded by 3ε, from (1.19) and the definition of N , uniformly inf for f ∈ bX .


Let ε > 0. Under A4, there exists nε such that

supn≥nε

supf :f∈F

P(|πθn−N (f)− π?(f)| ≥ ε) ≤ ε . (1.20)

By (1.20), we have for any function f ∈ F,

supn≥nε

supf :f∈F

E[∣∣πθn−N (f)− π?(f)

∣∣]≤ 2 sup

n≥nεsupf :f∈F

P(|πθn−N (f)− π?(f)| ≥ ε) + ε ≤ 3ε .

This is combined with Theorem 1.3.2 to prove the stated result.


By (Meyn and Tweedie, 2009, Theorems 14.0.1 and 14.3.7), Pθ possesses an uniqueinvariant probability measure πθ and πθ(V ) ≤ (1− λ)−1bπθ(C) ≤ (1− λ)−1b.

By Lemma 1.7.1, there exist a constant C (depending upon ε) such that supθMε(x, θ) ≤C lnV (x) and thus, a constant C ′ such that

P (Mε(Xn, θn) ≥M) ≤ E[C ′

V (Xn)M

∧ 1]≤ E

[C ′

E [V (Xn) | F0]M

∧ 1].

By iterating the drift inequality we have supn E [V (Xn) | F0] ≤ V (X0)+(1−λ)−1b <∞ (see (Andrieu and Moulines, 2006, Lemma 5) for a similar argument) whichimplies that for some constant C ′′,

supn

P (Mε(Xn, θn) ≥M) ≤ E[C ′′

V (X0)M

∧ 1].

Since V (x) < +∞ is finite for any x, the dominated convergence theorem yields thedesired result.

22


Let ε > 0. We have for any k ≥ 0 and x ∈ X,

|πθn(f)− πθ?(f)| ≤ 2 supθ∈Θ

∥∥πθ − P kθ (x, ·)∥∥

Vα+ |P kθnf(x)− P kθ?f(x)| .

Under B1, there exist constants C <∞ and ρ ∈ (0, 1) such that supθ∥∥P kθ (x, ·)− πθ

∥∥Vα≤

Cρk V (x) - see Lemma 1.7.1. We choose k large enough so that 2Cρk V (x?) < ε/2.This implies that for any ω ∈ A and any n ≥ 0,

|πθn(ω)(f)− πθ?(f)| ≤ ε/2 + |P kθn(ω)f(x?)− P kθ?f(x?)| .

It remains to prove that for any ω ∈ A, the second term in the rhs tends to zero asn→ +∞, for fixed k, f, x?, θ?. We have

P kθnf(x?)− P kθ?f(x?) =k∑j=1

P jθnP

k−jθ?

f(x?)− P j−1θn

P k−j+1θ?

f(x?)

=k∑j=1

∫X2P j−1θn

(x?, dy) Pθn(y, dz)− Pθ?(y, dz)P k−jθ?f(z) .

The proof follows the same lines as (Atchade, 2009, Lemma 4.9). We establish thatfor any j ≤ k, and for any ω ∈ A,

limn

∫X2P j−1θn(ω)(x?, dy)

Pθn(ω)(y, dz)− Pθ?(y, dz)

P k−jθ?

f(z) = 0 , (1.21)

which will conclude the proof. This is done under two successive applications ofLemma 1.7.2. Under E, for any ω ∈ A, y ∈ X and j ≤ k, Lemma 1.7.2 applied withµn ← Pθn(ω)(y, ·), µ ← Pθ?(y, ·), fn = f = P k−jθ?

f and V ← V α shows that for allω ∈ A, y ∈ X, j ≤ k,

limn

∫X

Pθn(ω)(y, dz)− Pθ?(y, dz)

P k−jθ?

f(z) = 0 . (1.22)

Let j ≤ k. By Lemma 1.7.3, for any ω ∈ A and B ∈ X , limn Pj−1θn(ω)(x?, B) =

P j−1θ?

(x?, B). Then (1.22) and Lemma 1.7.2 applied with µn ← P j−1θn(ω)(x?, ·), µ ←

P j−1θ?

(x?, ·), fn(y) ← Pθn(ω)Pk−jθ?

f(y), f ← P k−j+1θ?

f(y) and V ← V α, implies(1.21).


Proof of Lemma 1.4.2

We provide a self-contained proof which follows the same arguments as in (Ben-veniste et al., 1990, Part II). Let α ∈ (0, 1). The Jensen’s inequality implies thatPθV

α ≤ λαV α + bα1C . Results of Lemma 1.7.1 thus apply with V replaced by V α.First assertion. We have for any n ≥ 1,

Pnθ f − Pnθ′f =n−1∑j=0

P jθ (Pθ − Pθ′)(Pn−j−1θ′ f − πθ′(f)

).

By Lemma 1.7.1(iii) there exist constants C and ρ ∈ (0, 1) such that for any n ≥ 0,x ∈ X, supθ ‖Pnθ (x, ·) − πθ‖V α ≤ C ρn V α(x). Furthermore, by iterating the drift

23

inequality B1b, we have, for any x ∈ X, supj supθ Pjθ V

α(x) < +∞. This yields forany k ≥ 1, x? ∈ X,

‖πθ−πθ′‖V α ≤ ‖πθ−P kθ (x?, ·)‖V α+‖P kθ (x?, ·)−P kθ′(x?, ·)‖V α+‖P kθ′(x?, ·)−πθ′‖V α

≤ 2C ρkV α(x?) + sup|h|V α≤1

∣∣∣∣∣∣k−1∑j=0

P jθ (Pθ − Pθ′)(P k−j−1θ′ h− πθ′(h)

)(x?)

∣∣∣∣∣∣ .The second term in the rhs is upper bounded by

CDV α(θ, θ′)k−1∑j=1

ρk−j−1 supθP jθ V

α(x?) ≤C

1− ρDV α(θ, θ′) sup

jsupθP jθ V

α(x?) .

Hence, there exists a constant C such that for any k ≥ 1,

‖πθ − πθ′‖V α ≤ Cρk +DV α(θ, θ′)

.

Since this holds true for any k, this yields the desired result.Second assertion. Under the stated assumptions fθ exists and is given by∑n≥0 P

nθ f − πθ(f). Following the same lines as in Benveniste et al. (1990) (see

also (Andrieu and Moulines, 2006, Proposition 3)), we have for any n ≥ 1,

Pnθ f − Pnθ′f

=n−1∑j=0

(P jθ − πθ

)(Pθ − Pθ′)

(Pn−j−1θ′ f − πθ′(f)

)+n−1∑j=0

πθP

n−j−1θ′ f − πθPn−jθ′ f

=n−1∑j=0

(P jθ − πθ

)(Pθ − Pθ′)


)+ πθ(f)− πθPnθ′f ,

where we used that πθPθ = πθ in the second equality. This implies for any n ≥ 1,

Pnθ f − πθ(f) − Pnθ′f − πθ′(f)

=n−1∑j=0

(P jθ − πθ

)(Pθ − Pθ′)


)− πθPnθ′f − πθ′(f) .

Hence, by summing wrt n the previous identity,

Pθfθ −Pθ′ fθ′ =∑n≥1

n−1∑j=0

(P jθ − πθ

)(Pθ − Pθ′)


)− πθPθ′ fθ′

=∑n≥1

n−1∑j=0

(P jθ − πθ

)(Pθ − Pθ′)


)+ (πθ′ − πθ)Pθ′ fθ′ ,

where we used that Pθ′ fθ′ is πθ′ -integrable (see Lemma 1.7.1) and πθ′(Pθ′ fθ′) = 0.Since supθ |fθ| ≤ CV α, we have∣∣∣(πθ′ − πθ) fθ′ ∣∣∣ ≤ sup

θ|fθ|V α ‖πθ′ − πθ‖V α .

24

This implies that for any n ≥ 1 and j ≤ n− 1,∣∣∣(P jθ − πθ) (Pθ − Pθ′)(Pn−j−1θ′ f − πθ′(f)

)(x)∣∣∣

≤ C ρjV α(x)∣∣∣(Pθ − Pθ′)(Pn−j−1

θ′ f − πθ′(f))∣∣∣

Vα

≤ C ρjV α(x) supx∈X

V −α(x) ‖Pθ(x, ·)− Pθ′(x, ·)‖Vα∣∣∣Pn−j−1θ′ f − πθ′(f)

∣∣∣Vα

≤ C |f |Vα ρn−1 DV α(θ, θ′) V α(x) .

This concludes the proof of the second assertion. The third assertion is based onthe equality

Pθfθ(x)− Pθ′ fθ′(x) = fθ(x)− fθ′(x) + πθ(f)− πθ′(f) ,

and the two other assertions.


Under B1, each transition kernel Pθ is aperiodic and supC V < +∞ (see commentsin Section 1.3.3). By the Jensen’s inequality, we have PθV α ≤ λαV α+bα1C . Hence,by Lemma 1.7.1(ii), there exists a constant C such that for any θ ∈ Θ, any x ∈ X

and any f ∈ LV α , |fθ(x)| + |Pθfθ(x)| ≤ C|f |V α V α(x). By iterating the driftinequality we have supn E[V (Xn)] ≤ E [V (X0)]+(1−λ)−1b. Therefore, there existsp ∈ (1, 2] such that∑

k≥1

k−pE[∣∣∣fθk−1(Xk)− Pθk−1 fθk−1(Xk−1)

∣∣∣p] < +∞ .

As in Proposition 1.4.1, it is proved that limn→∞ n−1Mn(f) + Rn,1(f) = 0 P-a.s.Lemma 1.4.2 shows that, for f ∈ LV α ,

|Rn,2(f)| ≤ C|f |V αn−1n−1∑k=1

DV α(θk, θk−1)V α(Xk) ,

which implies, using the Kronecker Lemma (Hall and Heyde, 1980, Section 2.6),that Rn,2(f) a.s.−→ 0.


Proof of Lemma 1.5.1

Set Cθ(x) def=∫g(x, y)θ(dy). If Cθ(x) = 0, then PθV = PV ≤ λV + b1C and the

result follows. Assume now that Cθ(x) > 0. By (1.13),

Kθ(x, V ) = V (x) + C−1θ (x)

∫g(x, y)α(x, y) V (y)− V (x) θ(dy) .

Following the same lines as in the proof of (Atchade, 2009, Lemma 4.2), we writefor x ∈ X`,

V −1(x)∫g(x, y)α(x, y) V (y)− V (x) θ(dy)

=∫

X`

1 ∧ π

1−β(y)π1−β(x)

V (y)V (x)

− 1θ(dy)

=∫A(x)∩X`

V (y)V (x)

− 1θ(dy) +

∫R(x)∩X`

(π(y)π(x)

)1−β V (y)V (x)

− 1θ(dy)

25

where A(x) def= y ∈ X, π(y) ≥ π(x) = y, α(x, y) = 1 is the acceptance regionand R(x) def= X \A(x) is the rejection region. Since V ∝ π−τ(1−β), for x ∈ X`,

V −1(x)∫g(x, y)α(x, y) V (y)− V (x) θ(dy)

≤∫R(x)∩X`

(π(y)π(x)

)1−β(

π(y)π(x)

)−τ(1−β)

− 1

θ(dy) ≤ τ(1−τ)

1−ττ Cθ(x) ,

by using that, for τ ∈ (0, 1], supx∈[0,1] x(x−τ − 1) ≤ τ(1− τ)1−ττ . The proof follows.


Lemma 1.5.1 implies B1b. By iterating the drift inequality we have the upper boundon supn E [V (Xn)]. Since P is phi-irreducible and ε < 1, Pθ is phi-irreducible forany θ ∈ Θ, thus showing B1a. The level sets of V , V ≤ v are small for P andthus for Pθ, uniformly-in-θ, since Pθ(x, ·) ≥ (1− ε)P (x, ·). This yields B1c.


(i) Set Cθ(x) def=∫g(x, y)θ(dy). Under the stated assumptions, there exists A ∈ A

such that P(A) = 1 and for any ω ∈ A, for any x ∈ Xl

limnCθn(ω)(x) =

∫Xl

π(dy) ,

and the rhs is positive under the assumptions on π and on the partition of X. Hence,there exists a random variable N , finite a.s. , such that for any n ≥ N , and for anybounded function f such that |f |∞ ≤ 1, we have a.s.

∣∣Pθnf(x)− Pθn−1f(x)∣∣ ≤ 2ε

∥∥∥∥g(x, ·)θnCθn(x)

− g(x, ·)θn−1

Cθn−1(x)

∥∥∥∥TV

,

upon noting that by definition of α, |α(x, y)| ≤ 1. By definition of θn, n ≥ 0, wehave a.s. for any n ≥ N ,

supx∈X

supf :|f |∞≤1

∣∣∣∣θn(g(x, ·)f)Cθn(x)

− θn−1(g(x, ·)f)Cθn−1(x)

∣∣∣∣ ≤ 2 supx∈X

1∑nk=1 g(x, Yk)

.

By definition of g, supx∈X (∑nk=1 g(x, Yk))−1 = n−1 sup`∈1,...,K (θn(X`))

−1. Hence,EES4 implies that the rhs tends to zero a.s. which concludes the proof.

(ii) Set θ?def= π and Cθ?(x) def=

∫g(x, y)π(dy). As discussed above, there exists

a random variable N , finite a.s. , such that for any n ≥ N , we have a.s.

Pθn(ω)(x,B)− Pπ(x,B) = ε

∫B

θn(dy)Cθn(x)

− π(dy)Cθ?(x)

g(x, y) α(x, y)

− ε∫

X

θn(dy)Cθn(x)

− π(dy)Cθ?(x)

g(x, y)α(x, y) .

Under EES4, for any x ∈ X and B ∈ X , there exists Ax,B ∈ A such that P(Ax,B) = 1and for any ω ∈ Ax,B ,

limnPθn(ω)(x,B) = Pθ?(x,B) . (1.23)

26

Since X is Polish, it admits a dense subset D and a countable generating algebraB of X . Hence, there exists a set AD,B ∈ A such that P(AD,B) = 1 and for anyx ∈ D, B ∈ B, ω ∈ AD,B, the limit (1.23) holds.

We now prove that the limit holds for any x ∈ D and B ∈ X . Let x ∈ D andB ∈ X . For ε > 0, there exist B−, B+ ∈ B such that B− ⊆ B ⊆ B+ and

Pθ?(x,B)− ε ≤ Pθ?(x,B−) ≤ Pθ?(x,B) ≤ Pθ?(x,B+) ≤ Pθ?(x,B) + ε .

By positivity of the probability Pθn(x, ·), we have Pθn(x,B−) ≤ Pθn(x,B) ≤Pθn(x,B+). In addition, for any ω ∈ AD,B, limn Pθn(ω)(x,B′) = Pθ?(x,B′) forB′ ∈ B−, B+. This yields

Pθ?(x,B)− ε ≤ Pθ?(x,B−) ≤ lim infn

Pθn(ω)(x,B)

≤ lim supn

Pθn(ω)(x,B) ≤ Pθ?(x,B+) ≤ Pθ?(x,B) + ε .

This implies that for any ω ∈ AD,B, x ∈ D and B ∈ X , (1.23) holds. The last stepis to prove that the limit holds for any x ∈ X. For x, x′ ∈ X, we write

Pθn(x,B)−Pθ?(x,B) = ε Nθn(x,B)−Nθ?(x,B)−Nθn(x,X) +Nθ?(x,X)

whereNθ(x,B) def= Cθ(x)−1∫θ(dy) g(x, y)α(x, y) 1B(y). By EES1b and Lemma 1.6.1,

for any x ∈ X and k ≥ 1, there exists xk ∈ D such that limk xk = x, g(x, ·) = g(xk, ·)(this is a consequence of the assumptions on the partition of X) and for any B ∈ Xand θ ∈ Θ,

Nθ(xk, B)− 1k≤ Nθ(x,B) ≤ Nθ(xk, B) +

1k.

This implies that for any ω ∈ AD,B, B ∈ X and x ∈ X,

Nθ?(xk, B)− 1k≤ lim inf

nNθn(ω)(x,B) ≤ lim sup

nNθn(ω)(x,B) ≤ Nθ?(xk, B)+

1k,

since limnNθn(ω)(xk, B) = Nθ?(xk, B). By Lemma 1.6.1, this yields limkNθ?(xk, B) =Nθ?(x,B). Hence, we proved that for any ω ∈ AD,B, x ∈ X and B ∈ X , (1.23) holds.


As in the proof of Proposition 1.5.3, there exists a random variable N finite a.s. suchthat for any n ≥ N , we have a.s. supx∈X |Cθn(x)−Cθn−1(x)| = 0 and infx∈X Cθn(x) >0. We thus have for any n ≥ N , a.s.

DV α(θn, θn−1) ≤ 2ε supx∈X

∥∥∥∥g(x, ·)θnCθn(x)

− g(x, ·)θn−1

Cθn−1(x)

∥∥∥∥Vα

≤ n−1 sup`θ−1n (X`)

V α(Yn) +

((n− 1)−1

n−1∑k=1

V α(Yk)

)sup`θ−1n−1(X`)

.

Note that limn

θ−1n (X`)− π−1(X`)

= 0 and limn n

−1∑n−1k=1 V

α(Yk) = π(V α) > 0,P-a.s. Hence, there exists a P-a.s. finite integer valued random variable N , and twofinite random variables U1 and U2 such that, for all n ≥ N ≥ N ,

DV α(θn, θn−1) ≤ n−1 (V α(Yn)U1 + U2) . (1.24)

27

To check B2, we prove that∑k≥N k

−2V α(Xk) < +∞ and∑k≥N k

−2V α(Yk) V α(Xk) <+∞ a.s. by applying Lemma 1.7.4. The first sum converges since supk E [V (Xk)] ≤C E [V (X0)] (see Proposition 1.5.3). For the second sum, we apply Lemma 1.7.4with a = (1 + ε)/2 for some ε ∈ (0, 1) such that α(1 + ε) ≤ 1. We have by theHolder’s inequality

E [V αa(Yk) V αa(Xk)] ≤ E[V α(1+ε)(Yk)

]1/2E[V α(1+ε)(Xk)

]1/2≤ E [V (Yk)]1/2 E [V (Xk)]1/2 .

Under the stated assumptions, supk E [V (Yk) + V (Xk)] < +∞ which concludes theproof of the convergence of the series. and the proof.

Lemma 1.6.1 is adapted from (Atchade, 2009, Lemma 4.3).

Lemma 1.6.1. For any θ ∈ Θ, x ∈ X, B ∈ X such that Cθ(x) def=∫g(x, y)θ(dy) ∈

(0,+∞) define

Nθ(x,B) def= Cθ(x)−1

∫B

θ(dy)g(x, y)

1 ∧ π1−β(y)π1−β(x)

.

For any x, x′ ∈ X, such that g(x, ·) = g(x′, ·)

supB∈X

supθ∈Θ

|Nθ(x,B)−Nθ(x′, B)| ≤(π(x) ∨ π(x′)π(x) ∧ π(x′)

)1−β

− 1 .

Proof. We assume that π(x) ≤ π(x′). We have

Cθ(x) |Nθ(x,B)−Nθ(x′, B)|

=∣∣∣∣∫ θ(dy) g(x, y)

1 ∧ π

1−β(y)π1−β(x)

− 1 ∧ π1−β(y)π1−β(x′)

∣∣∣∣≤∫y,π(y)≤π(x)

θ(dy) g(x, y)π1−β(y)(

1π1−β(x)

− 1π1−β(x′)

)+∫y,π(x)≤π(y)≤π(x′)

θ(dy) g(x, y)π1−β(y)(

1π1−β(y)

− 1π1−β(x′)

)≤(

1π1−β(x)

− 1π1−β(x′)

)∫y,π(y)≤π(x′)

θ(dy) g(x, y)π1−β(y)

≤(π1−β(x′)π1−β(x)

− 1)Cθ(x) .

1.7 Technical lemmas

Lemma 1.7.1. Let P be a phi-irreducible and aperiodic transition kernel. Assumethere exist a measurable function V : X → [1,+∞), λ ∈ (0, 1), b ∈ (0,+∞) and aset C such that PV ≤ λV + b1C and supC V < +∞. Assume in addition that thelevel sets V ≤ v of V are 1-small with minorizing constant εv.

(i) Then the chain possesses an invariant distribution π such that π(V ) ≤ (1 −λ)−1b, In addition, supn

∫ξ(dx)PnV (x) ≤ λξ(V ) + (1− λ)−1b for any distri-

bution ξ on (X,X ).

28

(ii) For any function f ∈ LV , there exists a solution f to the Poisson equationthat satisfies for any a ∈ (0, 1) and any x ∈ X,

|f(x)| ≤ |f |Va

(V (x) + π(V ) +

2b+ λ(supC V + (1− a)−1b)

εsupC V ∨(1−a)−1b

).

(iii) There exist constants C and ρ ∈ (0, 1) - depending upon supC V , b, λ and theminorizing constant εsupC V such that

‖Pn(x, ·)− π‖V ≤ C ρn V (x) .

Proof. The control of π(V ) is a consequence of (Meyn and Tweedie, 2009, Theo-rem 14.3.7). The uniform control of Eξ[V (Xn)] is obtained by iterating the driftinequality. Item (ii) is a consequence of (Fort and Moulines, 2003, Proposition 22);details of proof are omitted. (iii) is proved in Douc et al. (2004).

Lemma 1.7.2 is established in (Atchade, 2009, Lemma 4.8).

Lemma 1.7.2. Let µn, n ≥ 0 be a family of probability measures on (X,X ) andfn : X→ R, n ≥ 0 be a family of measurable functions. Assume that

(i) µn(A)→ µ(A) for any A ∈ X .

(ii) there exists f : X→ R such that fn(x)→ f(x) for any x ∈ X.

(iii) there exists a function V : X→ (0,+∞) such that supn |fn|V < +∞, µ(V ) <+∞ and supn µn(V α) < +∞ for some α > 1.

Then limn µn(fn) = µ(f).

Lemma 1.7.3. Assume E. Then for any k ≥ 1, for all ω ∈ A, x ∈ X, and B ∈ X ,limn P

kθn(ω)(x,B) = P kθ?(x,B).

Proof. The proof is by induction on k. The case k = 1 holds by E. Assume thatthe property holds with k. We write

P k+1θn(ω)(x,B)−P k+1

θ?(x,B) =

∫P kθn(ω)(x,dy)Pθn(ω)(y,B)−

∫P kθ?(x, dy)Pθ?(y,B)

and apply Lemma 1.7.2 with µn ← P kθn(ω)(x, dy), µ← P kθ?(x, dy), fn ← Pθn(ω)(·, B),f ← Pθ?(·, B), V = 1. The induction assumption and E imply that the condition ofLemma 1.7.2 and thus, the property holds for k + 1. This concludes the induction.

Lemma 1.7.4. Let Wk, k ≥ 1 be a family of positive random variables andρk, k ≥ 1 be positive sequence. Then the series

∑k ρkWk is finite P-a.s. if there

exists a ∈ (0, 1] such that (a)∑k ρ

ak < +∞ and (b) supk E [W a

k ] < +∞.

Proof. We have by sub-additivity

E

[(∑k

ρkWk

)a]≤∑k

ρak E [W ak ] ≤

(supk

E [W ak ]) ∑

k

ρak .

Bibliography

Andrieu, C., Jasra, A., Doucet, A. and Del Moral, P. (2007). On non-linear Markov chain Monte Carlo via self-interacting approximations. Tech. rep.,Available at http://stats.ma.ic.ac.uk/a/aj2/public html/.

Andrieu, C. and Moulines, E. (2006). On the ergodicity properties of someadaptive MCMC algorithms. Ann. Appl. Probab. 16 1462–1505.

Andrieu, C., Moulines, E. and Priouret, P. (2005). Stability of stochasticapproximation under verifiable conditions. SIAM J. Control Optim. 44 283–312(electronic).

Andrieu, C. and Robert, C. (2001). Controlled MCMC for optimal sampling.Tech. Rep. Tech. Rep. 0125, Cahiers de Mathematiques du Ceremade; UniversiteParis-Dauphine.

Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statisticsand Computing 18 343–373.

Atchade, Y. (2009). A cautionary tale on the efficiency of some adaptive MonteCarlo schemes. Tech. rep., ArXiv:0901:1378v1.

Atchade, Y. and Fort, G. (To appear). Limit theorems for some adaptive MCMCalgorithms with subgeometric kernels. Bernoulli. ArXiv math.PR/0807.2952 (re-vision, Jan 2009) .

Atchade, Y. F. and Rosenthal, J. S. (2005). On adaptive Markov chain MonteCarlo algorithms. Bernoulli 11 815–828.

Bai, Y. (2008). The simultaneuous drift conditions for Adaptive MarkovChain Monte Carlo algorithms. Tech. rep., Univ. of Toronto, available athttp://www.probability.ca/jeff/ftpdir/yanbai2.pdf.

Bai, Y., Roberts, G. and Rosenthal, J. (2009). On the Containement conditionfor Adaptive Markov Chain Monte Carlo algorithms. Tech. rep., Univ. of Toronto,available at http://www.probability.ca/jeff/.

Benveniste, A., Metivier, M. and Priouret, P. (1990). Adaptive Algorithmsand Stochastic Approximations, vol. 22. Springer-Verlag, Berlin.

Cappe, O. and Moulines, E. (2009). Online EM algorithm for latent data models.To appear in J. Roy. Stat. Soc., B.

29

30

Chauveau, D. and Vandekerkhove, P. (1999). Un algorithme de Hastings-Metropolis avec apprentissage sequentiel. C. R. Acad. Sci. Paris Ser. I Math.329 173–176.

Chauveau, D. and Vandekerkhove, P. (2001). Algorithmes de Hastings-Metropolis en interaction. C. R. Acad. Sci. Paris Ser. I Math. 333 881–884.

Chauveau, D. and Vandekerkhove, P. (2002). Improving convergence of theHastings-Metropolis algorithm with an adaptive proposal. Scand. J. Statist. 2913–29.

Craiu, R. V., Rosenthal, J. S. and Yang, C. (2008). Learn from thy neighbor:Parallel-chain adaptive MCMC. Tech. rep., University of Toronto, available athttp://www.probability.ca/jeff/ftpdir/chao4.pdf.

Douc, R., Moulines, E. and Rosenthal, J. (2004). Quantitative bounds forgeometric convergence rates of Markov chains. Ann. Appl. Probab. 14 1643–1665.

Fort, G. and Moulines, E. (2000). V-subgeometric ergodicity for a Hastings-Metropolis algorithm. Stat. Probab. Lett. 49 401–410.

Fort, G. and Moulines, E. (2003). Convergence of the Monte Carlo ExpectationMaximization for curved exponential families. Ann. Statist. 31 1220–1251.

Fort, G., Moulines, E., Roberts, G. and Rosenthal, J. (2003). On thegeometric ergodicity of hybrid samplers. J. Appl. Probab. 40 123–146.

Gelman, A., Roberts, G. O. and Gilks, W. R. (1996). Efficient Metropolisjumping rules. In Bayesian statistics, 5 (Alicante, 1994). Oxford Sci. Publ.,Oxford Univ. Press, New York.

Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. Com-puting Science and Statistics: Proc. 23rd Symposium on the Interface, InterfaceFoundation, Fairfax Station, VA 156–163.

Geyer, C. J. and Thompson, E. (1995). Annealing Markov chain Monte Carlowith Applications to pedigree analysis. Journal of the American Statistical As-sociation 90 909–920.

Giordani, P. and Kohn, R. (2008). Adaptive Independent Metropolis-Hastingsby Fast Estimation of Mixtures of Normals.

Haario, H., Saksman, E. and Tamminen, J. (1999). Adaptive proposal distri-bution for random walk Metropolis algorithm. Comput. Stat. 14 375–395.

Haario, H., Saksman, E. and Tamminen, J. (2001). An adaptive Metropolisalgorithm. Bernoulli 7 223–242.

Haario, H., Saksman, E. and Tamminen, J. (2005). Componentwise adaptationfor high dimensional MCMC. Comput. Statist. 20 265–273.

Hall, P. and Heyde, C. (1980). Martingale Limit Theory and Its Application.Academic Press, New York.

Jarner, S. and Hansen, E. (2000). Geometric ergodicity of Metropolis algorithms.Stoch. Proc. Appl. 85 341–361.

31

Jasra, A., Stephens, D. A. and Holmes, C. C. (2007). On population-basedsimulation for static inference. Stat. Comput. 17 263–279.

Keith, J., Kroese, D. and Sofronov, G. (2008). Adaptive independence sam-plers. Statistics and Computing 18 409–420.

Kou, S., Zhou, Q. and Wong, W. (2006). Equi-energy sampler with applicationsto statistical inference and statistical mechanisms (with discussion). Ann. Statist.34 1581–1619.

Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and RecursiveAlgorithms and Applications, vol. 35. 2nd ed. Springer, New York.

Marinari, E. and Parisi, G. (1992). Simulated tempering: A new Monte Carloschemes. Europhysics letters 19 451–458.

Mengersen, K. and Tweedie, R. (1996). Rates of convergence of the Hastingsand Metropolis algorithms. Ann. Statist. 24 101–121.

Meyn, S. and Tweedie, R. (2009). Markov Chains and Stochastic Stability. Cam-bridge University Press. Second Edition.

Robert, C. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer-Verlag, New-York.

Roberts, G., Gelman, A. and Gilks, W. (1997). Weak Convergence and Op-timal Scaling of Random Walk Metropolis Algorithms. The Annals of AppliedProbability 7 110–120.

Roberts, G. and Rosenthal, J. (2001). Optimal scaling for various Metropolis-Hastings algorithms. Statist. Sci. 16 351–367.

Roberts, G. and Rosenthal, J. (2006). Examples of adaptive MCMC. Tech.rep., Univ. of Toronto. To appear in J. Computational Graphical Statistics.

Roberts, G. and Tweedie, R. (1996). Geometric convergence and central limittheorems for multidimensional Hastings and Metropolis algorithms. Biometrika83 95–10.

Roberts, G. O. and Rosenthal, J. S. (2007). Coupling and ergodicity of adap-tive Markov chain Monte Carlo algorithms. J. Appl. Probab. 44 458–475.

Rosenthal, J. S. (2007). AMCMC: An R interface for adaptive MCMC. Comput.Statist. Data Anal. 51 5467–5470.

Rosenthal, J. S. (2009). MCMC Handbook, chap. Optimal Proposal Distributionsand Adaptive MCMC. Chapman & Hall/CRC Press.

Rubinstein, R. and Kroese, D. (2008). Simulation and the Monte Carlo method.2nd ed. Wiley Series in Probability and Statistics, Wiley-Interscience [John Wiley& Sons], Hoboken, NJ.

Turro, E., Bochkina, N., Hein, A.-M. and Richardson, S. (2007). Bgx: abioconductor package for the bayesian integrated analysis of affymetrix genechips.BMC Bioinformatics 8 439.

Yang, C. (2008). On the weak law of large number for unbounded func-tionals of adaptive MCMC. Tech. rep., Univ. of Toronto, available athttp://www.probability.ca/jeff/ftpdir/chao2.pdf.

Adaptive Markov Chain Monte Carlo: Theory and Methodsdept.stat.lsa.umich.edu/~yvesa/afmp.pdfAdaptive...

Documents

Transcript of Adaptive Markov Chain Monte Carlo: Theory and Methodsdept.stat.lsa.umich.edu/~yvesa/afmp.pdfAdaptive...