Bayesian Parameter Inference in State-Space Models using...

Bayesian Parameter Inference in State-Space Modelsusing Particle MCMC

Arnaud DoucetDepartment of Statistics, Oxford University

University College London

5th October 2012

A. Doucet (UCL Masterclass Oct. 2012) 5th October 2012 1 / 37

State-Space Models with Unknown Parameters

In most scenarios of interest, the state-space model contains anunknown static parameter θ ∈ Θ so that

X1 ∼ µθ (·) and Xt | (Xt−1 = x) ∼ fθ ( ·| xt−1) .

The observations {Yt}t≥1 are conditionally independent given{Xt}t≥1 and θ

Yt | (Xt = xt ) ∼ gθ ( ·| x) .Aim: Given observations y1:T , we want to infer θ in a Bayesianframework.

Yt | (Xt = xt ) ∼ gθ ( ·| x) .

Aim: Given observations y1:T , we want to infer θ in a Bayesianframework.

Yt | (Xt = xt ) ∼ gθ ( ·| x) .Aim: Given observations y1:T , we want to infer θ in a Bayesianframework.

Bayesian Parameter Inference in State-Space Models

Set a prior p (θ) on θ so inference relies now on

p ( θ, x1:T | y1:T ) =p (θ, x1:T , y1:T )

p (y1:t )

wherep (θ, x1:T , y1:T ) = p (θ) pθ (x1:T , y1:T )

pθ (x1:T , y1:T ) = µθ (x1)T

∏k=2

fθ (xk | xk−1)T

∏k=1

gθ (yk | xk )

Standard approaches rely on MCMC.

Bayesian Parameter Inference in State-Space Models

Set a prior p (θ) on θ so inference relies now on

p ( θ, x1:T | y1:T ) =p (θ, x1:T , y1:T )

p (y1:t )

wherep (θ, x1:T , y1:T ) = p (θ) pθ (x1:T , y1:T )

pθ (x1:T , y1:T ) = µθ (x1)T

∏k=2

fθ (xk | xk−1)T

∏k=1

gθ (yk | xk )

Standard approaches rely on MCMC.

Common MCMC Approaches and Limitations

MCMC Idea: Simulate an ergodic Markov chain {θ (i) ,X1:T (i)}i≥0of invariant distribution p ( θ, x1:T | y1:T )... infinite number ofpossibilities.

Typical strategies consists of updating iteratively X1:T conditionalupon θ then θ conditional upon X1:T .

To update X1:T conditional upon θ, use MCMC kernels updatingsubblocks according to pθ (xt :t+K−1| yt :t+K−1, xt−1, xt+K ).Standard MCMC algorithms are ineffi cient if θ and X1:T are stronglycorrelated.

Strategy impossible to implement when it is only possible to samplefrom the prior but impossible to evaluate it pointwise.

To update X1:T conditional upon θ, use MCMC kernels updatingsubblocks according to pθ (xt :t+K−1| yt :t+K−1, xt−1, xt+K ).

Standard MCMC algorithms are ineffi cient if θ and X1:T are stronglycorrelated.

Metropolis-Hastings (MH) Sampling

To bypass these problems, we want to update jointly θ and X1:T .

Assume that the current state of our Markov chain is (θ, x1:T ), wepropose to update simultaneously the parameter and the states usinga proposal

q ( (θ∗, x∗1:T )| (θ, x1:T )) = q ( θ∗| θ) qθ∗ (x∗1:T | y1:T ) .

The proposal (θ∗, x∗1:T ) is accepted with MH acceptance probability

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

)Problem: Designing a proposal qθ∗ (x

∗1:T | y1:T ) such that the

acceptance probability is not extremely small is very diffi cult.

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

Problem: Designing a proposal qθ∗ (x∗1:T | y1:T ) such that the

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

“Idealized”Marginal MH Sampler

Consider the following so-called marginal Metropolis-Hastings (MH)algorithm which uses as a proposal

q ( (x∗1:T , θ∗)| (x1:T , θ)) = q ( θ∗| θ) pθ∗ (x

∗1:T | y1:T ) .

The MH acceptance probability is

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

)= 1∧ pθ∗ (y1:T ) p (θ

pθ (y1:T ) p (θ)q ( θ| θ∗)q ( θ∗| θ)

In this MH algorithm, X1:T has been essentially integrated out.

q ( (x∗1:T , θ∗)| (x1:T , θ)) = q ( θ∗| θ) pθ∗ (x

∗1:T | y1:T ) .

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

)= 1∧ pθ∗ (y1:T ) p (θ

pθ (y1:T ) p (θ)q ( θ| θ∗)q ( θ∗| θ)

q ( (x∗1:T , θ∗)| (x1:T , θ)) = q ( θ∗| θ) pθ∗ (x

∗1:T | y1:T ) .

1∧ p ( θ∗, x∗1:T | y1:T )

p ( θ, x1:T | y1:T )

q ( (x1:T , θ)| (x∗1:T , θ∗))

q((x∗1:T , θ

∗)∣∣ (x1:T , θ)

)= 1∧ pθ∗ (y1:T ) p (θ

pθ (y1:T ) p (θ)q ( θ| θ∗)q ( θ∗| θ)

Implementation Issues

Problem 1: We do not know pθ (y1:T ) =∫pθ (x1:T , y1:T ) dx1:T

analytically.

Problem 2: We do not know how to sample from pθ (x1:T | y1:T ) .

“Idea”: Use particle approximations of pθ (x1:T | y1:T ) and pθ (y1:T ).

analytically.

Particle Filters

Given θ, particle methods provide approximations of pθ (x1:T | y1:T )and pθ (y1:T ).

At time T , we obtain the following approximation of the posterior ofinterest

p̂θ (x1:T | y1:T ) =1N

∑k=1

δX (k )1:T

(x1:T )

and an approximation of pθ (y1:T ) is given by

p̂θ (y1:T ) = p̂θ (y1)T

∏t=2p̂θ (yt | y1:t−1) =

∏t=1

∑k=1

(yt |X (k )t

if we use fθ (xt | xt−1) as a proposal.

Particle Filters

Given θ, particle methods provide approximations of pθ (x1:T | y1:T )and pθ (y1:T ).

At time T , we obtain the following approximation of the posterior ofinterest

p̂θ (x1:T | y1:T ) =1N

∑k=1

δX (k )1:T

(x1:T )

and an approximation of pθ (y1:T ) is given by

p̂θ (y1:T ) = p̂θ (y1)T

∏t=2p̂θ (yt | y1:t−1) =

∏t=1

∑k=1

(yt |X (k )t

if we use fθ (xt | xt−1) as a proposal.

A Few Theoretical Results

We haveE [p̂θ (y1:T )] = pθ (y1:T ) .

Under mixing assumptions, we have

V [p̂θ (y1:T )]

p2θ (y1:T )≤ Dθ

Under mixing assumptions, we also have∫|E [p̂θ (x1:T | y1:T )]− pθ (x1:T | y1:T )| dx1:T ≤ Cθ

so if I run a particle method to obtain p̂θ (x1:T | y1:T ) thenX1:T ∼ p̂θ (x1:T | y1:T ), unconditionally X1:T ∼ E [p̂θ ( ·| y1:T )].

Problem: We cannot compute analytically the particle filter proposalqθ (x1:T | y1:T ) = E [p̂θ (x1:T | y1:T )] as it involves an expectation w.r.tall the variables appearing in the particle algorithm...

V [p̂θ (y1:T )]

p2θ (y1:T )≤ Dθ

V [p̂θ (y1:T )]

p2θ (y1:T )≤ Dθ

V [p̂θ (y1:T )]

p2θ (y1:T )≤ Dθ

At iteration i

Sample θ∗ ∼ q ( θ| θ (i − 1)).

Sample X ∗1:T ∼ pθ∗ (x1:T | y1:T ) .

With probability

1∧ pθ∗ (y1:T ) p (θ∗)

pθ(i−1) (y1:T ) p (θ (i − 1))q ( θ (i − 1)| θ∗)q ( θ∗| θ (i − 1))

set θ (i) = θ∗, X1:T (i) = X ∗1:T otherwise set θ (i) = θ (i − 1),X1:T (i) = X1:T (i − 1) .

At iteration i

Sample θ∗ ∼ q ( θ| θ (i − 1)).Sample X ∗1:T ∼ pθ∗ (x1:T | y1:T ) .

With probability

1∧ pθ∗ (y1:T ) p (θ∗)

At iteration i

Sample θ∗ ∼ q ( θ| θ (i − 1)).Sample X ∗1:T ∼ pθ∗ (x1:T | y1:T ) .

With probability

1∧ pθ∗ (y1:T ) p (θ∗)

Particle Marginal MH Sampler

At iteration i

Sample θ∗ ∼ q ( θ| θ (i − 1)) and run an particle filter to obtainp̂θ∗ (x1:T | y1:T ) and p̂θ∗ (y1:T ).

Sample X ∗1:T ∼ p̂θ∗ (x1:T | y1:T ) .

With probability

1∧ p̂θ∗ (y1:T ) p (θ∗)

p̂θ(i−1) (y1:T ) p (θ (i − 1))q ( θ (i − 1)| θ∗)q ( θ∗| θ (i − 1))

At iteration i

Sample X ∗1:T ∼ p̂θ∗ (x1:T | y1:T ) .

With probability

1∧ p̂θ∗ (y1:T ) p (θ∗)

At iteration i

Sample X ∗1:T ∼ p̂θ∗ (x1:T | y1:T ) .

With probability

1∧ p̂θ∗ (y1:T ) p (θ∗)

Validity of the Particle Marginal MH Sampler

Proposition. Assume that the ‘idealized’marginal MH sampler chainis ergodic then, under very weak assumptions, the PMMH samplerchain is ergodic and admits p( θ, x1:T | y1:T ) whatever being N ≥ 1.

It is easy to show the simpler result that the PMMH admitsp( θ| y1:T ) as invariant distribution whatever being N ≥ 1.Let U denote all the r.v. introduce to build the SMC estimate thenwrite p̂θ (y1:T ) = p̂θ (y1:T ;U) and from unbiasedness∫

p̂θ (y1:T ; u) qθ (u) du = pθ (y1:T ) .

Proposition. Assume that the ‘idealized’marginal MH sampler chainis ergodic then, under very weak assumptions, the PMMH samplerchain is ergodic and admits p( θ, x1:T | y1:T ) whatever being N ≥ 1.It is easy to show the simpler result that the PMMH admitsp( θ| y1:T ) as invariant distribution whatever being N ≥ 1.

Let U denote all the r.v. introduce to build the SMC estimate thenwrite p̂θ (y1:T ) = p̂θ (y1:T ;U) and from unbiasedness∫

p̂θ (y1:T ; u) qθ (u) du = pθ (y1:T ) .

Proposition. Assume that the ‘idealized’marginal MH sampler chainis ergodic then, under very weak assumptions, the PMMH samplerchain is ergodic and admits p( θ, x1:T | y1:T ) whatever being N ≥ 1.It is easy to show the simpler result that the PMMH admitsp( θ| y1:T ) as invariant distribution whatever being N ≥ 1.Let U denote all the r.v. introduce to build the SMC estimate thenwrite p̂θ (y1:T ) = p̂θ (y1:T ;U) and from unbiasedness∫

p̂θ (y1:T ; u) qθ (u) du = pθ (y1:T ) .

An Incomplete But Trivial Proof

The PMMH targets the distribution

π̂ (θ, u) ∝ p (θ) p̂θ (y1:T ; u) qθ (u)

which satisfiesπ̂ (θ) = p( θ| y1:T ).

The PMMH sampler uses as a proposal

q ( (θ∗, u∗)| (θ, u)) = q ( θ∗| θ) qθ∗ (u∗)

π̂ (θ∗, u∗)π̂ (θ, u)

q ( (θ, u)| (θ∗, u∗))q ( (θ∗, u∗)| (θ, u)) =

p (θ∗) p̂θ∗ (y1:T ; u∗)p (θ) p̂θ (y1:T ; u)

q ( θ| θ∗)q ( θ∗| θ) .

Trivial but deep result: if you plug any unbiased likelihood estimatewithin a MCMC scheme, you do not perturb the invariant distribution.

π̂ (θ, u) ∝ p (θ) p̂θ (y1:T ; u) qθ (u)

q ( (θ∗, u∗)| (θ, u)) = q ( θ∗| θ) qθ∗ (u∗)

π̂ (θ∗, u∗)π̂ (θ, u)

q ( (θ, u)| (θ∗, u∗))q ( (θ∗, u∗)| (θ, u)) =

p (θ∗) p̂θ∗ (y1:T ; u∗)p (θ) p̂θ (y1:T ; u)

q ( θ| θ∗)q ( θ∗| θ) .

π̂ (θ, u) ∝ p (θ) p̂θ (y1:T ; u) qθ (u)

q ( (θ∗, u∗)| (θ, u)) = q ( θ∗| θ) qθ∗ (u∗)

π̂ (θ∗, u∗)π̂ (θ, u)

q ( (θ, u)| (θ∗, u∗))q ( (θ∗, u∗)| (θ, u)) =

p (θ∗) p̂θ∗ (y1:T ; u∗)p (θ) p̂θ (y1:T ; u)

q ( θ| θ∗)q ( θ∗| θ) .

Inference for Stochastic Kinetic Models

Two species X 1t (prey) and X2t (predator)

Pr(X 1t+dt=x

1t+1,X

2t+dt=x

∣∣ x1t , x2t ) = α x1t dt + o (dt) ,Pr(X 1t+dt=x

1t−1,X 2t+dt=x2t+1

∣∣ x1t , x2t ) = β x1t x2t dt + o (dt) ,

Pr(X 1t+dt=x

2t+dt=x

2t−1

∣∣ x1t , x2t ) = γ x2t dt + o (dt) ,

withYk = X

1k∆T +Wk with Wk

i.i.d.∼ N(0, σ2

We are interested in the kinetic rate constants θ = (α, β,γ) a prioridistributed as (Boys et al., 2008; Kunsch, 2011)

α ∼ G(1, 10), β ∼ G(1, 0.25), γ ∼ G(1, 7.5).

MCMC methods require reversible jumps, Particle MCMC requiresonly forward simulation.

Pr(X 1t+dt=x

1t+1,X

2t+dt=x

1t−1,X 2t+dt=x2t+1

∣∣ x1t , x2t ) = β x1t x2t dt + o (dt) ,

Pr(X 1t+dt=x

2t+dt=x

2t−1

∣∣ x1t , x2t ) = γ x2t dt + o (dt) ,

withYk = X

1k∆T +Wk with Wk

i.i.d.∼ N(0, σ2

α ∼ G(1, 10), β ∼ G(1, 0.25), γ ∼ G(1, 7.5).

Pr(X 1t+dt=x

1t+1,X

2t+dt=x

1t−1,X 2t+dt=x2t+1

∣∣ x1t , x2t ) = β x1t x2t dt + o (dt) ,

Pr(X 1t+dt=x

2t+dt=x

2t−1

∣∣ x1t , x2t ) = γ x2t dt + o (dt) ,

withYk = X

1k∆T +Wk with Wk

i.i.d.∼ N(0, σ2

α ∼ G(1, 10), β ∼ G(1, 0.25), γ ∼ G(1, 7.5).

Experimental Results

0 1 2 3 4 5 6 7 8 9 1 0

p r e yp r e d a t o r

Simulated data

1 .5 2 2 .5 3 3 .5 4 4 .51 .5 2 2 .5 3 3 .5 4 4 .5 0 .0 6 0 .1 2 0 .1 80 .0 6 0 .1 2 0 .1 8 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8

Posterior distributions

Autocorrelation Functions

0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0

5 0 p a r t ic le s 1 0 0 p a r t ic le s 2 0 0 p a r t ic le s 5 0 0 p a r t ic le s

1 0 0 0 p a r t ic le s

0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0

5 0 p a r t ic le s 1 0 0 p a r t ic le s 2 0 0 p a r t ic le s 5 0 0 p a r t ic le s

1 0 0 0 p a r t ic le s

Autocorrelation of α (left) and β (right) for the PMMH sampler forvarious N.

Summary

Offl ine Bayesian parameter inference is feasible by using particle filterproposals within MCMC.

Particle MCMC allow us to perform Bayesian inference for dynamicmodels for which only forward simulation is possible.

Computationally intensive but several implementations on GPUalready available and applications in control, ecology, econometrics,biochemical systems, epidemiology, water resources research etc.

This approach does not suffer from degeneracy problem and N scalesroughly linearly with T .

Summary

How to Select the Number of Samples

Selection of N is a key issue.

If N is too small, then the algorithm mixes poorly and will requiremany MCMC iterations.

If N is too large, then each MCMC iteration is expensive.

Aim: We would like to provide guidelines on how to select N.Joint work with Mike Pitt (Warwick) and Robert Kohn (UNSW).

Aim: We would like to provide guidelines on how to select N.

Joint work with Mike Pitt (Warwick) and Robert Kohn (UNSW).

MCMC with Intractable Likelihood Function

Let Z = log p̂θ (y ;U)− log pθ(y) be the error in log-likelihood.

The proposal from which Z arises is denoted g(z |θ).We can rewrite the extended target

π̂(θ, z) = p( θ| y) exp(z)g(z |θ)

which is directly related to π̂(θ, u) through the many-to-onetransformation from u to z .

The previous algorithm proposes θ′ ∼ q(·|θ) and Z ′ ∼ g(·|θ′),accepting

(θ′, z ′

αQ (θ,Z ; θ′,Z ′) = min

{1, exp(Z ′ − Z )ω(θ′; θ)/ω(θ; θ′)

where ω(θ′; θ) = π(θ′)/q(θ′|θ).

The proposal from which Z arises is denoted g(z |θ).

We can rewrite the extended target

π̂(θ, z) = p( θ| y) exp(z)g(z |θ)

(θ′, z ′

αQ (θ,Z ; θ′,Z ′) = min

{1, exp(Z ′ − Z )ω(θ′; θ)/ω(θ; θ′)

π̂(θ, z) = p( θ| y) exp(z)g(z |θ)

(θ′, z ′

αQ (θ,Z ; θ′,Z ′) = min

{1, exp(Z ′ − Z )ω(θ′; θ)/ω(θ; θ′)

π̂(θ, z) = p( θ| y) exp(z)g(z |θ)

(θ′, z ′

αQ (θ,Z ; θ′,Z ′) = min

{1, exp(Z ′ − Z )ω(θ′; θ)/ω(θ; θ′)

Ineffi ciency Measure

Consider a stationary Markov chain {θj} with invariant density π(θ)and h : Θ→ R with Vπ [h(θ)] < ∞. Define

µh = Eπ [h(θ)] and µ̂h,n = n−1

∑j=1h(θj ).

Then, under regularity conditions, we have

limn−→∞

nV(µ̂h,n

)= Vπ [h(θ)] IFh with IFh = 1+ 2

∑τ=1

ρh(τ),

where ρh(τ) is the autocorrelation at lag τ of the stationary sequence{h(θj )}.

The IACT, IFh, quantifies how many times more samples are requiredfrom the Markov chain relative to using iid samples from π(θ) toachieve a given precision.

Ineffi ciency Measure

Consider a stationary Markov chain {θj} with invariant density π(θ)and h : Θ→ R with Vπ [h(θ)] < ∞. Define

µh = Eπ [h(θ)] and µ̂h,n = n−1

∑j=1h(θj ).

Then, under regularity conditions, we have

limn−→∞

nV(µ̂h,n

)= Vπ [h(θ)] IFh with IFh = 1+ 2

∑τ=1

ρh(τ),

where ρh(τ) is the autocorrelation at lag τ of the stationary sequence{h(θj )}.The IACT, IFh, quantifies how many times more samples are requiredfrom the Markov chain relative to using iid samples from π(θ) toachieve a given precision.

A Bounding Chain

For the sake of analysis, we introduce an alternative Q∗ chain.

Let (θ,Z ) be the current state of the Markov chain.

1 Propose θ′ ∼ q(·|θ) and Z ′ ∼ g(·|θ′).2 Accept θ′ w.p. αQEX (θ; θ

′) = min{1,ω(θ′; θ)/ω(θ; θ′)

3 Accept Z ′ w.p. αQ z(Z ;Z ′) = min {1, exp(Z ′ − Z )}.4(θ′,Z ′

)is accepted if and only if there is acceptance in both criteria.

A Bounding Chain

′) = min{1,ω(θ′; θ)/ω(θ; θ′)

A Bounding Chain

1 Propose θ′ ∼ q(·|θ) and Z ′ ∼ g(·|θ′).

2 Accept θ′ w.p. αQEX (θ; θ′) = min

{1,ω(θ′; θ)/ω(θ; θ′)

A Bounding Chain

′) = min{1,ω(θ′; θ)/ω(θ; θ′)

A Bounding Chain

′) = min{1,ω(θ′; θ)/ω(θ; θ′)

3 Accept Z ′ w.p. αQ z(Z ;Z ′) = min {1, exp(Z ′ − Z )}.

4(θ′,Z ′

A Bounding Chain

′) = min{1,ω(θ′; θ)/ω(θ; θ′)

A Bounding Chain

Lemma: The Markov chain Q∗ has the following properties:

1 We have

αQ (θ, z ; θ′, z ′) ≥ αQ ∗(θ, z ; θ

′, z ′) = αQEX (θ; θ′)× αQ z(z ; z

2 Q∗ is a reversible Markov chain with invariant density π̂(θ, z).3 For any function h(θ) the IACT is higher for the Q∗ chain than forthe Q chain; i.e. IFQ

h ≥ IFQh (Peskun, 1973; Tierney 1998).

Remark: We have αQ ∗(θ, z ; θ′, z ′) = αQ (θ, z ; θ

′, z ′) = αQEX (θ; θ′)

when the likelihood is known exactly andαQ ∗(θ, z ; θ

′, z ′) = αQ (θ, z ; θ∗, z∗) = αQZ (z ; z

∗) when the proposal isperfect, i.e. q(θ′|θ) = π(θ′).

A Bounding Chain

1 We have

αQ (θ, z ; θ′, z ′) ≥ αQ ∗(θ, z ; θ

′, z ′) = αQEX (θ; θ′)× αQ z(z ; z

′, z ′) = αQEX (θ; θ′)

′, z ′) = αQ (θ, z ; θ∗, z∗) = αQZ (z ; z

A Bounding Chain

1 We have

αQ (θ, z ; θ′, z ′) ≥ αQ ∗(θ, z ; θ

′, z ′) = αQEX (θ; θ′)× αQ z(z ; z

2 Q∗ is a reversible Markov chain with invariant density π̂(θ, z).

3 For any function h(θ) the IACT is higher for the Q∗ chain than forthe Q chain; i.e. IFQ

′, z ′) = αQEX (θ; θ′)

′, z ′) = αQ (θ, z ; θ∗, z∗) = αQZ (z ; z

A Bounding Chain

1 We have

αQ (θ, z ; θ′, z ′) ≥ αQ ∗(θ, z ; θ

′, z ′) = αQEX (θ; θ′)× αQ z(z ; z

′, z ′) = αQEX (θ; θ′)

′, z ′) = αQ (θ, z ; θ∗, z∗) = αQZ (z ; z

A Bounding Chain

1 We have

αQ (θ, z ; θ′, z ′) ≥ αQ ∗(θ, z ; θ

′, z ′) = αQEX (θ; θ′)× αQ z(z ; z

′, z ′) = αQEX (θ; θ′)

′, z ′) = αQ (θ, z ; θ∗, z∗) = αQZ (z ; z

Making Assumptions to Move Forward

Assumption. Let Z = log p̂θ (y ;U)− log pθ(y) be the error in theestimator of the log likelihood.

1 We have for N particles

g(z |θ) = φ(z ;−γ2(θ)/2N,γ2(θ)/N

π̂(z |θ) = exp(z)g(z |θ) = φ(z ;γ2(θ)/2N,γ2(θ)/N

)where φ(z ; a, b2) is a univariate normal of mean a, variance b2.

2 For a given value of σ2 we set N = Nσ2(θ) = γ(θ)2/σ2.

Under this assumption, both g(z |θ) and π̂(z |θ) are functions of σ2

only and we write g(z |θ) and π(z |θ) as

g(z |σ2) = φ(z ;−σ2/2, σ2), π(z |σ2) = φ(z ; σ2/2, σ2).

and θ and Z are independent under π̂(θ, z).

g(z |θ) = φ(z ;−γ2(θ)/2N,γ2(θ)/N

g(z |σ2) = φ(z ;−σ2/2, σ2), π(z |σ2) = φ(z ; σ2/2, σ2).

g(z |θ) = φ(z ;−γ2(θ)/2N,γ2(θ)/N

g(z |σ2) = φ(z ;−σ2/2, σ2), π(z |σ2) = φ(z ; σ2/2, σ2).

g(z |θ) = φ(z ;−γ2(θ)/2N,γ2(θ)/N

g(z |σ2) = φ(z ;−σ2/2, σ2), π(z |σ2) = φ(z ; σ2/2, σ2).

Empirical vs Asymptotic Distribution of Log-LikelihoodEstimator

4 2 0 2 4

0.5 Probit example: N=120, ρ = 0

4 2 0 2 4

0.5 AR1 plus noise example: N=60, ρ = 04 2 0 2 4

0.5 Probit example: N=120, ρ = 0 .9 7

4 2 0 2 4

0.5 AR1 plus noise example: N=60, ρ = 0 .9

Figure: Histograms of proposed (red) and accepted (pink) values of z in PMCMCscheme. Overlayed are Gaussian pdfs from our simplifying Assumption for atarget of σ = 0.92.

Bounding Chain with Perfect Proposal

Lemma (Pitt et al., 2012). Under the previous assumption, thefollowing results hold for the chain {θj ,Zj} arising from Q∗ that usesthe perfect proposal q(θ′|θ) = π(θ′) denoted QZ.

1 Let p(z ; σ2) be the probability of rejection given the current value z .Then

p(z ; σ2) = 1−∫

αQZ (z ; z′)g(z ′|σ2)dz ′

= Φ(z/σ+ σ/2)− exp(−z)Φ(z/σ− σ/2).

IF Z(σ2) = Eπ(·|σ2)

(1+ p(z ; σ2)1− p(z ; σ2)

)=∫ 1+ p∗(w , σ)1− p∗(w , σ)φ(w)dw ,

where p∗(w , σ) = Φ(w + σ)− exp(−wσ− σ2/2)Φ(w), Φ(·)standard normal cdf. In particular, IF Z(σ2) is independent of h.

p(z ; σ2) = 1−∫

= Φ(z/σ+ σ/2)− exp(−z)Φ(z/σ− σ/2).

IF Z(σ2) = Eπ(·|σ2)

(1+ p(z ; σ2)1− p(z ; σ2)

)=∫ 1+ p∗(w , σ)1− p∗(w , σ)φ(w)dw ,

p(z ; σ2) = 1−∫

= Φ(z/σ+ σ/2)− exp(−z)Φ(z/σ− σ/2).

IF Z(σ2) = Eπ(·|σ2)

(1+ p(z ; σ2)1− p(z ; σ2)

)=∫ 1+ p∗(w , σ)1− p∗(w , σ)φ(w)dw ,

Main Result

Under the previous assumption and additional regularity conditions,

IF Qh (σ

2) ≤ IF Q*h (σ

2) ≤ IF Uh (σ

IF Uh (σ

2) =12(1+ IF EX

h )(1+ IFZ (σ2))− 1.

The inequality becomes exact when σ −→ 0 (as IF Z (σ2) −→ 1 forσ −→ 0) and when the proposal is perfect (as IF EX

h = 1 whenq(θ′|θ) = π(θ′)).

For the unconditional acceptance probability, we have

PQ(A|σ2) ≥ PQ*(A|σ2) = 2Φ(−σ/√2) PEX(A).

with the bound getting tighter if either PEX(A) −→ 1 orPZ(A|σ2) −→ 1.

Main Result

IF Qh (σ

2) ≤ IF Q*h (σ

2) ≤ IF Uh (σ

IF Uh (σ

2) =12(1+ IF EX

h )(1+ IFZ (σ2))− 1.

h = 1 whenq(θ′|θ) = π(θ′)).

PQ(A|σ2) ≥ PQ*(A|σ2) = 2Φ(−σ/√2) PEX(A).

Main Result

IF Qh (σ

2) ≤ IF Q*h (σ

2) ≤ IF Uh (σ

IF Uh (σ

2) =12(1+ IF EX

h )(1+ IFZ (σ2))− 1.

h = 1 whenq(θ′|θ) = π(θ′)).

PQ(A|σ2) ≥ PQ*(A|σ2) = 2Φ(−σ/√2) PEX(A).

Relative Ineffi ciency

We have

IF Qh (σ

IF EXh

≤ IF Uh (σ

IF EXh

= RIF Uh (σ

RIF Uh (σ

2) =12(IF Z(σ2)− 1)

IF EXh

+12(1+ IF Z(σ2)).

For fixed σ, RIF Uh (σ

2) decreases as IF EXh increases from a value of

IF Z(σ2) for IF EXh = 1 to

RIF Uh (σ

2) −→ 12 (1+ IF

Z(σ2)) ≤ IF Z(σ2) as IF EXh −→ ∞.

The loss in effi ciency from using the estimated likelihood goes downas the proposal deteriorates.

We have

IF Qh (σ

IF EXh

≤ IF Uh (σ

IF EXh

= RIF Uh (σ

RIF Uh (σ

2) =12(IF Z(σ2)− 1)

IF EXh

+12(1+ IF Z(σ2)).

RIF Uh (σ

2) −→ 12 (1+ IF

Z(σ2)) ≤ IF Z(σ2) as IF EXh −→ ∞.

We have

IF Qh (σ

IF EXh

≤ IF Uh (σ

IF EXh

= RIF Uh (σ

RIF Uh (σ

2) =12(IF Z(σ2)− 1)

IF EXh

+12(1+ IF Z(σ2)).

RIF Uh (σ

2) −→ 12 (1+ IF

Z(σ2)) ≤ IF Z(σ2) as IF EXh −→ ∞.

Computing Time

The Computing Time (CT) for Q is defined as

CT Qh (σ

2) = IF Qh (σ

2)/σ2;

i.e. we take into account the computational efforts associated toσ2 ∝ 1/N.

1 CT Qh (σ

2) ≤ CT Uh (σ

2) where CT Uh (σ

2) := IF Uh (σ

2)/σ2.2 If IF EX

h = 1, then CT Uh (σ

2) is minimized at σUopt = 0.92 and

IF Z (σU2opt ) = 4.54, PZ (A|σU

opt ) = 0.5153.3 The minimizing σUopt rises with IF

EXh to σU

opt = 1.0206 as IFEX −→ ∞.

Define the relative computing time for the ineffi ciency bound IF Uh (σ

RCT Uh (σ

2) =RIF U

h (σ2)

Both RIF Uh (σ

2) and RCT Uh (σ

2) are decreasing functions of IF EXh .

Computing Time

CT Qh (σ

2) = IF Qh (σ

2)/σ2;

1 CT Qh (σ

2) ≤ CT Uh (σ

2) where CT Uh (σ

2) := IF Uh (σ

2)/σ2.

2 If IF EXh = 1, then CT U

h (σ2) is minimized at σU

opt = 0.92 andIF Z (σU2opt ) = 4.54, P

Z (A|σUopt ) = 0.5153.

3 The minimizing σUopt rises with IFEXh to σU

opt = 1.0206 as IFEX −→ ∞.

RCT Uh (σ

2) =RIF U

h (σ2)

Both RIF Uh (σ

2) and RCT Uh (σ

Computing Time

CT Qh (σ

2) = IF Qh (σ

2)/σ2;

1 CT Qh (σ

2) ≤ CT Uh (σ

2) where CT Uh (σ

2) := IF Uh (σ

2)/σ2.2 If IF EX

opt ) = 0.5153.

3 The minimizing σUopt rises with IFEXh to σU

opt = 1.0206 as IFEX −→ ∞.

RCT Uh (σ

2) =RIF U

h (σ2)

Both RIF Uh (σ

2) and RCT Uh (σ

Computing Time

CT Qh (σ

2) = IF Qh (σ

2)/σ2;

1 CT Qh (σ

2) ≤ CT Uh (σ

2) where CT Uh (σ

2) := IF Uh (σ

2)/σ2.2 If IF EX

EXh to σU

opt = 1.0206 as IFEX −→ ∞.

RCT Uh (σ

2) =RIF U

h (σ2)

Both RIF Uh (σ

2) and RCT Uh (σ

Computing Time

CT Qh (σ

2) = IF Qh (σ

2)/σ2;

1 CT Qh (σ

2) ≤ CT Uh (σ

2) where CT Uh (σ

2) := IF Uh (σ

2)/σ2.2 If IF EX

EXh to σU

opt = 1.0206 as IFEX −→ ∞.

RCT Uh (σ

2) =RIF U

h (σ2)

Both RIF Uh (σ

2) and RCT Uh (σ

Relative Upper Bounds on Ineffi ciency and ComputingTime

R I F ( I F E X = 1 )R I F ( I F E X = 4 )R I F ( I F E X = 1 0 0 )

R I F ( I F E X = 2 .5 )R I F ( I F E X = 2 0 )

0.0 2.5 5.0 7.5 10.0 12.5

R I F ( I F E X = 2 .5 )R I F ( I F E X = 2 0 )

R C T ( I F E X = 1 )R C T ( I F E X = 4 )R C T ( I F E X = 1 0 0 )

R C T ( I F E X = 2 .5 )R C T ( I F E X = 2 0 )

0.25 0.5 0.75 1 1.25 1.5 1.75 2

30 R C T ( I F E X = 1 )R C T ( I F E X = 4 )R C T ( I F E X = 1 0 0 )

R C T ( I F E X = 2 .5 )R C T ( I F E X = 2 0 )

R I F ( I F E X = 2 .5 )R I F ( I F E X = 2 0 )

0.0 2.5 5.0 7.5 10.0 12.5

150 R I F ( I F E X = 1 )R I F ( I F E X = 4 )R I F ( I F E X = 1 0 0 )

R I F ( I F E X = 2 .5 )R I F ( I F E X = 2 0 )

R C T ( I F E X = 2 .5 )R C T ( I F E X = 2 0 )

0.25 0.5 0.75 1 1.25 1.5 1.75 2

R C T ( I F E X = 2 .5 )R C T ( I F E X = 2 0 )

Figure: RCT Uh (top) and RIF

Uh (bottom) against 1/σ2 (left) and σ (right).

Different values of IFEXh are shown on each plot.

Example 1: Probit Model

We use a simple Bernoulli Probit model, where for t = 1, ...,T

Yt = I(Xt > 0), Xtiid∼ N(θ; 1).

The likelihood is known explicitly as Pr(Yt = 1) = Φ(θ) but isestimated through

P̂rθ(Yt = 1) =1N

∑k=1

I(X (k )t > 0), X (k )tiid∼ N(θ; 1)

and set θ ∼ N(0, σ2θ

)with σ2θ � 1.

Autoregressive Metropolis proposal

θ′ = θ̂ + ρ(θ − θ̂) +

√σ̂2(1− ρ2)

ν− 2 t5,

where θ̂ is the posterior mode and σ̂2 is chosen as the negativeinverse of the second derivative of the log posterior.We will consider ρ ∈ {0, 0.4, 0.6, 0.9, 0.97}.

Yt = I(Xt > 0), Xtiid∼ N(θ; 1).

P̂rθ(Yt = 1) =1N

∑k=1

I(X (k )t > 0), X (k )tiid∼ N(θ; 1)

)with σ2θ � 1.

θ′ = θ̂ + ρ(θ − θ̂) +

√σ̂2(1− ρ2)

ν− 2 t5,

Yt = I(Xt > 0), Xtiid∼ N(θ; 1).

P̂rθ(Yt = 1) =1N

∑k=1

I(X (k )t > 0), X (k )tiid∼ N(θ; 1)

)with σ2θ � 1.

θ′ = θ̂ + ρ(θ − θ̂) +

√σ̂2(1− ρ2)

ν− 2 t5,

where θ̂ is the posterior mode and σ̂2 is chosen as the negativeinverse of the second derivative of the log posterior.

We will consider ρ ∈ {0, 0.4, 0.6, 0.9, 0.97}.

Yt = I(Xt > 0), Xtiid∼ N(θ; 1).

P̂rθ(Yt = 1) =1N

∑k=1

I(X (k )t > 0), X (k )tiid∼ N(θ; 1)

)with σ2θ � 1.

θ′ = θ̂ + ρ(θ − θ̂) +

√σ̂2(1− ρ2)

ν− 2 t5,

Setting the Number of Samples

We want to assess experimentally whether our upper/lower boundsare sharp.

We select N so as σ to be roughly equal to a pre-specified value.

1 Choose a large initial value for the number of samples, NS .2 Run the MCMC scheme for a fixed number of iterates recording θ.3 Record the estimated variance of the log of the likelihood estimator,V (θ,NS ) = V̂ [log p̂θ(y ;U)].

4 Set Nθ = V (θ,NS )/σ2.

1 Choose a large initial value for the number of samples, NS .

2 Run the MCMC scheme for a fixed number of iterates recording θ.3 Record the estimated variance of the log of the likelihood estimator,V (θ,NS ) = V̂ [log p̂θ(y ;U)].

1 Choose a large initial value for the number of samples, NS .2 Run the MCMC scheme for a fixed number of iterates recording θ.

3 Record the estimated variance of the log of the likelihood estimator,V (θ,NS ) = V̂ [log p̂θ(y ;U)].

Acceptance Probabilities for Probit Example

Pr(Accept) of ExactPr(Accept) of Bound

Pr(Accept) of PMCMC

0.25 0.50 0.75 1.00 1.25 1.50 1.75

ρ = 0

Pr(Accept) of PMCMC Pr(Accept) of ExactPr(Accept) of Bound

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0 .4

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0 .6

ρ = 0 .9 7

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0 .9

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

1.0Pr(Accept) of ExactPr(Accept) of Bound

Pr(Accept) of PMCMC

Figure: Probit example T = 100, θ = 0.5. Accept. Proba vs σ(θ). Estim. probafor the exact MCMC scheme is shown (constant), estim. proba from thesimulated likelihood scheme (red) and lower bound given as proba exact schemetimes 2Φ(−σ/

√2) (blue).

Relative Ineffi ciency and Computing Time

RCT ( ρ = 0 )RCT ( ρ = 0 .6 )RCT ( ρ = 0 .9 7 )

RCT ( ρ = 0 .4 )RCT ( ρ = 0 .9 )

0 250 500 750 1000 1250

1500RCT ( ρ = 0 )RCT ( ρ = 0 .6 )RCT ( ρ = 0 .9 7 )

RCT ( ρ = 0 .4 )RCT ( ρ = 0 .9 )

RCT ( ρ = 0 )RCT ( ρ = 0 .6 )RCT ( ρ = 0 .9 7 )

RCT ( ρ = 0 .4 )RCT ( ρ = 0 .9 )

0.5 1.0 1.5 2.0

RCT ( ρ = 0 )RCT ( ρ = 0 .6 )RCT ( ρ = 0 .9 7 )

RCT ( ρ = 0 .4 )RCT ( ρ = 0 .9 )

RIF ( ρ = 0 )RIF ( ρ = 0 .6 )RIF ( ρ = 0 .9 7 )

RIF ( ρ = 0 .4 )RIF ( ρ = 0 .9 )

0 250 500 750 1000 1250

60RIF ( ρ = 0 )RIF ( ρ = 0 .6 )RIF ( ρ = 0 .9 7 )

RIF ( ρ = 0 .4 )RIF ( ρ = 0 .9 )

RIF ( ρ = 0 )RIF ( ρ = 0 .6 )RIF ( ρ = 0 .9 7 )

RIF ( ρ = 0 .4 )RIF ( ρ = 0 .9 )

0.5 1.0 1.5 2.0

60RIF ( ρ = 0 )RIF ( ρ = 0 .6 )RIF ( ρ = 0 .9 7 )

RIF ( ρ = 0 .4 )RIF ( ρ = 0 .9 )

Figure: RCT Qh (top) and RIF

Qh (bottom) against N (left) and σ(θ) (right) for

various values of ρ.

Example 2: Noisy Autoregressive Example

We have

Xt+1 = µ(1− φ) + φXt + σηηt and Yt = Xt + σεWt

where ηt and Wt are independent standard normal and

φ, µ, σ2η

The likelihood can be computed exactly using the Kalman filter.

Autoregressive Metropolis proposal for θ based on multivariatet-distribution.

N is selected in the same manner so as to obtain an approximatelyconstant σ

We have

φ, µ, σ2η

We have

φ, µ, σ2η

We have

φ, µ, σ2η

Acceptance Probabilities

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0 .4

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0 .6

Pr(Accept) of PMCMCPr(Accept) of ExactPr(Accept) of Bound

Pr(Accept) of PMCMC

0.5 1.0 1.5 2.0

ρ = 0 .9

Pr(Accept) of PMCMC

Figure: AR1 plus noise example with T = 300, φ = 0.8, µ = 0.5, σ2η = 1,

σ2ε = 0.5. Probabilities of acceptance displayed against σ(θ).

Relative Ineffi ciency and Computing Time

0 100 200 300

ρ = 0RCT

500 RCT RIF

0 100 200 300

50 RIF RIF

50 RIF

0 100 200 300

500ρ = 0 .4

RCT RCT

500 RCT RIF

0 100 200 300

50 RIFRIF

50 RIF

0 100 200 300

500ρ = 0 .6

ρ = 0 .9

RCT RCT

500RCT RIF

0 100 200 300

50RIF RIF

50 RIF

0 100 200 300

500RCT RCT

500RCT RIF

0 100 200 300

50 RIF RIF

50 RIF

Figure: From left to right: RCT Qh vs N, RCT

Qh vs σ(θ), RIF Qh against N and

RIF Qh against σ(θ) for various values of ρ and different parameters.

Discussion

We have provided an approximate analysis of the PMMH sampler.

For a general proposal and under simplifying assumptions on thelikelihood estimator, we can get guidelines on how to select σ: as longas σ is around 1 then you are fine.

Coming up with a nicer way to adapt N would be useful; e.g. Lee,2011.

Particle Gibbs samplers are a powerful alternative not yet wellunderstood (Andrieu, D. & Holenstein, 2010; Whiteley, Andrieu, D.,2010; Lindsten, Jordan, Schon, 2012).

Discussion

Bayesian Parameter Inference in State-Space Models using...

Documents

Transcript of Bayesian Parameter Inference in State-Space Models using...