no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

75
Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion No - U - Turn Sampler based on M.D.Hoffmann and A.Gelman (2011) Marco Banterle Presented at the ”Bayes in Paris” reading group 11 / 04 / 2013

description

presentation made by Marco Banterle at the Bayes in Paris seminar, CREST, April 11, 2013

Transcript of no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Page 1: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

No - U - Turn Samplerbased on M.D.Hoffmann and A.Gelman (2011)

Marco Banterle

Presented at the ”Bayes in Paris” reading group

11 / 04 / 2013

Page 2: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Outline

1 Hamiltonian MCMC

2 NUTS

3 ε tuning

4 Numerical Results

5 Conclusion

Page 3: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Outline

1 Hamiltonian MCMCHamiltonian Monte CarloOther useful tools

2 NUTS

3 ε tuning

4 Numerical Results

5 Conclusion

Page 4: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Hamiltonian dynamic

Hamiltonian MC techniques have a nice foundation in physics

Describe the total energy of a system composed by a frictionlessparticle sliding on a (hyper-)surface

Position q ∈ Rd and momentum p ∈ Rd are necessary to define thetotal energy H(q, p), which is usually formalized through the sumof a potential energy term U(q) and a kinetic energy term K(p).

Hamiltonian dynamics are characterized by

∂q

∂t=∂H

∂p,

∂p

∂t= −∂H

∂q

that describe how the system changes though time.

Page 5: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Properties of the Hamiltonian

• Reversibility : the mapping from (q, p)t to (q, p)t+s is 1-to-1

• Conservation of the Hamiltonian : ∂H∂t = 0

total energy is conserved

• Volume preservation : applying the map resulting fromtime-shifting to a region R of the (q, p)-space do not changethe volume of the (projected) region

• Symplecticness : stronger condition than volumepreservation.

Page 6: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

What about MCMC?

H(q, p) = U(q) +K(p)

We can easily interpret U(q) as minus the log target density forthe variable of interest q, while p will be introduced artificially.

Page 7: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

What about MCMC?

H(q, p) = U(q) +K(p)→ − log (π(q|y))− log(f (p))

Let’s examine its properties under a statistical lens:

• Reversibility : MCMC updates that use these dynamics leavethe desired distribution invariant

• Conservation of the Hamiltonian : Metropolis updatesusing Hamiltonians are always accepted

• Volume preservation : we don’t need to account for anychange in volume in the acceptance probability for Metropolisupdates (no need to compute the determinant of the Jacobianmatrix for the mapping)

Page 8: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

the Leapfrog

Hamilton’s equations are not always1 explicitly available, hence theneed for time discretization with some small step-size ε.

A numerical integrator which serves our scopes is called Leapfrog

Leapfrog Integrator

For j = 1, . . . , L

1 pt+ε/2 = pt − (ε/2)∂U∂q (qt)

2 qt+ε = qt + ε∂K∂p (pt+ε/2)

3 pt+ε = pt+ε/2 − (ε/2)∂U∂q (qt+ε)

t = t + ε

1unless they’re quadratic form

Page 9: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

A few more words

The kinetic energy is usually (for simplicity) assumed to beK(p) = pTM−1p which correspond to p|q ≡ p ∼ N (0,M).

This finally implies that in the leapfrog ∂K∂p (pt+ε/2) = M−1pt+ε/2

Usually M, for which few guidance exists, is taken to be diagonaland often equal to the identity matrix.

The leapfrog method preserves volume exactly and due to itssymmetry it is also reversible by simply negating p, applying the

same number L of steps again, and then negating p again2.It does not however conserve the total energy and thus this

deterministic move to (q′, p′) will be accepted with probability

min[1, exp(−H(q′, p′) + H(q, p))

]2negating ε serves the same purpose

Page 10: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

HMC algorithm

We now have all the elements to construct an MCMC methodbased on the Hamiltoninan dynamics:

HMC

Given q0, ε, L,MFor i = 1, . . . ,N

1 p ∼ N (0,M)

2 Set qi ← qi−1, q′ ← qi−1, p′ ← p

3 for j = 1, . . . , L

update(q′, p′)through leapfrog

4 Set qi ← q′ with probability

min[1, exp(−H(q′, p′) + H(qi−1, pi−1))

]

Page 11: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

HMC benefits

HMC make use of gradient information to move across the spaceand so its typical trajectories do not resemble random-walks.Moreover the error in the Hamiltonian stays bounded3 and hencewe also have high acceptance rate.

3even if theoretical justification for that are missing

Page 12: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

HMC benefits

Random-walk:

• require changes proposed with magnitude comparable to thesd in the most constrained direction (square root of thesmallest eigenvalue of the covariance matrix)

• to reach a almost-independent state need a number ofiterations mostly determined by how long it takes to explorethe less constrained direction

• proposals have no tendency to move consistently in the samedirection.

For HMC the proposal move accordingly to the gradient for L step,even if ε is still constrained by the smallest eigenvalue.

Page 13: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Connected difficulties

Performance depends strongly on choosing suitable values for εand L.

• ε too large → inaccurate simulation & high reject rate

• ε too small → wasted computational power (small steps)

• L too small → random walk behavior and slow mixing

• L too large → trajectories retrace their steps → U-TURN

even worse: PERIODICITY!!

Page 14: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Connected difficulties

Performance depends strongly on choosing suitable values for εand L.

• ε too large → inaccurate simulation & high reject rate

• ε too small → wasted computational power (small steps)

• L too small → random walk behavior and slow mixing

• L too large → trajectories retrace their steps → U-TURN

even worse: PERIODICITY!!

Page 15: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Periodicity

Ergodicity of HMC may fail if a produced trajectory of length (Lε)is an exact periodicity for some state.

Example

Consider q ∼ N (0, 1), then H(q, p) = q2/2 + p2/2 and theresulting Hamiltonian is

dq

dt= p

dp

dt= −q

and hence has an exact solution

q(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε)

for some real a and r .

Page 16: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Periodicity

Example

q ∼ N (0, 1), H(q, p) = q2/2 + p2/2

dq

dt= p

dp

dt= −q

q(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε)

If we chose a trajectory length such that Lε = 2π at the end of theiteration we will return to the starting point!

• Lε near the period makes HMC ergodic but practically useless

• interactions between variables prevent exact periodicities, butnear periodicities might still slow HMC considerably!

Page 17: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Periodicity

Example

q ∼ N (0, 1), H(q, p) = q2/2 + p2/2

dq

dt= p

dp

dt= −q

q(Lε) = r cos(a + Lε), p(Lε) = −r sin(a + Lε)

If we chose a trajectory length such that Lε = 2π at the end of theiteration we will return to the starting point!

• Lε near the period makes HMC ergodic but practically useless

• interactions between variables prevent exact periodicities, butnear periodicities might still slow HMC considerably!

Page 18: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Other useful tools - Windowed HMC

The leapfrog introduces ”random” errors in H at each step andhence the acceptance probability may have an high variability.Smoothing (over windows of states) out this oscillations could leadto higher acceptance rates.

Page 19: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Other useful tools - Windowed HMC

The leapfrog introduces ”random” errors in H at each step andhence the acceptance probability may have an high variability.Smoothing (over windows of states) out this oscillations could leadto higher acceptance rates.

Windows map

We map (q, p)→ [(q0, p0), . . . , (qW−1, pW−1)] by writing

(q, p) = (qs , ps), s ∼ U(0,W − 1)

and deterministically recover the other states. This windows hasprobability density

P([(q0, p0), . . . , (qW−1, pW−1)]) =1

W

W−1∑i=0

P(qi , pi )

Page 20: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Windowed HMC

Similarly we perform L - W + 1 leapfrog steps starting from(qW−1, pW−1), up to (qL, pL) and then accept the window[(qL−W+1,−pL−W+1), . . . , (qL,−pL)] with probability

min

[1,

∑Li=L−W+1 P(qi , pi )∑W−1

i=0 P(qi , pi )

]

and finally select (qa,−pa) with probability

P(qa, pa)∑i∈accepted window P(qi , pi )

Page 21: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Slice Sampler

Slice Sampling idea

Sampling from f (x) is equivalent to uniform sampling fromSG(f ) = {(x , y)|0 ≤ y ≤ f (x)}

the subgraph of f

We simply make use of this by sampling from f with an auxiliaryvariable Gibbs sampling:

• y |x ∼ U(0, f (x))

• x |y ∼ USy where Sy = {x |y ≤ f (x)} is the slice

Page 22: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Outline

1 Hamiltonian MCMC

2 NUTSIdeaFlow of the simplest NUTSA more efficient implementation

3 ε tuning

4 Numerical Results

5 Conclusion

Page 23: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

What we (may) have and want to avoid

Opacity and size grows with leapfrog steps madeIn black start and end-point

Page 24: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Main idea

Define a criterion that helps us avoid these U-Turn, stopping whenwe simulated ”long enough”: instantaneous distance gain

C (q, q′) =∂

∂t

1

2(q′−q)T (q′−q) = (q′−q)T ∂

∂t(q′−q) = (q′−q)Tp

Simulating until C (q, q′) < 0 lead to a non-reversible MC, so theauthors devised a different scheme.

Page 25: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Main Idea

• NUTS augments the model with a slice variable u

• Add a finite set C of candidates for the update

• C ⊆ B deterministically chosen,with B the set of all leapfrog steps

Page 26: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Main Idea

• NUTS augments the model with a slice variable u

• Add a finite set C of candidates for the update

• C ⊆ B deterministically chosen,with B the set of all leapfrog steps

(At a high level) B is built by doubling and checking C (q, q′) onsub-trees

Page 27: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

High level NUTS steps

1 resample momentum p ∼ N (0, I )

2 sample u|q, p ∼ U [0, exp(−H(qt , p))]

3 generate the proposal from p(B, C|qt , p, u, ε)

4 sample (qt+1, p) ∼ T (·|qt , p, C)

T (·|qt , p, C) s.t. leave the uniform distribution over C invariant(C contain (q′, p′) s.t. u ≤ exp(−H(q′, p′)) and sat. reversibility)

Page 28: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Justify T (q′, p′|qt , p, C)

Conditions on p(B, C|q, p, u, ε):

C.1: Elements in C chosen in a volume-preserving way

C.2: p((q, p) ∈ C|q, p, u, ε) = 1

C.3: p(u ≤ exp(−H(q′, p′))|(q′, p′) ∈ C) = 1

C.4: If (q, p) ∈ C and (q′, p′) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q′, p′, u, ε)

by them

p(q, p|u,B, C, ε) ∝ p(B, C|q, p, u, ε)p(q, p|u)∝ p(B, C|q, p, u, ε)I{u≤exp(−H(q′,p′))} C .1∝ IC C .2&C .4− C .3

Page 29: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

p(B, C|q, p, u, ε) - Building B by doubling

Build B by repeatedly doubling a binary tree with (q,p) leaves.

• Chose a random ”direction in time” νj ∼ U({−1, 1})• Take 2j leap(frog)s of size νjε from

(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )

• Continue until a stopping rule is met

Given the start (q, p) and ε : 2j equi-probable height-j treesReconstructing a particular height-j tree from any leaf has prob 2−j

Possible stopping rule: at height j

• for one of the 2j − 1 subtrees

(q+, q−)Tp− < 0 or (q+, q−)Tp+ < 0

• the tree includes a leaf s.t. log(u)− H(q, p) > ∆max

Page 30: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

p(B, C|q, p, u, ε) - Building B by doubling

Build B by repeatedly doubling a binary tree with (q,p) leaves.

• Chose a random ”direction in time” νj ∼ U({−1, 1})• Take 2j leap(frog)s of size νjε from

(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )

• Continue until a stopping rule is met

Given the start (q, p) and ε : 2j equi-probable height-j treesReconstructing a particular height-j tree from any leaf has prob 2−j

Possible stopping rule: at height j

• for one of the 2j − 1 subtrees

(q+, q−)Tp− < 0 or (q+, q−)Tp+ < 0

• the tree includes a leaf s.t. log(u)− H(q, p) > ∆max

Page 31: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

p(B, C|q, p, u, ε) - Building B by doubling

Build B by repeatedly doubling a binary tree with (q,p) leaves.

• Chose a random ”direction in time” νj ∼ U({−1, 1})• Take 2j leap(frog)s of size νjε from

(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )

• Continue until a stopping rule is met

Given the start (q, p) and ε : 2j equi-probable height-j treesReconstructing a particular height-j tree from any leaf has prob 2−j

Possible stopping rule: at height j

• for one of the 2j − 1 subtrees

(q+, q−)Tp− < 0 or (q+, q−)Tp+ < 0

• the tree includes a leaf s.t. log(u)− H(q, p) > ∆max

Page 32: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: Elements in C chosen in a volume-preserving way

C.2: p((q, p) ∈ C|q, p, u, ε) = 1

C.3: p(u ≤ exp(−H(q′, p′))|(q′, p′) ∈ C) = 1

C.4: If (q, p) ∈ C and (q′, p′) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q′, p′, u, ε)

Page 33: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: XSatisfied because of the leapfrog

C.2: p((q, p) ∈ C|q, p, u, ε) = 1

C.3: p(u ≤ exp(−H(q′, p′))|(q′, p′) ∈ C) = 1

C.4: If (q, p) ∈ C and (q′, p′) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q′, p′, u, ε)

Page 34: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: XSatisfied because of the leapfrog

C.2: XSatisfied if C includes the initial state

C.3: p(u ≤ exp(−H(q′, p′))|(q′, p′) ∈ C) = 1

C.4: If (q, p) ∈ C and (q′, p′) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q′, p′, u, ε)

Page 35: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: XSatisfied because of the leapfrog

C.2: XSatisfied if C includes the initial state

C.3: XSatisfied if we exclude points outside the slice u

C.4: If (q, p) ∈ C and (q′, p′) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q′, p′, u, ε)

Page 36: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: XSatisfied because of the leapfrog

C.2: XSatisfied if C includes the initial state

C.3: XSatisfied if we exclude points outside the slice u

C.4: As long as C given B is deterministic

p(B, C|q, p, u, ε) = 2−j or p(B, C|q, p, u, ε) = 0

Page 37: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: XSatisfied because of the leapfrog

C.2: XSatisfied if C includes the initial state

C.3: XSatisfied if we exclude points outside the slice u

C.4: XSatisfied if we exclude states that couldn’t generate B

Page 38: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: XSatisfied because of the leapfrog

C.2: XSatisfied if C includes the initial state

C.3: XSatisfied if we exclude points outside the slice u

C.4: XSatisfied if we exclude states that couldn’t generate B

Finally we select (q’,p’) at random from C

Page 39: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Page 40: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Page 41: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Efficient NUTS

The computational cost per-leapfrog step is comparable with HMC(just 2j+1 − 2 more inner products)However:

• it requires to store 2j position-momentum states

• long jumps are not guaranteed

• waste time if a stopping criterion is met during doubling

Page 42: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Efficient NUTS

The authors address these issues as follow:

• waste time if a stopping criterion is met during doubling

as soon as a stopping rule is met break out of the loop

Page 43: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Efficient NUTS

The authors address these issues as follow:

• long jumps are not guaranteed

Consider the kernel:

T (w ′|w , C) =

{ I[w ′∈Cnew ]|Cnew | if |Cnew | > |Cold |,

I[w ′∈Cnew ]|Cnew |

|Cnew ||Cold | + (1− |Cnew |

|Cold | )I[w ′ = w ] if |Cnew | ≤ |Cold |,

where w is short for (q,p) and let Cnew and Cold be respectively the set C′introduced during the final iteration and the older elements already in C.

Page 44: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Efficient NUTS

The authors address these issues as follow:

• it requires to store 2j position-momentum states

• long jumps are not guaranteed

Consider the kernel:

T (w ′|w , C) =

{ I[w ′∈Cnew ]|Cnew | if |Cnew | > |Cold |,

I[w ′∈Cnew ]|Cnew |

|Cnew ||Cold | + (1− |Cnew |

|Cold | )I[w ′ = w ] if |Cnew | ≤ |Cold |,

where w is short for (q,p) and let Cnew and Cold be respectively the set C′introduced during the final iteration and the older elements already in C.

Iteratively applying the above after every doubling moreover require thatwe store only O(j) position (and sizes) rather than O(2j )

Page 45: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Outline

1 Hamiltonian MCMC

2 NUTS

3 ε tuningAdaptive MCMCRandom ε

4 Numerical Results

5 Conclusion

Page 46: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Adaptive MCMC

Classic adaptive MCMC idea:stochastic optimization on the parameter with vanishing

adaptation:

θt+1 ← θt − ηtHt

where R 3 ηt → 0 with the ”correct” pace andHt = δ − αt defines a characteristic of interest of the chain.

Example

Consider the case of a random-walk MH with x ′|xt ∼ N (0, θt)adapted on the acceptance rate (αt) in order to get to the desired

optimum δ = 0.234

Page 47: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Double Averaging

This idea has however (in particoular) an intrinsic flaw:

• the diminishing step sizes ηt give more weight to the earlyiterations

The authors rely then on the following scheme:

εt+1 = µ− 1

γ

√t

t + t0

t∑i=1

Hi ε̃t+1 = ηtεt+1 + (1− ηt)ε̃t

adaptationγ shrinkage → µt0 early iteration

samplingηt = t−k

k ∈ (0.5, 1]

Page 48: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Double Averaging

This idea has however (in particoular) an intrinsic flaw:

• the diminishing step sizes ηt give more weight to the earlyiterations

The authors rely then on the following scheme3:

εt+1 = µ− 1

γ

√t

t + t0

t∑i=1

Hi ε̃t+1 = ηtεt+1 + (1− ηt)ε̃t

adaptationγ shrinkage → µt0 early iteration

samplingηt = t−k

k ∈ (0.5, 1]

3introduced by Nesterov (2009) for stochastic convex optimization

Page 49: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Select the criterion

The authors advice to use Ht = δ − αt with δ = 0.65 and

αHMCt = min

[1,

P(q′,−p′)P(qt ,−pt)

]

αNUTSt =

1

|Bfinalt |

∑(q′

t ,p′t)∈Bfinal

t

min

[1,

P(q′t ,−p′t)

P(qt−1,−pt)

]

Page 50: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Alternative: random ε

Periodicity is still a problem and the ”optimal” ε may differ indifferent region of the target! (e.g. mixtures)The solution might be to randomize ε around some ε0

Example

ε ∼ E(ε0) ε ∼ U(c1ε0, c2ε0)

c1 < 1, c2 > 1

Page 51: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Alternative: random ε

Periodicity is still a problem and the ”optimal” ε may differ indifferent region of the target! (e.g. mixtures)The solution might be to randomize ε around some ε0

Helpfull especially for HMC, care is needed for NUTS as L isusually inversely proportional to ε

What said about adaptation can be combined using ε0 = εt

Page 52: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Outline

1 Hamiltonian MCMC

2 NUTS

3 ε tuning

4 Numerical ResultsPaper’s examplesMultivariate Normal

5 Conclusion

Page 53: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Numerical Examples

”It at least as good as an ’optimally’ tuned HMC.”Tested on:

1 (MVN) : 250-d Normal whose precision matrix was sampledfrom Wishart with identity scale matrix and 250 df

2 (LR) : (24+1)-dim regression coefficient distribution on theGerman credit data with weak priors

3 (HLR) : same dataset - exponentially distributed variancehyper parameters and included two-way interactions →(300+1)*2-d predictors

4 (SV) : 3000 days of returns from S&P 500 index → 3001-dtarget following this scheme:

τ ∼ E(100); ν ∼ E(100); s1 ∼ E(100);

log si ∼ N (log si−1, τ−1);

log yi − log yi−1si

∼ tν .

Page 54: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

ε and L in NUTS

Also the dual averaging usually does a good job of coercing thestatistic H to its desired value.

Most of the trajectory lengths are integer powers of two, indicatingthat the U-turn criterion is usually satisfied only after a doubling is

completed, which is desirable since it means that we onlyoccasionally have to throw out entire half-trajectories to satisfy DB.

Page 55: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Page 56: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

To get the full advantage of HMC, the trajectory has to be longenough (but not much longer) that in the least constrained

direction the end-point is distant from the start point.

Page 57: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

Page 58: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

Trajectory reverse direction before the least constrained directionhas been explored.

This can produce a U-Turn when the trajectory is much shorterthan is optimal.

Page 59: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

Fortunately, too-short trajectories are cheap relative tolong-enough trajectories!

but it does not help in avoiding RW

Page 60: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Multivariate Normal

One way to overcome the problem could be to check for U-Turnonly in the least constrained

But usually such an information is not known a-priori.Maybe after a trial run to discover the magnitude order of the

variables?someone said periodicity?

Page 61: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Outline

1 Hamiltonian MCMC

2 NUTS

3 ε tuning

4 Numerical Results

5 ConclusionImprovementsDiscussion

Page 62: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:

• Generate the whole trajectory from (q0, p0)= (q, p) to (qL, pL)

• Select a point in the ”accept window”[(qL−W+1, pL−W+1), . . . , (qL, pL)], say (qL−k , pL−k )

• equivalent to select k ∼ U(0,W )

• Take k step backward from (q0, p0) to (q−k , p−k)

• Accept (qL−k , pL−k) with probability

min

1,

W∑i=1

wiP(qL−W+i ,−pL−W+i )

W∑i=1

wiP(qi−k−1,−pi−k−1)

• With uniform weights wi = 1/W ∀i• Other weights may favor states further from the start!

Page 63: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:

• Generate the whole trajectory from (q0, p0) to (qL, pL)

• Select a point (qL−k , pL−k )

• Take k step backward from (q0, p0) to (q−k , p−k )

• Reject window : [(q−k , p−k ), . . . , (qW−k−1, pW−k−1)]

• Accept (qL−k , pL−k) with probability

min

1,

W∑i=1

wiP(qL−W+i ,−pL−W+i )

W∑i=1

wiP(qi−k−1,−pi−k−1)

• With uniform weights wi = 1/W ∀i• Other weights may favor states further from the start!

Page 64: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:

• Generate the whole trajectory from (q0, p0) to (qL, pL)

• Select a point (qL−k , pL−k )

• Take k step backward from (q0, p0) to (q−k , p−k )

• Accept (qL−k , pL−k ) with probability

min

1,

W∑i=1

wiP(qL−W+i ,−pL−W+i )

W∑i=1

wiP(qi−k−1,−pi−k−1)

• With uniform weights wi = 1/W ∀i• Other weights may favor states further from the start!

Page 65: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weightingprocedure:

• Accept (qL−k , pL−k ) with probability

min

1,

W∑i=1

wiP(qL−W+i ,−pL−W+i )

W∑i=1

wiP(qi−k−1,−pi−k−1)

• With uniform weights wi = 1/W ∀i• Other weights may favor states further from the start!

Page 66: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Riemann Manifold HMC

• In statistical modeling the parameter space is a manifold

• d(p(y |θ), p(y |θ + δθ)) can be defined as δθTG (θ)δθG (θ) is the expected Fisher information matrix [Rao (1945)]

• G (θ) defines a position-specific Riemann metric.

Girolami & Calderhead (2011) used this to tune K(p) using local2nd -order derivative information encoded in M = G (q)

Page 67: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Riemann Manifold HMC

Girolami & Calderhead (2011) used this to tune K(p) using local2nd -order derivative information encoded in M = G (q)

The deterministic proposal is guided not only by the gradient ofthe target density but also exploits local geometric structure.

Possible optimality : paths that are produced by the solution ofHamiltoninan equations follow the geodesics on the manifold.

Page 68: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Riemann Manifold HMC

Girolami & Calderhead (2011) used this to tune K(p) using local2nd -order derivative information encoded in M = G (q)

Practically:p|q ∼ N (0,G (q)) resolving the scaling issues in HMC

• tuning of ε less critical

• non separable Hamiltonian

H(q, p) = U(q) +1

2log{

(2π)D |G (q)|}

+1

2pTG (q)−1p

• need for an (expensive) implicit integrator(generalized leapfrog + fixed point iterations)

Page 69: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

RMHMC & Related”In some cases, the computational overhead for solving implicitequations undermines RMHMCs benefits” . [Lan et al. (2012)]

Lagrangian Dynamics (RMLMC)Replaces momentum p (mass×velocity) in HMC by velocity v

volume correction through the Jacobian

• semi-implicit integrator for ”Hamiltonian” dynamics

• different fully explicit integrator for Lagrangian dynamicstwo extra matrix inversions to update v

Adaptively Updated (AUHMC)Replace G (q) with M(q, q′) = 1

2 [G (q) + G (q′)]

• fully explicit leapfrog integrator (M(q, q′) constant throughtrajectory)

• less local-adaptive (especially for long trajectories)

• is it really reversible?

Page 70: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Split Hamiltonian

Variations on HMC obtained by using discretizations ofHamiltonian dynamics, splitting H into:

H(q, p) = H1(q, p) + H2(q, p) + · · ·+ HK (q, p)

This may allows much of the movement to be done at low cost.

• log of U(q) as the log of a Gaussian plus a second term

quadratic forms allow for explicit solution

• H1 and its gradient can be eval. quickly, with only a slowly-varying H2 requiring costly computations. e.g. splitting data

Page 71: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Split Hamiltonian

• log of U(q) as the log of a Gaussian plus a second termquadratic forms allow for explicit solution

• H1 and its gradient can be eval. quickly, with only a slowly-varying H2 requiring costly computations. e.g. splitting data

Page 72: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

split-wRMLNUTS & co.

• NUTS itself provide a nice automatic tune of HMC

• very few citation to their work!

• integration with RMHMC (just out!) may overcome ε-relatedproblems and provide better criteria for the stopping rule!

HMC may not be the definitive sampler, but is definitely equallyuseful and unknown.

It seems the direction taken may finally overcome the difficultiesconnected with its use and spread.

Page 73: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

Betancourt (2013b)

• RMHMC has smoother movements on the surface but..

• .. (q′ − q)Tp has no more a meaning

The paper address this issue by generalizing the stopping rule toother M mass matrices.

Page 74: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

References I

Introduction:

• M.D. Hoffmann & A. Gelman (2011) - The No-U-TurnSampler

• R.M. Neal (2011) - MCMC using Hamiltonian dynamics inHandbook of Markov Chain Monte Carlo

• Z. S. Qin, and J.S. Liu (2001) - Multipoint Metropolismethod with application to hybrid Monte Carlo

• R.M. Neal (2003) - Slice Sampling

Page 75: no U-turn sampler, a discussion of Hoffman & Gelman NUTS algorithm

Hamiltonian MCMC NUTS ε tuning Numerical Results Conclusion

References II

Further Readings:

• M. Girolami, B. Calderhead (2011) - Riemann manifoldLangevin and Hamiltonian Monte Carlo methods

• Z. Wang, S. Mohamed, N. de Freitas (2013) - AdaptiveHamiltonian and Riemann Manifold Monte Carlo Samplers

• S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami (2012) -Lagrangian Dynamical Monte Carlo

• M. Burda, JM. Maheu (2013) - Bayesian Adaptively UpdatedHamiltonian Monte Carlo with an Application toHigh-Dimensional BEKK GARCH Models

• Michael Betancourt (2013) - Generalizing the No-U-TurnSampler to Riemannian Manifolds