Advanced Hamiltonian Monte Carlo: Riemann Manifold HMC and...

Post on 16-Apr-2020

4 views 0 download

Transcript of Advanced Hamiltonian Monte Carlo: Riemann Manifold HMC and...

Advanced Hamiltonian Monte Carlo: Riemann Man-ifold HMC and the No-U Turn Sampler

Nilesh Tripuraneni, Adam Scibior

nt357@cam.ac.uk ams240@cam.ac.uk

22/01/2015

B.(HM)C.

I MCMC requires a Proposal Kernel – T (θ, θ′) – to sample fromp(θ) = e−U(θ).

I Random Walk Proposal – T (θ, θ′) ∼ N (θ′|θ, σ) – agnostic toany details of the target density.

I Motivated by SDE of Langevin Dynamicsdθ(t) = −∇θU(θ)dt + dW (t) proposeθ′ = θ − ε2

2∇θU(θ) + εN (0,1)

Figure : From IT, L, and I – D.J.C MacKay

2 of 36

Hamiltonian Monte CarloI Have density f (θ) = e−U(θ), and introduce auxiliary variable

p ∼ N (p|0, I) in order to obtain joint density

f (θ,p) =1

(2π)d/2 e−12 pT pp(θ) =

12πd/2 e−( 1

2 pT p+U(θ))

I Energy H(p, θ) = 12 log(2π)d − log f (θ)︸ ︷︷ ︸

PotentialU(θ)

+12

pT p︸ ︷︷ ︸KineticK (p)

I Integrate out p –∫

dp 12πd/2 e−( 1

2 pT p+U(θ)) = e−U(θ) – to recovercorrect parameter marginal.

3 of 36

Hamiltonian Dynamics

I Hamiltonian: H = 12pT p + U(θ)

I Hamilton’s equations:

dθdτ

=∂H∂p

= p (1)

dpdτ

= −∂H∂θ

= −∇θU(θ) (2)

I Naive Euler Integrator

θ(τ + ε) = θ(τ) + εdθ(τ)

dτ= θ(τ) + εp(τ)

p(τ + ε) = p(τ) + εdp(τ)

dτ= p(τ)− ε∇θU(θ)

4 of 36

Hamiltonian Dynamics cont’d.

I For separable H, ”split” H as H(p, θ) = U(θ)/2 + K (p) + U(θ)/2to induce the flow maps on (p,q):

Φ ε2 ,U(θ) ◦ Φε,K (p) ◦ Φ ε

2 ,U(θ)

I Reversible, Volume-Preserving Leapfrog integrator:

p(t + ε/2) = p(τ) + ε∇θU(θ(τ))/2θ(τ + ε) = θ(τ) + εp(τ + ε/2)

p(τ + ε) = p(τ + ε/2) + ε∇θU(θ(τ + ε))/2

(a) Leapfrog (b) Symplectic Int.5 of 36

Hamiltonian Monte CarloI Where’s the sampling? First Gibbs sample from p then do all of

the above, iterating the leapfrog integrator L times withstep-size ε as a deterministic proposal.

I Apply MH correction step

min(1,e−H(p∗,θ∗)

e−H(p,θ) ������

�����:1

T ((p∗, θ∗)→ (p, θ))

T ((p, θ)→ (p∗, θ∗))

to correct for integrator bias. Note that our careful constructionhas resulted in the cancellation of the ”Hastings” term.

6 of 36

Hamiltonian Monte Carlo Picture

Figure : From IT, L, and I – D.J.C MacKay

7 of 36

HMC parameters - L

The number of steps for which the Hamilton equations aresimulated in a single MCMC step.

Small L - random walk behaviour

Large L - wasteful re-exploration of the state space

8 of 36

HMC parameters - ε

The size of a discrete step used for numeric integration.

Small ε - larger number of steps required to simulate the sameamount of ”time”

Large ε - significant discretisation errors, low acceptance rate

9 of 36

Automatically setting L - the U turn

The U turn is a convenient stopping criterion

ddt

(θ+ − θ−)2

2= (θ+ − θ−)T p+ < 0

But what about detailed balance?

10 of 36

Automatically setting L - NUTS

We want

min(1,���

������

������:1

f (p∗, θ∗)f (p, θ)

T ((p∗, θ∗)→ (p, θ))

T ((p, θ)→ (p∗, θ∗))) = 1

Three steps of NUTS:

I Simulate Hamiltonian dynamics until the U turn occurrs,generating a trace B.

I Pick a suitable subset of B of candidate states C.I Pick the new state uniformly from C.

11 of 36

NUTS - slice sampling

Slice sampling makes NUTS simpler.

f (p, θ, u) ∝ I[u ∈ [0, f (p, θ)]]

f (p, θ) =

∫f (p, θ, u)du

f (p, θ | u) ∝ I[f (p, θ) ≥ u]

f (u | p, θ) = Uniform(0, f (p, θ))

12 of 36

NUTS - generating trajectories

function TRAJECTORY((p, θ))B ← {(p, θ)}for n = 0 to∞ do

direction← Uniform(left , right)B′ ← extend(B,2n,direction)if STOPPING(B′) then breakend ifB ← B′

end forreturn B

end functionfunction STOPPING(B)

(p−, θ−)← leftmost(B)(p+, θ+)← rightmost(B)Uturn← (θ+ − θ−)T p− < 0return Uturn ∨ STOPPING(left(B)) ∨ STOPPING(right(B))

end function13 of 36

NUTS - full algorithm

function NUTSSTEP((p, θ))u ← Uniform(0, f (p, θ))B ← TRAJECTORY((p, θ))C ← {f (p∗, θ∗) ≥ u | (p∗, θ∗) ∈ B}(p∗, θ∗)← Uniform(C)return (p∗, θ∗)

end function

14 of 36

NUTS - why it works

Pr(B | (p, θ)) = Pr(B | (p∗, θ∗))

if (p, θ), (p∗, θ∗) ∈ B

f (u | (p, θ))

f (u | (p∗, θ∗))=

f (p∗, θ∗)f (p, θ)

C computed deterministically from u and B

Pr((p, θ) | C) = Pr((p∗, θ∗ | C)

if (p, θ), (p∗, θ∗) ∈ C

T ((p, θ)→ (p∗, θ∗))

T ((p∗, θ∗)→ (p, θ))=

f (p∗, θ∗)f (p, θ)

15 of 36

Automatically setting ε

asymptotically vanishing adaptation

εt+1 = εt + ηtHt∑t

ηt =∞,∑

t

η2t <∞

Ht = (αt − δ)

αt - Metropolis-Hastings acceptance probability at time t

δ - target acceptance ratio; 0.65 is a reasonable default

ηt controls the decay of adaptation, a good choice is ηt = t−κ withκ ∈ (0.5,1]

16 of 36

Automatically setting ε

dual averaging

εt+1 = µ−√

1t + t0

t∑i=0

Hi

ε̄t+1 = ηtεt+1 + (1− ηt )ε̄t

17 of 36

Stan

A C++ based probabilistic programming framework whichimplements HMC and NUTS.

http://mc-stan.org

18 of 36

Pre-Conditioning

Problem: Sample from a correlated distribution with different scalesusing an isotropic proposal

Figure : From IT, L, and I – D.J.C MacKay

I If ε ∼ L too many rejectionsI If ε << L slow mixing along the ”L” direction

Solution: Use a ”preconditioned” proposal Q(x , x ′) ∼ Σ−1N (x ′, εI)19 of 36

HMC with Pre-Conditioning

I Density p(θ), introduce auxiliary variable p ∼ N (p|0,M).I Potential Energy

H(p, θ) = − log p(θ) + 12 log(2π)d |M|+ 1

2pT M−1pI Hamilton’s equations:

dθdτ = ∂H

∂p = M−1p and dpdτ = −∂H

∂θ = ∇θUI Reversible, Volume-Preserving Leapfrog integrator:

p(t + ε/2) = p(τ) + ε∇θU(θ(τ))/2θ(τ + ε) = θ(τ) + εM−1p(τ + ε/2)

p(τ + ε) = p(τ + ε/2) + ε∇θU(θ(τ + ε))/2I How to set M? Requires either exact (or empirical estimate) of

covariance of target

20 of 36

Global→ LocalGlobal covariance estimates can be badly locally miscalibrated

Figure : From Kameleon MCMC – Gretton et al.

21 of 36

Philosophizing

I Coordinates are for calculations, but geometry should bedefined in a ”covariant” way.

I From a formal perspective all coordinates are equally good.I If so, ideally our calculations (gradients, sampling proposals,

etc...) to be should be independent of parametrization.I Perhaps there are better notions of ”closeness” of parameters

in probability models – i.e. should we consider N (0,1) asbeing the same distance from N (0,2) as N (0,99) is fromN (0,100) ?

22 of 36

ManifoldI Characterized by an atlas – a consistent collection of open sets

Ui , and functions φi that bicontinuously map the manifold M toa real vector space Rm – (Ui , φi)

I Useful to consider curves, c : (a,b)→ M and functionsf : M → R on the manifold in order to define differentiablestructure

Figure : From GTP – M.Nakahara

23 of 36

Tangent Space

I If c : (a,b)→ M is a curve, and f : M → R thendf (c(t))

dt |t=0 = Xµ( ∂f∂xµ ) ≡ X [f ] where ∂f/∂xµ really means

∂(f◦φ−1(x))∂xµ and Xµ = dφ(c(t))

dt |t=0

I Tangent Space TpM is a linearization of the manifold spannedby directional derivatives at p

24 of 36

MetricThe tangent space at each point TpM is endowed with an innerproduct via the metric tensor Gp : TpM × TpM → R giving a notionof distance/angle between points. It is also:

I Symmetric Gp(t1, t2)

I Bilinear Gp(t1 + t2, t3) = Gp(t1, t3) + Gp(t2, t3)

I Positive-Definite Gp(t1, t2) > 0If we had a path θ(t) : R→ M, then the length of the curve is givenby:

D(t1, t2) =∫ t2

t1

√Gθ(t)(dθ

dt ,dθdt )dt

For example, the metric on R2 in polar (r , θ) coordinates is:[1 00 r2

]

25 of 36

Connections and Parallel Transport

I In Euclidean space Rm the derivative of the vector fieldV = Vµeµ with respect to xv has the µth component∂Vµ∂xν = lim∆x→0

Vµ(...,xν+∆xν ,...)−Vµ(...,xν ,....)∆xν

I Take an affine connection, then with a chart (U, φ) withcoordinate x = φ(p) on M, and define the connectioncoefficients as by ∇eνeµ = Γλνµ where eµ = ∂/∂xµ

I The covariant derivative of V with respect to xν is defined bylim∆xν→0

Vµ(x+∆x)− ˜Vµ(x+∆x)∆xν ( ∂

∂xµ ) = ∇νVµ =

(∂Vµ∂xν + V νΓµνλ)( ∂

∂xµ )I Allows us to connect nearby tangent spaces

26 of 36

Pictures

27 of 36

Putting It All Together

I A manifold with a metric defines an essentially uniqueconnection called the Levi-Civita connection –Γk

ij = 12G−1

km(∂iGjm + ∂jGI′m − ∂mGij)

I Given a curve c(t) with vector field V = dxµdt , it is a geodesic if

∇V V = 0. This corresponds to the notion of ”straightness” on amanifold.

I Geodesic equivalent to curve minimizing the D(t1, t2) definedbefore.

I Proposals will be Geodesics

28 of 36

Information Geometry

I Describe set of probability distributions parametrized by θ as astatistical manifold

I Denote Expected Fisher Information as

G(θ) = Cov(∇θ log f (θ)) = −E(∂2

∂θ2 log f (θ))

I To first order:

D(θ||θ + δθ) =

∫dyp(y |θ + δθ) log

p(y |θ + δθ)

p(y |θ)≈ δθT G(θ)δθ

29 of 36

N (µ, σ)

I For N (µ, σ) the metric in (µ, σ) coordinates is:

G =

[N/σ2 0

0 2N/σ2

]∂G∂µ

=

[0 00 0

]∂G∂σ

=

[−2N/σ3 0

0 −4N/σ3

]I Riemannian Langevin Diffusion

dθi(t) = [G−1(θ(t))∇θU(θ)]idt+

|G(θ(t))|−1/2| ∂∂θj

[G−1(θ(t))ij |G(θ(t))|−1/2]idt+

[√

G−1(θ(t))dW (t)]i

30 of 36

More PicturesI With a sample size of N = 30 drawn from N (µ = 0, σ = 10)

31 of 36

Riemannian HMCI HMC on a metrized (with Fisher information) manifold:

H(θ,p) = −L(θ) +12

log((2π)D|G(θ)|) +12

pT G(θ)−1p

I Integrating out p leaves us with the correct marginal over θ

32 of 36

Riemannian HMC cont’.dI Hamilton’s equations for time evolution are equivalent to the

2nd order geodesic equations – so RHMC proposal aregeodesics.

dθdt

=∂H∂pi

= (G(θ)−1p)i

dpi

dt= −∂H

∂θi= (3)

∂U(θ)

∂θi− 1

2[Tr[G(θ)−1∂G(θ)

∂θi] +

12

pT G(θ)−1∂G(θ)

∂θiG(θ)−1p (4)

I Define implicit, volume-preserving, reversible GeneralizeLeapfrog Integrator which is solved iteratively.

33 of 36

Riemannian HMC Performance

34 of 36

Riemann SGLD Performance

35 of 36

CitationsI Duane, S, Kennedy, A D, Pendleton, B J, and Roweth, D.

Hybrid Monte Carlo. Physics Letters B, 195(2):216222, 1987I Neal, Radford M. MCMC using Hamiltonian dynamics.

Handbook of Markov Chain Monte Carlo, 54:113162, 2010.I Hoffman, Gelman. The No-U-turn sampler: adaptively setting

path lengths in Hamiltonian Monte Carlo. Journal of MachineLearning Research: Volume 15 Issue 1, 1593-1623

I Girolami, Mark and Calderhead, Ben. Riemann manifoldLangevin and Hamiltonian Monte Carlo methods. Journal ofthe Royal Statistical Society: Series B, 73(2):123 214, 2011.

I Nakahara, Mikio. Geometry, Topology, and Physics

36 of 36