Markov Chain Monte Carlo InferenceIterative conditional modes: Instead of sampling, update wrt a...
Transcript of Markov Chain Monte Carlo InferenceIterative conditional modes: Instead of sampling, update wrt a...
Markov Chain Monte Carlo InferenceMelih KandemirHeidelberg Grad Days 2019
Lecture 3
2/??
Monte Carlo Integration
The big question : Evaluate
Ep(z)[f(z)] =
∫f(z)p(z)dz
ExamplesI Bayesian prediction:
p(znew|z,D) =∫p(znew|θ)p(θ|D)dθ = Ep(θ|D)[p(znew|θ)]
I Difficult variational updates:log q(z1)← Ep(z2)[log p(z1, z2)]
I Difficult E-step in EM:Q(θ, θold) = Ep(z|D,θold)[log p(z,D|θ)]
3/??
Approximating the integral by samples
Ep(z)[f(z)] =
∫f(z)p(z)dz
' 1
L
L∑l=1
f(z(l))
where z(l) are samples drawn from p(z(l)).
As long as iid samples are drawn from the true p(z(l)), ' 20samples are sufficient for a good approximation.
4/??
Sampling from inverse CDF1
Draw u ∼ Uniform(0, 1)Calculate y = h−1(u)
Because:Pr(h−1(u) ≤ y) = Pr(u ≤ h(y)) = h(y)
Problem: How do we compute h−1(u) for an arbitrarydistribution?
1Bishop, PRML, 2006
5/??
Rejection Sampling2
Target distribution p(z), and envelop distribution q(z)
Procedure:I z(t) ∼ q(z)I u(t) ∼ Uniform(0, kq(z(t)))I Accept sample if u(t) ≤ p(z)
p(accept) =∫ p(z)
kq(z)q(z)dz =
1
k
∫p(z)dz
2Bishop, PRML, 2006
6/??
Adaptive Rejection Sampling3
Envelope function is a set of piecewise exponential functions:
q(z) = kiλi exp{−λi(z − zi−1)} zi−1 ≤ z ≤ zi
Each rejected sample is added as a grid point.
Acceptance rate decays exponentially wrt dimensionality!3Bishop, PRML, 2006
7/??
Importance Sampling (1)
Ep(z)[f(z)] =
∫f(z)p(z)dz
=
∫f(z)
p(z)
q(z)q(z)dz
Draw l samples from q(z). Then,
Ep(z)[f(z)] ' 1
L
∫ L
l=1f(z(l))
p(z(l))
q(z(l))︸ ︷︷ ︸importance weight
8/??
Importance Sampling (2)
I (+) All samples are retained.I (-) Too much dependent on how similar q(z) is to p(z).I (-) No diagnostic measures available!
9/??
Markov Chain Monte Carlo
I Robust to high dimensionalitiesI Samples form a Markov chain with a transition functionT (z|z′)
I Samples are drawn from the target distribution p(z) if,I p(z) is invariant wrt T (z|z′),
p(z) =
∫p(z′)T (z|z′)dz′.
I the Markov chain governed by T (z|z′) is ergodic.I Invariance : Ensured by detailed balance:
p(z)T (z′|z) = p(z′)T (z|z′)
I Ergodicity : More tricky. Imposed by sampling algorithms.
10/??
Metropolis-HastingsProcedure:
I Propose the next state by Q(z′|z), e.g. N (z, σ2)
I Accept with probability min
(1,p(z′)Q(z|z′)p(z)Q(z′|z)
)I Stay at the current state (add another copy of it to the
samples list) otherwiseThe proposal variance σ2 is very influential.
I Determines step sizeI If large, low acceptance rateI If small, slow convergence
11/??
Metropolis-Hastings (2)
Detailed balance is provided:
p(z)T (z′|z) = p(z)Q(z′|z) min
(1,p(z′)Q(z|z′)p(z)Q(z′|z)
)= min
(p(z)Q(z′|z), p(z′)Q(z|z′)
)= p(z′)Q(z|z′) min
(p(z)Q(z′|z)p(z′)Q(z|z′)
, 1
)= p(z′)T (z|z′)
12/??
Metropolis-Hastings (3)4
1-D Demo:
4Murray,MLSS,2009
13/??
Gibbs Sampling
Procedure:I Initialize z(1)1 , z
(1)2 , z
(1)3
I For l = 1 to L− 1I z
(l+1)1 ∼ p(z1|z(l)2 , z
(l)3 )
I z(l+1)2 ∼ p(z2|z(l+1)
1 , z(l)3 )
I z(l+1)3 ∼ p(z3|z(l+1)
1 , z(l+1)2 )
14/??
Gibbs Sampling (2)
I Invariance: All conditioned variates are constant bydefinition, and the remaining variable is sampled from thetrue distribution.
I Ergodicity: Guaranteed if all conditional probabilities arenon-zero in their entire domain.
I Gibbs sampling is a special case of Metropolis-Hastingswith qk(z′|z) = p(zk|z\k), thus
A(z′|z) =p(z′k|z′\k)p(z′\k)p(zk|z′\k)p(zk|z\k)p(z\k)p(z′k|z\k)
= 1
Hence, all samples are accepted.
15/??
Gibbs Sampling (3)5
Step size is governed by covariances of conditionaldistributions.
Iterative conditional modes: Instead of sampling, update wrta point estimate (e.g. mean, mode).
5Bishop, PRML, 2006
16/??
Collapsed Gibbs Sampling
Integrating out some of the variables may yield others to appearconditionally-independent, which entails faster convergence.
Rao-Blackwell Theorem: Let z and θ be dependent variables,and f(z, θ) be some scalar function. Then,
varz,θ[f(z, θ)] ≥ varz[Eθ[f(z, θ)|z]].
17/??
Example: Gaussian Mixture Model 6
Employ conjugate priors to:I cluster meansI cluster covariancesI mixture probabilities
Then integrate them out!
6Murphy, Mach. Learn., 2012
18/??
Implementation tricks
I Thinning : Take every Kth sample to decorrelateI Burn-in : Discard first (e.g. half) of the samples which
were prior to mixingI Multiple runs : To neutralize the effect of initialization
19/??
Diagnosing Convergence 1: Traceplots
20/??
Diagnosing Convergence 2: Running mean plots
21/??
Diagnosing Conv. 3: Rubin-Gelman Metric
I Calculate within-chain variance W and between-chainvariance B
I Calculate estimated variance
ˆV ar(θ) = (1− 1/n)W + (1/n)B
I Calculate and monitor Potential Scale Reduction Factor(PSRF)
R =
√ˆV ar(θ)
W
I R should get smaller until convergence.
22/??
Diagnosing Convergence 4: Other metrics
I Geweke diagnostic: Take first x and last y samples inthe chain and test if they come from the same distribution.
I Raftery and Lewis diagnostic: Calculate nr of iterationsuntil a desired level of accuracy is reached for a posteriorquantile.
I Heidelberg and Welch diagnostic: Repeatedsignificance testing (stationary vs null)
23/??
Example: Bayesian logistic regression
p(fi|w,xi) = N (fi|wTxi, σ2), i = 1, · · · , N
p(yi|fi) =1
1 + e−fiyi, i = 1, · · · , N
p(wd|αd) = N (wd|0, α−1d ), d = 1, · · · , Dp(αd) = G(αd|a, b), d = 1, · · · , D
24/??
Let’s aim for a Gibbs samples
We require the following conditional distributions:
p(w|f ,α,X,y), (1)p(α|w, f ,X,y), (2)p(f |w,α,X,y) (3)
25/??
The log joint
log p(w, f ,α,X,y) =
N∑i=1
log p(fi|w,xi) +
N∑i=1
log p(yi|fi)
+
D∑d=1
log p(wi|αi) +
D∑d=1
log p(αi)
= −1
2log |σ2I| − 1
2σ2(fT −wTXT )(f −Xw)
−N∑i=1
log(1 + e−yifi) +1
2
D∑d=1
logαd −1
2wTAw
+
D∑d=1
(a− 1) logαd −D∑d=1
bαd + const
where Add = αd and Aij = 0, i 6= j
26/??
The conditionals
p(αd|α−d,w, f ,X,y) = G(αd|a+1
2, b+
1
2w2d)
p(w|f ,α,X,y) = N
(w
∣∣∣∣∣ (XTX + A)−1
XT f ,(XTX + A
)−1)p(f |w,α,X,y) = Metropolis with q(fi) = N (fi|wTxi, σ
2)
27/??
Problems with standard MCMCs
Metropolis and Metropolis-Hastings:I The proposal distribution is agnostic about the target
model.I The proposals tend to perform random walk, hence shoot
blindly.I The outcome often is an unacceptably low acceptance
rate.Gibbs:
I Requires conditional distributions available in closed form(not tractable even for logistic regression).
Remedy⇒ Incorporate model curvature into the samplingscheme.
28/??
Back to the past: Potential and Kinetic Energy
Figure. http://99daveva31893.blogspot.com.tr/2013/05/potential-and-kinetic-energy.html.jpg
29/??
For physicists
I System state: Position of the roller coaster θ.I Potential energy: A score proportional to the height of the
roller coaster U(θ).I Kinetic energy: A score proportional to the speed
(actually momentum) r of the roller coaster K(r).I Total energy: Rule that governs how potential and kinetic
energies are related H(θ, r) = K(r) + U(θ).
Here H(θ, r) is also called the Hamiltonian function.
30/??
For us machine learners
I System state: Position on the explored space of latentvariables θ.
I Potential energy: A score proportional (i.e. nonormalization constant) to the posterior we aim to samplefrom U(θ).
I Kinetic energy: Auxiliary variables r that animate thesystem state as fast as an auxiliary score K(r).
I Total energy: Rule H(θ, r) = K(r) + U(θ) assuring thatthe animated system is a Markov chain which has theposterior as the stationary distribution.
31/??
Hamiltonian dynamicsA physically-inspired way to model the Markov chain dynamics:For ith latent variable (particle for physicists), we have
dθidt
=∂H
∂ri, (4)
dridt
= −∂H∂θi
, (5)
meaningI (1) A particle moves as fast as the change in kinetic
energy.I (2) The momentum of the particle increases as fast as the
decrease in its height.We choose
K(r) =1
2rTM−1r
U(θ) = − log p(θ)− log p(D|θ) = − log p(θ)−∑x∈D
log p(x|θ).
32/??
Understanding Hamiltonian dynamics: The hockeypuck analogy
dθidt
=∂H
∂ri,
dridt
= −∂H∂θi
,
I Assume a hockey puck (θ) placed on a rugged surface offrictionless ice (U(θ)).
I We let the puck move by pushing it towards an arbitrarydirection K(r).
I As the surface is frictionless, the puck will keep movingforever (so we can sample as much as we want!).
I On a flat surface, the puck will keep constant speed.I Under positive slope ∂H/∂θi > 0 it will climb and then lose
speed and vice versa.I The puck swings between steep regions (modes) and
keeps speed on plateaus!
33/??
How to solve the Hamiltonian system ofdifferential equations
No analytically tractable solution for interesting models.Approximate by finite difference.
Way 1: Euler’s method:
rt+ε ← rt + εdrtdt
= rt − ε∇θtU(θ)
θt+ε ← θt + εdθtdt
= θt + εM−1rt
I Advantage: Ultimately trivial.I Disadvantage: The finite difference approximation will
diverge from the true gradient at every time step.
34/??
How to solve the Hamiltonian system ofdifferential equations
Way 2: Modified Euler’s method:
rt+ε ← rt + εdrtdt
= rt − ε∇θtU(θ)
θt+ε ← θt + εdθtdt
= θt + εM−1rt+ε
I Coupling the two updates brings about a charmingimprovement in accuracy, yet does not solve all theproblems.
I The finite difference approximation is still not sufficientlyaccurate.
I Stronger coupling is required.
35/??
How to solve the Hamiltonian system ofdifferential equations
Way 3: The Leapfrog method:
rt+ε/2 ← rt + (ε/2)drtdt
= rt − (ε/2)∇θtU(θ)
θt+ε ← θt + εdθtdt
= θt + εM−1rt+ε/2
rt+ε ← rt+ε/2 + (ε/2)drtdt
= rt+ε/2 − (ε/2)∇θtU(θ)
Here is how the Leapfrog method takes one step ahead:I Take half a step with the right leg.I Take a full step with the left leg.I Take half a step with the right leg.
36/??
Euler’s method and Leapfrog
Figure. R. Neal, MCMC using Hamiltonian Dynamics, 2011
37/??
Hamiltonian Monte Carlo Sampling
Figure. T. Chen et al. Stochastic Gradient Hamiltonian MonteCarlo, ICML, 2014
38/??
The Metropolis CorrectionThe numerical inaccuracies resulting from the finite differenceapproximation accumulate and cause the chain diverge fromthe target distribution. We can avoid this by accessing the trueposterior every now and then.Define the Hamiltonian joint distribution
π(θ, r) ∝ exp(− U(θ)− 1
2rTM−1r
).
Applying the Metropolis criterion, we get the acceptanceprobability
min
(1,
π(θ, r)
π(θ0, r0)
)= min
1, eH(θ,r)−H(θ0,r0)︸ ︷︷ ︸ρ
.
Remark: Accessing the posterior p(θ|D) is nice, but could beunacceptably expensive for some models, such as deep neuralnets!
39/??
HMC in R
Figure. R. Neal, MCMC using Hamiltonian Dynamics, 2011
40/??
Random walk versus Hamiltonian dynamics
Figure. R. Neal, MCMC using Hamiltonian Dynamics, 2011
41/??
Random walk versus Hamiltonian dynamics
Figure. R. Neal, MCMC using Hamiltonian Dynamics, 2011
42/??
Random walk versus Hamiltonian dynamics
Figure. R. Neal, MCMC using Hamiltonian Dynamics, 2011
43/??
HMC in high dimensionalities
Figure. R. Neal, MCMC using Hamiltonian Dynamics, 2011
44/??
Stochastic Gradient HMCI HMC is all great for accurate posterior inference but every
jump requires a full pass on the the data, which is nolonger practical in the present age.
I The naive way out is to switch from exact gradient tostochastic gradient for the potential energy (i.e. the model):
∇U(θ) = −|D||D|
∑x∈D
∇ log p(x|θ)−∇ log p(θ),
where D is a random minibatch.I The stochastic gradient trick is mostly used for very large
data sets, which results in excessively many iterations.When iterated large enough times, the Central LimitTheorem will govern the stochastic gradient noise
∇U(θ)−∇U(θ) ∼ N (0, V ),
where V is the noise covariance.
45/??
The notorious efficiency-accuracy tradeoff
Minibatch SizeSmall Large
E�cient
Computation
Accurate
Gradient
46/??
Naive Stochastic Gradient HMC
With a little sloppy notation, let us write
∇U(θ)−∇U(θ) ∼ N (0, V )⇒ ∇U(θ) = ∇U(θ) +N (0, V ).
With the added stochastic gradient noise, the ε-discretizedmomentum update turns into
∆r = −ε∇U(θ) = −ε(∇U(θ) + p), p = Ls, s ∼ N (0, 1),
= −ε∇U(θ) + εp, εp = εLs, s ∼ N (0, 1),
= −ε∇U(θ) +N (0, ε2V ).
where V = LLT is the Cholesky decomposition of V .
47/??
Naive Stochastic Gradient HMCCasting ε→ 0, we attain the dynamical system below
dθ = M−1rdt,
dr = −∇U(θ)dt+N (0, 2B(θ)dt),
where B(θ) = 12εV (θ).
I The hockey puck on the ice surface is now under randomwind!
I The Hamiltonian system preserves entropy when underexact gradients.
I The extra entropy coming from the stochastic gradientbreaks the balance and accumulates entropy at everyiteration.
I Consequently, the dynamic system above converges to auniform distribution!
48/??
Stochastic Gradient HMC with FrictionA new term is added to the system to counter the stochasticgradient entropy:
dθ = M−1rdt,
dr = −∇U(θ)dt−BM−1rdt+N (0, 2B(θ)dt).
I The hockey puck is now on an ice surface, under randomwind, and is exerted some friction!
I With the friction term added, the system is again able topreserve entropy, hence the equilibrium distribution is theposterior.
I The dynamical system above is known by physicists assecond-order Langevin dynamics [Wang & Uhlenbeck,1945].
49/??
Practical IssuesI All is well, but we do not know the noise model V , hence
the B matrix in the friction term.I Yet we can approximate it empirically B ≈ B, whereB = 1
2εV with V being the empirical Fisher information.I For any C � B, the system below is equivalent to SGHMC
with friction
dθ = M−1rdt,
dr = −∇U(θ)dt− CM−1rdt+N (0, 2(C − B)dt) +N (0, 2B(θ)dt).
I When C = B = 0 the SGHMC boils down to SGD withmomentum.
I No need to do Metropolis correction any longer! Recall thatthis saves a lot of time (evaluation of the model on theentire training data).
50/??
Stochastic Gradient HMC
Figure. T. Chen et al. Stochastic Gradient Hamiltonian MonteCarlo, ICML, 2014.
51/??
Naive versus principled SGHMC
Figure. T. Chen et al. Stochastic Gradient Hamiltonian MonteCarlo, ICML, 2014.
52/??
Naive versus principled SGHMC
Figure. T. Chen et al. Stochastic Gradient Hamiltonian MonteCarlo, ICML, 2014.
53/??
SGHMC for deep learningA Bayesian multilayer perceptron with 100 hidden neuronsevaluated on MNIST.
Figure. T. Chen et al. Stochastic Gradient Hamiltonian MonteCarlo, ICML, 2014.
54/??
Maximum A-Posteriori Estimation
For a Bayesian model p(θ) p(D|θ) with p(D|θ) =∏x∈D p(x|θ),
the mode of the posterior is called the Maximum A-PosterioriEstimate (MAP). The MAP of a model can be found by
argminθ
{− log p(θ)−
∑x∈D
log p(x|θ)}.
When closed-form solution is not available, we do gradientdescent
θt+1 ← θt − εt(−∇ log p(θt)−
∑x∈D∇ log p(x|θ)
).
The second term demands a full pass on the training set, whichis not feasible in many cases.
55/??
Stochastic Gradient Descent (SGD)Robbins-Monro Theorem [Robbins & Monro 1951] says that theapproximate gradient
|D||D|
∑x∈D∇ log p(x|θ)
obtained by randomly sampling a minibatch D, hence assuming
that the training data consists of|D|D
replications of D willconverge to the exact gradient after infinitely many iterations ifthe learning rate follows a series satisfying
∞∑t=1
εt =∞,∞∑t=1
ε2t <∞.
Hence, the learning rate should never drop down to zero (left),but should still keep decreasing over time (right).
56/??
(First-order) Langevin Dynamics
∆θt =ε
2
(∇ log p(θt) +
∑x∈D∇ log p(x|θt)
)+ ηt,
ηt = N (0, ε).
I Just as in HMC, introduced to solve stochastic differentialequations.
I Converges to the posterior, but discretization error shouldbe resolved by Metropolis correction.
I Unlike HMC, Gaussian noise injected to the gradient toenhance stochasticity.
I The noise variance ε is proportional to the learning rate.I Decreasing ε also decreases the discretization error. Will
be useful soon!
57/??
Stochastic Gradient Langevin Dynamics (SGLD)7
Adapt Langevin dynamics to SGD
∆θt =εt2
(∇ log p(θt) +
|D||D|
∑x∈D
∇ log p(x|θt)
)+ ηt,
ηt = N (0, εt).
I To be eligible for Robbins-Monro, learning rate ηt has todecrease in time.
I The discretization error will also decrease⇒ Metropoliscorrection will no longer be required!
7M. Welling, Y.W. Teh, Stochastic Gradient Langevin Dynamics, ICML,2011
58/??
SGD⇒ SGLD
I The algorithm has two phases, starts with first, switches tothe second after some iterations:
1. SGD: Performs only SGD with extended stochasticity (ηt) tooverpass local minima.
2. Langevin Dynamics: Samples from the true posterior.I The key question is when this switching takes place.I The system transitions into mode two when
εt ≈4α|D||D|
λmin(I−1F ),
where I−1F is the empirical Fisher information of thestochastic gradient error, λmin(·) is the smallest eigenvalueof the argument, and α is a sample threshold.
59/??
Calculating posterior expectationsI During training, we collect a set of samples{θ1, θ2, · · · , θT }.
I We will use them to approximate an expectation E[f(θ)],for instance the posterior predictive of a future observationx∗
p(x∗|D) =
∫p(x∗|θ)p(θ|D)dθ = E[p(x∗|θ)].
I Simple sample averaging
E[p(x∗|θ)] ≈ 1
T
T∑t=1
p(x∗|θt)
will over-emphasize non-minimal regions where εt (hencediscretization error) was high, the system was not yetsufficiently in the Langevin dynamics phase.
I Instead, do
E[p(x∗|θ)] ≈∑T
t=1 εtp(x∗|θt)∑T
t=1 εt.
60/??
Bayesian neural net benchmarking on some UCIdata sets
A Bayesian neural net with a single hidden layer of 50 unitsused. Root Mean Square Error on the test split reported.
HMC Dropout PBP Varout SGLDboston 2.76 2.97 3.01 2.70 2.21concrete 4.12 5.23 5.67 4.89 4.19kin8nm 0.06 0.10 0.10 0.08 0.02power 3.73 4.02 4.12 4.04 2.42protein 3.91 4.36 4.73 4.13 1.07red wine 0.63 0.62 0.64 0.63 0.21yacht 0.56 1.11 1.02 0.71 1.32