Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf ·...
Transcript of Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf ·...
![Page 2: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/2.jpg)
Numerical integration problem
−1
−0.5
0
0.5
1−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0
1
2
3
4
5
YX
f(X
,Y)
∫
x∈Xf(x)dx
![Page 3: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/3.jpg)
Used for: function approximation
f(x) ≈ α1f1(x) + α2f2(x) + . . .
Orthonormal system:∫
fi(x)2dx = 1 and
∫
fi(x)fj(x)dx = 0
• Fourier (sinx, cosx, . . . )
• Chebyshev (1, x, 2x2 − 1, 4x3 − 3x, . . . )
• . . .
Coefficients are
αi =∫
f(x)fi(x)dx
![Page 4: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/4.jpg)
Used for: optimization
Optimization problem: minimize T (x) for x ∈ X
Assume unique global optimum x∗
Define Gibbs distribution with temperature 1/β for β > 0:
Pβ(x) =1
Z(β)exp(−βT (x))
As β →∞, have Ex∼Pβ(x)→ x∗
Simulated annealing: track Eβ(x) =∫
xPβ(x)dx as β →∞
![Page 5: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/5.jpg)
Used for: Bayes net inference
Undirected Bayes net on x = x1, x2, . . .:
P (x) =1
Z
∏
j
φj(x)
Typical inference problem: compute E(xi)
Belief propagation is fast if argument lists of φjs are small and form ajunction tree
If not, MCMC
![Page 6: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/6.jpg)
Used for: SLAM
![Page 7: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/7.jpg)
Used for
Image segmentation
Tracking radar/sonar returns
![Page 8: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/8.jpg)
Outline
Uniform sampling, importance sampling
MCMC and Metropolis-Hastings algorithm
What if f(x) has internal structure:
• SIS, SIR (particle filter)
• Gibbs sampler
Combining SIR w/ MCMC
![Page 9: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/9.jpg)
A one-dimensional problem
−1 −0.5 0 0.5 10
10
20
30
40
50
60
70
![Page 10: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/10.jpg)
Uniform sampling
−1 −0.5 0 0.5 10
10
20
30
40
50
60
70
true integral 24.0; uniform sampling 14.7 w/ 30 samples
![Page 11: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/11.jpg)
Uniform sampling
Pick an x uniformly at random from X
E(f(x)) =∫
P (x)f(x)dx
=1
V
∫
f(x)dx
where V is volume of X
So E(V f(x)) = desired integral
But variance can be big (esp. if V large)
![Page 12: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/12.jpg)
Uniform sampling
Do it a bunch of times: pick xi, compute
V
n
n∑
i=1
f(xi)
Same expectation, lower variance
Variance decreases as 1/n (standard dev 1/√
n)
Not all that fast; limitation of most MC methods
![Page 13: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/13.jpg)
Nonuniform sampling
−1 −0.5 0 0.5 10
10
20
30
40
50
60
70
true integral 24.0; importance sampling (Q = N(0,0.252)) 25.8
![Page 14: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/14.jpg)
Importance sampling
Suppose we pick x nonuniformly, x ∼ Q(x)
Q(x) is importance distribution
Use Q to (approximately) pick out areas where f is large
But EQ(f(x)) =∫
Q(x)f(x)dx
Not what we want
![Page 15: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/15.jpg)
Importance sampling
Define g(x) = f(x)/Q(x)
Now
EQ(g(x)) =∫
Q(x)g(x)dx
=∫
Q(x)f(x)/Q(x)dx
=∫
f(x)dx
![Page 16: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/16.jpg)
Importance sampling
So, sample xi from Q, take average of g(xi):
1
n
∑
i
f(xi)/Q(xi)
wi = 1/Q(xi) is importance weight
Uniform sampling is just importance sampling with Q = uniform = 1/V
![Page 17: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/17.jpg)
Parallel importance sampling
Suppose f(x) = P (x)g(x)
Desired integral is∫
f(x)dx = EP (g(x))
But suppose we only know g(x) and λP (x)
![Page 18: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/18.jpg)
Parallel importance sampling
Pick n samples xi from proposal Q(x)
If we could compute importance weights wi = P (xi)/Q(xi), then
EQ [wig(xi)] =∫
Q(x)P (x)
Q(x)g(x)dx
=∫
f(x)dx
so 1n
∑
i wig(xi) would be our IS estimate
![Page 19: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/19.jpg)
Parallel importance sampling
Assign raw importance weights wi = λP (xi)/Q(xi)
E(wi) =∫
Q(x)(λP (x)/Q(x))dx
= λ∫
P (x)dx
= λ
So wi is an unbiased estimate of λ
Define w = 1n
∑
i wi ⇒ also unbiased, but lower variance
![Page 20: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/20.jpg)
Parallel importance sampling
wi/w is approximately wi, but computed without knowing λ
So, make the estimate∫
f(x)dx ≈ 1
n
∑
i
wi
wg(xi)
![Page 21: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/21.jpg)
Parallel IS bias
0 1 2 30
0.5
1
1.5
2
2.5
3
Parallel IS is biased
E(w) = λ, but E(1/w) 6= 1/λ in general
Bias→ 0 as n→∞, since variance of w → 0
![Page 22: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/22.jpg)
Parallel IS example
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Q : (x, y) ∼ N(1,1) θ ∼ U(−π, π)
f(x, y, θ) = Q(x, y, θ)P (o = 0.8 | x, y, θ)/Z
![Page 23: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/23.jpg)
Parallel IS example
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Posterior E(x, y, θ) = (0.496,0.350,0.084)
![Page 24: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/24.jpg)
Back to n dimensions
Picking a good sampling distribution becomes hard in high-d
Major contribution to integral can be hidden in small areas
Danger of missing these areas ⇒ need to search for areas of largef(x)
Naively, searching could bias our choice of x in strange ways, makingit hard to design an unbiased estimator
![Page 25: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/25.jpg)
Markov chain Monte-Carlo
Design a Markov chain M whose moves tend to increase f(x) if it issmall
This chain encodes a search strategy: start at an arbitrary x, run chainfor a while to find an x with reasonably high f(x)
For x found by an arbitrary search algorithm, don’t know what impor-tance weight we should use to correct for search bias
For x found by M after sufficiently many moves, can use stationarydistribution of M , PM(x), as importance weight
![Page 26: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/26.jpg)
Picking PM
MCMC works well if f(x)/PM(x) has low variance
f(x) � PM(x) means there’s a region of comparatively large f(x)
that we don’t sample enough
f(x)� PM(x) means we waste samples in regions where f(x) ≈ 0
So, e.g., if f(x) = g(x)P (x), could ask for PM = P
![Page 27: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/27.jpg)
Metropolis-Hastings
Way of getting chain M with desired PM
Basic strategy: start from arbitrary x
Repeatedly tweak x a little to get x′
If PM(x′) ≥ PM(x)α, move to x′
If PM(x′)� PM(x)α, stay at x
In intermediate cases, randomize
![Page 28: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/28.jpg)
Proposal distributions
MH has one parameter: how do we tweak x to get x′
Encoded in one-step proposal distribution Q(x′ | x)
Good proposals explore quickly but remain in regions of high PM(x)
Optimal proposal: P (x′ | x) = PM(x′) for all x
![Page 29: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/29.jpg)
Metropolis-Hastings algorithm
MH transition probability TM(x′ | x) is defined as follows:
Sample x′ ∼ Q(x′ | x)
Compute p =PM(x′)PM(x)
Q(x|x′)Q(x′|x) =
PM(x′)PM(x)
α
With probability p, set x← x′
Repeat
Stop after, say, t steps (possibly� t distinct samples)
![Page 30: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/30.jpg)
Metropolis-Hastings notes
Only need PM up to constant factor—nice for problems where normal-izing constant is hard
Efficiency determined by
• how fast Q(x′ | x) moves us around
• how high acceptance probability p is
Tension between fast Q and high p
![Page 31: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/31.jpg)
Metropolis-Hastings proof
Given PM(x) and TM(x′ | x)
Want to show PM is stationary distribution for TM
Based on “detailed balance” condition
PM(x)TM(x′ | x) = PM(x′)TM(x | x′) ∀x, x′
Detailed balance implies∫
PM(x)TM(x′ | x)dx =∫
PM(x′)TM(x | x′)dx
= PM(x′)∫
TM(x | x′)dx
= PM(x′)
So, if we can show detailed balance we are done
![Page 32: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/32.jpg)
Proving detailed balance
Want to show PM(x)TM(x′ | x) = PM(x′)TM(x | x′) for x 6= x′
PM(x)TM(x′ | x) = PM(x)Q(x′ | x)max
(
1,PM(x′)PM(x)
Q(x | x′)Q(x′ | x)
)
PM(x′)TM(x | x′) = PM(x′)Q(x | x′)max
(
1,PM(x)
PM(x′)Q(x′ | x)Q(x | x′)
)
Exactly one of the two max statements chooses 1
Wlog, suppose it’s the first
![Page 33: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/33.jpg)
Detailed balance
PM(x)TM(x′ | x) = PM(x)Q(x′ | x)
PM(x′)TM(x | x′) = PM(x′)Q(x | x′) PM(x)
PM(x′)Q(x′ | x)Q(x | x′)
= PM(x)Q(x′ | x)
So, PM is stationary distribution of Metropolis-Hastings sampler
![Page 34: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/34.jpg)
Metropolis-Hastings example
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
![Page 35: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/35.jpg)
MH example accuracy
True E(x2) ≈ 0.28
σ = 0.25 in proposal leads to acceptance rate 55–60%
After 1000 samples minus burn-in of 100:
final estimate 0.282361
final estimate 0.271167
final estimate 0.322270
final estimate 0.306541
final estimate 0.308716
![Page 36: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/36.jpg)
Structure in f(x)
Suppose f(x) = g(x)P (x) as above
And suppose P (x) can be factored, e.g.,
P (x) =1
Zφ12(x1, x2)φ13(x1, x3)φ245(x2, x4, x5) . . .
Then we can take advantage of structure to sample from P efficientlyand compute EP (g(x))
![Page 37: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/37.jpg)
Linear or tree structure
P (x) =1
Zφ12(x1, x2)φ23(x2, x3)φ34(x3, x4)
Pick a node as root arbitrarily
Sample a value for root
Sample children conditional on parents
Repeat until we have sampled all of x
![Page 38: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/38.jpg)
Sequential importance sampling
Assume a chain graph x1 . . . xT (tree would be fine too)
Want to estimate E(xT )
Can evaluate but not sample from P (x1), P (xt+1 | xt)
![Page 39: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/39.jpg)
Sequential importance sampling
Suppose we have proposals Q(x1), Q(xt+1 | xt)
Sample x1 ∼ Q(x1), compute weight w1 = P (x1)/Q(x1)
Sample x2 ∼ Q(x2 | x1), weight w2 = w1 · P (x2 | x1)/Q(x2 | x1)
. . . continue until last variable xT
Weight wT at final step is P (x)/Q(x)
![Page 40: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/40.jpg)
Problems with SIS
wT often has really high variance
We often only know P (xt+1 | xt) up to a constant factor
For example, in an HMM after conditioning on observation yt+1,
P (xt+1 | xt, yt+1) =1
ZP (xt+1 | xt)P (yt+1 | xt+1)
![Page 41: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/41.jpg)
Parallel SIS
Apply parallel IS trick to SIS:
• Generate n SIS samples xi with weights wi
• Normalize wi so∑
i wi = n
Gets rid of problem of having to know normalized Ps
Introduces a bias which→ 0 as n→∞
Still not practical (variance of wi)
![Page 42: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/42.jpg)
Sequential importance resampling
SIR = particle filter, sample = particle
Run SIS, keep weights normalized to sum to n
Monitor variance of weights
If too few particles get most of the weight, resample to fix it
Resampling reduces variance of final estimate, but increases bias dueto normalization
![Page 43: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/43.jpg)
Resampling
After normalization, suppose a particle has weight 0 < w < 1
Set its weight to 1 w/ probability w, or to 0 w/ probability 1− w
E(weight) is still w, but can throw particle away if weight is 0
![Page 44: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/44.jpg)
Resampling
A particle with weight w ≥ 1 will get bwc copies for sure, plus one withprobability w − bwc
Total number of particles is ≈ n
Can make it exactly n:
High-weight particles are replicated at expense of low-weight ones
![Page 45: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/45.jpg)
SIR example
[DC factored filter movie]
![Page 46: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/46.jpg)
Gibbs sampler
Recall
P (x) =1
Zφ12(x1, x2)φ13(x1, x3)φ245(x2, x4, x5) . . .
What if we don’t have a nice tree structure?
![Page 47: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/47.jpg)
Gibbs sampler
MH algorithm for sampling from P (x)
Proposal distribution: pick an i at random, resample xi from its condi-tional distribution holding x¬i fixed
That is, Q(x, x′) = 0 if x and x′ differ in more than one component
If x and x′ differ in component i,
Q(x′ | x) =1
nP (x′i | x¬i)
![Page 48: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/48.jpg)
Gibbs acceptance probability
MH acceptance probability is
p =P (x′)P (x)
Q(x | x′)Q(x′ | x)
For Gibbs, suppose we are resampling x1 which participates in φ7(x1, x4)
and φ9(x1, x3, x6)
P (x′)P (x)
=φ7(x
′1, x4)φ9(x
′1, x3, x6)
φ7(x1, x4)φ9(x1, x3, x6)
First factor is easy
![Page 49: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/49.jpg)
Gibbs acceptance probability
Second factor:
Q(x | x′)Q(x′ | x) =
P (x1 | x′¬1)P (x′1 | x¬1)
P (x′1 | x¬1) is simple too:
P (x′1 | x¬1) =1
Zφ7(x
′1, x4)φ9(x
′1, x3, x6)
SoQ(x | x′)Q(x′ | x) =
φ7(x1, x4)φ9(x1, x3, x6)
φ7(x′1, x4)φ9(x
′1, x3, x6)
![Page 50: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/50.jpg)
Better yet
P (x′)P (x)
=φ7(x
′1, x4)φ9(x
′1, x3, x6)
φ7(x1, x4)φ9(x1, x3, x6)
Q(x | x′)Q(x′ | x) =
φ7(x1, x4)φ9(x1, x3, x6)
φ7(x′1, x4)φ9(x
′1, x3, x6)
The two factors cancel!
So p = 1: always accept
![Page 51: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/51.jpg)
Gibbs in practice
Simple to implement
Often works well
Common failure mode: knowing x¬i “locks down” xi
Results in slow mixing, since it takes a lot of low-probability moves toget from x to a very different x′
![Page 52: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/52.jpg)
Locking down
E.g., handwriting recognition: “antidisestablishmen?arianism”
Even if we do propose and accept “antidisestablishmenqarianism”, likelyto go right back
E.g., image segmentation: if all my neighboring pixels in an 11 × 11
region are background, I’m highly likely to be background as well
E.g., HMMs: knowing xt−1 and xt+1 often gives a good idea of xt
Sometimes conditional on values of other variables: ai? 7→ {aid, ail, aim, air}but th? 7→ the (and maybe thy or tho)
![Page 53: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/53.jpg)
Worked example
[switch to Matlab]
![Page 54: Geoff Gordon - Carnegie Mellon School of Computer Scienceggordon/MCMC/ggordon.MCMC-tutorial.pdf · Markov chain Monte-Carlo Design a Markov chain M whose moves tend to increase f(x)](https://reader034.fdocument.org/reader034/viewer/2022052607/5a718b767f8b9a98538cfeb8/html5/thumbnails/54.jpg)
Related topics
Reversible-jump MCMC
• for when we don’t know the dimension of x
Rao-Blackwellization
• hybrid between Monte-Carlo and exact
• treat some variables exactly, sample over rest
Swendsen-Wang
• modification to Gibbs that mixes faster in locked-down distributions
Data-driven proposals: EKPF, UPF