Download - and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Transcript
Page 1: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Monte Carlo

and Machine Learning

Iain MurrayUniversity of Edinburgh

http://iainmurray.net

Page 2: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Linear regressiony = θ1x+ θ2, p(θ) = N (θ; 0, 0.42I)

-2 0 2 4

-6

-4

-2

0

2

4

Prior p(θ)

x

y

Page 3: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Linear regressiony(n) = θ1x

(n) + θ2 + ε(n), ε(n) ∼ N (0, 0.12)

-2 0 2 4

-6

-4

-2

0

2

4

p(θ |Data) ∝ p(Data | θ) p(θ)

x

y

Page 4: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Linear regression (zoomed in)

-2 0 2 4

-1.5

-1

-0.5

0

p(θ |Data) ∝ p(Data | θ) p(θ)

x

y

Page 5: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Model mismatch

−2 0 2

−6

−4

−2

0

2

What will Bayesian linear regression do?

Page 6: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Quiz

Given a (wrong) linear assumption, which explanations are

typical of the posterior distribution?

−3 −2 −1 0 1 2 3

−4

−2

0

2

−3 −2 −1 0 1 2 3

−4

−2

0

2

−3 −2 −1 0 1 2 3

−4

−2

0

2

A B C

D All of the above

E None of the above

Z Not sure

Page 7: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

‘Underfitting’

−4 −2 0 2 4

−6

−4

−2

0

2

4

Posterior very certain despite blatant misfit. Peaked around least bad option.

Page 8: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Model checking

arXiv:1705.07057 — Masked Autoregressive Flow

Introduction: Gelman et al., Bayesian Data Analysis

Page 9: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Roadmap

— Looking at samples

— Monte Carlo computations

— Scaling to large datasets

Page 10: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Simple Monte Carlo Integration

∫f(θ)π(θ) dθ = “average over π of f”

≈ 1

S

S∑

s=1

f(θ(s)), θ(s) ∼ π

UnbiasedVariance ∼ 1/S

Page 11: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Prediction

p(y∗ |x∗,D) =∫p(y∗ |x∗, θ) p(θ | D) dθ

≈ 1

S

s

p(y∗ |x∗, θ(s)), θ(s) ∼ p(θ | D)

xx∗

y

Page 12: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Prediction

p(y∗ |x∗,D) =∫p(y∗ |x∗, θ) p(θ | D) dθ

≈ 1

S

s

p(y∗ |x∗, θ(s)), θ(s) ∼ p(θ | D)

p(y∗ |x∗, θ(s))

xx∗

y

Page 13: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Prediction

p(y∗ |x∗,D) =∫p(y∗ |x∗, θ) p(θ | D) dθ

≈ 1

S

s

p(y∗ |x∗, θ(s)), θ(s) ∼ p(θ | D)

p(y∗ |x∗, θ(s))

1

12

12∑

s=1

p(y∗ |x∗, θ(s))

xx∗

y

Page 14: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Prediction

p(y∗ |x∗,D) =∫p(y∗ |x∗, θ) p(θ | D) dθ

≈ 1

S

s

p(y∗ |x∗, θ(s)), θ(s) ∼ p(θ | D)

p(y∗ |x∗,D)

1

100

100∑

s=1

p(y∗ |x∗, θ(s))

xx∗

y

Page 15: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

How do we sample p(θ | D)?

π(θ) ∝ π∗(θ) = p(D | θ) p(θ)

Page 16: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

>35,000 citations Marshall Rosenbluth’s account:

http://dx.doi.org/10.1063/1.1887186

Page 17: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Metropolis–Hastings

θ′ ∼ q(θ′; θ(s))

if accept:

θ(s+1)← θ′

else:

θ(s+1)← θ(s)

P (accept) = min

(1,

π∗(θ′) q(θ(s); θ′)

π∗(θ(s)) q(θ′; θ(s))

)

Page 18: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

What can we approximate?

Unbiased estimates, from subset of data?

A) p(D | θ) =∏

n

p(Dn | θ)

B) log p(D | θ) =∑

n

log p(Dn | θ)

C) ∇θ p(D | θ)

D) ∇θ log p(D | θ)

E) None of the above

F) All of the above

Z) Don’t know

Page 19: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Pseudo-marginal MCMC

MCMC with unbiased estimate of π∗(θ)

Step towards big data: arXiv:1403.5693

Allowing negative estimators: arXiv:1306.4032

Easy-to-use algorithm: http://iainmurray.net/pub/16pmss/

Not yet a solution to MCMC for big data.

Page 20: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Dynamical methods

Simulation using ∇θ log π∗(θ) estimates

Langevin/Hamiltonian dynamics, and newer developments

An overview: https://papers.nips.cc/paper/

5891-a-complete-recipe-for-stochastic-gradient-mcmc

A recent new direction: arXiv:1611.07873

Use for optimization?

http://proceedings.mlr.press/v51/chen16c.html

Page 21: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

SMC

Sequential Monte Carlo

Popular in signal processing, probabilistic programming

One recent paper:

https://papers.nips.cc/paper/

5450-asynchronous-anytime-sequential-monte-carlo

Page 22: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Variational Methods

Approximate p(θ | D) = π(θ) with q(θ)

Classic approach: minimize DKL(q||π)

Monte Carlo approximation to gradients

“Black box”: just need ∇θ log p(Dn | θ)

Refs: http://shakirm.com/papers/VITutorial.pdf

https://www.cs.toronto.edu/∼duvenaud/papers/blackbox.pdf

http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/notes/

Page 23: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Will neural nets eat everything?

Can simulate from p(θ) and p(D | θ)

Training data!

{θ(s),D(s)} ∼ p(θ,D)

Just fit p(D | θ) and/or p(θ | D) ?

Example recognition network: arXiv:1605.06376Example conditional density estimator: arXiv:1705.07057

Page 24: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Appendix slides

Page 25: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

MCMC ReferencesMy first reading: MacKay’s book and Neal’s review:

www.inference.phy.cam.ac.uk/mackay/itila/

www.cs.toronto.edu/∼radford/review.abstract.html

Handbook of Markov Chain Monte Carlo(Brooks, Gelman, Jones, Meng eds, 2011)

http://www.mcmchandbook.net/HandbookSampleChapters.html

Page 26: and Machine Learning Monte Carlo - microsoft.com · Variational Methods Approximate p( jD) = ˇ( ) with q( ) Classic approach: minimize D KL(qjjˇ) Monte Carlo approximation to gradients

Practical MCMC examples

My exercise sheet:http://iainmurray.net/teaching/09mlss/

BUGS examples and more in STAN:https://github.com/stan-dev/example-models/

Kaggle entry:http://iainmurray.net/pub/12kaggle dark/