Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x...

22
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Yee Whye Teh in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb, Sebastian Vollmer, Charles Blundell

Transcript of Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x...

Page 1: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Yee Whye Teh

in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb, Sebastian Vollmer, Charles Blundell

Page 2: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Bayesian Learning

! Parameter vector X.

! Data items Y = y1, y2,... yN.

! Model:

! Aim:

! Inference algorithms:!Variational inference: parametrise posterior as qθ and optimize θ. !Markov chain Monte Carlo: construct samples X1…Xn ~ p(X|Y).

p(X,Y ) = p(X)NY

i=1

p(yi|X)

X

y1 y2 y3 y4 ..... yN

p(X|Y ) =p(X)p(Y |X)

p(Y )

Page 3: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Machine Learning on Distributed Systems

y1i y2i y3i y4i

! Distributed storage

! Distributed computation

! costly network communications

Page 4: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Parameter Server

! Parameter server [Ahmed et al 2012], DistBelief network [Dean et al 2012].

parameter server:• parameter x

worker:• xi = x• updates to xi’ • returns Δxi = xi’ - xi

y1i y2i y3i y4i

Page 5: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

! Only communication at the combination stage.

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Embarassingly Parallel MCMC Sampling

y1i y2i y3i y4i

Treat as independentinference problems.Collect samples.

“Combine” samples together.

{Xji}j=1...m,i=1...n

{Xi}i=1...n

[Scott et al 2013, Neiswanger et al 2013, Wang & Dunson 2013, Stanislav et al 2014]

Page 6: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Embarassingly Parallel MCMC Sampling

! Unclear how to combine worker samples well.

! Particularly if local posteriors on worker machines do not overlap.

Figure from Wang & Dunson

Page 7: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Main Idea

! Identify regions of high (global) posterior probability mass.

! Shift each local posterior to agree with high probability region, and draw samples from these.

! How to find high probability region?!Defined in terms of low order moments.!Use information gained from local posterior samples (using small amount of communication). Figure from Wang & Dunson

Page 8: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Tilting Local Posteriors

! Each worker machine j has access only to its data subset.

! where pj(X) is a local prior and pj(X | yj) is local posterior.

! Adapt local priors pj(X) so that local posterior agree on certain moments

! Use expectation propagation (EP) [Minka 2001] to adapt local priors.

pj(X | yj) = pj(X)IY

i=1

p(yji |X)

Epj(X|yj)[s(X)] = s0 8j

Page 9: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Expectation Propagation

! If N is large, the worker j likelihood term p(yj | X) should be well approximated by Gaussian

! Parameters fit iteratively to minimize KL divergence:

! Optimal qj is such that first two moments of agree with ! Moments of local posterior estimated using MCMC sampling.! At convergence, first two moments of all local posteriors agree.

p(yj |X) ⇡ qj(X) = N (X;µj ,⌃j)

[Minka 2001]

p(X | y) ⇡ pj(X | y) / p(yj |X) p(X)Y

k 6=j

qk(X)

| {z }pj(X)

qnewj (·) = arg minN (·;µ,⌃)

KL�pj(· | y) kN (·;µ,⌃)pj(·)

N (·;µ,⌃)pj(·) pj(·|y)

Page 10: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Posterior Server Architecture

Data

Likelihood

Posterior

Cavity

Likelihoodapproximation

MCMC

Data

Likelihood

Cavity

Likelihoodapproximation

MCMC

Data

Likelihood

Cavity

Likelihoodapproximation

MCMC

Posterior Server

Worker WorkerWorker

Network

Page 11: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Bayesian Logistic Regression

! Simulated dataset.!d=20, # data items N=1000.

! NUTS based sampler.!# workers m = 4,10,50.!# MCMC iters T = 1000,1000,10000.

! # EP iters k given as vertical lines.200 400 600 800 1000 1200 1400

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

k × T × N/m × 103

100 200 300 400 500 600−2.5

−2

−1.5

−1

−0.5

0

0.5

1

k × T × N/m × 103

250 500 750 1000 1250 1500−2.5

−2

−1.5

−1

−0.5

0

0.5

1

k × T × N/m × 103

Page 12: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Bayesian Logistic Regression

! MSE of posterior mean, as function of total # iterations.

3.2 6.4 9.6 12.8 16 19.2x 105

10−6

10−4

10−2

100

k × T × m

SMS(s)SMS(a)SCOTNEIS(p)NEIS(n)WANG

Page 13: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Stochastic Natural-gradient EP

! EP has no guarantee of convergence.! EP technically cannot handle stochasticity in moment estimates.! Long MCMC run needed for good moment estimates.! Fails for neural nets and other complex high-dimensional models.

! Stochastic Natural-gradient EP:!Alternative variational algorithm to EP.!Convergent, even with Monte Carlo estimates of moments.!Double-loop algorithm [Welling & Teh 2001, Yuille 2002, Heskes & Zoeter 2002]

Page 14: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Demonstrative Example

EP (500 samples) SNEP (500 samples)

Page 15: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Comparison to Maximum Likelihood SGD

! Maximum likelihood via SGD:!DistBelief [Dean et al 2012]!Elastic-averaging SGD [Zhang et al 2015]

! Separate likelihood approximations and states per worker.!Worker parameters not forced to be exactly same.

! Each worker learns to approximate its own likelihood. !Can be achieved without detailed knowledge from other workers.

! Diagonal Gaussian exponential family. !Variance estimates are important to learning.

Data

Likelihood

Posterior

Cavity

Likelihoodapproximation

MCMC

Data

Likelihood

Cavity

Likelihoodapproximation

MCMC

Data

Likelihood

Cavity

Likelihoodapproximation

MCMC

Posterior Server

Worker WorkerWorker

Network

Page 16: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Experiments on Distributed Bayesian Neural Networks

! Bayesian approach to learning neural network:! compute parameter posterior given complex neural network likelihood.!Diagonal covariance Gaussian prior and exponential-family approximation.

! Two datasets and architectures: MNIST fully-connected, CIFAR10 convnet.

! Implementation in Julia.!Workers are cores on a server.!Sampler is stochastic gradient Langevin dynamics [Welling & Teh 2011].

!Adagrad [Duchi et al 2011]/RMSprop [Tieleman & Hinton 2012] type adaptation.

!Evaluated on test accuracy.

Page 17: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

MNIST 500x300

epochs per worker0 5 10 15 20

1248Adam

1.0

1.5

2.0

2.5te

st e

rror i

n %

Varying the number of workers

Page 18: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

MNIST 500x300

epochs per worker0 5 10 15 20

1248

1.0

1.5

2.0

2.5te

st e

rror i

n %

Power SNEP - Varying the number of workers

Page 19: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

MNIST 500x300

epochs per worker0 5 10 15 20

SNEPp-SNEPA-SGDEASGDAdam

1.0

1.5

2.0

2.5te

st e

rror i

n %

Comparison of distributed methods (8 workers)

Page 20: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

MNIST Very Deep MLP

epochs per worker0 10 20 30 40 50

ASGDEASGDp-SNEP

1

2

3

4

5te

st e

rror i

n %

16 workers

Page 21: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

CIFAR10 ConvNet

epochs per worker0 10 20 30

SNEPp-SNEPA-SGDEASGDAdam

20

25

30

35

40te

st e

rror i

n %

Comparison of distributed methods (8 workers)

Page 22: Distributed Bayesian Learning with Stochastic Natural ...€¦parameter server: • parameter x worker: • x i = x • updates to x i’ • returns Δx i = x i’ - x i y 1i y 2i

Distributed Bayesian Learning with SNEP and Posterior Server Yee Whye Teh

Concluding Remarks

! Novel distributed learning based on a combination of Monte Carlo and a convergent alternative to expectation propagation.

! Combination of variational and MCMC algorithms.!Advantageous over both pure variational and pure MCMC algorithms.

! Being Bayesian can be advantageous computationally in distributed setting.

! Thank you!