Download - Deep generative learning_icml_part2

Transcript
Page 1: Deep generative learning_icml_part2

Stochastic Gradient Fisher ScoringAhn, Korattikara, Welling – 2012

Large Gradient

Small Gradient

Mixing Issues Bernstein-von Mises theorem

θ0 - True parameterIN - Fisher Information at θ0

( a.k.a Bayesian CLT)

1vrijdag 4 juli 14

Page 2: Deep generative learning_icml_part2

SGFS

Stochastic Gradient Langevin

Samples from the correct posterior, , at low ϵ

2

2vrijdag 4 juli 14

Page 3: Deep generative learning_icml_part2

SGFS

Stochastic Gradient Langevin

Low Bias

High

Samples from the correct posterior, , at low ϵ

2

2vrijdag 4 juli 14

Page 4: Deep generative learning_icml_part2

SGFS

Stochastic Gradient Langevin

Markov Chain for Approximate

Low Bias

High

Samples from the correct posterior, , at low ϵ

Samples from approximate posterior, , at any ϵ

2

2vrijdag 4 juli 14

Page 5: Deep generative learning_icml_part2

SGFS

Stochastic Gradient Langevin

Markov Chain for Approximate

Low Bias

High

Samples from the correct posterior, , at low ϵ

Samples from approximate posterior, , at any ϵ

Low

High Bias

2

2vrijdag 4 juli 14

Page 6: Deep generative learning_icml_part2

SGFS

Small ϵ

Large ϵ

Bias

Variance

3

3vrijdag 4 juli 14

Page 7: Deep generative learning_icml_part2

SGFS

Small ϵ

Large ϵ

Bias

Variance

(term compensates for subsampling noise)

3

3vrijdag 4 juli 14

Page 8: Deep generative learning_icml_part2

The SGFS Knob

Burn-in using

Sampling

Sampling

Decrease ϵ over time

Exact Sampli

4

Low Variance( Fast )

High Variance( Slow )

High Bias Low Bias

xxx xx xx xx x

x x

xx

xx

xxx

xx x

x

xx xx

xx x

xx

xx

4vrijdag 4 juli 14

Page 9: Deep generative learning_icml_part2

Demo SGFS ε = 2

5

5vrijdag 4 juli 14

Page 10: Deep generative learning_icml_part2

Demo SGFS ε = 2

5

5vrijdag 4 juli 14

Page 11: Deep generative learning_icml_part2

Demo SGFS ε = 0.4

6

6vrijdag 4 juli 14

Page 12: Deep generative learning_icml_part2

Demo SGFS ε = 0.4

6

6vrijdag 4 juli 14

Page 13: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

7vrijdag 4 juli 14

Page 14: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

7vrijdag 4 juli 14

Page 15: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

Euclidean distance b/w parameters is 10,

but densities p(x|θ) are almost identical

7vrijdag 4 juli 14

Page 16: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

Euclidean distance b/w parameters is 10,

but densities p(x|θ) are almost identical

where G(θ) is positive semi-definite

7vrijdag 4 juli 14

Page 17: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

Euclidean distance b/w parameters is 10,

but densities p(x|θ) are almost identical

where G(θ) is positive semi-definite

7vrijdag 4 juli 14

Page 18: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

Euclidean distance b/w parameters is 10,

but densities p(x|θ) are almost identical

where G(θ) is positive semi-definite

7vrijdag 4 juli 14

Page 19: Deep generative learning_icml_part2

Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) - Patterson & Teh, 2013

Euclidean space of parameters θ = (σ, µ)of a normal distribution

Euclidean distance b/w parameters is 1,

but densities p(x|θ) are very different

Euclidean distance b/w parameters is 10,

but densities p(x|θ) are almost identical

where G(θ) is positive semi-definite

Natural Gradient change in curvaturealign noise

7vrijdag 4 juli 14

Page 20: Deep generative learning_icml_part2

Stochastic Gradient Hamiltonian Monte Carlo T. Chen, E. B. Fox, C. Guestrin (2014)

8vrijdag 4 juli 14

Page 21: Deep generative learning_icml_part2

An (over-) simplified explanation of Hamiltonian Monte Carlo (HMC)

Stochastic Gradient Hamiltonian Monte Carlo T. Chen, E. B. Fox, C. Guestrin (2014)

one informative gradient step of size ϵ + one random step of size ϵ = Random walk type movement and bad mixing

Langevin Update

8vrijdag 4 juli 14

Page 22: Deep generative learning_icml_part2

An (over-) simplified explanation of Hamiltonian Monte Carlo (HMC)

Stochastic Gradient Hamiltonian Monte Carlo T. Chen, E. B. Fox, C. Guestrin (2014)

one informative gradient step of size ϵ + one random step of size ϵ = Random walk type movement and bad mixing

Langevin Update

• HMC allows multiple gradient steps per noise step• HMC can make distant proposals with high acceptance probability

8vrijdag 4 juli 14

Page 23: Deep generative learning_icml_part2

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)

An (over-) simplified explanation of Hamiltonian Monte Carlo (HMC)

Stochastic Gradient Hamiltonian Monte Carlo T. Chen, E. B. Fox, C. Guestrin (2014)

one informative gradient step of size ϵ + one random step of size ϵ = Random walk type movement and bad mixing

Langevin Update

• Naively using stochastic gradients in HMC does not work well• Authors use a correction term to cancel the effect of noise in gradients

• HMC allows multiple gradient steps per noise step• HMC can make distant proposals with high acceptance probability

8vrijdag 4 juli 14

Page 24: Deep generative learning_icml_part2

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)

An (over-) simplified explanation of Hamiltonian Monte Carlo (HMC)

Stochastic Gradient Hamiltonian Monte Carlo T. Chen, E. B. Fox, C. Guestrin (2014)

one informative gradient step of size ϵ + one random step of size ϵ = Random walk type movement and bad mixing

Langevin Update

• Naively using stochastic gradients in HMC does not work well• Authors use a correction term to cancel the effect of noise in gradients

• HMC allows multiple gradient steps per noise step• HMC can make distant proposals with high acceptance probability

Talk tomorrow afternoon

In Track C (Monte Carlo)

8vrijdag 4 juli 14

Page 25: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 26: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 27: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 28: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 29: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 30: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 31: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

9vrijdag 4 juli 14

Page 32: Deep generative learning_icml_part2

Distributed SGLDAhn, Shahbaba, Welling (2014)

NN

N

Total NData points

Adaptive Load Balancing: Longer trajectories from faster machines

9vrijdag 4 juli 14

Page 33: Deep generative learning_icml_part2

D-SGLD ResultsWikipedia dataset: 4.6M articles, 811M tokens, vocabulary size: 7702PubMed dataset: 8.2M articles, 730M tokens, vocabulary size: 39987

Model: Latent Dirichlet Allocation

10

10vrijdag 4 juli 14

Page 34: Deep generative learning_icml_part2

D-SGLD ResultsWikipedia dataset: 4.6M articles, 811M tokens, vocabulary size: 7702PubMed dataset: 8.2M articles, 730M tokens, vocabulary size: 39987

Model: Latent Dirichlet Allocation

10

Talk tomorrow afternoon

In Track C (Monte Carlo)

10vrijdag 4 juli 14

Page 35: Deep generative learning_icml_part2

A Recap

Use  an  efficient  proposal  so  that  the  Metropolis-­‐Has3ngs  test  can  be  avoidedUse  an  efficient  proposal  so  that  the  Metropolis-­‐Has3ngs  test  can  be  avoided

SGLD Langevin  Dynamics  with  stochas3c  gradients

SGFS Precondi3oning  matrix  based  on  Fisher  informa3on  at  mode

SGRLD Posi3on  specific  precondi3oning  matrix  based  on  Reimannian  geometry

SGHMC Avoids  random  walks  by  taking  mul3ple  gradient  steps

DSGLD Distributed  version  of  above  algorithms

11vrijdag 4 juli 14

Page 36: Deep generative learning_icml_part2

A Recap

Use  an  efficient  proposal  so  that  the  Metropolis-­‐Has3ngs  test  can  be  avoidedUse  an  efficient  proposal  so  that  the  Metropolis-­‐Has3ngs  test  can  be  avoided

SGLD Langevin  Dynamics  with  stochas3c  gradients

SGFS Precondi3oning  matrix  based  on  Fisher  informa3on  at  mode

SGRLD Posi3on  specific  precondi3oning  matrix  based  on  Reimannian  geometry

SGHMC Avoids  random  walks  by  taking  mul3ple  gradient  steps

DSGLD Distributed  version  of  above  algorithms

Approximate  the  Metropolis-­‐Has3ngs  Test  using  less  dataApproximate  the  Metropolis-­‐Has3ngs  Test  using  less  data

11vrijdag 4 juli 14

Page 37: Deep generative learning_icml_part2

Why approximate the MH test? (if gradient based methods seem to work so well)

• Gradient based proposals are not always available– Parameter spaces of different dimensionality– Distributions on constrained manifolds– Discrete variables

• High gradients may catapult the sampler to low density regions

12vrijdag 4 juli 14

Page 38: Deep generative learning_icml_part2

Metropolis-Hastings

1 2

3

13vrijdag 4 juli 14

Page 39: Deep generative learning_icml_part2

Metropolis-Hastings

1 2

3

13vrijdag 4 juli 14

Page 40: Deep generative learning_icml_part2

Metropolis-Hastings

14vrijdag 4 juli 14

Page 41: Deep generative learning_icml_part2

Metropolis-Hastings

Does not depend on the data (x)

14vrijdag 4 juli 14

Page 42: Deep generative learning_icml_part2

Approximate Metropolis-Hastings

15vrijdag 4 juli 14

Page 43: Deep generative learning_icml_part2

Approximate Metropolis-Hastings

15vrijdag 4 juli 14

Page 44: Deep generative learning_icml_part2

Approximate Metropolis-Hastings

15vrijdag 4 juli 14

Page 45: Deep generative learning_icml_part2

Approximate Metropolis-Hastings

Collect more data

15vrijdag 4 juli 14

Page 46: Deep generative learning_icml_part2

Approximate Metropolis-Hastings

How do we choose Δ+ and Δ-?

Collect more data

15vrijdag 4 juli 14

Page 47: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

16vrijdag 4 juli 14

Page 48: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

16vrijdag 4 juli 14

Page 49: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

16vrijdag 4 juli 14

Page 50: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

16vrijdag 4 juli 14

Page 51: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

(c is chosen as in a t-test for µ = µ0 vs µ ≠ µ0 )

16vrijdag 4 juli 14

Page 52: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

(c is chosen as in a t-test for µ = µ0 vs µ ≠ µ0 )

Talk tomorrow afternoon

In Track C (Monte Carlo)

16vrijdag 4 juli 14

Page 53: Deep generative learning_icml_part2

Approach 1: Using Confidence IntervalsKorattikara, Chen, Welling (2014)

Collect more data

(c is chosen as in a t-test for µ = µ0 vs µ ≠ µ0 )

Talk tomorrow afternoon

In Track C (Monte Carlo)

• Singh, Wick, McCallum (2012) – inference in large scale factor graphs• DuBois, Korattikara, Welling, Smyth (2014) – approximate Slice Sampling

16vrijdag 4 juli 14

Page 54: Deep generative learning_icml_part2

Independent Component Analysis

Mixture of 4 audio sources - 1.95 Million datapoints, 16 dimensionsTest function is Amari distance to true unmixing matrix

17

17vrijdag 4 juli 14

Page 55: Deep generative learning_icml_part2

SGLD + approximate MH

SGLD

SGLD+

MH

18

18vrijdag 4 juli 14

Page 56: Deep generative learning_icml_part2

Approach 2: Using Concentration InequalitiesBardenet, Doucet, Holmes (2014)

Collect more data

19vrijdag 4 juli 14

Page 57: Deep generative learning_icml_part2

Approach 2: Using Concentration InequalitiesBardenet, Doucet, Holmes (2014)

Collect more data

19vrijdag 4 juli 14

Page 58: Deep generative learning_icml_part2

Approach 2: Using Concentration InequalitiesBardenet, Doucet, Holmes (2014)

Collect more data

• Complementary to previous method• More robust as it does not use any CLT assumptions• Uses more data per test if CLT assumptions do hold

19vrijdag 4 juli 14

Page 59: Deep generative learning_icml_part2

Approach 2: Using Concentration InequalitiesBardenet, Doucet, Holmes (2014)

Collect more data

• Complementary to previous method• More robust as it does not use any CLT assumptions• Uses more data per test if CLT assumptions do hold

Talk tomorrow afternoon

In Track C (Monte Carlo)

19vrijdag 4 juli 14

Page 60: Deep generative learning_icml_part2

Summary

Use  an  efficient  proposal  so  that  the  Metropolis-­‐Has3ngs  test  can  be  avoidedUse  an  efficient  proposal  so  that  the  Metropolis-­‐Has3ngs  test  can  be  avoided

SGLD Langevin  Dynamics  with  stochas3c  gradients

SGFS Precondi3oning  matrix  based  on  Fisher  informa3on  at  mode

SGRLD Posi3on  specific  precondi3oning  based  on  Reimannian  geometry

SGHMC Avoids  random  walks  by  taking  mul3ple  gradient  steps

DSGLD Distributed  version  of  above  algorithms

Approximate  the  Metropolis-­‐Has3ngs  Test  using  less  dataApproximate  the  Metropolis-­‐Has3ngs  Test  using  less  data

Confidence  Intervals

Based  on  confidence  levels  using  CLT  assump3ons.

Concentra3on  Bounds

Based  on  concentra3on  bounds.  More  robust  as  it  does  not  use  CLT  assump3ons,  but  uses  more  data  than  above  if  CLT  assump3ons  hold

20vrijdag 4 juli 14

Page 61: Deep generative learning_icml_part2

Langevin Dynamics

• The Langevin update is a discrete time approximation of a stochastic differential equation(SDE)

• The stationary distribution of this SDE is S0(θ)

• Discretization introduces O(ϵ) errors that are corrected using a MH test

Analysis: SGLDI. Sato and H. Nakagawa (2014)

Stochastic Gradient Langevin Dynamics

• The stationary distribution of the SDE that SGLD represents can also be shown to be S0(θ) I. Sato and H. Nakagawa (2014)

• Time Discretized SGLD converges weakly to the SGLD SDE

i.e. For any continuous differentiable and polynomial growth function f:

21vrijdag 4 juli 14

Page 62: Deep generative learning_icml_part2

Langevin Dynamics

• The Langevin update is a discrete time approximation of a stochastic differential equation(SDE)

• The stationary distribution of this SDE is S0(θ)

• Discretization introduces O(ϵ) errors that are corrected using a MH test

Analysis: SGLDI. Sato and H. Nakagawa (2014)

Stochastic Gradient Langevin Dynamics

• The stationary distribution of the SDE that SGLD represents can also be shown to be S0(θ) I. Sato and H. Nakagawa (2014)

• Time Discretized SGLD converges weakly to the SGLD SDE

i.e. For any continuous differentiable and polynomial growth function f:

Talk Monday afternoon

In Track C (Monte Carlo & Approximate Inference)

21vrijdag 4 juli 14

Page 63: Deep generative learning_icml_part2

Assume Uniform Ergodicity

Control error in Transition Kernel

Analysis: Approximate MH

Control probability of making a wrong decision:

_ Error in acceptance probability is bounded:

_ Error in transition probability is bounded:

where Total Variation

22vrijdag 4 juli 14

Page 64: Deep generative learning_icml_part2

Error in Stationary Distribution

If the error in transition probability is bounded:

And uniform ergodicity holds:

Then, the error in the stationary distribution is bounded as:

Analysis: Approximate MH

For more details:1. P. Alquier, N. Friel, R. Everitt, A. Boland (2014) 2. R. Bardenet, A. Doucet, C. Holmes (2014) 3. A. Korattikara, Y. Chen, M. Welling (2014) 4. N. S. Pillai, A. Smith (2014)

23vrijdag 4 juli 14

Page 65: Deep generative learning_icml_part2

References - MCMCApproximate MCMC algorithms using mini-batch gradients

• Stochastic Gradient Langevin Dynamics – M. Welling and Y. W. Teh (ICML 2011)• Stochastic Gradient Fisher Scoring – S. Ahn, A. Korattikara, M. Welling (ICML 2012)• Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex – S. Patterson and Y.W. Teh (NIPS 2013)• Stochastic Gradient Hamiltonian Monte Carlo - T. Chen, E. B. Fox, C. Guestrin (ICML 2014)• Distributed Stochastic Gradient MCMC – S. Ahn, B. Shahbaba, M. Welling (ICML 2014)

Approximate MCMC algorithms using mini-batch Metropolis-Hastings

• Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget - A. Korattikara, Y. Chen, M. Welling (ICML 2014)• Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach – R. Bardenet, A. Doucet, C. Holmes (ICML 2014)• Approximate Slice Sampling for Bayesian Posterior Inference – C. DuBois, A. Korattikara, M. Welling, P. Smyth (AISTATS 2014)

Theory

• Approximation Analysis of Stochastic Gradient Langevin Dynamics using Fokker-Planck Equation and Ito Process –I. Sato and H. Nakagawa (ICML 2014)

• Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels - P. Alquier, N. Friel, R. Everitt, A. Boland (arXiv 2014)

• Ergodicity of Approximate MCMC Chains with Applications to Large Data Sets - N. S. Pillai, A. Smith (arXiv 2014)

Asymptotically unbiased MCMC algorithms using mini-batches

• Asymptotically Exact, Embarrassingly Parallel MCMC – W. Neiswanger, C. Wang, R. Xing (arXiv 2013)• Firefly Monte Carlo: Exact MCMC with Subsets of Data – D. Maclaurin, R. P. Adams (arXiv 2014)• Accelerating MCMC via Parallel Predictive Prefetching – E. Angelino, E. Kohler, A. Waterland, M. Seltzer, R. P. Adams (arXiv 2014)

24vrijdag 4 juli 14

Page 66: Deep generative learning_icml_part2

Conclusions & Future Directions• Bayesian  Inference  is  not  superfluous  in  the  context  of  big  data.

• Two  requirements:• Stochas3c  /  mini-­‐batch  based  updates• Distributed  implementa3on

• Two  fruiRul  approaches:• Stochas3c  Varia3onal  Bayes• Mini-­‐batch  MCMC

• Future  VB:• Very  flexible  varia3onal  posteriors,  very  small  remaining  bias• Black-­‐box  inference  engine,  a  la    Infer.net,  BUGS

• Future  MCMC• BeTer  theory• BeTer  use  of  powerful  (stochas3c)  op3miza3on  methods.    

25vrijdag 4 juli 14

Page 67: Deep generative learning_icml_part2

Stochas3c  Fully  Structured  Distributed  Varia3onal  Bayes

Stochas3c  Approxima3on  MCMC

(driving  bias  to  0)

(driving  variance  to  0)

26vrijdag 4 juli 14

Page 68: Deep generative learning_icml_part2

Acknowledgements & Collaborators• Yee Whye Teh• Sungjin Ahn• Babak Shahbaba• Yutian Chen• Durk Kingma• Taco Cohen• Alex Ihler• Chris DuBois• Padhraic Smyth• Dan Gillen

27vrijdag 4 juli 14