Download - Auto-Encoding Variational Bayes - University of Cambridge · 2019-10-30 · Auto-Encoding Variational Bayes Pawe l F. P. Budzianowski, Thomas F. W. Nicholson, William C. Tebbutt Problem

Transcript
Page 1: Auto-Encoding Variational Bayes - University of Cambridge · 2019-10-30 · Auto-Encoding Variational Bayes Pawe l F. P. Budzianowski, Thomas F. W. Nicholson, William C. Tebbutt Problem

Auto-Encoding Variational BayesPawe l F. P. Budzianowski, Thomas F. W. Nicholson, William C. Tebbutt

Problem Definition

Perform approximate inference in model with local latent variables ziwhilst learning point estimates for the MAP solution for the global pa-rameters θ having observed xi.

θ

zi xi

N

SGVB

Stochastic Gradient Variational Bayes provides a method to find a deter-ministic approximation to an intractable posterior distribution by findingparameters φ such that DKL (qφ (zi |xi) || pθ (zi |xi)) is minimised forall i. This is achieved by, for each observation, maximising a lower bound

L (φ;xi) = Eqφ(zi |xi) [log pθ (xi | zi)]− DKL (qφ (zi |xi) || p (zi)) .

The expectation term in this lower bound cannot typically be computedexactly.

L̃B (θ, φ;xi) =1

L

L∑l=1

(log pθ (xi | zi,l)− DKL (qφ (zi |xi) || pθ (zi))) ,

where reparameterising z = gφ(x, ε) with ε ∼ p(ε) yields a differentiableMonte Carlo approximation.

Variational Autoencoder

The Variational Autoencoder is a generative latent variable model fordata in which zi ∼ N (0, I) and xi ∼ pθ (xi | zi), where this conditionalis parameterised by an multi-layer perceptron (MLP).

An MLP recognition model qφ (zi |xi) is used to provide fast approxi-mate posterior inference in zi | xi.

The MLPs used in the recognition model qφ and conditional distributionpθ (xi | zi) are often compared to the encoder and decoder networks intraditional autoencoders respectively.

Noisy KL-divergence estimate

In the case of the non-Gaussian distributions, it is often impossible toobtain closed-form expression for the KL-divergence term which also re-quires estimation by sampling. This yields more generic estimator of theform:

L̃A(θ,φ;x(i)) =1

L

L∑l=1

(log pθ(x

(i), z(i,l))− log qφ(z(i,l)|x(i))

).

0.0 0.2 0.4 0.6 0.8 1.0Training samples evaluated 1e8

160

150

140

130

120

110

100

90

L

MNIST, Nz = 3

LA (train)LA (test)LB (train)LB (test)

0.0 0.2 0.4 0.6 0.8 1.0Training samples evaluated 1e8

160

150

140

130

120

110

100

90 MNIST, Nz = 10

LA (train)LA (test)LB (train)LB (test)

0.0 0.2 0.4 0.6 0.8 1.0Training samples evaluated 1e8

160

150

140

130

120

110

100

90 MNIST, Nz = 20

LA (train)LA (test)LB (train)LB (test)

0.0 0.2 0.4 0.6 0.8 1.0Training samples evaluated 1e8

160

150

140

130

120

110

100

90 MNIST, Nz = 200

LA (train)LA (test)LB (train)LB (test)

Visualisation of learned manifolds

The linearly spaced grid of coordinates over the unit square is mappedthrough the inverse CDF of the Gaussian to obtain the value of z whichcan be used to sample from pθ(x|z) with the estimated parameters θ.

Bayesian: is it really all that?

Comparing reconstruction error to vanilla auto-encoder, we see strongerperformance from VAEB.

2 10 200

5

10

15

20

25

30

35

40

Latent space size

MS

E

MSE for various VAE and AE

AEVAE

AEVAEOriginal

dim 2

dim 10

dim 20

Full Variational Bayes

Possible to perform full VB on parameters:

L(φ;X) =

∫qφ(θ)(log pθ(X) + log pα(θ)− log qφ(θ))dθ

A differentiable Monte Carlo estimate to perform SGVB, yielding a dis-tribution over parameters.Implementation showed a decrease of variational lower bound, but noevidence of learning, possibly due to strict Gaussian assumptions of vari-ational approximate posteriors.

Architecture experiments

We examined various changes to the original architecture of the auto-encoder to test the robustness and flexibility of the model which lead toimprovement in terms of optimising the lower bound and computationalefficiency.

• Different activa-tion functions.

0.0 0.2 0.4 0.6 0.8 1.0# Training samples evaluated 1e8

160

150

140

130

120

110

100

90

L

MNIST, Nz = 5

0.0 0.2 0.4 0.6 0.8 1.01e8

160

150

140

130

120

110

100

90 MNIST, Nz = 10

0.0 0.2 0.4 0.6 0.8 1.01e8

160

150

140

130

120

110

100

90 MNIST, Nz = 20

Sigmoid (train)Sigmoid (test)ReLu (train)ReLu (test)Tanh (train)Tanh (test)

• Increasing thedepth of theencoder.

0.0 0.2 0.4 0.6 0.8 1.0# Training samples evaluated 1e8

220

200

180

160

140

120

100

L

MNIST, Defualt AEVB

0.0 0.2 0.4 0.6 0.8 1.01e8

220

200

180

160

140

120

100

MNIST, Depth = 2

0.0 0.2 0.4 0.6 0.8 1.01e8

220

200

180

160

140

120

100

MNIST, Depth = 3

0.0 0.2 0.4 0.6 0.8 1.01e8

220

200

180

160

140

120

100

MNIST, Depth = 4

traintest

Future works

I. Scheduled training of VAEB [2].II. Direct parameterization of differentiable transform [3].III. Different priors over latent space.

References

1. Kingma, D. P., and Welling M., ”Auto-encoding variational bayes.”arXiv preprint arXiv:1312.6114 (2013).

2. Geras, K. J., and Sutton C., ”Scheduled denoising autoencoders.” arXivpreprint arXiv:1406.3269 (2014).

3. Tran, D, Ranganath, R. and Blei, M. ”Variational Gaussian Process”arXiv preprint arXiv:1511.06499 (2015).