Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...

Probabilistic & Bayesian deep learning

Andreas Damianou

Amazon Research Cambridge, UK

Talk at University of Sheffield, 19 March 2019

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

A standard neural network

w q. . .

I Define: φj = φ(xj) and fj = wjφj (ignore bias for now)

I Once we’ve defined all w’s with back-prop, then f (and the whole network)becomes deterministic.

I What does that imply?

Trained neural network is deterministic. Implications?

I Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers:dropout, early stopping...

I Data generation: A model which generalizes well, should also understand -or evenbe able to create (”imagine”)- variations.

I No predictive uncertainty: Uncertainty needs to be propagated across the modelto be reliable.

Need for uncertainty

I Reinforcement learning

I Critical predictive systems

I Active learning

I Semi-automatic systems

I Scarce data scenarios

BNN with priors on its weights

q(w1 )

q(w2 )

q(w3 )

BNN with priors on its weights

0.12 0.54

ϕq. . .

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

N∑i=1

(g(W, xi)− yi)2︸︷︷︸fit

+λ∑i

‖ wi ‖︸︷︷︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸︷︷︸fit

+ log p(W)︸︷︷︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

argminW

N∑i=1

+λ∑i

argmaxW

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

argminW

N∑i=1

+λ∑i

argmaxW

Integrating out weights

I Define: D = (x, y)

I Remember Bayes’ rule:

p(w|D) =p(D|w)p(w)

p(D) =∫p(D|w)p(w)dw

I For Bayesian inference, weights need to also be integrated out. This gives us aproperly defined posterior on the parametres.

(Bayesian) Occam’s Razor

“A plurality is not to be posited without necessity”. W. of Ockham “Everything should be made

as simple as possible, but not simpler”. A. Einstein

Evidence is higher for the model that is not “unnecessarily complex” but still“explains” the data D.

Bayesian Inference

Remember: Separation of Model and Inference

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸︷︷︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸︷︷︸

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Inference

+H [q(w; θ)]

Inference

+H [q(w; θ)]

Credit: Tushar Tank

Inference: “Score function method”

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(w;θ) [p(D,w)∇θ log q(w; θ)]

K∑i=1

p(D,w(k))∇θ log q(w(k); θ), w(k) iid∼ q(w; θ)

(Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

Black-box VI

Black-box VI (github.com/blei-lab/edward)

Priors on weights (what we saw before)

ϕ ϕ ϕ. . .

From NN to GP

ϕ ϕ ϕ. . .. . . . . .

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

From NN to GP

ϕ ϕ ϕ. . .. . . . . .

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I NN: p(W)

I GP: p(f(·))

From NN to GP

ϕ ϕ ϕ. . .. . . . . .

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I NN: p(W)

I GP: p(f(·))

From NN to GP

ϕ ϕ ϕ. . .. . . . . .

ϕ ϕ ϕ. . .. . . . . .I Real world perfectly described by

unobserved latent variables: H

I But we only observe noisyhigh-dimensional data: Y

I We try to interpret the worldand infer the latents: H ≈ H

I Inference:p(H|Y) = p(Y|H)p(H)

Deep Gaussian processes

I Uncertainty about parameters: Check. Uncertainty about structure?

I Deep GP simultaneously brings in:I prior on “weights”

I input/latent space is kernalized

I stochasticity in the warping

Introducing Gaussian Processes:

I A Gaussian distribution depends on a mean and a covariance matrix.

I A Gaussian process depends on a mean and a covariance function.

Samples from a 1-D Gaussian

How do GPs solve the overfitting problem (i.e. regularize)?

I Answer: Integrate over the function itself!

I So, we will average out all possible function forms, under a (GP) prior!

Recap:

MAP: argmaxw

p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε

Bayesian: argmaxθ

∫f p(y|f) p(f |x,θ︸︷︷︸

GP prior

) e.g. y = f(x,θ) + ε

I θ are hyperparameters

I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.

How do GPs solve the overfitting problem (i.e. regularize)?

I Answer: Integrate over the function itself!

I So, we will average out all possible function forms, under a (GP) prior!

Recap:

MAP: argmaxw

p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε

Bayesian: argmaxθ

∫f p(y|f) p(f |x,θ︸︷︷︸

GP prior

) e.g. y = f(x,θ) + ε

I θ are hyperparameters

I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.

Deep Gaussian processes

I Define a recursive stacked construction

f(x)→ GP

fL(fL−1(fL−2 · · · f1(x))))→ deep GP

Compare to:

ϕ(x)>w→ NN

ϕ(ϕ(ϕ(x)>w1)> . . .wL−1)

>wL → DNN

Two-layered DGP

Output

Unobserved

Inference in DGPs

I It’s tricky. An additional difficulty comes from the fact that the kernel couples allpoints, so p(y|x) is not factorized anymore.

I Indeed, p(y|x,w) =∫f p(y|f)p(f |x,w) but w appears inside the non-linear kernel

function: p(f |x,w) = N(f |0, k(X,X|w))

The resulting objective function

By integrating over all unknowns:

I We obtain approximate posteriors q(h), q(f) over them

I We end up with a well regularized variational lower bound on the true marginallikelihood:

F = Data Fit −KL (q(h1) ‖ p(h1))︸︷︷︸Regularisation

L∑l=2

H (q(hl))︸︷︷︸Regularisation

I Regularization achieved through uncertainty propagation.

An example where uncertainty propagation matters: Recurrent learning.

Dynamics/memory: Deep Recurrent Gaussian Process

Avatar control

Figure: The generated motion with a step function signal, starting with walking (blue),switching to running (red) and switching back to walking (blue).

Videos:https://youtu.be/FR-oeGxV6yY Switching between learned speeds

https://youtu.be/AT0HMtoPgjc Interpolating (un)seen speed

https://youtu.be/FuF-uZ83VMw Constant unseen speed

RGP GPNARX

MLP-NARX LSTM

RGP GPNARX

MLP-NARX LSTM

Stochastic warping

H N (0, I)G

ϕ ϕ ϕ. . .

H N (0, I)G

ϕ ϕ ϕ. . .

Inference:

I Need to infer posteriors on H

I Define q(H) = N (µ,Σ)

I Amortize:{µ,Σ} =MLP (Y;θ)

Stochastic warping

H N (0, I)G

ϕ ϕ ϕ. . .

H N (0, I)G

ϕ ϕ ϕ. . .

Inference:

I Need to infer posteriors on H

I Define q(H) = N (µ,Σ)

I Amortize:{µ,Σ} =MLP (Y;θ)

Amortized inference

Solution: Reparameterization through recognition model g:

µn1 = g1(y(n))

µ(n)l = gl(µ

(n)l−1)

gl = MLP(θl)

gl deterministic ⇒ µ(n)l = gl(. . . g1(y

Dai, Damianou, Gonzalez, Lawrence, ICLR 2016

Bayesian approach recap

I Three ways of introducing uncertainty / noise in a NN:

I Treat weights w as distributions (BNNs)

I Stochasticity in the warping function φ (VAE)

I Bayesian non-parametrics applied to DNNs (DGP)

Connection between Bayesian and “traditional” approaches

There’s another view of Bayesian NNs. Namely that they can be used as a means ofexplaining / motivating DNN techniques (e.g. Dropout) in a more mathematicallygrounded way.

I Dropout as variational inference. (Kingma et al. 2015), (Gal and Ghahramani.2016)

I Low-rank NN weight approximation and Deep Gaussian processes (Hensman andLawrence 2014, Damianou 2015, Louizos and Welling 2016, Cutajar et al. 2016)

Summary

I Motivation for probabilistic and Bayesian reasoning

I Three ways of incorporating uncertainty in DNNs

I Inference is more challenging when uncertainty has to be propagated

I Connection between Bayesian and “traditional” NN approaches

Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...

Documents

Transcript of Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...

Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

ABC: Bayesian Computation Without Likelihoods

Bayesian Regression & Classiﬁcation · Bayesian Regression & Classiﬁcation learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic

Probabilistic aspects of character sums Lecture 1: Classics

Stochastic Volatility Models: Bayesian Framework

Probabilistic Inference Lecture 7

Bayesian Decision and Bayesian Learning

lsopt probabilistic class - LS-OPT Support

Probabilistic Inference Lecture 3

Bayesian Inference, Basics - Stony Brookzhu/ams570/Bayesian_Basics.pdf · Bayesian Inference In Bayesian inference there is a fundamental distinction between • Observable quantities

Introduction to probabilistic graphical models 2015/2016 ...

Chapter 5. Bayesian Statistics (II)

Nonparametric Bayesian Methods 1 What is …larry/=sml/nonparbayes.pdfNonparametric Bayesian Methods 1 What is Nonparametric Bayes? In parametric Bayesian inference we have a model

Nanobubble at 100 meters deep underwater - nac … · Nanobubble at 100 meters deep underwater ＊At deep underwater, such as the bottom of seas, dam lakes and deep wells, Foamestcan

DEEP FOUNDATIONS – Pile Foundations - جامعة تكريتced.ceng.tu.edu.iq/images/lectures/dr.farouq/ch7-Deep-Foundation... · DEEP FOUNDATIONS – Pile Foundations ... Qs =

Lecture 17 Bayesian Econometrics · PDF fileBayesian Econometrics Bayesian Econometrics: Introduction • Idea: ... • The typical problem in Bayesian statistics involves obtaining

Refined Probabilistic Abstraction

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

Robust Bayesian clustering - UCL Computer Science - · PDF fileRobust Bayesian clustering ... Bayesian learning, graphical models, approximate inference, variational inference, ...

Probabilistic programming and optimizationArto Klami Probabilistic programming and optimization March 29, 2018 2 / 23 animation by animate[2016/04/15] Bayesian inference using optimization