Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Probabilistic & Bayesian deep learning

Andreas Damianou

Amazon Research Cambridge, UK

Talk at University of Sheffield, 19 March 2019

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

Page 3: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

Page 4: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

A standard neural network

Y

x1

ϕ1

w1

x2

ϕ2

w2

x3

ϕ3

w3

xq

ϕq

w q. . .

. . .

I Define: φj = φ(xj) and fj = wjφj (ignore bias for now)

I Once we’ve defined all w’s with back-prop, then f (and the whole network)becomes deterministic.

I What does that imply?

Page 5: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Trained neural network is deterministic. Implications?

I Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers:dropout, early stopping...

I Data generation: A model which generalizes well, should also understand -or evenbe able to create (”imagine”)- variations.

I No predictive uncertainty: Uncertainty needs to be propagated across the modelto be reliable.

Page 6: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Need for uncertainty

I Reinforcement learning

I Critical predictive systems

I Active learning

I Semi-automatic systems

I Scarce data scenarios

I ...

Page 7: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

BNN with priors on its weights

Y

x1

ϕ1

w1

x2

ϕ2

w2

x3

ϕ3

w3

xq

ϕqwq

. . .

⇒

Y

x1

ϕ1

q(w1 )

x2

ϕ2

q(w2 )

x3

ϕ3

q(w3 )

xq

ϕq

q(wq)

. . .

Page 8: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

BNN with priors on its weights

Y

x1

ϕ1

x2

ϕ2

x3

ϕ3

xq

ϕq

0.23

0.90

0.12 0.54

. . .

⇒

Y

x1

ϕ1

x2

ϕ2

x3

ϕ3

xq

ϕq. . .

. . .

Page 9: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

1

2

N∑i=1

(g(W, xi)− yi)2︸︷︷︸fit

+λ∑i

‖ wi ‖︸︷︷︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸︷︷︸fit

+ log p(W)︸︷︷︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Page 10: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

1

2

N∑i=1

(g(W, xi)− yi)2︸︷︷︸fit

+λ∑i

‖ wi ‖︸︷︷︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸︷︷︸fit

+ log p(W)︸︷︷︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Page 11: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

1

2

N∑i=1

(g(W, xi)− yi)2︸︷︷︸fit

+λ∑i

‖ wi ‖︸︷︷︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸︷︷︸fit

+ log p(W)︸︷︷︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Page 12: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Integrating out weights

I Define: D = (x, y)

I Remember Bayes’ rule:

p(w|D) =p(D|w)p(w)

p(D) =∫p(D|w)p(w)dw

I For Bayesian inference, weights need to also be integrated out. This gives us aproperly defined posterior on the parametres.

Page 13: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

(Bayesian) Occam’s Razor

“A plurality is not to be posited without necessity”. W. of Ockham “Everything should be made

as simple as possible, but not simpler”. A. Einstein

Evidence is higher for the model that is not “unnecessarily complex” but still“explains” the data D.

Page 14: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Bayesian Inference

Remember: Separation of Model and Inference

Page 15: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸︷︷︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸︷︷︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Page 16: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸︷︷︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸︷︷︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Page 17: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸︷︷︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸︷︷︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Page 18: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Credit: Tushar Tank

Page 19: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Credit: Tushar Tank

Page 20: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Credit: Tushar Tank

Page 21: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Credit: Tushar Tank

Page 22: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference: “Score function method”

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(w;θ) [p(D,w)∇θ log q(w; θ)]

≈ 1

K

K∑i=1

p(D,w(k))∇θ log q(w(k); θ), w(k) iid∼ q(w; θ)

(Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

Page 23: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Page 24: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Page 25: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Page 26: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Page 27: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Black-box VI

Page 28: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Black-box VI (github.com/blei-lab/edward)

Page 29: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Priors on weights (what we saw before)

Y

H

ϕ ϕ ϕ. . .

...

H

X

ϕ ϕ ϕ. . .

Page 30: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

Page 31: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

Page 32: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

Page 33: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I Real world perfectly described by

unobserved latent variables: H

I But we only observe noisyhigh-dimensional data: Y

I We try to interpret the worldand infer the latents: H ≈ H

I Inference:p(H|Y) = p(Y|H)p(H)

p(Y)

Page 34: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Deep Gaussian processes

I Uncertainty about parameters: Check. Uncertainty about structure?

I Deep GP simultaneously brings in:I prior on “weights”

I input/latent space is kernalized

I stochasticity in the warping

Page 35: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Introducing Gaussian Processes:

I A Gaussian distribution depends on a mean and a covariance matrix.

I A Gaussian process depends on a mean and a covariance function.

Page 36: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 37: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 38: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 39: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 40: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 41: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 42: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 43: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 44: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 45: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 46: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 47: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 48: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 49: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 50: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 51: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 52: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 53: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 54: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 55: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Page 56: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 57: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 58: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 59: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 60: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 61: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 62: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 63: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Samples from a 1-D Gaussian

Page 64: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

How do GPs solve the overfitting problem (i.e. regularize)?

I Answer: Integrate over the function itself!

I So, we will average out all possible function forms, under a (GP) prior!

Recap:

MAP: argmaxw

p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε

Bayesian: argmaxθ

∫f p(y|f) p(f |x,θ︸︷︷︸

GP prior

) e.g. y = f(x,θ) + ε

I θ are hyperparameters

I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.

Page 65: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

How do GPs solve the overfitting problem (i.e. regularize)?

I Answer: Integrate over the function itself!

I So, we will average out all possible function forms, under a (GP) prior!

Recap:

MAP: argmaxw

p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε

Bayesian: argmaxθ

∫f p(y|f) p(f |x,θ︸︷︷︸

GP prior

) e.g. y = f(x,θ) + ε

I θ are hyperparameters

I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.

Page 66: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Deep Gaussian processes

x

h1

f1

h2

f2

y

f3

I Define a recursive stacked construction

f(x)→ GP

fL(fL−1(fL−2 · · · f1(x))))→ deep GP

Compare to:

ϕ(x)>w→ NN

ϕ(ϕ(ϕ(x)>w1)> . . .wL−1)

>wL → DNN

Page 67: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Two-layered DGP

Y

H

X

f1

f2 Yf

f

Input

Output

Unobserved

Page 68: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Inference in DGPs

I It’s tricky. An additional difficulty comes from the fact that the kernel couples allpoints, so p(y|x) is not factorized anymore.

I Indeed, p(y|x,w) =∫f p(y|f)p(f |x,w) but w appears inside the non-linear kernel

function: p(f |x,w) = N(f |0, k(X,X|w))

Page 69: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

The resulting objective function

By integrating over all unknowns:

I We obtain approximate posteriors q(h), q(f) over them

I We end up with a well regularized variational lower bound on the true marginallikelihood:

F = Data Fit −KL (q(h1) ‖ p(h1))︸︷︷︸Regularisation

+

L∑l=2

H (q(hl))︸︷︷︸Regularisation

I Regularization achieved through uncertainty propagation.

Page 70: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

An example where uncertainty propagation matters: Recurrent learning.

Page 71: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Dynamics/memory: Deep Recurrent Gaussian Process

Page 72: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Avatar control

Figure: The generated motion with a step function signal, starting with walking (blue),switching to running (red) and switching back to walking (blue).

Videos:https://youtu.be/FR-oeGxV6yY Switching between learned speeds

https://youtu.be/AT0HMtoPgjc Interpolating (un)seen speed

https://youtu.be/FuF-uZ83VMw Constant unseen speed

https://youtu.be/FR-oeGxV6yY

https://youtu.be/AT0HMtoPgjc

https://youtu.be/FuF-uZ83VMw

Page 73: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

RGP GPNARX

MLP-NARX LSTM

Page 74: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

RGP GPNARX

MLP-NARX LSTM

Page 75: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Stochastic warping

Y

H N (0, I)G

ϕ ϕ ϕ. . .

...

H N (0, I)G

X

ϕ ϕ ϕ. . .

Inference:

I Need to infer posteriors on H

I Define q(H) = N (µ,Σ)

I Amortize:{µ,Σ} =MLP (Y;θ)

Page 76: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Stochastic warping

Y

H N (0, I)G

ϕ ϕ ϕ. . .

...

H N (0, I)G

X

ϕ ϕ ϕ. . .

Inference:

I Need to infer posteriors on H

I Define q(H) = N (µ,Σ)

I Amortize:{µ,Σ} =MLP (Y;θ)

Page 77: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Amortized inference

Solution: Reparameterization through recognition model g:

µn1 = g1(y(n))

µ(n)l = gl(µ

(n)l−1)

gl = MLP(θl)

gl deterministic ⇒ µ(n)l = gl(. . . g1(y

(n)))

Dai, Damianou, Gonzalez, Lawrence, ICLR 2016

Page 78: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Bayesian approach recap

I Three ways of introducing uncertainty / noise in a NN:

I Treat weights w as distributions (BNNs)

I Stochasticity in the warping function φ (VAE)

I Bayesian non-parametrics applied to DNNs (DGP)

Page 79: Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of She eld, 19 March 2019 In this

Connection between Bayesian and “traditional” approaches

There’s another view of Bayesian NNs. Namely that they can be used as a means ofexplaining / motivating DNN techniques (e.g. Dropout) in a more mathematicallygrounded way.

I Dropout as variational inference. (Kingma et al. 2015), (Gal and Ghahramani.2016)

I Low-rank NN weight approximation and Deep Gaussian processes (Hensman andLawrence 2014, Damianou 2015, Louizos and Welling 2016, Cutajar et al. 2016)

I ...