Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...

Post on 10-Feb-2020

14 views 0 download

Transcript of Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...

Probabilistic & Bayesian deep learning

Andreas Damianou

Amazon Research Cambridge, UK

Talk at University of Sheffield, 19 March 2019

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

A standard neural network

Y

x1

ϕ1

w1

x2

ϕ2

w2

x3

ϕ3

w3

xq

ϕq

w q. . .

. . .

I Define: φj = φ(xj) and fj = wjφj (ignore bias for now)

I Once we’ve defined all w’s with back-prop, then f (and the whole network)becomes deterministic.

I What does that imply?

Trained neural network is deterministic. Implications?

I Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers:dropout, early stopping...

I Data generation: A model which generalizes well, should also understand -or evenbe able to create (”imagine”)- variations.

I No predictive uncertainty: Uncertainty needs to be propagated across the modelto be reliable.

Need for uncertainty

I Reinforcement learning

I Critical predictive systems

I Active learning

I Semi-automatic systems

I Scarce data scenarios

I ...

BNN with priors on its weights

Y

x1

ϕ1

w1

x2

ϕ2

w2

x3

ϕ3

w3

xq

ϕqwq

. . .

. . .

Y

x1

ϕ1

q(w1 )

x2

ϕ2

q(w2 )

x3

ϕ3

q(w3 )

xq

ϕq

q(wq)

. . .

. . .

BNN with priors on its weights

Y

x1

ϕ1

x2

ϕ2

x3

ϕ3

xq

ϕq

0.23

0.90

0.12 0.54

. . .

. . .

Y

x1

ϕ1

x2

ϕ2

x3

ϕ3

xq

ϕq. . .

. . .

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

1

2

N∑i=1

(g(W, xi)− yi)2︸ ︷︷ ︸fit

+λ∑i

‖ wi ‖︸ ︷︷ ︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸ ︷︷ ︸fit

+ log p(W)︸ ︷︷ ︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

1

2

N∑i=1

(g(W, xi)− yi)2︸ ︷︷ ︸fit

+λ∑i

‖ wi ‖︸ ︷︷ ︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸ ︷︷ ︸fit

+ log p(W)︸ ︷︷ ︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argminW

1

2

N∑i=1

(g(W, xi)− yi)2︸ ︷︷ ︸fit

+λ∑i

‖ wi ‖︸ ︷︷ ︸regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmaxW

log p(y|x,W)︸ ︷︷ ︸fit

+ log p(W)︸ ︷︷ ︸regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Integrating out weights

I Define: D = (x, y)

I Remember Bayes’ rule:

p(w|D) =p(D|w)p(w)

p(D) =∫p(D|w)p(w)dw

I For Bayesian inference, weights need to also be integrated out. This gives us aproperly defined posterior on the parametres.

(Bayesian) Occam’s Razor

“A plurality is not to be posited without necessity”. W. of Ockham “Everything should be made

as simple as possible, but not simpler”. A. Einstein

Evidence is higher for the model that is not “unnecessarily complex” but still“explains” the data D.

Bayesian Inference

Remember: Separation of Model and Inference

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize

= log(p(D))− L(θ)︸︷︷︸maximize

whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Credit: Tushar Tank

Credit: Tushar Tank

Credit: Tushar Tank

Credit: Tushar Tank

Inference: “Score function method”

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(w;θ) [p(D,w)∇θ log q(w; θ)]

≈ 1

K

K∑i=1

p(D,w(k))∇θ log q(w(k); θ), w(k) iid∼ q(w; θ)

(Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)

I q(ε) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)

[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)

]I For example: w ∼ N (µ, σ)

T→ w = µ+ σ · ε, ε ∼ N (0, 1)

I MC by sampling from q(ε) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Black-box VI

Black-box VI (github.com/blei-lab/edward)

Priors on weights (what we saw before)

Y

H

ϕ ϕ ϕ. . .

...

H

X

ϕ ϕ ϕ. . .

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a specialcase of this

From NN to GP

Y

HG

ϕ ϕ ϕ. . .. . . . . .

...

HG

X

ϕ ϕ ϕ. . .. . . . . .I Real world perfectly described by

unobserved latent variables: H

I But we only observe noisyhigh-dimensional data: Y

I We try to interpret the worldand infer the latents: H ≈ H

I Inference:p(H|Y) = p(Y|H)p(H)

p(Y)

Deep Gaussian processes

I Uncertainty about parameters: Check. Uncertainty about structure?

I Deep GP simultaneously brings in:I prior on “weights”

I input/latent space is kernalized

I stochasticity in the warping

Introducing Gaussian Processes:

I A Gaussian distribution depends on a mean and a covariance matrix.

I A Gaussian process depends on a mean and a covariance function.

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

How do GPs solve the overfitting problem (i.e. regularize)?

I Answer: Integrate over the function itself!

I So, we will average out all possible function forms, under a (GP) prior!

Recap:

MAP: argmaxw

p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε

Bayesian: argmaxθ

∫f p(y|f) p(f |x,θ︸ ︷︷ ︸

GP prior

) e.g. y = f(x,θ) + ε

I θ are hyperparameters

I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.

How do GPs solve the overfitting problem (i.e. regularize)?

I Answer: Integrate over the function itself!

I So, we will average out all possible function forms, under a (GP) prior!

Recap:

MAP: argmaxw

p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε

Bayesian: argmaxθ

∫f p(y|f) p(f |x,θ︸ ︷︷ ︸

GP prior

) e.g. y = f(x,θ) + ε

I θ are hyperparameters

I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.

Deep Gaussian processes

x

h1

f1

h2

f2

y

f3

I Define a recursive stacked construction

f(x)→ GP

fL(fL−1(fL−2 · · · f1(x))))→ deep GP

Compare to:

ϕ(x)>w→ NN

ϕ(ϕ(ϕ(x)>w1)> . . .wL−1)

>wL → DNN

Two-layered DGP

Y

H

X

f1

f2 Yf

f

Input

Output

Unobserved

Inference in DGPs

I It’s tricky. An additional difficulty comes from the fact that the kernel couples allpoints, so p(y|x) is not factorized anymore.

I Indeed, p(y|x,w) =∫f p(y|f)p(f |x,w) but w appears inside the non-linear kernel

function: p(f |x,w) = N(f |0, k(X,X|w))

The resulting objective function

By integrating over all unknowns:

I We obtain approximate posteriors q(h), q(f) over them

I We end up with a well regularized variational lower bound on the true marginallikelihood:

F = Data Fit −KL (q(h1) ‖ p(h1))︸ ︷︷ ︸Regularisation

+

L∑l=2

H (q(hl))︸ ︷︷ ︸Regularisation

I Regularization achieved through uncertainty propagation.

An example where uncertainty propagation matters: Recurrent learning.

Dynamics/memory: Deep Recurrent Gaussian Process

Avatar control

Figure: The generated motion with a step function signal, starting with walking (blue),switching to running (red) and switching back to walking (blue).

Videos:https://youtu.be/FR-oeGxV6yY Switching between learned speeds

https://youtu.be/AT0HMtoPgjc Interpolating (un)seen speed

https://youtu.be/FuF-uZ83VMw Constant unseen speed

RGP GPNARX

MLP-NARX LSTM

RGP GPNARX

MLP-NARX LSTM

Stochastic warping

Y

H N (0, I)G

ϕ ϕ ϕ. . .

...

H N (0, I)G

X

ϕ ϕ ϕ. . .

Inference:

I Need to infer posteriors on H

I Define q(H) = N (µ,Σ)

I Amortize:{µ,Σ} =MLP (Y;θ)

Stochastic warping

Y

H N (0, I)G

ϕ ϕ ϕ. . .

...

H N (0, I)G

X

ϕ ϕ ϕ. . .

Inference:

I Need to infer posteriors on H

I Define q(H) = N (µ,Σ)

I Amortize:{µ,Σ} =MLP (Y;θ)

Amortized inference

Solution: Reparameterization through recognition model g:

µn1 = g1(y(n))

µ(n)l = gl(µ

(n)l−1)

gl = MLP(θl)

gl deterministic ⇒ µ(n)l = gl(. . . g1(y

(n)))

Dai, Damianou, Gonzalez, Lawrence, ICLR 2016

Bayesian approach recap

I Three ways of introducing uncertainty / noise in a NN:

I Treat weights w as distributions (BNNs)

I Stochasticity in the warping function φ (VAE)

I Bayesian non-parametrics applied to DNNs (DGP)

Connection between Bayesian and “traditional” approaches

There’s another view of Bayesian NNs. Namely that they can be used as a means ofexplaining / motivating DNN techniques (e.g. Dropout) in a more mathematicallygrounded way.

I Dropout as variational inference. (Kingma et al. 2015), (Gal and Ghahramani.2016)

I Low-rank NN weight approximation and Deep Gaussian processes (Hensman andLawrence 2014, Damianou 2015, Louizos and Welling 2016, Cutajar et al. 2016)

I ...

Summary

I Motivation for probabilistic and Bayesian reasoning

I Three ways of incorporating uncertainty in DNNs

I Inference is more challenging when uncertainty has to be propagated

I Connection between Bayesian and “traditional” NN approaches