# Probabilistic & Bayesian deep learning ... Probabilistic & Bayesian deep learning Andreas...

date post

10-Feb-2020Category

## Documents

view

7download

0

Embed Size (px)

### Transcript of Probabilistic & Bayesian deep learning ... Probabilistic & Bayesian deep learning Andreas...

Probabilistic & Bayesian deep learning

Andreas Damianou

Amazon Research Cambridge, UK

Talk at University of Sheffield, 19 March 2019

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

In this talk

Not in this talk: CRFs, Boltzmann machines, ...

A standard neural network

Y

x1

ϕ1 w 1

x2

ϕ2

w 2

x3

ϕ3

w 3

xq

ϕq

w q . . .

. . .

I Define: φj = φ(xj) and fj = wjφj (ignore bias for now)

I Once we’ve defined all w’s with back-prop, then f (and the whole network) becomes deterministic.

I What does that imply?

Trained neural network is deterministic. Implications?

I Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers: dropout, early stopping...

I Data generation: A model which generalizes well, should also understand -or even be able to create (”imagine”)- variations.

I No predictive uncertainty: Uncertainty needs to be propagated across the model to be reliable.

Need for uncertainty

I Reinforcement learning

I Critical predictive systems

I Active learning

I Semi-automatic systems

I Scarce data scenarios

I ...

BNN with priors on its weights

Y

x1

ϕ1

w 1

x2

ϕ2

w 2

x3

ϕ3

w 3

xq

ϕq w q

. . .

. . .

⇒

Y

x1

ϕ1

q(w 1 )

x2

ϕ2

q(w 2 )

x3

ϕ3

q(w 3 )

xq

ϕq

q( w q )

. . .

. . .

BNN with priors on its weights

Y

x1

ϕ1

x2

ϕ2

x3

ϕ3

xq

ϕq

0.23

0.90

0.12 0. 54

. . .

. . .

⇒

Y

x1

ϕ1

x2

ϕ2

x3

ϕ3

xq

ϕq. . .

. . .

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argmin W

1

2

N∑ i=1

(g(W, xi)− yi)2︸ ︷︷ ︸ fit

+λ ∑ i

‖ wi ‖︸ ︷︷ ︸ regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmax W

log p(y|x,W)︸ ︷︷ ︸ fit

+ log p(W)︸ ︷︷ ︸ regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argmin W

1

2

N∑ i=1

(g(W, xi)− yi)2︸ ︷︷ ︸ fit

+λ ∑ i

‖ wi ‖︸ ︷︷ ︸ regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmax W

log p(y|x,W)︸ ︷︷ ︸ fit

+ log p(W)︸ ︷︷ ︸ regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Probabilistic re-formulation

I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

I Training minimizing loss:

argmin W

1

2

N∑ i=1

(g(W, xi)− yi)2︸ ︷︷ ︸ fit

+λ ∑ i

‖ wi ‖︸ ︷︷ ︸ regularizer

I Equivalent probabilistic view for regression, maximizing posterior probability:

argmax W

log p(y|x,W)︸ ︷︷ ︸ fit

+ log p(W)︸ ︷︷ ︸ regularizer

where p(y|x,W) ∼ N and p(W) ∼ Laplace

Integrating out weights

I Define: D = (x, y)

I Remember Bayes’ rule:

p(w|D) = p(D|w)p(w) p(D) =

∫ p(D|w)p(w)dw

I For Bayesian inference, weights need to also be integrated out. This gives us a properly defined posterior on the parametres.

(Bayesian) Occam’s Razor

“A plurality is not to be posited without necessity”. W. of Ockham “Everything should be made

as simple as possible, but not simpler”. A. Einstein

Evidence is higher for the model that is not “unnecessarily complex” but still “explains” the data D.

Bayesian Inference

Remember: Separation of Model and Inference

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way in which w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸ minimize

= log(p(D))− L(θ)︸︷︷︸ maximize

where L(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way in which w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸ minimize

= log(p(D))− L(θ)︸︷︷︸ maximize

where L(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Inference

I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way in which w appears through g.

I Attempt at variational inference:

KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸ minimize

= log(p(D))− L(θ)︸︷︷︸ maximize

where L(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

F

+H [q(w; θ)]

I Term in red is still problematic. Solution: MC.

I Such approaches can be formulated as black-box inferences.

Credit: Tushar Tank

Credit: Tushar Tank

Credit: Tushar Tank

Credit: Tushar Tank

Inference: “Score function method”

∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(w;θ) [p(D,w)∇θ log q(w; θ)]

≈ 1 K

K∑ i=1

p(D,w(k))∇θ log q(w(k); θ), w(k) iid∼ q(w; θ)

(Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

I q(�) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

[ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

I MC by sampling from q(�) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

I q(�) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

[ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

I MC by sampling from q(�) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

I q(�) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

[ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

I MC by sampling from q(�) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: “Reparameterization gradient”

I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

I q(�) is now independent of θ

∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

[ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

I MC by sampling from q(�) (thus obtaining samples from w through T )

(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Black-box VI

Black-box VI (github.com/blei-lab/edward)

Priors on weights (what we saw before)

Y

H

ϕ ϕ ϕ. . .

...

H

X

ϕ ϕ ϕ. . .

From NN to GP

Y

H G

ϕ ϕ ϕ. . .. . . . . .

...

H G

X

ϕ ϕ ϕ. . .. . . . . . I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so: H2 = f2(H1; θ2) + �

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a special case of this

From NN to GP

Y

H G

ϕ ϕ ϕ. . .. . . . . .

...

H G

X

ϕ ϕ ϕ. . .. . . . . . I NN: H2 = W2φ(H1)

I GP: φ is ∞−dimensional so: H2 = f2(H1; θ2) + �

I NN: p(W)

I GP: p(f(·))

I VAE can be seen as a special case of this

From NN to GP

Y

H G

ϕ ϕ ϕ. . .. . . . . .

...

H G

X

ϕ ϕ ϕ. . ..

Recommended

*View more*