Probabilistic & Bayesian deep learning ... Probabilistic & Bayesian deep learning Andreas...

download Probabilistic & Bayesian deep learning ... Probabilistic & Bayesian deep learning Andreas Damianou Amazon

of 80

  • date post

    10-Feb-2020
  • Category

    Documents

  • view

    7
  • download

    0

Embed Size (px)

Transcript of Probabilistic & Bayesian deep learning ... Probabilistic & Bayesian deep learning Andreas...

  • Probabilistic & Bayesian deep learning

    Andreas Damianou

    Amazon Research Cambridge, UK

    Talk at University of Sheffield, 19 March 2019

  • In this talk

    Not in this talk: CRFs, Boltzmann machines, ...

  • In this talk

    Not in this talk: CRFs, Boltzmann machines, ...

  • A standard neural network

    Y

    x1

    ϕ1 w 1

    x2

    ϕ2

    w 2

    x3

    ϕ3

    w 3

    xq

    ϕq

    w q . . .

    . . .

    I Define: φj = φ(xj) and fj = wjφj (ignore bias for now)

    I Once we’ve defined all w’s with back-prop, then f (and the whole network) becomes deterministic.

    I What does that imply?

  • Trained neural network is deterministic. Implications?

    I Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers: dropout, early stopping...

    I Data generation: A model which generalizes well, should also understand -or even be able to create (”imagine”)- variations.

    I No predictive uncertainty: Uncertainty needs to be propagated across the model to be reliable.

  • Need for uncertainty

    I Reinforcement learning

    I Critical predictive systems

    I Active learning

    I Semi-automatic systems

    I Scarce data scenarios

    I ...

  • BNN with priors on its weights

    Y

    x1

    ϕ1

    w 1

    x2

    ϕ2

    w 2

    x3

    ϕ3

    w 3

    xq

    ϕq w q

    . . .

    . . .

    Y

    x1

    ϕ1

    q(w 1 )

    x2

    ϕ2

    q(w 2 )

    x3

    ϕ3

    q(w 3 )

    xq

    ϕq

    q( w q )

    . . .

    . . .

  • BNN with priors on its weights

    Y

    x1

    ϕ1

    x2

    ϕ2

    x3

    ϕ3

    xq

    ϕq

    0.23

    0.90

    0.12 0. 54

    . . .

    . . .

    Y

    x1

    ϕ1

    x2

    ϕ2

    x3

    ϕ3

    xq

    ϕq. . .

    . . .

  • Probabilistic re-formulation

    I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

    I Training minimizing loss:

    argmin W

    1

    2

    N∑ i=1

    (g(W, xi)− yi)2︸ ︷︷ ︸ fit

    +λ ∑ i

    ‖ wi ‖︸ ︷︷ ︸ regularizer

    I Equivalent probabilistic view for regression, maximizing posterior probability:

    argmax W

    log p(y|x,W)︸ ︷︷ ︸ fit

    + log p(W)︸ ︷︷ ︸ regularizer

    where p(y|x,W) ∼ N and p(W) ∼ Laplace

  • Probabilistic re-formulation

    I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

    I Training minimizing loss:

    argmin W

    1

    2

    N∑ i=1

    (g(W, xi)− yi)2︸ ︷︷ ︸ fit

    +λ ∑ i

    ‖ wi ‖︸ ︷︷ ︸ regularizer

    I Equivalent probabilistic view for regression, maximizing posterior probability:

    argmax W

    log p(y|x,W)︸ ︷︷ ︸ fit

    + log p(W)︸ ︷︷ ︸ regularizer

    where p(y|x,W) ∼ N and p(W) ∼ Laplace

  • Probabilistic re-formulation

    I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))

    I Training minimizing loss:

    argmin W

    1

    2

    N∑ i=1

    (g(W, xi)− yi)2︸ ︷︷ ︸ fit

    +λ ∑ i

    ‖ wi ‖︸ ︷︷ ︸ regularizer

    I Equivalent probabilistic view for regression, maximizing posterior probability:

    argmax W

    log p(y|x,W)︸ ︷︷ ︸ fit

    + log p(W)︸ ︷︷ ︸ regularizer

    where p(y|x,W) ∼ N and p(W) ∼ Laplace

  • Integrating out weights

    I Define: D = (x, y)

    I Remember Bayes’ rule:

    p(w|D) = p(D|w)p(w) p(D) =

    ∫ p(D|w)p(w)dw

    I For Bayesian inference, weights need to also be integrated out. This gives us a properly defined posterior on the parametres.

  • (Bayesian) Occam’s Razor

    “A plurality is not to be posited without necessity”. W. of Ockham “Everything should be made

    as simple as possible, but not simpler”. A. Einstein

    Evidence is higher for the model that is not “unnecessarily complex” but still “explains” the data D.

  • Bayesian Inference

    Remember: Separation of Model and Inference

  • Inference

    I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way in which w appears through g.

    I Attempt at variational inference:

    KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸ minimize

    = log(p(D))− L(θ)︸︷︷︸ maximize

    where L(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

    F

    +H [q(w; θ)]

    I Term in red is still problematic. Solution: MC.

    I Such approaches can be formulated as black-box inferences.

  • Inference

    I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way in which w appears through g.

    I Attempt at variational inference:

    KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸ minimize

    = log(p(D))− L(θ)︸︷︷︸ maximize

    where L(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

    F

    +H [q(w; θ)]

    I Term in red is still problematic. Solution: MC.

    I Such approaches can be formulated as black-box inferences.

  • Inference

    I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way in which w appears through g.

    I Attempt at variational inference:

    KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸ minimize

    = log(p(D))− L(θ)︸︷︷︸ maximize

    where L(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸

    F

    +H [q(w; θ)]

    I Term in red is still problematic. Solution: MC.

    I Such approaches can be formulated as black-box inferences.

  • Credit: Tushar Tank

  • Credit: Tushar Tank

  • Credit: Tushar Tank

  • Credit: Tushar Tank

  • Inference: “Score function method”

    ∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(w;θ) [p(D,w)∇θ log q(w; θ)]

    ≈ 1 K

    K∑ i=1

    p(D,w(k))∇θ log q(w(k); θ), w(k) iid∼ q(w; θ)

    (Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

  • Inference: “Reparameterization gradient”

    I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

    I q(�) is now independent of θ

    ∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

    [ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

    ] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

    I MC by sampling from q(�) (thus obtaining samples from w through T )

    (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

  • Inference: “Reparameterization gradient”

    I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

    I q(�) is now independent of θ

    ∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

    [ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

    ] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

    I MC by sampling from q(�) (thus obtaining samples from w through T )

    (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

  • Inference: “Reparameterization gradient”

    I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

    I q(�) is now independent of θ

    ∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

    [ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

    ] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

    I MC by sampling from q(�) (thus obtaining samples from w through T )

    (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

  • Inference: “Reparameterization gradient”

    I Reparametrize w as a transformation T of a simpler variable �: w = T (�; θ)

    I q(�) is now independent of θ

    ∇θF = ∇θEq(w;θ)[log p(D,w)] = Eq(�)

    [ ∇wp(D,w)|w=T (�;θ)∇θT (�; θ)

    ] I For example: w ∼ N (µ, σ) T→ w = µ+ σ · �, � ∼ N (0, 1)

    I MC by sampling from q(�) (thus obtaining samples from w through T )

    (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

  • Black-box VI

  • Black-box VI (github.com/blei-lab/edward)

  • Priors on weights (what we saw before)

    Y

    H

    ϕ ϕ ϕ. . .

    ...

    H

    X

    ϕ ϕ ϕ. . .

  • From NN to GP

    Y

    H G

    ϕ ϕ ϕ. . .. . . . . .

    ...

    H G

    X

    ϕ ϕ ϕ. . .. . . . . . I NN: H2 = W2φ(H1)

    I GP: φ is ∞−dimensional so: H2 = f2(H1; θ2) + �

    I NN: p(W)

    I GP: p(f(·))

    I VAE can be seen as a special case of this

  • From NN to GP

    Y

    H G

    ϕ ϕ ϕ. . .. . . . . .

    ...

    H G

    X

    ϕ ϕ ϕ. . .. . . . . . I NN: H2 = W2φ(H1)

    I GP: φ is ∞−dimensional so: H2 = f2(H1; θ2) + �

    I NN: p(W)

    I GP: p(f(·))

    I VAE can be seen as a special case of this

  • From NN to GP

    Y

    H G

    ϕ ϕ ϕ. . .. . . . . .

    ...

    H G

    X

    ϕ ϕ ϕ. . ..