Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...
Transcript of Probabilistic & Bayesian deep learning...Probabilistic & Bayesian deep learning Andreas Damianou...
Probabilistic & Bayesian deep learning
Andreas Damianou
Amazon Research Cambridge, UK
Talk at University of Sheffield, 19 March 2019
In this talk
Not in this talk: CRFs, Boltzmann machines, ...
In this talk
Not in this talk: CRFs, Boltzmann machines, ...
A standard neural network
Y
x1
ϕ1
w1
x2
ϕ2
w2
x3
ϕ3
w3
xq
ϕq
w q. . .
. . .
I Define: φj = φ(xj) and fj = wjφj (ignore bias for now)
I Once we’ve defined all w’s with back-prop, then f (and the whole network)becomes deterministic.
I What does that imply?
Trained neural network is deterministic. Implications?
I Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers:dropout, early stopping...
I Data generation: A model which generalizes well, should also understand -or evenbe able to create (”imagine”)- variations.
I No predictive uncertainty: Uncertainty needs to be propagated across the modelto be reliable.
Need for uncertainty
I Reinforcement learning
I Critical predictive systems
I Active learning
I Semi-automatic systems
I Scarce data scenarios
I ...
BNN with priors on its weights
Y
x1
ϕ1
w1
x2
ϕ2
w2
x3
ϕ3
w3
xq
ϕqwq
. . .
. . .
⇒
Y
x1
ϕ1
q(w1 )
x2
ϕ2
q(w2 )
x3
ϕ3
q(w3 )
xq
ϕq
q(wq)
. . .
. . .
BNN with priors on its weights
Y
x1
ϕ1
x2
ϕ2
x3
ϕ3
xq
ϕq
0.23
0.90
0.12 0.54
. . .
. . .
⇒
Y
x1
ϕ1
x2
ϕ2
x3
ϕ3
xq
ϕq. . .
. . .
Probabilistic re-formulation
I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))
I Training minimizing loss:
argminW
1
2
N∑i=1
(g(W, xi)− yi)2︸ ︷︷ ︸fit
+λ∑i
‖ wi ‖︸ ︷︷ ︸regularizer
I Equivalent probabilistic view for regression, maximizing posterior probability:
argmaxW
log p(y|x,W)︸ ︷︷ ︸fit
+ log p(W)︸ ︷︷ ︸regularizer
where p(y|x,W) ∼ N and p(W) ∼ Laplace
Probabilistic re-formulation
I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))
I Training minimizing loss:
argminW
1
2
N∑i=1
(g(W, xi)− yi)2︸ ︷︷ ︸fit
+λ∑i
‖ wi ‖︸ ︷︷ ︸regularizer
I Equivalent probabilistic view for regression, maximizing posterior probability:
argmaxW
log p(y|x,W)︸ ︷︷ ︸fit
+ log p(W)︸ ︷︷ ︸regularizer
where p(y|x,W) ∼ N and p(W) ∼ Laplace
Probabilistic re-formulation
I DNN: y = g(W,x) = w1ϕ(w2ϕ(. . .x))
I Training minimizing loss:
argminW
1
2
N∑i=1
(g(W, xi)− yi)2︸ ︷︷ ︸fit
+λ∑i
‖ wi ‖︸ ︷︷ ︸regularizer
I Equivalent probabilistic view for regression, maximizing posterior probability:
argmaxW
log p(y|x,W)︸ ︷︷ ︸fit
+ log p(W)︸ ︷︷ ︸regularizer
where p(y|x,W) ∼ N and p(W) ∼ Laplace
Integrating out weights
I Define: D = (x, y)
I Remember Bayes’ rule:
p(w|D) =p(D|w)p(w)
p(D) =∫p(D|w)p(w)dw
I For Bayesian inference, weights need to also be integrated out. This gives us aproperly defined posterior on the parametres.
(Bayesian) Occam’s Razor
“A plurality is not to be posited without necessity”. W. of Ockham “Everything should be made
as simple as possible, but not simpler”. A. Einstein
Evidence is higher for the model that is not “unnecessarily complex” but still“explains” the data D.
Bayesian Inference
Remember: Separation of Model and Inference
Inference
I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.
I Attempt at variational inference:
KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize
= log(p(D))− L(θ)︸︷︷︸maximize
whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸
F
+H [q(w; θ)]
I Term in red is still problematic. Solution: MC.
I Such approaches can be formulated as black-box inferences.
Inference
I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.
I Attempt at variational inference:
KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize
= log(p(D))− L(θ)︸︷︷︸maximize
whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸
F
+H [q(w; θ)]
I Term in red is still problematic. Solution: MC.
I Such approaches can be formulated as black-box inferences.
Inference
I p(D) (and hence p(w|D)) is difficult to compute because of the nonlinear way inwhich w appears through g.
I Attempt at variational inference:
KL (q(w; θ) ‖ p(w|D))︸ ︷︷ ︸minimize
= log(p(D))− L(θ)︸︷︷︸maximize
whereL(θ) = Eq(w;θ)[log p(D,w)]︸ ︷︷ ︸
F
+H [q(w; θ)]
I Term in red is still problematic. Solution: MC.
I Such approaches can be formulated as black-box inferences.
Credit: Tushar Tank
Credit: Tushar Tank
Credit: Tushar Tank
Credit: Tushar Tank
Inference: “Score function method”
∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(w;θ) [p(D,w)∇θ log q(w; θ)]
≈ 1
K
K∑i=1
p(D,w(k))∇θ log q(w(k); θ), w(k) iid∼ q(w; θ)
(Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)
Inference: “Reparameterization gradient”
I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)
I q(ε) is now independent of θ
∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)
[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)
]I For example: w ∼ N (µ, σ)
T→ w = µ+ σ · ε, ε ∼ N (0, 1)
I MC by sampling from q(ε) (thus obtaining samples from w through T )
(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)
Inference: “Reparameterization gradient”
I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)
I q(ε) is now independent of θ
∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)
[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)
]I For example: w ∼ N (µ, σ)
T→ w = µ+ σ · ε, ε ∼ N (0, 1)
I MC by sampling from q(ε) (thus obtaining samples from w through T )
(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)
Inference: “Reparameterization gradient”
I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)
I q(ε) is now independent of θ
∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)
[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)
]I For example: w ∼ N (µ, σ)
T→ w = µ+ σ · ε, ε ∼ N (0, 1)
I MC by sampling from q(ε) (thus obtaining samples from w through T )
(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)
Inference: “Reparameterization gradient”
I Reparametrize w as a transformation T of a simpler variable ε: w = T (ε; θ)
I q(ε) is now independent of θ
∇θF = ∇θEq(w;θ)[log p(D,w)]= Eq(ε)
[∇wp(D,w)|w=T (ε;θ)∇θT (ε; θ)
]I For example: w ∼ N (µ, σ)
T→ w = µ+ σ · ε, ε ∼ N (0, 1)
I MC by sampling from q(ε) (thus obtaining samples from w through T )
(Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)
Black-box VI
Black-box VI (github.com/blei-lab/edward)
Priors on weights (what we saw before)
Y
H
ϕ ϕ ϕ. . .
...
H
X
ϕ ϕ ϕ. . .
From NN to GP
Y
HG
ϕ ϕ ϕ. . .. . . . . .
...
HG
X
ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)
I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε
I NN: p(W)
I GP: p(f(·))
I VAE can be seen as a specialcase of this
From NN to GP
Y
HG
ϕ ϕ ϕ. . .. . . . . .
...
HG
X
ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)
I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε
I NN: p(W)
I GP: p(f(·))
I VAE can be seen as a specialcase of this
From NN to GP
Y
HG
ϕ ϕ ϕ. . .. . . . . .
...
HG
X
ϕ ϕ ϕ. . .. . . . . .I NN: H2 = W2φ(H1)
I GP: φ is ∞−dimensional so:H2 = f2(H1; θ2) + ε
I NN: p(W)
I GP: p(f(·))
I VAE can be seen as a specialcase of this
From NN to GP
Y
HG
ϕ ϕ ϕ. . .. . . . . .
...
HG
X
ϕ ϕ ϕ. . .. . . . . .I Real world perfectly described by
unobserved latent variables: H
I But we only observe noisyhigh-dimensional data: Y
I We try to interpret the worldand infer the latents: H ≈ H
I Inference:p(H|Y) = p(Y|H)p(H)
p(Y)
Deep Gaussian processes
I Uncertainty about parameters: Check. Uncertainty about structure?
I Deep GP simultaneously brings in:I prior on “weights”
I input/latent space is kernalized
I stochasticity in the warping
Introducing Gaussian Processes:
I A Gaussian distribution depends on a mean and a covariance matrix.
I A Gaussian process depends on a mean and a covariance function.
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
Samples from a 1-D Gaussian
How do GPs solve the overfitting problem (i.e. regularize)?
I Answer: Integrate over the function itself!
I So, we will average out all possible function forms, under a (GP) prior!
Recap:
MAP: argmaxw
p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε
Bayesian: argmaxθ
∫f p(y|f) p(f |x,θ︸ ︷︷ ︸
GP prior
) e.g. y = f(x,θ) + ε
I θ are hyperparameters
I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.
How do GPs solve the overfitting problem (i.e. regularize)?
I Answer: Integrate over the function itself!
I So, we will average out all possible function forms, under a (GP) prior!
Recap:
MAP: argmaxw
p(y|w, φ(x))p(w) e.g. y = φ(x)>w + ε
Bayesian: argmaxθ
∫f p(y|f) p(f |x,θ︸ ︷︷ ︸
GP prior
) e.g. y = f(x,θ) + ε
I θ are hyperparameters
I The Bayesian approach (GP) automatically balances the data-fitting with thecomplexity penalty.
Deep Gaussian processes
x
h1
f1
h2
f2
y
f3
I Define a recursive stacked construction
f(x)→ GP
fL(fL−1(fL−2 · · · f1(x))))→ deep GP
Compare to:
ϕ(x)>w→ NN
ϕ(ϕ(ϕ(x)>w1)> . . .wL−1)
>wL → DNN
Two-layered DGP
Y
H
X
f1
f2 Yf
f
Input
Output
Unobserved
Inference in DGPs
I It’s tricky. An additional difficulty comes from the fact that the kernel couples allpoints, so p(y|x) is not factorized anymore.
I Indeed, p(y|x,w) =∫f p(y|f)p(f |x,w) but w appears inside the non-linear kernel
function: p(f |x,w) = N(f |0, k(X,X|w))
The resulting objective function
By integrating over all unknowns:
I We obtain approximate posteriors q(h), q(f) over them
I We end up with a well regularized variational lower bound on the true marginallikelihood:
F = Data Fit −KL (q(h1) ‖ p(h1))︸ ︷︷ ︸Regularisation
+
L∑l=2
H (q(hl))︸ ︷︷ ︸Regularisation
I Regularization achieved through uncertainty propagation.
An example where uncertainty propagation matters: Recurrent learning.
Dynamics/memory: Deep Recurrent Gaussian Process
Avatar control
Figure: The generated motion with a step function signal, starting with walking (blue),switching to running (red) and switching back to walking (blue).
Videos:https://youtu.be/FR-oeGxV6yY Switching between learned speeds
https://youtu.be/AT0HMtoPgjc Interpolating (un)seen speed
https://youtu.be/FuF-uZ83VMw Constant unseen speed
RGP GPNARX
MLP-NARX LSTM
RGP GPNARX
MLP-NARX LSTM
Stochastic warping
Y
H N (0, I)G
ϕ ϕ ϕ. . .
...
H N (0, I)G
X
ϕ ϕ ϕ. . .
Inference:
I Need to infer posteriors on H
I Define q(H) = N (µ,Σ)
I Amortize:{µ,Σ} =MLP (Y;θ)
Stochastic warping
Y
H N (0, I)G
ϕ ϕ ϕ. . .
...
H N (0, I)G
X
ϕ ϕ ϕ. . .
Inference:
I Need to infer posteriors on H
I Define q(H) = N (µ,Σ)
I Amortize:{µ,Σ} =MLP (Y;θ)
Amortized inference
Solution: Reparameterization through recognition model g:
µn1 = g1(y(n))
µ(n)l = gl(µ
(n)l−1)
gl = MLP(θl)
gl deterministic ⇒ µ(n)l = gl(. . . g1(y
(n)))
Dai, Damianou, Gonzalez, Lawrence, ICLR 2016
Bayesian approach recap
I Three ways of introducing uncertainty / noise in a NN:
I Treat weights w as distributions (BNNs)
I Stochasticity in the warping function φ (VAE)
I Bayesian non-parametrics applied to DNNs (DGP)
Connection between Bayesian and “traditional” approaches
There’s another view of Bayesian NNs. Namely that they can be used as a means ofexplaining / motivating DNN techniques (e.g. Dropout) in a more mathematicallygrounded way.
I Dropout as variational inference. (Kingma et al. 2015), (Gal and Ghahramani.2016)
I Low-rank NN weight approximation and Deep Gaussian processes (Hensman andLawrence 2014, Damianou 2015, Louizos and Welling 2016, Cutajar et al. 2016)
I ...
Summary
I Motivation for probabilistic and Bayesian reasoning
I Three ways of incorporating uncertainty in DNNs
I Inference is more challenging when uncertainty has to be propagated
I Connection between Bayesian and “traditional” NN approaches