Deep Machine Learning

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

http://mlg.postech.ac.kr/∼seungjin

October 5, 2016

1 / 60

Deep Learning

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1,

I Fully-connected network (MLP): h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network: h(l)

t = σ(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network: h(l)

t = σ(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

2 / 60

Deep Learning

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1,

I Fully-connected network (MLP): h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network: h(l)

t = σ(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network: h(l)

t = σ(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

)2 / 60

A Big Deal?

I Much simpler than other ML algorithms

I Scale to large datasets much better

I Have attained state of the art performance onseveral tasks (speech recognition, visualrecognition, and so on)

3 / 60

A Big Deal?

3 / 60

A Big Deal?

3 / 60

A Big Deal?

3 / 60

Speech Recognition

4 / 60

Deep learning for speech recognition

http://image.slidesharecdn.com/22-01-15dlmeetup-150122111042-conversion-gate01/95/deep-

learning-an-interactive-introduction-for-nlpers-9-638.jpg?cb=1422014515

5 / 60

Apple Siri, Google Now, MS Cortana, Amazon Alexa

Amazon Echo: A wireless speakerand voice command device

Google Home: A smart speakerfrom Google

6 / 60

Visual Recognition

7 / 60

ImageNet Challenge

8 / 60

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

9 / 60

AlexNet (2012)

9 / 60

AlexNet (2012)

9 / 60

Deep Convolutional Neural Networks

h(l)i = σ

(M∑k=1

w(l)k h

(l−1)i+k−1 + b

(l)i = max

i−Q+1≤k≤i+Q−1h

(l)k .

Deep CNN: Classifiy 1.2 million images into 1000 classes

A. Krizhevsky, I. Sutskever, G. Hinton (2012),

”ImageNet Classification with Deep Convolutional Neural Networks”

10 / 60

Deep Convolutional Neural Networks

h(l)i = σ

(M∑k=1

w(l)k h

(l−1)i+k−1 + b

(l)i = max

i−Q+1≤k≤i+Q−1h

(l)k .

Deep CNN: Classifiy 1.2 million images into 1000 classes

A. Krizhevsky, I. Sutskever, G. Hinton (2012),

”ImageNet Classification with Deep Convolutional Neural Networks”

10 / 60

Deep learning is good at representation learning

H. Lee et al., ”Convolutional deep belief networks for scalable unsupervised learning of hierarchicalrepresentations, ICML-2009.

11 / 60

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

12 / 60

I Deep architecture

12 / 60

I Deep architecture

12 / 60

I Deep architecture

12 / 60

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

13 / 60

Pre-Trained CNNs

13 / 60

Pre-Trained CNNs

13 / 60

Bayesian Approach: Evidence p(D|Mi ) =∫p(D|w ,Mi )p(w |Mi )dw

I Select a model with maximum evidence

I Select a subset of pre-trained CNNs in a greedy manner

Y. Kim, T. Jang, B. Han, S. Choi, ”Learning to select pre-trained deep representations with

Bayesian evidence framework,” CVPR-2016.

14 / 60

DeepFace by Facebook

Nine Layer Deep Neural Network

Face recognition: Detection → Allignment → Representation →Classification

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, ”DeepFace: Closing the Gap to Human-Level

Performance in Face Verification,” CVPR-2014.

15 / 60

Photomath

https://photomath.net/en/

16 / 60

Dense Captioning

Fully convolutional localization network: Localization and description joinlty

Justin Johnson, Andrej Karpathy, Li Fei-Fei, ”DenseCap: Fully Convolutional Localization

Networks for Dense Captioning,” CVPR-2016

17 / 60

Fully convolutional localization network

Justin Johnson, Andrej Karpathy, Li Fei-Fei, ”DenseCap: Fully Convolutional Localization

Networks for Dense Captioning,” CVPR-2016

18 / 60

Text and Natural Language

19 / 60

Gmail’s New AI

Smart Reply uses machine learningin the form of an industrial strengthneural network to write the emails.

I Hand-crafted rules forcommon reply scenarios?

I Any engineer’s ability toinvent ’rules’ would be quicklyoutstripped by the tremendousdiversity with which realpeople communicate.

http://www.ign.com/articles/2015/11/04/gmails-new-ai-will-answer-your-emails-for-you

20 / 60

https://research.googleblog.com/2015/11/computer-respond-to-this-email.html

21 / 60

Machine Translation

http://blog.webcertain.com/wp-content/uploads/2015/02/machine-translation-search-engines.jpg

22 / 60

Neural Machine Translation

Kyunghyun Cho et al., ”Learning Phrase Representations using RNN Encoder?Decoder for

Statistical Machine Translation,” EMNLP-2014

23 / 60

Why is deep learning so successful?

Compared to 1980-90’s, dive into deeper ...

I Methods and algorithmsI Pre-training, dropout, ReLU, batch normalization, and so on

I Plenty of data

I Computing power (GPU computing)

24 / 60

I Plenty of data

24 / 60

I Plenty of data

24 / 60

I Plenty of data

24 / 60

Restricted Boltzmann Machines

25 / 60

Harmonium (or Restricted Boltzmann Machine)

v1 v2 vD

I Harmonium (Smolensky, 1986),aka RBM (Hinton andSejnowski, 1986).

I Undirected model which allowsonly inter-layer connections(bipartite graphs).

I Energy-based probabilistic model defines a probability distribution throughan energy function, associating an energy function to each configurationof the variables of interest:

p(v) =∑h

p(v , h) =∑h

Ze−E(v ,h).

I Learning corresponds to modifying the energy function so that its shapehas desirable properties (maximum likelihood estimation).

26 / 60

Harmonium (or Restricted Boltzmann Machine)

v1 v2 vD

I Harmonium (Smolensky, 1986),aka RBM (Hinton andSejnowski, 1986).

I Undirected model which allowsonly inter-layer connections(bipartite graphs).

I Energy-based probabilistic model defines a probability distribution throughan energy function, associating an energy function to each configurationof the variables of interest:

p(v) =∑h

p(v , h) =∑h

Ze−E(v ,h).

I Learning corresponds to modifying the energy function so that its shapehas desirable properties (maximum likelihood estimation).

26 / 60

I RBM: v ∈ {1, 0}D and h ∈ {1, 0}KI Energy is given by E(v , h) = −b>v − c>h − v>Wh.I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

D∑i=1

Wi,jvi

), p(vi = 1|h) = σ

K∑j=1

Wi,jhj

I Gaussian-Bernoulli RBM: v ∈ RD and h ∈ {1, 0}KI Energy is given by

E(v , h) =D∑i=1

(vi − bi )2

−K∑j=1

cjhj −D∑i=1

K∑j=1

viWi,jhjσi

I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

D∑i=1

Wi,jviσi

p(vi |h) = N

(vi∣∣∣ bi + σi

K∑j=1

Wi,jhj , σ2i

27 / 60

I RBM: v ∈ {1, 0}D and h ∈ {1, 0}KI Energy is given by E(v , h) = −b>v − c>h − v>Wh.I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

D∑i=1

Wi,jvi

), p(vi = 1|h) = σ

K∑j=1

Wi,jhj

I Gaussian-Bernoulli RBM: v ∈ RD and h ∈ {1, 0}KI Energy is given by

E(v , h) =D∑i=1

(vi − bi )2

−K∑j=1

cjhj −D∑i=1

K∑j=1

viWi,jhjσi

I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

D∑i=1

Wi,jviσi

p(vi |h) = N

(vi∣∣∣ bi + σi

K∑j=1

Wi,jhj , σ2i

27 / 60

Gibbs Sampling in RBM with Binary Units

v (1) ∼ p(v),

h(1) ∼ p(h|v (1))

v (2) ∼ p(v |h(1))

h(2) ∼ p(h|v (2))

v (k+1) ∼ p(v |h(k)) (k Gibbs steps).

28 / 60

Contrastive Divergence Learning

Average log-likelihood gradient is approximated by k Gibbs steps,⟨∂ log p(v)

⟩p(v)

≈⟨− ∂

∂θE(v , h)

⟩p◦

⟨∂

∂θE(v , h)

⟩p(k)(v,h)

where p(k)(v , h) is the joint distribution determined by k Gibbs steps and

p(v) =1

N∑n=1

δ(v − v n), p◦ =1

N∑n=1

p(h|v n).

Gradient ascent learning leads to

W ← W + η

[⟨vh>

⟩p◦−⟨vh>

⟩p(k)(v,h)

b ← b + η

N∑n=1

v n − 〈v〉p(k)(v,h)

c ← c + η[〈h〉p◦ − 〈h〉p(k)(v,h)

29 / 60

Dropout

30 / 60

Dropout

Figure: Taken from Srivastava et al., JMLR 2014.

I Form a vector of independent Bernoulli random variables, z (l), wherez

(l)i ∼ Bern(p).

I Feedforward operations are:

y(l+1)i = σ

(w (l+1)>

(y (l) � z (l)

(l+1)i

31 / 60

Figure: Taken from Srivastava et al., JMLR 2014.

Approximates the effect of averaging the predictions of all these thinnednetworks by simply using a single unthinned network that has smallerweights.

32 / 60

Deep learning has made big success in ...

33 / 60

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

I Seeing

I Hearing

I Reading

34 / 60

I Seeing

I Hearing

I Reading

34 / 60

I Seeing

I Hearing

I Reading

34 / 60

I Seeing

I Hearing

I Reading

34 / 60

I Seeing

I Hearing

I Reading

34 / 60

I Seeing

I Hearing

I Reading

34 / 60

35 / 60

The Deeper and the Better?

36 / 60

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ”Deep Residual Learning for Image

Recognition,” Preprint arXiv:1512.03385, 2015

37 / 60

How to set up a depth?

I Bayesian optimization

I Deep residual net

I Stochastic depth

38 / 60

I Deep residual net

I Stochastic depth

38 / 60

I Deep residual net

I Stochastic depth

38 / 60

I Deep residual net

I Stochastic depth

38 / 60

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, ”Deep Residual Learning for Image

Recognition,” Preprint arXiv:1512.03385, 2015

39 / 60

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Q. Weinberger, ”Deep Networks with

Stochastic Depth,” Preprint arXiv:1603.09382, 2016

40 / 60

Uncertainty and Bayes

41 / 60

Bayesian Learning for Neural Networks [Neal, 1996]

I Question: Does it make sense to use a network with aninfinite number of hidden units, training it by maximizing thelikelihood of a finite amount of data?

I The network will overfit the data and its generalization performancewill be poor.

I Well, however, the idea of selecting the network size depending onthe amount of training data makes little sense to a Bayesian.

I Radford M. Neal (1996) showed that:I Sensible to consider a limit where the number of hidden units in a

net tends to infinity, and that good predictions can be obtained fromsuch models using the Bayesian machinery.

I For fixed hyperparameters, a large class of neural network modelswill converge to a Gaussian process prior over functions in the limitof an infinite number of hidden units.

42 / 60

Bayesian Neural Networks

I NN model parameterized by θ ={W (L), . . . ,W (1)

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

p(y = k |x , θ) = softmax(f (x ; θ))

=efk (x ;θ)∑j e

fj (x ;θ).

I InferenceI First level of inference: Fit your model to the data (regularization)I Second level of inference: Model comparison

43 / 60

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

=efk (x ;θ)∑j e

fj (x ;θ).

43 / 60

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

=efk (x ;θ)∑j e

fj (x ;θ).

43 / 60

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

=efk (x ;θ)∑j e

fj (x ;θ).

43 / 60

I Training data: D = (X ,Y ).

I Likelihood: p(Y |X , θ).

I Prior over parameters: p(θ)

I Posterior over parameters

p(θ|X ,Y ) =p(Y |X , θ)p(θ)

p(Y |X ).

I Prediction is done by

p(y∗|x∗,D) =

∫p(y∗|x∗, θ)p(θ|X ,Y )dθ.

44 / 60

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Deep Generative Models

46 / 60

Discriminative model Generative model

I Discriminative model: Directly model p(h|v)

I Generative model: Joint distribution p(v ,h) = p(v |h)p(h)

47 / 60

https://www.openai.com/blog/generative-models/

I A powerful technique to modeling complex high-dimensional datasets

I Often challenging to carry out ”inference”

I Why generative models?I Leverage unlabeled dataI Environment simulator: Reinforcement learning and planningI Speech synthesis, machine translation, image segmentation, and so

48 / 60

Deep + Probabilistic = Good?I Deep layers in hierarchy

I Good performance in representation, as well as in predictionI Huge number of parameters (model compression?) need careful

regularization (to avoid overfitting)I Lack of uncertainty

I Probabilistic modelsI Flexible models with uncertaintyI Require approximate inference

with Bayesian methods

I Prediction by MLE: p(x∗|θML) or p(y∗|x∗, θML).

I Prediction by Bayesian:

p(x∗|X ) =

∫p(x∗|θ)p(θ|X )dθ

p(y∗|x∗,Y ,X ) =

∫p(y∗|x∗, θ)p(θ|Y ,X )dθ

49 / 60

p(x∗|X ) =

p(y∗|x∗,Y ,X ) =

49 / 60

p(x∗|X ) =

p(y∗|x∗,Y ,X ) =

49 / 60

I Deep probabilistic models with Bayesian methods

p(x∗|X ) =

p(y∗|x∗,Y ,X ) =

49 / 60

Learning Deep Directed Generative Models

Two classes of algorithms

I Variational autoencoders

I Dirk Kimgma (Open AI)

I Generative AdversarialNetworks

I Ian Goodfellow (Open AI)

50 / 60

Taken from Shakir Mohamed’s slides

51 / 60

Taken from Shakir Mohamed’s slides

52 / 60

Variational Autoencoder [Kingma and Welling, 2014]

I Probabilistic decoder: pθ(x |z)

I Probabilistic encoder: qφ(z |x)

I Model parameters θ andvariational parameters φ arelearned by maximizingvariational lower bound on themarginal likelihood (stochasticgradient variational Bayes).

I Decoding MLP and encodingMLP. For instance,

p(x |z) = N (x |µ,Σ),

h = σ (W 1z + b1) ,

µ = W 2h + b2,

log diag(Σ) = W 3h + b3.

53 / 60

Variational Autoencoder [Kingma and Welling, 2014]

I Probabilistic decoder: pθ(x |z)

I Probabilistic encoder: qφ(z |x)

I Model parameters θ andvariational parameters φ arelearned by maximizingvariational lower bound on themarginal likelihood (stochasticgradient variational Bayes).

I Decoding MLP and encodingMLP. For instance,

p(x |z) = N (x |µ,Σ),

h = σ (W 1z + b1) ,

µ = W 2h + b2,

log diag(Σ) = W 3h + b3.

53 / 60

Stochastic Gradient Variational Bayes

Log-likelihood is given by

log p(x) = log

∫p(x , z)dz

≥∫

q(z |x) logp(x , z)

q(z |x)dz

∫q(z |x) log

p(x |z)p(z)

q(z |x)dz

[log p(x |z)

]︸︷︷︸SGVB

− KL [q(z |x)‖p(z)]︸︷︷︸analytically computed

where Eq[·] denotes the expectation w.r.t. q(z |x) and Monte Carlo estimatesare performed with the reparameterization trick:

Eq [log p(x |z)] ≈ 1

L∑l=1

log p(x |z (l)),

where z (l) = m +√λ� ε(l) and ε(l) ∼ N (0, I ). A single sample is often

sufficient to form this Monte Carlo estimates in practice

54 / 60

Kingma and Welling, 2014

55 / 60

VAE with Rank-One Covariance [Suh and Choi, 2016]

𝑍𝑧(3) 𝑧(5) 𝑧(4) 𝑧(1)

𝑎(𝑧(4))

𝜇(𝑧(4)) 𝜇(𝑧(1)) 𝜇(𝑧(6))

𝑎(𝑧(1)) 𝑎(𝑧(6))

𝑎(𝑧(3)) 𝑎(𝑧(5))

𝜇(𝑧(3)) 𝜇(𝑧(5))

𝑧(6) 𝑧(2)

𝑎(𝑧(2))𝜇(𝑧(2))

I Probabilistic decoder: pθ(x |z)I Probabilistic encoder: qφ(z |x)

I Find local principal directionat a specific location µ(z):

pθ(x |z) = N(µ, ωI + aa>

pθ(z) = N (0, I ),

µ = W µh + bµ,logω = w>ω h + bω,

a = W ah + ba,

h = tanh(W hz + bh).

56 / 60

(a) True images (b) Generated images

57 / 60

Generative Adversarial Network [Goodfellow et al., 2014]

I Generator network, G (z) : RK → RD

I Captures the data distributionI Counterfeiters: Tries to fake discriminator

I Discriminative network, D(x) : RD → {0, 1}I Police: Tries to detect counterfeit images

The discriminator notices a difference between the two distributions the generator adjusts its parameters slightly to make it go away, until

at the end (in theory) the generator exactly reproduces the true data distribution and the discriminator is guessing at random, unable to

find a difference.

58 / 60

Two-player minimax game:

Ex∼pdata(x ) [logD(x)] + Ez∼p(z ) [log (1− D(G (z)))] .

I Both G and D are MLPs.

I Does not require any sophisticated inference methods (variational orsampling)

I Alternate between k steps of optimizing D and one step ofoptimizing G

I In practice, train G : maxG Ez∼p(z ) [logD(G (z))] . (strongergradients early in learning)

59 / 60

Question and Discussion

60 / 60

Deep Machine Learning -...

Documents

Transcript of Deep Machine Learning -...