Deep Machine Learning -...

104
Deep Machine Learning Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea [email protected] http://mlg.postech.ac.kr/seungjin October 5, 2016 1 / 60

Transcript of Deep Machine Learning -...

Page 1: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Machine Learning

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

http://mlg.postech.ac.kr/∼seungjin

October 5, 2016

1 / 60

Page 2: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Learning

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1,

I Fully-connected network (MLP): h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network: h(l)

t = σ(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network: h(l)

t = σ(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)

2 / 60

Page 3: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Learning

f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1,

I Fully-connected network (MLP): h(l)t = σ

(W (l)h(l−1)

t + b(l)t

)I Convolutional neural network: h(l)

t = σ(w (l) ∗ h(l−1)

t + b(l)t

)I Recurrent neural network: h(l)

t = σ(W (l)h(l−1)

t + V (l)h(l)t−1 + b(l)

t

)2 / 60

Page 4: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

A Big Deal?

I Much simpler than other ML algorithms

I Scale to large datasets much better

I Have attained state of the art performance onseveral tasks (speech recognition, visualrecognition, and so on)

3 / 60

Page 5: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

A Big Deal?

I Much simpler than other ML algorithms

I Scale to large datasets much better

I Have attained state of the art performance onseveral tasks (speech recognition, visualrecognition, and so on)

3 / 60

Page 6: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

A Big Deal?

I Much simpler than other ML algorithms

I Scale to large datasets much better

I Have attained state of the art performance onseveral tasks (speech recognition, visualrecognition, and so on)

3 / 60

Page 7: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

A Big Deal?

I Much simpler than other ML algorithms

I Scale to large datasets much better

I Have attained state of the art performance onseveral tasks (speech recognition, visualrecognition, and so on)

3 / 60

Page 8: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Speech Recognition

4 / 60

Page 9: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep learning for speech recognition

http://image.slidesharecdn.com/22-01-15dlmeetup-150122111042-conversion-gate01/95/deep-

learning-an-interactive-introduction-for-nlpers-9-638.jpg?cb=1422014515

5 / 60

Page 10: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Apple Siri, Google Now, MS Cortana, Amazon Alexa

Amazon Echo: A wireless speakerand voice command device

Google Home: A smart speakerfrom Google

6 / 60

Page 11: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Visual Recognition

7 / 60

Page 12: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

ImageNet Challenge

8 / 60

Page 13: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

9 / 60

Page 14: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

9 / 60

Page 15: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

2015: A Milestone Year in Computer Science

AlexNet (2012)

I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012

I VGG (very deep CNN, 16-19weight layers), 2015

I GoogLeNet (22 layers), 2015

I Deep Residual Net (100-1000layers), 2015

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

9 / 60

Page 16: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Convolutional Neural Networks

h(l)i = σ

(M∑k=1

w(l)k h

(l−1)i+k−1 + b

(l)i

), h

(l)i = max

i−Q+1≤k≤i+Q−1h

(l)k .

Deep CNN: Classifiy 1.2 million images into 1000 classes

A. Krizhevsky, I. Sutskever, G. Hinton (2012),

”ImageNet Classification with Deep Convolutional Neural Networks”

10 / 60

Page 17: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Convolutional Neural Networks

h(l)i = σ

(M∑k=1

w(l)k h

(l−1)i+k−1 + b

(l)i

), h

(l)i = max

i−Q+1≤k≤i+Q−1h

(l)k .

Deep CNN: Classifiy 1.2 million images into 1000 classes

A. Krizhevsky, I. Sutskever, G. Hinton (2012),

”ImageNet Classification with Deep Convolutional Neural Networks”

10 / 60

Page 18: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep learning is good at representation learning

H. Lee et al., ”Convolutional deep belief networks for scalable unsupervised learning of hierarchicalrepresentations, ICML-2009.

11 / 60

Page 19: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

12 / 60

Page 20: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

12 / 60

Page 21: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

12 / 60

Page 22: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why CNN so successful?

I Similar to simple and complex cells in V1 area of visual cortex

I Deep architecture

I Supervised representation learning

12 / 60

Page 23: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

13 / 60

Page 24: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

13 / 60

Page 25: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Pre-Trained CNNs

I AlexNet (5 convolutional layers + 3 fully connected layers): A.

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional

neural networks. In NIPS, volume 25, 2012

I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.

Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

13 / 60

Page 26: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Approach: Evidence p(D|Mi ) =∫p(D|w ,Mi )p(w |Mi )dw

I Select a model with maximum evidence

I Select a subset of pre-trained CNNs in a greedy manner

Y. Kim, T. Jang, B. Han, S. Choi, ”Learning to select pre-trained deep representations with

Bayesian evidence framework,” CVPR-2016.

14 / 60

Page 27: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

DeepFace by Facebook

Nine Layer Deep Neural Network

Face recognition: Detection → Allignment → Representation →Classification

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, ”DeepFace: Closing the Gap to Human-Level

Performance in Face Verification,” CVPR-2014.

15 / 60

Page 28: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Photomath

https://photomath.net/en/

16 / 60

Page 29: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Dense Captioning

Fully convolutional localization network: Localization and description joinlty

Justin Johnson, Andrej Karpathy, Li Fei-Fei, ”DenseCap: Fully Convolutional Localization

Networks for Dense Captioning,” CVPR-2016

17 / 60

Page 30: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Fully convolutional localization network

Justin Johnson, Andrej Karpathy, Li Fei-Fei, ”DenseCap: Fully Convolutional Localization

Networks for Dense Captioning,” CVPR-2016

18 / 60

Page 31: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Text and Natural Language

19 / 60

Page 32: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Gmail’s New AI

Smart Reply uses machine learningin the form of an industrial strengthneural network to write the emails.

I Hand-crafted rules forcommon reply scenarios?

I Any engineer’s ability toinvent ’rules’ would be quicklyoutstripped by the tremendousdiversity with which realpeople communicate.

http://www.ign.com/articles/2015/11/04/gmails-new-ai-will-answer-your-emails-for-you

20 / 60

Page 33: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

https://research.googleblog.com/2015/11/computer-respond-to-this-email.html

21 / 60

Page 34: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Machine Translation

http://blog.webcertain.com/wp-content/uploads/2015/02/machine-translation-search-engines.jpg

22 / 60

Page 35: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Neural Machine Translation

Kyunghyun Cho et al., ”Learning Phrase Representations using RNN Encoder?Decoder for

Statistical Machine Translation,” EMNLP-2014

23 / 60

Page 36: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why is deep learning so successful?

Compared to 1980-90’s, dive into deeper ...

I Methods and algorithmsI Pre-training, dropout, ReLU, batch normalization, and so on

I Plenty of data

I Computing power (GPU computing)

24 / 60

Page 37: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why is deep learning so successful?

Compared to 1980-90’s, dive into deeper ...

I Methods and algorithmsI Pre-training, dropout, ReLU, batch normalization, and so on

I Plenty of data

I Computing power (GPU computing)

24 / 60

Page 38: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why is deep learning so successful?

Compared to 1980-90’s, dive into deeper ...

I Methods and algorithmsI Pre-training, dropout, ReLU, batch normalization, and so on

I Plenty of data

I Computing power (GPU computing)

24 / 60

Page 39: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Why is deep learning so successful?

Compared to 1980-90’s, dive into deeper ...

I Methods and algorithmsI Pre-training, dropout, ReLU, batch normalization, and so on

I Plenty of data

I Computing power (GPU computing)

24 / 60

Page 40: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Restricted Boltzmann Machines

25 / 60

Page 41: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Harmonium (or Restricted Boltzmann Machine)

h1 hK

v1 v2 vD

I Harmonium (Smolensky, 1986),aka RBM (Hinton andSejnowski, 1986).

I Undirected model which allowsonly inter-layer connections(bipartite graphs).

I Energy-based probabilistic model defines a probability distribution throughan energy function, associating an energy function to each configurationof the variables of interest:

p(v) =∑h

p(v , h) =∑h

1

Ze−E(v ,h).

I Learning corresponds to modifying the energy function so that its shapehas desirable properties (maximum likelihood estimation).

26 / 60

Page 42: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Harmonium (or Restricted Boltzmann Machine)

h1 hK

v1 v2 vD

I Harmonium (Smolensky, 1986),aka RBM (Hinton andSejnowski, 1986).

I Undirected model which allowsonly inter-layer connections(bipartite graphs).

I Energy-based probabilistic model defines a probability distribution throughan energy function, associating an energy function to each configurationof the variables of interest:

p(v) =∑h

p(v , h) =∑h

1

Ze−E(v ,h).

I Learning corresponds to modifying the energy function so that its shapehas desirable properties (maximum likelihood estimation).

26 / 60

Page 43: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I RBM: v ∈ {1, 0}D and h ∈ {1, 0}KI Energy is given by E(v , h) = −b>v − c>h − v>Wh.I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

(cj +

D∑i=1

Wi,jvi

), p(vi = 1|h) = σ

(bi +

K∑j=1

Wi,jhj

).

I Gaussian-Bernoulli RBM: v ∈ RD and h ∈ {1, 0}KI Energy is given by

E(v , h) =D∑i=1

(vi − bi )2

2σ2i

−K∑j=1

cjhj −D∑i=1

K∑j=1

viWi,jhjσi

.

I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

(cj +

D∑i=1

Wi,jviσi

),

p(vi |h) = N

(vi∣∣∣ bi + σi

K∑j=1

Wi,jhj , σ2i

).

27 / 60

Page 44: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I RBM: v ∈ {1, 0}D and h ∈ {1, 0}KI Energy is given by E(v , h) = −b>v − c>h − v>Wh.I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

(cj +

D∑i=1

Wi,jvi

), p(vi = 1|h) = σ

(bi +

K∑j=1

Wi,jhj

).

I Gaussian-Bernoulli RBM: v ∈ RD and h ∈ {1, 0}KI Energy is given by

E(v , h) =D∑i=1

(vi − bi )2

2σ2i

−K∑j=1

cjhj −D∑i=1

K∑j=1

viWi,jhjσi

.

I Conditional probabilities are calculated as:

p(hj = 1|v) = σ

(cj +

D∑i=1

Wi,jviσi

),

p(vi |h) = N

(vi∣∣∣ bi + σi

K∑j=1

Wi,jhj , σ2i

).

27 / 60

Page 45: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Gibbs Sampling in RBM with Binary Units

v (1) ∼ p(v),

h(1) ∼ p(h|v (1))

v (2) ∼ p(v |h(1))

h(2) ∼ p(h|v (2))

...

v (k+1) ∼ p(v |h(k)) (k Gibbs steps).

28 / 60

Page 46: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Contrastive Divergence Learning

Average log-likelihood gradient is approximated by k Gibbs steps,⟨∂ log p(v)

∂θ

⟩p(v)

≈⟨− ∂

∂θE(v , h)

⟩p◦

+

⟨∂

∂θE(v , h)

⟩p(k)(v,h)

,

where p(k)(v , h) is the joint distribution determined by k Gibbs steps and

p(v) =1

N

N∑n=1

δ(v − v n), p◦ =1

N

N∑n=1

p(h|v n).

Gradient ascent learning leads to

W ← W + η

[⟨vh>

⟩p◦−⟨vh>

⟩p(k)(v,h)

],

b ← b + η

[1

N

N∑n=1

v n − 〈v〉p(k)(v,h)

],

c ← c + η[〈h〉p◦ − 〈h〉p(k)(v,h)

].

29 / 60

Page 47: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Dropout

30 / 60

Page 48: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Dropout

Figure: Taken from Srivastava et al., JMLR 2014.

I Form a vector of independent Bernoulli random variables, z (l), wherez

(l)i ∼ Bern(p).

I Feedforward operations are:

y(l+1)i = σ

(w (l+1)>

i

(y (l) � z (l)

)+ b

(l+1)i

).

31 / 60

Page 49: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Figure: Taken from Srivastava et al., JMLR 2014.

Approximates the effect of averaging the predictions of all these thinnednetworks by simply using a single unthinned network that has smallerweights.

32 / 60

Page 50: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep learning has made big success in ...

33 / 60

Page 51: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 52: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 53: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 54: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 55: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 56: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 57: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

So far, deep learning has been successful inperception:

I Seeing

I Hearing

I Reading

The next challenge will be thinking (inference)

I Deep probabilistic models

I Sequential decision making (deep RL)

34 / 60

Page 58: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

35 / 60

Page 59: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

The Deeper and the Better?

36 / 60

Page 60: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ”Deep Residual Learning for Image

Recognition,” Preprint arXiv:1512.03385, 2015

37 / 60

Page 61: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

How to set up a depth?

I Bayesian optimization

I Deep residual net

I Stochastic depth

38 / 60

Page 62: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

How to set up a depth?

I Bayesian optimization

I Deep residual net

I Stochastic depth

38 / 60

Page 63: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

How to set up a depth?

I Bayesian optimization

I Deep residual net

I Stochastic depth

38 / 60

Page 64: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

How to set up a depth?

I Bayesian optimization

I Deep residual net

I Stochastic depth

38 / 60

Page 65: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, ”Deep Residual Learning for Image

Recognition,” Preprint arXiv:1512.03385, 2015

39 / 60

Page 66: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Q. Weinberger, ”Deep Networks with

Stochastic Depth,” Preprint arXiv:1603.09382, 2016

40 / 60

Page 67: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Uncertainty and Bayes

41 / 60

Page 68: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Learning for Neural Networks [Neal, 1996]

I Question: Does it make sense to use a network with aninfinite number of hidden units, training it by maximizing thelikelihood of a finite amount of data?

I The network will overfit the data and its generalization performancewill be poor.

I Well, however, the idea of selecting the network size depending onthe amount of training data makes little sense to a Bayesian.

I Radford M. Neal (1996) showed that:I Sensible to consider a limit where the number of hidden units in a

net tends to infinity, and that good predictions can be obtained fromsuch models using the Bayesian machinery.

I For fixed hyperparameters, a large class of neural network modelswill converge to a Gaussian process prior over functions in the limitof an infinite number of hidden units.

42 / 60

Page 69: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Learning for Neural Networks [Neal, 1996]

I Question: Does it make sense to use a network with aninfinite number of hidden units, training it by maximizing thelikelihood of a finite amount of data?

I The network will overfit the data and its generalization performancewill be poor.

I Well, however, the idea of selecting the network size depending onthe amount of training data makes little sense to a Bayesian.

I Radford M. Neal (1996) showed that:I Sensible to consider a limit where the number of hidden units in a

net tends to infinity, and that good predictions can be obtained fromsuch models using the Bayesian machinery.

I For fixed hyperparameters, a large class of neural network modelswill converge to a Gaussian process prior over functions in the limitof an infinite number of hidden units.

42 / 60

Page 70: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Learning for Neural Networks [Neal, 1996]

I Question: Does it make sense to use a network with aninfinite number of hidden units, training it by maximizing thelikelihood of a finite amount of data?

I The network will overfit the data and its generalization performancewill be poor.

I Well, however, the idea of selecting the network size depending onthe amount of training data makes little sense to a Bayesian.

I Radford M. Neal (1996) showed that:I Sensible to consider a limit where the number of hidden units in a

net tends to infinity, and that good predictions can be obtained fromsuch models using the Bayesian machinery.

I For fixed hyperparameters, a large class of neural network modelswill converge to a Gaussian process prior over functions in the limitof an infinite number of hidden units.

42 / 60

Page 71: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Learning for Neural Networks [Neal, 1996]

I Question: Does it make sense to use a network with aninfinite number of hidden units, training it by maximizing thelikelihood of a finite amount of data?

I The network will overfit the data and its generalization performancewill be poor.

I Well, however, the idea of selecting the network size depending onthe amount of training data makes little sense to a Bayesian.

I Radford M. Neal (1996) showed that:I Sensible to consider a limit where the number of hidden units in a

net tends to infinity, and that good predictions can be obtained fromsuch models using the Bayesian machinery.

I For fixed hyperparameters, a large class of neural network modelswill converge to a Gaussian process prior over functions in the limitof an infinite number of hidden units.

42 / 60

Page 72: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Neural Networks

I NN model parameterized by θ ={W (L), . . . ,W (1)

}:

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

p(y = k |x , θ) = softmax(f (x ; θ))

=efk (x ;θ)∑j e

fj (x ;θ).

I InferenceI First level of inference: Fit your model to the data (regularization)I Second level of inference: Model comparison

43 / 60

Page 73: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Neural Networks

I NN model parameterized by θ ={W (L), . . . ,W (1)

}:

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

p(y = k |x , θ) = softmax(f (x ; θ))

=efk (x ;θ)∑j e

fj (x ;θ).

I InferenceI First level of inference: Fit your model to the data (regularization)I Second level of inference: Model comparison

43 / 60

Page 74: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Neural Networks

I NN model parameterized by θ ={W (L), . . . ,W (1)

}:

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

p(y = k |x , θ) = softmax(f (x ; θ))

=efk (x ;θ)∑j e

fj (x ;θ).

I InferenceI First level of inference: Fit your model to the data (regularization)I Second level of inference: Model comparison

43 / 60

Page 75: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Bayesian Neural Networks

I NN model parameterized by θ ={W (L), . . . ,W (1)

}:

f (x ; θ) = W (L)σ(. . .W (2)σ(W (1)x + b(1)) . . .

)I Regression

p(y |x , θ) = N (y |f (x ; θ), β−1).

I Classification

p(y = k |x , θ) = softmax(f (x ; θ))

=efk (x ;θ)∑j e

fj (x ;θ).

I InferenceI First level of inference: Fit your model to the data (regularization)I Second level of inference: Model comparison

43 / 60

Page 76: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Training data: D = (X ,Y ).

I Likelihood: p(Y |X , θ).

I Prior over parameters: p(θ)

I Posterior over parameters

p(θ|X ,Y ) =p(Y |X , θ)p(θ)

p(Y |X ).

I Prediction is done by

p(y∗|x∗,D) =

∫p(y∗|x∗, θ)p(θ|X ,Y )dθ.

44 / 60

Page 77: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Page 78: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Page 79: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Page 80: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Page 81: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Page 82: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

I Difficult to compute the posterior p(θ|X ,Y ).

I Various approximate inference methods have been applied.I David J. C. MacKay (1992): Laplace approximationI Radford Neal (1995): MCMCI Alex Graves (2011): Practical variational inferenceI Jose Miguel Hernandez-Lobato and Ryan P. Adams (2015):

Probabilsitic backpropI Charles Blundell et al. (2015): Bayes by backprop

I am working on BNN with a different approach!

45 / 60

Page 83: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Generative Models

46 / 60

Page 84: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Discriminative model Generative model

I Discriminative model: Directly model p(h|v)

I Generative model: Joint distribution p(v ,h) = p(v |h)p(h)

47 / 60

Page 85: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Generative Models

https://www.openai.com/blog/generative-models/

I A powerful technique to modeling complex high-dimensional datasets

I Often challenging to carry out ”inference”

I Why generative models?I Leverage unlabeled dataI Environment simulator: Reinforcement learning and planningI Speech synthesis, machine translation, image segmentation, and so

on

48 / 60

Page 86: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Generative Models

https://www.openai.com/blog/generative-models/

I A powerful technique to modeling complex high-dimensional datasets

I Often challenging to carry out ”inference”

I Why generative models?I Leverage unlabeled dataI Environment simulator: Reinforcement learning and planningI Speech synthesis, machine translation, image segmentation, and so

on

48 / 60

Page 87: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Generative Models

https://www.openai.com/blog/generative-models/

I A powerful technique to modeling complex high-dimensional datasets

I Often challenging to carry out ”inference”

I Why generative models?I Leverage unlabeled dataI Environment simulator: Reinforcement learning and planningI Speech synthesis, machine translation, image segmentation, and so

on

48 / 60

Page 88: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep Generative Models

https://www.openai.com/blog/generative-models/

I A powerful technique to modeling complex high-dimensional datasets

I Often challenging to carry out ”inference”

I Why generative models?I Leverage unlabeled dataI Environment simulator: Reinforcement learning and planningI Speech synthesis, machine translation, image segmentation, and so

on

48 / 60

Page 89: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep + Probabilistic = Good?I Deep layers in hierarchy

I Good performance in representation, as well as in predictionI Huge number of parameters (model compression?) need careful

regularization (to avoid overfitting)I Lack of uncertainty

I Probabilistic modelsI Flexible models with uncertaintyI Require approximate inference

I Deep probabilistic models

with Bayesian methods

I Prediction by MLE: p(x∗|θML) or p(y∗|x∗, θML).

I Prediction by Bayesian:

p(x∗|X ) =

∫p(x∗|θ)p(θ|X )dθ

p(y∗|x∗,Y ,X ) =

∫p(y∗|x∗, θ)p(θ|Y ,X )dθ

49 / 60

Page 90: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep + Probabilistic = Good?I Deep layers in hierarchy

I Good performance in representation, as well as in predictionI Huge number of parameters (model compression?) need careful

regularization (to avoid overfitting)I Lack of uncertainty

I Probabilistic modelsI Flexible models with uncertaintyI Require approximate inference

I Deep probabilistic models

with Bayesian methods

I Prediction by MLE: p(x∗|θML) or p(y∗|x∗, θML).

I Prediction by Bayesian:

p(x∗|X ) =

∫p(x∗|θ)p(θ|X )dθ

p(y∗|x∗,Y ,X ) =

∫p(y∗|x∗, θ)p(θ|Y ,X )dθ

49 / 60

Page 91: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep + Probabilistic = Good?I Deep layers in hierarchy

I Good performance in representation, as well as in predictionI Huge number of parameters (model compression?) need careful

regularization (to avoid overfitting)I Lack of uncertainty

I Probabilistic modelsI Flexible models with uncertaintyI Require approximate inference

I Deep probabilistic models

with Bayesian methods

I Prediction by MLE: p(x∗|θML) or p(y∗|x∗, θML).

I Prediction by Bayesian:

p(x∗|X ) =

∫p(x∗|θ)p(θ|X )dθ

p(y∗|x∗,Y ,X ) =

∫p(y∗|x∗, θ)p(θ|Y ,X )dθ

49 / 60

Page 92: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Deep + Probabilistic = Good?I Deep layers in hierarchy

I Good performance in representation, as well as in predictionI Huge number of parameters (model compression?) need careful

regularization (to avoid overfitting)I Lack of uncertainty

I Probabilistic modelsI Flexible models with uncertaintyI Require approximate inference

I Deep probabilistic models with Bayesian methods

I Prediction by MLE: p(x∗|θML) or p(y∗|x∗, θML).

I Prediction by Bayesian:

p(x∗|X ) =

∫p(x∗|θ)p(θ|X )dθ

p(y∗|x∗,Y ,X ) =

∫p(y∗|x∗, θ)p(θ|Y ,X )dθ

49 / 60

Page 93: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Learning Deep Directed Generative Models

Two classes of algorithms

I Variational autoencoders

I Dirk Kimgma (Open AI)

I Generative AdversarialNetworks

I Ian Goodfellow (Open AI)

50 / 60

Page 94: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Taken from Shakir Mohamed’s slides

51 / 60

Page 95: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Taken from Shakir Mohamed’s slides

52 / 60

Page 96: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Variational Autoencoder [Kingma and Welling, 2014]

I Probabilistic decoder: pθ(x |z)

I Probabilistic encoder: qφ(z |x)

I Model parameters θ andvariational parameters φ arelearned by maximizingvariational lower bound on themarginal likelihood (stochasticgradient variational Bayes).

I Decoding MLP and encodingMLP. For instance,

p(x |z) = N (x |µ,Σ),

h = σ (W 1z + b1) ,

µ = W 2h + b2,

log diag(Σ) = W 3h + b3.

53 / 60

Page 97: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Variational Autoencoder [Kingma and Welling, 2014]

I Probabilistic decoder: pθ(x |z)

I Probabilistic encoder: qφ(z |x)

I Model parameters θ andvariational parameters φ arelearned by maximizingvariational lower bound on themarginal likelihood (stochasticgradient variational Bayes).

I Decoding MLP and encodingMLP. For instance,

p(x |z) = N (x |µ,Σ),

h = σ (W 1z + b1) ,

µ = W 2h + b2,

log diag(Σ) = W 3h + b3.

53 / 60

Page 98: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Stochastic Gradient Variational Bayes

Log-likelihood is given by

log p(x) = log

∫p(x , z)dz

≥∫

q(z |x) logp(x , z)

q(z |x)dz

=

∫q(z |x) log

p(x |z)p(z)

q(z |x)dz

= Eq

[log p(x |z)

]︸ ︷︷ ︸SGVB

− KL [q(z |x)‖p(z)]︸ ︷︷ ︸analytically computed

,

where Eq[·] denotes the expectation w.r.t. q(z |x) and Monte Carlo estimatesare performed with the reparameterization trick:

Eq [log p(x |z)] ≈ 1

L

L∑l=1

log p(x |z (l)),

where z (l) = m +√λ� ε(l) and ε(l) ∼ N (0, I ). A single sample is often

sufficient to form this Monte Carlo estimates in practice

54 / 60

Page 99: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Kingma and Welling, 2014

55 / 60

Page 100: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

VAE with Rank-One Covariance [Suh and Choi, 2016]

𝑋1

𝑋2

𝑍𝑧(3) 𝑧(5) 𝑧(4) 𝑧(1)

𝑎(𝑧(4))

𝜇(𝑧(4)) 𝜇(𝑧(1)) 𝜇(𝑧(6))

𝑎(𝑧(1)) 𝑎(𝑧(6))

𝑎(𝑧(3)) 𝑎(𝑧(5))

𝜇(𝑧(3)) 𝜇(𝑧(5))

𝑧(6) 𝑧(2)

𝑎(𝑧(2))𝜇(𝑧(2))

I Probabilistic decoder: pθ(x |z)I Probabilistic encoder: qφ(z |x)

I Find local principal directionat a specific location µ(z):

pθ(x |z) = N(µ, ωI + aa>

),

pθ(z) = N (0, I ),

µ = W µh + bµ,logω = w>ω h + bω,

a = W ah + ba,

h = tanh(W hz + bh).

56 / 60

Page 101: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

(a) True images (b) Generated images

57 / 60

Page 102: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Generative Adversarial Network [Goodfellow et al., 2014]

I Generator network, G (z) : RK → RD

I Captures the data distributionI Counterfeiters: Tries to fake discriminator

I Discriminative network, D(x) : RD → {0, 1}I Police: Tries to detect counterfeit images

The discriminator notices a difference between the two distributions the generator adjusts its parameters slightly to make it go away, until

at the end (in theory) the generator exactly reproduces the true data distribution and the discriminator is guessing at random, unable to

find a difference.

58 / 60

Page 103: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Two-player minimax game:

minG

maxD

Ex∼pdata(x ) [logD(x)] + Ez∼p(z ) [log (1− D(G (z)))] .

I Both G and D are MLPs.

I Does not require any sophisticated inference methods (variational orsampling)

I Alternate between k steps of optimizing D and one step ofoptimizing G

I In practice, train G : maxG Ez∼p(z ) [logD(G (z))] . (strongergradients early in learning)

59 / 60

Page 104: Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Question and Discussion

60 / 60