5 Deep Learning Multi-layered feedforward neural networks ... · 5 Deep Learning •Some Topics in...

5 Deep Learning

• Some Topics in Deep Learning:

∗ Learning algorithms:

ɾBack propagation, Stochastic Gradient Descent Method

ɾDropout, Batch normalization

∗ Generative Adversarial Network(GAN)

∗ Restricted Boltzmann machine

1/71

Multi-layered feedforward neural networks:

input x "−→ output y = f(x;Θ)

Θ : parameter in the model

z2 = φ2(W2z1 + b2)

z1 = φ1(W1x + b1)

z0 = x

W2z1 + b2

W1x + b1

Deep Neural Network(DNN): NN with many layers.

2/71

f(x;Θ) = φD(· · ·φ2(b2 +W2φ1(b1 +W1x)) · · · ),

Θ = (b1,W1, b2,W2, . . . , bD,WD).

Computation of output✓ ✏

1. z0 = x ∈ Rd0

2. For k = 1, . . . , D:

zk = φk(Wkzk−1 + bk),

Wk ∈ Rdk×dk−1, bk ∈ Rdk,

φk : Rdk → Rdk

3. output zD✒ ✑

z2 = φ2(W2z1 + b2)

z1 = φ1(W1x + b1)

z0 = x

W2z1 + b2

W1x + b1

3/71

φ(v): Activation function.

• For φ : R→ R, let us define φ(v) = (φ(v1), . . . ,φ(vd)).

• example of φ : R→ R

∗ Sigmoid function: φ(z) =1

1 + e−z

∗ Tangent hyperbolic: φ(z) = tanh(z)

∗ Rectified linear unit (ReLU): φ(z) = max{z, 0}

-4 -2 0 2 4

-1.0

0.00.51.01.52.0

sigmoidtanhReLU

4/71

Neural Networks: an information processing model of the brain.

human brain: more than 1010 neurons

http://www.lab.kochi-tech.ac.jp/future/1110/okasaka/neural.htm

5/71

• Deep Learning: wide range of applications

6/71

Generic object recognition

• Competition(ILSVRC2012 dataset):

multiclass classification problem with 1.2 million pictures(data), more than 1,000 classes.

ILSVRC� クラス分類タスクのエラー率（top5 error）の推移

http://image-net.org/challenges/talks_2017/ILSVRC2017_overview.pdf

AlexNet

ZFNetSENetResNet

GooLeNetEnsemble

http://aidiary.hatenablog.com/entry/20151108/1446952402

http://image-net.org/challenges/talks_2017/ILSVRC2017_overview.pdf

7/71

Segmentation

• segmentation: classifiation of each pixel

• application to self-driving car

Kendall, et al., Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder

Architectures for Scene Understanding, arXiv preprint arXiv:1511.02680, 2015.

8/71

Style Transfer

=⇒

https://research.preferred.jp/2015/09/chainer-gogh/

• middle layers in CNN:position of objects & stype information

• keep position and transfer style to the otherpictures

Gatys, et al., A Neural Algorithm of Artistic Style, CVPR 2016.

9/71

• Learning of DNN

10/71

training data: (x1, y1), . . . , (xn, yn)

• regression xi ∈ Rd, yi ∈ R.✓ ✏squared loss:

ℓ(x, y;Θ) =1

2(y − f(x;Θ))2

squared loss for multi-dimensional output yi ∈ RdD

ℓ(x, y;Θ) =1

2∥y − f(x;Θ))∥2

✒ ✑

11/71

• classifiation: y ∈ {1, 2, . . . , G}.DNN f(x;Θ) = (f1(x;Θ), . . . , fG(x;Θ)) ∈ RG

transform the output f = (f1, . . . , fG) to the probability:

softmax funtion : sfy(f) =efy

∑Gℓ=1 e

fℓ, y = 1, . . . , G

logistic loss for the softmax output

ℓ(x, y;Θ) = − log sfy(f(x;Θ)), y ∈ {1, . . . , G}

12/71

Optimization✓ ✏

Purpose: find the optimal solution of

minΘ

L(Θ), L(Θ) =1

n

n∑

i=1

ℓ(xi, yi;Θ)

✒ ✑

• Θ is high-dimensional parameter.

∗ Learning algorithm: only function value and 1st order derivatives are

available.

∗ Optimization methods with Hessian matrix such as the Newton method

are computationally too demanding.

13/71

Gradient Descent Method; GD✓ ✏

1. initial point Θ0

2. For t = 0, 1, 2, . . .

• parameter update: Θt+1 = Θt − ηt∇ΘL(Θt)ɽηt > 0: appropriate step size.✒ ✑

• in each iteration, gradient of loss at all data points is required.

∂

∂Θℓ(xi, yi;Θt), i = 1, . . . , n

Large computation cost. It’s not available to DNN.

14/71

Stochastic Gradient Descent Method; SGD✓ ✏

1. initial point Θ0

2. For t = 0, 1, 2, . . .

(a) randomly select m samples, (xi1, yi1), . . . , (xim, yim).

(b) parameter update: Θt+1 = Θt − ηt1

m

m∑

k=1

∂

∂Θℓ(xik, yik;Θt)ɽ

ηt > 0: appropriate step size.✒ ✑

15/71

• the simplest SGD: m = 1

• SGD with mini-batch: 1 < m ≪ n. Commonly used in DNN learning.

m = 10 ∼ 1000.

• the update direction if SGD is an unbiased estimator of ∇L(Θ):

E[1

m

m∑

k=1

∂

∂Θℓ(xik, yik;Θt)

]= ∇L(Θ)

16/71

Descent direction of SGD

https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

17/71

How to Center Deep Boltzmann Machines

0 200 400 600 800 1000

epoch

�96

�94

�92

�90

�88

�86

log-

likel

ihoo

d

ddbs (⌘ : 0.01)

ddbs (⌘ : 0.01 ! 0.001)

ddbs (⌘ : 0.01 ! 0.0001)

00 (⌘ : 0.01)

00 (⌘ : 0.01 ! 0.001)

00 (⌘ : 0.01 ! 0.0001)

(a) MNIST-Sampled

0 200 400 600 800 1000

epoch

�100

�95

�90

�85

�80

�75

�70

�65

log-

likel

ihoo

d

ddbs (⌘ : 0.01)

ddbs (⌘ : 0.01 ! 0.001)

ddbs (⌘ : 0.01 ! 0.0001)

00 (⌘ : 0.01)

00 (⌘ : 0.01 ! 0.001)

00 (⌘ : 0.01 ! 0.0001)

(b) MNIST-Threshold

Figure 18: Evolution of the LL of single trials on the test data of (a) MNIST-Sampled and(b) MNIST-Threshold for ddbs and 00 with 500 hidden units. The models weretrained for 1000 epochs with a weight decay of 0.0002 and a momentum of 0.9that was reduced to 0.5 after 5 epochs. The learning rate was either fixed to0.01, decayed from 0.01 to 0.001 or from 0.01 to 0.0001 over 1000 epochs.

the same as described in Section 6. The maximum average LL values are given in Table 14,showing again that the reached LL on the di↵erent MNIST variants is quite di↵erent andthat centered RBMs reach better values than normal RBMs on all three data sets.

In our experience the use of weight decay, momentum and an annealing learning rate iscrucial to reach a LL that is comparable to the value of -86.34 reported by Salakhutdinovand Murray (2008). We therefore trained centered RBMs (ddbs) and normal RBMs (00) usingPCD-25 for the extensive amount of 1000 epochs (600000 gradient updates). In additiona weight decay of 0.0002, and a momentum of 0.9 that was set to 0.5 after 5 epochs wasused. The learning rate was fixed to 0.01 or either decayed from 0.01 to 0.001 or from 0.01to 0.0001 over 1000 epochs.

The average LL on the test data of MNIST-Sampled for single trials are shown inFigure 18(a). It can be seen that a decaying learning rate is crucial for 00, which onlyreaches a value of �88.0 without decaying learning rate but �85.9 with a learning rate thatdecayed from 0.01 to 0.001. ddbs reaches a LL value of �86.2 without and �85.5 with alearning rate that decayed from 0.01 to 0.001. Although the di↵erence between normal andcentered RBMs become smaller with weight decay, ddbs still reaches a higher LL, seems tobe less depended on the learning rate schedule and allows faster learning. We performedthe same experiments also for the MNIST-Threshold data set. The results are shown inFigure 18(b), which show qualitatively the same results. The di↵erence between 00 and ddbsis more prominent on the MNIST-Threshold where 00 reaches a LL value of �67.5 and ddbsreaches �65.0, which can also be seen from Table 14.

57

Evaluation of SGD: the function value to epoch.

• In many papers, the function value or test error is plotted to the epoch.

18/71

epoch:

• An epoch is one complete presentation of the data set to be learned to a

learning machine.

∗ SGD with sample size n, mini-batch size m:

1 epoch of DNN learning =⇒ t = n/m iterations of SGD.

19/71

Gradient computation: let us define ℓ(x, y;Θ) = ℓy(f(x;Θ)).

∂

∂Θℓ(x, y;Θ) =

∂f

∂Θ(x;Θ)∇ℓy(f(x;Θ)).

• ℓy(f) =1

2(y − f)2 =⇒ ∇ℓy = f − y

• ℓy(f) = − log sfy(f1, . . . , fG) =⇒ ∇ℓy = sf(f)− eysf(f) = (sf1(f), . . . , sfG(f))T .

note: notation of derivative:

• For the function ℓy(v1, v2, . . . , vk), define ∇ℓy =(∂ℓy∂v1

, . . . , ∂ℓy∂vk

)T. (derivative w.r.t. f)

• The Jacobi matrix is dentoed by ∂f∂Θ(x;Θ). For the function f = (f1, . . . , fd) ∈ Rd,

∂f∂Θ = (∇Θf1, . . . ,∇Θfd) ∈ Rm×d, where ∇Θfk is the gradient vector w.r.t. the parameterΘ.

20/71

Exercise

For f = (f1, . . . , fG), let us define sfy(f) =efy

∑Gℓ=1 e

fℓ, y = 1, . . . , G.

Calculate the gradient vector of log sfy(f), i.e.,

∇ log sfy(f) =

(∂

∂f1log sfy(f), . . . ,

∂

∂fGlog sfy(f)

)T

.

21/71

Back Propagation: a calculation method of∂f

∂Θ.

The above derivative is reuired to obtain∂

∂Θℓ(x, y;Θ).

notations:

• Wkzk−1 + bk = (Wk, bk)

(zk−1

1

)is rewritten as Wkzk−1.

• Define vk = Wkzk−1 = Wkφk−1(vk−1).

f(x;Θ) = φD(vD) = φD(WDφD−1(vD−1)) = · · ·

22/71

For f(x;Θ) = φD(vD) = φD(· · ·φk(Wkφk−1(vk−1)︸︷︷︸vk

) · · · ) ∈ RdD, we have

∂f

∂vk−1=

∂φk−1

∂vk−1WT

k∂f

∂vk∈ Rdk−1×dD.

where∂φk

∂vk= (∇φk,1,∇φk,2 . . . ,∇φk,dk) ∈ Rdk×dk

Using vk = Wkzk−1, we have

∂fa∂Wk

=∂fa∂vk

zTk−1 ∈ Rdk×dk−1, a = 1, . . . , dD

23/71

• calculate the function value from input layer to output layer.

z0 = x → z1 → · · ·→ zD−1→ zD = f(x;Θ)

v1 = W1z0→ v2 = W2z1→ · · ·→ vD

• calculate the derivative from output layer to input layer.

∂f

∂vD=

∂φD

∂vD(vD)→

∂f

∂vD−1=

∂φD−1

∂vD−1(vD−1)W

TD

∂f

∂vD→ · · ·→ ∂f

∂v1,

∂fa∂Wk

=∂fa∂vk

zTk−1, a = 1, . . . , dD.

∂

∂Wkℓ(x, y;Θ) =

dD∑

a=1

∂fa∂Wk

(x;Θ)(∇ℓy(f(xi;Θ)))a.

DNN requires enormous number of matrix calculations. For that purpose the GPU is quite

beneficial.

24/71

Calculation of Derivatives

Fast automatic differentiation

• function: combination of basic operations (four arithmetical operations,

exp, log, trigonometric)

• partial differentiation: linear algebraic calcuation with the derivative of

basic operations and chain rule.

ex. derivative of sin((x1 + x2)x22)✓ ✏

✒ ✑25/71

ReLU

When φ(z) is the sigmoid or tanh function, φ′(z) ≈ 0 as |z|→∞.

−→ the small gradient with numerical error yields zero gradient and the

learning ceases.

Rectified Linear Unit; ReLU φ(z) = max{z, 0}ɽφ′(0) = 0 for simplicity.

φ′(z) =

⎧⎨

⎩1, z > 0,

0, z ≤ 0.

26/71

Improvement of SGD

27/71

• Momentum SGD: add momentum term and suppress the oscillation.

Θt+1 = Θt −{ηt × (descent direction of SGD) + α(Θt −Θt−1)

}

• AdaGrad: automatic controll of the step size ηt.

Duchi, et al., ”Adaptive subgradient methods for online learning and stochastic optimization.”, JMLR

12.Jul (2011): 2121-2159.

• Adam (commonly used in DNN learing): the momentum and step size are

devised. Kingma, Diederik, and Jimmy Ba. ”Adam: A method for stochastic optimization.” arXiv

preprint arXiv:1412.6980 (2014).

other methods: see https://en.wikipedia.org/wiki/Stochastic_gradient_descent

28/71

Optimization: Momentum, AdaGrad/AdaDelta

Alec Radford, https://imgur.com/s25RsOr

Momentum: Add velocity, like a ball with mass rolling downhill

NAG (Nesterov accelerated gradient): Jumps ahead and recalculate thegradient, check for overshooting.

AdaGrad/AdaDelta (Duchi, Hazan, Singer, 2011): Keep a history of gradientsand decrease learning rate in directions with a history of large gradients

(Yue Shi Lai) September 13, 2017 18 / 26

29/71

Regularization

• DNN with too many layers

∗ learning time is quite long since the number of parameters to be learned

is enormous.

∗ overfit to training data

How can we circumvent these issues?

• Regularization:

∗ minimize empirical loss with regularization term.

∗ Early stopping: use a suboptimal solution of empirical loss.

∗ Dropout: In each iteration of SGD, randomly choose the set of parameters

to be updated.

30/71

Dropout

Srivastava, et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting,

JMLR, 15, pp.1929–1958, 2014.

• Training phase: in each step of SGD,

∗ in each layer, choose units with probabability p.

∗ update the parameters associated with the chosen units.

• Test phase:

∗ the unit is always present and the weights are multiplied by p.

note: in the test phase, the number of units is 1/p times more than that

in the training phase.

31/71

Dropout:

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov

(a) Standard Neural Net (b) After applying dropout.

Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:An example of a thinned net produced by applying dropout to the network on the left.Crossed units have been dropped.

its posterior probability given the training data. This can sometimes be approximated quitewell for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but wewould like to approach the performance of the Bayesian gold standard using considerablyless computation. We propose to do this by approximating an equally weighted geometricmean of the predictions of an exponential number of learned models that share parameters.

Model combination nearly always improves the performance of machine learning meth-ods. With large neural networks, however, the obvious idea of averaging the outputs ofmany separately trained nets is prohibitively expensive. Combining several models is mosthelpful when the individual models are di↵erent from each other and in order to makeneural net models di↵erent, they should either have di↵erent architectures or be trainedon di↵erent data. Training many di↵erent architectures is hard because finding optimalhyperparameters for each architecture is a daunting task and training each large networkrequires a lot of computation. Moreover, large networks normally require large amounts oftraining data and there may not be enough data available to train di↵erent networks ondi↵erent subsets of the data. Even if one was able to train many di↵erent large networks,using them all at test time is infeasible in applications where it is important to respondquickly.

Dropout is a technique that addresses both these issues. It prevents overfitting andprovides a way of approximately combining exponentially many di↵erent neural networkarchitectures e�ciently. The term “dropout” refers to dropping out units (hidden andvisible) in a neural network. By dropping a unit out, we mean temporarily removing it fromthe network, along with all its incoming and outgoing connections, as shown in Figure 1.The choice of which units to drop is random. In the simplest case, each unit is retained witha fixed probability p independent of other units, where p can be chosen using a validationset or can simply be set at 0.5, which seems to be close to optimal for a wide range ofnetworks and tasks. For the input units, however, the optimal probability of retention isusually closer to 1 than to 0.5.

1930

32/71

Interpretation of Dropout

• The effect of Dropout is close to L2-regulariztion on the weights

∗ Baldi and Sadowski, Understanding Dropout, NIPS 2013.

• Relation to Bayes inference.

∗ Gal, Ghahramani, Dropout as a Bayesian Approximation, NIPS 2015.

∗ Kendall, Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, NIPS

2017.

33/71

Batch normalization

Ioffe, Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift,

ICML 2015.

Vanishing gradient problem✓ ✏When φ(z) = tanh(z) (or sigmoid) is used as the activation function in

DNN, φ′(z) ≈ 0 for large |z|.−→ gradient almost vanishes and the learning process will be at a standstill.✒ ✑

classical technique: normalize the input vectors x1, . . . ,xn so that the mean

value is zero.

34/71

normalization: For x = (x1, . . . , xd) ∈ Rd

x "−→ n(x) =x− µ

σ(element-wise calc.)

µℓ = EP [xℓ],σℓ =√

VP [xℓ], ℓ = 1, . . . , dɽ

35/71

✓ ✏DNN: even if the input x is normalized, the input to the middle layer may

have severe bias. The bias results in the slow trianing.

−→ correction is required.✒ ✑

• standard computation:

1. vk = Wkzk−1

2. zk = φk(vk)

• normalization:

1. vk = Wkzk−1

2. uk = nk(vk): with estimated µ, σ → batch normalization.

3. zk = φk(uk)

36/71

Add middle layers for normalization

• reduce the dependency on the initial value for training parameteer.

• suppress the overfitting to training data.

37/71

• batch normalization:

∗ intputs in mini-batch xi1, . . . ,ximɽ∗ intputs to k-th middle layer in the mini-batch vk,i1, . . . ,vk,im

Using mini-batch, estimate the mean and sd.

−→ µk, σk

correction : bnk(vk;γk,βk) = γk ◦vk − µk

σk + ε+ βk.

◦ : Hadamard product.

γk,βk: learning parameters.

ε in the denominator is introduced for the numerical stability.

38/71

ॱܭ✓ ✏

1. v1 = W1x

2. k = 1, . . . , D − 1 : vk+1 = Wk+1φk(bnk(vk;γk,βk))✒ ✑

• Learning parameters: Wk,γk,βk

• Computation without BN: vk+1 = Wk+1φk(vk).

39/71

Back propagation method:

uk = bnk(vk;γk,βk), vk = Wkφk−1(uk−1)

f(x;Θ) = φD(uD) = φD(· · · bnk(Wkφk−1(uk−1)︸︷︷︸vk

;γk,βk) · · · ) = · · ·

back propagation for derivatives:

⎧⎪⎪⎨

⎪⎪⎩

∂f

∂uk−1=

∂vk

∂uk−1

∂f

∂vk=

∂φk−1

∂uk−1WT

k∂f

∂vk,

∂f

∂vk=

γk

σk◦ ∂f

∂uk(column-wise Hadamard product)

40/71

derivative w.r.t. the learning parameters

⎧⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎩

∂fa∂Wk

=∂fa∂vk

· φk−1(uk−1)T , a = 1, . . . , dD

∂f

∂γk=

vk − µk

σk◦ ∂f

∂uk,

∂f

∂βk=

∂f

∂uk.

41/71

Exercise

1. Calculate the gradient descent vector when the batch normalization layer

is introduced, i.e., prove the following equatoins:

∂f

∂vk=

γk

σk◦ ∂f

∂uk, (column-wise Hadamard product),

∂f

∂γk=

vk − µk

σk◦ ∂f

∂uk. (column-wise Hadamard product).

42/71

• Deep Learning: Generative Adversarial Nets (GAN)

∗ Goodfellow, et al., Generative Adversarial Nets, NIPS 2014.

43/71

Generative models

• training data: x1, . . . ,xn ∼i.i.d. p0

• purpose: generate samples from p0.

1. estimation of distribution

2. data generation

44/71

• model: x = G(z;Θg), z ∼ q0

∗ G(z;Θg)ɿDNN

∗ q0(z): predefined simple distribution such as Gaussian, Uniform, etc.

∗ generate data: z′ ∼ q0 −→ x′ = G(z′;Θg)

Let p(x) be the distribution of x = G(z;Θg), z ∼ q0(z).

• learning of complex distributions: employ DNN G(z;Θg) as feature

extractor.

• Standard method of estimating probability dist. (cf. MLE):

−→ data generation is not necessarily straightforward.

45/71

Learning Methods

Estimation of p0:✓ ✏Is x′ ∼ G(z;Θg) with z ∼ q0 generated from p0(x)?✒ ✑

Interpret the distribution estimation as the classification problem.

• classifier x "−→ D(x;Θd) ∈ [0, 1]

• D(x;Θd) ≥ 1/2 ⇒ decision is “x is generated from p0”

46/71

Learn the classifier D✓ ✏

• x ∼ p0(x) ⇒ D(x;Θd) is large

• x ∼ p0(x) ⇒ D(x;Θd) is small (1−D(x;Θd) is large)

maxΘd

Ex∼p0[logD(x;Θd)] + Ez∼q0[log (1−D(G(z);Θd))]

−→ D(x;Θd) discriminates the difference between p0(x) and p(x).✒ ✑

47/71

If the distribution of x ∼ p0 and G(z;Θg) ∼ p is close

⇔ difficult to distinguish using D.

Learn the generator G✓ ✏

minΘg

maxΘd

Ex∼p0[logD(x;Θd)] + Ez∼q0[log (1−D(G(z;Θg);Θd))]

−→ Learn G(z;Θg)✒ ✑Learning algorithm: solve min-max problem using SGD with mini-batch.

48/71

✓ ✏• repeat until the solution converges to a point:

1. sampling:

resampling x′1, . . . ,x

′m from training data,

generate z1, . . . , zm from q0

2. Update Θd : Θd← Θd + ηdδΘd

δΘd =1

m

m∑

i=1

∇Θd

{logD(x′

i) + log(1−D(G(zi)))}

3. Update Θg Θg ← Θg − ηgδΘg

δΘg =1

m

m∑

i=1

∇Θg log(1−D(G(zi)))

✒ ✑

49/71

a) b)

c) d)

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example ofthe neighboring sample, in order to demonstrate that the model has not memorized the training set. Samplesare fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, theseimages show actual samples from the model distributions, not conditional means given samples of hidden units.Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chainmixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminatorand “deconvolutional” generator)

Figure 3: Digits obtained by linearly interpolating between coordinates in z space of the full model.

1. A conditional generative model p(x | c) can be obtained by adding c as input to both G and D.2. Learned approximate inference can be performed by training an auxiliary network to predict z

given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but withthe advantage that the inference net may be trained for a fixed generator net after the generatornet has finished training.

3. One can approximately model all conditionals p(xS | x 6S) where S is a subset of the indicesof x by training a family of conditional models that share parameters. Essentially, one can useadversarial nets to implement a stochastic extension of the deterministic MP-DBM [10].

4. Semi-supervised learning: features from the discriminator or inference net could improve perfor-mance of classifiers when limited labeled data is available.

5. Efficiency improvements: training could be accelerated greatly by devising better methods forcoordinating G and D or determining better distributions to sample z from during training.

This paper has demonstrated the viability of the adversarial modeling framework, suggesting thatthese research directions could prove useful.

7

50/71

Optimal solution of D(x):

maxD

∫{p0(x) logD(x) + p(x) log(1−D(x))} dx

−→ D∗(x) =p0(x)

p0(x) + p(x)(optimize at each x)

note: the above loss is concave in D(x).

Interpretation: sign(D∗(x) − 1/2) is the Bayes rule of the following binary

classification problems.

p(x|+ 1) = p0(x), p(x|− 1) = p(x),

P (y = +1) = P (y = −1) = 1/2

51/71

Find the optimal solution of p(x) ∼ G(z;Γg):

subustitute D∗(x) and optimize w.r.t. p

minp:pdf

∫ {p0(x) log

p0(x)

p0(x) + p(x)+ p(x) log

p(x)

p0(x) + p(x)

}dx

−→ p(x) = p0(x)

52/71

Proof.

∫ {p0(x) log

p0(x)

p0(x) + p(x)+ p(x) log

p(x)

p0(x) + p(x)

}dx

= 2

∫p0(x) + p(x)

2

{p0(x)

p0(x) + p(x)log

p0(x)

p0(x) + p(x)+

p(x)

p0(x) + p(x)log

p(x)

p0(x) + p(x)

}dx

≥ 2

∫p0(x) + p(x)

2(− log 2)dx = −2 log 2

equality condition: p(x) = p0(x)ɼi.e., the distribution of G(z;Θg), z ∼ q(z)

is the same as p0(x). !

• GAN: DNN is used to learn the classifier D and generator G.

53/71

Exercise

1. Find the optimal solution of the following problem:

maxq∈(0,1)

p0(x) log q + p(x) log(1− q)

2. Find the optimal solution of the following problem:

minr∈(0,1)

r log r + (1− r) log(1− r)

54/71

• Learning of Boltzmann machine

∗ Restricted Boltzmann Machine (RBM)

∗ Contrastive divergence method

∗ Deep Boltzmann machine

55/71

Statistical model for Boltzmann Machine

probability of the multi-dimensional discrete r.v. x = (x1, . . . , xd) ∈ {0, 1}d:

p(x;W, b) =exp{xTWx+ xTb}

Z, Z: normalization const.

W = (wab) ∈ Sym(d), Waa = 0, b ∈ Rd

For simplicity, we assume b = 0 and denote the probability by p(x;W ).

p(x;W ) =exp{xTWx}

Z

56/71

• Estimate W from i.i.d. data x1, . . . ,xn, xi = (xi1, . . . , xid) ∈ {0, 1}d.

• mle: minW

ℓ(W ), ℓ(W ) = −1

n

n∑

i=1

log p(xi;W )

computation of the gradient for optimization:

∂

∂Wablog

exTWx

∑x e

xTWx︸︷︷︸

p(x;W )

= xaxb −∑

x xaxbexTWx

∑x e

xTWx

=⇒ ∇ℓ(W ) = −1

n

∑

i

xixiT + EW [xxT ]

57/71

gradient method:✓ ✏

1. ηt > 0ɼinitial value W0ɽ

2. Repeat for t = 0, 1, 2, . . . until Wt converges to a point.

Wt+1 = Wt − ηt∇ℓ(Wt)✒ ✑Computation of Z and EW [xxT ] require the sum of 2d terms. It’s

impossible.

−→ we cannot compute ∇ℓ(W ).

Approximation of ∇ℓ(W ):

• Boltzmann machine: Markov Chain Monte Carlo (MCMC)

• Restricted Boltzmann Machine: Contrastive Divergence method.

58/71

MCMC

x = (x1, . . . , xd): update each element like x1→ x2→ · · ·→ xd→ x1 · · · .

Let x(−a) be d− 1-dim vector dropping xa from x.

1. Repeat a = 1, 2, . . . , d, 1, 2, . . . , d, . . ..

x′a ∼ p(xa|x(−a);Wt), xa ←− x′

a.

2. Output x that approximate the samples from p(x;W ).

59/71

Sampling from p(xa|x(−a);W )

p(xa|x(−a);W ) =p(x1, . . . , xd;W )∑xa

p(x1, . . . , xd;W )

=exp{2xa

∑b xbWab}

exp{2∑

b xbWab}+ 1

=exp {2xa(Wx)a}exp {2(Wx)a}+ 1

(tractable)

Sampling method:

1. generate u ∼ U [0, 1].

2. xa =

⎧⎨

⎩1, u ≤ p(1|x(−a);W )

0, o.w.

60/71

Boltzmann machine with hidden nodes

x =

(v

h

), h is the hidden nodes. W =

(Wv Wvh

WTvh Wh

).

p(v;W ) ∝∑

h

exp{xTWx}

∝ exp{vTWvv}∑

h

exp{2vTWvhh+ hTWhh}

61/71

data: v1, . . . ,vn (h is not observed)

minW

ℓ(W ), ℓ(W ) = −1

n

n∑

i=1

log∑

h

p(v,h;W )

∂

∂Wablog

∑h ex

TWx

Z=

∑h xaxbex

TWx

∑h exTWx

− EW [xaxb]

= EW [xaxb|v]− EW [xaxb]

=⇒ ∇ℓ(W ) = −1

n

n∑

i=1

EW [xxT |vi] + EW [xxT ]

62/71

• the 1st term in the gradient:

expectation of

(vi

h

)(vT

i hT ) under p(h|vi;W )

• the 2nd term in the gradient:

expectation of xxT under p(v,h;W )

Exact calculation of the above expectation is computationally demanding.

Approximation of the expectation using the sample mean:

We need samples h ∼ p(h|vi;W ) and (v,h) ∼ p(v,h;W ).

• RBM: sampling from p(h|vi;W ) is easy.

63/71

Restricted Boltzmann machine; RBM

Boltzmann machine with hidden nodes and bipartite graph structure.

(Wv Wvh

WTvh Wh

)=

(O W

WT O

), p(v,h;W ) ∝ exp{vTWh}

Hinton and Salakhutdinov, ”Reducing the Dimensionality of Data with Neural Networks”, Science. 313 (5786):

504–507, 2006.

64/71

Feature extraction using RBM

v "−→ h ∼ p(h|v) "−→ v′ ∼ p(v′|h)

dimension reduction: dimv = dimv′ > dimh.

• training the network so as to be input ≈ output.

• cut-off the noise by the hidden nodes.

65/71

Deep Boltzmann Machines

h3

h2

h1

v

W3

W2

W1

Deep BeliefNetwork

Deep BoltzmannMachine

Pretraining

W W

W

h

h

hh

W

h

h

vv

ComposeW

W

v

RBM

RBM

1 1

2

1

1

22

2

1

2

2

1

Figure 2: Left: A three-layer Deep Belief Network and a three-layer Deep Boltzmann Machine. Right: Pretraining consists of learninga stack of modified RBM’s, that are then composed to create a deep Boltzmann machine.

Consider a two-layer Boltzmann machine (see Fig. 2, rightpanel) with no within-layer connections. The energy of thestate {v,h1,h2} is defined as:

E(v,h1,h2; θ) = −v⊤W

1h

1 − h1⊤

W2h

2, (9)

where θ = {W1,W2} are the model parameters, repre-senting visible-to-hidden and hidden-to-hidden symmetricinteraction terms. The probability that the model assigns toa visible vector v is:

p(v; θ) =1

Z(θ)

!

h1,h2

exp (−E(v,h1,h2; θ)). (10)

The conditional distributions over the visible and the twosets of hidden units are given by logistic functions:

p(h1j = 1|v,h2) = σ

"

!

i

W 1ijvi +

!

m

W 2jmh2

j

#

, (11)

p(h2m = 1|h1) = σ

"

!

j

W 2imh1

i

#

, (12)

p(vi = 1|h1) = σ"

!

j

W 1ijhj

#

. (13)

For approximate maximum likelihood learning, we couldstill apply the learning procedure for general Boltzmannmachines described above, but it would be rather slow, par-ticularly when the hidden units form layers which becomeincreasingly remote from the visible units. There is, how-ever, a fast way to initialize the model parameters to sensi-ble values as we describe in the next section.

3.1 Greedy Layerwise Pretraining of DBM’s

Hinton et al. (2006) introduced a greedy, layer-by-layer un-supervised learning algorithm that consists of learning astack of RBM’s one layer at a time. After the stack ofRBM’s has been learned, the whole stack can be viewedas a single probabilistic model, called a “deep belief net-work”. Surprisingly, this model is not a deep Boltzmannmachine. The top two layers form a restricted Boltzmannmachine which is an undirected graphical model, but thelower layers form a directed generative model (see Fig. 2).

After learning the first RBM in the stack, the generativemodel can be written as:

p(v; θ) =!

h1

p(h1;W1)p(v|h1;W1), (14)

where p(h1;W1) =$

vp(h1,v;W1) is an implicit

prior over h1 defined by the parameters. The secondRBM in the stack replaces p(h1;W1) by p(h1;W2) =$

h2 p(h1,h2;W2). If the second RBM is initialized cor-rectly (Hinton et al., 2006), p(h1;W2) will become a bet-ter model of the aggregated posterior distribution over h1,where the aggregated posterior is simply the non-factorialmixture of the factorial posteriors for all the training cases,i.e. 1/N

$

n p(h1|vn;W1). Since the second RBM is re-placing p(h1;W1) by a better model, it would be possibleto infer p(h1;W1,W2) by averaging the two models of h1

which can be done approximately by using 1/2W1 bottom-up and 1/2W2 top-down. Using W1 bottom-up and W2

top-down would amount to double-counting the evidencesince h2 is dependent on v.

To initialize model parameters of a DBM, we proposegreedy, layer-by-layer pretraining by learning a stack ofRBM’s, but with a small change that is introduced to elim-inate the double-counting problem when top-down andbottom-up influences are subsequently combined. For thelower-level RBM, we double the input and tie the visible-to-hidden weights, as shown in Fig. 2, right panel. In thismodified RBM with tied parameters, the conditional distri-butions over the hidden and visible states are defined as:

p(h1j = 1|v) = σ

"

!

i

W 1ijvi +

!

i

W 1ijvi

#

, (15)

p(vi = 1|h1) = σ"

!

j

W 1ijhj

#

. (16)

Contrastive divergence learning works well and the modi-fied RBM is good at reconstructing its training data. Con-versely, for the top-level RBM we double the number ofhidden units. The conditional distributions for this model

Pr(v,h1,h2,h3) = Pr(h2,h3)Pr(h1|h2)Pr(v|h1)

Split the network into some RBM. Hinton, Osindero, Teh, “A fast learning algorithm for deep

belief nets.”

Neural Computation, Vol. 18 Issue 7, pp. 1527–1554, 2006.

66/71

Some Properties of RBM

Conditional independence of hidden nodes and visible nodes.

p(v|h;W ) =exp{vTWh}∑v exp{vTWh}.

∑

v

exp{vTWh} =∑

v

∏

ℓ

exp{vℓ(Wh)ℓ} =∏

ℓ

∑

vℓ

exp{vℓ(Wh)ℓ}

=∏

ℓ

(1 + e(Wh)ℓ).

67/71

Hence we have

• p(v|h;W ) =∏

ℓ

evℓ(Wh)ℓ

1 + e(Wh)ℓ=∏

ℓ

p(vℓ|h;W ).

p(vℓ = 1|h;W ) =e(Wh)ℓ

1 + e(Wh)ℓ.

• p(h|v;W ) =∏

ℓ

ehℓ(WTv)ℓ

1 + e(WTv)ℓ=∏

ℓ

p(hℓ|v;W ).

p(hℓ = 1|v;W ) =e(W

Tv)ℓ

1 + e(WTv)ℓ.

68/71

Learning RBM

Computation of ∇ℓ(W ): we need EW [xxT |vi], EW [xxT ] (x = (vT ,hT )T )ɽ

• Conditional independence yields h ∼ p(h|v;W ), v ∼ p(v|h;W ). Data

generation is easy.

Python code: implementation of h ∼ p(h|v;W )✓ ✏>>> import numpy as np

>>> vdim = 10; hdim = 100 # dimension

>>> # initialize W

>>> W = np.random.normal(size=vdim*hdim).reshape(vdim,hdim)

>>> v = np.random.uniform(size=vdim) < 0.2 # initialize v

>>> # prob: array of p(h_l=1|v; W), l=1,..,hdim

>>> q = np.exp(np.dot(W.T,v)); prob = q/(1+q)

>>> # sampling h from P(h|v; W)

>>> hsample = np.random.uniform(size=hdim) < prob✒ ✑

69/71

• For sample vi, generate many h from p(h|vi;W )

−→ approximate EW [xxT |vi] by the sample mean.

• Generate (v,h) from p(v,h;W ):✓ ✏1. randomly choose v′

0 = vi from training data

2. Repeat for t = 0, 1, 2, . . . , T :

h′t+1 ∼ p(h|v′

t;W ), v′t+1 ∼ p(v|h′

t+1;W )

3. Sample (v′T ,h

′T )✒ ✑

• generate many (v′T ,h

′T ), and approximate EW [xxT ] by the sample mean.

Practically, T = 1 is sufficient: contrastive divergence method.Carreira-Perpinan and Hinton, On Contrastive Divergence Learning, AISTATS 2005.

70/71

Learn the MNIST(hand-written data) by RBM

MNIST data

training result of RBM. visualize the weight from observed node to hidden nodes

Compression from 282 = 784 dim to 100 dim.

→ useful for the feature extraction.

71/71

5 Deep Learning Multi-layered feedforward neural networks ... · 5 Deep Learning •Some Topics in...

Documents

Transcript of 5 Deep Learning Multi-layered feedforward neural networks ... · 5 Deep Learning •Some Topics in...