Deep Learning Theory and Practice - Computer Action...

Deep Learning Theory and PracticeLecture 5

Introduction to deep neural networks

Dr. Ted Willke willke@pdx.edu

Tuesday, January 21, 2020

Review of Lecture 4• Adaline ‘neuron’ minimizes the squared loss

update the weights w(t + 1) = w(t) + η ⋅ (y* − s(t)) ⋅ x*

compute s = wTx

w(1) = 0

for iteration t = 1, 2, 3, . . .

pick a point (at random) (x*, y*) ∈ D

t ← t + 1

Forward pass of ‘signal’:

Backward pass of updates:

The Adaline Algorithm:

Review of Lecture 4• Logistic regression: Better classification

hw(x) = θ (wTx)Uses , where θ(s) =1

1 + e−s

yGives us the probability of being the label:

• Learning should strive to maximize this jointprobability over the training data:

P(y1, . . . , yN |x1, . . . , xN) =N

∏n=1

P(yn |xn) .

• The principle of maximum likelihood says we can do this if we minimize this error:

1. Compute the gradient

2. Move in the direction

3. Update the weights:

4. Repeat until converged! A convex problem

v̂ = − gt

gt = ∇Ein(w(t))

w(t + 1) = w(t) + ηv̂t

∇wEin(w) → 0.

• We can’t minimize this analytically, but we can numerically/iteratively set

Review: Gradient Descent

Ball on complicated hilly terrain

- rolls down to a local valley

this is called a local minimum

Questions:

1. How to get to the bottom of the deepest valley?

2. How to do this when we don’t have gravity :-)?

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

Today’s Lecture

6(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

Our has only one valley

… because is a convex function of . Ein(w) w

Can you prove this for logistic regression?

How to ‘roll’ down

Assume you are at weights and you take a step of size in the direction . w(t) η v̂

w(t + 1) = w(t) + ηv̂

We get to select . v̂

Select to make as small as possible.v̂ Ein(w(t + 1))

what’s the best direction to take a step in??

The gradient is the fastest way to roll down

Approximately the change in Ein

What choice of will maximize the gradient (minimize its negative)?v̂

Maximizing the descent

How do we maximize ?ΔEin

ΔEin ≈ η∇Ein(w(t))Tv̂

≈ η∥∇Ein(w(t))T∥∥v̂∥ cos(θ) , where is the angle between and .θ ∇Ein ̂v

This is maximized when , i.e., when points in direction of .cos(θ) = 1 ∇Ein(w(t))Tv̂

Therefore, we can take the largest negative step when .v̂ = −∇Ein(w(t))

∥∇Ein(w(t))∥

‘Rolling down’ = iterating the negative gradient

The ‘Goldilocks’ step size

Fixed learning rate gradient descent

Define

(reduces step size as minimum is approached)

Gradient descent can minimize any smooth function.

Summary of linear models

Credit Analysis

Perceptron

Linear regression

Logistic regression Cross-Entropy Error (Gradient Descent)

Squared Error (Pseudo-inverse)

Classification Error(PLA)

Approve or Deny

Amount ofCredit

Probability of Default

Today’s Lecture

The neural network - biologically inspired

biological function biological structure

Biological inspiration, not bio-literalism

Engineering success can draw upon biological inspiration at many levels of abstraction.We must account for the unique demands and constraints of the in-silico system.

XOR: A limitation of the linear model

f = h1h2 + h1h2

h1(x) = sign(wT1 x) h2(x) = sign(wT

Perceptrons for OR and AND

OR(x1, x2) = sign(x1 + x2 + 1.5) AND(x1, x2) = sign(x1 + x2 − 1.5)

Representing using OR and AND

f = h1h2 + h1h2

Representing using OR and AND

f = h1h2 + h1h2

The multilayer perceptron

3 layers ‘feedforward’

hidden layers

Universal Approximation

Any target function that can be decomposed into linear separators can be implemented by a 3-layer MLP.

A powerful model

Target 8 perceptrons 16 perceptrons

Red flags for generalization and optimization.

What tradeoff is involved here?

Minimizing

The combinatorial challenge for the MLP is even greater than that of the perceptron.

is not smooth (due to ), so cannot use gradient descent.sign( ⋅ )Ein

sign(x) ≈ tanh(x) ⟶ gradient descent to minimize Ein .

The deep neural network

input layer l = 0 hidden layers 0 < l < L output layer l = L

How the network operates

w(l)ij

1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

x(l)j = θ(s(l)

j ) = θ (d(l−1)

∑i=0

w(l)ij x(l−1)

Apply to x x(0)1 . . . x(0)

d(0) → → x(L)1 = h(x)

θ(s) = tanh(s) =es − e−s

es + e−s

Today’s Lecture

How can we efficiently train a deep network?

Gradient descent minimizes: Ein(w) =1N

∑n=1

e(h(xn), yn)

by iterative steps along −∇Ein :

∇w = − η∇Ein(w)

∇Ein is based on ALL examples (xn, yn)

‘batch’ GD

ln(1 + e−ynwTxn) logistic regression

The stochastic aspect

𝔼n [−∇e(h(xn), yn)] =1N

∑n=1

e(h(xn), yn)‘Average’ direction:

= − ∇Ein :

Pick one at a time. Apply GD to . (xn, yn) e(h(xn), yn)

stochastic gradient descent (SGD)

A randomized version of GD.

Benefits of SGD

Randomization helps.

1. cheaper computation

2. randomization

3. simple

Rule of thumb:

η = 0.1 works

(empirically adjust; exponentially)

The linear signal

Input is a linear combination (using weights) of the outputs of the previous layer

x(l−1) .

(recall the linear signal )s = wTx

Forward propagation: Computing

Minimizing

Using makes differentiable, so we can use gradient descent (or SGD) local min.θ = tanh Ein ⟶

Gradient descent

Gradient descent of

We need:

Numerical Approach

approximate

inefficient

Algorithmic Approach :-)

is a function of ande(x) s(l) s(l) = (W(l))Tx(l−1)

(chain rule)

sensitivity

Computing using the chain rule

Multiple applications of the chain rule:

The backpropagation algorithm

Algorithm for gradient descent on

42Can do batch version or sequential version (SGD).

Digits Data

Today’s Lecture

Deep Learning Theory and Practice - Computer Action...

Documents

Transcript of Deep Learning Theory and Practice - Computer Action...

DEEP FOUNDATIONS – Pile Foundations - جامعة تكريتced.ceng.tu.edu.iq/images/lectures/dr.farouq/ch7-Deep-Foundation... · DEEP FOUNDATIONS – Pile Foundations ... Qs =

TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass/universality.pdf · Deep Learning and Evolution The Baldwin E ect In a 1987 paper entitled \How Learning Can Guide Evolu-tion",

On the Fundamental Stability of Deep Networks · ON THE STABILITY OF DEEP NETWORKS RAJA GIRYES AND GUILLERMO SAPIRO DUKE UNIVERSITY Mathematics of Deep Learning International Conference

Solving PDE related problems using deep-learning

Deep Neural Networks Motivated By Ordinary Differential ...helper.ipam.ucla.edu/publications/mlptut/mlptut_16188.pdfOrdinary Differential Equations Machine Learning for Physics and

Nanobubble at 100 meters deep underwater - nac … · Nanobubble at 100 meters deep underwater ＊At deep underwater, such as the bottom of seas, dam lakes and deep wells, Foamestcan

ISSCC 2017 / SESSION 14 / DEEP-LEARNING PROCESSORS / 14blaauw.engin.umich.edu/wp-content/uploads/sites/...In the DLA, a non-uniform memory access (NUMA) architecture is carefully designed

10703 Deep Reinforcement Learning and Control · 2017. 10. 18. · Policy-Based Reinforcement Learning ‣ So far we approximated the value or action-value function using parameters

Regularization for Deep Learning - Universität Hildesheim · Regularization Noise Robustness (Noise to Weights) I Noise to weights reduces over- tting and is used primarily with

Niobe Deep

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

lecture 8: optimization and deeper architectureslecture 8: optimization and deeper architectures deep learning for vision Yannis Avrithis Inria Rennes-Bretagne Atlantique Rennes, Nov.

Human-level Control Through Deep Reinforcement Learning€¦ · 1 Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529{533 (2015) 2 Lin, L.-J.

Deep Learning intro and hands-on tutorialusers.auth.gr/passalis/etc/ml_meetup.pdf2000-Σμ :Deep Learning ... π }Python 20/53. DeepLearning DLFrameworks Darknet Keras _ y w y ώππFramework?

45669583 Organic Practice

Multilayer Perceptron and Deep Learning - uni-goettingen.de · Supervised learning Multilayer Perceptron and Deep Learning. Some slides are adopted from Honglak Lee, Geoffrey Hinton,

Deep Generative Models - Adji Bousso DiengDeep Generative Models Adji Bousso Dieng Deep Learning Indaba Nairobi, Kenya August, 2019 @adjiboussodieng Setup!Observations x 1;:::;x N

CS-F441: Selected Topics from Computer Science (Deep Learning for NLP & CV) · 2019-11-06 · CS-F441: SELECTED TOPICS FROM COMPUTER SCIENCE (DEEP LEARNING FOR NLP & CV) Lecture-KT-10:

deep imaging Lecture 5: A gentle introduction to optimizationLecture 5: A gentle introduction to optimization. Machine Learning and Imaging – Roarke Horstmeyer(2019) deep imaging

School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.