Deep Learning Theory and Practice - Computer Action...

45
Deep Learning Theory and Practice Lecture 5 Introduction to deep neural networks Dr. Ted Willke [email protected] Tuesday, January 21, 2020

Transcript of Deep Learning Theory and Practice - Computer Action...

Page 1: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Deep Learning Theory and PracticeLecture 5

Introduction to deep neural networks

Dr. Ted Willke [email protected]

Tuesday, January 21, 2020

Page 2: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Review of Lecture 4• Adaline ‘neuron’ minimizes the squared loss

2

update the weights w(t + 1) = w(t) + η ⋅ (y* − s(t)) ⋅ x*

compute s = wTx

w(1) = 0

for iteration t = 1, 2, 3, . . .

pick a point (at random) (x*, y*) ∈ D

t ← t + 1

Forward pass of ‘signal’:

Backward pass of updates:

The Adaline Algorithm:

1:

2:

3:

4:

5:

6:

Page 3: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Review of Lecture 4• Logistic regression: Better classification

3

hw(x) = θ (wTx)Uses , where θ(s) =1

1 + e−s

yGives us the probability of being the label:

• Learning should strive to maximize this jointprobability over the training data:

P(y1, . . . , yN |x1, . . . , xN) =N

∏n=1

P(yn |xn) .

• The principle of maximum likelihood says we can do this if we minimize this error:

1. Compute the gradient

2. Move in the direction

3. Update the weights:

4. Repeat until converged! A convex problem

v̂ = − gt

gt = ∇Ein(w(t))

w(t + 1) = w(t) + ηv̂t

∇wEin(w) → 0.

• We can’t minimize this analytically, but we can numerically/iteratively set

Page 4: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Review: Gradient Descent

4

Ball on complicated hilly terrain

- rolls down to a local valley

this is called a local minimum

Questions:

1. How to get to the bottom of the deepest valley?

2. How to do this when we don’t have gravity :-)?

Page 5: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

Page 6: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

6(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

Page 7: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Our has only one valley

7

Ein

… because is a convex function of . Ein(w) w

Can you prove this for logistic regression?

Page 8: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

How to ‘roll’ down

8

Assume you are at weights and you take a step of size in the direction . w(t) η v̂

w(t + 1) = w(t) + ηv̂

We get to select . v̂

Select to make as small as possible.v̂ Ein(w(t + 1))

what’s the best direction to take a step in??

Page 9: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The gradient is the fastest way to roll down

9

Approximately the change in Ein

What choice of will maximize the gradient (minimize its negative)?v̂

Page 10: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Maximizing the descent

10

How do we maximize ?ΔEin

ΔEin ≈ η∇Ein(w(t))Tv̂

≈ η∥∇Ein(w(t))T∥∥v̂∥ cos(θ) , where is the angle between and .θ ∇Ein ̂v

This is maximized when , i.e., when points in direction of .cos(θ) = 1 ∇Ein(w(t))Tv̂

Therefore, we can take the largest negative step when .v̂ = −∇Ein(w(t))

∥∇Ein(w(t))∥

Page 11: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

‘Rolling down’ = iterating the negative gradient

11

Page 12: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The ‘Goldilocks’ step size

12

Page 13: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Fixed learning rate gradient descent

13

Define

Then

(reduces step size as minimum is approached)

Gradient descent can minimize any smooth function.

Page 14: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Summary of linear models

Credit Analysis

Perceptron

Linear regression

Logistic regression Cross-Entropy Error (Gradient Descent)

Squared Error (Pseudo-inverse)

Classification Error(PLA)

Approve or Deny

Amount ofCredit

Probability of Default

Page 15: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

15

Page 16: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The neural network - biologically inspired

16

biological function biological structure

Page 17: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Biological inspiration, not bio-literalism

17

Engineering success can draw upon biological inspiration at many levels of abstraction.We must account for the unique demands and constraints of the in-silico system.

Page 18: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

XOR: A limitation of the linear model

18

Page 19: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

XOR: A limitation of the linear model

19

f = h1h2 + h1h2

h1(x) = sign(wT1 x) h2(x) = sign(wT

2 x)

Page 20: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Perceptrons for OR and AND

20

OR(x1, x2) = sign(x1 + x2 + 1.5) AND(x1, x2) = sign(x1 + x2 − 1.5)

Page 21: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Representing using OR and AND

21

f

f = h1h2 + h1h2

Page 22: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Representing using OR and AND

22

f

f = h1h2 + h1h2

Page 23: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The multilayer perceptron

23

wT2 x

wT0 x

3 layers ‘feedforward’

hidden layers

Page 24: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Universal Approximation

24

Any target function that can be decomposed into linear separators can be implemented by a 3-layer MLP.

f

Page 25: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

A powerful model

25

Target 8 perceptrons 16 perceptrons

Red flags for generalization and optimization.

What tradeoff is involved here?

Page 26: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Minimizing

26

Ein

The combinatorial challenge for the MLP is even greater than that of the perceptron.

is not smooth (due to ), so cannot use gradient descent.sign( ⋅ )Ein

sign(x) ≈ tanh(x) ⟶ gradient descent to minimize Ein .

Page 27: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The deep neural network

27

input layer l = 0 hidden layers 0 < l < L output layer l = L

Page 28: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

How the network operates

28

w(l)ij

1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

x(l)j = θ(s(l)

j ) = θ (d(l−1)

∑i=0

w(l)ij x(l−1)

i )

Apply to x x(0)1 . . . x(0)

d(0) → → x(L)1 = h(x)

θ(s) = tanh(s) =es − e−s

es + e−s

Page 29: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

29

Page 30: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

How can we efficiently train a deep network?

30

Gradient descent minimizes: Ein(w) =1N

N

∑n=1

e(h(xn), yn)

by iterative steps along −∇Ein :

∇w = − η∇Ein(w)

∇Ein is based on ALL examples (xn, yn)

‘batch’ GD

ln(1 + e−ynwTxn) logistic regression

Page 31: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The stochastic aspect

31

𝔼n [−∇e(h(xn), yn)] =1N

N

∑n=1

e(h(xn), yn)‘Average’ direction:

= − ∇Ein :

Pick one at a time. Apply GD to . (xn, yn) e(h(xn), yn)

stochastic gradient descent (SGD)

A randomized version of GD.

Page 32: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Benefits of SGD

32

Randomization helps.

1. cheaper computation

2. randomization

3. simple

Rule of thumb:

η = 0.1 works

(empirically adjust; exponentially)

Page 33: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The linear signal

33

Input is a linear combination (using weights) of the outputs of the previous layer

s(l)

x(l−1) .

(recall the linear signal )s = wTx

Page 34: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Forward propagation: Computing

34

h(x)

Page 35: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Minimizing

35

Ein

Using makes differentiable, so we can use gradient descent (or SGD) local min.θ = tanh Ein ⟶

Page 36: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Gradient descent

36

Page 37: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Gradient descent of

37

Ein

We need:

Page 38: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Numerical Approach

38

approximate

inefficient

:-(

Page 39: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Algorithmic Approach :-)

39

is a function of ande(x) s(l) s(l) = (W(l))Tx(l−1)

(chain rule)

sensitivity

Page 40: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Computing using the chain rule

40

δ(l)

Multiple applications of the chain rule:

Page 41: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

The backpropagation algorithm

41

Page 42: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Algorithm for gradient descent on

42Can do batch version or sequential version (SGD).

Ein

Page 43: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Digits Data

43

Page 44: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Today’s Lecture

•Review of gradient descent

•What is a deep neural network?

•How do we train one?

•How do we train one efficiently?

•Tutorial: Image classification using a logistic regression network

44

Page 45: Deep Learning Theory and Practice - Computer Action Teamweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/... · 2020. 1. 21. · Deep Learning Theory and Practice Lecture 5 Introduction

Further reading

• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.

• Goodfellow et al. (2016) Deep Learning. https://www.deeplearningbook.org/

• Boyd, S., and Vandenberghe, L. (2018) Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares. http://vmls-book.stanford.edu/

• VanderPlas, J. (2016) Python Data Science Handbook. https://jakevdp.github.io/PythonDataScienceHandbook/

45