Deep Learning Theory and Practice - Computer Action...
Transcript of Deep Learning Theory and Practice - Computer Action...
Deep Learning Theory and PracticeLecture 5
Introduction to deep neural networks
Dr. Ted Willke [email protected]
Tuesday, January 21, 2020
Review of Lecture 4• Adaline ‘neuron’ minimizes the squared loss
2
update the weights w(t + 1) = w(t) + η ⋅ (y* − s(t)) ⋅ x*
compute s = wTx
w(1) = 0
for iteration t = 1, 2, 3, . . .
pick a point (at random) (x*, y*) ∈ D
t ← t + 1
Forward pass of ‘signal’:
Backward pass of updates:
The Adaline Algorithm:
1:
2:
3:
4:
5:
6:
Review of Lecture 4• Logistic regression: Better classification
3
hw(x) = θ (wTx)Uses , where θ(s) =1
1 + e−s
yGives us the probability of being the label:
• Learning should strive to maximize this jointprobability over the training data:
P(y1, . . . , yN |x1, . . . , xN) =N
∏n=1
P(yn |xn) .
• The principle of maximum likelihood says we can do this if we minimize this error:
1. Compute the gradient
2. Move in the direction
3. Update the weights:
4. Repeat until converged! A convex problem
v̂ = − gt
gt = ∇Ein(w(t))
w(t + 1) = w(t) + ηv̂t
∇wEin(w) → 0.
• We can’t minimize this analytically, but we can numerically/iteratively set
Review: Gradient Descent
4
Ball on complicated hilly terrain
- rolls down to a local valley
this is called a local minimum
Questions:
1. How to get to the bottom of the deepest valley?
2. How to do this when we don’t have gravity :-)?
Today’s Lecture
•Review of gradient descent
•What is a deep neural network?
•How do we train one?
•How do we train one efficiently?
•Tutorial: Image classification using a logistic regression network
5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)
Today’s Lecture
•Review of gradient descent
•What is a deep neural network?
•How do we train one?
•How do we train one efficiently?
•Tutorial: Image classification using a logistic regression network
6(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)
Our has only one valley
7
Ein
… because is a convex function of . Ein(w) w
Can you prove this for logistic regression?
How to ‘roll’ down
8
Assume you are at weights and you take a step of size in the direction . w(t) η v̂
w(t + 1) = w(t) + ηv̂
We get to select . v̂
Select to make as small as possible.v̂ Ein(w(t + 1))
what’s the best direction to take a step in??
The gradient is the fastest way to roll down
9
Approximately the change in Ein
What choice of will maximize the gradient (minimize its negative)?v̂
Maximizing the descent
10
How do we maximize ?ΔEin
ΔEin ≈ η∇Ein(w(t))Tv̂
≈ η∥∇Ein(w(t))T∥∥v̂∥ cos(θ) , where is the angle between and .θ ∇Ein ̂v
This is maximized when , i.e., when points in direction of .cos(θ) = 1 ∇Ein(w(t))Tv̂
Therefore, we can take the largest negative step when .v̂ = −∇Ein(w(t))
∥∇Ein(w(t))∥
‘Rolling down’ = iterating the negative gradient
11
The ‘Goldilocks’ step size
12
Fixed learning rate gradient descent
13
Define
Then
(reduces step size as minimum is approached)
Gradient descent can minimize any smooth function.
Summary of linear models
Credit Analysis
Perceptron
Linear regression
Logistic regression Cross-Entropy Error (Gradient Descent)
Squared Error (Pseudo-inverse)
Classification Error(PLA)
Approve or Deny
Amount ofCredit
Probability of Default
Today’s Lecture
•Review of gradient descent
•What is a deep neural network?
•How do we train one?
•How do we train one efficiently?
•Tutorial: Image classification using a logistic regression network
15
The neural network - biologically inspired
16
biological function biological structure
Biological inspiration, not bio-literalism
17
Engineering success can draw upon biological inspiration at many levels of abstraction.We must account for the unique demands and constraints of the in-silico system.
XOR: A limitation of the linear model
18
XOR: A limitation of the linear model
19
f = h1h2 + h1h2
h1(x) = sign(wT1 x) h2(x) = sign(wT
2 x)
Perceptrons for OR and AND
20
OR(x1, x2) = sign(x1 + x2 + 1.5) AND(x1, x2) = sign(x1 + x2 − 1.5)
Representing using OR and AND
21
f
f = h1h2 + h1h2
Representing using OR and AND
22
f
f = h1h2 + h1h2
The multilayer perceptron
23
wT2 x
wT0 x
3 layers ‘feedforward’
hidden layers
Universal Approximation
24
Any target function that can be decomposed into linear separators can be implemented by a 3-layer MLP.
f
A powerful model
25
Target 8 perceptrons 16 perceptrons
Red flags for generalization and optimization.
What tradeoff is involved here?
Minimizing
26
Ein
The combinatorial challenge for the MLP is even greater than that of the perceptron.
is not smooth (due to ), so cannot use gradient descent.sign( ⋅ )Ein
sign(x) ≈ tanh(x) ⟶ gradient descent to minimize Ein .
The deep neural network
27
input layer l = 0 hidden layers 0 < l < L output layer l = L
How the network operates
28
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s(l)
j ) = θ (d(l−1)
∑i=0
w(l)ij x(l−1)
i )
Apply to x x(0)1 . . . x(0)
d(0) → → x(L)1 = h(x)
θ(s) = tanh(s) =es − e−s
es + e−s
Today’s Lecture
•Review of gradient descent
•What is a deep neural network?
•How do we train one?
•How do we train one efficiently?
•Tutorial: Image classification using a logistic regression network
29
How can we efficiently train a deep network?
30
Gradient descent minimizes: Ein(w) =1N
N
∑n=1
e(h(xn), yn)
by iterative steps along −∇Ein :
∇w = − η∇Ein(w)
∇Ein is based on ALL examples (xn, yn)
‘batch’ GD
ln(1 + e−ynwTxn) logistic regression
The stochastic aspect
31
𝔼n [−∇e(h(xn), yn)] =1N
N
∑n=1
e(h(xn), yn)‘Average’ direction:
= − ∇Ein :
Pick one at a time. Apply GD to . (xn, yn) e(h(xn), yn)
stochastic gradient descent (SGD)
A randomized version of GD.
Benefits of SGD
32
Randomization helps.
1. cheaper computation
2. randomization
3. simple
Rule of thumb:
η = 0.1 works
(empirically adjust; exponentially)
The linear signal
33
Input is a linear combination (using weights) of the outputs of the previous layer
s(l)
x(l−1) .
(recall the linear signal )s = wTx
Forward propagation: Computing
34
h(x)
Minimizing
35
Ein
Using makes differentiable, so we can use gradient descent (or SGD) local min.θ = tanh Ein ⟶
Gradient descent
36
Gradient descent of
37
Ein
We need:
Numerical Approach
38
approximate
inefficient
:-(
Algorithmic Approach :-)
39
is a function of ande(x) s(l) s(l) = (W(l))Tx(l−1)
(chain rule)
sensitivity
Computing using the chain rule
40
δ(l)
Multiple applications of the chain rule:
The backpropagation algorithm
41
Algorithm for gradient descent on
42Can do batch version or sequential version (SGD).
Ein
Digits Data
43
Today’s Lecture
•Review of gradient descent
•What is a deep neural network?
•How do we train one?
•How do we train one efficiently?
•Tutorial: Image classification using a logistic regression network
44
Further reading
• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.
• Goodfellow et al. (2016) Deep Learning. https://www.deeplearningbook.org/
• Boyd, S., and Vandenberghe, L. (2018) Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares. http://vmls-book.stanford.edu/
• VanderPlas, J. (2016) Python Data Science Handbook. https://jakevdp.github.io/PythonDataScienceHandbook/
45