Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning...
Transcript of Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning...
1
UVA CS 6316:Machine Learning
Lecture 15: Neural Network / Deep Learning Basics
3
ewx+b
1 + ewx+b
Logistic Regression
Sigmoid Function(aka logistic, logit, “S”, soft-step)
4
P(Y=1|x) =
x
ez
1 + ez
Logistic Regression
x1
x2
x3
Σ
+1
z
ŷ = sigmoid(z) =
5
w1
w2
w3
bSummingFunction
Multiply by weights
ŷ = P(Y=1|x;w)
Σ x wib+=z i
i
OutputSigmoidFunction
Input
ez
1 + ez
Logistic Regression
x1
x2
x3
Σ
+1
= wT x + b
ŷ = sigmoid(z) =6
px11xpp = 3
1x1
w1
w2
w3
bSummingFunction
SigmoidFunction
Multiply by weights
Σ x wib+=z i
i
Output
z
1x1Input
ŷ = P(Y=1|x;w)
“Neuron” or “Perceptron”
x1
x2
x3
Σ
+1
z
7
w1
w2
w3
b
“Neuron” or “Perceptron”
x1
x2
x3
Σ ŷ
8
From here on, we leave out bias for simplicity
w1
w2
w3
z
Neurons
9http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
“Block View” of a Neuron
10
x *w z
ŷ
parameterized block
ez
1 + ez
z = wT x
ŷ = sigmoid(z) =px11xp1x1
Dot Product SigmoidInput Output
Neuron Representation
Σ
11
The linear transformation and nonlinearity together is typically considered a single neuron
ŷ
x1
x2
x3
x
w1
w2
w3
Neuron Representation
*w
12
ŷx
13
Multiple Neurons
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
14
Wz
ŷ1
ŷ2
ŷ3
ŷ4
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
15
d = 4
p = 3
W
ez
1 + ez
z =WT x
ŷ= sigmoid(z) =px1dxpdx1
dx1 dx1
ŷ1
ŷ2
ŷ3
ŷ4
Multiple Neurons
Wmatrix
vectorz
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
16
W
ŷ = P(Y=1|x;W)
z1
Neural Network (1 Hidden Layer)
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
17
W1
w2
Σ
z1
z2
h1subscript represents layer
ŷ = P(Y=1|x;W)
Neural Network (1 Hidden Layer)
x1
x2
x3
Σ
Σ
Σ
Σ
Input x
18
z1
Σz2
W1
w2
ŷ = P(Y=1|x;W,w)
z1 =W1T x
h1= sigmoid(z1) px1dxpdx1
dx1 dx1
z2 =w2T h1
ŷ= sigmoid(z2) px11xp1x1
1x1 1x1
h1
x1
x2
x3
Input x
19
p = 3
W1Σ
Σ
Σ
Σ
Σ
Neural Network (1 Hidden Layer)
ŷ = P(Y=1|x;W,w)
z1 =W1T x
h1= sigmoid(z1) px1dxpdx1
dx1 dx1
z2 =w2T h1
ŷ= sigmoid(z2) px11xp1x1
1x1 1x1
h1
w2
Dot Product Sigmoid
20
Input
x *W1 z1
W1 is a matrix z1 is a vector
ŷ
Output
Neural Network (1 Hidden Layer)
Dot Product Sigmoid
*w2 z2
h1
z1 =W1T x
h1= sigmoid(z1) px1dxpdx1
dx1 dx1
z2 =w2T h1
ŷ= sigmoid(z2) px11xp1x1
1x1 1x1
Dot Product Sigmoid
21
Input
x *W1 z1
W1 is a matrix z1 is a vector
ŷ
Output
Neural Network (1 Hidden Layer)
Dot Product Sigmoid
*w2 z2
h1
z1 =W1T x
h1= sigmoid(z1) px1dxpdx1
dx1 dx1
z2 =w2T h1
ŷ= sigmoid(z2) px11xp1x1
1x1 1x1
Neural Network (2 Hidden Layers)
22
1st hidden layer
2nd hiddenlayer
Outputlayer
x1
x2
x3
W1
w3
W2
h1 h2
ŷ x
Neural Network (2 Hidden Layers)
23
z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)
h1 h2W1
w3
W2
1
2
3
hidden layer 1 output
hidden layer 2 output
ŷ
x1
x2
x3
x
Multi-Class Neural Network (2 Hidden Layers)
x
24
h1 h2W1 w3
W2
z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)
1
2
3
ŷ1
ŷ2
ŷ3
hidden layer 1 output
hidden layer 2 output
x1
x2
x3
“Block View” Of Multi-Layer Neural Network
x
1st hidden layer
2nd hidden layer
Output layer
25
*W1
*W2
*w3
z1 z2 z3h1 h2 ŷ
“Deep” Neural Networks (i.e. > 1 hidden layer)
26
27
28
Toy Example: Will I pass this class?
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
29
Toy Example: Will I pass this class?
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
30
Toy Example: Will I pass this class?
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
31
4
5
W1 w3
W2
Toy Example: Will I pass this class?
ŷ
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
32
W1 w3
W2
4
5
Toy Example: Will I pass this class?
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
ŷ
33
Toy Example: Will I pass this class?
W1 w3
W2
4
5
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
ŷ
Binary Classification LossBinary Cross Entropy Loss
ŷ = P(Y=1|X,W)
34
x1
x2
x
E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)
W1
w3
W2
true output
z1
z2
z3
35
x1
x2
x3
x
Σ
Σ
Σ
ŷ1
ŷ2
ŷ3
Multi-Class Classification LossCross Entropy Loss
“Softmax” function. Normalizing function which converts each class output to a probability.
E = loss = - yj log ŷjΣj = 1...K
= P( yi = 1 | x )
W1 W3
W2
ŷi
“0” for all except true class
K = 3
010
0.10.70.2
ŷ y
36
x1
x2
Σ
W1w3
W2
x
Regression LossMean Squared Error Loss
E = loss= ( y - ŷ )212
true output
ŷ
37
38
x1
x2
x3
W1
w3
W2
x ŷ E (ŷ; W)
W* = argmin E( ŷ ; W)W
Training Neural Networks
39
How do we learn the optimal weights WL for our task??● Gradient descent:
LeCun et. al. Efficient Backpropagation. 1998
WL(t+1) = WL(t) - 𝜂 𝝏 E𝝏 WL(t)
x ŷ
W1w3
W2
Gradient Descent
40
W* = argmin E( ŷ ; W)W
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
41
Gradient Descent
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
42
Gradient Descent
𝝏 E𝝏 W(t)
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
43
Gradient Descent
𝝏 E𝝏 W(t)
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
44
Gradient Descent
http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
Training Neural Networks
45LeCun et. al. Efficient Backpropagation. 1998
But how do we get gradients of lower layers (e.g. W1) ?● Backpropagation!
○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable
x ŷ
W1w3
W2
46
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
47
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
48
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
49
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
50
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
51
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
52
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
53
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
54
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
55
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
56
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
57
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
58
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
59
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
60
Backpropagation Intro
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
61
Backpropagation Intro
Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
62
ŷ = P(y=1|X,W)
x1
x2
x3
W1
w2
x
Backpropagation(binary classification example)
Example on 1-hidden layer network for binary classification
h1
63
Backpropagation(binary classification example)
x ŷ*W1
*w2
z1z2h1
64
x ŷ*W1
*w2
z1 z2h1
E = loss =
Backpropagation(binary classification example)
Gradient Descent to Minimize loss: Need to find these!
65
Backpropagation(binary classification example)
x ŷ*W1
*w2
z1 z2h1
66
Backpropagation(binary classification example)
= ??
= ??
x ŷ*W1
*w2
z1 h1 z2
67
Backpropagation(binary classification example)
= ??
= ??
Exploit the chain rule!
x *W1
*w2
z1 ŷh1 z2
68
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
69
Backpropagation(binary classification example)
chain rule
x *W1
*w2
z1 ŷh1 z2
70
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
71
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
72
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
73
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
74
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
75
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
76
Backpropagation(binary classification example)
x *W1
*w2
z1 ŷh1 z2
77
Backpropagation(binary classification example)
already computed!
x *W1
*w2
z1 ŷh1 z2
78
“Local-ness” of Backpropagation
fx y
“local gradients”activations
gradients
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
79
“Local-ness” of Backpropagation
fx y
activations
gradients
“local gradients”
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
x
80
Example: Sigmoid Block
sigmoid(x) = 𝛔 (x)
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
f3f2f1
Deep Learning: Layers of Differentiable Parameterized Functions (with nonlinearities)
x ŷ
81
E(ŷ,y)
82
Building Deep Neural Nets
fx
y
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
83
Block Example Implementation
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
84
Deep Learning Frameworks
85
Pytorch Sample Code
86
Building Deep Neural Nets
“GoogLeNet” for Object Classification
Backprop Demo
x1
x2
Σ
Σ
ŷ
87
w1 z1
w2w3
w4z2
h1
h2
Σ
w5
w6
h1 =exp(z1)
1 + exp(z1)exp(z2)
1 + exp(z2)h2 =
E = ( y - ŷ )2
f2
f4
w(t+1) = w(t) - 𝜂 𝝏 E𝝏 w(t)
𝝏 E𝝏 wi
= ??
z1 = x1w1 + x2w3 z2 = x1w2 + x2w4
f1
ŷ = h1w5 + h2w6 f3
88
Nonlinearity Functions (i.e. transfer or activation functions)
x1
x2
x3
Σ
SummingFunction
w1
w2
w3
z
x﹡w
89
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
90http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
91https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Name Plot Equation Derivative ( w.r.t x )
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
92
Name Plot Equation Derivative ( w.r.t x )
https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
Nonlinearity Functions (i.e. transfer or activation functions)
x﹡w
93
Name Plot Equation Derivative ( w.r.t x )
usually works best in practicehttps://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions
94
95
Neural Net Pipeline
x ŷ
W1w3
1. Initialize weights2. For each batch of input x samples S:
a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)
3. Repeat step 2 until loss convergence
W2
E
96
Non-Convexity of Neural Nets
In very high dimensions, there exists many local minimum which are about the same.
Pascanu, et. al. On the saddle point problem for non-convex optimization 2014
97
Advantage of Neural Nets
As long as it’s fully differentiable, we can train the model to automatically learn features for us.
98