Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning...

1

UVA CS 6316:Machine Learning

Lecture 15: Neural Network / Deep Learning Basics

ewx+b

1 + ewx+b

Logistic Regression

Sigmoid Function(aka logistic, logit, “S”, soft-step)

4

P(Y=1|x) =

x

ez

1 + ez

Logistic Regression

x1

x2

x3

Σ

+1

z

ŷ = sigmoid(z) =

5

w1

w2

w3

bSummingFunction

Multiply by weights

ŷ = P(Y=1|x;w)

Σ x wib+=z i

i

OutputSigmoidFunction

Input

ez

1 + ez

Logistic Regression

x1

x2

x3

Σ

+1

= wT x + b

ŷ = sigmoid(z) =6

px11xpp = 3

1x1

w1

w2

w3

bSummingFunction

SigmoidFunction

Multiply by weights

Σ x wib+=z i

i

Output

z

1x1Input

ŷ = P(Y=1|x;w)

“Neuron” or “Perceptron”

x1

x2

x3

Σ

+1

z

7

w1

w2

w3

b

“Neuron” or “Perceptron”

x1

x2

x3

Σ ŷ

8

From here on, we leave out bias for simplicity

w1

w2

w3

z

Neurons

9http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

“Block View” of a Neuron

10

x *w z

ŷ

parameterized block

ez

1 + ez

z = wT x

ŷ = sigmoid(z) =px11xp1x1

Dot Product SigmoidInput Output

Neuron Representation

Σ

11

The linear transformation and nonlinearity together is typically considered a single neuron

ŷ

x1

x2

x3

x

w1

w2

w3

Neuron Representation

*w

12

ŷx

Multiple Neurons

x1

x2

x3

Σ

Σ

Σ

Σ

Input x

14

Wz

ŷ1

ŷ2

ŷ3

ŷ4

x1

x2

x3

Σ

Σ

Σ

Σ

Input x

15

d = 4

p = 3

W

ez

1 + ez

z =WT x

ŷ= sigmoid(z) =px1dxpdx1

dx1 dx1

ŷ1

ŷ2

ŷ3

ŷ4

Multiple Neurons

Wmatrix

vectorz

x1

x2

x3

Σ

Σ

Σ

Σ

Input x

16

W

ŷ = P(Y=1|x;W)

z1

Neural Network (1 Hidden Layer)

x1

x2

x3

Σ

Σ

Σ

Σ

Input x

17

W1

w2

Σ

z1

z2

h1subscript represents layer

ŷ = P(Y=1|x;W)


x1

x2

x3

Σ

Σ

Σ

Σ

Input x

18

z1

Σz2

W1

w2

ŷ = P(Y=1|x;W,w)

z1 =W1T x

h1= sigmoid(z1) px1dxpdx1

dx1 dx1

z2 =w2T h1

ŷ= sigmoid(z2) px11xp1x1

1x1 1x1

h1

x1

x2

x3

Input x

19

p = 3

W1Σ

Σ

Σ

Σ

Σ


ŷ = P(Y=1|x;W,w)

z1 =W1T x


dx1 dx1

z2 =w2T h1


1x1 1x1

h1

w2

Dot Product Sigmoid

20

Input

x *W1 z1

W1 is a matrix z1 is a vector

ŷ

Output


Dot Product Sigmoid

*w2 z2

h1

z1 =W1T x


dx1 dx1

z2 =w2T h1


1x1 1x1

Dot Product Sigmoid

21

Input

x *W1 z1

W1 is a matrix z1 is a vector

ŷ

Output


Dot Product Sigmoid

*w2 z2

h1

z1 =W1T x


dx1 dx1

z2 =w2T h1


1x1 1x1

Neural Network (2 Hidden Layers)

22

1st hidden layer

2nd hiddenlayer

Outputlayer

x1

x2

x3

W1

w3

W2

h1 h2

ŷ x

Neural Network (2 Hidden Layers)

23

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)

h1 h2W1

w3

W2

1

2

3

hidden layer 1 output


ŷ

x1

x2

x3

x

Multi-Class Neural Network (2 Hidden Layers)

x

24

h1 h2W1 w3

W2

z1 =WT xh1 = sigmoid(z1)z2 =WT h1 h2 = sigmoid(z2)z3 =wT h2 ŷ = sigmoid(z3)

1

2

3

ŷ1

ŷ2

ŷ3



x1

x2

x3

“Block View” Of Multi-Layer Neural Network

x

1st hidden layer

2nd hidden layer

Output layer

25

*W1

*W2

*w3

z1 z2 z3h1 h2 ŷ

“Deep” Neural Networks (i.e. > 1 hidden layer)

26

28

Toy Example: Will I pass this class?

http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf


29




30




31

4

5

W1 w3

W2


ŷ



32

W1 w3

W2

4

5



ŷ


33


W1 w3

W2

4

5


ŷ


Binary Classification LossBinary Cross Entropy Loss

ŷ = P(Y=1|X,W)

34

x1

x2

x

E= loss = - logP(Y = ŷ | X= x ) = - y log(ŷ) - (1 - y) log(1-ŷ)

W1

w3

W2

true output

z1

z2

z3

35

x1

x2

x3

x

Σ

Σ

Σ

ŷ1

ŷ2

ŷ3

Multi-Class Classification LossCross Entropy Loss

“Softmax” function. Normalizing function which converts each class output to a probability.

E = loss = - yj log ŷjΣj = 1...K

= P( yi = 1 | x )

W1 W3

W2

ŷi

“0” for all except true class

K = 3

010

0.10.70.2

ŷ y

36

x1

x2

Σ

W1w3

W2

x

Regression LossMean Squared Error Loss

E = loss= ( y - ŷ )212

true output

ŷ

38

x1

x2

x3

W1

w3

W2

x ŷ E (ŷ; W)

W* = argmin E( ŷ ; W)W

Training Neural Networks

39

How do we learn the optimal weights WL for our task??● Gradient descent:

LeCun et. al. Efficient Backpropagation. 1998

WL(t+1) = WL(t) - 𝜂 𝝏 E𝝏 WL(t)

x ŷ

W1w3

W2

Gradient Descent

40

W* = argmin E( ŷ ; W)W



41

Gradient Descent



42

Gradient Descent

𝝏 E𝝏 W(t)



43

Gradient Descent

𝝏 E𝝏 W(t)



44

Gradient Descent



Training Neural Networks

45LeCun et. al. Efficient Backpropagation. 1998

But how do we get gradients of lower layers (e.g. W1) ?● Backpropagation!

○ Repeated application of chain rule of calculus○ Locally minimize the objective○ Requires all “blocks” of the network to be differentiable

x ŷ

W1w3

W2

46

Backpropagation Intro

http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf

47



48



49



50



51



52



53



54



55



56



57



58



59



60



61


Tells us: by increasing x by a scale of 1, we decrease f by a scale of 4


62

ŷ = P(y=1|X,W)

x1

x2

x3

W1

w2

x

Backpropagation(binary classification example)

Example on 1-hidden layer network for binary classification

h1

63


x ŷ*W1

*w2

z1z2h1

64

x ŷ*W1

*w2

z1 z2h1

E = loss =


Gradient Descent to Minimize loss: Need to find these!

65


x ŷ*W1

*w2

z1 z2h1

66


= ??

= ??

x ŷ*W1

*w2

z1 h1 z2

67


= ??

= ??

Exploit the chain rule!

x *W1

*w2

z1 ŷh1 z2

68


x *W1

*w2

z1 ŷh1 z2

69


chain rule

x *W1

*w2

z1 ŷh1 z2

70


x *W1

*w2

z1 ŷh1 z2

71


x *W1

*w2

z1 ŷh1 z2

72


x *W1

*w2

z1 ŷh1 z2

73


x *W1

*w2

z1 ŷh1 z2

74


x *W1

*w2

z1 ŷh1 z2

75


x *W1

*w2

z1 ŷh1 z2

76


x *W1

*w2

z1 ŷh1 z2

77


already computed!

x *W1

*w2

z1 ŷh1 z2

78

“Local-ness” of Backpropagation

fx y

“local gradients”activations

gradients


79

“Local-ness” of Backpropagation

fx y

activations

gradients

“local gradients”


x

80

Example: Sigmoid Block

sigmoid(x) = 𝛔 (x)


f3f2f1

Deep Learning: Layers of Differentiable Parameterized Functions (with nonlinearities)

x ŷ

81

E(ŷ,y)

82

Building Deep Neural Nets

fx

y


83

Block Example Implementation


84

Deep Learning Frameworks

85

Pytorch Sample Code

86

Building Deep Neural Nets

“GoogLeNet” for Object Classification

Backprop Demo

x1

x2

Σ

Σ

ŷ

87

w1 z1

w2w3

w4z2

h1

h2

Σ

w5

w6

h1 =exp(z1)

1 + exp(z1)exp(z2)

1 + exp(z2)h2 =

E = ( y - ŷ )2

f2

f4

w(t+1) = w(t) - 𝜂 𝝏 E𝝏 w(t)

𝝏 E𝝏 wi

= ??

z1 = x1w1 + x2w3 z2 = x1w2 + x2w4

f1

ŷ = h1w5 + h2w6 f3

Nonlinearity Functions (i.e. transfer or activation functions)

x1

x2

x3

Σ

SummingFunction

w1

w2

w3

z

x﹡w

89


x﹡w

90http://introtodeeplearning.com/2019/materials/2019_6S191_L1.pdf



x﹡w

91https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

Name Plot Equation Derivative ( w.r.t x )


x﹡w

92


https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions


x﹡w

93


usually works best in practicehttps://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

95

Neural Net Pipeline

x ŷ

W1w3

1. Initialize weights2. For each batch of input x samples S:

a. Run the network “Forward” on S to compute outputs and lossb. Run the network “Backward” using outputs and loss to compute gradientsc. Update weights using SGD (or a similar method)

3. Repeat step 2 until loss convergence

W2

E

96

Non-Convexity of Neural Nets

In very high dimensions, there exists many local minimum which are about the same.

Pascanu, et. al. On the saddle point problem for non-convex optimization 2014

97

Advantage of Neural Nets

As long as it’s fully differentiable, we can train the model to automatically learn features for us.

Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning...

Documents

Transcript of Lecture 15: Learning Basics Neural Network / Deep Machine Learning …€¦ · Machine Learning...