Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and...

17
Data Mining and Machine Learning Madhavan Mukund Lecture 18, Jan–Apr 2020 https://www.cmi.ac.in/ ~ madhavan/courses/dmml2020jan/

Transcript of Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and...

Page 1: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Data Mining and Machine Learning

Madhavan Mukund

Lecture 18, Jan–Apr 2020https://www.cmi.ac.in/~madhavan/courses/dmml2020jan/

Page 2: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Loss functions (costs) for neural networks

So far, we have assumed mean sum-squared error as the loss function.

Consider single neuron, two inputs x = (x1, x2)

C =1

n

n∑i=1

(yi − ai )2, where ai = σ(zi ) = σ(w1x

i1 + w2x

i2 + b)

For gradient descent, we compute∂C

∂w1,∂C

∂w2,∂C

∂bI For j = 1, 2,

∂C

∂wj=

2

n

n∑i=1

(yi − ai ) · −∂ai∂wj

=2

n

n∑i=1

(ai − yi )∂ai∂zi

∂zi∂wj

=2

n

n∑i=1

(ai − yi )σ′(zi )x

ij

I∂C

∂b=

2

n

n∑i=1

(ai − yi )∂ai∂zi

∂zi∂b

=2

n

n∑i=1

(ai − yi )σ′(zi )

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 2 / 17

Page 3: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Loss functions . . .

∂C

∂wj=

2

n

n∑i=1

(ai − yi )σ′(zi )x

ij ,∂C

∂b=

2

n

n∑i=1

(ai − yi )σ′(zi )

Each term in∂C

∂w1,∂C

∂w2,∂C

∂bis proportional to σ′(zi )

Ideally, gradient descent should take large steps when a− y is large

I σ(z) is flat at bothextremes

I If a is completely wrong,a ≈ (1− y), we still haveσ′(z) ≈ 0

I Learning is slow even whencurrent model is far fromoptimal

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 3 / 17

Page 4: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Cross entropy

A better loss function

C (a, y) =

{− ln(a), if y = 1

− ln(1− a), if y = 0

I If a ≈ y , C (a, y) ≈ 0 in both casesI If a ≈ 1− y , C (a, y)→∞ in both cases

Combine into a single equation

C (a, y) = −[y ln(a) + (1− y) ln(1− a)]

I y = 1 ⇒ second term vanishes, C = − ln(a)I y = 0 ⇒ first term vanishes, C = − ln(1− a)

This is called cross entropy

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 4 / 17

Page 5: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Cross entropy and gradient descent

C = −[y ln(σ(z)) + (1− y) ln(1− σ(z))]

∂C

∂wj=∂C

∂σ

∂σ

∂wj= −

[y

σ(z)− 1− y

1− σ(z)

]∂σ

∂wj

= −[

y

σ(z)− 1− y

1− σ(z)

]∂σ

∂z

∂z

∂wj

= −[

y

σ(z)− 1− y

1− σ(z)

]σ′(z)xj

= −[y(1− σ(z))− (1− y)σ(z)

σ(z)(1− σ(z))

]σ′(z)xj

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 5 / 17

Page 6: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Cross entropy and gradient descent . . .

∂C

∂wj= −

[y(1− σ(z))− (1− y)σ(z)

σ(z)(1− σ(z))

]σ′(z)xj

Recall that σ′(z) = σ(z)(1− σ(z))

Therefore,∂C

∂wj= −[y(1− σ(z))− (1− y)σ(z)]xj

= −[y − yσ(z)− σ(z) + yσ(z)]xj

= (σ(z)− y)xj

= (a− y)xj

Similarly,∂C

∂b= (a− y)

Thus, as we wanted, the gradient is proportional to a− y

The greater the error, the faster the learning rate

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 6 / 17

Page 7: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Cross entropy . . .

Overall,

I∂C

∂wj=

1

n

n∑i=1

(ai − yi )xij

I∂C

∂b=

1

n

n∑i=1

(ai − yi )

Cross entropy allows the network to learn faster when the model is farfrom the true one

Other theoretical justifications to justify using cross entropy

I Derive from goal of maximizing log-likelihood of model

I Will be addressed in advanced ML course

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 7 / 17

Page 8: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Case study: Handwritten digits

MNIST database has 1000 samples of handwritten digits {0,1,. . . ,9}

Assume input segmented as individual digits

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 8 / 17

Page 9: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Handwritten digits . . .

Each image is 28× 28 = 784 pixels

Each pixel is a grayscale value from 0.0 (white) to 1.0 (black)

Building a neural network

I Linearize the image row-wise, inputs are x1, x2, . . . , x784

I Single hidden layer, with 15 nodes

I Output layer has 10 nodes, a decision aj for each digit j ∈ {0, 1, . . . , 9}I Final output is the maximum among these

F Naıvely, arg maxj

aj

F Softmax : arg maxj

eaj∑j e

aj

Smooth approximation to arg max

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 9 / 17

Page 10: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Handwritten digits . . .

Neural network to recognize handwritten digits

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 10 / 17

Page 11: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Handwritten digits . . .

Intuitively, the internal node recognize features

Combinations of features identify digits

Hypothetically, suppose first four hidden neurons focus on fourquadrants of image.

This combination favours the verdict 0

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 11 / 17

Page 12: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Identifying images

Suppose we have a network that can recognize hand-written{0, 1, . . . , 9} of size 28× 28

We want to find these images in a larger image

Slide a window of size 28× 28 over the larger image

Apply the original network to each window

Pool the results to see if the digit occurs anywhere in the image

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 12 / 17

Page 13: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Convolutional neural network

Each “window” connects to a different node in the first hidden layer.

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 13 / 17

Page 14: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Convolutional neural network . . .

Sliding window performs a convolution of the window networkfunction across the entire input

I Convolutional Neural Network (CNN)

Combine these appropriately — e.g., max-pool partitions features intosmall regions, say 2× 2, and takes the maximum across each region

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 14 / 17

Page 15: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Convolutional neural network . . .

Example

Input is 28× 28 MNIST image

Three features, each examines a 5× 5 window

I Construct a separate hidden layer for each feature

I Each feature produces a hidden layer with 24× 24 nodes

Max-pool of size 2× 2 partitions each feature layer into 12× 12

I Second hidden layer

Finally, three max-pool layers from three hidden features arecombined into 10 indicator output nodes for {0, 1, . . . , 9}

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 15 / 17

Page 16: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Convolutional neural network . . .

Network structure for this example

Input layer is 28× 28

Three hidden feature layers, each 24× 24

Three max-pool layers, each 12× 12

10 output nodes

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 16 / 17

Page 17: Data Mining and Machine Learningmadhavan/courses/dmml2020... · Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan{Apr 202012/17. Convolutional neural network Each \window"

Deep learning

Hidden layers extract “features” from the input

Individual features can be combined into more complex ones

Networks with multiple hidden layers are called deep neural networks

I How deep is deep?

I Originally even 4 layers was deep.

I Recently, 100+ layers have been used

The main challenge is learning the weights

I Vanishing gradients

I Enormously large training data

Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 17 / 17