Data Mining and Machine Learning madhavan/courses/dmml2020... · PDF file Madhavan Mukund...

Click here to load reader

  • date post

    22-Jul-2020
  • Category

    Documents

  • view

    6
  • download

    0

Embed Size (px)

Transcript of Data Mining and Machine Learning madhavan/courses/dmml2020... · PDF file Madhavan Mukund...

  • Data Mining and Machine Learning

    Madhavan Mukund

    Lecture 18, Jan–Apr 2020 https://www.cmi.ac.in/~madhavan/courses/dmml2020jan/

    https://www.cmi.ac.in/~madhavan/courses/dmml2020jan/

  • Loss functions (costs) for neural networks

    So far, we have assumed mean sum-squared error as the loss function.

    Consider single neuron, two inputs x = (x1, x2)

    C = 1

    n

    n∑ i=1

    (yi − ai )2, where ai = σ(zi ) = σ(w1x i1 + w2x i2 + b)

    For gradient descent, we compute ∂C

    ∂w1 , ∂C

    ∂w2 , ∂C

    ∂b I For j = 1, 2,

    ∂C

    ∂wj =

    2

    n

    n∑ i=1

    (yi − ai ) · − ∂ai ∂wj

    = 2

    n

    n∑ i=1

    (ai − yi ) ∂ai ∂zi

    ∂zi ∂wj

    = 2

    n

    n∑ i=1

    (ai − yi )σ′(zi )x ij

    I ∂C

    ∂b =

    2

    n

    n∑ i=1

    (ai − yi ) ∂ai ∂zi

    ∂zi ∂b

    = 2

    n

    n∑ i=1

    (ai − yi )σ′(zi )

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 2 / 17

  • Loss functions . . .

    ∂C

    ∂wj =

    2

    n

    n∑ i=1

    (ai − yi )σ′(zi )x ij , ∂C

    ∂b =

    2

    n

    n∑ i=1

    (ai − yi )σ′(zi )

    Each term in ∂C

    ∂w1 , ∂C

    ∂w2 , ∂C

    ∂b is proportional to σ′(zi )

    Ideally, gradient descent should take large steps when a− y is large

    I σ(z) is flat at both extremes

    I If a is completely wrong, a ≈ (1− y), we still have σ′(z) ≈ 0

    I Learning is slow even when current model is far from optimal

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 3 / 17

  • Cross entropy

    A better loss function

    C (a, y) =

    { − ln(a), if y = 1 − ln(1− a), if y = 0

    I If a ≈ y , C (a, y) ≈ 0 in both cases I If a ≈ 1− y , C (a, y)→∞ in both cases

    Combine into a single equation

    C (a, y) = −[y ln(a) + (1− y) ln(1− a)] I y = 1 ⇒ second term vanishes, C = − ln(a) I y = 0 ⇒ first term vanishes, C = − ln(1− a)

    This is called cross entropy

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 4 / 17

  • Cross entropy and gradient descent

    C = −[y ln(σ(z)) + (1− y) ln(1− σ(z))]

    ∂C

    ∂wj = ∂C

    ∂σ

    ∂σ

    ∂wj = −

    [ y

    σ(z) − 1− y

    1− σ(z)

    ] ∂σ

    ∂wj

    = − [

    y

    σ(z) − 1− y

    1− σ(z)

    ] ∂σ

    ∂z

    ∂z

    ∂wj

    = − [

    y

    σ(z) − 1− y

    1− σ(z)

    ] σ′(z)xj

    = − [ y(1− σ(z))− (1− y)σ(z)

    σ(z)(1− σ(z))

    ] σ′(z)xj

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 5 / 17

  • Cross entropy and gradient descent . . .

    ∂C

    ∂wj = −

    [ y(1− σ(z))− (1− y)σ(z)

    σ(z)(1− σ(z))

    ] σ′(z)xj

    Recall that σ′(z) = σ(z)(1− σ(z))

    Therefore, ∂C

    ∂wj = −[y(1− σ(z))− (1− y)σ(z)]xj

    = −[y − yσ(z)− σ(z) + yσ(z)]xj = (σ(z)− y)xj = (a− y)xj

    Similarly, ∂C

    ∂b = (a− y)

    Thus, as we wanted, the gradient is proportional to a− y

    The greater the error, the faster the learning rate

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 6 / 17

  • Cross entropy . . .

    Overall,

    I ∂C

    ∂wj =

    1

    n

    n∑ i=1

    (ai − yi )x ij

    I ∂C

    ∂b =

    1

    n

    n∑ i=1

    (ai − yi )

    Cross entropy allows the network to learn faster when the model is far from the true one

    Other theoretical justifications to justify using cross entropy

    I Derive from goal of maximizing log-likelihood of model

    I Will be addressed in advanced ML course

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 7 / 17

  • Case study: Handwritten digits

    MNIST database has 1000 samples of handwritten digits {0,1,. . . ,9}

    Assume input segmented as individual digits

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 8 / 17

  • Handwritten digits . . .

    Each image is 28× 28 = 784 pixels

    Each pixel is a grayscale value from 0.0 (white) to 1.0 (black)

    Building a neural network

    I Linearize the image row-wise, inputs are x1, x2, . . . , x784

    I Single hidden layer, with 15 nodes

    I Output layer has 10 nodes, a decision aj for each digit j ∈ {0, 1, . . . , 9} I Final output is the maximum among these

    F Näıvely, arg max j

    aj

    F Softmax : arg max j

    eaj∑ j e

    aj

    Smooth approximation to arg max

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 9 / 17

  • Handwritten digits . . .

    Neural network to recognize handwritten digits

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 10 / 17

  • Handwritten digits . . .

    Intuitively, the internal node recognize features

    Combinations of features identify digits

    Hypothetically, suppose first four hidden neurons focus on four quadrants of image.

    This combination favours the verdict 0

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 11 / 17

  • Identifying images

    Suppose we have a network that can recognize hand-written {0, 1, . . . , 9} of size 28× 28

    We want to find these images in a larger image

    Slide a window of size 28× 28 over the larger image

    Apply the original network to each window

    Pool the results to see if the digit occurs anywhere in the image

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 12 / 17

  • Convolutional neural network

    Each “window” connects to a different node in the first hidden layer.

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 13 / 17

  • Convolutional neural network . . .

    Sliding window performs a convolution of the window network function across the entire input

    I Convolutional Neural Network (CNN)

    Combine these appropriately — e.g., max-pool partitions features into small regions, say 2× 2, and takes the maximum across each region

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 14 / 17

  • Convolutional neural network . . .

    Example

    Input is 28× 28 MNIST image

    Three features, each examines a 5× 5 window I Construct a separate hidden layer for each feature

    I Each feature produces a hidden layer with 24× 24 nodes

    Max-pool of size 2× 2 partitions each feature layer into 12× 12 I Second hidden layer

    Finally, three max-pool layers from three hidden features are combined into 10 indicator output nodes for {0, 1, . . . , 9}

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 15 / 17

  • Convolutional neural network . . .

    Network structure for this example

    Input layer is 28× 28 Three hidden feature layers, each 24× 24 Three max-pool layers, each 12× 12 10 output nodes

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 16 / 17

  • Deep learning

    Hidden layers extract “features” from the input

    Individual features can be combined into more complex ones

    Networks with multiple hidden layers are called deep neural networks

    I How deep is deep?

    I Originally even 4 layers was deep.

    I Recently, 100+ layers have been used

    The main challenge is learning the weights

    I Vanishing gradients

    I Enormously large training data

    Madhavan Mukund Data Mining and Machine Learning Lecture 18, Jan–Apr 2020 17 / 17