Incremental Stochastic Gradient Descent - · PDF fileIncremental Stochastic Gradient Descent...

44
Incremental Stochastic Gradient Descent Batch mode : gradient descent w=w - η E D [w] over the entire data D E D [w]=1/2Σ d (t d -o d ) 2 Incremental mode: gradient descent w=w - η E d [w] over individual training examples d E d [w]=1/2 (t d -o d ) 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η is small enough

Transcript of Incremental Stochastic Gradient Descent - · PDF fileIncremental Stochastic Gradient Descent...

Page 1: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Incremental Stochastic Gradient Descent

  Batch mode : gradient descent w=w - η ∇ED[w] over the entire data D

ED[w]=1/2Σd(td-od)2

  Incremental mode: gradient descent w=w - η ∇Ed[w] over individual training examples d Ed[w]=1/2 (td-od)2

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η is small enough

Page 2: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Comparison: Perceptron and Gradient Descent Rule

Perceptron learning rule guaranteed to succeed (perfectly classifying training examples) if

  Training examples are linearly separable   Sufficiently small learning rate η

Linear unit training rules using gradient descent   Guaranteed to converge to hypothesis with minimum squared error   Given sufficiently small learning rate η   Even when training data contains noise   Even when training data not separable by H

Page 3: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 4: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Restaurant Problem: Will I wait for a table?

  Alternate – whether there is a suitable alternative restaurant nearby

  Bar – whether the restaurant has a comfortable bar area to wait in   Fri/Sat – true on Fridays and Saturdays   Hungry – whether we are hungry   Patrons – how many people are in the restaurant (None, Some or

Full)   Price – the restaurants price range ($, $$, $$$)   Raining – whether its is raining outside   Reservation – whether we made a reservation   Type – the kind of restaurant (French, Italian, Thai, or Burger)   WaitEstimate – the wait estimate by the host (0-10 minutes, 10-30,

30-60, > 60)

Page 5: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Multilayer Network

Page 6: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 7: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 8: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

A compromise function   Perceptron

  Linear

  Sigmoid (Logistic)

output = net = wixii=0

n

output =σ (net) =1

1+ e−net

output =1 if wixi > 0

i=0

n

∑0 else

⎨ ⎪

⎩ ⎪

Page 9: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Learning in Multilayer Networks

 Same method as for Perceptrons  Example inputs are presented to the

network   If the network computes an output that

matches the desired, nothing is done   If there is an error, then the weights are

adjusted to balance the error

Page 10: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 11: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

BackPropagation Learning

Page 12: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Alternative Error Measures

Page 13: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Neural Network Model Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2 Gender

Stage 4

.6

.5

.8

.2

.1

.3 .7

.2

Weights HiddenLayer

“Probability of beingAlive”

0.6 Σ

Σ

.4

.2 Σ

Page 14: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Getting an answer from a NN Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2 Gender

Stage 4

.6

.5

.8

.1

.7

Weights HiddenLayer

“Probability of beingAlive”

0.6 Σ

Page 15: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2 Gender

Stage 4

.5

.8

.2

.3

.2

Weights HiddenLayer

“Probability of beingAlive”

0.6 Σ

Getting an answer from a NN

Page 16: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Getting an answer from a NN Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

1 Gender

Stage 4

.6

.5

.8

.2

.1

.3 .7

.2

Weights HiddenLayer

“Probability of beingAlive”

0.6 Σ

Page 17: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Minimizing the Error

w initial w trained

initial error

final error

Error surface

positive change

negative derivative

local minimum

Page 18: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Representational Power (FFNN)

 Boolean functions  2 layers of units

 Continuous functions  2 layers of units (sigmoid then linear)

 Arbitrary functions  3 layers of units (sigmoids then linear)

Page 19: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Hypothesis Space and Inductive Bias

Page 20: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Hidden Layer Representations

Page 21: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Hidden Layer Representations

Page 22: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 23: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 24: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 25: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Overfitting

Page 26: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Neural Nets for Face Recognition

Page 27: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Learning Hidden Unit Weights

Page 28: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

ALVINN Drives 70 mph on a public highway

Camera image

30x32 pixels as inputs

30 outputs for steering 30x32 weights

into one out of four hidden unit

4 hidden units

Page 29: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Handwritten Character Recognition

  Le Cun et al. (1989) implemented a neural network to read zip codes on hand-addressed envelopes, for sorting purposes

  To identify the digits, uses a 16x16 array of pixels as input, 3 hidden layers, and a distributed output encoding with 10 output units for digits 0-9

  256 input nodes, 10 output units (1 for the liklihood of each number)

Page 30: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 31: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 32: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Interpreting Satellite Imagery for Automated Weather Forecasting

Page 33: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Recurrent Neural Nets

Page 34: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Neural Network Language Models  Statistical Language Modeling:

 Predict probability of next word in sequence

I was headed to Madrid , ____ P(___ = “Spain”) = 0.5, P(___ = “but”) = 0.2, etc.

 Used in speech recognition, machine translation, (recently) information extraction

Page 35: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Summary

 Perceptrons, one layer networks, are insufficiently expressive

 Multi-layer networks are sufficiently expressive and can be trained by error back-propogation

 Many applications including speech, driving, hand written character recognition, fraud detection, driving, etc.

Page 36: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:
Page 37: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Local Search algorithms   In many optimization problems, the path to the

goal is irrelevant; the goal state itself is the solution

  In such cases, we can use local search algorithms

  keep a single "current" state, try to improve it • Hill-climbing • Simulated annealing • Local Beam Search • Stochastic Beam Search • Genetic Algorithms

Page 38: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Genetic algorithms   A successor state is generated by combining two parent

states

  Start with k randomly generated states (population)

  A state is represented as a string over a finite alphabet (often a string of 0s and 1s)

  Evaluation function (fitness function). Higher values for better states.

  Produce the next generation of states by selection,

Page 39: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Genetic algorithms

  Fitness function: number of non-attacking pairs of queens (min = 0, max = 8 × 7/2 = 28)

  24/(24+23+20+11) = 31%   23/(24+23+20+11) = 29% etc

Page 40: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Genetic algorithms

Page 41: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Genetic Algorithms Continued…

1.  Choose initial population 2.  Evaluate fitness of each in population 3.  Repeat the following until we hit a

terminating condition: 1.  Select best-ranking to reproduce 2.  Breed using crossover and mutation 3.  Evaluate the fitnesses of the offspring 4.  Replace worst ranked part of population with

offspring

Page 42: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

How computers play games…

Page 43: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Minimax: An Optimal Strategy

Page 44: Incremental Stochastic Gradient Descent -  · PDF fileIncremental Stochastic Gradient Descent ... ALVINN Drives 70 mph on a public highway ... 158-5(NeuralNetworks).ppt Author:

Minimax Algorithm: An Optimal Strategy

Choose the best move based on the resulting states’ MINIMAX-VALUE…

MINIMAX-VALUE(n) = if n is a terminal state then Utility(n) else if MAX’s turn the MAXIMUM MINIMAX-VALUE of all possible successors to n else if MIN’s turn the MINIMUM MINIMAX-VALUE of all possible successors to n