Incremental Stochastic Gradient Descent - · PDF fileIncremental Stochastic Gradient Descent...
Transcript of Incremental Stochastic Gradient Descent - · PDF fileIncremental Stochastic Gradient Descent...
Incremental Stochastic Gradient Descent
Batch mode : gradient descent w=w - η ∇ED[w] over the entire data D
ED[w]=1/2Σd(td-od)2
Incremental mode: gradient descent w=w - η ∇Ed[w] over individual training examples d Ed[w]=1/2 (td-od)2
Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η is small enough
Comparison: Perceptron and Gradient Descent Rule
Perceptron learning rule guaranteed to succeed (perfectly classifying training examples) if
Training examples are linearly separable Sufficiently small learning rate η
Linear unit training rules using gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H
Restaurant Problem: Will I wait for a table?
Alternate – whether there is a suitable alternative restaurant nearby
Bar – whether the restaurant has a comfortable bar area to wait in Fri/Sat – true on Fridays and Saturdays Hungry – whether we are hungry Patrons – how many people are in the restaurant (None, Some or
Full) Price – the restaurants price range ($, $$, $$$) Raining – whether its is raining outside Reservation – whether we made a reservation Type – the kind of restaurant (French, Italian, Thai, or Burger) WaitEstimate – the wait estimate by the host (0-10 minutes, 10-30,
30-60, > 60)
Multilayer Network
A compromise function Perceptron
Linear
Sigmoid (Logistic)
€
output = net = wixii=0
n
∑
€
output =σ (net) =1
1+ e−net
€
output =1 if wixi > 0
i=0
n
∑0 else
⎧
⎨ ⎪
⎩ ⎪
Learning in Multilayer Networks
Same method as for Perceptrons Example inputs are presented to the
network If the network computes an output that
matches the desired, nothing is done If there is an error, then the weights are
adjusted to balance the error
BackPropagation Learning
Alternative Error Measures
Neural Network Model Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2 Gender
Stage 4
.6
.5
.8
.2
.1
.3 .7
.2
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Σ
.4
.2 Σ
Getting an answer from a NN Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2 Gender
Stage 4
.6
.5
.8
.1
.7
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2 Gender
Stage 4
.5
.8
.2
.3
.2
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Getting an answer from a NN
Getting an answer from a NN Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
1 Gender
Stage 4
.6
.5
.8
.2
.1
.3 .7
.2
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Minimizing the Error
w initial w trained
initial error
final error
Error surface
positive change
negative derivative
local minimum
Representational Power (FFNN)
Boolean functions 2 layers of units
Continuous functions 2 layers of units (sigmoid then linear)
Arbitrary functions 3 layers of units (sigmoids then linear)
Hypothesis Space and Inductive Bias
Hidden Layer Representations
Hidden Layer Representations
Overfitting
Neural Nets for Face Recognition
Learning Hidden Unit Weights
ALVINN Drives 70 mph on a public highway
Camera image
30x32 pixels as inputs
30 outputs for steering 30x32 weights
into one out of four hidden unit
4 hidden units
Handwritten Character Recognition
Le Cun et al. (1989) implemented a neural network to read zip codes on hand-addressed envelopes, for sorting purposes
To identify the digits, uses a 16x16 array of pixels as input, 3 hidden layers, and a distributed output encoding with 10 output units for digits 0-9
256 input nodes, 10 output units (1 for the liklihood of each number)
Interpreting Satellite Imagery for Automated Weather Forecasting
Recurrent Neural Nets
Neural Network Language Models Statistical Language Modeling:
Predict probability of next word in sequence
I was headed to Madrid , ____ P(___ = “Spain”) = 0.5, P(___ = “but”) = 0.2, etc.
Used in speech recognition, machine translation, (recently) information extraction
Summary
Perceptrons, one layer networks, are insufficiently expressive
Multi-layer networks are sufficiently expressive and can be trained by error back-propogation
Many applications including speech, driving, hand written character recognition, fraud detection, driving, etc.
Local Search algorithms In many optimization problems, the path to the
goal is irrelevant; the goal state itself is the solution
In such cases, we can use local search algorithms
keep a single "current" state, try to improve it • Hill-climbing • Simulated annealing • Local Beam Search • Stochastic Beam Search • Genetic Algorithms
Genetic algorithms A successor state is generated by combining two parent
states
Start with k randomly generated states (population)
A state is represented as a string over a finite alphabet (often a string of 0s and 1s)
Evaluation function (fitness function). Higher values for better states.
Produce the next generation of states by selection,
Genetic algorithms
Fitness function: number of non-attacking pairs of queens (min = 0, max = 8 × 7/2 = 28)
24/(24+23+20+11) = 31% 23/(24+23+20+11) = 29% etc
Genetic algorithms
Genetic Algorithms Continued…
1. Choose initial population 2. Evaluate fitness of each in population 3. Repeat the following until we hit a
terminating condition: 1. Select best-ranking to reproduce 2. Breed using crossover and mutation 3. Evaluate the fitnesses of the offspring 4. Replace worst ranked part of population with
offspring
How computers play games…
Minimax: An Optimal Strategy
Minimax Algorithm: An Optimal Strategy
Choose the best move based on the resulting states’ MINIMAX-VALUE…
MINIMAX-VALUE(n) = if n is a terminal state then Utility(n) else if MAX’s turn the MAXIMUM MINIMAX-VALUE of all possible successors to n else if MIN’s turn the MINIMUM MINIMAX-VALUE of all possible successors to n