1
On-Line Learning with Recycled Examples
A Cavity Analysis
Peixun Luo and K Y Michael Wong
Hong Kong University of Science and Technology
2
Inputs ξj j = 1 N
Weights Jj j = 1 hellip N
Activation y = Jξ
Output S = f(y)
Formulation
Jj
y
3
Given p = αN examples with
inputs ξjμ j = 1 N μ = 1 hellip p
outputs yμ generated by a teacher network
Learning is done by defining a risk function and minimizing it by gradient descent
The Learning of a Task
Jj
y
4
Define a cost function in terms of the examplesE = Σ μ E μ + regularization terms
On-line learningAt time t draw an example σ(t) andΔJj ~ Gradient with respect to σ(t) + weight decay
Batch learningAt time tΔJj ~ Average gradient with respect to all examples+ weight decay
Learning Dynamics
5
Batch vs On-lineBatch learning On-line learning
Same batch of examples for all steps
An independent example per step
Simple dynamics no sequence dependence
Complex dynamics sequence dependence
Small stepwise changes of examples
Giant boosts of examples stepwise
Previous analysis possible
Previous analysis limited to infinite sets
Stable but inefficient Efficient but less stable
6
It has been applied to many complex systems It has been applied to steady-state properties of
learning It uses a self-consistency argument to consider what
happens when a set of p examples is expanded to p + 1examples
The central quantity is the cavity activation which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0)
Since the original network has no information about example 0 the cavity activation obeys a random distribution (eg a Gaussian)
Now suppose the network incorporates example 0 at time s The activation is no longer random
The Cavity Method
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
2
Inputs ξj j = 1 N
Weights Jj j = 1 hellip N
Activation y = Jξ
Output S = f(y)
Formulation
Jj
y
3
Given p = αN examples with
inputs ξjμ j = 1 N μ = 1 hellip p
outputs yμ generated by a teacher network
Learning is done by defining a risk function and minimizing it by gradient descent
The Learning of a Task
Jj
y
4
Define a cost function in terms of the examplesE = Σ μ E μ + regularization terms
On-line learningAt time t draw an example σ(t) andΔJj ~ Gradient with respect to σ(t) + weight decay
Batch learningAt time tΔJj ~ Average gradient with respect to all examples+ weight decay
Learning Dynamics
5
Batch vs On-lineBatch learning On-line learning
Same batch of examples for all steps
An independent example per step
Simple dynamics no sequence dependence
Complex dynamics sequence dependence
Small stepwise changes of examples
Giant boosts of examples stepwise
Previous analysis possible
Previous analysis limited to infinite sets
Stable but inefficient Efficient but less stable
6
It has been applied to many complex systems It has been applied to steady-state properties of
learning It uses a self-consistency argument to consider what
happens when a set of p examples is expanded to p + 1examples
The central quantity is the cavity activation which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0)
Since the original network has no information about example 0 the cavity activation obeys a random distribution (eg a Gaussian)
Now suppose the network incorporates example 0 at time s The activation is no longer random
The Cavity Method
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
3
Given p = αN examples with
inputs ξjμ j = 1 N μ = 1 hellip p
outputs yμ generated by a teacher network
Learning is done by defining a risk function and minimizing it by gradient descent
The Learning of a Task
Jj
y
4
Define a cost function in terms of the examplesE = Σ μ E μ + regularization terms
On-line learningAt time t draw an example σ(t) andΔJj ~ Gradient with respect to σ(t) + weight decay
Batch learningAt time tΔJj ~ Average gradient with respect to all examples+ weight decay
Learning Dynamics
5
Batch vs On-lineBatch learning On-line learning
Same batch of examples for all steps
An independent example per step
Simple dynamics no sequence dependence
Complex dynamics sequence dependence
Small stepwise changes of examples
Giant boosts of examples stepwise
Previous analysis possible
Previous analysis limited to infinite sets
Stable but inefficient Efficient but less stable
6
It has been applied to many complex systems It has been applied to steady-state properties of
learning It uses a self-consistency argument to consider what
happens when a set of p examples is expanded to p + 1examples
The central quantity is the cavity activation which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0)
Since the original network has no information about example 0 the cavity activation obeys a random distribution (eg a Gaussian)
Now suppose the network incorporates example 0 at time s The activation is no longer random
The Cavity Method
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
4
Define a cost function in terms of the examplesE = Σ μ E μ + regularization terms
On-line learningAt time t draw an example σ(t) andΔJj ~ Gradient with respect to σ(t) + weight decay
Batch learningAt time tΔJj ~ Average gradient with respect to all examples+ weight decay
Learning Dynamics
5
Batch vs On-lineBatch learning On-line learning
Same batch of examples for all steps
An independent example per step
Simple dynamics no sequence dependence
Complex dynamics sequence dependence
Small stepwise changes of examples
Giant boosts of examples stepwise
Previous analysis possible
Previous analysis limited to infinite sets
Stable but inefficient Efficient but less stable
6
It has been applied to many complex systems It has been applied to steady-state properties of
learning It uses a self-consistency argument to consider what
happens when a set of p examples is expanded to p + 1examples
The central quantity is the cavity activation which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0)
Since the original network has no information about example 0 the cavity activation obeys a random distribution (eg a Gaussian)
Now suppose the network incorporates example 0 at time s The activation is no longer random
The Cavity Method
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
5
Batch vs On-lineBatch learning On-line learning
Same batch of examples for all steps
An independent example per step
Simple dynamics no sequence dependence
Complex dynamics sequence dependence
Small stepwise changes of examples
Giant boosts of examples stepwise
Previous analysis possible
Previous analysis limited to infinite sets
Stable but inefficient Efficient but less stable
6
It has been applied to many complex systems It has been applied to steady-state properties of
learning It uses a self-consistency argument to consider what
happens when a set of p examples is expanded to p + 1examples
The central quantity is the cavity activation which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0)
Since the original network has no information about example 0 the cavity activation obeys a random distribution (eg a Gaussian)
Now suppose the network incorporates example 0 at time s The activation is no longer random
The Cavity Method
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
6
It has been applied to many complex systems It has been applied to steady-state properties of
learning It uses a self-consistency argument to consider what
happens when a set of p examples is expanded to p + 1examples
The central quantity is the cavity activation which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0)
Since the original network has no information about example 0 the cavity activation obeys a random distribution (eg a Gaussian)
Now suppose the network incorporates example 0 at time s The activation is no longer random
The Cavity Method
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
7
The cavity activation diffuses randomly The generic activation receiving a stimulus at time s
is no longer random The background examples also adjust due to the
newcomer Assuming that the background adjustments are small
we can use linear response theory to superpose the effects due to all previous times s
Linear Response
time
stimulation time s
X(t)
h(t)random diffusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
8
For batch learningGeneric activation of an example at time t= cavity activation of the example at time t+ integrates(Greenrsquos function from time s to t) x(gradient term at time s)
For on-line learningGeneric activation of an example at time t= cavity activation of the example at time t+ summations(Greenrsquos function from time s to t) x(gradient term at time s)The learning instants s are Poisson distributed
Useful Equations
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
9
Simulation Results generic activation (with giant boosts)
(line)cavity activation from theory
(dots)simulation with example removed
learning instants (Poisson distributed)
theory and simulations agree
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
10
Further Development
training error
generalization error
theory and simulations agree
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
11
Critical Learning Rate (1)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
12
Critical Learning Rate (2)
theory and simulations agree
critical learning rate at which learning diverges
other approximations
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
13
Average Learning
theory and simulations agree
generalization error drops when the dynamics is averaged over monitoring periods
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
14
We have analysed the dynamics of on-line learning with recycled examples using the cavity approach
Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning
Theory and simulations agree well onthe evolution of the training and generalization errorsthe critical learning rate at which learning divergesthe performance of average learning
Future to develop a Monte Carlo sampling procedure for multilayer networks
Conclusion
Top Related