17 Machine Learning Radial Basis Functions

Upload
andresmendezvazquez 
Category
Engineering

view
308 
download
2
Transcript of 17 Machine Learning Radial Basis Functions
Neural NetworksRadial Basis Functions Networks
Andres MendezVazquez
December 10, 2015
1 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
2 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
3 / 96
Introduction
ObservationThe backpropagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.
NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a highdimensionalspace.
ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.
Under a statistical metric
4 / 96
Introduction
ObservationThe backpropagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.
NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a highdimensionalspace.
ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.
Under a statistical metric
4 / 96
Introduction
ObservationThe backpropagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.
NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a highdimensionalspace.
ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.
Under a statistical metric
4 / 96
Thus
In the context of a neural networkThe hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into thehidden space.
Name of these functionsRadialBasis functions.
5 / 96
Thus
In the context of a neural networkThe hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into thehidden space.
Name of these functionsRadialBasis functions.
5 / 96
History
These functions were first introducedAs the solution of the real multivariate interpolation problem
Right nowIt is now one of the main fields of research in numerical analysis.
6 / 96
History
These functions were first introducedAs the solution of the real multivariate interpolation problem
Right nowIt is now one of the main fields of research in numerical analysis.
6 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
7 / 96
A Basic Structure
We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a nonlinear transformation.3 Output Layer applying a linear transformation.
Example
8 / 96
A Basic Structure
We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a nonlinear transformation.3 Output Layer applying a linear transformation.
Example
8 / 96
A Basic Structure
We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a nonlinear transformation.3 Output Layer applying a linear transformation.
Example
8 / 96
A Basic Structure
We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a nonlinear transformation.3 Output Layer applying a linear transformation.
ExampleInput Nodes
Nonlinear Nodes
Linear Node
8 / 96
Why the nonlinear transformation?
The justificationIn a paper by Cover (1965), a patternclassification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alowdimensional space.
ThusA good reason to make the dimension in the hidden space in aRadialBasis Function (RBF) network high
9 / 96
Why the nonlinear transformation?
The justificationIn a paper by Cover (1965), a patternclassification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alowdimensional space.
ThusA good reason to make the dimension in the hidden space in aRadialBasis Function (RBF) network high
9 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
10 / 96
Cover’s Theorem
The Resumed StatementA complex patternclassification problem cast in a highdimensional spacenonlinearly is more likely to be linearly separable than in a lowdimensionalspace.
ActuallyIt is quite more complex...
11 / 96
Cover’s Theorem
The Resumed StatementA complex patternclassification problem cast in a highdimensional spacenonlinearly is more likely to be linearly separable than in a lowdimensionalspace.
ActuallyIt is quite more complex...
11 / 96
Some facts
A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.
ConsiderA family of surfaces that separate the space in two regions.
In additionWe have a set of patterns
H = {x1,x2, ...,xN} (1)
12 / 96
Some facts
A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.
ConsiderA family of surfaces that separate the space in two regions.
In additionWe have a set of patterns
H = {x1,x2, ...,xN} (1)
12 / 96
Some facts
A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.
ConsiderA family of surfaces that separate the space in two regions.
In additionWe have a set of patterns
H = {x1,x2, ...,xN} (1)
12 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
13 / 96
Dichotomy (Binary Partition)
NowThe pattern set is split into two classes H1 and H2.
DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.
DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
Dichotomy (Binary Partition)
NowThe pattern set is split into two classes H1 and H2.
DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.
DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
Dichotomy (Binary Partition)
NowThe pattern set is split into two classes H1 and H2.
DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.
DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
Thus
We define the following function (Vector of measurements)
φ : H → Rd1 (2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)
NowSuppose that the pattern x is a vector in an d0dimensional input space.
15 / 96
Thus
We define the following function (Vector of measurements)
φ : H → Rd1 (2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)
NowSuppose that the pattern x is a vector in an d0dimensional input space.
15 / 96
Thus
We define the following function (Vector of measurements)
φ : H → Rd1 (2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)
NowSuppose that the pattern x is a vector in an d0dimensional input space.
15 / 96
Then...
We have that the mapping φ (x)It maps points in d0dimensional space into corresponding points in a newspace of dimension d1.
Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feedforward neural network.
ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
Then...
We have that the mapping φ (x)It maps points in d0dimensional space into corresponding points in a newspace of dimension d1.
Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feedforward neural network.
ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
Then...
We have that the mapping φ (x)It maps points in d0dimensional space into corresponding points in a newspace of dimension d1.
Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feedforward neural network.
ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
17 / 96
φseparable functionsDefinitionA dichotomy {H1,H2} of H is said to be φseparable if there exists ad1dimensional vector w such that
1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wTφ (x) = 0 (4)
NowThe inverse image of this hyperplane
Hyp−1 ={
xwTφ (x) = 0}
(5)
define the separating surface in the input space.18 / 96
φseparable functionsDefinitionA dichotomy {H1,H2} of H is said to be φseparable if there exists ad1dimensional vector w such that
1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wTφ (x) = 0 (4)
NowThe inverse image of this hyperplane
Hyp−1 ={
xwTφ (x) = 0}
(5)
define the separating surface in the input space.18 / 96
φseparable functionsDefinitionA dichotomy {H1,H2} of H is said to be φseparable if there exists ad1dimensional vector w such that
1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wTφ (x) = 0 (4)
NowThe inverse image of this hyperplane
Hyp−1 ={
xwTφ (x) = 0}
(5)
define the separating surface in the input space.18 / 96
φseparable functionsDefinitionA dichotomy {H1,H2} of H is said to be φseparable if there exists ad1dimensional vector w such that
1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wTφ (x) = 0 (4)
NowThe inverse image of this hyperplane
Hyp−1 ={
xwTφ (x) = 0}
(5)
define the separating surface in the input space.18 / 96
Now
Taking in considerationA natural class of mappings obtained by using a linear combination ofrwise products of the pattern vector coordinates.
They are calledAs the rthorder rational varieties.
A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0
ai1i2...ir xi1xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.
19 / 96
Now
Taking in considerationA natural class of mappings obtained by using a linear combination ofrwise products of the pattern vector coordinates.
They are calledAs the rthorder rational varieties.
A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0
ai1i2...ir xi1xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.
19 / 96
Now
Taking in considerationA natural class of mappings obtained by using a linear combination ofrwise products of the pattern vector coordinates.
They are calledAs the rthorder rational varieties.
A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0
ai1i2...ir xi1xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.
19 / 96
Now
Taking in considerationA natural class of mappings obtained by using a linear combination ofrwise products of the pattern vector coordinates.
They are calledAs the rthorder rational varieties.
A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0
ai1i2...ir xi1xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.
19 / 96
Homogenous Functions
DefinitionA function f (x) is said to be homogeneous of degree n if, by introducing aconstant parameter λ, replacing the variable x with λx we find:
f (λx) = λnf (x) (7)
20 / 96
Homogeneous Equation
Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial
PropertiesFor an input space of dimensionality d0, there are(
d0r
)= d0!
(d0 − r)!r ! (8)
monomials in (Eq. 6).
21 / 96
Homogeneous Equation
Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial
PropertiesFor an input space of dimensionality d0, there are(
d0r
)= d0!
(d0 − r)!r ! (8)
monomials in (Eq. 6).
21 / 96
Example of these surfaces
Hyperplanes (firstorder rational varieties)
22 / 96
Example of these surfaces
Hyperplanes (firstorder rational varieties)
23 / 96
Example of these surfaces
Quadrices (secondorder rational varieties)
24 / 96
Example of these surfaces
Hyperspheres (quadrics with certain linear constraints on thecoefficients)
25 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
26 / 96
The Stochastic Experiment
SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.
SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.
Now given P (N , d1) the probability that a particular dichotomypicked at random is φseparable
P (N , d1) =(1
2
)N−1 d1−1∑m=0
(N − 1
m
)(9)
27 / 96
The Stochastic Experiment
SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.
SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.
Now given P (N , d1) the probability that a particular dichotomypicked at random is φseparable
P (N , d1) =(1
2
)N−1 d1−1∑m=0
(N − 1
m
)(9)
27 / 96
The Stochastic Experiment
SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.
SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.
Now given P (N , d1) the probability that a particular dichotomypicked at random is φseparable
P (N , d1) =(1
2
)N−1 d1−1∑m=0
(N − 1
m
)(9)
27 / 96
What?
Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.
Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.
SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.
28 / 96
What?
Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.
Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.
SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.
28 / 96
What?
Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.
Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.
SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.
28 / 96
Final ingredients if the Cover’s Theorem
FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.
SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).
ThenIn general, a complex patternclassification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.
29 / 96
Final ingredients if the Cover’s Theorem
FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.
SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).
ThenIn general, a complex patternclassification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.
29 / 96
Final ingredients if the Cover’s Theorem
FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.
SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).
ThenIn general, a complex patternclassification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.
29 / 96
Final ingredients if the Cover’s Theorem
FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.
SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).
ThenIn general, a complex patternclassification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.
29 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
30 / 96
There is always an exception to every rule!!!
The XOR Problem
0
1
1
Class 1
Class 2
31 / 96
Now
We define the following radial functions
φ1 (x) = exp{‖x − t1‖22
}where t1 = (1, 1)T
φ2 (x) = exp{‖x − t2‖22
}where t2 = (1, 1)T
ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)
32 / 96
Now
We define the following radial functions
φ1 (x) = exp{‖x − t1‖22
}where t1 = (1, 1)T
φ2 (x) = exp{‖x − t2‖22
}where t2 = (1, 1)T
ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)
32 / 96
New Space
We have the following new φ1 − φ2 space
0
1
1
Class 1
Class 2
33 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
34 / 96
Separating Capacity of a Surface
Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.
Now, given our patterns {x i}Ni=1
Given N be a random variable defined as the largest integer such that thesequence is φseparable.
We have that
Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)
35 / 96
Separating Capacity of a Surface
Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.
Now, given our patterns {x i}Ni=1
Given N be a random variable defined as the largest integer such that thesequence is φseparable.
We have that
Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)
35 / 96
Separating Capacity of a Surface
Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.
Now, given our patterns {x i}Ni=1
Given N be a random variable defined as the largest integer such that thesequence is φseparable.
We have that
Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)
35 / 96
Separating Capacity of a Surface
Then
Prob (N = n) =(1
2
)n(
n − 1d1 − 1
), n = 0, 1, 2... (11)
Remark:(
nd1
)=(
n − 1d1 − 1
)+(
n − 1d1
), 0 < d1 < n
To interpret thisRecall the negative binomial distribution.
It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.
36 / 96
Separating Capacity of a Surface
Then
Prob (N = n) =(1
2
)n(
n − 1d1 − 1
), n = 0, 1, 2... (11)
Remark:(
nd1
)=(
n − 1d1 − 1
)+(
n − 1d1
), 0 < d1 < n
To interpret thisRecall the negative binomial distribution.
It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.
36 / 96
Separating Capacity of a Surface
Then
Prob (N = n) =(1
2
)n(
n − 1d1 − 1
), n = 0, 1, 2... (11)
Remark:(
nd1
)=(
n − 1d1 − 1
)+(
n − 1d1
), 0 < d1 < n
To interpret thisRecall the negative binomial distribution.
It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.
36 / 96
Separating Capacity of a Surface
Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.
Definition
p (K = kp, q) =(
r + k − 1k
)prqk (12)
What happened with p = q = 12 and k + r = n
Any idea?
37 / 96
Separating Capacity of a Surface
Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.
Definition
p (K = kp, q) =(
r + k − 1k
)prqk (12)
What happened with p = q = 12 and k + r = n
Any idea?
37 / 96
Separating Capacity of a Surface
Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.
Definition
p (K = kp, q) =(
r + k − 1k
)prqk (12)
What happened with p = q = 12 and k + r = n
Any idea?
37 / 96
Separating Capacity of a Surface
Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1
2
FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.
We have then
E [N ] = 2d1
Median [N ] = 2d1
38 / 96
Separating Capacity of a Surface
Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1
2
FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.
We have then
E [N ] = 2d1
Median [N ] = 2d1
38 / 96
Separating Capacity of a Surface
Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1
2
FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.
We have then
E [N ] = 2d1
Median [N ] = 2d1
38 / 96
This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
40 / 96
Given a problem of nonlinearly separable patterns
It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension
For this, we use a nonlinear mapQuite similar to solve a difficult nonlinear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.
41 / 96
Given a problem of nonlinearly separable patterns
It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension
For this, we use a nonlinear mapQuite similar to solve a difficult nonlinear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.
41 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
42 / 96
Take in consideration the following architecture
Mapping from input space to hidden space, followed by a linearmapping to output space!!!
Input NodesNonlinear Nodes
Linear Node
43 / 96
This can be seen as
We have the following map
s : Rd0 → R (13)
ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
This can be seen as
We have the following map
s : Rd0 → R (13)
ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
ExampleWe have that the Red planes represent the mappings and the Gray isthe Linear Separator
45 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
46 / 96
General Idea
FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as inputoutput patterns.
SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.
47 / 96
General Idea
FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as inputoutput patterns.
SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.
47 / 96
General Idea
FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as inputoutput patterns.
SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.
47 / 96
General Idea
FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as inputoutput patterns.
SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.
47 / 96
This leads to the theory of multivariable interpolation
Interpolation ProblemGiven a set of N different points
{xi ∈ Rd0 i = 1, 2, ...,N
}and a
corresponding set of N real numbers{di ∈ R1i = 1, 2, ...,N
}, find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ...,N (14)
RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.
48 / 96
This leads to the theory of multivariable interpolation
Interpolation ProblemGiven a set of N different points
{xi ∈ Rd0 i = 1, 2, ...,N
}and a
corresponding set of N real numbers{di ∈ R1i = 1, 2, ...,N
}, find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ...,N (14)
RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.
48 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
49 / 96
RadialBasis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =N∑
i=1wiφ (‖x − xi‖) (15)
Where
{φ (‖x − xi‖) i = 1, ...,N}
is a set of N arbitrary, generally nonlinear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.
In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.
50 / 96
RadialBasis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =N∑
i=1wiφ (‖x − xi‖) (15)
Where
{φ (‖x − xi‖) i = 1, ...,N}
is a set of N arbitrary, generally nonlinear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.
In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.
50 / 96
RadialBasis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =N∑
i=1wiφ (‖x − xi‖) (15)
Where
{φ (‖x − xi‖) i = 1, ...,N}
is a set of N arbitrary, generally nonlinear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.
In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.
50 / 96
A Set of Simultaneous Linear Equations
Given
φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)
Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...
......
...φN1 φN2 · · · φNN
w1w2...
wN
=
d1d2...
dN
(17)
51 / 96
A Set of Simultaneous Linear Equations
Given
φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)
Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...
......
...φN1 φN2 · · · φNN
w1w2...
wN
=
d1d2...
dN
(17)
51 / 96
Now
We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji  (j, i) = 1, 2, ...,N} (18)
Thus, we have
Φw = x (19)
52 / 96
Now
We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji  (j, i) = 1, 2, ...,N} (18)
Thus, we have
Φw = x (19)
52 / 96
Now
We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji  (j, i) = 1, 2, ...,N} (18)
Thus, we have
Φw = x (19)
52 / 96
From here
Assuming that Φ is a nonsingular matrix
w = Φ−1x (20)
QuestionHow can we be sure that the interpolation matrix Φ is nonsingular?
AnswerIt turns out that for a large class of radialbasis functions and undercertain conditions the nonsingularity happens!!!
53 / 96
From here
Assuming that Φ is a nonsingular matrix
w = Φ−1x (20)
QuestionHow can we be sure that the interpolation matrix Φ is nonsingular?
AnswerIt turns out that for a large class of radialbasis functions and undercertain conditions the nonsingularity happens!!!
53 / 96
From here
Assuming that Φ is a nonsingular matrix
w = Φ−1x (20)
QuestionHow can we be sure that the interpolation matrix Φ is nonsingular?
AnswerIt turns out that for a large class of radialbasis functions and undercertain conditions the nonsingularity happens!!!
53 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
54 / 96
Introduction
ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.
ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.
ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.
55 / 96
Introduction
ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.
ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.
ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.
55 / 96
Introduction
ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.
ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.
ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.
55 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
56 / 96
Wellposed
The ProblemAssume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
DefinitionThe problem of reconstructing the mapping f is said to be wellposed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
Wellposed
The ProblemAssume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
DefinitionThe problem of reconstructing the mapping f is said to be wellposed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
Wellposed
The ProblemAssume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
DefinitionThe problem of reconstructing the mapping f is said to be wellposed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
Defining the meaning of this
ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .
UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.
ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.
58 / 96
Defining the meaning of this
ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .
UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.
ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.
58 / 96
Defining the meaning of this
ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .
UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.
ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.
58 / 96
Basically
ExampleMapping
59 / 96
IllPosed
ThereforeIf any of these conditions is not satisfied, the problem is said to beillposed.
BasicallyAn illposed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.
60 / 96
IllPosed
ThereforeIf any of these conditions is not satisfied, the problem is said to beillposed.
BasicallyAn illposed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.
60 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
61 / 96
Learning from data
Rebuilding the physical phenomena using the samplesPhysical Phenomenon
Data
62 / 96
We have the following
Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.
It is a wellposed dataBut learning form such data i.e. rebuilding the hypersurface can be anillposed inverse problem.
63 / 96
We have the following
Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.
It is a wellposed dataBut learning form such data i.e. rebuilding the hypersurface can be anillposed inverse problem.
63 / 96
Why
FirstThe existence criterion may be violated in that a distinct output may notexist for every input
SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the inputoutput mapping uniquely.
ThirdThe unavoidable presence of noise or imprecision in reallife training dataadds uncertainty to the reconstructed inputoutput mapping.
64 / 96
Why
FirstThe existence criterion may be violated in that a distinct output may notexist for every input
SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the inputoutput mapping uniquely.
ThirdThe unavoidable presence of noise or imprecision in reallife training dataadds uncertainty to the reconstructed inputoutput mapping.
64 / 96
Why
FirstThe existence criterion may be violated in that a distinct output may notexist for every input
SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the inputoutput mapping uniquely.
ThirdThe unavoidable presence of noise or imprecision in reallife training dataadds uncertainty to the reconstructed inputoutput mapping.
64 / 96
The noise problem
Getting out of the range
Mapping+Noise
65 / 96
How?
This can happen whenThere is a lack of information!!!
Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
How?
This can happen whenThere is a lack of information!!!
Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
67 / 96
How do we solve the problem?
Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingillposed ’
TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andillposed problems.
68 / 96
How do we solve the problem?
Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingillposed ’
TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andillposed problems.
68 / 96
Also Known as Ridge Regression
SetupWe have:
Input Signal{
xi ∈ Rd0}N
i=1.
Output Signal {di ∈ R}Ni=1.
In additionNote that the output is assumed to be onedimensional.
69 / 96
Also Known as Ridge Regression
SetupWe have:
Input Signal{
xi ∈ Rd0}N
i=1.
Output Signal {di ∈ R}Ni=1.
In additionNote that the output is assumed to be onedimensional.
69 / 96
Now, assuming that you have an approximation functiony = F (x)
Standard Error Term
Es (F) = 12
N∑i=1
(di − yi) = 12
N∑i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) = 12 ‖DF‖2 (23)
WhereD is a linear differential operator.
70 / 96
Now, assuming that you have an approximation functiony = F (x)
Standard Error Term
Es (F) = 12
N∑i=1
(di − yi) = 12
N∑i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) = 12 ‖DF‖2 (23)
WhereD is a linear differential operator.
70 / 96
Now, assuming that you have an approximation functiony = F (x)
Standard Error Term
Es (F) = 12
N∑i=1
(di − yi) = 12
N∑i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) = 12 ‖DF‖2 (23)
WhereD is a linear differential operator.
70 / 96
Now
Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all realvalued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) = 12
N∑i=1
(di − f (xi)) + 12 ‖Df ‖2 (24)
Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.
71 / 96
Now
Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all realvalued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) = 12
N∑i=1
(di − f (xi)) + 12 ‖Df ‖2 (24)
Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.
71 / 96
Now
Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all realvalued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) = 12
N∑i=1
(di − f (xi)) + 12 ‖Df ‖2 (24)
Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.
71 / 96
Now
Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all realvalued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) = 12
N∑i=1
(di − f (xi)) + 12 ‖Df ‖2 (24)
Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.
71 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
72 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Introduction
What did we see until now?The design of learning machines from two main points:
Statistical Point of ViewLinear Algebra and Optimization Point of View
Going back to the probability modelsWe might think in the machine to be learned as a function g (xD)....
Something as curve fitting...
Under a data set
D = {(xi , yi) i = 1, 2, ...,N} (25)
Remark: Where the xi ∼ p (xΘ)!!!
73 / 96
Thus, we have that
Two main functionsA function g (xD) obtained using some algorithm!!!E [yx] the optimal regression...
ImportantThe key factor here is the dependence of the approximation on D.
Why?The approximation may be very good for a specific training data set butvery bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functionsA function g (xD) obtained using some algorithm!!!E [yx] the optimal regression...
ImportantThe key factor here is the dependence of the approximation on D.
Why?The approximation may be very good for a specific training data set butvery bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functionsA function g (xD) obtained using some algorithm!!!E [yx] the optimal regression...
ImportantThe key factor here is the dependence of the approximation on D.
Why?The approximation may be very good for a specific training data set butvery bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functionsA function g (xD) obtained using some algorithm!!!E [yx] the optimal regression...
ImportantThe key factor here is the dependence of the approximation on D.
Why?The approximation may be very good for a specific training data set butvery bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
Thus, we have that
Two main functionsA function g (xD) obtained using some algorithm!!!E [yx] the optimal regression...
ImportantThe key factor here is the dependence of the approximation on D.
Why?The approximation may be very good for a specific training data set butvery bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2)
We can do that for our data
VarD (g (xD)) = ED((g (xD)− E [yx])2
)
Now, if we add and subtract
ED [g (xD)] (26)
Remark: The expected output of the machine g (xD)
75 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2)
We can do that for our data
VarD (g (xD)) = ED((g (xD)− E [yx])2
)
Now, if we add and subtract
ED [g (xD)] (26)
Remark: The expected output of the machine g (xD)
75 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2)
We can do that for our data
VarD (g (xD)) = ED((g (xD)− E [yx])2
)
Now, if we add and subtract
ED [g (xD)] (26)
Remark: The expected output of the machine g (xD)
75 / 96
How do we measure the difference
We have that
Var(X) = E((X − µ)2)
We can do that for our data
VarD (g (xD)) = ED((g (xD)− E [yx])2
)
Now, if we add and subtract
ED [g (xD)] (26)
Remark: The expected output of the machine g (xD)
75 / 96
Thus, we have that
Or Original variance
VarD (g (xD)) = ED((g (xD)− E [yx])2)
= ED((g (xD)− ED [g (xD)] + ED [g (xD)]− E [yx])2)
= ED((g (xD)− ED [g (xD)])2 + ...
...2 ((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx]) + ...
... (ED [g (xD)]− E [yx])2)Finally
ED (((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx])) =? (27)
76 / 96
Thus, we have that
Or Original variance
VarD (g (xD)) = ED((g (xD)− E [yx])2)
= ED((g (xD)− ED [g (xD)] + ED [g (xD)]− E [yx])2)
= ED((g (xD)− ED [g (xD)])2 + ...
...2 ((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx]) + ...
... (ED [g (xD)]− E [yx])2)Finally
ED (((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx])) =? (27)
76 / 96
Thus, we have that
Or Original variance
VarD (g (xD)) = ED((g (xD)− E [yx])2)
= ED((g (xD)− ED [g (xD)] + ED [g (xD)]− E [yx])2)
= ED((g (xD)− ED [g (xD)])2 + ...
...2 ((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx]) + ...
... (ED [g (xD)]− E [yx])2)Finally
ED (((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx])) =? (27)
76 / 96
Thus, we have that
Or Original variance
VarD (g (xD)) = ED((g (xD)− E [yx])2)
= ED((g (xD)− ED [g (xD)] + ED [g (xD)]− E [yx])2)
= ED((g (xD)− ED [g (xD)])2 + ...
...2 ((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx]) + ...
... (ED [g (xD)]− E [yx])2)Finally
ED (((g (xD)− ED [g (xD)])) (ED [g (xD)]− E [yx])) =? (27)
76 / 96
We have the BiasVariance
Our Final Equation
ED(
(g (xD)− E [yx])2) = ED(
(g (xD)− ED [g (xD)])2)︸ ︷︷ ︸VARIANCE
+ (ED [g (xD)]− E [yx])2︸ ︷︷ ︸BIAS
Where the varianceIt represent the measure of the error between our machine g (xD) and theexpected output of the machine under xi ∼ p (xΘ).
Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (xΘ) and the expected output of the optimalregression.
77 / 96
We have the BiasVariance
Our Final Equation
ED(
(g (xD)− E [yx])2) = ED(
(g (xD)− ED [g (xD)])2)︸ ︷︷ ︸VARIANCE
+ (ED [g (xD)]− E [yx])2︸ ︷︷ ︸BIAS
Where the varianceIt represent the measure of the error between our machine g (xD) and theexpected output of the machine under xi ∼ p (xΘ).
Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (xΘ) and the expected output of the optimalregression.
77 / 96
We have the BiasVariance
Our Final Equation
ED(
(g (xD)− E [yx])2) = ED(
(g (xD)− ED [g (xD)])2)︸ ︷︷ ︸VARIANCE
+ (ED [g (xD)]− E [yx])2︸ ︷︷ ︸BIAS
Where the varianceIt represent the measure of the error between our machine g (xD) and theexpected output of the machine under xi ∼ p (xΘ).
Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (xΘ) and the expected output of the optimalregression.
77 / 96
We have the BiasVariance
Our Final Equation
ED(
(g (xD)− E [yx])2) = ED(
(g (xD)− ED [g (xD)])2)︸ ︷︷ ︸VARIANCE
+ (ED [g (xD)]− E [yx])2︸ ︷︷ ︸BIAS
Where the varianceIt represent the measure of the error between our machine g (xD) and theexpected output of the machine under xi ∼ p (xΘ).
Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (xΘ) and the expected output of the optimalregression.
77 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
78 / 96
Using this in our favor!!!
Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.
ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!
Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.
79 / 96
Using this in our favor!!!
Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.
ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!
Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.
79 / 96
Using this in our favor!!!
Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.
ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!
Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.
79 / 96
Using this in our favor!!!
Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.
ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!
Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.
79 / 96
Example
In the case of a linear regression model
C (w) =N∑
i=1
(di −wT xi
)2+ λ
d0∑j=1
w2j (28)
ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.
80 / 96
Example
In the case of a linear regression model
C (w) =N∑
i=1
(di −wT xi
)2+ λ
d0∑j=1
w2j (28)
ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.
80 / 96
Example
In the case of a linear regression model
C (w) =N∑
i=1
(di −wT xi
)2+ λ
d0∑j=1
w2j (28)
ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.
80 / 96
Example
In the case of a linear regression model
C (w) =N∑
i=1
(di −wT xi
)2+ λ
d0∑j=1
w2j (28)
ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.
80 / 96
Important
The BiasIt favors solutions involving small weights and the effect is to smooth theoutput function.
81 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
82 / 96
Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =N∑
i=1(di − f (xi))2 (29)
And we will use a generalized version for f
f (xi) =d1∑
j=1wjφj (xi) (30)
WhereThe free variables are the weights {wj}d1
j=1.
83 / 96
Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =N∑
i=1(di − f (xi))2 (29)
And we will use a generalized version for f
f (xi) =d1∑
j=1wjφj (xi) (30)
WhereThe free variables are the weights {wj}d1
j=1.
83 / 96
Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =N∑
i=1(di − f (xi))2 (29)
And we will use a generalized version for f
f (xi) =d1∑
j=1wjφj (xi) (30)
WhereThe free variables are the weights {wj}d1
j=1.
83 / 96
Where
φj (x i) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi ,xj) (31)
With
φ (x,xj) = exp{− 1
2σ2 ‖x − xi‖}
(32)
84 / 96
Where
φj (x i) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi ,xj) (31)
With
φ (x,xj) = exp{− 1
2σ2 ‖x − xi‖}
(32)
84 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w,λ) =N∑
i=1(di − f (xi))2 +
d1∑j=1
λjw2j (33)
What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.
85 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w,λ) =N∑
i=1(di − f (xi))2 +
d1∑j=1
λjw2j (33)
What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.
85 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w,λ) =N∑
i=1(di − f (xi))2 +
d1∑j=1
λjw2j (33)
What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.
85 / 96
Thus
Final cost function assuming there is a regularization term per weight
C (w,λ) =N∑
i=1(di − f (xi))2 +
d1∑j=1
λjw2j (33)
What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.
85 / 96
Differentiate the function with respect to the free variables.
First
∂C (w,λ)∂wj
= 2N∑
i=1(di − f (xi))
∂f (xi)∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)∂wj
∂f (xi)∂wj
= φj (xi) (35)
86 / 96
Differentiate the function with respect to the free variables.
First
∂C (w,λ)∂wj
= 2N∑
i=1(di − f (xi))
∂f (xi)∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)∂wj
∂f (xi)∂wj
= φj (xi) (35)
86 / 96
Now
We have thenN∑
i=1f (xi)φj (xi) + λjwj =
N∑i=1
diφj (xi) (36)
Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.
87 / 96
Now
We have thenN∑
i=1f (xi)φj (xi) + λjwj =
N∑i=1
diφj (xi) (36)
Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.
87 / 96
Now
We have thenN∑
i=1f (xi)φj (xi) + λjwj =
N∑i=1
diφj (xi) (36)
Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.
87 / 96
Using Our Linear Algebra
We have then
φTj f + λjwj = φT
j d (37)
Where
φj =
φj (x1)φj (x2)
...φj (xN )
, f =
f (x1)f (x2)
...f (xN )
,d =
d1d2...
dN
(38)
88 / 96
Using Our Linear Algebra
We have then
φTj f + λjwj = φT
j d (37)
Where
φj =
φj (x1)φj (x2)
...φj (xN )
, f =
f (x1)f (x2)
...f (xN )
,d =
d1d2...
dN
(38)
88 / 96
NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them
φT1 f
φT2 f...
φTd1f
+
λ1w1λ2w2...
λd1wd1
=
φT
1 dφT
2 d...
φTd1d
(39)
Now, if we define
Φ =[φ1 φ2 . . . φd1
](40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)
...... . . . ...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them
φT1 f
φT2 f...
φTd1f
+
λ1w1λ2w2...
λd1wd1
=
φT
1 dφT
2 d...
φTd1d
(39)
Now, if we define
Φ =[φ1 φ2 . . . φd1
](40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)
...... . . . ...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them
φT1 f
φT2 f...
φTd1f
+
λ1w1λ2w2...
λd1wd1
=
φT
1 dφT
2 d...
φTd1d
(39)
Now, if we define
Φ =[φ1 φ2 . . . φd1
](40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)
...... . . . ...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
We can then
Define the following matrix equation
ΦT f + Λw = ΦT d (42)
Where
Λ =
λ1 0 · · · 00 λ2 · · · 0...
... . . . ...0 0 · · · λd1
(43)
90 / 96
We can then
Define the following matrix equation
ΦT f + Λw = ΦT d (42)
Where
Λ =
λ1 0 · · · 00 λ2 · · · 0...
... . . . ...0 0 · · · λd1
(43)
90 / 96
Now, we have that
The vector can be decomposed into the product of two termsDesign matrix and the weight vector
We have then
fi = f (xi) =d1∑
j=1wjhj (xi) = φ
Ti w (44)
Where
φi =
φ1 (xi)φ2 (xi)
...φd1 (xi)
(45)
91 / 96
Now, we have that
The vector can be decomposed into the product of two termsDesign matrix and the weight vector
We have then
fi = f (xi) =d1∑
j=1wjhj (xi) = φ
Ti w (44)
Where
φi =
φ1 (xi)φ2 (xi)
...φd1 (xi)
(45)
91 / 96
Now, we have that
The vector can be decomposed into the product of two termsDesign matrix and the weight vector
We have then
fi = f (xi) =d1∑
j=1wjhj (xi) = φ
Ti w (44)
Where
φi =
φ1 (xi)φ2 (xi)
...φd1 (xi)
(45)
91 / 96
Furthermore
We get that
f =
f1f2...
fN
=
φ
T1 w
φT2 w...
φTN w
= Φw (46)
Finally, we have that
ΦT d =ΦT f + Λw=ΦT Φw + Λw
=[ΦT Φ + Λ
]w
92 / 96
Furthermore
We get that
f =
f1f2...
fN
=
φ
T1 w
φT2 w...
φTN w
= Φw (46)
Finally, we have that
ΦT d =ΦT f + Λw=ΦT Φw + Λw
=[ΦT Φ + Λ
]w
92 / 96
Now...
We get finally
w =[ΦT Φ + Λ
]−1ΦT d (47)
RememberThis equation is the most general form of the normal equation.
We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..
93 / 96
Now...
We get finally
w =[ΦT Φ + Λ
]−1ΦT d (47)
RememberThis equation is the most general form of the normal equation.
We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..
93 / 96
Now...
We get finally
w =[ΦT Φ + Λ
]−1ΦT d (47)
RememberThis equation is the most general form of the normal equation.
We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..
93 / 96
Thus, we have
First Case
w =[ΦT Φ + λI d1
]−1ΦT d (48)
Second Case
w =[ΦT Φ
]−1ΦT d (49)
94 / 96
Thus, we have
First Case
w =[ΦT Φ + λI d1
]−1ΦT d (48)
Second Case
w =[ΦT Φ
]−1ΦT d (49)
94 / 96
Outline1 Introduction
Main IdeaBasic RadialBasis Functions
2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφseparable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface
3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadialBasis Functions (RBF)
4 IntroductionDescription of the ProblemWellposed or illposedThe Main Problem
5 Regularization TheorySolving the issueBiasVariance DilemmaMeasuring the difference between optimal and learnedThe BiasVariance
How can we use this?Getting a solutionWe still need to talk about...
95 / 96
There are still several things that we need to look at...
FirstWhat is the variance of the weight vector? The Variance Matrix.
SecondThe prediction of the output at any of the training set inputs  TheProjection Matrix
FinallyThe incremental algorithm for the problem!!!
96 / 96
There are still several things that we need to look at...
FirstWhat is the variance of the weight vector? The Variance Matrix.
SecondThe prediction of the output at any of the training set inputs  TheProjection Matrix
FinallyThe incremental algorithm for the problem!!!
96 / 96
There are still several things that we need to look at...
FirstWhat is the variance of the weight vector? The Variance Matrix.
SecondThe prediction of the output at any of the training set inputs  TheProjection Matrix
FinallyThe incremental algorithm for the problem!!!
96 / 96