17 Machine Learning Radial Basis Functions

Post on 16-Apr-2017

308 views 2 download

Transcript of 17 Machine Learning Radial Basis Functions

Neural NetworksRadial Basis Functions Networks

Andres Mendez-Vazquez

December 10, 2015

1 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

2 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

3 / 96

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Thus

In the context of a neural networkThe hidden units provide a set of "functions"

A "basis" for the input patterns when they are expanded into thehidden space.

Name of these functionsRadial-Basis functions.

5 / 96

Thus

In the context of a neural networkThe hidden units provide a set of "functions"

A "basis" for the input patterns when they are expanded into thehidden space.

Name of these functionsRadial-Basis functions.

5 / 96

History

These functions were first introducedAs the solution of the real multivariate interpolation problem

Right nowIt is now one of the main fields of research in numerical analysis.

6 / 96

History

These functions were first introducedAs the solution of the real multivariate interpolation problem

Right nowIt is now one of the main fields of research in numerical analysis.

6 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

7 / 96

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

ExampleInput Nodes

Nonlinear Nodes

Linear Node

8 / 96

Why the non-linear transformation?

The justificationIn a paper by Cover (1965), a pattern-classification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alow-dimensional space.

ThusA good reason to make the dimension in the hidden space in aRadial-Basis Function (RBF) network high

9 / 96

Why the non-linear transformation?

The justificationIn a paper by Cover (1965), a pattern-classification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alow-dimensional space.

ThusA good reason to make the dimension in the hidden space in aRadial-Basis Function (RBF) network high

9 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

10 / 96

Cover’s Theorem

The Resumed StatementA complex pattern-classification problem cast in a high-dimensional spacenonlinearly is more likely to be linearly separable than in a low-dimensionalspace.

ActuallyIt is quite more complex...

11 / 96

Cover’s Theorem

The Resumed StatementA complex pattern-classification problem cast in a high-dimensional spacenonlinearly is more likely to be linearly separable than in a low-dimensionalspace.

ActuallyIt is quite more complex...

11 / 96

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

13 / 96

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

Thus

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

Thus

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

Thus

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

17 / 96

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Homogenous Functions

DefinitionA function f (x) is said to be homogeneous of degree n if, by introducing aconstant parameter λ, replacing the variable x with λx we find:

f (λx) = λnf (x) (7)

20 / 96

Homogeneous Equation

Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial

PropertiesFor an input space of dimensionality d0, there are(

d0r

)= d0!

(d0 − r)!r ! (8)

monomials in (Eq. 6).

21 / 96

Homogeneous Equation

Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial

PropertiesFor an input space of dimensionality d0, there are(

d0r

)= d0!

(d0 − r)!r ! (8)

monomials in (Eq. 6).

21 / 96

Example of these surfaces

Hyperplanes (first-order rational varieties)

22 / 96

Example of these surfaces

Hyperplanes (first-order rational varieties)

23 / 96

Example of these surfaces

Quadrices (second-order rational varieties)

24 / 96

Example of these surfaces

Hyperspheres (quadrics with certain linear constraints on thecoefficients)

25 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

26 / 96

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

2

)N−1 d1−1∑m=0

(N − 1

m

)(9)

27 / 96

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

2

)N−1 d1−1∑m=0

(N − 1

m

)(9)

27 / 96

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

2

)N−1 d1−1∑m=0

(N − 1

m

)(9)

27 / 96

What?

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

What?

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

What?

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

30 / 96

There is always an exception to every rule!!!

The XOR Problem

0

1

1

Class 1

Class 2

31 / 96

Now

We define the following radial functions

φ1 (x) = exp{‖x − t1‖22

}where t1 = (1, 1)T

φ2 (x) = exp{‖x − t2‖22

}where t2 = (1, 1)T

ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:

Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)

32 / 96

Now

We define the following radial functions

φ1 (x) = exp{‖x − t1‖22

}where t1 = (1, 1)T

φ2 (x) = exp{‖x − t2‖22

}where t2 = (1, 1)T

ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:

Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)

32 / 96

New Space

We have the following new φ1 − φ2 space

0

1

1

Class 1

Class 2

33 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

34 / 96

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Separating Capacity of a Surface

Then

Prob (N = n) =(1

2

)n(

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

nd1

)=(

n − 1d1 − 1

)+(

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Separating Capacity of a Surface

Then

Prob (N = n) =(1

2

)n(

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

nd1

)=(

n − 1d1 − 1

)+(

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Separating Capacity of a Surface

Then

Prob (N = n) =(1

2

)n(

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

nd1

)=(

n − 1d1 − 1

)+(

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Separating Capacity of a Surface

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Separating Capacity of a Surface

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Separating Capacity of a Surface

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Separating Capacity of a Surface

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

2

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

Separating Capacity of a Surface

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

2

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

Separating Capacity of a Surface

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

2

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

This allows to define the Corollary to Cover’s Theorem

A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .

Something Notable

This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.

39 / 96

This allows to define the Corollary to Cover’s Theorem

A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .

Something Notable

This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.

39 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

40 / 96

Given a problem of non-linearly separable patterns

It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension

For this, we use a non-linear mapQuite similar to solve a difficult non-linear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.

41 / 96

Given a problem of non-linearly separable patterns

It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension

For this, we use a non-linear mapQuite similar to solve a difficult non-linear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.

41 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

42 / 96

Take in consideration the following architecture

Mapping from input space to hidden space, followed by a linearmapping to output space!!!

Input NodesNonlinear Nodes

Linear Node

43 / 96

This can be seen as

We have the following map

s : Rd0 → R (13)

ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1

44 / 96

This can be seen as

We have the following map

s : Rd0 → R (13)

ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1

44 / 96

ExampleWe have that the Red planes represent the mappings and the Gray isthe Linear Separator

45 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

46 / 96

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

This leads to the theory of multi-variable interpolation

Interpolation ProblemGiven a set of N different points

{xi ∈ Rd0 |i = 1, 2, ...,N

}and a

corresponding set of N real numbers{di ∈ R1|i = 1, 2, ...,N

}, find a

function F : RN → R that satisfies the interpolation condition:

F (xi) = di i = 1, 2, ...,N (14)

RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.

48 / 96

This leads to the theory of multi-variable interpolation

Interpolation ProblemGiven a set of N different points

{xi ∈ Rd0 |i = 1, 2, ...,N

}and a

corresponding set of N real numbers{di ∈ R1|i = 1, 2, ...,N

}, find a

function F : RN → R that satisfies the interpolation condition:

F (xi) = di i = 1, 2, ...,N (14)

RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.

48 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

49 / 96

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

Where

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

Where

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

Where

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

A Set of Simultaneous Linear Equations

Given

φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)

Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...

......

...φN1 φN2 · · · φNN

w1w2...

wN

=

d1d2...

dN

(17)

51 / 96

A Set of Simultaneous Linear Equations

Given

φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)

Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...

......

...φN1 φN2 · · · φNN

w1w2...

wN

=

d1d2...

dN

(17)

51 / 96

Now

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Now

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Now

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

54 / 96

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

56 / 96

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Basically

ExampleMapping

59 / 96

Ill-Posed

ThereforeIf any of these conditions is not satisfied, the problem is said to beill-posed.

BasicallyAn ill-posed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.

60 / 96

Ill-Posed

ThereforeIf any of these conditions is not satisfied, the problem is said to beill-posed.

BasicallyAn ill-posed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.

60 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

61 / 96

Learning from data

Rebuilding the physical phenomena using the samplesPhysical Phenomenon

Data

62 / 96

We have the following

Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.

It is a well-posed dataBut learning form such data i.e. rebuilding the hypersurface can be anill-posed inverse problem.

63 / 96

We have the following

Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.

It is a well-posed dataBut learning form such data i.e. rebuilding the hypersurface can be anill-posed inverse problem.

63 / 96

Why

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

Why

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

Why

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

The noise problem

Getting out of the range

Mapping+Noise

65 / 96

How?

This can happen whenThere is a lack of information!!!

Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”

66 / 96

How?

This can happen whenThere is a lack of information!!!

Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”

66 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

67 / 96

How do we solve the problem?

Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingill-posed ’

TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andill-posed problems.

68 / 96

How do we solve the problem?

Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingill-posed ’

TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andill-posed problems.

68 / 96

Also Known as Ridge Regression

SetupWe have:

Input Signal{

xi ∈ Rd0}N

i=1.

Output Signal {di ∈ R}Ni=1.

In additionNote that the output is assumed to be one-dimensional.

69 / 96

Also Known as Ridge Regression

SetupWe have:

Input Signal{

xi ∈ Rd0}N

i=1.

Output Signal {di ∈ R}Ni=1.

In additionNote that the output is assumed to be one-dimensional.

69 / 96

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

72 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

78 / 96

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Important

The BiasIt favors solutions involving small weights and the effect is to smooth theoutput function.

81 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

82 / 96

Now, we can carry out the optimization

First, we rewrite the cost function the following way

S (w) =N∑

i=1(di − f (xi))2 (29)

And we will use a generalized version for f

f (xi) =d1∑

j=1wjφj (xi) (30)

WhereThe free variables are the weights {wj}d1

j=1.

83 / 96

Now, we can carry out the optimization

First, we rewrite the cost function the following way

S (w) =N∑

i=1(di − f (xi))2 (29)

And we will use a generalized version for f

f (xi) =d1∑

j=1wjφj (xi) (30)

WhereThe free variables are the weights {wj}d1

j=1.

83 / 96

Now, we can carry out the optimization

First, we rewrite the cost function the following way

S (w) =N∑

i=1(di − f (xi))2 (29)

And we will use a generalized version for f

f (xi) =d1∑

j=1wjφj (xi) (30)

WhereThe free variables are the weights {wj}d1

j=1.

83 / 96

Where

φj (x i) is in our case, we may have the Gaussian distribution

φj (xi) = φ (xi ,xj) (31)

With

φ (x,xj) = exp{− 1

2σ2 ‖x − xi‖}

(32)

84 / 96

Where

φj (x i) is in our case, we may have the Gaussian distribution

φj (xi) = φ (xi ,xj) (31)

With

φ (x,xj) = exp{− 1

2σ2 ‖x − xi‖}

(32)

84 / 96

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Differentiate the function with respect to the free variables.

First

∂C (w,λ)∂wj

= 2N∑

i=1(di − f (xi))

∂f (xi)∂wj

+ 2λjwj (34)

We get differential of ∂f (xi)∂wj

∂f (xi)∂wj

= φj (xi) (35)

86 / 96

Differentiate the function with respect to the free variables.

First

∂C (w,λ)∂wj

= 2N∑

i=1(di − f (xi))

∂f (xi)∂wj

+ 2λjwj (34)

We get differential of ∂f (xi)∂wj

∂f (xi)∂wj

= φj (xi) (35)

86 / 96

Now

We have thenN∑

i=1f (xi)φj (xi) + λjwj =

N∑i=1

diφj (xi) (36)

Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.

87 / 96

Now

We have thenN∑

i=1f (xi)φj (xi) + λjwj =

N∑i=1

diφj (xi) (36)

Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.

87 / 96

Now

We have thenN∑

i=1f (xi)φj (xi) + λjwj =

N∑i=1

diφj (xi) (36)

Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.

87 / 96

Using Our Linear Algebra

We have then

φTj f + λjwj = φT

j d (37)

Where

φj =

φj (x1)φj (x2)

...φj (xN )

, f =

f (x1)f (x2)

...f (xN )

,d =

d1d2...

dN

(38)

88 / 96

Using Our Linear Algebra

We have then

φTj f + λjwj = φT

j d (37)

Where

φj =

φj (x1)φj (x2)

...φj (xN )

, f =

f (x1)f (x2)

...f (xN )

,d =

d1d2...

dN

(38)

88 / 96

NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them

φT1 f

φT2 f...

φTd1f

+

λ1w1λ2w2...

λd1wd1

=

φT

1 dφT

2 d...

φTd1d

(39)

Now, if we define

Φ =[φ1 φ2 . . . φd1

](40)

Written in full form

Φ =

φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)

...... . . . ...

φ1 (xN ) φ2 (xN ) · · · φd1 (xN )

(41)

89 / 96

NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them

φT1 f

φT2 f...

φTd1f

+

λ1w1λ2w2...

λd1wd1

=

φT

1 dφT

2 d...

φTd1d

(39)

Now, if we define

Φ =[φ1 φ2 . . . φd1

](40)

Written in full form

Φ =

φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)

...... . . . ...

φ1 (xN ) φ2 (xN ) · · · φd1 (xN )

(41)

89 / 96

NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them

φT1 f

φT2 f...

φTd1f

+

λ1w1λ2w2...

λd1wd1

=

φT

1 dφT

2 d...

φTd1d

(39)

Now, if we define

Φ =[φ1 φ2 . . . φd1

](40)

Written in full form

Φ =

φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)

...... . . . ...

φ1 (xN ) φ2 (xN ) · · · φd1 (xN )

(41)

89 / 96

We can then

Define the following matrix equation

ΦT f + Λw = ΦT d (42)

Where

Λ =

λ1 0 · · · 00 λ2 · · · 0...

... . . . ...0 0 · · · λd1

(43)

90 / 96

We can then

Define the following matrix equation

ΦT f + Λw = ΦT d (42)

Where

Λ =

λ1 0 · · · 00 λ2 · · · 0...

... . . . ...0 0 · · · λd1

(43)

90 / 96

Now, we have that

The vector can be decomposed into the product of two termsDesign matrix and the weight vector

We have then

fi = f (xi) =d1∑

j=1wjhj (xi) = φ

Ti w (44)

Where

φi =

φ1 (xi)φ2 (xi)

...φd1 (xi)

(45)

91 / 96

Now, we have that

The vector can be decomposed into the product of two termsDesign matrix and the weight vector

We have then

fi = f (xi) =d1∑

j=1wjhj (xi) = φ

Ti w (44)

Where

φi =

φ1 (xi)φ2 (xi)

...φd1 (xi)

(45)

91 / 96

Now, we have that

The vector can be decomposed into the product of two termsDesign matrix and the weight vector

We have then

fi = f (xi) =d1∑

j=1wjhj (xi) = φ

Ti w (44)

Where

φi =

φ1 (xi)φ2 (xi)

...φd1 (xi)

(45)

91 / 96

Furthermore

We get that

f =

f1f2...

fN

=

φ

T1 w

φT2 w...

φTN w

= Φw (46)

Finally, we have that

ΦT d =ΦT f + Λw=ΦT Φw + Λw

=[ΦT Φ + Λ

]w

92 / 96

Furthermore

We get that

f =

f1f2...

fN

=

φ

T1 w

φT2 w...

φTN w

= Φw (46)

Finally, we have that

ΦT d =ΦT f + Λw=ΦT Φw + Λw

=[ΦT Φ + Λ

]w

92 / 96

Now...

We get finally

w =[ΦT Φ + Λ

]−1ΦT d (47)

RememberThis equation is the most general form of the normal equation.

We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..

93 / 96

Now...

We get finally

w =[ΦT Φ + Λ

]−1ΦT d (47)

RememberThis equation is the most general form of the normal equation.

We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..

93 / 96

Now...

We get finally

w =[ΦT Φ + Λ

]−1ΦT d (47)

RememberThis equation is the most general form of the normal equation.

We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..

93 / 96

Thus, we have

First Case

w =[ΦT Φ + λI d1

]−1ΦT d (48)

Second Case

w =[ΦT Φ

]−1ΦT d (49)

94 / 96

Thus, we have

First Case

w =[ΦT Φ + λI d1

]−1ΦT d (48)

Second Case

w =[ΦT Φ

]−1ΦT d (49)

94 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

95 / 96

There are still several things that we need to look at...

FirstWhat is the variance of the weight vector? The Variance Matrix.

SecondThe prediction of the output at any of the training set inputs - TheProjection Matrix

FinallyThe incremental algorithm for the problem!!!

96 / 96

There are still several things that we need to look at...

FirstWhat is the variance of the weight vector? The Variance Matrix.

SecondThe prediction of the output at any of the training set inputs - TheProjection Matrix

FinallyThe incremental algorithm for the problem!!!

96 / 96

There are still several things that we need to look at...

FirstWhat is the variance of the weight vector? The Variance Matrix.

SecondThe prediction of the output at any of the training set inputs - TheProjection Matrix

FinallyThe incremental algorithm for the problem!!!

96 / 96