Download - 17 Machine Learning Radial Basis Functions

Transcript
Page 1: 17 Machine Learning Radial Basis Functions

Neural NetworksRadial Basis Functions Networks

Andres Mendez-Vazquez

December 10, 2015

1 / 96

Page 2: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

2 / 96

Page 3: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

3 / 96

Page 4: 17 Machine Learning Radial Basis Functions

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Page 5: 17 Machine Learning Radial Basis Functions

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Page 6: 17 Machine Learning Radial Basis Functions

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Page 7: 17 Machine Learning Radial Basis Functions

Thus

In the context of a neural networkThe hidden units provide a set of "functions"

A "basis" for the input patterns when they are expanded into thehidden space.

Name of these functionsRadial-Basis functions.

5 / 96

Page 8: 17 Machine Learning Radial Basis Functions

Thus

In the context of a neural networkThe hidden units provide a set of "functions"

A "basis" for the input patterns when they are expanded into thehidden space.

Name of these functionsRadial-Basis functions.

5 / 96

Page 9: 17 Machine Learning Radial Basis Functions

History

These functions were first introducedAs the solution of the real multivariate interpolation problem

Right nowIt is now one of the main fields of research in numerical analysis.

6 / 96

Page 10: 17 Machine Learning Radial Basis Functions

History

These functions were first introducedAs the solution of the real multivariate interpolation problem

Right nowIt is now one of the main fields of research in numerical analysis.

6 / 96

Page 11: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

7 / 96

Page 12: 17 Machine Learning Radial Basis Functions

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

Page 13: 17 Machine Learning Radial Basis Functions

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

Page 14: 17 Machine Learning Radial Basis Functions

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

Page 15: 17 Machine Learning Radial Basis Functions

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

ExampleInput Nodes

Nonlinear Nodes

Linear Node

8 / 96

Page 16: 17 Machine Learning Radial Basis Functions

Why the non-linear transformation?

The justificationIn a paper by Cover (1965), a pattern-classification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alow-dimensional space.

ThusA good reason to make the dimension in the hidden space in aRadial-Basis Function (RBF) network high

9 / 96

Page 17: 17 Machine Learning Radial Basis Functions

Why the non-linear transformation?

The justificationIn a paper by Cover (1965), a pattern-classification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alow-dimensional space.

ThusA good reason to make the dimension in the hidden space in aRadial-Basis Function (RBF) network high

9 / 96

Page 18: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

10 / 96

Page 19: 17 Machine Learning Radial Basis Functions

Cover’s Theorem

The Resumed StatementA complex pattern-classification problem cast in a high-dimensional spacenonlinearly is more likely to be linearly separable than in a low-dimensionalspace.

ActuallyIt is quite more complex...

11 / 96

Page 20: 17 Machine Learning Radial Basis Functions

Cover’s Theorem

The Resumed StatementA complex pattern-classification problem cast in a high-dimensional spacenonlinearly is more likely to be linearly separable than in a low-dimensionalspace.

ActuallyIt is quite more complex...

11 / 96

Page 21: 17 Machine Learning Radial Basis Functions

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Page 22: 17 Machine Learning Radial Basis Functions

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Page 23: 17 Machine Learning Radial Basis Functions

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Page 24: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

13 / 96

Page 25: 17 Machine Learning Radial Basis Functions

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

Page 26: 17 Machine Learning Radial Basis Functions

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

Page 27: 17 Machine Learning Radial Basis Functions

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

Page 28: 17 Machine Learning Radial Basis Functions

Thus

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

Page 29: 17 Machine Learning Radial Basis Functions

Thus

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

Page 30: 17 Machine Learning Radial Basis Functions

Thus

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

Page 31: 17 Machine Learning Radial Basis Functions

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Page 32: 17 Machine Learning Radial Basis Functions

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Page 33: 17 Machine Learning Radial Basis Functions

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Page 34: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

17 / 96

Page 35: 17 Machine Learning Radial Basis Functions

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

Page 36: 17 Machine Learning Radial Basis Functions

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

Page 37: 17 Machine Learning Radial Basis Functions

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

Page 38: 17 Machine Learning Radial Basis Functions

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

(5)

define the separating surface in the input space.18 / 96

Page 39: 17 Machine Learning Radial Basis Functions

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Page 40: 17 Machine Learning Radial Basis Functions

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Page 41: 17 Machine Learning Radial Basis Functions

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Page 42: 17 Machine Learning Radial Basis Functions

Now

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

Page 43: 17 Machine Learning Radial Basis Functions

Homogenous Functions

DefinitionA function f (x) is said to be homogeneous of degree n if, by introducing aconstant parameter λ, replacing the variable x with λx we find:

f (λx) = λnf (x) (7)

20 / 96

Page 44: 17 Machine Learning Radial Basis Functions

Homogeneous Equation

Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial

PropertiesFor an input space of dimensionality d0, there are(

d0r

)= d0!

(d0 − r)!r ! (8)

monomials in (Eq. 6).

21 / 96

Page 45: 17 Machine Learning Radial Basis Functions

Homogeneous Equation

Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial

PropertiesFor an input space of dimensionality d0, there are(

d0r

)= d0!

(d0 − r)!r ! (8)

monomials in (Eq. 6).

21 / 96

Page 46: 17 Machine Learning Radial Basis Functions

Example of these surfaces

Hyperplanes (first-order rational varieties)

22 / 96

Page 47: 17 Machine Learning Radial Basis Functions

Example of these surfaces

Hyperplanes (first-order rational varieties)

23 / 96

Page 48: 17 Machine Learning Radial Basis Functions

Example of these surfaces

Quadrices (second-order rational varieties)

24 / 96

Page 49: 17 Machine Learning Radial Basis Functions

Example of these surfaces

Hyperspheres (quadrics with certain linear constraints on thecoefficients)

25 / 96

Page 50: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

26 / 96

Page 51: 17 Machine Learning Radial Basis Functions

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

2

)N−1 d1−1∑m=0

(N − 1

m

)(9)

27 / 96

Page 52: 17 Machine Learning Radial Basis Functions

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

2

)N−1 d1−1∑m=0

(N − 1

m

)(9)

27 / 96

Page 53: 17 Machine Learning Radial Basis Functions

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

2

)N−1 d1−1∑m=0

(N − 1

m

)(9)

27 / 96

Page 54: 17 Machine Learning Radial Basis Functions

What?

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

Page 55: 17 Machine Learning Radial Basis Functions

What?

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

Page 56: 17 Machine Learning Radial Basis Functions

What?

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

Page 57: 17 Machine Learning Radial Basis Functions

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Page 58: 17 Machine Learning Radial Basis Functions

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Page 59: 17 Machine Learning Radial Basis Functions

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Page 60: 17 Machine Learning Radial Basis Functions

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

Page 61: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

30 / 96

Page 62: 17 Machine Learning Radial Basis Functions

There is always an exception to every rule!!!

The XOR Problem

0

1

1

Class 1

Class 2

31 / 96

Page 63: 17 Machine Learning Radial Basis Functions

Now

We define the following radial functions

φ1 (x) = exp{‖x − t1‖22

}where t1 = (1, 1)T

φ2 (x) = exp{‖x − t2‖22

}where t2 = (1, 1)T

ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:

Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)

32 / 96

Page 64: 17 Machine Learning Radial Basis Functions

Now

We define the following radial functions

φ1 (x) = exp{‖x − t1‖22

}where t1 = (1, 1)T

φ2 (x) = exp{‖x − t2‖22

}where t2 = (1, 1)T

ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:

Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)

32 / 96

Page 65: 17 Machine Learning Radial Basis Functions

New Space

We have the following new φ1 − φ2 space

0

1

1

Class 1

Class 2

33 / 96

Page 66: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

34 / 96

Page 67: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Page 68: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Page 69: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Page 70: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Then

Prob (N = n) =(1

2

)n(

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

nd1

)=(

n − 1d1 − 1

)+(

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Page 71: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Then

Prob (N = n) =(1

2

)n(

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

nd1

)=(

n − 1d1 − 1

)+(

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Page 72: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Then

Prob (N = n) =(1

2

)n(

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

nd1

)=(

n − 1d1 − 1

)+(

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Page 73: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Page 74: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Page 75: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Page 76: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

2

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

Page 77: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

2

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

Page 78: 17 Machine Learning Radial Basis Functions

Separating Capacity of a Surface

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

2

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

Page 79: 17 Machine Learning Radial Basis Functions

This allows to define the Corollary to Cover’s Theorem

A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .

Something Notable

This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.

39 / 96

Page 80: 17 Machine Learning Radial Basis Functions

This allows to define the Corollary to Cover’s Theorem

A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .

Something Notable

This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.

39 / 96

Page 81: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

40 / 96

Page 82: 17 Machine Learning Radial Basis Functions

Given a problem of non-linearly separable patterns

It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension

For this, we use a non-linear mapQuite similar to solve a difficult non-linear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.

41 / 96

Page 83: 17 Machine Learning Radial Basis Functions

Given a problem of non-linearly separable patterns

It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension

For this, we use a non-linear mapQuite similar to solve a difficult non-linear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.

41 / 96

Page 84: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

42 / 96

Page 85: 17 Machine Learning Radial Basis Functions

Take in consideration the following architecture

Mapping from input space to hidden space, followed by a linearmapping to output space!!!

Input NodesNonlinear Nodes

Linear Node

43 / 96

Page 86: 17 Machine Learning Radial Basis Functions

This can be seen as

We have the following map

s : Rd0 → R (13)

ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1

44 / 96

Page 87: 17 Machine Learning Radial Basis Functions

This can be seen as

We have the following map

s : Rd0 → R (13)

ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1

44 / 96

Page 88: 17 Machine Learning Radial Basis Functions

ExampleWe have that the Red planes represent the mappings and the Gray isthe Linear Separator

45 / 96

Page 89: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

46 / 96

Page 90: 17 Machine Learning Radial Basis Functions

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

Page 91: 17 Machine Learning Radial Basis Functions

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

Page 92: 17 Machine Learning Radial Basis Functions

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

Page 93: 17 Machine Learning Radial Basis Functions

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

Page 94: 17 Machine Learning Radial Basis Functions

This leads to the theory of multi-variable interpolation

Interpolation ProblemGiven a set of N different points

{xi ∈ Rd0 |i = 1, 2, ...,N

}and a

corresponding set of N real numbers{di ∈ R1|i = 1, 2, ...,N

}, find a

function F : RN → R that satisfies the interpolation condition:

F (xi) = di i = 1, 2, ...,N (14)

RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.

48 / 96

Page 95: 17 Machine Learning Radial Basis Functions

This leads to the theory of multi-variable interpolation

Interpolation ProblemGiven a set of N different points

{xi ∈ Rd0 |i = 1, 2, ...,N

}and a

corresponding set of N real numbers{di ∈ R1|i = 1, 2, ...,N

}, find a

function F : RN → R that satisfies the interpolation condition:

F (xi) = di i = 1, 2, ...,N (14)

RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.

48 / 96

Page 96: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

49 / 96

Page 97: 17 Machine Learning Radial Basis Functions

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

Where

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

Page 98: 17 Machine Learning Radial Basis Functions

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

Where

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

Page 99: 17 Machine Learning Radial Basis Functions

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

Where

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

Page 100: 17 Machine Learning Radial Basis Functions

A Set of Simultaneous Linear Equations

Given

φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)

Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...

......

...φN1 φN2 · · · φNN

w1w2...

wN

=

d1d2...

dN

(17)

51 / 96

Page 101: 17 Machine Learning Radial Basis Functions

A Set of Simultaneous Linear Equations

Given

φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)

Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...

......

...φN1 φN2 · · · φNN

w1w2...

wN

=

d1d2...

dN

(17)

51 / 96

Page 102: 17 Machine Learning Radial Basis Functions

Now

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Page 103: 17 Machine Learning Radial Basis Functions

Now

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Page 104: 17 Machine Learning Radial Basis Functions

Now

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Page 105: 17 Machine Learning Radial Basis Functions

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

Page 106: 17 Machine Learning Radial Basis Functions

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

Page 107: 17 Machine Learning Radial Basis Functions

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

Page 108: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

54 / 96

Page 109: 17 Machine Learning Radial Basis Functions

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Page 110: 17 Machine Learning Radial Basis Functions

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Page 111: 17 Machine Learning Radial Basis Functions

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Page 112: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

56 / 96

Page 113: 17 Machine Learning Radial Basis Functions

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Page 114: 17 Machine Learning Radial Basis Functions

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Page 115: 17 Machine Learning Radial Basis Functions

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Page 116: 17 Machine Learning Radial Basis Functions

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Page 117: 17 Machine Learning Radial Basis Functions

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Page 118: 17 Machine Learning Radial Basis Functions

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Page 119: 17 Machine Learning Radial Basis Functions

Basically

ExampleMapping

59 / 96

Page 120: 17 Machine Learning Radial Basis Functions

Ill-Posed

ThereforeIf any of these conditions is not satisfied, the problem is said to beill-posed.

BasicallyAn ill-posed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.

60 / 96

Page 121: 17 Machine Learning Radial Basis Functions

Ill-Posed

ThereforeIf any of these conditions is not satisfied, the problem is said to beill-posed.

BasicallyAn ill-posed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.

60 / 96

Page 122: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

61 / 96

Page 123: 17 Machine Learning Radial Basis Functions

Learning from data

Rebuilding the physical phenomena using the samplesPhysical Phenomenon

Data

62 / 96

Page 124: 17 Machine Learning Radial Basis Functions

We have the following

Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.

It is a well-posed dataBut learning form such data i.e. rebuilding the hypersurface can be anill-posed inverse problem.

63 / 96

Page 125: 17 Machine Learning Radial Basis Functions

We have the following

Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.

It is a well-posed dataBut learning form such data i.e. rebuilding the hypersurface can be anill-posed inverse problem.

63 / 96

Page 126: 17 Machine Learning Radial Basis Functions

Why

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

Page 127: 17 Machine Learning Radial Basis Functions

Why

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

Page 128: 17 Machine Learning Radial Basis Functions

Why

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

Page 129: 17 Machine Learning Radial Basis Functions

The noise problem

Getting out of the range

Mapping+Noise

65 / 96

Page 130: 17 Machine Learning Radial Basis Functions

How?

This can happen whenThere is a lack of information!!!

Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”

66 / 96

Page 131: 17 Machine Learning Radial Basis Functions

How?

This can happen whenThere is a lack of information!!!

Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”

66 / 96

Page 132: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

67 / 96

Page 133: 17 Machine Learning Radial Basis Functions

How do we solve the problem?

Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingill-posed ’

TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andill-posed problems.

68 / 96

Page 134: 17 Machine Learning Radial Basis Functions

How do we solve the problem?

Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingill-posed ’

TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andill-posed problems.

68 / 96

Page 135: 17 Machine Learning Radial Basis Functions

Also Known as Ridge Regression

SetupWe have:

Input Signal{

xi ∈ Rd0}N

i=1.

Output Signal {di ∈ R}Ni=1.

In additionNote that the output is assumed to be one-dimensional.

69 / 96

Page 136: 17 Machine Learning Radial Basis Functions

Also Known as Ridge Regression

SetupWe have:

Input Signal{

xi ∈ Rd0}N

i=1.

Output Signal {di ∈ R}Ni=1.

In additionNote that the output is assumed to be one-dimensional.

69 / 96

Page 137: 17 Machine Learning Radial Basis Functions

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Page 138: 17 Machine Learning Radial Basis Functions

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Page 139: 17 Machine Learning Radial Basis Functions

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Page 140: 17 Machine Learning Radial Basis Functions

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Page 141: 17 Machine Learning Radial Basis Functions

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Page 142: 17 Machine Learning Radial Basis Functions

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Page 143: 17 Machine Learning Radial Basis Functions

Now

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

Page 144: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

72 / 96

Page 145: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 146: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 147: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 148: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 149: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 150: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 151: 17 Machine Learning Radial Basis Functions

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (25)

Remark: Where the xi ∼ p (x|Θ)!!!

73 / 96

Page 152: 17 Machine Learning Radial Basis Functions

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Page 153: 17 Machine Learning Radial Basis Functions

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Page 154: 17 Machine Learning Radial Basis Functions

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Page 155: 17 Machine Learning Radial Basis Functions

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Page 156: 17 Machine Learning Radial Basis Functions

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

74 / 96

Page 157: 17 Machine Learning Radial Basis Functions

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

Page 158: 17 Machine Learning Radial Basis Functions

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

Page 159: 17 Machine Learning Radial Basis Functions

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

Page 160: 17 Machine Learning Radial Basis Functions

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (26)

Remark: The expected output of the machine g (x|D)

75 / 96

Page 161: 17 Machine Learning Radial Basis Functions

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Page 162: 17 Machine Learning Radial Basis Functions

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Page 163: 17 Machine Learning Radial Basis Functions

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Page 164: 17 Machine Learning Radial Basis Functions

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (27)

76 / 96

Page 165: 17 Machine Learning Radial Basis Functions

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

Page 166: 17 Machine Learning Radial Basis Functions

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

Page 167: 17 Machine Learning Radial Basis Functions

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

Page 168: 17 Machine Learning Radial Basis Functions

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represent the measure of the error between our machine g (x|D) and theexpected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represent the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

77 / 96

Page 169: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

78 / 96

Page 170: 17 Machine Learning Radial Basis Functions

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Page 171: 17 Machine Learning Radial Basis Functions

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Page 172: 17 Machine Learning Radial Basis Functions

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Page 173: 17 Machine Learning Radial Basis Functions

Using this in our favor!!!

Something NotableIntroducing bias is equivalent to restricting the range of functions forwhich a model can account.Typically this is achieved by removing degrees of freedom.

ExamplesThey would be lowering the order of a polynomial or reducing the numberof weights in a neural network!!!

Ridge RegressionIt does not explicitly remove degrees of freedom but instead reduces theeffective number of parameters.

79 / 96

Page 174: 17 Machine Learning Radial Basis Functions

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Page 175: 17 Machine Learning Radial Basis Functions

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Page 176: 17 Machine Learning Radial Basis Functions

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Page 177: 17 Machine Learning Radial Basis Functions

Example

In the case of a linear regression model

C (w) =N∑

i=1

(di −wT xi

)2+ λ

d0∑j=1

w2j (28)

ThusThis is ridge regression (weight decay) and the regularizationparameter λ > 0 controls the balance between fitting the data andavoiding the penalty.A small value for λ means the data can be fit tightly without causinga large penalty.A large value for λ means a tight fit has to be sacrificed if it requireslarge weights.

80 / 96

Page 178: 17 Machine Learning Radial Basis Functions

Important

The BiasIt favors solutions involving small weights and the effect is to smooth theoutput function.

81 / 96

Page 179: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

82 / 96

Page 180: 17 Machine Learning Radial Basis Functions

Now, we can carry out the optimization

First, we rewrite the cost function the following way

S (w) =N∑

i=1(di − f (xi))2 (29)

And we will use a generalized version for f

f (xi) =d1∑

j=1wjφj (xi) (30)

WhereThe free variables are the weights {wj}d1

j=1.

83 / 96

Page 181: 17 Machine Learning Radial Basis Functions

Now, we can carry out the optimization

First, we rewrite the cost function the following way

S (w) =N∑

i=1(di − f (xi))2 (29)

And we will use a generalized version for f

f (xi) =d1∑

j=1wjφj (xi) (30)

WhereThe free variables are the weights {wj}d1

j=1.

83 / 96

Page 182: 17 Machine Learning Radial Basis Functions

Now, we can carry out the optimization

First, we rewrite the cost function the following way

S (w) =N∑

i=1(di − f (xi))2 (29)

And we will use a generalized version for f

f (xi) =d1∑

j=1wjφj (xi) (30)

WhereThe free variables are the weights {wj}d1

j=1.

83 / 96

Page 183: 17 Machine Learning Radial Basis Functions

Where

φj (x i) is in our case, we may have the Gaussian distribution

φj (xi) = φ (xi ,xj) (31)

With

φ (x,xj) = exp{− 1

2σ2 ‖x − xi‖}

(32)

84 / 96

Page 184: 17 Machine Learning Radial Basis Functions

Where

φj (x i) is in our case, we may have the Gaussian distribution

φj (xi) = φ (xi ,xj) (31)

With

φ (x,xj) = exp{− 1

2σ2 ‖x − xi‖}

(32)

84 / 96

Page 185: 17 Machine Learning Radial Basis Functions

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Page 186: 17 Machine Learning Radial Basis Functions

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Page 187: 17 Machine Learning Radial Basis Functions

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Page 188: 17 Machine Learning Radial Basis Functions

Thus

Final cost function assuming there is a regularization term per weight

C (w,λ) =N∑

i=1(di − f (xi))2 +

d1∑j=1

λjw2j (33)

What do we do?1 Differentiate the function with respect to the free variables.2 Equate the results with zero.3 Solve the resulting equations.

85 / 96

Page 189: 17 Machine Learning Radial Basis Functions

Differentiate the function with respect to the free variables.

First

∂C (w,λ)∂wj

= 2N∑

i=1(di − f (xi))

∂f (xi)∂wj

+ 2λjwj (34)

We get differential of ∂f (xi)∂wj

∂f (xi)∂wj

= φj (xi) (35)

86 / 96

Page 190: 17 Machine Learning Radial Basis Functions

Differentiate the function with respect to the free variables.

First

∂C (w,λ)∂wj

= 2N∑

i=1(di − f (xi))

∂f (xi)∂wj

+ 2λjwj (34)

We get differential of ∂f (xi)∂wj

∂f (xi)∂wj

= φj (xi) (35)

86 / 96

Page 191: 17 Machine Learning Radial Basis Functions

Now

We have thenN∑

i=1f (xi)φj (xi) + λjwj =

N∑i=1

diφj (xi) (36)

Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.

87 / 96

Page 192: 17 Machine Learning Radial Basis Functions

Now

We have thenN∑

i=1f (xi)φj (xi) + λjwj =

N∑i=1

diφj (xi) (36)

Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.

87 / 96

Page 193: 17 Machine Learning Radial Basis Functions

Now

We have thenN∑

i=1f (xi)φj (xi) + λjwj =

N∑i=1

diφj (xi) (36)

Something NotableThere are m such equations, for 1 ≤ j ≤ m, each representing oneconstraint on the solution.Since there are exactly as many constraints as there are unknownsequations has, except under certain pathological conditions, a uniquesolution.

87 / 96

Page 194: 17 Machine Learning Radial Basis Functions

Using Our Linear Algebra

We have then

φTj f + λjwj = φT

j d (37)

Where

φj =

φj (x1)φj (x2)

...φj (xN )

, f =

f (x1)f (x2)

...f (xN )

,d =

d1d2...

dN

(38)

88 / 96

Page 195: 17 Machine Learning Radial Basis Functions

Using Our Linear Algebra

We have then

φTj f + λjwj = φT

j d (37)

Where

φj =

φj (x1)φj (x2)

...φj (xN )

, f =

f (x1)f (x2)

...f (xN )

,d =

d1d2...

dN

(38)

88 / 96

Page 196: 17 Machine Learning Radial Basis Functions

NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them

φT1 f

φT2 f...

φTd1f

+

λ1w1λ2w2...

λd1wd1

=

φT

1 dφT

2 d...

φTd1d

(39)

Now, if we define

Φ =[φ1 φ2 . . . φd1

](40)

Written in full form

Φ =

φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)

...... . . . ...

φ1 (xN ) φ2 (xN ) · · · φd1 (xN )

(41)

89 / 96

Page 197: 17 Machine Learning Radial Basis Functions

NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them

φT1 f

φT2 f...

φTd1f

+

λ1w1λ2w2...

λd1wd1

=

φT

1 dφT

2 d...

φTd1d

(39)

Now, if we define

Φ =[φ1 φ2 . . . φd1

](40)

Written in full form

Φ =

φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)

...... . . . ...

φ1 (xN ) φ2 (xN ) · · · φd1 (xN )

(41)

89 / 96

Page 198: 17 Machine Learning Radial Basis Functions

NowSince there is one of these equations, each relating one scalarquantity to another, we can stack them

φT1 f

φT2 f...

φTd1f

+

λ1w1λ2w2...

λd1wd1

=

φT

1 dφT

2 d...

φTd1d

(39)

Now, if we define

Φ =[φ1 φ2 . . . φd1

](40)

Written in full form

Φ =

φ1 (x1) φ2 (x1) · · · φd1 (x1)φ1 (x2) φ2 (x2) · · · φd1 (x2)

...... . . . ...

φ1 (xN ) φ2 (xN ) · · · φd1 (xN )

(41)

89 / 96

Page 199: 17 Machine Learning Radial Basis Functions

We can then

Define the following matrix equation

ΦT f + Λw = ΦT d (42)

Where

Λ =

λ1 0 · · · 00 λ2 · · · 0...

... . . . ...0 0 · · · λd1

(43)

90 / 96

Page 200: 17 Machine Learning Radial Basis Functions

We can then

Define the following matrix equation

ΦT f + Λw = ΦT d (42)

Where

Λ =

λ1 0 · · · 00 λ2 · · · 0...

... . . . ...0 0 · · · λd1

(43)

90 / 96

Page 201: 17 Machine Learning Radial Basis Functions

Now, we have that

The vector can be decomposed into the product of two termsDesign matrix and the weight vector

We have then

fi = f (xi) =d1∑

j=1wjhj (xi) = φ

Ti w (44)

Where

φi =

φ1 (xi)φ2 (xi)

...φd1 (xi)

(45)

91 / 96

Page 202: 17 Machine Learning Radial Basis Functions

Now, we have that

The vector can be decomposed into the product of two termsDesign matrix and the weight vector

We have then

fi = f (xi) =d1∑

j=1wjhj (xi) = φ

Ti w (44)

Where

φi =

φ1 (xi)φ2 (xi)

...φd1 (xi)

(45)

91 / 96

Page 203: 17 Machine Learning Radial Basis Functions

Now, we have that

The vector can be decomposed into the product of two termsDesign matrix and the weight vector

We have then

fi = f (xi) =d1∑

j=1wjhj (xi) = φ

Ti w (44)

Where

φi =

φ1 (xi)φ2 (xi)

...φd1 (xi)

(45)

91 / 96

Page 204: 17 Machine Learning Radial Basis Functions

Furthermore

We get that

f =

f1f2...

fN

=

φ

T1 w

φT2 w...

φTN w

= Φw (46)

Finally, we have that

ΦT d =ΦT f + Λw=ΦT Φw + Λw

=[ΦT Φ + Λ

]w

92 / 96

Page 205: 17 Machine Learning Radial Basis Functions

Furthermore

We get that

f =

f1f2...

fN

=

φ

T1 w

φT2 w...

φTN w

= Φw (46)

Finally, we have that

ΦT d =ΦT f + Λw=ΦT Φw + Λw

=[ΦT Φ + Λ

]w

92 / 96

Page 206: 17 Machine Learning Radial Basis Functions

Now...

We get finally

w =[ΦT Φ + Λ

]−1ΦT d (47)

RememberThis equation is the most general form of the normal equation.

We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..

93 / 96

Page 207: 17 Machine Learning Radial Basis Functions

Now...

We get finally

w =[ΦT Φ + Λ

]−1ΦT d (47)

RememberThis equation is the most general form of the normal equation.

We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..

93 / 96

Page 208: 17 Machine Learning Radial Basis Functions

Now...

We get finally

w =[ΦT Φ + Λ

]−1ΦT d (47)

RememberThis equation is the most general form of the normal equation.

We have two casesIn standard ridge regression λj = λ, 1 ≤ j ≤ m.Ordinary least squares where there is no weight penalty or all λj = 0,1 ≤ j ≤ m..

93 / 96

Page 209: 17 Machine Learning Radial Basis Functions

Thus, we have

First Case

w =[ΦT Φ + λI d1

]−1ΦT d (48)

Second Case

w =[ΦT Φ

]−1ΦT d (49)

94 / 96

Page 210: 17 Machine Learning Radial Basis Functions

Thus, we have

First Case

w =[ΦT Φ + λI d1

]−1ΦT d (48)

Second Case

w =[ΦT Φ

]−1ΦT d (49)

94 / 96

Page 211: 17 Machine Learning Radial Basis Functions

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

95 / 96

Page 212: 17 Machine Learning Radial Basis Functions

There are still several things that we need to look at...

FirstWhat is the variance of the weight vector? The Variance Matrix.

SecondThe prediction of the output at any of the training set inputs - TheProjection Matrix

FinallyThe incremental algorithm for the problem!!!

96 / 96

Page 213: 17 Machine Learning Radial Basis Functions

There are still several things that we need to look at...

FirstWhat is the variance of the weight vector? The Variance Matrix.

SecondThe prediction of the output at any of the training set inputs - TheProjection Matrix

FinallyThe incremental algorithm for the problem!!!

96 / 96

Page 214: 17 Machine Learning Radial Basis Functions

There are still several things that we need to look at...

FirstWhat is the variance of the weight vector? The Variance Matrix.

SecondThe prediction of the output at any of the training set inputs - TheProjection Matrix

FinallyThe incremental algorithm for the problem!!!

96 / 96