17 Machine Learning Radial Basis Functions

Neural NetworksRadial Basis Functions Networks

Andres Mendez-Vazquez

December 10, 2015

1 / 96

Outline1 Introduction

Main IdeaBasic Radial-Basis Functions

2 SeparabilityCover’s Theorem on the separability of patternsDichotomyφ-separable functionsThe Stochastic ExperimentThe XOR ProblemSeparating Capacity of a Surface

3 Interpolation ProblemWhat is gained?Feedforward NetworkLearning ProcessRadial-Basis Functions (RBF)

4 IntroductionDescription of the ProblemWell-posed or ill-posedThe Main Problem

5 Regularization TheorySolving the issueBias-Variance DilemmaMeasuring the difference between optimal and learnedThe Bias-Variance

How can we use this?Getting a solutionWe still need to talk about...

2 / 96

3 / 96

Introduction

ObservationThe back-propagation algorithm for the design of a multilayer perceptronas described in the previous chapter may be viewed as the application of arecursive technique known in statistics as stochastic approximation.

NowWe take a completely different approach by viewing the design of a neuralnetwork as a curve fitting (approximation) problem in a high-dimensionalspace.

ThusLearning is equivalent to finding a surface in a multidimensional space thatprovides a best fit to the training data.

Under a statistical metric

4 / 96

Introduction

4 / 96

Introduction

4 / 96

In the context of a neural networkThe hidden units provide a set of "functions"

A "basis" for the input patterns when they are expanded into thehidden space.

Name of these functionsRadial-Basis functions.

5 / 96

In the context of a neural networkThe hidden units provide a set of "functions"

A "basis" for the input patterns when they are expanded into thehidden space.

Name of these functionsRadial-Basis functions.

5 / 96

History

These functions were first introducedAs the solution of the real multivariate interpolation problem

Right nowIt is now one of the main fields of research in numerical analysis.

6 / 96

History

These functions were first introducedAs the solution of the real multivariate interpolation problem

Right nowIt is now one of the main fields of research in numerical analysis.

6 / 96

7 / 96

A Basic Structure

We have the following structure1 Input Layer to connect with the environment.2 Hidden Layer applying a non-linear transformation.3 Output Layer applying a linear transformation.

Example

8 / 96

A Basic Structure

Example

8 / 96

A Basic Structure

Example

8 / 96

A Basic Structure

ExampleInput Nodes

Nonlinear Nodes

Linear Node

8 / 96

Why the non-linear transformation?

The justificationIn a paper by Cover (1965), a pattern-classification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alow-dimensional space.

ThusA good reason to make the dimension in the hidden space in aRadial-Basis Function (RBF) network high

9 / 96

Why the non-linear transformation?

The justificationIn a paper by Cover (1965), a pattern-classification problem mapped to ahigh dimensional space is more likely to be linearly separable than in alow-dimensional space.

ThusA good reason to make the dimension in the hidden space in aRadial-Basis Function (RBF) network high

9 / 96

10 / 96

Cover’s Theorem

The Resumed StatementA complex pattern-classification problem cast in a high-dimensional spacenonlinearly is more likely to be linearly separable than in a low-dimensionalspace.

ActuallyIt is quite more complex...

11 / 96

Cover’s Theorem

The Resumed StatementA complex pattern-classification problem cast in a high-dimensional spacenonlinearly is more likely to be linearly separable than in a low-dimensionalspace.

ActuallyIt is quite more complex...

11 / 96

Some facts

A factOnce we know a set of patterns are linearly separable, the problem is easyto solve.

ConsiderA family of surfaces that separate the space in two regions.

In additionWe have a set of patterns

H = {x1,x2, ...,xN} (1)

12 / 96

Some facts

H = {x1,x2, ...,xN} (1)

12 / 96

Some facts

H = {x1,x2, ...,xN} (1)

12 / 96

13 / 96

Dichotomy (Binary Partition)

NowThe pattern set is split into two classes H1 and H2.

DefinitionA dichotomy (binary partition) of the points is said to be separable withrespect to the family of surfaces if a surface exists in the family thatseparates the points in the class H1 from those in the class H2.

DefineFor each pattern x ∈ H, we define a set of real valued measurementfunctions {φ1 (x) , φ2 (x) , ..., φd1 (x)}

14 / 96

We define the following function (Vector of measurements)

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

NowSuppose that the pattern x is a vector in an d0-dimensional input space.

15 / 96

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

15 / 96

φ : H → Rd1 (2)

Defined as

φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3)

15 / 96

Then...

We have that the mapping φ (x)It maps points in d0-dimensional space into corresponding points in a newspace of dimension d1.

Each of this functions φi (x)It is known as a hidden function because it plays a role similar to thehidden unit in a feed-forward neural network.

ThusWe have that the space spanned by the set of hidden functions{φi (x)}d1

i=1 is called as the hidden space of feature space.

16 / 96

Then...

16 / 96

Then...

16 / 96

17 / 96

φ-separable functionsDefinitionA dichotomy {H1,H2} of H is said to be φ-separable if there exists ad1-dimensional vector w such that

1 wTφ (x) > 0 if x ∈ H1.2 wTφ (x) < 0 if x ∈ H2.

Clearly the hyperplane is defined by the equation

wTφ (x) = 0 (4)

NowThe inverse image of this hyperplane

Hyp−1 ={

x|wTφ (x) = 0}

define the separating surface in the input space.18 / 96

wTφ (x) = 0 (4)

Hyp−1 ={

x|wTφ (x) = 0}

wTφ (x) = 0 (4)

Hyp−1 ={

x|wTφ (x) = 0}

wTφ (x) = 0 (4)

Hyp−1 ={

x|wTφ (x) = 0}

Taking in considerationA natural class of mappings obtained by using a linear combination ofr-wise products of the pattern vector coordinates.

They are calledAs the rth-order rational varieties.

A rational variety of order r in dimensional d0 is described by∑0≤i1≤i2≤...≤ir≤d0

ai1i2...ir xi1xi2 ...xir = 0 (6)

where xi is the ith coordinate of the input vector x and x0 is set to unityin order to express the previous equation in homogeneous form.

19 / 96

ai1i2...ir xi1xi2 ...xir = 0 (6)

19 / 96

ai1i2...ir xi1xi2 ...xir = 0 (6)

19 / 96

ai1i2...ir xi1xi2 ...xir = 0 (6)

19 / 96

Homogenous Functions

DefinitionA function f (x) is said to be homogeneous of degree n if, by introducing aconstant parameter λ, replacing the variable x with λx we find:

f (λx) = λnf (x) (7)

20 / 96

Homogeneous Equation

Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial

PropertiesFor an input space of dimensionality d0, there are(

)= d0!

(d0 − r)!r ! (8)

monomials in (Eq. 6).

21 / 96

Homogeneous Equation

Equation (Eq. 6)A rth order product of entries xi of x, xi1xi2 ...xir , is called a monomial

PropertiesFor an input space of dimensionality d0, there are(

)= d0!

(d0 − r)!r ! (8)

monomials in (Eq. 6).

21 / 96

Example of these surfaces

Hyperplanes (first-order rational varieties)

22 / 96

Hyperplanes (first-order rational varieties)

23 / 96

Quadrices (second-order rational varieties)

24 / 96

Hyperspheres (quadrics with certain linear constraints on thecoefficients)

25 / 96

26 / 96

The Stochastic Experiment

SupposeYou have the following activation patterns x1,x2, ...,xN are chosenindependently.

SupposeThat all possible dichotomies of H = {x1,x2, ...,xN} are equiprobable.

Now given P (N , d1) the probability that a particular dichotomypicked at random is φ-separable

P (N , d1) =(1

)N−1 d1−1∑m=0

(N − 1

27 / 96

P (N , d1) =(1

)N−1 d1−1∑m=0

(N − 1

27 / 96

P (N , d1) =(1

)N−1 d1−1∑m=0

(N − 1

27 / 96

Basically (Eq. 9) representsThe essence of Cover’s Separability Theorem.

Something NotableIt is a statement of the fact that the cumulative binomial distributioncorresponding to the probability that N − 1 (Flips of a coin) samples willbe separable in a mapping of d1 − 1 (heads) or fewer dimensions.

SpecificallyThe higher we make the hidden space in the radial basis function thecloser is the probability of P (N , d1) to one.

28 / 96

Final ingredients if the Cover’s Theorem

FirstNonlinear formulation of the hidden function defined by φ (x), where x isthe input vector and i = 1, 2, ..., d1.

SecondHigh dimensionality of the hidden space compared to the input space.This dimensionality is determined by the value assigned to d_1 (i.e.,the number of hidden units).

ThenIn general, a complex pattern-classification problem cast inhighdimensional space nonlinearly is more likely to be linearly separablethan in a lowdimensional space.

29 / 96

30 / 96

There is always an exception to every rule!!!

The XOR Problem

Class 1

Class 2

31 / 96

We define the following radial functions

φ1 (x) = exp{‖x − t1‖22

}where t1 = (1, 1)T

φ2 (x) = exp{‖x − t2‖22

}where t2 = (1, 1)T

ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:

Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)

32 / 96

We define the following radial functions

φ1 (x) = exp{‖x − t1‖22

}where t1 = (1, 1)T

φ2 (x) = exp{‖x − t2‖22

}where t2 = (1, 1)T

ThenIf we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:

Original Mapping(0, 1) → (0.3678, 0.3678)(1, 0) → (0.3678, 0.3678)(0, 0) → (0.1353, 1)(1, 1) → (1, 0.1353)

32 / 96

New Space

We have the following new φ1 − φ2 space

Class 1

Class 2

33 / 96

34 / 96

Separating Capacity of a Surface

Something Notable(Eq. 9) has an important bearing on the expected maximum number ofrandomly assigned patterns that are linearly separable in amultidimensional space.

Now, given our patterns {x i}Ni=1

Given N be a random variable defined as the largest integer such that thesequence is φ-separable.

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

We have that

Prob (N = n) = P (n, d1)− P (n + 1, d1) (10)

35 / 96

Prob (N = n) =(1

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

n − 1d1 − 1

n − 1d1

), 0 < d1 < n

To interpret thisRecall the negative binomial distribution.

It is a repeated sequence of Bernoulli TrialsWith k failures preceding the rth success.

36 / 96

Prob (N = n) =(1

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

n − 1d1 − 1

n − 1d1

), 0 < d1 < n

36 / 96

Prob (N = n) =(1

n − 1d1 − 1

), n = 0, 1, 2... (11)

Remark:(

n − 1d1 − 1

n − 1d1

), 0 < d1 < n

36 / 96

Thus, we have thatGiven p and q the probabilities of success and failure, respectively, withp + q = 1.

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

What happened with p = q = 12 and k + r = n

Any idea?

37 / 96

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

Any idea?

37 / 96

Definition

p (K = k|p, q) =(

r + k − 1k

)prqk (12)

Any idea?

37 / 96

Thus(Eq. 11) is just the negative binomial distribution shifted d1 units to theright with parameters d1 and 1

FinallyN corresponds to thew “waiting time” for d1 th failure in a sequence oftosses of a fair coin.

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

We have then

E [N ] = 2d1

Median [N ] = 2d1

38 / 96

This allows to define the Corollary to Cover’s Theorem

A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .

Something Notable

This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.

39 / 96

This allows to define the Corollary to Cover’s Theorem

A celebrated asymptotic resultThe expected maximum number of randomly assigned patterns (vectors)that are linearly separable in a space of dimensionality d1 is equal to 2d1 .

Something Notable

This result suggests that 2d1 is a natural definition of the separatingcapacity of a family of decision surfaces having d1 degrees of freedom.

39 / 96

40 / 96

Given a problem of non-linearly separable patterns

It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension

For this, we use a non-linear mapQuite similar to solve a difficult non-linear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.

41 / 96

Given a problem of non-linearly separable patterns

It is possible to see thatThere is a benefit to be gained by mapping the input space into a newspace of high enough dimension

For this, we use a non-linear mapQuite similar to solve a difficult non-linear filtering problem by mapping itto high dimension, then solving it as a linear filtering problem.

41 / 96

42 / 96

Take in consideration the following architecture

Mapping from input space to hidden space, followed by a linearmapping to output space!!!

Input NodesNonlinear Nodes

Linear Node

43 / 96

This can be seen as

We have the following map

s : Rd0 → R (13)

ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1

44 / 96

This can be seen as

We have the following map

s : Rd0 → R (13)

ThereforeWe may think of s as a hypersurface (graph) Γ ⊂ Rd0+1

44 / 96

ExampleWe have that the Red planes represent the mappings and the Gray isthe Linear Separator

45 / 96

46 / 96

General Idea

FirstThe training phase constitutes the optimization of a fitting procedurefor the surface Γ.It is based in the know data points as input-output patterns.

SecondThe generalization phase is synonymous with interpolation betweenthe data points.The interpolation being performed along the constrained surfacegenerated by the fitting procedure.

47 / 96

General Idea

47 / 96

General Idea

47 / 96

General Idea

47 / 96

This leads to the theory of multi-variable interpolation

Interpolation ProblemGiven a set of N different points

{xi ∈ Rd0 |i = 1, 2, ...,N

}and a

corresponding set of N real numbers{di ∈ R1|i = 1, 2, ...,N

}, find a

function F : RN → R that satisfies the interpolation condition:

F (xi) = di i = 1, 2, ...,N (14)

RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.

48 / 96

This leads to the theory of multi-variable interpolation

Interpolation ProblemGiven a set of N different points

{xi ∈ Rd0 |i = 1, 2, ...,N

}and a

corresponding set of N real numbers{di ∈ R1|i = 1, 2, ...,N

}, find a

function F : RN → R that satisfies the interpolation condition:

F (xi) = di i = 1, 2, ...,N (14)

RemarkFor strict interpolation as specified here, the interpolating surface isconstrained to pass through all the training data points.

48 / 96

49 / 96

Radial-Basis Functions (RBF)

The function F has the following form (Powell, 1988)

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

{φ (‖x − xi‖) |i = 1, ...,N}

is a set of N arbitrary, generally non-linear, functions, know as RBF with‖·‖ denotes a norm that is usually Euclidean.

In additionThe know data points xi ∈ Rd0 i = 1, 2, ...,N are taken to be the centersof the radial basis functions.

50 / 96

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

{φ (‖x − xi‖) |i = 1, ...,N}

50 / 96

F (x) =N∑

i=1wiφ (‖x − xi‖) (15)

{φ (‖x − xi‖) |i = 1, ...,N}

50 / 96

A Set of Simultaneous Linear Equations

φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)

Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...

......

...φN1 φN2 · · · φNN

w1w2...

d1d2...

51 / 96

A Set of Simultaneous Linear Equations

φji = φ (‖xj − xi‖) , (j, i) = 1, 2, ...,N (16)

Using (Eq. 14) and (Eq. 15), we getφ11 φ12 · · · φ1Nφ21 φ22 · · · φ2N...

......

...φN1 φN2 · · · φNN

w1w2...

d1d2...

51 / 96

We can create the following vectorsd = [d1, d2, ..., dN ]T (Response vector).w = [w1,w2, ...,wN ]T (Linear weight vector).

Now, we define a N × N matrix called interpolation matrix

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

Φ = {φji | (j, i) = 1, 2, ...,N} (18)

Thus, we have

Φw = x (19)

52 / 96

From here

Assuming that Φ is a non-singular matrix

w = Φ−1x (20)

QuestionHow can we be sure that the interpolation matrix Φ is non-singular?

AnswerIt turns out that for a large class of radial-basis functions and undercertain conditions the non-singularity happens!!!

53 / 96

From here

w = Φ−1x (20)

53 / 96

From here

w = Φ−1x (20)

53 / 96

54 / 96

Introduction

ObservationThe strict interpolation procedure described may not be a good strategyfor the training of RBF networks for certain classes of tasks.

ReasonIf the number of data points is much larger than the number of degrees offreedom of the underlying physical process.

ThusThe network may end up fitting misleading variations due to idiosyncrasiesor noise in the input data.

55 / 96

Introduction

55 / 96

Introduction

55 / 96

56 / 96

Well-posed

The ProblemAssume that we have a domain X and a range Y , metric spaces.

They are related by a mapping

f : X → Y (21)

DefinitionThe problem of reconstructing the mapping f is said to be well-posed ifthree conditions are satisfied: Existence, Uniqueness and Continuity.

57 / 96

Well-posed

f : X → Y (21)

57 / 96

Well-posed

f : X → Y (21)

57 / 96

Defining the meaning of this

ExistenceFor every input vector x ∈ X , there does exist an output y = f (x), wherey ∈ Y .

UniquenessFor any pair of input vectors x, t ∈ X , we have f (x) = f (t) if and only ifx = t.

ContinuityThe mapping is continuous, if for any ε > 0 exists δ such that thecondition dX (x, t) < δ implies dY (f (x) , f (t)) < ε.

58 / 96

Basically

ExampleMapping

59 / 96

Ill-Posed

ThereforeIf any of these conditions is not satisfied, the problem is said to beill-posed.

BasicallyAn ill-posed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.

60 / 96

Ill-Posed

ThereforeIf any of these conditions is not satisfied, the problem is said to beill-posed.

BasicallyAn ill-posed problem means that large data sets may contain asurprisingly small amount of information about the desired solution.

60 / 96

61 / 96

Learning from data

Rebuilding the physical phenomena using the samplesPhysical Phenomenon

62 / 96

We have the following

Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.

It is a well-posed dataBut learning form such data i.e. rebuilding the hypersurface can be anill-posed inverse problem.

63 / 96

We have the following

Physical PhenomenaSpeech, pictures, radar signals, sonar signals, seismic data.

It is a well-posed dataBut learning form such data i.e. rebuilding the hypersurface can be anill-posed inverse problem.

63 / 96

FirstThe existence criterion may be violated in that a distinct output may notexist for every input

SecondThere may not be as much information in the training sample as we reallyneed to reconstruct the input-output mapping uniquely.

ThirdThe unavoidable presence of noise or imprecision in real-life training dataadds uncertainty to the reconstructed input-output mapping.

64 / 96

The noise problem

Getting out of the range

Mapping+Noise

65 / 96

This can happen whenThere is a lack of information!!!

Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”

66 / 96

This can happen whenThere is a lack of information!!!

Lanczos, 1964“A lack of information cannot be remedied by any mathematical trickery.”

66 / 96

67 / 96

How do we solve the problem?

Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingill-posed ’

TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andill-posed problems.

68 / 96

How do we solve the problem?

Something NotableIn 1963, Tikhonov proposed a new method called regularization for solvingill-posed ’

TikhonovHe was a Soviet and Russian mathematician known for importantcontributions to topology, functional analysis, mathematical physics, andill-posed problems.

68 / 96

Also Known as Ridge Regression

SetupWe have:

Input Signal{

xi ∈ Rd0}N

Output Signal {di ∈ R}Ni=1.

In additionNote that the output is assumed to be one-dimensional.

69 / 96

Also Known as Ridge Regression

SetupWe have:

Input Signal{

xi ∈ Rd0}N

Output Signal {di ∈ R}Ni=1.

In additionNote that the output is assumed to be one-dimensional.

69 / 96

Now, assuming that you have an approximation functiony = F (x)

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

WhereD is a linear differential operator.

70 / 96

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

70 / 96

Standard Error Term

Es (F) = 12

N∑i=1

(di − yi) = 12

N∑i=1

(di − F (xi)) (22)

Regularization Term

Ec (F) = 12 ‖DF‖2 (23)

70 / 96

Ordinarily y = F (x)Normally, the function space representing the functional F is the L2 spacethat consist of all real-valued functions f (x) with x ∈ Rd0

The quantity to be minimized in regularization theory is

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

Whereλ is a positive real number called the regularization parameter.E (f ) is called the Tikhonov functional.

71 / 96

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

71 / 96

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

71 / 96

E (f ) = 12

N∑i=1

(di − f (xi)) + 12 ‖Df ‖2 (24)

71 / 96

72 / 96

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set