of 146 /146
Machine Learning for Data Mining Introduction to Bayesian Classiﬁers Andres Mendez-Vazquez August 3, 2015 1 / 71

Embed Size (px)

### Transcript of Machine Learning - Introduction to Bayesian Classification

Machine Learning for Data MiningIntroduction to Bayesian Classifiers

Andres Mendez-Vazquez

August 3, 2015

1 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

2 / 71

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

4 / 71

Classification Problem

GoalGiven dnew , provide h(dnew)

The Machinery in General looks...

SupervisedLearning

Training Info: Desired/Trget Output

INPUT OUTPUT

5 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

6 / 71

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

The most important term in all this

The factor

likelihood × prior − information (3)

10 / 71

Example

We have the likelihood of two classes

11 / 71

Example

We have the posterior of two classes when P (ω1) = 23 and P (ω2) = 1

3

12 / 71

Naive Bayes Model

In the case of two classes

P (x) =2∑

i=1p (x, ωi) =

2∑i=1

p (x|ωi) P (ωi) (4)

13 / 71

Error in this rule

We have that

P (error |x) ={

P (ω1|x) if we decide ω2

P (ω2|x) if we decide ω1(5)

Thus, we have that

P (error) =ˆ ∞−∞

P (error ,x) dx =ˆ ∞−∞

P (error |x) p (x) dx (6)

14 / 71

Error in this rule

We have that

P (error |x) ={

P (ω1|x) if we decide ω2

P (ω2|x) if we decide ω1(5)

Thus, we have that

P (error) =ˆ ∞−∞

P (error ,x) dx =ˆ ∞−∞

P (error |x) p (x) dx (6)

14 / 71

Classification Rule

Thus, we have the Bayes Classification Rule1 If P (ω1|x) > P (ω2|x) x is classified to ω1

2 If P (ω1|x) < P (ω2|x) x is classified to ω2

15 / 71

Classification Rule

Thus, we have the Bayes Classification Rule1 If P (ω1|x) > P (ω2|x) x is classified to ω1

2 If P (ω1|x) < P (ω2|x) x is classified to ω2

15 / 71

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

We have several cases

If for some x we have P (x|ω1) = P (x|ω2)

The final decision relies completely from the prior probability.

On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (x|ωi).

17 / 71

We have several cases

If for some x we have P (x|ω1) = P (x|ω2)

The final decision relies completely from the prior probability.

On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (x|ωi).

17 / 71

How the Rule looks likeIf P (ω1) = P (ω2) the Rule depends on the term p (x|ωi)

18 / 71

The Error in the Second Case of Naive Bayes

Error in equiprobable classes

P (error) = 12

x0ˆ

−∞

p (x|ω2) dx + 12

∞̂

x0

p (x|ω1) dx (8)

Remark: P (ω1) = P (ω2) = 12

19 / 71

What do we want to prove?

Something NotableBayesian classifier is optimal with respect to minimizing theclassification error probability.

20 / 71

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

Proof

It is more

Pe = P (ω1)ˆ

R2

p (ω1,x)P (ω1) dx + P (ω2)

ˆ

R1

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

R2

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

Proof

It is more

Pe = P (ω1)ˆ

R2

p (ω1,x)P (ω1) dx + P (ω2)

ˆ

R1

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

R2

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

Proof

It is more

Pe = P (ω1)ˆ

R2

p (ω1,x)P (ω1) dx + P (ω2)

ˆ

R1

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

R2

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

Proof

Thus

P (ω1) =ˆ

R1

p (ω1|x) p (x) dx +ˆ

R2

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx =ˆ

R2

p (ω1|x) p (x) dx (13)

Then

Pe = P (ω1)−ˆ

R1

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (14)

23 / 71

Proof

Thus

P (ω1) =ˆ

R1

p (ω1|x) p (x) dx +ˆ

R2

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx =ˆ

R2

p (ω1|x) p (x) dx (13)

Then

Pe = P (ω1)−ˆ

R1

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (14)

23 / 71

Proof

Thus

P (ω1) =ˆ

R1

p (ω1|x) p (x) dx +ˆ

R2

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx =ˆ

R2

p (ω1|x) p (x) dx (13)

Then

Pe = P (ω1)−ˆ

R1

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (14)

23 / 71

Graphically P (ω1): Thanks Edith 2013 Class!!!

In Red

24 / 71

Thus we have´R1

p (ω1|x) p (x) dx =´

R1p (ω1,x) dx = PR1(ω1)

Thus

25 / 71

Finally

Finally

Pe = P (ω1)−ˆ

R1

[p (ω1|x)− p (ω2|x)] p (x) dx (15)

Thus, we have

Pe =

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx

R1

p (ω2|x) p (x) dx

26 / 71

Finally

Finally

Pe = P (ω1)−ˆ

R1

[p (ω1|x)− p (ω2|x)] p (x) dx (15)

Thus, we have

Pe =

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx

R1

p (ω2|x) p (x) dx

26 / 71

Pe for a non optimal rule

A great idea Edith!!!

27 / 71

Which decision function for minimizing the error

A single number in this case

28 / 71

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Pe for an optimal rule

A great idea Edith!!!

30 / 71

For M classes ω1, ω2, ..., ωM

We have that vector x is in ωi

P (ωi |x) > P (ωj |x) ∀j 6= i (16)

Something NotableIt turns out that such a choice also minimizes the classification errorprobability.

31 / 71

For M classes ω1, ω2, ..., ωM

We have that vector x is in ωi

P (ωi |x) > P (ωj |x) ∀j 6= i (16)

Something NotableIt turns out that such a choice also minimizes the classification errorprobability.

31 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

32 / 71

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

Example

In classificationIn the classification case, binary or otherwise, a natural loss function is the0-1 loss where y′ = f (x) :

`(y′, y

)= 1

[y′ 6= y

](18)

35 / 71

Example

In classificationIn the classification case, binary or otherwise, a natural loss function is the0-1 loss where y′ = f (x) :

`(y′, y

)= 1

[y′ 6= y

](18)

35 / 71

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Thus

Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):

f̂N = arg minf∈F

R̂ (f ) (21)

If we knew the distribution P and do not restrict ourselves F , thebest function would be

f ∗ = arg minf

R (f ) (22)

37 / 71

Thus

Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):

f̂N = arg minf∈F

R̂ (f ) (21)

If we knew the distribution P and do not restrict ourselves F , thebest function would be

f ∗ = arg minf

R (f ) (22)

37 / 71

Thus

For classification with 0-1 loss, f ∗ is called the Bayesian Classifier

f ∗ = arg miny∈Y

P (Y = y|X = x) (23)

38 / 71

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

The New Risk Function

The new risk function to be minimized

r = λ12P (ω1)ˆ

R2

p (x|ω1) dx + λ21P (ω2)ˆ

R1

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

The New Risk Function

The new risk function to be minimized

r = λ12P (ω1)ˆ

R2

p (x|ω1) dx + λ21P (ω2)ˆ

R1

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

The New Risk Function

The new risk function to be minimized

r = λ12P (ω1)ˆ

R2

p (x|ω1) dx + λ21P (ω2)ˆ

R1

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

ˆRi

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

ˆRi

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

ˆRi

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

How do we do that?We minimize each integral

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (27)

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

How do we do that?We minimize each integral

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (27)

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

How do we do that?We minimize each integral

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (27)

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

(30)

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

(30)

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

(30)

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Then

We have that

p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33)

for all j 6= i.

45 / 71

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Special Case: The Two-Class Case

Definitionl12 is known as the likelihood ratio and the preceding test as the likelihoodratio test.

47 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

48 / 71

Decision Surface

Because the R1 and R2 are contiguousThe separating surface between both of them is described by

P (ω1|x)− P (ω2|x) = 0 (36)

Thus, we define the decision function as

g12 (x) = P (ω1|x)− P (ω2|x) = 0 (37)

49 / 71

Decision Surface

Because the R1 and R2 are contiguousThe separating surface between both of them is described by

P (ω1|x)− P (ω2|x) = 0 (36)

Thus, we define the decision function as

g12 (x) = P (ω1|x)− P (ω2|x) = 0 (37)

49 / 71

Which decision function for the Naive Bayes

A single number in this case

50 / 71

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

52 / 71

Gaussian Distribution

We can use the Gaussian distribution

p (x|ωi) = 1(2π)l/2 |Σi |

1/2exp

{−1

2 (x − µi)T Σ−1i (x − µi)

}(38)

Example

53 / 71

Gaussian DistributionWe can use the Gaussian distribution

p (x|ωi) = 1(2π)l/2 |Σi |

1/2exp

{−1

2 (x − µi)T Σ−1i (x − µi)

}(38)

Example

53 / 71

Some Properties

About ΣIt is the covariance matrix between variables.

ThusIt is positive definite.Symmetric.The inverse exists.

54 / 71

Some Properties

About ΣIt is the covariance matrix between variables.

ThusIt is positive definite.Symmetric.The inverse exists.

54 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

55 / 71

Influence of the Covariance Σ

Look at the following Covariance

Σ =[

1 00 1

]

It simple the unit Gaussian with mean µ

56 / 71

Influence of the Covariance Σ

Look at the following Covariance

Σ =[

1 00 1

]

It simple the unit Gaussian with mean µ

56 / 71

The Covariance Σ as a Rotation

Look at the following Covariance

Σ =[

4 00 1

]

Actually, it flatten the circle through the x − axis

57 / 71

The Covariance Σ as a Rotation

Look at the following Covariance

Σ =[

4 00 1

]

Actually, it flatten the circle through the x − axis

57 / 71

Influence of the Covariance Σ

Look at the following Covariance

Σa = RΣbRT with R =[

cos θ − sin θ− sin θ cos θ

]

It allows to rotate the axises

58 / 71

Influence of the Covariance Σ

Look at the following Covariance

Σa = RΣbRT with R =[

cos θ − sin θ− sin θ cos θ

]

It allows to rotate the axises

58 / 71

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

Thus

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

Thus

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

Thus

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

60 / 71

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

Given a series of classes ω1, ω2, ..., ωM

For example

p (x|ωj) ∼ N(µj ,Σj

)(41)

In our caseWe will assume that there is no dependence between classes!!!

62 / 71

Given a series of classes ω1, ω2, ..., ωM

For example

p (x|ωj) ∼ N(µj ,Σj

)(41)

In our caseWe will assume that there is no dependence between classes!!!

62 / 71

Now

Suppose that ωj contains n samples x1,x2, ...,xn

p (x1,x2, ...,xn |θj) =n∏

j=1p (xj |θj) (42)

We can see then the function p (x1,x2, ...,xn|θj) as a function of

L (θj) =n∏

j=1p (xj |θj) (43)

63 / 71

Now

Suppose that ωj contains n samples x1,x2, ...,xn

p (x1,x2, ...,xn |θj) =n∏

j=1p (xj |θj) (42)

We can see then the function p (x1,x2, ...,xn|θj) as a function of

L (θj) =n∏

j=1p (xj |θj) (43)

63 / 71

Example

L (θj) = log∏nj=1 p (x j |θj)

64 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

65 / 71

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

12

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

12

n∑j=1

[xj

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

12

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

12

n∑j=1

[xj

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

12

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

12

n∑j=1

[xj

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Maximum Likelihood

Then

∂ ln L (ωi)∂µi

=n∑

j=1Σ−1

i (xj − µi) = 0

nΣ−1i

−µi + 1n

n∑j=1

xj

= 0

µ̂i = 1n

n∑j=1

xj

67 / 71

Maximum Likelihood

Then, we derive with respect to Σi

For this we use the following tricks:1 ∂ log|Σ|

∂Σ−1 = − 1|Σ| · |Σ| (Σ)T = −Σ

2 ∂Tr [AB]∂A = ∂Tr [BA]

∂A = BT

3 Trace(of a number)=the number4 Tr(AT B) = Tr

(BAT

)Thus

f (Σi) = −n2 ln |ΣI | −

12

n∑j=1

[(xj − µi)T Σ−1

i (xj − µi)]

+ c1 (47)

68 / 71

Maximum Likelihood

Thus

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{(xj − µi)T Σ−1

i (xj − µi)}]

+c1 (48)

Tricks!!!

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{Σ−1

i (xj − µi) (xj − µi)T}]

+c1 (49)

69 / 71

Maximum Likelihood

Thus

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{(xj − µi)T Σ−1

i (xj − µi)}]

+c1 (48)

Tricks!!!

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{Σ−1

i (xj − µi) (xj − µi)T}]

+c1 (49)

69 / 71

Maximum Likelihood

Derivative with respect to Σ∂f (Σi)∂Σi

= n2 Σi −

12

n∑j=1

[(xj − µi) (xj − µi)T

]T(50)

Thus, when making it equal to zero

Σ̂i = 1n

n∑j=1

(xj − µi) (xj − µi)T (51)

70 / 71

Maximum Likelihood

Derivative with respect to Σ∂f (Σi)∂Σi

= n2 Σi −

12

n∑j=1

[(xj − µi) (xj − µi)T

]T(50)

Thus, when making it equal to zero

Σ̂i = 1n

n∑j=1

(xj − µi) (xj − µi)T (51)

70 / 71

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71