Download - Machine Learning - Introduction to Bayesian Classification

Transcript
Page 1: Machine Learning - Introduction to Bayesian Classification

Machine Learning for Data MiningIntroduction to Bayesian Classifiers

Andres Mendez-Vazquez

August 3, 2015

1 / 71

Page 2: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

2 / 71

Page 3: Machine Learning - Introduction to Bayesian Classification

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Page 4: Machine Learning - Introduction to Bayesian Classification

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Page 5: Machine Learning - Introduction to Bayesian Classification

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Page 6: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

4 / 71

Page 7: Machine Learning - Introduction to Bayesian Classification

Classification Problem

GoalGiven dnew , provide h(dnew)

The Machinery in General looks...

SupervisedLearning

Training Info: Desired/Trget Output

INPUT OUTPUT

5 / 71

Page 8: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

6 / 71

Page 9: Machine Learning - Introduction to Bayesian Classification

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Page 10: Machine Learning - Introduction to Bayesian Classification

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Page 11: Machine Learning - Introduction to Bayesian Classification

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Page 12: Machine Learning - Introduction to Bayesian Classification

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Page 13: Machine Learning - Introduction to Bayesian Classification

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

Page 14: Machine Learning - Introduction to Bayesian Classification

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

Page 15: Machine Learning - Introduction to Bayesian Classification

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

Page 16: Machine Learning - Introduction to Bayesian Classification

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

Page 17: Machine Learning - Introduction to Bayesian Classification

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

Page 18: Machine Learning - Introduction to Bayesian Classification

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

Page 19: Machine Learning - Introduction to Bayesian Classification

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

Page 20: Machine Learning - Introduction to Bayesian Classification

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

Page 21: Machine Learning - Introduction to Bayesian Classification

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

Page 22: Machine Learning - Introduction to Bayesian Classification

The most important term in all this

The factor

likelihood × prior − information (3)

10 / 71

Page 23: Machine Learning - Introduction to Bayesian Classification

Example

We have the likelihood of two classes

11 / 71

Page 24: Machine Learning - Introduction to Bayesian Classification

Example

We have the posterior of two classes when P (ω1) = 23 and P (ω2) = 1

3

12 / 71

Page 25: Machine Learning - Introduction to Bayesian Classification

Naive Bayes Model

In the case of two classes

P (x) =2∑

i=1p (x, ωi) =

2∑i=1

p (x|ωi) P (ωi) (4)

13 / 71

Page 26: Machine Learning - Introduction to Bayesian Classification

Error in this rule

We have that

P (error |x) ={

P (ω1|x) if we decide ω2

P (ω2|x) if we decide ω1(5)

Thus, we have that

P (error) =ˆ ∞−∞

P (error ,x) dx =ˆ ∞−∞

P (error |x) p (x) dx (6)

14 / 71

Page 27: Machine Learning - Introduction to Bayesian Classification

Error in this rule

We have that

P (error |x) ={

P (ω1|x) if we decide ω2

P (ω2|x) if we decide ω1(5)

Thus, we have that

P (error) =ˆ ∞−∞

P (error ,x) dx =ˆ ∞−∞

P (error |x) p (x) dx (6)

14 / 71

Page 28: Machine Learning - Introduction to Bayesian Classification

Classification Rule

Thus, we have the Bayes Classification Rule1 If P (ω1|x) > P (ω2|x) x is classified to ω1

2 If P (ω1|x) < P (ω2|x) x is classified to ω2

15 / 71

Page 29: Machine Learning - Introduction to Bayesian Classification

Classification Rule

Thus, we have the Bayes Classification Rule1 If P (ω1|x) > P (ω2|x) x is classified to ω1

2 If P (ω1|x) < P (ω2|x) x is classified to ω2

15 / 71

Page 30: Machine Learning - Introduction to Bayesian Classification

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

Page 31: Machine Learning - Introduction to Bayesian Classification

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

Page 32: Machine Learning - Introduction to Bayesian Classification

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

Page 33: Machine Learning - Introduction to Bayesian Classification

We have several cases

If for some x we have P (x|ω1) = P (x|ω2)

The final decision relies completely from the prior probability.

On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (x|ωi).

17 / 71

Page 34: Machine Learning - Introduction to Bayesian Classification

We have several cases

If for some x we have P (x|ω1) = P (x|ω2)

The final decision relies completely from the prior probability.

On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (x|ωi).

17 / 71

Page 35: Machine Learning - Introduction to Bayesian Classification

How the Rule looks likeIf P (ω1) = P (ω2) the Rule depends on the term p (x|ωi)

18 / 71

Page 36: Machine Learning - Introduction to Bayesian Classification

The Error in the Second Case of Naive Bayes

Error in equiprobable classes

P (error) = 12

x0ˆ

−∞

p (x|ω2) dx + 12

∞̂

x0

p (x|ω1) dx (8)

Remark: P (ω1) = P (ω2) = 12

19 / 71

Page 37: Machine Learning - Introduction to Bayesian Classification

What do we want to prove?

Something NotableBayesian classifier is optimal with respect to minimizing theclassification error probability.

20 / 71

Page 38: Machine Learning - Introduction to Bayesian Classification

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

Page 39: Machine Learning - Introduction to Bayesian Classification

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

Page 40: Machine Learning - Introduction to Bayesian Classification

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

Page 41: Machine Learning - Introduction to Bayesian Classification

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

R2 be the region of the feature space in which we decide in favor of ω2

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Thus

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

R2

p (x|ω1) dx + P (ω2)ˆ

R1

p (x|ω2) dx

21 / 71

Page 42: Machine Learning - Introduction to Bayesian Classification

Proof

It is more

Pe = P (ω1)ˆ

R2

p (ω1,x)P (ω1) dx + P (ω2)

ˆ

R1

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

R2

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

Page 43: Machine Learning - Introduction to Bayesian Classification

Proof

It is more

Pe = P (ω1)ˆ

R2

p (ω1,x)P (ω1) dx + P (ω2)

ˆ

R1

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

R2

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

Page 44: Machine Learning - Introduction to Bayesian Classification

Proof

It is more

Pe = P (ω1)ˆ

R2

p (ω1,x)P (ω1) dx + P (ω2)

ˆ

R1

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

R2

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

Page 45: Machine Learning - Introduction to Bayesian Classification

Proof

Thus

P (ω1) =ˆ

R1

p (ω1|x) p (x) dx +ˆ

R2

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx =ˆ

R2

p (ω1|x) p (x) dx (13)

Then

Pe = P (ω1)−ˆ

R1

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (14)

23 / 71

Page 46: Machine Learning - Introduction to Bayesian Classification

Proof

Thus

P (ω1) =ˆ

R1

p (ω1|x) p (x) dx +ˆ

R2

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx =ˆ

R2

p (ω1|x) p (x) dx (13)

Then

Pe = P (ω1)−ˆ

R1

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (14)

23 / 71

Page 47: Machine Learning - Introduction to Bayesian Classification

Proof

Thus

P (ω1) =ˆ

R1

p (ω1|x) p (x) dx +ˆ

R2

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx =ˆ

R2

p (ω1|x) p (x) dx (13)

Then

Pe = P (ω1)−ˆ

R1

p (ω1|x) p (x) dx +ˆ

R1

p (ω2|x) p (x) dx (14)

23 / 71

Page 48: Machine Learning - Introduction to Bayesian Classification

Graphically P (ω1): Thanks Edith 2013 Class!!!

In Red

24 / 71

Page 49: Machine Learning - Introduction to Bayesian Classification

Thus we have´R1

p (ω1|x) p (x) dx =´

R1p (ω1,x) dx = PR1(ω1)

Thus

25 / 71

Page 50: Machine Learning - Introduction to Bayesian Classification

Finally

Finally

Pe = P (ω1)−ˆ

R1

[p (ω1|x)− p (ω2|x)] p (x) dx (15)

Thus, we have

Pe =

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx

R1

p (ω2|x) p (x) dx

26 / 71

Page 51: Machine Learning - Introduction to Bayesian Classification

Finally

Finally

Pe = P (ω1)−ˆ

R1

[p (ω1|x)− p (ω2|x)] p (x) dx (15)

Thus, we have

Pe =

P (ω1)−ˆ

R1

p (ω1|x) p (x) dx

R1

p (ω2|x) p (x) dx

26 / 71

Page 52: Machine Learning - Introduction to Bayesian Classification

Pe for a non optimal rule

A great idea Edith!!!

27 / 71

Page 53: Machine Learning - Introduction to Bayesian Classification

Which decision function for minimizing the error

A single number in this case

28 / 71

Page 54: Machine Learning - Introduction to Bayesian Classification

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Page 55: Machine Learning - Introduction to Bayesian Classification

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Page 56: Machine Learning - Introduction to Bayesian Classification

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Page 57: Machine Learning - Introduction to Bayesian Classification

Pe for an optimal rule

A great idea Edith!!!

30 / 71

Page 58: Machine Learning - Introduction to Bayesian Classification

For M classes ω1, ω2, ..., ωM

We have that vector x is in ωi

P (ωi |x) > P (ωj |x) ∀j 6= i (16)

Something NotableIt turns out that such a choice also minimizes the classification errorprobability.

31 / 71

Page 59: Machine Learning - Introduction to Bayesian Classification

For M classes ω1, ω2, ..., ωM

We have that vector x is in ωi

P (ωi |x) > P (ωj |x) ∀j 6= i (16)

Something NotableIt turns out that such a choice also minimizes the classification errorprobability.

31 / 71

Page 60: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

32 / 71

Page 61: Machine Learning - Introduction to Bayesian Classification

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Page 62: Machine Learning - Introduction to Bayesian Classification

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Page 63: Machine Learning - Introduction to Bayesian Classification

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Page 64: Machine Learning - Introduction to Bayesian Classification

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Page 65: Machine Learning - Introduction to Bayesian Classification

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

Page 66: Machine Learning - Introduction to Bayesian Classification

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

Page 67: Machine Learning - Introduction to Bayesian Classification

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

Page 68: Machine Learning - Introduction to Bayesian Classification

Example

In classificationIn the classification case, binary or otherwise, a natural loss function is the0-1 loss where y′ = f (x) :

`(y′, y

)= 1

[y′ 6= y

](18)

35 / 71

Page 69: Machine Learning - Introduction to Bayesian Classification

Example

In classificationIn the classification case, binary or otherwise, a natural loss function is the0-1 loss where y′ = f (x) :

`(y′, y

)= 1

[y′ 6= y

](18)

35 / 71

Page 70: Machine Learning - Introduction to Bayesian Classification

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Page 71: Machine Learning - Introduction to Bayesian Classification

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Page 72: Machine Learning - Introduction to Bayesian Classification

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Page 73: Machine Learning - Introduction to Bayesian Classification

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Page 74: Machine Learning - Introduction to Bayesian Classification

Thus

Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):

f̂N = arg minf∈F

R̂ (f ) (21)

If we knew the distribution P and do not restrict ourselves F , thebest function would be

f ∗ = arg minf

R (f ) (22)

37 / 71

Page 75: Machine Learning - Introduction to Bayesian Classification

Thus

Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):

f̂N = arg minf∈F

R̂ (f ) (21)

If we knew the distribution P and do not restrict ourselves F , thebest function would be

f ∗ = arg minf

R (f ) (22)

37 / 71

Page 76: Machine Learning - Introduction to Bayesian Classification

Thus

For classification with 0-1 loss, f ∗ is called the Bayesian Classifier

f ∗ = arg miny∈Y

P (Y = y|X = x) (23)

38 / 71

Page 77: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

Page 78: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

Page 79: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

Page 80: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

Page 81: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

=λ12

ˆR2

p (x, ω1) dx + λ21

ˆR1

p (x, ω2) dx

=λ12

ˆR2

p (x|ω1) P (ω1) dx + λ21

ˆR1

p (x|ω2) P (ω2) dx

39 / 71

Page 82: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

The new risk function to be minimized

r = λ12P (ω1)ˆ

R2

p (x|ω1) dx + λ21P (ω2)ˆ

R1

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

Page 83: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

The new risk function to be minimized

r = λ12P (ω1)ˆ

R2

p (x|ω1) dx + λ21P (ω2)ˆ

R1

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

Page 84: Machine Learning - Introduction to Bayesian Classification

The New Risk Function

The new risk function to be minimized

r = λ12P (ω1)ˆ

R2

p (x|ω1) dx + λ21P (ω2)ˆ

R1

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

Page 85: Machine Learning - Introduction to Bayesian Classification

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Page 86: Machine Learning - Introduction to Bayesian Classification

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Page 87: Machine Learning - Introduction to Bayesian Classification

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Page 88: Machine Learning - Introduction to Bayesian Classification

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

ˆRi

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

Page 89: Machine Learning - Introduction to Bayesian Classification

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

ˆRi

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

Page 90: Machine Learning - Introduction to Bayesian Classification

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

ˆRi

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

Page 91: Machine Learning - Introduction to Bayesian Classification

How do we do that?We minimize each integral

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (27)

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

Page 92: Machine Learning - Introduction to Bayesian Classification

How do we do that?We minimize each integral

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (27)

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

Page 93: Machine Learning - Introduction to Bayesian Classification

How do we do that?We minimize each integral

ˆRi

M∑k=1

λkip (x|ωk) P (ωk) dx (27)

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

Page 94: Machine Learning - Introduction to Bayesian Classification

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

(30)

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Page 95: Machine Learning - Introduction to Bayesian Classification

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

(30)

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Page 96: Machine Learning - Introduction to Bayesian Classification

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

(30)

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Page 97: Machine Learning - Introduction to Bayesian Classification

Then

We have that

p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33)

for all j 6= i.

45 / 71

Page 98: Machine Learning - Introduction to Bayesian Classification

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Page 99: Machine Learning - Introduction to Bayesian Classification

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Page 100: Machine Learning - Introduction to Bayesian Classification

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Page 101: Machine Learning - Introduction to Bayesian Classification

Special Case: The Two-Class Case

Definitionl12 is known as the likelihood ratio and the preceding test as the likelihoodratio test.

47 / 71

Page 102: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

48 / 71

Page 103: Machine Learning - Introduction to Bayesian Classification

Decision Surface

Because the R1 and R2 are contiguousThe separating surface between both of them is described by

P (ω1|x)− P (ω2|x) = 0 (36)

Thus, we define the decision function as

g12 (x) = P (ω1|x)− P (ω2|x) = 0 (37)

49 / 71

Page 104: Machine Learning - Introduction to Bayesian Classification

Decision Surface

Because the R1 and R2 are contiguousThe separating surface between both of them is described by

P (ω1|x)− P (ω2|x) = 0 (36)

Thus, we define the decision function as

g12 (x) = P (ω1|x)− P (ω2|x) = 0 (37)

49 / 71

Page 105: Machine Learning - Introduction to Bayesian Classification

Which decision function for the Naive Bayes

A single number in this case

50 / 71

Page 106: Machine Learning - Introduction to Bayesian Classification

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

Page 107: Machine Learning - Introduction to Bayesian Classification

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

Page 108: Machine Learning - Introduction to Bayesian Classification

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

Page 109: Machine Learning - Introduction to Bayesian Classification

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

Page 110: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

52 / 71

Page 111: Machine Learning - Introduction to Bayesian Classification

Gaussian Distribution

We can use the Gaussian distribution

p (x|ωi) = 1(2π)l/2 |Σi |

1/2exp

{−1

2 (x − µi)T Σ−1i (x − µi)

}(38)

Example

53 / 71

Page 112: Machine Learning - Introduction to Bayesian Classification

Gaussian DistributionWe can use the Gaussian distribution

p (x|ωi) = 1(2π)l/2 |Σi |

1/2exp

{−1

2 (x − µi)T Σ−1i (x − µi)

}(38)

Example

53 / 71

Page 113: Machine Learning - Introduction to Bayesian Classification

Some Properties

About ΣIt is the covariance matrix between variables.

ThusIt is positive definite.Symmetric.The inverse exists.

54 / 71

Page 114: Machine Learning - Introduction to Bayesian Classification

Some Properties

About ΣIt is the covariance matrix between variables.

ThusIt is positive definite.Symmetric.The inverse exists.

54 / 71

Page 115: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

55 / 71

Page 116: Machine Learning - Introduction to Bayesian Classification

Influence of the Covariance Σ

Look at the following Covariance

Σ =[

1 00 1

]

It simple the unit Gaussian with mean µ

56 / 71

Page 117: Machine Learning - Introduction to Bayesian Classification

Influence of the Covariance Σ

Look at the following Covariance

Σ =[

1 00 1

]

It simple the unit Gaussian with mean µ

56 / 71

Page 118: Machine Learning - Introduction to Bayesian Classification

The Covariance Σ as a Rotation

Look at the following Covariance

Σ =[

4 00 1

]

Actually, it flatten the circle through the x − axis

57 / 71

Page 119: Machine Learning - Introduction to Bayesian Classification

The Covariance Σ as a Rotation

Look at the following Covariance

Σ =[

4 00 1

]

Actually, it flatten the circle through the x − axis

57 / 71

Page 120: Machine Learning - Introduction to Bayesian Classification

Influence of the Covariance Σ

Look at the following Covariance

Σa = RΣbRT with R =[

cos θ − sin θ− sin θ cos θ

]

It allows to rotate the axises

58 / 71

Page 121: Machine Learning - Introduction to Bayesian Classification

Influence of the Covariance Σ

Look at the following Covariance

Σa = RΣbRT with R =[

cos θ − sin θ− sin θ cos θ

]

It allows to rotate the axises

58 / 71

Page 122: Machine Learning - Introduction to Bayesian Classification

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

Thus

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Page 123: Machine Learning - Introduction to Bayesian Classification

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

Thus

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Page 124: Machine Learning - Introduction to Bayesian Classification

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

Thus

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Page 125: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

60 / 71

Page 126: Machine Learning - Introduction to Bayesian Classification

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

Page 127: Machine Learning - Introduction to Bayesian Classification

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

Page 128: Machine Learning - Introduction to Bayesian Classification

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

Page 129: Machine Learning - Introduction to Bayesian Classification

Given a series of classes ω1, ω2, ..., ωM

For example

p (x|ωj) ∼ N(µj ,Σj

)(41)

In our caseWe will assume that there is no dependence between classes!!!

62 / 71

Page 130: Machine Learning - Introduction to Bayesian Classification

Given a series of classes ω1, ω2, ..., ωM

For example

p (x|ωj) ∼ N(µj ,Σj

)(41)

In our caseWe will assume that there is no dependence between classes!!!

62 / 71

Page 131: Machine Learning - Introduction to Bayesian Classification

Now

Suppose that ωj contains n samples x1,x2, ...,xn

p (x1,x2, ...,xn |θj) =n∏

j=1p (xj |θj) (42)

We can see then the function p (x1,x2, ...,xn|θj) as a function of

L (θj) =n∏

j=1p (xj |θj) (43)

63 / 71

Page 132: Machine Learning - Introduction to Bayesian Classification

Now

Suppose that ωj contains n samples x1,x2, ...,xn

p (x1,x2, ...,xn |θj) =n∏

j=1p (xj |θj) (42)

We can see then the function p (x1,x2, ...,xn|θj) as a function of

L (θj) =n∏

j=1p (xj |θj) (43)

63 / 71

Page 133: Machine Learning - Introduction to Bayesian Classification

Example

L (θj) = log∏nj=1 p (x j |θj)

64 / 71

Page 134: Machine Learning - Introduction to Bayesian Classification

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

65 / 71

Page 135: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

12

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

12

n∑j=1

[xj

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Page 136: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

12

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

12

n∑j=1

[xj

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Page 137: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

12

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

12

n∑j=1

[xj

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Page 138: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood

Then

∂ ln L (ωi)∂µi

=n∑

j=1Σ−1

i (xj − µi) = 0

nΣ−1i

−µi + 1n

n∑j=1

xj

= 0

µ̂i = 1n

n∑j=1

xj

67 / 71

Page 139: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood

Then, we derive with respect to Σi

For this we use the following tricks:1 ∂ log|Σ|

∂Σ−1 = − 1|Σ| · |Σ| (Σ)T = −Σ

2 ∂Tr [AB]∂A = ∂Tr [BA]

∂A = BT

3 Trace(of a number)=the number4 Tr(AT B) = Tr

(BAT

)Thus

f (Σi) = −n2 ln |ΣI | −

12

n∑j=1

[(xj − µi)T Σ−1

i (xj − µi)]

+ c1 (47)

68 / 71

Page 140: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood

Thus

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{(xj − µi)T Σ−1

i (xj − µi)}]

+c1 (48)

Tricks!!!

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{Σ−1

i (xj − µi) (xj − µi)T}]

+c1 (49)

69 / 71

Page 141: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood

Thus

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{(xj − µi)T Σ−1

i (xj − µi)}]

+c1 (48)

Tricks!!!

f (Σi) = −n2 ln |Σi |−

12

n∑j=1

[Trace

{Σ−1

i (xj − µi) (xj − µi)T}]

+c1 (49)

69 / 71

Page 142: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood

Derivative with respect to Σ∂f (Σi)∂Σi

= n2 Σi −

12

n∑j=1

[(xj − µi) (xj − µi)T

]T(50)

Thus, when making it equal to zero

Σ̂i = 1n

n∑j=1

(xj − µi) (xj − µi)T (51)

70 / 71

Page 143: Machine Learning - Introduction to Bayesian Classification

Maximum Likelihood

Derivative with respect to Σ∂f (Σi)∂Σi

= n2 Σi −

12

n∑j=1

[(xj − µi) (xj − µi)T

]T(50)

Thus, when making it equal to zero

Σ̂i = 1n

n∑j=1

(xj − µi) (xj − µi)T (51)

70 / 71

Page 144: Machine Learning - Introduction to Bayesian Classification

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

Page 145: Machine Learning - Introduction to Bayesian Classification

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

Page 146: Machine Learning - Introduction to Bayesian Classification

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71