• date post

16-Feb-2017
• Category

## Engineering

• view

196

5

Embed Size (px)

### Transcript of 06 Machine Learning - Naive Bayes

• Machine Learning for Data MiningIntroduction to Bayesian Classifiers

Andres Mendez-Vazquez

August 3, 2015

1 / 71

• Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian

2 / 71

• Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) 1, . . .K

3 / 71

• Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) 1, . . .K

3 / 71

• Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) 1, . . .K

3 / 71

• Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian

4 / 71

• Classification Problem

GoalGiven dnew , provide h(dnew)

The Machinery in General looks...

SupervisedLearning

Training Info: Desired/Trget Output

INPUT OUTPUT

5 / 71

• Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian

6 / 71

• Naive Bayes Model

Task for two classesLet 1, 2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

The Rule for classification is the following one

P (i |x) =P (x|i) P (i)

P (x) (1)

Remark: Bayes to the next level.

7 / 71

• Naive Bayes Model

Task for two classesLet 1, 2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

The Rule for classification is the following one

P (i |x) =P (x|i) P (i)

P (x) (1)

Remark: Bayes to the next level.

7 / 71

• Naive Bayes Model

Task for two classesLet 1, 2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

The Rule for classification is the following one

P (i |x) =P (x|i) P (i)

P (x) (1)

Remark: Bayes to the next level.

7 / 71

• Naive Bayes Model

Task for two classesLet 1, 2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

The Rule for classification is the following one

P (i |x) =P (x|i) P (i)

P (x) (1)

Remark: Bayes to the next level.

7 / 71

• In Informal English

We have that

posterior = likelihood prior informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

• In Informal English

We have that

posterior = likelihood prior informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

• In Informal English

We have that

posterior = likelihood prior informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

• We have the following terms...

LikelihoodWe call p (x|i) the likelihood of i given x:

This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

• We have the following terms...

LikelihoodWe call p (x|i) the likelihood of i given x:

This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

• We have the following terms...

LikelihoodWe call p (x|i) the likelihood of i given x:

This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

• We have the following terms...

LikelihoodWe call p (x|i) the likelihood of i given x:

This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

• We have the following terms...

LikelihoodWe call p (x|i) the likelihood of i given x:

This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

• We have the following terms...

LikelihoodWe call p (x|i) the likelihood of i given x:

This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

Prior ProbabilityIt is the known probability of a given class.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

• The most important term in all this

The factor

likelihood prior information (3)

10 / 71

• Example

We have the likelihood of two classes

11 / 71

• Example

We have the posterior of two classes when P (1) = 23 and P (2) =13

12 / 71

• Naive Bayes Model

In the case of two classes

P (x) =2

i=1p (x, i) =

2i=1

p (x|i) P (i) (4)

13 / 71

• Error in this rule

We have that

P (error |x) ={

P (1|x) if we decide 2P (2|x) if we decide 1

(5)

Thus, we have that

P (error) =

P (error ,x) dx =

P (error |x) p (x) dx (6)

14 / 71

• Error in this rule

We have that

P (error |x) ={

P (1|x) if we decide 2P (2|x) if we decide 1

(5)

Thus, we have that

P (error) =

P (error ,x) dx =

P (error |x) p (x) dx (6)

14 / 71

• Classification Rule

Thus, we have the Bayes Classification Rule1 If P (1|x) > P (2|x) x is classified to 12 If P (1|x) < P (2|x) x is classified to 2

15 / 71

• Classification Rule

Thus, we have the Bayes Classification Rule1 If P (1|x) > P (2|x) x is classified to 12 If P (1|x) < P (2|x) x is classified to 2

15 / 71

• What if we remove the normalization factor?

Remember

P (1|x) + P (2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2

16 / 71

• What if we remove the normalization factor?

Remember

P (1|x) + P (2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2

16 / 71

• What if we remove the normalization factor?

Remember

P (1|x) + P (2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2

16 / 71

• We have several cases

If for some x we have P (x|1) = P (x|2)The final decision relies completely from the prior probability.

On the