06 Machine Learning - Naive Bayes
date post
16-Feb-2017Category
Engineering
view
196download
5
Embed Size (px)
Transcript of 06 Machine Learning - Naive Bayes
Machine Learning for Data MiningIntroduction to Bayesian Classifiers
Andres Mendez-Vazquez
August 3, 2015
1 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian
2 / 71
Classification problem
Training DataSamples of the form (d, h(d))
dWhere d are the data objects to classify (inputs)
h (d)h(d) are the correct class info for d, h(d) 1, . . .K
3 / 71
Classification problem
Training DataSamples of the form (d, h(d))
dWhere d are the data objects to classify (inputs)
h (d)h(d) are the correct class info for d, h(d) 1, . . .K
3 / 71
Classification problem
Training DataSamples of the form (d, h(d))
dWhere d are the data objects to classify (inputs)
h (d)h(d) are the correct class info for d, h(d) 1, . . .K
3 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian
4 / 71
Classification Problem
GoalGiven dnew , provide h(dnew)
The Machinery in General looks...
SupervisedLearning
Training Info: Desired/Trget Output
INPUT OUTPUT
5 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe Multi-Class Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian
6 / 71
Naive Bayes Model
Task for two classesLet 1, 2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.
The Rule for classification is the following one
P (i |x) =P (x|i) P (i)
P (x) (1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classesLet 1, 2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.
The Rule for classification is the following one
P (i |x) =P (x|i) P (i)
P (x) (1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classesLet 1, 2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.
The Rule for classification is the following one
P (i |x) =P (x|i) P (i)
P (x) (1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classesLet 1, 2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.
The Rule for classification is the following one
P (i |x) =P (x|i) P (i)
P (x) (1)
Remark: Bayes to the next level.
7 / 71
In Informal English
We have that
posterior = likelihood prior informationevidence (2)
BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.
8 / 71
In Informal English
We have that
posterior = likelihood prior informationevidence (2)
BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.
8 / 71
In Informal English
We have that
posterior = likelihood prior informationevidence (2)
BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.
8 / 71
We have the following terms...
LikelihoodWe call p (x|i) the likelihood of i given x:
This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (x|i) the likelihood of i given x:
This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (x|i) the likelihood of i given x:
This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (x|i) the likelihood of i given x:
This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (x|i) the likelihood of i given x:
This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (x|i) the likelihood of i given x:
This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
The most important term in all this
The factor
likelihood prior information (3)
10 / 71
Example
We have the likelihood of two classes
11 / 71
Example
We have the posterior of two classes when P (1) = 23 and P (2) =13
12 / 71
Naive Bayes Model
In the case of two classes
P (x) =2
i=1p (x, i) =
2i=1
p (x|i) P (i) (4)
13 / 71
Error in this rule
We have that
P (error |x) ={
P (1|x) if we decide 2P (2|x) if we decide 1
(5)
Thus, we have that
P (error) =
P (error ,x) dx =
P (error |x) p (x) dx (6)
14 / 71
Error in this rule
We have that
P (error |x) ={
P (1|x) if we decide 2P (2|x) if we decide 1
(5)
Thus, we have that
P (error) =
P (error ,x) dx =
P (error |x) p (x) dx (6)
14 / 71
Classification Rule
Thus, we have the Bayes Classification Rule1 If P (1|x) > P (2|x) x is classified to 12 If P (1|x) < P (2|x) x is classified to 2
15 / 71
Classification Rule
Thus, we have the Bayes Classification Rule1 If P (1|x) > P (2|x) x is classified to 12 If P (1|x) < P (2|x) x is classified to 2
15 / 71
What if we remove the normalization factor?
Remember
P (1|x) + P (2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2
16 / 71
What if we remove the normalization factor?
Remember
P (1|x) + P (2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2
16 / 71
What if we remove the normalization factor?
Remember
P (1|x) + P (2|x) = 1 (7)
We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2
16 / 71
We have several cases
If for some x we have P (x|1) = P (x|2)The final decision relies completely from the prior probability.
On the