     • date post

14-Aug-2015
• Category

## Engineering

• view

23
• download

3

Embed Size (px)

### Transcript of Machine Learning - Introduction to Bayesian Classification

1. 1. Machine Learning for Data Mining Introduction to Bayesian Classiers Andres Mendez-Vazquez August 3, 2015 1 / 71
2. 2. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 2 / 71
3. 3. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
4. 4. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
5. 5. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
6. 6. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 4 / 71
7. 7. Classication Problem Goal Given dnew, provide h(dnew) The Machinery in General looks... Supervised Learning Training Info: Desired/Trget Output INPUT OUTPUT 5 / 71
8. 8. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 6 / 71
9. 9. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
10. 10. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
11. 11. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
12. 12. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
13. 13. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
14. 14. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
15. 15. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
16. 16. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
17. 17. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
18. 18. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
19. 19. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
20. 20. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
21. 21. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
22. 22. The most important term in all this The factor likelihood prior information (3) 10 / 71
23. 23. Example We have the likelihood of two classes 11 / 71
24. 24. Example We have the posterior of two classes when P (1) = 2 3 and P (2) = 1 3 12 / 71
25. 25. Naive Bayes Model In the case of two classes P (x) = 2 i=1 p (x, i) = 2 i=1 p (x|i) P (i) (4) 13 / 71
26. 26. Error in this rule We have that P (error|x) = P (1|x) if we decide 2 P (2|x) if we decide 1 (5) Thus, we have that P (error) = P (error, x) dx = P (error|x) p (x) dx (6) 14 / 71
27. 27. Error in this rule We have that P (error|x) = P (1|x) if we decide 2 P (2|x) if we decide 1 (5) Thus, we have that P (error) = P (error, x) dx = P (error|x) p (x) dx (6) 14 / 71
28. 28. Classication Rule Thus, we have the Bayes Classication Rule 1 If P (1|x) > P (2|x) x is classied to 1 2 If P (1|x) < P (2|x) x is classied to 2 15 / 71
29. 29. Classication Rule Thus, we have the Bayes Classication Rule 1 If P (1|x) > P (2|x) x is classied to 1 2 If P (1|x) < P (2|x) x is classied to 2 15 / 71
30. 30. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
31. 31. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
32. 32. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
33. 33. We have several cases If for some x we have P (x|1) = P (x|2) The nal decision relies completely from the prior probability. On the Other hand if P (1) = P (2), the state is equally probable In this case the decision is based entirely on the likelihoods P (x|i). 17 / 71
34. 34. We have several cases If for some x we have P (x|1) = P (x|2) The nal decision relies completely from the prior probability. On the Other hand if P (1) = P (2), the state is equally probable In this case the decision is based entirely on the likelihoods P (x|i). 17 / 71
35. 35. How the Rule looks like If P (1) = P (2) the Rule depends on the term p (x|i) 18 / 71
36. 36. The Error in the Second Case of Naive Bayes Error in equiprobable classes P (error) = 1 2 x0 p (x|2) dx + 1 2 x0 p (x|1) dx (8) Remark: P (1) = P (2) = 1 2 19 / 71
37. 37. What do we want to prove? Something Notable Bayesian classier is optimal with respect to minimizing the classication error probability. 20 / 71
38. 38. Proo