Machine Learning - Introduction to Bayesian Classification

download Machine Learning - Introduction to Bayesian Classification

of 146

Embed Size (px)

Transcript of Machine Learning - Introduction to Bayesian Classification

  1. 1. Machine Learning for Data Mining Introduction to Bayesian Classiers Andres Mendez-Vazquez August 3, 2015 1 / 71
  2. 2. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 2 / 71
  3. 3. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
  4. 4. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
  5. 5. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
  6. 6. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 4 / 71
  7. 7. Classication Problem Goal Given dnew, provide h(dnew) The Machinery in General looks... Supervised Learning Training Info: Desired/Trget Output INPUT OUTPUT 5 / 71
  8. 8. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 6 / 71
  9. 9. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
  10. 10. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
  11. 11. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
  12. 12. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
  13. 13. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  14. 14. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  15. 15. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
  16. 16. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  17. 17. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  18. 18. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  19. 19. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  20. 20. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  21. 21. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
  22. 22. The most important term in all this The factor likelihood prior information (3) 10 / 71
  23. 23. Example We have the likelihood of two classes 11 / 71
  24. 24. Example We have the posterior of two classes when P (1) = 2 3 and P (2) = 1 3 12 / 71
  25. 25. Naive Bayes Model In the case of two classes P (x) = 2 i=1 p (x, i) = 2 i=1 p (x|i) P (i) (4) 13 / 71
  26. 26. Error in this rule We have that P (error|x) = P (1|x) if we decide 2 P (2|x) if we decide 1 (5) Thus, we have that P (error) = P (error, x) dx = P (error|x) p (x) dx (6) 14 / 71
  27. 27. Error in this rule We have that P (error|x) = P (1|x) if we decide 2 P (2|x) if we decide 1 (5) Thus, we have that P (error) = P (error, x) dx = P (error|x) p (x) dx (6) 14 / 71
  28. 28. Classication Rule Thus, we have the Bayes Classication Rule 1 If P (1|x) > P (2|x) x is classied to 1 2 If P (1|x) < P (2|x) x is classied to 2 15 / 71
  29. 29. Classication Rule Thus, we have the Bayes Classication Rule 1 If P (1|x) > P (2|x) x is classied to 1 2 If P (1|x) < P (2|x) x is classied to 2 15 / 71
  30. 30. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
  31. 31. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
  32. 32. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
  33. 33. We have several cases If for some x we have P (x|1) = P (x|2) The nal decision relies completely from the prior probability. On the Other hand if P (1) = P (2), the state is equally probable In this case the decision is based entirely on the likelihoods P (x|i). 17 / 71
  34. 34. We have several cases If for some x we have P (x|1) = P (x|2) The nal decision relies completely from the prior probability. On the Other hand if P (1) = P (2), the state is equally probable In this case the decision is based entirely on the likelihoods P (x|i). 17 / 71
  35. 35. How the Rule looks like If P (1) = P (2) the Rule depends on the term p (x|i) 18 / 71
  36. 36. The Error in the Second Case of Naive Bayes Error in equiprobable classes P (error) = 1 2 x0 p (x|2) dx + 1 2 x0 p (x|1) dx (8) Remark: P (1) = P (2) = 1 2 19 / 71
  37. 37. What do we want to prove? Something Notable Bayesian classier is optimal with respect to minimizing the classication error probability. 20 / 71
  38. 38. Proo