# Machine Learning - Introduction to Bayesian Classification

date post

14-Aug-2015Category

## Engineering

view

25download

3

Embed Size (px)

### Transcript of Machine Learning - Introduction to Bayesian Classification

- 1. Machine Learning for Data Mining Introduction to Bayesian Classiers Andres Mendez-Vazquez August 3, 2015 1 / 71
- 2. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 2 / 71
- 3. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
- 4. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
- 5. Classication problem Training Data Samples of the form (d, h(d)) d Where d are the data objects to classify (inputs) h (d) h(d) are the correct class info for d, h(d) 1, . . . K 3 / 71
- 6. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 4 / 71
- 7. Classication Problem Goal Given dnew, provide h(dnew) The Machinery in General looks... Supervised Learning Training Info: Desired/Trget Output INPUT OUTPUT 5 / 71
- 8. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 6 / 71
- 9. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
- 10. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
- 11. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
- 12. Naive Bayes Model Task for two classes Let 1, 2 be the two classes in which our samples belong. There is a prior probability of belonging to that class P (1) for class 1. P (2) for class 2. The Rule for classication is the following one P (i|x) = P (x|i) P (i) P (x) (1) Remark: Bayes to the next level. 7 / 71
- 13. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
- 14. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
- 15. In Informal English We have that posterior = likelihood prior information evidence (2) Basically One: If we can observe x. Two: we can convert the prior-information to the posterior information. 8 / 71
- 16. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
- 17. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
- 18. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
- 19. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
- 20. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
- 21. We have the following terms... Likelihood We call p (x|i) the likelihood of i given x: This indicates that given a category i: If p (x|i) is large, then i is the likely class of x. Prior Probability It is the known probability of a given class. Remark: Because, we lack information about this class, we tend to use the uniform distribution. However: We can use other tricks for it. Evidence The evidence factor can be seen as a scale factor that guarantees that the posterior probability sum to one. 9 / 71
- 22. The most important term in all this The factor likelihood prior information (3) 10 / 71
- 23. Example We have the likelihood of two classes 11 / 71
- 24. Example We have the posterior of two classes when P (1) = 2 3 and P (2) = 1 3 12 / 71
- 25. Naive Bayes Model In the case of two classes P (x) = 2 i=1 p (x, i) = 2 i=1 p (x|i) P (i) (4) 13 / 71
- 26. Error in this rule We have that P (error|x) = P (1|x) if we decide 2 P (2|x) if we decide 1 (5) Thus, we have that P (error) = P (error, x) dx = P (error|x) p (x) dx (6) 14 / 71
- 27. Error in this rule We have that P (error|x) = P (1|x) if we decide 2 P (2|x) if we decide 1 (5) Thus, we have that P (error) = P (error, x) dx = P (error|x) p (x) dx (6) 14 / 71
- 28. Classication Rule Thus, we have the Bayes Classication Rule 1 If P (1|x) > P (2|x) x is classied to 1 2 If P (1|x) < P (2|x) x is classied to 2 15 / 71
- 29. Classication Rule Thus, we have the Bayes Classication Rule 1 If P (1|x) > P (2|x) x is classied to 1 2 If P (1|x) < P (2|x) x is classied to 2 15 / 71
- 30. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
- 31. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
- 32. What if we remove the normalization factor? Remember P (1|x) + P (2|x) = 1 (7) We are able to obtain the new Bayes Classication Rule 1 If P (x|1) p (1) > P (x|2) P (2) x is classied to 1 2 If P (x|1) p (1) < P (x|2) P (2) x is classied to 2 16 / 71
- 33. We have several cases If for some x we have P (x|1) = P (x|2) The nal decision relies completely from the prior probability. On the Other hand if P (1) = P (2), the state is equally probable In this case the decision is based entirely on the likelihoods P (x|i). 17 / 71
- 34. We have several cases If for some x we have P (x|1) = P (x|2) The nal decision relies completely from the prior probability. On the Other hand if P (1) = P (2), the state is equally probable In this case the decision is based entirely on the likelihoods P (x|i). 17 / 71
- 35. How the Rule looks like If P (1) = P (2) the Rule depends on the term p (x|i) 18 / 71
- 36. The Error in the Second Case of Naive Bayes Error in equiprobable classes P (error) = 1 2 x0 p (x|2) dx + 1 2 x0 p (x|1) dx (8) Remark: P (1) = P (2) = 1 2 19 / 71
- 37. What do we want to prove? Something Notable Bayesian classier is optimal with respect to minimizing the classication error probability. 20 / 71
- 38. Proof Step 1 R1 be the region of the feature space in which we decide in favor of 1 R2 be the region of the feature space in which we decide in favor of 2 Step 2 Pe = P (x R2, 1) + P (x R1, 2) (9) Thus Pe = P (x R2|1) P (1) + P (x R1|2) P (2) = P (1) R2 p (x|1) dx + P (2) R1 p (x|2) dx 21 / 71
- 39. Proof Step 1 R1 be the region of the feature space in which we decide in favor of 1 R2 be the region of the feature space in which we decide in favor of 2 Step 2 Pe = P (x R2, 1) + P (x R1, 2) (9) Thus Pe = P (x R2|1) P (1) + P (x R1|2) P (2) = P (1) R2 p (x|1) dx + P (2) R1 p (x|2) dx 21 / 71
- 40. Proof Step 1 R1 be the region of the feature space in which we decide in favor of 1 R2 be the region of the feature space in which we decide in favor of 2 Step 2 Pe = P (x R2, 1) + P (x R1, 2) (9) Thus Pe = P (x R2|1) P (1) + P (x R1|2) P (2) = P (1) R2 p (x|1) dx + P (2) R1 p (x|2) dx 21 / 71
- 41. Proof Step 1 R1 be the region of the feature space in which we decide in favor of 1 R2 be the region of the feature space in which we decide in favor of 2 Step 2 Pe = P (x R2, 1) + P (x R1, 2) (9) Thus Pe = P (x R2|1) P (1) + P (x R1|2) P (2) = P (1) R2 p (x|1) dx + P (2) R1 p (x|2) dx 21 / 71
- 42. Proof It is more Pe = P (1) R2 p (1, x) P (1) dx + P (2) R1 p (2, x) P (2) dx (10) Finally Pe = R2 p (1|x) p (x) dx + R1 p (2|x) p (x) dx (11) Now, we choose the Bayes Classication Rule R1 : P (1|x) > P (2|x) R2 : P (2|x) > P (1|x) 22 / 71
- 43. Proof It is more Pe = P (1) R2 p (1, x) P (1) dx + P (2) R1 p (2, x) P (2) dx (10) Finally Pe = R2 p (1|x) p (x) dx + R1 p (2|x) p (x) dx (11) Now, we choose the Bayes Classication Rule R1 : P (1|x) > P (2|x) R2 : P (2|x) > P (1|x) 22 / 71
- 44. Proof It is more Pe = P (1) R2 p (1, x) P (1) dx + P (2) R1 p (2, x) P (2) dx (10) Finally Pe = R2 p (1|x) p (x) dx + R1 p (2|x) p (x) dx (11) Now, we choose the Bayes Classication Rule R1 : P (1|x) > P (2|x) R2 : P (2|x) > P (1|x) 22 / 71
- 45. Proof Thus P (1) = R1 p (1|x) p (x) dx + R2 p (1|x) p (x) dx (12) Now, we have... P (1) R1 p (1|x) p (x) dx = R2 p (1|x) p (x) dx (13) Then Pe = P (1) R1 p (1|x) p (x) dx + R1 p (2|x) p (x) dx (14) 23 / 71
- 46. Proof Thus P (1) = R1 p (1|x) p (x) dx + R2 p (1|x) p (x) dx (12) Now, we have... P (1) R1 p (1|x) p (x) dx = R2 p (1|x) p (x) dx (13) Then Pe = P (1) R1 p (1|x) p (x) dx + R1 p (2|x) p (x) dx (14) 23 / 71
- 47. Proof Thus P (1) = R1 p (1|x) p (x) dx + R2 p (1|x) p (x) dx (12) Now, we have... P (1) R1 p (1|x) p (x) dx = R2 p (1|x) p (x) dx (13) Then Pe = P (1) R1 p (1|x) p (x) dx + R1 p (2|x) p (x) dx (14) 23 / 71
- 48. Graphically P (1): Thanks Edith 2013 Class!!! In Red 24 / 71
- 49. Thus we have R1 p (1|x) p (x) dx = R1 p (1, x) dx = PR1 (1) Thus 25 / 71
- 50. Finally Finally Pe = P (1) R1 [p (1|x) p (2|x)] p (x) dx (15) Thus, we have Pe = P (1) R1 p (1|x) p (x) dx + R1 p (2|x) p (x) dx 26 / 71
- 51. Finally Finally Pe = P (1) R1 [p (1|x) p (2|x)] p (x) dx (15) Thus, we have Pe = P (1) R1 p (1|x) p (x) dx + R1 p (2|x) p (x) dx 26 / 71
- 52. Pe for a non optimal rule A great idea Edith!!! 27 / 71
- 53. Which decision function for minimizing the error A single number in this case 28 / 71
- 54. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (1|x) > P (2|x) R2 : P (2|x) > P (1|x) 29 / 71
- 55. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (1|x) > P (2|x) R2 : P (2|x) > P (1|x) 29 / 71
- 56. Error is minimized by the Bayesian Naive Rule Thus The probability of error is minimized at the region of space in which: R1 : P (1|x) > P (2|x) R2 : P (2|x) > P (1|x) 29 / 71
- 57. Pe for an optimal rule A great idea Edith!!! 30 / 71
- 58. For M classes 1, 2, ..., M We have that vector x is in i P (i|x) > P (j|x) j = i (16) Something Notable It turns out that such a choice also minimizes the classication error probability. 31 / 71
- 59. For M classes 1, 2, ..., M We have that vector x is in i P (i|x) > P (j|x) j = i (16) Something Notable It turns out that such a choice also minimizes the classication error probability. 31 / 71
- 60. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 32 / 71
- 61. Minimizing the Risk Something Notable The classication error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classied as benign. Not so serious, a benign tumor gets classied as malign. 33 / 71
- 62. Minimizing the Risk Something Notable The classication error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classied as benign. Not so serious, a benign tumor gets classied as malign. 33 / 71
- 63. Minimizing the Risk Something Notable The classication error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classied as benign. Not so serious, a benign tumor gets classied as malign. 33 / 71
- 64. Minimizing the Risk Something Notable The classication error probability is not always the best criterion to be adopted for minimization. All the errors get the same importance However Certain errors are more important than others. For example Really serious, a doctor makes a wrong decision and a malign tumor gets classied as benign. Not so serious, a benign tumor gets classied as malign. 33 / 71
- 65. It is based on the following idea In order to measure the predictive performance of a function f : X Y We use a loss function : Y Y R+ (17) A non-negative function that quanties how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
- 66. It is based on the following idea In order to measure the predictive performance of a function f : X Y We use a loss function : Y Y R+ (17) A non-negative function that quanties how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
- 67. It is based on the following idea In order to measure the predictive performance of a function f : X Y We use a loss function : Y Y R+ (17) A non-negative function that quanties how bad the prediction f (x) is the true label y. Thus, we can say that (f (x) , y) is the loss incurred by f on the pair (x, y). 34 / 71
- 68. Example In classication In the classication case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71
- 69. Example In classication In the classication case, binary or otherwise, a natural loss function is the 0-1 loss where y = f (x) : y , y = 1 y = y (18) 35 / 71
- 70. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y y)2 2 Absolute loss: (y , y) = |y y| Thus, given the loss function, we can dene the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following R (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
- 71. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y y)2 2 Absolute loss: (y , y) = |y y| Thus, given the loss function, we can dene the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following R (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
- 72. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y y)2 2 Absolute loss: (y , y) = |y y| Thus, given the loss function, we can dene the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following R (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
- 73. Furthermore For regression problems, some natural choices 1 Squared loss: (y , y) = (y y)2 2 Absolute loss: (y , y) = |y y| Thus, given the loss function, we can dene the risk as R (f ) = E(X,Y ) [ (f (X) , Y )] (19) Although we cannot see the expected risk, we can use the sample to estimate the following R (f ) = 1 n N i=1 (f (xi) , yi) (20) 36 / 71
- 74. Thus Risk Minimization Minimizing the empirical risk over a xed class F YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): fN = arg min f F R (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f = arg min f R (f ) (22) 37 / 71
- 75. Thus Risk Minimization Minimizing the empirical risk over a xed class F YX of functions leads to a very important learning rule, named empirical risk minimization (ERM): fN = arg min f F R (f ) (21) If we knew the distribution P and do not restrict ourselves F, the best function would be f = arg min f R (f ) (22) 37 / 71
- 76. Thus For classication with 0-1 loss, f is called the Bayesian Classier f = arg min yY P (Y = y|X = x) (23) 38 / 71
- 77. The New Risk Function Now, we have the following loss functions for two classes 12 is the loss value if the sample is in class 1, but the function classied as in class 2. 21 is the loss value if the sample is in class 1, but the function classied as in class 2. We can generate the new risk function r =12Pe (x, 1) + 21Pe (x, 2) =12 R2 p (x, 1) dx + 21 R1 p (x, 2) dx =12 R2 p (x|1) P (1) dx + 21 R1 p (x|2) P (2) dx 39 / 71
- 78. The New Risk Function Now, we have the following loss functions for two classes 12 is the loss value if the sample is in class 1, but the function classied as in class 2. 21 is the loss value if the sample is in class 1, but the function classied as in class 2. We can generate the new risk function r =12Pe (x, 1) + 21Pe (x, 2) =12 R2 p (x, 1) dx + 21 R1 p (x, 2) dx =12 R2 p (x|1) P (1) dx + 21 R1 p (x|2) P (2) dx 39 / 71
- 79. The New Risk Function Now, we have the following loss functions for two classes 12 is the loss value if the sample is in class 1, but the function classied as in class 2. 21 is the loss value if the sample is in class 1, but the function classied as in class 2. We can generate the new risk function r =12Pe (x, 1) + 21Pe (x, 2) =12 R2 p (x, 1) dx + 21 R1 p (x, 2) dx =12 R2 p (x|1) P (1) dx + 21 R1 p (x|2) P (2) dx 39 / 71
- 80. The New Risk Function Now, we have the following loss functions for two classes 12 is the loss value if the sample is in class 1, but the function classied as in class 2. 21 is the loss value if the sample is in class 1, but the function classied as in class 2. We can generate the new risk function r =12Pe (x, 1) + 21Pe (x, 2) =12 R2 p (x, 1) dx + 21 R1 p (x, 2) dx =12 R2 p (x|1) P (1) dx + 21 R1 p (x|2) P (2) dx 39 / 71
- 81. The New Risk Function Now, we have the following loss functions for two classes 12 is the loss value if the sample is in class 1, but the function classied as in class 2. 21 is the loss value if the sample is in class 1, but the function classied as in class 2. We can generate the new risk function r =12Pe (x, 1) + 21Pe (x, 2) =12 R2 p (x, 1) dx + 21 R1 p (x, 2) dx =12 R2 p (x|1) P (1) dx + 21 R1 p (x|2) P (2) dx 39 / 71
- 82. The New Risk Function The new risk function to be minimized r = 12P (1) R2 p (x|1) dx + 21P (2) R1 p (x|2) dx (24) Where the terms work as a weight factor for each error Thus, we could have something like 12 > 21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger eect on the cost function than the errors associated with the second term in the summation. 40 / 71
- 83. The New Risk Function The new risk function to be minimized r = 12P (1) R2 p (x|1) dx + 21P (2) R1 p (x|2) dx (24) Where the terms work as a weight factor for each error Thus, we could have something like 12 > 21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger eect on the cost function than the errors associated with the second term in the summation. 40 / 71
- 84. The New Risk Function The new risk function to be minimized r = 12P (1) R2 p (x|1) dx + 21P (2) R1 p (x|2) dx (24) Where the terms work as a weight factor for each error Thus, we could have something like 12 > 21!!! Then Errors due to the assignment of patterns originating from class 1 to class 2 will have a larger eect on the cost function than the errors associated with the second term in the summation. 40 / 71
- 85. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes j live. Now, think the following error Assume x belong to class k, but lies in Ri with i = k the vector is misclassied. Now, for this error we associate the term ki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
- 86. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes j live. Now, think the following error Assume x belong to class k, but lies in Ri with i = k the vector is misclassied. Now, for this error we associate the term ki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
- 87. Now, Consider a M class problem We have Rj, j = 1, ..., M be the regions where the classes j live. Now, think the following error Assume x belong to class k, but lies in Ri with i = k the vector is misclassied. Now, for this error we associate the term ki (loss) With that we have a loss matrix L where (k, i) location corresponds to such a loss. 41 / 71
- 88. Thus, we have a general loss associated to each class i Denition rk = M i=1 ki Ri p (x|k) dx (25) We want to minimize the global risk r = M k=1 rkP (k) = M i=1 Ri M k=1 kip (x|k) P (k) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
- 89. Thus, we have a general loss associated to each class i Denition rk = M i=1 ki Ri p (x|k) dx (25) We want to minimize the global risk r = M k=1 rkP (k) = M i=1 Ri M k=1 kip (x|k) P (k) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
- 90. Thus, we have a general loss associated to each class i Denition rk = M i=1 ki Ri p (x|k) dx (25) We want to minimize the global risk r = M k=1 rkP (k) = M i=1 Ri M k=1 kip (x|k) P (k) dx (26) For this We want to select the set of partition regions Rj. 42 / 71
- 91. How do we do that? We minimize each integral Ri M k=1 kip (x|k) P (k) dx (27) We need to minimize M k=1 kip (x|k) P (k) (28) We can do the following If x Ri, then li = M k=1 kip (x|k) P (k) < lj = M k=1 kjp (x|k) P (k) (29) for all j = i. 43 / 71
- 92. How do we do that? We minimize each integral Ri M k=1 kip (x|k) P (k) dx (27) We need to minimize M k=1 kip (x|k) P (k) (28) We can do the following If x Ri, then li = M k=1 kip (x|k) P (k) < lj = M k=1 kjp (x|k) P (k) (29) for all j = i. 43 / 71
- 93. How do we do that? We minimize each integral Ri M k=1 kip (x|k) P (k) dx (27) We need to minimize M k=1 kip (x|k) P (k) (28) We can do the following If x Ri, then li = M k=1 kip (x|k) P (k) < lj = M k=1 kjp (x|k) P (k) (29) for all j = i. 43 / 71
- 94. Remarks When we have Kroneckers delta ki = 0 if k = i 1 if k = i (30) We can do the following ki = 1 ki (31) We nish with M k=1,k=i p (x|k) P (k) < M k=1,k=j p (x|k) P (k) (32) 44 / 71
- 95. Remarks When we have Kroneckers delta ki = 0 if k = i 1 if k = i (30) We can do the following ki = 1 ki (31) We nish with M k=1,k=i p (x|k) P (k) < M k=1,k=j p (x|k) P (k) (32) 44 / 71
- 96. Remarks When we have Kroneckers delta ki = 0 if k = i 1 if k = i (30) We can do the following ki = 1 ki (31) We nish with M k=1,k=i p (x|k) P (k) < M k=1,k=j p (x|k) P (k) (32) 44 / 71
- 97. Then We have that p (x|j) P (j) < p (x|i) P (i) (33) for all j = i. 45 / 71
- 98. Special Case: The Two-Class Case We get l1 =11p (x|1) P (1) + 21p (x|2) P (2) l2 =12p (x|1) P (1) + 22p (x|2) P (2) We assign x to 1 if l1 < l2 (21 22) p (x|2) P (2) < (12 11) p (x|1) P (1) (34) If we assume that ii < ij - Correct Decisions are penalized much less than wrong ones x 1 (2) if l12 p (x|1) p (x|2) > ( ( ( gj (x) j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) gj (x) i, j = 1, 2, ..., M i = j 51 / 71
- 107. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (i|x)). Classic Example the Monotonically increasing f (P (i|x)) = ln P (i|x). The decision test is now classify x in i if gi (x) > gj (x) j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) gj (x) i, j = 1, 2, ..., M i = j 51 / 71
- 108. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (i|x)). Classic Example the Monotonically increasing f (P (i|x)) = ln P (i|x). The decision test is now classify x in i if gi (x) > gj (x) j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) gj (x) i, j = 1, 2, ..., M i = j 51 / 71
- 109. In general First Instead of working with probabilities, we work with an equivalent function of them gi (x) = f (P (i|x)). Classic Example the Monotonically increasing f (P (i|x)) = ln P (i|x). The decision test is now classify x in i if gi (x) > gj (x) j = i. The decision surfaces, separating contiguous regions, are described by gij (x) = gi (x) gj (x) i, j = 1, 2, ..., M i = j 51 / 71
- 110. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 52 / 71
- 111. Gaussian Distribution We can use the Gaussian distribution p (x|i) = 1 (2) l/2 |i| 1/2 exp 1 2 (x i)T 1 i (x i) (38) Example 53 / 71
- 112. Gaussian Distribution We can use the Gaussian distribution p (x|i) = 1 (2) l/2 |i| 1/2 exp 1 2 (x i)T 1 i (x i) (38) Example 53 / 71
- 113. Some Properties About It is the covariance matrix between variables. Thus It is positive denite. Symmetric. The inverse exists. 54 / 71
- 114. Some Properties About It is the covariance matrix between variables. Thus It is positive denite. Symmetric. The inverse exists. 54 / 71
- 115. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 55 / 71
- 116. Inuence of the Covariance Look at the following Covariance = 1 0 0 1 It simple the unit Gaussian with mean 56 / 71
- 117. Inuence of the Covariance Look at the following Covariance = 1 0 0 1 It simple the unit Gaussian with mean 56 / 71
- 118. The Covariance as a Rotation Look at the following Covariance = 4 0 0 1 Actually, it atten the circle through the x axis 57 / 71
- 119. The Covariance as a Rotation Look at the following Covariance = 4 0 0 1 Actually, it atten the circle through the x axis 57 / 71
- 120. Inuence of the Covariance Look at the following Covariance a = RbRT with R = cos sin sin cos It allows to rotate the axises 58 / 71
- 121. Inuence of the Covariance Look at the following Covariance a = RbRT with R = cos sin sin cos It allows to rotate the axises 58 / 71
- 122. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classication is p (x, 1) = p (x|i) P (i)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|i) P (i)] = ln p (x|i) + ln P (i) (39) Thus gi (x) = 1 2 (x i)T 1 i (x i) + ln P (i) + ci (40) 59 / 71
- 123. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classication is p (x, 1) = p (x|i) P (i)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|i) P (i)] = ln p (x|i) + ln P (i) (39) Thus gi (x) = 1 2 (x i)T 1 i (x i) + ln P (i) + ci (40) 59 / 71
- 124. Now For Two Classes Then, we use the following trick for two Classes i = 1, 2 We know that the pdf of correct classication is p (x, 1) = p (x|i) P (i)!!! Thus It is possible to generate the following decision function: gi (x) = ln [p (x|i) P (i)] = ln p (x|i) + ln P (i) (39) Thus gi (x) = 1 2 (x i)T 1 i (x i) + ln P (i) + ci (40) 59 / 71
- 125. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 60 / 71
- 126. Given a series of classes 1, 2, ..., M We assume for each class j The samples are drawn independetly according to the probability law p (x|j) We call those samples as i.i.d. independent identically distributed random variables. We assume in addition p (x|j) has a known parametric form with vector j of parameters. 61 / 71
- 127. Given a series of classes 1, 2, ..., M We assume for each class j The samples are drawn independetly according to the probability law p (x|j) We call those samples as i.i.d. independent identically distributed random variables. We assume in addition p (x|j) has a known parametric form with vector j of parameters. 61 / 71
- 128. Given a series of classes 1, 2, ..., M We assume for each class j The samples are drawn independetly according to the probability law p (x|j) We call those samples as i.i.d. independent identically distributed random variables. We assume in addition p (x|j) has a known parametric form with vector j of parameters. 61 / 71
- 129. Given a series of classes 1, 2, ..., M For example p (x|j) N j, j (41) In our case We will assume that there is no dependence between classes!!! 62 / 71
- 130. Given a series of classes 1, 2, ..., M For example p (x|j) N j, j (41) In our case We will assume that there is no dependence between classes!!! 62 / 71
- 131. Now Suppose that j contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|j) = n j=1 p (xj|j) (42) We can see then the function p (x1, x2, ..., xn|j) as a function of L (j) = n j=1 p (xj|j) (43) 63 / 71
- 132. Now Suppose that j contains n samples x1, x2, ..., xn p (x1, x2, ..., xn|j) = n j=1 p (xj|j) (42) We can see then the function p (x1, x2, ..., xn|j) as a function of L (j) = n j=1 p (xj|j) (43) 63 / 71
- 133. Example L (j) = log n j=1 p (xj|j) 64 / 71
- 134. Outline 1 Introduction Supervised Learning Naive Bayes The Naive Bayes Model The Multi-Class Case Minimizing the Average Risk 2 Discriminant Functions and Decision Surfaces Introduction Gaussian Distribution Inuence of the Covariance Maximum Likelihood Principle Maximum Likelihood on a Gaussian 65 / 71
- 135. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (i) = n 2 ln |i| 1 2 n j=1 (xj i)T 1 i (xj i) + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 n 2 ln |i| 1 2 n j=1 xj T 1 i xj 2xj T 1 i i + i T 1 i i + c2 (46) 66 / 71
- 136. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (i) = n 2 ln |i| 1 2 n j=1 (xj i)T 1 i (xj i) + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 n 2 ln |i| 1 2 n j=1 xj T 1 i xj 2xj T 1 i i + i T 1 i i + c2 (46) 66 / 71
- 137. Maximum Likelihood on a Gaussian Then, using the log!!! ln L (i) = n 2 ln |i| 1 2 n j=1 (xj i)T 1 i (xj i) + c2 (44) We know that dxT Ax dx = Ax + AT x, dAx dx = A (45) Thus, we expand equation44 n 2 ln |i| 1 2 n j=1 xj T 1 i xj 2xj T 1 i i + i T 1 i i + c2 (46) 66 / 71
- 138. Maximum Likelihood Then ln L (i) i = n j=1 1 i (xj i) = 0 n1 i i + 1 n n j=1 xj = 0 i = 1 n n j=1 xj 67 / 71
- 139. Maximum Likelihood Then, we derive with respect to i For this we use the following tricks: 1 log|| 1 = 1 || || ()T = 2 Tr[AB] A = Tr[BA] A = BT 3 Trace(of a number)=the number 4 Tr(AT B) = Tr BAT Thus f (i) = n 2 ln |I | 1 2 n j=1 (xj i)T 1 i (xj i) + c1 (47) 68 / 71
- 140. Maximum Likelihood Thus f (i) = n 2 ln |i| 1 2 n j=1 Trace (xj i)T 1 i (xj i) +c1 (48) Tricks!!! f (i) = n 2 ln |i| 1 2 n j=1 Trace 1 i (xj i) (xj i)T +c1 (49) 69 / 71
- 141. Maximum Likelihood Thus f (i) = n 2 ln |i| 1 2 n j=1 Trace (xj i)T 1 i (xj i) +c1 (48) Tricks!!! f (i) = n 2 ln |i| 1 2 n j=1 Trace 1 i (xj i) (xj i)T +c1 (49) 69 / 71
- 142. Maximum Likelihood Derivative with respect to f (i) i = n 2 i 1 2 n j=1 (xj i) (xj i)T T (50) Thus, when making it equal to zero i = 1 n n j=1 (xj i) (xj i)T (51) 70 / 71
- 143. Maximum Likelihood Derivative with respect to f (i) i = n 2 i 1 2 n j=1 (xj i) (xj i)T T (50) Thus, when making it equal to zero i = 1 n n j=1 (xj i) (xj i)T (51) 70 / 71
- 144. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71
- 145. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71
- 146. Exercises Duda and Hart Chapter 3 3.1, 3.2, 3.3, 3.13 Theodoridis Chapter 2 2.5, 2.7, 2.10, 2.12, 2.14, 2.17 71 / 71