Machine Learning  Introduction to Bayesian Classification

Upload
andresmendezvazquez 
Category
Engineering

view
30 
download
3
Embed Size (px)
Transcript of Machine Learning  Introduction to Bayesian Classification
Machine Learning for Data MiningIntroduction to Bayesian Classifiers
Andres MendezVazquez
August 3, 2015
1 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
2 / 71
Classification problem
Training DataSamples of the form (d, h(d))
dWhere d are the data objects to classify (inputs)
h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K
3 / 71
Classification problem
Training DataSamples of the form (d, h(d))
dWhere d are the data objects to classify (inputs)
h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K
3 / 71
Classification problem
Training DataSamples of the form (d, h(d))
dWhere d are the data objects to classify (inputs)
h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K
3 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
4 / 71
Classification Problem
GoalGiven dnew , provide h(dnew)
The Machinery in General looks...
SupervisedLearning
Training Info: Desired/Trget Output
INPUT OUTPUT
5 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
6 / 71
Naive Bayes Model
Task for two classesLet ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.
The Rule for classification is the following one
P (ωi x) = P (xωi) P (ωi)P (x) (1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classesLet ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.
The Rule for classification is the following one
P (ωi x) = P (xωi) P (ωi)P (x) (1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classesLet ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.
The Rule for classification is the following one
P (ωi x) = P (xωi) P (ωi)P (x) (1)
Remark: Bayes to the next level.
7 / 71
Naive Bayes Model
Task for two classesLet ω1, ω2 be the two classes in which our samples belong.
There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.
The Rule for classification is the following one
P (ωi x) = P (xωi) P (ωi)P (x) (1)
Remark: Bayes to the next level.
7 / 71
In Informal English
We have that
posterior = likelihood × prior − informationevidence (2)
BasicallyOne: If we can observe x.Two: we can convert the priorinformation to the posterior information.
8 / 71
In Informal English
We have that
posterior = likelihood × prior − informationevidence (2)
BasicallyOne: If we can observe x.Two: we can convert the priorinformation to the posterior information.
8 / 71
In Informal English
We have that
posterior = likelihood × prior − informationevidence (2)
BasicallyOne: If we can observe x.Two: we can convert the priorinformation to the posterior information.
8 / 71
We have the following terms...
LikelihoodWe call p (xωi) the likelihood of ωi given x:
This indicates that given a category ωi : If p (xωi) is “large”, then ωiis the “likely” class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (xωi) the likelihood of ωi given x:
This indicates that given a category ωi : If p (xωi) is “large”, then ωiis the “likely” class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (xωi) the likelihood of ωi given x:
This indicates that given a category ωi : If p (xωi) is “large”, then ωiis the “likely” class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (xωi) the likelihood of ωi given x:
This indicates that given a category ωi : If p (xωi) is “large”, then ωiis the “likely” class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (xωi) the likelihood of ωi given x:
This indicates that given a category ωi : If p (xωi) is “large”, then ωiis the “likely” class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
We have the following terms...
LikelihoodWe call p (xωi) the likelihood of ωi given x:
This indicates that given a category ωi : If p (xωi) is “large”, then ωiis the “likely” class of x.
Prior ProbabilityIt is the known probability of a given class.
Remark: Because, we lack information about this class, we tend touse the uniform distribution.
However: We can use other tricks for it.
EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.
9 / 71
The most important term in all this
The factor
likelihood × prior − information (3)
10 / 71
Example
We have the likelihood of two classes
11 / 71
Example
We have the posterior of two classes when P (ω1) = 23 and P (ω2) = 1
3
12 / 71
Naive Bayes Model
In the case of two classes
P (x) =2∑
i=1p (x, ωi) =
2∑i=1
p (xωi) P (ωi) (4)
13 / 71
Error in this rule
We have that
P (error x) ={
P (ω1x) if we decide ω2
P (ω2x) if we decide ω1(5)
Thus, we have that
P (error) =ˆ ∞−∞
P (error ,x) dx =ˆ ∞−∞
P (error x) p (x) dx (6)
14 / 71
Error in this rule
We have that
P (error x) ={
P (ω1x) if we decide ω2
P (ω2x) if we decide ω1(5)
Thus, we have that
P (error) =ˆ ∞−∞
P (error ,x) dx =ˆ ∞−∞
P (error x) p (x) dx (6)
14 / 71
Classification Rule
Thus, we have the Bayes Classification Rule1 If P (ω1x) > P (ω2x) x is classified to ω1
2 If P (ω1x) < P (ω2x) x is classified to ω2
15 / 71
Classification Rule
Thus, we have the Bayes Classification Rule1 If P (ω1x) > P (ω2x) x is classified to ω1
2 If P (ω1x) < P (ω2x) x is classified to ω2
15 / 71
What if we remove the normalization factor?
Remember
P (ω1x) + P (ω2x) = 1 (7)
We are able to obtain the new Bayes Classification Rule1 If P (xω1) p (ω1) > P (xω2) P (ω2) x is classified to ω1
2 If P (xω1) p (ω1) < P (xω2) P (ω2) x is classified to ω2
16 / 71
What if we remove the normalization factor?
Remember
P (ω1x) + P (ω2x) = 1 (7)
We are able to obtain the new Bayes Classification Rule1 If P (xω1) p (ω1) > P (xω2) P (ω2) x is classified to ω1
2 If P (xω1) p (ω1) < P (xω2) P (ω2) x is classified to ω2
16 / 71
What if we remove the normalization factor?
Remember
P (ω1x) + P (ω2x) = 1 (7)
We are able to obtain the new Bayes Classification Rule1 If P (xω1) p (ω1) > P (xω2) P (ω2) x is classified to ω1
2 If P (xω1) p (ω1) < P (xω2) P (ω2) x is classified to ω2
16 / 71
We have several cases
If for some x we have P (xω1) = P (xω2)
The final decision relies completely from the prior probability.
On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (xωi).
17 / 71
We have several cases
If for some x we have P (xω1) = P (xω2)
The final decision relies completely from the prior probability.
On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (xωi).
17 / 71
How the Rule looks likeIf P (ω1) = P (ω2) the Rule depends on the term p (xωi)
18 / 71
The Error in the Second Case of Naive Bayes
Error in equiprobable classes
P (error) = 12
x0ˆ
−∞
p (xω2) dx + 12
∞̂
x0
p (xω1) dx (8)
Remark: P (ω1) = P (ω2) = 12
19 / 71
What do we want to prove?
Something NotableBayesian classifier is optimal with respect to minimizing theclassification error probability.
20 / 71
ProofStep 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2ω1) P (ω1) + P (x ∈ R1ω2) P (ω2)
= P (ω1)ˆ
R2
p (xω1) dx + P (ω2)ˆ
R1
p (xω2) dx
21 / 71
ProofStep 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2ω1) P (ω1) + P (x ∈ R1ω2) P (ω2)
= P (ω1)ˆ
R2
p (xω1) dx + P (ω2)ˆ
R1
p (xω2) dx
21 / 71
ProofStep 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2ω1) P (ω1) + P (x ∈ R1ω2) P (ω2)
= P (ω1)ˆ
R2
p (xω1) dx + P (ω2)ˆ
R1
p (xω2) dx
21 / 71
ProofStep 1
R1 be the region of the feature space in which we decide in favor of ω1
R2 be the region of the feature space in which we decide in favor of ω2
Step 2
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)
Thus
Pe = P (x ∈ R2ω1) P (ω1) + P (x ∈ R1ω2) P (ω2)
= P (ω1)ˆ
R2
p (xω1) dx + P (ω2)ˆ
R1
p (xω2) dx
21 / 71
Proof
It is more
Pe = P (ω1)ˆ
R2
p (ω1,x)P (ω1) dx + P (ω2)
ˆ
R1
p (ω2,x)P (ω2) dx (10)
Finally
Pe =ˆ
R2
p (ω1x) p (x) dx +ˆ
R1
p (ω2x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1x) > P (ω2x)R2 : P (ω2x) > P (ω1x)
22 / 71
Proof
It is more
Pe = P (ω1)ˆ
R2
p (ω1,x)P (ω1) dx + P (ω2)
ˆ
R1
p (ω2,x)P (ω2) dx (10)
Finally
Pe =ˆ
R2
p (ω1x) p (x) dx +ˆ
R1
p (ω2x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1x) > P (ω2x)R2 : P (ω2x) > P (ω1x)
22 / 71
Proof
It is more
Pe = P (ω1)ˆ
R2
p (ω1,x)P (ω1) dx + P (ω2)
ˆ
R1
p (ω2,x)P (ω2) dx (10)
Finally
Pe =ˆ
R2
p (ω1x) p (x) dx +ˆ
R1
p (ω2x) p (x) dx (11)
Now, we choose the Bayes Classification Rule
R1 : P (ω1x) > P (ω2x)R2 : P (ω2x) > P (ω1x)
22 / 71
Proof
Thus
P (ω1) =ˆ
R1
p (ω1x) p (x) dx +ˆ
R2
p (ω1x) p (x) dx (12)
Now, we have...
P (ω1)−ˆ
R1
p (ω1x) p (x) dx =ˆ
R2
p (ω1x) p (x) dx (13)
Then
Pe = P (ω1)−ˆ
R1
p (ω1x) p (x) dx +ˆ
R1
p (ω2x) p (x) dx (14)
23 / 71
Proof
Thus
P (ω1) =ˆ
R1
p (ω1x) p (x) dx +ˆ
R2
p (ω1x) p (x) dx (12)
Now, we have...
P (ω1)−ˆ
R1
p (ω1x) p (x) dx =ˆ
R2
p (ω1x) p (x) dx (13)
Then
Pe = P (ω1)−ˆ
R1
p (ω1x) p (x) dx +ˆ
R1
p (ω2x) p (x) dx (14)
23 / 71
Proof
Thus
P (ω1) =ˆ
R1
p (ω1x) p (x) dx +ˆ
R2
p (ω1x) p (x) dx (12)
Now, we have...
P (ω1)−ˆ
R1
p (ω1x) p (x) dx =ˆ
R2
p (ω1x) p (x) dx (13)
Then
Pe = P (ω1)−ˆ
R1
p (ω1x) p (x) dx +ˆ
R1
p (ω2x) p (x) dx (14)
23 / 71
Graphically P (ω1): Thanks Edith 2013 Class!!!
In Red
24 / 71
Thus we have´R1
p (ω1x) p (x) dx =´
R1p (ω1,x) dx = PR1(ω1)
Thus
25 / 71
Finally
Finally
Pe = P (ω1)−ˆ
R1
[p (ω1x)− p (ω2x)] p (x) dx (15)
Thus, we have
Pe =
P (ω1)−ˆ
R1
p (ω1x) p (x) dx
+ˆ
R1
p (ω2x) p (x) dx
26 / 71
Finally
Finally
Pe = P (ω1)−ˆ
R1
[p (ω1x)− p (ω2x)] p (x) dx (15)
Thus, we have
Pe =
P (ω1)−ˆ
R1
p (ω1x) p (x) dx
+ˆ
R1
p (ω2x) p (x) dx
26 / 71
Pe for a non optimal rule
A great idea Edith!!!
27 / 71
Which decision function for minimizing the error
A single number in this case
28 / 71
Error is minimized by the Bayesian Naive Rule
ThusThe probability of error is minimized at the region of space in which:
R1 : P (ω1x) > P (ω2x)R2 : P (ω2x) > P (ω1x)
29 / 71
Error is minimized by the Bayesian Naive Rule
ThusThe probability of error is minimized at the region of space in which:
R1 : P (ω1x) > P (ω2x)R2 : P (ω2x) > P (ω1x)
29 / 71
Error is minimized by the Bayesian Naive Rule
ThusThe probability of error is minimized at the region of space in which:
R1 : P (ω1x) > P (ω2x)R2 : P (ω2x) > P (ω1x)
29 / 71
Pe for an optimal rule
A great idea Edith!!!
30 / 71
For M classes ω1, ω2, ..., ωM
We have that vector x is in ωi
P (ωi x) > P (ωj x) ∀j 6= i (16)
Something NotableIt turns out that such a choice also minimizes the classification errorprobability.
31 / 71
For M classes ω1, ω2, ..., ωM
We have that vector x is in ωi
P (ωi x) > P (ωj x) ∀j 6= i (16)
Something NotableIt turns out that such a choice also minimizes the classification errorprobability.
31 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
32 / 71
Minimizing the Risk
Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.
All the errors get the same importance
HoweverCertain errors are more important than others.
For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.
All the errors get the same importance
HoweverCertain errors are more important than others.
For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.
All the errors get the same importance
HoweverCertain errors are more important than others.
For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.
33 / 71
Minimizing the Risk
Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.
All the errors get the same importance
HoweverCertain errors are more important than others.
For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.
33 / 71
It is based on the following idea
In order to measure the predictive performance of a functionf : X → YWe use a loss function
` : Y × Y → R+ (17)
A nonnegative function that quantifies how bad the prediction f (x) is thetrue label y.
Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
It is based on the following idea
In order to measure the predictive performance of a functionf : X → YWe use a loss function
` : Y × Y → R+ (17)
A nonnegative function that quantifies how bad the prediction f (x) is thetrue label y.
Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
It is based on the following idea
In order to measure the predictive performance of a functionf : X → YWe use a loss function
` : Y × Y → R+ (17)
A nonnegative function that quantifies how bad the prediction f (x) is thetrue label y.
Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).
34 / 71
Example
In classificationIn the classification case, binary or otherwise, a natural loss function is the01 loss where y′ = f (x) :
`(y′, y
)= 1
[y′ 6= y
](18)
35 / 71
Example
In classificationIn the classification case, binary or otherwise, a natural loss function is the01 loss where y′ = f (x) :
`(y′, y
)= 1
[y′ 6= y
](18)
35 / 71
Furthermore
For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2
2 Absolute loss: ` (y′, y) = y′ − y
Thus, given the loss function, we can define the risk as
R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)
Although we cannot see the expected risk, we can use the sample toestimate the following
R̂ (f ) = 1n
N∑i=1
` (f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2
2 Absolute loss: ` (y′, y) = y′ − y
Thus, given the loss function, we can define the risk as
R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)
Although we cannot see the expected risk, we can use the sample toestimate the following
R̂ (f ) = 1n
N∑i=1
` (f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2
2 Absolute loss: ` (y′, y) = y′ − y
Thus, given the loss function, we can define the risk as
R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)
Although we cannot see the expected risk, we can use the sample toestimate the following
R̂ (f ) = 1n
N∑i=1
` (f (xi) , yi) (20)
36 / 71
Furthermore
For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2
2 Absolute loss: ` (y′, y) = y′ − y
Thus, given the loss function, we can define the risk as
R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)
Although we cannot see the expected risk, we can use the sample toestimate the following
R̂ (f ) = 1n
N∑i=1
` (f (xi) , yi) (20)
36 / 71
Thus
Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):
f̂N = arg minf∈F
R̂ (f ) (21)
If we knew the distribution P and do not restrict ourselves F , thebest function would be
f ∗ = arg minf
R (f ) (22)
37 / 71
Thus
Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):
f̂N = arg minf∈F
R̂ (f ) (21)
If we knew the distribution P and do not restrict ourselves F , thebest function would be
f ∗ = arg minf
R (f ) (22)
37 / 71
Thus
For classification with 01 loss, f ∗ is called the Bayesian Classifier
f ∗ = arg miny∈Y
P (Y = yX = x) (23)
38 / 71
The New Risk Function
Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆR2
p (x, ω1) dx + λ21
ˆR1
p (x, ω2) dx
=λ12
ˆR2
p (xω1) P (ω1) dx + λ21
ˆR1
p (xω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆR2
p (x, ω1) dx + λ21
ˆR1
p (x, ω2) dx
=λ12
ˆR2
p (xω1) P (ω1) dx + λ21
ˆR1
p (xω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆR2
p (x, ω1) dx + λ21
ˆR1
p (x, ω2) dx
=λ12
ˆR2
p (xω1) P (ω1) dx + λ21
ˆR1
p (xω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆR2
p (x, ω1) dx + λ21
ˆR1
p (x, ω2) dx
=λ12
ˆR2
p (xω1) P (ω1) dx + λ21
ˆR1
p (xω2) P (ω2) dx
39 / 71
The New Risk Function
Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.
We can generate the new risk function
r =λ12Pe (x, ω1) + λ21Pe (x, ω2)
=λ12
ˆR2
p (x, ω1) dx + λ21
ˆR1
p (x, ω2) dx
=λ12
ˆR2
p (xω1) P (ω1) dx + λ21
ˆR1
p (xω2) P (ω2) dx
39 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)ˆ
R2
p (xω1) dx + λ21P (ω2)ˆ
R1
p (xω2) dx (24)
Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!
ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.
40 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)ˆ
R2
p (xω1) dx + λ21P (ω2)ˆ
R1
p (xω2) dx (24)
Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!
ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.
40 / 71
The New Risk Function
The new risk function to be minimized
r = λ12P (ω1)ˆ
R2
p (xω1) dx + λ21P (ω2)ˆ
R1
p (xω2) dx (24)
Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!
ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.
40 / 71
Now, Consider a M class problem
We haveRj , j = 1, ...,M be the regions where the classes ωj live.
Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.
Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.
41 / 71
Now, Consider a M class problem
We haveRj , j = 1, ...,M be the regions where the classes ωj live.
Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.
Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.
41 / 71
Now, Consider a M class problem
We haveRj , j = 1, ...,M be the regions where the classes ωj live.
Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.
Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.
41 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =M∑
i=1λki
ˆRi
p (xωk) dx (25)
We want to minimize the global risk
r =M∑
k=1rkP (ωk) =
M∑i=1
ˆRi
M∑k=1
λkip (xωk) P (ωk) dx (26)
For thisWe want to select the set of partition regions Rj .
42 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =M∑
i=1λki
ˆRi
p (xωk) dx (25)
We want to minimize the global risk
r =M∑
k=1rkP (ωk) =
M∑i=1
ˆRi
M∑k=1
λkip (xωk) P (ωk) dx (26)
For thisWe want to select the set of partition regions Rj .
42 / 71
Thus, we have a general loss associated to each class ωi
Definition
rk =M∑
i=1λki
ˆRi
p (xωk) dx (25)
We want to minimize the global risk
r =M∑
k=1rkP (ωk) =
M∑i=1
ˆRi
M∑k=1
λkip (xωk) P (ωk) dx (26)
For thisWe want to select the set of partition regions Rj .
42 / 71
How do we do that?We minimize each integral
ˆRi
M∑k=1
λkip (xωk) P (ωk) dx (27)
We need to minimizeM∑
k=1λkip (xωk) P (ωk) (28)
We can do the followingIf x ∈ Ri , then
li =M∑
k=1λkip (xωk) P (ωk) < lj =
M∑k=1
λkjp (xωk) P (ωk) (29)
for all j 6= i.43 / 71
How do we do that?We minimize each integral
ˆRi
M∑k=1
λkip (xωk) P (ωk) dx (27)
We need to minimizeM∑
k=1λkip (xωk) P (ωk) (28)
We can do the followingIf x ∈ Ri , then
li =M∑
k=1λkip (xωk) P (ωk) < lj =
M∑k=1
λkjp (xωk) P (ωk) (29)
for all j 6= i.43 / 71
How do we do that?We minimize each integral
ˆRi
M∑k=1
λkip (xωk) P (ωk) dx (27)
We need to minimizeM∑
k=1λkip (xωk) P (ωk) (28)
We can do the followingIf x ∈ Ri , then
li =M∑
k=1λkip (xωk) P (ωk) < lj =
M∑k=1
λkjp (xωk) P (ωk) (29)
for all j 6= i.43 / 71
Remarks
When we have Kronecker’s delta
δki ={
0 if k 6= i1 if k = i
(30)
We can do the following
λki = 1− δki (31)
We finish withM∑
k=1,k 6=ip (xωk) P (ωk) <
M∑k=1,k 6=j
p (xωk) P (ωk) (32)
44 / 71
Remarks
When we have Kronecker’s delta
δki ={
0 if k 6= i1 if k = i
(30)
We can do the following
λki = 1− δki (31)
We finish withM∑
k=1,k 6=ip (xωk) P (ωk) <
M∑k=1,k 6=j
p (xωk) P (ωk) (32)
44 / 71
Remarks
When we have Kronecker’s delta
δki ={
0 if k 6= i1 if k = i
(30)
We can do the following
λki = 1− δki (31)
We finish withM∑
k=1,k 6=ip (xωk) P (ωk) <
M∑k=1,k 6=j
p (xωk) P (ωk) (32)
44 / 71
Then
We have that
p (xωj) P (ωj) < p (xωi) P (ωi) (33)
for all j 6= i.
45 / 71
Special Case: The TwoClass Case
We get
l1 =λ11p (xω1) P (ω1) + λ21p (xω2) P (ω2)l2 =λ12p (xω1) P (ω1) + λ22p (xω2) P (ω2)
We assign x to ω1 if l1 < l2(λ21 − λ22) p (xω2) P (ω2) < (λ12 − λ11) p (xω1) P (ω1) (34)
If we assume that λii < λij  Correct Decisions are penalized muchless than wrong ones
x ∈ ω1 (ω2) if l12 ≡p (xω1)p (xω2) > (<) P (ω2)
P (ω1)(λ21 − λ22)(λ12 − λ11) (35)
46 / 71
Special Case: The TwoClass Case
We get
l1 =λ11p (xω1) P (ω1) + λ21p (xω2) P (ω2)l2 =λ12p (xω1) P (ω1) + λ22p (xω2) P (ω2)
We assign x to ω1 if l1 < l2(λ21 − λ22) p (xω2) P (ω2) < (λ12 − λ11) p (xω1) P (ω1) (34)
If we assume that λii < λij  Correct Decisions are penalized muchless than wrong ones
x ∈ ω1 (ω2) if l12 ≡p (xω1)p (xω2) > (<) P (ω2)
P (ω1)(λ21 − λ22)(λ12 − λ11) (35)
46 / 71
Special Case: The TwoClass Case
We get
l1 =λ11p (xω1) P (ω1) + λ21p (xω2) P (ω2)l2 =λ12p (xω1) P (ω1) + λ22p (xω2) P (ω2)
We assign x to ω1 if l1 < l2(λ21 − λ22) p (xω2) P (ω2) < (λ12 − λ11) p (xω1) P (ω1) (34)
If we assume that λii < λij  Correct Decisions are penalized muchless than wrong ones
x ∈ ω1 (ω2) if l12 ≡p (xω1)p (xω2) > (<) P (ω2)
P (ω1)(λ21 − λ22)(λ12 − λ11) (35)
46 / 71
Special Case: The TwoClass Case
Definitionl12 is known as the likelihood ratio and the preceding test as the likelihoodratio test.
47 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
48 / 71
Decision Surface
Because the R1 and R2 are contiguousThe separating surface between both of them is described by
P (ω1x)− P (ω2x) = 0 (36)
Thus, we define the decision function as
g12 (x) = P (ω1x)− P (ω2x) = 0 (37)
49 / 71
Decision Surface
Because the R1 and R2 are contiguousThe separating surface between both of them is described by
P (ω1x)− P (ω2x) = 0 (36)
Thus, we define the decision function as
g12 (x) = P (ω1x)− P (ω2x) = 0 (37)
49 / 71
Which decision function for the Naive Bayes
A single number in this case
50 / 71
In general
FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi x)).
Classic Example the Monotonically increasingf (P (ωi x)) = ln P (ωi x).
The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.
The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j
51 / 71
In general
FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi x)).
Classic Example the Monotonically increasingf (P (ωi x)) = ln P (ωi x).
The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.
The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j
51 / 71
In general
FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi x)).
Classic Example the Monotonically increasingf (P (ωi x)) = ln P (ωi x).
The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.
The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j
51 / 71
In general
FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi x)).
Classic Example the Monotonically increasingf (P (ωi x)) = ln P (ωi x).
The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.
The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j
51 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
52 / 71
Gaussian Distribution
We can use the Gaussian distribution
p (xωi) = 1(2π)l/2 Σi 
1/2exp
{−1
2 (x − µi)T Σ−1i (x − µi)
}(38)
Example
53 / 71
Gaussian DistributionWe can use the Gaussian distribution
p (xωi) = 1(2π)l/2 Σi 
1/2exp
{−1
2 (x − µi)T Σ−1i (x − µi)
}(38)
Example
53 / 71
Some Properties
About ΣIt is the covariance matrix between variables.
ThusIt is positive definite.Symmetric.The inverse exists.
54 / 71
Some Properties
About ΣIt is the covariance matrix between variables.
ThusIt is positive definite.Symmetric.The inverse exists.
54 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
55 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σ =[
1 00 1
]
It simple the unit Gaussian with mean µ
56 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σ =[
1 00 1
]
It simple the unit Gaussian with mean µ
56 / 71
The Covariance Σ as a Rotation
Look at the following Covariance
Σ =[
4 00 1
]
Actually, it flatten the circle through the x − axis
57 / 71
The Covariance Σ as a Rotation
Look at the following Covariance
Σ =[
4 00 1
]
Actually, it flatten the circle through the x − axis
57 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σa = RΣbRT with R =[
cos θ − sin θ− sin θ cos θ
]
It allows to rotate the axises
58 / 71
Influence of the Covariance Σ
Look at the following Covariance
Σa = RΣbRT with R =[
cos θ − sin θ− sin θ cos θ
]
It allows to rotate the axises
58 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (xωi) P (ωi)!!!
ThusIt is possible to generate the following decision function:
gi (x) = ln [p (xωi) P (ωi)] = ln p (xωi) + ln P (ωi) (39)
Thus
gi (x) = −12 (x − µi)T Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (xωi) P (ωi)!!!
ThusIt is possible to generate the following decision function:
gi (x) = ln [p (xωi) P (ωi)] = ln p (xωi) + ln P (ωi) (39)
Thus
gi (x) = −12 (x − µi)T Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Now For Two Classes
Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (xωi) P (ωi)!!!
ThusIt is possible to generate the following decision function:
gi (x) = ln [p (xωi) P (ωi)] = ln p (xωi) + ln P (ωi) (39)
Thus
gi (x) = −12 (x − µi)T Σ−1
i (x − µi) + ln P (ωi) + ci (40)
59 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
60 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability lawp (xωj)
We call those samples asi.i.d. — independent identically distributed random variables.
We assume in additionp (xωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability lawp (xωj)
We call those samples asi.i.d. — independent identically distributed random variables.
We assume in additionp (xωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
We assume for each class ωj
The samples are drawn independetly according to the probability lawp (xωj)
We call those samples asi.i.d. — independent identically distributed random variables.
We assume in additionp (xωj) has a known parametric form with vector θj of parameters.
61 / 71
Given a series of classes ω1, ω2, ..., ωM
For example
p (xωj) ∼ N(µj ,Σj
)(41)
In our caseWe will assume that there is no dependence between classes!!!
62 / 71
Given a series of classes ω1, ω2, ..., ωM
For example
p (xωj) ∼ N(µj ,Σj
)(41)
In our caseWe will assume that there is no dependence between classes!!!
62 / 71
Now
Suppose that ωj contains n samples x1,x2, ...,xn
p (x1,x2, ...,xn θj) =n∏
j=1p (xj θj) (42)
We can see then the function p (x1,x2, ...,xnθj) as a function of
L (θj) =n∏
j=1p (xj θj) (43)
63 / 71
Now
Suppose that ωj contains n samples x1,x2, ...,xn
p (x1,x2, ...,xn θj) =n∏
j=1p (xj θj) (42)
We can see then the function p (x1,x2, ...,xnθj) as a function of
L (θj) =n∏
j=1p (xj θj) (43)
63 / 71
Example
L (θj) = log∏nj=1 p (x j θj)
64 / 71
Outline
1 IntroductionSupervised LearningNaive Bayes
The Naive Bayes ModelThe MultiClass Case
Minimizing the Average Risk
2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian
65 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −n2 ln Σi  −
12
n∑j=1
(xj − µi)T Σ−1i (xj − µi)
+ c2 (44)
We know thatdxT Ax
dx = Ax + AT x, dAxdx = A (45)
Thus, we expand equation44
−n2 ln Σi  −
12
n∑j=1
[xj
T Σ−1i xj − 2xj
T Σ−1i µi + µi
T Σ−1i µi
]+ c2 (46)
66 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −n2 ln Σi  −
12
n∑j=1
(xj − µi)T Σ−1i (xj − µi)
+ c2 (44)
We know thatdxT Ax
dx = Ax + AT x, dAxdx = A (45)
Thus, we expand equation44
−n2 ln Σi  −
12
n∑j=1
[xj
T Σ−1i xj − 2xj
T Σ−1i µi + µi
T Σ−1i µi
]+ c2 (46)
66 / 71
Maximum Likelihood on a Gaussian
Then, using the log!!!
ln L (ωi) = −n2 ln Σi  −
12
n∑j=1
(xj − µi)T Σ−1i (xj − µi)
+ c2 (44)
We know thatdxT Ax
dx = Ax + AT x, dAxdx = A (45)
Thus, we expand equation44
−n2 ln Σi  −
12
n∑j=1
[xj
T Σ−1i xj − 2xj
T Σ−1i µi + µi
T Σ−1i µi
]+ c2 (46)
66 / 71
Maximum Likelihood
Then
∂ ln L (ωi)∂µi
=n∑
j=1Σ−1
i (xj − µi) = 0
nΣ−1i
−µi + 1n
n∑j=1
xj
= 0
µ̂i = 1n
n∑j=1
xj
67 / 71
Maximum Likelihood
Then, we derive with respect to Σi
For this we use the following tricks:1 ∂ logΣ
∂Σ−1 = − 1Σ · Σ (Σ)T = −Σ
2 ∂Tr [AB]∂A = ∂Tr [BA]
∂A = BT
3 Trace(of a number)=the number4 Tr(AT B) = Tr
(BAT
)Thus
f (Σi) = −n2 ln ΣI  −
12
n∑j=1
[(xj − µi)T Σ−1
i (xj − µi)]
+ c1 (47)
68 / 71
Maximum Likelihood
Thus
f (Σi) = −n2 ln Σi −
12
n∑j=1
[Trace
{(xj − µi)T Σ−1
i (xj − µi)}]
+c1 (48)
Tricks!!!
f (Σi) = −n2 ln Σi −
12
n∑j=1
[Trace
{Σ−1
i (xj − µi) (xj − µi)T}]
+c1 (49)
69 / 71
Maximum Likelihood
Thus
f (Σi) = −n2 ln Σi −
12
n∑j=1
[Trace
{(xj − µi)T Σ−1
i (xj − µi)}]
+c1 (48)
Tricks!!!
f (Σi) = −n2 ln Σi −
12
n∑j=1
[Trace
{Σ−1
i (xj − µi) (xj − µi)T}]
+c1 (49)
69 / 71
Maximum Likelihood
Derivative with respect to Σ∂f (Σi)∂Σi
= n2 Σi −
12
n∑j=1
[(xj − µi) (xj − µi)T
]T(50)
Thus, when making it equal to zero
Σ̂i = 1n
n∑j=1
(xj − µi) (xj − µi)T (51)
70 / 71
Maximum Likelihood
Derivative with respect to Σ∂f (Σi)∂Σi
= n2 Σi −
12
n∑j=1
[(xj − µi) (xj − µi)T
]T(50)
Thus, when making it equal to zero
Σ̂i = 1n
n∑j=1
(xj − µi) (xj − µi)T (51)
70 / 71
Exercises
Duda and HartChapter 3
3.1, 3.2, 3.3, 3.13
TheodoridisChapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71
Exercises
Duda and HartChapter 3
3.1, 3.2, 3.3, 3.13
TheodoridisChapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71
Exercises
Duda and HartChapter 3
3.1, 3.2, 3.3, 3.13
TheodoridisChapter 2
2.5, 2.7, 2.10, 2.12, 2.14, 2.17
71 / 71