06 Machine Learning - Naive Bayes

Machine Learning for Data MiningIntroduction to Bayesian Classifiers

Andres Mendez-Vazquez

August 3, 2015

1 / 71

Outline

1 IntroductionSupervised LearningNaive Bayes

The Naive Bayes ModelThe Multi-Class Case

Minimizing the Average Risk

2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance ΣMaximum Likelihood PrincipleMaximum Likelihood on a Gaussian

2 / 71

Classification problem

Training DataSamples of the form (d, h(d))

dWhere d are the data objects to classify (inputs)

h (d)h(d) are the correct class info for d, h(d) ∈ 1, . . .K

3 / 71

Outline

4 / 71

Classification Problem

GoalGiven dnew , provide h(dnew)

The Machinery in General looks...

SupervisedLearning

Training Info: Desired/Trget Output

INPUT OUTPUT

5 / 71

Outline

6 / 71

Naive Bayes Model

Task for two classesLet ω1, ω2 be the two classes in which our samples belong.

There is a prior probability of belonging to that classP (ω1) for class 1.P (ω2) for class 2.

The Rule for classification is the following one

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

Remark: Bayes to the next level.

7 / 71

Naive Bayes Model

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

7 / 71

Naive Bayes Model

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

7 / 71

Naive Bayes Model

P (ωi |x) = P (x|ωi) P (ωi)P (x) (1)

7 / 71

In Informal English

We have that

posterior = likelihood × prior − informationevidence (2)

BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

8 / 71

In Informal English

We have that

8 / 71

In Informal English

We have that

8 / 71

We have the following terms...

LikelihoodWe call p (x|ωi) the likelihood of ωi given x:

This indicates that given a category ωi : If p (x|ωi) is “large”, then ωiis the “likely” class of x.

Prior ProbabilityIt is the known probability of a given class.

Remark: Because, we lack information about this class, we tend touse the uniform distribution.

However: We can use other tricks for it.

EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

9 / 71

The most important term in all this

The factor

likelihood × prior − information (3)

10 / 71

Example

We have the likelihood of two classes

11 / 71

Example

We have the posterior of two classes when P (ω1) = 23 and P (ω2) = 1

12 / 71

Naive Bayes Model

In the case of two classes

P (x) =2∑

i=1p (x, ωi) =

2∑i=1

p (x|ωi) P (ωi) (4)

13 / 71

Error in this rule

We have that

P (error |x) ={

P (ω1|x) if we decide ω2

P (ω2|x) if we decide ω1(5)

Thus, we have that

P (error) =ˆ ∞−∞

P (error ,x) dx =ˆ ∞−∞

P (error |x) p (x) dx (6)

14 / 71

Error in this rule

We have that

P (error |x) ={

P (ω1|x) if we decide ω2

P (ω2|x) if we decide ω1(5)

Thus, we have that

P (error) =ˆ ∞−∞

P (error ,x) dx =ˆ ∞−∞

P (error |x) p (x) dx (6)

14 / 71

Classification Rule

Thus, we have the Bayes Classification Rule1 If P (ω1|x) > P (ω2|x) x is classified to ω1

2 If P (ω1|x) < P (ω2|x) x is classified to ω2

15 / 71

Classification Rule

Thus, we have the Bayes Classification Rule1 If P (ω1|x) > P (ω2|x) x is classified to ω1

2 If P (ω1|x) < P (ω2|x) x is classified to ω2

15 / 71

What if we remove the normalization factor?

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

We are able to obtain the new Bayes Classification Rule1 If P (x|ω1) p (ω1) > P (x|ω2) P (ω2) x is classified to ω1

2 If P (x|ω1) p (ω1) < P (x|ω2) P (ω2) x is classified to ω2

16 / 71

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

16 / 71

Remember

P (ω1|x) + P (ω2|x) = 1 (7)

16 / 71

We have several cases

If for some x we have P (x|ω1) = P (x|ω2)

The final decision relies completely from the prior probability.

On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (x|ωi).

17 / 71

We have several cases

If for some x we have P (x|ω1) = P (x|ω2)

The final decision relies completely from the prior probability.

On the Other hand if P (ω1) = P (ω2), the “state” is equally probableIn this case the decision is based entirely on the likelihoods P (x|ωi).

17 / 71

How the Rule looks likeIf P (ω1) = P (ω2) the Rule depends on the term p (x|ωi)

18 / 71

The Error in the Second Case of Naive Bayes

Error in equiprobable classes

P (error) = 12

−∞

p (x|ω2) dx + 12

p (x|ω1) dx (8)

Remark: P (ω1) = P (ω2) = 12

19 / 71

What do we want to prove?

Something NotableBayesian classifier is optimal with respect to minimizing theclassification error probability.

20 / 71

ProofStep 1

R1 be the region of the feature space in which we decide in favor of ω1

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

p (x|ω1) dx + P (ω2)ˆ

p (x|ω2) dx

21 / 71

ProofStep 1

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

p (x|ω1) dx + P (ω2)ˆ

p (x|ω2) dx

21 / 71

ProofStep 1

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

p (x|ω1) dx + P (ω2)ˆ

p (x|ω2) dx

21 / 71

ProofStep 1

Step 2

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2) (9)

Pe = P (x ∈ R2|ω1) P (ω1) + P (x ∈ R1|ω2) P (ω2)

= P (ω1)ˆ

p (x|ω1) dx + P (ω2)ˆ

p (x|ω2) dx

21 / 71

It is more

Pe = P (ω1)ˆ

p (ω1,x)P (ω1) dx + P (ω2)

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

p (ω1|x) p (x) dx +ˆ

p (ω2|x) p (x) dx (11)

Now, we choose the Bayes Classification Rule

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

It is more

Pe = P (ω1)ˆ

p (ω1,x)P (ω1) dx + P (ω2)

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

p (ω2|x) p (x) dx (11)

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

It is more

Pe = P (ω1)ˆ

p (ω1,x)P (ω1) dx + P (ω2)

p (ω2,x)P (ω2) dx (10)

Finally

Pe =ˆ

p (ω2|x) p (x) dx (11)

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

22 / 71

P (ω1) =ˆ

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

p (ω1|x) p (x) dx =ˆ

p (ω1|x) p (x) dx (13)

Pe = P (ω1)−ˆ

p (ω2|x) p (x) dx (14)

23 / 71

P (ω1) =ˆ

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

p (ω1|x) p (x) dx (13)

Pe = P (ω1)−ˆ

p (ω2|x) p (x) dx (14)

23 / 71

P (ω1) =ˆ

p (ω1|x) p (x) dx (12)

Now, we have...

P (ω1)−ˆ

p (ω1|x) p (x) dx (13)

Pe = P (ω1)−ˆ

p (ω2|x) p (x) dx (14)

23 / 71

Graphically P (ω1): Thanks Edith 2013 Class!!!

In Red

24 / 71

Thus we have´R1

p (ω1|x) p (x) dx =´

R1p (ω1,x) dx = PR1(ω1)

25 / 71

Finally

Pe = P (ω1)−ˆ

[p (ω1|x)− p (ω2|x)] p (x) dx (15)

Thus, we have

P (ω1)−ˆ

p (ω1|x) p (x) dx

p (ω2|x) p (x) dx

26 / 71

Finally

Pe = P (ω1)−ˆ

[p (ω1|x)− p (ω2|x)] p (x) dx (15)

Thus, we have

P (ω1)−ˆ

p (ω1|x) p (x) dx

p (ω2|x) p (x) dx

26 / 71

Pe for a non optimal rule

A great idea Edith!!!

27 / 71

Which decision function for minimizing the error

A single number in this case

28 / 71

Error is minimized by the Bayesian Naive Rule

ThusThe probability of error is minimized at the region of space in which:

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

R1 : P (ω1|x) > P (ω2|x)R2 : P (ω2|x) > P (ω1|x)

29 / 71

Pe for an optimal rule

A great idea Edith!!!

30 / 71

For M classes ω1, ω2, ..., ωM

We have that vector x is in ωi

P (ωi |x) > P (ωj |x) ∀j 6= i (16)

Something NotableIt turns out that such a choice also minimizes the classification errorprobability.

31 / 71

For M classes ω1, ω2, ..., ωM

We have that vector x is in ωi

P (ωi |x) > P (ωj |x) ∀j 6= i (16)

Something NotableIt turns out that such a choice also minimizes the classification errorprobability.

31 / 71

Outline

32 / 71

Minimizing the Risk

Something NotableThe classification error probability is not always the best criterion to beadopted for minimization.

All the errors get the same importance

HoweverCertain errors are more important than others.

For exampleReally serious, a doctor makes a wrong decision and a malign tumorgets classified as benign.Not so serious, a benign tumor gets classified as malign.

33 / 71

Minimizing the Risk

33 / 71

Minimizing the Risk

33 / 71

Minimizing the Risk

33 / 71

It is based on the following idea

In order to measure the predictive performance of a functionf : X → YWe use a loss function

` : Y × Y → R+ (17)

A non-negative function that quantifies how bad the prediction f (x) is thetrue label y.

Thus, we can say that` (f (x) , y) is the loss incurred by f on the pair (x, y).

34 / 71

` : Y × Y → R+ (17)

34 / 71

` : Y × Y → R+ (17)

34 / 71

Example

In classificationIn the classification case, binary or otherwise, a natural loss function is the0-1 loss where y′ = f (x) :

`(y′, y

[y′ 6= y

35 / 71

Example

In classificationIn the classification case, binary or otherwise, a natural loss function is the0-1 loss where y′ = f (x) :

`(y′, y

[y′ 6= y

35 / 71

Furthermore

For regression problems, some natural choices1 Squared loss: ` (y′, y) = (y′ − y)2

2 Absolute loss: ` (y′, y) = |y′ − y|

Thus, given the loss function, we can define the risk as

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

Although we cannot see the expected risk, we can use the sample toestimate the following

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Furthermore

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Furthermore

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Furthermore

R (f ) = E(X ,Y ) [` (f (X) ,Y )] (19)

R̂ (f ) = 1n

N∑i=1

` (f (xi) , yi) (20)

36 / 71

Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):

f̂N = arg minf∈F

R̂ (f ) (21)

If we knew the distribution P and do not restrict ourselves F , thebest function would be

f ∗ = arg minf

R (f ) (22)

37 / 71

Risk MinimizationMinimizing the empirical risk over a fixed class F ⊆ YX of functions leadsto a very important learning rule, named empirical risk minimization(ERM):

f̂N = arg minf∈F

R̂ (f ) (21)

If we knew the distribution P and do not restrict ourselves F , thebest function would be

f ∗ = arg minf

R (f ) (22)

37 / 71

For classification with 0-1 loss, f ∗ is called the Bayesian Classifier

f ∗ = arg miny∈Y

P (Y = y|X = x) (23)

38 / 71

The New Risk Function

Now, we have the following loss functions for two classesλ12 is the loss value if the sample is in class 1, but the functionclassified as in class 2.λ21 is the loss value if the sample is in class 1, but the functionclassified as in class 2.

We can generate the new risk function

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

p (x, ω1) dx + λ21

p (x, ω2) dx

p (x|ω1) P (ω1) dx + λ21

p (x|ω2) P (ω2) dx

39 / 71

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

p (x, ω1) dx + λ21

p (x, ω2) dx

p (x|ω1) P (ω1) dx + λ21

p (x|ω2) P (ω2) dx

39 / 71

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

p (x, ω1) dx + λ21

p (x, ω2) dx

p (x|ω1) P (ω1) dx + λ21

p (x|ω2) P (ω2) dx

39 / 71

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

p (x, ω1) dx + λ21

p (x, ω2) dx

p (x|ω1) P (ω1) dx + λ21

p (x|ω2) P (ω2) dx

39 / 71

r =λ12Pe (x, ω1) + λ21Pe (x, ω2)

p (x, ω1) dx + λ21

p (x, ω2) dx

p (x|ω1) P (ω1) dx + λ21

p (x|ω2) P (ω2) dx

39 / 71

The new risk function to be minimized

r = λ12P (ω1)ˆ

p (x|ω1) dx + λ21P (ω2)ˆ

p (x|ω2) dx (24)

Where the λ terms work as a weight factor for each errorThus, we could have something like λ12 > λ21!!!

ThenErrors due to the assignment of patterns originating from class 1 to class 2will have a larger effect on the cost function than the errors associatedwith the second term in the summation.

40 / 71

r = λ12P (ω1)ˆ

p (x|ω1) dx + λ21P (ω2)ˆ

p (x|ω2) dx (24)

40 / 71

r = λ12P (ω1)ˆ

p (x|ω1) dx + λ21P (ω2)ˆ

p (x|ω2) dx (24)

40 / 71

Now, Consider a M class problem

We haveRj , j = 1, ...,M be the regions where the classes ωj live.

Now, think the following errorAssume x belong to class ωk , but lies in Ri with i 6= k −→ the vector ismisclassified.

Now, for this error we associate the term λki (loss)With that we have a loss matrix L where (k, i) location corresponds tosuch a loss.

41 / 71

Thus, we have a general loss associated to each class ωi

Definition

rk =M∑

i=1λki

p (x|ωk) dx (25)

We want to minimize the global risk

r =M∑

k=1rkP (ωk) =

M∑i=1

M∑k=1

λkip (x|ωk) P (ωk) dx (26)

For thisWe want to select the set of partition regions Rj .

42 / 71

Definition

rk =M∑

i=1λki

p (x|ωk) dx (25)

r =M∑

k=1rkP (ωk) =

M∑i=1

M∑k=1

42 / 71

Definition

rk =M∑

i=1λki

p (x|ωk) dx (25)

r =M∑

k=1rkP (ωk) =

M∑i=1

M∑k=1

42 / 71

How do we do that?We minimize each integral

M∑k=1

We need to minimizeM∑

k=1λkip (x|ωk) P (ωk) (28)

We can do the followingIf x ∈ Ri , then

li =M∑

k=1λkip (x|ωk) P (ωk) < lj =

M∑k=1

λkjp (x|ωk) P (ωk) (29)

for all j 6= i.43 / 71

M∑k=1

li =M∑

M∑k=1

for all j 6= i.43 / 71

M∑k=1

li =M∑

M∑k=1

for all j 6= i.43 / 71

Remarks

When we have Kronecker’s delta

δki ={

0 if k 6= i1 if k = i

We can do the following

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Remarks

δki ={

0 if k 6= i1 if k = i

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

Remarks

δki ={

0 if k 6= i1 if k = i

λki = 1− δki (31)

We finish withM∑

k=1,k 6=ip (x|ωk) P (ωk) <

M∑k=1,k 6=j

p (x|ωk) P (ωk) (32)

44 / 71

We have that

p (x|ωj) P (ωj) < p (x|ωi) P (ωi) (33)

for all j 6= i.

45 / 71

Special Case: The Two-Class Case

We get

l1 =λ11p (x|ω1) P (ω1) + λ21p (x|ω2) P (ω2)l2 =λ12p (x|ω1) P (ω1) + λ22p (x|ω2) P (ω2)

We assign x to ω1 if l1 < l2(λ21 − λ22) p (x|ω2) P (ω2) < (λ12 − λ11) p (x|ω1) P (ω1) (34)

If we assume that λii < λij - Correct Decisions are penalized muchless than wrong ones

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

We get

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

We get

x ∈ ω1 (ω2) if l12 ≡p (x|ω1)p (x|ω2) > (<) P (ω2)

P (ω1)(λ21 − λ22)(λ12 − λ11) (35)

46 / 71

Definitionl12 is known as the likelihood ratio and the preceding test as the likelihoodratio test.

47 / 71

Outline

48 / 71

Decision Surface

Because the R1 and R2 are contiguousThe separating surface between both of them is described by

P (ω1|x)− P (ω2|x) = 0 (36)

Thus, we define the decision function as

g12 (x) = P (ω1|x)− P (ω2|x) = 0 (37)

49 / 71

Decision Surface

Because the R1 and R2 are contiguousThe separating surface between both of them is described by

P (ω1|x)− P (ω2|x) = 0 (36)

Thus, we define the decision function as

g12 (x) = P (ω1|x)− P (ω2|x) = 0 (37)

49 / 71

Which decision function for the Naive Bayes

A single number in this case

50 / 71

In general

FirstInstead of working with probabilities, we work with an equivalent functionof them gi (x) = f (P (ωi |x)).

Classic Example the Monotonically increasingf (P (ωi |x)) = ln P (ωi |x).

The decision test is nowclassify x in ωi if gi (x) > gj (x) ∀j 6= i.

The decision surfaces, separating contiguous regions, are described bygij (x) = gi (x)− gj (x) i, j = 1, 2, ...,M i 6= j

51 / 71

In general

51 / 71

In general

51 / 71

In general

51 / 71

Outline

52 / 71

Gaussian Distribution

We can use the Gaussian distribution

p (x|ωi) = 1(2π)l/2 |Σi |

1/2exp

2 (x − µi)T Σ−1i (x − µi)

Example

53 / 71

Gaussian DistributionWe can use the Gaussian distribution

p (x|ωi) = 1(2π)l/2 |Σi |

1/2exp

2 (x − µi)T Σ−1i (x − µi)

Example

53 / 71

Some Properties

About ΣIt is the covariance matrix between variables.

ThusIt is positive definite.Symmetric.The inverse exists.

54 / 71

Some Properties

About ΣIt is the covariance matrix between variables.

ThusIt is positive definite.Symmetric.The inverse exists.

54 / 71

Outline

55 / 71

Influence of the Covariance Σ

Look at the following Covariance

1 00 1

It simple the unit Gaussian with mean µ

56 / 71

1 00 1

It simple the unit Gaussian with mean µ

56 / 71

The Covariance Σ as a Rotation

4 00 1

Actually, it flatten the circle through the x − axis

57 / 71

The Covariance Σ as a Rotation

4 00 1

Actually, it flatten the circle through the x − axis

57 / 71

Σa = RΣbRT with R =[

cos θ − sin θ− sin θ cos θ

It allows to rotate the axises

58 / 71

Σa = RΣbRT with R =[

cos θ − sin θ− sin θ cos θ

It allows to rotate the axises

58 / 71

Now For Two Classes

Then, we use the following trick for two Classes i = 1, 2We know that the pdf of correct classification isp (x, ω1) = p (x|ωi) P (ωi)!!!

ThusIt is possible to generate the following decision function:

gi (x) = ln [p (x|ωi) P (ωi)] = ln p (x|ωi) + ln P (ωi) (39)

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Now For Two Classes

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Now For Two Classes

gi (x) = −12 (x − µi)T Σ−1

i (x − µi) + ln P (ωi) + ci (40)

59 / 71

Outline

60 / 71

Given a series of classes ω1, ω2, ..., ωM

We assume for each class ωj

The samples are drawn independetly according to the probability lawp (x|ωj)

We call those samples asi.i.d. — independent identically distributed random variables.

We assume in additionp (x|ωj) has a known parametric form with vector θj of parameters.

61 / 71

For example

p (x|ωj) ∼ N(µj ,Σj

In our caseWe will assume that there is no dependence between classes!!!

62 / 71

For example

p (x|ωj) ∼ N(µj ,Σj

In our caseWe will assume that there is no dependence between classes!!!

62 / 71

Suppose that ωj contains n samples x1,x2, ...,xn

p (x1,x2, ...,xn |θj) =n∏

j=1p (xj |θj) (42)

We can see then the function p (x1,x2, ...,xn|θj) as a function of

L (θj) =n∏

j=1p (xj |θj) (43)

63 / 71

Suppose that ωj contains n samples x1,x2, ...,xn

p (x1,x2, ...,xn |θj) =n∏

j=1p (xj |θj) (42)

We can see then the function p (x1,x2, ...,xn|θj) as a function of

L (θj) =n∏

j=1p (xj |θj) (43)

63 / 71

Example

L (θj) = log∏nj=1 p (x j |θj)

64 / 71

Outline

65 / 71

Maximum Likelihood on a Gaussian

Then, using the log!!!

ln L (ωi) = −n2 ln |Σi | −

n∑j=1

(xj − µi)T Σ−1i (xj − µi)

+ c2 (44)

We know thatdxT Ax

dx = Ax + AT x, dAxdx = A (45)

Thus, we expand equation44

−n2 ln |Σi | −

n∑j=1

T Σ−1i xj − 2xj

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

ln L (ωi) = −n2 ln |Σi | −

n∑j=1

+ c2 (44)

We know thatdxT Ax

−n2 ln |Σi | −

n∑j=1

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

ln L (ωi) = −n2 ln |Σi | −

n∑j=1

+ c2 (44)

We know thatdxT Ax

−n2 ln |Σi | −

n∑j=1

T Σ−1i µi + µi

T Σ−1i µi

]+ c2 (46)

66 / 71

Maximum Likelihood

∂ ln L (ωi)∂µi

j=1Σ−1

i (xj − µi) = 0

nΣ−1i

−µi + 1n

n∑j=1

µ̂i = 1n

n∑j=1

67 / 71

Maximum Likelihood

Then, we derive with respect to Σi

For this we use the following tricks:1 ∂ log|Σ|

∂Σ−1 = − 1|Σ| · |Σ| (Σ)T = −Σ

2 ∂Tr [AB]∂A = ∂Tr [BA]

∂A = BT

3 Trace(of a number)=the number4 Tr(AT B) = Tr

f (Σi) = −n2 ln |ΣI | −

n∑j=1

[(xj − µi)T Σ−1

i (xj − µi)]

+ c1 (47)

68 / 71

Maximum Likelihood

f (Σi) = −n2 ln |Σi |−

n∑j=1

[Trace

{(xj − µi)T Σ−1

i (xj − µi)}]

+c1 (48)

Tricks!!!

f (Σi) = −n2 ln |Σi |−

n∑j=1

[Trace

{Σ−1

i (xj − µi) (xj − µi)T}]

+c1 (49)

69 / 71

Maximum Likelihood

f (Σi) = −n2 ln |Σi |−

n∑j=1

[Trace

{(xj − µi)T Σ−1

i (xj − µi)}]

+c1 (48)

Tricks!!!

f (Σi) = −n2 ln |Σi |−

n∑j=1

[Trace

{Σ−1

i (xj − µi) (xj − µi)T}]

+c1 (49)

69 / 71

Maximum Likelihood

Derivative with respect to Σ∂f (Σi)∂Σi

= n2 Σi −

n∑j=1

[(xj − µi) (xj − µi)T

]T(50)

Thus, when making it equal to zero

Σ̂i = 1n

n∑j=1

(xj − µi) (xj − µi)T (51)

70 / 71

Maximum Likelihood

Derivative with respect to Σ∂f (Σi)∂Σi

= n2 Σi −

n∑j=1

[(xj − µi) (xj − µi)T

]T(50)

Thus, when making it equal to zero

Σ̂i = 1n

n∑j=1

(xj − µi) (xj − µi)T (51)

70 / 71

Exercises

Duda and HartChapter 3

3.1, 3.2, 3.3, 3.13

TheodoridisChapter 2

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

Exercises

3.1, 3.2, 3.3, 3.13

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

Exercises

3.1, 3.2, 3.3, 3.13

2.5, 2.7, 2.10, 2.12, 2.14, 2.17

71 / 71

06 Machine Learning - Naive Bayes

Engineering

Transcript of 06 Machine Learning - Naive Bayes