06 Machine Learning - Naive Bayes

download 06 Machine Learning - Naive Bayes

of 146

  • date post

    16-Feb-2017
  • Category

    Engineering

  • view

    192
  • download

    5

Embed Size (px)

Transcript of 06 Machine Learning - Naive Bayes

  • Machine Learning for Data MiningIntroduction to Bayesian Classifiers

    Andres Mendez-Vazquez

    August 3, 2015

    1 / 71

  • Outline

    1 IntroductionSupervised LearningNaive Bayes

    The Naive Bayes ModelThe Multi-Class Case

    Minimizing the Average Risk

    2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian

    2 / 71

  • Classification problem

    Training DataSamples of the form (d, h(d))

    dWhere d are the data objects to classify (inputs)

    h (d)h(d) are the correct class info for d, h(d) 1, . . .K

    3 / 71

  • Classification problem

    Training DataSamples of the form (d, h(d))

    dWhere d are the data objects to classify (inputs)

    h (d)h(d) are the correct class info for d, h(d) 1, . . .K

    3 / 71

  • Classification problem

    Training DataSamples of the form (d, h(d))

    dWhere d are the data objects to classify (inputs)

    h (d)h(d) are the correct class info for d, h(d) 1, . . .K

    3 / 71

  • Outline

    1 IntroductionSupervised LearningNaive Bayes

    The Naive Bayes ModelThe Multi-Class Case

    Minimizing the Average Risk

    2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian

    4 / 71

  • Classification Problem

    GoalGiven dnew , provide h(dnew)

    The Machinery in General looks...

    SupervisedLearning

    Training Info: Desired/Trget Output

    INPUT OUTPUT

    5 / 71

  • Outline

    1 IntroductionSupervised LearningNaive Bayes

    The Naive Bayes ModelThe Multi-Class Case

    Minimizing the Average Risk

    2 Discriminant Functions and Decision SurfacesIntroductionGaussian DistributionInfluence of the Covariance Maximum Likelihood PrincipleMaximum Likelihood on a Gaussian

    6 / 71

  • Naive Bayes Model

    Task for two classesLet 1, 2 be the two classes in which our samples belong.

    There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

    The Rule for classification is the following one

    P (i |x) =P (x|i) P (i)

    P (x) (1)

    Remark: Bayes to the next level.

    7 / 71

  • Naive Bayes Model

    Task for two classesLet 1, 2 be the two classes in which our samples belong.

    There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

    The Rule for classification is the following one

    P (i |x) =P (x|i) P (i)

    P (x) (1)

    Remark: Bayes to the next level.

    7 / 71

  • Naive Bayes Model

    Task for two classesLet 1, 2 be the two classes in which our samples belong.

    There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

    The Rule for classification is the following one

    P (i |x) =P (x|i) P (i)

    P (x) (1)

    Remark: Bayes to the next level.

    7 / 71

  • Naive Bayes Model

    Task for two classesLet 1, 2 be the two classes in which our samples belong.

    There is a prior probability of belonging to that classP (1) for class 1.P (2) for class 2.

    The Rule for classification is the following one

    P (i |x) =P (x|i) P (i)

    P (x) (1)

    Remark: Bayes to the next level.

    7 / 71

  • In Informal English

    We have that

    posterior = likelihood prior informationevidence (2)

    BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

    8 / 71

  • In Informal English

    We have that

    posterior = likelihood prior informationevidence (2)

    BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

    8 / 71

  • In Informal English

    We have that

    posterior = likelihood prior informationevidence (2)

    BasicallyOne: If we can observe x.Two: we can convert the prior-information to the posterior information.

    8 / 71

  • We have the following terms...

    LikelihoodWe call p (x|i) the likelihood of i given x:

    This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

    Prior ProbabilityIt is the known probability of a given class.

    Remark: Because, we lack information about this class, we tend touse the uniform distribution.

    However: We can use other tricks for it.

    EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

    9 / 71

  • We have the following terms...

    LikelihoodWe call p (x|i) the likelihood of i given x:

    This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

    Prior ProbabilityIt is the known probability of a given class.

    Remark: Because, we lack information about this class, we tend touse the uniform distribution.

    However: We can use other tricks for it.

    EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

    9 / 71

  • We have the following terms...

    LikelihoodWe call p (x|i) the likelihood of i given x:

    This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

    Prior ProbabilityIt is the known probability of a given class.

    Remark: Because, we lack information about this class, we tend touse the uniform distribution.

    However: We can use other tricks for it.

    EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

    9 / 71

  • We have the following terms...

    LikelihoodWe call p (x|i) the likelihood of i given x:

    This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

    Prior ProbabilityIt is the known probability of a given class.

    Remark: Because, we lack information about this class, we tend touse the uniform distribution.

    However: We can use other tricks for it.

    EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

    9 / 71

  • We have the following terms...

    LikelihoodWe call p (x|i) the likelihood of i given x:

    This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

    Prior ProbabilityIt is the known probability of a given class.

    Remark: Because, we lack information about this class, we tend touse the uniform distribution.

    However: We can use other tricks for it.

    EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

    9 / 71

  • We have the following terms...

    LikelihoodWe call p (x|i) the likelihood of i given x:

    This indicates that given a category i : If p (x|i) is large, then iis the likely class of x.

    Prior ProbabilityIt is the known probability of a given class.

    Remark: Because, we lack information about this class, we tend touse the uniform distribution.

    However: We can use other tricks for it.

    EvidenceThe evidence factor can be seen as a scale factor that guarantees that theposterior probability sum to one.

    9 / 71

  • The most important term in all this

    The factor

    likelihood prior information (3)

    10 / 71

  • Example

    We have the likelihood of two classes

    11 / 71

  • Example

    We have the posterior of two classes when P (1) = 23 and P (2) =13

    12 / 71

  • Naive Bayes Model

    In the case of two classes

    P (x) =2

    i=1p (x, i) =

    2i=1

    p (x|i) P (i) (4)

    13 / 71

  • Error in this rule

    We have that

    P (error |x) ={

    P (1|x) if we decide 2P (2|x) if we decide 1

    (5)

    Thus, we have that

    P (error) =

    P (error ,x) dx =

    P (error |x) p (x) dx (6)

    14 / 71

  • Error in this rule

    We have that

    P (error |x) ={

    P (1|x) if we decide 2P (2|x) if we decide 1

    (5)

    Thus, we have that

    P (error) =

    P (error ,x) dx =

    P (error |x) p (x) dx (6)

    14 / 71

  • Classification Rule

    Thus, we have the Bayes Classification Rule1 If P (1|x) > P (2|x) x is classified to 12 If P (1|x) < P (2|x) x is classified to 2

    15 / 71

  • Classification Rule

    Thus, we have the Bayes Classification Rule1 If P (1|x) > P (2|x) x is classified to 12 If P (1|x) < P (2|x) x is classified to 2

    15 / 71

  • What if we remove the normalization factor?

    Remember

    P (1|x) + P (2|x) = 1 (7)

    We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2

    16 / 71

  • What if we remove the normalization factor?

    Remember

    P (1|x) + P (2|x) = 1 (7)

    We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2

    16 / 71

  • What if we remove the normalization factor?

    Remember

    P (1|x) + P (2|x) = 1 (7)

    We are able to obtain the new Bayes Classification Rule1 If P (x|1) p (1) > P (x|2) P (2) x is classified to 12 If P (x|1) p (1) < P (x|2) P (2) x is classified to 2

    16 / 71

  • We have several cases

    If for some x we have P (x|1) = P (x|2)The final decision relies completely from the prior probability.

    On the