Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...

Bayes Decision Rule and Naïve Bayes Classifier

Machine Learning I CSE 6740, Fall 2013

Le Song

Gaussian Mixture model

A density model 𝑝(𝑋) may be multi-modal: model it as a mixture of uni-modal distributions (e.g. Gaussians)

Consider a mixture of 𝐾 Gaussians

𝑝(𝑋) = 𝜋𝑘𝓝(𝑋|𝜇𝑘 , Σ𝑘)𝐾𝑘=1

Learn 𝜋𝑘 ∈ 0,1 , 𝜇𝑘, Σ𝑘;

mixing proportion

mixture Component

𝓝(𝜇1, Σ1)

𝓝(𝜇2, Σ2)

For each data point 𝑥𝑖:

Randomly choose a mixture component, 𝑧𝑖 = 1,2,…𝐾 , with

probability 𝑝 𝑧𝑖|𝜋 = 𝜋𝑧𝑖

Then sample the actual value of 𝑥𝑖 from a Gaussian distribution

𝑝 𝑥𝑖 𝑧𝑖 = 𝓝 𝑥 𝜇𝑧𝑖 , Σ𝑧𝑖

Joint distribution over 𝑥 𝑎𝑛𝑑 𝑧 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥 𝑧 = 𝜋𝑧𝓝 𝑥 𝜇𝑧, Σ𝑧

Marginal distribution 𝑝(𝑥)

𝑝 𝑥 = 𝑝 𝑥, 𝑧

𝑧=1

= 𝜋𝑧𝓝 𝑥 𝜇𝑧, Σ𝑧

𝑧=1

𝑧1 = 1

Imagine generating your data …

Bayes rule

𝑃 𝑧 𝑥 = 𝑃 𝑥 𝑧 𝑃 𝑧

𝑃 𝑥=𝑃(𝑥, 𝑧)

𝑃(𝑥, 𝑧)𝑧

posterior

likelihood Prior

normalization constant

Prior: 𝑝 𝑧 = 𝜋𝑧 Likelihood: 𝑝 𝑥 𝑧 = 𝓝 𝑥 𝜇𝑧, Σ𝑧

Posterior: 𝑝 𝑧 𝑥 =𝜋𝑧𝓝 𝑥 𝜇𝑧,Σ𝑧 𝜋𝑧𝑧 𝓝 𝑥 𝜇𝑧,Σ𝑧

Details of EM

We intend to learn the parameters that maximizes the log−likelihood of the data

𝑙 𝜃; 𝐷 = log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑧𝑖=1

𝑖=1

Expectation step (E-step): What do we take expectation over?

𝑓 𝜃 = 𝐸𝑞 𝑧1,𝑧2,..,𝑧𝑚 [log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑖=1

Maximization step (M-step): how to maximize? 𝜃𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓 𝜃

Nonconvex Difficult!

E-step: what is 𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚

𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚 : posterior distribution of the latent variables

𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚 = 𝑝(𝑧𝑖|𝑥𝑖 , 𝜃𝑡)

For each data point 𝑥𝑖, compute 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖) for each 𝑘

𝜏𝑘𝑖 = 𝑝 𝑧𝑖 = 𝑘 𝑥𝑖

=𝜋𝑘𝓝 𝑥

𝑖 𝜇𝑘 , Σ𝑘 𝜋𝑘𝑘 𝓝 𝑥

𝑖 𝜇𝑘 , Σ𝑘

E-step: compute the expectation

𝑓 𝜃 = 𝐸𝑞 𝑧1,𝑧2,..,𝑧𝑚 log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑖=1

= 𝐸𝑝(𝑧𝑖|𝑥𝑖,𝜃𝑡) log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑖=1

In the case of Gaussian mixture:

𝑓(𝜃) = 𝐸𝑝(𝑧𝑖|𝑥𝑖,𝜃𝑡) log 𝜋𝑧𝑖 − 𝑥𝑖 − 𝜇𝑧𝑖

⊤Σ𝑧𝑖 𝑥

𝑖 − 𝜇𝑧𝑖 + log Σ𝑧𝑖 + 𝑐

𝑖=1

𝑘=1

𝜏𝑘𝑖 log 𝜋𝑘 − 𝑥

𝑖 − 𝜇𝑘⊤Σ𝑘 𝑥𝑖 − 𝜇𝑘 + log Σ𝑘 + 𝑐

𝑖=1

M-step: maximize 𝑓 𝜃

𝑓 𝜃 = 𝐾𝑘=1 𝜏𝑖𝑘 log 𝜋𝑘 − 𝑥

𝑖 − 𝜇𝑘⊤Σ𝑘 𝑥𝑖 − 𝜇𝑘 + log Σ𝑘 + 𝑐 𝑚

𝑖=1

For instance, we want to find 𝜋𝑘, and 𝜋𝑘 = 1𝐾𝑖=1

Form Lagrangian

𝐿 =

𝑘=1

𝜏𝑘𝑖 log 𝜋𝑘 + 𝑜𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚𝑠

𝑖=1

+ 𝜆(1 − 𝜋𝑘

𝑖=1

Take partial derivative and set to 0

𝜕𝐿

𝜕𝜋𝑘= 𝜏𝑘𝑖

𝜋𝑘

𝑖=1

− 𝜆 = 0

⇒ 𝜋𝑘 =1

𝜆 𝜏𝑘𝑖

𝑖=1

⇒ 𝜆 = 𝑚

EM graphically

𝑓(𝜃) 𝑙 𝜃; 𝐷

𝑙 𝜃𝑡; 𝐷

E − step

M− step 𝜃𝑡+1

Experiments with mixture of 3 Gaussians

Why do we need density estimation?

Learn the “shape” of the data cloud

Assess the likelihood of seeing a particular data point

Is this a typical data point? (high density value)

Is this an abnormal data point / outlier? (low density value)

Building block for more sophisticated learning algorithms

Classification, regression, graphical models …

Outlier detection

Declare data points in low density region as outliers

First fit density 𝑝 𝑥 , and then set a threshold 𝑡

Data points in region with density small than 𝑡, ie., 𝑝 𝑥 ≤ 𝑡

Geometric method for outlier detector

Rather than first fit density and then threshold, directly find the boundary between inlier and outlier

-6 -4 -2 0 2 4 6-6

boundary

Minimum enclosing ball

Find the smallest ball such that it include all data points.

Parameterized with a center 𝑐 and radius 𝑟

Optimization problem (quadratically constrained quadratic programming)

min𝑟,𝑐 𝑟2

𝑠. 𝑡. 𝑥𝑖 − 𝑐⊤𝑥𝑖 − 𝑐 ≤ 𝑟2,

∀𝑖 = 1,… ,𝑚

Find center 𝑐 and radius 𝑟

Lagrangian of the problem:

𝐿 = 𝑟2 + 𝛼𝑖 𝑥𝑖 − 𝑐

⊤𝑥𝑖 − 𝑐 − 𝑟2

𝑖=1

, 𝛼𝑖 ≥ 0

Set partial derivative of 𝐿 to 0

𝜕𝐿

𝜕𝑟= 2𝑟 − 2 𝛼𝑖

𝑖=1

𝑟 = 0 ⇒ 𝛼𝑖 = 1

𝑖=1

𝜕𝐿

𝜕𝑐= 𝛼𝑖(−2𝑥

𝑖 + 2𝑐)

𝑖=1

= 0 ⇒ c = 𝛼𝑖𝑥𝑖

𝑖=1

c and r not in explicit

Manipulate the Lagrangian

𝐿 = 𝑟2 + 𝛼𝑖 𝑥𝑖 − 𝑐

⊤𝑥𝑖 − 𝑐 − 𝑟2

𝑖=1

𝐿 = 𝑟2 1 − 𝛼𝑖

𝑖=1

+ 𝛼𝑖𝑥𝑖⊤𝑥𝑖

− 2𝑐⊤ 𝛼𝑖𝑥𝑖

+ 𝛼𝑖

𝑖=1

𝑐⊤𝑐

𝐿 = 𝛼𝑖𝑥𝑖⊤𝑥𝑖 −

𝑖=1

𝛼𝑖𝛼𝑗𝑥𝑖⊤𝑥𝑗

𝑗=1

𝑖=1

Express Lagrangian 𝐿 in 𝛼

max𝛼𝐿(𝛼) = 𝛼𝑖𝑥

𝑖⊤𝑥𝑖 −

𝑖=1

𝛼𝑖𝛼𝑗𝑥𝑖⊤𝑥𝑗

𝑗=1

𝑖=1

𝑠. 𝑡. 𝛼𝑖 = 1

𝑖=1

, 𝛼𝑖 ≥ 0

A quadratic programming problem 𝐿 = 𝑏⊤𝛼 − 𝛼⊤𝐴𝛼 with linear constraints

𝐴𝑖𝑗 = 𝑥𝑖⊤𝑥𝑗 𝑏𝑖 = 𝑥

𝑖⊤𝑥𝑖 𝛼

Experiment with minimum enclosing ball

-6 -4 -2 0 2 4 6-6

Classification

Represent the data

A label is provided for each data point, eg., 𝑦 ∈ −1,+1

Classifier

Outline

What is the theoretically the best classifier?

Bayesian decision rule for minimum error .

What is the difficulty of achieve the minimum error?

What do we do in practice?

Decision-making: divide high-dimensional space

Distributions of sample from normal (positive class) and abnormal (negative class) tissues

Class conditional distribution

In classification case, we are given the label for each data point

Class conditional distribution: 𝑃 𝑥 𝑦 = 1 , 𝑃(𝑥|𝑦 = −1) (how to compute?)

Class prior: 𝑃 𝑦 = 1 , 𝑃(𝑦 = −1) (how to compute?)

𝑝 𝑥 𝑦 = 1 = 𝓝(𝑥; 𝜇1, Σ1)

𝑝 𝑥 𝑦 = −1 = 𝓝(𝑥; 𝜇2, Σ2)

How to come up with decision boundary

Given class conditional distribution: 𝑃 𝑥 𝑦 = 1 , 𝑃(𝑥|𝑦 =− 1), and class prior: 𝑃 𝑦 = 1 , 𝑃(𝑦 = −1)

𝑃 𝑥 𝑦 = 1 = 𝓝(𝑥; 𝜇1, Σ1)

𝑃 𝑥 𝑦 = −1 = 𝓝(𝑥; 𝜇2, Σ2)

Use Bayes rule

𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥=𝑃(𝑥, 𝑦)

𝑃(𝑥, 𝑦)𝑧

posterior

likelihood Prior

normalization constant

Prior: 𝑃 𝑦 Likelihood (class conditional distribution : 𝑝 𝑥 𝑦 =

𝓝 𝑥 𝜇𝑦, Σ𝑦

Posterior: 𝑃 𝑦 𝑥 =𝑃(𝑦)𝓝 𝑥 𝜇𝑦,Σ𝑦

𝑃(𝑦)𝓝 𝑥 𝜇𝑦,Σ𝑦𝑦

Bayes Decision Rule

Learning: prior: 𝑝 𝑦 ,class conditional distribution : 𝑝 𝑥 𝑦

The poster probability of a test point

𝑞𝑖 𝑥 ≔ 𝑃 𝑦 = 𝑖 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥

Bayes decision rule:

If 𝑞𝑖 𝑥 > 𝑞𝑗(𝑥), then 𝑦 = 𝑖, otherwise 𝑦 = 𝑗

Alternatively:

If ratio 𝑙 𝑥 =𝑃(𝑥|𝑦=𝑖)

𝑃(𝑥|𝑦=𝑗)>𝑃 𝑦=𝑗

𝑃 𝑦=𝑖, then 𝑦 = 𝑖, otherwise 𝑦 = 𝑗

Or look at the log-likelihood ratio h x = − ln𝑞𝑖(𝑥)

𝑞𝑗 𝑥

Example: Gaussian class conditional distribution

Depending on the Gaussian distributions, the decision boundary can be very different

Decision boundary: h x = − ln𝑞𝑖 𝑥

𝑞𝑗 𝑥= 0

Is Bayesian decision rule optimal?

Optimal with respect to what criterion?

Bayes error: expected probability of a sample being incorrectly classified

Given a sample, we compute posterior distribution 𝑞1 𝑥 ≔𝑃 𝑦 = 1 𝑥 and 𝑞2 𝑥 ≔ 𝑃 𝑦 = 2 𝑥 , the probability of error:

𝑟 𝑥 = min{𝑞1 𝑥 , 𝑞2(𝑥)}

Bayes error

𝐸𝑝 𝑥 𝑟 𝑥 = ∫ 𝑟 𝑥 𝑝 𝑥 𝑑𝑥

Bayes error

The Bayes error:

𝜖 = 𝐸𝑝 𝑥 𝑟 𝑥 = ∫ 𝑟 𝑥 𝑝 𝑥 𝑑𝑥

= min 𝑞1 𝑥 , 𝑞2 𝑥 𝑝 𝑥 𝑑𝑥

= min𝑃 𝑦 = 1 𝑝(𝑥|𝑦 = 1)

𝑝(𝑥),𝑝 𝑦 = 2 𝑝(𝑥|𝑦 = 2)

𝑝(𝑥)𝑝 𝑥 𝑑𝑥

= min 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1 , 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2) 𝑑𝑥

Bayes error II

𝜖 = min 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1 , 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2) 𝑑𝑥

= 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1𝐿1

𝑑𝑥 + 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2)𝐿2

𝑑𝑥

𝑃(𝑦 = 1)𝑝(𝑥|𝑦 = 1)

𝑃(𝑦 = 2)𝑝(𝑥|𝑦 = 2)

𝐿2 𝐿1

Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...

Documents

Transcript of Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...

Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Galactic coordinates - ASTRONOMY GROUPstar-spd3/Teaching/AS2001Gal/galactic... · 2010. 9. 29. · Use Cosine rule twice and sine rule once to solve independently for l & b = = see

Auto-Encoding Variational Bayes - University of Cambridge · 2019-10-30 · Auto-Encoding Variational Bayes Pawe l F. P. Budzianowski, Thomas F. W. Nicholson, William C. Tebbutt Problem

Woodwards rule

Exact Inference in Bayes Nets

10-7-5 7rlaz/PatternRecognition/slides/10-7-5_7.pdf · +The Bayes Classifier Bayes Formula: information from samples Assumptions: 1. Classes number c is known. 2. Prior probabilities

Mathematical Statistics, Lecture 11 Bayes Procedures · Bayes Procedures Decision-Theoretic Framework Bayes Procedures and Reparametrization Reparametrization of Bayes Decision Problems.

MLE’s, Bayesian Classifiers and Naïve Bayesand Naïve Bayestom/10601_sp08/slides/NBayes-1-28-2008-plus.pdf · MLE’s, Bayesian Classifiers and Naïve Bayesand Naïve Bayes Required

The 3σ rule for outlier detection from the viewpoint of ...

Borns Rule From Decoherence and Evarience

Empirical Bayes Quantile-Prediction - Statistics

New Chemistry 163B Multicomponent Phase Rule, Phase Diagram, …switkes.chemistry.ucsc.edu/teaching/CHEM163B/HANDOUTS/... · 2009. 3. 3. · Chemistry 163B Multicomponent Phase Rule,

Naive Bayes and Gaussian Bayes Classifier

Ruleml2012 - personalizing location information through rule based policies

06 Machine Learning - Naive Bayes

Naive Bayes and Gaussian Bayes Classifier - cs.toronto.eduurtasun/courses/CSC411/tutorial4.pdf · Naive Bayes and Gaussian Bayes Classi er Mengye Ren mren@cs.toronto.edu October 18,

b) Bayes-Aktionen und Zul¨assigkeit Satz 2.58 (Bayes ...€¦ · 1 b) Bayes-Aktionen und Zul¨assigkeit Satz 2.58 (Bayes-Aktionen zu nicht entarteter Priori) Gegeben sei ein Entscheidungsproblem

ON EMPIRICAL BAYES ESTIMATION OF ...ON EMPIRICAL BAYES ESTIMATION OF MULTIVARIATE REGRESSION COEFFICIENT R. J. KARUNAMUNI AND L. WEI Received 11 November 2005; Revised 19 April 2006;

Unbiased Bayes for Big Data

Unit-II_woodword Fiers Rule