Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...

Post on 06-Feb-2018

256 views 1 download

Transcript of Bayes Decision Rule and Naïve Bayes Classifierlsong/teaching/CSE6740fall13/lecture8.pdf · Bayes...

Bayes Decision Rule and Naïve Bayes Classifier

Machine Learning I CSE 6740, Fall 2013

Le Song

Gaussian Mixture model

A density model 𝑝(𝑋) may be multi-modal: model it as a mixture of uni-modal distributions (e.g. Gaussians)

Consider a mixture of 𝐾 Gaussians

𝑝(𝑋) = 𝜋𝑘𝓝(𝑋|𝜇𝑘 , Σ𝑘)𝐾𝑘=1

Learn 𝜋𝑘 ∈ 0,1 , 𝜇𝑘, Σ𝑘;

2

mixing proportion

mixture Component

𝓝(𝜇1, Σ1)

𝓝(𝜇2, Σ2)

For each data point 𝑥𝑖:

Randomly choose a mixture component, 𝑧𝑖 = 1,2,…𝐾 , with

probability 𝑝 𝑧𝑖|𝜋 = 𝜋𝑧𝑖

Then sample the actual value of 𝑥𝑖 from a Gaussian distribution

𝑝 𝑥𝑖 𝑧𝑖 = 𝓝 𝑥 𝜇𝑧𝑖 , Σ𝑧𝑖

Joint distribution over 𝑥 𝑎𝑛𝑑 𝑧 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥 𝑧 = 𝜋𝑧𝓝 𝑥 𝜇𝑧, Σ𝑧

Marginal distribution 𝑝(𝑥)

𝑝 𝑥 = 𝑝 𝑥, 𝑧

𝐾

𝑧=1

= 𝜋𝑧𝓝 𝑥 𝜇𝑧, Σ𝑧

𝐾

𝑧=1

𝑧1 = 1

𝑥1

Imagine generating your data …

3

Bayes rule

𝑃 𝑧 𝑥 = 𝑃 𝑥 𝑧 𝑃 𝑧

𝑃 𝑥=𝑃(𝑥, 𝑧)

𝑃(𝑥, 𝑧)𝑧

4

posterior

likelihood Prior

normalization constant

Prior: 𝑝 𝑧 = 𝜋𝑧 Likelihood: 𝑝 𝑥 𝑧 = 𝓝 𝑥 𝜇𝑧, Σ𝑧

Posterior: 𝑝 𝑧 𝑥 =𝜋𝑧𝓝 𝑥 𝜇𝑧,Σ𝑧 𝜋𝑧𝑧 𝓝 𝑥 𝜇𝑧,Σ𝑧

Details of EM

We intend to learn the parameters that maximizes the log−likelihood of the data

𝑙 𝜃; 𝐷 = log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝐾

𝑧𝑖=1

𝑚

𝑖=1

Expectation step (E-step): What do we take expectation over?

𝑓 𝜃 = 𝐸𝑞 𝑧1,𝑧2,..,𝑧𝑚 [log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑚

𝑖=1

]

Maximization step (M-step): how to maximize? 𝜃𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓 𝜃

5

Nonconvex Difficult!

E-step: what is 𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚

𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚 : posterior distribution of the latent variables

𝑞 𝑧1, 𝑧2, . . , 𝑧𝑚 = 𝑝(𝑧𝑖|𝑥𝑖 , 𝜃𝑡)

𝑚

𝑖

For each data point 𝑥𝑖, compute 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖) for each 𝑘

𝜏𝑘𝑖 = 𝑝 𝑧𝑖 = 𝑘 𝑥𝑖

=𝜋𝑘𝓝 𝑥

𝑖 𝜇𝑘 , Σ𝑘 𝜋𝑘𝑘 𝓝 𝑥

𝑖 𝜇𝑘 , Σ𝑘

6

E-step: compute the expectation

𝑓 𝜃 = 𝐸𝑞 𝑧1,𝑧2,..,𝑧𝑚 log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑚

𝑖=1

= 𝐸𝑝(𝑧𝑖|𝑥𝑖,𝜃𝑡) log 𝑝 𝑥𝑖 , 𝑧𝑖 𝜃

𝑚

𝑖=1

In the case of Gaussian mixture:

𝑓(𝜃) = 𝐸𝑝(𝑧𝑖|𝑥𝑖,𝜃𝑡) log 𝜋𝑧𝑖 − 𝑥𝑖 − 𝜇𝑧𝑖

⊤Σ𝑧𝑖 𝑥

𝑖 − 𝜇𝑧𝑖 + log Σ𝑧𝑖 + 𝑐

𝑚

𝑖=1

=

𝐾

𝑘=1

𝜏𝑘𝑖 log 𝜋𝑘 − 𝑥

𝑖 − 𝜇𝑘⊤Σ𝑘 𝑥𝑖 − 𝜇𝑘 + log Σ𝑘 + 𝑐

𝑚

𝑖=1

7

M-step: maximize 𝑓 𝜃

𝑓 𝜃 = 𝐾𝑘=1 𝜏𝑖𝑘 log 𝜋𝑘 − 𝑥

𝑖 − 𝜇𝑘⊤Σ𝑘 𝑥𝑖 − 𝜇𝑘 + log Σ𝑘 + 𝑐 𝑚

𝑖=1

For instance, we want to find 𝜋𝑘, and 𝜋𝑘 = 1𝐾𝑖=1

Form Lagrangian

𝐿 =

𝐾

𝑘=1

𝜏𝑘𝑖 log 𝜋𝑘 + 𝑜𝑡ℎ𝑒𝑟 𝑡𝑒𝑟𝑚𝑠

𝑚

𝑖=1

+ 𝜆(1 − 𝜋𝑘

𝐾

𝑖=1

)

Take partial derivative and set to 0

𝜕𝐿

𝜕𝜋𝑘= 𝜏𝑘𝑖

𝜋𝑘

𝑚

𝑖=1

− 𝜆 = 0

⇒ 𝜋𝑘 =1

𝜆 𝜏𝑘𝑖

𝑚

𝑖=1

⇒ 𝜆 = 𝑚

8

EM graphically

9

𝑓(𝜃) 𝑙 𝜃; 𝐷

𝑙 𝜃𝑡; 𝐷

E − step

M− step 𝜃𝑡+1

Experiments with mixture of 3 Gaussians

10

Why do we need density estimation?

Learn the “shape” of the data cloud

Assess the likelihood of seeing a particular data point

Is this a typical data point? (high density value)

Is this an abnormal data point / outlier? (low density value)

Building block for more sophisticated learning algorithms

Classification, regression, graphical models …

11

Outlier detection

Declare data points in low density region as outliers

First fit density 𝑝 𝑥 , and then set a threshold 𝑡

Data points in region with density small than 𝑡, ie., 𝑝 𝑥 ≤ 𝑡

12

Geometric method for outlier detector

Rather than first fit density and then threshold, directly find the boundary between inlier and outlier

13

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

boundary

Minimum enclosing ball

Find the smallest ball such that it include all data points.

Parameterized with a center 𝑐 and radius 𝑟

Optimization problem (quadratically constrained quadratic programming)

min𝑟,𝑐 𝑟2

𝑠. 𝑡. 𝑥𝑖 − 𝑐⊤𝑥𝑖 − 𝑐 ≤ 𝑟2,

∀𝑖 = 1,… ,𝑚

14

𝑟

𝑐

Find center 𝑐 and radius 𝑟

Lagrangian of the problem:

𝐿 = 𝑟2 + 𝛼𝑖 𝑥𝑖 − 𝑐

⊤𝑥𝑖 − 𝑐 − 𝑟2

𝑚

𝑖=1

, 𝛼𝑖 ≥ 0

Set partial derivative of 𝐿 to 0

𝜕𝐿

𝜕𝑟= 2𝑟 − 2 𝛼𝑖

𝑚

𝑖=1

𝑟 = 0 ⇒ 𝛼𝑖 = 1

𝑚

𝑖=1

𝜕𝐿

𝜕𝑐= 𝛼𝑖(−2𝑥

𝑖 + 2𝑐)

𝑚

𝑖=1

= 0 ⇒ c = 𝛼𝑖𝑥𝑖

𝑚

𝑖=1

15

c and r not in explicit

form

Manipulate the Lagrangian

𝐿 = 𝑟2 + 𝛼𝑖 𝑥𝑖 − 𝑐

⊤𝑥𝑖 − 𝑐 − 𝑟2

𝑚

𝑖=1

𝐿 = 𝑟2 1 − 𝛼𝑖

𝑚

𝑖=1

+ 𝛼𝑖𝑥𝑖⊤𝑥𝑖

𝑚

𝑖

− 2𝑐⊤ 𝛼𝑖𝑥𝑖

𝑚

𝑖

+ 𝛼𝑖

𝑚

𝑖=1

𝑐⊤𝑐

𝐿 = 𝛼𝑖𝑥𝑖⊤𝑥𝑖 −

𝑚

𝑖=1

𝛼𝑖𝛼𝑗𝑥𝑖⊤𝑥𝑗

𝑚

𝑗=1

𝑚

𝑖=1

16

Express Lagrangian 𝐿 in 𝛼

max𝛼𝐿(𝛼) = 𝛼𝑖𝑥

𝑖⊤𝑥𝑖 −

𝑚

𝑖=1

𝛼𝑖𝛼𝑗𝑥𝑖⊤𝑥𝑗

𝑚

𝑗=1

𝑚

𝑖=1

𝑠. 𝑡. 𝛼𝑖 = 1

𝑚

𝑖=1

, 𝛼𝑖 ≥ 0

A quadratic programming problem 𝐿 = 𝑏⊤𝛼 − 𝛼⊤𝐴𝛼 with linear constraints

17

𝐴𝑖𝑗 = 𝑥𝑖⊤𝑥𝑗 𝑏𝑖 = 𝑥

𝑖⊤𝑥𝑖 𝛼

Experiment with minimum enclosing ball

18

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

Classification

Represent the data

A label is provided for each data point, eg., 𝑦 ∈ −1,+1

Classifier

19

Outline

What is the theoretically the best classifier?

Bayesian decision rule for minimum error .

What is the difficulty of achieve the minimum error?

What do we do in practice?

20

Decision-making: divide high-dimensional space

Distributions of sample from normal (positive class) and abnormal (negative class) tissues

21

Class conditional distribution

In classification case, we are given the label for each data point

Class conditional distribution: 𝑃 𝑥 𝑦 = 1 , 𝑃(𝑥|𝑦 = −1) (how to compute?)

Class prior: 𝑃 𝑦 = 1 , 𝑃(𝑦 = −1) (how to compute?)

22

𝑝 𝑥 𝑦 = 1 = 𝓝(𝑥; 𝜇1, Σ1)

𝑝 𝑥 𝑦 = −1 = 𝓝(𝑥; 𝜇2, Σ2)

How to come up with decision boundary

Given class conditional distribution: 𝑃 𝑥 𝑦 = 1 , 𝑃(𝑥|𝑦 =− 1), and class prior: 𝑃 𝑦 = 1 , 𝑃(𝑦 = −1)

23

𝑃 𝑥 𝑦 = 1 = 𝓝(𝑥; 𝜇1, Σ1)

𝑃 𝑥 𝑦 = −1 = 𝓝(𝑥; 𝜇2, Σ2)

?

Use Bayes rule

𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥=𝑃(𝑥, 𝑦)

𝑃(𝑥, 𝑦)𝑧

24

posterior

likelihood Prior

normalization constant

Prior: 𝑃 𝑦 Likelihood (class conditional distribution : 𝑝 𝑥 𝑦 =

𝓝 𝑥 𝜇𝑦, Σ𝑦

Posterior: 𝑃 𝑦 𝑥 =𝑃(𝑦)𝓝 𝑥 𝜇𝑦,Σ𝑦

𝑃(𝑦)𝓝 𝑥 𝜇𝑦,Σ𝑦𝑦

Bayes Decision Rule

Learning: prior: 𝑝 𝑦 ,class conditional distribution : 𝑝 𝑥 𝑦

The poster probability of a test point

𝑞𝑖 𝑥 ≔ 𝑃 𝑦 = 𝑖 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥

Bayes decision rule:

If 𝑞𝑖 𝑥 > 𝑞𝑗(𝑥), then 𝑦 = 𝑖, otherwise 𝑦 = 𝑗

Alternatively:

If ratio 𝑙 𝑥 =𝑃(𝑥|𝑦=𝑖)

𝑃(𝑥|𝑦=𝑗)>𝑃 𝑦=𝑗

𝑃 𝑦=𝑖, then 𝑦 = 𝑖, otherwise 𝑦 = 𝑗

Or look at the log-likelihood ratio h x = − ln𝑞𝑖(𝑥)

𝑞𝑗 𝑥

25

Example: Gaussian class conditional distribution

Depending on the Gaussian distributions, the decision boundary can be very different

Decision boundary: h x = − ln𝑞𝑖 𝑥

𝑞𝑗 𝑥= 0

26

Is Bayesian decision rule optimal?

Optimal with respect to what criterion?

Bayes error: expected probability of a sample being incorrectly classified

Given a sample, we compute posterior distribution 𝑞1 𝑥 ≔𝑃 𝑦 = 1 𝑥 and 𝑞2 𝑥 ≔ 𝑃 𝑦 = 2 𝑥 , the probability of error:

𝑟 𝑥 = min{𝑞1 𝑥 , 𝑞2(𝑥)}

Bayes error

𝐸𝑝 𝑥 𝑟 𝑥 = ∫ 𝑟 𝑥 𝑝 𝑥 𝑑𝑥

27

Bayes error

The Bayes error:

𝜖 = 𝐸𝑝 𝑥 𝑟 𝑥 = ∫ 𝑟 𝑥 𝑝 𝑥 𝑑𝑥

= min 𝑞1 𝑥 , 𝑞2 𝑥 𝑝 𝑥 𝑑𝑥

= min𝑃 𝑦 = 1 𝑝(𝑥|𝑦 = 1)

𝑝(𝑥),𝑝 𝑦 = 2 𝑝(𝑥|𝑦 = 2)

𝑝(𝑥)𝑝 𝑥 𝑑𝑥

= min 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1 , 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2) 𝑑𝑥

28

Bayes error II

𝜖 = min 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1 , 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2) 𝑑𝑥

= 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1𝐿1

𝑑𝑥 + 𝑃 𝑦 = 2 𝑝(𝑥|𝑦 = 2)𝐿2

𝑑𝑥

29

𝑃(𝑦 = 1)𝑝(𝑥|𝑦 = 1)

𝑃(𝑦 = 2)𝑝(𝑥|𝑦 = 2)

𝐿2 𝐿1

More on Bayes error

Bayes error is the lower bound of probability of classification error

Bayes decision rule is the theoretically best classifier that minimize probability of classification error

However, computing Bayes error or Bayes decision rule is in general a very complex problem. Why?

Need density estimation

Need to do integral, eg. ∫ 𝑃 𝑦 = 1 𝑝 𝑥 𝑦 = 1𝐿1𝑑𝑥

30

What do people do in practice?

Use simplifying assumption for 𝑝 𝑥 𝑦 = 1

Assume 𝑝 𝑥 𝑦 = 1 is Gaussian

Assume 𝑝 𝑥 𝑦 = 1 is fully factorized

Rather than model density of 𝑝 𝑥 𝑦 = 1 , directly go for the

decision boundary h x = − ln𝑞𝑖 𝑥

𝑞𝑗 𝑥

Logistic regression

Support vector machine

Neural networks

31

Naïve Bayes Classifier

Use Bayes decision rule for classification

𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥

But assume 𝑝 𝑥 𝑦 = 1 is fully factorized

𝑝 𝑥 𝑦 = 1 = 𝑝(𝑥𝑖|𝑦 = 1)

𝑑

𝑖=1

Or the variables corresponding to each dimension of the data are independent given the label

32