Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf ·...

32
Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor: Aarti Singh Machine Learning 10-315 Sept 4, 2019

Transcript of Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf ·...

Page 1: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Classification – Decision boundary & Naïve Bayes

Sub-lecturer: Mariya Toneva

Instructor: Aarti Singh

Machine Learning 10-315 Sept 4, 2019

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA

Page 2: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

1-dim Gaussian Bayes classifier

2

Class conditional density

Class probability

Bernoulli(θ) Gaussian(μy, σ2

y)

Page 3: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

d-dim Gaussian Bayes classifier

3

Class conditional density

Class probability

Bernoulli(θ)

Decision Boundary

Gaussian(μy,Σy)

Page 4: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

4

• Decision boundary is set of points x: P(Y=1|X=x) = P(Y=0|X=x)

If class conditional feature distribution P(X=x|Y=y) is 2-dim Gaussian N(μy,Σy)

Decision Boundary of Gaussian Bayes

Note: In general, this implies a quadratic equation in x. But if Σ1= Σ0, then quadratic part cancels out and decision boundary is linear.

Page 5: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Multi-class problem Multi-dimensional input X

5

Page 6: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Handwritten digit recognition

6

Note: 8 digits shown out of 10 (0, 1, …, 9); Axes are obtained by nonlinear dimensionality reduction (later in course)

φ2(X)

φ1(X

) Multi-class classification

Page 7: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Handwritten digit recognition

7

Training Data:

Gaussian Bayes model:

P(Y = y) = py for all y in 0, 1, 2, …, 9 p0, p1, …, p9 (sum to 1)

P(X=x|Y = y) ~ N(μy,Σy) for each y μy – d-dim vector

Σy - dxd matrix

1

… n greyscale images

… n labels

Input, X

Label, Y

Each image represented as a vector of intensity values at the d pixels (features)

2

X

Page 8: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Gaussian Bayes classifier

8

P(Y = y) = py for all y in 0, 1, 2, …, 9 p0, p1, …, p9 (sum to 1)

P(X=x|Y = y) ~ N(μy,Σy) for each y μy – d-dim vector

Σy - dxd matrix

How to learn parameters py, μy, Σy from data?

Page 9: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

How many parameters do we need to learn?

9

Kd + Kd(d+1)/2 = O(Kd2) if d features

Quadratic in dimension d! If d = 256x256 pixels, ~ 21.5 billion parameters!

Class probability:

P(Y = y) = py for all y in 0, 1, 2, …, 9 p0, p1, …, p9 (sum to 1)

Class conditional distribution of features:

P(X=x|Y = y) ~ N(μy,Σy) for each y μy – d-dim vector

Σy - dxd matrix

K-1 if K labels

Page 10: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Hand-written digit recognition

10

0, 1, 2, …, 9

Input, X (images of hand-written digits) Label, Y

Feature representation:

Grey-scale images – d-dim vector of d pixel intensities Black-white images – d-dim binary (0/1) vector

Continuous features

Discrete features

Page 11: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

What about discrete features?

11 11

Training Data:

Discrete Bayes model:

P(Y = y) = py for all y in 0, 1, 2, …, 9 p0, p1, …, p9 (sum to 1)

P(X=x|Y = y) ~ For each label y, maintain probability table with 2d-1 entries

1

… n black-white images

… n labels

Input, X

Label, Y

Each image represented as a vector of d binary features (black 1 or white 0)

2

X

Page 12: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

How many parameters do we need to learn?

12

Class probability:

P(Y = y) = py for all y in 0, 1, 2, …, 9 p0, p1, …, p9 (sum to 1)

Class conditional distribution of features:

P(X=x|Y = y) ~ For each label y, maintain probability table with 2d-1 entries

K-1 if K labels

K(2d – 1) if d binary features

Exponential in dimension d!

Page 13: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

What’s wrong with too many parameters?

• How many training data needed to learn one parameter (bias of a coin)?

• Need lots of training data to learn the parameters!

– Training data > number of parameters

13

Page 14: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Naïve Bayes Classifier

14

• Bayes Classifier with additional “naïve” assumption:

– Features are independent given class:

– More generally:

• If conditional independence assumption holds, NB is optimal classifier! But worse otherwise.

Page 15: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Conditional Independence

15

• X is conditionally independent of Y given Z:

probability distribution governing X is independent of the value of Y, given the value of Z

• Equivalent to:

• e.g.,

Note: does NOT mean Thunder is independent of Rain

Page 16: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Conditional vs. Marginal Independence

16

Wearing coats is independent of accidents conditioning on the fact that it rained

Page 17: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Naïve Bayes Classifier

17

• Bayes Classifier with additional “naïve” assumption:

– Features are independent given class:

• How many parameters now?

Page 18: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Handwritten digit recognition (continuous features)

18

Training Data:

How many parameters?

Class probability P(Y = y) =py for all y

Class conditional distribution of features (using Naïve Bayes assumption)

P(Xi = xi|Y = y) ~ N(μ(y)i, σ

2i

(y)) for each y and each pixel i

K-1 if K labels

2Kd

1 2

… n greyscale images with d pixels

… n labels

X

Y

May not hold

Linear instead of Quadratic in d!

Page 19: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Independent Gaussians

19

μ

Σ with non-zero off-diagonals

Σ with zero off-diagonals d=2

X = [X1; X2]

X1

X2

X1

X2

μ

Equivalent to assuming

Page 20: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Handwritten digit recognition (discrete features)

20

Training Data:

How many parameters?

Class probability P(Y = y) =py for all y

Class conditional distribution of features (using Naïve Bayes assumption)

P(Xi = xi|Y = y) – one probability value for each y, pixel i

K-1 if K labels

Kd

1 2

… n black-white (1/0) images with d pixels

… n labels

X

Y

May not hold

Linear instead of Exponential in d!

Page 21: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Naïve Bayes Classifier

21

• Bayes Classifier with additional “naïve” assumption:

– Features are independent given class:

• Has fewer parameters, and hence requires fewer training data, even though assumption may be violated in practice

Page 22: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

How to learn parameters from data? MLE, MAP

(Discrete case)

22

Page 23: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Learning parameters in distributions

23

= = 1 -

Learning θ is equivalent to learning probability of head in coin flip. How do you learn that? Data = Answer: 3/5 Why??

Page 24: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Bernoulli distribution

24

Data, D =

• P(Heads) = , P(Tails) = 1-

• Flips are i.i.d.: – Independent events

– Identically distributed according to Bernoulli distribution

Choose that maximizes the probability of observed data

Page 25: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Maximum Likelihood Estimation (MLE)

Choose that maximizes the probability of observed data (aka likelihood)

MLE of probability of head:

25

= 3/5

“Frequency of heads”

Page 26: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Multinomial distribution

26

Data, D = rolls of a dice

• P(1) = p1, P(2) = p2, …, P(6) = p6 p1+….+p6 = 1

• Rolls are i.i.d.: – Independent events

– Identically distributed according to Multinomial() distribution where

Choose that maximizes the probability of observed data

= {p1, p2, … , p6}

Page 27: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Choose that maximizes the probability of observed data

MLE of probability of rolls:

27 “Frequency of roll y”

Rolls that turn up y

Total number of rolls

Maximum Likelihood Estimation (MLE)

Page 28: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Back to Naïve Bayes

28

Page 29: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Naïve Bayes with discrete features

29 29

Training Data:

Discrete Naïve Bayes model:

P(Y = y) = py for all y in 0, 1, 2, …, 9 p0, p1, …, p9 (sum to 1)

P(Xi=xi|Y = y) - one probability value for each y, pixel i

1

… n black-white images

… n labels

Input, X

Label, Y

Each image represented as a vector of d binary features (black 1 or white 0)

2

X

Page 30: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Naïve Bayes Algo – Discrete features

• Training Data

• Maximum Likelihood Estimates

– For Class probability

– For class conditional distribution

• NB Prediction for test data

30

Page 31: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Issues with Naïve Bayes

31

• Issue 1: Usually, features are not conditionally independent:

Nonetheless, NB is the single most used classifier particularly

when data is limited, works well

• Issue 2: Typically use MAP estimates instead of MLE since insufficient data may cause MLE to be zero.

Page 32: Classification Decision boundary & Naïve Bayesaarti/Class/10315_Fall19/lecs/Lecture3.pdf · Classification – Decision boundary & Naïve Bayes Sub-lecturer: Mariya Toneva Instructor:

Insufficient data for MLE

32

• What if you never see a training instance where X1=a when Y=b?

– e.g., b={SpamEmail}, a ={‘Earn’}

– P(X1= a | Y = b) = 0

• Thus, no matter what the values X2,…,Xd take:

• What now???

= 0