Bayesianmethods&Naïve( Bayes(...

27
Bayesian methods & Naïve Bayes Lecture 18 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

Transcript of Bayesianmethods&Naïve( Bayes(...

Page 1: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Bayesian  methods  &  Naïve  Bayes  Lecture  18  

David  Sontag  New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate

Page 2: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Beta  prior  distribu9on  –  P(θ)  

•  The  posterior  distribu9on:  

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1

1

Brief Article

The Author

January 11, 2012

ˆ⌅ = argmax

⌅lnP (D | ⌅)

ln ⌅�H

d

d⌅lnP (D | ⌅) =

d

d⌅[

ln ⌅�H(1� ⌅)�T

]

=

d

d⌅[

�H ln ⌅ + �T ln(1� ⌅)]

= �Hd

d⌅ln ⌅ + �T

d

d⌅ln(1� ⌅) =

�H

⌅� �T

1� ⌅= 0

⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇥ ⇤ ln 2� 2N⇤2

N ⇤ ln(2/⇥)

2⇤2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⌅) ⇧ 1

P (⌅ | D) ⇧ P (D | ⌅)

P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1

(1� ⌅)⇥T�1= ⌅�H+⇥H�1

(1� ⌅)�T+⇥t+1

1

Brief Article

The Author

January 11, 2012

ˆ⇧ = argmax

⌅lnP (D | ⇧)

ln ⇧�H

d

d⇧lnP (D | ⇧) =

d

d⇧[

ln ⇧�H(1� ⇧)�T

]

=

d

d⇧[

�H ln ⇧ + �T ln(1� ⇧)]

= �Hd

d⇧ln ⇧ + �T

d

d⇧ln(1� ⇧) =

�H

⇧� �T

1� ⇧= 0

⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇤ ⇤ ln 2� 2N⌅2

N ⇤ ln(2/⇤)

2⌅2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⇧) ⇧ 1

P (⇧ | D) ⇧ P (D | ⇧)

P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1

(1�⇧)⇥T�1= ⇧�H+⇥H�1

(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )

1

Page 3: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

•  We  now  have  a  distribu9on  over  parameters  •  For  any  specific  f,  a  func9on  of  interest,  compute  the  

expected  value  of  f:  

•  Integral  is  oKen  hard  to  compute  •  As  more  data  is  observed,  prior  is  more  concentrated  

•  MAP  (Maximum  a  posteriori  approxima9on):  use  most  likely  parameter  to  approximate  the  expecta9on  

Using  Bayesian  inference  for  predic9on  

Page 4: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

MAP  for  Beta  distribu9on  

•  MAP:  use  most  likely  parameter:  

Brief Article

The Author

January 11, 2012

ˆ⇧ = argmax

⌅lnP (D | ⇧)

ln ⇧�H

d

d⇧lnP (D | ⇧) =

d

d⇧[

ln ⇧�H(1� ⇧)�T

]

=

d

d⇧[

�H ln ⇧ + �T ln(1� ⇧)]

= �Hd

d⇧ln ⇧ + �T

d

d⇧ln(1� ⇧) =

�H

⇧� �T

1� ⇧= 0

⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)

ln ⇤ ⇤ ln 2� 2N⌅2

N ⇤ ln(2/⇤)

2⌅2

N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8

0.02= 190

P (⇧) ⇧ 1

P (⇧ | D) ⇧ P (D | ⇧)

P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1

(1�⇧)⇥T�1= ⇧�H+⇥H�1

(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )

�H + ⇥H � 1�H + ⇥H + �T + ⇥T � 2

1•  Beta prior equivalent to extra thumbtack flips •  As N → 1, prior is “forgotten” •  But, for small sample size, prior is important!

Page 5: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

What  about  con9nuous  variables?  

•  Billionaire  says:  If  I  am  measuring  a  con9nuous  variable,  what  can  you  do  for  me?  

•  You  say:  Let  me  tell  you  about  Gaussians…  

Page 6: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Some  proper9es  of  Gaussians  

•  Affine  transforma9on  (mul9plying  by  scalar  and  adding  a  constant)  are  Gaussian  –  X  ~  N(µ,σ2)  –  Y  =  aX  +  b    Y  ~  N(aµ+b,a2σ2)  

•  Sum  of  Gaussians  is  Gaussian  –  X  ~  N(µX,σ2

X)  –  Y  ~  N(µY,σ2

Y)  –  Z  =  X+Y    Z  ~  N(µX+µY,  σ2

X+σ2Y)  

•  Easy  to  differen9ate,  as  we  will  see  soon!  

Page 7: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Learning  a  Gaussian  

•  Collect  a  bunch  of  data  – Hopefully,  i.i.d.  samples  

– e.g.,  exam  scores  

•  Learn  parameters  – Mean:  µ – Variance:  σ

xi i =

Exam  Score  

0   85  

1   95  

2   100  

3   12  

…   …  

99   89  

Page 8: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

MLE  for  Gaussian:  

•  Prob.  of  i.i.d.  samples  D={x1,…,xN}:  

•  Log-likelihood of data:

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

2

Page 9: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Your  second  learning  algorithm:  MLE  for  mean  of  a  Gaussian  

•  What’s  MLE  for  mean?  

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

= �NX

i=1

xi +Nµ = 0

= �N

�+

NX

i=1

(xi � µ)2

�3= 0

2

-

Page 10: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

MLE  for  variance  •  Again,  set  deriva9ve  to  zero:  

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

2

µMLE , �MLE = argmaxµ,�

P (D | µ,�)

= �NX

i=1

(xi � µ)

�2= 0

= �N

�+

NX

i=1

(xi � µ)2

�3= 0

2

Page 11: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Learning  Gaussian  parameters  

•  MLE:  

•  MLE  for  the  variance  of  a  Gaussian  is  biased  – Expected  result  of  es9ma9on  is  not  true  parameter!    – Unbiased  variance  es9mator:  

Page 12: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Bayesian  learning  of  Gaussian  parameters  

•  Conjugate  priors  – Mean:  Gaussian  prior  

– Variance:  Wishart  Distribu9on  

•  Prior  for  mean:  

Page 13: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Naïve  Bayes  

Slides adapted from Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld

Page 14: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Bayesian  Classifica9on  

•  Problem  statement:  – Given  features  X1,X2,…,Xn  – Predict  a  label  Y  

Page 15: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Example  Applica9on  

•  Digit  Recogni9on  

•  X1,…,Xn  ∈  {0,1}  (Black  vs.  White  pixels)  

•  Y  ∈  {0,1,2,3,4,5,6,7,8,9}  

Classifier 5

Page 16: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

The  Bayes  Classifier  

•  If  we  had  the  joint  distribu9on  on  X1,…,Xn  and  Y,  could  predict  using:  

–  (for  example:  what  is  the  probability  that  the  image  represents  a  5  given  its  pixels?)  

•  So  …  How  do  we  compute  that?  

Page 17: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

The  Bayes  Classifier  

•  Use  Bayes  Rule!  

•  Why  did  this  help?    Well,  we  think  that  we  might  be  able  to  specify  how  features  are  “generated”  by  the  class  label  

Normalization Constant

Likelihood Prior

Page 18: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

The  Bayes  Classifier  

•  Let’s  expand  this  for  our  digit  recogni9on  task:  

•  To  classify,  we’ll  simply  compute  these  probabili9es,  one  per  class,  and  predict  based  on  which  one  is  largest  

Page 19: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Model  Parameters  

•  How  many  parameters  are  required  to  specify  the  likelihood,  P(X1,…,Xn|Y)  ?  

–  (Supposing  that  each  image  is  30x30  pixels)  

•  The  problem  with  explicitly  modeling  P(X1,…,Xn|Y)  is  that  there  are  usually  way  too  many  parameters:  

– We’ll  run  out  of  space  

– We’ll  run  out  of  9me  –  And  we’ll  need  tons  of  training  data  (which  is  usually  not  available)  

Page 20: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Naïve  Bayes  •  Naïve  Bayes  assump9on:  

–  Features  are  independent  given  class:  

– More  generally:  

•  How  many  parameters  now?  •  Suppose  X  is  composed  of  n  binary  features  

Page 21: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

The  Naïve  Bayes  Classifier  •  Given:  

–  Prior  P(Y)  –  n  condi9onally  independent  features  X  given  the  class  Y  

–  For  each  Xi,  we  have  likelihood  P(Xi|Y)  

•  Decision  rule:  

If certain assumption holds, NB is optimal classifier! (they typically don’t)

Y

X1 Xn X2

Page 22: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

A Digit Recognizer

•  Input: pixel grids

•  Output: a digit 0-9

Page 23: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

Naïve Bayes for Digits (Binary Inputs)

•  Simple version: –  One feature Fij for each grid position <i,j>

–  Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image

–  Each input maps to a feature vector, e.g.

–  Here: lots of features, each is binary valued

•  Naïve Bayes model:

•  Are the features independent given class?

•  What do we need to learn?

Page 24: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

What has to be learned?

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

Page 25: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

MLE  for  the  parameters  of  NB  •  Given  dataset  

–  Count(A=a,B=b)  Ã  number  of  examples  where  A=a  and  B=b  

•  MLE  for  discrete  NB,  simply:  –  Prior:  

–  Observa9on  distribu9on:    

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

Page 26: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

MLE  for  the  parameters  of  NB  

•  Training  amounts  to,  for  each  of  the  classes,  averaging  all  of  the  examples  together:  

Page 27: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

MAP  es9ma9on  for  NB  •  Given  dataset  

–  Count(A=a,B=b)  Ã  number  of  examples  where  A=a  and  B=b  

•  MAP  es9ma9on  for  discrete  NB,  simply:  –  Prior:  

–  Observa9on  distribu9on:    

•  Called  “smoothing”.  Corresponds  to  Dirichlet  prior!  

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

µMLE ,⇥MLE = argmaxµ,�

P (D | µ,⇥)

= �NX

i=1

(xi � µ)

⇥2= 0

= �NX

i=1

xi +Nµ = 0

= �N

⇥+

NX

i=1

(xi � µ)2

⇥3= 0

argmaxwln

✓1

⇥⇥2�

◆N

+NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argmaxw

NX

j=1

�[tj �P

iwihi(xj)]2

2⇥2

= argminw

NX

j=1

[tj �X

i

wihi(xj)]2

P (Y = y) =Count(Y = y)Py0 Count(Y = y�)

P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)

2

+ a

+ |X_i|*a