of 45

• date post

30-Apr-2018
• Category

## Documents

• view

213

1

Embed Size (px)

### Transcript of Introduction to Bayesian inference - University of to Bayesian inference Thomas Alexander Brouwer...

• Introduction to Bayesian inference

Thomas Alexander Brouwer

University of Cambridge

tab43@cam.ac.uk

17 November 2015

• Probabilistic models

I Describe how data was generated using probabilitydistributions

I Generative process

I Data D, parameters

I Want to find the best parameters for the data D -inference

• Topic modelling

I Documents D1, ...,DDI Documents cover topics, with a distribution d = (t1, ..., tT )

I Words in document, Dd = {wd ,1, ...,wd ,N}I Some words are more prevalent in some topics

I Topics have a word distribution t = (w1, ...,wV )

I Data is words in documents D1, ...,DD , parameters are d , t

• From Bleis ICML-2012 tutorial

• Overview

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

• Probability primer

I Random variable X

I Probability distribution p(X )

I Discrete distribution e.g. coin flip or dice roll

I Continuous distribution e.g. height distribution

• Multiple RVs

I Joint distribution p(X ,Y )

I Conditional distribution p(X |Y )

• Probability rules

I Chain rule p(X |Y ) = p(X ,Y )p(Y )

or p(X ,Y ) = p(X |Y )p(Y )

I Marginal rule p(X ) =Y

p(X ,Y ) =Y

p(X |Y )p(Y )

for continuous variables,

p(x) =

yp(x , y)dy =

yp(x |y)p(y)dy

We can add more conditional random variables if we want, so e.g.p(X ,Y |Z ) = p(X |Y ,Z )p(Y |Z ).

• Independence

I X and Y are independent if p(X ,Y ) = p(X )p(Y )

I Equivalently, if p(Y |X ) = p(Y )

• Expectation and variance

I Expectation E [X ] =x

x p(X = x), E [X ] =xx p(x)dx

I Variance V [X ] = E[(X E [X ])2

]= E

[X 2] E [X ]2

where E[X 2]

=x

x2 p(X = x) or E[X 2]

=

xx2 p(x)dx

• Dice roll: E [X ] = 1 16

+ 2 16

+ 3 16

+ 4 16

+ 5 16

+ 6 16

=7

2

E[X 2]

= 12 16

+ 22 16

+ 32 16

+ 42 16

+ 52 16

+ 62 16

=91

6

So V [X ] =91

6(

7

2

)2=

35

12

• Latent variable models

I Manifest or observed variable

I Latent or unobserved variable

I Latent variable models

• Probability distributions

Categorical distribution

I N possible outcomes, with probabilities (p1, ..., pN)

I Draw a single value e.g. throw a dice once

I Parameters = (p1, ..., pN)

I Discrete distribution, p(X = i) = piI Expectation for outcome i is pi , variance is pi (1 pi )

• Probability distributions

Dirichlet distribution

I Draws are vectors x = (x1, ..., xN) s.t.i

xi = 1

I In other words, draws are probability vectors the parameterto the categorical distribution

I Parameters = (1, ..., N) =

I Continuous distribution, p(x) =1

B()

i

xi1i

where B() =

i (i )

(

i i )and (i ) =

0

yi1eidy

I Expectation for ith element xi is E [xi ] =ij j

• Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

• Unfair dice

I Dice with unknown distribution, p = (p1, p2, p3, p4, p5, p6)

I We observe some throws and want to estimate p

I Say we observe 4, 6, 6, 4, 6, 3

I Perhaps p = (0, 0, 16 ,26 , 0,

36)

• Maximum likelihood

I Maximum likelihood solution, ML = max

p(D|)

I Want to incorporate some prior belief or knowledge about ourparameters

• Bayes theorem

Bayes theoremFor any two random variables X and Y ,

p(X |Y ) = p(Y |X )p(X )p(Y )

ProofFrom chain rule, p(X ,Y ) = p(Y |X )p(X ) = p(X |Y )p(Y ).Divide both sides by p(Y ).

• Disease test

I Test for disease with 99% accuracy

I 1 in a 1000 people have the disease

I You tested positive. What is the probability that you have thedisease?

• Disease test

I Let X = disease, and Y = positive

I Want to know p(X |Y ) probability of disease given a positivetest

I From Bayes, p(X |Y ) = p(Y |X )p(X )p(Y )

I p(Y |X ) = 0.99, p(X ) = 0.001

p(Y ) = p(Y ,X ) + p(Y , !X ) = p(Y |X )p(X ) + p(Y |!X )p(!X )= 0.99 0.001 + 0.01 0.999 = 0.01098

I So p(X |Y ) = 0.99 0.0010.01098

= 0.09016393442

• Bayes theorem for inference

I Want to find best parameters for our model afterobserving the data D

I ML overfits by using p(D|)I Need some way of using prior belief about the parameters

I Consider p(|D) our belief about the parameters afterobserving the data

• Bayesian inference

I Using Bayes theorem, p(|D) = p(D|)p()p(D)

I Prior p()

I Likelihood p(D|)I Posterior p(|D)

I Maximum A Posteriori (MAP) MAP = max

p(|D) = max

p(D|)p()

I Bayesian inference find full posterior distribution p(|D)

• Intractability

I In our model we define the prior p() and likelihood p(D|)I How do we find p(D)?

I p(D) =

p(D, )d =

p(D|)p()d

I BUT: space of possible values for is huge!

I Approximate Bayesian inference

• Latent Dirichlet Allocation

Generative process

I Draw document-to-topic distributions, d Dir()(d = 1, ...,D)

I Draw topic-to-word distributions, t Dir() (t = 1, ...,T )I For each of the N words in each of the D documents:

I Draw a topic from the documents topic distribution,zdn Multinomial(d)

I Draw a word from the topicss word distribution,wdn Multinomial(z)

Note that our models data is the words wdn we observe, and theparameters are the d , t . We have placed Dirichlet priors over theparameters, with its own parameters , .

• Hyperparameters

In our model we have:

I Random variables observed ones, like the words; and latentones, like the topics

I Parameters document-to-topic distributions d andtopic-to-word distributions d

I Hyperparameters these are parameters to the priordistributions over our parameters, so and

• Conjugacy

For a specific parameter i , p(i ) is conjugate to the likelihoodp(D|i ) if the posterior of the parameter, p(i |D), is of the samefamily as the prior.

e.g. the Dirichlet distribution is the conjugate prior for thecategorical distribution.

• Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

• Bayesian network

I Nodes are random variables (latent or observed)

I Arrows indicate dependencies

I Distribution of a node only depends on its parents (and thingsfurther down the network)

I Plates indicate repetition of variables

A

B

C D

p(D|A,B,C ) = p(D|C )BUT: p(C |A,B,D) 6= p(C |A,B)

• A

B

C D

Recall Bayes p(X |Y ,Z ) = p(Y |X ,Z )p(X |Z )p(Y |Z )

p(C |A,B,D) = p(D|A,B,C )p(C |A,B)p(D|A,B)

=p(D|C )p(C |A,B)C p(C ,D|A,B)dC

=p(D|C )p(C |A,B)

C p(D|A,B,C )p(C |A,B)dC

=p(D|C )p(C |A,B)

C p(D|C )p(C |A,B)dC

• Latent Dirichlet Allocation

Generative process

I Draw document-to-topic distributions, d Dir()(d = 1, ...,D)

I Draw topic-to-word distributions, t Dir() (t = 1, ...,T )I For each of the N words in each of the D documents:

I Draw a topic from the documents topic distribution,zdn Multinomial(d)

I Draw a word from the topicss word distribution,wdn Multinomial(z)

• Latent Dirichlet Allocation

From http://parkcu.com/blog/

• Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

• Gibbs sampling

I Want to approximate p(|D) for parameters = (1, ..., N)I Cannot compute this exactly, but maybe we can draw samples

from it

I We can then use these samples to estimate the distribution, orestimate the expectation and variance

• Gibbs sampling

I For each parameter i , write down its distribution conditionalon the data and the values of the other parameters,p(i |i ,D)

I If our model is conjugate, this gives closed-form expressions(meaning this distribution is of a known form, e.g. Dirichlet,so we can draw from it)

I Drawing new values for the parameters i in turn willeventually converge to give draws from the true posterior,p(|D)

I Burn-in, thinning

• Latent Dirichlet Allocation

I Want to draw samples from p(,, z |w)I w = {wd ,n}d=1..D,n=1..NI z = {zd ,n}d=1..D,n=1..NI = {d}d=1..DI = {t}t=1..T

• Latent Dirichlet Allocation

For Gibbs sampling, need distribitions:

I p(d |d ,, z ,w)I p(t |,t , z ,w)I p(zd ,n|,, zd ,n,w)

• Latent Dirichlet Allocation

These are relatively straightforward to derive. For example:

p(zd ,n|,, zd ,n,w) =p(w |,, z)p(zd ,n|,, zd ,n)

p(w |,, zd ,n) p(wd ,n|,, zw ,n)p(zd ,n|,, zd ,n)= p(wd ,n|zw ,n, zd,n)p(zd ,n|d)= zd,n,wd,nd ,zd,n

Where the first step follows from Bayes theorem, the second fromthat fact that some terms do not depend on zd ,n, the third fromindependence in our Bayesi