# Introduction to Bayesian inference - University of to Bayesian inference Thomas Alexander Brouwer...

date post

30-Apr-2018Category

## Documents

view

213download

1

Embed Size (px)

### Transcript of Introduction to Bayesian inference - University of to Bayesian inference Thomas Alexander Brouwer...

Introduction to Bayesian inference

Thomas Alexander Brouwer

University of Cambridge

tab43@cam.ac.uk

17 November 2015

Probabilistic models

I Describe how data was generated using probabilitydistributions

I Generative process

I Data D, parameters

I Want to find the best parameters for the data D -inference

Topic modelling

I Documents D1, ...,DDI Documents cover topics, with a distribution d = (t1, ..., tT )

I Words in document, Dd = {wd ,1, ...,wd ,N}I Some words are more prevalent in some topics

I Topics have a word distribution t = (w1, ...,wV )

I Data is words in documents D1, ...,DD , parameters are d , t

From Bleis ICML-2012 tutorial

Overview

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Probability primer

I Random variable X

I Probability distribution p(X )

I Discrete distribution e.g. coin flip or dice roll

I Continuous distribution e.g. height distribution

Multiple RVs

I Joint distribution p(X ,Y )

I Conditional distribution p(X |Y )

Probability rules

I Chain rule p(X |Y ) = p(X ,Y )p(Y )

or p(X ,Y ) = p(X |Y )p(Y )

I Marginal rule p(X ) =Y

p(X ,Y ) =Y

p(X |Y )p(Y )

for continuous variables,

p(x) =

yp(x , y)dy =

yp(x |y)p(y)dy

We can add more conditional random variables if we want, so e.g.p(X ,Y |Z ) = p(X |Y ,Z )p(Y |Z ).

Independence

I X and Y are independent if p(X ,Y ) = p(X )p(Y )

I Equivalently, if p(Y |X ) = p(Y )

Expectation and variance

I Expectation E [X ] =x

x p(X = x), E [X ] =xx p(x)dx

I Variance V [X ] = E[(X E [X ])2

]= E

[X 2] E [X ]2

where E[X 2]

=x

x2 p(X = x) or E[X 2]

=

xx2 p(x)dx

Dice roll: E [X ] = 1 16

+ 2 16

+ 3 16

+ 4 16

+ 5 16

+ 6 16

=7

2

E[X 2]

= 12 16

+ 22 16

+ 32 16

+ 42 16

+ 52 16

+ 62 16

=91

6

So V [X ] =91

6(

7

2

)2=

35

12

Latent variable models

I Manifest or observed variable

I Latent or unobserved variable

I Latent variable models

Probability distributions

Categorical distribution

I N possible outcomes, with probabilities (p1, ..., pN)

I Draw a single value e.g. throw a dice once

I Parameters = (p1, ..., pN)

I Discrete distribution, p(X = i) = piI Expectation for outcome i is pi , variance is pi (1 pi )

Probability distributions

Dirichlet distribution

I Draws are vectors x = (x1, ..., xN) s.t.i

xi = 1

I In other words, draws are probability vectors the parameterto the categorical distribution

I Parameters = (1, ..., N) =

I Continuous distribution, p(x) =1

B()

i

xi1i

where B() =

i (i )

(

i i )and (i ) =

0

yi1eidy

I Expectation for ith element xi is E [xi ] =ij j

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Unfair dice

I Dice with unknown distribution, p = (p1, p2, p3, p4, p5, p6)

I We observe some throws and want to estimate p

I Say we observe 4, 6, 6, 4, 6, 3

I Perhaps p = (0, 0, 16 ,26 , 0,

36)

Maximum likelihood

I Maximum likelihood solution, ML = max

p(D|)

I Easily leads to overfitting

I Want to incorporate some prior belief or knowledge about ourparameters

Bayes theorem

Bayes theoremFor any two random variables X and Y ,

p(X |Y ) = p(Y |X )p(X )p(Y )

ProofFrom chain rule, p(X ,Y ) = p(Y |X )p(X ) = p(X |Y )p(Y ).Divide both sides by p(Y ).

Disease test

I Test for disease with 99% accuracy

I 1 in a 1000 people have the disease

I You tested positive. What is the probability that you have thedisease?

Disease test

I Let X = disease, and Y = positive

I Want to know p(X |Y ) probability of disease given a positivetest

I From Bayes, p(X |Y ) = p(Y |X )p(X )p(Y )

I p(Y |X ) = 0.99, p(X ) = 0.001

p(Y ) = p(Y ,X ) + p(Y , !X ) = p(Y |X )p(X ) + p(Y |!X )p(!X )= 0.99 0.001 + 0.01 0.999 = 0.01098

I So p(X |Y ) = 0.99 0.0010.01098

= 0.09016393442

Bayes theorem for inference

I Want to find best parameters for our model afterobserving the data D

I ML overfits by using p(D|)I Need some way of using prior belief about the parameters

I Consider p(|D) our belief about the parameters afterobserving the data

Bayesian inference

I Using Bayes theorem, p(|D) = p(D|)p()p(D)

I Prior p()

I Likelihood p(D|)I Posterior p(|D)

I Maximum A Posteriori (MAP) MAP = max

p(|D) = max

p(D|)p()

I Bayesian inference find full posterior distribution p(|D)

Intractability

I In our model we define the prior p() and likelihood p(D|)I How do we find p(D)?

I p(D) =

p(D, )d =

p(D|)p()d

I BUT: space of possible values for is huge!

I Approximate Bayesian inference

Latent Dirichlet Allocation

Generative process

I Draw document-to-topic distributions, d Dir()(d = 1, ...,D)

I Draw topic-to-word distributions, t Dir() (t = 1, ...,T )I For each of the N words in each of the D documents:

I Draw a topic from the documents topic distribution,zdn Multinomial(d)

I Draw a word from the topicss word distribution,wdn Multinomial(z)

Note that our models data is the words wdn we observe, and theparameters are the d , t . We have placed Dirichlet priors over theparameters, with its own parameters , .

Hyperparameters

In our model we have:

I Random variables observed ones, like the words; and latentones, like the topics

I Parameters document-to-topic distributions d andtopic-to-word distributions d

I Hyperparameters these are parameters to the priordistributions over our parameters, so and

Conjugacy

For a specific parameter i , p(i ) is conjugate to the likelihoodp(D|i ) if the posterior of the parameter, p(i |D), is of the samefamily as the prior.

e.g. the Dirichlet distribution is the conjugate prior for thecategorical distribution.

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Bayesian network

I Nodes are random variables (latent or observed)

I Arrows indicate dependencies

I Distribution of a node only depends on its parents (and thingsfurther down the network)

I Plates indicate repetition of variables

A

B

C D

p(D|A,B,C ) = p(D|C )BUT: p(C |A,B,D) 6= p(C |A,B)

A

B

C D

Recall Bayes p(X |Y ,Z ) = p(Y |X ,Z )p(X |Z )p(Y |Z )

p(C |A,B,D) = p(D|A,B,C )p(C |A,B)p(D|A,B)

=p(D|C )p(C |A,B)C p(C ,D|A,B)dC

=p(D|C )p(C |A,B)

C p(D|A,B,C )p(C |A,B)dC

=p(D|C )p(C |A,B)

C p(D|C )p(C |A,B)dC

Latent Dirichlet Allocation

Generative process

I Draw document-to-topic distributions, d Dir()(d = 1, ...,D)

I Draw topic-to-word distributions, t Dir() (t = 1, ...,T )I For each of the N words in each of the D documents:

I Draw a topic from the documents topic distribution,zdn Multinomial(d)

I Draw a word from the topicss word distribution,wdn Multinomial(z)

Latent Dirichlet Allocation

From http://parkcu.com/blog/

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Gibbs sampling

I Want to approximate p(|D) for parameters = (1, ..., N)I Cannot compute this exactly, but maybe we can draw samples

from it

I We can then use these samples to estimate the distribution, orestimate the expectation and variance

Gibbs sampling

I For each parameter i , write down its distribution conditionalon the data and the values of the other parameters,p(i |i ,D)

I If our model is conjugate, this gives closed-form expressions(meaning this distribution is of a known form, e.g. Dirichlet,so we can draw from it)

I Drawing new values for the parameters i in turn willeventually converge to give draws from the true posterior,p(|D)

I Burn-in, thinning

Latent Dirichlet Allocation

I Want to draw samples from p(,, z |w)I w = {wd ,n}d=1..D,n=1..NI z = {zd ,n}d=1..D,n=1..NI = {d}d=1..DI = {t}t=1..T

Latent Dirichlet Allocation

For Gibbs sampling, need distribitions:

I p(d |d ,, z ,w)I p(t |,t , z ,w)I p(zd ,n|,, zd ,n,w)

Latent Dirichlet Allocation

These are relatively straightforward to derive. For example:

p(zd ,n|,, zd ,n,w) =p(w |,, z)p(zd ,n|,, zd ,n)

p(w |,, zd ,n) p(wd ,n|,, zw ,n)p(zd ,n|,, zd ,n)= p(wd ,n|zw ,n, zd,n)p(zd ,n|d)= zd,n,wd,nd ,zd,n

Where the first step follows from Bayes theorem, the second fromthat fact that some terms do not depend on zd ,n, the third fromindependence in our Bayesi

Recommended

*View more*