Download - Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge [email protected] 17 November 2015

Transcript

Introduction to Bayesian inference

Thomas Alexander Brouwer

University of Cambridge

[email protected]

17 November 2015

Page 2: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic models

I Describe how data was generated using probabilitydistributions

I Generative process

I Data D, parameters θ

I Want to find the “best” parameters θ for the data D -inference

Page 3: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Topic modelling

I Documents D1, ...,DD

I Documents cover topics, with a distribution θd = (t1, ..., tT )

I Words in document, Dd = {wd ,1, ...,wd ,N}I Some words are more prevalent in some topics

I Topics have a word distribution φt = (w1, ...,wV )

I Data is words in documents D1, ...,DD , parameters are θd , φt

Page 4: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

From Blei’s ICML-2012 tutorial

Page 5: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Overview

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes’ theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Page 6: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probability primer

I Random variable X

I Probability distribution p(X )

I Discrete distribution – e.g. coin flip or dice roll

I Continuous distribution – e.g. height distribution

Page 7: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Multiple RVs

I Joint distribution p(X ,Y )

I Conditional distribution p(X |Y )

Page 8: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probability rules

I Chain rule p(X |Y ) =p(X ,Y )

p(Y )or p(X ,Y ) = p(X |Y )p(Y )

I Marginal rule p(X ) =∑Y

p(X ,Y ) =∑Y

p(X |Y )p(Y )

for continuous variables,

p(x) =

∫yp(x , y)dy =

∫yp(x |y)p(y)dy

We can add more conditional random variables if we want, so e.g.p(X ,Y |Z ) = p(X |Y ,Z )p(Y |Z ).

Page 9: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Independence

I X and Y are independent if p(X ,Y ) = p(X )p(Y )

I Equivalently, if p(Y |X ) = p(Y )

Page 10: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Expectation and variance

I Expectation E [X ] =∑x

x · p(X = x), E [X ] =

∫xx · p(x)dx

I Variance V [X ] = E[(X − E [X ])2

]= E

[X 2]− E [X ]2

where E[X 2]

=∑x

x2 · p(X = x) or E[X 2]

∫xx2 · p(x)dx

Page 11: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Dice roll: E [X ] = 1 ∗ 1

6+ 2 ∗ 1

6+ 3 ∗ 1

6+ 4 ∗ 1

6+ 5 ∗ 1

6+ 6 ∗ 1

E[X 2]

= 12 ∗ 1

6+ 22 ∗ 1

6+ 32 ∗ 1

6+ 42 ∗ 1

6+ 52 ∗ 1

6+ 62 ∗ 1

So V [X ] =91

6−(

=35

Page 12: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent variable models

I Manifest or observed variable

I Latent or unobserved variable

I Latent variable models

Page 13: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probability distributions

Categorical distribution

I N possible outcomes, with probabilities (p1, ..., pN)

I Draw a single value – e.g. throw a dice once

I Parameters θ = (p1, ..., pN)

I Discrete distribution, p(X = i) = piI Expectation for outcome i is pi , variance is pi ∗ (1− pi )

Page 14: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probability distributions

Dirichlet distribution

I Draws are vectors x = (x1, ..., xN) s.t.∑i

xi = 1

I In other words, draws are probability vectors – the parameterto the categorical distribution

I Parameters θ = (α1, ..., αN) = α

I Continuous distribution, p(x) =1

B(α)

∏i

xαi−1i

where B(α) =

∏i Γ (αi )

Γ (∑

i αi )and Γ (αi ) =

∫ ∞0

yαi−1e−αidy

I Expectation for ith element xi is E [xi ] =αi∑j αj

Page 15: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes’ theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Page 16: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Unfair dice

I Dice with unknown distribution, p = (p1, p2, p3, p4, p5, p6)

I We observe some throws and want to estimate p

I Say we observe 4, 6, 6, 4, 6, 3

I Perhaps p = (0, 0, 16 ,26 , 0,

36)

Page 17: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Maximum likelihood

I Maximum likelihood solution, θML = maxθ

p(D|θ)

I Easily leads to overfitting

I Want to incorporate some prior belief or knowledge about ourparameters

Page 18: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Bayes’ theorem

Bayes’ theoremFor any two random variables X and Y ,

p(X |Y ) =p(Y |X )p(X )

p(Y )

ProofFrom chain rule, p(X ,Y ) = p(Y |X )p(X ) = p(X |Y )p(Y ).Divide both sides by p(Y ).

Page 19: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Disease test

I Test for disease with 99% accuracy

I 1 in a 1000 people have the disease

I You tested positive. What is the probability that you have thedisease?

Page 20: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Disease test

I Let X = disease, and Y = positive

I Want to know p(X |Y ) – probability of disease given a positivetest

I From Bayes’, p(X |Y ) =p(Y |X )p(X )

p(Y )I p(Y |X ) = 0.99, p(X ) = 0.001

p(Y ) = p(Y ,X ) + p(Y , !X ) = p(Y |X )p(X ) + p(Y |!X )p(!X )

= 0.99 ∗ 0.001 + 0.01 ∗ 0.999 = 0.01098

I So p(X |Y ) =0.99 ∗ 0.001

0.01098= 0.09016393442

Page 21: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Bayes’ theorem for inference

I Want to find “best” parameters θ for our model afterobserving the data D

I ML overfits by using p(D|θ)

I Need some way of using prior belief about the parameters

I Consider p(θ|D) – our belief about the parameters afterobserving the data

Bayesian inference

I Using Bayes’ theorem, p(θ|D) =p(D|θ)p(θ)

p(D)I Prior p(θ)

I Likelihood p(D|θ)

I Posterior p(θ|D)

I Maximum A Posteriori (MAP) –θMAP = max

θp(θ|D) = max

θp(D|θ)p(θ)

I Bayesian inference – find full posterior distribution p(θ|D)

Page 23: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Intractability

I In our model we define the prior p(θ) and likelihood p(D|θ)

I How do we find p(D)?

I p(D) =

∫θp(D, θ)dθ =

∫θp(D|θ)p(θ)dθ

I BUT: space of possible values for θ is huge!

I Approximate Bayesian inference

Page 24: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent Dirichlet Allocation

Generative process

I Draw document-to-topic distributions, θd ∼ Dir(α)(d = 1, ...,D)

I Draw topic-to-word distributions, φt ∼ Dir(β) (t = 1, ...,T )I For each of the N words in each of the D documents:

I Draw a topic from the document’s topic distribution,zdn ∼ Multinomial(θd)

I Draw a word from the topics’s word distribution,wdn ∼ Multinomial(φz)

Note that our model’s data is the words wdn we observe, and theparameters are the θd , φt . We have placed Dirichlet priors over theparameters, with its own parameters α, β.

Page 25: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Hyperparameters

In our model we have:

I Random variables – observed ones, like the words; and latentones, like the topics

I Parameters – document-to-topic distributions θd andtopic-to-word distributions φd

I Hyperparameters – these are parameters to the priordistributions over our parameters, so α and β

Page 26: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Conjugacy

For a specific parameter θi , p(θi ) is conjugate to the likelihoodp(D|θi ) if the posterior of the parameter, p(θi |D), is of the samefamily as the prior.

e.g. the Dirichlet distribution is the conjugate prior for thecategorical distribution.

Page 27: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes’ theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Page 28: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Bayesian network

I Nodes are random variables (latent or observed)

I Arrows indicate dependencies

I Distribution of a node only depends on its parents (and thingsfurther down the network)

I Plates indicate repetition of variables

C D

p(D|A,B,C ) = p(D|C )

BUT: p(C |A,B,D) 6= p(C |A,B)

Page 29: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

C D

Recall Bayes’ p(X |Y ,Z ) =p(Y |X ,Z )p(X |Z )

p(Y |Z )

p(C |A,B,D) =p(D|A,B,C )p(C |A,B)

p(D|A,B)=

p(D|C )p(C |A,B)∫C p(C ,D|A,B)dC

=p(D|C )p(C |A,B)∫

C p(D|A,B,C )p(C |A,B)dC

=p(D|C )p(C |A,B)∫

C p(D|C )p(C |A,B)dC

Page 30: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent Dirichlet Allocation

Generative process

I Draw document-to-topic distributions, θd ∼ Dir(α)(d = 1, ...,D)

I Draw topic-to-word distributions, φt ∼ Dir(β) (t = 1, ...,T )I For each of the N words in each of the D documents:

I Draw a topic from the document’s topic distribution,zdn ∼ Multinomial(θd)

I Draw a word from the topics’s word distribution,wdn ∼ Multinomial(φz)

Page 31: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent Dirichlet Allocation

From http://parkcu.com/blog/

Page 32: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes’ theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Page 33: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Gibbs sampling

I Want to approximate p(θ|D) for parameters θ = (θ1, ..., θN)

I Cannot compute this exactly, but maybe we can draw samplesfrom it

I We can then use these samples to estimate the distribution, orestimate the expectation and variance

Page 34: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Gibbs sampling

I For each parameter θi , write down its distribution conditionalon the data and the values of the other parameters,p(θi |θ−i ,D)

I If our model is conjugate, this gives closed-form expressions(meaning this distribution is of a known form, e.g. Dirichlet,so we can draw from it)

I Drawing new values for the parameters θi in turn willeventually converge to give draws from the true posterior,p(θ|D)

I Burn-in, thinning

Page 35: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent Dirichlet Allocation

I Want to draw samples from p(θ,φ, z |w)

I w = {wd ,n}d=1..D,n=1..N

I z = {zd ,n}d=1..D,n=1..N

I θ = {θd}d=1..D

I φ = {φt}t=1..T

Page 36: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent Dirichlet Allocation

For Gibbs sampling, need distribitions:

I p(θd |θ−d ,φ, z ,w)

I p(φt |θ,φ−t , z ,w)

I p(zd ,n|θ,φ, z−d ,n,w)

Page 37: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Latent Dirichlet Allocation

These are relatively straightforward to derive. For example:

p(zd ,n|θ,φ, z−d ,n,w) =p(w |θ,φ, z)p(zd ,n|θ,φ, z−d ,n)

p(w |θ,φ, z−d ,n)

∝ p(wd ,n|θ,φ, zw ,n)p(zd ,n|θ,φ, z−d ,n)

= p(wd ,n|zw ,n, φzd,n)p(zd ,n|θd)

= φzd,n,wd,nθd ,zd,n

Where the first step follows from Bayes’ theorem, the second fromthat fact that some terms do not depend on zd ,n, the third fromindependence in our Bayesian Network, and the fourth from ourmodel’s definition of those distributions.We then simply compute these probabilities for all zd ,n, normalisethem to sum to 1, and draw a new value with those probabilities!

Page 38: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Collapsed Gibbs sampler

In practice we actually want to find p(z |w), as we can estimatethe θd , φt from the topic assignments. We integrate out the otherparameters. This is called a collapsed Gibbs sampler.

Page 39: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Probabilistic modelsProbability theoryLatent variable models

Bayesian inferenceBayes’ theoremLatent Dirichlet AllocationConjugacy

Graphical models

Gibbs sampling

Variational Bayesian inference

Page 40: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

Variational Bayesian inference

I Want to approximate p(θ|D) for parameters θ = (θ1, ..., θN)

I Cannot compute this exactly, but maybe we can approximateit

I Introduce a new distribution q(θ) over the parameters, calledthe variational distribution

I We can choose the exact form of q ourselves, giving us a setof variational parameters ν – i.e. we have q(θ|ν)

I We then tweak ν so that q is as similar to p as possible!

I We want q to be easier to compute – we normally do this byassuming each of the parameters θi is independent in theposterior – mean-field assumption

q(θ|ν) =∏i

q(θi |νi )

Page 41: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

KL-divergence

I We need some way of measuring similarity betweendistributions

I We use the KL-divergence between distributions q and p

I DKL(q||p) =

∫θq(θ) log

q(θ)

p(θ|D)dθ

Page 42: Introduction to Bayesian inference - University of … · Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

ELBO

We can show that minimising DKL(q||p) is equivalent tomaximising something called the Evidence Lower Bound(ELBO) L.

L =

∫θq(θ) log p(θ,D)dθ −

∫θq(θ) log q(θ)dθ

= Eq [log p(θ,D)]− Eq [log q(θ)]

If we choose the precise distribution for q, we can write down thisexpression. Then optimise by taking the derivative w.r.t. ν andsolving for 0, to give the variational parameter updates.