Introduction to Bayesian inference - University of to Bayesian inference Thomas Alexander Brouwer...

download Introduction to Bayesian inference - University of   to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015

of 45

  • date post

    30-Apr-2018
  • Category

    Documents

  • view

    213
  • download

    1

Embed Size (px)

Transcript of Introduction to Bayesian inference - University of to Bayesian inference Thomas Alexander Brouwer...

  • Introduction to Bayesian inference

    Thomas Alexander Brouwer

    University of Cambridge

    tab43@cam.ac.uk

    17 November 2015

  • Probabilistic models

    I Describe how data was generated using probabilitydistributions

    I Generative process

    I Data D, parameters

    I Want to find the best parameters for the data D -inference

  • Topic modelling

    I Documents D1, ...,DDI Documents cover topics, with a distribution d = (t1, ..., tT )

    I Words in document, Dd = {wd ,1, ...,wd ,N}I Some words are more prevalent in some topics

    I Topics have a word distribution t = (w1, ...,wV )

    I Data is words in documents D1, ...,DD , parameters are d , t

  • From Bleis ICML-2012 tutorial

  • Overview

    Probabilistic modelsProbability theoryLatent variable models

    Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

    Graphical models

    Gibbs sampling

    Variational Bayesian inference

  • Probability primer

    I Random variable X

    I Probability distribution p(X )

    I Discrete distribution e.g. coin flip or dice roll

    I Continuous distribution e.g. height distribution

  • Multiple RVs

    I Joint distribution p(X ,Y )

    I Conditional distribution p(X |Y )

  • Probability rules

    I Chain rule p(X |Y ) = p(X ,Y )p(Y )

    or p(X ,Y ) = p(X |Y )p(Y )

    I Marginal rule p(X ) =Y

    p(X ,Y ) =Y

    p(X |Y )p(Y )

    for continuous variables,

    p(x) =

    yp(x , y)dy =

    yp(x |y)p(y)dy

    We can add more conditional random variables if we want, so e.g.p(X ,Y |Z ) = p(X |Y ,Z )p(Y |Z ).

  • Independence

    I X and Y are independent if p(X ,Y ) = p(X )p(Y )

    I Equivalently, if p(Y |X ) = p(Y )

  • Expectation and variance

    I Expectation E [X ] =x

    x p(X = x), E [X ] =xx p(x)dx

    I Variance V [X ] = E[(X E [X ])2

    ]= E

    [X 2] E [X ]2

    where E[X 2]

    =x

    x2 p(X = x) or E[X 2]

    =

    xx2 p(x)dx

  • Dice roll: E [X ] = 1 16

    + 2 16

    + 3 16

    + 4 16

    + 5 16

    + 6 16

    =7

    2

    E[X 2]

    = 12 16

    + 22 16

    + 32 16

    + 42 16

    + 52 16

    + 62 16

    =91

    6

    So V [X ] =91

    6(

    7

    2

    )2=

    35

    12

  • Latent variable models

    I Manifest or observed variable

    I Latent or unobserved variable

    I Latent variable models

  • Probability distributions

    Categorical distribution

    I N possible outcomes, with probabilities (p1, ..., pN)

    I Draw a single value e.g. throw a dice once

    I Parameters = (p1, ..., pN)

    I Discrete distribution, p(X = i) = piI Expectation for outcome i is pi , variance is pi (1 pi )

  • Probability distributions

    Dirichlet distribution

    I Draws are vectors x = (x1, ..., xN) s.t.i

    xi = 1

    I In other words, draws are probability vectors the parameterto the categorical distribution

    I Parameters = (1, ..., N) =

    I Continuous distribution, p(x) =1

    B()

    i

    xi1i

    where B() =

    i (i )

    (

    i i )and (i ) =

    0

    yi1eidy

    I Expectation for ith element xi is E [xi ] =ij j

  • Probabilistic modelsProbability theoryLatent variable models

    Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

    Graphical models

    Gibbs sampling

    Variational Bayesian inference

  • Unfair dice

    I Dice with unknown distribution, p = (p1, p2, p3, p4, p5, p6)

    I We observe some throws and want to estimate p

    I Say we observe 4, 6, 6, 4, 6, 3

    I Perhaps p = (0, 0, 16 ,26 , 0,

    36)

  • Maximum likelihood

    I Maximum likelihood solution, ML = max

    p(D|)

    I Easily leads to overfitting

    I Want to incorporate some prior belief or knowledge about ourparameters

  • Bayes theorem

    Bayes theoremFor any two random variables X and Y ,

    p(X |Y ) = p(Y |X )p(X )p(Y )

    ProofFrom chain rule, p(X ,Y ) = p(Y |X )p(X ) = p(X |Y )p(Y ).Divide both sides by p(Y ).

  • Disease test

    I Test for disease with 99% accuracy

    I 1 in a 1000 people have the disease

    I You tested positive. What is the probability that you have thedisease?

  • Disease test

    I Let X = disease, and Y = positive

    I Want to know p(X |Y ) probability of disease given a positivetest

    I From Bayes, p(X |Y ) = p(Y |X )p(X )p(Y )

    I p(Y |X ) = 0.99, p(X ) = 0.001

    p(Y ) = p(Y ,X ) + p(Y , !X ) = p(Y |X )p(X ) + p(Y |!X )p(!X )= 0.99 0.001 + 0.01 0.999 = 0.01098

    I So p(X |Y ) = 0.99 0.0010.01098

    = 0.09016393442

  • Bayes theorem for inference

    I Want to find best parameters for our model afterobserving the data D

    I ML overfits by using p(D|)I Need some way of using prior belief about the parameters

    I Consider p(|D) our belief about the parameters afterobserving the data

  • Bayesian inference

    I Using Bayes theorem, p(|D) = p(D|)p()p(D)

    I Prior p()

    I Likelihood p(D|)I Posterior p(|D)

    I Maximum A Posteriori (MAP) MAP = max

    p(|D) = max

    p(D|)p()

    I Bayesian inference find full posterior distribution p(|D)

  • Intractability

    I In our model we define the prior p() and likelihood p(D|)I How do we find p(D)?

    I p(D) =

    p(D, )d =

    p(D|)p()d

    I BUT: space of possible values for is huge!

    I Approximate Bayesian inference

  • Latent Dirichlet Allocation

    Generative process

    I Draw document-to-topic distributions, d Dir()(d = 1, ...,D)

    I Draw topic-to-word distributions, t Dir() (t = 1, ...,T )I For each of the N words in each of the D documents:

    I Draw a topic from the documents topic distribution,zdn Multinomial(d)

    I Draw a word from the topicss word distribution,wdn Multinomial(z)

    Note that our models data is the words wdn we observe, and theparameters are the d , t . We have placed Dirichlet priors over theparameters, with its own parameters , .

  • Hyperparameters

    In our model we have:

    I Random variables observed ones, like the words; and latentones, like the topics

    I Parameters document-to-topic distributions d andtopic-to-word distributions d

    I Hyperparameters these are parameters to the priordistributions over our parameters, so and

  • Conjugacy

    For a specific parameter i , p(i ) is conjugate to the likelihoodp(D|i ) if the posterior of the parameter, p(i |D), is of the samefamily as the prior.

    e.g. the Dirichlet distribution is the conjugate prior for thecategorical distribution.

  • Probabilistic modelsProbability theoryLatent variable models

    Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

    Graphical models

    Gibbs sampling

    Variational Bayesian inference

  • Bayesian network

    I Nodes are random variables (latent or observed)

    I Arrows indicate dependencies

    I Distribution of a node only depends on its parents (and thingsfurther down the network)

    I Plates indicate repetition of variables

    A

    B

    C D

    p(D|A,B,C ) = p(D|C )BUT: p(C |A,B,D) 6= p(C |A,B)

  • A

    B

    C D

    Recall Bayes p(X |Y ,Z ) = p(Y |X ,Z )p(X |Z )p(Y |Z )

    p(C |A,B,D) = p(D|A,B,C )p(C |A,B)p(D|A,B)

    =p(D|C )p(C |A,B)C p(C ,D|A,B)dC

    =p(D|C )p(C |A,B)

    C p(D|A,B,C )p(C |A,B)dC

    =p(D|C )p(C |A,B)

    C p(D|C )p(C |A,B)dC

  • Latent Dirichlet Allocation

    Generative process

    I Draw document-to-topic distributions, d Dir()(d = 1, ...,D)

    I Draw topic-to-word distributions, t Dir() (t = 1, ...,T )I For each of the N words in each of the D documents:

    I Draw a topic from the documents topic distribution,zdn Multinomial(d)

    I Draw a word from the topicss word distribution,wdn Multinomial(z)

  • Latent Dirichlet Allocation

    From http://parkcu.com/blog/

  • Probabilistic modelsProbability theoryLatent variable models

    Bayesian inferenceBayes theoremLatent Dirichlet AllocationConjugacy

    Graphical models

    Gibbs sampling

    Variational Bayesian inference

  • Gibbs sampling

    I Want to approximate p(|D) for parameters = (1, ..., N)I Cannot compute this exactly, but maybe we can draw samples

    from it

    I We can then use these samples to estimate the distribution, orestimate the expectation and variance

  • Gibbs sampling

    I For each parameter i , write down its distribution conditionalon the data and the values of the other parameters,p(i |i ,D)

    I If our model is conjugate, this gives closed-form expressions(meaning this distribution is of a known form, e.g. Dirichlet,so we can draw from it)

    I Drawing new values for the parameters i in turn willeventually converge to give draws from the true posterior,p(|D)

    I Burn-in, thinning

  • Latent Dirichlet Allocation

    I Want to draw samples from p(,, z |w)I w = {wd ,n}d=1..D,n=1..NI z = {zd ,n}d=1..D,n=1..NI = {d}d=1..DI = {t}t=1..T

  • Latent Dirichlet Allocation

    For Gibbs sampling, need distribitions:

    I p(d |d ,, z ,w)I p(t |,t , z ,w)I p(zd ,n|,, zd ,n,w)

  • Latent Dirichlet Allocation

    These are relatively straightforward to derive. For example:

    p(zd ,n|,, zd ,n,w) =p(w |,, z)p(zd ,n|,, zd ,n)

    p(w |,, zd ,n) p(wd ,n|,, zw ,n)p(zd ,n|,, zd ,n)= p(wd ,n|zw ,n, zd,n)p(zd ,n|d)= zd,n,wd,nd ,zd,n

    Where the first step follows from Bayes theorem, the second fromthat fact that some terms do not depend on zd ,n, the third fromindependence in our Bayesi