Variational Methods - Carnegie Mellon School of Computer...

download Variational Methods - Carnegie Mellon School of Computer tom/10-702/Zoubin-702.pdf2003-04-24Variational Methods Zoubin Ghahramani zoubin@cs.cmu.edu ... joint distribution for hidden

of 38

  • date post

    24-May-2018
  • Category

    Documents

  • view

    213
  • download

    1

Embed Size (px)

Transcript of Variational Methods - Carnegie Mellon School of Computer...

  • Variational Methods

    Zoubin Ghahramani

    zoubin@cs.cmu.edu

    http://www.gatsby.ucl.ac.uk

    Statistical Approaches to Learning and DiscoveryCarnegie Mellon University

    April 2003

  • The Expectation Maximization (EM) algorithm

    Given observed/visible variables y, unobserved/hidden/latent/missing variables x,and model parameters , maximize the likelihood w.r.t. .

    L() = log p(y|) = logp(x,y|)dx,

    where we have written the marginal for the visibles in terms of an integral over thejoint distribution for hidden and visible variables.

    Using Jensens inequality, any distribution1 over hidden variables q(x) gives:

    L() = logq(x)

    p(x,y|)q(x)

    dx q(x) log

    p(x,y|)q(x)

    dx = F(q, ),

    defining the F(q, ) functional, which is a lower bound on the log likelihood.

    In the EM algorithm, we alternately optimize F(q, ) wrt q and , and we can provethat this will never decrease L().

    1s.t. q(x) > 0 if p(x, y|) > 0.

  • The E and M steps of EM

    The lower bound on the log likelihood:

    F(q, ) =q(x) log

    p(x,y|)q(x)

    dx =q(x) log p(x,y|)dx +H(q),

    where H(q) = q(x) log q(x)dx is the entropy of q. We iteratively alternate:

    E step: maximize F(q, ) wrt distribution over hidden variables given theparameters:

    q(k)(x) := argmaxq(x)

    F(q(x), (k1)

    ).

    M step: maximize F(q, ) wrt the parameters given the hidden distribution:

    (k) := argmax

    F(q(k)(x),

    )= argmax

    q(k)(x) log p(x,y|)dx,

    which is equivalent to optimizing the expected complete-data likelihood p(x,y|),since the entropy of q(x) does not depend on .

  • EM as Coordinate Ascent in F

  • The EM algorithm never decreases the log likelihood

    The difference between the log likelihood and the bound:

    L()F(q, ) = log p(y|)q(x) log

    p(x,y|)q(x)

    dx

    = log p(y|)q(x) log

    p(x|y, )p(y|)q(x)

    dx

    = q(x) log

    p(x|y, )q(x)

    dx = KL(q(x), p(x|y, )

    ),

    This is the Kullback-Liebler divergence; it is non-negative and zero if and only ifq(x) = p(x|y, ) (thus this is the E step). Although we are working with a boundon the likelihood, the likelihood is non-decreasing in every iteration:

    L((k1)

    )=

    E stepF(q(k), (k1)

    )

    M stepF(q(k), (k)

    )

    JensenL((k)

    ),

    where the first equality holds because of the E step, and the first inequality comesfrom the M step and the final inequality from Jensen. Usually EM converges to alocal optimum of L (although there are exceptions).

  • A Generative Model for Generative Models

    Gaussian

    Factor Analysis(PCA)

    Mixture of Factor Analyzers

    Mixture of Gaussians

    (VQ)

    CooperativeVector

    Quantization

    SBN,BoltzmannMachines

    Factorial HMM

    HMM

    Mixture of

    HMMs

    SwitchingState-space

    Models

    ICALinear

    DynamicalSystems (SSMs)

    Mixture ofLDSs

    NonlinearDynamicalSystems

    NonlinearGaussian

    Belief Nets

    mix

    mix

    mix

    switch

    red-dim

    red-dim

    dyn

    dyn

    dyn

    dyn

    dyn

    mix

    distrib

    hier

    nonlinhier

    nonlin

    distrib

    mix : mixture

    red-dim : reduced dimensiondyn : dynamicsdistrib : distributed representation

    hier : hierarchical

    nonlin : nonlinear

    switch : switching

  • Example: Dynamic Bayesian Networks

    Factorial Hidden Markov Models, Switching State Space Models, and NonlinearDynamical Systems, Nonlinear Dynamical Systems, ...

    S(1)

    t

    S(2)

    t

    S(3)

    t

    Yt

    S(1)

    t+1

    S(2)

    t+1

    S(3)

    t+1

    Yt+1

    S(1)

    t-1

    S(2)

    t-1

    S(3)

    t-1

    Yt-1

    X(M)

    2

    S 2

    Y2

    X(M)

    3

    S 3

    Y3

    X(M)

    1

    S 1

    Y1

    X(1)

    2 X(1)

    3X(1)

    1

    At

    Dt

    Ct

    Bt

    At+1

    Dt+1

    Ct+1

    Bt+1

    ...

    At+2

    Dt+2

    Ct+2

    Bt+2

  • Intractability

    For many models of interest, exact inference is not computationally feasible.This occurs for two (main) reasons:

    distributions may have complicated forms (non-linearities in generative model) explaining away causes coupling from observations: observing the value of a

    child induces dependencies amongst its parents

    Y

    X3 X4X1 X5X2

    1 2 311

    Y = X1 + 2 X2 + X3 + X4 + 3 X5

    We can still work with such models by using approximate inference techniques toestimate the latent variables.

  • Variational Appoximations

    Assume your goal is to maximize likelihood ln p(y|).Any distribution q(x) over the hidden variables defines a lower bound on ln p(y|):

    ln p(y|)

    x

    q(x) lnp(x,y|)q(x)

    = F(q, )

    Constrain q(x) to be of a particular tractable form (e.g. factorised) and maximiseF subject to this constraint

    E-step: Maximise F w.r.t. q with fixed, subject to the constraint on q,equivalently minimize:

    ln p(y|)F(q, ) =

    x

    q(x) lnq(x)

    p(x|y, )= KL(qp)

    The inference step therefore tries to find q closest to the exact posteriordistribution.

    M-step: Maximise F w.r.t. with q fixed

    (related to mean-field approximations)

  • Variational Approximations for Bayesian Learning

  • Model structure and overfitting:a simple example

    0 5 1020

    0

    20

    40

    M = 0

    0 5 1020

    0

    20

    40

    M = 1

    0 5 1020

    0

    20

    40

    M = 2

    0 5 1020

    0

    20

    40

    M = 3

    0 5 1020

    0

    20

    40

    M = 4

    0 5 1020

    0

    20

    40

    M = 5

    0 5 1020

    0

    20

    40

    M = 6

    0 5 1020

    0

    20

    40

    M = 7

  • Learning Model Structure

    Conditional Independence StructureWhat is the structure of the graph (i.e. what relations hold)?

    A

    C

    B

    D

    E

    Feature SelectionIs some input relevant to predicting some output ?

    Cardinality of Discrete Latent VariablesHow many clusters in the data?

    How many states in a hidden Markov model?SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY

    Dimensionality of Real Valued Latent VectorsWhat choice of dimensionality in a PCA/FA model of the data?

    How many state variables in a linear-Gaussian state-space model?

  • Using Bayesian Occams Razor to Learn Model Structure

    Select the model class, m, with the highest probability given the data, y:

    p(m|y) = p(y|m)p(m)p(y)

    , p(y|m) =p(y|,m) p(|m) d

    Interpretation of the marginal likelihood (evidence): The probability thatrandomly selected parameters from the prior would generate y.

    Model classes that are too simple are unlikely to generate the data set.

    Model classes that are too complex can generate many possible data sets, so again,they are unlikely to generate that particular data set at random.

    too simple

    too complex

    "just right"

    All possible data sets

    P(Y

    |Mi)

    Y

  • Bayesian Model Selection: Occams Razor at Work

    0 5 1020

    0

    20

    40

    M = 0

    0 5 1020

    0

    20

    40

    M = 1

    0 5 1020

    0

    20

    40

    M = 2

    0 5 1020

    0

    20

    40

    M = 3

    0 5 1020

    0

    20

    40

    M = 4

    0 5 1020

    0

    20

    40

    M = 5

    0 5 1020

    0

    20

    40

    M = 6

    0 5 1020

    0

    20

    40

    M = 7

    0 1 2 3 4 5 6 70

    0.2

    0.4

    0.6

    0.8

    1

    M

    P(Y

    |M)

    Model Evidence

    demo: polybayes Subtleties of Occams Hill

  • Computing Marginal Likelihoods can beComputationally Intractable

    p(y|m) =p(y|,m) p(|m) d

    This can be a very high dimensional integral.

    The presence of latent variables results in additional dimensions that need tobe marginalized out.

    p(y|m) =

    p(y,x|,m) p(|m) dx d

    The likelihood term can be complicated.

  • Practical Bayesian approaches

    Laplace approximations:

    Appeals to asymptotic normality to make a Gaussian approximation about theposterior mode of the parameters.

    Large sample approximations (e.g. BIC).

    Markov chain Monte Carlo methods (MCMC):

    converge to the desired distribution in the limit, but: many samples are required to ensure accuracy. sometimes hard to assess convergence and reliably compute marginal likelihood.

    Variational approximations...

    Note: other deterministic approximations are also available now: e.g. Bethe/Kikuchiapproximations, Expectation Propagation, Tree-based reparameterizations.

  • Lower Bounding the Marginal Likelihood

    Variational Bayesian Learning

    Let the latent variables be x, data y and the parameters .We can lower bound the marginal likelihood (by Jensens inequality):

    ln p(y|m) = lnp(y,x,|m) dx d

    = lnq(x,)

    p(y,x,|m)q(x,)

    dx d

    q(x,) ln

    p(y,x,|m)q(x,)

    dx d.

    Use a simpler, factorised approximation to q(x,) qx(x)q():

    ln p(y|m) qx(x)q() ln

    p(y,x,|m)qx(x)q()

    dx d

    = Fm(qx(x), q(),y).

  • Variational Bayesian Learning . . .

    Maximizing this lower bound, Fm, leads to EM-like iterative updates:

    q(t+1)x (x) exp[

    ln p(x,y|,m) q(t) () d]

    E-like step

    q(t+1) () p(|m) exp

    [ln p(x,y|,m) q(t+1)x (x) dx

    ]M-like step

    Maximizing Fm is equivalent to minimizing KL-divergence between the approximateposterior, q() qx(x) and the true posterior, p(,x|y,m):

    ln p(y|m)Fm(qx(x), q(),y