Special Topics in Machine Learning Brown University CSCI ...

Post on 12-Dec-2021

11 views 0 download

Transcript of Special Topics in Machine Learning Brown University CSCI ...

Applied Bayesian Nonparametrics

Special Topics in Machine Learning Brown University CSCI 2950-P, Fall 2011

October 11: Variational Methods

Convexity & Jensen s Inequality

f(x)

chord

a x! b

Lower Bounds on Marginal Likelihood

ln p(X|θ)L(q,θ)

KL(q||p)

C. Bishop, Pattern Recognition & Machine Learning

Expectation Maximization Algorithm

C. Bishop, Pattern Recognition & Machine Learning

ln p(X|θold)L(q,θold)

KL(q||p) = 0

ln p(X|θnew)L(q,θnew)

KL(q||p)

E Step: Optimize distribution on hidden variables given parameters

M Step: Optimize parameters given distribution on hidden variables

EM: A Sequence of Lower Bounds

C. Bishop, Pattern Recognition & Machine Learning

θold θnew

L (q, θ)

ln p(X|θ)

Fitting Gaussian Mixtures

C. Bishop, Pattern Recognition & Machine Learning

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1

Complete Data Labeled by True Cluster Assignments

Incomplete Data: Points to be Clustered

Posterior Assignment Probabilities

C. Bishop, Pattern Recognition & Machine Learning

(b)

0 0.5 1

0

0.5

1

Posterior Probabilities of Assignment to Each Cluster

Incomplete Data: Points to be Clustered

(c)

0 0.5 1

0

0.5

1

EM Algorithm

C. Bishop, Pattern Recognition & Machine Learning

(a)!2 0 2

!2

0

2

EM Algorithm

C. Bishop, Pattern Recognition & Machine Learning

(b)!2 0 2

!2

0

2

EM Algorithm

C. Bishop, Pattern Recognition & Machine Learning

(c)

L = 1

!2 0 2

!2

0

2

EM Algorithm

C. Bishop, Pattern Recognition & Machine Learning

(d)

L = 2

!2 0 2

!2

0

2

EM Algorithm

C. Bishop, Pattern Recognition & Machine Learning

(e)

L = 5

!2 0 2

!2

0

2

EM Algorithm

C. Bishop, Pattern Recognition & Machine Learning

(f)

L = 20

!2 0 2

!2

0

2

Pairwise Markov Random Fields

•! Product of arbitrary positive clique potential functions

•! Guaranteed Markov with respect to corresponding graph

set of nodes

set of edges connecting nodes normalization constant (partition function)

Markov Chain Factorizations

Energy Functions

Interpretation and terminology from statistical physics

Approximate Inference Framework

•!Choose a family of approximating distributions which is tractable. The simplest example:

•!Define a distance to measure the quality of different approximations. Two possibilities:

•!Find the approximation minimizing this distance

Fully Factored Approximations

•!Trivially minimized by setting

•!Doesn’t provide a computational method!

Marginal Entropies

Joint Entropy

Variational Approximations

(Multiply by one)

(Jensen’s inequality)

•!Minimizing KL divergence maximizes a lower bound on the data likelihood

Free Energies

Negative Entropy

Average Energy

Gibbs Free Energy

Normalization

•!Free energies equivalent to KL divergence, up to a normalization constant

Mean Field Free Energy

Mean Field Equations

•!Add Lagrange multipliers to enforce

•!Taking derivatives and simplifying, we find a set of fixed point equations:

•!Updating one marginal at a time gives convergent coordinate descent

Mean Field Message Passing

Want products of messages to be simple

Want expectations of log potential functions to be simple

Exponential Families •! Natural or canonical parameters determine log-linear

combination of sufficient statistics:

•! Log partition function normalizes to produce valid probability distribution:

Directed Mean Field

•! Can derive updates using exponential family form of the conditional distribution of each variable, given its parents

•! Can also just take derivatives, collect terms, simplify!

Variational Message Passing, Winn &

Bishop, JMLR 2005

Structured Mean Field

Original Graph Naïve Mean Field Structured Mean Field

•!Any subgraph for which inference is tractable leads to a mean field style approximation for which the update equations are tractable