Special Topics in Machine Learning Brown University CSCI ...
Transcript of Special Topics in Machine Learning Brown University CSCI ...
Applied Bayesian Nonparametrics
Special Topics in Machine Learning Brown University CSCI 2950-P, Fall 2011
October 11: Variational Methods
Convexity & Jensen s Inequality
f(x)
chord
a x! b
Lower Bounds on Marginal Likelihood
ln p(X|θ)L(q,θ)
KL(q||p)
C. Bishop, Pattern Recognition & Machine Learning
Expectation Maximization Algorithm
C. Bishop, Pattern Recognition & Machine Learning
ln p(X|θold)L(q,θold)
KL(q||p) = 0
ln p(X|θnew)L(q,θnew)
KL(q||p)
E Step: Optimize distribution on hidden variables given parameters
M Step: Optimize parameters given distribution on hidden variables
EM: A Sequence of Lower Bounds
C. Bishop, Pattern Recognition & Machine Learning
θold θnew
L (q, θ)
ln p(X|θ)
Fitting Gaussian Mixtures
C. Bishop, Pattern Recognition & Machine Learning
(a)
0 0.5 1
0
0.5
1 (b)
0 0.5 1
0
0.5
1
Complete Data Labeled by True Cluster Assignments
Incomplete Data: Points to be Clustered
Posterior Assignment Probabilities
C. Bishop, Pattern Recognition & Machine Learning
(b)
0 0.5 1
0
0.5
1
Posterior Probabilities of Assignment to Each Cluster
Incomplete Data: Points to be Clustered
(c)
0 0.5 1
0
0.5
1
EM Algorithm
C. Bishop, Pattern Recognition & Machine Learning
(a)!2 0 2
!2
0
2
EM Algorithm
C. Bishop, Pattern Recognition & Machine Learning
(b)!2 0 2
!2
0
2
EM Algorithm
C. Bishop, Pattern Recognition & Machine Learning
(c)
L = 1
!2 0 2
!2
0
2
EM Algorithm
C. Bishop, Pattern Recognition & Machine Learning
(d)
L = 2
!2 0 2
!2
0
2
EM Algorithm
C. Bishop, Pattern Recognition & Machine Learning
(e)
L = 5
!2 0 2
!2
0
2
EM Algorithm
C. Bishop, Pattern Recognition & Machine Learning
(f)
L = 20
!2 0 2
!2
0
2
Pairwise Markov Random Fields
•! Product of arbitrary positive clique potential functions
•! Guaranteed Markov with respect to corresponding graph
set of nodes
set of edges connecting nodes normalization constant (partition function)
Markov Chain Factorizations
Energy Functions
Interpretation and terminology from statistical physics
Approximate Inference Framework
•!Choose a family of approximating distributions which is tractable. The simplest example:
•!Define a distance to measure the quality of different approximations. Two possibilities:
•!Find the approximation minimizing this distance
Fully Factored Approximations
•!Trivially minimized by setting
•!Doesn’t provide a computational method!
Marginal Entropies
Joint Entropy
Variational Approximations
(Multiply by one)
(Jensen’s inequality)
•!Minimizing KL divergence maximizes a lower bound on the data likelihood
Free Energies
Negative Entropy
Average Energy
Gibbs Free Energy
Normalization
•!Free energies equivalent to KL divergence, up to a normalization constant
Mean Field Free Energy
Mean Field Equations
•!Add Lagrange multipliers to enforce
•!Taking derivatives and simplifying, we find a set of fixed point equations:
•!Updating one marginal at a time gives convergent coordinate descent
Mean Field Message Passing
Want products of messages to be simple
Want expectations of log potential functions to be simple
Exponential Families •! Natural or canonical parameters determine log-linear
combination of sufficient statistics:
•! Log partition function normalizes to produce valid probability distribution:
Directed Mean Field
•! Can derive updates using exponential family form of the conditional distribution of each variable, given its parents
•! Can also just take derivatives, collect terms, simplify!
Variational Message Passing, Winn &
Bishop, JMLR 2005
Structured Mean Field
Original Graph Naïve Mean Field Structured Mean Field
•!Any subgraph for which inference is tractable leads to a mean field style approximation for which the update equations are tractable