Bayesian Time Series Models - University of Oxford
Transcript of Bayesian Time Series Models - University of Oxford
Bayesian Time Series Models
Zoubin GhahramaniDepartment of Engineering
University of Cambridge
joint work with Matt Beal, Jurgen van Gael,
Yee Whye Teh, Yunus Saatci
Wednesday, 26 May 2010
Bayesian Learning
Apply the basic rules of probability to learning from data.
Data set: D = {x1, . . . , xn} Models: m, m� etc. Model parameters: θ
Prior probability of models: P (m), P (m�) etc.Prior probabilities of model parameters: P (θ|m)Model of data given parameters (likelihood model): P (x|θ, m)
If the data are independently and identically distributed then:
P (D|θ, m) =n�
i=1
P (xi|θ, m)
Posterior probability of model parameters:
P (θ|D,m) =P (D|θ, m)P (θ|m)
P (D|m)
Posterior probability of models:
P (m|D) =P (m)P (D|m)
P (D)
Wednesday, 26 May 2010
3
Bayesian Occam’s Razor and Model Comparison
Compare model classes, e.g. m and m�, using posterior probabilities given D:
p(m|D) =p(D|m) p(m)
p(D), p(D|m) =
�p(D|θ,m) p(θ|m) dθ
Interpretations of the Marginal Likelihood (“model evidence”):
• The probability that randomly selected parameters from the prior would generate D.
• Probability of the data under the model, averaging over all possible parameter values.
• log2
�1
p(D|m)
�is the number of bits of surprise at observing data D under model m.
Model classes that are too simple are unlikelyto generate the data set.
Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.
too simple
too complex
"just right"
All possible data sets of size n
P(D
|m)
D
Wednesday, 26 May 2010
Bayesian Model Comparison: Occam’s Razor at Work
0 5 10!20
0
20
40
M = 0
0 5 10!20
0
20
40
M = 1
0 5 10!20
0
20
40
M = 2
0 5 10!20
0
20
40
M = 3
0 5 10!20
0
20
40
M = 4
0 5 10!20
0
20
40
M = 5
0 5 10!20
0
20
40
M = 6
0 5 10!20
0
20
40
M = 7
0 1 2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
M
P(Y
|M)
Model Evidence
For example, for quadratic polynomials (m = 2): y = a0 + a1x + a2x2 + �, where� ∼ N (0,σ2) and parameters θ = (a0 a1 a2 σ)
demo: polybayes
Wednesday, 26 May 2010
Learning Model Structure
How many clusters in the data?
What is the intrinsic dimensionality of the data?
Is this input relevant to predicting that output?
What is the order of a dynamical system?
How many states in a hidden Markov model?SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY
How many auditory sources in the input?
Which graph structure best models the data?A
D
C
B
Edemo: run simple
Wednesday, 26 May 2010
Time Series Models• Broadly two classes of time series models:
– fully observed models (e.g. AR, n-gram)– hidden state models (e.g. HMM, state-space models (SSMs))
• How do we learn the dimensionality of a linear-Gaussian SSM? – Variational Bayesian learning of SSMs
• Hidden Markov models (HMMs) are widely used, but how do we choose the number of hidden states?– A non-parametric Bayesian approach: infinite HMMs.
• A single discrete state variable is a poor representation of the history. Can we do better?– Factorial HMMs
• Can we make Factorial HMMs non-parametric?– infinite factorial HMMs and the Markov Indian Buffet Process
Wednesday, 26 May 2010
Part I:
Variational Bayesian learning of
State-Space Models
7
Wednesday, 26 May 2010
Linear-Gaussian State-space models (SSMs)
x1
u1 u2 u3 uT
y1 y2 y3 yT
x2 xTx3 ...A
C
B
D
P (x1:T ,y1:T |u1:T ) = P (x1|u1)P (y1|x1,u1)T�
t=2
P (xt|xt−1,ut)P (yt|xt,ut)
Output equation: yt = Cxt + Dut + vt
State dynamics equation: xt = Axt−1 + But + wt
where xt, ut and yt are real-valued vectors and v and w are uncorrelated zero-meanGaussian noise vectors.
These models are a.k.a. stochastic linear dynamical systems, Kalman filter models.Continuous state version of HMMs.
Forward–backward algorithm ≡ Kalman smoothing
Wednesday, 26 May 2010
9
Lower Bounding the Marginal LikelihoodVariational Bayesian Learning
Let the latent variables be x, data y and the parameters θ.
We can lower bound the marginal likelihood (using Jensen’s inequality):
ln p(y|m) = ln�
p(y,x,θ|m) dx dθ
= ln�
q(x,θ)p(y,x,θ|m)
q(x,θ)dx dθ
≥�
q(x,θ) lnp(y,x,θ|m)
q(x,θ)dx dθ.
Use a simpler, factorised approximation to q(x,θ) ≈ qx(x)qθ(θ):
ln p(y|m) ≥�
qx(x)qθ(θ) lnp(y,x,θ|m)qx(x)qθ(θ)
dx dθ
= Fm(qx(x), qθ(θ),y).
(some significant contributors to this framework from 1993-1998: C. M. Bishop, G. E. Hinton,
T. S. Jaakkola, M. I. Jordan, D. J. C. MacKay, R. M. Neal, L. K. Saul)Wednesday, 26 May 2010
10
Inferring the Number of Hidden States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 201000
1500
2000
2500
3000
3500
4000
dimensionality of state space, k
low
er
bound o
n m
arg
inal lik
elih
ood / n
ats
Variation of F , the lower bound on the marginal likelihood, with hidden statedimension k for 10 random initialisations of VBEM. We can do use this lower boundto infer/select the number of hidden states.
Wednesday, 26 May 2010
11
Inferring Regulatory Networks
CASP8
IRAK1BP1GATA3
API1
CASP4 HTF4 MAPK4
LAT
TP53I3
CX
3C
R1
CY
TP
450
CIRMAPK9
CSF2
ITGAM
JUNDSLA
ZNFN1A1
IL16
CTNNA1 CCNG1 CDC2
PDE4B ID3 TCF8
PCNA
RPS6KB1
SNW1
TRAF5 CASP7
MYD88 API2
IL3RA
CDK4
IL2RGCD69
RBL2
RB1
CCNA2
SIVAMCL1IFNAR1SMN1
EGR1
JUNB
AKT1
• Gene-gene interactions present in ≥80% of the VB state-space models outof 10 random seeds and k = 14 at aconfidence level of 99.8%.
• The number inside each node is thegene identity
• Numbers on the edges represent thenumber of models from 10 differentrandom seeds in which the interactionis supported at this confidence level.
• Dotted lined are negative interactions,continuous lines represent positiveinteractions.
Wednesday, 26 May 2010
Summary of Part I
• Bayesian machine learning• Marginal likelihoods and Occam’s
Razor• Variational Bayesian lower bounds• Application to learning the number of
hidden states in linear-Gaussian state-space models
12
Wednesday, 26 May 2010
Part II
The Infinite Hidden Markov Model
Wednesday, 26 May 2010
Hidden Markov Models
Wednesday, 26 May 2010
Choosing the number of hidden states
• How do we choose K, the number of hidden states, in an HMM?
• Can we define a model with an unbounded number of hidden states, and a suitable inference algorithm?
Wednesday, 26 May 2010
Infinite Hidden Markov models
Wednesday, 26 May 2010
Alice in Wonderland
Wednesday, 26 May 2010
Hierarchical Urn Scheme for generating transitions in the iHMM (2002)
• nij is the number of previous transitions from i to j• α, β, and γ are hyperparameters• prob. of transition from i to j proportional to nij • with prob. proportional to βγ jump to a new state
Wednesday, 26 May 2010
Relating iHMMs to DPMs
• The infinite Hidden Markov Model is closely related to Dirichlet Process Mixture (DPM) models
• This makes sense: – HMMs are time series generalisations of mixture models.– DPMs are a way of defining mixture models with countably
infinitely many components.– iHMMs are HMMs with countably infinitely many states.
Wednesday, 26 May 2010
HMMs as sequential mixtures
Wednesday, 26 May 2010
Infinite Hidden Markov Models
Wednesday, 26 May 2010
Infinite Hidden Markov Models
Wednesday, 26 May 2010
Infinite Hidden Markov Models
Teh, Jordan, Beal and Blei (2005) derived iHMMs in terms of Hierarchical Dirichlet Processes.
Wednesday, 26 May 2010
Efficient inference in iHMMs?
Wednesday, 26 May 2010
Inference and Learning in HMMs and iHMMs
• HMM inference of hidden states p(st|y1…yT,θ):– forward backward = dynamic programming = belief
propagation • HMM parameter learning:
– Baum Welch = expectation maximization (EM), or Gibbs sampling (Bayesian)
• iHMM inference and learning, p(st ,θ |y1…yT):– Gibbs Sampling
• This is unfortunate: Gibbs can be very slow for time series!
• Can we use dynamic programming?Wednesday, 26 May 2010
Dynamic Programming in HMMs Forward Backtrack Sampling
Wednesday, 26 May 2010
Beam Sampling
Wednesday, 26 May 2010
Beam Sampling
Wednesday, 26 May 2010
Auxiliary variables
Note: adding u variables, does not change distribution over other vars.
Wednesday, 26 May 2010
Beam Sampling
Wednesday, 26 May 2010
Experiment: Text Prediction
Wednesday, 26 May 2010
Experiment: Changepoint Detection
Wednesday, 26 May 2010
Experiment: Changepoint Detection
Wednesday, 26 May 2010
Parallel and Distributed Implementations of iHMMs
• Recent work on parallel (.NET) and distributed (Hadoop) implementations of beam-sampling for iHMMs (Bratieres, Van Gael, Vlachos and Ghahramani, 2010).
• Applied to unsupervised learning of part-of-speech tags from Newswire text (10 million word sequences).
• Promising results; code available.
34
Wednesday, 26 May 2010
Part III
• Hidden Markov models represent the entire history of a sequence using a single state variable st
• This seems restrictive...
• It seems more natural to allow many hidden state variables, a “distributed representation” of state.
• …the Factorial Hidden Markov ModelWednesday, 26 May 2010
Factorial HMMs
• Factorial HMMs (Ghahramani and Jordan, 1997)• A kind of dynamic Bayesian network.• Inference using variational methods or sampling.• Have been used in a variety of applications (e.g. condition monitoring, biological sequences, speech recognition).
Wednesday, 26 May 2010
From factorial HMMs to infinite factorial HMMs?
• A non-parametric version where the number of chains is unbounded? • In infinite factorial HMM (ifHMM) each chain is binary (van Gael, Teh, and Ghahramani, 2008).
• Based on the Markov extension of the Indian Buffet Process (IBP).
Wednesday, 26 May 2010
Bars-in-time data
Wednesday, 26 May 2010
Bars-in-time data
Wednesday, 26 May 2010
ifHMM Toy Experiment: Bars-in-time
Data
Ground truth
Inferred
Wednesday, 26 May 2010
ifHMM Experiment: Bars-in-time
Wednesday, 26 May 2010
separating speech audio of multiple speakers in time
Wednesday, 26 May 2010
Wednesday, 26 May 2010
The Big Picture
Wednesday, 26 May 2010
Summary• Bayesian methods provide a flexible framework for modelling.
• State Space Models can be learned using variational Bayesian methods
• iHMMs provide a non-parametric sequence model where the number of states is not bounded a priori.
• Beam sampling provides an efficient exact dynamic programming-based MCMC method for iHMMs.
• ifHMMs extend iHMMs to multiple state variables in parallel.
• Future directions: new models, fast algorithms, and compelling applications.
Wednesday, 26 May 2010
Appendix:
Wednesday, 26 May 2010
Binary Matrix Representation of Clustering
Wednesday, 26 May 2010
Binary Latent Feature Matrices
Wednesday, 26 May 2010
Indian Buffet Process
(Griffiths and Ghahramani, 2005)Wednesday, 26 May 2010
Indian Buffet Process
Wednesday, 26 May 2010
Indian Buffet Process
Wednesday, 26 May 2010
HMM vs iHMM
Wednesday, 26 May 2010
HMM vs iHMM
Wednesday, 26 May 2010
Infinite Hidden Markov Models
• Basic idea of two-level urn scheme to share information between states in (Beal, Ghahramani, and Rasmussen, 2002)
• Derived from Hierarchical Dirichlet Processes (Teh, Jordan, Beal & Blei, 2006)
Wednesday, 26 May 2010
iHMMs and HDPs
Wednesday, 26 May 2010
Inference and Learning
Wednesday, 26 May 2010
Variational Bayesian Learning . . .
Maximization of this lower bound, Fm, can be done via EM-like iterative updates:
q(t+1)x (x) ∝ exp
��ln p(x,y|θ,m) q(t)
θ (θ) dθ
�E-like step
q(t+1)θ (θ) ∝ p(θ|m) exp
��ln p(x,y|θ,m) q(t+1)
x (x) dx�
M-like step
Maximizing Fm is equivalent to minimizing KL-divergence between the approximateposterior, qθ(θ) qx(x) and the true posterior, p(θ,x|y,m):
ln p(y|m)−Fm(qx(x), qθ(θ),y) =�
qx(x) qθ(θ) lnqx(x) qθ(θ)p(θ,x|y,m)
dx dθ = KL(q�p)
Wednesday, 26 May 2010
Hidden Markov Models
Wednesday, 26 May 2010
Experiment I - HMM data
Wednesday, 26 May 2010
Beam Sampling Properties
Wednesday, 26 May 2010