Bayesian Time Series Models - University of Oxford

Bayesian Time Series Models

Zoubin GhahramaniDepartment of Engineering

University of Cambridge

joint work with Matt Beal, Jurgen van Gael,

Yee Whye Teh, Yunus Saatci

Wednesday, 26 May 2010

Bayesian Learning

Apply the basic rules of probability to learning from data.

Data set: D = {x1, . . . , xn} Models: m, m� etc. Model parameters: θ

Prior probability of models: P (m), P (m�) etc.Prior probabilities of model parameters: P (θ|m)Model of data given parameters (likelihood model): P (x|θ, m)

If the data are independently and identically distributed then:

P (D|θ, m) =n�

i=1

P (xi|θ, m)

Posterior probability of model parameters:

P (θ|D,m) =P (D|θ, m)P (θ|m)

P (D|m)

Posterior probability of models:

P (m|D) =P (m)P (D|m)

P (D)


3

Bayesian Occam’s Razor and Model Comparison

Compare model classes, e.g. m and m�, using posterior probabilities given D:

p(m|D) =p(D|m) p(m)

p(D), p(D|m) =

�p(D|θ,m) p(θ|m) dθ

Interpretations of the Marginal Likelihood (“model evidence”):

• The probability that randomly selected parameters from the prior would generate D.

• Probability of the data under the model, averaging over all possible parameter values.

• log2

�1

p(D|m)

�is the number of bits of surprise at observing data D under model m.

Model classes that are too simple are unlikelyto generate the data set.

Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.

too simple

too complex

"just right"

All possible data sets of size n

P(D

|m)

D


Bayesian Model Comparison: Occam’s Razor at Work

0 5 10!20

0

20

40

M = 0

0 5 10!20

0

20

40

M = 1

0 5 10!20

0

20

40

M = 2

0 5 10!20

0

20

40

M = 3

0 5 10!20

0

20

40

M = 4

0 5 10!20

0

20

40

M = 5

0 5 10!20

0

20

40

M = 6

0 5 10!20

0

20

40

M = 7

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

M

P(Y

|M)

Model Evidence

For example, for quadratic polynomials (m = 2): y = a0 + a1x + a2x2 + �, where� ∼ N (0,σ2) and parameters θ = (a0 a1 a2 σ)

demo: polybayes


Learning Model Structure

How many clusters in the data?

What is the intrinsic dimensionality of the data?

Is this input relevant to predicting that output?

What is the order of a dynamical system?

How many states in a hidden Markov model?SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY

How many auditory sources in the input?

Which graph structure best models the data?A

D

C

B

Edemo: run simple


Time Series Models• Broadly two classes of time series models:

– fully observed models (e.g. AR, n-gram)– hidden state models (e.g. HMM, state-space models (SSMs))

• How do we learn the dimensionality of a linear-Gaussian SSM? – Variational Bayesian learning of SSMs

• Hidden Markov models (HMMs) are widely used, but how do we choose the number of hidden states?– A non-parametric Bayesian approach: infinite HMMs.

• A single discrete state variable is a poor representation of the history. Can we do better?– Factorial HMMs

• Can we make Factorial HMMs non-parametric?– infinite factorial HMMs and the Markov Indian Buffet Process


Part I:

Variational Bayesian learning of

State-Space Models

7


Linear-Gaussian State-space models (SSMs)

x1

u1 u2 u3 uT

y1 y2 y3 yT

x2 xTx3 ...A

C

B

D

P (x1:T ,y1:T |u1:T ) = P (x1|u1)P (y1|x1,u1)T�

t=2

P (xt|xt−1,ut)P (yt|xt,ut)

Output equation: yt = Cxt + Dut + vt

State dynamics equation: xt = Axt−1 + But + wt

where xt, ut and yt are real-valued vectors and v and w are uncorrelated zero-meanGaussian noise vectors.

These models are a.k.a. stochastic linear dynamical systems, Kalman filter models.Continuous state version of HMMs.

Forward–backward algorithm ≡ Kalman smoothing


9

Lower Bounding the Marginal LikelihoodVariational Bayesian Learning

Let the latent variables be x, data y and the parameters θ.

We can lower bound the marginal likelihood (using Jensen’s inequality):

ln p(y|m) = ln�

p(y,x,θ|m) dx dθ

= ln�

q(x,θ)p(y,x,θ|m)

q(x,θ)dx dθ

≥�

q(x,θ) lnp(y,x,θ|m)

q(x,θ)dx dθ.

Use a simpler, factorised approximation to q(x,θ) ≈ qx(x)qθ(θ):

ln p(y|m) ≥�

qx(x)qθ(θ) lnp(y,x,θ|m)qx(x)qθ(θ)

dx dθ

= Fm(qx(x), qθ(θ),y).

(some significant contributors to this framework from 1993-1998: C. M. Bishop, G. E. Hinton,

T. S. Jaakkola, M. I. Jordan, D. J. C. MacKay, R. M. Neal, L. K. Saul)Wednesday, 26 May 2010

10

Inferring the Number of Hidden States

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 201000

1500

2000

2500

3000

3500

4000

dimensionality of state space, k

low

er

bound o

n m

arg

inal lik

elih

ood / n

ats

Variation of F , the lower bound on the marginal likelihood, with hidden statedimension k for 10 random initialisations of VBEM. We can do use this lower boundto infer/select the number of hidden states.


11

Inferring Regulatory Networks

CASP8

IRAK1BP1GATA3

API1

CASP4 HTF4 MAPK4

LAT

TP53I3

CX

3C

R1

CY

TP

450

CIRMAPK9

CSF2

ITGAM

JUNDSLA

ZNFN1A1

IL16

CTNNA1 CCNG1 CDC2

PDE4B ID3 TCF8

PCNA

RPS6KB1

SNW1

TRAF5 CASP7

MYD88 API2

IL3RA

CDK4

IL2RGCD69

RBL2

RB1

CCNA2

SIVAMCL1IFNAR1SMN1

EGR1

JUNB

AKT1

• Gene-gene interactions present in ≥80% of the VB state-space models outof 10 random seeds and k = 14 at aconfidence level of 99.8%.

• The number inside each node is thegene identity

• Numbers on the edges represent thenumber of models from 10 differentrandom seeds in which the interactionis supported at this confidence level.

• Dotted lined are negative interactions,continuous lines represent positiveinteractions.


Summary of Part I

• Bayesian machine learning• Marginal likelihoods and Occam’s

Razor• Variational Bayesian lower bounds• Application to learning the number of

hidden states in linear-Gaussian state-space models

12


Part II

The Infinite Hidden Markov Model


Hidden Markov Models


Choosing the number of hidden states

• How do we choose K, the number of hidden states, in an HMM?

• Can we define a model with an unbounded number of hidden states, and a suitable inference algorithm?


Infinite Hidden Markov models


Alice in Wonderland


Hierarchical Urn Scheme for generating transitions in the iHMM (2002)

• nij is the number of previous transitions from i to j• α, β, and γ are hyperparameters• prob. of transition from i to j proportional to nij • with prob. proportional to βγ jump to a new state


Relating iHMMs to DPMs

• The infinite Hidden Markov Model is closely related to Dirichlet Process Mixture (DPM) models

• This makes sense: – HMMs are time series generalisations of mixture models.– DPMs are a way of defining mixture models with countably

infinitely many components.– iHMMs are HMMs with countably infinitely many states.


HMMs as sequential mixtures


Infinite Hidden Markov Models



Teh, Jordan, Beal and Blei (2005) derived iHMMs in terms of Hierarchical Dirichlet Processes.


Efficient inference in iHMMs?


Inference and Learning in HMMs and iHMMs

• HMM inference of hidden states p(st|y1…yT,θ):– forward backward = dynamic programming = belief

propagation • HMM parameter learning:

– Baum Welch = expectation maximization (EM), or Gibbs sampling (Bayesian)

• iHMM inference and learning, p(st ,θ |y1…yT):– Gibbs Sampling

• This is unfortunate: Gibbs can be very slow for time series!

• Can we use dynamic programming?Wednesday, 26 May 2010

Dynamic Programming in HMMs Forward Backtrack Sampling


Beam Sampling


Auxiliary variables

Note: adding u variables, does not change distribution over other vars.


Beam Sampling


Experiment: Text Prediction


Experiment: Changepoint Detection


Parallel and Distributed Implementations of iHMMs

• Recent work on parallel (.NET) and distributed (Hadoop) implementations of beam-sampling for iHMMs (Bratieres, Van Gael, Vlachos and Ghahramani, 2010).

• Applied to unsupervised learning of part-of-speech tags from Newswire text (10 million word sequences).

• Promising results; code available.

34


Part III

• Hidden Markov models represent the entire history of a sequence using a single state variable st

• This seems restrictive...

• It seems more natural to allow many hidden state variables, a “distributed representation” of state.

• …the Factorial Hidden Markov ModelWednesday, 26 May 2010

Factorial HMMs

• Factorial HMMs (Ghahramani and Jordan, 1997)• A kind of dynamic Bayesian network.• Inference using variational methods or sampling.• Have been used in a variety of applications (e.g. condition monitoring, biological sequences, speech recognition).


From factorial HMMs to infinite factorial HMMs?

• A non-parametric version where the number of chains is unbounded? • In infinite factorial HMM (ifHMM) each chain is binary (van Gael, Teh, and Ghahramani, 2008).

• Based on the Markov extension of the Indian Buffet Process (IBP).


Bars-in-time data


ifHMM Toy Experiment: Bars-in-time

Data

Ground truth

Inferred


ifHMM Experiment: Bars-in-time


separating speech audio of multiple speakers in time


The Big Picture


Summary• Bayesian methods provide a flexible framework for modelling.

• State Space Models can be learned using variational Bayesian methods

• iHMMs provide a non-parametric sequence model where the number of states is not bounded a priori.

• Beam sampling provides an efficient exact dynamic programming-based MCMC method for iHMMs.

• ifHMMs extend iHMMs to multiple state variables in parallel.

• Future directions: new models, fast algorithms, and compelling applications.


Appendix:


Binary Matrix Representation of Clustering


Binary Latent Feature Matrices


Indian Buffet Process

(Griffiths and Ghahramani, 2005)Wednesday, 26 May 2010

Indian Buffet Process


HMM vs iHMM



• Basic idea of two-level urn scheme to share information between states in (Beal, Ghahramani, and Rasmussen, 2002)

• Derived from Hierarchical Dirichlet Processes (Teh, Jordan, Beal & Blei, 2006)


iHMMs and HDPs


Inference and Learning


Variational Bayesian Learning . . .

Maximization of this lower bound, Fm, can be done via EM-like iterative updates:

q(t+1)x (x) ∝ exp

��ln p(x,y|θ,m) q(t)

θ (θ) dθ

�E-like step

q(t+1)θ (θ) ∝ p(θ|m) exp

��ln p(x,y|θ,m) q(t+1)

x (x) dx�

M-like step

Maximizing Fm is equivalent to minimizing KL-divergence between the approximateposterior, qθ(θ) qx(x) and the true posterior, p(θ,x|y,m):

ln p(y|m)−Fm(qx(x), qθ(θ),y) =�

qx(x) qθ(θ) lnqx(x) qθ(θ)p(θ,x|y,m)

dx dθ = KL(q�p)


Hidden Markov Models


Experiment I - HMM data


Beam Sampling Properties


Bayesian Time Series Models - University of Oxford

Documents

Transcript of Bayesian Time Series Models - University of Oxford