Bayesian Time Series Models - University of Oxford

60
Bayesian Time Series Models Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yee Whye Teh, Yunus Saatci Wednesday, 26 May 2010

Transcript of Bayesian Time Series Models - University of Oxford

Page 1: Bayesian Time Series Models - University of Oxford

Bayesian Time Series Models

Zoubin GhahramaniDepartment of Engineering

University of Cambridge

joint work with Matt Beal, Jurgen van Gael,

Yee Whye Teh, Yunus Saatci

Wednesday, 26 May 2010

Page 2: Bayesian Time Series Models - University of Oxford

Bayesian Learning

Apply the basic rules of probability to learning from data.

Data set: D = {x1, . . . , xn} Models: m, m� etc. Model parameters: θ

Prior probability of models: P (m), P (m�) etc.Prior probabilities of model parameters: P (θ|m)Model of data given parameters (likelihood model): P (x|θ, m)

If the data are independently and identically distributed then:

P (D|θ, m) =n�

i=1

P (xi|θ, m)

Posterior probability of model parameters:

P (θ|D,m) =P (D|θ, m)P (θ|m)

P (D|m)

Posterior probability of models:

P (m|D) =P (m)P (D|m)

P (D)

Wednesday, 26 May 2010

Page 3: Bayesian Time Series Models - University of Oxford

3

Bayesian Occam’s Razor and Model Comparison

Compare model classes, e.g. m and m�, using posterior probabilities given D:

p(m|D) =p(D|m) p(m)

p(D), p(D|m) =

�p(D|θ,m) p(θ|m) dθ

Interpretations of the Marginal Likelihood (“model evidence”):

• The probability that randomly selected parameters from the prior would generate D.

• Probability of the data under the model, averaging over all possible parameter values.

• log2

�1

p(D|m)

�is the number of bits of surprise at observing data D under model m.

Model classes that are too simple are unlikelyto generate the data set.

Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.

too simple

too complex

"just right"

All possible data sets of size n

P(D

|m)

D

Wednesday, 26 May 2010

Page 4: Bayesian Time Series Models - University of Oxford

Bayesian Model Comparison: Occam’s Razor at Work

0 5 10!20

0

20

40

M = 0

0 5 10!20

0

20

40

M = 1

0 5 10!20

0

20

40

M = 2

0 5 10!20

0

20

40

M = 3

0 5 10!20

0

20

40

M = 4

0 5 10!20

0

20

40

M = 5

0 5 10!20

0

20

40

M = 6

0 5 10!20

0

20

40

M = 7

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

M

P(Y

|M)

Model Evidence

For example, for quadratic polynomials (m = 2): y = a0 + a1x + a2x2 + �, where� ∼ N (0,σ2) and parameters θ = (a0 a1 a2 σ)

demo: polybayes

Wednesday, 26 May 2010

Page 5: Bayesian Time Series Models - University of Oxford

Learning Model Structure

How many clusters in the data?

What is the intrinsic dimensionality of the data?

Is this input relevant to predicting that output?

What is the order of a dynamical system?

How many states in a hidden Markov model?SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY

How many auditory sources in the input?

Which graph structure best models the data?A

D

C

B

Edemo: run simple

Wednesday, 26 May 2010

Page 6: Bayesian Time Series Models - University of Oxford

Time Series Models• Broadly two classes of time series models:

– fully observed models (e.g. AR, n-gram)– hidden state models (e.g. HMM, state-space models (SSMs))

• How do we learn the dimensionality of a linear-Gaussian SSM? – Variational Bayesian learning of SSMs

• Hidden Markov models (HMMs) are widely used, but how do we choose the number of hidden states?– A non-parametric Bayesian approach: infinite HMMs.

• A single discrete state variable is a poor representation of the history. Can we do better?– Factorial HMMs

• Can we make Factorial HMMs non-parametric?– infinite factorial HMMs and the Markov Indian Buffet Process

Wednesday, 26 May 2010

Page 7: Bayesian Time Series Models - University of Oxford

Part I:

Variational Bayesian learning of

State-Space Models

7

Wednesday, 26 May 2010

Page 8: Bayesian Time Series Models - University of Oxford

Linear-Gaussian State-space models (SSMs)

x1

u1 u2 u3 uT

y1 y2 y3 yT

x2 xTx3 ...A

C

B

D

P (x1:T ,y1:T |u1:T ) = P (x1|u1)P (y1|x1,u1)T�

t=2

P (xt|xt−1,ut)P (yt|xt,ut)

Output equation: yt = Cxt + Dut + vt

State dynamics equation: xt = Axt−1 + But + wt

where xt, ut and yt are real-valued vectors and v and w are uncorrelated zero-meanGaussian noise vectors.

These models are a.k.a. stochastic linear dynamical systems, Kalman filter models.Continuous state version of HMMs.

Forward–backward algorithm ≡ Kalman smoothing

Wednesday, 26 May 2010

Page 9: Bayesian Time Series Models - University of Oxford

9

Lower Bounding the Marginal LikelihoodVariational Bayesian Learning

Let the latent variables be x, data y and the parameters θ.

We can lower bound the marginal likelihood (using Jensen’s inequality):

ln p(y|m) = ln�

p(y,x,θ|m) dx dθ

= ln�

q(x,θ)p(y,x,θ|m)

q(x,θ)dx dθ

≥�

q(x,θ) lnp(y,x,θ|m)

q(x,θ)dx dθ.

Use a simpler, factorised approximation to q(x,θ) ≈ qx(x)qθ(θ):

ln p(y|m) ≥�

qx(x)qθ(θ) lnp(y,x,θ|m)qx(x)qθ(θ)

dx dθ

= Fm(qx(x), qθ(θ),y).

(some significant contributors to this framework from 1993-1998: C. M. Bishop, G. E. Hinton,

T. S. Jaakkola, M. I. Jordan, D. J. C. MacKay, R. M. Neal, L. K. Saul)Wednesday, 26 May 2010

Page 10: Bayesian Time Series Models - University of Oxford

10

Inferring the Number of Hidden States

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 201000

1500

2000

2500

3000

3500

4000

dimensionality of state space, k

low

er

bound o

n m

arg

inal lik

elih

ood / n

ats

Variation of F , the lower bound on the marginal likelihood, with hidden statedimension k for 10 random initialisations of VBEM. We can do use this lower boundto infer/select the number of hidden states.

Wednesday, 26 May 2010

Page 11: Bayesian Time Series Models - University of Oxford

11

Inferring Regulatory Networks

CASP8

IRAK1BP1GATA3

API1

CASP4 HTF4 MAPK4

LAT

TP53I3

CX

3C

R1

CY

TP

450

CIRMAPK9

CSF2

ITGAM

JUNDSLA

ZNFN1A1

IL16

CTNNA1 CCNG1 CDC2

PDE4B ID3 TCF8

PCNA

RPS6KB1

SNW1

TRAF5 CASP7

MYD88 API2

IL3RA

CDK4

IL2RGCD69

RBL2

RB1

CCNA2

SIVAMCL1IFNAR1SMN1

EGR1

JUNB

AKT1

• Gene-gene interactions present in ≥80% of the VB state-space models outof 10 random seeds and k = 14 at aconfidence level of 99.8%.

• The number inside each node is thegene identity

• Numbers on the edges represent thenumber of models from 10 differentrandom seeds in which the interactionis supported at this confidence level.

• Dotted lined are negative interactions,continuous lines represent positiveinteractions.

Wednesday, 26 May 2010

Page 12: Bayesian Time Series Models - University of Oxford

Summary of Part I

• Bayesian machine learning• Marginal likelihoods and Occam’s

Razor• Variational Bayesian lower bounds• Application to learning the number of

hidden states in linear-Gaussian state-space models

12

Wednesday, 26 May 2010

Page 13: Bayesian Time Series Models - University of Oxford

Part II

The Infinite Hidden Markov Model

Wednesday, 26 May 2010

Page 14: Bayesian Time Series Models - University of Oxford

Hidden Markov Models

Wednesday, 26 May 2010

Page 15: Bayesian Time Series Models - University of Oxford

Choosing the number of hidden states

• How do we choose K, the number of hidden states, in an HMM?

• Can we define a model with an unbounded number of hidden states, and a suitable inference algorithm?

Wednesday, 26 May 2010

Page 16: Bayesian Time Series Models - University of Oxford

Infinite Hidden Markov models

Wednesday, 26 May 2010

Page 17: Bayesian Time Series Models - University of Oxford

Alice in Wonderland

Wednesday, 26 May 2010

Page 18: Bayesian Time Series Models - University of Oxford

Hierarchical Urn Scheme for generating transitions in the iHMM (2002)

• nij is the number of previous transitions from i to j• α, β, and γ are hyperparameters• prob. of transition from i to j proportional to nij • with prob. proportional to βγ jump to a new state

Wednesday, 26 May 2010

Page 19: Bayesian Time Series Models - University of Oxford

Relating iHMMs to DPMs

• The infinite Hidden Markov Model is closely related to Dirichlet Process Mixture (DPM) models

• This makes sense: – HMMs are time series generalisations of mixture models.– DPMs are a way of defining mixture models with countably

infinitely many components.– iHMMs are HMMs with countably infinitely many states.

Wednesday, 26 May 2010

Page 20: Bayesian Time Series Models - University of Oxford

HMMs as sequential mixtures

Wednesday, 26 May 2010

Page 21: Bayesian Time Series Models - University of Oxford

Infinite Hidden Markov Models

Wednesday, 26 May 2010

Page 22: Bayesian Time Series Models - University of Oxford

Infinite Hidden Markov Models

Wednesday, 26 May 2010

Page 23: Bayesian Time Series Models - University of Oxford

Infinite Hidden Markov Models

Teh, Jordan, Beal and Blei (2005) derived iHMMs in terms of Hierarchical Dirichlet Processes.

Wednesday, 26 May 2010

Page 24: Bayesian Time Series Models - University of Oxford

Efficient inference in iHMMs?

Wednesday, 26 May 2010

Page 25: Bayesian Time Series Models - University of Oxford

Inference and Learning in HMMs and iHMMs

• HMM inference of hidden states p(st|y1…yT,θ):– forward backward = dynamic programming = belief

propagation • HMM parameter learning:

– Baum Welch = expectation maximization (EM), or Gibbs sampling (Bayesian)

• iHMM inference and learning, p(st ,θ |y1…yT):– Gibbs Sampling

• This is unfortunate: Gibbs can be very slow for time series!

• Can we use dynamic programming?Wednesday, 26 May 2010

Page 26: Bayesian Time Series Models - University of Oxford

Dynamic Programming in HMMs Forward Backtrack Sampling

Wednesday, 26 May 2010

Page 27: Bayesian Time Series Models - University of Oxford

Beam Sampling

Wednesday, 26 May 2010

Page 28: Bayesian Time Series Models - University of Oxford

Beam Sampling

Wednesday, 26 May 2010

Page 29: Bayesian Time Series Models - University of Oxford

Auxiliary variables

Note: adding u variables, does not change distribution over other vars.

Wednesday, 26 May 2010

Page 30: Bayesian Time Series Models - University of Oxford

Beam Sampling

Wednesday, 26 May 2010

Page 31: Bayesian Time Series Models - University of Oxford

Experiment: Text Prediction

Wednesday, 26 May 2010

Page 32: Bayesian Time Series Models - University of Oxford

Experiment: Changepoint Detection

Wednesday, 26 May 2010

Page 33: Bayesian Time Series Models - University of Oxford

Experiment: Changepoint Detection

Wednesday, 26 May 2010

Page 34: Bayesian Time Series Models - University of Oxford

Parallel and Distributed Implementations of iHMMs

• Recent work on parallel (.NET) and distributed (Hadoop) implementations of beam-sampling for iHMMs (Bratieres, Van Gael, Vlachos and Ghahramani, 2010).

• Applied to unsupervised learning of part-of-speech tags from Newswire text (10 million word sequences).

• Promising results; code available.

34

Wednesday, 26 May 2010

Page 35: Bayesian Time Series Models - University of Oxford

Part III

• Hidden Markov models represent the entire history of a sequence using a single state variable st

• This seems restrictive...

• It seems more natural to allow many hidden state variables, a “distributed representation” of state.

• …the Factorial Hidden Markov ModelWednesday, 26 May 2010

Page 36: Bayesian Time Series Models - University of Oxford

Factorial HMMs

• Factorial HMMs (Ghahramani and Jordan, 1997)• A kind of dynamic Bayesian network.• Inference using variational methods or sampling.• Have been used in a variety of applications (e.g. condition monitoring, biological sequences, speech recognition).

Wednesday, 26 May 2010

Page 37: Bayesian Time Series Models - University of Oxford

From factorial HMMs to infinite factorial HMMs?

• A non-parametric version where the number of chains is unbounded? • In infinite factorial HMM (ifHMM) each chain is binary (van Gael, Teh, and Ghahramani, 2008).

• Based on the Markov extension of the Indian Buffet Process (IBP).

Wednesday, 26 May 2010

Page 38: Bayesian Time Series Models - University of Oxford

Bars-in-time data

Wednesday, 26 May 2010

Page 39: Bayesian Time Series Models - University of Oxford

Bars-in-time data

Wednesday, 26 May 2010

Page 40: Bayesian Time Series Models - University of Oxford

ifHMM Toy Experiment: Bars-in-time

Data

Ground truth

Inferred

Wednesday, 26 May 2010

Page 41: Bayesian Time Series Models - University of Oxford

ifHMM Experiment: Bars-in-time

Wednesday, 26 May 2010

Page 42: Bayesian Time Series Models - University of Oxford

separating speech audio of multiple speakers in time

Wednesday, 26 May 2010

Page 43: Bayesian Time Series Models - University of Oxford

Wednesday, 26 May 2010

Page 44: Bayesian Time Series Models - University of Oxford

The Big Picture

Wednesday, 26 May 2010

Page 45: Bayesian Time Series Models - University of Oxford

Summary• Bayesian methods provide a flexible framework for modelling.

• State Space Models can be learned using variational Bayesian methods

• iHMMs provide a non-parametric sequence model where the number of states is not bounded a priori.

• Beam sampling provides an efficient exact dynamic programming-based MCMC method for iHMMs.

• ifHMMs extend iHMMs to multiple state variables in parallel.

• Future directions: new models, fast algorithms, and compelling applications.

Wednesday, 26 May 2010

Page 46: Bayesian Time Series Models - University of Oxford

Appendix:

Wednesday, 26 May 2010

Page 47: Bayesian Time Series Models - University of Oxford

Binary Matrix Representation of Clustering

Wednesday, 26 May 2010

Page 48: Bayesian Time Series Models - University of Oxford

Binary Latent Feature Matrices

Wednesday, 26 May 2010

Page 49: Bayesian Time Series Models - University of Oxford

Indian Buffet Process

(Griffiths and Ghahramani, 2005)Wednesday, 26 May 2010

Page 50: Bayesian Time Series Models - University of Oxford

Indian Buffet Process

Wednesday, 26 May 2010

Page 51: Bayesian Time Series Models - University of Oxford

Indian Buffet Process

Wednesday, 26 May 2010

Page 52: Bayesian Time Series Models - University of Oxford

HMM vs iHMM

Wednesday, 26 May 2010

Page 53: Bayesian Time Series Models - University of Oxford

HMM vs iHMM

Wednesday, 26 May 2010

Page 54: Bayesian Time Series Models - University of Oxford

Infinite Hidden Markov Models

• Basic idea of two-level urn scheme to share information between states in (Beal, Ghahramani, and Rasmussen, 2002)

• Derived from Hierarchical Dirichlet Processes (Teh, Jordan, Beal & Blei, 2006)

Wednesday, 26 May 2010

Page 55: Bayesian Time Series Models - University of Oxford

iHMMs and HDPs

Wednesday, 26 May 2010

Page 56: Bayesian Time Series Models - University of Oxford

Inference and Learning

Wednesday, 26 May 2010

Page 57: Bayesian Time Series Models - University of Oxford

Variational Bayesian Learning . . .

Maximization of this lower bound, Fm, can be done via EM-like iterative updates:

q(t+1)x (x) ∝ exp

��ln p(x,y|θ,m) q(t)

θ (θ) dθ

�E-like step

q(t+1)θ (θ) ∝ p(θ|m) exp

��ln p(x,y|θ,m) q(t+1)

x (x) dx�

M-like step

Maximizing Fm is equivalent to minimizing KL-divergence between the approximateposterior, qθ(θ) qx(x) and the true posterior, p(θ,x|y,m):

ln p(y|m)−Fm(qx(x), qθ(θ),y) =�

qx(x) qθ(θ) lnqx(x) qθ(θ)p(θ,x|y,m)

dx dθ = KL(q�p)

Wednesday, 26 May 2010

Page 58: Bayesian Time Series Models - University of Oxford

Hidden Markov Models

Wednesday, 26 May 2010

Page 59: Bayesian Time Series Models - University of Oxford

Experiment I - HMM data

Wednesday, 26 May 2010

Page 60: Bayesian Time Series Models - University of Oxford

Beam Sampling Properties

Wednesday, 26 May 2010