Bayesian Time Series Models - University of Oxford

Post on 27-May-2022

2 views 0 download

Transcript of Bayesian Time Series Models - University of Oxford

Bayesian Time Series Models

Zoubin GhahramaniDepartment of Engineering

University of Cambridge

joint work with Matt Beal, Jurgen van Gael,

Yee Whye Teh, Yunus Saatci

Wednesday, 26 May 2010

Bayesian Learning

Apply the basic rules of probability to learning from data.

Data set: D = {x1, . . . , xn} Models: m, m� etc. Model parameters: θ

Prior probability of models: P (m), P (m�) etc.Prior probabilities of model parameters: P (θ|m)Model of data given parameters (likelihood model): P (x|θ, m)

If the data are independently and identically distributed then:

P (D|θ, m) =n�

i=1

P (xi|θ, m)

Posterior probability of model parameters:

P (θ|D,m) =P (D|θ, m)P (θ|m)

P (D|m)

Posterior probability of models:

P (m|D) =P (m)P (D|m)

P (D)

Wednesday, 26 May 2010

3

Bayesian Occam’s Razor and Model Comparison

Compare model classes, e.g. m and m�, using posterior probabilities given D:

p(m|D) =p(D|m) p(m)

p(D), p(D|m) =

�p(D|θ,m) p(θ|m) dθ

Interpretations of the Marginal Likelihood (“model evidence”):

• The probability that randomly selected parameters from the prior would generate D.

• Probability of the data under the model, averaging over all possible parameter values.

• log2

�1

p(D|m)

�is the number of bits of surprise at observing data D under model m.

Model classes that are too simple are unlikelyto generate the data set.

Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.

too simple

too complex

"just right"

All possible data sets of size n

P(D

|m)

D

Wednesday, 26 May 2010

Bayesian Model Comparison: Occam’s Razor at Work

0 5 10!20

0

20

40

M = 0

0 5 10!20

0

20

40

M = 1

0 5 10!20

0

20

40

M = 2

0 5 10!20

0

20

40

M = 3

0 5 10!20

0

20

40

M = 4

0 5 10!20

0

20

40

M = 5

0 5 10!20

0

20

40

M = 6

0 5 10!20

0

20

40

M = 7

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

M

P(Y

|M)

Model Evidence

For example, for quadratic polynomials (m = 2): y = a0 + a1x + a2x2 + �, where� ∼ N (0,σ2) and parameters θ = (a0 a1 a2 σ)

demo: polybayes

Wednesday, 26 May 2010

Learning Model Structure

How many clusters in the data?

What is the intrinsic dimensionality of the data?

Is this input relevant to predicting that output?

What is the order of a dynamical system?

How many states in a hidden Markov model?SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY

How many auditory sources in the input?

Which graph structure best models the data?A

D

C

B

Edemo: run simple

Wednesday, 26 May 2010

Time Series Models• Broadly two classes of time series models:

– fully observed models (e.g. AR, n-gram)– hidden state models (e.g. HMM, state-space models (SSMs))

• How do we learn the dimensionality of a linear-Gaussian SSM? – Variational Bayesian learning of SSMs

• Hidden Markov models (HMMs) are widely used, but how do we choose the number of hidden states?– A non-parametric Bayesian approach: infinite HMMs.

• A single discrete state variable is a poor representation of the history. Can we do better?– Factorial HMMs

• Can we make Factorial HMMs non-parametric?– infinite factorial HMMs and the Markov Indian Buffet Process

Wednesday, 26 May 2010

Part I:

Variational Bayesian learning of

State-Space Models

7

Wednesday, 26 May 2010

Linear-Gaussian State-space models (SSMs)

x1

u1 u2 u3 uT

y1 y2 y3 yT

x2 xTx3 ...A

C

B

D

P (x1:T ,y1:T |u1:T ) = P (x1|u1)P (y1|x1,u1)T�

t=2

P (xt|xt−1,ut)P (yt|xt,ut)

Output equation: yt = Cxt + Dut + vt

State dynamics equation: xt = Axt−1 + But + wt

where xt, ut and yt are real-valued vectors and v and w are uncorrelated zero-meanGaussian noise vectors.

These models are a.k.a. stochastic linear dynamical systems, Kalman filter models.Continuous state version of HMMs.

Forward–backward algorithm ≡ Kalman smoothing

Wednesday, 26 May 2010

9

Lower Bounding the Marginal LikelihoodVariational Bayesian Learning

Let the latent variables be x, data y and the parameters θ.

We can lower bound the marginal likelihood (using Jensen’s inequality):

ln p(y|m) = ln�

p(y,x,θ|m) dx dθ

= ln�

q(x,θ)p(y,x,θ|m)

q(x,θ)dx dθ

≥�

q(x,θ) lnp(y,x,θ|m)

q(x,θ)dx dθ.

Use a simpler, factorised approximation to q(x,θ) ≈ qx(x)qθ(θ):

ln p(y|m) ≥�

qx(x)qθ(θ) lnp(y,x,θ|m)qx(x)qθ(θ)

dx dθ

= Fm(qx(x), qθ(θ),y).

(some significant contributors to this framework from 1993-1998: C. M. Bishop, G. E. Hinton,

T. S. Jaakkola, M. I. Jordan, D. J. C. MacKay, R. M. Neal, L. K. Saul)Wednesday, 26 May 2010

10

Inferring the Number of Hidden States

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 201000

1500

2000

2500

3000

3500

4000

dimensionality of state space, k

low

er

bound o

n m

arg

inal lik

elih

ood / n

ats

Variation of F , the lower bound on the marginal likelihood, with hidden statedimension k for 10 random initialisations of VBEM. We can do use this lower boundto infer/select the number of hidden states.

Wednesday, 26 May 2010

11

Inferring Regulatory Networks

CASP8

IRAK1BP1GATA3

API1

CASP4 HTF4 MAPK4

LAT

TP53I3

CX

3C

R1

CY

TP

450

CIRMAPK9

CSF2

ITGAM

JUNDSLA

ZNFN1A1

IL16

CTNNA1 CCNG1 CDC2

PDE4B ID3 TCF8

PCNA

RPS6KB1

SNW1

TRAF5 CASP7

MYD88 API2

IL3RA

CDK4

IL2RGCD69

RBL2

RB1

CCNA2

SIVAMCL1IFNAR1SMN1

EGR1

JUNB

AKT1

• Gene-gene interactions present in ≥80% of the VB state-space models outof 10 random seeds and k = 14 at aconfidence level of 99.8%.

• The number inside each node is thegene identity

• Numbers on the edges represent thenumber of models from 10 differentrandom seeds in which the interactionis supported at this confidence level.

• Dotted lined are negative interactions,continuous lines represent positiveinteractions.

Wednesday, 26 May 2010

Summary of Part I

• Bayesian machine learning• Marginal likelihoods and Occam’s

Razor• Variational Bayesian lower bounds• Application to learning the number of

hidden states in linear-Gaussian state-space models

12

Wednesday, 26 May 2010

Part II

The Infinite Hidden Markov Model

Wednesday, 26 May 2010

Hidden Markov Models

Wednesday, 26 May 2010

Choosing the number of hidden states

• How do we choose K, the number of hidden states, in an HMM?

• Can we define a model with an unbounded number of hidden states, and a suitable inference algorithm?

Wednesday, 26 May 2010

Infinite Hidden Markov models

Wednesday, 26 May 2010

Alice in Wonderland

Wednesday, 26 May 2010

Hierarchical Urn Scheme for generating transitions in the iHMM (2002)

• nij is the number of previous transitions from i to j• α, β, and γ are hyperparameters• prob. of transition from i to j proportional to nij • with prob. proportional to βγ jump to a new state

Wednesday, 26 May 2010

Relating iHMMs to DPMs

• The infinite Hidden Markov Model is closely related to Dirichlet Process Mixture (DPM) models

• This makes sense: – HMMs are time series generalisations of mixture models.– DPMs are a way of defining mixture models with countably

infinitely many components.– iHMMs are HMMs with countably infinitely many states.

Wednesday, 26 May 2010

HMMs as sequential mixtures

Wednesday, 26 May 2010

Infinite Hidden Markov Models

Wednesday, 26 May 2010

Infinite Hidden Markov Models

Wednesday, 26 May 2010

Infinite Hidden Markov Models

Teh, Jordan, Beal and Blei (2005) derived iHMMs in terms of Hierarchical Dirichlet Processes.

Wednesday, 26 May 2010

Efficient inference in iHMMs?

Wednesday, 26 May 2010

Inference and Learning in HMMs and iHMMs

• HMM inference of hidden states p(st|y1…yT,θ):– forward backward = dynamic programming = belief

propagation • HMM parameter learning:

– Baum Welch = expectation maximization (EM), or Gibbs sampling (Bayesian)

• iHMM inference and learning, p(st ,θ |y1…yT):– Gibbs Sampling

• This is unfortunate: Gibbs can be very slow for time series!

• Can we use dynamic programming?Wednesday, 26 May 2010

Dynamic Programming in HMMs Forward Backtrack Sampling

Wednesday, 26 May 2010

Beam Sampling

Wednesday, 26 May 2010

Beam Sampling

Wednesday, 26 May 2010

Auxiliary variables

Note: adding u variables, does not change distribution over other vars.

Wednesday, 26 May 2010

Beam Sampling

Wednesday, 26 May 2010

Experiment: Text Prediction

Wednesday, 26 May 2010

Experiment: Changepoint Detection

Wednesday, 26 May 2010

Experiment: Changepoint Detection

Wednesday, 26 May 2010

Parallel and Distributed Implementations of iHMMs

• Recent work on parallel (.NET) and distributed (Hadoop) implementations of beam-sampling for iHMMs (Bratieres, Van Gael, Vlachos and Ghahramani, 2010).

• Applied to unsupervised learning of part-of-speech tags from Newswire text (10 million word sequences).

• Promising results; code available.

34

Wednesday, 26 May 2010

Part III

• Hidden Markov models represent the entire history of a sequence using a single state variable st

• This seems restrictive...

• It seems more natural to allow many hidden state variables, a “distributed representation” of state.

• …the Factorial Hidden Markov ModelWednesday, 26 May 2010

Factorial HMMs

• Factorial HMMs (Ghahramani and Jordan, 1997)• A kind of dynamic Bayesian network.• Inference using variational methods or sampling.• Have been used in a variety of applications (e.g. condition monitoring, biological sequences, speech recognition).

Wednesday, 26 May 2010

From factorial HMMs to infinite factorial HMMs?

• A non-parametric version where the number of chains is unbounded? • In infinite factorial HMM (ifHMM) each chain is binary (van Gael, Teh, and Ghahramani, 2008).

• Based on the Markov extension of the Indian Buffet Process (IBP).

Wednesday, 26 May 2010

Bars-in-time data

Wednesday, 26 May 2010

Bars-in-time data

Wednesday, 26 May 2010

ifHMM Toy Experiment: Bars-in-time

Data

Ground truth

Inferred

Wednesday, 26 May 2010

ifHMM Experiment: Bars-in-time

Wednesday, 26 May 2010

separating speech audio of multiple speakers in time

Wednesday, 26 May 2010

Wednesday, 26 May 2010

The Big Picture

Wednesday, 26 May 2010

Summary• Bayesian methods provide a flexible framework for modelling.

• State Space Models can be learned using variational Bayesian methods

• iHMMs provide a non-parametric sequence model where the number of states is not bounded a priori.

• Beam sampling provides an efficient exact dynamic programming-based MCMC method for iHMMs.

• ifHMMs extend iHMMs to multiple state variables in parallel.

• Future directions: new models, fast algorithms, and compelling applications.

Wednesday, 26 May 2010

Appendix:

Wednesday, 26 May 2010

Binary Matrix Representation of Clustering

Wednesday, 26 May 2010

Binary Latent Feature Matrices

Wednesday, 26 May 2010

Indian Buffet Process

(Griffiths and Ghahramani, 2005)Wednesday, 26 May 2010

Indian Buffet Process

Wednesday, 26 May 2010

Indian Buffet Process

Wednesday, 26 May 2010

HMM vs iHMM

Wednesday, 26 May 2010

HMM vs iHMM

Wednesday, 26 May 2010

Infinite Hidden Markov Models

• Basic idea of two-level urn scheme to share information between states in (Beal, Ghahramani, and Rasmussen, 2002)

• Derived from Hierarchical Dirichlet Processes (Teh, Jordan, Beal & Blei, 2006)

Wednesday, 26 May 2010

iHMMs and HDPs

Wednesday, 26 May 2010

Inference and Learning

Wednesday, 26 May 2010

Variational Bayesian Learning . . .

Maximization of this lower bound, Fm, can be done via EM-like iterative updates:

q(t+1)x (x) ∝ exp

��ln p(x,y|θ,m) q(t)

θ (θ) dθ

�E-like step

q(t+1)θ (θ) ∝ p(θ|m) exp

��ln p(x,y|θ,m) q(t+1)

x (x) dx�

M-like step

Maximizing Fm is equivalent to minimizing KL-divergence between the approximateposterior, qθ(θ) qx(x) and the true posterior, p(θ,x|y,m):

ln p(y|m)−Fm(qx(x), qθ(θ),y) =�

qx(x) qθ(θ) lnqx(x) qθ(θ)p(θ,x|y,m)

dx dθ = KL(q�p)

Wednesday, 26 May 2010

Hidden Markov Models

Wednesday, 26 May 2010

Experiment I - HMM data

Wednesday, 26 May 2010

Beam Sampling Properties

Wednesday, 26 May 2010