Bayesian Time Series Models

Zoubin GhahramaniDepartment of Engineering

University of Cambridge

joint work with Matt Beal, Jurgen van Gael,

Yee Whye Teh, Yunus Saatci

Wednesday, 26 May 2010

Bayesian Learning

Apply the basic rules of probability to learning from data.

Data set: D = {x1, . . . , xn} Models: m, m� etc. Model parameters: θ

Prior probability of models: P (m), P (m�) etc.Prior probabilities of model parameters: P (θ|m)Model of data given parameters (likelihood model): P (x|θ, m)

If the data are independently and identically distributed then:

P (D|θ, m) =n�

P (xi|θ, m)

Posterior probability of model parameters:

P (θ|D,m) =P (D|θ, m)P (θ|m)

P (D|m)

Posterior probability of models:

P (m|D) =P (m)P (D|m)

Bayesian Occam’s Razor and Model Comparison

Compare model classes, e.g. m and m�, using posterior probabilities given D:

p(m|D) =p(D|m) p(m)

p(D), p(D|m) =

�p(D|θ,m) p(θ|m) dθ

Interpretations of the Marginal Likelihood (“model evidence”):

• The probability that randomly selected parameters from the prior would generate D.

• Probability of the data under the model, averaging over all possible parameter values.

• log2

p(D|m)

�is the number of bits of surprise at observing data D under model m.

Model classes that are too simple are unlikelyto generate the data set.

Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.

too simple

too complex

"just right"

All possible data sets of size n

Bayesian Model Comparison: Occam’s Razor at Work

0 5 10!20

0 1 2 3 4 5 6 70

Model Evidence

For example, for quadratic polynomials (m = 2): y = a0 + a1x + a2x2 + �, where� ∼ N (0,σ2) and parameters θ = (a0 a1 a2 σ)

demo: polybayes

Learning Model Structure

How many clusters in the data?

What is the intrinsic dimensionality of the data?

Is this input relevant to predicting that output?

What is the order of a dynamical system?

How many states in a hidden Markov model?SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY

How many auditory sources in the input?

Which graph structure best models the data?A

Edemo: run simple

Time Series Models• Broadly two classes of time series models:

– fully observed models (e.g. AR, n-gram)– hidden state models (e.g. HMM, state-space models (SSMs))

• How do we learn the dimensionality of a linear-Gaussian SSM? – Variational Bayesian learning of SSMs

• Hidden Markov models (HMMs) are widely used, but how do we choose the number of hidden states?– A non-parametric Bayesian approach: infinite HMMs.

• A single discrete state variable is a poor representation of the history. Can we do better?– Factorial HMMs

• Can we make Factorial HMMs non-parametric?– infinite factorial HMMs and the Markov Indian Buffet Process

Part I:

Variational Bayesian learning of

State-Space Models

Linear-Gaussian State-space models (SSMs)

u1 u2 u3 uT

y1 y2 y3 yT

x2 xTx3 ...A

P (x1:T ,y1:T |u1:T ) = P (x1|u1)P (y1|x1,u1)T�

P (xt|xt−1,ut)P (yt|xt,ut)

Output equation: yt = Cxt + Dut + vt

State dynamics equation: xt = Axt−1 + But + wt

where xt, ut and yt are real-valued vectors and v and w are uncorrelated zero-meanGaussian noise vectors.

These models are a.k.a. stochastic linear dynamical systems, Kalman filter models.Continuous state version of HMMs.

Forward–backward algorithm ≡ Kalman smoothing

Lower Bounding the Marginal LikelihoodVariational Bayesian Learning

Let the latent variables be x, data y and the parameters θ.

We can lower bound the marginal likelihood (using Jensen’s inequality):

ln p(y|m) = ln�

p(y,x,θ|m) dx dθ

= ln�

q(x,θ)p(y,x,θ|m)

q(x,θ)dx dθ

≥�

q(x,θ) lnp(y,x,θ|m)

q(x,θ)dx dθ.

Use a simpler, factorised approximation to q(x,θ) ≈ qx(x)qθ(θ):

ln p(y|m) ≥�

qx(x)qθ(θ) lnp(y,x,θ|m)qx(x)qθ(θ)

dx dθ

= Fm(qx(x), qθ(θ),y).

(some significant contributors to this framework from 1993-1998: C. M. Bishop, G. E. Hinton,

T. S. Jaakkola, M. I. Jordan, D. J. C. MacKay, R. M. Neal, L. K. Saul)Wednesday, 26 May 2010

Inferring the Number of Hidden States

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 201000

dimensionality of state space, k

bound o

inal lik

ood / n

Variation of F , the lower bound on the marginal likelihood, with hidden statedimension k for 10 random initialisations of VBEM. We can do use this lower boundto infer/select the number of hidden states.

Inferring Regulatory Networks

IRAK1BP1GATA3

CASP4 HTF4 MAPK4

TP53I3

CIRMAPK9

JUNDSLA

ZNFN1A1

CTNNA1 CCNG1 CDC2

PDE4B ID3 TCF8

RPS6KB1

TRAF5 CASP7

MYD88 API2

IL2RGCD69

SIVAMCL1IFNAR1SMN1

• Gene-gene interactions present in ≥80% of the VB state-space models outof 10 random seeds and k = 14 at aconfidence level of 99.8%.

• The number inside each node is thegene identity

• Numbers on the edges represent thenumber of models from 10 differentrandom seeds in which the interactionis supported at this confidence level.

• Dotted lined are negative interactions,continuous lines represent positiveinteractions.

Summary of Part I

• Bayesian machine learning• Marginal likelihoods and Occam’s

Razor• Variational Bayesian lower bounds• Application to learning the number of

hidden states in linear-Gaussian state-space models

Part II

The Infinite Hidden Markov Model

Hidden Markov Models

Choosing the number of hidden states

• How do we choose K, the number of hidden states, in an HMM?

• Can we define a model with an unbounded number of hidden states, and a suitable inference algorithm?

Infinite Hidden Markov models

Alice in Wonderland

Hierarchical Urn Scheme for generating transitions in the iHMM (2002)

• nij is the number of previous transitions from i to j• α, β, and γ are hyperparameters• prob. of transition from i to j proportional to nij • with prob. proportional to βγ jump to a new state

Relating iHMMs to DPMs

• The infinite Hidden Markov Model is closely related to Dirichlet Process Mixture (DPM) models

• This makes sense: – HMMs are time series generalisations of mixture models.– DPMs are a way of defining mixture models with countably

infinitely many components.– iHMMs are HMMs with countably infinitely many states.

HMMs as sequential mixtures

Infinite Hidden Markov Models

Teh, Jordan, Beal and Blei (2005) derived iHMMs in terms of Hierarchical Dirichlet Processes.

Efficient inference in iHMMs?

Inference and Learning in HMMs and iHMMs

• HMM inference of hidden states p(st|y1…yT,θ):– forward backward = dynamic programming = belief

propagation • HMM parameter learning:

– Baum Welch = expectation maximization (EM), or Gibbs sampling (Bayesian)

• iHMM inference and learning, p(st ,θ |y1…yT):– Gibbs Sampling

• This is unfortunate: Gibbs can be very slow for time series!

• Can we use dynamic programming?Wednesday, 26 May 2010

Dynamic Programming in HMMs Forward Backtrack Sampling

Beam Sampling

Auxiliary variables

Note: adding u variables, does not change distribution over other vars.

Beam Sampling

Experiment: Text Prediction

Experiment: Changepoint Detection

Parallel and Distributed Implementations of iHMMs

• Recent work on parallel (.NET) and distributed (Hadoop) implementations of beam-sampling for iHMMs (Bratieres, Van Gael, Vlachos and Ghahramani, 2010).

• Applied to unsupervised learning of part-of-speech tags from Newswire text (10 million word sequences).

• Promising results; code available.

Part III

• Hidden Markov models represent the entire history of a sequence using a single state variable st

• This seems restrictive...

• It seems more natural to allow many hidden state variables, a “distributed representation” of state.

• …the Factorial Hidden Markov ModelWednesday, 26 May 2010

Factorial HMMs

• Factorial HMMs (Ghahramani and Jordan, 1997)• A kind of dynamic Bayesian network.• Inference using variational methods or sampling.• Have been used in a variety of applications (e.g. condition monitoring, biological sequences, speech recognition).

From factorial HMMs to infinite factorial HMMs?

• A non-parametric version where the number of chains is unbounded? • In infinite factorial HMM (ifHMM) each chain is binary (van Gael, Teh, and Ghahramani, 2008).

• Based on the Markov extension of the Indian Buffet Process (IBP).

Bars-in-time data

ifHMM Toy Experiment: Bars-in-time

Ground truth

Inferred

ifHMM Experiment: Bars-in-time

separating speech audio of multiple speakers in time

The Big Picture

Summary• Bayesian methods provide a flexible framework for modelling.

• State Space Models can be learned using variational Bayesian methods

• iHMMs provide a non-parametric sequence model where the number of states is not bounded a priori.

• Beam sampling provides an efficient exact dynamic programming-based MCMC method for iHMMs.

• ifHMMs extend iHMMs to multiple state variables in parallel.

• Future directions: new models, fast algorithms, and compelling applications.

Appendix:

Binary Matrix Representation of Clustering

Binary Latent Feature Matrices

Indian Buffet Process

(Griffiths and Ghahramani, 2005)Wednesday, 26 May 2010

HMM vs iHMM

• Basic idea of two-level urn scheme to share information between states in (Beal, Ghahramani, and Rasmussen, 2002)

• Derived from Hierarchical Dirichlet Processes (Teh, Jordan, Beal & Blei, 2006)

iHMMs and HDPs

Inference and Learning

Variational Bayesian Learning . . .

Maximization of this lower bound, Fm, can be done via EM-like iterative updates:

q(t+1)x (x) ∝ exp

��ln p(x,y|θ,m) q(t)

θ (θ) dθ

�E-like step

q(t+1)θ (θ) ∝ p(θ|m) exp

��ln p(x,y|θ,m) q(t+1)

x (x) dx�

M-like step

Maximizing Fm is equivalent to minimizing KL-divergence between the approximateposterior, qθ(θ) qx(x) and the true posterior, p(θ,x|y,m):

ln p(y|m)−Fm(qx(x), qθ(θ),y) =�

qx(x) qθ(θ) lnqx(x) qθ(θ)p(θ,x|y,m)

dx dθ = KL(q�p)

Hidden Markov Models

Experiment I - HMM data

Beam Sampling Properties

Bayesian Time Series Models - University of Oxford

Transcript of Bayesian Time Series Models - University of Oxford

Bayesian Time Series Models - University of Oxford

Documents

Transcript of Bayesian Time Series Models - University of Oxford

Bayesian Adaptive Trading with Daily Cycle

Non-parametric Bayesian Methods - Cambridge Machine Learning …mlg.eng.cam.ac.uk/tutorials/07/zg.pdf · 2007-07-02 · Non-parametric Bayesian Models •Bayesian methods are most

Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

LECTURE 05: BAYESIAN ESTIMATION

1 Motivation. - UW Faculty Web Serverfaculty.washington.edu/bajari/metricstheorysp09/bayesian.pdf1 Motivation. • Bayesian discrete choice models • Bayesian approach oﬀers extremely

Applied Bayesian Data Analysis - Statistical Horizons

Robust Bayesian clustering - UCL Computer Science - · PDF fileRobust Bayesian clustering ... Bayesian learning, graphical models, approximate inference, variational inference, ...

Doubly Robust Bayesian Inference for Non-Stationary ... · 2.1 General Bayesian Inference (GBI) with -Divergences ( -D) Standard Bayesian inference minimizes the KLD between the Data

Bayesian Inference, Basics - Stony Brookzhu/ams570/Bayesian_Basics.pdf · Bayesian Inference In Bayesian inference there is a fundamental distinction between • Observable quantities

Lecture 3 - University of Oxford

Bayesian Parameter Inference in State-Space Models using ...events.csml.ucl.ac.uk/userdata/masterclasses/doucet/...Bayesian Parameter Inference in State-Space Models using Particle

Γ The Oxford Democrat. - CORE

Nonparametric Bayesian Models for Sparse Matrices and ... › ~jwp2128 › Teaching › E6892 › papers › ghahram… · Non-parametric models are designed to be very exible; many

Bayesian Inference for Some Basic Models

Bayesian and frequentist inference

Superconductivity - NICHOLAS RESEARCH GROUP :: OXFORD UNIVERSITY

Bayesian Models for Astronomy

Nonparametric Bayesian Methods 1 What is …larry/=sml/nonparbayes.pdfNonparametric Bayesian Methods 1 What is Nonparametric Bayes? In parametric Bayesian inference we have a model

Chapter 5. Bayesian Statistics (II)

Faidon Panagiotopoulos - Bayesian Network collision