Lecture 2: Stochastic Models and MCMC October 5, 2010 · Introduction Matrix in R Irreducibility...

Stochastic ModelsMarkov Chain Monte Carlo

IntroductionMatrix in R

Lecture 2: Stochastic Models and MCMC

October 5, 2010

Special Meeting 1/24



Dynamical System

Knowing the state xt at time t, a system allows us to determinext+1. This suggests the evolution of states in the system

xt+1 = φt(xt)

where φt maps from all the possible states to themselves.




Probabilistic Dynamical System

When the dynamical system is probabilistic it may be determinedby the transition probability

Pij = P(Xt+1 = j |Xt = i) = P(φt(i) = j)

for finitely many states i and j , with initial distributionp0(i) = P(X0 = i). This is a model for Markov chain.




Transition probability matrix.

When the system has a finite state S = {1, . . . ,M}, the Markovchain is represented by the transition probability matrix

P =

P1,1 P1,2 · · · P1,M

P2,1 P2,2 · · · P2,M

. . . . . . . . . . . . . . . . . . . . . . .PM,1 PM,2 · · · PM,M

Since P is a square matrix, we can also introduce the n-th powerPn whose (i , j)-entry is denoted by Pn

ij . It represents the n-steptransition probability

Pnij = P(Xn = j | X0 = i)




Transition Graph Representation

A Markov chain and its transition probabilities can be formulatedin terms of directed graph and function on its edges. Here Sbecomes the set of vertices of graph, and E = {(i , j) : Pij > 0}represents the collection of edges. The transition probability Pij ’scan be considered as a function on E .




Communication Classes

When we can find a series of edges

(i0, i1), (i1, i2), . . . , (in−1, in)

we call it a directed path from i0 to in. There is a directed pathfrom i and j if and only if Pn

i ,j > 0 for some n. We say that state jcan be reached from state i , and write “i → j” if Pn

ij > 0 for somen ≥ 0, that is, if there is a directed path from i and j . The twostates i and j are said to communicate, denoted by “i ↔ j ,” if wehave i → j and j → i . It is an equivalence relation, and therefore,the state space S can be partitioned into equivalent classes, whichwe call communication classes.




Irreducibility

The transition probability matrix P is said to be irreducible if itconsists of only one communication class. A communication classC is called recurrent if no state outside can be reached from anystate inside, that is,

Pni ,j = 0 for any i ∈ C and j 6∈ C , and for all n ≥ 0.

In particular, a state i is called absorbing if Pii = 1 since theMarkov chain has to stay at the state i forever (thus, “absorbed”)once it reaches there. Otherwise, that is, if a communication classis not recurrent, we call it transient.




An Example

The transition probability matrix

P =

0.3 0.1 0.5 0.10.2 0.4 0.2 0.20 0 1 00 0 0 1

corresponds to the previous example of graph representation. Herethe communication classes are {1, 2}, {3}, and {4}. Note that{3}, and {4} are absorbing.




Periodicity of Recurrent Classes

We say that a recurrent state i has the period d if Pnii > 0 only

when n = kd with some integer k. Furthermore, if state i has theperiod d and i ↔ j , then state j will have the same period d ; thus,a recurrent class must have a common period d . If a recurrentclass has the period d ≥ 2 then it is called periodic; otherwise,(that is, if d = 1) it is said to be aperiodic. If a recurrent class Cis aperiodic then

Pnij > 0 whenever i , j ∈ C

for sufficiently large n.




Eigenvalue and Eigenvector

A raw vector uT (which is a column vector u “transposed”) iscalled a left-eigenvector if it satisfies uT P = λuT . Then λ iscalled the corresponding eigenvalue. The eigenvalue λ = 1 isknown to be the largest in absolute value in any matrix P ofprobability transition matrix. Furthermore, Frobenius showed thatits corresponding left-eigenvector uT = [π1 · · ·πM ] is strictlypositive (i.e., πi > 0 for all i = 1, . . . ,M), and can be chosen asthe unique solution to the equations

πj =M∑i=1

πiPij for all j = 1, . . . ,M andM∑i=1

πi = 1.

Thus, π(j) is viewed as probability mass function, and called thestationary distribution.




Perron–Frobenius theorem

Historically this fundamental discovery was made by Perron in1907. Suppose that P is aperiodic and irreducible. Then P has aunique left-eigenvector uT = [π1 · · ·πM ] representing thestationary distribution and

limn→∞

Pnij = πj for all i = 1, . . . ,M.

Moreover, if λ2 is the second largest eigenvalue of P in absolutevalue, for any |λ2| < ρ < 1 we can find C > 0 such that

max1≤i ,j≤M

|Pnij − πj | < Cρn for all n = 1, 2, . . ..




Matrix Manipulation

A matrix and its multiplication with a row vector are introduced inR. By default data are filled into the matrix, starting from the firstcolumn to the last column with the specified number ncol ofcolumns.

> P = matrix(c(0, 0, 0.6, 1, 0.1, 0.4, 0, 0.9, 0), ncol=3)> P> P %*% P> c(0.2, 0.8, 0) %*% P %*% P

The result should define a transition probability matrix and givethe probability distribution after the second step given the initialdistribution p0(1) = 0.2, p0(2) = 0.8, and p0(3) = 0.




Stationary Distribution and Eigenvector

Compare the probability at the Nth step with the normalizedeigenvetor corresponding to λ = 1. Change the initial distributionand the size N, and compare the result. Note that eigen()calculates the left-eigenvectors when the transposed matrix t(P) isgiven.

> N = 20> A = P> for(i in 1:N) A = A %*% P> A> c(0.2, 0.8, 0) %*% A> uu = as.real(eigen(t(P))$vectors[,1])> uu / sum(uu)



Markov ChainExploration with RMetropolis Algorithm

Markov chain

Let S be a discrete state space and let (Xt , t = 0, 1, . . .) be adiscrete time stochastic process, taking its value on S . Then Xt iscalled a Markov chain if

P(Xt = xt |Xs = xs , s ≤ t ′) = P(Xt = xt |Xt′ = xt′)

for t ′ < t. Let P(x , y) be a transition probability, that is, aprobability distribution P(x , ·) on S for each x ∈ S . Then one canconstruct a Markov chain Xt so that

P(Xt = y |Xt−1 = x) = P(x , y).

Then we can easily see that P(Xt+n = y |Xt = x) = Pn(x , y).




Ergodicity and Limit Theorem

We call (Xt) (and its transition probability P) an ergodic Markovchain if P is irreducible and aperiodic. Now suppose that we havedevised an ergodic Markov chain whose stationary distribution is π.Then we have the following convergence theorem: If P isirreducible and aperiodic, then there is a unique stationarydistribution π such that

Pn(x , y) = P(Xn = y |X0 = x) → π(y) as n →∞,

regardless of the choice for x . In terms of sample path it impliesthat

XtL−→ π as t −→∞.




Random beta-Walk

Here we will introduce a random walk on (0, 1) in which the nextstep Xt+1 is determined by the beta distribution with parameterα = max(δ + θXt , 1) and β = max(δ + θ(1− Xt), 1). Theseparameters can control the behavior of walk.

I A smaller δ keeps a sample path closer to either of theboundary.

I The larger the value θ is the smaller the move of each stepbecomes.

Thus, 0 < δ < 1 will change the shape of stationary distribution ofthe random walk, and θ > 0 will influence the speed ofconvergence of random walk. Later we use this random walk as aproposal Markov chain on (0, 1).




Simulation in R

Download the R code “bwalk.r” and observe a sample path ofthe beta random walk.

I Choose a different choice of δ and θ.

I Change the running time and see how long it takes to displaya stationary behavior.

> source("bwalk.r")> sample.path = rwalk(move=bmove, trajectory=T, run.time=500,delta=0.8, theta=20)> par(mfrow=c(2,1))> plot(sample.path, type="l", xlab="time", ylab="state")




A Long Run Behavior

A long run behavior can be observed from the histogram of Xt

from the end of runs repeatedly.

I Change the running time, and see if the distribution of Xt isdifferent.

I Obtain the histogram of Xt for a different choice of δ and θ.

I Change the initial point by adding “init.state=1”, and seewhether it affects the long run behavior.

> sample.data = rwalk(move=bmove, run.time=100,delta=0.8, theta=20, sample.size=500)> hist(sample.data, freq=F, breaks=seq(0,1,by=0.05), col="red")




Emergence of Markov Chain Monte Carlo

In reality the “state” space for π(x) is not R. Either it is a subsetof Rn (in Bayesian applications), or it has a complex discretestructure (e.g., Ising model). For such models both algorithms viainverse probability transform and resampling method are notapplicable in general. By way of Markov chain convergencetheorem one can construct a Markov chain X whose stationarydistribution is π.

XtL−→ π as t −→∞.

This methodology gives a way to break the limitation of MonteCarlo simulation. But how can we make it happen?




Metropolis Algorithm

Let π be a pdf of interest on a state space S , and let Q be atransition probability on S . Define the acceptance probability by

A(x , y) := min

{π(y)Q(y , x)

π(x)Q(x , y), 1

}for a move from x to y satisfying Q(x , y) 6= 0 and Q(y , x) 6= 0.

Metropolis-Hastings Algorithm (MHA)

1. Choose an initial state X0 at time t = 0.

2. Suppose that Xt = x at time t. Then pick y according toQ(x , ·).

3. Move to Xt+1 = y with the probability A(x , y), or stay Xt+1 =x with the probability (1− A(x , y)).




Transition Probability for MHA

One should choose Q to be irreducible on S , and call it a proposalchain. In terms of transition probability, the above algorithm isgiven by

P(x , y) =

{Q(x , y)A(x , y) if x 6= y ;

Q(x , x) +∑

z∈S Q(x , z)(1− A(x , z)) if x = y .

Then P is irreducible, aperiodic and reversible transition probabilitywith the stationary distribution π.




Property of Time-Reversibility

Let π be a probability distribution on S . We call π stationary if∑x∈S

π(x)P(x , y) = π(y)

for all y ∈ S . A Markov chain P̃ is said to be time-reversed if thedetailed balance

π(x)P(x , y) = π(y)P̃(y , x)

holds between P and P̃. Moreover, P is called a reversible Markovchain when P = P̃. In particular if the probability distribution πsatisfies the detailed balance between the two Markov chains Pand P̃, then π becomes stationary for both P and P̃.




Reversibility for MHA

The detailed balance clearly holds for either x = y orπ(x)Q(x , y) = π(y)Q(y , x). Suppose that x 6= y , and thatπ(x)Q(x , y) > π(y)Q(y , x). Then

A(x , y) = π(y)Q(y ,x)π(x)Q(x ,y) and A(y , x) = 1,

and therefore, we obtain

π(x)P(x , y) = π(x)Q(x , y)A(x , y)= π(y)Q(y , x) = π(y)P(y , x).

It also holds for the case that π(x)Q(x , y) < π(y)Q(y , x) bysymmetry.




Metropolis for Normal Mixture

Here we use a random beta walk as a proposal chain, and run theMetropolis-Hastings Algorithm to generate a normal mixturedensity on [0, 1].

> source("nm.r")> source("metro.r")> sample = bwalk.metro(ff=nm, run.time=100, delta=0.8,theta=20, sample.size=500)> hist(sample, freq=F, breaks=seq(0,1,by=0.05), col="red")> x = seq(0,1,by=0.01)> lines(x, nm(x))

Change the running time and the parameters for random betawalk, and see how the histogram differs from the targetdistribution of normal mixture.


Lecture 2: Stochastic Models and MCMC October 5, 2010 · Introduction Matrix in R Irreducibility...

Documents

Transcript of Lecture 2: Stochastic Models and MCMC October 5, 2010 · Introduction Matrix in R Irreducibility...