1 Analysis of Markov Chains - Stanford Universityweb.stanford.edu/class/stat217/martingale.pdf · 1...

26
1 Analysis of Markov Chains 1.1 Martingales Martingales are certain sequences of dependent random variables which have found many applications in probability theory. In order to introduce them it is useful to first re-examine the notion of conditional probability. Recall that we have a probability space Ω on which random variables are defined. For simplicity we may assume the random variables assume only a countable set of values, however, the results are valid for random variables taking values in R. To obtain this greater generality requires a refinement of the notion of conditional probability and expectation as discussed below. The proofs remain valid without any essential change after replacing the word partition (see below) by the appropriate σ-algebra. The assumtion that random vari- ables take only countably many values dispenses with the need to introduce the concept of σ-algebras. For random variables X and Y conditional prob- ability P [X = a|Y ] is itself a random variable. For every b Z such that P [Y = b] = 0 we have P [X = a | Y = b]= P [X = a and Y = b] P [Y = b] . Similarly conditional expectation E[X |Y ] is a random variable which takes different values for different b’s. We can put this in a slightly different lan- guage by saying that Y defines a partition of the probability space Ω; Ω= bZ A b , (disjoint union) (1.1.1) where A b = {ω Ω | Y (ω)= b}. Thus conditional expectation is a random variable which is constant on each piece A b of the partition defined by Y . With this picture in mind we redefine the notion of conditioning by making use of partitionings of the probability space. So assume we are given a partition of the probability space Ω as in (1.1.1), however we do not require that this partition be defined by any specifically given random variable. It is just a partition which somehow has been specified. Each subset A b is an event and if A b has non-zero probability, then P [X |A b ] and E[X |A b ] make sense. It is convenient to introduce a succint notation for a partition. Generally we use A or A n to denote a partition or sequence of partitions which we shall 1

Transcript of 1 Analysis of Markov Chains - Stanford Universityweb.stanford.edu/class/stat217/martingale.pdf · 1...

1 Analysis of Markov Chains

1.1 Martingales

Martingales are certain sequences of dependent random variables which havefound many applications in probability theory. In order to introduce them itis useful to first re-examine the notion of conditional probability. Recall thatwe have a probability space Ω on which random variables are defined. Forsimplicity we may assume the random variables assume only a countable setof values, however, the results are valid for random variables taking valuesin R. To obtain this greater generality requires a refinement of the notionof conditional probability and expectation as discussed below. The proofsremain valid without any essential change after replacing the word partition(see below) by the appropriate σ-algebra. The assumtion that random vari-ables take only countably many values dispenses with the need to introducethe concept of σ-algebras. For random variables X and Y conditional prob-ability P [X = a|Y ] is itself a random variable. For every b ∈ Z such thatP [Y = b] 6= 0 we have

P [X = a | Y = b] =P [X = a and Y = b]

P [Y = b].

Similarly conditional expectation E[X|Y ] is a random variable which takesdifferent values for different b’s. We can put this in a slightly different lan-guage by saying that Y defines a partition of the probability space Ω;

Ω =⋃b∈Z

Ab, (disjoint union) (1.1.1)

where Ab = ω ∈ Ω | Y (ω) = b. Thus conditional expectation is a randomvariable which is constant on each piece Ab of the partition defined by Y .With this picture in mind we redefine the notion of conditioning by makinguse of partitionings of the probability space. So assume we are given apartition of the probability space Ω as in (1.1.1), however we do not requirethat this partition be defined by any specifically given random variable. It isjust a partition which somehow has been specified. Each subset Ab is an eventand if Ab has non-zero probability, then P [X|Ab] and E[X|Ab] make sense.It is convenient to introduce a succint notation for a partition. Generally weuse A or An to denote a partition or sequence of partitions which we shall

1

encounter shortly. Notice that each An is a partition of the probability spaceand the subscript n does not refer to the subsets comprising the partitionA. By A ≺ A′ we mean every subset of Ω defined by the partition A is aunion of subsets of Ω defined by the partition A′. In such a case we say A′ isa refinement of A. For example if Y and Z are random variables, we definethe partition AY as

Ω =⋃AY

b , where AYb = ω|Y (ω) = b,

and similarly for AZ . Then the collection of intersections AYb ∩ AZ

c definesa partition A′ which is a refinement of both AY and AZ . The set AY

b ∩ AZc

consists of all ω ∈ Ω such that Y (ω) = b and Z(ω) = c. For a randomvariable X notion of P [X = a|A] simply means that for every subset ofpositive probability defined by the partition A we have a number which is theconditional probability of X = a given that subset. Thus P [X = a|A] itselfis a random variable which is constant on each subset Ab of the partitiondefined by A. Similarly, the conditional expectation E[X|A] is a randomvariable which is constant on each subset defined by the partition A.

Given a partition A we say a random variable X is A-admissible if X isconstant on every subset defined by A. For example, the random variablesP [X = a|A] and E[X|A] are A-admissible. For an A-admissible randomvariable X clearly we have

E[X | A] = X. (1.1.2)

If A′ is a refinement of the partition A, then

E[E[X | A′] | A] = E[X | A]. (1.1.3)

This is a generalization of the statment E[E[X|Y ]] = E[X] and its proof isagain just a re-arrangemet of a series.

With the preliminaries out of the way we can now define martingale. Asequence (Xl,Al) consisting of random variables Xl and and partitions Al

such that Al ≺ Al+1 is called a martingale if

1. E[|Xl|] <∞.

2. E[Xl+1 | Al] = Xl.

2

If condition (2) is replaced by E[Xl+1 | Al] ≥ Xl, (resp. E[Xl+1 | Al] ≤Xl) then we refer to (Xl,Al) as a submartingale (resp. supermartingale).The statement (Xl) is a martingale means (Xl,Al) is a martingale where Al

is the partition defined by the random variables X, X1, · · · , Xl. The firstcondition is simply a necessary technical requirement since manipulationsinvolving expectations are problematic unless we assume finiteness! Thesecond condition is the essential feature. To motivate this requirement andwhy such sequences are called martingales we look at the following basicexample:

Example 1.1.1 Suppose we are playing a fair game such as receiving (resp.paying) $1 for every appearance of an H (resp. a T ) in tosses of a fair coin.Let Xn denote the total winnings after n trials. Let An denote the partitionof the probability space according to the random variables X1, · · · , Xn. Thismeans for every set of integers (a1, · · · , an) we define the subset A(a1,···,an)

as the set of ω ∈ Ω (i.e., paths) such that Xj = aj. In this case of courseA(a1,···,an) is either empty or consists of a single path. Since the coin is assumedto be fair, the expected winnings after (n+1)st trial is the same as the actualwinnings after nth trial. This is just the second requirement. There was amisconception that in a fair game if one player doubles his/her last bet afterevery loss, then he/she will surely win. This strategy is called a martingalein gambling circles. One of the basic theorems about martingales is thatunder certain conditions no strategy will change the expected winnings ofa fair game which is zero. Intuitively, the flaw in the application of themartingale strategy to gambling is that while the player may win many times,the actual amount won every time is small. On the other hand, since noplayer has infinite amount of money, there is positive probability of losingeverything. On balance the expected winnings do not change. Many havelost considerable sums of money by relying on the appearance of winningstrategies which do not exist. ♠

The significance of martingales is based on two fundamental theoremswhich state and apply in this subsection and prove in the next. The ideabehind the first theorem is making mathematics out of the fact that changingstrategies does not affect the the fairness of a game. To do so we need somedefinitions. Assume we are given a sequence of partitions An with An ≺An+1. A random variable T : Ω → Z+ such that the set ω ∈ Ω | T (ω) = n is

3

a union of subsets ofAn is called a Markov time. If in addition P [T <∞] = 1,then T is called a stopping time. There is an important technical conditionwhich appears in connection with many theorems involving martingales. Itis called uniform integrability. Let Xn be a sequence of random variables.For a positive real number c let

Ω(n,c) = ω ∈ Ω | |Xn(ω)| > c,

and

I|Xn|>c(ω) =

1 if ω ∈ Ω(n,c);0 otherwise.

That is, I|Xn|>c is the indicator function of the set Ω(n,c). Uniform integrabilityof the sequence Xn means

supn

E[|Xn|I|Xn|>c] → 0 as c→∞.

We will discuss the meaning and significance of this condition later. For thetime being we will not dwell on uniform integrability and try to understandthe implications of martingale property assuming that this technical condi-tion is fulfilled. The first fundamental result about martingales is a pairof related propositions known as the Optional Stopping Time and OptionalSampling Time theorems.

Proposition 1.1.1 Let (Xn,An) be a martingale, T a stopping time, andassume

1. E[|XT |] <∞;

2. E[Xn | T > n]P [T > n] −→ 0 as n→∞.

Then

E[XT ] = E[X1].

If (Xn,An) is a submartingale and the same hypotheses hold then E[XT ] ≥E[X1].

4

Proposition 1.1.2 Let (Xn,An) be a uniformly integrable martingale, andX∞ be its limit, then for any stopping time T1 we have

E[X∞ | AT1 ] = XT1 .

If T2 is any stopping time and P [T2 ≥ T1] = 1, then

E[XT2 | AT1 ] = E[XT1 ].

If (Xn,An) is a uniformly integrable submartingale, and the same hypotheseshold, then the same assertions are valid after replacing = by ≥.

To understand the meaning of these results in the context of games, notethat T (the stopping time) is the mathematical expression of a strategy in agame. Proposition 1.1.1 asserts that the expected winnings E[XT ] is in factindependent of the strategy and equal to E[X]. To better appreciate thesignificance of these results and to maintain the continuity of the treatmentwe consider some examples.

Example 1.1.2 To appreciate the significance of the condition of proposi-tion 1.1.1 let us assume a person is playing a fair game and his/her ratherbizarre strategy is to stop playing once he/she loses $100. Obviously with thisstrategy the assertion of theorem E[XT ] = E[X] is no longer valid. Howeverthis does not violate proposition 1.1.1. Mathematically, the given strategyassigns to each path the time it reaches −$100. It is a simple exercise toshow that T is a stopping time, the second condition in proposition 1.1.1 isnot fulfilled. The purpose of this example was to show that some hypothesisis necessary to make the remarkable conclusion of propositions 1.1.1 or 1.1.2valid. This motivates the technical condition of uniform integrability whichwill be introduced later. ♠

Example 1.1.3 Consider the simple random walk S = k, Sl = Xl + Sl−1

on Z which moves one unit to the right with probability p and one unit tothe left with probability q = 1− p. Assume p 6= q and set

Yl = (q

p)Sl

Let Al be the partition defined by the random variables X1, · · · , Xl. Then(Yl,Al) is a martingale. In fact we have

5

E[Yl+1 | Al] = E[( qp)Xl+1+Sl | Al]

= ( qp)SlE[( q

p)Xl+1 | Al]

= ( qp)Sl [p q

p+ q p

q]

= Yl.Now assume 0 < k < N and T denote the time the random walk hits 0 orN . Then proposition 1.1.1 implies

E[(q

p)ST ] = E[Y] = (

q

p)k.

Let

pk = P [Absorption at 0|S = k], 1− pk = P [Absorption at N |S = k].

Then

E[YT ] = (q

p)0pk + (

q

p)N(1− pk) = (

q

p)k.

Solving for pk we obtain

pk =αk − αN

1− αN, where α =

q

p.

We could have derived this familiar result for gambler’s ruin problem by moreelementary means but the use of martingales simplified the treatment anddemonstrates how martingales are applied. ♠

Example 1.1.4 Let us consider a coin tossing experiment where H’s appearwith probability p and T ’s with probability q = 1− p. We want to calculatethe expected time of the appearance of a given pattern. Let us begin with avery simple pattern, say TT . First we have to construct a martingale whichallows us to make the calculation. Assume there is a casino and one playerenters the casino at the passage of each unit of time. Let us assume thatevery player bets $1 on T at first. If H appears the player loses $1. If Tappears the player collects $(1 + p

q) to make the game fair. Since we are

interested in th appearance of the pattern TT , we make the stipulation thatif T appears, then the player bets the entire $(1 + p

q) on the next bet. It

should be emphasized that a new player enters the casino after the passageof a unit of time regardless of whether H or T appears. Let Sl denote the

6

total winnings of the casino after passage of l units of time, and T be thetime of the appearance of TT . If T = N then

SN−1 = N − 2− p

q, SN = N − 2− p

q− p

q(1 +

p

q)− p

q.

On the other hand, proposition 1.1.1 tells us E[ST ] = E[S] = 0. ThereforeT = N yields, after a simple calculation,

E[T ] =1

q+

1

q2

as the expected time of the first appearance of TT . Of course we had calcu-lated this quantity by using generating functions (see example ??). It shouldbe pointed out that in this example and in the application of martingalesone should judiciously construct a sequence of random variables which forma martingale and that it should be possible to make the calculation of therelevant quantity (ST in this case) easily. This is often the tricky step in theapplication of martingales. The fact that we chose a very simple pattern TTis not essential, and the argument works for complex patterns as well. ♠

An important and intuitive corollary to proposition 1.1.1 is

Corollary 1.1.1 (Wald Identity) Let X1, X2, · · · be a sequence of iid randomvariables with E[|Xi|] <∞ Let T be a stopping time for the sequence Xi withE[T ] <∞. Then

E[X1 + · · ·+XT ] = E[X1]E[T ].

If furthermore E[X2i ] <∞, then

E[(X1 + · · ·+XT )− TE[X1])2] = E[T ]Var[X1].

Proof - Let An be the partition defined by the random variables X1, · · · , Xn.Set

Yl = X1 + · · ·+Xl − lE[X1].

It is clear that the sequence (Yl,Al) is a martingale with E[Yl] = 0. Therefore

0 = E[YT ] = E[X1 + · · ·+XT ]− E[T ]E[X1],

7

whence the first assertion. Set

Zl = (X1 + · · ·+Xl − lE[X1])2 − lVar[X1]

Then (Zl,Al) is a martingale, and applying theorem to proposition 1.1.1 weobtain the second assertion. ♣

Exercise ?? shows that Wald’s identity is not valid if T were not a stoppingtime.

The second basic theorem on martingales is about convergence. It isuseful to clarify the meaning of some of the notions of convergence whichare commonly used in probability and understand their relationship. LetX1, X2, · · · be a sequence of random variables (defined on a probability spaceΩ).

1. (Lp Convergence) - Xn converges to a random variable X (defined onΩ) in Lp if

limn→∞

E[|Xn −X|p] = 0.

We will be mainly interested in p = 1, 2. The Cauchy-Schwartz in-equality in this context can be stated as

|E[WZ]| ≤√

E[W 2]E[Z2],

where W and Z are real-valued random variables. Substituting W =|Xn − X| and Z = 1 we deduce that convergence in L2 implies con-vergence in L1. Note that if Xn → X in L1 then E[Xn] converges toE[X].

2. (Pointwise Convergenece) - Xn converges to X pointwise if for everyω ∈ Ω the sequence of numbers Xn(ω) converges to X(ω). Pointwiseconvergence does not imply convergence in L1 and E[Xn] may not con-verge to E[X] as shown in example ??. It is often more convenient torelax the notion of pointwise convergence to that of almost pointwiseconvergence. This means there is a subset Ω ⊂ Ω such that P [Ω] = 1and Xn → X pointwise on Ω. We have already seen how deletinga set of probability zero makes it possible to make precise statementsabout the behavior of a sequence of random variables. For example,

8

with probability 1 (i,e,, in the complement of a set of paths of proba-bility zero) a transient state is visited only finitely many times. Or the(strong) law of large numbers states that for an iid sequence of randomvariables X1, X2, · · · with mean µ = E[Xj], we have almost pointwiseconvergence of the sequence X1+···+Xn

nto µ. Related to almost pointwise

convergence is the weaker notion of convergence probability which, forexample, in the case of the Law of Large Numbers leads to the WeakLaw of Large Numebrs.

3. (Convergence in Distribution) - Let Fn be the distribution function ofXn and F that of X. Assume F is a continuous function. Xn convergesto X in distribution means for every x ∈ R we have

P [Xn ≤ x] = Fn(x) −→ F (x) = P [X ≤ x].

The standard statement of the Central Limit Theorem is about con-vergence in distribution. Convergence in distribution does not implyalmost pointwise convergence, but almost pointwise convergence im-plies convergence in distribution.

The essential point for our immediate application is that convergence inL1 implies convergence of expectations, however, weaker notions of conver-gence do not.

Recall the statement of the submartingale convergence theorem:

Theorem 1.1.1 Let (Xn,An) be a submartingale and assume

supn

E[Xn+] <∞, |E[X1]| <∞.

Then Xn converges almost everywhere to an integrable random variable X∞.If the submartingale (Xn,An) is uniformly integrable, then convergence isalso in L1.

The following example clarifies the need for the condition of uniformintegrability.

Example 1.1.5 Consider the sequence of functions fn defined on [0, 1]defined as (draw a picture to see a sequence of spike functions)

fn(x) =

22nx if 0 ≤ x ≤ 2−n

−22nx+ 2n+1 if 2−n < x ≤ 2−n+1

0 otherwise.

9

It is clear that ∫ 1

fn(x)dx = 1,

and the sequence fn converges to the zero function everywhere on [0, 1].Therefore

limn→∞

∫ 1

fn(x)dx 6=

∫ 1

lim

n→∞fn(x)dx.

This is obviously an undesirable situation which should be avoided sincewhen dealing with limiting values we want the integrals (e.g. expectations)to converge to the right values. The assumption of uniform integrabilityeliminates cases like this when we cannot interchange limit and integral. Itis easy to see that the uniform integrability condition is not satisfied for thesequence fn. ♠

Example 1.1.6 Let Ω be a probability space, A a σ-algebra. and X anA-admissible random variable with the property E[|X|] < ∞. Let An be afamily of σ-algebras with An ≺ An+1 ≺ A for all n. Let Yn = E[X | An]. Weshow that (Yl,Al) is a uniformly integrable martingale. We have

E[|Yl|] = E[|E[X|Al]|]= ≤ E[E[|X| | Al]]= E[|X|] <∞;

andE[Yl+1 | Al] = E[E[X|Al+1]|Al]

= E[X | Al]= Yl,

proving Martingale property. The martingale (Yl,Al) is uniformly inte-gerable. To prove this set Zn = E[|Yn| |An] and note that

E[(|X| − Zn)I[Zn≥a]] = 0,

where IA denotes the indicator function of the set A. ThereforeE[|X|I[Zn≥a]] = E[E[|X||An]I[Zn≥a]]

≥ E[|E[X|An]|I[Zn≥a]]= E[|Yn|I[Zn≥a]]

By Markov inequality

P [Zn ≥ a] ≤ E[Zn]

a=

E[|X|]a

,

10

and therefore P [Zn ≥ 0] → 0 as a → ∞ uniformly in n. Since |X| is inte-grable, E[|X|I[Zn≥a]] → 0 uniformly in n which implies uniform integrabilityof Yn. ♠

Example 1.1.7 A special case of Doob’s martingale is when Ω = [0, 1], X isan integrable function on [0, 1] andAn is the partition of [0, 1] into 2n intervalsof length 1

2n . Then Xn is the approximation by the step function to X whichon each interval of length 1

2n of the partition is integral of X on that interval.In this fashion Doob’s martingale gives a sequence of approximations to toX where by increasing n the level of detail increases. ♠

It is a remarkable fact about uniformly integrable martingales that theyare necessarily of the form of Doob’s martingale in the sense described below:

Corollary 1.1.2 Let (Xn,An) be a uniformly integrable martingale, thenXn → X∞ almost everywhere and in L1 and Xn = E[X∞|An] almost every-where.

Proof - In view of theorem 1.1.1 it only remains to show Xn = E[X∞|An].It suffices to show that for every A ∈ An of positive probability, we have

Xn = E[Xn|A] = E[X∞|A]

In view of martingale property for m ≥ n we have E[Xn|A] = E[Xm|A]. Now

|E[Xm|A]− E[X∞|A]| ≤ E[|Xm −X∞||A] ≤ E[|Xm −X∞|]P [A]

which tends to 0 as m→∞. The required result follows. ♣At this point it is instructive to prove Kolmogorov’s zero-one law. The

proof elegantly uses of some of the concepts we have introduced.

Corollary 1.1.3 Let X1, X2, · · · be independent random variables. Let Bn

be the partition defined by the random variables Xn, Xn+1, · · · and T = ∩Bn.Let A ∈ T , then P [A] = 0 or 1.

Proof - Let An be the partition defined by the random variables X1, · · ·Xn,and A∞ = limn→∞An. It is clear from the hypotheses that A ∈ A∞. Let IA

11

denote the indicator function of the set A, and set Yn = E[IA | An]. FromA ∈ A∞ it follows that

Yn −→ E[IA | A∞] = IA (1.1.4)

almost surely and in L1. Since the random variables are independent andA ∈ T = ∩Bn, A is independent of An for every n. Therefore

Yn = E[IA | An] = P [A]. (1.1.5)

From (1.1.4) and (1.1.5) it follows that P [A] = 0 or 1. ♣It is customary to call a event A ∈ T a tail event and a and a T -admissible

random variable a tail function. An equivalent way of stating Kolmorove’s0-1 law is

Corollary 1.1.4 A tail function is constant almost everywhere.

Example 1.1.8 Let us give a simple example of a tail event. Let a1, a2, · · ·be random numbers chosen from [0, 1] independently and according to theuniform distribution. Whether or not the series

∑ane

i2πnθ converges is a tailevent since it does not depend on any finite number of an’s. Therefore withprobability one either all such series converge or diverge. It is not difficultto convince oneself that the latter alternative is the case. ♠

As another applications of martigales we derive the Strong Law of LargeNumbers (SLLN). This requires introducing the notion of reverse martingalewhich is a sequence X1, X2, · · · of random variables. together with partitionsAn An+1 such that

E[Xn | An+1] = Xn+1.

Note that for reverse martingales the partition An is finer than An+1 asopposed to the case of martingales where the opposite is true.

Example 1.1.9 For our applications the reverse martingale which is of in-terest is the analogue of Doob’s martingale. In fact assume that X is anintegrable random variable and An An+1 are partitions where n = 1, 2, · · ·Then Yn = E[X|An] together with the partitions An define a reverse martin-gale. Let A∞ = ∩An. Then

Yn −→ E[X | A∞]

12

almost everywhere and in L1. The proof of this fact is similar to that in caseof Doob’s martingale amd will not be repeated. ♠

Corollary 1.1.5 Let X1, X2, · · · be iid random variables with E[Xk] < ∞,and Sn = X1 + · · ·+Xn. Then Sn

nconverges to µ almost everywhere and in

L1.

Proof - It is clear that for k ≤ n

E[X1] =Sn

nalmost everywhere. (1.1.6)

Independence of random variables X1 and Xn, Xn+1, Xn+2, · · · and (1.1.6)imply

Sn

n= E[X1 | Sn, Xn+1, Xn+2, · · ·]

= E[X1 | Sn, Sn+1, Sn+2, · · ·].

Let Bn denote the partition generated by the Sn, Sn+1, Sn+2, · · · and T = ∩Bn.Since limn→∞ E[X1|Bn] is a tail function, it is a constant c by corollary 1.1.4.From example 1.1.9 convergence of E[X1|Bn] to c is almost everywhere andin L1. The constant c is necessarily µ since E[Sn

n] = µ. ♣

We now give some examples of martingales which directly involve Markovchains. Let X, X1, · · · be a Markov chain with transition matrix P . Linearcombinations of the states of the Markov chain are given as row vectors andfunctions on state space are described as column vectors. Thus the action ofthe Markov chain on functions on the state space is by f → Pf . A functionf on the state space is called harmonic (resp. subharmonic) if Pf = f (resp.Pf ≥ f).

Example 1.1.10 In this example we show how a Markov chain and a boundedfunction f on the state space give rise to a martingale. Let P denote thetransition matrix of the Markov chain X, X1, · · ·. Define

Y fn = f(Xn)− f(X)−

n−1∑k=0

(P − I)f(Xk).

Here (P − I)f(Xk) is the obvious matrix/vector multiplication. The bound-edness condition on f is necessary to ensure existence of infinte sums

∑i Pijfj

13

and integrability of the random variables Y fn when the state space is infinite.

Let An be the partition defined by X, · · · , Xn. It is easy to see that (Y fn ,An)

is a martingale. In fact, we have

Y fn+1 − Y f

n = f(Xn+1)− Pf(Xn).

Since E[f(Xn+1)|An] = Pf(Xn) we have

E[Y fn+1|An]− Y f

n = E[Y fn+1 − Y f

n |An] = 0

proving the martingale property. This martingale is often called Levy’s mar-tingale. ♠

To a Markov chain X, X1, · · · with transition matrix P we associated theoperator P − I defined on functions on the state space. This operator insome ways resembles the Laplace operator on a domain in Rn or a manifold,and the analogy has led to a number of interesting results on Markov chains.Assume the transition matrix has the property that Pij 6= 0 if and only ifPji 6= 0. Then to the Markov chain we assign a graph whose vertices are theset of states and two vertices i and j are connected by an edge if Pij 6= 0. Forour purposes here the graph and the assumption on the transition matrixare not essential and are noted only for better pictorial representation of theMarkov chain. A subset U ⊂ S is called a domain and and its complement isdenoted by U ′. The boundary ∂U of U consists of the set of vertices (states)k ∈ U ′ such that there is j ∈ U with Pjk 6= 0. A problem which is analogousto a classical boundary value problem and is of intrinsic interest for examplein Markov decision problems is to solve the equation

(P − I)u = φ on U, subject to u = ψ on ∂U. (1.1.7)

The solution u can in fact be defined on the entire state space S. The domainU can have empty boundary for example if U = S. As before we let Ej[Y ],where Y is a random variable depending on the Markov chain, denote theexpectation of Y conditioned on X = j.

Proposition 1.1.3 Let T be the first hitting time of the boundary ∂U andassume for every j ∈ U , P [T <∞] = 1. Then the function

u(j) = Ej[ψ(XT )−T−1∑k=0

φ(Xk)],

14

defined for j ∈ U ∪ ∂U , is a solution to (1.1.7). If φ and ψ are non-negativethen u is the unique bounded non-negative solution.

Proof - The fact that U solves the problem (1.1.7) follows easily by con-ditioning to X1 and is left to the reader. To prove uniqueness let u be asolution to (1.1.7) and Y u

n = Yn be the Levy martingale associated to u:

Yn = u(Xn)− u(X)−n−1∑k=0

(P − I)u(Xk).

For every integer m, T ∧m is also a stopping time and it follows that

E[YT∧m] = Ej[Y] = 0.

Therefore the fact that (P − I)u = φ on U implies

u(j) = Ej[u(XT∧m)−(T∧m)−1∑

k=0

φ(Xk)]

Now taking limm→∞ we see that

u(j) = Ej[ψ(XT )−T−1∑k=0

φ(Xk)]

which is the desired uniqueness. Notice that in the application of limm→∞ wemade use of the assumption of non-negativity since otherwise the monotoneconvergence theorem may not have been applicable. ♣

1.2 Some Proofs

There are results similar in spirit to the Optional Stopping Time theoremwhich are useful in martingale theory and in the proof of theorem 1.1.1. Onesuch result is lemma 1.2.1 below which we first describe as a strategy in agame of chance. Consider an individual playing a game of chance against acasino. The game is played at regular intervals and every time the player hasthe option of playing or not playing. The player decides to bet or not betat (k + 1)st trial (codified by εk+1 = 1 or 0 in lemma 1.2.1) according to theinformation he/she has from the first k trials. The point of lemma 1.2.1 isthat no matter what strategy is used (i.e., choice of εk’s) the expected valuewill not improve. More precisely, we have

15

Lemma 1.2.1 Let (Xn,An) be a submartingale, εk, k = 1, 2, · · ·, Ak-admissiblerandom variables taking values 1 and 0 only. Set Y1 = X1 and

Yk+1 = Yk + εk(Xk+1 −Xk).

Then E[Yn] ≤ E[Xn]. If (Xn,An) is a martingale, then E[Yn] = E[Xn].

Proof - We haveE[Yn+1 | An] = E[Yn + εn(Xn+1 −Xn) | An]

= Yn + εnE[Xn+1 −Xn | An]≥ Yn + εn(Xn −Xn)= Yn.

Therefore (Yn,An) is a submartingale (or martingale if (Xn,An) is so). Weshow by induction on n that E[Yn] ≤ E[Xn]. For n = 1 the result is obvious.Since

Xn+1 − Yn+1 = (1− εn)(Xn+1 −Xn) +Xn − Yn,

the submartingale property and the induction hypothesis imply

E[Xn+1 − Yn+1 | An] ≥ E[Xn − Yn | An] ≥ 0. (1.2.1)

Notice that if (Xn,An) were a martingale, then the inequalities in (1.2.1) canbe replaced by =. The required result follows by taking another E and usingE[E[Z | An]] = E[Z]. ♣

For a random varaible X we let

X+ = max(X, 0), X− = min(X, 0).

It is immediate that if (Xn,An) is a submartingale, then so is the sequence(Xn+,An).

The key point in the proof of (sub)martingale convergence theroem is theupcrossing lemma 1.2.2 below. We need some notation. Let X1, · · · , XN bereal-valued random variables (defined as usual on a parobability space Ω),a < b be real numbers. For ω ∈ Ω let T1(ω) be the smallest integer i1 ≤ Nsuch that XT1(ω) ≤ a. If no such integer exists, we set T1(ω) = ∞. Let T2(ω)be the smallest integer i2, i1 < i2 ≤ N , such that XT2(ω) ≥ b with the sameproviso XT2(ω) = ∞ if no such i2 exists. Similarly, let T3(ω) be the smallest

16

integer i3, i2 < i3 ≤ N , such that XT3(ω) ≤ a etc.etc. For each ω let l be thenumber of finite Ti’s and define the upcrossing random varaible as

Uab(ω) =

l2

if l is even;l−12

if l is odd.

Now we can prove

Lemma 1.2.2 With the above notation if (Xk,Ak), k = 1, 2, · · · , N , is asubmartingale, then

E[Uab] ≤1

b− aE[(XN − a)+].

Proof - It suffices to prove the lemma for the case a = 0 and Xj take onlynon-negative values. In fact, we consider ((Xk − a)+,Ak) which is also asubmartingale and whose validity implies that of the general case. For theproof of the assertion for a = 0 and Xj ≥ 0, define

εj(ω) =

1 if T1 ≤ j < T2, or T3 ≤ j < T4, or · · ·0 if T2 ≤ j < T3, or T4 ≤ j < T5, or · · ·

Define random variables Yk according to lemma 1.2.1. It is clear that Yn(ω) ≥bU0b(ω) and therefore lemma 1.2.1 implies

E[U0b] ≤1

bE[YN ] ≤ 1

bE[XN ]

which is the required result. ♣Proof of Submartingale Convergence Theorem - The set

ω | Xk(ω) does not converge to a finite or infinite limit

coincides with the set

∪a,b∈Qω | lim infk→∞

Xk(ω) ≤ a < b ≤ lim supk→∞

Xk(ω).

(Q is the set of rational numbers.) Therefore if Xk does not converge on asubset of Ω of positive probability, then

P [ω | lim infk→∞

Xk(ω) ≤ a < b ≤ lim supk→∞

Xk(ω)] > 0

17

for some rational numbers a < b. It follows that Xk has an infinite numberof upcrosssings on a set of positive probability and consequently E[Uab] = ∞.Clearly Uab is the monotone limit of Uab,N as N →∞ where Uab,N means weare computing upcrossing for random variables X1, · · · , XN only. Therefore

E[Uab,N ] −→ E[Uab].

Now

E[Uab,N ] ≤ 1

b− aE[(XN − a)+] ≤ 1

b− a

(supN

E[XN+] + a−

)<∞

contradicting E[Uab] = ∞. Therefore E[Uab] < ∞ and XN → X∞ almosteverywhere. It remains to show X∞ is integrable. We have |XN | = 2XN+ −XN and submartingale property implies E[XN ] ≥ E[X1]. Therefore

E[|XN |] ≤ 2 supN

E[XN+]− E[X1] <∞.

By Fatou’s lemma E[X∞] ≤ lim inf E[|XN |] < ∞ and therefore X∞ is in-tegrable. Convergence in L1 of uniformly integrable submartingales followsfrom general measure theory. ♣

Propositions 1.1.1 and 1.1.2 are all in the same spirit, and their proofshave much in common as demonstrated below.

Lemma 1.2.3 With the above notation let T∧n = min(T, n), then E[XT∧n] =E[X1].

Proof - We haveE[Xn] =

∑ni=1 E[Xn | T = i]P [T = i] + E[Xn | T > n]P [T > n]

=∑n

i=1 E[E[Xn | A1, · · · ,Ai] | T = i]P [T = i]+E[Xn | T > n]P [T > n]

=∑n

i=1 E[Xi | T = i]P [T = i] + E[Xn | T > n]P [T > n]= E[XT | T ≤ n]P [T ≤ n] + E[Xn | T > n]P [T > n]= E[XT∧n].

Since (Xn,An) is a martingale, we have E[Xn] = E[X1], and the requiresresult follows. ♣Proof of proposition 1.1.1 - We have

E[XT ] = E[XT | T ≤ n]P [T ≤ n] + E[XT | T > n]P [T > n].

18

From the last equality in the proof of lemma 1.2.3 it follows that

E[XT ] = E[XT∧n]−E[Xn | T > n]P [T > n]+E[XT | T > n]P [T > n]. (1.2.2)

Hypothesis (2) implies E[XT | T > n]P [T > n] tends to 0 as n → ∞. Fromhypothesis (1) we have

E[|XT |] =∞∑

k=1

E[|XT | | T = k]P [T = k] <∞.

Therefore the tail of the convergent series on the right hand side tends to 0.It follows that

E[XT | T > n]P [T > n] ≤∞∑

k=n+1

E[|XT | | T = k]P [T = k] −→ 0

as n → ∞. Taking limit of n → ∞ on right hand side of (1.2.2) we obtainthe desired result for martingales. Minor modification of the same argumentimplies the assertion for submartingales. ♣

For the proof of proposition 1.1.2 we need

Lemma 1.2.4 Let (Xn,An) be a martingale and T a stopping time such thatwith probability 1 T ≤ N <∞, then

E[XN |AT ] = XT .

Proof - For A ∈ AT of positive probability we have

E[XNIA] =N∑

n=1

E[XNIA∩[T=n]].

In view of the martingale property this becomes

E[XNIA] =N∑

n=1

E[XnIA∩[T=n]] = E[XT IA],

proving the lemma. ♣Proof of proposition 1.1.2 - To prove the first assertion it suffices to showfor A ∈ AT1 we have

E[X∞IA] = E[XT1IA]. (1.2.3)

19

In view of corollary 1.1.2 and lemma 1.2.4 we have

E[X∞|Am] = Xm, E[Xm|AT1∧m] = XT1∧m.

It is clear that E[X∞|AT1∧m] = XT1∧m and consequently

E[X∞IA∩[T1≤m]] = E[XT1∧mIA∩[T1≤m]] = E[XT1IA∩[T1≤m]].

Taking limm→∞ and applying the monotone convergence theorem to positiveand negative parts of X∞ we obtain (1.2.3). The assertion E[XT2|AT1 ] =XT1 follows from the first assertion, corollary ?? and the general propertyE[E[X∞|AT2 ]|AT1 ] = E[X∞|AT1 ]. ♣

20

1.3 An Idea from Analysis

Spectral theory is one of most fruitful ideas of analysis. In this and the fol-lowing subsections we show how this idea can be successfully applied to solvesome problems in probability theory. We do not intend to give a sophisticatedtreatment of the spectral theroem, rather we loosely explain the general ideaand apply it to some random walk problems. Even understanding the broadoutlines of the theory is a useful mathematical tool.

Let A be an n×n matrix with eigenvalues λ1, · · · , λn counted with multi-plicity, and assume A is diagonalizable. The diagonalization of the matrix Acan be accurately interpreted as taking the underlying vector space to be thespace L(S) of functions (generally complex-valued) on a set S = s1, · · · , snof cardinality n and the matrix A to be the operator of multiplication ofan element ψ ∈ L(S) by the function ϕA which has value λj at sj. In thismanner we have obtained the simplest form that a matrix can reasonablybe expected to have. Of course not every matrix is diagonalizable but largeclasses of matrices including an open dense subset of them are. The ideaof spectral theory is to try to do the same, to the extent possible, for infi-nite matrices or linear operators on infinite dimensional spaces. Even whendiagonalization is possible for infinite matrices, several distinct possibilitiespresent themselves which we now describe:

1. (Pure Point Spectrum) There is a countably infinite set S = s1, s2, · · ·and a complex valued function φA defined on S such that the action ofthe matrix A is given by multiplication by ϕA. The underlying vectorspace is the vector space of square summable functions on S, i.e., if weset ψk = ψ(sk), then

∑k |ψk|2 <∞. It often becomes necessary to take

a weighted sum in the sense that there is a positive weight function1ck

and the undelying vector space is the space of sequences ψk suchthat ∑

k

1

ck|ψk|2 <∞.

The weight function 1ck

is sometimes called Plancherel measure.

2. (Absolutely Continuous Spectrum) There is an interval (a, b) (closed oropen, a and/or b possibly∞), a function ϕA such that A is the operatorof multiplication of functions on (a, b) by ϕA. There is a positive or

21

non-negative function 1c(λ)

(called the Plancherel measure) such that

the underlying vector space is the space of functions ψ on (a, b) withthe property ∫ b

a|ψ(λ)|2 dλ

c(λ)<∞.

3. (Singular Continuous Spectrum) There is an uncountable set S of Lebesguemeasure zero (such as the Cantor set) such that A can be realized asmultiplication by a function ϕA on S. The underlying vector space isagain the space of square integrable functions on S relative to somemeasure on S. One often hopes that the problem does not lead to thiscase.

4. Any combination of the above.

The first two cases and their combination is of greatest interest to us.In order to demonstrate the idea we look at some familiar examples anddemonstrate the diagonalization process.

Example 1.3.1 Let A be the differentiation operator ddx

on the space ofperiodic functions with period 2π. Since

d

dxeinx = ineinx,

in is an eigenvalue of ddx

. Writing a periodic function as a Fourier series

ψ(x) =∑n∈Z

aneinx,

we see that the appropriate set S is S = Z and

ϕ ddx

(n) = in.

The Plancherel measure is 1ck

= 12π

and from the basic theory of Fourier series

(Parseval’s theorem) we know that∫ π

−πψd

dxφdx = − i

∑n∈Z

nanbn,

where ψ(x) =∑

n∈Z aneinx and φ(x) =

∑n∈Z bne

inx. ♠

22

Example 1.3.2 Let f be a periodic function of period 2π and Af be theoperator of convolution with f , i.e.,

Af : ψ −→ 1

∫ π

−πf(x− y)ψ(y)dy

Assume f(x) =∑

n fneinx and ψ(x) =

∑n ane

inx. Substituting the Fourierseries for f and ψ in the defintiion of Af we get

Af (ψ) =1

∑n,m

eimx∫ π

−πfname

i(n−m)ydy =∑m

fmameimx.

This means that in the diagonalization of the operator Af , the set S is Z,and the function ϕAf

is

ϕAf(n) = fn.

The underlying vector space and Plancherel measure is the same as in ex-ample 1.3.1. Thus Fourier series transforms convolutions into multiplicationof Fourier coefficients. The fact that Fourier series simultaneously diagonal-izes convolutions anf differentiation reflects the fact that convolutions anddifferntiation commute and commuting diagonalizable matrices can be si-multaneously diagonalized. Convolution operators occur frequently in manyareas of mathematics and engineering. ♠

Example 1.3.3 Examples 1.3.1 and 1.3.2 for functions on R or Rn whenthe periodicity assumption is removed. For a function ψ of compact support(i.e., ψ vanishes outside a closed interval [a, b]) Fourier transform is definedby

ψ(λ) =∫ ∞

e−iλxψ(x)dx.

Integration by parts shows that under Fourier transform the operator ofdifferentiation d

dxbecomes multiplication by −iλ:

˜(dψdx

)(λ) = (−iλ)ψ(λ).

23

Similarly let Af denotes the operator of convolution by an integrable functionf :

Af (ψ) =∫ ∞

−∞f(x− y)psi(y)dy.

Then a change of variable shows∫ ∞

−∞e−iλx

∫ ∞

−∞f(x− y)ψ(y)dydx = f(λ)ψ(λ).

Therefore the diagonalization process of the differentiation and convolutionoperators on functions on R leads to case (2) with S = R, d

dx↔ −iλ, and

Af ↔ f . For the underlying vector space it is convenient to start with thespace of compactly supported functions on R and then try to extend theoperators to L2(S). There are technical points which need clarification, butfor the time being we are going to ignore them. ♠

Example 1.3.4 Let us apply the above considerations to the simple randomwalk on Z. The random walk is described by convolution with the functionf on Z defined by

f(n) =

p, if n = 1;q, if n = −1;0, otherwise.

Convoluition on Z is defined similar to the cases on R except that the integralis replaced by a sum:

Af (ψ) = f ? ψ(n) =∑k∈Z

f(n− k)ψ(k).

Let ej be the function on Z defined by ej(n) = δjn where δjn is 1 if j = n and0 otherwise. It is straightforward to see that the matrix of the operator Af

relative to the basis ej for the vector space of functions on Z is the matrixof transition probabilities for the simple random walk on Z. Example 1.3.2suggests that this situation is dual to one described in that example. For Swe take the interval [−π, π], to state j corresponds the periodic function eijx

and the action of the transition matrix P is given by multiplication by thefunction

peix + qe−ix = (p+ q) cos x+ i(p− q) sinx.

24

The probability of being in state 0 at time 2l is the therefore the constantterm in the Fourier expansion of (peix + qe−ix)2l, viz.,(

2l

l

)plql,

which we had easily established before. ♠

Example 1.3.5 A more interesting example is the application of the idea ofthe spectral theorem to to the reflecting random walk on Z+ where the point0 is a reflecting barrier. The matrix of transition probabilities is given by

P =

0 1 0 0 · · ·12

0 12

0 · · ·0 1

20 1

2· · ·

......

......

. . .

Assume the diagonalization can be implemented so that P becomes multi-plication by the function ϕP (x) = x on the space of functions on an intervalwhich we take to be [−1, 1]. If the state n corresponds to the function φn

then we must require

xφ = φ1, xφn(x) =1

2φn−1(x) +

1

2φn+1(x).

Using the elementary trigonometric identity

cosα cos β =1

2cos(α− β) +

1

2cos(α+ β)

we obtain the following functions φn:

φ(x) = 1, φn(x) = cosnθ, where θ = cos−1 x. (1.3.1)

The polynomials φn(x) = cos(n cos−1 x) are generally called Chebycheff poly-nomials. The above discussion should serve as a good motivation for theintroduction of these polynomials which found a number of applications. Forthe Plancherel measure we seek a function 1

c(x)such that

∫ 1

−1φn(x)φm(x)

dx

c(x)= δmn.

25

It is clear that we can take

1

c(x)=

1

π√

1− x2.

The coefficients of the matrix P n can be computed in terms of the Chebycheffpolynomials. In fact it is straightforward to see that

P(n)jk =

∫ 1

−1xnφj(x)φk(x)

dx

c(x).

The idea of this simple example can be extended to considerably more com-plex Markov chains and the machinery of orthogonal polynomials can be usedto this effect. ♠

In the next subsection we apply some ideas from Fourier analysis for theanalysis of Markov chains.

26