Wieslaw Zielonka …zielonka/Enseignement/MPRI/2014/...Non-existence of optimal strategies player 1...

Stochastic games

Wieslaw Zielonkawww.liafa.univ-paris-diderot.fr/~zielonka

October 25, 2014

Concurrent stochastic games

Game structure G = (S ,A0,A1, δ):

S - the set of states,

for each state s, A(s), and B(s) non-empty sets of actions ofplayers 0 and 1 available at s,

a probabilistic transition function: δ(s, a, b) ∈ ∆(S), for alls ∈ S and (a, b) ∈ A(s)× B(s).

∆(S) is a probability distribution over S .

δ(t|s, a, b)

will denote the probability of moving to state t when actions aand b are played at state s.

Plays and histories

A play of G is an infinite sequence

ω = s0, (a0, b0), s1, (a1, b1), s2, (a2, b2), . . .

such that (ai , bi ) ∈ A(si)× B(si) for all i ≥ 0.The set of all plays is denoted Ω.

A history is a finite sequence

h = s0, (a0, b0), s1, (a1, b1), s2, (a2, b2), . . . , sn

alternating states and joint actions such that(ai , bi ) ∈ A(si )× B(si) for all i ≥ 0 and terminating in a state.The set of histories is denoted H.

Strategies

A strategy of player 0 is a mapping

σ : H → ∆(A)

such that, for each history h terminating in state s,σ(h) ∈ ∆(A(s)).

Strategies for player 1 are defined in a similar way.We write σ(a|h) to denote the probability assigned by σ of playingaction a given a history h.

Pure and memoryless strategies

A strategy σ is memoryless if for each history h, σ(h) = σ(s)where s is the last state of h.(Memoryless strategy : for each state s there is a fixed probabilitydistribution σ(·|s) and player playing at s chooses an action a withprobability σ(a|s) independently on previously visited states andpreviously played actions.)

A strategy σ is pure if for each finite history h and each action a,σ(s|h) ∈ 0, 1.

Probability

Let h ∈ H be a history. By h+ we denote the cylinder generatedby h:

h+ = ω ∈ Ω | h < ω

where h < ω means that h is a prefix of ω.Given an initial state s ∈ S and strategies σ, τ of players 0 and 1we can define a probability over cylinders.

Payoff

Ps,σ,τ extends in a unique way to a probability over the σ-algebragenerated by cylinders.A payoff mapping is bounded measurable mapping

u : Ω → R

We will suppose that player 0 wants to minimize the expectedpayoff while player 1 wants to maximize the expected payoff.In the sequel

Es,σ,τ (u)

will denote the expectation of u for the probability Ps,σ,τ (u).

Game value and optimal strategies

Alwayssupτ

infσEs,σ,τ (u) ≤ inf

σsupτ

Es,σ,τ (u)

where the quantity on the left-hand side is called the lower gamevalue of and the quantity on the right-hand side is the upper gamevalue. If the lower and the upper game values are equal then this iscalled game value.Let val(s) be the game value for the initial state s.Strategy σ of player 0 is ε-optimal if for each strategy τ of player 1

Es,σ,τ (u) ≤ val(s) + ε

Strategy τ of player 1 is ε-optimal if for each strategy σ of player 0

val(s)− ε ≤ Es,σ,τ (u)

A strategy is 0-optimal strategy is called optimal (i.e. it is aε-optimal strategy with ε = 0).

Reachability games

Let F ⊂ S be a set of states.

A reachability game Reach(F ) is the game such that the payoff is1 for all plays such that si ∈ F for some i ≥ 0 and otherwise, ifsi 6∈ F for all i , then the payoff is 0.

The necessity of randomized strategies

player 0

player 1

throw hit

standL,throwL

standR,throwR

with probability 1 with a randomized strategy.Player 1 wins (the state ”hit” is reached)

standR,throwL

standL,throwR

At each round player 1 throws a snowball and tries to hit player 0.Player 0 is seen either in the left window or in the right window(and he can change the window at each stage). standL, standR

actions of player 0 (stand in the left or right window), throwL,throwR actions of player 1 of throwing a snowball either at the leftor at the right window.

Non-existence of optimal strategies

player 1

player 0 home

wet

hide

safe

wait,run

throw,hidethrow,run

Player 1 is hidden behind a hill. He wins if he can reach the housedry. Player 0 has only one snowball at his disposal and he tries tohit player 1. At each stage player 0 chooses between throw andwait and player 1 chooses between run and hide.Suppose that player 1 plays using the strategy σ such that at eachround he chooses action run with probability ε > 0. For anystrategy τ of player 0 he can hit player 1 with probability not

Repeated hide and run (Buchi game)

player 1

player 0 home

wet

hide

safe

wait,run

throw,hidethrow,run

wait,hide

In repeated Hide or Run game player 1 returns behind the hill aftereach successful visit of the shelter. The transition diagram aboveindicates that, intuitively, if player 0 has thrown the snowballwithout hitting player 1 then player 0 cannot immediately prepareand use another snowball, he should wait until player 1 visits theshelter and returns behind the hill. Player 1 wins if he visits theshelter infinitely often without being hit (visits state home infinitelyoften).

ε-optimal strategy of player 1 in repeated hide and run

To win with probability > 1− ε player 1 runs with probability

εk = 1− (1− ε)1/2k+1

where k is the number of visits to the shelter. Thus he runs ateach round with probability ε0 = 1− (1− ε)1/2 before the firstvisit to the shelter, after visiting the shelter for the first time butbefore the second visit he runs with probability ε1 = 1− (1− ε)1/4

at each round etc.

With this strategy he should memorize the number of visits to theshelter, thus he needs an unbounded memory.The probability the player 1 will be never hit by a snowball is

∞∏

k=0

(1− εk) =∞∏

k=0

(1− ε)1/2k+1

= (1− ε)∑

∞

k=01

2k+1 = 1− ε

Extending the transition mapping

Fix a state s ∈ S . Let ξ ∈ ∆(A(s)) and η ∈ ∆(B(s)) beprobability distributions over actions in s.By δ(s, ξ, η) we denoted the probability distribution over statessuch that for each state t

δ(t|s, ξ, η) =∑

a∈A(s)

∑

b∈B(s)

ξ(a)η(b)δ(t|s, a, b)

is the probability of moving to t if players choose in s their actionsindependently according to ξ and η respectively.

Calculating the set of states with positive valueLet ηu,s denote a uniform probability distribution over actions B(s),

ηu,s(b) =1

nwhere n = |B(s)|

Define inductively

F0 = F ,

Fn+1 = Fn ∪ s ∈ S | ∀a ∈ A(s),∃t ∈ Fn, δ(t|s, a, ηs ) > 0,

F∞ = ∪nFn

TheoremIf player 1 uses a memoryless strategy η such that η(s) = ηu,s foreach state s then for each initial state s ∈ F∞ the game reachesthe set F of states with a positive probability against everystrategy of player 0.Player 0 has a pure memoryless strategy σ such that for eachinitial state s ∈ S \ F the probability of visiting F is 0 for eachpossible strategy of player 1.

Proof.From the construction of Fn it follows that if the current positionis in Fn and player 1 plays using the uniform distribution overavailable actions then whatever the action played by player 0 thegame moves with a positive probability to a state in Fn−1

From the definition of F∞ it follows that for each state s ∈ S \ Fplayer 0 has an action a such that δ(t|s, a, ηu,s) = 0. However thisimplies that for any probability distribution η over actionδ(t|s, a, η) = 0. Thus it suffices that player 0 plays such action aeach time the game visits s and the game will remain in S \ F∞forever.

Value iteration

For two matrices M1 and M2 of the same dimension k × n wedefine the distance

d(M1,M2) = max1≤i≤k,1≤j≤n

|M1(i , j)−M2(i , j)|

We note val[M] the value of the matrix game M.

Theorem

(i) For each c ∈ R, val[M + cJ] = val[M] + c, where J thematrix with all entries 1.

(ii) If M1 ≤ M2 component-wise then val[M1] ≤ val[M2], whereval[M] is the value of matrix game M.

(iii)|val(M1)− val(M2)| ≤ d(M1,M2)

Proof.(i) Both players have the same optimal strategies in M and inM + cJ since σT (M + cJ)η = σT ·M · η + c for all strategies σ, η.

(ii) For all strategies σ, η, σT ·M1 · η ≤ σT ·M2 · η.

(iii) Follows from (i) and (ii) since

M1 − d(M1,M2)J ≤ M2 ≤ M1 + d(M1,M2)J

In the sequel we will write

W0 = S \ F∞, W1 = F

Thus we know that the value of the reachability game Reach(F ) is0 on W0 and 1 on W1. For the states in S \ (W0 ∪W1) the value ofthe reachability games is in (0, 1] and our aim is to find this value.By F we denote the set of mappings from S \ (W0 ∪W1) into[0, 1].For v ∈ F and each state s 6∈ W0 ∪W1 we define a one-stagegame Γs(v):players 0 and 1 choose action (a, b) ∈ A(s)× B(s) and player 1receives from player 0 the sum

∑

t∈S

δ(t|s, a, b)v∗(t)

where v∗(t) =

0 if t ∈ W0

1 if t ∈ W1

v(t) otherwise

Note that Γs(v) can be seen as a matrix game with the payoffdefined above.Let us define an operator

Φ : F → F

such that, given v ∈ F ,

Φ(v)(s) = val[Γs(v)]

is the value of the one step game.For v1, v2 ∈ F we write v1 ≤ v2 if v1(s) ≤ v2(s) for each states ∈ S \ (W0 ∪W1).

TheoremΦ is monotone: if v1, v2 ∈ F are such that v1 ≤ v2 thenΦ(v1) ≤ Φ(v2).Φ is non-expanding: ‖Φ(v1)− Φ(v2)‖ ≤ ‖v1 − v2‖, where‖v‖ = maxs∈S\(W0∪W1)|v(s)|.

Proof: by monotonicity of the value of matrix games by property(iii) of matrix games.

Let v0 ∈ F be the mapping equal to 0 for all states.Define by induction:

vi+1 = Φ(vi )

TheoremThe sequence vi is monotone, vi+1 ≥ vi for all i .Let v∞ = lim vi . Then Φ(v∞) = v∞, i.e. v∞ is a fixed point of Φ(in fact v∞ is the smallest fixed point of Φ).For each state s ∈ S \ (W0,∪W1), v∞(s) is the value of thereachability game Reach(F ).Player 0 has an optimal pure memoryless strategy, player 1 has anε-optimal memoryless strategy in the game Reach(F ).

Proof

v∞ exists since the sequence vi is monotone and bounded by theconstant mapping equal to 1 for each state.Since Φ is non-expanding the fact that vi → v∞ implies thatΦ(vi ) → Φ(v∞). But Φ(vi) = vi+1 and we can see thatv∞ = Φ(v∞).

Let σ be a memoryless strategy of player 0 such that

in a state s ∈ S \ (W0 ∪W1) he plays using a distributionαs ∈ ∆(A(s)) such that αs is optimal for him in the one stagegame Γs(v∞).

in each state s ∈ W0 he plays a fixed action such that allsuccessor states are in W0 (we have seen previously that suchan action exists),

in states s ∈ W1 he plays any fixed action.

Suppose that player 1 plays using any strategy τ .

Let

v∗∞(s) =

v∞(s) if s ∈ S \ (W0 ∪W1)

1 if s ∈ W1

0 if s ∈ W0

Let us suppose that at some stage the game is in states ∈ S \ (W0 ∪W1). Let p(t|s) be the probability to move to astate t in the next stage (this probability can depend on the stagesince the strategy of player 1 can depend on history).Then

v∗∞(s) ≥∑

t∈S

p(t|s) · v∗∞(t)

This is clearly true for states belonging to W0 ∪W1.

For states s ∈ S \ (W0 ∪W1) this follows from optimality of thestrategy σ of player 0 in the one stage matrix game Γs(v∞). Indeedon the left-hand side we have the value of Γs(v∞) and on theright-hand side we have the payoff when player 0 (minimizer) playshis optimal strategy while player 1 (maximizer) plays any strategy.

Letpi (s)

be the probability that the game with initial state s will reach F inat most i steps.We shall prove that

pi (s) ≤ v∗∞(s)

for all i ≥ 0.Certainly this inequality is true for states s ∈ W0 ∪W1 (why?)Certainly 0 = p0(s) ≤ v∗∞(s) for s 6∈ F (recall that F = W1).

Suppose that pi(s) ≤ v∗∞(s).Then for s ∈ S \ (W0 ∪W1)

pi+1(s) =∑

t∈S

p(t|s) · pi(t) ≤∑

t∈S

p(t|s) · v∗∞(t) ≤ v∗∞(s)

Since the probability to reach F from s in any number of steps isthe limit of limi→∞ pi (s) we get that the probability to reach F isbounded from above by v∗∞ if player 1 uses the memorylessstrategy σ described above.

ε-optimal strategy for player 1

Let v0 ∈ F be the constant mapping 0 for each state andinductively vi+1 = Φ(vi). Let k by such that ‖v∞ − vk‖ < ε.Let us consider the following strategy of player 1 that will beapplied in all states s ∈ W1:

at stage 0, if the current state is s0 then player 1 plays anoptimal strategy in the game Γs0(vk),

at stage 1, if the current state is s1 then player 1 plays anoptimal strategy in the game Γs1(vk−1),

in general, at stage i < k , if the current state is si then player1 plays an optimal strategy in the game Γsi (vk−i )

Let us note that for each i

v∗i (s) ≤∑

t∈S

p(t|s)v∗i−1(t)

where p(t|s) is the probability of mowing in one step to t from s(this probability can depend on the stage of the game). Indeed, onthe left-hand side we have the value of the game Γs(vi−1) and onthe right-hand side we have what player 1 wins effectively in such agame. Since he plays using his optimal strategy he wins at leastthe value of the game.

We shall prove by induction that

pi(s) ≥ vi(s)

where vi (s) is the probability to reach F in at most i steps.Clearly true for i = 0. Let 0 < i ≤ k , s 6∈ W0 ∪W1.

pi (s) =∑

t∈S

p(t|s) · pi−1(t) ≥∑

t∈S

p(t|s) · v∗i−1(t) ≥ vi (s)

Since v∞(s)− vk(s) < ε we can see that the strategy of player 1described above is ε-optimal (but not memoryless!)

A memoryless strategy σ of player 1 is proper if for each strategy τ

of player 0 the probability to reach W0 ∪W1 is 1.We have seen that the uniform strategy of player 1 is proper(choose each action with the same probability).

Wieslaw Zielonka …zielonka/Enseignement/MPRI/2014/...Non-existence of optimal strategies player 1...

Documents

Transcript of Wieslaw Zielonka …zielonka/Enseignement/MPRI/2014/...Non-existence of optimal strategies player 1...