# Wieslaw Zielonka zielonka/Enseignement/MPRI/2014/... Non-existence of optimal strategies player 1...

date post

29-Jun-2020Category

## Documents

view

0download

0

Embed Size (px)

### Transcript of Wieslaw Zielonka zielonka/Enseignement/MPRI/2014/... Non-existence of optimal strategies player 1...

Stochastic games

Wieslaw Zielonka www.liafa.univ-paris-diderot.fr/~zielonka

October 25, 2014

Concurrent stochastic games

Game structure G = (S ,A0,A1, δ):

◮ S - the set of states,

◮ for each state s, A(s), and B(s) non-empty sets of actions of players 0 and 1 available at s,

◮ a probabilistic transition function: δ(s, a, b) ∈ ∆(S), for all s ∈ S and (a, b) ∈ A(s)× B(s).

∆(S) is a probability distribution over S .

δ(t|s, a, b)

will denote the probability of moving to state t when actions a and b are played at state s.

Plays and histories

A play of G is an infinite sequence

ω = s0, (a0, b0), s1, (a1, b1), s2, (a2, b2), . . .

such that (ai , bi ) ∈ A(si)× B(si) for all i ≥ 0. The set of all plays is denoted Ω.

A history is a finite sequence

h = s0, (a0, b0), s1, (a1, b1), s2, (a2, b2), . . . , sn

alternating states and joint actions such that (ai , bi ) ∈ A(si )× B(si) for all i ≥ 0 and terminating in a state. The set of histories is denoted H.

Strategies

A strategy of player 0 is a mapping

σ : H → ∆(A)

such that, for each history h terminating in state s, σ(h) ∈ ∆(A(s)).

Strategies for player 1 are defined in a similar way. We write σ(a|h) to denote the probability assigned by σ of playing action a given a history h.

Pure and memoryless strategies

A strategy σ is memoryless if for each history h, σ(h) = σ(s) where s is the last state of h. (Memoryless strategy : for each state s there is a fixed probability distribution σ(·|s) and player playing at s chooses an action a with probability σ(a|s) independently on previously visited states and previously played actions.)

A strategy σ is pure if for each finite history h and each action a, σ(s|h) ∈ {0, 1}.

Probability

Let h ∈ H be a history. By h+ we denote the cylinder generated by h:

h+ = {ω ∈ Ω | h < ω}

where h < ω means that h is a prefix of ω. Given an initial state s ∈ S and strategies σ, τ of players 0 and 1 we can define a probability over cylinders.

Probability

Let h = s0, (a0, b0), s1, (a1, b1), s2, (a2, b2), . . . , sn ∈ H

then

Ps,σ,τ (h+) = Is=s0 · σ(a0|s0)τ(b0|s0)δ(s1|s0, a0, b0)

σ(a1|s0, a0, b0, s1)τ(b1|s0, a0, b0, s1)δ(s2|s1, a1, b1) · · ·

σ(an−1|s0, a0, b0, . . . , sn−1)τ(bn−1|s0, a0, b0, . . . , sn−1)

δ(sn|sn−1, an−1, bn−1)

where Is=s0 is 1 if s = s0 and 0 otherwise.

Payoff

Ps,σ,τ extends in a unique way to a probability over the σ-algebra generated by cylinders. A payoff mapping is bounded measurable mapping

u : Ω → R

We will suppose that player 0 wants to minimize the expected payoff while player 1 wants to maximize the expected payoff. In the sequel

Es,σ,τ (u)

will denote the expectation of u for the probability Ps,σ,τ (u).

Game value and optimal strategies

Always sup τ

inf σ Es,σ,τ (u) ≤ inf

σ sup τ

Es,σ,τ (u)

where the quantity on the left-hand side is called the lower game value of and the quantity on the right-hand side is the upper game value. If the lower and the upper game values are equal then this is called game value. Let val(s) be the game value for the initial state s. Strategy σ of player 0 is ε-optimal if for each strategy τ of player 1

Es,σ,τ (u) ≤ val(s) + ε

Strategy τ of player 1 is ε-optimal if for each strategy σ of player 0

val(s)− ε ≤ Es,σ,τ (u)

A strategy is 0-optimal strategy is called optimal (i.e. it is a ε-optimal strategy with ε = 0).

Reachability games

Let F ⊂ S be a set of states.

A reachability game Reach(F ) is the game such that the payoff is 1 for all plays such that si ∈ F for some i ≥ 0 and otherwise, if si 6∈ F for all i , then the payoff is 0.

The necessity of randomized strategies

player 0

player 1

throw hit

standL,throwL

standR,throwR

with probability 1 with a randomized strategy. Player 1 wins (the state ”hit” is reached)

standR,throwL

standL,throwR

At each round player 1 throws a snowball and tries to hit player 0. Player 0 is seen either in the left window or in the right window (and he can change the window at each stage). standL, standR actions of player 0 (stand in the left or right window), throwL, throwR actions of player 1 of throwing a snowball either at the left or at the right window.

Non-existence of optimal strategies

player 1

player 0 home

wet

hide

safe

wait,run

throw,hidethrow,run

Player 1 is hidden behind a hill. He wins if he can reach the house dry. Player 0 has only one snowball at his disposal and he tries to hit player 1. At each stage player 0 chooses between throw and wait and player 1 chooses between run and hide. Suppose that player 1 plays using the strategy σ such that at each round he chooses action run with probability ε > 0. For any strategy τ of player 0 he can hit player 1 with probability not

Repeated hide and run (Büchi game)

player 1

player 0 home

wet

hide

safe

wait,run

throw,hidethrow,run

wait,hide

In repeated Hide or Run game player 1 returns behind the hill after each successful visit of the shelter. The transition diagram above indicates that, intuitively, if player 0 has thrown the snowball without hitting player 1 then player 0 cannot immediately prepare and use another snowball, he should wait until player 1 visits the shelter and returns behind the hill. Player 1 wins if he visits the shelter infinitely often without being hit (visits state home infinitely often).

ε-optimal strategy of player 1 in repeated hide and run

To win with probability > 1− ε player 1 runs with probability

εk = 1− (1− ε) 1/2k+1

where k is the number of visits to the shelter. Thus he runs at each round with probability ε0 = 1− (1− ε)

1/2 before the first visit to the shelter, after visiting the shelter for the first time but before the second visit he runs with probability ε1 = 1− (1− ε)

1/4

at each round etc.

With this strategy he should memorize the number of visits to the shelter, thus he needs an unbounded memory. The probability the player 1 will be never hit by a snowball is

∞ ∏

k=0

(1− εk) = ∞ ∏

k=0

(1− ε)1/2 k+1

= (1− ε) ∑

∞

k=0 1

2k+1 = 1− ε

Extending the transition mapping

Fix a state s ∈ S . Let ξ ∈ ∆(A(s)) and η ∈ ∆(B(s)) be probability distributions over actions in s. By δ(s, ξ, η) we denoted the probability distribution over states such that for each state t

δ(t|s, ξ, η) = ∑

a∈A(s)

∑

b∈B(s)

ξ(a)η(b)δ(t|s, a, b)

is the probability of moving to t if players choose in s their actions independently according to ξ and η respectively.

Calculating the set of states with positive value Let ηu,s denote a uniform probability distribution over actions B(s),

ηu,s(b) = 1

n where n = |B(s)|

Define inductively

F0 = F ,

Fn+1 = Fn ∪ {s ∈ S | ∀a ∈ A(s),∃t ∈ Fn, δ(t|s, a, ηs ) > 0},

F∞ = ∪nFn

Theorem If player 1 uses a memoryless strategy η such that η(s) = ηu,s for each state s then for each initial state s ∈ F∞ the game reaches the set F of states with a positive probability against every strategy of player 0. Player 0 has a pure memoryless strategy σ such that for each initial state s ∈ S \ F the probability of visiting F is 0 for each possible strategy of player 1.

Proof. From the construction of Fn it follows that if the current position is in Fn and player 1 plays using the uniform distribution over available actions then whatever the action played by player 0 the game moves with a positive probability to a state in Fn−1 From the definition of F∞ it follows that for each state s ∈ S \ F player 0 has an action a such that δ(t|s, a, ηu,s) = 0. However this implies that for any probability distribution η over action δ(t|s, a, η) = 0. Thus it suffices that player 0 plays such action a each time the game visits s and the game will remain in S \ F∞ forever.

Value iteration

For two matrices M1 and M2 of the same dimension k × n we define the distance

d(M1,M2) = max 1≤i≤k,1≤j≤n

|M1(i , j)−M2(i , j)|

We note val[M] the value of the matrix game M.

Theorem

(i) For each c ∈ R, val[M + cJ] = val[M] + c, where J the matrix with all entries 1.

(ii) If M1 ≤ M2 component-wise then val[M1] ≤ val[M2], where val[M] is the value of matrix game M.

(iii) |val(M1)− val(M2)| ≤ d(M1,M2)

Proof. (i) Both players have the same optimal strategies in M and in M + cJ since σT (M + cJ)η = σT ·M · η + c for all strategies σ, η.

(ii) For all strategies σ, η, σT ·M1 · η ≤ σ T ·M2 · η.

(iii) Follows from (i) and (ii) since

M1 − d(M1,M2)J ≤ M2 ≤ M1 + d(M1,M2)J

In the sequel we will write

W0 = S \

Recommended

*View more*