Politex: Regret Bounds for Policy Iteration Using Expert Prediction · 2019-06-05 · Politex +...

1
P: Regret Bounds for Policy Iteration Using Expert Prediction Yasin Abbasi-Yadkori, Peter L. Bartle, Kush Bhatia, Nevena Lazić, Csaba Szepesvári, Gellért Weisz Summary I Seing: average-cost RL with discrete actions and value function approximation I P: soened and averaged policy iteration. I If the value function error aer τ steps satisfies k Q π - b Q π k ν ε ( τ ) = ε 0 + O( p 1/ τ ) where ε 0 is the approximation error, and ν is the stationary state-action distribution, then the regret of P in uniformly mixing MDPs is of the order R T = e O(T 3/4 + ε 0 T ) . I Regret bound does not scale in the size of the MDP, does not depend on the "concentrability coeicient", easy to implement (no confidence bounds required). P algorithm Input: phase length τ > 0, initial state x 0 Set b Q 0 (x , a) = 0 x , a for i := 1, 2,..., do Policy iteration: π i (·| x ) = argmin uΔ hu, b Q i-1 (x , ·)i P : π i (·| x ) = argmin uΔ hu, i-1 j =0 b Q j (x , ·)i - η -1 H(u) exp - η i-1 j =0 b Q j (x , ·) Execute π i for τ time steps and collect dataset Z i Estimate b Q i from Z 1 ,..., Z i , π 1 ,..., π i end for Dierent value estimation methods are possible. For non-linear Q-functions, one can maintain only the most recent n estimates. P analysis Assumptions: I A1 (Unichain). MDP states form a single recurrent class. I A2 (Uniform mixing). sup π k( ν π - ν ) > H π k 1 exp(- κ -1 )kν π - ν k 1 , where H π is the transition probability matrix for (s, a) pairs under π , and ν π is the stationary state-action distribution. Let λ π = lim T →∞ 1 T ˝ T t =1 c (x t , π (x t )) be the average cost of a policy. Regret decomposes as R T = T t =1 c (x t , a t )- c (x * t , a * t ) = V T + R T + W T V T = T t =1 c t - λ π (t ) R T = T t =1 λ π (t ) - λ π * W T = T t =1 λ π * - c * t I V T and W T are the dierences between the instantaneous and average costs, respectively scale as κ T 3/4 and κ T w.h.p. I R T is the pseudo-regret, bounded using the regret bound of the Exponentially Weighted Average (EWA) forecaster and the value error bound. Least squares policy estimation (LSPE) Bounding R T requires b Q i (x , a)∈[b, b + Q max ] and bounded error k Q π i - b Q i k ν * , k Q π i - b Q i k μ * π i ε ( τ ) = ε 0 + O( p 1/ τ ) . LSPE: I Linear value function approximation b Q π = Ψw π I Obtains a simulation-based solution to the projected Bellman equation Ψw = Π π (c - λ1 + H Ψw ) Under the additional assumptions below, w.h.p. LSPE satisfies the error bound, and Q max is bounded. I A3 (Features). Columns of [Ψ 1] are linearly independent, and features are bounded. I A4 (Feature excitation). For any π , λ min (Ψ > diag( ν π )Ψ)≥ σ > 0. Experiments P + LSPE on eueing problems Figure: Average cost at the end of each phase for the 4-queue and 8-queue environments (mean and std of 50 runs), for dierent η . P + neural networks on Atari Figure: Ms Pacman game scores obtained by the agents at the end of each game, using runs with dierent random seeds. Related work Y. Abbasi-Yadkori, N. Lazić, and C. Szepesvári. "Regret bounds for model-free LQ control." AISTATS (2019). E. Even-Dar, S. M. Kakade, and Y. Mansour. "Online MDPs." Mathematics of Operations Research 34.3 (2009). H. Yu and D. P. Bertsekas. "Convergence results for some temporal dierence methods based on least squares." IEEE Transactions on Automatic Control 54.7 (2009).

Transcript of Politex: Regret Bounds for Policy Iteration Using Expert Prediction · 2019-06-05 · Politex +...

Page 1: Politex: Regret Bounds for Policy Iteration Using Expert Prediction · 2019-06-05 · Politex + LSPE on ˚eueing problems 0 500 1000 1500 2000 phase 25 30 35 40 45 50 55 60 65 70

Politex: Regret Bounds for Policy Iteration Using Expert PredictionYasin Abbasi-Yadkori, Peter L. Bartle�, Kush Bhatia, Nevena Lazić, Csaba Szepesvári, Gellért Weisz

Summary

I Se�ing: average-cost RL with discrete actions and

value function approximation

I Politex: so�ened and averaged policy iteration.

I If the value function error a�er τ steps satisfies

‖Qπ − Qπ ‖ν ≤ ε(τ ) = ε0 + O(√1/τ )

where ε0 is the approximation error, and ν is the

stationary state-action distribution, then the regret of

Politex in uniformly mixing MDPs is of the order

RT = O(T 3/4 + ε0T ) .

I Regret bound does not scale in the size of the MDP,

does not depend on the "concentrability coe�icient",

easy to implement (no confidence bounds required).

Politex algorithm

Input: phase length τ > 0, initial state x0Set Q0(x, a) = 0 ∀x, afor i := 1, 2, . . . , doPolicy iteration: πi(·|x) = argmin

u∈∆〈u, Qi−1(x, ·)〉

Politex : πi(·|x) = argminu∈∆

〈u,i−1∑j=0

Qj(x, ·)〉 − η−1H(u)

∝ exp(− η

i−1∑j=0

Qj(x, ·))

Execute πi for τ time steps and collect datasetZiEstimate Qi fromZ1, . . . ,Zi,π1, . . . ,πi

end for

Di�erent value estimation methods are possible. For

non-linear Q-functions, one can maintain only the most

recent n estimates.

Politex analysis

Assumptions:

I A1 (Unichain). MDP states form a single recurrent class.

I A2 (Uniform mixing). supπ ‖(νπ − ν )>Hπ ‖1 ≤ exp(−κ−1)‖νπ − ν ‖1,where Hπ is the transition probability matrix for (s, a) pairs under

π , and νπ is the stationary state-action distribution.

Let λπ = limT→∞1T∑T

t=1 c(xt,π (xt)) be the average cost of a policy.

Regret decomposes as

RT =

T∑t=1

c(xt, at) − c(x∗t , a∗t ) = VT + RT +WT

VT =

T∑t=1

ct − λπ(t) RT =

T∑t=1

λπ(t) − λπ ∗ WT =

T∑t=1

λπ ∗ − c∗t

I VT and WT are the di�erences between the instantaneous and

average costs, respectively scale as κT 3/4and κ

√T w.h.p.

I RT is the pseudo-regret, bounded using the regret bound of the

Exponentially Weighted Average (EWA) forecaster and the value

error bound.

Least squares policy estimation (LSPE)

Bounding RT requires Qi(x, a) ∈ [b, b + Qmax] and bounded error

‖Qπi − Qi‖ν∗, ‖Qπi − Qi‖µ∗⊗πi ≤ ε(τ ) = ε0 + O(√1/τ ) .

LSPE:

I Linear value function approximation Qπ = Ψwπ

I Obtains a simulation-based solution to the projected Bellman

equation Ψw = Ππ (c − λ1 + HΨw)

Under the additional assumptions below, w.h.p. LSPE satisfies the

error bound, and Qmax is bounded.

I A3 (Features). Columns of [Ψ 1] are linearly independent, and

features are bounded.

I A4 (Feature excitation). For any π , λmin(Ψ>diag(νπ )Ψ) ≥ σ > 0.

Experiments

Politex + LSPE on �eueing problems

0 500 1000 1500 2000

phase

25

30

35

40

45

50

55

60

65

70

avg c

ost

4-queue

LSPI

POLITEX

RLSVI

0 500 1000 1500 2000

phase

100

150

200

250

300

3508-queue

Figure: Average cost at the end of each phase

for the 4-queue and 8-queue environments

(mean and std of 50 runs), for di�erent η.

Politex + neural networks on Atari

0 0.25e8 0.50e8 0.75e8 1e8

Game frames

0

1000

2000

Rew

ard

per

epis

ode

POLITEX

DQN

Figure: Ms Pacman game scores obtained by

the agents at the end of each game, using runs

with di�erent random seeds.

Related work

Y. Abbasi-Yadkori, N. Lazić, and C. Szepesvári.

"Regret bounds for model-free LQ control."

AISTATS (2019).

E. Even-Dar, S. M. Kakade, and Y. Mansour.

"Online MDPs." Mathematics of Operations

Research 34.3 (2009).

H. Yu and D. P. Bertsekas. "Convergence

results for some temporal di�erence methods

based on least squares." IEEE Transactions on

Automatic Control 54.7 (2009).