Politex: Regret Bounds for Policy Iteration Using Expert Prediction · 2019-06-05 · Politex +...

Politex: Regret Bounds for Policy Iteration Using Expert PredictionYasin Abbasi-Yadkori, Peter L. Bartle�, Kush Bhatia, Nevena Lazić, Csaba Szepesvári, Gellért Weisz

Summary

I Se�ing: average-cost RL with discrete actions and

value function approximation

I Politex: so�ened and averaged policy iteration.

I If the value function error a�er τ steps satisfies

‖Qπ − Qπ ‖ν ≤ ε(τ ) = ε0 + O(√1/τ )

where ε0 is the approximation error, and ν is the

stationary state-action distribution, then the regret of

Politex in uniformly mixing MDPs is of the order

RT = O(T 3/4 + ε0T ) .

I Regret bound does not scale in the size of the MDP,

does not depend on the "concentrability coe�icient",

easy to implement (no confidence bounds required).

Politex algorithm

Input: phase length τ > 0, initial state x0Set Q0(x, a) = 0 ∀x, afor i := 1, 2, . . . , doPolicy iteration: πi(·|x) = argmin

u∈∆〈u, Qi−1(x, ·)〉

Politex : πi(·|x) = argminu∈∆

〈u,i−1∑j=0

Qj(x, ·)〉 − η−1H(u)

∝ exp(− η

i−1∑j=0

Qj(x, ·))

Execute πi for τ time steps and collect datasetZiEstimate Qi fromZ1, . . . ,Zi,π1, . . . ,πi

end for

Di�erent value estimation methods are possible. For

non-linear Q-functions, one can maintain only the most

recent n estimates.

Politex analysis

Assumptions:

I A1 (Unichain). MDP states form a single recurrent class.

I A2 (Uniform mixing). supπ ‖(νπ − ν )>Hπ ‖1 ≤ exp(−κ−1)‖νπ − ν ‖1,where Hπ is the transition probability matrix for (s, a) pairs under

π , and νπ is the stationary state-action distribution.

Let λπ = limT→∞1T∑T

t=1 c(xt,π (xt)) be the average cost of a policy.

Regret decomposes as

T∑t=1

c(xt, at) − c(x∗t , a∗t ) = VT + RT +WT

T∑t=1

ct − λπ(t) RT =

T∑t=1

λπ(t) − λπ ∗ WT =

T∑t=1

λπ ∗ − c∗t

I VT and WT are the di�erences between the instantaneous and

average costs, respectively scale as κT 3/4and κ

√T w.h.p.

I RT is the pseudo-regret, bounded using the regret bound of the

Exponentially Weighted Average (EWA) forecaster and the value

error bound.

Least squares policy estimation (LSPE)

Bounding RT requires Qi(x, a) ∈ [b, b + Qmax] and bounded error

‖Qπi − Qi‖ν∗, ‖Qπi − Qi‖µ∗⊗πi ≤ ε(τ ) = ε0 + O(√1/τ ) .

I Linear value function approximation Qπ = Ψwπ

I Obtains a simulation-based solution to the projected Bellman

equation Ψw = Ππ (c − λ1 + HΨw)

Under the additional assumptions below, w.h.p. LSPE satisfies the

error bound, and Qmax is bounded.

I A3 (Features). Columns of [Ψ 1] are linearly independent, and

features are bounded.

I A4 (Feature excitation). For any π , λmin(Ψ>diag(νπ )Ψ) ≥ σ > 0.

Experiments

Politex + LSPE on �eueing problems

0 500 1000 1500 2000

4-queue

POLITEX

0 500 1000 1500 2000

3508-queue

Figure: Average cost at the end of each phase

for the 4-queue and 8-queue environments

(mean and std of 50 runs), for di�erent η.

Politex + neural networks on Atari

0 0.25e8 0.50e8 0.75e8 1e8

Game frames

POLITEX

Figure: Ms Pacman game scores obtained by

the agents at the end of each game, using runs

with di�erent random seeds.

Related work

Y. Abbasi-Yadkori, N. Lazić, and C. Szepesvári.

"Regret bounds for model-free LQ control."

AISTATS (2019).

E. Even-Dar, S. M. Kakade, and Y. Mansour.

"Online MDPs." Mathematics of Operations

Research 34.3 (2009).

H. Yu and D. P. Bertsekas. "Convergence

results for some temporal di�erence methods

based on least squares." IEEE Transactions on

Automatic Control 54.7 (2009).

Politex: Regret Bounds for Policy Iteration Using Expert Prediction · 2019-06-05 · Politex +...

Documents

Transcript of Politex: Regret Bounds for Policy Iteration Using Expert Prediction · 2019-06-05 · Politex +...

ExtendedForward-BackwardAlgorithm with n “proximable” functionssmai.emath.fr/smai2011/slides/hraguet/Slides.pdf · 2011. 5. 30. · PR =1.39e+04 s; t ADMM =1.35e+04 s iteration

Efficient Linear Programming for Dense CRFs · LP minimization Current method I Projected subgradient descent )Too slow. I Linearithmictime per iteration. I Expensiveline search.

Expert Clean.gr - Ταπητοκαθαριστήρια

Random Iteration of Maps on a Cylinder and di usive …vkaloshi/papers/random-iteration.pdfRandom Iteration of Maps on a Cylinder and di usive behavior O. Castej on, V. Kaloshiny June

Classifier-Based Approximate Policy Iteration

Numerical Methods-Lecture IV: Bellman Equations: Solutionstgallen/Teaching/Econ_690... · 2015. 10. 8. · Aside: Intuition for VFI I In the iteration period, all future states are

A Hybrid Legal Expert Systemcs.anu.edu.au/software/shyster/tom/thesis.pdf · Legal expert systems are the nexus of Artiﬁcial Intelligence and the law. A legal expert system is “a

Portfolio optimization under partial information with expert opinions talks School... · 2012-01-09 · Portfolio optimization under partial information with expert opinions Abdelali

RL 8: Value Iteration and Policy Iteration · RL 8: Value Iteration and Policy Iteration MichaelHerrmann University of Edinburgh, School of Informatics 06/02/2015

Who Expert Committee on Drug Dependence...iv WHO Technical Report Series No. 973, 2012 Who Expert Committee on Drug Dependence Hammamet, Tunisia, 4–8 June, 2012 Members1 Professor

Automatic Generation of Explanation for Expert … expert system generic tool (AESGT)[3]. The developed explanation components can be easily reused with expert systems developed by

Lecture IV Value Function Iteration with Discretization · 2015-09-17 · Lecture IV Value Function Iteration with Discretization Gianluca Violante New York University Quantitative

The Nordic Expert Group for Criteria Documentation of ... · The Nordic Expert Group fo r Criteria Documentation of Health Risks from Chemicals and The Dutch ... (GABA). 4.2 Production

Rayleigh Quotient Iteration - in.tum.de

Quick and Safe Expert Ground-Loop Testing

ECDL ICDL Computer Essentials - Expert Training...ECDL / ICDL Χρήη Υπολογιή και Διαχίριη Αρχ 0ίων (Computer Essentials ) Η ενότητα αυτή θέτει

CENTRUL JUDEȚEAN DE RESURSE ȘI ASISTENȚĂ ......”Mihai Tican Rumano”, Berevoești. Expert State Donatoare: Newschool AS, Norvegia Scop și obiective: Lucrarea de față are

MAZDA 3 ΒL *MZB* - accessorybase.mazda.ph-wt.com€¦ · Expert Fitment Required Montage durch Fachwerkstatt erforderlich Montaje sólo por el concesionario Montage par spécialiste

Higgs Hunt Using Expert Discriminants in the ZH b b at the ...

The Nordic Expert Group for Criteria Documentation of Health Risks ...

MAZDA 3 ΒL MZB - accessorybase.mazda.ph-wt.com€¦ · Expert Fitment Required Montage durch Fachwerkstatt erforderlich Montaje sólo por el concesionario Montage par spécialiste