Safe and ffi ffolicy Reinforcement Learning - RL-Tokyo · Safe and ffi ffolicy Reinforcement...

Safe and Efficient Off-Policy

Reinforcement LearningNIPS 2016

Yasuhiro Fujita

Preferred Networks Inc.

January 11, 2017

[Munos et al. 2016]

▶ Proposes a new off-policy multi-step RL method:Retrace(λ)

▶ Good theoretical properties: low-variance, safe andefficient

▶ It outperforms one-step Q-learning and existingmulti-step variants

▶ Proves the convergence of Watkins’s Q(λ) for the firsttime

Multi-step methods

▶ Multi-step methods have some advantages oversingle-step methods

▶ They can balance bias and variance▶ They can propagate values more quickly

▶ Example: SARSA(λ)▶ n-step return

R(n)s =

s+n∑t=s

γt−srt + γn+1Q(xs+n+1, as+n+1)

▶ λ-return based update rule

∆Q(xs , as) =∑t≥s

(λγ)t−sδt

δt = rt + γQ(xt+1, at+1)− Q(xt , at)

Multi-step methods in off-policy settings

▶ Can we apply multi-step methods to off-policy cases?▶ Policy evaluation: estimate Qπ from samples collected

by µ (π ̸= µ)▶ Control: estimate Q∗ from samples collected by µ

▶ In “Algorithms for Reinforcement Learning”, p. 57

Watkins’s Q(λ) [Watkins 1989]

▶ Classic multi-step algorithm for off-policy control

▶ Cut off traces whenever a non-greedy action is taken

R(n)s =

s+n∑t=s

γt−srt + γn+1 maxa

Q(xs+n+1, a)

(for any n < τ = argminu≥1

I{πs+u ̸= µs+u})

▶ Converges to Q∗ under a mild assumption (proved in thispaper)

▶ Only little faster than one-step Q-learning if non-greedyactions are frequent (i.e. not “efficient”)

Backup diagram of Q(λ)

General operator R

▶ To compare off-policy multistep methods, consider thegeneral operator R:

▶ Different non-negative coefficients cs (traces) result indifferent methods

▶ (Is this equation correct?)

Desired properties

▶ Low variance▶ Variance of the online estimate is small▶ ≈ V(c1 · · · ct) is small▶ ≈ V(c) is small

▶ Safe▶ Convergence to Qπ (policy evaluation) or Q∗ (control) is

guaranteed

▶ Efficient▶ Traces are not unnecessarily cut if π and µ are close

Comparison of properties

▶ Retrace(λ) is low-variance, safe and efficient

▶ Note that Watkins’s Q(λ) ̸= Qπ(λ)

Importance Sampling (IS) [Precup et al. 2000]

cs =π(as |xs)µ(as |xs)

▶ RQ = Qπ for any Q in this case▶ If Q = 0, it just becomes the basic IS estimate∑

t≥0 γt(∏t

s=1 cs)rt

▶ High variance, mainly due to the variance of the product

π(a1|x1)µ(a1|x1)

· · · π(at |xt)µ(at |xt)

Off-policy Qπ(λ) and Q∗(λ) [Harutyunyan et al.

cs = λ

▶ A very recently proposed alternative▶ Qπ(λ) for policy evaluation, Q∗ for control

▶ To guarantee convergence, π and µ must be sufficientlyclose:

▶ In policy evaluation, λ < 1−γγϵ , where

ϵ := maxx∥π(·|x)− µ(·|x)∥1▶ In control, λ < 1−γ

▶ Available even if µ is unknown and/or non-Markovian

Tree Backup (TB) [Precup et al. 2000]

cs = λπ(as |xs)

▶ The operator defines a contraction, thus is safe

▶ Not efficient because it cuts traces even if π = µ

▶ Available even if µ is unknown and/or non-Markovian

Useful table [Harutyunyan et al. 2016]

Retrace(λ)

cs = λmin(1,π(as |xs)µ(as |xs)

▶ Proposed by this paper

▶ IS ratio truncated at 1

▶ If π is close to µ, cs is close to 1, avoid unnecessarilycutting traces

Experiments on Atari 2600

▶ Trained asynchrounously with 16 threads

▶ Each thread has private replay memory holding 62,500transitions

▶ Q-learning uses a minibatch of 64 transitions

▶ Retrace, TB and Q* use four 16-step sequences

Performance comparison

▶ 0 and 1 of inter-algorithm scores respectively correspondto the worst and best scores for a particular game

▶ Retrace(λ) performs best on 30 out of 60 games

Sensitivity to λ

▶ Note that the Q-learning scores are fixed across different λ

▶ Q* performs best for small values of λ

Conclusions

▶ Retrace(λ)▶ is low-variance, safe and efficient▶ outperforms one-step Q-learning and existing multi-step

variants on Atari 2600▶ (is already applied to A3C in another paper [Wang et al.

2016])

▶ Watkins’s Q(λ) now has a convergence guarantee

Future work

▶ Estimate µ if it is unknown

▶ Relaxing the Markov assumption in the control case toallow cs > 1:

cs = λmin(1

c1 · · · cs−1,π(as |xs)µ(as |xs)

Theorem 1

▶ π and µ are stationary

▶ cs can be non-Markovian

Theorem 2

▶ π is not stationary

▶ cs must be Markovian

Theorem 3

▶ Convergence of sample-based online algorithm

▶ As a corollary, Watkins’s Q(λ) converges to Q∗

▶ Only cs is different

参考文献 I

[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedingsof Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.

[2] Remi Munos et al. “Safe and Efficient Off-Policy Reinforcement Learning”. In:Proceedings of Neural Information Processing Systems (NIPS). 2016. arXiv:1606.02647.

[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces forOff-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the SeventeenthInternational Conference on Machine Learning (2000), pp. 759–766.

[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:arXiv (2016), pp. 1–20. arXiv: 1611.01224.

[5] Christopher John Cornish Hellaby Watkins. “Learning from delayed rewards”.PhD thesis. Cambridge University, 1989.

Safe and ffi ffolicy Reinforcement Learning - RL-Tokyo · Safe and ffi ffolicy Reinforcement...

Documents

Transcript of Safe and ffi ffolicy Reinforcement Learning - RL-Tokyo · Safe and ffi ffolicy Reinforcement...

Human-level Control Through Deep Reinforcement Learning€¦ · 1 Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529{533 (2015) 2 Lin, L.-J.

Reinforcement Learning - 4. Model-free reinforcement Learning

SAFE τεύχος 10

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

More safe internet

Towards a Theory of Generalization in Reinforcement Learning …files.boazbarak.org/misc/mltheory/sham1.pdf · 2021. 4. 24. · Approx. Dynamic Programming with Linear Function Approximation

SAFE τεύχος 9

SAFE (τεύχος 6)

Large Scale Reinforcement Learning using Q-SARSA(λ) and Cascading Neural Networks

SAFE (τεύχος 8)

Reinforcement Learning - Wei XuSpecifically, reinforcement learning There was an MDP, but you couldn’t solve it with just computation You needed to actually act to figure it out

Advanced Q-Function Learning Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec4.pdfZ. Wang, N. de Freitas, and M. Lanctot.\Dueling network architectures for deep reinforcement learning".

Reinforcement Learning Lecture Temporal Difference Learning

Safe in the_sun_2016

Reinforcement Learning - Policy Search: Actor-Critic and ...mmartin/URL/Lecture5.pdf · Mario Martin (CS-UPC) Reinforcement Learning May 7, 2020 17 / 72. Approximated Cross-Entropy

CSC 411: Introduction to Machine Learningmren/teach/csc411_19s/lec/lec22.pdf · CSC 411: Introduction to Machine Learning CSC 411 Lecture 22: Reinforcement Learning II Mengye Ren

10703 Deep Reinforcement Learning and Control · 2017. 10. 18. · Policy-Based Reinforcement Learning ‣ So far we approximated the value or action-value function using parameters

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Lecture 7: Policy Gradient - jbnu.ac.krnlp.jbnu.ac.kr/AI2019/slides_RL/pg.pdf · 2019. 10. 28. · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the