Safe and Efficient Off-Policy Reinforcement Learning

18
Safe and Efficient Off-Policy Reinforcement Learning NIPS 2016 Yasuhiro Fujita Preferred Networks Inc. January 19, 2017

Transcript of Safe and Efficient Off-Policy Reinforcement Learning

Page 1: Safe and Efficient Off-Policy Reinforcement Learning

Safe and Efficient Off-Policy

Reinforcement LearningNIPS 2016

Yasuhiro Fujita

Preferred Networks Inc.

January 19, 2017

Page 2: Safe and Efficient Off-Policy Reinforcement Learning

Safe and Efficient Off-Policy Reinforcement Learning

by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare

▶ Off-policy RL: learning the value function for one policy πQπ(x , a) = Eπ[r1 + γr2 + γ2r3 + · · · |x0 = x , a0 = a]from data collected by another policy µ ̸= π

▶ Retrace(λ): a new off-policy multi-step RL algorithm

▶ Theoretical advantages

+ It converges for any π, µ (safe)+ It makes the best use of samples if π and µ are close to

each other (efficient)+ Its variance is lower than importance sampling

▶ Empirical evaluation▶ On Atari 2600 it beats one-step Q-learning (DQN) and

the existing multi-step methods (Q∗(λ), Tree-Backup)

Page 3: Safe and Efficient Off-Policy Reinforcement Learning

Notation and definitions

▶ state x ∈ X▶ action a ∈ A▶ discount factor γ ∈ [0, 1]

▶ immediate reward r ∈ R▶ policies π, µ : X ×A 7→ [0, 1]

▶ value functionQπ(x , a) := Eπ[r1 + γr2 + γ2r3 + · · · |x0 = x , a0 = a]

▶ optimal value function Q∗ := maxπ Qπ

▶ EπQ(x , ·) :=∑

a π(a|x)Q(x , a)

Page 4: Safe and Efficient Off-Policy Reinforcement Learning

Policy evaluation

▶ Learning the value function for a policy π:

Qπ(x , a) = Eπ[r1 + γr2 + γ2r3 + · · · |x0 = x , a0 = a]

▶ You can learn optimal control if π is a greedy policy tothe current estimate Q(x , a) e.g. Q-learning

▶ On-policy: learning from data collected by π

▶ Off-policy: learning from data collected by µ ̸= π

▶ Off-policy methods have advantages:

+ Sample-efficient (e.g. experience replay)+ Exploration by µ

Page 5: Safe and Efficient Off-Policy Reinforcement Learning

On-policy multi-step methods

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ Temporal difference (or “surprise”) at t:δt = rt + γQ(xt+1, at+1)− Q(xt , at)

▶ You can use δt to estimate Qπ(xt , at) (one-step)

▶ Can you use δt to estimate Qπ(xs , as) for all s ≤ t?(multi-step)

Page 6: Safe and Efficient Off-Policy Reinforcement Learning

TD(λ)

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ A popular multi-step algorithm for on-policy policyevaluation

▶ ∆tQ(x , a) = (γλ)tδt , where λ ∈ [0, 1] is chosen tobalance bias and variance

▶ Multi-step methods have advantages:

+ Rewards are propagated rapidly+ Bias introduced by bootstrapping is reduced

Page 7: Safe and Efficient Off-Policy Reinforcement Learning

Off-policy multi-step algorithm

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ δt = rt + γEπQ(xt+1, ·)− Q(xt , at)

▶ You can use δt to estimate Qπ(xt , at) e.g. Q-learning

▶ Can you use δt to estimate Qπ(xs , as) for all s ≤ t?▶ δt might be less relevant to Qπ(xs , as) compared to the

on-policy case

Page 8: Safe and Efficient Off-Policy Reinforcement Learning

Importance Sampling (IS) [Precup et al. 2000]

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ ∆tQ(x , a) = γt(∏

1≤s≤tπ(as |xs)µ(as |xs))δt

+ Unbiased estimate of Qπ

− Large (possibly infinite) variance since π(as |xs)µ(as |xs) is not

bounded

Page 9: Safe and Efficient Off-Policy Reinforcement Learning

Qπ(λ) [Harutyunyan et al. 2016]

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ ∆tQ(x , a) = (γλ)tδt

+ Convergent if µ and π are sufficiently close to each otheror λ is sufficiently small:λ < 1−γ

γϵ, where ϵ := maxx∥π(·|x)− µ(·|x)∥1

− Not convergent otherwise

Page 10: Safe and Efficient Off-Policy Reinforcement Learning

Tree-Backup (TB) [Precup et al. 2000]

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ ∆tQ(x , a) = (γλ)t(∏

1≤s≤t π(as |xs))δt+ Convergent for any π and µ

+ Works even if µ is unknown and/or non-Markov

−∏

1≤s≤t π(as |xs) decays rapidly when near on-policy

Page 11: Safe and Efficient Off-Policy Reinforcement Learning

A unified view

▶ General algorithm: ∆Q(x , a) =∑

t≥0 γt(∏

1≤s≤t cs)δt▶ None of the existing methods is perfect

▶ Low variance (↔ IS)▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))▶ “Efficient” i.e. using full returns when on-policy (↔

Tree-Backup)

Page 12: Safe and Efficient Off-Policy Reinforcement Learning

Choice of the coefficients cs▶ Contraction speed

▶ Consider a general operator R:

RQ(x , a) = Q(x , a) + Eµ[∑t≥0

γt(∏

1≤s≤t

cs)δt ]

▶ If 0 ≤ cs ≤ π(as |xs)µ(as |xs) , R is a contraction and Qπ is its

fixed point (thus the algorithm is “safe”)

|RQ(x , a)− Qπ(x , a)| ≤ η(x , a)∥Q − Qπ∥

η(x , a) := 1− (1− γ)Eµ[∑t≥0

γt(t∏

s=1

cs)]

▶ η = 0 for cs = 1 (“efficient”)

▶ Variance▶ cs ≤ 1 result in low variance since

∏1≤s≤t cs ≤ 1

Page 13: Safe and Efficient Off-Policy Reinforcement Learning

Retrace(λ)

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

▶ ∆Q(x , a) = γt(∏

1≤s≤t λmin(1, π(as |xs)µ(as |xs)))δt

+ Variance is bounded

+ Convergent for any π and µ

+ Uses full returns when on-policy

− Doesn’t work if µ is unknown or non-Markov (↔Tree-Backup)

Page 14: Safe and Efficient Off-Policy Reinforcement Learning

Evaluation on Atari 2600

▶ Trained asynchrounously with 16 CPU threads [Mnihet al. 2016]

▶ Each thread has private replay memory holding 62,500transitions

▶ Q-learning uses a minibatch of 64 transitions

▶ Retrace, TB and Q∗(λ) (a control version of Qπ(λ)) usefour 16-step sub-sequences

Page 15: Safe and Efficient Off-Policy Reinforcement Learning

Performance comparison

▶ Inter-algorithm scores are normalized so that 0 and 1respectively correspond to the worst and best scores for aparticular game

▶ λ = 1 performs best except Q∗

▶ Retrace(λ) performs best on 30 out of 60 games

Page 16: Safe and Efficient Off-Policy Reinforcement Learning

Sensitivity to the value of λ

▶ Retrace(λ) is robust and consistently outperforms Tree-Backup▶ Q* performs best for small values of λ▶ Note that the Q-learning scores are fixed across different λ

Page 17: Safe and Efficient Off-Policy Reinforcement Learning

Conclusions

▶ Retrace(λ)▶ is an off-policy multi-step value-based RL algorithm▶ is low-variance, safe and efficient▶ outperforms one-step Q-learning and existing multi-step

variants on Atari 2600▶ (is already applied to A3C in another paper [Wang et al.

2016])

Page 18: Safe and Efficient Off-Policy Reinforcement Learning

References I

[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedingsof Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.

[2] Volodymyr Mnih et al. Asynchronous Methods for Deep ReinforcementLearning (old). 2016. arXiv: 1602.01783.

[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces forOff-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the SeventeenthInternational Conference on Machine Learning (2000), pp. 759–766.

[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:arXiv (2016), pp. 1–20. arXiv: 1611.01224.