Safe and Efficient Off-Policy Reinforcement Learning
Transcript of Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy
Reinforcement LearningNIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017
Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
▶ Off-policy RL: learning the value function for one policy πQπ(x , a) = Eπ[r1 + γr2 + γ2r3 + · · · |x0 = x , a0 = a]from data collected by another policy µ ̸= π
▶ Retrace(λ): a new off-policy multi-step RL algorithm
▶ Theoretical advantages
+ It converges for any π, µ (safe)+ It makes the best use of samples if π and µ are close to
each other (efficient)+ Its variance is lower than importance sampling
▶ Empirical evaluation▶ On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q∗(λ), Tree-Backup)
Notation and definitions
▶ state x ∈ X▶ action a ∈ A▶ discount factor γ ∈ [0, 1]
▶ immediate reward r ∈ R▶ policies π, µ : X ×A 7→ [0, 1]
▶ value functionQπ(x , a) := Eπ[r1 + γr2 + γ2r3 + · · · |x0 = x , a0 = a]
▶ optimal value function Q∗ := maxπ Qπ
▶ EπQ(x , ·) :=∑
a π(a|x)Q(x , a)
Policy evaluation
▶ Learning the value function for a policy π:
Qπ(x , a) = Eπ[r1 + γr2 + γ2r3 + · · · |x0 = x , a0 = a]
▶ You can learn optimal control if π is a greedy policy tothe current estimate Q(x , a) e.g. Q-learning
▶ On-policy: learning from data collected by π
▶ Off-policy: learning from data collected by µ ̸= π
▶ Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)+ Exploration by µ
On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ Temporal difference (or “surprise”) at t:δt = rt + γQ(xt+1, at+1)− Q(xt , at)
▶ You can use δt to estimate Qπ(xt , at) (one-step)
▶ Can you use δt to estimate Qπ(xs , as) for all s ≤ t?(multi-step)
TD(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ A popular multi-step algorithm for on-policy policyevaluation
▶ ∆tQ(x , a) = (γλ)tδt , where λ ∈ [0, 1] is chosen tobalance bias and variance
▶ Multi-step methods have advantages:
+ Rewards are propagated rapidly+ Bias introduced by bootstrapping is reduced
Off-policy multi-step algorithm
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ δt = rt + γEπQ(xt+1, ·)− Q(xt , at)
▶ You can use δt to estimate Qπ(xt , at) e.g. Q-learning
▶ Can you use δt to estimate Qπ(xs , as) for all s ≤ t?▶ δt might be less relevant to Qπ(xs , as) compared to the
on-policy case
Importance Sampling (IS) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x , a) = γt(∏
1≤s≤tπ(as |xs)µ(as |xs))δt
+ Unbiased estimate of Qπ
− Large (possibly infinite) variance since π(as |xs)µ(as |xs) is not
bounded
Qπ(λ) [Harutyunyan et al. 2016]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x , a) = (γλ)tδt
+ Convergent if µ and π are sufficiently close to each otheror λ is sufficiently small:λ < 1−γ
γϵ, where ϵ := maxx∥π(·|x)− µ(·|x)∥1
− Not convergent otherwise
Tree-Backup (TB) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆tQ(x , a) = (γλ)t(∏
1≤s≤t π(as |xs))δt+ Convergent for any π and µ
+ Works even if µ is unknown and/or non-Markov
−∏
1≤s≤t π(as |xs) decays rapidly when near on-policy
A unified view
▶ General algorithm: ∆Q(x , a) =∑
t≥0 γt(∏
1≤s≤t cs)δt▶ None of the existing methods is perfect
▶ Low variance (↔ IS)▶ “Safe” i.e. convergent for any π and µ (↔ Qπ(λ))▶ “Efficient” i.e. using full returns when on-policy (↔
Tree-Backup)
Choice of the coefficients cs▶ Contraction speed
▶ Consider a general operator R:
RQ(x , a) = Q(x , a) + Eµ[∑t≥0
γt(∏
1≤s≤t
cs)δt ]
▶ If 0 ≤ cs ≤ π(as |xs)µ(as |xs) , R is a contraction and Qπ is its
fixed point (thus the algorithm is “safe”)
|RQ(x , a)− Qπ(x , a)| ≤ η(x , a)∥Q − Qπ∥
η(x , a) := 1− (1− γ)Eµ[∑t≥0
γt(t∏
s=1
cs)]
▶ η = 0 for cs = 1 (“efficient”)
▶ Variance▶ cs ≤ 1 result in low variance since
∏1≤s≤t cs ≤ 1
Retrace(λ)
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
▶ ∆Q(x , a) = γt(∏
1≤s≤t λmin(1, π(as |xs)µ(as |xs)))δt
+ Variance is bounded
+ Convergent for any π and µ
+ Uses full returns when on-policy
− Doesn’t work if µ is unknown or non-Markov (↔Tree-Backup)
Evaluation on Atari 2600
▶ Trained asynchrounously with 16 CPU threads [Mnihet al. 2016]
▶ Each thread has private replay memory holding 62,500transitions
▶ Q-learning uses a minibatch of 64 transitions
▶ Retrace, TB and Q∗(λ) (a control version of Qπ(λ)) usefour 16-step sub-sequences
Performance comparison
▶ Inter-algorithm scores are normalized so that 0 and 1respectively correspond to the worst and best scores for aparticular game
▶ λ = 1 performs best except Q∗
▶ Retrace(λ) performs best on 30 out of 60 games
Sensitivity to the value of λ
▶ Retrace(λ) is robust and consistently outperforms Tree-Backup▶ Q* performs best for small values of λ▶ Note that the Q-learning scores are fixed across different λ
Conclusions
▶ Retrace(λ)▶ is an off-policy multi-step value-based RL algorithm▶ is low-variance, safe and efficient▶ outperforms one-step Q-learning and existing multi-step
variants on Atari 2600▶ (is already applied to A3C in another paper [Wang et al.
2016])
References I
[1] Anna Harutyunyan et al. “Q(λ) with Off-Policy Corrections”. In: Proceedingsof Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep ReinforcementLearning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. “Eligibility Traces forOff-Policy Policy Evaluation”. In: ICML ’00: Proceedings of the SeventeenthInternational Conference on Machine Learning (2000), pp. 759–766.
[4] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:arXiv (2016), pp. 1–20. arXiv: 1611.01224.