Safe and Efficient Off-Policy Reinforcement Learning
date post
24-Jan-2017Category
Software
view
310download
2
Embed Size (px)
Transcript of Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy
Reinforcement LearningNIPS 2016
Yasuhiro Fujita
Preferred Networks Inc.
January 19, 2017
Safe and Efficient Off-Policy Reinforcement Learning
by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare
Off-policy RL: learning the value function for one policy Q(x , a) = E[r1 + r2 + 2r3 + |x0 = x , a0 = a]from data collected by another policy =
Retrace(): a new off-policy multi-step RL algorithm Theoretical advantages
+ It converges for any , (safe)+ It makes the best use of samples if and are close to
each other (efficient)+ Its variance is lower than importance sampling
Empirical evaluation On Atari 2600 it beats one-step Q-learning (DQN) and
the existing multi-step methods (Q(), Tree-Backup)
Notation and definitions
state x X action a A discount factor [0, 1] immediate reward r R policies , : X A 7 [0, 1] value function
Q(x , a) := E[r1 + r2 + 2r3 + |x0 = x , a0 = a] optimal value function Q := max Q
EQ(x , ) :=
a (a|x)Q(x , a)
Policy evaluation
Learning the value function for a policy :
Q(x , a) = E[r1 + r2 + 2r3 + |x0 = x , a0 = a]
You can learn optimal control if is a greedy policy tothe current estimate Q(x , a) e.g. Q-learning
On-policy: learning from data collected by
Off-policy: learning from data collected by = Off-policy methods have advantages:
+ Sample-efficient (e.g. experience replay)+ Exploration by
On-policy multi-step methods
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
Temporal difference (or surprise) at t:t = rt + Q(xt+1, at+1) Q(xt , at)
You can use t to estimate Q(xt , at) (one-step)
Can you use t to estimate Q(xs , as) for all s t?
(multi-step)
https://ewrl.files.wordpress.com/2016/12/munos.pdf
TD()
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
A popular multi-step algorithm for on-policy policyevaluation
tQ(x , a) = ()tt , where [0, 1] is chosen to
balance bias and variance
Multi-step methods have advantages:
+ Rewards are propagated rapidly+ Bias introduced by bootstrapping is reduced
https://ewrl.files.wordpress.com/2016/12/munos.pdf
Off-policy multi-step algorithm
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
t = rt + EQ(xt+1, ) Q(xt , at) You can use t to estimate Q
(xt , at) e.g. Q-learning
Can you use t to estimate Q(xs , as) for all s t?
t might be less relevant to Q(xs , as) compared to the
on-policy case
https://ewrl.files.wordpress.com/2016/12/munos.pdf
Importance Sampling (IS) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
tQ(x , a) = t(
1st(as |xs)(as |xs))t
+ Unbiased estimate of Q
Large (possibly infinite) variance since (as |xs)(as |xs) is not
bounded
https://ewrl.files.wordpress.com/2016/12/munos.pdf
Q() [Harutyunyan et al. 2016]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
tQ(x , a) = ()tt
+ Convergent if and are sufficiently close to each otheror is sufficiently small: < 1
, where := maxx(|x) (|x)1
Not convergent otherwise
https://ewrl.files.wordpress.com/2016/12/munos.pdf
Tree-Backup (TB) [Precup et al. 2000]
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
tQ(x , a) = ()t(
1st (as |xs))t+ Convergent for any and
+ Works even if is unknown and/or non-Markov
1st (as |xs) decays rapidly when near on-policy
https://ewrl.files.wordpress.com/2016/12/munos.pdf
A unified view
General algorithm: Q(x , a) =
t0 t(
1st cs)t None of the existing methods is perfect
Low variance ( IS) Safe i.e. convergent for any and ( Q()) Efficient i.e. using full returns when on-policy (
Tree-Backup)
Choice of the coefficients cs Contraction speed
Consider a general operator R:
RQ(x , a) = Q(x , a) + E[t0
t(
1stcs)t ]
If 0 cs (as |xs)(as |xs) , R is a contraction and Q is its
fixed point (thus the algorithm is safe)
|RQ(x , a) Q(x , a)| (x , a)Q Q
(x , a) := 1 (1 )E[t0
t(t
s=1
cs)]
= 0 for cs = 1 (efficient)
Variance cs 1 result in low variance since
1st cs 1
Retrace()
From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf
Q(x , a) = t(
1st min(1,(as |xs)(as |xs)))t
+ Variance is bounded
+ Convergent for any and
+ Uses full returns when on-policy
Doesnt work if is unknown or non-Markov (Tree-Backup)
https://ewrl.files.wordpress.com/2016/12/munos.pdf
Evaluation on Atari 2600
Trained asynchrounously with 16 CPU threads [Mnihet al. 2016]
Each thread has private replay memory holding 62,500transitions
Q-learning uses a minibatch of 64 transitions
Retrace, TB and Q() (a control version of Q()) usefour 16-step sub-sequences
Performance comparison
Inter-algorithm scores are normalized so that 0 and 1respectively correspond to the worst and best scores for aparticular game
= 1 performs best except Q
Retrace() performs best on 30 out of 60 games
Sensitivity to the value of
Retrace() is robust and consistently outperforms Tree-Backup Q* performs best for small values of Note that the Q-learning scores are fixed across different
Conclusions
Retrace() is an off-policy multi-step value-based RL algorithm is low-variance, safe and efficient outperforms one-step Q-learning and existing multi-step
variants on Atari 2600 (is already applied to A3C in another paper [Wang et al.
2016])
References I
[1] Anna Harutyunyan et al. Q() with Off-Policy Corrections. In: Proceedingsof Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.
[2] Volodymyr Mnih et al. Asynchronous Methods for Deep ReinforcementLearning (old). 2016. arXiv: 1602.01783.
[3] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility Traces forOff-Policy Policy Evaluation. In: ICML 00: Proceedings of the SeventeenthInternational Conference on Machine Learning (2000), pp. 759766.
[4] Ziyu Wang et al. Sample Efficient Actor-Critic with Experience Replay. In:arXiv (2016), pp. 120. arXiv: 1611.01224.
http://arxiv.org/abs/1602.04951http://arxiv.org/abs/1602.01783http://arxiv.org/abs/1611.01224
Recommended