# Safe and Efficient Off-Policy Reinforcement Learning

date post

24-Jan-2017Category

## Software

view

310download

2

Embed Size (px)

### Transcript of Safe and Efficient Off-Policy Reinforcement Learning

Safe and Efficient Off-Policy

Reinforcement LearningNIPS 2016

Yasuhiro Fujita

Preferred Networks Inc.

January 19, 2017

Safe and Efficient Off-Policy Reinforcement Learning

by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare

Off-policy RL: learning the value function for one policy Q(x , a) = E[r1 + r2 + 2r3 + |x0 = x , a0 = a]from data collected by another policy =

Retrace(): a new off-policy multi-step RL algorithm Theoretical advantages

+ It converges for any , (safe)+ It makes the best use of samples if and are close to

each other (efficient)+ Its variance is lower than importance sampling

Empirical evaluation On Atari 2600 it beats one-step Q-learning (DQN) and

the existing multi-step methods (Q(), Tree-Backup)

Notation and definitions

state x X action a A discount factor [0, 1] immediate reward r R policies , : X A 7 [0, 1] value function

Q(x , a) := E[r1 + r2 + 2r3 + |x0 = x , a0 = a] optimal value function Q := max Q

EQ(x , ) :=

a (a|x)Q(x , a)

Policy evaluation

Learning the value function for a policy :

Q(x , a) = E[r1 + r2 + 2r3 + |x0 = x , a0 = a]

You can learn optimal control if is a greedy policy tothe current estimate Q(x , a) e.g. Q-learning

On-policy: learning from data collected by

Off-policy: learning from data collected by = Off-policy methods have advantages:

+ Sample-efficient (e.g. experience replay)+ Exploration by

On-policy multi-step methods

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

Temporal difference (or surprise) at t:t = rt + Q(xt+1, at+1) Q(xt , at)

You can use t to estimate Q(xt , at) (one-step)

Can you use t to estimate Q(xs , as) for all s t?

(multi-step)

https://ewrl.files.wordpress.com/2016/12/munos.pdf

TD()

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

A popular multi-step algorithm for on-policy policyevaluation

tQ(x , a) = ()tt , where [0, 1] is chosen to

balance bias and variance

Multi-step methods have advantages:

+ Rewards are propagated rapidly+ Bias introduced by bootstrapping is reduced

https://ewrl.files.wordpress.com/2016/12/munos.pdf

Off-policy multi-step algorithm

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

t = rt + EQ(xt+1, ) Q(xt , at) You can use t to estimate Q

(xt , at) e.g. Q-learning

Can you use t to estimate Q(xs , as) for all s t?

t might be less relevant to Q(xs , as) compared to the

on-policy case

https://ewrl.files.wordpress.com/2016/12/munos.pdf

Importance Sampling (IS) [Precup et al. 2000]

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

tQ(x , a) = t(

1st(as |xs)(as |xs))t

+ Unbiased estimate of Q

Large (possibly infinite) variance since (as |xs)(as |xs) is not

bounded

https://ewrl.files.wordpress.com/2016/12/munos.pdf

Q() [Harutyunyan et al. 2016]

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

tQ(x , a) = ()tt

+ Convergent if and are sufficiently close to each otheror is sufficiently small: < 1

, where := maxx(|x) (|x)1

Not convergent otherwise

https://ewrl.files.wordpress.com/2016/12/munos.pdf

Tree-Backup (TB) [Precup et al. 2000]

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

tQ(x , a) = ()t(

1st (as |xs))t+ Convergent for any and

+ Works even if is unknown and/or non-Markov

1st (as |xs) decays rapidly when near on-policy

https://ewrl.files.wordpress.com/2016/12/munos.pdf

A unified view

General algorithm: Q(x , a) =

t0 t(

1st cs)t None of the existing methods is perfect

Low variance ( IS) Safe i.e. convergent for any and ( Q()) Efficient i.e. using full returns when on-policy (

Tree-Backup)

Choice of the coefficients cs Contraction speed

Consider a general operator R:

RQ(x , a) = Q(x , a) + E[t0

t(

1stcs)t ]

If 0 cs (as |xs)(as |xs) , R is a contraction and Q is its

fixed point (thus the algorithm is safe)

|RQ(x , a) Q(x , a)| (x , a)Q Q

(x , a) := 1 (1 )E[t0

t(t

s=1

cs)]

= 0 for cs = 1 (efficient)

Variance cs 1 result in low variance since

1st cs 1

Retrace()

From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

Q(x , a) = t(

1st min(1,(as |xs)(as |xs)))t

+ Variance is bounded

+ Convergent for any and

+ Uses full returns when on-policy

Doesnt work if is unknown or non-Markov (Tree-Backup)

https://ewrl.files.wordpress.com/2016/12/munos.pdf

Evaluation on Atari 2600

Trained asynchrounously with 16 CPU threads [Mnihet al. 2016]

Each thread has private replay memory holding 62,500transitions

Q-learning uses a minibatch of 64 transitions

Retrace, TB and Q() (a control version of Q()) usefour 16-step sub-sequences

Performance comparison

Inter-algorithm scores are normalized so that 0 and 1respectively correspond to the worst and best scores for aparticular game

= 1 performs best except Q

Retrace() performs best on 30 out of 60 games

Sensitivity to the value of

Retrace() is robust and consistently outperforms Tree-Backup Q* performs best for small values of Note that the Q-learning scores are fixed across different

Conclusions

Retrace() is an off-policy multi-step value-based RL algorithm is low-variance, safe and efficient outperforms one-step Q-learning and existing multi-step

variants on Atari 2600 (is already applied to A3C in another paper [Wang et al.

2016])

References I

[1] Anna Harutyunyan et al. Q() with Off-Policy Corrections. In: Proceedingsof Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.

[2] Volodymyr Mnih et al. Asynchronous Methods for Deep ReinforcementLearning (old). 2016. arXiv: 1602.01783.

[3] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility Traces forOff-Policy Policy Evaluation. In: ICML 00: Proceedings of the SeventeenthInternational Conference on Machine Learning (2000), pp. 759766.

[4] Ziyu Wang et al. Sample Efficient Actor-Critic with Experience Replay. In:arXiv (2016), pp. 120. arXiv: 1611.01224.

http://arxiv.org/abs/1602.04951http://arxiv.org/abs/1602.01783http://arxiv.org/abs/1611.01224