Safe and Efficient Off-Policy Reinforcement Learning

download Safe and Efficient Off-Policy Reinforcement Learning

of 18

  • date post

    24-Jan-2017
  • Category

    Software

  • view

    310
  • download

    2

Embed Size (px)

Transcript of Safe and Efficient Off-Policy Reinforcement Learning

  • Safe and Efficient Off-Policy

    Reinforcement LearningNIPS 2016

    Yasuhiro Fujita

    Preferred Networks Inc.

    January 19, 2017

  • Safe and Efficient Off-Policy Reinforcement Learning

    by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare

    Off-policy RL: learning the value function for one policy Q(x , a) = E[r1 + r2 + 2r3 + |x0 = x , a0 = a]from data collected by another policy =

    Retrace(): a new off-policy multi-step RL algorithm Theoretical advantages

    + It converges for any , (safe)+ It makes the best use of samples if and are close to

    each other (efficient)+ Its variance is lower than importance sampling

    Empirical evaluation On Atari 2600 it beats one-step Q-learning (DQN) and

    the existing multi-step methods (Q(), Tree-Backup)

  • Notation and definitions

    state x X action a A discount factor [0, 1] immediate reward r R policies , : X A 7 [0, 1] value function

    Q(x , a) := E[r1 + r2 + 2r3 + |x0 = x , a0 = a] optimal value function Q := max Q

    EQ(x , ) :=

    a (a|x)Q(x , a)

  • Policy evaluation

    Learning the value function for a policy :

    Q(x , a) = E[r1 + r2 + 2r3 + |x0 = x , a0 = a]

    You can learn optimal control if is a greedy policy tothe current estimate Q(x , a) e.g. Q-learning

    On-policy: learning from data collected by

    Off-policy: learning from data collected by = Off-policy methods have advantages:

    + Sample-efficient (e.g. experience replay)+ Exploration by

  • On-policy multi-step methods

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    Temporal difference (or surprise) at t:t = rt + Q(xt+1, at+1) Q(xt , at)

    You can use t to estimate Q(xt , at) (one-step)

    Can you use t to estimate Q(xs , as) for all s t?

    (multi-step)

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • TD()

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    A popular multi-step algorithm for on-policy policyevaluation

    tQ(x , a) = ()tt , where [0, 1] is chosen to

    balance bias and variance

    Multi-step methods have advantages:

    + Rewards are propagated rapidly+ Bias introduced by bootstrapping is reduced

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • Off-policy multi-step algorithm

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    t = rt + EQ(xt+1, ) Q(xt , at) You can use t to estimate Q

    (xt , at) e.g. Q-learning

    Can you use t to estimate Q(xs , as) for all s t?

    t might be less relevant to Q(xs , as) compared to the

    on-policy case

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • Importance Sampling (IS) [Precup et al. 2000]

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    tQ(x , a) = t(

    1st(as |xs)(as |xs))t

    + Unbiased estimate of Q

    Large (possibly infinite) variance since (as |xs)(as |xs) is not

    bounded

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • Q() [Harutyunyan et al. 2016]

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    tQ(x , a) = ()tt

    + Convergent if and are sufficiently close to each otheror is sufficiently small: < 1

    , where := maxx(|x) (|x)1

    Not convergent otherwise

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • Tree-Backup (TB) [Precup et al. 2000]

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    tQ(x , a) = ()t(

    1st (as |xs))t+ Convergent for any and

    + Works even if is unknown and/or non-Markov

    1st (as |xs) decays rapidly when near on-policy

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • A unified view

    General algorithm: Q(x , a) =

    t0 t(

    1st cs)t None of the existing methods is perfect

    Low variance ( IS) Safe i.e. convergent for any and ( Q()) Efficient i.e. using full returns when on-policy (

    Tree-Backup)

  • Choice of the coefficients cs Contraction speed

    Consider a general operator R:

    RQ(x , a) = Q(x , a) + E[t0

    t(

    1stcs)t ]

    If 0 cs (as |xs)(as |xs) , R is a contraction and Q is its

    fixed point (thus the algorithm is safe)

    |RQ(x , a) Q(x , a)| (x , a)Q Q

    (x , a) := 1 (1 )E[t0

    t(t

    s=1

    cs)]

    = 0 for cs = 1 (efficient)

    Variance cs 1 result in low variance since

    1st cs 1

  • Retrace()

    From the presentation by the authors: https://ewrl.files.wordpress.com/2016/12/munos.pdf

    Q(x , a) = t(

    1st min(1,(as |xs)(as |xs)))t

    + Variance is bounded

    + Convergent for any and

    + Uses full returns when on-policy

    Doesnt work if is unknown or non-Markov (Tree-Backup)

    https://ewrl.files.wordpress.com/2016/12/munos.pdf

  • Evaluation on Atari 2600

    Trained asynchrounously with 16 CPU threads [Mnihet al. 2016]

    Each thread has private replay memory holding 62,500transitions

    Q-learning uses a minibatch of 64 transitions

    Retrace, TB and Q() (a control version of Q()) usefour 16-step sub-sequences

  • Performance comparison

    Inter-algorithm scores are normalized so that 0 and 1respectively correspond to the worst and best scores for aparticular game

    = 1 performs best except Q

    Retrace() performs best on 30 out of 60 games

  • Sensitivity to the value of

    Retrace() is robust and consistently outperforms Tree-Backup Q* performs best for small values of Note that the Q-learning scores are fixed across different

  • Conclusions

    Retrace() is an off-policy multi-step value-based RL algorithm is low-variance, safe and efficient outperforms one-step Q-learning and existing multi-step

    variants on Atari 2600 (is already applied to A3C in another paper [Wang et al.

    2016])

  • References I

    [1] Anna Harutyunyan et al. Q() with Off-Policy Corrections. In: Proceedingsof Algorithmic Learning Theory (ALT). 2016. arXiv: 1602.04951.

    [2] Volodymyr Mnih et al. Asynchronous Methods for Deep ReinforcementLearning (old). 2016. arXiv: 1602.01783.

    [3] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility Traces forOff-Policy Policy Evaluation. In: ICML 00: Proceedings of the SeventeenthInternational Conference on Machine Learning (2000), pp. 759766.

    [4] Ziyu Wang et al. Sample Efficient Actor-Critic with Experience Replay. In:arXiv (2016), pp. 120. arXiv: 1611.01224.

    http://arxiv.org/abs/1602.04951http://arxiv.org/abs/1602.01783http://arxiv.org/abs/1611.01224