Lecture 7: Policy Gradient - UCL Computer 7: Policy Gradient Introduction Rock-Paper-Scissors...

download Lecture 7: Policy Gradient - UCL Computer 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats

of 41

  • date post

    29-Apr-2018
  • Category

    Documents

  • view

    218
  • download

    0

Embed Size (px)

Transcript of Lecture 7: Policy Gradient - UCL Computer 7: Policy Gradient Introduction Rock-Paper-Scissors...

  • Lecture 7: Policy Gradient

    Lecture 7: Policy Gradient

    David Silver

  • Lecture 7: Policy Gradient

    Outline

    1 Introduction

    2 Finite Difference Policy Gradient

    3 Monte-Carlo Policy Gradient

    4 Actor-Critic Policy Gradient

  • Lecture 7: Policy Gradient

    Introduction

    Policy-Based Reinforcement Learning

    In the last lecture we approximated the value or action-valuefunction using parameters ,

    V(s) V (s)Q(s, a) Q(s, a)

    A policy was generated directly from the value function

    e.g. using -greedy

    In this lecture we will directly parametrise the policy

    (s, a) = P [a | s, ]

    We will focus again on model-free reinforcement learning

  • Lecture 7: Policy Gradient

    Introduction

    Value-Based and Policy-Based RL

    Value Based

    Learnt Value FunctionImplicit policy(e.g. -greedy)

    Policy Based

    No Value FunctionLearnt Policy

    Actor-Critic

    Learnt Value FunctionLearnt Policy

    Value Function Policy

    ActorCritic

    Value-Based Policy-Based

  • Lecture 7: Policy Gradient

    Introduction

    Advantages of Policy-Based RL

    Advantages:

    Better convergence properties

    Effective in high-dimensional or continuous action spaces

    Can learn stochastic policies

    Disadvantages:

    Typically converge to a local rather than global optimum

    Evaluating a policy is typically inefficient and high variance

  • Lecture 7: Policy Gradient

    Introduction

    Rock-Paper-Scissors Example

    Example: Rock-Paper-Scissors

    Two-player game of rock-paper-scissors

    Scissors beats paperRock beats scissorsPaper beats rock

    Consider policies for iterated rock-paper-scissors

    A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)

  • Lecture 7: Policy Gradient

    Introduction

    Aliased Gridworld Example

    Example: Aliased Gridworld (1)

    The agent cannot differentiate the grey statesConsider features of the following form (for all N, E, S, W)

    (s, a) = 1(wall to N, a = move E)

    Compare value-based RL, using an approximate value function

    Q(s, a) = f ((s, a), )

    To policy-based RL, using a parametrised policy

    (s, a) = g((s, a), )

  • Lecture 7: Policy Gradient

    Introduction

    Aliased Gridworld Example

    Example: Aliased Gridworld (2)

    Under aliasing, an optimal deterministic policy will either

    move W in both grey states (shown by red arrows)move E in both grey states

    Either way, it can get stuck and never reach the money

    Value-based RL learns a near-deterministic policy

    e.g. greedy or -greedy

    So it will traverse the corridor for a long time

  • Lecture 7: Policy Gradient

    Introduction

    Aliased Gridworld Example

    Example: Aliased Gridworld (3)

    An optimal stochastic policy will randomly move E or W ingrey states

    (wall to N and S, move E) = 0.5

    (wall to N and S, move W) = 0.5

    It will reach the goal state in a few steps with high probability

    Policy-based RL can learn the optimal stochastic policy

  • Lecture 7: Policy Gradient

    Introduction

    Policy Search

    Policy Objective Functions

    Goal: given policy (s, a) with parameters , find best

    But how do we measure the quality of a policy ?

    In episodic environments we can use the start value

    J1() = V(s1) = E [v1]

    In continuing environments we can use the average value

    JavV () =s

    d(s)V (s)

    Or the average reward per time-step

    JavR() =s

    d(s)a

    (s, a)Ras

    where d(s) is stationary distribution of Markov chain for

  • Lecture 7: Policy Gradient

    Introduction

    Policy Search

    Policy Optimisation

    Policy based reinforcement learning is an optimisation problem

    Find that maximises J()

    Some approaches do not use gradient

    Hill climbingSimplex / amoeba / Nelder MeadGenetic algorithms

    Greater efficiency often possible using gradient

    Gradient descentConjugate gradientQuasi-newton

    We focus on gradient descent, many extensions possible

    And on methods that exploit sequential structure

  • Lecture 7: Policy Gradient

    Finite Difference Policy Gradient

    Policy Gradient

    Let J() be any policy objective function

    Policy gradient algorithms search for alocal maximum in J() by ascending thegradient of the policy, w.r.t. parameters

    = J()

    Where J() is the policy gradient

    J() =

    J()1...

    J()n

    and is a step-size parameter

    !"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#

    !"#$%&'('%$()&*+$,*$#&&-&$$$$."'%"$'*$%-/0'*,('-*$.'("$("#$1)*%('-*$,22&-3'/,(-& %,*$0#$)+#4$(-$%,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$

    !"#$2,&(',5$4'11#*(',5$-1$("'+$#&&-&$1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($%,*$*-.$0#$)+#4$(-$)24,(#$("#$'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$

  • Lecture 7: Policy Gradient

    Finite Difference Policy Gradient

    Computing Gradients By Finite Differences

    To evaluate policy gradient of (s, a)

    For each dimension k [1, n]Estimate kth partial derivative of objective function w.r.t. By perturbing by small amount in kth dimension

    J()

    k J( + uk) J()

    where uk is unit vector with 1 in kth component, 0 elsewhere

    Uses n evaluations to compute policy gradient in n dimensions

    Simple, noisy, inefficient - but sometimes effective

    Works for arbitrary policies, even if policy is not differentiable

  • Lecture 7: Policy Gradient

    Finite Difference Policy Gradient

    AIBO example

    Training AIBO to Walk by Finite Difference Policy Gradient

    those parameters to an Aibo and instructing it to time itselfas it walked between two fixed landmarks (Figure 5). Moreefficient parameters resulted in a faster gait, which translatedinto a lower time and a better score. After completing anevaluation, the Aibo sent the resulting score back to the hostcomputer and prepared itself for a new set of parameters toevaluate.6

    When implementing the algorithm described above, wechose to evaluate t = 15 policies per iteration. Since therewas significant noise in each evaluation, each set of parameterswas evaluated three times. The resulting score for that set ofparameters was computed by taking the average of the threeevaluations. Each iteration therefore consisted of 45 traversalsbetween pairs of beacons and lasted roughly 7 12 minutes. Sinceeven small adjustments to a set of parameters could lead tomajor differences in gait speed, we chose relatively smallvalues for each !j . To offset the small values of each !j , weaccelerated the learning process by using a larger value of = 2 for the step size.

    A

    BC

    A

    LandmarksLandmarks

    C

    B

    Fig. 5. The training environment for our experiments. Each Aibo times itselfas it moves back and forth between a pair of landmarks (A and A, B and B,or C and C).

    IV. RESULTSOur main result is that using the algorithm described in

    Section III, we were able to find one of the fastest known Aibogaits. Figure 6 shows the performance of the best policy ofeach iteration during the learning process. After 23 iterationsthe learning algorithm produced a gait (shown in Figure 8)that yielded a velocity of 291 3 mm/s, faster than both thebest hand-tuned gaits and the best previously known learnedgaits (see Figure 1). The parameters for this gait, along withthe initial parameters and ! values are given in Figure 7.

    Note that we stopped training after reaching a peak policy at23 iterations, which amounted to just over 1000 field traversalsin about 3 hours. Subsequent evaluations showed no furtherimprovement, suggesting that the learning had plateaued.

    6There is video of the training process at:www.cs.utexas.edu/AustinVilla/legged/learned-walk/

    180

    200

    220

    240

    260

    280

    300

    0 5 10 15 20 25

    Velo

    city

    (mm

    /s)

    Number of Iterations

    Velocity of Learned Gait during Training

    (UT Austin Villa)

    Learned Gait

    Hand!tuned Gait

    Hand!tuned Gait

    Hand!tuned Gait(UNSW)

    (UNSW)

    (German Team)

    (UT Austin Villa)Learned Gait

    Fig. 6. The velocity of the best gait from each iteration during training,compared to previous results. We were able to learn a gait significantly fasterthan both hand-coded gaits and previously learned gaits.

    Parameter Initial ! BestValue Value

    Front locus:(height) 4.2 0.35 4.081

    (x offset) 2.8 0.35 0.574(y offset) 4.9 0.35 5.152

    Rear locus:(height) 5.6 0.35 6.02

    (x offset) 0.0 0.35 0.217(y offset) -2.8 0.35 -2.982

    Locus length 4.893 0.35 5.285Locus skew multiplier 0.035 0.175 0.049Front height 7.7 0.35 7.483Rear height 11.2 0.35 10.843Time to move

    through locus 0.704 0.016 0.679Time on ground 0.5 0.05 0.430

    Fig. 7. The initial policy, the amount of change (!) for each parameter, andbest policy learned after 23 iterations. All values are given in centimeters,except time to move through locus, which is measured in seconds, and timeon ground, which is a fraction.

    There are a couple of possible explanations for the amountof variation in the learning process, visible as the spikes inthe learning curve shown in Figure 6. Despite the fact that weaveraged over multiple evaluations to determine the score foreach policy, there was still a fair amount of noise associatedwith the score for each policy. It is entirely possible thatthis noise led the search astray at times, causing temporarydecreases in performan