• date post

29-Apr-2018
• Category

## Documents

• view

218

0

Embed Size (px)

### Transcript of Lecture 7: Policy Gradient - UCL Computer 7: Policy Gradient Introduction Rock-Paper-Scissors...

David Silver

Outline

1 Introduction

Introduction

Policy-Based Reinforcement Learning

In the last lecture we approximated the value or action-valuefunction using parameters ,

V(s) V (s)Q(s, a) Q(s, a)

A policy was generated directly from the value function

e.g. using -greedy

In this lecture we will directly parametrise the policy

(s, a) = P [a | s, ]

We will focus again on model-free reinforcement learning

Introduction

Value-Based and Policy-Based RL

Value Based

Learnt Value FunctionImplicit policy(e.g. -greedy)

Policy Based

No Value FunctionLearnt Policy

Actor-Critic

Learnt Value FunctionLearnt Policy

Value Function Policy

ActorCritic

Value-Based Policy-Based

Introduction

Better convergence properties

Effective in high-dimensional or continuous action spaces

Can learn stochastic policies

Typically converge to a local rather than global optimum

Evaluating a policy is typically inefficient and high variance

Introduction

Rock-Paper-Scissors Example

Example: Rock-Paper-Scissors

Two-player game of rock-paper-scissors

Scissors beats paperRock beats scissorsPaper beats rock

Consider policies for iterated rock-paper-scissors

A deterministic policy is easily exploitedA uniform random policy is optimal (i.e. Nash equilibrium)

Introduction

Aliased Gridworld Example

Example: Aliased Gridworld (1)

The agent cannot differentiate the grey statesConsider features of the following form (for all N, E, S, W)

(s, a) = 1(wall to N, a = move E)

Compare value-based RL, using an approximate value function

Q(s, a) = f ((s, a), )

To policy-based RL, using a parametrised policy

(s, a) = g((s, a), )

Introduction

Aliased Gridworld Example

Example: Aliased Gridworld (2)

Under aliasing, an optimal deterministic policy will either

move W in both grey states (shown by red arrows)move E in both grey states

Either way, it can get stuck and never reach the money

Value-based RL learns a near-deterministic policy

e.g. greedy or -greedy

So it will traverse the corridor for a long time

Introduction

Aliased Gridworld Example

Example: Aliased Gridworld (3)

An optimal stochastic policy will randomly move E or W ingrey states

(wall to N and S, move E) = 0.5

(wall to N and S, move W) = 0.5

It will reach the goal state in a few steps with high probability

Policy-based RL can learn the optimal stochastic policy

Introduction

Policy Search

Policy Objective Functions

Goal: given policy (s, a) with parameters , find best

But how do we measure the quality of a policy ?

In episodic environments we can use the start value

J1() = V(s1) = E [v1]

In continuing environments we can use the average value

JavV () =s

d(s)V (s)

Or the average reward per time-step

JavR() =s

d(s)a

(s, a)Ras

where d(s) is stationary distribution of Markov chain for

Introduction

Policy Search

Policy Optimisation

Policy based reinforcement learning is an optimisation problem

Find that maximises J()

Some approaches do not use gradient

Hill climbingSimplex / amoeba / Nelder MeadGenetic algorithms

Greater efficiency often possible using gradient

We focus on gradient descent, many extensions possible

And on methods that exploit sequential structure

Let J() be any policy objective function

Policy gradient algorithms search for alocal maximum in J() by ascending thegradient of the policy, w.r.t. parameters

= J()

Where J() is the policy gradient

J() =

J()1...

J()n

and is a step-size parameter

!"#\$%"&'(%")*+,#'-'!"#\$%%" (%")*+,#'.+/0+,#

!"#\$%&'('%\$()&*+\$,*\$#&&-&\$\$\$\$."'%"\$'*\$%-/0'*,('-*\$.'("\$("#\$1)*%('-*\$,22&-3'/,(-& %,*\$0#\$)+#4\$(-\$%,(#\$,*\$#&&-&\$1)*%('-*\$\$\$\$\$\$\$\$\$\$\$\$\$

!"#\$2,&(',5\$4'11#*(',5\$-1\$("'+\$#&&-&\$1)*%('-*\$\$\$\$\$\$\$\$\$\$\$\$\$\$\$\$6\$("#\$7&,4'#*(\$%,*\$*-.\$0#\$)+#4\$(-\$)24,(#\$("#\$'*(#&*,5\$8,&',05#+\$'*\$("#\$1)*%('-*\$,22&-3'/,(-& 9,*4\$%&'('%:;\$\$\$\$\$\$

To evaluate policy gradient of (s, a)

For each dimension k [1, n]Estimate kth partial derivative of objective function w.r.t. By perturbing by small amount in kth dimension

J()

k J( + uk) J()

where uk is unit vector with 1 in kth component, 0 elsewhere

Uses n evaluations to compute policy gradient in n dimensions

Simple, noisy, inefficient - but sometimes effective

Works for arbitrary policies, even if policy is not differentiable

AIBO example

Training AIBO to Walk by Finite Difference Policy Gradient

those parameters to an Aibo and instructing it to time itselfas it walked between two fixed landmarks (Figure 5). Moreefficient parameters resulted in a faster gait, which translatedinto a lower time and a better score. After completing anevaluation, the Aibo sent the resulting score back to the hostcomputer and prepared itself for a new set of parameters toevaluate.6

When implementing the algorithm described above, wechose to evaluate t = 15 policies per iteration. Since therewas significant noise in each evaluation, each set of parameterswas evaluated three times. The resulting score for that set ofparameters was computed by taking the average of the threeevaluations. Each iteration therefore consisted of 45 traversalsbetween pairs of beacons and lasted roughly 7 12 minutes. Sinceeven small adjustments to a set of parameters could lead tomajor differences in gait speed, we chose relatively smallvalues for each !j . To offset the small values of each !j , weaccelerated the learning process by using a larger value of = 2 for the step size.

A

BC

A

LandmarksLandmarks

C

B

Fig. 5. The training environment for our experiments. Each Aibo times itselfas it moves back and forth between a pair of landmarks (A and A, B and B,or C and C).

IV. RESULTSOur main result is that using the algorithm described in

Section III, we were able to find one of the fastest known Aibogaits. Figure 6 shows the performance of the best policy ofeach iteration during the learning process. After 23 iterationsthe learning algorithm produced a gait (shown in Figure 8)that yielded a velocity of 291 3 mm/s, faster than both thebest hand-tuned gaits and the best previously known learnedgaits (see Figure 1). The parameters for this gait, along withthe initial parameters and ! values are given in Figure 7.

Note that we stopped training after reaching a peak policy at23 iterations, which amounted to just over 1000 field traversalsin about 3 hours. Subsequent evaluations showed no furtherimprovement, suggesting that the learning had plateaued.

6There is video of the training process at:www.cs.utexas.edu/AustinVilla/legged/learned-walk/

180

200

220

240

260

280

300

0 5 10 15 20 25

Velo

city

(mm

/s)

Number of Iterations

Velocity of Learned Gait during Training

(UT Austin Villa)

Learned Gait

Hand!tuned Gait

Hand!tuned Gait

Hand!tuned Gait(UNSW)

(UNSW)

(German Team)

(UT Austin Villa)Learned Gait

Fig. 6. The velocity of the best gait from each iteration during training,compared to previous results. We were able to learn a gait significantly fasterthan both hand-coded gaits and previously learned gaits.

Parameter Initial ! BestValue Value

Front locus:(height) 4.2 0.35 4.081

(x offset) 2.8 0.35 0.574(y offset) 4.9 0.35 5.152

Rear locus:(height) 5.6 0.35 6.02

(x offset) 0.0 0.35 0.217(y offset) -2.8 0.35 -2.982

Locus length 4.893 0.35 5.285Locus skew multiplier 0.035 0.175 0.049Front height 7.7 0.35 7.483Rear height 11.2 0.35 10.843Time to move

through locus 0.704 0.016 0.679Time on ground 0.5 0.05 0.430

Fig. 7. The initial policy, the amount of change (!) for each parameter, andbest policy learned after 23 iterations. All values are given in centimeters,except time to move through locus, which is measured in seconds, and timeon ground, which is a fraction.

There are a couple of possible explanations for the amountof variation in the learning process, visible as the spikes inthe learning curve shown in Figure 6. Despite the fact that weaveraged over multiple evaluations to determine the score foreach policy, there was still a fair amount of noise associatedwith the score for each policy. It is entirely possible thatthis noise led the search astray at times, causing temporarydecreases in performan