RL5: On-policyandoﬀ-policyalgorithms · Overview Oﬀ-policyalgorithms Q-learning(lasttime)...

RL 5: On-policy and off-policy algorithms

Michael Herrmann

University of Edinburgh, School of Informatics

27/01/2015

Overview

Off-policy algorithms

Q-learning (last time)R-learning (a variant of Q-learning)

On-policy algorithms

SARSATD(λ)Actor-critic methods

27/01/2015 Michael Herrmann RL 5

R-learning (see S&B [2] Section 11.2)

Similar to Q-learning, in particular for non-discounted, non-episodicproblems

Consider average reward ρ = limn→∞1n∑n

t=1 E [rt ]

Value is define here as “above average”:

V (st) =∞∑

k=1

E [rt+k − ρ|st = s]

Q (st , at) =∞∑

k=1

E [rt+k − ρ|st = s, at = a]

Relative value function (relative to the average)

ρ is adapted and measures (average) success

Implies a different concept of optimality in non-episodic tasks

A. Schwartz (1993) A reinforcement learning methodfor maximizing undiscounted rewards. In ICML, 298-305


R-learning: Algorithm

1 Initialise ρ and Q (s, a)

2 Observe st and choose at (e.g. ε-greedy), execute at

3 Observe rt+1 and st+1

4 Update

Qt+1(st , at)=(1− η)Qt(st , at)+η(rt+1 − ρt + max

aQt(st+1, a)

)5 If Q (st , at) = maxaQ (st , a) then

ρt+1 =(1−α) ρt+α(rt+1+max

aQt(st+1, at+1)−max

aQt+1(s, a)

)Hint: Choose η � α (Otherwise, for r = 0, Q-value may cease tochange and the agent may get trapped in a suboptimal limit cycle.)


R vs. Q: A simple example

Only one decision: Robotmoves either to nearbyprinter (“o.k.”) or to distantmail room (“good”).

Similar to a 2AB, but waiting times differ for the “arms”:Q-learning with low γ favours the nearby goal, while its learningtimes get longer for larger γ. R-learning identifies the better choicequickly based on trajectory based reward averages.Note that results may depend on parameters.


R-learning example: Access-control queuing task

Customers pay 1, 2, 4, or 8 (this is a reward) of four differentpriorities to be servedStates are the numberof for free serversActions: customer atthe head of the queueis either served orrejected (and removedfrom the queue)Proportion of highpriority customers inthe queue is h = 0.5

Busy server becomes free with prob. p = 0.06 (p and h arenot known to the algorithm) on each time step

http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node67.html


R-learning: Conclusions

An off-policy learning method for on-going learningMay be superior to discounted methodsMay get trapped in limit cycles (exploration is important)Success depends on the parameters η, α, ε (exploration rate)

For details see:

S. Mahadevan (1996). Average reward reinforcement learning:Foundations, algorithms, and empirical results. Machine Learning,22(1-3), 159-195.


Exploration-exploitation dilemma

Remember MABs: Exploratory moves were necessary to findpromising actions.RL: Actions influence the future reward. It is necessary toexplore the sequence of actions for all state, i.e. policies.Whether or not a policy gives high reward requires the agentto follow this policy.

The agent can either

choose a policy, evaluate it, and move on to better policies

or

Collect all available information and use it simultaneously toconstruct a good policy


What is a policy?

Deterministic: Function that maps states to actions. π : S → A

a = π (s)

Examples: Standard policy in Q-learning:at = π (st) = argmaxaQ (st , a) (after convergence).

Stochastic: Probability of an action given a state.π : S ×A → [0, 1] with

∑a∈As

π (s, a) = 1 for all s

P (a|s) = π (a, s)

Examples: random, Boltzmann policy

Partial policy: π is not necessarily defined for all s ∈ S, e.g. apolicy obtained from demonstration by a human teacher. Can becompleted by defaults or combined with other partial policies.


On-policy learning vs off-policy learning (preliminary)

On-policy (TD(λ), SARSA)

Start with a simple soft policySample state space with this policyImprove policy

Off-policy (Q-learning, R-learning)

Gather information from (partially) random movesEvaluate states as if a greedy policy was usedSlowly reduce randomness


SARSA and Q-learning

SARSA and Q-learning can be represented as look-up tables

Q-learning (off-policy): at = argmaxaQ (st , a) (plus exploration)

Qt+1 (st , at) = (1− η)Qt (st , at) + η(rt+1 + γmax

aQt (st+1, a)

)SARSA (on-policy):

Qt+1 (st , at) = (1− η)Qt (st , at) + η (rt+1 + γQt (st+1, at+1))

Q-learning: V (st+1)=maxaQ(st+1,a), but at+1 can be anythingSARSA: at ∼ π (st , ·) and update rule learns the exact valuefunction for π (s, a)

How does the policy improve for SARSA ?


SARSA algorithm

Initialise Q0(s, a)

Repeat (for each episode)

Initialise s0Choose a0 from s0 using policy derived from Q (e.g., ε-greedy)Repeat (for each step of episode):

Take action at , observe rt+1, st+1

Choose a′ form s ′ using policy derived from Q (e.g., ε-greedy)Qt+1 (st , at) = (1− η)Qt (st , at) + η (rt+1 + γQt (st+1, at+1))set t = t + 1

until st+1 is terminal

SARSA means: State → Action → Reward → State → Action


SARSA vs. Q: The cliff walking task (S&B, example 6.6)

r = −1 for every step,r = −100 for fallingdown the cliffε-greedy with ε = 0.1(see figure)ε-greedy with ε→ 0both methodsconverge to the (nowsafe) optimal path

Results for fixed ε


On-policy learning vs off-policy learning (according to S&B)

On-policy methods

attempt to evaluate or improve the policy that is used to makedecisionsoften use soft action choice, i.e. π (s, a) > 0 ∀acommit to always exploring and try to find the best policy thatstill exploresmay become trapped in local minima

Off-policy methods

evaluate one policy while following another, e.g. tries toevaluate the greedy policy while following a more exploratoryschemethe policy used for behaviour should be softpolicies may not be sufficiently similarmay be slower (only the part after the last exploration isreliable), but remains more flexible if alternative routes appear

May lead to the same result (e.g. after greedification)


TD: Temporal difference learning

TD learns value function V (s) directly.TD is on-policy, i.e. the resulting value function depends onpolicy that is used.Information from policy-dependent sampling of the valuefunction is not used immediately to improve the policy.TD-learning as such is not an RL algorithm, but can be usedfor RL if transition probabilities p (s ′|s, a) are known.One way to use TD for RL is SARSA, where the transitionprobabilities are implicitly included in the state-action values.


Temporal Difference (TD) Learning for Value Prediction

Ideal value function

Vt =∞∑τ=t

γτ−trτ = rt + γrt+1 + γ2rt+2 + · · ·

= rt + γ (rt+1 + γrt+2 + · · · )

= rt + γ

∞∑τ=t+1

γτ−(t+1)rτ

= rt + γVt+1

During learning, the value function is based on estimates of Vt andVt+1 and may not obey this relation. If all states and all actions aresampled sufficiently often, then the requirement of consistency, i.e.minimisation of the absolute value of the δ error (δ for διαϕoρα)

δt+1 = rt + γVt+1 − Vt

will move the estimates Vt and Vt+1 towards the ideal values.27/01/2015 Michael Herrmann RL 5

The simplest TD algorithm

Let Vt be the t-th iterate of a learning rule for estimating the valuefunction V .

Let st the state of the system at time step t.

δt+1 = rt + γVt (st+1)− Vt (st)

Vt+1 (s) =

{Vt (s) + ηδt+1 if s = stVt (s) otherwise

Vt+1 (st) = Vt (st) + ηδt+1 = Vt (st) + η(rt + γVt (st+1)− Vt (st)

)= (1− η) Vt (st) + η

(rt + γVt (st+1)

)The update of the estimate V is an exponential average over thecumulative expected reward.


TD(0) Algorithm

Initialise η and γ and execute after each state transition

function TD0(s,r ,s1,V ) {

δ := r + γ ∗ V [s1]− V [s];

V [s] := V [s] + η ∗ δ;

return V ; }

Remarks

If the algorithm converges it must converge to a value functionwhere the expected temporal differences are zero for all states.The continuous version of the algorithm can be shown to beglobally asymptotically stableTD(0) is a stochastic approximation algorithm. If the systemis ergodic and the learning rate is appropriately decreased, itbehaves like the continuous version.


Robbins-Monro conditions

How to choose learning rates? If

∞∑t=0

ηt =∞ and∞∑

t=0

η2t <∞,

then Vt (·) will behave as the temporally continuous variant

dV (·)dt

= r + (γP − I ) V (·)

Choosing e.g. ηt = c t−α, the conditions hold for α ∈(1

2 , 1]:

α > 1: forced convergence, but possibly without reaching goalα = 1: smallest step sizes, but still possibleα ≤ 1

2 : large fluctuations can happen even after long timeIterate-averaging (Polyak & Juditsky, 1992) gives best possible asymptotic rate of convergence

Practically: fixed step sizes or finite-time reduction (see earlier slide)


Actor-Critic Methods

Policy (actor) is represented independently of the (state) valuefunction (critic)A number of variants exist, in particular among the earlyreinforcement learning algorithms

Advantages1

AC methods require minimal computation in order to selectactions which is beneficial in continuous cases, where searchbecomes a problem.They can learn an explicitly stochastic policy, i.e. learn theoptimal action probabilities. Useful in competitive andnon-Markov cases2.A plausible model of biological reinforcement learningRecently also an off-policy variant was proposed

1Mark Lee following Sutton&Barto2see, e.g., Singh, Jaakkola, and Jordan, 1994


Actor-Critic Methods

Actor aims at improvingpolicy (adaptive searchelement)Critic evaluates thecurrent policy (adaptivecritic element)Learning is based on theTD error δtReward only known to thecriticCritic should improve aswell


Example: Policies for the inverted pendulum

Exploitation (actor):Escape from low-rewardregions as fast as possibleaims at max. re.g. Inverted pendulumtask: Wants to stay nearthe upright positionpreferentially greedy anddeterministic

Exploration (critic):Find examples wherelearning is optimalaims at max. δe.g. Inverted pendulumtask: Wants to move awayfrom the upright positionpreferentiallynon-deterministic


AC methods vs. other RL algorithms

SARSA and Q-learning do not have an explicit policyrepresentation, in a sense they are thus “critic-only” algorithms.There are also “actor-only” methods which directly try toimprove the policy, e.g. REINFORCE (Williams, 1992).AC is advantageous for continuous problems (later!), where Qand SARSA may become unstable due to the concomitantfunction approximation.


Conclusions

Both on- and off-policy methods have their advantages

If a good starting policy is available: on-policy may beinteresting, but may not explore other policies wellIf more exploration is necessary, then perhaps off-policy isadvisable, but maybe slow

Actor-critic is of historical interest, but we will come back tothis.Also TD learning including value iteration and policy iterationwill be revisited shortly.We need a theoretical framework to understand better how thealgorithms work. For this purpose, we will study Markovdecision problems (MDPs) next.

Literature: R-learning: S&B (2), section 11.2; SARSA: S&B (2),section 6.4


RL5: On-policyandoﬀ-policyalgorithms · Overview Oﬀ-policyalgorithms Q-learning(lasttime)...

Documents

Transcript of RL5: On-policyandoﬀ-policyalgorithms · Overview Oﬀ-policyalgorithms Q-learning(lasttime)...