Reinforcement Learning Lecture Temporal Difference Learning

58
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

Transcript of Reinforcement Learning Lecture Temporal Difference Learning

Page 1: Reinforcement Learning Lecture Temporal Difference Learning

Reinforcement Learning

Temporal Difference Learning

Temporal difference learning, TD prediction, Q-learning, elibigility traces.(many slides from Marc Toussaint)

Vien NgoMLR, University of Stuttgart

Page 2: Reinforcement Learning Lecture Temporal Difference Learning

Outline

• Temporal Difference Learning

• Q-learning

• Eligibility Traces

2/??

Page 3: Reinforcement Learning Lecture Temporal Difference Learning

Learning in MDPs

• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).

• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1

(state, action, immediate reward, next state)

• What could we learn from that?– learn to predict next state: P (s′|s, a)

– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)

– learn to predict action: π(s, a)

3/??

Page 4: Reinforcement Learning Lecture Temporal Difference Learning

Learning in MDPs

• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).

• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1

(state, action, immediate reward, next state)

• What could we learn from that?– learn to predict next state: P (s′|s, a)

– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)

– learn to predict action: π(s, a)

3/??

Page 5: Reinforcement Learning Lecture Temporal Difference Learning

Model-based versus Model-free

Edward Tolman (1886 - 1959)Purposive Behavior in Animals and Men

(stimulus-stimulus, non-reinforcement driven)

Wolfgang Kohler (1887–1967)learn facts about the world that they couldsubsequently use in a flexible manner, ratherthan simply learning automatic responses

Clark Hull (1884 - 1952)Principles of Behavior (1943)

learn stimulus-response mappings based onreinforcement

4/??

Page 6: Reinforcement Learning Lecture Temporal Difference Learning

5/??

Page 7: Reinforcement Learning Lecture Temporal Difference Learning

6/??

Page 8: Reinforcement Learning Lecture Temporal Difference Learning

Introduction to Model-Free Methods

– Monte-Carlo Methods, On-Policy MC Control

– Temporal Difference Learning, On-Policy SARSA Algorithm

– Off-Policy Q-Learning Algorithm

7/??

Page 9: Reinforcement Learning Lecture Temporal Difference Learning

Monte-Carlo Policy Evaluation– MC policy evaluation: First-visit and every-visit methods

Algorithm 1 MC-PE: policy evaluation of policy π1: Given π; Returns(s) = ∅, ∀s ∈ S2: while (!converged) do3: Generate an episode τ = (s0, a0, r0, . . . , sT−1, aT−1, rT−1, sT ) using π4: for each state st in τ do5: Compute the return of st: R(st) =

∑Tl=t γ

l−trl

6: Either: For each st in τ , add R(st) into Returns(st) once , first-visit

Add R(st) into Returns(st) , every-visit

7: Return V π(s) = average(Returns(s)

)– Converge!!! as the number of visits to all s goes to infinity.

(Introduction to RL, Sutton & Barto 1998)8/??

Page 10: Reinforcement Learning Lecture Temporal Difference Learning

Monte-Carlo Policy Evaluation

• Alternatively, increment updates for MC-PE when obtaining a returnR(st):

V π(st) = V π(st) + αt(R(st)− V (st)

)

9/??

Page 11: Reinforcement Learning Lecture Temporal Difference Learning

Example: MC Method for Blackjack

• States is 3-dimensional: current sum (12-21), dealer’s one showingcard (ace-10), having a usable card (1/0)

• Actions: stick, hit

• Rewards:•

R(·, hit) =

+1.0 if current sum is better than the dealer’s

0.0 if current sum is equal the dealer’s

−1.0 if current sum is worse than the dealer’s

R(·, hit) =

−1.0 if current sum > 21

0.0 otherwise

• If current sum < 12, automatically do hit.

10/??

Page 12: Reinforcement Learning Lecture Temporal Difference Learning

Example: MC Method for Blackjack

Approximate value functions for the policy that sticks only on 20 or 21.

11/??

Page 13: Reinforcement Learning Lecture Temporal Difference Learning

On-Policy MC Control AlgorithmGeneralized policy iteration framework

Algorithm 2 On-policy MC Control Algorithm1: Init an initial policy π0.2: while (!converged) do3: Policy Evaluation:

πk → Qπk (s, a), ∀s, a

4: Policy Improvement: greedy action selection

πk+1(s) = argmaxa

Q(s, a),∀s

12/??

Page 14: Reinforcement Learning Lecture Temporal Difference Learning

ε-Greedy Policy• The ε-greedy policy chooses actions with probability

π(a|s) =

ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)

ε/|A| otherwise

• Policy improvement: the improvement of a new ε-greedy policy πk+1

that is constructed upon the value functions Qπk(s, a) of the previouspolicy is

Qπk(s, πk+1(s)) =∑a

πk+1(a|s)Qπk(s, a)

|A|∑a

Qπk(s, a) + (1− ε) maxa

Qπk(s, a)

≥ ε

|A|∑a

Qπk(s, a) + (1− ε)∑a

π(a|s)− ε/|A|1− ε

Qπk(s, a)

=∑a

π(a|s)Qπk(s, a) = V πk(s)

• Therefore V πk+1(s) ≥ V πk(s)

13/??

Page 15: Reinforcement Learning Lecture Temporal Difference Learning

ε-Greedy Policy• The ε-greedy policy chooses actions with probability

π(a|s) =

ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)

ε/|A| otherwise

• Policy improvement: the improvement of a new ε-greedy policy πk+1

that is constructed upon the value functions Qπk(s, a) of the previouspolicy is

Qπk(s, πk+1(s)) =∑a

πk+1(a|s)Qπk(s, a)

|A|∑a

Qπk(s, a) + (1− ε) maxa

Qπk(s, a)

≥ ε

|A|∑a

Qπk(s, a) + (1− ε)∑a

π(a|s)− ε/|A|1− ε

Qπk(s, a)

=∑a

π(a|s)Qπk(s, a) = V πk(s)

• Therefore V πk+1(s) ≥ V πk(s)13/??

Page 16: Reinforcement Learning Lecture Temporal Difference Learning

MC Control Algorithm with ε-Greedy

• Choose actions using ε-greedy.

• Converges!!! if εk decreases to zero through time, e.g. εk = 1/k (GLIEpolicy).

• A policy is GLIE (Greedy in the Limit with Infinite Exploration) if• All state-action pairs are visited infinitely often.• In the limit (k →∞), the policy becomes greedy

limk→∞

πk(a|s) = δa(argmaxa′

Qπk (s, a′))

where δ is a Dirac function.

14/??

Page 17: Reinforcement Learning Lecture Temporal Difference Learning

MC Control Algorithm: Blackjack

15/??

Page 18: Reinforcement Learning Lecture Temporal Difference Learning

Temporal difference (TD) learning

• TD Prediction (TD(0))

• On-policy TD Control (SARSA Algorithm)

• Off-policy TD Control (Q-Learning)

• Eligibility Traces (TD(λ))

16/??

Page 19: Reinforcement Learning Lecture Temporal Difference Learning

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

]

• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target

]

= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error

] .

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

17/??

Page 20: Reinforcement Learning Lecture Temporal Difference Learning

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

]• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target

]

= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error

] .

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

17/??

Page 21: Reinforcement Learning Lecture Temporal Difference Learning

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

]• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target

]

= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error

] .

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

17/??

Page 22: Reinforcement Learning Lecture Temporal Difference Learning

TD Prediction vs. MC Prediction

Figure: MC backup diagram

Figure: TD backup diagram

18/??

Page 23: Reinforcement Learning Lecture Temporal Difference Learning

TD Prediction vs. MC PredictionDriving Home Example

Elapsed Time Predicted Predicted

State (minutes) Time to Go Total Time

leaving office, friday at 6 0 30 30

reach car, raining 5 35 40

exiting highway 20 15 35

2ndary road, behind truck 30 10 40

entering home street 40 3 43

arrive home 43 0 43

19/??

Page 24: Reinforcement Learning Lecture Temporal Difference Learning

TD Prediction vs. MC Prediction

Monte-Carlo methodTD(0)

(Introduction to RL, Sutton & Barto 1998)

20/??

Page 25: Reinforcement Learning Lecture Temporal Difference Learning

TD Prediction vs. MC Prediction

• TD can learns before terminationof epsisodes.

• TD can be used for eithernon-episodic or episodic tasks.

• The update depends on single

stochastic transition⇒ lowervariance.

• Updates use bootstrapping⇒estimate has some bias.

• TD updates exploit the Markovproperty.

• MC learning must wait until theend of episodes.

• MC only works for episodic tasks.

• The update depends on asequence of many stochastictransitions⇒ much largervariance.

• Unbiased estimate.

• MC updates does not exploit theMarkov property, hence it can beeffective in non-Markovenvironment.

21/??

Page 26: Reinforcement Learning Lecture Temporal Difference Learning

TD vs. MC: A Random Walk Example

22/??

Page 27: Reinforcement Learning Lecture Temporal Difference Learning

On-Policy TD Control: SARSA

23/??

Page 28: Reinforcement Learning Lecture Temporal Difference Learning

On-Policy TD Control: SARSA

s,a

s'

a'

r

Figure: Learning on tuple (s, a, r, s′, a′): SARSA

• Q-value updates:

Qt+1(s, a) = Qt(s, a) + α(rt + γQt(s

′, a′)−Qt(s, a))

24/??

Page 29: Reinforcement Learning Lecture Temporal Difference Learning

On-Policy TD Control: SARSA

Algorithm 3 SARSA Algorithm1: Init Q(s, a), ∀s ∈ S, a ∈ A2: while (!converged) do3: Init a starting state s4: Select an action a from s using a policy derived from Q (e.g. ε-greedy)5: for each episode do6: Execute a, observe r, s′

7: Select an action a′ from s′ using a policy derived from Q (e.g. ε-greedy)8: Update: Qt+1(s, a) = Qt(s, a) + αt

(r + γQt(s

′, a′)−Qt(s, a))

9: s← s′; a← a′

25/??

Page 30: Reinforcement Learning Lecture Temporal Difference Learning

The SARSA’s Q values converges w.p.1 to the optimal values as longas

• the learning rates satisfy ∑t

αt(s, a) =∞

, ∑t

α2t (s, a) <∞

• the policies πt(s, a) derived from Qt(s, a) are GLIE policies.

26/??

Page 31: Reinforcement Learning Lecture Temporal Difference Learning

SARSA on Windy Gridworld Example

• reward is -1.0 at non-terminal state (cell G).

• each move is shifted upward along the wind direction, the strength innumber of cells shifted upwards is given below each column.

27/??

Page 32: Reinforcement Learning Lecture Temporal Difference Learning

• y-axis shows the accumulated number of goal reaching.

28/??

Page 33: Reinforcement Learning Lecture Temporal Difference Learning

Off-Policy TD Control: Q-Learning

29/??

Page 34: Reinforcement Learning Lecture Temporal Difference Learning

Off-Policy TD Control: Q-Learning

Off-Policy MC?

• Importance sampling to estimate the following expectation of returns:

Eτ∼P (τ)

[ρ(τ)

]=

∫ρ(τ)P (τ)dτ

=

∫ρ(τ)

P (τ)

P (τ)P (τ)dτ

= Eτ∼P (τ)

[ρ(τ)

P (τ)

P (τ)

]

≈ 1

N

[N∑i=1

ρ(τi)P (τi)

P (τi)

]

• Denote a trajectory distribution Pπ(τ), where τ = {s0, a0, s1, a1, · · · }

Pπ(τ) = P0(s0)∏

P (st+1|st, at)µ(at|st)30/??

Page 35: Reinforcement Learning Lecture Temporal Difference Learning

• Assuming in MC Control, the control policy used to generate data isµ(a|s) (i.e. P (τ)).

• The target policy is π(a|s) (i.e. P (τ))

• Set importance weights as:

wt =P (τt)

P (τt)=

T∏i=t

π(ai|si)µ(ai|si)

• The MC value update becomes (when observing a return ρt):

V (st)← V (st) + α(wtρt − V (st))

31/??

Page 36: Reinforcement Learning Lecture Temporal Difference Learning

Off-Policy TD Control: Q-Learning

Off-Policy TD?

• The term ρ = rt + γV (st+1) is estimated by importance sampling.

• The TD value update becomes (given a transition (st, at, rt, st+1)):

V (st)← V (st) + α(π(at|st)µ(at|st)

(rt + γV (st+1))− V (st))

32/??

Page 37: Reinforcement Learning Lecture Temporal Difference Learning

Off-Policy TD Control: Q-Learning

• The target policy is greedy: π(st) = arg maxaQ(st, a)

• The control policy is µ, e.g. ε-greedy w.r.t Q(s, a)

33/??

Page 38: Reinforcement Learning Lecture Temporal Difference Learning

Off-Policy TD Control: Q-Learning

• Q-learning (Watkins, 1988) Given a new experience (s, a, r, s′)

Qnew(s, a) = (1− α) Qold(s, a) + α [r + γmaxa′

Qold(s′, a′)]

= Qold(s, a) + α [rt −Qold(s, a) + γmaxa

Qold(s′, a)]

• Reinforcement:– more reward than expected (r > Qold(s, a)− γmaxaQold(s′, a))→ increase Q(s, a)

– less reward than expected (r < Qold(s, a)− γmaxaQold(s′, a))→ decrease Q(s, a)

34/??

Page 39: Reinforcement Learning Lecture Temporal Difference Learning

Q-Learning

(Introduction to RL, Sutton & Barto 1998)

35/??

Page 40: Reinforcement Learning Lecture Temporal Difference Learning

Q-learning convergence with prob 1

• Q-learning is a stochastic approximation of Q-Iteration:

Q-learning: Qnew(s, a) = (1− α)Qold(s, a) + α[r + γmaxa′ Qold(s′, a′)]

Q-Iteration: ∀s,a : Qk+1(s, a) = R(s, a) + γ∑s′ P (s′|a, s) maxa′ Qk(s

′, a′)

We’ve shown convergence of Q-VI to Q∗

• Convergence of Q-learning:Q-Iteration is a deterministic update: Qk+1 = T (Qk)

Q-learning is a stochastic version: Qk+1 = (1− α)Qk + α[T (Qk) + ηk]

ηk is zero mean!

36/??

Page 41: Reinforcement Learning Lecture Temporal Difference Learning

Q-learning convergence with prob 1The Q-learning algorithm converges w.p.1 as long as the learning ratessatisfy ∑

t

αt(s, a) =∞

, ∑t

α2t (s, a) <∞

(Watkins and Dayan, ’Q-learning’. Machine Learning 1992)

37/??

Page 42: Reinforcement Learning Lecture Temporal Difference Learning

Q-learning vs. SARSA: The Cliff Example

38/??

Page 43: Reinforcement Learning Lecture Temporal Difference Learning

Q-Learning impact

• Q-Learning was the first provably convergent direct adaptive optimalcontrol algorithm

• Great impact on the field of Reinforcement Learning– smaller representation than models– automatically focuses attention to where it is needed,

i.e., no sweeps through state space– though does not solve the exploration versus exploitation issue– ε-greedy, optimistic initialization, etc,...

39/??

Page 44: Reinforcement Learning Lecture Temporal Difference Learning

Unified View

(Introduction to RL, Sutton & Barto 1998)40/??

Page 45: Reinforcement Learning Lecture Temporal Difference Learning

Eligibility traces

• Temporal Difference: based on single experience (s0, r0, s1)

Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]

• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)

temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)

Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]

41/??

Page 46: Reinforcement Learning Lecture Temporal Difference Learning

Eligibility traces

• Temporal Difference: based on single experience (s0, r0, s1)

Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]

• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)

temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)

Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]

41/??

Page 47: Reinforcement Learning Lecture Temporal Difference Learning

Eligibility trace

• The error of n-step TD update

V nt (s) = V nt (s) + α[Rnt − V nt (s)]

where Rnt is n-step returns

Rnt = rt + γrt+1 + . . .+ γn−1rt+n−1 + γnVt(st+n)

• The offline value update up to time T

Vt(s) = Vt(s) + α

T−1∑t=0

∆Vt(s)

• Error reduction

|V nt − V ∗π | ≤ γn |Vt − V ∗π |

42/??

Page 48: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Forward View

• TD(λ) is the averaging of n-backups with different n

• Look into the future, and doMC-Evaluation then averagingweightedly:

Rλt = (1− λ)

∞∑n=1

λn−1Rnt

43/??

Page 49: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Forward View

• TD(λ) is the averaging of n-backups with different n

Rλt = (1− λ)∞∑n=1

λn−1Rnt

44/??

Page 50: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Forward View

• 19-State Random Walk Task.

45/??

Page 51: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Backward View

• Each step, eligibility traces for all states are updated

et(s) =

γλet−1(s) s 6= st

γλet−1(s) s = st

46/??

Page 52: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and

update those values as well:e(st)← e(st) + 1

∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]

∀s : e(s)← γλe(s)

47/??

Page 53: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and

update those values as well:e(st)← e(st) + 1

∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]

∀s : e(s)← γλe(s)

47/??

Page 54: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Backward View

48/??

Page 55: Reinforcement Learning Lecture Temporal Difference Learning

TD(λ): Backward vs. Forward

• Two view provides equivalent offline update (see the proof in Section7.4, Introduction to RL book, Sutton & Barto).

49/??

Page 56: Reinforcement Learning Lecture Temporal Difference Learning

SARSA(λ)

50/??

Page 57: Reinforcement Learning Lecture Temporal Difference Learning

SARSA(λ): Example

51/??

Page 58: Reinforcement Learning Lecture Temporal Difference Learning

Q(λ)

• The n-step return

rt + γrt+1 + . . .+ γn−1rt+n−1 + γn maxa

Qt(st+n, a)

• Q(λ) algorithm by Watkin

52/??