Reinforcement Learning Lecture Temporal Difference Learning

Post on 12-Nov-2021

7 views 0 download

Transcript of Reinforcement Learning Lecture Temporal Difference Learning

Reinforcement Learning

Temporal Difference Learning

Temporal difference learning, TD prediction, Q-learning, elibigility traces.(many slides from Marc Toussaint)

Vien NgoMLR, University of Stuttgart

Outline

• Temporal Difference Learning

• Q-learning

• Eligibility Traces

2/??

Learning in MDPs

• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).

• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1

(state, action, immediate reward, next state)

• What could we learn from that?– learn to predict next state: P (s′|s, a)

– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)

– learn to predict action: π(s, a)

3/??

Learning in MDPs

• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).

• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1

(state, action, immediate reward, next state)

• What could we learn from that?– learn to predict next state: P (s′|s, a)

– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)

– learn to predict action: π(s, a)

3/??

Model-based versus Model-free

Edward Tolman (1886 - 1959)Purposive Behavior in Animals and Men

(stimulus-stimulus, non-reinforcement driven)

Wolfgang Kohler (1887–1967)learn facts about the world that they couldsubsequently use in a flexible manner, ratherthan simply learning automatic responses

Clark Hull (1884 - 1952)Principles of Behavior (1943)

learn stimulus-response mappings based onreinforcement

4/??

5/??

6/??

Introduction to Model-Free Methods

– Monte-Carlo Methods, On-Policy MC Control

– Temporal Difference Learning, On-Policy SARSA Algorithm

– Off-Policy Q-Learning Algorithm

7/??

Monte-Carlo Policy Evaluation– MC policy evaluation: First-visit and every-visit methods

Algorithm 1 MC-PE: policy evaluation of policy π1: Given π; Returns(s) = ∅, ∀s ∈ S2: while (!converged) do3: Generate an episode τ = (s0, a0, r0, . . . , sT−1, aT−1, rT−1, sT ) using π4: for each state st in τ do5: Compute the return of st: R(st) =

∑Tl=t γ

l−trl

6: Either: For each st in τ , add R(st) into Returns(st) once , first-visit

Add R(st) into Returns(st) , every-visit

7: Return V π(s) = average(Returns(s)

)– Converge!!! as the number of visits to all s goes to infinity.

(Introduction to RL, Sutton & Barto 1998)8/??

Monte-Carlo Policy Evaluation

• Alternatively, increment updates for MC-PE when obtaining a returnR(st):

V π(st) = V π(st) + αt(R(st)− V (st)

)

9/??

Example: MC Method for Blackjack

• States is 3-dimensional: current sum (12-21), dealer’s one showingcard (ace-10), having a usable card (1/0)

• Actions: stick, hit

• Rewards:•

R(·, hit) =

+1.0 if current sum is better than the dealer’s

0.0 if current sum is equal the dealer’s

−1.0 if current sum is worse than the dealer’s

R(·, hit) =

−1.0 if current sum > 21

0.0 otherwise

• If current sum < 12, automatically do hit.

10/??

Example: MC Method for Blackjack

Approximate value functions for the policy that sticks only on 20 or 21.

11/??

On-Policy MC Control AlgorithmGeneralized policy iteration framework

Algorithm 2 On-policy MC Control Algorithm1: Init an initial policy π0.2: while (!converged) do3: Policy Evaluation:

πk → Qπk (s, a), ∀s, a

4: Policy Improvement: greedy action selection

πk+1(s) = argmaxa

Q(s, a),∀s

12/??

ε-Greedy Policy• The ε-greedy policy chooses actions with probability

π(a|s) =

ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)

ε/|A| otherwise

• Policy improvement: the improvement of a new ε-greedy policy πk+1

that is constructed upon the value functions Qπk(s, a) of the previouspolicy is

Qπk(s, πk+1(s)) =∑a

πk+1(a|s)Qπk(s, a)

|A|∑a

Qπk(s, a) + (1− ε) maxa

Qπk(s, a)

≥ ε

|A|∑a

Qπk(s, a) + (1− ε)∑a

π(a|s)− ε/|A|1− ε

Qπk(s, a)

=∑a

π(a|s)Qπk(s, a) = V πk(s)

• Therefore V πk+1(s) ≥ V πk(s)

13/??

ε-Greedy Policy• The ε-greedy policy chooses actions with probability

π(a|s) =

ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)

ε/|A| otherwise

• Policy improvement: the improvement of a new ε-greedy policy πk+1

that is constructed upon the value functions Qπk(s, a) of the previouspolicy is

Qπk(s, πk+1(s)) =∑a

πk+1(a|s)Qπk(s, a)

|A|∑a

Qπk(s, a) + (1− ε) maxa

Qπk(s, a)

≥ ε

|A|∑a

Qπk(s, a) + (1− ε)∑a

π(a|s)− ε/|A|1− ε

Qπk(s, a)

=∑a

π(a|s)Qπk(s, a) = V πk(s)

• Therefore V πk+1(s) ≥ V πk(s)13/??

MC Control Algorithm with ε-Greedy

• Choose actions using ε-greedy.

• Converges!!! if εk decreases to zero through time, e.g. εk = 1/k (GLIEpolicy).

• A policy is GLIE (Greedy in the Limit with Infinite Exploration) if• All state-action pairs are visited infinitely often.• In the limit (k →∞), the policy becomes greedy

limk→∞

πk(a|s) = δa(argmaxa′

Qπk (s, a′))

where δ is a Dirac function.

14/??

MC Control Algorithm: Blackjack

15/??

Temporal difference (TD) learning

• TD Prediction (TD(0))

• On-policy TD Control (SARSA Algorithm)

• Off-policy TD Control (Q-Learning)

• Eligibility Traces (TD(λ))

16/??

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

]

• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target

]

= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error

] .

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

17/??

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

]• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target

]

= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error

] .

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

17/??

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

]• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target

]

= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error

] .

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

17/??

TD Prediction vs. MC Prediction

Figure: MC backup diagram

Figure: TD backup diagram

18/??

TD Prediction vs. MC PredictionDriving Home Example

Elapsed Time Predicted Predicted

State (minutes) Time to Go Total Time

leaving office, friday at 6 0 30 30

reach car, raining 5 35 40

exiting highway 20 15 35

2ndary road, behind truck 30 10 40

entering home street 40 3 43

arrive home 43 0 43

19/??

TD Prediction vs. MC Prediction

Monte-Carlo methodTD(0)

(Introduction to RL, Sutton & Barto 1998)

20/??

TD Prediction vs. MC Prediction

• TD can learns before terminationof epsisodes.

• TD can be used for eithernon-episodic or episodic tasks.

• The update depends on single

stochastic transition⇒ lowervariance.

• Updates use bootstrapping⇒estimate has some bias.

• TD updates exploit the Markovproperty.

• MC learning must wait until theend of episodes.

• MC only works for episodic tasks.

• The update depends on asequence of many stochastictransitions⇒ much largervariance.

• Unbiased estimate.

• MC updates does not exploit theMarkov property, hence it can beeffective in non-Markovenvironment.

21/??

TD vs. MC: A Random Walk Example

22/??

On-Policy TD Control: SARSA

23/??

On-Policy TD Control: SARSA

s,a

s'

a'

r

Figure: Learning on tuple (s, a, r, s′, a′): SARSA

• Q-value updates:

Qt+1(s, a) = Qt(s, a) + α(rt + γQt(s

′, a′)−Qt(s, a))

24/??

On-Policy TD Control: SARSA

Algorithm 3 SARSA Algorithm1: Init Q(s, a), ∀s ∈ S, a ∈ A2: while (!converged) do3: Init a starting state s4: Select an action a from s using a policy derived from Q (e.g. ε-greedy)5: for each episode do6: Execute a, observe r, s′

7: Select an action a′ from s′ using a policy derived from Q (e.g. ε-greedy)8: Update: Qt+1(s, a) = Qt(s, a) + αt

(r + γQt(s

′, a′)−Qt(s, a))

9: s← s′; a← a′

25/??

The SARSA’s Q values converges w.p.1 to the optimal values as longas

• the learning rates satisfy ∑t

αt(s, a) =∞

, ∑t

α2t (s, a) <∞

• the policies πt(s, a) derived from Qt(s, a) are GLIE policies.

26/??

SARSA on Windy Gridworld Example

• reward is -1.0 at non-terminal state (cell G).

• each move is shifted upward along the wind direction, the strength innumber of cells shifted upwards is given below each column.

27/??

• y-axis shows the accumulated number of goal reaching.

28/??

Off-Policy TD Control: Q-Learning

29/??

Off-Policy TD Control: Q-Learning

Off-Policy MC?

• Importance sampling to estimate the following expectation of returns:

Eτ∼P (τ)

[ρ(τ)

]=

∫ρ(τ)P (τ)dτ

=

∫ρ(τ)

P (τ)

P (τ)P (τ)dτ

= Eτ∼P (τ)

[ρ(τ)

P (τ)

P (τ)

]

≈ 1

N

[N∑i=1

ρ(τi)P (τi)

P (τi)

]

• Denote a trajectory distribution Pπ(τ), where τ = {s0, a0, s1, a1, · · · }

Pπ(τ) = P0(s0)∏

P (st+1|st, at)µ(at|st)30/??

• Assuming in MC Control, the control policy used to generate data isµ(a|s) (i.e. P (τ)).

• The target policy is π(a|s) (i.e. P (τ))

• Set importance weights as:

wt =P (τt)

P (τt)=

T∏i=t

π(ai|si)µ(ai|si)

• The MC value update becomes (when observing a return ρt):

V (st)← V (st) + α(wtρt − V (st))

31/??

Off-Policy TD Control: Q-Learning

Off-Policy TD?

• The term ρ = rt + γV (st+1) is estimated by importance sampling.

• The TD value update becomes (given a transition (st, at, rt, st+1)):

V (st)← V (st) + α(π(at|st)µ(at|st)

(rt + γV (st+1))− V (st))

32/??

Off-Policy TD Control: Q-Learning

• The target policy is greedy: π(st) = arg maxaQ(st, a)

• The control policy is µ, e.g. ε-greedy w.r.t Q(s, a)

33/??

Off-Policy TD Control: Q-Learning

• Q-learning (Watkins, 1988) Given a new experience (s, a, r, s′)

Qnew(s, a) = (1− α) Qold(s, a) + α [r + γmaxa′

Qold(s′, a′)]

= Qold(s, a) + α [rt −Qold(s, a) + γmaxa

Qold(s′, a)]

• Reinforcement:– more reward than expected (r > Qold(s, a)− γmaxaQold(s′, a))→ increase Q(s, a)

– less reward than expected (r < Qold(s, a)− γmaxaQold(s′, a))→ decrease Q(s, a)

34/??

Q-Learning

(Introduction to RL, Sutton & Barto 1998)

35/??

Q-learning convergence with prob 1

• Q-learning is a stochastic approximation of Q-Iteration:

Q-learning: Qnew(s, a) = (1− α)Qold(s, a) + α[r + γmaxa′ Qold(s′, a′)]

Q-Iteration: ∀s,a : Qk+1(s, a) = R(s, a) + γ∑s′ P (s′|a, s) maxa′ Qk(s

′, a′)

We’ve shown convergence of Q-VI to Q∗

• Convergence of Q-learning:Q-Iteration is a deterministic update: Qk+1 = T (Qk)

Q-learning is a stochastic version: Qk+1 = (1− α)Qk + α[T (Qk) + ηk]

ηk is zero mean!

36/??

Q-learning convergence with prob 1The Q-learning algorithm converges w.p.1 as long as the learning ratessatisfy ∑

t

αt(s, a) =∞

, ∑t

α2t (s, a) <∞

(Watkins and Dayan, ’Q-learning’. Machine Learning 1992)

37/??

Q-learning vs. SARSA: The Cliff Example

38/??

Q-Learning impact

• Q-Learning was the first provably convergent direct adaptive optimalcontrol algorithm

• Great impact on the field of Reinforcement Learning– smaller representation than models– automatically focuses attention to where it is needed,

i.e., no sweeps through state space– though does not solve the exploration versus exploitation issue– ε-greedy, optimistic initialization, etc,...

39/??

Unified View

(Introduction to RL, Sutton & Barto 1998)40/??

Eligibility traces

• Temporal Difference: based on single experience (s0, r0, s1)

Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]

• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)

temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)

Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]

41/??

Eligibility traces

• Temporal Difference: based on single experience (s0, r0, s1)

Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]

• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)

temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)

Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]

41/??

Eligibility trace

• The error of n-step TD update

V nt (s) = V nt (s) + α[Rnt − V nt (s)]

where Rnt is n-step returns

Rnt = rt + γrt+1 + . . .+ γn−1rt+n−1 + γnVt(st+n)

• The offline value update up to time T

Vt(s) = Vt(s) + α

T−1∑t=0

∆Vt(s)

• Error reduction

|V nt − V ∗π | ≤ γn |Vt − V ∗π |

42/??

TD(λ): Forward View

• TD(λ) is the averaging of n-backups with different n

• Look into the future, and doMC-Evaluation then averagingweightedly:

Rλt = (1− λ)

∞∑n=1

λn−1Rnt

43/??

TD(λ): Forward View

• TD(λ) is the averaging of n-backups with different n

Rλt = (1− λ)∞∑n=1

λn−1Rnt

44/??

TD(λ): Forward View

• 19-State Random Walk Task.

45/??

TD(λ): Backward View

• Each step, eligibility traces for all states are updated

et(s) =

γλet−1(s) s 6= st

γλet−1(s) s = st

46/??

TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and

update those values as well:e(st)← e(st) + 1

∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]

∀s : e(s)← γλe(s)

47/??

TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and

update those values as well:e(st)← e(st) + 1

∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]

∀s : e(s)← γλe(s)

47/??

TD(λ): Backward View

48/??

TD(λ): Backward vs. Forward

• Two view provides equivalent offline update (see the proof in Section7.4, Introduction to RL book, Sutton & Barto).

49/??

SARSA(λ)

50/??

SARSA(λ): Example

51/??

Q(λ)

• The n-step return

rt + γrt+1 + . . .+ γn−1rt+n−1 + γn maxa

Qt(st+n, a)

• Q(λ) algorithm by Watkin

52/??