Reinforcement Learning Lecture Temporal Difference Learning

Reinforcement Learning

Temporal Difference Learning

Temporal difference learning, TD prediction, Q-learning, elibigility traces.(many slides from Marc Toussaint)

Vien NgoMLR, University of Stuttgart

Outline

• Temporal Difference Learning

• Q-learning

• Eligibility Traces

Learning in MDPs

• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).

• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1

(state, action, immediate reward, next state)

• What could we learn from that?– learn to predict next state: P (s′|s, a)

– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)

– learn to predict action: π(s, a)

Learning in MDPs

• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).

• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1

(state, action, immediate reward, next state)

• What could we learn from that?– learn to predict next state: P (s′|s, a)

– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)

– learn to predict action: π(s, a)

Model-based versus Model-free

Edward Tolman (1886 - 1959)Purposive Behavior in Animals and Men

(stimulus-stimulus, non-reinforcement driven)

Wolfgang Kohler (1887–1967)learn facts about the world that they couldsubsequently use in a flexible manner, ratherthan simply learning automatic responses

Clark Hull (1884 - 1952)Principles of Behavior (1943)

learn stimulus-response mappings based onreinforcement

Introduction to Model-Free Methods

– Monte-Carlo Methods, On-Policy MC Control

– Temporal Difference Learning, On-Policy SARSA Algorithm

– Off-Policy Q-Learning Algorithm

Monte-Carlo Policy Evaluation– MC policy evaluation: First-visit and every-visit methods

Algorithm 1 MC-PE: policy evaluation of policy π1: Given π; Returns(s) = ∅, ∀s ∈ S2: while (!converged) do3: Generate an episode τ = (s0, a0, r0, . . . , sT−1, aT−1, rT−1, sT ) using π4: for each state st in τ do5: Compute the return of st: R(st) =

∑Tl=t γ

l−trl

6: Either: For each st in τ , add R(st) into Returns(st) once , first-visit

Add R(st) into Returns(st) , every-visit

7: Return V π(s) = average(Returns(s)

)– Converge!!! as the number of visits to all s goes to infinity.

(Introduction to RL, Sutton & Barto 1998)8/??

Monte-Carlo Policy Evaluation

• Alternatively, increment updates for MC-PE when obtaining a returnR(st):

V π(st) = V π(st) + αt(R(st)− V (st)

Example: MC Method for Blackjack

• States is 3-dimensional: current sum (12-21), dealer’s one showingcard (ace-10), having a usable card (1/0)

• Actions: stick, hit

• Rewards:•

R(·, hit) =

+1.0 if current sum is better than the dealer’s

0.0 if current sum is equal the dealer’s

−1.0 if current sum is worse than the dealer’s

R(·, hit) =

−1.0 if current sum > 21

0.0 otherwise

• If current sum < 12, automatically do hit.

Example: MC Method for Blackjack

Approximate value functions for the policy that sticks only on 20 or 21.

On-Policy MC Control AlgorithmGeneralized policy iteration framework

Algorithm 2 On-policy MC Control Algorithm1: Init an initial policy π0.2: while (!converged) do3: Policy Evaluation:

πk → Qπk (s, a), ∀s, a

4: Policy Improvement: greedy action selection

πk+1(s) = argmaxa

Q(s, a),∀s

ε-Greedy Policy• The ε-greedy policy chooses actions with probability

π(a|s) =

ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)

ε/|A| otherwise

• Policy improvement: the improvement of a new ε-greedy policy πk+1

that is constructed upon the value functions Qπk(s, a) of the previouspolicy is

Qπk(s, πk+1(s)) =∑a

πk+1(a|s)Qπk(s, a)

|A|∑a

Qπk(s, a) + (1− ε) maxa

Qπk(s, a)

≥ ε

|A|∑a

Qπk(s, a) + (1− ε)∑a

π(a|s)− ε/|A|1− ε

Qπk(s, a)

π(a|s)Qπk(s, a) = V πk(s)

• Therefore V πk+1(s) ≥ V πk(s)

ε-Greedy Policy• The ε-greedy policy chooses actions with probability

π(a|s) =

ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)

ε/|A| otherwise

• Policy improvement: the improvement of a new ε-greedy policy πk+1

that is constructed upon the value functions Qπk(s, a) of the previouspolicy is

Qπk(s, πk+1(s)) =∑a

πk+1(a|s)Qπk(s, a)

|A|∑a

Qπk(s, a) + (1− ε) maxa

Qπk(s, a)

≥ ε

|A|∑a

Qπk(s, a) + (1− ε)∑a

π(a|s)− ε/|A|1− ε

Qπk(s, a)

π(a|s)Qπk(s, a) = V πk(s)

• Therefore V πk+1(s) ≥ V πk(s)13/??

MC Control Algorithm with ε-Greedy

• Choose actions using ε-greedy.

• Converges!!! if εk decreases to zero through time, e.g. εk = 1/k (GLIEpolicy).

• A policy is GLIE (Greedy in the Limit with Infinite Exploration) if• All state-action pairs are visited infinitely often.• In the limit (k →∞), the policy becomes greedy

limk→∞

πk(a|s) = δa(argmaxa′

Qπk (s, a′))

where δ is a Dirac function.

MC Control Algorithm: Blackjack

Temporal difference (TD) learning

• TD Prediction (TD(0))

• On-policy TD Control (SARSA Algorithm)

• Off-policy TD Control (Q-Learning)

• Eligibility Traces (TD(λ))

Temporal difference (TD) Predictionrecall

V π(s) = E[R(π(s), s) + γV π(s′)

• TD learning: Given a new experience (s, a, r, s′)

Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸︷︷︸TD target

= Vold(s) + α [r − Vold(s) + γVold(s′)︸︷︷︸TD error

• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)

– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)

V π(s) = E[R(π(s), s) + γV π(s′)

]• TD learning: Given a new experience (s, a, r, s′)

V π(s) = E[R(π(s), s) + γV π(s′)

]• TD learning: Given a new experience (s, a, r, s′)

TD Prediction vs. MC Prediction

Figure: MC backup diagram

Figure: TD backup diagram

TD Prediction vs. MC PredictionDriving Home Example

Elapsed Time Predicted Predicted

State (minutes) Time to Go Total Time

leaving office, friday at 6 0 30 30

reach car, raining 5 35 40

exiting highway 20 15 35

2ndary road, behind truck 30 10 40

entering home street 40 3 43

arrive home 43 0 43

Monte-Carlo methodTD(0)

(Introduction to RL, Sutton & Barto 1998)

• TD can learns before terminationof epsisodes.

• TD can be used for eithernon-episodic or episodic tasks.

• The update depends on single

stochastic transition⇒ lowervariance.

• Updates use bootstrapping⇒estimate has some bias.

• TD updates exploit the Markovproperty.

• MC learning must wait until theend of episodes.

• MC only works for episodic tasks.

• The update depends on asequence of many stochastictransitions⇒ much largervariance.

• Unbiased estimate.

• MC updates does not exploit theMarkov property, hence it can beeffective in non-Markovenvironment.

TD vs. MC: A Random Walk Example

On-Policy TD Control: SARSA

Figure: Learning on tuple (s, a, r, s′, a′): SARSA

• Q-value updates:

Qt+1(s, a) = Qt(s, a) + α(rt + γQt(s

′, a′)−Qt(s, a))

On-Policy TD Control: SARSA

Algorithm 3 SARSA Algorithm1: Init Q(s, a), ∀s ∈ S, a ∈ A2: while (!converged) do3: Init a starting state s4: Select an action a from s using a policy derived from Q (e.g. ε-greedy)5: for each episode do6: Execute a, observe r, s′

7: Select an action a′ from s′ using a policy derived from Q (e.g. ε-greedy)8: Update: Qt+1(s, a) = Qt(s, a) + αt

(r + γQt(s

′, a′)−Qt(s, a))

9: s← s′; a← a′

The SARSA’s Q values converges w.p.1 to the optimal values as longas

• the learning rates satisfy ∑t

αt(s, a) =∞

, ∑t

α2t (s, a) <∞

• the policies πt(s, a) derived from Qt(s, a) are GLIE policies.

SARSA on Windy Gridworld Example

• reward is -1.0 at non-terminal state (cell G).

• each move is shifted upward along the wind direction, the strength innumber of cells shifted upwards is given below each column.

• y-axis shows the accumulated number of goal reaching.

Off-Policy TD Control: Q-Learning

Off-Policy MC?

• Importance sampling to estimate the following expectation of returns:

Eτ∼P (τ)

[ρ(τ)

∫ρ(τ)P (τ)dτ

∫ρ(τ)

P (τ)

P (τ)P (τ)dτ

= Eτ∼P (τ)

[ρ(τ)

P (τ)

[N∑i=1

ρ(τi)P (τi)

P (τi)

• Denote a trajectory distribution Pπ(τ), where τ = {s0, a0, s1, a1, · · · }

Pπ(τ) = P0(s0)∏

P (st+1|st, at)µ(at|st)30/??

• Assuming in MC Control, the control policy used to generate data isµ(a|s) (i.e. P (τ)).

• The target policy is π(a|s) (i.e. P (τ))

• Set importance weights as:

wt =P (τt)

P (τt)=

T∏i=t

π(ai|si)µ(ai|si)

• The MC value update becomes (when observing a return ρt):

V (st)← V (st) + α(wtρt − V (st))

Off-Policy TD?

• The term ρ = rt + γV (st+1) is estimated by importance sampling.

• The TD value update becomes (given a transition (st, at, rt, st+1)):

V (st)← V (st) + α(π(at|st)µ(at|st)

(rt + γV (st+1))− V (st))

• The target policy is greedy: π(st) = arg maxaQ(st, a)

• The control policy is µ, e.g. ε-greedy w.r.t Q(s, a)

• Q-learning (Watkins, 1988) Given a new experience (s, a, r, s′)

Qnew(s, a) = (1− α) Qold(s, a) + α [r + γmaxa′

Qold(s′, a′)]

= Qold(s, a) + α [rt −Qold(s, a) + γmaxa

Qold(s′, a)]

• Reinforcement:– more reward than expected (r > Qold(s, a)− γmaxaQold(s′, a))→ increase Q(s, a)

– less reward than expected (r < Qold(s, a)− γmaxaQold(s′, a))→ decrease Q(s, a)

Q-Learning

(Introduction to RL, Sutton & Barto 1998)

Q-learning convergence with prob 1

• Q-learning is a stochastic approximation of Q-Iteration:

Q-learning: Qnew(s, a) = (1− α)Qold(s, a) + α[r + γmaxa′ Qold(s′, a′)]

Q-Iteration: ∀s,a : Qk+1(s, a) = R(s, a) + γ∑s′ P (s′|a, s) maxa′ Qk(s

′, a′)

We’ve shown convergence of Q-VI to Q∗

• Convergence of Q-learning:Q-Iteration is a deterministic update: Qk+1 = T (Qk)

Q-learning is a stochastic version: Qk+1 = (1− α)Qk + α[T (Qk) + ηk]

ηk is zero mean!

Q-learning convergence with prob 1The Q-learning algorithm converges w.p.1 as long as the learning ratessatisfy ∑

αt(s, a) =∞

, ∑t

α2t (s, a) <∞

(Watkins and Dayan, ’Q-learning’. Machine Learning 1992)

Q-learning vs. SARSA: The Cliff Example

Q-Learning impact

• Q-Learning was the first provably convergent direct adaptive optimalcontrol algorithm

• Great impact on the field of Reinforcement Learning– smaller representation than models– automatically focuses attention to where it is needed,

i.e., no sweeps through state space– though does not solve the exploration versus exploitation issue– ε-greedy, optimistic initialization, etc,...

Unified View

(Introduction to RL, Sutton & Barto 1998)40/??

Eligibility traces

• Temporal Difference: based on single experience (s0, r0, s1)

Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]

• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)

temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)

Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]

Eligibility traces

• Temporal Difference: based on single experience (s0, r0, s1)

Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]

• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)

temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)

Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]

Eligibility trace

• The error of n-step TD update

V nt (s) = V nt (s) + α[Rnt − V nt (s)]

where Rnt is n-step returns

Rnt = rt + γrt+1 + . . .+ γn−1rt+n−1 + γnVt(st+n)

• The offline value update up to time T

Vt(s) = Vt(s) + α

T−1∑t=0

∆Vt(s)

• Error reduction

|V nt − V ∗π | ≤ γn |Vt − V ∗π |

TD(λ): Forward View

• TD(λ) is the averaging of n-backups with different n

• Look into the future, and doMC-Evaluation then averagingweightedly:

Rλt = (1− λ)

∞∑n=1

λn−1Rnt

• TD(λ) is the averaging of n-backups with different n

Rλt = (1− λ)∞∑n=1

λn−1Rnt

• 19-State Random Walk Task.

TD(λ): Backward View

• Each step, eligibility traces for all states are updated

et(s) =

γλet−1(s) s 6= st

γλet−1(s) s = st

TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and

update those values as well:e(st)← e(st) + 1

∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]

∀s : e(s)← γλe(s)

TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and

update those values as well:e(st)← e(st) + 1

∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]

∀s : e(s)← γλe(s)

TD(λ): Backward View

TD(λ): Backward vs. Forward

• Two view provides equivalent offline update (see the proof in Section7.4, Introduction to RL book, Sutton & Barto).

SARSA(λ)

SARSA(λ): Example

• The n-step return

rt + γrt+1 + . . .+ γn−1rt+n−1 + γn maxa

Qt(st+n, a)

• Q(λ) algorithm by Watkin

Reinforcement Learning Lecture Temporal Difference Learning

Documents

Transcript of Reinforcement Learning Lecture Temporal Difference Learning

Temporal Network Optimization Subject to Connectivity ...

Introduction to Deep Reinforcement Learningwnzhang.net/teaching/cs420/slides/13-deep-rl.pdf · 2020-06-08 · Deep Reinforcement Learning •Deep Reinforcement Learning •leverages

Reinforcement Learning via Policy Optimizationhanxiaol/slides/rl-po.pdf · Policy Gradient r U( ) ˇr logP(˝;ˇ )R(˝) ˝˘P(;ˇ ) (7) I Analogous to SGD (so variance reduction is

Reinforcement Learning Lecture Function Approximation

Large Scale Reinforcement Learning using Q-SARSA(λ) and Cascading Neural Networks

Adaptive Reward-Poisoning Attacks against Reinforcement ...pages.cs.wisc.edu/~jerryzhu/pub/online_attack_on_RL.pdf · Adaptive Reward-Poisoning Attacks against Reinforcement Learning

Reinforcement Learning - 4. Model-free reinforcement Learning

Lecture 4: Model-Free Prediction - UCL · Lecture 4: Model-Free Prediction Temporal-Di erence Learning Driving Home Example Advantages and Disadvantages of MC vs. TD TD …

Lecture 5: Model-Free Predictionjosephmodayil.com/UCLRL/MC-TD.pdf · Lecture 5: Model-Free Prediction Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Di erence Learning 4

R:00AUSTRALIABURSTING REINFORCEMENT TO BE USED IN … · Page: BURSTING REINFORCEMENT TO BE USED IN RMS PROJECTS Code: Edition: BR-RMS 1.2 1/7 Drawing 1: Anchorage bursting reinforcement

Safe and Efficient Off-Policy Reinforcement Learning

Linear Temporal Logic - Rich Model Toolkit

Lecture 2: Making Sequences of Good Decisions Given a ...web.stanford.edu/class/cs234/slides/lecture2.pdf · Emma Brunskill (CS234 Reinforcement Learning)Lecture 2: Making Sequences

CSC 411: Introduction to Machine Learningmren/teach/csc411_19s/lec/lec22.pdf · CSC 411: Introduction to Machine Learning CSC 411 Lecture 22: Reinforcement Learning II Mengye Ren

Control of a nonlinear non affine discrete system using neural networks and online training with reinforcement learning methods

PCA - Carnegie Mellon School of Computer Scienceguestrin/Class/15781/slides/pca-mdps.pdfinference Then learning for BNs For reinforcement learning: Formal framework Markov decision

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

On-Policy Concurrent Reinforcement Learningcse.unl.edu/~lksoh/Classes/CSCE475_875_Fall15/Seminar...SARSA (on-policy method) converges to a stable Q value while the classic Q-learning

Inverse Reinforcement Learning - University of …pabbeel/cs287-fa12/...High-level picture Dynamics Model T Reinforcement Probability distribution over next states given current Describes

10703 Deep Reinforcement Learning and Control · 2017. 10. 18. · Policy-Based Reinforcement Learning ‣ So far we approximated the value or action-value function using parameters