Reinforcement Learning Lecture Temporal Difference Learning
Transcript of Reinforcement Learning Lecture Temporal Difference Learning
Reinforcement Learning
Temporal Difference Learning
Temporal difference learning, TD prediction, Q-learning, elibigility traces.(many slides from Marc Toussaint)
Vien NgoMLR, University of Stuttgart
Outline
• Temporal Difference Learning
• Q-learning
• Eligibility Traces
2/??
Learning in MDPs
• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).
• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1
(state, action, immediate reward, next state)
• What could we learn from that?– learn to predict next state: P (s′|s, a)
– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)
– learn to predict action: π(s, a)
3/??
Learning in MDPs
• Assume unknown MDP {S,A, ·, ·, γ} (the agent does not know P,R).
• While interacting with the world, the agent collects data of the formD = {(st, at, rt, st+1)}Ht=1
(state, action, immediate reward, next state)
• What could we learn from that?– learn to predict next state: P (s′|s, a)
– learn to predict immediate reward: P (r|s, a) (or R(s, a))– learn to predict value: s, a 7→ Q(s, a)
– learn to predict action: π(s, a)
3/??
Model-based versus Model-free
Edward Tolman (1886 - 1959)Purposive Behavior in Animals and Men
(stimulus-stimulus, non-reinforcement driven)
Wolfgang Kohler (1887–1967)learn facts about the world that they couldsubsequently use in a flexible manner, ratherthan simply learning automatic responses
Clark Hull (1884 - 1952)Principles of Behavior (1943)
learn stimulus-response mappings based onreinforcement
4/??
5/??
6/??
Introduction to Model-Free Methods
– Monte-Carlo Methods, On-Policy MC Control
– Temporal Difference Learning, On-Policy SARSA Algorithm
– Off-Policy Q-Learning Algorithm
7/??
Monte-Carlo Policy Evaluation– MC policy evaluation: First-visit and every-visit methods
Algorithm 1 MC-PE: policy evaluation of policy π1: Given π; Returns(s) = ∅, ∀s ∈ S2: while (!converged) do3: Generate an episode τ = (s0, a0, r0, . . . , sT−1, aT−1, rT−1, sT ) using π4: for each state st in τ do5: Compute the return of st: R(st) =
∑Tl=t γ
l−trl
6: Either: For each st in τ , add R(st) into Returns(st) once , first-visit
Add R(st) into Returns(st) , every-visit
7: Return V π(s) = average(Returns(s)
)– Converge!!! as the number of visits to all s goes to infinity.
(Introduction to RL, Sutton & Barto 1998)8/??
Monte-Carlo Policy Evaluation
• Alternatively, increment updates for MC-PE when obtaining a returnR(st):
V π(st) = V π(st) + αt(R(st)− V (st)
)
9/??
Example: MC Method for Blackjack
• States is 3-dimensional: current sum (12-21), dealer’s one showingcard (ace-10), having a usable card (1/0)
• Actions: stick, hit
• Rewards:•
R(·, hit) =
+1.0 if current sum is better than the dealer’s
0.0 if current sum is equal the dealer’s
−1.0 if current sum is worse than the dealer’s
•
R(·, hit) =
−1.0 if current sum > 21
0.0 otherwise
• If current sum < 12, automatically do hit.
10/??
Example: MC Method for Blackjack
Approximate value functions for the policy that sticks only on 20 or 21.
11/??
On-Policy MC Control AlgorithmGeneralized policy iteration framework
Algorithm 2 On-policy MC Control Algorithm1: Init an initial policy π0.2: while (!converged) do3: Policy Evaluation:
πk → Qπk (s, a), ∀s, a
4: Policy Improvement: greedy action selection
πk+1(s) = argmaxa
Q(s, a),∀s
12/??
ε-Greedy Policy• The ε-greedy policy chooses actions with probability
π(a|s) =
ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)
ε/|A| otherwise
• Policy improvement: the improvement of a new ε-greedy policy πk+1
that is constructed upon the value functions Qπk(s, a) of the previouspolicy is
Qπk(s, πk+1(s)) =∑a
πk+1(a|s)Qπk(s, a)
=ε
|A|∑a
Qπk(s, a) + (1− ε) maxa
Qπk(s, a)
≥ ε
|A|∑a
Qπk(s, a) + (1− ε)∑a
π(a|s)− ε/|A|1− ε
Qπk(s, a)
=∑a
π(a|s)Qπk(s, a) = V πk(s)
• Therefore V πk+1(s) ≥ V πk(s)
13/??
ε-Greedy Policy• The ε-greedy policy chooses actions with probability
π(a|s) =
ε/|A|+ 1− ε if a = arg maxa′∈AQ(s, a′)
ε/|A| otherwise
• Policy improvement: the improvement of a new ε-greedy policy πk+1
that is constructed upon the value functions Qπk(s, a) of the previouspolicy is
Qπk(s, πk+1(s)) =∑a
πk+1(a|s)Qπk(s, a)
=ε
|A|∑a
Qπk(s, a) + (1− ε) maxa
Qπk(s, a)
≥ ε
|A|∑a
Qπk(s, a) + (1− ε)∑a
π(a|s)− ε/|A|1− ε
Qπk(s, a)
=∑a
π(a|s)Qπk(s, a) = V πk(s)
• Therefore V πk+1(s) ≥ V πk(s)13/??
MC Control Algorithm with ε-Greedy
• Choose actions using ε-greedy.
• Converges!!! if εk decreases to zero through time, e.g. εk = 1/k (GLIEpolicy).
• A policy is GLIE (Greedy in the Limit with Infinite Exploration) if• All state-action pairs are visited infinitely often.• In the limit (k →∞), the policy becomes greedy
limk→∞
πk(a|s) = δa(argmaxa′
Qπk (s, a′))
where δ is a Dirac function.
14/??
MC Control Algorithm: Blackjack
15/??
Temporal difference (TD) learning
• TD Prediction (TD(0))
• On-policy TD Control (SARSA Algorithm)
• Off-policy TD Control (Q-Learning)
• Eligibility Traces (TD(λ))
16/??
Temporal difference (TD) Predictionrecall
V π(s) = E[R(π(s), s) + γV π(s′)
]
• TD learning: Given a new experience (s, a, r, s′)
Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target
]
= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error
] .
• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)
– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)
17/??
Temporal difference (TD) Predictionrecall
V π(s) = E[R(π(s), s) + γV π(s′)
]• TD learning: Given a new experience (s, a, r, s′)
Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target
]
= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error
] .
• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)
– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)
17/??
Temporal difference (TD) Predictionrecall
V π(s) = E[R(π(s), s) + γV π(s′)
]• TD learning: Given a new experience (s, a, r, s′)
Vnew(s) = (1− α) Vold(s) + α [r + γVold(s′)︸ ︷︷ ︸TD target
]
= Vold(s) + α [r − Vold(s) + γVold(s′)︸ ︷︷ ︸TD error
] .
• Reinforcement:– more reward than expected (r > Vold(s)− γVold(s′))→ increase V (s)
– less reward than expected (r < Vold(s)− γVold(s′))→ decrease V (s)
17/??
TD Prediction vs. MC Prediction
Figure: MC backup diagram
Figure: TD backup diagram
18/??
TD Prediction vs. MC PredictionDriving Home Example
Elapsed Time Predicted Predicted
State (minutes) Time to Go Total Time
leaving office, friday at 6 0 30 30
reach car, raining 5 35 40
exiting highway 20 15 35
2ndary road, behind truck 30 10 40
entering home street 40 3 43
arrive home 43 0 43
19/??
TD Prediction vs. MC Prediction
Monte-Carlo methodTD(0)
(Introduction to RL, Sutton & Barto 1998)
20/??
TD Prediction vs. MC Prediction
• TD can learns before terminationof epsisodes.
• TD can be used for eithernon-episodic or episodic tasks.
• The update depends on single
stochastic transition⇒ lowervariance.
• Updates use bootstrapping⇒estimate has some bias.
• TD updates exploit the Markovproperty.
• MC learning must wait until theend of episodes.
• MC only works for episodic tasks.
• The update depends on asequence of many stochastictransitions⇒ much largervariance.
• Unbiased estimate.
• MC updates does not exploit theMarkov property, hence it can beeffective in non-Markovenvironment.
21/??
TD vs. MC: A Random Walk Example
22/??
On-Policy TD Control: SARSA
23/??
On-Policy TD Control: SARSA
s,a
s'
a'
r
Figure: Learning on tuple (s, a, r, s′, a′): SARSA
• Q-value updates:
Qt+1(s, a) = Qt(s, a) + α(rt + γQt(s
′, a′)−Qt(s, a))
24/??
On-Policy TD Control: SARSA
Algorithm 3 SARSA Algorithm1: Init Q(s, a), ∀s ∈ S, a ∈ A2: while (!converged) do3: Init a starting state s4: Select an action a from s using a policy derived from Q (e.g. ε-greedy)5: for each episode do6: Execute a, observe r, s′
7: Select an action a′ from s′ using a policy derived from Q (e.g. ε-greedy)8: Update: Qt+1(s, a) = Qt(s, a) + αt
(r + γQt(s
′, a′)−Qt(s, a))
9: s← s′; a← a′
25/??
The SARSA’s Q values converges w.p.1 to the optimal values as longas
• the learning rates satisfy ∑t
αt(s, a) =∞
, ∑t
α2t (s, a) <∞
• the policies πt(s, a) derived from Qt(s, a) are GLIE policies.
26/??
SARSA on Windy Gridworld Example
• reward is -1.0 at non-terminal state (cell G).
• each move is shifted upward along the wind direction, the strength innumber of cells shifted upwards is given below each column.
27/??
• y-axis shows the accumulated number of goal reaching.
28/??
Off-Policy TD Control: Q-Learning
29/??
Off-Policy TD Control: Q-Learning
Off-Policy MC?
• Importance sampling to estimate the following expectation of returns:
Eτ∼P (τ)
[ρ(τ)
]=
∫ρ(τ)P (τ)dτ
=
∫ρ(τ)
P (τ)
P (τ)P (τ)dτ
= Eτ∼P (τ)
[ρ(τ)
P (τ)
P (τ)
]
≈ 1
N
[N∑i=1
ρ(τi)P (τi)
P (τi)
]
• Denote a trajectory distribution Pπ(τ), where τ = {s0, a0, s1, a1, · · · }
Pπ(τ) = P0(s0)∏
P (st+1|st, at)µ(at|st)30/??
• Assuming in MC Control, the control policy used to generate data isµ(a|s) (i.e. P (τ)).
• The target policy is π(a|s) (i.e. P (τ))
• Set importance weights as:
wt =P (τt)
P (τt)=
T∏i=t
π(ai|si)µ(ai|si)
• The MC value update becomes (when observing a return ρt):
V (st)← V (st) + α(wtρt − V (st))
31/??
Off-Policy TD Control: Q-Learning
Off-Policy TD?
• The term ρ = rt + γV (st+1) is estimated by importance sampling.
• The TD value update becomes (given a transition (st, at, rt, st+1)):
V (st)← V (st) + α(π(at|st)µ(at|st)
(rt + γV (st+1))− V (st))
32/??
Off-Policy TD Control: Q-Learning
• The target policy is greedy: π(st) = arg maxaQ(st, a)
• The control policy is µ, e.g. ε-greedy w.r.t Q(s, a)
33/??
Off-Policy TD Control: Q-Learning
• Q-learning (Watkins, 1988) Given a new experience (s, a, r, s′)
Qnew(s, a) = (1− α) Qold(s, a) + α [r + γmaxa′
Qold(s′, a′)]
= Qold(s, a) + α [rt −Qold(s, a) + γmaxa
Qold(s′, a)]
• Reinforcement:– more reward than expected (r > Qold(s, a)− γmaxaQold(s′, a))→ increase Q(s, a)
– less reward than expected (r < Qold(s, a)− γmaxaQold(s′, a))→ decrease Q(s, a)
34/??
Q-Learning
(Introduction to RL, Sutton & Barto 1998)
35/??
Q-learning convergence with prob 1
• Q-learning is a stochastic approximation of Q-Iteration:
Q-learning: Qnew(s, a) = (1− α)Qold(s, a) + α[r + γmaxa′ Qold(s′, a′)]
Q-Iteration: ∀s,a : Qk+1(s, a) = R(s, a) + γ∑s′ P (s′|a, s) maxa′ Qk(s
′, a′)
We’ve shown convergence of Q-VI to Q∗
• Convergence of Q-learning:Q-Iteration is a deterministic update: Qk+1 = T (Qk)
Q-learning is a stochastic version: Qk+1 = (1− α)Qk + α[T (Qk) + ηk]
ηk is zero mean!
36/??
Q-learning convergence with prob 1The Q-learning algorithm converges w.p.1 as long as the learning ratessatisfy ∑
t
αt(s, a) =∞
, ∑t
α2t (s, a) <∞
(Watkins and Dayan, ’Q-learning’. Machine Learning 1992)
37/??
Q-learning vs. SARSA: The Cliff Example
38/??
Q-Learning impact
• Q-Learning was the first provably convergent direct adaptive optimalcontrol algorithm
• Great impact on the field of Reinforcement Learning– smaller representation than models– automatically focuses attention to where it is needed,
i.e., no sweeps through state space– though does not solve the exploration versus exploitation issue– ε-greedy, optimistic initialization, etc,...
39/??
Unified View
(Introduction to RL, Sutton & Barto 1998)40/??
Eligibility traces
• Temporal Difference: based on single experience (s0, r0, s1)
Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]
• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)
temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)
Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]
41/??
Eligibility traces
• Temporal Difference: based on single experience (s0, r0, s1)
Vnew(s0) = Vold(s0) + α[r0 + γVold(s1)− Vold(s0)]
• longer experience sequence, e.g.: (s0, r0, r1, r2, s3)
temporal credit assignment, think further backwards: receiving r3 alsotells us something about V (s0)
Vnew(s0) = Vold(s0) + α[r0 + γr1 + γ2r2 + γ3Vold(s3)− Vold(s0)]
41/??
Eligibility trace
• The error of n-step TD update
V nt (s) = V nt (s) + α[Rnt − V nt (s)]
where Rnt is n-step returns
Rnt = rt + γrt+1 + . . .+ γn−1rt+n−1 + γnVt(st+n)
• The offline value update up to time T
Vt(s) = Vt(s) + α
T−1∑t=0
∆Vt(s)
• Error reduction
|V nt − V ∗π | ≤ γn |Vt − V ∗π |
42/??
TD(λ): Forward View
• TD(λ) is the averaging of n-backups with different n
• Look into the future, and doMC-Evaluation then averagingweightedly:
Rλt = (1− λ)
∞∑n=1
λn−1Rnt
43/??
TD(λ): Forward View
• TD(λ) is the averaging of n-backups with different n
Rλt = (1− λ)∞∑n=1
λn−1Rnt
44/??
TD(λ): Forward View
• 19-State Random Walk Task.
45/??
TD(λ): Backward View
• Each step, eligibility traces for all states are updated
et(s) =
γλet−1(s) s 6= st
γλet−1(s) s = st
46/??
TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and
update those values as well:e(st)← e(st) + 1
∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]
∀s : e(s)← γλe(s)
47/??
TD(λ): Backward View• TD(λ): remember where you’ve been recently (“eligibility trace”) and
update those values as well:e(st)← e(st) + 1
∀s : Vnew(s) = Vold(s) + α e(s) [rt + γVold(st+1)− Vold(st)]
∀s : e(s)← γλe(s)
47/??
TD(λ): Backward View
48/??
TD(λ): Backward vs. Forward
• Two view provides equivalent offline update (see the proof in Section7.4, Introduction to RL book, Sutton & Barto).
49/??
SARSA(λ)
50/??
SARSA(λ): Example
51/??
Q(λ)
• The n-step return
rt + γrt+1 + . . .+ γn−1rt+n−1 + γn maxa
Qt(st+n, a)
• Q(λ) algorithm by Watkin
52/??