Download - TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

Transcript
Page 1: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• TD(0) prediction• Sarsa, On-policy learning• Q-Learning, Off-policy learning

Page 2: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• Actor-Critic

Page 3: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Unified View

Page 4: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

N-step TD Prediction

Page 5: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning
Page 6: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Forward View

Page 7: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Random Walk

Page 8: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• 19-state random walk

Page 9: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• n-step method is simple version of TD(λ)

• Example: backup average of 2-step and 4-step returns

Page 10: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Forward View, TD(λ)

• Weigh all n-step return backups by λn-1

(time since visitation)• λ-return:

• Backup using λ-return:

Page 11: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Weighting of λ-return

Page 12: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Relationship with TD(0) and MC

Page 13: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning
Page 14: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Backward View

Page 15: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Book shows forward and backward views are actually equivalent

Page 16: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

On-line, Tabular TD(λ)

Page 17: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• Update rule:

• As before, λ = 0 means TD(0)• Now, when λ = 1, you get MC, but– Can apply to continuing tasks– Works incrementally and on-line!

Page 18: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Control: Sarsa(λ)

Page 19: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Gridworld Example

Page 20: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Watkin’s Q(λ)

• Why isn’t Q-learning as easy as Sarsa?

Page 21: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Watkin’s Q(λ)• Why isn’t Q-learning as easy as Sarsa?

Page 22: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• Accumulating traces– Eligibilities can be greater than 1– Could cause convergence problems

• Replacing traces

Page 23: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning
Page 24: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• Example: why do accumulating traces do particularly poorly in this task?

Page 25: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning
Page 26: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

Implementation Issues

• Could require significant amounts of computation– But most traces are very close to zero…– We can actually throw them out when they get very

small• Will want to use some type of efficient data

structure• In practice, increases computation only by a

small multiple

Page 27: TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

• AnonymousFeedback.net Send to [email protected]

1. What’s been most useful to you (and why)?

2. What’s been least useful (and why)?

3. What could students do to improve the class?

4. What could Matt do to improve the class?