TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

Click here to load reader

download TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy  learning

of 27

  • date post

    23-Feb-2016
  • Category

    Documents

  • view

    29
  • download

    0

Embed Size (px)

description

TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning. Actor-Critic. Unified View. N-step TD Prediction. Forward View. Random Walk. 19-state random walk. n-step method is simple version of TD( λ ) Example: backup average of 2-step and 4-step returns. - PowerPoint PPT Presentation

Transcript of TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

Monte Carlo Methods

TD(0) predictionSarsa, On-policy learningQ-Learning, Off-policy learning

Actor-Critic

Unified View

N-step TD Prediction

Forward View

Random Walk

19-state random walk

n-step method is simple version of TD()

Example: backup average of 2-step and 4-step returns

Forward View, TD()Weigh all n-step return backups by n-1 (time since visitation)-return:

Backup using -return:

Weighting of -return

Relationship with TD(0) and MC

Backward View

Book shows forward and backward views are actually equivalentOn-line, Tabular TD()

Update rule:

As before, = 0 means TD(0)Now, when = 1, you get MC, butCan apply to continuing tasksWorks incrementally and on-line!

Control: Sarsa()

Gridworld Example

Watkins Q()Why isnt Q-learning as easy as Sarsa?

Watkins Q()Why isnt Q-learning as easy as Sarsa?

Watkins Q()

Accumulating tracesEligibilities can be greater than 1Could cause convergence problemsReplacing traces

Example: why do accumulating traces do particularly poorly in this task?

Implementation IssuesCould require significant amounts of computationBut most traces are very close to zeroWe can actually throw them out when they get very smallWill want to use some type of efficient data structureIn practice, increases computation only by a small multipleAnonymousFeedback.net Send to taylorm@eecs.wsu.eduWhats been most useful to you (and why)?

Whats been least useful (and why)?

What could students do to improve the class?

What could Matt do to improve the class?