TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

• TD(0) prediction• Sarsa, On-policy learning• Q-Learning, Off-policy learning

• Actor-Critic

Unified View

N-step TD Prediction

Forward View

Random Walk

• 19-state random walk

• n-step method is simple version of TD(λ)

• Example: backup average of 2-step and 4-step returns

Forward View, TD(λ)

• Weigh all n-step return backups by λn-1

(time since visitation)• λ-return:

• Backup using λ-return:

Weighting of λ-return

Relationship with TD(0) and MC

Backward View

Book shows forward and backward views are actually equivalent

On-line, Tabular TD(λ)

• Update rule:

• As before, λ = 0 means TD(0)• Now, when λ = 1, you get MC, but– Can apply to continuing tasks– Works incrementally and on-line!

Control: Sarsa(λ)

Gridworld Example

Watkin’s Q(λ)

• Why isn’t Q-learning as easy as Sarsa?

Watkin’s Q(λ)• Why isn’t Q-learning as easy as Sarsa?

• Accumulating traces– Eligibilities can be greater than 1– Could cause convergence problems

• Replacing traces

• Example: why do accumulating traces do particularly poorly in this task?

Implementation Issues

• Could require significant amounts of computation– But most traces are very close to zero…– We can actually throw them out when they get very

small• Will want to use some type of efficient data

structure• In practice, increases computation only by a

small multiple

• AnonymousFeedback.net Send to taylorm@eecs.wsu.edu

1. What’s been most useful to you (and why)?

2. What’s been least useful (and why)?

3. What could students do to improve the class?

4. What could Matt do to improve the class?

TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

Documents

Transcript of TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning

On-Policy Concurrent Reinforcement Learningcse.unl.edu/~lksoh/Classes/CSCE475_875_Fall15/Seminar...SARSA (on-policy method) converges to a stable Q value while the classic Q-learning

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

CBN’s 5-year Monetary Policy Blueprint

Reinforcement Learning via Policy Optimizationhanxiaol/slides/rl-po.pdf · Policy Gradient r U( ) ˇr logP(˝;ˇ )R(˝) ˝˘P(;ˇ ) (7) I Analogous to SGD (so variance reduction is

Policy paper-no17.2013 σακελλαρόπουλος-φίτσιου-4

RL5: On-policyandoﬀ-policyalgorithms · Overview Oﬀ-policyalgorithms Q-learning(lasttime) R-learning(avariantofQ-learning) On-policyalgorithms SARSA TD( ) Actor-criticmethods

Macroeconomics Lecture 16. Review of the Previous Lecture Three Experiments –Fiscal Policy at Home –Fiscal Policy Abroad –Increase in Investment Demand.

Targeted policy making by transforming social networks

THESIS - COMMON EUROPEAN DEFENCE POLICY

Learning in Indefiniteness · Purushottam Kar (CSE/IITK) Learning in Indeﬁniteness August 2, 2010 8 / 60. Learning Limitations of PAC learning Most interesting function classes

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

Towards Safe Policy Improvement for Non-Stationary MDPs ...

Tivoli SecureWay Policy Directorpublib.boulder.ibm.com/tividd/td/SW_30/GC32-0737-00/zh... · 2007-09-29 · eÑ Tivoli® Policy Director O⌡µTivoli Policy Director ú Xñ {í ≥

Safe and Efficient Off-Policy Reinforcement Learning

Reinforcement Learning - Policy Search: Actor-Critic and ...mmartin/URL/Lecture5.pdf · Mario Martin (CS-UPC) Reinforcement Learning May 7, 2020 17 / 72. Approximated Cross-Entropy

SAMPLE EFFICIENT POLICY GRADIENT METHODS WITH …

Classifier-Based Approximate Policy Iteration

Machine Learning Learning with Graphical Models · Machine Learning Learning with Graphical Models Marc Toussaint University of Stuttgart Summer 2015

Optimal policy computation with Dynare

Bank of Greece - Monetary policy, 2014~2015 report