Reinforcement learning - Computer Sciencelazebnik/fall11/lec19_rl.pdf · Reinforcement learning ......

Reinforcement learning• Regular MDP

– Given:• Transition model P(s’ | s, a)• Reward function R(s)

– Find:• Policy π(s)

• Reinforcement learningTransition model and reward function initially unknown– Transition model and reward function initially unknown

– Still need to find the right policy– “Learn by doing”ea by do g

Reinforcement learning: Basic scheme

• In each time step:In each time step:– Take some action– Observe the outcome of the action: successor stateObserve the outcome of the action: successor state

and reward– Update some internal representation of the

i t d lienvironment and policy– If you reach a terminal state, just start over (each

pass through the environment is called a trial)pass through the environment is called a trial)

• Why is this called reinforcement learning?• Why is this called reinforcement learning?

Applications of reinforcement learning

• BackgammonBackgammon

http://www research ibm com/massive/tdl htmlhttp://www.research.ibm.com/massive/tdl.html

http://en.wikipedia.org/wiki/TD-Gammon

• Learning a fast gait for AibosLearning a fast gait for Aibos

Initial gait Learned gait

Policy Gradient Reinforcement Learning for Fast Quadrupedal LocomotionNate Kohl and Peter Stone.

IEEE International Conference on Robotics and Automation, 2004.

• Stanford autonomous helicopterStanford autonomous helicopter

Reinforcement learning strategies

• Model-basedModel-based– Learn the model of the MDP (transition probabilities

and rewards) and try to solve the MDP concurrently

• Model-free– Learn how to act without explicitly learning the p y g

transition probabilities P(s’ | s, a)– Q-learning: learn an action-utility function Q(s,a) that

t ll th l f d i ti i t ttells us the value of doing action a in state s

Model-based reinforcement learning• Basic idea: try to learn the model of the MDP (transition

probabilities and rewards) and learn how to act (solve the MDP) simultaneously

• Learning the model:Keep track of how many times state s’ follows state s– Keep track of how many times state s’ follows state s when you take action a and update the transition probability P(s’ | s, a) according to the relative frequencies

– Keep track of the rewards R(s)• Learning how to act:

– Estimate the utilities U(s) using Bellman’s equations– Choose the action that maximizes expected future utility:

∑∈

* )'(),|'(maxarg)(ssAa

sUassPsπ

Model-based reinforcement learning• Learning how to act:

– Estimate the utilities U(s) using Bellman’s equations( ) g q– Choose the action that maximizes expected future utility

given the model of the environment we’ve experienced through our actions so far:through our actions so far:

∑=* )'(),|'(maxarg)( sUassPsπ

• Is there any problem with this “greedy” approach?

∑∈ ')( ssAa

y p g y pp

Exploration vs. exploitation• Exploration: take a new action with unknown consequences

– Pros:Pros: • Get a more accurate model of the environment• Discover higher-reward states than the ones found so far

C– Cons: • When you’re exploring, you’re not maximizing your utility• Something bad might happeng g pp

• Exploitation: go with the best strategy found so far– Pros:

• Maximize reward as reflected in the current utility estimates• Avoid bad stuff

– Cons:Cons: • Might also prevent you from discovering the true optimal strategy

Incorporating exploration• Idea: explore more in the beginning, become

more and more greedy over time• Standard (“greedy”) selection of optimal action:

∑= )'()',|'(maxarg sUassPa

• Modified strategy:⎞⎛

∑∈ ')('

)(),|(gssAa

⎟⎠

⎞⎜⎝

⎛= ∑

∈ ')(')',(),'()',|'(maxarg

ssAaasNsUassPfa

exploration function

Number of times we’ve taken action a’ in state s

⎩⎨⎧ <

otherwiseif

NnRnuf e (optimistic reward estimate)

Model-free reinforcement learning• Idea: learn how to act without explicitly learning

the transition probabilities P(s’ | s, a)the transition probabilities P(s | s, a)• Q-learning: learn an action-utility function Q(s,a)

that tells us the value of doing action a in state sg• Relationship between Q-values and utilities:

),(max)( asQsU a=

Model-free reinforcement learning• Q-learning: learn an action-utility function Q(s,a)

that tells us the value of doing action a in state sthat tells us the value of doing action a in state s

),(max)( asQsU a=

• Equilibrium constraint on Q values:

∑ )''()|'()()( QPRQ

• Problem: we don’t know (and don’t want to learn)

∑+='

' )','(max),|'()(),(s

a asQassPsRasQ γ

• Problem: we don t know (and don t want to learn) P(s’ | s, a)

Temporal difference (TD) learning• Equilibrium constraint on Q values:

∑• Temporal difference (TD) update:

∑+='

' )','(max),|'()(),(s

a asQassPsRasQ γ

• Temporal difference (TD) update:– Pretend that the currently observed transition (s,a,s’)

is the only possible outcome and adjust the Q values y p jtowards the “local equilibrium”

)','(max)(),( ' asQsRasQ alocal += γ

( ))()()()(),(),()1(),(

asQasQasQasQasQasQasQ

localnew

( )),()','(max)(),(),( ' asQasQsRasQasQ anew −++= γα

( )),(),(),(),( asQasQasQasQ += α

Temporal difference (TD) learning• At each time step t

– From current state s, select an action a:From current state s, select an action a:

( ))',(),',(maxarg ' asNasQfa a=

Exploration function

Number of times we’ve taken action a’ from

state s– Get the successor state s’– Perform the TD update:

( )),()','(max)(),(),( ' asQasQsRasQasQ a −++← γα

Learning rateShould start at 1 and

decay as O(1/t) e.g., α(t) = 60/(59 + t)

Function approximation• So far, we’ve assumed a lookup table representation for

utility function U(s) or action-utility function Q(s,a)B t h t if th t t i ll l ti ?• But what if the state space is really large or continuous?

• Alternative idea: approximate the utility function as a weighted linear combination of features:weighted linear combination of features:

)()()()( 2211 sfwsfwsfwsU nnK++=– RL algorithms can be modified to estimate these weights

• Recall: features for designing evaluation functions in gamesB fit• Benefits:– Can handle very large state spaces (games), continuous state

spaces (robot control)– Can generalize to previously unseen states

Reinforcement learning - Computer Sciencelazebnik/fall11/lec19_rl.pdf · Reinforcement learning ......

Documents

Transcript of Reinforcement learning - Computer Sciencelazebnik/fall11/lec19_rl.pdf · Reinforcement learning ......

Q-Function Learning Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec3.pdf4M. Riedmiller.\Neural tted Q iteration{ rst experiences with a data e cient neural reinforcement learning

MASSIVE NEUTRINOS AND COSMOLOGY

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

Advanced Q-Function Learning Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec4.pdfZ. Wang, N. de Freitas, and M. Lanctot.\Dueling network architectures for deep reinforcement learning".

Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Disentangling Complex Massive Star Forming Region

1 Mechanical model to evaluate steel reinforcement ...

Lecture 2: Making Sequences of Good Decisions Given a ...web.stanford.edu/class/cs234/slides/lecture2.pdf · Emma Brunskill (CS234 Reinforcement Learning)Lecture 2: Making Sequences

Large Scale Reinforcement Learning using Q-SARSA(λ) and Cascading Neural Networks

PCA - Carnegie Mellon School of Computer Scienceguestrin/Class/15781/slides/pca-mdps.pdfinference Then learning for BNs For reinforcement learning: Formal framework Markov decision

Quixote: A NetHack Reinforcement Learning Framework and Agentcs229.stanford.edu/proj2019spr/poster/93.pdf · 2019. 6. 18. · Quixote: A NetHack Reinforcement Learning Framework and

Contributions to deep reinforcement learning and its ... · Contributions to deep reinforcement learning and its applications in smartgrids Vincent Francois-Lavet University of Liege,

Lecture 2: Making Sequences of Good Decisions Given a ... · Emma Brunskill (CS234 Reinforcement Learning)Lecture 2: Making Sequences of Good Decisions Given a Model of the WorldWinter

Inverse Reinforcement Learning - University of …pabbeel/cs287-fa12/...High-level picture Dynamics Model T Reinforcement Probability distribution over next states given current Describes

Human-level Control Through Deep Reinforcement Learning€¦ · 1 Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529{533 (2015) 2 Lin, L.-J.

Reinforcement Learning via Policy Optimizationhanxiaol/slides/rl-po.pdf · Policy Gradient r U( ) ˇr logP(˝;ˇ )R(˝) ˝˘P(;ˇ ) (7) I Analogous to SGD (so variance reduction is

Reinforcement Learning: Part 2 - Max Planck Societymlss.tuebingen.mpg.de/2015/slides/watkins/Lecture2.pdf · Reinforcement Learning: Part 2 Chris Watkins Department of Computer Science

Lecture 7: Policy Gradient - jbnu.ac.krnlp.jbnu.ac.kr/AI2019/slides_RL/pg.pdf · 2019. 10. 28. · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the