Reinforcement Learning Lecture Function Approximation

Reinforcement Learning

Function Approximation

Continuous state/action space, mean-square error, gradient temporal difference learning,least-square temporal difference, least squares policy iteration

Vien NgoMLR, University of Stuttgart

Outline

• Function Approximation

– Gradient Descent Methods.

– Batch Learning Methods

Value Iteration in Continuous MDP

V (s) = supa

[r(s, a) + γ

∫P (s′|s, a)V (s′)ds′

Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

spaces

Value Function Approximation

(from Satinder Singh, RL: A tutorial at videolectures.net)

• Using function approximation: Vt(s) = V (s, θt)

• Generalizing to unvisited states.

Function Approximation in ML

• Many possibilities for V (s, θ)

• Neural Network• Decision tree• Linear regression• Nearest neighbor• Gaussian process• Kernel methods• . . .

• Training data: {st, at, rt, st+1}i• MC methods: {Rt} (=

∑∞l=t γ

lrl)• TD(0): rt + γVt(st+1)

• TD(λ): Rλt

Function Approximation in ML

• Many possibilities for V (s, θ)

• Neural Network• Decision tree• Linear regression• Nearest neighbor• Gaussian process• Kernel methods• . . .

• Training data: {st, at, rt, st+1}i• MC methods: {Rt} (=

∑∞l=t γ

lrl)• TD(0): rt + γVt(st+1)

• TD(λ): Rλt

Performance Measure

• Minimizing the mean-squared error (MSE) over some distribution, P , ofthe states

ρ(θt) = Es∼P (·)

[(V π(s)− Vt(s, θt)

)2]where V π(s) is the true value function of the policy π.

• Set P to the stationary distribution of policy π in on-policy learningmethods (e.g. SARSA).

Stochastic Gradient-Descent Methods

• Update θ in direction of gradient:

θt+1 = θt −1

2αt∇θρ(θt)

= θt + αtEP[(V π(s)− Vt(s, θt)

)∇Vt(s, θt)

]• Stochastic gradient descent

θt+1 = θt + αt(V π(st)− Vt(st, θt)

)∇θVt(st, θt)

Stochastic Gradient-Descent Methods

• Bootstrapping V π(st) by an estimate vt: if E[v(st)

]= V π(st), called

unbiased estimate.

• Using unbiased estimate guarantees convergence to a local optimum.

• vt in• MC: unbiased

θt+1 = θt + αt(Rt − V (st, θt)

)∇θV (st, θt))

• TD(0): unbiased

θt+1 = θt + αt(rt + γV (st+1, θt)− V (st, θt)

)∇θV (st, θt))

• TD(λ): biased for λ < 1

θt+1 = θt + αt(Rλt − V (st, θt)

)∇θV (st, θt))

Linear Function Approximation

• The estimate value function:

V (s, θt) = θ>t φ(s)

where θ ∈ <d is a vector of parameters, φ : S 7→ <d is a mapping fromstates to d-dimensional spaces.

– Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse

of dimensionality)11/??

Tabular-Lookup Features

• For finite discrete state space S = {s1, . . . , sn}

φ(s) =

δs1(s)

δs2(s)

δsn(s)

, ∀s ∈ S

where δ is a Dirac function.• The value functions V (s, θ), θ ∈ <d:

V (s, θ) = [θ1, θ2, . . . , θn]×

δs1(s)

δs2(s)

δsn(s)

(1035 states, 105 binary features and parameters.)(Sutton, presentation at ICML 2009) )

TD(λ) with Linear Function Approximation

• Applying stochastic approximation and bootstrapping, we caniteratively update the parameters (TD(0) with function approximation)

θt+1 = θt + αt[rt + γV (st+1, θt)− Vt(st, θt)

]φ(s)

• TD(λ) (with eligibility trace)

et = γλet−1 + φ(st)

θt+1 = θt + αtet[rt + γV (st+1, θt)− Vt(st, θt)

Action-Value Function Approximation

• Q-value function: Q(s, a, θ)

• The mean-square error

ρ(θ) = Es,a∼P (s,a)

[(Qπ(s, a)−Q(s, a, θ)

)2]• Stochastic gradient update

θt+1 = θt + αt(Qπ(st, at)−Q(st, at, θ)

)∇θQ(st, at, θ)

• With linear function approximation

θt+1 = θt + αt(Qπ(st, at)−Q(st, at, θ)

)φ(st, at)

Action-Value Linear Function Approximation

• SARSA(0)

θt+1 = θt + αt

(rt + γQ(st+1, at+1)−Q(st, at)

)φ(st, at)

• SARSA(λ)

et = γλet−1 + φ(s, a)

θt+1 = θt + αtet

(rt + γQ(st+1, at+1)−Q(st, at)

)φ(st, at)

SARSA(λ) with Linear Function Approximation

• While (!converge)• e = 0 (a vector of zeros)• initial state s = s0

• Repeat (for each episode)at = π(st), e.g. ε-greedy according to argmaxaQ(st, a, θt)

Take at, observe rt, st+1

et = γλet−1 + φ(st, at)

θt+1 = θt + αtet[rt + γQ(st+1, π(st+1), θt)−Q(st, at, θt)

]t = t+ 1

• until st is terminal.

SARSA(λ) with tile-coding function approximation

TD(λ) with Linear Function Approximation

• Convergence proof: If the stochastic process St is ergodic Markovprocess, the whose stationary distribution is the same as the stationarydistribution of the underlying MDP (e.g. on-policy distribution).

• The convergence property

MSE(θ∞) ≤ 1− γλ1− λ

MSE(θ∗)

(Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 1997)

• Convergence guarantee for off-policy methods (e.g Q-learning withlinear function approximation)? [NO], Cos• Bootstrapping• Following not true gradients (i.e. P)

Bootstrapping vs. Non-Bootstrapping

Q-Learning with Linear Function Approximation

A Counter-example from Baird (1995)

• Using exact update (DP backups) for policy evaluation

θt+1 = θt + α∑

s∼P (s)

[E{rt+1 + γVts

′|st = s} − V (s, θt)]∇θVt(s, θt)

• if P (s) is uniform (6= the true stationary distribution of the Markovchain), then the asymptotic bahaviour becomes unstable

Problems with Gradient-Descent Methods

• With nonlinear FA, on-policy methods like SARSA may still diverge(Tsitsiklis and Van Roy, 1997)

• with linear FA, off-policy like Q-learning may diverge (Baird 1995)

• They are not true gradient-descent methods

• Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all aboveproblems: true gradient direction of projected Bellman error.

Problems with Gradient-Descent Methods

• With nonlinear FA, on-policy methods like SARSA may still diverge(Tsitsiklis and Van Roy, 1997)

• with linear FA, off-policy like Q-learning may diverge (Baird 1995)

• They are not true gradient-descent methods

• Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all aboveproblems: true gradient direction of projected Bellman error.

Gradient temporal difference learning

• GTD (gradient temporal difference learning)

• GTD2 (gradient temporal difference learning, version 2)

• TDC (temporal difference learning with gradient corrections.)

Value function geometry• Bellman operator

TV = R+ γPV

(The space spanned by the feature vectors)

RMSBE : Residual mean-squared Bellman errorRMSPBE: Residual mean-squared projected Bellman error 25/??

TD performance measure

• Error from the true value:||Vθ − V ∗||2P

• Error in the Bellman update (used in previous section: TD(0), GTD0methods)

||Vθ − TVθ||2P

• Error in Bellman update after projection (TDC and GTD2 methods)

||Vθ −ΠTVθ||2P

TD performance measure

• GTD(0): the Norm of the Expected TD Update

NEU(θ) = E(δφ)>E(δφ)

• GTD(2) and TDC: the norm of the expected TD update weighted by thecovariance matrix of the features

MSPBE(θ) = E(δφ)>E(φφ>)−1E(δφ)

(δ is the TD error.)(GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.)

Updates

• GTD(0)

θt+1 = θt + αt(φ(st)− γφ(st+1))φ(st)>wtπ(st, at)

wt+1 = wt + βt(δtφ(st)− wt)π(st, at)

• GTD2

θt+1 = θt + αt(φ(st)− γφ(st+1))φ(st)>wt

wt+1 = wt + βt(δt − φ(st)>wt)φ(st)

• TDC

θt+1 = θt + αtδtφ(st)− αtγφ(st+1)φ(st)>wt

wt+1 = wt + βt(δt − φ(st)>wt)φ(st)

where δt = rt + γφ(st+1)− φ(st) is the TD error. 28/??

On Baird’s counterexample.

Summary

• Gradient TD algorithms with linear function approximation problems areguaranteed convergent under both general on- and off-policy training.

• the compuational complexity is only O(n) (n is the number of features).

• the curse of dimensionality is removed

Batch Learning Methods

• Gradient-descent methods are simple and computationally efficient,but:

• sensitive to the choice of learning rates and initial parameter values,• not sample-efficient.

• Batch learning is an alternative that searches for the best fitting valuefunction given the historical transition data.• Fitted Q-Iteration (FQI)• Experience Replay in Deep Q-networks (DQN)• Least-square temporal difference (LSTD) method; LSPI.

Batch Learning Methods

Fitted Q-Iteration

• Given all experience H = {(st, at, rt, st+1)}Nt=0.

• FQI approximates Q(s, a) by using supervised regression byconstructing a training data

Dk = {(si, ai), Q̂k(si, ai)}Ni=0

where Q̂k(si, ai) = ri + γmaxa′ Q̂k−1(si, a′)

• Hence, FQI is considered as Approximate Q-Iteration

• At iteration, any regression techniques could be used, with differentchoices of function approximators: neural network, radial basis functionnetworks, regression trees, etc.

Fitted Q-Iteration: Mountain Car Benchmark

• The control interval is t = 0.05s

• Actions are continuous, restricted within the interval [4, 4]

• Using a neural network with RPROP: configuration of 3-5-5-1(input-hidden-hidden-output layers)

• Training trajectories had a maximum length of 50 primitive control steps.

• After each trajectory/episode, run one FQI iteration (k = k + 1).

Neural fitted Q-learning by Martin Riedmiller, 2005.35/??

Experience Replay in Deep Q-networks (DQN)

• given the training data H = {(si, ai, ri, si+1)}Ni=0

• recall the squared-error

ρθ =

N∑t=0

(Qπ(si, ai)− Q̂(si, ai, θ)

)2= EH

(Qπ(si, ai)− Q̂(si, ai, θ)

)2≈ Esi,ai,ri,si+1

EH(ri + max

aQπ(si+1, a, θ)− Q̂(si, ai, θ)

• using stochastic gradient descent• sample a data point

(s, a, r, s′) ∼ H

• stochastic gradient update (by single samples)

θk+1 = θk + αt(rt +max

a′Qπ(s′, a′, θk)− Q̂(s, a, θk

)∇θQ(s, a; θk)

Experience Replay in Deep Q-networks (DQN)

• Q(s, a) is approximated using a deep neural network whose weightsare θ.

• Loop until convergence• take action at using ε-greedy acc. to Q(st, a, θt)

• add st, at, rt, st+1 into replay memory D.• sample a subset H of (si, ai, ri, si+1) from D

• optimize the least squared error (as in previous slide),

ρθ = EH(Qπ(si, ai)− Q̂(si, ai, θ)

)2and return θt+1 = θ∗k

• t = t+ 1

DQN in Atari Games

from David Silver, a NIPS talk 201438/??

DQN in Atari Games• Q(s, a) is approximated using a deep neural network whose weights

are θ.• input s is stack of raw pixels from 4 consecutive frames• control a is 18 joystick/button positions

from David Silver, a NIPS talk 2014

DQN in Atari Games

DQN in Atari GamesExperience Replay vs. no Replay

LSPI: Least Squares Policy Iteration

• Least-square temporal difference (LSTD) method; LSPI.

– Bellman residual minimization

– Least Squares Fixed-Point Approximation

Bellman residual minimization

• The Q-functions for a given policy π fulfills for any s, a:

Qπ(s, a) = R(s, a) + γ∑s′

P (s′ | a, s) Qπ(s′, π(s′))

• Written as optmization: minimize the Bellman residual error

L(Qπ) = ||R+ γPΠQπ −Qπ||

where Π is the projection operator onto the feature space.

Bellman residual minimization

• The true fixed point of Bellman Residual Minimization (this is anoverconstrained system)

βπ =(

(Φ− γPΠΦ)>(Φ− γPΠΦ))−1

(Φ− γPΠΦ)r

• the solution βπ of the system is unique since the columns of Φ (thebasis functions) are linearly independent by definition.(See Lagoudakis & Parr (JMLR 2003) for details.)

LSPI: Least Squares Fixed-Point Approximation

• Projection TπQ back onto span(Φ)

T̂π(Q) = Φ(Φ>Φ)−1Φ>(TπQ)

• Minimize projected Bellman residual

T̂π(Q) = Q

• The approximate fixed-point

βπ =(

Φ>(Φ− γPΠΦ))−1

Φ>r45/??

LSPI: Comparisons of two views

• the Bellman residual minimizing method: focus on the magnitude of thechange.

• the least-squares fixed-point approximation: focus on the direction ofthe change.

• the least-squares fixed point approximation is less stable and lesspredictable

• the least-squares fixed-point method might be preferable. Because– Learning the Bellman residual minimizing approximation requires doubled samples.

– Experimentally, it often delivers policies that are superior.

(See Lagoudakis & Parr (JMLR 2003) for details.)

LSPI: LSTDQ algorithm

A = Φ>(Φ− γPΠΦ)

b = Φ>r

• For each (s, a, r, s′) ∈ D

A← A+ φ(s, a)(φ(s, a)− γφ(s′, π(s′))

)>b← b+ φ(s, a)r

• β ← A−1b

LSPI algorithmgiven D

• repeat

π ← π′

π′ ← LSTDQ(π) (π′ is a policy of βπ)

• return π

LSPI: Riding a bike

(from Alma A. M. Rahat’s simulation.)• States = {θ, θ̇, ω, ω̇, ω̈, ψ}. where θ is the angle of the handlebar, ω is the vertical angle

of the bicycle, and ψ is the angle of the bicycle to the goal.• Actions: {τ, ν}. τ ∈ {−2, 0, 2} is the torque applied to the handlebar,ν ∈ {−0.02, 0, 0.02} is the displacement of the rider.

• For each a, the value function Q(s, a) uses 20 features

(1, ω, ω̇, ω2, ωω̇, θ, θ̇, θ2, θ̇2, θθ̇, ωθ, ωθ2, ω2θ, ψ, ψ2, ψθ, ψ̄, ψ̄2, ψ̄θ)

where ψ̄ = sign(ψ)× π − ψ. 49/??

LSPI: Riding a bike

(converges after 8 iterations, based on 50000 samples collected from 2500 episodes)• Training samples were collected in advance by initializing the bicycle to a small random

perturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20steps using a purely random policy.

• Each successful ride must complete a distance of 2 kilometers.• This experiment was repeated 100 times

from Lagoudakis & Parr (JMLR 2003)50/??

LSPI: Riding a bike

Feature Selection/Building Problems

• Feature selection.

• Online/increment feature learning.

Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010);Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng,(2009), Boots and Gordon, (2010), Sun et al., (2011), etc.

Reinforcement Learning Lecture Function Approximation

Documents

Transcript of Reinforcement Learning Lecture Function Approximation

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

CSC 411: Introduction to Machine Learningmren/teach/csc411_19s/lec/lec22.pdf · CSC 411: Introduction to Machine Learning CSC 411 Lecture 22: Reinforcement Learning II Mengye Ren

Repair of Epoxy-Coated Reinforcement (1265-5)

Q-Function Learning Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec3.pdf4M. Riedmiller.\Neural tted Q iteration{ rst experiences with a data e cient neural reinforcement learning

On-Policy Concurrent Reinforcement Learningcse.unl.edu/~lksoh/Classes/CSCE475_875_Fall15/Seminar...SARSA (on-policy method) converges to a stable Q value while the classic Q-learning

Reinforcement Learning with Derivative-Free Exploration · In this paper, we introduce a derivative-free based exploration called DFE as a ... Now the overview of DFE can be seen

R:00AUSTRALIABURSTING REINFORCEMENT TO BE USED IN … · Page: BURSTING REINFORCEMENT TO BE USED IN RMS PROJECTS Code: Edition: BR-RMS 1.2 1/7 Drawing 1: Anchorage bursting reinforcement

Internet Monetization - Reinforcement · PDF fileReinforcement Learning Temporal Difference Reinforcement ... Episodes of experience fs 1;a 1;r 2;:::;s T g ... for each state s in

Quixote: A NetHack Reinforcement Learning Framework and Agentcs229.stanford.edu/proj2019spr/poster/93.pdf · 2019. 6. 18. · Quixote: A NetHack Reinforcement Learning Framework and

Lecture 7: Policy Gradient - jbnu.ac.krnlp.jbnu.ac.kr/AI2019/slides_RL/pg.pdf · 2019. 10. 28. · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the

dACC and the adaptive regulation of reinforcement … and the adaptive regulation of reinforcement learning parameters: ... • dACC is in an appropriate position to ... Global decrease

Tutorial on Optimal Transport Theory · Tutorial on Optimal Transport Theory With a machine learning touch L ena c Chizat* Jul. 24th 2019 ... Computation and Approximation Density

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Safe and ffi ffolicy Reinforcement Learning - RL-Tokyo · Safe and ffi ffolicy Reinforcement Learning NIPS 2016 Yasuhiro Fujita Preferred Networks Inc. January 11, 2017 [Munos et

Reinforcement Learning - Wei XuSpecifically, reinforcement learning There was an MDP, but you couldn’t solve it with just computation You needed to actually act to figure it out

Lecture 2: Making Sequences of Good Decisions Given a ... · Emma Brunskill (CS234 Reinforcement Learning)Lecture 2: Making Sequences of Good Decisions Given a Model of the WorldWinter

Large Scale Reinforcement Learning using Q-SARSA(λ) and Cascading Neural Networks

A Distributional Analysis of Sampling-Based …A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms responding distributional operator is a contraction mapping

Inverse Reinforcement Learning - University of …pabbeel/cs287-fa12/...High-level picture Dynamics Model T Reinforcement Probability distribution over next states given current Describes

Reinforcement Learning - Policy Search: Actor-Critic and ...mmartin/URL/Lecture5.pdf · Mario Martin (CS-UPC) Reinforcement Learning May 7, 2020 17 / 72. Approximated Cross-Entropy