of 56 /56
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien Ngo MLR, University of Stuttgart

others
• Category

## Documents

• view

9

0

Embed Size (px)

### Transcript of Reinforcement Learning Lecture Function Approximation Reinforcement Learning

Function Approximation

Continuous state/action space, mean-square error, gradient temporal difference learning,least-square temporal difference, least squares policy iteration

Vien NgoMLR, University of Stuttgart Outline

• Function Approximation

– Batch Learning Methods

2/?? Value Iteration in Continuous MDP

V (s) = supa

[r(s, a) + γ

∫P (s′|s, a)V (s′)ds′

]

3/?? Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

4/?? Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

4/?? Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

4/?? Value Function Approximation

(from Satinder Singh, RL: A tutorial at videolectures.net)

• Using function approximation: Vt(s) = V (s, θt)

• Generalizing to unvisited states.

5/?? Function Approximation in ML

• Many possibilities for V (s, θ)

• Neural Network• Decision tree• Linear regression• Nearest neighbor• Gaussian process• Kernel methods• . . .

• Training data: {st, at, rt, st+1}i• MC methods: {Rt} (=

∑∞l=t γ

lrl)• TD(0): rt + γVt(st+1)

• TD(λ): Rλt

6/?? Function Approximation in ML

• Many possibilities for V (s, θ)

• Neural Network• Decision tree• Linear regression• Nearest neighbor• Gaussian process• Kernel methods• . . .

• Training data: {st, at, rt, st+1}i• MC methods: {Rt} (=

∑∞l=t γ

lrl)• TD(0): rt + γVt(st+1)

• TD(λ): Rλt

6/?? Performance Measure

• Minimizing the mean-squared error (MSE) over some distribution, P , ofthe states

ρ(θt) = Es∼P (·)

[(V π(s)− Vt(s, θt)

)2]where V π(s) is the true value function of the policy π.

• Set P to the stationary distribution of policy π in on-policy learningmethods (e.g. SARSA).

7/?? • Update θ in direction of gradient:

θt+1 = θt −1

2αt∇θρ(θt)

= θt + αtEP[(V π(s)− Vt(s, θt)

)∇Vt(s, θt)

θt+1 = θt + αt(V π(st)− Vt(st, θt)

)∇θVt(st, θt)

8/?? • Bootstrapping V π(st) by an estimate vt: if E[v(st)

]= V π(st), called

unbiased estimate.

• Using unbiased estimate guarantees convergence to a local optimum.

• vt in• MC: unbiased

θt+1 = θt + αt(Rt − V (st, θt)

)∇θV (st, θt))

• TD(0): unbiased

θt+1 = θt + αt(rt + γV (st+1, θt)− V (st, θt)

)∇θV (st, θt))

• TD(λ): biased for λ < 1

θt+1 = θt + αt(Rλt − V (st, θt)

)∇θV (st, θt))

9/?? Linear Function Approximation

10/?? Linear Function Approximation

• The estimate value function:

V (s, θt) = θ>t φ(s)

where θ ∈ <d is a vector of parameters, φ : S 7→ <d is a mapping fromstates to d-dimensional spaces.

– Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse

of dimensionality)11/?? Linear Function Approximation

Tabular-Lookup Features

• For finite discrete state space S = {s1, . . . , sn}

φ(s) =

δs1(s)

δs2(s)

. . .

δsn(s)

, ∀s ∈ S

where δ is a Dirac function.• The value functions V (s, θ), θ ∈ <d:

V (s, θ) = [θ1, θ2, . . . , θn]×

δs1(s)

δs2(s)

. . .

δsn(s)

12/?? Linear Function Approximation

(1035 states, 105 binary features and parameters.)(Sutton, presentation at ICML 2009) )

13/?? TD(λ) with Linear Function Approximation

• Applying stochastic approximation and bootstrapping, we caniteratively update the parameters (TD(0) with function approximation)

θt+1 = θt + αt[rt + γV (st+1, θt)− Vt(st, θt)

]φ(s)

• TD(λ) (with eligibility trace)

et = γλet−1 + φ(st)

θt+1 = θt + αtet[rt + γV (st+1, θt)− Vt(st, θt)

]

14/?? Action-Value Function Approximation

• Q-value function: Q(s, a, θ)

• The mean-square error

ρ(θ) = Es,a∼P (s,a)

[(Qπ(s, a)−Q(s, a, θ)

θt+1 = θt + αt(Qπ(st, at)−Q(st, at, θ)

)∇θQ(st, at, θ)

• With linear function approximation

θt+1 = θt + αt(Qπ(st, at)−Q(st, at, θ)

)φ(st, at)

15/?? Action-Value Linear Function Approximation

• SARSA(0)

θt+1 = θt + αt

(rt + γQ(st+1, at+1)−Q(st, at)

)φ(st, at)

• SARSA(λ)

et = γλet−1 + φ(s, a)

θt+1 = θt + αtet

(rt + γQ(st+1, at+1)−Q(st, at)

)φ(st, at)

16/?? SARSA(λ) with Linear Function Approximation

• While (!converge)• e = 0 (a vector of zeros)• initial state s = s0

• Repeat (for each episode)at = π(st), e.g. ε-greedy according to argmaxaQ(st, a, θt)

Take at, observe rt, st+1

et = γλet−1 + φ(st, at)

θt+1 = θt + αtet[rt + γQ(st+1, π(st+1), θt)−Q(st, at, θt)

]t = t+ 1

• until st is terminal.

17/?? SARSA(λ) with tile-coding function approximation

18/?? TD(λ) with Linear Function Approximation

• Convergence proof: If the stochastic process St is ergodic Markovprocess, the whose stationary distribution is the same as the stationarydistribution of the underlying MDP (e.g. on-policy distribution).

• The convergence property

MSE(θ∞) ≤ 1− γλ1− λ

MSE(θ∗)

(Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 1997)

• Convergence guarantee for off-policy methods (e.g Q-learning withlinear function approximation)? [NO], Cos• Bootstrapping• Following not true gradients (i.e. P)

19/?? Bootstrapping vs. Non-Bootstrapping

20/?? Q-Learning with Linear Function Approximation

A Counter-example from Baird (1995)

21/?? • Using exact update (DP backups) for policy evaluation

θt+1 = θt + α∑

s∼P (s)

[E{rt+1 + γVts

′|st = s} − V (s, θt)]∇θVt(s, θt)

• if P (s) is uniform (6= the true stationary distribution of the Markovchain), then the asymptotic bahaviour becomes unstable

22/?? • With nonlinear FA, on-policy methods like SARSA may still diverge(Tsitsiklis and Van Roy, 1997)

• with linear FA, off-policy like Q-learning may diverge (Baird 1995)

• They are not true gradient-descent methods

• Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all aboveproblems: true gradient direction of projected Bellman error.

23/?? • With nonlinear FA, on-policy methods like SARSA may still diverge(Tsitsiklis and Van Roy, 1997)

• with linear FA, off-policy like Q-learning may diverge (Baird 1995)

• They are not true gradient-descent methods

• Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all aboveproblems: true gradient direction of projected Bellman error.

23/?? • GTD (gradient temporal difference learning)

• GTD2 (gradient temporal difference learning, version 2)

• TDC (temporal difference learning with gradient corrections.)

24/?? Value function geometry• Bellman operator

TV = R+ γPV

(The space spanned by the feature vectors)

RMSBE : Residual mean-squared Bellman errorRMSPBE: Residual mean-squared projected Bellman error 25/?? TD performance measure

• Error from the true value:||Vθ − V ∗||2P

• Error in the Bellman update (used in previous section: TD(0), GTD0methods)

||Vθ − TVθ||2P

• Error in Bellman update after projection (TDC and GTD2 methods)

||Vθ −ΠTVθ||2P

26/?? TD performance measure

• GTD(0): the Norm of the Expected TD Update

NEU(θ) = E(δφ)>E(δφ)

• GTD(2) and TDC: the norm of the expected TD update weighted by thecovariance matrix of the features

MSPBE(θ) = E(δφ)>E(φφ>)−1E(δφ)

(δ is the TD error.)(GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.)

27/?? • GTD(0)

θt+1 = θt + αt(φ(st)− γφ(st+1))φ(st)>wtπ(st, at)

wt+1 = wt + βt(δtφ(st)− wt)π(st, at)

• GTD2

θt+1 = θt + αt(φ(st)− γφ(st+1))φ(st)>wt

wt+1 = wt + βt(δt − φ(st)>wt)φ(st)

• TDC

θt+1 = θt + αtδtφ(st)− αtγφ(st+1)φ(st)>wt

wt+1 = wt + βt(δt − φ(st)>wt)φ(st)

where δt = rt + γφ(st+1)− φ(st) is the TD error. 28/?? On Baird’s counterexample.

29/?? Summary

• Gradient TD algorithms with linear function approximation problems areguaranteed convergent under both general on- and off-policy training.

• the compuational complexity is only O(n) (n is the number of features).

• the curse of dimensionality is removed

30/?? Batch Learning Methods

31/?? Batch Learning Methods

• Gradient-descent methods are simple and computationally efficient,but:

• sensitive to the choice of learning rates and initial parameter values,• not sample-efficient.

• Batch learning is an alternative that searches for the best fitting valuefunction given the historical transition data.• Fitted Q-Iteration (FQI)• Experience Replay in Deep Q-networks (DQN)• Least-square temporal difference (LSTD) method; LSPI.

32/?? Batch Learning Methods

33/?? Fitted Q-Iteration

• Given all experience H = {(st, at, rt, st+1)}Nt=0.

• FQI approximates Q(s, a) by using supervised regression byconstructing a training data

Dk = {(si, ai), Q̂k(si, ai)}Ni=0

where Q̂k(si, ai) = ri + γmaxa′ Q̂k−1(si, a′)

• Hence, FQI is considered as Approximate Q-Iteration

• At iteration, any regression techniques could be used, with differentchoices of function approximators: neural network, radial basis functionnetworks, regression trees, etc.

34/?? Fitted Q-Iteration: Mountain Car Benchmark

• The control interval is t = 0.05s

• Actions are continuous, restricted within the interval [4, 4]

• Using a neural network with RPROP: configuration of 3-5-5-1(input-hidden-hidden-output layers)

• Training trajectories had a maximum length of 50 primitive control steps.

• After each trajectory/episode, run one FQI iteration (k = k + 1).

Neural fitted Q-learning by Martin Riedmiller, 2005.35/?? Experience Replay in Deep Q-networks (DQN)

• given the training data H = {(si, ai, ri, si+1)}Ni=0

• recall the squared-error

ρθ =

N∑t=0

(Qπ(si, ai)− Q̂(si, ai, θ)

)2= EH

(Qπ(si, ai)− Q̂(si, ai, θ)

)2≈ Esi,ai,ri,si+1

EH(ri + max

aQπ(si+1, a, θ)− Q̂(si, ai, θ)

)2

• using stochastic gradient descent• sample a data point

(s, a, r, s′) ∼ H

• stochastic gradient update (by single samples)

θk+1 = θk + αt(rt +max

a′Qπ(s′, a′, θk)− Q̂(s, a, θk

)∇θQ(s, a; θk)

36/?? Experience Replay in Deep Q-networks (DQN)

• Q(s, a) is approximated using a deep neural network whose weightsare θ.

• Loop until convergence• take action at using ε-greedy acc. to Q(st, a, θt)

• add st, at, rt, st+1 into replay memory D.• sample a subset H of (si, ai, ri, si+1) from D

• optimize the least squared error (as in previous slide),

ρθ = EH(Qπ(si, ai)− Q̂(si, ai, θ)

)2and return θt+1 = θ∗k

• t = t+ 1

37/?? DQN in Atari Games

from David Silver, a NIPS talk 201438/?? DQN in Atari Games• Q(s, a) is approximated using a deep neural network whose weights

are θ.• input s is stack of raw pixels from 4 consecutive frames• control a is 18 joystick/button positions

from David Silver, a NIPS talk 2014

39/?? DQN in Atari Games

from David Silver, a NIPS talk 2014

40/?? DQN in Atari GamesExperience Replay vs. no Replay

from David Silver, a NIPS talk 2014

41/?? LSPI: Least Squares Policy Iteration

• Least-square temporal difference (LSTD) method; LSPI.

– Bellman residual minimization

– Least Squares Fixed-Point Approximation

42/?? Bellman residual minimization

• The Q-functions for a given policy π fulfills for any s, a:

Qπ(s, a) = R(s, a) + γ∑s′

P (s′ | a, s) Qπ(s′, π(s′))

• Written as optmization: minimize the Bellman residual error

L(Qπ) = ||R+ γPΠQπ −Qπ||

where Π is the projection operator onto the feature space.

43/?? Bellman residual minimization

• The true fixed point of Bellman Residual Minimization (this is anoverconstrained system)

βπ =(

(Φ− γPΠΦ)>(Φ− γPΠΦ))−1

(Φ− γPΠΦ)r

• the solution βπ of the system is unique since the columns of Φ (thebasis functions) are linearly independent by definition.(See Lagoudakis & Parr (JMLR 2003) for details.)

44/?? LSPI: Least Squares Fixed-Point Approximation

• Projection TπQ back onto span(Φ)

T̂π(Q) = Φ(Φ>Φ)−1Φ>(TπQ)

• Minimize projected Bellman residual

T̂π(Q) = Q

• The approximate fixed-point

βπ =(

Φ>(Φ− γPΠΦ))−1

Φ>r45/?? LSPI: Comparisons of two views

• the Bellman residual minimizing method: focus on the magnitude of thechange.

• the least-squares fixed-point approximation: focus on the direction ofthe change.

• the least-squares fixed point approximation is less stable and lesspredictable

• the least-squares fixed-point method might be preferable. Because– Learning the Bellman residual minimizing approximation requires doubled samples.

– Experimentally, it often delivers policies that are superior.

(See Lagoudakis & Parr (JMLR 2003) for details.)

46/?? LSPI: LSTDQ algorithm

A = Φ>(Φ− γPΠΦ)

b = Φ>r

• For each (s, a, r, s′) ∈ D

A← A+ φ(s, a)(φ(s, a)− γφ(s′, π(s′))

)>b← b+ φ(s, a)r

• β ← A−1b

47/?? LSPI algorithmgiven D

• repeat

π ← π′

π′ ← LSTDQ(π) (π′ is a policy of βπ)

• return π

48/?? LSPI: Riding a bike

(from Alma A. M. Rahat’s simulation.)• States = {θ, θ̇, ω, ω̇, ω̈, ψ}. where θ is the angle of the handlebar, ω is the vertical angle

of the bicycle, and ψ is the angle of the bicycle to the goal.• Actions: {τ, ν}. τ ∈ {−2, 0, 2} is the torque applied to the handlebar,ν ∈ {−0.02, 0, 0.02} is the displacement of the rider.

• For each a, the value function Q(s, a) uses 20 features

(1, ω, ω̇, ω2, ωω̇, θ, θ̇, θ2, θ̇2, θθ̇, ωθ, ωθ2, ω2θ, ψ, ψ2, ψθ, ψ̄, ψ̄2, ψ̄θ)

where ψ̄ = sign(ψ)× π − ψ. 49/?? LSPI: Riding a bike

(converges after 8 iterations, based on 50000 samples collected from 2500 episodes)• Training samples were collected in advance by initializing the bicycle to a small random

perturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20steps using a purely random policy.

• Each successful ride must complete a distance of 2 kilometers.• This experiment was repeated 100 times

from Lagoudakis & Parr (JMLR 2003)50/?? LSPI: Riding a bike

51/?? Feature Selection/Building Problems

• Feature selection.

• Online/increment feature learning.

Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010);Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng,(2009), Boots and Gordon, (2010), Sun et al., (2011), etc.

52/??