Reinforcement Learning Lecture Function Approximation

56
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien Ngo MLR, University of Stuttgart

Transcript of Reinforcement Learning Lecture Function Approximation

Page 1: Reinforcement Learning Lecture Function Approximation

Reinforcement Learning

Function Approximation

Continuous state/action space, mean-square error, gradient temporal difference learning,least-square temporal difference, least squares policy iteration

Vien NgoMLR, University of Stuttgart

Page 2: Reinforcement Learning Lecture Function Approximation

Outline

• Function Approximation

– Gradient Descent Methods.

– Batch Learning Methods

2/??

Page 3: Reinforcement Learning Lecture Function Approximation

Value Iteration in Continuous MDP

V (s) = supa

[r(s, a) + γ

∫P (s′|s, a)V (s′)ds′

]

3/??

Page 4: Reinforcement Learning Lecture Function Approximation

Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

4/??

Page 5: Reinforcement Learning Lecture Function Approximation

Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

4/??

Page 6: Reinforcement Learning Lecture Function Approximation

Continuous state/actions in model-free RL

• DP with tabular-lookup is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• RL in large and continuous domains:• Backgammon: 1020 states (board size: 28)• Computer GO: 1035 states (board size: 9× 9); 10171 states (board size:

19× 19)• Autonomous Helicopter: continuous high-dimensional state and action

spaces

• In the following: two families of methods for handling large orcontinuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), Batch learning.– optimize a parameterized π(a|s) (policy search - next lecture).

4/??

Page 7: Reinforcement Learning Lecture Function Approximation

Value Function Approximation

(from Satinder Singh, RL: A tutorial at videolectures.net)

• Using function approximation: Vt(s) = V (s, θt)

• Generalizing to unvisited states.

5/??

Page 8: Reinforcement Learning Lecture Function Approximation

Function Approximation in ML

• Many possibilities for V (s, θ)

• Neural Network• Decision tree• Linear regression• Nearest neighbor• Gaussian process• Kernel methods• . . .

• Training data: {st, at, rt, st+1}i• MC methods: {Rt} (=

∑∞l=t γ

lrl)• TD(0): rt + γVt(st+1)

• TD(λ): Rλt

6/??

Page 9: Reinforcement Learning Lecture Function Approximation

Function Approximation in ML

• Many possibilities for V (s, θ)

• Neural Network• Decision tree• Linear regression• Nearest neighbor• Gaussian process• Kernel methods• . . .

• Training data: {st, at, rt, st+1}i• MC methods: {Rt} (=

∑∞l=t γ

lrl)• TD(0): rt + γVt(st+1)

• TD(λ): Rλt

6/??

Page 10: Reinforcement Learning Lecture Function Approximation

Performance Measure

• Minimizing the mean-squared error (MSE) over some distribution, P , ofthe states

ρ(θt) = Es∼P (·)

[(V π(s)− Vt(s, θt)

)2]where V π(s) is the true value function of the policy π.

• Set P to the stationary distribution of policy π in on-policy learningmethods (e.g. SARSA).

7/??

Page 11: Reinforcement Learning Lecture Function Approximation

Stochastic Gradient-Descent Methods

• Update θ in direction of gradient:

θt+1 = θt −1

2αt∇θρ(θt)

= θt + αtEP[(V π(s)− Vt(s, θt)

)∇Vt(s, θt)

]• Stochastic gradient descent

θt+1 = θt + αt(V π(st)− Vt(st, θt)

)∇θVt(st, θt)

8/??

Page 12: Reinforcement Learning Lecture Function Approximation

Stochastic Gradient-Descent Methods

• Bootstrapping V π(st) by an estimate vt: if E[v(st)

]= V π(st), called

unbiased estimate.

• Using unbiased estimate guarantees convergence to a local optimum.

• vt in• MC: unbiased

θt+1 = θt + αt(Rt − V (st, θt)

)∇θV (st, θt))

• TD(0): unbiased

θt+1 = θt + αt(rt + γV (st+1, θt)− V (st, θt)

)∇θV (st, θt))

• TD(λ): biased for λ < 1

θt+1 = θt + αt(Rλt − V (st, θt)

)∇θV (st, θt))

9/??

Page 13: Reinforcement Learning Lecture Function Approximation

Linear Function Approximation

10/??

Page 14: Reinforcement Learning Lecture Function Approximation

Linear Function Approximation

• The estimate value function:

V (s, θt) = θ>t φ(s)

where θ ∈ <d is a vector of parameters, φ : S 7→ <d is a mapping fromstates to d-dimensional spaces.

– Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse

of dimensionality)11/??

Page 15: Reinforcement Learning Lecture Function Approximation

Linear Function Approximation

Tabular-Lookup Features

• For finite discrete state space S = {s1, . . . , sn}

φ(s) =

δs1(s)

δs2(s)

. . .

δsn(s)

, ∀s ∈ S

where δ is a Dirac function.• The value functions V (s, θ), θ ∈ <d:

V (s, θ) = [θ1, θ2, . . . , θn]×

δs1(s)

δs2(s)

. . .

δsn(s)

12/??

Page 16: Reinforcement Learning Lecture Function Approximation

Linear Function Approximation

(1035 states, 105 binary features and parameters.)(Sutton, presentation at ICML 2009) )

13/??

Page 17: Reinforcement Learning Lecture Function Approximation

TD(λ) with Linear Function Approximation

• Applying stochastic approximation and bootstrapping, we caniteratively update the parameters (TD(0) with function approximation)

θt+1 = θt + αt[rt + γV (st+1, θt)− Vt(st, θt)

]φ(s)

• TD(λ) (with eligibility trace)

et = γλet−1 + φ(st)

θt+1 = θt + αtet[rt + γV (st+1, θt)− Vt(st, θt)

]

14/??

Page 18: Reinforcement Learning Lecture Function Approximation

Action-Value Function Approximation

• Q-value function: Q(s, a, θ)

• The mean-square error

ρ(θ) = Es,a∼P (s,a)

[(Qπ(s, a)−Q(s, a, θ)

)2]• Stochastic gradient update

θt+1 = θt + αt(Qπ(st, at)−Q(st, at, θ)

)∇θQ(st, at, θ)

• With linear function approximation

θt+1 = θt + αt(Qπ(st, at)−Q(st, at, θ)

)φ(st, at)

15/??

Page 19: Reinforcement Learning Lecture Function Approximation

Action-Value Linear Function Approximation

• SARSA(0)

θt+1 = θt + αt

(rt + γQ(st+1, at+1)−Q(st, at)

)φ(st, at)

• SARSA(λ)

et = γλet−1 + φ(s, a)

θt+1 = θt + αtet

(rt + γQ(st+1, at+1)−Q(st, at)

)φ(st, at)

16/??

Page 20: Reinforcement Learning Lecture Function Approximation

SARSA(λ) with Linear Function Approximation

• While (!converge)• e = 0 (a vector of zeros)• initial state s = s0

• Repeat (for each episode)at = π(st), e.g. ε-greedy according to argmaxaQ(st, a, θt)

Take at, observe rt, st+1

et = γλet−1 + φ(st, at)

θt+1 = θt + αtet[rt + γQ(st+1, π(st+1), θt)−Q(st, at, θt)

]t = t+ 1

• until st is terminal.

17/??

Page 21: Reinforcement Learning Lecture Function Approximation

SARSA(λ) with tile-coding function approximation

18/??

Page 22: Reinforcement Learning Lecture Function Approximation

TD(λ) with Linear Function Approximation

• Convergence proof: If the stochastic process St is ergodic Markovprocess, the whose stationary distribution is the same as the stationarydistribution of the underlying MDP (e.g. on-policy distribution).

• The convergence property

MSE(θ∞) ≤ 1− γλ1− λ

MSE(θ∗)

(Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 1997)

• Convergence guarantee for off-policy methods (e.g Q-learning withlinear function approximation)? [NO], Cos• Bootstrapping• Following not true gradients (i.e. P)

19/??

Page 23: Reinforcement Learning Lecture Function Approximation

Bootstrapping vs. Non-Bootstrapping

20/??

Page 24: Reinforcement Learning Lecture Function Approximation

Q-Learning with Linear Function Approximation

A Counter-example from Baird (1995)

21/??

Page 25: Reinforcement Learning Lecture Function Approximation

• Using exact update (DP backups) for policy evaluation

θt+1 = θt + α∑

s∼P (s)

[E{rt+1 + γVts

′|st = s} − V (s, θt)]∇θVt(s, θt)

• if P (s) is uniform (6= the true stationary distribution of the Markovchain), then the asymptotic bahaviour becomes unstable

22/??

Page 26: Reinforcement Learning Lecture Function Approximation

Problems with Gradient-Descent Methods

• With nonlinear FA, on-policy methods like SARSA may still diverge(Tsitsiklis and Van Roy, 1997)

• with linear FA, off-policy like Q-learning may diverge (Baird 1995)

• They are not true gradient-descent methods

• Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all aboveproblems: true gradient direction of projected Bellman error.

23/??

Page 27: Reinforcement Learning Lecture Function Approximation

Problems with Gradient-Descent Methods

• With nonlinear FA, on-policy methods like SARSA may still diverge(Tsitsiklis and Van Roy, 1997)

• with linear FA, off-policy like Q-learning may diverge (Baird 1995)

• They are not true gradient-descent methods

• Gradient TD Methods (Sutton et. al. 2008, 2009) can solve all aboveproblems: true gradient direction of projected Bellman error.

23/??

Page 28: Reinforcement Learning Lecture Function Approximation

Gradient temporal difference learning

• GTD (gradient temporal difference learning)

• GTD2 (gradient temporal difference learning, version 2)

• TDC (temporal difference learning with gradient corrections.)

24/??

Page 29: Reinforcement Learning Lecture Function Approximation

Value function geometry• Bellman operator

TV = R+ γPV

(The space spanned by the feature vectors)

RMSBE : Residual mean-squared Bellman errorRMSPBE: Residual mean-squared projected Bellman error 25/??

Page 30: Reinforcement Learning Lecture Function Approximation

TD performance measure

• Error from the true value:||Vθ − V ∗||2P

• Error in the Bellman update (used in previous section: TD(0), GTD0methods)

||Vθ − TVθ||2P

• Error in Bellman update after projection (TDC and GTD2 methods)

||Vθ −ΠTVθ||2P

26/??

Page 31: Reinforcement Learning Lecture Function Approximation

TD performance measure

• GTD(0): the Norm of the Expected TD Update

NEU(θ) = E(δφ)>E(δφ)

• GTD(2) and TDC: the norm of the expected TD update weighted by thecovariance matrix of the features

MSPBE(θ) = E(δφ)>E(φφ>)−1E(δφ)

(δ is the TD error.)(GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.)

27/??

Page 32: Reinforcement Learning Lecture Function Approximation

Updates

• GTD(0)

θt+1 = θt + αt(φ(st)− γφ(st+1))φ(st)>wtπ(st, at)

wt+1 = wt + βt(δtφ(st)− wt)π(st, at)

• GTD2

θt+1 = θt + αt(φ(st)− γφ(st+1))φ(st)>wt

wt+1 = wt + βt(δt − φ(st)>wt)φ(st)

• TDC

θt+1 = θt + αtδtφ(st)− αtγφ(st+1)φ(st)>wt

wt+1 = wt + βt(δt − φ(st)>wt)φ(st)

where δt = rt + γφ(st+1)− φ(st) is the TD error. 28/??

Page 33: Reinforcement Learning Lecture Function Approximation

On Baird’s counterexample.

29/??

Page 34: Reinforcement Learning Lecture Function Approximation

Summary

• Gradient TD algorithms with linear function approximation problems areguaranteed convergent under both general on- and off-policy training.

• the compuational complexity is only O(n) (n is the number of features).

• the curse of dimensionality is removed

30/??

Page 35: Reinforcement Learning Lecture Function Approximation

Batch Learning Methods

31/??

Page 36: Reinforcement Learning Lecture Function Approximation

Batch Learning Methods

• Gradient-descent methods are simple and computationally efficient,but:

• sensitive to the choice of learning rates and initial parameter values,• not sample-efficient.

• Batch learning is an alternative that searches for the best fitting valuefunction given the historical transition data.• Fitted Q-Iteration (FQI)• Experience Replay in Deep Q-networks (DQN)• Least-square temporal difference (LSTD) method; LSPI.

32/??

Page 37: Reinforcement Learning Lecture Function Approximation

Batch Learning Methods

33/??

Page 38: Reinforcement Learning Lecture Function Approximation

Fitted Q-Iteration

• Given all experience H = {(st, at, rt, st+1)}Nt=0.

• FQI approximates Q(s, a) by using supervised regression byconstructing a training data

Dk = {(si, ai), Q̂k(si, ai)}Ni=0

where Q̂k(si, ai) = ri + γmaxa′ Q̂k−1(si, a′)

• Hence, FQI is considered as Approximate Q-Iteration

• At iteration, any regression techniques could be used, with differentchoices of function approximators: neural network, radial basis functionnetworks, regression trees, etc.

34/??

Page 39: Reinforcement Learning Lecture Function Approximation

Fitted Q-Iteration: Mountain Car Benchmark

• The control interval is t = 0.05s

• Actions are continuous, restricted within the interval [4, 4]

• Using a neural network with RPROP: configuration of 3-5-5-1(input-hidden-hidden-output layers)

• Training trajectories had a maximum length of 50 primitive control steps.

• After each trajectory/episode, run one FQI iteration (k = k + 1).

Neural fitted Q-learning by Martin Riedmiller, 2005.35/??

Page 40: Reinforcement Learning Lecture Function Approximation

Experience Replay in Deep Q-networks (DQN)

• given the training data H = {(si, ai, ri, si+1)}Ni=0

• recall the squared-error

ρθ =

N∑t=0

(Qπ(si, ai)− Q̂(si, ai, θ)

)2= EH

(Qπ(si, ai)− Q̂(si, ai, θ)

)2≈ Esi,ai,ri,si+1

EH(ri + max

aQπ(si+1, a, θ)− Q̂(si, ai, θ)

)2

• using stochastic gradient descent• sample a data point

(s, a, r, s′) ∼ H

• stochastic gradient update (by single samples)

θk+1 = θk + αt(rt +max

a′Qπ(s′, a′, θk)− Q̂(s, a, θk

)∇θQ(s, a; θk)

36/??

Page 41: Reinforcement Learning Lecture Function Approximation

Experience Replay in Deep Q-networks (DQN)

• Q(s, a) is approximated using a deep neural network whose weightsare θ.

• Loop until convergence• take action at using ε-greedy acc. to Q(st, a, θt)

• add st, at, rt, st+1 into replay memory D.• sample a subset H of (si, ai, ri, si+1) from D

• optimize the least squared error (as in previous slide),

ρθ = EH(Qπ(si, ai)− Q̂(si, ai, θ)

)2and return θt+1 = θ∗k

• t = t+ 1

37/??

Page 42: Reinforcement Learning Lecture Function Approximation

DQN in Atari Games

from David Silver, a NIPS talk 201438/??

Page 43: Reinforcement Learning Lecture Function Approximation

DQN in Atari Games• Q(s, a) is approximated using a deep neural network whose weights

are θ.• input s is stack of raw pixels from 4 consecutive frames• control a is 18 joystick/button positions

from David Silver, a NIPS talk 2014

39/??

Page 44: Reinforcement Learning Lecture Function Approximation

DQN in Atari Games

from David Silver, a NIPS talk 2014

40/??

Page 45: Reinforcement Learning Lecture Function Approximation

DQN in Atari GamesExperience Replay vs. no Replay

from David Silver, a NIPS talk 2014

41/??

Page 46: Reinforcement Learning Lecture Function Approximation

LSPI: Least Squares Policy Iteration

• Least-square temporal difference (LSTD) method; LSPI.

– Bellman residual minimization

– Least Squares Fixed-Point Approximation

42/??

Page 47: Reinforcement Learning Lecture Function Approximation

Bellman residual minimization

• The Q-functions for a given policy π fulfills for any s, a:

Qπ(s, a) = R(s, a) + γ∑s′

P (s′ | a, s) Qπ(s′, π(s′))

• Written as optmization: minimize the Bellman residual error

L(Qπ) = ||R+ γPΠQπ −Qπ||

where Π is the projection operator onto the feature space.

43/??

Page 48: Reinforcement Learning Lecture Function Approximation

Bellman residual minimization

• The true fixed point of Bellman Residual Minimization (this is anoverconstrained system)

βπ =(

(Φ− γPΠΦ)>(Φ− γPΠΦ))−1

(Φ− γPΠΦ)r

• the solution βπ of the system is unique since the columns of Φ (thebasis functions) are linearly independent by definition.(See Lagoudakis & Parr (JMLR 2003) for details.)

44/??

Page 49: Reinforcement Learning Lecture Function Approximation

LSPI: Least Squares Fixed-Point Approximation

• Projection TπQ back onto span(Φ)

T̂π(Q) = Φ(Φ>Φ)−1Φ>(TπQ)

• Minimize projected Bellman residual

T̂π(Q) = Q

• The approximate fixed-point

βπ =(

Φ>(Φ− γPΠΦ))−1

Φ>r45/??

Page 50: Reinforcement Learning Lecture Function Approximation

LSPI: Comparisons of two views

• the Bellman residual minimizing method: focus on the magnitude of thechange.

• the least-squares fixed-point approximation: focus on the direction ofthe change.

• the least-squares fixed point approximation is less stable and lesspredictable

• the least-squares fixed-point method might be preferable. Because– Learning the Bellman residual minimizing approximation requires doubled samples.

– Experimentally, it often delivers policies that are superior.

(See Lagoudakis & Parr (JMLR 2003) for details.)

46/??

Page 51: Reinforcement Learning Lecture Function Approximation

LSPI: LSTDQ algorithm

A = Φ>(Φ− γPΠΦ)

b = Φ>r

• For each (s, a, r, s′) ∈ D

A← A+ φ(s, a)(φ(s, a)− γφ(s′, π(s′))

)>b← b+ φ(s, a)r

• β ← A−1b

47/??

Page 52: Reinforcement Learning Lecture Function Approximation

LSPI algorithmgiven D

• repeat

π ← π′

π′ ← LSTDQ(π) (π′ is a policy of βπ)

• return π

48/??

Page 53: Reinforcement Learning Lecture Function Approximation

LSPI: Riding a bike

(from Alma A. M. Rahat’s simulation.)• States = {θ, θ̇, ω, ω̇, ω̈, ψ}. where θ is the angle of the handlebar, ω is the vertical angle

of the bicycle, and ψ is the angle of the bicycle to the goal.• Actions: {τ, ν}. τ ∈ {−2, 0, 2} is the torque applied to the handlebar,ν ∈ {−0.02, 0, 0.02} is the displacement of the rider.

• For each a, the value function Q(s, a) uses 20 features

(1, ω, ω̇, ω2, ωω̇, θ, θ̇, θ2, θ̇2, θθ̇, ωθ, ωθ2, ω2θ, ψ, ψ2, ψθ, ψ̄, ψ̄2, ψ̄θ)

where ψ̄ = sign(ψ)× π − ψ. 49/??

Page 54: Reinforcement Learning Lecture Function Approximation

LSPI: Riding a bike

(converges after 8 iterations, based on 50000 samples collected from 2500 episodes)• Training samples were collected in advance by initializing the bicycle to a small random

perturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20steps using a purely random policy.

• Each successful ride must complete a distance of 2 kilometers.• This experiment was repeated 100 times

from Lagoudakis & Parr (JMLR 2003)50/??

Page 55: Reinforcement Learning Lecture Function Approximation

LSPI: Riding a bike

51/??

Page 56: Reinforcement Learning Lecture Function Approximation

Feature Selection/Building Problems

• Feature selection.

• Online/increment feature learning.

Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010);Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng,(2009), Boots and Gordon, (2010), Sun et al., (2011), etc.

52/??