Policy Gradient Methods: Pathwise Derivative Methods and...

Policy Gradient Methods: Pathwise Derivative Methods

and Wrap-up

March 15, 2017

Pathwise Derivative Policy Gradient Methods

Policy Gradient Estimators: Review

Deriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

Want to compute ∇θE [RT ]. We’ll use ∇θ log π(at | st ; θ)

I Reparameterize: at = π(st , zt ; θ). zt is noise from fixed distribution.

I Only works if P(s2 | s1, a1) is known _

Deriving the Policy Gradient, ReparameterizedI Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

Want to compute ∇θE [RT ]. We’ll use ∇θ log π(at | st ; θ)I Reparameterize: at = π(st , zt ; θ). zt is noise from fixed distribution.

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

Deriving the Policy Gradient, ReparameterizedI Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

Want to compute ∇θE [RT ]. We’ll use ∇θ log π(at | st ; θ)I Reparameterize: at = π(st , zt ; θ). zt is noise from fixed distribution.

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

Using a Q-function

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

dθE [RT ] = E

[T∑t=1

datdθ

[T∑t=1

datE [RT | at ]

datdθ

[T∑t=1

dQ(st , at)

datdθ

[T∑t=1

dθQ(st , π(st , zt ; θ))

SVG(0) Algorithm

I Learn Qφ to approximate Qπ,γ, and use it to compute gradient estimates.

I Pseudocode:

for iteration=1, 2, . . . doExecute policy πθ to collect T timesteps of dataUpdate πθ using g ∝ ∇θ

∑Tt=1Q(st , π(st , zt ; θ))

Update Qφ using g ∝ ∇φ

∑Tt=1(Qφ(st , at)− Qt)

2, e.g. with TD(λ)end for

N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXivpreprint arXiv:1510.09142 (2015)

SVG(0) Algorithm

I Learn Qφ to approximate Qπ,γ, and use it to compute gradient estimates.

I Pseudocode:

for iteration=1, 2, . . . doExecute policy πθ to collect T timesteps of dataUpdate πθ using g ∝ ∇θ

∑Tt=1Q(st , π(st , zt ; θ))

2, e.g. with TD(λ)end for

SVG(1) Algorithm

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

I Instead of learning Q, we learn

I State-value function V ≈ V π,γ

I Dynamics model f , approximating st+1 = f (st , at) + ζt

I Given transition (st , at , st+1), infer ζt = st+1 − f (st , at)

I Q(st , at) = E [rt + γV (st+1)] = E [rt + γV (f (st , at) + ζt)], and at = π(st , θ, ζt)

SVG(∞) Algorithm

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

I Just learn dynamics model f

I Given whole trajectory, infer all noise variables

I Freeze all policy and dynamics noise, differentiate through entire deterministiccomputation graph

SVG Results

I Applied to 2D robotics tasks

I Overall: different gradient estimators behave similarly

Deterministic Policy Gradient

I For Gaussian actions, variance of score function policy gradient estimator goes toinfinity as variance goes to zero

I Intuition: finite difference gradient estimators

I But SVG(0) gradient is fine when σ → 0

∇θ∑t

Q(st , π(st , θ, ζt))

I Problem: there’s no exploration.

I Solution: add noise to the policy, but estimate Q with TD(0), so it’s validoff-policy

I Policy gradient is a little biased (even with Q = Qπ), but only because statedistribution is off—it gets the right gradient at every state

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. “Deterministic Policy Gradient Algorithms”. ICML. 2014

Deep Deterministic Policy GradientI Incorporate replay buffer and target network ideas from DQN for increased

stability

I Use lagged (Polyak-averaging) version of Qφ and πθ for fitting Qφ (towardsQπ,γ) with TD(0)

Qt = rt + γQφ′(st+1, π(st+1; θ′))

I Pseudocode:

for iteration=1, 2, . . . doAct for several timesteps, add data to replay bufferSample minibatchUpdate πθ using g ∝ ∇θ

∑Tt=1 Q(st , π(st , zt ; θ))

2,end for

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. “Continuous control with deep reinforcement learning”. arXiv preprintarXiv:1509.02971 (2015)

DDPG Results

Applied to 2D and 3D robotics tasks and driving with pixel input

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. “Continuous control with deep reinforcement learning”. arXiv preprintarXiv:1509.02971 (2015)

Policy Gradient Methods: Comparison

I Two kinds of policy gradient estimatorI REINFORCE / score function estimator: ∇ log π(a | s)A.

I Learn Q or V for variance reduction, to estimate AI Pathwise derivative estimators (differentiate wrt action)

I SVG(0) / DPG: ddaQ(s, a) (learn Q)

I SVG(1): dda (r + γV (s ′)) (learn f ,V )

I SVG(∞): ddat

(rt + γrt+1 + γ2rt+2 + . . . ) (learn f )

I Pathwise derivative methods more sample-efficient when they work (maybe),but work less generally due to high bias

Policy Gradient Methods vs Q-Function Regression Methods

I Q-function regression methods are more sample-efficient when they work,but don’t work as generally

I Policy gradients are easier to debug and understandI Don’t have to deal with “burn-in” periodI When it’s working, performance should be monotonically increasingI Diagnostics like KL, entropy, baseline’s explained variance

I Q-function regression methods are more compatible with exploration andoff-policy learning

I Policy-gradient methods are more compatible with recurrent policies

I Q-function regression methods CAN be used with continuous action spaces(e.g., S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. “Continuous deepQ-learning with model-based acceleration”. (2016)) but final performanceis worse (so far)

Recent Papers on Connecting Policy Gradients and Q-function

RegressionI B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. “PGQ: Combining policy

gradient and Q-learning”. (2016)

I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic

with Experience Replay”. (2016)

I Uses adjusted returns: A. Harutyunyan, M. G. Bellemare, T. Stepleton, andR. Munos. “Q(λ) with Off-Policy Corrections”. 2016, N. Jiang and L. Li.“Doubly robust off-policy value evaluation for reinforcement learning”. 2016

I O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. “Bridging the Gap Between Valueand Policy Based Reinforcement Learning”. (2017)

I T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. “Reinforcement Learning with DeepEnergy-Based Policies”. (2017)

I O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. “Bridging the Gap Between Valueand Policy Based Reinforcement Learning”. (2017)

Policy Gradient Methods: Pathwise Derivative Methods and...

Documents

Transcript of Policy Gradient Methods: Pathwise Derivative Methods and...

Policy Gradient Algorithms - Stanford · 2020. 9. 20. · Policy Gradient Theorem (PGT) Theorem r J( ) = Z S ˆˇ(s) Z A r ˇ(s;a; ) Qˇ(s;a) da ds Note: ˆˇ(s) depends on , but

Tivoli SecureWay Policy Directorpublib.boulder.ibm.com/tividd/td/SW_30/GC32-0737-00/zh... · 2007-09-29 · eÑ Tivoli® Policy Director O⌡µTivoli Policy Director ú Xñ {í ≥

Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Youth Policy Report 2010

2.5. Regional Cluster Policy

Variance Reduction for Policy Gradient Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec6.pdf · \potential" I Theorem: ~r admits the same optimal policies as r.1 I Proof sketch:

Policy-Gradient Algorithms for Partially Observable …...driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms

Lecture 7: Policy Gradient - David Silver · Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (2) Under aliasing, an optimaldeterministicpolicy

Reinforcement Learning via Policy Optimizationhanxiaol/slides/rl-po.pdf · Policy Gradient r U( ) ˇr logP(˝;ˇ )R(˝) ˝˘P(;ˇ ) (7) I Analogous to SGD (so variance reduction is

Policy Gradient Methods: Pathwise Derivative Methods and ...rll.berkeley.edu/deeprlcourse/docs/lec7.pdfPolicy Gradient Methods vs Q-Function Regression Methods I Q-function regression

Classifier-Based Approximate Policy Iteration

Public Policy Course Session 17

Lecture 7: Policy Gradient - UCL Computer Science · Lecture 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors

Optimal policy computation with Dynare

Policy paper-no17.2013 σακελλαρόπουλος-φίτσιου-4

Algorithmic Foundations of Learning Lecture 9 …Algorithmic Foundations of Learning Lecture 9 Oracle Model. Gradient Descent Methods Lecturer: Patrick Rebeschini Version: November

Policy Gradient with Baselines - cse.buffalo.edu

BYZANTINE-RESILIENT NON-CONVEX STOCHASTIC GRADIENT …

2.4. Innovation policy

THESIS - COMMON EUROPEAN DEFENCE POLICY