Variance Reduction for Policy Gradient...

Variance Reduction for Policy Gradient Methods

March 13, 2017

Reward Shaping

I Reward shaping: r(s, a, s ′) = r(s, a, s ′) + γΦ(s ′)− Φ(s) for arbitrary“potential” Φ

I Theorem: r admits the same optimal policies as r .1

I Proof sketch: suppose Q∗ satisfies Bellman equation (T Q = Q). If wetransform r → r , policy’s value function satisfies Q(s, a) = Q∗(s, a)− Φ(s)

I Q∗ satisfies Bellman equation ⇒ Q also satisfies Bellman equation

1A. Y. Ng, D. Harada, and S. Russell. “Policy invariance under reward transformations: Theory and application to reward shaping”. ICML. 1999.

Reward ShapingI Theorem: R admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell.

“Policy invariance under reward transformations: Theory and application to rewardshaping”. ICML. 1999

I Alternative proof: advantage function is invariant. Let’s look at effect on V π and Qπ:

E[r0 + γr1 + γ2r2 + . . .

]condition on either s0 or (s0, a0)

= E[r0 + γ r1 + γ2r2 + . . .

[(r0 + γΦ(s1)− Φ(s0)) + γ(r1 + γΦ(s2)− Φ(s1)) + γ2(r2 + γΦ(s3)− Φ(s2)) + . . .

[r0 + γr1 + γ2r2 + · · · − Φ(s0)

]Thus,

V π(s) = V π(s)− Φ(s)

Qπ(s) = Qπ(s, a)− Φ(s)

Aπ(s) = Aπ(s, a)

Aπ(s, π(s)) = 0 at all states ⇒ π is optimal

Reward Shaping and Problem Difficulty

I Shape with Φ = V ∗ ⇒ problem is solved in one step of value iteration

I Shaping leaves policy gradient invariant (and just adds baseline to estimator)

E[∇θ log πθ(a0 | s0)(r0 + γΦ(s1)− Φ(s0)) + γ(r1 + γΦ(s2)− Φ(s1))

+ γ2(r2 + γΦ(s3)− Φ(s2)) + . . .]

= E[∇θ log πθ(a0 | s0)(r0 + γr1 + γ2r2 + · · · − Φ(s0))

[∇θ log πθ(a0 | s0)(r0 + γr1 + γ2r2 + . . . )

Reward Shaping and Policy GradientsI First note connection between shaped reward and advantage function:

Es1 [r0 + γV π(s1)− V π(s0) | s0 = s, a0 = a] = Aπ(s, a)

Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretendγ = 0 after shaping) we get

∇θ log πθ(at | st)rt

∇θ log πθ(at | st)(rt + γV π(st+1)− V π(st))

∇ log πθ(at | st)Aπ(st , at)

]I Compromise: use more aggressive discount γλ, with λ ∈ (0, 1): called generalized

advantage estimation

∇θ log πθ(at | st)∞∑k=0

(γλ)k rt+k

Reward Shaping – Summary

I Reward shaping transformation leaves policy gradient and optimal policyinvariant

I Shaping with Φ ≈ V π makes consequences of actions more immediate

I Shaping, and then ignoring all but first term, gives policy gradient

[Aside] Reward Shaping: Very Important in PracticeI I. Mordatch, E. Todorov, and Z. Popovic. “Discovery of complex behaviors through

contact-invariant optimization”. ACM Transactions on Graphics (TOG) 31.4 (2012),p. 43

I Y. Tassa, T. Erez, and E. Todorov. “Synthesis and stabilization of complex behaviorsthrough online trajectory optimization”. Intelligent Robots and Systems (IROS), 2012IEEE/RSJ International Conference on. IEEE. 2012, pp. 4906–4913

Variance Reduction for Policy Gradients

Variance Reduction

I We have the following policy gradient formula:

∇θEτ [R] = Eτ

[T−1∑t=0

∇θ log π(at | st , θ)Aπ(st , at)

I Aπ is not known, but we can plug in At , an advantage estimator

I Previously, we showed that taking

At = rt + rt+1 + rt+2 + · · · − b(st)

for any function b(st), gives an unbiased policy gradient estimator.b(st) ≈ V π(st) gives variance reduction.

The Delayed Reward Problem

I With policy gradient methods, we are confounding the effect of multipleactions:

At = rt + rt+1 + rt+2 + · · · − b(st)

mixes effect of at , at+1, at+2, . . .

I SNR of At scales roughly as 1/TI Only at contributes to signal Aπ(st , at), but at+1, at+2, . . . contribute to

noise.

Variance Reduction with DiscountsI Discount factor γ, 0 < γ < 1, downweights the effect of rewards that are far

in the future—ignore long term dependencies

I We can form an advantage estimator using the discounted return:

Aγt = rt + γrt+1 + γ2rt+2 + . . .︸︷︷︸discounted return

−b(st)

reduces to our previous estimator when γ = 1.

I So advantage has expectation zero, we should fit baseline to be discountedvalue function

V π,γ(s) = Eτ[r0 + γr1 + γ2r2 + . . . | s0 = s

]I Discount γ is similar to using a horizon of 1/(1− γ) timesteps

I Aγt is a biased estimator of the advantage function

Value Functions in the Future

I Baseline accounts for and removes the effect of past actions

I Can also use the value function to estimate future rewards

rt + γV (st+1) cut off at one timestep

rt + γrt+1 + γ2V (st+2) cut off at two timesteps

rt + γrt+1 + γ2rt+2 + . . . ∞ timesteps (no V )

Value Functions in the Future

I Subtracting out baselines, we get advantage estimators

A(1)t = rt + γV (st+1)−V (st)

A(2)t = rt + rt+1 + γ2V (st+2)−V (st)

A(∞)t = rt + γrt+1 + γ2rt+2 + . . .−V (st)

I A(1)t has low variance but high bias, A

(∞)t has high variance but low bias.

I Using intermediate k (say, 20) gives an intermediate amount of bias and variance

Finite-Horizon Methods: Advantage Actor-Critic

I A2C / A3C uses this fixed-horizon advantage estimator

I Pseudocode

for iteration=1, 2, . . . doAgent acts for T timesteps (e.g., T = 20),For each timestep t, compute

Rt = rt + γrt+1 + · · ·+ γT−t+1rT−1 + γT−tV (st)

At = Rt − V (st)

Rt is target value function, in regression problemAt is estimated advantage function

Compute loss gradient g = ∇θ

∑Tt=1

[− log πθ(at | st)At + c(V (s)− Rt)

g is plugged into a stochastic gradient descent variant, e.g., Adam.end for

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016)

A3C Video

A3C Results

0 2 4 6 8 10 12 14Training time (hours)

Beamrider

DQN1-step Q1-step SARSAn-step QA3C

Breakout

Q*bert

Space Invaders

Reward Shaping

TD(λ) Methods: Generalized Advantage EstimationI Recall, finite-horizon advantage estimators

A(k)t = rt + γrt+1 + · · ·+ γk−1rt+k−1 + γkV (st+k)− V (st)

I Define the TD error δt = rt + γV (st+1)− V (st)

I By a telescoping sum,

A(k)t = δt + γδt+1 + · · ·+ γk−1δt+k−1

I Take exponentially weighted average of finite-horizon estimators:

Aλ = A(1)t + λA

(2)t + λ2A

(3)t + . . .

I We obtain

Aλt = δt + (γλ)δt+1 + (γλ)2δt+2 + . . .

I This scheme named generalized advantage estimation (GAE) in [1], though versions haveappeared earlier, e.g., [2]. Related to TD(λ)

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-dimensional continuous control using generalized advantage estimation”.(2015)

H. Kimura and S. Kobayashi. “An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect ValueFunction.” ICML. 1998

Choosing parameters γ, λ

Performance as γ, λ are varied

0 100 200 300 400 500

number of policy iterations

3D Biped

γ=0.96,λ=0.96

γ=0.98,λ=0.96

γ=0.99,λ=0.96

γ=0.995,λ=0.92

γ=0.995,λ=0.96

γ=0.995,λ=0.98

γ=0.995,λ=0.99

γ=0.995,λ=1.0

γ=1,λ=0.96

γ=1, No value fn

TRPO+GAE Video

Variance Reduction for Policy Gradient...

Documents

Transcript of Variance Reduction for Policy Gradient...

Gradient Estimation for Real-Time Adaptive Temporal Filtering · Gradient Estimation for Real-Time Adaptive Temporal Filtering 24:3 obtain an estimate per pixel (Section3.4). Finally,

Supplementary Material for Stress Gradient Plasticity - PNAS€¦ · Supplementary Material for Stress Gradient Plasticity ... our simulations and also show examples of the actual

Present lec6 AM FM - MyWWW ZHAW · 2015-10-27 · AM Synchron- & I/Q-Demodulator Synchrondemodulator I/Q-Demodulator 2∙cos(2 πf0t) ... DSB, SSB: time & frequency domain ... Present_lec6_AM_FM.ppt

A latitudinal gradient of microbial β‐diversity in ...

Linear Rergression Time Complexity Calculation · 2021. 3. 21. · For t iterations, what is the computational complexity of our gradient descent solution? Hint, rewrite the above

Gradient, Newton and conjugate direction methods for ...home.deib.polimi.it/taccari/OPT_14-15/lab4.pdf · Consider the gradient method (steepest descent), ... 500 and 1000 iterations.

A Variational Analysis of Stochastic Gradient Algorithms · A Variational Analysis of Stochastic Gradient Algorithms Stephan Mandt 1, Matthew D. Hoffman 2, David M. Blei 1;3 1 Data

Gamma-convergence of gradient ows on Hilbert and …serfaty/gcv-erice2.pdfGamma-convergence of gradient ows on Hilbert and metric spaces and applications Sylvia Serfaty July 12, 2010

Pressure Gradient Modeling for Near Horizontal Flow as a ... · PDF filePressure Gradient Modeling for Near Horizontal Flow as a ... and inclined two-phase flow. Most published correlations

High Gradient Tests of the HINS SSR1 Single Spoke Resonator

Introduction to Stochastic Gradient Markov Chain Monte ...

Stochastic Gradient Descent (SGD)

Integrable Systems, gradient ows and dissipation Anthony Bloch … · 2012-07-31 · 1 Integrable Systems, gradient ows and dissipation Anthony Bloch (Recent work with Morrison and

Physics 321 Hour 8 Potential Energy in Three Dimensions Gradient, Divergence, and Curl.

BYZANTINE-RESILIENT NON-CONVEX STOCHASTIC GRADIENT …

3.0 Tesla Multiple Echo Recombined Gradient Echo MRI による関 …

Policy Gradient with Baselines - cse.buffalo.edu

Fixed Field Alternating Gradient (FFAG) Accelerator Part I · synchronized rf. AG synchrotron. Reduce size of magnets. by strong focusing (alternating gradient) FFAG Fixed magnetwith

Efficient Logistic Regression with Stochastic Gradient Descent: The Continuing Saga

Conjugate Gradient:Conjugate Gradientxhx/courses/ConvexOpt/projects/Patrick...• Shewchuck , “An Introduction to the Conjugate Gradient MethodAn Introduction to the Conjugate Gradient