Variance Reduction for Policy Gradient...

22
Variance Reduction for Policy Gradient Methods March 13, 2017

Transcript of Variance Reduction for Policy Gradient...

Page 1: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Variance Reduction for Policy Gradient Methods

March 13, 2017

Page 2: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping

Page 3: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping

Page 4: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping

I Reward shaping: r(s, a, s ′) = r(s, a, s ′) + γΦ(s ′)− Φ(s) for arbitrary“potential” Φ

I Theorem: r admits the same optimal policies as r .1

I Proof sketch: suppose Q∗ satisfies Bellman equation (T Q = Q). If wetransform r → r , policy’s value function satisfies Q(s, a) = Q∗(s, a)− Φ(s)

I Q∗ satisfies Bellman equation ⇒ Q also satisfies Bellman equation

1A. Y. Ng, D. Harada, and S. Russell. “Policy invariance under reward transformations: Theory and application to reward shaping”. ICML. 1999.

Page 5: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward ShapingI Theorem: R admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell.

“Policy invariance under reward transformations: Theory and application to rewardshaping”. ICML. 1999

I Alternative proof: advantage function is invariant. Let’s look at effect on V π and Qπ:

E[r0 + γr1 + γ2r2 + . . .

]condition on either s0 or (s0, a0)

= E[r0 + γ r1 + γ2r2 + . . .

]= E

[(r0 + γΦ(s1)− Φ(s0)) + γ(r1 + γΦ(s2)− Φ(s1)) + γ2(r2 + γΦ(s3)− Φ(s2)) + . . .

]= E

[r0 + γr1 + γ2r2 + · · · − Φ(s0)

]Thus,

V π(s) = V π(s)− Φ(s)

Qπ(s) = Qπ(s, a)− Φ(s)

Aπ(s) = Aπ(s, a)

Aπ(s, π(s)) = 0 at all states ⇒ π is optimal

Page 6: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping and Problem Difficulty

I Shape with Φ = V ∗ ⇒ problem is solved in one step of value iteration

I Shaping leaves policy gradient invariant (and just adds baseline to estimator)

E[∇θ log πθ(a0 | s0)(r0 + γΦ(s1)− Φ(s0)) + γ(r1 + γΦ(s2)− Φ(s1))

+ γ2(r2 + γΦ(s3)− Φ(s2)) + . . .]

= E[∇θ log πθ(a0 | s0)(r0 + γr1 + γ2r2 + · · · − Φ(s0))

]= E

[∇θ log πθ(a0 | s0)(r0 + γr1 + γ2r2 + . . . )

]

Page 7: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping and Policy GradientsI First note connection between shaped reward and advantage function:

Es1 [r0 + γV π(s1)− V π(s0) | s0 = s, a0 = a] = Aπ(s, a)

Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretendγ = 0 after shaping) we get

E

[∑t

∇θ log πθ(at | st)rt

]= E

[∑t

∇θ log πθ(at | st)(rt + γV π(st+1)− V π(st))

]

= E

[∑t

∇ log πθ(at | st)Aπ(st , at)

]I Compromise: use more aggressive discount γλ, with λ ∈ (0, 1): called generalized

advantage estimation

E

[∑t

∇θ log πθ(at | st)∞∑k=0

(γλ)k rt+k

]

Page 8: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping – Summary

I Reward shaping transformation leaves policy gradient and optimal policyinvariant

I Shaping with Φ ≈ V π makes consequences of actions more immediate

I Shaping, and then ignoring all but first term, gives policy gradient

Page 9: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

[Aside] Reward Shaping: Very Important in PracticeI I. Mordatch, E. Todorov, and Z. Popovic. “Discovery of complex behaviors through

contact-invariant optimization”. ACM Transactions on Graphics (TOG) 31.4 (2012),p. 43

I Y. Tassa, T. Erez, and E. Todorov. “Synthesis and stabilization of complex behaviorsthrough online trajectory optimization”. Intelligent Robots and Systems (IROS), 2012IEEE/RSJ International Conference on. IEEE. 2012, pp. 4906–4913

Page 10: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Variance Reduction for Policy Gradients

Page 11: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Variance Reduction

I We have the following policy gradient formula:

∇θEτ [R] = Eτ

[T−1∑t=0

∇θ log π(at | st , θ)Aπ(st , at)

]

I Aπ is not known, but we can plug in At , an advantage estimator

I Previously, we showed that taking

At = rt + rt+1 + rt+2 + · · · − b(st)

for any function b(st), gives an unbiased policy gradient estimator.b(st) ≈ V π(st) gives variance reduction.

Page 12: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

The Delayed Reward Problem

I With policy gradient methods, we are confounding the effect of multipleactions:

At = rt + rt+1 + rt+2 + · · · − b(st)

mixes effect of at , at+1, at+2, . . .

I SNR of At scales roughly as 1/TI Only at contributes to signal Aπ(st , at), but at+1, at+2, . . . contribute to

noise.

Page 13: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Variance Reduction with DiscountsI Discount factor γ, 0 < γ < 1, downweights the effect of rewards that are far

in the future—ignore long term dependencies

I We can form an advantage estimator using the discounted return:

Aγt = rt + γrt+1 + γ2rt+2 + . . .︸ ︷︷ ︸discounted return

−b(st)

reduces to our previous estimator when γ = 1.

I So advantage has expectation zero, we should fit baseline to be discountedvalue function

V π,γ(s) = Eτ[r0 + γr1 + γ2r2 + . . . | s0 = s

]I Discount γ is similar to using a horizon of 1/(1− γ) timesteps

I Aγt is a biased estimator of the advantage function

Page 14: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Value Functions in the Future

I Baseline accounts for and removes the effect of past actions

I Can also use the value function to estimate future rewards

rt + γV (st+1) cut off at one timestep

rt + γrt+1 + γ2V (st+2) cut off at two timesteps

. . .

rt + γrt+1 + γ2rt+2 + . . . ∞ timesteps (no V )

Page 15: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Value Functions in the Future

I Subtracting out baselines, we get advantage estimators

A(1)t = rt + γV (st+1)−V (st)

A(2)t = rt + rt+1 + γ2V (st+2)−V (st)

. . .

A(∞)t = rt + γrt+1 + γ2rt+2 + . . .−V (st)

I A(1)t has low variance but high bias, A

(∞)t has high variance but low bias.

I Using intermediate k (say, 20) gives an intermediate amount of bias and variance

Page 16: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Finite-Horizon Methods: Advantage Actor-Critic

I A2C / A3C uses this fixed-horizon advantage estimator

I Pseudocode

for iteration=1, 2, . . . doAgent acts for T timesteps (e.g., T = 20),For each timestep t, compute

Rt = rt + γrt+1 + · · ·+ γT−t+1rT−1 + γT−tV (st)

At = Rt − V (st)

Rt is target value function, in regression problemAt is estimated advantage function

Compute loss gradient g = ∇θ

∑Tt=1

[− log πθ(at | st)At + c(V (s)− Rt)

2]

g is plugged into a stochastic gradient descent variant, e.g., Adam.end for

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016)

Page 17: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

A3C Video

Page 18: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

A3C Results

0 2 4 6 8 10 12 14Training time (hours)

0

2000

4000

6000

8000

10000

12000

14000

16000

Sco

re

Beamrider

DQN1-step Q1-step SARSAn-step QA3C

0 2 4 6 8 10 12 14Training time (hours)

0

100

200

300

400

500

600

Sco

re

Breakout

DQN1-step Q1-step SARSAn-step QA3C

0 2 4 6 8 10 12 14Training time (hours)

30

20

10

0

10

20

30

Sco

re

Pong

DQN1-step Q1-step SARSAn-step QA3C

0 2 4 6 8 10 12 14Training time (hours)

0

2000

4000

6000

8000

10000

12000

Sco

re

Q*bert

DQN1-step Q1-step SARSAn-step QA3C

0 2 4 6 8 10 12 14Training time (hours)

0

200

400

600

800

1000

1200

1400

1600

Sco

re

Space Invaders

DQN1-step Q1-step SARSAn-step QA3C

Page 19: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Reward Shaping

Page 20: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

TD(λ) Methods: Generalized Advantage EstimationI Recall, finite-horizon advantage estimators

A(k)t = rt + γrt+1 + · · ·+ γk−1rt+k−1 + γkV (st+k)− V (st)

I Define the TD error δt = rt + γV (st+1)− V (st)

I By a telescoping sum,

A(k)t = δt + γδt+1 + · · ·+ γk−1δt+k−1

I Take exponentially weighted average of finite-horizon estimators:

Aλ = A(1)t + λA

(2)t + λ2A

(3)t + . . .

I We obtain

Aλt = δt + (γλ)δt+1 + (γλ)2δt+2 + . . .

I This scheme named generalized advantage estimation (GAE) in [1], though versions haveappeared earlier, e.g., [2]. Related to TD(λ)

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-dimensional continuous control using generalized advantage estimation”.(2015)

H. Kimura and S. Kobayashi. “An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect ValueFunction.” ICML. 1998

Page 21: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

Choosing parameters γ, λ

Performance as γ, λ are varied

0 100 200 300 400 500

number of policy iterations

2.5

2.0

1.5

1.0

0.5

0.0

cost

3D Biped

γ=0.96,λ=0.96

γ=0.98,λ=0.96

γ=0.99,λ=0.96

γ=0.995,λ=0.92

γ=0.995,λ=0.96

γ=0.995,λ=0.98

γ=0.995,λ=0.99

γ=0.995,λ=1.0

γ=1,λ=0.96

γ=1, No value fn

Page 22: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is

TRPO+GAE Video