School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al....

27
Determinist PG, Pathwise deriva2ves Deep Reinforcement Learning and Control Katerina Fragkiadaki Carnegie Mellon School of Computer Science Spring 2020, CMU 10-403

Transcript of School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al....

Page 1: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Determinist PG, Pathwise deriva2ves

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie Mellon

School of Computer Science

Spring 2020, CMU 10-403

Page 2: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Compu2ng Gradients of Expecta2ons

∇θ𝔼a∼πθR(a, s)

g =1N

N

∑i=1

∇θ log Pθ(τ(i))R(τ(i)) =1N

N

∑i=1

T

∑t=1

∇θ log πθ(α(i)t |s(i)

t )R(τ(i))

• Do we have access to the reward func2on ? • Do we have access to the analy2c gradients of rewards w.r.t.

ac2ons a? • For con2nuous ac2ons a, do we have access to the analy2c gradients of

Q(s,a,w) or Q(s,a,w_1)-V(s,w_2) w.r.t. ac2ons a? • Have we used the later anywhere?

R(τ)R(τ)

𝔼τ∼Pθ(τ) [∇θlog Pθ(τ)R(τ)]𝔼s∼d0(s), a∼πθ(a|s) ∇θ log πθ(a |s)[Q(s, a, w1) − V(s, w2)]

Likelihood ratio gradient estimator:

maxθ

. 𝔼τ∼Pθ(τ) [R(τ)]Policy objective:

Page 3: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

What if we have a determinis2c policy?

𝔼∑t

dQ(st, at)dθ

= 𝔼∑t

dQ(st, at)dat

dat

a = πθ(s)

maxθ

. 𝔼T

∑t=1

R(st, at)

Qs: • Can we backpropagate through R?

a = πθ(s)

maxθ

. 𝔼T

∑t=1

Q(st, at)

Q: does this expectation depend on theta?

Qs: • Can we backpropagate through Q?

Page 4: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Deriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

I Only works if P(s2 | s1, a1) is known _

Using a Q-function

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

RT

d

d✓E [RT ] = E

"TX

t=1

dRT

dat

datd✓

#= E

"TX

t=1

d

datE [RT | at ]

datd✓

#

= E"

TX

t=1

dQ(st , at)

dat

datd✓

#= E

"TX

t=1

d

d✓Q(st , ⇡(st , zt ; ✓))

#

Deep Determinis2c Policy Gradients

𝔼 [T

∑t=1

dQ (st, at)dat

dat

dθ ] = 𝔼 [T

∑t=1

ddθ

Q (st, π (st, zt; θ))]Continuous control with deep reinforcement learning, Lilicrap et al. 2016

ddθ

𝔼 [RT] = 𝔼 [T

∑t=1

dRT

dat

dat

dθ ] = 𝔼 [T

∑t=1

ddat

𝔼 [RT |at]dat

dθ ]= 𝔼 [

T

∑t=1

dQ (st, at)dat

dat

dθ ] = 𝔼 [T

∑t=1

ddθ

Q (st, π (st, zt; θ))]

𝔼∑t

dQ(st, at)dθ

= 𝔼T

∑t=1

dQ(st, at)dat

dat

a = πθ(s)

Page 5: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

s DNN (θQ)

sDNN (θμ) a

Deep Determinis2c Policy Gradients

We are following a stochas2c behavior policy to collect data. DDPG :Deep Q learning for con2nuous ac2ons

a = μ(θ)Q(s, a)

Continuous control with deep reinforcement learning, Lilicrap et al. 2016

Page 6: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Deep Determinis2c Policy Gradients

Page 7: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Compu2ng Gradients of Expecta2ons

∇θ𝔼x∼Pθ(x) f(x) = 𝔼x∼Pθ(x) ∇θlog Pθ(x)f(x)

When the variable w.r.t. which we are differen2a2ng appears in the distribu2on:

∇θ𝔼f(x(θ)) = 𝔼x∼P(x) ∇θ f(x(θ)) = 𝔼x∼P(x)df(x(θ))

dxdxdθ

When the variable w.r.t. which we are differen2a2ng appears inside the expecta2on:

likelihood ra2o gradient es2mator

Q: From which to which? Why would we want to do so?

Re-parametriza2on trick: For some distribu2ons we can switch from one gradient es2mator to the other.

Pθ(x)

Page 8: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Imagine we knew the reward func2on ρ(s, a)

πθ(s)

ρ(s, a)

s0

a0

r0

θ

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

Page 9: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Determinis2c policy

πθ(s)

ρ(s, a)

s0

a0

r0

θa = πθ(s)

I can compute the gradient with the chain rule.

∇θ ρ(s, a) =dρda

dadθ

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

Deriva2ve of the *known* reward func2on w.r.t. the ac2on

I want to learn to maximize the average reward obtained.

θ

max .θ

ρ(s0, a)

Page 10: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Stochas2c policy

πθ(s)

ρ(s, a)

s0

a0

r0

θ

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

I want to learn to maximize the average reward obtained.

θ

∇θ ρ(s, a) =dρda

dadθ

max .θ

𝔼aρ(s0, a)

∇θ𝔼aρ(s0, a)

Page 11: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Stochas2c policy

πθ(s)

ρ(s, a)

s0

a0

r0

θ

𝔼a ∇θ log πθ(s)ρ(s0, a)

Likelihood ra2o es2mator, works for both con2nuous and discrete ac2ons

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

I want to learn to maximize the average reward obtained.

θ

max .θ

𝔼aρ(s0, a)

Page 12: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Example: Gaussian policy

πθ(s)

ρ(s, a)

s0

a0

r0

θ

µ✓(s) �✓(s)

a ∼ 𝒩(μ(s, θ), Σ(s, θ))

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

If is constant: σ2

∇θlog πθ(s, a) =(a − μ(s; θ)) ∂μ(s; θ)

∂θ

σ2

I want to learn to maximize the average reward obtained.

θ

𝔼a ∇θ log πθ(s)ρ(s0, a)

Likelihood ra2o es2mator, works for both con2nuous and discrete ac2ons

max .θ

𝔼aρ(s0, a)

Page 13: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Example: Gaussian policy

πθ(s)

ρ(s, a)

s0

a0

r0

θ

µ✓(s) �✓(s)

a ∼ 𝒩(μ(s, θ), Σ(s, θ))

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

We can either:

• Assume fixed (not learned)

• Learn one value for all ac2on coordinates (spherical or isotropic Gaussian)

• Learn (diagonal covariance)

• Learn a full covariance matrix

σ

σ(s, θ)

σi(s, θ), i = 1⋯n

Σ(s, θ)

Page 14: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Example: Gaussian policy

πθ(s)

ρ(s, a)

s0

a0

r0

θ

µ✓(s) �✓(s)

a ∼ 𝒩(μ(s, θ), Σ(s, θ))

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

We can either:

• Assume fixed (not learned)

• Learn one value for all ac2on coordinates (spherical or isotropic Gaussian)

• Learn (diagonal covariance)

• Learn a full covariance matrix

σ

σ(s, θ)

σi(s, θ), i = 1⋯n

Σ(s, θ)

Page 15: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametriza2on for Gaussian

πθ(s)

ρ(s, a)

s0

r0

θ

µ✓(s) �✓(s)

z

a0

Instead of: a ∼ 𝒩(μ(s, θ), Σ(s, θ))

We can write: a = μ(s, θ) + zσ(s, θ) z ∼ 𝒩(0, In×n)

Qs:

• Does depend on ?

• Does depend on ?

a θ

z θ

max .θ

𝔼aρ(s0, a)

max .θ

𝔼zρ(s0, a(z))

Because:

𝔼z(μ(s, θ) + zσ(s, θ)) = μ(s, θ)Varz(μ(s, θ) + zσ(s, θ)) = σ(s, θ)2In×n

Page 16: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametriza2on for Gaussian

πθ(s)

ρ(s, a)

s0

r0

θ

µ✓(s) �✓(s)

z

a0

What do we gain?

∇θ𝔼z [ρ (a(θ, z), s)] = 𝔼zdρ (a(θ, z), s)

dada(θ, z)

dθda(θ, z)

dθ=

dμ(s, θ)dθ

+ zdσ(s, θ)

max .θ

𝔼aρ(s0, a)

max .θ

𝔼zρ(s0, a(z))

Instead of: a ∼ 𝒩(μ(s, θ), Σ(s, θ))

We can write: a = μ(s, θ) + zσ(s, θ) z ∼ 𝒩(0, In×n)

Page 17: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametriza2on for Gaussian

πθ(s)

ρ(s, a)

s0

r0

θ

µ✓(s) �✓(s)

z

a0

Sample estimate:

∇θ1N

N

∑i=1

[ρ (a(θ, zi), s)] =1N

N

∑i=1

dρ (a(θ, z), s)da

da(θ, z)dθ

|z=zi

What do we gain?

∇θ𝔼z [ρ (a(θ, z), s)] = 𝔼zdρ (a(θ, z), s)

dada(θ, z)

dθda(θ, z)

dθ=

dμ(s, θ)dθ

+ zdσ(s, θ)

max .θ

𝔼aρ(s0, a)

max .θ

𝔼zρ(s0, a(z))

Instead of: a ∼ 𝒩(μ(s, θ), Σ(s, θ))

We can write: a = μ(s, θ) + zσ(s, θ) z ∼ 𝒩(0, In×n)

Page 18: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametriza2on for Gaussian

πθ(s)

ρ(s, a)

s0

r0

θ

µ✓(s) �✓(s)

z ∼ 𝒩(0,I)z

a0 a = μ(s, θ) + z ⊙ σ(s, θ)

The pathwise deriva2ve uses the

deriva2ve of the reward w.r.t. the

ac2on!

𝔼a ∇θ log πθ(s, a)ρ(s, a) 𝔼zdρ (a(θ, z), s)

dada(θ, z)

Likelihood ra2o grad es2mator: Pathwise deriva2ve:

Page 19: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Known MDP with known determinis2c reward and dynamic func2ons

...T(s, a)

πθ(s)

ρ(s, a)

πθ(s)

s0 s1

a0 a1

T(s, a)

ρ(s, a)

r0 r1

θ

deterministic node: the value is a deterministic function of its input

stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)

deterministic computation node

Can we apply the chair rule through deterministic policies?

Can we apply the chain rule through sampled actions?

Page 20: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametrized Policy GradientsDeriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)

I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

I Only works if P(s2 | s1, a1) is known _

We want to compute: ∇θ𝔼[RT]

Episodic MDP:

The problem is: we do not know the reward function, neither the dynamics function.

Solution: we will approximate it with the Q function!!!!

Page 21: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametrized Policy GradientsDeriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)

I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

I Only works if P(s2 | s1, a1) is known _

We want to compute: ∇θ𝔼[RT]

Deriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

RT

I Only works if P(s2 | s1, a1) is known _

• Episodic MDP:

• Reparameterize: . is noise from fixed distribu2onat = π(st, zt, θ) zt

Page 22: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametrized Policy GradientsDeriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)

I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

I Only works if P(s2 | s1, a1) is known _

We want to compute: ∇θ𝔼[RT]

Deriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

RT

I Only works if P(s2 | s1, a1) is known _

• Episodic MDP:

• Reparameterize: . is noise from fixed distribu2onat = π(st, zt, θ) zt

Deriving the Policy Gradient, Reparameterized

I Episodic MDP:

s1 s2 . . . sT

a1 a2 . . . aT

RT

Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

RT

I Only works if P(s2 | s1, a1) is known _

Page 23: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Using a Q-function

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

RT

d

d✓E [RT ] = E

"TX

t=1

dRT

dat

datd✓

#= E

"TX

t=1

d

datE [RT | at ]

datd✓

#

= E"

TX

t=1

dQ(st , at)

dat

datd✓

#= E

"TX

t=1

d

d✓Q(st , ⇡(st , zt ; ✓))

#

Re-parametrized Policy Gradients

ddθ

𝔼 [RT] = 𝔼 [T

∑t=1

dRT

dat

dat

dθ ] = 𝔼 [T

∑t=1

ddat

𝔼 [RT |at]dat

dθ ]

The problem is: we do not know the reward function!

Solution: we will approximate it with the Q function!!!!

Page 24: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Re-parametrized Policy Gradients

Learning continuous control by stochastic value gradients, Hees et al.

Using a Q-function

s1 s2 . . . sT

a1 a2 . . . aT

z1 z2 . . . zT

RT

d

d✓E [RT ] = E

"TX

t=1

dRT

dat

datd✓

#= E

"TX

t=1

d

datE [RT | at ]

datd✓

#

= E"

TX

t=1

dQ(st , at)

dat

datd✓

#= E

"TX

t=1

d

d✓Q(st , ⇡(st , zt ; ✓))

#

ddθ

𝔼 [RT] = 𝔼 [T

∑t=1

dRT

dat

dat

dθ ] = 𝔼 [T

∑t=1

ddat

𝔼 [RT |at]dat

dθ ]= 𝔼 [

T

∑t=1

dQ (st, at)dat

dat

dθ ] = 𝔼 [T

∑t=1

ddθ

Q (st, π (st, zt; θ))]Learn to approximate , and use it to compute gradient es2matesQϕ Qπ,γ

Page 25: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

Stochas2c Value Gradients V0SVG(0) Algorithm

I Learn Q� to approximate Q⇡,�, and use it to compute gradient estimates.

I Pseudocode:

for iteration=1, 2, . . . doExecute policy ⇡✓ to collect T timesteps of data

Update ⇡✓ using g / r✓

PT

t=1Q(st , ⇡(st , zt ; ✓))

Update Q� using g / r�

PT

t=1(Q�(st , at) � Qt)

2, e.g. with TD(�)

end for

N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS. 2015

Learning continuous control by stochastic value gradients, Hees et al.

Learn to approximate , and use it to compute gradient es2mates

Algorithm:

Qϕ Qπ,γ

Page 26: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

s DNN

DPG in Simulated PhysicsI Physics domains are simulated in MuJoCoI End-to-end learning of control policy from raw pixels sI Input state s is stack of raw pixels from last 4 framesI Two separate convnets are used for Q and ⇡I Policy ⇡ is adjusted in direction that most improves Q

Q(s,a)

π(s)

as DNN a(✓µ)

(✓Q)

z

z ⇠ N (0, 1)a = µ(s; ✓) + z�(s; ✓)

Stochas2c Value Gradients V0

Learning continuous control by stochastic value gradients, Hees et al.

Page 27: School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al. 2016] d d ... Continuous control with deep reinforcement learning, Lilicrap et al.

s DNN

DPG in Simulated PhysicsI Physics domains are simulated in MuJoCoI End-to-end learning of control policy from raw pixels sI Input state s is stack of raw pixels from last 4 framesI Two separate convnets are used for Q and ⇡I Policy ⇡ is adjusted in direction that most improves Q

Q(s,a)

π(s)

as DNN a(✓µ)

(✓Q)

a = µ(✓)

Compare with: Deep Determinis2c Policy Gradients

No z!

Learning continuous control by stochastic value gradients, Hees et al.