School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al....
Transcript of School of Computer Science...Continuous control with deep reinforcement learning, Lilicrap et al....
Determinist PG, Pathwise deriva2ves
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie Mellon
School of Computer Science
Spring 2020, CMU 10-403
Compu2ng Gradients of Expecta2ons
∇θ𝔼a∼πθR(a, s)
g =1N
N
∑i=1
∇θ log Pθ(τ(i))R(τ(i)) =1N
N
∑i=1
T
∑t=1
∇θ log πθ(α(i)t |s(i)
t )R(τ(i))
• Do we have access to the reward func2on ? • Do we have access to the analy2c gradients of rewards w.r.t.
ac2ons a? • For con2nuous ac2ons a, do we have access to the analy2c gradients of
Q(s,a,w) or Q(s,a,w_1)-V(s,w_2) w.r.t. ac2ons a? • Have we used the later anywhere?
R(τ)R(τ)
𝔼τ∼Pθ(τ) [∇θlog Pθ(τ)R(τ)]𝔼s∼d0(s), a∼πθ(a|s) ∇θ log πθ(a |s)[Q(s, a, w1) − V(s, w2)]
Likelihood ratio gradient estimator:
maxθ
. 𝔼τ∼Pθ(τ) [R(τ)]Policy objective:
What if we have a determinis2c policy?
𝔼∑t
dQ(st, at)dθ
= 𝔼∑t
dQ(st, at)dat
dat
dθ
a = πθ(s)
maxθ
. 𝔼T
∑t=1
R(st, at)
Qs: • Can we backpropagate through R?
a = πθ(s)
maxθ
. 𝔼T
∑t=1
Q(st, at)
Q: does this expectation depend on theta?
Qs: • Can we backpropagate through Q?
Deriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
I Only works if P(s2 | s1, a1) is known _
Using a Q-function
✓
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
d
d✓E [RT ] = E
"TX
t=1
dRT
dat
datd✓
#= E
"TX
t=1
d
datE [RT | at ]
datd✓
#
= E"
TX
t=1
dQ(st , at)
dat
datd✓
#= E
"TX
t=1
d
d✓Q(st , ⇡(st , zt ; ✓))
#
Deep Determinis2c Policy Gradients
𝔼 [T
∑t=1
dQ (st, at)dat
dat
dθ ] = 𝔼 [T
∑t=1
ddθ
Q (st, π (st, zt; θ))]Continuous control with deep reinforcement learning, Lilicrap et al. 2016
ddθ
𝔼 [RT] = 𝔼 [T
∑t=1
dRT
dat
dat
dθ ] = 𝔼 [T
∑t=1
ddat
𝔼 [RT |at]dat
dθ ]= 𝔼 [
T
∑t=1
dQ (st, at)dat
dat
dθ ] = 𝔼 [T
∑t=1
ddθ
Q (st, π (st, zt; θ))]
𝔼∑t
dQ(st, at)dθ
= 𝔼T
∑t=1
dQ(st, at)dat
dat
dθ
a = πθ(s)
s DNN (θQ)
sDNN (θμ) a
Deep Determinis2c Policy Gradients
We are following a stochas2c behavior policy to collect data. DDPG :Deep Q learning for con2nuous ac2ons
a = μ(θ)Q(s, a)
Continuous control with deep reinforcement learning, Lilicrap et al. 2016
Deep Determinis2c Policy Gradients
Compu2ng Gradients of Expecta2ons
∇θ𝔼x∼Pθ(x) f(x) = 𝔼x∼Pθ(x) ∇θlog Pθ(x)f(x)
When the variable w.r.t. which we are differen2a2ng appears in the distribu2on:
∇θ𝔼f(x(θ)) = 𝔼x∼P(x) ∇θ f(x(θ)) = 𝔼x∼P(x)df(x(θ))
dxdxdθ
When the variable w.r.t. which we are differen2a2ng appears inside the expecta2on:
likelihood ra2o gradient es2mator
Q: From which to which? Why would we want to do so?
Re-parametriza2on trick: For some distribu2ons we can switch from one gradient es2mator to the other.
Pθ(x)
Imagine we knew the reward func2on ρ(s, a)
πθ(s)
ρ(s, a)
s0
a0
r0
θ
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
Determinis2c policy
πθ(s)
ρ(s, a)
s0
a0
r0
θa = πθ(s)
I can compute the gradient with the chain rule.
∇θ ρ(s, a) =dρda
dadθ
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
Deriva2ve of the *known* reward func2on w.r.t. the ac2on
I want to learn to maximize the average reward obtained.
θ
max .θ
ρ(s0, a)
Stochas2c policy
πθ(s)
ρ(s, a)
s0
a0
r0
θ
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
I want to learn to maximize the average reward obtained.
θ
∇θ ρ(s, a) =dρda
dadθ
max .θ
𝔼aρ(s0, a)
∇θ𝔼aρ(s0, a)
Stochas2c policy
πθ(s)
ρ(s, a)
s0
a0
r0
θ
𝔼a ∇θ log πθ(s)ρ(s0, a)
Likelihood ra2o es2mator, works for both con2nuous and discrete ac2ons
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
I want to learn to maximize the average reward obtained.
θ
max .θ
𝔼aρ(s0, a)
Example: Gaussian policy
πθ(s)
ρ(s, a)
s0
a0
r0
θ
µ✓(s) �✓(s)
a ∼ 𝒩(μ(s, θ), Σ(s, θ))
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
If is constant: σ2
∇θlog πθ(s, a) =(a − μ(s; θ)) ∂μ(s; θ)
∂θ
σ2
I want to learn to maximize the average reward obtained.
θ
𝔼a ∇θ log πθ(s)ρ(s0, a)
Likelihood ra2o es2mator, works for both con2nuous and discrete ac2ons
max .θ
𝔼aρ(s0, a)
Example: Gaussian policy
πθ(s)
ρ(s, a)
s0
a0
r0
θ
µ✓(s) �✓(s)
a ∼ 𝒩(μ(s, θ), Σ(s, θ))
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
We can either:
• Assume fixed (not learned)
• Learn one value for all ac2on coordinates (spherical or isotropic Gaussian)
• Learn (diagonal covariance)
• Learn a full covariance matrix
σ
σ(s, θ)
σi(s, θ), i = 1⋯n
Σ(s, θ)
Example: Gaussian policy
πθ(s)
ρ(s, a)
s0
a0
r0
θ
µ✓(s) �✓(s)
a ∼ 𝒩(μ(s, θ), Σ(s, θ))
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
We can either:
• Assume fixed (not learned)
• Learn one value for all ac2on coordinates (spherical or isotropic Gaussian)
• Learn (diagonal covariance)
• Learn a full covariance matrix
σ
σ(s, θ)
σi(s, θ), i = 1⋯n
Σ(s, θ)
Re-parametriza2on for Gaussian
πθ(s)
ρ(s, a)
s0
r0
θ
µ✓(s) �✓(s)
z
a0
Instead of: a ∼ 𝒩(μ(s, θ), Σ(s, θ))
We can write: a = μ(s, θ) + zσ(s, θ) z ∼ 𝒩(0, In×n)
Qs:
• Does depend on ?
• Does depend on ?
a θ
z θ
max .θ
𝔼aρ(s0, a)
max .θ
𝔼zρ(s0, a(z))
Because:
𝔼z(μ(s, θ) + zσ(s, θ)) = μ(s, θ)Varz(μ(s, θ) + zσ(s, θ)) = σ(s, θ)2In×n
Re-parametriza2on for Gaussian
πθ(s)
ρ(s, a)
s0
r0
θ
µ✓(s) �✓(s)
z
a0
What do we gain?
∇θ𝔼z [ρ (a(θ, z), s)] = 𝔼zdρ (a(θ, z), s)
dada(θ, z)
dθda(θ, z)
dθ=
dμ(s, θ)dθ
+ zdσ(s, θ)
dθ
max .θ
𝔼aρ(s0, a)
max .θ
𝔼zρ(s0, a(z))
Instead of: a ∼ 𝒩(μ(s, θ), Σ(s, θ))
We can write: a = μ(s, θ) + zσ(s, θ) z ∼ 𝒩(0, In×n)
Re-parametriza2on for Gaussian
πθ(s)
ρ(s, a)
s0
r0
θ
µ✓(s) �✓(s)
z
a0
Sample estimate:
∇θ1N
N
∑i=1
[ρ (a(θ, zi), s)] =1N
N
∑i=1
dρ (a(θ, z), s)da
da(θ, z)dθ
|z=zi
What do we gain?
∇θ𝔼z [ρ (a(θ, z), s)] = 𝔼zdρ (a(θ, z), s)
dada(θ, z)
dθda(θ, z)
dθ=
dμ(s, θ)dθ
+ zdσ(s, θ)
dθ
max .θ
𝔼aρ(s0, a)
max .θ
𝔼zρ(s0, a(z))
Instead of: a ∼ 𝒩(μ(s, θ), Σ(s, θ))
We can write: a = μ(s, θ) + zσ(s, θ) z ∼ 𝒩(0, In×n)
Re-parametriza2on for Gaussian
πθ(s)
ρ(s, a)
s0
r0
θ
µ✓(s) �✓(s)
z ∼ 𝒩(0,I)z
a0 a = μ(s, θ) + z ⊙ σ(s, θ)
The pathwise deriva2ve uses the
deriva2ve of the reward w.r.t. the
ac2on!
𝔼a ∇θ log πθ(s, a)ρ(s, a) 𝔼zdρ (a(θ, z), s)
dada(θ, z)
dθ
Likelihood ra2o grad es2mator: Pathwise deriva2ve:
Known MDP with known determinis2c reward and dynamic func2ons
...T(s, a)
πθ(s)
ρ(s, a)
πθ(s)
s0 s1
a0 a1
T(s, a)
ρ(s, a)
r0 r1
θ
deterministic node: the value is a deterministic function of its input
stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from)
deterministic computation node
Can we apply the chair rule through deterministic policies?
Can we apply the chain rule through sampled actions?
Re-parametrized Policy GradientsDeriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)
I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
I Only works if P(s2 | s1, a1) is known _
We want to compute: ∇θ𝔼[RT]
Episodic MDP:
The problem is: we do not know the reward function, neither the dynamics function.
Solution: we will approximate it with the Q function!!!!
Re-parametrized Policy GradientsDeriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)
I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
I Only works if P(s2 | s1, a1) is known _
We want to compute: ∇θ𝔼[RT]
Deriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
✓
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Only works if P(s2 | s1, a1) is known _
• Episodic MDP:
• Reparameterize: . is noise from fixed distribu2onat = π(st, zt, θ) zt
Re-parametrized Policy GradientsDeriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)
I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
I Only works if P(s2 | s1, a1) is known _
We want to compute: ∇θ𝔼[RT]
Deriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
✓
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Only works if P(s2 | s1, a1) is known _
• Episodic MDP:
• Reparameterize: . is noise from fixed distribu2onat = π(st, zt, θ) zt
Deriving the Policy Gradient, Reparameterized
I Episodic MDP:
✓
s1 s2 . . . sT
a1 a2 . . . aT
RT
Want to compute r✓E [RT ]. We’ll use r✓ log ⇡(at | st ; ✓)I Reparameterize: at = ⇡(st , zt ; ✓). zt is noise from fixed distribution.
✓
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
I Only works if P(s2 | s1, a1) is known _
Using a Q-function
✓
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
d
d✓E [RT ] = E
"TX
t=1
dRT
dat
datd✓
#= E
"TX
t=1
d
datE [RT | at ]
datd✓
#
= E"
TX
t=1
dQ(st , at)
dat
datd✓
#= E
"TX
t=1
d
d✓Q(st , ⇡(st , zt ; ✓))
#
Re-parametrized Policy Gradients
ddθ
𝔼 [RT] = 𝔼 [T
∑t=1
dRT
dat
dat
dθ ] = 𝔼 [T
∑t=1
ddat
𝔼 [RT |at]dat
dθ ]
The problem is: we do not know the reward function!
Solution: we will approximate it with the Q function!!!!
Re-parametrized Policy Gradients
Learning continuous control by stochastic value gradients, Hees et al.
Using a Q-function
✓
s1 s2 . . . sT
a1 a2 . . . aT
z1 z2 . . . zT
RT
d
d✓E [RT ] = E
"TX
t=1
dRT
dat
datd✓
#= E
"TX
t=1
d
datE [RT | at ]
datd✓
#
= E"
TX
t=1
dQ(st , at)
dat
datd✓
#= E
"TX
t=1
d
d✓Q(st , ⇡(st , zt ; ✓))
#
ddθ
𝔼 [RT] = 𝔼 [T
∑t=1
dRT
dat
dat
dθ ] = 𝔼 [T
∑t=1
ddat
𝔼 [RT |at]dat
dθ ]= 𝔼 [
T
∑t=1
dQ (st, at)dat
dat
dθ ] = 𝔼 [T
∑t=1
ddθ
Q (st, π (st, zt; θ))]Learn to approximate , and use it to compute gradient es2matesQϕ Qπ,γ
Stochas2c Value Gradients V0SVG(0) Algorithm
I Learn Q� to approximate Q⇡,�, and use it to compute gradient estimates.
I Pseudocode:
for iteration=1, 2, . . . doExecute policy ⇡✓ to collect T timesteps of data
Update ⇡✓ using g / r✓
PT
t=1Q(st , ⇡(st , zt ; ✓))
Update Q� using g / r�
PT
t=1(Q�(st , at) � Qt)
2, e.g. with TD(�)
end for
N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS. 2015
Learning continuous control by stochastic value gradients, Hees et al.
Learn to approximate , and use it to compute gradient es2mates
Algorithm:
Qϕ Qπ,γ
s DNN
DPG in Simulated PhysicsI Physics domains are simulated in MuJoCoI End-to-end learning of control policy from raw pixels sI Input state s is stack of raw pixels from last 4 framesI Two separate convnets are used for Q and ⇡I Policy ⇡ is adjusted in direction that most improves Q
Q(s,a)
π(s)
as DNN a(✓µ)
(✓Q)
z
z ⇠ N (0, 1)a = µ(s; ✓) + z�(s; ✓)
Stochas2c Value Gradients V0
Learning continuous control by stochastic value gradients, Hees et al.
s DNN
DPG in Simulated PhysicsI Physics domains are simulated in MuJoCoI End-to-end learning of control policy from raw pixels sI Input state s is stack of raw pixels from last 4 framesI Two separate convnets are used for Q and ⇡I Policy ⇡ is adjusted in direction that most improves Q
Q(s,a)
π(s)
as DNN a(✓µ)
(✓Q)
a = µ(✓)
Compare with: Deep Determinis2c Policy Gradients
No z!
Learning continuous control by stochastic value gradients, Hees et al.