Global Convergence of Policy...

52
1/51 Global Convergence of Policy Optimization Tengyang Xie and Wenbin Wan

Transcript of Global Convergence of Policy...

Page 1: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

1/51

Global Convergence of Policy Optimization

Tengyang Xie and Wenbin Wan

Page 2: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

2/51

Outline

Background

Global Convergence in Tabular MDPs

Global Convergence w/ Function Approximation

Neural Policy Gradient Methods

Page 3: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

3/51

Section 1

Background

Page 4: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

4/51

Markov Decision Process (MDP)

An (infinite-horizon discounted) MDP [Sutton and Barto, 1998;Puterman, 2014] is a tuple (S, A, P , R , γ, d0)

I state s ∈ SI action a ∈ AI transition function P : S ×A → ∆(S)

I reward function R : S ×A → [0,Rmax]

I discount factor γ ∈ [0, 1)

I initial state distribution d0 ∈ ∆(S)

(∆(·) denotes the probability simplex)

Page 5: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

5/51

Notations Regarding Value Function and PolicyI policy π : S → ∆(A)

I π induced random trajectory: (s0, a0, r0, s1, a1, r1, . . .), wheres0 ∼ d0, at ∼ π(·|st), rt = R(st , at), st+1 ∼ P(·|st , at), ∀t ≥ 0

I (state-)value function V π(s) := E[∑∞

t=0 γtrt |s0 = s, π]

I Q-function Qπ(s, a) := E[∑∞

t=0 γtrt |s0 = s, a0 = a, π]

I advantage function Aπ(s, a) := Qπ(s, a)− V π(s)

I expected discounted return J(π) := E[∑∞

t=0 γtrt |s0 ∼ d0, π]

I optimal policy: π?, value function of π?: V ?, Q-function ofπ?: Q?

I normalized discounted state occupancydπ(s) := (1− γ)

∑∞t=0 γ

t Pr [st = s|s0 ∼ d0, π],dπ(s, a) := dπ(s)π(a|s)

I Bellman optimality operator: T

T V (s) := maxa

E[R(s, a) + E[V (s ′)|s, a]

]

Page 6: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

6/51

Policy Parameterizations

I direct parameterization: πθ(a|s) = θs,a, where θ ∈ ∆(A)|S|.

I softmax parameterization: πθ(a|s) =exp(θs,a)∑a′ exp(θs,a′ )

, where

θ ∈ R|S||A|.

Example function approximation:

θs,a ⇒ θ · φs,a

Page 7: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

7/51

Policy Gradient Theorem

Objective function:

maxθ

J(πθ)

Theorem ([Sutton et al., 2000])

∇θJ(πθ) =1

1− γ E(s,a)∼dπ

[∇θ log πθ(a|s)Qπθ(s, a)] (1)

Corollary

∇θJ(πθ) =1

1− γ E(s,a)∼dπ

[∇θ log πθ(a|s)Aπθ(s, a)]

Page 8: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

8/51

Section 2

Global Convergence in Tabular MDPs

Page 9: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

9/51

Overview

The global convergence of policy gradient comes from its specialstructure.

There are two main ways for attaining the global convergence:I Policy Improvement — all the stationary points are global

optimal [Bhandari and Russo, 2019]

I Bounding Performance Difference — performance differencebetween π and π? can be bounded by the (variants of) policygradient [Agarwal et al., 2019]

Page 10: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

10/51

Ways to Attain Global Convergence: I. Policy ImprovementWarm-up: Policy Improvement LemmaLet π be any policy, π+ be the greedy policy w.r.t. Qπ. ThenV π+(s) ≥ V π(s) for any s ∈ S.

Proof.For any s ∈ S, we have

V π(s)

≤ Qπ(s, π+(s))

= E[rt+1 + γV π(st+1)|st = s, at ∼ π+]

≤ E[rt+1 + γQπ(st+1, π+(st+1))|st = s, at ∼ π+]

≤ E[rt+1 + γrt+2 + γ2V π(st+2)|st = s, at ∼ π+, at+1 ∼ π+]

...≤ V π+(s)

Page 11: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

11/51

Ways to Attain Global Convergence: I. Policy Improvement

How about π′ := π + α(π+ − π), where α ∈ (0, 1)?

Policy can be improved on this direction!

Theorem (No spurious local optima, Theorem 1 in [Bhandariand Russo, 2019])Under Assumption 1-4 in [Bhandari and Russo, 2019], for policy πθ,and π+ be a policy iteration update of πθ. Take u to satisfy

d

duπθ+αu(s) = π+(s)− πθ(s), ∀s ∈ S.

Then,

d

dαJ(θ + αu)

∣∣∣∣∣α=0

≥ 11− γ

‖V πθ − TV πθ‖1,dπθ .

Page 12: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

11/51

Ways to Attain Global Convergence: I. Policy Improvement

How about π′ := π + α(π+ − π), where α ∈ (0, 1)?

Policy can be improved on this direction!

Theorem (No spurious local optima, Theorem 1 in [Bhandariand Russo, 2019])Under Assumption 1-4 in [Bhandari and Russo, 2019], for policy πθ,and π+ be a policy iteration update of πθ. Take u to satisfy

d

duπθ+αu(s) = π+(s)− πθ(s), ∀s ∈ S.

Then,

d

dαJ(θ + αu)

∣∣∣∣∣α=0

≥ 11− γ

‖V πθ − TV πθ‖1,dπθ .

Page 13: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

12/51

Ways to Attain Global Convergence: I. Policy Improvement

How to prove the No-spurious-local-optima Theorem?

Lemma (Policy gradients for directional derivatives, Lemma 1in [Bhandari and Russo, 2019])For any θ and u, we have

d

dαJ(θ + αu)

∣∣∣∣∣α=0

=1

1− γ E(s,a)∼πθs0∼d0

[d

dαQπθ(s, πθ+αu(s))

∣∣∣∣∣α=0

].

Then bounding the RHS of the lemma above, we prove theNo-spurious-local-optima Theorem.

Page 14: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

13/51

Ways to Attain Global Convergence: I. Policy Improvement

What can we learn from the No-spurious-local-optimaTheorem?

Recall the result of No-spurious-local-optima Theorem,

d

dαJ(θ + αu)

∣∣∣∣∣α=0

≥ 11− γ

‖V πθ − TV πθ‖1,dπθ .

The LHS is the a partial gradient of J(θ) (lower bounds the policygradient, Eq.(1), by any norm), and the RHS=0 if and only ifV πθ = V ?.

This implies that all the stationary points obtained by the policygradient theorem are globally optimal.

Page 15: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

14/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Warm-up: Performance Difference Lemma

Lemma (The performance difference lemma [Kakade andLangford, 2002])For all policies π, π′,

J(π)− J(π′) =1

1− γ E(s,a)∼dπ

[Aπ

′(s, a)

].

This lemma can be proved by directly simplify the RHS using thedefinition of Aπ

′.

Page 16: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

15/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Usage of the Performance Difference Lemma — GradientDomination Lemma(directly parameterized policy classes and projected policy gradient)

Lemma (Gradient domination, Lemma 4.1 in [Agarwal et al.,2019])For any policy π, we have

J(π?)− J(π) ≤∥∥∥∥dπ?dπ

∥∥∥∥∞

maxπ

(π − π)>∇πJ(π)

≤ 11− γ

∥∥∥∥dπ?d0

∥∥∥∥∞

maxπ

(π − π)>∇πJ(π),

(2)

where the max is over the set of all possible policies.

Page 17: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

16/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Proof of the gradient domination lemma.

By the performance difference lemma,

J(π?)− J(π) =1

1− γ∑s,a

dπ?(s)π?(a|s)Aπ(s, a)

≤ 11− γ

∑s,a

dπ?(s) maxa

Aπ(s, a)

≤ 11− γ

∥∥∥∥dπ?dπ

∥∥∥∥∞

∑s,a

dπ(s) maxa

Aπ(s, a)

Page 18: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

17/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Proof of the gradient domination lemma.(cont.)

∑s,a

dπ(s) maxa

Aπ(s, a)

= maxπ

∑s,a

dπ(s)π(a|s)Aπ(s, a)

= maxπ

∑s,a

dπ(s)(π(a|s)− π(a|s))Aπ(s, a)

= (1− γ) maxπ

(π − π)>∇πJ(π)

Combining these two parts above, we complete the proof.

Page 19: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

18/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

How to use the gradient domination lemma?

Definition (First-order Stationarity)A policy π ∈ ∆(A)|S| is ε-stationary with respect to the initialstate distribution d0 if

maxπ+δ∈∆(A)|S|,‖δ‖2≤1

δ>∇πJ(π) ≤ ε,

where ∆(A)|S| is the set of all possible policies.

Page 20: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

19/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

How to use the gradient domination lemma?(cont.)

Then, we have the following inequality

maxπ

(π − π)>∇πJ(π) ≤ maxπ+δ∈∆(A)|S|,‖δ‖2≤1

δ>∇πJ(π) (3)

We now connect the performance difference with the first-orderstationary condition (using the gradient dominate lemma andEq.(3)).

By applying classic first-order optimization results, we obtain theglobal convergence of (projected) policy gradient.

Page 21: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

20/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Iteration Complexity for direct parameterization

TheoremThe projected gradient ascent algorithm (projecting to theprobability simplex after each gradient ascent step) on J(πθ) withstepsize (1−γ)3

2γ|A| satisfies

mint≤T{J(π?)− J(πθ(t))} ≤ ε,

when ever T >64γ|S||A|(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

∞.

Page 22: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

21/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case (πθ(a|s) =exp(θs,a)∑a′ exp(θs,a′ )

)

Challenge: Attaining the optimal policy (which is deterministic)needs to send the parameters to ∞.

Three types of algorithms:1. regular policy gradient2. policy gradient w/ entropic regularization3. natural policy gradient

Page 23: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

22/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

1. Regular policy gradient only has asymptotic convergence (at thispoint)

Theorem (Global convergence for softmax parameterization,Theorem 5.1 in [Agarwal et al., 2019])Assume we follow the gradient descent update rule and that thedistribution d0 is strictly positive i.e. d0(s) > 0 for all states s.Suppose η ≤ (1−γ)2

5 , then we have that for all states s,V (t)(s)→ V ?(s) as t →∞.

Page 24: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

23/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

2. Polynomial Convergence with Relative Entropy Regularization

The relative-entropy regularized objective:

Lλ(θ) := J(πθ) +λ

|S||A|∑s,a

log πθ(a|s) + λ log |A|,

where λ is a regularization parameter.

Its benefit: keep the parameters from becoming too large, as ameans to ensure adequate exploration.

Page 25: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

24/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

2. Polynomial Convergence with Relative Entropy Regularization(cont.)

Theorem (Iteration complexity with relative entropyregularization, Corollary 5.4 in [Agarwal et al., 2019])Let βλ := 8γ

(1−λ)3+ 2λ|S| . Starting from any initial θ(0), consider the

gradient ascent of Lλ with λ = ε(1−λ)

2∥∥∥ dπ?

d0

∥∥∥∞

and η = 1/βλ. Then we

have mint<T{J(π?)− J(πθ(t))} ≤ ε, if T ≥320|S|2|A|2

(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

∞.

Page 26: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

25/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

3. Natural Policy Gradient

Formulation:

F (θ) = E(s,a)∼dπθ

[∇θ log πθ(a|s) (∇θ log πθ(a|s))>

]θ(t+1) = θ(t) + ηF (θ(t))†∇θJ(πθ(t)),

where M† denotes the Moore-Penrose pseudoinverse of the matrixM.

Page 27: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

26/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

3. Natural Policy Gradient(cont.)

Theorem (Global Convergence for Natural Policy GradientAscent, Theorem 5.7 in [Agarwal et al., 2019])Suppose we run the natural policy gradient updates with θ(0) = 0.Fixed η ≥ 0. For all T > 0, we have:

J(πθ(T )) ≥ J(π?)− log |A|ηT

− 1(1− γ)2T

.

In particular, setting η ≥ (1− γ)2 log |A|, NPG finds an ε-optimalpolicy in a number of iterations that is at most T ≤ 2

(1−γ)2ε, which

has no dependence on |S|, |A|,∥∥∥dπ?

d0

∥∥∥2

∞.

Page 28: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

27/51

Review of Tabular ResultsWhat we covered:I All the first order stationary points of policy gradient are

global optimal.I The exact (projected) policy gradient w/ direct

parameterization has a O

(γ|S||A|

(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

)iteration

complexity.I The exact policy gradient w/ softmax parameterization has

asymptotic convergence.I The exact policy gradient w/ relative entropy regularization

and softmax parameterization has a O

(γ|S|2|A|2

(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

)iteration complexity.

I The exact natural policy gradient w/ softmax parameterization

has a2

(1− γ)2εiteration complexity.

Page 29: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

28/51

Review of Tabular Results

What we did not cover or future directions:

I Exploration w/ policy gradient (partially solved by Cai et al.[2019a])

I Stochastic policy gradient / sample based resultsI Actor-critic approachI A sharp analysis or improved algorithm regarding the

distribution mismatch coefficient (e.g., Eq. (2))I Sample complexity analysisI Landscape of J(π)

Page 30: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

29/51

Section 3

Global Convergence w/ Function Approximation

Page 31: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

30/51

Overview

What we will cover:

I Natural Policy Gradient for Unconstrained Policy Class

I Projected Policy Gradient for Constrained Policy Classes

Challenge: how to capture the approximation error properly?

Page 32: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

31/51

Natural Policy Gradient for Unconstrained Policy Class

Let the policy classes parameterized by θ ∈ Rd , where d � |S||A|.

The update rule still using the exact natural policy gradient (seetabular part). We also need to assume that log πθ(a|s) is aβ-smooth function for all θ, s, and a.

Example (Linear softmax policies)For any state, action pair s, a, suppose we have a feature mappingφs,a ∈ Rd with ‖φs,a‖22 ≤ β. Let us consider the policy class

πθ(a|s) =exp(θ · φs,a)∑

a′∈A exp(θ · φs,a′).

with θ ∈ Rd . Then log πθ(a|s) is a β-smooth function.

Page 33: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

32/51

Natural Policy Gradient for Unconstrained Policy Class

Tools for analyzing NPG

Let the NPG update rule can be written abstractly as(u(t) = F (θ(t))†∇θJ(πθ(t)))

θ(t+1) = θ(t) + ηu(t).

We then leverage the connection of NPG update and compatiblefunction approximation [Sutton et al., 2000; Kakade, 2002].

Lν(w ; θ) := Es,a∼ν

[(Aπθ(s, a)− w · ∇θ log πθ(a|s))2

],

L?ν(θ) := minw

Lν(w ; θ).

ν(s, a) = dπθ(s, a) ⇒ u(t) ∈ argminw Lν(w ; θ) and L?ν(θ) is ameasure of approximation error of πθ.

(we can verify that L?ν(θ) = 0⇒ πθ = π?)

Page 34: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

33/51

Natural Policy Gradient for Unconstrained Policy Class

NPG results

Consider the update rule θ(t+1) = θ(t) + ηu(t), whereu(t) ∈ argminw Lν(t)(w ; θ) and ν(t) = dπ

θ(t)(s, a).

TheoremLet π? be the optimal policy in the class, η =

√2 log |A|/(βW 2T ),

L?ν(t)(θ

(t)) ≤ εapprox, ‖u(t)‖2 ≤W . Then we have

mint≤T{J(π?)− J(πθ(t))} ≤

(W√

2β log |A|(1− γ)

)· 1√

T

+

√1

(1− γ)3

∥∥∥∥dπ?d0

∥∥∥∥∞εapprox.

Page 35: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

34/51

Projected Policy Gradient for Constrained Policy ClassesThe constrained policy class Π = {πθ : θ ∈ Θ}, where Θ ⊆ Rd is aconvex set, be the feasible set of all policies.

Following the similar intuition as policy improvement part, wedefine the Bellman policy error in approximating π+

θ (greedy policyw.r.t. Qπθ) as

LBPE(θ,w) =

[∑a∈A

∣∣∣π+θ (a|s)− πθ(a|s)− w>∇θπθ(a|s)

∣∣∣] .Then, the approximation error can be captured byLBPE(θ) := LBPE(θ;w?(θ)), where

w?(θ) = argminw∈Rd :w+θ∈Θ

LBPE(θ;w).

(it is easy to verify that LBPE(θ) = 0⇒ πθ = π?)

Page 36: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

35/51

Projected Policy Gradient for Constrained Policy Classes

TheoremSuppose the πθ is Lipschitz continuous and smooth for all θ ∈ Θ.Assume for all t < T ,

LBPE(θ(t)) ≤ εapprox and ‖w?(θ(t))‖2 ≤W ?.

Let

β =β2|A|

(1− γ)2 +2γβ2

1 |A|2

(1− γ)3 .

Then, the projected policy gradient ascent w/ stepsize η = 1/βsatisfies

mint<T{J(π?)− J(πθ(t))} ≤

1(1− γ)3

∥∥∥∥dπ?d0

∥∥∥∥∞εapprox + (W ? + 1)ε,

for T ≥ 8β(1−γ)3ε2

∥∥∥dπ?d0

∥∥∥2

∞.

Page 37: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

36/51

Section 4

Neural Policy Gradient Methods

Page 38: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

37/51

Overparameterized Neural Policy

A two-layer neural network f ((s, a);W , b) with input (s, a) andwidth m takes the form of

f((s, a);W , b

)=

1√m

m∑r=1

br · ReLU((s, a)>[W ]r

), ∀(s, a) ∈ S ×A,

whereI (s, a) ∈ S ×A ⊆ Rd

I ReLU : R→ R is the rectified linear unit (ReLU) activationfunction, which is defined as ReLU(u) = 1{u > 0} · u.

I {br}r∈[m] and W = ([W ]>1 , . . . , [W ]>m)> ∈ Rmd are theparameters.

Page 39: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

38/51

Overparameterized Neural Policy

Using the two-layer neural network, we define the neural policies as

πθ(a | s) =exp[τ · f

((s, a); θ

)]∑a′∈A exp

[τ · f

((s, a′); θ

)] , ∀(s, a) ∈ S ×A,

and the feature mapping φθ = ([φθ]>1 , . . . , [φθ]>m)> : Rd → Rmd ofa two-layer neural network f ((·, ·); θ) as

[φθ]r (s, a) =br√m· 1{

(s, a)>[θ]r > 0}· (s, a),

∀(s, a) ∈ S ×A, ∀r ∈ [m].

Page 40: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

39/51

Overparameterized Neural Policy

Policy Gradient and Fisher Information Matrix(Proposition 3.1 in [Wang et al., 2019])

For πθ defined previous, we have

∇θJ(πθ) = τ · Eσπθ[Qπθ(s, a) ·

(φθ(s, a)− Eπθ

[φθ(s, a′)

])],

F (θ) = τ2 · Eσπθ[(φθ(s, a)− Eπθ

[φθ(s, a′)

])(φθ(s, a)− Eπθ

[φθ(s, a′)

])>].

Page 41: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

40/51

Neural Policy Gradient MethodsActor Update

To update θi , we set

θi+1 ← ΠB(θi + η · G (θi ) · ∇θJ(πθi )

),

whereI B = {α ∈ Rmd : ‖α−Winit‖2 ≤ R}, where R > 1 and Winit

is the initial parameterI ΠB : Rmd → B as the projection operator onto the parameter

space B ⊆ Rmd

I G (θi ) = Imd for policy gradient and G (θi ) = (F (θi ))−1 fornatural policy gradient

I η is the learning rate and ∇θJ(πθi ) is an estimator of ∇θJ(πθi )

∇θJ(πθi ) =1B·

B∑`=1

Qωi (s`, a`) · ∇θ log πθi (a` | s`)

Page 42: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

41/51

Neural Policy Gradient MethodsActor Update

Sampling From Visitation Measure

Recall that the policy gradient (Proposition 3.1 in [Wang et al.,2019])

∇θJ(πθ) = τ · Eσπθ[Qπθ(s, a) ·

(φθ(s, a)− Eπθ

[φθ(s, a′)

])].

We need to sample from the visitation measure σπθ . Define a newMDP (S,A, P, ζ, r , γ) with Markov transition kernel P

P(s ′ | s, a) = γ · P(s ′ | s, a) + (1− γ) · ζ(s ′), ∀(s, a, s ′) ∈ S ×A× S,

whereI P is the Markov transition kernel of the original MDP

Page 43: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

42/51

Neural Policy Gradient MethodsActor Update

Inverting Fisher Information Matrix

Recall that G (θi ) = (F (θi ))−1 for natural policy gradient.

Challenge: Inverting an estimator F (θi ) of F (θi ) can beinfeasible as F (θi ) is a high-dimensional matrix, which is possiblynot invertible.To resolve this issue, we estimate the natural policy gradientG (θi ) · ∇θJ(πθi ) by solving

minα∈B‖F (θi ) · α− τi · ∇θJ(πθi )‖2

Page 44: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

43/51

Neural Policy Gradient MethodsActor Update

Meanwhile, F (θi ) is an unbiased estimator of F (θi ) based on{(s`, a`)}`∈[B] sampled from σi , which is defined as

F (θi ) =τ2i

B∑`=1

(φθi (s`, a`)− Eπθi

[φθi (s`, a

′`)])

(φθi (s`, a`)− Eπθi

[φθi (s`, a

′`)])>

.

The actor update of neural natural policy gradient takes the form of

τi+1 ← τi + η,

τi+1 · θi+1 ← τi · θi + η · argminα∈B

‖F (θi ) · α− τi · ∇θJ(πθi )‖2,

Page 45: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

44/51

Neural Policy Gradient MethodsActor Update

To summarize:

At the i-th iteration,Neural policy gradient obtains θi+1 via projected gradient ascentusing ∇θJ(πθi ) defined by

∇θJ(πθi ) =1B·

B∑`=1

Qωi (s`, a`) · ∇θ log πθi (a` | s`)

Neural natural policy gradient solves

minα∈B‖F (θi ) · α− τi · ∇θJ(πθi )‖2

and obtains θi+1 according to

τi+1 ← τi + η,

τi+1 · θi+1 ← τi · θi + η · argminα∈B

‖F (θi ) · α− τi · ∇θJ(πθi )‖2,

Page 46: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

45/51

Neural Policy Gradient MethodsCritic Update

To obtain ∇θJ(πθ), it remains to obtain the critic Qωi in

∇θJ(πθi ) =1B·

B∑`=1

Qωi (s`, a`) · ∇θ log πθi (a` | s`)

For any policy π, the action-value function Qπ is the uniquesolution to the Bellman equation Q = T πQ [Sutton and Barto,1998].Here T π is the Bellman operator that takes the form of

T πQ(s, a) = E[(1− γ) · r(s, a) + γ · Q(s ′, a′)

], ∀(s, a) ∈ S ×A.

Page 47: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

46/51

Neural Policy Gradient MethodsCritic Update

Correspondingly, we aim to solve the following optimization problem

ωi ← argminω∈B

Eςi[(Qω(s, a)− T πθiQω(s, a)

)2],

whereI ςi is the stationary state-action distributionI T πθi is the Bellman operator associated with πθi

Page 48: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

47/51

Neural Policy Gradient MethodsCritic Update

We adopt neural temporal-difference learning (TD) studied in [Caiet al., 2019b], which solves the optimization problem above viastochastic semigradient descent [Sutton, 1988].

Specifically, an iteration of neural TD takes the form of

ω(t + 1/2)

← ω(t)

− ηTD ·(Qω(t)(s, a)− (1− γ) · r(s, a)− γQω(t)(s ′, a′)

)· ∇ωQω(t)(s, a)

ω(t + 1)← argminα∈B

‖α− ω(t + 1/2)‖2,

Page 49: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

48/51

Neural Policy Gradient MethodsCritic Update

To summarize:

Combine the actor updates and the critic update described by1:

θi+1 ← ΠB(θi + η · G (θi ) · ∇θJ(πθi )

)2:

τi+1 ← τi + η,

τi+1 · θi+1 ← τi · θi + η · argminα∈B

‖F (θi ) · α− τi · ∇θJ(πθi )‖2

3:

ωi ← argminω∈B

Eςi[(Qω(s, a)− T πθiQω(s, a)

)2].

Page 50: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

49/51

Neural Policy Gradient Methods

Page 51: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

Reference IAlekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav

Mahajan. Optimality and approximation with policy gradientmethods in markov decision processes. arXiv preprintarXiv:1908.00261, 2019.

Jalaj Bhandari and Daniel Russo. Global optimality guarantees forpolicy gradient methods. arXiv preprint arXiv:1906.01786, 2019.

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provablyefficient exploration in policy optimization. arXiv preprintarXiv:1912.05830, 2019a.

Qi Cai, Zhuoran Yang, Jason D Lee, and Zhaoran Wang. Neuraltemporal-difference learning converges to global optima. InAdvances in Neural Information Processing Systems, pages11312–11322, 2019b.

Sham Kakade and John Langford. Approximately optimalapproximate reinforcement learning. In ICML, volume 2, pages267–274, 2002.

50/51

Page 52: Global Convergence of Policy Optimizationniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture-23-Global... · 4/51 Markov Decision Process (MDP) An(infinite-horizondiscounted)MDP[SuttonandBarto,1998;

Reference IISham M Kakade. A natural policy gradient. In Advances in neural

information processing systems, pages 1531–1538, 2002.Martin L Puterman. Markov decision processes: discrete stochastic

dynamic programming. John Wiley & Sons, 2014.Richard S Sutton. Learning to predict by the methods of temporal

differences. Machine learning, 3(1):9–44, 1988.Richard S Sutton and Andrew G Barto. Reinforcement Learning:

An Introduction. MIT Press, Cambridge, MA, March 1998. ISBN0-262-19398-1.

Richard S Sutton, David A McAllester, Satinder P Singh, andYishay Mansour. Policy gradient methods for reinforcementlearning with function approximation. In Advances in neuralinformation processing systems, pages 1057–1063, 2000.

Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neuralpolicy gradient methods: Global optimality and rates ofconvergence. arXiv preprint arXiv:1909.01150, 2019.

51/51