Global Convergence of Policy...

Post on 14-Oct-2020

2 views 0 download

Transcript of Global Convergence of Policy...

1/51

Global Convergence of Policy Optimization

Tengyang Xie and Wenbin Wan

2/51

Outline

Background

Global Convergence in Tabular MDPs

Global Convergence w/ Function Approximation

Neural Policy Gradient Methods

3/51

Section 1

Background

4/51

Markov Decision Process (MDP)

An (infinite-horizon discounted) MDP [Sutton and Barto, 1998;Puterman, 2014] is a tuple (S, A, P , R , γ, d0)

I state s ∈ SI action a ∈ AI transition function P : S ×A → ∆(S)

I reward function R : S ×A → [0,Rmax]

I discount factor γ ∈ [0, 1)

I initial state distribution d0 ∈ ∆(S)

(∆(·) denotes the probability simplex)

5/51

Notations Regarding Value Function and PolicyI policy π : S → ∆(A)

I π induced random trajectory: (s0, a0, r0, s1, a1, r1, . . .), wheres0 ∼ d0, at ∼ π(·|st), rt = R(st , at), st+1 ∼ P(·|st , at), ∀t ≥ 0

I (state-)value function V π(s) := E[∑∞

t=0 γtrt |s0 = s, π]

I Q-function Qπ(s, a) := E[∑∞

t=0 γtrt |s0 = s, a0 = a, π]

I advantage function Aπ(s, a) := Qπ(s, a)− V π(s)

I expected discounted return J(π) := E[∑∞

t=0 γtrt |s0 ∼ d0, π]

I optimal policy: π?, value function of π?: V ?, Q-function ofπ?: Q?

I normalized discounted state occupancydπ(s) := (1− γ)

∑∞t=0 γ

t Pr [st = s|s0 ∼ d0, π],dπ(s, a) := dπ(s)π(a|s)

I Bellman optimality operator: T

T V (s) := maxa

E[R(s, a) + E[V (s ′)|s, a]

]

6/51

Policy Parameterizations

I direct parameterization: πθ(a|s) = θs,a, where θ ∈ ∆(A)|S|.

I softmax parameterization: πθ(a|s) =exp(θs,a)∑a′ exp(θs,a′ )

, where

θ ∈ R|S||A|.

Example function approximation:

θs,a ⇒ θ · φs,a

7/51

Policy Gradient Theorem

Objective function:

maxθ

J(πθ)

Theorem ([Sutton et al., 2000])

∇θJ(πθ) =1

1− γ E(s,a)∼dπ

[∇θ log πθ(a|s)Qπθ(s, a)] (1)

Corollary

∇θJ(πθ) =1

1− γ E(s,a)∼dπ

[∇θ log πθ(a|s)Aπθ(s, a)]

8/51

Section 2

Global Convergence in Tabular MDPs

9/51

Overview

The global convergence of policy gradient comes from its specialstructure.

There are two main ways for attaining the global convergence:I Policy Improvement — all the stationary points are global

optimal [Bhandari and Russo, 2019]

I Bounding Performance Difference — performance differencebetween π and π? can be bounded by the (variants of) policygradient [Agarwal et al., 2019]

10/51

Ways to Attain Global Convergence: I. Policy ImprovementWarm-up: Policy Improvement LemmaLet π be any policy, π+ be the greedy policy w.r.t. Qπ. ThenV π+(s) ≥ V π(s) for any s ∈ S.

Proof.For any s ∈ S, we have

V π(s)

≤ Qπ(s, π+(s))

= E[rt+1 + γV π(st+1)|st = s, at ∼ π+]

≤ E[rt+1 + γQπ(st+1, π+(st+1))|st = s, at ∼ π+]

≤ E[rt+1 + γrt+2 + γ2V π(st+2)|st = s, at ∼ π+, at+1 ∼ π+]

...≤ V π+(s)

11/51

Ways to Attain Global Convergence: I. Policy Improvement

How about π′ := π + α(π+ − π), where α ∈ (0, 1)?

Policy can be improved on this direction!

Theorem (No spurious local optima, Theorem 1 in [Bhandariand Russo, 2019])Under Assumption 1-4 in [Bhandari and Russo, 2019], for policy πθ,and π+ be a policy iteration update of πθ. Take u to satisfy

d

duπθ+αu(s) = π+(s)− πθ(s), ∀s ∈ S.

Then,

d

dαJ(θ + αu)

∣∣∣∣∣α=0

≥ 11− γ

‖V πθ − TV πθ‖1,dπθ .

11/51

Ways to Attain Global Convergence: I. Policy Improvement

How about π′ := π + α(π+ − π), where α ∈ (0, 1)?

Policy can be improved on this direction!

Theorem (No spurious local optima, Theorem 1 in [Bhandariand Russo, 2019])Under Assumption 1-4 in [Bhandari and Russo, 2019], for policy πθ,and π+ be a policy iteration update of πθ. Take u to satisfy

d

duπθ+αu(s) = π+(s)− πθ(s), ∀s ∈ S.

Then,

d

dαJ(θ + αu)

∣∣∣∣∣α=0

≥ 11− γ

‖V πθ − TV πθ‖1,dπθ .

12/51

Ways to Attain Global Convergence: I. Policy Improvement

How to prove the No-spurious-local-optima Theorem?

Lemma (Policy gradients for directional derivatives, Lemma 1in [Bhandari and Russo, 2019])For any θ and u, we have

d

dαJ(θ + αu)

∣∣∣∣∣α=0

=1

1− γ E(s,a)∼πθs0∼d0

[d

dαQπθ(s, πθ+αu(s))

∣∣∣∣∣α=0

].

Then bounding the RHS of the lemma above, we prove theNo-spurious-local-optima Theorem.

13/51

Ways to Attain Global Convergence: I. Policy Improvement

What can we learn from the No-spurious-local-optimaTheorem?

Recall the result of No-spurious-local-optima Theorem,

d

dαJ(θ + αu)

∣∣∣∣∣α=0

≥ 11− γ

‖V πθ − TV πθ‖1,dπθ .

The LHS is the a partial gradient of J(θ) (lower bounds the policygradient, Eq.(1), by any norm), and the RHS=0 if and only ifV πθ = V ?.

This implies that all the stationary points obtained by the policygradient theorem are globally optimal.

14/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Warm-up: Performance Difference Lemma

Lemma (The performance difference lemma [Kakade andLangford, 2002])For all policies π, π′,

J(π)− J(π′) =1

1− γ E(s,a)∼dπ

[Aπ

′(s, a)

].

This lemma can be proved by directly simplify the RHS using thedefinition of Aπ

′.

15/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Usage of the Performance Difference Lemma — GradientDomination Lemma(directly parameterized policy classes and projected policy gradient)

Lemma (Gradient domination, Lemma 4.1 in [Agarwal et al.,2019])For any policy π, we have

J(π?)− J(π) ≤∥∥∥∥dπ?dπ

∥∥∥∥∞

maxπ

(π − π)>∇πJ(π)

≤ 11− γ

∥∥∥∥dπ?d0

∥∥∥∥∞

maxπ

(π − π)>∇πJ(π),

(2)

where the max is over the set of all possible policies.

16/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Proof of the gradient domination lemma.

By the performance difference lemma,

J(π?)− J(π) =1

1− γ∑s,a

dπ?(s)π?(a|s)Aπ(s, a)

≤ 11− γ

∑s,a

dπ?(s) maxa

Aπ(s, a)

≤ 11− γ

∥∥∥∥dπ?dπ

∥∥∥∥∞

∑s,a

dπ(s) maxa

Aπ(s, a)

17/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Proof of the gradient domination lemma.(cont.)

∑s,a

dπ(s) maxa

Aπ(s, a)

= maxπ

∑s,a

dπ(s)π(a|s)Aπ(s, a)

= maxπ

∑s,a

dπ(s)(π(a|s)− π(a|s))Aπ(s, a)

= (1− γ) maxπ

(π − π)>∇πJ(π)

Combining these two parts above, we complete the proof.

18/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

How to use the gradient domination lemma?

Definition (First-order Stationarity)A policy π ∈ ∆(A)|S| is ε-stationary with respect to the initialstate distribution d0 if

maxπ+δ∈∆(A)|S|,‖δ‖2≤1

δ>∇πJ(π) ≤ ε,

where ∆(A)|S| is the set of all possible policies.

19/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

How to use the gradient domination lemma?(cont.)

Then, we have the following inequality

maxπ

(π − π)>∇πJ(π) ≤ maxπ+δ∈∆(A)|S|,‖δ‖2≤1

δ>∇πJ(π) (3)

We now connect the performance difference with the first-orderstationary condition (using the gradient dominate lemma andEq.(3)).

By applying classic first-order optimization results, we obtain theglobal convergence of (projected) policy gradient.

20/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

Iteration Complexity for direct parameterization

TheoremThe projected gradient ascent algorithm (projecting to theprobability simplex after each gradient ascent step) on J(πθ) withstepsize (1−γ)3

2γ|A| satisfies

mint≤T{J(π?)− J(πθ(t))} ≤ ε,

when ever T >64γ|S||A|(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

∞.

21/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case (πθ(a|s) =exp(θs,a)∑a′ exp(θs,a′ )

)

Challenge: Attaining the optimal policy (which is deterministic)needs to send the parameters to ∞.

Three types of algorithms:1. regular policy gradient2. policy gradient w/ entropic regularization3. natural policy gradient

22/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

1. Regular policy gradient only has asymptotic convergence (at thispoint)

Theorem (Global convergence for softmax parameterization,Theorem 5.1 in [Agarwal et al., 2019])Assume we follow the gradient descent update rule and that thedistribution d0 is strictly positive i.e. d0(s) > 0 for all states s.Suppose η ≤ (1−γ)2

5 , then we have that for all states s,V (t)(s)→ V ?(s) as t →∞.

23/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

2. Polynomial Convergence with Relative Entropy Regularization

The relative-entropy regularized objective:

Lλ(θ) := J(πθ) +λ

|S||A|∑s,a

log πθ(a|s) + λ log |A|,

where λ is a regularization parameter.

Its benefit: keep the parameters from becoming too large, as ameans to ensure adequate exploration.

24/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

2. Polynomial Convergence with Relative Entropy Regularization(cont.)

Theorem (Iteration complexity with relative entropyregularization, Corollary 5.4 in [Agarwal et al., 2019])Let βλ := 8γ

(1−λ)3+ 2λ|S| . Starting from any initial θ(0), consider the

gradient ascent of Lλ with λ = ε(1−λ)

2∥∥∥ dπ?

d0

∥∥∥∞

and η = 1/βλ. Then we

have mint<T{J(π?)− J(πθ(t))} ≤ ε, if T ≥320|S|2|A|2

(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

∞.

25/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

3. Natural Policy Gradient

Formulation:

F (θ) = E(s,a)∼dπθ

[∇θ log πθ(a|s) (∇θ log πθ(a|s))>

]θ(t+1) = θ(t) + ηF (θ(t))†∇θJ(πθ(t)),

where M† denotes the Moore-Penrose pseudoinverse of the matrixM.

26/51

Ways to Attain Global Convergence: II. BoundingPerformance Difference

The softmax parameterization case

3. Natural Policy Gradient(cont.)

Theorem (Global Convergence for Natural Policy GradientAscent, Theorem 5.7 in [Agarwal et al., 2019])Suppose we run the natural policy gradient updates with θ(0) = 0.Fixed η ≥ 0. For all T > 0, we have:

J(πθ(T )) ≥ J(π?)− log |A|ηT

− 1(1− γ)2T

.

In particular, setting η ≥ (1− γ)2 log |A|, NPG finds an ε-optimalpolicy in a number of iterations that is at most T ≤ 2

(1−γ)2ε, which

has no dependence on |S|, |A|,∥∥∥dπ?

d0

∥∥∥2

∞.

27/51

Review of Tabular ResultsWhat we covered:I All the first order stationary points of policy gradient are

global optimal.I The exact (projected) policy gradient w/ direct

parameterization has a O

(γ|S||A|

(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

)iteration

complexity.I The exact policy gradient w/ softmax parameterization has

asymptotic convergence.I The exact policy gradient w/ relative entropy regularization

and softmax parameterization has a O

(γ|S|2|A|2

(1− γ)6ε2

∥∥∥∥dπ?d0

∥∥∥∥2

)iteration complexity.

I The exact natural policy gradient w/ softmax parameterization

has a2

(1− γ)2εiteration complexity.

28/51

Review of Tabular Results

What we did not cover or future directions:

I Exploration w/ policy gradient (partially solved by Cai et al.[2019a])

I Stochastic policy gradient / sample based resultsI Actor-critic approachI A sharp analysis or improved algorithm regarding the

distribution mismatch coefficient (e.g., Eq. (2))I Sample complexity analysisI Landscape of J(π)

29/51

Section 3

Global Convergence w/ Function Approximation

30/51

Overview

What we will cover:

I Natural Policy Gradient for Unconstrained Policy Class

I Projected Policy Gradient for Constrained Policy Classes

Challenge: how to capture the approximation error properly?

31/51

Natural Policy Gradient for Unconstrained Policy Class

Let the policy classes parameterized by θ ∈ Rd , where d � |S||A|.

The update rule still using the exact natural policy gradient (seetabular part). We also need to assume that log πθ(a|s) is aβ-smooth function for all θ, s, and a.

Example (Linear softmax policies)For any state, action pair s, a, suppose we have a feature mappingφs,a ∈ Rd with ‖φs,a‖22 ≤ β. Let us consider the policy class

πθ(a|s) =exp(θ · φs,a)∑

a′∈A exp(θ · φs,a′).

with θ ∈ Rd . Then log πθ(a|s) is a β-smooth function.

32/51

Natural Policy Gradient for Unconstrained Policy Class

Tools for analyzing NPG

Let the NPG update rule can be written abstractly as(u(t) = F (θ(t))†∇θJ(πθ(t)))

θ(t+1) = θ(t) + ηu(t).

We then leverage the connection of NPG update and compatiblefunction approximation [Sutton et al., 2000; Kakade, 2002].

Lν(w ; θ) := Es,a∼ν

[(Aπθ(s, a)− w · ∇θ log πθ(a|s))2

],

L?ν(θ) := minw

Lν(w ; θ).

ν(s, a) = dπθ(s, a) ⇒ u(t) ∈ argminw Lν(w ; θ) and L?ν(θ) is ameasure of approximation error of πθ.

(we can verify that L?ν(θ) = 0⇒ πθ = π?)

33/51

Natural Policy Gradient for Unconstrained Policy Class

NPG results

Consider the update rule θ(t+1) = θ(t) + ηu(t), whereu(t) ∈ argminw Lν(t)(w ; θ) and ν(t) = dπ

θ(t)(s, a).

TheoremLet π? be the optimal policy in the class, η =

√2 log |A|/(βW 2T ),

L?ν(t)(θ

(t)) ≤ εapprox, ‖u(t)‖2 ≤W . Then we have

mint≤T{J(π?)− J(πθ(t))} ≤

(W√

2β log |A|(1− γ)

)· 1√

T

+

√1

(1− γ)3

∥∥∥∥dπ?d0

∥∥∥∥∞εapprox.

34/51

Projected Policy Gradient for Constrained Policy ClassesThe constrained policy class Π = {πθ : θ ∈ Θ}, where Θ ⊆ Rd is aconvex set, be the feasible set of all policies.

Following the similar intuition as policy improvement part, wedefine the Bellman policy error in approximating π+

θ (greedy policyw.r.t. Qπθ) as

LBPE(θ,w) =

[∑a∈A

∣∣∣π+θ (a|s)− πθ(a|s)− w>∇θπθ(a|s)

∣∣∣] .Then, the approximation error can be captured byLBPE(θ) := LBPE(θ;w?(θ)), where

w?(θ) = argminw∈Rd :w+θ∈Θ

LBPE(θ;w).

(it is easy to verify that LBPE(θ) = 0⇒ πθ = π?)

35/51

Projected Policy Gradient for Constrained Policy Classes

TheoremSuppose the πθ is Lipschitz continuous and smooth for all θ ∈ Θ.Assume for all t < T ,

LBPE(θ(t)) ≤ εapprox and ‖w?(θ(t))‖2 ≤W ?.

Let

β =β2|A|

(1− γ)2 +2γβ2

1 |A|2

(1− γ)3 .

Then, the projected policy gradient ascent w/ stepsize η = 1/βsatisfies

mint<T{J(π?)− J(πθ(t))} ≤

1(1− γ)3

∥∥∥∥dπ?d0

∥∥∥∥∞εapprox + (W ? + 1)ε,

for T ≥ 8β(1−γ)3ε2

∥∥∥dπ?d0

∥∥∥2

∞.

36/51

Section 4

Neural Policy Gradient Methods

37/51

Overparameterized Neural Policy

A two-layer neural network f ((s, a);W , b) with input (s, a) andwidth m takes the form of

f((s, a);W , b

)=

1√m

m∑r=1

br · ReLU((s, a)>[W ]r

), ∀(s, a) ∈ S ×A,

whereI (s, a) ∈ S ×A ⊆ Rd

I ReLU : R→ R is the rectified linear unit (ReLU) activationfunction, which is defined as ReLU(u) = 1{u > 0} · u.

I {br}r∈[m] and W = ([W ]>1 , . . . , [W ]>m)> ∈ Rmd are theparameters.

38/51

Overparameterized Neural Policy

Using the two-layer neural network, we define the neural policies as

πθ(a | s) =exp[τ · f

((s, a); θ

)]∑a′∈A exp

[τ · f

((s, a′); θ

)] , ∀(s, a) ∈ S ×A,

and the feature mapping φθ = ([φθ]>1 , . . . , [φθ]>m)> : Rd → Rmd ofa two-layer neural network f ((·, ·); θ) as

[φθ]r (s, a) =br√m· 1{

(s, a)>[θ]r > 0}· (s, a),

∀(s, a) ∈ S ×A, ∀r ∈ [m].

39/51

Overparameterized Neural Policy

Policy Gradient and Fisher Information Matrix(Proposition 3.1 in [Wang et al., 2019])

For πθ defined previous, we have

∇θJ(πθ) = τ · Eσπθ[Qπθ(s, a) ·

(φθ(s, a)− Eπθ

[φθ(s, a′)

])],

F (θ) = τ2 · Eσπθ[(φθ(s, a)− Eπθ

[φθ(s, a′)

])(φθ(s, a)− Eπθ

[φθ(s, a′)

])>].

40/51

Neural Policy Gradient MethodsActor Update

To update θi , we set

θi+1 ← ΠB(θi + η · G (θi ) · ∇θJ(πθi )

),

whereI B = {α ∈ Rmd : ‖α−Winit‖2 ≤ R}, where R > 1 and Winit

is the initial parameterI ΠB : Rmd → B as the projection operator onto the parameter

space B ⊆ Rmd

I G (θi ) = Imd for policy gradient and G (θi ) = (F (θi ))−1 fornatural policy gradient

I η is the learning rate and ∇θJ(πθi ) is an estimator of ∇θJ(πθi )

∇θJ(πθi ) =1B·

B∑`=1

Qωi (s`, a`) · ∇θ log πθi (a` | s`)

41/51

Neural Policy Gradient MethodsActor Update

Sampling From Visitation Measure

Recall that the policy gradient (Proposition 3.1 in [Wang et al.,2019])

∇θJ(πθ) = τ · Eσπθ[Qπθ(s, a) ·

(φθ(s, a)− Eπθ

[φθ(s, a′)

])].

We need to sample from the visitation measure σπθ . Define a newMDP (S,A, P, ζ, r , γ) with Markov transition kernel P

P(s ′ | s, a) = γ · P(s ′ | s, a) + (1− γ) · ζ(s ′), ∀(s, a, s ′) ∈ S ×A× S,

whereI P is the Markov transition kernel of the original MDP

42/51

Neural Policy Gradient MethodsActor Update

Inverting Fisher Information Matrix

Recall that G (θi ) = (F (θi ))−1 for natural policy gradient.

Challenge: Inverting an estimator F (θi ) of F (θi ) can beinfeasible as F (θi ) is a high-dimensional matrix, which is possiblynot invertible.To resolve this issue, we estimate the natural policy gradientG (θi ) · ∇θJ(πθi ) by solving

minα∈B‖F (θi ) · α− τi · ∇θJ(πθi )‖2

43/51

Neural Policy Gradient MethodsActor Update

Meanwhile, F (θi ) is an unbiased estimator of F (θi ) based on{(s`, a`)}`∈[B] sampled from σi , which is defined as

F (θi ) =τ2i

B∑`=1

(φθi (s`, a`)− Eπθi

[φθi (s`, a

′`)])

(φθi (s`, a`)− Eπθi

[φθi (s`, a

′`)])>

.

The actor update of neural natural policy gradient takes the form of

τi+1 ← τi + η,

τi+1 · θi+1 ← τi · θi + η · argminα∈B

‖F (θi ) · α− τi · ∇θJ(πθi )‖2,

44/51

Neural Policy Gradient MethodsActor Update

To summarize:

At the i-th iteration,Neural policy gradient obtains θi+1 via projected gradient ascentusing ∇θJ(πθi ) defined by

∇θJ(πθi ) =1B·

B∑`=1

Qωi (s`, a`) · ∇θ log πθi (a` | s`)

Neural natural policy gradient solves

minα∈B‖F (θi ) · α− τi · ∇θJ(πθi )‖2

and obtains θi+1 according to

τi+1 ← τi + η,

τi+1 · θi+1 ← τi · θi + η · argminα∈B

‖F (θi ) · α− τi · ∇θJ(πθi )‖2,

45/51

Neural Policy Gradient MethodsCritic Update

To obtain ∇θJ(πθ), it remains to obtain the critic Qωi in

∇θJ(πθi ) =1B·

B∑`=1

Qωi (s`, a`) · ∇θ log πθi (a` | s`)

For any policy π, the action-value function Qπ is the uniquesolution to the Bellman equation Q = T πQ [Sutton and Barto,1998].Here T π is the Bellman operator that takes the form of

T πQ(s, a) = E[(1− γ) · r(s, a) + γ · Q(s ′, a′)

], ∀(s, a) ∈ S ×A.

46/51

Neural Policy Gradient MethodsCritic Update

Correspondingly, we aim to solve the following optimization problem

ωi ← argminω∈B

Eςi[(Qω(s, a)− T πθiQω(s, a)

)2],

whereI ςi is the stationary state-action distributionI T πθi is the Bellman operator associated with πθi

47/51

Neural Policy Gradient MethodsCritic Update

We adopt neural temporal-difference learning (TD) studied in [Caiet al., 2019b], which solves the optimization problem above viastochastic semigradient descent [Sutton, 1988].

Specifically, an iteration of neural TD takes the form of

ω(t + 1/2)

← ω(t)

− ηTD ·(Qω(t)(s, a)− (1− γ) · r(s, a)− γQω(t)(s ′, a′)

)· ∇ωQω(t)(s, a)

ω(t + 1)← argminα∈B

‖α− ω(t + 1/2)‖2,

48/51

Neural Policy Gradient MethodsCritic Update

To summarize:

Combine the actor updates and the critic update described by1:

θi+1 ← ΠB(θi + η · G (θi ) · ∇θJ(πθi )

)2:

τi+1 ← τi + η,

τi+1 · θi+1 ← τi · θi + η · argminα∈B

‖F (θi ) · α− τi · ∇θJ(πθi )‖2

3:

ωi ← argminω∈B

Eςi[(Qω(s, a)− T πθiQω(s, a)

)2].

49/51

Neural Policy Gradient Methods

Reference IAlekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav

Mahajan. Optimality and approximation with policy gradientmethods in markov decision processes. arXiv preprintarXiv:1908.00261, 2019.

Jalaj Bhandari and Daniel Russo. Global optimality guarantees forpolicy gradient methods. arXiv preprint arXiv:1906.01786, 2019.

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provablyefficient exploration in policy optimization. arXiv preprintarXiv:1912.05830, 2019a.

Qi Cai, Zhuoran Yang, Jason D Lee, and Zhaoran Wang. Neuraltemporal-difference learning converges to global optima. InAdvances in Neural Information Processing Systems, pages11312–11322, 2019b.

Sham Kakade and John Langford. Approximately optimalapproximate reinforcement learning. In ICML, volume 2, pages267–274, 2002.

50/51

Reference IISham M Kakade. A natural policy gradient. In Advances in neural

information processing systems, pages 1531–1538, 2002.Martin L Puterman. Markov decision processes: discrete stochastic

dynamic programming. John Wiley & Sons, 2014.Richard S Sutton. Learning to predict by the methods of temporal

differences. Machine learning, 3(1):9–44, 1988.Richard S Sutton and Andrew G Barto. Reinforcement Learning:

An Introduction. MIT Press, Cambridge, MA, March 1998. ISBN0-262-19398-1.

Richard S Sutton, David A McAllester, Satinder P Singh, andYishay Mansour. Policy gradient methods for reinforcementlearning with function approximation. In Advances in neuralinformation processing systems, pages 1057–1063, 2000.

Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neuralpolicy gradient methods: Global optimality and rates ofconvergence. arXiv preprint arXiv:1909.01150, 2019.

51/51