MarkovianStateandActionAbstractionsfor … · Non-Markovianess...

Post on 20-Mar-2020

3 views 0 download

Transcript of MarkovianStateandActionAbstractionsfor … · Non-Markovianess...

Markovian State and Action Abstractions forMDPs via Hierarchical MCTS

Aijun Bai†, Siddharth Srivastava‡, Stuart Russell†

July 12, 2016

UC Berkeley† | UTRC‡

1

Background

State Abstraction

State abstraction groups a set of statesinto a unit:• Ground MDP: M = 〈S,A, T ,R,γ〉• Abstract states: X = x1, x2, · · ·

– A partition on state space S

• Abstraction function: ϕ : S→ X

– ϕ(s) ∈ X is the abstractstate corresponding toground state s ∈ S Figure 1: Rooms domain

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 3

Non-Markovianess

State abstraction results into a reduced high-level abstract statespace. A well-known difficulty for state abstraction:

• Non-Markovianess: Pr(x ′ | x,a)

• Aggregation probability: Pr(s | x)

– Depending on past actions and abstract states

– Depending on the policy being executed/computed

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 4

Safe State Abstraction

Safe state abstraction avoids the non-Markovian problem:

• Ignore only irrelevant state variables (Dietterich, 1999; Andre& Russell, 2002; Jong & Stone, 2005)

• Exploit particular model structure (e.g. bisimulation orhomomorphism) (Dearden & Boutilier, 1997; Givan et al.,2003; Jiang et al., 2014; Anand et al., 2015)

However, safe state abstraction is lossless:

• Not always possible

• Computationally difficult to find

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 5

The Weighting Function Approach

The weighting function approach approximates Pr(s | x) using afixed weighting function w(s, x) (Bertsekas et al., 1995; Singhet al., 1995; Li et al., 2006):

• Superficially, the state-abstracted model can be written in aMarkovian way:

– Tϕ(x′ | x,a) =

∑s′∈ϕ−1(x′)

∑s∈ϕ−1(x) T(s

′ | s,a)w(s, x)

– Rϕ(x,a) =∑s∈ϕ−1(x) R(s,a)w(s, x)

• Abstract MDP: 〈X,A, Tϕ,Rϕ,γ〉However, a fixed weighting function can not capture the truedynamics of the abstract system!

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 6

Our Approach

State Abstraction from a POMDP Perspective

Doing state abstraction ϕ on a ground MDP M = 〈S,A,R, T ,γ〉actually creates a POMDP:

• Abstract states X as observations

• Observation function: Ω(x | s) = 1[x = ϕ(s)]

• POMDP(M,ϕ) = 〈S,A,X, T ,R,Ω,γ〉

– Underlying MDP: M

• The belief state b(s) in POMDP(M,ϕ) replaces the ad-hocweighting function

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 8

Solving POMDP(M,ϕ) via Monte Carlo Tree Search

Exactly solving POMDP(M,ϕ) via dynamic programming isintractable. From a tree-based online planning perspective,branching factors:

• M: up to |S|× |A|

• POMDP(M,ϕ): up to |X|× |A|

We consider solving it online via MCTS:

• POMCP(M,ϕ): POMCP running on POMDP(M,ϕ)

– Build a search tree in the history space via sampling

– Provided with a simulator for the ground MDP

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 9

Action Abstraction on POMDP(M,ϕ)

A given state abstraction naturally induces an action abstraction:

• Extend the theory of options to a POMDP

• Obtain an SMDP with options O in history space H

• Options connect histories in a one high-level step

– E.g., option ox→y connects histories ending with x ∈ Xto histories ending with y ∈ X

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 10

Value Function Decomposition for Options

Hierarchical policy for POMDP(M,ϕ) — Π = µ,πo1 ,πo2 , · · · :• µ : H→ O is the overall option-selection policy

• πo is the inner policy for option o ∈ O

• MAXQ-like value function decomposition in history space:

– Qµ(h,o) = Vπo(h) +∑h′∈H γ

|h′|−|h| Pr(h ′ | h,o)Vµ(h ′)

– Qπo(h,a) = R(h,a) + γ∑x∈X Pr(x | h,a)Vπo(hax)

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 11

Exploit Action Abstraction via Hierarchical MCTS

The resulting hierarchical MCTS algorithm — POMCP(M,ϕ,O):

• Learn µ by running high-level POMCP over options

• Learn πo by running low-level POMCP over primitive actions

• Invoke a nested MCTS when evaluating an option

• Update option/action values according to the value functiondecomposition

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 12

Theoretical Results

Theorems

• POMCP(M,ϕ) finds the optimal policy for a ground MDPM consistent with input state abstraction ϕ

• The performance loss of POMCP(M,ϕ) is bounded by aconstant multiple of an aggregation error introduced bygrouping states with different optimal actions

• POMCP(M,ϕ,O) converges to a recursively optimalhierarchical policy for POMDP(M,ϕ) over the hierarchydefined by input state and action abstractions

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 14

Experimental Evaluation

The Rooms Domain

The ROOMS[m,n,k] problem:

• A robot navigates in a m× n grid map containing k rooms

• Primitive actions: E, S, W and N

• Probability 0.2 of executing a random action

The input state and action abstractions:

• Abstract states: rooms

• Options: transitions between rooms

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 16

Experimental Results: The Rooms Domain

−55−50−45−40−35−30−25−20−15−10

1 10 100 1000 10000 100000 1e+06

Dis

coun

tedR

etur

n

Simulations

UCTUCTϕ

POMCP(M,ϕ)POMCP(M,ϕ,O)

(a) ROOMS[17, 17, 4]

−60−55−50−45−40−35−30−25−20−15

1 10 100 1000 10000 100000 1e+06

Dis

coun

tedR

etur

nSimulations

UCTUCTϕ

POMCP(M,ϕ)POMCP(M,ϕ,O)

(b) ROOMS[25, 13, 18]

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 17

Conclusions

Conclusions

• Propose state- and action-abstracted MDPs can be viewedas POMDPs

• Bound the performance loss induced by the abstraction

• Describe a hierarchical MCTS algorithm for approximatelysolving the abstract POMDP

– Converge to a recursively optimal hierarchical policy

– Improve ground MCTS by orders of magnitudeempirically

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 19

Questions?

A. Bai, S. Srivastava and S. Russell, IJCAI 2016, New York, NY. 20

References I

References

Anand, A., Grover, A., Mausam, M., & Singla, P. (2015). ASAP-UCT: abstraction of state-action pairs in UCT. InProceedings of the 24th International Conference on Artificial Intelligence, (pp. 1509–1515). AAAI Press.

Andre, D., & Russell, S. J. (2002). State abstraction for programmable reinforcement learning agents. InAAAI/IAAI, (pp. 119–125).

Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., & Bertsekas, D. P. (1995). Dynamic programming andoptimal control , vol. 1. Athena Scientific Belmont, MA.

Dearden, R., & Boutilier, C. (1997). Abstraction and approximate decision-theoretic planning. ArtificialIntelligence, 89(1), 219–283.

Dietterich, T. G. (1999). State abstraction in MAXQ hierarchical reinforcement learning. arXiv preprintcs/9905015 .

Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov decisionprocesses. Artificial Intelligence, 147(1), 163–223.

Jiang, N., Singh, S., & Lewis, R. (2014). Improving UCT planning via approximate homomorphisms. InProceedings of the 2014 international conference on Autonomous agents and multi-agent systems, (pp.1289–1296). International Foundation for Autonomous Agents and Multiagent Systems.

Jong, N. K., & Stone, P. (2005). State abstraction discovery from irrelevant state variables. Citeseer.Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. In ISAIM.Singh, S. P., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. Advances in

neural information processing systems, (pp. 361–368).

MDPs and POMDPs

Markov decision processes (MDPs) provide a rich framework forplaning and learning under uncertainty in fully observableenvironments:

• An MDP is a tuple 〈S,A, T ,R,γ〉Partially observable Markov decision processes (POMDPs) extendMDPs to partially observable environments:

• A POMDP is a tuple 〈S,A,Z, T ,R,Ω,γ〉

– Underlying MDP: 〈S,A, T ,R,γ〉

The Pseudo Code

Figure 3: The overall POMCP(M,ϕ,O) algorithm

Aggregation Error

DefinitionThe aggregation error of state abstraction 〈X,ϕ〉 for a groundMDP M = 〈S,A, T ,R,γ〉 is e, if ∃ a ∈ A, such that for all x ∈ X,ϕ(s) = x and d ∈ [0,H], |Vd(s) −Qd(s, a)| 6 e, where Vd andQd are the optimal value and action-value functions at depth d inthe search tree of M, and H is the maximal planning horizon.

Optimality Results for State Abstraction

TheoremFor state abstraction 〈X,ϕ〉 for a ground MDP M = 〈S,A, T ,R,γ〉with aggregation error e, let s0 be the current state in the groundMDP M and let h0 with P(h0) = s0 be the corresponding historyin POMDP(M,ϕ). Let Q∗(s, ·) and Q∗(h, ·) be the optimalaction values of M and POMDP(M,ϕ) respectively. Leta∗ = argmaxa∈AQ∗(h0,a) be the optimal primitive action foundin POMDP(M,ϕ) at history h0, and define an action-value erroras E(a∗) = |maxa∈AQ∗(s0,a) −Q∗(s0,a∗)|. Suppose themaximal planning horizon is H, then E(a∗) is bounded byE(a∗) 6 2He if γ = 1, else E(a∗) 6 2γ1−γH

1−γ e.

Convergence Results with Action Abstraction

TheoremWith probability 1, POMCP(M,ϕ,O) converges to a recursivelyoptimal hierarchical policy for POMDP(M,ϕ) over the hierarchydefined by the input state and action abstractions.

The Continuous Rooms Domain

The C-ROOMS[m,n,k] problem:

• Each cell has a size of 1 (m2)

• The position of the agent is represented as (x,y) coordinates

• An action moves the agent by a distance of 1 (m) expectedly

• Gaussian noise is added to each movement

The input state and action abstractions remain the same as in therooms domain.

Experimental Results: The Continuous Rooms Domain

−55−50−45−40−35−30−25−20−15−10

1 10 100 1000 10000 100000 1e+06

Dis

coun

tedR

etur

n

Simulations

UCTUCTϕ

POMCP(M,ϕ)POMCP(M,ϕ,O)

(a) C-ROOMS[17, 17, 4]

−60−55−50−45−40−35−30−25−20−15

1 10 100 1000 10000 100000 1e+06

Dis

coun

tedR

etur

nSimulations

UCTUCTϕ

POMCP(M,ϕ)POMCP(M,ϕ,O)

(b) C-ROOMS[25, 13, 18]