Near-Optimal Control of Queueing Systems via Approximate One...

28
Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks” Seminar Cornell University

Transcript of Near-Optimal Control of Queueing Systems via Approximate One...

Page 1: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Near-Optimal Control of Queueing Systemsvia

Approximate One-Step Policy Improvement

Jefferson Huang

March 21, 2018

“Reinforcement Learning for Processing Networks” Seminar

Cornell University

Page 2: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Performance Evaluation and Optimization

∃ various (approximate) methods for evaluating a fixed policy for an MDP.

I evaluate = compute value function

I methods include LSTD, TD(λ), . . .

Policy Improvement: If vπ is the exact value function for the policy π, then apolicy π+ that is provably at least as good is given by:

π+(x) ∈ arg mina∈A(x)

{c(x , a) + δE[vπ(X ) | X ∼ p(·|x , a)]} , x ∈ X (1)

I Discounted Costs: δ ∈ [0, 1), vπ gives expected discounted total costfrom each state under π

I Average Costs1: δ = 1, vπ is the “relative value function” under π

Policy Iteration is based on this idea.

1when the MDP is unichain (e.g., ergodic under every policy)

1/27

Page 3: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Approximate Policy Improvement

If vπ is replaced with an approximation v̂π, then the “improved policy”π+ where

π+(x) ∈ arg mina∈A(x)

{c(x , a) + δE[v̂π(X ) | X ∼ p(·|x , a)]} , x ∈ X (2)

is not necessarily better than π.

Questions:

1. When is (2) computationally tractable?

2. When is π+ close to being optimal?

Our focus is on MDPs modeling queueing systems.

2/27

Page 4: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Outline

Part 1: Using Analytically Tractable Policies2

I Average Costs

Part 2: Using Simulation and Interpolation3

I Average Costs

Part 3: Using Lagrangian Relaxations4

I Discounted Costs

2Bhulai, S. (2017). Value Function Approximation in Complex Queueing Systems. In Markov Decision

Processes in Practice (pp. 33-62). Springer.3

James, T., Glazebrook, K., & Lin, K. (2016). Developing effective service policies for multiclass queues withabandonment: asymptotic optimality and approximate policy improvement. INFORMS Journal on Computing,28(2), 251-264.

4Brown, D. B., & Haugh, M. B. (2017). Information relaxation bounds for infinite horizon Markov decision

processes. Operations Research, 65(5), 1355-1379.3/27

Page 5: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Part 1

Using Analytically Tractable Policies

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 4/27

Page 6: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Analytically Tractable Queueing Systems

Idea:

1. Start with systems whose Poisson equation is analytically solvable.

2. Use them to suggest analytically tractable policies for more complexsystems.

Examples: (Bhulai 2017)

I M/Cox(r)/1 queue

I M/M/s queue

I M/M/s/s blocking system

I priority queue

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 5/27

Page 7: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

The Poisson Equation

Let π be a policy (e.g., a fixed admission rule, a fixed priority rule).

In general, the Poisson equation looks like this:

g + h(x) = c(x ,π(x)) +∑y∈X

p(y |x ,π(x))h(y), x ∈ X.

We want to solve for the average cost g and the relative value function h(x) ofπ.

The Poisson equation is also called the “evaluation equation”.

I e.g., Puterman, M. L. (2005). Markov Decision Processes: DiscreteStochastic Dynamic Programming. John Wiley & Sons.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 6/27

Page 8: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

The Poisson Equation for Queueing Systems

For queueing systems, the Poisson equation is often a linear differenceequation.

I See e.g., Mickens, R. (1991). Difference Equations: Theory andApplications. CRC Press.

Example: Poisson equation for a uniformized M/M/1 queue with λ+ µ < 1and linear holding cost rate c:

g + h(x) = cx + λh(x + 1) + µh(x − 1) + (1 − λ− µ)h(x), x ∈ {1, 2, . . . },

g + h(0) = λh(1) + (1 − λ)h(0).

This is a “second-order” difference equation.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 7/27

Page 9: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Linear Difference Equations

Theorem (Bhulai 2017, Theorem 2.1)

Suppose f : {0, 1, . . . }→ R satisfies

f (x + 1) + α(x)f (x) + β(x)f (x − 1) = q(x), x > 1

where β(x) 6= 0 for all x > 1.

If fh : {0, 1, . . . }→ R is a “homogeneous solution”, i.e.,

fh(x + 1) + α(x)fh(x) + β(x)fh(x − 1) = 0, x > 1,

then, letting the empty product be equal to one,

f (x)

fh(x)=

f (0)

fh(0)+

(f (1)

fh(1)−

f (0)

fh(0)

) x∑i=1

i∏j=1

β(j)fh(j − 1)

fh(j + 1)

+

x∑i=1

i∏j=1

β(j)fh(j − 1)

fh(j + 1)

i−1∑j=1

q(j)

fh(j + 1)∏j+1

k=1β(k)fh(k−1)

fh(k+1)

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 8/27

Page 10: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Application: M/M/1 Queue

Rewrite the Poisson equation in the form of the Theorem:

h(x + 1) +

(−λ+ µ

λ

)︸ ︷︷ ︸

α(x)

h(x) +(µλ

)︸ ︷︷ ︸β(x)

h(x − 1) =g − cx

λ︸ ︷︷ ︸q(x)

, x > 1.

Note that fh ≡ 1 works as the “homogeneous solution”.

We also know that, for an M/M/1 queue with linear holding cost rate c,

g =cλ

µ− λ.

So, according to the Theorem,

h(x) =g

λ

x∑i=1

(µλ

)i

+

x∑i=1

(µλ

)ii−1∑j=1

µ

)j+1 (g − cj

λ

)=

cx(x + 1)

2(µ− λ).

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 9/27

Page 11: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Other Analytically Tractable Systems

Relative value functions for the following systems are presented in (Bhulai2017):

1. M/Cox(r)/1 Queue

I special cases: hyperexponential, hypoexponential, Erlang, andexponential service times

2. M/M/s Queue

I with infinite buffer, with no buffer (blocking system)

3. 2-class M/M/1 Priority Queue

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 10/27

Page 12: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Application to Analytically Intractable Systems

Idea:

1. Pick an initial policy whose relative value function can be written interms of the relative value functions of simpler systems.

2. Do one-step policy improvement using that policy.

In (Bhulai 2017), this is applied to the following problems:

1. Routing Poisson arrivals to two different M/Cox(r)/1 queues.

I Initial Policy: Bernoulli routingI Uses relative value function of M/Cox(r)/1 queue

2. Routing in a Multi-Skill Call Center

I Initial Policy: Static randomized policy that tries to route calls toagents with the fewest skills first

I Uses relative value function of M/M/s queue

3. Controlled Polling System with Switching Costs

I Initial Policy: cµ-ruleI Uses relative value function of priority queue

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 11/27

Page 13: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Controlled Polling System with Switching Costs

Two queues with independent Poisson arrivals at rates λ1, λ2, exponentialservice times with rates µ1,µ2 and holding cost rates c1, c2, respectively.

If queue i is currently being served, switching to queue j 6= i costs si , i = 1, 2.

Problem: Dynamically assign the server to one of the two queues, so that theaverage cost incurred is minimized.

Do one-step policy improvement on the cµ-rule.

Results for λ1 = λ2 = 1, µ1 = 6, µ2 = 3, c1 = 2, c2 = 1, s1 = s2 = 2:

Policy Average Cost

cµ-Rule 3.62894One-Step Improvement 3.09895

Optimal Policy 3.09261

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 12/27

Page 14: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Part 2

Using Simulation and Interpolation

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 13/27

Page 15: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Scheduling a Multiclass Queue with Abandonments

k queues with independent Poisson arrivals at rates λ1, . . . , λk , exponentialservice times with rates µ1, . . . ,µk , and holding cost rates c1, . . . , ck ,respectively.

Each customer in queue i = 1, . . . , k remains available for service for anexponentially distributed amount of time, with rate θi .

Each service completion from queue i = 1, . . . , k earns Ri ; each abandonmentfrom queue i costs Di .

Problem: Dynamically assign the server to one of the k queues, so that theaverage cost incurred is minimized.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 14/27

Page 16: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Relative Value Function

Let π be a policy, and select any reference state

xr ∈ X = {(i1, . . . , ik) ∈ {0, 1, . . . }k }.

gπ = average cost incurred under π

rπ(x) = expected total cost to reach xr under π, starting from state x

tπ(x) = expected time to reach xr under π, starting from state x

Then the relative value function is

hπ(x) = rπ(x) − gπtπ(x), x ∈ X.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 15/27

Page 17: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Approximate Policy Improvement

Exact DP is infeasible for k > 3 classes.

(James, Glazebrook, Lin 2016) propose an approximate policy improvementalgorithm.

Idea: Given a policy π, approximate its relative value function hπ as follows:

1. Simulate π to estimate its average cost gπ and the long-run frequencywith which each state is visited.

2. Based on Step 1, select a set of initial states from which the relativevalue under π is estimated via simulation.

3. Estimate the relative value function by interpolating between the valuesestimated in Step 2.

4. Do policy improvement using the estimated relative value function.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 16/27

Page 18: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Selecting States (Step 2)

S = set of initial states selected in Step 2, from which the relative value isestimated via simulation

S = Sanchor + Ssupport,

where

1. Sanchor = set of most frequently visited states (based on Step 1)

2. Ssupport = set of regularly spaced states

Parameters: How many states to include in Sanchor and Ssupport.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 17/27

Page 19: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Interpolation (Step 3)

Use an (augmented) radial basis function

hπ(x) ≈n∑

i=1

αiφ(‖x − xi‖) +d∑

j=1

βjpj (x)

where

I n = number of selected states in Step 2

I xi = i th selected state in Step 2

I φ(r) = r 2 log(r) (thin plate spline)

I ‖ · ‖ = Euclidean norm

I d = k + 1

I p1(x) = 1, pj (x) = number of customers in queue j − 1

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 18/27

Page 20: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Computing the Interpolation Parameters

xi = i th selected state in Step 2

fi = estimated relative value starting from xi

Aij = φ(‖xi − xj‖) for i , j = 1, . . . n

Pij = pj (xi ) for i = 1, . . . , n and j = 1, . . . , k + 1

Solve the following linear system of equations:

Aα+ Pβ = f

PTα = 0

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 19/27

Page 21: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Example: Interpolation

2 classes, initial policy is the “Rµθ-rule”

m = number of replications for each selected state

Figure 1 in (James, Glazebrook, Lin 2016)

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 20/27

Page 22: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Example: Approximate Policy Improvement

Same problem

API = policy from one-step policy improvement

Figure 2 in (James, Glazebrook, Lin 2016)

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 21/27

Page 23: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Suboptimality of Heuristics

API(π) = one-step approximate policy improvement applied to π

k = 3 classes, ρ =∑k

i=1(λi/µi )

Figure 3 in (James, Glazebrook, Lin 2016); see the paper details on this andother numerical studies.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 22/27

Page 24: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Part 3

Using Lagrangian Relaxations

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 23/27

Page 25: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Multiclass Queue with Convex Holding Costs

k queues with independent Poisson arrivals at rates λ1, . . . , λk and exponentialservice times with rates µ1, . . . ,µk , respectively.

If there are xi customers in queue i , the holding cost rate is ci (xi ) whereci : {0, 1, . . . }→ R is nonnegative, increasing, and convex.

Each queue i has a buffer of size Bi

Problem: Dynamically assign the server to one of the k queues, so that thediscounted cost incurred is minimized.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 24/27

Page 26: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Relaxed Problem

The following relaxation is considered in (Brown, Haugh 2017):

I The queues are grouped into G groups.

I The server can serve at most one queue per group; can serve multiplegroups simulataneously.

I Penalty ` for serving multiple groups simultaneously.

Under this relaxation, the value function decouples across groups (α =discount factor):

v `(x) =(G − 1)`

1 − δ+∑g

v `g (xg )

v ` can be used in both one-step “policy improvement”, and to construct alower bound on the optimal cost via an information relaxation.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 25/27

Page 27: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Suboptimality of Heuristics

Myopic: use one-step improvement with the value function vm(x) =∑

i ci (xi )

From (Brown, Haugh 2017); see the paper for details.

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 26/27

Page 28: Near-Optimal Control of Queueing Systems via Approximate One …faculty.nps.edu/jefferson.huang/2018-rlpn.pdf · 2018-03-21 · Near-Optimal Control of Queueing Systems via Approximate

Research Questions

1. Applications to other systems?

2. Performance guarantees for one-step improvement?

3. Other functions to use in one-step improvement?

4. Conditions under which one-step improvement is practical?

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 27/27