Near-Optimal Control of Queueing Systems via Approximate One...

Near-Optimal Control of Queueing Systemsvia

Approximate One-Step Policy Improvement

Jefferson Huang

March 21, 2018

“Reinforcement Learning for Processing Networks” Seminar

Cornell University

Performance Evaluation and Optimization

∃ various (approximate) methods for evaluating a fixed policy for an MDP.

I evaluate = compute value function

I methods include LSTD, TD(λ), . . .

Policy Improvement: If vπ is the exact value function for the policy π, then apolicy π+ that is provably at least as good is given by:

π+(x) ∈ arg mina∈A(x)

{c(x , a) + δE[vπ(X ) | X ∼ p(·|x , a)]} , x ∈ X (1)

I Discounted Costs: δ ∈ [0, 1), vπ gives expected discounted total costfrom each state under π

I Average Costs1: δ = 1, vπ is the “relative value function” under π

Policy Iteration is based on this idea.

1when the MDP is unichain (e.g., ergodic under every policy)

1/27

Approximate Policy Improvement

If vπ is replaced with an approximation v̂π, then the “improved policy”π+ where

π+(x) ∈ arg mina∈A(x)

{c(x , a) + δE[v̂π(X ) | X ∼ p(·|x , a)]} , x ∈ X (2)

is not necessarily better than π.

Questions:

1. When is (2) computationally tractable?

2. When is π+ close to being optimal?

Our focus is on MDPs modeling queueing systems.

2/27

Outline

Part 1: Using Analytically Tractable Policies2

I Average Costs

Part 2: Using Simulation and Interpolation3

I Average Costs

Part 3: Using Lagrangian Relaxations4

I Discounted Costs

2Bhulai, S. (2017). Value Function Approximation in Complex Queueing Systems. In Markov Decision

Processes in Practice (pp. 33-62). Springer.3

James, T., Glazebrook, K., & Lin, K. (2016). Developing effective service policies for multiclass queues withabandonment: asymptotic optimality and approximate policy improvement. INFORMS Journal on Computing,28(2), 251-264.

4Brown, D. B., & Haugh, M. B. (2017). Information relaxation bounds for infinite horizon Markov decision

processes. Operations Research, 65(5), 1355-1379.3/27

Part 1

Using Analytically Tractable Policies

Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 4/27

Analytically Tractable Queueing Systems

Idea:

1. Start with systems whose Poisson equation is analytically solvable.

2. Use them to suggest analytically tractable policies for more complexsystems.

Examples: (Bhulai 2017)

I M/Cox(r)/1 queue

I M/M/s queue

I M/M/s/s blocking system

I priority queue


The Poisson Equation

Let π be a policy (e.g., a fixed admission rule, a fixed priority rule).

In general, the Poisson equation looks like this:

g + h(x) = c(x ,π(x)) +∑y∈X

p(y |x ,π(x))h(y), x ∈ X.

We want to solve for the average cost g and the relative value function h(x) ofπ.

The Poisson equation is also called the “evaluation equation”.

I e.g., Puterman, M. L. (2005). Markov Decision Processes: DiscreteStochastic Dynamic Programming. John Wiley & Sons.


The Poisson Equation for Queueing Systems

For queueing systems, the Poisson equation is often a linear differenceequation.

I See e.g., Mickens, R. (1991). Difference Equations: Theory andApplications. CRC Press.

Example: Poisson equation for a uniformized M/M/1 queue with λ+ µ < 1and linear holding cost rate c:

g + h(x) = cx + λh(x + 1) + µh(x − 1) + (1 − λ− µ)h(x), x ∈ {1, 2, . . . },

g + h(0) = λh(1) + (1 − λ)h(0).

This is a “second-order” difference equation.


Linear Difference Equations

Theorem (Bhulai 2017, Theorem 2.1)

Suppose f : {0, 1, . . . }→ R satisfies

f (x + 1) + α(x)f (x) + β(x)f (x − 1) = q(x), x > 1

where β(x) 6= 0 for all x > 1.

If fh : {0, 1, . . . }→ R is a “homogeneous solution”, i.e.,

fh(x + 1) + α(x)fh(x) + β(x)fh(x − 1) = 0, x > 1,

then, letting the empty product be equal to one,

f (x)

fh(x)=

f (0)

fh(0)+

(f (1)

fh(1)−

f (0)

fh(0)

) x∑i=1

i∏j=1

β(j)fh(j − 1)

fh(j + 1)

+

x∑i=1

i∏j=1

β(j)fh(j − 1)

fh(j + 1)

i−1∑j=1

q(j)

fh(j + 1)∏j+1

k=1β(k)fh(k−1)

fh(k+1)


Application: M/M/1 Queue

Rewrite the Poisson equation in the form of the Theorem:

h(x + 1) +

(−λ+ µ

λ

)︸︷︷︸

α(x)

h(x) +(µλ

)︸︷︷︸β(x)

h(x − 1) =g − cx

λ︸︷︷︸q(x)

, x > 1.

Note that fh ≡ 1 works as the “homogeneous solution”.

We also know that, for an M/M/1 queue with linear holding cost rate c,

g =cλ

µ− λ.

So, according to the Theorem,

h(x) =g

λ

x∑i=1

(µλ

)i

+

x∑i=1

(µλ

)ii−1∑j=1

(λ

µ

)j+1 (g − cj

λ

)=

cx(x + 1)

2(µ− λ).


Other Analytically Tractable Systems

Relative value functions for the following systems are presented in (Bhulai2017):

1. M/Cox(r)/1 Queue

I special cases: hyperexponential, hypoexponential, Erlang, andexponential service times

2. M/M/s Queue

I with infinite buffer, with no buffer (blocking system)

3. 2-class M/M/1 Priority Queue


Application to Analytically Intractable Systems

Idea:

1. Pick an initial policy whose relative value function can be written interms of the relative value functions of simpler systems.

2. Do one-step policy improvement using that policy.

In (Bhulai 2017), this is applied to the following problems:

1. Routing Poisson arrivals to two different M/Cox(r)/1 queues.

I Initial Policy: Bernoulli routingI Uses relative value function of M/Cox(r)/1 queue

2. Routing in a Multi-Skill Call Center

I Initial Policy: Static randomized policy that tries to route calls toagents with the fewest skills first

I Uses relative value function of M/M/s queue

3. Controlled Polling System with Switching Costs

I Initial Policy: cµ-ruleI Uses relative value function of priority queue


Controlled Polling System with Switching Costs

Two queues with independent Poisson arrivals at rates λ1, λ2, exponentialservice times with rates µ1,µ2 and holding cost rates c1, c2, respectively.

If queue i is currently being served, switching to queue j 6= i costs si , i = 1, 2.

Problem: Dynamically assign the server to one of the two queues, so that theaverage cost incurred is minimized.

Do one-step policy improvement on the cµ-rule.

Results for λ1 = λ2 = 1, µ1 = 6, µ2 = 3, c1 = 2, c2 = 1, s1 = s2 = 2:

Policy Average Cost

cµ-Rule 3.62894One-Step Improvement 3.09895

Optimal Policy 3.09261


Part 2

Using Simulation and Interpolation


Scheduling a Multiclass Queue with Abandonments

k queues with independent Poisson arrivals at rates λ1, . . . , λk , exponentialservice times with rates µ1, . . . ,µk , and holding cost rates c1, . . . , ck ,respectively.

Each customer in queue i = 1, . . . , k remains available for service for anexponentially distributed amount of time, with rate θi .

Each service completion from queue i = 1, . . . , k earns Ri ; each abandonmentfrom queue i costs Di .

Problem: Dynamically assign the server to one of the k queues, so that theaverage cost incurred is minimized.


Relative Value Function

Let π be a policy, and select any reference state

xr ∈ X = {(i1, . . . , ik) ∈ {0, 1, . . . }k }.

gπ = average cost incurred under π

rπ(x) = expected total cost to reach xr under π, starting from state x

tπ(x) = expected time to reach xr under π, starting from state x

Then the relative value function is

hπ(x) = rπ(x) − gπtπ(x), x ∈ X.


Approximate Policy Improvement

Exact DP is infeasible for k > 3 classes.

(James, Glazebrook, Lin 2016) propose an approximate policy improvementalgorithm.

Idea: Given a policy π, approximate its relative value function hπ as follows:

1. Simulate π to estimate its average cost gπ and the long-run frequencywith which each state is visited.

2. Based on Step 1, select a set of initial states from which the relativevalue under π is estimated via simulation.

3. Estimate the relative value function by interpolating between the valuesestimated in Step 2.

4. Do policy improvement using the estimated relative value function.


Selecting States (Step 2)

S = set of initial states selected in Step 2, from which the relative value isestimated via simulation

S = Sanchor + Ssupport,

where

1. Sanchor = set of most frequently visited states (based on Step 1)

2. Ssupport = set of regularly spaced states

Parameters: How many states to include in Sanchor and Ssupport.


Interpolation (Step 3)

Use an (augmented) radial basis function

hπ(x) ≈n∑

i=1

αiφ(‖x − xi‖) +d∑

j=1

βjpj (x)

where

I n = number of selected states in Step 2

I xi = i th selected state in Step 2

I φ(r) = r 2 log(r) (thin plate spline)

I ‖ · ‖ = Euclidean norm

I d = k + 1

I p1(x) = 1, pj (x) = number of customers in queue j − 1


Computing the Interpolation Parameters

xi = i th selected state in Step 2

fi = estimated relative value starting from xi

Aij = φ(‖xi − xj‖) for i , j = 1, . . . n

Pij = pj (xi ) for i = 1, . . . , n and j = 1, . . . , k + 1

Solve the following linear system of equations:

Aα+ Pβ = f

PTα = 0


Example: Interpolation

2 classes, initial policy is the “Rµθ-rule”

m = number of replications for each selected state

Figure 1 in (James, Glazebrook, Lin 2016)


Example: Approximate Policy Improvement

Same problem

API = policy from one-step policy improvement

Figure 2 in (James, Glazebrook, Lin 2016)


Suboptimality of Heuristics

API(π) = one-step approximate policy improvement applied to π

k = 3 classes, ρ =∑k

i=1(λi/µi )

Figure 3 in (James, Glazebrook, Lin 2016); see the paper details on this andother numerical studies.


Part 3

Using Lagrangian Relaxations


Multiclass Queue with Convex Holding Costs

k queues with independent Poisson arrivals at rates λ1, . . . , λk and exponentialservice times with rates µ1, . . . ,µk , respectively.

If there are xi customers in queue i , the holding cost rate is ci (xi ) whereci : {0, 1, . . . }→ R is nonnegative, increasing, and convex.

Each queue i has a buffer of size Bi

Problem: Dynamically assign the server to one of the k queues, so that thediscounted cost incurred is minimized.


Relaxed Problem

The following relaxation is considered in (Brown, Haugh 2017):

I The queues are grouped into G groups.

I The server can serve at most one queue per group; can serve multiplegroups simulataneously.

I Penalty ` for serving multiple groups simultaneously.

Under this relaxation, the value function decouples across groups (α =discount factor):

v `(x) =(G − 1)`

1 − δ+∑g

v `g (xg )

v ` can be used in both one-step “policy improvement”, and to construct alower bound on the optimal cost via an information relaxation.


Suboptimality of Heuristics

Myopic: use one-step improvement with the value function vm(x) =∑

i ci (xi )

From (Brown, Haugh 2017); see the paper for details.


Research Questions

1. Applications to other systems?

2. Performance guarantees for one-step improvement?

3. Other functions to use in one-step improvement?

4. Conditions under which one-step improvement is practical?


Near-Optimal Control of Queueing Systems via Approximate One...

Documents

Transcript of Near-Optimal Control of Queueing Systems via Approximate One...