Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request...

41
Overload Control for μs-scale RPCs with Breakwater Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, Adam Belay 1

Transcript of Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request...

Page 1: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Overload Control forμs-scale RPCs with Breakwater

Inho Cho, Ahmed Saeed, Joshua Fried,Seo Jin Park, Mohammad Alizadeh, Adam Belay

1

Page 2: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Trend: μs-scale RPCs

2

1. Fast Network: Network latency (~ 5 us)

2. Fast Storage: M.2 NVME SSD (~ 20 us)

3. In-memory operations: Memcached, Reddis, Ignite

Page 3: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Trend: High Fan-outRemote memory

https:/ /

Encryption Cache Storage3

Web serverInternet

Page 4: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Load Imbalance Unexpected user traffic

Packet bursts Redirected traffic due to failure

!!

Causes of Server Overload

4

Page 5: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

0

200

400

600

800

0 0.4 0.8 1.2 1.6

Thro

ughp

ut

(kre

qs/s

)

Clients' Demand (Mreqs/s)

110

1001,000

10,000100,000

0 0.4 0.8 1.2 1.6

Late

ncy

(μs)

Clients' Demand (Mreqs/s)

Performance Without Overload Control

99th

50th

Without overload control, server overload makes almost all requests violate its SLO.

5

SLO

Page 6: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Ideal Overload Control

Server

Clients6

Request

should keep request queue short, but not emptyshould inform clients about overload quickly

Page 7: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Strawman #1: Server-side AQM

Server

Clients7

Request

drop

Drop notification

Page 8: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Strawman #1: Server-side AQM

Server

Clients7

Request

drop

Drop notification

Page 9: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Strawman #1: Server-side AQM

Server

Clients7

The cost of packet processing is comparable to the service time

Packetprocessing

Requestexecution

Request

Page 10: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Strawman #2: Client Rate limiting

Server

Clients8

Request

Page 11: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients8

Request

Probing server status incur high message overhead

Hey server, are you busy?

Hey server, are you busy?

Hey server, are you busy?

Hey server, are you busy?

Strawman #2: Client Rate limiting

Page 12: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Breakwater

9

Overload control scheme for μs-scale RPCs

Components Benefits

1. Credit-based admission control Coordinates requests with minimum delay

2. Demand speculation Minimizes message overhead

3. Delay-based AQM Ensures low tail latency

Page 13: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Breakwater’s benefits

0200400600800

0 0.4 0.8 1.2 1.6

Thro

ughp

ut

(kre

qs/s

)

Clients' Demand (Mreqs/s)

0200400600800

1,000

0 0.4 0.8 1.2 1.699

%-il

eLa

tenc

y (μ

s)Clients' Demand (Mreqs/s)

IdealBreakwater

SLO

10

(1) High throughput(2) Low and bounded tail latency(3) Scalability to a large number of clients

Handles server overload with μs-scale RPCs with

Page 14: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Queueing delay as congestion signal

Server

Clients RequestCredit Response11

Page 15: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients

For every RTT:If delay < target:credit += A

Else:B = MAX(1 − β � 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑−𝑡𝑡𝑑𝑑𝑡𝑡𝑡𝑡𝑑𝑑𝑡𝑡

𝑡𝑡𝑑𝑑𝑡𝑡𝑡𝑡𝑑𝑑𝑡𝑡, 0.5)

credit ×= B

RequestCredit Response12

Comp. #1: Credit-based admission controlBreakwater controls amount of incoming requests with credits

Page 16: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response13

register

Client 1Client 2Client 3Client 4

Comp. #1: Credit-based admission controlBreakwater controls amount of incoming requests with credits

Page 17: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response13

Client 1Client 2Client 3Client 4

Comp. #1: Credit-based admission controlBreakwater controls amount of incoming requests with credits

Page 18: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response13

deregister

Client 1Client 2Client 3Client 4

Comp. #1: Credit-based admission controlBreakwater controls amount of incoming requests with credits

Page 19: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Demand Message Overhead

Server

Clients RequestCredit Response

Server needs to know which client has demand

14

Client 1Client 2Client 3Client 4

Page 20: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response14

Demand Message OverheadServer needs to know which client has demand

Page 21: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients

I have n requests!

RequestCredit Response14

Demand Message OverheadServer needs to know which client has demand

Page 22: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

0

200

400

600

800

0 0.4 0.8 1.2 1.6

Thro

ughp

ut

(kre

qs/s

)

Clients' Demand (Mreqs/s)

0200400600800

1,000

0 0.4 0.8 1.2 1.6

99%

-ile

Late

ncy

(μs)

Clients' Demand (Mreqs/s)

No overload control Idealcredit

SLO

Credit-based admission control has lower and bounded tail latency but lower throughput.

15

Impact of Credit-based Admission Control

Page 23: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Piggybacking Demand Information

Server

Clients

I have n requests!

RequestCredit Response

Breakwater piggybacks clients’ demand information into requests.

16

Page 24: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients

I have n requests!

RequestCredit Response

(I have n more request)

16

Piggybacking Demand InformationBreakwater piggybacks clients’ demand information into requests.

Page 25: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response17

Comp. #2: Demand SpeculationBreakwater speculate clients’ demand to minimize message overhead

Page 26: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response17

Comp. #2: Demand SpeculationBreakwater speculate clients’ demand to minimize message overhead

Page 27: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response17

Comp. #2: Demand SpeculationBreakwater speculate clients’ demand to minimize message overhead

Page 28: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Idealcredit credit + demand spec.

0

200

400

600

800

0 0.4 0.8 1.2 1.6

Thro

ughp

ut

(kre

qs/s

)

Clients' Demand (Mreqs/s)

No overload control

0200400600800

1,000

0 0.4 0.8 1.2 1.6

99%

-ile

Late

ncy

(μs)

Clients' Demand (Mreqs/s)18

Demand speculation improves throughput with higher tail latency

SLO

Impact of Adding Demand Speculation

Page 29: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Credit Overcommitment

Server

Clients RequestCredit Response19

Server issues more credit than the number of requests it can accomodate

Page 30: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Incast Causing Long Queue

Server

Clients RequestCredit Response19

With credit overcommitment, multiple requests may arrive at the server at the same time

Page 31: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Comp. #3: Delay-based AQM

Server

Clients RequestCredit Response19

To ensure low tail latency, the server drops requests if queueing delay exceeds threshold.

drop

Page 32: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Server

Clients RequestCredit Response19

drop

Comp. #3: Delay-based AQMTo ensure low tail latency, the server drops requests if queueing delay exceeds threshold.

Page 33: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Impact of Adding Delay-based AQM

0

200

400

600

800

0 0.4 0.8 1.2 1.6

Thro

ughp

ut

(kre

qs/s

)

Clients' Demand (Mreqs/s)

0200400600800

1,000

0 0.4 0.8 1.2 1.699

%-il

e La

tenc

y (μ

s)Clients' Demand (Mreqs/s)

Idealcredit

credit + demand spec.No overload control

credit + demand spec. + AQM

20

SLO

Breakwater achieves high throughput and low and bounded tail latency at the same time

Page 34: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Testbed Setup- xl170 in Cloudlab- 11 machines are connected to a single switch- 10 client machines / 1 server machine- Implementation on Shenango as a RPC layer

Synthetic Workload- Clients generate request with open-loop Poisson process- Requests spin-loops specified amount of time at server- Exponential service time distribution with 10μs average

Evaluation

21

Page 35: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

(1) Does Breakwater achieves high throughput and low tail latency even with demand spikes?

(2) Does Breakwater provides fast notification for the rejected requests?

(3) Is Breakwater scalable to many clients?

Evaluation

22

Baselines:DAGOR

priority-based overload control used in WeChatSEDA

adaptive overload control for staged event-driven architecture

Page 36: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

High Goodput with Fast Convergence

0400800

1,200

0 2 4 6 8 10

Goo

dput

(k

reqs

/s)

Time (seconds)

Breakwater DAGOR SEDA Load Ideal

0.5 s1.6 s

23

Breakwater converges to higher goodput 25x faster than DAGOR and 79x faster than SEDA.

0400800

1,200D

eman

d(k

reqs

/s)

Capacity

Page 37: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Low and Bounded Tail Latency

0.010.1

110

1001000

0 2 4 6 8 10

99%

-ile

Late

ncy

(ms)

Time (seconds)

SLO

Breakwater DAGOR SEDA

24

Breakwater maintains low tail latency even with load spikes.

0400800

1,200D

eman

d(k

reqs

/s)

Capacity

Page 38: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Fast Notification of Reject

0.010.1

110

1001000

0 2 4 6 8 10Mea

n Re

ject

D

elay

(ms)

Time (seconds)

SLO

Breakwater DAGOR

25

Breakwater notifies rejected request to clients before violating its SLO.

0400800

1,200D

eman

d(k

reqs

/s)

Capacity

Page 39: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Scalability to Many Clients

0200400600800

1000

100 1k 10kGoo

dput

(kre

qs/s

)

The number of clients

Breakwater DAGOR SEDA

+10%

26

Breakwater easily scales to 10,000 clients.

Page 40: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

• Breakwater is a server-driven credit-based overload control system for μs-scale RPCs

d

• Breakwater’s key components include(1) Credit-based admission control(2) Demand speculation(3) Delay-based AQM

D

• Our evaluation shows that Breakwater achieves(1) Low & bounded tail latency with high throughput(2) Fast notification for a rejected request(3) Scalability to many clients

27

Conclusion

Page 41: Overload Control for us-scale RPCs with Breakwater...DAGOR 25 Breakwater notifies rejected request to clients before violating its SLO. 0 400 800 Demand 1,200 (kreqs /s) Capacity

Thank you!

Questions?Inho Cho <[email protected]>

28

inhocho89.github.io/breakwater/Breakwater is available at