R2P2: Making RPCs first-class datacenter citizens•L7 loadbalancing •e.g. NGINX reverse proxy...

of 29/29
R2P2: Making RPCs first-class datacenter citizens Marios Kogias <[email protected]>
  • date post

    01-Aug-2021
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of R2P2: Making RPCs first-class datacenter citizens•L7 loadbalancing •e.g. NGINX reverse proxy...

Marios Kogias <[email protected]>
Datacenter Communication • Infrastructure:
• Applications: • Data-stores, search, etc… • Complex Fan-in/Fan-out patterns • Tight tail-latency SLOs • Service time variability • μs-scale Remote Procedure Calls
Load Balancer
Root Root
Q: What is a typical RPC stack?
Q: Identify the layers involved
Q: What is a latency SLO?
Transport
RPC
Application
P3 P2 P1
Problems: 1. Ordering and Head-of-line blocking • TCP imposes ordering of
requests • RPCs are independent
P2 P1P3
• Deep packet inspection • Connection termination, e.g. L7 LB
P2 P1P3
R2R4R3 R1
Outline
• R2P2, a transport protocol for RPCs that exposes the RPC abstraction to the network and enables in-network policy enforcement
• Usecase: In-network RPC loadbalancing over R2P2
• Identify reusable system design principles • Suggested reading Hints for Computer System Design
• Independent RR pairs • Not connections • Not messages
• No protocol-enforced ordering • No fate sharing • Lost packets only affect equivalent RRP
• Per-RPC decisions: • Timeout • At-least/At-most once
6
Paradigm Mismatch Multiplexing independent RPCs over a reliable byte-stream, e.g. TCP
P3 P2 P1
• TCP imposes ordering of requests • RPCs are independent
• Lost packets can affect several requests
P2 P1P3
R1R3 R2
• Deep packet inspection • Connection termination (L7 LB)
P2 P1P3
• Break point-to-point RPC semantics • Request Destination != Reply Source • Per request policy enforcement
8
Clients
S1
S2
SN
Hint: Separate normal and worst case
Paradigm Mismatch Multiplexing independent RPCs over a reliable byte-stream, e.g. TCP
P3 P2 P1
• TCP imposes ordering of requests • RPCs are independent
• Lost packets can affect several requests
P2 P1P3
R1R3 R2
• Deep packet inspection • Connection termination (L7 LB)
P2 P1P3
R2R4 R3 R1
• Network-based RPC load balancing [ATC 2019] • Target selection based on request type [ATC 2019] • And more to discuss later…
RPC Policies explored
Clients
S1
S2
SN
Middlebox
12
• Terminate client connections • Open other connections to the servers
13
0
100
200
300
400
500
99 th
L at
en cy
(µ s)
• 4 servers x 16 threads • HTTP-based RPC • = 25 exponential distribution • Max throughput 2.56 MRPS • NGINX with Join-Shortest-Queue
L7 loadbalancers suffer from the mismatch and become IO bottlenecks
NGINX-JSQ
In-network Request-level Loadbalancing
• Software DPDK R2P2 router • 5μs latency overhead • IOPS-bottlenecked with 2 cores
• P4 dataplane on Barefoot Tofino • 1μs latency overhead
14
• 4 servers x 16 threads • = 25 exponential distribution • Max throughput 2.56 MRPS
NGINX-JSQ RANDOM
0
100
200
300
400
500
In-network Request-level Loadbalancing
• Software DPDK R2P2 router • 5μs latency overhead • IOPS-bottlenecked with 2 cores
• P4 dataplane on Barefoot Tofino • 1μs latency overhead
15
• 4 servers x 16 threads • = 25 Exponential distribution • Max throughput 2.56 MRPS
0.0 0.5 1.0 1.5 2.0 2.5 Load (MRPS)
0
100
200
300
400
500
Request-level Load-Balancing
• Equivalent to L4 loadbancing
• Could be implemented as L7 loadbalancing
Load
How can we implement RPC loadbalancing with single-queue performance across multiple servers while achieving high throughput and low latency?
Join-Bounded-Shortest-Queue JBSQ(n) • Split-Queue model • One central “unbounded” queue • Several distributed bounded queues
• Delay scheduling decision for better placement • Trade-off • Throughput
• High n can lead to bad placement • Tail-latency
• Small n exposes the communication overhead

Always think about the trade-offs
JBSQ RPC Load Balancing on R2P2 • Central queue of REQ0s in the
middlebox • Middlebox maintains
completed RPC
4 servers (DPDK) x 16 threads = 10 Exponential distribution
4-byte packets over R2P2
0
50
100
150
4 servers (DPDK) x 16 threads = 10 Exponential distribution
4-byte packets over R2P2
0
50
100
150
?
JBSQ Evaluation
4 servers (DPDK) x 16 threads = 10 Exponential distribution
4-byte packets over R2P2
0
50
100
150
4 servers (DPDK) x 16 threads = 10 Exponential distribution
4-byte packets over R2P2
0
50
100
150
More efficient HW implementation
Alternative Policies
Header Size
PacketId/Packet Count
• Alternative policies: • STICKY • HASH • Etc..
Redis • KV-store • Master-Slave replication
• 3+1 DPDK servers • USR Facebook workload • Baseline: • Linux TCP
TCP-DIRECT RANDOM SW-JBSQ(20)
0
100
200
300
99 th
L at
en cy
(µ s)
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Load (MRPS)
0
100
200
300
Observations: • R2P2 and DPDK increase throughput • Scheduling benefits are more significant as
service time variability increases
26
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Load (MRPS)
0
50
100
150
= 10 Exponential distribution 64-byte packets
Lessons Learnt from R2P2
1. Pushing functionality in the network is a viable option
2. Programmable switches can undertake some of this functionality
3. Adding network hops for better scheduling can improve performance
1. Try to properly place functionality in the right layer • Can you think of alternative RPC policies / functionality that can be
implemented with this new abstraction?
2. Separate normal and worst case • Mention other usecases of this hint
3. Leave it to the client • Mention other usecases of this hint
Design Points to Remember
Conclusion • R2P2 – transport protocol for RPCs • Exposes the RPC abstraction to the network • Enables in-network policy enforcement
• In-network RPC loadbalancing • Software/Hardware middlebox • JBSQ scheduling policy
• Extensible in-network policies