μTune: Auto-‐Tuned Threading for OLDI Microservices
Akshitha Sriraman, Thomas F. Wenisch University of Michigan
OSDI 2018 10/08/2018
μTune: Auto-‐Tuned Threading for OLDI Microservices
OLDI: From Monoliths to Microservices
2
> 100 ms SLO
Monolith Microservices
Sub-‐ms SLO
Sub-‐ms-‐scale system overheads must be characterized for microservices
Monolithic)service
Scaling Scaling
RPC
Microservice
100μs
20 μs
120μs 20%!
Monolith
300ms
20 μs
300.02ms
μTune: Auto-‐Tuned Threading for OLDI Microservices
Impact of Threading on Microservices • Our focus: Sub-‐ms overheads due to threading design
Threading model that exhibits best tail latency is a funcNon of load
Lock contenNon Thread wakeups Spurious context switch
3
Polling
Low load
Blocking Polling
Polling
High load
μTune: Auto-‐Tuned Threading for OLDI Microservices
ContribuNons • A taxonomy of threading models
– Structured understanding of threading implicaNons
Ø Reveals tail inflecNon points across load Ø Peak load-‐sustaining model is worse at low load
• μTune: – Uses tail inflecNon insights to opNmize tail latency
– Tunes model & thread pool size across load
– Simple interface: Abstracts threading model from RPC code
4
Latency
Up to 1.9x tail improvement over staNc throughput-‐opNmized model
μTune: Auto-‐Tuned Threading for OLDI Microservices
Outline • MoNvaNon • A taxonomy of threading models
• μTune: – Simple interface design
– AutomaNc load adaptaNon system
• EvaluaNon
5
μTune: Auto-‐Tuned Threading for OLDI Microservices
Mid-‐Ner Faces More Threading Overheads
6
Front-‐End Microserver Mid-‐Tier Microserver
Leaf Microserver 1
Leaf Microserver 2 • Mid-‐Ner – subject to more threading overheads
– Manages RPC fan-‐out to many leaves – RPC layer interacNons dominate computaNon
Threading overheads must be characterized for mid-‐%er microservices
μTune: Auto-‐Tuned Threading for OLDI Microservices
A Taxonomy of Threading Models
7
Front-‐End Mid-‐Tier Leaf
NW socket
(1) Poll vs. Block Request
(2) In-‐line vs. Dispatch
(3) Synchronous vs.
Asynchronous
Synchronous
Asynchronous
Block Poll
In-‐line SIB SIP
Dispatch SDB SDP
Block Poll
In-‐line AIB AIP
Dispatch ADB ADP
vs. ...
vs. ...
vs. ...
Network poller
Worker
Response
μTune: Auto-‐Tuned Threading for OLDI Microservices 8
Latency Tradeoffs Across Threading Models
0
0.5
1
1.5
2
10 100 1000 10000
99th percenN
le tail latency (m
s)
Load (Queries Per Second)
saturaNon
HDSearch: Sync.
In-‐line Block
In-‐line Poll
Dispatch Block
Dispatch Poll
1.4x
In-‐line Poll has lowest low-‐load latency: Avoids thread wakeup delays
μTune: Auto-‐Tuned Threading for OLDI Microservices 9
Latency Tradeoffs Across Threading Models
0
0.5
1
1.5
2
10 100 1000 10000
99th percenN
le tail latency (m
s)
Load (Queries Per Second)
saturaNon
HDSearch: Sync.
In-‐line Block
In-‐line Poll
Dispatch Block
Dispatch Poll
In-‐Line Poll faces contenNon; Dispatch Poll with one network poller is best
μTune: Auto-‐Tuned Threading for OLDI Microservices 10
Latency Tradeoffs Across Threading Models
0
0.5
1
1.5
2
10 100 1000 10000
99th percenN
le tail latency (m
s)
Load (Queries Per Second)
saturaNon
HDSearch: Sync.
In-‐line Block
In-‐line Poll
Dispatch Block
Dispatch Poll
Dispatch Block is best at high load as it does not waste CPU
∞
μTune: Auto-‐Tuned Threading for OLDI Microservices 11
Latency Tradeoffs Across Threading Models
0
0.5
1
1.5
2
10 100 1000 10000
99th percenN
le tail latency (m
s)
Load (Queries Per Second)
saturaNon
HDSearch: Sync.
In-‐line Block
In-‐line Poll
Dispatch Block
Dispatch Poll
∞
No single threading model works best at all loads
μTune: Auto-‐Tuned Threading for OLDI Microservices
• Threading choice can significantly affect tail latency • Threading latency trade-‐offs are not obvious • Most sohware face latency penalNes due to staNc threading
12
Need for AutomaNc Load AdaptaNon: μTune
Opportunity: Exploit trade-‐offs among threading models at run-‐Nme
μTune: Auto-‐Tuned Threading for OLDI Microservices
• Load adaptaNon: Vary threading model & pool size at run-‐Nme • Abstract threading model boiler-‐plate code from RPC code
13
Microservice funcNonality: ProcessReq(), InvokeLeaf(), FinalizeResp()
μTune automaNc load adaptaNon system
RPC layer
App layer
μTune
Network layer
μTune
Simple interface: Developer defines only three funcNons
μTune: Auto-‐Tuned Threading for OLDI Microservices
μTune: Goals & Challenges
14
Service code
μTune‘s threading framework
Simple interface
Quick load change detecNon
Fast threading model switches
Scale thread pools
μTune: Auto-‐Tuned Threading for OLDI Microservices
μTune System Design: Auto-‐Tuner • Dynamically picks threading model & pool sizes based on load
15
Request rate Best TM Ideal no. of threads
0 – 128 QPS SIP In-line: one
. . .
4096 – 8192 QPS SDB NW poller: one, Workers: many (eg. 50), Resp. threads: many
Offline training
Create piecewise linear model
Mid-tier Leaf
Front end
Request
Compute
In-line thread
In-line thread
Response
NW socket
In-line thread: <block>
Synchronous
Mid-tier Leaf
Front end
Request
Compute
In-line thread
Response
NW socket
Synchronous
(a) (b)
Mid-tier Front- end
Request
Worker
Response
NW socket
Network thread: <block>
Task queue
Worker notified
Worker awaits notification
Synchronous
Dispatch
Leaf
Compute
Mid-tier Front- end
Request
Worker
Response
NW socket
Network thread: <poll>
Task queue
Worker notified
Worker awaits notification
Synchronous
Dispatch
Leaf
Compute
(a) (b)
Mid-tier Front- end
Request
Compute
In-line thread
Resp. thread: <block>
Response
NW (server) socket
In-line thread: <block>
Asynchronous
Leaf
NW (client) socket
Mid-tier Front- end
Request
Compute
In-line thread
Resp. thread: <poll>
Response
NW (server) socket
In-line thread: <poll>
Asynchronous
Leaf
NW (client) socket
(a) (b)
Front- end
Request
Network thread: <block>
Task queue
Worker notified
Dispatch
Worker awaits notification
Asynchronous
Mid-tier Leaf
Compute Response
NW (client) socket
NW (server) socket
Resp. thread: <block>
Request
Network thread: <poll>
Task queue
Worker notified
Dispatch
Worker awaits notification
Asynchronous
Compute Response
NW (client) socket
NW (server) socket
Resp. thread: <poll>
Front- end
Mid-tier Leaf
(a) (b)
SIB SIP SDB SDP AIB AIP ADB ADP
(a) µTune framework
Offline training
(b) Async. µTune’s automatic load adaptation system
1
Create piecewise linear model
Request'rate' Best'TM' Ideal'no.'of'threads'
0'–'128'QPS' AIP' Inline:'one'
.'
.'
4096'–'8192'QPS'
ADB' NW'poller:'one''Workers:'few'(eg.'4),'Resp.'threads:'few'
Online: Request from front-end gRPC
Circular event buffer 1
2
Request'rate'
compute'
Send to switching logic
Switch'to'best'TM'and'
thread'poll'sizes'if'needed'
ProcessRequest()
InvokeLeafAsync()
Request to leaf
Mid-tier Front- end
Request
Compute
In-line thread
Resp. thread: <block>
Response
NW (server) socket
In-line thread: <block>
Asynchronous
Leaf
NW (client) socket
Mid-tier Front- end
Request
Compute
In-line thread
Resp. thread: <poll>
Response
NW (server) socket
In-line thread: <poll>
Asynchronous
Leaf
NW (client) socket
(a) (b)
Front- end
Request
Network thread: <block>
Task queue
Worker notified
Dispatch
Worker awaits notification
Asynchronous
Mid-tier Leaf
Compute Response
NW (client) socket
NW (server) socket
Resp. thread: <block>
Request
Network thread: <poll>
Task queue
Worker notified
Dispatch
Worker awaits notification
Asynchronous
Compute Response
NW (client) socket
NW (server) socket
Resp. thread: <poll>
Front- end
Mid-tier Leaf
(a) (b)
FinalizeResponse()Response from leaf Response to
front-end 3
4
5
6
7
10 11
8
9
2
Circular event buffer Online: Request from
front-‐end gRPC
Request rate compute
Send to switching logic
Switch to best TM & thread pool
sizes
Request to leaf
μTune: Auto-‐Tuned Threading for OLDI Microservices
Experimental Setup • μSuite [Sriraman ’18] benchmark suite:
– Load generator, a mid-‐Ner, 4 or 16 leaf microservers
• Study μTune’s adaptaNon in two load scenarios: – Steady-‐state – Transient
16
μTune: Auto-‐Tuned Threading for OLDI Microservices
EvaluaNon: μTune’s Load AdaptaNon
17
Converges to best threading model & pool sizes to improve tails by up to 1.9x
0 0.5
1 1.5
2
20 50 100 1K 8K 14K
Load (Queries Per Second)
∞ 6.17 ∞HDSearch: Async.
99th percenN
le tail latency (m
s)
1.9x
In-‐Line Poll Dispatch Poll Dispatch Block Haque ‘15 Abdelzaher ‘99 μTune
saturaNon
<5% mean overhead
μTune: Auto-‐Tuned Threading for OLDI Microservices
Conclusion • Taxonomy of threading models
– No single threading model has best tail latency across all load
• μTune – threading model framework + load adaptaNon system
• Achieved up to 1.9x tail speedup over best staNc model
18
μTune: Auto-‐Tuned Threading for OLDI Microservices 19
μTune: Auto-‐Tuned Threading for OLDI Microservices Akshitha Sriraman, Thomas F. Wenisch
University of Michigan
hwps://github.com/wenischlab/MicroTune
Poster number: 29
Top Related