Using Load-Balancing To Build High-Performance Routers Isaac Keslassy, Shang-Tse (Da) Chuang, Nick...
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Using Load-Balancing To Build High-Performance Routers Isaac Keslassy, Shang-Tse (Da) Chuang, Nick...
Using Load-Balancing To Build High-Performance Routers
Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown
Stanford University
2
R
R
R
R
R
R
Typical Router Architecture
Input
Input
Input
Switch Fabric
Scheduler
Output
Output
Output
1122
11
3
Traffic matrix:
Uniform traffic matrix: λij = λ
Definitions: Traffic MatrixR
R
R
R
R
R
1
N
i
1
N
j
4
100% throughput: for any traffic matrix of row and column sum less than R,
λij < μij
Definitions: 100% ThroughputR
R
R
R
R
R
1
N
i
1
N
j
ij ij
5
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
6
Stanford 100Tb/s Router
“Optics in Routers” project http://yuba.stanford.edu/or/
Some challenging numbers: 100Tb/s 160Gb/s linecards 640 linecards
7
In
In
In
Out
Out
Out
R
R
R
R
R
R
Router capacity = NRSwitch capacity = N2R
100% Throughput in a Mesh Fabric
?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
8
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/NR/N
R/N
R/N
R/N
R/N
If Traffic Is Uniform
RNR /NR /NR /
R
NR / NR /
9
Real Traffic is Not Uniform
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
RNR /NR /NR /
R
RNR /NR /NR /
R
RNR /NR /NR /
R
R
R
R
?
10
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Load-Balanced Switch
Load-balancing stage Forwarding stage
In
In
In
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
100% throughput for weakly mixing traffic (Valiant, C.-S. Chang)
11
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
112233
Load-Balanced Switch
12
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
22
11
Load-Balanced Switch
13
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/NR/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Intuition: Proof of 100% Throughput
Arrivals to second mesh:
Capacity of second mesh:
Second mesh: arrival rate < service rate
111
111
111
where,1
UaUN
b
01
-b RUaUN
C
UN
RC
Cba
14
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
?
15
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Packet Reordering
12
16
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Delay Difference Between Middle Ports
1
2
cells
17
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
123
0
UFS (Uniform Frame Spreading)
12
18
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
FOFF (Full Ordered Frames First)
12
19
FOFF (Full Ordered Frames First)
Input Algorithm N FIFO queues corresponding to the N output flows Spread each flow uniformly: if last packet was sent to
middle port k, send next to k+1. Every N time-slots, pick a flow:
- If full frame exists, pick it and spread like UFS - Else if all frames are partial, pick one in round-robin order and send it
123
12
4
N
20
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Reordering
123
NN
21
FOFF
Output properties N FIFO queues corresponding to the N middle
ports If there are N2 packets, one of the head-of-line
packets is in order and can depart Buffer size at most N2 packets
111
22
333
Output
4
N
22
FOFF Properties
Property 1: FOFF maintains packet order.
Property 2: FOFF has O(1) complexity.
Property 3: Congestion buffers operate independently.
Property 4: FOFF maintains an average packet delay within constant from ideal output-queued router.
Corollary: FOFF has 100% throughput for any adversarial traffic.
23
In
In
In
Out
Out
Out
R
R
R
R
R
R
Output-Queued Router?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
24
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
25
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
From Two Meshes to One Mesh
One linecard
In
Out
26
From Two Meshes to One Mesh
First meshIn Out
In Out
In Out
In Out
One linecard
Second mesh
R R
R
R
R
27
From Two Meshes to One Mesh
Combined meshIn Out
In Out
In Out
In Out
2RR
2R
2R
2R
28
Many Fabric Options
Options
Space: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM
Any spreadingdevice
C1, C2, …, CN
C1
C2
C3
CN
In Out
In Out
In Out
In Out
N channels each at rate 2R/NOne linecard
29
AWGR (Arrayed Waveguide Grating Router) A Passive Optical Component
Wavelength i on input port j goes to output port (i+j-1) mod N
Can shuffle information from different inputs
1,
2…N
NxN AWGR
Linecard 1
Linecard 2
Linecard N
1
2
N
Linecard 1
Linecard 2
Linecard N
30
In Out
In Out
In Out
In Out
Static WDM Switching: Packaging
AWGR
Passive andAlmost Zero
Power
A
B
C
D
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, D
A, A, A, A
B, B, B, B
C, C, C, C
D, D, D, D
N WDM channels, each at rate 2R/N
31
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
32
Scaling Problem
For N < 64, an AWGR is a good solution. We want N = 640. Need to decompose.
33
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R 2R
Mesh
2R In Out
In Out
In Out
In Out
R
2RR
34
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R In Out
In Out
In Out
In Out
R2R/N
35
1
2
3
4
Example: N=8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
2R/8
36
When N is Too LargeDecompose into groups (or racks)
4R/42R 2R1
2
3
4
5
6
7
8
2R2R
1
2
3
4
5
6
7
8
4R 4R
37
When N is Too LargeDecompose into groups (or racks)
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
38
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
39
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
2RL
Solution: replace mesh with sum of permutations
= + +
2RL/G 2RL/G 2RL/G 2RL/G
≤
2RL 2RL/G
G *
40
Hybrid Electro-Optical ArchitectureUsing MEMS Switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
41
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
42
Fiber Link Capacity
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
MEMSSwitch
Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R
Laser/Modulator
MUX
43
Group/Rack 1
1
2
2R
2R 4R
Group/Rack 2
1
2
2R
2R 4R
Number of MEMS Switches Example: 2 Groups of 2 Linecards
1
2
2R
2R
Group/Rack 1
1
2
2R
2R
Group/Rack 2
4R
4R
2R
2R
2R
2R
2R
2R
44
Theorem: M≡L+G-1 MEMS switches are sufficient for bandwidth.
Number of MEMS Switches
Examples:
5540,16,640
2
MGLN
NMNGL
G groups, Li linecards in group i,
G
iiLN
1
,max kk
LL
45
Implementation of a 100Tb/s Load-Balanced Router
Linecard Rack 1
L = 16160Gb/s linecards
55 56
1 2
40 x 40static
MEMS
Switch Rack < 100W
L = 16160Gb/s linecards
Linecard Rack G = 40
L = 16160Gb/s linecards
46
Group A
1
2
2R
2R 4R
Group B
1
2
2R
2R 4R
Packet Schedule
1
2
2R
2R
Group A
1
2
2R
2R
Group B
4R
4R
2R
2R
2R
2R
47
At each time-slot: Each transmitting linecard sends one packet Each receiving linecard receives one packet (MEMS constraint) Each transmitting group i
sends at most one packet to each receiving group j through each MEMS connecting them
In a schedule of N time-slots: Each transmitting linecard sends exactly one
packet to each receiving linecard
Rules for Packet Schedule
48
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 ? ? ? ?
Tx LC A2 ? ? ? ?
Tx LC B1 ? ? ? ?
Tx LC B2 ? ? ? ?
Tx Group A
Tx Group B
49
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
50
Bad Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
51
Group Schedule
T+1 T+2 T+3 T+4
Tx Group A AB AB AB AB
Tx Group B AB AB AB AB
52
Good Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 B1 A2 A1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 A1 B2 B1
Theorem: There exists a polynomial-time algorithm that finds the correct packet schedule.
Verilog implementation < 50ms.
Tx Group A
Tx Group B
53
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
54
Summary
The load-balanced switch Does not need any centralized scheduling Can use a mesh
Using FOFF It keeps packets in order It guarantees 100% throughput
Using the hybrid electro-optical architecture It scales to high port numbers It tolerates linecard failure
55
References
Initial Work
C.-S. Chang, D.-S. Lee and Y.-S. Jou, "Load Balanced Birkhoff-von Neumann Switches, part I: One-Stage Buffering," Computer Communications, Vol. 25, pp. 611-622, 2002.
Extensions
I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown, "Scaling Internet Routers Using Optics," ACM SIGCOMM '03, Karlsruhe, Germany, August 2003.
I. Keslassy, S.-T. Chuang and N. McKeown, “A Load-Balanced Switch with an Arbitrary Number of Linecards,” IEEE Infocom ’04, Hong Kong, March 2004.
Thank you.