Using Load-Balancing To Build High-Performance Routers Isaac Keslassy Ph.D. Oral Examination...

Post on 24-Dec-2015

217 views 0 download

Transcript of Using Load-Balancing To Build High-Performance Routers Isaac Keslassy Ph.D. Oral Examination...

Using Load-Balancing To Build High-Performance Routers

Isaac Keslassy

Ph.D. Oral ExaminationDepartment of Electrical Engineering

Stanford University

2

R

R

R

R

R

R

Typical Router Architecture

Input

Input

Input

Switch Fabric

Scheduler

Output

Output

Output

1122

11

3

Traffic matrix:

Uniform traffic matrix: λij = λ

Definitions: Traffic MatrixR

R

R

R

R

R

1

N

i

1

N

j

4

100% throughput: for any traffic matrix of row and column sum less than R,

λij < μij

Definitions: 100% ThroughputR

R

R

R

R

R

1

N

i

1

N

j

ij ij

5

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

6

Stanford 100Tb/s Router

“Optics in Routers” project http://yuba.stanford.edu/or/

Some challenging numbers: 100Tb/s 160Gb/s linecards 640 linecards

7

In

In

In

Out

Out

Out

R

R

R

R

R

R

Router capacity = NRSwitch capacity = N2R

100% Throughput in a Mesh Fabric

?

?

?

?

?

?

?

?

?

R

R

R

R

R

R

R

R

R

RRRR

8

R

In

In

In

Out

Out

Out

R

R

R

R

R

R/N

R/N

R/N

R/NR/N

R/N

R/N

R/N

R/N

If Traffic Is Uniform

RNR /NR /NR /

R

NR / NR /

9

Real Traffic is Not Uniform

R

In

In

In

Out

Out

Out

R

R

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

RNR /NR /NR /

R

RNR /NR /NR /

R

RNR /NR /NR /

R

R

R

R

?

10

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Load-Balanced Switch

Load-balancing stage Forwarding stage

In

In

In

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R

R

R

100% throughput for weakly mixing traffic (Valiant, C.-S. Chang)

11

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

112233

Load-Balanced Switch

12

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N33

22

11

Load-Balanced Switch

13

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/NR/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Intuition: 100% Throughput

Arrivals to second mesh:

Capacity of second mesh:

Second mesh: arrival rate < service rate

111

111

111

where,1

UaUN

b

01

-b RUaUN

C

UN

RC

Cba

14

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

?

15

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Packet Reordering

12

16

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Bounding Delay Difference Between Middle Ports

1

2

cells

17

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

123

0

UFS (Uniform Frame Spreading)

12

18

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

FOFF (Full Ordered Frames First)

12

19

FOFF (Full Ordered Frames First)

Input Algorithm N FIFO queues corresponding to the N output flows Spread each flow uniformly: if last packet was sent to

middle port k, send next to k+1. Every N time-slots, pick a flow:

- If full frame exists, pick it and spread like UFS - Else if all frames are partial, pick one in round-robin order and send it

123

12

4

N

20

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Bounding Reordering

123

NN

21

FOFF

Output properties N FIFO queues corresponding to the N middle

ports Buffer size less than N2 packets If there are N2 packets, one of the head-of-line

packets is in order

111

22

333

Output

4

N

22

FOFF Properties

Property 1: FOFF maintains packet order.

Property 2: FOFF has O(1) complexity.

Property 3: Congestion buffers operate independently.

Property 4: FOFF maintains an average packet delay within constant from ideal output-queued router.

Corollary: FOFF has 100% throughput for any adversarial traffic.

23

In

In

In

Out

Out

Out

R

R

R

R

R

R

Output-Queued Router?

?

?

?

?

?

?

?

?

R

R

R

R

R

R

R

R

R

RRRR

24

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

25

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

From Two Meshes to One Mesh

One linecard

In

Out

26

From Two Meshes to One Mesh

First meshIn Out

In Out

In Out

In Out

One linecard

Second mesh

R R

R

R

R

27

From Two Meshes to One Mesh

Combined meshIn Out

In Out

In Out

In Out

2RR

2R

2R

2R

28

Many Fabric Options

Options

Space: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM

Any spreadingdevice

C1, C2, …, CN

C1

C2

C3

CN

In Out

In Out

In Out

In Out

N channels each at rate 2R/NOne linecard

29

AWGR (Arrayed Waveguide Grating Router) A Passive Optical Component

Wavelength i on input port j goes to output port (i+j-1) mod N

Can shuffle information from different inputs

1,

2…N

NxN AWGR

Linecard 1

Linecard 2

Linecard N

1

2

N

Linecard 1

Linecard 2

Linecard N

30

In Out

In Out

In Out

In Out

Static WDM Switching: Packaging

AWGR

Passive andAlmost Zero

Power

A

B

C

D

A, B, C, D

A, B, C, D

A, B, C, D

A, B, C, D

A, A, A, A

B, B, B, B

C, C, C, C

D, D, D, D

N WDM channels, each at rate 2R/N

31

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

32

Scaling Problem

For N < 64, an AWGR is a good solution. We want N = 640. Need to decompose.

33

A Different Representation of the Mesh

In Out

In Out

In Out

In Out

R 2R

Mesh

2R In Out

In Out

In Out

In Out

R

2RR

34

A Different Representation of the Mesh

In Out

In Out

In Out

In Out

R In Out

In Out

In Out

In Out

R2R/N

35

1

2

3

4

Example: N=8

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

2R/8

36

When N is Too LargeDecompose into groups (or racks)

4R/42R 2R1

2

3

4

5

6

7

8

2R2R

1

2

3

4

5

6

7

8

4R 4R

37

When N is Too LargeDecompose into groups (or racks)

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

2RL

2RL 2RL

2RL2RL/G

2RL/G

2RL/G

2RL/G

38

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

39

When Linecards Fail

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

2RL

2RL 2RL

2RL2RL/G

2RL/G

2RL/G

2RL/G

2RL

Solution: replace mesh with sum of permutations

= + +

2RL/G 2RL/G 2RL/G 2RL/G

2RL 2RL/G

G *

40

Hybrid Electro-Optical ArchitectureUsing MEMS Switches

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

41

When Linecards Fail

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

42

Fiber Link Capacity

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

MEMSSwitch

Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R

Laser/Modulator

MUX

43

Group/Rack 1

1

2

2R

2R 4R

Group/Rack 2

1

2

2R

2R 4R

Example2 Groups of 2 Linecards

1

2

2R

2R

Group/Rack 1

1

2

2R

2R

Group/Rack 2

4R

4R

2R

2R

2R

2R

2R

2R

44

Theorem: M≡L+G-1 MEMS switches are sufficient for bandwidth.

Number of MEMS Switches

Examples:

5540,16,640

2

MGLN

NMNGL

G groups, Li linecards in group i,

G

iiLN

1

,max kk

LL

45

Group A

1

2

2R

2R 4R

Group B

1

2

2R

2R 4R

Packet Schedule

1

2

2R

2R

Group A

1

2

2R

2R

Group B

4R

4R

2R

2R

2R

2R

46

At each time-slot: Each transmitting linecard sends one packet Each receiving linecard receives one packet (MEMS constraint) Each transmitting group i

sends at most one packet to each receiving group j through each MEMS connecting them

In a schedule of N time-slots: Each transmitting linecard sends exactly one

packet to each receiving linecard

Rules for Packet Schedule

47

Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 ? ? ? ?

Tx LC A2 ? ? ? ?

Tx LC B1 ? ? ? ?

Tx LC B2 ? ? ? ?

Tx Group A

Tx Group B

48

Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 A1 A2 B1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 B1 B2 A1

Tx Group A

Tx Group B

49

Bad Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 A1 A2 B1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 B1 B2 A1

Tx Group A

Tx Group B

50

Group Schedule

T+1 T+2 T+3 T+4

Tx Group A AB AB AB AB

Tx Group B AB AB AB AB

51

Good Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 B1 A2 A1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 A1 B2 B1

Theorem: There exists a polynomial-time algorithm that finds the correct packet schedule.

Tx Group A

Tx Group B

52

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

53

Summary

The load-balanced switch Does not need any centralized scheduling Can use a mesh

Using FOFF It keeps packets in order It guarantees 100% throughput

Using the hybrid electro-optical architecture It scales to high port numbers It tolerates linecard failure

54

Summary of Contributions

Load-Balanced Switch

I. Keslassy and N. McKeown, “Maintaining Packet Order in Two-Stage Switches,” Proceedings of IEEE Infocom '02, New York, June 2002.

I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown, “Scaling Internet Routers Using Optics,” ACM SIGCOMM '03, Karlsruhe, Germany, August 2003. Also in Computer Communication Review, vol. 33, no. 4, p. 189, October 2003.

I. Keslassy, S.-T. Chuang and N. McKeown, “A Load-Balanced Switch with an Arbitrary Number of Linecards,” to appear in Proceedings of IEEE Infocom ’04, Hong Kong, March 2004.

I. Keslassy, C.-S. Chang, N. McKeown and D.-S. Lee, “Maximizing the Throughput of Fixed Interconnection Networks,” in preparation.

55

Summary of Contributions Packet-Switch Scheduling

I. Keslassy and N. McKeown, “Analysis of Scheduling Algorithms That Provide 100% Throughput in Input-Queued Switches,” Proceedings of the 39th Annual Allerton Conference on Communication, Control, and Computing, Monticello, Illinois, October 2001.

I. Keslassy, M. Kodialam, T. V. Lakshman and D. Stiliadis, “On Guaranteed Smooth Scheduling for Input-Queued Switches,” Proceedings of IEEE Infocom '03, San Francisco, California, April 2003.

I. Keslassy, R. Zhang-Shen and N. McKeown, “Maximum Size Matching is Unstable for Any Packet Switch,” IEEE Communications Letters, Vol. 7, No. 10, pp. 496-498, Oct. 2003.

I. Keslassy, M. Kodialam, T. V. Lakshman and D. Stiliadis, “On Guaranteed Smooth Scheduling for Input-Queued Switches,” submitted to IEEE/ACM Transactions on Networking.

56

Summary of Contributions

Scheduling in Optical Networks

I. Keslassy, M. Kodialam, T. V. Lakshman and D. Stiliadis, “Scheduling Schemes for Delay Graphs with Applications to Optical Packet Networks,” to appear in Proceedings of IEEE HPSR ’04, Phoenix, Arizona, April 2004.

Scheduling in Wireless Networks

I. Keslassy, M. Kodialam and T. V. Lakshman, “Faster Algorithms for Minimum-Energy Scheduling of Wireless Data Transmissions,” Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt '03), INRIA Sophia-Antipolis, France, March 2003.

57

Summary of Contributions

Router Buffer Sizing

G. Appenzeller, I. Keslassy and N. McKeown, “Sizing Router Buffers,” submitted to ACM SIGCOMM ’04.

Image Classification

I. Keslassy, M. Kalman, D. Wang, and B. Girod, “Classification of Compound Images Based on Transform Coefficient Likelihood,” Proceedings of the International Conference on Image Processing (ICIP '01), Thessaloniki, Greece, October 2001.

58

Merci ! Nick McKeown Balaji Prabhakar Mark Horowitz, David Miller, Olav Solgaard

John and Kate Wakerly (Stanford Graduate Fellowship) SNRC, DARPA/MARCO, Cisco, NSF

Da Rui and Nandita Group Members: Gireesh, Greg, Guido, Martin, Masayoshi, Matthew, Mingjie,

Pablo, Sundar, Theresa, Yashar Friends and Colleagues: Abtin, Alan, Allen, Amalia, Amelia, Anamaya, Ananthan,

Arjun, Athina, Bill, Brian, Chang, Chandra, Changhua, Chao-Kai, Chao-Lin, Christine, Christophe, Damon, Dana, Daniel, Danny, David, Denise, Derek, Devavrat, Dimitri, Elif, Emilio, Eric, Flavio, Giulio, Hanna, In-Sung, Ingrid, Joachim, Jonathan, Ken, Kevin, Kostas, Kyoungsik, Lakshman, Laurence, Lizzi, Marcy, Marissa, Mark, Maureen, Max-David, Mayank, Milind, Mina, Mohsen, Murali, Myles, Nathan, Neda, Neha, Nick, Ofer, Paolo, Pascal, Paul, Peter, Prashanth, Rivi, Rong, Ruben, Ryan, Sam, Sylvia, Tali, Vinayak, Vincent, Yoav, … and the audience!

In memory of my departed grandparents Z’’L. To My Family: Mamie, Papa, Maman, Michael

and the numerous cousins…

Thank you.