RF-Interconnect and its Applications to NoC...

1

1

RF-Interconnect and its Applications to

NoC Design

Frank Chang, Jason Cong and Glenn ReinmanE-mails: [email protected]

[email protected]@cs.ucla.edu

NOCS Tutorial Course, May 10 2009 San Diego, California

2

RF-Interconnect

2

3

Outline

• Future Network-on-Chip (NoC) needs and development trends

• Traditional baseband-interconnect constraints• Multiband RF-Interconnect (RF-I) advantages

– Scalability in latency, energy/bit, data rate (Gbps/link) and overhead (area/Gb)

– On-chip demonstrations– Off-chip demonstrations– Remaining technology challenges

• Potential RF-I system applications

4

Current Trend in CMP

• 65nm CMOS 80 tile NoC• 10X8 2D mesh network-on-

chip running @ 4GHz• Bisection bandwidth

256GB/s• 1 TFLOPS @ 1V about 98W• Needs total 75 Clk cycles

from the lower left corner to the upper right corner

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chipin 65nm CMOS (Sriram Vangal et al., Intel)

3

5

Future CMP Development

Trends:• Heterogeneous/domain-specific system architecture• Many-core massive parallel data processing• System integration in deep-scaled CMOS technology• Low supply voltage with sub-Vth digital operation

Issues:• Performance increasingly dependent on inter-core or

inter-system communications

6

Scaling of Traditional Interconnect

• Scaling reduces delay of logic gates but not of wires• Latency is RC limited (~L2)• Using CMOS repeaters reduces latency (~L) but receives no benefit

from scaling• Even low swing signaling requires extensive equalization• Waste of broad bandwidth available from modern CMOS devices

(ft>150GHz, fmax>250GHz)

10Tf

4

7

Baseband Interconnect Issues

• Latency is large across the chip• Bandwidth is RC limited (~1Gbps/wire)• Communication pattern is fixed (non-reconfigurable)• Energy consumption is high and not scalable

(~10pJ/bit/cm)• At 22nm technology, the total network power using

buffer can be as high as 150W* • Future microprocessors may encounter

communication congestion and most of the energy will be spent on “talking” instead of computing*“Research Challenges for On-Chip Interconnection Networks,” IEEE Micro, 2007

8

Communication Challenges• On-Chip Issues

– # Cores in Chip-Multiprocessor (CMP) growing• Increasing bandwidth demand on interconnect

– Wires scaling poorly compared to transistors• Increased latency to communicate between distant points on

CMP

• Off-chip limited by chip-to-chip, board-to-board, board-to-backplane communications

• Requirements on future interconnect – Scalable, reliable– Support high traffic volume with low latency– Constrained by

• Power• Silicon Area• Cost (compatibility with mainstream CMOS technology)

5

9

How Can RF Help?• fT will exceed 600GHz

at16nm and fmax will even approach 1THz!

• Millimeter-wave CMOS circuits have been developed for 60GHz and recently for 324 GHz bands*

• Incredible bandwidth is available in future but most people neglect that!

• EM waves travel at the (effective) speed of light (~7ps/mm in Silicon)

*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC

10

-100

-90

-80

-70

323.038 323.238 323.438 323.638 323.838 324.0Frequency (GHz)

Pout

(dB

m)

UCLA 90nm CMOS VCO at 324GHz (ISSCC 2008*)

CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz!On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process

323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

6

11

4xf0 by Linear Superposition4xfo by Linear Superposition

12

Communication beyond Baseband

• Ultra-high carrier frequencies can be generated and modulated by modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

• On/off-chip Transmission lines and off-chip near-field antennas can guide waves (RF modulated data) from transmitter to receiver with recoverable attenuation in short distances (<30cm)

7

13

• Carrier frequencies can be generated and modulated using modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

• Higher carrier frequencies can avoid basebanddigital noise and cause less frequency dispersion across the band

Multiband Communications

14

Bi-directional Bus

Advantages:• Higher combined

data rate• Simultaneous,

bi-directional communications

• Re-configurable between bands

• Low in-band coupling for parallel bus

• Potentially with fewer I/O pins and smaller routing area

RF-Interconnect Concept

0f

8

15

• Loss of 1.5dB/mm over 100GHz of Bandwidth

Differential TL: IBM 90nm ProcessWidth: 3um Spacing: 3umTotal Thickness: Two Top Metal = 1µmMetal Resistivity: 0.0424Ohm/Sq

3um 3um 3um

0.5um0.5um

M8M7

Differential Transmission Line

16

Multiband FDMA-Interconnect

• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel)

• N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

• In RX, individual signals are down-converted by mixer, and recovered after low-pass filter

Sig

nal S

pect

rum

Sig

nal P

ower

Sig

nal P

ower

Sign

al P

ower

Sig

nal P

ower

9

17

Data0

DataN-2

Dat

a 1po

wer

FDMA-Interconnect System

18

Incredible CMOS Bandwidth

100

200

300

400

500

600

700

800

900

1000

20 30 40 50 60 70channel length (nm)

freq

(GH

z)

ft_DRAM

fmax_DRAM

ft_NFET

fmax_NFET

Technology (nm) 90 65 45 32 22 16

Number of Cores 8 16 40 64 120 189BW (Bisection) [GB/s] 91 128 202 256 350 440Chip Power (total) [W] 100 120 144 173 207 249

fmax [GHz] 200 270 370 480 590 710

fvco [GHz] 320 432 592 768 944 1136Max Aggregate Data Rate [Gb/s/wire] 160 216 296 384 472 568

Maximum aggregate data rate for RF-Interconnect can reach 500Gb/s/wire @16nm Tech Node

10

19

Advantages of RF-I over Parallel Bus

• Latency – speed-of-light data transmission • Bandwidth – high aggregate data rate through

simultaneous transmissions on multiple bands of RF modulated signals

• Area – avoid extensive use of repeaters• Energy – low overall energy bit • Reconfigurability – efficient bidirectional and

tunable communications via shared on/off-chip transmission lines or off-chip antennas

20

RF-Interconnect Demonstrations

• Off-chip (On-board) Simultaneous Dual-band Communications through RF-Interconnect (ISSCC 05)

• Inter-layer 3DIC RF-Interconnect (ISSCC 07)• On-chip Simultaneous generation of multi-

band carriers (RFIC 08)• On-Chip Tri-band simultaneous

communications (VLSI 09)

11

21

Off-Chip FDMA Links (ISSCC 05*)• 2 carrier RF-I provide

simultaneous off chip between 4 CMOS chip in 0.18um technology

• 1 baseband and 1 RF band at 7.4GHz

• Selectivity between bands is achieved using bandpass or lowpass filtering.

• The RF carrier was modulated using BPSK.

• Using this scheme, simultaneous data rates of (2+2) Gb/s were achieved in both the baseband (2Gbps) and the RF band (2Gbps).

*J. Ko, J. Kim, Z. Xu, Q. Gu, C. Chien, and M.F. Chang, “An RF/Baseband FDMA-Interconnect Transceiver for Reconfigurable Multiple Access Chip-to-Chip Communication,” in 2005 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, February 2005

22

3D-IC Layer-to-Layer RF-Interconnect (ISSCC 07*)

• 3DIC RF-I in MIT-Lincoln Lab 0.18 μm3DIC technology

• Data is modulated with amplitude shift keying (ASK) modulation of a 25GHz carrier

• low energy-per-bit: 0.39pJ/bit

• high data rate: 11Gb/s

22

(a) Schematic of 3DIC RF-I (b) Eye diagram with 11Gb/s data rate (c) Die photo of the 3DIC RF-I

• Q. Gu, Z. Xu, J. Ko and M.F. Chang, "Two 10Gbps/pin Low Power Interconnect Methods for 3D IC", 2007 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, vol.50, pp.448-449, Feb. 2007, San Francisco, California, USA

12

23

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation *(RFIC 2008)

• Using sub-harmonic injection-locked VCOs simultaneously lock to one single reference frequency

• Advantages:– Eliminate multiple PLLs– Low Power Consumption– Small Area

Master VCO

Non-linear HarmonicGenerator

Slave VCOs

*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

24

Sub-harmonic Injection Locked VCO* (RFIC 2008)

• LC-based VCO core• Differential pair for odd harmonic generation• Single-ended for even harmonic generation• Injection locking to high harmonic within locking range

of the VCO

ProcessFree Running

Frequency (GHz)

Max locking Range (GHz)

Locking Harmonics Power (mW)

This Work* IBM 90nm CMOS 29.3 5.6 2nd,4th, 6th, 8th

3rd, 5th, 7th 4

*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

13

25

25

Simultaneous Sub-harmonic Injection Locked Multi-Frequency Generation (RFIC 2008)

• Using sub-harmonic injection-locked VCOs simultaneous lock to one single master reference frequency

• Advantages:– Eliminate Multiple PLLs– Low Power Consumption– Small in Silicon Area

• Demonstrated Sub-harmonic 30GHz and 50GHz injection-locked VCO in IBM 90nm Process

Master VCO

Non‐linear HarmonicGenerator Slave VCO

(a) Output Spectrum of the 30GHz and 50GHz VCO simultaneously locked with the same reference source at 9.7GHz (b) Die Photo of the 30GHz Sub-harmonic Injection VCO

25

26

RF-I using ASK Modulation

• TX: The Transformer couples the output of the VCO to the ASK modulator and use a simple modulator to generate ASK signal

• RX: The differential-mutual-mixer (self-mixer) acts as the envelope detector. Then a simple buffer and Schmitt Trigger recover the signal to rail-to-rail swing.

• Don’t Need Carrier Synchronization!

14

27

Simulated RF-I using ASK ModulationVCO Output: 60GHZ

ASK modulated Signal

Mixer output5Gbit/s Data input

28

Tri-Band On-Chip RF-Interconnect

• IBM 90nm process• 5mm Differential Transmission Line• Total 3 Channels: 2RF + 1Baseband• Differential mode for RF: 30GHz and 50GHz• Common Mode for Baseband• Total Aggregate Data Rate is 10Gb/s

50GHzTX

30GHzTX

Base BandTX

50GHzRX

30GHzRX

Base BandRX

15

29

Tri-band FDMA-Interconnect Layout

30

Tri-band On-Chip RF-I Test Results

30GHz Channel50 GHz Channel

30GHz Channel

50GHz Channel

Base Band Channel

ProcessIBM 90nm CMOS Digital

Process

Total 3 Channels 30GHz, 50GHz, Base BandData Rate in each

channelRF Band: 4GbpsBase Band: 2Gbps

Total Data Rate 10GbpsBit Error Rate Across all Bands <10E‐9

Latency 6 ps/mmEnegry Per Bit (RF) 0.09*pJ/bit/mmEnegry Per Bit (BB) 0.125pJ/bit/mm

Data Output waveform Output Spectrum of the RF-Bands, 30GHz and 50GHz

*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.

16

31

Inter-channel Modulation in multi-band ASK Transmitter

• Switch in f2 is able to directly modulate the signal from f1

⇒ Cause severe Inter-channel interference

• Additional Transformer avoids signal current flowing through the switch in other channel

=> No Inter-channel interference through direct modulation

32

Base Band Common Mode Interconnect*

• Base Band is transmitted in Common Mode• Using Capacitive Coupling method:• Common Mode Swing is controlled to be about 100mV • Low Swing and save power

“A 5.6mW 1-Gbps pair Pulse Signaling Transceiver for a Fully AC Coupled Bus”, Jongsun Kim, Ingrid Verbauwhede, Mau-Chung Frank Chang, JSSC, VOL 40, No 6, June 2005

17

33

• Differential mode for RF communications

– Using inductive coupling with band-pass characteristic

– It is able to filter out the undesired channel

• Common mode for baseband communication

– Common mode signal is tapering out at the center of the RX transformer loop

Multi-Band ASK Receiver

34

Signal to Interference Ratio (SIR)

• Determine the max effective communication distance using SIR• Major source of interference: Coupling from adjacent TL• Side walls between TLs effectively suppress the cross-talk

18

35

Multi-band ASK RF-I Scaling

Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2Area/Gbit

(µm2/Gbit)

90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100

65nm 4RF + 1 BB 6 30 25 0.83 0.024 800

45nm 5RF + 1 BB 7 42 30 0.71 0.023 540

32nm 6RF + 1 BB 8 56 35 0.63 0.021 380

22nm 7RF + 1 BB 9 72 40 0.56 0.019 260

36

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:1. 32nm node; 30x

repeater, FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte

2. Repeaters Area = 0.022mm2

3. Bus physical width = 160um

4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8GbpsInterconnect length = 2cm

RF‐IRepeated

Bus# of wire 13 448

Data rate per carrier (Gbit/s) 8 NA

# of carrier 7 NAData rate per carrier

(Gbit/s) 56 1

Aggregate Data Rate 728 768Bus Physical Width 160 160

Transceiver Area (mm2) 0.27 0.022Power (mW) 455 6144

Energy per bit (pJ/bit) 0.63 8

19

37

RF-I built on top of 2D-Mesh of CMP-NoC facilitates “super-highway” network for inter-core communicationsEnables simultaneous multi-band communications by using multiple frequency carriers up to fmax of the super-scaled CMOS device (100-500GHz) Encodes data by phase or amplitude modulation Uses direct coupling between the transmission line and electronic transceivers Enhances performance with scaling (higher aggregate data rates, lower latency, lower energy/bit and lower area consumption/bit)

RF-I enables ultra-high performance CMP with low latency, low energy per bit, high aggregated data rate and bandwidth/route reconfigurable inter-core and inter-core-memory communications.

RF T-Line line overlaid on single-chip CMP tapped with T/R circuit

RF-I for CMP Inter-Core Communications

38

• Comparison across process technology of…

– Traditional RC parallel bus– RF-Interconnect– Optical Interconnect

• As process technology scales toward 22nm…

– RF-I has the lowest latency– RF-I consumes least energy– RF-I has highest data rate density

• RF-I is fully compatible with modern CMOS technology

RC/RF/Optical Interconnect Comparison

20

39

RF-I Fill the Technology Gap between RC Repeater and Optical Interconnect*

• On-Chip:– RC Repeater is non-scalable– RF-I has better energy efficiency d > 1mm

• Off-Chip**:– RF-I has better energy efficiency d < 30cm– Over-head of Optical-I is too high

• RF-I may be the prefect fit for the mid-range interconnect

*Sai-Wang Tam, et.al, "Ultra-Low Power/Latency and Scalable Multiband RF-Interconnect for Reconfigurable," Submitted to Proceeding of IEEE**H. Cho, et.al, “Power comparison between high-speed electrical and optical interconnects for interchip communication,” J. Lightw. Technol, Sep. 2004.

On-

chip

O

ff-ch

ip**

40

Quick RF-I Summary

• Bandwidth – high aggregate data rates through simultaneous transmissions of multiple bands with RF carrier modulated signals (324GH carrier recently realized in 90nm CMOS, Chang et al., 2008 ISSCC)

• Energy – low overall energy per bit (0.1pJ/bit/mm in 90nm to 0.05pJ/bit/mm in 22nm CMOS)

• Low Overhead –High data rate/wire and low area/Gigabit and low latency due to speed-of-light data transmission

• Re-configurability –efficient simultaneous communications with adaptive bandwidths via shared on/off-chip transmission medium

• Total compatibility and scalability with mainstream digital CMOS technology

• Multicast support – scalable means to communicate from one transmitter to a number of receivers on chip

21

41

RF-I Enabled NoC Communication Architecture

42

Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC– Shortcut Selection

• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock

• Conclusions and Future Work

22

43





44

Communication Diversity• Diverse communication patterns in parallel applications

– Different models of parallelism• Data decomposition, pipelined parallelism, master/worker, etc

– Different data inputs• Can vary communication hotspots and bandwidth demand

– Cache coherence• May favor broadcast/multicast

• Implies applications have different “ideal” NoCs– Topology– Bandwidth allocation– Latency

• NoC design alternatives– Traditional approach – Design for the general case

• High bandwidth links in a uniform topology– Our approach – Provide bandwidth where it is needed

• Reconfigurable RF-I flexibly allocates bandwidth

23

45

Architectural Considerations for RF-I• Opportunities (both on and off chip)

– High bandwidth communication• Data distribution across many-core topologies• Vital in keeping many-core designs active

– Low latency communication• Enables users to apply parallel computing to a broader applications

through faster synchronization and communication• Faster cache coherence protocols

– Reconfigurability• Adapt NoC topology/bandwidth to the needs of the individual

application– Power efficient communication

• Challenges– Frequency arbitration and Tx/Rx tuning– Application-specific modeling

46

Baseline ArchitectureR R

CR

CR

CR RR

CR

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RR R R

R RR RC

RC

RC

R RRC

R

R RR RC

RC

RC

R RRC

R

R$

$

R$

$

R$

$

R$

RR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RRRR RR RR

RR RRRR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

RR RRRR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

R$

R$

$$

R$

R$

$$

R$

R$

$$

R$

R$

$$$

$$$ $$$ $$$

$$$$$$

$$$ $$$RRR

$$$

$$$ $$$

$$$RRR

RRRRRR$$$ $$$

$$$ $$$

$$$$$$RRR RRR

RRR

RRRRRR

RRR RRR

$$$

$$$

$$$

$$$

$$$

$$$

$$$

RRRCCC CCC CCC CCC

CCCCCCCCCCCCCCC

CCC

CCC

CCC CCC CCC CCC CCC CCC

CCCCCCCCCCCCCCC

CCC CCC CCC CCC CCC

CCC

CCCCCCCCCCCCRRR RRR RRR RRR

RRRRRRRRRRRRRRR

RRR RRR RRR RRR RRR

RRR

RRR

RRR RRR RRR RRR RRR RRR

RRRRRRRRRRRRRRRRRR

RRR RRR RRR RRR

R (square) = routerC (circle) = processor core$ (diamond) = cache bank+ (plus) = main memory interface

• 10x10 mesh of pipelined routers– NoC runs at 2GHz– XY routing

• 64 4GHz 3-wide processor cores containing– 8KB L1 Data Cache– 8KB L1 Instruction Cache

• 32 L2 Cache Banks– 256KB each– Organized as shared NUCA

cache• 4 Main Memory Interfaces

– Labeled with + in the figure

24

47

Quantifying Application Diversity• For a 100 (10x10 mesh) router configuration:

• Measures messages sent from a router on x-axis to router on y-axis

• Legend for the figure on the coming slide– Black: no traffic– Dark Blue: [1, mean / 4)– Light Blue: [mean/4 , mean/2)– White : [mean/2, 2*mean)– Orange: [2*mean, 4*mean]– Red: (4*xmean, inf)

48

Messages Sent between Routers

Barnes

High communication

High communication

25

49

Messages Sent between Routers

LU

High communication

High communication

50





26

51

RF-I Physical Organization• Physically

– RF-I is a bundle of transmission lines

– Connected to and shared between set of RF-enabledrouters

– RF-enabled router consists of a Tx/Rx pair

– single cycle transmission across 400mm2 die

– 16 carrier frequencies per transmission line

• @32 nm, with NoC running @2GHz

52

RF-Enabled Routers

RF-Enable 50 Routers- Represented by GREEN Routing

Tables

RAdd 6th

Port

RX

TX

Transmission Line…

27

53

RF-I Logical Organization

• Logically:- RF-I behaves as set of N express channels- Each channel assigned to src, dest router pair (s,d)

• Reconfigured by:- remapping shortcuts to

match needs of different applications LOGICAL ALOGICAL B

54





28

55

Architecture-Specific ShortcutsDesign time shortcuts Referred to as static shortcuts in the remainder of this talk Selection Criteria

Consider an optimization function for a topologylength of shortest-path(x,y) if x != y

0 if x == yWx,y =

We wish to minimize the total cost of the graph G representing the network-on-chip

Σall(x,y)

Wx,yTotal-Cost(G) =

56

Shortcut-Selection Constraints• Each router should have at most 6 ports

– A router can be at most one shortcut source and at most one shortcut destination

• Total of B (budget) unidirectional shortcuts: B = 16

• For static shortcuts:– RF-enable routers which are shortcut srcs/dests– At most 16 RF-enabled routers

• For adaptive shortcuts, shortcut srcs/dests are limited to– RF-enabled routers chosen at design-time

29

57

Min Total-Cost(NoC): Heuristic 1

I) For each pair of non-adjacent routers i,j

– Make a new candidate graph Gi,j with an edge between them

– Calculate Total-Cost(Gi,j)– Record improvement as…

II) Select shortcut of edge (x,y) such that Gx,y had max improvement

– Disallow any use of x as a src or y as a dest afterwards

III) Repeat (I) and (II) until budget B exhausted

R RC

RC

RC

R RRC

R

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RR R R

R RR RC

RC

RC

R RRC

R

R RR RC

RC

RC

R RRC

R

R$

$

R$

$

R$

$

R$

RR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RRRR RR RR

RR RRRR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

RR RRRR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

R$

R$

$$

R$

R$

$$

R$

R$

$$

R$

R$

$$$

$$$ $$$ $$$

$$$$$$

$$$ $$$RRR

$$$

$$$ $$$

$$$RRR

RRRRRR$$$ $$$

$$$ $$$

$$$$$$RRR RRR

RRR

RRRRRR

RRR RRR

$$$

$$$

$$$

$$$

$$$

$$$

$$$

RRR

CCC CCC CCC CCC

CCCCCCCCCCCCCCC

CCC

CCC


CCCCCCCCCCCCCCC

CCC CCC CCC CCC CCC

CCC


RRRRRRRRRRRRRRR

RRR RRR RRR RRR RRR

RRR

RRR


RRRRRRRRRRRRRRRRRR

RRR RRR RRR RRR

G

|Total-Cost(Gi,j) – Total-Cost(G)|

i j

Gi,jGx,y is best

O(BV5)

58

Min Total-Cost(NoC): Heuristic 2

(I) Calculate Wi,j for all pairs i,jin G– Record all Wi,j

(II) Select shortcut of edge (x,y) s.tWx,y = max(Wi,j)

– Which is the graph diameter– Disallow any use of x as a src

or y as a dest afterwards

(III) Repeat (I) and (II) until budget exhausted

R RC

RC

RC

R RRC

R

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RR R R

R RR RC

RC

RC

R RRC

R

R RR RC

RC

RC

R RRC

R

R$

$

R$

$

R$

$

R$

RR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RRRR RR RR

RR RRRR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

RR RRRR RC

RC

RC

RC

RC

RC

RR RRRC

RC

RR

R$

R$

$$

R$

R$

$$

R$

R$

$$

R$

R$

$$$

$$$ $$$ $$$

$$$$$$

$$$ $$$RRR

$$$

$$$ $$$

$$$RRR

RRRRRR$$$ $$$

$$$ $$$

$$$$$$RRR RRR

RRR

RRRRRR

RRR RRR

$$$

$$$

$$$

$$$

$$$

$$$

$$$

RRR

CCC CCC CCC CCC

CCCCCCCCCCCCCCC

CCC

CCC


CCCCCCCCCCCCCCC

CCC CCC CCC CCC CCC

CCC


RRRRRRRRRRRRRRR

RRR RRR RRR RRR RRR

RRR

RRR


RRRRRRRRRRRRRRRRRR

RRR RRR RRR RRR

GThese shortcuts tend to perform within 1% as well as those chosen with heuristic 1

O(BV3)

30

59

Adaptive RF-I Shortcuts • Assume a profile of communication for an application

– Fi,j = count of messages sent between router i and router j• Change optimization function

• To offset effect of removing src/dest routers (already selected) from consideration– Alternate router-to-router shortcuts with region-to-region

shortcuts– Allows placement of shortcuts at routers near a hotspot

Σall(x,y)

(Fx,y Wx,y)Total-Cost(G) = .

60





31

61

Results on Performance Improvement• Static Shortcuts

– 20% reduction in latency on average– 11% increase in NoC power

• Adaptive Shortcuts with 50 RF-I routers– 32% reduction in latency on average– 24% increase in NoC power

62

Power Savings• We can thin the

baseline mesh links– From 16B…– …to 8B– …to 4B

• RF-I makes up the difference in performance while saving overall power!– RF-I provides

bandwidth where most necessary

– Baseline RC wires supply the rest

16 bytes8 bytes4 bytes

Requires high bw to communicate w/ B

A

B

32

63

Evaluation Methodology• Used detailed interconnection network

simulator - Garnet[1] • Built probabilistic traces

– To cover different communication patterns that may be exhibited by future applications

• Leveraged Orion[4], CosiNoC[5], IPEM[2] for power methodology

64

RF-I Enables Power Savings

Relative latency

Relative pow

er

0.0

0.2

0.4

0.6

0.8

1.0

1.21.4

uniform uniDF biDF hotbiDF 1Hotspot 2Hotspot 4Hotspot

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4Static Adaptive Baseline - 8B Static - 8B Adaptive - 8B Baseline - 4B Static - 4B Adaptive - 4B

• On average adaptive shortcuts w/ 50 RF-enabled routers on a 4B mesh • 62%(82%) power(area) savings over baseline• Performance comparable to baseline

33

65

RF-I Enables a Power Savings

• Adaptive RF-I enabled NoC- Cost Effective in terms of both power and performance

66





34

67

RF-I Enabled Multicast

Get S

2

1

3 4

2

1

1

1 1

1FILL

Fill

Conventional NoC

Request Scenario

Rx RxTx

RxTx

RxTx

RxTx

RxTx

RxTx

RxTx

RxTx

Tx

RF-I enabled NoC

68

Virtual Circuit Tree Multicast [6]• Demonstrated the importance of multicast for

current/future NoCs• Enhance NoC routers with additional state

tables– Store routing information for dynamically discovered

multicast trees– Dynamically spawn packets to follow

communication tree• Reduce NoC congestion and multicast latency

35

69

RF-I Enabled Multicast• RF-I provides natural means for multicast• Multiple receivers listen to same channel• Conceptually, RF shortcut with multiple

destinations

70

Multicast in our architecture

• 50 RF-enabled routers

• 16 adaptive shortcuts

• 34 routers left to tune to the multicast channel

• Allocate 1 channelfor multicast

• 15 adaptive shortcuts

• 35 routers left to tune to the multicast channel

36

71

RF-I Enabled Multicast (cont)• We accelerate two multicast messages: Fills and

Invalidates • Both of which are issued by cache banks

– Limit multicast senders to be cache banks• We use coarse-grain arbitration scheme

– to decide which component can use the multicast channel– A cache bank in a cluster is chosen as the designated

multicast sender for a fixed period of time– The caches sent multicast message to the designated – sender over conventional wires

72

Designated MCSender

Wants to send MC msgs

TRANSMITRECEIVE

MC recipients

Example Multicast(MC) Scenario

37

73

Multicast Results

• On average RF-I MC+ SC provides:• 37% reduction in latency• at a cost of 25% increase in NoC Power

• 20 and 50 indicates:- % of distinct source-destination pairs- simulate multicast destination reuse

• Perform a fair comparison with VCT

74

Unified Analysis

38

75





76

Deadlock: To Avoid or Confront?• South-Last Strategy [Ogras and Marculescu, 2006]

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock

– Restricts shortcut selection• Results show that it effectively halves performance

• Deadlock Detection & Recovery (DDR)– Based on Duato and Pinkston’s theory [Duato and Pinkston

2001]• If deadlock occurs, route all packets in the network on a spare virtual

channel• Use deadlock-free XY-routing• Packets entering network after this point may be routed normally

39

77

How to detect deadlock…?• Rather than detect that deadlock has occurred

– Detect a sufficient condition for deadlock: circular buffer dependency

• Each router maintains a list of other routers waiting on it

• When buffer at neighbor router d is full, sender s transmits waiting-list message to neighbor– Bit vector indicating which routers are waiting on s, as well

as s’s ID– If a router is “waiting on itself,” circular buffer dependency

has occurred• Raise DEADLOCK condition

– If d’s buffer empties, s sends one time clear-waiting-list message to reset state

78

Deadlock Detection Example

• N inbound buffer at R21 fills up• R11 can’t send to R21• R11 tells R21:

– {R11} waiting on you• W inbound buffer at R11 fills up• R10 tells R11:

– {R10} waiting on you• R11 tells R21:

– {R10,R11} waiting on you• If there is circular dependence

– R21 will eventually see that it is waiting on itself DEADLOCK!

……

R11

R21

in out

inout

in

out

R10in

out…

……

…

…

…

0R0 R1 R10 R11 R21

… …0 0 1 0

1R0 R1 R10 R11 R21

… …0 0 0 0

1R0 R1 R10 R11 R21

… …0 0 1 01R0 R1 R10 R11 R21

… …0 0 1 1

40

79





80

ConclusionRF-Interconnect:• Enables adaptive NoC

– Bandwidth can be flexibly allocated– To match the communication demands of

applications• Offers dramatic power and area savings

– By simplifying baseline NoC topology– Provides performance of 16B mesh on a 4B mesh

• 62% power savings, 82% area savings

• Natural means of multicast

41

81

Future Directions• Fine-Grain Adaptation

– On-Demand or Phase-Specific Shortcut/Multicast• Message-Based Acceleration

– Application-Specific Synchronization– Cache Coherence– NUCA Migration

• Deadlock Free Routing– Application-Specific Turn Removal

• Physical Implementation

82

References[1] N. Agarwal, L-S Peh, and N. Jha. Garnet: A detailed interconnection

network model inside a full-system simulation framework. Technical Report CE-P08-001, Dept. of Electrical Engineering, Princeton University, 2007.

[2] M. Chang, J. Cong, A. Kaplan, M. Naik, G. Reinman, E. Socher and S.W. Tam, “CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect,” The 14th International Symposium on High-Performance Computer Architecture, Salt Lake City, UT, pp. 191-202, February 2008. (Best Paper Award)

[3] M. F. Chang, J. Cong, A. Kaplan, C. Liu, M. Naik, J. Premkumar, G. Reinman, E. Socher, and R. Tam, “Power Reduction of CMP Communication Networks via RFInterconnects,” Proceedings of the 41st Annual International Symposium on Microarchitecture (MICRO), Lake Como, Italy, pages 376-387, November 2008

[4] J. Cong and D.Z. Pan. Interconnect estimation and planning for deep submicron designs. In Proceedings of DAC-36, 1999.

[5] J.D. Owens, W.J. Dally, R. Ho, D.N. Jayasimha, S.W. Keckler, and L-S. Peh. Research challenges for on-chip interconnection networks. IEEE Micro, 27(5):96–108, 2007.

[6] A. Pinto, L. Carloni, and A. Sangiovanni-Vincentelli. Constraint-driven communication synthesis. In Design Automation Conference, June 2002

[7] H. Wang, X. Zhu, L-S. Peh, and S. Malik. Orion: A power performancesimulator for interconnection networks. In Proceedings of MICRO-35, November 2002.

[8] N. Jerger, L. Peh, and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for Hardware Multicast Support. International Symposium on Computer Architecture, June 2008.

For updated slides of this tutorial, please go to http://cadlab.cs.ucla.edu/~cong

RF-Interconnect and its Applications to NoC...

Documents

Transcript of RF-Interconnect and its Applications to NoC...