RF-Interconnect and its Applications to NoC...
-
Upload
doankhuong -
Category
Documents
-
view
223 -
download
1
Transcript of RF-Interconnect and its Applications to NoC...
1
1
RF-Interconnect and its Applications to
NoC Design
Frank Chang, Jason Cong and Glenn ReinmanE-mails: [email protected]
[email protected]@cs.ucla.edu
NOCS Tutorial Course, May 10 2009 San Diego, California
2
RF-Interconnect
2
3
Outline
• Future Network-on-Chip (NoC) needs and development trends
• Traditional baseband-interconnect constraints• Multiband RF-Interconnect (RF-I) advantages
– Scalability in latency, energy/bit, data rate (Gbps/link) and overhead (area/Gb)
– On-chip demonstrations– Off-chip demonstrations– Remaining technology challenges
• Potential RF-I system applications
4
Current Trend in CMP
• 65nm CMOS 80 tile NoC• 10X8 2D mesh network-on-
chip running @ 4GHz• Bisection bandwidth
256GB/s• 1 TFLOPS @ 1V about 98W• Needs total 75 Clk cycles
from the lower left corner to the upper right corner
ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chipin 65nm CMOS (Sriram Vangal et al., Intel)
3
5
Future CMP Development
Trends:• Heterogeneous/domain-specific system architecture• Many-core massive parallel data processing• System integration in deep-scaled CMOS technology• Low supply voltage with sub-Vth digital operation
Issues:• Performance increasingly dependent on inter-core or
inter-system communications
6
Scaling of Traditional Interconnect
• Scaling reduces delay of logic gates but not of wires• Latency is RC limited (~L2)• Using CMOS repeaters reduces latency (~L) but receives no benefit
from scaling• Even low swing signaling requires extensive equalization• Waste of broad bandwidth available from modern CMOS devices
(ft>150GHz, fmax>250GHz)
10Tf
4
7
Baseband Interconnect Issues
• Latency is large across the chip• Bandwidth is RC limited (~1Gbps/wire)• Communication pattern is fixed (non-reconfigurable)• Energy consumption is high and not scalable
(~10pJ/bit/cm)• At 22nm technology, the total network power using
buffer can be as high as 150W* • Future microprocessors may encounter
communication congestion and most of the energy will be spent on “talking” instead of computing*“Research Challenges for On-Chip Interconnection Networks,” IEEE Micro, 2007
8
Communication Challenges• On-Chip Issues
– # Cores in Chip-Multiprocessor (CMP) growing• Increasing bandwidth demand on interconnect
– Wires scaling poorly compared to transistors• Increased latency to communicate between distant points on
CMP
• Off-chip limited by chip-to-chip, board-to-board, board-to-backplane communications
• Requirements on future interconnect – Scalable, reliable– Support high traffic volume with low latency– Constrained by
• Power• Silicon Area• Cost (compatibility with mainstream CMOS technology)
5
9
How Can RF Help?• fT will exceed 600GHz
at16nm and fmax will even approach 1THz!
• Millimeter-wave CMOS circuits have been developed for 60GHz and recently for 324 GHz bands*
• Incredible bandwidth is available in future but most people neglect that!
• EM waves travel at the (effective) speed of light (~7ps/mm in Silicon)
*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC
10
-100
-90
-80
-70
323.038 323.238 323.438 323.638 323.838 324.0Frequency (GHz)
Pout
(dB
m)
UCLA 90nm CMOS VCO at 324GHz (ISSCC 2008*)
CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz!On-Wafer VCO Test Setup at JPL
CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process
323.5GHz VCO
*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA
6
11
4xf0 by Linear Superposition4xfo by Linear Superposition
12
Communication beyond Baseband
• Ultra-high carrier frequencies can be generated and modulated by modern CMOS to enable simultaneous multiband communications with higher aggregate data rate
• On/off-chip Transmission lines and off-chip near-field antennas can guide waves (RF modulated data) from transmitter to receiver with recoverable attenuation in short distances (<30cm)
7
13
• Carrier frequencies can be generated and modulated using modern CMOS to enable simultaneous multiband communications with higher aggregate data rate
• Higher carrier frequencies can avoid basebanddigital noise and cause less frequency dispersion across the band
Multiband Communications
14
Bi-directional Bus
Advantages:• Higher combined
data rate• Simultaneous,
bi-directional communications
• Re-configurable between bands
• Low in-band coupling for parallel bus
• Potentially with fewer I/O pins and smaller routing area
RF-Interconnect Concept
0f
8
15
• Loss of 1.5dB/mm over 100GHz of Bandwidth
Differential TL: IBM 90nm ProcessWidth: 3um Spacing: 3umTotal Thickness: Two Top Metal = 1µmMetal Resistivity: 0.0424Ohm/Sq
3um 3um 3um
0.5um0.5um
M8M7
Differential Transmission Line
16
Multiband FDMA-Interconnect
• In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel)
• N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates
• In RX, individual signals are down-converted by mixer, and recovered after low-pass filter
Sig
nal S
pect
rum
Sig
nal P
ower
Sig
nal P
ower
Sign
al P
ower
Sig
nal P
ower
9
17
Data0
DataN-2
Dat
a 1po
wer
FDMA-Interconnect System
18
Incredible CMOS Bandwidth
100
200
300
400
500
600
700
800
900
1000
20 30 40 50 60 70channel length (nm)
freq
(GH
z)
ft_DRAM
fmax_DRAM
ft_NFET
fmax_NFET
Technology (nm) 90 65 45 32 22 16
Number of Cores 8 16 40 64 120 189BW (Bisection) [GB/s] 91 128 202 256 350 440Chip Power (total) [W] 100 120 144 173 207 249
fmax [GHz] 200 270 370 480 590 710
fvco [GHz] 320 432 592 768 944 1136Max Aggregate Data Rate [Gb/s/wire] 160 216 296 384 472 568
Maximum aggregate data rate for RF-Interconnect can reach 500Gb/s/wire @16nm Tech Node
10
19
Advantages of RF-I over Parallel Bus
• Latency – speed-of-light data transmission • Bandwidth – high aggregate data rate through
simultaneous transmissions on multiple bands of RF modulated signals
• Area – avoid extensive use of repeaters• Energy – low overall energy bit • Reconfigurability – efficient bidirectional and
tunable communications via shared on/off-chip transmission lines or off-chip antennas
20
RF-Interconnect Demonstrations
• Off-chip (On-board) Simultaneous Dual-band Communications through RF-Interconnect (ISSCC 05)
• Inter-layer 3DIC RF-Interconnect (ISSCC 07)• On-chip Simultaneous generation of multi-
band carriers (RFIC 08)• On-Chip Tri-band simultaneous
communications (VLSI 09)
11
21
Off-Chip FDMA Links (ISSCC 05*)• 2 carrier RF-I provide
simultaneous off chip between 4 CMOS chip in 0.18um technology
• 1 baseband and 1 RF band at 7.4GHz
• Selectivity between bands is achieved using bandpass or lowpass filtering.
• The RF carrier was modulated using BPSK.
• Using this scheme, simultaneous data rates of (2+2) Gb/s were achieved in both the baseband (2Gbps) and the RF band (2Gbps).
*J. Ko, J. Kim, Z. Xu, Q. Gu, C. Chien, and M.F. Chang, “An RF/Baseband FDMA-Interconnect Transceiver for Reconfigurable Multiple Access Chip-to-Chip Communication,” in 2005 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, February 2005
22
3D-IC Layer-to-Layer RF-Interconnect (ISSCC 07*)
• 3DIC RF-I in MIT-Lincoln Lab 0.18 μm3DIC technology
• Data is modulated with amplitude shift keying (ASK) modulation of a 25GHz carrier
• low energy-per-bit: 0.39pJ/bit
• high data rate: 11Gb/s
22
(a) Schematic of 3DIC RF-I (b) Eye diagram with 11Gb/s data rate (c) Die photo of the 3DIC RF-I
• Q. Gu, Z. Xu, J. Ko and M.F. Chang, "Two 10Gbps/pin Low Power Interconnect Methods for 3D IC", 2007 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, vol.50, pp.448-449, Feb. 2007, San Francisco, California, USA
12
23
Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation *(RFIC 2008)
• Using sub-harmonic injection-locked VCOs simultaneously lock to one single reference frequency
• Advantages:– Eliminate multiple PLLs– Low Power Consumption– Small Area
Master VCO
Non-linear HarmonicGenerator
Slave VCOs
*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008
24
Sub-harmonic Injection Locked VCO* (RFIC 2008)
• LC-based VCO core• Differential pair for odd harmonic generation• Single-ended for even harmonic generation• Injection locking to high harmonic within locking range
of the VCO
ProcessFree Running
Frequency (GHz)
Max locking Range (GHz)
Locking Harmonics Power (mW)
This Work* IBM 90nm CMOS 29.3 5.6 2nd,4th, 6th, 8th
3rd, 5th, 7th 4
*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008
13
25
25
Simultaneous Sub-harmonic Injection Locked Multi-Frequency Generation (RFIC 2008)
• Using sub-harmonic injection-locked VCOs simultaneous lock to one single master reference frequency
• Advantages:– Eliminate Multiple PLLs– Low Power Consumption– Small in Silicon Area
• Demonstrated Sub-harmonic 30GHz and 50GHz injection-locked VCO in IBM 90nm Process
Master VCO
Non‐linear HarmonicGenerator Slave VCO
(a) Output Spectrum of the 30GHz and 50GHz VCO simultaneously locked with the same reference source at 9.7GHz (b) Die Photo of the 30GHz Sub-harmonic Injection VCO
25
26
RF-I using ASK Modulation
• TX: The Transformer couples the output of the VCO to the ASK modulator and use a simple modulator to generate ASK signal
• RX: The differential-mutual-mixer (self-mixer) acts as the envelope detector. Then a simple buffer and Schmitt Trigger recover the signal to rail-to-rail swing.
• Don’t Need Carrier Synchronization!
14
27
Simulated RF-I using ASK ModulationVCO Output: 60GHZ
ASK modulated Signal
Mixer output5Gbit/s Data input
28
Tri-Band On-Chip RF-Interconnect
• IBM 90nm process• 5mm Differential Transmission Line• Total 3 Channels: 2RF + 1Baseband• Differential mode for RF: 30GHz and 50GHz• Common Mode for Baseband• Total Aggregate Data Rate is 10Gb/s
50GHzTX
30GHzTX
Base BandTX
50GHzRX
30GHzRX
Base BandRX
15
29
Tri-band FDMA-Interconnect Layout
30
Tri-band On-Chip RF-I Test Results
30GHz Channel50 GHz Channel
30GHz Channel
50GHz Channel
Base Band Channel
ProcessIBM 90nm CMOS Digital
Process
Total 3 Channels 30GHz, 50GHz, Base BandData Rate in each
channelRF Band: 4GbpsBase Band: 2Gbps
Total Data Rate 10GbpsBit Error Rate Across all Bands <10E‐9
Latency 6 ps/mmEnegry Per Bit (RF) 0.09*pJ/bit/mmEnegry Per Bit (BB) 0.125pJ/bit/mm
Data Output waveform Output Spectrum of the RF-Bands, 30GHz and 50GHz
*VCO power (5mW) can be shared by all (many tens) parallel RF-I links in NOC and does not burden individual link significantly.
16
31
Inter-channel Modulation in multi-band ASK Transmitter
• Switch in f2 is able to directly modulate the signal from f1
⇒ Cause severe Inter-channel interference
• Additional Transformer avoids signal current flowing through the switch in other channel
=> No Inter-channel interference through direct modulation
32
Base Band Common Mode Interconnect*
• Base Band is transmitted in Common Mode• Using Capacitive Coupling method:• Common Mode Swing is controlled to be about 100mV • Low Swing and save power
“A 5.6mW 1-Gbps pair Pulse Signaling Transceiver for a Fully AC Coupled Bus”, Jongsun Kim, Ingrid Verbauwhede, Mau-Chung Frank Chang, JSSC, VOL 40, No 6, June 2005
17
33
• Differential mode for RF communications
– Using inductive coupling with band-pass characteristic
– It is able to filter out the undesired channel
• Common mode for baseband communication
– Common mode signal is tapering out at the center of the RX transformer loop
Multi-Band ASK Receiver
34
Signal to Interference Ratio (SIR)
• Determine the max effective communication distance using SIR• Major source of interference: Coupling from adjacent TL• Side walls between TLs effectively suppress the cross-talk
18
35
Multi-band ASK RF-I Scaling
Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2Area/Gbit
(µm2/Gbit)
90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100
65nm 4RF + 1 BB 6 30 25 0.83 0.024 800
45nm 5RF + 1 BB 7 42 30 0.71 0.023 540
32nm 6RF + 1 BB 8 56 35 0.63 0.021 380
22nm 7RF + 1 BB 9 72 40 0.56 0.019 260
36
Comparison between Repeated Bus and Multi-band RF-I @ 32nm
Assumptions:1. 32nm node; 30x
repeater, FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte
2. Repeaters Area = 0.022mm2
3. Bus physical width = 160um
4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8GbpsInterconnect length = 2cm
RF‐IRepeated
Bus# of wire 13 448
Data rate per carrier (Gbit/s) 8 NA
# of carrier 7 NAData rate per carrier
(Gbit/s) 56 1
Aggregate Data Rate 728 768Bus Physical Width 160 160
Transceiver Area (mm2) 0.27 0.022Power (mW) 455 6144
Energy per bit (pJ/bit) 0.63 8
19
37
RF-I built on top of 2D-Mesh of CMP-NoC facilitates “super-highway” network for inter-core communicationsEnables simultaneous multi-band communications by using multiple frequency carriers up to fmax of the super-scaled CMOS device (100-500GHz) Encodes data by phase or amplitude modulation Uses direct coupling between the transmission line and electronic transceivers Enhances performance with scaling (higher aggregate data rates, lower latency, lower energy/bit and lower area consumption/bit)
RF-I enables ultra-high performance CMP with low latency, low energy per bit, high aggregated data rate and bandwidth/route reconfigurable inter-core and inter-core-memory communications.
RF T-Line line overlaid on single-chip CMP tapped with T/R circuit
RF-I for CMP Inter-Core Communications
38
• Comparison across process technology of…
– Traditional RC parallel bus– RF-Interconnect– Optical Interconnect
• As process technology scales toward 22nm…
– RF-I has the lowest latency– RF-I consumes least energy– RF-I has highest data rate density
• RF-I is fully compatible with modern CMOS technology
RC/RF/Optical Interconnect Comparison
20
39
RF-I Fill the Technology Gap between RC Repeater and Optical Interconnect*
• On-Chip:– RC Repeater is non-scalable– RF-I has better energy efficiency d > 1mm
• Off-Chip**:– RF-I has better energy efficiency d < 30cm– Over-head of Optical-I is too high
• RF-I may be the prefect fit for the mid-range interconnect
*Sai-Wang Tam, et.al, "Ultra-Low Power/Latency and Scalable Multiband RF-Interconnect for Reconfigurable," Submitted to Proceeding of IEEE**H. Cho, et.al, “Power comparison between high-speed electrical and optical interconnects for interchip communication,” J. Lightw. Technol, Sep. 2004.
On-
chip
O
ff-ch
ip**
40
Quick RF-I Summary
• Bandwidth – high aggregate data rates through simultaneous transmissions of multiple bands with RF carrier modulated signals (324GH carrier recently realized in 90nm CMOS, Chang et al., 2008 ISSCC)
• Energy – low overall energy per bit (0.1pJ/bit/mm in 90nm to 0.05pJ/bit/mm in 22nm CMOS)
• Low Overhead –High data rate/wire and low area/Gigabit and low latency due to speed-of-light data transmission
• Re-configurability –efficient simultaneous communications with adaptive bandwidths via shared on/off-chip transmission medium
• Total compatibility and scalability with mainstream digital CMOS technology
• Multicast support – scalable means to communicate from one transmitter to a number of receivers on chip
21
41
RF-I Enabled NoC Communication Architecture
42
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
22
43
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
44
Communication Diversity• Diverse communication patterns in parallel applications
– Different models of parallelism• Data decomposition, pipelined parallelism, master/worker, etc
– Different data inputs• Can vary communication hotspots and bandwidth demand
– Cache coherence• May favor broadcast/multicast
• Implies applications have different “ideal” NoCs– Topology– Bandwidth allocation– Latency
• NoC design alternatives– Traditional approach – Design for the general case
• High bandwidth links in a uniform topology– Our approach – Provide bandwidth where it is needed
• Reconfigurable RF-I flexibly allocates bandwidth
23
45
Architectural Considerations for RF-I• Opportunities (both on and off chip)
– High bandwidth communication• Data distribution across many-core topologies• Vital in keeping many-core designs active
– Low latency communication• Enables users to apply parallel computing to a broader applications
through faster synchronization and communication• Faster cache coherence protocols
– Reconfigurability• Adapt NoC topology/bandwidth to the needs of the individual
application– Power efficient communication
• Challenges– Frequency arbitration and Tx/Rx tuning– Application-specific modeling
46
Baseline ArchitectureR R
CR
CR
CR RR
CR
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RR R R
R RR RC
RC
RC
R RRC
R
R RR RC
RC
RC
R RRC
R
R$
$
R$
$
R$
$
R$
RR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RRRR RR RR
RR RRRR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
RR RRRR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
R$
R$
$$
R$
R$
$$
R$
R$
$$
R$
R$
$$$
$$$ $$$ $$$
$$$$$$
$$$ $$$RRR
$$$
$$$ $$$
$$$RRR
RRRRRR$$$ $$$
$$$ $$$
$$$$$$RRR RRR
RRR
RRRRRR
RRR RRR
$$$
$$$
$$$
$$$
$$$
$$$
$$$
RRRCCC CCC CCC CCC
CCCCCCCCCCCCCCC
CCC
CCC
CCC CCC CCC CCC CCC CCC
CCCCCCCCCCCCCCC
CCC CCC CCC CCC CCC
CCC
CCCCCCCCCCCCRRR RRR RRR RRR
RRRRRRRRRRRRRRR
RRR RRR RRR RRR RRR
RRR
RRR
RRR RRR RRR RRR RRR RRR
RRRRRRRRRRRRRRRRRR
RRR RRR RRR RRR
R (square) = routerC (circle) = processor core$ (diamond) = cache bank+ (plus) = main memory interface
• 10x10 mesh of pipelined routers– NoC runs at 2GHz– XY routing
• 64 4GHz 3-wide processor cores containing– 8KB L1 Data Cache– 8KB L1 Instruction Cache
• 32 L2 Cache Banks– 256KB each– Organized as shared NUCA
cache• 4 Main Memory Interfaces
– Labeled with + in the figure
24
47
Quantifying Application Diversity• For a 100 (10x10 mesh) router configuration:
• Measures messages sent from a router on x-axis to router on y-axis
• Legend for the figure on the coming slide– Black: no traffic– Dark Blue: [1, mean / 4)– Light Blue: [mean/4 , mean/2)– White : [mean/2, 2*mean)– Orange: [2*mean, 4*mean]– Red: (4*xmean, inf)
48
Messages Sent between Routers
Barnes
High communication
High communication
25
49
Messages Sent between Routers
LU
High communication
High communication
50
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
26
51
RF-I Physical Organization• Physically
– RF-I is a bundle of transmission lines
– Connected to and shared between set of RF-enabledrouters
– RF-enabled router consists of a Tx/Rx pair
– single cycle transmission across 400mm2 die
– 16 carrier frequencies per transmission line
• @32 nm, with NoC running @2GHz
52
RF-Enabled Routers
RF-Enable 50 Routers- Represented by GREEN Routing
Tables
RAdd 6th
Port
RX
TX
Transmission Line…
27
53
RF-I Logical Organization
• Logically:- RF-I behaves as set of N express channels- Each channel assigned to src, dest router pair (s,d)
• Reconfigured by:- remapping shortcuts to
match needs of different applications LOGICAL ALOGICAL B
54
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
28
55
Architecture-Specific ShortcutsDesign time shortcuts Referred to as static shortcuts in the remainder of this talk Selection Criteria
Consider an optimization function for a topologylength of shortest-path(x,y) if x != y
0 if x == yWx,y =
We wish to minimize the total cost of the graph G representing the network-on-chip
Σall(x,y)
Wx,yTotal-Cost(G) =
56
Shortcut-Selection Constraints• Each router should have at most 6 ports
– A router can be at most one shortcut source and at most one shortcut destination
• Total of B (budget) unidirectional shortcuts: B = 16
• For static shortcuts:– RF-enable routers which are shortcut srcs/dests– At most 16 RF-enabled routers
• For adaptive shortcuts, shortcut srcs/dests are limited to– RF-enabled routers chosen at design-time
29
57
Min Total-Cost(NoC): Heuristic 1
I) For each pair of non-adjacent routers i,j
– Make a new candidate graph Gi,j with an edge between them
– Calculate Total-Cost(Gi,j)– Record improvement as…
II) Select shortcut of edge (x,y) such that Gx,y had max improvement
– Disallow any use of x as a src or y as a dest afterwards
III) Repeat (I) and (II) until budget B exhausted
R RC
RC
RC
R RRC
R
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RR R R
R RR RC
RC
RC
R RRC
R
R RR RC
RC
RC
R RRC
R
R$
$
R$
$
R$
$
R$
RR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RRRR RR RR
RR RRRR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
RR RRRR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
R$
R$
$$
R$
R$
$$
R$
R$
$$
R$
R$
$$$
$$$ $$$ $$$
$$$$$$
$$$ $$$RRR
$$$
$$$ $$$
$$$RRR
RRRRRR$$$ $$$
$$$ $$$
$$$$$$RRR RRR
RRR
RRRRRR
RRR RRR
$$$
$$$
$$$
$$$
$$$
$$$
$$$
RRR
CCC CCC CCC CCC
CCCCCCCCCCCCCCC
CCC
CCC
CCC CCC CCC CCC CCC CCC
CCCCCCCCCCCCCCC
CCC CCC CCC CCC CCC
CCC
CCCCCCCCCCCCRRR RRR RRR RRR
RRRRRRRRRRRRRRR
RRR RRR RRR RRR RRR
RRR
RRR
RRR RRR RRR RRR RRR RRR
RRRRRRRRRRRRRRRRRR
RRR RRR RRR RRR
G
|Total-Cost(Gi,j) – Total-Cost(G)|
i j
Gi,jGx,y is best
O(BV5)
58
Min Total-Cost(NoC): Heuristic 2
(I) Calculate Wi,j for all pairs i,jin G– Record all Wi,j
(II) Select shortcut of edge (x,y) s.tWx,y = max(Wi,j)
– Which is the graph diameter– Disallow any use of x as a src
or y as a dest afterwards
(III) Repeat (I) and (II) until budget exhausted
R RC
RC
RC
R RRC
R
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RR R R
R RR RC
RC
RC
R RRC
R
R RR RC
RC
RC
R RRC
R
R$
$
R$
$
R$
$
R$
RR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RRRR RR RR
RR RRRR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
RR RRRR RC
RC
RC
RC
RC
RC
RR RRRC
RC
RR
R$
R$
$$
R$
R$
$$
R$
R$
$$
R$
R$
$$$
$$$ $$$ $$$
$$$$$$
$$$ $$$RRR
$$$
$$$ $$$
$$$RRR
RRRRRR$$$ $$$
$$$ $$$
$$$$$$RRR RRR
RRR
RRRRRR
RRR RRR
$$$
$$$
$$$
$$$
$$$
$$$
$$$
RRR
CCC CCC CCC CCC
CCCCCCCCCCCCCCC
CCC
CCC
CCC CCC CCC CCC CCC CCC
CCCCCCCCCCCCCCC
CCC CCC CCC CCC CCC
CCC
CCCCCCCCCCCCRRR RRR RRR RRR
RRRRRRRRRRRRRRR
RRR RRR RRR RRR RRR
RRR
RRR
RRR RRR RRR RRR RRR RRR
RRRRRRRRRRRRRRRRRR
RRR RRR RRR RRR
GThese shortcuts tend to perform within 1% as well as those chosen with heuristic 1
O(BV3)
30
59
Adaptive RF-I Shortcuts • Assume a profile of communication for an application
– Fi,j = count of messages sent between router i and router j• Change optimization function
• To offset effect of removing src/dest routers (already selected) from consideration– Alternate router-to-router shortcuts with region-to-region
shortcuts– Allows placement of shortcuts at routers near a hotspot
Σall(x,y)
(Fx,y Wx,y)Total-Cost(G) = .
60
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
31
61
Results on Performance Improvement• Static Shortcuts
– 20% reduction in latency on average– 11% increase in NoC power
• Adaptive Shortcuts with 50 RF-I routers– 32% reduction in latency on average– 24% increase in NoC power
62
Power Savings• We can thin the
baseline mesh links– From 16B…– …to 8B– …to 4B
• RF-I makes up the difference in performance while saving overall power!– RF-I provides
bandwidth where most necessary
– Baseline RC wires supply the rest
16 bytes8 bytes4 bytes
Requires high bw to communicate w/ B
A
B
32
63
Evaluation Methodology• Used detailed interconnection network
simulator - Garnet[1] • Built probabilistic traces
– To cover different communication patterns that may be exhibited by future applications
• Leveraged Orion[4], CosiNoC[5], IPEM[2] for power methodology
64
RF-I Enables Power Savings
Relative latency
Relative pow
er
0.0
0.2
0.4
0.6
0.8
1.0
1.21.4
uniform uniDF biDF hotbiDF 1Hotspot 2Hotspot 4Hotspot
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4Static Adaptive Baseline - 8B Static - 8B Adaptive - 8B Baseline - 4B Static - 4B Adaptive - 4B
• On average adaptive shortcuts w/ 50 RF-enabled routers on a 4B mesh • 62%(82%) power(area) savings over baseline• Performance comparable to baseline
33
65
RF-I Enables a Power Savings
• Adaptive RF-I enabled NoC- Cost Effective in terms of both power and performance
66
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
34
67
RF-I Enabled Multicast
Get S
2
1
3 4
2
1
1
1 1
1FILL
Fill
Conventional NoC
Request Scenario
Rx RxTx
RxTx
RxTx
RxTx
RxTx
RxTx
RxTx
RxTx
Tx
RF-I enabled NoC
68
Virtual Circuit Tree Multicast [6]• Demonstrated the importance of multicast for
current/future NoCs• Enhance NoC routers with additional state
tables– Store routing information for dynamically discovered
multicast trees– Dynamically spawn packets to follow
communication tree• Reduce NoC congestion and multicast latency
35
69
RF-I Enabled Multicast• RF-I provides natural means for multicast• Multiple receivers listen to same channel• Conceptually, RF shortcut with multiple
destinations
70
Multicast in our architecture
• 50 RF-enabled routers
• 16 adaptive shortcuts
• 34 routers left to tune to the multicast channel
• Allocate 1 channelfor multicast
• 15 adaptive shortcuts
• 35 routers left to tune to the multicast channel
36
71
RF-I Enabled Multicast (cont)• We accelerate two multicast messages: Fills and
Invalidates • Both of which are issued by cache banks
– Limit multicast senders to be cache banks• We use coarse-grain arbitration scheme
– to decide which component can use the multicast channel– A cache bank in a cluster is chosen as the designated
multicast sender for a fixed period of time– The caches sent multicast message to the designated – sender over conventional wires
72
Designated MCSender
Wants to send MC msgs
TRANSMITRECEIVE
MC recipients
Example Multicast(MC) Scenario
37
73
Multicast Results
• On average RF-I MC+ SC provides:• 37% reduction in latency• at a cost of 25% increase in NoC Power
• 20 and 50 indicates:- % of distinct source-destination pairs- simulate multicast destination reuse
• Perform a fair comparison with VCT
74
Unified Analysis
38
75
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
76
Deadlock: To Avoid or Confront?• South-Last Strategy [Ogras and Marculescu, 2006]
– Routes which can lead to circular buffer dependence are forbidden avoids deadlock
– Restricts shortcut selection• Results show that it effectively halves performance
• Deadlock Detection & Recovery (DDR)– Based on Duato and Pinkston’s theory [Duato and Pinkston
2001]• If deadlock occurs, route all packets in the network on a spare virtual
channel• Use deadlock-free XY-routing• Packets entering network after this point may be routed normally
39
77
How to detect deadlock…?• Rather than detect that deadlock has occurred
– Detect a sufficient condition for deadlock: circular buffer dependency
• Each router maintains a list of other routers waiting on it
• When buffer at neighbor router d is full, sender s transmits waiting-list message to neighbor– Bit vector indicating which routers are waiting on s, as well
as s’s ID– If a router is “waiting on itself,” circular buffer dependency
has occurred• Raise DEADLOCK condition
– If d’s buffer empties, s sends one time clear-waiting-list message to reset state
78
Deadlock Detection Example
• N inbound buffer at R21 fills up• R11 can’t send to R21• R11 tells R21:
– {R11} waiting on you• W inbound buffer at R11 fills up• R10 tells R11:
– {R10} waiting on you• R11 tells R21:
– {R10,R11} waiting on you• If there is circular dependence
– R21 will eventually see that it is waiting on itself DEADLOCK!
……
R11
R21
in out
inout
in
out
R10in
out…
……
…
…
…
0R0 R1 R10 R11 R21
… …0 0 1 0
1R0 R1 R10 R11 R21
… …0 0 0 0
1R0 R1 R10 R11 R21
… …0 0 1 01R0 R1 R10 R11 R21
… …0 0 1 1
40
79
Outline• Application Diversity and NoC Motivation• Adaptive RF-Interconnect in NoC Design
– RF-I Overlaid on a Mesh NoC– Shortcut Selection
• Architecture implications– Performance improvement– Power Savings– Efficient multicast support– Deadlock
• Conclusions and Future Work
80
ConclusionRF-Interconnect:• Enables adaptive NoC
– Bandwidth can be flexibly allocated– To match the communication demands of
applications• Offers dramatic power and area savings
– By simplifying baseline NoC topology– Provides performance of 16B mesh on a 4B mesh
• 62% power savings, 82% area savings
• Natural means of multicast
41
81
Future Directions• Fine-Grain Adaptation
– On-Demand or Phase-Specific Shortcut/Multicast• Message-Based Acceleration
– Application-Specific Synchronization– Cache Coherence– NUCA Migration
• Deadlock Free Routing– Application-Specific Turn Removal
• Physical Implementation
82
References[1] N. Agarwal, L-S Peh, and N. Jha. Garnet: A detailed interconnection
network model inside a full-system simulation framework. Technical Report CE-P08-001, Dept. of Electrical Engineering, Princeton University, 2007.
[2] M. Chang, J. Cong, A. Kaplan, M. Naik, G. Reinman, E. Socher and S.W. Tam, “CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect,” The 14th International Symposium on High-Performance Computer Architecture, Salt Lake City, UT, pp. 191-202, February 2008. (Best Paper Award)
[3] M. F. Chang, J. Cong, A. Kaplan, C. Liu, M. Naik, J. Premkumar, G. Reinman, E. Socher, and R. Tam, “Power Reduction of CMP Communication Networks via RFInterconnects,” Proceedings of the 41st Annual International Symposium on Microarchitecture (MICRO), Lake Como, Italy, pages 376-387, November 2008
[4] J. Cong and D.Z. Pan. Interconnect estimation and planning for deep submicron designs. In Proceedings of DAC-36, 1999.
[5] J.D. Owens, W.J. Dally, R. Ho, D.N. Jayasimha, S.W. Keckler, and L-S. Peh. Research challenges for on-chip interconnection networks. IEEE Micro, 27(5):96–108, 2007.
[6] A. Pinto, L. Carloni, and A. Sangiovanni-Vincentelli. Constraint-driven communication synthesis. In Design Automation Conference, June 2002
[7] H. Wang, X. Zhu, L-S. Peh, and S. Malik. Orion: A power performancesimulator for interconnection networks. In Proceedings of MICRO-35, November 2002.
[8] N. Jerger, L. Peh, and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for Hardware Multicast Support. International Symposium on Computer Architecture, June 2008.
For updated slides of this tutorial, please go to http://cadlab.cs.ucla.edu/~cong