NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and...

47
1 NoCAlert (MICRO-2012) University of Cyprus University of Cyprus International Symposium on Microarchitecture, December 3 2012, Vancouver, Canada The Multicore Computer Architecture Laboratory (multiCAL) Ξ - Computer Architecture Research Group (Ξ - CARCH) EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures

Transcript of NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and...

Page 1: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

1 NoCAlert (MICRO-2012) University of Cyprus

University of Cyprus

International Symposium on Microarchitecture, December 3 2012, Vancouver, Canada

The Multicore Computer Architecture Laboratory (multiCAL) Ξ - Computer Architecture Research Group (Ξ - CARCH)

EuroCloud FP7 Project

NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures

Page 2: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

2 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 3: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

3 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 4: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

4 NoCAlert (MICRO-2012) University of Cyprus

The Network-on-Chip (NoC) paradigm

Ima

ge

co

urt

esy o

f C

. D

an

iloff

• On-chip interconnection fabric (backbone) to connect all nodes

• Modular design

• Structured Interconnect Layout

• Scalable and efficient

• Packet-based communication

Page 5: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

5 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

Graph courtesy of www.crn.com

Intel 4004 4-bit

1971

Following Moore’s law the number of transistors/chip double approx. every 18-24 months Designers turn into integrating more cores to take advantage of parallelism

Page 6: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

6 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000

Page 7: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

7 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007

Intel Core 2 Duo 2 Cores

Page 8: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

8 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007

Intel Core i7 (Nehalem) 4 Cores

2008

Page 9: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

9 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008

AMD Opteron 2400 6 Cores

2009

Page 10: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

10 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010

IBM POWER7 8 Cores

Page 11: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

11 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010

Intel Xeon Westmere-EX 10 Cores

2011

Page 12: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

12 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010 2011 2012

Page 13: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

13 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010 2011 2012

Near Future

Intel Single-Chip Cloud Computer 48 Cores

Intel Polaris Chip 80 Cores

Page 14: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

14 NoCAlert (MICRO-2012) University of Cyprus

It’s already happening! Intel Polaris Chip

• Router is becoming part of the core design

• NoCs are becoming necessary

Tilera TILE64 – 64 Cores • 2D mesh NoC comprising • 5 independent networks

• one for each of 5 message classes

1971

2000 2007 2008 2009 2010 2011 2012

Near Future

Page 15: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

15 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 16: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

16 NoCAlert (MICRO-2012) University of Cyprus

Reliability in the nano era • Aggressive transistor downsizing

– Increasing hardware variability

– Susceptibility to wear-out (accelerated aging effects)

• Permanent faults – Static (occurring at manufacture-time)

• Process Variability (PV), Manufacturing imperfections

– Dynamic (occurring at run-time, prolonged stressing component wear-out) • Electro-Migration (EM), Negative Bias Temperature Instability (NBTI), Oxide

breakdown, Stress-Induced Voiding (SIV), Hot Carrier Injection (HCI), etc.

• Transient faults (or Soft Errors – Single-Event Upsets, SEU) – Alpha particles (impurities in packaging/interconnect), Cosmic-ray-induced

neutrons, Neutron-induced 10B fission (interconnect layer insulator)

– Traditionally associated with memories • Error Correcting Codes (ECC) widely used in DRAM modules

Page 17: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

17 NoCAlert (MICRO-2012) University of Cyprus

Ominous predictions regarding reliability

* S.R. Nassif, N. Mehta, and Yu Cao. A resilience roadmap. In Proc. of the Design, Automation and Test in Europe Conference (DATE), 2010.

• This recent study from DATE-2010 signifies increases in failure probabilities by tens of orders of magnitude at 12 nm, as opposed to 45 nm. • Each new technology generation decreases IC lifetime by half [ITRS 2011]

Probability of failure

Impact of NBTI on failure probability trends

Challenge: “Designing reliable systems from unreliable components” *

* S. Borkar, "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," in IEEE Micro, Nov/Dec 2005.

Page 18: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

18 NoCAlert (MICRO-2012) University of Cyprus

NoC (un)reliability implications

• A single fault in the NoC can cause:

– Network disconnections

– Deadlocks (Network and Protocol-level)

– Lost packets

– Degraded performance

A single fault can paralyze the entire system (CMP)

• Protecting the NoC is of paramount importance

Page 19: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

19 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 20: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

20 NoCAlert (MICRO-2012) University of Cyprus

NoCAlert: The Big Picture

• NoCAlert:

– Lightweight distributed invariance checkers

– Checkers behave like hardware assertions

– Checks for legality, not correctness

– Network’s operation is never interrupted

– Provides almost instantaneous fault detection

• Assumption:

– Packet/flit contents are protected with ECC

• NoCAlert protects against faults in the control logic

• Interesting observation

– Erroneous but legal module outputs are always benign

Page 21: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

21 NoCAlert (MICRO-2012) University of Cyprus

NoCAlert’s Terminology

• Invariance violation: – The breaking of a fundamental functional rule within the

context of a component’s operation

– e.g., the routing computation unit outputs an illegal direction

• Legality: – Illegal is an output that is impossible to occur, based on the set

of functional correctness rules of a given component

• Instantaneous fault detection: – Detect a fault as soon as it manifests (same clock cycle)

– Easier to recover

– Localized information could identify faulty location

Page 22: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

22 NoCAlert (MICRO-2012) University of Cyprus

Invariance Checking

• System is continuously (on-line) examined for illegal outputs

– An illegal output can be the result of some kind of fault

• Emulates assertions used in software

• Example: Assume a variable X cannot get the value 5

– assert(X!=5)

– In hardware this would be achieved with a comparison unit that raises a flag

Page 23: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

23 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 24: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

24 NoCAlert (MICRO-2012) University of Cyprus

A generic (typical) NoC router micro-architecture

5x5 Crossbar

West Out

VA1Arbitration

VA2Arbitration

SA1Arbitration

SA2Arbitration

SA2 arbiters control the XBAR connections

Local Arbitration: Choose one specific

output VC in adjacent router

Global Arbitration:

Resolve global conflicts

Local Arbitration: One winning VC in

each port

Global Arbitration: Resolve global

conflicts

Routing Computation

Routing Computation:

Next-hop direction

RCVC0

RCVC1

RCVC2

RCVC3

One flit capacity

RCVC0

RCVC1

RCVC2

RCVC3

One flit capacity

RCVC0

RCVC1

RCVC2

RCVC3

One flit capacity

RCVC0

RCVC1

RCVC2

RCVC3

One flit capacity

RCVC0

RCVC1

RCVC2

RCVC3

One flit slot

West In

East In

VC ID

North InSouth In

Processing Element In

InputPort

East OutNorth OutSouth Out

Processing Element Out

Router Pipeline

* Network packets are broken into multiple flits. A flit is a flow control unit and it is the smallest unit of flow control in the NoC.

Page 25: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

25 NoCAlert (MICRO-2012) University of Cyprus

Identifying invariances within the NoC Router

• Identification of invariances relies on the modularity and hierarchy of the NoC Router

• The functional algorithm of each module is exhaustively inspected using a bottom-up approach – Identification of all the functional rules

– Identification of all the functionally illegal outputs

• End-to-end invariances at the network-level are identified

Network Level

Router Level

Input Port

FIFO Buffers

RC Unit

VA and SA

Arbiters

Crossbar Switch

Page 26: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

26 NoCAlert (MICRO-2012) University of Cyprus

Invariance categorization

• 32 invariances have been identified through detailed exploration of the router’s microarchitecture

• Identified invariances are categorized based on the router module they are associated with

– Routing Computation unit (3)

– Arbiters (10)

– Crossbar (3)

– Buffer State (12)

– Port-Level (3)

– End-to-End (network-level) (1)

Page 27: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

27 NoCAlert (MICRO-2012) University of Cyprus

Ensuring network correctness

• Which conditions must a reliable network satisfy?

• Four main conditions that ensure

functional correctness within

the network* – No packets are dropped

– Delivery time is bounded

– No data corruption occurs

– No new packet is generated

within the network

• Additional requirement:

Intra-packet flit ordering

* D. Borrione et al. A generic model for formally verifying NoC communication architectures: A case study. In NOCS 2007

Page 28: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

28 NoCAlert (MICRO-2012) University of Cyprus

Routing Algorithm

• Routing algorithms forbid some turns to avoid deadlocks and livelocks in the network

• E.g., Dimension-order XY routing

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R S (0,0)

D (2,3)

Page 29: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

29 NoCAlert (MICRO-2012) University of Cyprus

Invariance Example – Routing Algorithm

• Invariance violation due to forbidden turn according to the specification of the XY routing algorithm

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R S (0,0)

D (3,3)

Forbidden Turn

Page 30: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

30 NoCAlert (MICRO-2012) University of Cyprus

Invariance Example - Arbiters

0 0 0 0 1 0

0 0 0 0

5:1 Arbiter

Req

ues

ts G

rants

5:1 Arbiter

1 1 0 0 1

1 0 0 0 1

Gran

ts Req

ues

ts

• Grant is not allowed without a corresponding request

• Arbiter’s output must always be 1-hot • Can’t assign one resource (output port or VC)

to multiple contestants

0 0 0 0 0 1

0 0 1 0

5:1 Arbiter

Req

ues

ts G

rants

• If at least one request exists, the arbiter must grant one of the contestants

Page 31: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

31 NoCAlert (MICRO-2012) University of Cyprus

Faults that do not cause invariance violations

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R S (0,0)

D (3,3)

Forbidden Turn

Legal Turn

• Invariance checking only detects illegal outputs

• Does not necessarily detect incorrect outputs

Page 32: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

32 NoCAlert (MICRO-2012) University of Cyprus

Faults that do not cause invariance violations (Cont.)

• Two elemental questions arising by this kind of faults:

1. If such non-invariant upsets cause some other functional violation later on in the network, will the fault be caught by subsequent NoCAlert checkers?

2. If these non-invariant upsets are never caught by any subsequent NoCAlert checker, do they end up affecting the overall network correctness?

Page 33: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

33 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 34: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

34 NoCAlert (MICRO-2012) University of Cyprus

Evaluation framework

• Tools used:

– GARNET cycle-accurate NoC simulator

• Used for extensive experimentation under fault presence

– Synopsys Design Compiler (for hardware synthesis)

• Used to assess the hardware overhead of NoCAlert

• Based on full Verilog HDL implementation of NoCAlert and synthesis using 65 nm commercial standard-cell libraries

• Compared against ForEVeR*, the current state-of-the-art

* R. Parikh and V. Bertacco, "Formally enhanced runtime verification to ensure noc functional correctness," In Proc. of the International Symposium on Microarchitecture (MICRO), 2011.

Page 35: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

35 NoCAlert (MICRO-2012) University of Cyprus

The ForEVeR mechanism *

• Epoch-based on-line fault detection mechanism – Achieved with the help of an additional lightweight checker network that is

assumed to be 100% reliable

– Contains run-time checks for arbitration stages and End-to-End coverage

• Counter-based scheme that uses notification packets

• Fault assessment occurs at the end of each epoch – If counter values not reconciled, a recovery mechanism is triggered

– In-flight data delivered to the intended destination via the checker network

• BUT, how to choose epoch duration (sensitivity to traffic injection rate, application-specific, etc.) False positives even in a fault-free environment!

* R. Parikh and V. Bertacco, "Formally enhanced runtime verification to ensure noc functional correctness," In Proc. of the International Symposium on Microarchitecture (MICRO), 2011.

Page 36: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

36 NoCAlert (MICRO-2012) University of Cyprus

Fault-injection framework

• Fault model: Single-bit, single-event transient faults

• Faults were injected at the inputs and outputs of every control module of a router

– RC units, VA and SA arbiters, Crossbar, and Status Tables

– One fault injected in each experiment

• Total number of fault locations:

– 11,808 for 8x8 2D mesh network

Logic Module

Checker Module

Logic Module

Checker Module

(a) (b)

Invariance Assertion

Input

Invariance Assertion

Output Output

Single-Bit FaultSingle-Bit Fault

Input

Page 37: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

37 NoCAlert (MICRO-2012) University of Cyprus

The Golden Reference report

• For each experiment, generate a Golden Reference (GR) report:

– A log of the entire network’s output (flit ejections), under a fault-free run.

– “Oracle” knowledge of what should normally happen

• “Contaminated” Logs are compared against the GR

– If all flits were delivered correctly (remember the four rules), and intra-packet order was maintained, the fault was benign (no system-level effect)

– Note that the global order of packets is allowed to change

Page 38: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

38 NoCAlert (MICRO-2012) University of Cyprus

Network’s State Affects Fault Detection

• The state of the network can influence fault detection behavior:

– Faults in an empty network are less likely to be masked

– Warmed-up networks might “hide” faults

• Need for testing at different states

– 7 different traffic injection ratios (10-40% in 5% increments)

– 3 different fault injection instances

• Faults were injected at cycle 0 (empty network), cycle 32 K, and cycle 64 K (warmed-up network)

– Resulting in a total of approx. 248 K simulations

Page 39: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

39 NoCAlert (MICRO-2012) University of Cyprus

Classification of NoCAlert’s detection outcomes

• Four main fault detection categories: – True positive: Detected non-benign fault

– True negative: Non-detected benign fault

– False positive: Detected benign fault • Can cause unnecessary fault recovery triggering

– False negative: Non-detected non-benign fault • Worst case

• Ideally, this should be ZERO

• Non-benign: Comparison against GR failed

• Benign: Successful comparison against GR

Page 40: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

40 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 41: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

41 NoCAlert (MICRO-2012) University of Cyprus

Fault coverage breakdown

• Observations: – 0% false negatives for both schemes (no malicious faults escape) – False-positive percentages higher in a warmed-up network

• More faults are masked

– NoCAlert behaves slightly worse than ForEVeR in terms of false positives (ForEVeR is an epoch-based mechanism some faults vanish by end of epoch)

51.64 51.64 38.45 38.45 38.70 38.70

30.62 27.76 45.33 42.56 45.15 39.32

17.73 20.59 16.22 18.99 16.15 21.98

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

NoCAlert ForEVeR NoCAlert ForEVeR NoCAlert ForEVeR

Cycle 0 Cycle 32000 Cycle 64000

True Positive False Positive True Negative

Page 42: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

42 NoCAlert (MICRO-2012) University of Cyprus

Fault detection latency

• Observations:

– 97% of NoCAlert’s fault detections were instantaneous

– Significant fault detection latency improvement

• Up to 100x

– NoCAlert: on-line mechanism instantaneous detection

0

10

20

30

40

50

60

70

80

90

100

1 8 64 512 4096

Fau

lts

Det

ect

ed

(%

)

Detection Delay (Cycles)

NoCAlert

ForEVeR

Cumulative fault detection delay

distribution

Page 43: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

43 NoCAlert (MICRO-2012) University of Cyprus

Hardware overhead (synthesis using 65 nm libraries)

• Area overhead: ranges from 1.38% to 4.42% (3% average)

• Power* overhead : 0.3% - 1.3% (0.7% average)

• Critical path overhead: At most 3% (1% average)

• DMR-CL: Use Double Modular Redundancy (DMR) protection for control logic

• NoCAlert: Use NoCAlert protection for control logic

0

10

20

30

40

50

60

70

80

90

100

0

1

2

3

4

5

6

7

8

9

10

11

12

2VCs 3VCs 4VCs 5VCs 6VCs 7VCs 8VCs

Are

a O

verh

ead

Pe

rce

nta

ge

Are

a (N

orm

alis

ed

)

Number of VCs per Port

Baseline

NoCAlert

DMR-CL

NoCAlert Area Overhead (%)

DMR-CL Area Overhead (%)

* The power numbers were extracted from the Synopsys Design Compiler power report, with switching activity set to 50% for all nets. The operating voltage is 1 V.

Page 44: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

44 NoCAlert (MICRO-2012) University of Cyprus

Outline

Necessity of Networks-on-Chip (NoCs)

Reliability and NoCs

The NoCAlert Approach: Invariance Checking

Identifying invariances and examples

Evaluation

Results

Conclusion – Future Work

Page 45: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

45 NoCAlert (MICRO-2012) University of Cyprus

Conclusion

• Comprehensive on-line and real-time fault detection mechanism

• Ensures 0% false negatives within the NoC

• Based on the concept of invariance checking

– A collection of micro-checker modules dispersed throughout the router’s control logic modules

– Real-time hardware assertions

• Tremendous improvement in detection delay

• Extremely lightweight nature of NoCAlert in terms of area/power/timing overhead

Page 46: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

46 NoCAlert (MICRO-2012) University of Cyprus

Future Work

• Formally prove NoCAlert’s full coverage

• Fault localization

– Take advantage of the localized information provided by NoCAlert

– Try to pinpoint the faulty location

• Fault Recovery

– After fault localization

– Techniques to bypass faulty modules

Page 47: NoCAlert: An On-Line and Real-Time Fault Detection ...EuroCloud FP7 Project NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures . University

47 NoCAlert (MICRO-2012) University of Cyprus

Thank you!

QUESTIONS?