ECE 5745 Complex Digital ASIC Design Course Overview · Facebook Oculus I Starting to design ASIC...
Transcript of ECE 5745 Complex Digital ASIC Design Course Overview · Facebook Oculus I Starting to design ASIC...
ECE 5745 Complex Digital ASIC DesignCourse Overview
Christopher Batten
School of Electrical and Computer EngineeringCornell University
http://www.csl.cornell.edu/courses/ece5745
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Complex Digital ASIC Design
I Course goal, structure, motivation. What is the goal of the course?. Why should students want to take this course?. How is the course structured?
I Activity 1: Evaluation of Integer Multiplier
I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips
I Activity 2: Brainstorming for Sorting Accelerator
ECE 5745 Course Overview 2 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
The Computer Engineering Stack
Technology
Application
Gap too large to bridge in one step(but there are exceptions e.g., magnetic compass)
In its broadest definition, computer architecture is thedesign of the abstraction/implementation layers that allow us to
execute information processing applications efficientlyusing available manufacturing technologies
ECE 5745 Course Overview 3 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
The Computer Engineering Stack
In its broadest definition, computer architecture is thedesign of the abstraction/implementation layers that allow us to
execute information processing applications efficientlyusing available manufacturing technologies
Circuits
Devices
Programming Language
Algorithm
Co
mp
ute
rA
rch
itectu
re
Technology
Application
Register-Transfer Level
Instruction Set Architecture
Microarchitecture
Operating System
Gate Level
ECE 5745 Course Overview 3 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
What is Computer Architecture?
In its broadest definition, computer architecture is thedesign of the abstraction/implementation layers that allow us to
execute information processing applications efficientlyusing available manufacturing technologies
Circuits
Devices
Programming Language
Algorithm
Co
mp
ute
rA
rch
itectu
re
Technology
Application
Register-Transfer Level
Instruction Set Architecture
Microarchitecture
Operating System
Gate Level
Operating System
Gate Level
Register-Transfer Level
Instruction Set Architecture
Microarchitecture
Application Requirements • Suggest how to improve architecture • Provide revenue to fund development
Technology Constraints • Restrict what can be done efficiently • New technologies make new arch possible
Architecture provides feedback to guideapplication and technology research directions
ECE 5745 Course Overview 4 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Key Metrics in Computer Architecture
P
I$
D$
Network
Network
Network
P
I$
P
I$
P
I$
D$ D$ D$
I Primary Metrics. Execution time (cycles/task). Energy (Joules/task). Cycle time (ns/cycle). Area (µm2)
I Secondary Metrics. Performance (ns/task). Average power (Watts). Peak power (Watts). Cost ($). Design complexity. Reliability. Flexibility
Discuss qualitative first-order analysisfrom ECE 4750 on board
ECE 5745 Course Overview 5 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Unanswered Questions from ECE 4750
P
I$
D$
Network
Network
Network
Xcel P
I$
D$ D$ D$
Xcel
AcceleratedInstructions
I How can we quantitatively evaluatearea, cycle time, and energy?
I How do we actually implementprocessors, memories, andnetworks in a real chip?
I How should we implement/analyzeapplication-specific accelerators?
. Very loosely coupledmemory-mapped accelerators
. More tightly coupled co-processoraccelerators
. Specialized instructions andfunctional units
ECE 5745 Course Overview 6 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
ASIC: Application-Specific Integrated Circuit
P
I$
D$
Network
Network
Network
Xcel P
I$
D$ D$ D$
Xcel
AcceleratedInstructions
Performance (Tasks per Second)
SimpleProc
Processor PowerConstraint
En
erg
y (
Jo
ule
s p
er
Ta
sk)
C
B
AAAA
Superscalarw/ Deeper
Pipelines
Out-of-Order Superscalar
Superpipelined
D
EE
MulticoreFF
Multicore E
AAAA
Specialized Accelerators
ECE 5745 Course Overview 7 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
ASIC: Application-Specific Integrated Circuit
P
I$
D$
Network
Network
Network
Xcel P
I$
D$ D$ D$
Xcel
AcceleratedInstructions
Performance (Tasks per Second)
En
erg
y E
ffic
ien
cy (
Ta
sks p
er
Jo
ule
)
SimpleProcessor
Design PowerConstraint
High-PerformanceArchitectures
EmbeddedArchitectures
DesignPerformanceConstraint
Flexibility vs
. Spe
cializat
ion
CustomASIC
Less FlexibleAccelerator
More FlexibleAccelerator
ECE 5745 Course Overview 8 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Goal for ECE 5745 is to answer these questions!
P
I$
D$
Network
Network
Network
Xcel P
I$
D$ D$ D$
Xcel
AcceleratedInstructions
I How can we quantitatively evaluatearea, cycle time, and energy?
I How do we actually implementprocessors, memories, andnetworks in a real chip?
I How should we implement/analyzeapplication-specific accelerators?
. Very loosely coupledmemory-mapped accelerators
. More tightly coupled co-processoraccelerators
. Specialized instructions andfunctional units
ECE 5745 Course Overview 9 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Full Custom Design vs. Standard-Cell Design
L01 –Introduction 22
6.884 –S
pring 20052 Feb 2005
Full Custom D
esignDesigner is free to do anything, anywhere
–though each design team
usually imposes som
e disciplineM
ost time consum
ing design style–
Reserved for very high performance or very high volum
e devices (Intel m
icroprocessors, RF power amps for cellphones)
Requires complete custom
ization of all layers of wafer
Piece of full-custom m
ultiplier array, 1.0Pm
2-metal
Full-custom layoutin 1.0µm w/ 2 metal
layers
I Full-Custom Design (ECE 4740)
. Designer is free to do anything, anywhere; thoughteam usually imposes some design discipline
. Most time consuming design style; reserved forvery high performance or very high volume chips(Intel microprocessors, RF power amps forcellphones)
I Standard-Cell Design (ECE 5745)
. Fixed library of “standard cells” and SRAMmemory generators
. Register-transfer-level description is automaticallymapped to this library of standard cells, then thesecells are placed and routed automatically
. Enables agile hardware design methodology
ECE 5745 Course Overview 10 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Standard-Cell Design Methodology
L01 – Introduction 246.884 – Spring 2005 2 Feb 2005
Standard Cell ASICs• Also called Cell-Based ICs (CBICs)• Fixed library of cells plus memory generators• Cells can be synthesized from HDL, or entered in schematics• Cells placed and routed automatically• Requires complete set of custom masks for each design• Currently most popular hard-wired ASIC type (6.884 will use this)
Mem1 Mem
2
Cells arranged in rows
Generated memory arrays
L01 – Introduction 256.884 – Spring 2005 2 Feb 2005
Standard Cell DesignCells have standard height but vary in widthDesigned to connect power, ground, and wells by abutment
VDD Rail
GND Rail
Clock Rail
Cell I/O on M2Power
Rails in M1
Clock Rail (not typical)
NAND2 Flip-flop
Well Contact under Power Rail
Ripple carry adder with carrychain highlighted
ECE 5745 Course Overview 11 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Standard-Cell Design Methodology
Design in HDL
HDLSimulator
Place&Route
Switching Activity
PowerAnalysis
Synthesis
Standard Cells
Gate-Level Model
Layout
Area (μm2)
Execution Time(cycles/task)
Cycle Time (ns)
Energy (J/task)
Area (μm2)
Cycle Time (ns)
Energy (J/task)
Energy (J/task)
ECE 5745 Course Overview 12 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Example Standard-Cell Chip PlotMotivation Architectural Patterns Scale VT Core Maven VT Core Evaluation
Single-Lane Vector-Thread Unit w/ 256 Registers
MIT CSAIL Christopher Batten 32 / 42ECE 5745 Course Overview 13 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
What is Complex Digital ASIC Design?
Circuits
Devices
Programming Language
Algorithm
Co
mp
ute
rA
rch
itectu
re
Technology
Application
Register-Transfer Level
Instruction Set Architecture
Microarchitecture
Operating System
Gate Level
Circuits
Register-Transfer Level
Microarchitecture
Gate Level
layout ready for fabrication
Complex digital ASIC design isthe process of
quantitatively exploring the
area, cycle time, execution time, andenergy trade-offs
of various
automated standard-cell CAD tools and then to transform the most
promising design to
using
application-specific accelerators(and general-purpose proc+mem+net)
ECE 5745 Course Overview 14 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Complex Digital ASIC Design
I Course goal, structure, motivation. What is the goal of the course?. Why should students want to take this course?. How is the course structured?
I Activity 1: Evaluation of Integer Multiplier
I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips
I Activity 2: Brainstorming for Sorting Accelerator
ECE 5745 Course Overview 15 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Technology Scaling is Slowing
1940 1955 1970 1985 2000 2015 2030
Syste
m P
erf
orm
an
ce
VacuumTube
DiscreteTransistor
IntegratedBipolar
IntegratedCMOS
7nm, ~50B Transistors
New Technology?
TechnologyFallowPeriod?
New Technology?
Golden Ageof ChipDesign?
OR
ECE 5745 Course Overview 16 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Example Application Domain: Image Recognition
Starfish
Dog
ECE 5745 Course Overview 17 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Machine Learning: Training vs. Inference
forward"starfish"
Training
many images
Model
=? "dog"
labels
errorbackward
Inferenceforward
"dog"
fewimages
ECE 5745 Course Overview 18 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
ImageNet Large-Scale Visual Recognition ChallengeT
op
5 E
rro
r R
ate
28%26%
16%
12%
7%
'10
3.6% 2.3%3%
'11 '12 '13 '14 '15 '16 '17
HumanError Rate
Software: Deep Neural Network
Hardware: Graphics Processing Units0 0
14%
En
trie
s U
sin
g G
PU
s
74%
89%~100%
ECE 5745 Course Overview 19 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
ASIC Design for ML in the Cloud
NVIDIA DGX-1I Graphics processor
specialized just formachine learning
I Available as part of acomplete system withboth the software andhardware designed byNVIDIA
Intel NervanaI Custom ASIC specifically
designed to acceleratemachine learning
I Special hardware supportfor Winograd algorithms
I Collaboration betwenIntel and Facebook
I Maybe Facebook is alsobuilding their ownin-house chips?
Google TPU
I Custom ASIC specificallydesigned to accelerateGoogle’s TensorFlowC++ library
I Tightly integrated intoGoogle’s data centers
I 15–30× faster thancontemporary CPU andGPUs
ECE 5745 Course Overview 20 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
ASIC Design for ML at the Edge
Google Pixel
I Designed custom ASICcalled the Pixel VisualCore for imageprocessing
I PVC enables 5xperformanceimprovement inhigh-dynamic-rangeimaging
Facebook OculusI Starting to design ASIC
chips for Oculus VRheadsets
I Significant performancedemands under strictpower requirements
Amazon EchoI Developing AI ASICs so
Echo line can do moreon-board processing
I Reduces need forround-trip to cloud
I Co-design the algorithmsand the underlyinghardware
ECE 5745 Course Overview 21 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Top-five software companies areall making chips
I Facebook: w/ Intel, in-house AI chips?I Amazon: Echo, Oculus, networking chipsI Microsoft: Hiring for AI chips?I Google: TPU, Pixel, convergence?I Apple: SoCs for phones, wireless chips
Chip startup ecosystem formachine learning is thriving!
I GraphcoreI NervanaI CerebrasI Wave ComputingI Horizon RoboticsI CambriconI DeePhiI EsperantoI SambaNovaI EyerissI TenstorrentI MythicI ThinkForceI GroqI Lightmatter
ECE 5745 Course Overview 22 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
The field of complex digital ASIC designis experiencing a disruptive
sea change and has a critical choice:
1. A technological fallow period
2. A golden age of ASIC design
This course will help you appreciate and possiblycontribute to this golden age!
ECE 5745 Course Overview 23 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Course Motivation: Comp Arch Research Perspective
Transistors(Thousands)
MIPSR2K
IntelP4
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten
1975 1980 1985 1990 1995 2000 2005 2010 2015
100
101
102
103
104
105
106
107
DECAlpha 21264
TypicalPower (W)
Frequency(MHz)
SPECintPerformance
~9%/year
~15%/year
Numberof Cores
Intel 48-Core Prototype
AMD 4-Core OpteronParallelizationandSpecialization
ECE 5745 Course Overview 24 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Cross-Layer Interaction is Critical
Circuits
Devices
Programming Language
Algorithm
Co
mp
ute
rA
rch
itectu
re
Technology
Application
Register-Transfer Level
Instruction Set Architecture
Microarchitecture
Operating System
Gate Level
Circuits
Register-Transfer Level
Microarchitecture
Gate Level
Need to quantitatively understand
area, cycle time, and energy
trade-offs to be able to create
new revolutionary architectures
that tackle the most
challenging physical design
issues
ECE 5745 Course Overview 25 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Course Motivation: Circuits Research Perspective
Your Digital Circuit Here
Fighter Airplane
~100,000 parts
Your Analog Circuit Here
Intel Sandy Bridge E
2.27 Billion transistors
ECE 5745 Course Overview 26 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Cross-Layer Interaction is Critical
Circuits
Devices
Programming Language
Algorithm
Co
mp
ute
rA
rch
itectu
re
Technology
Application
Register-Transfer Level
Instruction Set Architecture
Microarchitecture
Operating System
Gate Level
Circuits
Register-Transfer Level
Microarchitecture
Gate Level
Circuit-level researchers need to
appreciate the system-level
context for their subsystems;
cross-layer interaction can
generate some of the
most exciting research ideas
ECE 5745 Course Overview 27 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Complex Digital ASIC Design
I Course goal, structure, motivation. What is the goal of the course?. Why should students want to take this course?. How is the course structured?
I Activity 1: Evaluation of Integer Multiplier
I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips
I Activity 2: Brainstorming for Sorting Accelerator
ECE 5745 Course Overview 28 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Course Structure
Part 1ASIC Design
Overview
Part 2Digital CMOS
Circuits
Part 3CAD Algorithms
P P
MM
PrereqComputer
Architecture
ECE 5745 Course Overview 29 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Part 1: ASIC Design Overview
P P
MM
Topic 1Hardware
DescriptionLanguages
Topic 2CMOS Devices
Topic 3CMOS Circuits
Topic 4Full-Custom
DesignMethodology
Topic 5Automated
DesignMethodologies
Topic 7Clocking, Power Distribution,
Packaging, and I/O
Topic 8Testing and Verification
Topic 6Closing
theGap
ECE 5745 Course Overview 30 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Part 2: Digital CMOS Circuits
Topic 1
0
Sequential
StateTopic
11
Interconnect
Topic 9
Combinational
Logic
ECE 5745 Course Overview 31 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Part 3: CAD Algorithms
RTL to Logic
Synthesis
TechnologyIndependent
Synthesis
TechnologyDependentSynthesis
x = a'bc + a'bc'y = b'c' + ab' + ac
x = a'by = b'c' + ac
Placement
DetailedRouting
Global
Routing
Topic 12Synthesis Algorithms
Topic 13Physical Design Automation
ECE 5745 Course Overview 32 / 52
• Complex Digital ASIC Design • Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
Five-Week Design Project
P
I$
D$
Network
Network
Network
Xcel P
I$
D$ D$ D$
Xcel
AcceleratedInstructions
Performance (Tasks per Second)
En
erg
y E
ffic
ien
cy (
Ta
sks p
er
Jo
ule
)
SimpleProcessor
Design PowerConstraint
High-PerformanceArchitectures
EmbeddedArchitectures
DesignPerformanceConstraint
Flexibility vs
. Spe
cializat
ion
CustomASIC
Less FlexibleAccelerator
More FlexibleAccelerator
ECE 5745 Course Overview 33 / 52
Complex Digital ASIC Design • Activity 1 • Case Study: Scalar vs. Vector Processors Activity 2
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Complex Digital ASIC Design
I Course goal, structure, motivation. What is the goal of the course?. Why should students want to take this course?. How is the course structured?
I Activity 1: Evaluation of Integer Multiplier
I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips
I Activity 2: Brainstorming for Sorting Accelerator
ECE 5745 Course Overview 34 / 52
Complex Digital ASIC Design • Activity 1 • Case Study: Scalar vs. Vector Processors Activity 2
Fixed-Latency Iterative Multiplier Datapath
b_regb_mux_sel
32b
a_rega_mux_sel
32b
result_reg
result_mux_sel
result_en
32b
32b
32b
>>
<<
add_mux_sel
req_msg.areq_msg
32b resp_msg
req_msg.b
b_lsb
0
ECE 5745 Course Overview 35 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Complex Digital ASIC Design
I Course goal, structure, motivation. What is the goal of the course?. Why should students want to take this course?. How is the course structured?
I Activity 1: Evaluation of Integer Multiplier
I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips
I Activity 2: Brainstorming for Sorting Accelerator
ECE 5745 Course Overview 36 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Scalar Processors with Multithreading
Programmer'sLogicalView
μT0
Memory
μT1 μT2 μT3 μT4 μTi
ExecutionResources
ControlInstructions
MemoryInstructions
ArchitecturalRegisters
ECE 5745 Course Overview 37 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Scalar Processors with Multithreading
Programmer'sLogicalView
μT0
Memory
μT1 μT2 μT3 μT4 μTi
loop: load a, a_ptr load b, b_ptr add c, a, b store c, c_ptr a_ptr++ b_ptr++ c_ptr++ branch
Vector-Vector Add Masked Filter
loop: load a, a_ptr a_ptr++ branch a = 0 op0 op1 ... branch
ECE 5745 Course Overview 37 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Scalar Processors with Multithreading
Programmer'sLogicalView
μT0
Memory
μT1 μT2 μT3 μT4 μTi
TypicalCoreMicro-Architecture
En
erg
y
Performance
MIMDμT0
μT1
μT2
μT3
Instr Memory
Data Memory
Multi-threadedCores
ECE 5745 Course Overview 37 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Vector-SIMD Processors
Programmer'sLogicalView
CT0
elm 0 elm 1 elm 2 elm 3
Memory
CTj
elm i
ArchitecturalVector Registerwith 4 Elements
Vector-SIMDArithmetic
Instructions
Vector-SIMDMemory
Instructions
ECE 5745 Course Overview 38 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Vector-SIMD Processors
Programmer'sLogicalView
CT0
elm 0 elm 1 elm 2 elm 3
Memory
CTj
elm i
loop: vload A, a_ptr vload B, b_ptr vadd C, A, B vstore C, c_ptr a_ptr++ b_ptr++ c_ptr++ branch
loop: vload A, a_ptr vset F, A = 0 vop0 under flag F vop1 under flag F ... a_ptr++ branch
Vector-Vector Add Masked Filter
ECE 5745 Course Overview 38 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Vector-SIMD Processors
Programmer'sLogicalView
CT0
elm 0 elm 1 elm 2 elm 3
Memory
CTj
elm i
TypicalCoreMicro-Architecture 0
2
1
3
Instruction Memory
CP VIU
VMU
Data Memory
Ve
cto
r L
an
es
En
erg
y
Performance
MIMD
Vector-SIMD
ECE 5745 Course Overview 38 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Quantitative Area EvaluationMotivation Architectural Patterns Scale VT Core Maven VT Core Evaluation
Single-Lane Vector-Thread Unit w/ 256 Registers
MIT CSAIL Christopher Batten 32 / 42ECE 5745 Course Overview 39 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Quantitative Area Evaluation
Quad-Corew/ Vertical
Multithreading
Vector-SIMD
(8 elm/lane)
1 t
hre
ad
2 t
hre
ad
s4
th
rea
ds
8 t
hre
ad
s
1 c
ore
w/
4 la
ne
s
4 c
ore
s w
/ 1
lan
e e
a
data cacheinstruction cache
control processor
integer functional units
floating-point functional unitsmemory unitsregister filecontrol logic
Single-core multi-lane designreduces area by 15%
Multi-core single-lane design
increases area by 20%
1.75
1.25
1.00
0.50
0.00
0.25
0.75
1.50
No
rma
lize
d A
rea
ECE 5745 Course Overview 40 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Quantitative Performance and Energy Evaluation
0.8 1.0 1.2 1.4 1.6
Normalized Tasks / SecondNo
rma
lize
d E
ne
rgy / T
ask
1.6
1.4
1.2
1.0
0.8
0.6
0.4
25
20
15
10
5
0En
erg
y / T
ask (μ
J)
Multithreaded Multicore(increasing number ofthreads per core)
Quad-Corew/ Vertical
Multithreading
1 th
rea
d
2 th
rea
ds
4 th
rea
ds
8 th
rea
ds
Performance reduction with increasingthreads due to increased cycle time and
thread management overhead onfine-grain loops
Single-Lane Vector(increasing vlen)
Vector-SIMD
(4 core w/ 1 lane)
data cacheinst cache
control procint func unitsmemory unitsregister filecontrol logic
1 e
lm/la
ne
2 e
lm/la
ne
4 e
lm/la
ne
8 e
lm/la
ne
leakage
ECE 5745 Course Overview 41 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Simple RISC Processor ASIC
SP Datapath SP Regfile
RAMSubbank
(2KB)
RAMSubbank
(2KB)
RAMSubbank
(2KB)
RAMSubbank
(2KB)
AH
IPC
on
tro
ller
SP Control
RAM Interface
VCO
Figure 2: STC1 chip plot and die photo. On the chip plot, the Exec Unit is colored blue, and theFetch Unit is colored green and purple.
HostPC
PLX9050
ATB0Controller
PowerSupplies
ProbePoints
PCI
SDRAM
STC1
Host ATB0 Daughter Card
Figure 3: Test System Block Diagram.The test system includes the ATB0test baseboard, a daughter card withthe STC1 chip, and a host computerto control the test system.
5
ECE 5745 Course Overview 42 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Simple RISC Processor ASIC
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 41 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
750 ------------- . . . . . . . . . . . . . . . . . . . . . . . . .740 . . . . . . . . . . . . . . . . . . . . . . . . .730 . . . . . . . . . . . . . . . . . . . . . . . . .720 . . . . . . . . . . . . . . . . . . . . . . . . .710 . . . . . . . . . . . . . . . . . . . . . . * * .700 ------------- . . . . . . . . . . . . . . . . . . . . * * * * .690 . . . . . . . . . . . . . . . . . . . * * * * * .680 . . . . . . . . . . . . . . . . . * * * * * * . .670 . . . . . . . . . . . . . . . . * * * * * * * . .660 . . . . . . . . . . . . . . . . * * * * * * * . .650 ------------- . . . . . . . . . . . . . . * * * * * * * * * * .640 . . . . . . . . . . . . . . * * * * * * * * * * .630 . . . . . . . . . . . . . * * * * * * * * * * . .620 . . . . . . . . . . . . * * * * * * * * * * * * .610 . . . . . . . . . . . * . * * * * * * * * * * . .600 ------------- . . . . . . . . . . . * * * * * * * * * * * * . .590 . . . . . . . . . . * * * * * * * * * * * * * * .580 . . . . . . . . . . * * * * * * * * * * * * * . .570 . . . . . . . . . * * * * * * * * * * * * * * * *560 . . . . . . . . . * * * * * * * * * * * * * * * *550 ------------- . . . . . . . * * * * * * * * * * * * * * * * * .540 . . . . . . . * * * * * * * * * * * * * * * * . *530 . . . . . . . * * * * * * * * * * * * * * * * * *520 . . . . . . * * * * * * * * * * * * * * * * * * *510 . . . . . . * * * * * * * * * * * * * * * * * * *500 ------------- . . . . . * * * * * * * * * * * * * * * * * * * *490 . . . . * * * * * * * * * * * * * * * * * * * * *480 . . . . * * * * * * * * * * * * * * * * * * * * *470 . . . * * * * * * * * * * * * * * * * * * * * * *460 . . . * * * * * * * * * * * * * * * * * . * * * *450 ------------- . . . * * * * * * * * * * * * * * * * * * * * * *440 . . * * * * * * * * * * * * * * * * * * * * * * *430 . . * * * * * * * * * * * * * * * * * * * * * * *420 . . * * * * * * * * * * * * * * * * * * * * * * *410 . * * * * * * * * * * * * * * * * * * * * * * * *400 ------------- . * * * * * * * * * * * * * * * * * * * * * * * *390 . . . . . * * * * *380 . . . . . * * * * *370 . . . . . * * * * *360 . . . . . * * * * *350 --- . . . . * * * * * *340 . . . . * * * * * *330 . . . . * * * * * *320 . . . * * * * * * *310 . . . * * * * * * *300 --- . . . * * * * * * *290 . . . * * * * * * *280 . . * * * * * * * *270 . . * * * * * * * *260 . . * * * * * * * *250 --- . . * * * * * * * *240 . * * * * * * * * *230 . * * * * * * * * *220 . * * * * * * * * *210 . * * * * * * * * *200 --- * * * * * * * * * *190 * * * * * * * * * *180 * * * * * * * * * *
Figure 5: Shmoo plot for awide voltage range. The hori-zontal axis plots supply volt-age in mV, and the verticalaxis plots frequency in MHz.A * indicates correct opera-tion at that operating pointand a dot indicates failure.The shmoo plot shows com-bined results from two di�er-ent chips.
la t9, input_end
forever:
la t8, input_start
la t7, output
loop:
lw t0, 0(t8)
addu t8, 4
bgez t0, pos
subu t0, $0, t0
pos:
sw t0, 0(t7)
addu t7, 4
bne t8, t9, loop
b forever
Figure 6: Test program for measuringtypical power consumption. The innerloop calculates the absolute values ofintegers in an input array and storesthe results to an output array. The in-put array had 10 integers during test-ing, and the program repeatedly runsthis inner loop forever.
7
I RISC processor w/ 8 KB SRAMI TSMC 0.18 µm processI 1.7× 2.1 mmI 610K TransistorsI 450 MHz at 1.8 V
ECE 5745 Course Overview 43 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Scale Vector-Thread Processor ASIC
ECE 5745 Course Overview 44 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Scale Vector-Thread Processor ASIC
TSMC 0.18µm • 7.14 Million Transistors • 16.6 mm2 Core Area
CP
Vector Lane 0
Vector Lane 1
Vector Lane 2
Cache Control
Cache Tag CAMs
CacheDataRAM
Banks VMU
VIU
XBAR
Vector Lane 3
ECE 5745 Course Overview 45 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Scale Energy vs. Performance Results
Motivation Vector-Threading Scale VT Processor Maven VT Processor Future Research
Energy-Efficiency of the Scale VT Processor
MIT CSAIL Christopher Batten 24 / 39ECE 5745 Course Overview 46 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
BRG Test Chip 1 (2016)
Taped-out Layout for BRGTC1
HostInterface
de
bu
g
RISCCore
SortAccel
Memory Arbitration Unit
SRAMBank(2KB)
SRAMBank(2KB)
SRAMBank(2KB)
SRAMBank(2KB)
diff clk (+)
diff clk (−)
singleended clk
rese
t
Ctrl
Reghost2chip
chip2host
LVDSRecv
clkdiv
clk tree
resettree
clk
ou
t
The testing platform enables running small
test programs on BRGTC1 to compare the
performance and energy of pure-software kernels
versus the HLS-generated sorting accelerator
2x2mm 1.3M transistors in IBM 130nm
RISC processor, 16KB SRAM
HLS-generated accelerators
Static Timing Analysis Freq. @ 246 MHz
Post-Silicon Evaluation Strategy
LV
DS
div
ide
d
clk
ou
t
LV
DS
clk
ou
t
ECE 5745 Course Overview 47 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
Celerity System-on-Chip Overview (2017)
Target Workload: High-Performance Embedded Computing
I 5× 5mm in TSMC 16 nm FFCI 385 million transistorsI 511 RISC-V cores
. 5 Linux-capable Rocket cores
. 496-core tiled manycore
. 10-core low-voltage arrayI 1 BNN acceleratorI 1 synthesizable PLLI 1 synthesizable LDO VregI 3 clock domainsI 672-pin flip chip BGA packageI 9-months from PDK access to
tape-out
ECE 5745 Course Overview 48 / 52
Complex Digital ASIC Design Activity 1 • Case Study: Scalar vs. Vector Processors • Activity 2
BRG Test Chip 2 (2018)
Memory
Instruction Memory Arbiter
L1 Data $
(32KB)
LLFU Arbiter
Int Mul/Div
FPU
L1 Instruction $
(32KB)
Ho
st
Inte
rfa
ce
Syn
the
siz
ab
le P
LL
ArbiterData
Block Diagram4xRV32IMAF cores with “smart”
sharing L1$/LLFU,synthesizable PLL
Taped-out Layout for BRGTC22x2mm, 1.2M-trans, IBM 130nm
Static Timing Analysis Freq. @ 500MHz
ECE 5745 Course Overview 49 / 52
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors • Activity 2 •
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Complex Digital ASIC Design
I Course goal, structure, motivation. What is the goal of the course?. Why should students want to take this course?. How is the course structured?
I Activity 1: Evaluation of Integer Multiplier
I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips
I Activity 2: Brainstorming for Sorting Accelerator
ECE 5745 Course Overview 50 / 52
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors • Activity 2 •
Brainstorming for Sorting Accelerator
I$
D$
Arbiter
XcelP
Performance (Tasks per Second)
En
erg
y E
ffic
ien
cy (
Ta
sks p
er
Jo
ule
)
SoftwareBaseline
Sort array of integers stored in memoryProc sends Xcel base address and sizeProc waits for Xcel to finish sorting
I Software Baseline. Insertion sort O(n2)
. 20% of dynamicinstructions are lw/sw
I Hardware Accelerator
. What is ideal speedupassuming we still useinsertion sort?
. What if we use adifferent algorithm?
. Sketch hardware impl ofa sorting accelerator
. Where will acceleratorbe in the energy/perfspace?
ECE 5745 Course Overview 51 / 52
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2
RTL
Devices
ISA
PL
Algorithm
μArch
Technology
Application
OS
Gates
Circuits
Take-Away Points
I Complex digital ASIC design is the process ofquantitatively exploring the area, cycle time,execution time, and energy trade-offs ofgeneral-purpose and application-specific designsusing automated standard-cell CAD tools and thento transform the most promising design to layoutready for fabrication
I Course can provide useful experience withcross-layer interaction for students interested inpursuing research in computer architecture orcircuits or students interested in pursuing a careerin in industry development of ASICs
ECE 5745 Course Overview 52 / 52