ISSCC 2008 / SESSION 13 / MOBILE PROCESSING /...
Transcript of ISSCC 2008 / SESSION 13 / MOBILE PROCESSING /...
256 • 2008 IEEE International Solid-State Circuits Conference
ISSCC 2008 / SESSION 13 / MOBILE PROCESSING / 13.1
13.1 A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-κ Metal Gate CMOS
Gianfranco Gerosa, Steve Curtis, Mike D’Addeo, Bo Jiang, Belliappa Kuttanna, Feroze Merchant, Binta Patel, Mohammed Taufique,Haytham Samarchi
Intel, Austin, TX
This paper describes a low-power Intel® Architecture (IA) processorspecifically designed for Mobile Internet Devices (MID) and Ultra-Mobile PCs (UMPC) where average power consumed is in the order ofa few hundred mW (as measured by MobileMark’05 OP @ 60 nitsbrightness) with performance similar to mainstream Ultra-MobilePCs. The design consists of an in-order pipeline capable of issuing 2instructions per cycle supporting 2 threads, 32KB instruction and24KB data L1 caches, independent integer and floating point execu-tion units, ×86 front end execution unit, a 512KB L2 cache and a 533MT/s dual-mode (GTL and CMOS) front-side-bus (FSB); a block dia-gram is shown in Fig. 13.1.1. The design contains 47M transistors ina die size under 25mm2 manufactured in a 9-metal 45nm CMOSprocess with optimized transistors for low leakage [1] packaged in aHalide-Free 441 ball, 14×13mm2 μFCBGA. Thermal Design Power(TDP) consumption is measured at 2W using a synthetic power-virustest at a frequency of 2GHz.
Features in this new micro-architecture are selected with low powerand high performance per watt efficiency in mind. The pipeline is tai-lored to execute IA instructions as single atomic operations consistingof a single destination register and up to three source-registers andadheres to the Load-Op-Store instruction format. Further, usingpower efficient algorithms in areas like instruction decoding andscheduling (traditionally complex circuits that are power hungry)achieves high performance per watt efficiency. In addition, supportfor Hyper-threading technology (HT) is added; the instruction sched-uling logic can find a pair of instructions from either the same threador across threads in a given cycle to dispatch. HT is a feature thatprovides high performance per watt (typically 30% increases in per-formance for a 15% increase in power) efficiency in an in-orderpipeline. Moreover, the use of specialized execution units is mini-mized. For example, the SIMD integer multiplier and Floating Pointdivider are used to execute instructions that would normally requirea dedicated scalar integer multiplier and integer divider respectively.Finally, other features like activity-based control of instruction issueand dispatch of operations on the Front Side Bus are added for powerreduction.
Several features support extended battery life including the newIntel® Deep Power Down Technology [2], which allows for a majorityof the CPU functionality to be powered down except for an array thatholds the micro-architectural state with very fast entry/exit times(<100us). Other micro-architecture features include Intel® 64Architecture support, Intel® Digital Media Boost (SSE3), and Intel®
Virtualization technology.
This design uses a “sea-of-Functional-Unit-Blocks (FUB)” methodolo-gy whereby all cluster hierarchies as well as all unit-level hierarchiesare flattened at the chip-level even though the logical partitioning ofthe chip-level RTL model is comprised of several logical clusters(Floating Point, Integer Execution, Memory Execution, Front-End,Bus Interface, and L2 Cache). This methodology essentially removesall cluster and unit-level hierarchical boundaries resulting in a phys-ical hierarchy where an object-based parallel editing scheme is usedfor physical design convergence; Fig. 13.1.7 shows the die photo withthe logical unit partitions. The physical database consists of 205unique FUBs (not including repeater stations) and 41K FUB-to-FUBinterconnects. 91% of the FUBs are designed using cell-based designtechniques using pre-characterized standard cells with 45% using“structured data-path” design techniques and 46% fully synthesizedrandom logic blocks. The remaining 9% are full-custom blocks.
The clock distribution is architected with power saving as a top pri-ority. Clock recombination in global distribution is limited to a fewcritical stages after carefully studying the power/skew trade off.Global clock routing is further reduced by: (1) implementing a grid-
less topology that routes clock to locations only if required, (2) align-ment of clock receivers to major clock spines, and (3) reducing thenumber of receivers on the clock network by minimizing clock cellcloning where clock tree synthesis is used. Flow automation and acell-based design approach are used to minimize implementationeffort and achieve fast clock convergence.
This processor uses a Register File (RF) design style for all corearrays including the 32KB L1 instruction cache, the 24KB L1 datacache and the 10.5KB C6 array which holds micro-Architecture stateduring the deep power down state (VCC=0.3V). These use 8T memo-ry cells to attain better SER characteristics, high performance cacheaccess time and lower voltage operation than traditional 6T SRAMcells. The L1 caches implement 1 bit per byte parity and no ECC. RFarrays comprise over 50% of the total core area and are an importantcontributor to overall chip power. Several power saving techniquesare employed: fine granularity sleep on word-line drivers with doublestacked PFETs with near zero wakeup times, sleep controls to allowfloating of read bit-lines on the ROM arrays (22Kuops × 60 bits/uop)during non-access cycles, and fine granularity power gating on non-accessed array banks. In addition, array bits on the “read-side” arearchitecturally pre-disposed to a value of ‘0’ to reduce leakage powerthrough the read stack.
The L2 cache is a phase-based, 8-way 512KB design supporting sin-gle cycle throughput for both read and write operations in a 9-cyclepipeline including tag lookup, data read, in-line ECC and transmit tothe L1. Figure 13.1.2 shows the L2 cache timing diagram. Tag, LRUand State bits are combined together into one array to minimize areaand power. Tag and data sub-arrays are 4.5KB and 17.5KB, respec-tively with 256 cells (bit-cell area = 0.3816μm2) per column and col-umn redundancy for optimum array efficiency. Average power reduc-tion techniques include: power-gating (PG) transistors for the word-line drivers, PG and sleep transistors (ST) for the memory arrays (seeFig. 13.1.3 and [3]), floating bit lines and tri-state-able write drivers.The ST implementation is based on PFETS, creating a virtual VCCwhich can go as low as 750mV reducing bit-cell leakage power up to2.5×. The 512kB L2 cache is dynamically chop-able down to 2-ways.For less demanding applications, sub-arrays of the unused ways arepowered down via PG and ST resulting in 10× leakage power reduc-tion per sub-array.
A dual mode IO buffer is implemented where both legacy Gunning-Transistor-Logic (GTL) signaling and a full CMOS swing can be sup-ported with a fuse-able option. In CMOS mode, the buffer can reliablytransmit data at 400 to 533MT/s while reducing the total Front-Side-Bus (FSB) power up to 2.5× (data-dependent) as compared to GTL inan ISO-slew rate comparison. Essentially the resistance-compensat-ed NFET pull-down impedance is reprogrammed to 55Ω and the on-die-termination (using R-compensated 55Ω PFETs) is turned off toeliminate DC power. Figure 13.1.4 shows the dual mode buffer andreceiver. IO leakage power is further reduced by “splitting” the 1.05VIO power supply (VCCP) and only keeping 21 pins “alive’ during thedeep power down state resulting in ~10% reduction of average power;Fig. 13.1.5 shows a platform implementation.
An IA microprocessor specifically micro-architected for low power andimplemented in 45nm CMOS technology is presented. This processoris suitable for MIDs and UMPCs given the range (~0.6W to 2.0W) ofmeasured TDP power on real world applications. 2 GHz core frequen-cies are achieved at 1.0V and 90°C; Fig. 13.1.6 shows measured powerat different power states and a shmoo of the processor running sever-al worst-case workloads.
Acknowledgments:The authors gratefully acknowledge the support of project manager ElenoraYoeli and the work of the talented and dedicated Intel design and product teamsin Austin, Texas that implemented this processor.
References:[1] K. Mistry, et al., “A 45nm Logic Technology with High-κ+ Metal GateTransistors, Strained Silicon, 9 Cu Interconnect Layers, 193nm Dry Patterning,and 100% Pb-free Packaging”, Tech. Dig. IEDM, Dec. 2007.[2] V. George, et al., “PENRYN: 45-nm Next Generation Intel® CoreTM 2Processor”, accepted to ASSCC Dig. Tech. Papers, Nov. 2007.[3] F. Hamzaoglu, et al., “A 153Mb SRAM Design with Dynamic StabilityEnhancement and Leakage Reduction in 45nm Hi-κ Metal Gate CMOSTechnology”, ISSCC Dig. Tech. Papers, pp. 376-377, Feb. 2008.
978-1-4244-2010-0/08/$25.00 ©2008 IEEE
Session_13_Penmor.qxp:Session_ 1/7/08 5:44 PM Page 256
257DIGEST OF TECHNICAL PAPERS •
Continued on Page 611
ISSCC 2008 / February 5, 2008 / 8:30 AM
Figure 13.1.1: Low-power IA processor architecture block diagram. Figure 13.1.2: L2 timer phase-based timing diagram.
Figure 13.1.3: L2 cache sleep circuit and shut-off (Idle) mode.
Figure 13.1.5: Split IO Power Supply to reduce IO leakage power in SLEEPstate. Figure 13.1.6: Power for several power states and VCC-Fmax shmoo @ 90°C.
Figure 13.1.4: Dual-mode Front-Side-Bus IO driver.
PMH
Per threadInteger
Register File
DataCache
DataTLBs
Fill Buffers
ALU
JEU
ALU
SIMD multiplier
FP adder
FP store
ALU
Shuffle
FP multiplier
FP move
FP ROM
FP divider
Per threadFP
Register File
Fault/Retire
2-wide Inst. Length
Decoder
Per-thread Instruction
Queues
InstructionCache
Branch Prediction UnitUROM
Per
Thr
ead
Pre
fetc
h B
uffe
rs
Inst. TLB
XLAT/ FL
AGU AGU
L2 Cache
DL1 prefetcher
Front-End Cluster
FP/SIMD execution cluster
Integer Execution Cluster
Memory Execution Cluster
ALU
Shifter
XLAT/ FL
FSBBIU
Bus Cluster
APIC
CLK
WL
blpch_b
rdysel_b
saen
sapch_b
latchenadjustable
pulsed clocks
ysel_b & sapch_b interlock
blpch_b & WL
interlock
Active Sleep/Shut-off
Virtual Vcc
VCCActive Sleep/Shut-off
Virtual Vcc
VCC
VCCP
NFETcontrol
PFETcontrol
DATA from core
CMOS ENABLE
+-
DATA IN_OUTDATA to core
2/3 VCCPGTL
1/2 VCCPCMOS
VREF
25; on/offGTL
55; on/offCMOS
Z (Ω)
55; remains on for ODTGTL
55; on/offCMOS
Z (Ω)
1.05VVRM
100 μF 2 μF 0.5 μF 0.5 nFbulk edge pkg on-die
VCCPC6
GTLREF
PACKAGE
DIE
21 IOs
VTTC6PDNfrom ChipSet
18 μF 4.5 μF 4.5 nFedge pkg
on-die
VCCP 182 IOs
C0FULLON
C1/2 C4SLEEP
C6DEEP
POWERDOWN
WA
TTS
0.016X
2.5
PO
WER
VIR
US
TE
ST
0.4X
0.12X
1.0X VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h
MHz
1250
1282
1316
1351
1389
1429
1471
1515
1563
1613
1667
1724
1786
1852
1923
2000
2083
2174
2273
2381
2500
2GHz @ Vcc=1.0V
0.70
1.20
Co
re V
CC
(V
olt
s)
1.25 2.5Core Frequency (GHz)
VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h
MHz
1250
1282
1316
1351
1389
1429
1471
1515
1563
1613
1667
1724
1786
1852
1923
2000
2083
2174
2273
2381
2500
2GHz @ Vcc=1.0V
0.70
1.20
Co
re V
CC
(V
olt
s)
1.25 2.5Core Frequency (GHz)
C0FULLON
C1/2 C4SLEEP
C6DEEP
POWERDOWN
WA
TTS
0.016X
2.5
PO
WER
VIR
US
TE
ST
0.4X
0.12X
1.0X VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h
MHz
1250
1282
1316
1351
1389
1429
1471
1515
1563
1613
1667
1724
1786
1852
1923
2000
2083
2174
2273
2381
2500
2GHz @ Vcc=1.0V
0.70
1.20
Co
re V
CC
(V
olt
s)
1.25 2.5Core Frequency (GHz)
VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h
MHz
1250
1282
1316
1351
1389
1429
1471
1515
1563
1613
1667
1724
1786
1852
1923
2000
2083
2174
2273
2381
2500
2GHz @ Vcc=1.0V
0.70
1.20
Co
re V
CC
(V
olt
s)
1.25 2.5Core Frequency (GHz)
13
Session_13_Penmor.qxp:Session_ 1/7/08 5:44 PM Page 257
611 • 2008 IEEE International Solid-State Circuits Conference 978-1-4244-2010-0/08/$25.00 ©2008 IEEE
ISSCC 2008 PAPER CONTINUATIONS
Figure 13.1.7: Low-power IA processor die micrograph.
Session_13_Runbacks.qxp:Session_ 1/7/08 5:50 PM Page 1