ISSCC 2008 / SESSION 13 / MOBILE PROCESSING /...

256 • 2008 IEEE International Solid-State Circuits Conference

ISSCC 2008 / SESSION 13 / MOBILE PROCESSING / 13.1

13.1 A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-κ Metal Gate CMOS

Gianfranco Gerosa, Steve Curtis, Mike D’Addeo, Bo Jiang, Belliappa Kuttanna, Feroze Merchant, Binta Patel, Mohammed Taufique,Haytham Samarchi

Intel, Austin, TX

This paper describes a low-power Intel® Architecture (IA) processorspecifically designed for Mobile Internet Devices (MID) and Ultra-Mobile PCs (UMPC) where average power consumed is in the order ofa few hundred mW (as measured by MobileMark’05 OP @ 60 nitsbrightness) with performance similar to mainstream Ultra-MobilePCs. The design consists of an in-order pipeline capable of issuing 2instructions per cycle supporting 2 threads, 32KB instruction and24KB data L1 caches, independent integer and floating point execu-tion units, ×86 front end execution unit, a 512KB L2 cache and a 533MT/s dual-mode (GTL and CMOS) front-side-bus (FSB); a block dia-gram is shown in Fig. 13.1.1. The design contains 47M transistors ina die size under 25mm2 manufactured in a 9-metal 45nm CMOSprocess with optimized transistors for low leakage [1] packaged in aHalide-Free 441 ball, 14×13mm2 μFCBGA. Thermal Design Power(TDP) consumption is measured at 2W using a synthetic power-virustest at a frequency of 2GHz.

Features in this new micro-architecture are selected with low powerand high performance per watt efficiency in mind. The pipeline is tai-lored to execute IA instructions as single atomic operations consistingof a single destination register and up to three source-registers andadheres to the Load-Op-Store instruction format. Further, usingpower efficient algorithms in areas like instruction decoding andscheduling (traditionally complex circuits that are power hungry)achieves high performance per watt efficiency. In addition, supportfor Hyper-threading technology (HT) is added; the instruction sched-uling logic can find a pair of instructions from either the same threador across threads in a given cycle to dispatch. HT is a feature thatprovides high performance per watt (typically 30% increases in per-formance for a 15% increase in power) efficiency in an in-orderpipeline. Moreover, the use of specialized execution units is mini-mized. For example, the SIMD integer multiplier and Floating Pointdivider are used to execute instructions that would normally requirea dedicated scalar integer multiplier and integer divider respectively.Finally, other features like activity-based control of instruction issueand dispatch of operations on the Front Side Bus are added for powerreduction.

Several features support extended battery life including the newIntel® Deep Power Down Technology [2], which allows for a majorityof the CPU functionality to be powered down except for an array thatholds the micro-architectural state with very fast entry/exit times(<100us). Other micro-architecture features include Intel® 64Architecture support, Intel® Digital Media Boost (SSE3), and Intel®

Virtualization technology.

This design uses a “sea-of-Functional-Unit-Blocks (FUB)” methodolo-gy whereby all cluster hierarchies as well as all unit-level hierarchiesare flattened at the chip-level even though the logical partitioning ofthe chip-level RTL model is comprised of several logical clusters(Floating Point, Integer Execution, Memory Execution, Front-End,Bus Interface, and L2 Cache). This methodology essentially removesall cluster and unit-level hierarchical boundaries resulting in a phys-ical hierarchy where an object-based parallel editing scheme is usedfor physical design convergence; Fig. 13.1.7 shows the die photo withthe logical unit partitions. The physical database consists of 205unique FUBs (not including repeater stations) and 41K FUB-to-FUBinterconnects. 91% of the FUBs are designed using cell-based designtechniques using pre-characterized standard cells with 45% using“structured data-path” design techniques and 46% fully synthesizedrandom logic blocks. The remaining 9% are full-custom blocks.

The clock distribution is architected with power saving as a top pri-ority. Clock recombination in global distribution is limited to a fewcritical stages after carefully studying the power/skew trade off.Global clock routing is further reduced by: (1) implementing a grid-

less topology that routes clock to locations only if required, (2) align-ment of clock receivers to major clock spines, and (3) reducing thenumber of receivers on the clock network by minimizing clock cellcloning where clock tree synthesis is used. Flow automation and acell-based design approach are used to minimize implementationeffort and achieve fast clock convergence.

This processor uses a Register File (RF) design style for all corearrays including the 32KB L1 instruction cache, the 24KB L1 datacache and the 10.5KB C6 array which holds micro-Architecture stateduring the deep power down state (VCC=0.3V). These use 8T memo-ry cells to attain better SER characteristics, high performance cacheaccess time and lower voltage operation than traditional 6T SRAMcells. The L1 caches implement 1 bit per byte parity and no ECC. RFarrays comprise over 50% of the total core area and are an importantcontributor to overall chip power. Several power saving techniquesare employed: fine granularity sleep on word-line drivers with doublestacked PFETs with near zero wakeup times, sleep controls to allowfloating of read bit-lines on the ROM arrays (22Kuops × 60 bits/uop)during non-access cycles, and fine granularity power gating on non-accessed array banks. In addition, array bits on the “read-side” arearchitecturally pre-disposed to a value of ‘0’ to reduce leakage powerthrough the read stack.

The L2 cache is a phase-based, 8-way 512KB design supporting sin-gle cycle throughput for both read and write operations in a 9-cyclepipeline including tag lookup, data read, in-line ECC and transmit tothe L1. Figure 13.1.2 shows the L2 cache timing diagram. Tag, LRUand State bits are combined together into one array to minimize areaand power. Tag and data sub-arrays are 4.5KB and 17.5KB, respec-tively with 256 cells (bit-cell area = 0.3816μm2) per column and col-umn redundancy for optimum array efficiency. Average power reduc-tion techniques include: power-gating (PG) transistors for the word-line drivers, PG and sleep transistors (ST) for the memory arrays (seeFig. 13.1.3 and [3]), floating bit lines and tri-state-able write drivers.The ST implementation is based on PFETS, creating a virtual VCCwhich can go as low as 750mV reducing bit-cell leakage power up to2.5×. The 512kB L2 cache is dynamically chop-able down to 2-ways.For less demanding applications, sub-arrays of the unused ways arepowered down via PG and ST resulting in 10× leakage power reduc-tion per sub-array.

A dual mode IO buffer is implemented where both legacy Gunning-Transistor-Logic (GTL) signaling and a full CMOS swing can be sup-ported with a fuse-able option. In CMOS mode, the buffer can reliablytransmit data at 400 to 533MT/s while reducing the total Front-Side-Bus (FSB) power up to 2.5× (data-dependent) as compared to GTL inan ISO-slew rate comparison. Essentially the resistance-compensat-ed NFET pull-down impedance is reprogrammed to 55Ω and the on-die-termination (using R-compensated 55Ω PFETs) is turned off toeliminate DC power. Figure 13.1.4 shows the dual mode buffer andreceiver. IO leakage power is further reduced by “splitting” the 1.05VIO power supply (VCCP) and only keeping 21 pins “alive’ during thedeep power down state resulting in ~10% reduction of average power;Fig. 13.1.5 shows a platform implementation.

An IA microprocessor specifically micro-architected for low power andimplemented in 45nm CMOS technology is presented. This processoris suitable for MIDs and UMPCs given the range (~0.6W to 2.0W) ofmeasured TDP power on real world applications. 2 GHz core frequen-cies are achieved at 1.0V and 90°C; Fig. 13.1.6 shows measured powerat different power states and a shmoo of the processor running sever-al worst-case workloads.

Acknowledgments:The authors gratefully acknowledge the support of project manager ElenoraYoeli and the work of the talented and dedicated Intel design and product teamsin Austin, Texas that implemented this processor.

References:[1] K. Mistry, et al., “A 45nm Logic Technology with High-κ+ Metal GateTransistors, Strained Silicon, 9 Cu Interconnect Layers, 193nm Dry Patterning,and 100% Pb-free Packaging”, Tech. Dig. IEDM, Dec. 2007.[2] V. George, et al., “PENRYN: 45-nm Next Generation Intel® CoreTM 2Processor”, accepted to ASSCC Dig. Tech. Papers, Nov. 2007.[3] F. Hamzaoglu, et al., “A 153Mb SRAM Design with Dynamic StabilityEnhancement and Leakage Reduction in 45nm Hi-κ Metal Gate CMOSTechnology”, ISSCC Dig. Tech. Papers, pp. 376-377, Feb. 2008.

978-1-4244-2010-0/08/$25.00 ©2008 IEEE

Session_13_Penmor.qxp:Session_ 1/7/08 5:44 PM Page 256

257DIGEST OF TECHNICAL PAPERS •

Continued on Page 611

ISSCC 2008 / February 5, 2008 / 8:30 AM

Figure 13.1.1: Low-power IA processor architecture block diagram. Figure 13.1.2: L2 timer phase-based timing diagram.

Figure 13.1.3: L2 cache sleep circuit and shut-off (Idle) mode.

Figure 13.1.5: Split IO Power Supply to reduce IO leakage power in SLEEPstate. Figure 13.1.6: Power for several power states and VCC-Fmax shmoo @ 90°C.

Figure 13.1.4: Dual-mode Front-Side-Bus IO driver.

PMH

Per threadInteger

Register File

DataCache

DataTLBs

Fill Buffers

ALU

JEU

ALU

SIMD multiplier

FP adder

FP store

ALU

Shuffle

FP multiplier

FP move

FP ROM

FP divider

Per threadFP

Register File

Fault/Retire

2-wide Inst. Length

Decoder

Per-thread Instruction

Queues

InstructionCache

Branch Prediction UnitUROM

Per

Thr

ead

Pre

fetc

h B

uffe

rs

Inst. TLB

XLAT/ FL

AGU AGU

L2 Cache

DL1 prefetcher

Front-End Cluster

FP/SIMD execution cluster

Integer Execution Cluster

Memory Execution Cluster

ALU

Shifter

XLAT/ FL

FSBBIU

Bus Cluster

APIC

CLK

WL

blpch_b

rdysel_b

saen

sapch_b

latchenadjustable

pulsed clocks

ysel_b & sapch_b interlock

blpch_b & WL

interlock

Active Sleep/Shut-off

Virtual Vcc

VCCActive Sleep/Shut-off

Virtual Vcc

VCC

VCCP

NFETcontrol

PFETcontrol

DATA from core

CMOS ENABLE

+-

DATA IN_OUTDATA to core

2/3 VCCPGTL

1/2 VCCPCMOS

VREF

25; on/offGTL

55; on/offCMOS

Z (Ω)

55; remains on for ODTGTL

55; on/offCMOS

Z (Ω)

1.05VVRM

100 μF 2 μF 0.5 μF 0.5 nFbulk edge pkg on-die

VCCPC6

GTLREF

PACKAGE

DIE

21 IOs

VTTC6PDNfrom ChipSet

18 μF 4.5 μF 4.5 nFedge pkg

on-die

VCCP 182 IOs

C0FULLON

C1/2 C4SLEEP

C6DEEP

POWERDOWN

WA

TTS

0.016X

2.5

PO

WER

VIR

US

TE

ST

0.4X

0.12X

1.0X VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h

MHz

1250

1282

1316

1351

1389

1429

1471

1515

1563

1613

1667

1724

1786

1852

1923

2000

2083

2174

2273

2381

2500

2GHz @ Vcc=1.0V

0.70

1.20

Co

re V

CC

(V

olt

s)

1.25 2.5Core Frequency (GHz)

VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h

MHz

1250

1282

1316

1351

1389

1429

1471

1515

1563

1613

1667

1724

1786

1852

1923

2000

2083

2174

2273

2381

2500

2GHz @ Vcc=1.0V

0.70

1.20

Co

re V

CC

(V

olt

s)


C0FULLON

C1/2 C4SLEEP

C6DEEP

POWERDOWN

WA

TTS

0.016X

2.5

PO

WER

VIR

US

TE

ST

0.4X

0.12X

1.0X VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h

MHz

1250

1282

1316

1351

1389

1429

1471

1515

1563

1613

1667

1724

1786

1852

1923

2000

2083

2174

2273

2381

2500

2GHz @ Vcc=1.0V

0.70

1.20

Co

re V

CC

(V

olt

s)


VCC1.20 * * * * * * * * * * * * * * * * * * * * *1.15 * * * * * * * * * * * * * * * * * * * * *1.10 * * * * * * * * * * * * * * * * * * * * A1.05 * * * * * * * * * * * * * * * * * * * C E1.00 * * * * * * * * * * * * * * * * * F G I J0.95 * * * * * * * * * * * * * * * K M E P J Q0.90 * * * * * * * * * * * * * K M T U J V J J0.85 * * * * * * * * * * * M G X J J Z J J J J0.80 * * * * * * * M \ ] ^ J J J J J J J J J J0.75 * _ M b G ^ c e J J J J J J J h J J J J J0.70 J J J J J J J J J k J J J J J J J J l n h

MHz

1250

1282

1316

1351

1389

1429

1471

1515

1563

1613

1667

1724

1786

1852

1923

2000

2083

2174

2273

2381

2500

2GHz @ Vcc=1.0V

0.70

1.20

Co

re V

CC

(V

olt

s)


13

Session_13_Penmor.qxp:Session_ 1/7/08 5:44 PM 57

611 • 2008 IEEE International Solid-State Circuits Conference 978-1-4244-2010-0/08/$25.00 ©2008 IEEE

ISSCC 2008 PAPER CONTINUATIONS

Figure 13.1.7: Low-power IA processor die micrograph.

Session_13_Runbacks.qxp:Session_ 1/7/08 5:50 PM Page 1

ISSCC 2008 / SESSION 13 / MOBILE PROCESSING /...

Documents

Transcript of ISSCC 2008 / SESSION 13 / MOBILE PROCESSING /...