Power Managementece3056-sy.ece.gatech.edu/wp-content/uploads/sites/5… · · 2017-12-09v Power...
-
Upload
truongtuong -
Category
Documents
-
view
214 -
download
2
Transcript of Power Managementece3056-sy.ece.gatech.edu/wp-content/uploads/sites/5… · · 2017-12-09v Power...
1
Power Management
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Basic Trends
Lecture notes S. Yalamanchili and S. Mukhopadhyay
2
(3)
Technology Scaling
• 30% scaling down in dimensions à doubles transistor density
• Power per transistor v Vdd scaling à lower power
• Transistor delay =
GATE
SOURCE
BODY
DRAIN
tox
GATESOURCE DRAIN
L
P =αCVdd2 f +VddIleak
Delay = k ⋅C VddVdd −Vt( )2
(4)
Moore’s Law
4
From wikipedia.org
• Performance scaled with number of transistors
• Dennard scaling*: power scaled with feature size
Goal: Sustain Performance Scaling
*R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
3
(5)
Parallelism and PowerIBM Power5
Source: IBM
AMD Trinity
Source: forwardthinking.pcmag.com
• How much of the chip area is devoted to compute?
• Run many cores slower. Why does this reduce power?
(6)
Parallelism
• Concurrency + lower frequency à greater energy efficiency
P =αCVdd2 f +VddIleak
Core
Cache
Core
Cache
Core
Cache
Core
Cache
Core
Cache
• 4X #cores• 0.75x voltage• 0.5x Frequency• ~1X power• 2X in performance
Example
4
(7)
The Power Wall
• Power per transistor scales with frequency but also scales with Vddv Lower Vdd can be compensated for with increased
pipelining to keep throughput constantv Power per transistor is not same as power per
area à power density is the problem!v Multiple units can be run at lower frequencies to
keep throughput constant, while saving power
P =αCVdd2 f +VddIleak
(8)
Mukhopadhyay and Yalamanchili (2009)
n Based on scaling using Pentium-class coresn While Moore’s Law continues, scaling phenomena have changed
n Power densities are increasing with each generation
8
What is the Problem?
5
(9)
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
Power Management Basics
Lecture notes S. Yalamanchili and S. Mukhopadhyay
6
(11)
What are my Options?
1. Better technologyv Manufacturingv Better devices (FinFet)v New Devices à non-CMOS? à this is the future
2. Be more efficient – activity managementv Clock gating – dynamic energy/powerv Power gating – static energy/powerv Power state management - both
3. Improved architecturev Simpler pipelines
4. Parallelism
Not this course
Where does the power go?
(12)
Distribution of Power on Chip*
* From G. Chandra, P. Kapur and K. C. Saraswat, Scaling Trends for Distribution of on chip Power”, Stanford University, 2011
Circa 2011 (50nm)
Modern Estimates are 20%-40% (courtesy S. Mukhopadhyay)
7
(13)
Activity Management
• Turn off clock to a block of logic
• Eliminate unnecessary transitions/activity
• Clock distribution power
• Turn off power to a block of logic, e.g., core
• No leakage
CombinationalLogic
clk
clkcond
input
clkCore 0 Core 1
VddPower gate transistor
Clock Gating Power Gating
(14)
Qualcomm Snapdragon
www.quora.com
• Heterogeneous architecture
• Different components used at different points in time
8
(15)
Multiple Voltage-Frequency Domains
From E. Rotem et. Al. HotChips 2011
• Cores and ring in one DVFS domain• Graphics unit in another DVFS domain• Cores and portion of cache can be gated
off
Intel Sandy Bridge Processor
(16)
Processor Power States
• Performance States – P-statesv Operate at different voltage/frequencies
o Recall delay-voltage relationshipv Lower voltage à lower leakagev Lower frequency à lower power (not the same as energy!)v Lower frequency à longer execution time
• Idle States - C-statesv Sleep statesv Differ is how much state is saved
• SW or HW managed transitions between states!
9
(17)
Example of P-states
• Software Managed Power States
• Changing Power States is not free
AMD Trinity A10-5800 APU: 100W TDP
CPU P-state
Voltage (V)
Freq (MHz)
HWOnly
(Boost)
Pb0 1 2400
Pb1 0.875 1800
SW-Visible
P0 0.825 1600
P1 0.812 1400
P2 0.787 1300
P3 0.762 1100
P4 0.75 900
(18)
Example of P-states
From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html
10
(19)
Management Knobs
• Each core can be in any one of a multiple of states
• How do I decide what state to set each core?v Who decides? HW? SW?
• How do I decide when I can turn off a core?
• What am I saving? Static energy or dynamic energy?
(20)
Power Management
• Software controlled power managementv Optimize power and/or energyv Orchestrated by the operating system or application
librariesv Industry standard interfaces for power management
o Advanced Configuration and Power Interface (ACPI)n https://www.acpica.org/n http://www.acpi.info/
• Hardware power managementv Optimized power/energyv Failsafe operation, e.g., protect against thermal
emergencies à automatically throttle when exceeding temperature bounds.
11
(21)
Power Management3.0
Time Die
Tem
per
atu
re
Thermal Headroo
m
Convert thermal headroom to higher performance through boost
HW Boost states
SW visible states
Per
form
ance
CPU DVFS-state
HWOnly
(Boost)
Pb0Pb1
SW-Visible
P0P1P2
- - -Pmin
Instructions/cycle
Time
Performance and energy efficiency depend on effective utilization of power and thermal headroom
(22)
Boosting
• Exploit package physicsv Temperature changes on the
order of milliseconds
• Use the thermal headroom
Max Power
TDP Power
Low power – build up thermal credits
Turbo boost region
10s of seconds
Intel Sandy Bridge
Throttling
12
(23)
Power Gating
Intel Sandy Bridge Processor
• Turn off components that are not being usedv Lose all state information
• Costs of powering down
• Costs of powering up
• Smart shutdownv Models to guide decisions
(24)
Linux CPU Governors
13
(25)
Linux CPU Governors
Sources: https://www.kernel.org/doc/Documentation/cpu-freq/governors.txthttps://lwn.net/Articles/682391/https://lwn.net/Articles/531853/https://android.googlesource.com/kernel/common/+/a7827a2a60218b25f222b54f77ed38f57aebe08b/Documentation/cpu-freq/governors.txt
Performance: Set all cores to max supported frequency. (Max performance)
Powersave: Set all cores to min supported frequency. (Min power)
Userspace: Allow user to set any supported frequency.
Ondemand: Sample periodically, assess CPU Load, set frequency (Max performance)If load > up_threshold, set core frequency to max. If load < up_threshold, decrease core frequency based on sample_down_factor.
Conservative: Same as ondemand. Increase and decrease of core frequency is smoother.
SchedUtil: Uses CPU load as calculated by the scheduler’s per-entity-load-tracking (PELT).
Interactive: Designed for Android. Event (cpu idle state) based policy makes it more responsive than ondemand and conservative. Boosting of core frequencies is done via a heuristic. (Max performance when needed).
(26)
Parallelism
• Concurrency + lower frequency à greater energy efficiency
P =αCVdd2 f +VddIleak
Core
Cache
Core
Cache
Core
Cache
Core
Cache
Core
Cache
• 4X #cores• 0.75x voltage• 0.5x Frequency• 1X power• 2X in performance
Example
14
(27)
Simplify Core DesignAMD Bulldozer Core
ARM A7 Core (arm.com)
• Support for branch prediction, schedulers, etc. consumes more energy per instruction
• Can fit many more simpler cores on a die
(28)
Metrics
• Power efficiencyv MIPS/wattv Ops/watt
• Energy efficiencyv Joules/instructionv Joules/op
• Compositev Energy-delay productv Energy-delay2
Why are these useful?
15
Modeling
Lecture notes S. Yalamanchili and S. Mukhopadhyay
(30)
Microarchitectural Level Models
• How can we study power consumption without building circuits?v Models
• Models can are available at multiple levels of abstraction.
We are interested in microarchitectural models
16
(31)
Processor Microarchitecture
Instruction Cache
Instruction Queue
FetchQueue
Instruction Decoder
BranchPrediction
Register Files
Instruction TLB
ALU
MUL
FPU
LD
ST
L1 Data Cache
DataTLB
L2 Data CacheNoC
RouterOn-ChipNetwork
Fetch Decode Execute/Writeback
Memory
Network
(32)
Energy/Power Calculation
• How do we calculate energy or power dissipation for a given microarchitecture?
• Energy/Power varies between:v Different ISA; ARM vs Intel x86
v Different microarchitecture; in-order vs out-of-order
v Different applications; memory vs compute-bound
v Different technologies; 90nm vs 22nm technology
v Different operation conditions; frequency, temperature
17
(33)
Architecture Activity (1)
Instruction Cache
Instruction Queue
FetchQueue
Instruction Decoder
BranchPrediction
Register Files
Instruction TLB
ALU
MUL
FPU
LD
ST
L1 Data Cache
DataTLB
L2 Data CacheNoC
RouterOn-ChipNetwork
Activity 1: Instruction Fetch
icache.read++; fbuffer.write++;
• Collect activity counts of each architecture component (through simulation or measurement).
• List of components differs between microarchitectures.
• Activity counts at each component differs between applications.
(34)
Architecture Activity (2)
Instruction Cache
Instruction Queue
FetchQueue
Instruction Decoder
BranchPrediction
Register Files
Instruction TLB
ALU
MUL
FPU
LD
ST
L1 Data Cache
DataTLB
L2 Data CacheNoC
RouterOn-ChipNetwork
Activity 2: Instruction Decodefbuffer.read++; idecoder.logic++;
• Read/write accesses to caches, buffers, etc.
• Logical accesses to logic blocks such as decoder, ALUs, etc.
• Tradeoff of differentiating more access types (accuracy) vs simulation speed (complexity).
18
(35)
Power and Architecture Activity
• For example, At nth clock cycle, collected counters are:v Data cache:
o read = 20, write = 12;
o per-read energy = 0.5nJ; per-write energy = 0.6nJ;
o Read energy = read*per-read energy = 10nJ
o Write energy = write*per-write energy = 7.2nJ
o Total activity energy = read+write energies = 17.2nJ
o If n = 50th clock cycle and clock frequency = 2GHz,Total activity power = energy*clock_freq/n = 688mW
*Note: n/clock_freq = n clock periods in secpower = time average of energy
(36)
Things to consider (1)
1. How do we calculate per-read/write energies?
• Per-access energies can be estimated from circuit-level designs and analyses.
• There are various open-source tools for this.
Architecture Specification
Technology Parameters
Circuit-levelEstimation
Tool
Estimation Results:Area, Energy, Timing, etc.
19
(37)
Things to consider (2)
2. Is per-access energy always the same?
• Per-access energy in fact depends on:• how many bits are switching • how they are switching (0→1 or 1→0)
• It is reasonable to assume constant per-access energy in long-term observation (e.g., n = 1B clock cycles); the number of switching bits are averaged (e.g., 50% of bits are switching).
• Most architecture simulators do not capture bit-level details due to simulation complexity.
(38)
Things to consider (3)
3. If a register file didn’t have read/write accesses but held data, what is the energy dissipation?
• Energy (or power) is largely comprised of dynamic and static dissipations.
• Dynamic (or switching) energy refers to energy dissipation due to switching activities.
• Static (or leakage) energy is dissipation to keep the electronic system turned on.
• In this case, the register file has no dynamic energydissipation but consumes static energy.
20
(39)
Example: A Simple Energy Model• We can use a simple model of per-access
energy for the architecture componentsCommonComponents AccessEnergy(10-12joules)
Inst.Cache+TLB Read19.22 Write21.6
DataCache+TLB Read25.28 Write27.26
Inst.Decoder Logic Switching16.78
Inst.Registers Read2.74 Write4.38
FP.Registers Read1.26 Write1.98
OtherBuffers Read9.74 Write11.18
ALU+ResultBus(interconnect) LogicSwitching123.2
FPU+ResultBus(interconnect) LogicSwitching 241.02
• Each unit can be accessed multiple times depending on instruction type• An Intel/AMD x86 instruction consume 600pJ ~ 4nJ dynamic energy.
@16nm
Thermal Issues
Lecture notes S. Yalamanchili and S. Mukhopadhyay
21
(41)
Thermal Issues
• Heat can cause damage to the chipv Need failsafe operation
• Thermal fields change the physical characteristicsv Leakage current and therefore power increasesv Delay increasesv Device degradation becomes worse
• Cooling solution determines the permitted power dissipation
(42)
Thermal Design Power (TDP)
• This is the maximumpower at which the part is designed to operatev Dictates the design of the
cooling system o Max temperature à Tjmax
v Typically fixed by worst case workload
• Parts are typically operating below the TDP
• Opportunities for turbo mode?
AMD Trinity APU
http://ecs.vancouver.wsu.edu/thermofluids-research
22
(43)
Heat Sink Limits on Performancen Thermal design power (TDP)
n Determines the cooling solution & package limits
n Performance depends on effective utilization of this thermal headroom
} www.legitreviews.com
Instructions/cycle
Time
Thermal Headroom
Convert thermal headroom to higher performance through boosting
HW Boost states
SW visible states
Boost power
Workload
Tem
pP
ower
(44)
Trinity TDP
Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2
23
(45)
Issues
• Cooling chips is now an issue for computer architects!
• Co-design the cooling system and the processor
• Some very “cool” new technologiesv E.g., microfluidics!
(46)
Electrical and Fluidic I/Os
• Fluid flow through the microchannels carry heat out to an external heat exchanger (e.g., heat sink)
Courtesy L. Zheng ECE) and Professor Muhannad Bakir (ECE)
24
(47)
Fabrication Examples
Electrical and fluidic microbumps, fluidic vias and fine wires
Micropin-fins (150 µm diameter and 225 µm diameter)and vias
Courtesy L. Zheng ECE) and Professor Muhannad Bakir (ECE)
(48)
IBM Series Mainframe
www.03.ibm.com
25
(49)
Immersion Cooling
www.physic.org
(50)
Conclusions
• Power/energy is the leading driver of modern architecture design
• Power and energy management is key to scalability
• Need integrated power/energy, performance, thermal management in fielded systems
• What about energy/power efficient algorithms?
26
(51)
Study Guide
• Explain the difference between energy dissipation and power dissipation
• Distinguish between static power dissipation and dynamic power dissipation
• Explain dynamic voltage frequency scalingv What are power states?v Why is this an advantage?v What is the impact of DVFS on i) energy, ii)
execution time, and iii) power
• Distinguish between clock gating and power gating
(52)
Study Guide (cont.)• Define thermal design power (TDP)
• Name two schemes to preventing the chip from exceeding TDP. Explain how they achieve this goal
• What does boosting achieve?
• What is the difference between C-states and P-states?
• Name one power management technique that will save static power?
• How does using many slower simpler cores improve power efficiency?
27
(53)
Study Guide (cont.)
• How is thermal design power (TDP) calculated?
• When using boost algorithms, what determines the duration of the high frequency operation?
• How does a power virus work?
• Describe how throttling works
• Know the power dissipation in some modern processor-memory systems drawn from the embedded, server, and high performance computing segments
(54)
Glossary
• Boosting
• C-states
• Dynamic Power and Energy
• Power Gating
• P-states
• Static Power and Energy
• Time constant
• Thermal Design Power (TDP)
• Throttling