Computer Structure 2014 – Advanced Topics 1 Computer Structure Advanced Topics Computer Structure...

of 68 /68
Computer Structure 2014 – Advanced Topics 1 Computer Structure Advanced Topics Lihu Rappoport and Adi Yoaz

Embed Size (px)

Transcript of Computer Structure 2014 – Advanced Topics 1 Computer Structure Advanced Topics Computer Structure...

  • Slide 1
  • Computer Structure 2014 Advanced Topics 1 Computer Structure Advanced Topics Computer Structure Advanced Topics Lihu Rappoport and Adi Yoaz
  • Slide 2
  • Computer Structure 2014 Advanced Topics 2 Intel Core TM Arch
  • Slide 3
  • Computer Structure 2014 Advanced Topics 3 Haswell builds upon innovations in the 2 nd and 3 rd Generation Intel Core i3/i5/i7 Processors (Sandy Bridge and Ivy Bridge) Haswell builds upon innovations in the 2 nd and 3 rd Generation Intel Core i3/i5/i7 Processors (Sandy Bridge and Ivy Bridge) Nehalem 45nm Process Technology TOCK 1 st Generation Intel Core New Intel arch (Nehalem) 3 rd Generation Intel Core Intel arch (Sandy Bridge) Ivy Bridge 22nm Process Technology TICK Tick/Tock Development Model TOCK Haswell 4 th Generation Intel Core New Intel arch (Haswell) Haswell CPU 22nm Process Technology Haswell CPU 22nm Process Technology 1 st Generation Intel Core Intel arch (Nehalem) 32nm Process Technology TICKTOCK Westmere Sandy Bridge 2 nd Generation Intel Core New Intel arch (Sandy Bridge) Foil taken from IDF 2012
  • Slide 4
  • Computer Structure 2014 Advanced Topics 4 5 th Generation Intel Core TM Processor 14nm Process Technology Broadwell micro-architecture
  • Slide 5
  • Computer Structure 2014 Advanced Topics 5 Haswell Core at a Glance Next generation branch prediction Improves performance and saves wasted work Improved front-end Initiate TLB and cache misses speculatively Handle cache misses in parallel to hide latency Leverages improved branch prediction Deeper buffers Extract more instruction parallelism More resources when running a single thread More execution units, shorter latencies Power down when not in use More load/store bandwidth Better prefetching, better cache line split latency & throughput, double L2 bandwidth New modes save power without losing performance No pipeline growth Same branch misprediction latency Same L1/L2 cache latency Foil taken from IDF 2012 Decode opQueue op Allocation Out-of-Order Execution op-Cache Tag I-cache Tag Branch Prediction ITLB op-Cache Data I-cache Data 12345670
  • Slide 6
  • Computer Structure 2014 Advanced Topics 6 Branch Prediction Unit Predict branch targets Direct Calls and Jumps target provided by a Target Array Indirect Calls and Jumps predicted either as having a fixed target or as having targets that vary based on execution path Returns predicted by a 16 entry Return Stack Buffer (RSB) For conditional branches predict if taken or not The BPU makes predictions for 32 bytes at a time Twice the width of the fetch engine Enables taken branches to be predicted with no penalty From the Optimization Manual / IDF 2012 Decode opQueue op Allocation Out-of-Order Execution op-Cache Tag I-cache Tag Branch Prediction ITLB op-Cache Data I-cache Data 12345670
  • Slide 7
  • Computer Structure 2014 Advanced Topics 7 Instruction TLB Initiate TLB and cache misses speculatively Handle cache misses in parallel to hide latency Leverages improved branch prediction Decode opQueue op Allocation Out-of-Order Execution op-Cache Tag I-cache Tag Branch Prediction ITLB op-Cache Data I-cache Data 12345670 From the Optimization Manual / IDF 2012
  • Slide 8
  • Computer Structure 2014 Advanced Topics 8 Instruction Fetch and Pre-decode 32KB 8-way I-Cache Fetches aligned 16 bytes per cycle Typical programs average ~4 bytes per instruction A misaligned target into the line or a taken branch out of the line reduces the effective number of instruction bytes fetched In typical integer code: a taken branches every ~10 instructions, which translates into a partial fetch every 3~4 cycles The Pre-Decode Unit Determines the length of each instruction Decodes all prefixes associated with instructions Mark various properties of instructions for the decoders for example, is branch 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders op Queue op Queue MSROM From the Optimization Manual
  • Slide 9
  • Computer Structure 2014 Advanced Topics 9 Instruction Pre-decode and IQ The pre-decode unit writes 6 instructions/cycle into the IQ If a fetch line contains >6 instructions Continues to pre-decode 6 instructions/cycle, until all instructions in the fetch line are written into IQ The subsequent fetch line starts pre-decode only after the current fetch completes A fetch of line of 7 instructions takes 2 cycles to pre-decode Average of 3.5 inst/cycle (still higher than IPC of most apps) Length changing prefixes (LCP) change the inst. length Each LCP within the fetch line takes 3 cycle to pre-decode The Instruction Queue is 18 instructions deep 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders op Queue op Queue MSROM From the Optimization Manual
  • Slide 10
  • Computer Structure 2014 Advanced Topics 10 Instruction Decode and op Queue 4 Decoders, which decode instructions into ops 1 st decoder can decode all instructions of up to 4 ops The other 3 decoders handle common single op instructions Instructions with >4 ops generate ops from the MSROM Up to 4 ops/cycle are delivered into the op Queue Buffers 56 ops (28 ops / thread when running 2 threads) Decouples the front end and the out-of order engine Helps hiding bubbles introduced between the various sources of ops in the front end 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders op Queue op Queue MSROM From the Optimization Manual
  • Slide 11
  • Computer Structure 2014 Advanced Topics 11 Macro-Fusion The IQ sends up to 5 inst. / cycle to the decoders Merge two adjacent instructions into a single op A macro-fused instruction executes with a single dispatch Reduces latency and frees execution resources Increased decode, rename and retire bandwidth Power savings from representing more work in fewer bits The 1 st instruction modifies flags CMP, TEST, ADD, SUB, AND, INC, DEC The 2 nd instruction pair is a conditional branch These pairs are common in many types of applications L1 I-Cache Pre decode Decoders op Queue op Queue MSROM Instruction Queue Instruction Queue From the Optimization Manual
  • Slide 12
  • Computer Structure 2014 Advanced Topics 12 Stack Pointer Tracker ESP = SUB ESP, 8 STORE [ESP-8], EBX STORE [ESP-4], EAX PUSH EAX PUSH EBX INC ESP ESP = ESP - 4 STORE [ESP], EAX ESP = ESP - 4 STORE [ESP], EBX ESP = ADD ESP, 1 = - 4 = 0 = - 4 = - 8 = - 4 ESP = ADD ESP, 1 = 0 Need to sync ESP ! PUSH, POP, CALL, RET implicitly update ESP Add or sub an offset, which would require a dedicated op The Stack Pointer Tracker performs implicit ESP updates From the Optimization Manual and ITJ Provides the following benefits Improves decode BW: PUSH, POP and RET become single op instructions Conserves rename, execution and retire resources Remove dependencies on ESP can execute stack operations in parallel Saves power
  • Slide 13
  • Computer Structure 2014 Advanced Topics 13 Micro-Fusion Fuse multiple ops from same instruction into a single op Instruction which decode into a single micro-fused op can be handled by all decoders Improves instruction bandwidth delivered from decode to retirement and saves power A micro-fused op is dispatched multiple times in the OOO As it would if it were not micro-fused Micro-fused instructions Stores are comprised of 2 op: store-address and store-data Fused into a single op Load + op instruction e.g., FADD DOUBLE PTR [RDI+RSI*8] Load + jump e.g., JMP [RDI+200] From the Optimization Manual
  • Slide 14
  • Computer Structure 2014 Advanced Topics 14 Decoded op-Cache Caches the ops coming out of the decoders Up to 1.5K ops (32 sets 8 ways 6 ops/way) Next time ops are taken from the op Cache ~80% hit rate for most applications Included in the IC and iTLB, flushed on a context switch Higher Bandwidth and Lower Latency More cycles sustaining 4 instruction/cycle In each cycle provide ops for instructions mapped to 32 bytes Able to stitch across taken branches in the control flow op-Cache Branch Prediction Unit op Queue op Queue 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders From the Optimization Manual
  • Slide 15
  • Computer Structure 2014 Advanced Topics 15 Decoded op-Cache Decoded op Cache lets the normal front end sleep Decode one time instead of many times Branch misprediction penalty reduced The correct path is also the most efficient path Save Power while Increasing Performance Foil taken from IDF 2011 op-Cache Branch Prediction Unit op Queue op Queue 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders Zzzz
  • Slide 16
  • Computer Structure 2014 Advanced Topics 16 Loop Stream Detector (LSD) LSD detects small loops that fit in the op queue The op queue streams the loop, allowing the front-end to sleep Until a branch miss-prediction inevitably ends it Loops qualify for LSD replay if all following conditions are met Up to 28 ops, with 8 taken branches, 8 32-byte chunks All ops are also resident in the op-Cache No CALL or RET No mismatched stack operations (e.g., more PUSH than POP) op-Cache Branch Prediction Unit op Queue op Queue 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders Zzzz From the Optimization Manual
  • Slide 17
  • Computer Structure 2014 Advanced Topics 17 The Renamer Moves 4ops/cycle from the op-queue to the OOO Renames architectural sources and destinations of the ops to micro-architectural sources and destinations Allocates resources to the ops, e.g., load or store buffers Binds the op to an appropriate dispatch port Up to 2 branches each cycle Some ops are executed to completion during rename, effectively costing no execution bandwidth A subset of register-to-register MOV and FXCHG Zero-Idioms NOP Scheduler Allocate/Rename/Retire Load Buffers Store Buffers Reorder Buffers In order OOO op Queue op Queue From the Optimization Manual
  • Slide 18
  • Computer Structure 2014 Advanced Topics 18 Dependency Breaking Idioms Zero-Idiom an instruction that zeroes a register Regardless of the input data, the output data is always 0 E.g.: XOR REG, REG and SUB REG, REG No op dependency on its sources Zero-Idioms are detected and removed by the Renamer Do not consume execution resource, have zero exe latency Zero-Idioms remove partial register dependencies Improve instruction parallelism Ones-Idiom an instruction that sets a register to all 1s Regardless of the input data the output is always "all 1s" E.g., CMPEQ XMM1, XMM1; No op dependency on its sources, as with the zero idiom Can execute as soon as it finds a free execution port As opposed to Zero-Idiom, the Ones-Idiom op must execute From the Optimization Manual
  • Slide 19
  • Computer Structure 2014 Advanced Topics 19 Out-of-Order Cluster The Scheduler queues ops until all source operands are ready Schedules and dispatches ready ops to the available execution units in as close to a first in first out (FIFO) order as possible Physical Register File (PRF) Instead of centralized Retirement Register File Single copy of every data with no movement after calculation Allows significant increase in buffer sizes Dataflow window ~33% larger Retirement Retires ops in order and handles faults and exceptions Scheduler Allocate/Rename/Retire Load Buffers Store Buffers Reorder Buffers In order Out-of- order FP/INT Vector PRFInt PRF Foil taken from IDF 2011
  • Slide 20
  • Computer Structure 2014 Advanced Topics 20 Intel Advanced Vector Extensions Vectors are a natural data-type for many apps Extend SSE FP instruction set to 256 bits operand size Extend all 16 XMM registers to 256bits New, non-destructive source syntax VADDPS ymm1, ymm2, ymm3 New Operations to enhance vectorization Broadcasts, masked load & store YMM0 256 bits (AVX) XMM0 128 bits Wide vectors+ non-destructive source: more work with fewer instructions Extending the existing state is area and power efficient Wide vectors+ non-destructive source: more work with fewer instructions Extending the existing state is area and power efficient Foil taken from IDF 2011
  • Slide 21
  • Computer Structure 2014 Advanced Topics 21 Execution Cluster Scheduler sees matrix: 3 ports to 3 stacks of execution units General Purpose Integer SIMD (Vector) Integer SIMD Floating Point Challenge: double the output of one of these stacks in a manner that is invisible to the others Port 0 Port 1 Port 5 GPR SIMD INT SIMD FP ALU JMP VI ADD VI MUL VI Shuffle FP Shuf FP Bool DIV FP ADD Blend FP MUL Foil taken from IDF 2011
  • Slide 22
  • Computer Structure 2014 Advanced Topics 22 Foil taken from IDF 2011 Port 0 Port 1 Port 5 GPR SIMD INT SIMD FP ALU JMP VI ADD VI MUL VI Shuffle FP Shuf FP Bool DIV FP ADD Blend FP MUL Execution Cluster 256 bit Vectors Solution: Repurpose existing data paths to dual-use SIMD integer and legacy SIMD FP use legacy stack style Intel AVX utilizes both 128-bit execution stacks Double FLOPs 256-bit Multiply + 256-bit ADD + 256-bit Load per clock FP ADD FP Multiply FP Blend FP Shuffle FP Boolean FP Blend Cool Implementation of Intel AVX 256-bit Multiply + 256-bit ADD + 256-bit Load per clock Double your FLOPs with great energy efficiency Cool Implementation of Intel AVX 256-bit Multiply + 256-bit ADD + 256-bit Load per clock Double your FLOPs with great energy efficiency
  • Slide 23
  • Computer Structure 2014 Advanced Topics 23 Haswell AVX2: FMA & Peak FLOPS 0 10 20 30 40 50 60 70 05101520253035 Peak FLOPS/clock Peak Cache bytes/clock Banias Merom Sandy Bridge Haswell 2 new FMA units provide 2 peak FLOPs/cycle Fused Multiply-Add 2 cache bandwidth to feed wide vector units 32-byte load/store for L1 2 L2 bandwidth 5-cycle FMA latency same as an FP multiply Foil taken from IDF 2012
  • Slide 24
  • Computer Structure 2014 Advanced Topics 24 Haswell Execution Units FMA FP Multiply FMA FP Multiply 2FMA Doubles peak FLOPs Two FP multiplies benefits legacy 2FMA Doubles peak FLOPs Two FP multiplies benefits legacy Unified Reservation Station Port 1 Port 2 Port 3 Port 4 Port 5 Load & Store Address Load & Store Address Store Data Store Data Integer ALU & Shift Integer ALU & Shift Integer ALU & LEA Integer ALU & LEA Integer ALU & LEA Integer ALU & LEA FMA FP Mult FP Add FMA FP Mult FP Add Divide Port 6 Integer ALU & Shift Integer ALU & Shift Port 7 Store Address Store Address Port 0 New AGU for Stores Leaves Port 2 & 3 open for Loads New AGU for Stores Leaves Port 2 & 3 open for Loads Branch New Branch Unit Reduces Port0 Conflicts 2 nd EU for high branch code New Branch Unit Reduces Port0 Conflicts 2 nd EU for high branch code 4th ALU Great for integer workloads Frees Port0 & 1 for vector 4th ALU Great for integer workloads Frees Port0 & 1 for vector Vector Shuffle Vector Shuffle Branch Vector Int Multiply Vector Int Multiply Vector Logicals Vector Logicals Vector Shifts Vector Shifts Vector Int ALU Vector Int ALU Vector Int ALU Vector Int ALU Vector Logicals Vector Logicals Vector Logicals Vector Logicals Intel Microarchitecture (Haswell) Foil taken from IDF 2012
  • Slide 25
  • Computer Structure 2014 Advanced Topics 25 Memory Cluster The L1 D$ handles 2 256-bit loads + 1 256-bit store per cycle The ML$ can service one cache line (64 bytes) each cycle 72 load buffers keep load ops from allocation till retirement Re-dispatch blocked loads 42 store buffers keep store ops from allocation till the store value is written to L1-D$ or written to the line fill buffers for non-temporal stores 32KB L1 D$ Store Data Store Data Fill Buffers Memory Control Load/ Store Address Load/ Store Address Load/ Store Address Load/ Store Address Store Address Store Address 256KB ML$ 323 bytes/cycle From the Optimization Manual
  • Slide 26
  • Computer Structure 2014 Advanced Topics 26 L1 Data Cache Handles 2 256-bit loads + 1 256-bit store per cycle Maintains requests which cannot be completed Cache misses Unaligned access that splits across cache lines Data not ready to be forwarded from a preceding store Load block due to cache line replacement Handles up to 10 outstanding cache misses in the LFBs Continues to service incoming stores and loads The L1 D$ is a write-back write-allocate cache Stores that hit in the L1-D$ do not update the L2/LLC/Mem Stores that miss the L1-D$ allocate a cache line From the Optimization Manual
  • Slide 27
  • Computer Structure 2014 Advanced Topics 27 Stores Stores to memory are executed in two phases At EXE Fill store buffers with linear + physical address and with data Once store address and data are known Forward Store data to load operations that need it After the store retires completion phase First, the line must be in L1 D$, in E or M MESI state Otherwise, fetch it using a Read for Ownership request from L1 D$ L2$ LLC L2$ and L1 D$ in other cores Memory Read the data from the store buffers and write it to L1 D$ in M state Done at retirement to preserve the order of memory writes Release the store buffer entry taken by the store Affects performance only if store buffer becomes full (allocation stalls) Loads needing the store data get when store is executed From the Optimization Manual
  • Slide 28
  • Computer Structure 2014 Advanced Topics 28 Store Forwarding Forward data directly from a store to a load that needs it Instead of having the store write the data to the DCU and then the load read the data from the DCU Store to load forwarding conditions The store is the last store to that address, prior to the load Requires addresses of all stores older than the load to be known The store contains all data being loaded The load is from a write-back memory type Neither the load nor the store are non-temporal accesses The load is not a 4 an 8 byte load that crosses 8 byte boundary, relative to the preceding 16- or 32-byte store The load does not cross a 16-byte boundary of a 32-byte store From the Optimization Manual
  • Slide 29
  • Computer Structure 2014 Advanced Topics 29 Memory Disambiguation A load may depend on a preceding store A load is blocked until all preceding store addresses are known Predict which loads do not depend on any previous stores Forward data from a store or L1 D$ even when not all previous store addresses are known The prediction is verified, and if a conflict is detected The load and all succeeding instructions are re-fetched Always assumes dependency between loads and earlier stores that have the same address bits 0:11 The following loads are not disambiguated Loads that cross the 16-byte boundary 32-byte Intel AVX loads that are not 32-byte aligned From the Optimization Manual
  • Slide 30
  • Computer Structure 2014 Advanced Topics 30 Data Prefetching Two hardware prefetchers load data to the L1 D$ DCU prefetcher (the streaming prefetcher) Triggered by an ascending access to very recently loaded data Fetches the next line, assuming a streaming load Instruction pointer (IP)-based stride prefetcher Tracks individual load instructions, detecting a regular stride Prefetch address = current address + stride Detects strides of up to 2K bytes, both forward and backward From the Optimization Manual
  • Slide 31
  • Computer Structure 2014 Advanced Topics 31 Core Cache Size/Latency/Bandwidth MetricNehalemSandy BridgeHaswell L1 Instruction Cache32K, 4-way32K, 8-way L1 Data Cache32K, 8-way Fastest Load-to-use4 cycles Load bandwidth16 Bytes/cycle 32 Bytes/cycle (banked) 64 Bytes/cycle Store bandwidth16 Bytes/cycle 32 Bytes/cycle L2 Unified Cache256K, 8-way Fastest load-to-use10 cycles11 cycles Bandwidth to L132 Bytes/cycle 64 Bytes/cycle L1 Instruction TLB 4K: 128, 4-way 2M/4M: 7/thread 4K: 128, 4-way 2M/4M: 8/thread 4K: 128, 4-way 2M/4M: 8/thread L1 Data TLB 4K: 64, 4-way 2M/4M: 32, 4-way 1G: fractured 4K: 64, 4-way 2M/4M: 32, 4-way 1G: 4, 4-way 4K: 64, 4-way 2M/4M: 32, 4-way 1G: 4, 4-way L2 Unified TLB4K: 512, 4-way 4K+2M shared: 1024, 8-way All caches use 64-byte lines Foil taken from IDF 2012
  • Slide 32
  • Computer Structure 2014 Advanced Topics 32 Buffer Sizes NehalemSandy BridgeHaswell Out-of-order Window128168192 Scheduling Window3654 In-flight Loads486472 In-flight Stores323642 Scheduler Entries365460 Integer Register FileN/A160168 FP Register FileN/A144168 Allocation Queue28/thread 56 Extract more parallelism in every generation Foil taken from IDF 2012
  • Slide 33
  • Computer Structure 2014 Advanced Topics 33 Haswell Core Arch Scheduler Allocate/Rename/Retire Load Buffers Store Buffers Reorder Buffers In order OOO op Cache 32KB L1 I-Cache Pre decode Instruction Queue Instruction Queue Decoders Branch Prediction Unit op Queue op Queue Port 4 32KB L1 D$ Port 2 Load Store Address Store Address Store Data Store Data Fill Buffers INT Port 3 Load Store Address Store Address Port 7 Store Address Store Address Port 0 JMP ALU DIV Shift FMA Port 1 MUL ALU FMA Shift ALU LEA Port 5 ALU Shuffle ALU LEA Port 6 JMP ALU Shift VEC Load Data 2 Load Data 3 Memory Control 256KB ML$ From the Optimization Manual
  • Slide 34
  • Computer Structure 2014 Advanced Topics 34 Hyper Threading Technology
  • Slide 35
  • Computer Structure 2014 Advanced Topics 35 Thread-Level Parallelism Multiprocessor systems have been used for many years There are known techniques to exploit multiprocessors Software trends Applications consist of multiple threads or processes that can be executed in parallel on multiple processors Thread-level parallelism (TLP) threads can be from the same application different applications running simultaneously operating system services Increasing single thread performance becomes harder and is less and less power efficient Chip Multi-Processing (CMP) Two (or more) processors are put on a single die From the Optimization Manual
  • Slide 36
  • Computer Structure 2014 Advanced Topics 36 Multi-Threading Schemes Multi-threading: a single processor executes multiple threads Time-slice multithreading The processor switches between software threads after a fixed period Can effectively minimize the effects of long latencies to memory Switch-on-event multithreading Switch threads on long latency events such as cache misses Works well for server applications that have many cache misses A deficiency of both time-slice MT and switch-on-event MT They do not cover for branch mis-predictions and long dependencies Simultaneous multi-threading (SMT) Multiple threads execute on a single processor simultaneously w/o switching Makes the most effective use of processor resources Maximizes performance vs. transistor count and power From the Optimization Manual
  • Slide 37
  • Computer Structure 2014 Advanced Topics 37 Hyper-Threading Technology Two logical processors in each physical processor Sharing most execution/cache resources using SMT Look like two processors to SW (OS and apps) Each logical processor executes a software thread Threads execute simultaneously on one physical processor Each LP has maintains its own architectural state Complete set of architectural registers general-purpose registers, control registers, machine state registers, debug registers Instruction pointers Each LP has its own interrupt controller Interrupts sent to a specific LP are handled only by it From the Optimization Manual
  • Slide 38
  • Computer Structure 2014 Advanced Topics 38 Two Important Goals When one thread is stalled the other thread can continue to make progress Independent progress ensured by either Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues A single active thread running on a processor with HT runs at the same speed as without HT Partitioned resources are recombined when only one thread is active From the Optimization Manual
  • Slide 39
  • Computer Structure 2014 Advanced Topics 39 Single-task And Multi-task Modes MT-mode (Multi-task mode) Two active threads, with some resources partitioned as described earlier ST-mode (Single-task mode) ST0 / ST1 only thread 0 / 1 is active Resources that are partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources Moving the processor from between modes ST0 ST1 MT Thread 1 executes HALT Low Power Thread 1 executes HALT Thread 0 executes HALT Interrupt From the Optimization Manual
  • Slide 40
  • Computer Structure 2014 Advanced Topics 40 Thread Optimization The OS should implement two optimizations: Use HALT if only one logical processor is active Allows the processor to transition to either the ST0 or ST1 mode Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do This so-called idle loop can consume significant execution resources that could otherwise be used by the other active logical processor On a multi-processor system OS views logical processors similar to physical processors But can still differentiate and prefer to schedule a thread on a new physical processor rather than on a 2nd logical processors in the same phy processor Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor Allows SW threads to use different physical resources when possible From the Optimization Manual
  • Slide 41
  • Computer Structure 2014 Advanced Topics 41 Physical Resource Sharing Schemes Replicated Resources Each thread has its own resource Partitioned Resources Resource is partitioned in MT mode, and combined in ST mode Competitively-shared resource Both threads compete for the entire resource HT unaware resources Both threads use the resource Alternating Resources Alternate between active threads if one thread idle the active thread uses the resource continuously From the Optimization Manual
  • Slide 42
  • Computer Structure 2014 Advanced Topics 42 Front End Resource Sharing Instruction Pointer is replicated Branch prediction resources are either duplicated or shared E.g., the return stack buffer is duplicated iTLB The large page iTLB is replicated The small page iTLB is partitioned The I$ is Competitively-shared The decoder logic is shared and its use is alternated If only one thread needs the decode logic, it gets the full decode bandwidth The op-Cache is partitioned The op Queue is partitioned 56 ops in ST mode, 28 ops/thread in MT mode From the Optimization Manual
  • Slide 43
  • Computer Structure 2014 Advanced Topics 43 Back End Resource Sharing Out-Of-Order Execution Register renaming tables are replicated The reservation stations are competitively-shared Execution units are HT unaware The re-order buffers are partitioned Retirement logic alternates between threads Memory The Load buffers and store buffers are partitioned The DTLB and STLB are competitively-shared The cache hierarchy and fill buffers are competitively-shared From the Optimization Manual
  • Slide 44
  • Computer Structure 2014 Advanced Topics 44 Power Management
  • Slide 45
  • Computer Structure 2014 Advanced Topics 45 Processor Power Components The power consumed by a processor consists of Dynamic power: power for toggling transistors and lines from 01 or 10 CV 2 f : activity, C capacitance, V voltage, f frequency Leakage power: leakage of transistors under voltage function of: Z total size of all transistors, V voltage, t temperature Peak power must not exceed the thermal constrains Power generates heat Heat must be dissipated to keep transistors within allowed temperature Peak power determines peak frequency (and thus peak performance) Also affects form factor, cooling solution cost, and acoustic noise Average power Determines battery life (for mobile devices), electricity bill, air-condition bill Average power = Total Energy / Total time Including low-activity and idle-time (~90% idle time for client)
  • Slide 46
  • Computer Structure 2014 Advanced Topics 46 Performance per Watt In small form-factor devices thermal budget limits performance Old target: get max performance New target: get max performance at a given power envelope Performance per Watt Increasing f also requires increasing V (~linearly) Dynamic Power = CV 2 f = Kf 3 X% performance costs ~3X% power A power efficient feature better than 1:3 performance : power Otherwise it is better to just increase frequency (and voltage) Vmin is the minimal operation voltage Once at Vmin, reducing frequency no longer reduces voltage At this point a feature is power efficient only if it is 1:1 performance : power Active energy efficiency tradeoff Energy active = Power active Time active Power active / Perf active Energy efficient feature: 1:1 performance : power
  • Slide 47
  • Computer Structure 2014 Advanced Topics 47 Platform Power Processor average power is
  • Computer Structure 2014 Advanced Topics 60 Turbo Mode P1 is guaranteed frequency CPU and GFX simultaneous heavy load at worst case conditions Actual power has high dynamic range P0 is max possible frequency the Turbo frequency P1-P0 has significant frequency range (GHz) Single thread or lightly loaded applications GFX CPU balancing OS treats P0 as any other P-state Requesting is when it needs more performance P1 to P0 range is fully H/W controlled Frequency transitions handled completely in HW PCU keeps silicon within existing operating limits Systems designed to same specs, with or without Turbo Mode Pn is the energy efficient state Lower than Pn is controlled by Thermal-State TurboH/WControl OSVisibleStates OSControl T-state & Throttle P1 Pn P0 1C frequency LFM
  • Slide 61
  • Computer Structure 2014 Advanced Topics 61 Turbo Mode Frequency (F) No Turbo Core 0Core 1Core 2Core 3 Core 2Core 3Core 0Core 1 Power Gating Zero power for inactive cores Workload Lightly Threaded
  • Slide 62
  • Computer Structure 2014 Advanced Topics 62 Turbo Mode Workload Lightly Threaded Frequency (F) No Turbo Core 0Core 1Core 2Core 3 Turbo Mode Use thermal budget of inactive core to increase frequency of active cores Core 0Core 1 Power Gating Zero power for inactive cores
  • Slide 63
  • Computer Structure 2014 Advanced Topics 63 Turbo Mode Frequency (F) No Turbo Core 0Core 1Core 2Core 3 Workload Lightly Threaded Core 0Core 1 Power Gating Zero power for inactive cores Turbo Mode Use thermal budget of inactive core to increase frequency of active cores
  • Slide 64
  • Computer Structure 2014 Advanced Topics 64 Turbo Mode Active cores running workloads < TDP Frequency (F) No Turbo Core 0Core 1Core 2Core 3 Core 2 Core 3 Core 0 Core 1 Core 2Core 3Core 0Core 1 Turbo Mode Increase frequency within thermal headroom
  • Slide 65
  • Computer Structure 2014 Advanced Topics 65 Turbo Mode Frequency (F) No Turbo Core 0Core 1Core 2Core 3 Workload Lightly Threaded And active cores < TDP Core 2 Core 3 Core 1 Core 0 Turbo Mode Increase frequency within thermal headroom Power Gating Zero power for inactive cores
  • Slide 66
  • Computer Structure 2014 Advanced Topics 66 Thermal Capacitance Classic Model Steady-State Thermal Resistance Design guide for steady state Classic Model Steady-State Thermal Resistance Design guide for steady state Temperature Time Classic model response Temperature rises as energy is delivered to thermal solution Thermal solution response is calculated at real-time Temperature rises as energy is delivered to thermal solution Thermal solution response is calculated at real-time Temperature Time More realistic response to power changes New Model Steady-State Thermal Resistance AND Dynamic Thermal Capacitance Foil taken from IDF 2011
  • Slide 67
  • Computer Structure 2014 Advanced Topics 67 Time Power Sleep or Low power Turbo Boost 2.0 TDP C0/P0 (Turbo) After idle periods, the system accumulates energy budget and can accommodate high power/performance for a few seconds In Steady State conditions the power stabilizes on TDP P > TDP: Responsiveness Sustain power Buildup thermal budget during idle periods Use accumulated energy budget to enhance user experience Intel Turbo Boost Technology 2.0 Foil taken from IDF 2011
  • Slide 68
  • Computer Structure 2014 Advanced Topics 68 Core and Graphic Power Budgeting Cores and Graphics integrated on the same die with separate voltage/frequency controls; tight HW control Full package power specifications available for sharing Power budget can shift between Cores and Graphics Core Power [W] Graphics Power [W] Total package power Realistic concurrent max power Sum of max power Heavy Graphics workload Heavy CPU workload Specification Core Power Specification Graphics Power Applications Sandy Bridge Next Gen Turbo for short periods Sandy Bridge Next Gen Turbo for short periods Foil taken from IDF 2011