μ-op Fission: Hyper-threading without the Hyper-headache
-
Author
phillip-whitfield -
Category
Documents
-
view
31 -
download
2
Embed Size (px)
description
Transcript of μ-op Fission: Hyper-threading without the Hyper-headache
Spiral Template Office 2007
-op Fission: Hyper-threading without the Hyper-headacheAnthony Cartolano Robert KoutsoyannisDaniel S. McFarlinCarnegie Mellon University Dept. of Electrical and Computer EngineeringCarnegie Mellon
Streaming WHT Results on Intel Multicore: Best Versus Single Core
Carnegie MellonHierarchy of Parallel Hardware FeaturesCharacterized by degree of sharing (Cost of Communication)-1 SIMD: share register file, functional units, cache hierarchy SMP: share main memory and QPI link corecorecachememorycorecorecachememoryCPU0 SHARINGCPU1124451136357++++ Communication Cost
124451136357++++SMPCMPSMTSIMDCarnegie MellonHierarchy of Software Interfaces?corecorecachememorycorecorecachememoryCPU0 Coupling to HWCPU1124451136357++++ Overhead
124451136357++++SMPCMPSMTSIMDGeneral Purpose SW threading model to rule them all: affinity knob Overhead: intrinsics vs. OpenMP vs. APIs + RuntimesAmortizing runtime overhead dictates partitioning: precludes SMT Carnegie MellonHierarchy of Software Interfaces?corecorecachememorycorecorecachememoryCPU0 Coupling to HWCPU1124451136357++++ Overhead
124451136357++++SMPCMPSMTSIMDView SMT and SIMD as a continuum for parallel numeric codeExpose SMT to SW and Encode SMT loop parallelization with intrinsics to minimize overhead Requires HW/SW support and co-optimization to achieve Carnegie MellonOutlineMotivation
SMT Hardware Bottlenecks
-op Fission
Required software support
Experimental Results
Concluding RemarksCarnegie MellonGeneric N-way SMT
Narrow Front-End vs. Wide BackendCarnegie Mellon2-Way SMT Case Study Front-End (Fetch & Decode) is implemented as Fine-Grained Multi-threading Done to reduce i-cache, decode latency Can lead to back-end starvation
Carnegie Mellon
-op Fission 2 iterations compiler-fused with clone bit Equivalent to unroll-and-jam optimization Decoder fissions iterations into available RATs Reduces fetch & decode bandwidth and power Allows narrow front-end to keep wide back-end pipeline full
2 iterations compiler-fused with clone bit Equivalent to unroll-and-jam optimization Decoder fissions iterations into available RATs Reduces fetch & decode bandwidth and power Allows narrow front-end to keep wide back-end pipeline full
2 iterations compiler-fused with clone bit Equivalent to unroll-and-jam optimization Decoder fissions iterations into available RATs Reduces fetch & decode bandwidth and power Allows narrow front-end to keep wide back-end pipeline full
Carnegie Mellon
SW ParameterizableNUMASW Programmable Memory ControllerDecoupled Compute& Communication Instruction StreamsAsynch Gather/Scatter MEM LMVRF LMSimple ManycoreSIMD
Formal HW/SW Co-Design: The Data Pump ArchitectureCarnegie Mellon
High Dimensional Non-Uniform HW Parameter Space Carnegie Mellon
Software Architecture: SpiralLibrary generator for linear transforms (DFT, DCT, DWT, filters, .) and recently more
Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU
Research Goal: Teach computers to write fast librariesComplete automation of implementation and optimizationConquer the high algorithm level for automation
When a new platform comes out: Regenerate a retuned library
When a new platform paradigm comes out (e.g., CPU+GPU):Update the tool rather than rewriting the libraryIntel uses Spiral to generate parts of their MKL and IPP librariesCarnegie Mellon12
Spiral: A Domain Specific Program Generatorconstant foldingschedulingTransformuser specified
C Code:
Fast algorithmin SPLmany choices-SPL:
Iteration of this process to search for the fastest
But thats not all parallelizationvectorizationloop optimizationsconstant foldingschedulingOptimization at allabstraction levelsIteration of this process to search for the fastestCarnegie Mellon
Spiral Formula Representation of SAR
Grid ComputeRange InterpolationAzimuth Interpolation2D FFT
Carnegie Mellon
Spirals Automatically GeneratedPFA SAR Image Formation Codenewerplatforms16 Megapixels100 MegapixelsAlgorithm byJ. Rudin (best paper award, HPEC 2007): 30 Gflop/s on CellEach implementation: vectorized, threaded, cache tuned, ~13 MB of codeCode was not written by a humanCarnegie MellonRequired Software SupportExecutable has stub code which initializes pagetables and other CPU control registers at load time on all HW contexts
Compiler performs virtual Loop-Unroll-and-Jam on tagged loopsMaximizes sharing SMT Thread Cyclic Partitioning i=0: A;B;C;D;i=1: A;B;C;D;i=2: A;B;C;D;i=3: A;B;C;D;i=0: A;B;C;D;i=1: A;B;C;D;i=2: A;B;C;D;i=3: A;B;C;D;th1th0i=0: A;B;C;D;i=1: A;B;C;D;i=2: A;B;C;D;i=3: A;B;C;D;th1th0th2th3Carnegie MellonCompile Time Fork Support Need a lightweight fork mechanism Presently, Can only communicate between SMT register files via memory
Load/Store Queue prevents materialization in most cases
Prefer to have multi-assign statement for loop index with a vector input
Need broadcast assignment for live-in set to the loop
i= {0,1,2,3}; load (a0,0), i; load (a0,1), i; load (a0,2), i; load (a0,3), i;i=1;i=0;i=2;i=3; store 0, (a0,0); store 1, (a0,1); store 2, (a0,2); store 3, (a0,3); load (a0, APIC_ID), i; load addr, a0; load addr, a0; load addr, a0;
load addr, a0; load addr, a0;Carnegie MellonCompile Time Sample Loop Execution i = {0,1};
load addr, a0;
L0: cmp i, n;
jmpgte END; A;B;C;D;
add 2, i;
flushgte jmp L0END: waitrobs i = 1; load addr, a0; L0: cmp i, n; jmpgte END; A;B;C;D; add 2, i; i = 0; load addr, a0; L0: cmp i, n; A;B;C;D; add 2, i; jmp L0END: waitrobsCarnegie MellonExperimental SetupUsed PTLSim with above configurationLarger ROB size + physical register size than NehalemSmaller number of functional unitsSimulate -op fission with explicit unroll-and-jam of source code coupled with penalized functional unit latencies
Carnegie MellonExperimental Results
Carnegie MellonResults Drilldown Carnegie MellonConcluding Remarks Demonstrated a HW/SW Co-optimization approach to SMT parallelization
Preliminary evaluation suggests performance benefit for a range of numerical kernels
Scales with number of SMT contexts
Thread Fusion research suggests a 10-15% power consumption reduction is possible due to reduced fetch/decode
Future work: handling control-flow with predication and diverge-merge Carnegie Mellon
THANK YOU! Questions?
Carnegie MellonChart192015412251246
2-way SMT4-way SMTSpeedup of Various Kernels via -op Fission SMT percentage reduction in total cycle count
Sheet12-way SMT4-way SMTSSE Interpolation920Scalar Interpolation1541Streaming WHT225Non-2 Power DFTs12Binary Matrix Factorization46To resize chart data range, drag lower right corner of range.