Constraining Neutrino Mass Hierarchy and θ 13 with Supernova Neutrino Data Stanley Yen
Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented...
Transcript of Architecture Target Machine Presented by Greg Klepic Group 1knowak/cs550_spring_2016_o/...Presented...
Target Machine Architecture
Group 1Presented by Greg Klepic
Sections● 5.1 The memory hierarchy● 5.2 Data Representation● 5.3 Instruction Set Architecture● 5.4 Architecture and Implementation● 5.5 Compiling for Modern Processors
Memory hierarchyTypical Access Time Typical Capacity
Registers 0.2-0.5ns 256-1024 bytes
Primary (L1) cache 0.4-1ns 32K-256K bytes
L2 or L3 (on-chip) cache 4-30ns 1M-32M bytes
Off-chip cache 10-50ns Up to 128M bytes
Main Memory 50-200ns 256M-16G bytes
Flash 40-400μs 4G bytes- 1 T bytes
Disk 5-15ms 500G bytes and up
Tape 1-50s Effectively unlimited
Memory Hierarchy● Moving down the hierarchy increases latency but also increases capacity● Registers are accessed in 1 clock cycle, everything else take more● Keeping the data moving up the hierarchy prevents idle clock cycles● Usage of Caches vs memory and other slower sources limited by physical
size and cost● Latency limited by technology and bandwidth of buses● Disk and tape drives also limited by mechanical speed of the drive.
Memory Hierarchy- Registers● X86-64bit register
diagram● For 32 bit application
only white parts are available
● XMM 128-bit registers used for SSE vector ops
Memory Hierarchy-Caches● Cache is typically on the same
chip as the cpu● Much larger but slower access
time than registers, much faster access time
● L1 cache divided into two parts 1 for instructions other for data
● L3 cache shared for multi-core● Exploit locality to avoid cache
misses
Cache Hierarchy of Intel Westmere Architecture 4 core processor
Caches- Locality● Caches exploit locality to avoid cache misses● Spatial locality- tendency of a program to access data sequentially e.g.
arrays● Temporal locality- tendency to reuse certain data e.g. local variable in a
loop● Cache miss- If data is not in the cache at the time it’s needed, must wait
for memory (wasted clock cycles)
Memory Hierarchy- Processor/Memory Gap● Over time the rate at which
processor speed has increased is much greater than RAM
● If data had to be accessed from RAM frequently processor will be idle too often
Memory- Alignment● Operand appear in several sizes typically, 1,2,4, and 8 bytes● Some architectures require n-byte operands to appear at an address
divisible by n● Others like x86 do not, but run faster if operands are aligned● Buses are designed in such a way that if alignment isn’t there bits would
need to be shifted which takes time.● Also forcing alignment allows offsets to be specified in words rather than
bytes.
Data Representation● Operations interpret bits in memory in different ways● Data formats include instructions, addresses, binary ints, floats, and char● Integers come in half-word, word, and double-word length● Floating point come in single and double precision lengths
Little-Endian vs Big-Endian● Least significant byte of a
multiword datum at the address is little-endian
● Most significant byte at the address is call big-endian
● Little Endian is tolerant of variations in operand size
● X86 is Little-endian most other common architectures can be either
Operations on characters● Varies from architecture to architecture● Some can perform arithmetic and logical operations on 1-byte quantities● Most do not, only load/store● x86 has instructions for strings of characters e.g. copying, comparing,
searching● Vector operations in x86 can be used on strings
Integer Arithmetic-Representation● Two different representations of ints, signed and unsigned● Also two sets of operators, unsigned arithmetic used for pointers● Unsigned ints commonly represented by hexadecimal preceded by 0x, e.
g. 0x400= 4*162 +0*161 + 0*160 =1024 = 0100 0000 0000● Signed arithmetic uses ‘two’s complement arithmetic’● Examples of 4 bit 2’s complement ints
Signed Int arithmetic- Addition● The most significant bit of all negative numbers is one and non-negative is
zero, non-negative 2’s complement ints are represent same as unsigned● Smallest negative number in a range of size n is 1 0n-1 and normal rules
apply to increasing magnitude● Addition algorithm is the same as unsigned, no additional logic needed● Overflow occurs when a result is too large to fit in a word.● If addition of two non-negative ints give a ‘negative’ result (i.e. most
significant bit flips) then there is overflow, similar for 2 negative ints● Subtraction is adding the additive inverse
Floating Point Representation● Prior to 1985 float poorly-defined and vary from machine to machine● IEEE standard 754 defined 2 sizes for floats, single and double precision.● Single: sign bit, eight bit exponent, 23 bit significand
○ Represent number from roughly 10-38 to 1038
● Double: 11 bit exponent and 52 bit significand○ Represent number from roughly 10-308 to 10308
● Notation where s is the sign bit, sig is signifcand, exp is exponent○ -1s * sig * 2exp is the value of a given float
● exp is normalized by subtracting bias, b = most negative number○ b=-127 for single and b=-1023 for double-precision
● sig is always 1.X, unless the value is very close to zero then sig = 0.X*2min+1
○ min is the smallest allowed exponent
● Special patterns for zero, ∞, -∞, and Not a Number values
Instruction Set Architecture (ISA)● An ISA includes instructions available on a given machine and their
encoding in machine language● Instructions for
○ Computation: arithmetic and logical operations, tests, and comparisons on values in registers or memory (with address held in register)
○ Data Movement: Loads/Stores to and from memory to the registers or copies from one register or memory location to another
○ Control Flow: Branches, subroutine calls, and returns
ISAs● Differing philosophies● CISC (Complex instruction set Computing)- do as much work as possible
with each instruction as in x86● RISC (Reduced instruction set Computing)- maximize number of
instructions performed per second as in ARM, MIPS, etc.● RISC instructions more suitable for pipelined architecture● Most CISC ISAs convert to RISC-like instructions to facilitate pipelining.
Addressing Modes● RISC systems only allow computational instructions on value held in
registers called register-register architecture● CISC allow computational instructions to access operands directly in
memory, register-memory architecture● x86 allows for 2 address instructions meaning one will be overwritten by
the result● Others allow a third address to specify where to send the result● Displacement addressing gives address by a displacement relative to a
base used by some RISC ISAs● Indexed addressing address are found using two register, e.g. one
register has address of an array and second contains index, ARM uses this● CISC machines have more complex addressing modes
Conditions and Branching● Condition codes in a special processor status register control branch flow● Operations may change these codes● Conditional move instruction moves a value only if the codes are set
correctly● Predication allows any operation to be marked as conditional● Branched code can then be made branchless, where instructions in the
branch that doesn’t occur have no effect.● Doing this can save time if both branches are short
Architecture and Implementation● 4 Architectural breakthroughs lead to modern processing
1. Microprogramming, in the early 1960’s IBM developed a way to share code among processor ‘families’ previously code was written for a specific machine, led to more concise instructions
2. Microprocessor- mid 1970’s a processor could be implemented on a single chip with local registers, 8-bit registers to start got bigger as transistor count increased
3. RISC- compute many small pipelined operations in parallel, sometimes with more than one pipeline
4. Multicore processors- There was a limit reached in terms of clock speed due to heat, decrease in feature size led to development of multicore processors
● More recently SoC’s and HSA’s
Compiling for Modern Processors● Main concerns: effective use of pipeline and registers● Reasons a pipeline may stall
1. Cache misses: instructions not ready in cache2. Resource hazards: 2 instructions need the same functional unit3. Data hazards: operand needed but still in each by another instruction4. Control Hazards: Fetching can not occur until branching is resolved
● Branch prediction: predict branch outcome based on past results and roll back execution if needed, to avoid control hazards
● Instruction Scheduling: Reorder instructions at compile time to minimize stalling, maximize instruction level-parallelism.
Branch Prediction● Early RISC processors provided delayed branch instructions
○ This means that the instruction following the branch would be executed no matter what the outcome of the branch
● Proved to be impractical due to scheduling conflicts● Modern solution: branch predictor- guess the outcome of branches and
continue without knowing the result, backtrack as needed● Cache misses typically fall on the programmer to avoid, the compiler
assumes cache hits which in most programs is safe.● Even fast caches have some delay, so instructions can be reordered to
avoid the delay as long as the result of the program doesn’t change
Scheduling Dependence● Flow Dependence (true or read-after-right dependence): a later
instruction uses a value produced by an earlier instruction● Anti-Dependence (write-after-read dependence): later instruction
overwrites a value read by an earlier instruction● Output dependence (write-after-write dependence(: a later instruction
overwrites a value written by a previous instruction● The 2nd and 3rd type can frequently be corrected by the compiler
renaming registers○ Increases registered used but also increases instruction level parallelism
Register Allocation● 2-stage process1. Identify ‘blocks’ of code sequences with no branches in or out
a. Within each block assign a ‘virtual register to each value
2. Compiler maps virtual registers of a subroutine to the architectural registersa. Uses the same register when possibleb. If not enough registers spills over into memory
● Instruction scheduling then takes the instructions and may reorder them, this may cause an increase in the number of registers needed, but also decrease stalling.
Subroutine Calls● Small subroutine may be treated inline with the rest of the code, this
increases the code length but may help with register allocation● If unable then it’s treated separately allocating registers to the new
routine.○ If a value is used by the outer program making the call it must spill over into memory○ The program making the subroutine call then must reread the new values from memory
○ Sometimes the compiler must make assumption, if necessary the subroutine must rewrite every variable in scope to memory
● Inlining saves on time but can increase cache usage
Conclusions● Change in hardware such as pipelining and RISC instruction have
increased the complexity of machine code needed to take advantage of hardware
● Processor-Memory gap, relatively rapid increase of processor performance versus memory, necessitate taking advantage of the memory-hierarchy
● Optimizations to code take place at ISA level, compiler level, and program level.