Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1...

28
1 Σχεδιασμός της Ιεραρχίας Μνήμης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 Η Αρχή... “Ideally one would desire an indefinitely large memory capacity such that any particular…word would be immediately available…We are…forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible” A.W.Burks, H.H.Goldstein, and J. von Neumann (1946)

Transcript of Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1...

Page 1: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

1

Σχεδιασµός της Ιεραρχίας Μνήµης

Pedro Trancoso

H&P Appendix CH&P Chapter 5

2

Η Αρχή...

“Ideally one would desire an indefinitely large memory capacity such that any particular…word would be immediately available…We are…forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible”

A.W.Burks, H.H.Goldstein, and J. von Neumann(1946)

Page 2: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

2

3

Το Κίνητρο...

• ∆ιαφορά µεταξύ της επίδοσης του επεξεργαστή καιτης µνήµης:

• Παράδειγµα:Alpha 200MHz 340ns/5.0ns = 136 clkAlpha 300MHz 266ns/3.3ns = 320 clkAlpha 566MHz 180ns/1.7ns = 648 clk

4

Τυπική Ιεραρχία Μνήµης

Page 3: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

3

5

Ορολογία για Κρυφή Μνήµη...

• Cache, fully associative, write allocate, virtual memory, dirty bit, unified cache, memory stallcycles, block offset, misses per instruction, directmapped, write back, block, valid bit, data cache, locality, block address, hit time, address trace, write through, cache miss, set, instruction cache, page fault, random replacement, average memoryaccess time, miss rate, index field, cache hit, n-way set associative, no-write allocate, page, least-recently used, write buffer, miss penalty, tag field, write stall, ...

6

Αναθεώρηση

• Πρόσβαση στην Κρυφή Μνήµη– Λόγος Επιτυχίας (Cache Hit / Hit Rate), ΧρόνοςΕπιτυχίας (Hit Time)

– Λόγος Αποτυχίας (Cache Miss / Miss Rate), ΠοινήΑποτυχίας (Miss Penalty)

• Τοπικότητα– Χρονική Τοπικότητα (Temporal Locality)

• If an item is referenced, it will tend to be referenced again soon (e.g. loops, reuse)

– Χωρική Τοπικότητα (Spatial Locality)• If an item is referenced, items whose addresses are close by

tend to be referenced soon (e.g. straight line code, array accesses)

Page 4: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

4

7

Επίδοση της Κρυφής Μνήµης

• Παράδειγµα:– CPI=1, load/store=50% instr, miss penalty=25clk, miss rate=2%,

speedup=?

CPU execution time = (CPU clock cycles + Memory stall cycles) x Clock cycle time

Memory Stall Cycles = Number of misses x Miss penalty= IC x (Misses / Instruction) x Miss penalty= IC x (Memory Accesses / Instruction) x Miss rate x Miss penalty

Memory stall Cycles = IC x Reads per instruction x Read miss rate xRead miss penalty

+ IC x Writes per instruction x Write miss rate xWrite miss penalty

(Misses / Instruction) = (Miss rate x Memory accesses) / Instruction count= Miss rate x (Memory accesses / Instruction)

8

Κρυφή Μνήµη

BLOCK

SET

WORD

Page 5: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

5

9

Τέσσερις Ερωτήσεις

1. Που µπορούµε να βάλουµε ένα µπλοκ; (block placement)

2. Πως βρίσκουµε αν ένα µπλοκ είναι στην ΚρυφήΜνήµη; (block identification)

3. Ποιο µπλοκ να αντικαταστήσω µετά απόαποτυχία; (block replacement)

4. Τι γίνεται όταν γράφουµε; (write strategy)

10

Που µπορούµε να βάλουµε ένα µπλοκ;

• Οργάνωση της ΚρυφήςΜνήµης:– Direct Mapped: each block has

only one place it can appearMapping = (Block address) MOD

(Number of blocks in cache)

– Fully Associative: a block can be placed anywhere

– Set Associative: a block can be placed on a restricted set of places

Mapping = (Block address) MOD (Number of sets in cache)

Page 6: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

6

11

Πως βρίσκουµε αν ένα µπλοκ είναιστην Κρυφή Μνήµη;

• ∆ιεύθυνση:– Block Offset– Block Address: Index + Tag– Λειτουργία:

• Index για να βρει που µπορεί να είναι, • Tag για να βρει αν είναι το συγκεκριµένο µπλοκ (κάνουµε

Tag Check για όλα τα πιθανά Tags), • Offset για να βρει τα δεδοµένα µέσα στο µπλοκ

12

Ποίο µπλοκ να αντικαταστήσω µετάαπό αποτυχία;• Direct Mapped:

– Μόνο ένα µπλοκ µπορεί να αντικατασταθεί

• Fully Associative:– Random (απλό)– Least-recently used (LRU)– First in, first out (FIFO)

Page 7: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

7

13

Τι γίνεται όταν γράφουµε;

• Οι περισσότερες προσβάσεις είναι διαβάσµατα(π.χ. 10% st και 37% ld για 5 SPECint2000)

• Για το read µπορούµε να διαβάζουµε το Tag καιτο µπλοκ ταυτόχρονα όµως για το write...

• ∆υο επιλογές για να γράψουµε:– Write through: write to both block in cache and in

main memory– Write back: write only to block in cache. Modified

block written to main memory upon replacement (use dirty bit)

14

Τι γίνεται όταν γράφουµε;

• ∆υο επιλογές όταν έχουµε write miss (writemiss policy):– Write allocate: block is allocated (read miss +

write)– No-write allocate: does not affect the cache,

modifies only lower-level memory• Βελτίωση...

– Write buffer: processor continues execution as data is written to buffer

Page 8: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

8

15

Παράδειγµα: Alpha 21264 Data Cache

16

Επίδοση της Κρυφής Μνήµης

• Άσκηση: Ποια οργάνωση έχει χαµηλότερολόγο αποτυχίας:– 16KB instruction cache + 16KB data cache– 32 KB unified cache– Assume: 36% of instr are data transfers, hit=1clk,

miss penalty=100clk, unified with single port means 1 extra clk if 2 requests, write-through with write buffer (ignore stalls to write buffer)

Average memory access time = ΑΜΑΤ = Hit time + Miss rate x Miss penalty

Page 9: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

9

17

Επίδοση της Κρυφής Μνήµης

• Άσκηση: cache penalty=100clk, all instr take 1clk, average miss rate=2%, average memory refs per instr=1.5, average cache misses per 1000 instr=30. Performance with and without cache=?

CPU time = (CPU execution clock cycles + Memory stall clock cycles)x Clock cycle time

18

Μείωση του Λόγου Αποτυχίας

• Types of Misses (“three C’s”)– Compulsory: first access. Also called cold-start

misses or first-reference misses– Capacity: cache can not contain all blocks– Conflict: many blocks map to the same set. Also

called collision misses or interference misses– (fourth C: Coherence…)

Page 10: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

10

19

Six Basic Cache Optimizations

1. Larger block size to reduce miss rate2. Bigger caches to reduce miss rate3. Higher associativity to reduce miss rate4. Multilevel caches to reduce miss penalty5. Giving priority to read misses over writes

to reduce miss penalty6. Avoiding address translation during

indexing of the cache to reduce hit time

20

Larger Block Size

• Reduces compulsory misses (spatial locality)

• Increases miss penalty• May increase conflict and capacity misses

Page 11: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

11

21

Larger Block Size

• Άσκηση:Cache Size

Block Size 4K 16K 64K 256K16 8.57% 3.94% 2.04% 1.09%32 7.24% 2.87% 1.35% 0.70%64 7.00% 2.64% 1.06% 0.51%128 7.78% 2.77% 1.02% 0.49%256 9.51% 3.29% 1.15% 0.49%

– Assume memory system takes 80 clk overhead and then delivers 16 bytes every 2 clk (16 bytes in 82clk, 32 bytes in 84clk,…). Which block size has the smallest average memory access time for each cache size?

22

Bigger Caches

• Reduces miss rate

0

0.02

0.04

0.06

0.08

0.1

0.12

4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB

Cache Size

Tota

l Mis

s R

ate

1-way2-way4-way8-way

Page 12: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

12

23

Higher Associativity

• Reduces miss rate

0

0.02

0.04

0.06

0.08

0.1

0.12

1-way 2-way 4-way 8-way

Cache Size

Tota

l Mis

s R

ate 4KB

8KB16KB32KB64KB128KB256KB512KB

24

Multilevel Caches

• Πιο γρήγορη ή πιο µεγάλη ΚΜ; ΚΜ ∆εύτεροεπίπεδο (L2)

• Λόγοι αποτυχίας της L2:– Local miss rate = miss L2 / access L2– Global miss rate = miss L2 / memory access

• Multilevel inclusion / exclusion

AMAT = Hit time L1 + Miss rate L1 x Miss penalty L1Miss penalty L1 = Hit time L2 + Miss rate L2 x Miss penalty L2

AMAT = Hit time L1 + Miss rate L1x (Hit time L2 + Miss rate L2 x Miss penalty L2)

AMAT = Misses per instruction L1 x Hit time L2+ Misses per instruction L2 x Miss penalty L2

Page 13: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

13

25

Giving priority to read misses over writes

• Προτεραιότητα για read misses έναντι writes– Εξυπηρετώ read πριν από το τέλος του write– Προβλήµατα;

26

Avoiding Address Translation

• Virtually addressed cache– Problems: Protection, Process switch, Operating

Systems/User programs aliases• Virtually indexed, physically tagged cache

– Problem: no bigger than a page size…only with associativity

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

2KB

4KB

8KB

16KB

32KB

64KB

128K

B

256K

B

512K

B

1024

KB

Cache Size

Mis

s R

ate Purge

PIDsUniprocess

Page 14: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

14

27

Summary of Basic Cache Optimizations

Hit Miss Miss HardwareTechnique time penalty rate complexity CommentLarger block size - + 0 Trivial; P4 L2 128byteLarger cache size - + 1 Widely used, for L2Higher assoc - + 1 Widely usedMultilevel caches + 2 Costly hardware;

Harder if L1 block != L2 block; widely used

RD priority over WR + 1 Widely usedAvoiding addrTranslation + 1 Widely used

28

Eleven Advanced Optimizations of Cache Performance

• Reducing the hit time: small and simple caches, way prediction, and trace caches

• Increasing cache bandwidth: pipelined caches, multibanked caches, and non-blocking caches

• Reducing the miss penalty: critical word first and merging write buffers

• Reducing the miss rate: compiler optimizations• Reducing the miss penalty or miss rate via

parallelism: hardware prefetching and compiler prefetching

Page 15: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

15

29

Μείωση του Χρόνου Επιτυχίας

• Μικρή και απλή Κρυφή Μνήµη– L1 does not usually increase size with processor

generation (16KB P III, 8KB P4!)• Way Prediction and Pseudoassociative

Caches• Trace Caches

– Dynamic sequences of instructions including taken branches

– Intel NetBurst (Pentium 4)– Problem: Code replication

30

Αύξηση της Εύρος Ζώνης (1)

• Pipelined Cache Access• Multibanked Caches

Bank 0 Bank 1 Bank 3Bank 2

0 1 2 34 5 6 7

Page 16: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

16

31

Αύξηση της Εύρος Ζώνης (2)

• Nonblocking caches to increase cache bandwidth– Nonblocking or lockup-free cache: multiple outstanding

requests (hit-under-miss, hit-under-multiple-miss, miss-under-miss)

– Παράδειγµα: • Which is more important for fp programs: two-way set assoc or

hit-under-one-miss? For integer programs? Assume for 8KB data cache: miss rate for fp 11.4% for direct mapped, 10.7% for 2-way; miss rate for int 7.4% for direct mapped, 6.0% for 2-way. Miss penalty 16clk. Hit-under-one-miss reduces 76% for fp and81% for int.

32

Μείωση της ποινής αποτυχίας

• Χρήσιµη λέξη πρώτη και Early Restart– Critical Word First: requests the missed word

first– Early Restart: restarts execution as soon as word

arrives

• Συγχώνευση του write buffer– Multiword writes are faster than multiple

singleword writes– Less probability of becoming full

Page 17: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

17

33

Μείωση του Λόγου Αποτυχίας

• Βελτιώσεις από το Μεταγλωττιστή– Loop Interchange

for (j=0; j<100; j++)for (i=0; i<5000; i++)x[i][j] = 2*x[i][j]

for (i=0; i<5000; i++)for (j=0; j<100; j++)x[i][j] = 2*x[i][j]

i=0j=0

i=0j=100

i=1j=0

i=0j=0

i=0j=100

i=1j=0

34

Μείωση του Λόγου Αποτυχίας

• Blocking

X=

X=

Page 18: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

18

35

Μείωση της Ποινής ή του ΛόγουΑποτυχίας µε Παραλληλισµό

• Hardware Prefetching of Instructions and Data– Instruction Prefetch: on a miss fetch 2 blocks –

requested (goes to cache) and next block (goes to instruction stream buffer)

– Data Prefetch: calculates stride between accesses for prefetch (UltraSPARC III – 8 simultaneous pref)

– Παράδειγµα:• Which is the effective miss rate using prefetching? How

much larger would a cache need to be to match the AMAT? Assume 64KB data cache, prefetching reduces 20% data miss rate, miss per 1000=36.9 (22% data references, 1 extra clk for prefetch buffer. Miss penalty 15clk

36

Hardware Prefetching

1.16

1.45

1.18 1.2 1.21 1.26 1.29 1.32 1.41.49

1.97

11.21.41.6

1.82

2.2

gap

mcf

fam

3d

wup

wis

e

galg

el

face

rec

swim

appl

u

luca

s

mgr

id

equa

ke

SPECint2000 SPECfp2000

Perfo

rman

ce im

prov

emen

t

Page 19: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

19

37

Μείωση της Ποινής ή του ΛόγουΑποτυχίας µε Παραλληλισµό

• Compiler-Controlled Prefetching– Register or cache prefetch– Faulting or nonfaulting (nonbinding)

• Helper threads:– Pre-execution [Luk-ISCA2001]– SUN Scouts [ISCA2004]

38

Cache Optimization Summary

Page 20: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

20

39

Μεγαλύτερο Bandwidth της ΚύριαςΜνήµης• Φαρδιά Κύρια Μνήµη

40

Μεγαλύτερο Bandwidth της ΚύριαςΜνήµης• Simple Interleaved Memory

– Memory banks: read/write multiple words simultaneously

– Problems: new high capacity memory chips leads to fewer banks, difficult expansion

• Independent Memory Banks

Page 21: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

21

41

Τεχνολογία Μνήµης

• DRAM Technology– Multiplex address line: row access strobe (RAS),

column access strobe (CAS)– Dynamic: data refresh– Packaging: dual inline memory modules (DIMM)

• SRAM Technology– S = Static

42

Τεχνολογία Μνήµης

Page 22: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

22

43

Τεχνολογία Μνήµης

44

Τεχνολογία Μνήµης

Page 23: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

23

45

Τεχνολογία Μνήµης

• Embedded Processors Memory Technology– ROM and Flash

• Improving Memory Performance in a Standard DRAM Chip– Fast page mode– Synchronous DRAM (SDRAM)– Double Data Rate (DDR)

46

Τεχνολογία Μνήµης

• Improving Memory Performance via New DRAM Interface: RAMBUS– Packet-switched bus (or split-transaction bus)– 2nd G: direct RDRAM: DRDRAM– Packaging: RIMM

• Comparing RAMBUS with DDR SRAM– RDRAM and DRDRAM are expensive!– Performance?

Page 24: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

24

47

RAMBUS

RAMBUS 1200MHz vs DDR333RAMBUS 800MHz vs DDR266

48

Other results…

Results from:http://www.inqst.com/articles/

p3ddr/p3ddrmain.htm

0500

100015002000

VIAPro266

Intel 815E Intel 820

Page 25: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

25

49

Memory Latency!

A Performance Comparisonof Contemporary DRAMArchitecturesVinodh Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge

50

Άλλα Θέµατα

• Virtual Memory– Segments or pages?– Translation Lookaside Buffer (TLB)– Page size?

• Protection and Examples– Alpha & IA-32

Page 26: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

26

51

Σχεδιασµός Ιεραρχίας Μνήµης

• Superscalar CPU and Number of Ports to the Cache• Speculative Execution and the Memory System• Combining the Instruction Cache with Instruction

Fetch and Decode Mechanisms• Embedded Computer Caches and Real-Time

Performance• Embedded Computer Caches and Power• I/O and Consistency of Cached Data

52

Παράδειγµα: Alpha 21264

Page 27: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

27

53

Παράδειγµα: Sony PlayStation 2 (Emotion Engine)

54

Παράδειγµα: Sony PlayStation 2 (Emotion Engine)

Page 28: Pedro Trancoso - UCYtsik/EPL605/Chapter5-Memory.pdf · 1 ΣχεδιασµόςτηςΙεραρχίαςΜνήµης Pedro Trancoso H&P Appendix C H&P Chapter 5 2 ΗΑρχή... “Ideally

28

55

Παράδειγµα: Sun Fire 6800