Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ...

23
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain [email protected] ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected] ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain [email protected] Euro-Par 2009, Delft (The Netherlands) - August 27, 2009

Transcript of Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ...

Page 1: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Last Bank: Dealing with Address Reuse inNon-Uniform Cache Architecture for CMPs

Javier Lira ψ

Carlos Molina ф

Antonio González λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

[email protected]

Euro-Par 2009, Delft (The Netherlands) - August 27, 2009

Page 2: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

2

Page 3: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Introduction

CMPs have emerged as a dominant paradigm in system design.

1. Keep performance improvement while reducing power consumption.

2. Take advantage of Thread-level parallelism.

Commercial CMPs are currently available.

CMPs incorporate larger and shared last-level caches.

Wire delay is a key constraint.

3

Page 4: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

NUCA

Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].

NUCA divides a large cache in smaller and faster banks.

Banks close to cache controller have smaller latencies than further banks.

Processor

[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

Page 5: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

5

Page 6: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Methodology

Simulation tools:Simics + GEMSCACTI v6.0

PARSEC Benchmark Suite

Number of cores 8

Core processor Out-of-order SPARCv9

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

L1 cache latency 3 cycles

NUCA bank latency 2 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 350 cycles (from core)

Private L1 data caches 8 KBytes

Private L1 instr. caches 8 KBytes

Shared L2 NUCA cache 1 MByte, 256 Banks

Page 7: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Baseline NUCA cache architecture

8 cores

256 banks

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Page 8: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

8

Page 9: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Last Bank

Data movements concentrate most accessed data in few banks.

Data replacements in HOT banks are unfair.

9

Page 10: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Last Bank

An extra bank is included in the NUCA cache.

Acts as a Victim cache, but it is not fully-associative.

Provides evicted data a second chance for keeping in the NUCA.

10

Last Bank

Page 11: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Last Bank

11

Performance benefits restricted by Last Bank size.

Significant performance potential.

Analysis of reused addresses to find improvement points.

Page 12: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

12

Page 13: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Characterization of replacements in NUCA

How many evicted addresses are later reused?

How many cycles do a reused address usually spend out of the NUCA before being reinserted?

Where were reused addresses located within the NUCA just before being evicted?

What action did motivate reused addresses eviction from NUCA?

13

Page 14: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Reused address statistics

14

Nearly 70% of evicted addresses return to the NUCA cache.

Most of the reused address, return to NUCA at least twice.

Page 15: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Time between Eviction and Reinsertion

15

Nearly 30% of evicted addresses return in less than 100,000 cycles.

In blackscholes, almost 50% of reused addresses return to NUCA in less than 1,000 cycles.

Page 16: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Last location within the NUCA

Most of reused addresses were evicted from Local Banks.

Most of addresses replaced from Central Banks are not later reused.

16

Page 17: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

17

Page 18: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Selective Last Bank

Target: To reduce pollution in Last Bank.

This mechanism allows to select the evicted data blocks that are going to be stored in the Last Bank.

Implemented Selective Last Bank: Stores data blocks, if and only if, they were evicted from a Local Bank. Otherwise, sends them back to the main memory.

18

Page 19: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

LRU Prioritising Last Bank

Target: To maintain reused addresses in the NUCA cache.

Modification of data eviction policy of NUCA banks.

Prioritises lines that come from Last Bank during the data replacement process.

19

@AP: 0

@BP: 0

@CP: 0

@DP: 1

0 1 2 3

MRU LRU

@DP: 0

@AP: 0

@BP: 0

@CP: 0

0 1 2 3

@D, P:0

@A, P:0 @B, P:0 @C, P:0

Page 20: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Results

20

Both optimizations increase Last Bank performance benefits.

There is still room for improvement.

Adaptive filters will be analysed in future works.

Page 21: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Outline

Introduction

Methodology

Last Bank

Characterization of replacements in NUCA

Last Bank Optimizations

Conclusions

21

Page 22: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Conclusions

Data movements provoke unfair replacements in HOT banks.

Last Bank reduce access latency of promptly reused addresses.

Huge performance potential.

Two optimizations are proposed: Selective Last Bank: Reduce pollution in Last Bank. LRU Prioritising Last Bank: Maintain reused addresses in the NUCA cache.

22

Page 23: Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Last Bank: Dealing with Address Reuse inNon-Uniform Cache Architecture for CMPs

Questions?