HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio...

23
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain [email protected] ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected] ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain [email protected] IPDPS 2011, Anchorage, AK (USA) – May 17, 2011
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio...

Page 1: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs

Javier Lira ψ

Carlos Molina ф

Antonio González ψ,λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

[email protected]

IPDPS 2011, Anchorage, AK (USA) – May 17, 2011

Page 2: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Introduction

2

Core 0 Core 1 Core 2 Core 3

Core 4 Core 5 Core 6 Core 7

NUCA

S-NUCA (Static NUCA)

One possible location in the NUCA

Simple

Trivial search of data

No leverages locality

D-NUCA (Dynamic NUCA)

Multiple candidate banks

Migration increases complexity

Not easy to find data

Optimize cache access latency

Page 3: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Motivation

3

Significant performance potential

Limited by the access scheme

Page 4: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Access schemes in D-NUCA

Directory is not an alternative Needs to update block location on every migration Reduces D-NUCA potentiality Potential bottleneck

Algorithmic-based schemes

Partitioned multicast (hybrid access scheme) 1st step: Local bank + central banks (9 banks) 2nd step: The other core’s local banks

4

Performance Energy

Serial Low Low

Parallel High High

Page 5: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Serial vs Parallel

5

Reduce the number of messages required per access is crucial

Page 6: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Objectives

6

Optimize NUCA features Provide fast access when the data is near the requesting core

Reduce network contention Crucial in both performance and energy

Page 7: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Outline

Introduction and motivationMethodologyHK-NUCAResultsConclusions

7

Page 8: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Methodology

Simulation tools: Simics + GEMS CACTI v6.0

Two scenarios: Multi-programmed

Mix of SPEC CPU2006

Parallel applications PARSEC

Number of cores 8 – UltraSPARC IIIi

Frequency 1.5 GHz

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

Private L1 caches 8 x 32 Kbytes, 2-way

Shared L2 NUCA cache 8 MBytes, 128 Banks

NUCA Bank 64 KBytes, 8-way

L1 cache latency 3 cycles

NUCA bank latency 4 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 250 cycles (from core)

Page 9: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Baseline architecture

D-NUCA cache 8 MBytes 128 Banks Bank: 64 KBytes, 8-way

Migration scheme: Gradual Promotion

Replacement LRU

Access Partitioned Multicast

9

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Page 10: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Outline

Introduction and motivationMethodologyHK-NUCAResultsConclusions

10

Page 11: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

HK-NUCA

Home Knows where to find data in the NUCA cache

Home bank knows which other banks have at least one data block that it manages

There is a HK-PTR per cache set in all banks.

11

0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0

HK-PTR

Page 12: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

(2) Call Home(3) Parallel access

HK-NUCA

12

Core 0 Core 1 Core 2 Core 3

Core 4 Core 5 Core 6 Core 7

Core 0

(1) Fast access

0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0

Page 13: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Managing Home knowledge

Actions that provoke an update of HK-PTR:

New data enters to the cache

Eviction from the NUCA cache

Migration movements

Migrations are synchronized with HK-PTR updates

13

Page 14: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Overheads

Hardware Implementation HK-PTRs

Network Home knowledge updates

14

NUCA cache 8 MBytesHK-PTRs 32 KBytes

Page 15: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Outline

Introduction and motivationMethodologyHK-NUCAResultsConclusions

15

Page 16: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Performance results

16

Overall performance improvement of 4-6%Workloads with high miss rateLow miss rate, but high hit rate in the first two HK-NUCA stages

Low miss rate, high hit rate in the parallel access stage of HK-NUCA

Page 17: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

HK-NUCA accuracy

17

0 mess

ages

1 mess

age

2 mess

ages

3 mess

ages

4 mess

ages

5 mess

ages

6 mess

ages

7 mess

ages

8 mess

ages

9 mess

ages

10 mess

ages

11 mess

ages

12 mess

ages

13 mess

ages

14 mess

ages0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Me

ss

ag

es

se

nt

du

rin

g t

he

pa

ralle

l ac

ce

ss

sta

ge

of

HK

-NU

CA

85% of memory requests send less than 6 messages to the NUCA

Page 18: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

On-chip network traffic

18

Avg Messages sent per request

Part. Multcast 10.03HK-NUCA (3-steps) 3.82HK-NUCA (2-steps) 4.06Perfect Search 1

Page 19: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Energy consumption results

19

HK-NUCA reduces dynamic energy consumption by more than 50%

Page 20: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Outline

Introduction and motivationMethodologyHK-NUCAResultsConclusions

20

Page 21: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Conclusions

D-NUCA enables to take profit of the non-uniformity of NUCA caches

D-NUCA benefits are restricted by the access scheme used

HK-NUCA is an access scheme for D-NUCA organizations

Allows fast accesses to data that is near the requesting core

Home knowledge reduces miss resolution time and network contention

Outperforms by 6% the best performing access scheme

Reduces dynamic energy consumption by 50%

21

Page 22: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

HK-NUCA: Boosting data searches in Dynamic NUCA for CMPs

Questions?

22

Page 23: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Migration is not the problem

23

S-NUCAD-NUCA

Access scheme is the main limitation

in D-NUCA