HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio...
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs
Javier Lira ψ
Carlos Molina ф
Antonio González ψ,λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
IPDPS 2011, Anchorage, AK (USA) – May 17, 2011
Introduction
2
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
NUCA
S-NUCA (Static NUCA)
One possible location in the NUCA
Simple
Trivial search of data
No leverages locality
D-NUCA (Dynamic NUCA)
Multiple candidate banks
Migration increases complexity
Not easy to find data
Optimize cache access latency
Motivation
3
Significant performance potential
Limited by the access scheme
Access schemes in D-NUCA
Directory is not an alternative Needs to update block location on every migration Reduces D-NUCA potentiality Potential bottleneck
Algorithmic-based schemes
Partitioned multicast (hybrid access scheme) 1st step: Local bank + central banks (9 banks) 2nd step: The other core’s local banks
4
Performance Energy
Serial Low Low
Parallel High High
Serial vs Parallel
5
Reduce the number of messages required per access is crucial
Objectives
6
Optimize NUCA features Provide fast access when the data is near the requesting core
Reduce network contention Crucial in both performance and energy
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
7
Methodology
Simulation tools: Simics + GEMS CACTI v6.0
Two scenarios: Multi-programmed
Mix of SPEC CPU2006
Parallel applications PARSEC
Number of cores 8 – UltraSPARC IIIi
Frequency 1.5 GHz
Main Memory Size 4 Gbytes
Memory Bandwidth 512 Bytes/cycle
Private L1 caches 8 x 32 Kbytes, 2-way
Shared L2 NUCA cache 8 MBytes, 128 Banks
NUCA Bank 64 KBytes, 8-way
L1 cache latency 3 cycles
NUCA bank latency 4 cycles
Router delay 1 cycle
On-chip wire delay 1 cycle
Main memory latency 250 cycles (from core)
Baseline architecture
D-NUCA cache 8 MBytes 128 Banks Bank: 64 KBytes, 8-way
Migration scheme: Gradual Promotion
Replacement LRU
Access Partitioned Multicast
9
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
10
HK-NUCA
Home Knows where to find data in the NUCA cache
Home bank knows which other banks have at least one data block that it manages
There is a HK-PTR per cache set in all banks.
11
0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0
HK-PTR
(2) Call Home(3) Parallel access
HK-NUCA
12
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 0
(1) Fast access
0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0
Managing Home knowledge
Actions that provoke an update of HK-PTR:
New data enters to the cache
Eviction from the NUCA cache
Migration movements
Migrations are synchronized with HK-PTR updates
13
Overheads
Hardware Implementation HK-PTRs
Network Home knowledge updates
14
NUCA cache 8 MBytesHK-PTRs 32 KBytes
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
15
Performance results
16
Overall performance improvement of 4-6%Workloads with high miss rateLow miss rate, but high hit rate in the first two HK-NUCA stages
Low miss rate, high hit rate in the parallel access stage of HK-NUCA
HK-NUCA accuracy
17
0 mess
ages
1 mess
age
2 mess
ages
3 mess
ages
4 mess
ages
5 mess
ages
6 mess
ages
7 mess
ages
8 mess
ages
9 mess
ages
10 mess
ages
11 mess
ages
12 mess
ages
13 mess
ages
14 mess
ages0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Me
ss
ag
es
se
nt
du
rin
g t
he
pa
ralle
l ac
ce
ss
sta
ge
of
HK
-NU
CA
85% of memory requests send less than 6 messages to the NUCA
On-chip network traffic
18
Avg Messages sent per request
Part. Multcast 10.03HK-NUCA (3-steps) 3.82HK-NUCA (2-steps) 4.06Perfect Search 1
Energy consumption results
19
HK-NUCA reduces dynamic energy consumption by more than 50%
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
20
Conclusions
D-NUCA enables to take profit of the non-uniformity of NUCA caches
D-NUCA benefits are restricted by the access scheme used
HK-NUCA is an access scheme for D-NUCA organizations
Allows fast accesses to data that is near the requesting core
Home knowledge reduces miss resolution time and network contention
Outperforms by 6% the best performing access scheme
Reduces dynamic energy consumption by 50%
21
HK-NUCA: Boosting data searches in Dynamic NUCA for CMPs
Questions?
22
Migration is not the problem
23
S-NUCAD-NUCA
Access scheme is the main limitation
in D-NUCA