Evaluation of HEP worker nodes Michele Michelotto at … TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K...

Hepmark project

Evaluation of HEP worker nodesMichele Michelotto at pd.infn.it

8/2/2008 CSN5 Trieste michele michelotto - INFN Padova 2

Computing model

Tier3physics

department

α

β

γ

Desktop

Germany

Tier-1 UK

France

Italy

CERNTier 1

JapanCERN Tier 0

Tier-2

Lab aUni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

grid f

or a

region

al gro

up

USABNL

USAFNAL

grid for aphysicsstudy group


Computing Needs• Tape Storage:

– Very Easy: events Terabyte• Disk Storage

– Easy again: events Terabyte– (1000x1000 or 1024x1024?)– RAID protected or raw size?

• Computing Power– Tricky: Event/sec? Sim or Reco?– MIPS, CernUnit, MHz, Spec, SI2K….


T1 requirements

Experiment % CPU DISK TAPE CPU DISK TAPE CPU DISK TAPE CPU DISK TAPE CPU DISK TAPE CPU DISK TAPE CPU DISK TAPE

KSI2K TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K TB-N TB

ALICE 22% 154 16 77 286 110 143 748 330 428 1727 789 888 3026 1512 1554 4431 2503 2691 5714 3426 3905

ATLAS 32% 224 40 112 416 160 208 1088 480 623 2513 1148 1292 4401 2199 2260 6446 3641 3913 8311 4983 5680

CMS 35% 245 86 123 455 175 228 1190 525 681 2748 1256 1413 4813 2405 2472 7050 3983 4280 9090 5450 6212

LHCB 11% 77 26 39 143 55 72 374 165 214 864 395 444 1513 756 777 2216 1252 1345 2857 1713 1952

Total LHC TIER1 700 168 350 1300 500 650 3400 1500 1946 7852 3588 4037 13753 6871 7061,8 20143 11380 12230 25972 15571 17748

BaBar 585 149 0 680 200 0 1215 350 0 1215 350 0 1215 350 0 700 350 0 400 350 0

CDF 900 66 0 820 100 15 1161 170 15 1290 220 15 1420 270 15 800 270 15 600 270 15

LHCB TIER2 0 0 0 150 0 0 600 0 0 900 0 0 1300 0 0 1600 0 0 1600 0 0

TOTALE GRUPPO I 1485 214 0 1650 300 15 2976 520 15 3405 570 15 3935 620 15 3100 620 15 2600 620 15

AMS2 32 2 16 25 5 16 32 5 24 180 16 128 180 28 232 0 28 232 0 28 232

ARGO 22 12 28 150 70 186 288 122 366 288 159 546 288 195 726 288 195 726 288 195 726

GLAST 5 10 0 200 50 10 200 70 20 200 100 30 200 100 30 200 100 30

MAGIC 1 20 5 4 25 5 8 25 6 12 25 6 16 25 6 16 25 6 16

PAMELA 4 20 10 16 25 12 32 25 14 48 25 16 64 0 16 64 0 16 64

Virgo 10 25 75 180 90 130 250 150 200 500 220 250 500 220 250 0 0 0 0 0 0

TOTALE GRUPPO II 64 43 119 400 190 352 820 344 640 1218 485 1004 1218 565 1318 513 345 1068 513 345 1068

All experiments 2249 426 469 3350 990 1017 7196 2364 2601 12475 4643 5056 18906 8056 8395 23756 12345 13313 29085 16536 18831

All w/ overlap factor 1874 387 469 2792 900 1017 5997 2149 2601 10396 4221 5056 15755 7324 8395 19796 11222 13313 24237 15033 18831

CNAF TOTAL (PLAN) 1874 387 469 3000 1000 1000 5997 2149 2601 10396 4221 5056 15755 7324 8395 19796 11222 13313 24237 15033 18831

CNAF ACTUAL 1570 400 510 3000 1000 ?

Relative Contingency

Absolute contingency 0 0 0 1199 429,8 520,2 3119 1266 1517 6302 2929 3358 9898 5611 6656 12119 7516 9416

Zoccolo duro (TOTAL-CONTINGENCY) 3000 1000 1000 4797 1719 2081 7277 2954 3539 9453 4394 5037 9898 5611 6656 12119 7516 9416

INFN T1 P2P 2005 1800 850 850 2400 1200 1000 5500 2500 2100 8000 4000 4100 11500 5800 6000

INFN T1 P2P 2007 - - - 1300 500 650 4500 2000 2100 6500 3200 3300 10000 5000 5000

INFN T1 P2P 2007 v3 3000 1300 1500 5500 2500 2600 8500 4100 4200 12000 6800 7100 16000 9500 11000

2007 2011

50%

2012

50%

CNAF Plan September 2007

0% 20% 30% 40%

2008 2009 20102006


T1 + T2 cpu budget - LHC


SI2K frozen• SI2K is the benchmark used up to now to

measure the computing power of all the HEPexperiments– Computing power requested by experiment– Computing power provided by a Tier-[0,1,2]

• SI2K is the nickname for SPEC CPU Int 2000benchmark– Came after Spec89, Spec Int 92 and Spec Int 95– Declared obsolete by SPEC in 2006– Replaced by SPEC with CPU Int 2006


Transition problem• Impossible to find SPEC Int 2000pubblished results for the new processors(e.g. the not so new Clovertown 4-core)

• Impossible to find pubblished SPEC Int 2006for old processor (before 2006)– E.g. Old P4 Xeon, P4, AMD 2xx

• You can’t convert from SI2000 to SI2006 butthe ratio for x86 architecture is in the 137 –172 range


The SI2K inflaction• The main problems with SI2000 in our

community: it is not proportional to HEPcodes performance (as it was)

• You can buy processors with huge SI2Knumber but with a smaller increase in realperformances


Nominal SI vs real SI• SI2K results for the last generation processor

affected by inflation• So CERN (and FZK) started to use a new

currency: SI2K measured with “gcc”, the gnuC compiler and using two flavour ofoptimization– High tuning: gcc –O3 –funroll-

loops–march=$ARCH– Low tuning: gcc –O2 –fPIC –pthread


Nominal SI vs real SI

• CERN Proposal: Use as site rating the “RealSI” obtained by SI measured with gcc-lowand increased by 50%– Actually this make sense only for a short period of

time and for the last generation of processor• Run n copies in parallel

– Where n is the number of cores in the workernode

– To take in account the drop in performance of amulticore machine when fully loaded.


Too many SI2K• Take as an example a worker node with two

Intel Woodcrest dual core 5160 at 3.06 GHz• SI2K nominal: 2929 – 3089 (min – max)• SI2K sum on 4 cores: 11716 - 12536• SI2K gcc-low: 5523• SI2K gcc-high: 7034• SI2K gcc-low + 50%: 8284


Even more• Actually all the gcc results in the previous slide are

on i386 (32bit)• if you would like to know how your code is running

on 64 bit machine, you can measure Specint INT2000 with gcc on x86_64.

• So the worker node with two Intel Woodcrest dualcore 5160 at 3.06 GHz

• SI2K nominal: 2929 – 3089 (min – max)• SI2K on 4 cores: 11716 - 12536• SI2K gcc-low: 6021• SI2K gcc-high: 6409• SI2K gcc-low + 50%: 9031


A scale factor• All these numbers would be only annoying in

a world with a unique architecture, in whichonly clock improves in time

• You would be able to find a fixed ratiobetween all those number.

• But in the real world, the ratio depends onCPU producer (intel vs AMD) and processorgeneration (old xeon vs new “core” Xeon


The nominal SI2K

Big Differences between Intel and AMD whenSI2K/GHz are plotted


SI2K6rate


Which is the better?• I started to measure performances of HEP

codes on several machines• The goal was to find a “commercial

mantained” benchmark to replace SI2K• I compared HEP code with

– SI2K pubblished results– SI2K measured with gcc and “CERN” tuning– SI2006 and SI2006 rate pubblished results– SI2006 and SI2006 with gcc4 (32 and 64 bit)


Babar TierA ResultsBabar Stroili

0.0% 50.0% 100.0% 150.0% 200.0%

SI2K

SI2KCERN

SI2006

SI2006gcc

BABAR

be

nc

hm

ark

ratio

Opteron 2218

Opteron 275

Opteron 265

Xeon 5355

Xeon 5345

Xeon 5160

Xeon 2.8

Xeon 2.4

PIII 1.26

• If you normalize bycore and clock all newprocessors have thesame performance

• Doubling the oldergeneration cpu

• SI2006 matches thispattern (pubblishedand gcc ratio constant)

• SI2000-cern betterthan SI2K nominal

• SI2000 clearly doesn’twork


CMS sw SIM and Pythia• CMS Montecarlo simulation

(32bit) and Pythia (64bit)show the sameperformance oncenormalized

• Both Specint 2006pubblished and Specint2006 with gcc show thesame behaviour

• SI2K pubbished does notmatch HEP sw

• SI2K cern better but not asgood as SI2006


Atlas• Here 100% is Xeon5160• Few results for

SI2006+gcc but no difffrom CMS and babar

• Few results also fromSI2006 pubblishedbecause of several oldarchitectures

• SI2K+gcc not bad• SI2K pubblished heavily

overstimate new Xeon• Atlas simulation

normalized performs thesame on the new intel“core” or amd “opteron”(like CMS, Babar)


Many gaps• Easy to find SPEC pubblished result

– But only for new machines• Difficult to measure:

– Not easy to have machine on loan from Server reseller orproducer

– Not easy to borrow machine from colleagues– Always for short periods of time– A SPEC run can last 15-20 hours

• Need a set of dedicated worker node to make SPECand HEP application measurement– The set of WN should be available to other INFN people

who want to make similar measurements


Cache• In the 80’s the latency (3-10 clock time)• Now latency is 1000s of clock time• Importance of the cache architecture

– 1st level, 2nd level, 3rd level– Cache latency– Cache bandwidth– Shared or exclusive?


4 core processor


Intel 54xx


AMD 4core


Load transactional

Performance don’tdrop in the new4core processor

Clovertown drop wrtHarpwertown

A dual coreprocessor keepsonly up to Load3


Perf/watt• AMD

Barcelona at65nmPerformanceper watt similarto INTEL xeonat 45nm


Power when idle


Cache behaviour• 54xx has lower latency even with bigger cache• The 3 processors behave very differently in the 4MB e 64MB

range• If your (HEP) application works in this range you will see a

big change of performance changing processor


Memory intel vs amd• Access time very similar• At 1GB (tipical footprint of HEP application) the new AMD

behave better• But the new are Xeon 54xx much better than the 53xx


Mem intel vs amd• Who is faster?• It depends on

the block size• On the red

zones Intel isbetter.

• On the greenzone AMD isbetter


Cache behaviour• We need to study the behaviour of tipical

HEP application– Simulation, event generation, Reconstruction,

Analysis– To understand how to write more efficient

application


Power issues• Power

consumptionchange fromone processorto another– Clock, High-K

dielectric,Active PowerManagements,Clock throttling


Power consumption


HT on or off• Turning off Hyperthreading causes a 10% drop in

performance but also a 20% drop in Power consumption


What about HEP?• Need to make measurement of Power usage for

HEP application• Example: a big Tier2 with 500 boxes needs 100kW

– Like the whole CED of INFN Padova– About 800 MWh in one year– Energy cost 0.12 Euro per kWh Energy bills of 100

kEuro/year– A 10% improvement on Power efficiency means 10

kEuro/year savings– And savings on the infrastructure (power distribution,

UPS, Cooling)


Power meter• Need a device to measure Voltage and Current• And logging capabilities• E.g. Fluke 1735


Financial request• Need to buy a new worker node each time a new

processor is released in the dual proc market segment– Only if significantly new features are presents– One or two each for INTEL and AMD per year– 4 kEuro each (dual proc, 2GB/core, 1disk)– 2 box to start with


Manpower• Padova:

– Michele Michelotto (Primo Tecnologo) 75%– Alberto Crescente (CTER) 30%– Roberto Ferrari (CTER) 30%

• Ferrara:– Alberto Gianoli (Primo Tecnologo): 20%

• Bologna:– Franco Brasolin (CTER): 20%


International Outlook• Hepix is an organization where HEP

Computing Center people meets twice a year(Sprint Europe, Fall USA)

• IHEPCCC asked Hepix to form two Workinggroup to study Storage and CPUbenchmarking

• Real work started at the end of 2007• CPU group chaired by H.Meinhard (CERN)


Milestone• 2008

– Undestand SPEC 2006. Propose a newbenchmark to replace SI2K

– Measure the performance of the currentarchitectures for Montecarlo SIM (evt/sec vsSPEC)

• 2008/2009– Power performances

• 2009– Cache profiling


Backup slides• CPU tecnology• Tick tock


Tick Tock


2 socket – 8 core – 16 cpu logiche


Miglioramenti a 45nm


Chip di test

Evaluation of HEP worker nodes Michele Michelotto at … TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K...

Documents

Transcript of Evaluation of HEP worker nodes Michele Michelotto at … TB-N TB KSI2K TB-N TB KSI2K TB-N TB KSI2K...