Distributed Training of Generative Adversarial Networks for Fast … · 2018-08-30 · 1...

1

Distributed Training of Generative Adversarial

Networks for Fast Detector Simulation on

Intel® Xeon® HPC Cluster

Big Data Summit, 18 July 2018, NERSC, LBNL, Berkeley, CA

07/2018

Sofia Vallecorsaβ, Vikram Saletoreα, Damian Podareanuη,

Federico Carminatiβ, Valeriu Codreanuη, G. Khattakβ, Hans Pabstα

αIntel Corp., βCERN, ηSURFsara

2

Outline

Deep Learning for fast simulation

Generative Adversarial NetworksModel architecture

The training sample

Physics Performance

Computing Performance

Outlook and plans

2

3

Monte Carlo Simulation: Why

Understand how detector design affect

measurements and physics

Correct for inefficiencies, inaccuracies,

unknowns.

Theory models to compare data against.

Detailed simulation of subatomic particles is essential for data analysis, detector design

3

A good simulation demonstrates that we understand the detectors and the

physics we are studying

4

The problem

Complex physics and geometry modeling

Heavy computation requirements

>50% of WLCG power for simulations

Current code cannot cope (HL-LHC in 2025)

Currently available solutions detector dependent

Focus on EM Calorimeter

4

200 Computing centers in 20 countries:

> 600k cores

@CERN (20% WLCG): 65k cores;

30PB disk + >35PB tape storage

Campana, CHEP 2016

ATLAS experiment:

https://indico.cern.ch/event/505613/contributions/2241721/attachments/1344208/2032301/Oral-90.pdf

5

Classical Monte Carlo simulation

1. Calculate step particle could travel before doing a PHYSICS interaction

2a. Chek if step is within volume boundaries(GEOMETRY)

3. Propagate with selected step

5. PHYSICS process

4. Repeat 2,3 until reaching volume boundary. Restart from 1

Repeat stages 1 to 5 :

- For every particle trajectory step- For every primary particle - For every secondary particle

simplified from A. Gheata

https://indico.cern.ch/event/656336/timetable/#20171110

6


Generic approach

Can encapsulate expensive computations

Inference step is faster than algorithmic

approach

Already parallelized and optimized for

CPUs/HPCs.

Industry building highly optimized

software, hardware, and cloud services.

Improved, efficient and accurate fast simulation

6

Can we keep accuracy while doing things faster?

7

Requirements

Precise simulation results:

Detailed validation process

A fast inference step

Generic customizable tool

Easy-to-use and easily extensible framework

Large hyper parameters scans and meta-optimisation:

Training time under control

Scalability

Possibility to work across platforms

7

8

A DL engine for fast simulation

Start with time consuming detectors

Reproduce particle showers in calorimeters

Train on detailed simulation

Test training on real data

Test different models

Generative Adversarial Networks

Embed training-inference cycle in simulation

8

IntelParallel Computing Center 2017

9

Can image-processing approaches be useful?

Can we preserve accuracy while increasing speed?

Can we sustain the increase in detector complexity (future highly-granular calorimeters)?

How generic is this approach?

Can we “adjust” architecture to fit a large class of detectors?

What resources are needed?

9

A plan in two steps

• A first proof of concept

• Understand performance and

validate accuracy

• Prove generalisation is possible

• Understand and optimisecomputing resources

• Reduce training time

• HPC friendly

10

CLIC Calorimeter

Array of absorber material and silicon sensors

10

Ispy visualisation

25 2525

Detector output is

essentially a 3D image

Pierini, DS@HEP(*) http://cds.cern.ch/record/2254048#

CLIC (Compact LInear Collider) is a CERN project for a linear accelerator of electrons and positrons to TeVenergies

Associated electromagnetic calorimeter detector design(*)

Highly segmented (pixelized)

Segmentation is critical for particle identification and energy calibration.

https://github.com/tpmccauley/ispy-hepml

https://indico.hep.caltech.edu/indico/getFile.py/access?contribId=16&sessionId=9&resId=0&materialId=slides&confId=102

http://cds.cern.ch/record/2254048

12

Network architecture

3D conditional GAN

- reproduce full volumes of

shower reconstruct in one go

- with two auxiliary regression

tasks

Based on 3D

convolution/deconvolutions to

describe whole volume

12

Y

13

Conditioning and auxiliary tasks

Condition training on several input variables (particle type,

energy, incidence angle)

Auxiliary regression tasks assigned to the discriminator:

primary particle energy, deposited energy, incidence angle

Loss is linear combination of 3 terms:

Combined cross entropy (real/fake)

Mean absolute percentage error for regression tasks

13

Easily generalisable to multi-class approach (or multi-discriminator

approach): angle..

Real/fake probability

Epochs

14

RESULTS validationComparison to Monte Carlo data

14

Geant4GAN generated

GAN generated electron shower

Y moment (width)

Average shower section

Primary particle energy(100 GeV)

Single cell response

15

Outline


Generative Adversarial NetworksModel architecture

The training sample

Physics Performance

Computing Performance

Outlook and plans

15

16

Generation speedup

Inference:

Classical Monte Carlo requires 17 secs/shower using Geant4

3DGAN takes 7 msec/shower

speedup factor > 2500!!

Using a trained model is very fast

Time to create an electron shower

Method MachineTime/Shower

(msec)

Full Simulation (geant4)

Intel Xeon Platinum 8180

17000

3D GAN(batch size 128)

Intel Xeon Platinum 8180

7

17

Use keras 2.13 /Tensorflow 1.9 (Intel optimised)

• AVX512 –FMA support

• Intel® MKL-DNN (with 3D convolution support)

Optimised multicore utilisation• inter_op_paralellism_threads

• intra_op_paralellism threads

• OMP_NUM_THREADS

Horovod 0.13.4• MPI_AllReduce

Distributed

training

Run on TACC Stampede2 cluster:• Dual socket Intel Xeon 8160

• 2x 24 cores per node, 192 GB RAM

• Intel® Omni-Path Architecture

Test several MPI schedulingconfigurations

• 2,4, 8 processes per nodes.

• Best machine efficiency with 4 processes/node

18

Optimizing & Parallelizing Training

Karas:

Simplicity and high productivity

TensorFlow + MKL-DNN

Special MKL-DNN build with 3D Conv Support

Horovod

Init & Wrapping TensorFlow/Karas optimizer inside HorovodDistributedOptimizer class

Broadcast shared variables to the Horovod World

Data Loading: Ensure data is loaded in parallel: Adapting existing code to take into account of

Horovod World Size

Optimizer: RMSprop18

19

GANs, Dataset, TF, & Runtime Options

Original GANs

Conv Filters: non-multiple of 16

Parameters: Generator: 872736 & Discriminator: 73646; Model Size: 3.8MB

Modified Filter for Optimized Performance

Modified Conv Filters: Multiple of 16 for MKL-DNN optimizations

Parameters: Generator: 1042276 & Discriminator: 174966; Model-Size: ~5MB

Dataset: 200000

Training Samples: 180000 & Validation: 20000

TensorFlow 1.9 (private branch) + MKL-DNN (w/ 3D Conv Support)

Batch Size: 8/Worker, # Workers/Node=4/Node; 2P Xeon Nodes: 1 to 128

Tuning: inter_op: 2 & Intra_op: 11 (Xeon® 8160 is 24C/CPU); LR: 0.001, Optimizer: RMSprop

Warmup Epochs: 5 (Facebook Methodology), Training Epochs: 20, horovod fusion buf: 64MB

19

12 Cores½-Skt-0Worker-0

12 Cores½-Skt-0Worker-1

12 Cores½-Skt-1 Worker-2

12 Cores½-Skt-1 Worker-3

20

1

3

6

8

0

1

2

3

4

5

6

7

8

9

10

Baseline GANs:1Wk/Node, TF+EIGEN

Baseline GANs:1Wk/Node, TF+MKL-

DNN

GANs+Modified Filters:1Wk/Node, TF+MKL-

DNN


DNN

Sp

ee

du

p

High Energy Physics: 3D GANS Training Secs/Epoch PerformanceSingle-Node Intel(R) 2S Xeon(R) Stampede2/TACC

TensorFlow 1.9, MKL-DNN vs EIGEN

Perf. Improvement (Secs/Batch)

Baseline:140625

Secs/Epoch

Baseline:17831

Secs/Epoch

Single-Node Training Perf. optimisation

• 1 worker/node TF + Eigen (baseline)

• 1 worker/node TF + MKL-DNN

• 1 worker/node, TF+ MKL-DNN, optimised number of convolution filters

• 4 workers/node, TF+ MKL-DNN, optimised number of convolution filters

See in animation

21

1

3

6

8

0

1

2

3

4

5

6

7

8

9

10



DNN


DNN


DNN

Sp

ee

du

p




Baseline:140625

Secs/Epoch

Baseline:17831

Secs/Epoch






See in animation

22

1

3

6

8

0

1

2

3

4

5

6

7

8

9

10



DNN


DNN


DNN

Sp

ee

du

p




Baseline:140625

Secs/Epoch

Baseline:17831

Secs/Epoch






See in animation

23

1

3

6

8

0

1

2

3

4

5

6

7

8

9

10



DNN


DNN


DNN

Sp

ee

du

p




Baseline:140625

Secs/Epoch

Baseline:17831

Secs/Epoch






See in animation

24

17831

8998

4545

2288

1151

581

293

148

64

256

1024

4096

16384

65536

1 2 4 8 16 32 64 128

Seco

nd

s/Ep

och

Intel 2S Xeon(R) Nodes

High Energy Physics: 3D GANS Training Time PerformanceIntel 2S Xeon(R) on Stampede2/TACC, OPA Fabric

TensorFlow 1.9+horovod, IMPI, Core Aff. BKMs, 4 Workers/Node

2S Xeon 8160: Secs/Epoch

1.02.0

3.9

7.8

15.5

31

61

120

100% 100%98% 97% 97% 96% 95% 94%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128

Spe

ed

up

Eff

icie

ncy

Spe

ed

up

Intel(R) 2S Xeon(R) Nodes

High Energy Physics: 3D GANs Training Speedup PerformanceIntel 2S Xeon(R) on Stampede2/TACC, OPA Fabric

TensorFlow 1.9+MKL-DNN+horovod, Intel MPI, Core Aff. BKMs, 4 Workers/Node

2S Xeon 8160: Secs/Epoch Speedup Ideal Scaling Efficiency

128-Node Perf:148 Secs/Epoch

Multi-Node Time/Epoch Scaling Performance

Distributed training using data parallelism

94% scaling efficiency up to 128 nodes

25

Some performance degradation

Mostly in the low energy regions

for batchsize

Network optimised for the 100-

200 GeV central region

Applied warmup and scaling of

initial learning rate

Further investigation ongoing

Physics performance at scale

Data: Geant4

BatchSize=1024

BatchSize=4096

BatchSize=10240

26

Physics Performance at scale

Data: GeantV

BatchSize=1024

BatchSize=4096

BatchSize=10240

50 100 150 200 250 300

27

Conclusion & Plans

Distributed training process and optimisation to scale on clusters

is critical

Allows meta-optimisation and hyperparameter scans in order to

generalize to different detectors

Parallelizing training process and optimize scaling on clusters

Initial results are very promising

Reduced training time by x8 on single node

Linear scaling brings down training time to ~2min/Epoch on 128 nodes

Keep working on the understanding / optmisation of physics

performance at scale

First results are very promising from physics perspective

27

28

Questions?

www.cern.ch/openlab

30

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon, Movidius, Saffron and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2018 Intel Corporation.

Legal notices & disclaimers

http://www.intel.com/performance

31

1

3

6

8

0

1

2

3

4

5

6

7

8

9

10



DNN


DNN


DNN

Sp

ee

du

p




Baseline:140625

Secs/Epoch

Baseline:17831

Secs/Epoch






32

Generator G generates data from random noise

Discriminator D learns how to distinguish real data from generated data

32

Simultaneously train two networks that compete and cooperate with each other:

Generative adversarial networks

Goodfellow et. al. 2014.: arXiv:1406.2661v1

Image source:

The counterfeiter/detective case

Counterfeiter shows the Monalisa

Detective says it is fake and gives feedback

Counterfeiter makes new Monalisa based on feedback

Iterate until detective is fooled

https://arxiv.org/pdf/1701.00160v1.pdf

https://arxiv.org/abs/1406.2661v1

https://arxiv.org/abs/1406.2661v1

https://medium.com/@devnag/generative-adversarial-networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f

Distributed Training of Generative Adversarial Networks for Fast … · 2018-08-30 · 1...

Documents

Transcript of Distributed Training of Generative Adversarial Networks for Fast … · 2018-08-30 · 1...