1
Distributed Training of Generative Adversarial
Networks for Fast Detector Simulation on
Intel® Xeon® HPC Cluster
Big Data Summit, 18 July 2018, NERSC, LBNL, Berkeley, CA
07/2018
Sofia Vallecorsaβ, Vikram Saletoreα, Damian Podareanuη,
Federico Carminatiβ, Valeriu Codreanuη, G. Khattakβ, Hans Pabstα
αIntel Corp., βCERN, ηSURFsara
2
Outline
Deep Learning for fast simulation
Generative Adversarial NetworksModel architecture
The training sample
Physics Performance
Computing Performance
Outlook and plans
2
3
Monte Carlo Simulation: Why
Understand how detector design affect
measurements and physics
Correct for inefficiencies, inaccuracies,
unknowns.
Theory models to compare data against.
Detailed simulation of subatomic particles is essential for data analysis, detector design
3
A good simulation demonstrates that we understand the detectors and the
physics we are studying
4
The problem
Complex physics and geometry modeling
Heavy computation requirements
>50% of WLCG power for simulations
Current code cannot cope (HL-LHC in 2025)
Currently available solutions detector dependent
Focus on EM Calorimeter
4
200 Computing centers in 20 countries:
> 600k cores
@CERN (20% WLCG): 65k cores;
30PB disk + >35PB tape storage
Campana, CHEP 2016
ATLAS experiment:
5
Classical Monte Carlo simulation
1. Calculate step particle could travel before doing a PHYSICS interaction
2a. Chek if step is within volume boundaries(GEOMETRY)
3. Propagate with selected step
5. PHYSICS process
4. Repeat 2,3 until reaching volume boundary. Restart from 1
Repeat stages 1 to 5 :
- For every particle trajectory step- For every primary particle - For every secondary particle
simplified from A. Gheata
6
Deep Learning for fast simulation
Generic approach
Can encapsulate expensive computations
Inference step is faster than algorithmic
approach
Already parallelized and optimized for
CPUs/HPCs.
Industry building highly optimized
software, hardware, and cloud services.
Improved, efficient and accurate fast simulation
6
Can we keep accuracy while doing things faster?
7
Requirements
Precise simulation results:
Detailed validation process
A fast inference step
Generic customizable tool
Easy-to-use and easily extensible framework
Large hyper parameters scans and meta-optimisation:
Training time under control
Scalability
Possibility to work across platforms
7
8
A DL engine for fast simulation
Start with time consuming detectors
Reproduce particle showers in calorimeters
Train on detailed simulation
Test training on real data
Test different models
Generative Adversarial Networks
Embed training-inference cycle in simulation
8
IntelParallel Computing Center 2017
9
Can image-processing approaches be useful?
Can we preserve accuracy while increasing speed?
Can we sustain the increase in detector complexity (future highly-granular calorimeters)?
How generic is this approach?
Can we “adjust” architecture to fit a large class of detectors?
What resources are needed?
9
A plan in two steps
• A first proof of concept
• Understand performance and
validate accuracy
• Prove generalisation is possible
• Understand and optimisecomputing resources
• Reduce training time
• HPC friendly
10
CLIC Calorimeter
Array of absorber material and silicon sensors
10
Ispy visualisation
25 2525
Detector output is
essentially a 3D image
Pierini, DS@HEP(*) http://cds.cern.ch/record/2254048#
CLIC (Compact LInear Collider) is a CERN project for a linear accelerator of electrons and positrons to TeVenergies
Associated electromagnetic calorimeter detector design(*)
Highly segmented (pixelized)
Segmentation is critical for particle identification and energy calibration.
12
Network architecture
3D conditional GAN
- reproduce full volumes of
shower reconstruct in one go
- with two auxiliary regression
tasks
Based on 3D
convolution/deconvolutions to
describe whole volume
12
Y
13
Conditioning and auxiliary tasks
Condition training on several input variables (particle type,
energy, incidence angle)
Auxiliary regression tasks assigned to the discriminator:
primary particle energy, deposited energy, incidence angle
Loss is linear combination of 3 terms:
Combined cross entropy (real/fake)
Mean absolute percentage error for regression tasks
13
Easily generalisable to multi-class approach (or multi-discriminator
approach): angle..
Real/fake probability
Epochs
14
RESULTS validationComparison to Monte Carlo data
14
Geant4GAN generated
GAN generated electron shower
Y moment (width)
Average shower section
Primary particle energy(100 GeV)
Single cell response
15
Outline
Deep Learning for fast simulation
Generative Adversarial NetworksModel architecture
The training sample
Physics Performance
Computing Performance
Outlook and plans
15
16
Generation speedup
Inference:
Classical Monte Carlo requires 17 secs/shower using Geant4
3DGAN takes 7 msec/shower
speedup factor > 2500!!
Using a trained model is very fast
Time to create an electron shower
Method MachineTime/Shower
(msec)
Full Simulation (geant4)
Intel Xeon Platinum 8180
17000
3D GAN(batch size 128)
Intel Xeon Platinum 8180
7
17
Use keras 2.13 /Tensorflow 1.9 (Intel optimised)
• AVX512 –FMA support
• Intel® MKL-DNN (with 3D convolution support)
Optimised multicore utilisation• inter_op_paralellism_threads
• intra_op_paralellism threads
• OMP_NUM_THREADS
Horovod 0.13.4• MPI_AllReduce
Distributed
training
Run on TACC Stampede2 cluster:• Dual socket Intel Xeon 8160
• 2x 24 cores per node, 192 GB RAM
• Intel® Omni-Path Architecture
Test several MPI schedulingconfigurations
• 2,4, 8 processes per nodes.
• Best machine efficiency with 4 processes/node
18
Optimizing & Parallelizing Training
Karas:
Simplicity and high productivity
TensorFlow + MKL-DNN
Special MKL-DNN build with 3D Conv Support
Horovod
Init & Wrapping TensorFlow/Karas optimizer inside HorovodDistributedOptimizer class
Broadcast shared variables to the Horovod World
Data Loading: Ensure data is loaded in parallel: Adapting existing code to take into account of
Horovod World Size
Optimizer: RMSprop18
19
GANs, Dataset, TF, & Runtime Options
Original GANs
Conv Filters: non-multiple of 16
Parameters: Generator: 872736 & Discriminator: 73646; Model Size: 3.8MB
Modified Filter for Optimized Performance
Modified Conv Filters: Multiple of 16 for MKL-DNN optimizations
Parameters: Generator: 1042276 & Discriminator: 174966; Model-Size: ~5MB
Dataset: 200000
Training Samples: 180000 & Validation: 20000
TensorFlow 1.9 (private branch) + MKL-DNN (w/ 3D Conv Support)
Batch Size: 8/Worker, # Workers/Node=4/Node; 2P Xeon Nodes: 1 to 128
Tuning: inter_op: 2 & Intra_op: 11 (Xeon® 8160 is 24C/CPU); LR: 0.001, Optimizer: RMSprop
Warmup Epochs: 5 (Facebook Methodology), Training Epochs: 20, horovod fusion buf: 64MB
19
12 Cores½-Skt-0Worker-0
12 Cores½-Skt-0Worker-1
12 Cores½-Skt-1 Worker-2
12 Cores½-Skt-1 Worker-3
20
1
3
6
8
0
1
2
3
4
5
6
7
8
9
10
Baseline GANs:1Wk/Node, TF+EIGEN
Baseline GANs:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:4Wk/Node, TF+MKL-
DNN
Sp
ee
du
p
High Energy Physics: 3D GANS Training Secs/Epoch PerformanceSingle-Node Intel(R) 2S Xeon(R) Stampede2/TACC
TensorFlow 1.9, MKL-DNN vs EIGEN
Perf. Improvement (Secs/Batch)
Baseline:140625
Secs/Epoch
Baseline:17831
Secs/Epoch
Single-Node Training Perf. optimisation
• 1 worker/node TF + Eigen (baseline)
• 1 worker/node TF + MKL-DNN
• 1 worker/node, TF+ MKL-DNN, optimised number of convolution filters
• 4 workers/node, TF+ MKL-DNN, optimised number of convolution filters
See in animation
21
1
3
6
8
0
1
2
3
4
5
6
7
8
9
10
Baseline GANs:1Wk/Node, TF+EIGEN
Baseline GANs:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:4Wk/Node, TF+MKL-
DNN
Sp
ee
du
p
High Energy Physics: 3D GANS Training Secs/Epoch PerformanceSingle-Node Intel(R) 2S Xeon(R) Stampede2/TACC
TensorFlow 1.9, MKL-DNN vs EIGEN
Perf. Improvement (Secs/Batch)
Baseline:140625
Secs/Epoch
Baseline:17831
Secs/Epoch
Single-Node Training Perf. optimisation
• 1 worker/node TF + Eigen (baseline)
• 1 worker/node TF + MKL-DNN
• 1 worker/node, TF+ MKL-DNN, optimised number of convolution filters
• 4 workers/node, TF+ MKL-DNN, optimised number of convolution filters
See in animation
22
1
3
6
8
0
1
2
3
4
5
6
7
8
9
10
Baseline GANs:1Wk/Node, TF+EIGEN
Baseline GANs:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:4Wk/Node, TF+MKL-
DNN
Sp
ee
du
p
High Energy Physics: 3D GANS Training Secs/Epoch PerformanceSingle-Node Intel(R) 2S Xeon(R) Stampede2/TACC
TensorFlow 1.9, MKL-DNN vs EIGEN
Perf. Improvement (Secs/Batch)
Baseline:140625
Secs/Epoch
Baseline:17831
Secs/Epoch
Single-Node Training Perf. optimisation
• 1 worker/node TF + Eigen (baseline)
• 1 worker/node TF + MKL-DNN
• 1 worker/node, TF+ MKL-DNN, optimised number of convolution filters
• 4 workers/node, TF+ MKL-DNN, optimised number of convolution filters
See in animation
23
1
3
6
8
0
1
2
3
4
5
6
7
8
9
10
Baseline GANs:1Wk/Node, TF+EIGEN
Baseline GANs:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:4Wk/Node, TF+MKL-
DNN
Sp
ee
du
p
High Energy Physics: 3D GANS Training Secs/Epoch PerformanceSingle-Node Intel(R) 2S Xeon(R) Stampede2/TACC
TensorFlow 1.9, MKL-DNN vs EIGEN
Perf. Improvement (Secs/Batch)
Baseline:140625
Secs/Epoch
Baseline:17831
Secs/Epoch
Single-Node Training Perf. optimisation
• 1 worker/node TF + Eigen (baseline)
• 1 worker/node TF + MKL-DNN
• 1 worker/node, TF+ MKL-DNN, optimised number of convolution filters
• 4 workers/node, TF+ MKL-DNN, optimised number of convolution filters
See in animation
24
17831
8998
4545
2288
1151
581
293
148
64
256
1024
4096
16384
65536
1 2 4 8 16 32 64 128
Seco
nd
s/Ep
och
Intel 2S Xeon(R) Nodes
High Energy Physics: 3D GANS Training Time PerformanceIntel 2S Xeon(R) on Stampede2/TACC, OPA Fabric
TensorFlow 1.9+horovod, IMPI, Core Aff. BKMs, 4 Workers/Node
2S Xeon 8160: Secs/Epoch
1.02.0
3.9
7.8
15.5
31
61
120
100% 100%98% 97% 97% 96% 95% 94%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128
Spe
ed
up
Eff
icie
ncy
Spe
ed
up
Intel(R) 2S Xeon(R) Nodes
High Energy Physics: 3D GANs Training Speedup PerformanceIntel 2S Xeon(R) on Stampede2/TACC, OPA Fabric
TensorFlow 1.9+MKL-DNN+horovod, Intel MPI, Core Aff. BKMs, 4 Workers/Node
2S Xeon 8160: Secs/Epoch Speedup Ideal Scaling Efficiency
128-Node Perf:148 Secs/Epoch
Multi-Node Time/Epoch Scaling Performance
Distributed training using data parallelism
94% scaling efficiency up to 128 nodes
25
Some performance degradation
Mostly in the low energy regions
for batchsize
Network optimised for the 100-
200 GeV central region
Applied warmup and scaling of
initial learning rate
Further investigation ongoing
Physics performance at scale
Data: Geant4
BatchSize=1024
BatchSize=4096
BatchSize=10240
26
Physics Performance at scale
Data: GeantV
BatchSize=1024
BatchSize=4096
BatchSize=10240
50 100 150 200 250 300
27
Conclusion & Plans
Distributed training process and optimisation to scale on clusters
is critical
Allows meta-optimisation and hyperparameter scans in order to
generalize to different detectors
Parallelizing training process and optimize scaling on clusters
Initial results are very promising
Reduced training time by x8 on single node
Linear scaling brings down training time to ~2min/Epoch on 128 nodes
Keep working on the understanding / optmisation of physics
performance at scale
First results are very promising from physics perspective
27
28
Questions?
www.cern.ch/openlab
30
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon, Movidius, Saffron and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
© 2018 Intel Corporation.
Legal notices & disclaimers
31
1
3
6
8
0
1
2
3
4
5
6
7
8
9
10
Baseline GANs:1Wk/Node, TF+EIGEN
Baseline GANs:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:1Wk/Node, TF+MKL-
DNN
GANs+Modified Filters:4Wk/Node, TF+MKL-
DNN
Sp
ee
du
p
High Energy Physics: 3D GANS Training Secs/Epoch PerformanceSingle-Node Intel(R) 2S Xeon(R) Stampede2/TACC
TensorFlow 1.9, MKL-DNN vs EIGEN
Perf. Improvement (Secs/Batch)
Baseline:140625
Secs/Epoch
Baseline:17831
Secs/Epoch
Single-Node Training Perf. optimisation
• 1 worker/node TF + Eigen (baseline)
• 1 worker/node TF + MKL-DNN
• 1 worker/node, TF+ MKL-DNN, optimised number of convolution filters
• 4 workers/node, TF+ MKL-DNN, optimised number of convolution filters
32
Generator G generates data from random noise
Discriminator D learns how to distinguish real data from generated data
32
Simultaneously train two networks that compete and cooperate with each other:
Generative adversarial networks
Goodfellow et. al. 2014.: arXiv:1406.2661v1
Image source:
The counterfeiter/detective case
Counterfeiter shows the Monalisa
Detective says it is fake and gives feedback
Counterfeiter makes new Monalisa based on feedback
Iterate until detective is fooled
https://arxiv.org/pdf/1701.00160v1.pdf
Top Related