KungFu: Making Training in Distributed Machine Learning ...

of 26 /26
KungFu: Making Training in Distributed Machine Learning Adaptive Luo Mai 1,2 [email protected] With Guo Li 1 , Marcel Wagenlander 1 , Konstantinos Fertakis 1 , Andrei-Octavian Brabete 1 , Peter Pietzuch 1 Imperial College London 1 , University of Edinburgh 2 1 Large-Scale Data & Systems Group

Embed Size (px)

Transcript of KungFu: Making Training in Distributed Machine Learning ...

Luo Mai 1,2 [email protected]
With Guo Li 1, Marcel Wagenlander 1, Konstantinos Fertakis 1, Andrei-Octavian Brabete 1, Peter Pietzuch 1
Imperial College London 1, University of Edinburgh 2
1
2
ΔGradients
Dataset
Dataflow
Collective Communication
Data Batch
Worker 0
Worker 1
Worker 2
Worker 3
Distributed training systems are key to combining big data with large models
Parameters in Distributed ML Systems
3
System parameters • Number of workers • Communication topology • …
Worker 1
Worker 2
Small batch or large batch? Ring or binary-tree?
Issues with Empirical Parameter Tuning
4
“Change batch size at data epoch 30, 60, and 90 when training with ImageNet.” [1]
“Linearly scale the learning rate with the #workers when training ResNet models.” [2]
“Set the topology to a ring by default.” [3]
[1] Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources, 2020 [2] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2018 [3] Horovod: fast and easy distributed deep learning in TensorFlow, 2018
Issue
Dataset-specific
Model-specific
Cluster-specific
Automatic Parameter Adaptation
OpenAI predicts batch size based on Gradient Noise Scale (GNS)
GNS measures variation in data batches
• When GNS is small à keep batch size • When GNS is large à increase batch size
Example
Intuition
Proposal
6
Control loop monitors training workers and use monitored metric to set parameters
Gradient second-order metrics
Can we design a distributed ML system that supports adaptation?
Design challenges:
• How to support different types of adaptation? • How to adapt based on large volume of monitoring data? • How to change parameters of stateful workers?
Existing Approaches for Adaptation
Horovod Elastic
PyTorch Elastic
TensorFlow Elastic
Logs
GNS Policy Elastic PolicyContributions
Asynchronous collective communication layer
Monitoring training Adapting parameters
Dynamic worker membership tables
Processing large volume of monitoring data
Consistent adaptation for stateful workers
Contribution 1 Adaptation Policies
Write adaptation policies using expressive API functions:
Policy Policy
Monitoring Adaptation
12
Policy
Optimizer
Hook
opt = SGDOptimizer(…) opt = kf.Optimizer(opt)
import kungfu as kf
kf.resize(kf.size() + 1)
hook = kf.Hook(GNSPolicy(…)) model, data = … model.train(data, opt, hook)
2. KungFu Optimizer enables monitoring
Contribution 2 Embedding Monitoring Inside Dataflow
13
gradient1
gns1
gradient3
gradient2
allreduce1
Monitoring takes advantage of optimisations in dataflow engines and collective communication operations
Gradient-Noise-Scale Operator
Allreduce Operator
15
Coordinating synchronous allreduce prevents systems from scaling
allreduce1
allreduce2
Allreduce operator
key data key controldata
1. Pass message asynchronously
2. Maintain allreduce state
Contribution 3 Distributed Mechanisms for Parameter Adaptation
17
18
10
Value of # workers
Distributed Mechanism for Parameter Adaptation
19
gns allreduce avg
Idea: Decouple system parameters with dataflow state
MembershipMembership
Experimental Evaluation
21
0 1000 2000 3000 4000 5000 6000 7ime (VecondV)
0 10 20 30 40 50 60 70 80 90
100
Small batch size
22
0 10 20 30 40 50 60 70 80 90
100
Large batch size finishes quickly, but accuracy suffers
Large batch size
How Effectively Does KungFu Adapt?
23
0 10 20 30 40 50 60 70 80 90
100
% at
GNS predicts the effective batch size should increase during training
Batch size over time
How Effectively Does KungFu Adapt?
24
0 10 20 30 40 50 60 70 80 90
100
)
6%6 AFFuraFy L%6 AFFuraFy Kung)u AFFuraFy Kung)u %atFK 6Lze
0 512 1024 1536 2048 2560 3072 3584 4096
% at
Embedded monitoring and online adaptation are important to Adaptation Policies
Dynamic batch size
0 10 20 30 40 50 60
1 8 16 32 Horovod KungFu
25
Compare KungFu with state-of-the-art library (Horovod)
52% gap
r-V M
T hr
ou gh
pu t
Conclusions: KungFu
KungFu makes distributed machine learning adaptive - Current systems have no unified mechanism for adaptation
Adaptation Policies
2. Embedding monitoring inside dataflow
3. Distributed mechanisms for consistent adaptation