KungFu: Making Training in Distributed Machine Learning ...

26
KungFu: Making Training in Distributed Machine Learning Adaptive Luo Mai 1,2 [email protected] With Guo Li 1 , Marcel Wagenlander 1 , Konstantinos Fertakis 1 , Andrei-Octavian Brabete 1 , Peter Pietzuch 1 Imperial College London 1 , University of Edinburgh 2 1 Large-Scale Data & Systems Group

Transcript of KungFu: Making Training in Distributed Machine Learning ...

Page 1: KungFu: Making Training in Distributed Machine Learning ...

KungFu: Making Training in Distributed Machine Learning Adaptive

Luo Mai 1,[email protected]

With Guo Li 1, Marcel Wagenlander 1, Konstantinos Fertakis 1, Andrei-Octavian Brabete 1, Peter Pietzuch 1

Imperial College London 1, University of Edinburgh 2

1

Large-Scale Data & Systems Group

Page 2: KungFu: Making Training in Distributed Machine Learning ...

Training in Distributed ML Systems

2

ΔGradients

Dataset

Dataflow

Collective Communication

Data Batch

Worker 0

Worker 1

Worker 2

Worker 3

Distributed training systems are key to combining big data with large models

Page 3: KungFu: Making Training in Distributed Machine Learning ...

Parameters in Distributed ML Systems

3

Hyper-parameters• Batch size• Learning rate• …

System parameters• Number of workers• Communication topology• …

Worker 1

Worker 2

Worker 3Worker 0

Users must tune parameters to optimise time-to-accuracy

Small batch or large batch? Ring or binary-tree?

Page 4: KungFu: Making Training in Distributed Machine Learning ...

Issues with Empirical Parameter Tuning

4

“Change batch size at data epoch 30, 60, and 90 when training with ImageNet.” [1]

“Linearly scale the learning rate with the #workers when training ResNet models.” [2]

“Set the topology to a ring by default.” [3]

[1] Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources, 2020[2] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2018[3] Horovod: fast and easy distributed deep learning in TensorFlow, 2018

Issue

Dataset-specific

Model-specific

Cluster-specific

Examples of empirical parameter tuning

Page 5: KungFu: Making Training in Distributed Machine Learning ...

Automatic Parameter Adaptation

5

OpenAI predicts batch size based on Gradient Noise Scale (GNS)

GNS measures variation in data batches

• When GNS is small à keep batch size• When GNS is large à increase batch size

Example

Intuition

Proposal

Page 6: KungFu: Making Training in Distributed Machine Learning ...

Proposals for Automatic Parameter Adaptation

6

Control loop monitors training workers and use monitored metric to set parameters

Gradient second-order metrics

Gradient variance

Worker performance metrics

Control loop

Monitoring Adaptation

Workers

Page 7: KungFu: Making Training in Distributed Machine Learning ...

Open Challenges

7

Can we design a distributed ML system that supports adaptation?

Design challenges:

• How to support different types of adaptation?• How to adapt based on large volume of monitoring data?• How to change parameters of stateful workers?

Page 8: KungFu: Making Training in Distributed Machine Learning ...

Existing Approaches for Adaptation

8

Worker 0

Worker 1

Worker 2

Worker 3

1. No unified mechanisms for adaptationAutoScaling[MLSys’20]

HorovodElastic

PyTorchElastic

TensorFlow Elastic

Custom adaptation frameworks without generic APIs

2. Processing of monitoring data offline

Logs

3. Checkpoint-and-recoverWrite

checkpointRelease

resourcesAcquire

resourcesStart training

processRead

checkpoint

Many adaptation steps

Log analytics libraries

Page 9: KungFu: Making Training in Distributed Machine Learning ...

KungFu Overview

9

Monitoring, communication and adaptation functions1. Adaptation policies

GNS Policy Elastic PolicyContributions

2. Embedding monitoring operators inside dataflow

Asynchronous collective communication layer

TensorFlow/PyTorch/Keras Workers

3. Distributed mechanisms for parameter adaptation

Monitoring training Adapting parameters

Dynamic worker membership tables

Supportingdifferent types of adaptation

Processing large volume of monitoring data

Consistent adaptation for stateful workers

Page 10: KungFu: Making Training in Distributed Machine Learning ...

Contribution 1Adaptation Policies

10

Page 11: KungFu: Making Training in Distributed Machine Learning ...

Adaptation Policies

11

Monitoring

• grad_noise_scale• grad_variance• …

Communication

• allreduce• broadcast• …

Adaptation• resize• set_tree• …

Goal: Express adaptation over control loop for parameters

Write adaptation policies using expressive API functions:

Policy Policy

Monitoring Adaptation

Communication

Workers

Page 12: KungFu: Making Training in Distributed Machine Learning ...

Step NStep N

Example: Adaptation Policy for GNS

12

Step N Step N+1

Policy

Optimizer

Hook

1. Adaptation logic in policy functions

opt = SGDOptimizer(…)opt = kf.Optimizer(opt)

import kungfu as kf

class GNSPolicy(kf.BasePolicy) def after_step(self):gns = kf.grad_noise_scale()avg = kf.allreduce(gns, `avg`)if avg > self.prev:

kf.resize(kf.size() + 1)

3. KungFu Hook enables the policy

hook = kf.Hook(GNSPolicy(…))model, data = …model.train(data, opt, hook)

2. KungFu Optimizer enables monitoring

Page 13: KungFu: Making Training in Distributed Machine Learning ...

Contribution 2Embedding Monitoring Inside Dataflow

13

Page 14: KungFu: Making Training in Distributed Machine Learning ...

Embedding Monitoring Inside Dataflow

14

Problem: High monitoring cost reduces adaptation benefitIdea: Improve efficiency by adding monitoring operators to dataflow graph

gradient1

gns1

gradient3

gradient2

allreduce1

gns3 allreduce3

gns2 allreduce2

Dataflowgraph

Monitoring takes advantage of optimisations in dataflow engines and collective communication operations

Gradient-Noise-Scale Operator

Allreduce Operator

Gradient Dataflow Operator

Page 15: KungFu: Making Training in Distributed Machine Learning ...

Challenges of Dataflow Collective Communication

15

1. Dataflow engine launches operators asynchronously

Coordinating synchronous allreduce prevents systems from scaling

allreduce1

allreduce2

Worker 0 Worker 1

allreduce1

allreduce2

Timeallreduce1

allreduce2

allreduce2

allreduce1

2. Message-Passing-Interface (MPI) requires synchronous execution

Problem: Collective communication reduces dataflow performance

Allreduce operator

Page 16: KungFu: Making Training in Distributed Machine Learning ...

Making Collective Communication Asynchronous

16

allreduce1

Worker 0 Worker 1

allreduce2

Time

Idea: Use asynchronous collective communicationCollective Message

key data key controldata

Collective State

allreduce1 3. Pass complete result downstream

1. Pass message asynchronously

2. Maintain allreduce state

No need for coordination in asynchronous collective communication

Page 17: KungFu: Making Training in Distributed Machine Learning ...

Contribution 3 Distributed Mechanisms for Parameter Adaptation

17

Page 18: KungFu: Making Training in Distributed Machine Learning ...

Issues When Adapting System Parameters

18

gns allreduce avg

Other system parameters• Worker ranks• Communication topology• …

10

Problem: Parameter adaptation affects state consistency

Value of # workers

Dataflow for averaging GNS

Adapting system parameters therefore often requires system restart

Page 19: KungFu: Making Training in Distributed Machine Learning ...

Distributed Mechanism for Parameter Adaptation

19

2. Update worker membership using collective communication

gns allreduce avg

size_op

Dynamic worker membership

KungFu communication layer

1. System parameters as computational operators

Idea: Decouple system parameters with dataflow state

MembershipMembership

Parameter update

Online parameter adaptation is consistent and fast

Page 20: KungFu: Making Training in Distributed Machine Learning ...

Experimental Evaluation

20

Page 21: KungFu: Making Training in Distributed Machine Learning ...

How Effectively Does KungFu Adapt?

21

GNS policy, CIFAR-10 ResNet, 4 GPUs

0 1000 2000 3000 4000 5000 60007ime (VecondV)

0102030405060708090

100

9al

idat

ion

Acc

urac

y (%

)

6%6 Accuracy

Small batch size reaches high accuracy, but converges slowly

Small batch size

Page 22: KungFu: Making Training in Distributed Machine Learning ...

How Effectively Does KungFu Adapt?

22

0 1000 2000 3000 4000 5000 60007Lme (VecondV)

0102030405060708090

100

9al

Ldat

Lon

Acc

urac

y (%

)

6%6 AccuracyL%6 Accuracy

Large batch size finishes quickly, but accuracy suffers

Large batch size

GNS policy, CIFAR-10 ResNet, 4 GPUs

Page 23: KungFu: Making Training in Distributed Machine Learning ...

How Effectively Does KungFu Adapt?

23

0 1000 2000 3000 4000 5000 60007Lme (VeFondV)

0102030405060708090

100

9al

Ldat

Lon

AFF

uraF

y (%

)

6%6 AFFuraFyL%6 AFFuraFyKung)u %atFK 6Lze

05121024153620482560307235844096

%at

FK 6

Lze

GNS predicts the effective batch size should increase during training

Batch size over time

GNS policy, CIFAR-10 ResNet, 4 GPUs

Page 24: KungFu: Making Training in Distributed Machine Learning ...

How Effectively Does KungFu Adapt?

24

0 1000 2000 3000 4000 5000 60007Lme (VeFondV)

0102030405060708090

100

9al

Ldat

Lon

AFF

uraF

y (%

)

6%6 AFFuraFyL%6 AFFuraFyKung)u AFFuraFyKung)u %atFK 6Lze

05121024153620482560307235844096

%at

FK 6

Lze

Embedded monitoring and online adaptation are important to Adaptation Policies

Dynamic batch size

GNS policy, CIFAR-10 ResNet, 4 GPUs

Page 25: KungFu: Making Training in Distributed Machine Learning ...

What is KungFu’s Distributed Performance?

0102030405060

1 8 16 32Horovod KungFu

25

Asynchronous collective communication enables KungFu to scale better

Compare KungFu with state-of-the-art library (Horovod)

52% gap

32 VMs, K-80 GPU, ImageNet ResNetPe

r-VM

Thr

ough

put

Page 26: KungFu: Making Training in Distributed Machine Learning ...

Conclusions: KungFu

26

KungFu @ Githubhttps://github.com/lsds/KungFu

KungFu makes distributed machine learning adaptive- Current systems have no unified mechanism for adaptation

Adaptation Policies

Asynchronous collective communication

Dynamic worker membership tables

1. Adaptation policies that realise complex adaptation

2. Embedding monitoring inside dataflow

3. Distributed mechanisms for consistent adaptation