Skip to content
forked from lsds/KungFu

KungFu distributed machine learning framework

License

Notifications You must be signed in to change notification settings

jianweilin/KungFu

Repository files navigation

KungFu

Easy, adaptive and fast distributed machine learning.

Features

KungFu enables users to achieve fast and adaptive distributed machine learning. This is important because machine learning systems must cope with growing complex models and increasingly complicated deployment environments. KungFu has the following unique features:

  • Simplicity: KungFu permits distributed training by adding only one line of code in the training program. KungFu is easy to deploy. It does not require partitioning resources as in parameter servers and heavy dependency like MPI in Horovod.
  • Adaptive distributed training: KungFu provides many advanced distributed optimizers such as communication-efficient AD-PSGD and small-batch-efficient SMA to help you address the cases in which Synchronous SGD does not scale.
  • Monitoring: KungFu supports distributed SGD metrics such as gradient variance and gradient noise scale to help understand the training process with low overhead.
  • Online control: KungFu provides control operators such as barrier and resize_cluster to seamlessly reconfigure training, even in response to monitored metrics.
  • Extensibility: KungFu has a clean low-level API that allows an easy implementation of new distributed training, monitoring and control algorithms.

KungFu is fast and scalable. It exploits a high-performance implementation of communication, monitoring and control operators, and adopts a decentralized architecture. Please check out the performance of KungFu in the Benchmark section below.

Basic Usage

To use KungFu to scale out your TensorFlow training program, you simply need to make two changes:

  1. Wrap the optimizer in SynchronousSGDOptimizer or another distributed optimizer.

  2. Run distributed_initializer() after calling global_variables_initializer(). The distributed initializer synchronizes the initial variables on all workers.

import tensorflow as tf
from kungfu.tensorflow.v1.optimizers import SynchronousSGDOptimizer

# Build model...
loss = ...

# Add KungFu Distributed Optimizer
opt = tf.train.AdamOptimizer(0.01)
opt = SynchronousSGDOptimizer(opt)

# Make training operation
train_op = opt.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(opt.distributed_initializer()) # KungFu

    # Train your model for 10 steps.
    for step in range(10):
        sess.run(train_op)

See the TensorFlow Session and TensorFlow Keras examples for full training examples.

Run

Download the MNIST dataset (script) and run the following training script:

# Train a Single Layer Perception (SLP) model for the MNIST dataset using 4 CPUs for 10 data epochs.
kungfu-run -np 4 python3 examples/mnist_slp.py --data-dir=./mnist

If you want to run this example on two machines (each with 8 GPUs), run the following on both machines:

# Assume the machines have NIC eth0 and their IPs are 192.168.0.1 and 192.168.0.2.
# Assume NUM_GPU_SLOTS=8, NUM_GPUS=16
kungfu-run -np $NUM_GPUS \
    -H 192.168.0.1:$NUM_GPU_SLOTS,192.168.0.2:$NUM_GPU_SLOTS -nic eth0 \
    python3 examples/mnist_slp.py  --data-dir=./mnist

Install

KungFu requires Python 3, CMake 3.5+, Golang 1.11+ and TensorFlow <=1.13.2.

# Install tensorflow CPU
pip3 install tensorflow==1.13.1
# pip3 install tensorflow-gpu==1.13.1 # Using GPUs

# Download the KungFu source code
git clone https://github.com/lsds/KungFu.git

# Install KungFu
# export CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) # Parallel build.
pip3 install .

KungFu provides kungfu-run to launch a training program on a multi-GPU server. Using the following command to build kungfu-run.

# Build and install kungfu-run in the given GOBIN directory.
GOBIN=$(pwd)/bin go install -v ./srcs/go/cmd/kungfu-run

# Check if kungfu-run is built
./bin/kungfu-run -help

Benchmark

We benchmark the performance of KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines. The machines are interconnected by a 100 Gbps network. We benchmark the training throughput of ResNet-50, VGG16 and InceptionV3. These models represent different kinds of training workloads.

In the synchronous training case, we compare KungFu (SynchronousSGDOptimizer) with Horovod (0.16.1). Horovod uses OpenMPI 4.0.0. We evaluate the spectrum of batch size (from 256 to 4096) commonly used by SGD users. This batch size is evenly shared by the 16 GPUs. KungFu outperforms Horovod on all tested models, in particular with small batch sizes which significantly raise the frequency of synchronization.

sync

In the asynchronous training case, we compare KungFu (PairAveragingOptimizer) with TensorFlow parameter servers (1.13.1). We uses the same range of batch sizes as above. KungFu exhibits better scalability as well.

async

All benchmark scripts are available here.

Convergence

The synchronization algorithms (SynchronousSGDOptimizer, PairAveragingOptimizer and SynchronousAveragingOptimizer) can reach the same evaluation accuracy as Horovod. We validated this with the ResNet-50 and ResNet-101 models in the TensorFlow benchmark. You can also add your own KungFu distributed optimizer to the benchmark by adding one line of code, see here.

Contribute

Guideline

About

KungFu distributed machine learning framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 35.1%
  • Python 29.4%
  • C++ 26.0%
  • Shell 5.4%
  • CMake 2.9%
  • C 1.0%
  • Dockerfile 0.2%