Skip to content
/ bnn_hvd Public
forked from Himscipy/bnn_hvd

Distributed Training of Bayesian Neural Networks at Scale

License

Notifications You must be signed in to change notification settings

felker/bnn_hvd

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bayesian Neural Network (BNN) Distributed Training

License: MIT

The repo consist codes for preforming distributed training of Bayesian Neural Network models at scale using High Performance Computing Cluster such as ALCF (Theta). The main purpose of the code is to act as a tutorial for getting started with distibuted training of BNN's on High Performace Computing clusters. In addition, a advanced model is also added with the source repository. The advance model is associated with an ADSP project for estimating the the Gravitational Wave parameters using combination of Neural Networks and Bayesian Neural Network Layers. The dataset is available on Theta and restricted to the mmadsp users only, the code is provided for the purpose of demonstration. For furthter details about ADSP contact Argonne ALCF support.

The BNN models are implemented using the Tensorflow-probability libarary. The data distribted training is performed using Horovod.

Brief Background on BNN:

Bayesian Neutal Networks is one of approaches used to capture network uncertainity. The uncertainities in Bayesian modeling can be classified under two categories;

  1. Aleatoric uncertainity
  2. Epistemic uncertainity.

The Aleatoric uncertainity tries to capture noise inherent with the observations/data. The noise in data is associated with sensor measurement noise. Epistemic unceratinity is associated with model parameters, and with increasing the data the uncertainity can be reduced. The Aleoteric uncertainity is further divided into Homoscedastic and Heteroscedastic.

  • Homoscedastic uncertainty: uncertainty which stays constant for different inputs, and heteroscedastic uncertainty.
  • Heteroscedastic uncertainty: depends on the inputs to the model, with some inputs potentially having more noisy outputs than others. This is particuraly important to avoid model over-confident predictions.

The Epistemic uncertainty is modelled by putting a prior distribution over the model parameters/weights and compute how these weights varies and converges, which are done in case of Bayesian Neural Networks. While in case of Aleoteric uncertainity are modelled by putting distibutions on the output of the model. Further, details about the Bayesian Network and Variationa inference for training can be found in the Jupyter-Notebook.

Code Dependencies:

Dataset:

  • MNIST hand-written digit dataset sample images below.

  • CIFAR-10 The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Models:

  • Bayesian Neural Network with Flipout Fully Connected Layer.('BNN_conv_flip')
  • Bayesian Neural Network with Non-Flipout Fully Connected Layer.('BNN_conv_nonflip')
  • Bayesian Neural Network with Flipout Convolutional Layers.('BNN_FC_flip')
  • Bayesian Neural Network with Non-Flipout Convolutional Layers.('BNN_FC_nonflip)
  • Bayesian Neural Network with Flipout Convolutional Layers (3-VGG Blocks) for CIFAR-10 data.('CIFAR10_BNN_model')
  • Convolutional Neural Network ('CNN_Conv')
  • Fully Connected Neural Network ('CNN_FC')

How to run the code:

  • On the local machine Running:

    • horovodrun -n 2 -H localhost:2 python CNN_BNN_Model.py --flagfile=config_file.cfg
  • ALCF high performance Computing Cluster (Theta):

    PPN=1 # 32,16,8 MPIRank Per Node (Process Per Node)
    NUM_THDS=128
    
    aprun -n $((${COBALT_PARTSIZE} * ${PPN})) -N ${PPN} -cc depth -j 2 -d ${NUM_THDS} \
        -e OMP_NUM_THREADS=${NUM_THDS} -b python <path to the code>/CNN_BNN_Model.py \
        --flagfile=config_file.cfg
    
  • The submission script is provided in the repository.

  • Running the job with Balsam (Theta):

  • For other information about the configuration of running the code, use help function as follows;
    python CNN_BNN_Model.py --help

  • Example Results:

    • The comparison of the BNN and CNN time to train with increasing number of nodes are shown in the Fig-1 shown below.

    • The comparison of the Speed-Up between the BNN and CNN using the training time can be also compared in Fig-2 shown below.

    • The training of the Bayesian Network is to find optimal distribution of the training parameters which done using the technique of Variational Inference(VI). As the training iteration progress the weights posterior converges. An example is shown below with the weights initialized with the Gaussian prior in Fig-3.

    • As the model is trained once and the posteriors for the weights are converged. The model is used for performing inference. The inference is perfomed by running the model over and over again (Monte-Carlo iterations). The output of the model returns the prediction distribution as shown below for MC iterations of 300 and with a BNN Fully Connected model.

  • Research Articles: Papers related to Bayesian Neural Networks:

    Papers for Gravitational Bayesian Model:

  • Additional Resources:

  • Contact

  • Ackowledegment

    This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was funded in part and used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This work describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the work do not necessarily represent the views of the U.S. DOE or the United States Government. Declaration of Interests - None.

About

Distributed Training of Bayesian Neural Networks at Scale

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 60.0%
  • Jupyter Notebook 26.5%
  • Roff 9.9%
  • Shell 3.6%