This repository can simulates block-floating point arithmetic in software manner within reasonable time cost (~2x time)
This repository features...
- Training various neural networks with block floating point
- Fully configurable training environment
- Fully configurable block floating point on precision and group size on forward, weight gradient, local gradient
- Train on ImageNet / CIFAR100
- Simple dynamic precison control
- Save checkpoints, logs, etc
- Install Docker on the targeted machine.
- Clone this repository
- Make docker image as
docker build . -t $(whoami)/bfpsim:latest
(It will take a while, so get some coffee break ☕).
After creating docker container, you should execute tensorboard on background.
To execute tensorboard, move to tensorboard
directory, and execute ./run.sh [External Port]
. Make sure you are in tensorboard directory. Recommended value for external port is 6006
, but you can change if you know what you are doing. It will take a while if you first executing tensorboard, since docker image is different from main image
If your running this on remote server, make sure you opened the external port using ufw
, iptables
, etc, then input [Remote Server IP]:[External Port]
on your beloved internet browser.
If you are running this on local machine, just type http://localhost:[External Port]
, and you're good to go. 😆
- Clone this repository
- Install requirements listed below
torch >= 1.9.1
torchvision >= 0.5.0
numba >= 0.53.1
matplotlib >= 3.4.2
einops >= 0.3.0
slack_sdk
tensorboard >= 2.7.0
(tensorboard version is not crucial, though)
It's possible to set train information to your own slack server.
If you want to set by your own, follow Tutorial and get the slackbot token, and put the token in separate file named ./slackbot.token
, and give --slackbot
option as true.
Executing ResNet18 on CIFAR100, with FP32
docker run --rm --gpus '"device=0"' --cpus="8" --user "$(id -u):$(id -g)" --mount type=bind,source=/dataset,target=/dataset --shm-size 24G --workdir /app -v "$(pwd)":/app $(whoami)/bfpsim:latest python3 -u /app/cifar.py --mode train --model ResNet18 --dataset CIFAR100 --log True
Executing ResNet18 on CIFAR100, with FP24
docker run --rm --gpus '"device=0"' --cpus="8" --user "$(id -u):$(id -g)" --mount type=bind,source=/dataset,target=/dataset --shm-size 24G --workdir /app -v "$(pwd)":/app $(whoami)/bfpsim:latest python3 -u /app/cifar.py --mode train --model ResNet18 --dataset CIFAR100 --log True --bfp ResNet18_FB24
Executing ResNet18 on ImageNet(Original Code), with FP24
docker run --rm --gpus '"device=0"' --cpus="64" --user "$(id -u):$(id -g)" --mount type=bind,source=/dataset,target=/dataset --shm-size 24G --workdir /app -v "$(pwd)":/app $(whoami)/bfpsim:latest python3 -u /app/imagenet.py --arch resnet18 --bfp ResNet18_FB12LG24 --log True
`
If you are using imagenet, I recommend to reduce training epoch option as --epoch 60
, otherwise it will cost a quite time. Learning rate will be automatically adjusted based on full training epoch(Reduce every 1/3 training phase).
Changing '"device=0"'
will be change gpu to run. It is possible to use several gpus like '"device=0,1"'
, but I can't sure it will run properly.😂
Executing runs will automatically added to the folder names ./runs/
. Visualization is also available on the tensorboard, which is mentioned before.
Making your own configuration file on ./conf_net
is not simple...
First, "default":
defines the default configuration of BFP for any convolution or fully-connected layer.
"default":{
"fw_bit":4,
"fi_bit":4,
"bwo_bit":4,
"bwi_bit":4,
"biw_bit":4,
"bio_bit":4,
"fw_dim":[1,24,3,3],
"fi_dim":[1,24,3,3],
"bwo_dim":[1,24,3,3],
"bwi_dim":[1,24,3,3],
"biw_dim":[1,24,3,3],
"bio_dim":[1,24,3,3]
},
Each notation will indicates...
fw
: (True/False) Do BFP on weight on forward passfi
: (True/False) Do BFP on input while forward passfo
: (True/False) Do BFP on output while forward passbwo
: (True/False) Do BFP on output gradient while calculating weight gradientbwi
: (True/False) Do BFP on input feature map while calculating weight gradientbwg
: (True/False) Do BFP on weight gradientbio
: (True/False) Do BFP on output gradient while calculating local gradientbiw
: (True/False) Do BFP on weight while calculating local gradientbig
: (True/False) Do BFP on local gradientfw_bit
: Bit length of mantissa (precision) on weight on forward passfi_bit
: Bit length of mantissa (precision) on input while forward passfo_bit
: Bit length of mantissa (precision) on output while forward passbwo_bit
: Bit length of mantissa (precision) on output gradient while calculating weight gradientbwi_bit
: Bit length of mantissa (precision) on input feature map while calculating weight gradientbwg_bit
: Bit length of mantissa (precision) on weight gradientbio_bit
: Bit length of mantissa (precision) on output gradient while calculating local gradientbiw_bit
: Bit length of mantissa (precision) on weight while calculating local gradientbig_bit
: Bit length of mantissa (precision) on local gradientfw_dim
: Group dimension of mantissa (precision) on weight on forward passfi_dim
: Group dimension of mantissa (precision) on input while forward passfo_dim
: Group dimension of mantissa (precision) on output while forward passbwo_dim
: Group dimension of mantissa (precision) on output gradient while calculating weight gradientbwi_dim
: Group dimension of mantissa (precision) on input feature map while calculating weight gradientbwg_dim
: Group dimension of mantissa (precision) on weight gradientbio_dim
: Group dimension of mantissa (precision) on output gradient while calculating local gradientbiw_dim
: Group dimension of mantissa (precision) on weight while calculating local gradientbig_dim
: Group dimension of mantissa (precision) on local gradient
Group dimension is provided as list like [x,y,z,w]
. Example is shown below.
- Group size of 8, direction of input channel :
[1,8,1,1]
- Group size of 9, group by kernel (weight) :
[1,1,3,3]
- Group size of 216, mentioned on FlexBlock :
[1,24,3,3]
If the desired size is not suffcient, it will group restover to one group. For example, if the length of input channel is 53(which is weird though), and a user want to group with 8 each, it will make 7 groups, but last group will have 5, not 8 elements.😋
My manually type a network's name with argument "type=default"
, it will not make to BFPConv2d, and do the normal convolution. Just make sure to input the name correctly, and put net.
in front of the layer's name.
"net.conv1":{
"type":"default"
},
By adding optional argument --do
, it will execute the dynamic optimizer with simple method mentioned on original paper, FlexBlock.
docker run --rm --gpus '"device=5"' --cpus="64" --user "$(id -u):$(id -g)" --mount type=bind,source=/dataset,target=/dataset --shm-size 24G --workdir /app -v "$(pwd)":/app $(whoami)/bfpsim:latest python3 -u /app/imagenet.py --arch resnet18 --bfp ResNet18_FB12LG24 --do Simple/0.1/0.2/50/1/5 --do-color False
Execution of CIFAR100, with simple dynamic pricison control
docker run --rm --gpus '"device=3"' --cpus="8" --user "$(id -u):$(id -g)" --mount type=bind,source=/dataset,target=/dataset --shm-size 24G --workdir /app -v "$(pwd)":/app $(whoami)/bfpsim:latest python3 -u /app/cifar.py --mode train --model ResNet18 --dataset CIFAR10 --log True --bfp ResNet18_FB24 --do Simple/0.6/0.8/0/1 --do-color False
Simple/0.6/0.8/0/1
Each indicates,
- Simple : Method to control
- 0.6 : Threshold of zero-setting error that decrease precision
- 0.8 : Threshold of zero-setting error that increase precision
- 0 : fix precision to provided file for 0 steps at start of training
- 1 : fix precision for 1 step if precision is changed
If you enable -do-color
option, console window will shine like rainbow :), and see the zero-setting error of each weights/weight gradients/local gradients. (Recommended to show this, I really put some work to looks like a hacker 😎)
Using the -tc
option will use train configuration file on ./conf_train
. It will only work on cifar.py
, and you can set various preset arguments to the file, so you don't have to input arguments manually.
Execution example is shown below.
docker run --rm --gpus '"device=3"' --cpus="8" --user "$(id -u):$(id -g)" --mount type=bind,source=/dataset,target=/dataset --shm-size 24G --workdir /app -v "$(pwd)":/apwhoami)/bfpsim:latest python3 -u /app/cifar.py --mode train -tc ResNet18_CIFAR100_Mixed
By writing bfp-layer-conf-dict
and optimizer-dict
like a python dict, you can manually set your arguments over training epochs. Make sure you have to input scheduler-step if you want to change model's precision, learning rate scheduler step need to be proceeded to match training status.
"bfp-layer-conf-dict":{
"0":"ResNet18_FB16",
"20":"ResNet18_FB12LG16",
"180":"ResNet18_FB16"
},
"optimizer-dict":{
"20":{
"scheduler-step":20
},
"180":{
"scheduler-step":180
}
}
In fact, code about setting training config is quite old code, I have no idea why this is working until now. 🤤
Need to organize a bit... (Not prepared to open)
FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support