The Zeus library is a set of files that constitute a minimal yet versatile deep learning training/testing system. It is based on pytorch-lightning
and pycg
.
Up until now, Zeus has the following files: zeus.py
, train.py
, test.py
, overfit.py
, models/base_model.py
, and configs/default/zeus.yaml
.
python train.py <CONFIG> --logger_type <wandb/tb> --wname <NAME>
- By default,
logger_type
iswandb
. If you don't providewname
, a random wandb name will be generated.
Other useful flags:
--fast_dev_run
: Dry run. Just run one batch. This will not create checkpoints.--profiler
: Value can besimple
oradvanced
. Will print a log at trainer fit finished.--limit_<train/val/test>_batches
: limit the number of respective batches.--max_epochs
--debug
: If there is NaN issue. Please use this flag and manually checkout the stack trace. You cannot get a pdb interactive environment because the previous frames&data are not saved.
For wandb
(Note that checkpoints need to be on this machine!):
python train.py wdb:<WDB-URL> --resume
For tensorboard
:
python train.py <TB-HPARAM>.yaml --resume
For wandb
:
python test.py none --ckpt <WDB-URL>:<CKPT-ID> -v
-v
is automatically expanded to--visualize
.
python overfit.py <CONFIG> \
--interval 10 \
--log_dir ../overfit/<LOG-FOLDER> \
--log_plot -v
On remote server, run:
python upload_ckpt.py <WDB-URL>:<CKPT-ID>
<WDB-URL>
can be in 2 types:<USER>/<WANDB-PROJECT>/<WANDB-RUN-ID>
<USER>/<WANDB-PROJECT>/<WANDB-NAME>
: This is not unique, will prompt if more than 1 candidates exist.
<CKPT-ID>
:last
: last epochbest/epoch
: best epoch3
: the 3rd epochall
: all epochs
To run hyperparameter sweep, first write a config file like sweeps/*.yaml
. Then run on the local machine (master):
wandb sweep sweeps/<CONFIG>.yaml
Then, on the remote cluster, run:
CUDA_VISIBLE_DEVICES=... wandb agent <SWEEP-ID> --count 1
where count
specifies how many jobs should this agent run.
Agents will ask for job from the wandb server to run.
It will call our training script with --exec
set.
Here are some configurations that you can tweak. Please create a file named zeus_config.yaml
in the same folder as train.py
to override the ones in configs/default/zeus.yaml
.
Depending on your needs, many different profiling strategies can be used.
The built-in profiler of pytorch is actually very useful! You can use it in the following two ways:
- Using
pycg.exp
. Annotate your function withpycg.exp.pt_profile
or code block withpycg.exp.pt_profile_named(xxx)
. Then run the code withPT_PROFILE=1/2
set, where 1 will only profile CPU function calls while 2 will also record cuda calls (usually unnecessary). - Using
pytorch_lightning
intensorboard
. First annotate your code block withrecord_function()
, then run the script with a logging dir and flag--profiler pytorch
. Open tensorboard in the logging dir and the profile tool will appear.
First, use the decorator @exp.mem_profile
from pycg.exp
to annotate your function (the class decorator does not work very well yet).
Then we follow the 2-step approach:
- Set environment variable
MEM_PROFILE=1,<TH>
. This will print a brief description. - Set environment variable
MEM_PROFILE=2
. This will print a long description.
If your train/val code is slow and you doubt there is a bottleneck function, then find it out using the following steps:
-
Use the following flags:
--max_epochs 1 --limit_train/val/test_batches 20 --profiler simple
to run the model for a few iterations. At exit the program will print the time used for each pytorch-lightning step. Check which part is the bottleneck, especially mind the data loading part. -
py-spy
is able to gives you real-time display of your function, its computation utility (Active%). You can start it by runningsudo env "PATH=$PATH" py-spy top --pid <Your program>
. -
kernprof
gives you per-line execution time, and is very useful if you do not have many iterations in your code. In order to use it, add@pycg.exp.profile
to your function and runkernprof -l train.py ...
. When program exits, it will tell you something. -
Pytorch profiler is the ultimate tool of choice, which output a chrome trace for all CPU/GPU functions and has solved many of your problems effectively. Just uncomment the line in
train.py
. The trace json file can be loaded using Chrome'schrome://tracing
. (Note: Pytorch-lightning also provides pytorch profiler, but the documentation is very vague.) -
exp.Timer
class is also useful if you want to be flexible.
Nvidia is developing many tools for better performance diagnosis. The old one is nvvp
visual profiler, and new ones are Nsight Compute
and Nsight System
.
The tools are also bundled with CUDA. To debug a program, simply run:
sudo -E env PATH=$PATH LD_LIBRARY_PATH=$LD_LIBRARY_PATH /usr/local/cuda-11.6/bin/nsight-sys
Note that this has to run as root. Older versions earlier than 2020.5 does not profile memory. In my experience, it's hard to debug using this tool. For memory issue the best way is still memlab. Memlab is accurate -- it just ignores some cuda context overhead (~1.6GiB on your machine and ~800MiB on 1080).
Just add
--gpus 2
to enable DDP
, or add --accelerator dp
to use DP
.
Remember, when using DDP, your dataloader's batch size should be self.hparams.batch_size // self.trainer.world_size
, to match the original training.
Usually, pytorch tutorial will use something like mp.spawn()
, which resembles --accelerator ddp_spawn
and this according to PL, has many limitations. Internally, ddp
in PL use subprocess.Popen
instead of multiprocess
, so you will observe the entire script being run again because __name__
is not set to __mp_main__
as in multiprocess
.
Another hint: for multiprocess
, it has two modes fork
and spawn
, both of which will not run __main__
again, but the latter one will re-import all packages (because it creates a fresh python interpreter instance). Pytorch cuda tensor requires spawn
to work.