Skip to content

Latest commit

 

History

History
333 lines (259 loc) · 15.3 KB

README.md

File metadata and controls

333 lines (259 loc) · 15.3 KB

Contents

CSPDarkNet53 is a simple, highly modularized network architecture for image classification. It designs results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set in CSPDarkNet53.

Paper Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. CSPNet: A new backbone that can enhance learning capability of cnn. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPR Workshop), 2020. 2, 7

The overall network architecture of CSPDarkNet53 is show below:

Link

Dataset used can refer to paper.

  • Dataset size: 125G, 1250k colorful images in 1000 classes
    • Train: 120G, 1200k images
    • Test: 5G, 50k images
  • Data format: RGB images.
    • Note: Data will be processed in src/dataset.py

The mixed precision training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.

For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.

After installing MindSpore via the official website, you can start training and evaluation as follows:

  • Running local with Ascend
# run standalone training example with train.py
python train.py --is_distributed=0 --data_dir=$DATA_DIR > log.txt 2>&1 &

# run distributed training example
bash run_standalone_train.sh [DEVICE_ID] [DATA_DIR] (option)[PATH_CHECKPOINT]

# run distributed training example
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_DIR] (option)[PATH_CHECKPOINT]

# run evaluation example with eval.py
python eval.py --is_distributed=0 --per_batch_size=1 --pretrained=$PATH_CHECKPOINT --data_dir=$DATA_DIR > log.txt 2>&1 &

# run evaluation example
bash run_eval.sh [DEVICE_ID] [DATA_DIR] [PATH_CHECKPOINT]

For distributed training, a hccl configuration file with JSON format needs to be created in advance. Please follow the instructions in the link below: https://gitee.com/mindspore/models/tree/master/utils/hccl_tools

# Train ImageNet 8p on ModelArts
# (1) Perform a or b.
#       a. Set "enable_modelarts=True" on default_config.yaml file.
#          Set "is_distributed=1" on default_config.yaml file.
#          Set "data_dir='/cache/data/ImageNet/train'" on default_config.yaml file.
#          (option)Set "checkpoint_url='s3://dir_to_pretrained/'" on default_config.yaml file.
#          (option)Set "pretrained='/cache/checkpoint_path/model.ckpt'" on default_config.yaml file.
#          (option)Set other parameters on default_config.yaml file you need.
#       b. Add "enable_modelarts=True" on the website UI interface.
#          Add "is_distributed=1" on the website UI interface.
#          Add "data_dir='/cache/data/ImageNet/train'" on the website UI interface.
#          (option)Add "checkpoint_url='s3://dir_to_pretrained/'" on the website UI interface.
#          (option)Add "pretrained='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
#          (option)Add other parameters on the website UI interface.
# (2) (option)Upload or copy your pretrained model to S3 bucket if pretrained is set.
# (3) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
# (4) Set the code directory to "/path/cspdarknet53" on the website UI interface.
# (5) Set the startup file to "train.py" on the website UI interface.
# (6) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (7) Create your job.
#
# Eval ImageNet 1p on ModelArts
# (1) Perform a or b.
#       a. Set "enable_modelarts=True" on default_config.yaml file.
#          Set "is_distributed=0" on default_config.yaml file.
#          Set "per_batch_size=1" on default_config.yaml file.
#          Set "data_dir='/cache/data/ImageNet/validation_preprocess'" on default_config.yaml file.
#          Set "checkpoint_url='s3://dir_to_pretrained/'" on default_config.yaml file.
#          Set "pretrained='/cache/checkpoint_path/model.ckpt'" on default_config.yaml file.
#          (option)Set other parameters on default_config.yaml file you need.
#       b. Add "enable_modelarts=True" on the website UI interface.
#          Add "is_distributed=1" on the website UI interface.
#          Add "per_batch_size=1" on the website UI interface.
#          Add "data_dir='/cache/data/ImageNet/validation_preprocess'" on the website UI interface.
#          Add "checkpoint_url='s3://dir_to_pretrained/'" on the website UI interface.
#          Add "pretrained='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
#          (option)Add other parameters on the website UI interface.
# (2) Upload or copy your trained model to S3 bucket.
# (3) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
# (4) Set the code directory to "/path/cspdarknet53" on the website UI interface.
# (5) Set the startup file to "eval.py" on the website UI interface.
# (6) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (7) Create your job.
.
└─cspdarknet53
  ├─README.md
  ├─ascend310_infer                           # application for 310 inference
  ├── model_utils
    ├─__init__.py                             # init file
    ├─config.py                               # Parse arguments
    ├─device_adapter.py                       # Device adapter for ModelArts
    ├─local_adapter.py                        # Local adapter
    ├─moxing_adapter.py                       # Moxing adapter for ModelArts
  ├─scripts
    ├─run_standalone_train.sh                 # launch standalone training with ascend platform(1p)
    ├─run_distribute_train.sh                 # launch distributed training with ascend platform(8p)
    ├─run_infer_310.sh                        # shell script for 310 inference
    └─run_eval.sh                             # launch evaluating with ascend platform
  ├─src
    ├─utils
      ├─__init__.py                       # modeule init file
      ├─auto_mixed_precision.py           # Auto mixed precision
      ├─custom_op.py                      # network operations
      ├─logging.py                        # Custom logger
      ├─optimizers_init.py                # optimizer parameters
      ├─sampler.py                        # choose samples from the dataset
      ├─var_init.py                       # Initialize
    ├─__init__.py                       # parameter configuration
    ├─cspdarknet53.py                 # network definition
    ├─dataset.py                      # data preprocessing
    ├─head.py                         # common head architecture
    ├─image_classification.py         # Image classification
    ├─loss.py                         # Customized CrossEntropy loss function
    ├─lr_generator.py                 # learning rate generator
  ├─mindspore_hub_conf.py             # mindspore_hub_conf script
  ├─default_config.yaml               # Configurations
  ├─eval.py                           # eval net
  ├─postprogress.py                           # post process for 310 inference
  ├─export.py                                 # export net
  └─train.py                          # train net
Major parameters in default_config.yaml are:
'data_dir'              # dataset dir
'pretrained'            # checkpoint dir
'is_distributed'        # is distribute param
'per_batch_size'        # batch size each device
'log_path'              # save log file path

Usage

You can start training using python or shell scripts. The usage of shell scripts as follows:

  • Ascend:
# distribute training(8p)
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_DIR] (option)[PATH_CHECKPOINT]
# standalone training
bash run_standalone_train.sh [DEVICE_ID] [DATA_DIR] (option)[PATH_CHECKPOINT]

Notes: RANK_TABLE_FILE can refer to Link, and the device_ip can be got as Link. For large models like InceptionV3, it's better to export an external environment variable export HCCL_CONNECT_TIMEOUT=600 to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.

This is processor cores binding operation regarding the device_num and total processor numbers. If you are not expect to do it, remove the operations taskset in scripts/run_distribute_train.sh

Launch

# training example
  python:
      python train.py --is_distributed=0 --pretrained=PATH_CHECKPOINT --data_dir=DATA_DIR > log.txt 2>&1 &

  shell:
      # distribute training example(8p)
      bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_DIR] (option)[PATH_CHECKPOINT]
      # standalone training example
      bash run_standalone_train.sh [DEVICE_ID] [DATA_DIR] (option)[PATH_CHECKPOINT]

Result

Training result will be stored in the example path. Checkpoints will be stored at . /checkpoint by default, and training log will be redirected to ./log.txt.

Usage

You can start training using python or shell scripts. The usage of shell scripts as follows:

  • Ascend:
    bash run_eval.sh [DEVICE_ID] [DATA_DIR] [PATH_CHECKPOINT]

Launch

# eval example
  python:
      python eval.py --is_distributed=0 --per_batch_size=1 --pretrained=PATH_CHECKPOINT --data_dir=DATA_DIR > log.txt 2>&1 &

  shell:
      bash run_eval.sh [DEVICE_ID] [DATA_DIR] [PATH_CHECKPOINT]

checkpoint can be produced in training process.

Result

Evaluation result will be stored in the example path, you can find result in eval.log.

Before export model, you must modify the config file, default_config.yaml. The config item you should modify is ckpt_file.

python export.py

Before inference, please refer to MindSpore Inference with C++ Deployment Guide to set environment variables.

Before performing inference, we need to export model first. Air model can only be exported in Ascend 910 environment, mindir model can be exported in any environment.

# Ascend310 inference
bash run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [DEVICE_ID]

Inference result is saved in current path, you can find result like this in acc.log file.

Total data: 50000, top1 accuracy: 0.78458, top5 accuracy: 0.94254

Evaluation Performance

Parameters Ascend
Model Version CSPDarkNet53
Resource Ascend 910; cpu 2.60GHz, 192cores; memory 755G; OS Euler2.8
uploaded Date 06/02/2021
MindSpore Version 1.2.0
Dataset 1200k images
Batch_size 64
Training Parameters default_config.yaml
Optimizer Momentum
Loss Function CrossEntropy
Outputs probability
Loss 1.78
Total time (8p) 8ps: 14h
Checkpoint for Fine tuning 217M (.ckpt file)
Speed 8pc: 3977 imgs/sec
Scripts cspdarknet53 script

Inference Performance

Parameters Ascend
Model Version CSPDarkNet53
Resource Ascend 910; cpu 2.60GHz, 192cores; memory 755G; OS Euler2.8
Uploaded Date 06/02/2021
MindSpore Version 1.2.0
Dataset 50k images
Batch_size 1
Outputs probability
Accuracy acc=78.48%(TOP1)
acc=94.21%(TOP5)

We use random seed in "train.py", "./src/utils/var_init.py", "./src/utils/sampler.py".

Please check the official homepage.