Title: Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access
- AWS P3.8xlarge instance
- GPU: NVIDIA V100 (16GB) x 4ea
- Memory: 244GB DDR4 DRAM
- CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
- NVLink 2.0
- PCIe 3.0
For EuroSys '23 Artifact Evaluation Committee, we can provide the AWS instance we used if you don't have any machine that satisfies the requirements. Let us know through the HotCRP portal.
- Operating system: Ubuntu 18.04
- CUDA v11.3
- CuDNN v8.2.1
- ProtoBuf v3.11.4
- Boost v1.65
- TBB (Threading Building-Blocks) v2017_U7
- PyTorch v1.9
- Matplotlib v3.3.4 (for generating graphs)
- build-essential
$ sudo apt update
$ sudo apt install build-essential
- C++ Library on Ubuntu
$ sudo apt-get install libtbb-dev libboost1.65-all-dev
- CUDA Toolkit v11.3 & CuDNN v8.2.1
DeepPlan works with the PyTorch DL framework. To run PyTorch, we are supposed to install the dependent packages, CUDA and CuDNN.
To install the CUDA Toolkit, see this link: Download Installer for Linux Ubuntu 18.04 x86_64
To install the CuDNN Library, see this link: Installation Guide and CuDNN Archive
- ProtoBuf v3.11.4
DeepPlan uses the ProtoBuf library to serialize or deserialize plans. So, ProtoBuf is required to build DeepPlan. To install ProtoBuf, see this following link: https://github.com/protocolbuffers/protobuf/blob/main/src/README.md
To use DeepPlan, it is required to modify PyTorch (v1.9) framework. To simplify the step reflecting the code changes on the framework, we have provided a patch file for DeepPlan. The following command applies the patch to the PyTorch v1.9.0.
$ cd $HOME
$ # Let's first clone the DeepPlan repository and set the path
$ git clone https://github.com/csl-ajou/DeepPlan/
$ DEEPPLAN_HOME=$HOME/DeepPlan
$
$ # Let's download the PyTorch v1.9.0 package and set the path
$ git clone --recursive https://github.com/pytorch/pytorch -b v1.9.0
$ PYTORCH_HOME=$HOME/pytorch
$
$ cd $PYTORCH_HOME
$ patch -p1 < $DEEPPLAN_HOME/pytorch.patch
After applying the patch file, let's compile the PyTorch.
$ python3 setup.py install
In addition to PyTorch, install pip modules using the command below, from DeepPlan's Home
directory.
$ cd $DEEPPLAN_HOME
$ pip3 install -r requirements.txt
After successfully patching and building the PyTorch framework, we are ready to build DeepPlan to generate inference execution plans and the DL server prototype.
$ cd $DEEPPLAN_HOME
$ mkdir build
$ cd build
$ cmake -DCMAKE_PREFIX_PATH=$PYTORCH_HOME ..
$ make
You need to create a plan for a given model. In this tutorial, our target is ResNet50.
The python module, plan.py
, already imports the pre-trained models evaluated in the paper so that you can simply type the name of the model.
# Create Plan
$ cd $DEEPPLAN_HOME
$ mkdir -p plan_repo
$ python3 plan.py -m resnet50 -p plan_repo
# The generated plan from this command is saved the plans directory
If you want to take a look at generated plans (Table 3 in the paper), you can click the following links.
Once DeepPlan generate the execution plan for a given model, you can run the model inference with the DeepPlan engine through the commands below, from DeepPlan's Home
directory.
Here, we have an example for ResNet50. In this section, we describe how to run four different execution methods,
Baseline (on-demand), PipeSwitch, DeepPlan (DHA), DeepPlan (PT), and DeepPlan (PT+DHA), explained in our paper.
Before running the model inference, you have to set PLAN_REPO
environment variable which represents where plans are stored.
# The plan repository should be the same as the path specified in above creating a plan
$ export PLAN_REPO=$DEEPPLAN_HOME/plan_repo
$ cd $DEEPPLAN_HOME
- Baseline (on-demand)
$ ./build/benchmark -m resnet50 -e demand
You should see output similar to the following:
Benchmarking Inference resnet50
Average Latency : 17.7038 ms
- PipeSwtich (Bai et al. OSDI 2020)
$ ./build/benchmark -m resnet50 -e pipeline
You should see output similar to the following:
Benchmarking Inference resnet50
Average Latency : 11.981 ms
- DeepPlan (DHA)
$ ./build/benchmark -m resnet50 -e deepplan
You should see output similar to the following:
Benchmarking Inference renset50
Average Latency : 11.2345 ms
- DeepPlan (PT)
$ ./build/benchmark -m resnet50 -e pipeline -d 0 2 # d option represents the devices to be used for load
You should see output similar to the following:
Benchmarking Inference renset50
Average Latency : 9.39064 ms
- DeepPlan (DHA+PT)
$ ./build/benchmark -m resnet50 -e deepplan -d 0 2 # d option represents the devices to be used for load
You should see output similar to the following:
Benchmarking Inference renset101
Average Latency : 8.36423 ms
To reproduce the experimental results presented in the paper, we should have the model plans. To simplify creating model plans,
we provide create_all_plans.sh
shell script that makes all model plans used in the experiments.
$ cd $DEEPPLAN_HOME/scripts
$ mkdir -p $DEEPPLAN_HOME/plan_repo/V100
$ export PLAN_REPO=$DEEPPLAN_HOME/plan_repo/V100
$ source create_all_plans.sh # the plan repository is created in PLAN_REPO path.
For all shell scripts, we should setup PLAN_REPO
variable which represents plans repository.
We provided experiments scripts for figure #10, #12, #13, and #14.
Run the script in the $DEEPPLAN_HOME/scripts/fig#/run.sh
directory and the result will be logged in
the same directory. If the Matplotlib library was installed in your machine,
the graph will be drawn in fig#.pdf
.
We evaluate the inference latency with a single batch for On-Demand, PipeSwitch, DeepPlan(DHA), DeepPlan (PT), and DeepPlan (PT+DHA). The results are normalized to Baseline (on-demand).
$ cd $DEEPPLAN_HOME/scripts/fig10
$ source run.sh
We perform this experiment on a four-GPU server in an AWS instance. This experiment measures the 99% latency, goodput, and cold-start for BERT-Base while increasing the number of model instances concurrently running on the GPUs.
$ cd $DEEPPLAN_HOME/scripts/fig12
$ source run.sh
This experiment is similar to above the experiment (Figure 12) except that the evaluation model is changed from BERT-Base to Bert-Large and GPT2.
$ cd $DEEPPLAN_HOME/scripts/fig13
$ source run.sh
This experiment is also performed on a four-GPU server in an AWS instance. The above experiments (Figure 12, Figure 13) run with synthetic trace. But this experiment run with real-world trace derived from Microsoft Azure Functions. In this experiment, we evaluate three workloads of three hours each (total 9 hours).
To run this experiment, you should prepare azure trace dataset. https://github.com/Azure/AzurePublicDataset/blob/master/AzureFunctionsDataset2019.md
The following command download the azure-trace dataset.
$ cd $DEEPPLAN_HOME/scripts
$ source download_azure_trace_dataset.sh
# To recognize this trace file from client, The `AZURE_TRACE_DIR` variable should be set
$ export AZURE_TRACE_DIR=$DEEPPLAN_HOME/scripts/azure-functions
$ cd $DEEPPLAN_HOME/scripts/fig14
$ source run.sh