Paper: https://doi.org/10.1109/IISWC63097.2024.00012
Authors: Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park (KAIST)
This version v0.1.0 is an updated version of LLMServingSim (artifact) that uses an NPU simulator, it uses the performance model instead.
If you want to use the NPU simulator refer to the artifact branch. Ask for more features in the issue tab or via email.
We are preparing to use another NPU simulator for our next version.
git clone --recurse-submodules https://github.com/casys-kaist/LLMServingSim.git
cd LLMServingSim.code
Conda can be downloaded from the following link.
curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh
conda env create -p ./env -f ./environment.yml
conda activate ./env
conda create -n env_name python=3.9
conda activate env_name
conda install conda-forge::libprotobuf=3.6.1
conda install conda-forge::cmake=3.15
conda install cctbx202208::boost-cpp=1.74.0
pip install -r requirements.txt
Common issues while building ASTRA-Sim. If error regarding version of protoc
happens see here.
cd astra-sim
./build/astra_analytical/build.sh
cd extern/graph_frontend/chakra
pip install .
cd ../../../..
Now network and remote memory config are automatically set by inference_serving/config_generator.py
.
Simply passing arguments to main.py
is enough.
See inference_serving/config_generator.py
for more details.
Config & Dataset Path:
- Network config path:
astra-sim/inputs/network/analytical/{config_name}.json
- Remote(Host) memory config path:
astra-sim/inputs/remote_memory/analytical/{config_name}.json
- Dataset path:
dataset/{dataset_name}.tsv
Test Run
python main.py --model_name 'gpt3-6.7b' --hardware 'RTX3090' --npu_num 1 --npu_group 1 --npu_mem 40 \
--local_bw 1024 --remote_bw 512 --link_bw 256 --fp 16 --block_size 4 \
--dataset 'dataset/share-gpt-req100-rate10.tsv' --output 'output/example_run.csv' \
--verbose --req_num 10
or simply use
./run.sh
The current version only supports gpt3-6.7b and RTX3090.
Instructions for adding a new model and hardware are located here.
Parameters | Supporting Options | Default Value | Notes |
---|---|---|---|
model_name | 'gpt3-6.7b' | 'gpt3-6.7b' | |
hardware | 'RTX3090' | 'RTX3090' | |
npu_num | Integer | 16 | |
max_batch | Integer | 0 | 0: no limit |
npu_group | Integer | 1 | |
npu_mem | Integer | 40 | GB |
local_bw | Integer | 1024 | GB/s |
remote_bw | Integer | 512 | GB/s |
link_bw | Integer | 256 | GB/s |
fp | Integer | 16 | bits |
block_size | Integer | 8 | |
dataset | Dataset Path | None | None: manually add requests in main.py |
output | Output CSV Path | None | None: no csv output only stdout |
gen | Flag | False | Skip initiation phase On/Off |
req_num | Integer | 100 | |
log_interval | Float | 0.5 | Throughput log interval (s) |
verbose | Flag | False |
The standard output shows which requests are being processed in each iteration of the simulator and displays the measured throughput at regular intervals.
Additionally, it provides a summary of the simulation at the end.
With --verbose
option, the log includes more specific information including memory load and store.
{output_filename}.csv
contains the simulation result of each request.
You can find an example in output/example_run.csv
.
We used NVIDIA TensorRT-LLM to compile and run the model. While running the model, we used the NVIDIA Nsight system to measure each layer's latency.
You can follow the instructions here to generate a performance model.
Or, you can use another tool to measure the latency of each layer.
Follow the format of a performance model in perf_model/example.csv
or perf_model/RTX3090.csv
.
The current version supports gpt model architecture generated by TensorRT. If the model architecture does not follow gpt or the performance model is not generated by TensorRT, some codes of LLMServingSim should be modified.
inference_serving/utils.py
: functiongetConfig
Add your new model configuration (n_embd, n_layer, n_head, vocab_size)
inference_serving/memory_model.py
: functioncalculateSizes
&getWeight
calculateSizes
function calculates the input, weight, and output tensor size for each specific layer.
Change this function according to the model architecture.
getWeight
function calculates the total model size by retrieving weights from calculateSizes
.
Also, change this function according to the model architecture.
inference_serving/generate_trace.py
: functionsynthsizeTrace
This is the main function that generates trace for each iteration.
It uses calculatedSizes
to retrieve input, weight, and output tensor size for each layer.
Then, it stacks layers in the trace according to the model architecture.
While changing this function, there are three important things.
-
Make sure ATTENTION layer is well separated for each request
-
Make sure ith layer output and i+1th layer input size are matched
-
Make sure ALLREDUCE operation is well placed for synchronization
We provide a function to test your trace generation. See trace_test/
for more details.
If you use LLMServingSim for your research, please cite our paper:
@INPROCEEDINGS{10763697,
author={Cho, Jaehong and Kim, Minsu and Choi, Hyunmin and Heo, Guseul and Park, Jongse},
booktitle={2024 IEEE International Symposium on Workload Characterization (IISWC)},
title={LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale},
year={2024},
volume={},
number={},
pages={15-29},
keywords={Technological innovation;Program processors;Simulation;Large language models;Heuristic algorithms;Redundancy;Software algorithms;Inference algorithms;Software;System analysis and design;Large language model (LLM);Inference serving;Simulation infrastructure;Neural processing unit (NPU);Processing-in-memory (PIM);Heterogeneous system},
doi={10.1109/IISWC63097.2024.00012}}
If your error is similar to this, you can use the below solution.
/home/<user>/LLMServingSim/astra-sim/extern/graph_frontend/chakra/et_def/et_def.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
17 | #error This file was generated by an older version of protoc which is
| ^~~~~
/home/<user>/LLMServingSim/astra-sim/extern/graph_frontend/chakra/et_def/et_def.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
18 | #error incompatible with your Protocol Buffer headers. Please
| ^~~~~
/home/<user>/LLMServingSim/astra-sim/extern/graph_frontend/chakra/et_def/et_def.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
19 | #error regenerate this file with a newer version of protoc.
| ^~~~~
This method explicitly sets the conda environment for CMake to use.
-
Activate the Conda Environment: First, activate the desired conda environment.
conda activate your_env_name
-
Set the CMAKE_PREFIX_PATH Environment Variable: Add the path of the activated conda environment to the
CMAKE_PREFIX_PATH
environment variable.export CMAKE_PREFIX_PATH=$CONDA_PREFIX:$CMAKE_PREFIX_PATH
-
Activate the Conda Environment: First, activate the conda environment you want to modify.
conda activate your_env_name
-
Navigate to the Environment's Activation Script Directory: The activation scripts are located in the
etc/conda/activate.d
directory within your conda environment. If this directory does not exist, create it along with the deactivation directory.mkdir -p $CONDA_PREFIX/etc/conda/activate.d mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
-
Create and Edit the Activation Script: Create a script named
set_cmake_prefix.sh
to set theCMAKE_PREFIX_PATH
when the environment is activated.nano $CONDA_PREFIX/etc/conda/activate.d/set_cmake_prefix.sh
Add the following content to this file:
#!/bin/bash export OLD_CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH export CMAKE_PREFIX_PATH=$CONDA_PREFIX:$CMAKE_PREFIX_PATH
-
Create and Edit the Deactivation Script: Create a script named
unset_cmake_prefix.sh
to reset theCMAKE_PREFIX_PATH
when the environment is deactivated.nano $CONDA_PREFIX/etc/conda/deactivate.d/unset_cmake_prefix.sh
Add the following content to this file:
#!/bin/bash export CMAKE_PREFIX_PATH=$OLD_CMAKE_PREFIX_PATH unset OLD_CMAKE_PREFIX_PATH
-
Set Script Permissions: Ensure the scripts are executable.
chmod +x $CONDA_PREFIX/etc/conda/activate.d/set_cmake_prefix.sh chmod +x $CONDA_PREFIX/etc/conda/deactivate.d/unset_cmake_prefix.sh