LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Publication

Paper: https://doi.org/10.1109/IISWC63097.2024.00012

Authors: Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park (KAIST)

Build LLMServingSim

This version v0.1.0 is an updated version of LLMServingSim (artifact) that uses an NPU simulator, it uses the performance model instead.

If you want to use the NPU simulator refer to the artifact branch. Ask for more features in the issue tab or via email.

We are preparing to use another NPU simulator for our next version.

1. Git clone

git clone --recurse-submodules https://github.com/casys-kaist/LLMServingSim.git
cd LLMServingSim.code

2. `Conda` install (optional)

Conda can be downloaded from the following link.

curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh

3. Install dependency (tested in python 3.9, GCC, G++ 7.5.0)

Using `conda` environment.yml (Recommended)

conda env create -p ./env -f ./environment.yml
conda activate ./env

Clean `conda` install

conda create -n env_name python=3.9
conda activate env_name
conda install conda-forge::libprotobuf=3.6.1
conda install conda-forge::cmake=3.15
conda install cctbx202208::boost-cpp=1.74.0

pip install -r requirements.txt

4. Build ASTRA-Sim, Chakra

Common issues while building ASTRA-Sim. If error regarding version of protoc happens see here.

cd astra-sim
./build/astra_analytical/build.sh
cd extern/graph_frontend/chakra
pip install .
cd ../../../..

Run LLMServingSim

1. Set input configurations

Now network and remote memory config are automatically set by inference_serving/config_generator.py.

Simply passing arguments to main.py is enough. See inference_serving/config_generator.py for more details.

Config & Dataset Path:

Network config path: astra-sim/inputs/network/analytical/{config_name}.json
Remote(Host) memory config path: astra-sim/inputs/remote_memory/analytical/{config_name}.json
Dataset path: dataset/{dataset_name}.tsv

2. Run LLMServingSim

Test Run

python main.py --model_name 'gpt3-6.7b' --hardware 'RTX3090' --npu_num 1 --npu_group 1 --npu_mem 40 \
    --local_bw 1024 --remote_bw 512 --link_bw 256 --fp 16 --block_size 4 \
    --dataset 'dataset/share-gpt-req100-rate10.tsv' --output 'output/example_run.csv' \
    --verbose --req_num 10

or simply use

./run.sh

Parameters of `main.py`

The current version only supports gpt3-6.7b and RTX3090.

Instructions for adding a new model and hardware are located here.

Parameters	Supporting Options	Default Value	Notes
model_name	'gpt3-6.7b'	'gpt3-6.7b'
hardware	'RTX3090'	'RTX3090'
npu_num	Integer	16
max_batch	Integer	0	0: no limit
npu_group	Integer	1
npu_mem	Integer	40	GB
local_bw	Integer	1024	GB/s
remote_bw	Integer	512	GB/s
link_bw	Integer	256	GB/s
fp	Integer	16	bits
block_size	Integer	8
dataset	Dataset Path	None	None: manually add requests in main.py
output	Output CSV Path	None	None: no csv output only stdout
gen	Flag	False	Skip initiation phase On/Off
req_num	Integer	100
log_interval	Float	0.5	Throughput log interval (s)
verbose	Flag	False

Outputs of `main.py`

1. Standard output

The standard output shows which requests are being processed in each iteration of the simulator and displays the measured throughput at regular intervals.

Additionally, it provides a summary of the simulation at the end.

With --verbose option, the log includes more specific information including memory load and store.

2. Output file

{output_filename}.csv contains the simulation result of each request.

You can find an example in output/example_run.csv.

Adding a New Model & Hardware

1. Make a new performance model

We used NVIDIA TensorRT-LLM to compile and run the model. While running the model, we used the NVIDIA Nsight system to measure each layer's latency.

You can follow the instructions here to generate a performance model.

Or, you can use another tool to measure the latency of each layer. Follow the format of a performance model in perf_model/example.csv or perf_model/RTX3090.csv.

2. Modify functions

The current version supports gpt model architecture generated by TensorRT. If the model architecture does not follow gpt or the performance model is not generated by TensorRT, some codes of LLMServingSim should be modified.

inference_serving/utils.py: function getConfig

Add your new model configuration (n_embd, n_layer, n_head, vocab_size)

inference_serving/memory_model.py: function calculateSizes & getWeight

calculateSizes function calculates the input, weight, and output tensor size for each specific layer. Change this function according to the model architecture.

getWeight function calculates the total model size by retrieving weights from calculateSizes. Also, change this function according to the model architecture.

inference_serving/generate_trace.py: function synthsizeTrace

This is the main function that generates trace for each iteration. It uses calculatedSizes to retrieve input, weight, and output tensor size for each layer. Then, it stacks layers in the trace according to the model architecture.

While changing this function, there are three important things.

Make sure ATTENTION layer is well separated for each request
Make sure ith layer output and i+1th layer input size are matched
Make sure ALLREDUCE operation is well placed for synchronization

We provide a function to test your trace generation. See trace_test/ for more details.

Citation

If you use LLMServingSim for your research, please cite our paper:

@INPROCEEDINGS{10763697,
  author={Cho, Jaehong and Kim, Minsu and Choi, Hyunmin and Heo, Guseul and Park, Jongse},
  booktitle={2024 IEEE International Symposium on Workload Characterization (IISWC)}, 
  title={LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale}, 
  year={2024},
  volume={},
  number={},
  pages={15-29},
  keywords={Technological innovation;Program processors;Simulation;Large language models;Heuristic algorithms;Redundancy;Software algorithms;Inference algorithms;Software;System analysis and design;Large language model (LLM);Inference serving;Simulation infrastructure;Neural processing unit (NPU);Processing-in-memory (PIM);Heterogeneous system},
  doi={10.1109/IISWC63097.2024.00012}}

Common Errors

Error Example

If your error is similar to this, you can use the below solution.

/home/<user>/LLMServingSim/astra-sim/extern/graph_frontend/chakra/et_def/et_def.pb.h:17:2: error: #error This file was generated by an older version of protoc which is
   17 | #error This file was generated by an older version of protoc which is
      |  ^~~~~
/home/<user>/LLMServingSim/astra-sim/extern/graph_frontend/chakra/et_def/et_def.pb.h:18:2: error: #error incompatible with your Protocol Buffer headers. Please
   18 | #error incompatible with your Protocol Buffer headers.  Please
      |  ^~~~~
/home/<user>/LLMServingSim/astra-sim/extern/graph_frontend/chakra/et_def/et_def.pb.h:19:2: error: #error regenerate this file with a newer version of protoc.
   19 | #error regenerate this file with a newer version of protoc.
      |  ^~~~~

Method 1: Setting Environment Variables

This method explicitly sets the conda environment for CMake to use.

Activate the Conda Environment: First, activate the desired conda environment.
```
conda activate your_env_name
```
Set the CMAKE_PREFIX_PATH Environment Variable: Add the path of the activated conda environment to the CMAKE_PREFIX_PATH environment variable.
```
export CMAKE_PREFIX_PATH=$CONDA_PREFIX:$CMAKE_PREFIX_PATH
```

Method 2: Setting the Activation Script

Activate the Conda Environment: First, activate the conda environment you want to modify.
```
conda activate your_env_name
```
Navigate to the Environment's Activation Script Directory: The activation scripts are located in the etc/conda/activate.d directory within your conda environment. If this directory does not exist, create it along with the deactivation directory.
```
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
```

Create and Edit the Activation Script: Create a script named set_cmake_prefix.sh to set the CMAKE_PREFIX_PATH when the environment is activated.

nano $CONDA_PREFIX/etc/conda/activate.d/set_cmake_prefix.sh

Add the following content to this file:

#!/bin/bash
export OLD_CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH
export CMAKE_PREFIX_PATH=$CONDA_PREFIX:$CMAKE_PREFIX_PATH

Create and Edit the Deactivation Script: Create a script named unset_cmake_prefix.sh to reset the CMAKE_PREFIX_PATH when the environment is deactivated.
```
nano $CONDA_PREFIX/etc/conda/deactivate.d/unset_cmake_prefix.sh
```
Add the following content to this file:
```
#!/bin/bash
export CMAKE_PREFIX_PATH=$OLD_CMAKE_PREFIX_PATH
unset OLD_CMAKE_PREFIX_PATH
```

Set Script Permissions: Ensure the scripts are executable.

chmod +x $CONDA_PREFIX/etc/conda/activate.d/set_cmake_prefix.sh
chmod +x $CONDA_PREFIX/etc/conda/deactivate.d/unset_cmake_prefix.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Publication

Build LLMServingSim

1. Git clone

2. `Conda` install (optional)

3. Install dependency (tested in python 3.9, GCC, G++ 7.5.0)

Using `conda` environment.yml (Recommended)

Clean `conda` install

4. Build ASTRA-Sim, Chakra

Run LLMServingSim

1. Set input configurations

2. Run LLMServingSim

Parameters of `main.py`

Outputs of `main.py`

1. Standard output

2. Output file

Adding a New Model & Hardware

1. Make a new performance model

2. Modify functions

Citation

Common Errors

Error Example

Method 1: Setting Environment Variables

Method 2: Setting the Activation Script

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
astra-sim @ 4ac2bfb		astra-sim @ 4ac2bfb
dataset		dataset
inference_serving		inference_serving
output		output
perf_model		perf_model
trace_test		trace_test
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

License

casys-kaist/LLMServingSim

Folders and files

Latest commit

History

Repository files navigation

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Publication

Build LLMServingSim

1. Git clone

2. Conda install (optional)

3. Install dependency (tested in python 3.9, GCC, G++ 7.5.0)

Using conda environment.yml (Recommended)

Clean conda install

4. Build ASTRA-Sim, Chakra

Run LLMServingSim

1. Set input configurations

2. Run LLMServingSim

Parameters of main.py

Outputs of main.py

1. Standard output

2. Output file

Adding a New Model & Hardware

1. Make a new performance model

2. Modify functions

Citation

Common Errors

Error Example

Method 1: Setting Environment Variables

Method 2: Setting the Activation Script

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

2. `Conda` install (optional)

Using `conda` environment.yml (Recommended)

Clean `conda` install

Parameters of `main.py`

Outputs of `main.py`

Packages