Name		Name	Last commit message	Last commit date
parent directory ..
notebooks		notebooks
src		src
streamlit		streamlit
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
run.sh		run.sh

README.md

LLMPerf

A Tool for evaluating the performance of LLM APIs.

Installation

git clone https://github.com/ray-project/llmperf.git
cd llmperf
pip install -e .

Basic Usage

We implement performance tests for evaluating LLMs.

Load test

The load test spawns a number of concurrent requests to the LLM API and measures the inter-token latency and generation throughput per request and across concurrent requests. The prompt that is sent with each request is of the format:

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

Where the lines are randomly sampled from a collection of lines from Shakespeare sonnets. Tokens are counted using the LlamaTokenizer regardless of which LLM API is being tested. This is to ensure that the prompts are consistent across different LLM APIs.

To run the most basic load test you can the token_benchmark_ray script.

Caveats and Disclaimers

The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
The results may vary with time of day.
The results may vary with the load.
The results may not correlate with users’ workloads.

SambaNova Compatible APIs

Update API information for the SambaNova LLM These are represented as configurable variables in the environment variables file in .env.

For example, enter an endpoint with the URL "https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/456789ab-cdef-0123-4567-89abcdef0123" in the env file (with no spaces) as:

BASE_URL="https://api-stage.sambanova.net"
PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
API_KEY="89abcdef-0123-4567-89ab-cdef01234567"

python token_benchmark_ray.py \
--model "sambanova/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api sambanova \
--additional-sampling-params '{}'

LiteLLM

LLMPerf can use LiteLLM to send prompts to LLM APIs. To see the environment variables to set for the provider and arguments that one should set for model and additional-sampling-params.

see the LiteLLM Provider Documentation.

python token_benchmark_ray.py \
--model "meta-llama/Llama-2-7b-chat-hf" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "litellm" \
--additional-sampling-params '{}'

see python token_benchmark_ray.py --help for more details on the arguments.

Saving Results

The results of the load test are saved in the results directory specified by the --results-dir argument. The results are saved in 2 files, one with the summary metrics of the test, and one with metrics from each individual request that is returned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking

benchmarking

README.md

LLMPerf

Installation

Basic Usage

Load test

Caveats and Disclaimers

SambaNova Compatible APIs

LiteLLM

Saving Results

Files

benchmarking

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarking

Folders and files

parent directory

README.md

LLMPerf

Installation

Basic Usage

Load test

Caveats and Disclaimers

SambaNova Compatible APIs

LiteLLM

Saving Results