Skip to content

Latest commit

 

History

History
 
 

benchmarking

SambaNova logo

Benchmarking

Overview

This AI Starter Kit evaluates the performance of different LLM models hosted in SambaStudio. It allows users to configure various LLMs with diverse parameters, enabling experiments to not only generate different outputs but also measurement metrics simultaneously. The Kit includes:

  • Configurable SambaStudio and SambanNova Cloud connectors. The connectors generate answers from a deployed model.
  • An app with three functionalities:
    • A synthetic performance evaluation process with configurable options that users will utilize to obtain and compare different metrics over synthetic data generated by the app.
    • A custom performance evaluation process with configurable options that users will utilize to obtain and compare different metrics over their own customed prompts.
    • A chat interface with configurable options that users will set to interact and get performance metrics
  • A couple of bash scripts that are the core of the performance evaluations and provide more flexibility to users

This sample is ready-to-use. We provide:

  • Instructions for setup with SambaStudio or SambaNova Cloud
  • Instructions for running the model as-is
  • Instructions for customizing the model

Before you begin

To perform this setup, you must be a SambaNova customer with a SambaStudio account or have a SambaNova Cloud API key (more details in the following sections). You also have to set up your environment before you can run or customize the starter kit.

These steps assume a Mac/Linux/Unix shell environment. If using Windows, you will need to adjust some commands for navigating folders, activating virtual environments, etc.

Clone this repository

Clone the starter kit repo.

git clone https://github.com/sambanova/ai-starter-kit.git

Set up the inference endpoint, and environment variables

The next step is to set up your environment variables to use one of the models available from SambaNova. If you're a current SambaNova customer, you can deploy your models with SambaStudio. If you are not a SambaNova customer, you can self-service provision API endpoints using SambaNova Cloud.

  • If using SambaNova Cloud Please follow the instructions here for setting up your environment variables.

  • If using SambaStudio Please follow the instructions here for setting up endpoint and your environment variables.

Create the (virtual) environment

  1. (Recommended) Create a virtual environment and activate it (python version 3.11 recommended):

    python<version> -m venv <virtual-environment-name>
    source <virtual-environment-name>/bin/activate
  2. Install the required dependencies:

    cd benchmarking # If not already in the benchmarking folder
    pip install -r requirements.txt

Use the starter kit

When using the benchmarking starter kit, you have two options for running the program:

  • GUI Option: This option contains plots and configurations from a web browser.
  • CLI Option: This option allows you to run the program from the command line and provides more flexibility.

GUI Option

The GUI for this starter kit uses Streamlit, a Python framework for building web applications. This method is useful for analyzing outputs in a graphical manner since the results are shown via plots in the UI.

Deploy the starter kit GUI

Ensure you are in the benchmarking folder and run the following command:

streamlit run streamlit/app.py --browser.gatherUsageStats false 

After deploying the starter kit, you will see the following user interface:

perf_eval_image

Quickstart

After you've deployed the GUI, you can use the starter kit. More details will come in the following sections, however the general usage is described in the comming bullets:

  1. In the left side bar, select one of the three app functionalities (Click on each section to go to the full details):
  1. If the deployed LLM is a Composition of Experts (CoE), specify the desired expert in the corresponding text box and then set the configuration parameters. If the deployed LLM is not a CoE, simply set the configuration parameters.

  2. If the deployed LLM is a SambaNova Cloud endpoint, choose in the API type dropdown the sncloud option.

  3. After pressing the Run button, the program will perform inference on the data and product results in the middle of the screen. In the case of Performance on Chat functionality, users are able to interact with the LLM in a multi-turn chat interface.

Full Walkthrough

There are 3 options on the left side bar for running the benchmarking tool. Pick the walkthrough that best suits your needs.

Synthetic Performance Evaluation

This option allows you to evaluate the performance of the selected LLM on synthetic data generated by this benchmarking tool.

  1. Enter a model name and choose the right API type

Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.

  • If the model specified is a CoE, specify the desired expert in the Model Name text box.
    • The model name should mirror the name shown in studio, preceded with COE/ -
    • For example, the Samba-1 Turbo Llama-3-8B expert in studio is titled Meta-Llama-3-8B-Instruct so my model name would be COE/Meta-Llama-3-8B-Instruct.
  • If the model is a standalone model, enter the full model name shown on the model card. I.e. Llama-2-70b-chat-hf.
  • If the model is a SambaNova Cloud one, then be aware of the right model names used. Then, choose sncloud in the API type dropdown option.
    • For example, the Llama-3-8B model in SambaNova Cloud is titled llama3-8b so that will be the model name.
  1. Set the configuration parameters
  • Number of input tokens: The number of input tokens in the generated prompt. Default: 1000.
  • Number of output tokens: The number of output tokens the LLM can generate. Default: 1000.
  • Number of total requests: Number of requests sent. Default: 32. Note: the program can timeout before all requests are sent. Configure the Timeout parameter accordingly.
  • Number of concurrent workers: The number of concurrent workers. Default: 1. For testing batching-enabled models, this value should be greater than the largest batch_size one needs to test. The typical batch sizes that are supported are 1,4,8 and 16.
  • Timeout: Number of seconds before program times out. Default: 600 seconds
  1. Run the performance evaluation
  • Click the Run! button. This will start the program and a spinning indicator will show in the UI confirming the program is executing.
  • Depending on the parameter configurations, it should take between 1 min and 10 min. Some diagnostic/progress information will be displayed in the terminal shell.
  1. Analyze results

    Note: Not all model endpoints currently support the calculation of server-side statistics. Depending on your choice of endpoint, you may see either client and server information, or you may see just the client-side information.

    Bar plots

    The plots compare (if available) the following:

    • Server metrics: These are performance metrics from the API.
    • Client metrics: These are performance metrics computed on the client side. Additionally, if the endpoint supports dynamic batching, the plots will show per-batch metrics.

    The results are composed of four bar plots:

    • ttft_s bar plot: This plot shows the median Time to First Token (TTFT) with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer.

    • end_to_end_latency_s bar plot: This plot shows the median end-to-end latency with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is also mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer.

    • output_token_per_s_per_request bar plot: This plot shows the median number of output tokens per second per request with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. For endpoints that support dynamic batching, one should see a decreasing trend in metrics as the batch size increases.

    • throughput_token_per_s bar plot: This plot shows the median total tokens generated per second per batch with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. This metric represents the total number of tokens generated per second, which is the same as the previous metric for batch size = 1. However, for batch size > 1, it is estimated as the average of output_token_per_s_per_request * batch_size_used for each batch, to account for more tokens being generated due to concurrent requests being served in batch mode.

Custom Performance Evaluation

This option allows you to evaluate the performance of the selected LLM on your own custom dataset. The interface should look like this:

Custom Performance Evaluation

  1. Prep your dataset
  • The dataset needs to be in .jsonl format - these means a file with one JSON object per line.
  • Each JSON object should have a prompt key with the value being the prompt you want to pass to the LLM.
    • You can use a different keyword instead of prompt, but it's important that all your JSON objects use the same key
  1. Enter the dataset path
  • The entered path should be an absolute path to your dataset.
    • For example: /Users/johndoe/Documents/my_dataset.jsonl
  1. Enter a model name and choose the right API type

Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.

  • If the model specified is a CoE, specify the desired expert in the Model Name text box.
    • The model name should mirror the name shown in studio, preceded with COE/ -
    • For example, the Samba-1 Turbo Llama-3-8B expert in studio is titled Meta-Llama-3-8B-Instruct so my model name would be COE/Meta-Llama-3-8B-Instruct.
  • If the model is a standalone model, enter the full model name shown on the model card. I.e. Llama-2-70b-chat-hf.
  • If the model is a SambaNova Cloud one, then be aware of the right model names used. Then, choose sncloud in the API type dropdown option.
    • For example, the Llama-3-8B model in SambaNova Cloud is titled llama3-8b so that will be the model name
  1. Set the configuration and tuning parameters
  • Number of concurrent workers: The number of concurrent workers. Default: 1. For testing batching-enabled models, this value should be greater than the largest batch_size one needs to test. The typical batch sizes that are supported are 1,4,8 and 16.
  • Timeout: Number of seconds before program times out. Default: 600 seconds
  • Max Output Tokens: Maximum number of tokens to generate. Default: 256
  • Save LLM Responses: Whether to save the actual outputs of the LLM to an output file. The output file will contain the response_texts suffix.
  1. Analyze results

    Note: Not all model endpoints currently support the calculation of server-side statistics. Depending on your choice of endpoint, you may see either client and server information, or you may see just the client-side information.

    Bar plots

    The plots compare (if available) the following:

    • Server metrics: These are performance metrics from the API.
    • Client metrics: These are performance metrics computed on the client side. Additionally, if the endpoint supports dynamic batching, the plots will show per-batch metrics.

    The results are composed of four bar plots:

    • ttft_s bar plot: This plot shows the median Time to First Token (TTFT) with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer.

    • end_to_end_latency_s bar plot: This plot shows the median end-to-end latency with the height of each colored bar and a small black distribution bar. One should see higher values and higher variance in the client-side metrics compared to the server-side metrics. This difference is also mainly due to the request waiting in the queue to be served (for concurrent requests), which is not included in server-side metrics. There is also a small additional factor on the client-side due to the added latency of the API call to the client computer.

    • output_token_per_s_per_request bar plot: This plot shows the median number of output tokens per second per request with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. For endpoints that support dynamic batching, one should see a decreasing trend in metrics as the batch size increases.

    • throughput_token_per_s bar plot: This plot shows the median total tokens generated per second per batch with the height of each colored bar and a small black distribution bar. One should see good agreement between the client and server-side metrics. This metric represents the total number of tokens generated per second, which is the same as the previous metric for batch size = 1. However, for batch size > 1, it is estimated as the average of output_token_per_s_per_request * batch_size_used for each batch, to account for more tokens being generated due to concurrent requests being served in batch mode.

Performance on Chat

This option allows you to measure performance during a multi-turn conversation with an LLM. The interface should look like this:

perf_on_chat_image

  1. Enter a model name and choose the right API type
  • If the model specified is a CoE, specify the desired expert in the Model Name text box.
    • The model name should mirror the name shown in studio, preceded with COE/ -
    • For example, the Samba-1 Turbo Llama-3-8B expert in studio is titled Meta-Llama-3-8B-Instruct so my model name would be COE/Meta-Llama-3-8B-Instruct.
  • If the model is a standalone model, enter the full model name shown on the model card. I.e. Llama-2-70b-chat-hf.
  • If the model is a SambaNova Cloud one, then be aware of the right model names used. Then, choose sncloud in the API type dropdown option.
    • For example, the Llama-3-8B model in SambaNova Cloud is titled llama3-8b so that will be the model name
  1. Set the configuration parameters
  • Max tokens to generate: Maximum number of tokens to generate. Default: 256
  1. Start the chat session

After entering the model name and configuring the parameters, press Run! to activate the chat session.

  1. Ask anything and see results

Users are able to ask anything and get a generated answer to their questions, as shown in the image below. In addition to the back and forth conversations between the user and the LLM, there is a expander option that users can click to see the following metrics per each LLM response:

  • Latency (s)
  • Throughput (tokens/s)
  • Time to first token (s)

perf_on_chat_image

CLI Option

This method can be ran from a terminal session. Users have this option if they want to experiment using values that are beyond the limits specified in the Streamlit app parameters. You have two options for running the program from terminal:

  • Run with a custom dataset via run_custom_dataset.sh
  • Run with a synthetic dataset via run_synthetic_dataset.sh

Custom Dataset

Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.

  1. Open the file run_custom_dataset.sh and configure the following parameters:
  • model-name: Model name to be used. If it's a COE model, add "COE/" prefix to the name. Example: "COE/Meta-Llama-3-8B-Instruct"
  • llm-api: API type to be chosen. If it's a SambaNova Cloud model, double check the right model name spelling because it's shorter then other sambastudio model names.
  • results-dir: Path to the results directory. Default: "./data/results/llmperf"
  • num-workers: Number of concurrent workers. Default: 1
  • timeout: Timeout in seconds. Default: 600
  • input-file-path: The location of the custom dataset that you want to evaluate with
  • save-llm-responses: Whether to save the actual outputs of the LLM to an output file. The output file will contain the response_texts suffix.

Note: You should leave the --mode parameter untouched - this indicates what dataset mode to use.

  1. Run the script
  • Run the following command in your terminal:
sh run_custom_dataset.sh
  • The evaluation process will start and progress bar will be shown until it's complete.
  1. Analyze results
  • Results will be saved at the location specified in results-dir.
  • The name of the output files will depend on the input file name, mode name, and number of workers. You should see files that follow a similar format to the following:
<MODEL_NAME>_{FILE_NAME}_{NUM_CONCURRENT_WORKERS}_{MODE}
  • For each run, two files are generated with the following suffixes in the output file names: _individual_responses and _summary.

    • Individual responses file

      • This output file contains the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Server (if available) and Client side, for each individual request sent to the LLM. Users can use this data for further analysis. We provide this notebook notebooks/analyze-token-benchmark-results.ipynb with some charts that they can use to start.

individual_responses_image

  • Summary file

    • This file includes various statistics such as percentiles, mean and standard deviation to describe the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Client side. It also provides additional data points that bring more information about the overall run, like inputs used, number of errors, and number of completed requests per minute.

summary_output_image

Synthetic Dataset

Note: Currently we have specific prompting support for Llama2, Llama3, Mistral, Deepseek, Solar, and Eeve. Other instruction models can work, but number of tokens may not be close to the ones specified.

  1. Open the file run_synthetic_dataset.sh and configure the following parameters:
  • model-name: Model name to be used. If it's a COE model, add "COE/" prefix to the name. Example: "COE/Meta-Llama-3-8B-Instruct"
  • llm-api: API type to be chosen. If it's a SambaNova Cloud model, double check the right model name spelling because it's shorter then other sambastudio model names.
  • results-dir: Path to the results directory. Default: "./data/results/llmperf"
  • num-workers: Number of concurrent workers. Default: 1
  • timeout: Timeout in seconds. Default: 600
  • num-input-tokens: Number of input tokens to include in the request prompts. It's recommended to choose no more than 2000 tokens to avoid long wait times. Default: 1000.
  • num-output-tokens: Number of output tokens in the generation. It's recommended to choose no more than 2000 tokens to avoid long wait times. Default: 1000.
  • num-requests: Number of requests sent. Default: 32. Note: the program can timeout before all requests are sent. Configure the Timeout parameter accordingly.

Note: You should leave the --mode parameter untouched - this indicates what dataset mode to use.

  1. Run the script
  • Run the following command in your terminal:
sh run_synthetic_dataset.sh
  • The evaluation process will start and progress bar will be shown until it's complete.
  1. Analyze results
  • Results will be saved at the location specified in results-dir.
  • The name of the output files will depend on the input file name, mode name, and number of workers. You should see files that follow a similar format to the following:
<MODEL_NAME>_{NUM_INPUT_TOKENS}_{NUM_OUTPUT_TOKENS}_{NUM_CONCURRENT_WORKERS}_{MODE}
  • For each run, two files are generated with the following suffixes in the output file names: _individual_responses and _summary.

    • Individual responses file

      • This output file contains the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Server (if available) and Client side, for each individual request sent to the LLM. Users can use this data for further analysis. We provide this notebook notebooks/analyze-token-benchmark-results.ipynb with some charts that they can use to start.

individual_responses_image

  • Summary file

    • This file includes various statistics such as percentiles, mean and standard deviation to describe the number of input and output tokens, number of total tokens, Time To First Token (TTFT), End-To-End Latency (E2E Latency) and Throughput from Client side. It also provides additional data points that bring more information about the overall run, like inputs used, number of errors, and number of completed requests per minute.

summary_output_image

  • There's an additional notebook notebooks/multiple-models-benchmark.ipynb that will help users on running multiple benchmarks with different experts and gather performance results in one single table. A COE endpoint is meant to be used for this analysis.

Batching vs non-batching benchmarking

This kit also supports SambaNova Studio models with Dynamic Batch Size, which improves the model performance significantly.

In order to use a batching model, first users need to set up the proper endpoint supporting this feature, please look at this section for reference. Additionally, users need to specify number of workers > 1, either using the streamlit app or the terminal. Since the current maximum batch size is 16, it's recomended to choose a value for number of workers equal or greater than that to test different batch sizes.

Here are some examples with parameters for using an endpoint with and without dynamic batching size.

Non-batching setup:

If the user wants to send 32 requests to be processed sequentially, here are the parameter values that can work as an example:

  • Parameters:
    • Number of requests: 32
    • Number of concurrent workers: 1

We can see in the following Gantt chart how the 32 requests are being executed one after the other. (SambaNova Cloud with LLama3-8b was used for this example)

sequential_requests

Batching setup:

If the user wants to send 60 requests to be processed in batch, it's important to consider the number of workers chosen.

For example:

For the following parameter values:

  • Parameters:
    • Number of requests: 60
    • Number of concurrent workers: 21

We can see from the Gantt chart that the way they're being batched and processed is 1-16-4 requests, because there are 21 workers sending requests in parallel. This setup took ~ 4 mins 30 secs.

sequential_requests

Another example is the following:

  • Parameters:
    • Number of requests: 60
    • Number of concurrent workers: 60

We can see from the Gantt chart that the way they're being batched and processed is 1-16-16-16-8-1-1-1 requests, because there are 60 workers sending all requests in parallel. This setup took ~ 3 mins.

sequential_requests

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:

  • streamlit (version 1.37.0)
  • st-pages (version 0.5.0)
  • transformers (version 4.41.1)
  • python-dotenv (version 1.0.0)
  • Requests (version 2.31.0)
  • seaborn (version 0.12.2)