Skip to content

Latest commit

 

History

History
114 lines (83 loc) · 2.54 KB

README.md

File metadata and controls

114 lines (83 loc) · 2.54 KB

Eval Kit

A tool for evaluating the performance of LLM APIs using the RAG Evaluation methodology.

Installation

git clone https://github.com/sambanova/ai-starter-kit
cd eval_jumpstart
python -m venv eval_jumpstart_env
source eval_jumpstart_env/bin/activate
pip install -r requirements.txt

Basic Usage

We implement performance tests for evaluating LLMs.

Config File

Create a YAML configuration file to specify the evaluation settings.

Example config.yaml

eval_dataset:
  name: general_knowledge_data
  path: data/eval_data.csv

llms:
  - name: sncloud-llama3.1-405
    model_type: "sncloud"
    model_name: "Meta-Llama-3.1-405B-Instruct"
    max_tokens: 1024
    temperature: 0.0
  - name: sambastudio-llama2-70
    model_type: "sambastudio"
    model_name: "llama-2-70b-chat-hf"
    max_tokens: 1024
    temperature: 0.0
  - name: sncloud-llama3.1-70
    model_type: "sncloud"
    model_name: "Meta-Llama-3.1-70B-Instruct"
    max_tokens: 1024
    temperature: 0.0
  - name: sncloud-llama3.2-3
    model_type: "sncloud"
    model_name: "Meta-Llama-3.2-3B-Instruct"
    max_tokens: 1024
    temperature: 0.0

eval_llm:
  model_type: "sncloud"
  model_name: "Meta-Llama-3.1-405B-Instruct"
  max_tokens: 1024
  temperature: 0.0

Evaluation Use Cases

from dotenv import load_dotenv
import sys
import os

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, '..'))
repo_dir = os.path.abspath(os.path.join(kit_dir, '..'))
sys.path.append(kit_dir)
sys.path.append(repo_dir)

load_dotenv(os.path.join(repo_dir, '.env'), override=True)

import asyncio
import time
from utils.eval.evaluator import BaseWeaveEvaluator
from utils.visual.env_utils import get_wandb_key

wandb_api_key = get_wandb_key()
if wandb_api_key:
    import weave
else:
    raise ValueError('WANDB_API_KEY is not set.')


weave.init('your-project-name')

evaluator = BaseWeaveEvaluator()

start_time = time.time()

asyncio.run(evaluator.evaluate(name='dataset_name', filepath='dataset/path', use_concurrency=True))

end_time = time.time()
elapsed_time = end_time - start_time

print(f"Elapsed time: {elapsed_time:.2f} seconds")

More use cases are available in the notebooks

Metrics

The evaluation kit uses various metrics to evaluate the performance of LLMs:

  • Score
  • Reason

Logging

Results can be logged to Weights & Biases (wandb) by setting your WANDB_API_KEY in the .env file.

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory.