Skip to content
forked from Psycoy/MixEval

The official evaluation suite and dynamic data release for MixEval.

Notifications You must be signed in to change notification settings

philschmid/MixEval

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏠 Homepage | 🏆 Leaderboard | 📜 arXiv | 🤗 HF Dataset | 🤗 HF Paper | 𝕏 Twitter

MixEval (Fork)

MixEval is a A dynamic benchmark evaluating LLMs using real-world user queries and benchmarks, achieving a 0.96 model ranking correlation with Chatbot Arena and costs around $0.6 to run using GPT-3.5 as a Judge.

You can find more information and access the MixEval leaderboard here.

This is a fork of the original MixEval repository. The original repository can be found here. I created this fork to make the integration and use of MixEval easier during the training of new models. This Fork includes several improved feature to make usages easier and more flexible. Including:

  • Evaluation of Local Models during or post trainig with transformers
  • Hugging Face Datasets integration to avoid the need of local files.
  • Use of Hugging Face TGI or vLLM to accelerate evaluation and making it more manageable
  • Improved markdown outputs and timing for the training
  • Fixed pip install for remote or CI Integration.

News

[2024-09-23] Added 2024-08-11 version and support for additional system prompts when using API based models

[2024-06-29] Supports 2024-06-01 version and local api to evaluate models with vLLM or TGI

Getting started

pip install vllm
pip install git+https://github.com/philschmid/MixEval --upgrade

Note: If you want to evaluate models that are not included Take a look here. Zephyr example here.

Evaluation open LLMs

Using vLLM/TGI with hosted or local API:

  1. start you environment
vllm serve HuggingFaceH4/zephyr-7b-beta
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) API_URL=http://localhost:8000/v1 python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_name local_api \
    --model_path HuggingFaceH4/zephyr-7b-beta \
    --benchmark mixeval_hard \
    --version 2024-08-11 \
    --batch_size 20 \
    --output_dir results \
    --api_parallel_num 20

Results: 2024-08-11 version

| Metric        | Score  |
| ------------- | ------ |
| PIQA          | 75.00% |
| ARC           | 66.70% |
| DROP          | 62.10% |
| BBH           | 59.50% |
| GSM8k         | 58.60% |
| BoolQ         | 53.10% |
| WinoGrande    | 50.00% |
| MATH          | 49.20% |
| CommonsenseQA | 47.10% |
| TriviaQA      | 44.00% |
| AGIEval       | 41.39% |
| HellaSwag     | 36.10% |
| GPQA          | 33.30% |
| MMLU          | 32.20% |
| SIQA          | 30.80% |
| OpenBookQA    | 25.00% |
| MMLU-Pro      | 20.00% |
| overall       | 42.55% |

Total time: 419.3241469860077

Results: 2024-06-01 version

| Metric        | Score  |
| ------------- | ------ |
| BBH           | 87.50% |
| PIQA          | 62.50% |
| GSM8k         | 51.40% |
| OpenBookQA    | 50.00% |
| DROP          | 49.30% |
| BoolQ         | 48.60% |
| MATH          | 41.90% |
| CommonsenseQA | 40.00% |
| TriviaQA      | 39.40% |
| AGIEval       | 30.27% |
| HellaSwag     | 27.90% |
| MMLU          | 22.90% |
| GPQA          | 12.50% |
| SIQA          | 5.00%  |
| ARC           | 0.00%  |
| MBPP          | 0.00%  |
| overall       | 36.40% |

Total time: 440.4624948501587

Use vLLM/TGI with new system message method:

MODEL_PARSER_API=$(echo $OPENAI_API_KEY) API_URL=http://localhost:8000/v1 python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_name local_api \
    --model_path HuggingFaceH4/zephyr-7b-beta \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --model_systemprompt "You are a helpful assistant to solve math challenges." \
    --batch_size 20 \
    --output_dir results \
    --api_parallel_num 20

Local Hugging Face model from path:

# MODEL_PARSER_API=<your openai api key>
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_path my/local/path \
    --output_dir results/agi-5 \
    --model_name local_chat \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --api_parallel_num 20

Remote Hugging Face model with existing config:

# MODEL_PARSER_API=<your openai api key
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_name zephyr_7b_beta \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --output_dir results \
    --api_parallel_num 20

Remote Hugging Face model without config and defaults

Note: We use the model name local_chat to avoid the need for a config file and load it from the Hugging Face model hub.

# MODEL_PARSER_API=<your openai api key>
MODEL_PARSER_API=$(echo $OPENAI_API_KEY) python -m mix_eval.evaluate \
    --data_path hf://zeitgeist-ai/mixeval \
    --model_path alignment-handbook/zephyr-7b-sft-full \
    --output_dir results/handbook-zephyr \
    --model_name local_chat \
    --benchmark mixeval_hard \
    --version 2024-06-01 \
    --batch_size 20 \
    --api_parallel_num 20

About

The official evaluation suite and dynamic data release for MixEval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%