Currently still under revisions. Blog and technical report will be released soon. Arena-Hard is an evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
Check out our blog post for more details.
git clone https://github.com/lm-sys/arena-hard.git
cd arena-hard
pip install -r requirements.txt
pip install -r requirements-optional.txt # Optional dependencies (e.g., anthropic sdk)
We have pre-generated many popular models answers and judgments. You can browse them with an online demo or download them by
> git clone https://huggingface.co/spaces/lmsys/arena-hard-browser
// copy answers/judgments to the data directory
> cp -r arena-hard-browser/data .
Then run
> python show_result.py
gpt-4-0125-preview | win-rate: 77.74 | average #tokens: 618
claude-3-opus-20240229 | win-rate: 60.36 | average #tokens: 539
claude-3-sonnet-20240229 | win-rate: 47.24 | average #tokens: 553
claude-3-haiku-20240307 | win-rate: 41.47 | average #tokens: 504
gpt-4-0613 | win-rate: 37.9 | average #tokens: 354
mistral-large-2402 | win-rate: 37.77 | average #tokens: 399
Qwen1.5-72B-Chat | win-rate: 36.08 | average #tokens: 473
mistral-medium | win-rate: 32.94 | average #tokens: 492
gpt-3.5-turbo-0613 | win-rate: 25.14 | average #tokens: 403
Fill in your API endpoint in config/api_config.yaml
. We support OpenAI compatible API server. You can specify parallel
to indicate the number of concurrent API requests (default: 1).
# example
gpt-3.5-turbo-0125:
model_name: gpt-3.5-turbo-0125
endpoints: null
api_type: openai
parallel: 8
[YOUR-MODEL-NAME]:
model_name: [YOUR-MODEL-NAME]
endpoints:
- api_base: [YOUR-ENDPOINT-URL]
api_key: [YOUR-API-KEY]
api_type: openai
parallel: 8
You may use inference engine such as vLLM or SGLang to host your model with an OpenAI compatible API server.
In config/gen_answer_config.yaml
, add your model name in model_list
.
bench_name: arena-hard-v0.1
temperature: 0.0
max_tokens: 4096
num_choices: 1
model_list:
- [YOUR-MODEL-NAME]
Run the command to generate answers:
python gen_answer.py
Caching feature is implemented. The code will skip generating an answer when there is already an existing answer/judgment to the same prompt.
In config/judge_config.yaml
, add your model name in model_list
.
...
# Add your model below for evaluation
model_list:
- gpt-3.5-turbo-0125
- [YOUR-MODEL-NAME]
Run the command to generate judgments:
python gen_judgment.py
Judgment caching is also implemented. It will skip generating judgments that has already been generated or lacks one of the model answers.
Output model win rates. Optionally, use --full-stats
for detailed results.
> python show_result.py
You can review individual judgment results using our UI code.
> python qa_broswer.py --share