Skip to content

Commit

Permalink
Add full leaderboard
Browse files Browse the repository at this point in the history
  • Loading branch information
CodingWithTim authored May 22, 2024
1 parent a209efb commit 6b26b4f
Showing 1 changed file with 49 additions and 0 deletions.
49 changes: 49 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,55 @@ Arena-Hard-Auto-v0.1 is an automatic evaluation tool for instruction-tuned LLMs.

Check out our blog post for more details about how Arena Hard Auto v0.1 works -> [Blog post link](https://lmsys.org/blog/2024-04-19-arena-hard/).

## Full Leaderboard (Updated: 05/21)
```console
gpt-4-turbo-2024-04-09 | score: 82.6 | 95% CI: (-1.7, 1.9) | average #tokens: 662
gpt-4-0125-preview | score: 78.0 | 95% CI: (-2.1, 2.4) | average #tokens: 619
gemini-1.5-pro-api-preview | score: 72.0 | 95% CI: (-2.1, 2.5) | average #tokens: 676
yi-large-preview | score: 71.5 | 95% CI: (-2.4, 2.0) | average #tokens: 720
claude-3-opus-20240229 | score: 60.4 | 95% CI: (-2.5, 2.5) | average #tokens: 541
glm-4 | score: 55.7 | 95% CI: (-2.4, 2.3) | average #tokens: 622
gemini-1.5-pro | score: 53.4 | 95% CI: (-2.8, 2.4) | average #tokens: 478
gpt-4-0314 | score: 50.0 | 95% CI: (0.0, 0.0) | average #tokens: 423
gemini-1.5-flash-api-preview | score: 49.6 | 95% CI: (-2.2, 2.8) | average #tokens: 642
claude-3-sonnet-20240229 | score: 46.8 | 95% CI: (-2.3, 2.7) | average #tokens: 552
claude-3-haiku-20240307 | score: 41.5 | 95% CI: (-2.5, 2.5) | average #tokens: 505
llama-3-70b-chat-hf | score: 41.1 | 95% CI: (-2.0, 2.2) | average #tokens: 583
gpt-4-0613 | score: 37.9 | 95% CI: (-2.8, 2.4) | average #tokens: 354
mistral-large-2402 | score: 37.7 | 95% CI: (-2.1, 2.6) | average #tokens: 400
mixtral-8x22b-instruct-v0.1 | score: 36.4 | 95% CI: (-2.4, 2.6) | average #tokens: 430
Qwen1.5-72B-Chat | score: 36.1 | 95% CI: (-2.0, 2.7) | average #tokens: 474
command-r-plus | score: 33.1 | 95% CI: (-2.8, 2.4) | average #tokens: 541
mistral-medium | score: 31.9 | 95% CI: (-1.9, 2.2) | average #tokens: 485
mistral-next | score: 27.4 | 95% CI: (-2.4, 2.4) | average #tokens: 297
gpt-3.5-turbo-0613 | score: 24.8 | 95% CI: (-1.9, 2.3) | average #tokens: 401
claude-2.0 | score: 24.0 | 95% CI: (-1.8, 1.8) | average #tokens: 295
dbrx-instruct | score: 23.9 | 95% CI: (-1.5, 1.5) | average #tokens: 415
Mixtral-8x7B-Instruct-v0.1 | score: 23.4 | 95% CI: (-2.0, 1.9) | average #tokens: 457
gpt-3.5-turbo-0125 | score: 23.3 | 95% CI: (-2.2, 1.9) | average #tokens: 329
Yi-34B-Chat | score: 23.1 | 95% CI: (-1.6, 1.8) | average #tokens: 611
Starling-LM-7B-beta | score: 23.0 | 95% CI: (-1.8, 1.8) | average #tokens: 530
claude-2.1 | score: 22.8 | 95% CI: (-2.3, 1.8) | average #tokens: 290
Snorkel-Mistral-PairRM-DPO | score: 20.7 | 95% CI: (-1.8, 2.2) | average #tokens: 564
llama-3-8b-chat-hf | score: 20.6 | 95% CI: (-2.0, 1.9) | average #tokens: 585
gpt-3.5-turbo-1106 | score: 18.9 | 95% CI: (-1.8, 1.6) | average #tokens: 285
gpt-3.5-turbo-0301 | score: 18.1 | 95% CI: (-1.9, 2.1) | average #tokens: 334
gemini-1.0-pro | score: 17.8 | 95% CI: (-1.2, 2.2) | average #tokens: 322
snowflake-arctic-instruct | score: 17.6 | 95% CI: (-1.8, 1.5) | average #tokens: 365
command-r | score: 17.0 | 95% CI: (-1.7, 1.8) | average #tokens: 432
phi-3-mini-128k-instruct | score: 15.4 | 95% CI: (-1.4, 1.4) | average #tokens: 609
tulu-2-dpo-70b | score: 15.0 | 95% CI: (-1.6, 1.3) | average #tokens: 550
Starling-LM-7B-alpha | score: 12.8 | 95% CI: (-1.6, 1.4) | average #tokens: 483
mistral-7b-instruct | score: 12.6 | 95% CI: (-1.7, 1.4) | average #tokens: 541
gemma-1.1-7b-it | score: 12.1 | 95% CI: (-1.3, 1.3) | average #tokens: 341
Llama-2-70b-chat-hf | score: 11.6 | 95% CI: (-1.5, 1.2) | average #tokens: 595
vicuna-33b-v1.3 | score: 8.6 | 95% CI: (-1.1, 1.1) | average #tokens: 451
gemma-7b-it | score: 7.5 | 95% CI: (-1.2, 1.3) | average #tokens: 378
Llama-2-7b-chat-hf | score: 4.6 | 95% CI: (-0.8, 0.8) | average #tokens: 561
gemma-1.1-2b-it | score: 3.4 | 95% CI: (-0.6, 0.8) | average #tokens: 316
gemma-2b-it | score: 3.0 | 95% CI: (-0.6, 0.6) | average #tokens: 369
```

## Install Dependencies
```
git clone https://github.com/lm-sys/arena-hard.git
Expand Down

0 comments on commit 6b26b4f

Please sign in to comment.