Skip to content
View webup's full-sized avatar

Block or report webup

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

⚖️ Evaluation

10 repositories

A general framework used on evaluating the performance of large language models (LLMs) based on the peer review mechanism among LLMs

Python 16 2 Updated Aug 3, 2024

Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"

Python 937 64 Updated Sep 25, 2024

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

Go 139 5 Updated Dec 17, 2024

AI Observability & Evaluation

Jupyter Notebook 4,312 318 Updated Dec 18, 2024

Doing simple retrieval from LLM models at various context lengths to measure accuracy

Jupyter Notebook 1,611 177 Updated Aug 17, 2024

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; arXiv preprint arXiv:2403.16950)

Python 40 1 Updated Jul 11, 2024

Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

Jupyter Notebook 903 55 Updated Dec 16, 2024

Prompt optimization scratch

Python 533 37 Updated Dec 13, 2024

An open-source visual programming environment for battle-testing prompts to LLMs.

TypeScript 2,417 189 Updated Dec 16, 2024

Evals for agents

Python 2 2 Updated Dec 4, 2024