-
UISEE
- Shanghai
- https://muselink.cc/zhanghaili
- @zhanghaili0610
⚖️ Evaluation
A general framework used on evaluating the performance of large language models (LLMs) based on the peer review mechanism among LLMs
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; arXiv preprint arXiv:2403.16950)
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
An open-source visual programming environment for battle-testing prompts to LLMs.