While there are many public benchmarks available for foundational models on hf.co, dev teams need to provide benchmarks on private data+RAG strategies+fine-tuned models/public APIs before deployment in any business usecase. People often end up using GPT4 Autograding on the final LLM responses to generate these reports before deployment. While GPT4 autograding responses may make sense when a developer or a business person is reviewing pipeline performance manually, the approach is stupid and expensive when scaling to a benchmark of even 100 questions on private datasets.
To make it super clear why this is an expensive process, lets do some quick math. Lets say you have 5 prompt variations each that you want to test out on 5 LLMs (GPT 3.5 + 3 fine-tuned + 1 hugging face public model) on say 100 questions, you end up with a GPT4 autograding bill of $2500*(0.01*3+0.03) = $150. For context, you could literally hire a human annotator with a grading script for evaluating 2500 question-answer pairs at $5/hr*20hr = $100.
![Screenshot 2024-02-09 at 5 33 15β―PM](https://private-user-images.githubusercontent.com/31365087/303644268-322d88e6-f791-4c72-a4c3-bb067588d99f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3NDM4MDEsIm5iZiI6MTczOTc0MzUwMSwicGF0aCI6Ii8zMTM2NTA4Ny8zMDM2NDQyNjgtMzIyZDg4ZTYtZjc5MS00YzcyLWE0YzMtYmIwNjc1ODhkOTlmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE2VDIyMDUwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJhNGUyYjZjZDUzOTRiZGNmNjlhMGMwMDk1YmY3NjYyNjliMGIxOWFkMTc1YTdjYTdkMTYyNDViNDUxN2RiZWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.REWpqYWiOipDLwyZex_UdOi76m1Snf4rpJCIfzAnt88)
-
GPT4's autograding output is optimized for human reading and not further consumption into either RL pipelines or analytics dashboard.
-
Restricting GPT4 with system prompt to only answer with a single number value will be mostly wrong because the core value in the grading is when it can create a sequence of tokens to guide itself to the correct score.
-
While you can do greedy parsing of scores from things like PyDantic or OpenAI API's JSON mode, parsing reliability and format sanity are clear challenges here.
-
The most obvious flaw in GPT4 autograding is that the generated answer is better than the original golden answer for any answer generated using a GPT4+RAG pipeline.
-
The final grading isnt normalized along any axis for scaling. For example: a grade of 4/10 isn't linearly related to 2/10. It does not even follow a log scale or someother mathematical scaling logic. So you will end up having to give the GPT4 feedback to a sentiment analysis BERT.
Most generative AI use-cases in enterprise right now including fields like financial services have a limited scope in instruction type with golden answers already in place for benchmarking.
-
Using a cross-encoder model we can best determine the difference between LLM generated answer and the golden answer. This is well within the core competency of cross-encoder models that are trained on pairwise similarity tasks.
-
While the grading isnt a 100% flexible to immediate changes in grading policy, the generated grading scales both in terms of sensibility around the numeric values and .
-
While GPT4 autograding will continue to be used in performance evaluations of prompts and LLMs with open-ended questions or completely novel questions, we believe cross-encoders pre-trained on large duplicate question detection datsets (like the quora model we will use here) perform well.
-
The model we have currently finetuned for internal usage is the Quora Model. While this works perfectly fine out of the box, we recommend running PEFT on ~1000 examples before deployment.
Step 1: Install MongoDB
Step 2: Check if Mongo is running properly on localhost
sudo systemctl status mongod
Step 3: Setup the repository
git clone https://github.com/never2average/FrugalGrader
cd FrugalGrader
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Step 4: Populate the prompt library with all your testcases
Step 5: Run evals on full prompt library
python3 llm_evals.py
- CLI functions for easy and modular runs.
- Implement Atlas Dashboard connector for runtime logs.
- Integrate with PortKey for LLMOps teams.
- Implement an RLAIF module optimized at grading for PPO.
- Release PEFT Quora Model for financial services QA autograding. [from onfinance team]
Written by hand cause apparently all AI code documentation tools are so incompetent that they cannot generate a decent README.md for a repository.