The purpose of this project is to expand on the GradeMate app by enhancing the grading models (scoring, justification, feedback) and evaluation framework. This involves prompting, fine-tuning, evaluating, and orchestrating open-source Large Language Models (LLMs). Some noteworthy changes that we are exploring are: 1) leveraging open-source pre-trained LLMs instead of proprietary ones, 2) using grading-criteria focused models, and 3) developing a more rigorous evaluation framework. The project is conducted in collaboration with the Columbia University QMSS Innovation Lab.
Python version: 3.10.12 PyTorch version: 2.5.0+cu121 CUDA version: 12.1
For the first stage of this project, access to a GPU with 15 GB VRAM is needed (for models based on Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct). Google Colab meets in this requirement. For a later stage, where we will be fine-tuning a larger models, more than 35 GB RAM will be needed.
Main pre-training model (for further fine-tuning):
- Model Used: Llama-3.1-8B-Instruct
- Model Card: Llama-3.1-8B-Instruct
- Llama Models GitHub: meta-llama/llama-models
- Meta Release: Meta LLaMA 3.1 Performance
- Multi-lingual Support: Supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- Model Selection: Selected for its multi-lingual capabilities and suitability for fine-tuning within the resource limits of Colab Pro's GPUs.
To streamline fine-tuning, we are building a separate LLM for each individual grading criterion. This allows for targeted fine-tuning based on the complexity of the writing criterion. Also, since each criterion may appear across multiple rubrics, organizing fine-tuning by criterion rather than by rubric simplifies the process, addressing each assessment dimension independently.
Writing Criteria Type | Model Complexity (Parameters) | Fine-Tuning Required? |
---|---|---|
Grammar | Low (100M - 500M) | No |
Language Appropriateness | Medium (500M - 1B) | Yes |
Writing Cohesiveness and Context-Sensitive Tasks | High (1B - 6B) | Yes |
Content | Very High (6B +) | Yes |
Note: Smaller models suit simpler tasks, while complex, interpretive tasks require larger, fine-tuned models that can accommodate both assessment and justification tasks for robust scoring across diverse grading rubrics.
- Set-Up
- Prompting
- Fine-tuning: Refer to Fine-Tuning Guide
- Orchestration: Use LangChain to orchestrate LLM calls.
- Deploy: Follow deployment guidelines from NLP Cloud