Grading Models for GradeMate

Project Overview

The purpose of this project is to expand on the GradeMate app by enhancing the grading models (scoring, justification, feedback) and evaluation framework. This involves prompting, fine-tuning, evaluating, and orchestrating open-source Large Language Models (LLMs). Some noteworthy changes that we are exploring are: 1) leveraging open-source pre-trained LLMs instead of proprietary ones, 2) using grading-criteria focused models, and 3) developing a more rigorous evaluation framework. The project is conducted in collaboration with the Columbia University QMSS Innovation Lab.

Requirements

Python version: 3.10.12 PyTorch version: 2.5.0+cu121 CUDA version: 12.1

For the first stage of this project, access to a GPU with 15 GB VRAM is needed (for models based on Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct). Google Colab meets in this requirement. For a later stage, where we will be fine-tuning a larger models, more than 35 GB RAM will be needed.

Models Info

Main pre-training model (for further fine-tuning):

Model Used: Llama-3.1-8B-Instruct
Model Card: Llama-3.1-8B-Instruct
Llama Models GitHub: meta-llama/llama-models
Meta Release: Meta LLaMA 3.1 Performance

Key Features

Multi-lingual Support: Supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Model Selection: Selected for its multi-lingual capabilities and suitability for fine-tuning within the resource limits of Colab Pro's GPUs.

Modeling Strategy: Use Criterion-Specific LLMs

To streamline fine-tuning, we are building a separate LLM for each individual grading criterion. This allows for targeted fine-tuning based on the complexity of the writing criterion. Also, since each criterion may appear across multiple rubrics, organizing fine-tuning by criterion rather than by rubric simplifies the process, addressing each assessment dimension independently.

Writing Assessment Criteria

Writing Criteria Type	Model Complexity (Parameters)	Fine-Tuning Required?
Grammar	Low (100M - 500M)	No
Language Appropriateness	Medium (500M - 1B)	Yes
Writing Cohesiveness and Context-Sensitive Tasks	High (1B - 6B)	Yes
Content	Very High (6B +)	Yes

Note: Smaller models suit simpler tasks, while complex, interpretive tasks require larger, fine-tuned models that can accommodate both assessment and justification tasks for robust scoring across diverse grading rubrics.

Steps

Set-Up
Prompting
Fine-tuning: Refer to Fine-Tuning Guide
Orchestration: Use LangChain to orchestrate LLM calls.
Deploy: Follow deployment guidelines from NLP Cloud

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
llama_grading		llama_grading
.DS_Store		.DS_Store
.gitignore		.gitignore
AI_Ethics.html		AI_Ethics.html
LICENSE		LICENSE
Llama_Grading_Model_Exploration.ipynb		Llama_Grading_Model_Exploration.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grading Models for GradeMate

Project Overview

Requirements

Models Info

Key Features

Modeling Strategy: Use Criterion-Specific LLMs

Writing Assessment Criteria

Steps

About

Releases

Packages

Languages

License

laurauguc/llama_grading

Folders and files

Latest commit

History

Repository files navigation

Grading Models for GradeMate

Project Overview

Requirements

Models Info

Key Features

Modeling Strategy: Use Criterion-Specific LLMs

Writing Assessment Criteria

Steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages