Skip to content

laurauguc/llama_grading

Repository files navigation

Grading Models for GradeMate

Project Overview

The purpose of this project is to expand on the GradeMate app by enhancing the grading models (scoring, justification, feedback) and evaluation framework. This involves prompting, fine-tuning, evaluating, and orchestrating open-source Large Language Models (LLMs). Some noteworthy changes that we are exploring are: 1) leveraging open-source pre-trained LLMs instead of proprietary ones, 2) using grading-criteria focused models, and 3) developing a more rigorous evaluation framework. The project is conducted in collaboration with the Columbia University QMSS Innovation Lab.

Requirements

Python version: 3.10.12 PyTorch version: 2.5.0+cu121 CUDA version: 12.1

For the first stage of this project, access to a GPU with 15 GB VRAM is needed (for models based on Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct). Google Colab meets in this requirement. For a later stage, where we will be fine-tuning a larger models, more than 35 GB RAM will be needed.

Models Info

Main pre-training model (for further fine-tuning):

Key Features

  • Multi-lingual Support: Supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
  • Model Selection: Selected for its multi-lingual capabilities and suitability for fine-tuning within the resource limits of Colab Pro's GPUs.

Modeling Strategy: Use Criterion-Specific LLMs

To streamline fine-tuning, we are building a separate LLM for each individual grading criterion. This allows for targeted fine-tuning based on the complexity of the writing criterion. Also, since each criterion may appear across multiple rubrics, organizing fine-tuning by criterion rather than by rubric simplifies the process, addressing each assessment dimension independently.

Writing Assessment Criteria

Writing Criteria Type Model Complexity (Parameters) Fine-Tuning Required?
Grammar Low (100M - 500M) No
Language Appropriateness Medium (500M - 1B) Yes
Writing Cohesiveness and Context-Sensitive Tasks High (1B - 6B) Yes
Content Very High (6B +) Yes

Note: Smaller models suit simpler tasks, while complex, interpretive tasks require larger, fine-tuned models that can accommodate both assessment and justification tasks for robust scoring across diverse grading rubrics.

Steps

  1. Set-Up
  2. Prompting
  3. Fine-tuning: Refer to Fine-Tuning Guide
  4. Orchestration: Use LangChain to orchestrate LLM calls.
  5. Deploy: Follow deployment guidelines from NLP Cloud

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published