Skip to content

Fine-tuning Galactica and Gemma to operate on SMILES. Integrates into a molecular optimization algorithm.

Notifications You must be signed in to change notification settings

YerevaNN/ChemLactica

 
 

Repository files navigation

ChemLactica

Table of contents

Description

Fine tuning the galactica models on chemistry data from PubChem.

Prerequisites

  • Python 3.11
  • conda

Installation

conda create -n ChemLactica python=3.11 -y -f environment.yml
conda activate chemlactica

Usage

Training

The script for training the model is train.py which can be run from the command line using the following syntax:

python train.py --model_type galactica/125m --training_data_dir .small_data/train --valid_data_dir .small_data/valid --max_steps 128 --eval_steps 64 --track --eval_accumulation_steps 8

Here's what these arguments do

  • --model_type <model_name> - type of model to train, one of galactica/125m, galactica/1.3B , galactica/20B
  • --training_data_dir - directory containing training data
  • --valid_data_dir - directory containing validation data
  • --max_steps - maximum number of steps to run training for
  • --eval_steps - the interval at which to run evaluation
  • --track - whether to track model checkpoint or not
  • --eval_accumulation_steps - the number of steps after which to move the prediction tensor from GPU to CPU during the evaluation (specified to avoid OOM errors)

Tests

The test for running the a small sized model with the same architecture as galactica on a small set of data is located at /tests/precommit_test.py and can be called as follows:

python -m unittest precommit_test.py

This test is also run as part of the CI pipeline on the main branch on a public github runner.

About

Fine-tuning Galactica and Gemma to operate on SMILES. Integrates into a molecular optimization algorithm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published