Skip to content

Fine-tuning Galactica and Gemma to operate on SMILES. Integrates into a molecular optimization algorithm.

Notifications You must be signed in to change notification settings

YerevaNN/ChemLactica

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chemlactica / Chemma: Large Language Models for Small Molecules

TL;DR

  • A family of models that understand small organic molecules written in SMILES, their basic properties, and similarities between molecules.
  • Chemlactica-125M 🤗 and -1.3B 🤗 trained on top of Meta's Galactica models.
  • Chemma-2B 🤗 is built on top of Google's Gemma-2B.
  • All models are trained on 40B tokens covering 100M+ molecules from PubChem. The dataset is also available at 🤗.
  • A prompt like </s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES] will generate a molecule that has ~2.25 SAS score and has ~0.62 similarity score to the given molecule.
  • The models can be easily tuned to perform property prediction (~0.3 RMSE on FreeSolv from MoleculeNet).
  • The models wrapped into a genetic-like optimization algorithm beat all molecular optimization benchmarks we tried.
    • Practical Molecular Optimization: 17.5 vs 16.2 (previous SOTA: Genetic-guided GFlowNets).
    • Optimization for docking with AutoDock Vina: 3-4x less oracle calls for generating 100 good molecules than previous SOTA.
    • QED optimization from the RetMol paper: 99% success rate with 10K oracle calls with Chemlactica-125M (vs. 96% with 50K calls).
  • All details in the paper Small Molecule Optimization with Large Language Models.

Table of contents

Description

Fine tuning the galactica models on chemistry data from PubChem.

Prerequisites

  • Python 3.11
  • conda

Installation

conda create -n ChemLactica python=3.11 -y -f environment.yml
conda activate chemlactica

Usage

Pretraining

Instructions coming soon...

Fine-tuning

Instructions coming soon...

Molecular optimization

Instructions coming soon...

Tests

The test for running the a small sized model with the same architecture as galactica on a small set of data is located at /tests/precommit_test.py and can be called as follows:

python -m unittest precommit_test.py

This test is also run as part of the CI pipeline on the main branch on a public github runner.

About

Fine-tuning Galactica and Gemma to operate on SMILES. Integrates into a molecular optimization algorithm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published