TL;DR
- A family of models that understand small organic molecules written in SMILES, their basic properties, and similarities between molecules.
- Chemlactica-125M 🤗 and -1.3B 🤗 trained on top of Meta's Galactica models.
- Chemma-2B 🤗 is built on top of Google's Gemma-2B.
- All models are trained on 40B tokens covering 100M+ molecules from PubChem. The dataset is also available at 🤗.
- A prompt like
</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]
will generate a molecule that has ~2.25 SAS score and has ~0.62 similarity score to the given molecule. - The models can be easily tuned to perform property prediction (~0.3 RMSE on FreeSolv from MoleculeNet).
- The models wrapped into a genetic-like optimization algorithm beat all molecular optimization benchmarks we tried.
- Practical Molecular Optimization: 17.5 vs 16.2 (previous SOTA: Genetic-guided GFlowNets).
- Optimization for docking with AutoDock Vina: 3-4x less oracle calls for generating 100 good molecules than previous SOTA.
- QED optimization from the RetMol paper: 99% success rate with 10K oracle calls with Chemlactica-125M (vs. 96% with 50K calls).
- All details in the paper Small Molecule Optimization with Large Language Models.
Fine tuning the galactica models on chemistry data from PubChem.
- Python 3.11
- conda
conda create -n ChemLactica python=3.11 -y -f environment.yml
conda activate chemlactica
Instructions coming soon...
Instructions coming soon...
Instructions coming soon...
The test for running the a small sized model with the same architecture as galactica on a small set of data is located at /tests/precommit_test.py and can be called as follows:
python -m unittest precommit_test.py
This test is also run as part of the CI pipeline on the main branch on a public github runner.