Name		Name	Last commit message	Last commit date
Latest commit History 699 Commits
.github/workflows		.github/workflows
chemlactica		chemlactica
local_submit_files		local_submit_files
notebooks		notebooks
tests		tests
tokenizer.json		tokenizer.json
unit_tests		unit_tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
confirm_tests.py		confirm_tests.py
environment.yml		environment.yml
gemma_ddp_config.yaml		gemma_ddp_config.yaml
gemma_fsdp_config.yaml		gemma_fsdp_config.yaml
new_docmaker.py		new_docmaker.py
pyproject.toml		pyproject.toml
run_ap_mist7.sh		run_ap_mist7.sh
run_iter_ft.sh		run_iter_ft.sh
run_test_chem.sh		run_test_chem.sh
submit_run.py		submit_run.py
submit_run_gemma.py		submit_run_gemma.py
test_environment.yml		test_environment.yml
test_status.yaml		test_status.yaml
try_gemma.py		try_gemma.py

Repository files navigation

Chemlactica / Chemma: Large Language Models for Small Molecules

TL;DR

A family of models that understand small organic molecules written in SMILES, their basic properties, and similarities between molecules.
Chemlactica-125M 🤗 and -1.3B 🤗 trained on top of Meta's Galactica models.
Chemma-2B 🤗 is built on top of Google's Gemma-2B.
All models are trained on 40B tokens covering 100M+ molecules from PubChem. The dataset is also available at 🤗.
A prompt like </s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES] will generate a molecule that has ~2.25 SAS score and has ~0.62 similarity score to the given molecule.
The models can be easily tuned to perform property prediction (~0.3 RMSE on FreeSolv from MoleculeNet).
The models wrapped into a genetic-like optimization algorithm beat all molecular optimization benchmarks we tried.
- Practical Molecular Optimization: 17.5 vs 16.2 (previous SOTA: Genetic-guided GFlowNets).
- Optimization for docking with AutoDock Vina: 3-4x less oracle calls for generating 100 good molecules than previous SOTA.
- QED optimization from the RetMol paper: 99% success rate with 10K oracle calls with Chemlactica-125M (vs. 96% with 50K calls).
All details in the paper Small Molecule Optimization with Large Language Models.

Description

Fine tuning the galactica models on chemistry data from PubChem.

Prerequisites

Python 3.11
conda

Installation

conda create -n ChemLactica python=3.11 -y -f environment.yml
conda activate chemlactica

Usage

Pretraining

Instructions coming soon...

Fine-tuning

Instructions coming soon...

Molecular optimization

Instructions coming soon...

Tests

The test for running the a small sized model with the same architecture as galactica on a small set of data is located at /tests/precommit_test.py and can be called as follows:

python -m unittest precommit_test.py

This test is also run as part of the CI pipeline on the main branch on a public github runner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemlactica / Chemma: Large Language Models for Small Molecules

Table of contents

Description

Prerequisites

Installation

Usage

Pretraining

Fine-tuning

Molecular optimization

Tests

About

Releases

Packages

Contributors 5

Languages

YerevaNN/ChemLactica

Folders and files

Latest commit

History

Repository files navigation

Chemlactica / Chemma: Large Language Models for Small Molecules

Table of contents

Description

Prerequisites

Installation

Usage

Pretraining

Fine-tuning

Molecular optimization

Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages