Skip to content

Source code of ThermoFormer (Foundation Models for Science Workshop, NeurIPS 2024.)

Notifications You must be signed in to change notification settings

ginnm/ThermoFormer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ThermoFormer: Temperature-Aware Protein Representations 🏆🔥

Python 3.8+ PyTorch 2.4+ Transformers Biopython

ThermoFormer is a model built on top of the Hugging Face Transformers and PyTorch frameworks. It is designed to learn temperature-aware representations from millions of annotated protein sequences.


Prerequisites


🏗️ Building the OGT-labeled Dataset

Step 1: Download Growth Temperatures for 21,498 Microorganisms

  1. Download the dataset temperature_data.tsv.
  2. Place temperature_data.tsv into the ogt_data/ directory.

Step 2: Download UniRef100

  1. Download Uniref100.fasta from UniProt.
  2. Place Uniref100.fasta into the ogt_data/ directory.

Step 3: Check the Directory

ls ogt_data
uniref100.fasta  temperature_data.tsv

Step 4: Generate the OGT-Annotated Dataset

python build_ogt_dataset.py \
    --fasta_file ogt_data/uniref100.fasta \
    --ogt_file ogt_data/temperature_data.tsv \
    --output_file ogt_data/annotated.csv

🚀 Running Inference

Once you have the annotated dataset, you can run inference:

python inference.py --file ogt_data/annotated.csv --output ogt_data/infer.csv

🧪 Using ThermoFormer in Python

from model.modeling_thermoformer import ThermoFormer
from model.tokenization_thermoformer import ThermoFormerTokenizer

tokenizer = ThermoFormerTokenizer()
model = ThermoFormer.from_pretrained("GinnM/ThermoFormer")

# Example usage:
sequence = "MSSKLLL..."
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)

📜 Citation

If you use ThermoFormer in your research, please cite:

@inproceedings{
li2024learning,
title={Learning temperature-aware representations from millions of annotated protein sequences},
author={Mingchen Li and Liang Zhang and Zilan Wang and Bozitao Zhong and Pan Tan and Jiabei Cheng and Bingxin Zhou and Liang Hong and Huiqun Yu},
booktitle={Neurips 2024 Workshop Foundation Models for Science: Progress, Opportunities, and Challenges},
year={2024},
url={https://openreview.net/forum?id=sOU2rNqo90}
}

Happy hacking!

About

Source code of ThermoFormer (Foundation Models for Science Workshop, NeurIPS 2024.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages