ThermoFormer is a model built on top of the Hugging Face Transformers and PyTorch frameworks. It is designed to learn temperature-aware representations from millions of annotated protein sequences.
- Environment:
- Python 3.8+
- PyTorch 2.4+
- Transformers (Hugging Face)
- Biopython
- Download the dataset temperature_data.tsv.
- Place
temperature_data.tsv
into theogt_data/
directory.
- Download Uniref100.fasta from UniProt.
- Place
Uniref100.fasta
into theogt_data/
directory.
ls ogt_data
uniref100.fasta temperature_data.tsv
python build_ogt_dataset.py \
--fasta_file ogt_data/uniref100.fasta \
--ogt_file ogt_data/temperature_data.tsv \
--output_file ogt_data/annotated.csv
Once you have the annotated dataset, you can run inference:
python inference.py --file ogt_data/annotated.csv --output ogt_data/infer.csv
from model.modeling_thermoformer import ThermoFormer
from model.tokenization_thermoformer import ThermoFormerTokenizer
tokenizer = ThermoFormerTokenizer()
model = ThermoFormer.from_pretrained("GinnM/ThermoFormer")
# Example usage:
sequence = "MSSKLLL..."
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
If you use ThermoFormer in your research, please cite:
@inproceedings{
li2024learning,
title={Learning temperature-aware representations from millions of annotated protein sequences},
author={Mingchen Li and Liang Zhang and Zilan Wang and Bozitao Zhong and Pan Tan and Jiabei Cheng and Bingxin Zhou and Liang Hong and Huiqun Yu},
booktitle={Neurips 2024 Workshop Foundation Models for Science: Progress, Opportunities, and Challenges},
year={2024},
url={https://openreview.net/forum?id=sOU2rNqo90}
}
Happy hacking! ✨