PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships
PFresGO is an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins.
This repository contains script which were used to train the PFresGO model together with the scripts for conducting protein function prediction.
- The code was developed and tested using python 3.7
- TensorFlow = 2.4.1
If you want to trian PFresGO, run:
python train_PFresGO.py --num_hidden_layers 1 --ontology 'bp' --model_name 'BP_PFresGO'
If you want to use PFresGO for prediction, download the trained model from https://huggingface.co/datasets/Biocollab/PFresGO/tree/main
Then you should prepare your sequence file in the fasta format, generate protein residual level embedding (follow script fasta-embedding.py), and put the resulting .h5 format file into ./Datasets/ and run:
python predict.py --num_hidden_layers 1 --ontology 'bp' --model_name 'BP_PFresGO' --res_embeddings './Datasets/per_residue_embeddings.h5'
If you want to trian the autoencoder, run:
python train_autoencoder.py --input_dims 1024 --model_name 'Autoencoder_128'
The protein residual level embedding is generated by pre-trained language model protT5. Before you run this script, you need to create a datapath and then install the protT5 package via:
!mkdir protT5
!mkdir protT5/protT5_checkpoint
!mkdir protT5/sec_struct_checkpoint
!mkdir protT5/output
!wget -nc -P protT5/sec_struct_checkpoint http://data.bioembeddings.com/public/embeddings/feature_models/t5/secstruct_checkpoint.pt
!pip install torch transformers sentencepiece h5py
then put your own fasta format protein sequences into ./Datasets/ and run:
fasta-embedding.py --seq_path './Datasets/nrPDB-GO_2019.06.18_sequences.fasta'
The detailed configuration can refer to https://github.com/agemagician/ProtTrans
The GO term embedding is generated by pre-trained model Anc2Vec. Before you run this script, put your own .obo format GO terms file into ./Datasets/ and install the Anc2Vec package via:
pip install -U "anc2vec @ git+https://github.com/aedera/anc2vec.git"
The detailed configuration can refer to https://github.com/sinc-lab/anc2vec
Seq2TFRecord.py -prot_list '../Datasets/nrPDB-GO_2019.06.18_train.txt' -num_threads 30 -num_shards 30 -tfr_prefix '../Datasets/TFRecords_sequences/PDB_GO_train'
the resulting tfrecords uesd for PFresGO training and validation are stored in /Datasets/TFRecords_sequences/