PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships

PFresGO is an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins.

This repository contains script which were used to train the PFresGO model together with the scripts for conducting protein function prediction.

Dependencies

The code was developed and tested using python 3.7
TensorFlow = 2.4.1

Scripts

train_PFresGO.py - this script is to train our model PFresGO.

If you want to trian PFresGO, run:

python train_PFresGO.py --num_hidden_layers 1 --ontology 'bp' --model_name 'BP_PFresGO'

predict.py - this script is to make protrein function prediction.

If you want to use PFresGO for prediction, download the trained model from https://huggingface.co/datasets/Biocollab/PFresGO/tree/main

Then you should prepare your sequence file in the fasta format, generate protein residual level embedding (follow script fasta-embedding.py), and put the resulting .h5 format file into ./Datasets/ and run:

python predict.py --num_hidden_layers 1 --ontology 'bp' --model_name 'BP_PFresGO' --res_embeddings './Datasets/per_residue_embeddings.h5'

train_autoencoder.py - this script is to generate pretrained autoencoder model.

If you want to trian the autoencoder, run:

python train_autoencoder.py --input_dims 1024 --model_name 'Autoencoder_128'

fasta-embedding.py - this script is to generate protein residual level embedding.

The protein residual level embedding is generated by pre-trained language model protT5. Before you run this script, you need to create a datapath and then install the protT5 package via:

!mkdir protT5 !mkdir protT5/protT5_checkpoint !mkdir protT5/sec_struct_checkpoint !mkdir protT5/output !wget -nc -P protT5/sec_struct_checkpoint http://data.bioembeddings.com/public/embeddings/feature_models/t5/secstruct_checkpoint.pt !pip install torch transformers sentencepiece h5py

then put your own fasta format protein sequences into ./Datasets/ and run:

fasta-embedding.py --seq_path './Datasets/nrPDB-GO_2019.06.18_sequences.fasta'

The detailed configuration can refer to https://github.com/agemagician/ProtTrans

label_embedding.py - this script is to generate GO term embedding

The GO term embedding is generated by pre-trained model Anc2Vec. Before you run this script, put your own .obo format GO terms file into ./Datasets/ and install the Anc2Vec package via:

pip install -U "anc2vec @ git+https://github.com/aedera/anc2vec.git"

The detailed configuration can refer to https://github.com/sinc-lab/anc2vec

/preprocessing/Seq2TFRecord.py - this script is to generate training and validation tfrecords

Seq2TFRecord.py -prot_list '../Datasets/nrPDB-GO_2019.06.18_train.txt' -num_threads 30 -num_shards 30 -tfr_prefix '../Datasets/TFRecords_sequences/PDB_GO_train'

the resulting tfrecords uesd for PFresGO training and validation are stored in /Datasets/TFRecords_sequences/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships

Dependencies

Scripts

train_PFresGO.py - this script is to train our model PFresGO.

predict.py - this script is to make protrein function prediction.

train_autoencoder.py - this script is to generate pretrained autoencoder model.

fasta-embedding.py - this script is to generate protein residual level embedding.

label_embedding.py - this script is to generate GO term embedding

/preprocessing/Seq2TFRecord.py - this script is to generate training and validation tfrecords

Datasets - Here you can find the data used to train our method and make predictions.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Datasets		Datasets
pfresgo		pfresgo
preprocessing		preprocessing
README.md		README.md
autoencoder.py		autoencoder.py
autoencoder_util.py		autoencoder_util.py
evaluation.py		evaluation.py
fasta-embedding.py		fasta-embedding.py
label_embedding.py		label_embedding.py
predict.py		predict.py
train_PFresGO.py		train_PFresGO.py
train_autoencoder.py		train_autoencoder.py

BioColLab/PFresGO

Folders and files

Latest commit

History

Repository files navigation

PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships

Dependencies

Scripts

train_PFresGO.py - this script is to train our model PFresGO.

predict.py - this script is to make protrein function prediction.

train_autoencoder.py - this script is to generate pretrained autoencoder model.

fasta-embedding.py - this script is to generate protein residual level embedding.

label_embedding.py - this script is to generate GO term embedding

/preprocessing/Seq2TFRecord.py - this script is to generate training and validation tfrecords

Datasets - Here you can find the data used to train our method and make predictions.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages