📝 Paper | 🌐 Webserver (CBBL-SKKU) | 🚩 Model & Dataset
The official implementation of paper: HyPepTox-Fuse: An interpretable hybrid framework for accurate peptide toxicity prediction using NLP-based embeddings and conventional descriptors fusion
Update soon!
2024.12.31
: Manuscript was submitted to Journal of Pharmaceutical Analysis (JPA)
This project is summarized into:
- Installing environment
- Preparing datasets
- Configurations
- Training models
- Predicting models
- Inferencing models
- Citation
First, you need to install Miniconda (recommended) or Conda, then create a new environment by following command:
conda create -n hypeptox_fuse python=3.10
conda activate hypeptox_fuse
Second, you need to install required packages by running this command:
python -m pip install -r requirements.txt --no-cache-dir
(Optional) Third, if you want to run Inferencer
(inferencer.py
), please clone the modified iFeatureOmega package into src/
folder. In this package, we optimized the speed processing of _TPC()
function.
In this project, we utilized the benchmark dataset from ToxinPred3. We used 3 main NLP models: ESM-1, ESM-2 and ProtT5, and concatenated conventional descriptors (CCDs) extracted from iFeatureOmega. We already extracted features for all of them and they can be downloaded from OneDrive. You can also find the raw dataset inside raw_dataset/
folder or from the original website.
You can find two files of configurations inside configs/
folder:
config_HyPepToxFuse_Hybrid.yaml
for training Hybrid (NLP+CCDs)config_HyPepToxFuse_NLP_only.yaml
for training NLP only
There are some parameters you should concentrate on:
dataset:
dataset_root: Features/ # Root folder of features
feature_1_name: ESM_2 # Name of the folder of feature 1
feature_2_name: ESM_1 # Name of the folder of feature 2
feature_3_name: prot_t5_xl_uniref50 # Name of the folder of feature 3
handcraft_name: CCD # Name of the folder of feature CCD
...
model_config:
drop: 0.3 # Dropout ratio
gated_dim: 256 # projected dimension of 3 NLP-based features
handcraft_dim: 887 # dimension of CCD
input_dim_1: 2560 # Input dimension of feature 1
input_dim_2: 1280 # Input dimension of feature 2
input_dim_3: 1024 # Input dimension of feature 3
n_classes: 2 # Keep this value!
num_heads_attn: 2 # Number of attention heads in Feature Interactions component
num_heads_transformer: 2 # Number of attention heads in Transformer Fusion component
num_layers_transformers: 2 # Number of layers in Transformer Fusion component
num_mlp_layers: 4 # Number of MLP layers in Classifier component
...
trainer_config:
alpha_focal: # Keep these values!
- 0.5
- 0.5
batch_size: 256 # Training batch size
beta_focal: 1 # Keep this value!
epochs: 100 # Number of epochs
grad_accum: 1 # Number of gradient accumulation technique
k_fold: 5 # Number of folds
loss_fn: focal # Keep this value!
lr: 0.0001 # Training learning rate
ntxent_temperature: 0.1 # Temperature of NTXent loss
num_workers: 24 # Number of CPU cores
optimizer_scheduler:
gamma: 0.5 # Decreased ratio of learning rate
step_size: 20 # Number of epochs to decrease learing rate
output_path: checkpoints/HyPepToxFuse_Hybrid # Output folder of checkpoints
threshold: 0.5 # Classification threshold
To reconstruct the results from the paper, you can run two following commands:
python train_nlp_only.py --config configs/config_HyPepToxFuse_Hybrid.yaml --cuda
python train_hybrid.py --config configs/config_HyPepToxFuse_NLP_only.yaml --cuda
We've already designed HyPepToxFuse_Predictor()
module in predict.py
, you can easily use it by this sample code:
from predict import HyPepToxFuse_Predictor
predictor = HyPepToxFuse_Predictor(
model_config=dict(), # A dict of model configuration, which can be easily loaded from yaml config file (`model_config` key).
ckpt_dir='/path/to/ckpt/dir', # The path to all folds' checkpoints.
nfold=5, # Number of checkpoints inside `ckpt_dir`. Please use 5, which is the default of our training configuration.
device='cpu', # Device, 'cpu' or 'cuda'.
)
# Predict one sample
output = predictor.predict_one(
f1=feature_1, # ESM-2, shape: (1, L, 2560)
f2=feature_2, # ESM-1, shape: (1, L, 1280)
f3=feature_3, # ProtT5, shape: (1, L, 1024)
fccd=feature_ccd, # CCD, shape: (1, 887)
threshold=0.5, # Classification threshold, default is 0.5
)
# Return: a tuple of toxcitiy (bool) and 5-fold probabilities (list[float])
# Predict list of samples
outputs = predictor(
f1s=feature_1_list, # ESM-2, a list of tensor shaped (1, L, 2560)
f2s=feature_2_list, # ESM-1, a list of tensor shaped (1, L, 1280)
f3s=feature_3_list, # ProtT5, a list of tensor shaped (1, L, 1024)
fccds=feature_ccd_list, # CCD, a list of tensor shaped (1, 887)
threshold=0.5, # Classification threshold, default is 0.5
)
# Return: a tuple of list of toxcitiy (list[bool]) and list of 5-fold probabilities (list[list[float]])
You can download best models at OneDrive
Note: The
HyPepToxFuse_Predictor
only works when you've already had extracted features, so that we've also designed theInferencer
for automatically extracting features, reading peptide sequences from FASTA file, and saving results to CSV file, in the following section.
You can easily inference the model by using Inferencer
, follow this sample code for learning how to use:
from predict import HyPepToxFuse_Predictor
from inferencer import Inferencer
# Initialize predictor
predictor = HyPepToxFuse_Predictor(...)
infer = Inferencer(predictor, device='cpu') # You can use device 'cpu' or 'cuda', please be consistent with device of `predictor`
# Predict FASTA file
outputs = infer.predict_fasta_file(
fasta_file='/path/to/fasta/file', # The path to FASTA file, check FASTA format at: https://en.wikipedia.org/wiki/FASTA_format
threshold=0.5, # Classification threshold
batch_size=4 # Size of each iteration of prediction. You can increase the batch size for faster speed processing if having enough computing resources.
)
# Save results to CSV file
infer.save_csv_file(outputs, '/path/to/csv/file')
If you are interested in my work, please cite this:
Update soon!