This repository contains the bash commands, all the pyhton codes, and the intermediate and final results of this project related to the course of Laboratory of Bioinformatics 1 of the MSc in Bioinformatics - University of Bologna.
Altough this pipeline was designed specifically to model the Kunitz domain it could potentially work for any protein domain.
- Requirements
- 0. Study Workflow
- 1. Training set selection
- 2. MSA and HMM building
- 3. Test set preparation
- 4. E-value optimization and classification benchmark
- References
The python code was tested using python 3.10 on MacOS 13.
To run this pipeline the following programms need to be installed:
- CD-HIT 4.8.1
- blastpgp 2.2.26
- HMMER 3.3.2
conda create -n hmm_kunitz blast-legacy hmmer cd-hit
conda activate hmm_kunitz
Webtools links:
The aim of this study is to develop an HMM-based method which reliably identifies the presence of the Kunitz domain in UniProtKB/SwissProt sequences. In principle, a profile HMM can be derived from unaligned sequences by training. However, the parameters for a profile HMM are more accurately estimated from a multiple sequence alignment (MSA) and this has become the method of choice (Bateman and Haft, 2002). The MSA was retrieved from the alignment of 77 structures similar to the BPTI. The HMM was trained over this MSA and then UniProtKB/SwissProt was adopted to optimize and test the classification performance of the method.
The high-resolution structure of the bovine pancreatic trypsin inhibitor (1BPI) was chosen as the prototype of Kunitz domain to select the seeds (Parkin et al., 1996). The webtool PDBeFold was adopted to perform a pairwise structural alignment over the entire PDB (Krissinel and Henrick, 2005; Berman et al., 2000) . PDBeFold was run with default parameters and precision set to ‘highest’ and the results were selected by a Q-score > 0.75.
#download fasta sequences
fasta_similar_structure='bpt1_result_qscore_0.75.fasta.seq'
To avoid a bias in the HMM construction, CD-HIT v4.8.1 was adopted to cluster identical sequences and select the longest representative for each cluster (Fu et al., 2012). A threshold of 100% identity was chosen to maintain also sequences differing for only one residue, so that no information on variation went lost.
cd-hit -i $fasta_similar_structure -o seeds_filt.fasta -c 1
A multiple structure alignment between the representativses structures was performed using the PDBeFold. The MSA derived from the structure alignment, was downloaded and adopted as a training set for the HMM training.
#Extract pdb id with chain to align the structures on pdbefold
grep '^>' seeds_filt.fasta | cut -d' ' -f1 | cut -c6- > sets/pdb_id_training.list
#Align on PDBeFold and download the msa
msa='msa_seeds.seq'
grep . $msa > $msa.tmp && mv $msa.tmp $msa
The hmmbuild program provided by HMMER v3.3.2 was chosen to train the Kunitz’s HMM, leaving the optimal trimming on the MSA to the algorithm (Eddy, 2011). The HMM profile logo was plotted with Skylign (Wheeler et al., 2014).
hmmbuild kunitz.hmm $msa
The entire UniProt/SwissProt release_2023_02 (SP) was chosen as a test set and the annotation of Kunitz domain (PF00014) according to PFAM v35.0 was chosen as reference to evaluate the classification performance. Before testing the model, the test set was elaborated in order to avoid a bias in the model evaluation. The seed sequences and all high similarity proteins (>95%), were removed from SP in order to perform a fair test of the HMM. To identify the high-similarity proteins, blastpgp v2.2.26 (gapped-BLAST) was run with default parameters, using the training sequences as queries and the entire SP as target database (Altschul et al., 1997).
#map the training sequences PDB IDs to UniProt IDs
id_mapped='sets/id_mapped.list'
tail +2 $id_mapped | cut -f 2 > $id_mapped.tmp && mv $id_mapped.tmp $id_mapped
#find high similarity seq in the test set using blastpgb
fasta_db='../db/uniprot_sprot.fasta'
formatdb -i $fasta_db
blastpgp -i seeds_filt.fasta -d $fasta_db -m 8 -o blastpgp_results.bl8
#remove them with a py script using a treshold of 95% idenitity
p ../py_scripts/filter_blast_result.py blastpgp_results.bl8 sets/ids_similar95_to_remove 95
sort -u sets/ids_similar95_to_remove > sets/ids_similar95_to_remove.tmp && mv sets/ids_similar95_to_remove.tmp sets/ids_similar95_to_remove
#keep uniq from mapped id and high similarity
sort -u sets/ids_similar95_to_remove $id_mapped > sets/id_to_remove.tmp && mv sets/id_to_remove.tmp sets/id_to_remove
#modify the test set removing training seq
ids_to_remove='sets/id_to_remove'
p ../py_scripts/remove_fasta_by_id.py $ids_to_remove $fasta_db sprot_no_training.fasta
After that, all the sequences in SP were tested with hmmsearch using the option ‘--max’ which excludes all the heuristic filters. By default, hmmsearch reported in the result only sequences over an E-value threshold of 10. Since the computation of the E-value is influenced by the database search space (Finn et al., 2011), the test was performed once on the entire SP and then the random splitting was applied in order to avoid a bias due to different database size.
hmmsearch --cpu 6 --max --noali --tblout hmm_result kunitz.hmm sprot_no_training.fasta
#extract only ID and e-val columns from the hmm result file
grep -v "^#" hmm_result | awk -v OFS="\t" '$1=$1' | cut -f1 | cut -d '|' -f 2 > id_list
grep -v "^#" hmm_result | awk -v OFS="\t" '$1=$1' | cut -f5 > eval_list
paste id_list eval_list > hmm_result && rm id_list eval_list
In order to select the E-value maximizing the classification performance, the entire SP ID list was split into 2 subsets, using a python script. The subsets lists were generated randomly of equal length and with the same proportion of Kunitz proteins.
#download id list of no_kuntiz and kunitz (PF00014) in swissprot
kunitz_ids='sets/kunitz_390.list'
not_kunitz_ids='sets/no_kunitz_569126.list'
tail +2 $not_kunitz_ids > $not_kunitz_ids.tmp && mv $not_kunitz_ids.tmp $not_kunitz_ids
tail +2 $kunitz_ids > $kunitz_ids.tmp && mv $kunitz_ids.tmp $kunitz_ids
#remove the training seq from the kunitz id list
grep -vxf $ids_to_remove $kunitz_ids > sets/kunitz_no_training.list
#use the python script to create the subset 1 and 2
p ../py_scripts/random_split.py hmm_result sets/kunitz_no_training.list $not_kunitz_ids sets/subset_1 sets/subset_2
#check the uniqness of the ids and that there are no seq in common
cut -f1 sets/subset_1 sets/subset_2 | sort -u | wc
comm -12 <(sort sets/subset_1) <(sort sets/subset_2)
Subset1 was adopted to select the best threshold and subset2 was adopted as the test set. The role of the 2 subsets was then swapped to cross-validate the results. The 2 subsets lists were annotated with either 1 or 0 depending on the presence of the Kunitz domain according to PFAM and with the E-value previously resulted from hmmsearch. Since the distribution of Kunitz and non-Kunitz was skewed, Matthews’s correlation coefficient (MCC) (2) was adopted as classification score. Compared to accuracy (3) or F1 score, the MCC is a more reliable statistical coefficient which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset (Chicco and Jurman, 2020; Matthews, 1975).
The classification benchmark was tested by running a python script (Supplementary material) for an E-value threshold decreasing exponentially from 1e-1 to 1e-12. For each subset, the E-value threshold for which the model guaranteed the best MCC was identified, and after that, it was verified that the same outcome was achieved for the other subset. The average between the best threshold for both subset was applied in benchmarking the classification for the entire SP.
#optimization and classification test on the subsets + lineplot
p ../py_scripts/optimization.py sets/subset_1 sets/subset_2 1e-1 1e-12 > classification_result
#classification on SwissProt
cd sets
for i in $(seq 1 12);do p ../../py_scripts/classification.py <(cat subset_2 subset_1) 1e-$i;done
- Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
- Ascenzi,P. et al. The Bovine Basic Pancreatic Trypsin Inhibitor (Kunitz Inhibitor): A Milestone Protein. Curr. Protein Pept. Sci., 4, 231–251.
- Bateman,A. and Haft,D.H. (2002) HMM-based databases in InterPro. Brief. Bioinform., 3, 236–245.
- Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. Chen,P. et al. (2013) Collagen VI in cancer and its biological mechanisms. Trend Mol. Med., 19, 410–417.
- Chicco,D. and Jurman,G. (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, 6.
- Cotabarren,J. et al. (2020) Biotechnological, biomedical, and agronomical applications of plant protease inhibitors with high stability: A systematic review. Plant Sci., 292, 110398.
- Eddy,S.R. (2011) Accelerated Profile HMM Searches. PLOS Comput. Biol., 7,e1002195.
- Finn,R.D. et al. (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res., 39, W29–W37.
- Fry,B.G. et al. (2009) The Toxicogenomic Multiverse: Convergent Recruitment of
- Proteins Into Animal Venoms. Annu. Rev. Genomics Hum. Genet., 10, 483–511. Fu,L. et al. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28, 3150–3152.
- Krissinel,E. and Henrick,K. (2005) Multiple Alignment of Protein Structures in Three Dimensions. In, R. Berthold,M. et al. (eds), Computational Life Sciences, Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 67–78.
- Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta BBA - Protein Struct., 405, 442–451.
- Parkin,S. et al. (1996) Structure of bovine pancreatic trypsin inhibitor at 125 K defi- nition of carboxyl-terminal residues Gly57 and Ala58. Acta Crystallogr. D Biol. Crystallogr., 52, 18–29.
- Wheeler,T.J. et al. (2014) Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. BMC Bioinformatics, 15, 7.