Badges | |
---|---|
fairness | |
package | |
docs | |
tests | |
running on | |
license |
DeepRank2 is an open-source deep learning (DL) framework for data mining of protein-protein interfaces (PPIs) or single-residue missense variants. This package is an improved and unified version of two previously developed packages: DeepRank and DeepRank-GNN.
DeepRank2 allows for transformation of (pdb formatted) molecular data into 3D representations (either grids or graphs) containing structural and physico-chemical information, which can be used for training neural networks. DeepRank2 also offers a pre-implemented training pipeline, using either CNNs (for grids) or GNNs (for graphs), as well as output exporters for evaluating performances.
Main features:
- Predefined atom-level and residue-level feature types
- e.g. atom/residue type, charge, size, potential energy
- All features' documentation is available here
- Predefined target types
- binary class, CAPRI categories, DockQ, RMSD, and FNAT
- Flexible definition of both new features and targets
- Features generation for both graphs and grids
- Efficient data storage in HDF5 format
- Support for both classification and regression (based on PyTorch and PyTorch Geometric)
DeepRank2 extensive documentation can be found here.
The package officially supports ubuntu-latest OS only, whose functioning is widely tested through the continuous integration workflows.
Before installing deeprank2 you need to install some dependencies. We advise to use a conda environment with Python >= 3.10 installed. The following dependency installation instructions are updated as of 14/09/2023, but in case of issues during installation always refer to the official documentation which is linked below:
- MSMS:
conda install -c bioconda msms
.- Here for MacOS with M1 chip users.
- PyTorch
- We support torch's CPU library as well as CUDA.
- PyG and its optional dependencies:
torch_scatter
,torch_sparse
,torch_cluster
,torch_spline_conv
. - DSSP 4
- Check if
dssp
is installed:dssp --version
. If this gives an error or shows a version lower than 4:
- Check if
- GCC
- Check if gcc is installed:
gcc --version
. If this gives an error, runsudo apt-get install gcc
.
- Check if gcc is installed:
- For MacOS with M1 chip users only install the conda version of PyTables.
Once the dependencies are installed, you can install the latest stable release of deeprank2 using the PyPi package manager:
pip install deeprank2
Alternatively, get all the new developments by cloning the repo and installing the editable version of the package with:
git clone https://github.com/DeepRank/deeprank2
cd deeprank2
pip install -e .'[test]'
The test
extra is optional, and can be used to install test-related dependencies useful during the development.
If you have installed the package from a cloned repository (second option above), you can check that all components were installed correctly, using pytest. The quick test should be sufficient to ensure that the software works, while the full test (a few minutes) will cover a much broader range of settings to ensure everything is correct.
Run pytest tests/test_integration.py
for the quick test or just pytest
for the full test (expect a few minutes to run).
If you would like to contribute to the package in any way, please see our guidelines.
The following section serves as a first guide to start using the package, using Protein-Protein Interface (PPI) queries as example. For an enhanced learning experience, we provide in-depth tutorial notebooks for generating PPI data, generating variants data, and for the training pipeline. For more details, see the extended documentation.
For each protein-protein complex (or protein structure containing a missense variant), a query can be created and added to the QueryCollection
object, to be processed later on. Different types of queries exist:
- In a
ProteinProteinInterfaceResidueQuery
andSingleResidueVariantResidueQuery
, each node represents one amino acid residue. - In a
ProteinProteinInterfaceAtomicQuery
andSingleResidueVariantAtomicQuery
, each node represents one atom within the amino acid residues.
A query takes as inputs:
- a
.pdb
file, representing the protein-protein structure - the ids of the chains composing the structure, and
- optionally, the correspondent position-specific scoring matrices (PSSMs), in the form of
.pssm
files.
from deeprank2.query import QueryCollection, ProteinProteinInterfaceResidueQuery
queries = QueryCollection()
# Append data points
queries.add(ProteinProteinInterfaceResidueQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_1w.pdb",
chain_id1 = "A",
chain_id2 = "B",
targets = {
"binary": 0
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
queries.add(ProteinProteinInterfaceResidueQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_2w.pdb",
chain_id1 = "A",
chain_id2 = "B",
targets = {
"binary": 1
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
queries.add(ProteinProteinInterfaceResidueQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_3w.pdb",
chain_id1 = "A",
chain_id2 = "B",
targets = {
"binary": 0
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
The user is free to implement a custom query class. Each implementation requires the build
method to be present.
The queries can then be processed into graphs only or both graphs and 3D grids, depending on which kind of network will be used later for training.
from deeprank2.features import components, conservation, contact, exposure, irc, surfacearea
from deeprank2.utils.grid import GridSettings, MapMethod
feature_modules = [components, conservation, contact, exposure, irc, surfacearea]
# Save data into 3D-graphs only
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = feature_modules)
# Save data into 3D-graphs and 3D-grids
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = feature_modules,
grid_settings = GridSettings(
# the number of points on the x, y, z edges of the cube
points_counts = [20, 20, 20],
# x, y, z sizes of the box in Å
sizes = [1.0, 1.0, 1.0]),
grid_map_method = MapMethod.GAUSSIAN)
Data can be split in sets implementing custom splits according to the specific application. Assuming that the training, validation and testing ids have been chosen (keys of the HDF5 file/s), then the DeeprankDataset
objects can be defined.
For training GNNs the user can create a GraphDataset
instance:
from deeprank2.dataset import GraphDataset
node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
# Creating GraphDataset objects
dataset_train = GraphDataset(
hdf5_path = hdf5_paths,
subset = train_ids,
node_features = node_features,
edge_features = edge_features,
target = target
)
dataset_val = GraphDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train = False,
dataset_train = dataset_train
)
dataset_test = GraphDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train = False,
dataset_train = dataset_train
)
For training CNNs the user can create a GridDataset
instance:
from deeprank2.dataset import GridDataset
features = ["bsa", "res_depth", "hse", "info_content", "pssm", "distance"]
target = "binary"
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
# Creating GraphDataset objects
dataset_train = GridDataset(
hdf5_path = hdf5_paths,
subset = train_ids,
features = features,
target = target
)
dataset_val = GridDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train = False,
dataset_train = dataset_train,
)
dataset_test = GridDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train = False,
dataset_train = dataset_train,
)
Let's define a Trainer
instance, using for example of the already existing GINet
. Because GINet
is a GNN, it requires a dataset instance of type GraphDataset
.
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.naive_gnn import NaiveNetwork
trainer = Trainer(
NaiveNetwork,
dataset_train,
dataset_val,
dataset_test
)
The same can be done using a CNN, for example CnnClassification
. Here a dataset instance of type GridDataset
is required.
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.cnn.model3d import CnnClassification
trainer = Trainer(
CnnClassification,
dataset_train,
dataset_val,
dataset_test
)
By default, the Trainer
class creates the folder ./output
for storing predictions information collected later on during training and testing. HDF5OutputExporter
is the exporter used by default, but the user can specify any other implemented exporter or implement a custom one.
Optimizer (torch.optim.Adam
by default) and loss function can be defined by using dedicated functions:
import torch
trainer.configure_optimizers(torch.optim.Adamax, lr = 0.001, weight_decay = 1e-04)
Then the Trainer
can be trained and tested; the best model in terms of validation loss is saved by default, and the user can modify so or indicate where to save it using the train()
method parameter filename
.
trainer.train(
nepoch = 50,
batch_size = 64,
validate = True,
filename = "<my_folder/model.pth.tar>")
trainer.test()
The data generation process is rather efficient within DeepRank2. As an example, we show the time required to process the tutorials' PDB files available at this Zenodo address. Atomic resolution and distance_cutoff
of 5.5 Å have been used for both PPIs and SRVs, and the SRVs radius
was set to 10 Å. The experiments are done on Apple M1 Pro, using 1 CPU only. Measures are computed on 100 data points for PPIs, and 96 for SRVs.
Features modules used | Comments | Data processing speed [seconds/structure] | Memory [megabyte/structure] | |
---|---|---|---|---|
PPIs | components , contact , exposure , irc , secondary_structure , surfacearea , 33 features in total. |
conservation feature module was not used because PSSM files were not available for the data. |
graph only: 3.58 (std 0.27) graph+grid: 12.92 (std 1.39) |
graph only: 0.54 (std 0.07) graph+grid: 16.09 (std 0.44) |
SRVs | components , contact , exposure , irc , secondary_structure , surfacearea , 26 features in total. |
Same as above for conservation . |
graph only: 1.82 (std 0.37) graph+grid: 2.58 (std 0.66) |
graph only: 0.05 (std 0.01) graph+grid: 8.52 (std 8.50) |
- Branching
- When creating a new branch, please use the following convention:
<issue_number>_<description>_<author_name>
.
- When creating a new branch, please use the following convention:
- Pull Requests
- When creating a pull request, please use the following convention:
<type>: <description>
. Example types arefix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
, and others based on the Angular convention.
- When creating a pull request, please use the following convention:
- Software release
- Before creating a new package release, make sure to have updated all version strings in the source code. An easy way to do it is to run
bump2version [part]
from command line after having installed bump2version on your local environment. Instead of[part]
, type the part of the version to increase, e.g. minor. The settings in.bumpversion.cfg
will take care of updating all the files containing version strings.
- Before creating a new package release, make sure to have updated all version strings in the source code. An easy way to do it is to run