MCGLPPI

Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction

The original bioRxiv version link: [https://www.biorxiv.org/content/10.1101/2024.03.14.585015v1.abstract]

The accepted version link: [https://www.nature.com/articles/s41467-024-53583-w]

Basic Environment Configuation (Windows or Linux, the specific installation time depends on your system configuration):

Python 3.9.18
Pytorch 1.12.1
CUDA tool kit 11.3.1
Pytorch Geometric (PyG) for PyTorch 1.12.* and CUDA 11.3
Pytorch-scatter 2.1.0
Torchdrug 0.2.1

Steps about how to generate required data for model pre-training and model downstream predictions

Step1. Generating the coarse-grained (CG) protein complex files as the initial input of MCGLPPI.

No matter for the pre-training phase or the downstream prediction phase, the framework requires the CG geometric parameters generated by MARTINI2 (https://pubs.acs.org/doi/10.1021/ct700324x) or MARTINI3 (https://www.nature.com/articles/s41592-021-01098-3) for each sample point as the input.
As MCGLPPI focuses on the general protein-protein interaction (PPI) complex overall property prediction tasks, the explicit specification of two interaction parts for each original PDB complex file (as a sample point) is required. This can be easily achieved by specifying the chain id of each atom line to A (i.e., interaction part A) or B (i.e., interaction part B) in the corresponding PDB file based on the prior knowledge.
The raw scripts for transforming the full-atom PDB file into the CG geometric parameters can be found in https://cgmartini.nl/docs/downloads/tools/proteins-and-bilayers.html, version 2.4 (for MARTINI2) and https://arxiv.org/abs/2212.01191 (for MARTINI3). On top of this, we provide further optimized pipeline scripts and corresponding descriptions for generating the CG geometric parameters exactly in line with the input requirement of MCGLPPI, which are provided in /create_cg_geo_params/ path.
Briefly, after the CG transformation (applicable for both MARTINI2 and MARTINI3), for each sample point, three files will be read by the MCGLPPI CG Protein Class Production script (as the input) for further creating the CG protein graph described in the manuscript: 1) PDB name-cg.pdb, containing the CG bead lines equivalent to the full-atom PDB file that provides the particle type and coordinate information. 2) PDB name-cg_A.itp, containing the other geometric parameters which can be used to calculate the CG bond angles and dihedrals in interaction part A, etc. 3) PDB name-cg_B.itp: analogous to PDB name-cg_A.itp. We provide a simple demo in demo_samples/1a22/ as an example. The complete datasets provided below also follow this format (with adding the original full-atom PDB files with a format of PDB name-aa.pdb as an extra reference).
Thanks to the basic code logic of Torchdrug repository (https://torchdrug.ai/), when the first time to run the pre-training or downstream predictions, our scripts (see below) will automatically parse the aforementioned three files for each sample point, and produce and store a pickle (.pkl) file storing all generated CG Protein Classes for current dataset (e.g., cg2_pdbbind_strictdimer.pkl.gz in https://drive.google.com/file/d/14-0QF0b8JeXUU57yMzTdTPzvsNP_ayUv/view?usp=sharing, corresponding to a complete dataset folder including multiple sub-folders, for each representing a sample point) for the rapid repeated use.

Step2. CG diffusion-based pre-training using domain-domain interaction (DDI) templates based on the parsed pickle file.

Based on our assumption, pre-training on DDI templates, which are critical subsets of PPIs where the interaction typically occurs between domains rather than the entire proteins, could enhance the model's ability on PPI binding affinity related prediction tasks with a relatively smaller sample amount.
The curated original dataset (including the full-atom PDB file and corresponding three CG geometric files for each sample point) are provided in:

For full-atom and MARTINI2:
```
  https://drive.google.com/file/d/1nYPwkMhpIrTifOLWL-NDpyLOrnuzczjE/view?usp=sharing
```
For MARTINI3:
```
  https://drive.google.com/file/d/1S32LteGRSCCVPBM-Ig8MuVKibBMRqHiT/view?usp=sharing 
```
Please note that these zipped files contain 50,359 3DID (https://3did.irbbarcelona.org/) sample points covering 15,983 DDI structure templates. However, we remove any DDI templates from the 3DID dataset that are identical to those present in our downstream datasets, and 41,663 sample points are remained for the main experiments in the manuscript. Based on this, we further provide the parsed pickle file for these 41,663 sample points (in which the original sample point names are included) for a quick start:

For MARTINI2:
```
  https://drive.google.com/file/d/1FACIyhD-Jn1J6MpN7KuiU-uXo_gxgxh7/view?usp=sharing
```
For MARTINI3:
```
  https://drive.google.com/file/d/1ptIM69OkHzsP-fJtsUBF1HTBccM3Iv29/view?usp=sharing 
```
After preparing the parsed pickle file, we can run the corresponding scripts for MARTINI2 or MARTINI3 to pre-train the CG GearNet-Edge protein graph encoder (https://github.com/DeepGraphLearning/GearNet) for the downstream use:

MARTINI2 (run 1st and then run 2nd):
```
  python cg_steps/cg_pretrain.py -c config/ppi_cg/cgdiff_1st.yaml

  python cg_steps/cg_pretrain.py -c config/ppi_cg/cgdiff_2nd.yaml
```
MARTINI3 (run 1st and then run 2nd):
```
  python cg3_steps/cg3_pretrain.py -c config/ppi_cg3/cg3diff_1st.yaml

  python cg3_steps/cg3_pretrain.py -c config/ppi_cg3/cg3diff_2nd.yaml
```
We provide the pre-trained CG graph encoder (based on the aforementioned 41,663 MARTINI2 3DID subset) as an example (in pretrained_cgmodels, below is the checkpoint name):
```
  cgdiff_seed0_gamma0.2_bs64_epoch50_dim256_length150_radius5_extra_step2_0_ls3did_fepoch200_bbfeatsFalse_miFalse.pth
```
The below one is the encoder pre-trained on the further 33,144-sample subset described in the original manuscript:
```
  cgdiff_seed0_gamma0.2_bs64_epoch50_dim256_length150_radius5_extra_step2_0_lss3did_fepoch200_bbfeatsFalse_miFalse.pth
```
As a reference of the corresponding pre-training scripts in the original scales, we also clone the original ones in this repository (in SiamDiff, original link: https://github.com/DeepGraphLearning/SiamDiff/tree/main).

Step3. Downstream complex overall property predictions based on w/ or w/o pre-trained CG graph encoder.

(1) dG predictions based on the PDBbind strict dimer dataset

This dataset only contains the strict dimer complexes curated from the PDBbind v2020 database (http://www.pdbbind.org.cn/download/pdbbind_2020_intro.pdf). We provide relevant files as follows:
1. Original data for full-atom and MARTINI2:
```
  https://drive.google.com/file/d/1o8bDAZdQg-sRKdWpEA_5jRv05l0RwyRv/view?usp=sharing
```
2. Original data for MARTINI3:
```
  https://drive.google.com/file/d/1pgPsGvvT3zfvaMfmSj5COqtIUMC1m471/view?usp=sharing 
```
3. All dG labels for corresponding complex structures:
```
  PDBBINDdimer_strict_index.csv in downstream_files/PDBBIND/
```

We also provide the corresponding pickle files for a quick start:

MARTINI2 and MARTINI3:

  https://drive.google.com/file/d/14-0QF0b8JeXUU57yMzTdTPzvsNP_ayUv/view?usp=sharing

After the preparation of the source data for MCGLPPI, the .yaml execution scripts can be used to evaluate the model performance based on different data splitting settings. Please follow the illustration in corresponding scripts to set the hyper-parameters for evaluation configurations (the data splitting file ['index_path' argument] and whether to use the pre-trained graph encoder checkpoint ['model_checkpoint' argument] can be specified in these scripts). The running examples including training and evaluation are as follows:

Example 1 (in a standard tenfold cross-validation (CV) setting):

MARTINI2:
```
  python cg_steps/cg_downstream_1gpu_10CV_GBT.py -c config/ppi_cg/cg_pdbbind_gearnet_gbt_10CV.yaml
```
MARTINI3:
```
  python cg3_steps/cg3_downstream_1gpu_10CV_GBT.py -c config/ppi_cg3/cg3_pdbbind_gearnet_gbt_10CV.yaml 
```
Example 2 (in an overall TM-score-based splitting [<0.45: test set, 0.45~0.55: validation set, >0.55: training set]):

MARTINI2:
```
  python cg_steps/cg_downstream_1gpu_10CV_GBT.py -c config/ppi_cg/cg_pdbbind_gearnet_gbt_TMscore.yaml
```
MARTINI3:
```
  python cg3_steps/cg3_downstream_1gpu_10CV_GBT.py -c config/ppi_cg3/cg3_pdbbind_gearnet_gbt_TMscore.yaml
```

(2) dG predictions based on the ATLAS dataset

The used dataset is curated from https://onlinelibrary.wiley.com/doi/full/10.1002/prot.25260 and https://github.com/weng-lab/ATLAS/blob/master/README.md.

The purpose of examining MCGLPPI on this dataset is to check the feasiability of handling more complex geometric binding patterns (beyond the strict dimers) for identifying similar structures generated by the computational simulation.

We provide the relevant original data as follows:

Original data for full-atom and MARTINI2:

  https://drive.google.com/file/d/1SA7fXpbF2r6co7KkPxn0LSeCe_Noau-J/view?usp=sharing

Original data for MARTINI3:

 https://drive.google.com/file/d/1nai954uqdn47ZlOI5l_wfguIGgX3l4IA/view?usp=sharing

All dG labels for corresponding complex structures:
```
 ATLAS.csv in downstream_files/ATLAS/ 
```

We also provide the corresponding pickle files for a quick start:

MARTINI2 and MARTINI3:

  https://drive.google.com/file/d/1dWUoIPK_F4C_hKY5cG5aM1cQJS13azuv/view?usp=sharing

Running examples:

Example 1 (in a standard tenfold cross-validation (CV) setting):

MARTINI2:

  python cg_steps/cg_downstream_1gpu_10CV_GBT.py -c config/ppi_cg/cg_atlas_gearnet_gbt_10CV.yaml

MARTINI3:

  python cg3_steps/cg3_downstream_1gpu_10CV_GBT.py -c config/ppi_cg3/cg3_atlas_gearnet_gbt_10CV.yaml

(3) ddG predictions based on the AB-bind dataset

We demonstrate the potential of MCGLPPI on the extension into directly predicting ddG with the simple modifications (see manuscript for processing details). The ddG dataset used is a multiple-point mutation dataset AB-bind (https://pubmed.ncbi.nlm.nih.gov/26473627/), which contains 1101 sample points related to the binding affinity change (i.e., ddG) caused by multiple-point amino acid (AA) mutations on the complex formed from antibody or antibody-like binding. We provide relevant files as follows:
1. Original data for full-atom and MARTINI2:
```
  https://drive.google.com/file/d/1vxGXXhtYJw_QmZi9PBjNH8U-AwYsAj5L/view?usp=sharing 
```
2. All ddG labels for corresponding wild-type (WT) - mutant (MT) complex structure pairs:
```
  M1101_label.csv in downstream_files/M1101/ 
```

We also provide the corresponding pickle file for a quick start:

MARTINI2:

  https://drive.google.com/file/d/1LgR-CD7H4pUTXlWAcZ3kNzv9r05DVJwe/view?usp=sharing

Running example:

Example 1 (in a WT protein-protein complex type-based fivefold CV setting,

see https://academic.oup.com/bib/article/24/5/bbad310/7256145 for details,

the splitting file is retrieved from https://github.com/arantir123/MpbPPI):

MARTINI2:
```
  python cg_steps_energy_injection/cg_downstream_1gpu_10CV_GBT.py -c config/ppi_cg/cg_m1101_gearnet_gbt_WTtype.yaml 
```

(4) Protein-protein complex interface classifications based on the MANY/DC dataset

This dataset is used to examine the model's ability to distinguish/classify the biological interface from crystal artefacts (https://www.nature.com/articles/s41467-021-27396-0). We provide relevant files as follows:
1. Original data for full-atom and MARTINI2:
```
  https://drive.google.com/file/d/18oEzeiqKT7tf7f9krFxCQ8QP-o8lLwar/view?usp=sharing 
```
2. Original data for MARTINI3:
```
  https://drive.google.com/file/d/19R6QQiT2NDC94Pv15pgolm_-cLfUXryy/view?usp=sharing 
```
3. All binary classification labels for corresponding complex structures:
```
  MANYDC.csv in downstream_files/MANYDC/ 
```

We also provide the corresponding pickle files for a quick start:

MARTINI2 and MARTINI3:

  https://drive.google.com/file/d/1DiS9WX8zqKTWolEg4Ebm-fAziMiJvjqR/view?usp=sharing

Running examples:

Example 1 (in a conventional splitting where 80% MANY data points are selected as training set and complete DC data points are test set):

MARTINI2:

  python cg_steps/cg_downstream_1gpu_GBT_stats_cal.py -c config/ppi_cg/cg_manydc_gearnet_DCtest.yaml

MARTINI3:

  python cg3_steps/cg3_downstream_1gpu_GBT_stats_cal.py -c config/ppi_cg3/cg3_manydc_gearnet_DCtest.yaml

Quick start 【太长不看版】

We have provided:

(1) our pre-processed CG downstream source data pickle files for using MCGLPPI quickly (e.g., pickle files for the PDBbind strict dimer dataset in above link),

(2) pre-trained CG graph encoder (in pretrained_cgmodels, below is the checkpoint name),

  cgdiff_seed0_gamma0.2_bs64_epoch50_dim256_length150_radius5_extra_step2_0_ls3did_fepoch200_bbfeatsFalse_miFalse.pth

(3) corresponding .yaml execution scripts (e.g., ones based on the PDBbind strict dimer dataset) for a quick start.

Please follow the illustration in corresponding .yaml files (e.g., config/ppi_cg/cg_pdbbind_gearnet_gbt_10CV.yaml) to set the hyper-parameters for the downstream evaluation configurations (supporting both training-from-scratch or fine-tuning from the pre-trained CG graph encoder checkpoint).

A running example (including the training and evaluation to create the evaluation results on the test set):

After the environment configuration, usually dozens of minutes are needed to finish running the demo code. The evaluation results might be varying according to the actual installed virtual environment and the supporting hardware.

  python cg_steps/cg_downstream_1gpu_10CV_GBT.py -c config/ppi_cg/cg_pdbbind_gearnet_gbt_10CV.yaml

(whether to use the pre-trained CG graph encoder checkpoint can be directly specified by the 'model_checkpoint' argument in above .yaml file, if not, excuating training-from-scratch)

Please note that some of the used data subsets are small due to the limitation of available complex structures and their experimental labels, the evaluation results on these subsets might vary based on different hardwares and environments. Nevertheless, the better trade-off between the computational overhead and chemical-plausible interaction descriptions within complexes can be guaranteed. Please check the original manuscipt link for complete evaluation result reports.

MCGLPPI++

Meta-learning enables complex cluster-specific few-shot binding affinity prediction for protein-protein interactions

The accepted version link: [https://pubs.acs.org/doi/10.1021/acs.jcim.4c01607]

Basic Environment Configuation (Windows or Linux, the specific installation time depends on your system configuration):

Python 3.9.18
Pytorch 1.12.1
CUDA tool kit 11.3.1
Pytorch Geometric (PyG) for PyTorch 1.12.* and CUDA 11.3
Pytorch-scatter 2.1.0
Torchdrug 0.2.1

Quick start

We have provided (for how to generate the CG protein complex files as the basic model input, please check the MCGLPPI illustration above):

(1) our curated more comprehensive PPI dG dataset (including original data for full-atom and MARTINI2),

This dataset contains the selected PPI structure samples originated from 1) SKEMPI 2.0 (https://life.bsc.es/pid/skempi2/), 2) PDBbind v2020 (http://www.pdbbind.org.cn/), 3) ATLAS (https://pubmed.ncbi.nlm.nih.gov/28160322/), and 4) Panagiotis et al. databases (https://pubs.acs.org/doi/full/10.1021/pr9009854).

  https://drive.google.com/file/d/1j7kj14wLFpACixuqjCIf1ztErK4sULML/view?usp=sharing

(2) our pre-processed CG source data pickle file (based on MARTINI2),

  https://drive.google.com/file/d/1NpgCtAmIcyiUjjbcfq7uSfGymjQMR-IK/view?usp=sharing

(3) all dG labels for corresponding complex structures (including the pre-calculated original training cluster splitting features),

  MAML_complete_index.csv in downstream_files/MAML/

(4) pre-trained CG graph encoder [in pretrained_cgmodels, using 45,714 (out of 50,359) 3DID sample points for pre-training, below is the checkpoint name],

  cgdiff_seed0_gamma0.2_bs64_epoch50_dim256_length150_radius5_extra_step2_0_l3did_fepoch200_bbfeatsFalse_miFalse.pth

(5) corresponding running scripts for a quick start.

Please follow the illustration in config/ppi_cg_maml/cg_maml_gearnet_reg.yaml to set the hyper-parameters for the downstream evaluation configurations.

A running example:

After the environment configuration, usually dozens of minutes are needed to finish running the demo code. The evaluation results might be varying according to the actual installed virtual environment and the supporting hardware.

  python cg_maml_steps/cg_downstream_1gpu_maml_reg.py -c config/ppi_cg_maml/cg_maml_gearnet_reg.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCGLPPI

Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction

Steps about how to generate required data for model pre-training and model downstream predictions

Quick start 【太长不看版】

MCGLPPI++

Meta-learning enables complex cluster-specific few-shot binding affinity prediction for protein-protein interactions

Quick start

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 377 Commits
cg3_steps		cg3_steps
cg_maml_steps		cg_maml_steps
cg_steps		cg_steps
cg_steps_energy_injection		cg_steps_energy_injection
config		config
create_cg_geo_params		create_cg_geo_params
demo_samples/1a22		demo_samples/1a22
downstream_files		downstream_files
pretrained_cgmodels		pretrained_cgmodels
siamdiff		siamdiff
torchdrug		torchdrug
LICENSE		LICENSE
README.md		README.md
util.py		util.py

License

arantir123/MCGLPPI

Folders and files

Latest commit

History

Repository files navigation

MCGLPPI

Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction

Steps about how to generate required data for model pre-training and model downstream predictions

Quick start 【太长不看版】

MCGLPPI++

Meta-learning enables complex cluster-specific few-shot binding affinity prediction for protein-protein interactions

Quick start

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages