MUBD-DecoyMaker 3.0: Making Maximal Unbiased Benchmarking Data Sets with Deep Reinforcement Learning
MUBD-DecoyMaker 3.0 is a brand-new computational software to make Maximal Unbiased Benchmarking Data Sets (MUBD) for in silico screening. Compared with our earlier two versions, i.e. MUBD-DECOYMAKER (Pipeline Pilot-based version, or MUBD-DecoyMaker 1.0) and MUBD-DecoyMaker 2.0, MUBD-DecoyMaker 3.0 has two noteworthy features:
-
Virtual molecules generated by recurrent neural netwrok (RNN)-based molecular generator with reinforcement learning (RL), instead of chemical library molecules, constitue the unbiased decoy set (UDS) component of MUBD.
-
The criteria (or rule) for an ideal decoy previously defined in the earlier versions are integrated into a new scoring function for RL to fine-tune the generator.
Below is how to implement and run MUBD-DecoyMaker3.0.
As REINVENT is used to make virtual decoys of MUBD 3.0, users are required to install the conda
environment reinvent.v3.2
. Please note we have modified the packages reinvent_chemistry
and reinvent_scoring
here in order to include our scoring functions specific for MUBD:
- Clone this repository and navigate to it(指的是否是进入该目录).
- Merge modifications to original
reinvent_chemistry
andreinevnt_scoring
:
$ cp -r reinvent_chemistry/ reinvent_scoring/ ~/anaconda3/envs/reinvent.v3.2/lib/python3.7/site-packages
create a conda
environment called MUBD3.0
(for preprocessing and postprocessing):
$ conda env create -f MUBD3.0.yml
ACM Agonists
is used as a test case to demonstrate how to build MUBD-ACM-AGO data set with MUBD-DecoyMaker3.0. All the test files are included in the directory of resources
.
Run get_ligands.py
to process the raw ligand set. This script takes raw ligands in the representation of SMILES raw_actives.smi
as input and outputs unbiased ligand set Diverse_ligands.csv
. Another four property profiles Diverse_ligands_PS.csv
, Diverse_ligands_PS_maxmin.csv
, Diverse_ligands_sims_maxmin.txt
and Diverse_ligands_len.txt
are also recorded. Please use the --cure
option to preprocess the SMILES if no curation is performed before. (?1. cure是写错了cura还是专门用cure让用户理解该选项是用于未准备分子的“救治”?2. curation的内容要列一下,包括哪些?)
$ conda activate MUBD3.0
(MUBD3.0) $ python get_ligands.py
mk_config.py
writes out the configuration for MUBD3.0 virtual decoy generation. In order to automatically set up the configuration for each ligand and proceed to the next ligand, we provide gen_decoys.sh
. Please replace the </path/to/REINVENT>
and </path/to/MUBD3.0>
in scripts with user defined directories.
$ mkdir output
$ chmod +x ./gen_decoys.sh
$ conda activate reinvent.v3.2
(reinvent.v3.2) $ ./gen_decoys.sh
After decoy generation, each potential decoy set for ligand_$idx
is stored in output/ligand_$idx/results/scaffold_memory.csv
. Decoy refinement including SMILES curation and molecular clustering are performed to get unbiased decoy set Final_decoys.csv
. We provide process_decoys.sh
to automatically run agglomerative_clustering.py
and pool_decoys.py
.
$ chmod +x ./process_decoys.sh
$ conda activate MUBD3.0
(MUBD3.0) $ ./process_decoys.sh
Basic validation is conducted based on four metrics. Please go through the notebook basic_validation.ipynb
for more details.
$ conda activate MUBD3.0
(MUBD3.0) $ jupyter notebook