This repo contains the code for running synthon-based ROCS queries, as described in Shape-Aware Synthon Search (SASS) for virtual screening of synthon-based chemical spaces.
SASS is a synthon-based virtual screening method that carries out shape similarity searches in the synthon space instead of the enumerated product space. Queries are fragmented, and reaction synthons are scored against query fragments to prioritize top synthon combinations. A tiny fraction of the full library is then instantiated and scored, thereby avoiding full enumeration/scoring and significantly accelerating large-scale, shape-based virtual screening.
Configure config_template.yml
, and run:
python query_main.py
Note that users should install necessary third-party libraries.
- synthons: Building blocks that can be directly joined to form products. Synthons are different from reactants in that synthons have labeled connector atoms, and have been modified from the reactants (e.g. leaving group removed).
- connector atom: Dummy atoms (e.g. U, Np, Pu, Am) used to designate connecting vectors where two compatible synthons join.
- chemical space: All molecules generated from exhaustively instantiating all combinations of compatible synthons.
- query: The input molecule for which shape-similar molecules in the chemical space is to be searched.
- query fragment: Fragments of a query molecule, generated by cleaving bonds of the query.
The overall workflow of SASS is shown below:
where each step roughly corresponds to a step in the config file under the tasks
key.
There are 7 valid tasks for SASS:
-
ground_truth: Exhaustively enumerate the chemical space defined by the input synthons.
- input: Reaction and synthon files.
- output: Score list written to file (
pkl
).
-
gen_synthon_conformers: Generate conformers of the synthons, which is different from generating conformers for full-products, because the former involves connector atom substitution (for non-ring-forming synthons), or ring-completion/deletion (for ring-forming synthons).
- input: molecule files.
- output: conformer files in
oez
format.
-
score_synthons: Score synthons against query fragments with ROCS.
- input: synthon conformers; query fragments.
- output: A score dictionary saved to file (
pkl
).
-
select_synthons: Numerically aggregate the individual synthon scores to form pseudo-product scores. Keep the top-m pseudo-products with information on their consitituting synthons.
- input: synthon score files.
- output: An array of pseudo-products saved to file (
pkl
).
-
combine_products: Combine the top-m pseudo-products from different reactions.
- input: Pseudo-products files.
- output: One array of pseudo-products saved to file.
-
instantiate_products: Instantiate the pseudo-products (SASS selections).
- input: Pseudo-products file.
- output: Molecule file(s).
-
rescore_products: Generate conformers of the instantiated SASS selections, run ROCS on a given query molecule against this conformer database.
- input: Molecule file(s), query molecule.
- output: Score list written to file.
To run SASS in production mode, only the SASS query is needed. To run SASS in validation mode, both the SASS query result and the ground truth are needed.
- SASS query: run the following tasks:
gen_synthon_conformers
,score_synthons
,select_synthons
,combine_products
,instantiate_products
,rescore_products
. If the synthon conformers have already been generated (e.g. when using the same synthon data but a different query molecule), users can omit thegen_synthon_conformers
task in the config. The output after the final step is a pkl file containing the scores and joined synthon ids of the top products. - Ground truth: run the following tasks:
ground_truth
. The output is a pkl file containing the scores and joined synthon ids of the top products. - For validation, a SASS query only needs to be run until the end of the
combine_products
step, and the pseudo-product scores can be used directly to calculate the query performance (comparing to the ground truth).