Skip to content

Evolutionary algorithm for molecular properties optimization

License

Notifications You must be signed in to change notification settings

jyryu3161/EvoMol

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvoMol

Installation

EvoMol was designed on Ubuntu (18.04+). Some features might be missing on other systems. Especially, the drawing of exploration trees is currently unavailable on Windows.

To install EvoMol on your system, run the appropriate commands in your terminal. The installation depends on Anaconda.

Linux

$ git clone https://github.com/jules-leguy/EvoMol.git     # Clone EvoMol
$ cd EvoMol                                               # Move into EvoMol directory
$ conda env create -f evomol_env.yml                      # Create conda environment
$ conda activate evomolenv                                # Activate environment
$ python -m pip install .                                 # Install EvoMol

Windows

$ git clone https://github.com/jules-leguy/EvoMol.git     # Clone EvoMol
$ cd EvoMol                                               # Move into EvoMol directory
$ conda env create -f evomol_env_windows.yml              # Create conda environment
$ conda activate evomolenv                                # Activate environment
$ python -m pip install .                                 # Install EvoMol

Quickstart

Launching a QED optimization for 500 steps. Beware, you need to activate the evomolenv conda environment when you use EvoMol.

from evomol import run_model
run_model({
    "obj_function": "qed",
    "optimization_parameters": {
        "max_steps": 500
    },
    "io_parameters": {
        "model_path": "examples/1_qed"
    },
})

Model parametrization

To run a model, you need to pass a dictionary describing the run to the run_model function. This dictionary can have up to 4 entries that are described in this section.

Default values are represented in bold.

Objective function

The "obj_function" attribute can take the following values. Multi-objective functions can be nested to any depth.

  • Implemented functions: "qed", "plogp", "norm_plogp", "sascore", "norm_sascore", "clscore", "homo", "lumo" (see EvoMol article). "entropy_ifg", "entropy_gen_scaffolds", "entropy_shg_1" and "entropy_checkmol" can be used to maximize the entropy of descriptors, respectively using IFGs , Murcko generic scaffolds, level 1 shingles and checkmol.
  • A custom function evaluating a SMILES. It is also possible to give a tuple (function, string function name).
  • A dictionary describing a multi-objective function containing the following entries.
    • "type" :
      • "linear_combination" (linear combination of the properties)
      • "product" (product of properties)
      • "sigm_lin", (passing the value of a unique objective through a linear function and a sigmoid function)
      • "product_sigm_lin" (product of the properties after passing a linear function and a sigmoid function).
      • "gaussian" (passing the value of a unique objective function through a Gaussian function)
      • "opposite" (computing the opposite value of a unique objective function)
    • "functions" : list of functions (string keys describing implemented functions, custom functions or multi-objective functions).
    • Specific to the linear combination.
      • "coef" : list of coefficients.
    • Specific to the use of sigmoid/linear functions
      • "a" list of a coefficients for the ax+b linear function definition.
      • "b" list of b coefficients for the ax+b linear function definition.
      • "lambda" list of λ coefficients for the sigmoid function definition.
    • Specific to the use of Gaussian functions
      • mu: μ parameter of the Gaussian
      • sigma: σ parameter of the Gaussian
  • "guacamol_v2" for taking the goal directed GuacaMol benchmarks.

Search space

The "action_space_parameters" attribute can be set with a dictionary containing the following entries.

  • "atoms" : text list of available heavy atoms ("C,N,O,F,P,S,Cl,Br").
  • "max_heavy_atoms": maximum molecular size in terms of number of heavy atoms (38).
  • "substitution": whether to use substitute atom type action (True).
  • "cut_insert": whether to use cut atom and insert carbon atom actions (True).
  • "move_group": whether to use move group action (True).
  • "use_rd_filters": whether to use the rd_filter program as a quality filter before inserting the mutated individuals in the population (False).

Optimization parameters

The "optimization_parameters" attribute can be set with a dictionary containing the following entries.

  • "pop_max_size" : maximum population size (1000).
  • "max_steps" : number of steps to be run before stopping EvoMol(1500).
  • "max_obj_calls"": number of calls to the objective functions before stopping EvoMol (float("inf")).
  • "k_to_replace" : number of individuals replaced at each step (2).
  • "selection" : whether the best individuals are selected to be mutated ("best") or they are selected randomly ("random").
  • "problem_type" : whether it is a maximization ("max") or minimization ("min") problem.
  • "mutation_max_depth" : maximum number of successive actions on the molecular graph during a single mutation (2).
  • "mutation_find_improver_tries" : maximum number of mutations to find an improver (50).
  • "guacamol_init_top_100" : whether to initialize the population with the 100 best scoring individuals of the GuacaMol ChEMBL subset in case of taking the GuacaMol benchmarks (False). The list of SMILES must be given as initial population.
  • "mutable_init_pop" : if True, the individuals of the initial population can be freely mutated. If False, they can be branched but their atoms and bonds cannot be modified (True).
  • "n_max_desc": max number of descriptors to be possibly handled when using an evaluator relying on a vector of descriptors such as entropy contribution (3.000.000).
  • "shuffle_init_pop": whether to shuffle the smiles at initialization

Input/Output parameters

The "io_parameters" attribute can be set with a dictionary containing the following entries.

  • "model_path" : path where to save model's output data ("EvoMol_model").
  • "smiles_list_init": list of SMILES describing the initial population (None: interpreting the "smiles_list_init_path" attribute). Note : not available when taking GuacaMol benchmarks.
  • "smiles_list_init_path" : path where to find the SMILES list text file describing the initial population (None: initialization of the population with a single methane molecule).
  • "external_tabu_list": list of SMILES that won't be generated by EvoMol.
  • "record_history" : whether to save exploration tree data. Must be set to True to later draw the exploration tree (False).
  • "record_all_generated_individuals" : whether to record a list of all individuals that are generated during the entire execution (even if they fail the objective function computation or if they are not inserted in the population as they are not improvers). Also recording the step number and the total number of calls to the objective function at the time of generation.
  • "save_n_steps" : frequency (steps) of saving the data (100).
  • "print_n_steps" : frequency (steps) of printing current population statistics (1).
  • "dft_working_dir" : path where to save DFT optimization related files ("/tmp").
  • "dft_cache_files" : list of json files containing a cache of previously computed HOMO or LUMO values ([]).
  • "evaluation_strategy_parameters" : a dictionary that contains an entry "evaluate_init_pop" to set given parameters to the EvaluationStrategy instance in the context of the evaluation of the initial population. An entry "evaluate_new_sol" must be also contained to set given parameters for the evaluation of new solutions during the optimization process. If None, both keys are set to an empty set of parameters (None).

Examples

Drawing exploration trees

Large exploration tree

Performing a QED optimization run of 500 steps, while recording the exploration data.

from evomol import run_model

model_path = "examples/2_large_tree"

run_model({
    "obj_function": "qed",
    "optimization_parameters": {
        "max_steps": 500},
    "io_parameters": {
        "model_path": model_path,
        "record_history": True
    }
})

Plotting the exploration tree with solutions colored according to their score. Nodes represent solutions. Edges represent mutations that lead to an improvement in the population.

from evomol.plot_exploration import exploration_graph
exploration_graph(model_path=model_path, layout="neato")

Large exploration tree

Detailed exploration tree

Performing the experiment of mutating a fixed core of acetylsalicylic acid to increase its QED value.

from evomol import run_model

model_path = "examples/3_detailed_tree"

run_model({
    "obj_function": "qed",
    "optimization_parameters": {
        "max_steps": 10,
        "pop_max_size": 10,
        "k_to_replace": 2,
        "mutable_init_pop": False
    },
    "io_parameters": {
        "model_path": model_path,
        "record_history": True,
        "smiles_list_init_path": "examples/acetylsalicylic_acid.smi"
    }
})

Plotting the exploration tree including molecular drawings, scores and action types performed during mutations. Also plotting a table of molecular drawings.

from evomol.plot_exploration import exploration_graph

exploration_graph(model_path=model_path, layout="dot", draw_actions=True, plot_images=True, draw_scores=True,
                  root_node="O=C(C)Oc1ccccc1C(=O)O", legend_scores_keys_strat=["total"], mol_size=0.3,
                  legend_offset=(-0.007, -0.05), figsize=(20, 20/1.5), legends_font_size=13)

Detailed exploration tree

Detailed molecular drawings table

Entropy and multi-objective optimization

Optimizing jointly the QED and the entropy of IFGs using a linear combination. The weights are set respectively to 1 and 10.

from evomol import run_model

model_path = "examples/4_entropy_optimization"

run_model({
    "obj_function": {
        "type": "linear_combination",
        "functions": ["qed", "entropy_ifg"],
        "coef": [1, 10]
    },
    "optimization_parameters": {
        "max_steps": 500,
        "pop_max_size": 1000
    },
    "io_parameters": {
        "model_path": model_path,
        "record_history": True
    },
})

Plotting the exploration trees representing the QED values.

from evomol.plot_exploration import exploration_graph
exploration_graph(model_path=model_path, layout="neato", prop_to_study_key="qed")

Detailed exploration tree

Environment variables and data requirements

CLscore

As the CLscore is dependent of prior data to be computed, EvoMol needs to be given the data location. To do so, the $SHINGLE_LIBS environment variable must be set to the location of the shingle_libs folder that can be downloaded here.

DFT and Molecular Mechanics optimization

To perform DFT and Molecular Mechanics computation (necessary for HOMO and LUMO optimization), you need to bind Gaussian09 and OpenBabel with EvoMol.

To do so, the $OPT_LIBS variable must point to a folder containing :

  • run.sh : a script launching a DFT optimization with Gaussian09 of the input filepath given as parameter.
  • obabel/openbabel-2.4.1 : directory containing an installation of OpenBabel 2.4.1. Make sure to also set OpenBabel's $BABEL_DATADIR environment variable to $OPT_LIBS/obabel/openbabel-2.4.1/data.

You can install Open Babel by following these instructions

$ mkdir obabel & cd obabel                                                               # Create and move to installation directory
$ wget https://github.com/openbabel/openbabel/archive/refs/tags/openbabel-2-4-1.tar.gz   # Download sources
$ tar zxf openbabel-openbabel-2-4-1.tar.gz                                               # Extract sources
$ mv openbabel-openbabel-2-4-1 openbabel-2.4.1                                           # Rename directory
$ cd openbabel-2.4.1                                                                     # Go to installation directory
$ cmake .                                                                                # Preparing build (requires that cmake and g++ are installed)
$ make & make install                                                                    # Compilation and installation

Checkmol descriptor

In order to use the checkmol descriptor for entropy evaluation, the $CHECKMOL_EXE environment variable must point to the executable of the checkmol program.

OpenBabel must also be installed (see above section).

GuacaMol initial population

To use EvoMol for GuacaMol goal directed benchmarks optimization using the best scoring molecules from their subset of ChEMBL as initial population, you need to :

  • Download the ChEMBL subset.
  • Give the path of the data using the "smiles_list_init_path" attribute.
  • Insure that the "guacamol_init_top_100" attribute is set to True.

rd_filters

To use the rd_filter program as a filter of solutions that can be inserted in the population, the $FILTER_RULES_DATA environment variable must point to a folder containing the rules.json and alert_collection.csv files.

Citing EvoMol

To reference EvoMol, please cite the following article.

Leguy, J., Cauchy, T., Glavatskikh, M., Duval, B., Da Mota, B. EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J Cheminform 12, 55 (2020). https://doi.org/10.1186/s13321-020-00458-z

About

Evolutionary algorithm for molecular properties optimization

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%