Skip to content

Evolutionary algorithm for molecular properties optimization

License

Notifications You must be signed in to change notification settings

jyryu3161/EvoMol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvoMol

Installation

EvoMol has been designed on Ubuntu (18.04+). Some features might be missing on other systems. Especially, the drawing of exploration trees is currently unavailable on Windows.

To install EvoMol on your system, run the appropriate commands in your terminal. The installation depends on Anaconda.

Linux

$ git clone https://github.com/jules-leguy/EvoMol.git     # Clone EvoMol
$ cd EvoMol                                               # Move into EvoMol directory
$ conda env create -f evomol_env.yml                      # Create conda environment
$ conda activate evomolenv                                # Activate environment
$ python -m pip install .                                 # Install EvoMol

Windows

$ git clone https://github.com/jules-leguy/EvoMol.git     # Clone EvoMol
$ cd EvoMol                                               # Move into EvoMol directory
$ conda env create -f evomol_env_windows.yml              # Create conda environment
$ conda activate evomolenv                                # Activate environment
$ python -m pip install .                                 # Install EvoMol

Quickstart

Launching a QED optimization for 500 steps. Beware, you need to activate the evomolenv conda environment when you use EvoMol.

from evomol import run_model
run_model({
    "obj_function": "qed",
    "optimization_parameters": {
        "max_steps": 500
    },
    "io_parameters": {
        "model_path": "examples/1_qed"
    },
})

Model parametrization

To run a model, you need to pass a dictionary describing the run to the run_model function. This dictionary can have up to 4 entries that are described in this section.

Default values are represented in bold.

Objective function

The "obj_function" attribute can take the following values.

  • Implemented functions: "qed", "plogp", "norm_plogp", "sascore", "norm_sascore", "clscore", "homo", "lumo" (see EvoMol article). "entropy_ifg", "entropy_gen_scaffolds", "entropy_shg_1" and "entropy_checkmol" can be used to maximize the entropy of descriptors, respectively using functional groups , Murcko generic scaffolds, level 1 shingles and checkmol.
  • A custom function evaluating a SMILES. It is also possible to give a tuple (function, string function name).
  • A dictionary describing a multi-objective function containing the following entries.
    • "type" :
      • "linear_combination" (linear combination of the properties)
      • "product" (product of properties)
      • "sigm_lin", (passing the values of a unique objective through a linear function and a sigmoid function)
      • "product_sigm_lin" (product of the properties after passing a linear function and a sigmoid function).
    • "functions" : list of functions (string keys describing implemented functions or custom functions).
    • Specific to the linear combination.
      • "coef" : list of coefficients.
    • Specific to the use of sigmoid/linear functions
      • "a" list of a coefficients for the ax+b linear function definition.
      • "b" list of b coefficients for the ax+b linear function definition.
      • "lambda" list of λ coefficients for the sigmoid function definition.
  • "guacamol_v2" for taking the goal directed GuacaMol benchmarks.

Search space

The "action_space_parameters" attribute can be set with a dictionary containing the following entries.

  • "atoms" : text list of available heavy atoms ("C,N,O,F,P,S,Cl,Br").
  • "max_heavy_atoms": maximum molecular size in terms of number of heavy atoms (38).
  • "substitution": whether to use substitute atom type action (True).
  • "cut_insert": whether to use cut atom and insert carbon atom actions (True).
  • "move_group": whether to use move group action (True).
  • "use_rd_filters": whether to use the rd_filter program as a quality filter before inserting the mutated individuals in the population (False).

Optimization parameters

The "optimization_parameters" attribute can be set with a dictionary containing the following entries.

  • "pop_max_size" : maximum population size (1000).
  • "max_steps" : number of steps to be run (1500).
  • "k_to_replace" : number of individuals replaced at each step (2).
  • "selection" : whether the best individuals are selected to be mutated ("best") or they are selected randomly ("random").
  • "problem_type" : whether it is a maximization ("max") or minimization ("min") problem.
  • "mutation_max_depth" : maximum number of successive actions on the molecular graph during a single mutation (2).
  • "mutation_find_improver_tries" : maximum number of mutations to find an improver (50).
  • "guacamol_init_top_100" : whether to initialize the population with the 100 best scoring individuals of the GuacaMol ChEMBL subset in case of taking the GuacaMol benchmarks (True). The list of SMILES must be given as initial population.
  • "mutable_init_pop" : if True, the individuals of the initial population can be freely mutated. If False, they can be branched but their atoms and bonds cannot be modified (True).
  • "n_max_desc": max number of descriptors to be possibly handled when using an evaluator relying on a vector of descriptors such as entropy contribution (3.000.000).
  • "shuffle_init_pop": whether to shuffle the smiles at initialization

Input/Output parameters

The "io_parameters" attribute can be set with a dictionary containing the following entries.

  • "model_path" : path where to save model's output data ("EvoMol_model").
  • "smiles_list_init": list of SMILES describing the initial population (None: interpreting the "smiles_list_init_path" attribute). Note : not available when passing GuacaMol benchmarks.
  • "smiles_list_init_path" : path where to find the SMILES list text file describing the initial population (None: initialization of the population with a single methane molecule).
  • "external_tabu_list": list of SMILES that won't be generated by EvoMol.
  • "record_history" : whether to save exploration tree data. Must be set to True to further draw the exploration tree (False).
  • "record_all_generated_individuals" : whether to record the list of all individuals that are generated during the entire execution (not necessarily inserted in the population). Also recording the number of calls to the objective function at the time of insertion.
  • "save_n_steps" : frequency (steps) of saving the data (100).
  • "print_n_steps" : frequency (steps) of printing current population statistics (1).
  • "dft_working_dir" : path where to save DFT optimization related files ("/tmp").
  • "dft_cache_files" : list of json files containing a cache of previously computed HOMO or LUMO values ([]).

Drawing exploration trees

Large exploration tree

Performing a QED optimization run of 500 steps, while recording the exploration data.

from evomol import run_model

model_path = "examples/2_large_tree"

run_model({
    "obj_function": "qed",
    "optimization_parameters": {
        "max_steps": 500},
    "io_parameters": {
        "model_path": model_path,
        "record_history": True
    }
})

Plotting the exploration tree with solutions colored according to their score. Nodes represent solutions. Edges represent mutations that lead to an improvement in the population.

from evomol.plot_exploration import exploration_graph
exploration_graph(model_path=model_path, layout="neato")

Large exploration tree

Detailed exploration tree

Performing the experiment of mutating a fixed core of acetylsalicylic acid to increase its QED value.

from evomol import run_model

model_path = "examples/3_detailed_tree"

run_model({
    "obj_function": "qed",
    "optimization_parameters": {
        "max_steps": 10,
        "pop_max_size": 10,
        "k_to_replace": 2,
        "mutable_init_pop": False
    },
    "io_parameters": {
        "model_path": model_path,
        "record_history": True,
        "smiles_list_init_path": "examples/acetylsalicylic_acid.smi"
    }
})

Plotting the exploration tree including molecular drawings, scores and action types performed during mutations. Also plotting a table of molecular drawings.

from evomol.plot_exploration import exploration_graph

exploration_graph(model_path=model_path, layout="dot", draw_actions=True, plot_images=True, draw_scores=True,
                  root_node="O=C(C)Oc1ccccc1C(=O)O", legend_scores_keys_strat=["total"], mol_size=0.3,
                  legend_offset=(-0.007, -0.05), figsize=(20, 20/1.5), legends_font_size=13)

Detailed exploration tree

Detailed molecular drawings table

Environment variables and data requirements

CLscore

As the CLscore is dependent of prior data to be computed, EvoMol needs to be given the data location. To do so, the $SHINGLE_LIBS environment variable must be set to the location of the shingle_libs folder that can be downloaded here.

DFT and Molecular Mechanics optimization

To perform DFT and Molecular Mechanics computation (necessary for HOMO and LUMO optimization), you need to bind Gaussian09 and OpenBabel with EvoMol.

To do so, the $OPT_LIBS variable must point to a folder containing :

  • run.sh : a script launching a DFT optimization with Gaussian09 of the input filepath given as parameter.
  • obabel/openbabel-2.4.1 : directory containing an installation of OpenBabel 2.4.1. Make sure to also set OpenBabel's $BABEL_DATADIR environment variable to $OPT_LIBS/obabel/openbabel-2.4.1/data.

To install OpenBabel, you should compile the sources using the official instructions.

Checkmol descriptor

In order to use the checkmol descriptor for entropy evaluation, the $CHECKMOL_EXE environment variable must point to the executable of the checkmol program.

OpenBabel must also be installed (see above section).

GuacaMol initial population

To use EvoMol for GuacaMol goal directed benchmarks optimization using the best scoring molecules from their subset of ChEMBL as initial population, you need to :

  • Download the ChEMBL subset.
  • Give the path of the data using the "smiles_list_init_path" attribute.
  • Insure that the "guacamol_init_top_100" attribute is set to True.

rd_filters

To use the rd_filter program as a filter of solutions that can be inserted in the population, the $FILTER_RULES_DATA environment variable must point to a folder containing the rules.json and alert_collection.csv files.

About

Evolutionary algorithm for molecular properties optimization

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%