EvoMol was designed on Ubuntu (18.04+). Some features might be missing on other systems. Especially, the drawing of exploration trees is currently unavailable on Windows.
To install EvoMol on your system, run the appropriate commands in your terminal. The installation depends on Anaconda.
$ git clone https://github.com/jules-leguy/EvoMol.git # Clone EvoMol
$ cd EvoMol # Move into EvoMol directory
$ conda env create -f evomol_env.yml # Create conda environment
$ conda activate evomolenv # Activate environment
$ python -m pip install . # Install EvoMol
$ git clone https://github.com/jules-leguy/EvoMol.git # Clone EvoMol
$ cd EvoMol # Move into EvoMol directory
$ conda env create -f evomol_env_windows.yml # Create conda environment
$ conda activate evomolenv # Activate environment
$ python -m pip install . # Install EvoMol
Launching a QED optimization for 500 steps. Beware, you need to activate the evomolenv conda environment when you use EvoMol.
from evomol import run_model
run_model({
"obj_function": "qed",
"optimization_parameters": {
"max_steps": 500
},
"io_parameters": {
"model_path": "examples/1_qed"
},
})
To run a model, you need to pass a dictionary describing the run to the run_model function. This dictionary can have up to 4 entries that are described in this section.
Default values are represented in bold.
The "obj_function"
attribute can take the following values. Multi-objective functions can be nested to any depth.
- Implemented functions: "qed", "plogp", "norm_plogp", "sascore", "norm_sascore", "clscore", "homo", "lumo" (see EvoMol article). "entropy_ifg", "entropy_gen_scaffolds", "entropy_shg_1" and "entropy_checkmol" can be used to maximize the entropy of descriptors, respectively using IFGs , Murcko generic scaffolds, level 1 shingles and checkmol.
- A custom function evaluating a SMILES. It is also possible to give a tuple (function, string function name).
- A dictionary describing a multi-objective function containing the following entries.
"type"
:- "linear_combination" (linear combination of the properties)
- "product" (product of properties)
- "sigm_lin", (passing the value of a unique objective through a linear function and a sigmoid function)
- "product_sigm_lin" (product of the properties after passing a linear function and a sigmoid function).
- "gaussian" (passing the value of a unique objective function through a Gaussian function)
- "opposite" (computing the opposite value of a unique objective function)
"functions"
: list of functions (string keys describing implemented functions, custom functions or multi-objective functions).- Specific to the linear combination.
"coef"
: list of coefficients.
- Specific to the use of sigmoid/linear functions
"a"
list of a coefficients for the ax+b linear function definition."b"
list of b coefficients for the ax+b linear function definition."lambda"
list of λ coefficients for the sigmoid function definition.
- Specific to the use of Gaussian functions
mu
: μ parameter of the Gaussiansigma
: σ parameter of the Gaussian
"guacamol_v2"
for taking the goal directed GuacaMol benchmarks.
The "action_space_parameters"
attribute can be set with a dictionary containing the following entries.
"atoms"
: text list of available heavy atoms ("C,N,O,F,P,S,Cl,Br")."max_heavy_atoms"
: maximum molecular size in terms of number of heavy atoms (38)."substitution"
: whether to use substitute atom type action (True)."cut_insert"
: whether to use cut atom and insert carbon atom actions (True)."move_group"
: whether to use move group action (True)."use_rd_filters"
: whether to use the rd_filter program as a quality filter before inserting the mutated individuals in the population (False).
The "optimization_parameters"
attribute can be set with a dictionary containing the following entries.
"pop_max_size"
: maximum population size (1000)."max_steps"
: number of steps to be run before stopping EvoMol(1500)."max_obj_calls""
: number of calls to the objective functions before stopping EvoMol (float("inf"))."k_to_replace"
: number of individuals replaced at each step (2)."selection"
: whether the best individuals are selected to be mutated ("best") or they are selected randomly ("random")."problem_type"
: whether it is a maximization ("max") or minimization ("min") problem."mutation_max_depth"
: maximum number of successive actions on the molecular graph during a single mutation (2)."mutation_find_improver_tries"
: maximum number of mutations to find an improver (50)."guacamol_init_top_100"
: whether to initialize the population with the 100 best scoring individuals of the GuacaMol ChEMBL subset in case of taking the GuacaMol benchmarks (False). The list of SMILES must be given as initial population."mutable_init_pop"
: if True, the individuals of the initial population can be freely mutated. If False, they can be branched but their atoms and bonds cannot be modified (True)."n_max_desc"
: max number of descriptors to be possibly handled when using an evaluator relying on a vector of descriptors such as entropy contribution (3.000.000)."shuffle_init_pop"
: whether to shuffle the smiles at initialization
The "io_parameters"
attribute can be set with a dictionary containing the following entries.
"model_path"
: path where to save model's output data ("EvoMol_model")."smiles_list_init"
: list of SMILES describing the initial population (None: interpreting the"smiles_list_init_path"
attribute). Note : not available when taking GuacaMol benchmarks."smiles_list_init_path"
: path where to find the SMILES list text file describing the initial population (None: initialization of the population with a single methane molecule)."external_tabu_list"
: list of SMILES that won't be generated by EvoMol."record_history"
: whether to save exploration tree data. Must be set to True to later draw the exploration tree (False)."record_all_generated_individuals"
: whether to record a list of all individuals that are generated during the entire execution (even if they fail the objective function computation or if they are not inserted in the population as they are not improvers). Also recording the step number and the total number of calls to the objective function at the time of generation."save_n_steps"
: frequency (steps) of saving the data (100)."print_n_steps"
: frequency (steps) of printing current population statistics (1)."dft_working_dir"
: path where to save DFT optimization related files ("/tmp")."dft_cache_files"
: list of json files containing a cache of previously computed HOMO or LUMO values ([])."evaluation_strategy_parameters"
: a dictionary that contains an entry "evaluate_init_pop" to set given parameters to the EvaluationStrategy instance in the context of the evaluation of the initial population. An entry "evaluate_new_sol" must be also contained to set given parameters for the evaluation of new solutions during the optimization process. If None, both keys are set to an empty set of parameters (None).
Performing a QED optimization run of 500 steps, while recording the exploration data.
from evomol import run_model
model_path = "examples/2_large_tree"
run_model({
"obj_function": "qed",
"optimization_parameters": {
"max_steps": 500},
"io_parameters": {
"model_path": model_path,
"record_history": True
}
})
Plotting the exploration tree with solutions colored according to their score. Nodes represent solutions. Edges represent mutations that lead to an improvement in the population.
from evomol.plot_exploration import exploration_graph
exploration_graph(model_path=model_path, layout="neato")
Performing the experiment of mutating a fixed core of acetylsalicylic acid to increase its QED value.
from evomol import run_model
model_path = "examples/3_detailed_tree"
run_model({
"obj_function": "qed",
"optimization_parameters": {
"max_steps": 10,
"pop_max_size": 10,
"k_to_replace": 2,
"mutable_init_pop": False
},
"io_parameters": {
"model_path": model_path,
"record_history": True,
"smiles_list_init_path": "examples/acetylsalicylic_acid.smi"
}
})
Plotting the exploration tree including molecular drawings, scores and action types performed during mutations. Also plotting a table of molecular drawings.
from evomol.plot_exploration import exploration_graph
exploration_graph(model_path=model_path, layout="dot", draw_actions=True, plot_images=True, draw_scores=True,
root_node="O=C(C)Oc1ccccc1C(=O)O", legend_scores_keys_strat=["total"], mol_size=0.3,
legend_offset=(-0.007, -0.05), figsize=(20, 20/1.5), legends_font_size=13)
Optimizing jointly the QED and the entropy of IFGs using a linear combination. The weights are set respectively to 1 and 10.
from evomol import run_model
model_path = "examples/4_entropy_optimization"
run_model({
"obj_function": {
"type": "linear_combination",
"functions": ["qed", "entropy_ifg"],
"coef": [1, 10]
},
"optimization_parameters": {
"max_steps": 500,
"pop_max_size": 1000
},
"io_parameters": {
"model_path": model_path,
"record_history": True
},
})
Plotting the exploration trees representing the QED values.
from evomol.plot_exploration import exploration_graph
exploration_graph(model_path=model_path, layout="neato", prop_to_study_key="qed")
As the CLscore is dependent of prior
data to be computed, EvoMol needs to be given the data location.
To do so, the $SHINGLE_LIBS
environment variable must be set to the location of the shingle_libs folder that can
be downloaded here.
To perform DFT and Molecular Mechanics computation (necessary for HOMO and LUMO optimization), you need to bind Gaussian09 and OpenBabel with EvoMol.
To do so, the $OPT_LIBS
variable must point to a folder containing :
- run.sh : a script launching a DFT optimization with Gaussian09 of the input filepath given as parameter.
- obabel/openbabel-2.4.1 : directory containing an installation of OpenBabel 2.4.1. Make sure to also set OpenBabel's
$BABEL_DATADIR
environment variable to$OPT_LIBS/obabel/openbabel-2.4.1/data
.
You can install Open Babel by following these instructions
$ mkdir obabel & cd obabel # Create and move to installation directory
$ wget https://github.com/openbabel/openbabel/archive/refs/tags/openbabel-2-4-1.tar.gz # Download sources
$ tar zxf openbabel-openbabel-2-4-1.tar.gz # Extract sources
$ mv openbabel-openbabel-2-4-1 openbabel-2.4.1 # Rename directory
$ cd openbabel-2.4.1 # Go to installation directory
$ cmake . # Preparing build (requires that cmake and g++ are installed)
$ make & make install # Compilation and installation
In order to use the checkmol descriptor for entropy evaluation, the $CHECKMOL_EXE
environment variable must point
to the executable of the
checkmol program.
OpenBabel must also be installed (see above section).
To use EvoMol for GuacaMol goal directed benchmarks optimization using the best scoring molecules from their subset of ChEMBL as initial population, you need to :
- Download the ChEMBL subset.
- Give the path of the data using the
"smiles_list_init_path"
attribute. - Insure that the
"guacamol_init_top_100"
attribute is set to True.
To use the rd_filter program as a filter of solutions that can be
inserted in the population, the $FILTER_RULES_DATA
environment variable must point to a folder containing the
rules.json
and alert_collection.csv
files.
To reference EvoMol, please cite the following article.
Leguy, J., Cauchy, T., Glavatskikh, M., Duval, B., Da Mota, B. EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J Cheminform 12, 55 (2020). https://doi.org/10.1186/s13321-020-00458-z