This repository provides a drug-repurposing pipeline that predicts new drug candidates that can potentially treat a symptom related to a given disease such as Duchenne muscular dystrophy (DMD), Huntington's disease (HD) and Osteogenesis imperfecta (OI). For these predictions, explanations in the form of subgraphs of the input knowledge graph are generated.
Previous research by Pablo Perdomo Quinteiro1 provided this drug-repurposing pipeline. This is a project that builds upon this pipeline focusing on improving the conceptual model that the input knowledge graph conforms to and finding out whether the predictions and explanations improve as well.
In the implementation of the workflow, the keyword prev
is used to indicate the knowledge graphs that comply with the original data model from previous research1. The keyword restr
implies that the knowledge graph is the restructured knowledge graph, complying to the newly designed conceptual model. The keywords dmd
, hd
and oi
state that the knowledge graph is built from entities related to the disease DMD, HD or OI as seeds, respectively.
Two kinds of knowledge graphs can be built given any disease being a knowledge graph that aligns with the data model of previous research (original KG) and a knowledge graph that has undergone structural changes in order to conform to a newly designed conceptual model using Foundational Ontologies:
The data that will initially populate the knowledge graphs come from the graph build of the Monarch Initiative from September 2021 and is reached by using the tsv exports of this graph build. The tsv files are also stored in folder localfetcher/deprecated_data
such that it can be reached by the fetching script in this repository.
To get all associations in the knowledge graph for a given disease, a list of seeds needs to be initialized that contain the most important entities that relate to the disease that serves as the foundation of the knowledge graph.
localfetcher/main.py
- Running this script will prompt the user to enter the disease that will serve as the foundation of the knowledge graph. A list of seeds already exists for DMD, HD and OI, containing the identifier of the disease itself and the identifier of the most important causal/correlated genes. The output is a csv file found inlocalfetcher/output
with all association triples given the initial seeds. The name of the file indicates which group of seeds is used and the date of creation (for exampledmd_monarch_associations_2024-07-18.csv
).
The Monarch Initiative associations collected by the previously mentioned fetcher (localfetcher/main.py
) do not yet contain drug information. To add drug information, data is used from two different datasets being DrugCentral and Therapeutic Target Database (TTD). The former contains drug-phenotype interactions while the latter includes drug-protein interactions. In order to merge the drug information with the associations that are already included in the knowledge graph, the following scripts in folder kg_builder/original
need to be run in the given order:
kg_builder/original/1_restructurer_main.py
- The entities found in the Monarch Initiative associations need to be organized into different conceptual classes such that the concepts, relations and triples are the same as found in the knowledge graph built in previous research1. The resulting nodes after reorganizing are found inkg_builder/original/output
such asprev_dmd_monarch_nodes.csv
. All associations are found in folderoutput
within the subfolder that corresponds to the relevant disease. For example DMD, for which the fileoutput/dmd/prev_dmd_monarch_associations.csv
contains all associations that conform with the data model of the knowledge graph from previous research1.kg_builder/original/2_drug_info_merger_main.ipynb
- The drug information data from Drug Central and TTD are prepared to be compatible with the associations from Monarch Initiative. For example, acquiring the Human Phenotype identifiers of the disease entities found in the Drug Central dataset or acquiring the corresponding genes given the proteins that are targets of drugs given the TTD data. The relevant drug-disease pairs are found inkg_builder/original/output
such as filematched_drug_to_disease_dmd.csv
. In the same output folder, the relevant drug-gene pairs are stored in for examplematched_drug_targets_dmd.csv
.kg_builder/original/3_kg_drug_info_merger_main.ipynb
- This script will transform the found drug-disease and drug-gene pairs to associations that conform to the data model of the knowledge graph using the correct relations between the entities. The associations are found in folderoutput
. For example the files with the drug information associations for DMD beingoutput/dmd/prev_dmd_drugcentral_associations.csv
andoutput/dmd/prev_dmd_ttd_associations.csv
.kg_builder/original/4_kg_builder_main.py
- Now, the knowledge graph is built that contains the associations from Monarch Initiative, DrugCentral and TTD. All the nodes and edges of this complete knowledge graph are stored for DMD in filesoutput/dmd/prev_dmd_kg_nodes.csv
andoutput/dmd/prev_kg_dmd_edges.csv
.
The Monarch Initiative associations collected by the previously mentioned fetcher (localfetcher/main.py
) do not yet contain drug information. To add drug information, data is used from two different datasets being DrugCentral and Therapeutic Target Database (TTD). The former contains drug-phenotype interactions while the latter includes drug-protein interactions. In order to merge the drug information with the associations that are already included in the knowledge graph, the following script needs to be performed:
kg_builder/restructured/kg_builder_main.py
- This script merges the drug information from Drug Central and TTD into the knowledge graph with the Monarch Initiative associations. The complete knowledge graph is stored into two files beingoutput/dmd/restr_dmd_kg_nodes.csv
andoutput/dmd/restr_dmd_kg_edges.csv
.
To analyse the built knowledge graphs, run analyser/kg_analyser.ipynb
. In analyser/data_params.py
the parameters can be set to determine which knowledge graphs need to be included. The analysis outputs multiple files that can be found in the output
folder and related subfolder such as output/dmd
. The files contain information about for example all existing triples in the knowledge graph and statistics for each node- and edge type. Also, the knowledge graphs are stored into GEXF (Graph Exchange XML Format) files to support loading in the network into various network visualization applications.
Predictions are generated by training a graph neural network (GNN) model on one of the two KG variations. This process is taken from previous research1. However, the script for performing these steps in the pipeline are modified to allow for different input variations while maintaining the foundation of the already developed method.
The results from the node embedding and GNN training steps are found in the following folder given a knowledge graph complying to the original data model (prev
) and on disease DMD (dmd
):
output/dmd/prev_e2v/run_001
Multiple runs of the embedding and prediction process on the same knowledge graphs are allowed (and even recommended for analyzing the workflow performance) where each run is stored into a separate folder. For example, the first run is found in this subfolder called run_001
. This can be done using the script:
predictor/run_predictor_notebooks.py
- This script will run the prediction pipeline (embedding and GNN model training). To adjust the parameters that select which knowledge graph is used as input, change the values in filepredictor/data_params.py
.
First, the nodes and edges of the knowledge graph need to be indexed such that these indices can be used consistently throughout all the steps of the prediction workflow for retrieving the correct nodes given an index.
predictor/1_loader.ipynb
- This Jupyter Notebook needs to be run to get the indexed nodes and edges stored into files found in the output folder. For example for DMD and the original knowledge graph, the files areoutput/dmd/restr_dmd_indexed_nodes.csv
andoutput/dmd/restr_dmd_indexed_edges.csv
.
For the node embedding step and training the GNN model, a number of hyperparameters need to be set. In order to get the best results from the workflow, these parameter values need to be optimized. This can be done using the following script:
hyperparameter_opt.py
- Hyperparameter optimization is run using Random Search for both the node embedding step as well as training the GNN model process. The resulting optimized hyperparameter values can be found in TXT files found in the main folder such asoptimized_params_prev_dmd.txt
.
For the node embedding step, method Edge2vec2 has been implemented. The script can be found here:
predictor/2_edge2vec_embedding.ipynb
- It outputs the final transition matrix and node embeddings into the correspondingrun_xxx
folder.
Parameters | DMD | HD | OI | |||
---|---|---|---|---|---|---|
Original KG | Restructured KG | Original KG | Restructured KG | Original KG | Restructured KG | |
Number of walks | 6 | 4 | 6 | 2 | 6 | 4 |
Walk length | 7 | 7 | 7 | 7 | 7 | 7 |
Embedding dimension | 128 | 32 | 64 | 128 | 128 | 32 |
p | 1.0 | 0.75 | 0.5 | 1.0 | 1.0 | 0.5 |
q | 0.5 | 0.5 | 0.75 | 1.0 | 0.5 | 0.5 |
epochs | 10 | 10 | 10 | 10 | 10 | 10 |
For the GNN training step, this script is used:
predictor/3_predictor.ipynb
- After running this script, it outputs the resulting model weights in the correspondingrun_xxx
folder that are used in the next step of the workflow being the generation of explanations. The subfolderrun_xxx/pred
is created in which other results of this step are stored that are used for generating the explanation as well and also enable the analysis of the prediction performance. These results include for example the probability of an edge existing between a symptom and a drug calculated by the trained GNN model or the overall prediction performance scores during the training process and after in various metrics.
Parameters | DMD | HD | OI | |||
---|---|---|---|---|---|---|
Original KG | Restructured KG | Original KG | Restructured KG | Original KG | Restructured KG | |
Hidden dimension | 128 | 64 | 256 | 256 | 256 | 64 |
Output dimension | 256 | 64 | 64 | 64 | 64 | 128 |
Layers | 4 | 2 | 4 | 6 | 2 | 2 |
Aggregation function | mean | mean | mean | sum | mean | mean |
Dropout | 0.1 | 0.2 | 0.1 | 0.2 | 0.2 | 0.1 |
Learning rate | 0.012352 | 0.003191 | 0.015119 | 0.0364471 | 0.000606 | 0.026789 |
Epochs | 200 | 150 | 150 | 150 | 100 | 150 |
Edge Negative Sampling Ratio | 0.5 | 1.0 | 1.5 | 0.5 | 1.5 | 1.0 |
To analyse the predictions and accuracy of the trained GNN model, run analyser/prediction_analyser.ipynb
. In analyser/data_params.py
the parameters can be set to determine which GNN models are included based on which knowledge graphs are used as training data. The analyser outputs files in the corresponding folder specifying which disease subject and which data model are used in the knowledge graph training data such as output/dmd/prev_e2v
. For example, the predicted drug-symptom pair overlap between all independent runs of the GNN model trained on the same knowledge graph or training curves. The analysis considers all runs that have been performed, utilizing the prediction results from all run folders identified as for example output/dmd/prev_e2v/run_xxx
. Some analysis results are stored into the parent folder such as output/dmd
when it consists of the prediction results from both knowledge graphs (prev
and restr
) on the same disease such as the comparison of AUC ROC and F1 scores between the GNN models trained on the differently structured knowledge graphs given the same disease as subject.
Explanations are generated using the script:
predictor/4_explainer.ipynb
- As for the prediction process, use filepredictor/data_params.py
to adjust for which knowledge graph explanations need to be generated. In the Jupyter Notebook itself, it needs to be set which drug-symptom pairs are considered during the explanation generation. This is decided by indicating for how many runs the included drug-symptom pairs are found. For example, a value of5
is the threshold of a drug-symptom pair to be included for finding explanations when it is found in at least 5 runs. In this case and for an original DMD KG, the explainer will output the explanation graphs in the folderoutput/dmd/prev_e2v/expl_5
. This folder contains all found complete and incomplete explanations. Explanations are considered complete when there exists a direct or indirect path between the symptom and drug of the pair that is explained in the graph. The explanation graphs are stored in multiple formats such as an image or the raw data (gpickle
,pkl
).
The hyperparameters are not adjusted during the hyperparameter optimization step for each different knowledge graph and is thus fixed for each input.
Parameters | Values |
---|---|
Epochs | 700 |
Number of hops | 1 |
Maximum size of explanation | 15 |
Search iterations | 10 |
Learning rate | 0.01 |
The generated explanations are analyzed using:
analyser/explanation_analyser.ipynb
- This analysis script looks at the objective measurements to assess the explanations as well as the assessment of how many complete and incomplete the explainer yielded given the number of drug-symptom pairs that is included during the explanation generation.
This script outputs the following file for analyzing for example the explanations on the DMD KGs:
output/dmd/dmd_explanation_objective_measurements.csv
- Stores the objective measurements of the explanations found for each DMD KG
For each KG and the set of drug-symptom pairs that is used for generating the explanations, a file is stored that shows the yield of the explanainer. For example, for the original DMD KG looking at the explanations generated for the drug-symptom pairs that are found in at least 5 runs, the file is found here:
output/dmd/prev_e2v/expl_5/dmd_prev_expl_5_explanation_results.csv
Footnotes
-
Master's thesis project of Pablo Perdomo Quinteiro ↩ ↩2 ↩3 ↩4 ↩5
-
Gao, Z., Fu, G., Ouyang, C. et al. edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. BMC Bioinformatics 20, 306 (2019). https://doi.org/10.1186/s12859-019-2914-2 ↩