Update docs (jump-cellpainting#5)

* Update docs * Update docs * move out text * docs * more docs * cleanup * cleanup * use mamba * mamba * docs * new script! * cleanup * error checks * docs * more docs and options * several updates * new nb * docs * gzip * Updates * data files * logging * nan * outputs
johnarevalo · Apr 16, 2023 · 79596e4 · 79596e4
1 parent 35bda30
commit 79596e4
Show file tree

Hide file tree

Showing 13 changed files with 1,376 additions and 2,892 deletions.
diff --git a/README.md b/README.md
@@ -4,18 +4,80 @@ Credits: [Lewis Mervin](https://github.com/lewismervin1) for the orignal source
 
 ## Setup
 
-1. [Install Python](https://www.python.org/downloads/)
-1. [Install Poetry](https://python-poetry.org/docs/#installation)
-1. [Install Poetry Environment](https://python-poetry.org/docs/basic-usage/#installing-dependencies): `poetry install`
+We use [mamba](https://mamba.readthedocs.io/en/latest/) to manage the computational environment.
 
-For Linux, see
+To install mamba see [instructions](https://mamba.readthedocs.io/en/latest/installation.html).
 
-- <https://github.com/python-poetry/poetry/issues/1917#issuecomment-1380429197> if installing six fails
-- <https://stackoverflow.com/a/75435100> if you get "does not contain any element" warning when running `poetry install`
+After installing mamba, execute the following to install and navigate to the environment:
 
-## Run
+```bash
+# First, install the `genemod` conda environment
+mamba env create --force --file environment.yml
 
-### Create annotation file
+# If you had already installed this environment and now want to update it
+mamba env update --file environment.yml --prune
+
+# Then, activate the environment and you're all set!
+mamba activate compound-annotator
+
+```
+
+## Drug Repurposing Hub annotations
+
+See notebook `repurposing-annotations.ipynb` for details.
+
+## ChEMBL annotations
+
+The steps below produce the following file:
+
+- `data/chembl_annotation_filtered.csv.gz`: ChEMBL annotation file filtered to only include rows with `standard_inchi_key` that are present in the `compound.csv.gz` file (this is the metadata file from the jump-cellpainting/datasets repo).
+
+Here's how we'd use this file to annotate the `compound.csv.gz` file:
+
+```python
+import pandas as pd
+
+# Read in the compound metadata file
+compound_df = pd.read_csv("data/compound.csv.gz")
+
+# Read in the ChEMBL annotation file
+chembl_df = pd.read_csv("data/chembl_annotation_filtered.csv.gz")
+
+# Merge the two dataframes
+merged_df = compound_df.merge(chembl_df, left_on="Metadata_InChIKey", right_on="standard_inchi_key")
+
+# Count the number of rows in the merged dataframe
+merged_df.shape
+# (44017, 11)
+
+# Select the first row and print the values of the columns
+merged_df.iloc[0]
+```
+
+```text
+Metadata_JCP2022                                         JCP2022_000003
+Metadata_InChIKey                           AAALVYBICLMAMA-UHFFFAOYSA-N
+Metadata_InChI        InChI=1S/C20H15N3O2/c24-19-15-11-17(21-13-7-3-...
+assay_chembl_id                                                   29499
+target_chembl_id                                              CHEMBL203
+assay_type                                                            B
+molecule_chembl_id                                         CHEMBL268868
+pchembl_value                                                       6.8
+confidence_score                                                      8
+standard_inchi_key                          AAALVYBICLMAMA-UHFFFAOYSA-N
+pref_name                                                          DAPH
+```
+
+The following files are also produced:
+
+- `data/chembl_annotation.csv.gz`: ChEMBL annotation file. This is the raw output of a SQL query run on the ChEMBL SQLite database to get a subset of the data that we need.
+- `data/inchikey_chembl_filtered.csv.gz`: Mapping of `standard_inchi_key` to `molecule_chembl_id` from the filtered ChEMBL annotation file.
+
+### Steps for producing ChEMBL annotations
+
+<details>
+
+#### Create annotation file
 
 On a VM with >40G disk space, download ChEMBL SQLite database (4.2G compressed, 23G uncompressed)
 
@@ -83,7 +145,7 @@ standard_inchi_key: 56272
 pref_name: 6536
 ```
 
-### Create filtered annotation file
+#### Create filtered annotation file
 
 Filter the annotation file to only include rows with `standard_inchi_key` that are present in the `compound.csv.gz` file
 
@@ -140,7 +202,7 @@ gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | un
 csvgrep -c standard_inchi_key -f data/compound_inchi_key.txt <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz
 ```
 
-### Create mapping between `standard_inchi_key` and `chembl_id`
+#### Create mapping between `standard_inchi_key` and `chembl_id`
 
 Run SQL query to get mapping between `standard_inchi_key` and `chembl_id`
 
@@ -168,6 +230,13 @@ gzcat data/inchikey_chembl.csv.gz | wc -l
 # 2304876
 ```
 
+Count the number of rows in the `compound_inchi_key.txt` file
+
+```sh
+wc -l data/compound_inchi_key.txt
+# 116753
+```
+
 Now find rows in `data/inchikey_chembl.csv.gz` that have `standard_inchi_key` that are present in `data/compound_inchi_key.txt`
 
 ```sh
@@ -188,7 +257,4 @@ standard_inchi_key: 30072
 pref_name: 2508
 ```
 
-```sh
-wc -l data/compound_inchi_key.txt
-# 116753
-```
+</details>