Skip to content

Commit

Permalink
Update docs (jump-cellpainting#5)
Browse files Browse the repository at this point in the history
* Update docs

* Update docs

* move out text

* docs

* more docs

* cleanup

* cleanup

* use mamba

* mamba

* docs

* new script!

* cleanup

* error checks

* docs

* more docs and options

* several updates

* new nb

* docs

* gzip

* Updates

* data files

* logging

* nan

* outputs
  • Loading branch information
shntnu authored Apr 16, 2023
1 parent 35bda30 commit 79596e4
Show file tree
Hide file tree
Showing 13 changed files with 1,376 additions and 2,892 deletions.
94 changes: 80 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,80 @@ Credits: [Lewis Mervin](https://github.com/lewismervin1) for the orignal source

## Setup

1. [Install Python](https://www.python.org/downloads/)
1. [Install Poetry](https://python-poetry.org/docs/#installation)
1. [Install Poetry Environment](https://python-poetry.org/docs/basic-usage/#installing-dependencies): `poetry install`
We use [mamba](https://mamba.readthedocs.io/en/latest/) to manage the computational environment.

For Linux, see
To install mamba see [instructions](https://mamba.readthedocs.io/en/latest/installation.html).

- <https://github.com/python-poetry/poetry/issues/1917#issuecomment-1380429197> if installing six fails
- <https://stackoverflow.com/a/75435100> if you get "does not contain any element" warning when running `poetry install`
After installing mamba, execute the following to install and navigate to the environment:

## Run
```bash
# First, install the `genemod` conda environment
mamba env create --force --file environment.yml

### Create annotation file
# If you had already installed this environment and now want to update it
mamba env update --file environment.yml --prune

# Then, activate the environment and you're all set!
mamba activate compound-annotator

```

## Drug Repurposing Hub annotations

See notebook `repurposing-annotations.ipynb` for details.

## ChEMBL annotations

The steps below produce the following file:

- `data/chembl_annotation_filtered.csv.gz`: ChEMBL annotation file filtered to only include rows with `standard_inchi_key` that are present in the `compound.csv.gz` file (this is the metadata file from the jump-cellpainting/datasets repo).

Here's how we'd use this file to annotate the `compound.csv.gz` file:

```python
import pandas as pd

# Read in the compound metadata file
compound_df = pd.read_csv("data/compound.csv.gz")

# Read in the ChEMBL annotation file
chembl_df = pd.read_csv("data/chembl_annotation_filtered.csv.gz")

# Merge the two dataframes
merged_df = compound_df.merge(chembl_df, left_on="Metadata_InChIKey", right_on="standard_inchi_key")

# Count the number of rows in the merged dataframe
merged_df.shape
# (44017, 11)

# Select the first row and print the values of the columns
merged_df.iloc[0]
```

```text
Metadata_JCP2022 JCP2022_000003
Metadata_InChIKey AAALVYBICLMAMA-UHFFFAOYSA-N
Metadata_InChI InChI=1S/C20H15N3O2/c24-19-15-11-17(21-13-7-3-...
assay_chembl_id 29499
target_chembl_id CHEMBL203
assay_type B
molecule_chembl_id CHEMBL268868
pchembl_value 6.8
confidence_score 8
standard_inchi_key AAALVYBICLMAMA-UHFFFAOYSA-N
pref_name DAPH
```

The following files are also produced:

- `data/chembl_annotation.csv.gz`: ChEMBL annotation file. This is the raw output of a SQL query run on the ChEMBL SQLite database to get a subset of the data that we need.
- `data/inchikey_chembl_filtered.csv.gz`: Mapping of `standard_inchi_key` to `molecule_chembl_id` from the filtered ChEMBL annotation file.

### Steps for producing ChEMBL annotations

<details>

#### Create annotation file

On a VM with >40G disk space, download ChEMBL SQLite database (4.2G compressed, 23G uncompressed)

Expand Down Expand Up @@ -83,7 +145,7 @@ standard_inchi_key: 56272
pref_name: 6536
```

### Create filtered annotation file
#### Create filtered annotation file

Filter the annotation file to only include rows with `standard_inchi_key` that are present in the `compound.csv.gz` file

Expand Down Expand Up @@ -140,7 +202,7 @@ gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | un
csvgrep -c standard_inchi_key -f data/compound_inchi_key.txt <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz
```

### Create mapping between `standard_inchi_key` and `chembl_id`
#### Create mapping between `standard_inchi_key` and `chembl_id`

Run SQL query to get mapping between `standard_inchi_key` and `chembl_id`

Expand Down Expand Up @@ -168,6 +230,13 @@ gzcat data/inchikey_chembl.csv.gz | wc -l
# 2304876
```

Count the number of rows in the `compound_inchi_key.txt` file

```sh
wc -l data/compound_inchi_key.txt
# 116753
```

Now find rows in `data/inchikey_chembl.csv.gz` that have `standard_inchi_key` that are present in `data/compound_inchi_key.txt`

```sh
Expand All @@ -188,7 +257,4 @@ standard_inchi_key: 30072
pref_name: 2508
```

```sh
wc -l data/compound_inchi_key.txt
# 116753
```
</details>
Loading

0 comments on commit 79596e4

Please sign in to comment.