Credits: Lewis Mervin for the orignal source code.
- Install Python
- Install Poetry
- Install Poetry Environment:
poetry install
For Linux, see
- python-poetry/poetry#1917 (comment) if installing six fails
- https://stackoverflow.com/a/75435100 if you get "does not contain any element" warning when running
poetry install
On a VM with >40G disk space, download ChEMBL SQLite database (4.2G compressed, 23G uncompressed)
wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_31_sqlite.tar.gz
tar -xvzf chembl_31_sqlite.tar.gz
This produces a directory chembl_31_sqlite
with the following structure:
chembl_31
└── chembl_31_sqlite
├── INSTALL_sqlite
└── chembl_31.db
Run SQL query to extract ChEMBL annotation
sqlite3 -header -csv chembl_31/chembl_31_sqlite/chembl_31.db < sql/extract_chembl_annotation.sql | gzip > data/chembl_annotation.csv.gz
View the top 5 rows of the annotation file
head -n 5 <(gzcat data/chembl_annotation.csv.gz)
assay_chembl_id,target_chembl_id,assay_type,molecule_chembl_id,pchembl_value,confidence_score,standard_inchi_key,pref_name
1714633,CHEMBL3987582,B,CHEMBL4107559,6.07,7,UVVXRMZCPKQLAO-OAHLLOKOSA-N,
1714649,CHEMBL3987582,B,CHEMBL4107559,5.86,7,UVVXRMZCPKQLAO-OAHLLOKOSA-N,
1714633,CHEMBL3987582,B,CHEMBL4108338,6.15,7,OZBMIGDQBBMIRA-CQSZACIVSA-N,
1714649,CHEMBL3987582,B,CHEMBL4108338,5.84,7,OZBMIGDQBBMIRA-CQSZACIVSA-N,
Rendered as a table:
assay_chembl_id | target_chembl_id | assay_type | molecule_chembl_id | pchembl_value | confidence_score | standard_inchi_key | pref_name |
---|---|---|---|---|---|---|---|
1714633 | CHEMBL3987582 | B | CHEMBL4107559 | 6.07 | 7 | UVVXRMZCPKQLAO-OAHLLOKOSA-N | |
1714649 | CHEMBL3987582 | B | CHEMBL4107559 | 5.86 | 7 | UVVXRMZCPKQLAO-OAHLLOKOSA-N | |
1714633 | CHEMBL3987582 | B | CHEMBL4108338 | 6.15 | 7 | OZBMIGDQBBMIRA-CQSZACIVSA-N | |
1714649 | CHEMBL3987582 | B | CHEMBL4108338 | 5.84 | 7 | OZBMIGDQBBMIRA-CQSZACIVSA-N |
Count the number of rows in the annotation file
gzcat data/chembl_annotation.csv.gz | wc -l
# 1185184
Count the number of unique standard_inchi_key
in the annotation file
gzcat data/chembl_annotation.csv.gz | csvcut -c standard_inchi_key | tail -n +2 | sort | uniq | wc -l
# 556272
Filter the annotation file to only include rows with standard_inchi_key
that are present in the compound.csv.gz
file
wget https://raw.githubusercontent.com/jump-cellpainting/datasets/0682dd2d52e4d68208ab4af3a0bd114ca557cb0e/metadata/compound.csv.gz
mv compound.csv.gz data/
gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | uniq > data/compound_inchi_key.csv
Now find rows in data/chembl_annotation.csv
that have standard_inchi_key
that are present in data/compound_inchi_key.csv
csvgrep -c standard_inchi_key -f data/compound_inchi_key.csv <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz
View the top 5 rows of the filtered annotation file
head -n 5 <(gzcat data/chembl_annotation_filtered.csv.gz)
assay_chembl_id,target_chembl_id,assay_type,molecule_chembl_id,pchembl_value,confidence_score,standard_inchi_key,pref_name
1931436,CHEMBL4523105,B,CHEMBL3716578,5.75,9,GUUWHOSUKOCRHG-UHFFFAOYSA-N,
1931437,CHEMBL4523105,B,CHEMBL3716578,5.85,9,GUUWHOSUKOCRHG-UHFFFAOYSA-N,
1931437,CHEMBL4523105,B,CHEMBL4571346,5.01,9,ILKYRSSTSHMXTC-UHFFFAOYSA-N,
446514,CHEMBL2094132,B,CHEMBL1112,6.3,5,CEUORZQYGODEFX-UHFFFAOYSA-N,ARIPIPRAZOLE
Rendered as a table:
assay_chembl_id | target_chembl_id | assay_type | molecule_chembl_id | pchembl_value | confidence_score | standard_inchi_key | pref_name |
---|---|---|---|---|---|---|---|
1931436 | CHEMBL4523105 | B | CHEMBL3716578 | 5.75 | 9 | GUUWHOSUKOCRHG-UHFFFAOYSA-N | |
1931437 | CHEMBL4523105 | B | CHEMBL3716578 | 5.85 | 9 | GUUWHOSUKOCRHG-UHFFFAOYSA-N | |
1931437 | CHEMBL4523105 | B | CHEMBL4571346 | 5.01 | 9 | ILKYRSSTSHMXTC-UHFFFAOYSA-N | |
446514 | CHEMBL2094132 | B | CHEMBL1112 | 6.3 | 5 | CEUORZQYGODEFX-UHFFFAOYSA-N | ARIPIPRAZOLE |
Count the number of rows in the filtered annotation file
gzcat data/chembl_annotation_filtered.csv.gz | wc -l
# 44018
Count the number of unique standard_inchi_key
in the filtered annotation file
gzcat data/chembl_annotation_filtered.csv.gz | csvcut -c standard_inchi_key | tail -n +2 | sort | uniq | wc -l
# 4718
Here are all the commands in one place to create chembl_annotation_filtered.csv.gz
from chembl_annotation.csv.gz
and compound.csv.gz
:
commit=0682dd2d52e4d68208ab4af3a0bd114ca557cb0e
wget https://raw.githubusercontent.com/jump-cellpainting/datasets/${commit}/metadata/compound.csv.gz
mv compound.csv.gz data/
gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | uniq > data/compound_inchi_key.csv
csvgrep -c standard_inchi_key -f data/compound_inchi_key.csv <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz
Run SQL query to get mapping between standard_inchi_key
and chembl_id
sqlite3 -header -csv chembl_31/chembl_31_sqlite/chembl_31.db < sql/extract_chembl_inchikey_mapping.sql | gzip > data/inchikey_chembl.csv.gz
View the top 5 rows of the inchikey_chembl.csv.gz
file
head -n 5 <(gzcat data/inchikey_chembl.csv.gz)
molecule_chembl_id,standard_inchi_key,pref_name
CHEMBL4972698,AAAADVYFXUUVEO-UHFFFAOYSA-N,
CHEMBL492934,AAAAEENPAALFRN-UHFFFAOYSA-N,
CHEMBL4097563,AAAAJHGLNDAXFP-VNKVACROSA-N,
CHEMBL246893,AAAAKTROWFNLEP-UHFFFAOYSA-N,
molecule_chembl_id | standard_inchi_key | pref_name |
---|---|---|
CHEMBL4972698 | AAAADVYFXUUVEO-UHFFFAOYSA-N | |
CHEMBL492934 | AAAAEENPAALFRN-UHFFFAOYSA-N | |
CHEMBL4097563 | AAAAJHGLNDAXFP-VNKVACROSA-N | |
CHEMBL246893 | AAAAKTROWFNLEP-UHFFFAOYSA-N |
Count the number of rows in the inchikey_chembl.csv.gz
file
gzcat data/inchikey_chembl.csv.gz | wc -l
# 2304876
Now find rows in data/inchikey_chembl.csv.gz
that have standard_inchi_key
that are present in data/compound_inchi_key.csv
csvgrep -c standard_inchi_key -f data/compound_inchi_key.csv <(gzcat data/inchikey_chembl.csv.gz) | gzip > data/inchikey_chembl_filtered.csv.gz
View the top 5 rows of the inchikey_chembl_filtered.csv.gz
file
head -n 5 <(gzcat data/inchikey_chembl_filtered.csv.gz)
molecule_chembl_id,standard_inchi_key,pref_name
CHEMBL592894,AAAJHRMBUHXWLD-UHFFFAOYSA-N,
CHEMBL268868,AAALVYBICLMAMA-UHFFFAOYSA-N,DAPH
CHEMBL1734241,AAAZRMGPBSWFDK-UHFFFAOYSA-N,
CHEMBL3449946,AABSTWCOLWSFRA-UHFFFAOYSA-N,
molecule_chembl_id | standard_inchi_key | pref_name |
---|---|---|
CHEMBL592894 | AAAJHRMBUHXWLD-UHFFFAOYSA-N | |
CHEMBL268868 | AAALVYBICLMAMA-UHFFFAOYSA-N | DAPH |
CHEMBL1734241 | AAAZRMGPBSWFDK-UHFFFAOYSA-N | |
CHEMBL3449946 | AABSTWCOLWSFRA-UHFFFAOYSA-N |
Count the number of rows in the inchikey_chembl_filtered.csv.gz
file
gzcat data/inchikey_chembl_filtered.csv.gz | wc -l
# 30073