Skip to content

johnarevalo/compound-annotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

compound-annotator

Credits: Lewis Mervin for the orignal source code.

Setup

  1. Install Python
  2. Install Poetry
  3. Install Poetry Environment: poetry install

For Linux, see

Run

Create annotation file

On a VM with >40G disk space, download ChEMBL SQLite database (4.2G compressed, 23G uncompressed)

wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_31_sqlite.tar.gz
tar -xvzf chembl_31_sqlite.tar.gz

This produces a directory chembl_31_sqlite with the following structure:

chembl_31
└── chembl_31_sqlite
    ├── INSTALL_sqlite
    └── chembl_31.db

Run SQL query to extract ChEMBL annotation

sqlite3 -header -csv chembl_31/chembl_31_sqlite/chembl_31.db < sql/extract_chembl_annotation.sql | gzip > data/chembl_annotation.csv.gz

View the top 5 rows of the annotation file

head -n 5 <(gzcat data/chembl_annotation.csv.gz)
assay_chembl_id,target_chembl_id,assay_type,molecule_chembl_id,pchembl_value,confidence_score,standard_inchi_key,pref_name
1714633,CHEMBL3987582,B,CHEMBL4107559,6.07,7,UVVXRMZCPKQLAO-OAHLLOKOSA-N,
1714649,CHEMBL3987582,B,CHEMBL4107559,5.86,7,UVVXRMZCPKQLAO-OAHLLOKOSA-N,
1714633,CHEMBL3987582,B,CHEMBL4108338,6.15,7,OZBMIGDQBBMIRA-CQSZACIVSA-N,
1714649,CHEMBL3987582,B,CHEMBL4108338,5.84,7,OZBMIGDQBBMIRA-CQSZACIVSA-N,

Rendered as a table:

assay_chembl_id target_chembl_id assay_type molecule_chembl_id pchembl_value confidence_score standard_inchi_key pref_name
1714633 CHEMBL3987582 B CHEMBL4107559 6.07 7 UVVXRMZCPKQLAO-OAHLLOKOSA-N
1714649 CHEMBL3987582 B CHEMBL4107559 5.86 7 UVVXRMZCPKQLAO-OAHLLOKOSA-N
1714633 CHEMBL3987582 B CHEMBL4108338 6.15 7 OZBMIGDQBBMIRA-CQSZACIVSA-N
1714649 CHEMBL3987582 B CHEMBL4108338 5.84 7 OZBMIGDQBBMIRA-CQSZACIVSA-N

Count the number of rows in the annotation file

gzcat data/chembl_annotation.csv.gz | wc -l
# 1185184

Count the number of unique standard_inchi_key in the annotation file

gzcat data/chembl_annotation.csv.gz | csvcut -c standard_inchi_key | tail -n +2 | sort | uniq | wc -l
# 556272

Create filtered annotation file

Filter the annotation file to only include rows with standard_inchi_key that are present in the compound.csv.gz file

wget https://raw.githubusercontent.com/jump-cellpainting/datasets/0682dd2d52e4d68208ab4af3a0bd114ca557cb0e/metadata/compound.csv.gz
mv compound.csv.gz data/
gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | uniq > data/compound_inchi_key.csv

Now find rows in data/chembl_annotation.csv that have standard_inchi_key that are present in data/compound_inchi_key.csv

csvgrep -c standard_inchi_key -f data/compound_inchi_key.csv <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz

View the top 5 rows of the filtered annotation file

head -n 5 <(gzcat data/chembl_annotation_filtered.csv.gz)
assay_chembl_id,target_chembl_id,assay_type,molecule_chembl_id,pchembl_value,confidence_score,standard_inchi_key,pref_name
1931436,CHEMBL4523105,B,CHEMBL3716578,5.75,9,GUUWHOSUKOCRHG-UHFFFAOYSA-N,
1931437,CHEMBL4523105,B,CHEMBL3716578,5.85,9,GUUWHOSUKOCRHG-UHFFFAOYSA-N,
1931437,CHEMBL4523105,B,CHEMBL4571346,5.01,9,ILKYRSSTSHMXTC-UHFFFAOYSA-N,
446514,CHEMBL2094132,B,CHEMBL1112,6.3,5,CEUORZQYGODEFX-UHFFFAOYSA-N,ARIPIPRAZOLE

Rendered as a table:

assay_chembl_id target_chembl_id assay_type molecule_chembl_id pchembl_value confidence_score standard_inchi_key pref_name
1931436 CHEMBL4523105 B CHEMBL3716578 5.75 9 GUUWHOSUKOCRHG-UHFFFAOYSA-N
1931437 CHEMBL4523105 B CHEMBL3716578 5.85 9 GUUWHOSUKOCRHG-UHFFFAOYSA-N
1931437 CHEMBL4523105 B CHEMBL4571346 5.01 9 ILKYRSSTSHMXTC-UHFFFAOYSA-N
446514 CHEMBL2094132 B CHEMBL1112 6.3 5 CEUORZQYGODEFX-UHFFFAOYSA-N ARIPIPRAZOLE

Count the number of rows in the filtered annotation file

gzcat data/chembl_annotation_filtered.csv.gz | wc -l
# 44018

Count the number of unique standard_inchi_key in the filtered annotation file

gzcat data/chembl_annotation_filtered.csv.gz | csvcut -c standard_inchi_key | tail -n +2 | sort | uniq | wc -l
# 4718

Here are all the commands in one place to create chembl_annotation_filtered.csv.gz from chembl_annotation.csv.gz and compound.csv.gz:

commit=0682dd2d52e4d68208ab4af3a0bd114ca557cb0e

wget https://raw.githubusercontent.com/jump-cellpainting/datasets/${commit}/metadata/compound.csv.gz

mv compound.csv.gz data/

gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | uniq > data/compound_inchi_key.csv

csvgrep -c standard_inchi_key -f data/compound_inchi_key.csv <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz

Create mapping between standard_inchi_key and chembl_id

Run SQL query to get mapping between standard_inchi_key and chembl_id

sqlite3 -header -csv chembl_31/chembl_31_sqlite/chembl_31.db < sql/extract_chembl_inchikey_mapping.sql  | gzip > data/inchikey_chembl.csv.gz

View the top 5 rows of the inchikey_chembl.csv.gz file

head -n 5 <(gzcat data/inchikey_chembl.csv.gz)
molecule_chembl_id,standard_inchi_key,pref_name
CHEMBL4972698,AAAADVYFXUUVEO-UHFFFAOYSA-N,
CHEMBL492934,AAAAEENPAALFRN-UHFFFAOYSA-N,
CHEMBL4097563,AAAAJHGLNDAXFP-VNKVACROSA-N,
CHEMBL246893,AAAAKTROWFNLEP-UHFFFAOYSA-N,
molecule_chembl_id standard_inchi_key pref_name
CHEMBL4972698 AAAADVYFXUUVEO-UHFFFAOYSA-N
CHEMBL492934 AAAAEENPAALFRN-UHFFFAOYSA-N
CHEMBL4097563 AAAAJHGLNDAXFP-VNKVACROSA-N
CHEMBL246893 AAAAKTROWFNLEP-UHFFFAOYSA-N

Count the number of rows in the inchikey_chembl.csv.gz file

gzcat data/inchikey_chembl.csv.gz | wc -l
# 2304876

Now find rows in data/inchikey_chembl.csv.gz that have standard_inchi_key that are present in data/compound_inchi_key.csv

csvgrep -c standard_inchi_key -f data/compound_inchi_key.csv <(gzcat data/inchikey_chembl.csv.gz) | gzip > data/inchikey_chembl_filtered.csv.gz

View the top 5 rows of the inchikey_chembl_filtered.csv.gz file

head -n 5 <(gzcat data/inchikey_chembl_filtered.csv.gz)
molecule_chembl_id,standard_inchi_key,pref_name
CHEMBL592894,AAAJHRMBUHXWLD-UHFFFAOYSA-N,
CHEMBL268868,AAALVYBICLMAMA-UHFFFAOYSA-N,DAPH
CHEMBL1734241,AAAZRMGPBSWFDK-UHFFFAOYSA-N,
CHEMBL3449946,AABSTWCOLWSFRA-UHFFFAOYSA-N,
molecule_chembl_id standard_inchi_key pref_name
CHEMBL592894 AAAJHRMBUHXWLD-UHFFFAOYSA-N
CHEMBL268868 AAALVYBICLMAMA-UHFFFAOYSA-N DAPH
CHEMBL1734241 AAAZRMGPBSWFDK-UHFFFAOYSA-N
CHEMBL3449946 AABSTWCOLWSFRA-UHFFFAOYSA-N

Count the number of rows in the inchikey_chembl_filtered.csv.gz file

gzcat data/inchikey_chembl_filtered.csv.gz | wc -l
# 30073

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 73.6%
  • Python 26.4%