In the recent mol2vec paper, authors Jaeger et al consider the features returned by the rdkit Morgan fingerprint as "words" and a compound as a "sentence" to generate fixed-length embeddings. In this case we reproduce 200-element embeddings via a download of all SDF files in the PubChem compound database
Ensure that gensim is installed via:
pip install gensim
First, download the pubchem compound SDF corpus via running:
python ../pubchem_dataset/
Note - the script assumes that a /media/data/pubchem directory exists for this large download (approx 19 GB as of November 2017)
Then generate the embeddings file via:
Then you can use these embeddings as a fixed-length alternative to fingerprints derived directly from RDKit. A full implementation as a featurized for deepchem is WIP
Example code for using the vec.txt file that is created by the above script can be found in eval_mol2vec_results