Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
eval_mol2vec_results.py		eval_mol2vec_results.py
mol2vec.py		mol2vec.py
train_mol2vec.sh		train_mol2vec.sh

README.md

mol2vec implementation

In the recent mol2vec paper, authors Jaeger et al consider the features returned by the rdkit Morgan fingerprint as "words" and a compound as a "sentence" to generate fixed-length embeddings. In this case we reproduce 200-element embeddings via a download of all SDF files in the PubChem compound database

Setup

Ensure that gensim is installed via:

pip install gensim

Creating training corpus

First, download the pubchem compound SDF corpus via running:

python ../pubchem_dataset/download_pubchem_ftp.sh

Note - the script assumes that a /media/data/pubchem directory exists for this large download (approx 19 GB as of November 2017)

Then generate the embeddings file via:

./train_mol2vec.sh

Then you can use these embeddings as a fixed-length alternative to fingerprints derived directly from RDKit. A full implementation as a featurized for deepchem is WIP

Example code for using the vec.txt file that is created by the above script can be found in eval_mol2vec_results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mol2vec

mol2vec

README.md

mol2vec implementation

Setup

Creating training corpus

Files

mol2vec

Directory actions

More options

Directory actions

More options

Latest commit

History

mol2vec

Folders and files

parent directory

README.md

mol2vec implementation

Setup

Creating training corpus