update documentation

Liuharry2003 · Dec 12, 2018 · 32497bc · 32497bc
1 parent 1ef94d3
commit 32497bc
Showing 1 changed file with 205 additions and 138 deletions.
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ __For more details, please refer to the [paper](https://arxiv.org/abs/1811.12823
 
 ## Dataset
 
-We propose a biological molecule benchmark set refined from the ZINC database.
+We propose [a benchmarking dataset](https://media.githubusercontent.com/media/molecularsets/moses/master/data/dataset.csv) refined from the ZINC database.
 
 The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.
 
@@ -26,112 +26,112 @@ The dataset contains 1,936,962 molecular structures. For experiments, we also pr
 Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet.
 
 <table border="1" class="dataframe">
-  <thead>
-    <tr style="text-align: right;">
-      <th rowspan="2">Model</th>
-      <th rowspan="2">Valid (↑)</th>
-      <th rowspan="2">Unique@1k (↑)</th>
-      <th rowspan="2">Unique@10k (↑)</th>
-      <th colspan="2">FCD (↓)</th>
-      <th colspan="2">SNN (↓)</th>
-      <th colspan="2">Frag (↑)</th>
-      <th colspan="2">Scaff (↑)</th>
-      <th rowspan="2">IntDiv (↑)</th>
-      <th rowspan="2">Filters (↑)</th>
-    </tr>
-    <tr>
-      <th>Test</th>
-      <th>TestSF</th>
-      <th>Test</th>
-      <th>TestSF</th>
-      <th>Test</th>
-      <th>TestSF</th>
-      <th>Test</th>
-      <th>TestSF</th>
-    </tr>
-  </thead>
- <tbody>
-    <tr>
-      <th>CharRNN</th>
-      <td>0.9598</td>
-      <td><b>1.0000</b></td>
-      <td>0.9993</td>
-      <td>0.3233</td>
-      <td>0.8355</td>
-      <td>0.4606</td>
-      <td>0.4492</td>
-      <td>0.9977</td>
-      <td>0.9962</td>
-      <td>0.7964</td>
-      <td>0.1281</td>
-      <td><b>0.8561</b></td>
-      <td>0.9920</td>
-    </tr>
-    <tr>
-      <th>VAE</th>
-      <td>0.9528</td>
-      <td><b>1.0000</b></td>
-      <td>0.9992</td>
-      <td><b>0.2540</b></td>
-      <td><b>0.6959</b></td>
-      <td>0.4684</td>
-      <td>0.4547</td>
-      <td><b>0.9978</b></td>
-      <td><b>0.9963</b></td>
-      <td><b>0.8277</b></td>
-      <td>0.0925</td>
-      <td>0.8548</td>
-      <td>0.9925</td>
-    </tr>
-    <tr>
-      <th>AAE</th>
-      <td>0.9341</td>
-      <td><b>1.0000</b></td>
-      <td><b>1.0000</b></td>
-      <td>1.3511</td>
-      <td>1.8587</td>
-      <td>0.4191</td>
-      <td>0.4113</td>
-      <td>0.9865</td>
-      <td>0.9852</td>
-      <td>0.6637</td>
-      <td><b>0.1538</b></td>
-      <td>0.8531</td>
-      <td>0.9759</td>
-    </tr>
-    <tr>
-      <th>ORGAN</th>
-      <td>0.8731</td>
-      <td>0.9910</td>
-      <td>0.9260</td>
-      <td>1.5748</td>
-      <td>2.4306</td>
-      <td>0.4745</td>
-      <td>0.4593</td>
-      <td>0.9897</td>
-      <td>0.9883</td>
-      <td>0.7843</td>
-      <td>0.0632</td>
-      <td>0.8526</td>
-      <td><b>0.9934</b></td>
-    </tr>
-    <tr>
-      <th>JTN-VAE</th>
-      <td><b>1.0000</b></td>
-      <td>0.9980</td>
-      <td>0.9972</td>
-      <td>4.3769</td>
-      <td>4.6299</td>
-      <td><b>0.3909</b></td>
-      <td><b>0.3902</b></td>
-      <td>0.9679</td>
-      <td>0.9699</td>
-      <td>0.3868</td>
-      <td>0.1163</td>
-      <td>0.8495</td>
-      <td>0.9566</td>
-    </tr>
-  </tbody>
+<thead>
+<tr style="text-align: right;">
+<th rowspan="2">Model</th>
+<th rowspan="2">Valid (↑)</th>
+<th rowspan="2">Unique@1k (↑)</th>
+<th rowspan="2">Unique@10k (↑)</th>
+<th colspan="2">FCD (↓)</th>
+<th colspan="2">SNN (↓)</th>
+<th colspan="2">Frag (↑)</th>
+<th colspan="2">Scaff (↑)</th>
+<th rowspan="2">IntDiv (↑)</th>
+<th rowspan="2">Filters (↑)</th>
+</tr>
+<tr>
+<th>Test</th>
+<th>TestSF</th>
+<th>Test</th>
+<th>TestSF</th>
+<th>Test</th>
+<th>TestSF</th>
+<th>Test</th>
+<th>TestSF</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<th>CharRNN</th>
+<td>0.9598</td>
+<td><b>1.0000</b></td>
+<td>0.9993</td>
+<td>0.3233</td>
+<td>0.8355</td>
+<td>0.4606</td>
+<td>0.4492</td>
+<td>0.9977</td>
+<td>0.9962</td>
+<td>0.7964</td>
+<td>0.1281</td>
+<td><b>0.8561</b></td>
+<td>0.9920</td>
+</tr>
+<tr>
+<th>VAE</th>
+<td>0.9528</td>
+<td><b>1.0000</b></td>
+<td>0.9992</td>
+<td><b>0.2540</b></td>
+<td><b>0.6959</b></td>
+<td>0.4684</td>
+<td>0.4547</td>
+<td><b>0.9978</b></td>
+<td><b>0.9963</b></td>
+<td><b>0.8277</b></td>
+<td>0.0925</td>
+<td>0.8548</td>
+<td>0.9925</td>
+</tr>
+<tr>
+<th>AAE</th>
+<td>0.9341</td>
+<td><b>1.0000</b></td>
+<td><b>1.0000</b></td>
+<td>1.3511</td>
+<td>1.8587</td>
+<td>0.4191</td>
+<td>0.4113</td>
+<td>0.9865</td>
+<td>0.9852</td>
+<td>0.6637</td>
+<td><b>0.1538</b></td>
+<td>0.8531</td>
+<td>0.9759</td>
+</tr>
+<tr>
+<th>ORGAN</th>
+<td>0.8731</td>
+<td>0.9910</td>
+<td>0.9260</td>
+<td>1.5748</td>
+<td>2.4306</td>
+<td>0.4745</td>
+<td>0.4593</td>
+<td>0.9897</td>
+<td>0.9883</td>
+<td>0.7843</td>
+<td>0.0632</td>
+<td>0.8526</td>
+<td><b>0.9934</b></td>
+</tr>
+<tr>
+<th>JTN-VAE</th>
+<td><b>1.0000</b></td>
+<td>0.9980</td>
+<td>0.9972</td>
+<td>4.3769</td>
+<td>4.6299</td>
+<td><b>0.3909</b></td>
+<td><b>0.3902</b></td>
+<td>0.9679</td>
+<td>0.9699</td>
+<td>0.3868</td>
+<td>0.1163</td>
+<td>0.8495</td>
+<td>0.9566</td>
+</tr>
+</tbody>
 </table>
 
 For comparison of molecular properties, we computed the Frèchet distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED), Natural Product-likeness (NP) and molecular weight.
@@ -144,48 +144,115 @@ For comparison of molecular properties, we computed the Frèchet distance betwee
 |weight|
 |![weight](images/weight.png)|
 
-### Calculation of metrics for all models
+# Installation
+
+## Docker
+
+1. Install [docker](https://docs.docker.com/install/) and [nvidia-docker](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)).
+
+2. Pull an existing image from DockerHub:
+
+```
+docker pull molecularsets/moses
+```
+
+or clone the repository and build it manually:
+
 
-You can calculate all metrics with:
 ```
-cd scripts
-python run.py 
+git lfs install
+
+git clone https://github.com/molecularsets/moses.git
+
+nvidia-docker image build --tag molecularsets/moses moses/
 ```
-If necessary, dataset will be downloaded, splited and all models will be trained. As result in current directory will appear `metrics.csv` with values.
-You can specify the device and model by running `python run.py --device cuda:5 --model aae`. For more details use `python run.py --help`.
 
-## Installation
+3. Create a container:
+```
+nvidia-docker run -it moses --network="host" --shm-size 1G molecularsets/moses`
+```
+
+4. The dataset and source code is available inside the docker container:
+```
+docker exec -it molecularsets/moses bash
+```
+
+## Manually
+Alternatively, install dependencies and MOSES manually.
+
+1. Clone the repository:
+```
+git lfs install
+
+git clone https://github.com/molecularsets/moses.git
+```
+
+2. [Install RDKit](https://www.rdkit.org/docs/Install.html) for metrics calculation.
+
+3. Install MOSES:
+```
+python setup.py install
+```
+
+
+# Benchmarking your models
+
+* Install MOSES as described in the previous secion.
+
+* Calculate metrics for the trained model:
+
+```
+python scripts/metrics/eval.py --ref_path <reference dataset> --gen_path <generated dataset>
+```
+
+# Platform usage
+
+## Training
+
+```
+python scripts/<model name>/train.py \
+--train_load <train dataset> \
+--model_save <path to model> \
+--config_save <path to config> \
+--vocab_save <path to vocabulary>
+```
+For more details run `python scripts/<model name>/train.py --help`.
+
+## Generation
+
+```
+python scripts/<model name>/sample.py \
+--model_load <path to model> \
+--vocab_load <path to vocabulary> \
+--config_load <path to config> \
+--n_samples <number of samples> \
+--gen_save <path to generated dataset>
+```
+
+For more details run `python scripts/<model name>/sample.py --help`
+
+## Evaluation
+
+```
+python metrics/eval.py \
+--ref_path <reference dataset> \
+--gen_path <generated dataset>
+```
 
-### Docker
-* You can pull already existing image from DockerHub by `docker pull molecularsets/moses`. Otherwise, build an image based on the Dockerfile `nvidia-docker image build --tag <image_name> moses/`, where `moses/` is a cloned repository from github.
-* Create a container from the created image, e.g. by running `nvidia-docker run -it <container_name> --network="host" --shm-size 1G <image_name>`
-* The dataset is already downloaded during image building and the current repository is available at `/code` inside the docker container.
+For more details run `python scripts/eval.py --help`.
 
-### Manually
-Alternatively, install dependencies and MOSES manually:
-* [Install RDKit](https://www.rdkit.org/docs/Install.html) for metrics calculation.
-* Install MOSES with `python setup.py install`
-* Use `git lfs pull` to download the dataset
 
-## Usage
+## End-to-End launch
 
-### Training of model
-You can train model with:
+You can run pretty much everything with:
 ```
-cd scripts/<model name>
-python train.py --train_load <path to train dataset> --model_save <path to model> --config_save <path to config> --vocab_save <path to vocabulary>
+python scripts/run.py
 ```
-For more details use `python train.py --help`.
+This will **download** the dataset, **train** the models, **generate** new molecules, and **calculate** the metrics. Evaluation results will be saved in `metrics.csv`.
 
-### Calculation of metrics for trained model
-You can calculate metrics with:
+You can specify the device and/or model by running:
 ```
-cd scripts/<model name>
-python sample.py --model_load <path to model> --config_load <path to config> --vocab_load <path to vocabulary> --n_samples <number of smiles> --gen_save <path to generated smiles>
-cd ../metrics
-python eval.py --ref_path <path to referenced smiles> --gen_path <path to generated smiles>
+python scripts/run.py --device cuda:5 --model aae
 ```
-All metrics output to screen.
-For more details use `python sample.py --help` and `python eval.py --help`.
 
-You also can use `python run.py --model <model name>` for calculation metrics.
+For more details run `python scripts/run.py --help`.