Skip to content
/ MolGen Public

[ICLR 2024] Domain-Agnostic Molecular Generation with Chemical Feedback

License

Notifications You must be signed in to change notification settings

zjunlp/MolGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolGen

Pytorch license

Code for the paper "Domain-Agnostic Molecular Generation with Self-feedback".

🔔 News

📕 Requirements

To run the codes, You can configure dependencies by restoring our environment:

conda env create -f MolGen/environment.yml -n $Your_env_name$

and then:

conda activate $Your_env_name$

📚 Resource Download

You can download the pre-trained model via this link1, and the fine-tuned models via this link2.

Moreover, the dataset used for downstream tasks can be found here.

The expected structure of files is:

moldata
├── checkpoint 
│   ├── molgen.pkl              # pre-trained model
│   ├── syn_qed_model.pkl       # fine-tuned model for QED optimization on synthetic data
│   ├── syn_plogp_model.pkl     # fine-tuned model for p-logP optimization on synthetic data
│   ├── np_qed_model.pkl        # fine-tuned model for QED optimization on natural product data
│   ├── np_plogp_model.pkl      # fine-tuned model for p-logP optimization on natural product data
├── finetune
│   ├── np_test.csv             # nature product test data
│   ├── np_train.csv            # nature product train data
│   ├── plogp_test.csv          # synthetic test data for plogp optimization
│   ├── qed_test.csv            # synthetic test data for plogp optimization
│   └── zinc250k.csv            # synthetic train data
├── generate                    # generate molecules
├── output                      # molecule candidates
└── vocab_list
    └── zinc.npy                # SELFIES alphabet

🚀 How to run

  • Fine-tune

    • First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder output.
        cd MolGen
        bash preprocess.sh
    • Then utilize the self-feedback paradigm. The fine-tuned model will be stored in the folder checkpoint.
        bash finetune.sh
  • Generate

    To generate molecules, run this script. Please specify the checkpoint_path to determine whether to use the pre-trained model or the fine-tuned model.

    cd MolGen
    bash generate.sh

Citation

If you use or extend our work, please cite the paper as follows:

@article{fang2023molecular,
  title={Molecular Language Model as Multi-task Generator},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
  journal={arXiv preprint arXiv:2301.11259},
  year={2023}
}