Machine learning of the thermodynamic properties of molecular liquids with graph neural network.
The first part of this work (Feedforward NN) has been published in the following article
Predicting Thermodynamic Properties of Alkanes by High-throughput Force Field Simulation and Machine Learning
conda install matplotlib scikit-learn
conda install -c pytorch pytorch
conda install -c dglteam dgl-cuda10.1
conda install -c openbabel openbabel
conda install -c rdkit rdkit
A machine learning workflow can be separated into four steps: fingerprint calculation, data splitting, model training and prediction.
(All script used here are located at run
directory)
-
Calculate fingerprints
./gen-fp.py -i ../data/nist-CH-tc.txt -e morgan1,simple -o out
Several encoders are available, which are suitable for different purposes. Multiple fingerprints can be (or should be) combined for better performance.
- wyz - Handcrafted substructure fingerprint for predicting the properties of alkanes.
- morgan1 - Extended connectivity substructure count with radius equal to one. In order to suppress overfitting, substructures occurred in less than 200 molecules are dropped.
- simple - Four features describing the global structure of a molecule: number of atoms, molecular weight, maximum of shortest paths, number of rotatable bonds.
-
Split data to training and validation sets using 5-Fold cross-validation
./split-data.py -i ../data/nist-CH-tc.txt -o out
-
Train the model (see the following examples)
-
Predict property for new molecules (see the following examples)
Here is an example of learning isotropic heat capacity of alkanes using handcrafted xyz
fingerprint, as described in this article.
./gen-fp.py -i ../data/alkanes-npt-2018v3.txt -e wyz -o out
./split-data.py -i ../data/alkanes-npt-2018v3.txt -o out
./train.py -i ../data/alkanes-npt-2018v3.txt -t Cp -f out/fp_wyz -p out/part-1.txt -o out/result
./predict.py -d out/result -e wyz -i CCCCCC,300,1
Here is an example of learning critical temperature of hydrocarbons using morgan1
and simple
fingerprints.
./gen-fp.py -i ../data/nist-CH-tc.txt -e morgan1,simple -o out
./split-data.py -i ../data/nist-CH-tc.txt -o out
./train.py -i ../data/nist-CH-tc.txt -t tc -f out/fp_morgan1,out/fp_simple -p out/part-1.txt -o out/result
./predict.py -d out/result -e predefinedmorgan1,simple -i CCCCCC
- Note that the length of
morgan1
fingerprint depends on the molecular structures in the training set. For prediction,predefinedmorgan1
encoder should be used for calculating themorgan1
fingerprint.
Here is an example of learning cohesive energy using molecular graph and force field atom types.
- The node feature is a one-hot vector representing the atom type of each atom in the force field.
- The molecular graphs and node features are loaded from
data/msdfiles.zip
../gen-fp.py -i ../data/All-npt_rand.txt -e simple -o out ./split-data.py -i ../data/All-npt_rand.txt -o out ./train-gcn.py -i ../data/All-npt_rand.txt -t einter -f out/fp_simple -p out/part-1.txt -o out/result
Here is an example of learning cohesive energy using molecular graph, force field parameters and vacuum simulation results.
- The node feature is the LJ parameters and charge of each atom in the force field.
- Each 1-2, 1-3 and 1-4 pair is considered as an edge. The edge feature is the pair distance distribution extracted from vacuum simulation of a single molecule at 300 K.
- The molecular graphs are loaded from
data/msdfiles.zip
. - The edge features are loaded from
data/distfiles.zip
. - The node features are calculated from a force field file named
dump-MGI.ppf
, which should be put into thedata
directory../gen-fp.py -i ../data/All-npt_rand.txt -e simple -o out ./split-data.py -i ../data/All-npt_rand.txt -o out ./train-ffgcn.py -i ../data/All-npt_rand.txt -t einter -f out/fp_simple -p out/part-1.txt -o out/result