The project contains two separate parts. One is to extract the information of molecules from the cif files and compile to pandas dataframe, the other is to use machine learning to analyze the relationship between crystal structure and the band gap. And create an user interface to output the predicted bandgap value.
The first component of the computational material project is the preprocessing of .cif
files, which is an international standardized data format that stores the structural information of molecules. We are motivated to provide a tool that does massive reading and data extraction to provide the first step towards any statistical or computational analysis.
Combining with the advanced Materials Project and its API access, we also designed a download function that helps people without computer science context to select materials of interested band gap boundaries, without learning the complex syntax required in the original pymatgen
package.
In the CIF Processing folder, there is CIF_process.py
that could be directly executed from the command line by calling python CIF_process.py
. Following the sequential prompt to provide your information, you will be able to obtain a csv file made from either the downloaded data or the local files.
A special note regarding the test file of the CIF processing program:
The test file is named with manualtest*
rather than the conventional test
naming. It is because the testing file involves the evaluation of the returned raw data and we will ask you to provide confidential information to help the testing. See the prompt in the script for more information.
A practice as well as an exploratory study of this data processing program is also provided, which makes of the second part of the project. We applied several machine learning models on the data frame containing all materials with a band gap fall in the visible light spectrum from Materials Project
, generated by CIF_process.py
. The generated data file is provided in doc/dataset in this repo.
Hypothesis: the structure of substances is strongly correlated with the band gap values.
Method: For the prediction of the band gap from the structural information, the following 6 parameters extracted from the cif datafram are involved:
- Length of the edges in x, y, and z directions
- Angle between the edge and the three cartesian axis
A couple of neural network models have been adapted to train the model.
Results:
Deep learning for regression task – MSE of 0.14 K-means clustering (with 3 clusters) – 10.8% accuracy Deep learning classification with one-hot encoding – 36.7% accuracy
We also actively applied other machine learning models such as decision tree and random forest to explore the same hypothesis.
Interpretation: Despite decent training curve (training cost goes down and converges over the epochs), the classification tasks (both supervised and unsupervised) produce very low accuracy for classification tasks. And regression task’s evaluation metric requires further assessment.
The relatively negative results suggest weak correlation between the structural information and the band gap in the current database. The hypothesis requires further refinement , since the given data set has only band gap as criteria, without controlling other factors in the substances (such as type of atoms ). Therefore, it is very likely that the database contains lots of noise.
With the pair plot of all the features in the extracted dataframe, the correlation within the cell parameters is strong. We thus revised our hypothesis based on this discovery.
Hypothesis: For the band gap of a substance to fall in the visible light spectrum, there is a specific relationship for the cell parameters to satisfy, hence confining the size of the unit cell.
Method: Neural Network
Results: R2 score = 0.60
In the right hand side, users can enter the crystal lattice constant and choose the machine learning method and then the left hand side will output the predict band gap value.
Install and activate the environment with finalProject.yml
by:
conda env create -f finalProject.yml
conda activate comma_env
In console, execute the following command wherepackage_path
is the path to the folder containing this Readme (computational-material-project):pip install package_path
It can then be imported on the installed environment ascomma
.
computational-material-project
-----
setup.py
finalProject.yml
CIF processing/
|-CIF_process.py
|-df_CIF.py
|manualtest_df_CIF.ipynb
|manualtest_df_CIF.py
computational-materials/
|-tests/
| |-NN_metrics.py
| |-Neural_Network.py
| |-test_bandgap_dt_rf.py
| |-test_cif_conversion.py
| |-test_nn.py
|-models/
| |-Neural_Network.py
| |-band_gap_prediction.ipynb
| |-bandgap_dt_rf.py
| |-comp_material.ipynb
| |-user_interface_dt_rf.ipynb
|-quality_test/
| |-NN_metrics.py
|-optimization/
| |-hypersearch_nn.py
| |-hyper_tuning_DT_RF.ipynb
examples/
|-NN_demo.ipynb
|-DTandRF_demo.ipynb
doc/
|-DIRECT_finproj.pptx
|-Use case and component specification.md
|-dataset/
| |-bg_struct.csv
|-Image/
| |-pairplot.png
| |-actual vs predicted.png
| |-optimal nn model loss.png
| |-User Interface to predict band gap.png
See examples folder for more demonstrations on predicting feature with the available tools.
- Add more flexible and customizable components in the CIF processing program, so that users could select properties other than band gaps for downloading.
- Current CIFconvert() only extracts the cell length and cell angles from the cif files. We hope to include user-input-initiated selection of parameters to extract from the cif files.
- Based on the improvement of the above functionality, we are also interested to design an UI for this component as well.
- The exploration with machin learning and molecule structure-property relationship indicates that the current hypothesis regarding a correlation between crystalline cell parameters and the band gap needs to be further refined, as the example dataset
bg_struct.csv
didn't control non-structural factors that could impact the band gap.