Project title: Simulation of pre-biotic formation of metabolic networks with the appearance of heritable catalysts of random specificity
Author: AC Du Plessis, Prospective MSc Bioinformatics and Computational Biology
Although many in silico simulations to model the Origin of Life (OoL) has been attempted, there is no simulation that encapsulates all aspect. As this is only a master's project, we will did not set out to solve the OoL, but rather to gain some insight into the catalyst vs replicator debate. With this simulation we are working from the assumption that life could have started in an alkaline hydrothermal vent system like the Lost City hydrothermal vent system. Using the minimal metabolism of JCVI syn3A as a guide, we simulate the prebiotic chemical environment of a hydrothermal vent system by adding "catalysts" to in silico pores in a vent and verify whether a chemical reaction network can form based on the chemistry of this JCVI syn3A.
The preprocessing script is not included in the main program to keep the script that has to be run on the HPC as small as possible. This also allows
As input we used the full_jcvi_rxn_list.csv file which was compiled from the supplementary materials from the Breuer et al. (2019) Essential metabolism of a minimal cell. Format:
Reaction nr,Reaction equation
S001,D-glucose 6-phosphate <==> D-fructose 6-phosphate
S002,D-fructose 6-phosphate + ATP --> ADP + D-fructose 1,6-biphosphate + H+
Two files will be created:
- chem_dict.txt
- rxn numbers.csv
The chemical dictionary contains all of the metabolites extracted from the metabolic reaction list provided and its assigned value (I had to assign values because when working with pandas dfs later everything got mixed up when string was involved). Format:
D-GLUCOSE 6-PHOSPHATE: 1
D-FRUCTOSE 6-PHOSPHATE: 2
The reaction number file contains the metabolic reaction information extracted from the provided metabolic reaction list. As with the previous file, every reaction is assigned a number and all metabolites are represented using their metabolite number to avoid running into string handling [problems. Format:
reactions,reactants,products
1,[1],[2]
2,"[2, 3]","[11, 4, 16]"
The reverse simulation (starting with the full JCVI reaction network and deleting reactions with every iteration to see how quickly the network disappears) was only run locally. It is small enough to be run locally.
Makes use of the following python modules:
- matplotlib
- networkx
- numpy
- pandas
- csv
- random
- concurrent.futures
- pip install all of the above modules
- perform the data preprocessing on the full_jcvi_rxn_list.csv by making use of data_preprocessing.py
- Add the output files from the data preprocessing, together with updated smiles.csv to the same directory and run the reverse simulation script
The script should produce:
- A "systemfitness.csv" file that will contain the fitness values for all systems after every 5 iterations
- A "full.svg" image file that contains the full JCVI chemical reaction network overview before any of the deletions started happening
- A "reverse_sim_network.gml" file that contains a key and value list that describes the full reaction network.
All dependancies for this program can be found in the requirements file. The files in main_15000_4000 is used to run a simulation for 15 000 systems over 4 000 iterations and the files in main_60000_8000 to run a simulation for 60 000 systems over 8 000 iterations.
This simulation is very computationally expensive and requires the use of a high perfomance computing (HPC) unit/cluster.
- Use the same "chem_dict.txt" and "rxn numbers.csv" files from the reverse simulation produced by the data preprocessing script
- Create a new directory and add the files in HPC setup files and Main simulation
- Change the PBS directives in the simulation file to fit the capacity of the HPC you are using and then submit the script
The script will generate the following:
- 4 cythonized files - 2 for the more_cyhtonized.py file and 2 for the fastest_cython.py file
- A "fitness.csv" file containing the fitnesses for all systems over all iterations
- A "system.csv" file containing the system information of all systems
- An output, error and log file to inform if anything out of the ordinary happened during the run
The network visualization can be done by running the "vis1.pbs" script in Data visualization and analysis. The fitness and standard deviation graphs can be generated by running the "vis1.pbs" script in Fitness Calculation.
The data used to compile the JCVI reaction list was obtained from: Marian Breuer, Tyler M. Earnest, Chuck Merryman, Kim S. Wise, Lijie Sun, Michaela R. Lynott, Clyde A. Hutchison, Hamilton O. Smith, John D. Lapek, David J. Gonzalez, Valérie de Crécy-Lagard, Drago Haas, Andrew D. Hanson, Piyush Labhsetwar, John I. Glass, Zaida Luthey-Schulten (2019) Essential metabolism for a minimal cell eLife 8:e36842.
The virtual environment is automatically setup when running the simulation script. It makes use of the requirements file together with the "source sim2env/bin/activate" command.
Sample PBS directives:
#!/bin/bash
#PBS -N simulation
#PBS -l select=1:ncpus=80:mem=80GB:vnode=n12.hpc
#PBS -l walltime=160:00:00
#PBS -o sim.out
#PBS -e sim.err
#PBS -m abe
#PBS -M youremailaddress
MIT LICENSE
Whether you have questions, ideas, or just want to grab a coffee and chat about the Origin of Life, I’d love to hear from you! You can reach me at:
Email: [email protected]