MERGE represents a method that combines direct coupling analysis and machine learning techniques to predict a protein's fitness from sequence. It requires a binary parameter file outputted by plmc and variant-fitness pairs.
The most important steps for model construction are briefly described below. Step-by-step instructions are given here. To generate a model of the fitness landscape of a protein and explore it in silico, the following files are required:
- protein sequence in fasta format
- variant-fitness pairs in csv format
- Generate a multiple sequence alignment (MSA) using jackhmmer
To generate a multiple sequence alignment, the target sequence must be provided in fasta format and the inclusion threshold (--incT) must be set.
jackhmmer [-options] <seqfile> <seqdb>
- Post-process the MSA
In a next step, the MSA is being post-processed by
- excluding all positions, where the wild type sequence has a gap,
- excluding all positions that contain more than 30 % gaps,
- excluding all sequences that contain more than 50 % gaps.
The script sto2a2m.py can be found here.
python sto2a2m.py -sto <stoFile>
- Infer parameters for the Potts model using PLMC
Once the a2m file is generated, the parameters of the statitstical model are inferred.
plmc [options] alignmentfile
- Construct and explore the model of the fitness landscape using MERGE
Finally, a model of the fitness landscape is generated. See the example for details on how to use MERGE.
- Download the latest version of UniRef100 (this can take a while, large file > 100 GB)
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz
- Unzip the file to get a fasta file
gzip -d uniref100.fasta.gz
For further information see uniprot help or here
- Download the tarball
wget http://eddylab.org/software/hmmer/hmmer.tar.gz
- Unpack the tarball
tar zxf hmmer.tar.gz
- Enter the directory 'hmmer-3.4'
cd hmmer-3.4
- Set the installaion path (adjust "/your/install/path" accordingly!)
./configure --prefix /your/install/path
- Build HMMER
make
- Run self tests (optional)
make check
- Install programs and man pages
make install
- Add executable to PATH for session (adjust "/your/install/path" accordingly!)
export PATH="/your/install/path/bin:$PATH"
or permanently (adjust "/your/install/path" accordingly!)
echo 'export PATH="/your/install/path/bin:$PATH"' >> ~/.bashrc
- Exit the directory 'hmmer-3.4'
cd ..
For further information see hmmer documentation
- Clone the plmc repository
git clone https://github.com/debbiemarkslab/plmc.git
- Enter the directory 'plmc'
cd plmc
- Build with GCC and OpenMP to enable multicore parallelism
make all-openmp
- Add executable to PATH for session (adjust "/your/install/path" accordingly!)
export PATH="/your/install/path/bin:$PATH"
or permanently (adjust "/your/install/path" accordingly!)
echo 'export PATH="/your/install/path/bin:$PATH"' >> ~/.bashrc
- Exit the directory 'plmc'
cd ..
For further information see plmc repository
- Clone the MERGE repository
git clone https://github.com/amillig/MERGE.git
- Enter the MERGE directory
cd MERGE
- Install the dependencies
pip install -r requirements.txt
- Import MERGE as module in Python
import merge
“Combining evolutionary probability and machine learning enables data-driven protein engineering with minimized experimental effort” by Alexander-Maurice Illig, Niklas E. Siedhoff, Mehdi D. Davari*, and Ulrich Schwaneberg*
MERGE was developed and written by Alexander-Maurice Illig at RWTH Aachen University.
MERGE uses binary parameter files that are generated with plmc written by John Ingraham.