Skip to content
/ MERGE Public

Combining evolutionary probability and machine learning enables data-driven protein engineering with minimized experimental effort

License

Notifications You must be signed in to change notification settings

amillig/MERGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MERGE

MERGE represents a method that combines direct coupling analysis and machine learning techniques to predict a protein's fitness from sequence. It requires a binary parameter file outputted by plmc and variant-fitness pairs.

MERGE_git

Usage

The most important steps for model construction are briefly described below. Step-by-step instructions are given here. To generate a model of the fitness landscape of a protein and explore it in silico, the following files are required:

  • protein sequence in fasta format
  • variant-fitness pairs in csv format
  1. Generate a multiple sequence alignment (MSA) using jackhmmer

To generate a multiple sequence alignment, the target sequence must be provided in fasta format and the inclusion threshold (--incT) must be set.

jackhmmer [-options] <seqfile> <seqdb>
  1. Post-process the MSA

In a next step, the MSA is being post-processed by

  • excluding all positions, where the wild type sequence has a gap,
  • excluding all positions that contain more than 30 % gaps,
  • excluding all sequences that contain more than 50 % gaps.

The script sto2a2m.py can be found here.

python sto2a2m.py -sto <stoFile>
  1. Infer parameters for the Potts model using PLMC

Once the a2m file is generated, the parameters of the statitstical model are inferred.

plmc [options] alignmentfile
  1. Construct and explore the model of the fitness landscape using MERGE

Finally, a model of the fitness landscape is generated. See the example for details on how to use MERGE.

Prerequisites

1. Get the UniRef100 database

  1. Download the latest version of UniRef100 (this can take a while, large file > 100 GB)
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz
  1. Unzip the file to get a fasta file
gzip -d uniref100.fasta.gz

For further information see uniprot help or here

2. Installing HMMER

  1. Download the tarball
wget http://eddylab.org/software/hmmer/hmmer.tar.gz
  1. Unpack the tarball
tar zxf hmmer.tar.gz
  1. Enter the directory 'hmmer-3.4'
cd hmmer-3.4
  1. Set the installaion path (adjust "/your/install/path" accordingly!)
./configure --prefix /your/install/path
  1. Build HMMER
make
  1. Run self tests (optional)
make check
  1. Install programs and man pages
make install
  1. Add executable to PATH for session (adjust "/your/install/path" accordingly!)
export PATH="/your/install/path/bin:$PATH"

or permanently (adjust "/your/install/path" accordingly!)

echo 'export PATH="/your/install/path/bin:$PATH"' >> ~/.bashrc
  1. Exit the directory 'hmmer-3.4'
cd ..

For further information see hmmer documentation

3. Installing PLMC

  1. Clone the plmc repository
git clone https://github.com/debbiemarkslab/plmc.git
  1. Enter the directory 'plmc'
cd plmc
  1. Build with GCC and OpenMP to enable multicore parallelism
make all-openmp
  1. Add executable to PATH for session (adjust "/your/install/path" accordingly!)
export PATH="/your/install/path/bin:$PATH"

or permanently (adjust "/your/install/path" accordingly!)

echo 'export PATH="/your/install/path/bin:$PATH"' >> ~/.bashrc
  1. Exit the directory 'plmc'
cd ..

For further information see plmc repository

4. MERGE

  1. Clone the MERGE repository
git clone https://github.com/amillig/MERGE.git
  1. Enter the MERGE directory
cd MERGE
  1. Install the dependencies
pip install -r requirements.txt
  1. Import MERGE as module in Python
import merge

References

“Combining evolutionary probability and machine learning enables data-driven protein engineering with minimized experimental effort” by Alexander-Maurice Illig, Niklas E. Siedhoff, Mehdi D. Davari*, and Ulrich Schwaneberg*

Author

MERGE was developed and written by Alexander-Maurice Illig at RWTH Aachen University.

Credits

MERGE uses binary parameter files that are generated with plmc written by John Ingraham.

License

License

About

Combining evolutionary probability and machine learning enables data-driven protein engineering with minimized experimental effort

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages