MERGE Tutorial: Example for the Dataset YAP1 HUMAN Fields2012-singles
To build a model of the fitness landscape of a protein and explore it in silico, the following files are required:
- protein sequence in fasta format (see yap1.fasta)
- variant-fitness-pairs in csv format (see yap1.csv)
To generate a multiple sequence alignment, the target sequence must be provided in fasta format and the inclusion threshold (--incT) has to be set. It is recommended to start with
jackhmmer [-options] <seqfile> <seqdb>
Options:
- --incT <x> : consider sequences >= this score threshold as significant
- --cpu <n> : number of parallel CPU workers to use for multithreads
- --noali : don't output alignments, so output is smaller
- -A <f> : save multiple alignment of hits to file <f>
For the dataset YAP1 (
jackhmmer --incT 17 --cpu 4 --noali -A yap1.sto yap1.fasta uniref100.fasta
In a next step, the MSA is being post-processed by
- excluding all positions, where the wild type sequence has a gap,
- excluding all positions that contain more than 30 % gaps,
- excluding all sequences that contain more than 50 % gaps.
The script sto2a2m.py can be found here.
python sto2a2m.py -sto yap1.sto
Once the a2m file is generated, we proceed with infering the parameters of the statitstical model.
Usage: plmc [options] alignmentfile
Options:
- -o <f> : save estimated parameters to file <f> (binary)
- -n <n> : maximum number of threads to use in OpenMP
- -le <x> : set L2 lambda for couplings (e_ij)
- -lh <x> : set L2 lambda for fields (h_i)
- -m <n> : maximum number of iterations
- -g : model sequence likelihoods only by coding, non-gapped portions
- -f <s> : select only uppercase, non-gapped sites from a focus sequence
For the YAP1 dataset we choose
-
$48$ threads to use in OpenMP $\texttt{le} = 0.2 * (N_\text{sites} - 1) = 5.8$ $\texttt{lh} = 0.01$ -
$3500$ maximum iterations - yap1 as focus sequence
plmc -o yap1.params -n 48 -le 5.8 -lh 0.01 -m 3500 -g -f yap1 yap1.a2m
Run the script yap1_example.py.
python yap1_example.py
First, import the required package
import merge
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from scipy.stats import spearmanr
Next, user input is required:
- startingPosition : startingPosition != 1 if the sequence in the fasta file does not start at residue 1
- paramsFile : binary file outputted by PLMC
- csvFile : ensure that separator is ';' and not ','. The name of the column including the variants should be 'mutant'. Multiple substituted variants should be separated by ','. Variants should be given in the form F3A (PHE at position 3 is subsituted by ALA).
- fitnessColumn : key of the column including the fitness values
startingPosition = 170
paramsFile = 'yap1.params'
csvFile = 'yap1.csv'
fitnessColumn = 'linear'
In a next step, sequence representations of the wild type yap1_wt_encoded.npy and all variants given in yap1.csv are generated :
encodeCls = merge.Encode(startingPosition, paramsFile)
np.save('yap1_wt_encoded.npy', encodeCls._encode_wt())
data = merge.get_data(fitnessColumn, csvFile, encodeCls)
merge.generate_dataframe(data, 'yap1_encoded.csv')
Finally, a model of the fitness landscape is generated using 80 % of the data and evaluated in terms of the coefficient of determination
df = pd.read_csv('yap1_encoded.csv', sep=';')
xWt = np.load('yap1_wt_encoded.npy')
x = df.iloc[:,2:].to_numpy()
deltaE = merge.X_to_deltaE(x, xWt)
y = df.iloc[:,1].to_numpy()
xTrain, xTest, deltaETrain, deltaETest, yTrain, yTest = train_test_split(x, deltaE, y, train_size=0.8, random_state=42)
model = merge.CombinedPredictor()
model.train(xTrain, deltaETrain, yTrain)
print('R2 score (test set): %.2f '%(r2_score(yTest, model.predict(xTest, deltaETest))))
print('Spearman rho (test set): %.2f '%(spearmanr(yTest, model.predict(xTest, deltaETest))[0]))
Optionally, the fitness landscape can be explored to generate improved variants applying Metropolis Hastings algorithm. For this purpose, additional user input is required:
- yWt : Fitness value of the wild type
- maxSubstitutions : Maximum number of subsitutions to introduce during exploration
- factor : Only classify variants as improved if fitness >= factor * yWt
- nWalkers : Number of walkers to generate
- nCores : Number of CPUs to use for parallel execution
yWt = 1.0
maxSubstitutions = 3
factor = 1.1
nWalkers = 96
nCores = 8
explore = merge.Explore(encodeCls, model, yWt, maxSubstitutions=maxSubstitutions, factor=factor)
results = explore.scrape_landscape(nWalkers, nCores=nCores)
df = pd.DataFrame(results, columns=['Variant', 'Predicted Fitness'])
df.to_csv('improved_variants.csv', index=False, sep=';')