Jue Wang ([email protected])
Doug Tischer ([email protected])
Sidney Lisanza ([email protected])
This folder contains scripts for running protein hallucination using the RoseTTAFold structure-prediction model.
See README in root folder of rfdesign
repository.
Examples of full design workflows can be found in rfdesign/tutorials/
.
To run 100 design trajectories for scaffolding PD1 binding interface using 600 steps of gradient descent:
conda activate SE3
python hallucinate.py --pdb=/path/to/pd1.pdb --mask=20-25,A119-140,5-15,A63-82,0-5 \
--out=outdir/test --num=100 --steps=g600
To run 5 trajectories of 300 MCMC steps each, with output files numbered 5-9, modeling the
hallucinated protein in the context of a binding target pdl1.pdb
, fixing amino acids at certain positions, avoiding cysteines, turning on surface non-polar penalty and net-charge losses, and initializing the sequence with a previously hallucinated design:
python hallucinate.py --pdb=/path/to/pd1.pdb --mask=20-25,A119-140,5-15,A63-82,0-5 \
--receptor /path/to/pdl1.pdb \
--out outdir/test --steps m300 --num 5 --start_num 5 \
--use_template A163-191 --force_aa A119,A122-124,A130,A133 --exclude_aa C \
--w_surfnp 1 --w_nc 0.02 --spike 0.999 --spike_fas outdir/test_0.fas
The best way to understand how all the arguments can be used is to take a look
at the examples in tutorials/
.
hallucinate.py
takes as inputs a template pdb structure (--pdb
) and residue
ranges of positions in the template that should be constrained in the
hallucination (either --mask
or --contigs
).
The option
--mask=20-25,A119-140,5-15,B63-82,5
tells the script to hallucinate 20-25 residues, then constrain a region to mimic chain A residues 119-140 of the template, followed by 5-15 hallucinated residues, then mimic chain B residues 63-82, then hallucinate 5 residues. The length of each hallucinated region is sampled uniformly randomly from the specified range for each design, and kept fixed during optimization.
The option
--contigs=A119-140,B63-82 --len=80-100
tells the script to hallucinate a protein of 80-100 residues, and place
constrained regions ("contigs") corresponding to chain A residues 119-140 and
chain B residues 63-82 anywhere along its length. The difference from --mask
is that the length of the hallucinated regions are not sampled independently.
By default, --contigs
will randomly shuffle the order of contigs (i.e. B63-82
of the input pdb may appear before A119-140 in the output sequence). Setting
--keep_order True
will keep the contigs in the same order they appear in the
provided argument. When using --mask
, the contigs will always be kept in the
order provided, although they don't have to be provided in the same order that
they appear in the reference pdb.
When called with (for example) --out=outdir/test
, hallucinate.py
will write
outputs to the folder outdir
prefixed by test_0
, test_1
, etc. For each
design, these files will be output:
.npz
: numpy-format file with pairwise distance and orientation distributions for the design..trb
: pickled dictionary with metadata, including the indices of the constrained motifs in both the reference structure and hallucination..fas
: fasta file of the designed sequence..pdb
: backbone (N,Ca,C) structure of the design, as generated by TRFold minimization.
To generate design models with sidechains you can run a short relax using
scripts/trfold_relax.sh FOLDER
where FOLDER contains your hallucinate.py
outputs. This is a wrapper for
scripts/RosettaTR/trfold_relax.py
. This uses the TRFold .pdb file as an
input and outputs structures to FOLDER/trf_relax
. For further postprocessing
and analysis instructions, see scripts/README.md
.
Typically, hallucinations are generated by running gradient descent (GD) from a
randomly initialized sequence. This can be done with e.g. --steps=g400
(between
400-600 steps is recommended). The default values of --learning_rate
and
--drop
gave the best results in test runs but these should probably be tuned
to each new problem.
After GD, the loss can be improved further (and some non-differentiable losses
used) via Markov Chain Monte Carlo (MCMC). This is done with e.g.
--steps=m1000
. To start from an existing design, the --spike=0.999 --spike_fas designed_seq.fas
arguments are used, and the --mask
argument
should be set to the correct string for the previous design (see
tutorials/halluc_binder/
for an example).
GD and MCMC stages can be done in the same run, e.g. --steps=g600,m2000
. MCMC
can also be run without running GD first, but GD+MCMC is much more efficient.
By default, the optimization loss is an equally weighted average of cce and
entropy. You can choose different loss weights by passing the --w_cce
,
--w_entropy
, --w_kl
, --w_crmsd
, and other options. Using a loss weight of -1
will disable it if it is enabled by default, and a weight of 0 will turn off
the loss but still output its value for analysis.
Other losses are:
- rep: repulsive loss to avoid clashes with a binding partner (protein or
small molecule) specified by e.g.
--rep_pdb /path/to/receptor.pdb
. This is a partial Lennard-Jones potential (Alford 2017) (essentially a smoothed rectified linear function) with a characteristic distance e.g.--rep_sigma
in angstroms (Alford 2017). The loss is computed as the mean of all partial LJ potentials between all backbone atoms in the hallucination and all heavy atoms in the binding partner after aligning the hallucination to the motif. Typical weights are 1-5. - atr: attractive loss to encourage proximity to binding partner, computed as
partial LJ potential as in Alford 2017, assessed on the hallucination and
target similarly to
rep
. Use with--atr_pdb
and--atr_sigma
. Typical weights are 0.01-0.1. - rog: radius of gyration, to discourage formation of extended helices. Takes
the form of an exponential linear unit with distance threshold
--rog_thresh
in angstroms. The loss is 0 below the threshold and linear with any excess rog above the threshold. Typical weight is 1. Typical threshold (for proteins 50-100 AAs) is 14-16.