Skip to content

Latest commit

 

History

History
 
 

hallucination

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Hallucination design script

Jue Wang ([email protected])
Doug Tischer ([email protected])
Sidney Lisanza ([email protected])

This folder contains scripts for running protein hallucination using the RoseTTAFold structure-prediction model.

Installation

See README in root folder of rfdesign repository.

Usage

Examples of full design workflows can be found in rfdesign/tutorials/.

Quick start

To run 100 design trajectories for scaffolding PD1 binding interface using 600 steps of gradient descent:

conda activate SE3
python hallucinate.py --pdb=/path/to/pd1.pdb --mask=20-25,A119-140,5-15,A63-82,0-5 \
                      --out=outdir/test --num=100 --steps=g600 

To run 5 trajectories of 300 MCMC steps each, with output files numbered 5-9, modeling the hallucinated protein in the context of a binding target pdl1.pdb, fixing amino acids at certain positions, avoiding cysteines, turning on surface non-polar penalty and net-charge losses, and initializing the sequence with a previously hallucinated design:

python hallucinate.py --pdb=/path/to/pd1.pdb --mask=20-25,A119-140,5-15,A63-82,0-5 \
                      --receptor /path/to/pdl1.pdb \
                      --out outdir/test --steps m300 --num 5 --start_num 5 \
                      --use_template A163-191 --force_aa A119,A122-124,A130,A133 --exclude_aa C \
                      --w_surfnp 1 --w_nc 0.02 --spike 0.999 --spike_fas outdir/test_0.fas

The best way to understand how all the arguments can be used is to take a look at the examples in tutorials/.

Inputs

hallucinate.py takes as inputs a template pdb structure (--pdb) and residue ranges of positions in the template that should be constrained in the hallucination (either --mask or --contigs).

The option

--mask=20-25,A119-140,5-15,B63-82,5

tells the script to hallucinate 20-25 residues, then constrain a region to mimic chain A residues 119-140 of the template, followed by 5-15 hallucinated residues, then mimic chain B residues 63-82, then hallucinate 5 residues. The length of each hallucinated region is sampled uniformly randomly from the specified range for each design, and kept fixed during optimization.

The option

--contigs=A119-140,B63-82 --len=80-100

tells the script to hallucinate a protein of 80-100 residues, and place constrained regions ("contigs") corresponding to chain A residues 119-140 and chain B residues 63-82 anywhere along its length. The difference from --mask is that the length of the hallucinated regions are not sampled independently.

By default, --contigs will randomly shuffle the order of contigs (i.e. B63-82 of the input pdb may appear before A119-140 in the output sequence). Setting --keep_order True will keep the contigs in the same order they appear in the provided argument. When using --mask, the contigs will always be kept in the order provided, although they don't have to be provided in the same order that they appear in the reference pdb.

Outputs

When called with (for example) --out=outdir/test, hallucinate.py will write outputs to the folder outdir prefixed by test_0, test_1, etc. For each design, these files will be output:

  • .npz: numpy-format file with pairwise distance and orientation distributions for the design.
  • .trb: pickled dictionary with metadata, including the indices of the constrained motifs in both the reference structure and hallucination.
  • .fas: fasta file of the designed sequence.
  • .pdb: backbone (N,Ca,C) structure of the design, as generated by TRFold minimization.

To generate design models with sidechains you can run a short relax using

scripts/trfold_relax.sh FOLDER

where FOLDER contains your hallucinate.py outputs. This is a wrapper for scripts/RosettaTR/trfold_relax.py. This uses the TRFold .pdb file as an input and outputs structures to FOLDER/trf_relax. For further postprocessing and analysis instructions, see scripts/README.md.

Optimization methods

Typically, hallucinations are generated by running gradient descent (GD) from a randomly initialized sequence. This can be done with e.g. --steps=g400 (between 400-600 steps is recommended). The default values of --learning_rate and --drop gave the best results in test runs but these should probably be tuned to each new problem.

After GD, the loss can be improved further (and some non-differentiable losses used) via Markov Chain Monte Carlo (MCMC). This is done with e.g. --steps=m1000. To start from an existing design, the --spike=0.999 --spike_fas designed_seq.fas arguments are used, and the --mask argument should be set to the correct string for the previous design (see tutorials/halluc_binder/ for an example).

GD and MCMC stages can be done in the same run, e.g. --steps=g600,m2000. MCMC can also be run without running GD first, but GD+MCMC is much more efficient.

Losses

By default, the optimization loss is an equally weighted average of cce and entropy. You can choose different loss weights by passing the --w_cce, --w_entropy, --w_kl, --w_crmsd, and other options. Using a loss weight of -1 will disable it if it is enabled by default, and a weight of 0 will turn off the loss but still output its value for analysis.

Other losses are:

  • rep: repulsive loss to avoid clashes with a binding partner (protein or small molecule) specified by e.g. --rep_pdb /path/to/receptor.pdb. This is a partial Lennard-Jones potential (Alford 2017) (essentially a smoothed rectified linear function) with a characteristic distance e.g. --rep_sigma in angstroms (Alford 2017). The loss is computed as the mean of all partial LJ potentials between all backbone atoms in the hallucination and all heavy atoms in the binding partner after aligning the hallucination to the motif. Typical weights are 1-5.
  • atr: attractive loss to encourage proximity to binding partner, computed as partial LJ potential as in Alford 2017, assessed on the hallucination and target similarly to rep. Use with --atr_pdb and --atr_sigma. Typical weights are 0.01-0.1.
  • rog: radius of gyration, to discourage formation of extended helices. Takes the form of an exponential linear unit with distance threshold --rog_thresh in angstroms. The loss is 0 below the threshold and linear with any excess rog above the threshold. Typical weight is 1. Typical threshold (for proteins 50-100 AAs) is 14-16.