Skip to content

Protein design and variant prediction using autoregressive generative models

License

Notifications You must be signed in to change notification settings

kellisfm/seqdesign-pytorch

 
 

Repository files navigation

SeqDesign

SeqDesign is a generative, unsupervised model for biological sequences. It is capable of learning functional constraints from unaligned sequences in order to predict the effects of mutations and generate novel sequences, including insertions and deletions. For more information, check out the biorxiv preprint.

This version of the codebase is compatible with Python 3 and PyTorch. It also implements Fast Wavenet generation.
A TensorFlow version is available here

Installation

See INSTALL.md.

Examples

See the examples directory for examples of training, mutation effect prediction, and generation.

Usage

Run each script with the -h argument to see additional arguments:

Training

Given a fasta file of training sequences, run:

run_autoregressive_fr --dataset <your_dataset>.fa

Sequences are uniformly weighted by default. To set sequence weights, append : and a weight to each fasta header, e.g. :1.0.

Mutation effect prediction

Deterministic:

calc_logprobs_seqs_fr --sess <your_sess> --dropout-p 1.0 --num-samples 1 --input <input>.fa --output <output>.csv

Average of 500 samples:

calc_logprobs_seqs_fr --sess <your_sess> --dropout-p 0.5 --num-samples 500 --input <input>.fa --output <output>.csv

Sequence generation

generate_sample_seqs_fr --sess <your_sess>

Use the --fast-generation argument for Fast Wavenet.

Data availability

See the examples directory to download training sequences, mutation effect predictions, and generated sequences.

About

Protein design and variant prediction using autoregressive generative models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.9%
  • Shell 1.1%