Codon-Optimization

A deep learning based approach to the task of genetic codon prediction and optimization. We propose an LSTM-Transducer model for this task, gaining modest improvements in accuracy and perplexity in predicting codon choice over frequency-based methods.

This was originally implemented as an undergraduate project in Google Colab using a PyTorch wrapper, namedtensor for Harvard's CS287r Machine Learning for Natural Language Processing Course. After this work was presented as a poster at MLCB 2019, the code was revised with better coding practices for readability and reproducibility *. Lastly, no model generated sequences have been experimentaly tested for expression in the lab against frequency baselines.

Data

The models were tested on highly expressed genes of E. coli MG1655 and Humans hg19. The highly expressed gene set in data (data/ecoli.heg.fasta and data/human_HE.fasta) can be used directly to train a new model or any set of transcripts in Fasta form can be used as input to the model. The script src/download_human_genes.py was used to resolve nucleotide sequences for the human housekeeping gene set.

Running the code

After downloading a set of transcripts in Fasta form for modeling from the above links and removing redundancy (e.g. using CD-Hit), a model can be trained and predictions generated on a random train/val/test split using the command:

python src/main.py --data-file [data/datafile.fasta]

Different models can be selected over the codon layer using the --codon-model-name flag and over the amino acid layer using the --aa-model-name flag

To run baseline models:

python src/main.py --data-file [data/datafile.fasta] --run-baselines

* As this work is not being actively continued, while a --gpu flag is provided, this code has only been tested on CPU and has not yet been used to reproduce the results table or free energy analysis from the original colab experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codon-Optimization

Data

Running the code

About

Releases

Packages

Contributors 3

Languages

samgoldman97/Codon-Optimization

Folders and files

Latest commit

History

Repository files navigation

Codon-Optimization

Data

Running the code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages