Gene calling with Deep Neural Networks.
This software is undergoing active testing and development. Build on it at your own risk.
Setup and train models for de novo prediction of gene structure. That is, to perform "gene calling" and identify which base pairs in a genome belong to the UTR/CDS/Intron of genes. Train one model for a wide variety of genomes.
For realistically sized datasets, a GPU will be necessary for acceptable performance.
The example below and all provided models should run on an nvidia GPU with 11GB Memory (e.g. GTX 1080 Ti)
The diver for the GPU must also be installed. During development we have used
- nvidia-driver-495
- nvidia-driver-510
and many in between.
See https://github.com/gglyptodon/helixer-docker
Additionally, please see notes on usage, which will differ slightly from the example below.
Please see full installation instructions
Please additionally see dev installation instructions
This example focuses only on applying trained models for gene calling, only.
Information on training and evaluating the models can be found in docs
.
NOTE: the extensively evaluated models from the paper are available by running
git checkout v0.2.0
and following the instructions there in. But they were not yet applicable for generating gff3 files.
We are working towards training another round of models w/ the current architecture. For now a preliminary land plant model is available and will be used for the rest of the example.
The best models for each or all lineages can automatically
downloaded with the fetch_helixer_models.py
script.
The available lineages are land_plant
, vertebrate
, invertebrate
,
and fungi
.
Info on the downloaded models (and any new releases) can be found here: https://uni-duesseldorf.sciebo.de/s/lQTB7HYISW71Wi0
Note: to use a non-default model, set
--model-filepath <path/to/model.h5>'
, to override the lineage default forHelixer.py
.
# download an example chromosome
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-47/fasta/arabidopsis_lyrata/dna/Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz
gunzip Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz
# run all Helixer components from fa to gff3
Helixer.py --lineage land_plant --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa \
--species Arabidopsis_lyrata --gff-output-path Arabidopsis_lyrata_chromosome8_helixer.gff3
The above runs three main steps: conversion of sequence to numerical matrices in preparation (fasta2h5.py
),
prediction of base-wise probabilities with the Deep Learning based model (helixer/prediction/HybridModel.py
),
post-processing into primary gene models (helixer_post_bin
). See respective help functions for additional
usage information, if necessary.
# example broken into individual steps
fasta2h5.py --species Arabidopsis_lyrata --h5-output-path Arabidopsis_lyrata.h5 --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa
# the exact location ($HOME/.local/share/) of the model comes from appdirs
# the model was downloaded when Helixer.py was called above
# this example code is for _linux_ and will need to be modified for other OSs
HybridModel.py --load-model-path $HOME/.local/share/Helixer/models/land_plant.h5 --test-data Arabidopsis_lyrata.h5 --overlap --val-test-batch-size 32 -v
helixer_post_bin Arabidopsis_lyrata.h5 predictions.h5 100 0.1 0.8 60 Arabidopsis_lyrata_chromosome8_helixer.gff3
Output: The main output of the above commands is the gff3 file (Arabidopsis_lyrata_chromosome8_helixer.gff3) which contains the predicted genic structure (where the exons, introns, and coding regions are for every predicted gene in the genome). You can find more about the format here, and you can readily derive other formats, such as a fasta file of the proteome, using a standard parser, for instance gffread.
Most parameters from Helixer.py
have been set to a reasonable default; but nevertheless there
are a couple where the best setting is genome dependent.
It is of course critical to choose a model appropriate for your phylogenetic range / trained on species
that generalize well to your target species. When in doubt selection via --lineage
is recommended, as
this will use the best available model for that lineage.
Subsequence length controls how much of the genome the Neural Network can see at once, and should ideally be comfortably longer than the typical gene.
For genomes with large genes (i.e. there are frequently > 20kbp genomic loci), --subsequence-length
should be increased
This is particularly common for vertebrates and invertebrates but can also happen in plants. For efficiency,
the overlap parameters should increase as well. It might then be necessary to decrease --batch-size
if the GPU runs out of memory.
However, these should definitely not be higher than the N50, or even the N90 of the genome. Nor so high a reasonable batch size cannot be used.
- fungi, leave as is (
--subsequence-length 21384 --overlap-offset 10692 --overlap-core-length 16038
) - plants, leave as is, or try up to
--subsequence-length 106920 --overlap-offset 53460 --overlap-core-length 80190
- invertebrates, set to
--subsequence-length 213840 --overlap-offset 106920 --overlap-core-length 160380
- vertebrates, set to
--subsequence-length 213840 --overlap-offset 106920 --overlap-core-length 160380
Felix Stiehler, Marvin Steinborn, Stephan Scholz, Daniela Dey, Andreas P M Weber, Alisandra K Denton, Helixer: Cross-species gene annotation of large eukaryotic genomes using deep learning, Bioinformatics, , btaa1044, https://doi.org/10.1093/bioinformatics/btaa1044
Please check back to see if the post-processor and additional developments to the core Helixer functionality have been published.