PGAP

NCBI Prokaryotic Genome Annotation Pipeline

The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Haft DH et al 2018, Tatusova T et al 2016). Recent improvements utilize curated protein profile hidden Markov models (HMMs), including TIGRFAMS and new HMMs for antimicrobial resistance proteins, and curated complex domain architectures for functional annotation of proteins.

Get started by watching this webinar!

Instructions

To run the PGAP pipeline you will need Linux, or some compatible container technology, CWL (Common Workflow Language), and about 30GB of supplemental data. We provide instructions here for running under the CWL reference implementation, cwltool. Full instructions for installing, running, and interpreting the results may be found in our wiki.

References

NCBI

NCBI prokaryotic genome annotation pipeline.
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J.
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. Epub 2016 Jun 24.

RefSeq: an update on prokaryotic genome annotation and curation.
Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD.
Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860.

Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI.
Ciufo S, Kannan S, Sharma S, Badretdin A, Clark K, Turner S, Brover S, Schoch CL, Kimchi A, DiCuccio M.
Int J Syst Evol Microbiol. 2018 Jul;68(7):2386-2392.

GeneMarkS-2+

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes
Lomsadze A, Gemayel K, Tang S, Borodovsky M.
Genome Research. 2018; 28(7):1079-1089.

TIGRFAMs

TIGRFAMs: a protein family resource for the functional identification of proteins.
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O.
Nucleic Acids Res. 2001 Jan 1;29(1):41-3.

The TIGRFAMs database of protein families.
Haft DH, Selengut JD, White O.
Nucleic Acids Res. 2003 Jan 1;31(1):371-3.

TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O.
Nucleic Acids Res. 2007 Jan;35(Database issue):D260-4. Epub 2006 Dec 6.

TIGRFAMs and Genome Properties in 2013.
Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E.
Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95. doi: 10.1093/nar/gks1234. Epub 2012 Nov 28.

LICENSING TERMS

NCBI PGAP CWL

The NCBI PGAP CWL and other code authored by NCBI is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.

Third-party tools

The Docker image contains third-party tools distributed under the licensing terms of the respective license holders.

GeneMarkS-2+

GeneMarkS-2+ is distributed as part of PGAP with limited rights of use and redistribution from the Georgia Tech Research Corporation. See the full text of the license.

TIGRFAMs

The original TIGRFAMs database was a research project of the J. Craig Venter Institute (JCVI) . TIGRFAMs, short for The Institute for Genomic Research's database of protein families, is a collection of manually curated protein families focusing primarily on prokaryotic sequences. It consists of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) terminology, Enzyme Commission (EC) numbers, gene symbols, protein family names, descriptive text, cross-references to related models in TIGRFAMs and other databases, and pointers to the literature. The work has been described in the articles listed in the References section above and use of the TIGRFAMs database must grant proper attribution by citing those four articles.

As of April 2018, rights were transferred to the National Center for Biotechnology Information (NCBI), National Library of Medicine, NIH, for the data to be made available for distribution under a Creative Commons Attribution-ShareAlike 4.0 license. Please see (https://creativecommons.org/licenses/by-sa/4.0/) for a brief summary of the license and (https://creativecommons.org/licenses/by-sa/4.0/legalcode) to see the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 852 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
MG37		MG37
bacterial_annot		bacterial_annot
bacterial_kmer		bacterial_kmer
bacterial_mobile_elem		bacterial_mobile_elem
bacterial_ncrna		bacterial_ncrna
bacterial_noncoding		bacterial_noncoding
bacterial_trna		bacterial_trna
clade_assign		clade_assign
common		common
expr		expr
genomic_source		genomic_source
input_template		input_template
input_template2		input_template2
progs		progs
protein_alignment		protein_alignment
scripts		scripts
split_jobs		split_jobs
spurious_annot		spurious_annot
task_types		task_types
taxonomy_check_16S		taxonomy_check_16S
user_genome		user_genome
vecscreen		vecscreen
.gitignore		.gitignore
GeneMarkS_Software_License.txt		GeneMarkS_Software_License.txt
LICENSE.md		LICENSE.md
README.md		README.md
assemble.cwl		assemble.cwl
bacterial_prepare_unannotated.cwl		bacterial_prepare_unannotated.cwl
cache_entrez_gene.cwl		cache_entrez_gene.cwl
input.yaml		input.yaml
input_simple.yaml		input_simple.yaml
pgap.cwl		pgap.cwl
prepare_user_input.cwl		prepare_user_input.cwl
prepare_user_input.input.yaml		prepare_user_input.input.yaml
prepare_user_input2.cwl		prepare_user_input2.cwl
preserve_annot_markup.cwl		preserve_annot_markup.cwl
taxcheck.cwl		taxcheck.cwl
wf_common.cwl		wf_common.cwl
wf_pgap_simple.cwl		wf_pgap_simple.cwl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PGAP

Instructions

References

NCBI

GeneMarkS-2+

TIGRFAMs

LICENSING TERMS

NCBI PGAP CWL

Third-party tools

GeneMarkS-2+

TIGRFAMs

About

Releases

Packages

Languages

License

Tek-God/pgap

Folders and files

Latest commit

History

Repository files navigation

PGAP

Instructions

References

NCBI

GeneMarkS-2+

TIGRFAMs

LICENSING TERMS

NCBI PGAP CWL

Third-party tools

GeneMarkS-2+

TIGRFAMs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages