Skip to content
forked from sdwfrost/gbmunge

Munge GenBank files into FASTA and tab-separated metadata

License

Notifications You must be signed in to change notification settings

samordil/gbmunge

This branch is up to date with sdwfrost/gbmunge:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
Simon Frost
Dec 19, 2023
adbead3 · Dec 19, 2023

History

13 Commits
Dec 19, 2023
Jun 26, 2018
Oct 1, 2017
Oct 1, 2017
Jun 26, 2018
Oct 1, 2017
Oct 2, 2017

Repository files navigation

gbmunge

Munge GenBank files into FASTA sequences and tab-separated metadata.

This little C program will extract the following information from a GenBank file:

  • name
  • accession
  • length
  • submission date
  • host
  • country
  • collection date

In addition to extracting this information, dates are reformatted e.g. 31-DEC-2001 becomes 2001-12-31, which makes them more digestible to downstream software like BEAST, and country names are cleaned and matched to ISO3 codes.

Usage

gbmunge [-h] -i <Genbank_file> -f <sequence_output> -o <metadata_output> [-t] [-s]
  • Genbank_file: filename of GenBank-formatted sequence file (normally downloaded as sequence.gb)
  • sequence_output: filename of FASTA output
  • metadata_output: filename of tab-separated metadata
  • -t: flag to
    • only output sequences with collection dates (of any precision)
    • to name sequences as {accession}_{collection_date}
  • -s: flag to include sequences in tab-delimited file

Building

git clone https://github.com/sdwfrost/gbmunge
cd gbmunge
make

This will build gbmunge in the src/ directory. Add the directory to the path, or move the executable somewhere.

Testing

A Genbank file of MERS Coronavirus sequences is provided in the test/ directory.

cd test
../src/gbmunge -i sequence.gb -f sequence.fas -o sequence.txt -t

Here are the first few lines of output in sequence.txt:

name accession length submission_date host country_original country countrycode collection_date
JX869059_2012-06-13 JX869059 30119 2012-12-04 Homo sapiens NA NA NA 2012-06-13
KC164505_2012-09-11 KC164505 30111 2013-07-12 Homo sapiens United Kingdom United Kingdom GBR 2012-09-11
KC667074_2012-09-19 KC667074 30112 2013-04-30 Homo sapiens United Kingdom: England United Kingdom GBR 2012-09-19
KC776174_2012-04 KC776174 30030 2013-03-25 Homo sapiens Jordan Jordan JOR 2012-04

Credits

This code uses a slightly modified version of the GBParsy parser downloaded from the Google Code Archive. I found that the parsing of the LOCUS field wasn't working properly.

About

Munge GenBank files into FASTA and tab-separated metadata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 98.7%
  • Other 1.3%