Skip to content

Commit

Permalink
Merge pull request #376 from Ecogenomics/moving_to_ar53
Browse files Browse the repository at this point in the history
Moving to ar53
  • Loading branch information
aaronmussig authored Apr 7, 2022
2 parents 3734926 + 40eb08b commit ddd074a
Show file tree
Hide file tree
Showing 49 changed files with 1,836 additions and 770 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/release-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ jobs:
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
- name: Wait for PyPI to publish package
run: sleep 60s
shell: bash
- name: Build and push
uses: docker/build-push-action@v2
with:
Expand Down
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,27 @@
[![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
[![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)

<b>[GTDB-Tk v1.5.0](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April 23, 2021 along with new reference data for [GTDB R06-RS202](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>
<b> Please note v1.5.0+ is not compatible with GTDB R05-RS95. </b>
<b>[GTDB-Tk v2.0.1](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April xx, 2022 along with new reference data for [GTDB R07-RS207](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>
<b> Please note v2.0.1+ is not compatible with GTDB R06-RS202. </b>

GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy [GTDB](https://gtdb.ecogenomic.org/). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes. The GTDB-Tk is open source and released under the [GNU General Public License (Version 3)](https://www.gnu.org/licenses/gpl-3.0.en.html).

Notifications about GTDB-Tk releases will be available through the GTDB Twitter account (https://twitter.com/ace_gtdb).

Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) should be sent to the [GTDB team](https://gtdb.ecogenomic.org/about).
Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) can be posted on the [GTDB Forum](https://forum.gtdb.ecogenomic.org/) or sent to the [GTDB team](https://gtdb.ecogenomic.org/about).

## New Features

GTDB-Tk v2.0.1 includes the following new features:
- GTDB-TK now uses a **divide-and-conquer** approach where the bacterial reference tree is split into multiple order-level subtrees. This reduces the memory requirements of GTDB-Tk from **320 GB** of RAM when using the full GTDB R07-RS207 reference tree to approximately **35 GB**. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the `--full-tree` flag.
- Archaeal classification now uses a refined set of 53 archaeal-specific marker genes based on the recent publication by [Dombrowski et al., 2020](https://www.nature.com/articles/s41467-020-17408-w). This set of archaeal marker genes is now used by GTDB for curating the archaeal taxonomy.
- By default, all directories containing intermediate results are **now removed** by default at the end of the `classify_wf` and `de_novo_wf` pipelines. If you wish to retain these intermediates files use the `--keep-intermediates` flag.
- All MSA files produced by the `align` step are now compressed with gzip.
- The classification summary and failed genomes files are now the only files linked in the root directory of `classify_wf`.


## Documentation
https://ecogenomics.github.io/GTDBTk/
Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDBTk/).

## References

Expand Down
13 changes: 6 additions & 7 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
sphinx
sphinx-argparse
sphinx-rtd-theme
recommonmark
sphinx_rtd_theme
sphinx-sitemap
nbsphinx
sphinx ~= 4.1.0
sphinx-argparse ~= 0.2.0
sphinx-rtd-theme ~= 0.5.0
recommonmark ~= 0.7.0
sphinx-sitemap ~= 2.2.0
nbsphinx ~= 0.8.0
21 changes: 21 additions & 0 deletions docs/src/announcements.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,27 @@
Announcements
=============


GTDB R207 available
------------------

*April xx, 2022*

* GTDB Release 202 is now available and will be used from version ``2.0.1`` and up.
* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package
`gtdbtk_r207_data.tar.gz <https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/auxillary_files>`_.


GTDB R202 available
------------------

*April 23, 2021*

* GTDB Release 202 is now available and will be used from version ``1.5.0`` and up.
* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package
`gtdbtk_r202_data.tar.gz <https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/auxillary_files>`_.


GTDB R95 available
------------------

Expand Down
2 changes: 1 addition & 1 deletion docs/src/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Change log
* Check if stdout is being piped to a file before adding colour.
* (`#283 <https://github.com/Ecogenomics/GTDBTk/issues/283>`_) Significantly improved ``classify`` performance (noticeable when running trees > 1,000 taxa).
* Automatically cap pplacer CPUs to 64 unless specifying ``--pplacer_cpus`` to prevent pplacer from hanging.
* (`#262 <https://github.com/Ecogenomics/GTDBTk/issues/262>`_) Added ``--write_single_copy_genes`` to the ``identify`` command. Writes unaligned single-copy AR122/BAC120 marker genes to disk.
* (`#262 <https://github.com/Ecogenomics/GTDBTk/issues/262>`_) Added ``--write_single_copy_genes`` to the ``identify`` command. Writes unaligned single-copy AR53/BAC120 marker genes to disk.
* When running ``-version`` warn if GTDB-Tk is not running the most up-to-date version (disable via ``GTDBTK_VER_CHECK = False`` in ``config.py``). If GTDB-Tk encounters an error it will silently continue (3 second timeout).
* (`#276 <https://github.com/Ecogenomics/GTDBTk/issues/276>`_) Renamed the column ``aa_percent`` to ``msa_percent`` in ``summary.tsv`` (produced by ``classify``).
* (`#286 <https://github.com/Ecogenomics/GTDBTk/pull/286>`_) Fixed a file not found error when the reference data is a symbolic link (thanks `davidealbanese <https://github.com/davidealbanese>`_!).
Expand Down
2 changes: 1 addition & 1 deletion docs/src/commands/align.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
align
=====

Create a multiple sequence alignment based on the AR122/BAC120 marker set.
Create a multiple sequence alignment based on the AR53/BAC120 marker set.


Arguments
Expand Down
4 changes: 2 additions & 2 deletions docs/src/commands/classify_wf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ Classify workflow

For arguments and output files, see each of the individual steps:

* :ref:`commands/infer`
* :ref:`commands/identify`
* :ref:`commands/align`
* :ref:`commands/classify`

The classify workflow consists of three steps: ``identify``, ``align``, and ``classify``.

The ``identify`` step calls genes using `Prodigal <http://compbio.ornl.gov/prodigal/>`_,
and uses HMM models and the `HMMER <http://hmmer.org/>`_ package to identify the
120 bacterial and 122 archaeal marker genes used for phylogenetic inference
120 bacterial and 53 archaeal marker genes used for phylogenetic inference
(`Parks et al., 2018 <https://www.ncbi.nlm.nih.gov/pubmed/30148503>`_). Multiple
sequence alignments (MSA) are obtained by aligning marker genes to their respective HMM model.

Expand Down
18 changes: 9 additions & 9 deletions docs/src/examples/classify_wf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"However, it is sometimes more useful to just read the summary files which detail markers identified from either the archaeal 122, or bacterial 120 marker set."
"However, it is sometimes more useful to just read the summary files which detail markers identified from either the archaeal 53, or bacterial 120 marker set."
]
},
{
Expand All @@ -152,7 +152,7 @@
}
],
"source": [
"cat /tmp/gtdbtk/identify/gtdbtk.ar122.markers_summary.tsv "
"cat /tmp/gtdbtk/identify/gtdbtk.ar53.markers_summary.tsv "
]
},
{
Expand Down Expand Up @@ -201,7 +201,7 @@
"### Results\n",
"It is important to pay attention to the output, if a genome had a low number of markers identified it will be excluded from the analysis at this step. A warning will appear if that is the case.\n",
"\n",
"Depending on the domain, a prefixed file of either `ar122` or `bac120` will appear containing the MSA of the user genomes and the GTDB genomes, or just the user genomes (`gtdbtk.ar122.msa.fasta` and `gtdbtk.ar122.user_msa.fasta` respectively.)"
"Depending on the domain, a prefixed file of either `ar53` or `bac120` will appear containing the MSA of the user genomes and the GTDB genomes, or just the user genomes (`gtdbtk.ar53.msa.fasta` and `gtdbtk.ar53.user_msa.fasta` respectively.)"
]
},
{
Expand All @@ -213,9 +213,9 @@
"name": "stdout",
"output_type": "stream",
"text": [
"\u001B[0m\u001B[38;5;27malign\u001B[0m \u001B[38;5;51mgtdbtk.ar122.user_msa.fasta\u001B[0m \u001B[38;5;27midentify\u001B[0m\n",
"\u001B[38;5;51mgtdbtk.ar122.filtered.tsv\u001B[0m gtdbtk.log\n",
"\u001B[38;5;51mgtdbtk.ar122.msa.fasta\u001B[0m gtdbtk.warnings.log\n"
"\u001B[0m\u001B[38;5;27malign\u001B[0m \u001B[38;5;51mgtdbtk.ar53.user_msa.fasta\u001B[0m \u001B[38;5;27midentify\u001B[0m\n",
"\u001B[38;5;51mgtdbtk.ar53.filtered.tsv\u001B[0m gtdbtk.log\n",
"\u001B[38;5;51mgtdbtk.ar53.msa.fasta\u001B[0m gtdbtk.warnings.log\n"
]
}
],
Expand Down Expand Up @@ -264,7 +264,7 @@
"metadata": {},
"source": [
"### Results\n",
"The two main files output (one again, depending on their domain) are the summary file, and the reference tree containing those genomes (`gtdbtk.ar122.summary.tsv`, and `gtdbtk.ar122.classify.tree` respectively). Classification of the genomes are present in the summary file."
"The two main files output (one again, depending on their domain) are the summary file, and the reference tree containing those genomes (`gtdbtk.ar53.summary.tsv`, and `gtdbtk.ar53.classify.tree` respectively). Classification of the genomes are present in the summary file."
]
},
{
Expand All @@ -276,8 +276,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"\u001B[0m\u001B[38;5;27mclassify\u001B[0m \u001B[38;5;51mgtdbtk.ar122.summary.tsv\u001B[0m gtdbtk.warnings.log\n",
"\u001B[38;5;51mgtdbtk.ar122.classify.tree\u001B[0m gtdbtk.log\n"
"\u001B[0m\u001B[38;5;27mclassify\u001B[0m \u001B[38;5;51mgtdbtk.ar53.summary.tsv\u001B[0m gtdbtk.warnings.log\n",
"\u001B[38;5;51mgtdbtk.ar53.classify.tree\u001B[0m gtdbtk.log\n"
]
}
],
Expand Down
2 changes: 1 addition & 1 deletion docs/src/files/markers_summary.tsv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ markers_summary.tsv
===================

A summary of unique, duplicated, and missing markers within the 120 bacterial marker set,
or the 122 archaeal marker set for each submitted genome.
or the 53 archaeal marker set for each submitted genome.

For each genome:

Expand Down
2 changes: 1 addition & 1 deletion docs/src/files/pplacer.domain.json.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Example
}
], "metadata":
{"invocation":
"pplacer -m WAG -j 3 -c \/release89\/pplacer\/gtdb_r89_ar122.refpkg -o classify_output\/classify\/intermediate_results\/pplacer\/pplacer.ar122.json align_output\/align\/gtdbtk.ar122.user_msa.fasta"
"pplacer -m WAG -j 3 -c \/release89\/pplacer\/gtdb_r89_ar53.refpkg -o classify_output\/classify\/intermediate_results\/pplacer\/pplacer.ar53.json align_output\/align\/gtdbtk.ar53.user_msa.fasta"
}, "version": 3, "fields":
["distal_length", "edge_num", "like_weight_ratio", "likelihood",
"pendant_length"
Expand Down
2 changes: 1 addition & 1 deletion docs/src/files/pplacer.domain.out.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Example

.. code-block:: text
Running pplacer v1.1.alpha19-0-g807f6f3 analysis on align_output/align/gtdbtk.ar122.user_msa.fasta...
Running pplacer v1.1.alpha19-0-g807f6f3 analysis on align_output/align/gtdbtk.ar53.user_msa.fasta...
Didn't find any reference sequences in given alignment file. Using supplied reference alignment.
Pre-masking sequences... sequence length cut from 5124 to 5114.
Determining figs... figs disabled.
Expand Down
2 changes: 1 addition & 1 deletion docs/src/files/summary.tsv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
summary.tsv
===========

Classifications provided by the GTDB-Tk are in the files \<prefix>.bac120.summary.tsv and \<prefix>.ar122.summary.tsv for bacterial and archaeal genomes, respectively. These are tab separated files with the following columns:
Classifications provided by the GTDB-Tk are in the files \<prefix>.bac120.summary.tsv and \<prefix>.ar53.summary.tsv for bacterial and archaeal genomes, respectively. These are tab separated files with the following columns:

* user_genome: Unique identifier of query genome taken from the FASTA file of the genome.
* classification: GTDB taxonomy string inferred by the GTDB-Tk. An unassigned species (i.e., ``s__``) indicates that the query genome is either i) placed outside a named genus or ii) the ANI to the closest intra-genus reference genome with an AF >=0.65 is not within the species-specific ANI circumscription radius.
Expand Down
8 changes: 4 additions & 4 deletions docs/src/installing/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,12 @@ Hardware requirements
- Storage
- Time
* - Archaea
- ~13 GB
- ~27 GB
- ~34 GB
- ~65 GB
- ~1 hour / 1,000 genomes @ 64 CPUs
* - Bacteria
- ~215 GB
- ~27 GB
- ~320 GB ( 20GB for divide-and-conquer)
- ~65 GB
- ~1 hour / 1,000 genomes @ 64 CPUs

.. note::
Expand Down
2 changes: 1 addition & 1 deletion gtdbtk/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@
__status__ = 'Production'
__title__ = 'GTDB-Tk'
__url__ = 'https://github.com/Ecogenomics/GTDBTk'
__version__ = '1.7.0'
__version__ = '2.0.1'
13 changes: 8 additions & 5 deletions gtdbtk/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,17 @@ def print_help():
decorate -> Decorate tree with GTDB taxonomy
Tools:
infer_ranks -> Establish taxonomic ranks of internal nodes using RED
ani_rep -> Calculates ANI to GTDB representative genomes
trim_msa -> Trim an untrimmed MSA file based on a mask
export_msa -> Export the untrimmed archaeal or bacterial MSA file
infer_ranks -> Establish taxonomic ranks of internal nodes using RED
ani_rep -> Calculates ANI to GTDB representative genomes
trim_msa -> Trim an untrimmed MSA file based on a mask
export_msa -> Export the untrimmed archaeal or bacterial MSA file
remove_labels -> Remove labels (bootstrap values, node labels) from an Newick tree
convert_to_itol -> Convert a GTDB-Tk Newick tree to an iTOL tree
Testing:
test -> Validate the classify_wf pipeline with 3 archaeal genomes
check_install -> Verify third party programs and GTDB reference package.
check_install -> Verify third party programs and GTDB reference package
Use: gtdbtk <command> -h for command specific help
''' % __version__)
Expand Down
6 changes: 5 additions & 1 deletion gtdbtk/biolib_lite/seq_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,15 +122,19 @@ def read_fasta_seq(fasta_file, keep_annotation=False):

try:
open_file = open
mode = 'r'
if fasta_file.endswith('.gz'):
open_file = gzip.open
mode = 'rb'

seq_id = None
annotation = None
seq = None
with open_file(fasta_file, 'r') as f:
with open_file(fasta_file, mode) as f:

for line in f.readlines():
if isinstance(line, bytes):
line = line.decode()
# skip blank lines
if not line.strip():
continue
Expand Down
Loading

0 comments on commit ddd074a

Please sign in to comment.