Merge pull request #376 from Ecogenomics/moving_to_ar53

Moving to ar53
Ecogenomics · Apr 7, 2022 · ddd074a · ddd074a
2 parents 3734926 + 40eb08b
commit ddd074a
Show file tree

Hide file tree

Showing 49 changed files with 1,836 additions and 770 deletions.
diff --git a/.github/workflows/release-publish.yml b/.github/workflows/release-publish.yml
@@ -40,6 +40,9 @@ jobs:
         with:
           username: ${{ secrets.DOCKER_USERNAME }}
           password: ${{ secrets.DOCKER_PASSWORD }}
+      - name: Wait for PyPI to publish package
+        run: sleep 60s
+        shell: bash
       - name: Build and push
         uses: docker/build-push-action@v2
         with:

diff --git a/README.md b/README.md
@@ -7,17 +7,27 @@
 [![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
 [![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)
 
-<b>[GTDB-Tk v1.5.0](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April 23, 2021 along with new reference data for [GTDB R06-RS202](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>  
-<b> Please note v1.5.0+ is not compatible with GTDB R05-RS95. </b>
+<b>[GTDB-Tk v2.0.1](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April xx, 2022 along with new reference data for [GTDB R07-RS207](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>  
+<b> Please note v2.0.1+ is not compatible with GTDB R06-RS202. </b>
 
 GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy [GTDB](https://gtdb.ecogenomic.org/). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes. The GTDB-Tk is open source and released under the [GNU General Public License (Version 3)](https://www.gnu.org/licenses/gpl-3.0.en.html).
 
 Notifications about GTDB-Tk releases will be available through the GTDB Twitter account (https://twitter.com/ace_gtdb).
 
-Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) should be sent to the [GTDB team](https://gtdb.ecogenomic.org/about). 
+Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) can be posted on the [GTDB Forum](https://forum.gtdb.ecogenomic.org/) or sent to the [GTDB team](https://gtdb.ecogenomic.org/about).
+
+## New Features
+
+GTDB-Tk v2.0.1 includes the following new features:
+- GTDB-TK now uses a **divide-and-conquer** approach where the bacterial reference tree is split into multiple order-level subtrees. This reduces the memory requirements of GTDB-Tk from **320 GB** of RAM when using the full GTDB R07-RS207 reference tree to approximately **35 GB**. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the `--full-tree` flag.
+- Archaeal classification now uses a refined set of 53 archaeal-specific marker genes based on the recent publication by [Dombrowski et al., 2020](https://www.nature.com/articles/s41467-020-17408-w). This set of archaeal marker genes is now used by GTDB for curating the archaeal taxonomy.
+- By default, all directories containing intermediate results are **now removed** by default at the end of the `classify_wf` and `de_novo_wf` pipelines. If you wish to retain these intermediates files use the `--keep-intermediates` flag.
+- All MSA files produced by the `align` step are now compressed with gzip.
+- The classification summary and failed genomes files are now the only files linked in the root directory of `classify_wf`.
+
 
 ## Documentation
-https://ecogenomics.github.io/GTDBTk/
+Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDBTk/).
 
 ## References
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,7 +1,6 @@
-sphinx
-sphinx-argparse
-sphinx-rtd-theme
-recommonmark
-sphinx_rtd_theme
-sphinx-sitemap
-nbsphinx
+sphinx ~= 4.1.0
+sphinx-argparse ~= 0.2.0
+sphinx-rtd-theme ~= 0.5.0
+recommonmark ~= 0.7.0
+sphinx-sitemap ~= 2.2.0
+nbsphinx ~= 0.8.0
diff --git a/docs/src/announcements.rst b/docs/src/announcements.rst
@@ -1,6 +1,27 @@
 Announcements
 =============
 
+
+GTDB R207 available
+------------------
+
+*April xx, 2022*
+
+* GTDB Release 202 is now available and will be used from version ``2.0.1`` and up.
+* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package
+  `gtdbtk_r207_data.tar.gz <https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/auxillary_files>`_.
+
+
+GTDB R202 available
+------------------
+
+*April 23, 2021*
+
+* GTDB Release 202 is now available and will be used from version ``1.5.0`` and up.
+* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package
+  `gtdbtk_r202_data.tar.gz <https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/auxillary_files>`_.
+
+
 GTDB R95 available
 ------------------
 

diff --git a/docs/src/changelog.rst b/docs/src/changelog.rst
@@ -52,7 +52,7 @@ Change log
 * Check if stdout is being piped to a file before adding colour.
 * (`#283 <https://github.com/Ecogenomics/GTDBTk/issues/283>`_) Significantly improved ``classify`` performance (noticeable when running trees > 1,000 taxa).
 * Automatically cap pplacer CPUs to 64 unless specifying ``--pplacer_cpus`` to prevent pplacer from hanging.
-* (`#262 <https://github.com/Ecogenomics/GTDBTk/issues/262>`_) Added ``--write_single_copy_genes`` to the ``identify`` command. Writes unaligned single-copy AR122/BAC120 marker genes to disk.
+* (`#262 <https://github.com/Ecogenomics/GTDBTk/issues/262>`_) Added ``--write_single_copy_genes`` to the ``identify`` command. Writes unaligned single-copy AR53/BAC120 marker genes to disk.
 * When running ``-version`` warn if GTDB-Tk is not running the most up-to-date version (disable via ``GTDBTK_VER_CHECK = False`` in ``config.py``). If GTDB-Tk encounters an error it will silently continue (3 second timeout).
 * (`#276 <https://github.com/Ecogenomics/GTDBTk/issues/276>`_) Renamed the column ``aa_percent`` to ``msa_percent`` in ``summary.tsv`` (produced by ``classify``).
 * (`#286 <https://github.com/Ecogenomics/GTDBTk/pull/286>`_) Fixed a file not found error when the reference data is a symbolic link (thanks `davidealbanese <https://github.com/davidealbanese>`_!).

diff --git a/docs/src/commands/align.rst b/docs/src/commands/align.rst
@@ -3,7 +3,7 @@
 align
 =====
 
-Create a multiple sequence alignment based on the AR122/BAC120 marker set.
+Create a multiple sequence alignment based on the AR53/BAC120 marker set.
 
 
 Arguments

diff --git a/docs/src/commands/classify_wf.rst b/docs/src/commands/classify_wf.rst
@@ -8,15 +8,15 @@ Classify workflow
 
 For arguments and output files, see each of the individual steps:
 
-* :ref:`commands/infer`
+* :ref:`commands/identify`
 * :ref:`commands/align`
 * :ref:`commands/classify`
 
 The classify workflow consists of three steps: ``identify``, ``align``, and ``classify``.
 
 The ``identify`` step calls genes using `Prodigal <http://compbio.ornl.gov/prodigal/>`_,
 and uses HMM models and the `HMMER <http://hmmer.org/>`_ package to identify the
-120 bacterial and 122 archaeal marker genes used for phylogenetic inference
+120 bacterial and 53 archaeal marker genes used for phylogenetic inference
 (`Parks et al., 2018 <https://www.ncbi.nlm.nih.gov/pubmed/30148503>`_). Multiple
 sequence alignments (MSA) are obtained by aligning marker genes to their respective HMM model.
 

diff --git a/docs/src/examples/classify_wf.ipynb b/docs/src/examples/classify_wf.ipynb
@@ -133,7 +133,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "However, it is sometimes more useful to just read the summary files which detail markers identified from either the archaeal 122, or bacterial 120 marker set."
+    "However, it is sometimes more useful to just read the summary files which detail markers identified from either the archaeal 53, or bacterial 120 marker set."
    ]
   },
   {
@@ -152,7 +152,7 @@
     }
    ],
    "source": [
-    "cat /tmp/gtdbtk/identify/gtdbtk.ar122.markers_summary.tsv "
+    "cat /tmp/gtdbtk/identify/gtdbtk.ar53.markers_summary.tsv "
    ]
   },
   {
@@ -201,7 +201,7 @@
     "### Results\n",
     "It is important to pay attention to the output, if a genome had a low number of markers identified it will be excluded from the analysis at this step. A warning will appear if that is the case.\n",
     "\n",
-    "Depending on the domain, a prefixed file of either `ar122` or `bac120` will appear containing the MSA of the user genomes and the GTDB genomes, or just the user genomes (`gtdbtk.ar122.msa.fasta` and `gtdbtk.ar122.user_msa.fasta` respectively.)"
+    "Depending on the domain, a prefixed file of either `ar53` or `bac120` will appear containing the MSA of the user genomes and the GTDB genomes, or just the user genomes (`gtdbtk.ar53.msa.fasta` and `gtdbtk.ar53.user_msa.fasta` respectively.)"
    ]
   },
   {
@@ -213,9 +213,9 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "\u001B[0m\u001B[38;5;27malign\u001B[0m                      \u001B[38;5;51mgtdbtk.ar122.user_msa.fasta\u001B[0m  \u001B[38;5;27midentify\u001B[0m\n",
-      "\u001B[38;5;51mgtdbtk.ar122.filtered.tsv\u001B[0m  gtdbtk.log\n",
-      "\u001B[38;5;51mgtdbtk.ar122.msa.fasta\u001B[0m     gtdbtk.warnings.log\n"
+      "\u001B[0m\u001B[38;5;27malign\u001B[0m                      \u001B[38;5;51mgtdbtk.ar53.user_msa.fasta\u001B[0m  \u001B[38;5;27midentify\u001B[0m\n",
+      "\u001B[38;5;51mgtdbtk.ar53.filtered.tsv\u001B[0m  gtdbtk.log\n",
+      "\u001B[38;5;51mgtdbtk.ar53.msa.fasta\u001B[0m     gtdbtk.warnings.log\n"
      ]
     }
    ],
@@ -264,7 +264,7 @@
    "metadata": {},
    "source": [
     "### Results\n",
-    "The two main files output (one again, depending on their domain) are the summary file, and the reference tree containing those genomes (`gtdbtk.ar122.summary.tsv`, and `gtdbtk.ar122.classify.tree` respectively). Classification of the genomes are present in the summary file."
+    "The two main files output (one again, depending on their domain) are the summary file, and the reference tree containing those genomes (`gtdbtk.ar53.summary.tsv`, and `gtdbtk.ar53.classify.tree` respectively). Classification of the genomes are present in the summary file."
    ]
   },
   {
@@ -276,8 +276,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "\u001B[0m\u001B[38;5;27mclassify\u001B[0m                    \u001B[38;5;51mgtdbtk.ar122.summary.tsv\u001B[0m  gtdbtk.warnings.log\n",
-      "\u001B[38;5;51mgtdbtk.ar122.classify.tree\u001B[0m  gtdbtk.log\n"
+      "\u001B[0m\u001B[38;5;27mclassify\u001B[0m                    \u001B[38;5;51mgtdbtk.ar53.summary.tsv\u001B[0m  gtdbtk.warnings.log\n",
+      "\u001B[38;5;51mgtdbtk.ar53.classify.tree\u001B[0m  gtdbtk.log\n"
      ]
     }
    ],

diff --git a/docs/src/files/markers_summary.tsv.rst b/docs/src/files/markers_summary.tsv.rst
@@ -4,7 +4,7 @@ markers_summary.tsv
 ===================
 
 A summary of unique, duplicated, and missing markers within the 120 bacterial marker set, 
-or the 122 archaeal marker set for each submitted genome.
+or the 53 archaeal marker set for each submitted genome.
 
 For each genome:
 

diff --git a/docs/src/files/pplacer.domain.json.rst b/docs/src/files/pplacer.domain.json.rst
@@ -31,7 +31,7 @@ Example
         }
       ], "metadata":
       {"invocation":
-        "pplacer -m WAG -j 3 -c \/release89\/pplacer\/gtdb_r89_ar122.refpkg -o classify_output\/classify\/intermediate_results\/pplacer\/pplacer.ar122.json align_output\/align\/gtdbtk.ar122.user_msa.fasta"
+        "pplacer -m WAG -j 3 -c \/release89\/pplacer\/gtdb_r89_ar53.refpkg -o classify_output\/classify\/intermediate_results\/pplacer\/pplacer.ar53.json align_output\/align\/gtdbtk.ar53.user_msa.fasta"
       }, "version": 3, "fields":
       ["distal_length", "edge_num", "like_weight_ratio", "likelihood",
         "pendant_length"

diff --git a/docs/src/files/pplacer.domain.out.rst b/docs/src/files/pplacer.domain.out.rst
@@ -16,7 +16,7 @@ Example
 
 .. code-block:: text
     
-    Running pplacer v1.1.alpha19-0-g807f6f3 analysis on align_output/align/gtdbtk.ar122.user_msa.fasta...
+    Running pplacer v1.1.alpha19-0-g807f6f3 analysis on align_output/align/gtdbtk.ar53.user_msa.fasta...
     Didn't find any reference sequences in given alignment file. Using supplied reference alignment.
     Pre-masking sequences... sequence length cut from 5124 to 5114.
     Determining figs... figs disabled.

diff --git a/docs/src/files/summary.tsv.rst b/docs/src/files/summary.tsv.rst
@@ -4,7 +4,7 @@
 summary.tsv
 ===========
 
-Classifications provided by the GTDB-Tk are in the files \<prefix>.bac120.summary.tsv and \<prefix>.ar122.summary.tsv for bacterial and archaeal genomes, respectively. These are tab separated files with the following columns:
+Classifications provided by the GTDB-Tk are in the files \<prefix>.bac120.summary.tsv and \<prefix>.ar53.summary.tsv for bacterial and archaeal genomes, respectively. These are tab separated files with the following columns:
 
 * user_genome: Unique identifier of query genome taken from the FASTA file of the genome.
 * classification: GTDB taxonomy string inferred by the GTDB-Tk. An unassigned species (i.e., ``s__``) indicates that the query genome is either i) placed outside a named genus or ii) the ANI to the closest intra-genus reference genome with an AF >=0.65 is not within the species-specific ANI circumscription radius.

diff --git a/docs/src/installing/index.rst b/docs/src/installing/index.rst
@@ -34,12 +34,12 @@ Hardware requirements
      - Storage
      - Time
    * - Archaea
-     - ~13 GB
-     - ~27 GB
+     - ~34 GB
+     - ~65 GB
      - ~1 hour / 1,000 genomes @ 64 CPUs
    * - Bacteria
-     - ~215 GB
-     - ~27 GB
+     - ~320 GB ( 20GB for divide-and-conquer)
+     - ~65 GB
      - ~1 hour / 1,000 genomes @ 64 CPUs
 
 .. note::

diff --git a/gtdbtk/__init__.py b/gtdbtk/__init__.py
@@ -29,4 +29,4 @@
 __status__ = 'Production'
 __title__ = 'GTDB-Tk'
 __url__ = 'https://github.com/Ecogenomics/GTDBTk'
-__version__ = '1.7.0'
+__version__ = '2.0.1'
diff --git a/gtdbtk/__main__.py b/gtdbtk/__main__.py
@@ -48,14 +48,17 @@ def print_help():
     decorate -> Decorate tree with GTDB taxonomy
 
   Tools:
-    infer_ranks -> Establish taxonomic ranks of internal nodes using RED
-    ani_rep     -> Calculates ANI to GTDB representative genomes
-    trim_msa    -> Trim an untrimmed MSA file based on a mask
-    export_msa  -> Export the untrimmed archaeal or bacterial MSA file
+    infer_ranks     -> Establish taxonomic ranks of internal nodes using RED
+    ani_rep         -> Calculates ANI to GTDB representative genomes
+    trim_msa        -> Trim an untrimmed MSA file based on a mask
+    export_msa      -> Export the untrimmed archaeal or bacterial MSA file
+    remove_labels   -> Remove labels (bootstrap values, node labels) from an Newick tree
+    convert_to_itol -> Convert a GTDB-Tk Newick tree to an iTOL tree
+ 
 
   Testing:
     test          -> Validate the classify_wf pipeline with 3 archaeal genomes 
-    check_install -> Verify third party programs and GTDB reference package.
+    check_install -> Verify third party programs and GTDB reference package
 
   Use: gtdbtk <command> -h for command specific help
     ''' % __version__)

diff --git a/gtdbtk/biolib_lite/seq_io.py b/gtdbtk/biolib_lite/seq_io.py
@@ -122,15 +122,19 @@ def read_fasta_seq(fasta_file, keep_annotation=False):
 
     try:
         open_file = open
+        mode = 'r'
         if fasta_file.endswith('.gz'):
             open_file = gzip.open
+            mode = 'rb'
 
         seq_id = None
         annotation = None
         seq = None
-        with open_file(fasta_file, 'r') as f:
+        with open_file(fasta_file, mode) as f:
 
             for line in f.readlines():
+                if isinstance(line, bytes):
+                    line = line.decode()
                 # skip blank lines
                 if not line.strip():
                     continue
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,7 +3,7 @@ @@
     align
     =====
-    Create a multiple sequence alignment based on the AR122/BAC120 marker set.
+    Create a multiple sequence alignment based on the AR53/BAC120 marker set.
     Arguments
@@ Expand Down @@