Skip to content

Commit

Permalink
Merge branch 'staging' into issue540
Browse files Browse the repository at this point in the history
# Conflicts:
#	gtdbtk/ani_screen.py
#	gtdbtk/main.py
#	gtdbtk/markers.py
  • Loading branch information
pchaumeil committed Apr 24, 2024
2 parents 5dc36e7 + a4e59a2 commit b1096b8
Show file tree
Hide file tree
Showing 37 changed files with 1,061 additions and 508 deletions.
22 changes: 16 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# How to build and deploy the Docker image:
# docker build --build-arg VER=1.2.3 --no-cache -t ecogenomic/gtdbtk:latest -t ecogenomic/gtdbtk:1.2.3 .
# docker run -v /host/gtdbtk_io:/data -v /host/release_data:/refdata ecogenomic/gtdbtk classify_wf --genome_dir /data/genomes --out_dir /data/output
# docker push ecogenomic/gtdbtk:latest && sudo docker push ecogenomic/gtdbtk:1.2.3

FROM python:3.8-slim-bullseye
Expand All @@ -15,14 +16,19 @@ RUN apt-get update -y -m && \
libgomp1 \
libgsl25 \
libgslcblas0 \
build-essential \
curl \
hmmer=3.* \
mash=2.2.* \
prodigal=1:2.6.* \
fasttree=2.1.* \
unzip && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
ln -s /usr/bin/fasttreeMP /usr/bin/FastTreeMP
ln -s /usr/bin/fasttreeMP /usr/bin/FastTreeMP && \
curl https://sh.rustup.rs -sSf | sh -s -- -y

ENV PATH="/root/.cargo/bin:${PATH}"

# ---------------------------------------------------------------------------- #
# ----------------------------- INSTALL PPLACER ------------------------------ #
Expand All @@ -34,11 +40,15 @@ RUN wget https://github.com/matsen/pplacer/releases/download/v1.1.alpha19/pplace
rm -rf pplacer-Linux-v1.1.alpha19

# ---------------------------------------------------------------------------- #
# ----------------------------- INSTALL FASTANI ------------------------------ #
# ------------------------------ INSTALL SKANI ------------------------------- #
# ---------------------------------------------------------------------------- #
RUN wget https://github.com/ParBLiSS/FastANI/releases/download/v1.32/fastANI-Linux64-v1.32.zip -q && \
unzip fastANI-Linux64-v1.32.zip -d /usr/bin && \
rm fastANI-Linux64-v1.32.zip

RUN wget https://github.com/bluenote-1577/skani/archive/refs/tags/v0.2.1.tar.gz
RUN tar -xvf v0.2.1.tar.gz
RUN cd skani-0.2.1 && cargo install --path . --root /usr
RUN chmod +x /usr/bin/skani
RUN cd ../
RUN rm -rf v0.2.1.tar.gz skani-0.2.1

# ---------------------------------------------------------------------------- #
# --------------------- SET GTDB-TK MOUNTED DIRECTORIES ---------------------- #
Expand All @@ -51,7 +61,7 @@ ENV GTDBTK_DATA_PATH="/refdata/"
# --------------------------- INSTALL PIP PACKAGES --------------------------- #
# ---------------------------------------------------------------------------- #
RUN python -m pip install --upgrade pip && \
python -m pip install gtdbtk==${VER}
python -m pip install gtdbtk==${VER} \

# ---------------------------------------------------------------------------- #
# ---------------------------- SET THE ENTRYPOINT ---------------------------- #
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDB

## ✨ New Features

GTDB-Tk v2.3.0+ includes the following new features:
- New functionality ``convert_to_species`` function to convert GTDB genome IDs to GTDB species names
GTDB-Tk v2.4.0+ includes the following new features:
- `FastANI` has been replaced by `skani` as the primary tool for computing Average Nucleotide Identity (ANI).Users may notice slight variations in the results compared to those obtained using `FastANI`.


## 📈 Performance
Expand All @@ -63,6 +63,7 @@ We strongly encourage you to cite the following 3rd party dependencies:

* Matsen FA, et al. 2010. [pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree](https://www.ncbi.nlm.nih.gov/pubmed/21034504). <i>BMC Bioinformatics</i>, 11:538.
* Jain C, et al. 2019. [High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries](https://www.nature.com/articles/s41467-018-07641-9). <i>Nat. Communications</i>, doi: 10.1038/s41467-018-07641-9.
* Shaw J. and Yu Y.W. 2023. [Fast and robust metagenomic sequence comparison through sparse chaining with skani](https://www.nature.com/articles/s41592-023-02018-3). <i>Nature Methods</i>, 20, pages1661–1665 (2023).
* Hyatt D, et al. 2010. [Prodigal: prokaryotic gene recognition and translation initiation site identification](https://www.ncbi.nlm.nih.gov/pubmed/20211023). <i>BMC Bioinformatics</i>, 11:119. doi: 10.1186/1471-2105-11-119.
* Price MN, et al. 2010. [FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/). <i>PLoS One</i>, 5, e9490.
* Eddy SR. 2011. [Accelerated profile HMM searches](https://www.ncbi.nlm.nih.gov/pubmed/22039361). <i>PLOS Comp. Biol.</i>, 7:e1002195.
Expand Down
11 changes: 11 additions & 0 deletions docs/src/announcements.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
Announcements
=============

GTDB-Tk 2.4.0 available
-----------------------

*April 24, 2024*

* GTDB-Tk version ``2.4.0`` is now available.
* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package (Release 220).
`gtdbtk_r220_data.tar.gz <https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/>`_.



GTDB-Tk 2.3.0 available
-----------------------

Expand Down
30 changes: 30 additions & 0 deletions docs/src/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,36 @@
Change log
==========


2.4.0
-----

Bug Fixes:

* (`#576 <https://github.com/Ecogenomics/GTDBTk/issues/576>`_) When all genomes fail the prodigal step in the classify_wf, The
bac120 summary file is still produced with the all failed genomes listed as 'Unclassified'
* (`#573 <https://github.com/Ecogenomics/GTDBTk/issues/573>`_) When running the 3 classify steps independently, a genome can be filtered out in the align
step but still be classified in the identify step. To avoid duplication of row, the genome is classified with a warning.
* (`#540 <https://github.com/Ecogenomics/GTDBTk/issues/540>`_) Empty files are skipped during the sketch step of Mash,
they are then catch in the prodigal step and are returned as 'Unclassified'
* (`#549 <https://github.com/Ecogenomics/GTDBTk/issues/549>`_) : `--force` has been modified to deal with #540. Prodigal
wasn't returning the empty files as failed genomes, it was only skipping them. These genomes are now returned in the summary file and flagged as Unclassified.

Major Changes:

* FastANI has been replaced by skani as the primary tool for computing Average Nucleotide Identity (ANI).Users may notice slight variations in the results compared to those obtained using FastANI.
* In the generated `summary.tsv` files, several columns have been renamed for clarity and consistency. The following columns have been affected:

- "`fastani_reference`" column has been renamed to "`closest_genome_reference`".
- "`fastani_reference_radius`" column has been renamed to "`closest_genome_reference_radius`".
- "`fastani_taxonomy`" column has been renamed to "`closest_genome_taxonomy`".
- "`fastani_ani`" column has been renamed to "`closest_genome_ani`".
- "`fastani_af`" column has been renamed to "`closest_genome_af`".

These changes have been implemented to improve the readability and understanding of the data within the `summary.tsv` files. Users should update their scripts or processes accordingly to reflect these renamed column headers.



2.3.2
-----

Expand Down
29 changes: 15 additions & 14 deletions docs/src/commands/ani_rep.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,20 +49,21 @@ Output

.. code-block:: text
[2020-04-13 10:51:58] INFO: GTDB-Tk v1.1.0
[2020-04-13 10:51:58] INFO: gtdbtk ani_rep --genome_dir genomes/ --out_dir ani_rep/ --cpus 70
[2020-04-13 10:51:58] INFO: Using GTDB-Tk reference data version r89: /release89
[2020-04-13 10:51:59] INFO: Using Mash version 2.2.2
[2020-04-13 10:51:59] INFO: Creating Mash sketch file: ani_rep/intermediate_results/mash/gtdbtk.user_query_sketch.msh
==> Sketching 3 of 3 (100.0%) genomes
[2020-04-13 10:51:59] INFO: Creating Mash sketch file: ani_rep/intermediate_results/mash/gtdbtk.gtdb_ref_sketch.msh
==> Sketching 24706 of 24706 (100.0%) genomes
[2020-04-13 10:53:13] INFO: Calculating Mash distances.
[2020-04-13 10:53:14] INFO: Calculating ANI with FastANI.
==> Processing 874 of 874 (100.0%) comparisons.
[2020-04-13 10:53:23] INFO: Summary of results saved to: ani_rep/gtdbtk.ani_summary.tsv
[2020-04-13 10:53:23] INFO: Closest representative hits saved to: ani_rep/gtdbtk.ani_closest.tsv
[2020-04-13 10:53:23] INFO: Done.
[2024-03-27 16:43:25] INFO: GTDB-Tk v2.3.2
[2024-03-27 16:43:25] INFO: gtdbtk ani_rep --batchfile genomes/500_batchfile.tsv --out_dir user_vs_reps --cpus 90
[2024-03-27 16:43:25] INFO: Using GTDB-Tk reference data version r214: /srv/db/gtdbtk/official/release214_skani/release214
[2024-03-27 16:43:25] INFO: Loading reference genomes.
[2024-03-27 16:43:25] INFO: Using Mash version 2.2.2
[2024-03-27 16:43:25] INFO: Creating Mash sketch file: user_vs_reps/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2024-03-27 16:43:27] INFO: Completed 500 genomes in 1.42 seconds (351.61 genomes/second).
[2024-03-27 16:43:27] INFO: Creating Mash sketch file: user_vs_reps/intermediate_results/mash/gtdbtk.gtdb_ref_sketch.msh
[2024-03-27 16:46:55] INFO: Completed 85,205 genomes in 3.47 minutes (24,519.48 genomes/minute).
[2024-03-27 16:46:55] INFO: Calculating Mash distances.
[2024-03-27 16:47:37] INFO: Calculating ANI with skani v0.2.1.
[2024-03-27 16:47:45] INFO: Completed 4,383 comparisons in 7.68 seconds (570.58 comparisons/second).
[2024-03-27 16:47:46] INFO: Summary of results saved to: user_vs_reps/gtdbtk.ani_summary.tsv
[2024-03-27 16:47:46] INFO: Closest representative hits saved to: user_vs_reps/gtdbtk.ani_closest.tsv
[2024-03-27 16:47:46] INFO: Done.
4 changes: 2 additions & 2 deletions docs/src/commands/check_install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Output
[2020-11-04 09:35:16] INFO: Checking that all third-party software are on the system path:
[2020-11-04 09:35:16] INFO: |-- FastTree OK
[2020-11-04 09:35:16] INFO: |-- FastTreeMP OK
[2020-11-04 09:35:16] INFO: |-- fastANI OK
[2020-11-04 09:35:16] INFO: |-- skani OK
[2020-11-04 09:35:16] INFO: |-- guppy OK
[2020-11-04 09:35:16] INFO: |-- hmmalign OK
[2020-11-04 09:35:16] INFO: |-- hmmsearch OK
Expand All @@ -57,6 +57,6 @@ Output
[2020-11-04 09:35:20] INFO: |-- msa OK
[2020-11-04 09:35:20] INFO: |-- metadata OK
[2020-11-04 09:35:20] INFO: |-- taxonomy OK
[2020-11-04 09:47:36] INFO: |-- fastani OK
[2020-11-04 09:47:36] INFO: |-- skani OK
[2020-11-04 09:47:36] INFO: |-- mrca_red OK
[2020-11-04 09:47:36] INFO: Done.
Loading

0 comments on commit b1096b8

Please sign in to comment.