GitHub - novoalab/SeqTagger: Super-fast and accurate demultiplexing of direct RNA-seq runs.

About SeqTagger

What is SeqTagger?

It's a super-fast and accurate demultiplexing algorithm for direct RNA nanopore sequencing datasets. Supporting both RNA002 and RNA004 kits,and both fast5 and pod5 files.

How does SeqTagger work?

The workflow follows the standard direct RNA sequencing library preparation protocol in which default RT adapters are exchanged for barcode-containg RT adapters. SeqTagger then basecalls the DNA barcode from the direct RNA sequencing data using custom basecalling models. Finally, basecalled barcodes are aligned against the reference sequences for all barcodes and low confidence predicitions removed in a filtering step.

How many barcodes are supported?

Currently, SeqTagger supports the following models and barcodes:

Chemistry	Number of barcodes	SeqTagger Model	Barcode Sequences
RNA002	4	b04_RNA002	b04_RNA002_barcodes
RNA002	96	b96_RNA002	b96_RNA002_barcodes
RNA004	4	b04_RNA004	b04_RNA004_barcodes

Please note: These models do not work well on Nano-tRNAseq libraries. You can find pre-trained tRNA demultiplexing models here.

Does it work on all RNA types?

Yes, as long as the RNA molecule has a poly(A) tail (e.g. mRNAs, lncRNAs, etc.) or you have in vitro polyadenylated the sample prior to sequencing.

Please note: Nano-tRNAseq libraries do not have standard poly(A) RNA tails, and thus should not be used with the models listed above. You can find SeqTagger Dockerfiles with pre-trained tRNA demultiplexing models here (also available for RNA004 chemistries).

Running SeqTagger

Download test data for both RNA002 and RNA004:

mkdir -p seqtagger; cd seqtagger
wget https://public-docs.crg.es/enovoa/public/seqtagger/test_data/ \
  -q --show-progress -r -c -nc -np -nH --cut-dirs=3 --reject="index.html*"

It's handy to define an alias prior to using seqtagger:

alias seqtagger="docker run --gpus all -u $UID:$GID -v `pwd`:/data lpryszcz/seqtagger"

This will bind your current directory to /data in the docker container.

Then, running it is as easy as:

seqtagger mRNA -k models/b04_RNA004 -r -i /data/test_data/RNA004 -o /data/demux

Note, you can provide multiple input directories with fast5/pod5 files after -i.

Results will be saved in tab-delimited files (gzip-compressed): demux/RNA004.demux.tsv.gz

In addition, boxplots of per-barcode quality will be saved in corresponding directory ending with .boxplot.pdf.

Please note: You can now also run SeqTagger through the MasterOfPores3 nextflow workflow.

Split reads by barcode

Split Fast5 files

If you wish to split Fast5 file(s) by barcode, execute:

seqtagger fast5_split_by_barcode.py -b 50 -i /data/demux/RNA004.demux.tsv.gz \
  -f /data/test_data/RNA004 -o /data/demux/RNA004

Where -b specifies the baseQ cut-off. This will generate one output folder for each barcode named as demux/RNA004/bc_?/reads_*.fast5 where ? represents the barcode number.

Split FastQ files

We don't provide FastQ example in the test_data. If you wish to split FastQ file(s) by barcode:

# first concatenate all FastQ file into one
cat guppy/run1/*.fastq.gz > guppy/run1.fastq.gz
# and split reads using baseQ cut-off of 50
seqtagger fastq_split_by_barcode.py -b 50 -o /data/demux/run1 -i /data/demux/run1.demux.tsv.gz -f /data/guppy/run1.fastq.gz

This will save one FastQ file for each barcode named as demux/run1.demux.bc_?.fq.gz where ? represents the barcode number.

Split BAM files

We don't provide BAM example in the test_data. If you wish to split BAM file(s) by barcode:

seqtagger bam_split_by_barcode.py -i /data/demux/run1.demux.tsv.gz -f /data/run1.mapped.bam -o /data/run1.mapped

This will save one BAM file for each barcode named as run1.mapped.bc_?.bam where ? represents the barcode number.

Dependencies and versions

You'll need CUDA-compatible (Nvidia) GPU and CUDA v10 or newer installed in your system.

Additionally, you'll need to install docker and NVIDIA Container Toolkit.

Versions tested:

Software	Version
CUDA	10, 11, 12
Docker	25+
Nvidia Container Toolkit	1.14

License Information

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC-ND 4.0), available here, with the exception of the bonito module, which retains its original license. The full text of the licenses, including modified code, can be found in the bonito directory.

License Dependencies

ONT 1.0: bonito
- Licensed under the Oxford Nanopore Technologies Public License 1.0. Full license text available at ONT 1.0 License.
MPL 2.0: pod5, ont_fast5_api
- Licensed under the Mozilla Public License 2.0. Full license text available at MPL 2.0 License.
BSD 3-Clause: pandas, seaborn, joblib,
- Licensed under the BSD 3-Clause License. Full license text available at BSD 3-Clause License.
MIT: mappy, pysam, numpy
- Licensed under the MIT License. Full license text available at MIT License.
OTHER: pytorch, numpy
- Full license text for pytorch is available at pytorch License.
- Full license text for numpy is available at numpy License.

Please ensure compliance with each license's terms and conditions.

Patent Information

LPP, GD and EMN have filed patent applications (EP24382340 and EP24383144) based on this work at the European Patent Office.

Citation

If you found this work helpful, please cite:

Pryszcz LP*, Diensthuber G*, Llovera L, Medina R, Delgado-Tejedor A, Cozzuto L, Ponomarenko J and Novoa EM#. SeqTagger, a rapid and accurate tool to demultiplex direct RNA nanopore sequencing datasets. bioRxiv 2024 doi:[https://www.biorxiv.org/content/10.1101/2024.10.29.620808]

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
bonito		bonito
img		img
models		models
notebooks		notebooks
ref		ref
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

About SeqTagger

What is SeqTagger?

How does SeqTagger work?

How many barcodes are supported?

Does it work on all RNA types?

Running SeqTagger

Split reads by barcode

Split Fast5 files

Split FastQ files

Split BAM files

Dependencies and versions

License Information

License Dependencies

Patent Information

Citation

About

Releases

Packages

Contributors 3

Languages

License

novoalab/SeqTagger

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

About SeqTagger

What is SeqTagger?

How does SeqTagger work?

How many barcodes are supported?

Does it work on all RNA types?

Running SeqTagger

Split reads by barcode

Split Fast5 files

Split FastQ files

Split BAM files

Dependencies and versions

License Information

License Dependencies

Patent Information

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages