This page provides a third-party guide to modifying an existing GTF file by extending 3’-ends of transcripts using the program peaks2utr. This process is demanded by possible data loss caused due to the biased distribution of the reads obtained in some transcriptome sequencing methods including the Chromium Single Cell Gene Expression of 10X Genomics. The versions of the programs and example files employed below in this guide follow our own experience with medaka demonstrated as part of the activity of NBRP Medaka Project of Japan.
1.1. Installing peaks2utr v1.1.2
The installation requires the Python version 3.8 to 3.10 (not 3.11, the latest).
Follow the official instruction to install with pip
.
pip install peaks2utr
Also see Supplementary Data of the publication by the developers (Haese-Hill et al., Bioinformatics 2023)
If one wants to keep a different python version as default (like the latest version 3.11), create an environment dedicated to peaks2utr.
conda create -n peaks2utr python=3.10
conda activate peaks2utr
1.2. Installing cellranger v7.1.0
Follow the official instruction from 10X Genomics.
Use a GTF file from Ensembl if available for two reasons. First, 10X Genomics recommend a GTF file from Ensembl in its official page ('If the species is available from the Ensembl database, we recommend using the files from there'). Second, our trial of running peaks2utr with a GTF file from NCBI failed (see Part 5).
e.g., Oryzias_latipes.ASM223467v1.110.gtf.gz
from Ensembl
e.g., Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz
downloaded from Ensembl
The program peaks2utr assumes Chromium scRNA-seq data as input (Part 4.1), but it also accepts bulk RNA-seq data (Part 4.2). Choose one of these two options.
4.1. Use 10X Chromium single cell RNA-seq data set
Format the reference and gene model using the genome assembly (assembly.fna) and the GTF file (genemodel.gtf) (see the official guide)
cellranger mkref --genome=custom_ref --genes=genemodel.gtf --fasta=assembly.fna
Map Chromium scRNA-seq reads in the directory fastq_dir
with cellranger count
cellranger count --id=run_count --fastqs=fastq_dir --transcriptome=custom_ref
4.2. Use bulk RNA-seq data
Map the trimmed reads with hisat2 or equivalent onto the genome assembly (assembly.fna)
Use the BAM file made above in Part 4 (Part 4.1 or 4.2)
peaks2utr --gtf genemodel.gtf run_count/outs/possorted_genome_bam.bam -o genemodelNEW.gtf
Consider tweaking the parameter --max-distance
('maximum distance in bases that UTR can be from a transcript'; default, 200bp) depending on typical UTR lengths and other genomic spacing trends in the species of interest.
We had no problem with completing this whole process using GTF files from Ensembl for medaka, zebrafish, and mouse (see Part 2 above). In using GTF files for these three species from NCBI, we needed to modify them to clean up attribute (9th) columns before running peaks2utr and managed to complete the process.
6.1. Open the output file summary_stats.txt
6.2. Use agat_sp_statisctics.pl
in AGAT (Another GTF/GFF Analysis Toolkit)
cellranger mkref --genome=custom_refNEW --genes=genemodelNEW.gtf --fasta=assembly.fna
cellranger count --id=run_countNEW --fastqs=fastq_dir --transcriptome=custom_refNEW
9.1. Confirm an expected increase of mapping % ('Reads Mapped Confidently to Transcriptome'). Compare this with that of the product of the Part 4.1
The gene models of the Japanese medaka (Oryzias latipes Hd-Rr) we modified is available here at Figshare. Cite this with DOI 10.6084/m9.figshare.24080463 when you use this resource in your publication.
We thank Osamu Nishimura at RIKEN BDR and Satoshi Ansai at Kyoto Univ for discussion.
Haese-Hill et al., Bioinformatics 2023 'peaks2utr: a robust Python tool for the annotation of 3′ UTRs'
Bilgic et al., eLife 2023 'Truncated radial glia as a common precursor in the late corticogenesis of gyrencephalic mammals'
Lawson et al., eLife 2020 'An improved zebrafish transcriptome annotation for sensitive and comprehensive detection of cell type-specific genes'
Improved zebrafish gene models produced in this study