Genome sequence annotation

Step-by-step guide for UTR modeling

This page provides a third-party guide to modifying an existing GTF file by extending 3’-ends of transcripts using the program peaks2utr. This process is demanded by possible data loss caused due to the biased distribution of the reads obtained in some transcriptome sequencing methods including the Chromium Single Cell Gene Expression of 10X Genomics. The versions of the programs and example files employed below in this guide follow our own experience with medaka demonstrated as part of the activity of NBRP Medaka Project of Japan.

1. Installation

1.1. Installing peaks2utr v1.1.2

The installation requires the Python version 3.8 to 3.10 (not 3.11, the latest).

Follow the official instruction to install with pip.

pip install peaks2utr

Also see Supplementary Data of the publication by the developers (Haese-Hill et al., Bioinformatics 2023)

If one wants to keep a different python version as default (like the latest version 3.11), create an environment dedicated to peaks2utr.

conda create -n peaks2utr python=3.10
conda activate peaks2utr

1.2. Installing cellranger v7.1.0

Follow the official instruction from 10X Genomics.

2. Prepare a GTF file (genemodel.gtf)

Use a GTF file from Ensembl if available for two reasons. First, 10X Genomics recommend a GTF file from Ensembl in its official page ('If the species is available from the Ensembl database, we recommend using the files from there'). Second, our trial of running peaks2utr with a GTF file from NCBI failed (see Part 5).

e.g., Oryzias_latipes.ASM223467v1.110.gtf.gz from Ensembl

3. Prepare a genome assembly file (assembly.fna)

e.g., Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz downloaded from Ensembl

4. Prepare a transcript read mapping file (.bam)

The program peaks2utr assumes Chromium scRNA-seq data as input (Part 4.1), but it also accepts bulk RNA-seq data (Part 4.2). Choose one of these two options.

4.1. Use 10X Chromium single cell RNA-seq data set

Format the reference and gene model using the genome assembly (assembly.fna) and the GTF file (genemodel.gtf) (see the official guide)

cellranger mkref --genome=custom_ref --genes=genemodel.gtf --fasta=assembly.fna

Map Chromium scRNA-seq reads in the directory fastq_dir with cellranger count

cellranger count --id=run_count --fastqs=fastq_dir --transcriptome=custom_ref

4.2. Use bulk RNA-seq data

Map the trimmed reads with hisat2 or equivalent onto the genome assembly (assembly.fna)

5. Run peaks2utr

Use the BAM file made above in Part 4 (Part 4.1 or 4.2)

peaks2utr --gtf genemodel.gtf run_count/outs/possorted_genome_bam.bam -o genemodelNEW.gtf

Consider tweaking the parameter --max-distance ('maximum distance in bases that UTR can be from a transcript'; default, 200bp) depending on typical UTR lengths and other genomic spacing trends in the species of interest.

We had no problem with completing this whole process using GTF files from Ensembl for medaka, zebrafish, and mouse (see Part 2 above). In using GTF files for these three species from NCBI, we needed to modify them to clean up attribute (9th) columns before running peaks2utr and managed to complete the process.

6. Analyze the peaks2utr output

6.1. Open the output file summary_stats.txt

6.2. Use agat_sp_statisctics.pl in AGAT (Another GTF/GFF Analysis Toolkit)

7. Format the reference and gene model modified by peaks2utr

cellranger mkref --genome=custom_refNEW --genes=genemodelNEW.gtf --fasta=assembly.fna

8. Map scRNA-seq reads

cellranger count --id=run_countNEW --fastqs=fastq_dir --transcriptome=custom_refNEW

9. Analyze the output

9.1. Confirm an expected increase of mapping % ('Reads Mapped Confidently to Transcriptome'). Compare this with that of the product of the Part 4.1

Access to the modified medaka .gtf file

The gene models of the Japanese medaka (Oryzias latipes Hd-Rr) we modified is available here at Figshare. Cite this with DOI 10.6084/m9.figshare.24080463 when you use this resource in your publication.

Acknowledgments

We thank Osamu Nishimura at RIKEN BDR and Satoshi Ansai at Kyoto Univ for discussion.

References

Tools

Haese-Hill et al., Bioinformatics 2023 'peaks2utr: a robust Python tool for the annotation of 3′ UTRs'

Similar efforts

Bilgic et al., eLife 2023 'Truncated radial glia as a common precursor in the late corticogenesis of gyrencephalic mammals'
Lawson et al., eLife 2020 'An improved zebrafish transcriptome annotation for sensitive and comprehensive detection of cell type-specific genes'
　　Improved zebrafish gene models produced in this study

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genome sequence annotation

Step-by-step guide for UTR modeling

1. Installation

2. Prepare a GTF file (genemodel.gtf)

3. Prepare a genome assembly file (assembly.fna)

4. Prepare a transcript read mapping file (.bam)

5. Run peaks2utr

6. Analyze the peaks2utr output

7. Format the reference and gene model modified by peaks2utr

8. Map scRNA-seq reads

9. Analyze the output

Access to the modified medaka .gtf file

Acknowledgments

References

Tools

Similar efforts

About

Squalomix/utr-modeling

Folders and files

Latest commit

History

Repository files navigation

Genome sequence annotation

Step-by-step guide for UTR modeling

1. Installation

2. Prepare a GTF file (genemodel.gtf)

3. Prepare a genome assembly file (assembly.fna)

4. Prepare a transcript read mapping file (.bam)

5. Run peaks2utr

6. Analyze the peaks2utr output

7. Format the reference and gene model modified by peaks2utr

8. Map scRNA-seq reads

9. Analyze the output

Access to the modified medaka .gtf file

Acknowledgments

References

Tools

Similar efforts

About

Resources

Stars

Watchers

Forks