Skip to content

Latest commit

 

History

History

scRNAseq_HPV_branch

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

HPV mapping Tutorial

Introduction

This tutorial is to detect and identify HPV from single-cell RNA-sequencing (scRNA-seq) raw data based on CellRanger.

Pipeline input

  1. Host scRNA-seq raw data in .fastq (one sample in one folder)
  2. HPV complete genome in .fna
  3. HPV annotation file in .gft

Please note that there are dozens or hundreds of types for HPV. Make sure the complete genome is the right one you need.

In this example, we use HPV type 16 downaloaded from NIBI database. When asking "select file source," you can select RefSeq or GenBank, whichever is available. ViruSite is a genome reference database that can choose multiple microbiome genomes and download them as one file.

Pipeline output

Viral gene expression matrix

Install CellRanger

The first step is to install the lateset Cell Ranger. Once the tar.gz file is downloaded in a wanted directory, type the following in the terminal to decompress into current directory.

cd /fs/ess/PCON0022/tools
tar -xzvf cellranger-7.1.0.tar.gz

Create HPV genome reference

Once we have the virus genome and annotation files, we will generate the an HPV genome reference file based on CellRanger. Before we run the code, we need to manually make some changes in the .gft file, as virus sequencing does not contain an "exon" column. Please do the following in the .gft file: Change the third column "CDS" into "exon".

The HPV reference genome can now be built by typing:

tools=/fs/ess/PCON0022/tools
cd /fs/ess/PCON0022
$tools/cellranger-7.1.0/cellranger mkref \
          --genome=hpv_ref_genome \
          --fasta=path/to/hpv.fna \
          --genes=path/to/hpv.gtf

The output files in the folder "hpv_ref_genome" include the following:

Set Cell Ranger parameters

Parameters in the Cell Ranger can be modified based on your need. In this case, we set the parameters as follows:

  1. Find a file named "cellranger-7.1.0/lib/bin/parameters.toml".
  2. Change the third row in this file into "star_parameters = "--alignIntronMax 1 --genomeSAindexNbases 6".

Mapping HPV in scRNA-data

We can now get the viral gene expression matrix using the CellRanger count function. Typing the following:

wd=path/to/where/to/save/output
CellRanger=/fs/ess/PCON0022/tools/cellranger-7.1.0-virus/cellranger
FastqFolder=path/to/host/fastq/
Refer=/fs/ess/PCON0022/hpv_ref_genome
cd $wd
${CellRanger} count --id=hpv_mapping_output --transcriptome=${Refer} --fastqs=${FastqFolder} --sample=CC --chemistry=SC3P_auto --localcores=8  --localmem=64

Please note that in the "--sample" the sample name has to be the same as the prefix name in the scRNA-seq raw fastq file. For example, the scRNA-seq raw fastq file is called "CC_S1_L001_R1_001.fastq.gz", so we have "--sample=CC."

The file we submitted for mapping is also attached as "cc_hpv_mapping.sh".

Virus gene expression matrix

We can go to the folder "hpv_mapping_output" we created for saving the output in the above step. The folder "hpv_mapping_output" shall contain the following files:

The gene expression matrix for further analysis under R or python can be found under folder "outs". For example, we use the matrix under the folder "filtered_feature_bc_matrix":

Contact

Author: Yingjie Li

Methods for manuscript

The “mkref” function in Cell Ranger (10X Genomics; version 7.1) was used to build the HPV reference files. To detect HPV present in the single cells, we aligned the data to the HPV reference files using the “count” function in Cell Ranger. The parameters “alignIntronMax” and “genomeSAinndexNbases” for the STAR algorithm (integrated into Cell Ranger) were adjusted based on the characteristics of the viral genome. The “alignIntronMax” parameter, representing the maximum intro length allowed in the algorithm, was set to 1 because viruses do not contain introns. To achieve a balance between memory size and speed, the “genomeSAinndexNbases” parameter was set to 6 based on the algorithm min(14, log2(genome length)/2-1). The “genomeSAinndexNbases” was set to 6 bases on algorithm min(14, log2(genome length)/2-1) to balance the computational memory size and speed.