This tutorial is to detect and identify HPV from single-cell RNA-sequencing (scRNA-seq) raw data based on CellRanger.
- Host scRNA-seq raw data in .fastq (one sample in one folder)
- HPV complete genome in .fna
- HPV annotation file in .gft
Please note that there are dozens or hundreds of types for HPV. Make sure the complete genome is the right one you need.
In this example, we use HPV type 16 downaloaded from NIBI database. When asking "select file source," you can select RefSeq or GenBank, whichever is available. ViruSite is a genome reference database that can choose multiple microbiome genomes and download them as one file.
Viral gene expression matrix
The first step is to install the lateset Cell Ranger. Once the tar.gz file is downloaded in a wanted directory, type the following in the terminal to decompress into current directory.
cd /fs/ess/PCON0022/tools
tar -xzvf cellranger-7.1.0.tar.gz
Once we have the virus genome and annotation files, we will generate the an HPV genome reference file based on CellRanger. Before we run the code, we need to manually make some changes in the .gft file, as virus sequencing does not contain an "exon" column. Please do the following in the .gft file: Change the third column "CDS" into "exon".
The HPV reference genome can now be built by typing:
tools=/fs/ess/PCON0022/tools
cd /fs/ess/PCON0022
$tools/cellranger-7.1.0/cellranger mkref \
--genome=hpv_ref_genome \
--fasta=path/to/hpv.fna \
--genes=path/to/hpv.gtf
The output files in the folder "hpv_ref_genome" include the following:
Parameters in the Cell Ranger can be modified based on your need. In this case, we set the parameters as follows:
- Find a file named "cellranger-7.1.0/lib/bin/parameters.toml".
- Change the third row in this file into "star_parameters = "--alignIntronMax 1 --genomeSAindexNbases 6".
We can now get the viral gene expression matrix using the CellRanger count function. Typing the following:
wd=path/to/where/to/save/output
CellRanger=/fs/ess/PCON0022/tools/cellranger-7.1.0-virus/cellranger
FastqFolder=path/to/host/fastq/
Refer=/fs/ess/PCON0022/hpv_ref_genome
cd $wd
${CellRanger} count --id=hpv_mapping_output --transcriptome=${Refer} --fastqs=${FastqFolder} --sample=CC --chemistry=SC3P_auto --localcores=8 --localmem=64
Please note that in the "--sample" the sample name has to be the same as the prefix name in the scRNA-seq raw fastq file. For example, the scRNA-seq raw fastq file is called "CC_S1_L001_R1_001.fastq.gz", so we have "--sample=CC."
The file we submitted for mapping is also attached as "cc_hpv_mapping.sh".
We can go to the folder "hpv_mapping_output" we created for saving the output in the above step. The folder "hpv_mapping_output" shall contain the following files:
The gene expression matrix for further analysis under R or python can be found under folder "outs". For example, we use the matrix under the folder "filtered_feature_bc_matrix":
Author: Yingjie Li
The “mkref” function in Cell Ranger (10X Genomics; version 7.1) was used to build the HPV reference files. To detect HPV present in the single cells, we aligned the data to the HPV reference files using the “count” function in Cell Ranger. The parameters “alignIntronMax” and “genomeSAinndexNbases” for the STAR algorithm (integrated into Cell Ranger) were adjusted based on the characteristics of the viral genome. The “alignIntronMax” parameter, representing the maximum intro length allowed in the algorithm, was set to 1 because viruses do not contain introns. To achieve a balance between memory size and speed, the “genomeSAinndexNbases” parameter was set to 6 based on the algorithm min(14, log2(genome length)/2-1). The “genomeSAinndexNbases” was set to 6 bases on algorithm min(14, log2(genome length)/2-1) to balance the computational memory size and speed.