Reads processing scripts for PerturbSci-Kinetics. The bioRxiv preprint: https://doi.org/10.1101/2023.01.29.526143
- Fastq input: Paired-end full-coverage bulk RNA-seq.
- Sample ID: a text file containing the prefix of each sample on each line. R1 and R2 files from the sample should share the same prefix.
- Reference fasta: the fasta file of the reference genome. It is used during reads pileup.
- Index: the STAR index of the reference genome.
- Output folder: the directory for all output files.
- Script folder: the folder for all sub scripts.
- Other parameters include core number, the directories of packages.
- Trim adapter sequences by automatic detection.
- STAR alignment.
- Filter aligned reads.
- Merge bams and sort the merged bam.
- Summarize the base identities of reads mapped to each genomic location.
- Inherent SNP calling.
- A vcf file containing background mutations in RNA.
Single cell whole/nascent transcriptomes reprocessing steps (/whole_tx_processing/Main_processing.sh)
- Fastq input: Paired-end PerturbSci-Kinetics demultiplexed whole transcriptome fastq files.
- Sample ID: a text file containing the prefix of each sample on each line. R1 and R2 files from the sample should share the same prefix.
- Reference fasta: the fasta file of the reference genome.
- Index: the STAR index of the reference genome.
- Gtf file: the annotation file for the matched reference genome. It is used in feature counting.
- Reference SNP file: the SNP vcf file generated from the script above. It is used to filter out inherent mutations in the RNA.
- Output folder: the directory for all output files.
- Script folder: the folder for all sub scripts.
- Cutoff: only cell barcodes with reads number > this cutoff will be considered for further processed.
- Custom barcode folder: the folder for all barcodes.
- RT barcode, ligation barcode: pickle files containing all valid barcode sequences with at most 1 mismatch.
- Barcodes: the text file containing all RT+ligation barcode combinations.
- Other parameters include core number, the directories of packages.
- Change file names of fastq to make them callable in the following steps.
- Attach UMI sequences on R1 to headers of R2.
- Trim potential polyA sequences from the 3'end of R2.
- STAR alignment.
- Filter aligned reads.
- PCR duplicates removal based on both mapped genomic coordinates and UMI.
- Single-cell sam files generation.
- Transform the alignment information in single-cell sams to tables at the single-base level.
- Identify T>C mutations on each single read and extract read names of nascent reads.
- Extract nascent reads from single-cell sams.
- Gene-level feature counting on both single-cell whole/nascent sams and re-format the single-cell gene expression matrix.
- Two R object files containing single-cell whole/nascent tx expression count matrix respectively.
- Fastq input: Paired-end PerturbSci-Kinetics demultiplexed sgRNA fastq files.
- Cutoff: only cell barcodes with the number of sgRNA UMI > cutoff will be considered
- SgRNA correction file: pickle files containing all valid sgRNA sequences with at most 1 mismatch.
- SgRNA annotation df: A txt file including the sgRNA names, and corresponding gene symbols. It is used during the expression matrix construction.
- Other parameters are roughly the same as above.
- Change file names of fastq to make them callable in the following steps.
- One-step sgRNA identification, de-duplication, and counting.
- Re-format the single-cell sgRNA expression matrix.
- An R object file containing an single-cell sgRNA expression matrix.
- Parameters are roughly the same as those in single-cell processing scripts.
- Change file names of fastq to make them callable in the following steps.
- Attach UMI sequences on R1 to headers of R2.
- Trim potential adapter sequences from the 3'end of R1 and R2.
- STAR alignment.
- Filter aligned reads.
- PCR duplicates removal by picard.
- Transform the alignment information in sams to tables at the single-base level.
- Split the alignment info table into small sub tables.
- Identify T>C mutations on each read pair.
- Merge mutation info identified from all sub tables under one sample, and extract names of nascent reads.
- Extract nascent reads from sams.
- Gene-level feature counting on both whole/nascent bams and re-format the gene expression matrix.
- Two R object files containing gene x sample whole/nascent tx expression count matrix respectively.
- filter_dT_cells(): Get single-cell whole tx expression matrix from the output R.object of the preprocessing script.
- gene_id2gene_names(): Convert gene ids to gene symbols using the matched gtf file
- gRNA_cell_reformatting(): Read and reformat the sgRNA single-cell expression matrix to make it compatible with the integradation with whole tx info.
- match_whole_nascent_txme_with_gRNA(): Integrate whole tx data with sgRNA info, identify sgRNA-based singlets, and return a integrated obj.
- synth_deg_bootstrapping_NTC_vs_KD(): Calculate synthesis and degradation rates on cell populations. Also perform permutation tests between perturbations and NTC to examine the statistical significance.