Skip to content

Quality control of CRISPR edits using deep amplicon sequencing

License

Notifications You must be signed in to change notification settings

czbiohub-sf/DeepGenotype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepGenotype

Calculates the frequencies of protein-level mutations from deep-sequencing reads of CRISPR-edited cells

Features

  • Calculate genotypes with respect to protein/payload expressiblity and correctness
  • Automatically finds coding regions
  • Supported CRISPR editing types: tagging/insertion and SNP/base-editing
  • Works with both Illumina and PacBio reads
  • Batch process a list of samples
  • Invokes CRISPResso2 to perform read quality-trimming, alignment, and DNA-level genotype calculation

Inputs

There are two required input files:

  • Fastq files (can be gzipped or not)
  • A csv file (examples provided in example_csv), explanation of the columns is below

Outputs:

  • A result table in the format of a csv file and a xlsx file, the table contains sample-wise information of:
    • Protein-level genotype frequencies
    • Two metrics that quantify DNA-level mismatches in edited alleles
      • weighted average of the percent identity of the reads (that aligned to the HDR amplicon)
      • weighted average of the number of mismatches of the reads (that aligned to the HDR amplicon)
  • CRISPResso2 output that includes (and not limited to) the following:
    • Read aligning rate
    • Sequence-level genotype frequencies table
    • read-to-genotype assignments information

Installation

NOTE: if you installed DeepGenotype before 2025-01-15, please reinstall DeepGenotype to update CRISPResso2 to 2.3.1 to enable read quality-trimming.

create a conda environment and activate it

module load anaconda # if on the hpc
conda create -n DeepGenotype python=3.9
conda activate DeepGenotype

install CRISPResso2

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install CRISPResso2==2.3.1

verify CRISPResso2 installation

CRISPResso -h

clone DeepGenotype repo and install dependencies

git clone https://github.com/czbiohub-sf/DeepGenotype
cd DeepGenotype
pip install . # or pip install biopython==1.78 pandas requests openpyxl==3.1.2

verify DeepGenotype installation

cd DeepGenotype # must be in the DeepGenotype/DeepGenotype directory
python DeepGenotype.py

 

Usage:

cd DeepGenotype # must be in the DeepGenotype/DeepGenotype directory
python DeepGenotype.py --path2csv example_csv/test.csv --path2workDir test_dir/ --path2fastqDir test_dir/fastq_dir/

All paths are relative to DeepGenotype.py
Please make sure the following two python scripts are in the same directory as DeepGenotype.py:
    process_alleles_freq_table_INS.py
    process_alleles_freq_table_SNP.py

Optional arguments

--fastq_R1_suffix    (default "_R1_001.fastq.gz")
--fastq_R2_suffix    (default "_R2_001.fastq.gz")
--single_fastq_suffix    (use this option for single-ended reads as well as pacbio reads, need to specific the suffix, e.g.: fastq.gz)
--quantification_window_size    (default 50, which overrides CRISPResso2's default of 1)
--fastp_options_string    options to pass to fastp, [default = '--cut_front --cut_tail --cut_mean_quality 30 --cut_window_size 30'] which is to do quality trimming from both ends of each read, using a slide window of 30 and a mean quality threshold of 30, see fastp documentation for more options
--n_processes    number of cores to use for parallel processing, use with caution since increasing this parameter will significantly increase the memory required [default=1]
--skip_crispresso    skip CRISPResso if results already exist [default=False]
--min_reads_post_filter    if minimum number of reads post filtering is unmet, CRISPResso will be run again with less stringent quality trimming [default=50]
--min_reads_for_genotype    if minimum number of reads for genotype is unmet, the genotype together with its reads will be dropped [default=3]

 

To run pacbio test dataset (insertion mode)

load conda, and activate the DeepGenotype conda environment

module load anaconda
conda activate DeepGenotype

run DeepGenotype

# in DeepGenotype/DeepGenotype directory
python DeepGenotype.py \
--path2csv example_csv/test_pacbio.csv \
--path2workDir test_PacBio \
--path2fastqDir test_PacBio/fastq \
--single_fastq_suffix .fastq

NOTE: to run DeepGenotype in the background (and thus safe to close the terminal), preprend nohup and append & to the command (or use screen, tmux, etc. instructions not listed here):

nohup python DeepGenotype.py \
--path2csv example_csv/test_pacbio.csv \
--path2workDir test_PacBio \
--path2fastqDir test_PacBio/fastq \
--single_fastq_suffix .fastq &

To check the terminal output (while running in the background)

cat nohup.out

The completed nohup.out should look like this

[DeepGenotype.py][INFO]  Genome edit type: INS
[DeepGenotype.py][INFO]  Processing sample: HEK-nocap-CLTA-R1_ccs.lbc89--lbc89.lbc89--lbc89
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
[DeepGenotype.py][INFO]  Processing sample: HEK-nocap-CLTA-R2_ccs.lbc90--lbc90.lbc90--lbc90
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
[DeepGenotype.py][INFO]  Processing sample: HEK-nocap-CLTA-R3_ccs.lbc91--lbc91.lbc91--lbc91
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
[DeepGenotype.py][INFO]  Done processing all samples in the csv file

 

Example 1: To run MiSeq test dataset (insertion mode)

load conda, and activate the DeepGenotype conda environment

module load anaconda
conda activate DeepGenotype

run DeepGenotype in the background (and thus safe to close the terminal)

nohup python DeepGenotype.py \
--path2csv example_csv/test_INS.csv \
--path2workDir test_MiSeq_INS \
--path2fastqDir test_MiSeq_INS/fastq &

To check the terminal output (while running in the background

cat nohup.out

The completed nohup.out should look like this (only first 9 lines shown)

[DeepGenotype.py][INFO]  Genome edit type: INS
[DeepGenotype.py][INFO]  Processing sample: mNGplate19_sorted_A2_DDX6-C
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
[DeepGenotype.py][INFO]  Processing sample: mNGplate19_sorted_A3_LSM14A-N
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
...

 

Example 2: to run MiSeq test dataset (SNP mode)

load conda, and activate the DeepGenotype conda environment

module load anaconda
conda activate DeepGenotype

run DeepGenotype in the background (and thus safe to close the terminal)

nohup python DeepGenotype.py \
--path2csv example_csv/test_MiSeq_SNP.csv \
--path2workDir test_MiSeq_SNP \
--path2fastqDir test_MiSeq_SNP/fastq &

To check the terminal output (while running in the background

cat nohup.out

The completed nohup.out should look like this (only first 9 lines shown)

[DeepGenotype.py][INFO]  Genome edit type: INS
[DeepGenotype.py][INFO]  Processing sample: mNGplate19_sorted_A2_DDX6-C
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
[DeepGenotype.py][INFO]  Processing sample: mNGplate19_sorted_A3_LSM14A-N
[DeepGenotype.py][INFO]  ...running CRISPResso
[DeepGenotype.py][INFO]  ...parsing allele frequency table and re-calculating allele frequencies
[DeepGenotype.py][INFO]  ...done
...

 

Explanation of columns in the input csv file

The input csv should contain the following columns with the exact names

  • Sample_ID (e.g. mNGplate19_sorted_A2_DDX6-C)
    Important note: For paired-end sequencing, only one Sample_ID is needed. We automatically find both R1 and R2 fastq files.
    Check fastq file suffix parameters --fastq_R1_suffix and --fastq_R1_suffix in the Usage section.
    For single-ended reads, set --single_fastq_suffix to the suffix of the fastq file.
    Also check if you need Fastq_extra_suffix (below)

  • gene_name (e.g. DDX6)

  • ENST_id (e.g. ENST00000620157)

  • WT_amplicon_sequence

  • HDR_amplicon_sequence

  • gRNA_sequence

  • edit_type (e.g. INS or SNP, note that deletions, DEL is not supported at this point)
        INS = insertion, SNP = single nucleotide polymorphism, DEL = deletion

  • payload_block_index (e.g. 1 or 2 ...)
        Default is 1. This parameter is only needed when there are multiple blocks of SNPs or insertion/deletions between the wt and HDR amplicon.
        This parameter defines which block of SNPs or insertion/deletions is the payload
        Blocks are ordered from left to right in respect to the amplicon sequence.
        For example: there are 2 blocks of SNPs, the first block is a recut SNP and the second block of SNPs are of interest (payload), then payload_block_index should be set to 2, and the first block of SNPs will be analyzed for protein-changing mutations if it is in the coding region.

  • Fastq_extra_suffix (Optional)

       Extra suffix needed for mapping Sample_ID to corresponding fastq file names.

       Please note that the common (as opposed to extra) suffixes are the following values by default:
       "_R1_001.fastq.gz"
       "_R2_001.fastq.gz"

       For example if your fastq file names are:
                                                  "mNGplate19_sorted_A2_DDX6-C_S90_R1_001.fastq.gz"
                                                  "mNGplate19_sorted_A2_DDX6-C_S90_R2_001.fastq.gz"
        and your sample name is "mNGplate19_sorted_A2_DDX6-C", and "S90" is another variable part of the name
       Then you should add "S90" to the "Fastq_extra_suffix" column

 

Helper commands

Paginated view of fastq files

compressed

gzip -c example.fastq.gz | less

uncompressed

less example.fastq

Count the number of lines in a fastq file (divide by 4 you'll get the read count)

compressed

gzip -c example.fastq.gz | wc -l

uncompressed

cat example.fastq | wc -l

Count the number of line with a specific sequence in a fastq file

compressed

gzip -c example.fastq.gz | grep -c 'your-sequence-here'

uncompressed

cat example.fastq | grep -c 'your-sequence-here'

for matching sequences at the beginning of the line, add a ^: '^your-sequence-here'
for matching sequences at the end of the line, add a $: 'your-sequence-here$'

License

This project is licensed under the BSD 3-Clause license - see the LICENSE file for details.

About

Quality control of CRISPR edits using deep amplicon sequencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published