An insertion caller for Illumina paired-end WGS data.
There are three options for obtaining INSurVeyor: conda, singularity and compiling the source code.
To avoid conflicts with other packages, it is recommended to install insurveyor in a dedicated environment:
conda create -n insurveyor-env -c bioconda -c conda-forge insurveyor
conda activate insurveyor-env
insurveyor.py --version
The latest image can be obtained under Releases.
In order to compile the code, the following are required:
- A C and a C++ compiler are required. If the GCC suite is used, version 4.9.3 or above is required.
- A recent version of CMake (3.5 or above)
The following commands should be sufficient:
git clone https://github.com/kensung-lab/INSurVeyor
cd INSurVeyor/
./build_htslib.sh
cmake -DCMAKE_BUILD_TYPE=Release . && make
It is recommended to build the version of htslib provided with the code by using the provided build_htslib.sh. autoconf (https://www.gnu.org/software/autoconf/) and zlib development files (http://zlib.net) are required. On Ubuntu, they can be installed with
sudo apt-get install autoconf zlib1g-dev
Alternatively, build_htslib_full.sh will build a version of htslib with support for bz2, lzma and curl, but development files for all three must be installed.
If the provided htslib is not built correctly, the building process will search for an installed copy. Version 1.13 or higher is required.
For further information on htslib, please refer to https://github.com/samtools/htslib
Python is necessary to run INSurVeyor. Libraries NumPy (http://www.numpy.org/), PyFaidx (https://github.com/mdshw5/pyfaidx) and PySam (https://github.com/pysam-developers/pysam) are required. If Python 2 is used, numpy 1.16.6, pyfaidx 0.5.9 and pysam 0.16.0.1 are the recommended (i.e., tested) versions. If Python 3 is used, then numpy 1.21.2, pyfaidx 0.5.9.1 and pysam 0.16.0.1 were tested.
INSurVeyor needs a BAM/CRAM file, a (possibly empty) working directory and reference genome in FASTA format. The BAM/CRAM file must be coordinate-sorted and indexed. Furthermore, the MC and the MQ tag must be present for all primary alignments, when applicable.
Recent versions of BWA MEM (0.7.17) will add the MC tag. The easiest (but probably not the fastest) way to add the MQ tag is to use Picard FixMateInformation (http://broadinstitute.github.io/picard/command-line-overview.html#FixMateInformation)
java -jar picard.jar FixMateInformation I=file.bam
If you are using the singularity image, you can run INSurVeyor with the following command:
singularity run insurveyor.sif --threads N_THREADS BAM_FILE WORKDIR REFERENCE_FASTA
The command
singularity run insurveyor.sif
will print the help message of the software.
If you compiled the code, you can use
python insurveyor.py --threads N_THREADS BAM_FILE WORKDIR REFERENCE_FASTA
If you installed INSurVeyor using conda, the command is
insurveyor.py --threads N_THREADS BAM_FILE WORKDIR REFERENCE_FASTA
The output is a standard VCF file. It will be placed under WORKDIR/out.pass.vcf.gz. These are the insertions that INSurVeyor deemed confident enough.
The file WORKDIR/out.vcf.gz contains all of the insertions, including those that did not pass the filters. Most of them will be false positives. It is not recommend to you use this file unless for specific situations (e.g., you are looking for something specific).
For long insertions (in theory longer than twice the maximum valid insert size, in practice it depends on the quality of the library), INSurVeyor may not be able to assemble the whole sequence. In this case, the insertion will contain an INFO/INCOMPLETE_ASSEMBLY flag, and SVINSSEQ will contain a prefix and a suffix of the insertion, joined by a dash "-".
A demo is provided in the folder demo. You can run it with the command
mkdir workdir
python insurveyor.py demo/reads.bam workdir/ demo/ref.fa
The demo should run in less than a minute. If it ran successfully, a file workdir/out.pass.vcf.gz will be generated. It should contain two insertions and they should look like these:
ref 170872 T_INS_0 T <INS> . PASS END=170872;SVTYPE=INS;SVLEN=339;SVINSSEQ=AAGAAGTAGAGGAATGGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCATGAGGTCAGGAGATCGAGACCATCCTGGCTAACAAGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCGGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA;SPLIT_READS=10,16;DISCORDANT=30,50;ALGORITHM=transurveyor;LEFT_ANCHOR=170448-170871;RIGHT_ANCHOR=170872-171336;LEFT_ANCHOR_CIGAR=7X417=190S;RIGHT_ANCHOR_CIGAR=149S458=7X;AVG_STABLE_NM=0,0;STABLE_DEPTHS=28,35;SPANNING_READS=0,0;MHLEN=0;SCORES=1,1 GT 1
ref 714813 A_INS_0 a <INS> . PASS END=714814;SVTYPE=INS;SVLEN=78;SVINSSEQ=GTATAGTATATACTGTATATACTATATAGTATAGTATATACTGTATATACTATATAGTATAGTATATACTGTATATAC;SPLIT_READS=22,16;DISCORDANT=20,25;ALGORITHM=assembly;LEFT_ANCHOR=714722-714812;RIGHT_ANCHOR=714814-714903;AVG_STABLE_NM=0,0;STABLE_DEPTHS=28,26;SPANNING_READS=27,27 GT 1
In many cases the user will need a complete set of SVs (i.e., deletions, tandem duplications, inversions, etc.). In this case, we currently recommend to use Manta (https://github.com/Illumina/manta) as a SV caller, and to replace the insertions provided by Manta with the insertions provided by INSurVeyor. This is simple and can be done with bcftools. Note that the output of Manta and INSurVeyor must have the same sample; if not, you can use bcftools reheader.
bcftools view manta.vcf.gz -i "SVTYPE!='INS'" -O z -o manta.noINS.vcf.gz
tabix -p vcf manta.noINS.vcf.gz
tabix -p vcf surveyor.vcf.gz
bcftools concat -a manta.noINS.vcf.gz surveyor.vcf.gz -O z -o merged.vcf.gz
Alternatively, if the user prefers a slightly more complete set of insertions, and does not mind having duplicated insertions, they can retain Manta insertions when merging.
tabix -p vcf manta.vcf.gz
tabix -p vcf surveyor.vcf.gz
bcftools concat -a manta.vcf.gz surveyor.vcf.gz -O z -o merged.vcf.gz
When analysing populations, it is often useful to obtain a multi-sample VCF that merges/clusters SVs from different samples. In order to achieve this, you can use https://github.com/Mesh89/SurVClusterer Other alternatives include https://github.com/DecodeGenetics/svimmer and https://github.com/mkirsche/Jasmine , although we have not personally tested them.
Rajaby, R., Liu, DX., Au, C.H. et al. INSurVeyor: improving insertion calling from short read sequencing data. Nat Commun 14, 3243 (2023). https://doi.org/10.1038/s41467-023-38870-2