A snakemake
workflow for simulating integration of a virus/vector into a host.
The repo includes a host and viral reference for testing your installation. To run this test, you will need to have one of conda
, docker
or singularity
installed.
If you have conda
installed, you can run the pipeline with snakemake
, which automatically downloads dependencies using conda
.
# clone git repo
git clone https://github.com/aehrc/vector-integration-simulation-pipeline.git
cd vector-integration-simulation-pipeline
# create conda environment called 'snakemake' containing snakemake
conda create -n snakemake -c conda-forge -c bioconda snakemake -y
conda activate snakemake
# run with test data
snakemake --configfile test/config/simulation.yml --jobs 1 --use-conda --conda-frontend conda
You can find out more about snakemake options in the snakemake documentation.
To run with your own host and viral references, you will need to specify these in the config file (see below).
Another way to run is using the docker
container, which contains all dependencies, with either docker
or singularity
. To get the results from inside the container, you will need to bind-mount the directory where you would like the results to be written when you run the container. For example:
docker run \
--rm -it \
-v ${PWD}/test_results:/opt/simvi/test/out szsctt/simvi:latest \
snakemake --jobs 1 --configfile test/config/simulation.yml
The results should appear in a directory created in the current working directory called test_results
. Compare these results to those in the directory example_results
.
To run with your own host and viral references, you will need to specify these in the config file (see below), and bind-mount the config file and the references into the container.
It's also possible to run the container with singularity
instead of docker
, which requires slightly different bind-mounts:
singularity pull simvi.sif docker://szsctt/simvi:latest
mkdir -p test_results/.snakemake
singularity exec \
-B ${PWD}/test_results:/opt/simvi/test/out \
-B ${PWD}/test_results/.snakemake:/opt/simvi/.snakemake \
simvi.sif \
/opt/simvi/run_singularity.sh --configfile /opt/simvi/test/config/simulation.yml --jobs 1
You can also give additional arugments which will be passed to snakemake
.
Note that since singularity containers are read-only, we need to also bind-mount a folder for snakemake to write to (in this case, /opt/simvi/.snakemake
inside the container to ${PWD}/test_results/.snakemake
outside the container).
In order to simulate data, snakemake
will:
- Parses the config file to generate all combinations of the specified paramters. Each combination of parameters is a 'condition', and each condition has one or more 'replicates'. All replicates and conditions have a different random seed.
- Simulates integration using the
python3
scriptscripts/insert_virus.py
. This script outputs afasta
file consisting of the host with viral sequences inserted, and two files describing the location and properties of the integrated virus and episomes art_illumina
is used to simulate paried-end reads based on thefasta
file from the previous step- A script
scripts/annotate_reads.py
annotates the reads crossing host/viral junctions for each integration
For each dataset, the reads can be found in the output directory under sim_reads
, and the information about the integrations can be found under sim_ints.
As outlined in the QuickStart section above, running the pipeline requires either snakemake
and conda
/mamba
, or docker
/singularity
. If using conda
, other dependencies are automatically downloaded by snakemake
(using the --use-conda
) argument using the environment files in the envs
directory. If using the container with docker
/singularity
, dependences are already installed inside the container.
Specify all inputs and options in a config file, which is provided to snakemake
. An example config file can be found at test/config/simulation.yml
.
The config file is organised into datasets - in the example above, there is one dataset test
. A config file should have one or more datasets. The results for each dataset will be output in a directory with the same name as the dataset (test
in the case of the supplied config).
If you use docker
or singularity
to run the pipeline with your own references, make sure to bind-mount your data and output directories into the container, and that any paths in your config file are correct inside the container.
For each dataset, the following parameters should be specified in the config file.
Specify the output directory with the key out_directory
. Output files can be found in this directory, under the dataset name. The path should be either absolute, or relative to the directory in which you run the workflow (usually the directory which you cloned the repo into). If you are using the container version, the Snakefile
is located at /opt/simvi/Snakefile
.
Use the keys hosts
and viruses
to specify a dict of host and viruses, respectivley, in which the keys in each dict are a name for that reference, and the value is the path (absolute or relative to the snakefile) to that reference in fasta
format.
Each unique combination of simulation parameters (including the host and viral references specified above) is a 'condition'. Specify the number of replicates to perform for each condition. Each replicate will have a random seed - specify the seed for the first condition with the key initial_seed
, and each additional replicate will have a seed incremented by the key seed_increment
. This random seed is used for both simuating integrations (python script), and simulating reads (art_illumina
).
The user may specify a number of desired integration properties. Most properties are specified as either probabilities (start with p_
) or the mean of a Poisson distribution (lambda_
).
int_num
: The number of integrationsmin_sep
: The minimum separation (in bp) between each integrationepi_num
: The number of extra viral sequences (episomes) added to the output fasta. Episomes are also subject to being whole/subsequences, rearrangements and deletionsp_whole
: Probability that each integration/episome will consist of the whole viral sequence.min_len
: If integration/episome is not whole, its minimum length (in bp)p_rearrange
: Probability that an integration/episome is rearrangedp_delete
: Probability that an integration/episome has a deletionlambda_split
: During rearrangmenet/deletion, number of fragments in which to split the viral sequencep_overlap
: Probability of an overlap (common sequence between host and virus) at each junctionp_gap
: Probability of a gap (random bases added between host and virus) at each junctionlambda_junction
: Mean of Poisson distribution of number of bases involved in each gap/overlap junctionp_host_deletion
: Probability of a deletion from the host at the integration sitelambda_host_deletion
: Mean of Poisson distribution of number of bases deleted from host at integration site
The parameters read_len
(read length), fcov
(fold coverage), frag_len
(mean fragment length), frag_std
(standard deviation of fragment length), and seq_sys
will be used during read simulation with art_illumina
. Further details of these paramters can be found at the art manpage.
The main outputs of the pipeline are:
- Fasta file containing host sequence with integrated viral sequences (and episomes, if appropriate)
- Paired-end reads from
art_illumina
in fastq and sam format - Text file containing location and properties of each integration, in a tab-separated format
The text file with the properties of each integration ('*int-info.tsv') has the following columns:
id
: a unique number for each integrationchr
: the name of the host reference chromosome/contig in which integration occurredhPos
: position in the original host reference chromosome/contig at which integration occurredleftStart
,leftStop
: coordinates of the ambiguous bases (containing a gap or overlap if there is one) on the left side of the integration in the output fasta file. This will only be the same ashPos
for the first integration on each chromsome/contigrightStart
,rightStop
: same asleftStart
,leftStop
but for the right side of the integrationhDeleted
: number of bases deleted from the host chromosome/contig at the integration site. Deleted bases occur after the right side of the integrationhDeleted_input_fasta
: probably just ignore this column - it should be the same ashDeleted
virus
: name of viral reference from which integrated sequence was takenvBreakpoints
: a list of the parts of the virus which were integrated. For example [0, 1611] means that bases 0 through 1611 from the viral reference were integrated. [1400, 1531];[1335, 1400] means that there was a rearrangement: bases 1400 through 1531 and bases 1335 through 1400 were insertedvOris
: the orientation of each part of the virus listed in vBreakpoints, + for sense and - for antisensejuncTypes
: the type of the left and right junctions. 'clean' means there's nothing between the host and viral sequence, 'gap' means random bases were inserted at the jucntion, 'overlap' means that there was homology between the host and vector at the junction. This is a comma-separated list with two elements - the first is the left junction, and the second is the right junction.juncBases
: the sequences at the left and right junctions. This is a comma separated list with two elements - the first is for the left junction and the second is for the rightjuncLengths
: the number of bases involved in the left and right junctions.This is a comma separated list with two elements - the first is for the left junction and the second is for the rightwhole
: True if the whole virus was integrated, False otherwiserearrangement
: True if the integration involved a rearrangement (integration of two or more non-contiguous pieces of the viral reference), False otherwisedeletion
: True if the part of the virus integrated harbors a deletion, false otherwisen_swaps
: If there was a rearrangement, the number of times that pieces of the integrated part of the virus were swappedn_delete
: If there was a deletion, the number of pieces deleted from the integrated part of the virus
After annotation with simulated reads, the following columns are added ('*int-info.annotated.tsv'):
left_chimeric
: Read IDs of the chimeric reads that span the left junctionright_chimeric
: Read IDs of the chimeric reads that span the left junctionleft_discord
: Read pairs which straddle the left junction, so there's one read aligned to host and one to virusright_discord
: Read pairs which straddle the right junction, so there's one read aligned to virus and one to hostmultiple_discord
: Read pairs which span multiple integration junctions, with one read in host and the other in virus/vector. These should be called as integrations, but it's not clear what the coordinates of the integration should be.fake_discord
: read pairs where both reads map to either host or vector, but span more than one integration. In other words, read pairs where one both reads map to either host or vector, but are interrupted by a sequence of the other type. For example, a pair where one read maps to the virus of one integration, and the other to the virus of the next integration. These should not be called as integrations, but will appear to have a longer template length than expected when mappeddiscordant_and_chimeric
: Read pairs which are both discordant (because one read maps to host, and the other to vector, although one of the reads may not be mapped in its entirety), but also one member of the pair is chimeric
The text file with the properties of each episomal sequence ('*epi-info.tsv') has the following columns:
id
: a unique number for each episomevirus
: name of viral reference from which episomal sequence was takenstart
,stop
: coordinates of viral reference from which episomal sequence was takenpieces
: a list of the parts of the virus which are episomal. For example [0, 1611] means that bases 0 through 1611 from the viral reference constitute the episome. [1400, 1531];[1335, 1400] means that there was a rearrangement: the episome consists of bases 1400 through 1531 and bases 1335 through 1400oris
: the orientation of each part of the virus listed in pieces column, + for sense and - for antisenseis_whole
: True if the whole virus is episomal, False otherwiseis_rearrange
: True if the episome is rearranged (episome of two or more non-contiguous pieces of the viral reference), False otherwiseis_deletion
: True if the episome harbors a deletion, false otherwisen_swaps
: If there was a rearrangement, the number of times that pieces of the episome were randomly swappedn_delete
: If there was a deletion, the number of pieces deleted from the episome