- Copy Assembly
- Index Assembly
- Copy SNPable Mask
- Convert SNPable Mask
- Subset Assembly
- Link Long Assembly
- Index Long Assembly
- Create Samples List
- Copy Per-Sample Alignments
- Approximate Depth
- Mask
- Multihetsep
- MSMC
- Generate Random Seed
- Multihetsep
- MSMC
- Combine Bootstrap Results
- Plot
These scripts assume a very specific directory structure and naming scheme.
You can modify the scripts to avoid it, but using it should also be fairly straightforward.
First, create the directory structure:
mkdir some_project_dir
cd some_project_dir
mkdir -p data job_files/{success,failed}
git clone https://github.com/pickettbd/msmc-slurmPipeline scripts-msmc
When you run scripts stored in the scripts-msmc
directory, you must run them from the main project directory (some_project_dir
), not from the scripts-msmc
or data
directories.
This pipeline requires a "reference" genome assembly as input, which is handled in step 00. Technically, per-sample read data is also required, but this pipeline assumes you generated alignments with it previously (step 08) and, thus, ignores those input reads. Two additional input file are required:
data/species.txt
To generate intelligent SLURM job names, the species this analysis is being performed on is incorporated into the job names, and this name is acquired from a file calleddata/species.txt
. It should have exactly one line with a single string (no whitespace!); abbreviations are encouraged for brevity. As an example, we ran this analysis on the bluefin trevally, so our entry in this file wasbft
. No fancy check is performed to ensure that this file exists, so make sure you create it and at least put some value in it.data/genomeSize.txt
This information is required for the bootstrapping steps. You must provide the expected haploid genome size in bases (i.e., no k/m/g suffixes allowed). This must occur in a single line in the file. Only one entry is allowed. For example, with the bluefin trevally, our value was782400000
(782.4 Mb). Commas are allowed (they will simply be ignored), so782,400,000
would also be sufficient.
The script associated with this step (00-copyAsm.sh
) is just a placeholder. If you run it, it will display a brief message reminding you what needs to be done during this step. Running the script does not complete the step for you.
The goal is to copy the "reference" genome (a de novo assembly is okay) and make it accessible to the scripts in this pipeline. The final location of the assembly, in fasta format, must be located at data/assembly/asm.fa
. To accomplish this, you could do the following:
mkdir -p data/assembly
cd data/assembly
ln -s /path/to/my/scaffolds.fa asm.fa
cd ../..
Unlike step 00, the script associated with this step (01-indexAsm.sh
) will actually complete the step for you.
The goal is to have a Samtools faidx
index of the assembly located at data/assembly/asm.fa.fai
.
If you already have this, you may simply copy or link the file to this location and skip running the script.
If not, run the script and it will generate the index for you.
This script runs the job "locally" on the node you are using because fasta indexes usually take a few seconds to create, making them not worth the overhead required for a full (SLURM) "job".
To run this script, first ensure you are in the main project directory (i.e., parent directory of data
, job_files
, and scripts-msmc
).
Then run as follows: ./scripts-msmc/01-indexAsm.sh
If you have multiple samples, you will need a SNPable mask. If you have only one sample, please skip to step 04.
Similar to step 00, this step is a placeholder.
Generating a SNPable mask is beyond the scope of this repository, but you can generate one using this repository.
Once you have a SNPable mask, copy or link it to data/assembly/asm_snpable-mask.fa.gz
.
cd data/assembly
ln -s /path/to/where/the/SNPable_mask/mask_35_50.fa.gz asm_snpable-mask.fa.gz
cd -
MSMC requires a SNPable mask in bed format instead of fasta format.
The script associated with this step (03-convertSNPableMaskToBed.slurm
) will run a modified version of makeMappabilityMask.py
from msms-tools.
This script will run on the SLURM-controlled cluster as a job by running the following command from the main project directory: ./scripts-msmc/03-convertSNPableMaskToBed.submit
.
The MSMC manual suggests dropping short scaffolds/contigs from your assembly if you have contigs/scaffolds instead of chromosomes. The next steps will create a copy of the assembly with all contigs/scaffolds dropped that are shorter than 500kb. To change the cutoff, edit the scripts.
Similar to step 01, this step runs "locally" by running ./scripts-msmc/04-subsetAsm.sh
.
It will create a copy of the assembly without the shortest sequences at data/assembly/asm_ge500kb.fa
.
You may skip this step if you do not wish to subset your assembly or your assembly is already at such a level of contiguity.
This step is another placeholder. You need to put the "long" assembly at data/assembly/asm_long.fa
.
If you followed step 04, you will need to run the following:
cd data/assembly
ln -s asm_ge500kb.fa asm_long.fa
cd -
If you skipped step four because you already had a really great assembly, you would need to point to the original assembly:
cd data/assembly
ln -s asm.fa asm_long.fa
cd -
More generally, wherever your assembly (with only the longest sequences) is, copy or link it:
cd data/assembly
ln -s /path/to/long_subset_assembly/asm.fa asm_long.fa
cd -
Similar to step 01, the script associated with this step (06-indexLongAsm.sh
) will actually complete the step for you.
The goal is to have a Samtools faidx
index of the "long" assembly located at data/assembly/asm_long.fa.fai
.
If you already have this (e.g., if you skipped subsetting in step 04), simply copy or link to the existing *.fai
file.
Otherwise, run this step from the main project directory: ./scripts-msmc/06-indexLongAsm.sh
.
MSMC needs masks based on alignments of per-sample sequences to the assembly. These masks are not the SNPable mask. The SNPable masks are sample-independent, while these masks are generated on a per-sample basis. The per-sample sequences can be short (e.g., Illumina) or long (e.g., PacBio) reads. If they are PacBio reads, they should be self-corrected. The assembly must be the "long" assembly, not the original assembly (unless they are the same, which would be fine).
For some of the remaining steps, actions are performed on a per-sample basis.
To reliably complete the correct tasks and assess their completeness, a simple list of samples must be compiled.
Create a file at called data/samples.list
. Put one record per line. Each record is the name of the sample.
The sample names need not follow a pattern (e.g., sample1, sample2, ..., sampleN); arbitrary names (e.g., sally, bob, ..., mae) are perfectly valid.
The sample identifiers should be alphanumeric strings without whitespace.
If you have only one sample, you must still create this file!
As with the SNPable mask, the creation of it is beyond the scope of this repository; however, this repo will align self-corrected PacBio reads to a genome assembly with Minimap2.
The final alignments for each sample must be in bam format and named after the following pattern: data/alns/${SAMPLE_NAME}-pacbio-reads_x_asm-long.bam
.
${SAMPLE_NAME}
must match the sample identifiers in data/samples.list
created in the previous step.
The bam index file must also be copied or linked.
They can be copied or linked (shown for a single sample called "sample1") to these locations:
mkdir -p data/alns
cd data/alns
ln -s /path/to/my/sample1.bam sample1-pacbio-reads_x_asm-long.bam
ln -s /path/to/my/sample1.bam.bai sample1-pacbio-reads_x_asm-long.bam.bai
cd ../..
If you have them named after a pattern already in another location, you could automate this process doing something like the following:
cd data/alns
while read SAMPLE
do
ln -s /path/to/the/alignments/${SAMPLE}.bam ${SAMPLE}-pacbio-reads_x_asm-long.bam
ln -s /path/to/the/alignments/${SAMPLE}.bam.bai ${SAMPLE}-pacbio-reads_x_asm-long.bam.bai
done < ../samples.list
cd -
Later scripts require the depth, which can be approximated, for each sample.
The script 09-approxDepth.slurm
will approximate the depth for each sample, each as a separate job on the cluster.
To submit the jobs, simply run ./scripts-msmc/09-approxDepth.submit
from the main project directory.
To prepare to run MSMC, we need to create the input files. The next two steps do just that.
The script 10-mask.slurm
will create the input mask file for a single "chromosome" (or scaffold/contig) for each sample.
Each will be a separate job on the cluster. Each sample will have the various chromosomes submitted as separate jobs via job arrays.
To submit the jobs, simply run ./scripts-msmc/10-mask.submit
from the main project directory.
The output files will be in data/masks
.
The script 11-multihetsep.slurm
will create the input multihetsep file for a single "chromosome" (or scaffold/contig) for all samples together.
Each will be a separate job on the cluster.
To submit the jobs, simply run ./scripts-msmc/11-multihetsep.submit
from the main project directory.
The output files will be in data/multihetsep
. There will be one file per "chromosome" matching the filename pattern ${SEQUENCE_HEADER}.multihetsep.txt
.
This is what we have all been waiting for. The input files have been generated, now we just run MSMC.
The script msmc.slurm
will run msmc for this set of samples on all chromosomes in a single run.
This will be run on the cluster as a job.
To submit the job, simply run ./scripts-msmc/12-msmc.submit
from the main project directory.
The output files will be in data/msmc
.
To increase our confidence in our results, we should perform bootstrap replicates of what we already did. That basically entails generating new multihetsep files for each bootstrap replicate.
To generate random bootstrap replicates of the input data, we need a random number.
You may generate this in any way you wish, but the number must be placed in data/random_seed-bootstrap.txt
.
This file must consist of a single line with a single number. No whitespace, non-numeric (i.e., 0-9) characters, etc.
These scripts assume a 16-bit unsigned integer, though that may not be strinctly necessary as far as the underlying programs are concerned.
The number is preserved in that file for the sake of reproducibility.
The script 13-getRandomSeed.sh
will generate a random number for you and place it in that file, formatted correctly.
To do so, simply run ./scripts-msmc/13-getRandomSeed.sh
from the main project directory.
This step generates the bootstrap replicates of the multihetsep files. Each replicate will be located in it's own directory at data/multihetsep/bootstrap
.
If we were doing 1,000 replicates, the directories would be named data/multihetsep/bootstrap/round_1
, data/multihetsep/bootstrap/round_2
, ..., data/multihetsep/bootstrap/round_1000
.
The contents of each of these directories will have identical file names to the original data/multihetsep
directory; i.e., ${SEQUENCE_HEADER}.multihetsep.txt
, though the file contents will not be the same.
The script 14-multihetsepBootstrapPrep.slurm
will generate all of these replicates in a single go.
The default number of replicates is 1,000. To change this, edit the scripts.
To run this job on the cluster, run ./scripts-msmc/14-multihetsepBootstrapPrep.submit
from the main project directory.
Each run of msms.slurm
will run msmc for one set of samples on all chromosomes for a single bootstrap replicate.
Each of the bootstrap replicates will run as a separate job on the cluster; to submit them, run ./scripts-msmc/15-msmcBootstrap.submit
from the main project directory.
The output files for each run will be located in separate directories matching the filepath pattern: data/msmc/bootstrap/round_${ROUND_NUMBER}
.
This step will combine all the bootstrapped MSMC output files (each called msmc.final.txt
) into a single file called data/msmc/bootstrap/msmc-bootstrap_concat.tsv
.
An extra column "Bootstrap_Round" will be prepended to make the data "tidy" for easy use with tidy structures in R.
This step is run locally instead of as a job on the cluster because it is relatively fast (<1 minute).
To combine these bootstrap results, run ./scripts-msmc/16-concatenateMsmcBootstrap.sh
from the main project directory.
Finally, we need to visualize the results. Only one visualization is demonstrated here. Data visualization is often dataset-dependent; take care to manually explore yourself instead of relying only on the automated version of the plotting script.
The script 17-plot.sh
will run locally instead of as a job. It will run plot.R
for you, creating data/plot.pdf
.
To create the PDF, run ./scripts-msmc/17-plot.sh
from the main project directory.