Scripts used for the graph genome indexing
CACTUS has been run on single chromosomes on a Eleanor cloud instance with 16 cores, 96Gb of RAM and 320 Gb of disk space. HAL2VG has been on each chromosome separately on the same virtual machine. The script used to perform the alignments, as well as the general configuration file for CACTUS, are saved in the cactus folder. Downstream processing of each single VG archive has been performed on eddie, the University of Edinburgh high performance computing platform.
To generate the indexes on an SGE cluster,you need nodes with at least 500Gb of RAM and a minimum of viable space for 2Tb (better 4). Then, simply run:
./submit.sh
The final graphs will be included into ./GRAPH, including the XG and GCSA indexes.
If needed, it is possible to add new variants to a pre-existing graph (see here). To do so, proceed as follow:
- Create a compliant VCF using the GraphVCF.py script (detailed use of the script can be seen in the submitted script GenerateGraphVCF.sh).
- List the new vcf in a file.
- Run the AddToGraph.sh script in scripts folder, providing the list of vg graph to expand (-g), the list of VCF to use (-v) and specifying the reference genome to expand (-s)
Downstream analysis have been performed using the wrapper bagpipe, which allow to perform WGS, ATAC-seq, RRBS and RNA-seq analysis on a SGE/UGE cluster environment. This pipeline also includes script to perform graph genome alignment and variant calling when a graph genome is provided.
## VCF metrics
VCF-based metrics and analyses can be found in the vcf_processing
folder. Within this folder there are two subfolders, respectively for:
- SV_SPECIFIC: identification of SV specific for a single breed starting from the vg VCF files.
- STATISTICS: Calculate metrics on the vcf generated through the multiple analyses.
# ATAC-seq analyses
Script for processing the ATAC-seq results generated through bagpipe are collected in the ATA-seq
folder, and are separated in two distinct scripts, numbered in the order or run:
- Removal of the blank from the analysed samples
- Extraction of the peaks and cross-referencing for each genome
Non-reference sequence can be detected using the nextflow workflow included in detectSequences/nf-GraohSeq
. The workflow requires nextflow to be installed.
Andrea Talenti, Jessica Powell, Johanneke D Hemmink, Elizabeth AJ Cook, David Wragg, Siddharth Jayaraman, Edith Paxton, Chukwunonso Ezeasor, Emmanuel T Obishakin, Ebere R Agusi, Abdulfatai Tijjani, Karen Marshall, Andressa Fisch, Beatriz Ferreira, Ali Qasim, Umer N Chaudhry, Pamela Wiener, Philip Toye, Liam J Morrison, Timothy Connelley, James Prendergast. A cattle graph genome incorporating global breed diversity. bioRxiv 2021.06.23.449389; doi: https://doi.org/10.1101/2021.06.23.449389