Sanger Pipeline

Pipeline used by PPCG for Whole Genome Sequencing quality control, allignment and variant calling.


Libraries and Tools

Reference Files


If starting from BAM files, use samtools to sort, split in read1 and 2 and transform to FASTQ

samtools collate -u -O $FILE | samtools fastq -1 ${FILE%.bam}_r1.fastq -2 ${FILE%.bam}_r2.fastq -0 /dev/null -s /dev/null -n

Split by lane

FASTQ files were splitted by lane using fastqsplit. Note thatfastqsplit requires gziped files (pigz offers parallel fast zipping) and illumina FASQ version 1.8 or higher. For FASTQ file version older than Casanova 1.8 a modified version of fastqsplit can be used (

The guideline for FASTQ file naming is {sample}_{flowcell}_{barcode}_L00{lane}_R1.fastq.gz

Convertion to BAM

In this step, files were converted to BAM with read group info. For each of these per-lane FASTQ files, the fastqtobam in biobambam2 package was used to convert to unmapped BAM files.

  • Options for fastqtobam:
  • gz: set to 1 if input fastqs are gzipped, 0 otherwise.
  • namescheme: set to pairedfiles if fastqs are of paired-end sequencing
  • RGCN: Sequencing centre name, can set to your institution's name or abbreviation
  • RGID: Read group ID, set to an unique ID for this lane. Try to make sure it's unique across all lanes of all samples
  • RGSM: Sample ID, an unique ID across all of your samples
  • RGPL: Sequencing platform. We accept ILLUMINA only.
  • RGLB: Sequencing library ID. Set to a string of a library ID of your choice
  • I: Input fastq file name, specified it one more time for the second file of a pair.


/path/to/biobambam2/bin/fastqtobam gz=1 namescheme=pairedfiles \
RGSM=TestS1 \
RGLB=lib-Test \
I=TestS1_L001_R1.fastq.gz \
I=TestS1_L001_R2.fastq.gz \
> TestS1_L001.bam

Merging to single BAM

Step for merging BAM files into a single BAM file per sample If there were multiple per-lane BAM files, they were merged to a single unmapped per-sample BAM file using bamcat in biobambam2 or any other tools, such as Picard`s MergeSamFiles (

    /path/to/biobambam2/bin/bamcat \
    md5=1 \
    md5filename=TestS1.merge.bam.md5 \
    I=TestS1_L001.bam \
    I=TestS1_L002.bam \
    > TestS1.bam


    $ module load picard
    $ java -jar $EBROOTPICARD/picard.jar MergeSamFiles I=../fastqtobam/TESTS-H5F3GDSXX-ATTACTCG+TAATATTA_L001.bam I=../fastqtobam/TESTS-H5F3GDSXX-ATTACTCG+TATTCTTA_L001.bam I=../fastqtobam/TESTS-H5F3GDSXX-ATTACTCG+TAATCTTA_L001.bam O=TESTS.unmapped.bam

Alternative Tools for Merging

As 2021,biobambam2 is outdated and unsupported. biobambam2 depends on the library libmaus2 that is outdate and unsupported as well. Alternativelly, gatk FastqToSam can be used to a obtain same results.

Here the example code to merge two files PCAWG.faeb4dd6-68d9-4bed-82e8-d12adca6b28c_6_r1.fastq.gz, PCAWG.faeb4dd6-68d9-4bed-82e8-d12adca6b28c_6_r2.fastq.gz,:

        gatk FastqToSam --SEQUENCING_CENTER  WCM
                        --READ_GROUP_NAME    faeb4dd6-68d9-4bed-82e8-d12adca6b28c_6
                        --SAMPLE_NAME        faeb4dd6-68d9-4bed-82e8-d12adca6b28c
                        --PLATFORM           ILLUMINA
                        --LIBRARY_NAME       sample_name_lib
                        -F1                  PCAWG.faeb4dd6-68d9-4bed-82e8-d12adca6b28c_6_r1.fastq.gz
                        -F2                  PCAWG.faeb4dd6-68d9-4bed-82e8-d12adca6b28c_6_r2.fastq.gz

Make sure to use ILLUMINA as platform argument, since that is required in later steps to succesfully validate samples.


In this stea cgpNgsQc inspects the metadata table and assign UUIDs for each donor and sample.

This step need a tab-delimited metdata file with the following header:

column 1: Donor_ID 
column 2: Tissue_ID 
column 3: is_normal 
column 4: is_normal_for_donor 
column 5: Sample_ID column 6: relative_file_path

Create a tab-separated file with a content like this example (test_input.tsv)

        Donor_ID Tissue_ID is_normal is_normal_for_donor Sample_ID relative_file_path
        TestP0  WG  Y   Y   TestS0  ./TestS0.bam
        TestP0  Primary N       TestS1  ./TestS1.bam

Note that table needs to list for each donor a normal + a tumour sample.

cgpNgsQc package comes in a docker container, but it can be executed with udocker (, if docker is unavailable for your system.

Below an Example code for loading the docker image to udocker:

        udocker pull
        udocker create
        udocker name <CONTAINER ID for> <name>

The CONTAINER ID can be found with udocker ps.

Below is an example code to run the validation step using udocker.

udocker run -v `pwd`:`pwd`-w`pwd`udocker_cgp-ngs-qc -in test_input.tsv -out test_output.tsv -format tsv

Sucessfull exit will return a test_output.tsv file with UUIDs that can be used by cgpmap for alignment.


Example code is well documented in For Australian data, version 2.0.3 was used, and was run instead of

Reference for mapping can be downloaded from:

Image can be donwload using singularity with:

export CGPMAP_VER=2.0.3
singularity pull docker://$CGPMAP_VER

Below is an example code for a sample using singularity.

    singularity exec -i --bind
    --workdir {working_dir} --home {`home` under \`working_dir}:/home
    [path/to/dockstore-cgpmap-2.0.3.simg] -reference
    /mnt/reference/core_ref_GRCh37d5.tar.gz -bwa_idx
    /mnt/reference/bwa_idx_GRCh37d5.tar.gz -sample {sample UUID} TestS1.bam
  • --workdir: workspace directory
  • --home: a directory where all result files are stored

Once alignment is done, it's recommended to change the resulting BAM file to a more suitable name, such as TestS1.mapped.bam\.

Variant calling

Information on how to runcgpwgs can be found in the relative wiki at

Reference can be dowbloaded using:

mkdir ref
cd ref
echo '' \
| xargs -tI {} bash -c 'curl -L {} | tar --strip-components 1 -zx'

I Image can be donwload using singularity with:

xport CGPWGS_VER=1.1.2 
singularity pull docker://$CGPWGS_VER

Example code using singularity

singularity exec -i --bind
--workdir {working_dir} --home {working_dir}/home:home
-exclude="NC_007605,hs37d5,GL%" -species="human" -assembly="GRCh37d5"
  • --workdir: workspace directory
  • --home: a directory where all result files are stored


