The Personal Cancer Genome Reporter (PCGR) is a stand-alone software package for functional annotation and translation of individual cancer genomes for precision oncology. It interprets both somatic SNVs/InDels and copy number aberrations. The software extends basic gene and variant annotations from the Ensembl’s Variant Effect Predictor (VEP) with oncology-relevant, up-to-date annotations retrieved flexibly through vcfanno, and produces interactive HTML reports intended for clinical interpretation.
- Nov 27th 2018: 0.7.0 release
- Bundle update and bug fixing (see CHANGELOG )
- Reporting germline variants for cancer predisposition? Check out github.com/sigven/cpsr
- May 14th 2018: 0.6.2.1 release
- May 9th 2018: 0.6.2 release
- Fixed various bugs reported by users (see CHANGELOG)
- Data bundle update (ClinVar, KEGG, CIViC, UniProt, DiseaseOntology)
- May 2nd 2018: 0.6.1 release
- Fixed bugs in tier assignment
- April 25th 2018: 0.6.0 release
- Updated data sources
- Enabling specification of tumor type of input sample
- New tier system for classification of variants (ACMG-like)
- VCF validation can be turned off
- Tumor DP/AF presets
- JSON dump of report content
- GRCh38 support
- Runs under Python3
- November 29th 2017: 0.5.3 release
- Fixed bug with propagation of default options
- November 23rd 2017: 0.5.2 release
- November 15th 2017: 0.5.1 pre-release
- Bug fixing (VCF validation)
- November 14th 2017: 0.5.0 pre-release
- Updated version of VEP (v90)
- Updated versions of ClinVar, Uniprot KB, CIViC, CBMDB
- Removal of ExAC (replaced by gnomAD), removal of COSMIC due to licensing restrictions
- Users can analyze samples run without matching control (i.e. tumor-only)
- PCGR pipeline is now configured through a TOML-based configuration file
- Bug fixes / general speed improvements
- Work in progress: Export of report data through JSON
IMPORTANT: If you use PCGR, please cite the publication:
Sigve Nakken, Ghislain Fournous, Daniel Vodák, Lars Birger Aaasheim, Ola Myklebost, and Eivind Hovig. Personal Cancer Genome Reporter: variant interpretation report for precision oncology (2017). Bioinformatics. 34(10):1778–1780. doi:10.1093/bioinformatics/btx817
- VEP v94 - Variant Effect Predictor (GENCODE v28/v19 as the gene reference dataset)
- CIViC - Clinical interpretations of variants in cancer (November 12th 2018)
- ClinVar - Database of variants with clinical significance (November 2018)
- DoCM - Database of curated mutations (v3.2, April 2016)
- CBMDB - Cancer Biomarkers database (January 17th 2018)
- IntOGen catalog of driver mutations - (May 2016)
- DisGeNET - Database of gene-tumor type associations (May 2017)
- Cancer Hotspots - Resource for statistically significant mutations in cancer (v2 - 2017)
- dBNSFP v3.5 - Database of non-synonymous functional predictions (August 2017)
- TCGA release 13 - somatic mutations discovered across 33 tumor type cohorts (The Cancer Genome Atlas)
- UniProt/SwissProt KnowledgeBase 2018_10 - Resource on protein sequence and functional information (November 2018)
- Pfam v32 - Database of protein families and domains (September 2018)
- DGIdb - Database of targeted cancer drugs (v3.0.2, January 2018)
- ChEMBL - Manually curated database of bioactive molecules (v24.1, June 2018)
- CancerMine v6 - Literature-derived database of tumor suppressor genes/proto-oncogenes (November 2018)
An installation of Python (version 3.6) is required to run PCGR. Check that Python is installed by typing python --version
in your terminal window. In addition, a Python library for parsing configuration files encoded with TOML is needed. To install, simply run the following command:
pip install toml
- Install the Docker engine on your preferred platform
- installing Docker on Linux
- installing Docker on Mac OS
- NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with PCGR (an example being mounting of data volumes)
- Test that Docker is running, e.g. by typing
docker ps
ordocker images
in the terminal window - Adjust the computing resources dedicated to the Docker, i.e.:
- Memory: minimum 5GB
- CPUs: minimum 4
- How to - Mac OS X
a. Clone the PCGR GitHub repository (includes run script and configuration file): git clone https://github.com/sigven/pcgr.git
b. Download and unpack the latest data bundles in the PCGR directory
- grch37 data bundle - 20181119 (approx 9Gb)
- grch38 data bundle - 20181119 (approx 14Gb)
- Unpacking:
gzip -dc pcgr.databundle.grch37.YYYYMMDD.tgz | tar xvf -
c. Pull the PCGR Docker image (dev) from DockerHub (approx 5.1Gb):
docker pull sigven/pcgr:dev
(PCGR annotation engine)
a. Download and unpack the latest software release (0.7.0)
b. Download and unpack the assembly-specific data bundle in the PCGR directory
- grch37 data bundle - 20181119 (approx 9Gb)
- grch38 data bundle - 20181119 (approx 14Gb)
- Unpacking:
gzip -dc pcgr.databundle.grch37.YYYYMMDD.tgz | tar xvf -
A _data/_ folder within the _pcgr-X.X_ software folder should now have been produced
c. Pull the PCGR Docker image (0.7.0) from DockerHub (approx 5.1Gb):
docker pull sigven/pcgr:0.7.0
(PCGR annotation engine)
The PCGR workflow accepts two types of input files:
- An unannotated, single-sample VCF file (>= v4.2) with called somatic variants (SNVs/InDels)
- A copy number segment file
PCGR can be run with either or both of the two input files present.
- We strongly recommend that the input VCF is compressed and indexed using bgzip and tabix
- If the input VCF contains multi-allelic sites, these will be subject to decomposition
- Variants used for reporting should be designated as 'PASS' in the VCF FILTER column
The tab-separated values file with copy number aberrations MUST contain the following four columns:
- Chromosome
- Start
- End
- Segment_Mean
Here, Chromosome, Start, and End denote the chromosomal segment, and Segment_Mean denotes the log(2) ratio for a particular segment, which is a common output of somatic copy number alteration callers. Note that coordinates must be one-based (i.e. chromosomes start at 1, not 0). Below shows the initial part of a copy number segment file that is formatted correctly according to PCGR's requirements:
Chromosome Start End Segment_Mean
1 3218329 3550598 0.0024
1 3552451 4593614 0.1995
1 4593663 6433129 -1.0277
The PCGR configuration file, formatted using TOML (an easy to read file format) enables the user to configure a number of options in the PCGR workflow, related to the following:
- Tumor type of input sample
- Tier model
- Sequencing depth/allelic support thresholds
- MSI prediction
- Mutational signatures analysis
- Mutational burden analysis (e.g. target size)
- VCF to MAF conversion
- Tumor-only analysis options (i.e. exclusion of germline variants/enrichment for somatic calls)
- VEP/vcfanno options
- Log-ratio thresholds for gains/losses in CNA analysis
See here for more details about the exact usage of the configuration options.
The PCGR software bundle comes with a default configuration file (pcgr.toml), to be used as a starting point for runnning the PCGR workflow.
A tumor sample report is generated by calling the Python script pcgr.py, which takes the following arguments and options:
usage: pcgr.py [-h] [--input_vcf INPUT_VCF] [--input_cna INPUT_CNA]
[--force_overwrite] [--version] [--basic]
[--docker-uid DOCKER_USER_ID] [--no-docker]
pcgr_dir output_dir {grch37,grch38} configuration_file
sample_id
Personal Cancer Genome Reporter (PCGR) workflow for clinical interpretation of
somatic nucleotide variants and copy number aberration segments
positional arguments:
pcgr_dir PCGR base directory with accompanying data directory,
e.g. ~/pcgr-0.7.0
output_dir Output directory
{grch37,grch38} Genome assembly build: grch37 or grch38
configuration_file PCGR configuration file (TOML format)
sample_id Tumor sample/cancer genome identifier - prefix for
output files
optional arguments:
-h, --help show this help message and exit
--input_vcf INPUT_VCF
VCF input file with somatic query variants
(SNVs/InDels). (default: None)
--input_cna INPUT_CNA
Somatic copy number alteration segments (tab-separated
values) (default: None)
--force_overwrite By default, the script will fail with an error if any
output file already exists. You can force the
overwrite of existing result files by using this flag
(default: False)
--version show program's version number and exit
--basic Run functional variant annotation on VCF through
VEP/vcfanno, omit other analyses (i.e. CNA, MSI,
report generation etc. (STEP 4) (default: False)
--docker-uid DOCKER_USER_ID
Docker user ID. Default is the host system user ID. If
you are experiencing permission errors, try setting
this up to root (`--docker-uid root`) (default: None)
--no-docker Run the PCGR workflow in a non-Docker mode (see
install_no_docker/ folder for instructions (default:
False)
The examples folder contain input files from two tumor samples sequenced within TCGA (GRCh37 only). It also contains PCGR configuration files customized for these cases. A report for a colorectal tumor case can be generated by running the following command in your terminal window:
python pcgr.py --input_vcf ~/pcgr-0.7.0/examples/tumor_sample.COAD.vcf.gz
--input_cna ~/pcgr-0.7.0/examples/tumor_sample.COAD.cna.tsv
~/pcgr-0.7.0 ~/pcgr-0.7.0/examples grch37 ~/pcgr-0.7.0/examples/pcgr_conf.COAD.toml tumor_sample.COAD
This command will run the Docker-based PCGR workflow and produce the following output files in the examples folder:
- tumor_sample.COAD.pcgr_acmg.grch37.html - An interactive HTML report for clinical interpretation
- tumor_sample.COAD.pcgr_acmg.grch37.pass.vcf.gz - Bgzipped VCF file with rich set of annotations for precision oncology
- tumor_sample.COAD.pcgr_acmg.grch37.pass.tsv.gz - Compressed vcf2tsv-converted file with rich set of annotations for precision oncology
- tumor_sample.COAD.pcgr_acmg.grch37.snvs_indels.tiers.tsv - Tab-separated values file with variants organized according to tiers of functional relevance
- tumor_sample.COAD.pcgr_acmg.grch37.json.gz - Compressed JSON dump of HTML report content
- tumor_sample.COAD.pcgr_acmg.grch37.cna_segments.tsv.gz - Compressed tab-separated values file with annotations of gene transcripts that overlap with somatic copy number aberrations