####Introduction. With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widley accepted and used to store variants detected. The Cancer Genome Atlas Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. The resulting data consisting of genetic variants is stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner either from TCGA sources or any in-house studies as long as the data is in MAF format. Maftools can also handle ICGC Simple Somatic Mutation format.
maftools is on 👉 bioRxiv
Please cite the below if you find this tool useful for you.
Mayakonda, A. and H.P. Koeffler, Maftools: Efficient analysis, visualization and summarization of MAF files from large-scale cohort based cancer studies. bioRxiv, 2016. doi: http://dx.doi.org/10.1101/052662
MAF files contains many fields ranging from chromosome names to cosmic annotations. However, most of the analysis in maftools uses following fields.
- Mandatoty fields: Hugo_Symbol, Chromosome, Start_Position, End_position, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.
- Recommended optional fields: non MAF specific fields containing vaf and amino acid change information. Complete specififcation of MAF files can be found on NCI TCGA page.
NOTE: If you have variants stored as VCFs, convert them to MAF using vcf2maf. Merge MAFs from all samples into a single MAF before processing with maftools.
A complete documentation of maftools using TCGA LAML1 as a case study can be found here.
- Analysis
- Detect cancer driver genes based on positional clustering of variants2.
- Detect Mutually exclusive set of genes5.
- Compare two MAF files (cohorts) to detect differentially mutated genes.
- Add pfam domains and summarize.
- Extract mutational signatures and compare them to validated signatures.
- Tumor heterogenity and MATH (Mutant-Allele Tumor Heterogeneity) score estimation.
- Read and summarize GISTIC results.
- Visualization
- Make oncoplots.
- Make lollipop plots.
- Map variants on copy number (CBS) segments
- Forest plots
- Plot Transitions and Transversions.
- Plot maf summary.
- CoOncoplots
- Genecloud
- Rainfall plots and change point detection
- Annotation
- Annotate variants locally using Oncotator API.
- Convert Annovar annotations into MAF.
- Convert ICGC simple somatic mutation format into MAF.
Easy way: Install from Bioconductor. If you dont want to use devel bioc, install from Github.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
useDevel()
biocLite("maftools")
Install from Github
#Install Bioconductor dependencies.
source("http://bioconductor.org/biocLite.R")
biocLite("ComplexHeatmap")
biocLite("VariantAnnotation")
biocLite("Biostrings")
#Install maftools from github repository.
library("devtools")
install_github(repo = "PoisonAlien/maftools")
First we read maf file using function read.maf
which reads maf, summarises variants in various ways, creates oncomatrix and performs sorting.
require(maftools)
#read TCGA maf file for LAML
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml = read.maf(maf = laml.maf, removeSilent = T, useAll = F)
MAF file is read and stored as an MAF object of S4 class. Each slot can be accessed using directly @
. However there are accessor methods which can be used also.
#MAF file summarized by Genes
getGeneSummary(laml)
#MFA file summarized by samples (Tumor Sample Barcode)
getSampleSummary(laml)
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')
plotmafSummary(maf = laml, rmOutlier = T, addStat = 'median', dashboard = TRUE)
oncoplot
This function uses modified oncoprint script from ComplexHeatmap package by Zuguang Gu with added functionalities, while taking care of format conversions.
#We will plot top ten mutated genes
oncoplot(maf = laml, top = 10)
Use arguments drawRowBar, drawColBar to control side and upper barplots.
Each Single Nucleotide Variant can be classified into Trasition or Transversion. Variants can also be divided into six different classes, which helps to know us which kind of conversions are more frequent in a given type of cancer.
laml.titv = titv(maf = laml, useSyn = T)
It also returns a list of dataframes with raw counts for each conversion, fraction of each conversion and Ti to Tv ratios.
###Lollipop plots for amino acid changes. We can map protein changes on to the Protein structure similar to those draw by ProteinPaint or MutationMapper. Labelling requires ggrepel package.
lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = 'all')
###Genecloud
geneCloud(input = laml, minMut = 3)
###Integrating somatic variants and copy number alterations
####Detcting cancer causing genes.
maftools comes with the function oncodrive
which identifies cancer genes (driver) from a given MAF. oncodrive
is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hotspots). This method takes advantage of such positions to identify cancer genes. If you use this function, please cite OncodriveCLUST article2.
laml.sig = oncodrive(maf = aml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')
plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = T)
####Summarizing by pfam domains.
pfamDomain
adds domain information and summarizes amino acid changes accoriding to the domains that are affected. This serves the puposes of knowing what domain in given cancer cohort, is most frequently affected. This function is inspired from Pfam annotation modulce of MuSic tool3.
laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)
#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = F]
#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = F]
####Comparing two cohorts for differentially mutated genes
####Rainfall plots
We can also annotate variants using oncotator API4.
var.file = system.file('extdata', 'variants.tsv', package = 'maftools')
#This is what input looks like
var = read.delim(var.file, sep = '\t')
head(var)
#Annotate
var.maf = oncotate(maflite = var.file, header = T)
This is quite time consuming if input is big.
Many genes in cancer show strong exclusiveness in mutation pattern. We can detect such pair of genes using mutExclusive
which runs comet_exact_test
from cometExactTest
package for significance5.
mutExclusive(maf = laml, genes = c('NPM1', 'RUNX1'))
#Output
## n.00 n.01 n.10 n.11 gene1 gene2 pval
## 1 124 18 54 0 NPM1 RUNX1 0.00109823009964897
#We can visualize this pair using oncostrip. `oncostrip` draws a matrix similar to [oncoprint](http://www.cbioportal.org/faq.jsp#what-are-oncoprints) on [cBioPortal](http://www.cbioportal.org/index.do).
oncostrip(maf = laml, genes = c('NPM1', 'RUNX1'), sort = T, legend = T, removeNonMutated = T)
####Tumor Heterogenity Tumors are generally heterogenous i.e, consist of multiple clones. This heterogenity can be inferred by clustering variant allele frequencies. We will manually mention vaf column. Requires mclust package.
#We will run this for sample TCGA.AB.2972
inferHetrogentiy(maf = laml, tsb = 'TCGA.AB.2972', vafCol = 'TumorVAF_WU')
Every cancer, as it progresses leaves a signature characterised by specific pattern of nucleotide substitutions. Alexandrov et.al have shown such signatures, derived from over 7000 cancer samples. Such signatures can be extracted by decomposiong matrix of nucleotide substitutions, classified into 96 substitution classes based on immediate bases sorrouding the mutated base. Extracted signatures can also be compared to those 21 validated signatures6.
ExtractSignatures uses nmf
from NMF package to decompose the matrix and extract signatures7.
NOTE: Reading fasta file is memory consuming and it occupies ~3gb of memory while extracting adjacent bases from human genome.
Trying this on TCGA Liver MAF file.
#Read MAF file
maf = "hgsc.bcm.edu_LIHC.IlluminaGA_DNASeq.1.somatic.maf"
lihc = read.maf(maf = maf, removeSilent = T, useAll = F)
#Extract adjacent bases from immediate 3' and 5', classify them into 96 conversion classes.
lihc.tnm = trinucleotideMatrix(maf = lihc, ref_genome = 'hg19.fa', prefix = 'chr', add = T, useSyn = T)
## reading fasta (this might take a while)..
## Extracting adjacent bases..
## matrix of dimension 198x96
# Decompose extracted matrix using Non-negative Matrix Factorization.
lihc.signatures = extractSignatures(mat = lihc.tnm)
## Estimating best rank..
## Using 2 as a best-fit rank based on maximum cophenetic correlation coefficient.
## Comparing against experimentally validated 21 signatures.. (See Alexandrov et.al Nature 2013 for details.)
## Found Signature_1 most similar to validated Signature_4. Correlation coeff: 0.547099730968907
## Found Signature_2 most similar to validated Signature_12. Correlation coeff: 0.647694192108207
Signature_2 which corelates will validated Signature_12 was observed in Liver samples characterised by T>C mutations showing transcriptional strand bias8.
For full documentation please refer to vignette.
####References.
- Cancer Genome Atlas Research, N., Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med, 2013. 368(22): p. 2059-74.
- Tamborero, D., A. Gonzalez-Perez, and N. Lopez-Bigas, OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics, 2013. 29(18): p. 2238-44.
- Dees, N.D., et al., MuSiC: identifying mutational significance in cancer genomes. Genome Res, 2012. 22(8): p. 1589-98.
- Ramos, A.H., et al., Oncotator: cancer variant annotation tool. Hum Mutat, 2015. 36(4): p. E2423-9.
- Leiserson, M.D., et al., CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genome Biol, 2015. 16: p. 160.
- Alexandrov, L.B., et al., Signatures of mutational processes in human cancer. Nature, 2013. 500(7463): p. 415-21.
- Gaujoux, R. and C. Seoighe, A flexible R package for nonnegative matrix factorization. BMC Bioinformatics, 2010. 11: p. 367.
- data.table at warp speed 1.9.6
- ggplot2