Skip to content

R package for identifying motif matches and motif enrichment in DNA sequences

Notifications You must be signed in to change notification settings

wkopp/motifcounter

Repository files navigation

motifcounter - R package for analysing TFBSs in DNA sequences.

Travis-CI Build Status Coverage Status

This software package grew out of the work that I did to obtain my PhD.

If it is of help for your analysis, please cite

  @Manual{,
    title = {motifcounter: R package for analysing TFBSs in DNA sequences},
    author = {Wolfgang Kopp},
    year = {2017},
    doi = {10.18129/B9.bioc.motifcounter}
  }

A detailed description of the compound Poisson model is available in

@article{improvedcompound,
  title={An improved compound Poisson model for the number of motif hits in DNA sequences},
  author={Kopp, Wolfgang and Vingron, Martin},
  journal={Bioinformatics},
  pages={btx539},
  year={2017},
  publisher={Oxford University Press}
}

A detailed description of dynamic programming based approach is available in

@article{withoutpoisson,
  title={DNA Motif Match Statistics Without Poisson Approximation},
  author={Kopp, Wolfgang and Vingron, Martin},
  journal={Journal of Computational Biology},
  year={2019},
  publisher={Mary Ann Liebert, Inc.}
}

Usage

# Estimate a background model on a set of sequences
bg <- readBackground(sequences, order)

# Normalize a given PFM
new_motif <- normalizeMotif(motif)

# Evaluate the scores along a given sequence
scores <- scoreSequence(sequence, motif, bg)

# Evaluate the motif hits along a given sequence
hits <- motifHits(sequence, motif, bg)

# Evaluate the average score profile
score_profile <- scoreProfile(sequences, motif, bg)

# Evaluate the average motif hit profile
hit_profile <- motifHitProfile(sequences, motif, bg)

# Compute the motif hit enrichment
enrichment <- motifEnrichment(sequences, motif, bg)

Hallmarks of motifcounter

The motifcounter package facilitates the analysis of transcription factor binding sites (TFBSs) in DNA sequences. It can be used to scan a set of DNA sequences for known motifs (e.g. from TRANSFAC or JASPAR) in order to determine the positions and enrichment of TFBSs in the sequences.

Therefore, an analysis with motifcounter requires as input

  1. a position frequency matrix (PFM) which represents the TF affinity towards the DNA
  2. a background model, which is estimated from a given DNA sequence and which serves as a reference for the statistical analysis.
  3. a desired false positive level, for identifying putative TFBSs in DNA sequences. For example, a reasonable choice would be to choose a false positive level such that only one in 1000 positions are called TFBSs falsely.
  4. a given DNA sequence, which is subject to the TFBS analysis.

The package aims to improve motif hit enrichment analysis. To this end, the package offers a number of features:

  1. motifcounter supports the use of higher-order Markov models to account for the sequence composition in unbound DNA segments. This improves the reliability of the enrichment analysis, because higher-order sequence features occur commonly in natural DNA sequences (e.g. CpG islands).
  2. The package automatically accounts for self-overlapping motif structures1. This aspect is important for reducing the false positives obtained from the enrichment test, which is prevalent for repeat-like and palindromic motifs. motifcounter not only determines self-overlapping motif hit occurrences on a single DNA strand, but (by default) also with respect to the reverse strand.

Enrichment model

motifcounter implements two analytic approximations of the distribution of the number of motif hits in random DNA sequences that can optionally be used for the enrichment test:

  1. A compound Poisson approximation
  2. A combinatorial approximation

Both approximations yield highly accurate results for stringent false positive levels. Moreover, if you intend to analyse long DNA sequences or a large set of individual sequences (total sequence length >10kb), we recommend to use the compound Poisson approximation. On the other hand, we recommend the combinatorial approximation if a relaxed false positive level is prefered to identify TFBSs.

Installation

An easy way to install motifcounter is by facilitating the devtools R package.

#install.packages("devtools")
library(devtools)
install_github("wkopp/motifcounter", build_vignettes=TRUE)

Alternatively, the package can also be cloned or downloaded from this github-rep, built via R CMD build and installed via the R CMD INSTALL command.

Getting started

The motifcounter package contains a tutorial that illustrates:

  1. how to determine position- and strand-specific TF motif binding sites,
  2. how to analyse the profile of motif hit occurrences across a set of aligned sequences, and
  3. how to test for motif enrichment in a given set of sequences.

The tutorial can be found in the package-vignette:

library(motifcounter)
vignette("motifcounter")

Acknowledgements

Thanks to matthuska for reviewing and commenting on the package.


1: Self-overlapping motifs induce **clumps of motif hits** (that is, mutually overlapping motif hits) when a DNA sequence is scanned for hits. As a consequence of **motif clumping**, the distribution of the number of motif hits, and thus, the enrichment test are affected.