This software package grew out of the work that I did to obtain my PhD.
If it is of help for your analysis, please cite
@Manual{,
title = {motifcounter: R package for analysing TFBSs in DNA sequences},
author = {Wolfgang Kopp},
year = {2017},
doi = {10.18129/B9.bioc.motifcounter}
}
A detailed description of the compound Poisson model is available in
@article{improvedcompound,
title={An improved compound Poisson model for the number of motif hits in DNA sequences},
author={Kopp, Wolfgang and Vingron, Martin},
journal={Bioinformatics},
pages={btx539},
year={2017},
publisher={Oxford University Press}
}
A detailed description of dynamic programming based approach is available in
@article{withoutpoisson,
title={DNA Motif Match Statistics Without Poisson Approximation},
author={Kopp, Wolfgang and Vingron, Martin},
journal={Journal of Computational Biology},
year={2019},
publisher={Mary Ann Liebert, Inc.}
}
# Estimate a background model on a set of sequences
bg <- readBackground(sequences, order)
# Normalize a given PFM
new_motif <- normalizeMotif(motif)
# Evaluate the scores along a given sequence
scores <- scoreSequence(sequence, motif, bg)
# Evaluate the motif hits along a given sequence
hits <- motifHits(sequence, motif, bg)
# Evaluate the average score profile
score_profile <- scoreProfile(sequences, motif, bg)
# Evaluate the average motif hit profile
hit_profile <- motifHitProfile(sequences, motif, bg)
# Compute the motif hit enrichment
enrichment <- motifEnrichment(sequences, motif, bg)
The motifcounter
package facilitates the analysis of
transcription factor binding sites (TFBSs) in DNA sequences.
It can be used to scan a set of DNA sequences for known motifs
(e.g. from TRANSFAC or JASPAR) in order to determine the positions
and enrichment of TFBSs in the sequences.
Therefore, an analysis with motifcounter
requires as input
- a position frequency matrix (PFM) which represents the TF affinity towards the DNA
- a background model, which is estimated from a given DNA sequence and which serves as a reference for the statistical analysis.
- a desired false positive level, for identifying putative TFBSs in DNA sequences. For example, a reasonable choice would be to choose a false positive level such that only one in 1000 positions are called TFBSs falsely.
- a given DNA sequence, which is subject to the TFBS analysis.
The package aims to improve motif hit enrichment analysis. To this end, the package offers a number of features:
motifcounter
supports the use of higher-order Markov models to account for the sequence composition in unbound DNA segments. This improves the reliability of the enrichment analysis, because higher-order sequence features occur commonly in natural DNA sequences (e.g. CpG islands).- The package automatically accounts for self-overlapping motif
structures1. This aspect is important
for reducing the false positives obtained from the enrichment test, which is
prevalent for repeat-like and palindromic motifs.
motifcounter
not only determines self-overlapping motif hit occurrences on a single DNA strand, but (by default) also with respect to the reverse strand.
motifcounter
implements two analytic approximations of the
distribution of the number of motif hits
in random DNA sequences that can optionally be used for the
enrichment test:
- A compound Poisson approximation
- A combinatorial approximation
Both approximations yield highly accurate results for stringent false positive levels. Moreover, if you intend to analyse long DNA sequences or a large set of individual sequences (total sequence length >10kb), we recommend to use the compound Poisson approximation. On the other hand, we recommend the combinatorial approximation if a relaxed false positive level is prefered to identify TFBSs.
An easy way to install motifcounter
is by facilitating
the devtools
R package.
#install.packages("devtools")
library(devtools)
install_github("wkopp/motifcounter", build_vignettes=TRUE)
Alternatively, the package can also be cloned or
downloaded from this github-rep,
built via R CMD build
and installed via the R CMD INSTALL
command.
The motifcounter
package contains a tutorial that illustrates:
- how to determine position- and strand-specific TF motif binding sites,
- how to analyse the profile of motif hit occurrences across a set of aligned sequences, and
- how to test for motif enrichment in a given set of sequences.
The tutorial can be found in the package-vignette:
library(motifcounter)
vignette("motifcounter")
Thanks to matthuska for reviewing and commenting on the package.
1: Self-overlapping motifs induce **clumps of motif hits** (that is, mutually overlapping motif hits) when a DNA sequence is scanned for hits. As a consequence of **motif clumping**, the distribution of the number of motif hits, and thus, the enrichment test are affected.↩