Skip to content

mdozmorov/ChIP-seq_notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Notes on ChIP-seq and other-seq-related tools

License: MIT PR's Welcome

ChIP-seq, ATAC-seq related tools and genomics data analysis resources. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.

Table of content

Databases

  • UniBind database - TFBS predictions of approx. 56 million TFBSs with experimental and computational support for direct TF-DNA interactions for 644 TFs in > 1000 cell lines and tissues. Processed approx. 10,000 public ChIP-seq datasets from nine species using ChIP-eat. ChIP-eat combines both computational (high PWM score) and experimental (centrality to ChIP-seq peak summit) support to find high-confidence direct TF-DNA interactions in a ChIP-seq experiment-specific manner, uses the DAMO tool. Input data - ReMap 2018 and GTRD. Robust and permissive collections. Over 197,000 Cis-regulatory modules. Downloads of BED, FASTA, PWMs, Tracks for the UCSC GenomeBrowser, API, Enrichment analysis, online with or without background, differential enrichment. UniBind Enrichment BitBucket.

    Paper Puig, Rafael Riudavets, Paul Boddie, Aziz Khan, Jaime Abraham Castro-Mondragon, and Anthony Mathelier. “UniBind: Maps of High-Confidence Direct TF-DNA Interactions across Nine Species” BMC Genomics, (December 2021) https://doi.org/10.1186/s12864-021-07760-6

    Gheorghe, Marius, Geir Kjetil Sandve, Aziz Khan, Jeanne Chèneby, Benoit Ballester, and Anthony Mathelier. “A Map of Direct TF–DNA Interactions in the Human Genome.” Nucleic Acids Research 47, no. 4 (February 28, 2019): e21–e21. https://doi.org/10.1093/nar/gky1210

  • ReMap is an integrative analysis of Homo sapiens, Mus musculus and Arabidopsis thaliana transcriptional regulators from DNA-binding experiments such as ChIP-seq, ChIP-exo, DAP-seq from public sources (GEO, ENCODE, ENA). Human hg38 and Arabidopsis TAOR10. All peaks, non-redundant peaks, cis-Regulatory Modules. GitHub. Download genomic coordinates.

    Paper Chèneby, Jeanne, Zacharie Ménétrier, Martin Mestdagh, Thomas Rosnet, Allyssa Douida, Wassim Rhalloussi, Aurélie Bergon, Fabrice Lopez, and Benoit Ballester. “[ReMap 2020: A Database of Regulatory Regions from an Integrative Analysis of Human and Arabidopsis DNA-Binding Sequencing Experiments](https://doi.org/10.1093/nar/gkz945).” Nucleic Acids Research, October 29, 2019

    Hammal, Fayrouz, Pierre de Langen, Aurélie Bergon, Fabrice Lopez, and Benoit Ballester. “ReMap 2022: A Database of Human, Mouse, Drosophila and Arabidopsis Regulatory Regions from an Integrative Analysis of DNA-Binding Sequencing Experiments.” Nucleic Acids Research, November 9, 2021, gkab996. https://doi.org/10.1093/nar/gkab996.

  • ADASTRA - the database of Allelic Dosage-corrected Allele-Specific human Transcription factor binding sites (over 500K sites across 1073 human TFs and 649 cell types, reprocessed data from GTRD, pipeline at GitHub) at nearly 270K SNPs. Background Allele Dosage (BAD) maps. Many SNPs overlap eQTLs.
    Paper Abramov, Sergey, Alexandr Boytsov, Daria Bykova, Dmitry D. Penzar, Ivan Yevshin, Semyon K. Kolmykov, Marina V. Fridman, et al. “Landscape of Allele-Specific Transcription Factor Binding in the Human Genome.” Nature Communications 12, no. 1 (December 2021): 2751. https://doi.org/10.1038/s41467-021-23007-0.
  • ANANASTRA - ANnotation and enrichment ANalysis of Allele-Specific TRAnscription factor binding at SNPs. Annotates a given list of SNPs with allele-specific binding events across a wide range of transcription factors and cell types using ADASTRA. Enrichment analysis of SNPs in cell type-specific TFBSs (Fisher's exact, one-sided). API.
    Paper Boytsov, Alexandr, Sergey Abramov, Ariuna Z Aiusheeva, Alexandra M Kasianova, Eugene Baulin, Ivan A Kuznetsov, Yurii S Aulchenko, et al. “ANANASTRA: Annotation and Enrichment Analysis of Allele-Specific Transcription Factor Binding at SNPs.” Nucleic Acids Research, April 21, 2022, gkac262. https://doi.org/10.1093/nar/gkac262.
  • Catchitt - method for predicting TFBSs, leader of ENCODE-DREAM challenge. Other methods - table in supplementary. AUPRC to benchmark performance. DNAse-seq is the best predictor, RNA-seq and sequence-based features are not informative. Java implementation, predicted peaks for 32 transcription factors in 22 primary cell types and tissues (682 total) BED hg19 files, conservative and relaxed predictions, download.
    Paper Keilwagen, Jens, Stefan Posch, and Jan Grau. “Accurate Prediction of Cell Type-Specific Transcription Factor Binding.” Genome Biology 20, no. 1 (December 2019). https://doi.org/10.1186/s13059-018-1614-y
  • C4S DB - Comprehensive Collection and Comparison for ChIP-Seq Database. Over 16K human ChIP-seq experiments. Data aligned to GRCh37 (hs37d5) genome. "Gene browser" and "global similarity" views. Search for gene symbol, tissue/cell line, ChIP target, sample description/ID.
    Paper Anzawa, Hayato, and Kengo Kinoshita. “C4S DB: Comprehensive Collection and Comparison for ChIP-Seq Database.” Journal of Molecular Biology, May 2023, 168157. https://doi.org/10.1016/j.jmb.2023.168157.
  • RAEdb - enhancer database. Enhancers identified from STARR-seq and MPRA studies. Epromoters - promoters containing enhancers. Human (hg38)/mouse (mm10) data, select cell lines. BED/FASTQ download. Links to EnhancerAtlas, VISTA, SuperEnhancer databases.
    Paper Cai, Zena, Ya Cui, Zhiying Tan, Gaihua Zhang, Zhongyang Tan, Xinlei Zhang, and Yousong Peng. “RAEdb: A Database of Enhancers Identified by High-Throughput Reporter Assays.” Database: The Journal of Biological Databases and Curation 2019 (January 1, 2019). https://doi.org/10.1093/database/bay140.
  • ChIP-Atlas - a large database and analysis suite of public ChIP-seq and DNAse-seq experiments (Over 76K experiments, SRA uniformly processed data). Analyses: Visualization of peaks in IGV browser, BED file download, Target genes identification, Colocalization of factors (antigens), Enrichment analysis - permutation enrichment of BED regions, with custom background possible. GitHub, Documentation.
    Paper Oki, Shinya, Tazro Ohta, Go Shioi, Hideki Hatanaka, Osamu Ogasawara, Yoshihiro Okuda, Hideya Kawaji, Ryo Nakaki, Jun Sese, and Chikara Meno. “ChIP‐Atlas: A Data‐mining Suite Powered by Full Integration of Public ChIP‐seq Data” EMBO Reports, (December 2018) https://doi.org/10.15252/embr.201846255
  • TRRUST database (Transcriptional Regulatory Relationships Unraveled by Sentence-based Text mining). Over 8K regulatory interactions for 800 TFs in human, and over 6K interactions for 828 mouse TFs. Mouse and human TF regulatory networks overlap, complement each other. More information than in PAZAR, TFactS, TRED, TFe databases. Download, TSV format. Tools: 1. Search a gene, 2. Enrichment of key regulators for query genes.
    Paper Han, Heonjong, Jae-Won Cho, Sangyoung Lee, Ayoung Yun, Hyojin Kim, Dasom Bae, Sunmo Yang, et al. “TRRUST v2: An Expanded Reference Database of Human and Mouse Transcriptional Regulatory Interactions.” Nucleic Acids Research 46, no. D1 (January 4, 2018): D380–86. https://doi.org/10.1093/nar/gkx1013.
  • GTRD - transcription factor binding sites and data (ChIP-seq, ChIP-seo, DNAse-seq, MNase-seq, ATAC-seq, RNA-seq), uniformly processed, over 35K experiments. Seven species, TFs linked to CIS-BP. All cell types are assigned onthology. Experiment search, processed data/peaks download (BED, bigBed, bigWig).

    Paper Yevshin, Ivan, Ruslan Sharipov, Tagir Valeev, Alexander Kel, and Fedor Kolpakov. “GTRD: A Database of Transcription Factor Binding Sites Identified by ChIP-Seq Experiments.” Nucleic Acids Research 45, no. D1 (January 4, 2017): D61–67. https://doi.org/10.1093/nar/gkw951.

    Kolmykov, Semyon, Ivan Yevshin, Mikhail Kulyashov, Ruslan Sharipov, Yury Kondrakhin, Vsevolod J Makeev, Ivan V Kulakovskiy, Alexander Kel, and Fedor Kolpakov. “GTRD: An Integrated View of Transcription Regulation.” Nucleic Acids Research 49, no. D1 (January 8, 2021): D104–11. https://doi.org/10.1093/nar/gkaa1057.

  • Cistrome DB v3.0 - a resource of ChIP-seq, A T AC-seq and DNase-seq data from humans and mice. One-page interface to search by target gene and cell type, by gneomic region, find similar BED sets for the uploaded BED.
    Paper Taing, Len, Ariaki Dandawate, Sehi L’Yi, Nils Gehlenborg, Myles Brown, and Clifford A Meyer. “Cistrome Data Browser: Integrated Search, Analysis and Visualization of Chromatin Data.” Nucleic Acids Research, November 16, 2023, gkad1069. https://doi.org/10.1093/nar/gkad1069.
  • Cistrome DB - ChIP-seq peaks for TFs, histone modifications, DNAse/ATAC. Downloadable cell type-specific, hg38 BED files. Toolkit to answer questions like "What factors regulate your gene of interest?", "What factors bind in your interval?", "What factors have a significant binding overlap with your peak set?"

    Paper Zheng R, Wan C, Mei S, Qin Q, Wu Q, Sun H, Chen CH, Brown M, Zhang X, Meyer CA, Liu XS. Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res, 2018 Nov 20. https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gky1094/5193328

    Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, Zhu M, Wu J, Shi X, Taing L, Liu T, Brown M, Meyer CA, Liu XS. Cistrome data browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res, 2017 Jan 4;45(D1):D658-D662. https://academic.oup.com/nar/article/45/D1/D658/2333932

  • CODEX ChIP-seq - CODEX provides access to processed and curated NGS experiments, including ChIP-Seq (transcription factors and histones), RNA-Seq and DNase-Seq. Human, mouse. Download tracks, analyze correlations, motifs, compare between organisms, more.
    Paper Sánchez-Castillo, Manuel and Ruau, David and Wilkinson, Adam C. and Ng, Felicia S.L. and Hannah, Rebecca and Diamanti, Evangelia and Lombard, Patrick and Wilson, Nicola K. and Gottgens, Berthold. "CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities" Nucleic Acids Research, Database Issue, September 2014 https://doi.org/10.1093/nar/gku895
  • hTFtarget - database of TF-gene target regulations from >7K human ChIP-seq experiments.

Motif DBs

  • CIS-BP (The Catalog of Inferred Sequence Binding Preferences) - database of inferred sequence binding preferences. DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. PBM microarray assays to analyze TF binding preferences. Closely related DBDs (70% Amino Acid identity) almost always have very similar DNA sequence preferences, enabling inference of motifs for approx. 34% of the 70,000 known or predicted eukaryotic TFs. Tools to scan single sequence for TF binding, two sequences for differential TF binding (including SNP effect scan), protein scan, motif scan. Bulk download of PWMs, protein sequences, TF information, logos.
    Paper Weirauch, Matthew T., Ally Yang, Mihai Albu, Atina G. Cote, Alejandro Montenegro-Montero, Philipp Drewe, Hamed S. Najafabadi, et al. “Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity.” Cell 158, no. 6 (September 2014): 1431–43. https://doi.org/10.1016/j.cell.2014.08.009.
  • HOCOMOCO (Homo sapiens comprehensive model collection) - TFBS models and PWMs. Human- and mouse-specific models. HOCOMOCO v11 contains binding models for 453 mouse and 680 human transcription factors and includes 1302 mononucleotide and 576 dinucleotide position weight matrices. Uniformly processed data from GTRD, peaks called with four peak callers (). Used ChIPMunk in four computational models, including using DNA shape. Added MoLoTool, a web app to scan DNA sequences for TFBSs with PWMs. One model per TF is manually selected. Twice as many models as in JASPAR.

    Paper Kulakovskiy, Ivan V., Yulia A. Medvedeva, Ulf Schaefer, Artem S. Kasianov, Ilya E. Vorontsov, Vladimir B. Bajic, and Vsevolod J. Makeev. “HOCOMOCO: A Comprehensive Collection of Human Transcription Factor Binding Sites Models.” Nucleic Acids Research 41, no. D1 (January 1, 2013): D195–202. https://doi.org/10.1093/nar/gks1089.

    Kulakovskiy, Ivan V., Ilya E. Vorontsov, Ivan S. Yevshin, Anastasiia V. Soboleva, Artem S. Kasianov, Haitham Ashoor, Wail Ba-alawi, et al. “HOCOMOCO: Expansion and Enhancement of the Collection of Transcription Factor Binding Sites Models.” Nucleic Acids Research 44, no. D1 (January 4, 2016): D116–25. https://doi.org/10.1093/nar/gkv1249.

    Kulakovskiy, Ivan V, Ilya E Vorontsov, Ivan S Yevshin, Ruslan N Sharipov, Alla D Fedorova, Eugene I Rumynskiy, Yulia A Medvedeva, et al. “HOCOMOCO: Towards a Complete Collection of Transcription Factor Binding Models for Human and Mouse via Large-Scale ChIP-Seq Analysis.” Nucleic Acids Research 46, no. D1 (January 4, 2018): D252–59. https://doi.org/10.1093/nar/gkx1106.

  • SwissRegulon - a database of regulatory motifs (PWMs) across model organisms (prokaryots, eukaryots). Data partly comes from JASPAR and TRANSFAC, reprocessing of ChIP-seq experiments. GBrowse for browsing TFBSs. Other tools.
    Paper Pachkov, Mikhail, Piotr J. Balwierz, Phil Arnold, Evgeniy Ozonov, and Erik van Nimwegen. “SwissRegulon, a Database of Genome-Wide Annotations of Regulatory Sites: Recent Updates.” Nucleic Acids Research 41, no. D1 (November 23, 2012): D214–20. https://doi.org/10.1093/nar/gks1145.

ChIP-seq

ChIP-seq pipelines

  • ChIP-AP - ChIP-seq analysis pipeline integrating multiple tools and peak callers (FastQC, Clumpify and BBDuk from the BBMap Suite, Trimmomatic, BWA, Samtools, deepTools, MACS2, GEM, SICER2, HOMER, Genrich, IDR, and the MEME-Suite). QC, cleanup, alignment, peak-calling, pathway analysis. High-confidence peaks based on overlaps by different peak callers. Input - single- or paired-end FASTQ files or aligned BAM files. Conda installable. Command line and GUI. Documentation.
    Paper Suryatenggara, Jeremiah, Kol Jia Yong, Danielle E. Tenen, Daniel G. Tenen, and Mahmoud A. Bassal. "ChIP-AP: an integrated analysis pipeline for unbiased ChIP-seq analysis." Briefings in Bioinformatics 23, no. 1 (January 2022) https://doi.org/10.1093/bib/bbab537

Normalization

  • BAMscale - BAMscale is a one-step tool for either 1) quantifying and normalizing the coverage of peaks or 2) generated scaled BigWig files for easy visualization of commonly used DNA-seq capture based methods.

  • CHIPIN - ChIP-seq Intersample Normalization using gene expression. Assumption - non-differential genes should have non-differential peaks.

  • S3norm - Chip-seq normalization to sequencing depth AND signal-to-noise ratio to the common reference. Negative Binomial for modeling background, convert counts to -log10(p-values), use monotonic nonlinear model to match the means of the common peaks and backgrounds in two datasets. https://github.com/guanjue/S3norm

    • Xiang, Guanjue, Cheryl Keller, Belinda Giardine, Lin An, Ross Hardison, and Yu Zhang. “S3norm: Simultaneous Normalization of Sequencing Depth and Signal-to-Noise Ratio in Epigenomic Data.” BioRxiv, January 1, 2018, 506634. https://doi.org/10.1101/506634.

CUT&RUN

Quality control

  • ChIPQC - Quality metrics for ChIPseq data

  • phantompeakqualtools - This package computes informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays. https://github.com/kundajelab/phantompeakqualtools

Peaks

  • Benchmarking of 14 ChIP-seq tools for peak calling and differential analysis (described in supplementary). Experimental and simulated data, narrow and broad peaks, with/without input, replicates. DEG enrichment analysis. Poor agreement. MACS2 performs OK for sharp peaks. Figure 7 - decision tree for tool selection.
    Paper Steinhauser, Sebastian, Nils Kurzawa, Roland Eils, and Carl Herrmann. “A Comprehensive Comparison of Tools for Differential ChIP-Seq Analysis,” n.d., 14.
  • LanceOtron - deep learning-based peak caller from TF and histone ChIP-seq, ATAC-seq, DNAse-seq. Input - bigWig coverage file (+input, if available). Image recognition using wide and deep model (logistic regression producing enrichment scores, CNN, multilayer perceptron, Fig. 1c, Methods). Trained on hand-labeled data. Outperforms MACS2. Visualization using MLV genome visualization software. Website with videos, documentation.
    Paper Hentges, Lance D., Martin J. Sergeant, Damien J. Downes, Jim R. Hughes, and Stephen Taylor. "LanceOtron: a deep learning peak caller for ATAC-seq, ChIP-seq, and DNase-seq." bioRxiv (2021). https://doi.org/10.1101/2021.01.25.428108

Enhancers

  • ROSE - rank-ordering of super-enhancers using H3K27ac ChIP-seq data, by the Young lab.

  • LILI - a pipeline by Boeva lab for detection of super-enhancers using H3K27ac ChIP-seq data, which includes explicit correction for copy number variation inherent to cancer samples. The pipeline is based on the ROSE algorithm originally developed by the Young lab.

  • CenhANCER - a cancer enhancer database, curating public H3K27ac ChIP-seq data from 805 primary tissue samples and 671 cell line samples across 41 cancer types. 57 029 408 typical enhancers, 978 411 super-enhancers and 226 726 enriched transcription factors. Annotated with SNPs. Table 1 - comparison with other resources (CancerEnD, OncoBase, OncoCis, ENdb, DiseaseEnhancer, SEdb, SEanalysis).

    Paper Luo, Zhi-Hui, Meng-Wei Shi, Yuan Zhang, Dan-Yang Wang, Yi-Bo Tong, Xue-Ling Pan, and ShanShan Cheng. “CenhANCER: A Comprehensive Cancer Enhancer Database for Primary Tissues and Cell Lines.” Database 2023 (May 18, 2023): baad022. https://doi.org/10.1093/database/baad022.

Visualization

Intersections

  • BedSect - web server for intersection analysis of genomic regions, UpSet and correlation plots. Gene-centric, GREAT enrichment analysis. Integrated with the GTRD database. GitHub.
    Paper Mishra, Gyan Prakash, Arup Ghosh, Atimukta Jha, and Sunil Kumar Raghav. “BedSect: An Integrated Web Server Application to Perform Intersection, Visualization, and Functional Annotation of Genomic Regions From Multiple Datasets.” Frontiers in Genetics 11 (February 5, 2020): 3. https://doi.org/10.3389/fgene.2020.00003.
  • Intervene - command line and web server for venn diagrams of overlaps of genomic regions (up to six sets), UpSet plot, correlation heatmap. Python (pybedtools, Seaborn, Matplotlib) and R (UpSetF, Corrplot, Venerable). BitBucket.
    Paper Khan, Aziz, and Anthony Mathelier. “Intervene: A Tool for Intersection and Visualization of Multiple Gene or Genomic Region Sets.” BMC Bioinformatics 18, no. 1 (December 2017): 287. https://doi.org/10.1186/s12859-017-1708-7.

Motif analysis

  • memes - an R package interfacing MEME suite (DREME, ME, FIMO, TOMTOM). Using universalmotif_df R/Bioconductor object to make results compatible across tools. De novo motif discovery, differential motifs, known motif enrichment analysis. Visualization capabilities. Case example on ChIP-seq peaks in Drosophila wing development. Requires installation of MEME suite. Docker container with RStudio and everything configured. Bioconductor, pkgdown website.
    Paper Nystrom, Spencer L, and Daniel J McKay. “Memes: A Motif Analysis Environment in R Using Tools from the MEME Suite.” PLOS COMPUTATIONAL BIOLOGY, n.d., 14.
  • ChEA3 - transcription factor enrichment in gene lists. Six reference libraries of TF regulatory signatures (ARCHS4, ENCODE, GTeX, ReMap, Enrichr, Literature). Fisher's exact test. Outperform VIPER, DoRothEA, BART, TFEA.ChIP, oPOSSUM, MAGICACT.
    Paper Keenan, Alexandra B, Denis Torre, Alexander Lachmann, Ariel K Leong, Megan L Wojciechowicz, Vivian Utti, Kathleen M Jagodnik, Eryk Kropiwnicki, Zichen Wang, and Avi Ma’ayan. “[ChEA3: Transcription Factor Enrichment Analysis by Orthogonal Omics Integration](https://doi.org/10.1093/nar/gkz446).” Nucleic Acids Research, (July 2, 2019)
  • TFEA.ChIP - R package for transcription factor enrichment of gene lists (hypergeometric and GSEA) using experimental ChIP-seq datasets (ENCODE, GEO). Tested on known signatures, compared with two PWM-based and ChIP-based, performs comparably or better.

  • PWMScan - web tool for scanning entire genomes with a position-specific weight matrix. Multiple genomes and assemblies hosted on the server. Multiple PWM collections for Eukaryotic DNA (JASPAR, HOCOMOCO, SwissRegulon, UniPROBE, CIS-BP, from Jomla, Isakova publications) matrix_scan C program for matching PWMs. Compared with other motif scanning tools (PoSSuMseqrch, Patser, RSAT, STORM, HOMER), overlap >99%. Output - BEDdetail format. Code.

    Paper Ambrosini, Giovanna, Romain Groux, and Philipp Bucher. “PWMScan: A Fast Tool for Scanning Entire Genomes with a Position-Specific Weight Matrix.” Edited by John Hancock. Bioinformatics 34, no. 14 (July 15, 2018): 2483–84. https://doi.org/10.1093/bioinformatics/bty127.
  • gimmemotifs - framework for TF motif analysis using an ensemble of motif predictors. maelstrom tool to detect differential motif activity between multiple different conditions. Includes manually curated database of motifs. Benchmark of 14 motif detection tools - Homer, MEME, BioProspector are among the top performing. Extensive analysis results. Documentation. Tweet with updates

  • DECOD - Differential motif finder. k-mer-based. http://gene.ml.cmu.edu/DECOD/

    • Huggins, Peter, Shan Zhong, Idit Shiff, Rachel Beckerman, Oleg Laptenko, Carol Prives, Marcel H. Schulz, Itamar Simon, and Ziv Bar-Joseph. “DECOD: Fast and Accurate Discriminative DNA Motif Finding.” Bioinformatics 27, no. 17 (September 1, 2011): 2361–67. https://doi.org/10.1093/bioinformatics/btr412.
  • Non-redundant TF motif matches genome-wide - Clustering of 2179 motif models. hg38/mm10 BED files download with coordinates

  • homerkit - Read HOMER motif analysis output in R.

  • LISA - epigenetic Landscape In silico Subtraction analysis, enriched TFs and chromatin regulators in a list of genes.

  • Logolas - R package for Enrichment Depletion Logos (EDLogos) and String Logos.

  • marge - API for HOMER in R for Genomic Analysis using Tidy Conventions, GitHub

  • motifbreakR - R package for predicting the disruptiveness of single nucleotide polymorphism on TFBSs. SNPs may be a list of rsIDs or a BED file. Includes MotifDB PWMs and others (ENCODE, Factorbook, Hocomoco, homer).

  • motifStack - R package for plotting stacked logos for single or multiple DNA, RNA and amino acid sequence.

  • pyjaspar - A Pythonic interface to query and access JASPAR transcription factor motifs

  • RcisTarget - R package to identify transcription factor binding motifs enriched on a list of genes or genomic regions.

  • rGADEM - R package for de novo motif discovery.

Differential peak detection

  • diffTF - differential TF activity calculation and integration with RNA-seq data for classification of TFs into activators or repressors. Differential analysis on consensus peaks using permutations or statistics (diffPeaks). Input: BAM, fasta files, RNA-seq counts, external TFBS data (HOCOMOCO, JASPAR, TRRUST, or ReMap). Applied to several datasets, including multiomics, recovers known biology, experimental validation. Implemented as a Snakemake pipeline. Singularity, conda installation. Documentation.
    Paper Berest, Ivan, Christian Arnold, Armando Reyes-Palomares, Giovanni Palla, Kasper Dindler Rasmussen, Holly Giles, Peter-Martin Bruch, et al. “Quantification of Differential Transcription Factor Activity and Multiomics-Based Classification into Activators and Repressors: DiffTF.” Cell Reports 29, no. 10 (December 2019): 3147-3159.e12. https://doi.org/10.1016/j.celrep.2019.10.106.

Enrichment

  • UniBind Enrichment Analysis, also differential enrichment. Input - BED file in hg38 version. LOLA as an enrichment and database engine.

Interpretation

  • Lisa - web server to determine the transcription factors and chromatin regulators that are directly responsible for the perturbation of a differentially expressed gene set (chrom-PR score). Using public and custom human and mouse DNase-seq, and H3K27ac ChIP-seq profiles (CistromeDB). Input: list of differential genes. GitHub.
    Paper Qin, Qian, Jingyu Fan, Rongbin Zheng, Changxin Wan, Shenglin Mei, Qiu Wu, Hanfei Sun, et al. “Lisa: Inferring Transcriptional Regulators through Integrative Modeling of Public Chromatin Accessibility and ChIP-Seq Data.” Genome Biology 21, no. 1 (December 2020): 32. https://doi.org/10.1186/s13059-020-1934-6.
  • Cistrome-GO - functional enrichment analysis of genes regulated by TFs in human and mouse. Solo mode (ChIP-seq peaks only) or ensemble mode (integrates ChIP-seq peaks and RNA-seq differentially expressed genes). Implementation of BETA method. MACS2 peaks, DESeq2 output. Gene-centric regulatory potential (RP) score (exponentially weighted by distance sum of peaks). Human (hg19/hg38), Mouse (mm9/mm10).
    Paper Li, Shaojuan, Changxin Wan, Rongbin Zheng, Jingyu Fan, Xin Dong, Clifford A. Meyer, and X. Shirley Liu. “Cistrome-GO: A Web Server for Functional Enrichment Analysis of Transcription Factor ChIP-Seq Peaks.” Nucleic Acids Research 47, no. W1 (July 2, 2019): W206–11. https://doi.org/10.1093/nar/gkz332.
  • Toolkit for Cistrome Data Browser - online tool to answer questions like:

    • What factors regulate your gene of interest?
    • What factors bind in your interval?
    • What factors have a significant binding overlap with your peak set?
  • Chongzhi Zhang software page, http://faculty.virginia.edu/zanglab/software.htm

    • BART (Binding Analysis for Regulation of Transcription), a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input. http://bartweb.uvasomrc.io/
    • MARGE (Model-based Analysis of Regulation of Gene Expression), a comprehensive computational method for inference of cis-regulation of gene expression leveraging public H3K27ac genomic profiles in human or mouse. http://cistrome.org/MARGE/
    • MANCIE (Matrix Analysis and Normalization by Concordant Information Enhancement), a computational method for high-dimensional genomic data integration. https://cran.r-project.org/web/packages/MANCIE/index.html
    • SICER (Spatial-clustering Identification of ChIP-Enriched Regions), a ChIP-Seq data analysis method. https://home.gwu.edu/~wpeng/Software.htm
  • UROPA - Universal RObustPeak Annotator, a command line based tool intended for genomic region annotation. Definition of overlap/proximity types. Documentation

Excludable

  • CUT&RUN blacklists for human (hg38) and mouse (mm10) genomes. Different biochemical properties than ChIP-seq, SEACR peak caller that uses global background. 20 C&R negative control datasets per human/mouse genome, consistently called artifactual peaks (the highest 0.1% signals in more than 30% of replicates, peaks extended by 1Kb) are assembled into blacklists. Also contain mitochondrial sequences (NUMTs). Tested bowtie2 and bowtie alignment strategies. Cover approximately 0.2% of the genome, removing reads overlapping them increases variability among samples (PCA). Compared with the Boyle's Blacklist-generated lists. BED coordinates in supplementary.
    Paper Nordin, Anna, Gianluca Zambanini, Pierfrancesco Pagella, and Claudio Cantù. “The CUT&RUN Blacklist of Problematic Regions of the Genome.” Preprint. Genomics, November 14, 2022. https://doi.org/10.1101/2022.11.11.516118.
  • Blacklist - Application for making ENCODE Blacklists, and links to canonical blacklists. C, C++.
    Paper Amemiya, Haley M., Anshul Kundaje, and Alan P. Boyle. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Scientific Reports 9, no. 1 (December 2019): 9354. https://doi.org/10.1038/s41598-019-45839-z.
  • GEM - mappability calculations for each genomic region, accounting for mismatches. Pre-calculated UCSC genome browser tracks for human and mouse. Mappability of genes, both protein-coding and non-protein coding. RPKUM - unique exons for quantifying gene expression.
    Paper Derrien, Thomas, Jordi Estellé, Santiago Marco Sola, David G. Knowles, Emanuele Raineri, Roderic Guigó, and Paolo Ribeca. “Fast Computation and Applications of Genome Mappability.” PloS One 7, no. 1 (2012): e30377. https://doi.org/10.1371/journal.pone.0030377.
  • Greenscreen - an approach for removing false-positive peaks (ultra-high noise) from ChIP-seq data (also, CUT&RUN) using MACS2 (broadpeak setting, optimized significance threshold and merging distance to match Blacklist-created regions). As effective as canonical blacklists, improves true factor binding overlap, improves Standardized Standard Deviation (SSD -> 1), improves replicate correlation structure. Uses as few as three samples, 99.9% overlap with Blacklist-created regions, smaller genomic footprint, same performance as Blacklist-generated.
    Paper Klasfeld, Sammy, and Doris Wagner. “Greenscreen Decreases Type I Errors and Increases True Peak Detection in Genomic Datasets Including ChIP-Seq.” Preprint. Genomics, March 1, 2022. https://doi.org/10.1101/2022.02.27.482177.

DNAse-seq

  • DNAse-seq analysis guide. Tools for QC, peak calling, analysis, footprint detection, motif analysis, visualization, all-in-one tools (Table 2)
    • Liu, Yongjing, Liangyu Fu, Kerstin Kaufmann, Dijun Chen, and Ming Chen. “A Practical Guide for DNase-Seq Data Analysis: From Data Management to Common Applications.” Briefings in Bioinformatics, July 12, 2018. https://doi.org/10.1093/bib/bby057.

ATAC-seq

  • awesome-atac-analysis Awesome ATAC-seq analysis by Nathan Sheffield.

  • Benchmarking ATAC-seq peak calling by Austin Montgomery

  • ATAC-seq analysis considerations. Considering multiple workflows, settling on csaw-based. Normalization by library complexity (downsampling) is important. Workflow and GitHub with all scripts.

    Paper Reske, Jake J., Mike R. Wilson, and Ronald L. Chandler. “ATAC-Seq Normalization Method Can Significantly Affect Differential Accessibility Analysis and Interpretation.” Epigenetics & Chromatin 13, no. 1 (December 2020): 22. https://doi.org/10.1186/s13072-020-00342-y.
  • UNMC_ATACseq_Tutorial - An open-source interactive pipeline tutorial for differential ATAC-seq footprint analysis on the cloud (Google, AWS, Azure)

  • OCHROdb - a database of open chromatin regions (over 1.4M). 828 DNAse-I experiments, 194 cell lines, uniformly processed, QC'd,peaks called using Hotspot, regulatory elements clustered across all samples, batch effect corrected, reproducible peaks statistically selected. Data from ENCODE, Roadmap Epigenomics Mapping Consortium (REMC), Blueprint Epigenome and Genomics of Gene Regulation (GGR). Downloadable metadata, curated DHS dataset (full and chromosome-specific, BED format with cell/tissue-specific columns with accessibility values), visualized in JBrowse.

    Paper Shooshtari, Parisa, Samantha Feng, Viswateja Nelakuditi, Reza Asakereh, Nader Hosseini Naghavi, Justin Foong, Michael Brudno, and Chris Cotsapas. “Developing OCHROdb, a Comprehensive Quality Checked Database of Open Chromatin Regions from Sequencing Data.” Scientific Reports 13, no. 1 (May 18, 2023): 8106. https://doi.org/10.1038/s41598-022-26791-x.
  • DNAseI hypersensitive sites from 733 biosamples (439 cell andtissue types and states). NMF to simplify pattern detection. NMF patterns better explain heritability. Data at ENCODE and Zenodo, data browser. Twitter, data download
    Paper Meuleman, Wouter, Alexander Muratov, Eric Rynes, Jessica Halow, Kristen Lee, Daniel Bates, Morgan Diegel, et al. “Index and Biological Spectrum of Human DNase I Hypersensitive Sites.” Nature, July 29, 2020. https://doi.org/10.1038/s41586-020-2559-3.

ATAC-seq pipelines

  • ENCODE ATAC-seq pipeline - ATAC-seq and DNase-seq processing pipeline by Anshul Kundaje

  • TOBIAS (Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal) - transcription factor footprinting framework for ATAC-seq data. Corrects for Tn5 bias (ATACorrect module, Figure 1). Outperforms HINT-ATAC, PIQ, Wellington, similar or better performance as msCentipede. Validated using paired ATAC-seq and ChIP-seq data. Visualization of aggregated ATAC-seq signals, differential and time course analysis, TF clustering, network building. Input - BAM file, genome FASTA, BED peaks. Output - bigWigs of uncorrected, corrected signals, expected and corrected symbols. Conda, Neftflow implementation.

    Paper Bentsen, Mette, Philipp Goymann, Hendrik Schultheis, Kathrin Klee, Anastasiia Petrova, René Wiegandt, Annika Fust, et al. “ATAC-Seq Footprinting Unravels Kinetics of Transcription Factor Binding during Zygotic Genome Activation.” Nature Communications 11, no. 1 (August 26, 2020): 4267. https://doi.org/10.1038/s41467-020-18035-1.
  • HINT-ATAC - a footprinting method considering ATAC-seq protocol biases. Uses a position dependency model (PDM) to learn the cleavage preferences (Methods). Compared against three footprinting methods, DNase2TF, PIQ, Wellington. PDMs are crucial for correction of cleavage bias for ATAC-seq for all methods. Also improves correction for DNAse-seq data. Comparison of protocols, Omni-ATAC (best performance), Fast-ATAC. Part of RGT, Regulatory Genomics Toolbox. Tutorial.
    Paper Li, Zhijian, Marcel H. Schulz, Thomas Look, Matthias Begemann, Martin Zenke, and Ivan G. Costa. “Identification of Transcription Factor Binding Sites Using ATAC-Seq.” Genome Biology, (December 2019). https://doi.org/10.1186/s13059-019-1642-2
  • HMMRATAC - hidden Markov model for ATAC-seq to identify open chromatin regions. Parametric modeling of nucleosome-free regions and three nucleosomal reatures (mono-, di-, and tri-nucleosomes). First, train on 1000 auto-selected regions, then predict. Tested on "active promoters" and "strong enhancers" chromatin states (positive examples), and "heterochromatin" (negative examples). Compared with MACS2, F-seq.
    Paper Tarbell, Evan D, and Tao Liu. “HMMRATAC: A Hidden Markov ModeleR for ATAC-Seq.” Nucleic Acids Research, June 14, 2019, gkz533. https://doi.org/10.1093/nar/gkz533
  • ATACseqQC - R package for ATAC-seq quality control and analysis. QC, preprocessing, read shift, peak calling, motif analysis, enrichment in nucleosome-free regions, plotting (heatmaps, library complexity). Table 1 - summary of functions. Additional material - examples of commands, table of comparison with other pipelines.
    Paper Ou, Jianhong, Haibo Liu, Jun Yu, Michelle A. Kelliher, Lucio H. Castilla, Nathan D. Lawson, and Lihua Julie Zhu. “ATACseqQC: A Bioconductor Package for Post-Alignment Quality Assessment of ATAC-Seq Data.” BMC Genomics 19, no. 1 (December 2018): 169. https://doi.org/10.1186/s12864-018-4559-3.
  • atac_chip_preprocess - Preprocessing workflow for ATAC-seq and ChIP-seq data, Nextflow pipeline.

  • ATAC-seq peak calling using MACS2: macs2 callpeak --nomodel --nolambda -- keep-dup all --call-summits -f BAMPE -g hs

  • ATACProc - ATAC-seq processing pipeline

  • atacseq - nf-core ATAC-seq peak-calling and differential analysis pipeline.

  • pepatac - A modular, containerized pipeline for ATAC-seq data processing. Examples and documentation

Histone-seq

Homer program ‘findPeaks’ with the style ‘histone’. Peaks within 1 kb were merged into a single peak. Broad peaks in H3K36me3, H3K27me3 and H3K9me3 were called using the Homer program ‘findPeaks’ with the options ‘-region –size 1000 –minDist 2500’. When Homer runs with these options, the initial sets of peaks were 1 kb wide and peaks within 2.5 kb were merged.

  • DEScan2 - broad peak (histone, ATAC, DNAse) analysis (peak caller, peak filtering and alignment across replicates, creation of a count matrix). Peak caller uses a moving window and calculated a Poisson likelihood of a peak as compared to a region outside the window. https://bioconductor.org/packages/release/bioc/html/DEScan2.html

    • Righelli, Dario, John Koberstein, Nancy Zhang, Claudia Angelini, Lucia Peixoto, and Davide Risso. “Differential Enriched Scan 2 (DEScan2): A Fast Pipeline for Broad Peak Analysis.” PeerJ Preprints, 2018.
  • HMCan and HMCan-diff - histone ChIP-seq peak caller (and differential) that accounts for CNV, also for CG bias. Hidden Markov Model to detect peak signal. Control-FREEC to detect CNV in ChIP-seq data. Outperforms others, CCAT second best. https://www.cbrc.kaust.edu.sa/hmcan/

    • Ashoor, Haitham, Aurélie Hérault, Aurélie Kamoun, François Radvanyi, Vladimir B. Bajic, Emmanuel Barillot, and Valentina Boeva. “HMCan: A Method for Detecting Chromatin Modifications in Cancer Samples Using ChIP-Seq Data.” Bioinformatics (Oxford, England) 29, no. 23 (December 1, 2013): 2979–86. https://doi.org/10.1093/bioinformatics/btt524.
    • Ashoor, Haitham, Caroline Louis-Brennetot, Isabelle Janoueix-Lerosey, Vladimir B. Bajic, and Valentina Boeva. “HMCan-Diff: A Method to Detect Changes in Histone Modifications in Cells with Different Genetic Characteristics.” Nucleic Acids Research 45, no. 8 (05 2017): e58. https://doi.org/10.1093/nar/gkw1319.
  • RSEG - ChIP-seq analysis for identifying genomic regions and their boundaries marked by diffusive histone modification markers, such as H3K36me3 and H3K27me3, http://smithlabresearch.org/software/rseg/

Broad peak analysis

  • EDD - Enriched Domain Detector, a ChIP-seq peak caller for detection of megabase domains of enrichment.

  • epic2 - an ultraperformant reimplementation of SICER. It focuses on speed, low memory overhead and ease of use.

  • DEScan2 - Integrated peak and differential caller, specifically designed for broad epigenomic signals, R package.

Technology

  • ATAC-STARR-seq - updated protocol that combined transposase-accessible chromatin (ATAC-seq) with self-transcribing active regulatory region sequencing (STARR-seq) to selectively assay the regulatory potential of accessible DNA. Includes protocols for plasmid library generation, reporter assay, data analysis (peak-within-peak calling, adapted DESeq2 to normalize reporter RNA read counts to plasmid DNA read counts. Keep duplicates). Agrees with ATAC-seq, much less noisy. GSE181317 - data. GitHub - computational pipeline.
    Paper Hansen, Tyler J., and Emily Hodges. “Identifying Transcription Factor-Bound Activators and Silencers in the Chromatin Accessible Human Genome Using ATAC-STARR-Seq.” Preprint. Genomics, March 28, 2022. https://doi.org/10.1101/2022.03.25.485870.
  • CLIP-seq (cross-linking and immunoprecipitation) technology, detects sites bound by a protein to RNAs.Figure 1 - technology overview, Figure 2 - details of HITS-CLIP/iCLIP/irCLIP/eCLIP/PAR-CLIP/Proximity-CLIP. Computational analysis, Table 3 - peak detection software. Databases (doRiNA, ENCORI, POSTAR3).
    Paper Hafner, Markus, Maria Katsantoni, Tino Köster, James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, and Mihaela Zavolan. "CLIP and complementary methods." Nature Reviews Methods Primers 1, no. 1 (2021): 1-23. https://doi.org/10.1038/s43586-021-00018-1
  • STARR-seq (self-transcribing active regulatory region sequencing) technology for enhancer identification. 3 min Video protocol. Applied to Drosophila genome. The majority (55.6%) of identified enhancers were located within introns, especially in the first intron (37.2%), and in intergenic regions (22.6%). Many genes appeared to be regulated by several independently functioning enhancers.
    Paper Arnold, Cosmas D., Daniel Gerlach, Christoph Stelzer, Łukasz M. Boryń, Martina Rath, and Alexander Stark. “Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-Seq.” Science 339, no. 6123 (March 2013): 1074–77. https://doi.org/10.1126/science.1232542.

Machine learning

  • maxATAC - TFBS prediction from ATAC-seq (bulk and pseudobulk) in any cell type (whole genome, chromosome, or region). Deep dilated convolutional neural networks, bigWig and BED predictions of TFBSs. Models avaliable for 127 human TFs (h5 files). Outperforms baseline (average ChIP-seq signal, motif scanning) for most TFs and cell lines. AUPR is similar to the top performer in the ENCODE-DREAM in vivo TFBS prediction challenge (0.4). OMNI-ATAC-seq data for three cell lines, to be available. ATAC-seq scaling to signal per replicate to 20 million mapped reads (RP20M) and min-max normalized to 99th percentile signals. Python, separate functions for each step (prepare, average, normalize, train, predict, benchmark, peaks, variants). Tweet 1, Tweet 2.
    Paper Cazares, Tareian A, Faiz W Rizvi, Balaji Iyer, Xiaoting Chen, Michael Kotliar, Joseph A Wayman, Anthony Bejjani, et al. “MaxATAC: Genome-Scale Transcription-Factor Binding Prediction from ATAC-Seq with Deep Neural Networks.” Preprint. Bioinformatics, January 29, 2022. https://doi.org/10.1101/2022.01.28.478235.
  • Segmentation and genome annotation (SAGA) algorithms review. Methods and tools for finding patterns from multiple ChIP-seq, histone-seq, etc. measures (Table 1). Hidden Markov Model (HMM), Dynamic Bayesian Network (DBN) algorithms. HMM intuition, math, solution algorithms. Visualization. Future work, challenges.
    Paper Libbrecht, Maxwell W., Rachel C. W. Chan, and Michael M. Hoffman. “Segmentation and Genome Annotation Algorithms for Identifying Chromatin State and Other Genomic Patterns.” Edited by Tamar Schlick. PLOS Computational Biology 17, no. 10 (October 14, 2021): e1009423. https://doi.org/10.1371/journal.pcbi.1009423.

Misc

About

Notes on ChIP-seq and other-seq-related tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published