Skip to content

A collection of tools for Hi-C data analysis

License

Notifications You must be signed in to change notification settings

xtmgah/HiC_tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hi-C data analysis tools and papers

MIT License PR's Welcome

Tools are sorted by publication date, newest on top. Unpublished tools are listed at the end of each section. See Hi-C data notes and single-cell Hi-C notes for more. Please, contribute and get in touch! See MDmisc notes for other data science and genomics-related notes.

Table of content

Pipelines

  • Juicer - Java full pipeline to convert raw reads into Hi-C maps, visualized in Juicebox. Calls domains, loops, CTCF binding sites. .hic file format for storing multi-resolution Hi-C data.

    Paper
    • Durand, Neva C., Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. “Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments.” Cell Systems 3, no. 1 (July 2016)

    • Rao, Suhas S. P., Miriam H. Huntley, Neva C. Durand, Elena K. Stamenova, Ivan D. Bochkov, James T. Robinson, Adrian L. Sanborn, et al. “A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping.” Cell 159, no. 7 (December 18, 2014) - Juicer analysis example. TADs defined by frequent interactions. Enriched in CTCF and cohesin members. Five domain types. A1 and A2 enriched in genes. Chr 19 contains 6th pattern B6. Enrichment in different histone modification marks. TADs are preserved across cell types. Yet, differences between Gm12878 and IMR90 were detected. Boundaries detection by scanning image. Refs to the original paper.

  • HiC-Pro - Python and command line-based optimized and flexible pipeline for Hi-C data processing. hicpro2juicebox tool to generate Juicebox-compatible files (requires juicebox_clt.jar)

    Paper
    • Servant, Nicolas, Nelle Varoquaux, Bryan R. Lajoie, Eric Viara, Chong-Jian Chen, Jean-Philippe Vert, Edith Heard, Job Dekker, and Emmanuel Barillot. “HiC-Pro: An Optimized and Flexible Pipeline for Hi-C Data Processing.” Genome Biology 16 (December 1, 2015) - HiC pipeline, references to other pipelines, comparison. From raw reads to normalized matrices. Normalization methods, fast and memory-efficient implementation of iterative correction normalization (ICE). Data format. Using genotyping information to phase contact maps.
  • HiCUP - Perl-based pipeline, alignment only, output - BAM files.

    Paper
    • Wingett, Steven, Philip Ewels, Mayra Furlan-Magaril, Takashi Nagano, Stefan Schoenfelder, Peter Fraser, and Simon Andrews. “HiCUP: Pipeline for Mapping and Processing Hi-C Data.” F1000Research 4 (2015) - HiCUP pipeline, alignment only, removes artifacts (religations, duplicate reads) creating BAM files. Details about Hi-C sequencing artifacts. Used in conjunction with other pipelines.
  • FAN-C - Python pipeline for Hi-C processing. Input - raw FASTQ (aligned using BWA or Bowtie2, artifact filtering) or pre-aligned BAMs. KR or ICE normalization. Analysis and Visualization (contact distance decay, A/B compartment detection, TAD/loop detection, Average TAD/loop profiles, saddle plots, triangular heatmaps, comparison of two heatmaps). Automatic or modular. Compatible with .cool and .hic formats. Tweet1, Tweet2. Table 1 - detailed comparison of 13 Hi-C processing tools
    Paper - Kruse, Kai, Clemens B. Hug, and Juan M. Vaquerizas. “[FAN-C: A Feature-Rich Framework for the Analysis and Visualisation of Chromosome Conformation Capture Data](https://doi.org/10.1186/s13059-020-02215-9).” Genome Biology 21, no. 1 (December 2020)
  • GITAR - full Hi-C pre-processing, normalization, TAD detection, and visualization. Python scripts wrapping other tools. Table 1 summarizes the functionality of existing tools.

    Paper
    • Calandrelli, Riccardo, Qiuyang Wu, Jihong Guan, and Sheng Zhong. “GITAR: An Open Source Tool for Analysis and Visualization of Hi-C Data.” Genomics, Proteomics & Bioinformatics 16, no. 5 (2018): 365–72. https://doi.org/10.1016/j.gpb.2018.06.006
  • HiCdat - Hi-C processing pipeline and downstream analysis/visualization. Analyses: normalization, correlation, visualization, comparison, distance decay, PCA, interaction enrichment test, epigenomic enrichment/depletion.

    Paper
  • TADbit - TADbit is a complete Python library to deal with all steps to analyze, model and explore 3C-based data. With TADbit, the user can map FASTsQ files to obtain raw interaction binned matrices (Hi-C like matrices), normalize and correct interaction matrices, identify and compare the Topologically Associating Domains (TADs), build 3D models from the interaction matrices, and finally, extract structural properties from the models. TADbit is complemented by TADkit for visualizing 3D models.

    Paper

QC, quality control

  • qc3C - Hi-C quality assessment method based on non-naturally occurring k-mers containing ligation artifacts (the proportion of "signal"). Details of various types for read-pairs, valid and invalid configurations. Tested on simulated and experimental data. Works of FASTQ or BAM files. Output - the breakdown of valid and invalid pairs (numbers, stacked barplots). Compatible with MultiQC. Conda, Docker, Singluarity installations. Scripts for the paper

  • HiCSampler - a Python script for subsetting .hic files

  • HiCNoiseMeasurer - a Python script to measure noise in .hic files using the auto-correlation function

Capture-C

Capture-C peaks

  • CHiCAGO protocol for Capture Hi-C analysis. Introduction into 3C-based technologies, as compared with Hi-C, Statistical model for background noise estimation, normalization, weighted p-value correction. Comparison with other tools (HiCapTools, CHiCMaxima, CHiCANE), downstream analysis with Peaky, Chicdiff. Preprocessing with HiCUP, input files (Table 1), how to create auxillary files and set parameters for different restriction enzymes (R, Python scripts), QC, visualization. CHiCAGO R package, chicagoTools, PCHiCdata R package. See below for CHiCAGO paper.

  • Peaky - Bayesian sparse variable selection approach. The model proposes that for any given bait, the expected CHi-C signal at each prey fragment is expressed as a sum of contributions from a set of fragments directly contacting that bait. https://github.com/cqgd/pky

  • ChiCMaxima - a pipeline for detection and visualization of chromatin loops in Capture Hi-C data. Loess smoothing combined with a background model to detect significant interactions Comparison with GOTHiC and CHiCAGO. https://github.com/yousra291987/ChiCMaxima

  • HiCapTools - A software package that can design sequence capture probes for targeted chromosome capture applications and analyze sequencing output to detect proximities involving targeted fragments. Two probes are designed for each feature while avoiding repeat elements and non-unique regions. The data analysis suite processes alignment files to report genomic proximities for each feature at restriction fragment level and is isoform-aware for gene features. Statistical significance of contact frequencies is evaluated using an empirically derived background distribution. https://github.com/sahlenlab/HiCapTools

  • CHiCAGO is a Capture Hi-C data processing method that filters out contacts that are expected by chance given the linear proximity of the interacting fragments on the genome and takes into account the asymmetric biases introduced by the capture step used in the Capture Hi-C approach. Two-component background model (Delaporte distribution) - Brownian motion (Neg. Binom.) and technical noise (Poisson). Account for distance. https://bioconductor.org/packages/Chicago/, Tweet by Mikhail Spivakov: Running Chicago with data generated w/ a 4-cutter such as DpnII? Default settings were tuned on 6-cutter data (HindIII) & not optimal for this. Our suggested settings for DpnII are: MaxLBrowndist = 75000, binsize = 1500, minFragLen=75, maxFragLen=1200.

HiChIP

4C

Resolution improvement

  • VEHiCLE - a variational autoencoder (feature extraction, dimensionality reduction) and Generative Adversarial Network (maps low-dimensional vectors to Hi-C maps) for Hi-C resolution enhancement. Uses a combination of four loss functions: adversarial loss, variational loss, mean square error, and insulation score loss (interesting!). Intro into VAEs, GANs, loss functions. Uses GM12878, IMR90, K562, HMEC data. Compared using five metrics (similarity, reproducibility) against HiCPlus, DeepHic, HiCSR, outperforms all. Improves TAD identification, 3D structure modeling. Python implementation.

  • HiCRes - resolution estimation, based on the linear dependence of 20th percentile of coverage and the window size used to access coverage. Includes preseq for estimating and predicting library complexity, bowtie2 and HiCUP for estimating Hi-C-specific QC metrics. Relatively insensitive to enzyme of choice. Implemented as Docker/Singularity images. Requires significant computational resources, like 5 hours on 40 CPU cluster.

  • HiCSR - enhancement of Hi-C contact maps using a Generative Adversarial Network trained to optimize a custom loss function (weighted adversarial loss, pixel-wise L1 loss, and a feature reconstruction loss). An increase in resolution refers to recovering additional Hi-C contacts, "saturating" downsampled and noisy Hi-C matrices, not increasing the number of pixels. Representation learning with autoencoder with several convolutional layers and skip connections, then using it for the generator to create new matrices with discriminator telling them fake or real. Compared with HiCPlus, HiCNN, hicGAN, DeepHiC. Reproducibility is better using four metrics. Python3 PyTorch implementation https://github.com/PSI-Lab/HiCSR

  • DeepHiC - a generative adversarial network (GAN) for enhancing Hi-C data. Does not change the bin size, enhances the content of Hi-C data. Reconstructs the content from ~1% of the original data. Outperforms BoostHiC, HiCPlus, HiCNN. Online tool: http://sysomics.com/deephic/, code: https://github.com/omegahh/DeepHiC

  • hicGAN - improving resolution (saturation) of Hi-C data using Generative Adversarial Networks. Generator - five inner residual blocks to fight vanishing gradient (each block has two convolutional layers and batch normalization) and an outer skip connection. The discriminator has three convolutional blocks. Evaluation metrics: MSE, signal-to-noise ratio, structure similarity index, chromatin loop score. Compared against HiCPlus. Python, Tensorflow implementation

  • HiCNN - a computational method for resolution enhancement. A modification of the HiCPlus approach, using very deep (54 layers, five types of layers) convolutional neural network. A Hi-C matrix of regular resolution is transformed into the high-resolution but very sparse matrix, HiCNN predicts the missing values. Pearson and MSE evaluation metrics, overlap of Fit-Hi-C-detected significant interactions - perform similar or slightly better than HiCPlus. PyTorch implementation. http://dna.cs.miami.edu/HiCNN/

  • Boost-HiC - infer fine-resolution contact frequencies in Hi-C data, performs well even on 0.1% of the raw data. TAD boundaries remain. Better than HiCPlus. It can be used for differential analysis (comparison) of two Hi-C maps. https://github.com/LeopoldC/Boost-HiC

  • mHi-C - recovering alignment of multi-mapped reads in Hi-C data. Generative model to estimate probabilities for each bin-pair originating from a given origin. Reproducibility of contact matrices (stratum-adjusted correlation), reproducibility and number of significant interactions are improved. Novel interactions. Enrichment of TAD boundaries in LINE and SINE repetitive elements. Multi-mapping is not sensitive to trimming. Read filtering strategy (Figure 1, supplementary figures are very visual). https://github.com/keleslab/mHiC

  • HIFI - Hi-C Interaction Frequency Inference for restriction fragment-resolution analysis of Hi-C data. Sparsity is resolved by using dependencies between neighboring restriction fragments, with Markov Random Fields performing the best. Better resolves TADs and sub-TADs, significant interactions. CTCF, RAD21, SMC3, ZNF143 are enriched around TAD boundaries. Matrices normalized for fragment-specific biases. https://github.com/BlanchetteLab/HIFI

  • HiCPlus - increasing resolution of Hi-C data using convolutional neural network, mean squared error as a loss function. Basically, smoothing parts of Hi-C image, then binning into smaller parts. Performs better than bilinear/biqubic smoothing. https://github.com/zhangyan32/HiCPlus

Simulation

  • FreeHi-C v.2.0 - simulation of realistic Hi-C matrices with user- or data-driven spike-ins. Spike-ins are introduced on read-level and converted to interaction frequency level. Benchmark of HiCcompare, multiHiCcompare, diffHiC, and Selfish. Assessment of FDR, power, significance order, PRC and AUROC, genomic properties. GM12878 and A549 replicates of experimental Hi-C data. Three simulation settings with varying background distribution of interaction frequencies, spike-in proportions, sequencing depth. Figure 5 - summary of performances for all methods and comparison types. Subjective top performers: multiHiCcompare, HiCcompare, diffHiC, Selfish.

  • FreeHi-C - Hi-C data simulation based on properties of experimental Hi-C data. Preserves A/B compartments, TADs, the correlation between replicated (HiCRep), significant interactions, improves power to detect differential interactions. Robust to sequencing depth changes. Tested on replicates of GM12878, A549 human cancer cells, malaria P.falciparum. Compared with poorly performing Sim3C. All simulated data are at https://zenodo.org/record/3345896. Python3 implementation https://github.com/keleslab/FreeHiC

Normalization

CNV-aware normalization

  • Hi-C data normalization considering CNVs. Extension of matrix-balancing algorithm to either retain the copy-number variation effect (LOIC) or remove them (CAIC). ICE itself can lead to misrepresentation of the contact probabilities between CNV regions. Estimating CNV directly from Hi-C data correcting for GC content, mappability, fragment length using Poisson regression. LOIC - the sum of contacts for a given genomic bin is proportional to CNV. CAIC - raw interaction counts are the product of a CNV bias matrix and the expected contact counts at a given genomic distance. Data, and cancer-hic-norm - Normalization of cancer Hi-C data, scripts for the manuscript. LOIC and CAIC methods are implemented in the iced Python package, https://github.com/hiclib/iced

  • HiCapp - Iterative correction-based caICB method. Method to adjust for the copy number variants in Hi-C data. Loess-like idea - we converted the problem of removing the biases across chromosomes to the problem of minimizing the differences across the count-distance curves of different chromosomes. Our method assumes equal representation of genomic locus pairs with similar genomic distances located on different chromosomes if there were no bias in the Hi-C maps. https://bitbucket.org/mthjwu/hicapp

  • OneD - CNV bias-correction method, addresses the problem of partial aneuploidy. Bin-centric counts are modeled using the negative binomial distribution, and its parameters are estimated using splines. A hidden Markov model is fit to infer the copy number for each bin. Each Hi-C matrix entry is corrected by dividing its value by the square root of the product of CNVs for the corresponding bins. Reproducibility score (eigenvector decomposition and comparison) to measure improvement in the similarity between replicated Hi-C data. https://github.com/qenvio/dryhic

Reproducibility

  • HiCrep.py - a fast Python implementation of stratum-adjusted correlation coefficient metric for measuring similarity between Hi-C datasets (HiCrep method, originally in R). Can be used for MDS. Evaluated on 90 datasets from 4D Nucleome. More than 20 times faster on a single CPU. Results are the same as R implementation.

  • HPRep - reproducibility measure for HiChIP and PLAC-seq data. HiCrep-inspired. Reorganize data into anchor-centric interaction bins, normalize (fragment length, GC content, mappability, log2 obs/exp) smooth, stratify by distance (concatenate bins with the same distance from anchors). Considering "AND" (bin-pairs), "XOR" (one anchor bin), "NOT" (no interactions, ignored) bin pairs. Distance metric - weighted Pearson correlation (pairs of columns) stratified by distance. Compared with HiCRep, HiC-Spector, and naive Pearson on mouse H3K4me3 PLAC-seq data (brain and mESCs), human H3K37ac HiChIP data from GM12878 and K562, human H3K4me3 PLAC-seq brain data. HPRep shows higher similarity for replicates and more differentiation between cell lines, robust to downsampling. Nearly same results can be achieved analysing one chromosome (for speed).

  • IDR2D - Irreproducible Discovery Rate that identifies replicable interactions in ChIP-PET, HiChIP, and Hi-C data. Includes the original 1D IDR version (https://github.com/nboley/idr). Resolves multiple pairwise interactions. 

  • 3DChromatin_ReplicateQC - Comparison of four Hi-C reproducibility assessment tools, HiCRep, GenomeDISCO, HiC-Spector, QuASAR-Rep. Tested the effects of noise, sparsity, resolution. Spearman doesn't work well. All tools performed similarly, worsening expectedly. QuASAR has a QC tool measuring the level of noise. https://github.com/kundajelab/3DChromatin_ReplicateQC

  • HiCRep - Similarity assessment using generalized Cochran-Mantel-Haenzel statistics M2. Spearman/Pearson doesn't work. 2-step procedure: Smooth the matrix, then CMH statistics. Basically, splitting data by distance chunks, Pearson on each chunk, summarize. Simple and well-thought stats. Methods: Hi-C datasets with replicates, including 11 ENCODE datasets. R package https://github.com/MonkeyLB/hicrep, and Python implementation

  • [QuASAR] - Hi-C quality and reproducibility measure using spatial consistency between local and regional signals. Finds the maximum useful resolution by comparing quality and replicate scores of replicates. Part of HiFive pipeline

  • HiC-Spector - reproducibility metric to quantify the similarity between contact maps using spectral decomposition. Decomposing Laplacian matrices and sum the Euclidean distance between eigenvectors. https://github.com/gersteinlab/HiC-spector

  • localtadsim - Analysis of TAD similarity using a variation of information (VI) metric as a local distance measure. 23 human Hi-C datasets, Hi-C Pro processed into 100kb matrices, Armatus to call TADs. Defining structurally similar and variable regions. Comparison with previous studies of genomic similarity. Cancer-normal comparison - regions containing pan-cancer genes are structurally conserved in normal-normal pairs, not in cancer-cancer. https://github.com/Kingsford-Group/localtadsim

AB compartments

  • POSSUM - A/B compartment detection method in super-resolution Hi-C matrices. PCA of Sparse SUper Massive Matrices, Calculating eigenvectors for sparse matrices using power method (Figure 1, Methods). New GM12878 data at 500kb resolution (42 billion read pairs, 33 billion contacts). Genes can span compartments, but gene promoters almost exclusively (95%) are located in A compartments. Distinguishing loops formed by extrusion and non-extrusion mechanisms (SIP, HiCCUPS, Fit-Hi-C for detection), high resolution of Hi-C data is important. Applied to other datasets, organisms. A part of the Juicer pipeline Eigenvector, C++ POSSUM code on Jordan Rowley's lab GitHub. Other tools: HiCSampler, HiCNoiseMeasurer. Tweet1 by Doug Phanstiel, Tweet2 by Jordan Rowley

  • dcHiC - differential A/B compartment analysis of Hi-C data. Uses Multiple Factor Analysis (MFA), and extension of PCA which combines Hi-C maps before performing generalized PCA. Analogous to weighted PCA in which every dataset is normalized for its biases (Methods). Multivariate distance measure to estimate statistical significance of compartment differences. Applied to mouse neuronal differentiation, mouse hematopoietic system, human  cell Hi-C data. Gene enrichment analysis shows biologically relevant signal. Input - sparse matrix, hic, cool files.

  • Calder - multi-scale compartment and sub-compartment detection, improvement over dichotomous AB compartment detection. Clustering contact similarities (Fisher's z-transformed correlations) into high intra and low inter-region similarities, followed by a divisive hierarchical clustering within each domain. The likelihood of nested sub-domains can be estimated using a mixture log-normal distribution. Detailed methods, complex. Eight subcompartments, 4 within the A and 4 within the B compartment, balanced set, in contrast to SNIPER. Expected associations with active/inactive genomic annotations. Nested compartments may be associated with TADs/loops. Analysis of domain repositioning across 114 cell lines. 40kb resolution. R package, named after Alexander Calder, an American sculptor. Supplementary Data 1 - IDs and links to Hi-C, ChIP-seq, and RNA-seq datasets; Data 2 - hg19 BED files of Complete domain hierarchies inferred by CALDER from 127 Hi-C contact maps; Data 7 - coordinates of Repositioned compartment domains between normal and cancer cell lines derived from breast, prostate, and pancreatic tissue samples

  • SNIPER - 3D subcompartment (A1, A2, B1, B2, B3) identification from low-coverage Hi-C datasets. A neural network based on a denoising autoencoder (9 layers) and a multi-layer perceptron. Sigmoidal activation of inputs, ReLU, softmax on outputs. Dropout, binary cross-entropy. exp(-1/C) transformation of Hi-C matrices. Applied to Gm12878 and 8 additional cell types to compare subcompartment changes. Compared with Rao2014 annotations, outperforms Gaussian HMM and MEGABASE.

  • Eigenvector - Juicer's native tool. The eigenvector can be used to delineate compartments in Hi-C data at coarse resolution; the sign of the eigenvector typically indicates the compartment. The eigenvector is the first principal component of the Pearson's matrix.

Peak/Loop callers

  • ZipHi-C - a Bayesian framework based on a Hidden Markov Random Field model to detect significant interactions and experimental biases in Hi-C data. Predecessors - HMRFBayesHi-C, FastHiC. Borrows information from neighboring loci. Tested on simulated and experimental data, less false positives than FastHi-C, Juicer, HiCExplorer. Detailed stats methods.

  • NeoLoopFinder - detecting chromatin interactions induced by all kinds of structural variants (SVs). Input - a Hi-C contact matrix and a list of SV breakpoints. Output - genome-wide CNV profile, CNV segments, local assembly around SVs (graph-based algorithm), corrected Hi-C matrix for newly assembled regions and normalized for CNV effect and allelic effect, chromatin loops in rearranged regions (Peakachu), enhancer-hijacking events (needs H3K27ac data). CNVs are detected by HMM-based segmentation module. Includes visualization module. Neo-loop detection in 50 cancer Hi-C datasets from cell lines and patient samples (17 cancer types). Cancer-specific neoloops, associated genes, epigenomic enrichments. Methods - DI + HMM. Video, 20m

  • LASCA - loop/significant contact caller that uses Weibull distribution-based modeling to each diagonal. DBSCAN to cluster adjacent significant pixels. Works with Hi-C data from any species, tested on human, C. Elegans, S. Cerevisiae. Filters according Aggregate Peak Analysis patterns may be used to refine calls. Compared with HiCCUPS, MUSTACHE, demonstrates good overlap. Also identifies non-CTCF-driven loops. Input - .cool files. Python code

  • HiC-ACT - improved chromatin loop detection considering spatial dependency (especially at high 5-10kb resolution). Aggregated Cauchy Test (ACT) based approach accounting for possible correlations between adjacent loci pairs from high-resolution Hi-C data. Combine a set of p-values, T statistics following Cauchy distribution under arbitrary dependence structure. Need the local smoothing bandwidth size. Post-processing of results from loop callers that assume independence among loci. Input - bin-pair identifiers and the corresponding p-values. Tested on GM12878 and mESC data. The improvement in power is most pronounced in low-depth (downsampled) data. Fast, implemented in R. GitHub

  • Peakachu - loop prediction from Hi-C data using random forest on loop-specific pixel intensities within 11x11 window. ChIA-PET and HiChIP provide positive training examples. H3K27ac HiChIP better predicts short-range interactions, CTCF ChIA-PET is better for longer interactions. 10Kb resolution data, Gm12878 and K562 cell line. Excels for short-range interactions. Detects more loops than Fit-Hi-C, HiCCUPS, with good overlap. FDR estimated on auxin-treated vs. untreated HCT-116 cells, about 0.2%. Model trained using data from Hi-C performs well in other technologies, Micro-C, DNA SPRITE. Robust to sequencing depth. MCC to select best model. 3-fold cross-validation. Balanced training, same number of negative examples (with short and long distances between interacting loci). Predicted loops for 56 cell/tissue types.

  • FIREcaller - an R package to call frequently interacting regions from Hi-C data, as well as clustered super-FIREs. Normalization using HiCNormCis to regress out systematic biases. Converts normalized cis-interactions into Z-scores, calculates one-sided p-values and classifies bins as FIRE/nonFIRE. Also outputs continuous FIREscore (-ln(p-value)). FIREs are tissue-specific, can distinguish samples. Associated with H3K27ac and H3K4me3 signal.

  • HiCExplorer's hicDetectLoops for loop detection. Review and critique of HiCCUPS, HOMER, GOTHIC, cLoops, FastHiC. Distance-dependent of chromatin interactions with a continuous negative binomial distribution, detection of the interaction counts with p-values smaller than a threshold, then filtering. https://github.com/deeptools/HiCExplorer/

  • Chromosight - loop and pattern detection, computer vision-based (borders, FIREs, hairpins, and centromeres) in Hi-C maps. Takes in a single, whole-genome contact map, text-based bedGraph2d, and binary cool formats, ICE-normalizes. Sliding window, pattern detection using Pearson correlation with the template, then series of filters. Output - text-based. Outperforms HiCexplorer, HICCUPS, HOMER, cooltools, in the order of decreasing F1. Tested on synthetic Hi-C data mimicking S. cerevisiae genome, benchmark data at Zenodo, Python3 code

  • SIP - loop caller using image analysis. Regional maxima-based, peaks called in a sliding window. Distance-normalized Hi-C matrices, image adjusted using Gaussian blur, contrast enhancement, White Top-Hat correction, identified peaks then filtered by peak enrichment, empirical FDR, loop decay. Comparison with HiCCUPS and cLoops callers. Robust to noise, sequencing depth, much faster, good agreement, improved detection rate. SIPMeta - average metaplots of loops on bias-corrected images for better representation. Java implementation, works with .hic and .cool files https://github.com/PouletAxel/SIP

  • Mustache - loop detection from Hi-C and Micro-C maps. Scale-space theory, detection of blob-shaped objects in a multi-scale representation of contact maps, Gaussian kernels with increasing scales. Differences of adjacent Gaussians guide the search for local maxima. Series of filtering steps to minimize false positives. Corrected for multiple testing p-values of blobs. Applied to Gm12878 and K562 Hi-C data, and HFFc6 cell line Micro-C data, 5kb resolution. Compared with HiCCUPS, SIP comparison added, detects similar and more loops flanked by convergent CTCF, RAD21, SMC3, loops confirmed by ChIA-PET and HiChIP data. Python3 tool, Conda/Docker wrapped, handles .hic/.cool files. https://github.com/ay-lab/mustache, Tweet

  • FitHiC2 - protocol to install/run FitHiC Python3 tool/scripts. Fit of non-increasing cubic splines to distance-interaction frequency decay to identify significant interactions in individual matrices. Accounts for biases derived from KR (ICE, or other) normalization (HiCKRy). Works with fixed-bin- or restriction cut site resolution data. Overview of FitHiC algorithm, accounting for biases. Flexible input options, from HiC-Pro, Juicer, and other tools, validPairs file format. Post-processing to prioritize highly significant interactions supported by the nearby loci, and filter noisy detections. HTML report, flexible BED-derived output format, conversion to formats for WashU epigenome and UCSC browsers. Installable using conda, pip, GitHub. Comparable methods - HiCCUPS, HOMER, GOTHiC, HiC-DC, a brief description of each. Tested on three datasets. GitHub: https://github.com/ay-lab/fithic, Executable on Code Ocean: https://codeocean.com/capsule/4528858/tree/v3, Data: https://zenodo.org/record/3380589

  • FIREcaller - an R package to detect frequently interacting regions (FIREs, <200Kb interactions). Within-sample (HiCNormCis) and cross-sample (quantile) normalization, converting FIRE counts to Z-scores, taking significant ones. Schmitt data https://yunliweb.its.unc.edu/FIREcaller/

  • coolpup.py - Pile-up (aggregation, averaging) analysis of Hi-C data (.cool format) for visualizing and identifying chromatin loops from several sparse datasets, e.g., single-cell. Visualization using plotpup.py script. Scripts for the paper: https://github.com/Phlya/coolpuppy_paper/tree/master/Nagano, tool: https://github.com/Phlya/coolpuppy

  • cLoops2 - improved pipeline for analysing Hi-TrAC/TrAC data. Peak/loop calling, differentially enriched loops, annotation, resolution estimation, similarity, aggregation/visualization. Improved blockDBSCAN clustering algorithm. Outperforms MACS2, SICER, HOMER, SEACR

  • cLoops - DBSCAN-based algorithm for the detection of chromatin loops in ChIA-PET, Hi-C, HiChIP, Trac-looping data. Local permutation-based estimation of statistical significance, several tests for enrichment over the background. Outperforms diffHiC, Fit-Hi-C, GOTHiC, HiCCUPS, HOMER. https://github.com/YaqiangCao/cLoops

  • FitHiChIP - significant peak caller in HiChIP and PLAC-seq data. Accounts for assay-specific biases, as well as for the distance effect. 3D differential loops detection. Methods. https://github.com/ay-lab/FitHiChIP

  • StripeCaller - A toolkit for analyzing architectural stripes. Architectural stripes, created by extensive loading of cohesin near CTCF anchors, with Nipbl and Rad21 help. Little overlap between B cells and ESCs. Architectural stripes are sites for tumor-inducing TOP2beta DNA breaks. ATP is required for loop extrusion, cohesin translocation, but not required for maintenance, Replication of transcription is not important for loop extrusion. Zebra algorithm for detecting architectural stripes, image analysis, math in Methods. Human lymphoblastoid cells, mouse ESCs, mouse B-cells activated with LPS, CH12 B lymphoma cells, wild-type, treated with hydroxyurea (blocks DNA replication), flavopiridol (blocks transcription, PolII elongation), oligomycin (blocks ATP). Hi-C, ChIA-pet, ChIP-seq, ATAC-seq, and more Data1Data2.

  • HiC-DC - significant interaction detection using the zero-inflated negative binomial model and accounting for biases like GC content, mappability. Compared with Fit-Hi-C, more conservative. Robust to sequencing depth. Detects significant, biologically relevant interactions at all length scales, including sub-TADs. BWA-MEM alignment (Python script), then processing in R. https://bitbucket.org/leslielab/hic-dc/src/master/

  • GoTHIC - R package for peak calling in individual HiC datasets, while accounting for noise. https://www.bioconductor.org/packages/release/bioc/html/GOTHiC.html

    • Mifsud, Borbala, Inigo Martincorena, Elodie Darbo, Robert Sugar, Stefan Schoenfelder, Peter Fraser, and Nicholas M. Luscombe. “GOTHiC, a Probabilistic Model to Resolve Complex Biases and to Identify Real Interactions in Hi-C Data.” Edited by Mark Isalan. PLOS ONE 12, no. 4 (April 5, 2017) - The GOTHiC (genome organization through HiC) algorithm uses a simple binomial distribution model to simultaneously remove coverage-associated biases in Hi-C data and detect significant interactions by assuming that the global background interaction frequency of two loci. Use of the Benjamini–Hochberg multiple-testing correction to control for the false discovery rate.
  • HMRFBayesHiC - a hidden Markov random field-based Bayesian peak caller to identify long-range chromatin interactions from Hi-C data. Borrowing information from neighboring loci. Previous peak calling methods, Fit-Hi-C. Interactions between enhancers and promoters as a benchmark. https://yunliweb.its.unc.edu/HMRFBayesHiC/

  • FastHiC - hidden Markov random field (HMRF)-based peak caller, fast and well-performing. https://yunliweb.its.unc.edu/fasthic/

  • FitHiC - Python tool for detection of significant chromatin interactions, https://noble.gs.washington.edu/proj/fit-hi-c/

  • HiCPeaks - Python CPU-based implementation for BH-FDR and HICCUPS, two peak calling algorithms for Hi-C data, proposed by Rao et al. 2014. Text-to-cooler Hi-C data converter, two scripts to call peaks, and one for visualization (creation of a .png file). Pypi repo

  • HOMER - Perl scripts for normalization, visualization, significant interaction detection, motif discovery. Does not correct for bias. http://homer.ucsd.edu/homer/interactions/

Differential interactions

  • HICDCPlus - an R/Bioconductor package for Hi-C/Hi-ChIP interaction calling (directly from raw data, negative binomial regression accounting for genomic distance,GC content, mappability, restriction enzyme-based bin size) and differential analysis (DESeq2). Includes TAD (TopDom) and A/B compartment callers. Input - HiC-Pro or Juicer. Output compatible with visualization in Juicebox and HiCExplorer. Compared with diffHiC, multiHiCcompare, Selfish, provides better results. Normalization by ChIP-seq input may not be helpful. BitBucket

  • CHESS (Comparison of Hi-C Experimentss using Structural Similarity) - comparative analysis of Hi-C matrices and automatic feature extraction (TADs, loops, stripes). Image analysis-based structural similarity index (SSIM, combines brightness, contrast, and structure differences, S = 1 - identical matrices, <1 - differences) to assign similarity score and an associated p-value (from empirical distribution of SSIMs, two types of bacground model, Methods) to pairs of genomic regions. Obs/exp transformation, differential matrix, denoise, smooth, binarize, feature extraction using close morphology filter, k-means clustering, classification. Works with low sequencing depth, high noise data. Outperforms diffHiC, HOMER, and ACCOST. Applied to interspecies comparison of syntenic regions (Synteny portal), WT and Zelda-depleted Drosophila, C-cell lymphoma, Capture-C analysis. Input - Juicer, Cooler, of FAN-C format, plus .bedpe for regions to compare (chess pairs to generate). Python, scikit-image module. Documentation

  • BART3D - transcriptional regulators associated with differential chromatin interactions from Hi-C data. Input: HiC-Pro matrices, .hic Juicer files, .cool files. Output - ranked lists of transcriptional regulators. Distance-based normalization to average of individual matrices. Difference detection by a paired t-test of normalized interactions within 200kb (Figure 1A). Differential interactions are mapped to the union of DNAseI hypersensitive sites, then standard BART algorithm. Python implementation.

  • DiffGR - differentially interacting genomic regions. Stratum-adjusted correlation coefficient (SCC) (HiCrep-inspired) to measure similarity of local TAD regions. Focus on within-TAD interactions. Simulated data at various levels of sparsity, noise, HiCseg for TAD calling. 2D mean filter for smoothing, KR normalization. Permutation test to estimate the significance of SCC changes. FDR depends on the proportion of altered TADs. R implementation

  • Serpentine - differential analysis of two Hi-C maps using the 2D serpentine-binning method. Serpentine is a subset of connected pixels defined by thresholds in control and experimental contact maps. Serpentines are then compared using the Mean-Deviation plot. Help to alleviate the effect of sparsity. Uses HiCcompare functionality. Normalization does not help. Python package, currently processes full 1500x1500 matrices. https://github.com/koszullab/serpentine

  • ACCOST (Altered Chromatin COnformation STatirstics) - distance-aware differential Hi-C analysis. Extends the statistical model of DEseq by using the size factors to model the genomic distance effect. The use of the MD plot. Compared with diffHiC, FIND, and HiCcompare. Evaluated on human, mouse, plasmodium Hi-C data.

  • multiHiCcompare - joint normalization of multiple Hi-C datasets using cyclic loess regression through pairs of MD plots (minus-distance). Data-driven normalization accounting for the between-dataset biases. Per-distance edgeR-based testing of significant interactions. https://bioconductor.org/packages/multiHiCcompare/

  • Chicdiff - differential interaction detection in Capture Hi-C data. Signal normalization based on the CHiCAGO framework, differential testing using DESeq2. Accounting for distance effect by the Independent Hypothesis Testing (IHW) method to learn p-value weights based on the distance to maximize the number of rejected null hypotheses. https://github.com/RegulatoryGenomicsGroup/chicdiff

  • Selfish - comparative analysis of replicate Hi-C experiments via a self-similarity measure - local similarity borrowed from image comparison. Check reproducibility, detect differential interactions. Boolean representation of contact matrices for reproducibility quantification. Deconvoluting local interactions with a Gaussian filter (putting a Gaussian bell around a pixel), then comparing derivatives between contact maps for each radius. Simulated (Zhou method) and real comparison with FIND - better performance, especially on low fold-changes. Stronger enrichment of relevant epigenomic features. Matlab implementation https://github.com/ucrbioinfo/Selfish

  • HiCcompare - joint normalization of two Hi-C datasets using loess regression through an MD plot (minus-distance). Data-driven normalization accounting for the between-dataset biases. Per-distance permutation testing of significant interactions. https://bioconductor.org/packages/HiCcompare/

  • diffloop - Differential analysis of chromatin loops (ChIA-PET). edgeR framework. https://bioconductor.org/packages/diffloop/

  • FIND - differential chromatin interaction detection comparing the local spatial dependency between interacting loci. Previous strategies - simple fold-change comparisons, binomial model (HOMER), count-based (edgeR). FIND exploits a spatial Poisson process model to detect differential chromatin interactions that show a significant change in their interaction frequency and the interaction frequency of their adjacent bins. "Variogram" concept. For each point, compare densities between conditions using Fisher's test. Explored various multiple correction testing methods, used r^th ordered p-values (rOP) method. Benchmarking against edgeR in simulated settings - FIND outperforms at shorter distances, edgeR has more false positives at longer distances. Real Hi-C data normalized using KR and MA normalizations. R package https://bitbucket.org/nadhir/find/downloads/

    • Mohamed Nadhir, Djekidel, Yang Chen, and Michael Q. Zhang. “FIND: DifFerential Chromatin INteractions Detection Using a Spatial Poisson Process.” Genome Research, February 12, 2018. https://doi.org/10.1101/gr.212241.116.
  • AP - aggregation preference - parameter, to quantify TAD heterogeneity. Call significant interactions within a TAD, cluster with DBSCAN, calculate weighted interaction density within each cluster, average. AP measures are reproducible. Comparison of TADs in Gm12878 and IMR90 - stable TADs change their aggregation preference, these changes correlate with LINEs, Lamin B1 signal. Can detect structural changes (block split) in TADs. https://github.com/XiaoTaoWang/TADLib

  • diffHiC - Differential contacts using the full pipeline for Hi-C data. Explanation of the technology, binning. MA normalization, edgeR-based. Comparison with HOMER. https://bioconductor.org/packages/diffHic/

Differential abundance

  • Milo - an R package for differential abundance testing on scRNA-seq data between two groups or multiple conditions. Building a graph on the first 40 components of PCA, defining neighborhoods using a graph sampling algorithm. Each neighborhood (partially overlapping, in contrast to discrete clustering) contains cells from different conditions - differential abundance is tested using a negative binomial GLM. Tested on simulated datasets (dyntoy), a time course of mouse thymic epithelial cells development, liver cirrhosis analysis. Replicated datasets needed, batch corrected. Competitors: DA-seq, Cydar. Code to reproduce results for the paper

TAD callers

  • HiCKey - hierarchical TAD caller, comparison of TADs across samples. A generalized likelihood-ratio (GRL) test for detecting change-points in an interaction matrix that follows negative binomial distribution (Methods). Bottom-up approach to detect hierarchy. Tested on Forcato simulated data with nested TADs, TPR/FPR/difference/Fowlkes-Mallows index to estimate performance. Applied to seven cell lines. TAD hierarchy is up to four levels. Compared with TADtree, 3DNetMod, IC-Finder, HiCSeg. Colocalization within a 2-bin distance. Input - normalized, distance effect removed matrix in sparse text format, output - TAD start coordinate, hierarchy level, p-value of the changepoint. C++ implementation. Did not compare with SpectralTAD hierarchical caller.

  • GRiNCH - TAD detection algorithm using Graph Regularized Nonnegative matrix factorization. Graph captures distance dependence. Also smoothes/imputes Hi-C matrices. Compared with Directionality Score, Insulation Index, rGMAP, Armatus, HiCseg, TopDom using several metrics - Davies-Bouldin index, Delta Contact Counts, Rand Index, Mutual Information. GRiNCH detects consistent similarly-sized TADs, robust to different resolutions, boundaries show robust enrichment in known markers of TAD boundaries (TFBSs and histone), detects consistent Fit-Hi-C significant interactions (area under precision-recall curve). Applied to mouse neural development and pluripotency reprogramming to confirm known and discover new boundary regulators. Applied to SPRITE and HiChIP data. Project, GitHub, and Visualization tutorial

  • BHi-Cect - identification of the full hierarchy of chromosomal interactions (TADs). Spectral clustering starting from the whole chromosome, detecting nested BHi-Cect Partition Trees (BPTs), partitioned in non-contiguous and interwoven enclaves, inspired by fractal globule idea. Variation of information to test the agreement between two clustering results, overlap-based metrics to test correspondence with TADs. Correspondence analysis of enclaves association with TF content. Gene enrichment. Different enclaves show different epigenomic and gene expression signatures, bottom enclaves are most crisply defined. Resolution affects what enclave size can be detected. https://github.com/princeps091-binf/BHi-Cect

  • TADBD TAD caller using a multi-scale Haar diagonal template (sum of on-diagonal squares minus the sum of off-diagonal squares). Compared with HiCDB, IC-Finder, EAST (also using Haar features), TopDom, HiCseg using simulated (Forcato) and experimental (K562 and IMR90). ICE-normalized data. MCC, Jaccard. FAst. R package https://github.com/bioinfo-lab/TADBD/

  • TADpole - hierarchical TAD boundary caller. Preprocessing by filtering sparse rows, transforming the matrix into its Pearson correlation coefficient matrix, running PCA on it and retaining 200 PCs, transforming into a Euclidean distance matrix, clustering using the Constrained Incremental Sums of Squares clustering (rioja::chclust(, coniss)), estimating significance, Calinski-Harabasz index to estimate the optimal number of clusters (chromatin subdivisions). Benchmarking using Zufferey 2018 datasets, mouse limb bud development with genomic inversions from Kraft 2019. Resolution, normalization, sequencing depth. Metrics: the Overlap Score, the Measure of Concordance, all from Zufferey 2018. Enrichment in epigenomic marks. DiffT metric for differential analysis (on binarized TAD/non-TAD matrices). Compared with 22 TAD callers, including hierarchical (CaTCH, GMAP, Matryoshka, PSYCHIC). https://github.com/3DGenomes/TADpole

  • HiCDB - TAD boundary detection using local relative insulation (LRI) metric, improved stability, less parameter tuning, cross-resolution, differential boundary detection, lower computations, visualization. Review of previous methods, directionality index, insulation score. Math of LRI. GSEA-like enrichment in genome annotations (CTCF). Differential boundary detection using the intersection of extended boundaries. Compared with Armatus, DI, HiCseg, IC-finder, Insulation, TopDom on 40kb datasets. Accurately detects smaller-scale boundaries. Differential TADs are enriched in cell-type-specific genes. https://github.com/ChenFengling/RHiCDB

  • OnTAD - hierarchical TAD caller, Optimal Nested TAD caller. Sliding window, adaptive local minimum search algorithm, similar to TOPDOM. C++ implementation. https://github.com/anlin00007/OnTAD. OnTAD for coolers - a Python wrapper to work with .cool files.

    • An, Lin, Tao Yang, Jiahao Yang, Johannes Nuebler, Qunhua Li, and Yu Zhang. “Hierarchical Domain Structure Reveals the Divergence of Activity among TADs and Boundaries,” July 3, 2018. - Intro about TADs, Dixon's directionality index, Insulation score. Other hierarchical callers - TADtree, rGMAP, Arrowhead, 3D-Net, IC-Finder. Limitations of current callers - ad hoc thresholds, sensitivity to sequencing depth and mapping resolution, long running time and large memory usage, insufficient performance evaluation. Boundaries are asymmetric - some have more contacts with other boundaries, support for asymmetric loop extrusion model. Performance comparison with DomainCaller, rGMAP, Arrowhead, TADtree. Stronger enrichment of CTCF and two cohesin proteins RAD21 and SMC3. TAD-adjR^2 metric quantifying the proportion of variance in the contact frequencies explained by TAD boundaries. Reproducibility of TAD boundaries - Jaccard index, tested at different sequencing depths and resolutions. Boundaries of hierarchical TADs are more active - more CTCF, epigenomic features, TFBSs expressed genes. Super-boundaries - shared by 5 or more TADs, highly active. Rao-Huntley 2014 Gm12878 data. Distance correction - subtracting the mean counts at each distance.
  • 3D-NetMod - hierarchical, nested, partially overlapping TAD detection using graph theory. Community detection method based on the maximization of network modularity, Louvain-like locally greedy algorithm, repeated several (20) times to avoid local maxima, then getting consensus. Tuning parameters are estimated over a sequence search. Benchmarked against TADtree, directionality index, Arrowhead. ICE-normalized data brain data from Geschwind (human data) and Jiang (mouse data) studies. Computationally intensive. Python implementation https://bitbucket.org/creminslab/3dnetmod_method_v1.0_10_06_17

    • Norton, Heidi K., Daniel J. Emerson, Harvey Huang, Jesi Kim, Katelyn R. Titus, Shi Gu, Danielle S. Bassett, and Jennifer E. Phillips-Cremins. “Detecting Hierarchical Genome Folding with Network Modularity.” Nature Methods 15, no. 2 (February 2018): 119–22. https://doi.org/10.1038/nmeth.4560.
  • deDoc - TAD detection minimizing structural entropy of the Hi-C graph (structural information theory). Detects optimal resolution (= minimal entropy). Pooled 10 single-cell Hi-C analysis. Intro about TADs, a brief description of TAD callers, including hierarchical. Works best on raw, non-normalized data, highly robust to sparsity (0.1% of the original data sufficient). Compared with five TAD callers (Armatus, TADtree, Arrowhead, MrTADFinder, Domaincall (DI)), and a classical graph modularity detection algorithm. Enrichment in CTCF, housekeeping genes, H3K4me3, H4K20me1, H3K36me3. Other benchmarks - weighted similarity, number, length of TADs. Detects hierarchy over different passes. Java implementation (won't run on Mac) https://github.com/yinxc/structural-information-minimisation

    • Li, Angsheng, Xianchen Yin, Bingxiang Xu, Danyang Wang, Jimin Han, Yi Wei, Yun Deng, Ying Xiong, and Zhihua Zhang. “Decoding Topologically Associating Domains with Ultra-Low Resolution Hi-C Data by Graph Structural Entropy.” Nature Communications 9, no. 1 (15 2018): 3265. https://doi.org/10.1038/s41467-018-05691-7.
  • CaTCH - identification of hierarchical TAD structure. Reciprocal insulation (RI) index. Benchmarked against Dixon's TADs (diTADs). CTCF enrichment as a benchmark, enrichment of TADs in differentially expressed genes. https://github.com/zhanyinx/CaTCH_R

  • HiTAD - hierarchical TAD identification, different resolutions, correlation with chromosomal compartments, replication timing, gene expression. Adaptive directionality index approach. Data sources, methods for comparing TAD boundaries, reproducibility. H3K4me3 enriched and H3K4me1 depleted at boundaries. TAD boundaries (but not sub-TADs) separate replication timing, A/B compartments, gene expression. https://github.com/XiaoTaoWang/TADLib

  • IC-Finder - Segmentations of HiC maps into hierarchical interaction compartments, http://membres-timc.imag.fr/Daniel.Jost/DJ-TIMC/Software.html

  • ClusterTAD - A clustering method for identifying topologically associated domains (TADs) from Hi-C data, https://github.com/BDM-Lab/ClusterTAD

  • EAST - Efficient and Accurate Detection of Topologically Associating Domains from Contact Maps, Haar-like features (rectangles on images) and a function that quantifies TAD properties: frequency within is high, outside - low, boundaries must be strong. Objective - finding a set of contiguous non-overlapping domains maximizing the function. Restricted by the maximum length of TADs. Boundaries are enriched in CTCF, RNP PolII, H3K4me3, H3K27ac. https://github.com/ucrbioinfo/EAST

  • TADtree - Hierarchical (nested) TAD identification. Two ways of TAD definition: 1D and 2D. Normalization by distance. Enrichment over the background. http://compbio.cs.brown.edu/software/

  • TopDom - An efficient and Deterministic Method for identifying Topological Domains in Genomes, Method is based on the general observation that within-TAD interactions are stronger than between-TAD. binSignal value as the average of nearby contact frequency, fitting a curve, finding local minima, test them for significance. Fast, takes linear time. Detects similar domains to HiCseq and Dixon's directionality index. Found expected enrichment in CTCF, histone marks. Housekeeping genes and overall gene density are close to TAD boundaries, differentially expressed genes are not. http://zhoulab.usc.edu/TopDom/

  • TADtool - wrapper for directionality index and insulation score TAD callers.

  • Arboretum-Hi-C - a multitask spectral clustering method to identify differences in genomic architecture. Intro about the 3D genome organization, TAD differences, and conservation. Assessment of different clustering approaches using different distance measures, as well as raw contacts. Judging clustering quality by enrichment in genomic regulatory signals (Histone marks, LADs, early vs. late replication timing, TFs like POLII, TAF, TBP, CTCF, P300, CMYC, cohesin components, LADs, replication timing, SINE, LINE, LTR) and by numerical methods (Davies-Bouldin index, silhouette score, others). Although spectral clustering on contact counts performed best, spectral + Spearman correlation was chosen. Comparing cell types identifies biologically relevant differences as quantified by enrichment. Peak counts or average signal within regions were used for enrichment. Data https://zenodo.org/record/49767, and Arboretum-HiC https://bitbucket.org/roygroup/arboretum-hic

  • Armatus - TAD detection at different resolutions, Dynamic programming method. https://github.com/kingsfordgroup/armatus

  • HiCseg - TAD detection by maximization of likelihood based block-wise segmentation model. 2D segmentation rephrased as 1D segmentation - not contours, but borders. Statistical framework, solved with dynamic programming. Dixon data as gold standard. Hausdorff distance to compare segmentation quality. Parameters (from TopDom paper): nb_change_max = 500, distrib = 'G' and model = 'Dplus'. https://cran.r-project.org/web/packages/HiCseg/index.html

  • domaincaller - A Python implementation of the original DI domain caller, https://github.com/XiaoTaoWang/domaincaller

Architectural stripes

  • Stripenn - a computer vision-based method for architectural stripes detection using Canny edge detection.Scores stripes by median p-value and stripiness based on the continuity of interaction signal. Input - .cool files, optionally normalized. Output - coordinates and scores of the predicted stripes. Applicable to Hi-C, HiChIP, Micro-C data. Introduction to the biology of architectural stripes, review of previous methods (Zebra from Vian et al. 2018, domainClassifyR, CHESS for comparing 3D domains and stripes). Analysis of stripes from B and T lymphocytes identifies stripe anchors enriched in the transcriptionally active compartments, architectural proteins mediating loop extrusion. Strips are strongly conserved, correspond to TAD boundaries, subtle changes are associated with transcriptional output. Python, three functions (compute, score seeimage). Video 16m

Differential/timecourse TAD analysis

Prediction of 3D features

  • DL2021_HI-C - deep-learning approach for increasing Hi-C data resolution by appending additional information about genome sequence. Two algorithms: the image-to-image model (modified after VEHiCLE), which enhances Hi-C resolution by itself, and the sequence-to-image model (modified after Akita), which uses additional information about the underlying genome sequence for further resolution improvement. Both models are combined with the simple head model (Figure 4). Details of network architecture, training, testing, validation (5 metrics). Various architecture modifications. VEHiCLE by itself performs well.

  • L-HiC-Reg (Local HiC-Reg) - a Random Forests based regression method to predict high-resolution contact counts in new cell lines, and a network-based framework to identify candidate cell line-specific gene networks targeted by a set of variants from a Genome-wide association study (GWAS). Trained on chromosome-specific 1Mb segments in one cell line to predict in another. 55 cell lines, 10 annotations (CTCF, RAD21, TBP, histone marks, DNAse), imputing missing annotations with Avocado. Outperforms GeneHancer and JEME, Networks for 15 GWASs in all cell lines are available

    Preprint
  • CCIP (CTCF-mediated Chromatin Interaction Prediction) - predicting CTCF-mediated convergent and tandem loops with transitivity. Transitivity definition from the network of multiple CTCF-interacting regions, convergent and tandem. Incorporating the GCP (graph connecting probability) score, together with CTCF, RAD21, directional CTCF motif one-hot encoding into random forest. GCP is the most important predictive feature. Compared with Lollipop and CTCF-MP.

  • ChINN - chromatin interaction neural network, predicting chromatin interactions from DNA sequence. Trained on CTCF- and RNA PolII-mediated loops, as well as on Hi-C data. Gradient boosting trained on functional annotation, distance, or both as predictors. ChINN - CNN trained on sequence. Convergent CTCF orientation is an important predictor, other motifs complement predictive power. Applied to 6 new chronic lymphocytic leukemia samples, patient-specific interactions, vaildated by Hi-C and 4C.

  • compartmap - A/B compartment reconstruction from bulk and scRNA-seq/scATAC-seq. Steps: preprocessing and summarizing the single-cell assay data; eBayes shrinkage of summarized features towards a global or local target; computing the shrinkage correlation estimate of summarized features; computing domains via SVD (Methods). Also detects genomic rearrangements. Applied to K562 data. Bioconductor, GitHub, Tweet

  • Hi-ChIP-ML - machine learning models for TAD boundary prediction using epigenomic data (ChIP-seq, Histone-seq signal, 5 and 18 feature sets) in Drosophila. Linear regression with 4 regularization types, gradient boosting, bidirectional (to capture both sides around a boundary) LSTM. Methods describe data construction, loss function, model architectures, machine learning framework. One feature at a time to measure feature importance. Chriz factor is the top predictor, then H3K4me1, H3K27me1. Python, sklearn, keras. Colab notebook.

  • 3DPredictor - Prediction of enhnancer-promoter interactions from gene expression, CTCF binding and orientation, distance between interacting loci. Also predict changes in 3D genome organization - genomic rearrangements. Benchmarking of TargetFinder, its performance is overestimated. Class balance slightly improves performance. Two random chromosomes for validation, the rest - for training (10-90% split). Gradient boosting for prediction, significantly better than Random Forest. Multiple performance metrics. Model trained in one cell type can predict in another. Predict quantitative level of interactions. Other tools - EP2Vec, CTCF-MP, HiC-Reg. GitHub, Web tool

    • Belokopytova, Polina S., Miroslav A. Nuriddinov, Evgeniy A. Mozheiko, Daniil Fishman, and Veniamin Fishman. "Quantitative prediction of enhancer–promoter interactions." Genome research 30, no. 1 (2020): 72-84.
  • Akita - Chromatin interaction prediction from sequence only using CNN. Takes in 1Mb sequence and predicts interactions at 2Kb resolution. Includes Distance between interacting regions included. Allows for understanding the effect of mutations. Tested on Rao Hi-C datasets, Micro-C data. Basenji network architecture. 80/10/10 training, validation, test sets.

  • 3D-GNOME 2.0 - modeling 3D structure changes due to structural variants (SVs). Reference data - GM12878 ChIA-PET data (CTCF, RNAPII). Changes are modeled by removing or adding contacts between chromatin interaction anchors depending on SV (deletion, duplication, insertion, inversion). Web tool, Bitbucket

  • HiC-Reg - Predicting Hi-C contact counts from one-dimensional regulatory signals (Histone marks, CTCF, RAD21, Tbp, DNAse). Random Forest regression. Feature encoding - distance between two regions, pair-concat, window, multi-cell. Works across chromosomes (some chromosomes are worse than others) and cell lines (Gm12878, K562, Huvec, Hmec, Nhek, can be used to predict interactions on new cell lines). Selection of the most important features using multi-task group LASSO (distance, CTCF, Tbp, H4K20me1, DNAse, others). Predicted contacts correspond well to the original contacts (distance-stratified Pearson correlation), define TADs similar to the originals (Jaccard), define significant contacts (Fit-Hi-C) more enriched in CTCF binding. Validated on HBA1 and PAPPA gene promoters. Hi-C normalization doesn't have much effect. https://github.com/Roy-lab/HiC-Reg

  • TADBoundaryDectector - TAD boundary prediction from sequence only using deep learning models. 12 architectures tested, with three convolutional and an LSTM layer performed best. Methods, Implementation in Keras-TensorFlow. Model evaluation using different criteria, 96% accuracy reported. Deep learning outperforms feature-based models, among which Boosted Trees, Random Forest, elastic net logistic regression are the best performers. Data augmentation (aka feature engineering) by randomly shifting TAD boundary regions by som base pairs of length (0-100). Tested on Drosophila data. https://github.com/lincshunter/TADBoundaryDectector

  • 3DEpiLoop - prediction of 3D interactions from 1D epigenomic profiles using Random Forest trained on CTCF peaks (histone modifications are the most important predictors and TFBSs). https://bitbucket.org/4dnucleome/3depiloop

  • PRISMR - a polymer-based method (strings and binders switch SBS polymer model) to model 3D chromatin folding, to predict enhancer-promoter contacts, and to model the effect of structural variations (deletions, duplications, inversions) on the 3D genome organization. Input - Hi-C data. Infers minimal factors that shape chromatin folding and its equilibrium under the laws of physics, without prior assumptions or tunable parameters. Simulated annealing Monte Carlo optimization procedure that minimizes the distance between the predicted polymer model and the input contact matrix under a Bayesian weighting factor to avoid overfitting. Tested on a EPHA4 locus associated with limb malformations, and more. Newly generated capture Hi-C of mouse limb buds at embryonic day 11.5, and human skin fibroblasts, GEO. Public mutine CH12-LX cells and human IMR90 cells. Implementation - the LAMMPS package, code not available.

  • Predicting TAD boundaries using training data and making new predictions. Bayesian network (BNFinder method), random forest vs. basic k-means clustering, ChromHMM, cdBEST. Using sequence k-mers and ChIP-seq data from modENCODE for prediction - CTCF ChIP-seq performs best. Used Boruta package for feature selection. The Bayesian network performs best. To read on their BNFinder method

SNP-oriented

  • iRegNet3D - Integrated Regulatory Network 3D (iRegNet3D) is a high-resolution regulatory network comprised of interfaces of all known transcription factor (TF)-TF, TF-DNA interaction interfaces, as well as chromatin-chromatin interactions and topologically associating domain (TAD) information from different cell lines. Goal: SNP interpretation. Input: One or several SNPs, rsIDs, or genomic coordinates. Output: For one or two SNPs, on-screen information of their disease-related info, connection over TF-TF and chromatin interaction networks, and whether they interact in 3D and located within TADs. For multiple SNPs, the same info downloadable as text files. http://iregnet3d.yulab.org/index/

  • 3DSNP - 3DSNP database integrating SNP epigenomic annotations with chromatin loops. Linear closest gene, 3D interacting gene, eQTL, 3D interacting SNP, chromatin states, TFBSs, conservation. For individual SNPs. http://cbportal.org/3dsnp/

  • HUGIn - tissue-specific Hi-C linear display of anchor position and around. Overlay gene expression and epigenomic data. Association of SNPs with genes based on Hi-C interactions. Tissue-specific. http://yunliweb.its.unc.edu/HUGIn/

    • Martin, Joshua S, Zheng Xu, Alex P Reiner, Karen L Mohlke, Patrick Sullivan, Bing Ren, Ming Hu, and Yun Li. “HUGIn: Hi-C Unifying Genomic Interrogator.” Edited by Inanc Birol. Bioinformatics 33, no. 23 (December 1, 2017)

CNV and Structural variant detection

Visualization

De novo genome scaffolding

3D modeling

Deconvolution

  • THUNDER - cell-type deconvolution of Hi-C data. NMF, uses informative interactions within and between chromosomes (top 1000 features by Fano factor), reformatted into matrix form by concatenation. Needs number of cell types k. Tested on in silico mixture of cell types. Outperforms TOAST, CIBERSORT. R implementation.

Haplotype separation

Papers

Methodological Reviews

General Reviews

  • Bouwman, Britta A. M., and Wouter de Laat. “Getting the Genome in Shape: The Formation of Loops, Domains and Compartments.” Genome Biology 16 (August 10, 2015) - TAD/loop formation review. Convergent CTCF, cohesin, mediator, different scenarios of loop formation. Stability and dynamics of TADs. Rich source of references.

  • Chakraborty, Abhijit, and Ferhat Ay. “The Role of 3D Genome Organization in Disease: From Compartments to Single Nucleotides.” Seminars in Cell & Developmental Biology 90 (June 2019): 104–13. - 3D genome structure and disease. Evolution of technologies from FISH to variants of chromatin conformation capture. Hierarchical 3D organization, Table 1 summarizes each layer and its involvement in disease. Rearrangement of TADs/loops in cancer and other diseases. Specific examples of the biological importance of TADs, loops as means of distal communication.

  • Zheng, H., and Xie, W. (2019). "The role of 3D genome organization in development and cell differentiation." Nat. Rev. Mol. Cell Biol. - 3D structure of the genome and its changes during gametogenesis, embryonic development, lineage commitment, differentiation. Changes in developmental disorders and diseases. Chromatin compartments and TADs. Chromatin changes during X chromosome inactivation. Promoter-enhancer interactions established during development are accompanied by gene expression changes. Polycomb-mediated interactions may repress developmental genes. References to many studies.

  • Yu, Miao, and Bing Ren. “The Three-Dimensional Organization of Mammalian Genomes.” Annual Review of Cell and Developmental Biology 33 (06 2017) - 3D genome structure review. The role of gene promoters, enhancers, and insulators in regulating gene expression. Imaging-based tools, all flavors of chromatin conformation capture technologies. 3D features - chromosome territories, topologically associated domains (TADs), the association of TAD boundaries with replication domains, CTCF binding, transcriptional activity, housekeeping genes, genome reorganization during mitosis. Use of 3D data to annotate noncoding GWAS SNPs. 3D genome structure change in disease.

  • Fraser, J., C. Ferrai, A. M. Chiariello, M. Schueler, T. Rito, G. Laudanno, M. Barbieri, et al. “Hierarchical Folding and Reorganization of Chromosomes Are Linked to Transcriptional Changes in Cellular Differentiation.” Molecular Systems Biology 11, no. 12 (December 23, 2015) - 3D genome organization parts. Well-written and detailed. References. Technologies: FISH, 3C. 4C, 5C, Hi-C, GCC, TCC, ChIA-PET. Typical resolution - 40bp to 1Mb. LADs - conserved, but some are cell type-specific. Chromosome territories. Cell type-specific. Inter-chromosomal interactions may be important to define cell-specific interactions. A/B compartments identified by PCA. Chromatin loops, marked by CTCF and Cohesin binding, sometimes, with Mediator. Transcription factories

  • Dekker, Job, Marc A. Marti-Renom, and Leonid A. Mirny. “Exploring the Three-Dimensional Organization of Genomes: Interpreting Chromatin Interaction Data.” Nature Reviews. Genetics 14, no. 6 (June 2013) - 3D genome review. Chromosomal territories, transcription factories. Details of each 3C technology. Exponential decay of interaction frequencies. Box 2: A/B compartments (several Mb), TAD definition, size (hundreds of kb). TADs are largely stable, A/B compartments are tissue-specific. Adjacent TADs are not necessarily of opposing signs, may jointly form A/B compartments. Genes co-expression, enhancer-promoters interactions are confined to TADs. 3D modeling.

  • Witten, Daniela M., and William Stafford Noble. “On the Assessment of Statistical Significance of Three-Dimensional Colocalization of Sets of Genomic Elements.” Nucleic Acids Research 40, no. 9 (May 2012)

Technology

  • Hi-C 3.0 protocol. Evaluation of 12 experimental Hi-C protocols - they capture different 3D genome features with different efficiencies. Additional crosslinking with DSG improves signal-to-noise, loop detection, reduced compartment detection. Evaluating 4 restriction enzymes, MNase, DdeI, DpnII, HindIII. 4 cell lines - H1-hESCs, differentiated endoderms, HFF, HeLa (two cell cycle stages). 63 libraries total. All protocols detect cell type-specific differences, A/B compartments, insulation strength. MNase digestion improves loop detection. Anchors for multi-loop interactions can be detected. Double enzyme use improves loop detection. Evaluation of enrichment of CTCF, SMC3, H3K4me3, H3K27ac at loop boundaries. Ultra-deeply sequenced maps using Hi-C, Micro-C, and Hi-C 3.0 protocols (not yet available). cLIMS Hi-C data management system. Scripts to reproduce results. Tweet

  • Chromatin conformation capture technologies, from 3C to imaging. 3D structures (nucleolus, nuclear speckles, polycomb bodies, chromosome territories, A/B compartments, TADs, loops), their roles in gene expression (sometimes, conflicting), replication timing, DNA repair. Agreement (and disagreement) examples between 3C methods and FISH. Description of libation-based (3C - Hi-C) and ligation-free (GAM, SPRITE, DamC) technologies. Multiway interactions, primarily occur within TADs. TAD formation, loop extrusion mechanisms. Association between replication timing and A/B compartments. Effect of mechanical forces on chromosome folding.

  • Review of Hi-C, Capture-C, and Capture-C technologies, their computational preprocessing. Experimental protocols, similarities and differences, types of reads (figures), details of alignment, read orientation, elimination of artifacts, quality metrics. A brief overview of preprocessing tools. Example preprocessing of three types of data. Java tool for preprocessing all types of data, Diachromatic (Differential Analysis of Chromatin Interactions by Capture), GOPHER (Generator Of Probes for capture Hi-C Experiments at high Resolution) for genome cutting, probe design

  • Chromosome conformation capture technologies, 4C, 5C, Hi-C, ChIP-loop, ChIA-PET. From microscopy observations (constrained movement of genomic loci, LADs, preferential stability of chromosome conformation and its independence from transcription), to technology details (Figure 1). Examples of alpha- and beta-globin locus studies by different technologies, X chromosome inactivation, HOXA-d gene clusters. The future vision of single-cell, single-allele investigation of chromatin interactions.

  • Review of technologies for studying the 3D structure of the genome. From microscopy to 3C techniques revealing CTCF and cohesin as the key proteins for establishing chromatin loops.TADs are unlikely over large distances >>1Mb. Details of 3C, 4C, 5C, Hi-C, ChIP-PET, and other derivatives. A/B compartments and their subdivision. TADs, their conservation, ~35-50% still seem to change. CTCF (directionality of binding important) and cohesin. Diseases and the 3D genome, examples. Key steps in data analysis and interpretation, software, visualization. Hi-C data specifics - chimeric reads, mapping, data representation as fixed or enzyme-sized bins, normalization, detection of A/B compartments and TAD boundaries, significant interactions. Hi-C analysis tools: HiC-Pro, HiCUP, HOMER, Juicer. Tools for 3D modeling.

  • Hi-CO - chromatin conformation capture with nucleosome orientation technology. Ligation of DNA entry or exit points at every nucleosome locus in a micrococcal nuclease (MNase)-fragmented genome. Produces ~300M reads. Computational analysis - simulated annealing - molecular dynamics, determines 3D positions and orientation of every nucleosome. Not suitable for single-cell genome analysis, only detects pairwise ligations. Applied to the S. cerevisiae genome. Protocol. Example data and tutorial

  • RADICL-seq - RNA And DNA Interacting Complexes Ligated and sequenced - RNA-chromatin interaction profiling in intact nuclei. Crosslinking, digesting withDNAse I, RNAse H treatment (reduces ribosomal contamination), a bridge adapter specifically ligating RNA and DNA, reverse crosslinking, conversion to double-stranded DNA, cutting to 25-27bp length with EcoP15I enzyme. RNA- and DNA tags mapped, quantified by distance. Improvement over other technologies (MARGI, GRID-seq, SPRITE) - less cells, less spurious interactions, more trans interactions (lncRNAs in particular), more long-distance interactions. Four technology advantages in Discussion. Analysis - accounts for coverage, one-sided binomial test to call significant interactions. Data - mESC, oli-neu (from mOPCs). Visualization as 25kb contact matrix. Several noncoding transcripts interact in trans. Integration with CAGE data, comparison with Hi-C, association with TAD boundaries, A/B compartments, repeat elements. Differential analysis, Cell-type-specific interactions. Code, data

  • tagHi-C - low-input tagmentation-based Hi-C. Applied to mouse hematopoiesis 10 major blood cell types. Changes in compartments and the Rabl configuration defining chromatin condensation. Gene-body-associating domains are a general property of highly-expressed genes. Spatial chromatin loops link GWAS SNPs to candidate blood-phenotype genes. HiC-Pro to Juicer. GEO GSE142216 - RNA-seq, replicates, GEO GSE152918 - tagHi-C data, replicates, combined .hic files

  • scsHi-C - sister-chromatid-sensitive Hi-C to explore interactions between the sister chromatids. Distinguishing cis from trans sister contacts based on 4-thio-thymidine (4sT) labeling. Paired organization of sister chromatins in interphase and complete separation in mitosis. TADs that exhibit tight pairing are heterochromatin marked by H3K27me3. Chromatids are predominantly linked at TAD boundaries, within TADs - more flexible. Investigation of looping mechanism - NIPBL-depletion, Sororin degradation. Jupyter notebooks for each analysis https://github.com/gerlichlab/scshic_analysis

  • 4C technology, wet-lab protocol, and data analysis and visualization. R-based processing pipeline pipe4C, configuration parameters

  • SisterC Hi-C technology to test interactions between sister chromatids. Uses BrdU incorporation in S-phase and single-strand degradation by UV/Hoechst treatment to obtain inter-sister or intra-sister interactions. Findings about the alignment of chromatids, strong at centromeres, looser (~35kb spaced) interactions along arms, loops up to 50kb. Tested on the S. cerevisiae genome. distiller-nf, pairtools, cooltools for processing.

  • Pore-C chromatin conformation capture using Oxford Nanopore Technologies, direct sequencing of multi-way chromatin contacts without amplification (concatemers, HOLR - high-order and long-range contacts). >18 times higher efficiency as compared with SPRITE, enrichment in concatemers. Tested on the Gm12878 cell line. Hi-C matrix from Pore-C well resembles published data. Concatemers show significantly lower distance decay. Concatemers better resolve complex cancer rearrangements, well-suited for de novo genome scaffolding. Pore-C tools and a Snakemake pipeline, detection of multi-way interactions

  • Tiled-C low-input 3C technology, requiring >20,000 cells. Applied for in vivo mouse erythroid differentiation, alpha-globin genes. TADs are pre-existing, regulatory interactions gradually form during differentiation. Integration with scRNA-seq data (CITE-seq technology) and ATAC-seq data. Analyzed using CCseqBasic pipeline and TiledC. Data (Tiled-C, CITE-seq, ATAC-seq)

  • Red-C (RNA Ends on DNA Capture) technology - proximity ligation-based technology to understand RNA-DNA interactome (3' and 5' ends of the RNA molecule associated with a given DNA site). Technology description(Methods, Results, Figure 1). Assesses all coding and noncoding RNA classes (piRNAs, tRNAs, rRNAs, snRNAs, scRNAs, srpRNAs, vlinc, antisense RNAs). Applied to K562 cells. The RNA-DNA matrices show a wide distribution of RNA contacts along extended genomic regions. The highest number of contacts was observed for mRNA, linc and vlinc RNAs. Enhancer RNAs (eRNAs) produced from strong enhancers interact with other strong enhancers on the same chromosome.  MIR3648 and MIR3687 interact with repressed chromatin genome-wide. Investigation of mRNA interaction preference across gene body. Discovery of novel chromatin-interacting RNAs (X-RNAs). Processed Red-C and RNA-seq data, GitHub, Nextflow processing pipeline

  • ChIA-Drop technology - multiplex chromatin interaction analysis via droplet-based and barcode-linked sequencing. 10X genomics technology, instead of cells, each gel-bead-in-emulsion contains crosslinked (but not ligated) chromatin and unique barcode. Singleton reads removed, distance constraints further refine multi-way interactions. FISH agrees with ChIA-Drop. Only approx. 20% of promoters are regulated by multiple enhancers, in Drosophila S2 cells. Data. Supplementary - notes about data processing, overview of ChIA-Dropbox software for visualization, Python implementation. Drosophila S2 cell line ChIA-Drop data, Hi-C

  • ChIA-Dropbox - analysis and visualization of ChIA-Drop data (10X Genomics-based GEMs containing cross-linked chromatin). Alignment, demultiplexing, reconstruction of intra-chromosomal chromatin droplets from barcodes, drops inter-chromosomal, generates visualizations compatible with Juicebox, BASIC browser, its own ChIA-View (R/Shiny app). QC plots, multi-way interactions visualization, cluster/fragment/promoter views. Global and RNAPII-enriched GEO GSE109355 - ChIA-Drop data for Drosophila

  • Dip-C technology for single-cell Hi-C, multiplex end-tagging amplification (META). Detects approx. 5 times more contacts. Possible to make haplotype-separated Hi-C maps, detect CNVs, resolve X-chromosome inactivation. Approx. 10kb-resolution Hi-C matrices, 3D genome reconstruction at 20kb resolution. PCA on chromatin compartments separates cell types. Comparison with Nagano data, bulk Gm12878 Hi-C. dip-c - Tools to analyze Dip-C (or other 3C/Hi-C) data, hickit. Processed data: Gm12878, 17 cells, PBMC, 18 cells

  • Methyl-HiC technology, in situ Hi-C and WGBS. Comparable Hi-C matrices, TADs. 20% fewer CpGs overall, more CpGs in open chromatin. Proximal CpGs correlate irrespectively of loop anchors, weaker for inter-chromosomal interactions. Application to single-cell, mouse ESCs under different conditions. Relevant clustering, cluster-specific genes. Methods for wet-lab and computational processing. Bulk (replicates) and single-cell Methyl-HiC data. Scripts in https://bitbucket.org/dnaase/bisulfitehic/src/master/, Bhmem pipeline to map bisulfite-converted reads, Juicer pipeline for processing, VC normalization, HiCRep at 1Mb matrix similarity.

  • Review of chromatin conformation capture technologies, from image-based methods (FISH), through proximity ligation (3/4/5C, Hi-C, TCC, ChIA-PET, scHi-C), to ligation-free methods (GAM, SPRITE, ChIA-Drop). Details of each technology (Table 1, Figures), comparison of them (Table 2).

  • Optimization steps for Hi-C wet-lab protocol. Pitfalls and their effect on the downstream quality. Recommendations for each step.

  • DLO Hi-C technology (digestion-ligation-only Hi-C). Uses two rounds of digestion and ligation, without biotin and pull-down. Allows for early evaluation of Hi-C quality. Cost-effective, high signal-to-noise-ratio. Tested on THP-1 (human monocytes) and K562 cells. Data processed with ChIA-PET Tool, normalized with ICE

  • HIPMap - high-throughput imaging and analysis pipeline to map the location of gene loci within the 3D space. FISH in a 384-well plate format, automated imaging.

  • HiC method description, 1Mb, Gm06990. Small chromosomes, but 18, interact. Compartment A associated with open chromatin. 1Mb, 100kb resolution

  • 3C technology, the matrix of interaction frequencies, application to reveal spatial information, applied to yeast (S cerevisiae) genome. Interphase and metaphase chromosomes show different patterns of interactions. Distance-dependent decay of interaction frequencies. Basic observations on chromosome size, inter-chromosomal interactions.

    • Dekker, Job, Karsten Rippe, Martijn Dekker, and Nancy Kleckner. “Capturing Chromosome Conformation.” Science (New York, N.Y.) 295, no. 5558 (February 15, 2002)
  • The Arima kit uses two restriction enzymes: ^GATC (DpnII), G^ANTC (N can be either of the 4 genomic bases) (HinfI), which after ligation of DNA ends generates 4 possible ligation junctions in the chimeric reads: GATC-GATC, GANT-GATC, GANT-ANTC, GATC-ANTC. Source: 'Hi-C library preparation' Methods section from Du et al., “DNA Methylation Is Required to Maintain Both DNA Replication Timing Precision and 3D Genome Organization Integrity.”

Micro-C

  • Micro-Capture-C (MCC) technology. Advanced and optimized 3C technology, using micrococcal nuclease (MNase), intact cells permeabilized with digitonin, ultra-deep sequencing, plus sequencing the ligation junction. 330 viewpoints (promoters, enhancers, CTCF binding sites) in primary mouse erythroid and empryonic stem cells. Many biological observations, cohesin is required for CTCF contacts.Enhancer-promoter contacts are mediated by at least partly different mechanisms than cohesin-mediated interactions. Computational pipeline. GEO GSE153256 - mouse spleen erythroid cells MCC data. GEO GSE144336 - mouse primary erythroid cells, embryonic stem cells, HUDEP-2 cells. MCC, CTCF/Rad21/NIPBL ChIP-seq.

  • Ultra-deep Micro-C maps of human embryonic stem cells and fibroblasts. Compared with in situ Hi-C (DpnII 4bp cutter), Micro-C allows for the detection of ~20,000 additional loops, improved signal-to-noise ratio. Similar distance-dependent decay, recovery of A/B compartments, better recovery of close-range interactions. High-resolution interaction boundaries are not created equal - most are CTCF+ and YY1+, some are CTCF- and YY1+, CTCF- and YY1-, and completely negative boundaries. Multiple, weak pause sites of SMC complexes. Distiller-nf processing, ICE normalization, other Mirnylab tools. Data at https://data.4dnucleome.org/

  • Micro-C (MNase digestion Hi-C) data analysis. Nucleosome-level (~100-200bp) resolution Hi-C, captures all structures in regular Hi-C data. Stripes/flames structures correspond to enhancer-promoter interactions, colocalize with PolII, CTCF.  TADs are further split into "micro TADs", insulation score. Active boundaries colocalize with CpG islands, promoters, tRNA genes. Inactive boundaries are in repeats. Micro TADs subdivided into five subgroups. Two-start zig-zag model tetra-nucleosome stacks. Mouse stem cell, 38 biological replicates. Twitter

  • Micro-C (MNase digestion Hi-C) technology and basic analysis. Human embryonic stem cells H1-ESC and differentiated human foreskin fibroblasts (HFFc6). Captures standard Hi-C features, with many additional interaction peaks ("dots"). Enrichment of classical marks of TAD boundaries (Fig 3C) - RAD21, TAF1, PHF8, CTCF, TBP, POL2RA, YY1, and more.

  • Micro-C technology - mononucleosome resolution mapping in yeast. Micrococcal nuclease to fragment chromatin. Yeast does not have TADs, but Micro-C revealed self-associating domains (chromatin interaction domains, CIDs) driven by the number of genes. Enrichment of histone modifications. Data

  • Micro-C scripts and documentation for Dovetail™ Micro-C Kit. Uses the Micrococcal nuclease (MNase) enzyme instead of restriction enzymes for chromatin digestion, yielding 146 bp fragments distributed frequently across the genome. Detailed computational processing of Micro-C data, generation of valid pairs using bwa mem, pairtools, juicer_tools, cooler and other genomics tools, QC, contact matrix generation. Introduction to Dovetail™ Micro-C Technology Video, 5min, Breaking the Hi-C Resolution Barrier with Micro-C Video, 45 min. Micro-C datasets, MicroC Snakemake pipeline by Eric Davis.

Multi-way interactions

  • ImmunoGAM - an extension of genome architecture mappint (GAM, thin nuclear cryosections. Works with small numbers of cells (approx. 400-1500 cells). Immunoselection of cryosections (incubation with antibodies). Applied to profile oligodendrocytes (OLGs), pyramidal glutamatergic neurons (PGNs), dopaminergic neurons (DNs) of mouse brain. Integration with pseudobulk RNA-seq and ATAC-seq. Insulation score to detect domains, extensive cell type-specific TAD, AB compartment rearrangement. "Melting" - contact loss at highly expressed long neuronal genes (MELTRON pipeline to calculate a melting score https://github.com/DominikSzabo1/Meltron). Methods: GAM data window calling and QC. Extensive supplementary material. GEO GSE148792 - GAM sequencing data from DN, PGN and OLG, non-normalized co-segregation matrices, normalized pair-wised chromatin contacts maps and raw GAM segregation tables. Microscopy images. A public UCSC session with all data produced, as well as all published data utilized in this study is available at the UCSC Genome Browser session. Tweet

  • MATCHA - a computational framework for the analysis of multi-way chromatin interactions (hyperedges). Graph embedding fed into Hyper-SAGNN (self-attention based graph neural network for hypergraphs). Applied to SPRITE and ChIA-Drop data, Gm12878 at 1Mb and 100kb resolution.

  • Multi-contact 3C (MC-3C, based on conventional 3C, C-walk, and multi-contact 4C approaches) technology reveals distinct chromosome territories with very little mixing, never entanglement, same with chromosomal compartment domains (A-A, B-B interactions predominant, A-B - minimal to none). Analysis of C-walks - connected paths of pairwise interactions. Compared with C-walks generated from Hi-C data. Permutation analysis of the significance of insulated, mixed, and intermediate domains. PacBio sequencing, processing with SMRT Analysis package and a custom pipeline

  • MIA-Sig (multiplex interactions analysis by signal processing algorithms) - de-noise the multi-interaction data, assess the statistical significance of chromatin complexes, identify topological domains and multi-way inter-TAD contacts. Tailored for ChIA-Drop, also works with SPRITE and GAM data. Uses the assumption that the genomic distance of a true multiway interaction should be more evenly spaced to remove noise.

  • MC-4C multi-way contacts technology and computational protocols. ~2 weeks, ~$600/sample, best for <120kb regions. Computational protocol, test data included

  • MC-4C - Multi-way interactions technology, uses Nanopore MinION (or, PacBio) sequencing. Cross-linking, cutting with four-cutter and six-cutter enzymes, circularization, cutting with Cas9 gRNA designed to the viewpoint region, selective amplification of concatemers with primers specific to the viewpoint. Rigorous filtering strategy, interactions are allowing to distinguish reads coming from individual alleles. Compared with genome-wide multi-contact technologies C-walks, SPRITE, GAM, Tri-C. Applied to mouse beta-globin (fetal liver where hemoglobin genes are expressed, and brain, where they are silent) and protocadherin-alpha (same tissues, vice versa ) loci. Super enhancers can form hubs, target multiple genes. WAPL deletion in HAP1 (leukemia) cells stimulates the collision of CTCF-anchored domain loops to form rosette-like structures. MC-4C processing pipeline, Visualization of the analyzed data, raw data, processed data matrices

  • Multiplex-GAM - Genome Architecture Mapping technology. Multiple nuclear profiles (2-3 NPs, slices of nuclei) are sequenced together. Updates of the SLICE method for GAM data analysis. Details on the dependencies between the number of GAM samples, probability to detect interactions, the number of NPs per sample, technical parameters. TAD boundaries match those from Hi-C. Method-specific contacts exist, but GAM contacts are enriched in CTCF-enhancer contacts and other active transcription annotations, in contrast to heterochromatin-prone Hi-C contacts. No data yet

  • GAM (genome architecture mapping) - restriction- and ligation free chromatin conformation capture technology. Isolates and sequences DNA content of many ultra-thin (~0.22um) cryo-fixes nuclear slices, whole-genome amplification.~30kb matrix is built on frequencies of co-occurrences of regions in multiple slices. Applied to mouse ESCs. Normalization using linkage disequilibrium (better than ICE). GAM and Hi-C matrices are highly correlated, A/B compartments, TADs significantly overlap. Enhancers and active genes significantly interact. Multi-way interactions, triplets, super-enhancers are enriched in three-way interactions, confirmed by FISH. SLICE (Statistical Inference of co-segregation) mathematical model to assign significance of interactions (Supplementary Note 1). Negative binomial modeling of significant interactions, log-normal noise. Supplementary Table 2 - genomic coordinates of triplet (three-way interacting) TADs. Raw data, processed matrices, Python scripts, GEO GSE64881 - mouse ES cell GAM matrices at 1Mb and 30kb resolution

  • SPRITE (Split-Pool Recognition of Interactions by Tag Extension) - Multi-way interactions technology, barcode system based on repeated pooling ans splitting of crosslinked DNAhairballs to uncover clustered DNA fragments (SPRITE clusters). Does not involve proximity ligation. Transcriptionally active interaction hubs around duclear speckles, inactive - around nucleolus. These nuclear bodies constrain the 3D packaging. Applied to mouse embryonic stem cells (mESCs) and GM12878, data correlates well with Hi-C. Data: 25kb-1Mb inter- and intra-chromosomal interactions, Processing Pipeline

  • C-walks, multi-way technology, genome-wide. TADs organize chromosomal territories. Active and inactive TAD properties. Methods: Good mathematical description of insulation score calculations. Filter TADs smaller than 250kb. Inter-chromosomal contacts are rare, ~7-10%. Concatemers (more than two contacts) are unlikely.

  • TM3C - Tethered multiple 3C technology to probe multi-point contacts. NHEK, KBM7 cells, detected the Philadelphia chromosome, investigated triple contacts in the IGF2-H19 locus at 40kb, detected typical genomic structures (chromosomal compartments, distance-decay, TADs), reconstructed 3D genome at 1Mb resolution. A two-phase mapping strategy that separately maps chimeric subsequences within a single read (Methods). Multiple 4-cutter restriction enzymes

  • INGRID - chromatin conformation capture technology using in-gel replication of interacting DNA segments. Detects multi-way interactions. Protocol, demonstration of beta-globin gene locus.

Imaging

  • OligoFISSEQ - fluorescence in situ sequencing (FISSEQ) of barcoded oligopaint probes to enable the rapid visualization of many targeted genomic regions. Three strategies for interrogation of targets: sequencing by ligation (SBL, O-LIT), sequencing by synthesis (SBS, O-SIT), sequencing by hybridization (SBH, O-HIT). Applied to human diploit fibroblast cells, 36 regions across 6 chromosomes in just 4-8 rounds of sequencing. Fine tracing of 46 regions on X chromosome. Combined with OligoSTORM, allows for single-molecule super-resolution imaging. Whole genome-ready technology. Figure-specific source data, Analysis scripts

  • MERFISH - Super-resolution imaging technology, reconstruction 3D structure in single cells at 30kb resolution, 1.2Mb region of Chr21 in IMR90 cells. Distance maps obtained by microscopy show small distance for loci within, and larger between, TADs. TAD-like structures exist in single cells. 2.5Mb region of Chr21 in HCT116 cells, cohesin depletion does not abolish TADs, only alter their preferential positioning. Multi-point (triplet) interactions are prevalent. TAD boundaries are highly heterogeneous in single cells. , diffraction-limited and STORM (stochastic optical reconstruction microscopy) imaging. GitHub

Normalization

  • Lyu, Hongqiang, Erhu Liu, and Zhifang Wu. “Comparison of Normalization Methods for Hi-C Data.” BioTechniques 68, no. 2 (2020) - a comprehensive analysis of six Hi-C normalization methods for their ability to remove systematic biases. The introduction provides a good classification and overview of different normalization methods, including the latest methods for cross-sample normalization, such as "multiHiCcompare." Human and mouse Hi-C data were used, only cis interaction matrices are considered. A systematic protocol for benchmarking is presented. Several benchmarks were performed, including statistical quality, the influence of resolution, consistency of distance-dependent changes in interaction frequency, reproducibility of the 3D architecture. multiHiCcompare is reported as outperforming other methods on a range of performance metrics.

  • Imakaev, Maxim, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Naumova, Anton Goloborodko, Bryan R. Lajoie, Job Dekker, and Leonid A. Mirny. “Iterative Correction of Hi-C Data Reveals Hallmarks of Chromosome Organization.” Nature Methods 9, no. 10 (October 2012) - ICE - Iterative Correction and Eigenvalue decomposition, normalization of HiC data. Assumption - all loci should have equal visibility. Deconvolution into eigenvectors/values.

  • Yaffe, Eitan, and Amos Tanay. “Probabilistic Modeling of Hi-C Contact Maps Eliminates Systematic Biases to Characterize Global Chromosomal Architecture.” Nature Genetics 43, no. 11 (November 2011) - Sources of biases: 1) non-specific ligation (large distance between pairs); 2) length of each ligated fragments; 3) CG content and nucleotide composition; 4) Mappability. Normalization. Enrichment of long-range interactions in active promoters. General aggregation of active chromosomal domains. Chromosomal territories, high-activity and two low-activity genomic clusters

TAD detection

Spectral clustering

  • Y. X Rachel Wang, Purnamrita Sarkar, Oana Ursu, Anshul Kundaje and Peter J. Bickel, "Network modelling of topological domains using Hi-C data" - TAD analysis using graph-theoretical (network-based) methods. Treats TADs as a "community" within the network. Shows that naive spectral clustering is generally ineffective, leaving gaps in the data.

  • Liu, Sijia, Pin-Yu Chen, Alfred Hero, and Indika Rajapakse. “Dynamic Network Analysis of the 4D Nucleome.” BioRxiv, January 1, 2018. - Temporal Hi-C data analysis using graph theory. Integrated with RNA-seq data. Network-based approaches such as von Neumann graph entropy, network centrality, and multilayer network theory are applied to reveal universal patterns of the dynamic genome. Toeplitz normalization. Graph Laplacian matrix. Detailed statistics.

  • Norton, Heidi K, Harvey Huang, Daniel J Emerson, Jesi Kim, Shi Gu, Danielle S Bassett, and Jennifer E Phillips-Cremins. “Detecting Hierarchical 3-D Genome Domain Reconfiguration with Network Modularity,” November 22, 2016. - Graph theory for TAD identification. Louvain-like local greedy algorithm to maximize network modularity. Vary resolution parameter, hierarchical TAD identification. Hierarchical spatial variance minimization method. ROC analysis to quantify performance. Adjusted RAND score to quantify the TAD overlap.

  • Chen, Jie, Alfred O. Hero, and Indika Rajapakse. “Spectral Identification of Topological Domains.” Bioinformatics (Oxford, England) 32, no. 14 (15 2016) - Spectral algorithm to define TADs. Laplacian graph segmentation using Fiedler vector iteratively. Toeplitz normalization to remove distance effect. Spectral TADs do not overlap with Dixon's, but better overlap with CTCF. Python implementation https://github.com/shappiron/TAD-Laplacian-Identification

URLs

Courses

About

A collection of tools for Hi-C data analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published