Skip to content

Genome scaffolding based on HiC data in heterozygous and high ploidy genomes

License

Notifications You must be signed in to change notification settings

tanghaibao/allhic

Repository files navigation

ALLHIC: Genome scaffolding based on HiC data

     _       _____     _____     ____  ____  _____   ______
    / \     |_   _|   |_   _|   |_   ||   _||_   _|.' ___  |
   / _ \      | |       | |       | |__| |    | | / .'   \_|
  / ___ \     | |   _   | |   _   |  __  |    | | | |
_/ /   \ \_  _| |__/ | _| |__/ | _| |  | |_  _| |_\ `.___.'\
|____| |____||________||________||____||____||_____|`.____ .'

Travis-CI GOreport

This software is currently under active development. DO NOT USE.

Authors Haibao Tang (tanghaibao)
Xingtan Zhang (tangerzhang)
Email [email protected]
License BSD

Installation

The easiest way to install allhic is to download the latest binary from the releases and make sure to chmod +x the resulting binary.

If you are using go, you can build from source with:

go get -u -t -v github.com/tanghaibao/allhic/...
go install github.com/tanghaibao/allhic/cmd/allhic

Usage

Extract

Extract does a fair amount of preprocessing: 1) extract inter-contig links into a more compact form, specifically into .clm; 2) extract intra-contig links and build a distribution; 3) count up the restriction sites to be used in normalization (similar to LACHESIS); 4) bundles the inter-contig links into pairs of contigs.

allhic extract tests/test.bam tests/seq.fasta.gz

Prune

This prune step is optional for typical inbreeding diploid genomes. However, pruning will improve the quality of assembly of polyploid genomes. Prune pairs file to remove allelic/cross-allelic links.

allhic prune tests/Allele.ctg.table tests/test.pairs.txt

Please see help string of allhic prune on the formatting of Allele.ctg.table.

Partition

Given a target k, number of partitions, the goal of the partitioning is to separate all the contigs into separate clusters. As with all clustering algorithm, there is an optimization goal here. The LACHESIS algorithm is a hierarchical clustering algorithm using average links, which is the same method used by ALLHIC.

networkbefore networkafter

allhic partition tests/test.counts_GATC.txt tests/test.pairs.txt

Optimize

Given a set of Hi-C contacts between contigs, as specified in the clmfile, reconstruct the highest scoring ordering and orientations for these contigs.

Optimize uses Genetic Algorithm (GA) to search for the best scoring solution. GA has been successfully applied to genome scaffolding tasks in the past (see ALLMAPS; Tang et al. Genome Biology, 2015).

ga

allhic optimize tests/test.counts_GATC.2g1.txt tests/test.clm
allhic optimize tests/test.counts_GATC.2g2.txt tests/test.clm

Build

Build genome release, including .agp and .fasta output.

allhic build tests/test.counts_GATC.2g?.tour tests/seq.fasta.gz tests/asm-2g.chr.fasta

Plot

Use d3.js to visualize the heatmap.

allhic plot tests/test.bam tests/test.counts_GATC.2g1.tour

allhicplot

Pipeline

Following the 4 steps of prune, extract, partition, optimize, as described above. In summary, we have:

allhic extract tests/test.bam tests/seq.fasta.gz
allhic partition tests/test.counts_GATC.txt tests/test.pairs.txt 2
allhic optimize tests/test.counts_GATC.2g1.txt tests/test.clm
allhic optimize tests/test.counts_GATC.2g2.txt tests/test.clm
allhic build tests/test.counts_GATC.2g?.txt tests/seq.fasta.gz tests/asm-2g.chr.fasta

Or, in a single step:

allhic pipeline tests/test.bam tests/seq.fasta.gz 2

In summary, the pipeline requires a BAM file and the contigs FASTA file. The user then needs to specify the Restriction Enzyme used, the number k groups to partition into. Output include reconstructed chromosome AGP file (containing how the contigs are linked together) and chromosomal FASTA file.

WIP features

  • Add partition split inside "partition"
  • Use clustering when k = 1
  • Isolate matrix generation to "plot"
  • Add "pipeline" to simplify execution
  • Make "build" to merge subgroup tours
  • Provide better error messages for "file not found"
  • Plot the boundary of the contigs in "plot" using genome.json
  • Add dot plot to "plot"
  • Compare numerical output with Lachesis
  • Improve Ler0 results
  • Translate "prune" from C++ code to golang
  • Add test suites

Reference

TBD