Skip to content

Developed for assemble the genomes with second-generation sequencing data, especially using the Illumina short reads. It is comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

Notifications You must be signed in to change notification settings

fanagislab/DBG_assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1. Function
  Genome assembly algorithm is one of the most challenging bioinformatics algorithms, both for the high complexity of the task and the high requirements of computational resources. This package was developed for assemble the large genomes with second-generation sequencing data, especially using the Illumina short reads. The typic DBG (de bruijn graph) assembly includes filtering the raw reads, correct the sequencing errors in reads, construct contigs by DBG, and link the contigs into scaffolds by read pair relations, and this package was developed according to this scheme, with each part  developed as a single module. The underlying algorithm has many similarities to SOAPdenovo, the performance in assembled length and accuracy is also comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

2. Installation
  There is a Makefile in each module clean_illumina correct_error DBG_contig link_scaffold, and you just need to "cd " into the subdirectory for each module and type "make", and all the excutables will be generated.

3. Module and usage
  In the subdirectory "test/", there is testing data for 40 X simulated Illumina reads for the E.coli genome (4.6 Mb), associated with all the running commands stored in the files "work.sh" in each subdirectories. Below is a recommonded workflow:
  
  a. exclude the low quality bases and adaptor contaminants in raw reads by module clean_illumina
	
	##set requirements to resulting reads with maximum error rate 0.01 and minimum read length 75
	clean_lowqual -e 0.01 -r 75 ../00.raw_reads/Ecoli_readlen250_insert400_20X_250_400_1.fq.gz ./Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz ./Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.stat 
	
	##exclude the adapter sequences by pairwise alignment (non-gap dynamic programming) between reads and adapters
	clean_adapter -a ../../clean_illumina/illumina_NEB_adapter.fa -r 75 -s 12 Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz.nonAdapter.gz Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz.nonAdapter.stat

  b. generate the k-mer frequency table and correct the base errors in raw reads based on the k-mer frequency by module correct_error
	
	##count k-mer frequency and generate the k-mer frequency table in compressed bit format
	##clean_reads.lib is a list file that contains all the paths for the clean reads files, each path in one line
	 kmerfreq -k 17 -m 1 -q 10 ./clean_reads.lib
	
	##allow each reads to at most modify 2 bases, one input clean_reads.lib.kmer.freq.cz is generagted by kmerfreq
	 correct_error_reads -k 17 -c 2 ./clean_reads.lib.kmer.freq.cz  ./clean_reads.lib
  
  c. assembly the error corrected reads into contig sequences with the DBG (de bruijn graph) method by module DBG_contig
	
	##construct the k-mer de bruijn graph (K=31) and get the contig sequences, the input file corrected_reads.lib contains the paths for all corrrected reads files, each path in one line
	debruijn_contig -f 2 -k 31 -r 250 -t 10 -i 0.1 -M 125 -o Ecoli_corrected_reads ./corrected_reads.lib 2> Ecoli_corrected_reads.contig.log
  
  d. map pair-end reads onto the contig sequences and link the contigs into scaffolds with pair relations by module link_scaffold
	
	##use contigs longer than 125 bp, and raw reads with length no less than 250 bp, note that either raw reads or corrected reads can be used for scaffolding, here we only used insert 400-bp raw reads with file paths stored in raw_reads.lib.
	map_pair -l 125 -r 250 -o ./maping_results/ ./Ecoli_corrected_reads.contig.seq.fa  ./raw_reads.lib
	
	##set insert size 400 bp, the input raw_reads.lib.map_pair.2ctg.lib is generated by map_pair
	link_scaffold -i 400 -o Ecoli_corrected_reads.contig  Ecoli_corrected_reads.contig.seq.fa ./raw_reads.lib.map_pair.2ctg.lib

	##ieterative scaffolding with longer insert size reads
	##If you have multiple insert size data, they should be used sequenctially from shorter to longer in the insert size, by taking the resulting scaffold sequence in the last run as the input contig sequence in current run.


4. Reference (During the development of this package, we published two papers on the genome assembly algorithm)
  Zhenyu Li, Yanxiang chen, Desheng Mu, et al. Bicheng Yang and Wei Fan. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Briefings in Functional Genomics. VOL 11. NO 1. 25-37 (2012)
  Wei Fan* & Ruiqiang Li. Test driving genome assemblers. Nature biotechnology, volume 30 number 4 (2012).

About

Developed for assemble the genomes with second-generation sequencing data, especially using the Illumina short reads. It is comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published