GitHub - fanagislab/DBG_assembly: Developed for assemble the genomes with second-generation sequencing data, especially using the Illumina short reads. It is comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

fanagislab / DBG_assembly Public

Notifications You must be signed in to change notification settings
Fork 0
Star 5

Developed for assemble the genomes with second-generation sequencing data, especially using the Illumina short reads. It is comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

5 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DBG_contig		DBG_contig
clean_illumina		clean_illumina
correct_error		correct_error
link_scaffold		link_scaffold
test		test
ReadMe.txt		ReadMe.txt

Repository files navigation

1. Function
Genome assembly algorithm is one of the most challenging bioinformatics algorithms, both for the high complexity of the task and the high requirements of computational resources. This package was developed for assemble the large genomes with second-generation sequencing data, especially using the Illumina short reads. The typic DBG (de bruijn graph) assembly includes filtering the raw reads, correct the sequencing errors in reads, construct contigs by DBG, and link the contigs into scaffolds by read pair relations, and this package was developed according to this scheme, with each part developed as a single module. The underlying algorithm has many similarities to SOAPdenovo, the performance in assembled length and accuracy is also comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

2. Installation
There is a Makefile in each module clean_illumina correct_error DBG_contig link_scaffold, and you just need to "cd " into the subdirectory for each module and type "make", and all the excutables will be generated.

3. Module and usage
In the subdirectory "test/", there is testing data for 40 X simulated Illumina reads for the E.coli genome (4.6 Mb), associated with all the running commands stored in the files "work.sh" in each subdirectories. Below is a recommonded workflow:

a. exclude the low quality bases and adaptor contaminants in raw reads by module clean_illumina

##set requirements to resulting reads with maximum error rate 0.01 and minimum read length 75
clean_lowqual -e 0.01 -r 75 ../00.raw_reads/Ecoli_readlen250_insert400_20X_250_400_1.fq.gz ./Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz ./Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.stat

##exclude the adapter sequences by pairwise alignment (non-gap dynamic programming) between reads and adapters
clean_adapter -a ../../clean_illumina/illumina_NEB_adapter.fa -r 75 -s 12 Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz.nonAdapter.gz Ecoli_readlen250_insert400_20X_250_400_1.fq.gz.nonLowQual.gz.nonAdapter.stat

b. generate the k-mer frequency table and correct the base errors in raw reads based on the k-mer frequency by module correct_error

##count k-mer frequency and generate the k-mer frequency table in compressed bit format
##clean_reads.lib is a list file that contains all the paths for the clean reads files, each path in one line
kmerfreq -k 17 -m 1 -q 10 ./clean_reads.lib

##allow each reads to at most modify 2 bases, one input clean_reads.lib.kmer.freq.cz is generagted by kmerfreq
correct_error_reads -k 17 -c 2 ./clean_reads.lib.kmer.freq.cz ./clean_reads.lib

c. assembly the error corrected reads into contig sequences with the DBG (de bruijn graph) method by module DBG_contig

##construct the k-mer de bruijn graph (K=31) and get the contig sequences, the input file corrected_reads.lib contains the paths for all corrrected reads files, each path in one line
debruijn_contig -f 2 -k 31 -r 250 -t 10 -i 0.1 -M 125 -o Ecoli_corrected_reads ./corrected_reads.lib 2> Ecoli_corrected_reads.contig.log

d. map pair-end reads onto the contig sequences and link the contigs into scaffolds with pair relations by module link_scaffold

##use contigs longer than 125 bp, and raw reads with length no less than 250 bp, note that either raw reads or corrected reads can be used for scaffolding, here we only used insert 400-bp raw reads with file paths stored in raw_reads.lib.
map_pair -l 125 -r 250 -o ./maping_results/ ./Ecoli_corrected_reads.contig.seq.fa ./raw_reads.lib

##set insert size 400 bp, the input raw_reads.lib.map_pair.2ctg.lib is generated by map_pair
link_scaffold -i 400 -o Ecoli_corrected_reads.contig Ecoli_corrected_reads.contig.seq.fa ./raw_reads.lib.map_pair.2ctg.lib

##ieterative scaffolding with longer insert size reads
##If you have multiple insert size data, they should be used sequenctially from shorter to longer in the insert size, by taking the resulting scaffold sequence in the last run as the input contig sequence in current run.

4. Reference (During the development of this package, we published two papers on the genome assembly algorithm)
Zhenyu Li, Yanxiang chen, Desheng Mu, et al. Bicheng Yang and Wei Fan. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Briefings in Functional Genomics. VOL 11. NO 1. 25-37 (2012)
Wei Fan* & Ruiqiang Li. Test driving genome assemblers. Nature biotechnology, volume 30 number 4 (2012).

About

Developed for assemble the genomes with second-generation sequencing data, especially using the Illumina short reads. It is comparable to SOAPdenovo, and approach to SOAPdenovo2 in some aspects.

Readme

Activity

5 stars

1 watching

0 forks

Report repository