Skip to content

Latest commit

 

History

History
90 lines (70 loc) · 4.07 KB

preprocess.md

File metadata and controls

90 lines (70 loc) · 4.07 KB

Preprocessing

Assemble datasets

Firstly, you will have to assemble your set of reads into contigs. For this purpose, you can use metaSPAdes, SGA or metaFlye.

metaSPAdes

SPAdes is a short-read assembler based on the de Bruijn graph approach. metaSPAdes is the dedicated metagenomic assembler of SPAdes. Use metaSPAdes (SPAdes in metagenomics mode) software to assemble short reads into contigs. A sample command is given below.

spades --meta -1 Reads_1.fastq -2 Reads_2.fastq -o /path/output_folder -t 16

SGA

SGA (String Graph Assembler) is a short-read assembler based on the overlap-layout-consensus (more recently string graph) approach. Use SGA software to assemble short reads into contigs. Sample commands are given below. You may change the parameters to suit your datasets.

sga preprocess -o reads.fastq --pe-mode 1 Reads_1.fastq Reads_2.fastq
sga index -a ropebwt -t 16 --no-reverse reads.fastq
sga correct -k 41 --learn -t 16 -o reads.k41.fastq reads.fastq
sga index -a ropebwt -t 16 reads.k41.fastq
sga filter -x 2 -t 16 reads.k41.fastq
sga fm-merge -m 45 -t 16  reads.k41.filter.pass.fa
sga index -t 16 reads.k41.filter.pass.merged.fa
sga overlap -m 55 -t 16 reads.k41.filter.pass.merged.fa
sga assemble -m 95 reads.k41.filter.pass.merged.asqg.gz

metaFlye

Flye is a long-read assembler based on the de Bruijn graph approach. metaFlye is the dedicated metagenomic assembler of Flye. Use metaFlye (Flye in metagenomics mode) software to assemble long reads into contigs. A sample command is given below.

flye --meta --pacbio-raw reads.fasta --genome-size estimated_metagenome_size --out-dir /path/output_folder --threads 16

Bin contigs

Next, you have to bin the resulting contigs using an existing contig-binning tool. We have used the following tools with their commands for the experiments.

perl MaxBin-2.2.5/run_MaxBin.pl -contig contigs.fasta -abund abundance.abund -thread 8 -out /path/output_folder
python scripts/gen_kmer.py /path/to/data/contig.fasta 1000 4 
sh gen_cov.sh 
python SolidBin.py --contig_file /path/to/contigs.fasta --composition_profiles /path/to/kmer_4.csv --coverage_profiles /path/to/cov_inputtableR.tsv --output /output/result.tsv --log /output/log.txt --use_sfs

Prepare binning results

The binning output file should have delimiter separated (e.g., comma separated) values (contig_identifier, bin_number) for each contig. The contents of the binning output file should look similar to the example given below. Contigs are named according to their original identifier and the numbering of bins starts from 1. You can use the prepResult command to format an initial binning result in to the .csv format with contig identifiers and bin ID. Further details can be found here and in the next page.

Example binned inputs

Example metaSPAdes binned input

NODE_1_length_507141_cov_16.465306,1
NODE_2_length_487410_cov_94.354557,1
NODE_3_length_483145_cov_59.410818,1
NODE_4_length_468490_cov_20.967912,2
NODE_5_length_459607_cov_59.128379,2
...

Example SGA binned input

contig-0,1
contig-1,2
contig-2,1
contig-3,1
contig-4,2
...

Example Flye binned input

edge_1,1
edge_2,2
edge_3,1
edge_4,1
edge_5,2
...

Before using Flye assemblies for binning

Before using Flye assemblies for binning, please use gfa2fasta command to get the edge sequences. Further details can be found here. More details can be found in the next page.