Skip to content

hangxue-wustl/scalepopgen

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

scalepopgen is a fully automated nextflow-based pipeline that takes VCF or PLINK binary files as input and apply a variety of open-source tools to carry out comprehensive population genomic analyses. Additionally, python and R scripts have been developed to combine and plot the results of analyses, which allows user to immediately get an impression about the genomic patterns of the analyzed samples.

Broadly, the pipeline consists of the following four “sub-workflows”:

  • filtering and basic statistics
  • explore genetic structure
  • phylogeny using treemix
  • signatures of selection

The sub-workflows can be used separately or in combination with each other.

The pipeline can be run on any Linux operating system and require these three dependencies: Nextflow, Java and a software container or environment system such as conda, mamba, singularity or docker. Regarding the latter, we highly recommend using mamba. The pipeline can be run on both, local linux system as well as high performance computing (HPC) clusters. Note that the user only install the three dependencies listed above, while Nextflow automatically downloads the rest of the tools for the analyses.

Usage

::: note If you are new to Nextflow and nf-core, please refer to this page on how to set-up the Nextflow. Please, make sure to test your setup with -profile test before running the workflow on actual data. :::

After successful installation of Nextflow, Java and one of the container systems, download the scalepopgen:

git clone https://github.com/Popgen48/scalepopgen.git

INPUT FILES

All VCF files need to be splitted by the chromosomes and indexed with tabix. The VCF inputs should be listed in the comma-separated input sheet with the extension ".csv" and the header row exactly like in the example below. Please note that the chromosome name must not contain any punctuation marks.

vcf_input.csv:

chrom,vcf,vcf_idx
chr1,chrom1.vcf.gz,chrom1.vcf.gz.tbi
chr2,chrom2.vcf.gz,chrom2.vcf.gz.tbi

In addition to the VCF input format, it is also necessary to prepare a sample map file of individuals and populations. Sample map has two tab-delimited columns without header line. In the first column are individual IDs and in the second are population IDs as demonstrated on the example below. It is also important that the name of the file ends with ".map".

sample.map:

ind1  pop1
ind2  pop1
ind3  pop2
ind4  pop2

Similarly for the PLINK binary files, user need to specify them in the comma-separated input sheet with the header row, but with the extension ".p.csv".

plink_input.csv:

prefix,bed,bim,fam
popgen,popgen.bed,popgen.bim,popgen.fam

The workflow implement a lot of programs and tools, which consequently means a lot of parameters that need to be determined and provided as the yml format file. In order to make it easier for the users, we developed a Command-Line Interface (CLI), which helps to specify options for each sub-workflow. In fact, we highly recommend the CLI for creating parameter file as it guides the user through various options and at the same time checks the input formats.

The CLI can be downloaded and installed with the following commands:

git clone https://github.com/Popgen48/scalepopgen-cli.git
cd scalepopgen-cli/
#pip install --upgrade pip --> to update the version of pip 
pip3 install --no-cache-dir -r requirements.txt --user

Start the CLI with:

python scalepopgen_cli.py

grafik

Navigate through different sub-workflows and their options.

grafik grafik

Once you select and specify the parameters according to analyses you want to perform, simply save them to YAML file and copy the path within the -params-file option.

CLI5

Now, you can run the scalepopgen:

nextflow run scalepopgen/ \
   -profile <docker/singularity/conda/mamba> \
   -params-file <path/to/parameters.yml> \
   -qs <maximum number of processes>

Note that the CLI also generates a separate folder with the prefix citation_; this folder contains the relevant references in bibxtex format. These references should be cited in the manuscript.

After git clone, to test the functionality, run the following Command with a small dataset

nextflow run scalepopgen/ \
   -profile test,<docker/singularity/conda/mamba> \
   -qs 10

To reproduce the results discussed in the paper,

nextflow run scalepopgen/ \
   -profile test_full,<docker/singularity/conda/mamba> \
   -qs 10

::: warning Custom config files, including those provided by the Nextflow option -c, can be used to provide any other configuration, except for the parameters; see docs. :::

Additional Remarks

The detailed documentation of the tools are getting updated here

Credits

scalepopgen was mainly written by @BioInf2305 with contributions from @NPogo.

Many thanks to nf-core community for their assistance and help in the development of this pipeline.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

Note that the CLI also generates a separate folder with the prefix citation_; this folder contains the relevant references in bibxtex format. These references should be cited in the manuscript.

Other important list of references for the additional tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Nextflow 55.3%
  • Python 32.0%
  • Groovy 7.1%
  • R 4.8%
  • HTML 0.8%