GitHub - Peirong777/PyBSASeq: A novel algorithm for BSA-Seq data analysis

Note:

It is strongly recommended to have fisher installed on your system. It is fast in dealing with large datasets;
If you used PyBSASeq in your pulications, please cite: Zhang, J., Panthee, D.R. PyBSASeq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data. BMC Bioinformatics 21, 99 (2020). https://doi.org/10.1186/s12859-020-3435-8

PyBSASeq

A novel algorithm for BSA-Seq data analysis

Python 3.6 or above is required to run the script.

Usage

$ python PyBSASeq.py -i input -o output -p popstrct -b fbsize,sbsize

Here are the details of the options used in the script:

input – the name of the input file (the GATK4-generated tsv file)
output – the name of the output file
popstrct – population structure; three choices available: F2 for an F2 population, RIL for a population of recombinant inbred lines, or BC for a backcross population
fbsize – the number of individuals in the first bulk
sbsize – the number of individuals in the second bulk

The default cutoff p-value for identifying significant SNPs (sSNP) from the SNP dataset is 0.01 (alpha), and the default cutoff p-value for identifying sSNPs from the simulated dataset is 0.1 (smalpha). These values can be changed using the following options:

-v alpha,smalpha

alpha and smalpha should be in the range of 0.0 – 1.0, the chosen value should make statistical sense. The greater the smalpha value, the higher the threshold and the lower the false positive rate.

The default size of the sliding window is 2000000 (base pairs) and the incremental step is 10000 (base pairs), and their values can be changed using the following option:

-s slidingWindowSize,incrementalStep

Workflow

SNP filtering
Perform Fisher's exact test using the AD values of each SNP in both bulks. A SNP would be identified as an sSNP if its p-value is less than alpha. In the meantime, simulated REF/ALT reads of each SNP is obtained via simulation under null hypothesis, and Fisher's exact test is also performed using these simulated AD values; for each SNP, it would be an sSNP if its p-value is less than smalpha. Identification of sSNPs from the simulated dataset is for threshold calculation. A file named "COMPLETE.txt" will be writen to the working directory if Fisher's exact test is successful, and the results of Fisher's exact test are saved in a .csv file. The "COMPLETE.txt" file needs to be deleted in case starting over is desired.
Threshold calculation. The result is saved in the "threshold.txt" file. The "threshold.txt" file needs to be deleted if starting over is desired (e.g, if the size of the sliding window is changed).
Plotting.

Dataset

The file snp_final.tsv.bz2 contains the GATK4-generated SNP dataset using the sequencing data from the work of Yang et al. The sequence reads were treated as either single-end or paired-end when aligned to the reference genome. Significantly more SNPs were identified by the latter approach; however, the results of BSA-Seq analysis were very similar. Only the tsv file generated by the former approach is included here because of the 25 Mb file size limitation.

A small test file is included here for testing purpose; just issue the command python PyBSASeq.py in a terminal to test the Python script.

Other methods for BSA-Seq data analysis

BSA-Seq data analysis can be done using either the SNP index (allele frequency) method or the G-statistic method as well. I implemented both methods in Python for the purpose of comparison: PySNPIndex and PyGStatistic. The Python implementation of the original G-statistic method by Magwene can be found here (just found this site today, 6/27/2019), and the R implementation of both methods by Mansfeld can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.gitignore		.gitignore
PyBSASeq.py		PyBSASeq.py
PyBSASeq_old.py		PyBSASeq_old.py
README.md		README.md
additionalPeaks.txt		additionalPeaks.txt
smallTestFile.tsv		smallTestFile.tsv
snp_final.tsv.bz2		snp_final.tsv.bz2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyBSASeq

Usage

Workflow

Dataset

Other methods for BSA-Seq data analysis

About

Releases

Packages

Languages

Peirong777/PyBSASeq

Folders and files

Latest commit

History

Repository files navigation

PyBSASeq

Usage

Workflow

Dataset

Other methods for BSA-Seq data analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages