Skip to content

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method

License

Notifications You must be signed in to change notification settings

dblhlx/PyBSASeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Note: It is strongly recommended to have fisher installed on your system. It is fast in dealing with large datasets.

Citation: If you used PyBSASeq in your pulications, please cite:

  • Zhang, J., Panthee, D.R. PyBSASeq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data. BMC Bioinformatics 21, 99 (2020). https://doi.org/10.1186/s12859-020-3435-8
  • Zhang, J., Panthee, D.R. Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes. G3 Genes|Genomes|Genetics, jkab400 (2021), https://doi.org/10.1093/g3journal/jkab400

PyBSASeq

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method

Python 3.6 or above is required to run the script.

Usage

$ python PyBSASeq.py -i input -o output -p popstrct -b fbsize,sbsize

Here are the details of the options used in the script:

  • input – the names of the input files (the GATK4-generated tsv/csv file). If you have variant calling data from both the parents and the bulks, the format is as follows: parents.tsv,bulks.tsv. If you have only the variant calling data of the bulks, the format is as follows: bulks.tsv. The script PyBSASeq.py can handle both situations now. The script and the input files should be in the same folder.
  • output – the name of the output file. Default is BSASeq.csv
  • popstrct – population structure; three choices available: F2 for an F2 population, RIL for a population of recombinant inbred lines, or BC for a backcross population
  • fbsize – the number of individuals in the first bulk
  • sbsize – the number of individuals in the second bulk

The default cutoff p-value for identifying significant structural variants (sSV) from the SV dataset is 0.01 (alpha), and the default cutoff p-value for identifying sSVs from the simulated dataset is 0.01 (smalpha). These values can be changed using the following options:

-v alpha,smalpha

alpha and smalpha should be in the range of 0.0 – 1.0, the chosen value should make statistical sense. The greater the smalpha value, the higher the threshold and the lower the false positive rate.

The default size of the sliding window is 2000000 (base pairs) and the incremental step is 10000 (base pairs), and their values can be changed using the following option:

-s slidingWindowSize,incrementalStep

Four files (sliding_windows.csv, sv_fagz.csv, sv_fagz_fep.csv, and threshold.txt) and a folder in the date_time format containing BSASeq.csv and BSASeq.pdf will be generated in the ./Results folder after succesfully running the script. If gaps between subplots in PyBSASeq.pdf are too wide or too narrow, we can rerun the script to fine-tune the gaps by changing the values of a and/or b using the options below:

-a True -g a,b,c,d,e,f

  • a - the horizontal gap between subplots
  • b - the vertical gap between subplots
  • c, d, e, and f - the top, bottom, left, and right margins of the plot, respectively

The default values for a, b, c, d, e, and f are 0.028, 0.056, 0.0264, 0.054, 0.076, 0.002, 0.002, respectively. The script uses the sliding_windows.csv and threshold.txt files generated previously for plotting, and it is very fast.

If two or more peaks/valleys and all the values in between are beyond the confidence intervals/thresholds, only the highest peak or the lowerest valley will be identified as the peak/valley of this region. We can rerun the script to identify the positions of the other peaks/valleys and test their significance using the option below:

-e a1,b1,c1,a2,b2,c2,......,an,bn,cn

  • a - chromosome id
  • b - the start point of a chromosomal fragment
  • c - the end point of a chromosomal fragment

Right now, this option will not work if the chromosome IDs in the reference genome sequences are not digits, with the exception of sex chromosomes; we can use 1000 - 1005 to respectively represent sex chromosomes X, Y, Z, W, U, and V when specify regions on these chromosomes. Multiple regions on the same chromsome can be selected.

Workflow

  1. Structural variant (SV) filtering
  2. Perform Fisher's exact test using the AD values of each SV in both bulks. A SV would be identified as an sSV if its p-value is less than alpha. In the meantime, simulated REF/ALT reads of each SV is obtained via simulation under null hypothesis, and Fisher's exact test is also performed using these simulated AD values; for each SV, it would be an sSV if its p-value is less than smalpha. Identification of sSVs from the simulated dataset is for threshold calculation.
  3. Threshold calculation. The result is saved in the threshold.txt file. The threshold.txt file needs to be deleted if starting over is desired (e.g, if the size of the sliding window is changed).
  4. Plotting.

Test datasets

Two SV datasets for testing purpose, one from rice while the other from maize, are stored in the Data folder. The rice .csv files were generated via GATK4 using the sequencing data from the work of Lahari et al while the maize .csv files were extracted from .csv/.tsv files generated by Zheng et al.

Interpret the results

In the file PyBSASeq.pdf, peaks/valleys in panels B, C, and D that are beyond their genome-wide thresholds/confidence intervals very likely represent genomic regions controlling the trait of interest. The genomic region-specific thresholds/confidence intervals of these peaks/valleys are calculated to verrify their significance status. The results are presented in the file BSASeq.csv. A peak/valley is significant if its significance_status is equal to 1, not significant otherwise.

In BSASeq.csv, each row indicates a sliding window that contains a potentially significant peak/valley. The meanning of each column is shown below.

  • firstbulk_id.AvgLD: the average sequencing depth of the first bulk
  • secondbulk_id.AvgLD: the average sequencing depth of the second bulk
  • sSV: the number of the significant SVs in the sliding window
  • totalSV: the number of the total SVs in the sliding window
  • sSV/totalSV: the sSV/totalSV ratio in the sliding window
  • Threshold_sSV: the threshold of the sSV/totalSV ratio of the sliding window
  • GS: the G-statistic value of the sliding window
  • Threshold_GS: the threshold of the G-statistic value of the sliding window
  • DAF: the allele frequency difference of the sliding window
  • DAF_CI_LB: the DAF confidence interval lower bound of the sliding window
  • DAF_CI_UB: the DAF confidence interval upper bound of the sliding window
  • Threshold_DAF: the DAF threshold of the sliding window when parental genome sequences are not avaiable (or not used)
  • pvalue_tt: the t-test p-value of the sliding window
  • Significance_SSV: the significant status of the sliding window when using the significant SV method
  • Significance_GS: the significant status of the sliding window when using the G-statistic method
  • Significance_AF: the significant status of the sliding window when using the allele frequency method
  • Significance_TT: the significant status of the sliding window when using the t-test method

Bug report

Please report here if encounter any issue. I can receive issue notifications now.

Other methods for BSA-Seq data analysis

BSA-Seq data analysis can be done using either the SNP index (allele frequency) method or the G-statistic method as well. I implemented both methods in Python for the purpose of comparison: PySNPIndex and PyGStatistic. The Python implementation of the original G-statistic method by Magwene can be found here (just found this site today, 6/27/2019), and the R implementation of both methods by Mansfeld can be found here.

About

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages