Skip to content

Latest commit

 

History

History
 
 

18_fastx_sampler

FASTX Sampler: Randomly Subsampling Sequence Files

Sequence datasets in genomics and metagenomics can get dauntingly large, requiring copious time and compute resources to analyze. Many sequencers can produce tens of millions of reads per sample, and many experiments involve tens to hundreds of samples each with multiple technical replicates resulting in gigabytes to terabytes of data. Reducing the size of the input files by randomly subsampling sequences allows you to explore data more quickly. In this chapter, I’ll show how to use Python’s ran dom module to select some portion of the reads from FASTA/FASTQ sequence files.

You will learn about:

  • Non-deterministic sampling

Write a Python program called sampler.py that will probabilistically sample one or more input FASTA files into an output directory. The inputs for this program will be generated by your synth.py program. You can run make fasta to create files of 1K, 10K, and 100K reads in this directory. You can then use these files for testing your program:

$ ./sampler.py -m 2 tests/inputs/n1k.fa
  1: n1k.fa
Wrote 2 sequences from 1 file to directory "out".
$ cat out/n1k.fa
>34
AACATCAGGTATGGTCATCAGTTTTAGGATTTGAAGTAATTCTTCGCGAATCTTCGATCT
CTATAGGATCAGGAATTATACTTAACTTTATACTATAAGTGAAATAAACTCACTATGAAA
TTGGTAGTGGAACAGCAGAAGTTCAGATGATTTATCAGAAAAGTAATAGTGAGTAATCCT
TTAGATTTA
>40
TAGATTGCATCAGGGATTCAGGGCTGACCTTGTTGCACAGCATAAACAACTGATACACAC
AGACTATCTACTATACCATAAACATCTTGCTACTACAATTTCAGGTTCCTATGGATTTAA
TTGGCGCTTTATTTATCTGA

Here is the usage your program should create for -h or --help:

$ ./sampler.py -h
usage: sampler.py [-h] [-f format] [-p reads] [-m max] [-s seed] [-o DIR]
                  FILE [FILE ...]

Probabalistically subset FASTA files

positional arguments:
  FILE                  Input FASTA/Q file(s)

optional arguments:
  -h, --help            show this help message and exit
  -f format, --format format
                        Input file format (default: fasta)
  -p reads, --percent reads
                        Percent of reads (default: 0.1)
  -m max, --max max     Maximum number of reads (default: 0)
  -s seed, --seed seed  Random seed value (default: None)
  -o DIR, --outdir DIR  Output directory (default: out)

A passing test suite looks like this:

$ make test
python3 -m pytest -xv --disable-pytest-warnings --flake8 --pylint 
--pylint-rcfile=../pylintrc --mypy sampler.py tests/sampler_test.py
============================= test session starts ==============================
...
collected 15 items

sampler.py::FLAKE8 PASSED                                                [  6%]
sampler.py::mypy PASSED                                                  [ 12%]
tests/sampler_test.py::FLAKE8 SKIPPED                                    [ 18%]
tests/sampler_test.py::mypy PASSED                                       [ 25%]
tests/sampler_test.py::test_exists PASSED                                [ 31%]
tests/sampler_test.py::test_usage PASSED                                 [ 37%]
tests/sampler_test.py::test_bad_file PASSED                              [ 43%]
tests/sampler_test.py::test_bad_pct PASSED                               [ 50%]
tests/sampler_test.py::test_bad_seed PASSED                              [ 56%]
tests/sampler_test.py::test_bad_format PASSED                            [ 62%]
tests/sampler_test.py::test_defaults_one_file PASSED                     [ 68%]
tests/sampler_test.py::test_fastq_input PASSED                           [ 75%]
tests/sampler_test.py::test_defaults_multiple_file PASSED                [ 81%]
tests/sampler_test.py::test_max_reads PASSED                             [ 87%]
tests/sampler_test.py::test_options PASSED                               [ 93%]
::mypy PASSED                                                            [100%]
===================================== mypy =====================================

Success: no issues found in 2 source files
======================== 15 passed, 1 skipped in 3.73s =========================

Author

Ken Youens-Clark [email protected]