link

drlabl · Jun 19, 2021 · 8bba7de · 8bba7de
1 parent 5200710
commit 8bba7de
Show file tree

Hide file tree

Showing 6 changed files with 43 additions and 5 deletions.
diff --git a/12_mrna/show_patterns.py b/12_mrna/show_patterns.py
@@ -66,7 +66,8 @@ def main():
 
     pool = map(aa_to_codon.get, args.protein + '*')
     for i, codons in enumerate(product(*pool), start=1):
-        print(f"{i:5}: {''.join(codons)}")
+        # print(f"{i:5}: {''.join(codons)}")
+        print(f"{''.join(codons)}")
 
 
 # --------------------------------------------------

diff --git a/17_synth/README.md b/17_synth/README.md
@@ -1,4 +1,17 @@
-# DNA Synthesizer
+# DNA Synthesizer: Creating Synthetic Data with Markov Chains
+
+A Markov chain is a model for representing a sequence of possibilities found in a given data set. 
+It is a machine learning (ML) algorithm because it discovers or learns
+patterns from input data. 
+In this exercise, I’ll show how to use Markov chains trained on a set of DNA sequences to generate novel DNA sequences.
+In this exercise, you will:
+
+* Read some number of input sequence files to find all the unique k-mers for a
+given k.
+* Create a Markov chain using these k-mers to produce some number of novel
+sequences of lengths bounded by a minimum and maximum.
+* Learn about generators.
+* Use a random seed to replicate random selections.
 
 Write a program `synth.py` that uses Markov chains trained on input DNA files to create novel DNA sequences:
 

diff --git a/18_fastx_sampler/README.md b/18_fastx_sampler/README.md
@@ -1,4 +1,12 @@
-# FASTX Sampler
+# FASTX Sampler: Randomly Subsampling Sequence Files
+
+Sequence datasets in genomics and metagenomics can get dauntingly large, requiring copious time and compute resources to analyze. Many sequencers can produce tens of millions of reads per sample, and many experiments involve tens to hundreds of samples each with multiple technical replicates resulting in gigabytes to terabytes of data. 
+Reducing the size of the input files by randomly subsampling sequences allows you to explore data more quickly. 
+In this chapter, I’ll show how to use Python’s ran dom module to select some portion of the reads from FASTA/FASTQ sequence files.
+
+You will learn about:
+
+* Non-deterministic sampling
 
 Write a Python program called `sampler.py` that will probabilistically sample one or more input FASTA files into an output directory.
 The inputs for this program will be generated by your `synth.py` program.

diff --git a/19_blastomatic/README.md b/19_blastomatic/README.md
@@ -1,4 +1,17 @@
-# BLASTOMATIC
+# Blastomatic: Parsing Delimited Text Files
+
+Delimited text files are a standard way to encode columnar data. You are likely familiar with spreadsheets like Microsoft Excel or Google Sheets where each worksheet may hold a data set with columns across the top and records running down. You can export this data to a text file where the columns of data are delimited, or separated by a character. 
+Quite often the delimiter is a comma, and the file will have an extension
+of .csv. 
+This format is called CSV for comma-separated values. 
+When the delimiter is a Tab, the extension may be .tab, .txt, or .tsv for tab-separated values. 
+The first line of the file usually will contain the names of the columns. Notably, this is not the case with the tabular output from BLAST (Basic Local Alignment Search Tool), one of the most popular tools in bioinformatics used to compare sequences. 
+In this chapter, I will show you how to parse this output and merge the BLAST results with metadata from another delimited text file using the csv and pandas modules.
+
+In this exercise, you will:
+
+* Learn how to use csvkit and csvchk to view delimited text files
+* Learn how to use the csv and pandas modules to parse delimited text files
 
 Write a program called `blastomatic.py` that will select BLAST hits above a given percent ID and will merge them with annotations and print the query sequence ID, the percent ID, the depth, and the lat/lon:
 

diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 This is the repository for the book [Mastering Python for Bioinformatics](https://learning.oreilly.com/library/view/reproducible-bioinformatics-with/9781098100872/) (O'Reilly, 2021, ISBN 9781098100889).
 
+See [O'Reilly's website](https://get.oreilly.com/ind_mastering-python-for-bioinformatics-ch1.html) for a free dowload of the preface and first chapter.
+
 # Author
 
 Ken Youens-Clark <[email protected]>
diff --git a/app01_makefiles/yeast/Makefile b/app01_makefiles/yeast/Makefile
@@ -1,6 +1,7 @@
 .PHONY: all fasta features test clean
 
-FEATURES = http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab
+FEATURES = http://downloads.yeastgenome.org/curation/$\
+	chromosomal_feature/SGD_features.tab
 
 all: fasta genome chr-count chr-size features gene-count verified-genes uncharacterized-genes gene-types terminated-genes test