Skip to content

Commit

Permalink
link
Browse files Browse the repository at this point in the history
  • Loading branch information
kyclark committed Jun 19, 2021
1 parent 5200710 commit 8bba7de
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 5 deletions.
3 changes: 2 additions & 1 deletion 12_mrna/show_patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ def main():

pool = map(aa_to_codon.get, args.protein + '*')
for i, codons in enumerate(product(*pool), start=1):
print(f"{i:5}: {''.join(codons)}")
# print(f"{i:5}: {''.join(codons)}")
print(f"{''.join(codons)}")


# --------------------------------------------------
Expand Down
15 changes: 14 additions & 1 deletion 17_synth/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,17 @@
# DNA Synthesizer
# DNA Synthesizer: Creating Synthetic Data with Markov Chains

A Markov chain is a model for representing a sequence of possibilities found in a given data set.
It is a machine learning (ML) algorithm because it discovers or learns
patterns from input data.
In this exercise, I’ll show how to use Markov chains trained on a set of DNA sequences to generate novel DNA sequences.
In this exercise, you will:

* Read some number of input sequence files to find all the unique k-mers for a
given k.
* Create a Markov chain using these k-mers to produce some number of novel
sequences of lengths bounded by a minimum and maximum.
* Learn about generators.
* Use a random seed to replicate random selections.

Write a program `synth.py` that uses Markov chains trained on input DNA files to create novel DNA sequences:

Expand Down
10 changes: 9 additions & 1 deletion 18_fastx_sampler/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
# FASTX Sampler
# FASTX Sampler: Randomly Subsampling Sequence Files

Sequence datasets in genomics and metagenomics can get dauntingly large, requiring copious time and compute resources to analyze. Many sequencers can produce tens of millions of reads per sample, and many experiments involve tens to hundreds of samples each with multiple technical replicates resulting in gigabytes to terabytes of data.
Reducing the size of the input files by randomly subsampling sequences allows you to explore data more quickly.
In this chapter, I’ll show how to use Python’s ran dom module to select some portion of the reads from FASTA/FASTQ sequence files.

You will learn about:

* Non-deterministic sampling

Write a Python program called `sampler.py` that will probabilistically sample one or more input FASTA files into an output directory.
The inputs for this program will be generated by your `synth.py` program.
Expand Down
15 changes: 14 additions & 1 deletion 19_blastomatic/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,17 @@
# BLASTOMATIC
# Blastomatic: Parsing Delimited Text Files

Delimited text files are a standard way to encode columnar data. You are likely familiar with spreadsheets like Microsoft Excel or Google Sheets where each worksheet may hold a data set with columns across the top and records running down. You can export this data to a text file where the columns of data are delimited, or separated by a character.
Quite often the delimiter is a comma, and the file will have an extension
of .csv.
This format is called CSV for comma-separated values.
When the delimiter is a Tab, the extension may be .tab, .txt, or .tsv for tab-separated values.
The first line of the file usually will contain the names of the columns. Notably, this is not the case with the tabular output from BLAST (Basic Local Alignment Search Tool), one of the most popular tools in bioinformatics used to compare sequences.
In this chapter, I will show you how to parse this output and merge the BLAST results with metadata from another delimited text file using the csv and pandas modules.

In this exercise, you will:

* Learn how to use csvkit and csvchk to view delimited text files
* Learn how to use the csv and pandas modules to parse delimited text files

Write a program called `blastomatic.py` that will select BLAST hits above a given percent ID and will merge them with annotations and print the query sequence ID, the percent ID, the depth, and the lat/lon:

Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This is the repository for the book [Mastering Python for Bioinformatics](https://learning.oreilly.com/library/view/reproducible-bioinformatics-with/9781098100872/) (O'Reilly, 2021, ISBN 9781098100889).

See [O'Reilly's website](https://get.oreilly.com/ind_mastering-python-for-bioinformatics-ch1.html) for a free dowload of the preface and first chapter.

# Author

Ken Youens-Clark <[email protected]>
3 changes: 2 additions & 1 deletion app01_makefiles/yeast/Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
.PHONY: all fasta features test clean

FEATURES = http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab
FEATURES = http://downloads.yeastgenome.org/curation/$\
chromosomal_feature/SGD_features.tab

all: fasta genome chr-count chr-size features gene-count verified-genes uncharacterized-genes gene-types terminated-genes test

Expand Down

0 comments on commit 8bba7de

Please sign in to comment.