adding klg python6 and python7 problemset answers

labbces · Oct 21, 2019 · df7a26a · df7a26a
1 parent 704a020
commit df7a26a
Show file tree

Hide file tree

Showing 14 changed files with 928 additions and 0 deletions.
diff --git a/problemsets/answers/Python_06_KLG_1.py b/problemsets/answers/Python_06_KLG_1.py
@@ -0,0 +1,14 @@
+#!/usr/bin/env python3
+
+# Python 6 - IO - Problem Set
+# ===================
+
+# 1. Write a script to do the following to [Python_06.txt](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_06.txt)
+#    - Open and read the contents.  
+#    - Uppercase each line
+#    - Print each line to the STDOUT
+
+in_file = open('Python_06.txt', 'r')
+
+for line in in_file:
+    print(line.strip().upper()) ## strip whitespace and uppercase the line
diff --git a/problemsets/answers/Python_06_KLG_2.py b/problemsets/answers/Python_06_KLG_2.py
@@ -0,0 +1,13 @@
+#!/usr/bin/env python3
+
+# Python 6 - IO - Problem Set
+# ===================
+
+# 2. Modifiy the script in the previous problem to write the contents to a new file called "Python_06_uc.txt"
+
+in_file = open('Python_06.txt', 'r')
+out_file = open('Python_06_uc.txt', 'w')
+
+for line in in_file:
+    new_line = line.strip().upper() ## strip whitespace and uppercase the line
+    out_file.write(new_line + "\n")
diff --git a/problemsets/answers/Python_06_KLG_3.py b/problemsets/answers/Python_06_KLG_3.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python3
+
+# Python 6 - IO - Problem Set
+# ===================
+
+# 3. Open and print the reverse complement of each sequence in [Python_06.seq.txt](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_06.seq.txt). Each line is the following format:    `seqName\tsequence\n.` Make sure to print the output in fasta format including the sequence name and a note in the description that this is the reverse complement. Print to STDOUT and capture the output into a file with a command line redirect '>'. 
+#    - **Remember is is always a good idea to start with a test set for which you know the correct output.**
+
+
+in_file = open('Python_06.seq.txt', 'r')
+
+
+for line in in_file:
+    line = line.strip() ## strip whitespace and uppercase the line
+    items = line.split()
+    dna = items[1]
+    dna_lower = dna.lower()
+    dna_c = dna_lower.replace('a', 'T')
+    dna_c = dna_c.replace('t', 'A')
+    dna_c = dna_c.replace('g', 'C')
+    dna_c = dna_c.replace('c', 'G')
+    print(items[0] + "\t" + dna_c[::-1] )
+
+
+    ## here's how I would redirect to a file
+    ## ./Python_06_KLG_3.py >new_python6_seq.txt
+
diff --git a/problemsets/answers/Python_06_KLG_4.py b/problemsets/answers/Python_06_KLG_4.py
@@ -0,0 +1,24 @@
+#!/usr/bin/env python3
+
+# Python 6 - IO - Problem Set
+# ===================
+
+# 4. Open the [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) file [Python_06.fastq](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_06.fastq) and go through each line of the file. Count the number of lines and the number of characters per line. Have your program report the:  
+#     - total number of lines  
+#     - total number of characters  
+#     - average line length   
+
+in_file = open('Python_06.fastq', 'r')
+
+lines      = 1
+characters = 1
+line_lens = []
+for line in in_file:
+    line = line.strip() ## strip whitespace and uppercase the line
+    lines += 1
+    characters += len(line)
+    line_lens.append(len(line))
+
+print('Total lines:', str(lines))
+print('Total characters:', str(characters))
+print('Average line length:', str(sum(line_lens)/len(line_lens)))
diff --git a/problemsets/answers/Python_06_KLG_5.py b/problemsets/answers/Python_06_KLG_5.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+
+# Python 6 - IO - Problem Set
+# ===================
+
+# 5. Write your first FASTA parser. This is a script that reads in a FASTA file and stores each FASTA record separately for easy access for future analysis.
+
+# Things to keep in mind:
+#    - open your file
+#    - read each line
+#    - is your line a header line? is it a sequence line?
+#    - does a single FASTA record have one line of sequence or multiple lines of sequence?
+
+#    HINTS: use file I/O, if statements and dictionaries to write your first FASTA parser. Some other useful functions and methods are find, split, string concatenation.
+
+#    At the end, your script should return the following:
+
+#    fastaDict = {
+#       'seq1' : 'AAGAGCAGCTCGCGCTAATGTGATAGATGGCGGTAAAGTAAATGTCCTATGGGCCACCAATTATGGTGTATGAGTGAATCTCTGGTCCGAGATTCACTGAGTAACTGCTGTACACAGTAGTAACACGTGGAGATCCCATAAGCTTCACGTGTGGTCCAATAAAACACTCCGTTGGTCAAC' ,
+#       'seq2' : 'GCCACAGAGCCTAGGACCCCAACCTAACCTAACCTAACCTAACCTACAGTTTGATCTTAACCATGAGGCTGAGAAGCGATGTCCTGACCGGCCTGTCCTAACCGCCCTGACCTAACCGGCTTGACCTAACCGCCCTGACCTAACCAGGCTAACCTAACCAAACCGTGAAAAAAGGAATCT' ,
+#       'seq3' : 'ATGAAAGTTACATAAAGACTATTCGATGCATAAATAGTTCAGTTTTGAAAACTTACATTTTGTTAAAGTCAGGTACTTGTGTATAATATCAACTAAAT' ,
+#       'seq4' : 'ATGCTAACCAAAGTTTCAGTTCGGACGTGTCGATGAGCGACGCTCAAAAAGGAAACAACATGCCAAATAGAAACGATCAATTCGGCGATGGAAATCAGAACAACGATCAGTTTGGAAATCAAAATAGAAATAACGGGAACGATCAGTTTAATAACATGATGCAGAATAAAGGGAATAATCAATTTAATCCAGGTAATCAGAACAGAGGT' }
+
+in_file = open('python6.fasta', 'r')
+
+seq_dict = {}
+for line in in_file:
+    line = line.strip() ## strip whitespace and uppercase the line
+    if line.startswith(">"):
+        header = line
+        seq_dict[header] = ''
+    else:
+        seq_dict[header] += line
+
+print(seq_dict)
diff --git a/problemsets/answers/Python_06_KLG_6.py b/problemsets/answers/Python_06_KLG_6.py
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+
+# Python 6 - IO - Problem Set
+# ===================
+
+
+# 6. You are going to generate a couple of gene list that are saved in files, add their contents to sets, and compare them. 
+
+# __Generate Gene Lists:__
+
+
+# _Get all genes:_
+
+# 1. Go to [Ensembl Biomart](http://useast.ensembl.org/biomart/martview/4b8fb1941e75e7763e8c4ccf1ffcd9c5).
+# 2. In dropdown box, select "Ensembl Genes 98"  (or most current version)
+# 3. In dropdown box, select "Alpaca Genes" 
+# 4. On the left, click Attributes
+# 5. Expand GENE:
+# 6. Deselect "transcript stable ID".
+# 7. Click Results (top left)
+# 8. Export all results to "File" "TSV" --> GO
+# 9. Rename the file to "alpaca_all_genes.tsv"
+
+# _In the same Ensembl window, follow the steps below to get genes that have been labeled with Gene Ontology term "stem cell proliferation". For extra information on stem cell proliferation, check out  [stem cell proliferation](http://purl.obolibrary.org/obo/GO_0072089)_
+
+# 10. Click "Filters"
+# 11. Under "Gene Ontology", check "Go term name" and enter "stem cell proliferation"
+# 12. Click Results (top left)
+# 13. Export all results to "File" "TSV" --> GO
+# 14. Rename the file to "alpaca_stemcellproliferation_genes.tsv"
+
+# _In the same Ensembl window, follow the steps below to get genes that have been labeled with Gene Ontology term "pigmentation". For extra information on pigmentation, check out [pigmentation](http://purl.obolibrary.org/obo/GO_0043473)_
+
+
+# 15. Click "Filters"
+# 16. Under "Gene Ontology", check "Go term name" and enter "pigmentation"
+# 17. Click Results (top left)
+# 18. Export all results to "File" "TSV" --> GO
+# 19. Rename the file to "alpaca_pigmentation_genes.tsv"
+
+
+# __Open each of the three files and add the geneIDs to a Set. One Set per file.__
+
+
+
+
+alpaca_genes_f = open('alpaca_all_genes.tsv', 'r')
+
+alpaca_genes = set()
+
+header_line = 0
+for line in alpaca_genes_f:
+    if header_line == 0:
+        header_line +=1
+        pass
+    else:
+        gene = line.rstrip().split("\t")[0]
+        alpaca_genes.add(gene)
+
+alpaca_p_genes_f = open('alpaca_pigmentation_genes.tsv', 'r')
+
+alpaca_p_genes = set()
+
+header_line = 0
+for line in alpaca_p_genes_f:
+    if header_line == 0:
+        header_line +=1
+        pass
+    else:
+        gene = line.rstrip().split("\t")[0]
+        alpaca_p_genes.add(gene)
+
+alpaca_sc_genes_f = open('alpaca_stemcellproliferation_genes.tsv', 'r')
+alpaca_sc_genes = set()
+
+header_line = 0
+for line in alpaca_sc_genes_f:
+    if header_line == 0:
+        header_line +=1
+        pass
+    else:
+        gene = line.rstrip().split("\t")[0]
+        alpaca_sc_genes.add(gene)
+
+
+# A. Find all the genes that are not cell proliferation genes.
+
+print(len(alpaca_genes - alpaca_sc_genes))
+
+
+# B. Find all genes that are both stem cell proliferation genes and pigment genes.  
+# *Note* Make sure to NOT add the header to your set.  
+print('Genes that are both stem cell proliferation and pigmentation:')
+print(alpaca_sc_genes & alpaca_p_genes)
+
+
+
+# __Now, let do it again with transciption factors.__
+
+# 1. Go back to your Ensembl Biomart window
+# 2. Deselect the "GO Term Name"
+# 3. Select "GO Term Accession"
+# 4. Enter these two accessions IDs which in most organisms will be all the transcription factors
+#    - GO:0006355 is "regulation of transcription, DNA-dependent”. 
+#    - GO:0003677 is "DNA binding"
+# 5.  Click Results (top left)
+# 6. Export all results to "File" "TSV" --> GO
+# 7. Rename the file to "alpaca_transcriptionFactors.tsv"
+
+# __Open these two files: 1) the transcription factor gene list file and 2) the cell proliferation gene list file. Add each to a Set, One Set per file__
+
+
+alpaca_tf_genes_f = open('alpaca_transcriptionFactors.tsv', 'r')
+alpaca_tf_genes = set()
+
+header_line = 0
+for line in alpaca_tf_genes_f:
+    if header_line == 0:
+        header_line +=1
+        pass
+    else:
+        gene = line.rstrip().split("\t")[0]
+        alpaca_tf_genes.add(gene)
+
+
+
+# A. Find all the genes that are transcription factors for cell proliferation
+
+print('Genes that are transcription factors for cell proliferations:')
+print(alpaca_sc_genes & alpaca_tf_genes)
+print(len(alpaca_sc_genes & alpaca_tf_genes))
+
+# __Now do the same on the command line with `comm` command. You might need to `sort` each file first.__
+
+## firt pull out the first column of the file and sort it for both files
+#cut -f 1 alpaca_stemcellproliferation_genes.tsv | sort >alpaca_stemcellproliferation_genes.sorted.tsv
+#cut -f 1 alpaca_transcriptionFactors.tsv | sort >alpaca_transcriptionFactors.sorted.tsv
+
+## take the common elements from each file
+#comm -12 alpaca_transcriptionFactors.sorted.tsv alpaca_stemcellproliferation_genes.sorted.tsv
+#comm -12 alpaca_transcriptionFactors.sorted.tsv alpaca_stemcellproliferation_genes.sorted.tsv | wc -l
+
+
diff --git a/problemsets/answers/Python_07_KLG_1.py b/problemsets/answers/Python_07_KLG_1.py
@@ -0,0 +1,93 @@
+#!/usr/bin/env python3
+
+# Python 7 - Regular Expressions - Problem Set
+# ===================
+
+# 1. In the file [Python_07_nobody.txt](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_07_nobody.txt) find every occurrence of 'Nobody' and print out the position.
+
+## here I want to use regular expressions and resulting match objects to find the occurrences of Nobody
+## and print out their positions
+
+import re
+
+nobody_file = open('Python_07_nobody.txt', 'r')
+
+
+l = 1
+for line in nobody_file:
+    match = re.search('Nobody', line)
+    if match:
+        #print(match.start(), match.end()) ## this will print out the start and end, but is simplified with the span() method
+        print("Match found on line", l, "at position", match.span())
+    l += 1
+
+
+# 2. In the file [Python_07_nobody.txt](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_07_nobody.txt) substitute every occurrence of 'Nobody' with your favorite name and write an output file with that person's name (ex. Michael.txt).
+
+# 3. Using pattern matching, find all the header lines in [Python_07.fasta](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_07.fasta). Note that the format for a header in a fasta file is a line that starts with a greater than symbol and is followed by some text (e.g. `>seqName description` where seqName is the sequence name or identifier. The identifier cannot have spaces in it. The description that follows it can have spaces.)
+
+# 4. If a line matches the format of a FASTA header, extract the sequence name and description using sub patterns (groups). 
+# 	- Print lines something like this `id:"extracted seq name" desc:"extracted description"`
+
+# 5. Create a FASTA parser, or modify your FASTA parser from the previous problem set, to use regular expressions. Also make sure your parser can deal with a sequence that is split over many lines.
+
+# 6. The enzyme ApoI has a restriction site: R^AATTY where R and Y are degenerate nucleotideides. See the IUPAC table to identify the nucleotide possibilities for the R and Y. Write a regular expression to find and print all occurrences of the site in the following sequence [Python_07_ApoI.fasta](https://raw.githubusercontent.com/prog4biol/pfb2019/master/files/Python_07_ApoI.fasta). 
+
+# ```
+# >seq1
+# GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGAC
+# TTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT
+# AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGA
+# GCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA
+# AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGC
+# AAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC
+# AATTCTAGAATTTGTTTTTTGGTTTTTATTTATCATTGTAAATAAGACAAACATTTGTTC
+# CTAGTAAAGAATGTAACACCAGAAGTCACGTAAAATGGTGTCCCCATTGTTTAAACGGTT
+# GTTGGGACCAATGGAGTTCGTGGTAACAGTACATCTTTCCCCTTGAATTTGCCATTCAAA
+# ATTTGCGGTGGAATACCTAACAAATCCAGTGAATTTAAGAATTGCGATGGGTAATTGACA
+# TGAATTCCAAGGTCAAATGCTAAGAGATAGTTTAATTTATGTTTGAGACAATCAATTCCC
+# CAATTTTTCTAAGACTTCAATCAATCTCTTAGAATCCGCCTCTGGAGGTGCACTCAGCCG
+# CACGTCGGGCTCACCAAATATGTTGGGGTTGTCGGTGAACTCGAATAGAAATTATTGTCG
+# CCTCCATCTTCATGGCCGTGAAATCGGCTCGCTGACGGGCTTCTCGCGCTGGATTTTTTC
+# ACTATTTTTGAATACATCATTAACGCAATATATATATATATATATTTAT
+# ```
+
+
+# 7. Determine the site(s) of the physical cut(s) by ApoI in the above sequence. Print out the sequence with "^" at the cut site.
+
+#   Hints:  
+#    - Use `sub()`  
+#    - Use subpatterns (parentheses and `group()` ) to find the cut site within the pattern.
+#    - Example: if the pattern is GACGT^CT the following sequence
+
+# ```
+# AAAAAAAAGACGTCTTTTTTTAAAAAAAAGACGTCTTTTTTT
+# ```
+# we want to display the cut site like this:
+
+# ```
+# AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT
+# ```
+
+# 8. Now that you've done your restriction digest, determine the lengths of your fragments and sort them by length (in the same order they would separate on an electrophoresis gel).
+
+# Hint: Convert this string:
+
+# ```
+# AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT
+# ```
+
+# Into this list:
+
+# ```
+# ["AAAAAAAAGACGT","CTTTTTTTAAAAAAAAGACGT","CTTTTTTT"]
+# ```
+
+# 9. Download this file: ftp://ftp.neb.com/pub/rebase/bionet.txt of enzymes and their cut sites to fill a dictionary of enzymes paired with their recognition patterns. Be aware of the header lines, and be aware of how the columns are delimited. You'll modify this script in the next question.
+
+# 10. Write a script which takes two command line arguments: the name of an enzyme and a fasta file with a sequence to be cut. Load a dictionary of enzyme names and cut sites from the code you developed in question 9.
+#    If the enzyme is present in the dictionary, and can cut the sequence, print out:
+#      - the sequence, annotated with cut sites
+#      - the number of fragments
+#      - the fragments in their natural order (unsorted)
+#      - the fragments in sorted order (largest to smallest)