conda install biopython
- Create a new FASTA parser that uses BioPython to get the sequence name, description, and sequence.
- Add in some code to print out stats about your FASTA records in your mult-FASTA file:
- total number of sequences
- total number of nucleotides
- average length of sequences
- shortest seqeunce length
- longest seqeunce length
- average GC content
- highest GC content
- lowest GC content
seqeunce count:
total number of nucleotides:
avg len:
shortest len:
longest len:
avg GC content:
lowest GC content:
highest GC content:
- use a small test set
- Run your code on https://github.com/prog4biol/pfb2019/blob/master/files/Python_08.fasta
Preparation:
- Download uniprot_sprot using the Unix command 'wget':
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
Make sure to not add this to your gitHub Repository. It is tooooo big and with cause problems
- Unzip the file using the Unix command 'gunzip':
gunzip uniprot_sprot.fasta.gz
This will create a file uniprot_sprot.fasta
Do not add uniprot_sprot.fasta to your github repo. It is too big.* To be safe, find your .gitignore in the root of your github repository. Add uniprot_sprot.fasta*
anywhere in the file. Make sure to add this file to our index as you are updating your repo.
- What does the file contain? How many records? Does it look intact? How do you know?
Extract IDs from fasta file
-
with the
Bio.SeqIO
module, generate a list of all the IDs in the fasta file. How many are there? -
Make a list of all the descriptions. The description field almost always has a field OS=... that includes a species or strain designation. Here's an example
sp|A9N862|AAEB_SALPB p-hydroxybenzoic acid efflux pump subunit AaeB OS=Salmonella paratyphi B (strain ATCC BAA-1250 / SPB7) GN=aaeB PE=3 SV=1
Here the genus is Salmonella and the species is paratyphi. There is also a strain 'B (strain ATCC BAA-1250 / SPB7). You can ignore this part. Using regular expressions, extract just the genus and species and count the number of sequences present for that genus/species combination. List comprehensions make this kind of data processing quick to code, but you might want to start by going step by step in a for loop.
- Make a new fasta file of all the sequences containing the species 'Salmonella paratyphi B'. Include the 'B' for this part of the exercise. Call this protein file s_paratyphi.prot.fa. You'll want to loop through all the sequence records, extract the description, find matches to 'Salmonella paratyphi B' and convert to fasta.
For extra credit These questions will take some research and set up. Spend some time reading about how to run blast and ask for help as needed.
- Blast this protein against the S. paratyphi B proteins. You can do this remotely or locally with a blast binary or with biopython.
- Print the E-value and the score and the length of the alignment and the % similiarity (not % identity)
Install NCBI Blast+
- Download NCBI BLAST+
- Find the ftp link to the executables on the page you found in your google search BLAST+ executables
Installers and source code are available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ .
- Make sure you are a guest user, click continue
- Since we are using Macs, click on the
ncbi-blast-2.6.0+.dmg
- continue, continue, agree, install
- Now in a NEW terminal window, you will have the blast executables available. (blastn, blastx, tblastn, tblastx, blastp, makeblastdb, blastdbcmd)
Run BLAST+
- First format you FASTA file so that BLAST+ can use it as a database
makeblastdb -in [FASTAFILE] -dbtype [nucl or prot] -parse_seqids
-in
is the switch for the FASTA formated sequence file that you want to use as your BLAST db-dbtype
needs to beprot
ornucl
. This has to correspond to the sequence type in your FASTA fileparse_seqids
makes it possible for you to retrieve individual sequences by name usingblastdbcmd
- Run
blastp -help
for information about running the command and formatting output. - Run
blastp -query [Your Query FASTA File] -db [BLAST FORMATED DB FASTA FILE] -out [output file name] -evalue [evalue cutoff] -outfmt [5 for XML; 6 for TAB; etc]
-query
A FASTA formated sequence file with one or more query sequences-db
The file name of the FASTA formated file you formated withmakeblastdb
-out
A name of your choice for your output file, otherwise, the output is printed to the screen-evalue
The Expectation value (E) threshold for returning hits. 1e-5 is a common cutoff (Bill will say 1e-2, but we will be a tad more conservative)-outfmt
Choose the output format of your BLAST report as XML(5)-outfmt 5
. TAB(6) is also common output but unparsable by BioPython.
Parse BLAST Output
- Use BioPython to parse your XML BLAST results. Print out all the hit sequence ID that are better than 1e-5 as well as their descriptions in tab separated columns.