Since the formats from different dataset are different, it need differents script to be done.
parse_signalp_euk.py
parse_signalp_gram-.py
parse_signalp_gram-.-.py
parse_signalp_gram+.py
parse_signalp_gram+.-.py
parse_spds_euk.py
parse_spds_gram-.py
parse_spds_gram-.-.py
parse_spds_gram+.py
parse_spds_gram+.-.py
- Prefix
parse_
represents parsing the data signalp
andspds
stand for the two datasetsignalP
andSPDS17
- Suffix contains bio-categories, and if following
.-
representspositive sample
andnegative sample
- Eg.
parse_signalp_euk.py
: Parsing Eukaryotes data from the SignalP datasetparse_signalp_gram+.py
: Parsing Gram positive's positive samples data from the SignalP datasetparse_signalp_gram+.-.py
: Parsing Gram positive's negative samples data from the SignalP dataset
fasta.py
to_fasta
: Takes IDs and sequences, turn them in to fasta formatread_fasta
: Takes file's path in fasta format, and read it into array
fix_sequence.py
fix_sequence
: Takes sequences and length, return fixed sequences which in the given length(In this work the length is 96)
For convencienc, provide a Makefile
Usage: make TARGET
TARGET: Start with the specific dataset, a .
, the CATEGORY is coming after, if negative samples is expected, a .-
is coming after
eg.
make spds.gram-
: Parsing the positive samples of Gram-negative from the SPDS17 dataset
http://weblogo.threeplusone.com/ The website provides online and offline tools to produce sequence logo.
sudo easy_install weblogo
The script automatically converts every data into sequenceLOGO
http://weblogo.threeplusone.com/create.cgi, would be look like this
After
- Select fasta file
- Adjust expected scheme
- Click to gernerate
supposed to get a sequence LOGO like this.