Cuterle is a bioinformatic tool which creates an output file (extracted_domain.fasta
)
containing every domain annotated by InterProScan (~.tsv file
)
from the list of protein (~.fasta file
) submitted.
Cuterle uses two main analysis of InterPro (there are also others analysis):
- Pfam (XX.X) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
- SMART (X.X) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
Cuterle chooses for every protein the analysis with more results.
Running the manual mode there will be one output folder with this structure:
YYYY/MM/DD_Analysis_number_X:
Directory which cointains the sequences_draw images scaled to be light to load in the browser
- sequences_draw
Directory containing the sequences_draw images (created via -draw_image option)
- domains_list.csv
~.csv file reporting how many times every domains has been found
- extracted_domains.fasta
~.fasta file containing every domains extracted
- graphical_output.html
~.html file granting browsable graphical output
Index
- Suggested use
- Limitations
- Getting started
- Usage - Manual mode
- Examples manual mode syntax
- Usage - Assisted mode
- Usage - Graphical mode
- Output example - HTML_file
- Output example - Fasta list
- Output example - Sequence's draw
- How to get a ~.tsv file
- Log
- Next updates
This program has been thought as Quality of Life tool for extracting the domains.
Exempli gratia
I want to extract a specific domain (IPR002035) from transcriptome:
- Download the transcriptome's target obtaining an
trascriptome.fasta
file - Run InterProScan analysis against the transcriptome obtaining an
transcriptome_result.tsv
file:./interproscan.sh -o ./transcriptome_result.tsv -i ./transcriptome.fasta -f tsv -dp
- Run Cuterle:
python3 main.py -m -tsv transcriptome_result.tsv -fasta transcriptome.fasta -accession IPR002035
- Be happy with your
extracted_domains.fasta
result file
Thanks to the arguments option, the point 2 and 3 are scriptable, saving A LOT of time.
(Happiness can't be scripted; tough life)
Post Scriptum
With multiple transcriptomes to scan you should run an HMMER analysis, creating a reducted fasta list to use in point 2
- This program does nothing more than extracting the domains identified by InterProScan
- Non-canonical domain could be not identified
- This program has not been thought to replace software which perform complete protein analysis, like SMART
- Python3
- pip
Install the required Python packages; while you are in the project's root directory run the following command:
# Install requirements
pip install -r requirements.txt
From the release 1.2.0 it's available the manual mode, making the program script-friendly.
Asking help to the program:
python3 main.py -h
usage: main.py [options]
-----------------------------------------------------------------
IF NO OPTION IS SELECTED, THE PROGRAM WILL RUN IN [ASSISTED MODE]
-----------------------------------------------------------------
DESCRIPTION
Cuterle is a bioinformatic tool.
It returns an output file containing every domain annotated by InterProScan.
Pfam or SMART analysis are choosen by which method has more matches.
LIST OF OUTPUT FILE
extracted_domains.fasta - contains every domains extracted
[optional] domains_list.csv - contains the table's raw data (domain_name,count)
[optional] domains_view[seq_name].jpg - schematic domains draw FOR EACH sequence
NAME FORMAT
The name for every sequence added to extracted_domain.fasta is [>1,2,3,4,5,6]
1 - Protein accession (e.g. P51587)
2 - Length of the domain (e.g. [DOMAIN LENGHT: (150)])
3 - Start location of the domain (e.g. [START: 50])
4 - End location of the domain (e.g. [END: 200])
5 - InterPro annotations - description (e.g. [BRCA2 repeat])
6 - InterPro accession (e.g. [IPR002035])
It is possible to CHANGE the order for every tag;
e.g. [-nf 1] or [-nf 1,2,3,4] or [-nf 5,4,3,2,6,2,2,1]
DO NOT USE SPACE between the number!
------------------------------------------
optional arguments:
-h, --help show this help message and exit
-m Enable the manual mode. -tsv and -fasta argument are requested
-tsv file.tsv Input file containing the tsv file output from InterPro
-fasta file.fasta Input file containing the fasta sequences
-a Pfam or SMART Prior choice between 'Pfam' and 'SMART'. Read the documentation.
-nf NF Name format. Read the documentation. Format: [1,2,3,4,5,6]
-accession ACCESSION InterPro annotations - accession (e.g. IPR002035)
-draw_image FOR EACH sequences create a ~.jpg file reporting sequence+domains
python3 main.py -m -tsv vwf_Homo_sapiens.tsv -fasta vwf_Homo_sapiens.fasta -nf 1,2,3 -draw_image
python3 main.py -m -tsv vwf_Homo_sapiens.tsv -fasta vwf_Homo_sapiens.fasta -a SMART -nf 6,2,1,2,3 -accession IPR002035
An ultra-simple-gui has been created. So bad it's good.
In terminal run:
python3 main.py
If no optional argument is given, the program will run in assisted mode (which is a lot verbose).
Once you run main.py in terminal, the program request the two input files (~.tsv and ~.fasta).
For every input file there is a check which guarantee its existence and the right format.
Please be sure to use the right format
If you are not sure about how getting the tsv file follow How to get a ~.tsv file.
Summary table ("Accession ID", "Domain name" and "Domains' number found" as header) is graphically printed.
From the v2.1.0 an graphical_output.html file will automatically be created.
All the extracted domains have the follow default syntax:
>[{1}] - [LENGTH: {2}] - [START: {3}] - [END: {4}] - [{5}] - [{6}]
- First lineextracted domain sequence
- Second line
Where every {number} refer to the follow information:
- {1} - Protein accession (e.g. P51587)
- {2} - Length of the domain (e.g. [DOMAIN LENGHT: (150)])
- {3} - Start location of the domain (e.g. [START: 50])
- {4} - End location of the domain (e.g. [END: 200])
- {5} - InterPro annotations - description (e.g. [BRCA2 repeat])
- {6} - Signature accession (e.g. [IPR002035])
Changing the syntax is possible only by running the manual mode.
Every domain has a default color which is the same for all the proteins. There are 9 color; if there are more domains, they wll be colored in gray.
Draw layout:
- Sequence name
- Scale applied (if the scale is 1, it's hidden)
- Draw of the protein with its domains
There are two main way to get an tsv file from InterPro:
-
Follow the InterProScan guide to install and run it on some local machine
-
Use the official InterProScan website to submit the fasta fasta file and obtain the tsv file (like in the screenshot below):
Deleting the log file will reset the date counter
TOP PRIORITY
- HMMer output support (Maybe in v.2.2.0)
MEDIUM PRIORITY
- None
LOW PRIORITY
- None