functional annotation updates #230

amcooksey · 2023-04-12T22:04:57Z

first thoughts on how to do updates:
-re-run functional annotation pipeline
-copy new functional annotation directory to analysis folder (preferrably on apollo-stage, otherwise CERES)
-re-run final-workflow.cwl to generate genomic_annotated.gff (or possibly just the gff annotation portion; preferrably on apollo-stage)
-remove NCBI ref track from apollo (there may be more steps necessary here)
-add new NCBI ref track (there may be more steps necessary here)
-push changes to apollo-prod, i5k-stage, i5k-prod
-re-run createsymlinks
-update tripal functional annotation page

amcooksey · 2023-07-10T19:40:33Z

It looks like we will have to use the new version of Interproscan in order to use the updated databases. The json and xml outputs for the newer versions are substanitally bigger (60 M vs 1.8 G) but if we gzip those files we can keep the size down pretty well ( 1.8 G -> 207 M).

amcooksey · 2023-07-27T23:45:09Z

interproscan 5.45-80_3 (what we currently use)
json -- 60M
xml -- 9.3M
json.gz -- 7.3M
xml.gz -- 7M
interproscan 5.54-87 (we rejected it because the outputs were too big)
json -- 1.6G
xml -- 1.2G
json.gz -- 182M
xml.gz --150 M
interproscan 5.63-95 (latest version)
json -- 1.8G
xml -- 1.3G
json.gz -- 207M
xml.gz --171M

So, long story short, the newer versions have much larger outputs but they compress well.

mpoelchau · 2023-07-28T16:50:32Z

I'm inclined to remove the json and xml from the output that we provide. That said, do the gff3, tsv and gaf files that we produce convey the same information that the xml and json files do?

amcooksey · 2023-07-28T17:54:27Z

We can definitely remove the json. I think we can remove the xml and the others will cover the same information. They will just be more difficult for someone to parse but I'm not sure anyone is doing that.
I will double check with the Interpro people.

amcooksey · 2023-07-28T20:58:53Z

InterProScan support replied:
The JSON and XML formats are the most comprehensive. For instance,
InterProScan reports GO terms from two sources: InterPro and PANTHER. GO
terms from both resources are reported in the JSON and XML formats, but
only the InterPro GO terms are reported in the TSV/GFF3 format. We plan
to address this but we are not sure when this will be done.

[We pull the GO from the XML into the GAF file so I think we can avoid this problem.]

Another example is the version of resources used in InterProScan. The
XML and JSON formats report which version of InterPro, Pfam, etc. were
used, but not the GFF3 and TSV formats.

[Our readme file specifies which version of Interproscan we used and that is associated with a specific set of analysis versions.]

Finally, if you want to keep the score or e-value of matches reported by
InterProScan, only the XML and JSON formats include such information.

[Not sure how attached we are to the scores or evalues. Does anyone look at them?]

In a nutshell, we recommend using the XML and JSON formats, especially
if you plan to keep results for a long time. But the TSV/GFF3 formats
are also suitable in other cases, e.g. you are simply trying to check
whether your sequence is annotated by a specific Pfam domain.

amcooksey self-assigned this Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

functional annotation updates #230

functional annotation updates #230

amcooksey commented Apr 12, 2023

amcooksey commented Jul 10, 2023

amcooksey commented Jul 27, 2023 •

edited

Loading

mpoelchau commented Jul 28, 2023

amcooksey commented Jul 28, 2023

amcooksey commented Jul 28, 2023 •

edited

Loading

functional annotation updates #230

functional annotation updates #230

Comments

amcooksey commented Apr 12, 2023

amcooksey commented Jul 10, 2023

amcooksey commented Jul 27, 2023 • edited Loading

mpoelchau commented Jul 28, 2023

amcooksey commented Jul 28, 2023

amcooksey commented Jul 28, 2023 • edited Loading

amcooksey commented Jul 27, 2023 •

edited

Loading

amcooksey commented Jul 28, 2023 •

edited

Loading