Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional annotation updates #230

Open
amcooksey opened this issue Apr 12, 2023 · 5 comments
Open

functional annotation updates #230

amcooksey opened this issue Apr 12, 2023 · 5 comments
Assignees

Comments

@amcooksey
Copy link

first thoughts on how to do updates:
-re-run functional annotation pipeline
-copy new functional annotation directory to analysis folder (preferrably on apollo-stage, otherwise CERES)
-re-run final-workflow.cwl to generate genomic_annotated.gff (or possibly just the gff annotation portion; preferrably on apollo-stage)
-remove NCBI ref track from apollo (there may be more steps necessary here)
-add new NCBI ref track (there may be more steps necessary here)
-push changes to apollo-prod, i5k-stage, i5k-prod
-re-run createsymlinks
-update tripal functional annotation page

@amcooksey amcooksey self-assigned this Apr 12, 2023
@amcooksey
Copy link
Author

It looks like we will have to use the new version of Interproscan in order to use the updated databases. The json and xml outputs for the newer versions are substanitally bigger (60 M vs 1.8 G) but if we gzip those files we can keep the size down pretty well ( 1.8 G -> 207 M).

@amcooksey
Copy link
Author

amcooksey commented Jul 27, 2023

interproscan 5.45-80_3 (what we currently use)
json -- 60M
xml -- 9.3M
json.gz -- 7.3M
xml.gz -- 7M
interproscan 5.54-87 (we rejected it because the outputs were too big)
json -- 1.6G
xml -- 1.2G
json.gz -- 182M
xml.gz --150 M
interproscan 5.63-95 (latest version)
json -- 1.8G
xml -- 1.3G
json.gz -- 207M
xml.gz --171M

So, long story short, the newer versions have much larger outputs but they compress well.

@mpoelchau
Copy link
Contributor

I'm inclined to remove the json and xml from the output that we provide. That said, do the gff3, tsv and gaf files that we produce convey the same information that the xml and json files do?

@amcooksey
Copy link
Author

We can definitely remove the json. I think we can remove the xml and the others will cover the same information. They will just be more difficult for someone to parse but I'm not sure anyone is doing that.
I will double check with the Interpro people.

@amcooksey
Copy link
Author

amcooksey commented Jul 28, 2023

InterProScan support replied:
The JSON and XML formats are the most comprehensive. For instance,
InterProScan reports GO terms from two sources: InterPro and PANTHER. GO
terms from both resources are reported in the JSON and XML formats, but
only the InterPro GO terms are reported in the TSV/GFF3 format. We plan
to address this but we are not sure when this will be done.

[We pull the GO from the XML into the GAF file so I think we can avoid this problem.]

Another example is the version of resources used in InterProScan. The
XML and JSON formats report which version of InterPro, Pfam, etc. were
used, but not the GFF3 and TSV formats.

[Our readme file specifies which version of Interproscan we used and that is associated with a specific set of analysis versions.]

Finally, if you want to keep the score or e-value of matches reported by
InterProScan, only the XML and JSON formats include such information.

[Not sure how attached we are to the scores or evalues. Does anyone look at them?]

In a nutshell, we recommend using the XML and JSON formats, especially
if you plan to keep results for a long time. But the TSV/GFF3 formats
are also suitable in other cases, e.g. you are simply trying to check
whether your sequence is annotated by a specific Pfam domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants