Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing information for --diamond-path #8

Closed
xapple opened this issue Jun 19, 2024 · 2 comments
Closed

Missing information for --diamond-path #8

xapple opened this issue Jun 19, 2024 · 2 comments

Comments

@xapple
Copy link

xapple commented Jun 19, 2024

The ViPER pipeline seems to offer a classification feature, which is essential in making sense of any viral data. However it is not clear how to use this feature.

The documentation says that one should add the path to a diamond database after --diamond-path. Yes, but which database or what kind? Are we supposed to come up with our own collection of taxonomically annotated sequences? What did you use in your research? Any suggestions? Ideally, a default should be provided and should auto-download on the first run to create a smoother user experience.

Currently, I'm not sure how you intend for users to use "ViPER" with no classification database bundled?

@LanderDC
Copy link
Contributor

LanderDC commented Jun 20, 2024

The 'classification' feature of ViPER generates a Krona chart that's based on a diamond blastx alignment with a database that contains sequences with an accession number from NCBI. Krona will, based on the accessions of the best 25 hits, get the lowest common ancestor and display this in a pie chart.

In our lab we currently use NCBI's nr database formatted for diamond. As the nr database is currently ~300GB, it is undesirable to let it auto-download on the first run. To use the nr database, you will have to download it as a fasta file (see here for info on the current best way to generate the complete nr fasta file), and format it to a diamond database with diamond makedb. Alternatively, if you're only interested in the viruses in your data, you can only download the virus sequences from the nr database, which will reduce the runtime and database size massively.

I hope this answers your questions? I will update the documentation to make this more clear for other users.

However, if you are looking to reliably identify (novel) viruses, particularly phages, I recommend using the contig output from ViPER and running it through genomad.

@xapple
Copy link
Author

xapple commented Jun 20, 2024

I see. Thanks a lot for your answer. It's very clear now.
In fact it's so well written, that I would suggest you add it into the manual or the README.md.
This would greatly help out any new or aspiring bioinformaticians to better grasp how to use the tool.

@xapple xapple closed this as completed Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants