-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two level classification: Species & Genus #105
Comments
This might be needed for the existing classifiers (except In this case, where the fuzzy matching might have assigned a species (e.g. matches Phytophthora alni with one base pair change), this could be demoted to a genus level match (e.g. matches a Phytophthora genus only entry perfectly). Should be able to use our single isolate control plate to gauge how often this happens. |
Depends on various settings, but yes, one case on the control single isolates with the
Likewise for
|
See GitHub issue #105 which covers this problem in general. By adding lots of NCBI sequences to the DB at genus level only, we can now have the situation where we get a perfect match at genus level (e.g. 'Phytophthora'), but there is a species level match at one base pair away (e.g. 'Phytophthora fallax'). Before this change, the 'onebp' classifier would just report the genus in this case. Now it will report the species level match instead.
See GitHub issue #105 which covers this problem in general. By adding lots of NCBI sequences to the DB at genus level only, we can now have the situation where we get a perfect match at genus level (e.g. 'Phytophthora'), but there is a species level match at one base pair away (e.g. 'Phytophthora fallax'). Before this change, the 'onebp' classifier would just report the genus in this case. Now it will report the species level match instead.
Some of the output from the new |
This was done with |
Cross reference #597, might want a rank-specific classifier setting? |
The optimal methods for classification at species or genus level are likely different. Intuitively we might want to allow one or two edits between a DB entry and a sample when calling species, but a far larger edit distance when calling at genus level.
Issue #101 / #102 looks as adding sequences to the database where the exact species is untrusted, but the genus is fine. In particular, the motivation here is to include some sister genus level entries in the DB. Here we expect the existing classifier methods to label more of the previously unknown sequences as genus level (and occasionally downgrade a species specific prediction to genus level only).
It is my expectation that the strict classifiers (
identity
andonebp
) will only classify a minority of the previously unknowns (depends heavily on the DB coverage vs our environment sample diversity) while the fuzzy classifiers (blast
,swarmid
,swarm
) will make many more genus level calls.It would therefore seem practical to enhance the tool to support a two level classification, one set of methods aimed at species level, and another set aimed at genus level. These could be run in series (e.g. if no species is called, try for a genus), or perhaps in parallel (effectively as implemented now but with a results synthesis step).
The text was updated successfully, but these errors were encountered: