Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two level classification: Species & Genus #105

Open
peterjc opened this issue Apr 2, 2019 · 5 comments
Open

Two level classification: Species & Genus #105

peterjc opened this issue Apr 2, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@peterjc
Copy link
Owner

peterjc commented Apr 2, 2019

The optimal methods for classification at species or genus level are likely different. Intuitively we might want to allow one or two edits between a DB entry and a sample when calling species, but a far larger edit distance when calling at genus level.

Issue #101 / #102 looks as adding sequences to the database where the exact species is untrusted, but the genus is fine. In particular, the motivation here is to include some sister genus level entries in the DB. Here we expect the existing classifier methods to label more of the previously unknown sequences as genus level (and occasionally downgrade a species specific prediction to genus level only).

It is my expectation that the strict classifiers (identity and onebp) will only classify a minority of the previously unknowns (depends heavily on the DB coverage vs our environment sample diversity) while the fuzzy classifiers (blast, swarmid, swarm) will make many more genus level calls.

It would therefore seem practical to enhance the tool to support a two level classification, one set of methods aimed at species level, and another set aimed at genus level. These could be run in series (e.g. if no species is called, try for a genus), or perhaps in parallel (effectively as implemented now but with a results synthesis step).

@peterjc peterjc added the enhancement New feature or request label Apr 2, 2019
@peterjc
Copy link
Owner Author

peterjc commented Apr 4, 2019

This might be needed for the existing classifiers (except identity) if the DB contains Phytophthora species plus almost identical entries labelled as Phytophthora (genus only).

In this case, where the fuzzy matching might have assigned a species (e.g. matches Phytophthora alni with one base pair change), this could be demoted to a genus level match (e.g. matches a Phytophthora genus only entry perfectly).

Should be able to use our single isolate control plate to gauge how often this happens.

@peterjc
Copy link
Owner Author

peterjc commented Apr 4, 2019

Depends on various settings, but yes, one case on the control single isolates with the onebp classifier, a TP (Phytophthora fallax) becoming a FN (just Phytophthora):

$ for f in assess_sample_L5-*_onebp_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  31  5   9767  0.93         1.00         0.67       0.78  0.0036        0.36
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  31  5   9767  0.93         1.00         0.67       0.78  0.0036        0.36
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          63  31  6   9767  0.91         1.00         0.67       0.77  0.0037        0.37
...

Likewise for blast, losing a sample TP (Phytophthora fallax) becoming a FN (just Phytophthora). However, here the Phytophthora genus only entries are also having a positive effect by reducing the FP count - but that's due to #106 (we were accepting weak matches):

$ for f in assess_sample_L5-*_blast_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  62  5   9736  0.93         0.99         0.51       0.66  0.0068        0.51
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  62  5   9736  0.93         0.99         0.51       0.66  0.0068        0.51
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          63  31  6   9767  0.91         1.00         0.67       0.77  0.0037        0.37
...

peterjc added a commit that referenced this issue Apr 16, 2019
See GitHub issue #105 which covers this problem in general.

By adding lots of NCBI sequences to the DB at genus level only,
we can now have the situation where we get a perfect match at
genus level (e.g. 'Phytophthora'), but there is a species level
match at one base pair away (e.g. 'Phytophthora fallax').

Before this change, the 'onebp' classifier would just report
the genus in this case. Now it will report the species level
match instead.
peterjc added a commit that referenced this issue Apr 17, 2019
See GitHub issue #105 which covers this problem in general.

By adding lots of NCBI sequences to the DB at genus level only,
we can now have the situation where we get a perfect match at
genus level (e.g. 'Phytophthora'), but there is a species level
match at one base pair away (e.g. 'Phytophthora fallax').

Before this change, the 'onebp' classifier would just report
the genus in this case. Now it will report the species level
match instead.
@peterjc
Copy link
Owner Author

peterjc commented Jun 12, 2019

Some of the output from the new edit-graph command added in #144 makes me think we can try a more fuzzy genus level classifier (e.g. up to 2bp edit distance).

@peterjc
Copy link
Owner Author

peterjc commented Feb 17, 2021

This was done with 1s3g in v0.7.3, and 1s2g, 1s4g and 1s5g in v0.7.4 - but this is all effectively a special case, not the generalisation I was pondering with this issue.

@peterjc
Copy link
Owner Author

peterjc commented Feb 23, 2024

Cross reference #597, might want a rank-specific classifier setting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant