Two level classification: Species & Genus #105

peterjc · 2019-04-02T10:03:43Z

The optimal methods for classification at species or genus level are likely different. Intuitively we might want to allow one or two edits between a DB entry and a sample when calling species, but a far larger edit distance when calling at genus level.

Issue #101 / #102 looks as adding sequences to the database where the exact species is untrusted, but the genus is fine. In particular, the motivation here is to include some sister genus level entries in the DB. Here we expect the existing classifier methods to label more of the previously unknown sequences as genus level (and occasionally downgrade a species specific prediction to genus level only).

It is my expectation that the strict classifiers (identity and onebp) will only classify a minority of the previously unknowns (depends heavily on the DB coverage vs our environment sample diversity) while the fuzzy classifiers (blast, swarmid, swarm) will make many more genus level calls.

It would therefore seem practical to enhance the tool to support a two level classification, one set of methods aimed at species level, and another set aimed at genus level. These could be run in series (e.g. if no species is called, try for a genus), or perhaps in parallel (effectively as implemented now but with a results synthesis step).

The text was updated successfully, but these errors were encountered:

peterjc · 2019-04-04T08:05:23Z

This might be needed for the existing classifiers (except identity) if the DB contains Phytophthora species plus almost identical entries labelled as Phytophthora (genus only).

In this case, where the fuzzy matching might have assigned a species (e.g. matches Phytophthora alni with one base pair change), this could be demoted to a genus level match (e.g. matches a Phytophthora genus only entry perfectly).

Should be able to use our single isolate control plate to gauge how often this happens.

peterjc · 2019-04-04T10:58:18Z

Depends on various settings, but yes, one case on the control single isolates with the onebp classifier, a TP (Phytophthora fallax) becoming a FN (just Phytophthora):

$ for f in assess_sample_L5-*_onebp_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  31  5   9767  0.93         1.00         0.67       0.78  0.0036        0.36
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  31  5   9767  0.93         1.00         0.67       0.78  0.0036        0.36
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          63  31  6   9767  0.91         1.00         0.67       0.77  0.0037        0.37
...

Likewise for blast, losing a sample TP (Phytophthora fallax) becoming a FN (just Phytophthora). However, here the Phytophthora genus only entries are also having a positive effect by reducing the FP count - but that's due to #106 (we were accepting weak matches):

$ for f in assess_sample_L5-*_blast_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  62  5   9736  0.93         0.99         0.51       0.66  0.0068        0.51
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  62  5   9736  0.93         0.99         0.51       0.66  0.0068        0.51
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          63  31  6   9767  0.91         1.00         0.67       0.77  0.0037        0.37
...

See GitHub issue #105 which covers this problem in general. By adding lots of NCBI sequences to the DB at genus level only, we can now have the situation where we get a perfect match at genus level (e.g. 'Phytophthora'), but there is a species level match at one base pair away (e.g. 'Phytophthora fallax'). Before this change, the 'onebp' classifier would just report the genus in this case. Now it will report the species level match instead.

peterjc · 2019-06-12T16:24:27Z

Some of the output from the new edit-graph command added in #144 makes me think we can try a more fuzzy genus level classifier (e.g. up to 2bp edit distance).

peterjc · 2021-02-17T18:33:40Z

This was done with 1s3g in v0.7.3, and 1s2g, 1s4g and 1s5g in v0.7.4 - but this is all effectively a special case, not the generalisation I was pondering with this issue.

peterjc · 2024-02-23T15:59:52Z

Cross reference #597, might want a rank-specific classifier setting?

peterjc added the enhancement New feature or request label Apr 2, 2019

peterjc mentioned this issue Apr 4, 2019

NCBI import as genus level #101

Closed

peterjc mentioned this issue Apr 16, 2019

Tweak onebp to consider 1bp diff species matches over perfect genus #110

Merged

peterjc mentioned this issue Nov 22, 2019

Clustering database entries to flag taxonomic anomolies #18

Open

peterjc mentioned this issue Feb 23, 2024

Expand classifiers with distance based 1s7g to 1s9g methods #604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two level classification: Species & Genus #105

Two level classification: Species & Genus #105

peterjc commented Apr 2, 2019

peterjc commented Apr 4, 2019

peterjc commented Apr 4, 2019 •

edited

Loading

peterjc commented Jun 12, 2019

peterjc commented Feb 17, 2021

peterjc commented Feb 23, 2024

Two level classification: Species & Genus #105

Two level classification: Species & Genus #105

Comments

peterjc commented Apr 2, 2019

peterjc commented Apr 4, 2019

peterjc commented Apr 4, 2019 • edited Loading

peterjc commented Jun 12, 2019

peterjc commented Feb 17, 2021

peterjc commented Feb 23, 2024

peterjc commented Apr 4, 2019 •

edited

Loading