You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.
This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:
and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.
Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.
The text was updated successfully, but these errors were encountered:
nsheff
changed the title
Performance idea: find duplicate samples more efficiently.
Performance idea: find duplicate samples more efficiently
Mar 23, 2023
Raised in #432 (comment)
This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:
peppy/peppy/project.py
Lines 637 to 649 in cac87fb
and then samples are looped through again here:
peppy/peppy/project.py
Line 590 in cac87fb
and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.
Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.
The text was updated successfully, but these errors were encountered: