Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance idea: find duplicate samples more efficiently #436

Closed
nsheff opened this issue Mar 23, 2023 · 2 comments
Closed

Performance idea: find duplicate samples more efficiently #436

nsheff opened this issue Mar 23, 2023 · 2 comments
Assignees

Comments

@nsheff
Copy link
Contributor

nsheff commented Mar 23, 2023

Raised in #432 (comment)

I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.

This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:

peppy/peppy/project.py

Lines 637 to 649 in cac87fb

def _get_duplicated_sample_ids(self, sample_names_list: List) -> set:
return set(
[
sample_id
for sample_id in track(
sample_names_list,
description="Detecting duplicate sample names",
disable=not (self.is_sample_table_large and self.progressbar),
console=Console(file=sys.stderr),
)
if sample_names_list.count(sample_id) > 1
]
)

and then samples are looped through again here:

) = self._get_duplicated_and_not_duplicated_samples(

and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.

Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.

@nsheff nsheff changed the title Performance idea: find duplicate samples more efficiently. Performance idea: find duplicate samples more efficiently Mar 23, 2023
@nsheff
Copy link
Contributor Author

nsheff commented Mar 23, 2023

@neil-phan this is what we discussed

@nsheff
Copy link
Contributor Author

nsheff commented Mar 27, 2023

This is now released on 0.35.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

1 participant