Performance idea: find duplicate samples more efficiently #436

nsheff · 2023-03-23T11:04:36Z

I actually think the most important performance-related problem is not actually storing the samples in two ways, but in the way duplicate sample name are identified and merged, which is extremely inefficient.

This looks like an N^2 approach, since I'm not sure but I bet pythons List.count() function has to go through all the items in the list:

peppy/peppy/project.py

Lines 637 to 649 in cac87fb

    
           def _get_duplicated_sample_ids(self, sample_names_list: List) -> set: 
        
               return set( 
        
                   [ 
        
                       sample_id 
        
                       for sample_id in track( 
        
                           sample_names_list, 
        
                           description="Detecting duplicate sample names", 
        
                           disable=not (self.is_sample_table_large and self.progressbar), 
        
                           console=Console(file=sys.stderr), 
        
                       ) 
        
                       if sample_names_list.count(sample_id) > 1 
        
                   ] 
        
               )

and then samples are looped through again here:

peppy/peppy/project.py

Line 590 in cac87fb

) = self._get_duplicated_and_not_duplicated_samples(

and I think other times. So there's some algorithmic issues. This should be able to be accomplished in 1 linear pass through the sample objects. Can probably be done very quickly using pandas, but even if using sample objects just a single loop should probably work.

Basically just fixing the counting to go through and count once would probably be a huge speed benefit, and should be really simple to implement.

nsheff · 2023-03-23T11:06:29Z

@neil-phan this is what we discussed

nsheff · 2023-03-27T17:07:29Z

This is now released on 0.35.5.

nsheff changed the title ~~Performance idea: find duplicate samples more efficiently.~~ Performance idea: find duplicate samples more efficiently Mar 23, 2023

nsheff added the priority-high label Mar 23, 2023

nsheff mentioned this issue Mar 23, 2023

Performance idea: generator for object-style representation #432

Open

nsheff added this to PEP Vision Mar 23, 2023

nsheff self-assigned this Mar 23, 2023

neil-phan added a commit that referenced this issue Mar 26, 2023

Duplicate search id now in one pass (#436)

a48aee4

nsheff mentioned this issue Mar 27, 2023

Release v0.35.5 #441

Merged

neil-phan added a commit that referenced this issue Mar 27, 2023

Variable name reformatting (#436)

00befe3

nsheff closed this as completed Mar 27, 2023

github-project-automation bot moved this to Done in PEP Vision Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance idea: find duplicate samples more efficiently #436

Performance idea: find duplicate samples more efficiently #436

nsheff commented Mar 23, 2023

nsheff commented Mar 23, 2023

nsheff commented Mar 27, 2023

Performance idea: find duplicate samples more efficiently #436

Performance idea: find duplicate samples more efficiently #436

Comments

nsheff commented Mar 23, 2023

nsheff commented Mar 23, 2023

nsheff commented Mar 27, 2023