Why are we ignoring the duplicate hashes #34

ramsane · 2023-04-02T09:02:48Z

While I was trying to solve the common score problem(having a single score for each test_file), I figured that I have to keep the duplicate hashes. In the current code, at every step (while removing the boilerplate and while finding the common hashes), we are giving only the unique hashes.

Is this on purpose? Or did we face any issues while trying to keep the repeated hashes?

blingenf · 2023-04-02T19:17:23Z

This is done for efficiency reasons. Comparison between files is performed using set intersection, so all fingerprints must be unique.

ramsane · 2023-04-02T20:20:11Z

Okay. If this is the reason, we can give the option to include duplicates also. This seems to be very important as this might to the very poor highlighting of the code. Even if it is highly matched (in terms of percentages).

with the addition of small changes, The highlighting of the code improved drastically. I don't think this will add any additional performance overhead. Need to check that though.

Here are the before and after pictures.

Lowering the code

It also occurred to me that while generating the hashes, case of the letters mattters. Before tokenizing, I think it is better to convert the code to lowercase.(not inplace) so that these cases will not occur. I'm aware that all the programming languages are case sensitive. But As we are not compiling it, adding this makes our algorithm little more robust.

ramsane · 2023-04-02T20:26:50Z

Here are the changes that I've tried..

...
..
def get_document_fingerprints(doc, k, window_size, boilerplate=[]):
    ..
    ..
    # changing the option of removing duplicates to False
    hashes, idx = winnow(hashed_kgrams(doc, k=k), window_size=window_size,
                         remove_duplicates=False)
    if len(boilerplate) > 0:
        ..
        ..
        # Doing this way instead of np.intersect1d, and delete after that is because
        #  preserves the duplicates in the original array(hashes)
        hash_mask = np.isin(hashes, boilerplate, invert=True)
        hashes = hashes[hash_mask]
        idx = idx[hash_mask]
    return hashes, idx

def find_fingerprint_overlap(hashes1, hashes2, idx1, idx2):
    ..
    .. 
    hashes1_mask = np.isin(hashes1, hashes2)
    hashes2_mask = np.isin(hashes2, hashes1)
    return  idx1[hashes1_mask], idx2[hashes2_mask]
    ..
..

For lowering the case, we can do it in filter_code() in utils.py

   tokens = lexer.get_tokens(code.lower())

Hope this makes sense

blingenf · 2023-04-02T23:31:45Z

In theory, this approach should be much slower. Set intersection generally has a complexity of O(min(N, M)), whereas I believe each isin operation is an O((N +M)log(N +M)) sort. That said, having looked into this briefly I think numpy's intersect1d function isn't as efficient as it could be even when assume_unique=True. It might actually be better here to use plain python sets.

For improving the highlighting I think the most efficient solution would be to have mapping of the unique hashes back to the full list (so a single unique hash, for example, could point back to multiple locations). Then all relevant areas could be highlighted as needed. This would allow the more efficient set operations to be used, though it would introduce some additional memory requirements through the bookkeeping.

Lowercasing is a good idea for non-case sensitive languages, but that should happen after tokenization. Otherwise the lexer may fail to recognize keywords for case-sensitive languages.

ramsane · 2023-04-03T09:50:03Z

For duplicate hashes, I had a similar approach in mind. But it required more coding. The current one seemed to be easy. 😅. We should go with maintaining indices for each hash.

And for lowercasing the code, I think it makes sense. What was I thinking...

Now, the duplication of hashes is resolved, I think it is feasible to have an overall score for a test_file after comparing it with several reference files.

Once we have all the indices, getting the final score is as easy as

slices = get_copied_slices(final_idx,  k)
score = np.sum(slices[1]-slices[0])/len(self.detector.file_data[test_f].raw_code)

Is it possible to implement all of this in the existing code with minimal changes ??

blingenf · 2023-04-06T02:49:09Z

It might be a little while before I get around to working on it but this shouldn't be too bad of a refactor.

ankostis · 2023-07-27T11:09:38Z

Are we expecting any progress here?

blingenf · 2023-07-29T03:46:36Z

I still intend on working on this, it's just a slightly larger refactor and I've been distracted by other projects. I'll try to get around to it soon.

blingenf self-assigned this Apr 6, 2023

blingenf mentioned this issue Jul 29, 2023

feat: report match totals for each test-file based on all reference files #44

Open

blingenf mentioned this issue Aug 20, 2023

Improved duplicate hash handling #48

Merged

blingenf closed this as completed in #48 Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are we ignoring the duplicate hashes #34

Why are we ignoring the duplicate hashes #34

ramsane commented Apr 2, 2023 •

edited

Loading

blingenf commented Apr 2, 2023

ramsane commented Apr 2, 2023

ramsane commented Apr 2, 2023

blingenf commented Apr 2, 2023

ramsane commented Apr 3, 2023 •

edited

Loading

blingenf commented Apr 6, 2023

ankostis commented Jul 27, 2023

blingenf commented Jul 29, 2023

Why are we ignoring the duplicate hashes #34

Why are we ignoring the duplicate hashes #34

Comments

ramsane commented Apr 2, 2023 • edited Loading

blingenf commented Apr 2, 2023

ramsane commented Apr 2, 2023

Lowering the code

ramsane commented Apr 2, 2023

blingenf commented Apr 2, 2023

ramsane commented Apr 3, 2023 • edited Loading

blingenf commented Apr 6, 2023

ankostis commented Jul 27, 2023

blingenf commented Jul 29, 2023

ramsane commented Apr 2, 2023 •

edited

Loading

ramsane commented Apr 3, 2023 •

edited

Loading