Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are we ignoring the duplicate hashes #34

Closed
ramsane opened this issue Apr 2, 2023 · 8 comments · Fixed by #48
Closed

Why are we ignoring the duplicate hashes #34

ramsane opened this issue Apr 2, 2023 · 8 comments · Fixed by #48
Assignees

Comments

@ramsane
Copy link

ramsane commented Apr 2, 2023

While I was trying to solve the common score problem(having a single score for each test_file), I figured that I have to keep the duplicate hashes. In the current code, at every step (while removing the boilerplate and while finding the common hashes), we are giving only the unique hashes.

Is this on purpose? Or did we face any issues while trying to keep the repeated hashes?

@blingenf
Copy link
Owner

blingenf commented Apr 2, 2023

This is done for efficiency reasons. Comparison between files is performed using set intersection, so all fingerprints must be unique.

@ramsane
Copy link
Author

ramsane commented Apr 2, 2023

Okay. If this is the reason, we can give the option to include duplicates also. This seems to be very important as this might to the very poor highlighting of the code. Even if it is highly matched (in terms of percentages).

with the addition of small changes, The highlighting of the code improved drastically. I don't think this will add any additional performance overhead. Need to check that though.

Here are the before and after pictures.

2023-04-03-01-36-29-image

2023-04-03-01-36-47-image

Lowering the code

It also occurred to me that while generating the hashes, case of the letters mattters. Before tokenizing, I think it is better to convert the code to lowercase.(not inplace) so that these cases will not occur. I'm aware that all the programming languages are case sensitive. But As we are not compiling it, adding this makes our algorithm little more robust.

@ramsane
Copy link
Author

ramsane commented Apr 2, 2023

Here are the changes that I've tried..

...
..
def get_document_fingerprints(doc, k, window_size, boilerplate=[]):
    ..
    ..
    # changing the option of removing duplicates to False
    hashes, idx = winnow(hashed_kgrams(doc, k=k), window_size=window_size,
                         remove_duplicates=False)
    if len(boilerplate) > 0:
        ..
        ..
        # Doing this way instead of np.intersect1d, and delete after that is because
        #  preserves the duplicates in the original array(hashes)
        hash_mask = np.isin(hashes, boilerplate, invert=True)
        hashes = hashes[hash_mask]
        idx = idx[hash_mask]
    return hashes, idx

def find_fingerprint_overlap(hashes1, hashes2, idx1, idx2):
    ..
    .. 
    hashes1_mask = np.isin(hashes1, hashes2)
    hashes2_mask = np.isin(hashes2, hashes1)
    return  idx1[hashes1_mask], idx2[hashes2_mask]
    ..
..

For lowering the case, we can do it in filter_code() in utils.py

   tokens = lexer.get_tokens(code.lower())

Hope this makes sense

@blingenf
Copy link
Owner

blingenf commented Apr 2, 2023

In theory, this approach should be much slower. Set intersection generally has a complexity of O(min(N, M)), whereas I believe each isin operation is an O((N +M)log(N +M)) sort. That said, having looked into this briefly I think numpy's intersect1d function isn't as efficient as it could be even when assume_unique=True. It might actually be better here to use plain python sets.

For improving the highlighting I think the most efficient solution would be to have mapping of the unique hashes back to the full list (so a single unique hash, for example, could point back to multiple locations). Then all relevant areas could be highlighted as needed. This would allow the more efficient set operations to be used, though it would introduce some additional memory requirements through the bookkeeping.

Lowercasing is a good idea for non-case sensitive languages, but that should happen after tokenization. Otherwise the lexer may fail to recognize keywords for case-sensitive languages.

@ramsane
Copy link
Author

ramsane commented Apr 3, 2023

For duplicate hashes, I had a similar approach in mind. But it required more coding. The current one seemed to be easy. 😅. We should go with maintaining indices for each hash.

And for lowercasing the code, I think it makes sense. What was I thinking...

Now, the duplication of hashes is resolved, I think it is feasible to have an overall score for a test_file after comparing it with several reference files.

Once we have all the indices, getting the final score is as easy as

slices = get_copied_slices(final_idx,  k)
score = np.sum(slices[1]-slices[0])/len(self.detector.file_data[test_f].raw_code)

Is it possible to implement all of this in the existing code with minimal changes ??

@blingenf
Copy link
Owner

blingenf commented Apr 6, 2023

It might be a little while before I get around to working on it but this shouldn't be too bad of a refactor.

@blingenf blingenf self-assigned this Apr 6, 2023
@ankostis
Copy link
Contributor

Are we expecting any progress here?

@blingenf
Copy link
Owner

I still intend on working on this, it's just a slightly larger refactor and I've been distracted by other projects. I'll try to get around to it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants