-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why are we ignoring the duplicate hashes #34
Comments
This is done for efficiency reasons. Comparison between files is performed using set intersection, so all fingerprints must be unique. |
Okay. If this is the reason, we can give the option to include duplicates also. This seems to be very important as this might to the very poor highlighting of the code. Even if it is highly matched (in terms of percentages). with the addition of small changes, The highlighting of the code improved drastically. I don't think this will add any additional performance overhead. Need to check that though. Here are the before and after pictures. Lowering the codeIt also occurred to me that while generating the hashes, case of the letters mattters. Before tokenizing, I think it is better to convert the code to lowercase.(not inplace) so that these cases will not occur. I'm aware that all the programming languages are case sensitive. But As we are not compiling it, adding this makes our algorithm little more robust. |
Here are the changes that I've tried.. ...
..
def get_document_fingerprints(doc, k, window_size, boilerplate=[]):
..
..
# changing the option of removing duplicates to False
hashes, idx = winnow(hashed_kgrams(doc, k=k), window_size=window_size,
remove_duplicates=False)
if len(boilerplate) > 0:
..
..
# Doing this way instead of np.intersect1d, and delete after that is because
# preserves the duplicates in the original array(hashes)
hash_mask = np.isin(hashes, boilerplate, invert=True)
hashes = hashes[hash_mask]
idx = idx[hash_mask]
return hashes, idx
def find_fingerprint_overlap(hashes1, hashes2, idx1, idx2):
..
..
hashes1_mask = np.isin(hashes1, hashes2)
hashes2_mask = np.isin(hashes2, hashes1)
return idx1[hashes1_mask], idx2[hashes2_mask]
..
.. For lowering the case, we can do it in tokens = lexer.get_tokens(code.lower()) Hope this makes sense |
In theory, this approach should be much slower. Set intersection generally has a complexity of O(min(N, M)), whereas I believe each For improving the highlighting I think the most efficient solution would be to have mapping of the unique hashes back to the full list (so a single unique hash, for example, could point back to multiple locations). Then all relevant areas could be highlighted as needed. This would allow the more efficient set operations to be used, though it would introduce some additional memory requirements through the bookkeeping. Lowercasing is a good idea for non-case sensitive languages, but that should happen after tokenization. Otherwise the lexer may fail to recognize keywords for case-sensitive languages. |
For duplicate hashes, I had a similar approach in mind. But it required more coding. The current one seemed to be easy. 😅. We should go with maintaining indices for each hash. And for lowercasing the code, I think it makes sense. What was I thinking... Now, the duplication of hashes is resolved, I think it is feasible to have an overall score for a test_file after comparing it with several reference files. Once we have all the indices, getting the final score is as easy as slices = get_copied_slices(final_idx, k)
score = np.sum(slices[1]-slices[0])/len(self.detector.file_data[test_f].raw_code) Is it possible to implement all of this in the existing code with minimal changes ?? |
It might be a little while before I get around to working on it but this shouldn't be too bad of a refactor. |
Are we expecting any progress here? |
I still intend on working on this, it's just a slightly larger refactor and I've been distracted by other projects. I'll try to get around to it soon. |
While I was trying to solve the common score problem(having a single score for each test_file), I figured that I have to keep the duplicate hashes. In the current code, at every step (while removing the boilerplate and while finding the common hashes), we are giving only the unique hashes.
Is this on purpose? Or did we face any issues while trying to keep the repeated hashes?
The text was updated successfully, but these errors were encountered: