Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

So different results #18

Closed
AnnaKholkina opened this issue Jul 11, 2022 · 8 comments · Fixed by #19
Closed

So different results #18

AnnaKholkina opened this issue Jul 11, 2022 · 8 comments · Fixed by #19
Assignees
Labels
bug Something isn't working

Comments

@AnnaKholkina
Copy link

AnnaKholkina commented Jul 11, 2022

Hi! We compared two files in different versions of copydetect and got completely different check results in similarities. What does this have to do with?
Version 0.3.0:

>>> print(copydetect.__version__)
0.3.0
>>> fp1 = copydetect.CodeFingerprint("solution1.c", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("solution2.c", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities[0],similarities[1])
0.851063829787234 0.8439716312056738

Version 0.4.0:

>>> print(copydetect.__version__)
0.4.0
>>> fp1 = copydetect.CodeFingerprint("solution1.c", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("solution2.c", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities[0],similarities[1])
0.41025641025641024 0.41025641025641024

solution1.c:

#include <stdio.h>
 int main() {
    int f, g;
    int tmp ;
    int a[10];
    
    // считываем количество чисел n
    f=0;
    // формируем массив n чисел
    for(f = 0 ; f < 10; f++) { 
        a[f]=f;
    }
    for(f = 0 ; f < 9; f++)
       for(g = 0 ; g < 9- f  ; g++)
           if(a[g] > a[g+1]) {           
              // если они идут в неправильном порядке, то  
              //  меняем их местами. 
              tmp = a[g];
              a[g] = a[g+1] ;
              a[g+1] = tmp; 
           }
 }

solution2.c:

#include <stdio.h>
 int main() {
    int k, l;
    int tmp ;
    int a[10];
    // считываем количество чисел n

    // формируем массив n чисел
    for(k = 0 ; k < 10; k++) { 
        a[k]=k;
    }
    for(k = 0 ; k < 9; k++) { 
       // сравниваем два соседних элемента.
       for(l = 0 ; l < 9- k  ; l++) {  
           if(a[l] > a[l+1]) {           
              // если они идут в неправильном порядке, то  
              //  меняем их местами. 
              tmp = a[l];
              a[l] = a[l+1] ;
              a[l+1] = tmp; 
           }
        }
    }
 }
@ghost
Copy link

ghost commented Jul 11, 2022

Did you check what has changed betwwen versions in comparison algo?
What is the question about, why two different versions of program give different results?
From what I see, version 0.4.0 gives more accurate copy-rate, what is the consequence of logic updates (shown in higher version of program)

@AnnaKholkina
Copy link
Author

AnnaKholkina commented Jul 11, 2022

Did you check what has changed betwwen versions in comparison algo?
What is the question about, why two different versions of program give different results?
From what I see, version 0.4.0 gives more accurate copy-rate, what is the consequence of logic updates (shown in higher version of program)

Yes, this part of the code has been changed, but how can we interpret the new result?
The fact is that the second file is a plagiarism of the first, but their similarity is quite small.

Release notes 0.4.0:
Fix/feature: the similarity matrix is no longer necessarily square. There will no longer be large gaps when test files != reference files.
Bux fix: similarity is now based on number of fingerprints rather than number of tokens. This improves detection for files with large amounts of duplication (e.g., XML files)
Feature: fp argument for CodeFinerprint: fingerprints can now be initialized with file pointers rather than just a file path.

@ghost
Copy link

ghost commented Jul 11, 2022

@AnnaKholkina so it is said that bug fix has improved "detection for files with large amounts of duplication".
As I can see you samples are of this quality, and new values say that files are ~84% duplicated. For me, this sound more fair than old ~41%

@AnnaKholkina
Copy link
Author

AnnaKholkina commented Jul 11, 2022

@alexmechanic Just the problem is that 41% is the new result and the old result is 84%. We have more confidence in the old result.

@ghost
Copy link

ghost commented Jul 11, 2022

@AnnaKholkina oh well, I beg your pardon, I can see the issue now.

@blingenf blingenf self-assigned this Jul 12, 2022
@blingenf blingenf added the bug Something isn't working label Jul 12, 2022
@blingenf
Copy link
Owner

Thanks for reporting this issue. I have a fix ready (#19) and will get a new release out this weekend after double checking that it's doing what it should.

I've also added a few sanity checks to the unit tests to make sure something like this doesn't escape notice again in the future.

@blingenf blingenf linked a pull request Jul 16, 2022 that will close this issue
@blingenf
Copy link
Owner

The fix is now released as copydetect==0.4.2.

@AnnaKholkina
Copy link
Author

Thank you for response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants