So different results #18

AnnaKholkina · 2022-07-11T14:50:36Z

Hi! We compared two files in different versions of copydetect and got completely different check results in similarities. What does this have to do with?
Version 0.3.0:

>>> print(copydetect.__version__)
0.3.0
>>> fp1 = copydetect.CodeFingerprint("solution1.c", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("solution2.c", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities[0],similarities[1])
0.851063829787234 0.8439716312056738

Version 0.4.0:

>>> print(copydetect.__version__)
0.4.0
>>> fp1 = copydetect.CodeFingerprint("solution1.c", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("solution2.c", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities[0],similarities[1])
0.41025641025641024 0.41025641025641024

solution1.c:

#include <stdio.h>
 int main() {
    int f, g;
    int tmp ;
    int a[10];
    
    // считываем количество чисел n
    f=0;
    // формируем массив n чисел
    for(f = 0 ; f < 10; f++) { 
        a[f]=f;
    }
    for(f = 0 ; f < 9; f++)
       for(g = 0 ; g < 9- f  ; g++)
           if(a[g] > a[g+1]) {           
              // если они идут в неправильном порядке, то  
              //  меняем их местами. 
              tmp = a[g];
              a[g] = a[g+1] ;
              a[g+1] = tmp; 
           }
 }

solution2.c:

#include <stdio.h>
 int main() {
    int k, l;
    int tmp ;
    int a[10];
    // считываем количество чисел n

    // формируем массив n чисел
    for(k = 0 ; k < 10; k++) { 
        a[k]=k;
    }
    for(k = 0 ; k < 9; k++) { 
       // сравниваем два соседних элемента.
       for(l = 0 ; l < 9- k  ; l++) {  
           if(a[l] > a[l+1]) {           
              // если они идут в неправильном порядке, то  
              //  меняем их местами. 
              tmp = a[l];
              a[l] = a[l+1] ;
              a[l+1] = tmp; 
           }
        }
    }
 }

The text was updated successfully, but these errors were encountered:

ghost · 2022-07-11T15:25:09Z

Did you check what has changed betwwen versions in comparison algo?
What is the question about, why two different versions of program give different results?
From what I see, version 0.4.0 gives more accurate copy-rate, what is the consequence of logic updates (shown in higher version of program)

AnnaKholkina · 2022-07-11T16:09:14Z

Did you check what has changed betwwen versions in comparison algo?
What is the question about, why two different versions of program give different results?
From what I see, version 0.4.0 gives more accurate copy-rate, what is the consequence of logic updates (shown in higher version of program)

Yes, this part of the code has been changed, but how can we interpret the new result?
The fact is that the second file is a plagiarism of the first, but their similarity is quite small.

Release notes 0.4.0:
Fix/feature: the similarity matrix is no longer necessarily square. There will no longer be large gaps when test files != reference files.
Bux fix: similarity is now based on number of fingerprints rather than number of tokens. This improves detection for files with large amounts of duplication (e.g., XML files)
Feature: fp argument for CodeFinerprint: fingerprints can now be initialized with file pointers rather than just a file path.

ghost · 2022-07-11T16:18:06Z

@AnnaKholkina so it is said that bug fix has improved "detection for files with large amounts of duplication".
As I can see you samples are of this quality, and new values say that files are ~84% duplicated. For me, this sound more fair than old ~41%

AnnaKholkina · 2022-07-11T16:32:55Z

@alexmechanic Just the problem is that 41% is the new result and the old result is 84%. We have more confidence in the old result.

ghost · 2022-07-11T16:35:34Z

@AnnaKholkina oh well, I beg your pardon, I can see the issue now.

blingenf · 2022-07-16T02:40:41Z

Thanks for reporting this issue. I have a fix ready (#19) and will get a new release out this weekend after double checking that it's doing what it should.

I've also added a few sanity checks to the unit tests to make sure something like this doesn't escape notice again in the future.

blingenf · 2022-07-16T17:49:26Z

The fix is now released as copydetect==0.4.2.

AnnaKholkina · 2022-07-18T15:58:53Z

Thank you for response!

blingenf self-assigned this Jul 12, 2022

blingenf added the bug Something isn't working label Jul 12, 2022

blingenf linked a pull request Jul 16, 2022 that will close this issue

use token overlap/unique tokens as similarity score #19

Merged

blingenf closed this as completed Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

So different results #18

So different results #18

AnnaKholkina commented Jul 11, 2022 •

edited

Loading

ghost commented Jul 11, 2022 •

edited by ghost

Loading

AnnaKholkina commented Jul 11, 2022 •

edited

Loading

ghost commented Jul 11, 2022 •

edited by ghost

Loading

AnnaKholkina commented Jul 11, 2022 •

edited

Loading

ghost commented Jul 11, 2022

blingenf commented Jul 16, 2022

blingenf commented Jul 16, 2022

AnnaKholkina commented Jul 18, 2022

So different results #18

So different results #18

Comments

AnnaKholkina commented Jul 11, 2022 • edited Loading

ghost commented Jul 11, 2022 • edited by ghost Loading

AnnaKholkina commented Jul 11, 2022 • edited Loading

ghost commented Jul 11, 2022 • edited by ghost Loading

AnnaKholkina commented Jul 11, 2022 • edited Loading

ghost commented Jul 11, 2022

blingenf commented Jul 16, 2022

blingenf commented Jul 16, 2022

AnnaKholkina commented Jul 18, 2022

AnnaKholkina commented Jul 11, 2022 •

edited

Loading

ghost commented Jul 11, 2022 •

edited by ghost

Loading

AnnaKholkina commented Jul 11, 2022 •

edited

Loading

ghost commented Jul 11, 2022 •

edited by ghost

Loading

AnnaKholkina commented Jul 11, 2022 •

edited

Loading