Skip to content

Latest commit

 

History

History
46 lines (36 loc) · 1.67 KB

README.md

File metadata and controls

46 lines (36 loc) · 1.67 KB

De-anonymizing Programmers via Code Stylometry

This project is Python implementation of a work by Caliskan et al. for Java language.

Installation

Clone the repo and install python packages:

git clone https://github.com/rebryk/code_stylometry.git
cd code_stylometry
pip install -r requirements.txt

Usage

Dataset

Training and validation data is represented by source code of solutions to programming tasks from the international programming competition Google Code Jam.

data/metadata.json contains description of rounds and problems that were used to create the dataset.

If you want to download the corpus on your machine, run the following code:

cd data/
python crawler.py

It will download the solution files in java into data/codejam directory. File paths will look like data/codejam/<round_id>/<problem_id>/<username>.java

Features

The features package contains a few useful functions:

  • calculate_features_for_files(files)
    This function calculates sets of features for the given source files.
    Usage example: samples = calculate_features_for_files(['A.java', B.java'])
  • build_dataset(samples)
    Builds a pandas data frame from the given list of feature sets.
    Usage example: df = build_dataset(samples)

Training and validation

You can open De-anonymizing Programmers via Code Stylometry.ipynb to see how the methods above are used to de-anonymize 100 users with 9 code files each.

CatBoost is used to train the model.

License

MIT