De-anonymizing Programmers via Code Stylometry

This project is Python implementation of a work by Caliskan et al. for Java language.

Installation

Clone the repo and install python packages:

git clone https://github.com/rebryk/code_stylometry.git
cd code_stylometry
pip install -r requirements.txt

Usage

Dataset

Training and validation data is represented by source code of solutions to programming tasks from the international programming competition Google Code Jam.

data/metadata.json contains description of rounds and problems that were used to create the dataset.

If you want to download the corpus on your machine, run the following code:

cd data/
python crawler.py

It will download the solution files in java into data/codejam directory. File paths will look like data/codejam/<round_id>/<problem_id>/<username>.java

Features

The features package contains a few useful functions:

calculate_features_for_files(files)
This function calculates sets of features for the given source files.
Usage example: samples = calculate_features_for_files(['A.java', B.java'])
build_dataset(samples)
Builds a pandas data frame from the given list of feature sets.
Usage example: df = build_dataset(samples)

Training and validation

You can open De-anonymizing Programmers via Code Stylometry.ipynb to see how the methods above are used to de-anonymize 100 users with 9 code files each.

CatBoost is used to train the model.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
features		features
De-anonymizing Programmers via Code Stylometry.ipynb		De-anonymizing Programmers via Code Stylometry.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

De-anonymizing Programmers via Code Stylometry

Installation

Usage

Dataset

Features

Training and validation

License

About

Releases

Packages

Languages

License

rebryk/code_stylometry

Folders and files

Latest commit

History

Repository files navigation

De-anonymizing Programmers via Code Stylometry

Installation

Usage

Dataset

Features

Training and validation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages