IN5000 TU Delft - MSc Computer Science
This repository contains the scripts to reproduce and expand the work for the MSc Thesis in Computer Science related to developing a framework for identifying the evolution patterns of open-source software projects.
The MSc Thesis paper is available here.
- Python >= 3.10
- MongoDB (can be changed if needed but the scripts must be adapted)
Run in the terminal:
# Scripts requirements
pip install -r requirements.txt
# Scripts + code formatting requirements
pip install -r requirements-ci.txt
The following environment variables must be set in order to run the scripts:
GITHUB_AUTH_TOKEN
MONGODB_HOST
MONGODB_PORT
MONGODB_DATABASE
MONGODB_USER
MONGODB_PASSWORD
MONGODB_QPARAMS
The data mining uses the GitHub API to gather repositories data. The version used for this project is: "2022-11-28".
https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28
Move to the data_mining
folder and run:
# Move to the data_mining folder
cd data_mining
# Run GitHub repositories mining
python repositories.py
# Run GitHub statistics gathering
python statistics.py
# In case of missing statistics, fill the gaps with zeros
python statistics_fill_gaps.py
Move to the data_processing
folder and run the following scripts:
# Move to the data_processing folder
cd data_processing
# Create the patterns clustering model (saved in the models/phases folder)
python time_series_phases.py
# Move to the data_processing folder
cd data_processing
# Cluster repositories based on their metrics phases and metrics patterns similarity
# The clustering model is saved in the models/clustering folder
python time_series_clustering.py
# Move to the data_processing folder
cd data_processing
# Create the forecasting models for each metric
# The forecasting models are saved in the models/forecasting folder
# One subfolder will be present for each cluster
python time_series_forecasting.py
Move to the evaluation
folder and run the following scripts:
# Move to the evaluation folder
cd evaluation
# Framework insights - Patterns Modeling
python patterns_modeling.py
# Framework insights - Clustering
python multivariate_clustering.py
# Framework insights - Forecasting
python patterns_forecasting_models.py
# Framework insights - Features importance
python forecasting_features_importance_ablation.py
# N-grams
python months_n_grams.py
Move to the data_pipeline
folder and run the following scripts:
# Move to the data_pipeline folder
cd data_pipeline
# Run the pipeline
python repository_pipeline.py
To ensure high code quality, the Black formatted can be run as follows:
# Check formatting
black --check ./
# Format files
black ./
To reproduce and/or process more data using SonarCloud, follow these steps:
- Create a SonarCloud free account
- Create a project and get the related
SONAR_TOKEN
- Create the project folder in the
sonar_scanner
folder - Create the
pull_releases.sh
andsonar.sh
scripts following the existing examples - Run in the terminal:
# Move to the project folder
cd project_folder
# Pull repository code releases from GitHub
./pull_releases.sh
# Process code with SonarCloud
./sonar.sh
- Retireve the metrics results from the SonarCloud API and store them in JSON files following the existing examples
- Run in the terminal
# Process the metrics time series to obtain the break points and patterns sequence
python process_nar_data.py
The models are available in the model
folder in this repository as well as at the following HuggingFace Collection
The datasets are available at the same collection and can be placed in the data
folder. Otherwise, they can be created by running the scripts.
A simple UI to visualize the time series data is available in the following repository: https://github.com/IN5000-MB-TUD/data-app
Project developed for the course IN5000 - Master's thesis of the 2023/2024 academic year at TU Delft.
Author:
- Mattia Bonfanti
- [email protected]
- Master's in Computer Science - Software Technology Track