Use data from trasfermarkt-scraper to build a clean, public football (soccer) dataset. This includes data as clubs, games, players and player appearances from a number of national and international competitions and seasons.
Automate the data pipeline to keep these assets up to date and publicly available on well-known data catalogs for the data community's convenience.
✅ Kaggle ✅ data.world
All project data assets are kept inside the data
folder. This is a DVC repository, therefore all files for the current revision can be pulled from remote storage with the dvc pull
command.
ℹ️ Read access to the DVC remote storage for this project is required to successfully run
dvc pull
. Contributors should feel free to grant themselves access by adding their AWS IAM user ARN to this whitelist. Have a look at this PR for an example.
raw
data within this folder can be updated by running the trasfermarkt-scraper with the 1_acquire.py
script.
$ python 1_acquire.py --asset all --season 2021
Scripts for transforming scraped raw
data into a cleaned, validated data package that can be used as the basis of further analysis in this project. You may run these scripts to produce the prepared dataset within data/prep
using 2_prepare.py
.
$ python 2_prepare.py [--raw-files-location data/raw]
For reference on the types of assets produced by this script checkout published datasets linked above.
The preparation step uses raw
data as input, hence raw files need to be available locally in order to run this step. You may pull raw assets by running dvc pull
as mentioned earlier or by acquiring new and updated raw assets via 1_acquire.py
Define all the necessary infrastructure for the project in the cloud with Terraform.