transfermarkt-datasets

Use data from trasfermarkt-scraper to build a clean, public football (soccer) dataset. This includes data as clubs, games, players and player appearances from a number of national and international competitions and seasons.

Automate the data pipeline to keep these assets up to date and publicly available on well-known data catalogs for the data community's convenience.

✅ Kaggle ✅ data.world

data

All project data assets are kept inside the data folder. This is a DVC repository, therefore all files for the current revision can be pulled from remote storage with the dvc pull command.

ℹ️ Read access to the DVC remote storage for this project is required to successfully run dvc pull. Contributors should feel free to grant themselves access by adding their AWS IAM user ARN to this whitelist. Have a look at this PR for an example.

raw data within this folder can be updated by running the trasfermarkt-scraper with the 1_acquire.py script.

$ python 1_acquire.py --asset all --season 2021

prep

Scripts for transforming scraped raw data into a cleaned, validated data package that can be used as the basis of further analysis in this project. You may run these scripts to produce the prepared dataset within data/prep using 2_prepare.py.

$ python 2_prepare.py [--raw-files-location data/raw]

For reference on the types of assets produced by this script checkout published datasets linked above.

The preparation step uses raw data as input, hence raw files need to be available locally in order to run this step. You may pull raw assets by running dvc pull as mentioned earlier or by acquiring new and updated raw assets via 1_acquire.py

infra

Define all the necessary infrastructure for the project in the cloud with Terraform.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data		data
infra		infra
notebooks		notebooks
prep		prep
.gitignore		.gitignore
1_acquire.py		1_acquire.py
2_prepare.py		2_prepare.py
3_sync.py		3_sync.py
README.md		README.md
diagram.png		diagram.png
environment.yml		environment.yml
run_pipeline.sh		run_pipeline.sh
settings.yaml		settings.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transfermarkt-datasets

data

prep

infra

About

Releases

Packages

Languages

lukesonnet/transfermarkt-datasets

Folders and files

Latest commit

History

Repository files navigation

transfermarkt-datasets

data

prep

infra

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages