Skip to content

A flexible template for doing reproducible data science in Python.

License

Notifications You must be signed in to change notification settings

hackalog/easydata

Repository files navigation

Build Status CircleCI Coverage Status Documentation Status

EasyData

A python framework and git template for data scientists, teams, and workshop organizers aimed at making your data science reproducible

For most of us, data science is 5% science, 60% data cleaning, and 35% IT hell. Easydata focuses the 95% by helping you deliver

  • reproducible python environments,
  • reproducible datasets, and
  • reproducible workflows

In other words, Easydata is a template, library, and workflow that lets you get up and running with your data science analysis, quickly and reproducibly.

What is Easydata?

Easydata is a framework for building custom data science git repos that provides:

  • An prescribed workflow for collaboration, storytelling,
  • A python framework to support this workflow
  • A makefile wrapper for conda and pip environment management
  • prebuilt dataset recipes, and
  • a vast library of training materials and documentation around doing reproducible data science.

Easydata is not

  • an ETL tooklit
  • A data analysis pipeline
  • a containerization solution, or
  • a prescribed data format.

Requirements to use this framework:

  • anaconda (or miniconda)
  • python3.6+ (we use f-strings. So should you)
  • Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:

once you've installed anaconda, you can install the remaining requirements (including cookiecutter) by doing:

conda create --name easydata python=3
conda activate easydata
python -m pip install -f requirements.txt

To start a new project, run:


cookiecutter https://github.com/hackalog/easydata

To find out more


A good place to start is with reproducible environments. We have a tutorial here: Getting Started with EasyData Environments.

The next place to look is in the customized documentation that is in any EasyData created repo. It is customized to the settings that you put in your template. These are reference documents that can be found under references/easydata that are customized to your repo that cover:

  • more on conda environments
  • more on paths
  • git configuration (including setting up ssh with GitHub)
  • git workflows
  • tricks for using Jupyter notebooks in an EasyData environment
  • troubleshooting
  • recommendations for how to share your work

Furthermore, see:

The resulting directory structure


The directory structure of your new project looks like this:

  • LICENSE
    • Terms of use for this repo
  • Makefile
    • top-level makefile. Type make for a list of valid commands
  • Makefile.include
    • Global includes for makefile routines. Included by Makefile.
  • Makefile.env
    • Command for maintaining reproducible conda environment. Included by Makefile.
  • README.md
    • this file
  • catalog
    • Data catalog. This is where config information such as data sources and data transformations are saved
    • catalog/config.ini
      • Local Data Store. This configuration file is for local data only, and is never checked into the repo.
  • data
    • Data directory. often symlinked to a filesystem with lots of space
    • data/raw
      • Raw (immutable) hash-verified downloads
    • data/interim
      • Extracted and interim data representations
    • data/interim/cache
      • Dataset cache
    • data/processed
      • The final, canonical data sets for modeling.
  • docs
    • Sphinx-format documentation files for this project.
    • docs/Makefile: Makefile for generating HTML/Latex/other formats from Sphinx-format documentation.
  • notebooks
    • Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short - delimited description, e.g. 1.0-jqp-initial-data-exploration.
  • reference
    • Data dictionaries, documentation, manuals, scripts, papers, or other explanatory materials.
    • reference/easydata: Easydata framework and workflow documentation.
    • reference/templates: Templates and code snippets for Jupyter
    • reference/dataset: resources related to datasets; e.g. dataset creation notebooks and scripts
  • reports
    • Generated analysis as HTML, PDF, LaTeX, etc.
    • reports/figures
      • Generated graphics and figures to be used in reporting
  • environment.yml
    • The user-readable YAML file for reproducing the conda/pip environment.
  • environment.(platform).lock.1yml
    • resolved versions, result of processing environment.yml
  • setup.py
    • Turns contents of MODULE_NAME into a pip-installable python module (pip install -e .) so it can be imported in python code
  • MODULE_NAME
    • Source code for use in this project.
    • MODULE_NAME/__init__.py
      • Makes MODULE_NAME a Python module
    • MODULE_NAME/data
      • code to fetch raw data and generate Datasets from them
    • MODULE_NAME/analysis
      • code to turn datasets into output products

Installing development requirements

The first time:

make create_environment
git init
git add .
git commit -m "initial import"
git branch easydata   # tag for future easydata upgrades

Subsequent updates:

make update_environment

In case you need to delete the environment later:

conda deactivate
make delete_environment

Credits and Thanks