Skip to content

bennykemot/data_engineering_best_practices

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Best Practices

Code for blog at Data Engineering Best Practices - #1. Data flow & Code

Project

Assume we are extracting customer and order information from upstream sources and creating a daily report of the number of orders.

Data architecture

Setup

If you'd like to code along, you'll need

Prerequisite:

  1. git version >= 2.37.1
  2. Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2. Make sure that docker is running using docker ps
  3. pgcli

Run the following commands via the terminal. If you are using Windows, use WSL to set up Ubuntu and run the following commands via that terminal.

git clone https://github.com/josephmachado/data_engineering_best_practices.git
cd data_engineering_best_practices
make up # Spin up containers
make ddl # Create tables & views
make ci # Run checks & tests
make etl # Run etl
make spark-sh # Spark shell to check created tables
spark.sql("select partition from adventureworks.sales_mart group by 1").show() // should be the number of times you ran `make etl`
spark.sql("select count(*) from businessintelligence.sales_mart").show() // 59
spark.sql("select count(*) from adventureworks.dim_customer").show() // 1000 * num of etl runs
spark.sql("select count(*) from adventureworks.fct_orders").show() // 10000 * num of etl runs

You can see the results of DQ checks, using make meta

select * from ge_validations_store limit 1;
exit

Use make down to spin down containers.

Architecture

Data architecture

About

Sample project to demonstrate data engineering best practices

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 80.4%
  • Makefile 16.4%
  • CSS 2.1%
  • Dockerfile 1.1%