Skip to content

albertochong/deltalake-architecture

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project logo

Delta lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.

Status License


📝 Table of Contents

🧐 About

This project aims to make a simple etl processing, using pyspark with the deltalake framework. The work of pyspark will consume a filesystem titled landing-zone with files in json format. We will use some time travel techniques, writing in delta format for table management control and much more.

🔧 Architeture ELT Delta lake

image

Prerequisites

Spark: 3.1.1 https://spark.apache.org/downloads.html

🎈 Usage

delta-bronze.py

The pyspark script - [delta-bronze.py] work reads data from a filesystem called landing-zone using deltalake dependencies, which are jar packages that are in spark's session config, with which it is possible to use the delta lake framework. after the execution of this script, the data will be written in the directory passed in code, inside the write table will be written a directory called _delta_log, which is responsible for storing incremental files on table metadata, it will be something like 00000000000000000000.json, 0000000000000000000001.json... Json file under the _delta_log folder will have the information like add/remove parquet files(for Atomicity), stats(for optimized performance & data skipping), partitionBy(for partition pruning), readVersions(for time travel), commitInfo(for audit).

image

🔧 Running the tests

pyspark < src/delta-bronze.py

Results on delta/bronze

img

delta-silver.py

In this step, the pyspark script - [delta-silver.py] reads the data in delta format, which results in a performance gain due to being stored in parquet format and having one of the great advantages of _delta_log metadata management, steps are performed processing in which unnecessary columns are removed and preparation of tables with join for MDW modeling with data normalized in dataset formats.

🔧 Running the tests

pyspark < src/delta-silver.py

Results on delta/silver

img

delta-gold.py

The pyspakr script - [delta-gold.py] has the responsibility of enriching the data, in this process it is where we treat the data and refine it to the business area or who will consume the data, in this script I left the example of how to use the time travel using parameter passed in the function that we declared .option("versionAsOf", "0"), below are images after ingestion

🔧 Running the tests

pyspark < src/delta-gold.py

Results on delta/gold

img

img

⛏️ Built Using

✍️ Authors

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%