GitHub - albertochong/deltalake-architecture

Delta lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.

📝 Table of Contents

About
Architeture
Usage
Built Using
Authors

🧐 About

This project aims to make a simple etl processing, using pyspark with the deltalake framework. The work of pyspark will consume a filesystem titled landing-zone with files in json format. We will use some time travel techniques, writing in delta format for table management control and much more.

🔧 Architeture ELT Delta lake

Prerequisites

Spark: 3.1.1 https://spark.apache.org/downloads.html

🎈 Usage

delta-bronze.py

The pyspark script - [delta-bronze.py] work reads data from a filesystem called landing-zone using deltalake dependencies, which are jar packages that are in spark's session config, with which it is possible to use the delta lake framework. after the execution of this script, the data will be written in the directory passed in code, inside the write table will be written a directory called _delta_log, which is responsible for storing incremental files on table metadata, it will be something like 00000000000000000000.json, 0000000000000000000001.json... Json file under the _delta_log folder will have the information like add/remove parquet files(for Atomicity), stats(for optimized performance & data skipping), partitionBy(for partition pruning), readVersions(for time travel), commitInfo(for audit).

🔧 Running the tests

pyspark < src/delta-bronze.py

Results on delta/bronze

delta-silver.py

In this step, the pyspark script - [delta-silver.py] reads the data in delta format, which results in a performance gain due to being stored in parquet format and having one of the great advantages of _delta_log metadata management, steps are performed processing in which unnecessary columns are removed and preparation of tables with join for MDW modeling with data normalized in dataset formats.

🔧 Running the tests

pyspark < src/delta-silver.py

Results on delta/silver

delta-gold.py

The pyspakr script - [delta-gold.py] has the responsibility of enriching the data, in this process it is where we treat the data and refine it to the business area or who will consume the data, in this script I left the example of how to use the time travel using parameter passed in the function that we declared .option("versionAsOf", "0"), below are images after ingestion

🔧 Running the tests

pyspark < src/delta-gold.py

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs/img		docs/img
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Delta lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.

📝 Table of Contents

🧐 About

🔧 Architeture ELT Delta lake

Prerequisites

🎈 Usage

delta-bronze.py

🔧 Running the tests

Results on delta/bronze

delta-silver.py

🔧 Running the tests

Results on delta/silver

delta-gold.py

🔧 Running the tests

Results on delta/gold

⛏️ Built Using

✍️ Authors

About

Releases

Packages

Languages

albertochong/deltalake-architecture

Folders and files

Latest commit

History

Repository files navigation

Delta lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.

📝 Table of Contents

🧐 About

🔧 Architeture ELT Delta lake

Prerequisites

🎈 Usage

delta-bronze.py

🔧 Running the tests

Results on delta/bronze

delta-silver.py

🔧 Running the tests

Results on delta/silver

delta-gold.py

🔧 Running the tests

Results on delta/gold

⛏️ Built Using

✍️ Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages