The goal of this repo is to show off some of my skills in Databricks and Terraform, I hope you enjoy it ;)
In order to do this, I designed a CI/CD pipeline and a data ingestion/processing pipeline that takes real estate listing data simulated by a Lambda funtion into S3, ingests it into Databricks via Autoloader and transforms it through a medallion architecture of Delta Live Tables in order to expose it as a Databricks Dashboard.
Real Estate Inc. has a backend service, which emits JSON messages to an S3 bucket, whenever a listing on their website is created, updated, or deleted. This data needs to be flattened and consumed via a dashboard, containing only the currently active listings on the website.
The solution consists of two main parts:
- Lambda Functions: For handling the publishing service and the ingestion of listing CRUD events.
- Data Pipeline: Using Delta Live Tables (DLT) to process and manage the data, implementing both Bronze, Silver, and Gold tables, including an SCD2 table for historical changes.
- CI/CD Pipeline: Utilizing GitHub Actions for continuous integration and deployment.
- Infrastructure as Code: Managing infrastructure with Terraform.
This Lambda function simulates the behavior of the publishing service by generating and updating listings. It handles:
- Creation of Listings: Generates new listings with random data.
- Updating Listings: Updates existing listings with random modifications.
- Deletion of Listings: Deletes existing listings based on a random choice.
This Lambda function captures the CRUD events from the publishing service and reflects them onto a staging bucket. It handles:
- Detection of Events: Listens for object creation and deletion events in the S3 bucket.
- Processing Events: Retrieves the file contents for created objects and constructs a message payload.
- Storing Events: Writes the processed events to a staging S3 bucket with a unique hash identifier.
- Raw Listings Data: Contains raw listings data with S3 events, capturing JSON objects from the sales_and_rentals listings.
- Flattened Listings Data: Processes the raw data to flatten the JSON structure, removing nested fields and ensuring each column represents a property from the original JSON document.
- Current Listings Data: Maintains the latest state of each listing, including deletions, and ensures the data is available for querying from a Data Warehouse.
- Historical Listings Data: Tracks the SCD Type 2 history for listings, including records that have been deleted, with start and end timestamps for validity and an
is_current
boolean to indicate the current record.
The infrastructure is managed in its entirety by Terraform.
GitHub Actions is used for continuous integration and deployment. The workflow automates the following steps:
- Zipping Lambda Functions: Packages the Lambda functions.
- Terraform Initialization: Initializes Terraform in the workspace.
- Terraform Plan and Apply: Applies the Terraform plan to deploy the infrastructure.
The CI/CD pipeline ensures that any changes to the codebase or infrastructure are automatically tested and deployed, maintaining consistency and reliability in the deployment process.