This repository contains the projects completed as part of the Udacity Data Engineering Nanodegree.
A brief description of each of the projects can be found below.
The scripts created for each project can be found in the corresponding sub-directories.
Full details of what was studied in the Nanodegree can be found on the Syllabus
Projects were completed for the four main modules of the course:
- Data Modelling
- Model event data to create a non-relational database and ETL pipeline for a music streaming app. Learners will define queries and tables for a database built using Apache Cassandra.
- Cloud Data Warehouses
- In this project, learners will act as a data engineer for a streaming music service. They are tasked with building an ELT pipeline that extracts data from S3, stages it in Redshift, and transforms it into a set of dimensional tables for an analytics team to find insights into what songs their users are listening to.
- Spark and Data Lakes
- STEDI Human Balance Analytics - In this project, learners will act as a data engineer for the STEDI team to build a data lakehouse solution for sensor data that trains a machine learning model. They will build an ELT (Extract, Load, Transform) pipeline for lakehouse architecture, load data from an AWS S3 data lake, process the data into analytics tables using Spark and AWS Glue, and load them back into lakehouse architecture.
- Automate Data Pipeline
- In this project, learners will work to build high grade data pipelines from reusable tasks that can be monitored and provide easy backfills for a music streaming company, Sparkify. They will move JSON logs of user activity and JSON metadata data from S3 and process it in Sparkify’s data warehouse in Amazon Redshift. To complete the project, learners will need to create their own custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step. Further details can be found in the README files for each project.