The ETL pipeline is using S3 as data Lake and AWS GLUE ETL as access datalake. Using Athena to query a view. Visualise with Quicksight
Data engineering immersion day Project.
The project will be completing the following tasks. Data Validation and ETL with Glue to be tables that can be queried using Amazon Athena and Visualize with Amazon Quciksight
Data architecture that needs to be created:
- Retrieve data from RDS Postgres and then save it into datalake in the form of csv file.
- Add Glue Clawler to create Data Catalog.
- Perform ETL using Glue Studio
- Create View with Athena
- Create a visualization using Quicksight to display a sport events graph.
- AWS Account
- IAM resources permission policy setting for Glue, S3
-
Import the data set from RDS Postgres to Datalake
Using AWS CLI to import data. -
Datalake (S3)
The files is storing in S3 "tickets" directory.
-
Add Clawler Process
Create data catalog (database and tables) in Glue. Edit schema in each table. -
Run job in Glue Studio
Check incorrect schema and creat job to processed data in parquet format.
-
Create Glue Crawler for Parquet Files
Add Crawler . Once crawler has finished running, Tables were added. Then and Run Crawler.
-
Create View (Athena)
Query data and create a view with Amazon Athena Athena Workgroups to Control Query Access and Costs.