The README file includes a summary of the Sparkify AWS DWH project, how to run the Python scripts, and an explanation of the files in the repository. Comments are used effectively and each function has a docstring.
Sparkify music data is transferred from external S3 to AWS Redshift cluster using AWS CLI and Python.
Using the song and event datasets given, a star schema optimized for queries on song play analysis has been designed and implemented.
Staging Tables
staging_events - includes a sample of events (log data) data filtered to A/B/C staging_songs - includes a sample of songs (song_data) filtered to /2018/11
see the for more information on data types and distribution / sort keys
Dimension Tables
users - users in the app songs - songs in music database artists - artists in music database time - timestamps of records in songplays broken down into specific units
Fact Table
song_play -- includes all users sessions and songs played over time
### DISTRIBUTION
note that smaller tables that are used frequently for queries have an ALL distribution style. Everything else is EVEN (Redshift Default)
Please refer to the schema images in this project to identify the DISTKEYS and SORTKEYS, for both staging and star schema.
- create and AWS account and add the IAM and EC2 Service
- create an S3 Bucket and add the external data source to your bucket (optional?)
- create a IAM User with AdministratorAccess, this user does not need access to the console. They're Access Keys are used to commincate to the AWS CLI. You need to create the user BEFORE creating Access Keys.
- create the access keys and save them somewhere safe. You will need to add them to the 'dwh.cfg' file in this project.
- create a name for the cluster, db and set a password in the dwg.cfg file.
Basically, fill in the blanks for the dwh.cfg file
- Clone this repo to your local computer or download the zip files.
- Create a virtual environment in the same folder as the scripts.
NOTE : ensure you create a gitignore file and add 'dwh.cfg' before you commit to a public git / remove your credentials
### SCRIPTS (SEE BELOW FOR MORE DETAIL ON WHAT EACH SCRIPT DOES)
ensure that sql_queries.py and dwg.cfg files are defined before running scripts.
run scripts in the following order :
- create_cluster.py
- create_tables.py
- etl.py
each script has comments where necessary.
- use boto3 to craete a AWS Redshift cluster, steps included in create_cluster.py
- create or replace the tables in Redshift with create_tables.py
- copy and insert data from external S3 bucket into the recently created Redshift tables etl.py NOTE : to avoid high charges, ensure you only load samples of the data - not all files! You can do this using prefixing