A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
In this project, we will create an ETL pipeline to build a data warehouses hosted on Redshift.
- /images - some screenshots.
- Analytics.ipynb - It is a notebook containing basic analytics on the datawarehouse.
- create_cluster.ipynb - It is a notebook containing code to create a redshift cluster.
- create_cluster.py - A script to create a redshift cluster.
- create_tables.py - A script to drop and create tables.
- etl.py - A script to load data from s3 to stagging tables and then to fact and dim tables using the given dataset on S3.
- sql_queries.py - A script containing sql queries.
- dwh.cfg - Configuration file to add AWS credentials.
- delete_cluster.py - A script to delete the redshift cluster.
You'll be working with two datasets that reside in S3. Here are the S3 links for each:
- Song data - s3://udacity-dend/song_data
- Log data - s3://udacity-dend/log_data
- Log data json path - s3://udacity-dend/log_json_path.json
songplays - records in event data associated with song plays. Columns for the table:
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
user_id, first_name, last_name, gender, level
song_id, title, artist_id, year, duration
artist_id, name, location, lattitude, longitude
start_time, hour, day, week, month, year, weekday
$ python3 create_cluster.py
$ python3 create_tables.py
$ python3 etl.py
$ python3 delete_cluster.py
Run Analytics notebook to get the insights.