GitHub - amald94/Sparkify-data-warehouse-RedShift

Project Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

In this project, we will create an ETL pipeline to build a data warehouses hosted on Redshift.

Target Redshift AWS Architecture

Structure

/images - some screenshots.
Analytics.ipynb - It is a notebook containing basic analytics on the datawarehouse.
create_cluster.ipynb - It is a notebook containing code to create a redshift cluster.
create_cluster.py - A script to create a redshift cluster.
create_tables.py - A script to drop and create tables.
etl.py - A script to load data from s3 to stagging tables and then to fact and dim tables using the given dataset on S3.
sql_queries.py - A script containing sql queries.
dwh.cfg - Configuration file to add AWS credentials.
delete_cluster.py - A script to delete the redshift cluster.

Datasets

You'll be working with two datasets that reside in S3. Here are the S3 links for each:

Song data - s3://udacity-dend/song_data
Log data - s3://udacity-dend/log_data
Log data json path - s3://udacity-dend/log_json_path.json

Schema

Fact Table

songplays - records in event data associated with song plays. Columns for the table:

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users

user_id, first_name, last_name, gender, level

songs

song_id, title, artist_id, year, duration

artists

artist_id, name, location, lattitude, longitude

time

start_time, hour, day, week, month, year, weekday

Execute

Create Cluster

$ python3 create_cluster.py

Create Tables

$ python3 create_tables.py

Load The Data

$ python3 etl.py

Delete The Cluster

$ python3 delete_cluster.py

Analytics

Run Analytics notebook to get the insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Summary

Structure

Datasets

Schema

Fact Table

Dimension Tables

users

songs

artists

time

Execute

Create Cluster

Create Tables

Load The Data

Delete The Cluster

Analytics

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
images		images
Analytics.ipynb		Analytics.ipynb
README.md		README.md
create_cluster.ipynb		create_cluster.ipynb
create_cluster.py		create_cluster.py
create_tables.py		create_tables.py
delete_cluster.py		delete_cluster.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

amald94/Sparkify-data-warehouse-RedShift

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Summary

Structure

Datasets

Schema

Fact Table

Dimension Tables

users

songs

artists

time

Execute

Create Cluster

Create Tables

Load The Data

Delete The Cluster

Analytics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages