Project Overview

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

In this project, we will build an ETL pipeline for a data lake hosted on S3. We will load data from S3, process the data into analytics tables using Spark, and load them back into S3. We will deploy this Spark process on a cluster using AWS.

Summary

Project structure
Datasets
Schema
Deployement

Structure

/images - Some screenshots.
etl.py - Spark script to perform ETL job.
dl.cfg - Configuration file to add AWS credentials.

Datasets

You'll be working with two datasets that reside in S3. Here are the S3 links for each:

Song data - s3://udacity-dend/song_data
Log data - s3://udacity-dend/log_data

Schema

Fact Table

songplays - records in event data associated with song plays. Columns for the table:

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users

user_id, first_name, last_name, gender, level

songs

song_id, title, artist_id, year, duration

artists

artist_id, name, location, lattitude, longitude

time

start_time, hour, day, week, month, year, weekday

Deployement

Fill your AWS Credentials in dl.clg file
Create a EMR Cluster

Create an EMR Cluster

Copy your code to emr
Execute etl.py as a spark-submit job

Execute spark job

Check the processed result in s3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Project Overview

Summary

Structure

Datasets

Schema

Fact Table

Dimension Tables

users

songs

artists

time

Deployement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Project Overview

Summary

Structure

Datasets

Schema

Fact Table

Dimension Tables

users

songs

artists

time

Deployement