(Udacity: Data Engineering Nano Degree) | [email protected] | 2019-04-16 This project is a part of Udacity's Data Engineer Nano Degree.
After installing python3 + postgresql libraries and dependencies, run from command line:
python3 create_tables.sql
(to create the DB to Postgresql)python3 etl.py
(to process all the input data to the DB)
This Project-1 handles data of a music streaming startup, Sparkify. Data set is a set of files in JSON format and contains two parts:
- ./data/song_data: static data about artists and songs
- ./data/log_data: event data of service usage e.g. who listened what song, when, where, and with which client
Below, some figures about the data set (results after running the etl.py):
- ./data/song_data: 71 files
- ./data/log_data: 30 files
- songplays: 6820
- (unique) user: 97
- songs: 71
- artists: 69
- songplays with an artist included in artist table: 1 (song_id = SOZCTXZ12AB0182364, artist_id = AR5KOSW1187FB35FF4)
Project builds an ETL pipeline (Extract, Transform, Load) to create the DB and tables, fetch data from JSON files, process the data, and insert the the data to DB. As technologies, Project-1 uses python, SQL, Postgresql DB.
Sparkify analytics database (called here sparkifydb) schema has a star design. Start design means that it has one Dimension Table having business data, and supporting Fact Tables. Dimension Table answers one of the key questions: what songs users are listening to. DB schema is the following:
SparkifyDB schema as ER Diagram.
- songplays: song play data together with user, artist, and song info (songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent)
- users: user info (columns: user_id, first_name, last_name, gender, level)
- songs: song info (columns: song_id, title, artist_id, year, duration)
- artists: artist info (columns: artist_id, name, location, latitude, longitude)
- time: detailed time info about song plays (columns: start_time, hour, day, week, month, year, weekday)
Project has two scripts:
- create_tables.py: This script drops existing tables and creates new ones.
- etl.py: This script uses data in ./data/song_data and ./data/log_data, processes it, and inserts the processed data into DB.
Python3 is recommended as the environment. The most convenient way to install python is to use Anaconda (https://www.anaconda.com/distribution/) either via GUI or command line. Also, the following libraries are needed for the python environment to make Jupyter Notebook and Postgresql to work:
- postgresql (+ dependencies) to enable sripts and Jupyter to connect to Postgresql DB.
- jupyter (+ dependencies) to enable Jupyter Notebook.
- ipython-sql (https://anaconda.org/conda-forge/ipython-sql) to make Jupyter Notebook and SQL queries to Postgresql work together. NOTE: you may need to install this library from command line.
Type to command line:
python3 create_tables.py
Output: Script writes "Tables dropped successfully" and "Tables created successfully" if all tables were dropped and created without errors.
Type to command line:
python3 etl.py
Output: Script crawls through all the data directories, tells how many files it has to process, and writes to console the progress of the script as it processes them through. After successfully processing through a directory, script summarised the results:
- "All xx files processed OK in aaa/abc123" (e.g. data/song_data) and
- "All xx files processed OK in aaa/def456" (e.g. data/log_data).
etl.py
works the following way to clean-up and process the source data:
- Scripts crawls through all source data JSON files.
- From each file in ./data/song_data directory, script selects songs table values to a song_data DataFrame and inserts them to songs table, and artist table values to artist_data DataFrame and inserts them to artists table.
- From each file in ./data/logdata directory, script filters items with NextSong value in page parameter. It then processes _time_data from ts parameter value to be inserted in time table and user_data to be inserted in users table.
- ./data/songdata directory also contains data which is processed into _songplay_data. User_id and artist_id fields are fetched from users and artists table using song_select query. Other parameters are selected from log_data. Formed songplay_data data is finally inserted into songplays table.
Project-1 also contains the following files:
- etl.ipynb: Jupyter Notebook for interactively develop and run python code to be used in etl.py.
- test.ipynb: Jupyter Notebook for interactively run test SQL queries against used DB.
Project-1 provides customer startup Sparkify tools to analyse their service data and help them answer their key business questions like "What songs users are listening to".