Assignment

For detailed code explanations, Classes, Spark Session initialization instructions are added as Markdown text inside in the respective `.ipynb` files.

The ReadMe file consists of how to set up your environment for a successful run. For coding breakdown, open the Assignments.

Here's a list of things that each assignment holds:

Note: This is tested only on a Linux environment.

Assignment 1: Movies Dataset (1,697,533 reviews)

Find the item with the least rating
Find the item with the most rating
Find the item with the longest reviews
A desired dataframe operation
Store it into a parquet file

Assignment 2: Books Dataset (8,898,041 reviews)

Find the item with the least rating
Find the item with the most rating
Find the item with the longest reviews
Store it into a parquet file
Ingest the data in a postgre table

Tools/Libraries Used

Python
Jupyter
Pyspark
Docker: Image-Postgres

/bin consits of a .jar file that allows us to connect to any external DB with the pyspark module using jdbc.

To run an assignment:

Assignment_1.ipynb doesn't require docker to be installed. Only Assignment_2.ipynb requires Docker.

Setup the directories

The data directory is not uploaded which is under the name raw_lake.
It contains all the data file. The input file with the *.json format and the output file format with *.parquet

Instructions on how to download and store can be found inside the Assignment notebook.

Open a terminal and enter the commands below:

Docker installed:

git clone "https://github.com/snehangsude/TensorIoT.git"
cd TensorIoT
mkdir raw_lake
python3 -m venv .iot
source .iot/bin/activate
pip install -r requirements.txt

Docker not installed:

sudo snap install docker

git clone "https://github.com/snehangsude/TensorIoT.git"
cd TensorIoT
mkdir raw_lake
python3 -m venv .iot
source .iot/bin/activate
pip install -r requirements.txt

Once the above commands are run:

For VSCode:

Execute command code .
Open Assignment_1.ipynb/Assignment_2.ipynb & click on Run > Run all cells.

For Jupyter:

Execute command jupyter notebook
Open Assignment_1.ipynb/Assignment_2.ipynb & click on Run > Run all cells.

Running PGCLI

Since we are storing the data into PostgreDB, we would want to query out data from the DB. Hence the env has been supported with the use of pgcli. To execute pgcli shell:

pgcli -h localhost -u root -d books

Note: This needs to be run only after Assignment_2.ipynb file has been run successfully. For password prompt use the same password provided while executing Postgre in Docker

Logs

Sample logs are provided here to understand the basic runs of the notebook where logging is enabled.

Assignment_1.ipynb logs: Prefix- movies
Assignment_2.ipynb logs: Prefix - books

To get pyspark logs: enable 'INFO' inside the notebooks to get detailed logs. For more info view any notebook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Assignment

For detailed code explanations, Classes, Spark Session initialization instructions are added as Markdown text inside in the respective `.ipynb` files.

Assignment 1: Movies Dataset (1,697,533 reviews)

Assignment 2: Books Dataset (8,898,041 reviews)

Tools/Libraries Used

To run an assignment:

Setup the directories

Docker installed:

Docker not installed:

For VSCode:

For Jupyter:

Running PGCLI

Logs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Assignment

For detailed code explanations, Classes, Spark Session initialization instructions are added as Markdown text inside in the respective .ipynb files.

Assignment 1: Movies Dataset (1,697,533 reviews)

Assignment 2: Books Dataset (8,898,041 reviews)

Tools/Libraries Used

To run an assignment:

Setup the directories

Docker installed:

Docker not installed:

For VSCode:

For Jupyter:

Running PGCLI

Logs

For detailed code explanations, Classes, Spark Session initialization instructions are added as Markdown text inside in the respective `.ipynb` files.