CS205 Spring 2019 Final Project

Youtube video classification at scale using distributed computing and bi-LSTM. Website here.

We propose Youtube video classification at scale, leveraging a dataset of over 8 million videos with high throughput computing and large data handling system design. The project requires training on terabytes of video and audio files to then predict a video type from several thousand labels. Our solution uses two bidirectional-LSTM networks, one for audio and one for video, trained on a spark based distributed file system enabled by Elephas: Distributed Deep Learning with Keras & Spark. The infrastructure used consists of a custom cluster of p2xlarge instances on AWS using amazon machine images to spin-up compatible nodes. We found this approach produced reliable systemic performance and effective classification with our final model. We wrote a more detailed report here.

Setup and Installation:

Below we describe the various steps required to run the code and reproduce execution.

Getting Data:

Please find the YouTube-8M dataset here. Follow the instructions to transfer the data to an S3 bucket.

Creating a Machine Image:

Launch an Amazon EC2 m4.xlarge instance with Amazon Linux 2018.03.0 as the operating system.
Install Python 3.6 with yum install python36
Install a large collection of dependencies with python3 -m pip install elephas
Critical: Uninstall the pyspark package installed by the previous command with python3 -m pip uninstall pyspark
Go to the AWS console, select this instance, and create an image by selecting Action > Image > Create Image

Launching the EMR Cluster with an AMI:

Go to EMR, select Create Cluster and then go to Advanced Options
Using emr 5.23 dependencies, select:
- Spark 2.4.0
- Hadoop 2.8.5
- Ganglia 3.7.2
- Zeppelin 0.8.1
Size the cluster as your wish, but ensure the Master's EBS Root Volume is >= the AMI's EBS Root Volume (created above)
On the last page, under images, select the previously created machine image

Setting up the Cluster:

ssh into the master node
You may need to install Git with yum install git
Follow the directions in the notebook: Tensorflow-spark-connector.ipynb to install the Tensorflow-Spark connector
- In essence this involves a) installing Apache Maven and b) using it to install Tensorflow-Spark connector
(Optional): follow the instructions here to mount the S3 bucket on your local machine, then use hadoop distcp to move the files onto HDFS. The code reads directly from S3 so you would need to change the path if you do this step. We believe it may provide some performance benefits, but elected to stick with S3 for simplicity.
Add to your ~/.bashrc:

export PATH=/usr/lib/spark:$PATH
export PYSPARK_PYTHON=/usr/bin/python3

Reload it with source ~/.bashrc
Clone this repo to the master or copy train_youtube_elephas.py and create_youtube_model.py to the master
Submit the Spark job with: spark-submit --jars ecosystem/spark/spark-tensorflow-connector/target/spark-tensorflow-connector_2.11-1.10.0.jar train_youtube_elephas.py

Watch as Spark distributes the dataset and performs model training!

Memory Issues

We had to tune the maximum memory allotted to the various processes. If you run into any trouble with memory, consider adjusting the flags --driver-memory to change the maximum memory available to the driver script, --executor-memory for the executors, and finally --conf spark.driver.maxResultSize=SIZE where SIZE is the maximum expected serialized result your model.train method returns.

Monitoring

To easily monitor the Spark application, including tracking the runtime of each Job, enable port forwarding with: ssh -i /path/to/key -4 -L 3000:MASTER-DNS:4040 hadoop@MASTER-DNS where MASTER-DNS is the DNS found on the EMR cluster page.

In our experiments, we set --num-executors to the number of nodes in the cluster and --executor-cores to the number of vCPUs per node.

Name	Name	Last commit message	Last commit date
Latest commit dylanrandle tell linguist to ignore .ipynb Oct 27, 2021 57420da · Oct 27, 2021 History 47 Commits
.gitattributes	.gitattributes	tell linguist to ignore .ipynb	Oct 27, 2021
.gitignore	.gitignore	last miami data update	May 1, 2019
LICENSE	LICENSE	Initial commit	Apr 14, 2019
README.md	README.md	Update README.md	May 11, 2019
Scaling_Results.ipynb	Scaling_Results.ipynb	add scaling results notebook	May 11, 2019
Spark-elephas-model-training.ipynb	Spark-elephas-model-training.ipynb	keep newest spark-elephas-training	May 7, 2019
Tensorflow-spark-connector.ipynb	Tensorflow-spark-connector.ipynb	notebooks for youtube data analysis	May 2, 2019
Youtube8M_on_spark.ipynb	Youtube8M_on_spark.ipynb	notebooks for youtube data analysis	May 2, 2019
Youtube_video_bi-LSTM_classification-Spark.ipynb	Youtube_video_bi-LSTM_classification-Spark.ipynb	take newest notebook for biLSTM spark	May 7, 2019
convert_videos.py	convert_videos.py	latest conversion of videos	May 5, 2019
create_and_train_biLSTM_Youtube.py	create_and_train_biLSTM_Youtube.py	remove un-used imports	May 3, 2019
create_youtube_model.py	create_youtube_model.py	add updated training scripts	May 6, 2019
gitignore	gitignore	tommy updated EDA	Apr 14, 2019
open_street_map_API_pull.ipynb	open_street_map_API_pull.ipynb	last miami data update	May 1, 2019
train_youtube_elephas.py	train_youtube_elephas.py	async by default	May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS205 Spring 2019 Final Project

Setup and Installation:

Getting Data:

Creating a Machine Image:

Launching the EMR Cluster with an AMI:

Setting up the Cluster:

Memory Issues

Monitoring

About

Releases

Packages

Languages

License

ChanglinZhou/spark-tensorflow

Folders and files

Latest commit

History

Repository files navigation

CS205 Spring 2019 Final Project

Setup and Installation:

Getting Data:

Creating a Machine Image:

Launching the EMR Cluster with an AMI:

Setting up the Cluster:

Memory Issues

Monitoring

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages