Youtube video classification at scale using distributed computing and bi-LSTM. Website here.
We propose Youtube video classification at scale, leveraging a dataset of over 8 million videos with high throughput computing and large data handling system design. The project requires training on terabytes of video and audio files to then predict a video type from several thousand labels. Our solution uses two bidirectional-LSTM networks, one for audio and one for video, trained on a spark based distributed file system enabled by Elephas: Distributed Deep Learning with Keras & Spark. The infrastructure used consists of a custom cluster of p2xlarge instances on AWS using amazon machine images to spin-up compatible nodes. We found this approach produced reliable systemic performance and effective classification with our final model. We wrote a more detailed report here.
Below we describe the various steps required to run the code and reproduce execution.
Please find the YouTube-8M dataset here. Follow the instructions to transfer the data to an S3 bucket.
- Launch an Amazon EC2
m4.xlarge
instance with Amazon Linux 2018.03.0 as the operating system. - Install Python 3.6 with
yum install python36
- Install a large collection of dependencies with
python3 -m pip install elephas
- Critical: Uninstall the
pyspark
package installed by the previous command withpython3 -m pip uninstall pyspark
- Go to the AWS console, select this instance, and create an image by selecting Action > Image > Create Image
- Go to EMR, select Create Cluster and then go to Advanced Options
- Using
emr 5.23
dependencies, select:- Spark 2.4.0
- Hadoop 2.8.5
- Ganglia 3.7.2
- Zeppelin 0.8.1
- Size the cluster as your wish, but ensure the Master's EBS Root Volume is >= the AMI's EBS Root Volume (created above)
- On the last page, under images, select the previously created machine image
ssh
into the master node- You may need to install Git with
yum install git
- Follow the directions in the notebook:
Tensorflow-spark-connector.ipynb
to install the Tensorflow-Spark connector- In essence this involves a) installing Apache Maven and b) using it to install Tensorflow-Spark connector
- (Optional): follow the instructions here to mount the S3 bucket on your local machine, then use
hadoop distcp
to move the files onto HDFS. The code reads directly from S3 so you would need to change the path if you do this step. We believe it may provide some performance benefits, but elected to stick with S3 for simplicity. - Add to your
~/.bashrc
:
export PATH=/usr/lib/spark:$PATH
export PYSPARK_PYTHON=/usr/bin/python3
- Reload it with
source ~/.bashrc
- Clone this repo to the master or copy
train_youtube_elephas.py
andcreate_youtube_model.py
to the master - Submit the Spark job with:
spark-submit --jars ecosystem/spark/spark-tensorflow-connector/target/spark-tensorflow-connector_2.11-1.10.0.jar train_youtube_elephas.py
Watch as Spark distributes the dataset and performs model training!
We had to tune the maximum memory allotted to the various processes. If you run into any trouble with memory, consider adjusting the flags --driver-memory
to change the maximum memory available to the driver script, --executor-memory
for the executors, and finally --conf spark.driver.maxResultSize=SIZE
where SIZE
is the maximum expected serialized result your model.train
method returns.
To easily monitor the Spark application, including tracking the runtime of each Job, enable port forwarding with:
ssh -i /path/to/key -4 -L 3000:MASTER-DNS:4040 hadoop@MASTER-DNS
where MASTER-DNS
is the DNS found on the EMR cluster page.
In our experiments, we set --num-executors
to the number of nodes in the cluster and --executor-cores
to the number of vCPUs per node.