Bigsift: Automated Debugging of Big Data Analytics in Data-Intensive Scalable Computing (SoCC 2017)
Developing Big Data Analytics often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g. program crash, outlier results, etc.) arise, developers are often interested in pinpointing the root cause of errors and explaining the sources of anomalies. To address this problem, BigSift takes an Apache Spark program, a user-defined test oracle function, and a dataset as input and outputs a minimum set of input records that reproduces the same test failure by combining the insights from delta debugging with data provenance. The technical contribution of BigSift is the design of systems optimizations that bring automated debugging closer to a reality for data intensive scalable computing.
BigSift exposes an interactive web interface where a user can monitor a big data analytics job running remotely on the cloud, write a user-defined test oracle function, and then trigger the automated debugging process. BigSift also provides a set of predefined test oracle functions, which can be used for explaining common types of anomalies in big data analytics—for example, finding the origin of the output value that is more than k standard deviations away from the median. The demonstration video is available at https://youtu.be/jdBsCd61a1Q.
This project is developed by Professor Miryung Kim's Software Engineering and Analysis Laboratory at UCLA. If you encounter any problems, please open an issue or feel free to contact us:
Muhammad Ali Gulzar: Assistant Professsor at Virginia Tech, [email protected];
Siman Wang: Software Engineer at Snap;
Miryung Kim: Professor at UCLA, [email protected];
The source code of BigSift is available at https://github.com/maligulzar/bigdebug/tree/bigsift-demo
Before building docker container, we first need to download following two files and place them under BigSift-Zeppelin
.
spark-2.1.1-SNAPSHOT-bin-2.2.0.tgz
available at BigSiftZeppelin binary
with all interpreters . Available at Zeppelin. Extract the file usingtar -xzf zeppelin.tar.gz
in the docker directory.
Now install Docker in your local machine (Follow instructions here). After the installation is complete, launch the 'Docker' application that will start the Docker service (e.g., Whale-like icon on your Mac status bar). If this step is successful, you should be able to type 'docker' on your command line console.
Assuming that you have installed Docker and currently in the BigSift-Zeppelin
directory, you should be able to see "DockerFile" under this directory. The following command creates a docker image using the DockerFile under the current directory ('.') and assigns "spark" as the name of the image.
docker build -t spark .
This step will take several minutes to build a docker container from the recipe. You should see the messages similar to the following on your screen. It will then pull the required packages, run each command, etc. This process will take long time, as it downloads Spark, Scala, and other tools required to do your subsequent assignments. You need to ensure that your machine has enough hard disk space (several GBs, mine is about ~2.17GB) and memory to finish this step.
bash-3.2$ docker build -t spark .
Sending build context to Docker daemon 1.954GB
Step 1/34 : FROM debian:jessie
---> 25fc9eb3417f
Step 2/34 : MAINTAINER Getty Images "https://github.com/gettyimages"
---> Using cache
---> 3106ccca439d
...
To list all the images along with their status. Run
docker ps -a
Once the docker image is built, you can start the cluster using docker-compose.
docker-compose up
Use this command only when launching the cluster for the first time. Afterwords, use docker-compose start
to start the cluster.
This command will initiate the cluster using the recipe docker-compose.yml
.
Starting dockerspark_master_1 ...
Starting dockerspark_master_1 ... done
Starting dockerspark_worker_1 ...
Starting dockerspark_worker_1
Starting dockerspark_zeppelin_1 ...
Starting dockerspark_zeppelin_1 ... done
Attaching to dockerspark_master_1, dockerspark_worker_1, dockerspark_zeppelin_1
zeppelin_1 | Zeppelin start [ OK ]
Give this step a few seconds to set up everything and start all the nodes.
Now the cluster has been setup. Go to port 6060 of your local machine localhost:6060 to access Zeppeling notebook.
Use the following command to attach to any container in the cluster.
dcoker exec -it <container-name > /bin/bash
where the name of containers are printed on the screen in Step 4 such as dockerspark_master_1
Use the following command to shutdown the cluster. Make sure you have transferred all the important data from the containers to the host machine. Otherwise the data lying on the containers will be lost
docker-compose stop
In case a spark job can not be submitted through the notebook (Spark Context not present exception), restart the cluster using docker-compose down
and then docker-compose up
.
The down
command will bring down the entire application and remove the containers, images, volumes, and networks entirely,
Please refer to our SoCC'17 paper, Automated debugging in data-intensive scalable computing for more details.
@inproceedings{10.1145/3127479.3131624, author = {Gulzar, Muhammad Ali and Interlandi, Matteo and Han, Xueyuan and Li, Mingda and Condie, Tyson and Kim, Miryung}, title = {Automated Debugging in Data-Intensive Scalable Computing}, year = {2017}, isbn = {9781450350280}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3127479.3131624}, doi = {10.1145/3127479.3131624}, booktitle = {Proceedings of the 2017 Symposium on Cloud Computing}, pages = {520–534}, numpages = {15}, keywords = {big data, data provenance, fault localization, data-intensive scalable computing (DISC), and data cleaning, automated debugging}, location = {Santa Clara, California}, series = {SoCC '17} }