Name		Name	Last commit message	Last commit date
parent directory ..
.ipynb_checkpoints		.ipynb_checkpoints
applicationHistory		applicationHistory
monitoring_data		monitoring_data
split-dataset		split-dataset
1_Structured_Streaming_Demo.ipynb		1_Structured_Streaming_Demo.ipynb
2_Operations_on_Streaming_Dataframes-Datasets_Demo.ipynb		2_Operations_on_Streaming_Dataframes-Datasets_Demo.ipynb
3_Operations_on_Streaming_Dataframes-Datasets_Exercise - Solution.ipynb		3_Operations_on_Streaming_Dataframes-Datasets_Exercise - Solution.ipynb
3_Operations_on_Streaming_Dataframes-Datasets_Exercise.ipynb		3_Operations_on_Streaming_Dataframes-Datasets_Exercise.ipynb
4_Window_Operations_Demo.ipynb		4_Window_Operations_Demo.ipynb
5_Window Operations Exercise - Solution.ipynb		5_Window Operations Exercise - Solution.ipynb
5_Window Operations Exercise.ipynb		5_Window Operations Exercise.ipynb
README.md		README.md
buys.csv		buys.csv
data		data
twitterIDs.csv		twitterIDs.csv

README.md

Spark Structured Streaming with a Static Input

An application that takes as input the list of twitter user IDs and every 5 seconds, it emits the number of tweet actions of a user if it is present in the input list. An inner join on the staticDF consisting of twitterIDs and the input stream DF was performed and grouped on the userIDs.

Dataset

This application was developed to analyze the Higgs Twitter Dataset. The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the Higgs boson. Each row in this dataset is of the format <userA, userB, timestamp, interaction> where interactions can be retweets (RT), mention (MT) and reply (RE). We have split the dataset into a number of small files so that we can use the dataset to emulate streaming data. Download the split dataset onto your master VM.

streamer.sh

This script emulates twitter stream by doing the following :

Copies the entire split dataset to the HDFS. This would be the staging directory.
Creates a monitoring directory on the HDFS that this application listens to. This would be the directory this streaming application is listening to.
Periodically, moves the split dataset files from the staging directory to the monitoring directory using the hadoop fs -mv command.

Usage

Submit this spark job by using the following command :

spark-submit --verbose tweetactions.py <path_to_monitoring_dir_in_hdfs>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5_structured

5_structured

README.md

Spark Structured Streaming with a Static Input

Dataset

streamer.sh

Usage

Files

5_structured

Directory actions

More options

Directory actions

More options

Latest commit

History

5_structured

Folders and files

parent directory

README.md

Spark Structured Streaming with a Static Input

Dataset

streamer.sh

Usage