Project-Group-2

Anomaly Detection as a Service

Team Members:

Ankita Chikodi
Arkil Thakkar
Nehal Sharma
Shravani Pande

Project Description:

In our project, we aim to reduce human intervention for log file analyzing and debugging. Our methodology addresses log mining as a NLP domain problem and makes use of sophisticated techniques from natural language processing to extract key features from the logs. We are applying machine learning Random Forest Classifier to build the model. The log data that is being generated is being processed in real-time through Kafka Spark streaming pipeline. If the anomaly is detected, the system administrators are notifying about the anomalous behavior traced in the log files. Our model demonstrates an accurate predictive performance (F1-Score 93%).

System Architecture:

The log data is transferred through Kafka producer which is received as a Spark streaming object. The log data in the Spark streaming object is pre-processed and applied to the pre-trained model which is build using the older logs. After pre-processing, the model will predict whether it is an anomaly and then it will send a mail to the user.

Data Preprocessing:

Raw Data

Log data generated through HDFS

Log DataFrame

Unstructured logs preprocessed into dataframe

Event Sequence

BlockID having different events

Label Dictionary

Dictionary of blockIDs and event sequences

Label Mapping

Mapping blockIDs with associated label i.e. Anomaly or Normal

Natural Language Processing:

TF-IDF Building matrix by converting event sequences into TF-IDF form

Normalized Vector

Normalizes the matrix generated from TF-IDF

Model Building:

Dumping Model

Storing the trained model

Data Pipeline:

Kafka Logs

Sending the kafka logs through the producer

Processing

The new data is sent through the Kafka producer in chunks which is received by Spark streaming object in real time. On this object, the pre-processing is performed and given to the pre-trained model which computes the label of the log. If anomaly is found, then the whole data associated with it is sent to the system administrator through e-mail.

Output:

Future Enhancement:

Classifying the anomalies on the basis of severity such as Critical, High, Moderate, Low.
Alerting the different user groups according to it's criticality.

How to use this:

Start Zookeeper Open a new terminal and type zkserver
Start Kafka Server Open a new terminal and type .\bin\windows\kafka-server-start.bat .\config\server.properties
Create a Kafka topic Open a new terminal and type kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Initialize a local producer Open a new terminal and initialize a producer kafka-console-producer.bat --broker-list localhost:9092 --topic test --new-producer < HDFS.log
Spark ./spark-submit.sh '--jars spark-streaming-kafka-0-8-assembly_2.11-2.3.3.jar pyspark-shell Spark_Log_mining.py

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Report		Report
Screenshots		Screenshots
Spark_pipeline		Spark_pipeline
olderCode_inPython		olderCode_inPython
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project-Group-2

Anomaly Detection as a Service

Team Members:

Project Description:

System Architecture:

Data Preprocessing:

Natural Language Processing:

Model Building:

Data Pipeline:

Future Enhancement:

How to use this:

About

Releases

Packages

Contributors 3

Languages

SJSU272Spring2019/Project-Group-2

Folders and files

Latest commit

History

Repository files navigation

Project-Group-2

Anomaly Detection as a Service

Team Members:

Project Description:

System Architecture:

Data Preprocessing:

Natural Language Processing:

Model Building:

Data Pipeline:

Future Enhancement:

How to use this:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages