Skip to content

SJSU272Spring2019/Project-Group-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project-Group-2

Anomaly Detection as a Service

Team Members:

  • Ankita Chikodi
  • Arkil Thakkar
  • Nehal Sharma
  • Shravani Pande

Project Description:

In our project, we aim to reduce human intervention for log file analyzing and debugging. Our methodology addresses log mining as a NLP domain problem and makes use of sophisticated techniques from natural language processing to extract key features from the logs. We are applying machine learning Random Forest Classifier to build the model. The log data that is being generated is being processed in real-time through Kafka Spark streaming pipeline. If the anomaly is detected, the system administrators are notifying about the anomalous behavior traced in the log files. Our model demonstrates an accurate predictive performance (F1-Score 93%).

System Architecture:

The log data is transferred through Kafka producer which is received as a Spark streaming object. The log data in the Spark streaming object is pre-processed and applied to the pre-trained model which is build using the older logs. After pre-processing, the model will predict whether it is an anomaly and then it will send a mail to the user.

image

Data Preprocessing:

Raw Data

Log data generated through HDFS

image

Log DataFrame

Unstructured logs preprocessed into dataframe

image

Event Sequence

BlockID having different events

image

Label Dictionary

Dictionary of blockIDs and event sequences

image

Label Mapping

Mapping blockIDs with associated label i.e. Anomaly or Normal

image

Natural Language Processing:

TF-IDF Building matrix by converting event sequences into TF-IDF form

image

Normalized Vector

Normalizes the matrix generated from TF-IDF

image

Model Building:

Dumping Model

Storing the trained model

image

Data Pipeline:

Kafka Logs

Sending the kafka logs through the producer image

Processing

The new data is sent through the Kafka producer in chunks which is received by Spark streaming object in real time. On this object, the pre-processing is performed and given to the pre-trained model which computes the label of the log. If anomaly is found, then the whole data associated with it is sent to the system administrator through e-mail.

image

Output:

image

Future Enhancement:

  • Classifying the anomalies on the basis of severity such as Critical, High, Moderate, Low.
  • Alerting the different user groups according to it's criticality.

How to use this:

  • Start Zookeeper Open a new terminal and type zkserver

  • Start Kafka Server Open a new terminal and type .\bin\windows\kafka-server-start.bat .\config\server.properties

  • Create a Kafka topic Open a new terminal and type kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

  • Initialize a local producer Open a new terminal and initialize a producer kafka-console-producer.bat --broker-list localhost:9092 --topic test --new-producer < HDFS.log

  • Spark ./spark-submit.sh '--jars spark-streaming-kafka-0-8-assembly_2.11-2.3.3.jar pyspark-shell Spark_Log_mining.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •