- Ankita Chikodi
- Arkil Thakkar
- Nehal Sharma
- Shravani Pande
In our project, we aim to reduce human intervention for log file analyzing and debugging. Our methodology addresses log mining as a NLP domain problem and makes use of sophisticated techniques from natural language processing to extract key features from the logs. We are applying machine learning Random Forest Classifier to build the model. The log data that is being generated is being processed in real-time through Kafka Spark streaming pipeline. If the anomaly is detected, the system administrators are notifying about the anomalous behavior traced in the log files. Our model demonstrates an accurate predictive performance (F1-Score 93%).
The log data is transferred through Kafka producer which is received as a Spark streaming object. The log data in the Spark streaming object is pre-processed and applied to the pre-trained model which is build using the older logs. After pre-processing, the model will predict whether it is an anomaly and then it will send a mail to the user.
Raw Data
Log data generated through HDFS
Log DataFrame
Unstructured logs preprocessed into dataframe
Event Sequence
BlockID having different events
Label Dictionary
Dictionary of blockIDs and event sequences
Label Mapping
Mapping blockIDs with associated label i.e. Anomaly or Normal
TF-IDF Building matrix by converting event sequences into TF-IDF form
Normalized Vector
Normalizes the matrix generated from TF-IDF
Dumping Model
Storing the trained model
Kafka Logs
Sending the kafka logs through the producer
Processing
The new data is sent through the Kafka producer in chunks which is received by Spark streaming object in real time. On this object, the pre-processing is performed and given to the pre-trained model which computes the label of the log. If anomaly is found, then the whole data associated with it is sent to the system administrator through e-mail.
Output:
- Classifying the anomalies on the basis of severity such as Critical, High, Moderate, Low.
- Alerting the different user groups according to it's criticality.
-
Start Zookeeper Open a new terminal and type zkserver
-
Start Kafka Server Open a new terminal and type .\bin\windows\kafka-server-start.bat .\config\server.properties
-
Create a Kafka topic Open a new terminal and type kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
-
Initialize a local producer Open a new terminal and initialize a producer kafka-console-producer.bat --broker-list localhost:9092 --topic test --new-producer < HDFS.log
-
Spark ./spark-submit.sh '--jars spark-streaming-kafka-0-8-assembly_2.11-2.3.3.jar pyspark-shell Spark_Log_mining.py