Deployment codes for Anomaly Diagnosis Framework
author & maintainer = "Burak Aksar"
email = "[email protected]"
version = "1.0.0"
1-) Create a local virtual environment in the folder
python3 -m venv ml_venv
2-) Activate venv
source ml_venv/bin/activate/
3-) Install requirements
pip install -r requirements.txt
Run the jupyter notebook inside the venv, not in your local
./ml_venv/bin/jupyter notebook
Under the analysis folder you will find necessary scripts to replicate unknown apps, unknown inputs and the defauly anomaly diagnosis experiments.
The predict.py can be used to train a model and then you can use the RuntimePredictor class under runtime folder.
At a high level, E2EWatch requires the following components to provide diagnosis results at runtime in another production system:
-
Monitoring framework that can collect numeric telemetry data from compute nodes while applications are running. Even though we only experiment with LDMS, it can be adapted to other popular monitoring frameworks such as Ganglia, Examon by modifying the wrappers in the data collection phase.
-
Labeled data that is composed of anomalous and normal compute node telemetry data. It is possible to create labeled data sets using a suite of applications and synthetic anomalies. Another option is to use telemetry data labeled by users.
-
Backend web service that can provide telemetry data on the fly to the trained model. We use the existing Django web application deployed on the monitoring server. It is possible to use other backend web services that can handle client requests and query data from the database. If runtime diagnosis is not necessary, it is also possible to run the pickled model after the application run is completed.
E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC System
Authors: Burak Aksar (1), Benjamin Schwaller (2), Omar Aaziz (2), Vitus J. Leung (2), Jim Brandt (2), Manuel Egele (1), Ayse K. Coskun (1)
Affiliations: (1) Department of Electrical and Computer Engineering, Boston University (2) Sandia National Laboratories
This work has been partially funded by Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under Contract DENA0003525.
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details