We build a hybrid-deployed benchmark in the cloud-edge environment, which contains four widely used microservice systems. These systems are adapted for hybrid deployment and integrated into a unified monitoring framework. Please click here for the details of the Benchmark.
MicroCERCL constructs a heterogeneous dynamic topology stack based on metric data, after anomaly detection, it trains a graph neural network model to accurately localize the root cause without relying on historical data in the cloud-edge environment.
-
Python3.7 is recommended. Otherwise, any python3 version should be fine.
-
Git
git clone https://github.com/WDCloudEdge/MicroCERCL.git
cd MicroCERCL
python3.7 -m pip install -r requirements.txt
Change the dataset and other configs in Config.py
python3.7 ./main.py
It contains three folders corresponding to Bookinfo, Hipster, and SockShop, where the root cause is located within a hybrid deployment scenario. Each folder is further split into secondary folders based on the root cause of the microservice (or its instances). Each root cause service folder contains label information (xxx_label.txt) for all failures injected. Within each service, it is split into third-level folders according to the label file to form a failure sample. Each failure sample contains all hybrid-deployed microservice systems that form the fourth-level folders. Each hybrid-deployed microservice system folder contains three types of monitoring data: metrics, traces, and logs (Bookinfo without logs in each failure sample). As shown in figure:
It contains all the monitoring data of hybrid-deployed microservice systems when a failure occurs. As shown in figure:
File | Description |
---|---|
call.csv | Time-series call latency between microservices, including P99, P95, and P90, which denote the 99th, 95th, and 90th percentiles of the latency data. |
graph.csv | Time-series topologies contain the instance, the server where it is located, and the service call relationship. |
instance.csv | Time-series metrics of each instance, containing CPU usage, memory usage, and network transmit packets. |
latency.csv | The time-series latency of microservices, including P99, P95, and P90, which denote the 99th, 95th, and 90th percentiles of the latency data. |
resource.csv | Time-series metrics of instances within a specific namespace, containing the total CPU usage and memory usage |
success_rate.csv | Microservice Success Rate Time Series Data |
svc_metric.csv | Time-series metrics of microservices (the average of its instances), containing CPU usage, CPU limit, memory usage, memory limit, FS write, FS read, FS usage, net receive, net transmit, and network transmit packets. |
svc_qps.csv | Microservice QPS (Queries Per Second) Time Series Data |
File | Description |
---|---|
abnormal.pkl | Records data with missing structure, abnormal status code and error messages, excluding data with abnormal net latency. |
abnormal_half.pkl | For the namespace where the file is located, records data after eliminating other namespace service information from the Trace data based on abnormal.pkl (only this namespace service information is included) |
inbound.pkl | For the namespace where the file is located, record data containing service calls from other namespaces to this namespace. |
inbound_half.pkl | For the namespace where the file is located, records data after eliminating other namespace service information from the Trace data based on inbound.pkl (only this namespace service information is included) |
normal.pkl | Records data with complete structure and normal status code, including data with abnormal net latency. |
outbound.pkl | For the namespace where the file is located, record data containing service calls from this namespace to other namespaces. |
outbound_half.pkl | For the namespace where the file is located, records data after eliminating other namespace service information from the Trace data based on outbound.pkl (only this namespace service information is included) |
trace_net_latency.pkl | Statistics on request latency data and response latency data between a pair of service calls |
trace_pod_latency.pkl | Statistics on latency data between sending a request and receiving a response between a pair of service calls. |
Each instance (container) has a .pkl file, containing all business logs of the container.
McroCERCL/
│├── .gitignore
│├── Config.py
│├── MetricCollector.py
│├── README.md
│├── anomaly_detection.py
│├── graph.py
│├── main.py
│├── model.py
│├── model_aggregate.py
│├── model_attention.py
│├── requirements.txt
│└── util/
│└── │├── KubernetesClient.py
│└── │├── PrometheusClient.py
│└── │└── utils.py
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.