Skip to content

Commit

Permalink
Support monitoring AWS Cloud EKS (apache#10199)
Browse files Browse the repository at this point in the history
  • Loading branch information
pg-yang authored Dec 31, 2022
1 parent bfb90f6 commit 8de8c7c
Show file tree
Hide file tree
Showing 37 changed files with 4,352 additions and 2 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/skywalking.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -645,6 +645,8 @@ jobs:
config: test/e2e-v2/cases/exporter/kafka/e2e.yaml
- name: Virtual MQ
config: test/e2e-v2/cases/virtual-mq/e2e.yaml
- name: AWS Cloud EKS
config: test/e2e-v2/cases/aws/eks/e2e.yaml
steps:
- uses: actions/checkout@v3
with:
Expand Down
4 changes: 4 additions & 0 deletions docs/en/changes/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,16 @@
* Fix `time_bucket` of `ServiceTraffic` not set correctly in `slowSql` of MAL.
* Correct the TopN record query DAO of BanyanDB.
* Tweak interval settings of BanyanDB.
* Support monitoring AWS Cloud EKS.

#### UI

* Add Zipkin Lens UI to webapp, and proxy it to context path `/zipkin`.
* Migrate the build tool from vue cli to Vite4.
* Fix Instance Relation and Endpoint Relation dashboards show up.
* Add Micrometer icon
* Update MySQL UI to support MariaDB
* Add AWS menu for supporting AWS monitoring

#### Documentation

Expand Down
100 changes: 100 additions & 0 deletions docs/en/setup/backend/backend-aws-eks-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# AWS Cloud EKS monitoring
SkyWalking leverages OpenTelemetry Collector with [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) to transfer the metrics to
[OpenTelemetry receiver](opentelemetry-receiver.md) and into the [Meter System](./../../concepts-and-designs/meter.md).

### Data flow
1. OpenTelemetry Collector fetches metrics from EKS via [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) and pushes metrics to SkyWalking OAP Server via the OpenCensus gRPC Exporter or OpenTelemetry gRPC exporter.
2. The SkyWalking OAP Server parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results.

### Set up
1. Deploy [amazon/aws-otel-collector](https://hub.docker.com/r/amazon/aws-otel-collector) with [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) to EKS
2. Config SkyWalking [OpenTelemetry receiver](opentelemetry-receiver.md).

### EKS Monitoring
[AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) provides multiple dimensions metrics for EKS cluster, node, service, etc.
Accordingly, SkyWalking observes the status, and payload of the EKS cluster, which is cataloged as a `LAYER: AWS_EKS` `Service` in the OAP. Meanwhile, the k8s nodes would be recognized as `LAYER: AWS_EKS` `instance`s. The k8s service would be recognized as `endpoint`s.

#### Specify Job Name

SkyWalking distinguishes AWS Cloud EKS metrics by attributes `job_name`, which value is `aws-cloud-eks-monitoring`.
You could leverage OTEL Collector processor to add the attribute as follows:

```yaml
processors:
resource/job-name:
attributes:
- key: job_name
value: aws-cloud-eks-monitoring
action: insert
```
Notice, if you don't specify `job_name` attribute, SkyWalking OAP will ignore the metrics

#### Supported Metrics
| Monitoring Panel | Unit | Metric Name | Catalog | Description | Data Source |
|---------------------------------------|---------|--------------------------------------------|------------|--------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Count | | eks_cluster_node_count | Service | The node count of the EKS cluster | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Failed Node Count | | eks_cluster_failed_node_count | Service | The failed node count of the EKS cluster | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Pod Count (namespace dimension) | | eks_cluster_namespace_count | Service | The count of pod in the EKS cluster(namespace dimension) | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Pod Count (service dimension) | | eks_cluster_service_count | Service | The count of pod in the EKS cluster(service dimension) | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Dropped Count (per second) | count/s | eks_cluster_net_rx_dropped | Service | Network RX dropped count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count (per second) | count/s | eks_cluster_net_rx_error | Service | Network RX error count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Dropped Count (per second) | count/s | eks_cluster_net_rx_dropped | Service | Network TX dropped count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count (per second) | count/s | eks_cluster_net_rx_error | Service | Network TX error count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Pod Count | | eks_cluster_node_pod_number | Instance | The count of pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| CPU Utilization | percent | eks_cluster_node_cpu_utilization | Instance | The CPU Utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Memory Utilization | percent | eks_cluster_node_memory_utilization | Instance | The Memory Utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX | bytes/s | eks_cluster_node_net_rx_bytes | Instance | Network RX bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count | count/s | eks_cluster_node_net_rx_bytes | Instance | Network RX error count of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX | bytes/s | eks_cluster_node_net_rx_bytes | Instance | Network TX bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count | count/s | eks_cluster_node_net_rx_bytes | Instance | Network TX error count of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Disk IO Write | bytes/s | eks_cluster_node_net_rx_bytes | Instance | The IO write bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Disk IO Read | bytes/s | eks_cluster_node_net_rx_bytes | Instance | The IO read bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| FS Utilization | percent | eks_cluster_node_net_rx_bytes | Instance | The filesystem utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| CPU Utilization | percent | eks_cluster_node_pod_cpu_utilization | Instance | The CPU Utilization of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Memory Utilization | percent | eks_cluster_node_pod_memory_utilization | Instance | The Memory Utilization of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX | bytes/s | eks_cluster_node_pod_net_rx_bytes | Instance | Network RX bytes of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count | count/s | eks_cluster_node_pod_net_rx_error | Instance | Network RX error count of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX | bytes/s | eks_cluster_node_pod_net_tx_bytes | Instance | Network RX bytes of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count | count/s | eks_cluster_node_pod_net_tx_error | Instance | Network RX error count of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| CPU Utilization | percent | eks_cluster_service_pod_cpu_utilization | Endpoint | The CPU Utilization of pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Memory Utilization | percent | eks_cluster_service_pod_memory_utilization | Endpoint | The Memory Utilization of pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX | bytes/s | eks_cluster_service_pod_net_rx_bytes | Endpoint | Network RX bytes of the pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count | count/s | eks_cluster_service_pod_net_rx_error | Endpoint | Network TX error count of the pod that belongs to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX | bytes/s | eks_cluster_service_pod_net_tx_bytes | Endpoint | Network TX bytes of the pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count | count/s | eks_cluster_node_pod_net_tx_error | Endpoint | Network TX error count of the pod that belongs to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |

### Customizations
You can customize your own metrics/expression/dashboard panel.
The metrics definition and expression rules are found in `/config/otel-rules/aws-eks/`.
The AWS Cloud EKS dashboard panel configurations are found in `/config/ui-initialized-templates/aws_eks`.

### OTEL Configuration Sample With AWS Container Insights Receiver

```yaml
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
resource/job-name:
attributes:
- key: job_name
value: aws-cloud-eks-monitoring
action: insert
exporters:
otlp:
endpoint: oap-service:11800
tls:
insecure: true
logging:
loglevel: debug
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [resource/job-name]
exporters: [otlp,logging]
extensions: [health_check]
```
Refer to [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) for more information
3 changes: 3 additions & 0 deletions docs/en/setup/backend/opentelemetry-receiver.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,8 @@ for identification of the metric data.
| Metrics of MYSQL| otel-rules/mysql.yaml | prometheus/mysqld_exporter -> OpenTelemetry Collector -- OC/OTLP exporter --> SkyWalking OAP Server |
| Metrics of PostgreSQL| otel-rules/postgresql.yaml | postgres_exporter -> OpenTelemetry Collector -- OC/OTLP exporter --> SkyWalking OAP Server |
| Metrics of Apache APISIX| otel-rules/apisix.yaml | apisix prometheus plugin -> OpenTelemetry Collector -- OC/OTLP exporter --> SkyWalking OAP Server |
| Metrics of AWS Cloud EKS| otel-rules/aws-eks/eks-cluster.yaml |AWS Container Insights Receiver -> OpenTelemetry Collector -- OC/OTLP exporter --> SkyWalking OAP Server |
| Metrics of AWS Cloud EKS| otel-rules/aws-eks/eks-service.yaml |AWS Container Insights Receiver -> OpenTelemetry Collector -- OC/OTLP exporter --> SkyWalking OAP Server |
| Metrics of AWS Cloud EKS| otel-rules/aws-eks/eks-node.yaml |AWS Container Insights Receiver -> OpenTelemetry Collector -- OC/OTLP exporter --> SkyWalking OAP Server |

**Note**: You can also use OpenTelemetry exporter to transport the metrics to SkyWalking OAP directly. See [OpenTelemetry Exporter](./backend-meter.md#opentelemetry-exporter).
4 changes: 4 additions & 0 deletions docs/menu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ catalog:
catalog:
- name: "Linux Monitoring"
path: "/en/setup/backend/backend-vm-monitoring"
- name: "AWS Cloud Monitoring"
catalog:
- name: "EKS Monitoring"
path: "/en/setup/backend/backend-aws-eks-monitoring"
- name: "Browser Monitoring"
path: "/en/setup/service-agent/browser-agent"
- name: "Gateway Monitoring"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,12 @@ public enum Layer {
/**
* Apache APISIX is an open source, dynamic, scalable, and high-performance cloud native API gateway.
*/
APISIX(21, true);
APISIX(21, true),

/**
* EKS (Amazon Elastic Kubernetes Service) is k8s service provided by AWS Cloud
*/
AWS_EKS(22, true);

private final int value;
/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ public class UITemplateInitializer {
Layer.FAAS.name(),
Layer.APISIX.name(),
Layer.VIRTUAL_MQ.name(),
Layer.AWS_EKS.name(),
"custom"
};
private final UITemplateManagementService uiTemplateManagementService;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ receiver-otel:
selector: ${SW_OTEL_RECEIVER:default}
default:
enabledHandlers: ${SW_OTEL_RECEIVER_ENABLED_HANDLERS:"oc,otlp"}
enabledOtelRules: ${SW_OTEL_RECEIVER_ENABLED_OTEL_RULES:"apisix,k8s/*,istio-controlplane,vm,mysql/*,postgresql/*,oap"}
enabledOtelRules: ${SW_OTEL_RECEIVER_ENABLED_OTEL_RULES:"apisix,k8s/*,istio-controlplane,vm,mysql/*,postgresql/*,oap,aws-eks/*"}

receiver-zipkin:
selector: ${SW_RECEIVER_ZIPKIN:-}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This will parse a textual representation of a duration. The formats
# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
# with days considered to be exactly 24 hours.
# <p>
# Examples:
# <pre>
# "PT20.345S" -- parses as "20.345 seconds"
# "PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
# "PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
# "P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
# "P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
# "P-6H3M" -- parses as "-6 hours and +3 minutes"
# "-P6H3M" -- parses as "-6 hours and -3 minutes"
# "-P-6H+3M" -- parses as "+6 hours and -3 minutes"
# </pre>

filter: "{ tags -> tags.job_name == 'aws-cloud-eks-monitoring' }" # The OpenTelemetry job name
expPrefix: tag({tags -> tags.cluster = 'aws-eks-cluster::' + tags.ClusterName})
expSuffix: service(['cluster'], Layer.AWS_EKS)
metricPrefix: eks_cluster
metricsRules:
- name: node_count
exp: cluster_node_count.downsampling(LATEST)
- name: failed_node_count
exp: cluster_failed_node_count.downsampling(LATEST)
- name: namespace_count
exp: namespace_number_of_running_pods.sum(['Namespace','cluster'])
- name: service_count
exp: service_number_of_running_pods.sum(['Service','cluster'])
- name: net_rx_dropped
exp: node_network_rx_dropped
- name: net_rx_error
exp: node_network_rx_errors
- name: net_tx_dropped
exp: node_network_tx_dropped
- name: net_tx_error
exp: node_network_tx_errors
Loading

0 comments on commit 8de8c7c

Please sign in to comment.