forked from apache/skywalking
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support monitoring AWS Cloud EKS (apache#10199)
- Loading branch information
Showing
37 changed files
with
4,352 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# AWS Cloud EKS monitoring | ||
SkyWalking leverages OpenTelemetry Collector with [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) to transfer the metrics to | ||
[OpenTelemetry receiver](opentelemetry-receiver.md) and into the [Meter System](./../../concepts-and-designs/meter.md). | ||
|
||
### Data flow | ||
1. OpenTelemetry Collector fetches metrics from EKS via [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) and pushes metrics to SkyWalking OAP Server via the OpenCensus gRPC Exporter or OpenTelemetry gRPC exporter. | ||
2. The SkyWalking OAP Server parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results. | ||
|
||
### Set up | ||
1. Deploy [amazon/aws-otel-collector](https://hub.docker.com/r/amazon/aws-otel-collector) with [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) to EKS | ||
2. Config SkyWalking [OpenTelemetry receiver](opentelemetry-receiver.md). | ||
|
||
### EKS Monitoring | ||
[AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) provides multiple dimensions metrics for EKS cluster, node, service, etc. | ||
Accordingly, SkyWalking observes the status, and payload of the EKS cluster, which is cataloged as a `LAYER: AWS_EKS` `Service` in the OAP. Meanwhile, the k8s nodes would be recognized as `LAYER: AWS_EKS` `instance`s. The k8s service would be recognized as `endpoint`s. | ||
|
||
#### Specify Job Name | ||
|
||
SkyWalking distinguishes AWS Cloud EKS metrics by attributes `job_name`, which value is `aws-cloud-eks-monitoring`. | ||
You could leverage OTEL Collector processor to add the attribute as follows: | ||
|
||
```yaml | ||
processors: | ||
resource/job-name: | ||
attributes: | ||
- key: job_name | ||
value: aws-cloud-eks-monitoring | ||
action: insert | ||
``` | ||
Notice, if you don't specify `job_name` attribute, SkyWalking OAP will ignore the metrics | ||
|
||
#### Supported Metrics | ||
| Monitoring Panel | Unit | Metric Name | Catalog | Description | Data Source | | ||
|---------------------------------------|---------|--------------------------------------------|------------|--------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| Node Count | | eks_cluster_node_count | Service | The node count of the EKS cluster | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Failed Node Count | | eks_cluster_failed_node_count | Service | The failed node count of the EKS cluster | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Pod Count (namespace dimension) | | eks_cluster_namespace_count | Service | The count of pod in the EKS cluster(namespace dimension) | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Pod Count (service dimension) | | eks_cluster_service_count | Service | The count of pod in the EKS cluster(service dimension) | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX Dropped Count (per second) | count/s | eks_cluster_net_rx_dropped | Service | Network RX dropped count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX Error Count (per second) | count/s | eks_cluster_net_rx_error | Service | Network RX error count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX Dropped Count (per second) | count/s | eks_cluster_net_rx_dropped | Service | Network TX dropped count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX Error Count (per second) | count/s | eks_cluster_net_rx_error | Service | Network TX error count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Pod Count | | eks_cluster_node_pod_number | Instance | The count of pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| CPU Utilization | percent | eks_cluster_node_cpu_utilization | Instance | The CPU Utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Memory Utilization | percent | eks_cluster_node_memory_utilization | Instance | The Memory Utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX | bytes/s | eks_cluster_node_net_rx_bytes | Instance | Network RX bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX Error Count | count/s | eks_cluster_node_net_rx_bytes | Instance | Network RX error count of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX | bytes/s | eks_cluster_node_net_rx_bytes | Instance | Network TX bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX Error Count | count/s | eks_cluster_node_net_rx_bytes | Instance | Network TX error count of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Disk IO Write | bytes/s | eks_cluster_node_net_rx_bytes | Instance | The IO write bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Disk IO Read | bytes/s | eks_cluster_node_net_rx_bytes | Instance | The IO read bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| FS Utilization | percent | eks_cluster_node_net_rx_bytes | Instance | The filesystem utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| CPU Utilization | percent | eks_cluster_node_pod_cpu_utilization | Instance | The CPU Utilization of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Memory Utilization | percent | eks_cluster_node_pod_memory_utilization | Instance | The Memory Utilization of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX | bytes/s | eks_cluster_node_pod_net_rx_bytes | Instance | Network RX bytes of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX Error Count | count/s | eks_cluster_node_pod_net_rx_error | Instance | Network RX error count of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX | bytes/s | eks_cluster_node_pod_net_tx_bytes | Instance | Network RX bytes of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX Error Count | count/s | eks_cluster_node_pod_net_tx_error | Instance | Network RX error count of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| CPU Utilization | percent | eks_cluster_service_pod_cpu_utilization | Endpoint | The CPU Utilization of pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Memory Utilization | percent | eks_cluster_service_pod_memory_utilization | Endpoint | The Memory Utilization of pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX | bytes/s | eks_cluster_service_pod_net_rx_bytes | Endpoint | Network RX bytes of the pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network RX Error Count | count/s | eks_cluster_service_pod_net_rx_error | Endpoint | Network TX error count of the pod that belongs to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX | bytes/s | eks_cluster_service_pod_net_tx_bytes | Endpoint | Network TX bytes of the pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
| Network TX Error Count | count/s | eks_cluster_node_pod_net_tx_error | Endpoint | Network TX error count of the pod that belongs to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) | | ||
|
||
### Customizations | ||
You can customize your own metrics/expression/dashboard panel. | ||
The metrics definition and expression rules are found in `/config/otel-rules/aws-eks/`. | ||
The AWS Cloud EKS dashboard panel configurations are found in `/config/ui-initialized-templates/aws_eks`. | ||
|
||
### OTEL Configuration Sample With AWS Container Insights Receiver | ||
|
||
```yaml | ||
extensions: | ||
health_check: | ||
receivers: | ||
awscontainerinsightreceiver: | ||
processors: | ||
resource/job-name: | ||
attributes: | ||
- key: job_name | ||
value: aws-cloud-eks-monitoring | ||
action: insert | ||
exporters: | ||
otlp: | ||
endpoint: oap-service:11800 | ||
tls: | ||
insecure: true | ||
logging: | ||
loglevel: debug | ||
service: | ||
pipelines: | ||
metrics: | ||
receivers: [awscontainerinsightreceiver] | ||
processors: [resource/job-name] | ||
exporters: [otlp,logging] | ||
extensions: [health_check] | ||
``` | ||
Refer to [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) for more information |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
52 changes: 52 additions & 0 deletions
52
oap-server/server-starter/src/main/resources/otel-rules/aws-eks/eks-cluster.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# This will parse a textual representation of a duration. The formats | ||
# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS} | ||
# with days considered to be exactly 24 hours. | ||
# <p> | ||
# Examples: | ||
# <pre> | ||
# "PT20.345S" -- parses as "20.345 seconds" | ||
# "PT15M" -- parses as "15 minutes" (where a minute is 60 seconds) | ||
# "PT10H" -- parses as "10 hours" (where an hour is 3600 seconds) | ||
# "P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds) | ||
# "P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes" | ||
# "P-6H3M" -- parses as "-6 hours and +3 minutes" | ||
# "-P6H3M" -- parses as "-6 hours and -3 minutes" | ||
# "-P-6H+3M" -- parses as "+6 hours and -3 minutes" | ||
# </pre> | ||
|
||
filter: "{ tags -> tags.job_name == 'aws-cloud-eks-monitoring' }" # The OpenTelemetry job name | ||
expPrefix: tag({tags -> tags.cluster = 'aws-eks-cluster::' + tags.ClusterName}) | ||
expSuffix: service(['cluster'], Layer.AWS_EKS) | ||
metricPrefix: eks_cluster | ||
metricsRules: | ||
- name: node_count | ||
exp: cluster_node_count.downsampling(LATEST) | ||
- name: failed_node_count | ||
exp: cluster_failed_node_count.downsampling(LATEST) | ||
- name: namespace_count | ||
exp: namespace_number_of_running_pods.sum(['Namespace','cluster']) | ||
- name: service_count | ||
exp: service_number_of_running_pods.sum(['Service','cluster']) | ||
- name: net_rx_dropped | ||
exp: node_network_rx_dropped | ||
- name: net_rx_error | ||
exp: node_network_rx_errors | ||
- name: net_tx_dropped | ||
exp: node_network_tx_dropped | ||
- name: net_tx_error | ||
exp: node_network_tx_errors |
Oops, something went wrong.